File Compression Formats and Container Standards

Container Formats vs. Compression Algorithms

A compression algorithm defines how input bytes are encoded into a more compact bitstream. A container format wraps that bitstream with structural metadata: file headers, integrity checksums, file-system attributes, and in some cases multiple compressed members. The distinction matters because many container formats support multiple compression methods, and some algorithms appear in more than one container.

For example, DEFLATE (RFC 1951) is the algorithm; gzip (RFC 1952) and zlib (RFC 1950) are two different container formats that carry DEFLATE data with different headers and checksums. ZIP is a third container that can optionally use DEFLATE among several methods. Understanding which layer you are working with avoids confusion when troubleshooting or selecting tools.

Compression ratio comparison between Golomb and Rice coding variants — Compression ratio comparison between Golomb and Rice coding experiments. Ratio differences reflect input entropy characteristics. Source: Wikimedia Commons (CC BY-SA).

gzip

gzip (RFC 1952) is a single-file compressor originally developed for GNU systems. Its container structure consists of a 10-byte header (magic number 0x1f 0x8b, compression method byte, flags, modification timestamp, extra flags, OS ID), followed by DEFLATE-compressed data, and ending with a CRC-32 checksum and the original uncompressed size (modulo 2³²).

The HTTP/1.1 Content-Encoding: gzip mechanism (RFC 7230) is the most widely deployed form of in-transit compression. Web servers compress response bodies with gzip before transmission; clients decompress in memory. Despite Brotli and Zstandard offering better ratios, gzip remains universal because every HTTP client and server has supported it since the late 1990s. The Accept-Encoding request header allows clients to negotiate which encodings they support.

zlib and the zlib wrapper

zlib (RFC 1950) uses the same DEFLATE algorithm but a different, simpler container: a 2-byte CMF/FLG header, DEFLATE data, and an Adler-32 checksum (faster to compute than CRC-32). The zlib format appears wherever data must be embedded within a larger structure rather than stored as a standalone file: PNG image data, PDF stream objects, and HTTP/2 header compression (HPACK) all use zlib-wrapped DEFLATE internally.

ZIP Archive Format

The ZIP format, documented in PKWARE's Application Note (APPNOTE.TXT), differs fundamentally from gzip: it is a multi-file archive format where each member file has its own local file header and independently compressed data section. A central directory at the end of the archive lists all members with their offsets, enabling random access to individual files without decompressing the entire archive.

ZIP Compression Method	ID	Notes
Stored (no compression)	0	Used for already-compressed files (PNG, JPEG)
DEFLATE	8	Most common; supported by all implementations
DEFLATE64	9	Extended window size; limited tool support
BZIP2	12	Better ratio than DEFLATE; slower
LZMA	14	High ratio; used by 7-Zip
Zstandard	93	Added in recent APPNOTE versions

ZIP64 extensions, introduced to remove the original 4 GiB per-file and 65,535-file limits, use 64-bit offset fields and an end-of-central-directory record extension. Any archive or file exceeding 4 GiB requires ZIP64. Modern operating systems on all platforms support ZIP64, but some legacy tools do not.

bzip2

bzip2 uses a combination of the Burrows-Wheeler Transform (BWT), Move-to-Front (MTF) encoding, run-length encoding, and Huffman coding. The BWT rearranges the input bytes so that all occurrences of each character cluster together, dramatically improving the performance of subsequent entropy coding. bzip2 consistently produces smaller output than DEFLATE at the cost of higher CPU usage and a larger memory footprint during both compression and decompression.

The bzip2 format compresses streams in blocks of up to 900 KB. Each block is independently compressed, which allows parallel decompression (tools like pbzip2 exploit this) and partial recovery of corrupted archives. The .tar.bz2 (or .tbz2) combination pairs the POSIX tar archive container with bzip2 compression and is common in Linux source tarballs.

xz / LZMA2

The xz format uses LZMA2, which is an improved version of the Lempel-Ziv-Markov Chain Algorithm. LZMA achieves very high compression ratios by combining a large dictionary (up to 1.5 GB), a range coder (a variant of arithmetic coding), and careful modelling of the LZ77 match context. The xz format adds a CRC64 integrity check, support for multiple compression blocks (enabling parallel processing), and an index structure at the end of the file for random access.

The Linux kernel source tarball is distributed as .tar.xz. Package managers in Fedora, Arch Linux, and Debian have adopted xz for binary packages because the higher compression ratio reduces download size. Decompression speed is reasonable, though significantly slower than zstd at equivalent compression ratios.

Zstandard Frame Format

A Zstandard compressed file consists of one or more frames. Each frame begins with a magic number (0xFD2FB528), a frame header encoding the window size and optional content size, and then blocks of compressed data. A frame descriptor byte indicates whether a content checksum (XXH64) is appended. The skippable frame type (magic 0x184D2A50–0x184D2A5F) allows arbitrary user data to be embedded in the stream without the decompressor interpreting it.

Zstandard's seekable format extension specifies a special footer structure that maps compressed-byte offsets to uncompressed offsets at regular intervals, enabling random access into large compressed files without full decompression. This is particularly useful for compressed log storage and database backup files.

Media Formats: Lossy Compression

Image and video compression uses lossy algorithms that discard perceptually redundant information. The human visual system is less sensitive to high-frequency detail and to colour variation than to luminance variation, a property that JPEG, WebP, and modern video codecs exploit.

JPEG / JFIF

JPEG (ISO/IEC 10918) divides the image into 8×8 pixel blocks, applies the Discrete Cosine Transform (DCT) to convert spatial data to frequency coefficients, quantises the coefficients by dividing with a quantisation matrix (discarding small high-frequency values), and then entropy-codes the result with Huffman or arithmetic coding. The quantisation step is the lossy element: finer quantisation matrices produce larger files with less visible artefacts; coarser matrices produce smaller files with blocking artefacts (visible at quality settings below roughly 60%).

JPEG 2000 (ISO/IEC 15444) replaces the DCT with the Discrete Wavelet Transform, providing better quality at the same file size and avoiding blocking artefacts. It is used in digital cinema (DCI-P3) and medical imaging (DICOM), but its complexity prevented widespread web adoption.

WebP

WebP was developed by Google as a web-optimised image format. Its lossy mode uses the VP8 video codec's intra-frame prediction and DCT-based residual coding; its lossless mode uses a custom LZ77-derivative with a 2D spatial predictor and an entropy coding scheme based on canonical Huffman. WebP lossless typically produces smaller files than PNG for photographic content. WebP is supported in all major browsers and is now commonly used in content delivery networks to serve optimised images based on the browser's Accept header.

AV1 and AVIF

AV1 is a royalty-free video codec developed by the Alliance for Open Media (AOM) and published in 2018. It employs a large toolkit of coding tools: superblock partitioning, multiple intra-prediction modes, compound prediction, loop filters, and a frame-level entropy coder (ANS). AVIF is the image format that encodes still images using a single AV1 intra frame, stored in an ISOBMFF container (ISO/IEC 14496-12). AVIF typically achieves the same visual quality as JPEG at 30–50% smaller file sizes.

The ISO/IEC 23000-22 standard (MIAF, Multi-Image Application Format) specifies how AVIF and other image-sequence formats can embed multiple images, thumbnails, and metadata (EXIF, XMP) in a single file. German and European public broadcast media organisations (ARD, ZDF) are evaluating AVIF for web distribution of editorial photography.

Integrity Checking in Container Formats

Most container formats include a checksum or hash of the uncompressed data to detect corruption. The choice of checksum reflects a trade-off between speed and detection strength:

Common checksums in compression containers

CRC-32 (gzip, ZIP): Fast hardware-accelerated computation; 32-bit, detects single burst errors up to 32 bits long.
Adler-32 (zlib): Faster than CRC-32 on some architectures; weaker error detection for small inputs.
CRC-64 (xz): 64-bit; substantially lower collision probability than CRC-32 for large data.
XXH64 (Zstandard): Non-cryptographic hash; extremely fast, designed for bulk data integrity verification.

None of these checksums are cryptographic; they detect accidental corruption but not deliberate tampering. For integrity verification in security contexts, formats combine compression checksums with separate cryptographic signatures or MACs.

References: RFC 1950 (zlib), RFC 1952 (gzip), PKWARE APPNOTE.TXT, ISO/IEC 10918-1 (JPEG), ISO/IEC 15444-1 (JPEG 2000), AOM AV1 Bitstream Specification, ISO/IEC 14496-12 (ISOBMFF).