Entropy and the Limits of Compression

Shannon's source coding theorem, published in 1948, established a theoretical lower bound on lossless compression. Given a source emitting symbols with known probabilities, the minimum average code length approaches the Shannon entropy H(X) = −∑ p(x) log₂ p(x). No lossless compressor can produce output smaller than this bound on average.

In practice, real compressors do not know the true source distribution in advance. They either model it adaptively from the input data itself or use a pre-trained static model. The difference between the two strategies explains much of the design space for modern compression libraries.

Entropy is measured in bits per symbol. Highly repetitive data, such as a text file containing many common English words, has lower entropy than random binary data. A file filled with cryptographically random bytes is incompressible: any lossless encoding will produce output at least as large as the input.

Huffman coding tree showing variable-length codes assigned to symbols by frequency
Huffman coding tree: more frequent symbols receive shorter bit codes. Source: Wikimedia Commons (CC BY-SA).

Huffman Coding

David Huffman's 1952 algorithm produces a provably optimal prefix-free code for a known symbol probability distribution. It builds a binary tree bottom-up: the two least-probable symbols are repeatedly merged into a parent node until a single root remains. Traversing the tree from root to leaf assigns a bit string to each symbol, with shorter paths given to more frequent symbols.

Canonical Huffman coding, used in DEFLATE and many other formats, imposes an additional constraint: codes of the same length are assigned in lexicographic order. This makes it possible to reconstruct the code table from just the symbol lengths, saving overhead in the compressed stream.

Key properties of Huffman coding

  • Produces a prefix-free code: no codeword is a prefix of another
  • Optimal for symbol-by-symbol coding with known probabilities
  • Cannot achieve the entropy bound when symbol lengths are non-integer (arithmetic coding does)
  • O(n log n) construction time using a priority queue

Arithmetic Coding

Arithmetic coding encodes an entire message as a single number in the interval [0, 1). Each symbol narrows the interval proportionally to its probability. The final interval can be represented with a precision determined by the total message length, and any value within it can be decoded unambiguously.

Unlike Huffman coding, arithmetic coding can represent fractional bits per symbol, allowing it to reach the entropy limit more closely. This makes it superior for sources with highly skewed distributions. The ANS (Asymmetric Numeral Systems) family of algorithms, introduced by Jaroslaw Duda in 2013, achieves near-arithmetic-coding efficiency at speeds comparable to Huffman coding, and is now used in Zstandard and some AV1 implementations.

LZ77 and Sliding-Window Compression

Abraham Lempel and Jacob Ziv published two dictionary-based compression schemes in 1977 and 1978. LZ77 uses a sliding window over previously encoded data as an implicit dictionary. When the encoder encounters a string it has seen recently, it emits a reference (offset, length) pair pointing back to the earlier occurrence instead of the literal bytes. This exploits repetition in data without requiring prior knowledge of symbol probabilities.

LZ78 algorithm example showing dictionary-based compression steps
LZ78 compression step-by-step: the algorithm builds a dictionary of encountered substrings. Source: Wikimedia Commons (CC BY-SA).

LZ78, the 1978 variant, builds an explicit dictionary during encoding. Each new entry adds a previously-seen phrase plus one new character. LZ78 is the basis for the LZW (Lempel-Ziv-Welch) algorithm used in GIF and early TIFF files.

DEFLATE

DEFLATE, specified in RFC 1951 (1996), combines LZ77 sliding-window matching with canonical Huffman coding. The encoder scans the input, resolves back-references up to 32 KB back, and then entropy-codes the resulting sequence of literals and (distance, length) pairs using two Huffman trees: one for literals/lengths and one for distances.

DEFLATE output is divided into blocks. Each block carries its own Huffman trees, encoded with a third-level Huffman tree (to compress the code-length lists themselves). This three-level scheme adds complexity but improves compression for short or heterogeneous inputs. DEFLATE is the core algorithm in ZIP, gzip, zlib, and PNG.

FormatAlgorithmIETF / ISO StandardTypical Use
gzipDEFLATE + CRC-32RFC 1952Linux package archives, HTTP transport
ZIPDEFLATE (method 8)PKWARE APPNOTEGeneral-purpose archives on all platforms
PNGDEFLATE via zlibISO/IEC 15948Lossless raster images
zlibDEFLATE wrapperRFC 1950In-memory compression API

Brotli

Brotli was developed at Google and standardised in RFC 7932 (2016). It extends the LZ77 approach with a larger context window (up to 16 MiB), a pre-defined static dictionary of 120,512 common web-content tokens (HTML, CSS, JavaScript phrases), and an ANS-based entropy coder. The pre-trained dictionary means Brotli can achieve better compression than DEFLATE on typical web assets without increasing encoding time proportionally.

Brotli is supported natively in all major browsers and web servers. The Content-Encoding: br HTTP header signals Brotli-compressed responses. At quality level 11 (maximum), Brotli produces the smallest output among general-purpose compressors for HTML, CSS, and JSON, though encoding time increases substantially compared to lower quality levels.

Zstandard

Zstandard (zstd), standardised in RFC 8878 (2021), was developed at Meta (Facebook) with an explicit goal of matching or beating zlib compression ratios at much higher speeds. It uses a combination of LZ77 matching, a finite state entropy (FSE) coder derived from ANS, and optionally a user-supplied or pre-trained dictionary.

Zstd compression levels range from 1 (fastest) to 22 (smallest output). At level 3 (default), it typically achieves compression ratios comparable to gzip at three to five times the throughput on modern hardware. Facebook, Linux distributions, and several database systems use zstd for storage compression. The .zst frame format carries a content checksum and an optional dictionary ID for streaming decompression.

Context Modelling and PAQ-Family Algorithms

Context-mixing compressors such as PAQ and ZPAQ model the probability of the next bit given a large set of recent context bits. They blend predictions from multiple models using a neural network or logistic regression. These compressors achieve compression ratios well beyond DEFLATE or Zstandard but require minutes to compress a single megabyte. They are used primarily for archival compression benchmarks rather than operational workflows.

Standards and Adoption in Germany

German IT infrastructure follows international standards for compression. The Bundesamt für Sicherheit in der Informationstechnik (BSI) publishes technical guidelines (TR) that reference RFC-defined compression formats in contexts such as e-government document exchange and cryptographic container formats. The German administrative data exchange format XTA uses ZIP containers based on DEFLATE for transferring structured XML payloads between public authorities.

DIN, Germany's national standards body, adopts relevant ISO/IEC standards for compression and storage. ISO/IEC 21778:2017, which standardises JSON binary representation (CBOR), intersects with compression in contexts where schema-based models outperform general-purpose algorithms.


References: RFC 1951, RFC 7932, RFC 8878, Shannon (1948) "A Mathematical Theory of Communication", Huffman (1952) "A Method for the Construction of Minimum-Redundancy Codes", Ziv & Lempel (1977) "A Universal Algorithm for Sequential Data Compression".