Binary Codes

One thing I enjoy about tracking progress on machine learning research is that it is full of simple yet powerful ideas. Ideas that can be summarized in a few sentences and easily understood without much of additional context or preliminary knowledge. Of course it's an oversimplification, there are lots of details that need to be done right for the idea to work and a lot of experiments to be conducted to find the right ideas when searching over the infinite space of all simple ideas. Nonetheless, it is uplifting that simple ideas are still regularly discovered after decades of research.

In that sense, it feels similar to the beginning of information theory. In 1948 Shannon introduced entropy and proved that it is a lower bound on expected code length if the code is uniquely decodable. Additionally, in order to show that we can get arbitrarily close to entropy, Shannon described an efficient coding algorithm.

Shannon Coding

Let $A = {a 1, \dots , a n}$ be an input alphabet with probabilities of occurrence $p 1, \dots , p n$ . We assume that symbols are sorted in a decreasing order, $p 1 \geq \dots \geq p n$ . At the beginning we set $P \leftarrow 0$ . For every symbol $a i$ :

set codeword $code(a i)$ of symbol $a i$ to the first $⌈-log 2 (p i)⌉$ bits of a binary representation of $P$ ,
increase $P \leftarrow P + p i$ .

Let's consider an example with an alphabet $A = {a, b, c, d, e}$ and the corresponding probabilities $0.39$ , $0.2$ , $0.18$ , $0.13$ and $0.1$ . The lower bound for the average code length is the entropy, $H(A) \approx 2.15$ .

$a i$	$p i$	$⌈-log 2 (p i)⌉$	$P i$	$code(a i)$
$a$	$0.39$	$⌈1.35 \dots ⌉ = 2$	$0.00 = 0.00000 \dots (2)$	$00$
$b$	$0.2$	$⌈2.32 \dots ⌉ = 3$	$0.39 = 0.01100 \dots (2)$	$011$
$c$	$0.18$	$⌈2.47 \dots ⌉ = 3$	$0.59 = 0.10010 \dots (2)$	$100$
$d$	$0.13$	$⌈2.94 \dots ⌉ = 3$	$0.77 = 0.11000 \dots (2)$	$110$
$e$	$0.1$	$⌈3.32 \dots ⌉ = 4$	$0.90 = 0.11100 \dots (2)$	$1110$

The average code length is $2.71$ , but we can see that the code is far from optimal as some of the trailing bits are redundant and could be dropped.

Fano Coding

The Fano's method is best described as building a binary tree in which edges are labeled by 0s (left edges) and 1s (right edges). Input symbols are represented by leafs. A codeword of each symbol is uniquely determined by edge labels from root node to the leaf corresponding to that symbol. You can see an example of such tree in the demo below. Assuming that the input alphabet is sorted as before, to build a tree representing Fano codes we run the following procedure:

if there's only one symbol, return a tree consisting of a single node labeled by that symbol,
otherwise find a position $i$ such that $p 1 + p 2 + \dots + p i$ is as close as possible to $p i + 1 + p i + 2 + \dots + p n$ ,
recursively build trees for the two subsets of the symbols,
return a new tree with the two trees as subtrees, labeling the left and right edges with $0$ and $1$ , respectively.

Shannon mentions in his paper that his coding "is substantially the same as one found independently by R. M. Fano", but in practice the methods may perform differently. Let's consider the previous example. We first split ${a, b, c, d, e}$ into ${a, b}$ (codes starting with $0$ ) and ${c, d, e}$ (codes starting with $1$ ) with $p 1 + p 2 = 0.59$ and $p 3 + p 4 + p 5 = 0.41$ . The first subset ${a, b}$ is further split into $a$ and $b$ , hence $code(a) = 00$ and $code(b) = 01$ . The second subset ${c, d, e}$ is split into $c$ (which gets $code(c) = 10$ ) and ${d, e}$ . The final split yields $code(d) = 110$ and $code(e) = 111$ . The average code length is $2.23$ .

Huffman Coding

Both Shannon and Fano methods (sometimes randomly called Shannon-Fano coding) are efficient, but not optimal. Interestingly, one simple idea was waiting to be discovered. According to wikipedia:

In 1951, David A. Huffman and his MIT information theory classmates were given the choice of a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree and quickly proved this method the most efficient.

In doing so, Huffman outdid Fano, who had worked with information theory inventor Claude Shannon to develop a similar code. Building the tree from the bottom up guaranteed optimality, unlike the top-down approach of Shannon–Fano coding.

Source: History of Huffman coding.

The Huffman coding algorithm builds a tree as follows. For each alphabet symbol $a i$ we create a root node with label $a i$ and probability $p i$ . As long as there are at least two root nodes:

pick up the two root nodes with the smallest probabilities,
replace them by a tree with the two nodes as subtrees and probability equal to the sum of probabilities of the two nodes.

In the demo below you can experiment with Huffman encoding and decoding on several of trees (click the gear icon to choose corpus). The occurrences are rounded to the closest integers for display purposes, hence some numbers might not add up. The initial tree is the one build for the $A = {a, b, c, d, e}$ example and the average code length is $2.22$ .

Huffman Coding

Mode:

Encode (abc → 01) Decode (01 → abc)

Input:

Tree:

Output:

Entropy: 2.15 bits / symbol.

Expected code length: 2.22 bits / symbol.

Encoded: 0 bits / symbol.

Adaptive Huffman Coding

The Huffman algorithm yields optimal prefix-free codes, but to build them we need to know the distribution of input symbols. The frequencies can be calculated from the input sequence, but this requires two passes over the input. Additionally, the built tree must be encoded as well so that the output can be decoded. One solution to these problems is an elegant idea of adaptive Huffman codes.

The main idea behind the adaptive Huffman encoding is quite simple. We start the encoding process with a Huffman tree consisting of exactly one node, representing a special Not-Yet-Transferred (NYT) symbol. This symbol is used to introduce symbols newly encountered in the input. Let $s 1, s 2, \dots$ be the input sequence. To encode the $i$ -th symbol, $s i$ :

if $s i$ is present in the tree, then output the corresponding code. Otherwise, output the code for the NYT symbol, followed by a binary representation of symbol $s i$ ,
rebuilt the Huffman tree from scratch using frequencies of the first $i$ symbols and the NYT symbol.

That's basically the idea, but there are some details left. The frequency of the NYT symbol is assumed to be $0$ , but it is always included when rebuilding the tree, unless all but one symbol from the input alphabet were transmitted. The tree must be built in a deterministic way so that the same tree can be created when decoding.

In practice, the tree is never rebuilt from scratch, but it is updated after every symbol. In FGK (Faller-Gallager-Knuth) algorithm each node is additionally assigned a unique id such that if we traverse the tree from bottom to top (level by level) and from left to right (within a level), the ids are in an increasing order. Every time a given symbol is encoded, the occurrences are increased by $1$ , starting from the node representing that symbol and going up to the root. Let $q$ be the node corresponding to the encoded symbol (or a newly created node if the symbol was NYT). We proceed as follows:

from nodes having the same number of occurrences as $q$ find the one with the highest id,
if it's not $q$ nor the parent of $q$ , swap $q$ with that node,
increase the number of occurrences of $q$ by $1$ ,
if $q$ is not the root, set $q \leftarrow parent(q)$ and continue updating.

You can play with FGK in the demo below.

The algorithm does not guarantee that the resulting tree is a Huffman tree. The Vitter algorithm modifies FGK update rules in a way that the resulting tree is always a Huffman tree.

Adaptive Huffman Coding

Mode:

Encode (abc → 01) Decode (01 → abc)

Input:

Tree:

Output:

Entropy: 0 bits / symbol.

Expected code length: 0 bits / symbol.

Encoded: 0 bits / symbol.

The demo above uses UTF-8 encoding to represent newly encountered symbols, so a single symbol can take up to 32 bits. However, if the set $A$ of all possible symbols is known in advance (and available both to encoder and decoder), we can encode newly encountered symbols more efficiently. The first new symbol is encoded with around $log 2 (∣ A ∣)$ bits, the second with $log 2 (∣ A ∣ - 1)$ and so on. Once $∣ A ∣ - 2$ symbols are encoded we need a single bit to denote which of the two remaining symbols is encountered next. When it is encountered the tree is updated as usual and the Not-Yet-Transmitted node NYT is replaced with the last remaining symbol. The encoding of newly encountered symbols is equivalent to encoding a permutation of $A$ , for which we need around $∣ A ∣log 2 (∣ A ∣)$ bits (see the random permutation example for details).