As mentioned in lecture yesterday, the final problem set will deal with data compression. Today we're going to present a simple data compression scheme known as Huffman coding.
Suppose we want to compress a 100,000-byte data file that we know contains only the lowercase letters A through F. Since we have only six distinct characters to encode, we can represent each one with three bits rather than the eight bits normally used to store characters:
Letter A B C D E F Codeword 000 001 010 011 100 101
This fixed-length code gives us a compression ratio of 5/8 = 62.5%. Can we do better?
What if we knew the relative frequencies at which each letter occurred? It would be logical to assign shorter codes to the most frequent letters and save longer codes for the infrequent letters. For example, consider this code:
Letter A B C D E F Frequency (K) 45 13 12 16 9 5 Codeword 0 101 100 111 1101 1100
Using this code, our file can be represented with
(45×1 + 13×3 + 12×3 + 16×3 + 9×4 + 5×4) × 1000 = 224 000 bits
or 28 000 bytes, which gives a compression ratio of 72%. In fact, this is an optimal character code for this file (which is not to say that the file is not further compressible by other means).
Notice that in our variable-length code, no codeword is a prefix of any other codeword. For example, we have a codeword 0, so no other codeword starts with 0. And both of our four-bit codewords start with 110, which is not a codeword. Such codes are called prefix codes. Prefix codes are useful because they make a stream of bits unambiguous; we simply can accumulate bits from a stream until we have completed a codeword. (Notice that encoding is simple regardless of whether our code is a prefix code: we just build a dictionary of letters to codewords, look up each letter we're trying to encode, and append the codewords to an output stream.) In turns out that prefix codes always can be used to achive the optimal compression for a character code, so we're not losing anything by restricting ourselves to this type of character code.
When we're decoding a stream of bits using a prefix code, what data structure might we want to use to help us determine whether we've read a whole codeword yet?
One convenient representation is to use a binary tree with the codewords stored in the leaves so that the bits determine the path to the leaf. In our example, the codeword 1100 is found by starting at the root, moving down the right subtree twice and the left subtree twice:
100 / \ A 55 [45] / \ 25 30 / \ / \ C B 14 D [12] [13] / \ [16] F E [5] [9]
Here I've labeled the leaves with their frequencies and the branches with the total frequencies of the leaves in their subtrees. You'll notice that this is a full binary tree: every nonleaf node has two children. This happens to be true of all optimal codes, so we can tell that our fixed-length code is suboptimal by observing its tree:
100 / \ 86 14 / \ / 58 28 14 / \ / \ / \ A B C D E F [45] [13] [12] [16] [9] [5]
Since we can restrict ourselves to full trees, we know that for an alphabet C, we will have a tree with exactly |C| leaves and |C|-1 internal nodes. Given a tree T corresponding to a prefix code, we also can compute the number of bits required to encode a file:
B(T) = sum f(c) dT(c)
where f(c) is the frequency of character c and dT(c) is the depth of the character in the tree (which also is the length of the codeword for c). We call B(T) the cost of the tree T.
Huffman invented a simple algorithm for constructing such trees given the set of characters and their frequencies. The algorithm is greedy, which means that it makes choices that are locally optimal.
The algorithm constructs the tree in a bottom-up way. Given a set of leaves containing the characters and their frequencies, we merge the current two subtrees with the smallest frequencies. We perform this merging by creating a parent node labeled with the sum of the frequencies of its two children. Then we repeat this process until we have performed |C|-1 mergings to produce a single tree.
As an example, use Huffman's algorithm to construct the tree for our input.
How can we implement Huffman's algorithm efficiently? The operation we need to perform repeatly is extraction of the two subtrees with smallest frequencies, so we can use a priority queue. We can express this in ML as:
datatype HTree = Leaf of char * int | Branch of HTree * int * HTree fun huffmanTree(alpha : (char * int) list) : HTree = let val alphasize = length(alpha) fun freq(node:HTree):int = case node of Leaf(_,i) => i | Branch(_,i,_) => i val q = new_heap (fn (x,y) => Int.compare(freq x, freq y)) alphasize fun merge(i:int):HTree = if i = 0 then extract_min(q) else let val x = extract_min(q) val y = extract_min(q) in insert q (Branch(x, freq(x)+freq(y), y)); merge(i-1) end in app (fn (c:char,i:int):unit => insert q (Leaf(c,i))) alpha; merge(alphasize-1) end
We won't prove that the result is an optimal prefix tree, but why does this algorithm produce a valid and full prefix tree? We can see that every time we merge two subtrees, we're differentiating the codewords of all of their leaves by prepending a 0 to all the codewords of the left subtree and a 1 to all the codewords of the right subtree. And every nonleaf node has exactly two children by construction.
Let's analyze the running time of this algorithm if our alphabet has n characters. Building the initial queue takes time O(n log n) since each enqueue operation takes O(log n) time. Then we perform n-1 merges, each of which takes time O(log n). Thus Huffman's algorithm takes O(n log n) time.
If we want to compress a file with our current approach, we have to scan through the whole file to tally the frequencies of each character. Then we use the Huffman algorithm to compute an optimal prefix tree, and we scan the file a second time, writing out the codewords of each character of the file. But that's not sufficient. Why? We also need to write out the prefix tree so that the decompression algorithm knows how to interpret the stream of bits.
So our algorithm has one major potential drawback: We need to scan the whole input file before we can build the prefix tree. For large files, this can take a long time. (Disk access is very slow compared to CPU cycle times.) And in some cases it may be unreasonable; we may have a long stream of data that we'd like to compress, and it could be unreasonable to have to accumulate the data until we can scan it all. We'd like an algorithm that allows us to compress a stream of data without seeing the whole prefix tree in advance.
The solution is adaptive Huffman coding, which builds the prefix tree incrementally in such a way that the coding always is optimal for the sequence characters already seen. We start with a tree that has a frequency of zero for each character. When we read an input character, we increment the frequency of that character (and the frequency in all branches above it). We then may have to modify the tree to maintain the invariant that the least frequent characters are at the greatest depths. Because the tree is constructed incrementally, the decoding algorithm simply can update its copy of the tree after every character is decoded, so we don't need to include the prefix tree along with the compressed data.
Cormen, Leiserson, and Rivest. Introduction to Algorithms.