Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Huffman Coding - Fundamental Algorithms - Lecture Notes | CS 473, Study notes of Algorithms and Programming

Material Type: Notes; Class: Fundamental Algorithms; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Fall 2007;

Typology: Study notes

Pre 2010

Uploaded on 03/16/2009

koofers-user-v7n
koofers-user-v7n 🇺🇸

5

(1)

10 documents

1 / 6

Toggle sidebar

Related documents


Partial preview of the text

Download Huffman Coding - Fundamental Algorithms - Lecture Notes | CS 473 and more Study notes Algorithms and Programming in PDF only on Docsity! Chapter 25 Huffman Coding By Sariel Har-Peled, December 6, 2007¬ 25.1 Huffman coding (This portion of the class notes is based on Jeff Erickson class notes.) A binary code assigns a string of 0s and 1s to each character in the alphabet. A code assigns for each symbol in the input a codeword over some other alphabet. Such a coding is necessary, for example, for transmitting messages over a wire, were you can send only 0 or 1 on the wire (i.e., for example, consider the good old telegraph and Morse code). The receiver gets a binary stream of bits and needs to decode the message sent. A prefix code, is a code where one can decipher the message, a character by character, by just reading a prefix of the input binary string, matching it to an original character, and continuing to decipher the rest of the stream. Such a code is known as a prefix code. A binary code (or a prefix code) is prefix-free if no code is a prefix of any other. ASCII and Unicode’s UTF-8 are both prefix-free binary codes. Morse code is a binary code (and also a prefix code), but it is not prefix-free; for example, the code for S (· · · ) includes the code for E (·) as a prefix. (Hopefully the receiver knows that when it gets · · · that it is extremely unlikely that this should be interpreted as EEE, but rather S. Any prefix-free binary code can be visualized as a binary tree with the encoded characters stored at the leaves. The code word for any symbol is given by the path from the root to the corresponding leaf; 0 for left, 1 for right. The length of a codeword for a symbol is the depth of the corresponding leaf. Such trees are usually referred to as prefix trees or code treestree!code trees. ¬This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. 1 newline 16, 492 space 130,376 ‘!’ 955 ‘"’ 5,681 ‘$’ 2 ‘%’ 1 ‘” 1,174 ‘(’ 151 ‘)’ 151 ‘*’ 70 ‘,’ 13,276 ‘–’ 2,430 ‘.’ 6,769 ‘0’ 20 ‘1’ 61 ‘2’ 10 ‘3’ 12 ‘4’ 10 ‘5’ 14 ‘6’ 11 ‘7’ 13 ‘8’ 13 ‘9’ 14 ‘:’ 267 ‘;’ 1,108 ‘?’ 913 ‘A’ 48,165 ‘B’ 8,414 ‘C’ 13,896 ‘D’ 28,041 ‘E’ 74,809 ‘F’ 13,559 ‘G’ 12,530 ‘H’ 38,961 ‘I’ 41,005 ‘J’ 710 ‘K’ 4,782 ‘L’ 22,030 ‘M’ 15,298 ‘N’ 42,380 ‘O’ 46,499 ‘P’ 9,957 ‘Q’ 667 ‘R’ 37,187 ‘S’ 37,575 ‘T’ 54,024 ‘U’ 16,726 ‘V’ 5,199 ‘W’ 14,113 ‘X’ 724 ‘Y’ 12,177 ‘Z’ 215 ‘_’ 182 ’‘’ 93 ‘@’ 2 ‘/’ 26 Figure 25.1: Frequency of characters in the book “A tale of two cities” by Dickens. For the sake of brevity, small letters were counted together with capital letters. a b c d 0 0 0 1 1 1The beauty of prefix trees (and thus of prefix odes) is that decoding is very easy. As a concrete example, consider the tree on the right. Given a string ’010100’, we can traverse down the tree from the root, going left if get a ’0’ and right if we get ’1’. Whenever we get to a leaf, we output the character output in the leaf, and we jump back to the root for the next character we are about to read. For the example ’010100’, after reading ’010’ our traversal in the tree leads us to the leaf marked with ’b’, we jump back to the root and read the next input digit, which is ’1’, and this leads us to the leaf marked with ’d’, which we output, and jump back to the root. Finally, ’00’ leads us to the leaf marked by ’a’, which the algorithm output. Thus, the binary string ’010100’ encodes the string “bda”. Suppose we want to encode messages in an n-character alphabet so that the encoded message is as short as possible. Specifically, given an array frequency counts f [1 . . . n], we want to compute a prefix-free binary code that minimizes the total encoded length of the message. That is we would like to compute a tree T that minimizes cost(T) = n∑ i=1 f [i] ∗ len(code(i)) , (25.1) where code(i) is the binary string encoding the ith character and len(s) is the length (in bits) of the binary string s. As a concrete example, consider Figure 25.1, which shows the frequency of characters in the book “A tale of two cities”, which we would like to encode. Consider the characters ‘E’ and ‘Q’. The first appears > 74, 000 times in the text, and other appears only 667 times in the text. Clearly, it would be logical to give ‘E’, the most frequent letter in English, a very short prefix code, and a very long (as far as number of bits) code to ‘Q’. A nice property of this problem is that given two trees for some parts of the alphabet, we can easily put them together into a larger tree by just creating a new node and hanging the trees from this common node. For example, putting two characters together, we have the following. 2 another optimal code tree. In this final optimal code tree, x and y as maximum-depth siblings, as required. Theorem 25.1.3 Huffman codes are optimal prefix-free binary codes. Proof: If the message has only one or two different characters, the theorem is trivial. Otherwise, let f [1 . . . n] be the original input frequencies, where without loss of generality, f [1] and f [2] are the two smallest. To keep things simple, let f [n + 1] = f [1] + f [2]. By the previous lemma, we know that some optimal code for f [1..n] has characters 1 and 2 as siblings. Let T be this optimal tree, and consider the tree formed by it by removing 1 and 2 as it leaves. We remain with a tree T′ that has as leafs the characters 3, . . . , n and a “special” character n+1 (which is the parent of 1 and 2 in T) that has frequency f [n + 1]. Now, since f [n + 1] = f [1] + f [2], we have cost(T) = n∑ i=1 f [i]depthT(i) = n+1∑ i=3 f [i]depthT(i) + f [1]depthT(1) + f [2]depthT(2) − f [n + 1]depthT(n + 1) = cost ( T′ ) + ( f [1] + f [2] ) depth(T) − ( f [1] + f [2] )( depth(T) − 1 ) = cost ( T′ ) + f [1] + f [2]. (25.2) This implies that minimizing the cost of T is equivalent to minimizing the cost of T′. In particular, T′ must be an optimal coding tree for f [3 . . . n+1]. Now, consider the Huffman tree T′H constructed for f [3, . . . , n + 1] and the overall Huffman tree TH constructed for f [1, . . . , n]. By the way the construction algorithm works, we have that T′H is formed by removing the leafs of 1 and 2 from T. Now, by induction, we know that the Huffman tree generated for f [3, . . . , n+1] is optimal; namely, cost(T′) = cost ( T′H ) . As such, arguing as above, we have cost(TH) = cost ( T′H ) + f [1] + f [2]. = cost ( T′ ) + f [1] + f [2] = cost(T) , by Eq. (25.2). Namely, the Huffman tree has the same cost as the optimal tree. 25.1.1 What do we get For the book “A tale of two cities” which is made out of 779,940 bytes, and using the above Huffman compression results in a compression to a file of size 439,688 bytes. A far cry from what gzip can do (301,295 bytes) or bzip2 can do (220,156 bytes!), but still very impressive when you consider that the Huffman encoder can be easily written in a few hours of work. (This numbers ignore the space required to store the code with the file. This is pretty small, and would not change the compression numbers stated above significantly. 25.1.2 A formula for the average size of a code word Assume that our input is made out of n characters, where the ith character is pi fraction of the input (one can think about pi as the probability of seeing the ith character, if we were to pick a random character from the input). 5 Now, we can use these probabilities instead of frequencies to build a Huffman tree. The nat- ural question is what is the length of the codewords assigned to characters as a function of their probabilities? In general this question does not have a trivial answer, but there is a simple elegant answer, if all the probabilities are power of 2. Lemma 25.1.4 Let 1, . . . , n be n symbols, such that the probability for the ith symbol is pi, and furthermore, there is an integer li ≥ 0, such that pi = 1/2li . Then, in the Huffman coding for this input, the code for i is of length li. Proof: The proof is by easy induction of the Huffman algorithm. Indeed, for n = 2 the claim trivially holds since there are only two characters with probability 1/2. Otherwise, let i and j be the two characters with lowest probability. It must hold that pi = p j (otherwise, ∑ k pk can not be equal to one). As such, Huffman’s merges this two letters, into a single “character” that have probability 2pi, which would have encoding of length li − 1, by induction (on the remaining n − 1 symbols). Now, the resulting tree encodes i and j by code words of length (li − 1) + 1 = li, as claimed. In particular, we have that li = lg 1/pi. This implies that the average length of a code word is∑ i pi lg 1 pi . If we consider X to be a random variable that takes a value i with probability pi, then this formula is H(X) = ∑ i Pr[X = i] lg 1 Pr[X = i] , which is the entropy of X. 6
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved