Download Improving Text Compression: From ASCII to Prefix Codes and Huffman Codes - Prof. Randy Shu and more Study notes Algorithms and Programming in PDF only on Docsity! Binary Code Improving on ASCII • ASCII uses eight bits in order to store 28 distinct characters. If our file uses fewer characters, we might do better with a different code. • For example, rather than store "ABRACADABRA" in eight bit ASCII code, we store it in our own personal five bit binary code (A = 00001, B = 00010, ...)*. 0000100010100100000100011000010010000001000101001000001 *This represents a 37.5% savings over the ASCII encoding. But we can do better if we give up the notion of fixed encoding length. Variable-Length Encoding • Assign shortest bit strings to the most commonly used letters: A=0, B=1, R=01, C=10, D=11. So ABRACADABRA becomes 0 1 01 0 10 0 11 0 1 01 0 • This uses only 15 bits compared to 55 in our previous ` scheme. But it does required blanks to separate the characters. Epiphany • Without the spaces, the encoding of ABRACADABRA is ambiguous: 010101001101010. For example, is the first bit an A or the start of the pair 01 representing an R*. • But, delimiters aren't needed if no character code is the prefix of another. *Recall A=0, B=1, R=01, C=10, D=11. Huffman Codes • Huffman encoding is a method for constructing a tree which leads to a bit string of minimal length for any given message. • The first step is to find the frequency counts for the message. For example, ABRACADABRA Frequency Table A 5 B 2 C 1 D 1 R 2 Build Tree Bottom Up C:1 D:1 B:2 R:2 A:5 • Maintain a priority queue Q keyed on frequency. • Remove the two trees with smallest root values and merge them into a new tree. 2 Repeat C:1 D:1 B:2 R:2 A:5 A:5 11 C:1 D:1 2 B:2 R:2 4 6 To Obtain Huffman Code Huffman(C) n ← |C| Q ← C for i ← 1 to n-1 do z ← Allocate-Node() x ← left[z] ← Extract-Min(Q) y ← right[z] ← Extract-Min(Q) f[z] ← f[x] + f[y] Insert(Q,z) return Extract-Min(Q) Greedy-Choice Property Let x, y ∈ C have minimum frequencies. Then there exists an optimal prefix code for C in which the codewords for x,y have the same length and differ only in the last bit. b a yx T''