Download Huffman Codes: A Prefix Code for Data Compression and more Study notes Data Compression in PDF only on Docsity! Huffman codes I used for data compression, typically saving 20%–90% I Basic idea: represent often encountered characters by shorter (binary) codes 1 / 11 Huffman codes I used for data compression, typically saving 20%–90% I Basic idea: represent often encountered characters by shorter (binary) codes 1 / 11 Huffman codes Example I Suppose we have the following data file with total 100 characters: Char. a b c d e f Freq. 45 13 12 16 9 5 3-bit fixed length code 000 001 010 011 100 101 variable length code 0 101 100 111 1101 1100 I Total number of bits required to encode the file: I Fixed-length code: 100 × 3 = 300 I Variable-length code: 1·45 +3·13 + 3·12 + 3·16 + 4·9 + 4·5 = 225 I Variable-length code saves 25%. 2 / 11 Huffman codes Example I Suppose we have the following data file with total 100 characters: Char. a b c d e f Freq. 45 13 12 16 9 5 3-bit fixed length code 000 001 010 011 100 101 variable length code 0 101 100 111 1101 1100 I Total number of bits required to encode the file: I Fixed-length code: 100 × 3 = 300 I Variable-length code: 1·45 +3·13 + 3·12 + 3·16 + 4·9 + 4·5 = 225 I Variable-length code saves 25%. 2 / 11 Huffman codes Prefix(-free) codes: 1. No codeword is also a prefix1 of some other code. 2. A prefix code for Example: Char. a b c d e f Code 0 101 100 111 1101 1100 3. Encoding and decoding with a prefix code. 4. Example, cont’d. I Encode: I beef −→ 101110111011100 I face −→ 110001001101 I Decode: I 101110111011100 −→ beef I 110001001101 −→ face 1a word, letter or number placed before another 3 / 11 Huffman codes Prefix(-free) codes: 1. No codeword is also a prefix1 of some other code. 2. A prefix code for Example: Char. a b c d e f Code 0 101 100 111 1101 1100 3. Encoding and decoding with a prefix code. 4. Example, cont’d. I Encode: I beef −→ 101110111011100 I face −→ 110001001101 I Decode: I 101110111011100 −→ beef I 110001001101 −→ face 1a word, letter or number placed before another 3 / 11 Huffman codes Prefix(-free) codes: 1. No codeword is also a prefix1 of some other code. 2. A prefix code for Example: Char. a b c d e f Code 0 101 100 111 1101 1100 3. Encoding and decoding with a prefix code. 4. Example, cont’d. I Encode: I beef −→ 101110111011100 I face −→ 110001001101 I Decode: I 101110111011100 −→ beef I 110001001101 −→ face 1a word, letter or number placed before another 3 / 11 Huffman codes 5. Representation of prefix code: I full binary tree: every nonleaf node has two children. I All legal codes are at the leaves, since no prefix is shared 6. Example, cont’d (a) the (not-full-binary) tree corresponding to the fixed-legnth code, (b) the (full-binary) tree corresponding to the prefix code: 7. Fact: an optimal code for a file is always represented by a full binary tree. 4 / 11 Huffman codes 5. Representation of prefix code: I full binary tree: every nonleaf node has two children. I All legal codes are at the leaves, since no prefix is shared 6. Example, cont’d (a) the (not-full-binary) tree corresponding to the fixed-legnth code, (b) the (full-binary) tree corresponding to the prefix code: 7. Fact: an optimal code for a file is always represented by a full binary tree. 4 / 11 Huffman codes Cost and optimality Let C = alphabet (set of characters), then I A code = a binary tree T I For each character c ∈ C, define f(c) = frequency of c in the file dT (c) = length of the code for c = number of bits = depth of c’ leave in the tree T Then the number of bits (“cost of the tree/code T”) required to encode the file B(T ) = ∑ c∈C f(c) · dT (c), I A code T is optimal if B(T ) is minimal. 5 / 11 Huffman codes Cost and optimality Let C = alphabet (set of characters), then I A code = a binary tree T I For each character c ∈ C, define f(c) = frequency of c in the file dT (c) = length of the code for c = number of bits = depth of c’ leave in the tree T Then the number of bits (“cost of the tree/code T”) required to encode the file B(T ) = ∑ c∈C f(c) · dT (c), I A code T is optimal if B(T ) is minimal. 5 / 11 Huffman codes Cost and optimality Let C = alphabet (set of characters), then I A code = a binary tree T I For each character c ∈ C, define f(c) = frequency of c in the file dT (c) = length of the code for c = number of bits = depth of c’ leave in the tree T Then the number of bits (“cost of the tree/code T”) required to encode the file B(T ) = ∑ c∈C f(c) · dT (c), I A code T is optimal if B(T ) is minimal. 5 / 11 Huffman codes Cost and optimality Let C = alphabet (set of characters), then I A code = a binary tree T I For each character c ∈ C, define f(c) = frequency of c in the file dT (c) = length of the code for c = number of bits = depth of c’ leave in the tree T Then the number of bits (“cost of the tree/code T”) required to encode the file B(T ) = ∑ c∈C f(c) · dT (c), I A code T is optimal if B(T ) is minimal. 5 / 11 Huffman codes Let C = alphabet (set of characters), basic idea of Huffman codes to produce a prefix code for C: represent often encountered characters by shorter (binary) codes via 1. Building a full binary tree T in a bottom-up manner 2. Beginning with |C| leaves, performs a sequence of |C| − 1 “merging” operations to create T 3. “Merging” operation is greedy: the two with lowest frequencies are merged. 6 / 11 Huffman codes Let C = alphabet (set of characters), basic idea of Huffman codes to produce a prefix code for C: represent often encountered characters by shorter (binary) codes via 1. Building a full binary tree T in a bottom-up manner 2. Beginning with |C| leaves, performs a sequence of |C| − 1 “merging” operations to create T 3. “Merging” operation is greedy: the two with lowest frequencies are merged. 6 / 11 Review: priority queue I A priority queue is a data structure for maintaining a set S of elements, each with an associated key. I A min-priority queue supports the following operations: I Insert(S,x): inserts the element x into the set S, i.e., S = S ∪ {x}. I Minimum(S): returns the element of S with the smallest “key”. I ExtractMin(S): removes and returns the element of S with the smallest “key”. I DecreaseKey(S,x,k): decreases the value of element x’s key to the new value k, which is assumed to be at least as small as x’s current key value. I A max-priority queue supports the operations: Insert(S, x), Maximum(S), ExtractMax(S), IncreaseKey(S, x, k). I Section 6.5 describes a binary heap implementation. I Cost: let n = |S|, then I initialization building heap = O(n) I each heap operation = O(lgn) 7 / 11 Review: priority queue I A priority queue is a data structure for maintaining a set S of elements, each with an associated key. I A min-priority queue supports the following operations: I Insert(S,x): inserts the element x into the set S, i.e., S = S ∪ {x}. I Minimum(S): returns the element of S with the smallest “key”. I ExtractMin(S): removes and returns the element of S with the smallest “key”. I DecreaseKey(S,x,k): decreases the value of element x’s key to the new value k, which is assumed to be at least as small as x’s current key value. I A max-priority queue supports the operations: Insert(S, x), Maximum(S), ExtractMax(S), IncreaseKey(S, x, k). I Section 6.5 describes a binary heap implementation. I Cost: let n = |S|, then I initialization building heap = O(n) I each heap operation = O(lgn) 7 / 11 Huffman codes I Pseudocode: Huffmancode(C) n = |C| Q = C // min-priority queue, keyed by freq attribute for i = 1 to n-1 allocate a new node z z_left = x = ExtractMin(Q) z_right = y = ExtractMin(Q) freq[z] = freq[x] + freq[y] Insert(Q,z) endfor return ExtractMin(Q) // the root of the tree I Running time: T (n) = init. Heap + (n− 1) loop× each Heap op. = O(n) +O(n lg n) = O(n lg n) 8 / 11 Huffman codes I Pseudocode: Huffmancode(C) n = |C| Q = C // min-priority queue, keyed by freq attribute for i = 1 to n-1 allocate a new node z z_left = x = ExtractMin(Q) z_right = y = ExtractMin(Q) freq[z] = freq[x] + freq[y] Insert(Q,z) endfor return ExtractMin(Q) // the root of the tree I Running time: T (n) = init. Heap + (n− 1) loop× each Heap op. = O(n) +O(n lg n) = O(n lg n) 8 / 11 Huffman codes
Example
Tes] (bala, [ets [a45] ®)
Huffman codes Optimality: To prove the greedy algorithm Huffmancode producing an optimal prefix code, we show that it exhibits the following two ingradients: 1. The greedy-choice property If x, y ∈ C having the lowest frequencies, then there exists an optimal code T such that I dT (x) = dT (y) I the codes for x and y differ only in the last bit 2. The optimal substructure property If x, y ∈ C have the lowest frequencies, and let z be their parent. Then the tree T ′ = T − {x, y} represents an optimal prefix code for the alphabet C ′ = (C − {x, y}) ∪ {z}. 10 / 11 Huffman codes By the above two properties, after each greedy choice is made, we are left with an optimization problem of the same form as the original. By induction, we have Theorem. Huffman code is an optimal prefix code. 11 / 11