Download Data Compression: Trie and Huffman Encoding - Prof. Sugih Jamin and more Study notes Algorithms and Programming in PDF only on Docsity! Outline PA1 Past Due Last time: • Binary Search Tree (BST) • Binary Space Partition (BSP) Tree • MinHeap and MaxHeap • Priority queue Today: • Trie • RLE and Huffman encoding Sugih Jamin (jamin@eecs.umich.edu) Trie trie: from retrieval, rhymes with try, to differentiate from tree A trie: a tree that uses parts of the key, as opposed to the whole key, to perform search Whereas a tree associates keys with nodes, a trie associates keys with edges Example: for set of strings {on, owe, owl, tip, to} T: o n w e l t o p i Note: the handout’s external node is our leaf node, not our external (null) node Sugih Jamin (jamin@eecs.umich.edu) Run Length Encoding A very simple encoding method: for each repeated element, output the element and the count Example: 11111111000001111000011110000011111110000 Output: 8150414041507140 Sugih Jamin (jamin@eecs.umich.edu) Huffman Encoding Example string: If a woodchuck could chuck wood! ASCII encoding: 8 bits/char ⇒ requires 256 bits to encode (store in binary) the string Observe: there are only 13 distinct symbols in example string, so 4 bits/char is sufficient to encode the string ⇒ requires 128 bits Huffman encoding’s main ideas: • variable-length encoding: use different number of bits (code length) to represent different symbol • entropy encoding: assign smaller code to more frequently occuring symbol Goal: Σl(c) · f(c) minimized, where c is each unique symbol in string, l(c) length of its code, and f(c) its frequency Sugih Jamin (jamin@eecs.umich.edu) Huffman Encoding (contd) If a woodchuck could chuck wood! Can be encoded using the following codes (for example): symbol c freq. f(c) code C(c) I 1 11111 f 1 11110 a 1 11101 l 1 11100 ! 1 1101 w 2 1100 d 3 101 u 3 100 h 2 0111 k 2 0110 o 5 010 c 5 001 ’ ’ 5 000 Takes only 111 bits to encode the string: Sugih Jamin (jamin@eecs.umich.edu) Huffman Tree Construction (contd) Characteristics of Huffman trees: • higher frequency symbols at shallower depth • since all symbols are leaf nodes, no code is a prefix of another Construct Huffman tree from the |Σ| elements (|Σ|: alphabet size): • implement as a MinHeap, where the “key” is the frequency of occurrence of each element of Σ • take the two smallest elements off MinHeap, O( ) • make a tree of them, with the key of the new root node being the sum of the keys of the two children, O( ) • put new tree back into MinHeap, O( ) Total construction time: O( ) Sugih Jamin (jamin@eecs.umich.edu) Encoding Time Complexity Running times, n string length, |Σ| alphabet size • frequency count: O(n) • Huffman tree construction: O(|Σ| log |Σ|) • Total time: O(n + |Σ| log |Σ|) For binary data, treat each byte as a “character” Sugih Jamin (jamin@eecs.umich.edu) Compressing the Huffman Code Table The Huffman code for any particular text is not unique For example, the following are all acceptable: symbol c freq. f(c) code C(c) C ′(c) C ′′(c) ’ ’ 5 000 001 000 c 5 001 010 001 d 3 101 100 010 o 5 010 000 011 u 3 100 101 100 ! 2 1101 0110 1010 h 2 0111 1101 1011 k 2 0110 1110 1100 w 1 1100 0111 1101 a 1 11101 11100 11100 f 1 11110 11101 11101 l 1 11100 11111 11110 I 1 11111 11110 11111 The last column can be compressed into: 3’ ’cdou4!hkw5aflI Sugih Jamin (jamin@eecs.umich.edu)