Download Text Compression: Understanding Huffman Coding and more Lecture notes Data Structures and Algorithms in PDF only on Docsity! CPS 100 8.1 Text Compression: Examples 00101101100100d 0101001100011c 1010001100101e 1100101100010b 00000001100001a Var. length Fixed length ASCIISymbol “abcde” in the different formats ASCII: 01100001011000100110001101100100… Fixed: 000001010011100 Var: 000110100110 0 0 0 0 0 00 1 1 1 1 a b c d e a d bc e 0 0 0 0 1 1 1 1 Encodings ASCII: 8 bits/character Unicode: 16 bits/character CPS 100 8.2 Huffman coding: go go gophers Encoding uses tree: 0 left/1 right How many bits? 37!! Savings? Worth it? ASCII 3 bits Huffman g 103 1100111 000 00 o 111 1101111 001 01 p 112 1110000 010 1100 h 104 1101000 011 1101 e 101 1100101 100 1110 r 114 1110010 101 1111 s 115 1110011 110 101 sp. 32 1000000 111 101 3 s 1 * 2 2 p 1 h 1 2 e 1 r 1 4 g 3 o 3 6 3 2 p 1 h 1 2 e 1 r 1 4 s 1 * 2 7 g 3 o 3 6 13 CPS 100 8.3 Huffman Coding D.A Huffman in early 1950’s Before compressing data, analyze the input stream Represent data using variable length codes Variable length codes though Prefix codes Each letter is assigned a codeword Codeword is for a given letter is produced by traversing the Huffman tree Property: No codeword produced is the prefix of another Letters appearing frequently have short codewords, while those that appear rarely have longer ones Huffman coding is optimal per-character coding method CPS 100 8.4 Building a Huffman tree Begin with a forest of single-node trees (leaves) Each node/tree/leaf is weighted with character count Node stores two values: character and count There are n nodes in forest, n is size of alphabet? Repeat until there is only one node left: root of tree Remove two minimally weighted trees from forest Create new tree with minimal trees as children, • New tree root's weight: sum of children (character ignored) Does this process terminate? How do we get minimal trees? Remove minimal trees, hummm…… CPS 100 8.5 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 11 6 I 5 E 5 N 1 C 1 F 1 P 2 U 2 R 2 L 2 D 2 G 3 T 3 O 3 B 3 A 4 M 4 S CPS 100 8.6 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 11 6 I 5 E 5 N 1 C 1 F 1 P 2 U 2 R 2 L 2 D 2 G 3 T 3 O 3 B 3 A 4 M 4 S 2 2 CPS 100 8.7 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 11 6 I 5 E 5 N 1 F 1 C 1 P 2 U 2 R 2 L 2 D 2 G 3 T 3 O 3 B 3 A 4 M 4 S 2 2 3 3 CPS 100 8.8 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 11 6 I 5 E 5 N 1 F 1 C 1 P 2 U 2 R 2 L 2 D 2 G 3 T 3 O 3 B 3 A 4 M 4 S 2 2 3 3 4 4 CPS 100 8.17 Decoding a message 11 6 I 5 N 5 E 1 F 1 C 1 P 2 U 2 R 2 L 2 D 2 G 3 O 3 T 3 B 3 A 4 M 4 S 2 3 445 68 6 8 16 10 21 11 12 2337 60 100000100001001101 CPS 100 8.18 Decoding a message 11 6 I 5 N 5 E 1 F 1 C 1 P 2 U 2 R 2 L 2 D 2 G 3 O 3 T 3 B 3 A 4 M 4 S 2 3 445 68 6 8 16 10 21 11 12 2337 60 00000100001001101 CPS 100 8.19 Decoding a message 11 6 I 5 N 5 E 1 F 1 C 1 P 2 U 2 R 2 L 2 D 2 G 3 O 3 T 3 B 3 A 4 M 4 S 2 3 445 68 6 8 16 10 21 11 12 2337 60 0000100001001101 G CPS 100 8.20 Decoding a message 11 6 I 5 N 5 E 1 F 1 C 1 P 2 U 2 R 2 L 2 D 2 G 3 O 3 T 3 B 3 A 4 M 4 S 2 3 445 68 6 8 16 10 21 11 12 2337 60 1 GOOD CPS 100 8.21 Decoding a message 11 6 I 5 N 5 E 1 F 1 C 1 P 2 U 2 R 2 L 2 D 2 G 3 O 3 T 3 B 3 A 4 M 4 S 2 3 445 68 6 8 16 10 21 11 12 2337 60 01100000100001001101 GOOD CPS 100 8.22 Huffman coding: go go gophers choose two smallest weights combine nodes + weights Repeat Priority queue? Encoding uses tree: 0 left/1 right How many bits? ASCII 3 bits Huffman g 103 1100111 000 ?? o 111 1101111 001 ?? p 112 1110000 010 h 104 1101000 011 e 101 1100101 100 r 114 1110010 101 s 115 1110011 110 sp. 32 1000000 111 g o e r s * 3 3 h 1 211 1 2 p 1 h 1 2 e 1 r 1 3 s 1 * 2 2 p 1 h 1 2 e 1 r 1 4 g 3 o 3 6 1 p CPS 100 8.23 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” ⇔ “10101101001000101001110011100000” CPS 100 8.24 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” ⇔ “10101101001000101001110011100000” CPS 100 8.25 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” ⇔ “10101101001000101001110011100000” CPS 100 8.26 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” ⇔ “10101101001000101001110011100000” CPS 100 8.27 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” ⇔ “10101101001000101001110011100000” CPS 100 8.28 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” ⇔ “10101101001000101001110011100000”