Download Understanding UTF-8 and Huffman Coding: A Comprehensive Guide and more Lecture notes Construction in PDF only on Docsity! More Bits and Bytes Huffman Coding Encoding Text: How is it done? ASCII, UTF, Huffman algorithm UC SANTA CRUZ UTF is a VARIABLE LENGTH ALPHABET CODING § Remember ASCII can only represent 128 characters (7 bits) § UTF encodes over one million § Why would you want a variable length coding scheme?
Bits
of First Last Bytes in
. Byte 1 Byte 2 Byte 3 Byte 4
code code point code point sequence
point
7 | U+0000 U+007F 1 Oxxxxxxx
11. U+0080 U+07FF 2 110xxxxx 10xxxxxx
16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
21 U+10000 U+1FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
UC SANTA CRUZ
UC SANTA CRUZ UTF-8 A. 0000000001101010 B. 0000000011101010 C. 0000001010000111 D. 1010000011000111 What is the first Unicode value represented by this sequence? 11101010 10000011 10000111 00111111 11000011 10000000 UC SANTA CRUZ © 2010 Lawrence Snyder, CSE A Curious Story… The Diving Bell and the Butterfly Jean-‐Dominique Bauby UC SANTA CRUZ © 2010 Lawrence Snyder, CSE Asking Yes/No Questions § A protocol for Yes/No questions § One blink == Yes § Two blinks == No UC SANTA CRUZ © 2010 Lawrence Snyder, CSE Asking Letters In English ETAOINSHRDLU… UC SANTA CRUZ © 2010 Lawrence Snyder, CSE Compare Two Orderings § How many questions to encode: Plus ça change, plus c'est la même chose? § Asking in Frequency Order: 247 ESARINTULOMDPCFBVHGJQZYXKW § Asking in Alphabetical Order: 324 ABCDEFGHIJKLMNOPQRSTUVWXYZ UC SANTA CRUZ An Algorithm § Spelling by going through the letters is an algorithm § Going through the letters in frequency order is a program (also, an algorithm but with the order specified to a particular case, i.e. FR) § The nurses didn’t look this up in a book … they invented it to make their work easier; they were thinking computationally © 2010 Lawrence Snyder, CSE UC SANTA CRUZ Coding can be used to do Compression § What is CODING? § The conversion of one representation into another § What is COMPRESSION? § Change the representation (digitization) in order to reduce size of data (number of bits needed to represent data) § Benefits § Reduce storage needed § Consider growth of digitized data. § Reduce transmission cost / latency / bandwidth § When you have a 56K dialup modem, every savings in BITS counts, SPEED § Also consider telephone lines, texting UC SANTA CRUZ Can you lose information with Compression? § Lossless Compression is not guaranteed § Pigeonhole principle § Reduce size 1 bit ⇒ can only store ½ of data § Example § 000, 001, 010, 011, 100, 101, 110, 111 ⇒ 00, 01, 10, 11 § CONSIDER THE ALTERNATIVE § IF LOSSLESS COMPRESSION WERE GUARANTEED THEN § Compress file (reduce size by 1 bit) § Recompress output § Repeat (until we can store data with 0 bits) § OBVIOUS CONTRADICTION => IT IS NOT GUARANTEED. UC SANTA CRUZ Huffman Code: A Lossless Compression § Use Variable Length codes based on frequency (like UTF does) § Approach § Exploit statistical frequency of symbols § What do I MEAN by that? WE COUNT!!! § HELPS when the frequency for different symbols varies widely § Principle § Use fewer bits to represent frequent symbols § Use more bits to represent infrequent symbols A A B A A A A B UC SANTA CRUZ Huffman Code Example § “dog cat cat bird bird bird bird fish” § Expected size § Original ⇒ 1/8×2 + 1/4×2 + 1/2×2 + 1/8×2 = 2 bits / symbol § Huffman ⇒ 1/8×3 + 1/4×2 + 1/2×1 + 1/8×3 = 1.75 bits / symbol Symbol Dog Cat Bird Fish Frequency 1/8 1/4 1/2 1/8 Original Encoding 00 01 10 11 2 bits 2 bits 2 bits 2 bits Huffman Encoding 110 10 0 111 3 bits 2 bits 1 bit 3 bits UC SANTA CRUZ Huffman Code Algorithm Overview § Order the symbols with least frequent first (will explain) § Build a tree piece by piece… § Encoding § Calculate frequency of symbols in the message, language § JUST COUNT AND DIVIDE BY TOTAL NUMBER OF SYMBOLS § Create binary tree representing “best” encoding § Use binary tree to encode compressed file § For each symbol, output path from root to leaf § Size of encoding = length of path § Save binary tree UC SANTA CRUZ Huffman Code – Creating Tree § Algorithm (Recipe) § Place each symbol in leaf § Weight of leaf = symbol frequency § Select two trees L and R (initially leafs) § Such that L, R have lowest frequencies among all tree § Which L, R have the lowest number of occurrences in the message? § Create new (internal) node § Left child ⇒ L § Right child ⇒ R § New frequency ⇒ frequency( L ) + frequency( R ) § Repeat until all nodes merged into one tree UC SANTA CRUZ Huffman Tree Construction 1 3 5 8 2 7 A C E H I UC SANTA CRUZ Huffman Tree Construction 4 3 5 8 2 7 5 10 15 A C E H I UC SANTA CRUZ Huffman Tree Construction 5 3 5 8 2 7 5 10 15 25 1 1 1 1 0 0 0 0 A C E H I E = 01 I = 00 C = 10 A = 111 H = 110 UC SANTA CRUZ Huffman Coding Example § Huffman code § Input § ACE § Output § (111)(10)(01) = 1111001 E = 01 I = 00 C = 10 A = 111 H = 110 UC SANTA CRUZ Huffman Decoding 2 3 5 8 2 7 5 10 15 25 1 1 1 1 0 0 0 0 A C E H I 1111001 UC SANTA CRUZ Huffman Decoding 3 3 5 8 2 7 5 10 15 25 1 1 1 1 0 0 0 0 A C E H I 1111001 A UC SANTA CRUZ Huffman Decoding 4 3 5 8 2 7 5 10 15 25 1 1 1 1 0 0 0 0 A C E H I 1111001 A UC SANTA CRUZ Huffman Decoding 7 3 5 8 2 7 5 10 15 25 1 1 1 1 0 0 0 0 A C E H I 1111001 ACE UC SANTA CRUZ Huffman Code Properties § Prefix code § No code is a prefix of another code § Example § Huffman(“dog”) ⇒ 01 § Huffman(“cat”) ⇒ 011 // not legal prefix code § Can stop as soon as complete code found § No need for end-of-code marker § Nondeterministic § Multiple Huffman coding possible for same input § If more than two trees with same minimal weight UC SANTA CRUZ Huffman Code Properties § Greedy algorithm § Chooses best local solution at each step § Combines 2 trees with lowest frequency § Still yields overall best solution § Optimal prefix code § Based on statistical frequency § Better compression possible (depends on data) § Using other approaches (e.g., pattern dictionary) UC SANTA CRUZ Huffman Tree: TO BE OR NOT TO BE 1 2 3 2 4 2 R B T E O 1 N UC SANTA CRUZ Huffman Tree: TO BE OR NOT TO BE 1 2 R 2 B 3 T 2 E 4 O 1 N 4 UC SANTA CRUZ Huffman Tree: TO BE OR NOT TO BE 1 2 R 2 B 3 T 2 E 4 O 1 N 4 5 UC SANTA CRUZ Huffman Tree: TO BE OR NOT TO BE 1 2 R 2 B 3 T 2 E 4 O 1 N 4 5 8 13 N = 1110 R = 1111 E = 110 B = 01 O = 10 T = 00 1 1 1 1 1 0 0 0 0 0 UC SANTA CRUZ Huffman Tree: TO BE OR NOT TO BE 1 2 R 2 B 3 T 2 E 4 O 1 N 4 5 8 13 N = 1110 R = 1111 E = 110 B = 01 O = 10 T = 00 1 1 1 1 1 0 0 0 0 0 0010.01110.101111.11 101000.0010.01110 UC SANTA CRUZ Huffman Tree: TO BE OR NOT TO BE 1 2 R 2 B 3 T 2 E 4 O 1 N 4 5 8 13 N = 1110 R = 1111 E = 110 B = 01 O = 10 T = 00 1 1 1 1 1 0 0 0 0 0 0010.01110.101111.11 101000.0010.01110 32 bits UC SANTA CRUZ No code is prefix of another 1 2 R 2 B 3 T 2 E 4 O 1 N 4 5 8 13 N = 1110 R = 1111 E = 110 B = 01 O = 10 T = 00 1 1 1 1 1 0 0 0 0 0 UC SANTA CRUZ DECODING: Your turn 1 2 R 2 B 3 T 2 E 4 O 1 N 4 5 8 13 N = 1110 R = 1111 E = 110 B = 01 O = 10 T = 00 1 1 1 1 1 0 0 0 0 0 111110011000 = ? UC SANTA CRUZ DECODING: Your turn 1 2 R 2 B 3 T 2 E 4 O 1 N 4 5 8 13 N = 1110 R = 1111 E = 110 B = 01 O = 10 T = 00 1 1 1 1 1 0 0 0 0 0 111110011000 = ? A. ROBBER B. REBOOT C. ROBOT D. ROOT E. ROBERT