Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding UTF-8 and Huffman Coding: A Comprehensive Guide, Lecture notes of Construction

An in-depth exploration of UTF-8 encoding and Huffman coding, two essential concepts in computer science. UTF-8 is a variable-length character encoding that can represent every character in the Unicode set. Huffman coding is a lossless data compression algorithm that uses variable-length codes based on symbol frequency. the basics of both concepts, their applications, and practical examples.

Typology: Lecture notes

2021/2022

Uploaded on 09/07/2022

adnan_95
adnan_95 🇮🇶

4.3

(38)

921 documents

1 / 57

Toggle sidebar

Related documents


Partial preview of the text

Download Understanding UTF-8 and Huffman Coding: A Comprehensive Guide and more Lecture notes Construction in PDF only on Docsity! More Bits and Bytes Huffman Coding Encoding Text: How is it done? ASCII, UTF, Huffman algorithm UC SANTA CRUZ UTF is a VARIABLE LENGTH ALPHABET CODING §  Remember ASCII can only represent 128 characters (7 bits) §  UTF encodes over one million §  Why would you want a variable length coding scheme? Bits of First Last Bytes in . Byte 1 Byte 2 Byte 3 Byte 4 code code point code point sequence point 7 | U+0000 U+007F 1 Oxxxxxxx 11. U+0080 U+07FF 2 110xxxxx 10xxxxxx 16 U+0800 U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx 21 U+10000 U+1FFFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx UC SANTA CRUZ UC SANTA CRUZ UTF-8 A.  0000000001101010 B.  0000000011101010 C.  0000001010000111 D.  1010000011000111 What is the first Unicode value represented by this sequence? 11101010 10000011 10000111 00111111 11000011 10000000 UC SANTA CRUZ © 2010 Lawrence Snyder, CSE A Curious Story…                   The  Diving  Bell  and  the  Butterfly   Jean-­‐Dominique  Bauby     UC SANTA CRUZ © 2010 Lawrence Snyder, CSE Asking Yes/No Questions §  A  protocol  for  Yes/No  questions   §  One  blink  ==  Yes   §  Two  blinks  ==  No   UC SANTA CRUZ © 2010 Lawrence Snyder, CSE Asking Letters                     In  English  ETAOINSHRDLU…   UC SANTA CRUZ © 2010 Lawrence Snyder, CSE Compare Two Orderings §  How  many  questions  to  encode:      Plus  ça  change,  plus  c'est  la  même  chose?     §  Asking  in  Frequency  Order:  247   ESARINTULOMDPCFBVHGJQZYXKW   §  Asking  in  Alphabetical  Order:  324      ABCDEFGHIJKLMNOPQRSTUVWXYZ   UC SANTA CRUZ An Algorithm §  Spelling  by  going  through  the  letters  is  an  algorithm   §  Going  through  the  letters  in  frequency  order  is  a   program  (also,  an  algorithm  but  with  the  order   specified  to  a  particular  case,  i.e.  FR)   §  The  nurses  didn’t  look  this  up  in  a  book  …  they   invented  it  to  make  their  work  easier;  they  were   thinking  computationally   © 2010 Lawrence Snyder, CSE UC SANTA CRUZ Coding can be used to do Compression §  What is CODING? §  The conversion of one representation into another §  What is COMPRESSION? §  Change the representation (digitization) in order to reduce size of data (number of bits needed to represent data) §  Benefits §  Reduce storage needed §  Consider growth of digitized data. §  Reduce transmission cost / latency / bandwidth §  When you have a 56K dialup modem, every savings in BITS counts, SPEED §  Also consider telephone lines, texting UC SANTA CRUZ Can you lose information with Compression? §  Lossless Compression is not guaranteed §  Pigeonhole principle §  Reduce size 1 bit ⇒ can only store ½ of data §  Example §  000, 001, 010, 011, 100, 101, 110, 111 ⇒ 00, 01, 10, 11 §  CONSIDER THE ALTERNATIVE §  IF LOSSLESS COMPRESSION WERE GUARANTEED THEN §  Compress file (reduce size by 1 bit) §  Recompress output §  Repeat (until we can store data with 0 bits) §  OBVIOUS CONTRADICTION => IT IS NOT GUARANTEED. UC SANTA CRUZ Huffman Code: A Lossless Compression §  Use Variable Length codes based on frequency (like UTF does) §  Approach §  Exploit statistical frequency of symbols §  What do I MEAN by that? WE COUNT!!! §  HELPS when the frequency for different symbols varies widely §  Principle §  Use fewer bits to represent frequent symbols §  Use more bits to represent infrequent symbols A A B A A A A B UC SANTA CRUZ Huffman Code Example §  “dog cat cat bird bird bird bird fish” §  Expected size §  Original ⇒ 1/8×2 + 1/4×2 + 1/2×2 + 1/8×2 = 2 bits / symbol §  Huffman ⇒ 1/8×3 + 1/4×2 + 1/2×1 + 1/8×3 = 1.75 bits / symbol Symbol Dog Cat Bird Fish Frequency 1/8 1/4 1/2 1/8 Original Encoding 00 01 10 11 2 bits 2 bits 2 bits 2 bits Huffman Encoding 110 10 0 111 3 bits 2 bits 1 bit 3 bits UC SANTA CRUZ Huffman Code Algorithm Overview §  Order the symbols with least frequent first (will explain) §  Build a tree piece by piece… §  Encoding §  Calculate frequency of symbols in the message, language §  JUST COUNT AND DIVIDE BY TOTAL NUMBER OF SYMBOLS §  Create binary tree representing “best” encoding §  Use binary tree to encode compressed file §  For each symbol, output path from root to leaf §  Size of encoding = length of path §  Save binary tree UC SANTA CRUZ Huffman Code – Creating Tree §  Algorithm (Recipe) §  Place each symbol in leaf §  Weight of leaf = symbol frequency §  Select two trees L and R (initially leafs) §  Such that L, R have lowest frequencies among all tree §  Which L, R have the lowest number of occurrences in the message? §  Create new (internal) node §  Left child ⇒ L §  Right child ⇒ R §  New frequency ⇒ frequency( L ) + frequency( R ) §  Repeat until all nodes merged into one tree UC SANTA CRUZ Huffman Tree Construction 1 3 5 8 2 7 A C E H I UC SANTA CRUZ Huffman Tree Construction 4 3 5 8 2 7 5 10 15 A C E H I UC SANTA CRUZ Huffman Tree Construction 5 3 5 8 2 7 5 10 15 25 1 1 1 1 0 0 0 0 A C E H I E = 01 I = 00 C = 10 A = 111 H = 110 UC SANTA CRUZ Huffman Coding Example §  Huffman code §  Input §  ACE §  Output §  (111)(10)(01) = 1111001 E = 01 I = 00 C = 10 A = 111 H = 110 UC SANTA CRUZ Huffman Decoding 2 3 5 8 2 7 5 10 15 25 1 1 1 1 0 0 0 0 A C E H I 1111001 UC SANTA CRUZ Huffman Decoding 3 3 5 8 2 7 5 10 15 25 1 1 1 1 0 0 0 0 A C E H I 1111001 A UC SANTA CRUZ Huffman Decoding 4 3 5 8 2 7 5 10 15 25 1 1 1 1 0 0 0 0 A C E H I 1111001 A UC SANTA CRUZ Huffman Decoding 7 3 5 8 2 7 5 10 15 25 1 1 1 1 0 0 0 0 A C E H I 1111001 ACE UC SANTA CRUZ Huffman Code Properties §  Prefix code §  No code is a prefix of another code §  Example §  Huffman(“dog”) ⇒ 01 §  Huffman(“cat”) ⇒ 011 // not legal prefix code §  Can stop as soon as complete code found §  No need for end-of-code marker §  Nondeterministic §  Multiple Huffman coding possible for same input §  If more than two trees with same minimal weight UC SANTA CRUZ Huffman Code Properties §  Greedy algorithm §  Chooses best local solution at each step §  Combines 2 trees with lowest frequency §  Still yields overall best solution §  Optimal prefix code §  Based on statistical frequency §  Better compression possible (depends on data) §  Using other approaches (e.g., pattern dictionary) UC SANTA CRUZ Huffman Tree: TO BE OR NOT TO BE 1 2 3 2 4 2 R B T E O 1 N UC SANTA CRUZ Huffman Tree: TO BE OR NOT TO BE 1 2 R 2 B 3 T 2 E 4 O 1 N 4 UC SANTA CRUZ Huffman Tree: TO BE OR NOT TO BE 1 2 R 2 B 3 T 2 E 4 O 1 N 4 5 UC SANTA CRUZ Huffman Tree: TO BE OR NOT TO BE 1 2 R 2 B 3 T 2 E 4 O 1 N 4 5 8 13 N = 1110 R = 1111 E = 110 B = 01 O = 10 T = 00 1 1 1 1 1 0 0 0 0 0 UC SANTA CRUZ Huffman Tree: TO BE OR NOT TO BE 1 2 R 2 B 3 T 2 E 4 O 1 N 4 5 8 13 N = 1110 R = 1111 E = 110 B = 01 O = 10 T = 00 1 1 1 1 1 0 0 0 0 0 0010.01110.101111.11 101000.0010.01110 UC SANTA CRUZ Huffman Tree: TO BE OR NOT TO BE 1 2 R 2 B 3 T 2 E 4 O 1 N 4 5 8 13 N = 1110 R = 1111 E = 110 B = 01 O = 10 T = 00 1 1 1 1 1 0 0 0 0 0 0010.01110.101111.11 101000.0010.01110 32 bits UC SANTA CRUZ No code is prefix of another 1 2 R 2 B 3 T 2 E 4 O 1 N 4 5 8 13 N = 1110 R = 1111 E = 110 B = 01 O = 10 T = 00 1 1 1 1 1 0 0 0 0 0 UC SANTA CRUZ DECODING: Your turn 1 2 R 2 B 3 T 2 E 4 O 1 N 4 5 8 13 N = 1110 R = 1111 E = 110 B = 01 O = 10 T = 00 1 1 1 1 1 0 0 0 0 0 111110011000 = ? UC SANTA CRUZ DECODING: Your turn 1 2 R 2 B 3 T 2 E 4 O 1 N 4 5 8 13 N = 1110 R = 1111 E = 110 B = 01 O = 10 T = 00 1 1 1 1 1 0 0 0 0 0 111110011000 = ? A.  ROBBER B.  REBOOT C.  ROBOT D.  ROOT E.  ROBERT
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved