Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Huffman Coding: Efficient Character Representation, Study notes of Computer Science

Huffman coding, a method for efficiently representing characters using variable-length codes based on character frequency. The concept, its advantages, and a practical example of its implementation.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-10r-1
koofers-user-10r-1 🇺🇸

5

(1)

10 documents

1 / 14

Toggle sidebar

Related documents


Partial preview of the text

Download Huffman Coding: Efficient Character Representation and more Study notes Computer Science in PDF only on Docsity! Chapter 6: Memory: Information and Secret Codes CS105: Great Insights in Computer Science Overview • When we decide how to represent something in bits, there are some competing interests: • easily manipulated/processed • short • Common to use two representations: • one direct to allow for easy processing • one terse (compressed) to save storage and communication costs Plan • I’m going to try to describe one neat idea, implicit in Chapter 6: Huffman coding. • For more information, see wikipedia: • http://en.wikipedia.org/wiki/ Huffman_coding Gettysburg Address Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate -- we can not consecrate -- we can not hallow -- this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us -- that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion -- that we here highly resolve that these dead shall not have died in vain -- that this nation, under God, shall have a new birth of freedom -- and that government of the people, by the people, for the people, shall not perish from the earth. Attempt #3: Vary Length • Some characters are much more common than others. • Give the 4 most common characters a 3-bit code, and the remaining 28 a 6-bit code. • How many bits do we need now? Variable Length Patterns 000 <s> 001 e 010 t 011 a 100000 o 100001 h 100010 r 100011 n 100100 i 100101 d 100110 s 100111 l 101000 c 101001 w 101010 g 101011 f 101100 v 101101 , 101110 u 101111 - 110000 p 110001 b 110010 m 110011 . 110100 y 110101 <b> 110110 k 110111 q 111000 ? 111001 j 111010 x 111011 z Decodability • Note that the code was chosen so that the first bit of each character tells you whether the code is short (0) or long (1). • This choice ensures that a message can actually be decoded: • 100001100100000010100001001100010001110011 • h i <s> t h e r e . • 42 bits, not 45. But, harder to work with. What Gives? • We had assigned all 32 characters 5-bit codes. • Now we’ve got 4 that have 3-bit codes and 28 that are 6-bit codes. So, more than half of the characters have actually gotten longer. • How can that change help? • Need to factor in how many of each characters there are. Adding Up the Bits • How many bits to write down just the letter “y”? Well, there are 10 “y”s and each takes 6 bits. So, 60 bits. (It was 50, before.) • How about “t”? There are 126 and each takes 3 bits. That’s 378 (was 630). • So, how do we total them all up? • Let c be a character, freq(c) the number of times it appears, and len(c) its encoding length. • Total bits = !c freq(c) x len(c) Summing It Up • 282x3 + 165x3 + 126x3 +102x3 + 93x6+ 80x6 + 79x6 + ... + 0x6 + 0x6 = 6867 282 <s> 165 e 126 t 102 a 93 o 80 h 79 r 77 n 68 i 58 d 44 s 42 l 31 c 28 w 28 g 27 f 24 v 22 , 21 u 15 - 15 p 14 b 13 m 10 . 10 y 4 <b> 3 k 1 q 0 ? 0 j 0 x 0 z Tree (Prefix) Code • First, notice that a code can be drawn as a tree. • Left = “0”, right = “1”. So, e = “001”, w = “101001”. • Tree structure ensures code is decodable: Bits tell you unambiguously which character. a<s> e t o h r n i d s l c w g f v , u - p b m . y <b> k q ? j x z Huffman Coding • Make each character a subtree (”block”) with count equal to its frequency. • Take two blocks with smallest counts and “merge” them into left and right branches. The count for the new block is the sum of the counts of the blocks it is made out of. • Repeat until all blocks have been merged into one big block (single tree). • Read the code off the branches in the tree. Partial Example 21 u 13 m 10 . 15 p 14 b 15 - 10 y 4 <b> 3 k 1 q 21 u 13 m 10 . 15 p 14 b 15 - 10 y 4 <b> 4 3 k 1 q 21 u 13 m 10 . 15 p 14 b 15 - 10 y 4 8 4 <b> 3 k 1 q 21 u 13 m 10 . 15 p 14 b 15 - 18 10 y 4 8 4 <b> 3 k 1 q 21 u 23 13 m 10 . 15 p 14 b 15 - 18 10 y 4 8 4 <b> 3 k 1 q 21 u 23 13 m 10 . 29 15 p 14 b 15 - 18 10 y 4 8 4 <b> 3 k 1 q 21 u 23 13 m 10 . 29 15 p 14 b 15 - 18 10 y 4 8 4 <b> 3 k 1 q 33 22 , 22 , 22 , 22 , 22 , 22 , 22 , Completed Code Tree 1482 282 <s> 165 e 80 h 79 r 159 324 606 42 l 22 , 21 u 43 85 44 s 24 v 13 m 10 . 23 47 91 176 102 a 93 o 195 371 28 g 27 f 55 28 w 15 p 14 b 29 57 112 58 d 31 c 15 - 10 y 4 <b> 3 k 1 q 4 8 18 33 64 122 234 126 t 77 n 68 i 145 271 505 876 Created Code 11 <s> 100 e 0001 t 0100 a 0101 o 1010 h 1011 r 00000 n 00001 i 00101 d 01101 s 01111 l 001001 c 001101 w 001110 g 001111 f 011000 v 011100 , 011101 u 0010001 - 0011000 p 0011001 b 0110010 m 0110011 . 00100000 y 001000011 .......<b> 0010000100 .........k 0010000101 .........q Huffman: Summary • Total for this example: • 4.1 bits per character • 6135 total bits • 51.7% the size of ASCII representation. • Minimal for a character-by-character code for this passage. (No other character-by-character code leads to more compression.)
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved