Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Improving Text Compression: From ASCII to Prefix Codes and Huffman Codes - Prof. Randy Shu, Study notes of Algorithms and Programming

Various methods for text compression, starting with ascii encoding and moving on to binary codes, variable-length encoding, prefix codes, and huffman codes. How to assign shorter bit strings to the most commonly used letters and construct a tree for minimal bit length encoding.

Typology: Study notes

Pre 2010

Uploaded on 08/18/2009

koofers-user-jih
koofers-user-jih 🇺🇸

10 documents

1 / 9

Toggle sidebar

Related documents


Partial preview of the text

Download Improving Text Compression: From ASCII to Prefix Codes and Huffman Codes - Prof. Randy Shu and more Study notes Algorithms and Programming in PDF only on Docsity! Binary Code Improving on ASCII • ASCII uses eight bits in order to store 28 distinct characters. If our file uses fewer characters, we might do better with a different code. • For example, rather than store "ABRACADABRA" in eight bit ASCII code, we store it in our own personal five bit binary code (A = 00001, B = 00010, ...)*. 0000100010100100000100011000010010000001000101001000001 *This represents a 37.5% savings over the ASCII encoding. But we can do better if we give up the notion of fixed encoding length. Variable-Length Encoding • Assign shortest bit strings to the most commonly used letters: A=0, B=1, R=01, C=10, D=11. So ABRACADABRA becomes 0 1 01 0 10 0 11 0 1 01 0 • This uses only 15 bits compared to 55 in our previous ` scheme. But it does required blanks to separate the characters. Epiphany • Without the spaces, the encoding of ABRACADABRA is ambiguous: 010101001101010. For example, is the first bit an A or the start of the pair 01 representing an R*. • But, delimiters aren't needed if no character code is the prefix of another. *Recall A=0, B=1, R=01, C=10, D=11. Huffman Codes • Huffman encoding is a method for constructing a tree which leads to a bit string of minimal length for any given message. • The first step is to find the frequency counts for the message. For example, ABRACADABRA Frequency Table A 5 B 2 C 1 D 1 R 2 Build Tree Bottom Up C:1 D:1 B:2 R:2 A:5 • Maintain a priority queue Q keyed on frequency. • Remove the two trees with smallest root values and merge them into a new tree. 2 Repeat C:1 D:1 B:2 R:2 A:5 A:5 11 C:1 D:1 2 B:2 R:2 4 6 To Obtain Huffman Code Huffman(C) n ← |C| Q ← C for i ← 1 to n-1 do z ← Allocate-Node() x ← left[z] ← Extract-Min(Q) y ← right[z] ← Extract-Min(Q) f[z] ← f[x] + f[y] Insert(Q,z) return Extract-Min(Q) Greedy-Choice Property Let x, y ∈ C have minimum frequencies. Then there exists an optimal prefix code for C in which the codewords for x,y have the same length and differ only in the last bit. b a yx T''
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved