Download Huffman Codes Project 2: Compressing Text Files - Prof. David J. Galles and more Study Guides, Projects, Research Data Structures and Algorithms in PDF only on Docsity! CS245-2009S-P2 Huffman Codes Project 2 1 P2-0: Text Files • All files are represented as binary digits – including text files • Each character is represented by an integer code • ASCII – American Standard Code for Information Interchange • Text file is a sequence of binary digits which represent the codes for each character. P2-1: ASCII • Each character can be represented as an 8-bit number • ASCII for a = 97 = 01100001 • ASCII for b = 98 = 01100010 • Text file is a sequence of 1’s and 0’s which represent ASCII codes for characters in the file • File “aba” is 97, 97, 98 • 011000010110001001100001 P2-2: ASCII • Each character in ASCII is represented as 8 bits • We need 8 bits to represent all possible character combinations • (including control characters, and unprintable characters) • Breaking up file into individual characters is easy • Finding the kth character in a file is easy P2-3: ASCII • ASCII is not terribly efficient • All characters require 8 bits • Frequently used characters require the same number of bits as infrequently used characters • We could be more efficient if frequently used characters required fewer than 8 bits, and less frequently used characters required more bits P2-4: Representing Codes as Trees • Want to encode 4 only characters: a, b, c, d (instead of 256 characters) • How many bits are required for each code, if each code has the same length? P2-5: Representing Codes as Trees • Want to encode 4 only characters: a, b, c, d (instead of 256 characters) • How many bits are required for each code, if each code has the same length? • 2 bits are required, since there are 4 possible options to distinguish P2-6: Representing Codes as Trees CS245-2009S-P2 Huffman Codes Project 2 2 • Want to encode 4 only characters: a, b, c, d • Pick the following codes: • a: 00 • b: 01 • c: 10 • d: 11 • We can represent these codes as a tree • Characters are stored at the leaves of the tree • Code is represented by path to leaf P2-7: Representing Codes as Trees • a: 00, b: 01, c: 10, d:11 a b c d 0 1 0 01 1 P2-8: Representing Codes as Trees • a: 01, b: 00, c: 11, d:10 ab cd 0 1 0 01 1 P2-9: Prefix Codes • If no code is a prefix of any other code, then decoding the file is unambiguous. • If all codes are the same length, then no code will be a prefix of any other code (trivially) • We can create variable length codes, where no code is a prefix of any other code CS245-2009S-P2 Huffman Codes Project 2 5 P2-20: Decoding Files • We can use variable length keys to encode a text file • Given the encoded file, and the tree representation of the codes, it is easy to decode the file • Finding the kth character in the file is more tricky • Need to decode the first (k-1) characters in the file, to determine where the kth character is in the file P2-21: File Compression • We can use variable length codes to compress files • Select an encoding such that frequently used characters have short codes, less frequently used characters have longer codes • Write out the file using these codes • (If the codes are dependent upon the contents of the file itself, we will also need to write out the codes at the beginning of the file for decoding) P2-22: File Compression • We need a method for building codes such that: • Frequently used characters are represented by leaves high in the code tree • Less Frequently used characters are represented by leaves low in the code tree • Characters of equal frequency have equal depths in the code tree P2-23: Huffman Coding • For each code tree, we keep track of the total number of times the characters in that tree appear in the input file • We start with one code tree for each character that appears in the input file • We combine the two trees with the lowest frequency, until all trees have been combined into one tree P2-24: Huffman Coding • Example: If the letters a-e have the frequencies: • a: 100, b: 20, c:15, d: 30, e: 1 a:100 d:30c:15b:20 e:1 P2-25: Huffman Coding • Example: If the letters a-e have the frequencies: • a: 100, b: 20, c:15, d: 30, e: 1 CS245-2009S-P2 Huffman Codes Project 2 6 a:100 d:30c:15b:20 e:1 :16 P2-26: Huffman Coding • Example: If the letters a-e have the frequencies: • a: 100, b: 20, c:15, d: 30, e: 1 a:100 d:30c:15 b:20 e:1 :16 :36 P2-27: Huffman Coding • Example: If the letters a-e have the frequencies: • a: 100, b: 20, c:15, d: 30, e: 1 a:100 d:30 c:15 b:20 e:1 :16 :36 :66 P2-28: Huffman Coding • Example: If the letters a-e have the frequencies: CS245-2009S-P2 Huffman Codes Project 2 7 • a: 100, b: 20, c:15, d: 30, e: 1 a:100 d:30 c:15 b:20 e:1 :16 :36 :66 :166 P2-29: Huffman Coding • Example: If the letters a-e have the frequencies: • a: 10, b: 10, c:10, d: 10, e: 10 a:10 d:10c:10b:10 e:10 P2-30: Huffman Coding • Example: If the letters a-e have the frequencies: • a: 10, b: 10, c:10, d: 10, e: 10 a:10 d:10c:10b:10 e:10 :20 P2-31: Huffman Coding • Example: If the letters a-e have the frequencies: • a: 10, b: 10, c:10, d: 10, e: 10 a:10 d:10c:10b:10 e:10 :20 :20 P2-32: Huffman Coding CS245-2009S-P2 Huffman Codes Project 2 10 public void writeBit(boolean bit) public void close() P2-40: Binary Files • readBit • Read a single bit • readChar • Read a single character (8 bits) P2-41: Binary Files • writeBit • Writes out a single bit • writeChar • Writes out a single (8 bit) character P2-42: Binary Files • If we write to a binary file: • bit, bit, char, bit, int • And then read from the file: • bit, char, bit, int, bit • What will we get out? P2-43: Binary Files • If we write to a binary file: • bit, bit, char, bit, int • And then read from the file: • bit, char, bit, int, bit • What will we get out? • Garbage! (except for the first bit) P2-44: Printing out Trees • To print out Huffman trees: • Print out nodes in pre-order traversal • Need a way of denoting which nodes are leaves and which nodes are interior nodes • (Huffman trees are full – every node has 0 or 2 children) CS245-2009S-P2 Huffman Codes Project 2 11 • For each interior node, print out a 0 (single bit). For each leaf, print out a 1, followed by 8 bits for the character at the leaf P2-45: Compression? • Is it possible that huffman compression would not compress the file? • Is it possible that huffman compression could actually make the file larger? • How? P2-46: Compression? • What happens if all the charcters have the same frequency? • What does the tree look like? • What can we say about the lengths of the codes for each character? • What does that mean for the file size? P2-47: Compression? • What happens if all the charcters have the same frequency? • All nodes are at the same depth in the tree (that is, 8) • Each code will have a length of 8 • The encoded file will be the same size as the original file – plus the size required to encode the tree P2-48: Compression! • What to do? • Calculate the size of the input file • Calculate the size that the compressed file would be • If the compressed file is larger than than the input file, don’t compress P2-49: Compression! • Given the frequency array, how large is the input file? P2-50: Compression! • Given the frequency array, how large is the input file? • ∑ c freq(c) ∗ len(c) • (# of characters in the input file) * 8 P2-51: Compression! • Given the frequency array & codes for each compressed element, how large is the compressed file? P2-52: Compression! • Given the frequency array & codes for each compressed element, how large is the compressed file? • ∑ c freq(c) ∗ len(c) + CS245-2009S-P2 Huffman Codes Project 2 12 • Size of tree • + 4 bytes (32 bits) • Implementation detail of BinaryFile class • Note file sizes need to be a multiple of 8 bits ... P2-53: Command Line Arguments public static void main(String args[]) • The args parameter holds the input parameters • java MyProgram arg1 arg2 arg3 • args.length = 3 • args[0] = “arg1” • args[1] = “arg2” • args[2] = “arg3” P2-54: Calling Huffman java Huffman (-c|-u) [-v] [-f] infile outfile • (-c|-u) stands for either “-c” (for compress), or “-u” (for uncompress) • [-v] stands for an optional “-v” flag (for verbose) • [-f] stands for an optional “-f” flag (for force compress) • infile is the input file • outfile is the output file