Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Huffman Codes Project 2: Compressing Text Files - Prof. David J. Galles, Study Guides, Projects, Research of Data Structures and Algorithms

University of San Francisco (USF)Data Structures and Algorithms

Prof. David J. Galles

The process of using huffman codes to compress text files. It covers the representation of text files as binary digits, the inefficiency of ascii, the concept of representing codes as trees, prefix codes, variable length codes, decoding files, file compression, and the process of building huffman codes. The document also includes examples and instructions for creating encoding tables and using binary files.

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 07/30/2009

koofers-user-9op 🇺🇸

10 documents

1 / 12

Partial preview of the text

Download Huffman Codes Project 2: Compressing Text Files - Prof. David J. Galles and more Study Guides, Projects, Research Data Structures and Algorithms in PDF only on Docsity! CS245-2009S-P2 Huffman Codes Project 2 1 P2-0: Text Files • All files are represented as binary digits – including text files • Each character is represented by an integer code • ASCII – American Standard Code for Information Interchange • Text file is a sequence of binary digits which represent the codes for each character. P2-1: ASCII • Each character can be represented as an 8-bit number • ASCII for a = 97 = 01100001 • ASCII for b = 98 = 01100010 • Text file is a sequence of 1’s and 0’s which represent ASCII codes for characters in the file • File “aba” is 97, 97, 98 • 011000010110001001100001 P2-2: ASCII • Each character in ASCII is represented as 8 bits • We need 8 bits to represent all possible character combinations • (including control characters, and unprintable characters) • Breaking up file into individual characters is easy • Finding the kth character in a file is easy P2-3: ASCII • ASCII is not terribly efficient • All characters require 8 bits • Frequently used characters require the same number of bits as infrequently used characters • We could be more efficient if frequently used characters required fewer than 8 bits, and less frequently used characters required more bits P2-4: Representing Codes as Trees • Want to encode 4 only characters: a, b, c, d (instead of 256 characters) • How many bits are required for each code, if each code has the same length? P2-5: Representing Codes as Trees • Want to encode 4 only characters: a, b, c, d (instead of 256 characters) • How many bits are required for each code, if each code has the same length? • 2 bits are required, since there are 4 possible options to distinguish P2-6: Representing Codes as Trees CS245-2009S-P2 Huffman Codes Project 2 2 • Want to encode 4 only characters: a, b, c, d • Pick the following codes: • a: 00 • b: 01 • c: 10 • d: 11 • We can represent these codes as a tree • Characters are stored at the leaves of the tree • Code is represented by path to leaf P2-7: Representing Codes as Trees • a: 00, b: 01, c: 10, d:11 a b c d 0 1 0 01 1 P2-8: Representing Codes as Trees • a: 01, b: 00, c: 11, d:10 ab cd 0 1 0 01 1 P2-9: Prefix Codes • If no code is a prefix of any other code, then decoding the file is unambiguous. • If all codes are the same length, then no code will be a prefix of any other code (trivially) • We can create variable length codes, where no code is a prefix of any other code CS245-2009S-P2 Huffman Codes Project 2 5 P2-20: Decoding Files • We can use variable length keys to encode a text file • Given the encoded file, and the tree representation of the codes, it is easy to decode the file • Finding the kth character in the file is more tricky • Need to decode the first (k-1) characters in the file, to determine where the kth character is in the file P2-21: File Compression • We can use variable length codes to compress files • Select an encoding such that frequently used characters have short codes, less frequently used characters have longer codes • Write out the file using these codes • (If the codes are dependent upon the contents of the file itself, we will also need to write out the codes at the beginning of the file for decoding) P2-22: File Compression • We need a method for building codes such that: • Frequently used characters are represented by leaves high in the code tree • Less Frequently used characters are represented by leaves low in the code tree • Characters of equal frequency have equal depths in the code tree P2-23: Huffman Coding • For each code tree, we keep track of the total number of times the characters in that tree appear in the input file • We start with one code tree for each character that appears in the input file • We combine the two trees with the lowest frequency, until all trees have been combined into one tree P2-24: Huffman Coding • Example: If the letters a-e have the frequencies: • a: 100, b: 20, c:15, d: 30, e: 1 a:100 d:30c:15b:20 e:1 P2-25: Huffman Coding • Example: If the letters a-e have the frequencies: • a: 100, b: 20, c:15, d: 30, e: 1 CS245-2009S-P2 Huffman Codes Project 2 6 a:100 d:30c:15b:20 e:1 :16 P2-26: Huffman Coding • Example: If the letters a-e have the frequencies: • a: 100, b: 20, c:15, d: 30, e: 1 a:100 d:30c:15 b:20 e:1 :16 :36 P2-27: Huffman Coding • Example: If the letters a-e have the frequencies: • a: 100, b: 20, c:15, d: 30, e: 1 a:100 d:30 c:15 b:20 e:1 :16 :36 :66 P2-28: Huffman Coding • Example: If the letters a-e have the frequencies: CS245-2009S-P2 Huffman Codes Project 2 7 • a: 100, b: 20, c:15, d: 30, e: 1 a:100 d:30 c:15 b:20 e:1 :16 :36 :66 :166 P2-29: Huffman Coding • Example: If the letters a-e have the frequencies: • a: 10, b: 10, c:10, d: 10, e: 10 a:10 d:10c:10b:10 e:10 P2-30: Huffman Coding • Example: If the letters a-e have the frequencies: • a: 10, b: 10, c:10, d: 10, e: 10 a:10 d:10c:10b:10 e:10 :20 P2-31: Huffman Coding • Example: If the letters a-e have the frequencies: • a: 10, b: 10, c:10, d: 10, e: 10 a:10 d:10c:10b:10 e:10 :20 :20 P2-32: Huffman Coding CS245-2009S-P2 Huffman Codes Project 2 10 public void writeBit(boolean bit) public void close() P2-40: Binary Files • readBit • Read a single bit • readChar • Read a single character (8 bits) P2-41: Binary Files • writeBit • Writes out a single bit • writeChar • Writes out a single (8 bit) character P2-42: Binary Files • If we write to a binary file: • bit, bit, char, bit, int • And then read from the file: • bit, char, bit, int, bit • What will we get out? P2-43: Binary Files • If we write to a binary file: • bit, bit, char, bit, int • And then read from the file: • bit, char, bit, int, bit • What will we get out? • Garbage! (except for the first bit) P2-44: Printing out Trees • To print out Huffman trees: • Print out nodes in pre-order traversal • Need a way of denoting which nodes are leaves and which nodes are interior nodes • (Huffman trees are full – every node has 0 or 2 children) CS245-2009S-P2 Huffman Codes Project 2 11 • For each interior node, print out a 0 (single bit). For each leaf, print out a 1, followed by 8 bits for the character at the leaf P2-45: Compression? • Is it possible that huffman compression would not compress the file? • Is it possible that huffman compression could actually make the file larger? • How? P2-46: Compression? • What happens if all the charcters have the same frequency? • What does the tree look like? • What can we say about the lengths of the codes for each character? • What does that mean for the file size? P2-47: Compression? • What happens if all the charcters have the same frequency? • All nodes are at the same depth in the tree (that is, 8) • Each code will have a length of 8 • The encoded file will be the same size as the original file – plus the size required to encode the tree P2-48: Compression! • What to do? • Calculate the size of the input file • Calculate the size that the compressed file would be • If the compressed file is larger than than the input file, don’t compress P2-49: Compression! • Given the frequency array, how large is the input file? P2-50: Compression! • Given the frequency array, how large is the input file? • ∑ c freq(c) ∗ len(c) • (# of characters in the input file) * 8 P2-51: Compression! • Given the frequency array & codes for each compressed element, how large is the compressed file? P2-52: Compression! • Given the frequency array & codes for each compressed element, how large is the compressed file? • ∑ c freq(c) ∗ len(c) + CS245-2009S-P2 Huffman Codes Project 2 12 • Size of tree • + 4 bytes (32 bits) • Implementation detail of BinaryFile class • Note file sizes need to be a multiple of 8 bits ... P2-53: Command Line Arguments public static void main(String args[]) • The args parameter holds the input parameters • java MyProgram arg1 arg2 arg3 • args.length = 3 • args[0] = “arg1” • args[1] = “arg2” • args[2] = “arg3” P2-54: Calling Huffman java Huffman (-c|-u) [-v] [-f] infile outfile • (-c|-u) stands for either “-c” (for compress), or “-u” (for uncompress) • [-v] stands for an optional “-v” flag (for verbose) • [-f] stands for an optional “-f” flag (for force compress) • infile is the input file • outfile is the output file

Documents

questions

Huffman Codes Project 2: Compressing Text Files - Prof. David J. Galles, Study Guides, Projects, Research of Data Structures and Algorithms

Related documents

Partial preview of the text