Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Huffman Codes - Information Theory - Lecture Slides, Slides of Information Technology

Information Technology course teaches a lot we need to know in current world. These lecture slides include: Huffman Codes, Probabilities, Redundancy, Guessing Games, Bits Per Guess, Compression, Decompression, Arithmetic Coding, Algorithm, Example, Number of Bits, Decoding, Bayesian Derivation

Typology: Slides

2013/2014

Uploaded on 01/31/2014

dhanvin
dhanvin 🇮🇳

4.2

(14)

100 documents

1 / 111

Toggle sidebar

Related documents


Partial preview of the text

Download Huffman Codes - Information Theory - Lecture Slides and more Slides Information Technology in PDF only on Docsity! EE514a – Information Theory I Fall Quarter 2013 Prof. Jeff Bilmes University of Washington, Seattle Department of Electrical Engineering Fall Quarter, 2013 http://j.ee.washington.edu/~bilmes/classes/ee514a_fall_2013/ Lecture 11 - Oct 29th, 2013 Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F1/36 (pg.1/111) Logistics Review Class Road Map - IT-I L1 (9/26): Overview, Communications, Information, Entropy L2 (10/1): Props. Entropy, Mutual Information, L3 (10/3): KL-Divergence, Convex, Jensen, and properties. L4 (10/8): Data Proc. Ineq., thermodynamics, Stats, Fano, M. of Conv L5 (10/10): AEP, Compression L6 (10/15): Compression, Method of Types, L7 (10/17): Types, U. Coding., Stoc. Processes, Entropy rates, L8 (10/22): Entropy rates, HMMs, Coding, Kraft, L9 (10/24): Kraft, Shannon Codes,Huffman, Shannon/Fano/Elias L10 (10/28): Huffman, Shannon/Fano/Elias L11 (10/29): Shannon Games, Arith. Coding L12 (10/31): Midterm, in class. L13 L14 L15 L16 L17 L18 L19 L20 Finals Week: December 12th–16th. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F2/36 (pg.2/111) Logistics Review Announcements Office hours, every week, Tuedsays 4:30-5:30pm. Can also reach me at that time via a canvas conference. Midterm on Thursday, 10/31 in class. Covers everything up to and including homework 4 (today’s cumulative reading). We’ll have a review on 10/29. Next lecture will conflict with Stephen Boyd’s lecture (which is at 3:30-4:20pm in room EEB-105, see http://www.ee.washington. edu/news/2013/boyd_lytle_lecture.html). In order to see the lecture, 1/2 of Tuesday’s lecture will be youtube only (which is right now), and we’ll meet in person only from 2:30-3:20. On Tuesday, Oct 29th, we will meet from 2:30-2:20 in EEB-026, and then talk to the Boyd talk. The topic will be “games” and then midterm review. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F5/36 (pg.5/111) Logistics Review Huffman Codes Huffman coding is a symbol code, we code one symbol at a time. Is Huffman optimal? But what does optimal mean? In general, for a symbol code, each symbol n the source alphabet must use an integer number of codeword bits. This is ok for D-adic distributions but could use up to one extra bit per symbol on average. Bad example: p(0) = 1− p(1) = 0.999, then − log p(0) ≈ 0, so we should be using close to zero bits per symbol to code this, but Huffman uses 1. Thus, we need a long block to get any benefit. In practice, this means we need to store and be able to compute p(x1:n). No problem, right? Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F6/36 (pg.6/111) Logistics Review Huffman Codes Can we easily compute p(x1:n)? If |A| is the alphabet size, we need a table of size |A|n to store these probabilities. Moreover, it is hard to estimate p(x1:n) accurately. Given an amount of “training data” (to borrow a phrase from machine learning), it is hard to estimate this distribution. Many of the possible strings in any finite sample size will not occur (sparsity). Example: how hard is it to find a short grammatically valid English prhase never before written using a web search engine? “dogs ate banks on the river” is not found as of Mon, Oct 28, 2013. Smoothing models are required. Similar to the language model problem in natural language processing. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F7/36 (pg.7/111) Logistics Review The Probabilities They Are A-Changin’ Real sequential processes are not stationary. It might be a reasonable approximation to assume that they are “locally stationary”, meaning that the statistics of the process are governed by a distribution p(x) within a given fixed-width time window. Huffman assumes one fixed p(x). If this changes, say to p′(x), the code will be less optimal by D(p′(x)||p(x)) bits per symbol, where p′(x) is the “correct” distribution. Instead we could: 1 Recompute Huffman distribution and code each period. This is inefficient, however, as we’ll need to re-transmit the codebook each time! 2 We could do some sort of adaptive Huffman scheme. But do we really want a Huffman code to begin with? Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F9/36 (pg.10/111) Logistics Review The Probabilities They Are A-Changin’ Real sequential processes are not stationary. It might be a reasonable approximation to assume that they are “locally stationary”, meaning that the statistics of the process are governed by a distribution p(x) within a given fixed-width time window. Huffman assumes one fixed p(x). If this changes, say to p′(x), the code will be less optimal by D(p′(x)||p(x)) bits per symbol, where p′(x) is the “correct” distribution. Instead we could: 1 Recompute Huffman distribution and code each period. This is inefficient, however, as we’ll need to re-transmit the codebook each time! 2 We could do some sort of adaptive Huffman scheme. But do we really want a Huffman code to begin with? Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F9/36 (pg.11/111) Logistics Review The Probabilities They Are A-Changin’ Real sequential processes are not stationary. It might be a reasonable approximation to assume that they are “locally stationary”, meaning that the statistics of the process are governed by a distribution p(x) within a given fixed-width time window. Huffman assumes one fixed p(x). If this changes, say to p′(x), the code will be less optimal by D(p′(x)||p(x)) bits per symbol, where p′(x) is the “correct” distribution. Instead we could: 1 Recompute Huffman distribution and code each period. This is inefficient, however, as we’ll need to re-transmit the codebook each time! 2 We could do some sort of adaptive Huffman scheme. But do we really want a Huffman code to begin with? Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F9/36 (pg.12/111) Logistics Review The Probabilities They Are A-Changin’ Real sequential processes are not stationary. It might be a reasonable approximation to assume that they are “locally stationary”, meaning that the statistics of the process are governed by a distribution p(x) within a given fixed-width time window. Huffman assumes one fixed p(x). If this changes, say to p′(x), the code will be less optimal by D(p′(x)||p(x)) bits per symbol, where p′(x) is the “correct” distribution. Instead we could: 1 Recompute Huffman distribution and code each period. This is inefficient, however, as we’ll need to re-transmit the codebook each time! 2 We could do some sort of adaptive Huffman scheme. But do we really want a Huffman code to begin with? Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F9/36 (pg.15/111) Games Arith. Coding Midterm O Redundancy, Redundancy, wherefore art thou Redundancy Consider English text. Redundancy abounds. Redundancy exists at the sentence level, the word level, and the character level. Complete the following sentence fragment: “with more than 300 dead, most of the victims choked to death .” did you really need to see that last word, we could just predict it, or alternatively use few bits to code it. Shannon realized this early on, and he attempted to come up with an estimate of the entropy of English text. We can use Humans and play a guessing game to help us do that. Assume we have a simple alphabet, the letters ’A’ through ’Z’ along with space “ ” (so 27 symbols). The process is as follows: Given a letter history, we ask a person to guess the following letter, and count how may guesses it takes to get it right. Letters are given one at a time in sequence. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F10/36 (pg.16/111) Games Arith. Coding Midterm O Redundancy, Redundancy, wherefore art thou Redundancy Consider English text. Redundancy abounds. Redundancy exists at the sentence level, the word level, and the character level. Complete the following sentence fragment: “with more than 300 dead, most of the victims choked to death .” did you really need to see that last word, we could just predict it, or alternatively use few bits to code it. Shannon realized this early on, and he attempted to come up with an estimate of the entropy of English text. We can use Humans and play a guessing game to help us do that. Assume we have a simple alphabet, the letters ’A’ through ’Z’ along with space “ ” (so 27 symbols). The process is as follows: Given a letter history, we ask a person to guess the following letter, and count how may guesses it takes to get it right. Letters are given one at a time in sequence. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F10/36 (pg.17/111) Games Arith. Coding Midterm O Redundancy, Redundancy, wherefore art thou Redundancy Consider English text. Redundancy abounds. Redundancy exists at the sentence level, the word level, and the character level. Complete the following sentence fragment: “with more than 300 dead, most of the victims choked to death.” did you really need to see that last word, we could just predict it, or alternatively use few bits to code it. Shannon realized this early on, and he attempted to come up with an estimate of the entropy of English text. We can use Humans and play a guessing game to help us do that. Assume we have a simple alphabet, the letters ’A’ through ’Z’ along with space “ ” (so 27 symbols). The process is as follows: Given a letter history, we ask a person to guess the following letter, and count how may guesses it takes to get it right. Letters are given one at a time in sequence. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F10/36 (pg.20/111) Games Arith. Coding Midterm O Redundancy, Redundancy, wherefore art thou Redundancy Consider English text. Redundancy abounds. Redundancy exists at the sentence level, the word level, and the character level. Complete the following sentence fragment: “with more than 300 dead, most of the victims choked to death.” did you really need to see that last word, we could just predict it, or alternatively use few bits to code it. Shannon realized this early on, and he attempted to come up with an estimate of the entropy of English text. We can use Humans and play a guessing game to help us do that. Assume we have a simple alphabet, the letters ’A’ through ’Z’ along with space “ ” (so 27 symbols). The process is as follows: Given a letter history, we ask a person to guess the following letter, and count how may guesses it takes to get it right. Letters are given one at a time in sequence. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F10/36 (pg.21/111) Games Arith. Coding Midterm O Redundancy, Redundancy, wherefore art thou Redundancy Consider English text. Redundancy abounds. Redundancy exists at the sentence level, the word level, and the character level. Complete the following sentence fragment: “with more than 300 dead, most of the victims choked to death.” did you really need to see that last word, we could just predict it, or alternatively use few bits to code it. Shannon realized this early on, and he attempted to come up with an estimate of the entropy of English text. We can use Humans and play a guessing game to help us do that. Assume we have a simple alphabet, the letters ’A’ through ’Z’ along with space “ ” (so 27 symbols). The process is as follows: Given a letter history, we ask a person to guess the following letter, and count how may guesses it takes to get it right. Letters are given one at a time in sequence. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F10/36 (pg.22/111) Games Arith. Coding Midterm Shannon Games Guess the following next letter, and we will count how many letters it takes. it goes on: “a friend of mine found out rather dramatically the other day” Another instance of the number of guesses is: T H E R E - I S - N O - R E V E R S E - O N - A - M O T O R C Y C L E - 1 1 1 5 1 1 2 1 1 2 1 1 15 1 17 1 1 1 2 1 3 2 1 2 2 7 1 1 1 1 4 1 1 1 1 1 Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F11/36 (pg.25/111) Games Arith. Coding Midterm Shannon Games Guess the following next letter, and we will count how many letters it takes. it goes on: “a friend of mine found out rather dramatically the other day” Another instance of the number of guesses is: T H E R E - I S - N O - R E V E R S E - O N - A - M O T O R C Y C L E - 1 1 1 5 1 1 2 1 1 2 1 1 15 1 17 1 1 1 2 1 3 2 1 2 2 7 1 1 1 1 4 1 1 1 1 1 Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F11/36 (pg.26/111) Games Arith. Coding Midterm Shannon Games Guess the following next letter, and we will count how many letters it takes. it goes on: “a friend of mine found out rather dramatically the other day” Another instance of the number of guesses is: T H E R E - I S - N O - R E V E R S E - O N - A - M O T O R C Y C L E - 1 1 1 5 1 1 2 1 1 2 1 1 15 1 17 1 1 1 2 1 3 2 1 2 2 7 1 1 1 1 4 1 1 1 1 1 Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F11/36 (pg.27/111) Games Arith. Coding Midterm Guessing Games T H E R E - I S - N O - R E V E R S E - O N - A - M O T O R C Y C L E - 1 1 1 5 1 1 2 1 1 2 1 1 15 1 17 1 1 1 2 1 3 2 1 2 2 7 1 1 1 1 4 1 1 1 1 1 The sequence of guess numbers for the set of letters can be seen as a “code” for the string. I.e., a mapping from letters to integers: C : {’A’, ’B’, ’C’, . . . , ’Z’, ’ ’} → {1, 2, . . . , 27} (11.1) Often, the guesses are immediate. So there are many more ones (1s), twos (2s), and so on then there are large integers. Things that are more predictable have fewer guesses, or have higher probability. Things that require more guesses are less predictable, and have lower probability. The redundancy in English is its predictability, the more low numbered integers, the more English is redundant. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F12/36 (pg.30/111) Games Arith. Coding Midterm Guessing Games T H E R E - I S - N O - R E V E R S E - O N - A - M O T O R C Y C L E - 1 1 1 5 1 1 2 1 1 2 1 1 15 1 17 1 1 1 2 1 3 2 1 2 2 7 1 1 1 1 4 1 1 1 1 1 The sequence of guess numbers for the set of letters can be seen as a “code” for the string. I.e., a mapping from letters to integers: C : {’A’, ’B’, ’C’, . . . , ’Z’, ’ ’} → {1, 2, . . . , 27} (11.1) Often, the guesses are immediate. So there are many more ones (1s), twos (2s), and so on then there are large integers. Things that are more predictable have fewer guesses, or have higher probability. Things that require more guesses are less predictable, and have lower probability. The redundancy in English is its predictability, the more low numbered integers, the more English is redundant. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F12/36 (pg.31/111) Games Arith. Coding Midterm Bits per guess Let gt be the number of guesses at position t within the source string. log gt is the number of bits to represent a number as large as gt, or number of bits required to encode number of guesses at stage t. Then we can estimate the entropy rate of this process as follows: Ĥ(X) ≈ 1 n n∑ t=1 log gt ≈ 1 n n∑ t=1 log 1 p(xt|xt−1, xt−2, . . . , x1) (11.2) Suppose that x1, x2, . . . is a stochastic process with an entropy rate of the form H(Xt|Xt−1, . . . , X1). Then p(xt|xt−1, xt−2, . . . , x1) is the probability of letter xt at time t. Then we should have approximately gt ≈ 1 p(xt|xt−1, xt−2, . . . , x1) (11.3) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F13/36 (pg.32/111) Games Arith. Coding Midterm Bits per guess Let gt be the number of guesses at position t within the source string. log gt is the number of bits to represent a number as large as gt, or number of bits required to encode number of guesses at stage t. Then we can estimate the entropy rate of this process as follows: Ĥ(X) ≈ 1 n n∑ t=1 log gt ≈ 1 n n∑ t=1 log 1 p(xt|xt−1, xt−2, . . . , x1) (11.2) Suppose that x1, x2, . . . is a stochastic process with an entropy rate of the form H(Xt|Xt−1, . . . , X1). Then p(xt|xt−1, xt−2, . . . , x1) is the probability of letter xt at time t. Then we should have approximately gt ≈ 1 p(xt|xt−1, xt−2, . . . , x1) (11.3) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F13/36 (pg.35/111) Games Arith. Coding Midterm Bits per guess Let gt be the number of guesses at position t within the source string. log gt is the number of bits to represent a number as large as gt, or number of bits required to encode number of guesses at stage t. Then we can estimate the entropy rate of this process as follows: Ĥ(X) ≈ 1 n n∑ t=1 log gt ≈ 1 n n∑ t=1 log 1 p(xt|xt−1, xt−2, . . . , x1) (11.2) Suppose that x1, x2, . . . is a stochastic process with an entropy rate of the form H(Xt|Xt−1, . . . , X1). Then p(xt|xt−1, xt−2, . . . , x1) is the probability of letter xt at time t. Then we should have approximately gt ≈ 1 p(xt|xt−1, xt−2, . . . , x1) (11.3) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F13/36 (pg.36/111) Games Arith. Coding Midterm Bits per guess Let gt be the number of guesses at position t within the source string. log gt is the number of bits to represent a number as large as gt, or number of bits required to encode number of guesses at stage t. Then we can estimate the entropy rate of this process as follows: Ĥ(X) ≈ 1 n n∑ t=1 log gt ≈ 1 n n∑ t=1 log 1 p(xt|xt−1, xt−2, . . . , x1) (11.2) Suppose that x1, x2, . . . is a stochastic process with an entropy rate of the form H(Xt|Xt−1, . . . , X1). Then p(xt|xt−1, xt−2, . . . , x1) is the probability of letter xt at time t. Then we should have approximately gt ≈ 1 p(xt|xt−1, xt−2, . . . , x1) (11.3) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F13/36 (pg.37/111) Games Arith. Coding Midterm Compression A compression algorithm could transform the source string into a string of numbers. I.e., we’d use C : {’A’, ’B’, ’C’, . . . , ’Z’, ’ ’} → {1, 2, . . . , 27} to transform from source symbols into code symbols. But the frequency of code symbols 1, 2, 3, etc. is much higher than any of the source (alpha) symbols. I.e., we’ll see a “1” much more frequently than an “e” since many letters, even if they are not an “e” are easily guessable, and sometimes “e” is not guessable. So, rather than encode “there is no reverse on a motorcycle”, we would encode and compress “1,1,1,5,1,1,2,1,1,2,1,1,15,1,17,1,1,1,2,1,3,2,1,2,2,7,1,1,1,1,4,1,1,1,1” Should compress well, many 1s. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F14/36 (pg.40/111) Games Arith. Coding Midterm Compression A compression algorithm could transform the source string into a string of numbers. I.e., we’d use C : {’A’, ’B’, ’C’, . . . , ’Z’, ’ ’} → {1, 2, . . . , 27} to transform from source symbols into code symbols. But the frequency of code symbols 1, 2, 3, etc. is much higher than any of the source (alpha) symbols. I.e., we’ll see a “1” much more frequently than an “e” since many letters, even if they are not an “e” are easily guessable, and sometimes “e” is not guessable. So, rather than encode “there is no reverse on a motorcycle”, we would encode and compress “1,1,1,5,1,1,2,1,1,2,1,1,15,1,17,1,1,1,2,1,3,2,1,2,2,7,1,1,1,1,4,1,1,1,1” Should compress well, many 1s. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F14/36 (pg.41/111) Games Arith. Coding Midterm Decompression But how do we decode? We use an identical twin. I.e., at stage t, we ask that twin to guess the next letter, and tell them to stop guessing when they have made gt guesses. Thus we’ll recover the source message. Alternatively, we can compute a large table for all possible histories and memorize a Human’s guessing scheme. I.e., for all possible histories we would record gt and then wait that many guesses. This is of course impractical. 27L for length L strings, so we’ll need to do something smarter (as we will see). Nonetheless, this scheme, first introduced by Shannon in the early 1950s is the basis for what is called arithmetic coding. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F15/36 (pg.42/111) Games Arith. Coding Midterm Decompression But how do we decode? We use an identical twin. I.e., at stage t, we ask that twin to guess the next letter, and tell them to stop guessing when they have made gt guesses. Thus we’ll recover the source message. Alternatively, we can compute a large table for all possible histories and memorize a Human’s guessing scheme. I.e., for all possible histories we would record gt and then wait that many guesses. This is of course impractical. 27L for length L strings, so we’ll need to do something smarter (as we will see). Nonetheless, this scheme, first introduced by Shannon in the early 1950s is the basis for what is called arithmetic coding. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F15/36 (pg.45/111) Games Arith. Coding Midterm Decompression But how do we decode? We use an identical twin. I.e., at stage t, we ask that twin to guess the next letter, and tell them to stop guessing when they have made gt guesses. Thus we’ll recover the source message. Alternatively, we can compute a large table for all possible histories and memorize a Human’s guessing scheme. I.e., for all possible histories we would record gt and then wait that many guesses. This is of course impractical. 27L for length L strings, so we’ll need to do something smarter (as we will see). Nonetheless, this scheme, first introduced by Shannon in the early 1950s is the basis for what is called arithmetic coding. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F15/36 (pg.46/111) Games Arith. Coding Midterm Decompression But how do we decode? We use an identical twin. I.e., at stage t, we ask that twin to guess the next letter, and tell them to stop guessing when they have made gt guesses. Thus we’ll recover the source message. Alternatively, we can compute a large table for all possible histories and memorize a Human’s guessing scheme. I.e., for all possible histories we would record gt and then wait that many guesses. This is of course impractical. 27L for length L strings, so we’ll need to do something smarter (as we will see). Nonetheless, this scheme, first introduced by Shannon in the early 1950s is the basis for what is called arithmetic coding. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F15/36 (pg.47/111) Games Arith. Coding Midterm Arithmetic Coding This is the method used by DjVU (adaptive image compression used for printed material, overtaken by PDF but probably certain PDF formats use this as well). Assume we are given a probabilistic model of the source. I.e., p(x1:n) = n∏ i=1 p(xi) would be simple i.i.d. (11.4) or alternatively p(x1:n) = p(x1) n∏ i=2 p(xi|xi−1) would be a 1st order Markov model (11.5) Higher order Markov models often used as well (as we’ll see). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F16/36 (pg.50/111) Games Arith. Coding Midterm Arithmetic Coding At each symbol, we use the conditional probability to provide the probability of the next symbol. Arithmetic coding can easily handle complex adaptive models of the source that produce context-dependent predictive distributions (so not nec. stationary). E.g., could use pt(xt|x1, . . . , xt−1) Best understood with an example. Let X = {a, e, i, o, u, !} so |X | = 6. Source X1, X2, . . . need not be i.i.d. Assume that p(xn|x1, x2, . . . , xn−1) is given to both encoder (sender, compressor) and receiver (decoder, uncompressor). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F17/36 (pg.51/111) Games Arith. Coding Midterm Arithmetic Coding At each symbol, we use the conditional probability to provide the probability of the next symbol. Arithmetic coding can easily handle complex adaptive models of the source that produce context-dependent predictive distributions (so not nec. stationary). E.g., could use pt(xt|x1, . . . , xt−1) Best understood with an example. Let X = {a, e, i, o, u, !} so |X | = 6. Source X1, X2, . . . need not be i.i.d. Assume that p(xn|x1, x2, . . . , xn−1) is given to both encoder (sender, compressor) and receiver (decoder, uncompressor). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F17/36 (pg.52/111) Games Arith. Coding Midterm Arithmetic Coding At each symbol, we use the conditional probability to provide the probability of the next symbol. Arithmetic coding can easily handle complex adaptive models of the source that produce context-dependent predictive distributions (so not nec. stationary). E.g., could use pt(xt|x1, . . . , xt−1) Best understood with an example. Let X = {a, e, i, o, u, !} so |X | = 6. Source X1, X2, . . . need not be i.i.d. Assume that p(xn|x1, x2, . . . , xn−1) is given to both encoder (sender, compressor) and receiver (decoder, uncompressor). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F17/36 (pg.55/111) Games Arith. Coding Midterm Arithmetic Coding Like in Shannon-Fano-Elias coding, we divide the unit interval up into segments of length according to the probabilities p(X1 = x) for x ∈ {a, e, i, o, u, !}. Consider the following figure: p(a) p(a) + p(e) p(a) + p(e) + p(i) p(a) + p(e) + p(i) + p(o) p(a) + p(e) + p(i) + p(o) + p(u) p(x1 =!) p(x1 = u) p(x1 = o) p(x1 = i) p(x1 = e) p(x1 = a) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F18/36 (pg.56/111) Games Arith. Coding Midterm Arithmetic Coding Like in Shannon-Fano-Elias coding, we divide the unit interval up into segments of length according to the probabilities p(X1 = x) for x ∈ {a, e, i, o, u, !}.Consider the following figure: p(a) p(a) + p(e) p(a) + p(e) + p(i) p(a) + p(e) + p(i) + p(o) p(a) + p(e) + p(i) + p(o) + p(u) p(x1 =!) p(x1 = u) p(x1 = o) p(x1 = i) p(x1 = e) p(x1 = a) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F18/36 (pg.57/111) Games Arith. Coding Midterm Arithmetic Coding Each subinterval may be further divided into segments of (relative) length p(X2 = x2|X1 = x1) or actual length p(X2 = x2, X1 = x1). Relative lengths longer or shorter p(X1 = j) T p(X2 = j|X1 = k) The following figure shows this, starting with p(X1 = a). 1 0 ! u o i e a ! u o i e a ! u o i e a ! u o i e a ! u o i e a ! u o i e a ! u o i e a p(a, e) p(a, e, i) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F19/36 (pg.60/111) Games Arith. Coding Midterm Arithmetic Coding Length of interval for “ae” is p(X1 = a,X2 = e) = p(X1 = a)p(X2 = e|X1 = a). Intervals keep getting exponentially smaller with n larger. Key: at each stage, relative lengths of the intervals can change depending on history. At t = 1, relative interval fraction for “a” is p(a), at t = 2, relative interval fraction for “a” is p(a|X1), which might change depending on X1, and so on. This is different than Shannon-Fano-Elias coding which uses the same interval length at each step. Thus, if a symbol gets very probable, it uses a long relative interval (few bits to make distinct), and if it gets very improbable, it uses short relative interval (more bits to make distinct). This last point: Interval corresponding to, say [0.0110, 0.0111) is smaller than interval corresponding to [0.10, 0.11). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F20/36 (pg.61/111) Games Arith. Coding Midterm Arithmetic Coding Length of interval for “ae” is p(X1 = a,X2 = e) = p(X1 = a)p(X2 = e|X1 = a). Intervals keep getting exponentially smaller with n larger. Key: at each stage, relative lengths of the intervals can change depending on history. At t = 1, relative interval fraction for “a” is p(a), at t = 2, relative interval fraction for “a” is p(a|X1), which might change depending on X1, and so on. This is different than Shannon-Fano-Elias coding which uses the same interval length at each step. Thus, if a symbol gets very probable, it uses a long relative interval (few bits to make distinct), and if it gets very improbable, it uses short relative interval (more bits to make distinct). This last point: Interval corresponding to, say [0.0110, 0.0111) is smaller than interval corresponding to [0.10, 0.11). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F20/36 (pg.62/111) Games Arith. Coding Midterm Arithmetic Coding Length of interval for “ae” is p(X1 = a,X2 = e) = p(X1 = a)p(X2 = e|X1 = a). Intervals keep getting exponentially smaller with n larger. Key: at each stage, relative lengths of the intervals can change depending on history. At t = 1, relative interval fraction for “a” is p(a), at t = 2, relative interval fraction for “a” is p(a|X1), which might change depending on X1, and so on. This is different than Shannon-Fano-Elias coding which uses the same interval length at each step. Thus, if a symbol gets very probable, it uses a long relative interval (few bits to make distinct), and if it gets very improbable, it uses short relative interval (more bits to make distinct). This last point: Interval corresponding to, say [0.0110, 0.0111) is smaller than interval corresponding to [0.10, 0.11). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F20/36 (pg.65/111) Games Arith. Coding Midterm Arithmetic Coding Length of interval for “ae” is p(X1 = a,X2 = e) = p(X1 = a)p(X2 = e|X1 = a). Intervals keep getting exponentially smaller with n larger. Key: at each stage, relative lengths of the intervals can change depending on history. At t = 1, relative interval fraction for “a” is p(a), at t = 2, relative interval fraction for “a” is p(a|X1), which might change depending on X1, and so on. This is different than Shannon-Fano-Elias coding which uses the same interval length at each step. Thus, if a symbol gets very probable, it uses a long relative interval (few bits to make distinct), and if it gets very improbable, it uses short relative interval (more bits to make distinct). This last point: Interval corresponding to, say [0.0110, 0.0111) is smaller than interval corresponding to [0.10, 0.11). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F20/36 (pg.66/111) Games Arith. Coding Midterm Arithmetic Coding Length of interval for “ae” is p(X1 = a,X2 = e) = p(X1 = a)p(X2 = e|X1 = a). Intervals keep getting exponentially smaller with n larger. Key: at each stage, relative lengths of the intervals can change depending on history. At t = 1, relative interval fraction for “a” is p(a), at t = 2, relative interval fraction for “a” is p(a|X1), which might change depending on X1, and so on. This is different than Shannon-Fano-Elias coding which uses the same interval length at each step. Thus, if a symbol gets very probable, it uses a long relative interval (few bits to make distinct), and if it gets very improbable, it uses short relative interval (more bits to make distinct). This last point: Interval corresponding to, say [0.0110, 0.0111) is smaller than interval corresponding to [0.10, 0.11). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F20/36 (pg.67/111) Games Arith. Coding Midterm Arithmetic Coding How to code? Let i be the current source symbol number for Xi. We maintain a lower and an upper interval position. Ln(i|x1, x2, . . . , xn−1) = i−1∑ j=1 p(xn = j|x1, x2, . . . , xn−1) (11.6) Un(i|x1, x2, . . . , xn−1) = i∑ j=1 p(xn = j|x1, x2, . . . , xn−1) (11.7) on arrival of nth input symbol, we divide the (n− 1)st interval which is defined by Ln and Un via the half-open interval [Ln, Un). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F21/36 (pg.70/111) Games Arith. Coding Midterm Interval Divisions Example: initial interval is [0, 1) and we divide it depending on the symbol we receive. a↔ [ L1(a), U1(a) ) = [0, p(X1 = a)) (11.8) e↔ [ L1(e), U1(e) ) = [p(X1 = a), p(X1 = a) + p(X1 = e)) (11.9) i↔ [ L1(i), U1(i) ) = [p(a) + p(e), p(a) + p(e) + p(i)) (11.10) o↔ [ L1(o), U1(o) ) = [p(a) + p(e) + p(i), p(a) + p(e) + p(i) + p(o)) u↔ [ L1(u), U1(u) ) = [ ∑ x∈{a,e,i,o} p(x), ∑ x∈{a,e,i,o,u} p(x)) (11.11) !↔ [ L1(!), U1(!) ) = [ ∑ x∈{a,e,i,o,u} p(x), 1) (11.12) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F22/36 (pg.71/111) Games Arith. Coding Midterm Interval Divisions Example: initial interval is [0, 1) and we divide it depending on the symbol we receive. a↔ [ L1(a), U1(a) ) = [0, p(X1 = a)) (11.8) e↔ [ L1(e), U1(e) ) = [p(X1 = a), p(X1 = a) + p(X1 = e)) (11.9) i↔ [ L1(i), U1(i) ) = [p(a) + p(e), p(a) + p(e) + p(i)) (11.10) o↔ [ L1(o), U1(o) ) = [p(a) + p(e) + p(i), p(a) + p(e) + p(i) + p(o)) u↔ [ L1(u), U1(u) ) = [ ∑ x∈{a,e,i,o} p(x), ∑ x∈{a,e,i,o,u} p(x)) (11.11) !↔ [ L1(!), U1(!) ) = [ ∑ x∈{a,e,i,o,u} p(x), 1) (11.12) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F22/36 (pg.72/111) Games Arith. Coding Midterm Algorithm In general, we use an algorithm for the string x1, x2, . . . to derive the intervals [`, u) at each time step where ` is the lower and u is the upper range. Suppose we want to send N source symbols. Then we can follow the algorithm below. 1 `← 0 ; 2 u← 1 ; 3 p← u− ` ; 4 for n = 1 . . . N do 5 Compute ∀i ∈ X , Un and Ln as in Eqns.(11.6),(11.7) ; 6 u← `+ pUn(xn|x1, . . . , xn−1) ; 7 `← `+ pLn(xn|x1, . . . , xn−1) ; 8 p← u− ` ; Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F23/36 (pg.75/111) Games Arith. Coding Midterm Encoding Once we have final interval, to encode we simply send any binary string that lives in the final interval [`, u) after running the algorithm. On the other hand, we can make the algorithm online, so that it starts writing out bits in the interval once they are known unambiguously. Analogous to Shannon-Fano-Elias coding, if the current interval is [0.100101, 0.100110) then we can send the common prefix 1001 since that will not change. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F24/36 (pg.76/111) Games Arith. Coding Midterm Encoding Once we have final interval, to encode we simply send any binary string that lives in the final interval [`, u) after running the algorithm. On the other hand, we can make the algorithm online, so that it starts writing out bits in the interval once they are known unambiguously. Analogous to Shannon-Fano-Elias coding, if the current interval is [0.100101, 0.100110) then we can send the common prefix 1001 since that will not change. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F24/36 (pg.77/111) Games Arith. Coding Midterm Example Here is an example (Let 2 be a termination symbol): - p(a) = 0.425 p(b) = 0.425 p(2) = 0.15 b p(a|b) = 0.28 p(b|b) = 0.57 p(2|b) = 0.15 bb p(a|bb) = 0.21 p(b|bb) = 0.64 p(2|bb) = 0.15 bbb p(a|bbb) = 0.17 p(b|bbb) = 0.68 p(2|bbb) = 0.15 bbba p(a|bbba) = 0.28 p(b|bbba) = 0.57 p(2|bbba) = 0.15 With these probabilities, we will consider encoding the string bbba2, and we’ll get the final interval 100111101 bbba 10011101 10011110 10011111 10100000 I..e, the final code word will be 100111101 Lets look at the entire picture Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F25/36 (pg.80/111) Games Arith. Coding Midterm Example Here is an example (Let 2 be a termination symbol): - p(a) = 0.425 p(b) = 0.425 p(2) = 0.15 b p(a|b) = 0.28 p(b|b) = 0.57 p(2|b) = 0.15 bb p(a|bb) = 0.21 p(b|bb) = 0.64 p(2|bb) = 0.15 bbb p(a|bbb) = 0.17 p(b|bbb) = 0.68 p(2|bbb) = 0.15 bbba p(a|bbba) = 0.28 p(b|bbba) = 0.57 p(2|bbba) = 0.15 With these probabilities, we will consider encoding the string bbba2, and we’ll get the final interval 100111101 bbba 10011101 10011110 10011111 10100000 I..e, the final code word will be 100111101 Lets look at the entire picture Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F25/36 (pg.81/111) Games Arith. Coding Midterm Coding Example from D.J.C. MacKay’s 2001 book. a b ba bb b bba bbb bb bbba bbbb bbb 0 1 00 01 000 001 0000 0001 00000 00001 00010 00011 0010 0011 00100 00101 00110 00111 010 011 0100 0101 01000 01001 01010 01011 0110 0111 01100 01101 01110 01111 10 11 100 101 1000 1001 10000 10001 10010 10011 1010 1011 10100 10101 10110 10111 110 111 1100 1101 11000 11001 11010 11011 1110 1111 11100 11101 11110 11111 100111101 bbba bbbaa bbbab bbba 10011 10010111 10011000 10011001 10011010 10011011 10011100 10011101 10011110 10011111 10100000 Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F26/36 (pg.82/111) Games Arith. Coding Midterm Coding Example from D.J.C. MacKay’s 2001 book. 1 011 10 100 101 110 100111101 bbba bbbaa bbbab bbba 10011 100110 100111 1001110 1001111 10010111 10011000 10011001 10011010 10011011 10011100 10011101 10011110 10011111 101000001/2 ≤ p < 1 selected codeword Q: Why can’t we use 1001111? A: Because its interval is too large. Codeword 100111101’s interval is entirely within bbba2’s interval, so we are prefix free. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F28/36 (pg.85/111) Games Arith. Coding Midterm Coding Example from D.J.C. MacKay’s 2001 book. 1 011 10 100 101 110 100111101 bbba bbbaa bbbab bbba 10011 100110 100111 1001110 1001111 10010111 10011000 10011001 10011010 10011011 10011100 10011101 10011110 10011111 101000001/2 ≤ p < 1 selected codeword Q: Why can’t we use 1001111?A: Because its interval is too large. Codeword 100111101’s interval is entirely within bbba2’s interval, so we are prefix free. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F28/36 (pg.86/111) Games Arith. Coding Midterm Decoding To decode a binary string, say α = 0.z1z2z3 . . . we use algorithm: 1 `← 0 ; 2 u← 1 ; 3 p← u− ` ; 4 while special symbol 2 is not received do 5 find i such that: Ln(i|x1, . . . , xn−1) ≤ α− ` u− ` < Un(i|x1, . . . , xn−1) u← `+ pUn(i|x1, . . . , xn−1) ; 6 `← `+ pLn(i|x1, . . . , xn−1) ; 7 p← u− ` ; Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F29/36 (pg.87/111) Games Arith. Coding Midterm Number of bits Problem is, a given number in the final interval [Ln, Un) could be arbitrarily long (e.g., repeated or irrational number). We only need to send enough to uniquely identify string. How do we choose the number of bits to send? Define Fn(i|x1, x2, . . . , xn−1) = 1 2 [Ln(i) + Un(i)] (11.13) and bFn(i|x1, x2, . . . , xn−1)c` which is Fn truncated to ` bits. We could use `(xn|x1, . . . , xn−1) = dlog 1/p(xn|x1, . . . , xn−1)e+ 1 Instead, lets use the Shannon length of the entire code as `(x1:n) = dlog 1/p(x1:n)e+ 1 (11.14) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F30/36 (pg.90/111) Games Arith. Coding Midterm Number of bits Problem is, a given number in the final interval [Ln, Un) could be arbitrarily long (e.g., repeated or irrational number). We only need to send enough to uniquely identify string. How do we choose the number of bits to send? Define Fn(i|x1, x2, . . . , xn−1) = 1 2 [Ln(i) + Un(i)] (11.13) and bFn(i|x1, x2, . . . , xn−1)c` which is Fn truncated to ` bits. We could use `(xn|x1, . . . , xn−1) = dlog 1/p(xn|x1, . . . , xn−1)e+ 1 Instead, lets use the Shannon length of the entire code as `(x1:n) = dlog 1/p(x1:n)e+ 1 (11.14) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F30/36 (pg.91/111) Games Arith. Coding Midterm Number of bits Problem is, a given number in the final interval [Ln, Un) could be arbitrarily long (e.g., repeated or irrational number). We only need to send enough to uniquely identify string. How do we choose the number of bits to send? Define Fn(i|x1, x2, . . . , xn−1) = 1 2 [Ln(i) + Un(i)] (11.13) and bFn(i|x1, x2, . . . , xn−1)c` which is Fn truncated to ` bits. We could use `(xn|x1, . . . , xn−1) = dlog 1/p(xn|x1, . . . , xn−1)e+ 1 Instead, lets use the Shannon length of the entire code as `(x1:n) = dlog 1/p(x1:n)e+ 1 (11.14) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F30/36 (pg.92/111) Games Arith. Coding Midterm Estimating p(xn|x1, . . . , xn−1) We still have the problem that we need to estimate p(xn|x1, . . . , xn−1) We’d like to use adaptive models. One possibility is the Dirichlet model, having no independencies: p(a|x1:n−1) = N(a|x1:n−1) + α∑ a′(N(a ′|x1:n−1) + α) (11.19) Small α means more responsive Large α means more sluggish. How do we derive this? We can do so in a Bayesian setting. In general the problem of density estimation is a topic in and of itself. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F32/36 (pg.95/111) Games Arith. Coding Midterm Estimating p(xn|x1, . . . , xn−1) We still have the problem that we need to estimate p(xn|x1, . . . , xn−1) We’d like to use adaptive models. One possibility is the Dirichlet model, having no independencies: p(a|x1:n−1) = N(a|x1:n−1) + α∑ a′(N(a ′|x1:n−1) + α) (11.19) Small α means more responsive Large α means more sluggish. How do we derive this? We can do so in a Bayesian setting. In general the problem of density estimation is a topic in and of itself. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F32/36 (pg.96/111) Games Arith. Coding Midterm Estimating p(xn|x1, . . . , xn−1) We still have the problem that we need to estimate p(xn|x1, . . . , xn−1) We’d like to use adaptive models. One possibility is the Dirichlet model, having no independencies: p(a|x1:n−1) = N(a|x1:n−1) + α∑ a′(N(a ′|x1:n−1) + α) (11.19) Small α means more responsive Large α means more sluggish. How do we derive this? We can do so in a Bayesian setting. In general the problem of density estimation is a topic in and of itself. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 11 - Oct 29th, 2013 L11 F32/36 (pg.97/111)
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved