Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Kraft Inequality - Information Theory - Lecture Slides, Slides of Information Technology

Information Technology course teaches a lot we need to know in current world. These lecture slides include: Kraft Inequality, Towards Optimal Codes, Shannon Codes, Shannon Code Optimal, Huffman Coding, Greedy Method, Competitive Optimality of Shannon Code

Typology: Slides

2013/2014

Uploaded on 01/31/2014

dhanvin
dhanvin 🇮🇳

4.2

(14)

100 documents

1 / 247

Toggle sidebar

Related documents


Partial preview of the text

Download Kraft Inequality - Information Theory - Lecture Slides and more Slides Information Technology in PDF only on Docsity! EE514a – Information Theory I Fall Quarter 2013 Prof. Jeff Bilmes University of Washington, Seattle Department of Electrical Engineering Fall Quarter, 2013 http://j.ee.washington.edu/~bilmes/classes/ee514a_fall_2013/ Lecture 10 - Oct 28th, 2013 Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F1/57 (pg.1/247) Logistics Review Class Road Map - IT-I L1 (9/26): Overview, Communications, Information, Entropy L2 (10/1): Props. Entropy, Mutual Information, L3 (10/3): KL-Divergence, Convex, Jensen, and properties. L4 (10/8): Data Proc. Ineq., thermodynamics, Stats, Fano, M. of Conv L5 (10/10): AEP, Compression L6 (10/15): Compression, Method of Types, L7 (10/17): Types, U. Coding., Stoc. Processes, Entropy rates, L8 (10/22): Entropy rates, HMMs, Coding, Kraft, L9 (10/24): Kraft, Shannon Codes,Huffman, Shannon/Fano/Elias L10 (10/28): Huffman, Shannon/Fano/Elias L11 (10/29): Shannon Games, Arith. Coding L12 (10/31): Midterm, in class. L13 L14 L15 L16 L17 L18 L19 L20 Finals Week: December 12th–16th. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F2/57 (pg.2/247) Logistics Review Announcements Office hours, every week, Tuedsays 4:30-5:30pm. Can also reach me at that time via a canvas conference. Midterm on Thursday, 10/31 in class. Covers everything up to and including homework 4 (today’s cumulative reading). We’ll have a review on 10/29. Next lecture will conflict with Stephen Boyd’s lecture (which is at 3:30-4:20pm in room EEB-105, see http://www.ee.washington. edu/news/2013/boyd_lytle_lecture.html). In order to see the lecture, 1/2 of Tuesday’s lecture will be youtube only (which is right now), and we’ll meet in person only from 2:30-3:20. On Tuesday, Oct 29th, we will meet from 2:30-2:20 in EEB-026, and then talk to the Boyd talk. The topic will be “games” and then midterm review. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F5/57 (pg.5/247) Logistics Review On Midterm When: Thursday (Oct 31st, 2013). Length 1 hour 50 minutes length, in class. Closed book. OK to have one side of one 8.5× 11 inch sheet of paper on which you can write anything you wish. Can be computer printed or hand written. Can be used for the final as well (so save the sheet). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F6/57 (pg.6/247) Logistics Review On Midterm When: Thursday (Oct 31st, 2013). Length 1 hour 50 minutes length, in class. Closed book. OK to have one side of one 8.5× 11 inch sheet of paper on which you can write anything you wish. Can be computer printed or hand written. Can be used for the final as well (so save the sheet). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F6/57 (pg.7/247) Logistics Review On Midterm When: Thursday (Oct 31st, 2013). Length 1 hour 50 minutes length, in class. Closed book. OK to have one side of one 8.5× 11 inch sheet of paper on which you can write anything you wish. Can be computer printed or hand written. Can be used for the final as well (so save the sheet). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F6/57 (pg.10/247) Logistics Review On Midterm When: Thursday (Oct 31st, 2013). Length 1 hour 50 minutes length, in class. Closed book. OK to have one side of one 8.5× 11 inch sheet of paper on which you can write anything you wish. Can be computer printed or hand written. Can be used for the final as well (so save the sheet). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F6/57 (pg.11/247) Logistics Review Kraft inequality Theorem 10.2.1 (Kraft inequality) For any instantaneous code (prefix code) over alphabet of size D, the codeword lengths `1, `2, . . . , `m must satisfy∑ i D−`i ≤ 1 (10.1) Conversely, given a set of codeword lengths satisfying the above inequality, ∃ an instantaneous code with these word lengths. Note: converse says there exists a code with these lengths, not that all codes with these lengths will satisfy the inequality. Key point: for `i satisfying Kraft, no further restriction imposed by also wanting a prefix code, so we might as well use a prefix code (assuming it is easy to find given the lengths) Connects code existence to mathematical property on lengths! Given Kraft lengths, can construct an instantaneous code (as we will see). Given lengths, can compute E[`] and compare with H. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F7/57 (pg.12/247) Logistics Review Optimal Code Lengths Theorem 10.2.2 Entropy is the minimum expected length. That is, the expected length L of any instantaneous D-ary code (which thus satisfies Kraft inequality) for a r.v. X is such that L ≥ HD(X) (10.6) with equality iff D−`i = pi. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F10/57 (pg.15/247) Logistics Review Optimal Code Lengths . . . Proof of Theorem ??. So we have that L ≥ HD(X). Equality, L = H is achieved iff pi = D −`i for all i ⇔ − logD pi is an integer . . . . . . in which case c = ∑ iD −`i = 1 Definition 10.2.2 (D-adic) A probability distribution is called D-adic w.r.t. D if each of the probabilities is = D−n for some n. Ex: when D = 2, the distribution [12 , 1 4 , 1 8 , 1 8 ] = [2 −1, 2−2, 2−3, 2−3] is 2-adic. Thus, we have equality above iff the distribution is appropriately D-adic. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F11/57 (pg.16/247) Logistics Review Shannon Codes L−H = D(p||r) + logD 1/c, with c = ∑ iD −`i Thus, to produce a code, we find closest (in the KL sense) D-adic distribution w.r.t. D to p and then construct the code as in the proof of the Kraft inequality converse. In general, however, unless P=NP, it is hard to find the KL closest D-adic distribution (integer programming problem). Shannon codes: consider `i = dlogD 1/pie as the code lengths∑ iD −`i = ∑ iD −dlog 1/pie ≤∑iD− log 1/pi = ∑i pi = 1 This means Kraft inequality holds for these lengths, so there is a prefix code (if the lengths were too short there might be a problem but we’re rounding up). Also, we have a bound on lengths in terms of real numbers logD 1 pi ≤ `i < logD 1 pi + 1 (10.12) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F12/57 (pg.17/247) Logistics Review Coding with the wrong distribution Theorem 10.2.4 Expected length under p(x)of code with `(x) = dlog 1/q(x)e satisfies H(p) +D(p||q) ≤ Ep`(X) ≤ H(p) +D(p||q) + 1 (10.22) l.h.s. is the best we can do with the wrong distribution q when the true distribution is p. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F15/57 (pg.20/247) Logistics Review Kraft revisited We proved Kraft inequality is true for instantaneous codes (and vice verse). Could it be true for all uniquely decodable codes? Could larger class of codes have shorter expected codeword lengths? Since larger, we might (näıvely) expect that we could do better. Theorem 10.2.4 Codeword lengths of any uniquely decodable code (not. nec. instantaneous) must satisfy Kraft inequality ∑ iD −`i ≤ 1. Conversely, given a set of codeword lengths that satisfy Kraft, it is possible to construct a uniquely decodable code. Proof. Proof converse we already saw before (given lengths, we can construct a prefix code which is thus uniquely decodable). Thus we only need prove the first part. . . . Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F16/57 (pg.21/247) Huffman Shannon/Fano/Elias Next Shannon Code optimal? Ex: X = {0, 1} with p(X = 0) = 10−1000 = 1− p(X = 1). What are Shannon lengths? `(0) = dlog2 101000e = 3322 bits `(1) = dlog2(1− 101000)e = 1 bit For symbol 0, we’re using 3321 too many bits. In general, for other distributions, one can construct cases where dlogD pie is longer than necessary. Shannon length codes are not optimal integer length prefix codes. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F17/57 (pg.22/247) Huffman Shannon/Fano/Elias Next Shannon Code optimal? Ex: X = {0, 1} with p(X = 0) = 10−1000 = 1− p(X = 1). What are Shannon lengths? `(0) = dlog2 101000e = 3322 bits `(1) = dlog2(1− 101000)e = 1 bit For symbol 0, we’re using 3321 too many bits. In general, for other distributions, one can construct cases where dlogD pie is longer than necessary. Shannon length codes are not optimal integer length prefix codes. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F17/57 (pg.25/247) Huffman Shannon/Fano/Elias Next Shannon Code optimal? Ex: X = {0, 1} with p(X = 0) = 10−1000 = 1− p(X = 1). What are Shannon lengths? `(0) = dlog2 101000e = 3322 bits `(1) = dlog2(1− 101000)e = 1 bit For symbol 0, we’re using 3321 too many bits. In general, for other distributions, one can construct cases where dlogD pie is longer than necessary. Shannon length codes are not optimal integer length prefix codes. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F17/57 (pg.26/247) Huffman Shannon/Fano/Elias Next Shannon Code optimal? Ex: X = {0, 1} with p(X = 0) = 10−1000 = 1− p(X = 1). What are Shannon lengths? `(0) = dlog2 101000e = 3322 bits `(1) = dlog2(1− 101000)e = 1 bit For symbol 0, we’re using 3321 too many bits. In general, for other distributions, one can construct cases where dlogD pie is longer than necessary. Shannon length codes are not optimal integer length prefix codes. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F17/57 (pg.27/247) Huffman Shannon/Fano/Elias Next Huffman coding A procedure for finding shortest expected length prefix code. You’ve probably encountered it in computer science classes (a classic algorithm). Here we analyze it armed with the tools of information theory. Quest: given a p(x), find a code (bit strings and set of lengths) that is as short as possible, and also an instantaneous code (prefix free). We could do this greedily: start at the top and split the potential codewords into even probabilities (i.e., asking the question with highest entropy) This is similar to the game of 20 questions. We have a set of objects, w.l.o.g. the set S = {1, 2, 3, 4, . . . ,m} that occur with frequency proportional to non-negative (w1, w2, . . . , wm). We wish to determine an object from this class asking as few questions as possible. Supposing X ∈ S, each question can take the form “Is X ∈ A?” for some A ⊆ S. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F18/57 (pg.30/247) Huffman Shannon/Fano/Elias Next Huffman coding A procedure for finding shortest expected length prefix code. You’ve probably encountered it in computer science classes (a classic algorithm). Here we analyze it armed with the tools of information theory. Quest: given a p(x), find a code (bit strings and set of lengths) that is as short as possible, and also an instantaneous code (prefix free). We could do this greedily: start at the top and split the potential codewords into even probabilities (i.e., asking the question with highest entropy) This is similar to the game of 20 questions. We have a set of objects, w.l.o.g. the set S = {1, 2, 3, 4, . . . ,m} that occur with frequency proportional to non-negative (w1, w2, . . . , wm). We wish to determine an object from this class asking as few questions as possible. Supposing X ∈ S, each question can take the form “Is X ∈ A?” for some A ⊆ S. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F18/57 (pg.31/247) Huffman Shannon/Fano/Elias Next Huffman coding A procedure for finding shortest expected length prefix code. You’ve probably encountered it in computer science classes (a classic algorithm). Here we analyze it armed with the tools of information theory. Quest: given a p(x), find a code (bit strings and set of lengths) that is as short as possible, and also an instantaneous code (prefix free). We could do this greedily: start at the top and split the potential codewords into even probabilities (i.e., asking the question with highest entropy) This is similar to the game of 20 questions. We have a set of objects, w.l.o.g. the set S = {1, 2, 3, 4, . . . ,m} that occur with frequency proportional to non-negative (w1, w2, . . . , wm). We wish to determine an object from this class asking as few questions as possible. Supposing X ∈ S, each question can take the form “Is X ∈ A?” for some A ⊆ S. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F18/57 (pg.32/247) Huffman Shannon/Fano/Elias Next Huffman coding A procedure for finding shortest expected length prefix code. You’ve probably encountered it in computer science classes (a classic algorithm). Here we analyze it armed with the tools of information theory. Quest: given a p(x), find a code (bit strings and set of lengths) that is as short as possible, and also an instantaneous code (prefix free). We could do this greedily: start at the top and split the potential codewords into even probabilities (i.e., asking the question with highest entropy) This is similar to the game of 20 questions. We have a set of objects, w.l.o.g. the set S = {1, 2, 3, 4, . . . ,m} that occur with frequency proportional to non-negative (w1, w2, . . . , wm). We wish to determine an object from this class asking as few questions as possible. Supposing X ∈ S, each question can take the form “Is X ∈ A?” for some A ⊆ S. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F18/57 (pg.35/247) Huffman Shannon/Fano/Elias Next Huffman coding A procedure for finding shortest expected length prefix code. You’ve probably encountered it in computer science classes (a classic algorithm). Here we analyze it armed with the tools of information theory. Quest: given a p(x), find a code (bit strings and set of lengths) that is as short as possible, and also an instantaneous code (prefix free). We could do this greedily: start at the top and split the potential codewords into even probabilities (i.e., asking the question with highest entropy) This is similar to the game of 20 questions. We have a set of objects, w.l.o.g. the set S = {1, 2, 3, 4, . . . ,m} that occur with frequency proportional to non-negative (w1, w2, . . . , wm). We wish to determine an object from this class asking as few questions as possible. Supposing X ∈ S, each question can take the form “Is X ∈ A?” for some A ⊆ S. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F18/57 (pg.36/247) Huffman Shannon/Fano/Elias Next 20 Questions Question tree. S = {x1, x2, x3, x4, x5}. X ∈ {x2, x3} X ∈ {x2 x2 x3 x1 x4 x5 } X ∈ {x1} X ∈ {x4} 0.2 0.2 0.3 0.15 0.15 Y Y Y YN N N N How do we construct such a tree? Charles Sanders Peirce, 1901 said: Thus twenty skillful hypotheses will ascertain what two hundred thousand stupid ones might fail to do. The secret of the business lies in the caution which breaks a hypothesis up into its smallest logical components, and only risks one of them at a time. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F19/57 (pg.37/247) Huffman Shannon/Fano/Elias Next 20 Questions Question tree. S = {x1, x2, x3, x4, x5}. X ∈ {x2, x3} X ∈ {x2 x2 x3 x1 x4 x5 } X ∈ {x1} X ∈ {x4} 0.2 0.2 0.3 0.15 0.15 Y Y Y YN N N N How do we construct such a tree? Charles Sanders Peirce, 1901 said: Thus twenty skillful hypotheses will ascertain what two hundred thousand stupid ones might fail to do. The secret of the business lies in the caution which breaks a hypothesis up into its smallest logical components, and only risks one of them at a time. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F19/57 (pg.40/247) Huffman Shannon/Fano/Elias Next The Greedy Method Suggests a greedy method. “Do next whatever currently looks best.” Consider following table: a b c d e f g p 0.01 0.24 0.05 0.20 0.47 0.01 0.02 The question that looks best would infer the most about the distribution, one with the largest entropy. H(X|Y1) = H(X,Y1)−H(Y1) = H(X)−H(Y1), so choosing a question Y1 with large entropy leads to least “residual” uncertainty H(X|Y1) about X. Identically, we choose the question Y1 with the greatest mutual information about X since in this case I(Y1;X) = H(X)−H(X|Y1) = H(Y1). Again, questions take the form “Is X ∈ A?” for some A ⊆ S, so choosing a yes/no (binary) question means choosing the set A. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F20/57 (pg.41/247) Huffman Shannon/Fano/Elias Next The Greedy Method Suggests a greedy method. “Do next whatever currently looks best.” Consider following table: a b c d e f g p 0.01 0.24 0.05 0.20 0.47 0.01 0.02 The question that looks best would infer the most about the distribution, one with the largest entropy. H(X|Y1) = H(X,Y1)−H(Y1) = H(X)−H(Y1), so choosing a question Y1 with large entropy leads to least “residual” uncertainty H(X|Y1) about X. Identically, we choose the question Y1 with the greatest mutual information about X since in this case I(Y1;X) = H(X)−H(X|Y1) = H(Y1). Again, questions take the form “Is X ∈ A?” for some A ⊆ S, so choosing a yes/no (binary) question means choosing the set A. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F20/57 (pg.42/247) Huffman Shannon/Fano/Elias Next The Greedy Method Suggests a greedy method. “Do next whatever currently looks best.” Consider following table: a b c d e f g p 0.01 0.24 0.05 0.20 0.47 0.01 0.02 The question that looks best would infer the most about the distribution, one with the largest entropy. H(X|Y1) = H(X,Y1)−H(Y1) = H(X)−H(Y1), so choosing a question Y1 with large entropy leads to least “residual” uncertainty H(X|Y1) about X. Identically, we choose the question Y1 with the greatest mutual information about X since in this case I(Y1;X) = H(X)−H(X|Y1) = H(Y1). Again, questions take the form “Is X ∈ A?” for some A ⊆ S, so choosing a yes/no (binary) question means choosing the set A. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F20/57 (pg.45/247) Huffman Shannon/Fano/Elias Next The Greedy Method Suggests a greedy method. “Do next whatever currently looks best.” Consider following table: a b c d e f g p 0.01 0.24 0.05 0.20 0.47 0.01 0.02 The question that looks best would infer the most about the distribution, one with the largest entropy. H(X|Y1) = H(X,Y1)−H(Y1) = H(X)−H(Y1), so choosing a question Y1 with large entropy leads to least “residual” uncertainty H(X|Y1) about X. Identically, we choose the question Y1 with the greatest mutual information about X since in this case I(Y1;X) = H(X)−H(X|Y1) = H(Y1). Again, questions take the form “Is X ∈ A?” for some A ⊆ S, so choosing a yes/no (binary) question means choosing the set A. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F20/57 (pg.46/247) Huffman Shannon/Fano/Elias Next The Greedy Method We’ll use greedy, and choose the question (set) with the greatest entropy. If we consider the partition {a, b, c, d, e, f, g} = {a, b, c, d} ∪ {e, f, g}, the question “Is X ∈ {e, f, g}?” would have maximum entropy since p(X ∈ {a, b, c, d}) = p(X ∈ {e, f, g}) = 0.5. This question corresponds to random variable Y1 = 1{X∈{e,f,g}} so H(Y1) = 1 and this would be considered a good question (as good as it gets for binary r.v.). The next question depends on the outcome of the first, and we have either Y1 = 0 (≡ X ∈ {a, b, c, d}) or Y1 = 1 (≡ X ∈ {e, f, g}). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F21/57 (pg.47/247) Huffman Shannon/Fano/Elias Next The Greedy Method We’ll use greedy, and choose the question (set) with the greatest entropy. If we consider the partition {a, b, c, d, e, f, g} = {a, b, c, d} ∪ {e, f, g}, the question “Is X ∈ {e, f, g}?” would have maximum entropy since p(X ∈ {a, b, c, d}) = p(X ∈ {e, f, g}) = 0.5. This question corresponds to random variable Y1 = 1{X∈{e,f,g}} so H(Y1) = 1 and this would be considered a good question (as good as it gets for binary r.v.). The next question depends on the outcome of the first, and we have either Y1 = 0 (≡ X ∈ {a, b, c, d}) or Y1 = 1 (≡ X ∈ {e, f, g}). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F21/57 (pg.50/247) Huffman Shannon/Fano/Elias Next The Greedy Tree If Y1 = 0 then we can split to maximize entropy as follows: partition {a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4. This question corresponds to random variable Y2 = 1{X∈{c,d}} so H(Y2|Y1 = 0) = 1 and this would also be considered a good question (as good as it gets). If Y1 = 1, then we need to partition the set {e, f, g}. We can do this in one of three ways: case I II III split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f}) prob (0.47,0.03) (0.48,0.2) (0.49,0.1) H(Y2|Y1 = 1) 0.3274 0.2423 0.1414 Thus, we would choose case I for Y2 since that is the maximum entropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274 Also, H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) = H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F22/57 (pg.51/247) Huffman Shannon/Fano/Elias Next The Greedy Tree If Y1 = 0 then we can split to maximize entropy as follows: partition {a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4. This question corresponds to random variable Y2 = 1{X∈{c,d}} so H(Y2|Y1 = 0) = 1 and this would also be considered a good question (as good as it gets). If Y1 = 1, then we need to partition the set {e, f, g}. We can do this in one of three ways: case I II III split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f}) prob (0.47,0.03) (0.48,0.2) (0.49,0.1) H(Y2|Y1 = 1) 0.3274 0.2423 0.1414 Thus, we would choose case I for Y2 since that is the maximum entropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274 Also, H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) = H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F22/57 (pg.52/247) Huffman Shannon/Fano/Elias Next The Greedy Tree If Y1 = 0 then we can split to maximize entropy as follows: partition {a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4. This question corresponds to random variable Y2 = 1{X∈{c,d}} so H(Y2|Y1 = 0) = 1 and this would also be considered a good question (as good as it gets). If Y1 = 1, then we need to partition the set {e, f, g}. We can do this in one of three ways: case I II III split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f}) prob (0.47,0.03) (0.48,0.2) (0.49,0.1) H(Y2|Y1 = 1) 0.3274 0.2423 0.1414 Thus, we would choose case I for Y2 since that is the maximum entropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274 Also, H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) = H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F22/57 (pg.55/247) Huffman Shannon/Fano/Elias Next The Greedy Tree If Y1 = 0 then we can split to maximize entropy as follows: partition {a, b, c, d} = {a, b} ∪ {c, d} since p({a, b}) = p({c, d}) = 1/4. This question corresponds to random variable Y2 = 1{X∈{c,d}} so H(Y2|Y1 = 0) = 1 and this would also be considered a good question (as good as it gets). If Y1 = 1, then we need to partition the set {e, f, g}. We can do this in one of three ways: case I II III split ({e}, {f, g}) ({e, f}, {g}) ({e, g}, {f}) prob (0.47,0.03) (0.48,0.2) (0.49,0.1) H(Y2|Y1 = 1) 0.3274 0.2423 0.1414 Thus, we would choose case I for Y2 since that is the maximum entropy question. Thus, we get H(Y2|Y1 = 1) = 0.3274 Also, H(X|Y2, Y1) = H(X,Y2|Y1)−H(Y2|Y1) = H(X|Y1)−H(Y2|Y1) = H(X)−H(Y2|Y1)−H(Y1) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F22/57 (pg.56/247) Huffman Shannon/Fano/Elias Next The Greedy Tree Once we get to sets of size 2, we only have one possible question. Greedy strategy always greedily chooses what currently looks best, ignoring future. Latter questions must live with what is available. Summarizing all questions/splits, & their conditional entropies: set split probabilities conditional entropy {a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1 {a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1 {e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274 {a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423 {c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219 {e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0 {f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183 Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1) = H(Y1) + ∑ i∈{0,1} H(Y2|Y1 = i)p(Y1 = i) (10.2) + ∑ i,j∈{0,1} H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F23/57 (pg.57/247) Huffman Shannon/Fano/Elias Next The Greedy Tree Once we get to sets of size 2, we only have one possible question. Greedy strategy always greedily chooses what currently looks best, ignoring future. Latter questions must live with what is available. Summarizing all questions/splits, & their conditional entropies: set split probabilities conditional entropy {a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1 {a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1 {e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274 {a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423 {c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219 {e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0 {f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183 Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1) = H(Y1) + ∑ i∈{0,1} H(Y2|Y1 = i)p(Y1 = i) (10.2) + ∑ i,j∈{0,1} H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F23/57 (pg.60/247) Huffman Shannon/Fano/Elias Next The Greedy Tree Once we get to sets of size 2, we only have one possible question. Greedy strategy always greedily chooses what currently looks best, ignoring future. Latter questions must live with what is available. Summarizing all questions/splits, & their conditional entropies: set split probabilities conditional entropy {a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1 {a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1 {e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274 {a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423 {c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219 {e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0 {f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183 Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1) = H(Y1) + ∑ i∈{0,1} H(Y2|Y1 = i)p(Y1 = i) (10.2) + ∑ i,j∈{0,1} H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F23/57 (pg.61/247) Huffman Shannon/Fano/Elias Next The Greedy Tree Once we get to sets of size 2, we only have one possible question. Greedy strategy always greedily chooses what currently looks best, ignoring future. Latter questions must live with what is available. Summarizing all questions/splits, & their conditional entropies: set split probabilities conditional entropy {a, b, c, d, e, f} {a, b, c, d}, {e, f, g} (0.5,0.5) H(Y1) = 1 {a, b, c, d} {a, b}, {c, d} (0.25,0.25) H(Y2|Y1 = 0) = 1 {e, f, g} {e}, {f, g} (0.47,0.3) H(Y2|Y1 = 1) = 0.3274 {a, b} {a}, {b} (0.01,0.24) H(Y3|Y2 = 0, Y1 = 0) = 0.2423 {c, d} {c}, {d} (0.05,0.20) H(Y3|Y2 = 1, Y1 = 0) = 0.7219 {e} {e} (0.47) H(Y3|Y2 = 0, Y1 = 1) = 0.0 {f, g} {f}, {g} (0.01,0.02) H(Y3|Y2 = 1, Y1 = 1) = 0.9183 Also note, H(X) = H(Y1, Y2, Y3) = 1.9323 and recall H(Y1, Y2, Y3) = H(Y1) +H(Y2|Y1) +H(Y3|Y1, Y2) (10.1) = H(Y1) + ∑ i∈{0,1} H(Y2|Y1 = i)p(Y1 = i) (10.2) + ∑ i,j∈{0,1} H(Y3|Y1 = i, Y2 = j)p(Y1 = i, Y2 = j) Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F23/57 (pg.62/247) Huffman Shannon/Fano/Elias Next The Greedy Tree a b c d e f g p 0.01 0.24 0.05 0.20 0.47 0.01 0.02 This leads to the following (top-down greedily constructed) tree: {a, b, c, d, e, f, g} X ∈ {a, b, c, d}? dc X ∈ {c, d}? a b X ∈ { a, b}? {e, f, g}X ∈ ? e{ e } X ∈ ? f g {f, g} X ∈ ? 0 0 0 000 001 010 011 110 111 10 0 0 0 1 1 11 1 1 0.5 0.5 0.25 0.01 0.24 0.05 0.2 0.01 0.02 0.25 0.47 0.03 The expected length of this code E` = 2.5300. Entropy: H = 1.9323. Code efficiency H/E` = 1.9323/2.5300 = 0.7638. Can we do better? Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F24/57 (pg.65/247) Huffman Shannon/Fano/Elias Next The Greedy Tree a b c d e f g p 0.01 0.24 0.05 0.20 0.47 0.01 0.02 This leads to the following (top-down greedily constructed) tree: {a, b, c, d, e, f, g} X ∈ {a, b, c, d}? dc X ∈ {c, d}? a b X ∈ { a, b}? {e, f, g}X ∈ ? e{ e } X ∈ ? f g {f, g} X ∈ ? 0 0 0 000 001 010 011 110 111 10 0 0 0 1 1 11 1 1 0.5 0.5 0.25 0.01 0.24 0.05 0.2 0.01 0.02 0.25 0.47 0.03 The expected length of this code E` = 2.5300. Entropy: H = 1.9323. Code efficiency H/E` = 1.9323/2.5300 = 0.7638. Can we do better? Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F24/57 (pg.66/247) Huffman Shannon/Fano/Elias Next The Greedy Tree a b c d e f g p 0.01 0.24 0.05 0.20 0.47 0.01 0.02 This leads to the following (top-down greedily constructed) tree: {a, b, c, d, e, f, g} X ∈ {a, b, c, d}? dc X ∈ {c, d}? a b X ∈ { a, b}? {e, f, g}X ∈ ? e{ e } X ∈ ? f g {f, g} X ∈ ? 0 0 0 000 001 010 011 110 111 10 0 0 0 1 1 11 1 1 0.5 0.5 0.25 0.01 0.24 0.05 0.2 0.01 0.02 0.25 0.47 0.03 The expected length of this code E` = 2.5300. Entropy: H = 1.9323. Code efficiency H/E` = 1.9323/2.5300 = 0.7638. Can we do better? Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F24/57 (pg.67/247) Huffman Shannon/Fano/Elias Next The Greedy Tree vs. Huffman Tree Left is greedy, right is Huffman {a, b, c, d, e, f, g} X ∈ {a, b, c, d}? dc X ∈ {c, d}? a b X ∈ { a, b}? {e, f, g}X ∈ ? e{ e } X ∈ ? f g {f, g} X ∈ ? 0 0 0 000 000000 000001 00001 0001 001 01 1 001 010 011 110 111 10 0 0 0 1 1 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0.5 0.5 0.25 0.01 0.24 0.05 0.2 0.01 0.02 0.25 0.47 0.470.53 0.240.29 0.200.09 0.050.04 0.020.02 0.010.01 0.03 a f g c d b e 0 1 The Huffman lengths have E`huffman = 1.9700 Efficiency of Huffman code: H/E`huffman = 1.9323/1.9700 = 0.9809 Key problem: Greedy procedure is not optimal in this case. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F25/57 (pg.70/247) Huffman Shannon/Fano/Elias Next The Greedy Tree vs. Huffman Tree Left is greedy, right is Huffman {a, b, c, d, e, f, g} X ∈ {a, b, c, d}? dc X ∈ {c, d}? a b X ∈ { a, b}? {e, f, g}X ∈ ? e{ e } X ∈ ? f g {f, g} X ∈ ? 0 0 0 000 000000 000001 00001 0001 001 01 1 001 010 011 110 111 10 0 0 0 1 1 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0.5 0.5 0.25 0.01 0.24 0.05 0.2 0.01 0.02 0.25 0.47 0.470.53 0.240.29 0.200.09 0.050.04 0.020.02 0.010.01 0.03 a f g c d b e 0 1 The Huffman lengths have E`huffman = 1.9700 Efficiency of Huffman code: H/E`huffman = 1.9323/1.9700 = 0.9809 Key problem: Greedy procedure is not optimal in this case. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F25/57 (pg.71/247) Huffman Shannon/Fano/Elias Next The Greedy Tree vs. Huffman Tree Left is greedy, right is Huffman {a, b, c, d, e, f, g} X ∈ {a, b, c, d}? dc X ∈ {c, d}? a b X ∈ { a, b}? {e, f, g}X ∈ ? e{ e } X ∈ ? f g {f, g} X ∈ ? 0 0 0 000 000000 000001 00001 0001 001 01 1 001 010 011 110 111 10 0 0 0 1 1 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0.5 0.5 0.25 0.01 0.24 0.05 0.2 0.01 0.02 0.25 0.47 0.470.53 0.240.29 0.200.09 0.050.04 0.020.02 0.010.01 0.03 a f g c d b e 0 1 The Huffman lengths have E`huffman = 1.9700 Efficiency of Huffman code: H/E`huffman = 1.9323/1.9700 = 0.9809 Key problem: Greedy procedure is not optimal in this case. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F25/57 (pg.72/247) Huffman Shannon/Fano/Elias Next Greedy Why does starting from the top and splitting as such non-optimal? Where can it go wrong? Ex: There may be many ways to get a ≈ 50% split (to achieve high entropy) once done, the split is irrevocable and there is no way to know if the consequences of that split might hurt down the line. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F26/57 (pg.75/247) Huffman Shannon/Fano/Elias Next Huffman The Huffman code tree procedure 1 take the two least probable symbols in the alphabet. 2 These two will be given the longest codewords, will have equal length, and will differ in the last digit. 3 Combine these two symbols into a joint symbol having probability equal to the sum, add the joint symbol and then remove the two symbols, and repeat. Note that it is bottom up (agglomerative clustering) rather than top down (greedy splitting). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F27/57 (pg.76/247) Huffman Shannon/Fano/Elias Next Huffman The Huffman code tree procedure 1 take the two least probable symbols in the alphabet. 2 These two will be given the longest codewords, will have equal length, and will differ in the last digit. 3 Combine these two symbols into a joint symbol having probability equal to the sum, add the joint symbol and then remove the two symbols, and repeat. Note that it is bottom up (agglomerative clustering) rather than top down (greedy splitting). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F27/57 (pg.77/247) Huffman Shannon/Fano/Elias Next Huffman The Huffman code tree procedure 1 take the two least probable symbols in the alphabet. 2 These two will be given the longest codewords, will have equal length, and will differ in the last digit. 3 Combine these two symbols into a joint symbol having probability equal to the sum, add the joint symbol and then remove the two symbols, and repeat. Note that it is bottom up (agglomerative clustering) rather than top down (greedy splitting). Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F27/57 (pg.80/247) Huffman Shannon/Fano/Elias Next Huffman Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}. So 4 and 5 should have longest code length We build the tree from left to right. So we have E` = 2.3 bits and H = 2.2855 bits, as you can see this code does pretty well (close to entropy). Some code lengths are shorter/longer than I(x) = log 1/p(x). Construction is similar for D > 2, in such case we might use dummy symbols in alphabet. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F28/57 (pg.81/247) Huffman Shannon/Fano/Elias Next Huffman Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}. So 4 and 5 should have longest code length We build the tree from left to right. So we have E` = 2.3 bits and H = 2.2855 bits, as you can see this code does pretty well (close to entropy). Some code lengths are shorter/longer than I(x) = log 1/p(x). Construction is similar for D > 2, in such case we might use dummy symbols in alphabet. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F28/57 (pg.82/247) Huffman Shannon/Fano/Elias Next Huffman Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}. So 4 and 5 should have longest code length We build the tree from left to right. X 1 2 3 4 5 0.25 0.25 0.2 0.15 0.15 probcodeword 0.25 0.25 0.2 0.3 prob 0.25 0.45 0.3 probstep 1 step 2 0 1 0 1 So we have E` = 2.3 bits and H = 2.2855 bits, as you can see this code does pretty well (close to entropy). Some code lengths are shorter/longer than I(x) = log 1/p(x). Construction is similar for D > 2, in such case we might use dummy symbols in alphabet. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F28/57 (pg.85/247) Huffman Shannon/Fano/Elias Next Huffman Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}. So 4 and 5 should have longest code length We build the tree from left to right. X 1 2 3 4 5 0.25 0.25 0.2 0.15 0.15 probcodeword 0.25 0.25 0.2 0.3 prob 0.25 0.45 0.3 prob 0.55 0.45 probstep 1 step 2 step 3 0 1 0 1 0 1 So we have E` = 2.3 bits and H = 2.2855 bits, as you can see this code does pretty well (close to entropy). Some code lengths are shorter/longer than I(x) = log 1/p(x). Construction is similar for D > 2, in such case we might use dummy symbols in alphabet. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F28/57 (pg.86/247) Huffman Shannon/Fano/Elias Next Huffman Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}. So 4 and 5 should have longest code length We build the tree from left to right. X 1 2 3 4 5 0.25 0.25 0.2 0.15 0.15 probcodeword 0.25 0.25 0.2 0.3 prob 0.25 0.45 0.3 prob 0.55 0.45 prob 1.0 probstep 1 step 2 step 3 step 4 0 1 0 1 0 1 0 1 So we have E` = 2.3 bits and H = 2.2855 bits, as you can see this code does pretty well (close to entropy). Some code lengths are shorter/longer than I(x) = log 1/p(x). Construction is similar for D > 2, in such case we might use dummy symbols in alphabet. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F28/57 (pg.87/247) Huffman Shannon/Fano/Elias Next Huffman Ex: X = {1, 2, 3, 4, 5} with probabilities {1/4, 1/4, 1/5, 3/20, 3/20}. So 4 and 5 should have longest code length We build the tree from left to right. Xlog 1 p(x) 1 2 3 4 5 2 2 2 3 3 2.0 2.0 2.3 2.7 2.7 00 10 11 010 011 0.25 0.25 0.2 0.15 0.15 probcodewordlength 0.25 0.25 0.2 0.3 prob 0.25 0.45 0.3 prob 0.55 0.45 prob 1.0 probstep 1 step 2 step 3 step 4 0 1 0 1 0 1 0 1 So we have E` = 2.3 bits and H = 2.2855 bits, as you can see this code does pretty well (close to entropy). Some code lengths are shorter/longer than I(x) = log 1/p(x). Construction is similar for D > 2, in such case we might use dummy symbols in alphabet. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F28/57 (pg.90/247) Huffman Shannon/Fano/Elias Next More Huffman vs. Shannon Shannon code lengths `i = dlog 1/pie we saw are not optimal – more realistic example, binary alphabet with probabilities p(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths `a = 1 and `b = 14 bits, with E` = 1.0013 > 1. Optimal code lengths are not always ≤ dlog 1/pie. Consider X with probabilities (1/3, 1/3, 1/4, 1/12) with H = 1.8554. Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3) But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths are Ls = (2, 2, 2, 4) with ELs = 2.1557 > 2. In general, a particular codeword for the optimal code might be longer than Shannon’s length, but of course this is not true on average. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F29/57 (pg.91/247) Huffman Shannon/Fano/Elias Next More Huffman vs. Shannon Shannon code lengths `i = dlog 1/pie we saw are not optimal – more realistic example, binary alphabet with probabilities p(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths `a = 1 and `b = 14 bits, with E` = 1.0013 > 1. Optimal code lengths are not always ≤ dlog 1/pie. Consider X with probabilities (1/3, 1/3, 1/4, 1/12) with H = 1.8554. Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3) But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths are Ls = (2, 2, 2, 4) with ELs = 2.1557 > 2. In general, a particular codeword for the optimal code might be longer than Shannon’s length, but of course this is not true on average. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F29/57 (pg.92/247) Huffman Shannon/Fano/Elias Next More Huffman vs. Shannon Shannon code lengths `i = dlog 1/pie we saw are not optimal – more realistic example, binary alphabet with probabilities p(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths `a = 1 and `b = 14 bits, with E` = 1.0013 > 1. Optimal code lengths are not always ≤ dlog 1/pie. Consider X with probabilities (1/3, 1/3, 1/4, 1/12) with H = 1.8554. Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3) But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths are Ls = (2, 2, 2, 4) with ELs = 2.1557 > 2. In general, a particular codeword for the optimal code might be longer than Shannon’s length, but of course this is not true on average. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F29/57 (pg.95/247) Huffman Shannon/Fano/Elias Next More Huffman vs. Shannon Shannon code lengths `i = dlog 1/pie we saw are not optimal – more realistic example, binary alphabet with probabilities p(a) = 0.9999 and p(b) = 1− 0.9999 lead to lengths `a = 1 and `b = 14 bits, with E` = 1.0013 > 1. Optimal code lengths are not always ≤ dlog 1/pie. Consider X with probabilities (1/3, 1/3, 1/4, 1/12) with H = 1.8554. Huffman lengths are either Lh1 = (2, 2, 2, 2) or Lh2 = (1, 2, 3, 3) But dlog 1/p3e = d− log(1/4)e = 2 < 3. Shannon lengths are Ls = (2, 2, 2, 4) with ELs = 2.1557 > 2. In general, a particular codeword for the optimal code might be longer than Shannon’s length, but of course this is not true on average. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F29/57 (pg.96/247) Huffman Shannon/Fano/Elias Next Optimality of Huffman Huffman is optimal, i.e., ∑ i pi`i is minimal, for integer lengths. To show this: 1 First show lemma that some optimal codes have certain properties (not all, but that ∃ optimal code with these properties). 2 Given a code Cm for m symbols, that has said properties, produce new simpler code satisfying lemma and is simpler to optimize. 3 Ultimately get down to simple case of two symbols which are obvious to optimize. Prof. Jeff Bilmes EE514a/Fall 2013/Information Theory I – Lecture 10 - Oct 28th, 2013 L10 F30/57 (pg.97/247)
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved