Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Entropy and Mutual Information - Lecture Slides | ECE 534, Study notes of Electrical and Electronics Engineering

Material Type: Notes; Class: Elements of Information Theory; Subject: Electrical and Computer Engr; University: University of Illinois - Chicago; Term: Unknown 2012;

Typology: Study notes

2011/2012

Uploaded on 05/18/2012

koofers-user-lzv
koofers-user-lzv 🇺🇸

5

(3)

10 documents

1 / 21

Toggle sidebar

Related documents


Partial preview of the text

Download Entropy and Mutual Information - Lecture Slides | ECE 534 and more Study notes Electrical and Electronics Engineering in PDF only on Docsity! University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Chapter 2: Entropy and Mutual Information University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Chapter 2 outline • Definitions • Entropy • Joint entropy, conditional entropy • Relative entropy, mutual information • Chain rules • Jensen’s inequality • Log-sum inequality • Data processing inequality • Fano’s inequality University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Definitions A discrete random variable X takes on values x from the discrete alphabet X . The probability mass function (pmf) is described by pX(x) = p(x) = Pr{X = x}, for x ! X . University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Definitions Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 2 Probability, Entropy, and Inference This chapter, and its sibling, Chapter 8, devote some time to notation. Just as the White Knight distinguished between the song, the name of the song, and what the name of the song was called (Carroll, 1998), we will sometimes need to be careful to distinguish between a random variable, the value of the random variable, and the proposition that asserts that the random variable has a particular value. In any particular chapter, however, I will use the most simple and friendly notation possible, at the risk of upsetting pure-minded readers. For example, if something is ‘true with probability 1’, I will usually simply say that it is ‘true’. 2.1 Probabilities and ensembles An ensemble X is a triple (x,AX ,PX), where the outcome x is the value of a random variable, which takes on one of a set of possible values, AX = {a1, a2, . . . , ai, . . . , aI}, having probabilitiesPX = {p1, p2, . . . , pI}, with P (x=ai) = pi, pi ! 0 and ! ai!AX P (x=ai) = 1. The name A is mnemonic for ‘alphabet’. One example of an ensemble is a letter that is randomly selected from an English document. This ensemble is shown in figure 2.1. There are twenty-seven possible letters: a–z, and a space character ‘-’. i ai pi 1 a 0.0575 2 b 0.0128 3 c 0.0263 4 d 0.0285 5 e 0.0913 6 f 0.0173 7 g 0.0133 8 h 0.0313 9 i 0.0599 10 j 0.0006 11 k 0.0084 12 l 0.0335 13 m 0.0235 14 n 0.0596 15 o 0.0689 16 p 0.0192 17 q 0.0008 18 r 0.0508 19 s 0.0567 20 t 0.0706 21 u 0.0334 22 v 0.0069 23 w 0.0119 24 x 0.0073 25 y 0.0164 26 z 0.0007 27 – 0.1928 a b c d e f g h i j k l m n o p q r s t u v w x y z – Figure 2.1. Probability distribution over the 27 outcomes for a randomly selected letter in an English language document (estimated from The Frequently Asked Questions Manual for Linux ). The picture shows the probabilities by the areas of white squares. Abbreviations. Briefer notation will sometimes be used. For example, P (x=ai) may be written as P (ai) or P (x). Probability of a subset. If T is a subset of AX then: P (T ) = P (x"T ) = " ai!T P (x=ai). (2.1) For example, if we define V to be vowels from figure 2.1, V = {a, e, i, o, u}, then P (V ) = 0.06 + 0.09 + 0.06 + 0.07 + 0.03 = 0.31. (2.2) A joint ensemble XY is an ensemble in which each outcome is an ordered pair x, y with x " AX = {a1, . . . , aI} and y " AY = {b1, . . . , bJ}. We call P (x, y) the joint probability of x and y. Commas are optional when writing ordered pairs, so xy # x, y. N.B. In a joint ensemble XY the two variables are not necessarily inde- pendent. 22 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 2.1: Probabilities and ensembles 23 a b c d e f g h i j k l m n o p q r s t u v w x y z – y a b c d e f g h i j k l m n o p q r s t u v w x y z – x Figure 2.2. The probability distribution over the 27!27 possible bigrams xy in an English language document, The Frequently Asked Questions Manual for Linux. Marginal probability. We can obtain the marginal probability P (x) from the joint probability P (x, y) by summation: P (x=ai) ! ! y!AY P (x=ai, y). (2.3) Similarly, using briefer notation, the marginal probability of y is: P (y) ! ! x!AX P (x, y). (2.4) Conditional probability P (x=ai | y = bj) ! P (x=ai, y = bj) P (y = bj) if P (y = bj) "= 0. (2.5) [If P (y = bj) = 0 then P (x=ai | y = bj) is undefined.] We pronounce P (x=ai | y = bj) ‘the probability that x equals ai, given y equals bj ’. Example 2.1. An example of a joint ensemble is the ordered pair XY consisting of two successive letters in an English document. The possible outcomes are ordered pairs such as aa, ab, ac, and zz; of these, we might expect ab and ac to be more probable than aa and zz. An estimate of the joint probability distribution for two neighbouring characters is shown graphically in figure 2.2. This joint ensemble has the special property that its two marginal dis- tributions, P (x) and P (y), are identical. They are both equal to the monogram distribution shown in figure 2.1. From this joint ensemble P (x, y) we can obtain conditional distributions, P (y |x) and P (x | y), by normalizing the rows and columns, respectively (figure 2.3). The probability P (y |x=q) is the probability distribution of the second letter given that the first letter is a q. As you can see in figure 2.3a, the two most probable values for the second letter y given University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Entropy examples 1 • What’s the entropy of a uniform discrete random variable taking on K values? • What’s the entropy of a random variable with • What’s the entropy of a deterministic random variable? X = [!,",#,$], pX = [1/2; 1/4; 1/8; 1/8] ! University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Entropy: example 2 Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 32 2 — Probability, Entropy, and Inference What do you notice about your solutions? Does each answer depend on the detailed contents of each urn? The details of the other possible outcomes and their probabilities are ir- relevant. All that matters is the probability of the outcome that actually happened (here, that the ball drawn was black) given the di!erent hypothe- ses. We need only to know the likelihood, i.e., how the probability of the data that happened varies with the hypothesis. This simple rule about inference is known as the likelihood principle. The likelihood principle: given a generative model for data d given parameters !, P (d |!), and having observed a particular outcome d1, all inferences and predictions should depend only on the function P (d1 |!). In spite of the simplicity of this principle, many classical statistical methods violate it. 2.4 Definition of entropy and related functions The Shannon information content of an outcome x is defined to be h(x) = log2 1 P (x) . (2.34) It is measured in bits. [The word ‘bit’ is also used to denote a variable whose value is 0 or 1; I hope context will always make clear which of the two meanings is intended.] In the next few chapters, we will establish that the Shannon information content h(ai) is indeed a natural measure of the information content of the event x = ai. At that point, we will shorten the name of this quantity to ‘the information content’. i ai pi h(pi) 1 a .0575 4.1 2 b .0128 6.3 3 c .0263 5.2 4 d .0285 5.1 5 e .0913 3.5 6 f .0173 5.9 7 g .0133 6.2 8 h .0313 5.0 9 i .0599 4.1 10 j .0006 10.7 11 k .0084 6.9 12 l .0335 4.9 13 m .0235 5.4 14 n .0596 4.1 15 o .0689 3.9 16 p .0192 5.7 17 q .0008 10.3 18 r .0508 4.3 19 s .0567 4.1 20 t .0706 3.8 21 u .0334 4.9 22 v .0069 7.2 23 w .0119 6.4 24 x .0073 7.1 25 y .0164 5.9 26 z .0007 10.4 27 - .1928 2.4 ! i pi log2 1 pi 4.1 Table 2.9. Shannon information contents of the outcomes a–z. The fourth column in table 2.9 shows the Shannon information content of the 27 possible outcomes when a random character is picked from an English document. The outcome x = z has a Shannon information content of 10.4 bits, and x = e has an information content of 3.5 bits. The entropy of an ensemble X is defined to be the average Shannon in- formation content of an outcome: H(X) ! ! x!AX P (x) log 1 P (x) , (2.35) with the convention for P (x) = 0 that 0 " log 1/0 ! 0, since lim!"0+ ! log 1/! = 0. Like the information content, entropy is measured in bits. When it is convenient, we may also write H(X) as H(p), where p is the vector (p1, p2, . . . , pI). Another name for the entropy of X is the uncertainty of X. Example 2.12. The entropy of a randomly selected letter in an English docu- ment is about 4.11 bits, assuming its probability is as given in table 2.9. We obtain this number by averaging log 1/pi (shown in the fourth col- umn) under the probability distribution pi (shown in the third column). Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 32 2 — Probability, Entropy, and Inference What do you notice about your solutions? Does each answer depend on the detailed contents of each urn? The details of the other possible outcomes and their probabilities are ir- relevant. All that matters is the probability of the outcome that actually happened (here, that the ball drawn was black) given the di!erent hypothe- ses. We need only to know the likelihood, i.e., how the probability of the data that happened varies with the hypothesis. This simple rule about inference is known as the likelihood principle. The likelihood principle: given a generative model for data d given parameters !, P (d |!), and having observed a particular outcome d1, all inferences and predictions should depend only on the function P (d1 |!). In spite of the simplicity of this principle, many classical statistical methods violate it. 2.4 Definition of entropy and related functions The Shannon information content of an outcome x is defined to be h(x) = log2 1 P (x) . (2.34) It is measured in bits. [The word ‘bit’ is also used to denote a variable whose value is 0 or 1; I hope context will always make clear which of the two meanings is intended.] In the next few chapters, we will establish that the Shannon information content h(ai) is indeed a natural measure of the information content of the event x = ai. At that point, we will shorten the name of this quantity to ‘the information content’. i ai pi h(pi) 1 a .0575 4.1 2 b .0128 6.3 3 c .0263 5.2 4 d .0285 5.1 5 e .0913 3.5 6 f .0173 5.9 7 g .0133 6.2 8 h .0313 5.0 9 i .0599 4.1 10 j .0006 10.7 11 k .0084 6.9 12 l .0335 4.9 13 m .0235 5.4 14 n .0596 4.1 15 o .0689 3.9 16 p .0192 5.7 17 q .0008 10.3 18 r .0508 4.3 19 s .0567 4.1 20 t .0706 3.8 21 u .0334 4.9 22 v .0069 7.2 23 w .0119 6.4 24 x .0073 7.1 25 y .0164 5.9 26 z .0007 10.4 27 - .1928 2.4 ! i pi log2 1 pi 4.1 Table 2.9. Shannon information contents of the outcomes a–z. The fourth column in table 2.9 shows the Shannon information content of the 27 possible outcomes when a random character is picked from an English document. The outcome x = z has a Shannon information content of 10.4 bits, and x = e has an information content of 3.5 bits. The entropy of an ensemble X is defined to be the average Shannon in- formation content of an outcome: H(X) ! ! x!AX P (x) log 1 P (x) , (2.35) with the convention for P (x) = 0 that 0 " log 1/0 ! 0, since lim!"0+ ! log 1/! = 0. Like the information content, entropy is measured in bits. When it is convenient, we may also write H(X) as H(p), where p is the vector (p1, p2, . . . , pI). An ther name for th ntropy of X is he uncertainty of X. Example 2.12. The entropy of a randomly selected letter in an English docu- ment is about 4.11 bits, assuming its probability is as given in table 2.9. We obtain this number by averaging log 1/pi (shown in the f urt col- umn) under the probability distribution pi (shown in the third column). University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Entropy: example 3 • Bernoulli random variable takes on heads (0) with probability p and tails with probability 1-p. Its entropy is defined as H(p) := !p log2(p)! (1! p) log2(1! p) 16 ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p H (p ) FIGURE 2.1. H(p) vs. p. Suppose that we wish to determine the value of X with the minimum number of binary questions. An efficient first question is “Is X = a?” This splits the probability in half. If the answer to the first question is no, the second question can be “Is X = b?” The third question can be “Is X = c?” The resulting expected number of binary questions required is 1.75. This turns out to be the minimum expected number of binary questions required to determine the value of X. In Chapter 5 we show that the minimum expected number of binary questions required to determine X lies between H(X) and H(X) + 1. 2.2 JOINT ENTROPY AND CONDITIONAL ENTROPY We defined the entropy of a single random variable in Section 2.1. We now extend the definition to a pair of random variables. There is nothing really new in this definition because (X, Y ) can be considered to be a single vector-valued random variable. Definition The joint entropy H(X, Y ) of a pair of discrete random variables (X, Y ) with a joint distribution p(x, y) is defined as H(X, Y ) = ! ! x"X ! y"Y p(x, y) log p(x, y), (2.8) University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Entropy The entropy H(X) = ! ! x p(x) log p(x) has the following properties: • H(X) " 0, entropy is always non-negative. H(X) = 0 i! X is deterministic (0 log(0) = 0). • H(X) # log(|X |). H(X) = log(|X |) i! X has uniform distribution over X. • Since Hb(X) = logb(a)Ha(X), we don’t need to specify the base of the loga- rithm (bits vs. nat). Moving on to multiple RVs University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Joint entropy and conditional entropy Definition: Joint entropy of a pair of two discrete random variables X and Y is: H(X, Y ) := !Ep(x,y)[log p(X, Y )] = ! ! x!X ! y!Y p(x, y) log p(x, y) Note: H(X|Y ) != H(Y |X). """ University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Joint entropy and conditional entropy • Natural definitions, since.... Theorem: Chain rule H(X, Y ) = H(X) + H(Y |X) Corollary: H(X, Y |Z) = H(X|Z) + H(Y |X, Z) Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 140 8 — Dependent Random Variables H(X,Y ) H(X) H(Y ) I(X;Y )H(X |Y ) H(Y |X) Figure 8.1. The relationship between joint information, marginal entropy, conditional entropy and mutual entropy. 8.2 Exercises ! Exercise 8.1.[1 ] Consider three independent random variables u, v,w with en- tropies Hu,Hv,Hw. Let X ! (U,V ) and Y ! (V,W ). What is H(X,Y )? What is H(X |Y )? What is I(X;Y )? ! Exercise 8.2.[3, p.142] Referring to the definitions of conditional entropy (8.3– 8.4), confirm (with an example) that it is possible for H(X | y = bk) to exceed H(X), but that the average, H(X |Y ), is less than H(X). So data are helpful – they do not increase uncertainty, on average. ! Exercise 8.3.[2, p.143] Prove the chain rule for entropy, equation (8.7). [H(X,Y ) = H(X) + H(Y |X)]. Exercise 8.4.[2, p.143] Prove that the mutual information I(X;Y ) ! H(X) " H(X |Y ) satisfies I(X;Y ) = I(Y ;X) and I(X;Y ) # 0. [Hint: see exercise 2.26 (p.37) and note that I(X;Y ) = DKL(P (x, y)||P (x)P (y)).] (8.11) Exercise 8.5.[4 ] The ‘entropy distance’ between two random variables can be defined to be the di!erence between their joint entropy and their mutual information: DH(X,Y ) ! H(X,Y ) " I(X;Y ). (8.12) Prove that the entropy distance satisfies the axioms for a distance – DH(X,Y ) # 0, DH(X,X)= 0, DH(X,Y )=DH(Y,X), and DH(X,Z) $ DH(X,Y ) + DH(Y,Z). [Incidentally, we are unlikely to see DH(X,Y ) again but it is a good function on which to practise inequality-proving.] Exercise 8.6.[2 ] A joint ensemble XY has the following joint distribution. P (x, y) x 1 2 3 4 1 1/8 1/16 1/32 1/32 y 2 1/16 1/8 1/32 1/32 3 1/16 1/16 1/16 1/16 4 1/4 0 0 0 4 3 2 1 1 2 3 4 What is the joint entropy H(X,Y )? What are the marginal entropies H(X) and H(Y )? For each value of y, what is the conditional entropy H(X | y)? What is the conditional entropy H(X |Y )? What is the conditional entropy of Y given X? What is the mutual information between X and Y ? Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 8.3: Further exercises 141 Exercise 8.7.[2, p.143] Consider the ensemble XY Z in which AX = AY = AZ = {0, 1}, x and y are independent with PX = {p, 1 ! p} and PY = {q, 1!q} and z = (x + y)mod 2. (8.13) (a) If q = 1/2, what is PZ? What is I(Z;X)? (b) For general p and q, what is PZ? What is I(Z;X)? Notice that this ensemble is related to the binary symmetric channel, with x = input, y = noise, a d z = output. H(X|Y) H(Y|X)I(X;Y) H(X) H(Y) H(X,Y) Figure 8.2. A misleading representation of entropies (contrast with figure 8.1). Three term entropies Exercise 8.8.[3, p.143] Many texts draw figure 8.1 in the form of a Venn diagram (figure 8.2). Discuss why this diagram is a misleading representation of entropies. Hint: consider the three-variable ensemble XY Z in which x " {0, 1} and y " {0, 1} are independent binary variables and z " {0, 1} is defined to be z = x + y mod 2. 8.3 Further exercises The data-processing theorem The data processing theorem states that data processing can only destroy information. Exercise 8.9.[3, p.144] Prove this theorem by considering an ensemble WDR in which w is the state of the world, d is data gathered, and r is the processed data, so that these three variables form a Markov chain w # d # r, (8.14) that is, the probability P (w, d, r) can be written as P (w, d, r) = P (w)P (d |w)P (r | d). (8.15) Show that the average information that R conveys about W, I(W ;R), is less than or equal to the average information that D conveys about W , I(W ;D). This theorem is as much a caution about our definition of ‘information’ as it is a caution about data processing! ! University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Mutual information example p(x, y) y = 0 y = 1 x = 0 1/2 1/4 x = 1 0 1/4 X or Y p(x) p(y) 0 3/4 1/2 1 1/4 1/2 ! University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Divergence (relative entropy, K-L distance) Definition: Relative entropy, divergence or Kullback-Leibler distance between two distributions, P and Q, on the same alphabet, is D(p ! q) := Ep ! log p(x) q(x) " = # x!X p(x) log p(x) q(x) (Note: we use the convention 0 log 00 = 0 and 0 log 0 q = p log p 0 =!.) • D(p ! q) is in a sense a measure of the “distance” between the two distribu- tions. • If P = Q then D(p ! q) = 0. • Note D(p ! q) is not a true distance. D( , ) = 0.2075 D( , ) = 0.1887 University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye K-L divergence example • X = {1, 2, 3, 4, 5, 6} • P = [1/6 1/6 1/6 1/6 1/6 1/6] • Q = [1/10 1/10 1/10 1/10 1/10 1/2] • D(p ! q) =? and D(q ! p) =? x p(x) x q(x) ! University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Mutual information as divergence! Definition: The mutual information I(X;Y ) between the random variables X and Y is given by I(X;Y ) = H(X)!H(X|Y ) = ! x!X ! y!Y p(x, y) log2 p(x, y) p(x)p(y) = Ep(x,y) " log2 p(X, Y ) p(X)p(Y ) # I(X;Y ) = D(p(x, y) ! p(x)p(y)) • Can we express mutual information in terms of the K-L divergence? University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Mutual information and entropy Theorem: Relationship between mutual information and entropy. I(X;Y ) = H(X)!H(X|Y ) I(X;Y ) = H(Y )!H(Y |X) I(X;Y ) = H(X) + H(Y )!H(X, Y ) I(X;Y ) = I(Y ;X) (symmetry) I(X;X) = H(X) (“self-information”) ``Two’s company, three’s a crowd’’ H(X) H(Y) H(Y) H(X|Y) H(X) H(Y) I(X;Y) I(X;Y) I(X;Y) University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Chain rule for entropy Theorem: (Chain rule for entropy): (X1, X2, ..., Xn) ! p(x1, x2, ..., xn) H(X1, X2, ..., Xn) = n! i=1 H(Xi|Xi!1, ..., X1) ! H(X1) H(X3) H(X2) = + +H(X1,X2,X3) H(X1) H(X2|X1) H(X3|X1,X2) University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Convex and concave functions -1 0 -7 .5 -5 -2 .5 0 2 .5 5 7 .5 1 0 1 2 .5 -2.5 0 2.5 5 7.5 10 University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Convex and concave functions -1 0 -7 .5 -5 -2 .5 0 2 .5 5 7 .5 1 0 1 2 .5 -2.5 0 2.5 5 7.5 10 University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Jensen’s inequality Theorem: (Jensen’s inequality) If f is convex, then E[f(X)] ! f(E[X]). If f is strictly convex, the equality implies X = E[X] with probability 1. ! University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Jensen’s inequality consequences • Theorem: (Information inequality) D(p ! q) " 0, with equality i! p = q. • Corollary: (Nonnegativity of mutual information) I(X;Y ) " 0 with equality i! X and Y are independent. • Theorem: (Conditioning reduces entropy) H(X|Y ) # H(X) with equality i! X and Y are independent. • Theorem: H(X) # log |X | with equality i! X has a uniform distribution over X . • Theorem: (Independence bound on entropy) H(X1, X2, ..., Xn) # !n i=1 H(Xi)with equality i! Xi are independent. !# University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Log-sum inequality ! Theorem: (Log sum inequality) For nonnegative a1, a2, ..., an and b1, b2, ..., bn, n! i=1 ai log ai bi ! " n! i=1 ai # log $n i=1 ai$n i=1 bi with equality i! ai/bi = const. Convention: 0 log 0 = 0, a log a0 =" if a > 0 and 0 log 0 0 = 0. University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Log-sum inequality consequences • Theorem: (Convexity of relative entropy) D(p ! q) is convex in the pair (p, q), so that for pmf’s (p1, q1) and (p2, q2), we have for all 0 " ! " 1: D(!p1 + (1# !)p2 ! !q1 + (1# !)q2) " !D(p1 ! q1) + (1# !)D(p2 ! q2) • Theorem: Concavity of entropy For X $ p(x), we have that H(p) := Hp(X)is a concave function of p(x). • Theorem: (Concavity of the mutual information in p(x)) Let (X, Y ) $ p(x, y) = p(x)p(y|x). Then, I(X;Y ) is a concave function of p(x) for fixed p(y|x). • Theorem: (Convexity of the mutual information in p(y|x)) Let (X, Y ) $ p(x, y) = p(x)p(y|x). Then, I(X;Y ) is a convex function of p(y|x) for fixed p(x). !# University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Consequences on sufficient statistics • Consider a family of probability distributions {f!(x)} indexed by !. If X ! f(x | !) for fixed ! and T (X) is any statistic (i.e., function of the sample X), then we have ! " X " T (X). • The data processing inequality in turn implies I(!;X) ! I(!;T (X)) for any distribution on !. • Is it possible to choose a statistic that preserves all of the information in X about !? Definition: Su!cient Statistic A function T (X) is said to be a su!cient statistic relative to the family {f!(x)} if the conditional distribution of X, given T (X) = t, is independent of ! for any distribution on ! (Fisher-Neyman): f!(x) = f(x | t)f!(t) ! ! " T (X)" X ! I(!;T (X)) # I(!;X) """ University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Example of a sufficient statistic University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Fano’s inequality ! University of Illinois at Chicago ECE 534, Fall 2009, Natasha Devroye Fano’s inequality consequences
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved