Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Probability & Statistics, Exams of Calculus

Harvard Stat-110 Cheatsheet ... Probability Cheatsheet v2.0 ... formally, A and B (which have nonzero probability) are independent if.

Typology: Exams

2022/2023

Uploaded on 05/11/2023

sergeybrin
sergeybrin 🇺🇸

4.7

(7)

9 documents

1 / 122

Toggle sidebar

Related documents


Partial preview of the text

Download Probability & Statistics and more Exams Calculus in PDF only on Docsity! Probability & Statistics Author: Jose 胡冠洲 @ ShanghaiTech & Staff of Harvard STAT-110 Probability & Statistics Full-ver. Cheatsheet Harvard Stat-110 Cheatsheet Harvard Stat-110 Review Notes Full-ver. Cheatsheet See below (page 2-3). Harvard Stat-110 Cheatsheet See below (page 4-13). Harvard Stat-110 Review Notes See below (page 14-122).   R .Bb .BbiX SJ6 1 o` S:6 J:6 "2`MUpV P (X = 1) = p, P (X = 0) = q p pq pt+ q pet + q "BMUn, pV ! n x " pxqn−x- x ∈ [0, n] np npq (pt+ q)n (pet + q)n >:2QKUw, b, nV (wx )( b n−x) (w+b n ) - 0 ≤ k ≤ w, 0 ≤ n− k ≤ b nw w+b f f f :2QKUpV qkp- x ≥ 0 q p q p2 p 1−qt p 1−qet 6aUpV qk−1p- x ≥ 1 1 p q p2 pt 1−qt pet 1−qet L"BMUr, pV ! n+r−1 r−1 " prqn- n ≥ 0 r q p r q p2 ( p 1−qt) r ( p 1−qet ) r SQBbUλV e−λλk k! - k ≥ 0 λ λ eλ(t−1) eλ(e t−1) *QMi .BbiX S.6 *.6 1 o` J:6 lMB7Ua, bV 1 b−a - a < x < b · · · x−a b−a - a < x < b · · · a+b 2 (b−a)2 12 etb−eta t(b−a) N (0, 1) 1√ 2π e−z2/2 # z −∞ 1√ 2π e−t2/2dt y R e t2 2 N (µ,σ2) 1 σ √ 2π e−(z−µ)2/(2σ2) # z −∞ 1 σ √ 2π e−(t−µ)2/(2σ2)dt µ σ2 eµt+ 1 2σ 2t2 1tTQUλV λe−λx- x > 0 1− e−λx- x > 0 1 λ 1 λ2 λ λ−t "2iUa, bV 1 β(a,b)x a−1(1− x)b−1 f a a+b f f :KKUa,λV 1 Γ(a)(λy) ae−λy 1 y f a λ a λ2 f "v2bǶ _mH2 Y /Bb Y +QMi X /Bb P (X=x|Y=y)P (Y=y) P (X=x) P (X=x|Y=y)fY (y) P (X=x) X +QMi fX(x|Y=y)P (Y=y) fX(x) fX|Y (x|y)fY (y) fX(x) GPhS Y /Bb Y +QMi X /Bb $ y P (X = x|Y = y)P (Y = y) #∞ −∞ P (X = x|Y = y)fY (y)dy X +QMi $ y fX(x|Y = y)P (Y = y) #∞ −∞ fX|Y (x|y)fY (y)dy *QmMiBM; P`/2` LQ P`/2` _2TH+2 nk ! n+k−1 n−1 " LQ _2TH+2 n · · · (n− k + 1) ! n k " "Qb2@1Bbi2BM, x1 + · · ·+ xr = n, xi > 0- ! n−1 r−1 " FBM/bX A/2MiBiB2b, n ! n−1 k−1 " = k ! n k " - ! m+n k " = $k j=0 ! m j "! n k−j " .2}MBiBQMb Ç SJ6, P (X = x) @ LQMM2;iBp2c amKb iQ R Ç DQBMi SJ6, P (X = x, Y = y) = P (X = x|Y = y)P (Y = y) Ç K`;BMH SJ6, P (X = x) = $ y P (X = x, Y = y) Ç S.6, fX(x) = F ′ X(x) @ LQMM2;iBp2c BMi2;`i2b iQ R Ç DQBMi S.6, fX,Y (x, y) = ∂2 ∂x∂yFX,Y (x, y) Ç K`;BMH S.6, fX(x) = #∞ −∞ fX,Y (x, y)dy Ç *.6, FX(x) = P (X ≤ x) @ LQM/2+`2bBM;c _B;?i@+QMiBMmQmbc HBKx→−∞ F (x) = 0- HBKx→∞ F (x) = 1 Ç DQBMi *.6, FX,Y (x, y) = P (X ≤ x, Y ≤ y) Ç +X2X, E(Y |A) = $ yP (Y = y|A) f #∞ −∞ yf(y|A)dy Ç +X2X, E(Y |X) rBi? E(Y |X = x) = g(x) Ç +XpX, V ar(Y |X) = E((Y − E(Y |X))2|X) = E(Y 2|X)− (E(Y |X))2 1tT2+iiBQM f J2M Ç RX E(X) = $ x xP (X = x) kX bm`pBpH, E(X) = # 0 −∞(G(x)− 1)dx+ #∞ 0 G(x)dx jX GPhla, E(g(X)) = $ x g(x)P (X = x) 9X BM/B+iQ`b, X = I1 + I2 + · · · + In UQ` Qi?2` T`iBiBQMbV- i?2M E(X) = p1 + p2 + · · ·+ pn 8X #v S:6 f J:6, E(X) = g′X(1) Q` E(X) = M ′ X(0) eX k.@G, E(g(X,Y )) = $ x $ y g(x, y)P (X = x, Y = y) dX GPh1, E(X) = $n i=1 E(Y |Ai)P (Ai) 3X /KǶb Gr Ç S`QT2`iB2b, RX GBM2`Biv kX JQMQiQMB+Biv, X ≥ Y rXTX R ⇒ E(X) ≥ E(Y ) o`BM+2 Ç RX V ar(X) = E((X − E(X))2) = E(X2)− (E(X))2 ≥ 0 kX GPhla, E(X2) = $ x x 2P (X = x) jX BM/Bb, X = I1 + I2 + · · ·+ In- i?2M E( ! X 2 " ) = $ i<j IiIj 9X S:6 f J:6, E(X2 −X) = g′′X(1) Q` E(X2) = M ′′ X(0) 8X 1p2Ƕb Gr Ç S`QT2`iB2b, RX V ar(X + c) = V ar(X)- V ar(cX) = c2V ar(X) kX V ar(X + Y ) = V ar(X) + V ar(Y ) Bz BM/2T2M/2Mi5 S`QTQbBiBQM, fX,Y (x, y) = g(x)h(y) ⇒ BM/2T2M/2Mi M/ B7 g Bb pHB/ S.6- #Qi? `2 pHB/ K`;BMH S.6b Q7 X M/ Y AM/2T2M/2M+2 RX FX,Y (x, y) = FX(x)FY (y) kX fX,Y (x, y) = fX(x)fY (y) jX fX|Y (x|y) = fX(x) avKK2i`v Ç X ∼ "BMUn, pV ⇒ n−X ∼ "BMUn, qV Ç X ∼ >:2QKUw, b, nV ⇔ >:2QKUn,w + b− n,wV Ç Z ∼ N (0, 1) ⇒ φ(z) = φ(−z) Ç Z ∼ N (0, 1) ⇒ Φ(z) = 1− Φ(−z) Ç Z ∼ N (0, 1) ⇒ −Z ∼ N (0, 1) Cumulative Distribution Function (CDF) Gives the probability that a random variable is less than or equal to x. FX(x) = P (X  x) 0 1 2 3 4 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 x cd f ● ● ● ● ● ● ● ● ● ● The CDF is an increasing, right-continuous function with FX(x) ! 0 as x ! 1 and FX(x) ! 1 as x ! 1 Independence Intuitively, two random variables are independent if knowing the value of one gives no information about the other. Discrete r.v.s X and Y are independent if for all values of x and y P (X = x, Y = y) = P (X = x)P (Y = y) Expected Value and Indicators Expected Value and Linearity Expected Value (a.k.a. mean, expectation, or average) is a weighted average of the possible outcomes of our random variable. Mathematically, if x1, x2, x3, . . . are all of the distinct possible values that X can take, the expected value of X is E(X) = P i xiP (X = xi) X 3 2 6 10 1 1 5 4 ... Y 4 2 8 23 –3 0 9 1 ... X + Y 7 4 14 33 –2 1 14 5 ... ∑ xi ∑ yi+ ∑ (xi + yi)= E(X) E(Y)+ E(X + Y)= i=1 n i=1 n i=1 n n 1 n 1 n 1 Linearity For any r.v.s X and Y , and constants a, b, c, E(aX + bY + c) = aE(X) + bE(Y ) + c Same distribution implies same mean If X and Y have the same distribution, then E(X) = E(Y ) and, more generally, E(g(X)) = E(g(Y )) Conditional Expected Value is defined like expectation, only conditioned on any event A. E(X|A) = P x xP (X = x|A) Indicator Random Variables Indicator Random Variable is a random variable that takes on the value 1 or 0. It is always an indicator of some event: if the event occurs, the indicator is 1; otherwise it is 0. They are useful for many problems about counting how many events of some kind occur. Write IA = ( 1 if A occurs, 0 if A does not occur. Note that I2 A = IA, IAIB = IA\B , and IA[B = IA + IB IAIB . Distribution IA ⇠ Bern(p) where p = P (A). Fundamental Bridge The expectation of the indicator for event A is the probability of event A: E(IA) = P (A). Variance and Standard Deviation Var(X) = E (X E(X))2 = E(X2) (E(X))2 SD(X) = q Var(X) Continuous RVs, LOTUS, UoU Continuous Random Variables (CRVs) What’s the probability that a CRV is in an interval? Take the di↵erence in CDF values (or use the PDF as described later). P (a  X  b) = P (X  b) P (X  a) = FX(b) FX(a) For X ⇠ N (µ,2), this becomes P (a  X  b) = ✓ b µ ◆ ✓ a µ ◆ What is the Probability Density Function (PDF)? The PDF f is the derivative of the CDF F . F 0(x) = f(x) A PDF is nonnegative and integrates to 1. By the fundamental theorem of calculus, to get from PDF back to CDF we can integrate: F (x) = Z x 1 f(t)dt −4 −2 0 2 4 0. 00 0. 10 0. 20 0. 30 x PD F −4 −2 0 2 4 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 x C D F To find the probability that a CRV takes on a value in an interval, integrate the PDF over that interval. F (b) F (a) = Z b a f(x)dx How do I find the expected value of a CRV? Analogous to the discrete case, where you sum x times the PMF, for CRVs you integrate x times the PDF. E(X) = Z 1 1 xf(x)dx LOTUS Expected value of a function of an r.v. The expected value of X is defined this way: E(X) = X x xP (X = x) (for discrete X) E(X) = Z 1 1 xf(x)dx (for continuous X) The Law of the Unconscious Statistician (LOTUS) states that you can find the expected value of a function of a random variable, g(X), in a similar way, by replacing the x in front of the PMF/PDF by g(x) but still working with the PMF/PDF of X: E(g(X)) = X x g(x)P (X = x) (for discrete X) E(g(X)) = Z 1 1 g(x)f(x)dx (for continuous X) What’s a function of a random variable? A function of a random variable is also a random variable. For example, if X is the number of bikes you see in an hour, then g(X) = 2X is the number of bike wheels you see in that hour and h(X) = X 2 = X(X1) 2 is the number of pairs of bikes such that you see both of those bikes in that hour. What’s the point? You don’t need to know the PMF/PDF of g(X) to find its expected value. All you need is the PMF/PDF of X. Universality of Uniform (UoU) When you plug any CRV into its own CDF, you get a Uniform(0,1) random variable. When you plug a Uniform(0,1) r.v. into an inverse CDF, you get an r.v. with that CDF. For example, let’s say that a random variable X has CDF F (x) = 1 ex, for x > 0 By UoU, if we plug X into this function then we get a uniformly distributed random variable. F (X) = 1 eX ⇠ Unif(0, 1) Similarly, if U ⇠ Unif(0, 1) then F1(U) has CDF F . The key point is that for any continuous random variable X, we can transform it into a Uniform random variable and back by using its CDF. Moments and MGFs Moments Moments describe the shape of a distribution. Let X have mean µ and standard deviation , and Z = (X µ)/ be the standardized version of X. The kth moment of X is µk = E(Xk) and the kth standardized moment of X is mk = E(Zk). The mean, variance, skewness, and kurtosis are important summaries of the shape of a distribution. Mean E(X) = µ1 Variance Var(X) = µ2 µ2 1 Skewness Skew(X) = m3 Kurtosis Kurt(X) = m4 3 Moment Generating Functions MGF For any random variable X, the function MX(t) = E(etX) is the moment generating function (MGF) of X, if it exists for all t in some open interval containing 0. The variable t could just as well have been called u or v. It’s a bookkeeping device that lets us work with the function MX rather than the sequence of moments. Why is it called the Moment Generating Function? Because the kth derivative of the moment generating function, evaluated at 0, is the kth moment of X. µk = E(Xk) = M(k) X (0) This is true by Taylor expansion of etX since MX(t) = E(etX) = 1X k=0 E(Xk)tk k! = 1X k=0 µkt k k! MGF of linear functions If we have Y = aX + b, then MY (t) = E(et(aX+b)) = ebtE(e(at)X) = ebtMX(at) Uniqueness If it exists, the MGF uniquely determines the distribution. This means that for any two random variables X and Y , they are distributed the same (their PMFs/PDFs are equal) if and only if their MGFs are equal. Summing Independent RVs by Multiplying MGFs. If X and Y are independent, then MX+Y (t) = E(et(X+Y )) = E(etX)E(etY ) = MX(t) · MY (t) The MGF of the sum of two random variables is the product of the MGFs of those two random variables. Joint PDFs and CDFs Joint Distributions The joint CDF of X and Y is F (x, y) = P (X  x, Y  y) In the discrete case, X and Y have a joint PMF pX,Y (x, y) = P (X = x, Y = y). In the continuous case, they have a joint PDF fX,Y (x, y) = @2 @x@y FX,Y (x, y). The joint PMF/PDF must be nonnegative and sum/integrate to 1. Conditional Distributions Conditioning and Bayes’ rule for discrete r.v.s P (Y = y|X = x) = P (X = x, Y = y) P (X = x) = P (X = x|Y = y)P (Y = y) P (X = x) Conditioning and Bayes’ rule for continuous r.v.s fY |X(y|x) = fX,Y (x, y) fX(x) = fX|Y (x|y)fY (y) fX(x) Hybrid Bayes’ rule fX(x|A) = P (A|X = x)fX(x) P (A) Marginal Distributions To find the distribution of one (or more) random variables from a joint PMF/PDF, sum/integrate over the unwanted random variables. Marginal PMF from joint PMF P (X = x) = X y P (X = x, Y = y) Marginal PDF from joint PDF fX(x) = Z 1 1 fX,Y (x, y)dy Independence of Random Variables Random variables X and Y are independent if and only if any of the following conditions holds: • Joint CDF is the product of the marginal CDFs • Joint PMF/PDF is the product of the marginal PMFs/PDFs • Conditional distribution of Y given X is the marginal distribution of Y Write X ? Y to denote that X and Y are independent. Multivariate LOTUS LOTUS in more than one dimension is analogous to the 1D LOTUS. For discrete random variables: E(g(X,Y )) = X x X y g(x, y)P (X = x, Y = y) For continuous random variables: E(g(X,Y )) = Z 1 1 Z 1 1 g(x, y)fX,Y (x, y)dxdy Covariance and Transformations Covariance and Correlation Covariance is the analog of variance for two random variables. Cov(X,Y ) = E ((X E(X))(Y E(Y ))) = E(XY ) E(X)E(Y ) Note that Cov(X,X) = E(X2) (E(X))2 = Var(X) Correlation is a standardized version of covariance that is always between 1 and 1. Corr(X,Y ) = Cov(X,Y ) p Var(X)Var(Y ) Covariance and Independence If two random variables are independent, then they are uncorrelated. The converse is not necessarily true (e.g., consider X ⇠ N (0, 1) and Y = X2). X ? Y ! Cov(X,Y ) = 0 ! E(XY ) = E(X)E(Y ) Covariance and Variance The variance of a sum can be found by Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X,Y ) Var(X1 + X2 + · · · + Xn) = nX i=1 Var(Xi) + 2 X i<j Cov(Xi, Xj) If X and Y are independent then they have covariance 0, so X ? Y =) Var(X + Y ) = Var(X) + Var(Y ) If X1, X2, . . . , Xn are identically distributed and have the same covariance relationships (often by symmetry), then Var(X1 + X2 + · · · + Xn) = nVar(X1) + 2 ⇣n 2 ⌘ Cov(X1, X2) Covariance Properties For random variables W,X, Y, Z and constants a, b: Cov(X,Y ) = Cov(Y,X) Cov(X + a, Y + b) = Cov(X,Y ) Cov(aX, bY ) = abCov(X,Y ) Cov(W + X,Y + Z) = Cov(W,Y ) + Cov(W,Z) + Cov(X,Y ) + Cov(X,Z) Correlation is location-invariant and scale-invariant For any constants a, b, c, d with a and c nonzero, Corr(aX + b, cY + d) = Corr(X,Y ) Transformations One Variable Transformations Let’s say that we have a random variable X with PDF fX(x), but we are also interested in some function of X. We call this function Y = g(X). Also let y = g(x). If g is di↵erentiable and strictly increasing (or strictly decreasing), then the PDF of Y is fY (y) = fX(x) dx dy = fX(g1(y)) d dy g1(y) The derivative of the inverse transformation is called the Jacobian. Two Variable Transformations Similarly, let’s say we know the joint PDF of U and V but are also interested in the random vector (X,Y ) defined by (X,Y ) = g(U, V ). Let @(u, v) @(x, y) = @u @x @u @y @v @x @v @y ! be the Jacobian matrix. If the entries in this matrix exist and are continuous, and the determinant of the matrix is never 0, then fX,Y (x, y) = fU,V (u, v) @(u, v) @(x, y) The inner bars tells us to take the matrix’s determinant, and the outer bars tell us to take the absolute value. In a 2 ⇥ 2 matrix, a b c d = |ad bc| Convolutions Convolution Integral If you want to find the PDF of the sum of two independent CRVs X and Y , you can do the following integral: fX+Y (t) = Z 1 1 fX(x)fY (t x)dx Example Let X,Y ⇠ N (0, 1) be i.i.d. Then for each fixed t, fX+Y (t) = Z 1 1 1 p 2⇡ ex2/2 1 p 2⇡ e(tx)2/2dx By completing the square and using the fact that a Normal PDF integrates to 1, this works out to fX+Y (t) being the N (0, 2) PDF. Poisson Process Definition We have a Poisson process of rate arrivals per unit time if the following conditions hold: 1. The number of arrivals in a time interval of length t is Pois(t). 2. Numbers of arrivals in disjoint time intervals are independent. For example, the numbers of arrivals in the time intervals [0, 5], (5, 12), and [13, 23) are independent with Pois(5),Pois(7),Pois(10) distributions, respectively. 0 T1 T2 T3 T4 T5 + + + + + Count-Time Duality Consider a Poisson process of emails arriving in an inbox at rate emails per hour. Let Tn be the time of arrival of the nth email (relative to some starting time 0) and Nt be the number of emails that arrive in [0, t]. Let’s find the distribution of T1. The event T1 > t, the event that you have to wait more than t hours to get the first email, is the same as the event Nt = 0, which is the event that there are no emails in the first t hours. So P (T1 > t) = P (Nt = 0) = et ! P (T1  t) = 1 et Thus we have T1 ⇠ Expo(). By the memoryless property and similar reasoning, the interarrival times between emails are i.i.d. Expo(), i.e., the di↵erences Tn Tn1 are i.i.d. Expo(). Order Statistics Definition Let’s say you have n i.i.d. r.v.s X1, X2, . . . , Xn. If you arrange them from smallest to largest, the ith element in that list is the ith order statistic, denoted X(i). So X(1) is the smallest in the list and X(n) is the largest in the list. Note that the order statistics are dependent, e.g., learning X(4) = 42 gives us the information that X(1), X(2), X(3) are  42 and X(5), X(6), . . . , X(n) are 42. Distribution Taking n i.i.d. random variables X1, X2, . . . , Xn with CDF F (x) and PDF f(x), the CDF and PDF of X(i) are: FX(i) (x) = P (X(i)  x) = nX k=i ⇣n k ⌘ F (x)k(1 F (x))nk fX(i) (x) = n ⇣n 1 i 1 ⌘ F (x)i1(1 F (x))nif(x) Uniform Order Statistics The jth order statistic of i.i.d. U1, . . . , Un ⇠ Unif(0, 1) is U(j) ⇠ Beta(j, n j + 1). Conditional Expectation Conditioning on an Event We can find E(Y |A), the expected value of Y given that event A occurred. A very important case is when A is the event X = x. Note that E(Y |A) is a number. For example: • The expected value of a fair die roll, given that it is prime, is 1 3 · 2 + 1 3 · 3 + 1 3 · 5 = 10 3 . • Let Y be the number of successes in 10 independent Bernoulli trials with probability p of success. Let A be the event that the first 3 trials are all successes. Then E(Y |A) = 3 + 7p since the number of successes among the last 7 trials is Bin(7, p). • Let T ⇠ Expo(1/10) be how long you have to wait until the shuttle comes. Given that you have already waited t minutes, the expected additional waiting time is 10 more minutes, by the memoryless property. That is, E(T |T > t) = t + 10. Discrete Y Continuous Y E(Y ) = P y yP (Y = y) E(Y ) = R1 1 yfY (y)dy E(Y |A) = P y yP (Y = y|A) E(Y |A) = R1 1 yf(y|A)dy Conditioning on a Random Variable We can also find E(Y |X), the expected value of Y given the random variable X. This is a function of the random variable X. It is not a number except in certain special cases such as if X ? Y . To find E(Y |X), find E(Y |X = x) and then plug in X for x. For example: • If E(Y |X = x) = x3 + 5x, then E(Y |X) = X3 + 5X. • Let Y be the number of successes in 10 independent Bernoulli trials with probability p of success and X be the number of successes among the first 3 trials. Then E(Y |X) = X + 7p. • Let X ⇠ N (0, 1) and Y = X2. Then E(Y |X = x) = x2 since if we know X = x then we know Y = x2. And E(X|Y = y) = 0 since if we know Y = y then we know X = ±p y, with equal probabilities (by symmetry). So E(Y |X) = X2, E(X|Y ) = 0. Properties of Conditional Expectation 1. E(Y |X) = E(Y ) if X ? Y 2. E(h(X)W |X) = h(X)E(W |X) (taking out what’s known) In particular, E(h(X)|X) = h(X). 3. E(E(Y |X)) = E(Y ) (Adam’s Law, a.k.a. Law of Total Expectation) Adam’s Law (a.k.a. Law of Total Expectation) can also be written in a way that looks analogous to LOTP. For any events A1, A2, . . . , An that partition the sample space, E(Y ) = E(Y |A1)P (A1) + · · · + E(Y |An)P (An) For the special case where the partition is A,Ac, this says E(Y ) = E(Y |A)P (A) + E(Y |Ac)P (Ac) Eve’s Law (a.k.a. Law of Total Variance) Var(Y ) = E(Var(Y |X)) + Var(E(Y |X)) MVN, LLN, CLT Law of Large Numbers (LLN) Let X1, X2, X3 . . . be i.i.d. with mean µ. The sample mean is X̄n = X1 + X2 + X3 + · · · + Xn n The Law of Large Numbers states that as n ! 1, X̄n ! µ with probability 1. For example, in flips of a coin with probability p of Heads, let Xj be the indicator of the jth flip being Heads. Then LLN says the proportion of Heads converges to p (with probability 1). Central Limit Theorem (CLT) Approximation using CLT We use ⇠̇ to denote is approximately distributed. We can use the Central Limit Theorem to approximate the distribution of a random variable Y = X1 + X2 + · · · + Xn that is a sum of n i.i.d. random variables Xi. Let E(Y ) = µY and Var(Y ) = 2 Y . The CLT says Y ⇠̇N (µY ,2 Y ) If the Xi are i.i.d. with mean µX and variance 2 X , then µY = nµX and 2 Y = n2 X . For the sample mean X̄n, the CLT says X̄n = 1 n (X1 + X2 + · · · + Xn) ⇠̇N (µX ,2 X/n) Asymptotic Distributions using CLT We use D! to denote converges in distribution to as n ! 1. The CLT says that if we standardize the sum X1 + · · · + Xn then the distribution of the sum converges to N (0, 1) as n ! 1: 1 p n (X1 + · · · + Xn nµX) D! N (0, 1) In other words, the CDF of the left-hand side goes to the standard Normal CDF, . In terms of the sample mean, the CLT says p n(X̄n µX) X D! N (0, 1) Markov Chains Definition 1 21 1/2 31/2 1/4 5/12 41/3 1/6 7/12 51/4 1/8 7/8 A Markov chain is a random walk in a state space, which we will assume is finite, say {1, 2, . . . ,M}. We let Xt denote which element of the state space the walk is visiting at time t. The Markov chain is the sequence of random variables tracking where the walk is at all points in time, X0, X1, X2, . . . . By definition, a Markov chain must satisfy the Markov property, which says that if you want to predict where the chain will be at a future time, if we know the present state then the entire past history is irrelevant. Given the present, the past and future are conditionally independent. In symbols, P (Xn+1 = j|X0 = i0, X1 = i1, . . . , Xn = i) = P (Xn+1 = j|Xn = i) State Properties A state is either recurrent or transient. • If you start at a recurrent state, then you will always return back to that state at some point in the future. ♪You can check-out any time you like, but you can never leave. ♪ • Otherwise you are at a transient state. There is some positive probability that once you leave you will never return. ♪You don’t have to go home, but you can’t stay here. ♪ A state is either periodic or aperiodic. • If you start at a periodic state of period k, then the GCD of the possible numbers of steps it would take to return back is k > 1. • Otherwise you are at an aperiodic state. The GCD of the possible numbers of steps it would take to return back is 1. Distribution Properties Important CDFs Standard Normal Exponential() F (x) = 1 ex, for x 2 (0,1) Uniform(0,1) F (x) = x, for x 2 (0, 1) Convolutions of Random Variables A convolution of n random variables is simply their sum. For the following results, let X and Y be independent. 1. X ⇠ Pois(1), Y ⇠ Pois(2) ! X + Y ⇠ Pois(1 + 2) 2. X ⇠ Bin(n1, p), Y ⇠ Bin(n2, p) ! X + Y ⇠ Bin(n1 + n2, p). Bin(n, p) can be thought of as a sum of i.i.d. Bern(p) r.v.s. 3. X ⇠ Gamma(a1,), Y ⇠ Gamma(a2,) ! X + Y ⇠ Gamma(a1 + a2,). Gamma(n,) with n an integer can be thought of as a sum of i.i.d. Expo() r.v.s. 4. X ⇠ NBin(r1, p), Y ⇠ NBin(r2, p) ! X + Y ⇠ NBin(r1 + r2, p). NBin(r, p) can be thought of as a sum of i.i.d. Geom(p) r.v.s. 5. X ⇠ N (µ1, 2 1), Y ⇠ N (µ2, 2 2) ! X + Y ⇠ N (µ1 + µ2, 2 1 + 2 2) Special Cases of Distributions 1. Bin(1, p) ⇠ Bern(p) 2. Beta(1, 1) ⇠ Unif(0, 1) 3. Gamma(1,) ⇠ Expo() 4. 2 n ⇠ Gamma n 2 , 1 2 5. NBin(1, p) ⇠ Geom(p) Inequalities 1. Cauchy-Schwarz |E(XY )|  p E(X2)E(Y 2) 2. Markov P (X a)  E|X| a for a > 0 3. Chebyshev P (|X µ| a)  2 a2 for E(X) = µ,Var(X) = 2 4. Jensen E(g(X)) g(E(X)) for g convex; reverse if g is concave Formulas Geometric Series 1 + r + r2 + · · · + rn1 = n1X k=0 rk = 1 rn 1 r 1 + r + r2 + · · · = 1 1 r if |r| < 1 Exponential Function (ex) ex = 1X n=0 xn n! = 1 + x + x2 2! + x3 3! + · · · = lim n!1 ✓ 1 + x n ◆n Gamma and Beta Integrals You can sometimes solve complicated-looking integrals by pattern-matching to a gamma or beta integral: Z 1 0 xt1ex dx = (t) Z 1 0 xa1(1 x)b1 dx = (a)(b) (a + b) Also, (a + 1) = a(a), and (n) = (n 1)! if n is a positive integer. Euler’s Approximation for Harmonic Sums 1 + 1 2 + 1 3 + · · · + 1 n ⇡ logn + 0.577 . . . Stirling’s Approximation for Factorials n! ⇡ p 2⇡n ✓ n e ◆n Miscellaneous Definitions Medians and Quantiles Let X have CDF F . Then X has median m if F (m) 0.5 and P (X m) 0.5. For X continuous, m satisfies F (m) = 1/2. In general, the ath quantile of X is min{x : F (x) a}; the median is the case a = 1/2. log Statisticians generally use log to refer to natural log (i.e., base e). i.i.d r.v.s Independent, identically-distributed random variables. Example Problems Contributions from Sebastian Chiu Calculating Probability A textbook has n typos, which are randomly scattered amongst its n pages, independently. You pick a random page. What is the probability that it has no typos? Answer: There is a 1 1 n probability that any specific typo isn’t on your page, and thus a ✓ 1 1 n ◆n probability that there are no typos on your page. For n large, this is approximately e1 = 1/e. Linearity and Indicators (1) In a group of n people, what is the expected number of distinct birthdays (month and day)? What is the expected number of birthday matches? Answer: Let X be the number of distinct birthdays and Ij be the indicator for the jth day being represented. E(Ij) = 1 P (no one born on day j) = 1 (364/365)n By linearity, E(X) = 365 (1 (364/365)n) . Now let Y be the number of birthday matches and Ji be the indicator that the ith pair of people have the same birthday. The probability that any two specific people share a birthday is 1/365, so E(Y ) = ⇣n 2 ⌘ /365 . Linearity and Indicators (2) This problem is commonly known as the hat-matching problem. There are n people at a party, each with hat. At the end of the party, they each leave with a random hat. What is the expected number of people who leave with the right hat? Answer: Each hat has a 1/n chance of going to the right person. By linearity, the average number of hats that go to their owners is n(1/n) = 1 . Linearity and First Success This problem is commonly known as the coupon collector problem. There are n coupon types. At each draw, you get a uniformly random coupon type. What is the expected number of coupons needed until you have a complete set? Answer: Let N be the number of coupons needed; we want E(N). Let N = N1 + · · · + Nn, where N1 is the draws to get our first new coupon, N2 is the additional draws needed to draw our second new coupon and so on. By the story of the First Success, N2 ⇠ FS((n 1)/n) (after collecting first coupon type, there’s (n 1)/n chance you’ll get something new). Similarly, N3 ⇠ FS((n 2)/n), and Nj ⇠ FS((n j + 1)/n). By linearity, E(N) = E(N1) + · · · + E(Nn) = n n + n n 1 + · · · + n 1 = n nX j=1 1 j This is approximately n(log(n) + 0.577) by Euler’s approximation. Orderings of i.i.d. random variables I call 2 UberX’s and 3 Lyfts at the same time. If the time it takes for the rides to reach me are i.i.d., what is the probability that all the Lyfts will arrive first? Answer: Since the arrival times of the five cars are i.i.d., all 5! orderings of the arrivals are equally likely. There are 3!2! orderings that involve the Lyfts arriving first, so the probability that the Lyfts arrive first is 3!2! 5! = 1/10 . Alternatively, there are 5 3 ways to choose 3 of the 5 slots for the Lyfts to occupy, where each of the choices are equally likely. One of these choices has all 3 of the Lyfts arriving first, so the probability is 1/ ⇣5 3 ⌘ = 1/10 . Expectation of Negative Hypergeometric What is the expected number of cards that you draw before you pick your first Ace in a shu✏ed deck (not counting the Ace)? Answer: Consider a non-Ace. Denote this to be card j. Let Ij be the indicator that card j will be drawn before the first Ace. Note that Ij = 1 says that j is before all 4 of the Aces in the deck. The probability that this occurs is 1/5 by symmetry. Let X be the number of cards drawn before the first Ace. Then X = I1 + I2 + ...+ I48, where each indicator corresponds to one of the 48 non-Aces. Thus, E(X) = E(I1) + E(I2) + ... + E(I48) = 48/5 = 9.6 . Minimum and Maximum of RVs What is the CDF of the maximum of n independent Unif(0,1) random variables? Answer: Note that for r.v.s X1, X2, . . . , Xn, P (min(X1, X2, . . . , Xn) a) = P (X1 a,X2 a, . . . , Xn a) Similarly, P (max(X1, X2, . . . , Xn)  a) = P (X1  a,X2  a, . . . , Xn  a) We will use this principle to find the CDF of U(n), where U(n) = max(U1, U2, . . . , Un) and Ui ⇠ Unif(0, 1) are i.i.d. P (max(U1, U2, . . . , Un)  a) = P (U1  a, U2  a, . . . , Un  a) = P (U1  a)P (U2  a) . . . P (Un  a) = an for 0 < a < 1 (and the CDF is 0 for a  0 and 1 for a 1). Pattern-matching with ex Taylor series For X ⇠ Pois(), find E ✓ 1 X + 1 ◆ . Answer: By LOTUS, E ✓ 1 X + 1 ◆ = 1X k=0 1 k + 1 ek k! = e 1X k=0 k+1 (k + 1)! = e (e 1) Adam’s Law and Eve’s Law William really likes speedsolving Rubik’s Cubes. But he’s pretty bad at it, so sometimes he fails. On any given day, William will attempt N ⇠ Geom(s) Rubik’s Cubes. Suppose each time, he has probability p of solving the cube, independently. Let T be the number of Rubik’s Cubes he solves during a day. Find the mean and variance of T . Answer: Note that T |N ⇠ Bin(N, p). So by Adam’s Law, E(T ) = E(E(T |N)) = E(Np) = p(1 s) s Similarly, by Eve’s Law, we have that Var(T ) = E(Var(T |N)) + Var(E(T |N)) = E(Np(1 p)) + Var(Np) = p(1 p)(1 s) s + p2(1 s) s2 = p(1 s)(p + s(1 p)) s2 MGF – Finding Moments Find E(X3) for X ⇠ Expo() using the MGF of X. Answer: The MGF of an Expo() is M(t) = t . To get the third moment, we can take the third derivative of the MGF and evaluate at t = 0: E(X3) = 6 3 But a much nicer way to use the MGF here is via pattern recognition: note that M(t) looks like it came from a geometric series: 1 1 t = 1X n=0 ✓ t ◆n = 1X n=0 n! n tn n! The coecient of tn n! here is the nth moment of X, so we have E(Xn) = n! n for all nonnegative integers n. Markov chains (1) Suppose Xn is a two-state Markov chain with transition matrix Q = ✓ 0 1 0 1 ↵ ↵ 1 1 ◆ Find the stationary distribution ~s = (s0, s1) of Xn by solving ~sQ = ~s, and show that the chain is reversible with respect to ~s. Answer: The equation ~sQ = ~s says that s0 = s0(1 ↵) + s1 and s1 = s0(↵) + s0(1 ) By solving this system of linear equations, we have ~s = ✓ ↵ + , ↵ ↵ + ◆ To show that the chain is reversible with respect to ~s, we must show siqij = sjqji for all i, j. This is done if we can show s0q01 = s1q10. And indeed, s0q01 = ↵ ↵ + = s1q10 Markov chains (2) William and Sebastian play a modified game of Settlers of Catan, where every turn they randomly move the robber (which starts on the center tile) to one of the adjacent hexagons. Robber (a) Is this Markov chain irreducible? Is it aperiodic? Answer: Yes to both. The Markov chain is irreducible because it can get from anywhere to anywhere else. The Markov chain is aperiodic because the robber can return back to a square in 2, 3, 4, 5, . . . moves, and the GCD of those numbers is 1. (b) What is the stationary distribution of this Markov chain? Answer: Since this is a random walk on an undirected graph, the stationary distribution is proportional to the degree sequence. The degree for the corner pieces is 3, the degree for the edge pieces is 4, and the degree for the center pieces is 6. To normalize this degree sequence, we divide by its sum. The sum of the degrees is 6(3) + 6(4) + 7(6) = 84. Thus the stationary probability of being on a corner is 3/84 = 1/28, on an edge is 4/84 = 1/21, and in the center is 6/84 = 1/14. (c) What fraction of the time will the robber be in the center tile in this game, in the long run? Answer: By the above, 1/14 . (d) What is the expected amount of moves it will take for the robber to return to the center tile? Answer: Since this chain is irreducible and aperiodic, to get the expected time to return we can just invert the stationary probability. Thus on average it will take 14 turns for the robber to return to the center tile. Problem-Solving Strategies Contributions from Jessy Hwang, Yuan Jiang, Yuqi Hou 1. Getting started. Start by defining relevant events and random variables. (“Let A be the event that I pick the fair coin”; “Let X be the number of successes.”) Clear notion is important for clear thinking! Then decide what it is that you’re supposed to be finding, in terms of your notation (“I want to find P (X = 3|A)”). Think about what type of object your answer should be (a number? A random variable? A PMF? A PDF?) and what it should be in terms of. Try simple and extreme cases. To make an abstract experiment more concrete, try drawing a picture or making up numbers that could have happened. Pattern recognition: does the structure of the problem resemble something we’ve seen before? 2. Calculating probability of an event. Use counting principles if the naive definition of probability applies. Is the probability of the complement easier to find? Look for symmetries. Look for something to condition on, then apply Bayes’ Rule or the Law of Total Probability. 3. Finding the distribution of a random variable. First make sure you need the full distribution not just the mean (see next item). Check the support of the random variable: what values can it take on? Use this to rule out distributions that don’t fit. Is there a story for one of the named distributions that fits the problem at hand? Can you write the random variable as a function of an r.v. with a known distribution, say Y = g(X)? 4. Calculating expectation. If it has a named distribution, check out the table of distributions. If it’s a function of an r.v. with a named distribution, try LOTUS. If it’s a count of something, try breaking it up into indicator r.v.s. If you can condition on something natural, consider using Adam’s law. 5. Calculating variance. Consider independence, named distributions, and LOTUS. If it’s a count of something, break it up into a sum of indicator r.v.s. If it’s a sum, use properties of covariance. If you can condition on something natural, consider using Eve’s Law. 6. Calculating E(X2). Do you already know E(X) or Var(X)? Recall that Var(X) = E(X2) (E(X))2. Otherwise try LOTUS. 7. Calculating covariance. Use the properties of covariance. If you’re trying to find the covariance between two components of a Multinomial distribution, Xi, Xj , then the covariance is npipj for i 6= j. 8. Symmetry. If X1, . . . , Xn are i.i.d., consider using symmetry. 9. Calculating probabilities of orderings. Remember that all n! ordering of i.i.d. continuous random variables X1, . . . , Xn are equally likely. 10. Determining independence. There are several equivalent definitions. Think about simple and extreme cases to see if you can find a counterexample. 11. Do a painful integral. If your integral looks painful, see if you can write your integral in terms of a known PDF (like Gamma or Beta), and use the fact that PDFs integrate to 1? 12. Before moving on. Check some simple and extreme cases, check whether the answer seems plausible, check for biohazards. Biohazards Contributions from Jessy Hwang 1. Don’t misuse the naive definition of probability. When answering “What is the probability that in a group of 3 people, no two have the same birth month?”, it is not correct to treat the people as indistinguishable balls being placed into 12 boxes, since that assumes the list of birth months {January, January, January} is just as likely as the list {January, April, June}, even though the latter is six times more likely. 2. Don’t confuse unconditional, conditional, and joint probabilities. In applying P (A|B) = P (B|A)P (A) P (B) , it is not correct to say “P (B) = 1 because we know B happened”; P (B) is the prior probability of B. Don’t confuse P (A|B) with P (A,B). 3. Don’t assume independence without justification. In the matching problem, the probability that card 1 is a match and card 2 is a match is not 1/n2. Binomial and Hypergeometric are often confused; the trials are independent in the Binomial story and dependent in the Hypergeometric story. 4. Don’t forget to do sanity checks. Probabilities must be between 0 and 1. Variances must be 0. Supports must make sense. PMFs must sum to 1. PDFs must integrate to 1. 5. Don’t confuse random variables, numbers, and events. Let X be an r.v. Then g(X) is an r.v. for any function g. In particular, X2, |X|, F (X), and IX>3 are r.v.s. P (X2 < X|X 0), E(X),Var(X), and g(E(X)) are numbers. X = 2 and F (X) 1 are events. It does not make sense to write R1 1 F (X)dx, because F (X) is a random variable. It does not make sense to write P (X), because X is not an event. 6. Don’t confuse a random variable with its distribution. To get the PDF of X2, you can’t just square the PDF of X. The right way is to use transformations. To get the PDF of X + Y , you can’t just add the PDF of X and the PDF of Y . The right way is to compute the convolution. 7. Don’t pull non-linear functions out of expectations. E(g(X)) does not equal g(E(X)) in general. The St. Petersburg paradox is an extreme example. See also Jensen’s inequality. The right way to find E(g(X)) is with LOTUS. Distributions in R Command What it does help(distributions) shows documentation on distributions dbinom(k,n,p) PMF P (X = k) for X ⇠ Bin(n, p) pbinom(x,n,p) CDF P (X  x) for X ⇠ Bin(n, p) qbinom(a,n,p) ath quantile for X ⇠ Bin(n, p) rbinom(r,n,p) vector of r i.i.d. Bin(n, p) r.v.s dgeom(k,p) PMF P (X = k) for X ⇠ Geom(p) dhyper(k,w,b,n) PMF P (X = k) for X ⇠ HGeom(w, b, n) dnbinom(k,r,p) PMF P (X = k) for X ⇠ NBin(r, p) dpois(k,r) PMF P (X = k) for X ⇠ Pois(r) dbeta(x,a,b) PDF f(x) for X ⇠ Beta(a, b) dchisq(x,n) PDF f(x) for X ⇠ 2 n dexp(x,b) PDF f(x) for X ⇠ Expo(b) dgamma(x,a,r) PDF f(x) for X ⇠ Gamma(a, r) dlnorm(x,m,s) PDF f(x) for X ⇠ LN (m, s2) dnorm(x,m,s) PDF f(x) for X ⇠ N (m, s2) dt(x,n) PDF f(x) for X ⇠ tn dunif(x,a,b) PDF f(x) for X ⇠ Unif(a, b) The table above gives R commands for working with various named distributions. Commands analogous to pbinom, qbinom, and rbinom work for the other distributions in the table. For example, pnorm, qnorm, and rnorm can be used to get the CDF, quantiles, and random generation for the Normal. For the Multinomial, dmultinom can be used for calculating the joint PMF and rmultinom can be used for generating random vectors. For the Multivariate Normal, after installing and loading the mvtnorm package dmvnorm can be used for calculating the joint PDF and rmvnorm can be used for generating random vectors. Recommended Resources • Introduction to Probability Book (http://bit.ly/introprobability) • Stat 110 Online (http://stat110.net) • Stat 110 Quora Blog (https://stat110.quora.com/) • Quora Probability FAQ (http://bit.ly/probabilityfaq) • R Studio (https://www.rstudio.com) • LaTeX File (github.com/wzchen/probability cheatsheet) Please share this cheatsheet with friends! http://wzchen.com/probability-cheatsheet • Expected Value: linearity, fundamental bridge, variance, standard deviation, covariance, correlation, using expectation to prove existence, LOTUS. • Conditional Expectation: definition and meaning, taking out what’s known, conditional variance, Adam’s Law (iterated expectation), Eve’s Law. • Important Discrete Distributions: Bernoulli, Binomial, Geometric, Negative Binomial, Hypergeometric, Poisson. • Important Continuous Distributions: Uniform, Normal, Exponential, Gamma, Beta, Chi-Square, Student-t. • Jointly Distributed Random Variables: joint, conditional, and marginal distri- butions, independence, Multinomial, Multivariate Normal, change of variables, order statistics. • Convergence: Law of Large Numbers, Central Limit Theorem. • Inequalities: Cauchy-Schwarz, Markov, Chebyshev, Jensen. • Markov chains: Markov property, transition matrix, irreducibility, stationary distributions, reversibility. • Strategies: conditioning, symmetry, linearity, indicator r.v.s, stories, checking whether answers make sense (e.g., looking at simple and extreme cases and avoiding category errors). • Some Important Examples: birthday problem, matching problem (de Mont- mort), Monty Hall, gambler’s ruin, prosecutor’s fallacy, testing for a disease, capture-recapture (elk problem), coupon (toy) collector, St. Petersburg para- dox, Simpson’s paradox, two envelope paradox, waiting time for HH vs. waiting time for HT, store with a random number of customers, bank-post oce ex- ample, Bayes’ billiards, random walk on a network, chicken and egg. 2 3 Important Distributions 3.1 Table of Distributions The table below will be provided on the final (included as the last page). This is meant to help avoid having to memorize formulas for the distributions (or having to take up a lot of space on your pages of notes). Here 0 < p < 1 and q = 1 p. The parameters for Gamma and Beta are positive real numbers; n, r, and w are positive integers, as is b for the Hypergeometric. Name Param. PMF or PDF Mean Variance Bernoulli p P (X = 1) = p, P (X = 0) = q p pq Binomial n, p n k p k q nk , for k 2 {0, 1, . . . , n} np npq Geometric p q k p, for k 2 {0, 1, 2, . . . } q/p q/p 2 NegBinom r, p r+n1 r1 p r q n , n 2 {0, 1, 2, . . . } rq/p rq/p 2 Hypergeom w, b, n (wk)( b nk) (w+b n ) , for k 2 {0, 1, . . . , n} µ = nw w+b (w+bn w+b1 )n µ n (1 µ n ) Poisson e k k! , for k 2 {0, 1, 2, . . . } Uniform a < b 1 ba , for x 2 (a, b) a+b 2 (ba)2 12 Normal µ, 2 1 p 2⇡ e (xµ)2/(22) µ 2 Exponential e x , for x > 0 1/ 1/2 Gamma a, (a)1(x)aex x 1 , for x > 0 a/ a/ 2 Beta a, b (a+b) (a)(b)x a1(1 x)b1, for 0 < x < 1 µ = a a+b µ(1µ) a+b+1 2 n 1 2n/2(n/2) x n/21 e x/2, for x > 0 n 2n Student-t n ((n+1)/2)p n⇡(n/2)(1 + x 2 /n)(n+1)/2 0 if n > 1 n n2 if n > 2 3 3.2 Connections Between Distributions The table above summarizes the PMFs/PDFs of the important distributions, and their means and variances, but it does not say where each distribution comes from (stories), or how the distributions interrelate. Some of these connections between distributions are listed below. Also note that some of the important distributions are special cases of others. Bernoulli is a special case of Binomial; Geometric is a special case of Negative Bi- nomial; Unif(0,1) is a special case of Beta; and Exponential and 2 are both special cases of Gamma. 1. Binomial: If X1, . . . , Xn are i.i.d. Bern(p), then X1 + · · ·+Xn ⇠ Bin(n, p). 2. Neg. Binom.: IfG1, . . . , Gr are i.i.d. Geom(p), thenG1+· · ·+Gr ⇠ NBin(r, p). 3. Location and Scale: If Z ⇠ N (0, 1), then µ+ Z ⇠ N (µ, 2). If U ⇠ Unif(0, 1) and a < b, then a+ (b a)U ⇠ Unif(a, b). If X ⇠ Expo(1), then 1 X ⇠ Expo(). If Y ⇠ Gamma(a,), then Y ⇠ Gamma(a, 1). 4. Symmetry: If X ⇠ Bin(n, 1/2), then nX ⇠ Bin(n, 1/2). If U ⇠ Unif(0, 1), then 1 U ⇠ Unif(0, 1). If Z ⇠ N (0, 1), then Z ⇠ N (0, 1). 5. Universality of Uniform: Let F be the CDF of a continuous r.v., such that F 1 exists. If U ⇠ Unif(0, 1), then F 1(U) has CDF F . Conversely, if X ⇠ F , then F (X) ⇠ Unif(0, 1). 6. Uniform and Beta: Unif(0, 1) is the same distribution as Beta(1, 1). The jth order statistic of n i.i.d. Unif(0, 1) r.v.s is Beta(j, n j + 1). 7. Beta and Binomial: Beta is the conjugate prior to Binomial, in the sense that if X|p ⇠ Bin(n, p) and the prior is p ⇠ Beta(a, b), then the posterior is p|X ⇠ Beta(a+X, b+ nX). 8. Gamma: If X1, . . . , Xn are i.i.d. Expo(), then X1+ · · ·+Xn ⇠ Gamma(n,). 9. Gamma and Poisson: In a Poisson process of rate , the number of arrivals in a time interval of length t is Pois(t), while the time of the nth arrival is Gamma(n,). 4 An analogous formula holds for conditioning on a continuous r.v. X with PDF f(x): P (A) = Z 1 1 P (A|X = x)f(x)dx. Similarly, to go from a joint PDF f(x, y) for (X, Y ) to the marginal PDF of Y , integrate over all values of x: fY (y) = Z 1 1 f(x, y)dx. 5.6 Bayes’ Rule P (A|B) = P (B|A)P (A) P (B) . Often the denominator P (B) is then expanded by the Law of Total Probability. For continuous r.v.s X and Y , Bayes’ Rule becomes fY |X(y|x) = fX|Y (x|y)fY (y) fX(x) . 5.7 Expected Value, Variance, and Covariance Expected value is linear : for any random variables X and Y and constant c, E(X + Y ) = E(X) + E(Y ), E(cX) = cE(X). Variance can be computed in two ways: Var(X) = E(X EX)2 = E(X2) (EX)2. Constants come out from variance as the constant squared: Var(cX) = c 2Var(X). For the variance of the sum, there is a covariance term: Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ), where Cov(X, Y ) = E((X EX)(Y EY )) = E(XY ) (EX)(EY ). 7 So if X and Y are uncorrelated, then the variance of the sum is the sum of the vari- ances. Recall that independent implies uncorrelated but not vice versa. Covariance is symmetric: Cov(Y,X) = Cov(X, Y ), and covariances of sums can be expanded as Cov(X + Y, Z +W ) = Cov(X,Z) + Cov(X,W ) + Cov(Y, Z) + Cov(Y,W ). Note that for c a constant, Cov(X, c) = 0, Cov(cX, Y ) = cCov(X, Y ). The correlation of X and Y , which is between 1 and 1 , is Corr(X, Y ) = Cov(X, Y ) SD(X)SD(Y ) . This is also the covariance of the standardized versions of X and Y . 5.8 Law of the Unconscious Statistician (LOTUS) LetX be a discrete random variable and h be a real-valued function. Then Y = h(X) is a random variable. To compute EY using the definition of expected value, we would need to first find the PMF of Y and use EY = P y yP (Y = y). The Law of the Unconscious Statistician says we can use the PMF of X directly: Eh(X) = X x h(x)P (X = x). Similarly, for X a continuous r.v. with PDF fX(x), we can find the expected value of Y = h(X) by integrating h(x) times the PDF of X, without first finding fY (y): Eh(X) = Z 1 1 h(x)fX(x)dx. 5.9 Indicator Random Variables Let A and B be events. Indicator r.v.s bridge between probability and expectation: P (A) = E(IA), where IA is the indicator r.v. for A. It is often useful to think of a “counting” r.v. as a sum of indicator r.v.s. Indicator r.v.s have many pleasant 8 properties. For example, (IA)k = IA for any positive number k, so it’s easy to handle moments of indicator r.v.s. Also note that IA\B = IAIB, IA[B = IA + IB IAIB. 5.10 Symmetry There are many beautiful and useful forms of symmetry in statistics. For example: 1. If X and Y are i.i.d., then P (X < Y ) = P (Y < X). More generally, if X1, . . . , Xn are i.i.d., then P (X1 < X2 < . . .Xn) = P (Xn < Xn1 < · · · < X1), and likewise all n! orderings are equally likely (in the continuous case it follows that P (X1 < X2 < . . .Xn) = 1 n! , while in the discrete case we also have to consider ties). 2. If we shu✏e a deck of cards and deal the first two cards, then the probability is 1/52 that the second card is the Ace of Spades, since by symmetry it’s equally likely to be any card; it’s not necessary to do a law of total probability calculation conditioning on the first card. 3. Consider the Hypergeometric, thought of as the distribution of the number of white balls, where we draw n balls from a jar with w white balls and b black balls (without replacement). By symmetry and linearity, we can immediately get that the expected value is n w w+b , even though the trials are not independent, as the jth ball is equally likely to be any of the balls, and linearity still holds with dependent r.v.s. 4. By symmetry we can see immediately that if T is Cauchy, then 1/T is also Cauchy (since if we flip the ratio of two i.i.d. N (0, 1) r.v.s, we still have the ratio of two i.i.d. N (0, 1) r.v.s!). 5. E(X1|X1 +X2) = E(X2|X1 +X2) by symmetry if X1 and X2 are i.i.d. So by linearity, E(X1|X1+X2)+E(X2|X1+X2) = E(X1+X2|X1+X2) = X1+X2, which gives E(X1|X1 +X2) = (X1 +X2)/2. 9 5.15 Convergence Let X1, X2, . . . be i.i.d. random variables with mean µ and variance 2. The sample mean is defined as X̄n = 1 n nX i=1 Xi. The Strong Law of Large Numbers says that with probability 1, the sample mean converges to the true mean: X̄n ! µ with probability 1. The Weak Law of Large Numbers (which follows from Chebyshev’s Inequality) says that X̄n will be very close to µ with very high probability: for any ✏ > 0, P (|X̄n µ| > ✏) ! 0 as n ! 1. The Central Limit Theorem says that the sum of a large number of i.i.d. random variables is approximately Normal in distribution. More precisely, standardize the sum X1+ · · ·+Xn (by subtracting its mean and dividing by its standard deviation); then the standardized sum approaches N (0, 1) in distribution (i.e., the CDF of the standardized sum converges to ). So (X1 + · · ·+Xn) nµ p n ! N (0, 1) in distribution. In terms of the sample mean, p n (X̄n µ) ! N (0, 1) in distribution. 5.16 Inequalities When probabilities and expected values are hard to compute exactly, it is useful to have inequalities. One simple but handy inequality is Markov’s Inequality: P (X > a)  E|X| a , for any a > 0. Let X have mean µ and variance 2. Using Markov’s Inequality with (X µ)2 in place of X gives Chebyshev’s Inequality: P (|X µ| > a)  2 /a 2 . 12 For convex functions g (convexity of g is equivalent to g 00(x) 0 for all x, assuming this exists), there is Jensen’s Inequality (the reverse inequality holds for concave g): E(g(X)) g(E(X)) for g convex. The Cauchy-Schwarz inequality bounds the expected product of X and Y : |E(XY )|  p E(X2)E(Y 2). If X and Y have mean 0 and variance 1, this reduces to saying that the correlation is between -1 and 1. It follows that correlation is always between -1 and 1. 5.17 Markov Chains Consider a Markov chain X0, X1, . . . with transition matrix Q = (qij), and let v be a row vector listing the initial probabilities of being in each state. Then vQ n is the row vector listing the probabilities of being in each state after n steps, i.e., the jth component is P (Xn = j). A vector s of probabilities (adding to 1) is stationary for the chain if sQ = s; by the above, if a chain starts out with a stationary distribution then the distribution stays the same forever. Any irreducible Markov chain has a unique stationary distribution s, and the chain converges to it: P (Xn = i) ! si as n ! 1. If s is a vector of probabilities (adding to 1) that satisfies the reversibility condition siqij = sjqji for all states i, j, then it automatically follows that s is a stationary distribution for the chain; not all chains have this condition hold, but for those that do it is often easier to show that s is stationary using the reversibility condition than by showing sQ = s. 13 6 Common Mistakes in Probability 6.1 Category errors A category error is a mistake that not only happens to be wrong, but also it is wrong in every possible universe. If someone answers the question “How many students are in Stat 110?” with “10, since it’s one ten,” that is wrong (and a very bad approxima- tion to the truth); but there is no logical reason the enrollment couldn’t be 10, aside from the logical necessity of learning probability for reasoning about uncertainty in the world. But answering the question with “-42” or “⇡” or “pink elephants” would be a category error. To help avoid being categorically wrong, always think about what type an answer should have. Should it be an integer? A nonnegative integer? A number between 0 and 1? A random variable? A distribution? • Probabilities must be between 0 and 1. Example: When asked for an approximation to P (X > 5) for a certain r.v. X with mean 7, writing “P (X > 5) ⇡ E(X)/5.” This makes two mis- takes: Markov’s inequality gives P (X > 5)  E(X)/5, but this is an upper bound, not an approximation; and here E(X)/5 = 1.4, which is silly as an approximation to a probability since 1.4 > 1. • Variances must be nonnegative. Example: For X and Y independent r.v.s, writing that “Var(X Y ) = Var(X) Var(Y )”, which can immediately be seen to be wrong from the fact that it becomes negative if Var(Y ) > Var(X) (and 0 if X and Y are i.i.d.). The correct formula is Var(X Y ) = Var(X)+Var(Y ) 2Cov(X, Y ), which is Var(X) + Var(Y ) if X and Y are uncorrelated. • Correlations must be between 1 and 1. Example: It is common to confuse covariance and correlation; they are related by Corr(X, Y ) = Cov(X, Y )/(SD(X)SD(Y )), which is between -1 and 1. • The range of possible values must make sense. Example: Two people each have 100 friends, and we are interested in the dis- tribution of X = (number of mutual friends). Then writing “X ⇠ N (µ, 2)” doesn’t make sense since X is an integer (sometimes we use the Normal as an approximation to, say, Binomials, but exact answers should be given unless an approximation is specifically asked for); “X ⇠ Pois()” or “X ⇠ Bin(500, 1/2)” don’t make sense since X has possible values 0, 1, . . . , 100. 14 • Introduce clear notation for events and r.v.s of interest. Example: In the Calvin and Hobbes problem (from HW 3 and the final from 2010), clearly the event “Calvin wins the match” is important (so give it a name) and the r.v. “how many of the first two games Calvin wins” is important (so give it a name). Make sure that events you define really are events (they are subsets of the sample space, and it must make sense to talk about whether the event occurs) and that r.v.s you define really are r.v.s (they are functions mapping the sample space to the real line, and it must make sense to talk about their distributions and talk about them as a numerical summary of some aspect of the random experiment). • Think about location and scale when applicable. Example: If Yj ⇠ Expo(), it may be very convenient to work with Xj = Yj, which is Expo(1). In studying X ⇠ N (µ, 2), it may be very convenient to write X = µ+ Z where Z ⇠ N (0, 1) is the standardized version of X. 6.3 Common sense and checking answers Whenever possible (i.e., when not under severe time pressure), look for simple ways to check your answers, or at least to check that they are plausible. This can be done in various ways, such as using the following methods. 1. Miracle checks. Does your answer seem intuitively plausible? Is there a cat- egory error? Did asymmetry appear out of nowhere when there should be symmetry? 2. Checking simple and extreme cases. What is the answer to a simpler version of the problem? What happens if n = 1 or n = 2, or as n ! 1, if the problem involves showing something for all n? 3. Looking for alternative approaches and connections with other problems. Is there another natural way to think about the problem? Does the problem relate to other problems we’ve seen? • Probability is full of counterintuitive results, but not impossible results! Example: Suppose that we have P (snow Saturday) = P (snow Sunday) = 1/2. Then we can’t say “P (snow over the weekend) = 1”; clearly there is some chance of no snow, and of course the mistake is to ignore the need for disjointness. 17 Example: In finding E(eX) for X ⇠ Pois(), obtaining an answer that can be negative, or an answer that isn’t an increasing function of (intuitively, it is clear that larger should give larger average values of eX). • Check simple and extreme cases whenever possible. Example: Suppose we want to derive the mean and variance of a Hyperge- ometric, which is the distribution of the number of white balls if we draw n balls without replacement from a bag containing w white balls and b black balls. Suppose that using indicator r.v.s, we (correctly) obtain that the mean is µ = nw w+b and the variance is (w+bn w+b1 )n µ n (1 µ n ). Let’s check that this makes sense for the simple case n = 1: then the mean and variance reduce to those of a Bern(w/(w + b)), which makes sense since with only 1 draw, it doesn’t matter whether sampling is with replacement. Now let’s consider an extreme case where the total number of balls (w + b) is extremely large compared with n. Then it shouldn’t matter much whether the sampling is with or without replacement, so the mean and variance should be very close to those of a Bin(n,w/(b + w)), and indeed this is the case. If we had an answer that did not make sense in simple and extreme cases, we could then look harder for a mistake or explanation. Example: Let X1, X2, . . . , X1000 be i.i.d. with a continuous distribution, and consider the question of whether the event X1 < X2 is independent of the event X1 < X3. Many students guess intuitively that they are independent. But now consider the more extreme question of whether P (X1 < X2|X1 < X3, X1 < X4, . . . , X1 < X1000) is P (X1 < X2). Here most students guess intuitively (and correctly) that P (X1 < X2|X1 < X3, X1 < X4, . . . , X1 < X1000) > P (X1 < X2), since the evidence that X1 is less than all of X3, . . . , X1000 suggests that X1 is very small. Yet this more extreme case is the same in principle, just di↵erent in degree. Similarly, the Monty Hall problem is easier to understand with 1000 doors than with 3 doors. To show algebraically that X1 < X2 is not independent of X1 < X3, note that P (X1 < X2) = 1/2, while P (X1 < X2|X1 < X3) = P (X1 < X2, X1 < X3) P (X1 < X3) = 1/3 1/2 = 2 3 , where the numerator is 1/3 since the smallest of X1, X2, X3 is equally likely to be any of them. 18 • Check that PMFs are nonnegative and sum to 1, and PDFs are nonnegative and integrate to 1 (or that it is at least plausible), when it is not too messy. Example: Writing that the PDF of X is “f(x) = 1 5e 5x for all x > 0 (and 0 otherwise)” is immediately seen to be wrong by integrating (the constant in front should be 5, which can also be seen by recognizing this as an Expo(5). Writing that the PDF is “f(x) = 1+e x 1+x for all x > 0 (and 0 otherwise)” doesn’t make sense since even though the integral is hard to do directly, clearly 1+e x 1+x > 1 1+x , and R1 0 1 1+x dx is infinite. Example: Consider the following problem: “You are invited to attend 6 wed- dings next year, independently with all months of the year equally likely. What is the probability that no two weddings are in the same month?” A common mistake is to treat the weddings as indistinguishable. But no matter how generic and cliched weddings can be sometimes, there must be some way to distinguish two weddings! It often helps to make up concrete names, e.g., saying “ok, we need to look at the possible schedulings of the weddings of Daenerys and Drogo, of Cersei and Robert, . . . ”. There are 126 equally likely possibilities and, for example, it is much more likely to have 1 wedding per month in January through June than to have all 6 weddings in January (whereas treating weddings as indistinguishable would suggest having these be equal). 6.4 Random variables vs. distributions A random variable is not the same thing as its distribution! We call this confusion sympathetic magic, and the consequences of this confusion are often disastrous. Every random variable has a distribution (which can always be expressed using a CDF, which can be expressed by a PMF in the discrete case, and which can be expressed by a PDF in the continuous case). Every distribution can be used as a blueprint for generating r.v.s (for example, one way to do this is using Universality of the Uniform). But that doesn’t mean that doing something to a r.v. corresponds to doing it to the distribution of the r.v. Confusing a distribution with a r.v. with that distribution is like confusing a map of a city with the city itself, or a blueprint of a house with the house itself. The word is not the thing, the map is not the territory. • A function of a r.v. is a r.v. 19 example, if X and Y are i.i.d. N (µ, 2), then X + Y ⇠ N (2µ, 22), while X +X = 2X ⇠ N (2µ, 42). Example: Is it always true that if X ⇠ Pois() and Y ⇠ Pois(), then X + Y ⇠ Pois(2)? What is an example of a sum of Bern(p)’s which is not Binomial? Example: In the two envelope paradox, it is not true that the amount of money in the first envelope is independent of the indicator of which envelope has more money. • Independence is completely di↵erent from disjointness! Example: Sometimes students try to visualize independent events A and B with two non-overlapping ovals in a Venn diagram. Such events in fact can’t be independent (unless one has probability 0), since learning that A happened gives a great deal of information about B: it implies that B did not occur. • Independence is a symmetric property: if A is independent of B, then B is independent of A. There’s no such thing as unrequited independence. Example: If it is non-obvious whether A provides information about B but obvious that B provides information about A, then A and B can’t be indepen- dent. • The marginal distributions can be extracted from the joint distribution, but knowing the marginal distributions does not determine the joint distribution. Example: Calculations that are purely based on the marginal CDFs FX and FY of dependent r.v.s X and Y may not shed much light on events such as {X < Y } which involve X and Y jointly. • Keep the distinction between prior and posterior probabilities clear. Example: Suppose that we observe evidence E. Then writing “P (E) = 1 since we know for sure that E happened” is careless; we have P (E|E) = 1, but P (E) is the prior probability (the probability before E was observed). • Don’t confuse P (A|B) with P (B|A). Example: This mistake is also known as the prosecutor’s fallacy since it is often made in legal cases (but not always by the prosecutor!). For example, the prosecutor may argue that the probability of guilt given the evidence is very high by attempting to show that the probability of the evidence given innocence 22 is very low, but in and of itself this is insucient since it does not use the prior probability of guilt. Bayes’ rule thus becomes Bayes’ ruler, measuring the weight of the evidence by relating P (A|B) to P (B|A) and showing us how to update our beliefs based on evidence. • Don’t confuse P (A|B) with P (A,B). Example: The law of total probability is often wrongly written without the weights as “P (A) = P (A|B) + P (A|Bc)” rather than P (A) = P (A,B) + P (A,Bc) = P (A|B)P (B) + P (A|Bc)P (Bc). • The expression Y |X does not denote a r.v.; it is notation indicating that in working with Y , we should use the conditional distribution of Y given X (i.e., treat X as a known constant). The expression E(Y |X) is a r.v., and is a function of X (we have summed or integrated over the possible values of Y ). Example: Writing “E(Y |X) = Y ” is wrong, except if Y is a function of X, e.g., E(X3|X) = X 3; by definition, E(Y |X) must be g(X) for some function g, so any answer for E(Y |X) that is not of this form is a category error. 23 7 Stat 110 Final from 2006 1. The number of fish in a certain lake is a Pois() random variable. Worried that there might be no fish at all, a statistician adds one fish to the lake. Let Y be the resulting number of fish (so Y is 1 plus a Pois() random variable). (a) Find E(Y 2) (simplify). (b) Find E(1/Y ) (in terms of ; do not simplify yet). (c) Find a simplified expression for E(1/Y ). Hint: k!(k + 1) = (k + 1)!. 24 4. A post oce has 2 clerks. Alice enters the post oce while 2 other customers, Bob and Claire, are being served by the 2 clerks. She is next in line. Assume that the time a clerk spends serving a customer has the Expo() distribution. (a) What is the probability that Alice is the last of the 3 customers to be done being served? (Simplify.) Justify your answer. Hint: no integrals are needed. (b) Let X and Y be independent Expo() r.v.s. Find the CDF of min(X, Y ). (c) What is the expected total time that Alice needs to spend at the post oce? 27 5. Bob enters a casino with X0 = 1 dollar and repeatedly plays the following game: with probability 1/3, the amount of money he has increases by a factor of 3; with probability 2/3, the amount of money he has decreases by a factor of 3. Let Xn be the amount of money he has after playing this game n times. Thus, Xn+1 is 3Xn with probability 1/3 and is 31 Xn with probability 2/3. (a) Compute E(X1), E(X2) and, in general, E(Xn). (Simplify.) (b) What happens to E(Xn) as n ! 1? Let Yn be the number of times out of the first n games that Bob triples his money. What happens to Yn/n as n ! 1? (c) Does Xn converge to some number c as n ! 1 (with probability 1) and if so, what is c? Explain. 28 6. Let X and Y be independent standard Normal r.v.s and let R2 = X 2+Y 2 (where R > 0 is the distance from (X, Y ) to the origin). (a) The distribution of R2 is an example of three of the “important distributions” listed on the last page. State which three of these distributions R2 is an instance of, specifying the parameter values. (b) Find the PDF of R. (Simplify.) Hint: start with the PDF fW (w) of W = R 2. (c) Find P (X > 2Y + 3) in terms of the standard Normal CDF . (Simplify.) (d) Compute Cov(R2 , X). Are R 2 and X independent? 29 9. An urn contains red, green, and blue balls. Balls are chosen randomly with replacement (each time, the color is noted and then the ball is put back.) Let r, g, b be the probabilities of drawing a red, green, blue ball respectively (r + g + b = 1). (a) Find the expected number of balls chosen before obtaining the first red ball, not including the red ball itself. (Simplify.) (b) Find the expected number of di↵erent colors of balls obtained before getting the first red ball. (Simplify.) (c) Find the probability that at least 2 of n balls drawn are red, given that at least 1 is red. (Simplify; avoid sums of large numbers of terms, and P or · · · notation.) 32 10. LetX0, X1, X2, . . . be an irreducible Markov chain with state space {1, 2, . . . ,M}, M 3, transition matrix Q = (qij), and stationary distribution s = (s1, . . . , sM). The initial state X0 is given the stationary distribution, i.e., P (X0 = i) = si. (a) On average, how many of X0, X1, . . . , X9 equal 3? (In terms of s; simplify.) (b) Let Yn = (Xn 1)(Xn 2). For M = 3, find an example of Q (the transition matrix for the original chain X0, X1, . . . ) where Y0, Y1, . . . is Markov, and another example of Q where Y0, Y1, . . . is not Markov. Mark which is which and briefly explain. In your examples, make qii > 0 for at least one i and make sure it is possible to get from any state to any other state eventually. (c) If each column of Q sums to 1, what is s? Verify using the definition of stationary. 33 8 Stat 110 Final from 2007 1. Consider the birthdays of 100 people. Assume people’s birthdays are independent, and the 365 days of the year (exclude the possibility of February 29) are equally likely. (a) Find the expected number of birthdays represented among the 100 people, i.e., the expected number of days that at least 1 of the people has as his or her birthday (your answer can involve unsimplified fractions but should not involve messy sums). (b) Find the covariance between how many of the people were born on January 1 and how many were born on January 2. 34 4. Consider the following conversation from an episode of The Simpsons : Lisa: Dad, I think he’s an ivory dealer! His boots are ivory, his hat is ivory, and I’m pretty sure that check is ivory. Homer: Lisa, a guy who’s got lots of ivory is less likely to hurt Stampy than a guy whose ivory supplies are low. Here Homer and Lisa are debating the question of whether or not the man (named Blackheart) is likely to hurt Stampy the Elephant if they sell Stampy to him. They clearly disagree about how to use their observations about Blackheart to learn about the probability (conditional on the evidence) that Blackheart will hurt Stampy. (a) Define clear notation for the various events of interest here. (b) Express Lisa’s and Homer’s arguments (Lisa’s is partly implicit) as conditional probability statements in terms of your notation from (a). (c) Assume it is true that someone who has a lot of a commodity will have less desire to acquire more of the commodity. Explain what is wrong with Homer’s reasoning that the evidence about Blackheart makes it less likely that he will harm Stampy. 37 5. Empirically, it is known that 49% of children born in the U.S. are girls (and 51% are boys). Let N be the number of children who will be born in the U.S. in March 2009, and assume that N is a Pois() random variable, where is known. Assume that births are independent (e.g., don’t worry about identical twins). Let X be the number of girls who will be born in the U.S. in March 2009, and let Y be the number of boys who will be born then (note the importance of choosing good notation: boys have a Y chromosome). (a) Find the joint distribution of X and Y . (Give the joint PMF.) (b) Find E(N |X) and E(N2|X). 38 6. Let X1, X2, X3 be independent with Xi ⇠ Expo(i) (so with possibly di↵erent rates). A useful fact (which you may use) is that P (X1 < X2) = 1 1+2 . (a) Find E(X1 +X2 +X3|X1 > 1, X2 > 2, X3 > 3) in terms of 1,2,3. (b) Find P (X1 = min(X1, X2, X3)), the probability that the first of the three Expo- nentials is the smallest. Hint: re-state this in terms of X1 and min(X2, X3). (c) For the case 1 = 2 = 3 = 1, find the PDF of max(X1, X2, X3). Is this one of the “important distributions”? 39 9. Consider a knight randomly moving around on a 4 by 4 chessboard: ! A! ! B! ! C ! ! D 4 3 2 1 The 16 squares are labeled in a grid, e.g., the knight is currently at the square B3, and the upper left square is A4. Each move of the knight is an L-shape: two squares horizontally followed by one square vertically, or vice versa. For example, from B3 the knight can move to A1, C1, D2, or D4; from A4 it can move to B2 or C3. Note that from a white square, the knight always moves to a gray square and vice versa. At each step, the knight moves randomly, each possibility equally likely. Consider the stationary distribution of this Markov chain, where the states are the 16 squares. (a) Which squares have the highest stationary probability? Explain very briefly. (b) Compute the stationary distribution (simplify). Hint: random walk on a graph. 42 9 Stat 110 Final from 2008 1. Joe’s iPod has 500 di↵erent songs, consisting of 50 albums of 10 songs each. He listens to 11 random songs on his iPod, with all songs equally likely and chosen independently (so repetitions may occur). (a) What is the PMF of how many of the 11 songs are from his favorite album? (b) What is the probability that there are 2 (or more) songs from the same album among the 11 songs he listens to? (Do not simplify.) (c) A pair of songs is a “match” if they are from the same album. If, say, the 1st, 3rd, and 7th songs are all from the same album, this counts as 3 matches. Among the 11 songs he listens to, how many matches are there on average? (Simplify.) 43 2. Let X and Y be positive random variables, not necessarily independent. Assume that the various expressions below exist. Write the most appropriate of , , =, or ? in the blank for each part (where “?” means that no relation holds in general.) It is not necessary to justify your answers for full credit; some partial credit is available for justified answers that are flawed but on the right track. (a) P (X + Y > 2) EX+EY 2 (b) P (X + Y > 3) P (X > 3) (c) E(cos(X)) cos(EX) (d) E(X1/3) (EX)1/3 (e) E(XY ) (EX)EY (f) E (E(X|Y ) + E(Y |X)) EX + EY 44 5. A post oce has 2 clerks. Alice enters the post oce while 2 other customers, Bob and Claire, are being served by the 2 clerks. She is next in line. Assume that the time a clerk spends serving a customer has the Expo() distribution. (a) What is the probability that Alice is the last of the 3 customers to be done being served? Justify your answer. Hint: no integrals are needed. (b) Let X and Y be independent Expo() r.v.s. Find the CDF of min(X, Y ). (c) What is the expected total time that Alice needs to spend at the post oce? 47 6. You are given an amazing opportunity to bid on a mystery box containing a mystery prize! The value of the prize is completely unknown, except that it is worth at least nothing, and at most a million dollars. So the true value V of the prize is considered to be Uniform on [0,1] (measured in millions of dollars). You can choose to bid any amount b (in millions of dollars). You have the chance to get the prize for considerably less than it is worth, but you could also lose money if you bid too much. Specifically, if b < 2 3V , then the bid is rejected and nothing is gained or lost. If b 2 3V , then the bid is accepted and your net payo↵ is V b (since you pay b to get a prize worth V ). What is your optimal bid b (to maximize the expected payo↵)? 48 7. (a) Let Y = e X , with X ⇠ Expo(3). Find the mean and variance of Y (simplify). (b) For Y1, . . . , Yn i.i.d. with the same distribution as Y from (a), what is the approx- imate distribution of the sample mean Ȳn = 1 n P n j=1 Yj when n is large? (Simplify, and specify all parameters.) 49 2. Let X and Y be positive random variables, not necessarily independent. Assume that the various expected values below exist. Write the most appropriate of , , =, or ? in the blank for each part (where “?” means that no relation holds in general.) It is not necessary to justify your answers for full credit; some partial credit is available for justified answers that are flawed but on the right track. (a) E(X3) p E(X2)E(X4) (b) P (|X + Y | > 2) 1 16E((X + Y )4) (c) E( p X + 3) p E(X + 3) (d) E(sin2(X)) + E(cos2(X)) 1 (e) E(Y |X + 3) E(Y |X) (f) E(E(Y 2|X)) (EY )2 52 3. Let Z ⇠ N (0, 1). Find the 4th moment E(Z4) in the following two di↵erent ways: (a) using what you know about how certain powers of Z are related to other distri- butions, along with information from the table of distributions. (b) using the MGF M(t) = e t 2 /2, by writing down its Taylor series and using how the coecients relate to moments of Z, not by tediously taking derivatives of M(t). Hint: you can get this series immediately from the Taylor series for ex. 53 4. A chicken lays n eggs. Each egg independently does or doesn’t hatch, with probability p of hatching. For each egg that hatches, the chick does or doesn’t survive (independently of the other eggs), with probability s of survival. Let N ⇠ Bin(n, p) be the number of eggs which hatch, X be the number of chicks which survive, and Y be the number of chicks which hatch but don’t survive (so X + Y = N). (a) Find the distribution of X, preferably with a clear explanation in words rather than with a computation. If X has one of the “important distributions,” say which (including its parameters). (b) Find the joint PMF of X and Y (simplify). (c) Are X and Y independent? Give a clear explanation in words (of course it makes sense to see if your answer is consistent with your answer to (b), but you can get full credit on this part even without doing (b); conversely, it’s not enough to just say “by (b), . . . ” without further explanation). 54 7. Let X1, X2, X3 be independent with Xi ⇠ Expo(i) (so with possibly di↵erent rates). A useful fact (which you may use) is that P (X1 < X2) = 1 1+2 . (a) Find E(X1 +X2 +X3|X1 > 1, X2 > 2, X3 > 3) in terms of 1,2,3. (b) Find P (X1 = min(X1, X2, X3)), the probability that the first of the three Expo- nentials is the smallest. Hint: re-state this in terms of X1 and min(X2, X3). (c) For the case 1 = 2 = 3 = 1, find the PDF of max(X1, X2, X3). Is this one of the “important distributions”? 57 8. Let Xn be the price of a certain stock at the start of the nth day, and assume that X0, X1, X2, . . . follows a Markov chain with transition matrix Q (assume for simplicity that the stock price can never go below 0 or above a certain upper bound, and that it is always rounded to the nearest dollar). (a) A lazy investor only looks at the stock once a year, observing the values on days 0, 365, 2 · 365, 3 · 365, . . . . So the investor observes Y0, Y1, . . . , where Yn is the price after n years (which is 365n days; you can ignore leap years). Is Y0, Y1, . . . also a Markov chain? Explain why or why not; if so, what is its transition matrix? (b) The stock price is always an integer between $0 and $28. From each day to the next, the stock goes up or down by $1 or $2, all with equal probabilities (except for days when the stock is at or near a boundary, i.e., at $0, $1, $27, or $28). If the stock is at $0, it goes up to $1 or $2 on the next day (after receiving government bailout money). If the stock is at $28, it goes down to $27 or $26 the next day. If the stock is at $1, it either goes up to $2 or $3, or down to $0 (with equal probabilities); similarly, if the stock is at $27 it either goes up to $28, or down to $26 or $25. Find the stationary distribution of the chain (simplify). 58 11 Stat 110 Final from 2010 1. Calvin and Hobbes play a match consisting of a series of games, where Calvin has probability p of winning each game (independently). They play with a “win by two” rule: the first player to win two games more than his opponent wins the match. (a) What is the probability that Calvin wins the match (in terms of p)? Hint: condition on the results of the first k games (for some choice of k). (b) Find the expected number of games played. Hint: consider the first two games as a pair, then the next two as a pair, etc. 59 4. Let X be a discrete r.v. whose distinct possible values are x0, x1, . . . , and let pk = P (X = xk). The entropy of X is defined to be H(X) = P1 k=0 pk log2(pk). (a) Find H(X) for X ⇠ Geom(p). Hint: use properties of logs, and interpret part of the sum as an expected value. (b) Find H(X3) for X ⇠ Geom(p), in terms of H(X). (c) Let X and Y be i.i.d. discrete r.v.s. Show that P (X = Y ) 2H(X). Hint: Consider E(log2(W )), where W is a r.v. taking value pk with probability pk. 62 5. Let Z1, . . . , Zn ⇠ N (0, 1) be i.i.d. (a) As a function of Z1, create an Expo(1) r.v. X (your answer can also involve the standard Normal CDF ). (b) Let Y = e R , where R = p Z 2 1 + · · ·+ Z2 n . Write down (but do not evaluate) an integral for E(Y ). (c) Let X1 = 3Z1 2Z2 and X2 = 4Z1 + 6Z2. Determine whether X1 and X2 are independent (being sure to mention which results you’re using). 63 6. Let X1, X2, . . . be i.i.d. positive r.v.s. with mean µ, and let Wn = X1 X1+···+Xn . (a) Find E(Wn). Hint: consider X1 X1+···+Xn + X2 X1+···+Xn + · · ·+ Xn X1+···+Xn . (b) What random variable does nWn converge to as n ! 1? (c) For the case that Xj ⇠ Expo(), find the distribution of Wn, preferably without using calculus. (If it is one of the “important distributions” state its name and specify the parameters; otherwise, give the PDF.) 64 Stat 110 Final Review Solutions, Fall 2011 Prof. Joe Blitzstein (Department of Statistics, Harvard University) 1 Solutions to Stat 110 Final from 2006 1. The number of fish in a certain lake is a Pois() random variable. Worried that there might be no fish at all, a statistician adds one fish to the lake. Let Y be the resulting number of fish (so Y is 1 plus a Pois() random variable). (a) Find E(Y 2) (simplify). We have Y = X + 1 with X ⇠ Pois(), so Y 2 = X 2 + 2X + 1. So E(Y 2) = E(X2 + 2X + 1) = E(X2) + 2E(X) + 1 = (+ 2) + 2+ 1 = 2 + 3+ 1, since E(X2) = Var(X) + (EX)2 = + 2. (b) Find E(1/Y ) (in terms of ; do not simplify yet). By LOTUS, E( 1 Y ) = E( 1 X + 1 ) = 1X k=0 1 k + 1 e k k! (c) Find a simplified expression for E(1/Y ). Hint: k!(k + 1) = (k + 1)!. 1X k=0 1 k + 1 e k k! = e 1X k=0 k (k + 1)! = e 1X k=0 k+1 (k + 1)! = e (e 1) = 1 (1 e ). 1 2. Write the most appropriate of , , =, or ? in the blank for each part (where “?” means that no relation holds in general.) It is not necessary to justify your answers for full credit; some partial credit is available for justified answers that are flawed but on the right track. In (c) through (f),X and Y are i.i.d. (independent identically distributed) positive random variables. Assume that the various expected values exist. (a) (probability that a roll of 2 fair dice totals 9) (probability that a roll of 2 fair dice totals 10) The probability on the left is 4/36 and that on the right is 3/36 as there is only one way for both dice to show 5’s. (b) (probability that 65% of 20 children born are girls) (probability that 65% of 2000 children born are girls) With a large number of births, by the LLN it becomes likely that the fraction that are girls is close to 1/2. (c) E( p X)  p E(X) By Jensen’s inequality (or since Var( p X) 0). (d) E(sinX) ? sin(EX) The inequality can go in either direction. For example, let X be 0 or ⇡ with equal probabilities. Then E(sinX) = 0, sin(EX) = 1. But if we let X be ⇡/2 or 5⇡/2 with equal probabilities, then E(sinX) = 1, sin(EX) = 1. (e) P (X + Y > 4) P (X > 2)P (Y > 2) The righthand side is P (X > 2, Y > 2) by independence. The then holds since the event X > 2, Y > 2 is a subset of the event X + Y > 4. (f) E ((X + Y )2) = 2E(X2) + 2(EX)2 The lefthand side is E(X2) + E(Y 2) + 2E(XY ) = E(X2) + E(Y 2) + 2E(X)E(Y ) = 2E(X2) + 2(EX)2 since X and Y are i.i.d. 2 3. A fair die is rolled twice, with outcomes X for the 1st roll and Y for the 2nd roll. (a) Compute the covariance of X + Y and X Y (simplify). Cov(X + Y,X Y ) = Cov(X,X) Cov(X, Y ) + Cov(Y,X) Cov(Y, Y ) = 0. (b) Are X + Y and X Y independent? Justify your answer clearly. They are not independent: information about X + Y may give information about X Y . For example, if we know that X + Y = 12, then X = Y = 6, so X Y = 0. (c) Find the moment generating function MX+Y (t) of X+Y (your answer should be a function of t and can contain unsimplified finite sums). Since X and Y are i.i.d., LOTUS gives MX+Y (t) = E(et(X+Y )) = E(etX)E(etY ) = 1 6 6X k=1 e kt !2 3 6. Let X and Y be independent standard Normal r.v.s and let R2 = X 2+Y 2 (where R > 0 is the distance from (X, Y ) to the origin). (a) The distribution of R2 is an example of three of the “important distributions” listed on the last page. State which three of these distributions R2 is an instance of, specifying the parameter values. (For example, if it were Geometric with p = 1/3, the distribution would be Geom(1/3) and also NBin(1,1/3).) It is 2 2, Expo(1/2), and Gamma(1,1/2). (b) Find the PDF of R. (Simplify.) Hint: start with the PDF fW (w) of W = R 2. R = p W with fW (w) = 1 2e w/2 gives fR(r) = fW (w)|dw/dr| = 1 2 e w/22r = re r 2 /2 , for r > 0. (This is known as the Rayleigh distribution.) (c) Find P (X > 2Y + 3) in terms of the standard Normal CDF . (Simplify.) P (X > 2Y + 3) = P (X 2Y > 3) = 1 ✓ 3p 5 ◆ since X 2Y ⇠ N (0, 5). (d) Compute Cov(R2 , X). Are R 2 and X independent? They are not independent since knowing X gives information about R 2, e.g., X2 being large implies that R2 is large. But R2 and X are uncorrelated: Cov(R2 , X) = Cov(X2+Y 2 , X) = Cov(X2 , X)+Cov(Y 2 , X) = E(X3)(EX 2)(EX)+0 = 0. 6 7. Let U1, U2, . . . , U60 be i.i.d. Unif(0,1) and X = U1 + U2 + · · ·+ U60. (a) Which important distribution is the distribution of X very close to? Specify what the parameters are, and state which theorem justifies your choice. By the Central Limit Theorem, the distribution is approximately N (30, 5) since E(X) = 30,Var(X) = 60/12 = 5. (b) Give a simple but accurate approximation for P (X > 17). Justify briefly. P (X > 17) = 1P (X  17) = 1P ✓ X 30p 5  13p 5 ◆ ⇡ 1 ✓ 13p 5 ◆ = ✓ 13p 5 ◆ . Since 13/ p 5 > 5, and we already have (3) ⇡ 0.9985 by the 68-95-99.7% rule, the value is extremely close to 1. (c) Find the moment generating function (MGF) of X. The MGF of U1 is E(etU1) = R 1 0 e tu du = 1 t (et 1) for t 6= 0, and the MGF of U1 is 1 for t = 0. Thus, the MGF of X is 1 for t = 0, and for t 6= 0 it is E(etX) = E(et(U1+···+U60)) = E(etU1) 60 = (et 1)60 t60 . 7 8. Let X1, X2, . . . , Xn be i.i.d. random variables with E(X1) = 3, and consider the sum Sn = X1 +X2 + · · ·+Xn. (a) What is E(X1X2X3|X1)? (Simplify. Your answer should be a function of X1.) E(X1X2X3|X1) = X1E(X2X3|X1) = X1E(X2)E(X3) = 9X1. (b) What is E(X1|Sn) + E(X2|Sn) + · · ·+ E(Xn|Sn)? (Simplify.) By linearity, it is E(Sn|Sn), which is Sn. (c) What is E(X1|Sn)? (Simplify.) Hint: use (b) and symmetry. By symmetry, E(Xj|Sn) = E(X1|Sn) for all j. Then by (b), nE(X1|Sn) = Sn, so E(X1|Sn) = Sn n . 8 2 Solutions to Stat 110 Final from 2007 1. Consider the birthdays of 100 people. Assume people’s birthdays are independent, and the 365 days of the year (exclude the possibility of February 29) are equally likely. (a) Find the expected number of birthdays represented among the 100 people, i.e., the expected number of days that at least 1 of the people has as his or her birthday (your answer can involve unsimplified fractions but should not involve messy sums). Define indicator r.v.s Ij where Ij = 1 if the jth day of the year appears on the list of all the birthdays. Then EIj = P (Ij = 1) = 1 (364365) 100, so E( 365X j=1 Ij) = 365 ✓ 1 ( 364 365 )100 ◆ . (b) Find the covariance between how many of the people were born on January 1 and how many were born on January 2. Let Xj be the number of people born on January j. Then Cov(X1, X2) = 100 3652 . To see this, we can use the result about covariances in the Multinomial, or we can solve the problem directly as follows (or with various other methods). Let Aj be the indicator for the jth person having been born on January 1, and define Bj similarly for January 2. Then E(X1X2) = E ( X i Ai)( X j Bj) ! = E( X i,j AiBj) = 100 · 99( 1 365 )2 since AiBi = 0, while Ai and Bj are independent for i 6= j. So Cov(X1, X2) = 100 · 99( 1 365 )2 ( 100 365 )2 = 100 3652 . 11 2. Let X and Y be positive random variables, not necessarily independent. Assume that the various expected values below exist. Write the most appropriate of , , =, or ? in the blank for each part (where “?” means that no relation holds in general.) It is not necessary to justify your answers for full credit; some partial credit is available for justified answers that are flawed but on the right track. (a) (E(XY ))2  E(X2)E(Y 2) (by Cauchy-Schwarz) (b) P (|X + Y | > 2)  1 10E((X + Y )4) (by Markov’s Inequality) (c) E(ln(X + 3))  ln(E(X + 3)) (by Jensen) (d) E(X2 e X) E(X2)E(eX) (since X 2 and e X are positively correlated) (e) P (X + Y = 2) ? P (X = 1)P (Y = 1) (What if X, Y are independent? What if X ⇠ Bern(1/2) and Y = 1X?) (f) P (X + Y = 2)  P ({X 1} [ {Y 1}) (left event is a subset of right event) 12 3. Let X and Y be independent Pois() random variables. Recall that the moment generating function (MGF) of X is M(t) = e (et1) . (a) Find the MGF of X + 2Y (simplify). E(et(X+2Y )) = E(etX)E(e2tY ) = e (et1) e (e2t1) = e (et+e 2t2) . (b) Is X + 2Y also Poisson? Show that it is, or that it isn’t (whichever is true). No, it is not Poisson. This can be seen by noting that the MGF from (a) is not of the form of a Poisson MGF, or by noting that E(X + 2Y ) = 3, Var(X + 2Y ) = 5 are not equal, whereas any Poisson random variable has mean equal to its variance. (c) Let g(t) = lnM(t) be the log of the MGF of X. Expanding g(t) as a Taylor series g(t) = 1X j=1 cj j! t j (the sum starts at j = 1 because g(0) = 0), the coecient cj is called the jth cumulant of X. Find cj in terms of , for all j 1 (simplify). Using the Taylor series for et, g(t) = (et 1) = 1X j=1 t j j! , so cj = for all j 1. 13 6. Let X1, X2, X3 be independent with Xi ⇠ Expo(i) (independent Exponentials with possibly di↵erent rates). A useful fact (which you may use) is that P (X1 < X2) = 1 1+2 . (a) Find E(X1 +X2 +X3|X1 > 1, X2 > 2, X3 > 3) in terms of 1,2,3. By linearity, independence, and the memoryless property, we get E(X1|X1 > 1) + E(X2|X2 > 2) + E(X3|X3 > 3) = 1 1 + 1 2 + 1 3 + 6. (b) Find P (X1 = min(X1, X2, X3)), the probability that the first of the three Expo- nentials is the smallest. Hint: re-state this in terms of X1 and min(X2, X3). The desired probability is P (X1  min(X2, X3)). Noting that min(X2, X3) ⇠ Expo(2 + 3) is independent of X1, we have P (X1  min(X2, X3)) = 1 1 + 2 + 3 . (c) For the case 1 = 2 = 3 = 1, find the PDF of max(X1, X2, X3). Is this one of the “important distributions”? Let M = max(X1, X2, X3). Using the order statistics results from class or by directly computing the CDF and taking the derivative, for x > 0 we have fM(x) = 3(1 e x)2ex . This is not one of the “important distributions”. (The form is reminiscent of a Beta, but a Beta takes values between 0 and 1, while M can take any positive real value; in fact, B ⇠ Beta(1, 3) if we make the transformation B = e M .) 16 7. Let X1, X2, . . . be i.i.d. random variables with CDF F (x). For every number x, let Rn(x) count how many of X1, . . . , Xn are less than or equal to x. (a) Find the mean and variance of Rn(x) (in terms of n and F (x)). Let Ij(x) be 1 if Xj  x and 0 otherwise. Then Rn(x) = nX j=1 Ij(x) ⇠ Bin(n, F (x)), so ERn(x) = nF (x) and Var(Rn(x)) = nF (x)(1 F (x)). (b) Assume (for this part only) that X1, . . . , X4 are known constants. Sketch an example showing what the graph of the function R4(x) 4 might look like. Is the function R4(x) 4 necessarily a CDF? Explain briefly. For X1, . . . , X4 distinct, the graph of R4(x) 4 starts at 0 and then has 4 jumps, each of size 0.25 (it jumps every time one of the Xi’s is reached). The R4(x) 4 is the CDF of a discrete random variable with possible valuesX1, X2, X3, X4. (c) Show that Rn(x) n ! F (x) as n ! 1 (with probability 1). As in (a), Rn(x) is the sum of n i.i.d. Bern(p) r.v.s, where p = F (x). So by the Law of Large Numbers, Rn(x) n ! F (x) as n ! 1 (with probability 1). 17 8. (a) Let T be a Student-t r.v. with 1 degree of freedom, and let W = 1/T . Find the PDF of W (simplify). Is this one of the “important distributions”? Hint: no calculus is needed for this (though it can be used to check your answer). Recall that a Student-t with 1 degree of freedom (also known as a Cauchy) can be represented as a ratio X/Y with X and Y are i.i.d. N (0, 1). But then the reciprocal Y/X is of the same form! So W is also Student-t with 1 degree of freedom, and PDF fW (w) = 1 ⇡(1+w2) . (b) Let Wn ⇠ 2 n (the Chi-squared distribution with n degrees of freedom), for each n 1. Do there exist an and bn such that an(Wn bn) ! N (0, 1) in distribution as n ! 1? If so, find them; if not, explain why not. Write Wn = P n i=1 Z 2 i with the Zi i.i.d. N (0, 1). By the CLT, the claim is true with bn = E(Wn) = n and an = 1p Var(Wn) = 1p 2n . (c) Let Z ⇠ N (0, 1) and Y = |Z|. Find the PDF of Y , and approximate P (Y < 2). For y 0, the CDF of Y is P (Y  y) = P (|Z|  y) = P (y  Z  y) = (y) (y), so the PDF of Y is fY (y) = 1p 2⇡ e y 2 /2 + 1p 2⇡ e y 2 /2 = 2 1p 2⇡ e y 2 /2 . By the 68-95-99.7% Rule, P (Y < 2) ⇡ 0.95. 18
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved