Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Bayesian Estimation and Hypothesis Testing - Prof. Sridhar Mahadevan, Papers of Computer Science

An in-depth explanation of bayesian estimation, including hierarchical bayesian modeling, bayes estimation example, gamma and beta distributions, binomial bayes estimation, dirichlet distribution, gaussian bayes estimation, and beta priors. It also covers hypothesis testing, likelihood ratio tests, neyman pearson lemma, and bayes decision theory. A set of lecture notes from a course on machine learning and statistics.

Typology: Papers

Pre 2010

Uploaded on 08/19/2009

koofers-user-b86
koofers-user-b86 🇺🇸

10 documents

1 / 29

Toggle sidebar

Related documents


Partial preview of the text

Download Bayesian Estimation and Hypothesis Testing - Prof. Sridhar Mahadevan and more Papers Computer Science in PDF only on Docsity! Bayes Estimation Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts ©Sridhar Mahadevan: CMPSCI 689 – p.1/16 Topics Hierarchical Bayesian modeling Bayes Estimation example Gamma and Beta distributions Binomial Bayes estimation Dirichlet distribution Gaussian Bayes estimation ©Sridhar Mahadevan: CMPSCI 689 – p.2/16 Bayesian Estimation Generalizing from this example, suppose we were trying to estimate the probability θ from a sequence of Bernoulli experiments, where k successes were recorded in n trials. We know that the answer provided by maximum likelihood would be θ̂ = argmaxθP (k|n, θ) = argmaxθθ k(1 − θ)n−k = k n We can try to determine if this is an unbiased estimator, and use the Cramer-Rao theorem to determine if it is a minimum-variance unbiased estimator. However, there is a fairly basic problem we have ignored so far. Unfortunately, maximum likelihood tends to converge slowly, and can be inaccurate for small sample sizes. If we got 3 heads in the first 3 tosses, θ̂ = 1, which is far from the correct value (assuming a fair coin). ©Sridhar Mahadevan: CMPSCI 689 – p.5/16 Bayesian Estimation Suppose I chose M coins to toss, and then tossed it n times. You don’t get to see which coin I picked. How can you estimate the probability θi. P (θi|n, k) = P (k|n, θi)P (θi) ∑M j=1 P (k|n, θj)P (θj) Assume in the limit, that the number of coins M → ∞. Then, we would guess that the right solution would look like this P (θ|n, k) = P (k|n, θ)P (θ) ∫ 1 0 P (k|n, θ)P (θ)dθ Applying this general result for the binomial leads to the beta distribution, whose value depends on the gamma function. ∫ 1 0 θk(1 − θ)n−kP (θ)dθ ©Sridhar Mahadevan: CMPSCI 689 – p.6/16 Beta and Gamma Functions The gamma function is defined as Γ(α) = ∫ ∞ 0 xα−1e−xdx Γ(α) = (α − 1)Γ(α − 1) The Beta function is given as B(α1, α2) = ∫ 1 0 xα1−1(1 − x)α2−1dx B(α1, α2) = Γ(α1)Γ(α2) Γ(α1 + α2) α1 > 0, α2 > 0 ©Sridhar Mahadevan: CMPSCI 689 – p.7/16 Beta Distribution To derive the beta distribution, we first start with two independent one parameter gamma distributed random variables, x1 and x2 whose joint PDF is given by f(x1, x2|α1, α2) = xα1−11 e −x1 Γ(α1) xα2−12 e −x2 Γ(α2) = 1 Γ(α1)Γ(α2) xα1−11 x α2−1 2 e −x1−x2 The beta distribution arises naturally when we consider ratios (or proportions) of random variables. To see that, define the following linear transformation from x1, x2 to y1, y2, where y1 = x1 + x2 y2 = x1 x1 + x2 Note that as 0 < x1, x2 < ∞, it follows that 0 < y1 < ∞, and 0 < y2 < 1. We will show that y2 is a beta distributed random variable. First, the inverse transformation from x1, x2 to y1, y2 can be readily seen to be x1 = y1y2 x2 = y1(1 − y2) ©Sridhar Mahadevan: CMPSCI 689 – p.10/16 Beta Distribution We can use the general principle of linear transformations to obtain the PDF g(y1, y2) from the PDF of f(x1, x2). The Jacobian of the linear transformation from x1, x2 to y1, y2 is given by ∣ ∣ ∣ ∣ ∣ ∂x1 ∂y1 ∂x1 ∂y2 ∂x2 ∂y1 ∂x2 ∂y2 ∣ ∣ ∣ ∣ ∣ = ∣ ∣ ∣ ∣ ∣ y2 y1 (1 − y2) −y1 ∣ ∣ ∣ ∣ ∣ = −y1 g(y1, y2|α1, α2) = y1 yα1−11 e −y1yα2−11 Γ(α1) yα1−12 (1 − y2) α2−1 Γ(α2) Finally, we can write the PDF g2(y2|α1, α2) as the marginal g2(y2|α1, α2) = yα1−12 (1 − y2) α2−1 Γ(α1)Γ(α2) ∫ ∞ 0 yα1+α2−11 e −y1dy1 = Γ(α1 + α2) Γ(α1)Γ(α2) yα1−12 (1 − y2) α2−1 ©Sridhar Mahadevan: CMPSCI 689 – p.11/16 Beta Distribution We can write the general form of the beta distribution as (where 0 < x < 1) f(x|α1, α2) = Γ(α1 + α2) Γ(α1)Γ(α2) xα1−1(1 − x)α2−1 Returning to the original example of Bayesian estimation for the binomial distribution, where for simplicity, let us assume that our initial prior is P (θ) = 1 ∫ 1 0 θk(1 − θ)n−kP (θ)dθ = Γ(k + 1)Γ(n − k + 1) Γ(n + 2) The posterior probability P (θ|n, k) becomes P (θ|n, k) = P (k|n, θ)P (θ) ∫ 1 0 P (k|n, θ)P (θ)dθ = Γ(n + 2) Γ(k + 1)Γ(n − k + 1) θkθn−k which can be used to predict the next instance P (xn|X) = ∫ P (xn|θ)P (θ|X)dθ ©Sridhar Mahadevan: CMPSCI 689 – p.12/16 Dirichlet Priors For the multinomial distribution, the beta distribution can be generalized to the Dirichlet form P (θ) = C(α)θα1−11 . . . θ αM−1 M where ∑M i=1 θi = 1 and C(α) = Γ( ∑ M i=1 αi) ∏ M i=1 Γ(αi) Suppose N trials are performed, where the number of outcomes of the ith value is ∑M n=1 xin. Then, generalizing the above argument, the posterior distribution is given by P (θ|x) = C(α′)θ ∑ N n=1 x1n+α1−1 1 θ ∑ N n=1 x2n+α2−1 2 . . . θ ∑ N n=1 xMn +αM−1 M Exercise: Prove that the normalizer C(α) has the above form. ©Sridhar Mahadevan: CMPSCI 689 – p.15/16 Bayesian Univariate Gaussian Estimation The Gaussian forms a conjugate prior with itself, so we can simply place a Gaussian prior on the unknown mean, whose mean is µ0 and variance is fixed at τ2. That is, the prior for the mean is now described by a distribution P (µ) = 1 (2πτ2)1/2 e − 1 2τ2 (µ−µ0) 2 Given N IID samples, the posterior joint distribution P (µ|x) is proportional to the joint P (x, µ) P (x, µ) = 1 (2πσ2)N/2 e − 1 2σ2 ∑ N n=1 (xn−µ) 2 1 (2πτ2)1/2 e − 1 2τ2 (µ−µ0) 2 This expression can be simplified to yield (see DHS, Chapter 3, or Chapter 4, Appendix A, Jordan) P (µ|x) = 1 (2πσ̂2)1/2 e − 1 2σ̂2 (µ−µ̂)2 ©Sridhar Mahadevan: CMPSCI 689 – p.16/16 Hypothesis Testing and Bayes Decision Theory Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts ©Sridhar Mahadevan: CMPSCI 689 – p.1/13 Hypothesis Testing The framework of hypothesis testing is applied to machine learning in many ways If you fit a parametric model using class-conditional density functions, hypothesis testing tells you what type of test can optimally discriminate among classes It forms the theoretical basis behind significance tests (sometimes called p-values) often used in experimental design to confirm if the difference between two methods is indeed significant. For simplicity, consider the set of hypothesis described by the parameters underlying a distribution P (x|θ). A hypothesis is some subset of the space of parameters, e.g, µ = 0, λ > 3 etc. ©Sridhar Mahadevan: CMPSCI 689 – p.4/13 Null Hypothesis Hypothesis testing is usually framed in terms of a null hypothesis H0 (to be accepted or rejected), and an alternative hypothesis H1. A typical type of test would involve deciding for a particular data set if H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 Other types of tests include goodness-of-fit tests, where given a model (densitiy) f0(x|θ) and the (unknown) true density f(x|θ) H0 : f = f0 against H1 : f 6= f0 Another type of tests compares one family against another H0 : f = f0 against H1 : f = f1 ©Sridhar Mahadevan: CMPSCI 689 – p.5/13 Critical Region A hypothesis is simple if it is in the form θ = θ0. A composite hypothesis is formed from simple tests, e.g, θ > θ0. The critical region of a test is the set of all instances X = x1, . . . , xn for which the null hypothesis H0 is rejected. To find optimal tests, we need to define the error made by a test. There are two types of errors: Type I and Type II Accept H0 Reject H0 H0 is true Correct Type I H1 is true Type II Correct Since Type I errors are often more serious than Type II errors, we decide on an acceptable level of Type I error, and then define the critical region C by minimizing the Type II error on this region. The probability of a Type I error, denoted by α, is called the size or significance level of a test. ©Sridhar Mahadevan: CMPSCI 689 – p.6/13 Sufficient Statistics and Likelihood Ratio We now show that likelihood ratios are always a function of the sufficient statistic for the underlying distribution Using the factorization theorem, if T (X) is a sufficient statistic for distribution P (X|θ), then we know that f(X|θ) = g(T (X), θ)h(X). Lx(H1) Lx(H0) = supθ∈Θ1 fX(x|θ) supθ∈Θ0 fX(x|θ) = supθ∈Θ1 g(T (x), θ)h(x) supθ∈Θ0 g(T (x), θ)h(x) = supθ∈Θ1 g(T (x), θ) supθ∈Θ0 g(T (x), θ) = LT (X)(H1) LT (X)(H0) ©Sridhar Mahadevan: CMPSCI 689 – p.9/13 Neyman Pearson Framework Theorem: Assume that the null hypothesis H0 : f = f0 is to be tested against H1 : f = f1, where f0 and f1 are positive continuous PDFs defined on the same region. Then, among all tests of size ≤ α, the test with the smallest probability of Type II error is given by C = {x : f1(x) f0(x) > k} where k is determined by α = P (X ∈ C|H0) = ∫ C f0(x|θ)dx Proof: See Weiss, Chapter 6, or Casella and Berger, Chapter 8. Essentially, the Neyman Pearson theorem tells us that the likelihood ratio tests are optimal, and in practice, most if not all of modern empirical hypothesis testing is based on this theorem. Since likelihood ratios are functions of a sufficient statistic, in practice it is often much simpler to just think of the test as a function of a sufficient statistic. For example, if H0 : µ = µ0 versus H1 : µ = µ1, then the Neyman Pearson likelihood ratio test is essentially to reject H0 if the sample mean x̄ > c, where P (X̄ > c|H0) = α. ©Sridhar Mahadevan: CMPSCI 689 – p.10/13 Example Suppose H0 : µ = µ0 and H1 : µ = µ1. We know that the sample mean x̄ ∼ N(µ0, σ2 n ) under the null hypothesis H0. Applying the standard transformation Z = √ n(X̄−µ) σ ∼ N(0, 1). Thus, a test of size α has to reject H0 if z > zα where zα = Φ(1 − α), i.e. the point such that P (N(0, 1) > zα) = α. Suppose X = {5.1, 5.5, 4.9, 5.3} ∼ N(µ, σ2), where σ2 = 1 is known, but µ is unknown. Assume we want a test with significance α = 0.05 of H0 : µ0 = 5 against H1 : µ = 6. Then, since x̄ = 5.2, and z = 2(5.2−5) 1 = 0.4 and 0.4 < 1.645 where P (N(0, 1) > 1.645) = 0.05, H0 is not rejected. ©Sridhar Mahadevan: CMPSCI 689 – p.11/13
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved