Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Bayesian Estimation and Hypothesis Testing - Prof. Sridhar Mahadevan, Papers of Computer Science

University of Massachusetts - Amherst Computer Science

Prof. Sridhar Mahadevan

An in-depth explanation of bayesian estimation, including hierarchical bayesian modeling, bayes estimation example, gamma and beta distributions, binomial bayes estimation, dirichlet distribution, gaussian bayes estimation, and beta priors. It also covers hypothesis testing, likelihood ratio tests, neyman pearson lemma, and bayes decision theory. A set of lecture notes from a course on machine learning and statistics.

Typology: Papers

Pre 2010

Uploaded on 08/19/2009

koofers-user-b86 🇺🇸

10 documents

1 / 29

Partial preview of the text

Download Bayesian Estimation and Hypothesis Testing - Prof. Sridhar Mahadevan and more Papers Computer Science in PDF only on Docsity! Bayes Estimation Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts ©Sridhar Mahadevan: CMPSCI 689 – p.1/16 Topics Hierarchical Bayesian modeling Bayes Estimation example Gamma and Beta distributions Binomial Bayes estimation Dirichlet distribution Gaussian Bayes estimation ©Sridhar Mahadevan: CMPSCI 689 – p.2/16 Bayesian Estimation Generalizing from this example, suppose we were trying to estimate the probability θ from a sequence of Bernoulli experiments, where k successes were recorded in n trials. We know that the answer provided by maximum likelihood would be θ̂ = argmaxθP (k|n, θ) = argmaxθθ k(1 − θ)n−k = k n We can try to determine if this is an unbiased estimator, and use the Cramer-Rao theorem to determine if it is a minimum-variance unbiased estimator. However, there is a fairly basic problem we have ignored so far. Unfortunately, maximum likelihood tends to converge slowly, and can be inaccurate for small sample sizes. If we got 3 heads in the first 3 tosses, θ̂ = 1, which is far from the correct value (assuming a fair coin). ©Sridhar Mahadevan: CMPSCI 689 – p.5/16 Bayesian Estimation Suppose I chose M coins to toss, and then tossed it n times. You don’t get to see which coin I picked. How can you estimate the probability θi. P (θi|n, k) = P (k|n, θi)P (θi) ∑M j=1 P (k|n, θj)P (θj) Assume in the limit, that the number of coins M → ∞. Then, we would guess that the right solution would look like this P (θ|n, k) = P (k|n, θ)P (θ) ∫ 1 0 P (k|n, θ)P (θ)dθ Applying this general result for the binomial leads to the beta distribution, whose value depends on the gamma function. ∫ 1 0 θk(1 − θ)n−kP (θ)dθ ©Sridhar Mahadevan: CMPSCI 689 – p.6/16 Beta and Gamma Functions The gamma function is defined as Γ(α) = ∫ ∞ 0 xα−1e−xdx Γ(α) = (α − 1)Γ(α − 1) The Beta function is given as B(α1, α2) = ∫ 1 0 xα1−1(1 − x)α2−1dx B(α1, α2) = Γ(α1)Γ(α2) Γ(α1 + α2) α1 > 0, α2 > 0 ©Sridhar Mahadevan: CMPSCI 689 – p.7/16 Beta Distribution To derive the beta distribution, we first start with two independent one parameter gamma distributed random variables, x1 and x2 whose joint PDF is given by f(x1, x2|α1, α2) = xα1−11 e −x1 Γ(α1) xα2−12 e −x2 Γ(α2) = 1 Γ(α1)Γ(α2) xα1−11 x α2−1 2 e −x1−x2 The beta distribution arises naturally when we consider ratios (or proportions) of random variables. To see that, define the following linear transformation from x1, x2 to y1, y2, where y1 = x1 + x2 y2 = x1 x1 + x2 Note that as 0 < x1, x2 < ∞, it follows that 0 < y1 < ∞, and 0 < y2 < 1. We will show that y2 is a beta distributed random variable. First, the inverse transformation from x1, x2 to y1, y2 can be readily seen to be x1 = y1y2 x2 = y1(1 − y2) ©Sridhar Mahadevan: CMPSCI 689 – p.10/16 Beta Distribution We can use the general principle of linear transformations to obtain the PDF g(y1, y2) from the PDF of f(x1, x2). The Jacobian of the linear transformation from x1, x2 to y1, y2 is given by ∣ ∣ ∣ ∣ ∣ ∂x1 ∂y1 ∂x1 ∂y2 ∂x2 ∂y1 ∂x2 ∂y2 ∣ ∣ ∣ ∣ ∣ = ∣ ∣ ∣ ∣ ∣ y2 y1 (1 − y2) −y1 ∣ ∣ ∣ ∣ ∣ = −y1 g(y1, y2|α1, α2) = y1 yα1−11 e −y1yα2−11 Γ(α1) yα1−12 (1 − y2) α2−1 Γ(α2) Finally, we can write the PDF g2(y2|α1, α2) as the marginal g2(y2|α1, α2) = yα1−12 (1 − y2) α2−1 Γ(α1)Γ(α2) ∫ ∞ 0 yα1+α2−11 e −y1dy1 = Γ(α1 + α2) Γ(α1)Γ(α2) yα1−12 (1 − y2) α2−1 ©Sridhar Mahadevan: CMPSCI 689 – p.11/16 Beta Distribution We can write the general form of the beta distribution as (where 0 < x < 1) f(x|α1, α2) = Γ(α1 + α2) Γ(α1)Γ(α2) xα1−1(1 − x)α2−1 Returning to the original example of Bayesian estimation for the binomial distribution, where for simplicity, let us assume that our initial prior is P (θ) = 1 ∫ 1 0 θk(1 − θ)n−kP (θ)dθ = Γ(k + 1)Γ(n − k + 1) Γ(n + 2) The posterior probability P (θ|n, k) becomes P (θ|n, k) = P (k|n, θ)P (θ) ∫ 1 0 P (k|n, θ)P (θ)dθ = Γ(n + 2) Γ(k + 1)Γ(n − k + 1) θkθn−k which can be used to predict the next instance P (xn|X) = ∫ P (xn|θ)P (θ|X)dθ ©Sridhar Mahadevan: CMPSCI 689 – p.12/16 Dirichlet Priors For the multinomial distribution, the beta distribution can be generalized to the Dirichlet form P (θ) = C(α)θα1−11 . . . θ αM−1 M where ∑M i=1 θi = 1 and C(α) = Γ( ∑ M i=1 αi) ∏ M i=1 Γ(αi) Suppose N trials are performed, where the number of outcomes of the ith value is ∑M n=1 xin. Then, generalizing the above argument, the posterior distribution is given by P (θ|x) = C(α′)θ ∑ N n=1 x1n+α1−1 1 θ ∑ N n=1 x2n+α2−1 2 . . . θ ∑ N n=1 xMn +αM−1 M Exercise: Prove that the normalizer C(α) has the above form. ©Sridhar Mahadevan: CMPSCI 689 – p.15/16 Bayesian Univariate Gaussian Estimation The Gaussian forms a conjugate prior with itself, so we can simply place a Gaussian prior on the unknown mean, whose mean is µ0 and variance is fixed at τ2. That is, the prior for the mean is now described by a distribution P (µ) = 1 (2πτ2)1/2 e − 1 2τ2 (µ−µ0) 2 Given N IID samples, the posterior joint distribution P (µ|x) is proportional to the joint P (x, µ) P (x, µ) = 1 (2πσ2)N/2 e − 1 2σ2 ∑ N n=1 (xn−µ) 2 1 (2πτ2)1/2 e − 1 2τ2 (µ−µ0) 2 This expression can be simplified to yield (see DHS, Chapter 3, or Chapter 4, Appendix A, Jordan) P (µ|x) = 1 (2πσ̂2)1/2 e − 1 2σ̂2 (µ−µ̂)2 ©Sridhar Mahadevan: CMPSCI 689 – p.16/16 Hypothesis Testing and Bayes Decision Theory Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts ©Sridhar Mahadevan: CMPSCI 689 – p.1/13 Hypothesis Testing The framework of hypothesis testing is applied to machine learning in many ways If you fit a parametric model using class-conditional density functions, hypothesis testing tells you what type of test can optimally discriminate among classes It forms the theoretical basis behind significance tests (sometimes called p-values) often used in experimental design to confirm if the difference between two methods is indeed significant. For simplicity, consider the set of hypothesis described by the parameters underlying a distribution P (x|θ). A hypothesis is some subset of the space of parameters, e.g, µ = 0, λ > 3 etc. ©Sridhar Mahadevan: CMPSCI 689 – p.4/13 Null Hypothesis Hypothesis testing is usually framed in terms of a null hypothesis H0 (to be accepted or rejected), and an alternative hypothesis H1. A typical type of test would involve deciding for a particular data set if H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 Other types of tests include goodness-of-fit tests, where given a model (densitiy) f0(x|θ) and the (unknown) true density f(x|θ) H0 : f = f0 against H1 : f 6= f0 Another type of tests compares one family against another H0 : f = f0 against H1 : f = f1 ©Sridhar Mahadevan: CMPSCI 689 – p.5/13 Critical Region A hypothesis is simple if it is in the form θ = θ0. A composite hypothesis is formed from simple tests, e.g, θ > θ0. The critical region of a test is the set of all instances X = x1, . . . , xn for which the null hypothesis H0 is rejected. To find optimal tests, we need to define the error made by a test. There are two types of errors: Type I and Type II Accept H0 Reject H0 H0 is true Correct Type I H1 is true Type II Correct Since Type I errors are often more serious than Type II errors, we decide on an acceptable level of Type I error, and then define the critical region C by minimizing the Type II error on this region. The probability of a Type I error, denoted by α, is called the size or significance level of a test. ©Sridhar Mahadevan: CMPSCI 689 – p.6/13 Sufficient Statistics and Likelihood Ratio We now show that likelihood ratios are always a function of the sufficient statistic for the underlying distribution Using the factorization theorem, if T (X) is a sufficient statistic for distribution P (X|θ), then we know that f(X|θ) = g(T (X), θ)h(X). Lx(H1) Lx(H0) = supθ∈Θ1 fX(x|θ) supθ∈Θ0 fX(x|θ) = supθ∈Θ1 g(T (x), θ)h(x) supθ∈Θ0 g(T (x), θ)h(x) = supθ∈Θ1 g(T (x), θ) supθ∈Θ0 g(T (x), θ) = LT (X)(H1) LT (X)(H0) ©Sridhar Mahadevan: CMPSCI 689 – p.9/13 Neyman Pearson Framework Theorem: Assume that the null hypothesis H0 : f = f0 is to be tested against H1 : f = f1, where f0 and f1 are positive continuous PDFs defined on the same region. Then, among all tests of size ≤ α, the test with the smallest probability of Type II error is given by C = {x : f1(x) f0(x) > k} where k is determined by α = P (X ∈ C|H0) = ∫ C f0(x|θ)dx Proof: See Weiss, Chapter 6, or Casella and Berger, Chapter 8. Essentially, the Neyman Pearson theorem tells us that the likelihood ratio tests are optimal, and in practice, most if not all of modern empirical hypothesis testing is based on this theorem. Since likelihood ratios are functions of a sufficient statistic, in practice it is often much simpler to just think of the test as a function of a sufficient statistic. For example, if H0 : µ = µ0 versus H1 : µ = µ1, then the Neyman Pearson likelihood ratio test is essentially to reject H0 if the sample mean x̄ > c, where P (X̄ > c|H0) = α. ©Sridhar Mahadevan: CMPSCI 689 – p.10/13 Example Suppose H0 : µ = µ0 and H1 : µ = µ1. We know that the sample mean x̄ ∼ N(µ0, σ2 n ) under the null hypothesis H0. Applying the standard transformation Z = √ n(X̄−µ) σ ∼ N(0, 1). Thus, a test of size α has to reject H0 if z > zα where zα = Φ(1 − α), i.e. the point such that P (N(0, 1) > zα) = α. Suppose X = {5.1, 5.5, 4.9, 5.3} ∼ N(µ, σ2), where σ2 = 1 is known, but µ is unknown. Assume we want a test with significance α = 0.05 of H0 : µ0 = 5 against H1 : µ = 6. Then, since x̄ = 5.2, and z = 2(5.2−5) 1 = 0.4 and 0.4 < 1.645 where P (N(0, 1) > 1.645) = 0.05, H0 is not rejected. ©Sridhar Mahadevan: CMPSCI 689 – p.11/13

Documents

questions

Bayesian Estimation and Hypothesis Testing - Prof. Sridhar Mahadevan, Papers of Computer Science

Related documents

Partial preview of the text