Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Point and Interval Estimation of Probability in Hypothesis Testing, Study Guides, Projects, Research of Statistics

The concept of point and interval estimation of probability in hypothesis testing. It discusses how scientists and statisticians use data from a Bernoulli trial to estimate the value of an unknown probability (p). point estimates, interval estimates, and the concept of confidence intervals with different confidence levels. It also mentions the importance of large sample sizes and the role of the standard normal curve in approximating probabilities.

Typology: Study Guides, Projects, Research

2021/2022

Uploaded on 09/12/2022

maya090
maya090 🇺🇸

4.5

(21)

38 documents

1 / 10

Toggle sidebar

Related documents


Partial preview of the text

Download Point and Interval Estimation of Probability in Hypothesis Testing and more Study Guides, Projects, Research Statistics in PDF only on Docsity! Chapter 3 Estimation of p 3.1 Point and Interval Estimates of p Suppose that we have Bernoulli Trials (BT). So far, in every example I have told you the (numer- ical) value of p. In science, usually the value of p is unknown to the researcher. In such cases, scientists and statisticians use data from the BT to estimate the value of p. Note that the word estimate is a technical term that has a precise definition in this course. I don’t particularly like the choice of the word estimate for what we do, but I am not the tsar of the Statistics world! It will be very convenient for your learning if we distinguish between two creatures. First, is Nature, who knows everything and, in particular, knows the value of p. Second is the researcher who is ignorant of the value of p. Here is the idea. A researcher plans to observe n BT, but does not know the value of p. After the BT have been observed the researcher will use the information obtained to make a statement about what p might be. After observing the BT, the researcher counts the number of successes, x, in the n BT. We define p̂ = x/n, the proportion of successes in the sample, to be the point estimate of p. For example, if I observe n = 20 BT and count x = 13 successes, then my point estimate of p is p̂ = 13/20 = 0.65. It is trivially easy to calculate p̂ = x/n; thus, based on your experiences in previous math courses, you might expect that we will move along to the next topic. But we won’t. What we do in a Statistics course is evaluate the behavior of our procedure. What does this mean? Statisticians evaluate procedures by seeing how they perform in the long run. We say that the point estimate p̂ is correct if, and only if, p̂ = p. Obviously, any honest researcher wants the point estimate to be correct. Let’s go back to the example of a researcher who observes 13 successes in 20 BT and calculates p̂ = 13/20 = 0.65. The researcher schedules a press conference and the following exchange is recorded. • Researcher: I know that all Americans are curious about the value of p. I am here today to announce that based on my incredible effort, wisdom and brilliance, I estimate p to be 0.65. • Reporter: Great, but what is the actual value of p? Are you saying that p = 0.65? 33 • Researcher: Well, I don’t actually know what p is, but I certainly hope that it equals 0.65. As I have stated many times, nobody is better than I at obtaining correct point estimates. • Reporter: Granted, but is anybody worse than you at obtaining correct point estimates? • Researcher: (Mumbling) Well, no. You see, the problem is that only Nature knows the actual value of p. No mere researcher will ever know it. • Reporter: Then why are we here? Before we follow the reporter’s suggestion and give up, let’s see what we can learn. Let’s bring Nature into the analysis. Suppose that Nature knows that p = 0.75. Well, Nature knows that the researcher in the above press conference has an incorrect point estimate. But let’s proceed beyond that one example. Consider a researcher who decides to observe n = 20 BT and use them to estimate p. What will happen? Well, we don’t know what will happen. The researcher might observe x = 15 successes, giving p̂ = 15/20 = 0.75 which would be a correct point estimate. Sadly, of course, the researcher would not know it is correct; only Nature would. Given what we were doing in Chapters 1 and 2, it occurs to us to calculate a probability. After all, we use probabilities to quantify uncertainty. So, before the researcher observes the 20 BT, Nature decides to calculate the probability that the point estimate will be correct. This probability is, of course, P (X = 15) = 20! 15!5! (0.75)15(0.25)5, which I find, with the help of the binomial website, to be 0.2023. There are two rather obvious undesirable features to this answer. 1. Only Nature knows whether the point estimate is correct; indeed, before the data are col- lected, only Nature can calculate the probability the point estimate will be correct. 2. The probability that the point estimate will be correct is disappointingly small. (And note that for most values of p, it is impossible for the point estimate to be correct. For one of countless possible examples, suppose that n = 20 as in the current discussion and p = 0.43. It is impossible to obtain p̂ = 0.43.) As we shall see repeatedly in this course, what often happens is that by collecting more data our procedure becomes ‘better’ in some way. Thus, suppose that the researcher plans to observe n = 100 BT, with p still equal to 0.75. The probability that the point estimate will be correct is, P (X = 75) = 100! 75!25! (0.75)75(0.25)25, which I find, with the help of the website, to be 0.0918. This is very upsetting! More data makes the probability of a correct point estimate smaller, not larger. The difficulty lies in our desire to have p̂ be exactly correct. Close is good too. In fact, statisti- cians like to say, 34 No! The value of p is the rational number 1,677,211 divided by 2,939,604, which as a decimal is 0.570556782. . . . And I apologize for not writing this decimal until it repeats, but this is the size of the display on my calculator and I have other work I must do. Personally, and this is clearly a value judgment that you don’t need to agree with, 0.571 is precise enough for me: Obama received 57.1% of the votes. If I am feeling particularly casual, I would be happy with 0.57. I would never be happy, in an election, to round to one digit, in this case 0.6, because for so many elections rounding to one digit will give 0.5 for each candidate, which is not very helpful! (Of course, sometimes we must focus on total votes, not proportions. For example, in the 2008 Minnesota election for U.S. Senator, Franken beat Coleman by a small number of votes. The last number I heard was that Franken had 312 more votes out of nearly 3 million cast. So yes, to three digits, each man received 50.0% of the votes.) For p close to 0 (remember, we don’t let it be close to 1), usually we want much more precision than simply the nearest 0.001. At the time of this writing, there is a great deal of concern about the severity with which the H1N1 virus will hit the world during 2009–10. Let p be the proportion of, say, Americans who die from it. Now, if p equals one in 3 million, about 100 Americans will die, but if it equals one in 3,000, about 100,000 Americans will die. To the nearest 0.001, both of these p’s is 0.000. Clearly, more precision than the nearest 0.001 is needed if p is close to 0. 3.2 The Approximate 95% Confidence Interval for p In this section we learn about a particular kind of interval estimate of p which is called the confi- dence interval (CI) estimate. I will first give you the confidence interval formula and then derive it for you. Remember, first and foremost, a confidence interval is a closed interval. An interval is determined by its two endpoints, which we will denote by l for lower (smaller) endpoint and u for upper (larger) endpoint. Thus, I need to give you the formulas for l and u. They are: l = p̂ − 1.96 √ p̂q̂/n and u = p̂ + 1.96 √ p̂q̂/n. If you note the similarity of these equations and recall the prevalence of laziness in math, you won’t be surprised to learn that we usually combine these into one expression for the 95% confidence interval for p: p̂ ± 1.96 √ p̂q̂/n. We often write this as p̂ ± h, with h = 1.96 √ p̂q̂/n, called the half-width of the 95% CI for p. I will now provide a brief mathematical justification of our formula. 37 As discussed in Chapter 2, if X ∼ Bin(n, p) then probabilities for Z, Z = X − np √ npq , can be well approximated by the standard normal curve (snc), provided n is reasonably large and p is not too close to 0. It turns out that for the goal of interval estimation, the unknown p (and q = 1 − p) in the denominator of Z creates a major difficulty. Thanks, however, to an important result of Eugen Slutsky (1925) (called Slutsky’s Theorem) probabilities for Z ′, Z ′ = (X − np)/ √ np̂q̂, can be well approximated by the snc, provided n is reasonably large, p is not too close to 0 and 0 < p̂ < 1 (we don’t want to divide by 0!). Note that Z ′ is obtained by replacing the unknown p and q in the denominator of Z with the values p̂ and q̂ which will be known once the data are collected. Here is the derivation. Suppose that we want to calculate P (−1.96 ≤ Z ′ ≤ 1.96). Because of Slutsky’s result, we can approximate this by the area under the snc between −1.96 and 1.96. Using the website, you can verify that this area equals 0.95. Next, dividing the numerator and denominator of Z ′ by n gives Z ′ = p̂ − p √ p̂q̂/n . Thus, −1.96 ≤ Z ′ ≤ 1.96 becomes − 1.96 ≤ p̂ − p √ p̂q̂/n ≤ 1.96; rearranging terms, this last inequality becomes p̂ − 1.96 √ p̂q̂/n ≤ p ≤ p̂ + 1.96 √ p̂q̂/n. Examine this last expression. In terms of my definitions at the beginning of this section, it is l ≤ p ≤ u. Thus, we have shown that, before we collect data, the probability that we will obtain a correct confidence interval estimate is (approximately) 95% and that this is true for all values of p! This is a great result. The only concern is whether the approximation is good. I will do a few examples to investigate this question. Suppose that a researcher decides to observe n = 200 BT and plans to compute the above 95% confidence interval for p. Is the approximation any good? Well, to answer this question we must bring Nature into the argument. To investigate the quality of the approximation we need not only to specify n, which I have done, but also p. So suppose that p = 0.40. We note that the interval will be correct, if, and only if, it contains p = 0.40. That is, p̂ − 1.96 √ p̂q̂/200 ≤ 0.40 ≤ p̂ + 1.96 √ p̂q̂/200. 38 After some algebra, it follows that l ≤ 0.400 corresponds to p̂ ≤ 0.470 and u ≥ 0.400 corresponds to p̂ ≥ 0.340. Remembering that p̂ = x/200, we conclude that the confidence interval will be correct if, and only if, 68 ≤ X ≤ 94, where probabilities for X are given by the Bin(200,0.40). With the help of the binomial website, this probability is found to be 0.9466. Not ideal—I would prefer 0.9500—but a reasonably good approximation. I will repeat the above example for the same n = 200, but for a p that is closer to 0, say p = 0.10. In this case, by algebra, the confidence interval is correct if, and only if, 15 ≤ X ≤ 30. The probability of this event is 0.8976, which is not very close to the desired 95%. For one last example, suppose that n = 200 and p = 0.01. The interval is correct if, and only if, 1 ≤ X ≤ 8. The probability of this event is 0.8658, which is a really bad approximation to 0.9500. We have seen that for n = 200, if p is close to 0 the 95% in the 95% confidence interval is not a very good approximation to the exact probability that the interval will be correct. We will deal with that issue soon, but first I want to generalize the above result from 95% to other confidence levels. 3.2.1 Other Confidence Levels and One-sided Intervals The 95% confidence level is very popular with statisticians and scientists, but it is not the only possibility. You could choose any level you want, provided that it is above 0% and below 100%. There are six levels that are most popular and we will restrict attention to those in this class. They are: 80%, 90%, 95%, 98%, 99% and 99.73%. Consider again our derivation of the 95% confidence interval. The choice of 95% for level led to 1.96 appearing in the formula, but otherwise had absolutely no impact on the algebra or probability theory used. Thus, for any other level, we just need to determine what number to use in place of 1.96. For example, for 90% we need to find a positive number, let’s call it z, so that the area under the snc between −z and +z is 90%. It can be shown that z = 1.645 is the answer. Thus, to summarize: The 90% confidence interval for p is p̂ ± 1.645 √ p̂q̂ n . Extending these ideas we get the following result. The (two-sided) confidence interval for p is given by: p̂ ± z √ p̂q̂ n . In this formula, the number z is determined by the desired confidence level, as given in the follow- ing table. Confidence Level 80% 90% 95% 98% 99% 99.73% z: 1.282 1.645 1.960 2.326 2.576 3.000 Thus, for example, p̂ ± 2.576 √ p̂q̂ n 39
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved