Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Hazards of Using Statistical Tests: Cautions on Confidence Intervals and Hypothesis Tests , Study notes of Probability and Statistics

This document, from a math 243 lecture file, discusses the hazards of using statistical tests, specifically confidence intervals and hypothesis tests. The author cautions against assuming a simple random sample, non-normality, and the importance of knowing the standard deviation. The document also covers specific cautions for confidence intervals and hypothesis tests, such as nonresponse and undercoverage, and the interpretation of p-values.

Typology: Study notes

Pre 2010

Uploaded on 07/29/2009

koofers-user-846
koofers-user-846 🇺🇸

10 documents

1 / 11

Toggle sidebar

Related documents


Partial preview of the text

Download Hazards of Using Statistical Tests: Cautions on Confidence Intervals and Hypothesis Tests and more Study notes Probability and Statistics in PDF only on Docsity! Math 243: Lecture File 12 N. Christopher Phillips 7 May 2009 N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 1 / 41 Hazards of using statistical tests We will discuss things to watch out for when using z confidence intervals and z hypothesis tests. Most of these hazards and warnings apply to most or all of the procedures we will encounter (one and two sample t procedures and one and two proportion z procedures, for both confidence intervals and hypothesis tests). These other procedures also have their own hazards and warnings, but we discuss now the ones which apply broadly. (Note: I have included at the appropriate place here the so far unused material from lecture file 10.) N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 2 / 41 Summary These apply to both confidence intervals and hypothesis tests: The data must have been properly collected: simple random sample condition. The data must have been properly collected: proper design. More complicated experimental designs (such as stratified random sampling) require other procedures. The distribution is supposed to be normal. (This is special to the z procedures.) You must know σ. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 3 / 41 Summary (continued) This one applies only to confidence intervals: The margin of error only covers errors due to randomness in sampling. These apply only to hypothesis tests: The P-value only covers errors due to randomness in sampling. You need judgement to decide how small a P-value is convincing. (In different language: How do you choose α?) Beware of hard and fast cutoffs on P-values. Statistically significant doesn’t mean important. If you run many tests, some of them will improperly reject the null hypothesis merely by chance. (This is a very important point that people seem to have trouble with.) Formulate your hypotheses before you do your experiment. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 4 / 41 The data must have been properly collected: simple random sample condition The data must come from a simple random sample, or at least it must be reasonable to treat the sample as if it were a simple random sample. In some cases we really have a simple random sample, and this condition is met. In other cases, we don’t have a simple random sample (often, because there is no way to choose one), but we hope that our sample can be treated as one. Knowing when this is the case depends on specific knowledge of the subject area. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 5 / 41 Examples Example 16.1 of the book: Students from an introductory psychology class can probably be safely treated as a simple random sample of young people for the purposes of experiments on vision, but students from an introductory sociology class can certainly not be safely treated as a simple random sample of young people for the purposes of attitude surveys. A collection of laboratory rats chosen in no particular way is often treated as a simple random sample of that strain of rat. What one can conclude about wild rats is more questionable. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 6 / 41 More examples Repeated measurements can often be safely treated as a simple random sample from all possible measurements. See the book’s comment on trained tasters (under “Cautions about the z procedures”). There is nothing to make them unrepresentative. If the data were gathered from a convenience sample or a voluntary response sample, the z procedures are useless. No statistical procedure can compensate for such mistakes in data collection. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 7 / 41 Exceptions For example, the results of the excite.com online poll don’t tell you anything about the population of US adults, or even anything about the population of frequent visitors to the excite.com website. (This is a voluntary response sample. There is no reason to think that the ones that click on the poll are even representative of the people who frequently visit the site.) Things may change if the population of interest is people who respond to online polls. (However, in this case, those who respond specifically to the excite.com online poll constitute a convenience sample.) N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 8 / 41 Knowing the standard deviation We will remove this (usually unrealistic) condition in the next chapters, when we consider the t procedures. There are occasional situations in which σ is known. For example, you use a scientific instrument with known characteristics to make repeated measurement on a specimen for which the quantity being measured is unknown. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 17 / 41 Specific caution for confidence intervals The margin of error covers only errors due to the randomness in the choice of a simple random sample. It does not cover errors due to failure to obtain a simple random sample in the first place. For example, in opinion surveys, it does not cover errors due to: Nonresponse. Undercoverage (such as people who don’t have telephones, drivers licenses, or whatever, or omitting Alaska and Hawaii). Bias in the question (even if unintentional). Etc. These are often more serious than the kind of error that is covered. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 18 / 41 Specific cautions for hypothesis tests The P-value only covers errors due to randomness in sampling. This is similar to the issues for confidence intervals on the previous slide. For hypothesis tests, these issues damage the P-value instead of the margin of error. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 19 / 41 How small a P-value is convincing? The general rule is: the more implausible you (or your intended audience) consider the alternate hypothesis to be, the smaller P has to be to convince. This means that you should choose the significance level α to be smaller. Restated: to convince, you must persuade someone that it is more likely that the effect is real than that you chose, by bad luck, a highly unrepresentative sample. The more implausible your conclusion, the smaller the P-value you need. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 20 / 41 Recall how it works We test the null hypothesis, for example µ = 64, against an alternative hypothesis, say µ > 64. We choose a simple random sample, and compute x . The P-value is the probability that, if the null hypothesis is true, a simple random sample gives a value of x which is as extreme (here, is as large) as the one we got. H0: µ = 64. Ha: µ > 64. Example: I know σ = 2.7, and a simple random sample of size 9 gives x = 66.7. We have z = x − µ0 σ/ √ 9 = 3. Looking up in Table A, we find that the probability of having z ≥ 3 is 1− 0.9987 = 0.0013, so the P-value is 0.13%. This means that, if the null hypothesis is true, the probability of getting x ≥ 66.7 is 0.13%. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 21 / 41 Be careful with the interpretation! If the null hypothesis is true, the probability of getting x ≥ 66.7 is 0.13%. This does not mean that there is a 99.87% probability that the null hypothesis is wrong! How convincing this evidence is depends, among other things, on how plausible the alternative hypothesis is. If it is (subjectively) unlikely, a smaller P-value is needed to persuade people to believe it. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 22 / 41 The logic behind hypothesis testing: an example with coins Suppose we have a large collection of coins, some of which are ordinary fair coins and some of which have tails on both sides. I choose one of these coins, flip it 10 times, and report the result: it came up tails every time. For a fair coin, the probability that 10 tosses come out all tails is 1/1024, or a bit less than 0.1%. Since this outcome is unlikely for a fair coin, we interpret it as evidence that the coin has tails on both sides. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 23 / 41 The coin example is “hard”: if even one toss is heads, it can’t have tails on both sides. For a “soft” example, which is in some ways more realistic, see the discussion of free throw percentage at the beginning of Chapter 14. The coin example has the advantage of allowing some simple explicit probability calculations, which allow one to give explicit examples for “how small a P-value is convincing”. (I will only report the results here of the calculations here.) Suppose we choose a coin at random from a bag in which half the coins are fair and half have tails on both sides. We flip the coin 10 times, and get tails every time. This outcome has a P-value of 1/1024 ≈ 0.000977, or a bit less than 0.1%. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 24 / 41 Suppose we choose a coin at random from a bag in which half the coins are fair and half have tails on both sides. We flip the coin 10 times, and get tails every time. This outcome has a P-value of 1/1024 ≈ 0.000977, or a bit less than 0.1%. The actual probability that the coin has tails on both sides is 1024/1025 ≈ 0.9990. That is, if we do this experiment a very large number of times, then in about 99.90% of all the cases in which we get 10 tails, the coin in fact has tails on both sides. The alternative hypothesis is quite plausible: there is a 50% chance that a randomly chosen coin has tails on both sides. Since the alternative hypothesis is plausible, the P-value of about 0.001 is strong evidence that the null hypothesis is false. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 25 / 41 If one in a thousand coins has tails on both sides Now suppose we choose a coin at random from a bag containing 1024 fair coins and one coin with tails on both sides. Again we flip the coin 10 times, and get tails every time. This outcome has the same P-value as before, a bit less than 0.1%. The actual probability that the coin has tails on both sides is now 1/2. That is, if we do this experiment a very large number of times, then in about half of all the cases in which we get 10 tails, the coin is actually fair! The difference is that the alternative hypothesis is now rather implausible: there is a 1/1025, or less than 0.1%, chance that a randomly chosen coin has tails on both sides. Since the alternative hypothesis is not very plausible, the P-value of about 0.001 is only weak evidence that the null hypothesis is false. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 26 / 41 If one in a million coins has tails on both sides Next, suppose we choose a coin at random from a bag containing 220 (a bit over a million) fair coins and one coin with tails on both sides. Again we flip the coin 10 times, and get tails every time. This outcome still has the same P-value as before, a bit less than 0.1%. The actual probability that the coin has tails on both sides is now only 1/1025. That is, if we do this experiment a very large number of times, then in almost all (more than 999 out of 1000) cases in which we get 10 tails, the coin is nevertheless actually fair! In this case, the alternative hypothesis is now very implausible: there is a 1/(1 + 220) chance, less than one in a million, that a randomly chosen coin has tails on both sides. Since the alternative hypothesis is so implausible, even a P-value of less than 0.1% has little value as evidence. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 27 / 41 If no coins have tails on both sides Finally, suppose we choose a coin at random from a bag containing many coins, all of which are fair. Once again we flip the coin 10 times, and get tails every time. This outcome still has the same P-value, a bit less than 0.1%. The actual probability that the coin has tails on both sides is now zero, no matter what the P-value is, because there were no such coins in the bag. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 28 / 41 If you run many tests, some of them will improperly reject the null hypothesis merely by chance Example: I have a large bag of coins. Perhaps some of them have tails on both sides, but you have no idea how many. I choose 2000 coins at random from this bag, flip each one 10 times, and I find that two or three of them come up tails on every flip. I claim (with P = 1/1024 < 0.001) that this is strong evidence that those two coins have tails on both sides. Do you believe me? You shouldn’t. In the long run, you expect on average nearly one out of a thousand fair coins to come up tails all 10 times. If I select one of the coins that came up tails 10 times, flip it another 10 times, and it again comes up tails every time, then I have evidence that the coin has tails on both sides. This time, I formulated the hypotheses before the experiment on this coin. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 37 / 41 Multiple analyses: The New York Times gets it wrong Similarly, if you run a large number of tests at the (common) significance level α = 0.05 (or 5%), you expect that in about one in 20 cases that the null hypothesis is true, you will reject it with significance 0.05. This is what significance 0.05 means: there is a 5% probability that a simple random sample from a population for which the null hypothesis is true will give a result as extreme as the one observed. See Example 16.4 in the book. Somebody ran 20 tests (for association of cell phone use with 20 kinds of brain cancers), and found one of them was significant at α = 0.05. The New York Times claimed to be puzzled that for this kind of cancer, “the risk appeared to decrease . . . with greater mobile phone use”. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 38 / 41 As the book says, “Running one test and reaching the α = 0.05 level is reasonably good evidence that you have found something; running 20 tests and reaching that level only once is not.” Even if you run, say, 15 or 16 tests and reach the α = 0.05 level once, this isn’t good evidence of anything. If you really think there might be an association with that form of brain cancer, you must start with a completely independent set of data and test that one hypothesis. Also see the margin item “Honest hypotheses?” on page 366 of the book. (This is also an example of why the hypotheses must be formulated ahead of time.) N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 39 / 41 Formulate your hypotheses before you do your experiment Example: Prof. Greenbottle chooses a coin, flips it 10 times, and records the sequence of results, say TTHTTHHHTT. For a fair coin, the probability of getting exactly this sequence is 1/1024 = 1/210. He claims that the outcome is strong evidence that the coin is biased to produce this particular result, with P = 1/1024 < 0.001. This claim is nonsense. Generally, any claim formulated after the experiment is done is suspect. (There is some overlap with the issue of multiple analyses.) N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 40 / 41 Read Chapter 16! Read Chapter 16 carefully. N. Christopher Phillips () Math 243: Lecture File 12 7 May 2009 41 / 41
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved