Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Confidence Intervals in Statistical Sampling - Prof. Jack Morse, Study notes of Statistics

The concept of confidence intervals in statistical sampling, including the difference between stratified and cluster sampling, the construction of confidence intervals for population parameters and proportions, and the use of statcrunch to calculate confidence intervals. It also covers the t-distribution and the formula for determining sample size for estimating population means and proportions.

Typology: Study notes

2011/2012

Uploaded on 02/05/2012

tiaatuga
tiaatuga 🇺🇸

5

(1)

20 documents

1 / 54

Toggle sidebar

Related documents


Partial preview of the text

Download Confidence Intervals in Statistical Sampling - Prof. Jack Morse and more Study notes Statistics in PDF only on Docsity! Chapter Four – Gathering Data 4.1 Should we experiment or observe? There are two basic ways to gather data: 1. Observational Study 2. Experiment Difference between experiments and observational studies: Experiments attempt to manipulate or influence the subjects in an experiment. Properly designed experiments can be used to prove causation (that one variable CAUSES the other to change). In using experiments, subjects can be randomly assigned to groups. Observational studies simply measure characteristics of the subjects without attempting to manipulate or influence the subjects. Observational studies cannot be used to prove causation, they can only say that the variables are related to one another. In using an observational study, subjects cannot be randomly assigned to groups. The advantages of using an experiment are that you can prove causation and apply randomization, while an observational study can only prove association of two variables and not causation. Page 1 of 54 Example: Decide whether the following are experiments or observational studies: 1. Rats with cancer are divided into 2 groups. One group receives 5 mg a day of an experimental drug that is thought to fight cancer, the other group receives 10 mg a day of the same drug. After 2 years, the spread of cancer is measured in both groups. Subjects? Experiment or observational study? 2. A poll is conducted in which 500 people are asked whom they plan to vote for in the upcoming election. Subjects? Experiment or observational study? Page 2 of 54 4.3-4.4 Experiments An experiment is a controlled study in which one or more treatments are applied to experimental units. The experimenter then observes the effect of varying these treatments on a response variable. A lot of new terms were used in the above definition. Let’s now define these terms. 1) experimental unit (or subject): the person, object, or some other well-defined item upon which a treatment is applied. 2) treatment: a condition applied to the experimental unit. (i.e., a new drug is administered to patients) 3) response variable: a quantitative or categorical variable that represents our variable of interest. The goal in an experiment is to determine the effect the treatment has on the response variable. Page 5 of 54 Think back to our example with the rats. The experimental units are the rats with cancer. The explanatory variable is the amount of the drug, and there are two levels of treatment: 5 mg and 10 mg. The response variable is the spread of cancer for each rat. Many designed experiments are double blind. An experiment is double blind if neither the experimental unit nor the experimenter knows what treatment is being given to the experimental unit. For example, in a lot of medical experiments that are testing the effects of new drugs, researchers often administer to each patient either a dose of the new drug OR a placebo. A placebo is a “dummy” treatment (in this case, a fake pill) so that the patient does not know if they are getting the real medication or not. If this is to be double blind, the researchers will not know which patients are getting the real medication, and the patients also will not know if they are getting the real medication. If the researcher does know, but the patient does not, then it is a single blind experiment. Page 6 of 54 Experimental Designs: Completely randomized design The experimental units are randomly assigned to the treatments. Matched-Pairs Design A matched-pairs design is one in which the experimental units are somehow related or matched before the experiment takes place. For example: the same person before and after a treatment, twins, husband and wife, etc. Example: One twin receives some treatment and the other twin receives some other treatment. Not only can we measure the overall groups that received different treatments, we can also look at the difference in the results for each matched pair of twins. Often, the measure of a response variable for an experimental unit is taken before a treatment is applied and then a measure is taken from the same unit after the treatment. Here, the individual is matched against itself. Page 7 of 54 Example: A pharmaceutical company has developed an experimental drug meant to cure a deadly disease. The company randomly selects 300 males aged 25-29 years old with the disease and randomly divides them into two groups. Group 1 is given the experimental drug, while Group 2 is given a placebo. After one year of treatment, the white blood cell count for each male is recorded. a) Identify the experimental units. b) What is the response variable in this experiment? c) What is the explanatory variable? What are the two levels of treatments? d)Is the experiment design completely randomized? e) Does this experiment use matched-pairs design? f) If the researcher knows which patients are getting which drugs, is this a double or single blind experiment? Page 10 of 54 Example: Researchers at UGA wished to determine whether there was a connection between listening to classical music and reasoning skills. To test the research question, 36 college students listened to one of Mozart’s sonatas for 10 minutes and then took a reasoning test using the Stanford-Binet intelligence scale. The same students were also administered the test after sitting in a room for 10 minutes in complete silence. The mean score on the test following the Mozart piece was 119, while the mean test score following the silence was 110. The researchers concluded that subjects performed better on the reasoning tests after listening to Mozart. a) What is the response variable in the experiment? b) What is the explanatory variable? Describe each level of treatment. c) Does this experiment use matched-pairs design? Page 11 of 54 d) Does this experiment use crossover design? Let’s take a look at one more example to make sure we understand matched-pairs design: Let’s say we administer two tests to students and want to compare grades on the two tests. Let’s say we want to compare the mean for 20 Test A scores versus the mean for 20 Test B scores. If we were to select 20 people, and compare their Test A and Test B scores, then this would be an example of a matched pairs design. We would be matching the Test A scores against the same students’ Test B scores. The experimental units in each group are related because they are the same students in each group. We could not only compare the Test A scores to the Test B scores as a group, but we could also compare them for each individual. However, if we were to select 40 people, and compare 20 students’ Test A scores versus 20 different students’ Test B scores, then this would NOT be matched pairs design because there is no relation between one student’s Test A score and a Page 12 of 54 Identify the four who will take the herbal treatment. (List in numerical order.) , , , What are different ways to sample? Types of Sampling: 1. Simple Random Sampling – every subject has an equally likely chance of being selected for the sample. Usually, samples are chosen using a random number table. 2. Stratified Sampling – the population is divided into non-overlapping groups (called strata) and a simple random sample is then obtained from each group. 3. Cluster Sampling - the population is divided into non- overlapping groups and all individuals within a randomly selected group or groups are sampled. 4. Convenience Sampling – sampling where the individuals are easily obtained. Internet surveys are convenience samples. Studies that use convenience sampling generally have results that are suspect. 5. Systematic Sampling – selecting every kth subject from the population. Page 15 of 54 The difference between stratified and cluster sampling is that stratified sampling samples some individuals from all groups, where cluster sampling samples all individuals from some groups. Example: Identify the type of sampling used below. In order to determine the average IQ of ninth-grade students, a school psychologist obtains a list of all high schools in the local public school system. She randomly selects five of these schools and administers an IQ test to all ninth-grade students at the selected schools. _______________________________________________ A member of Congress wishes to determine her county’s opinion regarding estate taxes. She divides her county into three income classes: low-income households, middle-income households, and upper- income households. She then takes a random sample of households from each income class. ________________________________________________ A radio station asks its listeners to call in their opinion regarding the use of American forces in peacekeeping missions. ________________________________________________ In an effort to identify whether an advertising campaign has been effective, a marketing firm conducts a nationwide poll by randomly selecting individuals from a list of known users of the product. Page 16 of 54 A lobby has a list of the 100 senators of the U.S. In order to determine the Senate’s position regarding farm subsidies, they decide to talk with every seventh senator on the list starting with the third. ________________________________________________ Chapter 8: Statistical Inference: Confidence Intervals 8.1 What are Point and Interval Estimates of Population Parameters? When we first began our discussion of statistics, we mentioned that there were two branches of statistics: descriptive and inferential. The inferential branch uses sample information to draw conclusions about the population. One of the most common uses of the inferential branch is to use sample statistics, such asx, to estimate population parameters, such as μ. It makes sense that if we take a large enough sample,x should be pretty close to the actual value of μ. But the chances are pretty small thatx turns out to be exactly μ. This is why we callx a point estimate of μ. The key here is that sample statistics estimate population parameters. For example,x is a point estimate of μ and is a point estimate of p. Page 17 of 54 p̂ It is also important for us to note the level of confidence of a confidence interval. In our example before, our level of confidence would have been that we were 95% confident that the mean age of all students in the class was somewhere on our interval. So the level of confidence is the probability that the interval contains the population parameter, in this case, μ. We will see in examples that as we increase our level of confidence, we will get wider and wider intervals. We will be constructing two different types of confidence intervals: 1. In Section 8.2, we will be calculating the confidence interval for the population proportion, p. 2. In Section 8.3, we will be calculating the confidence interval for the population mean, μ (like our classroom age example). Before we get to these sections, let’s make sure we understand the terms in the example on the next page. In this example, the confidence interval will already be Page 20 of 54 constructed for us. In Sections 8.2 and 8.3, we will actually learn how to construct these confidence intervals. Example: Suppose a farmer is trying to estimate the average number of peaches per tree in his orchard. He does not want to count every peach on every tree, so he takes a random sample of a few trees and calculates a 95% confidence interval based on the sample. That 95% confidence interval for the mean yield of a new variety of peaches in an orchard is 112 to 148 peaches per tree. This means that we are 95% confident that the population mean, μ, for the number of peaches per tree is somewhere between 112 and 148 peaches per tree. What is the lower limit? What is the upper limit? What is the level of confidence? What is the width of the confidence interval? What is the sample mean,x? *Remember, the sample mean is always the middle of the confidence interval. The sample mean,x, will Page 21 of 54 always be on the confidence interval, but the population mean, μ, may or may not be on the confidence interval.* What is the margin of error? 8.2 How Can We Construct a Confidence Interval to Estimate a Population Proportion? Recall from Section 8.1 that confidence intervals can be written in the general format: point estimate +/- margin of error. The point estimate and margin of error change depending on what parameter is being estimated. For example, we looked at an example of a Confidence Interval for μ, so our point estimate wasx. Now we will consider the format of the Confidence Interval for the population proportion, p. The point estimate for this type of Confidence Interval is the sample proportion, = x/n, where x is the number of individuals in the sample with the desired characteristic and n is the sample size. So we know what goes before the +/-, the point estimate, and we can calculate that easily. Now we need to know how to calculate what goes after the +/-, the margin of error. Page 22 of 54 p̂ we want to be 95% confident that the interval contains p, the true population proportion. So, using the same Empirical Rule logic, if we start with and add and subtract close to 2 standard errors (1.96 to be exact) it makes sense that we are going to be 95% confident that the p value will be within that interval. Example 3 from Section 8.2: We asked n = 1154 Americans “Would you be willing to pay $6 per gallon of gas?”. In our random sample, 518 said they would be willing to pay $6 per gallon of gas. a. Find a 95% confidence interval for the population proportion of Americans willing to pay $6 per gallon of gas. First, we need our sample proportion. = x/n = # who said yes/total number in the sample = 518/1154 = .45 Next we need the standard error, Standard error = = .45*.55 1154 = 0.015 The 95% confidence interval formula: 1.96 So the interval is .45 +/- 1.96*0.015 = .45 +/- .03 So the lower limit is .45 - .03 = .42 And the upper limit is .45 + .03 = .48 Page 25 of 54 p̂ p̂ So our 95% confidence interval is (.42,.48) b) Interpret the interval. We are 95% confident that the proportion of ALL Americans that are willing to pay $6 per gallon of gas is between .42 and .48, in other words, between 42% and 48%. c) EXTRA QUESTION: Does it appear likely that 50% of ALL Americans are willing to pay $6 per gallon of gas? No, 50%, or .5, is not on our interval so it does not appear likely that 50% of all Americans are willing to pay $6 per gallon of gas. Our interval went between .42 and .48, so it appears that less than 50% of ALL Americans are willing to pay $6 per gallon of gas. Sample Size Needed for Validity of Confidence Interval for a Proportion: For these confidence intervals to be valid, we need to check some requirements as we did back when we were determining the sampling distribution for the sample proportion. The following two things must be true: n* ≥ 15 AND n*(1- ) ≥ 15 Also, we must make sure we take a random sample. Page 26 of 54 p̂ p̂ Check that this is true in the above gas example: Example: The drug Lipitor is meant to lower cholesterol levels. In a clinical trial of 884 randomly selected patients who received 10 mg doses of Lipitor daily, 221 reported a headache as a side effect. (a) Obtain a point estimate for the population proportion of users who will experience a headache. (b) Verify that the requirements for constructing a confidence interval about p are satisfied. (c) Construct a 95% CI for the population proportion of users who will have a headache. Page 27 of 54 Then put the area of the tail in the StatCrunch Normal Calculator with mean = 0 and standard deviation = 1: The z-score you get = 1.645. So to get the margin of error for a 90% confidence interval you multiply the standard error by 1.645. Let’s use this in the following example: Example: A study of 70 randomly selected people in Atlanta was conducted to estimate the proportion of Atlantans that owned dogs. The study revealed that 42 of the 74 people were dog-owners. a) Obtain a point estimate for the population proportion of dog-owners in Atlanta. b) Verify that the requirements for constructing a confidence interval about p are satisfied. Page 30 of 54 c) Construct a 90% confidence interval for the proportion of Atlantans that are dog-owners. d) Interpret the confidence interval. Now, using the same example as above, construct a 99% confidence interval. Let’s see how the interval changes if we increase the confidence level. We have the point estimate and the standard error, so we just need the new Z-score for this confidence interval: First, draw the curve with .99 in the middle and find the area of both tails: Page 31 of 54 Then put the area of the tail in the StatCrunch Normal Calculator with mean = 0 and standard deviation = 1: The Z-score = Now create the 99% confidence interval: Notice that the 99% confidence interval is wider than the 90% confidence interval. In this example, we saw that: As the level of confidence increases, the margin of error increases and the confidence interval gets wider. ALSO, as the level of confidence decreases, the margin of error decreases and the confidence interval gets narrower. This applies to all confidence intervals, like in the picture below: Page 32 of 54 Margin of Error Standard Error Confidence Interval HOW CAN STATCRUNCH CALCULATE THESE CONFIDENCE INTERVALS FOR US? Look back at our example where we wanted to get a 90% confidence interval for the population proportion of ALL Atlantans that own dogs on page 30 of our notes. We got the 90% confidence interval which has a lower limit of .50369 and an upper limit of .69631. Page 35 of 54 On page 31, we got the 99% confidence interval which has a lower limit of .44917 and an upper limit of .75083. Guess what, STATCRUNCH can get these values for us. Go to Stat  Proportions  One Sample With Summary Here we can type in how many Atlantans owned dogs in our sample. In our sample, 42 of 70 Atlantans owned dogs. Put those numbers in just like this and hit Next: On the next screen choose “Confidence Interval”, and we want a 90% confidence interval, so change the 0.95 to 0.90: Hit Calculate, and here is what we get: Page 36 of 54 It tells us the Sample Proportion, which is the point estimate, which is .6, the same thing we got in part (a). It also tells us the lower limit (.50369) and the upper limit (.69631), the same values we calculated! Notice, it also gives us the standard error. The only values it does not give us are the margin of error, and the Z-score used in the formula, so we still would need to know how to get those by hand. Now get the 99% Confidence Interval and check it against our answers of (0.44917, 0.75083). Section 8.3 How Can we Construct a Confidence Interval to Estimate a Population Mean? Recall from Section 8.1 that confidence intervals can be written in the general format: point estimate +/- margin of error. Remember the point estimate is a single number that is our “best guess” for the parameter. What single number is the Page 37 of 54 4. The area in the tails of the T-distribution is a little greater than the area in the tails of the normal distribution. 5. As the sample size n increases, the T curve looks more and more like the normal curve. Since the T-distribution looks different for different values of n, we always have to type in what we call the “degrees of freedom” on the T calculator. The degrees of freedom we have to put in the T calculator = n – 1. The degrees of freedom on the T-calculator is abbreviated as “DF”. So DF in StatCrunch = n – 1 and we always have to put that into the T Calculator. Try some different DF values in StatCrunch and see how the T-distribution changes for different sample sizes. Again it is Stat  Calculators  T Try DF = 5. Then try DF = 500, this one looks more like our normal curve. So our confidence interval formula for the population mean is: Lower limit:x − T ∙ s n Upper limit:x + T ∙ s n These intervals are valid when we : 1. use a random sample AND Page 40 of 54 2. either use a sample size > 30 OR when we are sampling from a normal population. So we can get the sample mean, sample standard deviation and n value, but we haven’t yet talked about what the T value is that we want from the T Calculator. To get the T value is just the same as getting the Z value when we were doing confidence intervals for the population proportion in Section 8.2. The only difference is that the T value depends on BOTH the confidence level and the sample size. Let’s find the T value for a 95% confidence interval if the sample size we used is n = 32. First, draw a curve with .95 in the middle and find the area of both tails: Page 41 of 54 Next put in the right tail area = .025 in the T Calculator AND put DF = 32 – 1 = 31. Hit Compute and you get T = 2.0395 Let’s do a few more: These are the same thing they are asking you to get on Homework 8.3-8.4. a) Find the t-score for a 99% confidence interval for a population mean with 5 observations in our sample. First, draw a curve with .99 in the middle and find the area of both tails: Page 42 of 54 Next put in the right tail area = .025 in the T Calculator AND put DF = 7 - 1 = 6. Hit Compute and you get T = 2.44691 Now we have everything we need, we can now construct the lower and upper limits of the 95% confidence interval: Lower limit =x − T ∙ s n = 233.57 – 2.447*(14.64/√7) = 220.03 Upper Limit =x + T ∙ s n = 233.57 + 2.447*(14.64/√7) = 247.11 Page 45 of 54 So we are 95% confident that the mean price of ALL Ipods sold on eBay is somewhere between $220.03 and $247.11. EXTRA QUESTION: According to our confidence interval, is it likely that the population mean price of ALL Ipods sold on eBay = $250? No, $250 is not on our confidence interval, so therefore it is not a likely mean price for ALL Ipods sold on eBay. We think that mean price should be somewhere between $220.03 and $247.11. EXTRA QUESTION #2: According to our confidence interval, is it possible that the population mean price of ALL Ipods sold on eBay = $225? Yes, $225 is a possible mean price because it is on our interval. Using STATCRUNCH to construct confidence intervals: Whenever we have actual data like in the above eBay example, we can put this data into StatCrunch and StatCrunch will actually calculate these intervals for us. First, put the seven eBay prices in a column on StatCrunch. Page 46 of 54 Go to Stat  T Statistics  One Sample  with data Choose the column you have put the data in and hit “Next”. Choose “Confidence Interval” and type in 0.95. Hit Calculate and here are our results: The same amounts we got before: Lower limit of the confidence interval = 220.03 Upper limit of the confidence interval = 247.11 Let’s do an example like this where we have to calculate the limits using the summary statistics and not the actual data. Example: Suppose a sample of 16 test scores is taken from a normal population. If the sample meanx = 78.2 and the sample standard deviation, s = 2.55, then Page 47 of 54 Point Estimate x Margin of Error T∙ s n Standard Error s n Confidence Interval x +/- T∙ s n Section 8.4: How Do We Choose the Sample Size for a Study? Sometimes before setup of an experiment/survey, we know that we want the margin of error to be a certain amount. Page 50 of 54 Like if we are trying to get results for an election, and we are going to be getting a sample proportion for the proportion of people who will vote for candidate A, maybe we know that whatever sample proportion we get, we want that to be within 3% of the true population proportion for ALL voters. So we know we want the margin of error = 3%. We can use a formula to tell us what sample size we need to take so that our margin of error will be 3%, or in other words, so we can be sure that whatever sample proportion we get will be within 3% of the true population proportion with a certain level of confidence. Here is that formula for choosing sample size in estimating a population proportion: n = (1- )Z2/m2 where is a guess at the value we think we might get for the sample proportion. If no guess is given, then use =0.5. The m is the margin of error. The Z-score is calculated again based on the level of confidence, just like with the confidence intervals. It represents how confident we want to be that the sample proportion we get will be that close to the true population proportion. Example 9 in Section 8.4: An election is expected to close and we are going to take a sample of people to obtain a sample proportion of the people who voted for candidate A. How large should the Page 51 of 54 sample size be for the margin of error of a 95% confidence interval to equal 0.02? (What we are saying here is that we want to take a sample and get a sample proportion. We then want to create a 95% confidence interval around that sample proportion, and we want the margin of error for that confidence interval to equal 0.02) Sample size formula: n = (1- )Z2/m2 = 0.5 because we aren’t given a guess for to use. m = .02 Z = Z-score based on 95% level of confidence First, draw the curve with .95 in the middle and find the area of both tails: Then put the area of the right tail in the StatCrunch Normal Calculator with mean = 0 and standard deviation = 1: The z-score you get = 1.96, just like before. n = (0.5)(1 - 0.5)(1.96)2/(.02)2 = 2401 We would need to take a sample of 2401 people to get a sample proportion that we can be 95% confident will be within 0.02 of the true population proportion. What if we are not dealing with a population proportion example, but a population mean example? That is, we want to know what sample size we need so that the sample mean we get is close enough to the true population mean. Page 52 of 54
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved