Download Statistics: Important Terms, Sampling Methods, and Confidence Intervals - Prof. Oneal and more Study notes Statistics in PDF only on Docsity! Important Terms • Population – Total set of subjects in which we are interested • Sample – A subset of the population for which we have data • Subject – Entities we measure (individuals) Histogram Interpretation • How many total students sampled? 60 + 82 + 60 + 41 = 243 • Which class has highest / lowest frequency? What are those frequencies? Highest: “100-109” with 82 Lowest: “120-129” with 41 • How many students have an IQ between 110 and 129? 60 + 41 = 101 Stem-And-Leaf Plot • A bar chart on its side • “Stem” is all digits except the last one • Last digit is the “leaf” • Ascending order • No commas • If nothing in a row, write the row, but leave it blank Example (HW 2.1-2.2) eBay selling prices 199 210 210 223 225 225 225 228 232 235 Sampling Methods • Simple Random Sampling – Each subject everywhere has an equally likely chance of being selected – Often done with a random number table – Choosing a company somewhere in the U.S. • Systematic – Selecting every “k-th” subject – Surveying every 10th person we meet downtown • Convenience – Individuals are easily found (e.g. internet surveys) – Often the “laziest” way, so less reliable answers Sampling Methods • Stratified Sampling – Taking some subjects from all possible groups • Cluster Sampling – Taking all subjects from some possible groups Skewness Outliers • The mean is sensitive to outliers. • The median is resistant to outliers. • When outliers are present, best to use median as measure of central tendency. • Example: average selling price of homes in the U.S. Standard Deviation • The average distance between any data point and the mean of the data. • Measures how much/little the data distribution is spread out. Summary Stats Interpretation • Mean – Average of the data set • Median (also called Q2) – About 50% of data lie below (and above) this value. • Range – Difference between maximum and minimum • Max & Min – Highest and lowest points in data set • Q1 and Q3 – 25% and 75% percentiles • Interquartile Range (IQR) – Difference between Q3 and Q1 Box-Plot (HW 2.5-2.6) Distribution of taxes (in cents) • Minimum = 2.6 Q3 = 105 • Q1 = 31 Maximum = 206 • Median = 55 • Construct a box-plot for this data. • What proportion of states have taxes… – Greater than 31 cents? – Greater than $1.05 (105 cents) ? • Find the range and the interquartile range (IQR). Box-Plot (HW 2.5-2.6) Greater than 31 cents: .75 Greater than $1.05: .25 Range = max - min = 206 - 2.6 = 203.4 IQR = Q3 – Q1 = 105 – 31 = 74 IQR = range for the middle half of the data. Box-Plot Outlier Test (HW 2.5-2.6) • Any point lying above Q3 + 1.5 x IQR is an outlier. • Any point lying below Q1 – 1.5 x IQR is also an outlier. • Are there any outliers on this box-plot? • Q1 - 1.5 x IQR = 256 - 1.5 x (1105 - 256) = -1017.5 • Because there are no points beneath this cutoff, we have no lower outliers. • Q3 + 1.5 x IQR = 1105 + 1.5 x (1105 - 256) = 2378.5 • Because the max is greater than this cutoff (320,000 > 2378.5), we have an upper outlier. Regression (HW 3.2-3.4) • Analysis says that we can use the length of an alligator (in feet) to predict its weight (in pounds). The equation is given by • Find the expected weight of an alligator that’s 10 feet long. • Suppose an alligator that’s 10 feet long actually weighs 402 pounds. Calculate the residual. • Observed - Predicted = 402 - 410 = -8 (so we overestimated) • Interpret the slope. • For every additional foot in length, an alligator’s weight is expected to increase by 40 pounds. • Interpret the intercept. • Literally: an alligator with length 0 will weigh 10 pounds...makes no sense! So, the intercept has no interpretation here. y! = 10 + 40x y! = 10 + 40(10) = 410 pounds Probability • Probability is the likelihood of a particular outcome occurring. • Example: probability of drawing a club from a deck of cards • A complement – All possible events that are not in A • Example: – A = it’s raining – Ac = it’s not raining • Complement probability: P(Ac) = 1 – P(A) Probability (HW 5.1 - 5.3) • We have an urn full of 12 blue, 10 red, and 8 black marbles. We reach in and draw a marble at random. • What’s the probability of drawing a marble that’s one of UGA’s colors? • 30 total marbles, and 10 + 8 = 18 of them are red or black • If the marble drawn was a UGA color, what’s the probability it was not red? • Out of 18 marbles of UGA colors (red or black), 8 of them are black (so not red) 8 18 = .44444 18 30 = .6 Discrete Probability Distribution • Two requirements: 1. Each individual p(x) is between 0 and 1, inclusive 2. All probabilities sum to 1 • The mean of a discrete distribution: MEAN = x ! p(x)" Mean of a Distribution (HW 6.1-6.2) • Here’s a table for the probability of different category hurricanes. Add up and get 2.6 • The mean/expected value is the category strength of a hurricane we will expect to see on average. If we average a large number of hurricanes, the long-run average will be about 2.6. Category Probability 1 x 0.15 0.15 2 x 0.32 0.64 3 x 0.36 1.08 4 x 0.12 0.48 5 x 0.05 0.25 Normal Distribution (Continuous) Three rules: 1. Total probability/ area under the normal curve is 1 2. Normal curve is symmetric 3. X value goes in left box; probability goes in right box on StatCrunch Normal (HW 6.1-6.3) • Study of entrance exam scores; mean is 120 with s.d. 11. Anything above 145 is considered superior. • Find the z-score for a score of 145. Interpret it. A score of 145 is 2.27 deviations above the mean. • What score is 1 standard deviation to the left of the mean? z = -1.0 because it’s below the mean !1 = x !120 11 " x = !1( ) 11( ) +120 = 109 Area Between Two Lines • Find the probability within 1.28 standard deviations from the mean. Make a sketch of this. • Try to think of a strategy for this. Area Between Two Lines Percentiles • The Pth percentile is the x that gives P % below on the normal • Always below • Example: here the x is the 30th percentile because 30% of the data falls below x 3 Types of Distributions 1. Population – Distribution of all points in the population 2. Sample Data / Data – Distribution of one particular sample 3. Sampling Dist. of Sample Means – Distribution of the sample means of a given size n Means Problem Distribution Shapes • Sample Data (Data Distribution) is the same shape as the population – If population is skewed, so is the sample data • Sampling Distribution of Sample Means is normal if… – Population is normal, or… – n > 30 (by the Central Limit Theorem) – Otherwise no conclusion about shape Two Important Properties As the sample size n increases… • The mean of the sampling distribution of sample means does not change. • The standard error (s.d. of the sampling distribution) decreases. • Example: ! > " (larger denominator, smaller overall fraction) Distributions (HW 7.1-7.2) • The lifetime of a certain type of tumble-dryer (until failure) has a distribution skewed right with mean 62 months and s.d. 11. A sample of 98 dryers is selected, and this sample has mean 57.2 and s.d. 12.1. • What is the center and spread for the population? Center = 62 Spread = 11 • What shape is the population? Skewed Right Distributions (HW 7.1-7.2) • What is the center and spread for the sample data selected? Center = 57.2 Spread = 12.1 • What shape is the sample data? Skewed Right (same as population) • What is the center and spread for the sampling distribution of sample means with size 98? Center = 62 Spread = 11 / sqrt(98) = 1.11117 • What shape is the sampling distribution of sample means? Normal by Central Limit Theorem (n > 30) StatCrunch (HW 7.1-7.2) • The average household temperature in Chattanooga is 67.6 degrees, and the s.d. is 4.2. A sample of 51 households is selected. • What’s the probability the average of this sample will be above 68.1? Fill in the inputs on the StatCrunch box. StatCrunch (HW 7.1-7.2) • Sampling distribution with n = 51 • Mean = 67.6 (population’s) • S.D. = standard error Interpretation of a C.I. • A 95% C.I. means that about 95% of all C.I.s constructed contain the true population proportion/mean, and about 5% do not (long-run definition) • We are 95% certain the true proportion lies somewhere inside our C.I. (definition of an individual interval) • A 99% C.I. means that about 99% of all C.I.s constructed contain the true population proportion/mean, and about 1% do not • Example: 1000 intervals – At 95%, about 950 (maybe 940-960) contain the true proportion – At 99%, about 990 (maybe 985-995) contain the true proportion Determining z • z = level of confidence • 95% C.I. : z = 1.96 • To get these numbers… – 95%, 5% is left over – Half of that is 2.5% – P (z >= ?) = .025 in StatCrunch Proportions C.I. (HW 8.1-8.2) • A random sample of 970 people were asked if they owned a pet hamster. 19 said yes and 951 said no. • Find a point estimate for the proportion of people who said yes. • If the margin of error is .00872, find the 95% confidence interval. .01959 ± .00872 = .01087,.02831( ) p! = 19 970 = .01959 Proportions C.I. (HW 8.1-8.2) • Suppose in a new sample for owning a pet hamster, we get a 95% confidence interval of (.03, .09). • Can we find the sample proportion? If so, find it. • Can we find the population proportion? If so, find it. We cannot because the population proportion is unknown. It may or may not be inside the interval. • Can we conclude that fewer than 12% of people own a pet hamster? Yes because .12 lies above this interval. • How about more than 2%? Yes because .02 lies below this interval. • What is the margin of error? Difference between endpoint and center: .09 - .06 = .03 .03+ .09 2 = .06 because sample proportion is always in the center C.I. Properties • Increasing level of confidence (z) widens the interval • Decreasing level of confidence (z) shortens the interval • Intuition: narrowing your field for the true proportion means you’re not as certain it really does fall inside the interval p! ± z ! p! 1" p!( ) n C.I. Properties • Increasing the sample size shortens the C.I. • Decreasing the sample size widens the C.I. • This is because standard error decreases as n increases, so the margin of error (width) decreases as well. • Intuition: a larger sample size gives a more accurate estimate and allows you to zero in on the true proportion. p! ± z ! p! 1" p!( ) n Summary of C.I. Width Factors Confidence Level (z) • As z increases, C.I. widens • As z decreases, C.I. shortens Sample Size (n) • As n increases, C.I. shortens • As n decreases, C.I. widens • Assumptions for proportion C.I.: 1. Sample is randomly selected 2. 3. C.I. with Means • Same general idea: • But we have a different formula: • With proportions, use z • With means, use t C.I. with Means • T-values change as degrees of freedom change (unlike normal calculator) • Degrees of freedom = n – 1 • Assumptions for doing C.I. for means: – Random sample – One of these two should be true: • Sampling from a normal population • n > 30 Choosing Sample Size • Idea: We have a given confidence level and a desired margin of error • What sample size is needed to achieve that? • Formula is different for proportions and means (see formula sheet) Sample Size Needed Formulas • What do we choose for the sample proportion? 1. Proportion of a previous study 2. If nothing is known, p! = .50 n = sample size needed z = z-score for confidence level s = sample standard deviation m = desired margin of error n = ! 2 z 2 m 2 z ! p! 1" p!( ) n p! + z ! p! 1" p!( ) n n(1! p!) "15 Proportions Summary Assumptions for a Valid Confidence Interval • Random sample • We need • We need Finding Sample Size Confidence Interval Point Estimate = Standard Error = Level of Confidence: use z Margin of Error = Lower Limit = Upper Limit = Means Summary Assumptions for a Valid Confidence Interval • Random Sample • One of these two: – Normal population, or... – n > 30 Finding Sample Size Confidence Intervals Point Estimate = Standard Error = Level of confidence depends on t Margin of Error = Lower Limit = Upper Limit = n = ! 2 z 2 m 2 Designed Experimental Study • Manipulates the subjects somehow • Can be used to prove causation • Subjects randomly divided into groups • Examples: – Does a coupon attached to a catalogue make recipients more likely to order? – Does a new medicine reduce the frequency of headaches? Observational Study • Measures qualities of subjects without manipulating them • Cannot be used to prove causation—only that the variables are related. • Cannot be randomly assigned to groups • Examples – Whether or not smoking has an effect on heart disease (can’t assign groups) – Are higher SAT scores positively correlated with higher college GPAs? Designed Experiments • Experimental Unit (subject) – The person/object that receives the treatment • Treatment – A condition/drug/etc. applied to the subject • Response Variable – Variable we are interested in studying • Explanatory Variable – Variable we believe to influence the response Designed Experiment (HW 4.1-4.4) • We are testing the effects of a new energy drink on heart rate. 50 subjects are randomly assigned to consume the energy drink, while a different 50 have a similar tasting drink but that is not an energy booster. The subjects’ heart rates are recorded, and the researchers know which drinks each subject gets. • Response: heart rate • Explanatory: type of drink received • Treatments: energy and generic drinks • Experimental Units: 100 subjects • Is this completely randomized or matched pairs? • Completely randomized -- no overlap in groups • If the subjects don’t know which drink they get, is the study single or double blind? Single since the researchers know Experimental Designs • Completely Randomized – Experimental units are randomly assigned to treatments, and no overlap in groups. – That is, everybody gets just one treatment; nobody gets both. • Matched Pairs – Subjects are somehow matched before the experiment happens, for measuring differences between the two – Twins, or same person in two treatment groups Comparing 2 Proportions or 2 Means (Independent) • Testing 2 independent proportions • Testing 2 independent means H 0 : p 1 = p 2 or p 1 ! p 2 = 0 HA : p1 > p2 or p1 ! p2 > 0 HA : p1 < p2 or p1 ! p2 < 0 HA : p1 " p2 or p1 ! p2 " 0 H 0 :µ 1 = µ 2 or µ 1 ! µ 2 = 0 H A :µ 1 > µ 2 or µ 1 ! µ 2 > 0 H A :µ 1 < µ 2 or µ 1 ! µ 2 < 0 H A :µ 1 " µ 2 or µ 1 ! µ 2 " 0 2 Proportions (HW 10.1-10.4) • Does a new medicine help lower cholesterol? People with high cholesterol were randomly assigned to receive either the new medicine, or a placebo. • After 5 weeks, 106 of the 8499 on the new medicine had lower cholesterol, and 86 of the 8091 in the placebo group had lower cholesterol. Is this a significant difference? • Set up the hypotheses. 2 Proportions (HW 10.1-10.4) • The p-value is .2673. Interpret. • We do not have strong evidence that there are different results. • We don’t reject the null at .05. • The test is not significant. • The 95% confidence interval for the difference in proportions will contain 0. • If we make the wrong conclusion, it would be a Type II error. 2 Means (HW 10.1-10.4) • A summary for types of sales for an iPod Mean n Bid 231.611 96 Buy-it-now 221.667 128 • Find the point estimate for the difference in population means. 231.611 – 221.667 = 9.944 • The p-value is .006. Can we conclude there’s a population difference? Yes; the p-value is below the common # = .05, so we reject the null • Without finding the CI, determine whether it will contain 0 or not. We just concluded that the means are most likely different, so 0 is not a plausible value for the true difference. It therefore will be outside the interval. Chi-Square Goodness of Fit • Used for testing if category proportions/counts are equal to specified values • Or, used for testing if category proportions/counts are all equal to each other. • Example hypotheses: H0 : Proportions are as stated (example: .20, .30, .50) H A : otherwise H0 : Proportions are all equal to one another H A : otherwise Goodness of Fit Steps 1. List hypotheses: • H0: the proportions are as claimed • Ha: otherwise 2. Check assumptions: A. The sample is randomly selected B. Each expected cell count is at least 5 3. Compute the test statistic: 4. State degrees of freedom: c - 1 (number of cells minus 1) 5. Find the p-value on the Chi-Square distribution, on df = c-1, by finding the probability above $2 6. State the conclusion ! 2 = obs." exp.( ) 2 exp. # Computing the p-value • Suppose we have 6 categories and $2 = 2.12164. Then df = 6 - 1 = 5 • Look up the probability above the test statistic $2, never below. Conclusions for Goodness of Fit If p-value < # (alpha) – Reject H0 – Test is significant – There is enough evidence to suggest the proportions are different that what’s specified – Possible Type I Error If p-value > # – Fail to reject H0 – Test is insignificant – There is insufficient evidence to suggest the proportions are different than what’s specified – Possible Type II Error Goodness Of Fit (HW 11.1-11.2) • It is thought that a certain type of cookie box should contain the following percentages of three varieties: • Chocolate Chip: 40% Oatmeal: 30% Sugar: 30% • A box is selected at random and opened. Here are the observed counts of 50 cookies: • Compute the expected category counts. Will we have a valid chi-square test? • Yes because the sample is random, and each expected count is at least 5. • True/False: the degrees of freedom for this problem would be 50 - 1 = 49. • False: it is 3 - 1 = 2, since we do categories - 1 for goodness-of-fit. Observed Expected Chocolate Chip 18 50 x .40 = 20 Oatmeal 19 50 x .30 = 15 Sugar 13 50 x .30 = 15 Goodness of Fit (HW 11.1-11.2) • We want to see if a 20-sided dice is fair (balanced). To investigate this, we roll it 80 times and record the number of times each face comes up. • The $2 statistic is 21.007. Fill in the boxes below to find the p-value. df = 20 - 1 = 19 (categories - 1) Chi-Square Test for Independence • Used in an r x c contingency table to test if there’s an association between the categorical variables (cf. contingency table association from Test 1) • The null hypothesis is that the explanatory and response variables are independent. • The alternative hypothesis is that there is a strong association between them. Independence Test Steps 1. H0: variables are independent. Ha: association 2. Assumptions: random sample, and each expected count is at least 5 3. Compute each expected count: 4. Compute the test statistic 5. Compute df = (r - 1)(c - 1) 6. Find the p-value: the probability above $2 7. State your conclusion exp.= row total( ) column total( ) overall total ! 2 = obs." exp.( ) 2 exp. # Conclusions for Independence Test If p-value < # (alpha) – Reject H0 – Test is significant – There is enough evidence to suggest the explanatory and response variables are related somehow – Possible Type I Error If p-value > # – Fail to reject H0 – Test is insignificant – There is insufficient evidence to suggest the explanatory and response variables are related – Possible Type II Error Independence Test (HW 11.1 - 11.2) • State the null and alternative hypotheses. • H0: gender and ice cream preference are independent • Ha: gender and ice cream preference are related somehow • Find the expected count for males preferring strawberry. Chocolate Strawberry Total Male Female Total 70 91 161 82 63 145 152 154 306 exp.= (161)(154) 306 = 81.026 Independence Test (HW 11.1 - 11.2) • Same question: the test statistic is 5.2159367. Draw a sketch and fill in the window below. df = (2 - 1)*(2 - 1) = 1 • What is your conclusion, at # = .04? • Reject the null. There is strong evidence that gender and preference for ice cream flavor are related somehow. • True/False: when all the observed and expected counts are far apart from each other, the chi-square statistic will be small. False: each (obs. - exp.)2/exp. calculation will be large, and so their sum is as well. True/False Questions • A 94% confidence interval will give the same conclusions as a right-tailed hypothesis test at # = .06. False: # is correct, but confidence intervals coincide with two-tailed tests only • Margin of error is the distance between the point estimate and one of the endpoints of a confidence interval. True: see earlier question in this handout • For a hypothesis test with proportions, the p-value can be found using the normal calculator. True (the T goes with means problems) • All other things being equal, a 93% confidence interval is shorter than a 96% interval. True: the higher the confidence, the wider the interval StatCrunch • Summary Stats – STAT > Summary Stats > Columns • Regression – STAT > Regression > Simple Linear • Calculators – STAT > Calculators > Normal & T & Chi-Square • Single Proportion – STAT > Proportions > One Sample > With Summary • Two Independent Proportions – STAT > Proportions > Two Sample > With Summary • Single Mean – STAT > T-Statistics > One Sample > With Data or Summary • Two Independent Means – STAT > T-Statistics > Two Sample > With Data or Summary – Uncheck “Pool Variances” • Two Dependent Means – STAT > T-Statistics > Paired • Chi-Square Goodness of Fit – STAT > Goodness-Of-Fit > Chi-Square