Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Statistics: Important Terms, Sampling Methods, and Confidence Intervals - Prof. Oneal, Study notes of Statistics

Important statistical terms such as population, sample, subject, histogram interpretation, sampling methods (simple random sampling, systematic sampling, convenience sampling, stratified sampling, and cluster sampling), skewness, outliers, standard deviation, summary stats interpretation, least-squares regression, and confidence intervals. It explains concepts like mean, median, z-score, relative risk, correlation, and regression. The document also discusses sampling distribution of sample means, means problem, and distribution shapes.

Typology: Study notes

2010/2011

Uploaded on 05/08/2011

quynhvudinh
quynhvudinh 🇺🇸

2 documents

1 / 17

Toggle sidebar

Related documents


Partial preview of the text

Download Statistics: Important Terms, Sampling Methods, and Confidence Intervals - Prof. Oneal and more Study notes Statistics in PDF only on Docsity! Important Terms • Population – Total set of subjects in which we are interested • Sample – A subset of the population for which we have data • Subject – Entities we measure (individuals) Histogram Interpretation • How many total students sampled? 60 + 82 + 60 + 41 = 243 • Which class has highest / lowest frequency? What are those frequencies? Highest: “100-109” with 82 Lowest: “120-129” with 41 • How many students have an IQ between 110 and 129? 60 + 41 = 101 Stem-And-Leaf Plot • A bar chart on its side • “Stem” is all digits except the last one • Last digit is the “leaf” • Ascending order • No commas • If nothing in a row, write the row, but leave it blank Example (HW 2.1-2.2) eBay selling prices 199 210 210 223 225 225 225 228 232 235 Sampling Methods • Simple Random Sampling – Each subject everywhere has an equally likely chance of being selected – Often done with a random number table – Choosing a company somewhere in the U.S. • Systematic – Selecting every “k-th” subject – Surveying every 10th person we meet downtown • Convenience – Individuals are easily found (e.g. internet surveys) – Often the “laziest” way, so less reliable answers Sampling Methods • Stratified Sampling – Taking some subjects from all possible groups • Cluster Sampling – Taking all subjects from some possible groups Skewness Outliers • The mean is sensitive to outliers. • The median is resistant to outliers. • When outliers are present, best to use median as measure of central tendency. • Example: average selling price of homes in the U.S. Standard Deviation • The average distance between any data point and the mean of the data. • Measures how much/little the data distribution is spread out. Summary Stats Interpretation • Mean – Average of the data set • Median (also called Q2) – About 50% of data lie below (and above) this value. • Range – Difference between maximum and minimum • Max & Min – Highest and lowest points in data set • Q1 and Q3 – 25% and 75% percentiles • Interquartile Range (IQR) – Difference between Q3 and Q1 Box-Plot (HW 2.5-2.6) Distribution of taxes (in cents) • Minimum = 2.6 Q3 = 105 • Q1 = 31 Maximum = 206 • Median = 55 • Construct a box-plot for this data. • What proportion of states have taxes… – Greater than 31 cents? – Greater than $1.05 (105 cents) ? • Find the range and the interquartile range (IQR). Box-Plot (HW 2.5-2.6) Greater than 31 cents: .75 Greater than $1.05: .25 Range = max - min = 206 - 2.6 = 203.4 IQR = Q3 – Q1 = 105 – 31 = 74 IQR = range for the middle half of the data. Box-Plot Outlier Test (HW 2.5-2.6) • Any point lying above Q3 + 1.5 x IQR is an outlier. • Any point lying below Q1 – 1.5 x IQR is also an outlier. • Are there any outliers on this box-plot? • Q1 - 1.5 x IQR = 256 - 1.5 x (1105 - 256) = -1017.5 • Because there are no points beneath this cutoff, we have no lower outliers. • Q3 + 1.5 x IQR = 1105 + 1.5 x (1105 - 256) = 2378.5 • Because the max is greater than this cutoff (320,000 > 2378.5), we have an upper outlier. Regression (HW 3.2-3.4) • Analysis says that we can use the length of an alligator (in feet) to predict its weight (in pounds). The equation is given by • Find the expected weight of an alligator that’s 10 feet long. • Suppose an alligator that’s 10 feet long actually weighs 402 pounds. Calculate the residual. • Observed - Predicted = 402 - 410 = -8 (so we overestimated) • Interpret the slope. • For every additional foot in length, an alligator’s weight is expected to increase by 40 pounds. • Interpret the intercept. • Literally: an alligator with length 0 will weigh 10 pounds...makes no sense! So, the intercept has no interpretation here. y! = 10 + 40x y! = 10 + 40(10) = 410 pounds Probability • Probability is the likelihood of a particular outcome occurring. • Example: probability of drawing a club from a deck of cards • A complement – All possible events that are not in A • Example: – A = it’s raining – Ac = it’s not raining • Complement probability: P(Ac) = 1 – P(A) Probability (HW 5.1 - 5.3) • We have an urn full of 12 blue, 10 red, and 8 black marbles. We reach in and draw a marble at random. • What’s the probability of drawing a marble that’s one of UGA’s colors? • 30 total marbles, and 10 + 8 = 18 of them are red or black • If the marble drawn was a UGA color, what’s the probability it was not red? • Out of 18 marbles of UGA colors (red or black), 8 of them are black (so not red) 8 18 = .44444 18 30 = .6 Discrete Probability Distribution • Two requirements: 1. Each individual p(x) is between 0 and 1, inclusive 2. All probabilities sum to 1 • The mean of a discrete distribution: MEAN = x ! p(x)" Mean of a Distribution (HW 6.1-6.2) • Here’s a table for the probability of different category hurricanes. Add up and get 2.6 • The mean/expected value is the category strength of a hurricane we will expect to see on average. If we average a large number of hurricanes, the long-run average will be about 2.6. Category Probability 1 x 0.15 0.15 2 x 0.32 0.64 3 x 0.36 1.08 4 x 0.12 0.48 5 x 0.05 0.25 Normal Distribution (Continuous) Three rules: 1. Total probability/ area under the normal curve is 1 2. Normal curve is symmetric 3. X value goes in left box; probability goes in right box on StatCrunch Normal (HW 6.1-6.3) • Study of entrance exam scores; mean is 120 with s.d. 11. Anything above 145 is considered superior. • Find the z-score for a score of 145. Interpret it. A score of 145 is 2.27 deviations above the mean. • What score is 1 standard deviation to the left of the mean? z = -1.0 because it’s below the mean !1 = x !120 11 " x = !1( ) 11( ) +120 = 109 Area Between Two Lines • Find the probability within 1.28 standard deviations from the mean. Make a sketch of this. • Try to think of a strategy for this. Area Between Two Lines Percentiles • The Pth percentile is the x that gives P % below on the normal • Always below • Example: here the x is the 30th percentile because 30% of the data falls below x 3 Types of Distributions 1. Population – Distribution of all points in the population 2. Sample Data / Data – Distribution of one particular sample 3. Sampling Dist. of Sample Means – Distribution of the sample means of a given size n Means Problem Distribution Shapes • Sample Data (Data Distribution) is the same shape as the population – If population is skewed, so is the sample data • Sampling Distribution of Sample Means is normal if… – Population is normal, or… – n > 30 (by the Central Limit Theorem) – Otherwise no conclusion about shape Two Important Properties As the sample size n increases… • The mean of the sampling distribution of sample means does not change. • The standard error (s.d. of the sampling distribution) decreases. • Example: ! > " (larger denominator, smaller overall fraction) Distributions (HW 7.1-7.2) • The lifetime of a certain type of tumble-dryer (until failure) has a distribution skewed right with mean 62 months and s.d. 11. A sample of 98 dryers is selected, and this sample has mean 57.2 and s.d. 12.1. • What is the center and spread for the population? Center = 62 Spread = 11 • What shape is the population? Skewed Right Distributions (HW 7.1-7.2) • What is the center and spread for the sample data selected? Center = 57.2 Spread = 12.1 • What shape is the sample data? Skewed Right (same as population) • What is the center and spread for the sampling distribution of sample means with size 98? Center = 62 Spread = 11 / sqrt(98) = 1.11117 • What shape is the sampling distribution of sample means? Normal by Central Limit Theorem (n > 30) StatCrunch (HW 7.1-7.2) • The average household temperature in Chattanooga is 67.6 degrees, and the s.d. is 4.2. A sample of 51 households is selected. • What’s the probability the average of this sample will be above 68.1? Fill in the inputs on the StatCrunch box. StatCrunch (HW 7.1-7.2) • Sampling distribution with n = 51 • Mean = 67.6 (population’s) • S.D. = standard error Interpretation of a C.I. • A 95% C.I. means that about 95% of all C.I.s constructed contain the true population proportion/mean, and about 5% do not (long-run definition) • We are 95% certain the true proportion lies somewhere inside our C.I. (definition of an individual interval) • A 99% C.I. means that about 99% of all C.I.s constructed contain the true population proportion/mean, and about 1% do not • Example: 1000 intervals – At 95%, about 950 (maybe 940-960) contain the true proportion – At 99%, about 990 (maybe 985-995) contain the true proportion Determining z • z = level of confidence • 95% C.I. : z = 1.96 • To get these numbers… – 95%, 5% is left over – Half of that is 2.5% – P (z >= ?) = .025 in StatCrunch Proportions C.I. (HW 8.1-8.2) • A random sample of 970 people were asked if they owned a pet hamster. 19 said yes and 951 said no. • Find a point estimate for the proportion of people who said yes. • If the margin of error is .00872, find the 95% confidence interval. .01959 ± .00872 = .01087,.02831( ) p! = 19 970 = .01959 Proportions C.I. (HW 8.1-8.2) • Suppose in a new sample for owning a pet hamster, we get a 95% confidence interval of (.03, .09). • Can we find the sample proportion? If so, find it. • Can we find the population proportion? If so, find it. We cannot because the population proportion is unknown. It may or may not be inside the interval. • Can we conclude that fewer than 12% of people own a pet hamster? Yes because .12 lies above this interval. • How about more than 2%? Yes because .02 lies below this interval. • What is the margin of error? Difference between endpoint and center: .09 - .06 = .03 .03+ .09 2 = .06 because sample proportion is always in the center C.I. Properties • Increasing level of confidence (z) widens the interval • Decreasing level of confidence (z) shortens the interval • Intuition: narrowing your field for the true proportion means you’re not as certain it really does fall inside the interval p! ± z ! p! 1" p!( ) n C.I. Properties • Increasing the sample size shortens the C.I. • Decreasing the sample size widens the C.I. • This is because standard error decreases as n increases, so the margin of error (width) decreases as well. • Intuition: a larger sample size gives a more accurate estimate and allows you to zero in on the true proportion. p! ± z ! p! 1" p!( ) n Summary of C.I. Width Factors Confidence Level (z) • As z increases, C.I. widens • As z decreases, C.I. shortens Sample Size (n) • As n increases, C.I. shortens • As n decreases, C.I. widens • Assumptions for proportion C.I.: 1. Sample is randomly selected 2. 3. C.I. with Means • Same general idea: • But we have a different formula: • With proportions, use z • With means, use t C.I. with Means • T-values change as degrees of freedom change (unlike normal calculator) • Degrees of freedom = n – 1 • Assumptions for doing C.I. for means: – Random sample – One of these two should be true: • Sampling from a normal population • n > 30 Choosing Sample Size • Idea: We have a given confidence level and a desired margin of error • What sample size is needed to achieve that? • Formula is different for proportions and means (see formula sheet) Sample Size Needed Formulas • What do we choose for the sample proportion? 1. Proportion of a previous study 2. If nothing is known, p! = .50 n = sample size needed z = z-score for confidence level s = sample standard deviation m = desired margin of error n = ! 2 z 2 m 2 z ! p! 1" p!( ) n p! + z ! p! 1" p!( ) n n(1! p!) "15 Proportions Summary Assumptions for a Valid Confidence Interval • Random sample • We need • We need Finding Sample Size Confidence Interval Point Estimate = Standard Error = Level of Confidence: use z Margin of Error = Lower Limit = Upper Limit = Means Summary Assumptions for a Valid Confidence Interval • Random Sample • One of these two: – Normal population, or... – n > 30 Finding Sample Size Confidence Intervals Point Estimate = Standard Error = Level of confidence depends on t Margin of Error = Lower Limit = Upper Limit = n = ! 2 z 2 m 2 Designed Experimental Study • Manipulates the subjects somehow • Can be used to prove causation • Subjects randomly divided into groups • Examples: – Does a coupon attached to a catalogue make recipients more likely to order? – Does a new medicine reduce the frequency of headaches? Observational Study • Measures qualities of subjects without manipulating them • Cannot be used to prove causation—only that the variables are related. • Cannot be randomly assigned to groups • Examples – Whether or not smoking has an effect on heart disease (can’t assign groups) – Are higher SAT scores positively correlated with higher college GPAs? Designed Experiments • Experimental Unit (subject) – The person/object that receives the treatment • Treatment – A condition/drug/etc. applied to the subject • Response Variable – Variable we are interested in studying • Explanatory Variable – Variable we believe to influence the response Designed Experiment (HW 4.1-4.4) • We are testing the effects of a new energy drink on heart rate. 50 subjects are randomly assigned to consume the energy drink, while a different 50 have a similar tasting drink but that is not an energy booster. The subjects’ heart rates are recorded, and the researchers know which drinks each subject gets. • Response: heart rate • Explanatory: type of drink received • Treatments: energy and generic drinks • Experimental Units: 100 subjects • Is this completely randomized or matched pairs? • Completely randomized -- no overlap in groups • If the subjects don’t know which drink they get, is the study single or double blind? Single since the researchers know Experimental Designs • Completely Randomized – Experimental units are randomly assigned to treatments, and no overlap in groups. – That is, everybody gets just one treatment; nobody gets both. • Matched Pairs – Subjects are somehow matched before the experiment happens, for measuring differences between the two – Twins, or same person in two treatment groups Comparing 2 Proportions or 2 Means (Independent) • Testing 2 independent proportions • Testing 2 independent means H 0 : p 1 = p 2 or p 1 ! p 2 = 0 HA : p1 > p2 or p1 ! p2 > 0 HA : p1 < p2 or p1 ! p2 < 0 HA : p1 " p2 or p1 ! p2 " 0 H 0 :µ 1 = µ 2 or µ 1 ! µ 2 = 0 H A :µ 1 > µ 2 or µ 1 ! µ 2 > 0 H A :µ 1 < µ 2 or µ 1 ! µ 2 < 0 H A :µ 1 " µ 2 or µ 1 ! µ 2 " 0 2 Proportions (HW 10.1-10.4) • Does a new medicine help lower cholesterol? People with high cholesterol were randomly assigned to receive either the new medicine, or a placebo. • After 5 weeks, 106 of the 8499 on the new medicine had lower cholesterol, and 86 of the 8091 in the placebo group had lower cholesterol. Is this a significant difference? • Set up the hypotheses. 2 Proportions (HW 10.1-10.4) • The p-value is .2673. Interpret. • We do not have strong evidence that there are different results. • We don’t reject the null at .05. • The test is not significant. • The 95% confidence interval for the difference in proportions will contain 0. • If we make the wrong conclusion, it would be a Type II error. 2 Means (HW 10.1-10.4) • A summary for types of sales for an iPod Mean n Bid 231.611 96 Buy-it-now 221.667 128 • Find the point estimate for the difference in population means. 231.611 – 221.667 = 9.944 • The p-value is .006. Can we conclude there’s a population difference? Yes; the p-value is below the common # = .05, so we reject the null • Without finding the CI, determine whether it will contain 0 or not. We just concluded that the means are most likely different, so 0 is not a plausible value for the true difference. It therefore will be outside the interval. Chi-Square Goodness of Fit • Used for testing if category proportions/counts are equal to specified values • Or, used for testing if category proportions/counts are all equal to each other. • Example hypotheses: H0 : Proportions are as stated (example: .20, .30, .50) H A : otherwise H0 : Proportions are all equal to one another H A : otherwise Goodness of Fit Steps 1. List hypotheses: • H0: the proportions are as claimed • Ha: otherwise 2. Check assumptions: A. The sample is randomly selected B. Each expected cell count is at least 5 3. Compute the test statistic: 4. State degrees of freedom: c - 1 (number of cells minus 1) 5. Find the p-value on the Chi-Square distribution, on df = c-1, by finding the probability above $2 6. State the conclusion ! 2 = obs." exp.( ) 2 exp. # Computing the p-value • Suppose we have 6 categories and $2 = 2.12164. Then df = 6 - 1 = 5 • Look up the probability above the test statistic $2, never below. Conclusions for Goodness of Fit If p-value < # (alpha) – Reject H0 – Test is significant – There is enough evidence to suggest the proportions are different that what’s specified – Possible Type I Error If p-value > # – Fail to reject H0 – Test is insignificant – There is insufficient evidence to suggest the proportions are different than what’s specified – Possible Type II Error Goodness Of Fit (HW 11.1-11.2) • It is thought that a certain type of cookie box should contain the following percentages of three varieties: • Chocolate Chip: 40% Oatmeal: 30% Sugar: 30% • A box is selected at random and opened. Here are the observed counts of 50 cookies: • Compute the expected category counts. Will we have a valid chi-square test? • Yes because the sample is random, and each expected count is at least 5. • True/False: the degrees of freedom for this problem would be 50 - 1 = 49. • False: it is 3 - 1 = 2, since we do categories - 1 for goodness-of-fit. Observed Expected Chocolate Chip 18 50 x .40 = 20 Oatmeal 19 50 x .30 = 15 Sugar 13 50 x .30 = 15 Goodness of Fit (HW 11.1-11.2) • We want to see if a 20-sided dice is fair (balanced). To investigate this, we roll it 80 times and record the number of times each face comes up. • The $2 statistic is 21.007. Fill in the boxes below to find the p-value. df = 20 - 1 = 19 (categories - 1) Chi-Square Test for Independence • Used in an r x c contingency table to test if there’s an association between the categorical variables (cf. contingency table association from Test 1) • The null hypothesis is that the explanatory and response variables are independent. • The alternative hypothesis is that there is a strong association between them. Independence Test Steps 1. H0: variables are independent. Ha: association 2. Assumptions: random sample, and each expected count is at least 5 3. Compute each expected count: 4. Compute the test statistic 5. Compute df = (r - 1)(c - 1) 6. Find the p-value: the probability above $2 7. State your conclusion exp.= row total( ) column total( ) overall total ! 2 = obs." exp.( ) 2 exp. # Conclusions for Independence Test If p-value < # (alpha) – Reject H0 – Test is significant – There is enough evidence to suggest the explanatory and response variables are related somehow – Possible Type I Error If p-value > # – Fail to reject H0 – Test is insignificant – There is insufficient evidence to suggest the explanatory and response variables are related – Possible Type II Error Independence Test (HW 11.1 - 11.2) • State the null and alternative hypotheses. • H0: gender and ice cream preference are independent • Ha: gender and ice cream preference are related somehow • Find the expected count for males preferring strawberry. Chocolate Strawberry Total Male Female Total 70 91 161 82 63 145 152 154 306 exp.= (161)(154) 306 = 81.026 Independence Test (HW 11.1 - 11.2) • Same question: the test statistic is 5.2159367. Draw a sketch and fill in the window below. df = (2 - 1)*(2 - 1) = 1 • What is your conclusion, at # = .04? • Reject the null. There is strong evidence that gender and preference for ice cream flavor are related somehow. • True/False: when all the observed and expected counts are far apart from each other, the chi-square statistic will be small. False: each (obs. - exp.)2/exp. calculation will be large, and so their sum is as well. True/False Questions • A 94% confidence interval will give the same conclusions as a right-tailed hypothesis test at # = .06. False: # is correct, but confidence intervals coincide with two-tailed tests only • Margin of error is the distance between the point estimate and one of the endpoints of a confidence interval. True: see earlier question in this handout • For a hypothesis test with proportions, the p-value can be found using the normal calculator. True (the T goes with means problems) • All other things being equal, a 93% confidence interval is shorter than a 96% interval. True: the higher the confidence, the wider the interval StatCrunch • Summary Stats – STAT > Summary Stats > Columns • Regression – STAT > Regression > Simple Linear • Calculators – STAT > Calculators > Normal & T & Chi-Square • Single Proportion – STAT > Proportions > One Sample > With Summary • Two Independent Proportions – STAT > Proportions > Two Sample > With Summary • Single Mean – STAT > T-Statistics > One Sample > With Data or Summary • Two Independent Means – STAT > T-Statistics > Two Sample > With Data or Summary – Uncheck “Pool Variances” • Two Dependent Means – STAT > T-Statistics > Paired • Chi-Square Goodness of Fit – STAT > Goodness-Of-Fit > Chi-Square
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved