Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Statistics: Important Terms, Histogram Interpretation, and Confidence Intervals - Prof. P, Study notes of Statistics

Essential statistics terms, including population, sample, and subject. It also delves into histogram interpretation, focusing on calculating total students, identifying classes with highest or lowest frequencies, and understanding z-scores. Additionally, it explains the concept of relative risk and correlation. The document concludes with an in-depth discussion on confidence intervals, their properties, and how to determine z-scores for different levels of confidence.

Typology: Study notes

2010/2011

Uploaded on 05/08/2011

nicoletb
nicoletb 🇺🇸

6 documents

1 / 16

Toggle sidebar

Related documents


Partial preview of the text

Download Statistics: Important Terms, Histogram Interpretation, and Confidence Intervals - Prof. P and more Study notes Statistics in PDF only on Docsity! Important Terms • Population – Total set of subjects in which we are interested • Sample – A subset of the population for which we have data • Subject – Entities we measure (individuals) Histogram Interpretation • How many total students sampled? • Which class has highest / lowest frequency? What are those frequencies? • How many students have an IQ between 110 and 129? Stem-And-Leaf Plot • A bar chart on its side • “Stem” is all digits except the last one • Last digit is the “leaf” • Ascending order • No commas • If nothing in a row, write the row, but leave it blank Example (HW 2.1-2.2) eBay selling prices 199 210 210 223 225 225 225 228 232 235 Sampling Methods • Simple Random Sampling – Each subject everywhere has an equally likely chance of being selected – Often done with a random number table – Choosing a company somewhere in the U.S. • Systematic – Selecting every “k-th” subject – Surveying every 10th person we meet downtown • Convenience – Individuals are easily found (e.g. internet surveys) – Often the “laziest” way, so less reliable answers Sampling Methods • Stratified Sampling – Taking some subjects from all possible groups • Cluster Sampling – Taking all subjects from some possible groups Skewness Outliers • The mean is sensitive to outliers. • The median is resistant to outliers. • When outliers are present, best to use median as measure of central tendency. • Example: average selling price of homes in the U.S. Standard Deviation • The average distance between any data point and the mean of the data. • Measures how much/little the data distribution is spread out. Summary Stats Interpretation • Mean – Average of the data set • Median (also called Q2) – About 50% of data lie below (and above) this value. • Range – Difference between maximum and minimum • Max & Min – Highest and lowest points in data set • Q1 and Q3 – 25% and 75% percentiles • Interquartile Range (IQR) – Difference between Q3 and Q1 Box-Plot (HW 2.5-2.6) Distribution of taxes (in cents) • Minimum = 2.6 Q3 = 105 • Q1 = 31 Maximum = 206 • Median = 55 • Construct a box-plot for this data. • What proportion of states have taxes… – Greater than 31 cents? – Greater than $1.05 (105 cents) ? • Find the range and the interquartile range (IQR). Box-Plot Outlier Test (HW 2.5-2.6) • Any point lying above Q3 + 1.5 x IQR is an outlier. • Any point lying below Q1 – 1.5 x IQR is also an outlier. • Are there any outliers on this box-plot? Mean & Median (HW 2.3-2.4) • This chart shows the number of grams of protein in various brands of loafs of bread. Compute the mean and median of the data set. What can you say about the shape of the distribution? Probability (HW 5.1 - 5.3) • We have an urn full of 12 blue, 10 red, and 8 black marbles. We reach in and draw a marble at random. • What’s the probability of drawing a marble that’s one of UGA’s colors? • If the marble drawn was a UGA color, what’s the probability it was not red? Discrete Probability Distribution • Two requirements: 1. Each individual p(x) is between 0 and 1, inclusive 2. All probabilities sum to 1 • The mean of a discrete distribution: MEAN = x ! p(x)" Mean of a Distribution (HW 6.1-6.2) • Here’s a table for the probability of different category hurricanes. • Find the missing value and the mean/expected value of this data set. Category Probability 1 0.15 2 0.32 3 4 0.12 5 0.05 Normal Distribution (Continuous) Three rules: 1. Total probability/ area under the normal curve is 1 2. Normal curve is symmetric 3. X value goes in left box; probability goes in right box on StatCrunch Normal (HW 6.1-6.3) • Study of entrance exam scores; mean is 120 with s.d. 11. Anything above 145 is considered superior. • Find the z-score for a score of 145. Interpret it. • What score is 1 standard deviation to the left of the mean? Area Between Two Lines • Find the probability within 1.28 standard deviations from the mean. Make a sketch of this. • Try to think of a strategy for this. Percentiles • The Pth percentile is the x that gives P % below on the normal • Always below • Example: here the x is the 30th percentile because 30% of the data falls below x 3 Types of Distributions 1. Population – Distribution of all points in the population 2. Sample Data / Data – Distribution of one particular sample 3. Sampling Dist. of Sample Means – Distribution of the sample means of a given size n Means Problem Distribution Shapes • Sample Data (Data Distribution) is the same shape as the population – If population is skewed, so is the sample data • Sampling Distribution of Sample Means is normal if… – Population is normal, or… – n > 30 (by the Central Limit Theorem) – Otherwise no conclusion about shape Two Important Properties As the sample size n increases… • The mean of the sampling distribution of sample means does not change. • The standard error (s.d. of the sampling distribution) decreases. • Example: ! > " (larger denominator, smaller overall fraction) Distributions (HW 7.1-7.2) • The lifetime of a certain type of tumble-dryer (until failure) has a distribution skewed right with mean 62 months and s.d. 11. A sample of 98 dryers is selected, and this sample has mean 57.2 and s.d. 12.1. • What is the center and spread for the population? • What shape is the population? Distributions (HW 7.1-7.2) • What is the center and spread for the sample data selected? • What shape is the sample data? • What is the center and spread for the sampling distribution of sample means with size 98? • What shape is the sampling distribution of sample means? StatCrunch (HW 7.1-7.2) • The average household temperature in Chattanooga is 67.6 degrees, and the s.d. is 4.2. A sample of 51 households is selected. • What’s the probability the average of this sample will be above 68.1? Fill in the inputs on the StatCrunch box. StatCrunch (HW 7.1-7.2) • What’s the probability the average of the sample with be within 1.5 degrees of the population mean? • Hint: draw a sketch! Think of a strategy for answering this. Proportions Problem Proportions (HW 7.1-7.3) • 57% of students at an academy are female. In a random sample of 55 students, 26 of them are female. • Let 1 = female and 0 = male. • Identify the population distribution of gender. X P(X) 1 0 Proportions (HW 7.1-7.3) • Identify the data distribution of gender. X P(X) 1 0 • What is the mean & standard error of the sampling distribution of the sample proportion? • Is the sampling distribution approximately normal? C.I. Properties • Increasing the sample size shortens the C.I. • Decreasing the sample size widens the C.I. • This is because standard error decreases as n increases, so the margin of error (width) decreases as well. • Intuition: a larger sample size gives a more accurate estimate and allows you to zero in on the true proportion. p! ± z ! p! 1" p!( ) n Summary of C.I. Width Factors Confidence Level (z) • As z increases, C.I. widens • As z decreases, C.I. shortens Sample Size (n) • As n increases, C.I. shortens • As n decreases, C.I. widens • Assumptions for proportion C.I.: 1. Sample is randomly selected 2. 3. C.I. with Means • Same general idea: • But we have a different formula: • With proportions, use z • With means, use t C.I. with Means • T-values change as degrees of freedom change (unlike normal calculator) • Degrees of freedom = n – 1 • Assumptions for doing C.I. for means: – Random sample – One of these two should be true: • Sampling from a normal population • n > 30 Choosing Sample Size • Idea: We have a given confidence level and a desired margin of error • What sample size is needed to achieve that? • Formula is different for proportions and means (see formula sheet) Sample Size Needed Formulas • What do we choose for the sample proportion? 1. Proportion of a previous study 2. If nothing is known, p! = .50 n = sample size needed z = z-score for confidence level s = sample standard deviation m = desired margin of error n = ! 2 z 2 m 2 z ! p! 1" p!( ) n p! + z ! p! 1" p!( ) n n(1! p!) "15 Proportions Summary Assumptions for a Valid Confidence Interval • Random sample • We need • We need Finding Sample Size Confidence Interval Point Estimate = Standard Error = Level of Confidence: use z Margin of Error = Lower Limit = Upper Limit = Means Summary Assumptions for a Valid Confidence Interval • Random Sample • One of these two: – Normal population, or... – n > 30 Finding Sample Size Confidence Intervals Point Estimate = Standard Error = Level of confidence depends on t Margin of Error = Lower Limit = Upper Limit = n = ! 2 z 2 m 2 Designed Experimental Study • Manipulates the subjects somehow • Can be used to prove causation • Subjects randomly divided into groups • Examples: – Does a coupon attached to a catalogue make recipients more likely to order? – Does a new medicine reduce the frequency of headaches? Observational Study • Measures qualities of subjects without manipulating them • Cannot be used to prove causation—only that the variables are related. • Cannot be randomly assigned to groups • Examples – Whether or not smoking has an effect on heart disease (can’t assign groups) – Are higher SAT scores positively correlated with higher college GPAs? Designed Experiments • Experimental Unit (subject) – The person/object that receives the treatment • Treatment – A condition/drug/etc. applied to the subject • Response Variable – Variable we are interested in studying • Explanatory Variable – Variable we believe to influence the response Designed Experiment (HW 4.1-4.4) • We are testing the effects of a new energy drink on heart rate. 50 subjects are randomly assigned to consume the energy drink, while a different 50 have a similar tasting drink but that is not an energy booster. The subjects’ heart rates are recorded, and the researchers know which drinks each subject gets. • Response: • Explanatory: • Treatments: • Experimental Units: • Is this completely randomized or matched pairs? • If the subjects don’t know which drink they get, is the study single or double blind? Experimental Designs • Completely Randomized – Experimental units are randomly assigned to treatments, and no overlap in groups. – That is, everybody gets just one treatment; nobody gets both. • Matched Pairs – Subjects are somehow matched before the experiment happens, for measuring differences between the two – Twins, or same person in two treatment groups Experimental Designs • Cross-over Design – A matched pairs when a subject receives both treatments at some point in the experiment – Cereal lab • Block: a set of matched experimental units / subjects • Randomized Block Design – Using blocks, but randomly assigning the order in which each block receives the treatment – This reduces possible bias – Cereal lab again (order was random) Hypotheses • The alternative hypothesis tests if a parameter is greater than, less than, or not equal to a suspected number. • The null hypothesis sets the parameter equal to the suspected number stated in the alternative. (Null is always equals!) • We test parameters p or • We never test statistics or Example Hypotheses p = population proportion p0 = hypothesized proportion (under H0 ) p! = sample proportion z-stat = test statistic (proportions) µ = population mean µ0 = hypothesized mean (under H0 ) x = sample mean t-stat = test statistic (means) p-value = probability that you observe this sample proportion/mean (or further away) when H0 is true. Hypothesis Testing Notation P-Value Interpretation • The p-value is the probability that you could observe a given sample mean, or further away, if in fact the null is true. • The probability of observing a sample mean of 50 (or higher) if the true mean were 40, is .12. Approximately 12% of sample proportions can be expected to be at least this much higher than 40. H O :µ = 40 H A :µ > 40 x = 50 p-value = .12 Goodness of Fit Steps 1. List hypotheses: • H0: the proportions are as claimed • Ha: otherwise 2. Check assumptions: A. The sample is randomly selected B. Each expected cell count is at least 5 3. Compute the test statistic: 4. State degrees of freedom: c - 1 (number of cells minus 1) 5. Find the p-value on the Chi-Square distribution, on df = c-1, by finding the probability above $2 6. State the conclusion ! 2 = obs." exp.( ) 2 exp. # Computing the p-value • Suppose we have 6 categories and $2 = 2.12164. Then df = 6 - 1 = 5 • Look up the probability above the test statistic $2, never below. Conclusions for Goodness of Fit If p-value < # (alpha) – Reject H0 – Test is significant – There is enough evidence to suggest the proportions are different that what’s specified – Possible Type I Error If p-value > # – Fail to reject H0 – Test is insignificant – There is insufficient evidence to suggest the proportions are different than what’s specified – Possible Type II Error Goodness Of Fit (HW 11.1-11.2) • It is thought that a certain type of cookie box should contain the following percentages of three varieties: • Chocolate Chip: 40% Oatmeal: 30% Sugar: 30% • A box is selected at random and opened. Here are the observed counts of 50 cookies: • Compute the expected category counts. Will we have a valid chi-square test? • True/False: the degrees of freedom for this problem would be 50 - 1 = 49. Observed Expected Chocolate Chip 18 Oatmeal 19 Sugar 13 Goodness of Fit (HW 11.1-11.2) • We want to see if a 20-sided dice is fair (balanced). To investigate this, we roll it 80 times and record the number of times each face comes up. • The $2 statistic is 21.007. Fill in the boxes below to find the p-value. Chi-Square Test for Independence • Used in an r x c contingency table to test if there’s an association between the categorical variables (cf. contingency table association from Test 1) • The null hypothesis is that the explanatory and response variables are independent. • The alternative hypothesis is that there is a strong association between them. Independence Test Steps 1. H0: variables are independent. Ha: association 2. Assumptions: random sample, and each expected count is at least 5 3. Compute each expected count: 4. Compute the test statistic 5. Compute df = (r - 1)(c - 1) 6. Find the p-value: the probability above $2 7. State your conclusion exp.= row total( ) column total( ) overall total ! 2 = obs." exp.( ) 2 exp. # Conclusions for Independence Test If p-value < # (alpha) – Reject H0 – Test is significant – There is enough evidence to suggest the explanatory and response variables are related somehow – Possible Type I Error If p-value > # – Fail to reject H0 – Test is insignificant – There is insufficient evidence to suggest the explanatory and response variables are related – Possible Type II Error Independence Test (HW 11.1 - 11.2) • State the null and alternative hypotheses. • Find the expected count for males preferring strawberry. Chocolate Strawberry Total Male Female Total 70 91 161 82 63 145 152 154 306 Independence Test (HW 11.1 - 11.2) • Same question: the test statistic is 5.2159367. Draw a sketch and fill in the window below. • What is your conclusion, at # = .04? • True/False: when all the observed and expected counts are far apart from each other, the chi-square statistic will be small. True/False Questions • A 94% confidence interval will give the same conclusions as a right-tailed hypothesis test at # = .06. • Margin of error is the distance between the point estimate and one of the endpoints of a confidence interval. • For a hypothesis test with proportions, the p-value can be found using the normal calculator. • All other things being equal, a 93% confidence interval is shorter than a 96% interval. StatCrunch • Summary Stats – STAT > Summary Stats > Columns • Regression – STAT > Regression > Simple Linear • Calculators – STAT > Calculators > Normal & T & Chi-Square • Single Proportion – STAT > Proportions > One Sample > With Summary • Two Independent Proportions – STAT > Proportions > Two Sample > With Summary • Single Mean – STAT > T-Statistics > One Sample > With Data or Summary • Two Independent Means – STAT > T-Statistics > Two Sample > With Data or Summary – Uncheck “Pool Variances” • Two Dependent Means – STAT > T-Statistics > Paired • Chi-Square Goodness of Fit – STAT > Goodness-Of-Fit > Chi-Square
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved