Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Analyzing Relationship: Scatterplots, Regression, and Hypothesis Testing, Exams of Probability and Statistics

An introduction to analyzing the relationship between quantitative variables using scatterplots, linear regression, and hypothesis testing. An example of examining the relationship between ice cream consumption and temperature, as well as an explanation of linear and non-linear relationships, correlation coefficients, and hypothesis testing for a linear relationship. The document also covers the importance of interpreting results with caution, avoiding data snooping, and making a bonferroni adjustment.

Typology: Exams

Pre 2010

Uploaded on 08/08/2009

koofers-user-x6r
koofers-user-x6r 🇺🇸

10 documents

1 / 11

Toggle sidebar

Related documents


Partial preview of the text

Download Analyzing Relationship: Scatterplots, Regression, and Hypothesis Testing and more Exams Probability and Statistics in PDF only on Docsity! Stat Handout 5 Linear Regression Math 382 1 5. 1 Introduction Suppose we want to investigate the relationship between two continuous random variables. Note we will look at the relationship between a qualitative random variable and a continuous random variable with ANOVA and between two qualitative random variables with a Chi-Squared Test. Example: Ice Cream Consumption Suppose we are thinking of opening an ice cream business and we want to know whether there is a relationship between the amount of ice cream consumed and the temperature. We decided to collect data over a 30-week time period from March to July. For each week, we recorded the average amount of ice cream consumed (per person) as well as the mean temperature. The data are presented on the last page of the handout. We have seen data like this before. It is paired data. For each of the 30 weeks, we have two pieces of information. However, we are not interested in the difference of the two means in this case. Our new question is, is there a pattern or predictable relationship that occurs between the two variables? In other words, as temperature increases, what happens to ice cream consumption (does it increase, decrease, or stay the same)? If we do see a pattern that is strong enough, we can build a model from our sample data that allows us to make predictions about the population, and help our ice cream business be more successful. To examine the relationship between two quantitative variables (we will call them X and Y) we use what is called a scatterplot, a two-dimensional grid system that contains a horizontal axis (for the X variable) and a vertical axis (for the Y variable.) Plot of Ice Cream Consumption vs. Temperature 0 0.1 0.2 0.3 0.4 0.5 0.6 0 10 20 30 40 50 60 70 80 Mean Temp (F) P in ts p er P er so n Stat Handout 5 Linear Regression Math 382 2 Looking at the data, do you see a relationship between temperature and ice cream consumption? If so, describe this relationship. Types of Relationship between Quantitative Variables Two quantitative variables can be related to each other in a number of ways. Some variables have a linear relationship; others have relationships that are best described as non- linear. Variables that have a non-linear relationship might be related through a simple curve (such as the number of fruit flies that multiply over time, or the relationship might be much more involved, requiring more complicated functions to describe (such as stock market prices over time.) 5. 2 Measuring the Strength of the Linear Relationship A scatterplot can help give us a general idea as to whether or not there is a linear relationship between two variables, how strong the relationship appears to be in the sample, and what the direction of the relationship is. Direction of the Linear Relationship. If the pattern goes uphill, we say the linear relationship is positive (both variables increase together or decrease together.) For example, height and weight of male adults exhibit a positive linear relationship. If the pattern goes downhill, we say the linear relationship is negative (as one variable increases, the other decreases. For example, we hope that as the number of police officers increases, the number of crimes decreases. Strength of the Linear Relationship. The strength of the linear relationship is a measure of how close the pattern of observed values resembles a straight line. If the data points line up perfectly, we say there is a perfect linear relationship. If the points lie quite close to the line overall, we say the relationship is strong. If the points don't have too much of a pattern, yet seem to resemble a cloud going uphill, the relationship is weak. If the points are scattered everywhere (or in cases where a different type of pattern exists) we say there is no linear relationship. To measure the strength and direction of the linear relationship between two quantitative variables we will calculate correlation coefficient. We draw conclusions about the population correla tion coefficient, ρ , by calculating the sample correlation coefficient (r). ( ) ( ) 11 1 n i i xyi y x x y y y x x SS r n s s SS SS = − − = = − ∑ where ( ) ( ) 1 1 n n xy i i i i i i SS y y x x x y nxy = = = − − = −∑ ∑ and ( ) 2 2 1 1 n x x i i SS n s x nx = = − = −∑ and ( ) 2 2 1 1 n y y i i SS n s y ny = = − = −∑ . Stat Handout 5 Linear Regression Math 382 5 The Bonferroni adjustment asks the researcher to compare their usual p-value to 0.05/n, where n is the number of hypothesis tests being conducted. This adjustment should be made when at least 5 tests are conducted. For example, if 10 hypothesis tests were conducted, our new cutoff for the p-value for each test is 0.05/10 = 0.005. If your p-value for any of these 10 tests is less than 0.005, then you reject oH for that test. Otherwise, you can't reject oH . When evaluating results presented to you, be sure to check to see if they have made an adjustment for the number of tests they conducted. Sometimes people report only the significant results, and don't tell you how many hypothesis tests they conducted before they find that result. That is what we mean by data snooping, or fishing for results. 5. 3 Least Squares Regression Line If there is a linear relationship between two continuous variables; we assume that the equation for the regression line for the population is 0 1ij i ijY Xβ β ε= + + where 0β is the y- intercept, 1β is the slope (the amount by which y changes when x is changed) and ijε is the error (this is the amount of vertical distance from an observation to the line). Assumptions: 1. We are investigating only linear relationships. 2. 0 1( )i iE Y Xβ β= + for each iX 3. The ijε ’s are distributed normally with mean 0 and variance σ for each iX . If we take a random sample of n observations, we measure the explanatory variable, X, and the response variable, Y for each individual. The equation of the least-squares regression line is 0 1 ˆ ˆˆi iy xβ β= + for 1,2, ,i n= K . Note ˆiy is the point estimate of 0 1( )i iE Y Xβ β= + . Estimated slope ˆ y x s r s β = estimated y- intercept 0 1ˆ ˆy xβ β= − where x the sample mean and xs is the sample standard deviation for the ix , y the sample mean and ys is the sample standard deviation for the iy , and r is the sample correlation. Stat Handout 5 Linear Regression Math 382 6 These formulas will determine the line of best fit or the least squares line, which is the line that minimizes the sum of squares of the vertical distances from the points (observations) to the line. Notice if the correlation coefficient, r, is positive then the slope will also be positive (likewise if r is negative the slope will be negative). For our example: If temp is the explanatory variable and pints is the response variable Y X 0.359y = 49.1x = 0.0658ys = 16.422xs = 0.776r = Determine the regression line 0 1ˆ ˆŷ xβ β= + : Interpret the slope and intercept in terms of the problem. Residuals A residual is the difference between the observed value of the response variable and the value predicted by the regression line. That is, the residual is ˆ ˆi i iresidual y yε= = − . For our example: the regression line is ˆ 0.207 0.003y x= + . According to our data on week 15 when temperature was 55, the observed pints per person were 0.381. The least squares regression line predicts pints per person to be ˆ 0.207 0.003*55 0.372y = + = 0.381 0.372 0.009residual = − = . Stat Handout 5 Linear Regression Math 382 7 Mean Temp (F) Line Fit Plot 0 0.1 0.2 0.3 0.4 0.5 0.6 0 10 20 30 40 50 60 70 80 Mean Temp (F) P in ts p er p er so n Pints per person Predicted Pints per person We can use the residuals to help us assess the fit of our regression line. Mean Temp (F) Residual Plot -0.1 -0.05 0 0.05 0.1 0.15 0 20 40 60 80 Mean Temp (F) R es id ua ls The residual plot should look like a random scattering of the points about the line 0y = . If there is a pattern in the residual plot then the line is not providing a good fit. Notice that in both the scatterplot and in the residual plot of our example there seems to be one observation that is somewhat unusual compared to the rest. Observations like these are called outliers; they are points that lie outside Stat Handout 5 Linear Regression Math 382 10 Our example: Let’s determine both the prediction interval for a future response and the confidence interval for the mean response when temperature is 55. Why is the prediction interval wider than the confidence interval? Week Pints per person Mean Temp (F) Week Pints per person Mean Temp (F) 1 0.386 41 16 0.381 63 2 0.374 56 17 0.47 72 3 0.393 63 18 0.443 72 4 0.425 68 19 0.386 67 5 0.406 69 20 0.342 60 6 0.344 65 21 0.319 44 7 0.327 61 22 0.307 40 8 0.288 47 23 0.284 32 9 0.269 32 24 0.326 27 10 0.256 24 25 0.309 28 11 0.286 28 26 0.359 33 12 0.298 26 27 0.376 41 13 0.329 32 28 0.416 52 14 0.318 40 29 0.437 64 15 0.381 55 30 0.548 71 *Source: Kotswara Rao Kadilyala (1970). "Testing for the independence of regression disturbances" Econometrica , 38, 97-117. Appears in: A Handbook of Small Data Sets, D. J. Hand, et al, editors (1994). Chapman and Hall: London. Stat Handout 5 Linear Regression Math 382 11 Homework 17 Due May 5 1. Use the data from 1993 Consumer Reports on New Cars (http://www.nmt.edu/~lballou/93cars.xls) (a) Is the average price of the car related to the average MPG? (b) Is there a relationship between city and highway MPG? Use scatterplots, calculate and interpret the correlation coefficient, test to determine if there is a linear relationship. If there is a linear relationship, determine the least squares regression line. Look at the residuals plots and determine if you think the model does a good job fitting the data. 2. The following is an illustration of famous Moore's Law for computer chips. X = Year (− 1900, for ease of computation) Y = number of transistors (in 1000) 71.50 2.3 78.75 31 82.75 110 85.25 280 89.75 1200 93.25 3100 95.25 5500 (a) Make a scatterplot of the data. Is the growth linear? (b) Let's try and fit the exponential growth model using a transformation: if Y = a1 exp(a2 X) then ln(Y) = ln(a1) + a2X Make a regression analysis of ln(Y) on X. Does this model do a good job fitting the data? (c) Predict the number of transistors in the year 2005. Did this prediction some true?
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved