Download Introduction to Statistics, Lecture notes for Statistics and more Lecture notes Statistics in PDF only on Docsity! Lecture 9: Multiple Linear Regression
—CSSCSCSCs
Lecture 9 Outline Lecture 9: Multiple Linear Regression Example – Birds of the High Paramo Theory – Fitting a Model to Multivariate Data Example – Codes Theory – Single parameter t tests Theory - Multiple Correlation Coefficient Theory – Confidence Intervals for Regression Parameters Theory – Prediction in Linear Models Theory – High Leverage Points Theory – Dangers of multicollinearity Categorical Variable in Linear Models 2/46 Lecture 9 Theory – Fitting a Model to Multivariate Data I Suppose we have n independent observations with k associated known (explanatory) values. A natural extension of simple linear regression is to consider the model with k predictor variables Yi = β0 + β1xi1 + . . .+ βkxik + i, i = 1, . . . , n, where i ∼ NID(0, σ2). I Note: there are p = k + 1 number of regression parameters in this model. I Again the parameters are estimated by minimising the sum of squares of the residuals S(β) = n∑ i=1 ( Yi − (β0 + β1xi1 + . . .+ βkxik) )2 . Thus, β̂ = (β̂0, β̂1, . . . , β̂k) > = argmin β S(β). 5/46 Lecture 9 Example – Reading in the Data dat = read.table(file = "../datasets/paramo.txt", header = TRUE, row.names = 1) dim(dat) ## [1] 14 5 head(dat) ## N AR EL DEc DNI ## Chiles 36 0.33 1.26 36 14 ## Las Papas-Coconuco 30 0.50 1.17 234 13 ## Sumapaz 37 2.03 1.06 543 83 ## Tolima-Quindio 35 0.99 1.90 551 23 ## Paramillo 11 0.03 0.46 773 45 ## Cocuy 21 2.17 2.00 801 14 6/46 Lecture 9 Example – Numerical Summary summary(dat) ## N AR EL DEc ## Min. : 4.00 Min. :0.0300 Min. :0.460 Min. : 36.0 ## 1st Qu.:13.00 1st Qu.:0.0875 1st Qu.:0.670 1st Qu.: 606.5 ## Median :17.50 Median :0.2750 Median :0.905 Median : 954.0 ## Mean :20.71 Mean :0.6557 Mean :1.117 Mean : 848.1 ## 3rd Qu.:29.75 3rd Qu.:0.8950 3rd Qu.:1.440 3rd Qu.:1141.5 ## Max. :37.00 Max. :2.1700 Max. :2.280 Max. :1380.0 ## DNI ## Min. : 5.00 ## 1st Qu.:14.00 ## Median :32.00 ## Mean :36.79 ## 3rd Qu.:52.50 ## Max. :83.00 7/46 Lecture 9 Example – Fitting a Multiple Linear Regression Model M1 = lm(N ~ 1 + AR + EL + DEc + DNI, data = dat) summary(M1) .... ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 27.889386 6.181843 4.511 0.00146 ** ## AR 5.153864 3.098074 1.664 0.13056 ## EL 3.075136 4.000326 0.769 0.46175 ## DEc -0.017216 0.005243 -3.284 0.00947 ** ## DNI 0.016591 0.077573 0.214 0.83541 ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## Residual standard error: 6.705 on 9 degrees of freedom ## Multiple R-squared: 0.7301,Adjusted R-squared: 0.6101 .... 10/46 Lecture 9 Example – Fitting a Multiple Linear Regression Model I A full stop on the right hand side of an R model formula represents all variables in the data frame except for the response, the following produce the same output: summary(lm(N ~ 1 + AR + EL + DEc + DNI, data = dat)) summary(lm(N ~ ., data = dat)) .... ## (Intercept) 27.889386 6.181843 4.511 0.00146 ** ## AR 5.153864 3.098074 1.664 0.13056 ## EL 3.075136 4.000326 0.769 0.46175 ## DEc -0.017216 0.005243 -3.284 0.00947 ** ## DNI 0.016591 0.077573 0.214 0.83541 .... 11/46 Lecture 9 Example – Closer look at summary(lm(...)) The fitted model is E(N) = 27.9 + 5.15 · AR+ 3.08 · EL− 0.017 · DEc+ 0.017 · DNI. You also can use M1$coeff for the parameter estimates as before. 12/46 Lecture 9 Example – Closer look at summary(lm(...)) The last columns corresponds to the p-values of the two-sided single parameter t tests where the degrees of freedom is given as n− p. 15/46 Lecture 9 Theory – Single parameter t tests I The summary(lm(...)) command in R provides information for testing the importance of a covariate taking into account all other variables in the model. I Specifically, given the model Yi = β0 + β1xi1 + . . .+ βkxik + i (i = 1, 2, . . . , n) the output provides statistics for performing a t test of H0 : βj = 0 vs. H1 : βj 6= 0 for any of the p = k + 1 given variables xj , j = 0, 1, . . . , k, making no assumptions about the other regression parameters. I To test H0 : βj = 0 we can use t∗ = β̂j − 0 SE(β̂j) under H0∼ tn−p ⇒ p-value = P (|tn−p| ≥ t∗). 16/46 Lecture 9 Example – Closer look at summary(lm(...)) Recall that i ∼ NID(0, σ2). An estimate of σ2 is 6.7052 = σ̂2 = RSS/(n− p) . 17/46 Lecture 9 Example – Closer look at summary(lm(...)) So R2a = 0.6101. 20/46 Lecture 9 Example – R and R2a The slow way (by hand) n = nrow(dat) p = 5 M1 = lm(N ~ 1 + AR + EL + DEc + DNI, data = dat) RSS = deviance(M1) # or sum(M1£res^2) Syy = (n - 1) * var(dat$N); (Syy - RSS) / Syy ## [1] 0.730068 1 - RSS / (n - p) / Syy * (n - 1) ## [1] 0.6100983 The fast way (with summary) summary(M1) ## Multiple R-squared: 0.7301,Adjusted R-squared: 0.6101 21/46 Lecture 9 Theory – Confidence intervals for Regression Parameters I A 100(1− α)% CI for the βj parameters, j = 0, 1, . . . , k = p− 1: β̂j ± t1−α/2,n−p · SE(β̂j) I You can get this from summary(lm(...)), e.g. qt(0.975, 9) ## [1] 2.262157 So a 95% confidence interval for β0 is 27.889± 2.262 · 6.182 or (13.91, 41.87). 22/46 Lecture 9 Example – Birds of the High Paramo par(mfrow = c(2, 2), mai = c(0.3, 0.3, 0.3, 0.3)) boxplot(M1$residuals); plot(M1, which = 1:3) − 10 0 5 10 15 20 25 30 − 10 0 10 Fitted values R es id ua ls ● ● ● ● ● ● ● ● ● ● ● ● ● ● Residuals vs Fitted Cocuy Cende Perija ● ● ●● ● ● ● ● ● ● ● ● ● ● −1 0 1 − 2 0 1 Normal Q−Q Cocuy Cende Perija 10 15 20 25 30 0. 0 1. 0 S ta nd ar di ze d re si du al s ● ● ● ●● ● ● ● ● ● ●● ● ● Scale−Location Cocuy Cende Perija 25/46 Lecture 9 Example – Birds of the High Paramo par(mfrow = c(2, 2), mai = c(0.3, 0.3, 0.3, 0.3)) plot(resid(M1) ~ AR + EL + DEc + DNI, data = dat) ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.5 1.0 1.5 2.0 − 10 0 5 AR ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 1.0 1.5 2.0 − 10 0 5 EL re si d( M 1) ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 200 400 600 800 1000 1400 − 10 0 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 40 60 80 − 10 0 5 re si d( M 1) 26/46 Lecture 9 Comments – Conclusions for the model for the Paramo Data I The diagnostic plots do not provide any compelling evidence that the model assumptions are invalid (but n small). I The model that we fitted contains four predictor variables. There are a number of questions that we could ask regarding this model: I Overall, is there significant evidence that the model is useful in explaining variation in the number of bird species? I Are all the predictor variables providing useful information about the response? I What would we lose (and gain) if we tried to use a model with less predictor variables? I How would that impact confidence intervals for parameters? 27/46 Lecture 9 Theory – Outlying points in Rp I As with simple linear regression the fitted model should not be used to predict Y values for x combinations that are well away from the set of observed xi values (i.e. be wary of extrapolation). I This is not always easy to detect in multiple linear regression. I For example, when there are several predictor variables and each component of x0 is well within the observed range of corresponding values in x1,x2,x3 . . . ,xn but the point x0 is a long way from the observed points in p-space. 30/46 Lecture 9 Example – Outlying points in R2 I Here, x0 has x1 and x2 coordinates well within their respective ranges I but x0 is not close to the observed sample values in the 2D space. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1 2 3 4 5 6 7 range of x1 ra ng e of x 2 x0 I In higher dimensions this type of behaviour is even harder to detect but we need to be on guard against extrapolating to extreme values. 31/46 Lecture 9 Example – Outlying Points I There may already be extreme values of x within the data. I Recall leverage is a measure of how far away the independent variable values of the observations are from the other observations. I The detection of such points proves difficult for higher dimensions of data hence using the leverage value helps in detecting outlying values in x (i.e. the high leverage points). I Recall Cook’s distance is a measure of outliers under a particular model and can be used to detect unusual Y -values under a particular model. 32/46 Lecture 9 pairs(dat) H 20 40 60 80 ● ● ● ● ● ● ● ● ● ● ● ● 30 50 ● ● ● ● ● ● ● ● ● ● ● 20 60 ● ● ● ● ● ● ● ● ● ● ● W ● ● ● ● ● ● ● ● ● ● ● 30 40 50 60 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 25 30 35 40 45 50 20 30 40 50 L 35/46 Lecture 9 Example – Catheter data summary(lm(L ~ H + W, data=dat)) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 21.0084 8.7512 2.401 0.0399 * ## H 0.1964 0.3606 0.545 0.5993 ## W 0.1908 0.1652 1.155 0.2777 ## --- ## Residual standard error: 3.943 on 9 degrees of freedom ## Multiple R-squared: 0.8053,Adjusted R-squared: 0.7621 ## F-statistic: 18.62 on 2 and 9 DF, p-value: 0.0006336 summary(lm(L ~ W, data=dat)) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 25.63746 2.00421 12.792 1.60e-07 *** ## W 0.27727 0.04399 6.303 8.87e-05 *** ## --- ## Residual standard error: 3.802 on 10 degrees of freedom ## Multiple R-squared: 0.7989,Adjusted R-squared: 0.7788 ## F-statistic: 39.73 on 1 and 10 DF, p-value: 8.871e-05 36/46 Lecture 9 Example – Catheter data cor(dat) ## H W L ## H 1.0000000 0.9610936 0.8811691 ## W 0.9610936 1.0000000 0.8938226 ## L 0.8811691 0.8938226 1.0000000 37/46 Lecture 9 Example – Caffeine Dose I Does caffeine stimulation affect the rate at which individuals can tap their fingers? I n = 30 male students randomly allocated to t = 3 groups of 10 for students each: I Group 1: zero caffeine dose I Group 2: low caffeine dose I Group 3: high caffeine dose I Allocation was blind (subjects did not know their caffeine dosage). I Two hours after treatment, each subject tapped fingers as quickly as possible for a minute. Number of taps recorded. 40/46 Lecture 9 Example – Caffeine Dose caffeine = read.table(file="caffeine.txt",header=TRUE) str(caffeine) boxplot(Taps ~ Dose, ylab="Taps", data=caffeine) High Low Zero 24 2 24 6 25 0 41/46 Lecture 9 Theory – Categorical Variable in Linear Models The model is Yij = µ+ αi + ij , j = 1, . . . , ni, i = 1, . . . , t, where I µ is the overall mean or grand mean I t is the number of levels for the categorical variable, I ni is the number of observations in category i, I αi is the effect due to category i, and I ij ∼ NID(0, σ2). 42/46 Lecture 9 Theory – Categorical Variable in Linear Models R applies the constraint β1 = α1 = 0, i.e. sets category 1 as the reference category. So the model fitted is in fact Yi′ = β0 + β2xi′2 + . . .+ βtxi′t + i′ , i ′ = 1, . . . , n So βk is the effect of category k relative to category 1. Note: I R organises categories in alphabetical order so category 1 for levels Zero, Low, High is High I Some categories are coded as numerical variables and needs to be explicitly coded as a factor using factor. E.g. x = c(1, 2, 3) str(x) ## num [1:3] 1 2 3 x = factor(x); str(x) ## Factor w/ 3 levels "1","2","3": 1 2 3 45/46 Lecture 9 Example – Categorical Variable in Linear Models summary(lm(Taps ~ Dose, data=caffeine)) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 248.3000 0.7047 352.326 < 2e-16 *** ## DoseLow -1.9000 0.9967 -1.906 0.06730 . ## DoseZero -3.5000 0.9967 -3.512 0.00158 ** I On average, male students who have low dosage of caffeine tap at a rate less than 1.9 compared to male students who have high caffeine dosage. I Likewise, male students who have zero dosage of caffeine tap at a rate less than 3.5 compared to male students who have high caffeine dosage. I And, male students who have low dosage of caffeine tap at a rate more than −1.9− (−3.5) = 1.6 compared to male students who have no caffeine. 46/46