Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Final Exam Preparation: Regression Unit Overview and Key Concepts, Exams of Design

An outline of the material covered in Units 9-13 of a statistics course focusing on Regression. It includes a summary of each unit, key concepts, and references to important formulas. Useful for students preparing for a take-home final exam.

Typology: Exams

2021/2022

Uploaded on 08/05/2022

char_s67
char_s67 🇱🇺

4.5

(109)

1.9K documents

1 / 28

Toggle sidebar

Related documents


Partial preview of the text

Download Final Exam Preparation: Regression Unit Overview and Key Concepts and more Exams Design in PDF only on Docsity! STAT E139 € Unit XX: Final Exam Review Final Exam Logistics • The take-home Final Exam will be released on Friday at 9am through Canvas, and will be due Monday, Dec 21, at 5:30pm (submitted to the designated assignment on the course website on Canvas). • It is open-book and open-notes. You will be required to use R. • Do NOT discuss the exam with any of your classmates! • It is cumulative, with an emphasis on Units 9-13 (Regression) & HWs 9-11. • Lots of practice problems and past exams. Pay special attention to this past summer’s exam (only other exam that was a take-home): https:// • Be sure to briefly explain answers and show calculations. • Please email Kevin or the TFs (send one email to everyone) if you have questions on the exam. 2 5 Unit 9: Simple Linear Regression • Correlation: • Using it in regression: • t-test for H0: ρ = 0: • R2: interpretation. 5    YX n i ii XYXY SS nYYXX r )1( ˆ 1     ̂0  Y  ̂1X X YXY S Sr 1̂ )2( )1( )|( 2 0     n r Hr t  Yi ii SST SSR YY XY rR       1 )( ))ˆˆ(( 1 2 2 1022  6 Unit 10: More Linear Regression • Regression to the mean. Phenomenon/Fallacy. • Estimating μY|X0 at a particular X0 vs. predicting a new Y at a particular X0. • Confidence Interval vs. Prediction interval at X0: • Equivalence of pooled t-test and regression with a single binary predictor. 6     2 2 0 2/1,2010 )1( 1 ˆˆˆ X n Sn XX n tX           2 2 0 2/1,2010 )1( 1 1ˆˆˆ X n Sn XX n tX       7 Unit 10: More Linear Regression • Checking Assumptions (aka, regression diagnostics): Use plots of the residuals!!! (1) Independence: scatterplots: e vs. time and study design! (2) Normality: qqplot or histogram of e (3) Linearity: scatterplot: e vs. x (or y vs. x). (4) Constant Variance: scatterplot: e vs. y . • What to do when they fail? Use Transformations! • Transforming X affects 3, transforming Y affects 2, 3, and 4. • Interpretation of log-transformations: log(Y) vs. X: Y vs. log(X): log(Y) vs. log(X): 7 10 Unit 11: Multiple Regression • Indicator predictors • Interpretation of coefficients • Link to ANOVA (& mathematical equivalence). • Interaction terms: β3(X1∙X2) • Interpretation when one is binary • General Suggestions • Modelling Quadratic Relationships. • May want to mean-center the involved X (to fix collinearity) • Transformation guidelines: 1) Make Y symmetric. 2) Look at scatterplots for non-linearity, and transform Xs based on their skewness. 3) Check residuals in the end. 11 Unit 12: Math of Multiple Regression • Matrix Notation of Multiple Regression: • Geometrically, Y can be interpreted as the orthogonal projection of Y onto the subspace generated by the columns of the design matrix X. • Coefficients: • Residual variance:  . ,0~ , 2 11)1()1(1 nnn nKKnn IMVN    ε εβXY  1T XXββ   )(,~ˆ 2 1 KMVN   )1( ~ )1( ˆ ˆ 2 )1( 2 1 2 2      KnKn YY Kn n i ii   YXXXβ T1T  )(ˆ 12 Unit 12: Inferences in Multiple Regression • t-test of one coefficient: • Linear combo of coefficients: • Confidence and Prediction intervals at X0: )1( 1 ],[ 0, 0 ~ )(ˆ ˆ    Kn H jj jj t XX T  1 ],[)2/1(),1( )(ˆˆ   jjKnj t XX T    βC T kkCCCH ...: 11000 )1( 0 ~ )(ˆ ˆ    Kn H t CXXC 1TT  0)2/1(),1( )(ˆˆ XXXXβX 1TT 0 T 0   Knt 0)2/1(),1( )(1ˆˆ XXXXβX 1TT 0 T 0    Knt 15 Unit 12: How to Select a Model • We want to choose the most parsimonious model that has the best predictive ability. • We have 3 automatic ways to find a near-best model: • Forward, Backward, and Stepwise model selection. • General Guidelines (assuming n >> K): 0) Begin by transforming all data as before. Choose a specific criterion. 1) Start with the full regression with all main effects, and use Stepwise to select a best model. 2) Of those main effects still in model, consider 2-way interactions of them and perform a stepwise model selection. • To compare potential ‘best’ models, use Cross Validation. • Why do we use Cross Validation? What is it attempting to do? How is it implemented? 16 Unit 13: Ecological Fallacy, Multicollinearity, & Leverage/Influence • Ecological Fallacy: • What does this mean? • It’s an example of what paradox? • Multicollinearity: • When does this arise? • Measures? • Why is this not necessarily a problem if it arises? • Leverage: • When does this occur? How to see it in a plot? • Measures? • Influence: • When does this occur? How to see it in a plot? • Measures? 17 Practice Problem #1 The following data consists of the midterm exam scores and final exam scores of 29 students in a given course. Scatterplot of the data are presented here. (a) Consider a simple linear regression model: Finali= β0 + β1Midtermi + ϵi where ϵi ~ N(0, σ²) i.i.d. for i = 1, 2, …, 29. Utilize the summary statistics presented in the table on the next slide to compute the estimates for β0, β1, σ², and calculate R2. 2. In a survey of Harvard undergrads, the following variables were measured: looks - the percent of Harvard students that a student think is better looking than him or her relationship - a binary variable indicating whether the student is in a significant relationship (relationship = 1) or single (relationship = 0) female - a binary variable indicating whether the student is female (female = 1) or male (female = 0) (a) To the right is the histogram of the response variable, looks. Comment on the plot. 20 Histogram of looks looks F re q u e n c y 0 20 40 60 80 100 0 1 0 2 0 3 0 4 0 5 0 6 0 (b) Based on this model, what is the estimated mean looks for women? (c) Is there a significant difference in average looks between men and women? How you know? 21 > summary(fit1<-lm(looks~female,data=looksdata)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 24.414 2.275 10.732 <2e-16 *** female 4.824 3.287 1.468 0.144 --- Residual standard error: 21.22 on 165 degrees of freedom Multiple R-squared: 0.0129, Adjusted R-squared: 0.00690 F-statistic: 2.154 on 1 and 165 DF, p-value: 0.1441 (d) What is the interpretation of the coefficient for female in this model? (e) Kevin is in a relationship. What is the estimated value of looks for Kevin? 22 > summary(fit2<-lm(looks~female+relationship,data=looksdata)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 23.413 2.612 8.963 6.79e-16 *** female 4.769 3.291 1.449 0.149 relationship 2.640 3.372 0.783 0.435 --- Residual standard error: 21.24 on 164 degrees of freedom Multiple R-squared: 0.01656, Adjusted R-squared: 0.004568 F-statistic: 1.381 on 2 and 164 DF, p-value: 0.2543 3. Based on the same survey, we want to try to determine if one’s opinion of their looks depends on class year. Below is the relevant R output: 25 a) If a regression model was fit to predict looks with sophomore, junior, and senior dummy variables as the predictors, what would be the formula for the estimated regression model? What would be R2? b) Calculate and write-out the ANOVA table for this dataset. > summarystats=cbind(by(looks,class,mean,na.rm=T), + by(looks,class,sd,na.rm=T),by(!is.na(looks),class,sum)) > colnames(summarystats)=c("mean","sd","n") > summarystats mean sd n freshman 26.96429 17.61624 28 junior 26.66667 20.27115 30 senior 29.15789 20.93446 19 sophomore 26.15556 22.96158 90 Below is the ANOVA table from R for (b) in the last slide: 26 c) Is there evidence of a difference among the 4 class years? Perform a formal hypothesis test. d) Ignoring your results in part (c) above, perform a formal hypothesis test to determine whether freshmen have a different perceived looks than the other 3 class years combined. > anova(aov(looks~class)) Analysis of Variance Table Response: looks Df Sum Sq Mean Sq F value Pr(>F) class 3 143 47.78 0.1037 0.9578 Residuals 163 75108 460.79 27 Practice Problem #4 Below are some summary statistics comparing price of homes across 3 groups: (a) Use this information to fill out the ANOVA table below: 27
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved