Download Understanding Minimum Error Points & Statistical Significance in Linear Regression and more Assignments Software Engineering in PDF only on Docsity! Lecture 6: Linear regression and hypothesis testing CSI 5v93: Introduction to machine learning Baylor University Computer Science Department Dr. Greg Hamerly http://cs.baylor.edu/˜hamerly/ CSI 5v93: Introduction to machine learning, Lecture 6 – p. 1/22 Announcements • Homework 2 due February 8th – extension CSI 5v93: Introduction to machine learning, Lecture 6 – p. 2/22 Questions? CSI 5v93: Introduction to machine learning, Lecture 6 – p. 3/22 Chapter 3: Linear methods for regression • 3.1 – Introduction • 3.2 – Linear regression models and least squares • 3.3 – Multiple regression from simple univariate regression • 3.4 – Subset selection and coefficient shrinkage • 3.5 – Computational considerations CSI 5v93: Introduction to machine learning, Lecture 6 – p. 4/22 The hypothesis test for βj = 0 Hypotheses: • H0 (null hypothesis): βj = 0 • H1 (alternative hypothesis): βj 6= 0 The test statistic is: zj = β̂j σ̂ √ vj Where vj is the jth diagonal element of (XT X)−1. Under H0, zj ∼ tN−d−1. The t distribution is like the Gaussian distribution, but with fatter tails. CSI 5v93: Introduction to machine learning, Lecture 6 – p. 9/22 The t and Gaussian distributions Z Ta il P ro ba bi lit ie s 2.0 2.2 2.4 2.6 2.8 3.0 0. 01 0. 02 0. 03 0. 04 0. 05 0. 06 The two are very related, but the t distribution arises because we don’t know the true σ2 of the data, only an estimate of it. Note that t requires the number of samples, Gaussian does not. The test is then if the zj score is large enough that it falls outside of the acceptance region, and lies in the rejection region. CSI 5v93: Introduction to machine learning, Lecture 6 – p. 10/22 Example: Testing the hypothesis • True model: f(X) = 30 + 50X − 10X2 • Assumed model: g(X) = β0 + β1X + β2X2 + β3 √ X + • Generate noisy data: yi = f(xi) + • Compute β̂ = (XT X)−1XT y (under assumed model g(X)) Question: Is β̂3 significant? Applying the test: • H0: β̂3 = 0 • H1: β̂3 6= 0 • Compute z3 = β̂3/(σ̂ √ v3) • Under H0, z3 ∼ tN−d−1 • Set α = 0.05 • If Pr(Z > z3) < α, then reject H0 CSI 5v93: Introduction to machine learning, Lecture 6 – p. 11/22 Matlab example -500 -400 -300 -200 -100 0 100 200 0 2 4 6 8 10 noisy data learned model • True model: f(X) = 30 + 50X − 10X2 • Assumed model: g(X) = β0 + β1X + β2X2 + β3 √ X + • β̂ = [ 38.5 57.8 −10.3 −16.8 ] • z = [ 6.4 12.3 −47.8 −1.6 ] • Pr(|Z| > 1.9721) = 1 − 0.05 = 0.95 • Therefore, β̂0, β̂1, and β̂2 are all significant (by themselves), and β̂3 is NOT significant (by itself). CSI 5v93: Introduction to machine learning, Lecture 6 – p. 12/22 Eliminating multiple variables Note that this z-test for significance only applies to one variable, and not multiple variables at once. To eliminate multiple variables, eliminate one variable at a time, re-running the test each time. CSI 5v93: Introduction to machine learning, Lecture 6 – p. 13/22 Eliminating multiple variables The F-test allows multiple variables to be tested for significance. F = (RSS0 − RSS1)/(d1 − d0) RSS1/(n − d1 − 1) • RSS0 and d0 refer to the smaller model (with d0 + 1 parameters) • RSS1 and d1 refer to the larger model (with d1 + 1 parameters) Then we know that F is distributed as Fd1−d0,n−d1−1 distribution. CSI 5v93: Introduction to machine learning, Lecture 6 – p. 14/22 Choosing the subset The RSS will always be smallest for the model with the most parameters (all d parameters), so this is not very useful. RSS for different subsets for the cancer-prediction problem: Subset Size k R es id ua l S um -o f-S qu ar es 0 20 40 60 80 10 0 0 1 2 3 4 5 6 7 8 • • • •• • •• •• ••• •• • ••• •••• ••• •• •••• •••• •• ••• •••• •••• ••• •• ••••• •••• ••• •••• •••• • ••• • ••••• ••• ••• •• ••• • •• •••• •• • •• •• ••••• • • CSI 5v93: Introduction to machine learning, Lecture 6 – p. 19/22 Stepwise selection Rather than consider every possible subset, stepwise selection adds one variable at a time. Forward stepwise selection: • start with one parameter (the intercept) • add the next best parameter (using the F statistic to determine the best) • repeat until the change in the F statistic is not significant (with some probability, e.g. 95%) Backwards stepwise selection is similar, but works in reverse. CSI 5v93: Introduction to machine learning, Lecture 6 – p. 20/22 Issues with stepwise selection What are some drawbacks with stepwise selection? Is forward or backward easier? Do you expect they will produce the same results? CSI 5v93: Introduction to machine learning, Lecture 6 – p. 21/22 2-minute journal Please write a response to the following on a piece of paper and hand it in immediately. Please make it anonymous (no names). Write about: • major points you learned today • areas not understood or requiring clarification CSI 5v93: Introduction to machine learning, Lecture 6 – p. 22/22