Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Understanding Minimum Error Points & Statistical Significance in Linear Regression, Assignments of Software Engineering

A lecture transcript from baylor university's csi 5v93: introduction to machine learning course, focusing on linear regression and hypothesis testing. The lecture covers the concept of minimum error points, the properties of estimate β, and statistical hypothesis testing for βj = 0. It also discusses the differences between the t and gaussian distributions and provides an example of testing the hypothesis using matlab.

Typology: Assignments

Pre 2010

Uploaded on 08/18/2009

koofers-user-qbs
koofers-user-qbs 🇺🇸

10 documents

1 / 11

Toggle sidebar

Related documents


Partial preview of the text

Download Understanding Minimum Error Points & Statistical Significance in Linear Regression and more Assignments Software Engineering in PDF only on Docsity! Lecture 6: Linear regression and hypothesis testing CSI 5v93: Introduction to machine learning Baylor University Computer Science Department Dr. Greg Hamerly http://cs.baylor.edu/˜hamerly/ CSI 5v93: Introduction to machine learning, Lecture 6 – p. 1/22 Announcements • Homework 2 due February 8th – extension CSI 5v93: Introduction to machine learning, Lecture 6 – p. 2/22 Questions? CSI 5v93: Introduction to machine learning, Lecture 6 – p. 3/22 Chapter 3: Linear methods for regression • 3.1 – Introduction • 3.2 – Linear regression models and least squares • 3.3 – Multiple regression from simple univariate regression • 3.4 – Subset selection and coefficient shrinkage • 3.5 – Computational considerations CSI 5v93: Introduction to machine learning, Lecture 6 – p. 4/22 The hypothesis test for βj = 0 Hypotheses: • H0 (null hypothesis): βj = 0 • H1 (alternative hypothesis): βj 6= 0 The test statistic is: zj = β̂j σ̂ √ vj Where vj is the jth diagonal element of (XT X)−1. Under H0, zj ∼ tN−d−1. The t distribution is like the Gaussian distribution, but with fatter tails. CSI 5v93: Introduction to machine learning, Lecture 6 – p. 9/22 The t and Gaussian distributions Z Ta il P ro ba bi lit ie s 2.0 2.2 2.4 2.6 2.8 3.0 0. 01 0. 02 0. 03 0. 04 0. 05 0. 06         The two are very related, but the t distribution arises because we don’t know the true σ2 of the data, only an estimate of it. Note that t requires the number of samples, Gaussian does not. The test is then if the zj score is large enough that it falls outside of the acceptance region, and lies in the rejection region. CSI 5v93: Introduction to machine learning, Lecture 6 – p. 10/22 Example: Testing the hypothesis • True model: f(X) = 30 + 50X − 10X2 • Assumed model: g(X) = β0 + β1X + β2X2 + β3 √ X +  • Generate noisy data: yi = f(xi) +  • Compute β̂ = (XT X)−1XT y (under assumed model g(X)) Question: Is β̂3 significant? Applying the test: • H0: β̂3 = 0 • H1: β̂3 6= 0 • Compute z3 = β̂3/(σ̂ √ v3) • Under H0, z3 ∼ tN−d−1 • Set α = 0.05 • If Pr(Z > z3) < α, then reject H0 CSI 5v93: Introduction to machine learning, Lecture 6 – p. 11/22 Matlab example -500 -400 -300 -200 -100 0 100 200 0 2 4 6 8 10 noisy data learned model • True model: f(X) = 30 + 50X − 10X2 • Assumed model: g(X) = β0 + β1X + β2X2 + β3 √ X +  • β̂ = [ 38.5 57.8 −10.3 −16.8 ] • z = [ 6.4 12.3 −47.8 −1.6 ] • Pr(|Z| > 1.9721) = 1 − 0.05 = 0.95 • Therefore, β̂0, β̂1, and β̂2 are all significant (by themselves), and β̂3 is NOT significant (by itself). CSI 5v93: Introduction to machine learning, Lecture 6 – p. 12/22 Eliminating multiple variables Note that this z-test for significance only applies to one variable, and not multiple variables at once. To eliminate multiple variables, eliminate one variable at a time, re-running the test each time. CSI 5v93: Introduction to machine learning, Lecture 6 – p. 13/22 Eliminating multiple variables The F-test allows multiple variables to be tested for significance. F = (RSS0 − RSS1)/(d1 − d0) RSS1/(n − d1 − 1) • RSS0 and d0 refer to the smaller model (with d0 + 1 parameters) • RSS1 and d1 refer to the larger model (with d1 + 1 parameters) Then we know that F is distributed as Fd1−d0,n−d1−1 distribution. CSI 5v93: Introduction to machine learning, Lecture 6 – p. 14/22 Choosing the subset The RSS will always be smallest for the model with the most parameters (all d parameters), so this is not very useful. RSS for different subsets for the cancer-prediction problem: Subset Size k R es id ua l S um -o f-S qu ar es 0 20 40 60 80 10 0 0 1 2 3 4 5 6 7 8 • • • •• • •• •• ••• •• • ••• •••• ••• •• •••• •••• •• ••• •••• •••• ••• •• ••••• •••• ••• •••• •••• • ••• • ••••• ••• ••• •• ••• • •• •••• •• • •• •• ••••• • • CSI 5v93: Introduction to machine learning, Lecture 6 – p. 19/22 Stepwise selection Rather than consider every possible subset, stepwise selection adds one variable at a time. Forward stepwise selection: • start with one parameter (the intercept) • add the next best parameter (using the F statistic to determine the best) • repeat until the change in the F statistic is not significant (with some probability, e.g. 95%) Backwards stepwise selection is similar, but works in reverse. CSI 5v93: Introduction to machine learning, Lecture 6 – p. 20/22 Issues with stepwise selection What are some drawbacks with stepwise selection? Is forward or backward easier? Do you expect they will produce the same results? CSI 5v93: Introduction to machine learning, Lecture 6 – p. 21/22 2-minute journal Please write a response to the following on a piece of paper and hand it in immediately. Please make it anonymous (no names). Write about: • major points you learned today • areas not understood or requiring clarification CSI 5v93: Introduction to machine learning, Lecture 6 – p. 22/22
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved