Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data Mining: Predictive Modeling and Evaluation - Prof. Jennifer L. Neville, Study notes of Data Analysis & Statistical Methods

This document from purdue university covers various aspects of predictive modeling and evaluation in data mining. Topics include score functions, cost-sensitive models, roc curves, bias-variance analysis, ensemble methods, and pathologies. Measures such as accuracy, precision, recall, and f1 score are discussed, along with concepts like overfitting, oversearching, and attribute selection errors.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-htd-1
koofers-user-htd-1 🇺🇸

10 documents

1 / 18

Toggle sidebar

Related documents


Partial preview of the text

Download Data Mining: Predictive Modeling and Evaluation - Prof. Jennifer L. Neville and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity! Data Mining CS57300 / STAT 59800-024 Purdue University February 19, 2009 1 Predictive modeling: evaluation 2 Score functions • Zero-one loss • Accuracy • Sensitivity/specificity • Precision/Recall/F1 • Absolute loss • Squared loss • Root mean-squared error • Likelihood/conditional likelihood • Area under the ROC curve • True positive rate (TPR) = TP/(TP+FN) • False positive rate (FPR) = FP/(FP+TN) Rec ll = TP/(TP+FN) = TPR • Precision = TP/(TP+FP) • Specificity = TN/(FP+TN) • Sensitivity = TPR Simple measures on tables P re d ic te d Actual TNFN– FPTP+ –+ 3 Cost-sensitive models • Define a score function based on a cost matrix • If ~y is the predicted class and y is the true class, then need to define a matrix of costs C(~y,y) • Reflects the severity of classifying an instance with true class y to class ~y • True positive rate (TPR) = TP/(TP+FN) • False positive rate (FPR) = FP/(FP+TN) • Recall = TP/(TP+FN) = TPR • Precision = TP/(TP+FP) • Specificity = TN/(FP+TN) • Sensitivity = TPR Simple measures on tables P re d ic te d Actual TNFN– FPTP+ –+ 4 Bias/variance tradeoff Expected MSE Size of parameter space Low bias High variance High bias Low variance 9 Ensemble methods • Motivation • Too difficult to construct a single model that optimizes performance (why?) • Approach • Construct many models on different versions of the training set and combine them during prediction • Goal: reduce bias and/or variance 10 General idea Apply learning algorithm M Apply learning algorithm M1 Apply learning algorithm M2 Apply learning algorithm M3 Apply learning algorithm M4A lte re d t ra in in g d a ta A g g re g a te in to M * }} 11 Bagging • Bootstrap aggregating • Main assumption • Combining many unstable predictors in an ensemble produces a stable predictor • Unstable predictor: small changes in training data produces large changes in the model (e.g., trees) • Model space: non-parametric, can model any function if an appropriate base model is used 12 Bagging • Given a training data set D={(x1,y1),..., (xN,yN)} • For m=1:M • Obtain a bootstrap sample Dm by drawing N instances with replacement from D • Learn model Mm from Dm • To classify test instance t, apply all models to t and take majority vote • Models have uncorrelated errors due to difference in training sets (each bootstrap sample has ~68% of D) 13 Boosting • Main assumption • Combining many weak (but stable) predictors in an ensemble produces a strong predictor • Weak predictor: only weakly predicts correct class of instances (e.g., tree stumps, 1-R) • Model space: non-parametric, can model any function if an appropriate base model is used 14 Overfitting (cont) (Oates & Jensen 1999) 19 Oversearching Heuristic search Exhaustive search (Quinlan and Cameron-Jones 1995; Murthy and Salzberg 1995) Search Method Accuracy Training set Test set 20 Attribute selection errors A2 1 4 3 6 . . . 2 3 A1 + – – + . . . + – Few possible values Many Possible values (Quinlan 1998; Liu and White 1994) Possible values Accuracy Training set Test set 21 • Evaluation functions are functions f(m,D) on models (m) and data samples (D) • Samples vary in their “representativeness”: f(m,D1) = x1 ! x2 = f(m,D2) ! Each score x is an estimate of some population parameter ! x 1 x 2 Evaluation functions are estimators 22 Population Sampling Distribution All Possible Samples ... ... 4.32 3.59 7.44 2.06 5.19 4.27 Derived Statistic Values ... • Parameter estimates What is the accuracy of m? • Evaluate accuracy on many samples; empirically estimate sampling distribution • Use distribution mean as estimate of population parameter • Hypothesis tests Does m perform better than chance? • Evaluate accuracy on sample • Compare to sampling distribution under null hypothesis (H0); asses probability that accuracy would be achieved by “chance” Population under H0 Sampling Distribution All Possible Samples ... ... 4.32 3.59 7.44 2.06 5.19 4.27 Derived Statistic Values ... ! " = b ! pH0 (" # b) = 0.027 How do we use statistical inference? 23 24 • Generate multiple items • Generate n models • Estimate scores • Using the training set and an evaluation function, calculate a score for each model • Select max-scoring item • Select the model withthe maximum score 1.43 28.62 The sampling distribution of Xmax is different from the sampling distribution of Xi Multiple comparison procedures! 24 Explaining Pathologies 29 Under H0, there is a non-zero probability that any model’s score xi will exceed some critical value xcrit The probability that the maximum of n scores (xmax) will exceed xcrit is uniformly equal or higher. p(X max > x crit |H 0 ) ! p(X i > x crit |H 0 ) Incorrect hypothesis tests 30 • Many components are available to use in a given model. • Algorithms select the component with the maximum score. • The correct sampling distribution depends on number of components evaluated. • Most learning algorithms do not adjust for number of components. Overfitting 31 • Sample scores are routinely used as estimates of population parameters. Any x i score is often an unbiased estimator of the population score. But the xmax is almost always a biased estimator Biased parameter estimates 32 • Two or more search spaces contain different numbers of models. • Maximum scores in each space are biased to differing degrees. • Most algorithms directly compare scores. • Attribute selection errors can be explained in an analogous way. Oversearching 33 Adjusting for multiple comparisons • Remove bias by testing on withheld data • New data (e.g., Oates & Jensen 1999) • Cross-validation (e.g., Weiss and Kulikowski 1991) • Estimate sampling distribution accurately • Randomization tests (e.g., Jensen 1992) • Adjust probability calculation • Bonferroni adjustment (e.g., Jensen & Schmill 1997) • Alter evaluation function to incorporate complexity penalty • MDL, BIC, etc. 34
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved