Download Data Mining: Predictive Modeling and Evaluation - Prof. Jennifer L. Neville and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity! Data Mining CS57300 / STAT 59800-024 Purdue University February 19, 2009 1 Predictive modeling: evaluation 2 Score functions • Zero-one loss • Accuracy • Sensitivity/specificity • Precision/Recall/F1 • Absolute loss • Squared loss • Root mean-squared error • Likelihood/conditional likelihood • Area under the ROC curve • True positive rate (TPR) = TP/(TP+FN) • False positive rate (FPR) = FP/(FP+TN) Rec ll = TP/(TP+FN) = TPR • Precision = TP/(TP+FP) • Specificity = TN/(FP+TN) • Sensitivity = TPR Simple measures on tables P re d ic te d Actual TNFN– FPTP+ –+ 3 Cost-sensitive models • Define a score function based on a cost matrix • If ~y is the predicted class and y is the true class, then need to define a matrix of costs C(~y,y) • Reflects the severity of classifying an instance with true class y to class ~y • True positive rate (TPR) = TP/(TP+FN) • False positive rate (FPR) = FP/(FP+TN) • Recall = TP/(TP+FN) = TPR • Precision = TP/(TP+FP) • Specificity = TN/(FP+TN) • Sensitivity = TPR Simple measures on tables P re d ic te d Actual TNFN– FPTP+ –+ 4 Bias/variance tradeoff Expected MSE Size of parameter space Low bias High variance High bias Low variance 9 Ensemble methods • Motivation • Too difficult to construct a single model that optimizes performance (why?) • Approach • Construct many models on different versions of the training set and combine them during prediction • Goal: reduce bias and/or variance 10 General idea Apply learning algorithm M Apply learning algorithm M1 Apply learning algorithm M2 Apply learning algorithm M3 Apply learning algorithm M4A lte re d t ra in in g d a ta A g g re g a te in to M * }} 11 Bagging • Bootstrap aggregating • Main assumption • Combining many unstable predictors in an ensemble produces a stable predictor • Unstable predictor: small changes in training data produces large changes in the model (e.g., trees) • Model space: non-parametric, can model any function if an appropriate base model is used 12 Bagging • Given a training data set D={(x1,y1),..., (xN,yN)} • For m=1:M • Obtain a bootstrap sample Dm by drawing N instances with replacement from D • Learn model Mm from Dm • To classify test instance t, apply all models to t and take majority vote • Models have uncorrelated errors due to difference in training sets (each bootstrap sample has ~68% of D) 13 Boosting • Main assumption • Combining many weak (but stable) predictors in an ensemble produces a strong predictor • Weak predictor: only weakly predicts correct class of instances (e.g., tree stumps, 1-R) • Model space: non-parametric, can model any function if an appropriate base model is used 14 Overfitting (cont) (Oates & Jensen 1999) 19 Oversearching Heuristic search Exhaustive search (Quinlan and Cameron-Jones 1995; Murthy and Salzberg 1995) Search Method Accuracy Training set Test set 20 Attribute selection errors A2 1 4 3 6 . . . 2 3 A1 + – – + . . . + – Few possible values Many Possible values (Quinlan 1998; Liu and White 1994) Possible values Accuracy Training set Test set 21 • Evaluation functions are functions f(m,D) on models (m) and data samples (D) • Samples vary in their “representativeness”: f(m,D1) = x1 ! x2 = f(m,D2) ! Each score x is an estimate of some population parameter ! x 1 x 2 Evaluation functions are estimators 22 Population Sampling Distribution All Possible Samples ... ... 4.32 3.59 7.44 2.06 5.19 4.27 Derived Statistic Values ... • Parameter estimates What is the accuracy of m? • Evaluate accuracy on many samples; empirically estimate sampling distribution • Use distribution mean as estimate of population parameter • Hypothesis tests Does m perform better than chance? • Evaluate accuracy on sample • Compare to sampling distribution under null hypothesis (H0); asses probability that accuracy would be achieved by “chance” Population under H0 Sampling Distribution All Possible Samples ... ... 4.32 3.59 7.44 2.06 5.19 4.27 Derived Statistic Values ... ! " = b ! pH0 (" # b) = 0.027 How do we use statistical inference? 23 24 • Generate multiple items • Generate n models • Estimate scores • Using the training set and an evaluation function, calculate a score for each model • Select max-scoring item • Select the model withthe maximum score 1.43 28.62 The sampling distribution of Xmax is different from the sampling distribution of Xi Multiple comparison procedures! 24 Explaining Pathologies 29 Under H0, there is a non-zero probability that any model’s score xi will exceed some critical value xcrit The probability that the maximum of n scores (xmax) will exceed xcrit is uniformly equal or higher. p(X max > x crit |H 0 ) ! p(X i > x crit |H 0 ) Incorrect hypothesis tests 30 • Many components are available to use in a given model. • Algorithms select the component with the maximum score. • The correct sampling distribution depends on number of components evaluated. • Most learning algorithms do not adjust for number of components. Overfitting 31 • Sample scores are routinely used as estimates of population parameters. Any x i score is often an unbiased estimator of the population score. But the xmax is almost always a biased estimator Biased parameter estimates 32 • Two or more search spaces contain different numbers of models. • Maximum scores in each space are biased to differing degrees. • Most algorithms directly compare scores. • Attribute selection errors can be explained in an analogous way. Oversearching 33 Adjusting for multiple comparisons • Remove bias by testing on withheld data • New data (e.g., Oates & Jensen 1999) • Cross-validation (e.g., Weiss and Kulikowski 1991) • Estimate sampling distribution accurately • Randomization tests (e.g., Jensen 1992) • Adjust probability calculation • Bonferroni adjustment (e.g., Jensen & Schmill 1997) • Alter evaluation function to incorporate complexity penalty • MDL, BIC, etc. 34