An in-depth analysis of decision trees, a popular machine learning algorithm used for classification tasks. It covers the basics of decision trees, including their definition, construction, and the gini index. The document also discusses the advantages of decision trees, such as their ease of interpretation and robustness to noise. Additionally, it highlights the disadvantages, such as the exponential size of the possible decision trees and data fragmentation. The document also delves into the topic of overfitting and underfitting, and the methods to address them. It concludes with a discussion on model selection and performance evaluation.

Tipologia: Slide

2018/2019

Caricato il 20/02/2024

laura-laid 🇮🇹

1 documento

1 / 151

Documenti correlati

Essay : there are both advantages and disadvantages to a career as a musician or an actor.

(2)

ADVANTAGES AND DISADVANTAGES OF SWOT

Advantages and Disadvantages of Living in a City: An Opinion Essay by Simeone Argentina

Marketing Channels and Retailing: Understanding Distribution Decisions

Advantages and disadvantages

Apprendimento automatico: scelta del modello, algoritmi e problemi

HUNTING: ADVANTAGES AND DISADVANTAGES

Advantages and disadvantages of globalisation

Advantages and disadvantages of marriage

Advantages and disadvantages of the Dad

Advantages of bicycles and disadvantages of cars

ADVANTAGES and DISADVANTAGES OF INTELLECTUAL PROPERTY

Building materials: advantages and disadvantages

(2)

Automation: Advantages, Disadvantages, and Applications

Advantages and Disadvantages of Living Abroad

FAST FASHION: advantages/disadvantages and solutions

(1)

advantages and disadvantages of technology in everyday life.

The Advantages and Disadvantages of Home Education

Advantages and disadvantages of the New Economy

Globalization: Impacts, Advantages, Disadvantages and Glocalization in Business

Dubbing vs. Subtitling: Advantages, Disadvantages, and Historical Context

Biomass Energy: How it Works, Advantages and Disadvantages

Common Law: Origins, Characteristics, and Advantages vs. Disadvantages

Advantages and Disadvantages of Leaving School Early to Work

City vs. Country Living: Weighing the Advantages and Disadvantages

Advantages and Disadvantages of Online Meetings: A Modern Communication Revolution

Textbooks vs. Tablets: A Comparative Analysis of Advantages and Disadvantages in Education

Hydroelectric Energy: Advantages, Disadvantages, and Impact on the Environment

TEXT ABOUT ADVANTAGES AND DISADVANTAGES OF TRAVELLING BY TRAIN OR BY CAR.

The Advantages and Disadvantages of Multiculturalism: A Cultural Diversity

Anteprima parziale del testo

Scarica Understanding Decision Trees in Classification: Advantages, Disadvantages, and Overfitting e più Slide in PDF di Machine learning solo su Docsity! Classification: Basic Concepts and Techniques 1 Classification problem ● What we have – A set of objects, each of them described by some features ◆ people described by age, gender, height, etc. ◆ bank transactions described by type, amount, time, etc. ● What we want to do – Associate the objects of a set to a class, taken from a predefined list ◆ “good customer” vs. “churner” ◆ “normal transaction” vs. “fraudulent” ◆ “low risk patient” vs. “risky” ? ? ? ? ? Feature 1 (e.g. Age) Fe at ur e 2 ( e. g. In co m e) 15k€ 50y 35k€ 60y Classify by similarity ● K-Nearest Neighbors – Decide label based on K most similar examples K=3 Build a model ● Example 1: linear separation line Build a model ● Example 2: Support Vector Machine (linear) Classification: Definition ● Given a collection of records (training set) – Each record is by characterized by a tuple (x,y), where x is the attribute set and y is the class label ◆ x: attribute, predictor, independent variable, input ◆ y: class, response, dependent variable, output ● Task: Learn a model that maps each attribute set x into one of the predefined class labels y ● Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. * 10 Supervised learning ● Supervised learning refers to problems where the value of a target attribute should be predicted based on the values of other attributes. ● Unsupervised learning (e.g. Cluster analysis) is not concerned with a specific target attribute. ● Problems with a categorical target attribute are called classification, problems with a numerical target attribute are called regression. General Approach for Building Classification Model MAC SCCI CA) Learning algorithm Large Medium Small 1 2 3 4 Medium Induction 5 Large 6 Medium Ù va Learn 8 Small Model 9 Medium Su 10 Small Training Set Model Apply Model Attrib1 Attrib2 Attrib3 TT, eduction Test Set Example of a Decision Tree categoric al categoric al contin uous class Home Owner MarSt Income YESNO NO NO Yes No Married Single, Divorced < 80K > 80K Splitting Attributes Training Data Model: Decision Tree * 15 Consider the problem of predicting whether a loan borrower will repay the loan or default on the loan payments. Another Example of Decision Tree categoric al categoric al contin uous class MarSt Home Owner Income YESNO NO NO Yes No Married Single, Divorced < 80K > 80K There could be more than one tree that fits the same data! * 16 Apply Model to Test Data Home Owner MarSt Income YESNO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data Start from the root of tree. * 17 Apply Model to Test Data MarSt Income YESNO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data Home Owner * 20 Apply Model to Test Data MarSt Income YESNO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data Home Owner * 21 Apply Model to Test Data MarSt Income YESNO NO NO Yes No Married Single, Divorced < 80K > 80K Test Data Assign Defaulted to “No” Home Owner * 22 Decision Tree Induction ● Many Algorithms: – Hunt’s Algorithm (one of the earliest) – CART – ID3, C4.5 – SLIQ,SPRINT * 25 General Structure of Hunt’s Algorithm ● Let Dt be the set of training records that reach a node t ● General Procedure: – If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Dt ? * 26 Hunt’s Algorithm (3,0) (4,3) (3,0) (1,3) (3,0) (3,0) (1,0) (0,3) (3,0) (7,3) * 27 Hunt’s Algorithm (3,0) (4,3) (3,0) (1,3) (3,0) (3,0) (1,0) (0,3) (3,0) (7,3) * 30 Design Issues of Decision Tree Induction ● Greedy strategy: – the number of possible decision trees can be very large, many decision tree algorithms employ a heuristic-based approach to guide their search in the vast hypothesis space. – Split the records based on an attribute test that optimizes certain criterion. * 31 Tree Induction ● How should training records be split? – Method for specifying test condition ◆ depending on attribute types – Measure for evaluating the goodness of a test condition ● How should the splitting procedure stop? – Stop splitting if all the records belong to the same class or have identical attribute values – Early termination Test Condition for Nominal Attributes ● Multi-way split: – Use as many partitions as distinct values. ● Binary split: – Divides values into two subsets * 35 Test Condition for Ordinal Attributes ● Multi-way split: – Use as many partitions as distinct values ● Binary split: – Divides values into two subsets – Preserve order property among attribute values This grouping violates order property * 36 Test Condition for Continuous Attributes / Annual », / Annual \ | Income | \ Income? / \ > 80K? / < 10K > 80K Yes No [10K,25K) [25K,50K) [50K,80K) (i) Binary split (ii) Multi-way split Tree Induction How to determine the best split? How to determine the Best Split ● Greedy approach: – Nodes with purer / homogeneous class distribution are preferred ● Need a measure of node impurity: High degree of impurity, Non-homogeneous Low degree of impurity, Homogeneous * 41 Measures of Node Impurity ● Gini Index ● Entropy ● Misclassification error * 42 Measure of Impurity: GINI ● Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). – Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information * 45 Measure of Impurity: GINI ● Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). – For 2-class problem (p, 1 – p): ◆ GINI = 1 – p2 – (1 – p)2 = 2p (1-p) * 46 Computing Gini Index of a Single Node P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444 * 47 Categorical Attributes: Computing Gini Index ● For each distinct value, gather counts for each class in the dataset ● Use the count matrix to make decisions Multi-way split Two-way split (find best partition of values) Which of these is the best? * 50 Continuous Attributes: Computing Gini Index ● Use Binary Decisions based on one value ● Several Choices for the splitting value – Number of possible splitting values = Number of distinct values ● Each splitting value has a count matrix associated with it – Class counts in each of the partitions, A < v and A ≥ v ● Simple method to choose best v – For each v, scan the database to gather count matrix and compute its Gini index – Computationally Inefficient! (O(N2)) Repetition of work. ≤ 80 > 80 Defaulted Yes 0 3 Defaulted No 3 4 Annual Income ? * 51 Continuous Attributes: Computing Gini Index... ● For efficient computation O(NlogN): for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Split Positions Sorted Values * 52 Continuous Attributes: Computing Gini Index... ● For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Split Positions Sorted Values * 55 Continuous Attributes: Computing Gini Index... ● For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Split Positions Sorted Values * 56 Measure of Impurity: Entropy ● Entropy at a given node t: (NOTE: p( j | t) is the relative frequency of class j at node t). ◆ Maximum (log nc) when records are equally distributed among all classes implying least information ◆ Minimum (0.0) when all records belong to one class, implying most information – Entropy based computations are quite similar to the GINI index computations * 57 Problem with large number of partitions ● Node impurity measures tend to prefer splits that result in large number of partitions, each being small but pure – Customer ID has highest information gain because entropy for all the children is zero – Can we use such a test condition on new test instances? * 60 Solution ● A low impurity value alone is insufficient to find a good attribute test condition for a node ● Solution: Consider the number of children produced by the splitting attribute in the identification of the best split ● High number of child nodes implies more complexity ● Method 1: Generate only binary decision trees – This strategy is employed by decision tree classifiers such as CART ● Method 2: Modify the splitting criterion to take into account the number of partitions produced by the attribute * 61 Gain Ratio ● Gain Ratio: Parent Node, p is split into k partitions ni is the number of records in partition i – Adjusts Information Gain by the entropy of the partitioning (SplitINFO). ◆ Higher entropy partitioning (large number of small partitions) is penalized! – Used in C4.5 algorithm – Designed to overcome the disadvantage of Information Gain * 62 Computing Error of a Single Node P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3 * 65 Node N1 Weighted Misclassification Error A? Yes No Node N2 M1 = Error(N1) = #minority(N1) / #N1 M2 = Error(N2) = #minority(N2) / #N2 M = weighted avg (M1, M2) = M1 * #N1/N + M2 * #N2/N = #minority(N1)/N + #minority(N2)/N = #minority(N1 and N2) / N * 66 M1 M2 M ● The aggregate error M is equivalent to the fraction of errors over both nodes after the split ● Easy to compute Comparison among Impurity Measures For a 2-class problem: * 67 Consistency among the impurity mesures •if a node N1 has lower entropy than node N2, then the Gini index and error rate of N1 will also be lower than that of N2 The attribute chosen as splitting criterion by the impurity measures can still be different! p(yes) p(no) p(yes) p(no) p(yes) p(no) Determine when to stop splitting Stopping Criteria for Tree Induction ● Stop expanding a node when all the records belong to the same class ● Stop expanding a node when all the records have similar attribute values ● Early termination (to be discussed later) Algorithms: ID3, C4.5, C5.0, CART ● ID3 uses the Hunt’s algorithm with information gain criterion and gain ratio ● C4.5 improves ID3 – Needs entire data to fit in memory – Handles missing attributes and continuous attributes – Performs tree post-pruning – C5.0 is the current commercial successor of C4.5 – Unsuitable for Large Datasets ● CART builds multivariate decision (binary) trees * 72 Redundant Attributes ● Decision trees can handle the presence of redundant attributes ● An attribute is redundant if it is strongly correlated with another attribute in the data ● Since redundant attributes show similar gains in purity if they are selected for splitting, only one of them will be selected as an attribute test condition in the decision tree algorithm. * 75 Advantages of Decision Tree ●Easy to interpret for small-sized trees ●Accuracy is comparable to other classification techniques for many simple data sets ●Robust to noise (especially when methods to avoid overfitting are employed) ●Can easily handle redundant or irrelevant attributes ●Inexpensive to construct ●Extremely fast at classifying unknown record ●Handle Missing Values Computational Complexity ● Finding an optimal decision tree is NP-hard ● Hunt’s Algorithm uses a greedy, top-down, recursive partitioning strategy for growing a decision tree ● Such techniques quickly construct a reasonably good decision tree even when the training set size is very large. ● Construction DT Complexity: O(M N log N) where M=n. attributes, N=n. instances ● Once a decision tree has been built, classifying a test record is extremely fast, with a worst-case complexity of O(w), where w is the maximum depth of the tree. * 77 Computing Impurity Measure (DT construction) Split on Refund: Entropy(Refund=Yes) = 0 Entropy(Refund=No) = -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183 Entropy(Children) = 0.3 (0) + 0.6 (0.9183) = 0.551 Gain = 0.9 × (0.8813 – 0.551) = 0.3303 Missing value Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813 Distribute Instances (DT construction) Refund Yes No Refund Yes No Probability that Refund=Yes is 3/9 Probability that Refund=No is 6/9 Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9 Classify Instances Refund MarSt TaxInc YESNO NO NO Yes No Married Single, Divorced < 80K > 80K Married Single Divorced Total Class=No 3 1 4 Class=Yes 1 + 6/9 1 2.67 Total 3 2.66 1 6.67 New record: Probability that Marital Status = Married is 3 / 6.67 = 0.45 Probability that Marital Status ={Single,Divorced} is 3.67 / 6.67 = 0.55 Probabilistic split method (C4.5) [ NOTICE: numbers contained errors in previous versions of this slide ] Handling interactions + : 1000 instances o : 1000 instances Adding Z as a noisy attribute generated from a uniform distribution Y Z Y Z X Entropy (X) : 0.99 Entropy (Y) : 0.99 Entropy (Z) : 0.98 Attribute Z will be chosen for splitting! X * 85 Decision Boundary • Border line between two neighboring regions of different classes is known as decision boundary • Decision boundary is parallel to axes because test condition involves a single attribute at-a-time Limitations of single attribute-based decision boundaries Both positive (+) and negative (o) classes generated from skewed Gaussians with centers at (8,8) and (12,12) respectively. * 87 Test Condition x + y < 20 Practical Issues of Classification ● Underfitting and Overfitting ● Costs of Classification Classification Errors ● Training errors (apparent errors) – Errors committed on the training set ● Test errors – Errors committed on the test set ● Generalization errors – Expected error of a model over random selection of records from same distribution Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large Decision Tree with 4 nodes Decision Tree Decision boundaries on Training data Decision Tree with 50 nodes Decision Tree Decision boundaries on Training data Which tree is better? Decision Tree with 4 nodes Decision Tree with 50 nodes Which tree is better ?