Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Decision Trees and Classification , Lecture Notes - Computer Science, Study notes of Artificial Intelligence

Prof. David C Parkes, Computer Science, Decision Trees, Classification, Inductive Bias, Restriction bias, Algorithm for Learning Decision Trees, Harvard, Lecture Notes

Typology: Study notes

2010/2011

Uploaded on 10/25/2011

thecoral
thecoral 🇺🇸

4.4

(28)

133 documents

1 / 9

Toggle sidebar

Related documents


Partial preview of the text

Download Decision Trees and Classification , Lecture Notes - Computer Science and more Study notes Artificial Intelligence in PDF only on Docsity! CS181 Lecture 2 — Decision Trees and Classification Avi Pfeffer; Revised by David Parkes Jan 23, 2010 In introducing supervised learning we consider the special problem of learning a Boolean concept from training examples, some of which satisfy the concept and some of which do not. This is a classical machine learning problem. After discussing some basic ideas for supervised learning, we will turn to a particular learning algorithm, decision trees. Optional readings for the next two lectures: Chapter 18 (through 18.4) of Russell & Norvig, Chapters 1 & 3 of Mitchell “Machine Learning” 1 The Task: Supervised Learning Whenever we talk about learning, there is a hierarchy of tasks to consider. We must first talk about the ultimate task to be performed, the thing we are trying to learn to do. Then we can talk about the task of learning how to perform the ultimate task. Finally, we can consider the task of designing a learning algorithm. For the next few classes, we will focus on classification as the ultimate task to be performed. Classification means determining what category an object falls into, based on its features (or attributes). For example, we might try to classify a plant as nutritious or poisonous, based on biological features such as color, leaf shape and so on. Or we might try to classify a pixel image as being a particular digit. Classification An object is described by a set of features X1, . . . , Xm. There is a set Y = {1, . . . , c} of possible classes. Given the features x = (x1, . . . , xm) ∈ X of a particular object, a classifier needs to determine it true class f(x) = y ∈ Y .1 Thus a classifier is a function h : X → Y . The performance of a classifier h on a new instance (x, y) is measured by an error function, for example ∆(y, y′) = { 0 if y = y′ 1 otherwise, (1) where y′ = h(x). In some domains a more complex error function can be appropriate. For example, in a medical domain, the cost of false negatives (missing a disease diagnosis) is likely higher than that of false positives (incorrectly diagnosing disease when there is none). Classification is a special case of the general problem of supervised learning. Supervised Learning The goal of supervised learning is to learn a classifier from a set of labeled data D. Each instance (x, y) ∈ D defines feature values x = (x1, . . . , xm) and a target value y ∈ Y . Together, we have D = {(x1, y1), . . . , (xn, yn)}, and n labeled examples. A supervised learning algorithm takes this data and outputs a function h : X → Y . Thus a supervised learning algorithm can itself be considered to define a function L from labeled training data to classifiers. For a particular training set D, L(D) is a classifier. 1We will generally use boldface to denote vectors or matrices and capital letters to denote sets, with small letters to denote particular elements of these sets. 1 How do we evaluate the performance of a learning algorithm L on training data D? Since the goal is to produce a classifier, we measure the performance of the learning algorithm through the performance of the classifier on future instances. A key point is that the classifier’s performance is measured relative to future examples that it has not yet seen, and not the training set. The critical question is whether the learning algorithm is able to produce classifiers that generalize to data that it did not see. When we have to choose a particular learning algorithm for a particular domain, we do not generally know exactly what the correct function will be, or exactly what the training set will look like. We need to design the learning algorithm so that it will perform well in the domain, given the particular characteristics of the domain, and given the amount of training data it will get. For example, we need to decide which features to generate to describe an object. We make also have parameters of the algorithm that we need to set. Different algorithms are good at learning different kinds of functions. Also, as we will see, some algorithms need a lot of data to work well. A key aspect of studying machine learning is not only to understand the learning algorithms themselves, but to understand what aspects of the domain make them work well. Binary classification. Often, the possible classes in a classification problem will be true and false. In this case, the learning task is sometimes called concept learning, where the learned function provides a definition of a concept; e.g., it could represent the concept rainy, or sad, or the digit 9. For concept learning, the training data consists of positive and negative examples: instances that are, or are not, examples. If the features themselves are also Boolean, then the problem is that of learning Boolean concepts . In this case, the concept is defined by a Boolean formula over the features. We will focus here mostly on the case of learning Boolean concepts, because they are simple, but already cover most of the issues that come up. There are some useful distinctions we can make concerning learning problems. The first concerns whether the classification task is deterministic or non-deterministic. This is a distinction concerning the underlying prediction problem. Deterministic. In a deterministic domain, if two objects have the same features, then they necessarily have the same classification: in symbols (xi = xj) ⇒ (yi = yj). Non-deterministic. In an non-deterministic domain, two instances may have the same features but different classifications. A non-deterministic domain is often called noisy. Noise can happen either because of inherent non-determinism in the domain, or because of errors and mis-labeling in the data. The second distinction is similar, but concerns the training data itself. The training data is consistent if there are no two instances that have the same features but different classifications, otherwise it is inconsistent. Note that a deterministic domain will necessarily have a consistent training set. Still, a particular training set for a non-deterministic domain may also be consistent. 2 Is Learning Possible? Before we go on to consider how learning is done, we need to consider whether learning is possible at all. In fact, there are strong arguments that learning is logically impossible! Consider the problem of learning Boolean concepts, and assume the target concept is deterministic. We want to learn a Boolean formula over the features X1, . . . , Xm from data consisting of positive and negative examples. For notational convenience, let’s order the data so that the positive examples appear first. So the data consists of positive examples {x1, . . . ,xk} and negative examples {xk+1, . . . ,xn}. Suppose we want to classify a new instance x. What should we do? There are two cases. If x appears in the data, we should provide exactly the same classification (because the domain is deterministic). If it does 2 In describing the decision tree framework, we need to describe two things: the decision trees themselves, which are the classifiers to be learned, and an algorithm for learning a decision tree from data. A decision tree is a representation of a function from a set of (discrete) features to a classification. As its name suggests, a decision tree can be understood as a rule that is structured as a tree. Each internal node of the tree is labeled by an feature Xk. There is a branch leaving the Xk node corresponding to each possible value of Xk. A leaf of the tree is associated with the training data that respects the branching decisions from the root to the leaf, and eventually (once the tree is built) a prediction. For example, let us consider the Nutritious vs Poisonous classification problem for predicting whether or not a plant might be good to eat.2 Suppose there are four features, each with the following values: • Skin (smooth, rough or scaly) • Color (pink, purple or orange) • Thorny (true or false) • Flowering (true or false) A possible decision tree for this domain is as follows: FloweringThorny Skin P N N P N snooth scaly rough T F T F The feature branched at the root is “Skin” and if this is smooth, then the next feature branched on is “Thorny” which is either True or False. On the leaves (there are five), we see the assigned labels of (P)oisonous, (N)utritious, (N)utritious, (P)oisonous and (N)utritious. Note that the decision tree does not need to branch on all possible features on a path from the root to a leaf, nor on the same features on every path. Given a decision tree, then we use it to classify a new instance as follows. Beginning at the root, select the appropriate subtree based on feature values of the instance, until reaching a leaf. The prediction is the label associated with the leaf. Formally, a tree classifier, h, classifies an instance x as follows: • at an feature nodeXk, follow the subtree reached by the branch labeled xk, where x = (x1, . . . , xk, . . . , xm) • at a leaf with class label y′, then the classification is y′. What is the expressive power of decision trees? In fact, any function f : X → Y from discrete (and finite) features X to discrete (and finite) Y can be represented as a decision tree. To see this, consider the complete tree, which splits on each of the features in turn, so that a leaf is reached only after assigning values to all the features. This means that a learning algorithm that learns decision trees does not have a restriction bias. Therefore, if the algorithm is to be successful, it must have a preference bias. 2This example is based on Chapter 3 of Mitchell. 5 What kind of preference bias is appropriate here? The preference bias of decision tree learning algorithms is to prefer shorter trees. This bias is a form of the famous Occam’s razor: if there are multiple ways of explaining the same phenomenon, the simpler hypothesis is to be preferred. In the case of decision trees, simpler is taken to mean shorter (i.e., with a less complex decision process.) We will see in this lecture the use of information gain to achieve a simple, but generally insufficient preference bias. In the next lecture we see additional tricks that are introduced to increase the preference bias. Because of this preference bias, decision trees are naturally better at expressing some kinds of functions than others. They are very good at representing functions where a small number of features are critical to the classification. They are bad at functions that depend approximately equally on all features. Examples of functions which decision trees have a hard time with are the majority and parity functions. Majority f(x) returns 1 if the number of features that are 1 is at least as many as those that are 0, and 0 otherwise. Parity f(x) returns 1 if the number of features that are 1 is odd, and 0 otherwise. Again I want to stress the key point, that understanding the inductive bias of an algorithm can tell us what sort of domains it will work well in. 5 Algorithm for Learning Decision Trees: ID3 We have now described the hypothesis space of decision trees, and the basic idea of the preference bias that will be adopted. But how do we actually go about learning decision trees? One obvious answer is to consider all possible trees, and choose the shortest one that is consistent with the training data. In the case of inconsistent training data, there will be no tree that is consistent. Nevertheless, one may choose the shortest tree amongst those with fewest errors on the training data. This is clearly an infeasible approach. The number of decision trees on m binary features is doubly exponential in m. To see this, note that a function is described by 2m rows in a truth table, and thus by the {0, 1} entries in the final column. There are 22 m possible final columns, and thus functions. Because there is no restriction bias, then the number of trees is at least this many and in fact more because multiple trees represent the same function. So an exhaustive search is clearly not going to work. Instead, a greedy search procedure is used, that grows a decision tree one node at a time, from the root downwards. The learning algorithm, called ID3, is recursive. It takes as arguments a set of n training instances D, and a set of m possible features X to split on. At each node, starting at the root, it makes decision about whether to split again on a feature or stop and create a leaf with a predicted label. In particular, at a node: • Stop growing the tree, and return a classification for objects that fall into this leaf. • Split on an feature Xk that has not yet been branched on. In this case, the training data is split into subsets Dx, corresponding to the different possible values x of Xk. A subtree is recursively grown for each of Dx, and the edge from the root to the subtree Dx is labeled by the corresponding value x. For the recursive call, Xk is removed from the set of available features to split on, since there is no point in splitting on the same feature twice. The ID algorithm is as follows: ID3(D, X) Let T be a new tree If all examples in D have the same class y Label(T ): = y Return T If X = ∅ // no attributes left to branch on 6 Label(T ) = the most common classification in D Return T Choose the best splitting feature Xk in X // described in the “information gain” section Label(T ) = Xk For each value x of Xk Let Dx be the subset of instances in D that have value x for Xk If Dx is empty // no data left Let Tx be a new tree Label(T ) = the most common classification in D Else Tx = ID3(Dx, X − {Xk}) Add a branch from T to Tx, labeled by x Return T Careful: A common mistake is to forget that D described in the above recursive definition of ID3 is the current data set, at that node. It is NOT the complete set of training data. We already see that ID3 can stop growing the tree in any recursive call for one of three reasons: • There is no data left, i.e. no training data that respects the branching decisions made between this node and the root. In this case, there is no need to grow the tree further, because any further elaboration would introduce complexity that is not supported by the training data. This is one place where the preference bias of the algorithm comes into play. • All training instances that respect the branching decisions made between this node and the root have the same classification y. In this case the constant function that returns y can be returned as the perfect classifier for future data with these feature values, and there is no need to elaborate further. Again, this is an example of where the algorithm includes a preference bias. • There are no features left to split on. Since all features have already been split on, all instances in the training data that respect the branching decisions made between this node and the root must have the same value for all features, so there is no point in splitting any further. If this case happens, the training data (and thus domain) must necessarily be inconsistent. We left undefined in ID3 how the “best splitting feature Xk in X is selected.” Here we will see an additional reason for why ID3 will stop splitting. We explain this next. 6 Information Gain: Deciding How to Branch The one thing that remains to be defined is the criterion for choosing which feature to split on. What criterion should we choose? Consider the ultimate goal: to produce short trees. This is the most basic preference bias of ID3. ID3 is a greedy algorithm, that greedily tries to grow the tree in such a way that the result is a short tree. Therefore, the criterion for choosing which feature to split on should tend to produce short trees. In general, the more “orderly” the data is, the shorter the tree needed to represent it. At one extreme, if all instances have the same classification, then ID3 will generate a tree of depth 0. Information theory gives us a precise concept of the “orderliness” of data. The key notion in information theory is entropy, defined as follows: Entropy(D) = − ∑ y∈Y ny n log2 (ny n ) (4) 7
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved