Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Iris Data Classification: Linear & Quadratic Discriminant Analysis, Decision Trees, Study notes of Computer Science

An overview of classification techniques using the example of fisher's iris data. It covers linear and quadratic discriminant analysis and decision trees, and discusses their advantages and limitations. The document uses matlab statistics toolbox functions for demonstration.

Typology: Study notes

2009/2010

Uploaded on 03/28/2010

koofers-user-q12-1
koofers-user-q12-1 🇺🇸

10 documents

1 / 10

Toggle sidebar

Related documents


Partial preview of the text

Download Iris Data Classification: Linear & Quadratic Discriminant Analysis, Decision Trees and more Study notes Computer Science in PDF only on Docsity! Classification Suppose we have a data set containing measurements on several variables for indivividuals in several groups. If we obtained measurements for more individuals, could we determine to which groups those individuals probably belong? This is the problem of classification. This demonstration illustrates classification by applying it to Fisher's iris data, using the Statistics Toolbox. Contents Fisher's iris data Linear and quadratic discriminant analysis Decision trees Conclusions Fisher's iris data Fisher's iris data consists of measurements on the sepal length, sepal width, petal length, and petal width of 150 iris specimens. There are 50 specimens from each of three species. We load the data and see how the sepal measurements differ between species. We use just the two columns containing sepal measurements. load fisheriris gscatter(meas(:,1), meas(:,2), species,'rgb','osd'); xlabel('Sepal length'); ylabel('Sepal width'); Suppose we measure a sepal and petal from an iris, and we need to determine its species on the basis of those measurements. One approach to solving this problem is known as discriminant analysis. Linear and quadratic discriminant analysis The CLASSIFY function can perform classification using different types of discriminant analysis. First we classify the data using the default linear method. linclass = classify(meas(:,1:2),meas(:,1:2),species); bad = ~strcmp(linclass,species); numobs = size(meas,1); sum(bad) / numobs ans = 0.2000 Of the 150 specimens, 20% or 30 specimens are misclassified by the linear discriminant function. We can see which ones they are by drawing X through the misclassified points. hold on; Both linear and quadratic discriminant analysis are designed for situations where the measurements from each group have a multivariate normal distribution. Often that is a reasonable assumption, but sometimes you may not be willing to make that assumption or you may see clearly that it is not valid. In these cases a nonparametric classification procedure may be more appropriate. We look at such a procedure next. Decision trees Another approach to classification is based on a decision tree. A decision tree is a set of simple rules, such as "if the sepal length is less than 5.45, classify the specimen as setosa." Decision trees do not require any assumptions about the distribution of the measurements in each group. Measurements can be categorical, discrete numeric, or continuous. The TREEFIT function can fit a decision tree to data. We create a decision tree for the iris data and see how well it classifies the irises into species. tree = treefit(meas(:,1:2), species); [dtnum,dtnode,dtclass] = treeval(tree, meas(:,1:2)); bad = ~strcmp(dtclass,species); sum(bad) / numobs ans = 0.1333 The decision tree misclassifies 13.3% or 20 of the specimens. But how does it do that? We use the same technique as above to visualize the regions assigned to each species. [grpnum,node,grpname] = treeval(tree, [x y]); gscatter(x,y,grpname,'grb','sod') Another way to visualize the decision tree is to draw a diagram of the decision rule and group assignments. treedisp(tree,'name',{'SL' 'SW'}) This cluttered-looking tree uses a series of rules of the form "SL < 5.45" to classify each specimen into one of 19 terminal nodes. To determine the species assignment for an observation, we start at the top node and apply the rule. If the point satisfies the rule we take the left path, and if not we take the right path. Ultimately we reach a terminal node that assigned the observation to one of the three species. It is usually possible to find a simpler tree that performs as well as, or better than, the more complex tree. The idea is simple. We have found a tree that classifies one particular data set well. We'd like to know the "true" error rate we would incur by using this tree to classify new data. If we had a second data set, we would be able to estimate the true error by classifying that the second data set directly. We could do this for the full tree and for subsets of it. Perhaps we'd find that a simpler subset gave the smallest error, because some of the decision rules in the full tree hurt rather than help. In our case we don't have a second data set, but we can simulate one by doing cross-validation. We remove a subset of 10% of the data, build a tree using the other 90%, and use the tree to classify the removed 10%. We could repeat this by removing each of ten subsets one at a time. For each subset we may find that smaller trees give smaller error than the full tree. Let's try it. First we compute what is called the "resubstitution error," or
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved