Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Iris Data Classification: Linear & Quadratic Discriminant Analysis, Decision Trees, Study notes of Computer Science

University of California - San Diego Computer Science

An overview of classification techniques using the example of fisher's iris data. It covers linear and quadratic discriminant analysis and decision trees, and discusses their advantages and limitations. The document uses matlab statistics toolbox functions for demonstration.

Typology: Study notes

2009/2010

Uploaded on 03/28/2010

koofers-user-q12-1 🇺🇸

10 documents

1 / 10

Partial preview of the text

Download Iris Data Classification: Linear & Quadratic Discriminant Analysis, Decision Trees and more Study notes Computer Science in PDF only on Docsity! Classification Suppose we have a data set containing measurements on several variables for indivividuals in several groups. If we obtained measurements for more individuals, could we determine to which groups those individuals probably belong? This is the problem of classification. This demonstration illustrates classification by applying it to Fisher's iris data, using the Statistics Toolbox. Contents Fisher's iris data Linear and quadratic discriminant analysis Decision trees Conclusions Fisher's iris data Fisher's iris data consists of measurements on the sepal length, sepal width, petal length, and petal width of 150 iris specimens. There are 50 specimens from each of three species. We load the data and see how the sepal measurements differ between species. We use just the two columns containing sepal measurements. load fisheriris gscatter(meas(:,1), meas(:,2), species,'rgb','osd'); xlabel('Sepal length'); ylabel('Sepal width'); Suppose we measure a sepal and petal from an iris, and we need to determine its species on the basis of those measurements. One approach to solving this problem is known as discriminant analysis. Linear and quadratic discriminant analysis The CLASSIFY function can perform classification using different types of discriminant analysis. First we classify the data using the default linear method. linclass = classify(meas(:,1:2),meas(:,1:2),species); bad = ~strcmp(linclass,species); numobs = size(meas,1); sum(bad) / numobs ans = 0.2000 Of the 150 specimens, 20% or 30 specimens are misclassified by the linear discriminant function. We can see which ones they are by drawing X through the misclassified points. hold on; Both linear and quadratic discriminant analysis are designed for situations where the measurements from each group have a multivariate normal distribution. Often that is a reasonable assumption, but sometimes you may not be willing to make that assumption or you may see clearly that it is not valid. In these cases a nonparametric classification procedure may be more appropriate. We look at such a procedure next. Decision trees Another approach to classification is based on a decision tree. A decision tree is a set of simple rules, such as "if the sepal length is less than 5.45, classify the specimen as setosa." Decision trees do not require any assumptions about the distribution of the measurements in each group. Measurements can be categorical, discrete numeric, or continuous. The TREEFIT function can fit a decision tree to data. We create a decision tree for the iris data and see how well it classifies the irises into species. tree = treefit(meas(:,1:2), species); [dtnum,dtnode,dtclass] = treeval(tree, meas(:,1:2)); bad = ~strcmp(dtclass,species); sum(bad) / numobs ans = 0.1333 The decision tree misclassifies 13.3% or 20 of the specimens. But how does it do that? We use the same technique as above to visualize the regions assigned to each species. [grpnum,node,grpname] = treeval(tree, [x y]); gscatter(x,y,grpname,'grb','sod') Another way to visualize the decision tree is to draw a diagram of the decision rule and group assignments. treedisp(tree,'name',{'SL' 'SW'}) This cluttered-looking tree uses a series of rules of the form "SL < 5.45" to classify each specimen into one of 19 terminal nodes. To determine the species assignment for an observation, we start at the top node and apply the rule. If the point satisfies the rule we take the left path, and if not we take the right path. Ultimately we reach a terminal node that assigned the observation to one of the three species. It is usually possible to find a simpler tree that performs as well as, or better than, the more complex tree. The idea is simple. We have found a tree that classifies one particular data set well. We'd like to know the "true" error rate we would incur by using this tree to classify new data. If we had a second data set, we would be able to estimate the true error by classifying that the second data set directly. We could do this for the full tree and for subsets of it. Perhaps we'd find that a simpler subset gave the smallest error, because some of the decision rules in the full tree hurt rather than help. In our case we don't have a second data set, but we can simulate one by doing cross-validation. We remove a subset of 10% of the data, build a tree using the other 90%, and use the tree to classify the removed 10%. We could repeat this by removing each of ten subsets one at a time. For each subset we may find that smaller trees give smaller error than the full tree. Let's try it. First we compute what is called the "resubstitution error," or

Documents

questions

Iris Data Classification: Linear & Quadratic Discriminant Analysis, Decision Trees, Study notes of Computer Science

Related documents

Partial preview of the text