Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Introduction to Bioinformatics: Microarray Data Analysis V - Classification Techniques, Exams of Electrical and Electronics Engineering

A part of the 'introduction to bioinformatics' course notes by prof. A.y. Zhang at the university of kansas, focusing on various classification techniques used in microarray data analysis. Topics include k-nearest neighbor, decision trees, naive bayes classifier, support vector machines, and their respective advantages and disadvantages.

Typology: Exams

Pre 2010

Uploaded on 09/17/2009

koofers-user-ni4
koofers-user-ni4 🇺🇸

10 documents

1 / 7

Toggle sidebar

Related documents


Partial preview of the text

Download Introduction to Bioinformatics: Microarray Data Analysis V - Classification Techniques and more Exams Electrical and Electronics Engineering in PDF only on Docsity! 1 2006-4-18 A. Y. Zhang, University of Kansas 1 EEC EECS 690 Introduction to Bioinformatics Microarray Data Analysis IV Anne Y. Zhang Electrical Engineering and Computer Science http://people.eecs.ku.edu/~yazhang/ 2006-4-18 A. Y. Zhang, University of Kansas 2 General model of classification Given input X, output Y Want to estimate the function f based on known data set (training data) unknown 2006-4-18 A. Y. Zhang, University of Kansas 3 K Nearest Neighbor K= 1: brown K= 3: green new Majority vote within the k nearest neighbors k-NN requires a parameter k and a distance metric. For k = 1, training error is zero, but test error could be large. As k↑, training error tends to increase, but test error tends to decrease first, and then tends to increase. 2006-4-18 A. Y. Zhang, University of Kansas 4 Decision Tree Learning The classification model is a tree, called decision tree. age? overcast student? credit rating? no yes fairexcellent <=30 >40 no noyes yes yes 30..40 Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels A flow-chart- like tree structure 2006-4-18 A. Y. Zhang, University of Kansas 5 Naïve Bayes Classifier Assume target function f : X Y, where each instance Xi described by attributes <x1,x2,…,xn>. Most probable value of f(X) is : ∏ ∈ ∈ ∈ ∈ = = = = i jij Vy jjn Vy n jjn Vy nj Vy yxPyP yPyxxP xxP yPyxxP xxyPv j j j j )|()(maxarg )()|,...,(maxarg ),...,( )()|,...,( maxarg ),...,|(maxarg 1 1 1 1 2006-4-18 A. Y. Zhang, University of Kansas 6 Support Vector Machines 2 2006-4-18 A. Y. Zhang, University of Kansas 7 Linear classifiers – which line is better? Data: <Xi, yi> i=1,...,n Xi={xi1,…xip}— p features/attributes yi∊{-1,+1} -- class Consider a two-class, linearly separable classification problem Many decision boundaries! Are all decision boundaries equally good? 2006-4-18 A. Y. Zhang, University of Kansas 8 Intuition I Maximizes the distance between the decision boundary and the “difficult points” close to decision boundary Intuition: if there are no points near the decision surface, then there are no very uncertain classification decisions 2006-4-18 A. Y. Zhang, University of Kansas 9 Another intuition If you have to place a fat separator between classes, you have less choices, and so the capacity of the model has been decreased 2006-4-18 A. Y. Zhang, University of Kansas 10 Support Vector Machine (SVM) Support vectors Maximize margin SVMs maximize the margin around the separating hyperplane. The decision function is fully specified by a subset of training samples, the support vectors. Seen by many as most successful current classification method 2006-4-18 A. Y. Zhang, University of Kansas 11 If not linearly separable Allow some errors Let some points be moved to where they belong, at a cost Still, try to place decision boundary “far” from each class Large margin classifiers 2006-4-18 A. Y. Zhang, University of Kansas 12 w: decision hyperplane normal xi: data point i yi: class of data point i (+1 or -1) Classify function is: f(x)=sign(wTxi + b) Decision boundary: wTx + b =0 Distance from example to the separator is Maximum Margin: Formalization w xw br T + = 5 2006-4-18 A. Y. Zhang, University of Kansas 25 Software SVM applets http://www.site.uottawa.ca/~gcaron/SVMApplet/S VMApplet.html SVM toolkits http://svmlight.joachims.org/ http://www.csie.ntu.edu.tw/~cjlin/libsvm/ http://theoval.sys.uea.ac.uk/~gcc/svm/toolbox/ 2006-4-18 A. Y. Zhang, University of Kansas 26 SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from : http://svmlight.joachims.org/ Detailed description about: •What are the features of SVMLight? •How to install it? •How to use it? •… 2006-4-18 A. Y. Zhang, University of Kansas 27 Training Step svm-learn [-option] train_file model_file •train_file contains training data; •The filename of train_file can be any filename; •The extension of train_file can be defined by user arbitrarily; •model_file contains the model built based on training data by SVM; 2006-4-18 A. Y. Zhang, University of Kansas 28 Format of input file (training data) For cancer classification, training data is a collection of gene expression profiles for patients; Each line represents a person; Each feature represents a gene’s expression level of the person; Feature value : e.g., log-ratio; 1 101:0.2 205:4 209:0.2 304:0.2… -1 202:0.1 203:0.1 208:0.1 209:0.3… … … 2006-4-18 A. Y. Zhang, University of Kansas 29 Testing Step svm-classify test_file model_file predictions •The format of test_file is exactly the same as train_file; •Needs to be scaled into same range; •We use the model built based on training data to classify test data, and compare the predictions with the original label of each test document; 2006-4-18 A. Y. Zhang, University of Kansas 30 Which means the first patient is classified correctly but the second one is incorrectly. Example In test_file, we have: 1 101:0.2 205:4 209:0.2 304:0.2… -1 202:0.1 203:0.1 208:0.1 209:0.3… … … After running the svm_classify, the Predictions may be: 1.045 -0.987 … … Which means this classifier classify these two patients Correctly. 1.045 0.987 … … or 6 2006-4-18 A. Y. Zhang, University of Kansas 31 Performance Evaluation Confusion Matrix •a is the number of correct predictions that an instance is negative; •b is the number of incorrect predictions that an instance is positive; •c is the number of incorrect predictions that an instance if negative; •d is the number of correct predictions that an instance is positive; dcpositive banegativeActual positivenegative Predicted 2006-4-18 A. Y. Zhang, University of Kansas 32 Performance Evaluation Accuracy (AC) is the proportion of the total number of predictions that were correct. AC = (a + d) / (a + b + c + d) Recall is the proportion of positive cases that were correctly identified. R = d / (c + d) Precision is the proportion of the predicted positive cases that were correct. P = d / (b + d) Actual positive cases number predicted positive cases number 2006-4-18 A. Y. Zhang, University of Kansas 33 Example 450 "-" 550 "+" Actual Test Cases: 400 530 Predicted: 50 20 For this classifier: a = 400 b = 50 c = 20 d = 530 Accuracy = (400 + 530) / 1000 = 93% Precision = d / (b + d) = 530 / 580 = 91.4% Recall = d / (c + d) = 530 / 550 = 96.4% 2006-4-18 A. Y. Zhang, University of Kansas 34 Parameter selection with cross-validation K-fold cross-validation: Training data X are randomly divided into k sets Xi (i=1,…,k) of as nearly equal size as possible. Ti share K-2 parts 121 31222 32111 −∪∪∪== ∪∪∪== ∪∪∪== KKKK K K XXXTXV XXXTXV XXXTXV L M L L 2006-4-18 A. Y. Zhang, University of Kansas 35 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )1 510 2 510 2 59 1 59 1 24 2 24 2 23 1 23 1 12 2 12 2 11 1 11 XVXT XVXT XVXT XVXT XVXT XVXT == == == == == == M 5×2 Cross-Validation 5 times 2 fold cross-validation 2006-4-18 A. Y. Zhang, University of Kansas 36 Other classification problems in Bioinformatics 7 2006-4-18 A. Y. Zhang, University of Kansas 37 Remote Homology Detection The problem Protein homology is key clue to structure and function prediction High sequence similarity indicates homology Remote homology – “twilight zone”, challenging to detect Data Sequences (DNA or proteins) Homologous proteins come from common ancestor SCOP database Non-vector inputs; Variable lengths 2006-4-18 A. Y. Zhang, University of Kansas 38 Classification of Genes and Proteins Problem: Put genes and proteins into biologically interesting categories Functions Subcellular localization Co-regulation Clinical categories – benign vs. pathogenic Data Sequences Promoter regions Phylogentic profiles
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved