Download Introduction to Bioinformatics: Microarray Data Analysis V - Classification Techniques and more Exams Electrical and Electronics Engineering in PDF only on Docsity! 1 2006-4-18 A. Y. Zhang, University of Kansas 1 EEC EECS 690 Introduction to Bioinformatics Microarray Data Analysis IV Anne Y. Zhang Electrical Engineering and Computer Science http://people.eecs.ku.edu/~yazhang/ 2006-4-18 A. Y. Zhang, University of Kansas 2 General model of classification Given input X, output Y Want to estimate the function f based on known data set (training data) unknown 2006-4-18 A. Y. Zhang, University of Kansas 3 K Nearest Neighbor K= 1: brown K= 3: green new Majority vote within the k nearest neighbors k-NN requires a parameter k and a distance metric. For k = 1, training error is zero, but test error could be large. As k↑, training error tends to increase, but test error tends to decrease first, and then tends to increase. 2006-4-18 A. Y. Zhang, University of Kansas 4 Decision Tree Learning The classification model is a tree, called decision tree. age? overcast student? credit rating? no yes fairexcellent <=30 >40 no noyes yes yes 30..40 Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels A flow-chart- like tree structure 2006-4-18 A. Y. Zhang, University of Kansas 5 Naïve Bayes Classifier Assume target function f : X Y, where each instance Xi described by attributes <x1,x2,…,xn>. Most probable value of f(X) is : ∏ ∈ ∈ ∈ ∈ = = = = i jij Vy jjn Vy n jjn Vy nj Vy yxPyP yPyxxP xxP yPyxxP xxyPv j j j j )|()(maxarg )()|,...,(maxarg ),...,( )()|,...,( maxarg ),...,|(maxarg 1 1 1 1 2006-4-18 A. Y. Zhang, University of Kansas 6 Support Vector Machines 2 2006-4-18 A. Y. Zhang, University of Kansas 7 Linear classifiers – which line is better? Data: <Xi, yi> i=1,...,n Xi={xi1,…xip}— p features/attributes yi∊{-1,+1} -- class Consider a two-class, linearly separable classification problem Many decision boundaries! Are all decision boundaries equally good? 2006-4-18 A. Y. Zhang, University of Kansas 8 Intuition I Maximizes the distance between the decision boundary and the “difficult points” close to decision boundary Intuition: if there are no points near the decision surface, then there are no very uncertain classification decisions 2006-4-18 A. Y. Zhang, University of Kansas 9 Another intuition If you have to place a fat separator between classes, you have less choices, and so the capacity of the model has been decreased 2006-4-18 A. Y. Zhang, University of Kansas 10 Support Vector Machine (SVM) Support vectors Maximize margin SVMs maximize the margin around the separating hyperplane. The decision function is fully specified by a subset of training samples, the support vectors. Seen by many as most successful current classification method 2006-4-18 A. Y. Zhang, University of Kansas 11 If not linearly separable Allow some errors Let some points be moved to where they belong, at a cost Still, try to place decision boundary “far” from each class Large margin classifiers 2006-4-18 A. Y. Zhang, University of Kansas 12 w: decision hyperplane normal xi: data point i yi: class of data point i (+1 or -1) Classify function is: f(x)=sign(wTxi + b) Decision boundary: wTx + b =0 Distance from example to the separator is Maximum Margin: Formalization w xw br T + = 5 2006-4-18 A. Y. Zhang, University of Kansas 25 Software SVM applets http://www.site.uottawa.ca/~gcaron/SVMApplet/S VMApplet.html SVM toolkits http://svmlight.joachims.org/ http://www.csie.ntu.edu.tw/~cjlin/libsvm/ http://theoval.sys.uea.ac.uk/~gcc/svm/toolbox/ 2006-4-18 A. Y. Zhang, University of Kansas 26 SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from : http://svmlight.joachims.org/ Detailed description about: •What are the features of SVMLight? •How to install it? •How to use it? •… 2006-4-18 A. Y. Zhang, University of Kansas 27 Training Step svm-learn [-option] train_file model_file •train_file contains training data; •The filename of train_file can be any filename; •The extension of train_file can be defined by user arbitrarily; •model_file contains the model built based on training data by SVM; 2006-4-18 A. Y. Zhang, University of Kansas 28 Format of input file (training data) For cancer classification, training data is a collection of gene expression profiles for patients; Each line represents a person; Each feature represents a gene’s expression level of the person; Feature value : e.g., log-ratio; 1 101:0.2 205:4 209:0.2 304:0.2… -1 202:0.1 203:0.1 208:0.1 209:0.3… … … 2006-4-18 A. Y. Zhang, University of Kansas 29 Testing Step svm-classify test_file model_file predictions •The format of test_file is exactly the same as train_file; •Needs to be scaled into same range; •We use the model built based on training data to classify test data, and compare the predictions with the original label of each test document; 2006-4-18 A. Y. Zhang, University of Kansas 30 Which means the first patient is classified correctly but the second one is incorrectly. Example In test_file, we have: 1 101:0.2 205:4 209:0.2 304:0.2… -1 202:0.1 203:0.1 208:0.1 209:0.3… … … After running the svm_classify, the Predictions may be: 1.045 -0.987 … … Which means this classifier classify these two patients Correctly. 1.045 0.987 … … or 6 2006-4-18 A. Y. Zhang, University of Kansas 31 Performance Evaluation Confusion Matrix •a is the number of correct predictions that an instance is negative; •b is the number of incorrect predictions that an instance is positive; •c is the number of incorrect predictions that an instance if negative; •d is the number of correct predictions that an instance is positive; dcpositive banegativeActual positivenegative Predicted 2006-4-18 A. Y. Zhang, University of Kansas 32 Performance Evaluation Accuracy (AC) is the proportion of the total number of predictions that were correct. AC = (a + d) / (a + b + c + d) Recall is the proportion of positive cases that were correctly identified. R = d / (c + d) Precision is the proportion of the predicted positive cases that were correct. P = d / (b + d) Actual positive cases number predicted positive cases number 2006-4-18 A. Y. Zhang, University of Kansas 33 Example 450 "-" 550 "+" Actual Test Cases: 400 530 Predicted: 50 20 For this classifier: a = 400 b = 50 c = 20 d = 530 Accuracy = (400 + 530) / 1000 = 93% Precision = d / (b + d) = 530 / 580 = 91.4% Recall = d / (c + d) = 530 / 550 = 96.4% 2006-4-18 A. Y. Zhang, University of Kansas 34 Parameter selection with cross-validation K-fold cross-validation: Training data X are randomly divided into k sets Xi (i=1,…,k) of as nearly equal size as possible. Ti share K-2 parts 121 31222 32111 −∪∪∪== ∪∪∪== ∪∪∪== KKKK K K XXXTXV XXXTXV XXXTXV L M L L 2006-4-18 A. Y. Zhang, University of Kansas 35 ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )1 510 2 510 2 59 1 59 1 24 2 24 2 23 1 23 1 12 2 12 2 11 1 11 XVXT XVXT XVXT XVXT XVXT XVXT == == == == == == M 5×2 Cross-Validation 5 times 2 fold cross-validation 2006-4-18 A. Y. Zhang, University of Kansas 36 Other classification problems in Bioinformatics 7 2006-4-18 A. Y. Zhang, University of Kansas 37 Remote Homology Detection The problem Protein homology is key clue to structure and function prediction High sequence similarity indicates homology Remote homology – “twilight zone”, challenging to detect Data Sequences (DNA or proteins) Homologous proteins come from common ancestor SCOP database Non-vector inputs; Variable lengths 2006-4-18 A. Y. Zhang, University of Kansas 38 Classification of Genes and Proteins Problem: Put genes and proteins into biologically interesting categories Functions Subcellular localization Co-regulation Clinical categories – benign vs. pathogenic Data Sequences Promoter regions Phylogentic profiles