Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Machine Learning Lab: Applying Algorithms to Biological Problems - Prof. Drena Leigh Dobbs, Lab Reports of Bioinformatics

A lab handout for a machine learning class focusing on biological applications. Students will learn about setting up machine learning experiments, cross validation, and using the weka program to implement naive bayes, j48 decision tree, and svm algorithms on a dataset of rna-binding proteins. The goal is to predict whether a central amino acid binds to rna or not based on the sequence. Exercises include understanding the assumptions and criteria of each algorithm, and analyzing their performance through cross validation and testing on a separate dataset.

Typology: Lab Reports

Pre 2010

Uploaded on 09/02/2009

koofers-user-vbx
koofers-user-vbx 🇺🇸

10 documents

1 / 7

Toggle sidebar

Related documents


Partial preview of the text

Download Machine Learning Lab: Applying Algorithms to Biological Problems - Prof. Drena Leigh Dobbs and more Lab Reports Bioinformatics in PDF only on Docsity! BCB 444/544 Lab 10 (11/8/07) Machine Learning Due Monday 11/12/07 by 5 pm – email to terrible@iastate.edu Objectives 1. Experiment with applying machine learning algorithms to biological problems. 2. Learn about how to set up a machine learning experiment. Introduction Machine learning combines principles from computer science, statistics, psychology, and other disciplines to develop computer programs for specific tasks. The tasks that machine learning programs have been developed for vary widely, from diagnosing cancer to driving a car. In biology, machine learning approaches are very popular for problems such as protein secondary structure prediction, gene prediction, analyzing microarray data, and many others. Machine learning is often quite effective, especially on problems that have a lot of data available. Molecular biology certainly has lots of data. A note about our training and test set files: The data set we are using in this lab is a set of RNA-binding proteins. Our input data is 15 amino acids from the protein sequence and a label of 1 or 0 indicating whether the central amino acid in the list of 15 binds to RNA or not (1 means binding, 0 means not binding). The training set contains an equal number of RNA-binding and non-binding residues, which is not the natural distribution. The entire data set contains only about 20% binding residues. We are using a set with equal numbers of binding and non- binding residues to make things a little easier. The test set is a single protein sequence, the 50S ribosomal protein L20 from E. coli. We will use this test case to see how well our classifiers we build perform on a protein sequence not in the training set. Exercises Before we get started on the exercises, we need to learn a little about machine learning experiments. The first concept is training and testing. In order to estimate the performance of any classifier, we need to train the classifier on some data and then measure performance on some other data. There are a few ways to do this. In the lab today, we will use cross validation and a separate test set. Go to http://en.wikipedia.org/wiki/Cross_validation and read about cross validation. 1. What is K-fold cross validation? What data is used for training? What data is used for testing? The most important point in training and testing is that the same data can never be used in both the training set and the test set. Usually, we have limited data and want to use as much as possible in training, which is why we do cross validation experiments. In our lab today, we will give ourselves the luxury of a test case that is not in the training set to test our classifiers on. Algorithms Read the sections on Naïve Bayes, J48 Decision tree, and SVM here: http://www.d.umn.edu/~padhy005/Chapter5.html 2. What assumption is used in the Naïve Bayes classifier? 3. What criterion does the decision tree classifier use to decide which attribute to put first in the decision tree? 4. What is the purpose of the kernel function in a SVM classifier? 5. Based on what you read, which method(s) can a human interpret? What method(s) can a human not interpret, i.e., “black box” method(s)? 6. According to this web page, which algorithm tends to have the highest classification accuracy? Experiments In this lab, we will be using the program Weka. Weka is a program that contains implementations of many machine learning algorithms in a standard framework that makes it easy to experiment with many methods. If you are in the computer lab in 1340 MBB, Weka is already installed on the machines. If you are working from home, you will have to download and install Weka. Weka is available at: http://www.cs.waikato.ac.nz/ml/weka/ The instructions should be fairly easy to follow for installing Weka on your computer. If you have trouble, send me an email and I may be able to help you. Or come into the lab and use the machines here. The lab in MBB is open most of the time; our class is the only one that currently uses this room. Running Weka: Some final notes on what we will do with Weka before the instructions for how to do it. performance statistics. For this lab, all of the results you are required to fill into the tables can be read directly from the output as long as you know where to find them. The table asks for accuracy, which in the Weka output is listed as “Correctly Classified Instances.” The next entries in your results table are TPRate, FPRate, Precision, and Recall, which can be read in the section called “Detailed Accuracy By Class.” These values are listed for both classes (1 and 0 for RNA-binding and non-RNA-binding respectively). I only want to see the values for class 1. The final numbers for your results table are TP (true positive), FP (false positive), FN (false negative), and TN (true negative). TP means we predicted RNA-binding and it actually is RNA-binding. FP means we predicted RNA-binding and it is not actually RNA-binding. FN means we predicted non-RNA-binding and it actually is RNA- binding. TN means we predicted non-RNA-binding and it actually is non-RNA-binding. Our correct predictions are TP and TN, our incorrect predictions are FP and FN. The counts of TP, FP, FN, and TN can be found in the section called “Confusion Matrix.” Our confusion matrix shows four numbers, the top left corner shows the number of TP predictions, the top right number is FP, bottom left is FN, and bottom right is TN. Finally, on to running some programs. For the lab machines, you can simply double- click on the weka.jar file on the desktop. Click on the button that says “Explorer” to get started. We will use the following files: Training set Test set Click on the Open file button and choose the training set file. Click on the Classify tab to get to the classification algorithms. To choose the algorithm, click on the Choose button near the top in the classifier section. Click on bayes, then NaiveBayes. Be sure that Cross validation is selected, and change the number in the box from 10 to 5. Then click on the Start button to run the classifier. Record the performance in the table below. To run the predictions on the test case, click the circle next to Supplied test set, then click the Set… button and choose the test file. Then click Start to build the classifier and make predictions on our test case. Record the performance in the table below. For our next algorithm, we will use the J48 decision tree. To choose the algorithm, click on the Choose button near the top in the classifier section. Click on trees, then J48. Be sure that Cross validation is selected, and make sure the number in the box is 5. Then click on the Start button to run the classifier. Record the performance in the table below. To run the predictions on the test case, click the circle next to Supplied test set, then click the Set… button and choose the test file. Then click Start to build the classifier and make predictions on our test case. Record the performance in the table below. For our next algorithm, we will use a SVM. To choose the algorithm, click on the Choose button near the top in the classifier section. Click on functions, then SMO. Be sure that Cross validation is selected, and make sure the number in the box is 5. Then click on the Start button to run the classifier. Record the performance in the table below. To run the predictions on the test case, click the circle next to Supplied test set, then click the Set… button and choose the test file. Then click Start to build the classifier and make predictions on our test case. Record the performance in the table below. Next, we will run the SVM algorithm again using a different kernel function. To change the kernel function, click on the text box next to the Choose button at the top. This will bring up a window showing the algorithm parameters. At the bottom of this window, there is a line that says “useRBF.” Change the value in this box to true to use the RBF kernel function. Click OK. Be sure that Cross validation is selected, and make sure the number in the box is 5. Then click on the Start button to run the classifier. Record the performance in the table below. To run the predictions on the test case, click the circle next to Supplied test set, then click the Set… button and choose the test file. Then click Start to build the classifier and make predictions on our test case. Record the performance in the table below. Finally, choose a different algorithm and run both 5 fold cross validation and predictions on the test case, and add the performance values to the tables in the blank line at the bottom. Be sure to include the name of the algorithm you chose. You can choose any of the algorithms available in Weka. (Side note – some of the algorithms will not work on this data set. They will produce an error message saying something about incompatibility. If this happens to you, simply choose a different algorithm. Also, some algorithms run much faster than others. If the algorithm you chose is taking much longer than our SVM runs, you may want to choose a different algorithm.) To find out more about the algorithms, you can select the algorithm with the Choose button at the top, then click on the text box near the Choose button (just like we did for the SVM when we changed to the RBF kernel). In the window that opens up, there is a section called “About” that gives a one line description of the algorithm. Click on the More button in this section to get more information, including a reference to a paper describing the algorithm. Another option for finding out more about the algorithm is to do an internet search with the name of the algorithm. 5 Fold Cross Validation Results Algorithm Accuracy TPRate FPRate Precision Recall TP FP FN TN NB J48 SVM SVM RBF Test Case Results Algorithm Accuracy TPRate FPRate Precision Recall TP FP FN TN NB J48 SVM SVM RBF 7. What algorithm did the best and under what conditions? 8. Did the cross validation results indicate accurately what performance on the test case would be? 9. Briefly describe what the algorithm you chose does.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved