Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Lecture Notes for Assignment 4 - Introduction to Bioinformatics | CS 466, Assignments of Computer Science

University of Illinois - Urbana-Champaign Computer Science

Material Type: Assignment; Class: Introduction to Bioinformatics; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Unknown 1989;

Typology: Assignments

Pre 2010

Uploaded on 03/11/2009

koofers-user-6q1 🇺🇸

(1)

10 documents

1 / 3

Partial preview of the text

Download Lecture Notes for Assignment 4 - Introduction to Bioinformatics | CS 466 and more Assignments Computer Science in PDF only on Docsity! Assignment 4. (Due by 11:59 PM on November 18.) Note: Create a subdirectory within /home/class/fa08/cs466/assignments/<yourloginid>/. Call it “assignment4”. Place your solutions within this subdirectory. Email the instructor once you have done this. Note: See the course web page for the late turn-in policy. Also note that you may request a 3-day extension at most once in the semester, by informing the instructor by e-mail. Note: Collaboration policy for this assignment – you may discuss the assignment with at most one other student. See “Basic Information” link on the course web page for more details. (In short, you may discuss, but must do the assignment on your own.) 1. Implement a simple version of the Gibbs sampling algorithm for motif finding. (40 points) An outline of the algorithm is the following: a. Given a set of N sequences in a FastA format file, and a desired motif length k. b. Compute the relative frequencies of each of the four bases in the input file, to construct a “background” distribution B, which may be treated as a PWM of length 1. c. Assume that each sequence has one occurrence of the motif. The motif is specified by the start position of its occurrence in each sequence, i.e., the motif W = {s1,s2, … sN} where si is the start position of the motif occurrence in sequence Si. d. Initialize W at random. (Randomly choose start positions in each sequence.) e. In each “inner iteration” of the algorithm, you look at each i=1…N one by one. f. Suppose you are looking at a particular i. Construct motif W’ = W – {si}. a. For each substring s in Si, compute the likelihood ratio Pr(s | W’) / Pr(s | B). The probability Pr(s | W) of sampling any substring of length k from a PWM of length k is as defined in class. For the background PWM B, this probability is the product of sampling each base of string s from the PWM B computed in step (b) above. b. Assign to each substring s in Si, a “choice probability” that is proportional to the likelihood ratio computed above for that s. Choose a substring s in Si with its choice probability. c. Update W to be equal to W’  {s}. d. After every update of W, compute its information content. g. Repeat the inner iteration (that goes over every i = 1…N) until some “stopping criterion” is reached. Report the motif with the highest information content seen during the sampling. h. Repeat steps d-g 10 times, storing the motif reported from each repetition. Each such repetition starts with a different random initialization in step (d). This is the strategy of “random restarts”, commonly used in sampling-based algorithms. Report all 10 motifs (and their information contents). When constructing a PWM from a motif in Step e.b above, use “pseudocounts”. (See the Lawrence et al paper covered in class, equation 1, for how to do this.) This is done to avoid any zeros in the PWM. The program should be run as: <programfilename> <FastaFilename> <k> where k is the motif length. What to turn in for Problem 1: a. In a README.txt file to be included in the subdirectory “problem1”, provide a textual description of how step f.b was implemented, i.e., how did you implement choosing a string with probability proportional to its likelihood ratio computed in step f.a. b. In the same file, also describe briefly what stopping criteria you chose for the algorithm. c. Code, as is usually submitted, with instructions on compiling and running the program. d. Include a Fasta file that you used to debug/develop your program, in the same subdirectory. 2. (40 points) Running BLAST. This assignment came with a FastA file (“mel.fa”) of regulatory sequences from D. melanogaster. a. Take each individual sequence in the FastA file and “BLAST” it against the D. pseudoobscura genome. (Go to http://flybase.org/blast/ for this purpose.) b. In the BLAST results, choose the top hit, only if it scores better (E-value) than 1E-20. c. Manually extract the left and right positions of the hit in the “subject” sequence. (Note that the hit may comprise more than one local alignments with the same subject sequence. Consider all these fragments when noting down the left and right positions.) d. In a file called “pse.coords.txt”, add a line with the format <FastaID> <subject-desc> <subject- left> <subject-right> [+/-] [E-value], where FastaID is the id of the FastA sequence that you started with; subject-desc is the “description” of the top hit from the “Blast hit summary” table; subject-left and subject-right are the coordinates you obtained in step c above; the fourth column (+/-) is “+” if the Blast hit is on the “plus” strand in the subject, and “-” otherwise; and “E-value” is the e-value provided by Blast for this hit. e. Now, use the “Genome View” link from any one of the local alignments in the top hit, to go to a “Genome Browser” view of this locus in the D. pseudoobscura genome. f. Figure out how to extract (from the Genome Browser) the sequence corresponding to the coordinates you noted down in step (d), and add this sequence to a FastA file called “pse.fa”, with its Fasta identifier being the id of the corresponding D. melanogaster sequence. g. Do this for every individual sequence in the attached FastA file (mel.fa). h. Run the Gibbs sampler you implemented, on this new FastA file (pse.fa), with a motif length of 7 and 8 separately. i. Convert the motif(s) you found into a “logo” format by going to http://weblogo.berkeley.edu/logo.cgi

Documents

questions

Lecture Notes for Assignment 4 - Introduction to Bioinformatics | CS 466, Assignments of Computer Science

Related documents

Partial preview of the text