Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Lecture Notes for Assignment 4 - Introduction to Bioinformatics | CS 466, Assignments of Computer Science

Material Type: Assignment; Class: Introduction to Bioinformatics; Subject: Computer Science; University: University of Illinois - Urbana-Champaign; Term: Unknown 1989;

Typology: Assignments

Pre 2010

Uploaded on 03/11/2009

koofers-user-6q1
koofers-user-6q1 🇺🇸

1

(1)

10 documents

1 / 3

Toggle sidebar

Related documents


Partial preview of the text

Download Lecture Notes for Assignment 4 - Introduction to Bioinformatics | CS 466 and more Assignments Computer Science in PDF only on Docsity! Assignment 4. (Due by 11:59 PM on November 18.) Note: Create a subdirectory within /home/class/fa08/cs466/assignments/<yourloginid>/. Call it “assignment4”. Place your solutions within this subdirectory. Email the instructor once you have done this. Note: See the course web page for the late turn-in policy. Also note that you may request a 3-day extension at most once in the semester, by informing the instructor by e-mail. Note: Collaboration policy for this assignment – you may discuss the assignment with at most one other student. See “Basic Information” link on the course web page for more details. (In short, you may discuss, but must do the assignment on your own.) 1. Implement a simple version of the Gibbs sampling algorithm for motif finding. (40 points) An outline of the algorithm is the following: a. Given a set of N sequences in a FastA format file, and a desired motif length k. b. Compute the relative frequencies of each of the four bases in the input file, to construct a “background” distribution B, which may be treated as a PWM of length 1. c. Assume that each sequence has one occurrence of the motif. The motif is specified by the start position of its occurrence in each sequence, i.e., the motif W = {s1,s2, … sN} where si is the start position of the motif occurrence in sequence Si. d. Initialize W at random. (Randomly choose start positions in each sequence.) e. In each “inner iteration” of the algorithm, you look at each i=1…N one by one. f. Suppose you are looking at a particular i. Construct motif W’ = W – {si}. a. For each substring s in Si, compute the likelihood ratio Pr(s | W’) / Pr(s | B). The probability Pr(s | W) of sampling any substring of length k from a PWM of length k is as defined in class. For the background PWM B, this probability is the product of sampling each base of string s from the PWM B computed in step (b) above. b. Assign to each substring s in Si, a “choice probability” that is proportional to the likelihood ratio computed above for that s. Choose a substring s in Si with its choice probability. c. Update W to be equal to W’  {s}. d. After every update of W, compute its information content. g. Repeat the inner iteration (that goes over every i = 1…N) until some “stopping criterion” is reached. Report the motif with the highest information content seen during the sampling. h. Repeat steps d-g 10 times, storing the motif reported from each repetition. Each such repetition starts with a different random initialization in step (d). This is the strategy of “random restarts”, commonly used in sampling-based algorithms. Report all 10 motifs (and their information contents). When constructing a PWM from a motif in Step e.b above, use “pseudocounts”. (See the Lawrence et al paper covered in class, equation 1, for how to do this.) This is done to avoid any zeros in the PWM. The program should be run as: <programfilename> <FastaFilename> <k> where k is the motif length. What to turn in for Problem 1: a. In a README.txt file to be included in the subdirectory “problem1”, provide a textual description of how step f.b was implemented, i.e., how did you implement choosing a string with probability proportional to its likelihood ratio computed in step f.a. b. In the same file, also describe briefly what stopping criteria you chose for the algorithm. c. Code, as is usually submitted, with instructions on compiling and running the program. d. Include a Fasta file that you used to debug/develop your program, in the same subdirectory. 2. (40 points) Running BLAST. This assignment came with a FastA file (“mel.fa”) of regulatory sequences from D. melanogaster. a. Take each individual sequence in the FastA file and “BLAST” it against the D. pseudoobscura genome. (Go to http://flybase.org/blast/ for this purpose.) b. In the BLAST results, choose the top hit, only if it scores better (E-value) than 1E-20. c. Manually extract the left and right positions of the hit in the “subject” sequence. (Note that the hit may comprise more than one local alignments with the same subject sequence. Consider all these fragments when noting down the left and right positions.) d. In a file called “pse.coords.txt”, add a line with the format <FastaID> <subject-desc> <subject- left> <subject-right> [+/-] [E-value], where FastaID is the id of the FastA sequence that you started with; subject-desc is the “description” of the top hit from the “Blast hit summary” table; subject-left and subject-right are the coordinates you obtained in step c above; the fourth column (+/-) is “+” if the Blast hit is on the “plus” strand in the subject, and “-” otherwise; and “E-value” is the e-value provided by Blast for this hit. e. Now, use the “Genome View” link from any one of the local alignments in the top hit, to go to a “Genome Browser” view of this locus in the D. pseudoobscura genome. f. Figure out how to extract (from the Genome Browser) the sequence corresponding to the coordinates you noted down in step (d), and add this sequence to a FastA file called “pse.fa”, with its Fasta identifier being the id of the corresponding D. melanogaster sequence. g. Do this for every individual sequence in the attached FastA file (mel.fa). h. Run the Gibbs sampler you implemented, on this new FastA file (pse.fa), with a motif length of 7 and 8 separately. i. Convert the motif(s) you found into a “logo” format by going to http://weblogo.berkeley.edu/logo.cgi
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved