Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Bioinformatics: Gene Assembly & Identification in Miniplasmid - Genome Sequencing Set 3 - , Assignments of Algorithms and Programming

Instructions for assembling and analyzing a small plasmid sequence using bioinformatics tools. Students are required to load and read the sequence files, perform a first-pass assembly, and use inversion-of to produce opposite strands for read alignment. The document also covers the calculation of the probability of specific nucleotides being represented in the reads and the use of genemark for gene prediction.

Typology: Assignments

Pre 2010

Uploaded on 02/12/2009

koofers-user-2qd
koofers-user-2qd 🇺🇸

10 documents

1 / 4

Toggle sidebar

Related documents


Partial preview of the text

Download Bioinformatics: Gene Assembly & Identification in Miniplasmid - Genome Sequencing Set 3 - and more Assignments Algorithms and Programming in PDF only on Docsity! Problem Set 3 (Genome Sequencing) - 1 Introduction to Bioinformatics Problem Set 3: Genome Sequencing 1. Assemble a sequence with your bare hands! You are trying to determine the DNA sequence of a very (very) small plasmids, which you estimate by gel electrophoresis to be about 200 nt. You have made a shotgun library of the miniplasmid and have generated reads of about 20 nt each (you must be using a very early technology!). Your objective now is to assemble those reads into the full sequence. a. Take a look at the sequences you will assemble. Within BioBIKE, click on the FILES menu, then click on Files. Click on the Shared-files subdirectory and then locate and click on a file called mini-plasmid-reads.txt. This file is in what is called FastA format, each read consisting of a one-line label preceded by ">" and then the DNA sequence. Approximately how many reads are there? Write down the label and the sequence of the first read. b. Return to BioBIKE and load the sequences you will assemble. Bring down READ (used to read from files) from the INPUT-OUTPUT menu. - The file-name is "mini-plasmid-reads.txt" - It is in the SHARED subdirectory (choose the SHARED flag from Options) - It is in FastA format (chose the FASTA flag from Options) - Execute, and don't be alarmed by the format of the result. Can you find the label and the sequence of the first read? c. DEFINE a variable (perhaps something like reads) as the contents of the file you read. Either cut and paste the READ function into the value hole of DEFINE, or fill that hole with PREVIOUS-RESULT. d. Make a first-pass assembly of the reads, using ALIGNMENT-OF to find overlaps (you can locate the function on the STRING/SEQUENCE menu, Bioinformatics Tools submenu). e. ALIGNMENT-OF was not designed to assemble reads, and it isn't very good at it. Copy the results into your favorite word processor, and continue the assembly by hand. Warning! It's easy to just sit and stare blankly at the sequences. If you find yourself making little progress, step back and ask yourself what kinds of overlaps are you trying to find. Then devise a systematic approach towards finding them, making use of the search capabilities of the word processor. f. Did you get a complete plasmid sequence from these reads? Probably not. Why so many separate pieces? g. Investigate INVERSION-OF (found in the STRING-SEQUENCES menu, String- production submenu. Bring down the function, and click on Help (in the green action arrow menu). Then click on Full Documentation. From the examples, do you understand what INVERSION-OF does? Try it out. Put in a sequence (in quotes of course) into the argument hole, predict what it should produce, and see if you're right. h. Why would INVERSION-OF be useful in analyzing reads? Which strand of a genome being sequenced gets read by sequencing reactions? Problem Set 3 (Genome Sequencing) - 2 i. Use INVERSION-OF to produce the opposite strands of all your reads. Then align as before the JOINed reads and inverted reads. Copy the alignment into a word processor and join together as many reads as possible. Of course you should speed up the process by using the knowledge gained from step 1.E. j. How many contigs do you get now, and how do you interpret them? k. What fraction of the plasmid have you covered with your assembled reads? Use in the calculation the total number of nucleotides in your contigs and orphan reads. l. In what ways was the process you went through to assemble the reads similar to the process used to assemble the Drosophila genome. In what ways did the latter process differ from yours? 2. Reconsider Problem 1, looking at the data as a whole. a. How many reads are there? (You might use COUNT-OF on the variable you defined in Problem 1.C) b. How many nucleotides are there in the reads? (You might get a SUM-OF the LENGTHS-OF the reads) c. What is the average read length? d. What is the calculated coverage of the mini-plasmid? e. What fraction of the plasmid do you expect is represented by the reads? This is a very common type of question in bioinformatics but not at all easy to answer the first time you encounter it. So let me break it down. i. The fraction of the plasmid you expect is represented by the reads is equal to one minus what? ii. The fraction of the plasmid you expect is NOT represented by the reads is equal to the probability that a specific nucleotide is not found in any of the reads. This may be the most difficult connection to make, so let's dwell on this a bit. If the probability is 50% that the nucleotide at coordinate 29 is not found in any read and (since there's nothing special about coordinate 29) 50% that any specific nucleotide is not found in any read, then on average, half of the nucleotides will be represented by the reads and half won't be. Draw pictures, visualize, but don't just accept the words. Get the idea into your head as a picture. iii. The probability that the nucleotide at coordinate 29 is not found in any read may be calculated from the probability that it isn't found in the first read AND that it isn't found in the second read AND … all the way to the last read. How do you combine these probabilities? Are the reads independent of one another? If I told you that the nucleotide is not found in the first read, would you know any more than before as to whether it is found in the second read? If not, then they're independent. How do you combine independent probabilities to calculate a joint probability (i.e., the probability that all the events occur)?
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved