Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Bioinformatics: Gene Assembly & Identification in Miniplasmid - Genome Sequencing Set 3 - , Assignments of Algorithms and Programming

Virginia Commonwealth University (VCU)Algorithms and Programming

Prof. Jeffrey Elhai

Instructions for assembling and analyzing a small plasmid sequence using bioinformatics tools. Students are required to load and read the sequence files, perform a first-pass assembly, and use inversion-of to produce opposite strands for read alignment. The document also covers the calculation of the probability of specific nucleotides being represented in the reads and the use of genemark for gene prediction.

Typology: Assignments

Pre 2010

Uploaded on 02/12/2009

koofers-user-2qd 🇺🇸

10 documents

1 / 4

Partial preview of the text

Download Bioinformatics: Gene Assembly & Identification in Miniplasmid - Genome Sequencing Set 3 - and more Assignments Algorithms and Programming in PDF only on Docsity! Problem Set 3 (Genome Sequencing) - 1 Introduction to Bioinformatics Problem Set 3: Genome Sequencing 1. Assemble a sequence with your bare hands! You are trying to determine the DNA sequence of a very (very) small plasmids, which you estimate by gel electrophoresis to be about 200 nt. You have made a shotgun library of the miniplasmid and have generated reads of about 20 nt each (you must be using a very early technology!). Your objective now is to assemble those reads into the full sequence. a. Take a look at the sequences you will assemble. Within BioBIKE, click on the FILES menu, then click on Files. Click on the Shared-files subdirectory and then locate and click on a file called mini-plasmid-reads.txt. This file is in what is called FastA format, each read consisting of a one-line label preceded by ">" and then the DNA sequence. Approximately how many reads are there? Write down the label and the sequence of the first read. b. Return to BioBIKE and load the sequences you will assemble. Bring down READ (used to read from files) from the INPUT-OUTPUT menu. - The file-name is "mini-plasmid-reads.txt" - It is in the SHARED subdirectory (choose the SHARED flag from Options) - It is in FastA format (chose the FASTA flag from Options) - Execute, and don't be alarmed by the format of the result. Can you find the label and the sequence of the first read? c. DEFINE a variable (perhaps something like reads) as the contents of the file you read. Either cut and paste the READ function into the value hole of DEFINE, or fill that hole with PREVIOUS-RESULT. d. Make a first-pass assembly of the reads, using ALIGNMENT-OF to find overlaps (you can locate the function on the STRING/SEQUENCE menu, Bioinformatics Tools submenu). e. ALIGNMENT-OF was not designed to assemble reads, and it isn't very good at it. Copy the results into your favorite word processor, and continue the assembly by hand. Warning! It's easy to just sit and stare blankly at the sequences. If you find yourself making little progress, step back and ask yourself what kinds of overlaps are you trying to find. Then devise a systematic approach towards finding them, making use of the search capabilities of the word processor. f. Did you get a complete plasmid sequence from these reads? Probably not. Why so many separate pieces? g. Investigate INVERSION-OF (found in the STRING-SEQUENCES menu, String- production submenu. Bring down the function, and click on Help (in the green action arrow menu). Then click on Full Documentation. From the examples, do you understand what INVERSION-OF does? Try it out. Put in a sequence (in quotes of course) into the argument hole, predict what it should produce, and see if you're right. h. Why would INVERSION-OF be useful in analyzing reads? Which strand of a genome being sequenced gets read by sequencing reactions? Problem Set 3 (Genome Sequencing) - 2 i. Use INVERSION-OF to produce the opposite strands of all your reads. Then align as before the JOINed reads and inverted reads. Copy the alignment into a word processor and join together as many reads as possible. Of course you should speed up the process by using the knowledge gained from step 1.E. j. How many contigs do you get now, and how do you interpret them? k. What fraction of the plasmid have you covered with your assembled reads? Use in the calculation the total number of nucleotides in your contigs and orphan reads. l. In what ways was the process you went through to assemble the reads similar to the process used to assemble the Drosophila genome. In what ways did the latter process differ from yours? 2. Reconsider Problem 1, looking at the data as a whole. a. How many reads are there? (You might use COUNT-OF on the variable you defined in Problem 1.C) b. How many nucleotides are there in the reads? (You might get a SUM-OF the LENGTHS-OF the reads) c. What is the average read length? d. What is the calculated coverage of the mini-plasmid? e. What fraction of the plasmid do you expect is represented by the reads? This is a very common type of question in bioinformatics but not at all easy to answer the first time you encounter it. So let me break it down. i. The fraction of the plasmid you expect is represented by the reads is equal to one minus what? ii. The fraction of the plasmid you expect is NOT represented by the reads is equal to the probability that a specific nucleotide is not found in any of the reads. This may be the most difficult connection to make, so let's dwell on this a bit. If the probability is 50% that the nucleotide at coordinate 29 is not found in any read and (since there's nothing special about coordinate 29) 50% that any specific nucleotide is not found in any read, then on average, half of the nucleotides will be represented by the reads and half won't be. Draw pictures, visualize, but don't just accept the words. Get the idea into your head as a picture. iii. The probability that the nucleotide at coordinate 29 is not found in any read may be calculated from the probability that it isn't found in the first read AND that it isn't found in the second read AND … all the way to the last read. How do you combine these probabilities? Are the reads independent of one another? If I told you that the nucleotide is not found in the first read, would you know any more than before as to whether it is found in the second read? If not, then they're independent. How do you combine independent probabilities to calculate a joint probability (i.e., the probability that all the events occur)?

Documents

questions

Bioinformatics: Gene Assembly & Identification in Miniplasmid - Genome Sequencing Set 3 - , Assignments of Algorithms and Programming

Related documents

Partial preview of the text