Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Improving The Draft Assembly of The Horse Genome: Final Report | AMSC 663, Papers of Mathematics

Material Type: Paper; Class: ADV SCI COMPUTING I; Subject: Applied Mathematics & Scientific Computation; University: University of Maryland; Term: Spring 2008;

Typology: Papers

Pre 2010

Uploaded on 02/13/2009

koofers-user-f2u-1
koofers-user-f2u-1 🇺🇸

10 documents

1 / 14

Toggle sidebar

Related documents


Partial preview of the text

Download Improving The Draft Assembly of The Horse Genome: Final Report | AMSC 663 and more Papers Mathematics in PDF only on Docsity! Improving the Draft Assembly of the Horse Genome: Final Report Megan Smedinghoff, smeds@umd.edu Advisor: James A. Yorke, yorke@umd.edu May 19, 2008 Abstract In February 2007, Broad Institute of MIT released a draft assembly of the horse genome. To fulfill the requirements of AMSC 663-664, I reassembled the genome using the Celera Assembler and then used my assembly to fix gaps and compressions in the Broad assembly. I was able to increase the N50 contig size of the Broad assembly by 15% and increase the total assembly size by .18%. As a part of this process, I also produced a parallelized version of the University of Maryland Overlapper. Initial tests show that the parallelized overlapper runs 2-4 times faster than the serial version. Background and Motivation In February 2007, Broad Institute of MIT announced that it had completed a draft assembly of the horse genome (equus caballus). The announcement was the culmination of a $15 million project funded by the National Institute of Human Genome Research and the National Institute of Health (1). The draft genome will allow the equine research community to better understand diseases that affect horses. Additionally, the release of the horse genome has caused some excitement in the human genome research community. There are over 80 known conditions in horses that are analogous to disorders in humans, including arthritis and allergies (2). Many believe that comparative genomics methods will lead to a better understanding of these disorders and therefore better treatments for both animals. The draft genome produced by Broad Institute was done using the Arachne assembler (3-4). I reassembled the genome using the Celera Assembler (5) and the University of Maryland overlapper (6-7). The Celera Assembler and the Arachne Assembler have slightly different underlying algorithms and therefore produce assemblies that are almost identical but do contain some differences. I also used the University of Maryland overlapper as a preprocessor to Celera. UMD Overlapper has been shown to improve genome quality when used in the Celera pipeline (7). After running UMD overlapper and then Celera, I used the resulting assembly to fix gaps and compressions in the Arachne draft. In the process of producing a reconciled assembly, I was also able to make improvements to the UMD overlapper. Since the horse genome is large, I was initially forced to split the data into groups and run many of the overlapper commands by hand. this case there is usually little or no overlap between the pieces of sequence. Instead of seeing that two pieces fit together, the assembler relies on knowing how far apart pairs of reads must be from each other. Figure 2: Celera assembler builds successively bigger pieces of sequence known as unitigs, contigs, and scaffolds to create an assembly To produce an assembly of the horse, I ran the Celera assembler with default parameters. The only modification I made to the Celera pipeline was that I used UMD overlapper to produce the list of read overlaps instead of using Celera’s internal routines. It took nine days to run the assembler and produce a draft horse genome (see Table 1). Unitig Conti g Scaffold Assembly Celera Arachne Number of Scaffolds 59,044 9,687 Number of Contigs in Scaffolds 126,810 55,316 Genome Size 2.5 billion bases 2.4 billion bases N50 Contig Size 77,479 bases 112,381 bases Table 1: Celera and Arachne assembly statistics The Celera assembly contained many more scaffolds than the Arachne assembly, but the two genomes were approximately the same size. The Arachne assembly had a larger N50 contig size than the Celera assembly. N50 contig size is the size of a contig such that 50% of all nucleotides are in larger contigs and 50% of all nucleotides are in smaller contigs. N50 contig size is frequently used to measure the quality of an assembly. A larger N50 contig size is usually considered better than a smaller one. Comparing the Two Assemblies After producing a Celera assembly of the horse, I started comparing the two assemblies at the sequence level to see how similar they were in terms of scaffold structure. The first step in this process was to run NUCmer, an open source program designed to quickly align large genomes (8-9). Once I had an alignment, I was able to map Celera contigs to Arachne contigs. Not all Celera contigs mapped to Arachne contigs, but I compiled a list of those that did. I then used this list to compare the scaffold structures of the two assemblies. In performing this comparison, I encountered three different types of disagreement between the assemblies: orientation problems, ordering problems, and gap size problems. Orientation Problems – An orientation problem occurs when contigs from a Celera scaffold map to contigs of an Arachne scaffold but one of the contigs is oriented in the opposite direction from its match (see Figure 3). Figure 3: Illustration of an orientation problem Ordering Problems – An ordering problem occurs when a set of contigs from a Celera scaffold map to a set of contigs from an Arachne scaffold but they map in a different order (see Figure 4). Figure 4: Illustration of an ordering problem Contig A Contig B Contig C Contig D Contig A’ Contig B’ Contig C’ Contig D’ Celera Scaffold Arachne Scaffold Contig A Contig B Contig C Contig D Contig A’ Contig C’ Contig B’ Contig D’ Celera Scaffold Arachne Scaffold Figure 6: Illustration of a gap Figure 7: Illustration of a compression I treated the Arachne assembly as the official assembly and used the Celera assembly to close gaps and fix compressions (see Table 3). The results show that reconciliation was able to fix 80% of the detected compressions. The genome size increased by .18% (4.3 million bases), and the N50 contig size increased by 15%. These increases in size represent increases in quality since the reconciliation software only fixes the official assembly if there is overwhelming evidence that the draft assembly is better. Assembly Celera Arachne Reconciled Number of Compressions N/A 687 136 Genome Size 2.512 billion bases 2.428 billion bases 2.433 billion bases Matching Sequences sesequences Assembly A Assembly B Gap in Assembly B Assembly A Assembly B Matching sequences Compression error in Assembly A or expansion error in Assembly B N50 Contig Size 77,479 bases 112,729 bases 130,123 bases Table 3: Reconciliation statistics Parallelizing the Overlapper In addition to producing a reconciled assembly of the horse genome, I also made improvements to the University of Maryland Overlapper. UMD Overlapper was developed by Mike Roberts and relies on the concept of “minimizers” to efficiently detect overlaps (6-7). The algorithm begins by examining each 20-mer in each read and recording an associated hex value for that 20-mer (see Figure 8). The program then uses a sliding 20 base window and finds the minimum hex value in that window. The 20-mer associated with this value (known as a “minimizer”) is then recorded along with the read name and the offset from the end of the read. This list is then sorted and filtered by minimizers that have fewer than 70 matches in the genome (minimizers with more than 70 matches frequently represent repeat regions). This sorted list of minimizers represents sets of candidate pairs for overlap. Reads in these groups are then compared in a process known as “checker”, which outputs a list of overlaps. There are some additional post- processing steps including a Poisson test that eliminates overlaps that have a probability of less than 10-8 of being true overlaps. There is also an additional step that splits groups of overlaps if there are a certain number of differences within the group. After the post- processing, the final list of overlaps is output. Figure 8: UMD Overlapper algorithm The most computationally intensive part of the algorithm is the Checker routine that compares the reads. The routine is computationally expensive since every read in a minimizer group needs to be compared to every other read in the group. While this process is time-consuming, it is advantageous in that each minimizer group can be run on a different processor and the routine is therefore easily parallelizable. An additional merit is that there is no overhead for the parallelization since the reads are already in groups. I used Sun Grid Engine functions to submit checker commands to available processors on the cluster. To make sure that my parallelization was working properly, I tested it on a bacterium, a fly, and the horse. In each case I was able to produce correct output and dramatically reduce the running time of algorithm (see Table 4). For a bacteria sized genome, the speedup was 2x. For a fly-sized genome, the speedup was 3.5x. In the case of the horse, I was unable to run the serial version of the overlapper. Get associated hex value for each 20-mer in read Record “minimizer” (20-mer with minimum hex value) for sliding 20-base window Create Read, Offset, Minimizer file R O M M O R Filter by certain minimizers and then sort to get candidate pairs for overlap Run “checker” on candidate pairs And produce edit-transcripts10 -8 Do Poisson test and eliminate any overlaps that have less than a 10-8 probability of occurring Do Multi-compare and split overlaps when they have a certain number of differences Overlaps Output Overlap List
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved