Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

A Whole-Genome Assembly of Drosophila - Project | CAP 5510, Study Guides, Projects, Research of Computer Science

University of Florida (UF)Computer Science

Material Type: Project; Class: BIOINFORMATICS; Subject: COMPUTER APPLICATIONS; University: University of Florida; Term: Unknown 1989;

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 09/17/2009

koofers-user-qgo 🇺🇸

10 documents

1 / 9

Partial preview of the text

Download A Whole-Genome Assembly of Drosophila - Project | CAP 5510 and more Study Guides, Projects, Research Computer Science in PDF only on Docsity! A Whole-Genome Assembly of Drosophila Eugene W. Myers,1* Granger G. Sutton,1 Art L. Delcher,1 Ian M. Dew,1 Dan P. Fasulo,1 Michael J. Flanigan,1 Saul A. Kravitz,1 Clark M. Mobarry,1 Knut H. J. Reinert,1 Karin A. Remington,1 Eric L. Anson,1 Randall A. Bolanos,1 Hui-Hsien Chou,1 Catherine M. Jordan,1 Aaron L. Halpern,1 Stefano Lonardi,1 Ellen M. Beasley,1 Rhonda C. Brandon,1 Lin Chen,1 Patrick J. Dunn,1 Zhongwu Lai,1 Yong Liang,1 Deborah R. Nusskern,1 Ming Zhan,1 Qing Zhang,1 Xiangqun Zheng,1 Gerald M. Rubin,2 Mark D. Adams,1 J. Craig Venter1 We report on the quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accom- plished it. Three independent external data sources essentially agree with and support the assembly’s sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochro- matin of the centromeres. Comparison with a previously sequenced 2.9- megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99.99% without manual curation. As such, this initial reconstruction of the Drosophila sequence should be of substantial value to the scientific community. The primary obstacle to determining the se- quence of a very large genome is that, with current technology, one can directly deter- mine the sequence of at most a thousand consecutive base pairs at a time. The process, dideoxy sequencing, used to produce such sequencing reads was essentially invented by Sanger circa 1980 (1), with subsequent mod- est gains in read length, moderate gains in data accuracy, and significant gains in throughput. Given the limitation on read length, researchers employ a shotgun-se- quencing approach, in which an effectively random sampling of sequencing reads is col- lected from a larger target DNA sequence. With sufficient oversampling, the sequence of the target can be inferred by piecing the sequence reads together into an assembly. Early on, the shotgun approach was ap- plied to small viral genomes and to 30- to 40-kbp segments of larger genomes that could be manipulated and amplified in a cos- mid. For a given level of oversampling, the number of unsampled regions or gaps in- creases linearly with target size, as does the number of interspersed repetitive sequences that tend to confound assembly. After com- puter assembly, a finishing phase ensues, wherein the gaps between assembled contigs are closed experimentally, and any misas- sembly is resolved. Because one does not know the order of the contigs or the size of the gaps and because the assembly problem becomes harder as size increases, it was com- monly believed that cosmid targets represent- ed the limit of the shotgun approach. Whole genomes were sequenced by first developing a set of cosmids or other clones covering the genomes by a process called physical map- ping, and then shotgun sequencing each clone as in (2–4). In 1994, the sequence of Haemophilus influenzae was obtained from the assembly of a whole-genome data set obtained by shotgun sequencing (5). This bacterial genome, at 1.8 Mbp, was much larger than was previously thought possible by a direct shotgun ap- proach, the largest previous genome so se- quenced being the lambda virus in 1982 (6). Critical to this accomplishment was the con- struction of a computer program capable of performing the assembly and the use of pairs of reads, called mates, from the ends of 2-kbp and 16-kbp inserts randomly sampled from the genome. Even though the pairing infor- mation was false 10 to 20% of the time owing to lane tracking problems on the slab-gel sequencing instruments available at the time, the presence of several mates with one read in one contig and the other read in another contig allowed ordering of the contigs and gave a rough estimate of the size of the gap between them, simplifying the finishing phase. Many groups have since sequenced bacterial genomes this way, and investigators have moved from using shotgun sequencing for cosmids to targets on the order of 100 to 150 kbp, that is, those clonable in P1 and bacterial artificial chromosome (BAC) vectors. A new approach to sequencing whole ge- nomes, proposed by Venter, Smith, and Hood in 1996 (7), was to make a 153 library of BAC-sized inserts randomly sampled from the genome and to produce end-sequence read pairs for them. One could then select and apply shotgun sequencing to several seed BACs, whereupon the end-sequences of other BACs in the library could be used to deter- mine minimally overlapping BACs at each end to sequence next in an interactive walk across the genome. Weber and Myers then proposed the whole-genome shotgun se- quencing of the human genome in 1997 (8, 9). The protocol involved collecting a 103 oversampling of the genome, with mate pairs from 0.9-kbp and 10-kbp inserts in a 1:1 ratio, and assembling this in conjunction with the long-range information provided by a ge- nome-wide sequence-tagged site (STS) map that is a series of unique, 300- to 500-bp sites ordered across the genome with an average spacing between sites of 100 kbp. In 1998, Venter and colleagues announced the under- taking of a whole-genome shotgun sequenc- ing of the human genome (10) with the se- quencing of Drosophila serving as a pilot project. For Drosophila, we set about collecting a 103 oversampling of a genome using a 1-to-1 ratio of 2-kbp and 10-kbp mate pairs. In addition, enough BACs to provide 153 coverage of the genome were to be collected and end-sequenced, effectively generating a set of mate pairs that give long-range infor- mation similar to that provided by the STS maps described above. Drosophila’s euchro- matic genome is estimated at 120 Mbp. Thus, the protocol would require collecting at least 2.4 million reads and 15,000 BAC end pairs, totaling 1.2 billion base pairs of data. Our Drosophila sequencing project began in May 1999, in collaboration with the Berkeley Dro- sophila Genome Project (BDGP), the results of which are detailed in the accompanying papers (11, 12). Celera Assembler Design Principles The primary difficulty in building an assem- bler for a whole-genome shotgun data set is to develop an algorithmic approach that de- tects and is not confused by stretches of repetitive DNA. The key to not being con- fused by repeats is the exploitation of mate pair information to circumnavigate and to fill them (13). Because of this, the result of as- sembly is a set of scaffolds of contigs, versus a set of contigs as customarily produced by other assemblers. A scaffold is a set of con- tigs that are ordered, oriented, and positioned with respect to each other by mate pairs whose reads are in adjacent contigs (see be- low). Although we demonstrate below that a whole-genome shotgun data set can be as- sembled in isolation, our pragmatic objective 1Celera Genomics, Inc., 45 West Gude Drive, Rock- ville, MD 20850, USA. 2Howard Hughes Medical Insti- tute, Berkeley Drosophila Genome Project, University of California, Berkeley, CA 94720, USA. *To whom correspondence should be sent: Gene. Myers@celera.com 24 MARCH 2000 VOL 287 SCIENCE www.sciencemag.org2196 T H E D R O S O P H I L A G E N O M E R E V I E W is to produce the best possible reconstruction of a genome, along with its correlation to existing data. Therefore, the assembler is ca- pable of utilizing available external data. Our assembler places reads in a series of stages, starting with the safest “moves” and pro- gressing toward increasingly more aggressive ones. The stage and evidence for a read’s placement are open to inspection, providing an audit trail of the assembler’s decision- making. To further optimize development time, we decided to build a batch assembler that assumes all data are available when it begins its task. For Drosophila this was fea- sible because assembly of a complete data set takes less than a week on an eight-processor suite of Compaq Alpha ES40s with a 32-Gb memory (14). The Drosophila Data Sets The scale of whole-genome assembly dictates that the quality of the input data be much higher than that required for smaller assem- bly problems. We determined data require- ments on the basis of simulation estimates (15) and received data of the quality shown in Table 1. In a whole-genome context, trillions of overlaps between reads are examined. In order to keep the a posteriori probability of a false overlap low, regions of low sequence quality must be trimmed much more aggres- sively than for other protocols (16). We pro- duced 3.156 million reads that yielded 1.76 Gbp of sequence after trimming to the 98% accuracy level on the basis of quality values that reflect the log-odds score of the base’s being correct (17). The observed mean se- quencing accuracy of these reads after trim- ming was 99.5% (18). A substantial fraction of the reads must be in mate pairs if one expects to achieve long- range ordering and repeat filling. Moreover, the more accurately one knows the distance between a pair of mates, and the more reli- ably one knows that a given pairing is true, the more strongly one can make inferences during the assembly process. We produced 1.151 million pairs (72.8% of the reads) whose insert lengths were normally distribut- ed with 10% variance and whose pairing reliability has been estimated at 99.66% (19). The spectrum of 2-kbp, 10-kbp, and BAC mates must be such that all of the euchromatic, nonrepetitive DNA is linked together and cov- ered at least two deep at every point. Moreover, an insufficient number of 10-kbp and BAC mates will prevent the formation of assemblies covering each chromosome arm. To our sur- prise, 10-kbp inserts could be sequenced as successfully as 2-kbp inserts, so we increased production of the 10-kbp mates in the late stag- es to produce 654,000 of the 2-kbp mates and 497,000 of the 10-kbp mates. A total of 12,152 acceptable quality BAC mates of average sep- aration 130.2 kbp, generated at Genoscope (11), were received from the BDGP and European Drosophila Genome Projects (EDGP). We term the data set described above the whole-genome shotgun data set or WGS data set, as it provides the data stipulated in our pure conception of the whole-genome shot- gun sequencing protocol. In addition to these data, the BDGP constructed a map of the second and third chromosomes, completely sequenced 340 BAC and P1 inserts compris- ing about 26 Mbp of Drosophila euchromatic sequence, and produced a 1.283 draft shot- gun of each BAC and P1 clone in a tiling set chosen from a physical map covering the genome (20). The EDGP produced a map of the X chromosome and completely se- quenced cosmid and BAC clones covering about 3 Mbp. The Canadian Drosophila Ge- nome Project produced a physical map of the small fourth chromosome. For more details on these data sets, see table 1 of (11). The joint data set is our term for the WGS data plus the draft reads and a perfect shredding (21) of 340 of the completely sequenced clones into a 33 tiling of 550-bp reads. There were a total of 337,000 draft reads constitut- ing 153.l Mbp of sequence and 154,000 reads shredded from the completed BACs. We did not include the known STS markers for Dro- sophila in the joint data set, reserving them for independent confirmation, and no specific advantage was taken of the locality and or- dering of the included external data. Thus, the net effect is that each of these reads was simply mapped to a location in the assembly, possibly filling in sequence gaps by means of the extra sampling coverage they provided. The primary use of these marker reads was to validate assembly and to provide navigation information for the finishing stage. The finished sequence resulting from as- sembly of the joint data set along with current finishing efforts will be available both at Celera’s Web site and at GenBank under accession numbers AE002566–AE003403 An assembly of the data through the scaf- folding phase (see below) was deposited in GenBank on 31 December 1999, accession numbers AC012691–AC020545. We are also prepared to participate in appropriate collab- orative efforts to use our raw data to test future algorithms. Celera Assembler’s Algorithmic Design The Celera assembler consists of a pipeline of several stages as shown in Fig. 1. An illus- trated primer on the assembler algorithms is on the Web (www.celera.com/genomeassem- bler). In preparation for the assembly compu- tation, the electropherograms for a read were interpreted as a sequence of bases and asso- ciated quality values that reflect the log-odds score of the base’s being correct (17). The read was then trimmed to an interval of 98% accuracy according to these quality values. Any prefix sequence of the high-quality re- gion matching the sequencing vector or linker was aggressively removed. Finally, the re- maining portion of the reads were screened for matches to any contaminant DNA such as Escherichia coli or cloning or sequencing vectors, and the entire read was removed if a significant matching segment was found. Af- ter this processing, what remained was a set of high-quality reads of Drosophila se- Fig. 1. Assembly pipeline. From an engineering perspective, sequences of messages flow from one stage to the next. Each stage performs work on its input stream, producing a stream of output messages reflecting its transformational function. The text gives the function of each stage. Table 1. Input data requirements and characteristics. The requested column gives the minimum or maximum requirement for the item stipulated at the left of each row. The received column shows what was actually produced. Type of data Requested Received Read length and accuracy 500 bp @ 98% (min) 551 bp @ 98% Shotgun coverage 103 (min) 14.63 Reads in pairs 70% (min) 72.8% Insert length variance 63% (max) 610% False-positive pairs 1% (max) 0.34% BAC map coverage 153 (min) 13.183 Ratio of 2 kbp to 10 kbp 4 to 1 (max) 1.32 to 1 www.sciencemag.org SCIENCE VOL 287 24 MARCH 2000 2197 T H E D R O S O P H I L A G E N O M E in the gaps between a scaffold’s contigs. Only 0.34% of the mated reads within con- tigs did not agree with the placement of their mates, which is well within the ex- pected false-positive error rate of the pair- ing information. There are 70 scaffolds having spans over 30 kbp, and the 25 scaf- folds with spans larger than 100 kbp con- tain more than 95% of the assembled se- quence (114.l Mbp). The sizes, in millions of base pairs, of the scaffolds over 1 Mbp are 24.3, 16.4, 15.1, 13.7, 10.6, 9.1, 5.1, 4.8, 4.5, 2.7, 2.1, 1.4, 1.4, and 1.3. These megascaffolds are a subset of the 50 mapped to the euchromatin by STS markers and cover the preponderance of the euchro- matic portions of every chromosome arm, breaking up into smaller scaffolds as the telomeres and centromeres are approached. This can be seen in the segmentation of Fig. 5 where each segment represents a scaffold. It was simplest for us to arrange our inves- tigation around the size of a scaffold, so in the remainder of this section we discuss the nature of these 25 scaffolds. The qualitative features to be observed about these scaf- folds are representative of the entire set. The level of assembly of the scaffolds larger than 100 kbp for the joint and WGS data sets, and a 6.53 WGS data set are compared in Table 2. The scaffolded regions for joint and WGS data sets were identical except that one 16.34-Mbp scaffold in the joint data set split into 10.45-Mbp and 5.64- Mbp scaffolds. There were 446 fewer gaps (23%) in the joint assembly, but these gaps constituted only 163 kbp (0.13%) of addition- al sequence, confirming that the additional coverage of the external data had a positive but small impact. Note carefully that in the joint assembly no advantage has been taken of the known relations between the shredded reads from a finished BAC and the relative proximity of draft reads from a given clone, thus it should not be surprising that the dif- ferences are small. We have not yet made design changes to the assembler to take ad- vantage of this information. For example, of the 1434 gaps in the large scaffolds of the joint data set, 140 are spanned by finished BAC and P1 sequences that the assembler could have potentially joined. The 6.53 WGS data set was produced by randomly sampling a 1-to-1 mix of 2-kbp to 10-kbp Celera reads totaling 6.53 coverage, in which 70% of reads were pairs and all of the BAC mates (133) were included. For this data set, the assembler produced 43 scaffolds that are slightly contracted and fragmented versions of the 25 large scaffolds in the big- ger data sets, containing more than 95% of their sequence. This confirms our earlier claims that one has a robust picture of a genome at 6.53 coverage with a whole-ge- nome approach. To evaluate the causes of the 1434 intra- scaffold gaps among the 25 large scaffolds of the joint data set, we examined the se- quence adjacent to each gap to see if there were any reads in the data set overlapping into the gap and whether the end of the sequence was screened as being repetitive. A total of 927 of the gaps have no overlap- ping sequence at either end and are almost certainly sequencing gaps as confirmed by their generally small size. Another 244 have a matching screen item at both bound- aries and are thus almost certainly unre- solved repeats. Of the remainder, 164 ap- pear to involve a sequencing gap and 99 appear to involve a repetitive element by virtue of having no overlap or a screen item at one end, respectively. The assembly of a shotgun data set is not the last step in producing a genome; a finishing phase is necessary in which a certain level of gap closure by experimental Fig. 5. STS-content map. Celera assembly scaffolds were plotted against the STS order on the STS-content map. The color palette is used to delineate scaffolds. The 17 outliers have been investigated; those circled in red remain unresolved at the time of publication. Table 2. Comparison of assembly of scaffolds larger than 100 kbp on three data sets. The length column for the scaffolds row gives the total number of base pairs spanned including gaps, whereas the length column for the total sequence row is the total number of base pairs in these scaffolds. The number column for placed pieces, for example, rocks, is the number of unitigs of that kind placed in the big scaffolds, whereas the length column gives the amount of sequence covered by that type but not by unitigs of the category above it, for example, 0.992 Mbp of the sequence for the joint data set was covered by rocks but not U-unitigs. Negative gaps are those where the assembler estimates that the two adjacent contigs should overlap but could not find one within the placement dictated by the bundles (40). Joint WGS 6.53 WGS Number Length (Mbp) Number Length (Mbp) Number Length (Mbp) Scaffolds 25 116.176 26 116.306 43 114.348 Total gaps 1,434 2.030 1,887 2.322 7,111 5.790 100–150 kbp 3 0.343 3 0.354 4 0.489 50–100 kbp 0 0.000 1 0.097 6 0.430 10–50 kbp 8 0.184 10 0.209 28 0.517 2–10 kbp 237 1.283 245 1.371 649 2.636 0–2 kbp 812 0.219 1,132 0.290 4,394 1.715 Negative 374 2 496 2 2,030 2 Total sequence 37,225 114.146 34,184 113.983 23,890 108.557 U-unitigs 7,446 110.581 7,164 110.604 8,007 103.933 Rocks 2,056 0.992 1,787 0.927 1,950 2.683 Stones 139 0.121 132 0.118 98 0.129 Pebbles 27,584 2.450 25,101 2.332 13,835 1.809 24 MARCH 2000 VOL 287 SCIENCE www.sciencemag.org2200 T H E D R O S O P H I L A G E N O M E means is required. One laboratory working on a BAC-by-BAC project reported that for an average BAC size of 99 kbp sequenced at 8.573 coverage, there were an average of 3.8 gaps that required some directed sequencing implying an average contig size of 26 kbp (27 ). For all firm scaffolds of the joint assembly, the average contig size is 50.0 kbp, implying the equivalent finishing effort of 2.0 gaps per 99 kbp of BAC. However, although the distribution of the sizes of sequencing gaps is the same in the two scenarios, the WGA assembly has sev- eral hundred repeat-induced gaps that are generally of a larger size. Nonetheless, this comparison suggests that the total finishing effort for Drosophila may well end up be- ing commensurate with a BAC-by-BAC approach. Validation of the Drosophila Assembly STS-level validation. STS maps (28) for the chromosome arms were concatenated to give a whole-genome map that orders 2378 STSs, permitting comparison between this indepen- dent order and the WGS assembly. The STS sequences mapped a total of 114.8 Mbp of assembled sequence across 50 scaffolds to the Drosophila genome. There is excellent agreement between the STS order in the STS- content maps and the WGS assembly (Fig. 5). Twelve STSs were discarded from the study because they proved not to map to unique positions. Of the remaining 2366 sites, 2167 matched contigs in the assembly giving 2117 ordered pairs of STSs that could disagree between the two data sets (29). There were 17 ordering discrepancies, each of which was investigated. We have been able to localize nine of the discrepant STS sequences on the published clone-tiling path (CTP) (see be- low), the positions of which agree in each case to the Celera assembly position. Eight discrepancies are unresolved and remain un- der investigation. Clone-level validation and coverage. The assembly of the WGS data set was compared to the finished and the 1.283 draft sequence available for the published CTP that covers most of the euchromatin of the genome (30). This allows us to identify the appropriate clone reagents for gap closure, and to verify the order and assembly of contigs in our scaffolds. As predicted from the results of the STS map comparison, the assembly is in excellent agreement with the published CTP (Fig. 6). There were only 11 discrepancies between the WGS assembly and the CTP. Each of these discrepancies was investigated and curated (31). One discrepancy is caused by a P1 clone on the tiling path that appears to be chimeric (32). The remaining 10 dis- crepancies were shown to be caused by place- ment errors in the CTP. In an attempt to judge how much of the genome a pure whole-genome assembly cap- tures, we compared the coverage of the 816 firm scaffolds of the WGS assembly to that of the CTP and associated sequence. This result is only indicative as it is difficult to precisely evaluate the intersection of our contigs with the light-shotgun data (33) because of repet- itive sequence. The WGS assembly was esti- mated to miss approximately 2.99 Mbp of the sequence in clones of the CTP. Almost all of the missed sequence was present in reads not incorporated into firm scaffolds, and these absences were uniformly distributed across chromosomes, suggesting that this number estimates the amount of sequence in the larg- er gaps of the WGS assembly. In the converse direction, approximately 15 Mbp of WGS data could not be matched to CTP data. Not surprisingly, most of this involved contigs mapped to chromosome X and a region of 3L where the CTP is incomplete. From these num- bers one might estimate that 105 Mbp of Dro- sophila are in the current physical map, and the WGS assembly has 3 Mbp of that in gaps, for a total of 97.1% coverage of the current physical map. One could then carry that number forward as an estimate of the percent of the euchromatin within the WGS assembly. Sequence-level validation. A comparison of the complete published sequence of the 2.9-Mbp Adh region (34) against the 23 Celera contigs from the WGS assembly that cover it is shown in Fig. 7. We chose the Adh region because it was the longest contiguous stretch of finished sequence available. There are two levels of discrepancy—small point variations involving one, two, or three bases, and larger block variations involving from 33 bp to 9 kbp. All of the large variations are in our solutions to repeats, and we discuss them first. There are 15 block-level differences be- tween the two sequences, totaling 25 kbp. Three are Hobo-elements in our sequence that are strongly supported by the assembly and thus appear to be genuine repeat-level polymorphisms. Four are variations in the copy number of tandem duplications, where three are manually correctable overcompres- sions by one repeat unit in our sequence. The remaining eight block discrepancies are in the interiors of retrotransposons and appear to be due to incorrect pebble walks as described earlier. All involve on the order of 30- to 100-bp blocks with the exception of one sub- stitution of 3500 bp and another retrotrans- poson that appears to be rearranged with respect to its long terminal repeats. There is thus room for improvement in the pebble repeat resolution phase, during which we did not adequately take advantage of interpebble mate pairs. All these discrepancies occur in regions covered solely by pebble-placed unitigs, and these constitute only 2.45% of the reconstruction. Altogether, we measured 9.5% of our Drosophila assembly as being C lo ne o rd er o n cl on e- til in g pa th Mapped Celera scaffolds 4 3R 3L 2R 2L X Fig. 6. Clone-tiling path (CTP) map. All mapped Celera scaffolds, oriented and ordered by both the STS-content map and the CTP sequence were plotted against all BAC/P1/cosmid clones ordered as they appear on the CTP. All “mutually unique regions” (39) between a clone and a contig are aggregated and displayed. The observed chimeric region (34 ) is marked by a star; evident misorderings in the tiling path are marked with square; repeat-induced “hits” are marked with a circle. www.sciencemag.org SCIENCE VOL 287 24 MARCH 2000 2201 T H E D R O S O P H I L A G E N O M E repetitive sequence, so the better part of most repeat constructions should prove correct with some variations in the interior of longer ele- ments like those just described. The concentration of individual base-pair discrepancies varies depending on whether the sequence is repetitive or not. The discrepancy in the repetitive regions is roughly 0.38%, where- as in the nonrepetitive sequence there are 140 differences for 0.0049% of the total. An exam- ination of the differences in the nonrepetitive region indicates that 78 are in deep coverage regions of the assembly, where multiple align- ments confirm our sequence. Therefore, one can bound our error rate in the nonrepetitive sequence as 62 in 2.82 Mbp or less than 0.0022%. The higher discrepancy rate in the repetitive region is explained by the use of unitig pebbles that are overcollapsed. Further details of the comparison are given in the leg- end. We thus project that we have a very high quality, ordered and mapped reconstruction of the nonrepetitive genome, with a draft-quality facsimile of the repetitive elements interspersed therein. To get a broader picture of sequence quality, we scrutinized the results of BLAST searches of the WGS assembly against 104 high-quality, finished P1 or BAC clones, totaling 10.2 Mbp of sequence. After curating conflicts (35), we tabulated all discrepancies in high-scoring seg- ment pairs (HSPs) longer than 10 kbp in both repetitive and unique regions, finding 63 in- serts, 142 deletions, and 177 substitutions in 182.7 kbp of known repetitive sequence (0.021%), and 244 inserts, 182 deletions, and 231 substitutions in the remaining 7.085 Mbp of unique sequence (0.0092%). Of the sequence not in large HSPs, 77 kbp is simply clone sequence that is in gaps between contigs of the WGS assembly. There then remains 48 kbp of non-HSP sequence, 31 kbp of which is in known repeats and 17 kbp of which will likely be discovered to be either repeat polymor- phisms or overcollapsed, unannotated tandem Fig. 7. Detailed com- parison between the Adh region and WGS assembly. The x axis gives location (in thousands of base pairs) relative to the public Adh sequence. Peaks indicate the numbers of single- base mismatches be- tween the two se- quences in windows of length 1000 (log scale, or zero if there was perfect agree- ment). Purple boxes denote transposable elements in the public Adh sequence. Larger- scale discrepancies are as follows: green lozenges, gaps be- tween Celera contigs; red inverted triangle, regions in the public Adh sequence that are absent from the Celera sequence (“de- letions”); purple tri- angle, regions in the Celera sequence that are absent from the public sequence (“in- sertions”); cyan X, in- version of a region of one sequence relative to the other. A star associated with an in- sertion or inversion in- dicates the presence of transposable ele- ments. A plus sign indicates that the as- sociated insertion or deletion involves tan- dem duplications. 24 MARCH 2000 VOL 287 SCIENCE www.sciencemag.org2202 T H E D R O S O P H I L A G E N O M E

Documents

questions

A Whole-Genome Assembly of Drosophila - Project | CAP 5510, Study Guides, Projects, Research of Computer Science

Related documents

Partial preview of the text