Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Assembly of the Working Draft of the Human Genome with Gig Assembler | CAP 5510, Study Guides, Projects, Research of Computer Science

University of Florida (UF)Computer Science

Material Type: Project; Class: BIOINFORMATICS; Subject: COMPUTER APPLICATIONS; University: University of Florida; Term: Summer 2000;

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 09/17/2009

koofers-user-pdg 🇺🇸

10 documents

1 / 8

Partial preview of the text

Download Assembly of the Working Draft of the Human Genome with Gig Assembler | CAP 5510 and more Study Guides, Projects, Research Computer Science in PDF only on Docsity! Assembly of the Working Draft of the Human Genome with GigAssembler W. James Kent1,3 and David Haussler2 1Department of Biology, University of California at Santa Cruz, Santa Cruz, California 95064, USA; 2Howard Hughes Medical Institute, Department of Computer Science, University of California at Santa Cruz, Santa Cruz, California 95064, USA The data for the public working draft of the human genome contains roughly 400,000 initial sequence contigs in ∼30,000 large insert clones. Many of these initial sequence contigs overlap. A program, GigAssembler, was built to merge them and to order and orient the resulting larger sequence contigs based on mRNA, paired plasmid ends, EST, BAC end pairs, and other information. This program produced the first publicly available assembly of the human genome, a working draft containing roughly 2.7 billion base pairs and covering an estimated 88% of the genome that has been used for several recent studies of the genome. Here we describe the algorithm used by GigAssembler. On May 24, 2000, the public Human Genome Project staged the first “freeze” of all currently available sequence data, co- ordinated by the director, Francis Collins, Greg Schuler at the National Center for Biotechnology Information, Adam Felsenfeld at the National Human Genome Research Institute, and the twenty primary public human sequencing centers (Box 1). Public database accessions for ∼22,000 shotgun- sequenced clones were selected for this freeze, mostly bacte- rial artificial chromosome (BAC) clones (International Hu- man Genome Sequencing Consortium 2001). The sequence contigs were extracted from these accessions and cleaned up as necessary by Schuler. We will refer to these contigs as “ini- tial sequence contigs”. There were ∼375,000 such initial se- quence contigs. The complete public human genome se- quence is not projected to be available until 2003. To get a useable working draft in the short term, it was necessary to order and orient these initial sequence contigs along the 20 unfinished autosomes and two sex chromosomes as best as the data permitted, detecting overlaps and building larger se- quence contigs where possible. Chromosomes 21 (Hattori et al. 2000) and 22 (Dunham et al. 1999) had already been fin- ished and nearly complete ordered and oriented euchromatic sequence was available for them. A group led by Robert Waterston at the Washington Uni- versity Genome Sequencing Center (WUGSC) created a map of the large insert clones from the May 24 freeze, based on the genome-wide physical map they had developed of ∼300,000 clones, primarily using fingerprint overlaps, but also employ- ing information from radiation hybrid, genetic, YAC–STS, and cytogenetic maps, as well as BAC end matches (International Human Genome Mapping Consortium 2001; International Human Genome Sequencing Consortium 2001). A clone fin- gerprint is defined by the set of sizes of fragments from the clone created by restriction enzyme digest and measured by gel electrophoresis. Sets of clones containing statistically sig- nificant overlaps between their fingerprints are grouped into fingerprint clone contigs. Sequenced clones from a finger- print clone contig are used for the sequence assembly. The May 24 map of sequenced clones consisted of some 1700 fin- gerprint clone contigs, each with an approximate chromo- somal location, plus a few additional contigs that could not be reliably placed on a chromosome. The end points of the in- dividual sequenced clones, as well as their overlaps and rela- tive order along the chromosome, were only very roughly determined in these fingerprint clone contigs. Thus, the prob- lem of clone order needed to be solved along with the prob- lem of initial sequence contig order and orientation. Initial sequence contigs from different clones within a fingerprint clone contig often showed long sequence overlaps, giving strong evidence of clone order, but not giving an entirely unambiguous signal because of the occasional presence of near exact duplicated regions (Ji et al. 2000). These two problems were somewhat intertwined: Clone order information from the fingerprint map was needed dur- ing the assembly of the initial sequence contigs within a fin- gerprint clone contig, and this assembly led to refined clone order. Further evidence of initial sequence contig order and ori- entation was obtained from sequence matches between the initial sequence contigs and mRNA or EST sequences. These matches help to order and orient the initial sequence contigs that contain the exons of a gene, even if these exons are separated by quite long introns, improving the usefulness of the working draft for the study of genes. A greater number of matches could be found between the initial sequence contigs from the shotgun-sequenced clones from the freeze and the paired ends of ∼500,000 BAC clones that were only end- sequenced (Zhao 2000). These matches also provide order and orientation information for the initial sequence contigs, but can be misleading because a significant percentage of the BAC end sequences are mispaired (Zhao 2000). A greedy (Cormen et al. 1990) algorithm, called GigAssembler, was developed to use the initial sequence contig, map, mRNA, EST, and BAC end data to assemble the genome sequence of the May 24 freeze (Kent and Haussler 2000). The resulting assembly, produced in mid June, con- sisted of 2,182,660,273 base pairs covering about 70% of the genome. This was quickly followed by an assembly covering 3Corresponding author. E-MAIL kent@biology.ucsc.edu; FAX (831) 459-4829. Article published on-line before print: Genome Res., 10.1101/gr.183201. Article and publication are at http://www.genome.org/cgi/doi/10.1101/ gr.183201. Methods 11:1541–1548 ©2001 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/01 $5.00; www.genome.org Genome Research 1541 www.genome.org significantly more of the genome using the data from the June 15 freeze. Since that time, further new human sequence has been added to the public databases, and new freezes have been declared periodically. The GigAssembler algorithm has developed further as well during that time. Starting with the September freeze, matches from the paired end sequences of plasmid clones were added to improve the ordering and ori- entation. These were taken from approximately one million genome-wide random reads from the Genome Center at the Whitehead Institute for Biomedical Research (WIBR) in the assembly of the September 5 freeze, and about one million further reads from the Sanger Centre and WUGSC in the Oc- tober 7 freeze, made available through the SNP Consortium at http://snp.cshl.org. In addition, ordering and orientation in- formation for the initial sequence contigs that is contained in some of the public database accessions was extracted and used by GigAssembler in later assemblies, along with information from assembled contigs of finished sequence (“NT contigs”). A history of the assemblies is given in Table 1. The first release to the public was July 7, 2000, on http:// genome.ucsc.edu, and included both the assembly of the May 24 freeze and the substantially larger assembly of the June 15 freeze. All subsequent assemblies have been available at that Web site as well. The purpose of this paper is to describe the algorithm used by GigAssembler. A description of the assembly itself, and the discoveries that have been made using the assembled working draft sequence, is given in the paper on the working draft genome by the International Sequencing Consortium and related papers (Bentley et al. 2001; Bock et al. 2001; Cheung et al. 2001; Clayton et al. 2001; Fahrer et al. 2001; Futreal et al. 2001; International Human Genome Sequencing Consortium 2001; International SNP Map Working Group 2001; Li et al. 2001; Murray and Marks 2001; Nestler and Landsman 2001; Pollard 2001; Riethman et al. 2001; Tupler et al. 2001; Wolfsberg et al. 2001; Yu 2001). There are many algorithms that assemble reads from sub- clones of a single BAC, PAC, cosmid, or other clone, or from a whole-genome shotgun of a relatively small and not strongly repetitive genome (Bonfield et al. 1995; Sutton et al. 1995; Huang 1996; Huang and Madan 1999; P. Green, http:// www.genome.washington.edu/UWGC/analysistools/ phrap.htm). The sequencing centers primarily used PHRAP (Green, unpubl. software) to assemble shotgun reads of sub- clones of a large insert clone into initial sequence contigs. Myers has pioneered an alternative approach to the assembly of larger genomes, working directly from paired whole- genome shotgun reads (Anson and Myers 1999). This method, embodied in the Celera assembler, was successful for the Drosophila melanogaster genome (Myers et al. 2000) and has been applied to the human genome as well. There has not yet been an opportunity for us to compare the results of the Celera assembly to GigAssembler’s assembly. Most successful assembly algorithms, whether for indi- vidual large insert clones or for whole genomes, have been based on greedy methods, and all have certain features in common. They begin by looking for sequence overlaps among the reads to be assembled and build up sequence con- tigs by making the best overlaps first. All must have some heuristics to avoid being confused by repetitive sequence. Some, like CAP (Huang 1996; Huang and Madan 1999), have heuristics to detect chimeric reads in their input, that is, reads composed of sequence from two or more nonadjacent places in the genome. Some, like the Celera assembler, CAP3 (Huang 1996), and GigAssembler, build scaffolds that consist of sev- eral ordered and oriented sequence contigs separated by se- Box 1. The Human Genome Sequencing Consortium The twenty institutions that form the Human Genome Sequencing Consortium include: Baylor College of Medicine, Houston, Texas, USA (http://www.hgsc.bcm.tmc.edu/); Beijing Human Genome Center, Institute of Genetics, Chinese Academy of Sciences, Beijing, China (http://hgc.igtp.ac.cn/); Cold Spring Harbor Laboratory, Lita Annenberg Hazen Genome Center, Cold Spring Harbor, New York, USA (http://nucleus.cshl.org/genseq/lita annenberg hazen genome cent.htm); Gesellschaft für Biotechnologische Forschung mbH, Braun- schweig, Germany (http://genome.gbf.de/); Genoscope, Evry, France (http://www.genoscope.fr/); Genome Therapeutics Corporation, Waltham, Massachusetts, USA (http://www.genomecorp.com/); Institute for Molecular Biotechnology, Jena, Germany (http://genome. imb-jena.de/); Joint Genome Institute, U.S. Department of Energy, Walnut Creek, California, USA (http://www.jgi.doe.gov/); Keio University, Tokyo, Japan (http://www-alis.tokyo.jst.go.jp/HGS/); Max Planck Institute for Molecular Genetics, Berlin, Germany (http://seq. mpimg-berlin-dahlem.mpg.de/); RIKEN Genomic Sciences Center, Saitama, Japan (http://hgp.gsc.riken.go.jp/); The Sanger Centre, Hinxton, UK (http://www.sanger.ac.uk/HGP/); Stanford Genome Technology Center, Palo Alto, California, USA (http://www-sequence. stanford.edu/); Stanford Human Genome Center, Palo Alto, California, USA (http://shgc.stanford.edu/); University of Oklahoma’s Advanced Center for Genome Technology, Oklahoma, USA (http://www.genome.ou.edu/); University of Texas Southwestern Medical Center at Dallas, Texas, USA (http://www3.utsouthwestern.edu/index.htm); University of Washington Genome Center, Seattle, Washington, USA (http://www.genome.washington.edu/UWGC/); Multimegabase Sequencing Center, Institute for Systems Biology, Seattle, Washington, USA (http://www.systemsbiology.org/); Whitehead Institute for Biomedical Research, MIT, Cambridge, Massachu- setts, USA (http://www-genome.wi.mit.edu/); and the Washington University Genome Sequencing Center, St. Louis, Missouri, USA (http://genome.wustl.edu/gsc/). Table 1. Working Drafts Produced by GigAssembler Freeze Input (GB) Output (GB) % of genome covered May 24, 2000 3.3 2.2 70 June 15, 2000 3.6 2.5 82 July 17, 2000 4 2.7 87 September 5, 2000 4.1 2.7 87 October 7, 2000 4.2 2.7 88 December 12, 2000 4.3 2.7 90 April 1, 2001 4.5 2.8 92 The total size of all initial sequence contigs that are input to the assembly and the total size of the contigs produced by the as- sembly, both in gigabases, are shown for the assemblies produced by GigAssembler. Percent of the genome covered is estimated as described in the International Human Genome Sequencing Consortium (2001) As the quality of the input data improved, the amount of artifactual duplication was reduced, resulting in a higher increase in the percentage of the genome covered relative to the increase in the total size of the contigs of the assembly. Kent and Haussler 1542 Genome Research www.genome.org building stage. The clone overlap is the sum of all of the initial sequence contig overlaps. Note that the clone over- lap gives us relative position but not relative orientation of the two clones as the orientation of initial sequence contigs within a clone is not necessarily consistent. The overlaps are used to order the clones in the fol- lowing manner. (a) Clones that are completely enclosed by another clone are put aside. (b) A clone is selected and the most overlapping other clone is joined with it to initialize an ordered list of clones. (c) Given an ordered list of clones ABCD, though there are still clones in the fingerprint clone contig that sig- nificantly overlap clones in this list, the clone X that overlaps as much as possible with another element on the list is selected and inserted into the list as follows: (i) If X overlaps A but not B, the order becomes XABCD. (ii) If X overlaps A and B, and overlap(A,B) < overlap(A,X) and overlap(A,B) < overlap(B,X) then the order becomes AXBCD. Here, overlap(A,B) is the total number of bases in the overlaps between the initial sequence contigs of A and the initial sequence contigs of B. (iii) Otherwise, steps i and ii are repeated, shifting the list so that C is considered in place of B and B in place of A. (iv) If there are still clones left in the fingerprint clone contig that have not been added to the ordered list after the iterations of the above steps cease, then a new barge is started to accommodate the remaining clones. (d) The order of the clones in each barge is compared to the fingerprint map, and if the barge looks backwards the order of its clones is reversed. (e) Barge coordinates are given in the following manner: (i) The first clone is given an offset of zero. (ii) For the nth clone, Offset(n + 1) = Offset(n) + Size(n) Overlap(n,n + 1), where the size of a clone is defined as the total length of all its initial sequence contigs. (f) Clones that are completely enclosed are given the co- ordinate: Offset(inner) = Offset(outer) + (Size(outer) Size(inner))/2 (3) Assign default coordinates to initial sequence contigs. The default coordinate of an initial sequence contig is just the barge offset of the clone it is in plus its start position in the accession for the clone. Default coordinates are then as- signed to a raft based on the average of the coordinates of the initial sequence contigs making up the raft. (4) Build a “raft-ordering” graph. This is a directed graph with two types of nodes: rafts and sequenced clone end points. An edge from one node to another implies the first node comes before the second. Associated with each edge is also a range of distances allowed between the centers of the objects represented by the two nodes. As an example, consider the sequenced clones A, B, C containing the initial sequence contigs a1, a2, b1, b2, c1, and c2 laid out in Figure 4. The rafts in this figure are a1b1a2, b2c1, and c2. The initial graph would just contain the ends of the clones. Representing the start and end of clone A as As and Ae this is represented in Figure 5. Adding the rafts gives what is depicted in Figure 6. (5) Add information from mRNAs, ESTs, paired plasmid reads, BAC end pairs, and ordering information from the sequencing centers. This information is used to connect rafts in the ordering graph in a three-step process— building a ‘bridge’ out of alignments of other data with initial sequence contigs, scoring the bridge, and then add- ing bridges one at a time, best scoring first, to the ordering graph. A bridge defines order and orientation of the initial sequence contigs, as well as an allowable range of dis- tances between them. The score function for bridges is the sum of two factors. The first factor is based on the type of the information. mRNA information is given the highest weight, then paired plasmid reads, information provided by the sequencing centers, ESTs, and BAC end matches, in that order. The second factor is based on the strength of the underlying alignment and is very similar to the score used for building rafts. Bridges that would conflict with the graph as constructed so far are rejected. Conflicts are detected using Bellman-Ford algorithm as described in the note below. (6) Walk the bridge graph to get an ordering of rafts. Each bridge is walked in the order of the default coordinates assigned in step 3 subject to the constraint that if a raft has predecessors, all the predecessors must be walked be- fore the raft is walked. (7) A sequence path through each raft is built as follows: (a) Find the longest, most finished initial sequence contig that passes though each section of the raft. (b) Put the best initial sequence contig for the first part of the raft into the sequence path. Figure 4 Three overlapping draft clones: A, B, and C. Each clone has two initial sequence contigs. Note that initial sequence contigs a1, b1, and a2 overlap as do b2 and c1. Figure 5 Ordering graph of clone starts and ends. This represents the same clones as in Fig. 4. (As) The start of clone A; (Ae) the end of clone A. Similarly Bs, Be, Cs, and Ce represent the starts and ends of clones B and C. Figure 3 Merging into a raft. A contig (‘raft’) of three sequences: A, B, and C has already been constructed by GigAssembler. The pro- gram now examines an alignment between sequence C and a new sequence, D, to see whether D should also be added to the raft. The parts of D marked with +s are compatible with the raft because of the C/D alignment. The program must also check that the parts of D marked with ?s are compatable with the raft by examining other alignments. Assembly of the Genome with GigAssembler Genome Research 1545 www.genome.org (c) Find an alignment between the best initial sequence contig for the first part of the raft and the best initial sequence contig for the second part of the raft. (d) Search for a ‘crossover point’ in the alignment where it would be reasonable to switch the sequence path to the next initial sequence contig. This crossover point is ideally 250 bases from the end of the larger, more finished initial sequence contig, but may be adjusted depending on the exact alignment. (e) Repeat steps c and d to extend the sequence path until the end of the raft is reached. (8) Build the final sequence for the fingerprint clone contig by inserting the appropriate number of Ns between raft sequence paths. Currently 100 Ns are inserted between rafts that are part of the same barge, 50,000 Ns between barges that are bridged, and 100,000 Ns between un- bridged barges. Notes The raft-ordering graph built in Steps 4 and 5 specifies a par- tial order on the midpoints of the rafts from a fingerprint clone contig. Additional constraints on this ordering are given by the distance ranges allowed for each bridge between rafts. The entire collection of partial order and distance con- straints is represented by a conjunction of difference con- straints, as defined, for example, in Cormen et al. (1990), pages 540–543. Each distance constraint has the form x y B, where x and y are variables representing the mid- points of two rafts, and B is a constant, representing a bound on the number of base pairs that separate x and y. If it can be inferred from the partial order that x comes before (5 of) y, then we can specify difference constraints that represent both an upper bound B and a lower bound b on the distance be- tween x and y. The upper bound is expressed as y x B. The lower bound is expressed with the distance constraint x y b. However, if the order of x and y are unknown, then only an upper bound B on the distance that separates them can be specified as part of a conjunction of distance constraints. This is specified by the two constraints x y B and y x B. A lower bound would require a disjunction of two distance constraints. Finally, if it is known that x comes before y but no distance bounds are known, then this fact can be represented by the distance constraint x y 0. The conjunction of distance constraints associated with a raft ordering graph has a feasible solution if and only if there is a placement of the rafts such that their midpoints are lin- early ordered in a manner consistent with the partial order in the raft-ordering graph and the distances between the mid- points of bridged pairs of rafts are in the allowed distance ranges. The Bellman-Ford algorithm for single source shortest paths can be used to determine if a conjunction of distance constraints has a feasible solution in time proportional to the number of constraints (bridges) times the number of variables (rafts), as shown in Cormen et al. (1990). A similar approach was used for a physical mapping problem by Thayer et al. (1999). The feasibility problem for conjunctions of difference constraints is a special case of the linear programming prob- lem and can also be solved by linear programming algo- rithms, but these can be slower. In the GigAssembler algorithm, bridges between rafts are added incrementally in a greedy fashion, and the Bellman- Ford procedure is only used to test feasibility of a new bridge. Because a lower bound cannot be enforced on the distance between the midpoints of two unordered rafts in a conjunc- tion of difference constraints, the solution of the conjunction of difference constraints may give a positioning of the mid- points of the rafts that puts some pairs of rafts too close, possibly overlapping them in a way that contradicts the se- quence data. That is why Step 6 is necessary. It would be more elegant to combine disjunctions and conjunctions of distance constraints in specifying the raft layout problem, but unfor- tunately such a combination leads to an NP-hard feasibility problem (Papadimitriou and Steiglitz 1982). Thus, our final layout step employs a simple heuristic approach, rather than an integrated optimization of all constraints. Assessment of the Accuracy of the Orientation and Order We created two test sets to assess the accuracy of the October 7 freeze working draft as assembled by GigAssembler. The first, called FinishedContigs, is a collection of 24 clone contigs with a total of 145 clones taken from chromosomes 7, 12, 14, 17, 20, 21, and 22 for which we have finished sequence span- ning the entire clone contig. The number of clones per clone contig varies from 4 to 13. We obtained a draft version for 88 of these clones by looking for a previous version of the fin- ished clone in GenBank. The second test set, called Scrambled22, was generated by Ray Wheeler at Neomorphic, Inc. by taking the 12 finished sequence contigs from chromo- some 22 and randomly choosing a tiling path of 233 ‘syn- thetic’ BACs covering them. The sequence of each synthetic BAC was then ‘draftified‘ by the introduction of gaps, indels, and substitutions in a way that the statistics on the resulting initial sequence contigs reasonably matched the statistics from initial sequence contigs of real draft BACs. Finally, the initial sequence contigs were given random order and orien- tation. We ran the GigAssembler algorithm on all of the clone contigs from both of these test data sets and compared its predicted order and orientation for the initial sequence con- tigs to the true order and orientation of the initial sequence contigs, as can be derived from the finished sequence. We Figure 6 Ordering graph after adding in rafts. The initial sequence contigs shown in Fig. 4 are merged into rafts where they overlap. This forms three rafts: a1b1a2, b2c1, and c2. These rafts are constrained to lie between the relevant clone ends by the addition of additional ordering edges to the graph shown in Fig. 5. Kent and Haussler 1546 Genome Research www.genome.org measured the orientation agreement as the fraction of initial sequence contigs that were oriented correctly. The average orientation agreement for the FinishedContigs test set was 0.90, and varied from 0.50 (near random guessing) to 1.0 (per- fect) among the 23 clone contigs. Performance degraded as the number of initial sequence contigs per draft clone in- creased and the size of the initial sequence contigs decreased. On the 12 contigs of the Scrambled22 test set, the average orientation agreement was about 0.87 and varied from 0.71 to 1.0. To measure the accuracy of the predicted order of the initial sequence contigs, we counted the number of violations of monotonicity in the order of the starting positions of the initial sequence contigs. A violation of monotonicity occurs at initial sequence contig A when the initial sequence contig following A in the predicted order in fact should come before A. For example, if the correct order of the initial sequence contigs is 1,2,3,4,5,6,7 and the predicted order is 1,5,2,4,7,3,6 then there are two violations, at initial sequence contigs A = 5 and A = 7. We measure order agreement as the fraction of initial sequence contigs in the predicted order where viola- tions do not occur (excluding the last initial sequence contig in the predicted order, which cannot have a violation). In the above example, the order agreement is 4/6. The average order agreement for the Finished Contigs test set was 0.85, and var- ied from 0.50 to 1.0 among the 23 clone contigs. On the 12 contigs of the Scrambled22 test set, the average order agree- ment was 0.83 and varied from 0.74 to 0.93. Conclusions GigAssembler could be improved in several ways. A mecha- nism to detect misassemblies or chimerism in the initial se- quence contigs could be quite helpful, perhaps along the lines of the mechanism developed for CAP2 (Huang 1996). Im- proved use of clone end information might lead to better barge construction in Step 2 of the algorithm as well (very recent modifications to GigAssembler to include this infor- mation have borne this out). Both of these improvements would reduce the rate of artifactual duplication in the draft assembly, which occurs when valid overlaps are not detected and thus the same region of the genome is assembled in two different parts of the assembly. It was estimated that 3% of the October 7 working draft genome sequence represented arti- factual duplication, some caused by lack of complete assem- bly by GigAssembler and some by missed overlaps between fingerprint clone contigs (International Human Genome Se- quencing Consortium 2001). More use could be made of PHRAP quality scores for individual bases during the assembly as well. Finally, it would be helpful to have some methods to impose an upper bound on the total assembled size of a large insert clone, and to eliminate initial sequence contigs from the assembly entirely when it appears that they don’t belong with the clone, or have other serious problems. Cross con- tamination from other clones is one source of initial sequence contigs that don’t belong. It is estimated to be rare (Interna- tional Human Genome Sequencing Consortium 2001) but is a significant problem for draft-sequenced clones because it is difficult to detect and eliminate until the clones are “topped up” to higher levels of coverage. The assembly of the working draft of the human ge- nome, although still imperfect, has permitted significant re- search to go forward, rather than wait years for the finishing of the sequence. In particular, having an assembly has al- lowed the construction of genome-wide gene prediction sets, and the side-by-side comparison of different kinds of genome annotation, including chromosomal band locations, STS po- sitions for genetic, radiation hybrid, YAC–STS and cytogenetic maps, GC content, density of various repeat families, CpG islands, ESTs, mRNAs, SNPs, and both known and predicted genes (International Human Genome Sequencing Consor- tium 2001). These comparisons are possible through the on- line genome browsers at UCSC (http://genome.ucsc.edu/ goldenPath/hgTracks.html) and ENSEMBL (http://www. ensembl.org), both of which use the GigAssembler assem- bly. Some of the discoveries that have been made using these and other tools are described elsewhere (Bentley et al. 2001; Bock et al. 2001; Cheung et al. 2001; Clayton et al. 2001; Fahrer et al. 2001; Futreal et al. 2001; International Human Genome Sequencing Consortium 2001; International SNP Map Working Group 2001; Li et al. 2001; Murray and Marks 2001; Nestler and Landsman 2001; Pollard 2001; Riethman et al. 2001; Tupler et al. 2001; Wolfsberg et al. 2001; Yu 2001). However, the Web pages at http://genome.ucsc.edu are cur- rently averaging >40,000 page requests per day, hence we sus- pect that the bulk of the new discoveries that this work has enabled have yet to be reported. ACKNOWLEDGMENTS D.H. acknowledges support from DOE Grant F603– 99ERG2849, NSF Grant DBI-9808007, Dean Patrick Mantey, and Chancellor M.R.C. Greenwood for the 100-node compu- tational cluster. Thanks to Alan Zahler for his advice and en- couragement. We thank Scot Kennedy, Terry Furey, Ray Wheeler, Aaron Tomb, Patrick Gavin, Nick Littlestone, and Paul Tatarsky for help testing the sequence and for technical support, and Ann Pace and Maria Guarino for administrative support. We thank Bob Waterston, Eric Lander, Francis Col- lins, Ewan Birney, Greg Schuler, John Sulston, and the rest of the genome analysis group for their advice and additional support. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact. REFERENCES Anson, E. and Myers, G. 1999. Algorithms for whole genome shotgun sequencing. Proc. RECOMB ’99, Lyon, France. 1–9. BAC Resource Consortium. 2001. Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature 409: 953–958. Bentley, D.R., Deloukas, P., Dunham, A., French, L., Gregory, S.G., Humphrey, S.J., Mungall, A.J., Ross, M.T., Carter, N.P., Dunham, I., et al. 2001. The physical maps for sequencing human chromosomes 1, 6, 9, 10, 13, 20 and X. Nature 409: 942–943. Bock, J.B., Matern, H.T., Peden, A.A., and Scheller, R.H. 2001. A genomic perspective on membrane compartment organization. Nature 409: 839–841. Bonfield, J.K., Smith, K.F., and Staden, R. 1995. A new DNA sequence assembly program. Nucleic Acids Res. 23: 4992–4999. Cheung, V.G., et al. 2001. Integration of cytogenetic landmarks into the draft sequence of the human genome, Nature 409: 953–958. Clayton, J.D., Kyriacou, C.P., and Reppert, S.M. 2001. Keeping time with the human genome. Nature. 409: 829–831. Cormen, T.H., Leiserson, C.E., and Rivest, R.L. 1990. Introduction to algorithms. MIT Press, Cambridge, MA. Dunham, I., Hunt, A.R., Collins, J.E., Bruskiewich, R., Beare, D.M., Clamp, M., Smink, L.J., Ainscough, R., Almeida, J.P., Babbage, A. et al. 1999. The DNA sequence of human chromosome 22. Nature 402: 489–495. Fahrer, A.M., Bazan, J.F., Papathanasiou, P., and Goodnow, C.C. 2001. A genomic view of immunology. Nature 409: 836–838. Assembly of the Genome with GigAssembler Genome Research 1547 www.genome.org

Documents

questions

Assembly of the Working Draft of the Human Genome with Gig Assembler | CAP 5510, Study Guides, Projects, Research of Computer Science

Related documents

Partial preview of the text