Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Genomic Sequencing: Accurate SNP Discovery with Template Sequences, Study Guides, Projects, Research of Bioinformatics

A method for accurately identifying single nucleotide polymorphisms (snps) in genomic sequences using a template sequence and bayesian statistical modeling. The authors demonstrate their approach by analyzing est data and show that it can effectively identify snps in draft-quality sequences. They also discuss the importance of paralogue identification and the challenges of dealing with low-quality data.

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 07/30/2009

koofers-user-7h5
koofers-user-7h5 🇺🇸

4.5

(2)

10 documents

1 / 5

Toggle sidebar

Related documents


Partial preview of the text

Download Genomic Sequencing: Accurate SNP Discovery with Template Sequences and more Study Guides, Projects, Research Bioinformatics in PDF only on Docsity! letter 452 nature genetics • volume 23 • december 1999 A general approach to single-nucleotide polymorphism discovery Gabor T. Marth1, Ian Korf1, Mark D. Yandell1, Raymond T. Yeh1, Zhijie Gu2, Hamideh Zakeri2, Nathan O. Stitziel1, LaDeana Hillier1, Pui-Yan Kwok2 & Warren R. Gish1 Washington University 1Department of Genetics and Genome Sequencing Center and 2Division of Dermatology, St. Louis, Missouri, USA. Correspondence should be addressed to G.T.M. (e-mail: gmarth@watson.wustl.edu) or P.-Y.K. (e-mail: kwok@im.wustl.edu). Single-nucleotide polymorphisms (SNPs) are the most abun- dant form of human genetic variation and a resource for map- ping complex genetic traits1. The large volume of data produced by high-throughput sequencing projects is a rich and largely untapped source of SNPs (refs 2–5). We present here a unified approach to the discovery of variations in genetic sequence data of arbitrary DNA sources. We propose to use the rapidly emerging genomic sequence6,7 as a template on which to layer often unmapped, fragmentary sequence data8–11 and to use base quality values12 to discern true allelic variations from sequencing errors. By taking advantage of the genomic sequence we are able to use simpler yet more accurate methods for sequence organization: fragment clustering, paralogue identification and multiple alignment. We analyse these sequences with a novel, Bayesian inference engine, POLY- BAYES, to calculate the probability that a given site is polymor- phic. Rigorous treatment of base quality permits completely automated evaluation of the full length of all sequences, with- out limitations on alignment depth. We demonstrate this approach by accurate SNP predictions in human ESTs aligned to finished and working-draft quality genomic sequences, a data set representative of the typical challenges of sequence-based SNP discovery. We started with 1,268,211 bp finished (less than 1 error per 10,000 bp) human reference sequence of 10 genomic clones, with EST content typical of gene-bearing clones. To initiate the analy- sis procedure (Fig. 1) to identify human ESTs that originated from these clones, we performed a database search against the public EST set (dbEST) and recovered 1,954 hits (representing potentially multiple exons of 1,365 unique ESTs) for which chro- matograms were available. Sequence clusters were constructed as groups of overlapping alignments (147 clusters). Sequence traces were re-processed with the PHRED base-calling program13,14 to obtain base quality values. Subsequent analyses used the full length of the ESTs, including low-quality portions. Cluster mem- bers were multiply aligned with an anchored alignment tech- nique. Unlike traditional algorithms, this method rapidly produces correct multiple alignments even in the presence of abundantly expressed or alternatively spliced transcripts. In total, EST clusters represented 80,469 bp of expressed genomic sequence, 38% of this in regions of single EST coverage and 81% in regions covered by 8 or fewer ESTs (Table 1). Inclusion of sequences representing highly similar regions duplicated elsewhere in the genome may give rise to false SNP pre- dictions, and the presence of such sequence paralogues points to difficulties during marker development. We devised a Bayesian15 genomic anchor ESTs candidate SNP (a) (b) anchor (c) anchor STS native EST s (d) (e) trace from DNA pool confirmed SNP(g) paralogues trace from CHM1 DNA (f) ESTs Fig. 1 Application of the POLYBAYES procedure to EST data. a, Regions of known human repeats in a genomic sequence are masked. b, Match- ing human ESTs are retrieved from dbEST and traces are re-called. c, Par- alogous ESTs are identified and discarded. d, Alignments of native EST reads are screened for candidate variable sites. e, An STS is designed for the verification of a candidate SNP. f, The uniqueness of the genomic location is determined by sequencing the STS in CHM1 (homozygous DNA). g, The presence of a SNP is analysed by sequencing the STS from pooled DNA samples. c f g © 1999 Nature America Inc. • http://genetics.nature.com © 1 99 9 N at u re A m er ic a In c. • h tt p :/ /g en et ic s. n at u re .c o m letter nature genetics • volume 23 • december 1999 453 discrimination algorithm (Fig. 2a) that takes into account base quality values to calculate the probability, PNAT, that a cluster member is native to (derived from) the given genomic region. The bimodal distribution of these probability values (Fig. 2b) indicates that we can distinguish between less accurate sequences that never- theless originate from the same underlying genomic location, and more accurate sequences with high-quality discrepancies that are likely to be paralogous. Using a conservative threshold value, PNAT,MIN, of 0.75, 23% of cluster members were declared paralo- gous and removed from further consideration, leaving 69,756 sites of native EST coverage. Once a proper data set is organized, the key to reliable detec- tion of SNPs is the ability to discern true allelic variation from sequencing error. To this end, we have developed a Bayesian-sta- tistical model for the mathematically rigorous treatment of sequence differences within a multiple alignment that takes into account the depth of coverage, the base quality values of the sequences and the a priori expected rate of polymorphic sites in the region. For each site within a multiple alignment of native sequences, the POLYBAYES algorithm calculates the probability, PSNP, that the site is polymorphic, as opposed to monomorphic. The distribution of probability scores (Fig. 3a) exhibits a high level of specificity: most sites (99.83%) produce scores below 0.1. They represent sites either with no disagreements between aligned sequences or with low-quality discrepancies that are likely the result of sequencing errors or possibly very rare SNPs. By marking a site as a candidate SNP if the corresponding SNP probability exceeded a threshold value, PSNP,MIN, of 0.40, we extracted 97 candidates. Of these, 38 were located in adenine-rich regions of the genomic clones matching the 3´ ends of ESTs. Sub- sequent negative verification results are consistent with the hypothesis16 that these sites result from internal priming events during cDNA library construction and that the adenine allele is contributed by the reverse transcription primer rather than the RNA template. We validated candidate sites with a pooled sequencing approach17 that allowed us to confirm true positives, provided the minor allele frequency was above 10%. We eliminated five candidates that did not fulfil this requirement. An additional 18 sites could not be analysed for lack of unique amplification (9 candidates in regions of low complexity or repetitive sequence, 4 candidates for unknown reasons and in 5 cases, the homozygous control genome18 indicated the presence of paralogues absent in the EST set). Of the remaining 36 sites, 20 were confirmed in at least 1 of 4 populations screened (13 transitions, 7 transversions), yielding a 56% overall confirmation rate. The confirmation rate is somewhat lower than the average SNP score of 0.78. Some of this difference may be due to systematic base-calling errors (compressions) and reverse transcriptase errors introduced during cDNA library construction. Several of the candidate sites may be true polymorphisms specific to the donors of the cDNA samples but absent in the population pools used in verification. Although precise calibration of the SNP probability values would require analysing the genomic source of 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 P(Data|ModelN) P(Data|Model_P) P(Model_N|Data) discrepancies (d ) PNAT, MIN DNAT DPAR native paralogous (d|ModelNAT) (d|ModelPAR) NAT pr ob ab ili ty 0 100 200 300 400 500 600 700 800 900 1000 1100 0.25 0.50 0.75 1.0 PNAT, MIN PNAT nu m be r of E ST s Fig. 2 Paralogue discrimination. a, Example probability distributions for a matching sequence with (hypothetical) uniform base quality values of 20, in pair-wise alignment with base perfect genomic anchor sequence (quality values 40), over a length of 250 bp. PPOLY,2 = 0.001, PPAR = 0.02, E=2.525, DNAT = 2.775 and DPAR = 7.525. If the posterior probability, PNAT, is higher than PNAT,MIN, the EST is considered native; otherwise, it is considered paralogous. b, Distribution of the posterior probability values, PNAT, calculated for 1,954 cluster members anchored to ten genomic clone sequences. a b Table 1 • SNP discovery in EST alignments of varying coverage No. of clusters No. of aligned sites Distribution of SNPs Deptha before paralogue after paralogue before paralogue after paralogue Candidatef analysedg confirmedh Confirmation filteringb filteringc filteringd filteringe ratei 1 47 (32.0%) 40 (32.0%) 30,828 (38.3%) 26,275 (37.7%) 12 (22.2%) 6 (16.7%) 5 (25.0%) 83% 2 25 (17.0%) 24 (19.2%) 15,771 (19.6%) 15,072 (21.6%) 8 (14.8%) 7 (19.4%) 2 (10.0%) 29% 3,4 23 (15.6%) 21 (16.8%) 12,478 (15.5%) 9,937 (14.2%) 17 (31.5%) 8 (22.2%) 5 (25.0%) 63% 5–8 17 (11.6%) 14 (11.2%) 6,627 (8.2%) 5,467 (7.8%) 7 (13.0%) 7 (19.4%) 1 (5.0%) 14% 9–16 14 (9.5%) 8 (6.4%) 7,704 (9.6%) 6,383 (9.2%) 3 (5.5%) 3 (8.4%) 3 (15.0%) 100% 17 or more 21 (14.3%) 18 (14.4%) 7,061 (8.8%) 6,662 (9.5%) 7 (13%) 5 (13.9%) 4 (20.0%) 80% Total 147 (100%) 125 (100%) 80,469 (100%) 69,756 (100%) 54 (100%) 36 (100%) 20 (100%) Overall 56% aDepth of coverage (or cluster size), not including the genomic reference sequence. bNumber of clusters of given cluster size before removal of paralogous ESTs. cNumber of clusters of given cluster size after removal of paralogous ESTs. dNumber of sites of given alignment depth in multiple alignments before removal of paralogous ESTs. eNumber of sites of given alignment depth in multiple alignments after removal of paralogous ESTs. fNumber of candidate SNPs found at sites of given alignment depth. gNumber of unambiguously analysed candidate SNPs. hNumber of SNPs confirmed in at least one of four population pools. iSNP confirmation rate. b–iNumbers in parentheses indicate percentages of relevant total. © 1999 Nature America Inc. • http://genetics.nature.com © 1 99 9 N at u re A m er ic a In c. • h tt p :/ /g en et ic s. n at u re .c o m
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved