Download CISC636 Lec4: Whole Genome Sequencing and Bioinformatics - Prof. Li Liao and more Study notes Computer Science in PDF only on Docsity! CISC636, S08, Lec4, Liao CISC 636 Intro to Bioinformatics (Spring 2008) Whole genome sequencing Mapping & Assembly CISC636, S08, Lec4, Liao [A] Bacteria, 1.6 Mb, ~1600 genes Science 269: 496 [B] 1997 Eukaryote, 13 Mb, ~6K genes Nature 387: 1 [C] 1998 Animal, ~100 Mb, ~20K genes Science 282: 1945 [D] 2001 Agrobacteria, 5.67 Mb, ~5419 genes Science 294:2317 [E,F] 2001 Human, ~3 Gb, ~35K genes Science 291: 1304 Nature 409: 860 [G] 2005 Chimpanzee (A) (B) (C) (E) (F) (G) As of May 2005, 225 completed microbial genomes (source: TIGR CMR) (D) CISC636, S08, Lec4, Liao Terms •BAC •YAC •Cosmid •Mapping •Tiling path •Read •Gap •Contig •Shotgun Why is Assembly Difficult?
The most natural notion of assembly 1s to order the fragments
so as to form the shortest string containing all of them.
ABRAC ABRACADABRA
ACADA ABRAC
RACAD
ADABR ACADA
DABRA ADABR
RACAD DABRA
However, the problem of finding the shortest common
superstring of a set of strings 1s NP-complete.
CISC636, S08, Lec4, Liao
Assembly is complicated by repeats
rptiB Ill
CISC636, S08,
Lec4, Liao
CISC636, S08, Lec4, Liao Sequence coverage (Lander-Waterman, 1988): • Length of genome: G • Length of fragment: L • # of fragments: N • Coverage: a = NL/G. Fragments are taken randomly from the original full length genome. Q: What is the probability that a base is not covered by any fragment? Assumption: fragments are independently taken from the genome, in other words, the left-hand end of any fragment is “uniformly” distributed in (0,G). Then, the probability for the LHE of a fragment to fall within an interval (x, x+L) is L/G. Since there are N fragments in total, on average, any point in the genome is going to be covered by NL/G fragments. CISC636, S08, Lec4, Liao Poisson distribution: - The rate for an event A to occur is r. what is the rate to see a left-hand end of a fragment? - Probability to see an event in time interval (t, t + dt) is P(A|r) = r dt - h(t) = probability no event in (0,t) This is called exponential distribution - By independence of different time intervals h(t + dt) = h(t) [1 – r dt] h/t + r h(t) = 0 h(t) = exp(-rt). - Probability to have n events in (0, t) P(n|r) = exp(-rt) [(rt)n / n!]. This is called Poisson distribution. CISC636, S08, Lec4, Liao What is the mean proportion of the genome covered by one or more fragments? – Randomly pick a point, the probability that to its left, within L, where there are at least on fragment, is 1 – exp(-NL/G) – Example: to have the genome 99% covered, the coverage NL/G shall be 4.6; and 99.9% covered if NL/G is 6.9. – Is it enough to have 99.9% covered? Human genome has 3 x 10 9 bps. A 6.9 x coverage will leave ~3,000,000 bps uncovered, which cause physical gaps in sequencing the human genome. – Then, what is the number of possible gaps? CISC636, S08, Lec4, Liao Assembly programs • Phrap • Cap • TIGR assembler • Celera assembler • CAP3 • ARACHNE • EULER • AMASS A genome sequence assembly primer is available at http://www.cbcb.umd.edu/research/assembly_primer.shtml Sequencing by Hybridization
Hybridize target to array containing a spot for each possible
k-mer.
TGT TGG
ACTGAC | TGA
TGT TGG TGA
CTT | ACTGAC JAcTGAC
ACTGAC CTG| cTa
CTT cTg| ACTGAC
ACTGAC | ACTGAC CTA
GAC
GAA GAT ACTGAC
GAA GAT GAC
ACTGAC | ACTGAC | ACTGAC
CISC636, S08, Lec4, Liao
The spectrum of a sequence: multi-set of all its
k-long substrings (k-mers).
Goal: reconstruct the sequence from its spectrum.
ACT
CIG
IGA ——& ACTGAC
GAC
Pevzner 89: reconstruction is polynomial.
CISC636, S08, Lec4, Liao