Download Bioinformatics Algorithm, Databases and the Tools - Homework 3 | CMSC 423 and more Assignments Computer Science in PDF only on Docsity! 1 CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 14 Genome assembly Administrativia • CMSC423 forum on CS forums http://forum.cs.umd.edu/ • Project questions? 2 Homework 3 answer Shotgun sequencing shearing sequencing assembly original DNA 5 Lander-Waterman statistics L = read length T = minimum overlap G = genome size N = number of reads c = coverage (NL / G) = 1 – T/L E(#islands) = Ne-c E(island size) = L(ec – 1) / c + 1 – contig = island with 2 or more reads All pairs alignment • Needed by the assembler • Try all pairs – must consider ~ n2 pairs • Smarter solution: only n x coverage (e.g. 8) pairs are possible – Build a table of k-mers contained in sequences (single pass through the genome) – Generate the pairs from k-mer table (single pass through k- mer table) k-mer A B C D H I F G E 6 Overlap-layout-consensus Main entity: read Relationship between reads: overlap 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 1 2 3 1 2 3 1 2 3 1 3 2 1 3 2 ACCTGA ACCTGA AGCTGA ACCAGA Paths through graphs and assembly • Hamiltonian circuit: visit each node (city) exactly once, returning to the start A B D C E H G I F A B C D H I F G E Genome 7 Sequencing by hybridization AAAA AAAC AAAG AAAT AACA AACG AACT AAGA ... probes - all possible k-mers AACAGTAGCTAGATG AACA TAGC AGAT ACAG AGCT GATG CAGT GCTA AGTA CTAG GTAG TAGA Assembling SBH data Main entity: oligomer (overlap) Relationship between oligomers: adjacency ACCTGATGCCAATTGCACT... CTGAT follows CCTGA (they share 4 nucleotides: CTGA) Problem: given all the k-mers, find the original string In assembly: fake the SBH experiment - break the reads into k-mers