Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

CISC636 Lec4: Whole Genome Sequencing and Bioinformatics - Prof. Li Liao, Study notes of Computer Science

An overview of whole genome sequencing and mapping techniques, focusing on the sanger method, sequencing by hybridization, and pyrosequencing. It also discusses physical mapping using sequence-tagged-sites (stss) and the challenges of genome assembly. Exercises and terms related to bioinformatics.

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-cos
koofers-user-cos 🇺🇸

10 documents

1 / 20

Toggle sidebar

Related documents


Partial preview of the text

Download CISC636 Lec4: Whole Genome Sequencing and Bioinformatics - Prof. Li Liao and more Study notes Computer Science in PDF only on Docsity! CISC636, S08, Lec4, Liao CISC 636 Intro to Bioinformatics (Spring 2008) Whole genome sequencing Mapping & Assembly CISC636, S08, Lec4, Liao [A] Bacteria, 1.6 Mb, ~1600 genes Science 269: 496 [B] 1997 Eukaryote, 13 Mb, ~6K genes Nature 387: 1 [C] 1998 Animal, ~100 Mb, ~20K genes Science 282: 1945 [D] 2001 Agrobacteria, 5.67 Mb, ~5419 genes Science 294:2317 [E,F] 2001 Human, ~3 Gb, ~35K genes Science 291: 1304 Nature 409: 860 [G] 2005 Chimpanzee (A) (B) (C) (E) (F) (G) As of May 2005, 225 completed microbial genomes (source: TIGR CMR) (D) CISC636, S08, Lec4, Liao Terms •BAC •YAC •Cosmid •Mapping •Tiling path •Read •Gap •Contig •Shotgun Why is Assembly Difficult? The most natural notion of assembly 1s to order the fragments so as to form the shortest string containing all of them. ABRAC ABRACADABRA ACADA ABRAC RACAD ADABR ACADA DABRA ADABR RACAD DABRA However, the problem of finding the shortest common superstring of a set of strings 1s NP-complete. CISC636, S08, Lec4, Liao Assembly is complicated by repeats rptiB Ill CISC636, S08, Lec4, Liao CISC636, S08, Lec4, Liao Sequence coverage (Lander-Waterman, 1988): • Length of genome: G • Length of fragment: L • # of fragments: N • Coverage: a = NL/G. Fragments are taken randomly from the original full length genome. Q: What is the probability that a base is not covered by any fragment? Assumption: fragments are independently taken from the genome, in other words, the left-hand end of any fragment is “uniformly” distributed in (0,G). Then, the probability for the LHE of a fragment to fall within an interval (x, x+L) is L/G. Since there are N fragments in total, on average, any point in the genome is going to be covered by NL/G fragments. CISC636, S08, Lec4, Liao Poisson distribution: - The rate for an event A to occur is r. what is the rate to see a left-hand end of a fragment? - Probability to see an event in time interval (t, t + dt) is P(A|r) = r dt - h(t) = probability no event in (0,t) This is called exponential distribution - By independence of different time intervals h(t + dt) = h(t) [1 – r dt] h/t + r h(t) = 0  h(t) = exp(-rt). - Probability to have n events in (0, t) P(n|r) = exp(-rt) [(rt)n / n!]. This is called Poisson distribution. CISC636, S08, Lec4, Liao What is the mean proportion of the genome covered by one or more fragments? – Randomly pick a point, the probability that to its left, within L, where there are at least on fragment, is 1 – exp(-NL/G) – Example: to have the genome 99% covered, the coverage NL/G shall be 4.6; and 99.9% covered if NL/G is 6.9. – Is it enough to have 99.9% covered? Human genome has 3 x 10 9 bps. A 6.9 x coverage will leave ~3,000,000 bps uncovered, which cause physical gaps in sequencing the human genome. – Then, what is the number of possible gaps? CISC636, S08, Lec4, Liao Assembly programs • Phrap • Cap • TIGR assembler • Celera assembler • CAP3 • ARACHNE • EULER • AMASS A genome sequence assembly primer is available at http://www.cbcb.umd.edu/research/assembly_primer.shtml Sequencing by Hybridization Hybridize target to array containing a spot for each possible k-mer. TGT TGG ACTGAC | TGA TGT TGG TGA CTT | ACTGAC JAcTGAC ACTGAC CTG| cTa CTT cTg| ACTGAC ACTGAC | ACTGAC CTA GAC GAA GAT ACTGAC GAA GAT GAC ACTGAC | ACTGAC | ACTGAC CISC636, S08, Lec4, Liao The spectrum of a sequence: multi-set of all its k-long substrings (k-mers). Goal: reconstruct the sequence from its spectrum. ACT CIG IGA ——& ACTGAC GAC Pevzner 89: reconstruction is polynomial. CISC636, S08, Lec4, Liao
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved