Download Sequence Alignment: Inexact Alignment and Dynamic Programming and more Study notes Computer Science in PDF only on Docsity! 1 CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 9 Sequence alignment: inexact alignment dynamic programming, gapped alignment, heuristics Play around with alignments • USC alignment library (seqaln) http://www.mhoenicka.de/software/cygwinports/seqaln.html 2 Global alignment recap C - - A G A C T G A G GATGC AGCGTAG GTCAGAC Value(A,A) = 10 Value(A,G) = -5 Value(A,-) = -2 Score[i,j] is the maximum of: 1. Score[i-1, j-1] + Value[S1[i-1],S2[j-1]] (S1[i-1], S2[j-1] aligned) 2. Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap) 3. Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap) Global alignment recap 1920910140-14-28C -28-24-20-16-12-8-40- -24 -20 -16 -12 -8 -4 - 2024131434-10A G A C T G 2410141848-6 1014378-6-2 -13 -9 -5 A -2 2 6 G -134812 048-31 -14-10-6-22 GATGC AG-C-GTAG -GTCAG-AC Value(A,A) = 10 Value(A,G) = -5 Value(A,-) = -4 Score[i,j] is the maximum of: 1. Score[i-1, j-1] + Value[S1[i-1],S2[j-1]] (S1[i-1], S2[j-1] aligned) 2. Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap) 3. Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap) 5 Running times • All these algorithms run in O(mn) – quadratic time • Note – this is significantly worse than exact matching • On Wednesday we'll talk about speed-up opportunities • BTW, how much space is needed? • If we only need to find the best score (not the exact alignment as well) – O(min(m,n)) • If we need to find the best alignment – elegant divide and conquer algorithm leads to linear space solution. Where do the alignment scores come from? • PAM matrices – PAM1 – based on frequency of mutations between closely related proteins (within 1 "evolutionary step") – PAM 2 - ... within 2 evolutionary steps – ... PAM 250 – commonly used • BLOSUM matrices – Frequency of mutations between proteins that are x% similar – BLOSUM100 – based on proteins that are exactly the same (e.g. score(A,A) is defined but not score(A,G) ) – BLOSUM62 – commonly used • gap scores usually determined empirically 6 BLOSUM62 Heuristics • What if limit the # of differences allowed? E.g. we expect the sequences to be very similar. • Compute 'banded' alignment – stay within # of differences (k) from the diagonal. • Optimal alignment cannot stray too far from diagonal • What if we do not know k? Do binary search to find it k k O(km) running time and space 7 Exclusion methods • Assume P must match T with at most k errors. Find places in T where P cannot match. • Split P into floor(n/k+1)-sized chunks. • If P matches T with less than k errors => at least one chunk matches with no errors • Use any exact matching algorithm to find places where a chunk matches T, then run dynamic programming in that vicinity. • Running time, on average O(m) Exclusion methods Exact match Putative alignment Text Pattern