Download Inexact Alignment Dynamic Programming, and Gapped Alignment | CMSC 423 and more Study notes Computer Science in PDF only on Docsity! CMSC423 Fall 2008 1 CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 9 inexact alignment dynamic programming, gapped alignment CMSC423 Fall 2008 2 Recap 5 Local alignment recap C - - A G A C T G A G GATGC AGCGTAG GTCAGAC Value(A,A) = 10 Value(A,G) = -5 Value(A,-) = -2 Score[i,j] is the maximum of: 0. 0 1. Score[i-1, j-1] + Value[S1[i-1],S2[j-1]] (S1[i-1], S2[j-1] aligned) 2. Score[i – 1, j] + Value[S1[i], -] (S1[i] aligned to gap) 3. Score[i, j – 1] + Value[-, S2[j]] (S2[j] aligned to gap) CMSC423 Fall 2008 6 Alignment scores CMSC423 Fall 2008 7 Where do the alignment scores come from? • PAM matrices – PAM1 – based on frequency of mutations between closely related proteins (within 1 "evolutionary step") – PAM 2 - ... within 2 evolutionary steps – ... PAM 250 – commonly used • BLOSUM matrices – Frequency of mutations between proteins that are x% similar – BLOSUM100 – based on proteins that are exactly the same (e.g. score(A,A) is defined but not score(A,G) ) – BLOSUM62 – commonly used • gap scores usually determined empirically CMSC423 Fall 2008 10 Heuristics • What if limit the # of differences allowed? E.g. we expect the sequences to be very similar. • Compute 'banded' alignment – stay within # of differences (k) from the diagonal. • Optimal alignment cannot stray too far from diagonal • What if we do not know k? Do binary search to find it k k O(km) running time and space CMSC423 Fall 2008 11 Exclusion methods • Assume P must match T with at most k errors. Find places in T where P cannot match. • Split P into floor(n/k+1)-sized chunks. • If P matches T with less than k errors => at least one chunk matches with no errors • Use any exact matching algorithm to find places where a chunk matches T, then run dynamic programming in that vicinity. • Running time, on average O(m) CMSC423 Fall 2008 12 Exclusion methods Exact match Putative alignment Text Pattern CMSC423 Fall 2008 15 Chaining approach • Extends the FASTA idea • Search for exact matches • Find the longest consistent chain of exact matches • Fill in the gaps in the chain using Smith-Waterman • This is the approach used by MUMmer (Delcher et al.) • MUM – maximally unique match (see mummer.sourceforge.net) CMSC423 Fall 2008 16 Chaining in 1-D • Input: multiple overlapping intervals on a line • Output: highest weight set of non-overlapping intervals • Weight could be length of interval, or Smith-Waterman score, etc. • Sort the endpoints (starts, ends) of the intervals • For every interval j, store V[j] – best score of a chain ending in j • MAX – store highest V[j] seen sofar • Process endpoints in increasing order of x coordinate • If we encounter left end (start) of interval j – V[j] = weight(j) + MAX • If we encounter right end (end) of interval j – MAX = max{V[j], MAX} • Running time?