Download Sequence Alignment: Inexact Alignment, Multiple Sequence Alignment | CMSC 423 and more Study notes Computer Science in PDF only on Docsity! 1 CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 10 Sequence alignment: inexact alignment, multiple sequence alignment Inexact alignment recap • Affine gaps – need 4 matrices: global score, score of alignments ending in a match, score of alignment ending in a gap in seq1, score of alignment ending in a gap in seq2. • In the "real" world, inexact alignment is performed only where necessary – heuristics pre-compute where an alignment is possible. • Also, inexact alignment is easier if we bound the allowed error – only need to explore the neighborhood of the main diagonal in the DP matrix 2 Chaining approach • Extends the FASTA idea • Search for exact matches • Find the longest consistent chain of exact matches • Fill in the gaps in the chain using Smith-Waterman • This is the approach used by MUMmer (Delcher et al.) Chaining in 1-D • Input: multiple overlapping intervals on a line • Output: highest weight set of non-overlapping intervals • Weight could be length of interval, or Smith-Waterman score, etc. • Sort the endpoints (starts, ends) of the intervals • For every interval j, store V[j] – best score of a chain ending in j • MAX – store highest V[j] seen sofar • Process endpoints in increasing order of x coordinate • If we encounter left end (start) of interval j – V[j] = weight(j) + MAX • If we encounter right end (end) of interval j – MAX = max{V[j], MAX} • Running time? 5 But... it's expensive • 3 sequences – need to fill in the cube O(n3) • k sequences – k-dimensional cube O(nk) time/space • There are tricks that can help – similar to AI techniques for reducing the search space • Basic idea – if we can estimate optimal score, we can prune the search space. • Note – these are just heuristics – not guaranteed to work faster Alternative – approximation algorithm • Can we efficiently compute a multiple alignment with a score that's not too bad? • The Star method: – build all k2 pairwise alignments (O(k2n2)) – pick sequence sc that is closest to all other sequences: sum si D(sc, si) is minimal over all choices of sc – iteratively align each sequence to sc • Theorem: sum-of-pairs score of star alignment is at most twice as big as optimal multiple alignment score 6 Iterative alignment • Take sequences si in order: – align s1 with sc - results in gaps being inserted in both sequences – align s2 with sc - if gaps must be inserted – insert in previously aligned sequences – and so on (note: if gaps coincide with previously introduced gaps no need to change previously aligned sequences) SC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGAL SC YFPHFDLSHGSAQVKAHGKKVGDALTLAVGHLDDLPGAL S1 YFPHFDLSHG-AQVKG--KKVADALTNAVAHVDDMPNAL SC YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVAHLDDLPGAL S1 YFPHF-DLS-----HG-AQVKG—GKKVA-----DALTNAVAHVDDMPNAL S2 FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMS SC YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVAHLDDLPGAL S1 YFPHF-DLS-----HG-AQVKG—GKKVA-----DALTNAVAHVDDMPNAL S2 FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMS S3 LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL Theorem proof • Theorem: star alignment is 2-optimal • Assumption: distances obey triangle inequality OPT = si,sj d*(si,sj) si,sj D(si,sj) k si D(si, sc) STAR = si,sj d(si,sj) siD(si, sc) + sjD(sj, sc) = 2k siD(si, sc) => STAR/OPT 2 Q.E.D sc si sj 7 Consensus sequence • For every column j in the alignment, pick the amino-acid AA that minimizes id(AA, Si[j]) (usually becomes majority rule) • Intuitively – this is the sequence of the ancestor of all the sequences in the multiple alignment • We can define the multiple alignment problem as: – find the multiple alignment that minimizes iD(CO, Si) • Related to "Steiner" string problem: – find a string S* and a multiple alignment such that iD(S*, Si) is minimal • Both formulations are NP hard CO YFPHFKDLS-----HGSAQVKAHGKKVG-----DALTLAVAHVDDTPGAL S1 YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVAHLDDLPGAL S2 YFPHF-DLS-----HG-AQVKG—GKKVA-----DALTNAVAHVDDMPNAL S3 FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMS S4 LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL Iterative alignment revisited • Pick a sequence (e.g. SC) as a starting point • Align S1 to it & build consensus for the alignment • Take S2 and align it to the consensus (instead of SC) • repeat... • Problem: consensus (or any single sequence) ignores the other sequences being aligned. • Solution: keep track of % of each amino-acid aligned in each column • score of alignment to profile – combination of scores to each AA. S1 YFPHF-DLS-----HGSAQVKAHGKKVG-----DALTLAVAHLDDLPGAL S2 YFPHF-DLS-----HG-AQVKG—GKKVA-----DALTNAVAHVDDMPNAL S3 FFPKFKGLTTADQLKKSADVRWHAERII-----NAVNDAVASMDDTEKMS S4 LFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATL 50% S 25% N 25% -100% F 75% A 25% Q