Download Dynamic Programming for Biomolecular Sequence Analysis: Alignment and Local Alignments - P and more Study notes Computer Science in PDF only on Docsity! 1 Dynamic Programming Method for Analyzing Biomolecular Sequences Tao Jiang Department of Computer Science University of California - Riverside (Typeset by Kun-Mao Chao) E-mail: jiang@cs.ucr.edu http://www.cs.ucr.edu/~jiang 2 Outline • The paradigm of dynamic programming • Sequence alignment – a general framework for comparing sequences in bioinformatics • Dynamic programming algorithms for sequence alignment • Techniques for improving the efficiency of the algorithms • Multiple sequence alignment 3 Dynamic Programming • Dynamic programming is an algorithmic method for solving optimization problems with a compositional/recursive cost structure. • Richard Bellman was one of the principal founders of this approach. 4 Two key ingredients • Two key ingredients for an optimization problem to be suitable for a dynamic programming solution: Each substructure is optimal. (principle of optimality) 1. optimal substructures 2. overlapping subinstances Subinstances are dependent. (Otherwise, a divide-and-conquer approach is the choice.) 2 5 Three basic components • The development of a dynamic programming algorithm has three basic components: – A recurrence relation (for defining the value/cost of an optimal solution); – A tabular computation (for computing the value of an optimal solution); – A backtracing procedure (for delivering an optimal solution). 6 Fibonacci numbers .for 21 11 00 −+−= = = i>1iFiFiF F F The Fibonacci numbers are defined by the following recurrence: 7 How to compute F10? F10 F9 F8 F8 F7 F7 F6 …… 8 Tabular computation • Tabular computation can avoid redundant computation steps. 553421138532110 F10F9F8F7F6F5F4F3F2F1F0 5 17 Two fundamental problems we solved (joint work with Lin and Chao) • Given a sequence of real numbers of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum --- an O(n)-time algorithm. U = 3 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 18 Two fundamental problems we solved (joint work with Lin and Chao) • Given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average --- an O(n log L)-time algorithm. This has been improved to O(n) by others later. L = 4 3 2 14 6 6 2 10 2 6 6 14 2 1 19 Another example Given a sequence as follows: 2, 6.6, 6.6, 3, 7, 6, 7, 2 and L = 2, the highest-average interval is the squared area, which has the average value 20/3. 2, 6.6, 6.6, 3, 7, 6, 7, 2 20 GC-rich regions • Our method can be used to locate a region of length at least L with the highest C+G ratio in O(n log L) time. ATGACTCGAGCTCGTCA 00101011011011010 Search for an interval of length at least L with the highest average. 6 21 Length-unconstrained version • Maximum average interval 3 2 14 6 6 2 10 2 6 6 14 2 1 The maximum element is the answer. It can be done in O(n) time. 22 A naive algorithm • A simple shift algorithm can compute the highest-average interval of a fixed length in O(n) time • Try L, L+1, L+2, ..., n. In total, O(n2). 23 A pigeonhole principle • Notice that the length of an optimal interval is bounded by 2L, we immediately have an O(nL)-time algorithm. We can bisect a region of length >= 2L into two segments, where each of them is of length >= L. 24 Future Development • Best k (nonintersecting) subsequences? • Max-average with both upper and lower length bounds • General (gapped) local alignment with length upper bound. • Measurement of goodness? 7 25 Longest increasing subsequence (LIS) • The longest increasing subsequence is to find a longest increasing subsequence of a given sequence of distinct integers a1a2…an . e.g. 9 2 5 3 7 11 8 10 13 6 2 3 7 5 7 10 13 9 7 11 3 5 11 13 are increasing subsequences. are not increasing subsequences. We want to find a longest one. 26 A naive approach for LIS • Let L[i] be the length of a longest increasing subsequence ending at position i. L[i] = 1 + max j = 0..i-1{L[j] | aj < ai} (use a dummy a0 = minimum, and L[0]=0) 9 2 5 3 7 11 8 10 13 6 L[i] 1 1 2 2 3 4 ? 27 A naive approach for LIS 9 2 5 3 7 11 8 10 13 6 L[i] 1 1 2 2 3 4 4 5 6 3 L[i] = 1 + max j = 0..i-1 {L[j] | aj < ai} The maximum length The subsequence 2, 3, 7, 8, 10, 13 is a longest increasing subsequence. This method runs in O(n2) time. 28 Binary Search • Given an ordered sequence x1x2 ... xn, where x1<x2< ... <xn, and a number y, a binary search finds the largest xi such that xi< y in O(log n) time. n ... n/2 n/4 10 37 i j 0 1 p 2 r 3 o 4 v 5 i 6 d 7 e 8 n 9 c 10 e 0 0 0 0 0 0 0 0 0 0 0 0 1 p 0 1 1 1 1 1 1 1 1 1 1 2 r 0 1 2 2 2 2 2 2 2 2 2 3 e 0 1 2 2 2 2 2 3 3 3 3 4 s 0 1 2 2 2 2 2 3 3 3 3 5 i 0 1 2 2 2 3 3 3 3 3 3 6 d 0 1 2 2 2 3 4 4 4 4 4 7 e 0 1 2 2 2 3 4 5 5 5 5 8 n 0 1 2 2 2 3 4 5 6 6 6 9 t 0 1 2 2 2 3 4 5 6 6 6 Running time and memory: O(mn) and O(mn). 38 procedure Output-LCS(A, prev, i, j) 1 if i = 0 or j = 0 then return 2 if prev(i, j)=” “ then ⎢ ⎣ ⎡ −−− ia jiprevALCSOutput print )1,1,,( 3 else if prev(i, j)=” “ then Output-LCS(A, prev, i-1, j) 4 else Output-LCS(A, prev, i, j-1) The backtracing algorithm 39 i j 0 1 p 2 r 3 o 4 v 5 i 6 d 7 e 8 n 9 c 10 e 0 0 0 0 0 0 0 0 0 0 0 0 1 p 0 1 1 1 1 1 1 1 1 1 1 2 r 0 1 2 2 2 2 2 2 2 2 2 3 e 0 1 2 2 2 2 2 3 3 3 3 4 s 0 1 2 2 2 2 2 3 3 3 3 5 i 0 1 2 2 2 3 3 3 3 3 3 6 d 0 1 2 2 2 3 4 4 4 4 4 7 e 0 1 2 2 2 3 4 5 5 5 5 8 n 0 1 2 2 2 3 4 5 6 6 6 9 t 0 1 2 2 2 3 4 5 6 6 6 Output: priden 40 Dot Matrix Sequence A:CTTAACT Sequence B:CGGATCAT C G G A T C A T C T T A A C T 11 41 C---TTAACT CGGATCA--T Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: Sequence A Sequence B 42 C---TTAACT CGGATCA--T Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: Insertion gap Match Mismatch Deletion gap 43 Alignment (or Edit) Graph Sequence A: CTTAACT Sequence B: CGGATCAT C G G A T C A T C T T A A C T C---TTAACT CGGATCA--T 44 A simple scoring scheme • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 alignment score (i.e. space) 12 45 Scoring Matrices • Amino acid substitution matrices – PAM – BLOSUM • DNA substitution matrices – DNA is less conserved than protein sequences – Less effective to compare coding regions at nucleotide level 46 PAM • Point Accepted Mutation (Dayhoff, et al.) • 1 PAM = PAM1 = 1% average change of all amino acid positions – After 100 PAMs of evolution, not every residue will have changed • some residues may have mutated several times • some residues may have returned to their original state • some residues may not changed at all 47 PAMX • PAMx = PAM1x E.g. PAM250 = PAM1250 • PAM250 is a widely used scoring matrix. Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ... Ala A 13 6 9 9 5 8 9 12 6 8 6 7 ... Arg R 3 17 4 3 2 5 3 2 6 3 2 9 Asn N 4 4 6 7 2 5 6 4 6 3 2 5 Asp D 5 4 8 11 1 7 10 5 6 3 2 5 Cys C 2 1 1 1 52 1 1 2 2 2 1 1 Gln Q 3 5 5 6 1 10 7 3 7 2 3 5 ... Trp W 0 2 0 0 0 0 0 0 1 0 1 0 Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 Val V 7 4 4 4 4 4 4 4 5 4 15 10 48 BLOSUM • Blocks Substitution Matrix • Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins • Matrix name indicates evolutionary distance – BLOSUM62 was created using sequences sharing no more than 62% identity 15 57 An optimal local alignment • Si,j: the score of an optimal local alignment ending at ai and bj • With proper initializations, Si,j can be computed as follows. ⎪ ⎪ ⎪ ⎩ ⎪⎪ ⎪ ⎨ ⎧ + −+ −+ = −− − − ),( ),( ),( 0 max 1,1 1, ,1 , jiji jji iji ji baws bws aws s 58 local alignment ? 2 3 5 0 3580000 0 0 0 115800020 135800350 28002580 00000000 C G G A T C A T C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 59 local alignment 8 10 11 13 2 3 5 0 103580000 18101320350 713352580 82580000 115800020 135800350 28002580 00000000 C G G A T C A T C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 the best score 60 8 10 11 13 2 3 5 0 103580000 18101320350 713352580 82580000 115800020 135800350 28002580 00000000 C G G A T C A T C T T A A C T the best score A – C - T A T C A T 8-3+8-3+8 = 18 (not always at the corner) 16 61 Affine gap penalties • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: -3 (w(-,x)=w(x,-)=-3) • E.g. each gap is charged an extra gap-open penalty: -4. • In general, a gap of length k should have penalty g(k) C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 -4 -4 alignment score: 12 – 4 – 4 = 4 62 Affine gap penalties • A gap of length k is penalized x + k·y. gap-open penalty gap-symbol penalty Three cases for alignment endings: 1. ...x ...x 2. ...x ...- 3. ...- ...x an aligned pair a deletion an insertion 63 Affine gap penalties • Let D(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with a deletion. • Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with an insertion. • Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj. 64 Affine gap penalties ⎪⎩ ⎪ ⎨ ⎧ +−− = ⎩ ⎨ ⎧ −−− −− = ⎩ ⎨ ⎧ −−− −− = ),( ),( ),()1,1( max),( )1,( )1,( max),( ),1( ),1( max),( jiI jiD bawjiS jiS yxjiS yjiI jiI yxjiS yjiD jiD ji 17 65 Affine gap penalties (Gotoh’s algorithm) SI D SI D SI D SI D -y -x-y -x-y -y w(ai,bj) 66 k best local alignments • Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) • FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985) • BLAST (Altschul et al., 1990; Altschul et al., 1997) BLAST and FASTA are key genomic database search tools. 67 k best local alignments • Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) – linear-space version:sim (Huang and Miller, 1991) – linear-space variants:sim2 (Chao et al., 1995); sim3 (Chao et al., 1997) • FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985) – linear-space band alignment (Chao et al., 1992) • BLAST (Altschul et al., 1990; Altschul et al., 1997) – restricted affine gap penalties (Chao, 1999) 68 FASTA 1) Find runs of identities, and identify regions with the highest density of identities. 2) Re-score using PAM matrix, and keep top scoring segments. 3) Eliminate segments that are unlikely to be part of the alignment. 4) Optimize the alignment in a band. Its running time is O(n). 20 77 Remarks • Filtering is based on the observation that a good alignment usually includes short identical or very similar fragments. • The idea of filtration was used in both FASTA and BLAST to achieve high speed 78 Linear space ideas Hirschberg, 1975; Myers and Miller, 1988 m/2 (i) scores can be computed in O(n) space (ii) divide-and-conquer S(a1…am/2,b1…bj) + S(am…am/2+1,bn…bj+1) maximized j 79 Two subproblems ½ original problem size m/2 m/4 3m/4 80 Four subproblems ¼ original problem size m/2 m/4 3m/4 21 81 Time and Space Complexity • Space: O(m+n) • Time: O(mn)*(1+ ½ + ¼ + …) = O(mn) 2 82 Band Alignment (K. Chao, W. Pearson, and W. Miller) Sequence B Sequence A 83 Band Alignment in Linear Space The remaining subproblems are no longer only half of the original problem. In worst case, this could cause an additional log n factor in time. 84 Band Alignment in Linear Space 22 85 Multiple sequence alignment (MSA) • The multiple sequence alignment problem is to simultaneously align more than two sequences. Seq1: GCTC Seq2: AC Seq3: GATC GC-TC A---C G-ATC 86 How to score an MSA? • Sum-of-Pairs (SP-score) GC-TC A---C G-ATC GC-TC A---C GC-TC G-ATC A---C G-ATC Score = Score Score Score + + 87 MSA for three sequences • an O(n3) algorithm 88 General MSA • For k sequences of length n: O(nk) • NP-Complete (Wang and Jiang) • The exact multiple alignment algorithms for many sequences are not feasible. • Some approximation algorithms are given. (e.g., 2- l/k for any fixed l by Bafna et al.)