Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Dynamic Programming for Biomolecular Sequence Analysis: Alignment and Local Alignments - P, Study notes of Computer Science

The use of dynamic programming in analyzing biomolecular sequences, specifically in the context of sequence alignment. The basics of dynamic programming, its application to sequence alignment, techniques for improving algorithm efficiency, and multiple sequence alignment. It also touches on future developments in the field.

Typology: Study notes

2009/2010

Uploaded on 03/28/2010

koofers-user-ite
koofers-user-ite 🇺🇸

1

(1)

10 documents

1 / 23

Toggle sidebar

Related documents


Partial preview of the text

Download Dynamic Programming for Biomolecular Sequence Analysis: Alignment and Local Alignments - P and more Study notes Computer Science in PDF only on Docsity! 1 Dynamic Programming Method for Analyzing Biomolecular Sequences Tao Jiang Department of Computer Science University of California - Riverside (Typeset by Kun-Mao Chao) E-mail: jiang@cs.ucr.edu http://www.cs.ucr.edu/~jiang 2 Outline • The paradigm of dynamic programming • Sequence alignment – a general framework for comparing sequences in bioinformatics • Dynamic programming algorithms for sequence alignment • Techniques for improving the efficiency of the algorithms • Multiple sequence alignment 3 Dynamic Programming • Dynamic programming is an algorithmic method for solving optimization problems with a compositional/recursive cost structure. • Richard Bellman was one of the principal founders of this approach. 4 Two key ingredients • Two key ingredients for an optimization problem to be suitable for a dynamic programming solution: Each substructure is optimal. (principle of optimality) 1. optimal substructures 2. overlapping subinstances Subinstances are dependent. (Otherwise, a divide-and-conquer approach is the choice.) 2 5 Three basic components • The development of a dynamic programming algorithm has three basic components: – A recurrence relation (for defining the value/cost of an optimal solution); – A tabular computation (for computing the value of an optimal solution); – A backtracing procedure (for delivering an optimal solution). 6 Fibonacci numbers .for 21 11 00 −+−= = = i>1iFiFiF F F The Fibonacci numbers are defined by the following recurrence: 7 How to compute F10? F10 F9 F8 F8 F7 F7 F6 …… 8 Tabular computation • Tabular computation can avoid redundant computation steps. 553421138532110 F10F9F8F7F6F5F4F3F2F1F0 5 17 Two fundamental problems we solved (joint work with Lin and Chao) • Given a sequence of real numbers of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum --- an O(n)-time algorithm. U = 3 9 –3 1 7 –15 2 3 –4 2 –7 6 –2 8 4 -9 18 Two fundamental problems we solved (joint work with Lin and Chao) • Given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average --- an O(n log L)-time algorithm. This has been improved to O(n) by others later. L = 4 3 2 14 6 6 2 10 2 6 6 14 2 1 19 Another example Given a sequence as follows: 2, 6.6, 6.6, 3, 7, 6, 7, 2 and L = 2, the highest-average interval is the squared area, which has the average value 20/3. 2, 6.6, 6.6, 3, 7, 6, 7, 2 20 GC-rich regions • Our method can be used to locate a region of length at least L with the highest C+G ratio in O(n log L) time. ATGACTCGAGCTCGTCA 00101011011011010 Search for an interval of length at least L with the highest average. 6 21 Length-unconstrained version • Maximum average interval 3 2 14 6 6 2 10 2 6 6 14 2 1 The maximum element is the answer. It can be done in O(n) time. 22 A naive algorithm • A simple shift algorithm can compute the highest-average interval of a fixed length in O(n) time • Try L, L+1, L+2, ..., n. In total, O(n2). 23 A pigeonhole principle • Notice that the length of an optimal interval is bounded by 2L, we immediately have an O(nL)-time algorithm. We can bisect a region of length >= 2L into two segments, where each of them is of length >= L. 24 Future Development • Best k (nonintersecting) subsequences? • Max-average with both upper and lower length bounds • General (gapped) local alignment with length upper bound. • Measurement of goodness? 7 25 Longest increasing subsequence (LIS) • The longest increasing subsequence is to find a longest increasing subsequence of a given sequence of distinct integers a1a2…an . e.g. 9 2 5 3 7 11 8 10 13 6 2 3 7 5 7 10 13 9 7 11 3 5 11 13 are increasing subsequences. are not increasing subsequences. We want to find a longest one. 26 A naive approach for LIS • Let L[i] be the length of a longest increasing subsequence ending at position i. L[i] = 1 + max j = 0..i-1{L[j] | aj < ai} (use a dummy a0 = minimum, and L[0]=0) 9 2 5 3 7 11 8 10 13 6 L[i] 1 1 2 2 3 4 ? 27 A naive approach for LIS 9 2 5 3 7 11 8 10 13 6 L[i] 1 1 2 2 3 4 4 5 6 3 L[i] = 1 + max j = 0..i-1 {L[j] | aj < ai} The maximum length The subsequence 2, 3, 7, 8, 10, 13 is a longest increasing subsequence. This method runs in O(n2) time. 28 Binary Search • Given an ordered sequence x1x2 ... xn, where x1<x2< ... <xn, and a number y, a binary search finds the largest xi such that xi< y in O(log n) time. n ... n/2 n/4 10 37 i j 0 1 p 2 r 3 o 4 v 5 i 6 d 7 e 8 n 9 c 10 e 0 0 0 0 0 0 0 0 0 0 0 0 1 p 0 1 1 1 1 1 1 1 1 1 1 2 r 0 1 2 2 2 2 2 2 2 2 2 3 e 0 1 2 2 2 2 2 3 3 3 3 4 s 0 1 2 2 2 2 2 3 3 3 3 5 i 0 1 2 2 2 3 3 3 3 3 3 6 d 0 1 2 2 2 3 4 4 4 4 4 7 e 0 1 2 2 2 3 4 5 5 5 5 8 n 0 1 2 2 2 3 4 5 6 6 6 9 t 0 1 2 2 2 3 4 5 6 6 6 Running time and memory: O(mn) and O(mn). 38 procedure Output-LCS(A, prev, i, j) 1 if i = 0 or j = 0 then return 2 if prev(i, j)=” “ then ⎢ ⎣ ⎡ −−− ia jiprevALCSOutput print )1,1,,( 3 else if prev(i, j)=” “ then Output-LCS(A, prev, i-1, j) 4 else Output-LCS(A, prev, i, j-1) The backtracing algorithm 39 i j 0 1 p 2 r 3 o 4 v 5 i 6 d 7 e 8 n 9 c 10 e 0 0 0 0 0 0 0 0 0 0 0 0 1 p 0 1 1 1 1 1 1 1 1 1 1 2 r 0 1 2 2 2 2 2 2 2 2 2 3 e 0 1 2 2 2 2 2 3 3 3 3 4 s 0 1 2 2 2 2 2 3 3 3 3 5 i 0 1 2 2 2 3 3 3 3 3 3 6 d 0 1 2 2 2 3 4 4 4 4 4 7 e 0 1 2 2 2 3 4 5 5 5 5 8 n 0 1 2 2 2 3 4 5 6 6 6 9 t 0 1 2 2 2 3 4 5 6 6 6 Output: priden 40 Dot Matrix Sequence A:CTTAACT Sequence B:CGGATCAT C G G A T C A T C T T A A C T 11 41 C---TTAACT CGGATCA--T Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: Sequence A Sequence B 42 C---TTAACT CGGATCA--T Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: Insertion gap Match Mismatch Deletion gap 43 Alignment (or Edit) Graph Sequence A: CTTAACT Sequence B: CGGATCAT C G G A T C A T C T T A A C T C---TTAACT CGGATCA--T 44 A simple scoring scheme • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: -3 (w(-,x)=w(x,-)=-3) C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 alignment score (i.e. space) 12 45 Scoring Matrices • Amino acid substitution matrices – PAM – BLOSUM • DNA substitution matrices – DNA is less conserved than protein sequences – Less effective to compare coding regions at nucleotide level 46 PAM • Point Accepted Mutation (Dayhoff, et al.) • 1 PAM = PAM1 = 1% average change of all amino acid positions – After 100 PAMs of evolution, not every residue will have changed • some residues may have mutated several times • some residues may have returned to their original state • some residues may not changed at all 47 PAMX • PAMx = PAM1x E.g. PAM250 = PAM1250 • PAM250 is a widely used scoring matrix. Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ... Ala A 13 6 9 9 5 8 9 12 6 8 6 7 ... Arg R 3 17 4 3 2 5 3 2 6 3 2 9 Asn N 4 4 6 7 2 5 6 4 6 3 2 5 Asp D 5 4 8 11 1 7 10 5 6 3 2 5 Cys C 2 1 1 1 52 1 1 2 2 2 1 1 Gln Q 3 5 5 6 1 10 7 3 7 2 3 5 ... Trp W 0 2 0 0 0 0 0 0 1 0 1 0 Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 Val V 7 4 4 4 4 4 4 4 5 4 15 10 48 BLOSUM • Blocks Substitution Matrix • Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins • Matrix name indicates evolutionary distance – BLOSUM62 was created using sequences sharing no more than 62% identity 15 57 An optimal local alignment • Si,j: the score of an optimal local alignment ending at ai and bj • With proper initializations, Si,j can be computed as follows. ⎪ ⎪ ⎪ ⎩ ⎪⎪ ⎪ ⎨ ⎧ + −+ −+ = −− − − ),( ),( ),( 0 max 1,1 1, ,1 , jiji jji iji ji baws bws aws s 58 local alignment ? 2 3 5 0 3580000 0 0 0 115800020 135800350 28002580 00000000 C G G A T C A T C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 59 local alignment 8 10 11 13 2 3 5 0 103580000 18101320350 713352580 82580000 115800020 135800350 28002580 00000000 C G G A T C A T C T T A A C T Match: 8 Mismatch: -5 Gap symbol: -3 the best score 60 8 10 11 13 2 3 5 0 103580000 18101320350 713352580 82580000 115800020 135800350 28002580 00000000 C G G A T C A T C T T A A C T the best score A – C - T A T C A T 8-3+8-3+8 = 18 (not always at the corner) 16 61 Affine gap penalties • Match: +8 (w(x, y) = 8, if x = y) • Mismatch: -5 (w(x, y) = -5, if x ≠ y) • Each gap symbol: -3 (w(-,x)=w(x,-)=-3) • E.g. each gap is charged an extra gap-open penalty: -4. • In general, a gap of length k should have penalty g(k) C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 -4 -4 alignment score: 12 – 4 – 4 = 4 62 Affine gap penalties • A gap of length k is penalized x + k·y. gap-open penalty gap-symbol penalty Three cases for alignment endings: 1. ...x ...x 2. ...x ...- 3. ...- ...x an aligned pair a deletion an insertion 63 Affine gap penalties • Let D(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with a deletion. • Let I(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj ending with an insertion. • Let S(i, j) denote the maximum score of any alignment between a1a2…ai and b1b2…bj. 64 Affine gap penalties ⎪⎩ ⎪ ⎨ ⎧ +−− = ⎩ ⎨ ⎧ −−− −− = ⎩ ⎨ ⎧ −−− −− = ),( ),( ),()1,1( max),( )1,( )1,( max),( ),1( ),1( max),( jiI jiD bawjiS jiS yxjiS yjiI jiI yxjiS yjiD jiD ji 17 65 Affine gap penalties (Gotoh’s algorithm) SI D SI D SI D SI D -y -x-y -x-y -y w(ai,bj) 66 k best local alignments • Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) • FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985) • BLAST (Altschul et al., 1990; Altschul et al., 1997) BLAST and FASTA are key genomic database search tools. 67 k best local alignments • Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) – linear-space version:sim (Huang and Miller, 1991) – linear-space variants:sim2 (Chao et al., 1995); sim3 (Chao et al., 1997) • FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985) – linear-space band alignment (Chao et al., 1992) • BLAST (Altschul et al., 1990; Altschul et al., 1997) – restricted affine gap penalties (Chao, 1999) 68 FASTA 1) Find runs of identities, and identify regions with the highest density of identities. 2) Re-score using PAM matrix, and keep top scoring segments. 3) Eliminate segments that are unlikely to be part of the alignment. 4) Optimize the alignment in a band. Its running time is O(n). 20 77 Remarks • Filtering is based on the observation that a good alignment usually includes short identical or very similar fragments. • The idea of filtration was used in both FASTA and BLAST to achieve high speed 78 Linear space ideas Hirschberg, 1975; Myers and Miller, 1988 m/2 (i) scores can be computed in O(n) space (ii) divide-and-conquer S(a1…am/2,b1…bj) + S(am…am/2+1,bn…bj+1) maximized j 79 Two subproblems ½ original problem size m/2 m/4 3m/4 80 Four subproblems ¼ original problem size m/2 m/4 3m/4 21 81 Time and Space Complexity • Space: O(m+n) • Time: O(mn)*(1+ ½ + ¼ + …) = O(mn) 2 82 Band Alignment (K. Chao, W. Pearson, and W. Miller) Sequence B Sequence A 83 Band Alignment in Linear Space The remaining subproblems are no longer only half of the original problem. In worst case, this could cause an additional log n factor in time. 84 Band Alignment in Linear Space 22 85 Multiple sequence alignment (MSA) • The multiple sequence alignment problem is to simultaneously align more than two sequences. Seq1: GCTC Seq2: AC Seq3: GATC GC-TC A---C G-ATC 86 How to score an MSA? • Sum-of-Pairs (SP-score) GC-TC A---C G-ATC GC-TC A---C GC-TC G-ATC A---C G-ATC Score = Score Score Score + + 87 MSA for three sequences • an O(n3) algorithm 88 General MSA • For k sequences of length n: O(nk) • NP-Complete (Wang and Jiang) • The exact multiple alignment algorithms for many sequences are not feasible. • Some approximation algorithms are given. (e.g., 2- l/k for any fixed l by Bafna et al.)
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved