Download Dynamic Programming in Computer Science: Sequence Alignment and Manhattan Tourist Problem and more Study notes Computer Science in PDF only on Docsity! Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha DNA Sequence Comparison: First Success Story • Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s function • In 1984 Russell Doolittle and colleagues found similarities between cancer-causing gene and normal growth factor (PDGF) gene • A normal growth gene switched on at the wrong time causes cancer ! Motivating Dynamic Programming Dynamic programming example: Manhattan Tourist Problem Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink* * * * * ** * * * * Source * Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink* * * * * ** * * * * Source * Dynamic programming example: Manhattan Tourist Problem MTP: Greedy Algorithm Is Not Optimal 1 2 5 2 1 5 2 3 4 0 0 0 5 3 0 3 5 0 10 3 5 5 1 2promising start, but leads to bad choices! source sink 18 22 MTP: Simple Recursive Program MT(n,m) if n=0 or m=0 return MT(n,m) x MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y} What’s wrong with this approach? Here’s what’s wrong • M(n,m) needs M(n, m-1) and M(n-1, m) • Both of these need M(n-1, m-1) • So M(n-1, m-1) will be computed at least twice • Dynamic programming: the same idea as this recursive algorithm, but keep all intermediate results in a table and reuse MTP: Dynamic Programming (cont’d) 1 2 5 3 0 1 2 3 0 1 2 3 i source 1 3 5 8 8 4 0 5 8 103 5 -5 9 13 1-5 S3,0 = 8 S2,1 = 9 S1,2 = 13 S3,0 = 8 j MTP: Dynamic Programming (cont’d) greedy alg. fails! 1 2 5 -5 1 -5 -5 3 0 5 3 0 3 5 0 10 -3 -5 0 1 2 3 0 1 2 3 i source 1 3 8 5 8 8 4 9 13 8 9 12 S3,1 = 9 S2,2 = 12 S1,3 = 8 j MTP: Dynamic Programming (cont’d) 1 2 5 -5 1 -5 -5 3 3 0 0 5 3 0 3 5 0 10 -3 -5 -5 2 0 1 2 3 0 1 2 3 i source 1 3 8 5 8 8 4 9 13 8 12 9 15 9 j S3,2 = 9 S2,3 = 15 Manhattan Is Not A Perfect Grid What about diagonals? • The score at point B is given by: sB = max of sA1 + weight of the edge (A1, B) sA2 + weight of the edge (A2, B) sA3 + weight of the edge (A3, B) B A3 A1 A2 Manhattan Is Not A Perfect Grid (cont’d) Computing the score for point x is given by the recurrence relation: sx = max of sy + weight of vertex (y, x) where y є Predecessors(x) • Predecessors (x) – set of vertices that have edges leading to x •The running time for a graph G(V, E) (V is the set of all vertices and E is the set of all edges) is O(E) since each edge is evaluated once Traveling in the Grid •By the time the vertex x is analyzed, the values sy for all its predecessors y should be computed – otherwise we are in trouble. •We need to traverse the vertices in some order •For a grid, can traverse vertices row by row, column by column, or diagonal by diagonal Alignment Aligning DNA Sequences V = ATCTGATG W = TGCATAC n = 8 m = 7 CATACGT GTAGTCTAV W match deletion insertion mismatch indels 4 1 2 2 matches mismatches insertions deletions Alignment : 2 x k matrix ( k ≥ m, n ) Longest Common Subsequence (LCS) – Alignment without Mismatches • Given two sequences v = v1 v2…vm and w = w1 w2…wn • The LCS of v and w is a sequence of positions in v: 1 < i1 < i2 < … < it < m and a sequence of positions in w: 1 < j1 < j2 < … < jt < n such that it -th letter of v equals to jt-letter of w and t is maximal LCS Problem as Manhattan Tourist Problem T G C A T A C 1 2 3 4 5 6 7 0i A T C T G A T C 0 1 2 3 4 5 6 7 8 j Edit Graph for LCS Problem T G C A T A C 1 2 3 4 5 6 7 0i A T C T G A T C 0 1 2 3 4 5 6 7 8 j Edit Graph for LCS Problem T G C A T A C 1 2 3 4 5 6 7 0i A T C T G A T C 0 1 2 3 4 5 6 7 8 j Every path is a common subsequence. Every diagonal edge adds an extra element to common subsequence LCS Problem: Find a path with maximum number of diagonal edges Printing LCS: Backtracking 1. PrintLCS(b,v,i,j) 2. if i = 0 or j = 0 3. return 4. if bi,j = “ “ 5. PrintLCS(b,v,i-1,j-1) 6. print vi 7. else 8. if bi,j = “ “ 9. PrintLCS(b,v,i-1,j) 10. else 11. PrintLCS(b,v,i,j-1) From LCS to Alignment • The Longest Common Subsequence problem—the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). • In the LCS Problem, we scored 1 for matches and 0 for indels • Consider penalizing indels and mismatches with negative scores • Simplest scoring scheme: +1 : match premium -µ : mismatch penalty -σ : indel penalty Simple Scoring • When mismatches are penalized by –µ, indels are penalized by –σ, and matches are rewarded with +1, the resulting score is: #matches – µ(#mismatches) – σ (#indels) Making a Scoring Matrix • Scoring matrices are created based on biological evidence. • Alignments can be thought of as two sequences that differ due to mutations. • Some of these mutations have little effect on the protein’s function, therefore some penalties, δ(vi , wj), will be less harsh than others. Scoring Matrix: Example 6---K 07--N 3-17-R -1-1-25A KNRA • Notice that although R and K are different amino acids, they have a positive score. • Why? They are both positively charged amino acids will not greatly change function of protein. Conservation • Amino acid changes that tend to preserve the physico-chemical properties of the original residue – Polar to polar • aspartate glutamate – Nonpolar to nonpolar • alanine valine – Similarly behaving residues • leucine to isoleucine Local Alignments: Why? • Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. • Example: – Homeobox genes have a short region called the homeodomain that is highly conserved between species. – A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence The Local Alignment Problem • Goal: Find the best local alignment between two strings • Input : Strings v, w and scoring matrix δ • Output : Alignment of substrings of v and w whose alignment score is maximum among all possible alignment of all possible substrings The Problem Is … • Long run time O(n4): - In the grid of size n x n there are ~n2 vertices (i,j) that may serve as a source. - For each such vertex computing alignments from (i,j) to (i’,j’) takes O(n2) time. • This can be remedied by allowing every point to be the starting point