Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Dynamic Programming in Computer Science: Sequence Alignment and Manhattan Tourist Problem , Study notes of Computer Science

The application of dynamic programming in finding sequence similarities between genes using the manhattan tourist problem as an example. It explains how dynamic programming helps reveal similarities between genes and how it differs from greedy algorithms and simple recursive programs. The document also covers the concept of longest common subsequence (lcs) and its relation to the manhattan tourist problem.

Typology: Study notes

Pre 2010

Uploaded on 03/16/2009

koofers-user-d8c-1
koofers-user-d8c-1 🇺🇸

10 documents

1 / 48

Toggle sidebar

Related documents


Partial preview of the text

Download Dynamic Programming in Computer Science: Sequence Alignment and Manhattan Tourist Problem and more Study notes Computer Science in PDF only on Docsity! Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha DNA Sequence Comparison: First Success Story • Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s function • In 1984 Russell Doolittle and colleagues found similarities between cancer-causing gene and normal growth factor (PDGF) gene • A normal growth gene switched on at the wrong time causes cancer ! Motivating Dynamic Programming Dynamic programming example: Manhattan Tourist Problem Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink* * * * * ** * * * * Source * Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink* * * * * ** * * * * Source * Dynamic programming example: Manhattan Tourist Problem MTP: Greedy Algorithm Is Not Optimal 1 2 5 2 1 5 2 3 4 0 0 0 5 3 0 3 5 0 10 3 5 5 1 2promising start, but leads to bad choices! source sink 18 22 MTP: Simple Recursive Program MT(n,m) if n=0 or m=0 return MT(n,m) x  MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y  MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y} What’s wrong with this approach? Here’s what’s wrong • M(n,m) needs M(n, m-1) and M(n-1, m) • Both of these need M(n-1, m-1) • So M(n-1, m-1) will be computed at least twice • Dynamic programming: the same idea as this recursive algorithm, but keep all intermediate results in a table and reuse MTP: Dynamic Programming (cont’d) 1 2 5 3 0 1 2 3 0 1 2 3 i source 1 3 5 8 8 4 0 5 8 103 5 -5 9 13 1-5 S3,0 = 8 S2,1 = 9 S1,2 = 13 S3,0 = 8 j MTP: Dynamic Programming (cont’d) greedy alg. fails! 1 2 5 -5 1 -5 -5 3 0 5 3 0 3 5 0 10 -3 -5 0 1 2 3 0 1 2 3 i source 1 3 8 5 8 8 4 9 13 8 9 12 S3,1 = 9 S2,2 = 12 S1,3 = 8 j MTP: Dynamic Programming (cont’d) 1 2 5 -5 1 -5 -5 3 3 0 0 5 3 0 3 5 0 10 -3 -5 -5 2 0 1 2 3 0 1 2 3 i source 1 3 8 5 8 8 4 9 13 8 12 9 15 9 j S3,2 = 9 S2,3 = 15 Manhattan Is Not A Perfect Grid What about diagonals? • The score at point B is given by: sB = max of sA1 + weight of the edge (A1, B) sA2 + weight of the edge (A2, B) sA3 + weight of the edge (A3, B) B A3 A1 A2 Manhattan Is Not A Perfect Grid (cont’d) Computing the score for point x is given by the recurrence relation: sx = max of sy + weight of vertex (y, x) where y є Predecessors(x) • Predecessors (x) – set of vertices that have edges leading to x •The running time for a graph G(V, E) (V is the set of all vertices and E is the set of all edges) is O(E) since each edge is evaluated once Traveling in the Grid •By the time the vertex x is analyzed, the values sy for all its predecessors y should be computed – otherwise we are in trouble. •We need to traverse the vertices in some order •For a grid, can traverse vertices row by row, column by column, or diagonal by diagonal Alignment Aligning DNA Sequences V = ATCTGATG W = TGCATAC n = 8 m = 7 CATACGT GTAGTCTAV W match deletion insertion mismatch indels 4 1 2 2 matches mismatches insertions deletions Alignment : 2 x k matrix ( k ≥ m, n ) Longest Common Subsequence (LCS) – Alignment without Mismatches • Given two sequences v = v1 v2…vm and w = w1 w2…wn • The LCS of v and w is a sequence of positions in v: 1 < i1 < i2 < … < it < m and a sequence of positions in w: 1 < j1 < j2 < … < jt < n such that it -th letter of v equals to jt-letter of w and t is maximal LCS Problem as Manhattan Tourist Problem T G C A T A C 1 2 3 4 5 6 7 0i A T C T G A T C 0 1 2 3 4 5 6 7 8 j Edit Graph for LCS Problem T G C A T A C 1 2 3 4 5 6 7 0i A T C T G A T C 0 1 2 3 4 5 6 7 8 j Edit Graph for LCS Problem T G C A T A C 1 2 3 4 5 6 7 0i A T C T G A T C 0 1 2 3 4 5 6 7 8 j Every path is a common subsequence. Every diagonal edge adds an extra element to common subsequence LCS Problem: Find a path with maximum number of diagonal edges Printing LCS: Backtracking 1. PrintLCS(b,v,i,j) 2. if i = 0 or j = 0 3. return 4. if bi,j = “ “ 5. PrintLCS(b,v,i-1,j-1) 6. print vi 7. else 8. if bi,j = “ “ 9. PrintLCS(b,v,i-1,j) 10. else 11. PrintLCS(b,v,i,j-1) From LCS to Alignment • The Longest Common Subsequence problem—the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). • In the LCS Problem, we scored 1 for matches and 0 for indels • Consider penalizing indels and mismatches with negative scores • Simplest scoring scheme: +1 : match premium -µ : mismatch penalty -σ : indel penalty Simple Scoring • When mismatches are penalized by –µ, indels are penalized by –σ, and matches are rewarded with +1, the resulting score is: #matches – µ(#mismatches) – σ (#indels) Making a Scoring Matrix • Scoring matrices are created based on biological evidence. • Alignments can be thought of as two sequences that differ due to mutations. • Some of these mutations have little effect on the protein’s function, therefore some penalties, δ(vi , wj), will be less harsh than others. Scoring Matrix: Example 6---K 07--N 3-17-R -1-1-25A KNRA • Notice that although R and K are different amino acids, they have a positive score. • Why? They are both positively charged amino acids will not greatly change function of protein. Conservation • Amino acid changes that tend to preserve the physico-chemical properties of the original residue – Polar to polar • aspartate  glutamate – Nonpolar to nonpolar • alanine  valine – Similarly behaving residues • leucine to isoleucine Local Alignments: Why? • Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. • Example: – Homeobox genes have a short region called the homeodomain that is highly conserved between species. – A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence The Local Alignment Problem • Goal: Find the best local alignment between two strings • Input : Strings v, w and scoring matrix δ • Output : Alignment of substrings of v and w whose alignment score is maximum among all possible alignment of all possible substrings The Problem Is … • Long run time O(n4): - In the grid of size n x n there are ~n2 vertices (i,j) that may serve as a source. - For each such vertex computing alignments from (i,j) to (i’,j’) takes O(n2) time. • This can be remedied by allowing every point to be the starting point
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved