Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Dynamic Programming in Computer Science: Sequence Alignment and Manhattan Tourist Problem , Study notes of Computer Science

University of Illinois - Urbana-Champaign Computer Science

Prof. Saurabh Sinha

The application of dynamic programming in finding sequence similarities between genes using the manhattan tourist problem as an example. It explains how dynamic programming helps reveal similarities between genes and how it differs from greedy algorithms and simple recursive programs. The document also covers the concept of longest common subsequence (lcs) and its relation to the manhattan tourist problem.

Typology: Study notes

Pre 2010

Uploaded on 03/16/2009

koofers-user-d8c-1 🇺🇸

10 documents

1 / 48

Partial preview of the text

Download Dynamic Programming in Computer Science: Sequence Alignment and Manhattan Tourist Problem and more Study notes Computer Science in PDF only on Docsity! Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha DNA Sequence Comparison: First Success Story • Finding sequence similarities with genes of known function is a common approach to infer a newly sequenced gene’s function • In 1984 Russell Doolittle and colleagues found similarities between cancer-causing gene and normal growth factor (PDGF) gene • A normal growth gene switched on at the wrong time causes cancer ! Motivating Dynamic Programming Dynamic programming example: Manhattan Tourist Problem Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink* * * * * ** * * * * Source * Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink* * * * * ** * * * * Source * Dynamic programming example: Manhattan Tourist Problem MTP: Greedy Algorithm Is Not Optimal 1 2 5 2 1 5 2 3 4 0 0 0 5 3 0 3 5 0 10 3 5 5 1 2promising start, but leads to bad choices! source sink 18 22 MTP: Simple Recursive Program MT(n,m) if n=0 or m=0 return MT(n,m) x  MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y  MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y} What’s wrong with this approach? Here’s what’s wrong • M(n,m) needs M(n, m-1) and M(n-1, m) • Both of these need M(n-1, m-1) • So M(n-1, m-1) will be computed at least twice • Dynamic programming: the same idea as this recursive algorithm, but keep all intermediate results in a table and reuse MTP: Dynamic Programming (cont’d) 1 2 5 3 0 1 2 3 0 1 2 3 i source 1 3 5 8 8 4 0 5 8 103 5 -5 9 13 1-5 S3,0 = 8 S2,1 = 9 S1,2 = 13 S3,0 = 8 j MTP: Dynamic Programming (cont’d) greedy alg. fails! 1 2 5 -5 1 -5 -5 3 0 5 3 0 3 5 0 10 -3 -5 0 1 2 3 0 1 2 3 i source 1 3 8 5 8 8 4 9 13 8 9 12 S3,1 = 9 S2,2 = 12 S1,3 = 8 j MTP: Dynamic Programming (cont’d) 1 2 5 -5 1 -5 -5 3 3 0 0 5 3 0 3 5 0 10 -3 -5 -5 2 0 1 2 3 0 1 2 3 i source 1 3 8 5 8 8 4 9 13 8 12 9 15 9 j S3,2 = 9 S2,3 = 15 Manhattan Is Not A Perfect Grid What about diagonals? • The score at point B is given by: sB = max of sA1 + weight of the edge (A1, B) sA2 + weight of the edge (A2, B) sA3 + weight of the edge (A3, B) B A3 A1 A2 Manhattan Is Not A Perfect Grid (cont’d) Computing the score for point x is given by the recurrence relation: sx = max of sy + weight of vertex (y, x) where y є Predecessors(x) • Predecessors (x) – set of vertices that have edges leading to x •The running time for a graph G(V, E) (V is the set of all vertices and E is the set of all edges) is O(E) since each edge is evaluated once Traveling in the Grid •By the time the vertex x is analyzed, the values sy for all its predecessors y should be computed – otherwise we are in trouble. •We need to traverse the vertices in some order •For a grid, can traverse vertices row by row, column by column, or diagonal by diagonal Alignment Aligning DNA Sequences V = ATCTGATG W = TGCATAC n = 8 m = 7 CATACGT GTAGTCTAV W match deletion insertion mismatch indels 4 1 2 2 matches mismatches insertions deletions Alignment : 2 x k matrix ( k ≥ m, n ) Longest Common Subsequence (LCS) – Alignment without Mismatches • Given two sequences v = v1 v2…vm and w = w1 w2…wn • The LCS of v and w is a sequence of positions in v: 1 < i1 < i2 < … < it < m and a sequence of positions in w: 1 < j1 < j2 < … < jt < n such that it -th letter of v equals to jt-letter of w and t is maximal LCS Problem as Manhattan Tourist Problem T G C A T A C 1 2 3 4 5 6 7 0i A T C T G A T C 0 1 2 3 4 5 6 7 8 j Edit Graph for LCS Problem T G C A T A C 1 2 3 4 5 6 7 0i A T C T G A T C 0 1 2 3 4 5 6 7 8 j Edit Graph for LCS Problem T G C A T A C 1 2 3 4 5 6 7 0i A T C T G A T C 0 1 2 3 4 5 6 7 8 j Every path is a common subsequence. Every diagonal edge adds an extra element to common subsequence LCS Problem: Find a path with maximum number of diagonal edges Printing LCS: Backtracking 1. PrintLCS(b,v,i,j) 2. if i = 0 or j = 0 3. return 4. if bi,j = “ “ 5. PrintLCS(b,v,i-1,j-1) 6. print vi 7. else 8. if bi,j = “ “ 9. PrintLCS(b,v,i-1,j) 10. else 11. PrintLCS(b,v,i,j-1) From LCS to Alignment • The Longest Common Subsequence problem—the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). • In the LCS Problem, we scored 1 for matches and 0 for indels • Consider penalizing indels and mismatches with negative scores • Simplest scoring scheme: +1 : match premium -µ : mismatch penalty -σ : indel penalty Simple Scoring • When mismatches are penalized by –µ, indels are penalized by –σ, and matches are rewarded with +1, the resulting score is: #matches – µ(#mismatches) – σ (#indels) Making a Scoring Matrix • Scoring matrices are created based on biological evidence. • Alignments can be thought of as two sequences that differ due to mutations. • Some of these mutations have little effect on the protein’s function, therefore some penalties, δ(vi , wj), will be less harsh than others. Scoring Matrix: Example 6---K 07--N 3-17-R -1-1-25A KNRA • Notice that although R and K are different amino acids, they have a positive score. • Why? They are both positively charged amino acids will not greatly change function of protein. Conservation • Amino acid changes that tend to preserve the physico-chemical properties of the original residue – Polar to polar • aspartate  glutamate – Nonpolar to nonpolar • alanine  valine – Similarly behaving residues • leucine to isoleucine Local Alignments: Why? • Two genes in different species may be similar over short conserved regions and dissimilar over remaining regions. • Example: – Homeobox genes have a short region called the homeodomain that is highly conserved between species. – A global alignment would not find the homeodomain because it would try to align the ENTIRE sequence The Local Alignment Problem • Goal: Find the best local alignment between two strings • Input : Strings v, w and scoring matrix δ • Output : Alignment of substrings of v and w whose alignment score is maximum among all possible alignment of all possible substrings The Problem Is … • Long run time O(n4): - In the grid of size n x n there are ~n2 vertices (i,j) that may serve as a source. - For each such vertex computing alignments from (i,j) to (i’,j’) takes O(n2) time. • This can be remedied by allowing every point to be the starting point

Documents

questions

Dynamic Programming in Computer Science: Sequence Alignment and Manhattan Tourist Problem , Study notes of Computer Science

Related documents

Partial preview of the text