Download Multiple Sequence Alignment and Dynamic Sequence Programming | CISC 636 and more Study notes Computer Science in PDF only on Docsity! CISC636, S08, Lec8, Liao CISC 636 Intro to Bioinformatics (Spring 2008) Multiple Sequence Alignment • Scoring • Dynamic Programming algorithms • Heuristic algorithms –CLUSTAL W CISC636, S08, Lec8, Liao Courtesy of jalview CISC636, S08, Lec8, Liao Scoring a multiple alignment – Ideally, should take into account • Some positions are more conserved than others – position specific scoring. (columns) • Sequences are not independent, they evolved as depicted by phylogenetic trees. (rows) – In practice, each position (column) is scored independently S(m) = G + ∑i S(mi) where mi stands for column i of the multiple alignment m, G is a function for scoring the gaps. • Note: Hidden Markov models take into account position correlation, but just locally. CISC636, S08, Lec8, Liao Column score – Ideally, a column with three rows should scored as log(pabc/ qaqbqc) (1) – Sum of pairs :SP scores S(mi) = ∑k<l S(mi k, mi l ), where mi k stands for residue at position i of sequence k. Scores S(a, b ) come from a substitution scoring matrix, e.g., PAM. This means that the score in eq(1) is approximated as log(pab/ qaqb) + log(pac/ qaqc) + log(pbc/ qbqc) (2) Note: scoring gaps s(a, -) = s(-, a) = -d s(-,-) = 0 (Once a gap, always a gap) CISC636, S08, Lec8, Liao Example of SP scoring F F F I V S = S(F,F) + S(F,F) + S(F, I) + S(F,V) + S(F,F) + S(F,I) + S(F,V) + S(F,I) + S(F,V) + S(I,V) = 8 + 8 + 0 -1 + 8 + 0 -1 +0 -1 + 4 = 25 F F F I N S = S(F,F) + S(F,F) + S(F, I) + S(F,N) + S(F,F) + S(F,I) + S(F,N) + S(F,I) + S(F,N) + S(I,N) = 8 + 8 + 0 -4 + 8 + 0 -4 +0 -4 + 4 = 16 Note: Blosum 50 is used CISC636, S08, Lec8, Liao • Distance-based guide tree – Distances may be obtained from • Pairwise alignment • Hybridization – Tree can be built by using • UPGMA (Unweighted Pair Group Method of Averages) • Neighbor joining Approach 2: Progressive Alignment CISC636, S08, Lec8, Liao UPGMA Approach 2: Progressive Alignment • Fast and easy • Robust to sequence errors • Assumption of molecular clock, i.e. constant rate for evolution CISC636, S08, Lec8, Liao • Add sequences to the growing alignment by following the order in the guide tree – Represent a multiple alignment as profile (Position Specific Scoring Matrix) • Given an alignment, a profile at each column is a vector of 20 specifying the frequencies of 20 amino acids appearing in that column. • Construction of profiles based on multiple sequence alignment. Approach 2: Progressive Alignment CISC636, S08, Lec8, Liao • Align profile P to profile Q – The score for aligning column i of P to column j of Q S(i,j) = ∑a {Pi (a) ∑b[Qj (b) S(a,b)]} Note: there are different scoring schemes. One other example is to use relative entropy: S(i,j) = ∑a Pi(a) log [Pi(a) / Qj(a)] – Use DP to find optimal alignment, i.e., maximizing the total score. Approach 2: Progressive Alignment CISC636, S08, Lec8, Liao Algorithm: clustalw (Higgins and Sharp 1989) i. construct a distance matrix of all N(N-1)/2 pairs by pairwise DP alignment ii. construct a guide tree by a neighbor-joining method iii. Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence-profile, and profile-profile alignment. Heuristic – Column once aligned, will not change later when new sequences are added can handle < 1,000 sequences Algorithm: T-COFFEE can handle < 10,000 sequenece Approach 2: Progressive Alignment