Download Bioinformatics: Importance and Methods for Optimal Multiple Sequence Alignment and more Study notes Computer Science in PDF only on Docsity! 8/20/2005 Su-Shing Chen, CISE 1 CAP 5510-7 Multiple Sequence Alignment BIOINFORMATICS Su-Shing Chen CISE 8/20/2005 Su-Shing Chen, CISE 2 Global Multiple Sequence Alignment (MSA) Ex: MSA of 4 sequences MQPILLLV,MLRLL, MKILLL,MPPVLILV: MQPILLLV MLR-LL-- MK-ILLL- MPPVLILV No column is all gaps 8/20/2005 Su-Shing Chen, CISE 5 Why we do multiple alignments? – Help prediction of the secondary and tertiary structures of new sequences; – Preliminary step in molecular evolution analysis using Phylogenetic methods for constructing phylogenetic trees 8/20/2005 Su-Shing Chen, CISE 6 Problem Definition Given strings S1, S2 … Sk a multiple (global) alignment maps them to strings S1’, S2’ … Sk’ that may contain spaces, where 1. |S1’| = |S2’| = … = |Sk’|, and 2. The removal of spaces from Si’ leaves Si, for 1 i k 8/20/2005 Su-Shing Chen, CISE 7 Scoring Methods There are various scoring methods “sum-of-pairs” score. The sum-of-pairs (SP) value for a multiple global alignment A of strings is the sum of the values of all kC2 pairwise alignments induced by A. 8/20/2005 Su-Shing Chen, CISE 10 Finding Optimal MSAs An optimal MSA is an alignment with minimum overall cost or maximum overall similarity. NP-complete. Our new aim for multiple alignment is obtaining a alignment which is very close to the optimal alignment. 8/20/2005 Su-Shing Chen, CISE 11 Overview of Methods Dynamic programming – too computationally expensive to do a complete search; uses heuristics Progressive – starts with pair-wise alignment of most similar sequences; adds to that Iterative – make an initial alignment of groups of sequences, adds to these (e.g. genetic algorithms) Locally conserved patterns Statistical and probabilistic methods 8/20/2005 Su-Shing Chen, CISE 12 Choosing sequences for alignment The more sequences to align the better. Don’t include similar (>80%) sequences. Sub-groups should be pre-aligned separately, and one member of each subgroup should be included in the final multiple alignment. 8/20/2005 Su-Shing Chen, CISE 15 Dynamic Programming contd.. For k sequences, we construct a k D lattice. Any point on this lattice can be represented by the k-tupple (x1, x2, x3, ….., xk). Where xi corresponds to the ith sequence in the alignment. And the range of xi is the entire width of the non gapped ith sequence. 8/20/2005 Su-Shing Chen, CISE 16 Dynamic Programming contd.. Representing an alignment in the lattice. Consider the following alignment Start from the point (0,0,0). Move column-wise along the alignment. If you find a residue in the ith sequence, increase the value of xi by 1. Else leave unchanged. 8/20/2005 Su-Shing Chen, CISE 17 Dynamic Programming contd.. The alignment path for the 3 sequences 8/20/2005 Su-Shing Chen, CISE 20 Carrillo-Lipman method Since Ao is the optimal alignment C (Ao) – C (Ah) > 0 …….(1) Let us define C( A|ij) as the score of the projection of A, calculated like normal pair wise alignment. Now from the basic defination of the score of the multiple alignment, we can write …….(2) | ( , ), ( ) ( )ij i j i j c A c A < = 8/20/2005 Su-Shing Chen, CISE 21 Carrillo-Lipman method Also since Ao|ij need not be the optimal projection, we have C (Ao|ij) >= d (Si, Sj) ……(3) Using (1), (2) and (3), we can show that Which can be written as | | ( , ), [ ( ) ( , )] ( , ) ( )h oij i j p q pq i j i j c A d S S d S S c A < − + ≤ |( , ) ( ) o p q p qU L d S S c A− + ≤ 8/20/2005 Su-Shing Chen, CISE 22 Carrillo-Lipman method Where And In the above equation we have derived a minimum score for the optimal alignment. Thus, we can safely ignore all the paths which has a score less than this minimum score. | ( , ), ( )h ij i j i j U c A < = ( , ), ( , )i j i j i j L d S S < = 8/20/2005 Su-Shing Chen, CISE 25 Manual alignments Consider the following sequences 8/20/2005 Su-Shing Chen, CISE 26 Manual alignments The scores for each pair wise alignment is shown below 8/20/2005 Su-Shing Chen, CISE 27 Manual alignments The table shows that S1 is the most similar sequence to all other sequences. Use these pair- wise alignments to construct the multiple alignment 8/20/2005 Su-Shing Chen, CISE 30 Manual alignment Far less running time O(n2k2+nl) for k sequences of n residues each and of a total gapped length l. However, these methods do not optimize the score of the MSA. Tradeoff between practicality and optimization. Cannot guarantee a unique MSA by these methods for a set of sequences, depends on the order of addition of sequences. 8/20/2005 Su-Shing Chen, CISE 31 Progressive alignment Compare pair-wise alignments and then add sequences based on a given order of similarity General approach - construct the pair wise alignment of the sequences. - construct a “rough” similarity tree. - combine the alignments from the most closely related groups to the most distant, following “once a gap always a gap” strategy – once a gap is added we don’t change it. 8/20/2005 Su-Shing Chen, CISE 32 Step 1: Construct the Distance Matrix. Construct the pair wise alignments of the sequences in the set and find the distances between them. Distance between two sequences :- Find the mismatches, insertions and deletions between the sequences and divide it by the total length of the sequence OR Assign a distance scheme for each mismatch and gap. Add up the distances over the entire sequence length. 8/20/2005 Su-Shing Chen, CISE 35 Step 2: Construct “Rough” Similarity Tree. Every node represents a sequence The length of the arms is proportional to the distance between the sequences. The distances are additive. Distance matrix is fed into an algorithm that builds a tree relating these sequences.(neighbor-joining tree) Neighbor Joining Tree
8/20/2005
ost
226
“084
O16
216 055
065
15
Eee 398
388
442
Note: Figure not dravwen bo scale
Su-Shing Chen, CISE
Hbb_human
Hbb_horse
Hba_human
Hba_horse
Myg_ whale
Cyhg_lamprey
Lob lupus
36
8/20/2005 Su-Shing Chen, CISE 37 Distance between Hbb_human and Hbb_horse is : .081 + .084 = .165 (close to .17 in the matrix) 8/20/2005 Su-Shing Chen, CISE 40 Aligning alignments To use dynamic alignment on these pair of alignments : Convert them to single sequences Define a scoring scheme for them. To score the pair of alignment, we score each column of the alignments against each other column 1 of alignment 1 is ( A C), column 2 of alignment 2 is ( C A A G) Define S(1,2) as the score of column1 against column 2. 8/20/2005 Su-Shing Chen, CISE 41 Aligning alignments (A C) ( C A A G) S(1,2)= 1/8{ score(A,C)+ score(A,A)+ score(A,A)+ score(A,G)+ score(C,C)+ score(C,A)+ score(C,A)+ score(C,G)} Similarly define S(1,1), S( 1,2), S(1,3). Construct a scoring matrix S for the pair of alignments. 8/20/2005 Su-Shing Chen, CISE 42 Weighting Sequences While aligning the sequences, we are just averaging the scores of individual pairs. This is problematic if we have very similar sequences. Use tree to weight our sequences, with highly diverged sequences getting larger weights Use length from root to sequences to compute weight : increased weights for more divergent sequences If 2 or more sequences share a branch, length of branch is split among the sequences : reduced weight for related sequences Use these weights instead of the average 8/20/2005 Su-Shing Chen, CISE 45 Progressive alignment It is more accurate than the Manual method The distances are not the true evolutionary distances. While calculating distance we count any sub substitution A->B as the first substitution. It may be a result of multiple substitution in that position A->C->B. While aligning the sequences, we are just averaging the scores of individual pairs. This is problematic if we have very similar sequences. 8/20/2005 Su-Shing Chen, CISE 46 ClustalW- for multiple alignment ClustalW is a general purpose multiple alignment program for DNA or proteins. http://www.ebi.ac.uk/clustalw/ It can create multiple alignments, manipulate existing alignments, do profile analysis and create phylogenetic trees. Alignment can be done by 2 methods: - slow/accurate - fast/approximate 8/20/2005 Su-Shing Chen, CISE 47 ClustalW - for multiple alignment Based on phylogenetic analysis A phylogenetic tree is created using a pairwise distance matrix and nearest- neighbor algorithm The most closely-related pairs of sequences are aligned using dynamic programming Each of the alignments is analyzed and a profile of it is created Alignment profiles are aligned progressively for a total alignment ClustalW Example - Run
YOUR EhiAtL A4LISMNMEMT TITLE | RESULTS ALIS RINE hl CPU MODE
] [Sequence | ! interactive ] [fon =] I single ~]
ETP VIN Dow SCORE TYPE TOPDIAG PAIR GAP
OwoORD SIZEs LEMGTH
| der =| [der =| | [percent =| [der ~| [der =]
MAT Ri GAP OPER END GAP SAP
GAPS EXTEMSIOnMn DISTANCES
def = def = | | der ~] [der ~] [der ~]
OLITPLI | PHYLOGENMETIC TREE
OUTPUT OUTPUT TREE TTPE CORRECT DIST. 1SMORE GAPS
FORMAT ORDER
| aln wfnumbers =] | aligned =] | | none | off | | off |S
Enter or Paste a set of Sequences in any supported format: felipe
FaAHGKEVLASF GEGVHHLDHLEGTF AALSELHCDBRLHYDPENFRLLGI
DFTPELGASYORVVAGY ANAL AHEYH
>sp|Pé6sey71| HBB HUMAN Hemoglobin
VWHLTPEEKSAVTALWGRVNVDEVGGEALGRLLVV¥¥PWTORFFESFGDI
KAHGEEVLGAFSDGLAHLDMNLEGTF ATLSELHCDELHVDPENFRLLGI
EFTPPYOAAYORVVAGY ANAL AHEYH
Sep|FPS6SS05|HBA_HUMAN Hemoglobin
VLSP ADKTINVRAAVGEVGAHAGE ¥Y¥GAEALERMFLSFPTTKTYFFHFDI
EKVADALTMVAVAHVDDMPMALS ALSDLHAHKLEVDPVNFKLLSHCLLY"
VHASLDEKFLASVSTVLTSKYH
|
ca
Lipload a file: J Browse... | Run | Reset |
8/20/:
ClustalW Example -
Results
esteem Results
Results of search
Number of sequences | 3
Alignment score 1467
Sequence format Pearson
Sequence type aa
ClustalW version 1.82
Jalview Snes |
Output file clustalw- 2005061 7-03111083 output
Aligninent file clustahw-2005081 7-03111083. alin
Guide tree file
clustaly: 2005081 7-03111083 dnd
Sour inpeut file
clustaly: 2005081 7-03111083 input
SUBMIT ANOTHER JOB |
51
8/20/2005 Su-Shing Chen, CISE 52 ClustalW Example - Output file ClustalW Example -
JalView
i} I a W a HT i N " " it ih Ith ih Mh
0p [PREALG jon apeae 1M og DOULOL ONY: -MEEn On I Hi 7
0g [POUT jamaica +140 . m 1 i
pV ma nuntat| +L + CORTE it) tL ATRPRER Og: = Th 1 ADL HD :
8/20/2005 Su-Shing Chen, CISE 55
8/20/2005 Su-Shing Chen, CISE 56 Other Programs - Dialign A local algorithm Aligns whole segments rather than single residues All pairwise alignments are performed and all aligned ungapped regions picked up. These regions appear as diagonals on a dotplot. A consistent set of diagonals is determined and iteratively added to the alignment 8/20/2005 Su-Shing Chen, CISE 57 Other Programs - Poa Progressive algorithm, employing partially ordered graphs, as opposed to generalized profiles, to represent aligned sequences Generalized profiles are only accurate when sequences are related solely due to a process of insertions, deletions and mutations. Partially ordered graphs can represent global cut- and-paste operations,and thus reflect the biological contents of multiple alignments more accurately. Problems caused by the inherent loss of information in generalized profiles are therefore avoided. The two most similar sequences are determined and aligned and all other sequences are added to this one profile in a stepwise fashion.