Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Bioinformatics: Importance and Methods for Optimal Multiple Sequence Alignment, Study notes of Computer Science

The importance of multiple sequence alignment (msa) in bioinformatics, defines the problem of finding optimal msas, explores various scoring methods, and outlines methods for finding optimal msas, including dynamic programming, the carrillo-lipman method, and manual alignments.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-5in
koofers-user-5in 🇺🇸

10 documents

1 / 58

Toggle sidebar

Related documents


Partial preview of the text

Download Bioinformatics: Importance and Methods for Optimal Multiple Sequence Alignment and more Study notes Computer Science in PDF only on Docsity! 8/20/2005 Su-Shing Chen, CISE 1 CAP 5510-7 Multiple Sequence Alignment BIOINFORMATICS Su-Shing Chen CISE 8/20/2005 Su-Shing Chen, CISE 2 Global Multiple Sequence Alignment (MSA)  Ex: MSA of 4 sequences  MQPILLLV,MLRLL, MKILLL,MPPVLILV: MQPILLLV MLR-LL-- MK-ILLL- MPPVLILV  No column is all gaps 8/20/2005 Su-Shing Chen, CISE 5 Why we do multiple alignments? – Help prediction of the secondary and tertiary structures of new sequences; – Preliminary step in molecular evolution analysis using Phylogenetic methods for constructing phylogenetic trees 8/20/2005 Su-Shing Chen, CISE 6 Problem Definition  Given strings S1, S2 … Sk a multiple (global) alignment maps them to strings S1’, S2’ … Sk’ that may contain spaces, where 1. |S1’| = |S2’| = … = |Sk’|, and 2. The removal of spaces from Si’ leaves Si, for 1  i  k 8/20/2005 Su-Shing Chen, CISE 7 Scoring Methods  There are various scoring methods  “sum-of-pairs” score.  The sum-of-pairs (SP) value for a multiple global alignment A of strings is the sum of the values of all kC2 pairwise alignments induced by A. 8/20/2005 Su-Shing Chen, CISE 10 Finding Optimal MSAs  An optimal MSA is an alignment with minimum overall cost or maximum overall similarity.  NP-complete.  Our new aim for multiple alignment is obtaining a alignment which is very close to the optimal alignment. 8/20/2005 Su-Shing Chen, CISE 11 Overview of Methods  Dynamic programming – too computationally expensive to do a complete search; uses heuristics  Progressive – starts with pair-wise alignment of most similar sequences; adds to that  Iterative – make an initial alignment of groups of sequences, adds to these (e.g. genetic algorithms)  Locally conserved patterns  Statistical and probabilistic methods 8/20/2005 Su-Shing Chen, CISE 12 Choosing sequences for alignment  The more sequences to align the better.  Don’t include similar (>80%) sequences.  Sub-groups should be pre-aligned separately, and one member of each subgroup should be included in the final multiple alignment. 8/20/2005 Su-Shing Chen, CISE 15 Dynamic Programming contd..  For k sequences, we construct a k D lattice. Any point on this lattice can be represented by the k-tupple (x1, x2, x3, ….., xk). Where xi corresponds to the ith sequence in the alignment. And the range of xi is the entire width of the non gapped ith sequence. 8/20/2005 Su-Shing Chen, CISE 16 Dynamic Programming contd..  Representing an alignment in the lattice.  Consider the following alignment  Start from the point (0,0,0). Move column-wise along the alignment. If you find a residue in the ith sequence, increase the value of xi by 1. Else leave unchanged. 8/20/2005 Su-Shing Chen, CISE 17 Dynamic Programming contd..  The alignment path for the 3 sequences 8/20/2005 Su-Shing Chen, CISE 20 Carrillo-Lipman method  Since Ao is the optimal alignment C (Ao) – C (Ah) > 0 …….(1)  Let us define C( A|ij) as the score of the projection of A, calculated like normal pair wise alignment. Now from the basic defination of the score of the multiple alignment, we can write …….(2) | ( , ), ( ) ( )ij i j i j c A c A < =  8/20/2005 Su-Shing Chen, CISE 21 Carrillo-Lipman method  Also since Ao|ij need not be the optimal projection, we have C (Ao|ij) >= d (Si, Sj) ……(3)  Using (1), (2) and (3), we can show that  Which can be written as | | ( , ), [ ( ) ( , )] ( , ) ( )h oij i j p q pq i j i j c A d S S d S S c A < − + ≤ |( , ) ( ) o p q p qU L d S S c A− + ≤ 8/20/2005 Su-Shing Chen, CISE 22 Carrillo-Lipman method  Where And  In the above equation we have derived a minimum score for the optimal alignment. Thus, we can safely ignore all the paths which has a score less than this minimum score. | ( , ), ( )h ij i j i j U c A < =  ( , ), ( , )i j i j i j L d S S < =  8/20/2005 Su-Shing Chen, CISE 25 Manual alignments  Consider the following sequences 8/20/2005 Su-Shing Chen, CISE 26 Manual alignments  The scores for each pair wise alignment is shown below 8/20/2005 Su-Shing Chen, CISE 27 Manual alignments  The table shows that S1 is the most similar sequence to all other sequences.  Use these pair- wise alignments to construct the multiple alignment 8/20/2005 Su-Shing Chen, CISE 30 Manual alignment  Far less running time O(n2k2+nl) for k sequences of n residues each and of a total gapped length l.  However, these methods do not optimize the score of the MSA.  Tradeoff between practicality and optimization.  Cannot guarantee a unique MSA by these methods for a set of sequences, depends on the order of addition of sequences. 8/20/2005 Su-Shing Chen, CISE 31 Progressive alignment  Compare pair-wise alignments and then add sequences based on a given order of similarity  General approach - construct the pair wise alignment of the sequences. - construct a “rough” similarity tree. - combine the alignments from the most closely related groups to the most distant, following “once a gap always a gap” strategy – once a gap is added we don’t change it. 8/20/2005 Su-Shing Chen, CISE 32 Step 1: Construct the Distance Matrix.  Construct the pair wise alignments of the sequences in the set and find the distances between them.  Distance between two sequences :-  Find the mismatches, insertions and deletions between the sequences and divide it by the total length of the sequence OR  Assign a distance scheme for each mismatch and gap. Add up the distances over the entire sequence length. 8/20/2005 Su-Shing Chen, CISE 35 Step 2: Construct “Rough” Similarity Tree.  Every node represents a sequence  The length of the arms is proportional to the distance between the sequences.  The distances are additive.  Distance matrix is fed into an algorithm that builds a tree relating these sequences.(neighbor-joining tree) Neighbor Joining Tree 8/20/2005 ost 226 “084 O16 216 055 065 15 Eee 398 388 442 Note: Figure not dravwen bo scale Su-Shing Chen, CISE Hbb_human Hbb_horse Hba_human Hba_horse Myg_ whale Cyhg_lamprey Lob lupus 36 8/20/2005 Su-Shing Chen, CISE 37  Distance between Hbb_human and Hbb_horse is : .081 + .084 = .165 (close to .17 in the matrix) 8/20/2005 Su-Shing Chen, CISE 40 Aligning alignments  To use dynamic alignment on these pair of alignments :  Convert them to single sequences  Define a scoring scheme for them.  To score the pair of alignment, we score each column of the alignments against each other  column 1 of alignment 1 is ( A C),  column 2 of alignment 2 is ( C A A G)  Define S(1,2) as the score of column1 against column 2. 8/20/2005 Su-Shing Chen, CISE 41 Aligning alignments  (A C) ( C A A G) S(1,2)= 1/8{ score(A,C)+ score(A,A)+ score(A,A)+ score(A,G)+ score(C,C)+ score(C,A)+ score(C,A)+ score(C,G)}  Similarly define S(1,1), S( 1,2), S(1,3).  Construct a scoring matrix S for the pair of alignments. 8/20/2005 Su-Shing Chen, CISE 42 Weighting Sequences  While aligning the sequences, we are just averaging the scores of individual pairs. This is problematic if we have very similar sequences.  Use tree to weight our sequences, with highly diverged sequences getting larger weights  Use length from root to sequences to compute weight : increased weights for more divergent sequences  If 2 or more sequences share a branch, length of branch is split among the sequences : reduced weight for related sequences  Use these weights instead of the average 8/20/2005 Su-Shing Chen, CISE 45 Progressive alignment  It is more accurate than the Manual method  The distances are not the true evolutionary distances.  While calculating distance we count any sub substitution A->B as the first substitution.  It may be a result of multiple substitution in that position A->C->B.  While aligning the sequences, we are just averaging the scores of individual pairs. This is problematic if we have very similar sequences. 8/20/2005 Su-Shing Chen, CISE 46 ClustalW- for multiple alignment  ClustalW is a general purpose multiple alignment program for DNA or proteins.  http://www.ebi.ac.uk/clustalw/  It can create multiple alignments, manipulate existing alignments, do profile analysis and create phylogenetic trees.  Alignment can be done by 2 methods: - slow/accurate - fast/approximate 8/20/2005 Su-Shing Chen, CISE 47 ClustalW - for multiple alignment  Based on phylogenetic analysis  A phylogenetic tree is created using a pairwise distance matrix and nearest- neighbor algorithm  The most closely-related pairs of sequences are aligned using dynamic programming  Each of the alignments is analyzed and a profile of it is created  Alignment profiles are aligned progressively for a total alignment ClustalW Example - Run YOUR EhiAtL A4LISMNMEMT TITLE | RESULTS ALIS RINE hl CPU MODE ] [Sequence | ! interactive ] [fon =] I single ~] ETP VIN Dow SCORE TYPE TOPDIAG PAIR GAP OwoORD SIZEs LEMGTH | der =| [der =| | [percent =| [der ~| [der =] MAT Ri GAP OPER END GAP SAP GAPS EXTEMSIOnMn DISTANCES def = def = | | der ~] [der ~] [der ~] OLITPLI | PHYLOGENMETIC TREE OUTPUT OUTPUT TREE TTPE CORRECT DIST. 1SMORE GAPS FORMAT ORDER | aln wfnumbers =] | aligned =] | | none | off | | off |S Enter or Paste a set of Sequences in any supported format: felipe FaAHGKEVLASF GEGVHHLDHLEGTF AALSELHCDBRLHYDPENFRLLGI DFTPELGASYORVVAGY ANAL AHEYH >sp|Pé6sey71| HBB HUMAN Hemoglobin VWHLTPEEKSAVTALWGRVNVDEVGGEALGRLLVV¥¥PWTORFFESFGDI KAHGEEVLGAFSDGLAHLDMNLEGTF ATLSELHCDELHVDPENFRLLGI EFTPPYOAAYORVVAGY ANAL AHEYH Sep|FPS6SS05|HBA_HUMAN Hemoglobin VLSP ADKTINVRAAVGEVGAHAGE ¥Y¥GAEALERMFLSFPTTKTYFFHFDI EKVADALTMVAVAHVDDMPMALS ALSDLHAHKLEVDPVNFKLLSHCLLY" VHASLDEKFLASVSTVLTSKYH | ca Lipload a file: J Browse... | Run | Reset | 8/20/: ClustalW Example - Results esteem Results Results of search Number of sequences | 3 Alignment score 1467 Sequence format Pearson Sequence type aa ClustalW version 1.82 Jalview Snes | Output file clustalw- 2005061 7-03111083 output Aligninent file clustahw-2005081 7-03111083. alin Guide tree file clustaly: 2005081 7-03111083 dnd Sour inpeut file clustaly: 2005081 7-03111083 input SUBMIT ANOTHER JOB | 51 8/20/2005 Su-Shing Chen, CISE 52 ClustalW Example - Output file ClustalW Example - JalView i} I a W a HT i N " " it ih Ith ih Mh 0p [PREALG jon apeae 1M og DOULOL ONY: -MEEn On I Hi 7 0g [POUT jamaica +140 . m 1 i pV ma nuntat| +L + CORTE it) tL ATRPRER Og: = Th 1 ADL HD : 8/20/2005 Su-Shing Chen, CISE 55 8/20/2005 Su-Shing Chen, CISE 56 Other Programs - Dialign  A local algorithm  Aligns whole segments rather than single residues  All pairwise alignments are performed and all aligned ungapped regions picked up.  These regions appear as diagonals on a dotplot.  A consistent set of diagonals is determined and iteratively added to the alignment 8/20/2005 Su-Shing Chen, CISE 57 Other Programs - Poa  Progressive algorithm, employing partially ordered graphs, as opposed to generalized profiles, to represent aligned sequences  Generalized profiles are only accurate when sequences are related solely due to a process of insertions, deletions and mutations.  Partially ordered graphs can represent global cut- and-paste operations,and thus reflect the biological contents of multiple alignments more accurately.  Problems caused by the inherent loss of information in generalized profiles are therefore avoided.  The two most similar sequences are determined and aligned and all other sequences are added to this one profile in a stepwise fashion.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved