Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Bioinformatics: Importance and Methods for Optimal Multiple Sequence Alignment, Study notes of Computer Science

University of Florida (UF)Computer Science

The importance of multiple sequence alignment (msa) in bioinformatics, defines the problem of finding optimal msas, explores various scoring methods, and outlines methods for finding optimal msas, including dynamic programming, the carrillo-lipman method, and manual alignments.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-5in 🇺🇸

10 documents

1 / 58

Partial preview of the text

Download Bioinformatics: Importance and Methods for Optimal Multiple Sequence Alignment and more Study notes Computer Science in PDF only on Docsity! 8/20/2005 Su-Shing Chen, CISE 1 CAP 5510-7 Multiple Sequence Alignment BIOINFORMATICS Su-Shing Chen CISE 8/20/2005 Su-Shing Chen, CISE 2 Global Multiple Sequence Alignment (MSA) Ex: MSA of 4 sequences MQPILLLV,MLRLL, MKILLL,MPPVLILV: MQPILLLV MLR-LL-- MK-ILLL- MPPVLILV No column is all gaps 8/20/2005 Su-Shing Chen, CISE 5 Why we do multiple alignments? – Help prediction of the secondary and tertiary structures of new sequences; – Preliminary step in molecular evolution analysis using Phylogenetic methods for constructing phylogenetic trees 8/20/2005 Su-Shing Chen, CISE 6 Problem Definition Given strings S1, S2 … Sk a multiple (global) alignment maps them to strings S1’, S2’ … Sk’ that may contain spaces, where 1. |S1’| = |S2’| = … = |Sk’|, and 2. The removal of spaces from Si’ leaves Si, for 1 i k 8/20/2005 Su-Shing Chen, CISE 7 Scoring Methods There are various scoring methods “sum-of-pairs” score. The sum-of-pairs (SP) value for a multiple global alignment A of strings is the sum of the values of all kC2 pairwise alignments induced by A. 8/20/2005 Su-Shing Chen, CISE 10 Finding Optimal MSAs An optimal MSA is an alignment with minimum overall cost or maximum overall similarity. NP-complete. Our new aim for multiple alignment is obtaining a alignment which is very close to the optimal alignment. 8/20/2005 Su-Shing Chen, CISE 11 Overview of Methods Dynamic programming – too computationally expensive to do a complete search; uses heuristics Progressive – starts with pair-wise alignment of most similar sequences; adds to that Iterative – make an initial alignment of groups of sequences, adds to these (e.g. genetic algorithms) Locally conserved patterns Statistical and probabilistic methods 8/20/2005 Su-Shing Chen, CISE 12 Choosing sequences for alignment The more sequences to align the better. Don’t include similar (>80%) sequences. Sub-groups should be pre-aligned separately, and one member of each subgroup should be included in the final multiple alignment. 8/20/2005 Su-Shing Chen, CISE 15 Dynamic Programming contd.. For k sequences, we construct a k D lattice. Any point on this lattice can be represented by the k-tupple (x1, x2, x3, ….., xk). Where xi corresponds to the ith sequence in the alignment. And the range of xi is the entire width of the non gapped ith sequence. 8/20/2005 Su-Shing Chen, CISE 16 Dynamic Programming contd.. Representing an alignment in the lattice. Consider the following alignment Start from the point (0,0,0). Move column-wise along the alignment. If you find a residue in the ith sequence, increase the value of xi by 1. Else leave unchanged. 8/20/2005 Su-Shing Chen, CISE 17 Dynamic Programming contd.. The alignment path for the 3 sequences 8/20/2005 Su-Shing Chen, CISE 20 Carrillo-Lipman method Since Ao is the optimal alignment C (Ao) – C (Ah) > 0 …….(1) Let us define C( A|ij) as the score of the projection of A, calculated like normal pair wise alignment. Now from the basic defination of the score of the multiple alignment, we can write …….(2) | ( , ), ( ) ( )ij i j i j c A c A < = 8/20/2005 Su-Shing Chen, CISE 21 Carrillo-Lipman method Also since Ao|ij need not be the optimal projection, we have C (Ao|ij) >= d (Si, Sj) ……(3) Using (1), (2) and (3), we can show that Which can be written as | | ( , ), [ ( ) ( , )] ( , ) ( )h oij i j p q pq i j i j c A d S S d S S c A < − + ≤ |( , ) ( ) o p q p qU L d S S c A− + ≤ 8/20/2005 Su-Shing Chen, CISE 22 Carrillo-Lipman method Where And In the above equation we have derived a minimum score for the optimal alignment. Thus, we can safely ignore all the paths which has a score less than this minimum score. | ( , ), ( )h ij i j i j U c A < = ( , ), ( , )i j i j i j L d S S < = 8/20/2005 Su-Shing Chen, CISE 25 Manual alignments Consider the following sequences 8/20/2005 Su-Shing Chen, CISE 26 Manual alignments The scores for each pair wise alignment is shown below 8/20/2005 Su-Shing Chen, CISE 27 Manual alignments The table shows that S1 is the most similar sequence to all other sequences. Use these pair- wise alignments to construct the multiple alignment 8/20/2005 Su-Shing Chen, CISE 30 Manual alignment Far less running time O(n2k2+nl) for k sequences of n residues each and of a total gapped length l. However, these methods do not optimize the score of the MSA. Tradeoff between practicality and optimization. Cannot guarantee a unique MSA by these methods for a set of sequences, depends on the order of addition of sequences. 8/20/2005 Su-Shing Chen, CISE 31 Progressive alignment Compare pair-wise alignments and then add sequences based on a given order of similarity General approach - construct the pair wise alignment of the sequences. - construct a “rough” similarity tree. - combine the alignments from the most closely related groups to the most distant, following “once a gap always a gap” strategy – once a gap is added we don’t change it. 8/20/2005 Su-Shing Chen, CISE 32 Step 1: Construct the Distance Matrix. Construct the pair wise alignments of the sequences in the set and find the distances between them. Distance between two sequences :- Find the mismatches, insertions and deletions between the sequences and divide it by the total length of the sequence OR Assign a distance scheme for each mismatch and gap. Add up the distances over the entire sequence length. 8/20/2005 Su-Shing Chen, CISE 35 Step 2: Construct “Rough” Similarity Tree. Every node represents a sequence The length of the arms is proportional to the distance between the sequences. The distances are additive. Distance matrix is fed into an algorithm that builds a tree relating these sequences.(neighbor-joining tree) Neighbor Joining Tree 8/20/2005 ost 226 “084 O16 216 055 065 15 Eee 398 388 442 Note: Figure not dravwen bo scale Su-Shing Chen, CISE Hbb_human Hbb_horse Hba_human Hba_horse Myg_ whale Cyhg_lamprey Lob lupus 36 8/20/2005 Su-Shing Chen, CISE 37 Distance between Hbb_human and Hbb_horse is : .081 + .084 = .165 (close to .17 in the matrix) 8/20/2005 Su-Shing Chen, CISE 40 Aligning alignments To use dynamic alignment on these pair of alignments : Convert them to single sequences Define a scoring scheme for them. To score the pair of alignment, we score each column of the alignments against each other column 1 of alignment 1 is ( A C), column 2 of alignment 2 is ( C A A G) Define S(1,2) as the score of column1 against column 2. 8/20/2005 Su-Shing Chen, CISE 41 Aligning alignments (A C) ( C A A G) S(1,2)= 1/8{ score(A,C)+ score(A,A)+ score(A,A)+ score(A,G)+ score(C,C)+ score(C,A)+ score(C,A)+ score(C,G)} Similarly define S(1,1), S( 1,2), S(1,3). Construct a scoring matrix S for the pair of alignments. 8/20/2005 Su-Shing Chen, CISE 42 Weighting Sequences While aligning the sequences, we are just averaging the scores of individual pairs. This is problematic if we have very similar sequences. Use tree to weight our sequences, with highly diverged sequences getting larger weights Use length from root to sequences to compute weight : increased weights for more divergent sequences If 2 or more sequences share a branch, length of branch is split among the sequences : reduced weight for related sequences Use these weights instead of the average 8/20/2005 Su-Shing Chen, CISE 45 Progressive alignment It is more accurate than the Manual method The distances are not the true evolutionary distances. While calculating distance we count any sub substitution A->B as the first substitution. It may be a result of multiple substitution in that position A->C->B. While aligning the sequences, we are just averaging the scores of individual pairs. This is problematic if we have very similar sequences. 8/20/2005 Su-Shing Chen, CISE 46 ClustalW- for multiple alignment ClustalW is a general purpose multiple alignment program for DNA or proteins. http://www.ebi.ac.uk/clustalw/ It can create multiple alignments, manipulate existing alignments, do profile analysis and create phylogenetic trees. Alignment can be done by 2 methods: - slow/accurate - fast/approximate 8/20/2005 Su-Shing Chen, CISE 47 ClustalW - for multiple alignment Based on phylogenetic analysis A phylogenetic tree is created using a pairwise distance matrix and nearest- neighbor algorithm The most closely-related pairs of sequences are aligned using dynamic programming Each of the alignments is analyzed and a profile of it is created Alignment profiles are aligned progressively for a total alignment ClustalW Example - Run YOUR EhiAtL A4LISMNMEMT TITLE | RESULTS ALIS RINE hl CPU MODE ] [Sequence | ! interactive ] [fon =] I single ~] ETP VIN Dow SCORE TYPE TOPDIAG PAIR GAP OwoORD SIZEs LEMGTH | der =| [der =| | [percent =| [der ~| [der =] MAT Ri GAP OPER END GAP SAP GAPS EXTEMSIOnMn DISTANCES def = def = | | der ~] [der ~] [der ~] OLITPLI | PHYLOGENMETIC TREE OUTPUT OUTPUT TREE TTPE CORRECT DIST. 1SMORE GAPS FORMAT ORDER | aln wfnumbers =] | aligned =] | | none | off | | off |S Enter or Paste a set of Sequences in any supported format: felipe FaAHGKEVLASF GEGVHHLDHLEGTF AALSELHCDBRLHYDPENFRLLGI DFTPELGASYORVVAGY ANAL AHEYH >sp|Pé6sey71| HBB HUMAN Hemoglobin VWHLTPEEKSAVTALWGRVNVDEVGGEALGRLLVV¥¥PWTORFFESFGDI KAHGEEVLGAFSDGLAHLDMNLEGTF ATLSELHCDELHVDPENFRLLGI EFTPPYOAAYORVVAGY ANAL AHEYH Sep|FPS6SS05|HBA_HUMAN Hemoglobin VLSP ADKTINVRAAVGEVGAHAGE ¥Y¥GAEALERMFLSFPTTKTYFFHFDI EKVADALTMVAVAHVDDMPMALS ALSDLHAHKLEVDPVNFKLLSHCLLY" VHASLDEKFLASVSTVLTSKYH | ca Lipload a file: J Browse... | Run | Reset | 8/20/: ClustalW Example - Results esteem Results Results of search Number of sequences | 3 Alignment score 1467 Sequence format Pearson Sequence type aa ClustalW version 1.82 Jalview Snes | Output file clustalw- 2005061 7-03111083 output Aligninent file clustahw-2005081 7-03111083. alin Guide tree file clustaly: 2005081 7-03111083 dnd Sour inpeut file clustaly: 2005081 7-03111083 input SUBMIT ANOTHER JOB | 51 8/20/2005 Su-Shing Chen, CISE 52 ClustalW Example - Output file ClustalW Example - JalView i} I a W a HT i N " " it ih Ith ih Mh 0p [PREALG jon apeae 1M og DOULOL ONY: -MEEn On I Hi 7 0g [POUT jamaica +140 . m 1 i pV ma nuntat| +L + CORTE it) tL ATRPRER Og: = Th 1 ADL HD : 8/20/2005 Su-Shing Chen, CISE 55 8/20/2005 Su-Shing Chen, CISE 56 Other Programs - Dialign A local algorithm Aligns whole segments rather than single residues All pairwise alignments are performed and all aligned ungapped regions picked up. These regions appear as diagonals on a dotplot. A consistent set of diagonals is determined and iteratively added to the alignment 8/20/2005 Su-Shing Chen, CISE 57 Other Programs - Poa Progressive algorithm, employing partially ordered graphs, as opposed to generalized profiles, to represent aligned sequences Generalized profiles are only accurate when sequences are related solely due to a process of insertions, deletions and mutations. Partially ordered graphs can represent global cut- and-paste operations,and thus reflect the biological contents of multiple alignments more accurately. Problems caused by the inherent loss of information in generalized profiles are therefore avoided. The two most similar sequences are determined and aligned and all other sequences are added to this one profile in a stepwise fashion.

Documents

questions

Bioinformatics: Importance and Methods for Optimal Multiple Sequence Alignment, Study notes of Computer Science

Related documents

Partial preview of the text