Download Bioinformatics Tool: Multiple Sequence Alignment for Analyzing Biological Sequences and more Study notes Chemistry in PDF only on Docsity! Multiple Sequence Alignment BME 110: CompBio Tools Todd Lowe April 22, 2008 Multiple Sequence Alignment • Multiple sequence alignment is probably the single- most important bioinformatics tools. • Many applications require accurate MSAs • PSIBLAST • Family and domain classification • Pattern identification • Structure prediction • secondary structure • fold recognition • Phylogeny • Full-genome alignments in browsers Methods • Full Dynamic Programming • Gives the optimal solution, but prohibitively slow and memory intensive for >6-8 sequences • MSA program • Progressive Alignment • ClustalW • http://www.ebi.ac.uk/clustalw/index.html (most commonly used) • Tcoffee • http://igs-server.cnrs-mrs.fr/Tcoffee/ (a little better, but slower) • Iterative • better than progressive methods, but slower • Dialign • HMMs Progressive Alignment 1. Calculate global pair-wise alignments for all pairs • Needleman and Wunsch 2. Use pairwise alignment scores to calculate a guide tree describing the distance between all pairs of sequences 3. Align the sequences progressively • Start with the two most closely related sequences • Add in sequences in order of increasing distance • ClustalW uses this method ClustalW Example • Input: 5 sequences detected by BLASTp using human SNAP-25 as a query • Default parameters, output order: input >sp_P13795 MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQGEQLERIEEGMDQINKD MKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNNQDGVVASQPARVVDEREQMAISGGFIRRVTND ARENEMDENLEQVSGIIGNLRHMALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG >gi_31242623 MPAAAPPAENGAAVPKTELQELQMKQQQVVDESLDSTRRMLALCEESTEVGMRTIVMLDEQGEQLDRIEE GMDQINADMREAEKNLSGMEKCCGICVLPCNKSASFKEDDGTWKGNDDGKVVNNQPQRVMDDRNGLGPQA GYIGRITNDAREDEMEENMGQVNTMIGNLRNMALDMGSELENQNRQIDRINRKGDSNATRIAAANERAHD LLK >gi_3822409 MPTTAEPAQENGAPRSELQELQLKAGQVTDETLESTRRMLALCEESKEAGIRTLVALDDQGEQLERIEEN MDQINADMKEAEKNLTGMEKFCGLCVLPWNKSAPFKENEDAWKGNDDGKVVNNQPQRVMDDGSGLGPQGG YIGRITNDAREDEMEENVGQVNTMIGNLRNMAIDMGSELENQNRQIDRIKNKAEM >gi_39593308 MSARRGAPGGQRHPRPYAVEPTVDINGLVLPADMSDELKGLNVGIDEKTIESLESTRRMLALCEESKEAG IKTLVMLDDQGEQLERCEGALDTINQDMKEAEDHLKGMEKCCGLCVLPWNKTDDFEKNSEYAKAWKKDDD GGVISDQPRITVGDPTMGPQGGYITKITNDAREDEMDENIQQVSTMVGNLRNMAIDMSTEVSNQNRQLDR IHDKAQSNEVRVESANKRAKNLITK >gi_32567202 MSGDDDIPEGLEAINLKMNATTDDSLESTRRMLALCEESKEAGIKTLVMLDDQGEQLERCEGALDTINQD MKEAEDHLKGMEKCCGLCVLPWNKTDDFEKTEFAKAWKKDDDGGVISDQPRITVGDSSMGPQGGYITKIT NDAREDEMDENVQQVSTMVGNLRNMAIDMSTEVSNQNRQLDRIHDKAQSNEVRVESANKRAKNLITK ClustalW Guide Tree • The guide tree shows the distances between sequences obtained from the initial pairwise alignments • This is the order that sequences were added into the MSA • Guide tree is not a phylogenetic tree (it’s just a rough estimate of similarity), however a true phylogenetic tree can be generated after making an alignment Progressive Alignment • Greedy algorithm • Breaks problem up into smaller problems • Finds best solution to each small problem • Combine solutions to get answer to whole problem • Not necessarily the global answer • Doesn’t use all information in solving sub-problems • Suboptimal answers for small problems may combine to give a better overall answer • Gaps: once created, they stay as part of alignment for rest of alignment iterations ClustalW Alignment CLUSTAL W (1.82) multiple sequence alignment sp_P13795 ---MAEDAD------------------------MRNELEEMQRRADQLADESLESTRRML 33 gi_31242623 MPAAAPPAENG-------------------AAVPKTELQELQMKQQQVVDESLDSTRRML 41 gi_3822409 MPTTAEPAQE--------------------NGAPRSELQELQLKAGQVTDETLESTRRML 40 gi_39593308 MSARRGAPGGQRHPRPYAVEPTVDINGLVLPADMSDELKGLNVGIDEKTIESLESTRRML 60 gi_32567202 MSGDDDIPEG---------------------------LEAINLKMNATTDDSLESTRRML 33 . *: :: . ::*:****** sp_P13795 QLVEESKDAGIRTLVMLDEQGEQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCN 93 gi_31242623 ALCEESTEVGMRTIVMLDEQGEQLDRIEEGMDQINADMREAEKNLSGMEKCCGICVLPCN 101 gi_3822409 ALCEESKEAGIRTLVALDDQGEQLERIEENMDQINADMKEAEKNLTGMEKFCGLCVLPWN 100 gi_39593308 ALCEESKEAGIKTLVMLDDQGEQLERCEGALDTINQDMKEAEDHLKGMEKCCGLCVLPWN 120 gi_32567202 ALCEESKEAGIKTLVMLDDQGEQLERCEGALDTINQDMKEAEDHLKGMEKCCGLCVLPWN 93 * ***.:.*::*:* **:*****:* * :* ** **:***.:*..: * **:** * * sp_P13795 KLKSSDA---YKKAWGNNQDG-VVASQPARVVDEREQMAISGGFIRRVTNDARENEMDEN 149 gi_31242623 KSASFKE---DDGTWKGNDDGKVVNNQPQRVMDDRNGLGPQAGYIGRITNDAREDEMEEN 158 gi_3822409 KSAPFKE---NEDAWKGNDDGKVVNNQPQRVMDDGSGLGPQGGYIGRITNDAREDEMEEN 157 gi_39593308 KTDDFEKNSEYAKAWKKDDDGGVISDQPRITVGDPT-MGPQGGYITKITNDAREDEMDEN 179 gi_32567202 KTDDFEK-TEFAKAWKKDDDGGVISDQPRITVGDSS-MGPQGGYITKITNDAREDEMDEN 151 * . :* ::** *: .** .:.: :. ..*:* ::******:**:** sp_P13795 LEQVSGIIGNLRHMALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG 206 gi_31242623 MGQVNTMIGNLRNMALDMGSELENQNRQIDRINRKGDSNATRIAAANERAHDLLK-- 213 gi_3822409 VGQVNTMIGNLRNMAIDMGSELENQNRQIDRIKNKAEM------------------- 195 gi_39593308 IQQVSTMVGNLRNMAIDMSTEVSNQNRQLDRIHDKAQSNEVRVESANKRAKNLITK- 235 gi_32567202 VQQVSTMVGNLRNMAIDMSTEVSNQNRQLDRIHDKAQSNEVRVESANKRAKNLITK- 207 : **. ::****:**:**..*:..****:*** *.: Graphical - Jalview • Postscript, PDF, HTML • Looks pretty and very visually informative • Completely useless for further computational analysis. DO NOT SAVE GRAPHICS AS YOUR ONLY OUTPUT • Jalview -- Java alignment editor (http://www.jalview.org) • Available as an online applet or as an application • Makes nice pictures and allow interactive editing Sequence Logos • Logos are another useful visualization of alignments that allow conserved positions to be easily picked out. • Multiple tools available on the web or can be downloaded: • http://weblogo.berkeley.edu Tcoffee • Makes a library of pair-wise global and several local alignments • Tries to find a multiple alignment that has best consensus with all alignments in the library. • Still a progressive algorithm • Slower, but usually a bit better than ClustalW Which Sequences? • Don’t include too many • Problems are VERY slow for many sequences • Start with 10-15 or so. • Closely related sequences are easy to align, but less informative. The converse is true for more distantly related sequences • No identical sequences • Each sequence 30-70% identical with at least half of the other sequences. Strategies • Visually inspect alignment and try eliminating sequences that seem problematic. • Avoid sequences with long insertions and/or terminal extensions • “Orphan” sequences (highly divergent members of a family) usually don’t disrupt alignment because they’re the last to be aligned. Collections of MSAs • Domain and family collection databases not only have sequences grouped by domain/family, but also have MSAs that were used for classification. • Example: Pfam http://pfam.janelia.org/