Download Multiple Sequence Alignment: Methods and Applications and more Study notes Chemistry in PDF only on Docsity! Multiple Sequence Alignment BME 110: CompBio Tools Todd Lowe April 28, 2009 Original Slides: Carol Rohl Multiple Sequence Alignment • Multiple sequence alignment (MSA) is one of the most important bioinformatics tools • Many applications require accurate MSAs • PSI-BLAST • Family and domain classification • Pattern identification • Structure prediction • secondary structure • fold recognition • Phylogeny • Full-genome alignments in browsers Progressive Alignment 1. Calculate global pair-wise alignments for all pairs • Needleman and Wunsch 2. Use pairwise alignment scores to calculate a guide tree describing the distance between all pairs of sequences 3. Align the sequences progressively • Start with the two most closely related sequences • Add in sequences in order of increasing distance • ClustalW uses this method ClustalW Example • Input: 5 sequences detected by BLASTp using human SNAP-25 as a query • Default parameters, output order: input >sp_P13795 MAEDADMRNELEEMQRRADQLADESLESTRRMLQLVEESKDAGIRTLVMLDEQGEQLERIEEGMDQINKD MKEAEKNLTDLGKFCGLCVCPCNKLKSSDAYKKAWGNNQDGVVASQPARVVDEREQMAISGGFIRRVTND ARENEMDENLEQVSGIIGNLRHMALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG >gi_31242623 MPAAAPPAENGAAVPKTELQELQMKQQQVVDESLDSTRRMLALCEESTEVGMRTIVMLDEQGEQLDRIEE GMDQINADMREAEKNLSGMEKCCGICVLPCNKSASFKEDDGTWKGNDDGKVVNNQPQRVMDDRNGLGPQA GYIGRITNDAREDEMEENMGQVNTMIGNLRNMALDMGSELENQNRQIDRINRKGDSNATRIAAANERAHD LLK >gi_3822409 MPTTAEPAQENGAPRSELQELQLKAGQVTDETLESTRRMLALCEESKEAGIRTLVALDDQGEQLERIEEN MDQINADMKEAEKNLTGMEKFCGLCVLPWNKSAPFKENEDAWKGNDDGKVVNNQPQRVMDDGSGLGPQGG YIGRITNDAREDEMEENVGQVNTMIGNLRNMAIDMGSELENQNRQIDRIKNKAEM >gi_39593308 MSARRGAPGGQRHPRPYAVEPTVDINGLVLPADMSDELKGLNVGIDEKTIESLESTRRMLALCEESKEAG IKTLVMLDDQGEQLERCEGALDTINQDMKEAEDHLKGMEKCCGLCVLPWNKTDDFEKNSEYAKAWKKDDD GGVISDQPRITVGDPTMGPQGGYITKITNDAREDEMDENIQQVSTMVGNLRNMAIDMSTEVSNQNRQLDR IHDKAQSNEVRVESANKRAKNLITK >gi_32567202 MSGDDDIPEGLEAINLKMNATTDDSLESTRRMLALCEESKEAGIKTLVMLDDQGEQLERCEGALDTINQD MKEAEDHLKGMEKCCGLCVLPWNKTDDFEKTEFAKAWKKDDDGGVISDQPRITVGDSSMGPQGGYITKIT NDAREDEMDENVQQVSTMVGNLRNMAIDMSTEVSNQNRQLDRIHDKAQSNEVRVESANKRAKNLITK Input Formats • FASTA format • Download from NCBI, ExPASy, EBI, Pfam … • Sequence names should be • Unique • 15 characters or less • Comprised of only A-Z,a-z,0-9 and _ (Do not use #$%@|*!:;. or spaces) Progressive Alignment • Greedy algorithm • Breaks problem up into smaller problems • Finds best solution to each small problem • Combine solutions to get answer to whole problem • Not necessarily the global answer • Doesn’t use all information in solving sub-problems • Suboptimal answers for small problems may combine to give a better overall answer • Gaps: once created, they stay as part of alignment for rest of alignment iterations ClustalW Alignment CLUSTAL W (1.82) multiple sequence alignment sp_P13795 ---MAEDAD------------------------MRNELEEMQRRADQLADESLESTRRML 33 gi_31242623 MPAAAPPAENG-------------------AAVPKTELQELQMKQQQVVDESLDSTRRML 41 gi_3822409 MPTTAEPAQE--------------------NGAPRSELQELQLKAGQVTDETLESTRRML 40 gi_39593308 MSARRGAPGGQRHPRPYAVEPTVDINGLVLPADMSDELKGLNVGIDEKTIESLESTRRML 60 gi_32567202 MSGDDDIPEG---------------------------LEAINLKMNATTDDSLESTRRML 33 . *: :: . ::*:****** sp_P13795 QLVEESKDAGIRTLVMLDEQGEQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCN 93 gi_31242623 ALCEESTEVGMRTIVMLDEQGEQLDRIEEGMDQINADMREAEKNLSGMEKCCGICVLPCN 101 gi_3822409 ALCEESKEAGIRTLVALDDQGEQLERIEENMDQINADMKEAEKNLTGMEKFCGLCVLPWN 100 gi_39593308 ALCEESKEAGIKTLVMLDDQGEQLERCEGALDTINQDMKEAEDHLKGMEKCCGLCVLPWN 120 gi_32567202 ALCEESKEAGIKTLVMLDDQGEQLERCEGALDTINQDMKEAEDHLKGMEKCCGLCVLPWN 93 * ***.:.*::*:* **:*****:* * :* ** **:***.:*..: * **:** * * sp_P13795 KLKSSDA---YKKAWGNNQDG-VVASQPARVVDEREQMAISGGFIRRVTNDARENEMDEN 149 gi_31242623 KSASFKE---DDGTWKGNDDGKVVNNQPQRVMDDRNGLGPQAGYIGRITNDAREDEMEEN 158 gi_3822409 KSAPFKE---NEDAWKGNDDGKVVNNQPQRVMDDGSGLGPQGGYIGRITNDAREDEMEEN 157 gi_39593308 KTDDFEKNSEYAKAWKKDDDGGVISDQPRITVGDPT-MGPQGGYITKITNDAREDEMDEN 179 gi_32567202 KTDDFEK-TEFAKAWKKDDDGGVISDQPRITVGDSS-MGPQGGYITKITNDAREDEMDEN 151 * . :* ::** *: .** .:.: :. ..*:* ::******:**:** sp_P13795 LEQVSGIIGNLRHMALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG 206 gi_31242623 MGQVNTMIGNLRNMALDMGSELENQNRQIDRINRKGDSNATRIAAANERAHDLLK-- 213 gi_3822409 VGQVNTMIGNLRNMAIDMGSELENQNRQIDRIKNKAEM------------------- 195 gi_39593308 IQQVSTMVGNLRNMAIDMSTEVSNQNRQLDRIHDKAQSNEVRVESANKRAKNLITK- 235 gi_32567202 VQQVSTMVGNLRNMAIDMSTEVSNQNRQLDRIHDKAQSNEVRVESANKRAKNLITK- 207 : **. ::****:**:**..*:..****:*** *.: Interleaved Formats • Most common output formats for MSAs are interleaved: • MSF, ASN, BLAST query-anchored formats • All sequences are stacked up, and chopped into blocks of ~60 residues • Easy for humans to read, but difficult to edit • Tools for converting formats are available on the web • EMBOSS tool for conversion (squizz_convert) Sequence Logos • Logos are another useful visualization of alignments that allow conserved positions to be easily picked out. • Multiple tools available on the web or can be downloaded: • http://weblogo.berkeley.edu Tcoffee • Makes a library of pair-wise global and several local alignments • Tries to find a multiple alignment that has best consensus with all alignments in the library. • Still a progressive algorithm • Slower, but usually a bit better than ClustalW Other Uses of MSA Servers • ClustalW can refine an alignment • If sequences are aligned when submitted, this info is used. • Tcoffee can • Combine alignments • Evaluate alignment quality • Use structural information if available Strategies • Visually inspect alignment and try eliminating sequences that seem problematic. • Avoid sequences with long insertions and/or terminal extensions • “Orphan” sequences (highly divergent members of a family) usually don’t disrupt alignment because they’re the last to be aligned. Collections of MSAs • Domain and family collection databases not only have sequences grouped by domain/family, but also have MSAs that were used for classification. • Example: Pfam http://pfam.janelia.org/