Download Multiple Sequence Alignment: Techniques, Tools, and Applications - Prof. Dietlind Gerloff and more Study notes Chemistry in PDF only on Docsity! Multiple Sequence Alignment • Multiple sequence alignment is probably the single- most important bioinformatics tools. • Many applications require accurate MSAs • PSIBLAST • Family and domain classification • Pattern identification • Structure prediction • secondary structure • fold recognition • Phylogeny • Full-genome alignments in browsers Conservation Patterns • Cys pairs -disulfide bonds • His, Ser -catalytic sites • Cys, His -metal binding sites • Gly, Pro -ends of 2° structure elements, turns • Lys, Arg, Asp, Glu - ligand binding • Lys/Arg-Asp/Glu pairs - salt bridges • Leu -coiled coils, leucine zippers • Motifs, secondary structure, indels PSI-BLAST Alignments • The goal of BLAST is rapid detection by detecting high-scoring local alignments. It doesn’t necessarily find the optimal global or local alignment • Profiles throw away information for regions that are insertions relative to the query Methods • Dynamic Programming • Gives the optimal solution, but prohibitively slow • Progressive • ClustalW • http://www.ebi.ac.uk/clustalw/index.html (most commonly used) • Tcoffee • http://igs-server.cnrs-mrs.fr/Tcoffee/ (a little better, but slower) • Iterative • better than progressive methods, but slower • Dialign • HMMs ClustalW Guide Tree • The guide tree shows the distances between sequences obtained from the initial pairwise alignments. • This is the order that sequences were added into the MSA • Guide tree is not a phylogenetic tree (it’s just a rough estimate of similarity), however a true phylogenetic tree can be generated after making an alignment Progressive Alignment • Greedy algorithm • Breaks problem up into smaller problems • Finds best solution to each small problem • Combine solutions to get answer to whole problem • Not necessarily the global answer. • Doesn’t use all information in solving sub-problems. • Suboptimal answers for small problems may combine to give a better overall answer • Gaps: once created, they stay as part of alignment for rest of alignment iterations ClustalW Alignment CLUSTAL W (1.82) multiple sequence alignment sp_P13795 ---MAEDAD------------------------MRNELEEMQRRADQLADESLESTRRML 33 gi_31242623 MPAAAPPAENG-------------------AAVPKTELQELQMKQQQVVDESLDSTRRML 41 gi_3822409 MPTTAEPAQE--------------------NGAPRSELQELQLKAGQVTDETLESTRRML 40 gi_39593308 MSARRGAPGGQRHPRPYAVEPTVDINGLVLPADMSDELKGLNVGIDEKTIESLESTRRML 60 gi_32567202 MSGDDDIPEG---------------------------LEAINLKMNATTDDSLESTRRML 33 . *: :: . ::*:****** sp_P13795 QLVEESKDAGIRTLVMLDEQGEQLERIEEGMDQINKDMKEAEKNLTDLGKFCGLCVCPCN 93 gi_31242623 ALCEESTEVGMRTIVMLDEQGEQLDRIEEGMDQINADMREAEKNLSGMEKCCGICVLPCN 101 gi_3822409 ALCEESKEAGIRTLVALDDQGEQLERIEENMDQINADMKEAEKNLTGMEKFCGLCVLPWN 100 gi_39593308 ALCEESKEAGIKTLVMLDDQGEQLERCEGALDTINQDMKEAEDHLKGMEKCCGLCVLPWN 120 gi_32567202 ALCEESKEAGIKTLVMLDDQGEQLERCEGALDTINQDMKEAEDHLKGMEKCCGLCVLPWN 93 * ***.:.*::*:* **:*****:* * :* ** **:***.:*..: * **:** * * sp_P13795 KLKSSDA---YKKAWGNNQDG-VVASQPARVVDEREQMAISGGFIRRVTNDARENEMDEN 149 gi_31242623 KSASFKE---DDGTWKGNDDGKVVNNQPQRVMDDRNGLGPQAGYIGRITNDAREDEMEEN 158 gi_3822409 KSAPFKE---NEDAWKGNDDGKVVNNQPQRVMDDGSGLGPQGGYIGRITNDAREDEMEEN 157 gi_39593308 KTDDFEKNSEYAKAWKKDDDGGVISDQPRITVGDPT-MGPQGGYITKITNDAREDEMDEN 179 gi_32567202 KTDDFEK-TEFAKAWKKDDDGGVISDQPRITVGDSS-MGPQGGYITKITNDAREDEMDEN 151 * . :* ::** *: .** .:.: :. ..*:* ::******:**:** sp_P13795 LEQVSGIIGNLRHMALDMGNEIDTQNRQIDRIMEKADSNKTRIDEANQRATKMLGSG 206 gi_31242623 MGQVNTMIGNLRNMALDMGSELENQNRQIDRINRKGDSNATRIAAANERAHDLLK-- 213 gi_3822409 VGQVNTMIGNLRNMAIDMGSELENQNRQIDRIKNKAEM------------------- 195 gi_39593308 IQQVSTMVGNLRNMAIDMSTEVSNQNRQLDRIHDKAQSNEVRVESANKRAKNLITK- 235 gi_32567202 VQQVSTMVGNLRNMAIDMSTEVSNQNRQLDRIHDKAQSNEVRVESANKRAKNLITK- 207 : **. ::****:**:**..*:..****:*** *.: Interleaved Formats • Most common output formats for MSAs are interleaved: • MSF, ASN, BLAST query-anchored formats • All sequences are stacked up, and chopped into blocks of ~60 residues • Easy for humans to read, but difficult to edit • Tools for converting formats are available on the web Aligned FASTA (A2M) Format >SN29_RAT/142-196 PSSRLKEAINTSKDQESKYQASHPNLRRLHDAE---LDSVPASTV----NTEVY-----P KNSSL---R-----A >SN29_HUMAN/142-197 PNNRLKEAISTSKEQEAKYQASHPNLR-------KLDDTDPVPRGA---GSAMSTDA-YP KNPHL---R-----A >SN25_TORMA/95-148 PCNK----LKNFEAGGAYKKVWGNNQD------G-VVASQP-ARVMD-DREQMA-----M SGGYI--RRI-TDDA >O93578/11-59 PCNK----MKS-----GASKAWGNNQD------G-VVASQP-ARVVD-EREQMA-----I SGGFI--RRV-TDDA >SN25_DROME/98-149 PCNK----SQSFK---EDDGTWKGNDD------GKVVNNQP-QRVMD-DRNGM-----MA QAGYI--GRI-TNDA • Uppercase and ‘-’ characters are alignment columns. There must be the same number of aligned characters in all sequences. • Insertions that are not part of the alignment, are indicated with lower case and ‘.’ characters. These are not read (i.e. they’re for humans only) • Benefits • Easily machine readable • Readable by most programs that read FASTA format (Note: characters in lowercase, if there were any, would indicate that the alignment is incertain at these positions) Graphical - Jalview • Postscript, PDF, HTML • Looks pretty and very visually informative • Completely useless for further computational analysis. DO NOT SAVE GRAPHICS AS YOUR ONLY OUTPUT • Jalview -- Java alignment editor (http://www.jalview.org) • Available as an online applet or as an application • Makes nice pictures and allow interactive editing e.g. Jalview, ClustalX (or others)