Download Multiple Sequence Alignment: Position Specific Scoring Matrices and Psi-BLAST - Prof. Dren and more Exams Bioinformatics in PDF only on Docsity! #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 BCB 444/544 Fall 07 Dobbs 1 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 1 BCB 444/544 Lecture 12 Multiple Sequence Alignment (MSA) PSSMs & Psi-BLAST #12_Sept17 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 2 √ Mon Sept 17 - Lecture 12 Position Specific Scoring Matrices & PSI-BLAST • Chp 6 - pp 75-78 (but not HMMs) Wed Sept 19 - Lecture 13 (not covered on Exam 1) Hidden Markov Models • Chp 6 - pp 79-84 • Eddy: What is a hidden Markov Model? 2004 Nature Biotechnol 22:1315 http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html Wed Sept 21 - EXAM 1 Required Reading (before lecture) 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 3 Assignments & Announcements Sun Sept 16 - Study Guide for Exam 1 was posted Mon Sept 17 - Answers to HW#2 will be posted ~ Noon Thu Sept 20 - Lab = Optional Review Session for Exam Fri Sept 21 - Exam 1 - Will cover: • Lectures 2-12 (thru Mon Sept 17) • Labs 1-4 • HW2 • All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming~ 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 4 Chp 5- Multiple Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 5 Multiple Sequence Alignment • Scoring Function • Exhaustive Algorithms • Heuristic Algorithms • Practical Issues 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 5 Multiple Sequence Alignments Credits for slides: Caragea & Brown, 2007; Fernandez-Baca, Heber &Hunter 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 6 Overview 1. What is a multiple sequence alignment (MSA)? 2. Where/why do we need MSA? 3. What is a good MSA? 4. Algorithms to compute a MSA #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 BCB 444/544 Fall 07 Dobbs 2 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 7 Multiple Sequence Alignment • Generalize pairwise alignment of sequences to include > 2 homologous sequences • Analyzing more than 2 sequences gives us much more information: • Which amino acids are required? Correlated? • Evolutionary/phylogenetic relationships • Similar to PSI-BLAST idea (not yet covered in lecture): use a set of homologous sequences to provide more "sensitivity" 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 8 Definition: MSA Given a set of sequences, a multiple sequence alignment is an assignment of gap characters, such that • resulting sequences have same length • no column contains only gaps ATTTG- ATTTGC AT-TGC ATTTG ATTTGC ATT-GC ATTT-G- ATTT-GC AT-T-GC YES NONO 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 9 Displaying MSAs: using CLUSTAL W * entirely conserved column : all residues have ~ same size AND hydropathy . all residues have ~ same size OR hydropathy RED: AVFPMILW (small) BLUE: DE (acidic, negative chg) MAGENTA: RHK (basic, positive chg) GREEN: STYHCNGQ (hydroxyl + amine + basic) 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 10 A single sequence that represents most common residue of each column in a MSA Example: What is a Consensus Sequence? FGGHL-GF F-GHLPGF FGGHP-FG FGGHL-GF Steiner consensus seqence: Given sequences s1,…, sk, find a sequence s* that maximizes Σi S(s*,si) 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 11 Applications of MSA • Building phylogenetic trees • Finding conserved patterns, e.g.: • Regulatory motifs (TF binding sites) • Splice sites • Protein domains • Identifying and characterizing protein families • Find out which protein domains have same function • Finding SNPs (single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms) • DNA fragment assembly (in genomic sequencing) 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 12 Application: Recover Phylogenetic Tree NYLS NYLS NFLS What was series of events that led to current species? #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 BCB 444/544 Fall 07 Dobbs 5 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 25 Given k sequences of length n: • Space for matrix: O(nk) • Neighbors/cell: 2k-1 • Time to compute SP score: O(k2) • Overall runtime: O(k22knk) Ouch!!! 3D What Happens to Computational Complexity? 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 26 What's so bad about those exponents? An example: Running Time of DP • Overall runtime: O(k22knk) 9 years6 3 weeks5 5 hours4 2 minutes3 1 second2 running time# sequences Sequences: globins (≈ 150 aa) But: There are fast heuristics. 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 27 Progressive Alignment Heuristic procedure: 1. Align most similar sequences first 2. Add sequences progressively Often: use guide tree to determine order of alignments Examples: Star alignment ClustalW Multiple Alignment by adding sequences 1 2 3 4 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 28 Guide Tree Binary tree • Leaves correspond to sequences • Internal nodes represent alignments • Root corresponds to final MSA ATC ATG TCG ATC ATG ATC- ATG- -TCC TCC TCG TCC -TCG 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 29 Star Alignment - will skip for now, come back to this on Wed Star alignment will NOT be covered on Exam 1 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 30 Chp6 - Profiles & Hidden Markov Models SECTION II SEQUENCE ALIGNMENT Xiong: Chp 6 Profiles & HMMs • Position Specific Scoring Matrices (PSSMs) • PSI-BLAST • Profiles • Markov Model & Hidden Markov Model #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 BCB 444/544 Fall 07 Dobbs 6 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 31 PSI Blast • Position Specific Iterated BLAST • Intuition: substitution matrices should be specific to a particular site: penalize alanine→glycine more in a helix • Basic idea: • Use BLAST with high stringency to get a set of closely related sequences • Align those sequences to create a new substitution matrix for each position • Then use that matrix (iteratively) to find additional sequences 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 32 Psi-BLAST BLAST Query Sequence database PSSM Multiple alignment 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 33 PSI-BLAST pseudocode Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 34 Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs PSI-BLAST pseudocode Position-specific scoring matrix 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 35 PSI-BLAST pseudocode Convert query to PSSM do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs This step requires a user-defined threshold 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 36 Position-specific scoring matrix - PSSM • A PSSM is an n by m matrix, where n is the size of alphabet, and m is length of sequence • Entry at (i, j) is score assigned by PSSM to letter i at the jth position -3-3-1-2-3-3-3-3V 2-33-1-3-2-2-2Y -2-21-2-2-3-4-3W -2-2-2-1-2-10-1T -10-200-11-1S -2-2-4-1-2-2-2-2P -1-36-3-3-3-3-3F -2-300-3-1-2-1M -1-2-31-2202K -3-40-2-4-2-3-2L -3-40-3-4-3-3-3I 8-2-10-2010H -26-3-26-20-2G 0-2-32-2000E 0-2-35-2101Q -3-3-2-3-3-3-3-3C -1-1-30-1-21-2D 10-300060N 0-2-31-2505R -20-2-10-1-2-1A #11 - MSAs; PSSMs & Psi-BLAST 9/17/07 BCB 444/544 Fall 07 Dobbs 7 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 37 Position-specific scoring matrix • A PSSM is an n by m matrix, where n is the size of the alphabet, and m is the length of the sequence. • The entry at (i, j) is the score assigned by the PSSM to letter i at the jth position. -3-3-1-2-3-3-3-3V 2-33-1-3-2-2-2Y -2-21-2-2-3-4-3W -2-2-2-1-2-10-1T -10-200-11-1S -2-2-4-1-2-2-2-2P -1-36-3-3-3-3-3F -2-300-3-1-2-1M -1-2-31-2202K -3-40-2-4-2-3-2L -3-40-3-4-3-3-3I 8-2-10-2010H -26-3-26-20-2G 0-2-32-2000E 0-2-35-2101Q -3-3-2-3-3-3-3-3C -1-1-30-1-21-2D 10-300060N 0-2-31-2505R -20-2-10-1-2-1A “K” at position 3 gets a score of 2 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 38 Position-specific scoring matrix This PSSM assigns sequence NMFWAFGH a score of: 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12 -3-3-1-2-3-3-3-3V 2-33-1-3-2-2-2Y -2-21-2-2-3-4-3W -2-2-2-1-2-10-1T -10-200-11-1S -2-2-4-1-2-2-2-2P -1-36-3-3-3-3-3F -2-300-3-1-2-1M -1-2-31-2202K -3-40-2-4-2-3-2L -3-40-3-4-3-3-3I 8-2-10-2010H -26-3-26-20-2G 0-2-32-2000E 0-2-35-2101Q -3-3-2-3-3-3-3-3C -1-1-30-1-21-2D 10-300060N 0-2-31-2505R -20-2-10-1-2-1A 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 39 Position-specific scoring matrix • What score does this PSSM assign to KRPGHFLA? 2 + 0 + -2 + 6 + 0 + 6 + -4 + -2 = 6 -3-3-1-2-3-3-3-3V 2-33-1-3-2-2-2Y -2-21-2-2-3-4-3W -2-2-2-1-2-10-1T -10-200-11-1S -2-2-4-1-2-2-2-2P -1-36-3-3-3-3-3F -2-300-3-1-2-1M -1-2-31-2202K -3-40-2-4-2-3-2L -3-40-3-4-3-3-3I 8-2-10-2010H -26-3-26-20-2G 0-2-32-2000E 0-2-35-2101Q -3-3-2-3-3-3-3-3C -1-1-30-1-21-2D 10-300060N 0-2-31-2505R -20-2-10-1-2-1A 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 40 Position-specific iterated BLAST BLAST Query Sequence database PSSM Multiple alignment ? 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 41 Creating a PSSM from 1 sequence -3-3-1-2-3-3-3-3V 2-33-1-3-2-2-2Y -2-21-2-2-3-4-3W -2-2-2-1-2-10-1T -10-200-11-1S -2-2-4-1-2-2-2-2P -1-36-3-3-3-3-3F -2-300-3-1-2-1M -1-2-31-2202K -3-40-2-4-2-3-2L -3-40-3-4-3-3-3I 8-2-10-2010H -26-3-26-20-2G 0-2-32-2000E 0-2-35-2101Q -3-3-2-3-3-3-3-3C -1-1-30-1-21-2D 10-300060N 0-2-31-2505R -20-2-10-1-2-1A BLOSUM62 matrix RNRGQFGH R R 20 by 20 20 by L L 9/17/07BCB 444/544 F07 ISU Dobbs #11 - MSAs; PSSMs & Psi-BLAST 42 Position-specific iterated BLAST BLAST Query Sequence database PSSM Multiple alignment ?