Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Star Alignment and Cluster in Bioinformatics - Study Guide | BCB 444, Exams of Bioinformatics

Material Type: Exam; Professor: Dobbs; Class: INTRO BIOINFORMATCS; Subject: BIOINFORMATICS AND COMPUTATIONAL BIOL; University: Iowa State University; Term: Fall 2007;

Typology: Exams

Pre 2010

Uploaded on 09/02/2009

koofers-user-92k-1
koofers-user-92k-1 🇺🇸

5

(2)

10 documents

1 / 8

Toggle sidebar

Related documents


Partial preview of the text

Download Star Alignment and Cluster in Bioinformatics - Study Guide | BCB 444 and more Exams Bioinformatics in PDF only on Docsity! #13 - Star Alignment; HMMs 9/19/07 BCB 444/544 Fall 07 Dobbs 1 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 1 BCB 444/544 Lecture 13 Star Alignment & Clustal (for MSA) Perhaps: Profiles & Hidden Markov Models (HMMs) #13_Sept19 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 2 √ Mon Sept 17 - Lecture 12 Position Specific Scoring Matrices & PSI-BLAST • Chp 6 - pp 75-78 (but not HMMs) Wed Sept 19 - Lecture 13 (not covered on Exam 1) Profiles & Hidden Markov Models • Chp 6 - pp 79-84 • Eddy: What is a hidden Markov Model? 2004 Nature Biotechnol 22:1315 http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html Fri Sept 21 - EXAM 1 Required Reading (before lecture) 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 3 Assignments & Announcements √ Sun Sept 16 - Study Guide for Exam 1 was posted √ Mon Sept 17 - Answers to HW#2 were posted Thu Sept 20 - Lab = Optional Review Session for Exam Fri Sept 21 - Exam 1 - Will cover: • Lectures 2-12 (thru Mon Sept 17) • Labs 1-4 • HW2 • All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming? 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 4 Chp 5- Multiple Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 5 Multiple Sequence Alignment • √ Scoring Function • √ Exhaustive Algorithms • Heuristic Algorithms • Star Alignment • Clustal • √ Practical Issues • First, review MSA scoring briefly, then back to Star Alignment & ClustalW 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 5 Scoring an Alignment - in Lecture 12, so will be covered on Exam 1 In practice, simple scoring functions are used Usually, columns are scored independently: ith column of alignment m ( ) GmSS(m) i i +=! Gap penalty A F P G Q I K F F F I Y Y Y G G Q G Q G K F F F I D D D A F P G Q I K F F F I D D D W W W W W W W F F F I I - - A F P G Q I K - - - I D D D G G G G G G G - F F I Y Y Y 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 6 Sum of Pairs (SP) Score • SP = sum of pairs = sum of scores of all possible pairs of sequences in an MSA, based on a particular scoring matrix • Compute for each column c: S(mi) = Σk<l s(mik, mil) A F P G F F F I G G Q G F F F I A F P G F F F I W W W W F F I - A F P G - - D D G G G G - F F Y F F I - mi PAM or BLOSUM score residue l #13 - Star Alignment; HMMs 9/19/07 BCB 444/544 Fall 07 Dobbs 2 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 7 Example: Calculating SP Score 5D -34G -517Y -1-2-25F DGYF S(m) = S(m1) + S(m2) + S(m3) = 3s(F,F) + 2s(-,Y) + s(-,-) + s(G,G) + 2s(G,D) = 15 -16 + 0 + 4 -6 = -3 Gap penalty = -8 s(-,-) = 0 BLOSUM 60 F - G F - G F Y D M = G G D m1 m2 m3 I added more colors to this slide 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 8 Algorithms & Software for MSA? #1 Exhaustive Methods • √ Multidimensional dynamic programming (DP) • Divide-and-Conquer Alignment (DCA) - "semi-exhaustive" web-based version available - see textbook • Full DP Optimal Global Alignment? Prohibitive in both time & space requirements for more than 10 sequences!! Heuristic Methods • Progressive (Star Alignment, Clustal) • Iterative • Block-based 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9 Dynamic Programming for MSA • As with pairwise alignments, MSAs can be computed by dynamic programming* F 2D 3D *(if you're not in a rush!) 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 10 Generalized Needleman-Wunsch Algorithm Given 3 sequences x, y, and z: Main iteration loop: S(i,j,k) = max ( S(i-1, j-1, k-1) + σ(xi, yj, zk), S(i-1, j-1, k ) + σ(xi, yj, - ), S(i-1, j , k-1) + σ(xi, -, zk), S(i-1, j , k ) + σ(xi, -, - ), S(i , j-1, k-1) + σ( -, yj, zk), S(i , j-1, k ) + σ( -, yj, -), S(i , j , k-1) + σ( -, -, zk) ) 3D 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 11 Given k sequences of length n • Space for matrix: O(nk) • Neighbors/cell: 2k-1 • Time to compute SP score: O(k2) • Overall runtime: O(k22knk)  Wow!!! 3D What Happens to Computational Complexity? 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 12 What's so bad about those exponents? Example: Running Time of DP for MSA • Overall runtime: O(k22knk) 9 years6 3 weeks5 5 hours4 2 minutes3 1 second2 Running Rime# Sequences Sequences? Globins only »150 aa !! But: There are fast heuristics #13 - Star Alignment; HMMs 9/19/07 BCB 444/544 Fall 07 Dobbs 5 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 25 Algorithms & Software for MSA? #2 √ Exhaustive Methods • Multidimensional dynamic programming (DP) • Divide-and-Conquer Alignment (DCA) - "semi-exhaustive" web-based version available - see textbook • Full DP Optimal Global Alignment? Prohibitive in both time & space requirements for more than 10 sequences!! Heuristic Methods • √ Progressive (Star Alignment, Clustal) • Iterative • Block-based 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 26 Algorithms & Software for MSA? #3 will NOT be covered on Exam1 Heuristic Methods - continued • Progressive alignments (Star Alignment, Clustal) • Others: T-Coffee, DbClustal -see text: can be better than Clustal • Match closely-related sequences first using a guide tree • Partial order alignments (POA) • Doesn't rely on guide tree; adds sequences in order given • PRALINE • Preprocesses input sequences by building profiles for each • Iterative methods • Idea: optimal solution can be found by repeatedly modifying existing suboptimal solutions (eg: PRRN) • Block-based Alignment • Multiple re-building attempts to find best alignment (eg: DIALIGN2 & Match-Box) • Local alignments • Profiles, Blocks, Patterns - more on these soon! 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 27 Chp 6 - Profiles & Hidden Markov Models SECTION II SEQUENCE ALIGNMENT Xiong: Chp 6 Profiles & HMMs • √ Position Specific Scoring Matrices (PSSMs) • √ PSI-BLAST First, review above briefly, then: • Profiles • Markov Models & Hidden Markov Models 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 28 PSI-BLAST (Covered in Lecture 12, so will be covered on Exam1) • Position Specific Iterated BLAST • Intuition: substitution matrices should be "sensitive" to protein context • e.g., larger penalty for Ala→Gly substitution if in a helix rather than in a loop • Basic idea: • Use BLAST with high stringency to generate a set of closely related sequences • Align those sequences to create a new substitution matrix for each position • Use this matrix (iteratively) to find additional sequences 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 29 PSI-BLAST Pseudocode Convert query to PSSM (or a Profile) do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs This step requires a user-defined threshold Position-Specific Scoring Matrix Note: Xiong textbook distinguishes between PSSMs (which have no gaps) & Profiles (can include gaps). Thus, based on these definitions, PSI-BLAST uses a Profile to iteratively add new homologs - other authors refer to pattern used by PSI-BLAST as a PSSM. 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 30 What is a PSSM? Position-Specific Scoring Matrix A PSSM is: • a representation of a motif • an n by m matrix, where n is size of alphabet & m is length of sequence • a matrix of scores in which entry at (i, j) is score assigned by PSSM to letter i at the jth position -3-3-1-2-3-3-3-3V 2-33-1-3-2-2-2Y -2-21-2-2-3-4-3W -2-2-2-1-2-10-1T -10-200-11-1S -2-2-4-1-2-2-2-2P -1-36-3-3-3-3-3F -2-300-3-1-2-1M -1-2-31-2202K -3-40-2-4-2-3-2L -3-40-3-4-3-3-3I 8-2-10-2010H -26-3-26-20-2G 0-2-32-2000E 0-2-35-2101Q -3-3-2-3-3-3-3-3C -1-1-30-1-21-2D 10-300060N 0-2-31-2505R -20-2-10-1-2-1A 20 l et te r al ph ab et 8 residue sequence “K” at position 3 gets a score of 2 Also, sometimes called: Position Weight Matrix (PWM) Note: Assumes positions are independent I added more text to this slide Xiong: PSSM = table that contains probability information re: residues at each position of an ungapped MSA #13 - Star Alignment; HMMs 9/19/07 BCB 444/544 Fall 07 Dobbs 6 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 31 Assigning a "Match" Score with a PSSM PSSM assigns sequence NMFWAFGH a score of: 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12 -3-3-1-2-3-3-3-3V 2-33-1-3-2-2-2Y -2-21-2-2-3-4-3W -2-2-2-1-2-10-1T -10-200-11-1S -2-2-4-1-2-2-2-2P -1-36-3-3-3-3-3F -2-300-3-1-2-1M -1-2-31-2202K -3-40-2-4-2-3-2L -3-40-3-4-3-3-3I 8-2-10-2010H -26-3-26-20-2G 0-2-32-2000E 0-2-35-2101Q -3-3-2-3-3-3-3-3C -1-1-30-1-21-2D 10-300060N 0-2-31-2505R -20-2-10-1-2-1A 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 32 Creating a PSSM from 1 Sequence -3-3-1-2-3-3-3-3V 2-33-1-3-2-2-2Y -2-21-2-2-3-4-3W -2-2-2-1-2-10-1T -10-200-11-1S -2-2-4-1-2-2-2-2P -1-36-3-3-3-3-3F -2-300-3-1-2-1M -1-2-31-2202K -3-40-2-4-2-3-2L -3-40-3-4-3-3-3I 8-2-10-2010H -26-3-26-20-2G 0-2-32-2000E 0-2-35-2101Q -3-3-2-3-3-3-3-3C -1-1-30-1-21-2D 10-300060N 0-2-31-2505R -20-2-10-1-2-1A BLOSUM62 matrix RNRGQFGH R R 20 by 20 20 by L L 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 33 Creating a PSSM from Multiple Sequences 1. Discard columns that contain gaps in query sequence 2. Compute relative sequence weights 3. Compute PSSM entries, taking into account • Observed residues in column • Sequence weights • Substitution matrix 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 34 1- Discard Columns with Gaps in Query EEFG----SVDGLVNNA QKYG----RLDVMINNA RRLG----TLNVLVNNA GGIG----PVD-LVNNA KALG----GFNVIVNNA ARFG----KID-LIPNA FEPEGPEKGMWGLVNNA AQLK----TVDVLINGA EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVD-LVNNA KALGGFNVIVNNA ARFGKID-LIPNA FEPEGMWGLVNNA AQLKTVDVLINGA 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 35 2- Compute Sequence Weights • Smaller weights are assigned to redundant sequences • Larger weights are assigned to unique sequences EEFGSVDGLVNNA 1.2 QKYGRLDVMINNA 1.2 RRLGTLNVLVNNA 0.8 GGIGPVDLLVNNA 0.8 KALGGFNVIVNNA 1.1 ARFGKIDTLIPNA 0.9 FEPEGMWGLVNNA 1.1 AQLKTVDVLINGA 1.3 How are weights determined? Based on branch lengths in guide tree: value for each sequence is then used to multiply raw alignment scores Goal of weighting? to decrease matching scores of frequent characters in MSA & increase scores of infrequent characters Info re: weights was added to this slide 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 36 3- Compute PSSM Entries (simplified version) E Q R G K A F A PSSM Background frequencies A 0.085 C 0.019 D 0.054 E 0.065 F 0.040 G 0.072 H 0.023 I 0.058 K 0.056 L 0.096 M 0.024 P 0.053 Q 0.042 R 0.054 S 0.072 T 0.063 V 0.073 W 0.016 Y 0.034 Observed residues PSSM column= Usually derived from large sequence database / This slide was modified #13 - Star Alignment; HMMs 9/19/07 BCB 444/544 Fall 07 Dobbs 7 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 37 ( ) ( )! ! " # $ $ % & BA MA Pr Pr log2 PSSM Entries = Log-Odds Scores Observed frequency of residue “A” Foreground model (i.e., the PSSM) Background model 1. Estimate probability of observing each residue (probability of A given M, where M is PSSM model) 2. Divide by background probability of observing each residue (probability of A given B, where B is background model) 3. Take log so that can add (rather than multiply) scores This slide was modified 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 38 Why (not) PSI-BLAST? • Psi-BLAST weights sequences according to observed diversity specific to family under investigation • Advantage: If sequences used to construct PSSMs are all homologous, sensitivity for a given level of specificity improves significantly • Disadvantage: However, if any non-homologous sequences are included in PSSMs, they become “corrupted” and "pull in" additional non-homologous sequences, resulting in false positive hits 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 39 How to Use PSI-BLAST Effectively • Set initial thresholds high • Inspect each iteration's result for suspicious sequences (When in doubt, leave it out!) • Do several iterations (~5), or until no new sequences are found • Make initial search very broad • First, use NR (large, inclusive database) with up to 5 iterations to set PSSM • Then use that PSSM to search in a more restricted domain, if possible • Be particularly cautious about matches to sequences with highly biased amino acid content 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 40 Summary: DP, BLAST & PSI-BLAST • Dynamic programming is O(NM) for pairwise alignment • BLAST is O(M) • BLAST produces an index of words in query sequence that allows fast matching to the database • At NCBI, target databases are also pre-indexed to indicate positions in all database sequences that match each possible search word above some score threshold • PSI-BLAST iterates BLAST, adding new homologs at each iteration 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 41 Applications of MSA • Building phylogenetic trees • Finding conserved patterns: • Regulatory motifs (TF binding sites) • Splice sites • Protein domains • Identifying and characterizing protein families • Find out which protein domains have same function • Finding SNPs (single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms) • DNA fragment assembly (in genomic sequencing) 9/19/07BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 42 Application: Discover Conserved Patterns Rationale: if sequences are homologous (derived from a common ancestor), they may be structurally/functionally equivalent TATA box = transcriptional promoter element Is there a conserved cis-acting regulatory sequence? Sequence Logo
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved