Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Profiles and Hidden Markov Models (HMMs) in BCB 444/544 Fall 07 - Prof. Drena Leigh Dobbs, Study notes of Bioinformatics

A lecture note from a biochemistry course (bcb 444/544) held in fall 07 at iowa state university (isu), focusing on profiles and hidden markov models (hmms). The concepts of transcription factors, their binding sites, enhancers, repressors, rna sequences, structures, and functions. It also discusses the importance of hmms in finding conserved patterns, building phylogenetic trees, and identifying protein domains. References for further study.

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-7rf
koofers-user-7rf 🇺🇸

10 documents

1 / 8

Toggle sidebar

Related documents


Partial preview of the text

Download Profiles and Hidden Markov Models (HMMs) in BCB 444/544 Fall 07 - Prof. Drena Leigh Dobbs and more Study notes Bioinformatics in PDF only on Docsity! #16 - Profiles & HMMs 9/28/07 BCB 444/544 Fall 07 Dobbs 1 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 1 BCB 444/544 Lecture 16 Profiles & Hidden Markov Models (HMMs) #16_Sept28 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 2 √ Mon & Wed Sept 24 & 26- Lecture 14 & 15 Review: Nucleus, Chromosomes, Genes, RNAs, Proteins Surprise lecture: No assigned reading √ Fri Sept 28 - Lectures 16 Profiles & Hidden Markov Models • Chp 6 - pp 79-84 • Eddy: What is a hidden Markov Model? 2004 Nature Biotechnol 22:1315 http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html Thurs Sept 27 - Lab 4 & Mon Oct 1 - Lecture 17 Protein Families, Domains, and Motifs • Chp 7 - pp 85-96 Required Reading (before lecture) 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 3 Assignments & Announcements Fri Sept 26 •Exam 1 - Graded & returned in class - Really! •HW#2 - Graded & returned in class - Really! • Answer KEYs posted on website • Grades posted on WebCT • HomeWork #3 - posted online Due: Mon Oct 8 by 5 PM • HW544Extra #1 - posted online Due: Task 1.1 - Mon Oct 1 by noon Task 1.2 & Task 2 - Mon Oct 8 by 5 PM 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 4 BCB 544 - Extra Required Reading Mon Sept 24 BCB 544 Extra Required Reading Assignment: • Pollard KS, Salama SR, Lambert N, Lambot MA, Coppens S, Pedersen JS, Katzman S, King B, Onodera C, Siepel A, Kern AD, Dehay C, Igel H, Ares M Jr, Vanderhaeghen P, Haussler D. (2006) An RNA gene expressed during cortical development evolved rapidly in humans. Nature 443: 167-172. • http://www.nature.com/nature/journal/v443/n7108/abs/nature05113.html doi:10.1038/nature05113 • PDF available on class website - under Required Reading Link 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 5 Extra Credit Questions #2-6: 2. What is the size of the dystrophin gene (in kb)? Is it still the largest known human protein? 3. What is the largest protein encoded in human genome (i.e., longest single polypeptide chain)? 4. What is the largest protein complex for which a structure is known (for any organism)? 5. What is the most abundant protein (naturally occurring) on earth? 6. Which state in the US has the largest number of mobile genetic elements (transposons) in its living population? For 1 pt total (0.2 pt each): Answer all questions correctly & submit by to terrible@iastate.edu For 2 pts total: Prepare a PPT slide with all correct answers & submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1 • Choose one option - you can't earn 3 pts! • Partial credit for incorrect answers? only if they are truly amusing! 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 6 Extra Credit Questions #7 & #8: Given that each male attending our BCB 444/544 class on a typical day is healthy (let's assume MH=7), and is generating sperm at a rate equal to the average normal rate for reproductively competent males (dSp/dT = ? per minute): 7a. How many rounds of meiosis will occur during our 50 minute class period? 7b. How many total sperm will be produced by our BCB 444/544 class during that class period? 8. How many rounds of meiosis will occur in the reproductively competent females in our class? (assume FH=5) For 0.6 pts total (0.2 pt each): Answer all questions correctly & submit by to terrible@iastate.edu For 1 pts total: Prepare a PPT slide with all correct answers & submit to ddobbs@iastate.edu before 9 AM on Mon Oct 1 • Choose one option - you can't earn more than 1 pt for this! • Partial credit for incorrect answers? only if they are truly amusing! #16 - Profiles & HMMs 9/28/07 BCB 444/544 Fall 07 Dobbs 2 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 7 Information flow in the cell? • DNA -> RNA -> protein: • Replication = DNA to DNA - by DNA polymerase • Transcription = DNA to RNA - by RNA polymerase • Translation = RNA to protein - by ribosomes • Exceptions/Complications: • DNA rearrangements: (by mobile genetic elements, recombination) • Reverse transcription: (RNA -> DNA, by reverse transcriptase) • Post-transcriptional modifications: • RNA splicing (removal of introns, by spliceosome) • RNA editing (addition/removal of nucleotides - usually U's) • Post-translational modifications: • Protein processing 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 8 Modeling Metabolic Pathways? see MetNet http://metnet.vrac.iastate.edu/MetNet_overview.htm 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 9 Chromosomes & Genes Genes in chromatin are not just “beads on a string” they are packaged in complex structures that we don't yet fully understand 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 10 Gene regulation • In eukaryotes, genes are often regulated at other levels: • Post-transcriptional (RNA transport, splicing, stability) • Post-translational (protein localization, folding, stability) •Transcriptional regulation is primarily mediated by proteins that bind cis-acting elements or DNA sequence signals associated with genes: • DNA level (sequence-specific) regulatory signals • Promoters, terminators • Enhancers, repressors, silencers • Chromatin level (global) regulation • Heterochromatin (inactive) •e.g., X-inactivation in female mammals 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 11 Promoter = DNA sequences required for initiation of transcription; contain TF binding sites, usually "close" to start site • Transcription factors (TFs) - proteins that regulate transcription • (In eukaryotes) RNA polymerase binds by recognizing a complex of TFs bound at promotor First, TFs must bind TF binding sites (TFBSs) within promoters; then RNA polymerase can bind and initiate transcription of RNA ~200 bp Pre-mRNA 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 12 Enhancers & repressors = DNA sequences that regulate initiation of transcription; contain TF binding sites,can be far from start site! Gene PromoterEnhancer Repressor Enhancers "enhance" transcription Repressors or silencers "repress" transcription 10-50,000 bp Repressor binding proteins (TFs) block transcription RNAP = RNA polymerase II Enhancer binding proteins (TFs) interact with RNAP #16 - Profiles & HMMs 9/28/07 BCB 444/544 Fall 07 Dobbs 5 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 25 Sequence Logos - for RNA Splicing Sites http://www-lmmb.ncifcrf.gov/~toms/gallery/SequenceLogoSculpture.gif Human intron donor and acceptor sites 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 26 PSSM vs Profile PSI-BLAST Pseudocode Convert query to PSSM (or a Profile) do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs Profile: from MSA, including gaps Position-Specific Scoring Matrix: from ungapped MSA Note: Xiong textbook distinguishes between PSSMs (which have no gaps) & Profiles (can include gaps). Thus, based on these definitions, PSI-BLAST uses a Profile to iteratively add new homologs - other authors refer to pattern used by PSI-BLAST as a PSSM. 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 27 What is a PSSM? Position-Specific Scoring Matrix A PSSM is: • a representation of a motif • an n by m matrix, where n is size of alphabet & m is length of sequence • a matrix of scores in which entry at (i, j) is score assigned by PSSM to letter i at the jth position -3-3-1-2-3-3-3-3V 2-33-1-3-2-2-2Y -2-21-2-2-3-4-3W -2-2-2-1-2-10-1T -10-200-11-1S -2-2-4-1-2-2-2-2P -1-36-3-3-3-3-3F -2-300-3-1-2-1M -1-2-31-2202K -3-40-2-4-2-3-2L -3-40-3-4-3-3-3I 8-2-10-2010H -26-3-26-20-2G 0-2-32-2000E 0-2-35-2101Q -3-3-2-3-3-3-3-3C -1-1-30-1-21-2D 10-300060N 0-2-31-2505R -20-2-10-1-2-1A 20 l et te r al ph ab et 8 residue sequence “K” at position 3 gets a score of 2 Also, sometimes called: Position Weight Matrix (PWM) Note: Assumes positions are independent I added more text to this slide Xiong: PSSM = table that contains probability information re: residues at each position of an ungapped MSA 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 28 ( ) ( )! ! " # $ $ % & BA MA Pr Pr log2 PSSM Entries = Log-Odds Scores Observed frequency of residue “A” Foreground model (i.e., the PSSM) Background model 1. Estimate probability of observing each residue (probability of A given M, where M is PSSM model) 2. Divide by background probability of observing each residue (probability of A given B, where B is background model) 3. Take log so that can add (rather than multiply) scores This slide was modified 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 29 Statistics References Statistical Inference (Hardcover) George Casella, Roger L. Berger StatWeb: A Guide to Basic Statistics for Biologists http://www.dur.ac.uk/stat.web/ Basic Statistics: http://www.statsoft.com/textbook/stbasic.html (correlations, tests, frequencies, etc.) Electronic Statistics Textbook: StatSoft http://www.statsoft.com/textbook/stathome.html (from basic statistics to ANOVA to discriminant analysis, clustering, regression data mining, machine learning, etc.) 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 30 Sequence Profiles Goal: to characterize sequences belonging to a class (structural or functional) & determine whether a query sequence also belongs to that class • DNA or RNA sequences • Protein sequences • Idea is to provide a "model" of the class against which we can test the new sequence #16 - Profiles & HMMs 9/28/07 BCB 444/544 Fall 07 Dobbs 6 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 31 Protein Sequence Profiles & PSSMs • Profile - a table that lists frequencies of each amino acid in each position of a protein sequence • PSSM - a special type of Profile - with no gaps • Frequencies are calculated from a MSA containing a domain of interest • Can be used to generate a consensus sequence • Derived scoring scheme can be used to align a new sequence to the profile • Profile can be used in database searches (PSI-BLAST) to find new sequences that match the profile • Profiles can also be used to compute MSAs heuristically (e.g., progressive alignment) 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 32 PSI-BLAST Limitations for generating patterns or "motifs" • With PSSMs, can't have insertions and deletions • With Profiles, essentially 'add extra columns' to PSSM to allow for gaps • Better approach (for defining domains)? • Profile HMM: elaborated version of a profile • Intuitively, a profile that models gaps 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 33 Sequence Motifs (Patterns) Types of representations? • √ Consensus Sequence • √ Sequence Logo - "enhanced"consensus sequence, in which symbol size ∝ information entropy • Information entropy??? In information theory, the Shannon entropy or information entropy is a measure of the [decrease in] uncertainty associated with a random variable. Entropy quantifies information in a piece of data. - Wikipedia • Check out this interesting website: Tom Schneider, NCIF • http://www.ccrnp.ncifcrf.gov/~toms/glossary.html#sequence_logo • √ PSSM - Position-Specific Scoring Matrix • √ Profiles HMMs - Hidden Markov Models 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 34 HMMs: an example Nucleotide frequencies in human genome 29.620.529.520.4 GTCA 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 35 CpG Islands • CpG dinucleotides are rarer than would be expected from independent probabilities of C and G (given the background frequencies in human genome) • High CpG frequency is sometimes biologically significant; e.g., sometimes associated with promoter regions (“start sites”for genes) • CpG island - a region where CpG dinucleotides are much more abundant than elsewhere Written CpG to distinguish from a C≡G base pair) 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 36 Hidden Markov Models - HMMs Goal: Find most likely explanation for observed variables Components: • Observed variables • Hidden variables • Emitted symbols • Emission probabilities • Transition probabilities • Graphical representation to illustrate relationships among these #16 - Profiles & HMMs 9/28/07 BCB 444/544 Fall 07 Dobbs 7 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 37 The Occasionally Dishonest Casino A casino uses a fair die most of the time, but occasionally switches to a "loaded" one • Fair die: Prob(1) = Prob(2) = . . . = Prob(6) = 1/6 • Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10, Prob(6) = ½ • These are emission probabilities Transition probabilities • Prob(Fair → Loaded) = 0.01 • Prob(Loaded → Fair) = 0.2 • Transitions between states obey a Markov process • (more on Markov chains/models/processes a bit later) 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 38 An HMM for Occasionally Dishonest Casino Transition probabilities • Prob(Fair → Loaded) = 0.01 • Prob(Loaded → Fair) = 0.2 Emission probabilities • Fair die: Prob(1) = Prob(2) = . . . = Prob(6) = 1/6 • Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10, Prob(6) = ½ 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 39 The Occasionally Dishonest Casino • Known: • Structure of the model • Transition probabilities • Hidden: What casino actually did • FFFFFLLLLLLLFFFF... • Observable: Series of die tosses • 3415256664666153... • What we must infer: • When was a fair die used? • When was a loaded one used? • Answer is a sequence FFFFFFFLLLLLLFFF... 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 40 HMM: Making the Inference • Model assigns a probability to each explanation for the observation, e.g.: P(326|FFL) = P(3|F) · P(F→F) · P(2|F) · P(F→L) · P(6|L) = 1/6 · 0.99 · 1/6 · 0.01 · ½ • Maximum Likelihood: Determine which explanation is most likely • Find path most likely to have produced observed sequence • Total Probability: Determine probability that observed sequence was produced by HMM • Consider all paths that could have produced the observed sequence 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 41 HMM Notation • x = sequence of symbols emitted by model • xi = symbol emitted at time i • π = path, a sequence of states • i-th state in π is πi • akr = probability of making a transition from state k to state r • ek(b) = probability that symbol b is emitted when in state k )|Pr( 1 kra iikr === !"" )|Pr()( kbxbe iik === ! 9/28/07BCB 444/544 F07 ISU Dobbs #16 - Profiles & HMMs 42 Calculating Different Paths to an Observed Sequence 00227.0 6 1 99.0 6 1 99.0 6 1 5.0 )6()2()6(),Pr( 0 )1( ! """""= = FFFFFFFF eaeaeax # 008.0 5.08.01.08.05.05.0 )6()2()6(),Pr( 0 )2( = !!!!!= = LLLLLLLL eaeaeax " 0000417.0 5.001.0 6 1 2.05.05.0 )6()2()6(),Pr( 00 )3( ! """""= = LLFLFLFLL aeaeaeax # FFF=)1(! LLL=)2(! LFL=)3(! 6,2,6,, 321 == xxxx
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved