Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

BCB 444/544 Fall 07 Lecture 10: BLAST Details and Gene Jargon - Prof. Drena Leigh Dobbs, Lab Reports of Bioinformatics

A lecture note from a biochemistry course at iowa state university (isu) in fall 2007, covering the details of blast (basic local alignment search tool) and related gene jargon. Information on blast algorithms, database searching, sequence comparison, and statistical significance. It also discusses the differences between fasta and blast, and provides resources for further study.

Typology: Lab Reports

Pre 2010

Uploaded on 09/02/2009

koofers-user-vbx
koofers-user-vbx 🇺🇸

10 documents

1 / 10

Toggle sidebar

Related documents


Partial preview of the text

Download BCB 444/544 Fall 07 Lecture 10: BLAST Details and Gene Jargon - Prof. Drena Leigh Dobbs and more Lab Reports Bioinformatics in PDF only on Docsity! #10 BLAST Details + some Gene Jargon 9/12/07 BCB 444/544 Fall 07 Dobbs 1 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 1 BCB 444/544 Lecture 10 BLAST Details Plus some Gene Jargon #10_Sept12 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 2 √ Mon Sept 10 - for Lecture 9 BLAST variations; BLAST vs FASTA, SW • Chp 4 - pp 51-62 √ Wed Sept 12 - for Lecture 10 & Lab 4 Multiple Sequence Alignment (MSA) • Chp 5 - pp 63-74 Fri Sept 14 - for Lecture 11 Position Specific Scoring Matrices & Profiles • Chp 6 - pp 75-78 (but not HMMs) • Good Additional Resource re: Sequence Alignment? • Wikipedia: http://en.wikipedia.org/wiki/Sequence_alignment Required Reading (before lecture) 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 3 Assignments & Announcements - #1 Revised Grading Policy has been sent via email Please review! √ Mon Sept 10 - Lab 3 Exercise due 5 PM: to: terrible@iastate.edu Thu Sept 13 - Graded Labs 2 & 3 will be returned at beginning of Lab 4 Fri Sept 14 - HW#2 due by 5 PM (106 MBB) Study Guide for Exam 1 will be posted by 5 PM 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 4 Review: Gene Jargon #1 (for HW2, 1c) Exons = "protein-encoding" (or "kept" parts) of eukaryotic genes vs Introns = "intervening sequences" = segments of eukaryotic genes that "interrupt" exons • Introns are transcribed into pre-RNA • but are later removed by RNA processing • & do not appear in mature mRNA • so are not translated into protein 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 5 Assignments & Announcements - #2 Mon Sept 17 - Answers to HW#2 will be posted by 5 PM Thu Sept 20 - Lab = Optional Review Session for Exam Fri Sept 21 - Exam 1 - Will cover: • Lectures 2-12 (thru Mon Sept 17) • Labs 1-4 • HW2 • All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 6 Chp 3- Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment • √ Evolutionary Basis • √ Sequence Homology versus Sequence Similarity • √ Sequence Similarity versus Sequence Identity • √ Methods - (Dot Plots, DP; Global vs Local Alignment) • √ Scoring Matrices (PAM vs BLOSUM) • √ Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. #10 BLAST Details + some Gene Jargon 9/12/07 BCB 444/544 Fall 07 Dobbs 2 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 7 Local Alignment: Algorithm 1) Initialize top row & leftmost column of matrix with "0" 2) Fill in DP matrix: In local alignment, no negative scores Assign "0" to cells with negative scores 3) Optimal score? in highest scoring cell(s) 4) Optimal alignment(s)? Traceback from each cell containing the optimal score, until a cell with "0" is reached (not just from lower right corner) This slide has been changed! 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 8 Local Alignment DP: Initialization & Recursion ! S 0,0( ) = 0 S i, j( ) =max S i"1, j "1( )+# xi ,y j( ) S i"1, j( )"$ S i, j "1( )"$ 0 % & ' ' ( ' ' ! S(i,0) = 0 S(0, j) = 0 New Slide 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 9 A Few Words about Parameter Selection in Sequence Alignment Optimal alignment between a pair of sequences depends critically on the selection of substitution matrix & gap penalty function In using BLAST or similar software, it is important to understand and, sometimes, to adjust these parameters (default is NOT always best!) How do we pick parameters that give the most biologically meaningful alignments and alignment scores? ! S i, j( ) =max S i"1, j "1( )+# xi ,y j( ) S i"1, j( )"$ S i, j "1( )"$ % & ' ( ' 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 10 Calculating an Alignment Score using a Substitution Matrix & an Affine Gap Penalty • Alignment score is sum of all match/mismatch scores (from substitution matrix) with an affine penalty subtracted for each gap a b c - - d a c c e f d 9 2 7 6 => 24 - (10 + 2) = 12 Match score Gap opening + extension Alignment Score Values from substitution matrix 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 11 Chp 4- Database Similarity Searching SECTION II SEQUENCE ALIGNMENT Xiong: Chp 4 Database Similarity Searching • Unique Requirements of Database Searching • Heuristic Database Searching • Basic Local Alignment Search Tool (BLAST) • FASTA • Comparison of FASTA and BLAST • Database Searching with Smith-Waterman Method 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 12 Sequence database Database searching Sequence comparison algorithm Query Sequence Target sequences ranked by score #10 BLAST Details + some Gene Jargon 9/12/07 BCB 444/544 Fall 07 Dobbs 5 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 25 Review: Gene Jargon #2.1 6-Frame translated DNA Sequence? Remember GeneBoy exercise? 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 26 Review: Gene Jargon #2.2 6-Frame translated DNA Sequence? Try NCBI tools: http://www.ncbi.nlm.nih.gov/gorf/orfig.cgi http://www.ncbi.nlm.nih.gov/ Or - for some Biology review re: DNA/RNA & ORFs, see next 3 slides borrowed from EMBL-EBI: http://www.ebi.ac.uk/ 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 27 Review: Gene Jargon #2.3 http://www.ebi.ac.uk/ DNA Strands 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 28 Review: Gene Jargon #2.4 http://www.ebi.ac.uk/ RNA Strands - copied from DNA 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 29 Review: Gene Jargon #2.5 http://www.ebi.ac.uk/ Reading Frames 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 30 BLAST - How does it work? Main idea - based on dot plots! G A T C A A C T G A C G T A G T T C A G C T G C G T A C #10 BLAST Details + some Gene Jargon 9/12/07 BCB 444/544 Fall 07 Dobbs 6 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 31 Dot Plots - apply in BLAST: G A T C A A C T G A C G T A G T T C A G C T G C G T A C Perform fast, approximate local alignments to find sequences in database that are related to query sequence Here, use 4-base "window" 75% identity (allow mismatches) 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 32 Detailed Steps in BLAST algorithm 1. Remove low-complexity regions (LCRs) 2. Make a list (dictionary): all words of length 3aa or 11 nt 3. Augment list to include similar words 4. Store list in a search tree (data structure) 5. Scan database for occurrences of words in search tree 6. Connect nearby occurrences 7. Extend matches (words) in both directions 8. Prune list of matches using a score threshold 9. Evaluate significance of each remaining match 10. Perform Smith-Waterman to get alignment 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 33 1: Filter low-complexity regions (LCRs) ! ! ! " # $ $ $ % & = ' i i N n L L K ! ! log 1 Window length (usually 12) Alphabet size (4 or 20) Frequency of ith letter in the window • Low complexity regions, transmembrane regions and coiled-coil regions often display significant similarity without homology. • Low complexity sequences can yield false positives. • Screen them out of your query sequences! When appropriate! K = computational complexity; varies from 0 (very low complexity) to 1 (high complexity) e.g., for GGGG: L! = 4!=4x3x2x1= 24 nG=4 nT=nA=nC=0 Π ni! = 4!x0!x0!x0! = 24 K=1/4 log4 (24/24) = 0 For CGTA: K=1/4 log4(24/1) = 0.57 This slide has been changed! 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 34 2: List all words in query YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 35 3: Augment word list YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … AAA AAB AAC … YYY 203 = 8000 possible matches 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 36 3: Augment word list G G F A A A 0 + 0 + -2 = -2 BLOSUM62 scores Non-match G G F G G Y 6 + 6 + 3 = 15 Match A user-specified threshold, T, determines which 3-letter words are considered matches and non-matches #10 BLAST Details + some Gene Jargon 9/12/07 BCB 444/544 Fall 07 Dobbs 7 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 37 3: Augment word list YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … GGI GGL GGM GGF GGW GGY … 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 38 3: Augment word list Observation: Selecting only words with score > T greatly reduces number of possible matches otherwise, 203 for 3-letter words from amino acid sequences! 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 39 Example A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 Find all words that match EAM with a score greater than or equal to 11 EAM 5 + 4 + 5 = 14 DAM 2 + 4 + 5 = 11 QAM 2 + 4 + 5 = 11 ESM 5 + 1 + 5 = 11 EAL 5 + 4 + 2 = 11 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 40 4: Store words in search tree Search tree Augmented list of query words “Does this query contain GGF?” “Yes, at position 2.” 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 41 Search tree G G L MF W Y GGF GGL GGM GGW GGY 9/12/07BCB 444/544 F07 ISU Dobbs #10 - BLAST details + some Gene Jargon 42 Example Put this word list into a search tree DAM QAM EAM KAM ECM EGM ESM ETM EVM EAI EAL EAV D Q E K A A A G S T V AC M M M M M M MM L M I V
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved