Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Lecture Slides on Scoring Matrices and Alignment Statistics | BCB 444, Study notes of Bioinformatics

Material Type: Notes; Professor: Dobbs; Class: INTRO BIOINFORMATCS; Subject: BIOINFORMATICS AND COMPUTATIONAL BIOL; University: Iowa State University; Term: Fall 2007;

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-bdi-2
koofers-user-bdi-2 🇺🇸

9 documents

1 / 8

Toggle sidebar

Related documents


Partial preview of the text

Download Lecture Slides on Scoring Matrices and Alignment Statistics | BCB 444 and more Study notes Bioinformatics in PDF only on Docsity! #6 -Scoring Matrices & Alignment Statistics 8/31/07 BCB 444/544 Fall 07 Dobbs 1 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 1 BCB 444/544 Lecture 6 Finish Dynamic Programming Scoring Matrices Alignment Statistics #6_Aug31 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 2 Mon Aug 27 - for Lecture #4 Pairwise Sequence Alignment • Chp 3 - pp 31-41 Wed Aug 29 - for Lecture #5 Dynamic Programming • Eddy: What is Dynamic Programming? 2004 Nature Biotechnol 22:909 http://www.nature.com/nbt/journal/v22/n7/abs/nbt0704-909.html Thurs Aug 30 - Lab #2: Databases, ISU Resources & Pairwise Sequence Alignment Fri Aug 31 - for Lecture #6 Scoring Matrices & Alignment Statistics • Chp 3 - pp 41-49 Required Reading (before lecture) 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 3 Announcements Fri Aug 31 - Revised notes for Lecture 5 posted online Changes? mainly re-ordering, symbols, color "coding" Mon Sept 3 - NO CLASSES AT ISU (Labor Day)!! - Enjoy!! Tues Sept 4 - Lab #2 Exercise Writeup Due by 5 PM (or sooner!) Send via email to Pete Zaback petez@iastate.edu (HW#2 assignment will be posted online) Fri Sept 14 - HW#2 Due by 5 PM (or sooner!) Fri Sept 21 - Exam #1 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 4 Chp 3- Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment • √Evolutionary Basis • √Sequence Homology versus Sequence Similarity • √Sequence Similarity versus Sequence Identity • Methods - cont • Scoring Matrices • Statistical Significance of Sequence AlignmentAdapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 5 Methods • √Global and Local Alignment • √Alignment Algorithms • √Dot Matrix Method • Dynamic Programming Method - cont • Gap penalities • DP for Global Alignment • DP for Local Alignment • Scoring Matrices • Amino acid scoring matrices • PAM • BLOSUM • Comparisons between PAM & BLOSUM • Statistical Significance of Sequence Alignment 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 6 Sequence Homology vs Similarity • Homologous sequences - sequences that share a common evolutionary ancestry • Similar sequences - sequences that have a high percentage of aligned residues with similar physicochemical properties (e.g., size, hydrophobicity, charge) IMPORTANT: • Sequence homology: • An inference about a common ancestral relationship, drawn when two sequences share a high enough degree of sequence similarity • Homology is qualitative • Sequence similarity: • The direct result of observation from a sequence alignment • Similarity is quantitative; can be describ d using percentages #6 -Scoring Matrices & Alignment Statistics 8/31/07 BCB 444/544 Fall 07 Dobbs 2 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 7 Goal of Sequence Alignment Find the best pairing of 2 sequences, such that there is maximum correspondence between residues • DNA 4 letter alphabet (+ gap) TTGACAC TTTACAC • Proteins 20 letter alphabet (+ gap) RKVA-GMA RKIAVAMA 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 8 Statement of Problem Given: • 2 sequences • Scoring system for evaluating match (or mismatch) of two characters • Penalty function for gaps in sequences Find: Optimal pairing of sequences that: • Retains the order of characters • Introduces gaps where needed • Maximizes total score 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 9 Avoiding Random Alignments with a Scoring Function • Introducing too many gaps generates nonsense alignments: s--e-----qu---en--ce sometimesquipsentice • Need to distinguish between alignments that occur due to homology and those that occur by chance • Define a scoring function that rewards matches (+) and penalizes mismatches (-) and gaps (-) Scoring Function (S): e.g. Match: α 1 Mismatch: β 1 Gap: γ 0 S = α(#matches) - β(#mismatches) - γ(#gaps) Note: I changed symbols & colors on this slide! 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 10 Not All Mismatches are the Same • Some amino acids are more "exchangeable" than others (physicochemical properties are similar) e.g., Ser & Thr are more similar than Trp & Ala • Substitution matrix can be used to introduce "mismatch costs" for handling different types of substitutions • Mismatch costs are not usually used in aligning DNA or RNA sequences, because no substitution is "better" than any other (in general) 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 11 Substitution Matrix s(a,b) corresponds to score of aligning character a with character b Match scores are often calculated based on frequency of mutations in very similar sequences (more details later) 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 12 Global vs Local Alignment Local alignment • Finds local regions with highest similarity between 2 sequences • Aligns these without regard for rest of sequence • Sequences are not assumed to be similar over entire length Global alignment • Finds best possible alignment across entire length of 2 sequences • Aligned sequences assumed to be generally similar over entire length #6 -Scoring Matrices & Alignment Statistics 8/31/07 BCB 444/544 Fall 07 Dobbs 5 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 25 ! S(i, j) =max S(i"1, j "1)+# (xi ,y j ) S(i"1, j)"$ S(i, j "1)"$ % & ' ( ' ! S(i,0) = "i #$ S(0, j) = " j #$ Initial conditions: Recursive definition: For 1 ≤ i ≤ N, 1 ≤ j ≤ M: 1- Define Score of Optimum Alignment using Recursion ! S(i, j) = Score of optimum alignment of x1..i and y1..j ! x 1..i = Prefix of length i of x y 1.. j = Prefix of length j of y Define: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 26 2- Initialize & Fill in DP Matrix for Storing Optimal Scores of Subproblems ! S(i, j) =max S(i"1, j "1)+# (x i , y j ) S(i"1, j)"$ S(i, j "1)"$ % & ' ( ' ! S(i,0) = "i # $ S(0, j) = " j # $ S(N,M) S(0,0)=0 S(i,j) S(i-1,j)S(i-1,j-1) S(i,j-1) 0 0 1 N 1 M InitializationRecursion • Construct sequence vs sequence matrix: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 27 2- cont Fill in DP Matrix S(N,M) S(0,0)=0 S(i,j) S(i-1,j)S(i-1,j-1) S(i,j-1) 0 0 1 N 1 M • Fill in from [0,0] to [N,M] (row by row), calculating best possible score for each alignment including residues at [i,j] • Keep track of dependencies of scores (in a pointer matrix). 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 28 x1 x2 . . . xi-1 xi y1 y2 . . . yj-1 yj S(i-1,j-1) + σ(xi,yj) x1 x2 . . . xi-1 xi y1 y2 . . . yj — S(i-1,j) - γ x1 x2 . . . xi — y1 y2 . . . yj-1 yj S(i,j-1) - γ xi aligns to yj xi aligns to a gap yj aligns to a gap 3- Calculate Score S(N,M) of Optimum Alignment - for Global Alignment What happens in last step in alignment of x[1..i] to y[1..j]? 1 of 3 cases applies: 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 29 Example Case 1: Line up xi with yj x: C A T T C A C y: C - T T C A G i - 1 i jj -1 x: C A T T C A - C y: C - T T C A G - Case 2: Line up xi with space i - 1 i j x: C A T T C A C - y: C - T T C A - G Case 3: Line up yj with space i jj -1 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 30 λ C T C G C A G C A C T T C A C 0 -5 -10 -15 -20 -25 -30 -35 -40 -5 -10 -15 -20 -25 -30 -35 10 5 λ +10 for match, -2 for mismatch, -5 for space Fill in the matrix #6 -Scoring Matrices & Alignment Statistics 8/31/07 BCB 444/544 Fall 07 Dobbs 6 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 31 +10 for match, -2 for mismatch, -5 for space Calculate score of optimum alignment 3 32 62 32 81 31 0-5-2 0-3 5 1 82 32 81 31 81 50-1 5-3 0 381 31 81 52 05-1 0-2 5 -4-7-2381 31 0-5-2 0 -7-2-5051 01 50-1 5 -1 0-50-7-2385-1 0 -2 5-2 0-1 5-1 0-5051 0-5 -4 0-3 5-3 0-2 5-2 0-1 5-1 0-50 λ C T C G C A G C C A C T T C A λ 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 32 4- Trace back through matrix to recover optimum alignment(s) that generated the optimal score How? "Repeat" alignment calculations in reverse order, starting at from position with highest score and following path, position by position, back through matrix Result? Optimal alignment(s) of sequences 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 33 Traceback - for Global Alignment Start in lower right corner & trace back to upper left Each arrow introduces one character at end of sequence alignment: • A horizontal move puts a gap in left sequence • A vertical move puts a gap in top sequence • A diagonal move uses one character from each sequence 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 34 3 32 62 32 81 31 0-5-2 0-3 5 1 82 32 81 31 81 50-1 5-3 0 381 31 81 52 05-1 0-2 5 -4-7-2381 31 0-5-2 0 -7-2-5051 01 50-1 5 -1 0-50-7-2385-1 0 -2 5-2 0-1 5-1 0-5051 0-5 -4 0-3 5-3 0-2 5-2 0-1 5-1 0-50 λ C T C G C A G C C A C T T C A λ * * Can have >1 optimum alignment; this example has 2 Traceback to Recover Alignment 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 35 Local Alignment: Motivation • To "ignore" stretches of non-coding DNA: • Non-coding regions (if "non-functional") are more likely to contain mutations than coding regions • Local alignment between two protein-encoding sequences is likely to be between two exons • To locate protein domains or motifs: • Proteins with similar structures and/or similar functions but from different species (for example), often exhibit local sequence similarities • Local sequence similarities may indicate ”functional modules”Non-coding - "not encoding protein" Exons - "protein-encoding" parts of genes vs Introns = "intervening sequences" - segments of eukaryotic genes that "interrupt" exons Introns are transcribed into RNA, but are later removed by RNA processing & are not translated into protein 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 36 Local Alignment: Example Best local alignment: Match: +2 Mismatch or space: -1 Score = 5 g g t c t g a g a a a c g a g g t c t g a g a a a c – g a - #6 -Scoring Matrices & Alignment Statistics 8/31/07 BCB 444/544 Fall 07 Dobbs 7 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 37 Local Alignment: Algorithm •S [i, j] = Score for optimally aligning a suffix of X with a suffix of Y • Initialize top row & leftmost column of matrix with "0" Recall: for Global Alignment, • S [i, j] = Score for optimally aligning a prefix of X with a prefix of Y • Initialize top row & leftmost column of with gap penalty 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 38 110201010 002010000 100102010 000000100 010000100 002000000 100101010 000000000 λ C T C G C A G C A C T T C A C λ +1 for a match, -1 for a mismatch, -5 for a space Traceback - for Local Alignment 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 39 Some Results re: Alignment Algorithms (for ComS, CprE & Math types!) • Most pairwise sequence alignment problems can be solved in O(mn) time • Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88] • Highly similar sequences can be aligned in O (dn) time, where d measures the distance between the sequences [Landau86] 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 40 "Scoring" or "Substitution" Matrices 2 Major types for Amino Acids: PAM & BLOSUM PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differences in alignments of closely related proteins BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 41 PAM Matrix PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differnces in closely related proteins • Model includes defined rate for each type of sequence change • Suffix number (n) reflects amount of "time" passed: rate of expected mutation if n% of amino acids had changed • PAM1 - for less divergent sequences (shorter time) • PAM250 - for more divergent sequences (longer time) 8/31/07BCB 444/544 F07 ISU Dobbs #6 - Scoring Matrices & Alignment Stats 42 BLOSUM Matrix BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins • Doesn't rely on a specific evolutionary model • Suffix number (n) reflects expected similarity: average % aa identity in the MSA from which the matrix was generated • BLOSUM45 - for more divergent sequences • BLOSUM62 - for less divergent sequences
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved