Download Sequence Alignment in Bioinformatics - Prof. Drena Leigh Dobbs and more Lab Reports Bioinformatics in PDF only on Docsity! #4 - Sequence Alignment 8/27/07 BCB 444/544 Fall 07 Dobbs 1 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 1 BCB 444/544 Finish: Lecture 2- Biological Databases Lecture 4 Sequence Alignment #4_Aug27 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 2 Required Reading (before lecture) Mon Aug 27 - for Lecture #4 Pairwise Sequence Alignment • Chp 3 - pp 31-41 Xiong Textbook Wed Aug 29 - for Lecture #5 Dynamic Programming • Eddy: What is Dynamic Programming? Thurs Aug 30 - Lab #2: Databases, ISU Resources,& Pairwise Sequence Alignment Fri Aug 31 - for Lecture #6 Scoring Matrices and Alignment Statistics • Chp 3 - pp 41-49 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 3 HW#2: 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 4 Back to: Chp 2- Biological Databases • Xiong: Chp 2 Introduction to Biological Databases • What is a Database? • Types of Databases • Biological Databases • Pitfalls of Biological Databases • Information Retrieval from Biological Databases 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 5 What is a Database? Duh!! OK: skip we'll skip that! 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 6 Types of Databases 3 Major types of electronic databases: 1. Flat files - simple text files • no organization to facilitate retrieval 2. Relational - data organized as tables ("relations") • shared features among tables allows rapid search 3. Object-oriented - data organized as "objects" • objects associated hierarchically #4 - Sequence Alignment 8/27/07 BCB 444/544 Fall 07 Dobbs 2 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 7 Biological Databases Currently - all 3 types, but MANY flat files What are goals of biological databases? 1. Information retrieval 2.Knowledge discovery Important issue: Interconnectivity 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8 Types of Biological Databases 1- Primary • "simple" archives of sequences, structures, images, etc. • raw data, minimal annotations, not always well curated! 2- Secondary • enhanced with more complete annotation of sequences, structures, images, etc. • usually curated! 3- Specialized • focused on a particular research interest or organism • usually - not always - highly curated 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 9 Examples of Biological Databases 1- Primary • DNA sequences • GenBank - US • European Molecular Biology Lab - EMBL • DNA Data Bank of Japan - DDBJ • Structures (Protein, DNA, RNA) • PDB - Protein Data Bank • NDB - Nucleic Acid Data Bank 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 10 Examples of Biological Databases 2- Secondary • Protein sequences • Swiss-Prot, TreEMBL, PIR • these recently combined into UniProt 3- Specialized • Species-specific (or "taxonomic" specific) • Flybase, WormBase, AceDB, PlantDB • Molecule-specific,disease-specific 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 11 • Errors! & • Lack of documentation re: quality or reliability of data • Limited mechanisms for "data checking" or preventing propagation of errors (esp. annotation errors!!) • Redundancy • Inconsistency • Incompatibility (format, terminology, data types, etc.) Pitfalls of Biological Databases 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 12 Information Retrieval from Biological Databases 2 most popular retrieval systems: • ENTREZ - NCBI • will use a LOT - was introduced in Lab 1 • SRS - Sequence Retrieval Systems - EBI • will use less, similar to ENTREZ Both: • Provide access to multiple databases • Allow complex queries #4 - Sequence Alignment 8/27/07 BCB 444/544 Fall 07 Dobbs 5 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 25 What is Sequence Alignment? Given 2 sequences of letters, and a scoring scheme for evaluating matching letters, find an optimal pairing of letters in one sequence to letters of other sequence. Align: 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A SHORT SENTENCE. 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A ######SHORT## SENTENCE##############. OR 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A ##SHORT###SENT#EN###CE##############. Is one of these alignments "optimal"? Which is better? 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 26 Goal of Sequence Alignment Find the best pairing of 2 sequences, such that there is maximum correspondence between residues • DNA 4 letter alphabet (+ gap) TTGACAC TTTACAC • Proteins 20 letter alphabet (+ gap) RKVA-GMA RKIAVAMA 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 27 Statement of Problem Given: • 2 sequences • Scoring system for evaluating match (or mismatch) of two characters • Penalty function for gaps in sequences Find: Optimal pairing of sequences that • Retains the order of characters • Introduces gaps where needed • Maximizes total score 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 28 Types of Sequence Variation • Sequences can diverge from a common ancestor through various types of mutations: • Substitutions ACGA → AGGA • Insertions ACGA → ACCGA • Deletions ACGA → AGA • Insertions or deletions ("indels") result in gaps in alignments • Substitotions result in mismatches • No change? match 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 29 Gaps Indels of various sizes can occur in one sequence relative to the other e.g., corresponding to a shortening of the polypeptide chain in a protein 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 30 Avoiding Random Alignments with a Scoring Function • Introducing too many gaps generates nonsense alignments: s--e-----qu---en--ce sometimesquipsentice • Need to distinguish between alignments that occur due to homology and those that occur by chance • Define a scoring function that accounts for mismatches and gaps Scoring Function (F): e.g. Match: + m +1 Mismatch: - s -1 Gap: - d -2 F = m(#matches) + s(#mismatches) + d(#gaps) #4 - Sequence Alignment 8/27/07 BCB 444/544 Fall 07 Dobbs 6 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 31 Not All Mismatches are the Same • Some amino acids are more "exchangeable" than others; e.g., Ser and Thr are more similar than Trp and Ala • A substitution matrix can be used to introduce "mismatch costs" for handling different types of substitutions • Mismatch costs are not usually used in aligning DNA or RNA sequences, because no substitution is "better" than any other (in general) 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 32 Substitution Matrix s(a,b) corresponds to score of aligning character a with character b Match scores are often calculated based on frequency of mutations in very similar sequences (more details later) 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 33 Methods • Global and Local Alignment • Alignment Algorithms • Dot Matrix Method • Dynamic Programming Method • Gap penalities • DP for Global Alignment • DP for Local Alignment • Scoring Matrices • Amino acid scoring matrices • PAM • BLOSUM • Comparisons between PAM & BLOSUM • Statistical Significance of Sequence Alignment 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 34 Global vs Local Alignment Local alignment • Finds local regions with highest similarity between 2 sequences • Aligns these without regard for rest of sequence • Sequences are not assumed to be similar over entire length Global alignment • Finds best possible alignment across entire length of 2 sequences • Aligned sequences assumed to be generally similar over entire length 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 35 Global vs Local Alignment - example S = CTGTCGCTGCACG T = TGCCGTG CTGTCG-CTGCACG -TGCCG--TG---- Global alignment CTGTCG-CTGCACG -TGC-CG-TG---- CTGTCGCTGCACG-- -------TGC-CGTG Local alignment Which is better? 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 36 Global vs Local Alignment When use which? Both are important but it is critical to use right method for a given task! Global alignment: • Good for: aligning closely related sequences of approx. same length • Not good for: divergent sequences or sequences with different lengths Local Alignment: • Good for: searching for conserved patterns (domains or motifs) in DNA or protein sequences • Not good for: generating alignment of closely related sequences Global and local alignments are fundamentally similar and differ only in optimization strategy used in aligning similar residues #4 - Sequence Alignment 8/27/07 BCB 444/544 Fall 07 Dobbs 7 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 37 Alignment Algorithms 3 major methods for alignment: 1. Dot matrix analysis 2. Dynamic Programming 3. Word or k-tuple methods (later, in Chp 4) 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 38 Dot Matrix Method (Dot Plots) • Place 1 sequence along top row of matrix • Place 2nd sequence along left column of matrix • Plot a dot each time there is a match between an element of row sequence and an element of column sequence • For proteins, usually use more sophisticated scoring schemes than "identical match" • Diagonal lines indicate areas of match • Reverse diagonals (perpendicular to diagonal) indicate inversions A C A C G A CC G G Exploring Dot Plots