Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Sequence Alignment in Bioinformatics - Prof. Drena Leigh Dobbs, Lab Reports of Bioinformatics

A part of the lecture notes for a biological chemistry (bcb) course at iowa state university (isu) in fall 2007. It covers the topic of sequence alignment, which is a fundamental concept in bioinformatics. The goals and importance of biological databases, web resources, and isu resources related to bioinformatics. It also introduces the concept of sequence alignment, its motivation, and its applications in various bioinformatics tasks. The document also covers the differences between homology, orthologs, paralogs, sequence homology, sequence similarity, and sequence identity.

Typology: Lab Reports

Pre 2010

Uploaded on 09/02/2009

koofers-user-0w6
koofers-user-0w6 🇺🇸

5

(1)

10 documents

1 / 7

Toggle sidebar

Related documents


Partial preview of the text

Download Sequence Alignment in Bioinformatics - Prof. Drena Leigh Dobbs and more Lab Reports Bioinformatics in PDF only on Docsity! #4 - Sequence Alignment 8/27/07 BCB 444/544 Fall 07 Dobbs 1 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 1 BCB 444/544 Finish: Lecture 2- Biological Databases Lecture 4 Sequence Alignment #4_Aug27 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 2 Required Reading (before lecture) Mon Aug 27 - for Lecture #4 Pairwise Sequence Alignment • Chp 3 - pp 31-41 Xiong Textbook Wed Aug 29 - for Lecture #5 Dynamic Programming • Eddy: What is Dynamic Programming? Thurs Aug 30 - Lab #2: Databases, ISU Resources,& Pairwise Sequence Alignment Fri Aug 31 - for Lecture #6 Scoring Matrices and Alignment Statistics • Chp 3 - pp 41-49 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 3 HW#2: 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 4 Back to: Chp 2- Biological Databases • Xiong: Chp 2 Introduction to Biological Databases • What is a Database? • Types of Databases • Biological Databases • Pitfalls of Biological Databases • Information Retrieval from Biological Databases 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 5 What is a Database? Duh!! OK: skip we'll skip that! 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 6 Types of Databases 3 Major types of electronic databases: 1. Flat files - simple text files • no organization to facilitate retrieval 2. Relational - data organized as tables ("relations") • shared features among tables allows rapid search 3. Object-oriented - data organized as "objects" • objects associated hierarchically #4 - Sequence Alignment 8/27/07 BCB 444/544 Fall 07 Dobbs 2 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 7 Biological Databases Currently - all 3 types, but MANY flat files What are goals of biological databases? 1. Information retrieval 2.Knowledge discovery Important issue: Interconnectivity 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8 Types of Biological Databases 1- Primary • "simple" archives of sequences, structures, images, etc. • raw data, minimal annotations, not always well curated! 2- Secondary • enhanced with more complete annotation of sequences, structures, images, etc. • usually curated! 3- Specialized • focused on a particular research interest or organism • usually - not always - highly curated 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 9 Examples of Biological Databases 1- Primary • DNA sequences • GenBank - US • European Molecular Biology Lab - EMBL • DNA Data Bank of Japan - DDBJ • Structures (Protein, DNA, RNA) • PDB - Protein Data Bank • NDB - Nucleic Acid Data Bank 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 10 Examples of Biological Databases 2- Secondary • Protein sequences • Swiss-Prot, TreEMBL, PIR • these recently combined into UniProt 3- Specialized • Species-specific (or "taxonomic" specific) • Flybase, WormBase, AceDB, PlantDB • Molecule-specific,disease-specific 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 11 • Errors! & • Lack of documentation re: quality or reliability of data • Limited mechanisms for "data checking" or preventing propagation of errors (esp. annotation errors!!) • Redundancy • Inconsistency • Incompatibility (format, terminology, data types, etc.) Pitfalls of Biological Databases 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 12 Information Retrieval from Biological Databases 2 most popular retrieval systems: • ENTREZ - NCBI • will use a LOT - was introduced in Lab 1 • SRS - Sequence Retrieval Systems - EBI • will use less, similar to ENTREZ Both: • Provide access to multiple databases • Allow complex queries #4 - Sequence Alignment 8/27/07 BCB 444/544 Fall 07 Dobbs 5 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 25 What is Sequence Alignment? Given 2 sequences of letters, and a scoring scheme for evaluating matching letters, find an optimal pairing of letters in one sequence to letters of other sequence. Align: 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A SHORT SENTENCE. 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A ######SHORT## SENTENCE##############. OR 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A ##SHORT###SENT#EN###CE##############. Is one of these alignments "optimal"? Which is better? 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 26 Goal of Sequence Alignment Find the best pairing of 2 sequences, such that there is maximum correspondence between residues • DNA 4 letter alphabet (+ gap) TTGACAC TTTACAC • Proteins 20 letter alphabet (+ gap) RKVA-GMA RKIAVAMA 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 27 Statement of Problem Given: • 2 sequences • Scoring system for evaluating match (or mismatch) of two characters • Penalty function for gaps in sequences Find: Optimal pairing of sequences that • Retains the order of characters • Introduces gaps where needed • Maximizes total score 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 28 Types of Sequence Variation • Sequences can diverge from a common ancestor through various types of mutations: • Substitutions ACGA → AGGA • Insertions ACGA → ACCGA • Deletions ACGA → AGA • Insertions or deletions ("indels") result in gaps in alignments • Substitotions result in mismatches • No change? match 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 29 Gaps Indels of various sizes can occur in one sequence relative to the other e.g., corresponding to a shortening of the polypeptide chain in a protein 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 30 Avoiding Random Alignments with a Scoring Function • Introducing too many gaps generates nonsense alignments: s--e-----qu---en--ce sometimesquipsentice • Need to distinguish between alignments that occur due to homology and those that occur by chance • Define a scoring function that accounts for mismatches and gaps Scoring Function (F): e.g. Match: + m +1 Mismatch: - s -1 Gap: - d -2 F = m(#matches) + s(#mismatches) + d(#gaps) #4 - Sequence Alignment 8/27/07 BCB 444/544 Fall 07 Dobbs 6 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 31 Not All Mismatches are the Same • Some amino acids are more "exchangeable" than others; e.g., Ser and Thr are more similar than Trp and Ala • A substitution matrix can be used to introduce "mismatch costs" for handling different types of substitutions • Mismatch costs are not usually used in aligning DNA or RNA sequences, because no substitution is "better" than any other (in general) 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 32 Substitution Matrix s(a,b) corresponds to score of aligning character a with character b Match scores are often calculated based on frequency of mutations in very similar sequences (more details later) 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 33 Methods • Global and Local Alignment • Alignment Algorithms • Dot Matrix Method • Dynamic Programming Method • Gap penalities • DP for Global Alignment • DP for Local Alignment • Scoring Matrices • Amino acid scoring matrices • PAM • BLOSUM • Comparisons between PAM & BLOSUM • Statistical Significance of Sequence Alignment 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 34 Global vs Local Alignment Local alignment • Finds local regions with highest similarity between 2 sequences • Aligns these without regard for rest of sequence • Sequences are not assumed to be similar over entire length Global alignment • Finds best possible alignment across entire length of 2 sequences • Aligned sequences assumed to be generally similar over entire length 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 35 Global vs Local Alignment - example S = CTGTCGCTGCACG T = TGCCGTG CTGTCG-CTGCACG -TGCCG--TG---- Global alignment CTGTCG-CTGCACG -TGC-CG-TG---- CTGTCGCTGCACG-- -------TGC-CGTG Local alignment Which is better? 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 36 Global vs Local Alignment When use which? Both are important but it is critical to use right method for a given task! Global alignment: • Good for: aligning closely related sequences of approx. same length • Not good for: divergent sequences or sequences with different lengths Local Alignment: • Good for: searching for conserved patterns (domains or motifs) in DNA or protein sequences • Not good for: generating alignment of closely related sequences Global and local alignments are fundamentally similar and differ only in optimization strategy used in aligning similar residues #4 - Sequence Alignment 8/27/07 BCB 444/544 Fall 07 Dobbs 7 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 37 Alignment Algorithms 3 major methods for alignment: 1. Dot matrix analysis 2. Dynamic Programming 3. Word or k-tuple methods (later, in Chp 4) 8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 38 Dot Matrix Method (Dot Plots) • Place 1 sequence along top row of matrix • Place 2nd sequence along left column of matrix • Plot a dot each time there is a match between an element of row sequence and an element of column sequence • For proteins, usually use more sophisticated scoring schemes than "identical match" • Diagonal lines indicate areas of match • Reverse diagonals (perpendicular to diagonal) indicate inversions A C A C G A CC G G Exploring Dot Plots
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved