Download Sequence Alignments and Database Searches in Bioinformatics and more Study notes Computer Science in PDF only on Docsity! 1 Sequence Alignments and Database Searches Introduction to Bioinformatics Intro to Bioinformatics – Sequence Alignment 2 Proteins are amino acid polymers Intro to Bioinformatics – Sequence Alignment 3 Messenger RNA • Carries instructions for a protein outside of the nucleus to the ribosome • The ribosome is a protein complex that synthesizes new proteins Transcription The Central Dogma DNA transcription RNA translation Proteins Intro to Bioinformatics – Sequence Alignment 5 DNA Replication • Prior to cell division, all the genetic instructions must be “copied” so that each new cell will have a complete set • DNA polymerase is the enzyme that copies DNA • Reads the old strand in the 3´ to 5´ direction Intro to Bioinformatics – Sequence Alignment 6 Over time, genes accumulate mutations • Environmental factors • Radiation • Oxidation • Mistakes in replication or repair • Deletions, Duplications • Insertions • Inversions • Point mutations 2 Intro to Bioinformatics – Sequence Alignment 7 • Codon deletion: ACG ATA GCG TAT GTA TAG CCG… • Effect depends on the protein, position, etc. • Almost always deleterious • Sometimes lethal • Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?… • Almost always lethal Deletions Intro to Bioinformatics – Sequence Alignment 8 Indels • Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT Intro to Bioinformatics – Sequence Alignment 9 The Genetic Code Substitutions are mutations accepted by natural selection. Synonymous: CGC CGA Non-synonymous: GAU GAA Intro to Bioinformatics – Sequence Alignment 10 Comparing two sequences • Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT • Indels are difficult, must align sequences: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT Intro to Bioinformatics – Sequence Alignment 11 Why align sequences? • The human genome, and others, are available • Automated gene finding is possible • Gene: AGTACGTATCGTATAGCGTAA • What does it do? • One approach: Is there a similar gene in another species? • Align sequences with known genes • Find the gene with the “best” match Intro to Bioinformatics – Sequence Alignment 12 Scoring a sequence alignment • Match score: +1 • Mismatch score: +0 • Gap penalty: –1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT • Matches: 18 × (+1) • Mismatches: 2 × 0 • Gaps: 7 × (– 1) Score = +11 5 Intro to Bioinformatics – Sequence Alignment 25 What are all these numbers, anyway? • Suppose we are aligning: a with a… a 0 -1 a -1 Intro to Bioinformatics – Sequence Alignment 26 The dynamic programming concept • Suppose we are aligning: ACTCG ACAGTAG • Last position choices: G +1 ACTC G ACAGTA G -1 ACTC - ACAGTAG - -1 ACTCG G ACAGTA Intro to Bioinformatics – Sequence Alignment 27 g c g 0 -1 -2 -3 g -1 1 0 -1 g -2 0 1 1 c -3 -1 1 1 g -4 -2 0 2 Semi-global alignment • Suppose we are aligning: GCG GGCG • Which do you prefer? G-CG -GCG GGCG GGCG • Semi-global alignment allows gaps at the ends for free. Intro to Bioinformatics – Sequence Alignment 28 Semi-global alignment g c g 0 0 0 0 g 0 1 0 1 g 0 1 1 1 c 0 0 2 1 g 0 1 1 3 • Semi-global alignment allows gaps at the ends for free. • Initialize first row and column to all 0’s • Allow free horizontal/vertical moves in last row and column Intro to Bioinformatics – Sequence Alignment 29 Local alignment • Global alignments – score the entire alignment • Semi-global alignments – allow unscored gaps at the beginning or end of either sequence • Local alignment – find the best matching subsequence. “Smith-Waterman Algorithm” • CGATG AAATGGA • This is achieved by allowing a 4th alternative at each position in the table: zero. Intro to Bioinformatics – Sequence Alignment 30 c g a t g 0 -1 -2 -3 -4 -5 a -1 0 0 0 0 0 a -2 0 0 1 0 0 a -3 0 0 1 0 0 t -4 0 0 0 2 1 g -5 0 1 0 1 3 g -6 0 1 0 0 2 a -7 0 0 2 1 1 Local alignment • Use zero if all other methods are negative • Mismatch = –1 this time CGATG AAATGGA 6 Intro to Bioinformatics – Sequence Alignment 31 Database Searching • How can we find a particular short sequence in a database of sequences, or one HUGE sequence? • Problem is identical to local sequence alignment, but on a much larger scale. • We must also have some idea of the significance of a database hit. • Databases always return some kind of hit, how much attention should be paid to the result? Intro to Bioinformatics – Sequence Alignment 32 BLAST • BLAST – Basic Local Alignment Search Tool • An approximation of the Needleman & Wunsch algorithm • Sacrifices some search sensitivity for speed Intro to Bioinformatics – Sequence Alignment 33 The BLAST algorithm • Break the search sequence into words • W = 4 for proteins, W = 12 for DNA • Include in the search all words that score above a certain value (T) for any search word MCGPFILGTYC MCG CGP MCG, CGP, GPF, PFI, FIL, ILG, LGT, GTY, TYC MCG CGP MCT MGP … MCN CTP … … This list can be computed in linear time Intro to Bioinformatics – Sequence Alignment 34 The Blast Algorithm (2) • Search for the words in the database • Word locations can be precomputed and indexed • Searching for a short string in a long string Regular expression matching: FSA • HSP (High Scoring Pair) = A match between a query word and the database • Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A • Extend the hit until the score falls below a threshold value, X Intro to Bioinformatics – Sequence Alignment 36 Results from a BLAST search 7 Intro to Bioinformatics – Sequence Alignment 37 Search Significance Scores • A search will always return some hits. • P score: the probability that one or more sequences of score >= S would have been found randomly • E score: the expected number of sequences of score > = S that would be found by random chance • Lets look at some examples...