Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Sequence Alignments and Database Searches in Bioinformatics, Study notes of Computer Science

An introduction to sequence alignments and database searches in bioinformatics. Topics covered include the role of proteins and messenger rna, dna replication and mutations, sequence alignment methods, and the use of the needleman-wunsch algorithm and blast tool for finding optimal alignments and searching databases. The document also discusses the importance of scoring alignments and the concept of semi-global and local alignments.

Typology: Study notes

Pre 2010

Uploaded on 08/01/2009

koofers-user-arx
koofers-user-arx 🇺🇸

3

(2)

10 documents

1 / 7

Toggle sidebar

Related documents


Partial preview of the text

Download Sequence Alignments and Database Searches in Bioinformatics and more Study notes Computer Science in PDF only on Docsity! 1 Sequence Alignments and Database Searches Introduction to Bioinformatics Intro to Bioinformatics – Sequence Alignment 2 Proteins are amino acid polymers Intro to Bioinformatics – Sequence Alignment 3 Messenger RNA • Carries instructions for a protein outside of the nucleus to the ribosome • The ribosome is a protein complex that synthesizes new proteins Transcription The Central Dogma DNA transcription  RNA translation  Proteins Intro to Bioinformatics – Sequence Alignment 5 DNA Replication • Prior to cell division, all the genetic instructions must be “copied” so that each new cell will have a complete set • DNA polymerase is the enzyme that copies DNA • Reads the old strand in the 3´ to 5´ direction Intro to Bioinformatics – Sequence Alignment 6 Over time, genes accumulate mutations • Environmental factors • Radiation • Oxidation • Mistakes in replication or repair • Deletions, Duplications • Insertions • Inversions • Point mutations 2 Intro to Bioinformatics – Sequence Alignment 7 • Codon deletion: ACG ATA GCG TAT GTA TAG CCG… • Effect depends on the protein, position, etc. • Almost always deleterious • Sometimes lethal • Frame shift mutation: ACG ATA GCG TAT GTA TAG CCG… ACG ATA GCG ATG TAT AGC CG?… • Almost always lethal Deletions Intro to Bioinformatics – Sequence Alignment 8 Indels • Comparing two genes it is generally impossible to tell if an indel is an insertion in one gene, or a deletion in another, unless ancestry is known: ACGTCTGATACGCCGTATCGTCTATCT ACGTCTGAT---CCGTATCGTCTATCT Intro to Bioinformatics – Sequence Alignment 9 The Genetic Code Substitutions are mutations accepted by natural selection. Synonymous: CGC  CGA Non-synonymous: GAU  GAA Intro to Bioinformatics – Sequence Alignment 10 Comparing two sequences • Point mutations, easy: ACGTCTGATACGCCGTATAGTCTATCT ACGTCTGATTCGCCCTATCGTCTATCT • Indels are difficult, must align sequences: ACGTCTGATACGCCGTATAGTCTATCT CTGATTCGCATCGTCTATCT ACGTCTGATACGCCGTATAGTCTATCT ----CTGATTCGC---ATCGTCTATCT Intro to Bioinformatics – Sequence Alignment 11 Why align sequences? • The human genome, and others, are available • Automated gene finding is possible • Gene: AGTACGTATCGTATAGCGTAA • What does it do? • One approach: Is there a similar gene in another species? • Align sequences with known genes • Find the gene with the “best” match Intro to Bioinformatics – Sequence Alignment 12 Scoring a sequence alignment • Match score: +1 • Mismatch score: +0 • Gap penalty: –1 ACGTCTGATACGCCGTATAGTCTATCT ||||| ||| || |||||||| ----CTGATTCGC---ATCGTCTATCT • Matches: 18 × (+1) • Mismatches: 2 × 0 • Gaps: 7 × (– 1) Score = +11 5 Intro to Bioinformatics – Sequence Alignment 25 What are all these numbers, anyway? • Suppose we are aligning: a with a… a 0 -1 a -1 Intro to Bioinformatics – Sequence Alignment 26 The dynamic programming concept • Suppose we are aligning: ACTCG ACAGTAG • Last position choices: G +1 ACTC G ACAGTA G -1 ACTC - ACAGTAG - -1 ACTCG G ACAGTA Intro to Bioinformatics – Sequence Alignment 27 g c g 0 -1 -2 -3 g -1 1 0 -1 g -2 0 1 1 c -3 -1 1 1 g -4 -2 0 2 Semi-global alignment • Suppose we are aligning: GCG GGCG • Which do you prefer? G-CG -GCG GGCG GGCG • Semi-global alignment allows gaps at the ends for free. Intro to Bioinformatics – Sequence Alignment 28 Semi-global alignment g c g 0 0 0 0 g 0 1 0 1 g 0 1 1 1 c 0 0 2 1 g 0 1 1 3 • Semi-global alignment allows gaps at the ends for free. • Initialize first row and column to all 0’s • Allow free horizontal/vertical moves in last row and column Intro to Bioinformatics – Sequence Alignment 29 Local alignment • Global alignments – score the entire alignment • Semi-global alignments – allow unscored gaps at the beginning or end of either sequence • Local alignment – find the best matching subsequence. “Smith-Waterman Algorithm” • CGATG AAATGGA • This is achieved by allowing a 4th alternative at each position in the table: zero. Intro to Bioinformatics – Sequence Alignment 30 c g a t g 0 -1 -2 -3 -4 -5 a -1 0 0 0 0 0 a -2 0 0 1 0 0 a -3 0 0 1 0 0 t -4 0 0 0 2 1 g -5 0 1 0 1 3 g -6 0 1 0 0 2 a -7 0 0 2 1 1 Local alignment • Use zero if all other methods are negative • Mismatch = –1 this time CGATG AAATGGA 6 Intro to Bioinformatics – Sequence Alignment 31 Database Searching • How can we find a particular short sequence in a database of sequences, or one HUGE sequence? • Problem is identical to local sequence alignment, but on a much larger scale. • We must also have some idea of the significance of a database hit. • Databases always return some kind of hit, how much attention should be paid to the result? Intro to Bioinformatics – Sequence Alignment 32 BLAST • BLAST – Basic Local Alignment Search Tool • An approximation of the Needleman & Wunsch algorithm • Sacrifices some search sensitivity for speed Intro to Bioinformatics – Sequence Alignment 33 The BLAST algorithm • Break the search sequence into words • W = 4 for proteins, W = 12 for DNA • Include in the search all words that score above a certain value (T) for any search word MCGPFILGTYC MCG CGP MCG, CGP, GPF, PFI, FIL, ILG, LGT, GTY, TYC MCG CGP MCT MGP … MCN CTP … … This list can be computed in linear time Intro to Bioinformatics – Sequence Alignment 34 The Blast Algorithm (2) • Search for the words in the database • Word locations can be precomputed and indexed • Searching for a short string in a long string  Regular expression matching: FSA • HSP (High Scoring Pair) = A match between a query word and the database • Find a “hit”: Two non-overlapping HSP’s on a diagonal within distance A • Extend the hit until the score falls below a threshold value, X Intro to Bioinformatics – Sequence Alignment 36 Results from a BLAST search 7 Intro to Bioinformatics – Sequence Alignment 37 Search Significance Scores • A search will always return some hits. • P score: the probability that one or more sequences of score >= S would have been found randomly • E score: the expected number of sequences of score > = S that would be found by random chance • Lets look at some examples...
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved