Download Two Sequence Alignment and Scoring Models for BME 110: CompBio Tools and more Assignments Chemistry in PDF only on Docsity! Two Sequence Alignment & Scoring Models BME 110: CompBio Tools Todd Lowe April 13, 2006 Admin • Reading: – Finish Claverie, Chapters 7, 8 – NCBI Blast Guide / info page http://www.ncbi.nlm.nih.gov/BLAST/Why.shtml • Homework #1 now online, due in 1 week (Thursday, April 20) • Today – Finish BLAST overview – Two sequence alignments – In-class BLAST exercises What is Reliable? • In biology P-value of 0.05 expect would be “good enough” (5 chances in 100 of not being correlated) • Due to BLAST’s estimation of significance, shouldn’t trust P or E values > 1x10-4 • Note: they may still be paralogs with different function! Examine alignment! • For good measure, I don’t have great confidence unless < 1x10-8 Beware Hit Transitivity! • “BLAST hits are not transitive, unless alignments are overlapping” Seq1: AAAAABBBB Seq2: AAAAA Seq3: BBBB • Seq2 and Seq3 not necessarily homologous! Example • Fibrillarin-like protein – DNA: XM_293903, Protein: XP_293903 • How “far” can we go in tree of life using nucleotide v. protein searches? Limit Search Space • If you only want hits to a specific genome or domain, **much** faster to only search that species A Related Note: Homology • Based on inference that two sequences are ancestrally derived from same molecule • If two sequences have high similarity, they may be inferred to be homologous • It is WRONG to say two sequences or genes are 80% homologous (they either are related, or they are not) Homology: Same Function? • Even if two sequences are ancestrally derived from same molecule, they may or may not still have the same function – Orthologs: homologous genes created by speciation • Generally implies function remains the same – Paralogs: homologous genes created by a gene duplication event (in same species) • Implies function may have changed Full-genome Comparisons Here, each dot is a gene match, not a nucleotide match From Zivanovic et al., NAR 30: 1902-10 Pair-wise Sequence Comparison • Basis for relating biological information from a well-studied gene to a new sequence • Many programs exist for pairwise comparison • Some are fast database searching and get “good” alignments – One sequence v. many thousands: • BLAST or FASTA • Some are much slower, but guarantee the “optimal alignment” – Smith-Waterman is the de facto standard What is Optimal?? • How do we get an “optimal” alignment • Optimal to who? • Optimal based on scoring model: – Substitution scoring matrix – Insertion / deletion scoring (penalties) • Caution: Just because it is optimal for a given scoring scheme, doesn’t mean it is biologically correct!! Which is better? Match +1, Mismatch –1, Gap -2 G A T C +1-1-1+1 | | OR (Score = 0) G T G C G A T - C +1-2+1-2+1 | | | (Score = -1) G - T G C Which is better? Match +1, Mismatch –1, Gap -1 G A T C +1-1-1+1 | | OR (Score = 0) G T G C G A T - C +1-1+1-1+1 | | | (Score = 1) G - T G C Moral: Scoring Model Matters!! • For DNA, model can be very simple: • +1 match, -1 mismatch • However, not all mutations have equal likelihood: • Transition: A<–>G or C <–> T – more likely • Transversion: A<–>C or G <–> T – less likely Protein Matrices, Same Idea • Original: Dayhoff matrix aka PAM • PAM = Percent accepted mutations • Based on small number of correctly aligned proteins • Simply count how often each amino acid is substituted for another • Frequency of substitutions based on properties of amino acids relative to each other SMALE G_|P SMALL
INON-POLAR POLAR
EARGE ¥ LARGE
NON-POLAR POLAR
* Closer two amino acids are, more similar
IN properties
Newer “Version” of Protein Matrices: BLOSUM • By Henikoff & Henikoff (1992), based on a much larger group of aligned proteins sequences in the Blocks database • BLOSUM = Blocks substitution matrix • Used most commonly today Similarity v. Homology • Similarity is strictly a measure based on a sequence alignment observation – Two sequences are 80% identical C T A G C G A C T T | | | | | | | | C G A G C C A C T T In-class BLAST practice