Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Two Sequence Alignment and Scoring Models for BME 110: CompBio Tools, Assignments of Chemistry

An overview of sequence alignment and scoring models used in bioinformatics, specifically for bme 110: compbio tools. Topics include understanding e-values and p-values, homology and its types, dynamic programming, and scoring matrices. The document also covers the importance of selecting the appropriate scoring model and the limitations of blast hit transitivity.

Typology: Assignments

Pre 2010

Uploaded on 09/17/2009

koofers-user-chb
koofers-user-chb 🇺🇸

5

(1)

10 documents

1 / 31

Toggle sidebar

Related documents


Partial preview of the text

Download Two Sequence Alignment and Scoring Models for BME 110: CompBio Tools and more Assignments Chemistry in PDF only on Docsity! Two Sequence Alignment & Scoring Models BME 110: CompBio Tools Todd Lowe April 13, 2006 Admin • Reading: – Finish Claverie, Chapters 7, 8 – NCBI Blast Guide / info page http://www.ncbi.nlm.nih.gov/BLAST/Why.shtml • Homework #1 now online, due in 1 week (Thursday, April 20) • Today – Finish BLAST overview – Two sequence alignments – In-class BLAST exercises What is Reliable? • In biology P-value of 0.05 expect would be “good enough” (5 chances in 100 of not being correlated) • Due to BLAST’s estimation of significance, shouldn’t trust P or E values > 1x10-4 • Note: they may still be paralogs with different function! Examine alignment! • For good measure, I don’t have great confidence unless < 1x10-8 Beware Hit Transitivity! • “BLAST hits are not transitive, unless alignments are overlapping” Seq1: AAAAABBBB Seq2: AAAAA Seq3: BBBB • Seq2 and Seq3 not necessarily homologous! Example • Fibrillarin-like protein – DNA: XM_293903, Protein: XP_293903 • How “far” can we go in tree of life using nucleotide v. protein searches? Limit Search Space • If you only want hits to a specific genome or domain, **much** faster to only search that species A Related Note: Homology • Based on inference that two sequences are ancestrally derived from same molecule • If two sequences have high similarity, they may be inferred to be homologous • It is WRONG to say two sequences or genes are 80% homologous (they either are related, or they are not) Homology: Same Function? • Even if two sequences are ancestrally derived from same molecule, they may or may not still have the same function – Orthologs: homologous genes created by speciation • Generally implies function remains the same – Paralogs: homologous genes created by a gene duplication event (in same species) • Implies function may have changed Full-genome Comparisons Here, each dot is a gene match, not a nucleotide match From Zivanovic et al., NAR 30: 1902-10 Pair-wise Sequence Comparison • Basis for relating biological information from a well-studied gene to a new sequence • Many programs exist for pairwise comparison • Some are fast database searching and get “good” alignments – One sequence v. many thousands: • BLAST or FASTA • Some are much slower, but guarantee the “optimal alignment” – Smith-Waterman is the de facto standard What is Optimal?? • How do we get an “optimal” alignment • Optimal to who? • Optimal based on scoring model: – Substitution scoring matrix – Insertion / deletion scoring (penalties) • Caution: Just because it is optimal for a given scoring scheme, doesn’t mean it is biologically correct!! Which is better? Match +1, Mismatch –1, Gap -2 G A T C +1-1-1+1 | | OR (Score = 0) G T G C G A T - C +1-2+1-2+1 | | | (Score = -1) G - T G C Which is better? Match +1, Mismatch –1, Gap -1 G A T C +1-1-1+1 | | OR (Score = 0) G T G C G A T - C +1-1+1-1+1 | | | (Score = 1) G - T G C Moral: Scoring Model Matters!! • For DNA, model can be very simple: • +1 match, -1 mismatch • However, not all mutations have equal likelihood: • Transition: A<–>G or C <–> T – more likely • Transversion: A<–>C or G <–> T – less likely Protein Matrices, Same Idea • Original: Dayhoff matrix aka PAM • PAM = Percent accepted mutations • Based on small number of correctly aligned proteins • Simply count how often each amino acid is substituted for another • Frequency of substitutions based on properties of amino acids relative to each other SMALE G_|P SMALL INON-POLAR POLAR EARGE ¥ LARGE NON-POLAR POLAR * Closer two amino acids are, more similar IN properties Newer “Version” of Protein Matrices: BLOSUM • By Henikoff & Henikoff (1992), based on a much larger group of aligned proteins sequences in the Blocks database • BLOSUM = Blocks substitution matrix • Used most commonly today Similarity v. Homology • Similarity is strictly a measure based on a sequence alignment observation – Two sequences are 80% identical C T A G C G A C T T | | | | | | | | C G A G C C A C T T In-class BLAST practice
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved