Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Database Similarity Searching - Lecture Slides | BSC 5936, Study notes of Biology

Material Type: Notes; Class: ST:TEACH/LEARN SCIEN; Subject: BIOLOGICAL SCIENCES; University: Florida State University; Term: Fall 2003;

Typology: Study notes

Pre 2010

Uploaded on 08/31/2009

koofers-user-ap2
koofers-user-ap2 🇺🇸

10 documents

1 / 13

Toggle sidebar

Related documents


Partial preview of the text

Download Database Similarity Searching - Lecture Slides | BSC 5936 and more Study notes Biology in PDF only on Docsity! Steve Thompson 1 Special Topics BSC4933/5936 Florida State University The Department of Biological Science http://www.bio.fsu.edu Sept. 23, 2003 An Introduction to Bioinformatics How can you search the databases for similar sequences, if pair-wise alignments take N2 time?! Significance and heuristics . . . Database Similarity Searching Steven M. Thompson Florida State University School of Computational Science and Information Technology (CSIT) Steve Thompson 2 But, why even do database searches? We can imagine screening databases for sequences similar to ours using the concepts of dynamic programming and log-odds scoring matrices and some yet to be described tricks. But what do database searches tell us; what can we gain from them? Why even bother? Inference through homology is a fundamental principle of biology. When a sequence is found to fall into a preexisting family we may be able to infer function, mechanism, evolution, perhaps even structure, based on homology with its neighbors. If no significant similarity can be found, the very fact that your sequence is new and different could be very important. Granted, its characterization may prove difficult, but it could be well worth it. Homology and similarity — Don’t confuse homology with similarity: there is a huge difference! Similarity is a statistic that describes how much two (sub)sequences are alike according to some set scoring criteria. It can be normalized to ascertain statistical significance, but it’s still just a number. Homology, in contrast and by definition, implies an evolutionary relationship — more than just the fact that we’ve all evolved from the same old primordial ‘ooze.’ To demonstrate homology reconstruct the phylogeny of the organisms or genes of interest. Better yet, show experimental evidence — structural, morphological, genetic, or fossil — that corroborates your assertion. There is no such thing as percent homology; something is either homologous or it is not. Walter Fitch is credited with “homology is like pregnancy — you can’t be 45% pregnant, just like something can’t be 45% homologous. You either are or you are not.” Highly significant similarity can argue for homology, but never the other way around. Steve Thompson 5 Rules of thumb for a protein search — The Z score represents the number of standard deviations some particular alignment is from a distribution of random alignments (the normal distribution). They very roughly correspond to the listed E Values (based on the Extreme Value distribution) for a typical protein sequence similarity search through a database with ~125,000 protein entries. Z score E Value Inference £3 ≥0.1 little, if any evidence for homology, but can not disprove! @5 @10 -2 probably homologous, but may be due to convergent evolution ≥10 £10 -3 strong evidence for homology On to the searches — But N2 is way too slow! How can it be done? Database searching programs use the two concepts of dynamic programming and log-odds scoring matrices; however, dynamic programming takes far too long when used against most sequence databases with a ‘normal’ computer. Remember how big the databases are! Therefore, the programs use tricks to make things happen faster. These tricks fall into two main categories, that of hashing, and that of approximation. Steve Thompson 6 Corn beef hash? Huh . . . Hashing is the process of breaking your sequence into small ‘words’ or ‘k-tuples’ (think all chopped up, just like corn beef hash) of a set size and creating a ‘look-up’ table with those words keyed to position numbers. Computers can deal with numbers way faster than they can deal with strings of letters, and this preprocessing step happens very quickly. Then when any of the word positions match part of an entry in the database, that match, the ‘offset,’ is saved. In general, hashing reduces the complexity of the search problem from N2 for dynamic programming to N, the length of all the sequences in the database. A simple hash table — (this example is from the Krane and Raymer text p.50) The sequence FAMLGFIKYLPGCM and a word size of one, would produce this query lookup hash table: word A C F G I K L M P Y Pos. 2 13 1 5 7 8 4 3 11 9 6 12 10 14 comparing it to the database sequence TGFIKYLPGACT, would yield the following offset table: 1 2 3 4 5 6 7 8 9 10 11 12 T G F I K Y L P G A C T 3 -2 3 3 3 -3 3 -4 -8 2 10 3 3 3 Steve Thompson 7 Hmmm & some interpretation — The offset numbers come from the difference between the positions of the words in the query sequence and the position of the occurrence of that word in the target sequence. Then . . . . Look at all of the offsets equal to three in the previous table. Therefore, offset the alignment by three: FAMLGFIKYLPGCM |||||||| TGFIKYLPGACT Quick and easy. Computers can compare these sorts of tables very fast. The trick is to ‘know’ how far to attempt to extend the alignment out. OK. Heuristics . . . What’s that? Approximation techniques are collectively known as ‘heuristics.’ Webster’s defines heuristic as “serving to guide, discover, or reveal; . . . but unproved or incapable of proof.” In database similarity searching techniques the heuristic usually restricts the necessary search space by calculating some sort of a statistic that allows the program to decide whether further scrutiny of a particular match should be pursued. This statistic may miss things depending on the parameters set — that’s what makes it heuristic. ‘Worthwhile’ results at the end are compiled and the longest alignment within the program’s restrictions is created. The exact implementation varies between the different programs, but the basic idea follows in most all of them. Steve Thompson 10 The BLAST algorithm, continued The math can be generalized thus: for any two sequences of length m and n, local, best alignments are identified as HSPs. HSPs are stretches of sequence pairs that cannot be further improved by extension or trimming, as described above. For un-gapped alignments, the number of expected HSPs with a score of at least S is given by the formula: E = Kmne-ls This is called an E-value for the score S. In a database search n is the size of the database in residues, so N=mn is the search space size. K and l are be supplied by statistical theory, and, as mentioned above, can be calculated by comparison to pre-computed, simulated distributions. These two parameters define the statistical significance of an E-value. The E-value defines the significance of the search. As mentioned above, the smaller an E-value is, the more likely it is significant. A value of 0.01 to 0.001 is a good starting point for significance in most typical searches. In other words, in order to assess whether a given alignment constitutes evidence for homology, it helps to know how strong an alignment can be expected from chance alone. The Fast algorithm — in more detail Fast is an older algorithm than BLAST. The original Fast paper came out in 1988, based on David Lipman’s work in a 1983 paper; the original BLAST paper was published in 1990. Both algorithms have been upgraded substantially since originally released. Fast was the first widely used, powerful sequence database searching algorithm. Bill Pearson continually refines the programs such that they remain a viable alternative to BLAST, especially if one is restricted to searching DNA against DNA without translation. They are also very helpful in situations where BLAST finds no significant alignments — arguably, Fast may be more sensitive than BLAST in these situations. Fast is also a hashing style algorithm and builds words of a set k- tuple size, by default two for peptides. It then identifies all exact word matches between the sequence and the database members. Note that the word matches must be exact for Fast and only similar, above some threshold, for BLAST. Steve Thompson 11 The Fast algorithm, continued From these exact word matches: 1) Scores are assigned to each continuous, ungapped, diagonal by adding all of the exact match BLOSUM values. 2) The ten highest scoring diagonals for each query-database pair are then rescored using BLOSUM similarities as well as identities and ends are trimmed to maximize the score. The best of each of these is called the Init1 score. 3) Next the program ‘looks’ around to see if nearby off-diagonal Init1 alignments can be combined by incorporating gaps. If so, a new score, Initn, is calculated by summing up all the contributing Init1 scores, penalizing gaps with a penalty for each. 4) The program then constructs an optimal local alignment for all Initn pairs with scores better than some set threshold using a variation of dynamic programming “in a band.” A sixteen residue band centered at the highest Init1 region is used by default with peptides. The score generated from this step called opt. The Fast algorithm, still continued 5) Next, Fast uses a simple linear regression against the natural log of the search set sequence length to calculate a normalized z-score for the sequence pair. Note that this is not the same Monte Carlo style Z score described earlier, and can not be directly compared to one. 6) Finally, it compares the distribution of these z-scores to the actual extreme-value distribution of the search. Using this distribution, the program estimates the number of sequences that would be expected to have, purely by chance, a z-score greater than or equal to the z-score obtained in the search. This is reported as the Expectation value. 7) If the user requests pair-wise alignments in the output, then the program uses full Smith-Waterman local dynamic programming, not ‘restricted to a band,’ to produce its final alignments. Steve Thompson 12 Let’s see ‘em in action To begin we’ll go to the most widely used (and abused!) biocomputing program on earth: NCBI’s BLAST — Connect to NCBI’s BLAST page with any Web browser: http://www.ncbi.nlm.nih.gov/BLAST/. There is a wealth of information there, including a wonderful tutorial and several very good essays for teaching yourself way more about BLAST than this lecture can ever hope for. For now I’ll demonstrate with a simple example, one of my favorites, the elongation factor 1a protein from Giardia lamblia, named EF1A_Giala in the Swiss-Prot database, but we have to use the accession code, Q08046, for NCBI’s BLAST server to find the sequence. Let’s see how it works and how quickly we get results back. Let’s contrast that with GCG’s BLAST version. I’ll illustrate with the same molecule and I’ll use GCG’s SeqLab GUI to show the difference between the two implementations of the program. And finally, let’s see how GCG’s FastA version compares to either BLAST implementation. Again, I’ll launch the program from SeqLab with the same example, but this time I’ll take advantage of Fast’s flexible database search syntax, being able to use any valid GCG sequence specification. Here I’ll search against a precompiled LookUp list file of all of the so-called ‘primitive’ eukaryotes in Swiss-Prot.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved