Download Database Searching for Similar Sequences: Techniques, Algorithms, and Tools and more Study notes Bioinformatics in PDF only on Docsity! 1 Lecture 9 Database Searching Database Searching for Similar Sequences • Database searching for similar sequences is ubiquitous in bioinformatics. • Databases are large and getting larger • Need fast methods Types of Searches • Sequence similarity search with query sequence • Alignment search with profile (scoring matrix with gap penalties) • Serch with position-specific scoring matrix representing ungapped sequence alignment • Iterative alignment search for similar sequences that starts with a query sequence, builds a multiple alignmnet, and then uses the alignment to augment the search • Search query sequence for patterns representative of protein families From Bioinformatics by Mount DNA vs Protein Searches • DNA sequences consists of 4 characters (nucleotides) • Protein sequences consist of 20 characters (amino acids) • Hence, it is easier to detect patterns in protein sequences than DNA sequences • Better to convert DNA sequences to protein sequences for searches. Database Searching Efficacy • To evaluate searching methods, selectivity and sensitivity need to be considered. • Selectivity is the ability of the method not to find members known to be of another group (i.e. false positives). • Sensitivity is the ability of the method to find members of the same protein family as the query sequence. Protein Searches • Easier to identify protein families by sequence similarity rather than structural similarity. (same structure does not mean same sequence) • Use the appropriate gap penalty scorings • Evaluate results for statistical significance. 2 History • Historically dynamic programming was used for database sequence similarity searching. • Computer memory, disk space, and CPU speed were limiting factors. • Speed still a factor due to the larger databases and increase number of searches. • FASTA and BLAST allow fast searching. History • The PAM250 matrix was used for a long time. It corresponds to a period of time where only 20% of the amino acids have remained unchanged. • BLOSUM has replace PAM250 in most applications. BLAST use the BLOSUM62 matrix. FASTA uses the BLOSUM50 matrix. Dynamic Programming • Use Smith-Waterman algorithm or an improvement thereof for local alignment. • Compares individual characters in the full- length sequence • Slower but more sensitive than FASTA or BLAST FASTA • Fast alignment of pairs of protein or DNA sequences • Searches for matching sequence patterns or words called k-tuples corresponding to k consecutive matches in both sequences • Local alignments are build based on these matches. • Better for DNA searches than BLAST (k-tuple can be smaller than minimum of 7 for BLAST) FASTA Algorithm • FASTA uses a search for regions of similarity by hashing • In hashing, a lookup table showing the positions of each k-tuple is made for each sequence • The relative positions of the k-tuple in each sequence are calculated by subtracting the postions of the first characters • K-tuples having the same offset are considered to be aligned. FASTA Algorithm • The number of comparisons increases linearly with the average sequence length • In dynamic programming and dot plots, the number of comparisons increases as the cube or square of the length, respectively.