Download Bioinformatics: Techniques & Tools for Sequence Similarity Searches in Databases and more Study notes Bioinformatics in PDF only on Docsity! 1 Lecture 9 Database Searching Database Searching for Similar Sequences • Database searching for similar sequences is ubiquitous in bioinformatics. • Databases are large and getting larger • Need fast methods 2 Types of Searches • Sequence similarity search with query sequence • Alignment search with profile (scoring matrix with gap penalties) • Serch with position-specific scoring matrix representing ungapped sequence alignment • Iterative alignment search for similar sequences that starts with a query sequence, builds a multiple alignmnet, and then uses the alignment to augment the search • Search query sequence for patterns representative of protein families From Bioinformatics by Mount DNA vs Protein Searches • DNA sequences consists of 4 characters (nucleotides) • Protein sequences consist of 20 characters (amino acids) • Hence, it is easier to detect patterns in protein sequences than DNA sequences • Better to convert DNA sequences to protein sequences for searches. 5 Search Tools • Similarity Search Tools – Smith-Waterman Searching • Heuristic Search Tools – FASTA – BLAST Dynamic Programming • Use Smith-Waterman algorithm or an improvement thereof for local alignment. • Compares individual characters in the full- length sequence • Slower but more sensitive than FASTA or BLAST • Finds optimal Alignment 6 FASTA • Fast alignment of pairs of protein or DNA sequences • Searches for matching sequence patterns or words called k-tuples corresponding to k consecutive matches in both sequences • Local alignments are build based on these matches. • Better for DNA searches than BLAST (k-tuple can be smaller than minimum of 7 for BLAST) • No guarantee of finding exactly optimal alignment FASTA Algorithm • FASTA uses a search for regions of similarity by hashing • In hashing, a lookup table showing the positions of each k-tuple is made for each sequence • The relative positions of the k-tuple in each sequence are calculated by subtracting the postions of the first characters • K-tuples having the same offset are considered to be aligned. • Adjacent regions are joined if possible by inserting gaps. • The highest scoring regions are aligned by dynamics programming 7 FASTA Algorithm • The number of comparisons increases linearly with the average sequence length • In dynamic programming and dot plots, the number of comparisons increases as the cube or square of the length, respectively. Significance of FASTA Searches • The average score is plotted against the log of the average sequence length in each length range. • A line is fit with linear regression and the z- score is the number of standard deviations from the fitted line. • A statistical distribution of alignment scores can be used to determine probabilities. 10 BLAST Algorithm • The alignments are extended as long as the similarity score increases and if overlap, they are combined. • These high-scoring segment pairs are matched in the entire database and listed • The statistical significance for these are calculated Database Searching with a Scoring Matrix or Profile • A combination of dynamic programming, genetic algorithms or hidden Markov models can be used to extract patterns from a multiple sequence alignment • Pattern finding and statistical methods (expectation minimization and Gibbs samplng) can be used also • Example: PROFILE HMM 11 Database Searching with a Position Specific Scoring Matrix • The previous method can be used to make a position specific scoring matrix. • The position specific scoring matrix is moved along the sequence to score every possible sequence position in the query sequence. • The highest scoring positions are typically the best matches for the corresponding set of sequences in the database • Examples: EMOTIF, MOTIF, PHI-BLAST, BLOCKS, Profilesearch