Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Bioinformatics: Techniques & Tools for Sequence Similarity Searches in Databases, Study notes of Bioinformatics

Various techniques and tools for database searching in bioinformatics, with a focus on sequence similarity searches. Topics include different types of searches, the efficacy of protein searches, historical background, and specific search tools such as fasta and blast. The document also covers the significance of search results and various versions of fasta and blast.

Typology: Study notes

Pre 2010

Uploaded on 02/12/2009

koofers-user-r2v
koofers-user-r2v 🇺🇸

10 documents

1 / 11

Toggle sidebar

Related documents


Partial preview of the text

Download Bioinformatics: Techniques & Tools for Sequence Similarity Searches in Databases and more Study notes Bioinformatics in PDF only on Docsity! 1 Lecture 9 Database Searching Database Searching for Similar Sequences • Database searching for similar sequences is ubiquitous in bioinformatics. • Databases are large and getting larger • Need fast methods 2 Types of Searches • Sequence similarity search with query sequence • Alignment search with profile (scoring matrix with gap penalties) • Serch with position-specific scoring matrix representing ungapped sequence alignment • Iterative alignment search for similar sequences that starts with a query sequence, builds a multiple alignmnet, and then uses the alignment to augment the search • Search query sequence for patterns representative of protein families From Bioinformatics by Mount DNA vs Protein Searches • DNA sequences consists of 4 characters (nucleotides) • Protein sequences consist of 20 characters (amino acids) • Hence, it is easier to detect patterns in protein sequences than DNA sequences • Better to convert DNA sequences to protein sequences for searches. 5 Search Tools • Similarity Search Tools – Smith-Waterman Searching • Heuristic Search Tools – FASTA – BLAST Dynamic Programming • Use Smith-Waterman algorithm or an improvement thereof for local alignment. • Compares individual characters in the full- length sequence • Slower but more sensitive than FASTA or BLAST • Finds optimal Alignment 6 FASTA • Fast alignment of pairs of protein or DNA sequences • Searches for matching sequence patterns or words called k-tuples corresponding to k consecutive matches in both sequences • Local alignments are build based on these matches. • Better for DNA searches than BLAST (k-tuple can be smaller than minimum of 7 for BLAST) • No guarantee of finding exactly optimal alignment FASTA Algorithm • FASTA uses a search for regions of similarity by hashing • In hashing, a lookup table showing the positions of each k-tuple is made for each sequence • The relative positions of the k-tuple in each sequence are calculated by subtracting the postions of the first characters • K-tuples having the same offset are considered to be aligned. • Adjacent regions are joined if possible by inserting gaps. • The highest scoring regions are aligned by dynamics programming 7 FASTA Algorithm • The number of comparisons increases linearly with the average sequence length • In dynamic programming and dot plots, the number of comparisons increases as the cube or square of the length, respectively. Significance of FASTA Searches • The average score is plotted against the log of the average sequence length in each length range. • A line is fit with linear regression and the z- score is the number of standard deviations from the fitted line. • A statistical distribution of alignment scores can be used to determine probabilities. 10 BLAST Algorithm • The alignments are extended as long as the similarity score increases and if overlap, they are combined. • These high-scoring segment pairs are matched in the entire database and listed • The statistical significance for these are calculated Database Searching with a Scoring Matrix or Profile • A combination of dynamic programming, genetic algorithms or hidden Markov models can be used to extract patterns from a multiple sequence alignment • Pattern finding and statistical methods (expectation minimization and Gibbs samplng) can be used also • Example: PROFILE HMM 11 Database Searching with a Position Specific Scoring Matrix • The previous method can be used to make a position specific scoring matrix. • The position specific scoring matrix is moved along the sequence to score every possible sequence position in the query sequence. • The highest scoring positions are typically the best matches for the corresponding set of sequences in the database • Examples: EMOTIF, MOTIF, PHI-BLAST, BLOCKS, Profilesearch
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved