Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Database Searching for Similar Sequences: Techniques, Algorithms, and Tools, Study notes of Bioinformatics

Various techniques for database searching for similar sequences in bioinformatics. It covers different types of searches, the differences between dna and protein searches, and the efficacy of protein searches. The history of database sequence similarity searching is also explored, focusing on dynamic programming, fasta, and blast. The significance of fasta searches and the versions of fasta are also discussed.

Typology: Study notes

Pre 2010

Uploaded on 02/12/2009

koofers-user-xty-1
koofers-user-xty-1 🇺🇸

10 documents

1 / 4

Toggle sidebar

Related documents


Partial preview of the text

Download Database Searching for Similar Sequences: Techniques, Algorithms, and Tools and more Study notes Bioinformatics in PDF only on Docsity! 1 Lecture 9 Database Searching Database Searching for Similar Sequences • Database searching for similar sequences is ubiquitous in bioinformatics. • Databases are large and getting larger • Need fast methods Types of Searches • Sequence similarity search with query sequence • Alignment search with profile (scoring matrix with gap penalties) • Serch with position-specific scoring matrix representing ungapped sequence alignment • Iterative alignment search for similar sequences that starts with a query sequence, builds a multiple alignmnet, and then uses the alignment to augment the search • Search query sequence for patterns representative of protein families From Bioinformatics by Mount DNA vs Protein Searches • DNA sequences consists of 4 characters (nucleotides) • Protein sequences consist of 20 characters (amino acids) • Hence, it is easier to detect patterns in protein sequences than DNA sequences • Better to convert DNA sequences to protein sequences for searches. Database Searching Efficacy • To evaluate searching methods, selectivity and sensitivity need to be considered. • Selectivity is the ability of the method not to find members known to be of another group (i.e. false positives). • Sensitivity is the ability of the method to find members of the same protein family as the query sequence. Protein Searches • Easier to identify protein families by sequence similarity rather than structural similarity. (same structure does not mean same sequence) • Use the appropriate gap penalty scorings • Evaluate results for statistical significance. 2 History • Historically dynamic programming was used for database sequence similarity searching. • Computer memory, disk space, and CPU speed were limiting factors. • Speed still a factor due to the larger databases and increase number of searches. • FASTA and BLAST allow fast searching. History • The PAM250 matrix was used for a long time. It corresponds to a period of time where only 20% of the amino acids have remained unchanged. • BLOSUM has replace PAM250 in most applications. BLAST use the BLOSUM62 matrix. FASTA uses the BLOSUM50 matrix. Dynamic Programming • Use Smith-Waterman algorithm or an improvement thereof for local alignment. • Compares individual characters in the full- length sequence • Slower but more sensitive than FASTA or BLAST FASTA • Fast alignment of pairs of protein or DNA sequences • Searches for matching sequence patterns or words called k-tuples corresponding to k consecutive matches in both sequences • Local alignments are build based on these matches. • Better for DNA searches than BLAST (k-tuple can be smaller than minimum of 7 for BLAST) FASTA Algorithm • FASTA uses a search for regions of similarity by hashing • In hashing, a lookup table showing the positions of each k-tuple is made for each sequence • The relative positions of the k-tuple in each sequence are calculated by subtracting the postions of the first characters • K-tuples having the same offset are considered to be aligned. FASTA Algorithm • The number of comparisons increases linearly with the average sequence length • In dynamic programming and dot plots, the number of comparisons increases as the cube or square of the length, respectively.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved