Download Entrez: An Integrated Information Retrieval System for Biologically Linked Data and more Study notes Bioinformatics in PDF only on Docsity! What is Entrez? • An integrated Information Retrieval System • A system of 31 linked databases • A text search engine • A tool for finding biologically linked data • A retrieval engine • A virtual workspace for manipulating large datasets • NOT a database! Lecture 2: Information Retrieval. Entrez Book Chapter 3 (pg 55-70) GenBank-Entrez: Integrated Information Retrieval System. Hard links Cross-references between elements in different databases. • CDS -> Protein -> Structure • Gene -> SNP -> Disease Hard link Neighboring Relationship between elements of the same database • Weighted key terms: Relevance pairs model of retrieval. Based on word proximity and frequency in database. •BLAST: DNA and Protein sequence similarities. •VAST: Vector Alignment Search Tool. Protein structure similarities. BLAST VAST Entrez Text Sequence Structure Neighboring VAST: Structure Neighbors Vector Alignment Search Tool For each 3D domain, locate SSEs (secondary structure elements), and represent them as individual vectors. 1 2 3 4 5 6 Human IL-4 VAST uses 3D Domains only! Whole polypeptides are assigned 3D domain 0 (zero). http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml Sequence alignment is the cornerstone of bioinformatics• Sequence alignments search for matches between sequences • Two broad classes of sequence alignments – Global: entire length. For highly similar seq. of ≅ length (GeneTool). – Local: find the most similar regions BLAST 2 sequences • Alignment can be performed between two or more sequences – Pairwise (Chp. 11) – Multiple Seq. Alignment (Chp. 12) VQQESGLVRTTC Global alignment Local alignment ESG ESG QKESGPSSSYC Book Chapter 11 completeSequence similarities The biological importance of sequence alignment • Sequence alignments assess the degree of similarity between sequences • Most common measurement: percent identity • Similar sequences suggest: – Similar structure: Proteins with similar sequences are likely to have similar structures – Similar function: Proteins with similar structures are likely to have similar functions and play similar biochemical roles. Conserved amino acids likely have an important function – Common evolutionary history • Fewer differences mean more recent divergence • Similarity ≠ homology (applies to 2 types of relationships) – Orthologous: direct descendant from a common ancestor by speciation (similarity and colinearity). – Paralogous: Separated by an event of gene duplication LPGPSKQMTRIW |||| | ||| | LPGPCK-MTRRW Dot matrix analysis • A graphical method • Shows all possible alignments • Caveats – Some guesswork in picking parameters • Window size • Stringency – Not as rigorous or quantitative as other methods C T S R V P G S E Q Q CTSRVPEQQR QQESGPVRSTC RQQEPVRSTC Dot matrix analysis: a real example Window size: 23 Stringency: 15 Window size: 1 Stringency: 1 MITE2.txt vs MITE2.txt WIS.txt vs WIS.txt Detecting repeats with dotter Greyamp tool Alignment tool Devising a scoring system • Dotplots problem: no statistical measure of the quality of the alignment • Measurement of sequence similarity: implies a metric - a statement of quantitatively how similar the two sequences are Match= 1 Mismatch=-1 AGCGCATCGGA ATCGCTTTACA Score= 6-5= 1 Match= 1 Open Gap=-2 Mismatch=-1 Extend gap=-1 AGCGC--TTCGGA ATCGCGGTTTGCA Score= 8-3-2-1= 2 AGCGC-TT-CGGA ATCGCGGTTTGCA Score= 7-4-2-2=-1 Affine gap penalties: a fix deduction G is made for introducing the gap and then an additional smaller deduction L is made that is proportional to the length (n) of the gap Gap penalty= G + (L * n) Side chains: C: Sulfhydryl (Cys) STPAG: small hydrophilic NDEQ: acid, acid amide, and hydrophilic HRK: basic (His, Arg, Lys) MILV: small hydrophobic FYW: aromatic (F=Phe, Y=Tyr, W=Trp) F F Y Y W W PAM Scoring Matrices or Mutation Data Matrix (MDM) F F Y Y W W A 4 BLOSUM matrices Created by Henikoff & Henikoff (1992) based on local multiple alignments of more distantly related sequences (based on 2000 conserved motifs called blocks, 500 groups of pr.). Multiple alignments of short regions (without gaps) of related sequences were gathered. In each alignment, sequences = to threshold value of percent identity were clustered into groups and averaged. Calculations across ≠ evolutionary distances, no extrapolation. Substitution frequencies for all pairs of amino acids were calculated between the groups, this was used to create the log-odds BLOSUM ( Block Substitution Matrix ). Thus, BLOSUM62 means that the sequences clustered in this block are no more than 62% identical. Perform better than PAM matrices This allows detection of more distantly related sequences, as it downplays the role of the more related sequences in the block when building the matrix. Score=log[(qij)/(pi pj)] qij =How often aa i and j align pi=pb. of aai pj=pb. of aaj Blosum 62 An example of scoring 52-4200-1E 5-3001-1Q 9-3-3-30C 61-2-2D 60-2N 5-1R 4A EQCDNRA BLOSUM62 A sequence comparison Total score: 13 A A 4 D Q 0 D E 2 R R 5 Q Q 5 C E -4 E C -4 R Q 1 A A 4 D Q 0 <30Longer alignments, more divergent sequenceBL0SUM30 30-40Most effective in finding all potential similaritiesBLOSUM62 50-60Members of a protein familyBLOSUM80 70-90Short alignment, high similarityBLOSUM90 ≅30Longer alignments, more divergent sequencePAM250 50-60Members of a protein familyPAM160 70-90Short alignment, high similarityPAM140 Similarity %Best useMatrix Comparison PAM vs. BLOSUM