Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Entrez: An Integrated Information Retrieval System for Biologically Linked Data, Study notes of Bioinformatics

Entrez is an information retrieval system consisting of 31 linked databases, a text search engine, and a retrieval engine. It provides hard links and neighboring relationships between elements in different databases, such as cds and protein, gene and snp, and disease. Entrez uses a relevance pairs model of retrieval based on word proximity and frequency in the database, and offers text searches with boolean operators and truncation. Users can save searches and retrieve previous ones, and vast, a vector alignment search tool, is used for structure neighbors.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-0w6
koofers-user-0w6 🇺🇸

5

(1)

10 documents

1 / 20

Toggle sidebar

Related documents


Partial preview of the text

Download Entrez: An Integrated Information Retrieval System for Biologically Linked Data and more Study notes Bioinformatics in PDF only on Docsity! What is Entrez? • An integrated Information Retrieval System • A system of 31 linked databases • A text search engine • A tool for finding biologically linked data • A retrieval engine • A virtual workspace for manipulating large datasets • NOT a database! Lecture 2: Information Retrieval. Entrez Book Chapter 3 (pg 55-70) GenBank-Entrez: Integrated Information Retrieval System. Hard links Cross-references between elements in different databases. • CDS -> Protein -> Structure • Gene -> SNP -> Disease Hard link Neighboring Relationship between elements of the same database • Weighted key terms: Relevance pairs model of retrieval. Based on word proximity and frequency in database. •BLAST: DNA and Protein sequence similarities. •VAST: Vector Alignment Search Tool. Protein structure similarities. BLAST VAST Entrez Text Sequence Structure Neighboring VAST: Structure Neighbors Vector Alignment Search Tool For each 3D domain, locate SSEs (secondary structure elements), and represent them as individual vectors. 1 2 3 4 5 6 Human IL-4 VAST uses 3D Domains only! Whole polypeptides are assigned 3D domain 0 (zero). http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml Sequence alignment is the cornerstone of bioinformatics• Sequence alignments search for matches between sequences • Two broad classes of sequence alignments – Global: entire length. For highly similar seq. of ≅ length (GeneTool). – Local: find the most similar regions BLAST 2 sequences • Alignment can be performed between two or more sequences – Pairwise (Chp. 11) – Multiple Seq. Alignment (Chp. 12) VQQESGLVRTTC Global alignment Local alignment ESG ESG QKESGPSSSYC Book Chapter 11 completeSequence similarities The biological importance of sequence alignment • Sequence alignments assess the degree of similarity between sequences • Most common measurement: percent identity • Similar sequences suggest: – Similar structure: Proteins with similar sequences are likely to have similar structures – Similar function: Proteins with similar structures are likely to have similar functions and play similar biochemical roles. Conserved amino acids likely have an important function – Common evolutionary history • Fewer differences mean more recent divergence • Similarity ≠ homology (applies to 2 types of relationships) – Orthologous: direct descendant from a common ancestor by speciation (similarity and colinearity). – Paralogous: Separated by an event of gene duplication LPGPSKQMTRIW |||| | ||| | LPGPCK-MTRRW Dot matrix analysis • A graphical method • Shows all possible alignments • Caveats – Some guesswork in picking parameters • Window size • Stringency – Not as rigorous or quantitative as other methods C T S R V P G S E Q Q CTSRVPEQQR QQESGPVRSTC RQQEPVRSTC Dot matrix analysis: a real example Window size: 23 Stringency: 15 Window size: 1 Stringency: 1 MITE2.txt vs MITE2.txt WIS.txt vs WIS.txt Detecting repeats with dotter Greyamp tool Alignment tool Devising a scoring system • Dotplots problem: no statistical measure of the quality of the alignment • Measurement of sequence similarity: implies a metric - a statement of quantitatively how similar the two sequences are Match= 1 Mismatch=-1 AGCGCATCGGA ATCGCTTTACA Score= 6-5= 1 Match= 1 Open Gap=-2 Mismatch=-1 Extend gap=-1 AGCGC--TTCGGA ATCGCGGTTTGCA Score= 8-3-2-1= 2 AGCGC-TT-CGGA ATCGCGGTTTGCA Score= 7-4-2-2=-1 Affine gap penalties: a fix deduction G is made for introducing the gap and then an additional smaller deduction L is made that is proportional to the length (n) of the gap Gap penalty= G + (L * n) Side chains: C: Sulfhydryl (Cys) STPAG: small hydrophilic NDEQ: acid, acid amide, and hydrophilic HRK: basic (His, Arg, Lys) MILV: small hydrophobic FYW: aromatic (F=Phe, Y=Tyr, W=Trp) F F Y Y W W PAM Scoring Matrices or Mutation Data Matrix (MDM) F F Y Y W W A 4 BLOSUM matrices Created by Henikoff & Henikoff (1992) based on local multiple alignments of more distantly related sequences (based on 2000 conserved motifs called blocks, 500 groups of pr.). Multiple alignments of short regions (without gaps) of related sequences were gathered. In each alignment, sequences = to threshold value of percent identity were clustered into groups and averaged. Calculations across ≠ evolutionary distances, no extrapolation. Substitution frequencies for all pairs of amino acids were calculated between the groups, this was used to create the log-odds BLOSUM ( Block Substitution Matrix ). Thus, BLOSUM62 means that the sequences clustered in this block are no more than 62% identical. Perform better than PAM matrices This allows detection of more distantly related sequences, as it downplays the role of the more related sequences in the block when building the matrix. Score=log[(qij)/(pi pj)] qij =How often aa i and j align pi=pb. of aai pj=pb. of aaj Blosum 62 An example of scoring 52-4200-1E 5-3001-1Q 9-3-3-30C 61-2-2D 60-2N 5-1R 4A EQCDNRA BLOSUM62 A sequence comparison Total score: 13 A A 4 D Q 0 D E 2 R R 5 Q Q 5 C E -4 E C -4 R Q 1 A A 4 D Q 0 <30Longer alignments, more divergent sequenceBL0SUM30 30-40Most effective in finding all potential similaritiesBLOSUM62 50-60Members of a protein familyBLOSUM80 70-90Short alignment, high similarityBLOSUM90 ≅30Longer alignments, more divergent sequencePAM250 50-60Members of a protein familyPAM160 70-90Short alignment, high similarityPAM140 Similarity %Best useMatrix Comparison PAM vs. BLOSUM
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved