Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Entrez: An Integrated Information Retrieval System for Biologically Linked Data, Study notes of Bioinformatics

University of California - Davis Bioinformatics

Entrez is an information retrieval system consisting of 31 linked databases, a text search engine, and a retrieval engine. It provides hard links and neighboring relationships between elements in different databases, such as cds and protein, gene and snp, and disease. Entrez uses a relevance pairs model of retrieval based on word proximity and frequency in the database, and offers text searches with boolean operators and truncation. Users can save searches and retrieve previous ones, and vast, a vector alignment search tool, is used for structure neighbors.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-0w6 🇺🇸

(1)

10 documents

1 / 20

Partial preview of the text

Download Entrez: An Integrated Information Retrieval System for Biologically Linked Data and more Study notes Bioinformatics in PDF only on Docsity! What is Entrez? • An integrated Information Retrieval System • A system of 31 linked databases • A text search engine • A tool for finding biologically linked data • A retrieval engine • A virtual workspace for manipulating large datasets • NOT a database! Lecture 2: Information Retrieval. Entrez Book Chapter 3 (pg 55-70) GenBank-Entrez: Integrated Information Retrieval System. Hard links Cross-references between elements in different databases. • CDS -> Protein -> Structure • Gene -> SNP -> Disease Hard link Neighboring Relationship between elements of the same database • Weighted key terms: Relevance pairs model of retrieval. Based on word proximity and frequency in database. •BLAST: DNA and Protein sequence similarities. •VAST: Vector Alignment Search Tool. Protein structure similarities. BLAST VAST Entrez Text Sequence Structure Neighboring VAST: Structure Neighbors Vector Alignment Search Tool For each 3D domain, locate SSEs (secondary structure elements), and represent them as individual vectors. 1 2 3 4 5 6 Human IL-4 VAST uses 3D Domains only! Whole polypeptides are assigned 3D domain 0 (zero). http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml Sequence alignment is the cornerstone of bioinformatics• Sequence alignments search for matches between sequences • Two broad classes of sequence alignments – Global: entire length. For highly similar seq. of ≅ length (GeneTool). – Local: find the most similar regions BLAST 2 sequences • Alignment can be performed between two or more sequences – Pairwise (Chp. 11) – Multiple Seq. Alignment (Chp. 12) VQQESGLVRTTC Global alignment Local alignment ESG ESG QKESGPSSSYC Book Chapter 11 completeSequence similarities The biological importance of sequence alignment • Sequence alignments assess the degree of similarity between sequences • Most common measurement: percent identity • Similar sequences suggest: – Similar structure: Proteins with similar sequences are likely to have similar structures – Similar function: Proteins with similar structures are likely to have similar functions and play similar biochemical roles. Conserved amino acids likely have an important function – Common evolutionary history • Fewer differences mean more recent divergence • Similarity ≠ homology (applies to 2 types of relationships) – Orthologous: direct descendant from a common ancestor by speciation (similarity and colinearity). – Paralogous: Separated by an event of gene duplication LPGPSKQMTRIW |||| | ||| | LPGPCK-MTRRW Dot matrix analysis • A graphical method • Shows all possible alignments • Caveats – Some guesswork in picking parameters • Window size • Stringency – Not as rigorous or quantitative as other methods C T S R V P G S E Q Q CTSRVPEQQR QQESGPVRSTC RQQEPVRSTC Dot matrix analysis: a real example Window size: 23 Stringency: 15 Window size: 1 Stringency: 1 MITE2.txt vs MITE2.txt WIS.txt vs WIS.txt Detecting repeats with dotter Greyamp tool Alignment tool Devising a scoring system • Dotplots problem: no statistical measure of the quality of the alignment • Measurement of sequence similarity: implies a metric - a statement of quantitatively how similar the two sequences are Match= 1 Mismatch=-1 AGCGCATCGGA ATCGCTTTACA Score= 6-5= 1 Match= 1 Open Gap=-2 Mismatch=-1 Extend gap=-1 AGCGC--TTCGGA ATCGCGGTTTGCA Score= 8-3-2-1= 2 AGCGC-TT-CGGA ATCGCGGTTTGCA Score= 7-4-2-2=-1 Affine gap penalties: a fix deduction G is made for introducing the gap and then an additional smaller deduction L is made that is proportional to the length (n) of the gap Gap penalty= G + (L * n) Side chains: C: Sulfhydryl (Cys) STPAG: small hydrophilic NDEQ: acid, acid amide, and hydrophilic HRK: basic (His, Arg, Lys) MILV: small hydrophobic FYW: aromatic (F=Phe, Y=Tyr, W=Trp) F F Y Y W W PAM Scoring Matrices or Mutation Data Matrix (MDM) F F Y Y W W A 4 BLOSUM matrices Created by Henikoff & Henikoff (1992) based on local multiple alignments of more distantly related sequences (based on 2000 conserved motifs called blocks, 500 groups of pr.). Multiple alignments of short regions (without gaps) of related sequences were gathered. In each alignment, sequences = to threshold value of percent identity were clustered into groups and averaged. Calculations across ≠ evolutionary distances, no extrapolation. Substitution frequencies for all pairs of amino acids were calculated between the groups, this was used to create the log-odds BLOSUM ( Block Substitution Matrix ). Thus, BLOSUM62 means that the sequences clustered in this block are no more than 62% identical. Perform better than PAM matrices This allows detection of more distantly related sequences, as it downplays the role of the more related sequences in the block when building the matrix. Score=log[(qij)/(pi pj)] qij =How often aa i and j align pi=pb. of aai pj=pb. of aaj Blosum 62 An example of scoring 52-4200-1E 5-3001-1Q 9-3-3-30C 61-2-2D 60-2N 5-1R 4A EQCDNRA BLOSUM62 A sequence comparison Total score: 13 A A 4 D Q 0 D E 2 R R 5 Q Q 5 C E -4 E C -4 R Q 1 A A 4 D Q 0 <30Longer alignments, more divergent sequenceBL0SUM30 30-40Most effective in finding all potential similaritiesBLOSUM62 50-60Members of a protein familyBLOSUM80 70-90Short alignment, high similarityBLOSUM90 ≅30Longer alignments, more divergent sequencePAM250 50-60Members of a protein familyPAM160 70-90Short alignment, high similarityPAM140 Similarity %Best useMatrix Comparison PAM vs. BLOSUM

Documents

questions

Entrez: An Integrated Information Retrieval System for Biologically Linked Data, Study notes of Bioinformatics

Related documents

Partial preview of the text