Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Bioinformatics Databases: Organization, Classification, and Searching, Study notes of Computer Science

University of Maryland Computer Science

This lecture from cmsc 838t covers the basics of bioinformatics databases, including their definition, organization, and classification. The document also discusses various bioinformatic databases, such as genbank, protein data bank, and pubmed, and their uses. The lecture touches upon database models, database identifiers, and database formats, as well as processing data in bioinformatic databases.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-pq2 🇺🇸

(1)

10 documents

1 / 25

Partial preview of the text

Download Bioinformatics Databases: Organization, Classification, and Searching and more Study notes Computer Science in PDF only on Docsity! 1 CMSC 838T – Lecture 9 CMSC 838T – Lecture 9 Bioinformatics databases 0 Organization & classification of bioinformatic data 0 Identify, format, & retrieval of bioinformatic data Entrez search & retrieval of linked databases Mapviewer search & retrieval by chromosome position CMSC 838T – Lecture 9 What Is a Database? Computerized storehouse of data (records) Allows 0 User-defined queries 0 Extraction of specified records 0 Adding, changing, removing, & merging records Uses standardized formats 2 CMSC 838T – Lecture 9 Database Models Defines data organization (schema) Relational 0 Entities and relationships stored in tables 0 Predefined schema 0 Examples: Oracle, DB2, MySQL, PostgreSQL Object-oriented 0 Stores data as objects (i.e., structures with predefined type) 0 Examples: Versant, Jasmine, Objectivity Semi-structured 0 Schema dynamically defined within data (self-describing) 0 Flexible description of data with complex relationships 0 Example: XML databases CMSC 838T – Lecture 9 Bioinformatic Databases Outline 0 Issues 0 Databases 0 Identifiers & formats 0 Searching databases 5 CMSC 838T – Lecture 9 Bioinformatic Databases Outline 0 Issues 0 Databases 0 Identifiers & formats 0 Searching databases CMSC 838T – Lecture 9 Major Bioinformatic Databases DNA sequences 0 GenBank, RefSeq, UniGene Protein sequences 0 Swiss-Prot, PIR-PSD, GenPept, TrEMBL, NR, RefSeq Protein structure 0 Protein Data Bank (PDB) Gene expression 0 Gene Expression Omnibus (GEO) Biomedical publications 0 PubMed / MedLine 6 CMSC 838T – Lecture 9 Bioinformatic Data Sources Primary databases 0 Original submissions by researchers 0 Staff organizes information only 0 Generally sequence oriented 0 Examples GenBank, PDB CMSC 838T – Lecture 9 Bioinformatic Data Sources Derived databases 0 Compiled from data in primary databases 0 Manually curated (human selection & correction) Advantages – high quality Disadvantages – high expense, low volume Examples Swiss-Prot, PIR-PSD, RefSeq 0 Computational derivation (automatically generated) Advantages – inexpensive, up-to-date Disadvantages – lower quality Examples GenPept, TrEMBL, UniGene, COGs 7 CMSC 838T – Lecture 9 Bioinformatic Databases – GenBank Database type 0 Nucleotide sequences 0 Primary database Data combined from additional sources 0 European Molecular Biology Laboratory (EMBL) 0 DNA DataBank of Japan (DDBJ) Current size 0 Release 134, Feb 2003 0 23,035,823 sequences 0 29,358,082,791 nucleotides CMSC 838T – Lecture 9 Bioinformatic Databases – GenBank Types of submissions to database 0 Genomic DNA High quality complete DNA sequence 0 mRNA / cDNA Partial or complete mRNA (or retranscribed cDNA) 0 Expressed sequence tag (EST) Single-pass partial cDNA sequences from mRNA 0 Sequence tagged sites (STS) Short DNA sequences unique in genome 0 Genomic survey sequence (GSS) Single-pass genomic DNA 0 Third-party annotations of GenBank sequences 10 CMSC 838T – Lecture 9 Bioinformatic Databases – Connections DNA sequences GenBank EMBL/EBI Automatically translated GenPept TrEMBL SwissProtPIR-PSD Manual curation& annotation Protein sequences from labs Genome projects Sequin & BankIt CMSC 838T – Lecture 9 Bioinformatic Databases – Protein Data Bank Database type 0 Protein 3D structures 0 Primary database Statistics 0 March 2003 0 20,473 proteins Folds & New Folds / Year 11 CMSC 838T – Lecture 9 Bioinformatic Databases – Pfam Database type 0 Protein families Multiple alignments of protein domains, conserved regions 0 Derived database (from Swiss-Prot & TrEMBL) Pfam-A – manually curated (hand-edited MSA) Pfam-B – computationally derived Non-overlapping families from PRODOM database Statistics 0 Release 8.0, February 2003 0 5193 families in Pfam-A 0 Protein sequence coverage 73% at least one match in Pfam-A 20% at least one match in Pfam-B CMSC 838T – Lecture 9 Bioinformatic Databases – RefSeq Database type 0 Nucleotide & protein sequences 0 Derived database Human curated (non-redundant, cross-linked) Data in RefSeq 0 Genomic DNA contigs 0 mRNAs & proteins for known genes, gene models 0 Entire chromosomes 0 Multiple organisms Statistics 0 March 2003 0 17,268 human loci, ~52,000 for all species 12 CMSC 838T – Lecture 9 Bioinformatic Databases – UniGene Database type 0 Nucleotide sequences 0 Computationally derived database Partitioned into non-redundant gene-oriented clusters 0 Gene-oriented view Data in UniGene 0 Clusters of genomic DNA & ESTs 0 Multiple organisms Statistics 0 March 2003 0 111,064 human loci, ~500,000 for all species CMSC 838T – Lecture 9 Bioinformatic Databases – Relative Sizes 1 10 100 1000 10000 100000 1000000 10000000 100000000 G en B an k G en Pe pt Tr EM B L U ni G en e PI R -P SD Sw is s- Pr ot R ef Se q PD B Pf am Computationally Derived Manually Curated D B s iz e (# s eq ue nc es ) 15 CMSC 838T – Lecture 9 Bioinformatic Databases Outline 0 Issues 0 Databases 0 Identifiers & formats 0 Searching databases CMSC 838T – Lecture 9 Bioinformatic Database Identifiers Common identifiers for bioinformatic data 0 Locus name 0 Accession numbers 0 GenInfo ID 0 PubMed ID 16 CMSC 838T – Lecture 9 Database Identifiers – Locus Names Original identifiers of GenBank records 0 LOCUS line in GenBank entries Originally 0 First 3 letters of organism followed by code for gene Example 0 HUMBB for human ß-globin region Problems 0 Unmaintainable due to growth of data 0 Homologous genes not named the same CMSC 838T – Lecture 9 Database Identifiers – Accession Numbers No biological meaning Originally 0 Uppercase letter followed by 5 digits: U00002 Currently 0 Two uppercase letters followed by six digits: BC037153 0 May include version number for entry: BC037153.1 Stable way of identifying GenBank entries Now being used for both DNA and proteins 17 CMSC 838T – Lecture 9 Database Identifiers – GenInfo (gi) IDs Identifier for a particular sequence only 0 Each entry gets a unique gi number Example 0 GI:22477487 Not subject to versioning 0 Entry always remains the same Different / new versions of the same sequence 0 Manage using accession numbers CMSC 838T – Lecture 9 Database Identifiers – PubMed IDs (PMID) Identifies articles managed by NCBI Reliable, stable link to citation Example 0 PMID: 12205585 20 CMSC 838T – Lecture 9 Database Format – ASN.1 International standard 0 Semi-structured format 0 Base format for NCBI data Example Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Mus musculus Brca1 mRNA, and translated products" , source { org { taxname "Mus musculus" , db { { db "taxon" , tag id 10090 } } , orgname { name binomial { genus "Mus" , species "musculus" } , … CMSC 838T – Lecture 9 Database Format – XML eXtensible Markup Language 0 Open standard for semi-structured data, uses tags like HTML 0 Document split into content (XML), style (XSL), linking (XLL) Example <?xml version="1.0"?> <!DOCTYPE GBSeq PUBLIC "-//NCBI//NCBI GBSeq/EN" “http://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.dtd"> <GBSet> <GBSeq> <GBSeq_locus>MMU35641</GBSeq_locus> <GBSeq_length>5538</GBSeq_length> <GBSeq_strandedness value="not-set">0</GBSeq_strandedness> <GBSeq_moltype value="mrna">5</GBSeq_moltype> <GBSeq_topology value="linear">1</GBSeq_topology> <GBSeq_division>ROD</GBSeq_division> <GBSeq_update-date>18-OCT-1996</GBSeq_update-date> <GBSeq_create-date>25-OCT-1995</GBSeq_create-date> <GBSeq_definition>Mus musculus Brca1 mRNA, complete cds</GBSeq_definition> <GBSeq_primary-accession>U35641</GBSeq_primary-accession> <GBSeq_accession-version>U35641.1</GBSeq_accession-version> 21 CMSC 838T – Lecture 9 Processing Data in Bioinformatic Databases Format conversion 0 Frequently tools handle only one of the data formats 0 Use software to transform between formats ReadSeq, SeqIO Perl (Practical Extraction and Report Language) 0 Portable C-like interpreted scripting language 0 Powerful pattern matching, string processing operations 0 Frequently used to extract / process bioinformatic data BioPerl 0 Collection of Perl classes designed for bioinformatic tools 0 Sequence analysis, alignment, format conversion, I/O, automate bioinformatic analyses, parse results, create GUIs, manage persistent storage in RDMBS… CMSC 838T – Lecture 9 Bioinformatic Databases Outline 0 Issues 0 Databases 0 Identifiers & formats 0 Searching databases 22 CMSC 838T – Lecture 9 Bioinformatic Databases – Usage NCBI Protein information usage survey CMSC 838T – Lecture 9 Using Bioinformatic Databases Primary use of bioinformatics 0 Finding similar sequences 0 BLAST! 1) insert sequence 2) click button!

Documents

questions

Bioinformatics Databases: Organization, Classification, and Searching, Study notes of Computer Science

Related documents

Partial preview of the text