Download Bioinformatics Databases: Organization, Classification, and Searching and more Study notes Computer Science in PDF only on Docsity! 1 CMSC 838T – Lecture 9 CMSC 838T – Lecture 9 Bioinformatics databases 0 Organization & classification of bioinformatic data 0 Identify, format, & retrieval of bioinformatic data Entrez search & retrieval of linked databases Mapviewer search & retrieval by chromosome position CMSC 838T – Lecture 9 What Is a Database? Computerized storehouse of data (records) Allows 0 User-defined queries 0 Extraction of specified records 0 Adding, changing, removing, & merging records Uses standardized formats 2 CMSC 838T – Lecture 9 Database Models Defines data organization (schema) Relational 0 Entities and relationships stored in tables 0 Predefined schema 0 Examples: Oracle, DB2, MySQL, PostgreSQL Object-oriented 0 Stores data as objects (i.e., structures with predefined type) 0 Examples: Versant, Jasmine, Objectivity Semi-structured 0 Schema dynamically defined within data (self-describing) 0 Flexible description of data with complex relationships 0 Example: XML databases CMSC 838T – Lecture 9 Bioinformatic Databases Outline 0 Issues 0 Databases 0 Identifiers & formats 0 Searching databases 5 CMSC 838T – Lecture 9 Bioinformatic Databases Outline 0 Issues 0 Databases 0 Identifiers & formats 0 Searching databases CMSC 838T – Lecture 9 Major Bioinformatic Databases DNA sequences 0 GenBank, RefSeq, UniGene Protein sequences 0 Swiss-Prot, PIR-PSD, GenPept, TrEMBL, NR, RefSeq Protein structure 0 Protein Data Bank (PDB) Gene expression 0 Gene Expression Omnibus (GEO) Biomedical publications 0 PubMed / MedLine 6 CMSC 838T – Lecture 9 Bioinformatic Data Sources Primary databases 0 Original submissions by researchers 0 Staff organizes information only 0 Generally sequence oriented 0 Examples GenBank, PDB CMSC 838T – Lecture 9 Bioinformatic Data Sources Derived databases 0 Compiled from data in primary databases 0 Manually curated (human selection & correction) Advantages – high quality Disadvantages – high expense, low volume Examples Swiss-Prot, PIR-PSD, RefSeq 0 Computational derivation (automatically generated) Advantages – inexpensive, up-to-date Disadvantages – lower quality Examples GenPept, TrEMBL, UniGene, COGs 7 CMSC 838T – Lecture 9 Bioinformatic Databases – GenBank Database type 0 Nucleotide sequences 0 Primary database Data combined from additional sources 0 European Molecular Biology Laboratory (EMBL) 0 DNA DataBank of Japan (DDBJ) Current size 0 Release 134, Feb 2003 0 23,035,823 sequences 0 29,358,082,791 nucleotides CMSC 838T – Lecture 9 Bioinformatic Databases – GenBank Types of submissions to database 0 Genomic DNA High quality complete DNA sequence 0 mRNA / cDNA Partial or complete mRNA (or retranscribed cDNA) 0 Expressed sequence tag (EST) Single-pass partial cDNA sequences from mRNA 0 Sequence tagged sites (STS) Short DNA sequences unique in genome 0 Genomic survey sequence (GSS) Single-pass genomic DNA 0 Third-party annotations of GenBank sequences 10 CMSC 838T – Lecture 9 Bioinformatic Databases – Connections DNA sequences GenBank EMBL/EBI Automatically translated GenPept TrEMBL SwissProtPIR-PSD Manual curation& annotation Protein sequences from labs Genome projects Sequin & BankIt CMSC 838T – Lecture 9 Bioinformatic Databases – Protein Data Bank Database type 0 Protein 3D structures 0 Primary database Statistics 0 March 2003 0 20,473 proteins Folds & New Folds / Year 11 CMSC 838T – Lecture 9 Bioinformatic Databases – Pfam Database type 0 Protein families Multiple alignments of protein domains, conserved regions 0 Derived database (from Swiss-Prot & TrEMBL) Pfam-A – manually curated (hand-edited MSA) Pfam-B – computationally derived Non-overlapping families from PRODOM database Statistics 0 Release 8.0, February 2003 0 5193 families in Pfam-A 0 Protein sequence coverage 73% at least one match in Pfam-A 20% at least one match in Pfam-B CMSC 838T – Lecture 9 Bioinformatic Databases – RefSeq Database type 0 Nucleotide & protein sequences 0 Derived database Human curated (non-redundant, cross-linked) Data in RefSeq 0 Genomic DNA contigs 0 mRNAs & proteins for known genes, gene models 0 Entire chromosomes 0 Multiple organisms Statistics 0 March 2003 0 17,268 human loci, ~52,000 for all species 12 CMSC 838T – Lecture 9 Bioinformatic Databases – UniGene Database type 0 Nucleotide sequences 0 Computationally derived database Partitioned into non-redundant gene-oriented clusters 0 Gene-oriented view Data in UniGene 0 Clusters of genomic DNA & ESTs 0 Multiple organisms Statistics 0 March 2003 0 111,064 human loci, ~500,000 for all species CMSC 838T – Lecture 9 Bioinformatic Databases – Relative Sizes 1 10 100 1000 10000 100000 1000000 10000000 100000000 G en B an k G en Pe pt Tr EM B L U ni G en e PI R -P SD Sw is s- Pr ot R ef Se q PD B Pf am Computationally Derived Manually Curated D B s iz e (# s eq ue nc es ) 15 CMSC 838T – Lecture 9 Bioinformatic Databases Outline 0 Issues 0 Databases 0 Identifiers & formats 0 Searching databases CMSC 838T – Lecture 9 Bioinformatic Database Identifiers Common identifiers for bioinformatic data 0 Locus name 0 Accession numbers 0 GenInfo ID 0 PubMed ID 16 CMSC 838T – Lecture 9 Database Identifiers – Locus Names Original identifiers of GenBank records 0 LOCUS line in GenBank entries Originally 0 First 3 letters of organism followed by code for gene Example 0 HUMBB for human ß-globin region Problems 0 Unmaintainable due to growth of data 0 Homologous genes not named the same CMSC 838T – Lecture 9 Database Identifiers – Accession Numbers No biological meaning Originally 0 Uppercase letter followed by 5 digits: U00002 Currently 0 Two uppercase letters followed by six digits: BC037153 0 May include version number for entry: BC037153.1 Stable way of identifying GenBank entries Now being used for both DNA and proteins 17 CMSC 838T – Lecture 9 Database Identifiers – GenInfo (gi) IDs Identifier for a particular sequence only 0 Each entry gets a unique gi number Example 0 GI:22477487 Not subject to versioning 0 Entry always remains the same Different / new versions of the same sequence 0 Manage using accession numbers CMSC 838T – Lecture 9 Database Identifiers – PubMed IDs (PMID) Identifies articles managed by NCBI Reliable, stable link to citation Example 0 PMID: 12205585 20 CMSC 838T – Lecture 9 Database Format – ASN.1 International standard 0 Semi-structured format 0 Base format for NCBI data Example Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Mus musculus Brca1 mRNA, and translated products" , source { org { taxname "Mus musculus" , db { { db "taxon" , tag id 10090 } } , orgname { name binomial { genus "Mus" , species "musculus" } , … CMSC 838T – Lecture 9 Database Format – XML eXtensible Markup Language 0 Open standard for semi-structured data, uses tags like HTML 0 Document split into content (XML), style (XSL), linking (XLL) Example <?xml version="1.0"?> <!DOCTYPE GBSeq PUBLIC "-//NCBI//NCBI GBSeq/EN" “http://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.dtd"> <GBSet> <GBSeq> <GBSeq_locus>MMU35641</GBSeq_locus> <GBSeq_length>5538</GBSeq_length> <GBSeq_strandedness value="not-set">0</GBSeq_strandedness> <GBSeq_moltype value="mrna">5</GBSeq_moltype> <GBSeq_topology value="linear">1</GBSeq_topology> <GBSeq_division>ROD</GBSeq_division> <GBSeq_update-date>18-OCT-1996</GBSeq_update-date> <GBSeq_create-date>25-OCT-1995</GBSeq_create-date> <GBSeq_definition>Mus musculus Brca1 mRNA, complete cds</GBSeq_definition> <GBSeq_primary-accession>U35641</GBSeq_primary-accession> <GBSeq_accession-version>U35641.1</GBSeq_accession-version> 21 CMSC 838T – Lecture 9 Processing Data in Bioinformatic Databases Format conversion 0 Frequently tools handle only one of the data formats 0 Use software to transform between formats ReadSeq, SeqIO Perl (Practical Extraction and Report Language) 0 Portable C-like interpreted scripting language 0 Powerful pattern matching, string processing operations 0 Frequently used to extract / process bioinformatic data BioPerl 0 Collection of Perl classes designed for bioinformatic tools 0 Sequence analysis, alignment, format conversion, I/O, automate bioinformatic analyses, parse results, create GUIs, manage persistent storage in RDMBS… CMSC 838T – Lecture 9 Bioinformatic Databases Outline 0 Issues 0 Databases 0 Identifiers & formats 0 Searching databases 22 CMSC 838T – Lecture 9 Bioinformatic Databases – Usage NCBI Protein information usage survey CMSC 838T – Lecture 9 Using Bioinformatic Databases Primary use of bioinformatics 0 Finding similar sequences 0 BLAST! 1) insert sequence 2) click button!