Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Bioinformatics Databases: Organization, Classification, and Searching, Study notes of Computer Science

This lecture from cmsc 838t covers the basics of bioinformatics databases, including their definition, organization, and classification. The document also discusses various bioinformatic databases, such as genbank, protein data bank, and pubmed, and their uses. The lecture touches upon database models, database identifiers, and database formats, as well as processing data in bioinformatic databases.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-pq2
koofers-user-pq2 🇺🇸

1

(1)

10 documents

1 / 25

Toggle sidebar

Related documents


Partial preview of the text

Download Bioinformatics Databases: Organization, Classification, and Searching and more Study notes Computer Science in PDF only on Docsity! 1 CMSC 838T – Lecture 9 CMSC 838T – Lecture 9 Bioinformatics databases 0 Organization & classification of bioinformatic data 0 Identify, format, & retrieval of bioinformatic data Entrez search & retrieval of linked databases Mapviewer search & retrieval by chromosome position CMSC 838T – Lecture 9 What Is a Database? Computerized storehouse of data (records) Allows 0 User-defined queries 0 Extraction of specified records 0 Adding, changing, removing, & merging records Uses standardized formats 2 CMSC 838T – Lecture 9 Database Models Defines data organization (schema) Relational 0 Entities and relationships stored in tables 0 Predefined schema 0 Examples: Oracle, DB2, MySQL, PostgreSQL Object-oriented 0 Stores data as objects (i.e., structures with predefined type) 0 Examples: Versant, Jasmine, Objectivity Semi-structured 0 Schema dynamically defined within data (self-describing) 0 Flexible description of data with complex relationships 0 Example: XML databases CMSC 838T – Lecture 9 Bioinformatic Databases Outline 0 Issues 0 Databases 0 Identifiers & formats 0 Searching databases 5 CMSC 838T – Lecture 9 Bioinformatic Databases Outline 0 Issues 0 Databases 0 Identifiers & formats 0 Searching databases CMSC 838T – Lecture 9 Major Bioinformatic Databases DNA sequences 0 GenBank, RefSeq, UniGene Protein sequences 0 Swiss-Prot, PIR-PSD, GenPept, TrEMBL, NR, RefSeq Protein structure 0 Protein Data Bank (PDB) Gene expression 0 Gene Expression Omnibus (GEO) Biomedical publications 0 PubMed / MedLine 6 CMSC 838T – Lecture 9 Bioinformatic Data Sources Primary databases 0 Original submissions by researchers 0 Staff organizes information only 0 Generally sequence oriented 0 Examples GenBank, PDB CMSC 838T – Lecture 9 Bioinformatic Data Sources Derived databases 0 Compiled from data in primary databases 0 Manually curated (human selection & correction) Advantages – high quality Disadvantages – high expense, low volume Examples Swiss-Prot, PIR-PSD, RefSeq 0 Computational derivation (automatically generated) Advantages – inexpensive, up-to-date Disadvantages – lower quality Examples GenPept, TrEMBL, UniGene, COGs 7 CMSC 838T – Lecture 9 Bioinformatic Databases – GenBank Database type 0 Nucleotide sequences 0 Primary database Data combined from additional sources 0 European Molecular Biology Laboratory (EMBL) 0 DNA DataBank of Japan (DDBJ) Current size 0 Release 134, Feb 2003 0 23,035,823 sequences 0 29,358,082,791 nucleotides CMSC 838T – Lecture 9 Bioinformatic Databases – GenBank Types of submissions to database 0 Genomic DNA High quality complete DNA sequence 0 mRNA / cDNA Partial or complete mRNA (or retranscribed cDNA) 0 Expressed sequence tag (EST) Single-pass partial cDNA sequences from mRNA 0 Sequence tagged sites (STS) Short DNA sequences unique in genome 0 Genomic survey sequence (GSS) Single-pass genomic DNA 0 Third-party annotations of GenBank sequences 10 CMSC 838T – Lecture 9 Bioinformatic Databases – Connections DNA sequences GenBank EMBL/EBI Automatically translated GenPept TrEMBL SwissProtPIR-PSD Manual curation& annotation Protein sequences from labs Genome projects Sequin & BankIt CMSC 838T – Lecture 9 Bioinformatic Databases – Protein Data Bank Database type 0 Protein 3D structures 0 Primary database Statistics 0 March 2003 0 20,473 proteins Folds & New Folds / Year 11 CMSC 838T – Lecture 9 Bioinformatic Databases – Pfam Database type 0 Protein families Multiple alignments of protein domains, conserved regions 0 Derived database (from Swiss-Prot & TrEMBL) Pfam-A – manually curated (hand-edited MSA) Pfam-B – computationally derived Non-overlapping families from PRODOM database Statistics 0 Release 8.0, February 2003 0 5193 families in Pfam-A 0 Protein sequence coverage 73% at least one match in Pfam-A 20% at least one match in Pfam-B CMSC 838T – Lecture 9 Bioinformatic Databases – RefSeq Database type 0 Nucleotide & protein sequences 0 Derived database Human curated (non-redundant, cross-linked) Data in RefSeq 0 Genomic DNA contigs 0 mRNAs & proteins for known genes, gene models 0 Entire chromosomes 0 Multiple organisms Statistics 0 March 2003 0 17,268 human loci, ~52,000 for all species 12 CMSC 838T – Lecture 9 Bioinformatic Databases – UniGene Database type 0 Nucleotide sequences 0 Computationally derived database Partitioned into non-redundant gene-oriented clusters 0 Gene-oriented view Data in UniGene 0 Clusters of genomic DNA & ESTs 0 Multiple organisms Statistics 0 March 2003 0 111,064 human loci, ~500,000 for all species CMSC 838T – Lecture 9 Bioinformatic Databases – Relative Sizes 1 10 100 1000 10000 100000 1000000 10000000 100000000 G en B an k G en Pe pt Tr EM B L U ni G en e PI R -P SD Sw is s- Pr ot R ef Se q PD B Pf am Computationally Derived Manually Curated D B s iz e (# s eq ue nc es ) 15 CMSC 838T – Lecture 9 Bioinformatic Databases Outline 0 Issues 0 Databases 0 Identifiers & formats 0 Searching databases CMSC 838T – Lecture 9 Bioinformatic Database Identifiers Common identifiers for bioinformatic data 0 Locus name 0 Accession numbers 0 GenInfo ID 0 PubMed ID 16 CMSC 838T – Lecture 9 Database Identifiers – Locus Names Original identifiers of GenBank records 0 LOCUS line in GenBank entries Originally 0 First 3 letters of organism followed by code for gene Example 0 HUMBB for human ß-globin region Problems 0 Unmaintainable due to growth of data 0 Homologous genes not named the same CMSC 838T – Lecture 9 Database Identifiers – Accession Numbers No biological meaning Originally 0 Uppercase letter followed by 5 digits: U00002 Currently 0 Two uppercase letters followed by six digits: BC037153 0 May include version number for entry: BC037153.1 Stable way of identifying GenBank entries Now being used for both DNA and proteins 17 CMSC 838T – Lecture 9 Database Identifiers – GenInfo (gi) IDs Identifier for a particular sequence only 0 Each entry gets a unique gi number Example 0 GI:22477487 Not subject to versioning 0 Entry always remains the same Different / new versions of the same sequence 0 Manage using accession numbers CMSC 838T – Lecture 9 Database Identifiers – PubMed IDs (PMID) Identifies articles managed by NCBI Reliable, stable link to citation Example 0 PMID: 12205585 20 CMSC 838T – Lecture 9 Database Format – ASN.1 International standard 0 Semi-structured format 0 Base format for NCBI data Example Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Mus musculus Brca1 mRNA, and translated products" , source { org { taxname "Mus musculus" , db { { db "taxon" , tag id 10090 } } , orgname { name binomial { genus "Mus" , species "musculus" } , … CMSC 838T – Lecture 9 Database Format – XML eXtensible Markup Language 0 Open standard for semi-structured data, uses tags like HTML 0 Document split into content (XML), style (XSL), linking (XLL) Example <?xml version="1.0"?> <!DOCTYPE GBSeq PUBLIC "-//NCBI//NCBI GBSeq/EN" “http://www.ncbi.nlm.nih.gov/dtd/NCBI_GBSeq.dtd"> <GBSet> <GBSeq> <GBSeq_locus>MMU35641</GBSeq_locus> <GBSeq_length>5538</GBSeq_length> <GBSeq_strandedness value="not-set">0</GBSeq_strandedness> <GBSeq_moltype value="mrna">5</GBSeq_moltype> <GBSeq_topology value="linear">1</GBSeq_topology> <GBSeq_division>ROD</GBSeq_division> <GBSeq_update-date>18-OCT-1996</GBSeq_update-date> <GBSeq_create-date>25-OCT-1995</GBSeq_create-date> <GBSeq_definition>Mus musculus Brca1 mRNA, complete cds</GBSeq_definition> <GBSeq_primary-accession>U35641</GBSeq_primary-accession> <GBSeq_accession-version>U35641.1</GBSeq_accession-version> 21 CMSC 838T – Lecture 9 Processing Data in Bioinformatic Databases Format conversion 0 Frequently tools handle only one of the data formats 0 Use software to transform between formats ReadSeq, SeqIO Perl (Practical Extraction and Report Language) 0 Portable C-like interpreted scripting language 0 Powerful pattern matching, string processing operations 0 Frequently used to extract / process bioinformatic data BioPerl 0 Collection of Perl classes designed for bioinformatic tools 0 Sequence analysis, alignment, format conversion, I/O, automate bioinformatic analyses, parse results, create GUIs, manage persistent storage in RDMBS… CMSC 838T – Lecture 9 Bioinformatic Databases Outline 0 Issues 0 Databases 0 Identifiers & formats 0 Searching databases 22 CMSC 838T – Lecture 9 Bioinformatic Databases – Usage NCBI Protein information usage survey CMSC 838T – Lecture 9 Using Bioinformatic Databases Primary use of bioinformatics 0 Finding similar sequences 0 BLAST! 1) insert sequence 2) click button!
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved