Download Principles of Data Organization: Understanding Databases and Data Modeling and more Study notes Chemistry in PDF only on Docsity! Principles of data organization Database database file record field character a single characteristic of an entity a set of fields a collection of related structured information about entities a collection of records a symbol used in data field Example of a Genbank entry LOCUS VIBHALUXA 3141 bp DNA BCT 15-FEB-1996 DEFINITION V.harveyi luciferase alpha and beta subunit (luxA and luxB) genes, complete cds. ACCESSION M10961 M13494 NID g155174 KEYWORDS luciferase. SOURCE Vibrio harveyi DNA. ORGANISM Vibrio harveyi Eubacteria; Proteobacteria; gamma subdivision; Vibrionaceae; Vibrio. REFERENCE 1 (bases 1 to 1838) AUTHORS Cohn,D.H., Mileham,A.J., Simon,M.I., Nealson,K.H., Rausch,S.K., Bonam,D. and Baldwin,T.O. TITLE Nucleotide sequence of the luxA gene of Vibrio harveyi and the complete amino acid sequence of the alpha subunit of bacterial luciferase JOURNAL J. Biol. Chem. 260 (10), 6139-6146 (1985) MEDLINE 85207595 REFERENCE 2 (bases 1745 to 3141) AUTHORS Johnston,T.C., Thompson,R.B. and Baldwin,T.O. TITLE Nucleotide sequence of the luxB gene of Vibrio harveyi and the complete amino acid sequence of the beta subunit of bacterial luciferase JOURNAL J. Biol. Chem. 261 (11), 4805-4811 (1986) MEDLINE 86168191 Example of a Genbank entry FEATURES Location/Qualifiers gene 707..1774 /gene="luxA" CDS 707..1774 /gene="luxA" /codon_start=1 /product="luciferase alpha subunit" /db_xref="PID:g155175" /transl_table=11 /translation="MKFGNFLLTYQPPELSQTEVMKRLVNLGKASEGCGFDTVWLLEH HFTEFGLLGNPYVAAAHLLGATETLNVGTAAIVLPTAHPVRQAEDVNLLDQMSKGRFR FGICRGLYDKDFRVFGTDMDNSRALMDCWYDLMKEGFNEGYIAADNEHIKFPKIQLNP SAYTQGGAPVYVVAESASTTEWAAERGLPMILSWIINTHEKKAQLDLYNEVATEHGYD VTKIDHCLSYITSVDHDSNRAKDICRNFLGHWYDSYVNATKIFDDSDQTKGYDFNKGQ WRDFVLKGHKDTNRRIDYSYEINPVGTPEECIAIIQQDIDATGIDNICCGFEANGSEE EIIASMKLFQSDVMPYLKEKQ" BASE COUNT 883 a 665 c 741 g 852 t ORIGIN 1 bp upstream of EcoRI site. 1 gaattcacca tgacgacggg caaaaatagt ttgtgcactg tttatcactg gctgcagacc 61 aagggcacac aaaacattgg cttgattgcg gcaagtctct cagctcgtgt cgcctatgaa 121 gttatctctg atctggagct gtcttttctg attactgcgg ttggtgtggt gaacttgcgt 181 gacacactag aaaaagcgct tggttttgat tacctcagtt tgcctatcga tgagctacca .... Database Organization Four major components of DBMS: Data * Hardware * Software * Users Database Management System (DBMS) • A named logical unit (record type, data item) • Relationships among logical units Relationships among logical units • one to one • one to many • many to one Data Model DNA vs. Protein searches DNA sequence Protein sequence DNA database Protein database DNA sequence - DNA database • larger databases • more random hits • simpler scoring functions • missing hits (similar proteins encoded by different DNAs • Redundancy eliminated • Inconsistency avoided • Data shared • Standards enforced • Security applied • Integrity maintained • Requirements balanced Database administration Data Warehouse Operational data Data fusion Data cleansing Metadata Data Mining • Data mining is the exploration and analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns and rules • Common data mining tasks – Classification – Estimation – Prediction – Affinity Grouping – Clustering – Description