Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Sequence Analysis and Databases: A Look into Molecular Biology and Bioinformatics, Study notes of Biology

An overview of sequence analysis in molecular biology and bioinformatics. It discusses the exponential growth of molecular sequence databases and the importance of primary sequences. The document also covers various sequence database installations, formats, and specialized databases.

Typology: Study notes

Pre 2010

Uploaded on 08/30/2009

koofers-user-w1d
koofers-user-w1d 🇺🇸

10 documents

1 / 17

Toggle sidebar

Related documents


Partial preview of the text

Download Sequence Analysis and Databases: A Look into Molecular Biology and Bioinformatics and more Study notes Biology in PDF only on Docsity! Steve Thompson 1 Special Topics BSC4933/5936: An Introduction to Bioinformatics. Florida State University The Department of Biological Science www.bio.fsu.edu Sept. 2, 2003 BioInformatics Databases, the “T” in van Engelen’s talk . . . just a glimpse. Steven M. Thompson Florida State University School of Computational Science and Information Technology (CSIT) Steve Thompson 2 But first some comments — If van Engelen’s talk scared you, don’t worry. His lecture is designed to provide an overview of the “way” that computers “think.” Don’t get hung up on the details. If you were bored, just wait. Lots and lots more ‘exciting’ algorithms are coming up. Something is guaranteed to challenge you! To begin — some terminology Just what the heck is an algorithm ! ? Merriam-Webster’s says: “A rule of procedure for solving a problem [often mathematical] that frequently involves repetition of an operation.” So, you could write an algorithm for tying your shoe! It’s just a set of explicit instructions for doing some routine task. And what about bioinformatics, genomics, proteomics, sequence analysis, computational molecular biology . . . ? Steve Thompson 5 Some neat stuff from the papers — We, Homo sapiens, aren’t nearly as special as we had hoped we were. Of the 3.2 billion base pairs in our DNA: Traditional, text-book estimates of the number of genes were often in the 100,000 range; turns out we’ve only got about twice as many as a fruit fly, between 25’ and 35,000! The protein coding region of the genome is only about 1% or so, a bunch of the remainder is ‘jumping’ ‘selfish DNA’ of which much may be involved in regulation and control. Over 100-200 genes were transferred from an ancestral bacterial genome to an ancestral vertebrate genome! (Later shown to be not true by more extensive analyses, and to be due to gene loss rather than transfer.) These databases are an organized way to store the tremendous amount of sequence information accumulating worldwide. Most have their own specific format. An ‘alphabet soup’ of three major database organizations around the world are responsible for maintaining most of this data. They largely ‘mirror’ one another and share accession codes, but NOT proper identifier names: North America: the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), at the National Institute of Health (NIH), has GenBank & GenPept. Also Georgetown University’s National Biomedical Research Foundation (NBRF) Protein Identification Resource (PIR) & NRL_3D (Naval Research Lab sequences of known three-dimensional structure). Europe: the European Molecular Biology Laboratory (EMBL), the European Bioinformatics Institute (EBI), and the Swiss Institute of Bioinformatics’ (SIB) Expert Protein Analysis System (ExPasy), all help maintain the EMBL Nucleotide Sequence Database, and the SWISS- PROT & TrEMBL amino acid sequence databases. Asia: The National Institute of Genetics (NIG) supports the Center for Information Biology’s (CIG) DNA Data Bank of Japan (DDBJ). What are sequence databases? Steve Thompson 6 A little history — Developments that affect software and the end user — The first well recognized sequence database was Dr. Margaret Dayhoff’s hardbound Atlas of Protein Sequence and Structure begun in the mid- sixties. DDBJ began in 1984, GenBank in 1982, and EMBL in 1980. They are all attempts at establishing an organized, reliable, comprehensive and openly available library of genetic sequences. Databases have long-since outgrown a hardbound atlas. They have become huge and have evolved through many changes with many more yet to come. Changes in format over the years are a major source of grief for software designers and program users. Each program needs to be able to recognize particular aspects of the sequence files; whenever they change it throws a wrench in the works. NCBI’s ASN.1 format and its Entrez interface attempt to circumvent some of these frustrations. However, database format is much debated as many bioinformaticians argue for relational or object-oriented standards. Unfortunately, until all biologists and computer scientists worldwide agree on one standard and all software is (re)written to that standard, neither of which is likely to happen very quickly, format issues will remain probably the most confusing and troubling aspect of working with primary sequence data. So what are these databases like? Just what are primary sequences? (Central Dogma: DNA —> RNA —> protein) Primary refers to one dimension — all of the ‘symbol’ information written in sequential order necessary to specify a particular biological molecular entity, be it polypeptide or nucleotide (van Engelen’s “P”). The symbols are the one letter codes for all of the biological nitrogenous bases and amino acid residues and their ambiguity codes (van Engelen’s “Alphabet”). Biological carbohydrates, lipids, and structural and functional information are not sequence data. Not even DNA translations in a DNA database! However, much of this feature and bibliographic type information is available in the reference documentation sections associated with primary sequences in the databases. Steve Thompson 7 Sequence database installations are commonly a complex ASCII/Binary mix, usually not relational or Object Oriented (but proprietary ones often are). They’ll contain several very long text files each containing different types of information all related to particular sequences, such as all of the sequences themselves, versus all of the title lines, or all of the reference sections. Binary files often help ‘glue together’ all of these other files by providing indexing functions. Software is usually required to successfully interact with these databases and access is most easily handled through various software packages and interfaces, either on the World Wide Web or otherwise, although systems level commands can be used if one understands the data's structure well enough. Content & Organization — More organization stuff — Nucleic Acid DB’s GenBank/EMBL/DDBJ all Taxonomic categories + HTC’s, HTG’s, & STS’s “Tags” EST’s GSS’s Amino Acid DB’s SWISS-PROT TrEMBL PIR PIR1 PIR2 PIR3 PIR4 NRL_3D Genpept Nucleic acid sequence databases (and TrEMBL) are split into subdivisions based on taxonomy (historical rankings — the Fungi warning!). PIR is split into subdivisions based on level of annotation. TrEMBL sequences are merged into SWISS-PROT as they receive increased levels of annotation. Steve Thompson 10 P earso n F astA form at — >EFHU1 PIR1 release 71.01 MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMG KGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIK NMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIV GVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDN MLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPL QDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALS EALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHP GQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAA IVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGK VTKSAQKAQKAK G C G sin g le seq u en ce form at — !!AA_SEQUENCE 1.0 P1;EFHU1 - translation elongation factor eEF-1 alpha-1 chain - human N;Alternate names: translation elongation factor Tu…… F;1-223/Domain: eEF-1 alpha domain I, GTP-binding #status predicted <EF1> F;8-156/Domain: translation elongation factor Tu homology <ETU> F;14-21/Region: nucleotide-binding motif A (P-loop) F;153-156/Region: GTP-binding NKXD motif F;245-330/Domain: eEF-1 alpha domain II, tRNA-binding #status predicted <EF2> F;332-462/Domain: eEF-1 alpha domain III, tRNA-binding #status predicted <EF3> F;36,55,79,165,318/Modified site: N6,N6,N6-trimethyllysine (Lys) #status predicted F;301,374/Binding site: glycerylphosphorylethanolamine (Glu) (covalent) #status predicted EFHU1 Length: 462 January 14, 2002 19:49 Type: P Check: 5308 .. 1 MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKE…… 401 IVDMVPGKPM CVESFSDYPP LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK 351 GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG PKFLKSGDAA 451 VTKSAQKAQK AK Only one line of documentation allowed! G C G M S F & R S F fo rm at — !!RICH_SEQUENCE 1.0 .. { name ef1a_giala descrip PileUp of: @/users1/thompson/.seqlab-mendel/pileup_28.list type PROTEIN longname /users1/thompson/seqlab/EF1A_primitive.orig.msf{ef1a_giala} sequence-ID Q08046 checksum 7342 offset 23 creation-date 07/11/2001 16:51:19 strand 1 comments ……………. !!AA_MULTIPLE_ALIGNMENT 1.0 small.pfs.msf MSF: 735 Type: P July 20, 2001 14:53 Check: 6619 .. Name: a49171 Len: 425 Check: 537 Weight: 1.00 Name: e70827 Len: 577 Check: 21 Weight: 1.00 Name: g83052 Len: 718 Check: 9535 Weight: 1.00 Name: f70556 Len: 534 Check: 3494 Weight: 1.00 Name: t17237 Len: 229 Check: 9552 Weight: 1.00 Name: s65758 Len: 735 Check: 111 Weight: 1.00 Name: a46241 Len: 274 Check: 3514 Weight: 1.00 // …………… This is SeqLab’s native format Steve Thompson 11 Specialized ‘sequence’ -type DB’s — Databases that contain special types of sequence information, such as patterns, motifs, and profiles. These include: REBASE, EPD, PROSITE, BLOCKS, ProDom, Pfam . . . . Databases that contain multiple sequence entries aligned, e.g. RDP and ALN. Databases that contain families of sequences ordered functionally, structurally, or phylogenetically, e.g. iProClass and HOVERGEN. Databases of species specific sequences, e.g. the HIV Database and the Giardia lamblia Genome Project. And on and on . . . . See Amos Bairoch’s excellent links page: http://us.expasy.org/alinks.html and the wonderful Human Genome Ensemble Project at http://www.ensembl.org/ that tries to tie it all together. What about other types of biological databases? Three dimensional structure databases: the Protein Data Bank and Rutgers Nucleic Acid Database. These databases contain all of the 3D atomic coordinate data necessary to define the tertiary shape of a particular biological molecule. The data is usually experimentally derived, either by X-ray crystallography or with NMR, but sometimes it is a hypothetical model. In all cases the source of the structure and its resolution is clearly indicated. Secondary structure boundaries, sequence data, and reference information are often associated with the coordinate data, but it is the 3D data that really matters, not the annotation. Molecular visualization or modeling software is required to interact with the data. It has little meaning on its own. See Molecules R Us at http://molbio.info.nih.gov/cgi-bin/pdb/ . Steve Thompson 12 Other types of Biological DB’s — Still more; these can be considered ‘non-molecular’: Genomic linkage mapping databases for most large genome projects (w/ pointers to sequences) — H. sapiens, Mus, Drosophila, C. elegans, Saccharomyces, Arabidopsis, E. coli, . . . . Reference Databases (also w/ pointers to sequences): e.g. OMIM — Online Mendelian Inheritance in Man PubMed/MedLine — over 11 million citations from more than 4 thousand bio/medical scientific journals. Phylogenetic Tree Databases: e.g. the Tree of Life. Metabolic Pathway Databases: e.g. WIT (What Is There) and Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Genes and Genomes). Population studies data — which strains, where, etc. And then databases that many biocomputing people don’t even usually consider: e.g. GIS/GPS/remote sensing data, medical records, census counts, mortality and birth rates . . . . So how do you access and manipulate all this data? Often on the InterNet over the World Wide Web: Site URL (Uniform Resource Locator) Content Nat’l Center Biotech' Info' http://www.ncbi.nlm.nih.gov/ databases/analysis/software PIR/NBRF http://www-nbrf.georgetown.edu/ protein sequence database IUBIO Biology Archive http://iubio.bio.indiana.edu/ database/software archive Univ. of Montreal http://megasun.bch.umontreal.ca/ database/software archive Japan's GenomeNet http://www.genome.ad.jp/ databases/analysis/software European Mol' Bio' Lab' http://www.embl-heidelberg.de/ databases/analysis/software European Bioinformatics http://www.ebi.ac.uk/ databases/analysis/software The Sanger Institute http://www.sanger.ac.uk/ databases/analysis/software Univ. of Geneva BioWeb http://www.expasy.ch/ databases/analysis/software ProteinDataBank http://www.rcsb.org/pdb/ 3D mol' structure database Molecules R Us http://molbio.info.nih.gov/cgi-bin/pdb/ 3D protein/nuc' visualization The Genome DataBase http://www.gdb.org/ The Human Genome Project Stanford Genomics http://genome-www.stanford.edu/ various genome projects Inst. for Genomic Res’rch http://www.tigr.org/ esp. microbial genome projects HIV Sequence Database http://hiv-web.lanl.gov/ HIV epidemeology seq' DB The Tree of Life http://tolweb.org/tree/phylogeny.html overview of all phylogeny Ribosomal Database Proj’ http://rdp.cme.msu.edu/html/ databases/analysis/software WIT Metabolism http://wit.mcs.anl.gov/WIT2/ metabolic reconstruction Harvard Bio' Laboratories http://golgi.harvard.edu/ nice bioinformatics links list With tools like NCBI’s Entrez & EMBL’s SRS . . . Steve Thompson 15 To answer the always perplexing GCG question — “What sequence(s)? . . . .” The sequence is in a local GCG format single sequence file in your UNIX account. (GCG Reformat and all From & To programs) The sequence is in a local GCG database in which case you ‘point’ to it by using any of the GCG database logical names. A colon, “:,” always sets the logical name apart from either an accession number or a proper identifier name or a wildcard expression and they are case insensitive. The sequence is in a GCG format multiple sequence file, either an MSF (multiple sequence format) file or an RSF (rich sequence format) file. To specify sequences contained in a GCG multiple sequence file, supply the file name followed by a pair of braces, “{},” containing the sequence specification, e.g. a wildcard — {*}. Finally, the most powerful method of specifying sequences is in a GCG “list” file. It is merely a list of other sequence specifications and can even contain other list files within it. The convention to use a GCG list file in a program is to precede it with an at sign, “@.” Furthermore, one can supply attribute information within list files to specify something special about the sequence. Specifying sequences, GCG style; in order of increasing power and complexity: Logical terms for the Wisconsin Package — Sequence databases, nucleic acids: Sequence databases, amino acids: GENBANKPLUS all of GenBank plus EST and GSS subdivisions GENPEPT GenBank CDS translations GBP all of GenBank plus EST and GSS subdivisions GP GenBank CDS translations GENBANK all of GenBank except EST and GSS subdivisions SWISSPROTPLUS all of Swiss-Prot and all of SPTrEMBL GB all of GenBank except EST and GSS subdivisions SWP all of Swiss-Prot and all of SPTrEMBL BA GenBank bacterial subdivision SWISSPROT all of Swiss-Prot (fully annotated) BACTERIAL GenBank bacterial subdivision SW all of Swiss-Prot (fully annotated) EST GenBank EST (Expressed Sequence Tags) subdivision SPTREMBL Swiss-Prot preliminary EMBL translations GSS GenBank GSS (Genome Survey Sequences) subdivision SPT Swiss-Prot preliminary EMBL translations HTC GenBank High Throughput cDNA P all of PIR Protein HTG GenBank High Throughput Genomic PIR all of PIR Protein IN GenBank invertebrate subdivision PROTEIN PIR fully annotated subdivision INVERTEBRATE GenBank invertebrate subdivision PIR1 PIR fully annotated subdivision OM GenBank other mammalian subdivision PIR2 PIR preliminary subdivision OTHERMAMM GenBank other mammalian subdivision PIR3 PIR unverified subdivision OV GenBank other vertebrate subdivision PIR4 PIR unencoded subdivision OTHERVERT GenBank other vertebrate subdivision NRL_3D PDB 3D protein sequences PAT GenBank patent subdivision NRL PDB 3D protein sequences PATENT GenBank patent subdivision PH GenBank phage subdivision PHAGE GenBank phage subdivision General data files: PL GenBank plant subdivision PLANT GenBank plant subdivision GENMOREDATA path to GCG optional data files PR GenBank primate subdivision GENRUNDATA path to GCG default data files PRIMATE GenBank primate subdivision RO GenBank rodent subdivision RODENT GenBank rodent subdivision STS GenBank (sequence tagged sites) subdivision SY GenBank synthetic subdivision SYNTHETIC GenBank synthetic subdivision TAGS GenBank EST and GSS subdivisions UN GenBank unannotated subdivision UNANNOTATED GenBank unannotated subdivision VI GenBank viral subdivision VIRAL GenBank viral subdivision These are easy — they make sense and you’ll have a vested interest. Steve Thompson 16 The List File Format — An example GCG list file of many elongation 1a and Tu factors follows. As with all GCG data files, two periods separate documentation from data. .. my-special.pep begin:24 end:134 SwissProt:EfTu_Ecoli Ef1a-Tu.msf{*} /usr/accounts/test/another.rsf{ef1a_*} @another.list The ‘way’ SeqLab works! See the listed references and WWW sites. Many fine texts are also starting to become available in the field. FOR EVEN MORE INFO... http://bio.fsu.edu/~stevet/workshop.html Contact me (stevet@bio.fsu.edu) for specific bioinformatics assistance and/or collaboration. To learn more - There’s a bewildering assortment of different databases and ways to access and manipulate the information within them. The key is to learn how to use that information in the most efficient manner. Conclusions — Next, a special treat, a colleague of mine, Misha Taylor will further discuss the nature of databases, with particular emphasis on relational and object oriented data structures. Steve Thompson 17 Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of Molecular Biology 215, 403-410. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs. Nucleic Acids Research 25, 3389-3402. Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research 20, 2013-2018. Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of Genetics, University of Washington, Seattle, Washington, U.S.A. Genetics Computer Group ¥ (GCG), Inc. (Copyright 1982-2001) Program Manual for the Wisconsin Package, Version 10.2, Madison, Wisconsin, USA 53711. Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis Primer. W.H. Freeman and Company, New York, N.Y., U.S.A. Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl. Acad. Sci. U.S.A. 84, 4355-4358. Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the National Academy of Sciences U.S.A. 89, 10915-10919. Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology 48, 443-453. Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online Mendelian Inheritance in Man (OMIM) medio 1994. Nucleic Acids Research 22, 3470-3473. Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the National Academy of Sciences U.S.A. 85, 2444-2448. Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Journal of Molecular Biology 232, 584-599. Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an Expandable GUI for Multiple Sequence Analysis. CABIOS, 10, 671-675. Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein Sequences and Structure, (M.O. Dayhoff editor) 5, Suppl. 3, 353-358, National Biomedical Research Foundation, Washington D.C., U.S.A. Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied Mathematics 2, 482-489. Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain Folding in Nucleic Acids. The Omega Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1. Nucleic Acids Research 10, 2471-2484. Swofford, D.L., PAUP (Phylogenetic Analysis Using Parsimony) (1989-1993) Illinois Natural History Survey, (1994) personal copyright, and (1997) Smithsonian Institution, Washington D.C., U.S.A. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22, 4673-4680. von Heijne, G. (1987) Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press, Inc., San Diego, California, U.S.A. Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks. Proceedings of the National Academy of Sciences U.S.A. 80, 726-730. Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. Science 244, 48-52. General Bioinformatics References —
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved