Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Molecular Databases for Biology: Organization, Access, and Important Databases, Lab Reports of Biology

An overview of molecular databases for biology, focusing on publicly available sequence and structural databases. It discusses primary sequences, the role of databases in storing sequence information, and notable databases such as genbank, swiss-prot, and the protein data bank. It also mentions specialized databases and ways to access them using various software and tools.

Typology: Lab Reports

Pre 2010

Uploaded on 08/30/2009

koofers-user-w1g
koofers-user-w1g 🇺🇸

5

(1)

10 documents

1 / 25

Toggle sidebar

Related documents


Partial preview of the text

Download Molecular Databases for Biology: Organization, Access, and Important Databases and more Lab Reports Biology in PDF only on Docsity! BSC4933/5936 Intro’ to BioInfo’ Lab #2 BSC4933/5936: Introduction to Bioinformatics Laboratory Section: Tuesdays from 3:45 to 5:45 PM. Biological Molecular Databases Week 2, Tuesday, September 2, 2003 Author and Instructor: Steven M. Thompson Molecular databases for biology and how they are organized and accessed: Internet sequence and structural databases as well as the on-site GCG sequence databases are reviewed in this tutorial. Access methods including those available on the World Wide Web, and the National Center for Biotechnology Information’s Network Entrez, and ¥GCG's LookUp are emphasized, but data entry and format conversion are also covered. 2 Steve Thompson BioInfo 4U 2538 Winnwood Circle Valdosta, GA, USA 31601-7953 stevet@bio.fsu.edu 229-249-9751 ¥GCG® is the Genetics Computer Group, part of Accelrys Inc., a subsidiary of Pharmacopeia Inc., producer of the Wisconsin Package® for sequence analysis. ” 2003 BioInfo 4U 5 usually discovered with some sort of text searching program, either on the World Wide Web or not. This brings a point, locus names versus accession numbers. The LOCUS, ID, and ENTRY names category in the various databases are different than the Accession number category. Each sequence is given a unique accession number upon submission to the database. This number allows tracking of the data when entries are merged or split; it will always be associated with its particular data. Entry names may change; accession numbers are forever; they just pile up, primary becomes secondary, ad infinitum. What changes have occurred in the databases — history and development? The first well recognized sequence database was Margaret Dayhoff’s Atlas of Protein Sequence and Structure begun in the mid sixties (1965-1978). GenBank began in 1982 (1986), EMBL in 1980 (1986). They have all been attempts at establishing an organized, reliable, comprehensive and openly available library of genetic sequence. Databases have long-since outgrown a hardbound atlas. They have become huge and have evolved through many changes. Changes in format over the years are a major source of grief for software designers and program users. Each program needs to be able to recognize particular aspects of the sequence files; whenever they change, it's liable to throw a wrench in the works. People have argued for particular standards such as XML, but it’s almost impossible to enforce. NCBI’s ASN.1 format and its Entrez interface, available on the Web or as Network Entrez, a dedicated client-server program, attempt to circumvent these frustrations somewhat. EMBL’s SRS (Sequence Retrieval System) found on the World Wide Web at all EMBL OutStations and the Wisconsin Package’s LookUp derivative of SRS also help people perform text searches in, interact with, and browse in the sequence databases. Both SRS and Entrez provide ‘links’ to associated databases so that you can jump from, for instance, a chromosomal map location, to a DNA sequence, to its translated protein sequence, to a corresponding structure, and then to a MedLine reference, and so on. They are very helpful! What other types of bioinformatics databases are used? Specialized versions of sequence databases include sequence pattern databases such as restriction and protease sites, promoter regions, and protein motifs and profiles; and organism or system specific databases such as the sequence portions of ACeDb (A C. elegans Database), FlyBase (Drosophila dataBase), SGD (Saccharomyces Genome Database), and the Ribosomal Database Project. In general two other types of databases are also accessed in bioinformatics — reference and three- dimensional structure. Reference databases run the gamut from OMIM (Online Mendelian Inheritance In Man) to PubMed access to MedLine (NCBI’s access to NLM’ bibliographic database of over 4,000 biomedical journals). Other databases that could be put in this class include things like proprietary medical records databases and population studies databases. Finally, the Research Collaboratory for Structural Bioinformatics (RCSB, a consortium consisting of three institutions: Rutgers University, San Diego Supercomputer Center at University of California, San Diego, and 6 the National Institute of Standards and Technology) supports the three-dimensional structural Protein Data Bank (PDB). Other three-dimensional structure databases include the Nucleic Acid Databank at Rutgers (NDB) and the Cambridge small molecule crystallographic Structural Database (CSD). The Human Genome Project and numerous smaller genome projects have kept the data coming at alarming rates. GenBank, has staggering growth statistics (http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html): Year BasePairs Sequences 1982 680338 606 1983 2274029 2427 1984 3368765 4175 1985 5204420 5700 1986 9615371 9978 1987 15514776 14584 1988 23800000 20579 1989 34762585 28791 1990 49179285 39533 1991 71947426 55627 1992 101008486 78608 1993 157152442 143492 1994 217102462 215273 1995 384939485 555694 1996 651972984 1021211 1997 1160300687 1765847 1998 2008761784 2837897 1999 3841163011 4864570 2000 11101066288 10106023 2001 15849921438 14976310 2002 28507990166 22318883 It doubles in size almost every year! GenBank version 135, April 2003, has 31,099,264,455 bases, from 24,027,936 sequences. As of that release, (50 years after the publication of the Watson-Crick double-helix!) 16 Archaea[bacteria], and 128 [Eu]Bacteria completely finished genomes, not counting all the virus, phage, and viroid genomes, 1364, 175, and 35 respectively, are publicly available at NCBI. One complete nucleomorph genome, from the cryptomonad Guillardia theta, and nine complete Eukaryote genomes, Anopheles gambiae, Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Encephalitozoon cuniculi, Plasmodium falciparum, Plasmodium yoelli, Saccharomyces cerevisiae, and Schizosaccharomyces pombe, are finished and publicly available. Four Vertebrate and Five Plant essentially complete genome maps are publicly available for analysis: the animals Danio rerio, Homo sapiens, Mus musculus, and Rattus norvegicus, and the plants Avena sativa, Glycine max, Hordeum vulgare, Oryza sativa, Triticum aestivum, and Zea mays. 7 Your Project Molecular System Choices As mentioned previously in the course, you need to decide on a particular molecule with which to perform this and the remaining eight directed computer exercise tutorials. So that I can provide the necessary data, and to provide a diverse, yet level, playing field, this choice must be made off a list that I provide of four different ‘hot’ interest molecules. My list will contain molecular systems for which at least one experimentally solved protein structure and that protein’s genomic DNA sequence is known. They will all be from organisms possessing exons and introns, to make the gene finding tutorial fair, regardless of choice. My apologies are offered to prokaryote biologists — sorry, but I think that you should know about splice-site recognition also. You will gain experience in all aspects of biocomputing covered in the course in a project-oriented fashion using the same natural progression as would be used in an actual experimental setting. An advantage of this approach, besides its attempt to appeal to a wide cross-section of individuals working in diverse areas, is that the resultant predictive data derived from sequence analysis will no doubt conflict with aspects of the known structural data, but elements of truth will also be found. In this way the strengths and weaknesses of each approach is better understood, and a greater empathy is found for the tremendous problems encountered in the all-too-common case of a newly discovered gene product with no structural homologues. With this approach to computational molecular biology, you will “come full swing,” gaining an appreciation for the full biocomputing spectrum available. The directed exercise tutorial sequence lasts for the first two thirds of the semester, ten weeks. Scheduled lab sessions are devoted after that to working on and conferring about your semester final research project. Select the molecule that interests you the most, or that most closely fits the general type of work that you plan on doing in your academic or professional career, from the list below. That way you should be more interested in using it for biocomputing practice in the tutorials. Take your pick off of the following project molecular systems list: 1) Higher plant ribulose bisphosphate carboxylase/oxygenase (RuBisCO), the nuclear encoded, small subunit only. This is a crucial enzyme in the Calvin cycle of photosynthesis, and, some would claim, the most abundant enzyme on earth. 2) Vertebrate c-H-Ras, also known as P21 Harvey ras proto-oncogene transforming protein. This incredibly ‘hot’ molecule is critically important in many cancer ontogenies. 3) Vertebrate basic fibroblast growth factor, also known as heparin-binding growth factor 2 or prostatotropin. This is another popularly investigated cytokine relevant to cancer research. 4) Fungal Cu/Zn superoxide dismutase (gene name sod). This is a cytopalsmic, oxireducatase type, free radical scavenger. Aren’t free radicals implicated in both cell aging and cancer, isn’t that the deal with antioxidant vitamins? 10 Leave the database to “Search” “Nucleotide.” Press “Go.” Remember from last week that my examples in this tutorial series will all use elongation factor 1 alpha, but that you are to use the same project molecule from the previous list throughout the tutorial series. The next screen will list all of the entries from NCBI’s Entrez nucleotide database that contain the words that you typed into the “for” box anywhere in the entries’ annotation. Because the molecular systems that we are using as examples are very well studied the list will likely be huge; mine includes 27096 entries! This is often the problem with initial Entrez searches — you get way too many sequences. My results follow in the next screen snapshot: Because of this problem NCBI provides a way to restrict your search beyond just restricting it to a particular database. Press the “Limits” button just below the “for” text box. This will allow you to restrict the search in several different ways, the “Field” “Limit” and “exclude” buttons perhaps being the most helpful. Notice the default is “All Fields,” in other words, your desired text can appear anywhere in the entry. Switch the “Field” “Limit” to “Title” and check “exclude all of the above” and then press “Go” again. Now the search is restricted to only those entries whose “Title” line contains your text and also excludes several problematic subsets of the Entrez nucleotide database. The “Title” line is also known as the “Definition” line and is a one- line description of the entry that usually contains the type of English words that would normally be used to name the entry. Both the “Protein Name” and “Gene Name” Fields are much more restrictive and often cause problems. In my case no entries were found at all if I restricted the “Field” to either “Protein” or “Gene” “Name” using the text “elongation factor alpha,” yet the “Title” search found almost 4000 entries: 11 Go ahead and pick one of the entries that you found on this search. Choose one that has the words “mRNA, complete cds” to reduce complications in subsequent steps. I’ll choose my first one, EF1-a from the red alga Porphyra purpurea. Clicking on the entry name displays it’s complete GenBank/GenPept format record: 12 The entry can be saved in its default GenBank/GenPept format (or any other desired format from the drop- down menu) to your computer by clicking on the “Send to” “File” button or it can be copied and pasted into other applications. For now click on the button in the upper right-hand corner entitled “Links” and pick “Protein” from the drop-down menu that appears. This will link over to the corresponding protein sequence: Next click on “Links” from this entry and notice all the additional databases listed. Rather than choosing one of those links, pick the “BLink” button instead. The resulting page has a wealth of information, all based on precompiled BLAST database searches against NCBI’s nonredundant protein database: Taxonomic groupings are color-coded and alignments are graphically represented. Explore some of these links for a while, but not for too long. There’s lots more to see today. As with most Web stuff, it is very easy to get sidetracked and end up spending way more time than you intended! In particular it may not be worth pursuing any structural links as the NCBI structural viewer Cn3D may not be installed on the computer that 15 elongation factor Tu with the PDB access code 1DG1. As described last week EF-Tu is the bacterial homologue of the eukaryote EF-1a protein and, therefore, is still appropriate for my example. The “Query Result Browser” window follows in the screen snapshot below: Press the “EXPLORE ” button to get PDB’s “Summary Information” page on the molecule. The PDB “Summary Information” page provides a “View Structure” link. Press “View Structure” to get the “Structure Explorer” page for the molecule. This page provides several interactive three- dimensional representation alternative downloads as well as a few still image downloads. As mentioned previously, we will not be bothering with the interactive downloads at this point, as the necessary molecular viewer may not be installed on computer you are using. Therefore, pick one of the “Still Images” to download, as my example shows here: That’s probably enough time to spend on Web access to the databases for now. Please explore further on your own time. There are still two more important access routes that we need to investigate today. 16 NCBI’s Network Entrez — Nentrez NCBI formerly offered another implementation of Entrez independent of the World Wide Web called Network Entrez, or Nentrez for short. It is a client/server program where you install the client on your machine and that accesses the server databases at NCBI independent of a Web browser. I happen to prefer it over the Web version — its use seems more intuitive to me and it is definitely faster and more flexible than the Web version. Unfortunately, NCBI stopped developmental support on Nentrez and it is no longer available. Too bad. Regardless, I download and installed the client on Mendel while it was still available, so that even if it is not installed on whatever local machine you are using, you will still be able to use it as long as you have active X windowing on that machine. Let’s take a brief look at how it works on Mendel. Go ahead and log onto Mendel with an X-tunneled ssh session. Remember from last week that regardless of what local machine you are using, you need the X-tunneled ssh session to display X windows on that local machine. On the Conradi iMacs click on the OroboroX icon in the “dock,” on Linux machines launch a terminal window; either way type the following command in the xterm window that results (the X has to be capitalized and replace “user” with your account name): > ssh –X user@mendel.csit.fsu.edu As seen last week, MS Windows machines use a graphical version of ssh with X tunneling capability in combination with Xwin32 for X windowing access to Mendel. After you’ve logged onto Mendel launch Network Entrez with the following command: > entrez & Remember that the ampersand, “&” isn’t essential but is very helpful as it leaves your xterm window available for other chores such as directory listings; otherwise the xterm window is unusable and another one would have to be launched if you wanted to run any system level commands. One of my favorite aspects of Nentrez is its ability to easily screen selections based on taxonomy. Change the “Database:” to “Nucleotide,” the “Field:” to “Organism,” and the “M o d e : ” to “Taxonomy” and then either press <return> or the “Accept” button. The screen display shown opposite should appear; the numbers indicate how many nucleotide entries of each taxonomic category are in the database: 17 <Double-click> “cellular organisms” to see the three ‘urkingdoms’ of cellular life, Archaea, Bacteria, and Eukaryota. Repeat the <double- click> procedure ‘stepping’ through all of cellular life until you’ve gotten to your desired taxon. You can also go directly to your desired taxon category with “Mode:” “Selection,” but I often find it helpful to see the complete path. Once you’ve gotten to the taxon category that you want, press “Accept.” This will take the term and move it to the “Query Refinement” section of the window. Now change the “Field ” to “Tit le” and type your project molecule’s descriptive name into the “Term:” text box and press <return> or the “Accept” button. This should result in a screen similar to the one shown below: Notice the ampersands between the terms. These are logical Boolean connectors between them. Selecting a term and dragging it onto another will change the AND logic to OR logic and the ampersand into a square bracket. Press the “Retrieve” button to display the next window (obviously yours will have a different number of entries and use different terms): Now comes another part that I really like versus NCBI’s Web implementation. Press the “Select:” “All” button and then switch the “Target:” button from “Nucleotide” to “Protein” and then press “Evaluate.” This will find all of the protein entries that correspond to the selected nucleotide entries. After you’ve found protein entries, it is often helpful to switch to the structural database allowing you to find and visualize any 3D structures for your entry. Selecting entries and then pressing the “Neighbor” button accesses the precompiled BLAST reports for those entries and thus allows you to discover all of the entries similar to your entry. As with the Web version, don’t spend too much time playing with Nentrez; we have one more major lab section to complete today. Furthermore, since development on Nentrez has stalled, the program may crash on you; it is a bit ‘buggy.’ If this happens, don’t worry about it and just proceed with the next section. If the program didn’t crash on you, use the “File” menu to “Quit.” 20 You need to use the Boolean operator symbols to connect the individual query strings because the databases are indexed using individual words for most fields. The “Organism” field is an exception; it will accept ‘Genus species’ designations as well as any single word supported level of taxonomy, e.g. “fungi.” The Boolean operators supported by LookUp are the ampersand, “&,” meaning “AND,” the pipe symbol, “|,” to denote the logical “OR,” and the exclamation point, “!,” to specify “BUT NOT.” Other LookUp query construction rules are case insensitivity, parenthesis nesting, “*” and “?” wildcard support, and automatic wildcard extension (e.g. “transcript” will find “transcriptional” and “transcript”). This query should find most of the elongation factor alpha’s from the ‘lower’ eukaryotes in the SwissProt database and will provide a good dataset example. The “LookUp” window should look similar to the adjacent display: Next press the “Run” button. The program will display the results of the search; scroll through the output and then “Close” the window. The beginning of the LookUp output file from my example follows below. I found fifteen entries in Swiss-Prot that met my restrictions: As mentioned previously, be careful that all of the sequences included in the output from any text searching program are appropriate. In this case the search only found the correct elongation factors, but improper nomenclature and other database inconsistencies can always cause problems. If you find inappropriate sequences upon reading the LookUp output, you can either edit the output file to remove them, or “CUT” them 21 from the SeqLab Editor display after loading the list. Another option, if you use an editor, is to comment out the undesired sequences by placing an exclamation point, “!,” in front of the unwanted lines. Select the LookUp output file in the “SeqLab Output Manager.” This is a very important window and will contain all of the output from your current SeqLab session. Files may be displayed, printed, saved in other locations or with other names, and deleted from this window. Press the “Save As. . .” button and give the LookUp output file a more appropriate name. Be sure not to change the directory specification, only changing that portion after the slash. Next press the “Add to Main List” button in the “SeqLab Output Manager” and “Close” the window afterwards. This will add the results of the LookUp search to the previous three entries from last week in your “sample.list.” Go to the “File” menu next and press “Save List.” Next, be sure that the LookUp output file is selected in the “SeqLab Main Window” and then switch “Mode:” to “Editor.” This will load the file into the SeqLab Editor where we could perform further analyses on the entries if so desired. Notice that all of the sequences now appear in the Editor window with the amino acid residues color-coded. As mentioned last week, the nine color groups are based on a UPGMA clustering of the BLOSUM62 amino acid scoring matrix, and approximate physical property categories for the different amino acids. Expand the window to an appropriate size by ‘grabbing’ the bottom-left corner of its ‘frame’ and ‘pulling’ it out as far as desired. Use the vertical scroll bar to see them all. Any portion of, or the entire alignment loaded, is now available for analysis by any of the GCG programs. The display should look similar to the following graphic after loading the dataset (but with your project molecular system, not my example): Also mentioned last week, but worth repeating, are other ways to get sequences into SeqLab. Use the “Add sequences from” “Sequence Files. . .” choice under the “File” menu to get GCG format compatible sequences or list files into SeqLab. The “Add Sequences” window’s “Filter” box is very important! By default files are filtered such that only those that end with the extension “.seq” are displayed. This often won’t do you any good as the sequences that you may want to add may have other extensions. Therefore, delete the “.seq” extension in the “Filter” box (including the period) if necessary, but be sure to leave the “*” wild card. Press 22 the “Filter” button to display all of the files in your working directory. Select the file that you want from the “Files” box, and then check the “Add” and then “Close” buttons at the bottom of the window to put the desired file into your current list, if you’re in List Mode, or directly into the Editor, if you’re in “Editor Mode.” Use SeqLab’s Editor “File” menu “Import” function to directly load GenBank format sequences or ABI binary trace files without the need to reformat. And, as seen above, you can also directly load sequences from the online GCG databases with the “Databases. . .” choice under the “Add sequences” menu if you know their proper identifier name or accession code. While you have sequences loaded in the Editor explore the interface for a bit. You also did this last week, but it’ll only take a few moments and you need to get comfortable with SeqLab. Each protein sequence is listed by its official SwissProt entry name (ID identifier). Use both scroll bars to move around within the sequences. The scroll bar at the bottom allows you to move through the sequences linearly; the one at the side allows you to scroll through all of your entries vertically. Quickly <double-click> on various entries’ names (or <single- click> the “INFO” icon with the sequence entry name selected) to see the database reference documentation on them. (This is the same information that you can get with the GCG command “typedata -ref” at the command line.) “Close” the “Sequence Information” windows after reading them. You can also change the sequences’ names and add any documentation that you want in this window. Change the “Display:” box from “Residue Coloring” to “Feature Coloring” and then “Graphic Features.” Now the display shows a schematic of the feature information from each entry with colors based on the information from the database Feature Table for the entry. “Graphic Features” represents features using the same colors but in a ‘cartoon’ fashion. Quickly <double-click> on one of the various colored regions of the sequences (or use the “Features” choice under the “Windows” menu). This will produce a new window that describes the features located at the cursor. Select the feature to show more details and to select that feature in its entirety. All the features are fully editable through the “Edit” check box in this panel and new features can be added with several desired shapes and colors through the “Add” check box. Nearly all GCG programs are accessible through the “Functions” menu. Select various entry’s names and then go to the “Functions” menu to perform different analyses on them. You can select sequences in their entirety by clicking on their names or you can select any position(s) within sequences by ‘capturing’ them with the mouse. You can select a range of sequence names by <shift><clicking> the top-most and bottom-most name desired, or <control><click> sequence entry names to select noncontiguous entries. (But, as explained last week, this doesn’t work in the present Linux version of GCG.) The “pos:” and “col:” indicators show you where the cursor is located on a sequence without including and with including gaps respectively. The “1:1” scroll bar near the upper right-hand corner allows you to ‘zoom’ in or out on the sequences — move it to 2:1 and beyond and notice the difference in the display. It’s probably a good idea to save the sequences in the display at this point and multiple times down the road as you work on a dataset. Do this occasionally the whole time you’re in SeqLab just in case there’s an interruption of service for any reason. Go to the “File” menu and choose “Save As.” Accept the default “.rsf”
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved