Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

What Is Bioinformatics? - Lecture Notes | CS 5263, Study notes of Computer Science

University of Texas - San Antonio Computer Science

Material Type: Notes; Class: Bioinformatics; Subject: Computer Science; University: University of Texas - San Antonio; Term: Unknown 2001;

Typology: Study notes

Pre 2010

Uploaded on 08/18/2009

koofers-user-7su-1 🇺🇸

10 documents

1 / 13

Partial preview of the text

Download What Is Bioinformatics? - Lecture Notes | CS 5263 and more Study notes Computer Science in PDF only on Docsity! © 2001 Schattauer GmbH 346 Method Inform Med 4/2001 related to this new field has been surging, and now comprise almost 2% of the annual total of papers in PubMed. This unexpected union between the two subjects is attributed to the fact that life itself is an information technology; an organism’s physiology is largely deter- mined by its genes, which at its most basic can be viewed as digital information.At the same time, there have been major advances in the technologies that supply the initial data; Anthony Kervalage of Celera recent- ly cited that an experimental laboratory can produce over 100 gigabytes of data a day with ease [5]. This incredible pro- cessing power has been matched by devel- opments in computer technology; the most important areas of improvements have been in the CPU, disk storage and Internet, allowing faster computations, better data storage and revolutionalised the methods for accessing and exchanging data. 1.1 Aims of Bioinformatics In general, the aims of bioinformatics are three-fold. First, at its simplest bioinfor- matics organises data in a way that allows researchers to access existing information and to submit new entries as they are produced, e.g. the Protein Data Bank for 3D macromolecular structures [6, 7]. While data-curation is an essential task, the in- formation stored in these databases is essentially useless until analysed. Thus the purpose of bioinformatics extends much further. The second aim is to develop tools and resources that aid in the analysis of data. For example, having sequenced a par- ticular protein, it is of interest to compare it with previously characterised sequences. What is Bioinformatics? A Proposed Definition and Overview of the Field N. M. Luscombe, D. Greenbaum, M. Gerstein Department of Molecular Biophysics and Biochemistry Yale University, New Haven, USA 1. Introduction Biological data are being produced at a phenomenal rate [1]. For example as of April 2001, the GenBank repository of nucleic acid sequences contained 11,546,000 entries [2] and the SWISS- PROT database of protein sequences con- tained 95,320 [3]. On average, these databa- ses are doubling in size every 15 months [2]. In addition, since the publication of the H. influenzae genome [4], complete sequences for nearly 300 organisms have been released, ranging from 450 genes to over 100,000. Add to this the data from the myriad of related projects that study gene expression, determine the protein structu- res encoded by the genes, and detail how these products interact with one another, and we can begin to imagine the enormous quantity and variety of information that is being produced. As a result of this surge in data, compu- ters have become indispensable to biologi- cal research. Such an approach is ideal because of the ease with which computers can handle large quantities of data and probe the complex dynamics observed in nature. Bioinformatics, the subject of the current review, is often defined as the appli- cation of computational techniques to understand and organise the information associated with biological macromolecules. Fig. 1 shows that the number of papers Summary Background: The recent flood of data from genome sequences and functional genomics has given rise to new field, bioinformatics, which combines elements of biology and computer science. Objectives: Here we propose a definition for this new field and review some of the research that is being pursued, particularly in relation to transcriptional regulatory systems. Methods: Our definition is as follows: Bioinformatics is conceptualizing biology in terms of macromolecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied maths, computer science, and statis- tics) to understand and organize the information associated with these molecules, on a large-scale. Results and Conclusions: Analyses in bioinformatics predominantly focus on three types of large datasets available in molecular biology: macromolecular struc- tures, genome sequences, and the results of function- al genomics experiments (eg expression data). Additional information includes the text of scientific papers and “relationship data” from metabolic path- ways, taxonomy trees, and protein-protein interaction networks. Bioinformatics employs a wide range of computational techniques including sequence and structural alignment, database design and data mining, macromolecular geometry, phylogenetic tree construction, prediction of protein structure and function, gene finding, and expression data clustering. The emphasis is on approaches integrating a variety of computational methods and heterogeneous data sources. Finally, bioinformatics is a practical discipline. We survey some representative applications, such as finding homologues, designing drugs, and performing large-scale censuses. Additional information pertinent to the review is available over the web at http://bioinfo.mbb.yale.edu/what-is-it. Keywords Bioinformatics, Genomics, Introduction, Transcription Regulation Method Inform Med 2001; 40: 346–58 Updated version of an invited review paper that appeared in Haux, R., Kulikowski, C. (eds.) (2001). IMIA Yearbook of Medical Informatics 2001: Digital Libraries and Medicine, pp. 83–99. Stuttgart: Schattauer. What is Bioinformatics? 347 Method Inform Med 4/2001 This needs more than just a simple text- based search, and programs such as FASTA [8] and PSI-BLAST [9] must consider what constitutes a biologically significant match. Development of such resources dictates expertise in computational theory, as well as a thorough understanding of biology. The third aim is to use these tools to ana- lyse the data and interpret the results in a biologically meaningful manner. Traditio- nally, biological studies examined individu- al systems in detail, and frequently com- pared them with a few that are related. In bioinformatics, we can now conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight novel features. In this review, we provide a systematic definition of bioinformatics as shown in Box 1. We focus on the first and third aims just described, with particular reference to the keywords: information, informatics, organisation, understanding, large-scale and practical applications. Specifically, we discuss the range of data that are currently being examined, the databases into which they are organised, the types of analyses that are being conducted using transcrip- tion regulatory systems as an example, and finally some of the major practical applica- tions of bioinformatics. 2. “…the INFORMATION associated with these Molecules…” Table 1 lists the types of data that are analysed in bioinformatics and the range of topics that we consider to fall within the field. Here we take a broad view and in- clude subjects that may not normally be listed. We also give approximate values describing the sizes of data being discussed. We start with an overview of the sources of information. Most bioinformatics analy- ses focus on three primary sources of data: DNA or protein sequences, macromolecu- lar structures and the results of functional genomics experiments. Raw DNA se- quences are strings of the four base-letters comprising genes, each typically 1,000 bases long. The GenBank [2] repository of nucleic acid sequences currently holds a total of 12.5 billion bases in 11.5 million entries (all database figures as of April 2001).At the next level are protein sequenc- es comprising strings of 20 amino acid- letters. At present there are about 400,000 known protein sequences [3], with a typical bacterial protein containing approximately 300 amino acids. Macromolecular struc- tural data represents a more complex form of information. There are currently 15,000 entries in the Protein Data Bank, PDB [6, 7], containing atomic structures of pro- teins, DNA and RNA solved by x-ray crystallography and NMR. A typical PDB file for a medium-sized protein contains the xyz-coordinates of approximately 2,000 atoms. Scientific euphoria has recently centred on whole genome sequencing. As with the raw DNA sequences, genomes consist of strings of base-letters, ranging from 1.6 million bases in Haemophilus influenzae [10] to 3 billion in humans [11, 12]. The Entrez database [13] currently has com- plete sequences for nearly 300 archaeal, bacterial and eukaryotic organisms. In addition to producing the raw nucleotide sequence, a lot of work is involved in processing this data. An important aspect of complete genomes is the distinction between coding regions and non-coding regions -‘junk’ repetitive sequences making up the bulk of base sequences especially in eukaryotes. Within the coding regions, genes are annotated with their translated protein sequence, and often with their cellular function. Fig. 1 Plot showing the growth of scientific publications in bioinformatics between 1973 and 2000. The histogram bars (left vertical axis) counts the total number of scientific articles relating to bioinformatics, and the black line (right vertical axis) gives the percentage of the annual total of articles relating to bioinformatics. The data are taken from PubMed. Bioinformatics – a Definition1 (Molecular) bio – informatics: bioinfor- matics is conceptualising biology in terms of molecules (in the sense of Phy- sical chemistry) and applying “informa- tics techniques” (derived from disci- plines such as applied maths, computer science and statistics) to understand and organise the information associated with these molecules, on a large scale. In short, bioinformatics is a management information system for molecular biolo- gy and has many practical applications. 1As submitted to the Oxford English Dictionary. Luscombe, Greenbaum, Gerstein 350 Method Inform Med 4/2001 different secondary structural elements. Another example is the use of structural data to understand a protein’s function; here studies have investigated the rela- tionship different protein folds and their functions [52, 53] and analysed similarities between different binding sites in the ab- sence of homology [54]. Combined with similarity measurements, these studies pro- vide us with an understanding of how much biological information can be accurately transferred between homologous proteins [55]. 4.1 The Bioinformatics Spectrum Fig. 2 summarises the main points we raised in our discussions of organising and understanding biological data – the development of bioinformatics techniques has allowed an expansion of biological analysis in two dimension, depth and breadth. The first is represented by the vertical axis in the figure and outlines a possible approach to the rational drug design process. The aim is to take a single gene and follow through an analysis that maximises our understanding of the protein it encodes. Starting with a gene sequence, we can determine the protein sequence with strong certainty. From there, prediction algorithms can be used to calcu- Paradigm shifts during the past couple of decades have taken much of biology away from the laboratory bench and have allowed the integration of other scientific disciplines, specifically computing. The result is an expansion of biological research in breadth and depth. The vertical axis demonstrates how bioinformatics can aid rational drug design with minimal work in the wet lab. Starting with a single gene sequence, we can determine with strong certainty, the protein sequence. From there, we can determine the structure using structure prediction techniques. With geometry calculations, we can further resolve the protein’s surface and through molecular simulation determine the force fields surrounding the molecule. Finally docking algorithms can provide predictions of the ligands that will bind on the protein surface, thus paving the way for the design of a drug specific to that molecule. The horizontal axis shows how the influx of biological data and advances in computer technology have broadened the scope of biology. Initially with a pair of proteins, we can make comparisons between the between sequences and structures of evolutionary related proteins. With more data, algorithms for multiple alignments of several proteins become necessary. Using multiple sequences, we can also create phylogenetic trees to trace the evolutionary development of the proteins in question. Finally, with the deluge of data we currently face, we need to construct large databases to store, view and deconstruct the information. Alignments now become more complex, requiring sophisticated scoring schemes and there is enough data to compile a genome census – a genomic equivalent of a population census – providing comprehensive statistical accounting of protein features in genomes. Fig. 2 Organizing and understanding biological data What is Bioinformatics? 351 Method Inform Med 4/2001 late the structure adopted by the protein. Geometry calculations can define the shape of the protein’s surface and molecu- lar simulations can determine the force fields surrounding the molecule. Finally, using docking algorithms, one could identify or design ligands that may bind the protein, paving the way for designing a drug that specifically alters the protein’s function. In practise, the intermediate steps are still difficult to achieve accurately, and they are best combined with experimental methods to obtain some of the data, for example characterising the structure of the protein of interest. The aim of the second dimension, the breadth in biological analysis, is to compare a gene or gene product with others. Ini- tially, simple algorithms can be used to compare the sequences and structures of a pair of related proteins. With a larger num- ber of proteins, improved algorithms can be used to produce multiple alignments, and extract sequence patterns or structural templates that define a family of proteins. Using this data, it is also possible to con- struct phylogenetic trees to trace the evolu- tionary path of proteins. Finally, with even more data, the information must be stored in large-scale databases. Comparisons become more complex, requiring multiple scoring schemes, and we are able to con- duct genomic scale censuses that provide comprehensive statistical accounts of protein features, such as the abundance of particular structures or functions in diffe- rent genomes. It also allows us to build phylogenetic trees that trace the evolution of whole organisms. 5. “… applying INFORMATICS TECHNIQUES…” The distinct subject areas we mention require different types of informatics tech- niques. Briefly, for data organisation, the first biological databases were simple flat files. However with the increasing amount of information, relational database methods with Web-page interfaces have become increasingly popular. In sequence analysis, techniques include string compari- son methods such as text search and one- dimensional alignment algorithms. Motif and pattern identification for multiple sequences depend on machine learning, clustering and data-mining techniques. 3D structural analysis techniques include Eu- clidean geometry calculations combined with basic application of physical chemis- try, graphical representations of surfaces and volumes, and structural comparison and 3D matching methods. For molecular simulations, Newtonian mechanics, quan- tum mechanics, molecular mechanics and electrostatic calculations are applied. In many of these areas, the computational methods must be combined with good statistical analyses in order to provide an objective measure for the significance of the results. 6. Transcription Regulation – a Case Study in Bioinformatics DNA-binding proteins have a central role in all aspects of genetic activity within an organism, participating in processes such as transcription, packaging, rearrangement, replication and repair. In this section, we focus on the studies that have contributed to our understanding of transcription regulation in different organisms. Through this example, we demonstrate how bio- informatics has been used to increase our knowledge of biological systems and also illustrate the practical applications of the different subject areas that were briefly outlined earlier. We start by considering structural analyses of how DNA-binding proteins recognise particular base se- quences. Later, we review several genomic studies that have characterised the nature of transcription factors in different orga- nisms, and the methods that have been used to identify regulatory binding sites in the upstream regions. Finally, we provide an overview of gene expression analyses that have been recently conducted and suggest future uses of transcription regulatory ana- lyses to rationalise the observations made in gene expression experiments. All the results that we describe have been found through computational studies. 6.1 Structural Studies As of April 2001, there were 379 structures of protein-DNA complexes in the PDB. Analyses of these structures have provided valuable insight into the stereochemical principles of binding, including how par- ticular base sequences are recognized and how the DNA structure is quite often modified on binding. A structural taxonomy of DNA-binding proteins, similar to that presented in SCOP and CATH, was first proposed by Harrison [56] and periodically updated to accom- modate new structures as they are solved [57]. The classification consists of a two-tier system: the first level collects proteins into eight groups that share gross structural features for DNA-binding, and the second comprises 54 families of proteins that are structurally homologous to each other. Assembly of such a system simplifies the comparison of different binding methods; it highlights the diversity of protein-DNA complex geometries found in nature, but also underlines the importance of inter- actions between -helices and the DNA major groove, the main mode of binding in over half the protein families. While the number of structures represented in the PDB does not necessarily reflect the rela- tive importance of the different proteins in the cell, it is clear that helix-turn-helix, zinc-coordinating and leucine zipper motifs are used repeatedly.These provide compact frameworks to present the -helix on the surfaces of structurally diverse proteins. At a gross level, it is possible to highlight the differences between transcription factor domains that “just” bind DNA and those involved in catalysis [58]. Although there are exceptions, the former typically approach the DNA from a single face and slot into the grooves to interact with base edges. The latter commonly envelope the substrate, using complex networks of secondary structures and loops. Focusing on proteins with -helices, the structures show many variations, both in amino acid sequences and detailed geo- metry. They have clearly evolved indepen- dently in accordance with the requirements of the context in which they are found. While achieving a close fit between the Luscombe, Greenbaum, Gerstein 352 Method Inform Med 4/2001 -helix and major groove, there is enough flexibility to allow both the protein and DNA to adopt distinct conformations. However, several studies that analysed the binding geometries of -helices demon- strated that most adopt fairly uniform con- formations regardless of protein family. They are commonly inserted in the major groove sideways, with their lengthwise axis roughly parallel to the slope outlined by the DNA backbone. Most start with the N-terminus in the groove and extend out, completing two to three turns within contacting distance of the nucleic acid [59, 60]. Given the similar binding orientations, it is surprising to find that the interactions between each amino acid position along the -helices and nucleotides on the DNA vary considerably between different pro- tein families. However, by classifying the amino acids according to the sizes of their side chains, we are able to rationalise the different interactions patterns. The rules of interactions are based on the simple pre- mise that for a given residue position on -helices in similar conformations, small amino acids interact with nucleotides that are close in distance and large amino acids with those that are further [60, 61]. Equi-va- lent studies for binding by other structural motifs, like -hairpins, have also been con- ducted [62]. When considering these interactions, it is important to remember that different regions of the protein surface also provide interfaces with the DNA. This brings us to look at the atomic level interactions between individual amino acid-base pairs. Such analyses are based on the premise that a significant proportion of specific DNA-binding could be rationalised by a universal code of recognition between amino acids and bases, ie whether certain protein residues preferably interact with particular nucleotides regardless of the type of protein-DNA complex [63]. Studies have considered hydrogen bonds, van der Waals contacts and water-mediated bonds [64-66]. Results showed that about 2/3 of all interactions are with the DNA back- bone and that their main role is one of sequence-independent stabilisation. In contrast, interactions with bases display some strong preferences, including the interactions of arginine or lysine with guanine, asparagine or glutamine with adenine and threonine with thymine. Such preferences were explained through exami- nation of the stereochemistry of the amino acid side chains and base edges. Also highlighted were more complex types of interactions where single amino acids contact more than one base-step simulta- neously, thus recognising a short DNA sequence. These results suggested that universal specificity, one that is observed across all protein-DNA complexes, indeed exists. However, many interactions that are normally considered to be non-specific, such as those with the DNA backbone, can also provide specificity depending on the context in which they are made. Armed with an understanding of protein structure, DNA-binding motifs and side chain stereochemistry, a major applica- tion has been the prediction of binding either by proteins known to contain a parti- cular motif, or those with structures solved in the uncomplexed form. Most common are predictions for -helix-major groove interactions – given the amino acid se- quence, what DNA sequence would it recognise [61, 67]. In a different approach, molecular simulation techniques have been used to dock whole proteins and DNAs on the basis of force-field calculations around the two molecules [68, 69]. The reason that both methods have been met with limited success is because even for apparently simple cases like - helix-binding, there are many other factors that must be considered. Comparisons between bound and unbound nucleic acid structures show that DNA-bending is a common feature of complexes formed with transcription factors [58, 70].This and other factors such as electrostatic and cation- mediated interactions assist indirect recognition of the nucleotide sequence, although they are not well understood yet. Therefore, it is now clear that detailed rules for specific DNA-binding will be family specific, but with underlying trends such as the arginine-guanine interactions. 6.2 Genomic Studies Due to the wealth of biochemical data that are available, genomic studies in bioin- formatics have concentrated on model organisms, and the analysis of regulatory systems has been no exception. Identification of transcription factors in genomes invari- ably depends on similarity search strate- gies, which assume a functional and evolu- tionary relationship between homologous proteins. In E. coli, studies have so far estimated a total of 300 to 500 transcription regulators [71] and PEDANT [72], a data- base of automatically assigned gene funct- ions, shows that typically 2-3% of pro- karyotic and 6-7% of eukaryotic genomes comprise DNA-binding proteins.As assign- ments were only complete for 40-60% of genomes as of August 2000, these figures most likely underestimate the actual num- ber. Nonetheless, they already represent a large quantity of proteins and it is clear that there are more transcription regulators in eukaryotes than other species. This is unsurprising, considering the organisms have developed a relatively sophisticated transcription mechanism. From the conclusions of the structural studies, the best strategy for characterising DNA-binding of the putative transcription factors in each genome is to group them by homology and to analyse the individual families. Such classifications are provided in the secondary sequence databases described earlier and also those that specialise in regulatory proteins such as RegulonDB [73] and TRANSFAC [74]. Of even greater use is the provision of structural assignments to the proteins; given a transcription factor, it is helpful to know the structural motif that it uses for binding, therefore providing us with a better understanding of how it recognises the target sequence. Structural genomics through bioinformatics assigns structures to the protein products of genomes by demonstrating similarity to proteins of known structure [75]. These studies have shown that prokaryotic transcription fac- tors most frequently contain helix-turn- helix motifs [71, 76] and eukaryotic factors contain homeodomain type helix-turn- helix, zinc finger or leucine zipper motifs. What is Bioinformatics? 355 Method Inform Med 4/2001 could be made in low-level organisms like yeast and the results applied to homo- logues in higher-level organisms such as humans, where experiments are more demanding. An equivalent approach is also employed in genomics. Homologue-finding is exten- sively used to confirm coding regions in newly sequenced genomes and functional data is frequently transferred to annotate individual genes. On a larger scale, it also simplifies the problem of understanding complex genomes by analysing simple organisms first and then applying the same principles to more complicated ones – this is one reason why early structural genomics projects focused on Mycoplasma genitalium [75]. Ironically, the same idea can be applied in reverse. Potential drug targets are quickly discovered by checking whether homo- logues of essential microbial proteins are missing in humans. On a smaller scale, structural differences between similar pro- teins may be harnessed to design drug molecules that specifically bind to one structure but not another. 7.2 Rational Drug Design One of the earliest medical applications of bioinformatics has been in aiding rational Fig. 3 Above is a schematic outlining how scientists can use bioinformatics to aid rational drug discovery. MLH1 is a human gene encoding a mismatch repair protein (mmr) situated on the short arm of chromosome 3. Through linkage analysis and its similarity to mmr genes in mice, the gene has been implicated in nonpolyposis colorectal cancer. Given the nucleotide sequence, the probable amino acid sequence of the encoded protein can be determined using translation software. Sequence search techniques can be used to find homologues in model organisms, and based on sequence similarity, it is possible to model the structure of the human protein on experimentally characterised structures. Finally, docking algorithms could design molecules that could bind the model structure, leading the way for biochemical assays to test their biological activity on the actual protein. Luscombe, Greenbaum, Gerstein 356 Method Inform Med 4/2001 drug design. Fig. 3 outlines the commonly cited approach, taking the MLH1 gene pro- duct as an example drug target. MLH1 is a human gene encoding a mismatch repair protein (mmr) situated on the short arm of chromosome 3 [110]. Through linkage ana- lysis and its similarity to mmr genes in mice, the gene has been implicated in nonpoly- posis colorectal cancer [111]. Given the nucleotide sequence, the probable amino acid sequence of the encoded protein can be determined using translation software. Sequence search techniques can then be used to find homologues in model orga- nisms, and based on sequence similarity, it is possible to model the structure of the human protein on experimentally character- ized structures. Finally, docking algorithms could design molecules that could bind the model structure, leading the way for bio- chemical assays to test their biological activity on the actual protein. 7.3 Large-scale Censuses Although databases can efficiently store all the information related to genomes, struc- tures and expression datasets, it is useful to condense all this information into under- standable trends and facts that users can readily understand. Broad generalisations help identify interesting subject areas for further detailed analysis, and place new ob- servations in a proper context. This enables one to see whether they are unusual in any way. Through these large-scale censuses, one can address a number of evolutionary, bio- chemical and biophysical questions. For example, are specific protein folds associat- ed with certain phylogenetic groups? How common are different folds within partic- ular organisms? And to what degree are folds shared between related organisms? Does this extent of sharing parallel meas- ures of relatedness derived from traditional evolutionary trees? Initial studies show that the frequency of folds differs greatly between organisms and that the sharing of folds between organisms does in fact follow traditional phylogenetic classifications [37, 112, 113]. We can also integrate data on protein functions; given that the particular protein folds are often related to specific biochemical functions [52, 53], these find- ings highlight the diversity of metabolic pathways in different organisms [36, 89]. As we discussed earlier, one of the most exciting new sources of genomic information is the expression data. Combining expression information with structural and functional classifications of proteins we can ask whether the high occurrence of a protein fold in a genome is indicative of high ex- pression levels [97]. Further genomic scale data that we can consider in large-scale sur- veys include the subcellular localisations of proteins and their interactions with each other [114-116]. In conjunction with struc- tural data, we can then begin to compile a map of all protein-protein interactions in an organism. 8. Conclusions With the current deluge of data, compu- tational methods have become indispens- able to biological investigations. Originally developed for the analysis of biological se- quences, bioinformatics now encompasses a wide range of subject areas including structural biology, genomics and gene ex- pression studies. In this review, we provided an introduction and overview of the cur- rent state of field. In particular, we discus- sed the types of biological information and databases that are commonly used, exa- mined some of the studies that are being con- ducted – with reference to transcription regulatory systems – and finally looked at several practical applications of the field. Two principal approaches underpin all stud- ies in bioinformatics. First is that of com- paring and grouping the data according to biologically meaningful similarities and sec- ond, that of analysing one type of data to infer and understand the observations for another type of data. These approaches are reflected in the main aims of the field, which are to understand and organise the information associated with biological molecules on a large scale. As a result, bioinformatics has not only provided great- er depth to biological investigations, but added the dimension of breadth as well. In this way, we are able to examine individual systems in detail and also compare them with those that are related in order to un- cover common principles that apply across many systems and highlight unusual fea- tures that are unique to some. Acknowledgments We thank Patrick McGarvey for comments on the manuscript. References 1. Reichhardt T. It’s sink or swim as a tidal wave of data approaches. Nature 1999. 399 (6736): 517-20. 2. Benson DA, et al. GenBank. Nucleic Acids Res 2000; 28 (1): 15-8. 3. Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 2000; 28 (1): 45-8. 4. Fleischmann RD, et al. Whole-genome ran- dom sequencing and assembly of Haemo- philus influenzae Rd. Science 1995; 269 (5223): 496-512. 5. Drowning in data. The Economist 1999 (26 June 1999). 6. Bernstein FC, et al. The Protein Data Bank. A computer-based archival file for macromolec- ular structures. Eur J Biochem 1977; 80 (2): 319-24. 7. Berman HM, et al. The Protein Data Bank. Nucleic Acids Res 2000; 28 (1): 235-42. 8. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 1988; 85 (8): 2444-8. 9. Altschul SF, et al. Gapped BLAST and PSI- BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997; 25 (17): 3389-402. 10. Fleischmann RD, et al. Whole-genome ran- dom sequencing and assembly of Haemo- philus influenzae Rd. Science 1995; 269 (5223): 496-512. 11. Lander ES, et al. Initial sequencing and analy- sis of the human genome. Nature 2001; 409: 860-921. 12. Venter JC, et al. The sequence of the human genome. Science 2001; 291 (5507): 1304-51. 13. Tatusova TA, Karsch-Mizrachi I, Ostell JA. Complete genomes in WWW Entrez: data representation and analysis. Bioinformatics 1999; 15 (7-8): 536-43. 14. Eisen MB, Brown PO. DNA arrays for analy- sis of gene expression. Methods Enzymol, 1999; 303: 179-205. 15. Cheung VG, et al. Making and reading micro- arrays. Nat Genet 1999; 21 (1 Suppl): 15-9. 16. Duggan DJ, et al. Expression profiling using cDNA microarrays. Nat Genet 1999. 21 (1 Suppl): 10-4. 17. Lipshutz RJ, et al. High density synthetic oligonucleotide arrays. Nat Genet 1999; 21 (1): 20-4. What is Bioinformatics? 357 Method Inform Med 4/2001 18. Velculescu VE, et al. Serial Analysis of Gene Expression. Detailed Protocol 1999. 19. Holstege FC, et al. Dissecting the regulatory circuitry of a eukaryotic genome. Cell 1998; 95 (5): 717-28. 20. Roth FP, Estep PW, Church GM. Finding DNA regulatory motifs within unaligned non- coding sequences clustered by whole-genome mRNA quantitation. Nat Biotech 1998; 16 (10): 939-45. 21. Jelinsky SA, Samson LD. Global response of Saccharomyces cerevisiae to an alkylating agent. Proc Natl Acad Sci USA 1999; 96 (4): 1486-91. 22. Cho RJ, et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol Cell 1998; 2 (1): 65-73. 23. DeRisi JL, Iyer VR, Brown PO. Exploring the metabolic and genetic control of gene expres- sion on a genomic scale. Science 1997; 278 (5338): 680-6. 24. Winzeler EA, et al. Functional characteriza- tion of the S. cerevisiae genome by gene deletion and parallel analysis. Science 1999; 285 (5429): 901-6. 25. Perou CM, et al. Molecular portraits of human breast tumours. Nature 2000; 406 (6797): 747-52. 26. Golub TR, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286 (5439): 531-7. 27. Pedersendagger AG, et al. A DNA structural atlas for Escherichia coli. J Mol Biol 2000; 299 (4): 907-30. 28. Kanehisa M; Goto S. KEGG: kyoto encyclo- pedia of genes and genomes. Nucleic Acids Res 2000; 28 (1): 27-30. 29. Jeffery CJ. Moonlighting proteins. TIBS 1999; 24 (1): 8-11. 30. Chothia, C. Proteins. One thousand families for the molecular biologist. Nature 1992; 357 (6379): 543-4. 31. Orengo CA, Jones DT, Thornton JM. Protein superfamilies and domain superfolds. Nature 1994; 372 (6507): 631-4. 32. Lesk AM, Chothia C. How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J Mol Biol 1980; 136 (3): 225-70. 33. Russell RB, et al. Recognition of analogous and homologous protein folds: analysis of sequence and structure conservation. J Mol Biol 1997; 269 (3): 423-39. 34. Russell RB, et al. Recognition of analogous and homologous protein folds – assessment of prediction success and associated alignment accuracy using empirical substitution matri- ces. Protein Eng 1998; 11 (1): 1-9. 35. Fitch WM. Distinguishing homologous from analogous proteins. Syst Zool 1970; 19: 99-110. 36. Tatusov RL, Koonin EV, Lipman DJ. A geno- mic perspective on protein families. Science 1997; 278 (5338): 631-7. 37. Gerstein M, Hegyi H. Comparing genomes in terms of protein structure: surveys of a finite parts list. FEMS Microbiol Rev 1998; 22 (4): 277-304. 38. Skolnick J, Fetrow JS. From genes to protein structure and function: novel applications of computational approaches in the genomic era. Trends Biotech 2000; 18: 34-9. 39. Qian J, et al. PartsList: a web-based system for dynamically ranking protein folds based on disparate attributes, including whole-genome expression and interaction information. Nucleic Acids Res 2001; 29 (8): 1750-64. 40. Gerstein M. Integrative database analysis in structural genomics. Nat Struct Biol 2000; 7 Suppl: 960-3. 41. Etzold T, Ulyanov A, Argos P. SRS: informa- tion retrieval system for molecular biology data banks. Methods Enzymol 1996; 266: 114-28. 42. Schuler GD, et al. Entrez: molecular biology database and retrieval system. Methods Enzymol 1996; 266: 141-62. 43. Wade K. Searching Entrez PubMed and uncover on the internet. Aviat Space Environ Med 2000; 71 (5): 559. 44. Bertone P, et al. SPINE: An integrated tracking database and datamining approach for high-throughput structural proteomics, enabling the determination of the properties of readily characterized proteins. Nucleic Acids Res. In Press. 45. Zhang MQ. Promoter analysis of co-regulated genes in the yeast genome. Comput Chem 1999; 23 (3-4): 233-50. 46. Boguski MS. Biosequence exegesis. Science 1999; 286 (5439): 453-5. 47. Miller C, Gurd J, Brass A. A RAPID algorithm for sequence database comparisons: application to the identification of vector contamination in the EMBL databases. Bio- informatics 1999; 15 (2): 111-21. 48. Gonnet GH, Korostensky C, Brenner S. Evaluation measures of multiple sequence alignments. J Comput Biol 2000; 7 (1-2): 261-76. 49. Orengo CA, Taylor WR. SSAP: sequential structure alignment program for protein struc- ture comparison. Methods Enzymol 1996; 266: 617-35. 50. Orengo CA. CORA – topological fingerprints for protein structural families. Protein Sci 1999; 8 (4): 699-715. 51. Russell RB, Sternberg MJ. Structure predic- tion. How good are we? Curr Biol 1995; 5 (5): 488-90. 52. Martin AC, et al. Protein folds and functions. Structure 1998; 6 (7): 875-84. 53. Hegyi H, Gerstein M. The relationship be- tween protein structure and function: a com- prehensive survey with application to the yeast genome. J Mol Biol 1999; 288 (1): 147-64. 54. Russell RB, Sasieni PD, Sternberg MJE. Supersites within superfolds. Binding site similarity in the absence of homology. J Mol Biol 1998; 282 (4): 903-18. 55. Wilson CA, Kreychman J, Gerstein M. As- sessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol 2000; 297 (1): 233-49. 56. Harrison SC. A structural taxonomy of DNA- binding domains. Nature 1991; 353 (6346): 715-9. 57. Luscombe NM, et al.An overview of the struc- tures of protein-DNA complexes. Genome Biology 2000; 1 (1): 1-37. 58. Jones S, et al. Protein-DNA interactions: A structural analysis. J Mol Biol 1999; 287 (5): 877-96. 59. Suzuki M, Gerstein M. Binding geometry of alpha-helices that recognize DNA. Proteins 1995; 23 (4): 525-35. 60. Luscombe NM, Thornton JM. Protein-DNA interactions: a 3D analysis of alpha-helix- binding in the major groove. Manuscript in preparation. 61. Suzuki M, et al. DNA recognition code of transcription factors. Protein Eng 1995; 8 (4): 319-28. 62. Suzuki M. DNA recognition by a -sheet. Protein Eng 1995; 8 (1): 1-4. 63. Seeman NC, Rosenberg JM, Rich A. Sequence specific recognition of double helical nucleic acids by proteins. Proc Natl Acad Sci USA 1976; 73: 804-8. 64. Suzuki M. A framework for the DNA-protein recognition code of the probe helix in transcription factors: the chemical and stereo- chemical rules. Structure 1994; 2 (4): 317-26. 65. Mandel-Gutfreund Y, Schueler O, Margalit H. Comprehensive analysis of hydrogen bonds in regulatory protein-DNA complexes: in search of common principles. J Mol Biol 1995; 253 (2): 370-82. 66. Luscombe NM, Laskowski RA, Thornton JM. Protein-DNA interactions: a 3D analysis of amino acid-base interactions. Nucleic Acids Res. In Press. 67. Mandel-Gutfreund Y, Margalit H. Quantita- tive parameters for amino acid-base inter- action: inplications for prediction of protein- DNA binding sites. Nucleic Acids Res 1998; 26: 2306-12. 68. Sternberg MJ, Gabb HA, Jackson RM. Predic- tive docking of protein-protein and protein- DNA complexes. Curr Opin Struct Biol 1998; 8 (2): 250-6. 69. Aloy P, et al. Modelling repressor proteins docking to DNA. Proteins 1998; 33 (4): 535-49. 70. Dickerson RE. DNA-binding: the prevalence of kinkiness and the virtues of normality. Nucleic Acids Res 1998; 26 (8): 1906-26. 71. Perez-Rueda E, Collado-Vides J. The reper- toire of DNA-binding transcriptional regula- tors in Escherichia coli K-12. Nucleic Acids Res 2000; 28 (8): 1838-47. 72. Mewes HW, et al. MIPS: a database for geno- mes and protein sequences. Nucleic Acids Res 2000; 28 (1): 37-40. 73. Salgado H, et al. RegulonDB (version 3.0): transcriptional regulation and operon orga- nization in Escherichia coli K-12. Nucleic Acids Res 2000; 28 (1): 65-7.

Documents

questions

What Is Bioinformatics? - Lecture Notes | CS 5263, Study notes of Computer Science

Related documents

Partial preview of the text