Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Bioinformatics: Understanding Genomics Data and Solving Information Problems - Prof. C.J. , Study notes of Chemistry

An introduction to bioinformatics, a field that focuses on the analysis and interpretation of genomics data using computational methods. The conceptual foundations of bioinformatics, its role in analyzing genomics data, and the importance of large-scale data in answering biological questions. It also discusses the challenges of dealing with the complexity and scale of genome data and the need for algorithms to help make sense of it.

Typology: Study notes

Pre 2010

Uploaded on 08/30/2009

koofers-user-unh
koofers-user-unh 🇺🇸

10 documents

1 / 47

Toggle sidebar

Related documents


Partial preview of the text

Download Bioinformatics: Understanding Genomics Data and Solving Information Problems - Prof. C.J. and more Study notes Chemistry in PDF only on Docsity! Introduction to Bioinformatics Chem. CM160A / 260A CS CM121 / 221 Christopher Lee Dept. of Chemistry & Biochemistry TEL 5-7374, EMAIL leec@chem.ucla.edu Course Goals – Understand conceptual foundations of bioinformatics, via examples of what has been done; – Apply these principles towards inventing new kinds of bioinformatics; – Understand the role of bioinformatics in analyzing and interpreting genomics data, and how biology questions map to computational problems; – Get acquainted with existing data and tools. Course Website • http://c260a.bioinformatics.ucla.edu/ • lectures, homework, additional readings will be posted. • You can ask questions directly on the website by adding comments to any page or post. You’ll also be able to see others’ comments and our responses. Why Bioinformatics? The way to answer biological questions in the genomics era. Example: What are the Origins of Genome Complexity? • yeast: 13Mb, 6000 genes, simple gene structure • Drosophila: 180 Mb, 13,600 genes • C. elegans: 100 Mb, 19,000 genes • human: 3,000 Mb • Complex gene structure... Why so complex? a gene I II III IV IVa V VIII IX XI XXII XXVI XXVIII WNK1: 156kb total, but only 10kb is actually used (28 exons) introns exons rica Inc. « http://genetics.nature.com ®& © 2000 Nature America Inc. + http://genetics.nature.com letter Gene Index analysis of the human genome estimates approximately 120,000 genes Feng Liang, Ingeborg Holt, Geo Pertea, Svetlana Karamycheva, Steven L. Salzberg & John Quackenbush Although sequencing of the human genome will soon be com- pleted, gene identification and annotation remains a challenge. Early estimates suggested that there might be 60,000-100,000 (ref. 1) human genes, but recent analyses of the available data from EST sequencing projects have estimated as few as 45,000 (ref. 2) or as many as 140,000 (ref. 3) distinct genes. The Chro- mosome 22 Sequencing Consortium estimated a minimum of 45,000 genes based on their annotation of the complete chro- mosome, although their data suggests there may be additional genes. The nearly 2,000,000 human ESTs in dbEST provide an important resource for gene identification and genome annota- tion, but these single-pass sequences must be carefully analysed to remove contaminating sequences, including those from genomic DNA, spurious transcription, and vector and bac- terial sequences. We have developed a highly refined and rigor- “ously tested protocol for cleaning, clustering and assembling. EST sequences to produce high-fidelity consensus sequences: for the represented genes (FL. et 2/., manuscript submitted) and used this to create the TIGR Gene Indices’—databases of expressed genes for human, mouse, rat and other species (http:/Awww.tigr.org/tdb/tgi.html). Using highly refined and. tested algorithms for EST analysis, we have arrived at two inde-_ ‘pendent estimates indicating the human genome contains 340,127 singleton ESTs were identified and eliminated from fur- ther analysis. (The statistics tor the latest build are available, see Table 1, http://genetics.nature.com/supplementar y_int Our first estimate of the number of human genes is ba the observation that many of the annotated genes in GenBank are not represented in the EST data. Of the 54,506 NP and sequences, 39,798 appear in 10,224 THCs containing ESTS, 8,036 appear in 1,769 THCs containing only and 6,672 remain as singletons (see Table 2, http: ature.com/ supplementary_into/). This suggests that the gene sequence data represent 18,665 (10,224+1,769+6,672) unique genes, of which only 10,224 (54.8%) have been sampled by EST sequencing pro- jects. If this trend holds true tor the remainder of the genes, it implies that the 73,655 THCs that contain ESTS represent only 54.8% of the total number of genes, suggesting an upper limit of approximately 134,000 human genes. A number of factors may influence this estimate by ‘splitting ESTs from the same gene into multiple THCs. One potential source of splitting is the relatively low quality of the data. CAP3 outperforms other assemblers in its ability to produce a sin- gle, high-fidelity consensus sequence trom ESTs with 1-8% error rates at various depths of coverage (F.L. et al., manuscript submit- ted). Consequently, we do not believe misassembly to be a major source of error. Alternative splicing, however, may generate multi- Puzzle #2 • Given the same experimental data (ESTs), and the same access to the latest computers and analysis methods, why is it so hard to count the human genes? articles a Initial sequencing and analysis of the human genome Intemational Human Genome Sequencing Consortium* rtial list of authors appears on the opposite liations are listed at the end of the paper. The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence. @ There appear to be about 30,000—40,000 protein-coding genes in the human genome—only about twice as many as in worm or fly. However, the genes are more complex, with more alternative splicing generating a larger number of protein products. Introduction: Genomics • What is genomics? • What is bioinformatics? • How does research in this new field differ from previous research (e.g. molecular biology)? Definitions • Genomics: genome-scale experimental analysis of biological systems – systematic, comprehensive – automated, high-throughput • Bioinformatics: the analysis and interpretation of genomics data – computational by necessity – find the information content in the data – probabilistic, data-driven What is Genomics? G=(MB)n Genomics is any (molecular) biology experiment taken to the whole genome scale. • Ideally in a single experiment. • E.g. genome sequencing. • E.g. DNA microarray analysis of gene expression. • E.g. mass spectrometry protein mixture analyses: quantity, phosphorylation, etc. How Sequencing went from Gene to Genome • Conventional “manual” sequencing – all you needed was P32, a micropipet, and a gel, and you could sequence a gene--several thousands bases (kb) in a few weeks. A lot like cooking... • “Automated” sequencing technology (1996) – 4 color fluorescence in a single lane. – 144 kb/day raw reads per machine – a lot of engineering and high-tech instrumentation Genomics Foundation: High Throughput Technology • Automation: any human step is a bottleneck. • Multiplexing & parallelization. • Miniaturization. • Read-out speed, sensitivity. • “GMP” Q/A, reproducibility, “production line” mindset. Laser Dye Based Sequencing ‘, ccD ‘\. | CAMERA = LASER Automated Trace Analysis 3800b mb Type (set in pi Tooth Automated Base Calling The New Face of Biology —— So Sequencing the Human Genome • the hard problem: covering, assembling the genome from many pieces; repetitive regions. • minimum tiling, library requirements • YACs, BACs, maps • mapping, not sequencing, was rate-limiting • “walking” via map markers, vs. BAC-end sequence tagged connectors. • 150,000+ human genes vs 25,000 Human Genome Sequencing • The experimental part (the actual sequencing) was easy. It was the information problem that was hard. • Assembly: the high frequency of repeats in the human genome can fool you into joining the wrong fragments Genomic DNA BAC library Organized mapped large clone contigs BAC to be sequenced Shotgun clones Shotgun sequence Assembly Hierarchical shotgun sequencing Ey he SRR) SZs Ze Lh AES F Pray o A 3 Cs =: 9 LEA F ) oP BAO RS = e, ne 7 . -ACCGTAAATGGGCTGATCATGCTTAAACCCTC CATCCTACTG... Genome Annotation • Genes are what biologists really want, not just the genome sequence. • Unfortunately, much of the initial 32,000 gene annotation was based on gene prediction, not measured experimental evidence. • It is likely that 50% of the reported genes were wrong in details (individual exons, boundaries) or entirely. • The Drosophila annotation has recently been shown to be deeply flawed. • Resolved through mapping of experimental cDNAs. Anti-climax: How to turn Data into Discoveries?!? atcgtacgtacgtagctatgcatgctagctagtcattctctactcaccacagtgctacgtactgtt ggacatcgtatagtatttatcgatctatgtcagtactttaggtagaacgatgtgattctacctat gttggtatatcgat... When you get massive data, what do you do with it? What does it mean? Problem: no human being could ever look at all this data. So the patterns must be discovered computationally. Definition of Bioinformatics Bioinformatics is the study of the inherent structure of biological information. • Data-driven: let the data speak for themselves. • non-random patterns in the data. • Measure significance of patterns as evidence for competing hypotheses. Completeness Changes Everything • In molecular biology cleverness is finding a way to answer a question definitively by ignoring 99.99% of genes. You can’t see them, so the experiment must exclude them. • In genomics cleverness is discovering what becomes when possible when you can see everything. Have to switch our deepest assumptions. Example: Ortholog Prediction • Orthologs: two genes related by speciation events alone. “the same gene in two species”, typically, same function. • Paralogs: two genes related by at least one gene-duplication & divergence event. Homology (sequence similarity): is it an ortholog or a paralog? • Experimentally very hard to answer. Gene Evolution through Speciation vs. Duplication speciation gene duplication orthologs paralogs In this diagram, all genes drawn with the same symbol are orthologous; genes with different symbols are paralogous. Genes with the same color are in the same species. ancestral gene Clusters of Orthologous Genes Analysis Phylogenetic Tree Ortholog Graph Slightly More Complex Example Phylogenetic Tree Ortholog Graph The Ortholog Information Problem • Ortholog identification is not solved well by standard experimental thinking (i.e. think of more experiments to do), but really is an information problem--finding believable patterns in the data. If you can solve this information problem, you can find orthologs reliably en masse, across entire genomes.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved