Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Search for study opportunitiesNEW

Connect with the world's best universities and choose your course of study

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Bioinformatics and Genome Sequencing: Databases and Tools, Study notes of Chemistry

University of California-Santa Cruz Chemistry

This document from bme 110: compbio tools covers the basics of bioinformatics on the web, types of databases, and genome databases. It includes information on sequence repositories, literature databases, specialized sequence databases, macromolecule structure databases, metabolism databases, and experimental data databases. The document also discusses the importance of using published databases and methods, assessing reliability of results, and interpreting significance of results.

Typology: Study notes

Pre 2010

Uploaded on 08/19/2009

koofers-user-il0 🇺🇸

4.5

(2)

10 documents

1 / 44

Partial preview of the text

Download Bioinformatics and Genome Sequencing: Databases and Tools and more Study notes Chemistry in PDF only on Docsity! Genome and DNA Sequence Databases BME 110: CompBio Tools Todd Lowe April 6, 2006 Admin • Reading: – Claverie, Chapters 2 & 3 • Who did not receive test email announcement on Wednesday from me? • Two open slots in class First: What’s the difference between Bioinformatics and Computational Biology? Bioinformatics on the Web • Golden Rules: – Use published databases and methods • Supported and maintained • Trusted by community – Document what you’ve done • Sequence identification numbers • Server, database, program versions • Program parameters – Assess reliability of results • Understand and use reported confidence measures • Compare results of multiple servers • Do results support/conflict with other available data? Most Basic Database: Sequence Repositories • Three major sequence repositories – NCBI • National Center for Biotechnology Information • www.ncbi.nlm.nih.gov – EBI • European Bioinformatics Institute • www.ebi.ac.uk – DDBJ • DNA Data Bank of Japan • www.ddbj.nig.ac.jp • Same sequence information in all three • Different tools for searching and retrieval NAR Database Index • Great collation of biological databases, Nucleic Acids Research Database Issue http://www3.oup.co.uk/nar/database/c/ 858 Databases (!!) Sorted alphabetically & by category Books date quickly, use on-line collections like this (or Google) to get most current information Literature: PubMed @ NCBI • What-- access to Medline – Primarily biomedical, molecular biology & biochemistry journals • Searching – Entrez search engine – Logical operators – Field Delimiters – What’s related – PMIDs – Returns article title, authors, & abstract • PubMed Central – free access to many full-text scientific papers • UCSC Electronic Journal access • Proxy Server for off-campus access Genes & Genomes Why Sequence a Genome? • We wish to understand how the entire cell / organism works – thousands of complex gene interactions – complete “parts list” is first step to understanding how parts work together as a whole • Economy of scale - faster, cheaper to sequence all genes at once, than one at a time by many different researchers First fully sequenced Organism: H. Influenza The Institute for Genome Research (TIGR) - 1995 Human Genome Project • In 1980’s, initial discussion to sequence human genome (meeting here at UCSC) – Began: 1990 – Planned finish: 2005 • Original Estimates: – ~100,000 genes – 3 billion nucleotides – less than 5% of genome codes for genes • Early opposition to sequencing 95% “junk”, taking money away from basic research Sequencing Genomes: Strategy #1 •Original “Top-Down” Strategy •Deliberate, small chance for major errors Two New Technologies 1. Capillary sequencing: ABI 3700 and MegaBASE machines allowed gel-free automated sequencing 2. Whole-genome shotgun approach (bottom-up sequencing) - sequence all the little pieces first (~600bp each), then put them together afterwards • likely to be less thorough, more gaps, but much faster for “gene-skimming” Sequencing Genomes: Strategy #2 WGS (Whole Genome Shotgun) – “Bottom Up” Strategy •No ordering of a clone library - straight to sequencing to build “scaffolds” •Mapping data (STS’s) used to place scaffolds One sequence read from the human genome: >ctg14072 CATGGAAACCCCANAAAAAACATGAAATGCATACCGAACTACAAAAAAGG AAAATAAATATAAACACATTCCAAAACTTAAAAATGAAGGAGATTTCAGA CAGTCCCTCCTGGTAAAATGTGAAATTGCACCCCAGCTGCAGCAGCTACT GTAAATATCCAAGGAATCAGTTTTAAGTGTTTGGGGATCCCAGGGATCCC TGCAAAGCACTCAGGATTTTAACATTAAGCTCACAAATTACAGCAGCTGG CCGGGCACAGTGGCTCACGCCCGTAATCCCAGCACTTTTGGAGGCCGAGG CAGGTGGATCACCTGAGGTCTCCACTAAAAATACAAAAAACTAGCCAGGG TGTGTGGCGGACATCTGTAATCCCAGCCACTTCGGGGGCTGAGGCAGGAG AATCACTTGAACCCGGGAGGTGGAGGTTGCATTGAGCTGACGTTATGCCA TTGCACTCCGGCCTGNGCAACAGAGAGAAACTTCATCTCTAACTACTAAT TACAGCAACCAACAGGCCTCTAGGTTAGTTACCACCCTAACCTTTTCGTT CGAGATTTTCAAACCACCTTGAACGTGGGTATTTTTTGTGGGTCCTTTAT CTTCATTCATTAATCACATTATCAGACATTCCCTGAGTGGCCTGGTTCTG TATACATGCTGAAGCTTCCAAATCAACCGTCCGTTTGGCTTCCCACAAC • Where are the genes? Annotating a Genome Goals: 1. Note the positions of any known or predicted genes 2. Give as much information about function of genes as possible (and certainty of information) Purpose is to help biologists make connections between their work and the sequence you are annotating First Step: Similarity Searching • Is it like anything we’ve seen before? • database search - rapidly compare a new sequence to all previously catalogued sequences • Currently, >100 Billion nucleotides in non- redundant nucleotide database at NCBI • Need fast methods, like BLAST • A “significant” hit often allows inference of function to the newly sequenced gene Two Types of Genes 1) Protein-coding • Purpose: make proteins • Common Pattern: [Start codon] [codon 1] [codon 2] […] [Stop codon] 2) Non-protein Coding RNAs (ncRNAs): • Purpose: make a functional RNA • No common pattern shared by all ncRNAs Protein Gene Finders • Based on statistical analysis of DNA for – Start codon [ATG] (sometimes TTG/GTG) – Stop codon [TGA], [TAG], [TAA] – Codon “frequencies” – Splice junctions - sequence motifs (patterns) at borders between exons and introns Differences Between Eukaryotic and Prokaryotic Genes/Genomes • Prokaryotes (Bacteria + Archaea) – Usually no introns, so no need to detect introns; ORFs ~ genes – Multiple genes are often organized into operons –functionally related genes that are co-transcribed in one long mRNA – Usually a single large circular chromosome (0.5-5 Mbp); often with some small circular DNA elements called plasmids – 70-95% of genome codes for genes • Eukaryotes – Genes broken up into “exons” and spliced out “introns”; complicates accurate gene prediction – Generally, minority of DNA in genome codes for genes – Multicellular euks have much more complex regulation Getting Sequences: NCBI • Go to http://www.ncbi.nlm.nih.gov/ • Choose “Nucleotide” or “Protein” • Type in query, same rules as for making PubMed queries (logical operators, limits, etc) • Or, to get a genome, choose “Genome Project” and type species name Genbank File Format • Completely annotated sequence LOCUS NC_000854 1669695 bp DNA circular BCT 07-APR-2003 DEFINITION Aeropyrum pernix, complete genome. ACCESSION NC_000854 SOURCE Aeropyrum pernix ORGANISM Aeropyrum pernix Archaea; Crenarchaeota; Thermoprotei; Desulfurococcales; Desulfurococcaceae; Aeropyrum. FEATURES Location /Qualifiers source 1..1669695 /organism="Aeropyrum pernix" /db_xref="taxon:56636" gene complement(213..938) /gene="APE0001" CDS complement(213..938) /gene="APE0001" /codon_start=1 /product="hypothetical protein" BASE COUNT 360022 a 473378 c 466849 g 369446 t ORIGIN 1 aaataataat aaaaattaag tgactcatgc attatcctac gaggtaaaaa tatgttataa 61 attgtcccag actaccatca atttagggac aatagtgttt aagggatggc cttcggagct 121 ggcagctcgc gggttcaaac tcgcgtaggg cccgagttct agttatagtt gcgtggattt . . . FASTA File Format -Mostly sequence, little description -General format often used for web server analysis tools >Seq_name Description (all on first line) AGTACGGACCAGACAGGCCGATAGGACG AGGCCGATAGGACGAGGCCGATAGGACG CGTTA >Next_seq_name Description ACCGATTACCGA UCSC Archaeal Genome Browser P_furiosus Genome Browser Base Position 234500 | 234550 | 234600 | 234650 | 234700 | 100 _| GC Percent in 20 base windows GC Percent OL GenBank protein-coding genes PFO21E K<¢ KK ERK CEEKK EEE KK EERE EEE PFO217 >22>>22>>222>>22922222>02222>2299222>9099 ‘Operon predictions by TIGR TIGR operons Ss ORF annotation from TIGR TIGR ORFs| EREZEEEEEEE222 22220220 TEET SSIFEFSISESIEFESEIFELEEESIFESESEEE IEE Int ics on microarra’ 1 Log-odds scan for promoters on plus strand (15 base window) Promoter + | O_j J 1 | Log-odds scan for promoters on minus strand (15 base windo J | enome P. furiosus microarray experiments Promoter - Conservation of proteins with tblastn against the sargasso sea Sargasso Sea KECKLER ERK ERE ER EEE DDD PPDDDPDDIDIDIDPPIIDIDDIDIIIIPDSIDIDDD Conservation of proteins with BLASTP BLASTP codes Pyrococcus 4-way multiz alignments Conservation Yeast Genome Database (SGD Chromosome VII features between coordinates 120000 - 220000 bp 480 250 16606 260 Ba0 ROHS 4 SPTE. PFI 2a YOL19eu GcNt tKCCUUDET -MBS3 YyeLigell MeMé ENP24 YOL199c ly . INE4 cOx13 GTSt ae ——— —— HOSZ YGL193C cpcss RPSZ6R COXA YGLIS6C YGLIESC. nND4 YoLigec SIRS YELis2c yoLi77H | APGL PTS BUDS 168 — ———< ——<<<—— TOSS YGLI76C SAEZ KEM1 NUPA9 —ROKL SUAS YGLIesH cUPZ SuT1 —— — =e — KEM SPO74 +tKCCULIG2 yOLieSC — RADS4 YeLieic PHL yoLtedc: +L ¢CAA>61 ANS 1 YGLIGOH YGLIS9H = YGLWtau2 RCKL yoLis7H cDc43 = PER14 MUTI veLigic LYSS YGLIS2c 14a 16a 180 Yeast Gene Entry (ROK1 ROE] BASIC INFORMATION ROKI RESOURCES S Standard Name ROKI Click on map for expanded view Welo00 = ta e600) Systematic Mame YGLITLW ly Hea Rows Feature Type ORF meme 5 GO Annotations ROK] GO evidence and references I Molecular Function » ATP dependent RNA helicase » Literature | Gene_Info Lit Guide =] View | Biological Process ow 355 primary transcript processing « mRNA splicing » Retrieve Sequences Cellular Component « nucleolus Jona (wy introns) = Retrieve » Sequence Analysis Tools Description contains domains found in the DEAD protein farily of ATP- feasts dependent RNA helicases; lngh-copy suppressor of ker null reutant BLASTP +|_ Analyze Phenotype e Old format: Mull neutant is ividble + Maps and Displays # Systematic deletion: invisble [Chr. Features Map =] View idove Phenogipe Detanis for RORT . » Comparison Resources Position Chr¥VII: coondinates 122394 ta 184088 worm Homotags z Niew| Old format Sequence details » Functional Analysis =

Documents

questions

Bioinformatics and Genome Sequencing: Databases and Tools, Study notes of Chemistry

Related documents

Partial preview of the text