Download Bioinformatics and Genome Sequencing: Databases and Tools and more Study notes Chemistry in PDF only on Docsity! Genome and DNA Sequence Databases BME 110: CompBio Tools Todd Lowe April 6, 2006 Admin • Reading: – Claverie, Chapters 2 & 3 • Who did not receive test email announcement on Wednesday from me? • Two open slots in class First: What’s the difference between Bioinformatics and Computational Biology? Bioinformatics on the Web • Golden Rules: – Use published databases and methods • Supported and maintained • Trusted by community – Document what you’ve done • Sequence identification numbers • Server, database, program versions • Program parameters – Assess reliability of results • Understand and use reported confidence measures • Compare results of multiple servers • Do results support/conflict with other available data? Most Basic Database: Sequence Repositories • Three major sequence repositories – NCBI • National Center for Biotechnology Information • www.ncbi.nlm.nih.gov – EBI • European Bioinformatics Institute • www.ebi.ac.uk – DDBJ • DNA Data Bank of Japan • www.ddbj.nig.ac.jp • Same sequence information in all three • Different tools for searching and retrieval NAR Database Index • Great collation of biological databases, Nucleic Acids Research Database Issue http://www3.oup.co.uk/nar/database/c/ 858 Databases (!!) Sorted alphabetically & by category Books date quickly, use on-line collections like this (or Google) to get most current information Literature: PubMed @ NCBI • What-- access to Medline – Primarily biomedical, molecular biology & biochemistry journals • Searching – Entrez search engine – Logical operators – Field Delimiters – What’s related – PMIDs – Returns article title, authors, & abstract • PubMed Central – free access to many full-text scientific papers • UCSC Electronic Journal access • Proxy Server for off-campus access Genes & Genomes Why Sequence a Genome? • We wish to understand how the entire cell / organism works – thousands of complex gene interactions – complete “parts list” is first step to understanding how parts work together as a whole • Economy of scale - faster, cheaper to sequence all genes at once, than one at a time by many different researchers First fully sequenced Organism: H. Influenza The Institute for Genome Research (TIGR) - 1995 Human Genome Project • In 1980’s, initial discussion to sequence human genome (meeting here at UCSC) – Began: 1990 – Planned finish: 2005 • Original Estimates: – ~100,000 genes – 3 billion nucleotides – less than 5% of genome codes for genes • Early opposition to sequencing 95% “junk”, taking money away from basic research Sequencing Genomes: Strategy #1 •Original “Top-Down” Strategy •Deliberate, small chance for major errors Two New Technologies 1. Capillary sequencing: ABI 3700 and MegaBASE machines allowed gel-free automated sequencing 2. Whole-genome shotgun approach (bottom-up sequencing) - sequence all the little pieces first (~600bp each), then put them together afterwards • likely to be less thorough, more gaps, but much faster for “gene-skimming” Sequencing Genomes: Strategy #2 WGS (Whole Genome Shotgun) – “Bottom Up” Strategy •No ordering of a clone library - straight to sequencing to build “scaffolds” •Mapping data (STS’s) used to place scaffolds One sequence read from the human genome: >ctg14072 CATGGAAACCCCANAAAAAACATGAAATGCATACCGAACTACAAAAAAGG AAAATAAATATAAACACATTCCAAAACTTAAAAATGAAGGAGATTTCAGA CAGTCCCTCCTGGTAAAATGTGAAATTGCACCCCAGCTGCAGCAGCTACT GTAAATATCCAAGGAATCAGTTTTAAGTGTTTGGGGATCCCAGGGATCCC TGCAAAGCACTCAGGATTTTAACATTAAGCTCACAAATTACAGCAGCTGG CCGGGCACAGTGGCTCACGCCCGTAATCCCAGCACTTTTGGAGGCCGAGG CAGGTGGATCACCTGAGGTCTCCACTAAAAATACAAAAAACTAGCCAGGG TGTGTGGCGGACATCTGTAATCCCAGCCACTTCGGGGGCTGAGGCAGGAG AATCACTTGAACCCGGGAGGTGGAGGTTGCATTGAGCTGACGTTATGCCA TTGCACTCCGGCCTGNGCAACAGAGAGAAACTTCATCTCTAACTACTAAT TACAGCAACCAACAGGCCTCTAGGTTAGTTACCACCCTAACCTTTTCGTT CGAGATTTTCAAACCACCTTGAACGTGGGTATTTTTTGTGGGTCCTTTAT CTTCATTCATTAATCACATTATCAGACATTCCCTGAGTGGCCTGGTTCTG TATACATGCTGAAGCTTCCAAATCAACCGTCCGTTTGGCTTCCCACAAC • Where are the genes? Annotating a Genome Goals: 1. Note the positions of any known or predicted genes 2. Give as much information about function of genes as possible (and certainty of information) Purpose is to help biologists make connections between their work and the sequence you are annotating First Step: Similarity Searching • Is it like anything we’ve seen before? • database search - rapidly compare a new sequence to all previously catalogued sequences • Currently, >100 Billion nucleotides in non- redundant nucleotide database at NCBI • Need fast methods, like BLAST • A “significant” hit often allows inference of function to the newly sequenced gene Two Types of Genes 1) Protein-coding • Purpose: make proteins • Common Pattern: [Start codon] [codon 1] [codon 2] […] [Stop codon] 2) Non-protein Coding RNAs (ncRNAs): • Purpose: make a functional RNA • No common pattern shared by all ncRNAs Protein Gene Finders • Based on statistical analysis of DNA for – Start codon [ATG] (sometimes TTG/GTG) – Stop codon [TGA], [TAG], [TAA] – Codon “frequencies” – Splice junctions - sequence motifs (patterns) at borders between exons and introns Differences Between Eukaryotic and Prokaryotic Genes/Genomes • Prokaryotes (Bacteria + Archaea) – Usually no introns, so no need to detect introns; ORFs ~ genes – Multiple genes are often organized into operons –functionally related genes that are co-transcribed in one long mRNA – Usually a single large circular chromosome (0.5-5 Mbp); often with some small circular DNA elements called plasmids – 70-95% of genome codes for genes • Eukaryotes – Genes broken up into “exons” and spliced out “introns”; complicates accurate gene prediction – Generally, minority of DNA in genome codes for genes – Multicellular euks have much more complex regulation Getting Sequences: NCBI • Go to http://www.ncbi.nlm.nih.gov/ • Choose “Nucleotide” or “Protein” • Type in query, same rules as for making PubMed queries (logical operators, limits, etc) • Or, to get a genome, choose “Genome Project” and type species name Genbank File Format • Completely annotated sequence LOCUS NC_000854 1669695 bp DNA circular BCT 07-APR-2003 DEFINITION Aeropyrum pernix, complete genome. ACCESSION NC_000854 SOURCE Aeropyrum pernix ORGANISM Aeropyrum pernix Archaea; Crenarchaeota; Thermoprotei; Desulfurococcales; Desulfurococcaceae; Aeropyrum. FEATURES Location /Qualifiers source 1..1669695 /organism="Aeropyrum pernix" /db_xref="taxon:56636" gene complement(213..938) /gene="APE0001" CDS complement(213..938) /gene="APE0001" /codon_start=1 /product="hypothetical protein" BASE COUNT 360022 a 473378 c 466849 g 369446 t ORIGIN 1 aaataataat aaaaattaag tgactcatgc attatcctac gaggtaaaaa tatgttataa 61 attgtcccag actaccatca atttagggac aatagtgttt aagggatggc cttcggagct 121 ggcagctcgc gggttcaaac tcgcgtaggg cccgagttct agttatagtt gcgtggattt . . . FASTA File Format -Mostly sequence, little description -General format often used for web server analysis tools >Seq_name Description (all on first line) AGTACGGACCAGACAGGCCGATAGGACG AGGCCGATAGGACGAGGCCGATAGGACG CGTTA >Next_seq_name Description ACCGATTACCGA UCSC Archaeal Genome Browser
P_furiosus Genome Browser
Base Position 234500 | 234550 | 234600 | 234650 | 234700 |
100 _| GC Percent in 20 base windows
GC Percent
OL
GenBank protein-coding genes
PFO21E K<¢ KK ERK CEEKK EEE KK EERE EEE PFO217 >22>>22>>222>>22922222>02222>2299222>9099
‘Operon predictions by TIGR
TIGR operons Ss
ORF annotation from TIGR
TIGR ORFs| EREZEEEEEEE222 22220220 TEET SSIFEFSISESIEFESEIFELEEESIFESESEEE IEE
Int ics on microarra’
1 Log-odds scan for promoters on plus strand (15 base window)
Promoter + |
O_j J 1 |
Log-odds scan for promoters on minus strand (15 base windo
J |
enome P. furiosus microarray experiments
Promoter -
Conservation of proteins with tblastn against the sargasso sea
Sargasso Sea KECKLER ERK ERE ER EEE DDD PPDDDPDDIDIDIDPPIIDIDDIDIIIIPDSIDIDDD
Conservation of proteins with BLASTP
BLASTP codes
Pyrococcus 4-way multiz alignments
Conservation
Yeast Genome Database (SGD
Chromosome VII features between coordinates 120000 - 220000 bp
480
250 16606
260
Ba0
ROHS 4 SPTE. PFI
2a
YOL19eu GcNt
tKCCUUDET -MBS3 YyeLigell
MeMé ENP24
YOL199c ly
. INE4 cOx13 GTSt
ae ——— ——
HOSZ YGL193C cpcss RPSZ6R COXA YGLIS6C YGLIESC. nND4
YoLigec SIRS YELis2c
yoLi77H
| APGL PTS BUDS
168 — ———< ——<<<——
TOSS YGLI76C SAEZ KEM1
NUPA9 —ROKL SUAS YGLIesH cUPZ SuT1
—— — =e —
KEM SPO74 +tKCCULIG2 yOLieSC — RADS4 YeLieic
PHL yoLtedc:
+L ¢CAA>61 ANS 1
YGLIGOH YGLIS9H = YGLWtau2 RCKL yoLis7H cDc43 = PER14 MUTI
veLigic
LYSS YGLIS2c
14a
16a
180
Yeast Gene Entry (ROK1
ROE] BASIC INFORMATION ROKI RESOURCES S
Standard Name ROKI Click on map for expanded view
Welo00 = ta e600)
Systematic Mame YGLITLW ly
Hea Rows
Feature Type ORF meme 5
GO Annotations ROK] GO evidence and references I
Molecular Function » ATP dependent RNA helicase » Literature
| Gene_Info Lit Guide =] View |
Biological Process ow 355 primary transcript processing
« mRNA splicing
» Retrieve Sequences
Cellular Component « nucleolus Jona (wy introns) =
Retrieve
» Sequence Analysis Tools
Description contains domains found in the DEAD protein farily of ATP- feasts
dependent RNA helicases; lngh-copy suppressor of ker null reutant BLASTP +|_ Analyze
Phenotype e Old format: Mull neutant is ividble + Maps and Displays
# Systematic deletion: invisble [Chr. Features Map =] View
idove Phenogipe Detanis for RORT .
» Comparison Resources
Position Chr¥VII: coondinates 122394 ta 184088 worm Homotags z Niew|
Old format Sequence details
» Functional Analysis =