Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Bioinformatics 1: Repeat Sequences - Detection & Evolution - Prof. Christopher Bystroff, Study notes of Biology

A lecture note from a bioinformatics 1 course focusing on repeats, satellites, and transposons. It covers the detection and evolution of repeat sequences, including motif finding, hidden markov models (hmm), and various tools like meme, gibbs sampling, k-means, and dotplot. The document also discusses different types of repeat sequences, such as satellites, simple sequence repeats (ssrs), minisatellites, and microsatellites, and their roles in heterochromatin and euchromatin.

Typology: Study notes

Pre 2010

Uploaded on 08/09/2009

koofers-user-dpl
koofers-user-dpl 🇺🇸

10 documents

1 / 36

Toggle sidebar

Related documents


Partial preview of the text

Download Bioinformatics 1: Repeat Sequences - Detection & Evolution - Prof. Christopher Bystroff and more Study notes Biology in PDF only on Docsity! Bioinformatics 1 -- lecture 19 Repeats, Satellites and Transposons -- Evolution and detection From DNA to HMM 1. Find motifs. 2. Determine function of motifs, if possible. 3. Assemble motifs into a HMM. 4. Train and validate the HMM. 5. Use the HMM to predict function, annotate DNA sequences. Annotating DNA MEME, Gibbs sampling, K-means, DotPlot Splice sites, binding sites, etc. microsatellite 541 gagccactag tgcttcattc tctcgctcct actagaatga acccaagatt gcccaggccc 601 aggtgtgtgt gtgtgtgtgt gtgtgtgtgt gtgtgtgtgt gtatagcaga gatggtttcc 661 taaagtaggc agtcagtcaa cagtaagaac ttggtgccgg aggtttgggg tcctggccct 721 gccactggtt ggagagctga tccgcaagct gcaagacctc tctatgcttt ggttctctaa 781 ccgatcaaat aagcataagg tcttccaacc actagcattt ctgtcataaa atgagcactg 841 tcctatttcc aagctgtggg gtcttgagga gatcatttca ctggccggac cccatttcac a microsatellite in a dog (canis familiaris) gene. Minisatellite 1 tgattggtct ctctgccacc gggagatttc cttatttgga ggtgatggag gatttcagga 61 tttgggggat tttaggatta taggattacg ggattttagg gttctaggat tttaggatta 121 tggtatttta ggatttactt gattttggga ttttaggatt gagggatttt agggtttcag 181 gatttcggga tttcaggatt ttaagttttc ttgattttat gattttaaga ttttaggatt 241 tacttgattt tgggatttta ggattacggg attttagggt ttcaggattt cgggatttca 301 ggattttaag ttttcttgat tttatgattt taagatttta ggatttactt gattttggga 361 ttttaggatt acgggatttt agggtgctca ctatttatag aactttcatg gtttaacata 421 ctgaatataa atgctctgct gctctcgctg atgtcattgt tctcataata cgttcctttg This 8bp tandem repeat has a consensus sequence AGGATTTT, but is almost never a perfect match to the consensus. ACRONYMS for satellites and transposons SSR Short Sequence Repeat STR Short Tandem Repeat VNTR Variable Number Tandem Repeat LTR Long Terminal Repeat LINE Long Interspersed Nuclear Element SINE Short Interspersed Nuclear Element MITE Miniature Inverted repeat Transposable Element (class III TE) TE Transposable Element IS Insertion Sequence IR Inverted Repeat RT Reverse Transcriptase TPase Transposase Alu 11% of primate genome (SINE) LINE1 14.6% of human genome Tn7,Tn3,Tn10,Mu,IS50 transposons or transposable bacteriophage retroposon=retrotransposon Class I TE, uses RT. Class II TE, uses TPase. Class III TE, MITEs* *Cl,ass III are now merged with Class II TEs. Expectation values for low complexity/repeat sequences. What is a good model for random alignments of low- complexity/repeat sequences? REMINDER: Significance is what matters! [ What is the likelihood of getting a score at “random”. ] Getting e-values requires a model for random scores. These scores are fit to a EVD. Using the EVD equation, we can convert a score to a e-value. Simplest option (1) Composition-biased model. Generate random sequences based on composition. Align them. Get scores. Fit the scores to the EVD. A,C,G,T Expectation values for low complexity/repeat sequences. Option (2) Dinucleotide composition-biased model. Generate random sequences based on dinucleotide model, such as 4-state Markov chain. Align them. Get scores. Fit the scores to the EVD. A C GT Expectation values for low complexity/repeat sequences. Option (3) Trinucleotide composition-biased model. Generate random sequences based on dinucleotide model, such as 16-state HMM. Align them. Get scores. Fit the scores to the EVD. A C GT A C GT A C GT A C GT after A after C after Gafter T Only the arrows into the 4 “after A” states are shown In class exercise: create a HMM for a microsatellite. •In SeqLab: File-->Add sequence-->Databases •Select GenEMBL-->Other mammalian •Search for gb_om:*sat* •Select gb_om:bbsat6 •Add to main window •Where does the repeat start/end? Create a HMM motif model like the one on the previous slide. Use ProSite syntax •Use your model to generate a random microsatellite sequence. RepeatMasker Tool to compare a curated library of known repeats to a query sequence: Returns: (1) Location and type of each repeat, or (2) Query sequence with repeats masked (=“N”) Uses: Modified Dynamic Programming algorithm Authors: Ariana Smit, Phil Green www.repeatmasker.org www.repeatmasker.org Annotation Results SW perc perc perc query position in query matching repeat position in repeat score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID Overlap 194 10.5 2.6 0.0 chr1 1031265 1031302 (244491545) + C-rich Low_complexity 3 41 (0) 624 0 238 26.4 0.7 0.7 chr1 1031638 1031782 (244491065) + (TG)n Simple_repeat 1 145 (0) 625 0 298 29.0 2.1 0.0 chr1 1031794 1031886 (244490961) + (CGTG)n Simple_repeat 3 97 (0) 626 0 255 23.1 1.8 1.8 chr1 1031900 1032062 (244490785) + (TG)n Simple_repeat 1 163 (0) 627 0 1864 13.8 0.0 0.7 chr1 1032330 1032614 (244490233) + AluJo SINE/Alu 5 287 (25) 628 0 Annotation Results SW perc perc perc query position in query matching repeat position in repeat score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID Overlap 194 10.5 2.6 0.0 chr1 1031265 1031302 (244491545) + C-rich Low_complexity 3 41 (0) 624 0 238 26.4 0.7 0.7 chr1 1031638 1031782 (244491065) + (TG)n Simple_repeat 1 145 (0) 625 0 298 29.0 2.1 0.0 chr1 1031794 1031886 (244490961) + (CGTG)n Simple_repeat 3 97 (0) 626 0 255 23.1 1.8 1.8 chr1 1031900 1032062 (244490785) + (TG)n Simple_repeat 1 163 (0) 627 0 1864 13.8 0.0 0.7 chr1 1032330 1032614 (244490233) + AluJo SINE/Alu 5 287 (25) 628 0 Types of TEs Class I: replicated/transposed through RNA intermediate. Requires RT Class II: replicated when chromosome replicates. Transposed by cut & paste DNA RNA RNA polymerase II Reverse transcriptase DNA Tn7 cut & paste transposition Mechanism of Tn7 transposition: TPase combines with TnsA, TnsB and TnsC to form a “transposasome”. Doublestranded cuts are made in the transposon, and a staggered double stranded cut is made on the target. The free 3’ends of the target ligate to the transposon. Later the gaps are filled in by DNA repair enzymes. Nancy L. Craig, Johns Hopkins Univ. Cut & paste leaves direct repeats SS Transposon DF Inverted Repeats HostDNA ....TACATGCACAG target site cA Te racerers | Fill in the gaps TACATGC AX Yy CAG... “ATG t WX Transposon BYanxcorerc. Fill in the gaps TACATGCA ypyprpTGCACAG.. ATGTACGT Ara vepese Gi xc crete. A iret Repeats_—____# Y/, What happens over millions of years? Some genomes contain a large accumulation of transposon scars. Transposable elements H.sapiens Z. mays Drosophila Arabidopsis C. elegans S. cerevisiae Other sequences TEs 35% >50% 2%15% 1.8% 3.1% Selfish DNA Like any good parasite, TEs have co-evolved with the host. Too much transposition kills the host. Some transposition may have a selective advantage -- TEs stimulate greater genetic variation, faster evolution. TE’s have adopted numerous strategies for survival, leading to selective, inefficient or intermittant transposition. Exercise: Finding TEs Write an algorithm for finding TEs. Find the TE in this sequence NNAGTGGTNNNNACCAGTNNN Tandem direct repeats: AGT Inverted repeat: GGT/ACC (NNNN is TPase gene.) (Step 1) Look for length L direct repeats (Step 2) Look for inverted repeats within direct repeats. TE-finding HMM? begin end begin end Complementary base states Inverted repeat Tandom repeat Transposase gene Heirarchical and Constrained HMMs begin end begin end Complementary base states Inverted repeat Tandom repeat Transposase gene begin endbegin end Inverted repeat Tandom repeat Transposase gene begin end end begin A heirarchical HMM is made by connecting the end and begin states of HMMs. Constructing an HMM from words apt tap sap tap sat spat step pats stop taps pits pots apps stat pass toss past saps pap asp sass (1) Align words. (2) Assign emissions to columns begin enda e o p s t p s t p s t Constructing a heirarchical HMM begin enda e o p s t p s t p s t We think these are the same state. p s tbegin end a e o begin end 1. 2. Heirarchical constrained HMM 1 12 endbegin
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved