Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Notes on Phylogenetic Analysis - Applied Bioinformatics | BIT 150, Study notes of Bioinformatics

Material Type: Notes; Class: Applied Bioinformatics; Subject: Biotechnology; University: University of California - Davis; Term: Winter 2006;

Typology: Study notes

Pre 2010
On special offer
30 Points
Discount

Limited-time offer


Uploaded on 07/31/2009

koofers-user-92k-1
koofers-user-92k-1 🇺🇸

5

(2)

10 documents

1 / 16

Toggle sidebar
Discount

On special offer

Related documents


Partial preview of the text

Download Notes on Phylogenetic Analysis - Applied Bioinformatics | BIT 150 and more Study notes Bioinformatics in PDF only on Docsity! Phylogenetic Analysis 1. Introduction 2. Construction of Phylogenetic Trees 1. Construction and editing of a MSA 2. Selection of a substitution model 3. Tree building 1. Distance based methods 1. UPGMA 2. Neighbor-joining 2. Character based methods 1. Maximum Parsimony 2. Maximum Likelihood 4. Tree evaluation 3. Software 1. MEGA 2. PHYLIP 3. PAUP Chapter 14 4x AB A B C D 6x ABD Polyploidy Horizontal gene transfer Symbiosis A) Rooted tree Seq. A nodes clade Seq. B Seq. C branches Seq. D B) Unrooted tree Seq. A Seq. C Seq. B Seq. D Assumptions of Phylogenetic Analysis 1. Any group of organisms are related by descent 2. There is a bifurcating pattern of cladogenesis 3. Additional “default” assumptions 1. Sequences are correct and homologous 2. Each position is homologous 3. All sequences share common history 4. The selected taxa are sufficient to solve the problem of interest and includes representative variation of the group 5. There is sufficient “phylogenetic signal” GCCTAGAGASeq. D ACCTATAGASeq. C GCG TGCCGASeq. B ACGTGAGAASeq. A 987654321Taxa Sequence 1 Sequence 2 Sequence 3 Sequence 4 0.5 S q. A S q. B S q. C S q. D Evolutionary changes of DNA Sequence Sixteen nucleotide pairs 4 identical (AA CC GG TT) 4 transitions (AG GA CT TC) 8 transversions (rest) Expected: transversion rate β > transition rate α Observed: transition rate α > transversion rate β 1 1 1 1 1 1 1 11 1 1 1 Jukes-Cantor 1-parameter model 1+1+1=3 1+1+1=3 = Kimura 2-parameter model: transitions are “cheaper” than transversions 1+1+2=4 < 1+2+2=5 ATATACAAAAA ATATAAAAAAA ATATACAAAAA ATATATAAAAA ATATAGAAAAA ATATAGAAAAA ATATAAAAAAA ATATATAAAAA Tree building Sequence 1 Sequence 2 Sequence 3 Sequence 4 0.5 Seq. 1 Seq. 2 Seq. 3 Seq. 4 1. Introduction 2. Construction of Phylogenetic Trees 1. Construction and editing of a MSA 2. Selection of a substitution model 3. Tree building 1. Distance based Methods 1. UPGMA 2. Neighbor-joining 2. Character based methods 1. Maximum Parsimony 2. Maximum Likelihood 4. Tree evaluation 3. Software 1. MEGA 2. PHYLIP 3. PAUP Distance based methods • Compute pair-wise distances • Discard the actual data • Use distance matrix to construct tree Character based methods • Derive trees to optimize the distribution of the actual data for each character • Pair-wise distances are not fixed: they are determined by the tree topology. Problems: • Divergence encounters an upper limit as seq. become mutationally saturated. We can not see multiple mutations at 1 site. • A 2nd mutation can reverse to an original state. Need a correction for this pb. • Homoplasy: two independent mutations to the same alternative state Distance-based methods • Based on genetic distances between sequence pairs in a MSA. • Less computationally intensive than character- base methods. • Can handle large number of sequences UPGMA works well only when the rate of gene substitution is relatively constant across taxa (molecular clock). We are dividing by 2! A. Sequences B. Distances table. No. of steps required to change one seq. into the otherUPGMA (Unweighted Pair Group Method using Arithmetic averages) 7--3 ---4 104.5-1-2 431-2 1. Select closest: 1 & 2 2. Branch length 1 & 2 = 3 / 2= 1.5 3. Merge 1 & 2 4. Calculate average distance 3 to 1-2: (4+5)/2=4.5 5. Branch length 3: 4.5/2= 2.25 6. Branch group 1-2: 2.25-1.5= 0.75 7. Calculate avg. distance 4 to 1-2-3: (10+10+7)/3=9 8. Branch length 4: 9/2= 4.5 9. Branch group 1-2-3: 4.5-2.25= 2.25--4 9-1-2-3 41-2-3 Sequence 1 Sequence 2 Sequence 3 Sequence 4 1.5000 1.5000 2.2500 4.5000 0.7500 2.2500 1 C. UPGMA Phylogenetic tree 0.75 . 5 5 4.5 NJ (Neighbor Joining): is a simplified ME. Fastest method. Iterative algorithm that minimizes branch length at each step of the clustering • The S value is not computed for all topologies, but the examination of different topologies is embedded in the algorithm, so that only one final tree is produced • The tree is “decomposed” from an unresolved star-tree. Most isolated neighbors are join together. Calculate the distance of all taxa to the new node • Neighbors are consolidated and the process is repeated considering the joined neighbors as a single taxon and using distances calculated in the previous step • Produces an un-rooted tree • Better for large number of taxa or short sequences Neighbor Joining Also appropriate when rates of nucleotide substitutions vary among taxa Sequence 1 Sequence 2 Sequence 3 Sequence 4 1.2500 1.7500 0.7500 5.1250 1.1250 2.2500 NJ=ME Character-based methods: Maximum Parsimony MP philosophy: “The best explanation is the simplest” • MP searches a tree that requires the minimum number of changes to explain the differences among the taxa studied. • For a nucleotide site to be informative to construct a MP tree (parsimony-informative) there must be at least two different kinds of nucleotides, each represented at least two times • To accommodate substitution bias, MP can use substitution weight matrices (weighted MP) Sequence 1 Sequence 2 Sequence 3 Sequence 4 MP MP performs poorly if • there is abundant among-site rate heterogeneity • there is too much homoplasy (backward & parallel subst.) • there is a small n or insufficient number of informative sites • there are long terminal branches and short internal ones MP relationship between whales as inferred by 21 SINE element insertions Parsimony analysis is useful for Large n with low level of seq. divergence and constant rate Irreversible shared derived characters: synapomorphies • duplications, inversions, and deletions of DNA segments • insertion of retroelements, SINEs and LINEs, new introns MP tree from previous 4 sequences http://en.wikipedia.org/wiki/Maximum_parsimony Character-based methods: Maximum Likelihood ML philosophy: “search for the evolutionary model and tree that has the highest likelihood of producing the observed data” • ML is a well-established statistical method of parameter estimation, and can be used to estimate branch lengths. • ML evaluates the pb. that the chosen evolutionary model has generated the observed data. • ML is derived from each base position in a MSA. The likelihoods for all the sites are multiplied to give an overall likelihood of a tree • The substitution model is optimized to fit the observed data. ML problems • ML uses great amounts of computation time (heuristic methods available) • Usually impractical to perform a complete search that simultaneously optimizes the substitution model and the tree. • The likelihood function includes no parameters for topologies ML is useful for • Good estimates of branch lengths • When data analysis proceeds according to the same model that generated the data, ML can outperform ME and MP Bootstrap Felsenstein’s bootstrap test (1985) is a test for the reliability of an inferred tree • Based on bootstrap re-sampling technique • Construct your tree • Select n nucleotides randomly (with replacement) • Construct a new tree and compare topology • Give a 1 to consistent interior branches (0 to others) • Repeat several hundred times • Compute the % of times a branch receives a 1 • This it the bootstrap confidence value PB • In NJ PB is the pb. of a branch > 0 •PB > 95%: significant • In MP is usual to produce a bootstrap consensus tree Comparison between methods: Only for equal rates of nucleotide substitution: UPGMA Fastest: UPGMA and NJ (ME with a preliminary NJ is next) Consistency (correct topology): NJ, ME, LS, ML (if unbiased estimates of nucleotide substitutions are used). MP sometimes inconsistent (long branch attraction) Statistical tests of consistency: Well established only for NJ, ME (and generalized Least Squares methods) Reliability of branch length estimates: Theoretically ML, LS, NJ, and ME are more reliable than MP. Probability of true topology: difficult to establish. The bootstrap consensus trees constructed for NJ MP and ML are usually very similar. Some particular tests have shown NJ>MP>FM and ME>LS At the end of the day • NJ with bootstrapping is usually a good choice well accepted by most journals (we used it in our recent PNAS and Science papers)
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved