Download Phylogenetic Trees in Bioinformatics: Building and Analyzing Evolutionary Relationships - and more Study notes Computer Science in PDF only on Docsity! CMSC423: Bioinformatic Algorithms, Databases and Tools Lecture 14 phylogenetic trees CMSC423 Fall 2008 2 Phylogeny questions • Given several organisms & a set of features (usually sequence, but also morphological: wing shape/color...) • A. Given a phylogenetic tree – figure out what the ancestors looked like (what are the features of internal nodes) • B. Find the phylogenetic tree that best describes the common evolutionary heritage of the organisms wings, feathers, teeth claws, no wings, fur ? A C AB B B A C C CMSC423 Fall 2008 5 Example 0 0 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 1 1 10 0 1 0 1 CMSC423 Fall 2008 6 Sankoff's algorithm • At each node v in the tree store s(v,t) – best parsimony score for subtree rooted at v if character stored at v is t • Traverse the tree in post-order and update s(v,t) as follows – assume node v has children u and w – s(v,t) = mini {s(u,i) + score(i,t)} + minj {s(w,j) + score(j,t)} • Character at root will be the one that maximizes s(root, t) • Note – this solves the weighted version. For unweighted set score (i,i) = 0, score(i,j) = 1 for any i,j CMSC423 Fall 2008 7 Trees as clustering • Start with a distance matrix – distance (e.g. alignment distance) between any two sequences (leaves) • Intuitively – want to cluster together the most similar sequences • UPGMA – Unweighted Pair Group Method using Arithmetic averages – Build pairwise distance matrix (e.g. from a multiple alignment) – Pick pair of sequences that are closest to each other and cluster them – create internal node that has the sequences as children – Repeat, including newly created internal nodes in the distance matrix – Key element – must be able to quickly compute distance between clusters (internal nodes) – weighted distance 1 2 1 2 ,1 2 1( , ) ( , ) | || | p cl q cl D cl cl D p q cl cl ∈ ∈ = ∑ CMSC423 Fall 2008 10 Trees as clustering • Note that both UPGMA and NJ assume distance matrix is additive: D(i,j) + D(j,k) = D(i,k) - usually not true but close • Also, NJ can be proven to build the optimal tree! • But, simple alignment distance is not a good metric CMSC423 Fall 2008 11 Maximum likelihood • For every branch S->T of length t, compute P(T|S,t) – likelihood that sequence S could have evolved in time t into sequence T • Find tree that maximizes the likelihood • Note that likelihood of a tree can be computed with an algorithm similar to Sankoffs • However, no simple way to find a tree given the sequences – most approaches use heuristic search techniques • Often, start with NJ tree – then "tweak" it to improve likelihood CMSC423 Fall 2008 12 Tree analysis & display CMSC423 Fall 2008 15 Drawing trees • Trees are easy to draw – just need to figure out how much space the leaves will take • Step 1 – calculate how much space each node will take (how many leaves from current node) • Step 2 – spread out the nodes according to # of leaves • Many ways of optimizing: e.g. width, area • For large trees – 3D displays (there's more room in 3D) – interactive displays (expand contract nodes as needed) CMSC423 Fall 2008 16 Analysis example • Build multiple alignment (e.g. Muscle, ClustalW) • Clean up alignment – manual editing – filters (pre-defined structure information) • Build tree – PAUP – parsimony & others – Phylip – maximum likelihood – Tree-Puzzle –maximum likelihood – etc... (many packages) • Integrated system – ARB – www.arb-home.de CMSC423 Fall 2008 17 Antibiotic resistance in Staphylococcus aureus Green boxes – individual strains in a phylogenetic tree Red diamonds, yellow triangle - acquisition of resistance Hexagon – loss of resistance