Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Review of Phylogenetic Tree Construction | CECS 660, Lab Reports of Computer Science

Material Type: Lab; Class: INTRO TO BIOINFORMATICS; Subject: Computer Engr & Computer Sci; University: University of Louisville; Term: Unknown 2007;

Typology: Lab Reports

Pre 2010

Uploaded on 09/17/2009

koofers-user-d8c-1
koofers-user-d8c-1 🇺🇸

10 documents

1 / 8

Toggle sidebar

Related documents


Partial preview of the text

Download Review of Phylogenetic Tree Construction | CECS 660 and more Lab Reports Computer Science in PDF only on Docsity! Review of Phylogenetic Tree Construction Jeffrey Rizzo1 and Eric C. Rouchka1 TR-ULBL-2007-01 November 19, 2007 1University of Louisville Speed School of Engineering Department of Computer Engineering and Computer Science 123 JB Speed Building Louisville, Kenucky, USA 40292 jvrizzo@gmail.com; eric.rouchka@louisville.edu University of Louisville Bioinformatics Laboratory Technical Report Series 1 Bioinformatics Review Review of Phylogenetic Tree Construction Jeffrey Rizzo1 and Eric C. Rouchka1,* 1Department of Computer Engineering and Computer Science, University of Louisville, 123 JB Speed Building, Louisville, KY, USA UNIVERSITY OF LOUISVILLE BIOINFORMATICS LABORATORY TECHNICAL REPORT SERIES REPORT NUMBER TR-ULBL-2007-01 ABSTRACT Motivation: Phylogenetic tree construction is a complex yet important problem in the field of bioinformatics. Once constructed, a phylogenetic or evolutionary tree can lend insight into the evolution of different species. The issue is that for a large number of species the problem grows to a computational complexity that is not easily solved. For this reason, new methods are being researched and applied to phylogenetic tree construction and have provided some promising results. Two topics of interest for this paper are the use of Ant Colony Optimization and Particle Swarm Optimization both of which are based on algorithms discovered from studding the pat- terns of nature. 1 INTRODUCTION To begin, phylogenetics is the science which studies evolutionary relationship between species. In order to make predictions about these relationships, phylogenetic trees are constructed which link the species. The problem of phylogenetic tree construction is widely accepted as an area in need of more research in bioinfor- matics as it is considered to be an NP-complete problem which means it is in a very small class of the most difficult problems to solve [1]. A relationship between two species is classified as a phylog- eny. A phylogenetic tree is a binary tree representation of the re- sulting relationship. There are two main types of trees that can be found: 1) rooted trees -- those that have a single node from which all nodes are derived, and 2) unrooted trees -- those that do not originate from one clear node. The tree follows standard graph theory notation were each species is represented as a node or leaf, and the relationship between species is referred to as an edge or branch. The lengths of the branches represent the time (a) (b) Fig. 1 (a) unrooted tree, (b) rooted tree *To whom correspondence should be addressed. estimate between the species. A simple representation of this is illustrated in Figure 1. Table 1 illustrates the complexity of enumerating possible tree configurations by showing the number of rooted and unrooted trees based on the number operational taxonomic units (OTU). The OTU is an extant present at an external node or leaf; which in the context of graph theory is just the nodes. The table is formulated using Equation 1 for the number of unrooted trees, and Equation 2 for the number of rooted trees. Having noted that there is a large number of possible trees, it is important to distinguish that there is only one “true” or correct tree from which species have evolved. Thus finding the one correct tree can become a computational nightmare without an efficient algorithm or strategy. (Eq. 1) (Eq. 2) Table 1. Number of phylogenetic trees. (Adapted from [2]). # of OTUs Number of Rooted Number of Unrooted 3 3 1 4 15 3 … 10 34,489,707 2,027,025 15 213,458,046,676,875 8 x 1012 20 8 x 1021 2 x1020 50 2.8 x 1076 3 x 2074 2 METHODS Phylogenetic tree construction methods are widely accepted to fall into one of two categories: distance based and character based. These two categories both offer a vast variety of options when constructing trees in two different directions. The most common distance based methods are the unwieghted pair group method using arithmetic averages (UPGMA) [3], Neighbor Joining [4] and the Fitch and Margoliash [5] algorithms that are all based off the initial creation of a distance matrix. The alternative to these meth- ods is the character based methods such as maximum parsimony [6] and maximum likelihood [7] which take a probabilistic ap- proach to tree construction. An attempt to introduce these topics C D e A B a b c d f A B a b c d C D Rizzo, Jeffrey 4 their approach creates a fully connected graph of the spices and constructs a distance matrix that is used as the distances between each node. From there the problem is solved much like TSP where the ants begin at random and explore the search space of the graph updating the edges based on the heuristics. The transition found (Eq. 3) gives the probability that the k-th ant, beginning at node i, goes to node j in its next step. Pk(i,j) is the probability of transition from node i to j, τ is the pheromone trail between two nodes, d(i,j) is the evolutionary distance, is the set of nodes connected to node i, and already visited by the k-th ant and α and β are con- stants. (Eq. 3) The equation has two main components which allow it apply the evolutionary distance as well as the experience of the ants. First the distance between the two nodes and second the accumu- lated pheromone levels. Since τ is dynamically updated, it repre- sents the “attractiveness” of node j when the ant is at node i. The ant thus picks a path that minimizes that transition function, and finds the shortest evolutionary distance between two species. To differ from traditional ACO, the system presented uses an interme- diary node created between the two previously visited nodes. It represents the ancestor of the two nodes, is not in the list of nodes to be set in the tree and represented by u. It is used to help recalcu- late distances to the remaining nodes by Eq. 4. (Eq. 4) The above steps repeat until all nodes are added to the list of visited nodes and then a final path is found. The final distance or score of the path is calculated by summing the transition probabili- ties along the path of adjacent nodes. When the path is con- structed, the pheromone level is increment for all nodes belonging to the path and decremented for the evaporation affect. The pheromone trail matrix is updated according to Eq. 5 where ρ represents the evaporation rate of the pheromone and Δτ(i,j) is the rate of increment of pheromone obtained from Eq. 6. (Eq. 5) (Eq. 6) These equations allow for the determination of the best path up to the point of the calculation. Using the procedure, each cycle moves closer toward constructing a tree. Once the correct number of cycles is completed, it is now possible to use the best path found to create the tree. The ACO has thus provided a linear list of the best possible path, and now Algorithm 3, must be used to create the tree. Algorithm 3. Tree construction algorithm (from [13].) Perretto and Lopes found that their method was successful when compared with other more common methods. They completed tests using mtDNA from 20 different species of mammals. The results of the phylogenetic trees were compared to the common PHYLIP software package using two of the common algorithms including Neighbor Joining and Fitch and Margoliash. To com- pare the different trees they used the tree structure and the total distance between nodes (Eq. 7) that was proposed by Kumnorkaew et al. [14]. (Eq. 7) This distance measure is similar to the quadratic error computation. The trees that were obtained with the mtDNA are shown if Fig. 3. using ACO (A) and Neighbor Joining (B). As shown, the two trees are similar, but there are small discrepancies between them which create differences in the distances between species. Table 2 shows the computed distance totals between branches using the different methods. The differences in the distances are due to the difference in the tree groupings as seen in Fig. 3. Table 2. Distance Comparisons between methods. (Taken from [13].) Algorithm Distance ACO 351.56 Fitch and Margoliash 352.27 Neighbor Joining 354.23 These preliminary results provide evidence that ACO can successfully be used to create phylogenetic trees and more importantly that ACO may provide better results than methods. The method has a number of parameters that are used throughout the above equations including α and β which were experimented with but were not calculated to be optimal. For this reason the evidence is only preliminary and cannot be considered as proof Tree Construction Algorithm WHILE NOT (all species grouped) FIND i, j pair that have the largest value in the pheromone matrix. IF (i OR j) already grouped change index by group index GROUP i, j pair into a new species k COMPUTE the distance between current species and ancestor DELETE the value of i, j pair END Review of Phylogenetic Tree Construction 5 Fig. 3. Trees produced using ACO (A) and Neighbor Joining (B) Methods. [12]. that ACO is always more accurate. With these parameters comes room for improvement of the ACO algorithm presented. If these parameters can be optimized it they could actually improve the overall efficiency and accuracy of the ACO method presented. The results are thus considered promising. ACO has also been used for other experiments [15]. Here the approach of ACO is slightly different than that discussed above, but the result is the same. When conducting experimenta- tion, the team generated sequences using Seq-Gen which allowed them to know the correct tree for the set of sequences used in their simulation. They then used the distance matrix between the se- quences as computed by PHYLIP, an online software program. This input was then fed into FITCH and Neighbor Joining algo- rithms on PHYLIP for comparison to their proposed ACO method. They used to different size trees with 8 and 16 leaves, respectively. For the smaller 8-leaf tree the ant algorithm gave the best general result [16]. However, for 16 leaves FITCH gave the best result. More simulation should be completed on larger trees, but the . Fig. 4. 15-Species and length of their CYP Sequences [16]. conclusion is made that ACO has proved comparable to traditional methods. The same group also tested with an example protein family of the Cytochrome P450 CYP050A. The sequences of 15 species were taken with lengths varying from 414 to 561 (Fig. 4). The Multiple Sequence Alignment was obtained using ClustalW and the distance matrix computed by PHYLIP. Here all three methods (FITCH, Neighbor Joining and ACO) agreed on the tree, however the ant algorithm gave the best score of 4023.8 [16]. Another approach to ACO was presented in [17]. Here the process of constructing the tree uses a pheromone graph where the pheromone graph is adaptively updated according to the phero- mone level left by the ants. The presented algorithm then dynami- cally adjusts the influence of each ant to the trail information up- dating and the selected probability of the path according to the equilibrium of the ant distribution. The three main steps of the phylogenetic tree construction are: initialization, construction based on optimal path found by ants, and lastly optimization. The method designates the input species as cities in the TSP, and in the first step, constructs a fully con- nected graph using the distance matrix among the species. Nodes represent the species and the edges or distances between the cities represent the calculated distance between the species. In constructing the tree, ants are sent out to find the optimal path by starting each ant at a randomly selected node or city. ACO theory has each ant pick its next destination based on the probabil- ity function used to describe the graph. This algorithm adjusts the equation to determine probability based on the equilibrium of path distribution dynamically. Every ant must pass through each city and all nodes must be stored in the open list, or the list of already visited nodes. It is also necessary to obtain the evolutionary distance be- tween two species based on one of several methods. In this algo- rithm, the cosine distance is introduced where the two species in the data set are represented as vectors and the cosine of the angle between them defines the evolutionary distance. It is also neces- sary to measure the distribution of the solutions and the trail updat- ing information. The “gathered degree” method was used to meas- ure this and it allowed the algorithm to determine probability of each path dynamically according to the ant distribution. The gath- ered degree (Eq. 8) thus determines the paths that the ants can take from each node. Ants thus consider paths with highest trail infor- mation (pheromone level). The number of paths to consider must be limited so an equation is introduced that helps limit local opti- mization and to ensure that paths with highest trail information are considered with highest probability (Eq. 9). This equation for w paths also allows ants to only consider paths with highest trail information in their selection process. The general principle is that paths with more trail information and small local distances are selected in a larger probability. (Eq. 8) (Eq. 9) Once the ants have completed their “work,” the construction of the tree is simple. The pheromone matrix that was given by the ants through the TSP optimization is used to construct the tree and A B Rizzo, Jeffrey 6 Algorithm 4. Tree Construction Algorithm. From [12]. the similarity between objects by pheromone levels on the edge of the graph. Essentially ants select the next node by the finding the edge with the largest amount of pheromone. The full algorithm used is shown in Algorithm 4. Experiments were completed using 14 species on the proteins hemoglobin alpha-I and cyctochrome C. The results were com- pared to the neighbor joining method (Clustal X) and TSP ap- proach. There were significant differences in the tree configura- tions but quantitatively the differences are small. The proposed method using ACO provided shorter branch lengths in all of the experiments run. It is stated that there are 94.11% cases better than TSP-Approach with basic parameters set. The conclusion is that the experiments show higher quality results than other algorithms and the authors suggest that more studies should be done in this area of phylogenetic tree construction using Ant Colony theory. From the three different ACO experiments presented above that have been completed, it is obvious that ACO has a purpose in phylogenetic tree construction. However, more work needs to be completed. All three groups introduced the problem and gave viable solutions that proved to be successful in their tests yet each group also noted that their tests were not completely conclusive. More studies should be done and more experimentation needs to occur, but these initial results are promising. 2.2.2. Particle Swarm Optimization Particle Swarm Optimization (PSO) is a method to solve complex optimization problems using a different result from nature. PSO was originally developed in 1995 by Kennedy and Eberhart [18] and it belongs to the category of swarm intelligence methods. The idea is rooted in the study and action simulation of simplified social models of birds and schools of fish. PSO is a population- based algorithm that exploits a population of individuals to probe promising regions of a search space. Unlike ACO where the ant colonies were found to use a concept known as stigmergy to com- municate, birds were found to use a communication strategy that was similar to a broadcast. Each agent is a particle like structure with the following parts: coordinates of the current location in the optimization landscape, the best solution point visited so far and the subset of other agents seen as neighbors [11]. As shown in the algorithm, during each iteration, each parti- cle accelerates in the direction of its own person best solution found so far, as well as in the direction of the global best position discovered so far [18]. What this means is that when one particle discovers something in the right direction, it moves that direction and all others follow it. Essentially one bird moves in a certain direction and the rest of the flock move their position relative to it in order to maintain their position. An effective algorithm thus has the daunting task of balancing the individual influence with the swarm influence and effectively leading the general movement of the swarm. The equations that are used to update the particles position and speed consist of three parts. First part is the previous speed of the swarm; the second is cognition modal (thought of the swarm); third is the social modal [1]. The first gives the swarm balance, the second allows the swarm to search the whole and avoid issues with locality, and the third reflects the information that is broad- casted out or communicated within the swarm. The experimentation and development of “Discrete Particle Swarm Optimization” for phylogenetic tree construction gave re- sults that showed PSO is well suited for cases when the number of sequences is less than 40. This was determined after analysis that the PSO is able to converge quickly on the best tree compared with genetic algorithm, but if the number of sequences was too large the efficiency will decrease [1]. 3 RESULTS The results of the following sections give some evidence that phy- logenetic tree construction is a complex problem that is difficult to solve. By providing new methods of optimizing and searching for the best tree, some of the traditional methods are challenged and reviewed closely for their relevance and acceptance. ACO and PSO both have been proven only in small studies to be more effi- cient and/or accurate than traditional methods such as Neighbor Joining and Maximum Likelihood, but these small studies are only a step toward the research that will be done in the future. As more and more sequences become readily available, the need to con- struct evolutionary trees will continue to grow and new methods for doing so will be sought. Ant Colony Optimization seems to be the most promising of the two methods explored within this paper, mainly because more research has been conducted in that direction. It appears that more researchers are using Ant Colony as the optimal method within the branch of Swarm Intelligence that includes ACO and PSO as the two forerunners. Perhaps the development of ACO about five years prior to PSO has provided the algorithm with more accep- Algorithm: Tree construction algorithm Step1: initialization 1.1 initializes parameters and the pheromone graph; 1.2 For each ant do choose an initial node to visit randomly; End for Step2: iterative process// finding the optimal path while (not terminal conditions) do {2.1 For i=1 to n do // n is the number of nodes in the graph compute the gathered degree of node i and the distribution extent w of the ants in this iteration according to the equations (8) and (9) For k=1 to m do // m is the number of ants 2.1.1 Select the next node j to visit. 2.1.2 Update the trail information on path (i, j) locally, according to equation (11) End for k If in m ants, the total lengths of paths that some ant trav- eled currently has exceeded the length of the optimal path ob- tained in last iteration, then stop the iteration of this ant. 2.2 update the trail information globally for each path of all sequences; End while Step3. Tree construction While not (all species grouped) Find a pair of species i, j that have the largest value in the pheromone matrix If (i or j) already grouped CHANGE index by group index Group i, j pair into a new species k; Compute the distance between Current species and ancestor; Delete the value of i, j pair End } }
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved