Conceived by Mikael Thollesson (c) The authors(s) and SweLL, 2001

Finding the tree - Search methods

Having a criterion for how good a tree is, is one thing. Another story is to find the tree that is best according to this criterion. This problem is the same regardless of which criterion is used and, as it turns out, a very hard one as well. It belongs to a class of problems known as NP-complete. For these problems, there is no faster method known – that is also guaranteed to find the correct (best) solution – than to evaluate all possible solutions. For data sets with few taxa (8-15 or so, depending on criterion used) exact methods - methods that are guaranteed to find the optimal tree - can be used. For larger data sets computing time will be prohibitively large and we have to use methods that are not guaranteed to find the optimal tree, heuristic approaches.

Exact methods

Exhaustive enumeration

If we evaluate each and every possible tree in turn, we will of course find the best tree. However, as we have seen previously, the number of possible trees will become very large even for a modest number of terminal nodes and this approach may only be feasible up to 11 taxa. (How many possible trees corresponds to 11 taxa?)

Branch and bound

We might do a little better if we can exclude some of the trees and just evaluate a subset of all possible trees. This can be achieved by first getting a fairly good (but not necessarily optimal) tree by some quick methods (below) and than assembling a tree by adding one taxon at a time (in a sequence determined by some algorithm). If we are lucky, the score of the tree will exceed our first score when we try to add taxa to some of the branches. As the score can never decrease when adding taxa to the tree, all trees that can be built by adding remaining taxa must have a worse score and consequently those trees need not to be considered further.


The success of such a method, called branch-and-bound, will depend on the data. “Messy” data will decrease the efficiency and in the worst case, we will have to evaluate all possible trees and thus perform an exhaustive search. It may be worth trying branch-and-bound for up to 15-20 taxa or so.

The outline above is an oversimplification of the algorithms implemented in current software. Several enhancements are used to enhance the efficiency of the branch-and-bound algorithm for finding trees, such as

More details on branch-and-bound algorithms in phylogeny can be found in Hendy & Penny (1982).

Heuristic methods

For most data sets we are forced to turn to heuristic or “quick-and-dirty” search methods, methods that try to find the best tree by reducing the set of trees examined and just calculating the score for some “likely” trees. However, these methods are not guaranteed to find the best tree(s) and one frequently will need to vary some of the parameters and try several times to get an adequate result.

Greedy algorithms or “hill-climbing”

Heuristic methods usually proceed in two steps; first a single (or sometimes a few) tree(s) is (are) built by adding one taxon at a time, placing the added taxon on the branch which gives the best tree for the subset. When a tree containing all taxa is at hand, the second step tries to find a better tree by moving subtrees to other branches, keeping the new tree if it is better than the previous.

The type of algorithms used is frequently of a kind called greedy algorithms, or hill climbing. This comes from the analogy of a method to find ones way to the top of a mountain (a peak of optimal score in tree search) when visibility is zero. It is quite simple: take one step in an arbitrary direction; if the ground is lower at the new position, go back and try another direction; if it is higher proceed with a new step. Eventually, one will end up in a spot where the ground is lower one step away in all directions – i.e., at a peak. However, we do not know if this is the summit of the mountain – it might be just a local peak. Of course we have the analogous problem when trying to find the best tree; the remedy is to start this “hill-climbing” from different points in the “tree landscape” (Fig.).

There are variations on the procedures in both steps in the heuristic algorithms, and their performance will depend at the data; it takes some skills to make optimal use of these search methods. Remember – a tree with a better score is a better hypothesis according to the chosen criterion irrespective of how that tree is found.

Step 1 - Obtaining the initial tree(s)

There are several options to obtain the initial tree(s), and all are Òcorrect Ó in some way but some are more efficient depending on the data at hand.

Step 2 - Improving the initial tree(s)

This step is commonly referred to as branch swapping. The procedure is to move subtrees to other parts of the tree in order to increase the tree score. Depending on how extensive the swapping is we have (in order of increasing extensiveness):

Alternatives to hill-climbing

Even if greedy algorithms like stepwise addition or star decomposition are followed by branch swapping, the possibility of getting trapped in local optima is present. Some methods have adopted another basis for acceptable moves (compared to hill climbing). One such method is called the great deluge (Dueck & Scheuer, 1990). To continue the hill-climbing analogy, we now not only make the move if the step leads to a higher spot, but accept lower grounds as long as we do not get our feet wet, i.e. step into water surrounding the hills. While performing this trek, it is also raining heavily so the water level is constantly rising.

This can be expressed in a more formal way; z is the score for a tree (be it any of the optimality criteria used). The new step (ti+1) is accepted if the score z is higher than the current water level, w

If step ti+1 is accepted, then the water level is raised

This procedure will explore a wider range of the tree-scape compared to hill-climbing, and thus let the us find trees in regions outside the local optima where we would get trapped using hill-climbing. There can be many variants on this theme.

Algorithmic and other methods

There is another class of methods to find a phylogenetic hypothesis, commonly referred to as algorithmic methods. I treat two methods under this heading since they are in widespread (ab)use. However, they can also be seen as a way to obtain an initial tree in heuristic search method (above).

These methods do not use an optimality criterion, but simply follows a number of steps to build a tree (there may be optimality criteria involved in each of these steps, but there is no overall goodness criterion for the tree). The advantage of these methods is that we can build a tree in a fraction of the time needed to evaluate trees to find the best according to a particular criterion. The disadvantage is that we have no idea how much better the tree we get is compared to alternative trees, or if there may be other equally good trees. –We simply have no concept of what “good” means in this context.

Most algorithmic methods must be applied on pair-wise distances (as with minimum evolution), and the performance of the algorithmic methods are thus greatly influenced by the distance measure used.

The algorithms usually proceed by grouping the two most similar terminal nodes; the pair-wise distances from this new group to all other nodes are calculated. In the next step the two closest nodes/groups are grouped and replaced by a new group and so on and so forth. At the end, all original nodes are grouped hierarchically as a tree. The differences between these methods lie mainly in how these new distances are calculated and how the two closest groups are determined.

Clustering analysis

UPGMA

UPGMA, or Unweighted Paired Group Method with Arithmetic means, is what is known as a clustering method – it will group the terminal nodes based on pure similarity. Another way to put it is that it will produce an ultrametric tree based on the pair-wise distances. If the data are not ultrametric, for example if we do not have a molecular clock, the tree will be wrong. Due to these requirements, UPGMA should only be applied in some special cases (e.g., testing for a molecular clock).

Neighbor-joining (NJ)

Neighbor-joining, proposed by Saitou & Nei (1987), is a better method than UPGMA in the sense that it does not require the data to be ultrametric, just additive. It is a good approximation of the best tree using minimum evolution (above) as criterion. NJ is a star decomposition that uses the minimum evolution criterion in each step (but not as an overall criterion; one could say that it corresponds to using minimum evolution as criterion and doing a sloppy job finding the best tree) and hence is a good way to produce a tree in the first step of an heuristic search strategy when using ME as optimality criterion.

The algorithm is described briefly in AA-NJ below, and there is a more detailed presentation with a worked example available.

Algorithms

Neighbor-joining

  1. Given a matrix of pair-wise distances, for every terminal node i calculate its net divergence from all other terminal nodes using

    where N is the number of terminal nodes in the current matrix.
  2. Create a rate-corrected distance matrix (M), where the elements are defined by

    for all i and j>i (the matrix is symmetric and i=j is irrelevant).
  3. Define a new node, u, whose three branches join nodes i, j, and the rest of the tree. Define the length of the branches from u to i and u to j as

    and


  4. Define the distance from u to each other terminal node (k-i or j)

  5. Remove distances to nodes i and j from the data matrix and decrease N by 1.
  6. If more than two nodes remain, go to step 1. Otherwise, the tree is fully resolved with the exception of the branch length of the branch joining the two remaining nodes (i and j). Let this length be

    vij=dij

Exercises

  1. Join thy neighbour
  2. Roaming the tree-scape