Conceived by Mikael Thollesson (c) The authors(s) and SweLL, 2001

Tree selection criteria

Theoretical foundations

Measuring evolutionary change

In practice, the optimality criterion for a tree is a score (a number) calculated based on transformations of the characters to explain the observed data at the terminal nodes given that tree. To calculate such a score we need some measure of (evolutionary) change.

The simplest measure of change would be the observed difference. This may be a good measure, but it will be an underestimate of the actual change. If there are five differences, there must have been at least five changes, no less. However, one character may have changed once and then back to its original state again. In this case, we will observe no difference, but there have been two changes. Such “multiple hits” at a certain site in a sequence can greatly obscure the actual history. It is also clear that the probability for a specific site to change more than once will increase with time (evolutionary distance), so the error will be larger for long branches than for short branches. Thus, we will get a function similar to Fig. Differences vs. Changes. To convince yourself that this is the case, you can try Paul Lewis' applet using a lighht bulb analogy.

Fig. Differences vs. Changes
 
If only we knew this function, we could calculate the actual number of changes from the observed differences. We do not know that, but we can model the evolution and estimate that function and thus the actual change if the assumptions of these models hold.

Models

Exactly what the models look like will depend on the kind of data we are evaluating, or to phrase it differently what the characters and character states are. If we have protein sequences the characters may be sites and the states the different amino acids. If we have DNA sequences the characters may be sites and the states A, C, G, or T. Or the characters may be codons and the states will then be one of the 61 possible codons (excluding the stop codons).
Example. For nucleotide sequences a simple model is the Jukes-Cantor (Jukes & Cantor, 1969). The assumptions are that all four bases (A, C, G, T) occur in equal frequencies (=0.25) and that transformations from one base to another occurs at the same rate between all bases. If we modify this model and let the bases have different and arbitrary frequencies (although they have to sum to 1), and assume that all transitions occur at one rate and all transversions at another (i.e., two rates instead of one) we have the Hasegawa-Kishino-Yano model (Hasegawa et al., 1985). Both of these are special cases of the general time-reversible model discussed in more detail below.

Models are implemented mainly in two different ways in phylogenetic inference. In one approach the parameters are estimated as part of the model, as is done when using the maximum likelihood optimality criterion (and in Bayesian methods) where the model is part of the criterion itself. In the other pair-wise distances are calculated between the observed sequences based on the observed differences and these distances are then feed into a criterion such as minimum evolution.

Criteria for selecting a best hypothesis

Given a data set for a number of taxa (or genes; terminal units) and that the model of the history we will use is a tree, there is a specific number of trees relating these taxa and phylogenetic inference boils down to a simple task – to choose one of these trees as the best hypothesis of the history. This means that we will evaluate the data for each tree, and evaluate how good the tree is for this particular data using some kind of criterion. There are a number of criteria proposed that may be used, but three of the most widely use are (maximum) parsimony, maximum likelihood, and minimum evolution, that will be discussed below. We can also note that there are other approaches to phylogenetic inference such as algorithmic methods (see Algorithmic methods) and methods that explore a number of possible hypotheses rather than selecting one.

Parsimony (MP)

Parsimony, or maximum parsimony, scores the number of changes, between different character states that at minimum are necessary to explain the observed data given the tree. The best hypothesis is the tree requiring the fewest changes. The changes may be restricted in what kind of changes that are allowed. This score, often referred to as the tree's length, is the minimum number of changes for the tree. Thus, parsimony uses the observations without any attempt to use an explicit model to estimate the total number of changes.

If we have k characters (sites) that each require l changes, the length of the tree, L, is the sum


Parsimony can be justified in several different ways; one is that it is a general principle in science known as Occam's razor – there is no need to make more assumptions than necessary to explain the observations. Hypotheses of homoplasy, that is when we have more steps (changes) than is minimally needed for the data, may be judged ad hoc in that they are attempts to explain the data that do not fit a particular hypothesis. The most parsimonious tree is the solution that requires the smallest number of such ad hoc hypothesis and is thus preferred. Parsimony is also appealing in that it does not require us to make any explicit assumptions about the process, and thus can be applied when we have data that can not easily be modeled.

There are some variations on parsimony. The two most commonly used variations are

Less frequently one may encounter

All the above variants can be regarded as special cases of generalized parsimony. Generalized parsimony assigns a cost for the transformation of any character state into each of all other character states. The cost for m different character states can be represented as a mxm matrix S, where Sij is the increase in tree length for a transformation from state i to state j. If all costs are not equal, this is commonly referred to as transformational weighting or character state weighting. Below are examples of cost matrices for Wagner, Fitch and Dollo parsimony as well as an example where transversions is weighted twice the cost of transitions.

  0 1 2 3
0 - 1 2 3
1 1 - 1 2
2 2 1 - 1
3 3 2 1 -
0 1 2 3
0 - 1 1 1
1 1 - 1 1
2 1 1 - 1
3 1 1 1 -
0 1 2 3
0 - 1M 2M 3M
1 1 - 1M 2M
2 2   - 1M
3 3 2 1 -
A C G T
A - 2 1 2
C 2 - 2 1
G 1   - 2
T 2 1 2 -
Wagner parsimony Fitch parsimony Dollo parsimony; M is an arbitrary, large number Transformational weighting - transversions twice as costly as transitions


Parsimony can be further elaborated by using another class of differential weighting. Instead of simply counting the minimum number of changes on a tree, certain characters may be given a higher weight (cost), implementing a positional weighting or character weighting. For example can the three positions in a codon be given different weights. This is implemented by multiplying each characters length with a weight, w. The tree length in this case is

Data:Undervisning:Sommarkurs, Bioinformatik:HTML file:Summercourse06.pict

The motivation for using differential weighting is to get a better approximation of the actual changes from the observed differences (Fig. Differences vs Change) by giving less weight to characters or changes that are considered less informative (usually characters that seems to be changing a lot or changes that seem to occur more frequently). The actually numeric values of the weights applied are to some extent arbitrary, but the common practice is to let the data determine these weights according to some objective function.

Maximum likelihood (ML)

Maximum likelihood is a kind of estimate that is very common in statistics. For example, estimating the population mean with the average of a sample is a maximum likelihood (or ML) estimate. (Other common techniques in statistics are the methods of moments, which are used for pair-wise distances, and least squares).

ML is different from parsimony in that an explicit model is used to calculate the score. The model in phylogenetic contexts consists of two parts:

The score used is the likelihood of a model (which includes the tree we want to evaluate), which is the conditional probability of the data (D) given the model (H). Or, phrased differently, it is the probability of getting the data we actually have got if the model (the tree T and the parameters Θ) were true;

Data:Undervisning:Sommarkurs, Bioinformatik:HTML file:Summercourse07.pict

Note that the likelihood is the (conditional) probability of the data not a probability of the model (the tree), since the sum of the likelihood over all trees does not equal one.

Normally, independence between characters is assumed and the likelihood of the tree is the product of the likelihoods of all characters. Since the likelihoods are very small numbers, the logarithm of the likelihood is normally used (with the log likelihood of the tree being the sum of the log likelihoods for the characters).


Calculating the likelihood for a tree is computational very intense and takes considerably more time than calculating for example the tree length in parsimony, and increases for more complex models. Extra computational load is generated since all branch lengths are optimized numerically for each tree that is examined.

The models that are most commonly used for DNA sequences are sub models of the GTR class of models, commonly modified to incorporate rate heterogeneity among the different sites. Some assumptions explicit or implicit when using these models in ML estimates are

One advantage of maximum likelihood is that it will give a correct result in some cases where other methods fail (i.e., it is consistent in those cases, see below) – provided that the models used are correct... The need for explicit models are sometimes viewed as a weakness, but may also be a strength as the values for different parameters are visible and thus their validity can be assessed.

Minimum evolution (ME)

Minimum evolution criterion differs from the previous two criteria in that the observations are not used directly to calculate the tree score (which is called length also for ME). Instead the data are transformed to pair-wise distances, and the score is calculated from those. Using pair-wise distances is an advantage for some kinds of data (e.g., DNA-DNA hybridization) where the data from the experiments are pair-wise differences.

For an unrooted tree with n terminal nodes we have (2n-3) branches, each with the length ei. The sum of these branch lengths, L, is the length of the tree and the score used for the minimum evolution criterion.
Data:Undervisning:Sommarkurs, Bioinformatik:HTML file:Summercourse08.pict

The best tree is the tree with the lowest score (i.e. shortest length) under the auxiliary criteria that all ei>0 (i.e., all branches have a positive length) and pij >= dij for all pairs of terminal nodes i, j (i.e., the pair-wise distances on the tree - p - can never be smaller than the directly observed pair-wise differences, d).

There are some drawbacks to use pair-wise distances compared to using the full data set. Some information is lost when characters are transformed to pair-wise distances; you can always calculate the distance from the characters, but not vice versa. Furthermore, the branch lengths may sometimes be uninterpretable (e.g., the length of a branch is 123.5 changes, but what is 1/2 a change?) or nonsensical (e.g., the total length of a ME tree may be 231 changes whereas a minimum of 233 changes are needed to explain the differences in the characters). Another disadvantage is that pair-wise distance methods will only give a tree, but no information of the changes in the sequences along this tree.

The pair-wise distances can be calculated in different ways, from just using the percentage of observed differences to applying a model to estimate the actual distances (changes) from the observed differences as discussed above. The models that can be applied are the same as the transformation models used in maximum likelihood, but the distances are calculated a priori and the models are not evaluated under each tree. The performance of ME and other methods that use pair-wise distances are of course greatly influenced by the choice of the model used to calculate the distances. There are also other distances available that does not correspond to a particular model in maximum likelihood, but that have other desirable properties such as the LogDet or paralinear distances.

Rooting

All of the methods discussed (with the exception of some ML models that we have not encountered) will select a best unrooted tree; the methods are time-reversible. Too root the tree, i.e. introduce a time axis, we need some auxiliary criterion or method.

The most common method of rooting a tree is to include outgroups in the analyses. An outgroup is simply a set (at least one) of taxa or sequences that one “knows” are less related to any of the study taxa (known as the ingroup) than any pair of ingroup taxa are related to each other. Alas, this is not something we can really know, but must infer from other studies. An example of outgroups would be to include a fish, a frog, and a bird when one is analyzing the phylogeny of mammals.

The procedure in itself is simple; the outgroup taxa are treated exactly as the ingroup (alignment, analysis). When the analysis is done and a resulting tree is achieved, the root from the perspective of the ingroup will be on the edge between the ingroup and the outgroup. If outgroup taxa are found scattered among ingroup taxa – well, than the outgroup was not really an outgroup and other outgroup taxa must be selected.


Algorithms

Character optimization in parsimony, unordered characters (Fitch parsimony)

    1. To each terminal node i assign a state set si containing the character states in the character matrix, xti. Set tree length=0.
    2. Visit an internal node k for which sk has not yet been defined, but which the state sets si and sj of its two immediate descendants have been defined.
      If Data:Undervisning:SweLL:Slask:Parsimony00.pict
      then Data:Undervisning:SweLL:Slask:Parsimony01.pict.
      Otherwise, Data:Undervisning:SweLL:Slask:Parsimony02.pict and the tree length is increased with 1.
    3. If node k is at the basal fork of the tree (i.e., immediate descendant to the arbitrarily designated terminal root,), proceed to step 4 otherwise return to 2.
    4. If xr (the state assigned to the terminal node at the root) is not contained in SK, increase tree length with 1.
    1. Visit an internal node k for which an optimal xk has not yet been assigned, but for which this assignment has been done for its immediate ancestor, m.
      (The first iteration will have m=r, the terminal taxon at the root)
    2. If xm is contained in SK(the set assigned to k in the first pass), assign xm to xk. Otherwise, arbitrarily assign any state from SK to xk.
    3. If all internal nodes have been visited, stop. Otherwise return to step 1.

The first pass is sufficient to find the minimum length of the tree, whereas the second step will also assign optimal reconstructions (character sets) to the internal nodes.

Calculating maximum likelihood scores

The likelihood of a one branch tree with two characters

Let's start with a simple tree with just one branch and two taxa, T1 and T2.

For these we have sequences two bases:

T1
AC
T2
CC

Interpreted on the tree this in turn means

To calculate the likelihoods for each site and for the whole tree, we need a model. I will illustrate using the simplest model, Jukes-Cantor (JC; substitution probabilities with two cases). The site likelihoods are then the probability of having a particular base (which is 0.25 for all four bases in the JC model) multiplied with the conditional probability of having the two bases at each side of a branch given the branch length µt. This can be expressed as

for the first site (where i<>j) and

for the second site (i=j) and µt is the length of the branch connecting the two taxa. The overall likelihood, assuming independence of the two sites, is

L = L1·L2 or, expressed as logarithms,

ln L = ln L1 + ln L2

The likelihood of a one branch tree with a set of characters

To complicate things a little (i.e., make things more realistic) we examine a case where we have two nudibranch species and a short sequence of 31 bases from the 16S rRNA gene.

Polycera quadrilineata
AAATGGAGTCAGGAATGAATGGAATAACCGG
Dendronotus frondosus
AATTGGAGCCAGGTATGAATGGAGTTACGTG

We have seven differences (in red) and 24 similarities. Thus, using the Jukes-Cantor model, we get a likelihood that is

As you can see, this depends on the branch length; if L has an optimum for some value of µt, this is the maximum likelihood and the value of µt is the maximum likelihood estimate of the branch length. If we plot the likelihood function of µt we get the following graph.

Fig. Example likelihood function for a one branch tree

So, the ML estimate of µt is 0.358. Note that we are not able to separate the rate parameter µ from the time, t, but have to treat them together.

The likelihood of four taxon tree

Let us finally examine a four-taxon tree, where we have the taxa T1 to T4 and with the first site

T1
T
T2
T
T3
A
T4
G

One of the three possible trees is

Here we encounter a new problem – we do not know the state at the two internal nodes, x and y. There are several different solutions a maximum likelihood method may apply to this, for example try all combinations and use the ones that maximizes the overall likelihood. This is not the way ML is normally implemented in phylogenetic inference though. Instead we sum over all possible states at the internal nodes and get a compound likelihood for observing the data.
Since we have a time-reversible model (JC is member in the GTR class), we can arbitrarily place the root at an internal node for calculation. Placing the root at the node x, we get the following expression:

Also note that we have now five branches with lengths d1 to d5, each corresponding to (µt)1 to (µt)5 respectively, that we have to optimize simultaneously to obtain the maximum likelihood.

GTR class substitution models

The mathematical expression of a a substitution model is a table of rates at which each state is replaced with each alternative state. The rates are substitutions per characters (site) per unit evolutionary distance. For nucleotide sequences these can be expressed as a 4x4 instantaneous rate matrix, Q. Here we will limit our discussion to cases where the overall rate of changing from state i to state j is the same as the rate from i to j, a constraint to models that are said to be time-reversible. The models used in phylogenetic inference to date are almost exclusively subsets of this class.

The most general form of time-reversible nucleotide models, which is called General Time Reversible model (GTR or REV), is

The rows (and columns) correspond to the bases A, C, G, and T respectively. The factor µ represents the mean instantaneous rate. This rate is modified with the relative rate parameters a, b, c, ..., l, which correspond to each possible transformation between two bases. The remaining parameters () are frequency parameters that correspond to the frequencies of the four bases.

The diagonal elements of Q are always chosen so that the row sums are zero (i.e., stationarity). Sometimes Q is decomposed to a rate matrix, R, and a frequency matrix – the off diagonal elements of Q are the product of R.

The GTR class of models contains a number of models designated by cryptic combinations of letters and numbers; these are usually the initial letters of the authors last names and the year of the publication. Their relationship can be illustrated as in Fig. GTR relationships

Fig. GTR models relationships

In terms of parameters, this can be described as in the table below, where the rate parameters a to f corresponds to the rates indicated in the figure.

Designation Rate params Base frequencies

Number of
free params

Reference
JC a=b=c=d=e=f 1 Jukes & Cantor, 1969
K80, K2P a=c=d=f, b=e 2 Kimura, 1980
TrNef a=c=d=f, b, e 3  
K81, K3ST a=f, b=e, c=d 3  
TVMef a, c, d, f, b=e 5  
TiMef a=f, c=d, b, e 4  
SYM a, b, c, d, e, f 6  
F81 a=b=c=d=e 4 Felsenstein, 1981
HKY a=c=d=f, b=e 5 Hasegawa et al., 1985
TrN a=c=d=f, b, e 6 Tamura & Nei, 1993
K81uf a=f, b=e, c=d 6  
TVM a, c, d, f, b=e 8  
TiM a=f, c=d, b, e 7  
GTR, REV a, b, c, d, e, f 9  

So, for example, the Q matrix for a Jukes-Cantor (JC) model is

The instantaneous rate matrix Q specifies the rates of change between pairs of nucleotides per instant of time, but we need the probabilities of change between states for a given time, t. This is the substitution probability matrix, P, which can be calculated as

This can be evaluated numerically by decomposing Q into eigenvalues and eigenvectors. For several of the sub(models) simple analytical solutions exists for P, whereas other cases (like GTR) needs to be solved numerically.

For the JC model, the P matrix is

or expressed in another way,

These are the probability for a change from i to j over time t when the states differ (i<>j) and the probability that the state does not change (i=j). This means that given the time along branches of a tree, and an estimated (optimized) value of µ, one can calculate the probabilities for the data observed.

Accommodating rate heterogeneity across sites in the models

The basic GTR class of models all assume that all sites evolve at the same rate. If this assumption is violated, maximum likelihood estimates may become inconsistent. Rate heterogeneity can be incorporated into ML analyses by adding a relative rate component, r, to the substitution probability expressions. For example, the JC model above will then become

If the relative rates r are scaled so that the mean rate remains 1, branch lengths will still reflect the number of substitutions per site.

Assigning values to r usually falls into one of these categories

Fig. Gamma distribution

Calculating pair-wise distances

A general framework for describing pair-wise distances is to use a divergence matrix Fxy to represent the relative frequency of each nucleotide pair in a given comparison of the sequences X and Y

where nij is the number of times sequence X has state i aligned with state j in sequence Y, and N is the sum of all nij.

The uncorrected distance, frequently referred to as the dissimilarity (D) or p-distance is simply the total number of differences divided by the total number of available sites:

This distance will suffer from the “multiple hits” problem mentioned in the beginning of this module. To get a better estimate of the actual change we apply a transformation based on some model, for example one of the GTR class models.

When we do this we end up with a differential equation to solve. Using the simplest model, the Jukes-Cantor model, we get the following:

There are two ways for a site in the two sequences to become the same if they are different (one possible change in either of the two sequences) and six different ways to become different if they are the same (three possible changes in either sequence). If the sequence length compared is Ntot and Ndiff of these are different, the differential equation becomes

Since we have the border condition Ndiff(0) = 0 (the sequences are identical immediately after separation) the solution to the equation is

Rewriting this using the fraction of different sites we get

and finally, solving for µt we get

However, what we wanted to calculate is the total number of changes based on the differences. At any site in the two (2) sequences there are three (3) changes that can occur, each with the rate µ/4, so the average number of changes during the divergence time t is

where we get the final expression after substituting µt from the previous expression. Thus, to summarize we have the JC-distance as

Several of the other models can be calculated in a similar way, but the expressions gets a bit more complicated. For example the Kimura 2-parameter model (K80/K2P) treats transitions and transversions differently and the distance based on this model can be calculated as

Not all of the models mentioned under maximum likelihood have a simple formula for the distance transformation; the very common HKY model for example does not. The most general model for which a simple formula exists is the Tamura-Nei model (TrN).

The LogDet distances

The models described above for ML and pair-wise distances assume stationarity, i.e. that the substitution probability matrices remain constant over the entire tree and that they are time-reversible. These assumptions jointly imply that base composition is constant. However, for some data sets it is obvious that the terminals have very different base composition and thus the base frequencies must change over the tree to explain the observation.

To cope with this situation the LogDet or paralinear distances were proposed. This is a transformation that gives additive distances under a wider range of assumptions, especially is it robust to change in base composition. LogDet stands for logarithm of the determinant, and the distance is calculated as

where x andy are the diagonal matrix of the character state frequencies in the two sequences respectively and r is the number of character states (r=4 for DNA/RNA).

Although LogDet distances can cope with differences in base frequencies between taxa, it also has its weakness – when there is site-to-site rate variation LogDet can give quite undesirable results. One remedy is to selectively remove some sites, for example a proportion of those that are constant over all sequences.

Exercises

  1. The parsimony optimalilty criterion