Conceived by Mikael Thollesson (c) The authors(s) and SweLL, 2001
The simplest measure of change would be the observed difference. This may be a good measure, but it will be an underestimate of the actual change. If there are five differences, there must have been at least five changes, no less. However, one character may have changed once and then back to its original state again. In this case, we will observe no difference, but there have been two changes. Such multiple hits at a certain site in a sequence can greatly obscure the actual history. It is also clear that the probability for a specific site to change more than once will increase with time (evolutionary distance), so the error will be larger for long branches than for short branches. Thus, we will get a function similar to Fig. Differences vs. Changes. To convince yourself that this is the case, you can try Paul Lewis' applet using a lighht bulb analogy.
Fig. Differences vs. ChangesIf only we knew this function, we could calculate the actual number of changes from the observed differences. We do not know that, but we can model the evolution and estimate that function and thus the actual change if the assumptions of these models hold.
Exactly what the models look like will depend on the kind of data we are evaluating,
or to phrase it differently what the characters and character states are. If
we have protein sequences the characters may be sites and the states the different
amino acids. If we have DNA sequences the characters may be sites and the states
A, C, G, or T. Or the characters may be codons and the states will then be one
of the 61 possible codons (excluding the stop codons).
Example. For nucleotide sequences a simple model is the Jukes-Cantor
(Jukes &
Cantor, 1969). The assumptions are that all four bases (A, C, G, T) occur
in equal frequencies (=0.25) and that transformations from one base to another
occurs at the same rate between all bases. If we modify this model and let the
bases have different and arbitrary frequencies (although they have to sum to
1), and assume that all transitions occur at one rate and all transversions
at another (i.e., two rates instead of one) we have the Hasegawa-Kishino-Yano
model (Hasegawa
et al., 1985). Both of these are special cases of the
general time-reversible model discussed in more detail below.
Models are implemented mainly in two different ways in phylogenetic inference.
In one approach the parameters are estimated as part of the model, as is done
when using the maximum likelihood optimality criterion
(and in Bayesian methods) where the model is part of the criterion itself. In
the other pair-wise distances are calculated between
the observed sequences based on the observed differences and these distances
are then feed into a criterion such as minimum evolution.
Parsimony, or maximum parsimony, scores the number of changes, between different character states that at minimum are necessary to explain the observed data given the tree. The best hypothesis is the tree requiring the fewest changes. The changes may be restricted in what kind of changes that are allowed. This score, often referred to as the tree's length, is the minimum number of changes for the tree. Thus, parsimony uses the observations without any attempt to use an explicit model to estimate the total number of changes.
If we have k characters (sites) that each require l changes, the length of the tree, L, is the sum
![]()
Parsimony can be justified in several different ways; one is that it is a general principle in science known as Occam's razor there is no need to make more assumptions than necessary to explain the observations. Hypotheses of homoplasy, that is when we have more steps (changes) than is minimally needed for the data, may be judged ad hoc in that they are attempts to explain the data that do not fit a particular hypothesis. The most parsimonious tree is the solution that requires the smallest number of such ad hoc hypothesis and is thus preferred. Parsimony is also appealing in that it does not require us to make any explicit assumptions about the process, and thus can be applied when we have data that can not easily be modeled.
There are some variations on parsimony. The two most commonly used variations are
Less frequently one may encounter
All the above variants can be regarded as special cases of generalized parsimony. Generalized parsimony assigns a cost for the transformation of any character state into each of all other character states. The cost for m different character states can be represented as a mxm matrix S, where Sij is the increase in tree length for a transformation from state i to state j. If all costs are not equal, this is commonly referred to as transformational weighting or character state weighting. Below are examples of cost matrices for Wagner, Fitch and Dollo parsimony as well as an example where transversions is weighted twice the cost of transitions.
|
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Wagner parsimony | Fitch parsimony | Dollo parsimony; M is an arbitrary, large number | Transformational weighting - transversions twice as costly as transitions |

ML is different from parsimony in that an explicit model is used to calculate the score. The model in phylogenetic contexts consists of two parts:
The score used is the likelihood of a model (which includes the tree
we want to evaluate), which is the conditional probability of the data
(D) given the model (H). Or, phrased differently, it is the probability
of getting the data we actually have got if the model (the tree T and
the parameters Θ) were true;

Note that the likelihood is the (conditional) probability of the data not a probability of the model (the tree), since the sum of the likelihood over all trees does not equal one.
Normally, independence between characters is assumed and the likelihood of the tree is the product of the likelihoods of all characters. Since the likelihoods are very small numbers, the logarithm of the likelihood is normally used (with the log likelihood of the tree being the sum of the log likelihoods for the characters).


Calculating the likelihood for a tree is computational very intense and takes considerably more time than calculating for example the tree length in parsimony, and increases for more complex models. Extra computational load is generated since all branch lengths are optimized numerically for each tree that is examined.
The models that are most commonly used for DNA sequences are sub models of the GTR class of models, commonly modified to incorporate rate heterogeneity among the different sites. Some assumptions explicit or implicit when using these models in ML estimates are
The conditional probability the same for all sites and not changing
over time. Different sites may evolve at different rates (violating
the first part), e.g., positions in a codon; this can be handled by introducing
the rate heterogeneity models.

The best tree is the tree with the lowest score (i.e. shortest length) under
the auxiliary criteria that all ei>0 (i.e., all branches have
a positive length) and pij >= dij for
all pairs of terminal nodes i, j (i.e., the pair-wise distances on the
tree - p - can never be smaller than the directly observed pair-wise
differences, d).
There are some drawbacks to use pair-wise distances compared to using the full data set. Some information is lost when characters are transformed to pair-wise distances; you can always calculate the distance from the characters, but not vice versa. Furthermore, the branch lengths may sometimes be uninterpretable (e.g., the length of a branch is 123.5 changes, but what is 1/2 a change?) or nonsensical (e.g., the total length of a ME tree may be 231 changes whereas a minimum of 233 changes are needed to explain the differences in the characters). Another disadvantage is that pair-wise distance methods will only give a tree, but no information of the changes in the sequences along this tree.
The pair-wise distances can be calculated in different ways, from just using the percentage of observed differences to applying a model to estimate the actual distances (changes) from the observed differences as discussed above. The models that can be applied are the same as the transformation models used in maximum likelihood, but the distances are calculated a priori and the models are not evaluated under each tree. The performance of ME and other methods that use pair-wise distances are of course greatly influenced by the choice of the model used to calculate the distances. There are also other distances available that does not correspond to a particular model in maximum likelihood, but that have other desirable properties such as the LogDet or paralinear distances.All of the methods discussed (with the exception of some ML models that we have not encountered) will select a best unrooted tree; the methods are time-reversible. Too root the tree, i.e. introduce a time axis, we need some auxiliary criterion or method.
The most common method of rooting a tree is to include outgroups in
the analyses. An outgroup is simply a set (at least one) of taxa or sequences
that one knows are less related to any of the study taxa (known
as the ingroup) than any pair of ingroup taxa are related to each other.
Alas, this is not something we can really know, but must infer from other
studies. An example of outgroups would be to include a fish, a frog, and a
bird when one is analyzing the phylogeny of mammals.
The procedure in itself is simple; the outgroup taxa are treated exactly as the ingroup (alignment, analysis). When the analysis is done and a resulting tree is achieved, the root from the perspective of the ingroup will be on the edge between the ingroup and the outgroup. If outgroup taxa are found scattered among ingroup taxa well, than the outgroup was not really an outgroup and other outgroup taxa must be selected.
.
and the tree length is increased with 1.The first pass is sufficient to find the minimum length of the tree, whereas the second step will also assign optimal reconstructions (character sets) to the internal nodes.
Let's start with a simple tree with just one branch and two taxa, T1 and T2.
For these we have sequences two bases:
| T1 |
AC |
| T2 |
CC |
Interpreted on the tree this in turn means
To calculate the likelihoods for each site and for the whole tree, we need a model. I will illustrate using the simplest model, Jukes-Cantor (JC; substitution probabilities with two cases). The site likelihoods are then the probability of having a particular base (which is 0.25 for all four bases in the JC model) multiplied with the conditional probability of having the two bases at each side of a branch given the branch length µt. This can be expressed as
for the first site (where i<>j) and
for the second site (i=j) and µt is the length of the branch connecting the two taxa. The overall likelihood, assuming independence of the two sites, is
L = L1·L2 or, expressed as logarithms,
ln L = ln L1 + ln L2
To complicate things a little (i.e., make things more realistic) we examine a case where we have two nudibranch species and a short sequence of 31 bases from the 16S rRNA gene.
| Polycera quadrilineata |
AAATGGAGTCAGGAATGAATGGAATAACCGG |
| Dendronotus frondosus |
AATTGGAGCCAGGTATGAATGGAGTTACGTG |
We have seven differences (in red) and 24 similarities. Thus, using the Jukes-Cantor model, we get a likelihood that is
![]()
As you can see, this depends on the branch length; if L has an optimum for some value of µt, this is the maximum likelihood and the value of µt is the maximum likelihood estimate of the branch length. If we plot the likelihood function of µt we get the following graph.

Fig. Example likelihood function for a one branch tree
So, the ML estimate of µt is 0.358. Note that we are not able to separate the rate parameter µ from the time, t, but have to treat them together.
Let us finally examine a four-taxon tree, where we have the taxa T1 to T4 and with the first site
| T1 |
T |
| T2 |
T |
| T3 |
A |
| T4 |
G |
One of the three possible trees is
Here we encounter a new problem we do not know the state at the two
internal nodes, x and y. There are several different solutions
a maximum likelihood method may apply to this, for example try all combinations
and use the ones that maximizes the overall likelihood. This is not the
way ML is normally implemented in phylogenetic inference though. Instead we
sum over all possible states at the internal nodes and get a compound likelihood
for observing the data.
Since we have a time-reversible model (JC is member in the GTR class), we can
arbitrarily place the root at an internal node for calculation. Placing the
root at the node x, we get the following expression:

Also note that we have now five branches with lengths d1 to d5, each corresponding to (µt)1 to (µt)5 respectively, that we have to optimize simultaneously to obtain the maximum likelihood.
The mathematical expression of a a substitution model is a table of rates at which each state is replaced with each alternative state. The rates are substitutions per characters (site) per unit evolutionary distance. For nucleotide sequences these can be expressed as a 4x4 instantaneous rate matrix, Q. Here we will limit our discussion to cases where the overall rate of changing from state i to state j is the same as the rate from i to j, a constraint to models that are said to be time-reversible. The models used in phylogenetic inference to date are almost exclusively subsets of this class.
The most general form of time-reversible nucleotide models, which is called General Time Reversible model (GTR or REV), is
The rows (and columns) correspond to the bases A, C, G, and T respectively.
The factor µ represents the mean instantaneous rate. This rate is modified
with the relative rate parameters a, b, c, ..., l,
which correspond to each possible transformation between two bases. The remaining
parameters (
) are frequency parameters
that correspond to the frequencies of the four bases.
The diagonal elements of Q are always chosen so that the row sums are
zero (i.e., stationarity). Sometimes Q is decomposed to a rate matrix,
R, and a frequency matrix
the off diagonal elements of Q are the product of R
.
The GTR class of models contains a number of models designated by cryptic combinations of letters and numbers; these are usually the initial letters of the authors last names and the year of the publication. Their relationship can be illustrated as in Fig. GTR relationships

Fig. GTR models relationships
In terms of parameters, this can be described as in the table below, where the rate parameters a to f corresponds to the rates indicated in the figure.
| Designation | Rate params | Base frequencies |
Number of |
Reference |
|---|---|---|---|---|
| JC | a=b=c=d=e=f | ![]() |
1 | Jukes & Cantor, 1969 |
| K80, K2P | a=c=d=f, b=e | ![]() |
2 | Kimura, 1980 |
| TrNef | a=c=d=f, b, e | ![]() |
3 | |
| K81, K3ST | a=f, b=e, c=d | ![]() |
3 | |
| TVMef | a, c, d, f, b=e | ![]() |
5 | |
| TiMef | a=f, c=d, b, e | ![]() |
4 | |
| SYM | a, b, c, d, e, f | ![]() |
6 | |
| F81 | a=b=c=d=e | 4 | Felsenstein, 1981 | |
| HKY | a=c=d=f, b=e | 5 | Hasegawa et al., 1985 | |
| TrN | a=c=d=f, b, e | 6 | Tamura & Nei, 1993 | |
| K81uf | a=f, b=e, c=d | 6 | ||
| TVM | a, c, d, f, b=e | 8 | ||
| TiM | a=f, c=d, b, e | 7 | ||
| GTR, REV | a, b, c, d, e, f | 9 |

So, for example, the Q matrix for a Jukes-Cantor (JC) model is

The instantaneous rate matrix Q specifies the rates of change between pairs of nucleotides per instant of time, but we need the probabilities of change between states for a given time, t. This is the substitution probability matrix, P, which can be calculated as

This can be evaluated numerically by decomposing Q into eigenvalues and eigenvectors. For several of the sub(models) simple analytical solutions exists for P, whereas other cases (like GTR) needs to be solved numerically.
For the JC model, the P matrix is

or expressed in another way,
These are the probability for a change from i to j over time t when the states differ (i<>j) and the probability that the state does not change (i=j). This means that given the time along branches of a tree, and an estimated (optimized) value of µ, one can calculate the probabilities for the data observed.
The basic GTR class of models all assume that all sites evolve at the same rate. If this assumption is violated, maximum likelihood estimates may become inconsistent. Rate heterogeneity can be incorporated into ML analyses by adding a relative rate component, r, to the substitution probability expressions. For example, the JC model above will then become
![]()
If the relative rates r are scaled so that the mean rate remains 1, branch lengths will still reflect the number of substitutions per site.
Assigning values to r usually falls into one of these categories
Fig. Gamma distribution
A general framework for describing pair-wise distances is to use a divergence matrix Fxy to represent the relative frequency of each nucleotide pair in a given comparison of the sequences X and Y

where nij is the number of times sequence X has state i aligned with state j in sequence Y, and N is the sum of all nij.
The uncorrected distance, frequently referred to as the dissimilarity (D) or p-distance is simply the total number of differences divided by the total number of available sites:
![]()
This distance will suffer from the multiple hits problem mentioned in the beginning of this module. To get a better estimate of the actual change we apply a transformation based on some model, for example one of the GTR class models.
When we do this we end up with a differential equation to solve. Using the simplest model, the Jukes-Cantor model, we get the following:
There are two ways for a site in the two sequences to become the same if they are different (one possible change in either of the two sequences) and six different ways to become different if they are the same (three possible changes in either sequence). If the sequence length compared is Ntot and Ndiff of these are different, the differential equation becomes

Since we have the border condition Ndiff(0) = 0 (the sequences are identical immediately after separation) the solution to the equation is

Rewriting this using the fraction of different sites we get

and finally, solving for µt we get

However, what we wanted to calculate is the total number of changes based on the differences. At any site in the two (2) sequences there are three (3) changes that can occur, each with the rate µ/4, so the average number of changes during the divergence time t is
![]()
where we get the final expression after substituting µt from the previous expression. Thus, to summarize we have the JC-distance as

Several of the other models can be calculated in a similar way, but the expressions gets a bit more complicated. For example the Kimura 2-parameter model (K80/K2P) treats transitions and transversions differently and the distance based on this model can be calculated as
Not all of the models mentioned under maximum likelihood have a simple formula for the distance transformation; the very common HKY model for example does not. The most general model for which a simple formula exists is the Tamura-Nei model (TrN).
The models described above for ML and pair-wise distances assume stationarity, i.e. that the substitution probability matrices remain constant over the entire tree and that they are time-reversible. These assumptions jointly imply that base composition is constant. However, for some data sets it is obvious that the terminals have very different base composition and thus the base frequencies must change over the tree to explain the observation.
To cope with this situation the LogDet or paralinear distances were proposed. This is a transformation that gives additive distances under a wider range of assumptions, especially is it robust to change in base composition. LogDet stands for logarithm of the determinant, and the distance is calculated as

where
x and
y
are the diagonal matrix of the character state frequencies
in the two sequences respectively and r is the number of character states
(r=4 for DNA/RNA).
Although LogDet distances can cope with differences in base frequencies between taxa, it also has its weakness when there is site-to-site rate variation LogDet can give quite undesirable results. One remedy is to selectively remove some sites, for example a proportion of those that are constant over all sequences.