
4.5. Phylogenetic Distances 159
clock hypothesis is valid, the distance computed here is proportional to the
amount of elapsed time, with the constant of proportionality being the muta-
tion rate. Thus, the distance can be thought of as a measure of how much time
was required for one sequence to mutate into the other. If the molecular clock
hypothesis does not hold, it is still a reconstruction of the average number
of substitutions that occurred at any one site. The larger it is, the greater the
evolutionary change.
Although we were unable to recover either the mutation rate α or the num-
ber of elapsed time periods t by themselves, we could at least recover the
product of the two from comparing sequences. If there is some other data
(such as a geological record) suggesting the time involved, then the mutation
rate can be found from d
JC
. This is one way that real DNA mutation rates are
estimated.
Example. Consider the two 40-base sequences at the end of Section 4.3.
From Table 4.1, we find that 11 of the sites have undergone a substitution, so
p = 11/40 = .2750. Thus,
d
JC
(S
0
, S
1
) =−
3
4
ln
1 −
4
3
11
40
≈ .3426.
Therefore, while we observed .2750 substitutions per site on average, we
estimate that in the course of evolution .3426 substitutions per site occurred.
Hidden mutations account for the difference.
The Kimura distances. Given any Markov model of base substitution, we
could hope to imitate the steps above to derive an appropriate formula recon-
structing the amount of mutation that has occurred. For the Kimura models,
you will find an exercise that steps you through the procedure. The final for-
mula for the Kimura 3-parameter model is
d
K 3
=−
1
4
(
ln(1 − 2β − 2γ ) + ln(1 − 2β − 2δ) + ln(1 − 2γ − 2δ)
)
,
where β, γ , and δ are estimates of parameters for a Kimura 3-parameter
matrix describing the mutation of the initial sequence to the final.
Of course, if γ = δ, this also gives a distance for the Kimura 2-parameter
model. In that case, β is the probability of a transition, while γ + δ = 2γ
is the probability of a transversion. Thus, if from sequence data we estimate
the probability of a transition p
1
by counting all transitions and dividing by
the length of the sequence, and the probability of a transversion p
2
similarly,