
THE KAPLAN–MEIER ESTIMATOR OF THE CDF 309
The last observation in this example justifies a quick discussion about ties for time-to-event
data. Even though there might be situations where the design of a study (or outcome variable)
is such that we could have pure jumps in the hazard, this is very rare. With continuous time we
therefore have that (t) = 0. However, we only measure time to a certain precision, such
as days. This means that two events that occurred at different time points may be recorded as
having occurred at the same time point. Had we measured to the precision of hours, many of
these ties would probably have been broken, and with more detailed timings even more so.
The proper way to handle ties that occur this way is to modify the Nelson–Aalen estimator to
account for what is really happening. In fact, if we have no censored observations at a time t
but d events, and two events cannot occur simultaneously, the estimate of d(t) should be
d
n
(t) =
1
Y
n
(t)
+
1
Y
n
(t) − 1
+ ...+
1
Y
n
(t) − d + 1
,
because first one occurs, then the next, etc. If there were censored events at the same time,
this would not be true, but still better than the crude estimate. If there is a mixture of events
and censorings, d
n
(t) should be the average of all possible ways in which what we observe
can occur, which makes this a combinatorial problem.
11.8 The Kaplan–Meier estimator of the CDF
We conclude this chapter with a more mathematical derivation of the Kaplan–Meier estimator
for the CDF from the Nelson–Aalen estimate of the hazard rate. Our main objective is to find
its variance in the presence of right-censored data, which allows us to investigate the CDF
F (t), using the methods discussed in Chapter 6.
For a continuous CDF we have that F
c
(t) = e
−(t)
, and it is therefore tempting to use the
function e
−
n
(t)
as an estimator for F
c
(t). This has indeed been suggested, and is called the
Breslow estimator of the CDF. However, it is not the natural estimator of F (t) based on
n
(t),
a role that is taken by the Kaplan–Meier estimator. Before we look closer into why this is, we
compare these two survival function estimators in a numerical example.
Example 11.6 Suppose that the logarithm of the sputum count in Section 6.2 instead describes
the survival time (in years) of 20 subjects after they have had a particular cancer diagnosis.
There are no censored data. Figure 11.7(a) shows the Nelson–Aalen estimate (solid gray
curve) of the true cumulative hazard (dashed curve) as well as pointwise confidence limits
(solid black curves) for the (true) cumulative hazard. The jumps in
n
(t) become larger as t
increases, for the obvious reason that the size of a jump is inversely proportional the number
at risk, of which there are fewer later than early on.
Figure 11.7(b) shows the true F
c
(t) (dashed line) as well as the two estimates of the
CDF discussed above, the Kaplan–Meier and Breslow respectively. We see that although
they are similar, they differ in details. In particular, we see that the Breslow estimator lies
above the Kaplan–Meier estimator everywhere. This is always true because of the inequality
e
−x
≥ 1 − x, which holds for all x, since it implies that
F
c
n
(t) =
s≤t
(1 − d
n
(s)) ≤
s≤t
e
−d
n
(s)
= e
−
n
(t)
.