
HOW TO DESCRIBE THE DISTRIBUTION OF A SAMPLE 151
We have already noted that there are continuous and discrete stochastic variables. For
the former the CDF is a continuous function, whereas for the latter it is a step function.
Some data are hybrids of discrete and continuous data. For example, in order to assess a
person’s disability we may use a Visual Analogue Scale (VAS) which means that the disability
is described by a mark on a line going from 0 to 1, where 0 means no disability and 1 complete
disability. The corresponding CDF may have jumps at both the end points, but be a continuous
and increasing function in between.
The value of the e-CDF at the point x is the proportion of elements in the sample
{x
1
,...,x
n
} that are at most x in magnitude. It therefore has the analytical expression
F
n
(x) =
1
n
n
i=1
I(x
i
≤ x), (6.1)
where I(C) denotes the indicator function which is 1 if the condition C is true and 0 if it is false.
If precisely k of these x
i
are less than or equal to a particular value x, then F
n
(x) = k/n.Ifso,
and x is one of the observations, we call k the rank of x and denote it by R
n
(x) (which therefore
is equal to nF
n
(x)). (If there are ties, a modification is needed.) There is a fundamental theorem
in probability theory, called the Glivenko–Cantelli theorem, which says that the e-CDF F
n
(x)
converges to F (x) (in a uniform way in x), as the sample size n increases.
The e-CDF is defined as a (right-continuous) step function. For many purposes it would
have been more convenient, at least for a continuous CDF, to define it as a piecewise linear
function such that the point (x
k
,F(x
k
)) is connected to (x
k+1
,F(x
k+1
)) by a straight line,
instead of via the point (x
k+1
,F(x
k
)) (the staircase). We will, however, mostly stick with
the convention, except for a few occasions when we point out some benefits of the linearly
interpolated version, when this clarifies a particular statistical method.
There is an alternative formula for the e-CDF. Its derivation is based on the observation that
for any monotone function we have that F (x) = F (x−) + F (x), where F(x) is the jump at
the point x and F (x−) is the left-hand limit of F (x). If we rewrite this for the complementary
function F
c
(x) = 1 − F (x), we get
F
c
(x) = F
c
(x−) − F (x) = F
c
(x−)
1 −
F (x)
F
c
(x−)
.
This is an observation which only is of interest at jump points, and since the e-CDF consists
exclusively of jump points, it is particularly useful for that function. In fact, if we apply the
observation repeatedly to the e-CDF, we get the alternative formula for F
n
(x), referred to
above. The way it is computed is as follows. Order the different values in the sample into a
strictly increasing sequence x
1
<x
2
<...and let d
j
denote the number of observations with
value x
j
. Let r
j
= nF
c
n
(x
j
−) be the number of observations that are at least x
j
in size. The
formula then reads
F
c
n
(x) =
j;x
j
≤x
1 −
d
j
r
j
,
where
C
a
j
means that we should multiply all the a
j
that fulfill the criterion defined in C.
This way of writing the e-CDF is called the Kaplan–Meier form of the e-CDF, or the Kaplan–
Meier estimate of the CDF. It has the important property that it can be generalized to some
situations where there is incomplete knowledge in the data. If we study the time until some