162 EMPIRICAL DISTRIBUTION FUNCTIONS
6.6 Analysis of paired data
We next return to the COPD patients and their sputum data, but now we assume we have
two observations on each of the 20 subjects. The second set of data consists of the following
numbers (paired with the previous sequence): 18, 19, 2, 2, 5, 515, 30, 10, 3, 55, 127, 2, 260,
16, 8, 301, 443, 1, 26, 24. The original observations (X) were taken after some drug was
given, whereas the new data (Y ) were obtained under the same conditions, but without this
drug. Precisely as with Cushny and Peebles data in the previous section, we wish to analyze
the difference Z = Y − X. Our intention here is to have a more general discussion about what
we can do by exploring the CDF for Z, noting that if X and Y have the same distribution,
the distribution for Z would be symmetric around zero (see equation (4.1)). We wish to find
ways to explore this symmetry in order to obtain a test for the null hypothesis that there is no
effect of the drug.
The immediate consequence of the symmetry is that both the median and the mean of Z are
zero. One way to test the null hypothesis is therefore to test if either of these two parameters
are zero. For the mean we get the estimate −659 with 95% confidence limits (−1318, 1), and
for the median we get the estimate −87 with 95% confidence limits (−235, −39). For the
mean there is not sufficient evidence to reject the null hypothesis at the conventional two-sided
5% level, since the interval contains zero, whereas based on the median we have sufficient
evidence. However, because of the skewness of the data, the proposed confidence interval for
the mean may be quite inaccurate (have the wrong error control). We will see that this is the
case in the next section.
The mean and median estimates are quite different, and from our previous discussion we
have a fairly good idea of why this is: we should probably log our data before we analyze. The
analysis should therefore be on the stochastic variable Z = ln Y − ln X = ln(Y/X) instead.
The mean and median of this distribution should then be back-transformed to the original
measurement scale by exponentiation. As discussed earlier, when we exponentiate the (arith-
metic) mean of the logged data, we get the geometric mean of the original data, whereas when
we exponentiate the median of the logged data, we get the median of the original data. For
our data the geometric mean estimate is 0.14 with 95% confidence limits (0.09, 0.21), and
for the median the estimate is 0.13 with 95% confidence limits (0.06, 0.22). We see that we
get more consistent results by analyzing logged data instead of the original data, but the final
claim is different: the first is the ratio of the geometric means obtained with and without drug
treatment, and the second is the median of the individual ratios.
What if we analyze the ratio Y/X directly? The median is the same, but now the mean is
the arithmetic mean of the ratio, estimated as 0.19 with 95% confidence limits (0.12, 0.27).
This is not the ratio of the individual means (as is true for the geometric mean) and may
therefore not be a natural measure of the location of the data. If we want to discuss a mean
ratio we should analyze differences of logged data instead (see Box 6.3).
In Figure 6.9 we have plotted the various e-CDFs discussed above. The largest, outer, graph
shows the e-CDF for the difference, which is highly skewed to the left, making measures like
the mean more or less meaningless. The middle graph shows the e-CDF for the ratios on the
original (linear) scale. This is also slightly skewed, but now to the right. Finally, the innermost
graph shows the same e-CDF but now on a logarithmic scale. This is the most symmetric of
the e-CDFs, and the scale we should work on for these data (supported by the fact that when
we consider cell counts in sputum, it is the relative effect that has clinical meaning).