COMMENTS AND FURTHER READING 239
8.9 Comments and further reading
This chapter has discussed two-group comparisons using what are perhaps the two most often
used and best-known tests in medical statistics, Student’s t-test and Wilcoxon’s rank sum test.
The former is an extension of the one-sample t-test we discussed in the previous chapter. The
two-group test was actually introduced by R. A. Fisher who, fifteen years after its publication,
discovered William Gosset’s original publication on the one-sample case and extended it,
not only to this two-group situation, but all the way to what became known as analysis of
variance (Senn, 2008). The t-test addresses the difference in means and requires complete
data, whereas the way to obtain confidence limits for a percentile difference, discussed in
Box 8.2, can be extended to situations with censored data (Su and Wei, 1993), by using the
appropriate variance estimate.
Lehmann (1998) gives a general introduction to nonparametric tests. The Wilcoxon test
was developed as a rank test by Frank Wilcoxon in 1945 and then, later and independently, by
Mann and Whitney, using the scores that are named for them. The test is therefore often referred
to as the Wilcoxon–Mann–Whitney test, with the score version leading to the Wilcoxon
probability (Halperin et al., 1987). The Hodges–Lehmann shift estimator was introduced
later (Hodges and Lehmann, 1963) as the median of all the differences. The characterization
in Box 8.4 was given in the paper by Fine (1966), though the discussion there is in terms of
e-CDFs instead of CDFs. The relation between the t-test and the Wilcoxon test discussed in
Box 8.3 was introduced (Conover and Iman, 1981) as a pedagogical technique to use rank
transformations as a bridge between parametric and nonparametric statistics (and also as a
method to carry out nonparametric statistics in statistical software that may not include such
methods). The initial enthusiasm about the prospect of applying this philosophy (to rank
data before analysis) to more general problems soon waned, when it was realized that the
nonlinear transformation involved did not always produce sensible tests (Thompson, 1991).
The generalized odds ratio mentioned in the text was discussed by Agresti (1980), but seems
not to have gained much interest in the biostatistics community.
I have found no reference for the description of the various aspects of the nature of the
Wilcoxon test discussed in this chapter; in particular about parameter estimation in different
models using a simple estimating equation. The particular relation between the proportional
odds model and the Wilcoxon test is, however, well known and has been used to determine
the power of a study for categorical data (Whitehead, 1993).
The fact that p-values can be computed in different ways, and why the way they are
computed matters, was illustrated in (Bergmann et al., 2000), where the outcomes for a
Wilcoxon test on a particular data set were compared between 11 PC-based statistical software
programs. The p-values varied, depending on whether a large-sample approximation or an
exact permutation form of the test was used and, in the former case, whether or not a continuity
correction (see page 98) was used. The key message is that you need to understand precisely
what your particular software does, before you use it.
The discussion of crossover studies is short, and focused on the design issues and on the
analysis of a two-period crossover study as a parallel group study. Similar discussions as well
as discussion of how to analyze crossover studies with more than two periods can be found
in either Jones and Kenward (1989) or Senn (2002).
For mathematical details on the unconditional test in Section 8.8, see Rao (1967) and
references therein, as well as Marden and Perlman (1980). It should be noted that we do not