SOME WAYS TO COMPUTE p-VALUES 225
observed data points to the two groups with the group sizes kept constant. For each such
permutation of data we compute the test statistic, and from that we can compute the CDF
for the rank sums. By picking out tails, we can compute the p-value. This is however not a
convenient method in our two-group case, with 40 data points equally divided between two
groups, because there are
40
20
≈ 1.4 · 10
11
such permutations. Because of the sheer size of
this, we do not compute this p-value here, but there are algorithms developed that also allow
us to compute the exact p-values for moderately large data sets.
Unless the sample sizes are small, the CLT provides us with a method to compute approx-
imations to the p-value for the Wilcoxon test, which has been described above. To apply it, we
need to compute the expected rank sum and its variance under the assumption of G(x) = F (x),
which is a pure combinatorial problem, the result of which was given in Box 8.3. Note that
the accuracy depends only on sample sizes, and not on the actual distribution F (x), as was
the case for the t-test.
There is a version of the exact computation that can be applied to the t-test as well, which
we describe in the next example.
Example 8.2 For the t-test, the distribution of the test statistic was deduced from mathemat-
ical operations based on distributional assumptions. An alternative approach is to deduce the
distribution of the test statistic by estimating the combined CDF (x) with its e-CDF
mn
(x),
and enumerate all possible assignments of groups to this data. For each such assignment we
have one value of the test statistic, and consequently the process gives us the CDF for the test
statistic under these conditions, from which a p-value can be computed as a simple fraction.
A test of the kind described in Example 8.2 is called a randomization test (or permutation
test). The p-values obtained from such tests are often referred to as exact, which they are
if the (combined) sample obtained is the only possible sample. In other words, the p-value
obtained in this way is from a conditional test, where we take the values of the observations
in that particular experiment as given. As an unconditional test, however, it is not exact – how
accurate the p-value is depends on how well the e-CDF of the combined sample approximates
the true CDF. If we apply this procedure to the Wilcoxon rank sum statistic, we derive the
combinatorial p-value described above, since it is the same procedure as was described for the
computation of the exact p-value for the Wilcoxon test. This also implies that its computational
hurdles apply to the randomization tests as well.
A related method, which is also feasible for large samples, is to use bootstrapping. This
method was described in Section 6.7 as a method to estimate the CDF for the test statistic by
repeated resampling. With this method we perform resampling a large number of times and for
each of these samples we compute the test statistic. This gives us an estimate of the distribution
of the test statistic for the actual data, which puts us in a position to compute the appropriate
p-value. This method will be approximate both for the t-test and for the Wilcoxon test. How
good the approximation is depends on how well the two e-CDFs describe their respective
CDFs, and also on how many samples we take.
Example 8.3 For the sputum data we computed the Wilcoxon test statistic as 523. The
two-sided p-value 2(1 − F (523)) can then be computed from a bootstrap-derived estimate of
the distribution F (x) of the rank test based on 10 000 samples, as 0.0021, but it varies slightly
between different runs since tail probabilities need many resamplings. In this case the CDF