THE PURPOSE OF MATHEMATICAL MODELS 247
have in the outcome, and it is a predictor if we use its value to get a more refined prediction
of the outcome than not knowing it would provide. The modeling is the same whichever
word we use, but the prediction terminology helps us better understand what we do. With
no predictors available, and forced to guess the outcome, our most likely guess would be the
mean. The reason why this is a sensible guess is that it minimizes the expected value of the
squared residuals (we could also use the median, which means minimizing the expected value
of the absolute residuals, but that is mathematically more complicated). If we have a model
that relates the outcome to a particular predictor, and we have measured that predictor, we
would instead use the conditional mean of the outcome, given the value of the covariate for
prediction. The model is good if this increases the precision in our prediction. Another term we
have used in earlier chapters is ‘confounder’, used primarily in epidemiology. A confounder
is an explanatory variable, other than the one that is under investigation, which is a predictor
of the response. In a non-epidemiology context the corresponding term is often ‘covariate’.
In the same way as the exposure we study is not a confounder, the explanatory variable we
have designed an experiment to investigate is often not included among the covariates. On
other occasions they are. We will mostly use the term ‘covariate’ in our discussion for any of
these concepts, and in a wide sense.
We can divide covariates into those that are fixed, like gender and similar variables,
that stay fixed if we repeat the experiment on the same subject, and those that are random,
which means that if we take a new measurement of such a covariate, we will probably get
a new value. Examples of random covariates include baseline measurements of outcome
variables such as blood pressure or lung function. However, in this chapter all analysis will
be conditional analysis on observed covariate values, which means that we also consider the
random covariates to be fixed. As a consequence we can, strictly speaking, only generalize
the results to situations with the same set of covariate values, and must use other means to
generalize to the general population. This may be an important point philosophically, but is
mostly ignored in practice.
Suppose we are comparing some treatments, or exposures, on an outcome variable with
a Gaussian distribution and that we have a number of covariates we may wish to include in a
linear statistical model. Why should we contemplate doing so? Here are some reasons.
1. To adjust for inherent differences among comparison groups in order to reduce bias.
This is of particular importance in studies where groups cannot be balanced by the use
of randomization or matching, as is the case in many observational studies.
2. To generate more powerful statistical tests through variance reduction, which will take
place if an appropriate covariate explains some of the variation present. Adjustment
for baseline measurements in a randomized experiment is an example of this.
3. To induce equivalence of comparison groups that are generated by randomization. Ran-
domization guarantees approximate equivalence, but statistical adjustment can offset
minor imbalances for important predictors for the outcome variable.
4. To clarify to what extent treatment effects are explained by other factors, poten-
tially leading to a change in the interpretation of treatment effects. Conversely, the
lack of explanatory factors would help to substantiate the independent existence of
treatment effects.