
188 CORRELATION AND REGRESSION IN BIVARIATE DISTRIBUTIONS
This expression confirms what we can see in Figure 7.3, namely that if we increase ρ from
0 toward 1, the ellipse becomes more elongated in the x = y direction and narrower in the
x =−y direction. In other words, the closer the correlation is to one, the more concentrated the
distribution is around the line x = y. Similarly, if the correlation is close to −1, the distribution
is concentrated around the line x =−y.
On occasion we may have a 2 × 2 table in which both factors actually have continuous
distributions, but data collection has been dichotomized into yes/no categories only. We can
then analyze the table in such a way that we actually get a correlation measure between the
two factors. This is illustrated in the following example.
Example 7.3 The table below describes a cross-classification according to whether or not a
group of workers were exposed to a particular environmental hazard, and whether or not they
showed symptoms of bronchitis:
Not exposed Exposed
Bronchitis 89 123
No bronchitis 453 318
Each of these two factors can be considered to vary in strength: there may be different degrees
of exposure and the bronchitis symptom may vary in severity. However, based on a cut-off
point for each, as in a medical test, each subject has found his place in this table. In this way
we may have a dichotomization of factors that are really continuous variables, but we have no
further recorded information on these. We therefore assume that the underlying continuous
data have a standardized bivariate Gaussian distribution, and that the table is obtained using
cut-offs a and b for exposure and bronchitis, respectively. As noted earlier, in order to fit the
data to such a distribution we need to invoke the correlation coefficient. In all there are three
parameters to fit, and three conditions:
453/983 =
2
(a, b; ρ), (89 + 453)/983 = (a), (453 + 318)/983 = (b).
The solution to this system is a = 0.129,b = 0.787,ρ = 0.241, where the last parameter is a
description of association. The correlation coefficient in this hypothetical model is called the
tetrachoric correlation coefficient.
The bivariate Gaussian is a distribution for which the best linear predictor is also the
best of all predictors. To see this, note that equation (7.3) implies that the conditional distri-
bution of Y given that X = x (whose density is ϕ
2
(x, y; ρ)/ϕ(x)) is also Gaussian, namely
N(ρx, 1 − ρ
2
). If we base our understanding of the bivariate Gaussian on the geometric de-
scription in Figure 7.3, it may be a little surprising to find that the mean is ρx and not x, since
this graph shows that the distribution of the standardized bivariate normal (when ρ>0) is
concentrated along the line y = x. To see why the mean is ρx, look at the vertical dashed line
over the point x = 1.5 in Figure 7.3. It intersects the axes of the ellipse in an oblique manner
(actually 45
◦
), so more of it is below the line y = x than is above it. The mean is the mid-point
on this line within the ellipse, and the geometrically inclined reader can convince himself that
the line of means is precisely the line y = ρx.