Kallen A. Understanding Biostatistics

Подождите немного. Документ загружается.

THE EFFECT OF MISSING COVARIATES 261

Speciﬁcally, let Y be the outcome variable, X the measured covariates and Z the miss-

ing one (we combine any set of omitted covariates into a single one for this discussion).

Assume that E(Y|X = x, Z = z) = h(xβ + z) for some response function h(u) and that the

population CDF of Z is given by G(z). If there is an intercept parameter β

in the linear

model xβ, we assume that G(z) has mean zero, since any non-zero mean can be incorporated

into the intercept. Since we have not measured Z, the mean we observe is not h(xβ + z),

but instead

E(Y |X = x) =



h(xβ + z)dG(z).

This new response function may differ considerably in shape from the original. A graphical

illustration of this, in a slightly different (but equivalent) setting, is given in Figure 10.4 on

page 270 in the next chapter. The model has therefore changed and this poses the following

immediate question: if we ﬁt the reduced data to the original model (with the response function

h(u); remember that we do not know G(z), so there is not much else we can do), how much do

the regression coefﬁcients from that model differ from the original one? We want to compare

β to the β

∗

which makes the function h(xβ

∗

) the best approximation to E(Y|X = x), and

we want to understand what the difference between β and β

∗

is. The answer depends on the

choice of response function and on the distribution G(z).

Before we look at a few examples, let us link this up with the discussion on individual risks

and population risks. We never know what the true individual risk is, the best we can do is to

obtain predictive covariates and use the conditional mean in the appropriate subpopulation as

the prediction for this. But this varies with how many, and which, covariates we include. There

is one prediction E(Y |X = x, Z = z) if we know both X and Z, and another, E(Y|X = x)

if we only know the former. The purpose of estimating the regression coefﬁcients β is to

understand how sensitive the outcome is to X, and we most often think of that as individual

sensitivity. If we use the same model (i.e., response function) in the two cases, we have two

inconsistent models, and it should not be a surprise that the regression coefﬁcients shift their

interpretation, and therefore their value. They do not do so if the response function is the

identity (h(u) = u), so in that case, as for ANOVA, the meaning is independent of how many

covariates we include in the model.

Example 9.5 Consider the case of an exponential response function, h(u) = e

, and assume

that Z has a Gaussian distribution, so that

E(Y |X = x) = e

xβ



∞

−∞

d(z/σ) = e

xβ

This means that patient heterogeneity inﬂates the conditional mean of Y compared with a

homogeneous population. Compared to the equation E(Y |X = x) = e

xβ

∗

, we ﬁnd that β and

∗

coincide except for the constant, for which we have the relation β

+ σ

/2 = β

∗

. This

is the mildest effect we get on the mean when we omit covariates, except for the identity

function, when there is no effect.

Other response functions have other effects. The following example gives the effect of

omitting covariates for the important case of the logistic regression when the omitted variable

is Gaussian.

262 LEAST SQUARES, LINEAR MODELS AND BEYOND

Example 9.6 Consider ﬁrst the probit model for which h(u) = (u) and assume that Z has

the Gaussian distribution with mean zero and variance σ

. Under that assumption,

E(Y |X = x) =



∞

−∞

(xβ + σz)d(z) = 



xβ

√

1 + σ



(To evaluate the integral in the middle, note that it is the CDF for the difference between inde-

pendent N(xβ, σ

) and N(0, 1) variables.) It follows that β

∗

= β/

√

1 + σ

, so all coefﬁcients

are affected and regressed toward zero. An approximation for the logistic regression follows

from this by appealing to Box 9.4; it gives us β

∗

≈ β/

√

1 + a

, where a = 0.59.

Note that the Z we have discussed above is actually γZ, where γ is the true regression

coefﬁcient for the unknown covariate, and it is really γ

that enters the expressions above.

The effect is therefore the combined effect of the heterogeneity in Z, as measured by the

variance, and the predictive power, as measured by γ, of the covariate.

Example 9.7 Consider a 2 ×2 table describing an exposure-response relationship. The

model we have is logit(p(x, ξ)) = ξ + βx, where x = 1 for an exposed individual and x = 0

for a control, and θ = e

is the odds ratio we require. The ξs are allowed to vary between

individuals with a CDF P(ξ). The odds ratio calculated from the population data is then

ψ =

(1 − P

)

(1 − P

, where P



∞

−∞

ξ+βx

1 + e

ξ+βx

dP(ξ).

If we approximate the logistic distribution with the Gaussian CDF and assume that the distri-

bution for ξ is N(α, σ

), we see that (approximately)



1 − P



α + βx

√

1 + a

from which it follows that ψ ≈ θ

, where ν = 1/

√

1 + a

< 1. This means that, because

of the heterogeneity, the population odds ratio we calculate from data (ψ) is expected to be

biased toward one, compared with the odds ratio that is relevant for the individual (θ).

This is of particular relevance to a one-to-one matched study, for which we have the

probability table (see page 98)

Controls(C

)

Cases EP

where

= e

βx



∞

−∞

(x+y)ξ

(1 + e

ξ+βx

)(1 + e

)

dP(ξ).

The assumption here is that the value ξ is common to the case and the control (that is what the

matching tries to achieve). We see that P

= θ, so this estimate of the odds ratio is not

THE EXPONENTIAL FAMILY OF DISTRIBUTIONS 263

inﬂuenced by the heterogeneity. This explains why, for the second Hodgkin’s lymphoma study

in Section 4.5.1, we obtained a smaller estimate from the ﬁrst analysis than from the second.

The relation is 1.47 = 2.14

, from which we get an indication of the heterogeneity by solving

for ν. However, there is no concomitant increase in precision, the 90% conﬁdence interval

for ψ is (0.88, 2.46), whereas that for θ from the matched analysis is (1.03, 4.47) (derived

from the Wilson interval for a single binomial parameter by transforming to the odds). This

is consistent with the discussion above, and also reﬂects the fact that fewer observations are

used in the second analysis.

The difference between β and β

∗

may be considered some kind of misspeciﬁcation bias,

because we analyze the wrong model. This may not always be a good use of the word ‘bias’,

because it really only reﬂects the fact that the overall effect seen in a population depends on

how much heterogeneity is left unaccounted for. The more predictive covariates we include,

the smaller is this residual heterogeneity. The general observation is that the larger the hetero-

geneity, the smaller is the effect we see in the population (we have |β

∗

|≤|β| in the notation

above, with the difference increasing with the heterogeneity). The exception here are the

identity and exponential response functions. A simple regression model in a heterogeneous

environment provides estimates, for example treatment estimates, that may be reasonably ac-

curate from a population perspective, but wrong when interpreted as individual effects. This

distinction between the population perspective and the individual perspective will play a major

role in our discussion on survival data and the Cox model later. It is also further discussed in

the next chapter, where we discuss the difference between a subject-speciﬁc and a population

averaged approach to the description of dose response.

We may also note that because of the general observation that the variance can be de-

composed as V (Y ) = E(V (Y|Z)) +V (E(Y |Z)) (applied to the conditional distribution of Y

given that X = x), the (conditional) variance of Y always increases with omitted covariates,

including the otherwise harmless case with the identity response function. However, this does

not imply that the precision in the estimated regression coefﬁcients has to increase when we

include more predictive covariates, if they get a new meaning.

9.7 The exponential family of distributions

The rest of this chapter is more mathematical in character. It is about a particular drive in

mathematics – the wish to generalize and systematize, to see what is common to a num-

ber of particular cases and ﬁnd a general formulation which treats these as special cases.

We seek a general theory, including all the proofs necessary, which we can apply to the

particular cases, without the need for individual proofs. This is something that appeals to

mathematicians, and statistics is a subdiscipline of mathematics. We therefore wish to take

this opportunity to formulate as part of a general framework the regression theories so far

encountered. This will give us the tools to ﬁnd variations of these, applicable to speciﬁc

problems (not that we will make much use of these tools, but at least we will be able to if

we wish).

We will use the following deﬁnition. A distribution is said to belong to the exponential

family of distributions if its CDF can be written in the form

(x, θ) = e

(xθ−κ(θ))/φ

(x) (9.5)

264 LEAST SQUARES, LINEAR MODELS AND BEYOND

for a parameter vector θ and a positive scalar φ. Here κ(θ) is a function of θ alone and the

reference function F

(x) (which does not have to be a CDF) does not depend on θ (but is

allowed to depend on φ). The parameter φ is called the dispersion parameter of the distribution.

The ﬁrst and obvious example is the case where we take the dispersion parameter to be one,

(x) to be one when x>0 and zero otherwise, and θ =−λ. The result is the probability

density function dF (x, λ) = λe

−λx

, x>0, which is the exponential distribution.

The deﬁnition above is not the only deﬁnition possible for the exponential family. The

most common deﬁnition is probably to write its density as

a(θ)e

Q(θ)T (x)

dF (x)

for some reference function F(x). The form in equation (9.5) is the special case when we (1)

use Q(θ) as parameter instead of θ, (2) consider the distribution of T (X) instead of that of the

original variable X, and (3) introduce the additional extra dispersion parameter φ. The form

in equation (9.5) is called the canonical form for the family and the parameter θ is called the

natural (or canonical) parameter for the distribution. If we understand the canonical form for

the family, we understand the general exponential family. Some key examples of distribution

families found within the exponential family are listed in Box 9.5. Of particular interest here

is that for the binomial distribution the natural parameter θ is not the proportion p, but the

log-odds ln(p/(1 − p)).

The examples in Box 9.5 constitute only a sample, but not all distributions encountered so

far belong to the exponential family. The t distribution is one exception, another is the logistic

distribution, but the logistic function plays a fundamental role for the binomial distribution,

since it maps the natural parameter (the log-odds) to the binomial proportion.

All these examples are univariate distributions. For any of them it is the case that if we

have a sample of n independent observations x

of the same distribution, the multivariate

distribution is proportional to

n(¯xθ−κ(θ))/φ

and the proportionality factor does not depend on θ (but may depend on φ). This means that

the CDF of a sample of n independent observations is summarized by the arithmetic average ¯x,

which has (essentially) the same distribution as the components, with the dispersion parameter

φ replaced by φ/n.

The expression for the CDF that deﬁnes the exponential family leads to a simple and

explicit form for the mean and variance of such a distribution, determined by the function

κ(θ). This also provides us with a method to estimate the natural parameter (though not φ,

which we consider known in this discussion). These formulas are obtained by differentiating

the relation

κ(θ)/φ



xθ/φ

(x),

with respect to θ. The ﬁrst formula we obtain is a formula for the mean:

(X) = κ



(θ).

Another differentiation and we ﬁnd a similar equation for the variance:

(X) = φκ



(θ).

THE EXPONENTIAL FAMILY OF DISTRIBUTIONS 265

Box 9.5 Some important subfamilies of the exponential family

Many of the important distributions we have encountered so far belong to the

exponential family.

The probability function for the binomial distribution can be written





(1 − p)

n−x





(1 − p)

θx

for which we have θ = ln p/(1 − p) and φ = 1. Since 1 − p = (1 +e

)

−1

we see that

κ(θ) = n ln(1 − p) =−n ln(1 + e

) and dF

(x) =





The probability function for the Poisson distribution can be written

−m

= e

x ln(m)−m

for which we have that φ = 1,θ= ln(m),κ(θ) = e

and dF

(x) = 1/x!.

The probability function for the non-central hypergeometric distribution can be

written (cf. Appendix 5.A.1)

F (ψ)

−1





r − x



for which we have that θ = ln ψ, φ = 1,dF

(x) =





r−x



and κ(θ) = ln F (e

The density function for the Gaussian distribution can be written



x − m



√

2π

(mx−m

/2)/σ

−x

/2σ

so θ = m, φ = σ

,κ(θ) = θ

/2 and dF

(x) = (2πφ)

−1/2

−x

/2φ

dx.

The density function for the gamma distribution can be written

(p)

p−1

−ax

= e

(−

x+ln

)/p

−1

p−1

(p)

so we take φ = 1/p, θ =−a/p, κ(θ) =−ln(−θ) and F

(x) = p

p−1

/(p).

In order to estimate θ from data we can use likelihood theory, and for this we note that

the part of the log-likelihood that contains information about the parameter θ is given by

(xθ − κ(θ))/φ, where x is the observation of X. (Information about φ may be present in

the ignored part, so this discussion does not apply to that parameter.) This means that the

maximum likelihood estimate of θ is given by the solution to the equation

(X) = x.

In words: the maximum likelihood estimate of θ is the parameter value for which the expected

value of X equals the observed value. If we have a sample of size n, it should equal the

average of the observations. If we therefore denote the difference between the two sides by

U(θ) = x − E

(X), we have that θ is the solution of the estimating equation U(θ) = 0, for

266 LEAST SQUARES, LINEAR MODELS AND BEYOND

which we have that the variance of U(θ) is the same as the variance of X. This observation,

together with the CLT, gives us a method to compute an approximate (two-sided) conﬁdence

function for θ, namely

C(θ) = χ

((x − E

(X))

(X)

−1

(x − E

(X))),

where s is the number of components of θ. Note the fundamental difference between the

parameters θ and φ. In a regression problem our primary focus will be on θ (or some function of

its components), whereas φ is a measure of dispersion which will not inﬂuence the estimation

of θ. Its inclusion is, however, of the greatest importance when we wish to make conﬁdence

statements about θ, and for that purpose we need to ﬁnd an estimate of φ.

Sometimes only one part of θ is of interest, with the rest being nuisance parameters.

When that is the case, one way to obtain knowledge about the interesting components is by

ﬁnding a conditional distribution which is independent of the nuisance parameters (another is

proﬁling). This was exempliﬁed in Section 9.4, when we introduced the conditional logistic

regression models, but can be done in some generality in the exponential family. To be more

speciﬁc, assume that we have a stochastic variable X with a distribution in the exponential

family for which the dispersion parameter is one and which is decomposed into X = (X

)

(both components can be vectors), with a corresponding decomposition of the canonical

parameter θ = (θ

,θ

). Then the conditional distribution of X

given X

also belongs to the

exponential family. For an outline of the computations we start with the probability density

for X, which is e

−κ(θ

,θ

)

f (x

). The marginal probability density for X

then given by

−κ(θ

,θ

)



f (x

)dx

= e

−κ(θ

,θ

)

g(x

,θ

It follows that the density for X

given that X

= x

, which is the ratio of these two densities,

can be written as

−ln(g(x

,θ

))

f (x

This density does not contain the parameter θ

, so if we want to make inference about θ

we can use the conﬁdence function based on the conditional distribution X

= x

.Afew

examples are listed in Box 9.6, which points out how event data that appear according to

Poisson processes are turned into a multinomial distribution if we do the analysis conditional

on the total count.

There is one important type of calculation remaining on distributions in the exponential

family. It is about allowing the natural parameter to be subject-speciﬁc, to vary in the pop-

ulation to account for heterogeneity (e.g., an omitted covariate). We know that this leads to

a new distribution, the determination of which involves the computation of an integral. For

members of the exponential family there are special complementary parameter distributions

for which this computation is easily carried out: for the distribution deﬁned in equation (9.5)

we deﬁne the (family of) conjugate distributions by

dQ(θ) = c(γ, χ)

−1

χθ−γκ(θ)

dθ (9.6)

THE EXPONENTIAL FAMILY OF DISTRIBUTIONS 267

Box 9.6 Some important conditional distributions from the exponential family

The probability function for two independent Poisson processes with rates λ

, and

observed for times T

,is

−(e

)

where θ

= ln(λ

), i = 1, 2. The distribution for x

= x

+ x

is Poisson with the

natural parameter given by θ = ln(e

+ e

). Division gives the conditional distribution

+(x

−x

)θ

−x





(1 − p)

−x

where p = T

/(T

+ χT

) is a function of χ = λ

/λ

only.

The bivariate distribution of two independent Bin(n

) and Bin(n

) distribu-

tions can be written as

(1 + e

)

(1 + e

)

b(x

) = a(θ

,θ

)b(x

ψ+(x

)θ

where θ

is the log-odds, ψ = θ

− θ

is the log-OR, a(θ

,θ

) the inverse of the de-

nominator, and b(x

) =







. The probability that x

+ x

= r is given by



a(θ

,θ

)b(k, r − k)e

kψ+rθ

= a(θ

,θ

rθ



b(k, r − k)e

kψ

so the conditional distribution for x

given that x

+ x

= r is given by

b(x

,r− x

/F (ψ, r), whereF(ψ, r) =



b(k, r − k)e

kψ

which is the non-central hypergeometric distribution, deﬁned in Box 4.4.

Related to the ﬁrst example above is the case where we observe k independent

Poisson distributions x

∈ Po(m

), i = 1,...,k. The joint probability function is then

p(x) =



i=1

−m

= e

−



ln m

−m

)



The probability function for x



is given by e

−m

!, so the distribution

of x conditional on x

= n will be







,...,x



...p

which is the probability function for a multinomial distribution.

268 LEAST SQUARES, LINEAR MODELS AND BEYOND

Box 9.7 Some mixed distributions from the exponential family

The CDF for the Poisson distribution is given by dF

(x) = e

xθ−e

/x!, so according to

formula (9.6) the conjugate distribution is

dQ(θ) = e

pθ−ae

(p)

dθ.

This means that it is the distribution for e

when X is a gamma distribution (i.e., the

log-gamma distribution), and we identify that c(a, p) = (p)/a

. It follows that the

mixed distribution is

(p + x)/(a + 1)

p+x

(p)/a



p + x − 1



a + 1





a + 1



which is the negative binomial distribution.

Consider the Bernoulli distribution (which is the binomial with n = 1) for which

we have dF

(x) = e

xθ−ln(1+e

)

, where θ is the log-odds. Its conjugate distribution is

dQ(θ) = c(γ, χ)

−1

χθ−γ ln(1+e

)

dθ = c(γ, χ)

−1

(1 − p)

γ−χ

dp,

where p = e

/(1 + e

). The coefﬁcient is identiﬁed by comparison with the beta dis-

tribution as c(γ, χ) = B(χ + 1,γ − χ + 1) (B(a, b) is deﬁned in Appendix 6.A.1), and

it follows that the mixed distribution for a sample of n is

c(γ + n, χ + x)

c(γ, χ)





B(a + x, b + n − x)

B(a, b)





For the Gaussian distribution with mean θ we have dF

θ,φ

(x) = e

(xθ−θ

/2)/φ

(x), as

we have seen, so the conjugate distribution takes the form

dQ(θ) = c(γ, χ)

−1

χθ−γθ

dθ,

which means that it is a Gaussian distribution with variance 1/γ and mean χ/γ.It

follows that c(γ, χ)

−1

√

2π

−χ

/2γ

, and the mixed probability density is therefore

√

γe

−χ

/2γ

√

γ + 1/φe

−(χ+x)

/2(γ+1/φ)

(x) =

√

2π(φ + 1/γ)

−K(x)

for a quadratic form K(x) with coefﬁcients that are functions of the parameters χ, γ and

φ. This means that it is a Gaussian distribution with variance φ + 1/γ, and we know that

its mean is the mean of Q(θ). It follows that the mixed distribution is N(χ/γ, φ + 1/γ).

for the appropriate coefﬁcient c(γ, χ). A short calculation then shows that the population

averaged density is



(x, θ)dQ(θ) =

c(γ + 1/φ, χ + x/φ)

c(γ, χ)

(x).

GENERALIZED LINEAR MODELS 269

If we instead have n observations x

,...,x

, the distribution is the same if we replace φ by

φ/n and x by the arithmetic average ¯x of the observations (we also change F

(x), but it still

does not contain any parameter other than φ). Box 9.7 contains three important examples. Of

particular importance is the last case, that the mixture of two Gaussian distributions is another

Gaussian distribution, which again shows that in this case it does not really matter whether

or not the individual mean response is heterogeneous in the population. We can still analyze

the model under the assumption of ﬁxed effects; the heterogeneity will only show up in the

residual variance.

The conjugate distributions for distributions from the exponential family are also useful

to Bayesian statisticians. In fact, if we take dQ(θ)asthea priori distribution, the a posteriori

distribution is given by

dF (θ|x) = c(γ + 1/φ, χ + x/φ)

−1

(χ+1/φ)θ−(γ+1/φ)κ(θ)

which is another member of the same family of distributions as the a priori distribution. To

use such distributions is therefore a way out of the general complexity of Bayesian statistics,

and explains why we used beta distributions when we discussed the distribution of a binomial

parameter in Section 4.6.2.

9.8 Generalized linear models

A general discussion on regression analysis in the exponential family will include not

only standard Gaussian regression and logistic regression, but also such matters as Poisson

regression, about which we will have more to say later. Denote the outcome variable by Y ,

and the covariate vector by X, with corresponding lower case letters denoting observations.

In a regression model we specify a function f (β, x) such that E(Y |X = x) = f (β, x), and

the purpose of the regression analysis is to estimate the coefﬁcients β.

Let us ﬁrst look at the unconditional problem, where we have the mean E(Y) expressed as

a function μ(β). We know from the general theory for the exponential family how to estimate

the natural parameter. To estimate β requires that we identify the relation between the natural

parameter and the mean of the distribution. The estimating equation for β can be shown to be



(β)V

(Y)

−1

(y − μ(β)) = 0. (9.7)

(In order to derive this, we express the natural parameter θ as a function (β)ofβ, which is

done using the equation κ



(θ) = μ(β). Insert (β) into the expression for the log-likelihood

to obtain y(β) −κ((β)) and differentiate. From this we derive the estimating equation





(β)(y − E

(Y)) = 0, where E

(Y) is shorthand for E

(β)

(Y). The ﬁnal observation is that



(β) = κ



(θ)



(β) = V

(Y)



(β).)

Once we have equation (9.7), we can apply this to the regression problem with n obser-

vations (x

) of (covariate, outcome) pairs. Notation-wise this means replacing μ(β) with

f (β, x), and we obtain the equation for the maximum likelihood estimate for the regression

coefﬁcients β as



i=1

σ(β, x

)

−2



(β, x

)

− f (β, x

)) = 0,

270 LEAST SQUARES, LINEAR MODELS AND BEYOND

where σ

(β, x) = V (Y|X = x) is the conditional variance of Y, provided the model is correct.

We recognize here equation (9.1), which means that the maximum likelihood estimate is the

same as the GLS estimate for distributions in the exponential family.

A regression model for distributions in the exponential family which is such that the

regression function f (β, x) takes the form f (β, x) = h(xβ) is called a generalized linear

model, and the function h(u) is called the response function for the model. This is often

expressed in terms of the inverse g(u)ofh(u) instead, a function which is called the link

function. Depending on the nature of the problem, different link functions apply to different

problems, as was discussed for the binomial distribution in Section 9.4. Of special interest

are those link functions for which we have that g(μ) actually deﬁnes the natural parameter

θ. The logistic regression model is such an example, since it is the generalized linear model

for binomial distributions with the link function g(p) = ln(p/(1 −p). Such links are called

natural links and they have the property that the GLS equation simpliﬁes to



i=1

− h(x

β)) = 0.

This is what the estimating equation looked like for the logistic regression model.

9.9 Comments and further reading

See Lehmann (1990) and Cox (1990) for general discussions about modeling in statistics, the

former with some historical comments. The list of reasons for covariate modeling given in

Section 9.2 was adapted from Koch et al. (1982). For an account of the history of least squares

and Gauss’s justiﬁcation for changing his method of proof, see Hald (1998,Chapter 13).

Chapter 6 in the same book gives some historical remarks on the general problem of ﬁtting

data, including the problems with absolute residuals.

For a further discussion on the problem of estimating, and interpreting, effects in a het-

erogeneous world, and how the meanings of parameters change as we change the model,

see Neuhaus et al. (1991). This problem is important, but not for Gaussian data, which

we have seen can accommodate misspeciﬁcation by increasing the dispersion parameter

(Ford et al., 1995).

For details on the theory and practical use of GLMs in biostatistics and other ﬁelds,

the original ‘bible’ is McCullagh and Nelder (1989), whereas the book by Fahrmeir and

Tutz (2001) is a modern treatise covering a wider area. See Morgan (1992) for more on

binomial regression, including an example of the multivariate probit model mentioned in

Section 3.8. For a general discussion on the roles of conditional tests in statistical inference,

see Reid (1995).

References

Cox, D.R. (1990) Role of models in statistical analysis. Statistical Science, 5(2), 169–174.

Fahrmeir, L. and Tutz, G. (2001) Multivariate Statistical Modelling Based on Generalized Linear Models

Springer Series in Statistics second edn. New York: Springer.

Ford, I., Norrie, J. and Ahmadi, S. (1995) Model inconsistency, illustrated by the Cox proportional

hazards model. Statistics in Medicine, 14(8), 735–746.