Kallen A. Understanding Biostatistics

Подождите немного. Документ загружается.

APPENDIX: ABOUT U-STATISTICS 241

8.A Appendix: About U-statistics

The arithmetic mean, the sample variance and the Wilcoxon probability estimator have in

common that they are all U-statistics. Such statistics consists of sums of dependent variables

which are asymptotically dominated by a sum of independent variables, so that we can appeal

to the CLT for a large-sample theory. Only a brief overview will be given here; for details,

see Kowalski and Tu (2008).

Consider a sample {x

,...,x

} from a CDF F (x). To deﬁne a U-statistic of order 1 and

degree k we pick a symmetric function g(x

,...,x

)ofk arguments. This function deﬁnes a

population parameter φ = E(g(X

,...,X

)). There are a total of





subsets {x

,...,x

}

of the original sample, subsets we denote by x

, where I is a set of indices. If we denote the

collection of all such subsets by , we can estimate φ by using the average









g(x

,...,x

), (8.10)

which is an unbiased estimator of φ. The members of  are dependent when they share

s, which is an important observation when we are to compute the variance of U

. This

deﬁnition is not restricted to univariate data; we can allow multivariate distributions where

s are vectors. Simple examples among the univariate tests are g(x) = x, which deﬁnes a

U-statistic of degree 1, namely the arithmetic mean (estimating the population mean) and

g(x, y) = (x − y)

/2, which deﬁnes a U-statistic of degree 2, namely the sample variance

(see Box 8.6), an unbiased estimate of the population variance σ

. Among the bivari-

ate tests we ﬁnd g(x, y) = (x

− y

)(x

− y

)/2, which deﬁnes the (Pearson) covariance,

and g(x, y) = 2I((x

− y

)(x

− y

) > 0) − 1, which deﬁnes Kendall’s τ. The theory of

U-statistics therefore provides a uniﬁed framework for the analysis of all these parameters.

In order to obtain knowledge about the population parameter φ from the corresponding

U-statistic we usually appeal to large-sample theory. For this we need to compute the variance

V (U

) =





−2



I,J

C(g(x

),g(x

)). For a covariance term in this sum to be non-zero, the

corresponding index sets I and J need to have a non-empty intersection, and for symmetry

reasons the covariance then depends only on the number of indices in this intersection. If we

therefore introduce the functions

,...,x

) =



...



g(x

,...,x

,...,y

k−d

)dF (y

) ...F(y

k−d

we see that if the indices have d variables in common the covariance is C(g(x

)) =

V (g

,...,X

)) = σ

. A combinatorial argument now shows that

V (U

) =



d=1

, where π





n−k

k−d







If we increase n, the dominant term in this expression is the ﬁrst term, for which we have that

∼ k

/n. For the special case of a U-statistic of degree 2 this is





V (U

) = 2(n − 2)V (g

(X)) + V (g(X

242 HOW TO COMPARE THE OUTCOME IN TWO GROUPS

Box 8.6 Estimating the variance

We can use U-statistics to derive the sample estimate of σ

with the correct degrees of

freedom. Let X and Y be independent stochastic variables with the same CDF F(x).

Then the difference X − Y has mean zero and variance 2σ

, which implies that

2σ



(x − y)

dF (x)dF (y) = 2



x<y

(x − y)

dF (x)dF (y).

We can therefore estimate σ

with







i<j

− x

)

n − 1



− ¯x)

Here we have used the observation, easily veriﬁed by brute force computation, that



− ¯x)



i<j

− x

)

So by ﬁnding a formula for σ

that does not involve m we get a proper estimator that

has the correct expected value σ

Example 8.8 With g(x, y) = (x − y)

/2 we have that

(x) =



(x − y)

dF (y) =

+ σ

whose expected value is σ

, and for which the variance is given by

V (g

(X)) =



∞

−∞



+ σ

− σ



dF (x) =

(μ

− σ

Since we also have that

V (g(X

)) =



(x − y)

dF (x)dF (y) =

(μ

+ 3σ

),μ



∞

−∞

(x − m)

dF (x),

we ﬁnd that

V (s

) =

2(n − 2)(μ

− σ

) + (2μ

+ 6σ

)





(n − 1)μ

− (n − 5)σ

n(n − 1)

Asymptotically, when n becomes large, this is approximately (μ

− σ

)/n, and we have that

√

n(s

− σ

) ∈ AsN(0,μ

− σ

The large-sample theory for U-statistics involves two steps: ﬁrst we show that the d = 1

term dominates, and then we apply the CLT to this dominant term. For this we ﬁrst note that

APPENDIX: ABOUT U-STATISTICS 243

the conditional expectation E(g(X

)|X

= x

) equals g

)ifi ∈ I, and equals φ otherwise,

which implies that E(U

− φ|X = x

) equals





−1



n − 1

k − 1



) +





−



n − 1

k − 1



− 1



) − φ).

Therefore the best predictor for U

− φ, which is a sum of functions of only one variable, is



i=1

) − φ).

Furthermore, E((U

−

)

) = V (U

) − V (

) goes to zero when we increase n, because the

two terms both equal k

/n asymptotically, an observation known as H

ajek’s projection theo-

rem. Appealing to the CLT, we ﬁnd that U

∈ AsN(φ, k

/n), provided σ

= V (g

(X)) > 0.

In summary, to obtain knowledge about φ we use the fact that

− φ)

V (U

)

∈ Asχ

(1).

The two-group situation is similar, but involves more complex formulas, and in order to

avoid the notational complexity of the general case, we consider only the simple case with

U-statistics of degree (1, 1). Such a U-statistic takes the form



i,j

g(x

and estimates the parameter φ = E(g(X, Y )). Denote the CDF for X by F (x) and that for Y

by G(x), and note that mnV (U

)isgivenby

V (g(X, Y )) + (n − 1)C(g(X

,Y),g(X

,Y)) + (m − 1)C(g(X, Y

),g(X, Y

)).

If we introduce the two functions

(x) =



∞

−∞

g(x, y)dG(y) − φ, g

(y) =



∞

−∞

g(x, y)dF (x) − φ,

we have that C(g(X, Y

),g(X, Y

)) = E(g

(X)

) = σ

, with similar expressions for the

other covariances. It follows that

V (U

) =

((m − 1)σ

+ (n − 1)σ

+ σ

), (8.11)

where σ

= V (g(X, Y )).

Example 8.9 The Wilcoxon parameter

= P(Y<X) +P (X = Y )/2 =



∞

−∞

G(x)dF (x),

244 HOW TO COMPARE THE OUTCOME IN TWO GROUPS

with CDFs deﬁned by their mid-point value at jump points, is estimated by the U-statistic of

degree (1, 1), deﬁned by the kernel function

g(x, y) = I(y<x) +

I(x = y).

For this kernel function we have that g

(x) = G(x) −P

and g

(y) = F

(y) − P

(1 − P

) − F (y). Moreover,

g(x, y)

= I(x<y) + I(x = y)/4 = g(x, y) −I(x = y)/4,

which means that the variance of g(X, Y ) equals

(1 − P

) −

P(X = Y) = P

(1 − P

) −



G(x

)F (x

From this it follows that the variance for this U-statistic is given by

nmV (U

) = (n − 1)



∞

−∞

(F (x) − (1 − P

))

dG(x)

+ (m − 1)



∞

−∞

(G(x) −P

))

dF (x) + P

(1 − P

) −



G(x

)F (x

In the continuous case this is formula (8.8) which we used when we discussed the Wilcoxon

test. The full formula was used in Section 8.6.

To derive the large-sample theory, we again use the fact that the projection of U

− φ

onto the space of functions of one variable only, either X or Y,is



i=1

) +



j=1

In particular, we have that V(

) = σ

/m + σ

/n. As before, we see that asymptotically

and

have the same Gaussian distribution and therefore that

− φ



V (U

nm)

∈ AsN(0, 1)

when n, m →∞in such a way that m/n → λ, provided one of the following conditions is

fulﬁlled: (1) that λ = 0 and σ

> 0; or (2) that λ>0 and one of σ

or σ

is positive.

Least squares, linear models

and beyond

9.1 Introduction

Up to now we have discussed group comparisons and how a statistical analysis of an outcome

variable may allow us to conclude that an exposure, possibly a drug, has an effect on this

outcome variable. If it has, the exposure is a predictor for such a response, in that we expect

differences in the response in individuals who are exposed, compared to those who are not

exposed. But we warned early on that there might be more to this than meets the eye – there

may be confounders operating in the background. The purpose of the present chapter is to

discuss how we take such confounders into account in a model for the data, as a continuation

of the discussion that concluded the last chapter on adjustment for baseline information.

In this chapter we will construct models that account for covariates (or explanatory vari-

ables) and we will see how we estimate the corresponding model parameters, mostly the mean.

Many such models are linear, which means that the values of explanatory variables are simply

added. Among such models we ﬁnd the classical case of analysis of variance (ANOVA), a

model in which parameter values are obtained using least squares estimation. We introduced

least squares estimation in Section 7.3, a discussion we now will follow up. We will discuss

variations of this particular estimation method, and end the chapter by brieﬂy discussing a

large family of distributions, which includes many of the important ones in statistics such as

the Gaussian, binomial and Poisson distributions, for which these estimation methods also

constitute the maximum likelihood theory. This will provide a uniﬁed framework for doing

statistics in classical ANOVA, logistic regression and other important models.

However, before we do this, we need to discuss heterogeneity. One of the main purposes of

linear models is to use covariates to explain heterogeneity in the response in the population, or

to demonstrate the importance of covariates in explaining such heterogeneity. Heterogeneity

is essentially the problem that there are, still unidentiﬁed, covariates that make the model

correct, but since we have not been able to include them in the model, the model we actually

Understanding Biostatistics, First Edition. Anders K¨all´en.

246 LEAST SQUARES, LINEAR MODELS AND BEYOND

Box 9.1 On different types of mathematical models

A mathematical model is a set of mathematical expressions which constitute a simpli-

ﬁcation of reality in order to expose some important aspect of it. Very often a key part

is the estimation of some parameters describing the model. There is a crude division of

models into mechanistic and empirical models (though different names may be used).

A mechanistic model is a model which is constructed in a search for the basic

mechanisms underlying the processes that are studied. The main purpose is therefore

related to understanding and strives to describe the structure of the underlying mech-

anism and the laws governing them in the most general way possible. This typically

requires abstraction and idealization in order to eliminate the speciﬁc circumstances of

the particular situation.

An empirical model is used to make predictions that can guide behavior. In this case

quality is not measured by how ‘true’ something is, but by its ability to provide useful

predictions in the special circumstances of the situation. These models only need to

provide a good approximation of reality in the relevant situation, and may be based on

a convenient and ﬂexible family of models from which we select the one that provides

the best ﬁt to data. As an example, a linear regression may be perfectly adequate for the

problem under consideration when we study only a limited range of options, and when

it cannot provide a reasonable ﬁt outside this range.

analyze is not correct. What is the effect of such misspeciﬁcation, of omitting these variables

from the analysis? The answers differ. We will see that for some models this is a question of

precision in statements, for others more a question about bias.

9.2 The purpose of mathematical models

A mathematical model that relates an outcome y to some input data x is essentially a function

y = f (x). The modeling process is the identiﬁcation of both the vector x and the functional

relationship f (x). We assess how good the model is by performing an experiment that provides

us with data (x

) and looking at the residuals y

− f (x

). This is zero for all data points if the

model is correct and there are no measurement errors in the data. In real life this may only be

true on average, and the residuals may not be precisely zero. In many applications, in particular

in physics, the model is deﬁned implicitly, possibly as solutions to differential equations, and

the only sources of errors are those from measurements. In biology in general, and medicine in

particular, a true functional relationship is usually unattainable, due to biological variation and

measurement errors. This is a serious practical complication, but philosophically the problem

is the same: can we ﬁnd data, and a functional relationship for these data, to enable us to

predict a particular outcome? Not surprisingly, there are many aspects of this modeling that

should be taken into account, one of which is the purpose of the model (Box 9.1), what aspect

of the distribution of the outcome it is that we want to model. In this chapter our focus will

be on modeling the mean of a distribution.

First, a word about terminology. When we relate the outcome to the input, the purpose

of the input is to act as either explanatory variables or predictors. The difference is one of

words only. A variable is an explanatory variable if it explains some of the variability we

THE PURPOSE OF MATHEMATICAL MODELS 247

have in the outcome, and it is a predictor if we use its value to get a more reﬁned prediction

of the outcome than not knowing it would provide. The modeling is the same whichever

word we use, but the prediction terminology helps us better understand what we do. With

no predictors available, and forced to guess the outcome, our most likely guess would be the

mean. The reason why this is a sensible guess is that it minimizes the expected value of the

squared residuals (we could also use the median, which means minimizing the expected value

of the absolute residuals, but that is mathematically more complicated). If we have a model

that relates the outcome to a particular predictor, and we have measured that predictor, we

would instead use the conditional mean of the outcome, given the value of the covariate for

prediction. The model is good if this increases the precision in our prediction. Another term we

have used in earlier chapters is ‘confounder’, used primarily in epidemiology. A confounder

is an explanatory variable, other than the one that is under investigation, which is a predictor

of the response. In a non-epidemiology context the corresponding term is often ‘covariate’.

In the same way as the exposure we study is not a confounder, the explanatory variable we

have designed an experiment to investigate is often not included among the covariates. On

other occasions they are. We will mostly use the term ‘covariate’ in our discussion for any of

these concepts, and in a wide sense.

We can divide covariates into those that are ﬁxed, like gender and similar variables,

that stay ﬁxed if we repeat the experiment on the same subject, and those that are random,

which means that if we take a new measurement of such a covariate, we will probably get

a new value. Examples of random covariates include baseline measurements of outcome

variables such as blood pressure or lung function. However, in this chapter all analysis will

be conditional analysis on observed covariate values, which means that we also consider the

random covariates to be ﬁxed. As a consequence we can, strictly speaking, only generalize

the results to situations with the same set of covariate values, and must use other means to

generalize to the general population. This may be an important point philosophically, but is

mostly ignored in practice.

Suppose we are comparing some treatments, or exposures, on an outcome variable with

a Gaussian distribution and that we have a number of covariates we may wish to include in a

linear statistical model. Why should we contemplate doing so? Here are some reasons.

1. To adjust for inherent differences among comparison groups in order to reduce bias.

This is of particular importance in studies where groups cannot be balanced by the use

of randomization or matching, as is the case in many observational studies.

2. To generate more powerful statistical tests through variance reduction, which will take

place if an appropriate covariate explains some of the variation present. Adjustment

for baseline measurements in a randomized experiment is an example of this.

3. To induce equivalence of comparison groups that are generated by randomization. Ran-

domization guarantees approximate equivalence, but statistical adjustment can offset

minor imbalances for important predictors for the outcome variable.

4. To clarify to what extent treatment effects are explained by other factors, poten-

tially leading to a change in the interpretation of treatment effects. Conversely, the

lack of explanatory factors would help to substantiate the independent existence of

treatment effects.

248 LEAST SQUARES, LINEAR MODELS AND BEYOND

5. To study the degree to which ﬁndings are uniform across subpopulations. For example,

if treatment effects are considered to be speciﬁc to certain age groups, an assessment

of the interaction between treatment and age would help to clarify whether an overall

treatment effect can be generalized to all ages.

We have emphasized that the setting was that of classical ANOVA (or ANCOVA), which is

not fully applicable to many other models, in particular the second item on the list. This is

because some of the arguments require a parameter capturing unexplained variance, which

can be reduced in size by identifying predictors. Many models do not have such parameters,

classical logistic regression being one example. As we will see, the inclusion/exclusion of a

covariate then becomes a more complex business, and is more related to bias than precision.

But the above list summarizes the main intuitive reasons why we want to include covariates

in statistical models.

By way of introduction to the rest of this chapter, which is about estimation methods, we

will look a little more deeply into the conditional test in Section 8.8, the ANCOVA.

Example 9.1 For the outcome variable we have two explanatory variables:

1. An indicator variable Z which deﬁnes group membership. It takes the value 1 for a

subject in group A and the value −1 for a subject in group B.

2. The baseline measurement of the outcome variable, denoted X.

We now introduce the following notation. Let x

denote a one for all observations (this is

introduced for notational convenience), let x

be the indicator for group membership (the

observation of Z) and let x

be the observed difference of X from the observed average ¯x in

the whole sample. In this notation the linear model for the conditional mean of the outcome

variable is given by the expression

E(Y |X = x) = θ

+ θ

To estimate the vector θ = (θ

,θ

) we can use least squares, which means that we minimize

the quadratic form

Q(θ) =



− (θ

+ θ

))

A short computation shows that this is equivalent to solving the following three equations:



= θ



+ θ



+ θ



,j= 1, 2, 3.

We do not write down the explicit solution to this system of equations here but note

the following:

1. All the θ

will be Gaussian variables, since they are linear combinations of the y

, which

were assumed to have Gaussian distributions.

2. The θ

are correlated.

Once we have found the θ

we can compute the baseline adjusted group means, which are

ˆm

for group A and ˆm

−

for group B. What we have now are group

THE PURPOSE OF MATHEMATICAL MODELS 249

Box 9.2 Matrix algebra and linear models

Matrix algebra offers a useful notation for linear models, where the regression function

takes the form f (θ, x) = xθ. In matrix notation this can be written as f (θ, x) = Aθ,

where the matrix A consists of the different row vectors x

of covariate data and θ is

a column vector. A is called the design matrix for the problem, and the equation that

deﬁnes the least squares estimator

θ becomes

(y − Aθ) = 0.

Explicitly this means that

θ = (A

−1

from which we have that E(

θ|A) = (A

−1

m, which equals θ

if the model m = Aθ

holds true. Its variance is given by

V (

θ|A) = (A

−1

A(A

−1

where  denotes the diagonal matrix for which the ith element on the diagonal is σ

In the important special case where σ

(x) = σ

is independent of x, this simpliﬁes to

V (

θ|A) = σ

−1

, and if our outcome data y come from a Gaussian distribution,

then

θ ∈ N(θ, σ

−1

If we instead use weighted least squares (WLS), the estimating equation becomes



−1

(y − Aθ) = 0,

and the WLS estimator is given by

θ = (A



−1



−1

In this case

θ has a Gaussian distribution with mean θ and variance (A



−1

means corresponding to groups that are as equal as our data (and model) allow them to be,

except for the treatment. They should therefore provide the fairest test for a treatment effect.

The actual treatment effect can be expressed in different ways. One is the obvious differ-

ence m

− m

, another is the mean ratio m

. With the estimates above, and the associated

covariance matrix, it is easy to derive conﬁdence intervals (and corresponding p-values) for

any of these. For the difference we have that m

− m

= 2θ

, so knowledge about this comes

directly from the original parameters. For the ratio we apply Fieller’s method (see Box 7.7),

after we have identiﬁed the bivariate Gaussian distribution for ( ˆm

, ˆm

The notation in the example becomes much simpler if we use matrix algebra. It is not that

the actual computations change, but they can be organized in a way that simpliﬁes the analysis

and allows for easy computer implementation of more general statistical models. The basic

steps are outlined in Box 9.2.

250 LEAST SQUARES, LINEAR MODELS AND BEYOND

In Example 9.1 we should note that we go through a two-step process. We ﬁrst obtain

updated mean estimates for the two groups, which are adjusted to the same value of the baseline

covariate. We then analyze these updated estimates in a second analysis step. We will illustrate

this two-step way of thinking further, ﬁrst in this chapter with a simple logistic regression

example, and then in the next chapter, when we discuss dose–response relationships.

9.3 Different ways to do least squares

Given a set of explanatory variables X = (X

,...,X

), we wish to ﬁnd a function f (X)

such that if we have the observation x of X, the value f (x) is a good prediction of what the

outcome variable Y will be. In order to determine the function we need to understand how

we decide when a predictor is a good predictor. In what sense is it good? Somehow it should

mean that the residuals Y − f (X) are ‘small’, but how do we measure that? One suggestion

was given in Section 7.3, namely that we should minimize the expected value of the squared

mean residuals, which implies that f (x) = E(Y |X = x), the conditional mean of Y given the

information that X = x. This is not the only way to deﬁne what a good predictor should be, but

it is a much used method, and mathematically convenient, so we will stick to it. It means that

we utilize the method of least squares which we have seen was developed by Gauss who, in

1823, replaced his ﬁrst proof (which we outlined in Box 4.10) with a new one, partly because

he considered the assumption of a Gaussian distribution too narrow. Instead he justiﬁed the

method of least squares, assuming a linear function for the predictor, as the method that gives

the linear unbiased estimate with the smallest variance.

So, let f (x) = E(Y|X = x) denote the conditional mean, which minimizes the expression

E((Y − f (X))

). Denote the conditional variance V (Y|X = x)byσ

(x). In general we may

not know f (x) (we do not know σ

(x) either, but ignore that for a moment) so we replace it

with a function f (θ, x) of some speciﬁed type, dependent on some unknown parameter vector

θ which is to be estimated. The idea is that there is a true parameter value θ

such that

E(Y |X = x) = f (θ

,x),

which in many cases will only be an approximation. By deﬁnition, θ

minimizes the mean

of the squared residuals, which is the function Q(θ) = E((Y − f (θ, X))

), and is therefore a

solution to the equation



(θ) = E(f



(θ, X)

(Y − f (θ, X))) = 0.

Here f



refers to differentiation with respect to θ. In order to estimate θ

from a set of data

points (x

),...,(x

) (where each x

is a k-tuple), we use the estimating equation



i=1



(θ, x

)

− f (θ, x

)) = 0,

obtained by replacing the CDF in the formula for the expected value with the corresponding

e-CDF. The solution of this is the least squares (LS) estimate of θ. The simplest case would

be to take a linear function in x, that is, f (θ, x) = xθ for a parameter vector θ, where the

row vector x may start with a one in order to account for an intercept. Such functions can be

justiﬁed either because they are as simple as they can be, or because we approximate f (θ, x)

with a linear function in x.