particularly with larger sample sizes, that haphazard departures from linearity may
result in the rejection of empirical consistency. Yet without a clear nonlinear trend
that could be modeled, preserving the linear regression approach may be the best
strategy, from the standpoint of parsimony. In sum, I accept the linear regression
model as appropriate for the data in this example, despite having to reject the hypoth-
esis of empirical consistency via a formal test.
Discriminatory Power vs. Empirical Consistency. As another illustration of the dis-
tinction between discriminatory power and empirical consistency, I conducted a
regression simulation. First, I drew a sample of 200 observations from a true linear
regression model in which Y was generated by the linear equation 3.2 1X ε, where
ε was normally distributed with mean zero and variance equal to 22. Estimation of the
simple linear regression equation with the 200 sample observations produced an r
2
of
.21. The F statistic for testing empirical consistency was .2437. With 8 and 190 degrees
of freedom, its p-value was .982. Obviously, the F test for lack of fit here suggests a
good-fitting model. Note, however, that discriminatory power is modest, at best. Next
I drew a sample of 200 observations from a model in which Y was generated as a
nonlinear—in particular, a quadratic—function of X. Specifically, the equation generat-
ing Y was 1.2 1X .5X
2
ε. Again, ε was a random observation from a normal dis-
tribution with mean zero. This time, however, the variance of ε was only 1.0. I then used
the 200 sample observations to estimate a simple linear regression equation. That is, I
estimated Y β
0
β
1
X ε, a clearly misspecified model. The test for lack of fit in this
case resulted in an F value of 339.389, a highly significant result (at p .00001).
Clearly, by a formal test, the linear model is rejected as empirically inconsistent. The
r
2
for the linear regression, however, was a whopping .96! The point of this exercise is
that contrary to popular conception, r
2
is not a measure of “fit” of the model to the data.
It is a measure of discriminatory power. It’s possible, as shown in these examples, for
good-fitting models to have only modest r
2
values and for bad-fitting models to have
very high discriminatory power. [See also Korn and Simon (1991) for another illustra-
tion of the distinction between these two components of model evaluation.]
Authenticity. The authenticity of a model is much more difficult to assess than is
either discriminatory power or empirical consistency. Here we ask: Does the model
truly reflect the real-world process that generated the data? This question usually does
not have a statistical answer. We must rely on theoretical reasoning and/or evidence
from experimental studies to buttress the veracity of our proposed causal link between
X and Y. On the other hand, we can evaluate whether additional variables are respon-
sible for the observed X–Y association, rendering the original causal model inauthen-
tic. For example, I have attempted above to argue, theoretically, for the reasonableness
of math diagnostic score as a cause of exam performance, for years of schooling as a
cause of couple modernism, and for frequenting bars as a cause of sexual frequency.
Objections to the authenticity of all of these models can be tendered. With respect to
exam performance, it is certainly possible that academic ability per se is the driving
force that affects performance on both the math diagnostic and the exam. In this case,
the relationship between diagnostic score and exam performance, being due to a third,
ASSESSING EMPIRICAL CONSISTENCY OF THE MODEL 69