
264
Part III: Drawing Conclusions from Data
That last term is the Greek letter epsilon. It represents “error” in the popula-
tion. In a way, “error” is an unfortunate term. It’s a catchall for “things you
don’t know or things you have no control over.” Error is reflected in the
residuals — the deviations from the predictions. The more you understand
about what you’re measuring, the more you decrease the error.
You can’t measure the error in the relationship between SAT and GPA, but
it’s lurking there. Someone might score low on the SAT, for example, and
then go on to have a wonderful college career with a higher-than-predicted
GPA. On a scatterplot, this person’s SAT-GPA point looks like an error in pre-
diction. As you find out more about that person, you might discover that he
or she was sick on the day of the SAT, and that explains the “error.”
You can test hypotheses about α, β, and ε, and that’s what I do in the upcom-
ing subsections.
Testing the fit
I begin with a test of how well the regression line fits the scatterplot. This is a
test of ε, the error in the relationship.
The objective is to decide whether or not the line really does represent a
relationship between the variables. It’s possible that what looks like a rela-
tionship is just due to chance and the equation of the regression line doesn’t
mean anything (because the amount of error is overwhelming) — or it’s pos-
sible that the variables are strongly related.
These possibilities are testable, and you set up hypotheses to test them:
H
0
: No real relationship
H
1
: Not H
0
Although those hypotheses make nice light reading, they don’t set up a sta-
tistical test. To set up the test, you have to consider the variances. To con-
sider the variances, you start with the deviations. Figure 14-3 focuses on one
point in a scatterplot and its deviation from the regression line (the residual)
and from the mean of the y-variable. It also shows the deviation between the
regression line and the mean.
As the figure shows, the distance between the point and the regression line
and the distance between the regression line and the mean add up to the dis-
tance between the point and the mean:
This sets the stage for some other important relationships.
20 454060-ch14.indd 26420 454060-ch14.indd 264 4/21/09 7:33:52 PM4/21/09 7:33:52 PM