
2.3 Multiple Linear Regression 79
this equation may perform with R
2
= 1 on the regression dataset, it will probably
be completely useless on another dataset obtained for crops growing under a
constant temperature of 10
◦
C.Togetatleastafirstideaastohowaparticular
regression model performs on unknown data, a procedure called cross-validation
can be used. Cross-validation approaches mimic ‘‘new data’’ in various ways, for
example, based on a partitioning of the dataset into a training dataset which is used
to obtain the regression equation, and a test dataset which is used to assess the
regression equation’s predictive capability (a so-called holdout validation approach,
see [48]).
The R program
LinRegEx3.r in the book software performs such a cross-
validation for the rose wilting data,
Volz.csv. This program is very similar
to
LinRegEx2.r, except for the following lines of code that implement the
partitioning of the data into training and test datasets:
1: Dataset=read.table(FileName, ...)
2: TrainInd=sample(1:47,37)
3: TrainData=Dataset[TrainInd,]
4: TestData=Dataset[-TrainInd,]
5: RegModel=lm(eq,data=TrainData)
6: DegWiltTrain=predict(RegModel,TrainData)
7: DegWiltTest=predict(RegModel,TestData)
(2.52)
After the data have been stored in the variable Dataset inline1ofprogram
2.52, 37 random indices between 1 and 47 (referring to the 47 lines of data in
Volz.csv) are chosen in line 2. See [45] and the R helppagesformoredetailson
the
sample command that is used in line 2. The 37 random indices are stored in
the variable
TrainInd which is then used in line 3 to assemble the training dataset
TrainData based on those lines of Dataset which correspond to the indices in
TrainInd. The remaining lines of Dataset are then reassembled into the test
dataset
TestData in line 4. The regression model RegModel is then computed
using the training dataset in line 5 (note the difference to line 4 of program 2.37
where the regression model is computed based on the entire dataset). Then, the
predict command is used again to apply the regression equation separately to the
training and test datasets in lines 6 and 7 of Equation 2.52.
Figure 2.5 shows an example result of
LinRegEx3.r. You should note that if you
run
LinRegEx3.r on your machine, the result will probably be different from the
plot shown in Figure 2.5 since the
sample command may select different training
and test datasets if it is performed on different computers. As explained in [45],
the
sample command is based on an algorithm generating pseudorandom numbers,
and the actual state of this algorithm is controlled by a set of integers stored in the
R object
.Random.seed. As a result of this procedure, LinRegEx3.r may generate
different results on different computers depending on the state of the algorithm
on each particular computer. Figure 2.5 compares the measured and predicted
values of
DegWilt similar to Figure 2.4b above. As could be expected, there are
larger deviations between the line
ˆ
y = y and the data for the test dataset which