86 4 BIVARIATE STATISTICS
tcalc = r(2,1) * ((length(x)-2)/(1-r(2,1)^2))^0.5
tcrit = tinv(0.95,length(x)-2)
tcalc =
13.3594
tcrit =
1.6991
is result indeed indicates that we can reject the null hypothesis and the
correlation coe cient is signi cant. As an alternative to detecting outli-
ers, resampling schemes or surrogates such as the bootstrap or jackknife
methods represent powerful tools for assessing the statistical signi cance
of the results. ese techniques are particularly useful when scanning large
multivariate data sets for outliers (see Chapter 9). Resampling schemes re-
peatedly resample the original data set of n data points either by choosing
n–1 subsamples n times (the jackknife), or by picking an arbitrary set of
subsamples with n data points with replacement (the bootstrap). e sta-
tistics of these subsamples provide better information on the characteristics
of the population than the statistical parameters (mean, standard devia-
tion, correlation coe cients) computed from the full data set. e function
bootstrp allows resampling of our bivariate data set including the outlier
(x,y)=(20,20).
rhos1000 = bootstrp(1000,'corrcoef',x,y);
is command rst resamples the data a thousand times, calculates the
correlation coe cient for each new subsample and stores the result in the
variable
rhos1000. Since corrcoef delivers a 2 × 2 matrix as mentioned
above,
rhos1000 has the dimension 1000 × 4, i.e., 1000 values for each ele-
ment of the 2 × 2 matrix. Plotting the histogram of the 1000 values for the
second element, i.e., the correlation coe cient of
(x,y) illustrates the dis-
persion of this parameter with respect to the presence or absence of the
outlier. Since the distribution of
rhos1000 contains many empty classes,
we use a large number of bins.
hist(rhos1000(:,2),30)
e histogram shows a cluster of correlation coe cients at around r=0.1
that follow the normal distribution, and a strong peak close to r=1 (Fig. 4.3).
e interpretation of this histogram is relatively straightforward. When the
subsample contains the outlier, the correlation coe cient is close to one, but
subsamples without the outlier yield a very low (close to zero) correlation