Trauth M.H., MATLAB® Recipes for Earth Sciences, Third edition

Подождите немного. Документ загружается.

4 BIVARIATE STATISTICS

4 Bivariate Statistics

4.1 Introduction

Bivariate analysis aims to understand the relationship between two vari-

ables x and y. Examples are the length and the width of a fossil, the sodium

and potassium content of volcanic glass or the organic matter content along

a sediment core. When the two variables are measured on the same object,

x is usually identi ed as the independent variable, and y as the dependent

variable. If both variables have been generated in an experiment, the vari-

able manipulated by the experimenter is described as the independent vari-

able. In some cases, neither variable is manipulated and neither is indepen-

dent.  e methods of bivariate statistics aim to describe the strength of the

relationship between the two variables, either by a single parameter such as

Pearson’s correlation coe cient for linear relationships or by an equation

obtained by regression analysis (Fig. 4.1).  e equation describing the rela-

tionship between x and y can be used to predict the y-response from any ar-

bitrary x within the range of the original data values used for the regression

analysis.  is is of particular importance if one of the two parameters is dif-

 cult to measure. In such a case, the relationship between the two variables

is  rst determined by regression analysis on a small training set of data.  e

regression equation can then be used to calculate the second parameter.

 is chapter  rst introduces Pearson’s correlation coe cient (Section

4.2), and then explains the widely-used methods of linear and curvilinear

regression analysis (Sections 4.3, 4.9 and 4.10). A selection of other meth-

ods that are also used to assess the uncertainties in regression analysis are

explained (Sections 4.4 to 4.8). All methods are illustrated by means of syn-

thetic examples since these provide an excellent means of assessing the  nal

outcome.

M.H. Trauth, MATLAB

Recipes for Earth Sciences, 3rd ed.,

DOI 10.1007/978-3-642-12762-5_4, © Springer-Verlag Berlin Heidelberg 2010

80 4 BIVARIATE STATISTICS

Regression line

i-th data point (x

)

Regression line:

age = 21.2 + 5.6 depth

r = 0.96

th in sediment (meters)

Age of sediment (kyrs)

-intercept = 21.2

Slope = 5.6

5101520

100

120

140

Bivariate Scatter

Fig. 4.1 Display of a bivariate data set.  e thirty data points represent the age of a sediment

(in kiloyears before present) at a certain depth (in meters) below the sediment-water

interface.  e combined distribution of the two variables suggests a linear relationship

between age and depth, i.e., the rate of increase in the sediment age with depth is constant.

Pearson’s correlation coe cient (explained in the text) of r=0.96 supports a strong

linear interdependency between the two variables. Linear regression yields the equation

age=21.2+5.6 depth, indicating an increase in sediment age of 5.6 kyrs per meter of sedi-

ment depth (the slope of the regression line).  e inverse of the slope is the sedimentation

rate of ca. 0.2 meters/kyr. Furthermore, the equation de nes an age for the sediment surface

of 21.2 kyrs (the intercept of the regression line with the y-axis).  e deviation of the surface

age from zero can be attributed either to the statistical uncertainty of regression or to a

natural process such as erosion or bioturbation.  e assessment of the statistical uncertainty

of regression is discussed in this chapter, but a careful evaluation of the possible e ects of

the various natural processes at the sediment-water interface will be required.

4.2 Pearson’s Correlation Coeﬃ cient

Correlation coefficients are o en used in the early stages of bivariate sta-

tistics.  ey provide only a very rough estimate of a rectilinear trend in a

bivariate data set. Unfortunately, the literature is full of examples where the

importance of correlation coe cients is overestimated, or where outliers in

the data set lead to an extremely biased estimation of the population cor-

relation coe cient.

 e most popular correlation coe cient is Pearson’s linear product-

moment correlation coefficient ρ (Fig. 4.2). We estimate the population's

correlation coe cient ρ from the sample data, i.e., we compute the sample

correlation coe cient r, which is de ned as

4.2 PEARSON’S CORRELATION COEFFICIENT 81

4 BIVARIATE STATISTICS

Outlier

Random bivariate

data cluster

r = 0.96

r = –0.97

r = 0.36

r = 0.96

r = 0.38

r = 0.95

0 5 10 15 20

−12

−10

−8

−6

−4

−2

0 5 10 15 20

−5

−10 −5 0 5 10

Bivariate Scatter

c d

Fig. 4.2 Pearson’s correlation coe cent r for various sample data sets. a–b Positive and

negative linear correlation, c random scatter with no linear correlation, d an outlier causing

a misleading value of r, e curvilinear relationship causing a high r since the curve is close

to a straight line, f curvilinear relationship clearly not described by r.

82 4 BIVARIATE STATISTICS

where n is the number of pairs xy of data points, s

and s

are the univariate

standard deviations.  e numerator of Pearson’s correlation coe cient is

known as the corrected sum of products of the bivariate data set. Dividing

the numerator by (n–1) yields the covariance

which is the summed products of deviations of the data from the sample

means, divided by (n–1).  e covariance is a widely-used measure in bivari-

ate statistics, although it has the disadvantage of being dependent on the

dimension of the data. Dividing the covariance by the univariate standard

deviations removes this e ect and leads to Pearson’s correlation coe cient.

A popular way to test the signi cance of Pearson’s correlation coe cient

is to determine the probability of an r value for a random sample from a

population with a ρ=0.  e signi cance of the correlation coe cient can be

estimated using a t statistic

 e correlation coe cient is signi cant if the calculated t is higher than the

critical t (n–2 degrees of freedom, α=0.05).  is test, however, is only valid

if both variables are Gaussian distributed with respect to both variables.

Pearson’s correlation coe cient is very sensitive to various disturbances

in the bivariate data set.  e following example illustrates the use of the cor-

relation coe cients and highlights the potential pitfalls when using these

measures of linear trends. It also describes the resampling methods that can

be used to explore the con dence level of the estimate for ρ.  e synthetic

data consist of two variables, the age of a sediment in kiloyears before pres-

ent and the depth below the sediment-water interface in meters.  e use of

synthetic data sets has the advantage that we fully understand the linear

model behind the data.

 e data are represented as two columns contained in  le agedepth_1.txt.

 ese data have been generated using a series of thirty random levels (in me-

ters) below the sediment surface.  e linear relationship age=5.6 meters+20

4.2 PEARSON’S CORRELATION COEFFICIENT 83

4 BIVARIATE STATISTICS

was used to compute noise-free values for the variable age.  is is the equa-

tion of a straight line with a slope of 5.6 and an intercept with the y-axis of

20. Some Gaussian noise with a zero mean and a standard deviation of 10

has been added to the age data.

clear

rand('seed',40), randn('seed',0)

meters = 20 * rand(30,1);

age = 5.6 * meters + 20;

age = age + 10.* randn(length(meters),1);

plot(meters,age,'o')

axis([0 20 0 140])

agedepth(:,1) = meters;

agedepth(:,2) = age;

agedepth = sortrows(agedepth,1)

save agedepth_1.txt agedepth -ascii

 e synthetic bivariate data set can be loaded from the  le agedepth_1.txt.

clear

agedepth = load('agedepth_1.txt');

We then de ne two new variables, meters and age, and generate a scatter

plot of the data.

meters = agedepth(:,1);

age = agedepth(:,2);

plot(meters,age,'o')

axis([0 20 0 140])

In the plot, we can observe a strong linear trend suggesting some depen-

dency between the variables,

meters and age.  is trend can be described

by Pearson’s correlation coe cient r, where r=1 indicates a perfect posi-

tive correlation, i.e.,

age increases with meters, r=0 suggests no correla-

tion, and r=–1 indicates a perfect negative correlation. We use the function

corrcoef to compute Pearson’s correlation coe cient.

corrcoef(meters,age)

which results in the output

ans =

1.0000 0.9567

0.9567 1.0000

84 4 BIVARIATE STATISTICS

 e function corrcoef calculates a matrix of correlation coe cients for

all possible combinations of the two variables

age and meters.  e value

of r=0.9567 suggests that the two variables

age and meters are dependent

on each other.

Pearson’s correlation coe cient is, however, highly sensitive to outliers,

as can be illustrated by the following example. Let us generate a normally-

distributed cluster of thirty data with zero mean and a standard deviation

one. To obtain identical data values, we reset the random number generator

by using the integer 5 as seed.

clear

randn('seed',5);

x = randn(30,1); y = randn(30,1);

plot(x,y,'o'), axis([-1 20 -1 20]);

As expected, the correlation coe cient for these random data is very low.

corrcoef(x,y)

ans =

1.0000 0.1021

0.1021 1.0000

Now we introduce a single outlier to the data set in the form of an excep-

tionally high

(x,y) value, in which x=y.  e correlation coe cient for the

bivariate data set including the outlier

(x,y)=(5,5) is much higher than

before.

x(31,1) = 5; y(31,1) = 5;

plot(x,y,'o'), axis([-1 20 -1 20]);

corrcoef(x,y)

ans =

1.0000 0.4641

0.4641 1.0000

Increasing the absolute (x,y) values for this outlier results in a dramatic

increase in the correlation coe cient.

x(31,1) = 10; y(31,1) = 10;

plot(x,y,'o'), axis([-1 20 -1 20]);

corrcoef(x,y)

4.2 PEARSON’S CORRELATION COEFFICIENT 85

4 BIVARIATE STATISTICS

ans =

1.0000 0.7636

0.7636 1.0000

and reaches a value close to r=1 if the outlier has a value of (x,y)

=(20,20).

x(31,1) = 20; y(31,1) = 20;

plot(x,y,'o'), axis([-1 20 -1 20]);

corrcoef(x,y)

ans =

1.0000 0.9275

0.9275 1.0000

 e bivariate data set still does not provide much evidence for a strong inter-

dependency between the variables. As we have seen, however, the combina-

tion of the random bivariate data with a single outlier results in a dramatic

increase in the correlation coe cient. Although outliers are easy to iden-

tify in a bivariate scatter, erroneous values can easily be overlooked in large

multivariate data sets.

Various methods exist to calculate the signi cance of Pearson’s corre-

lation coe cient.  e function

corrcoef also includes the possibility of

evaluating the quality of the result.  e p-value is the probability of obtain-

ing a correlation as large as the observed value by random chance, when

the true correlation is zero. If the p-value is small, then the correlation coef-

 cient r is signi cant.

[r,p] = corrcoef(x,y)

r =

1.0000 0.9275

0.9275 1.0000

p =

1.0000 0.0000

0.0000 1.0000

In our example, the p-value is zero suggesting that the correlation coe -

cient is signi cant. We conclude from this experiment that this particu-

lar signi cance test fails to detect correlations attributed to an outlier. We

therefore try an alternative t-test statistic to determine the signi cance of

the correlation between x and y. According to this test, we can reject the

null hypothesis that there is no correlation if the calculated t is larger than

the critical t (n–2 degrees of freedom, α=0.05).

86 4 BIVARIATE STATISTICS

tcalc = r(2,1) * ((length(x)-2)/(1-r(2,1)^2))^0.5

tcrit = tinv(0.95,length(x)-2)

tcalc =

13.3594

tcrit =

1.6991

 is result indeed indicates that we can reject the null hypothesis and the

correlation coe cient is signi cant. As an alternative to detecting outli-

ers, resampling schemes or surrogates such as the bootstrap or jackknife

methods represent powerful tools for assessing the statistical signi cance

of the results.  ese techniques are particularly useful when scanning large

multivariate data sets for outliers (see Chapter 9). Resampling schemes re-

peatedly resample the original data set of n data points either by choosing

n–1 subsamples n times (the jackknife), or by picking an arbitrary set of

subsamples with n data points with replacement (the bootstrap).  e sta-

tistics of these subsamples provide better information on the characteristics

of the population than the statistical parameters (mean, standard devia-

tion, correlation coe cients) computed from the full data set.  e function

bootstrp allows resampling of our bivariate data set including the outlier

(x,y)=(20,20).

rhos1000 = bootstrp(1000,'corrcoef',x,y);

 is command  rst resamples the data a thousand times, calculates the

correlation coe cient for each new subsample and stores the result in the

variable

rhos1000. Since corrcoef delivers a 2 × 2 matrix as mentioned

above,

rhos1000 has the dimension 1000 × 4, i.e., 1000 values for each ele-

ment of the 2 × 2 matrix. Plotting the histogram of the 1000 values for the

second element, i.e., the correlation coe cient of

(x,y) illustrates the dis-

persion of this parameter with respect to the presence or absence of the

outlier. Since the distribution of

rhos1000 contains many empty classes,

we use a large number of bins.

hist(rhos1000(:,2),30)

 e histogram shows a cluster of correlation coe cients at around r=0.1

that follow the normal distribution, and a strong peak close to r=1 (Fig. 4.3).

 e interpretation of this histogram is relatively straightforward. When the

subsample contains the outlier, the correlation coe cient is close to one, but

subsamples without the outlier yield a very low (close to zero) correlation

4.2 PEARSON’S CORRELATION COEFFICIENT 87

4 BIVARIATE STATISTICS

Low corrrelation coeﬃcients

of samples not containing

the outlier

High corrrelation coeﬃcients

of samples including

the outlier

Correlation Coeﬃcient r

Bootstrap Samples

−0.4 0 0.6 1

100

150

200

250

300

350

−0.2 0.40.2 0.8

Histogram of Bootstrap Results

Fig. 4.3 Bootstrap result for Pearson’s correlation coe cient r from 1000 subsamples.

 e histogram shows a roughly normally-distributed cluster of correlation coe cients at

around r=0.1 suggesting that these subsamples do not include the outlier.  e strong peak

close to r=1, however, suggests that an outlier with high values of the two variables x and y

is present in the corresponding subsamples.

coe cient suggesting no strong dependence between the two variables x

and

Bootstrapping therefore provides a simple but powerful tool for either

accepting or rejecting our  rst estimate of the correlation coe cient for the

population.  e application of the above procedure to the synthetic sedi-

ment data yields a clear unimodal Gaussian distribution for the correlation

coe cients of the subsamples.

clear

agedepth = load('agedepth_1.txt');

meters = agedepth(:,1);

age = agedepth(:,2);

corrcoef(meters,age)

ans =

1.0000 0.9567

0.9567 1.0000

rhos1000 = bootstrp(1000,'corrcoef',meters,age);

hist(rhos1000(:,2),30)