CHAPTER 5
Correlation and linear regression
5.1 Correlation
Let (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) represent n points on a scatterplot. Then the
correlation coefficient, denoted by r, is the average of the products of the z-scores,
except that we divide by n − 1 instead of n. That is
n
1 X xi − x̄ yi − ȳ
r= (5.1)
n − 1 i=1 sx sy
where sx and sx are the standard deviation for x and y respectively.
Equivalently, the correlation coefficient can be given as
Pn
i=1 (xi − x̄)(yi − ȳ)
r = pPn pPn
2 2
i=1 (xi − x̄) i=1 (yi − ȳ)
5-1
5.1. Correlation 5. Correlation and linear regression
which simplifies to
Pn
i=1xi yi − nx̄ȳ
=q P
n 2 2
Pn 2
2
i=1 xi − nx̄ i=1 yi − nȳ
The correlation coefficient measures the strength of a linear relationship between
two continuous variables. Its a mathematical fact that −1 < r < 1. Therefore,
positive values of r indicate that greater values of one variable are associated
with greater values of the other;
negative values of the correlation coefficient indicate that the greater values of
one variable are associated with lesser values of the other.
If r = 0, x and y are said to be uncorrelated and values closer to r = 0 indicate
a weak linear relationship whereas values closer to 1 or -1 indicate a strong linear
relationship.
Few important notes to take home:
The correlation coefficient measures only linear association.
The correlation coefficient can be misleading when outliers are present.
Correlation is not causation
Inference on the population correlation
Let X and Y ve random variables with a bivariate normal distribution and let ρ denote
the population correlation between the two variables. Suppose (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )
5-2
5. Correlation and linear regression 5.1. Correlation
is a random sample from the joint distribution of X and Y . Let r be the sample cor-
relation of the n pairs of points. The the quantity
1 1+r
W = ln (5.2)
2 1−r
is approximately normally distributed with mean
1 1+ρ
µW = ln (5.3)
2 1−ρ
and variance
2 1
σW = (5.4)
n−3
To construct the confidence interval, we need to make ρ from equation 5.3 the subject
of the formula. We obtain
e2µW − 1
ρ= (5.5)
e2µW + 1
Example 5.1.1. In a study of reaction times, in milliseconds, the time to respond to a
visual stimulus (x) and the time to respond to an auditory stimulus (y) were recorded
for each of 10 subjects. The results are presented below
x 161 203 235 176 201 188 228 211 191 178
y 159 206 241 163 197 193 209 189 169 201
Find a 95% confidence interval for the correlation between the two reaction times.
Solution 5.1.1. With some few punches on your calculator, you should find that
5-3
5.1. Correlation 5. Correlation and linear regression
r = 0.8159 so that we have
1 1 + 0.8159
W = ln
2 1 − 0.8159
= 1.1444
Since W is normally distributed with the standard deviation
r
1
σW = = 0.3780,
10 − 3
it follows that 95% confidence interval for µW is
1.1444 − 1.96(0.3780) <µW < 1.144 + 1.96(0.3780)
That is,
0.4036 <µW < 1.8852
Apply the transformation of ρ to obtain the 95% CI for ρ as
e2(0.4036) − 1 e2µW − 1 e2(1.8852) − 1
< <
e2(0.4036) + 1 e2µW + 1 e2(1.8852) + 1
which simplifies to
0.383 <ρ < 0.955
Example 5.1.2. From the previous example, find the p-value for testing H0 : ρ ≤ 0.3
5-4
5. Correlation and linear regression 5.1. Correlation
versus H1 : ρ < 0.3.
Solution 5.1.2. Under H0 , we take ρ = 0.3 so that
1 1 + 0.3
µW = ln = 0.3095
2 1 − 0.3
and
r
1
σW = = 0.3780
10 − 3
Thus under H0 , we have that W ∼ N 0.3095, 0.3780 . Since we found W = 1.1444,
then z-score is
1.1444 − 0.3095
Z= = 2.21
0.3780
Hence the p-value is given by
p = P(Z > 2.21) = 0.0136
It becomes much simplier when it comes to testing the hypothesis H0 : ρ = 0. We
have the following test statistic
√
r n−2
U= √ ∼ tn−2 (5.6)
1 − r2
5-5
5.2. Simple linear regression 5. Correlation and linear regression
5.2 Simple linear regression
Sometimes in practice, one is called upon to investigate problems involving sets of
variables when it is known that there exists some inherent relationship among the
variables. For example,
It may be known that the tar content in the outlet stream in a chemical process
is related to the inlet temperature. Therefore it may be of interest to develop a
method for estimating the tar content for various levels of the inlet temperature
from experimental information. Now, of course, it is highly likely that for many
example runs in which the inlet temperature is the same, say 130◦ C, the outlet
tar content will not be the same.
This is much like what happens when we study several automobiles with the same
engine volume. They will not all have the same gas mileage.
Houses in the same part of the country that have the same square footage of
living space will not all be sold for the same price.
Tar content, gas mileage (mpg), and the price of houses (in thousands of Pula) are
natural dependent variables, or responses, in these three scenarios. Inlet tem-
perature, engine volume (cubic feet), and square feet of living space are, respectively,
natural independent variables, or regressors. The concept of regression analysis
deals with finding the best relationship between Y and x, quantifying the strength of
that relationship, and using methods that allow for prediction of the response values
given values of the regressor x.
A reasonable form of a relationship between the response variable, Y and the re-
5-6
5. Correlation and linear regression 5.2. Simple linear regression
gressor variable, x is the linear relationship
Y = β0 + β1 x + ϵ (5.7)
where, β0 and β1 are unknown intercept and slope parameters respectively, ϵ is a
random variable that is assumed to be distributed with E(ϵ) = 0 and V ar(ϵ) = σ 2 .
The quantity σ 2 is often called the error variance or residual variance.
An important aspect of regression analysis is to estimate the parameters β0 and β1 .
Suppose we denote the estimates b0 for β0 and b1 for β1 . Then the estimated or fitted
regression line is given by
ŷ = b0 + b1 x,
where ŷ is the predicted or fitted value. Obviously, the fitted line is an estimate of
the true regression line. We expect that the fitted line should be closer to the true
regression line when a large amount of data are available. The method of estimation
is discussed in the next section.
5.2.1 The Method of Least Squares
We aim to find the estimates of β0 and β1 , b0 and b1 , so that the sum of the squares
of the residuals is a minimum. The residual sum of squares is often called the sum of
squares of the errors about the regression line and is denoted by SSE. To do this, we
first express the errors, ei in terms of b0 and b1 :
ei = yi − ŷi = yi − b0 − b1 xi
5-7
5.2. Simple linear regression 5. Correlation and linear regression
Therefore b0 and b1 are quantities that minimize the sum
n
X n
X
SSE = e2i = (yi − b0 − b1 xi )2
i=1 i=1
These quantities are
Pn
(x − x̄)(yi − ȳ)
Pn i
b1 = i=1 2
(5.8)
i=1 (xi − x̄)
b0 = ȳ − b1 x̄ (5.9)
Example 5.2.1. A blood pressure measurement consists of two numbers: the systolic
pressure, which is the maximum pressure taken when the heart is contracting, and
the diastolic pressure, which is the minimum pressure taken at the beginning of the
heartbeat. Blood pressures were measured, in mmHg for a sample of 16 adults. The
following table presents the results.
Systolic Diastolic Systolic Diastolic Systolic Diastolic Systolic Diastolic
134 87 133 91 119 69 108 69
115 83 112 75 118 88 105 66
113 77 107 71 130 76 157 103
123 77 110 74 116 70 154 94
a) Construct a scatterplot of diastolic pressure (y) vs systolic pressure (x). Verify
that a linear model is appropriate.
b) Estimate the regression line for predicting the diastolic pressure from the systolic
pressure.
c) Predict the diastolic pressure for a patient whose systolic pressure is 125 mmHg.
5-8
5. Correlation and linear regression 5.2. Simple linear regression
5.2.2 Measuring goodness-of-fit
A goodness-of-fit statistic is quantity that measures how well a model explains a given
set of data. As for a linear model, it fits the data well if there is a strong linear
relationship between x and y. Therefore, the correlation coefficient r can also be
thought of goodness-of-fit statistic for the linear model and we will explain how below.
Suppose we have no knowledge of the predictor variable. Then the best we can do
to predict the value of the response variable for a new subject is to use the average
value, ȳ. In this case, the sum of squared prediction errors will be given by
n
X
SST = (y1 − ȳ)2 (5.10)
i=1
This quantity is usually referred to as total sum of squares, denoted by SST .
If on the other hand, we have the knowledge of the predictor variable, then we can
fit the least-squares line and predict the response variable with the fitted value, ŷi . The
sum of squared prediction errors in this case is given by
n
X
SSE = (y1 − ŷi )2 (5.11)
i=1
and usually referred to as the error sum of squares, denoted by SSE.
Thus, the strength of the linear relationship can be measured by computing the
reduction of the sum of squared prediction errors obtained by using ŷi instead of the
5-9
5.2. Simple linear regression 5. Correlation and linear regression
naive guess of ȳ. That is, we use the regression sum of squares given by
SSR = SST − SSE
n
X n
X
= (yi − ȳ)2 − (yi − ŷi )2
i=1 i=1
There is however a problem with using SSR as it is since its the squared units of
the response variable. Therefore it will be difficult to compare two models fitted to
two different sets of data. We therefore need to scale this quantity with the SST and
obtain
Pn
2 (y1 − ŷi )2
r = 1 − Pi=1
n 2
(5.12)
i=1 (y1 − ȳ)
The quantity r2 is called the coefficient of determination. Since the SST is just the
sample variance of the response variable without dividing by n − 1, r2 is often referred
to by statisticians as the proportion of the variance in the response variable explained
by the regression model.
Example 5.2.2. A least-squares line is fit to a set of points. If the total sum of squares
is 181.2 and the error sum of squares is 33.9, compute the coefficient of determination
R2 .
Example 5.2.3. In a study relating the degree of warping, in mm, of a copper plate
(y) to temperature in ◦ C (x), the following summary statistics were calculated: n =
Pn 2
Pn 2
40, i=1 (xi − x̄) = 98775, i=1 (yi − ȳ) = 19.10, x̄ = 26.36, ȳ = 0.5188 and
Pn
i=1 (xi − x̄)(yi − ȳ) = 826.94
a) Compute the least-squares line for predicting warping from temperature.
5-10
5. Correlation and linear regression 5.2. Simple linear regression
b) Compute the coefficient of determination of the fitted least-squares line above.
5.2.3 Inferences about the least-square coefficients
The simple linear model has been presented as
yi = β0 + β1 xi + εi
where εi is the error in the ith observation for i = 1, 2, . . . , n.
Without the error terms, the points (xi , yi ), i = 1, 2, . . . , n, would lie exactly on the
least-squares line and the estimates b0 and b1 would be the true values of β0 and β1 .
But because of the error terms, εi ’s, the points are scattered around the least-squares
line and the quantities b0 and b1 do not equal the true values of β0 and β1 .
It should be noted that if were to re-sample and observe the pairs (x1 , y1 ), . . . , (xn , yn ),
the values of εi , and thus b0 and b1 , will be different, That is, εi , b0 and b1 are random
variables. Specifically, the error terms, εi , create uncertainty in the estimates b0 and
b1 . If εi ’s are small in magnitude then the uncertainties of b0 and b1 would be small as
well. On the other hand, if εi ’s are large in magnitude then the uncertainties of b0 and
b1 would also be large.
Therefore we need to estimate just how large the uncertainties are for b0 and b1 to
be useful. We need the following assumptions about the error terms for us to be able
to establish their uncertainties.
Assumptions for the error terms
(i) ε1 , ε2 , . . . , εn are random and independent of each other.
(ii) E(εi ) = 0 for all i = 1, 2, . . . , n
5-11
5.2. Simple linear regression 5. Correlation and linear regression
(iii) V(εi ) = σ 2 for all i = 1, 2, . . . , n. That is, the error terms have constant variance.
(iv) ε1 , ε2 , . . . , εn are normally distributed.
Under these assumptions, the effect of the error terms is governed by their variance,
σ 2 . Therefore to estimate the uncertainties in b0 and b1 , we should first estimate the
error variance, σ 2 . The estimate of the error variance is given by
n
2 1 X 2
s = e
n − 2 i=1 i
n
1 X
= (yi − ŷi )2 (5.13)
n − 2 i=1
We can now conclude on the means and variances of the least-square coefficients b0
and b1 . Notice that both b0 and b1 are linear combinations of the values of the response
variable, yi and therefore we can apply simply algebraic manipulations and show that
b0 and b1 are unbiased estimators of β0 and β1 with the following variances
x̄2
2 1
V(b0 ) = s + Pn 2
(5.14)
n i=1 (xi − x̄)
and
s2
V(b1 ) = Pn 2
(5.15)
i=1 (xi − x̄)
It can then be shown that the above assumption, the quantities
b − β0 b − β1
p0 and p1
V(b0 ) V(b1 )
have t-distribution with n − 2 degrees of freedom. Thus the confidence intervals and
5-12
5. Correlation and linear regression 5.2. Simple linear regression
hypothesis testing for β0 and β1 are derived exactly the same way as for the population
mean.
5.2.4 Prediction
Suppose that the experimenter wishes to construct a confidence interval for the mean
response, µY |x0 . We shall use the point estimator y0 = b0 + b1 x0 to estimate µY |x0 =
β0 + β1 x. It can be shown that the sampling distribution of Ŷ0 is normal with mean
µY |x0 = E(ŷ0 ) = β0 + β1 x0 (5.16)
and variance
(x0 − x̄)2
1
σŷ20 =s 2
+ Pn 2
(5.17)
n i=1 (xi − x̄)
Thus, a 100(1 − α)% confidence interval on the mean response µY |x0 can now be con-
structed from the statistic
ŷ0 − µY |x0
T = (5.18)
σŷ0
which has a t-distribution with n − 2 degrees of freedom.
Example 5.2.4. Use the data set
y 7 50 100 40 70
x 2 15 30 10 20
(a) Plot the data.
5-13
5.2. Simple linear regression 5. Correlation and linear regression
(b) Fit a regression line to the data and plot the regression line on the graph with
the data.
(c) Plot 95% confidence limits for the mean response on the graph around the re-
gression line.
5-14