0% found this document useful (0 votes)

8 views14 pages

Chap5Correlation LinearRegression

Chapter 5 discusses correlation and linear regression, explaining the correlation coefficient (r) as a measure of the strength of a linear relationship between two continuous variables. It covers the calculation of r, its interpretation, and the importance of understanding that correlation does not imply causation. The chapter also introduces simple linear regression, the method of least squares for estimating regression parameters, and the coefficient of determination (R²) as a measure of model fit.

Uploaded by

Olorato Modise

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views14 pages

Chap5Correlation LinearRegression

Uploaded by

Olorato Modise

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

CHAPTER 5

Correlation and linear regression

5.1 Correlation

Let (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) represent n points on a scatterplot. Then the

correlation coefficient, denoted by r, is the average of the products of the z-scores,
except that we divide by n − 1 instead of n. That is

n
1 X xi − x̄ yi − ȳ
r= (5.1)
n − 1 i=1 sx sy

where sx and sx are the standard deviation for x and y respectively.

Equivalently, the correlation coefficient can be given as

Pn
i=1 (xi − x̄)(yi − ȳ)
r = pPn pPn
2 2
i=1 (xi − x̄) i=1 (yi − ȳ)

5-1
5.1. Correlation 5. Correlation and linear regression

which simplifies to

Pn
i=1xi yi − nx̄ȳ
=q P
n 2 2
Pn 2
2

i=1 xi − nx̄ i=1 yi − nȳ

The correlation coefficient measures the strength of a linear relationship between

two continuous variables. Its a mathematical fact that −1 < r < 1. Therefore,

positive values of r indicate that greater values of one variable are associated

with greater values of the other;

negative values of the correlation coefficient indicate that the greater values of

one variable are associated with lesser values of the other.

If r = 0, x and y are said to be uncorrelated and values closer to r = 0 indicate

a weak linear relationship whereas values closer to 1 or -1 indicate a strong linear

relationship.

Few important notes to take home:

The correlation coefficient measures only linear association.

The correlation coefficient can be misleading when outliers are present.

Correlation is not causation

Inference on the population correlation

Let X and Y ve random variables with a bivariate normal distribution and let ρ denote
the population correlation between the two variables. Suppose (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )

5-2
5. Correlation and linear regression 5.1. Correlation

is a random sample from the joint distribution of X and Y . Let r be the sample cor-
relation of the n pairs of points. The the quantity

1 1+r
W = ln (5.2)
2 1−r

is approximately normally distributed with mean

1 1+ρ
µW = ln (5.3)
2 1−ρ

and variance

2 1
σW = (5.4)
n−3

To construct the confidence interval, we need to make ρ from equation 5.3 the subject
of the formula. We obtain

e2µW − 1
ρ= (5.5)
e2µW + 1

Example 5.1.1. In a study of reaction times, in milliseconds, the time to respond to a

visual stimulus (x) and the time to respond to an auditory stimulus (y) were recorded
for each of 10 subjects. The results are presented below

x 161 203 235 176 201 188 228 211 191 178

y 159 206 241 163 197 193 209 189 169 201

Find a 95% confidence interval for the correlation between the two reaction times.

Solution 5.1.1. With some few punches on your calculator, you should find that

5-3
5.1. Correlation 5. Correlation and linear regression

r = 0.8159 so that we have

1 1 + 0.8159
W = ln
2 1 − 0.8159
= 1.1444

Since W is normally distributed with the standard deviation

r
1
σW = = 0.3780,
10 − 3

it follows that 95% confidence interval for µW is

1.1444 − 1.96(0.3780) <µW < 1.144 + 1.96(0.3780)

That is,

0.4036 <µW < 1.8852

Apply the transformation of ρ to obtain the 95% CI for ρ as

e2(0.4036) − 1 e2µW − 1 e2(1.8852) − 1

< <
e2(0.4036) + 1 e2µW + 1 e2(1.8852) + 1

which simplifies to

0.383 <ρ < 0.955

Example 5.1.2. From the previous example, find the p-value for testing H0 : ρ ≤ 0.3

5-4
5. Correlation and linear regression 5.1. Correlation

versus H1 : ρ < 0.3.

Solution 5.1.2. Under H0 , we take ρ = 0.3 so that

1 1 + 0.3
µW = ln = 0.3095
2 1 − 0.3

and

r
1
σW = = 0.3780
10 − 3

Thus under H0 , we have that W ∼ N 0.3095, 0.3780 . Since we found W = 1.1444,
then z-score is

1.1444 − 0.3095
Z= = 2.21
0.3780

Hence the p-value is given by

p = P(Z > 2.21) = 0.0136

It becomes much simplier when it comes to testing the hypothesis H0 : ρ = 0. We

have the following test statistic

√
r n−2
U= √ ∼ tn−2 (5.6)
1 − r2

5-5
5.2. Simple linear regression 5. Correlation and linear regression

5.2 Simple linear regression

Sometimes in practice, one is called upon to investigate problems involving sets of

variables when it is known that there exists some inherent relationship among the
variables. For example,

It may be known that the tar content in the outlet stream in a chemical process

is related to the inlet temperature. Therefore it may be of interest to develop a

method for estimating the tar content for various levels of the inlet temperature
from experimental information. Now, of course, it is highly likely that for many
example runs in which the inlet temperature is the same, say 130◦ C, the outlet
tar content will not be the same.

This is much like what happens when we study several automobiles with the same

engine volume. They will not all have the same gas mileage.

Houses in the same part of the country that have the same square footage of

living space will not all be sold for the same price.

Tar content, gas mileage (mpg), and the price of houses (in thousands of Pula) are
natural dependent variables, or responses, in these three scenarios. Inlet tem-
perature, engine volume (cubic feet), and square feet of living space are, respectively,
natural independent variables, or regressors. The concept of regression analysis
deals with finding the best relationship between Y and x, quantifying the strength of
that relationship, and using methods that allow for prediction of the response values
given values of the regressor x.

A reasonable form of a relationship between the response variable, Y and the re-

5-6
5. Correlation and linear regression 5.2. Simple linear regression

gressor variable, x is the linear relationship

Y = β0 + β1 x + ϵ (5.7)

where, β0 and β1 are unknown intercept and slope parameters respectively, ϵ is a

random variable that is assumed to be distributed with E(ϵ) = 0 and V ar(ϵ) = σ 2 .
The quantity σ 2 is often called the error variance or residual variance.

An important aspect of regression analysis is to estimate the parameters β0 and β1 .

Suppose we denote the estimates b0 for β0 and b1 for β1 . Then the estimated or fitted
regression line is given by

ŷ = b0 + b1 x,

where ŷ is the predicted or fitted value. Obviously, the fitted line is an estimate of
the true regression line. We expect that the fitted line should be closer to the true
regression line when a large amount of data are available. The method of estimation
is discussed in the next section.

5.2.1 The Method of Least Squares

We aim to find the estimates of β0 and β1 , b0 and b1 , so that the sum of the squares
of the residuals is a minimum. The residual sum of squares is often called the sum of
squares of the errors about the regression line and is denoted by SSE. To do this, we
first express the errors, ei in terms of b0 and b1 :

ei = yi − ŷi = yi − b0 − b1 xi

5-7
5.2. Simple linear regression 5. Correlation and linear regression

Therefore b0 and b1 are quantities that minimize the sum

n
X n
X
SSE = e2i = (yi − b0 − b1 xi )2
i=1 i=1

These quantities are

Pn
(x − x̄)(yi − ȳ)
Pn i
b1 = i=1 2
(5.8)
i=1 (xi − x̄)

b0 = ȳ − b1 x̄ (5.9)

Example 5.2.1. A blood pressure measurement consists of two numbers: the systolic
pressure, which is the maximum pressure taken when the heart is contracting, and
the diastolic pressure, which is the minimum pressure taken at the beginning of the
heartbeat. Blood pressures were measured, in mmHg for a sample of 16 adults. The
following table presents the results.

Systolic Diastolic Systolic Diastolic Systolic Diastolic Systolic Diastolic

134 87 133 91 119 69 108 69

115 83 112 75 118 88 105 66
113 77 107 71 130 76 157 103
123 77 110 74 116 70 154 94

a) Construct a scatterplot of diastolic pressure (y) vs systolic pressure (x). Verify

that a linear model is appropriate.

b) Estimate the regression line for predicting the diastolic pressure from the systolic
pressure.

c) Predict the diastolic pressure for a patient whose systolic pressure is 125 mmHg.

5-8
5. Correlation and linear regression 5.2. Simple linear regression

5.2.2 Measuring goodness-of-fit

A goodness-of-fit statistic is quantity that measures how well a model explains a given
set of data. As for a linear model, it fits the data well if there is a strong linear
relationship between x and y. Therefore, the correlation coefficient r can also be
thought of goodness-of-fit statistic for the linear model and we will explain how below.

Suppose we have no knowledge of the predictor variable. Then the best we can do
to predict the value of the response variable for a new subject is to use the average
value, ȳ. In this case, the sum of squared prediction errors will be given by

n
X
SST = (y1 − ȳ)2 (5.10)
i=1

This quantity is usually referred to as total sum of squares, denoted by SST .

If on the other hand, we have the knowledge of the predictor variable, then we can
fit the least-squares line and predict the response variable with the fitted value, ŷi . The
sum of squared prediction errors in this case is given by

n
X
SSE = (y1 − ŷi )2 (5.11)
i=1

and usually referred to as the error sum of squares, denoted by SSE.

Thus, the strength of the linear relationship can be measured by computing the
reduction of the sum of squared prediction errors obtained by using ŷi instead of the

5-9
5.2. Simple linear regression 5. Correlation and linear regression

naive guess of ȳ. That is, we use the regression sum of squares given by

SSR = SST − SSE

n
X n
X
= (yi − ȳ)2 − (yi − ŷi )2
i=1 i=1

There is however a problem with using SSR as it is since its the squared units of
the response variable. Therefore it will be difficult to compare two models fitted to
two different sets of data. We therefore need to scale this quantity with the SST and
obtain

Pn
2 (y1 − ŷi )2
r = 1 − Pi=1
n 2
(5.12)
i=1 (y1 − ȳ)

The quantity r2 is called the coefficient of determination. Since the SST is just the
sample variance of the response variable without dividing by n − 1, r2 is often referred
to by statisticians as the proportion of the variance in the response variable explained
by the regression model.

Example 5.2.2. A least-squares line is fit to a set of points. If the total sum of squares
is 181.2 and the error sum of squares is 33.9, compute the coefficient of determination
R2 .

Example 5.2.3. In a study relating the degree of warping, in mm, of a copper plate
(y) to temperature in ◦ C (x), the following summary statistics were calculated: n =
Pn 2
Pn 2
40, i=1 (xi − x̄) = 98775, i=1 (yi − ȳ) = 19.10, x̄ = 26.36, ȳ = 0.5188 and
Pn
i=1 (xi − x̄)(yi − ȳ) = 826.94

a) Compute the least-squares line for predicting warping from temperature.

5-10
5. Correlation and linear regression 5.2. Simple linear regression

b) Compute the coefficient of determination of the fitted least-squares line above.

5.2.3 Inferences about the least-square coefficients

The simple linear model has been presented as

yi = β0 + β1 xi + εi

where εi is the error in the ith observation for i = 1, 2, . . . , n.

Without the error terms, the points (xi , yi ), i = 1, 2, . . . , n, would lie exactly on the
least-squares line and the estimates b0 and b1 would be the true values of β0 and β1 .
But because of the error terms, εi ’s, the points are scattered around the least-squares
line and the quantities b0 and b1 do not equal the true values of β0 and β1 .

It should be noted that if were to re-sample and observe the pairs (x1 , y1 ), . . . , (xn , yn ),
the values of εi , and thus b0 and b1 , will be different, That is, εi , b0 and b1 are random
variables. Specifically, the error terms, εi , create uncertainty in the estimates b0 and
b1 . If εi ’s are small in magnitude then the uncertainties of b0 and b1 would be small as
well. On the other hand, if εi ’s are large in magnitude then the uncertainties of b0 and
b1 would also be large.

Therefore we need to estimate just how large the uncertainties are for b0 and b1 to
be useful. We need the following assumptions about the error terms for us to be able
to establish their uncertainties.

Assumptions for the error terms

(i) ε1 , ε2 , . . . , εn are random and independent of each other.

(ii) E(εi ) = 0 for all i = 1, 2, . . . , n

5-11
5.2. Simple linear regression 5. Correlation and linear regression

(iii) V(εi ) = σ 2 for all i = 1, 2, . . . , n. That is, the error terms have constant variance.

(iv) ε1 , ε2 , . . . , εn are normally distributed.

Under these assumptions, the effect of the error terms is governed by their variance,
σ 2 . Therefore to estimate the uncertainties in b0 and b1 , we should first estimate the
error variance, σ 2 . The estimate of the error variance is given by

n
2 1 X 2
s = e
n − 2 i=1 i
n
1 X
= (yi − ŷi )2 (5.13)
n − 2 i=1

We can now conclude on the means and variances of the least-square coefficients b0
and b1 . Notice that both b0 and b1 are linear combinations of the values of the response
variable, yi and therefore we can apply simply algebraic manipulations and show that
b0 and b1 are unbiased estimators of β0 and β1 with the following variances

x̄2

2 1
V(b0 ) = s + Pn 2
(5.14)
n i=1 (xi − x̄)

and

s2
V(b1 ) = Pn 2
(5.15)
i=1 (xi − x̄)

It can then be shown that the above assumption, the quantities

b − β0 b − β1
p0 and p1
V(b0 ) V(b1 )

have t-distribution with n − 2 degrees of freedom. Thus the confidence intervals and

5-12
5. Correlation and linear regression 5.2. Simple linear regression

hypothesis testing for β0 and β1 are derived exactly the same way as for the population
mean.

5.2.4 Prediction

Suppose that the experimenter wishes to construct a confidence interval for the mean
response, µY |x0 . We shall use the point estimator y0 = b0 + b1 x0 to estimate µY |x0 =
β0 + β1 x. It can be shown that the sampling distribution of Ŷ0 is normal with mean

µY |x0 = E(ŷ0 ) = β0 + β1 x0 (5.16)

and variance

(x0 − x̄)2

1
σŷ20 =s 2
+ Pn 2
(5.17)
n i=1 (xi − x̄)

Thus, a 100(1 − α)% confidence interval on the mean response µY |x0 can now be con-
structed from the statistic

ŷ0 − µY |x0
T = (5.18)
σŷ0

which has a t-distribution with n − 2 degrees of freedom.

Example 5.2.4. Use the data set

y 7 50 100 40 70
x 2 15 30 10 20

(a) Plot the data.

5-13
5.2. Simple linear regression 5. Correlation and linear regression

(b) Fit a regression line to the data and plot the regression line on the graph with
the data.

5-14

425796316-COBIT-2019-Framework-Governance-and-Management-Objectives English
100% (4)
425796316-COBIT-2019-Framework-Governance-and-Management-Objectives English
288 pages
A Detailed Lesson Plan in Mathematics Grade 7 (Algebra) : I. Objectives
0% (1)
A Detailed Lesson Plan in Mathematics Grade 7 (Algebra) : I. Objectives
7 pages
What Happened To You Book Discussion Guide-National Version
No ratings yet
What Happened To You Book Discussion Guide-National Version
7 pages
Professional Education-Curriculum Development (Let Reviewer)
100% (11)
Professional Education-Curriculum Development (Let Reviewer)
7 pages
Sa1 Frame
No ratings yet
Sa1 Frame
51 pages
Linear Regression
100% (2)
Linear Regression
28 pages
83 Revision Questions For IGCSE Questions Solutions PDF
100% (4)
83 Revision Questions For IGCSE Questions Solutions PDF
5 pages
Correl Regr
No ratings yet
Correl Regr
33 pages
Catamaran Inclining Report
No ratings yet
Catamaran Inclining Report
24 pages
Chapter 4 - Notes
No ratings yet
Chapter 4 - Notes
58 pages
Biostatistics: Lect6: Correlation and Regression Analysis Dr. Ecem Yeğin
No ratings yet
Biostatistics: Lect6: Correlation and Regression Analysis Dr. Ecem Yeğin
28 pages
Generator Spare Parts Budget-2020
No ratings yet
Generator Spare Parts Budget-2020
106 pages
DAM Class 21-24 Regression Analysis
No ratings yet
DAM Class 21-24 Regression Analysis
93 pages
MLR and Regression
No ratings yet
MLR and Regression
30 pages
1 - Simple Linear Regression
No ratings yet
1 - Simple Linear Regression
43 pages
Sta 212
No ratings yet
Sta 212
14 pages
Chapter7
No ratings yet
Chapter7
52 pages
13simple Linear Regression
No ratings yet
13simple Linear Regression
127 pages
Chapter 09
No ratings yet
Chapter 09
25 pages
Simple Linear Regression Analysis
No ratings yet
Simple Linear Regression Analysis
7 pages
F Regression
No ratings yet
F Regression
65 pages
12.1correlation and Simple Linear
No ratings yet
12.1correlation and Simple Linear
45 pages
Lecture SLR
No ratings yet
Lecture SLR
60 pages
Beltscale Handbook 03 12 TL
No ratings yet
Beltscale Handbook 03 12 TL
8 pages
PHY104 Electricity Lectures 2024RevisedFinal
No ratings yet
PHY104 Electricity Lectures 2024RevisedFinal
156 pages
Lecture 8 Correlation and Linear Regression
No ratings yet
Lecture 8 Correlation and Linear Regression
66 pages
R-Programming - Unit 5
No ratings yet
R-Programming - Unit 5
43 pages
Regression Analysis
100% (1)
Regression Analysis
43 pages
6.6 Correlation & Linear Regression
No ratings yet
6.6 Correlation & Linear Regression
19 pages
The Things They Carry
No ratings yet
The Things They Carry
9 pages
Randox RX Analyzers: Monaco & Imola Specs
No ratings yet
Randox RX Analyzers: Monaco & Imola Specs
7 pages
Regression
No ratings yet
Regression
12 pages
Chapter 8
No ratings yet
Chapter 8
8 pages
Regression2024 MBA
No ratings yet
Regression2024 MBA
25 pages
Week 13
No ratings yet
Week 13
25 pages
SOW102-Doing Social Research, 2nd Edition-Therese Baker-1994 - (Learnclax - Com) - Pages-200-235
No ratings yet
SOW102-Doing Social Research, 2nd Edition-Therese Baker-1994 - (Learnclax - Com) - Pages-200-235
36 pages
Simple Linear Regression
100% (1)
Simple Linear Regression
50 pages
Simple Linear Regressionclassroom
No ratings yet
Simple Linear Regressionclassroom
37 pages
Catalog Tong May Phat Dien Cummins
No ratings yet
Catalog Tong May Phat Dien Cummins
114 pages
Thesis Help for Trade Students
100% (2)
Thesis Help for Trade Students
6 pages
Correlation and Linear Regression
No ratings yet
Correlation and Linear Regression
46 pages
Correlation (Quantitative Variables)
No ratings yet
Correlation (Quantitative Variables)
39 pages
Lecture 10
No ratings yet
Lecture 10
38 pages
Linear Regression for Beginners
No ratings yet
Linear Regression for Beginners
12 pages
Regression Analysis Basics
No ratings yet
Regression Analysis Basics
14 pages
Topic - Chapter 12 - Regression Models
No ratings yet
Topic - Chapter 12 - Regression Models
1 page
5 Chapter Fi
No ratings yet
5 Chapter Fi
29 pages
Lecture 12
No ratings yet
Lecture 12
47 pages
Lecture 4
No ratings yet
Lecture 4
60 pages
Lecture 4 - Correlation and Regression
No ratings yet
Lecture 4 - Correlation and Regression
35 pages
Reg & Cor QMS 080-1
No ratings yet
Reg & Cor QMS 080-1
48 pages
Classic 500
No ratings yet
Classic 500
86 pages
Linear Correlation Analysis Guide
No ratings yet
Linear Correlation Analysis Guide
15 pages
MAP 716 Lecture 4 Simple Linear Regression
No ratings yet
MAP 716 Lecture 4 Simple Linear Regression
23 pages
Chapter 5: Correlation and Linear Regression: Phan Thi Khanh Van
No ratings yet
Chapter 5: Correlation and Linear Regression: Phan Thi Khanh Van
19 pages
Correlation and Regression: Fathers' and Daughters' Heights
No ratings yet
Correlation and Regression: Fathers' and Daughters' Heights
43 pages
Regresión y Calibración
No ratings yet
Regresión y Calibración
6 pages
Estimation of Causal Relationships I: Illustration 1
No ratings yet
Estimation of Causal Relationships I: Illustration 1
8 pages
Slide Chap11
No ratings yet
Slide Chap11
19 pages
Chapter 8
No ratings yet
Chapter 8
45 pages
Simple LR Lecture
No ratings yet
Simple LR Lecture
60 pages
Correlation and Regression Analysis
No ratings yet
Correlation and Regression Analysis
65 pages
The Bucharest University of Economic Studies Bucharest Business School Romanian - French INDE MBA Program
No ratings yet
The Bucharest University of Economic Studies Bucharest Business School Romanian - French INDE MBA Program
67 pages
Applied Statistics in Construction
No ratings yet
Applied Statistics in Construction
8 pages
Correlation Coefficient & Simple Linear Regression: STATS 101 Laurens Holmes, JR
No ratings yet
Correlation Coefficient & Simple Linear Regression: STATS 101 Laurens Holmes, JR
53 pages
Business Statistics: Regression Basics
No ratings yet
Business Statistics: Regression Basics
56 pages
Advanced Marketing Research
No ratings yet
Advanced Marketing Research
32 pages
Regression PDF
No ratings yet
Regression PDF
18 pages
Dr. Sufian M. Salih / Regression and Correlation
No ratings yet
Dr. Sufian M. Salih / Regression and Correlation
14 pages
Regression and Correlation
No ratings yet
Regression and Correlation
13 pages
Chapter12 Stats
No ratings yet
Chapter12 Stats
6 pages
Simple Linear Regression Part 1
No ratings yet
Simple Linear Regression Part 1
63 pages
Apoc Datasheet Agents of The Imperium Web
No ratings yet
Apoc Datasheet Agents of The Imperium Web
24 pages
Regression & Correlation Guide
No ratings yet
Regression & Correlation Guide
5 pages
Regression
No ratings yet
Regression
66 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
BHU RET Geology 2020
0% (1)
BHU RET Geology 2020
41 pages
STD V Intl Syllabus 2024 25
No ratings yet
STD V Intl Syllabus 2024 25
10 pages
Database Programming With PL/SQL 2-3: Practice Activities: Recognizing Data Types
No ratings yet
Database Programming With PL/SQL 2-3: Practice Activities: Recognizing Data Types
3 pages
Time Series Analysis and Forecasting of Gold Price Using ARIMA and LSTM Model
No ratings yet
Time Series Analysis and Forecasting of Gold Price Using ARIMA and LSTM Model
8 pages
Aspen HYSYS Pump, Compressor, Expander, and Heat Exchanger Simulations
No ratings yet
Aspen HYSYS Pump, Compressor, Expander, and Heat Exchanger Simulations
22 pages
Revival and Reinvention of Kathak Dance
No ratings yet
Revival and Reinvention of Kathak Dance
14 pages
Shakeel Saleem File Albania
No ratings yet
Shakeel Saleem File Albania
27 pages
Graduands Convocation 2019 v2 PDF
No ratings yet
Graduands Convocation 2019 v2 PDF
53 pages
MRSPTU M.tech. Mechanical Engg. (Sem 1-4) Syllabus Updated On 19.3.2017
No ratings yet
MRSPTU M.tech. Mechanical Engg. (Sem 1-4) Syllabus Updated On 19.3.2017
15 pages
Unit 1 - What Kind of Movies Have You Been Watching Recently
No ratings yet
Unit 1 - What Kind of Movies Have You Been Watching Recently
12 pages
67207e78746876a86fe72ba5 Widavasigivexatez
No ratings yet
67207e78746876a86fe72ba5 Widavasigivexatez
2 pages
Daphnia Heart Rate Experiment Guide
No ratings yet
Daphnia Heart Rate Experiment Guide
7 pages
Teaching Tools for Parsing Education
No ratings yet
Teaching Tools for Parsing Education
5 pages

Chap5Correlation LinearRegression

Uploaded by

Chap5Correlation LinearRegression

Uploaded by

CHAPTER 5

Correlation and linear regression

Let (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) represent n points on a scatterplot. Then the

where sx and sx are the standard deviation for x and y respectively.

Equivalently, the correlation coefficient can be given as

The correlation coefficient measures the strength of a linear relationship between

with greater values of the other;

one variable are associated with lesser values of the other.

 If r = 0, x and y are said to be uncorrelated and values closer to r = 0 indicate

a weak linear relationship whereas values closer to 1 or -1 indicate a strong linear

Few important notes to take home:

 The correlation coefficient measures only linear association.

 The correlation coefficient can be misleading when outliers are present.

 Correlation is not causation

Inference on the population correlation

is approximately normally distributed with mean

Example 5.1.1. In a study of reaction times, in milliseconds, the time to respond to a

r = 0.8159 so that we have

Since W is normally distributed with the standard deviation

it follows that 95% confidence interval for µW is

1.1444 − 1.96(0.3780) <µW < 1.144 + 1.96(0.3780)

0.4036 <µW < 1.8852

Apply the transformation of ρ to obtain the 95% CI for ρ as

e2(0.4036) − 1 e2µW − 1 e2(1.8852) − 1

0.383 <ρ < 0.955

versus H1 : ρ < 0.3.

Solution 5.1.2. Under H0 , we take ρ = 0.3 so that

Hence the p-value is given by

p = P(Z > 2.21) = 0.0136

It becomes much simplier when it comes to testing the hypothesis H0 : ρ = 0. We

5.2 Simple linear regression

Sometimes in practice, one is called upon to investigate problems involving sets of

is related to the inlet temperature. Therefore it may be of interest to develop a

gressor variable, x is the linear relationship

where, β0 and β1 are unknown intercept and slope parameters respectively, ϵ is a

An important aspect of regression analysis is to estimate the parameters β0 and β1 .

5.2.1 The Method of Least Squares

Therefore b0 and b1 are quantities that minimize the sum

These quantities are

Systolic Diastolic Systolic Diastolic Systolic Diastolic Systolic Diastolic

134 87 133 91 119 69 108 69

a) Construct a scatterplot of diastolic pressure (y) vs systolic pressure (x). Verify

5.2.2 Measuring goodness-of-fit

This quantity is usually referred to as total sum of squares, denoted by SST .

and usually referred to as the error sum of squares, denoted by SSE.

SSR = SST − SSE

a) Compute the least-squares line for predicting warping from temperature.

b) Compute the coefficient of determination of the fitted least-squares line above.

5.2.3 Inferences about the least-square coefficients

The simple linear model has been presented as

where εi is the error in the ith observation for i = 1, 2, . . . , n.

Assumptions for the error terms

(i) ε1 , ε2 , . . . , εn are random and independent of each other.

(ii) E(εi ) = 0 for all i = 1, 2, . . . , n

(iv) ε1 , ε2 , . . . , εn are normally distributed.

It can then be shown that the above assumption, the quantities

µY |x0 = E(ŷ0 ) = β0 + β1 x0 (5.16)

which has a t-distribution with n − 2 degrees of freedom.

Example 5.2.4. Use the data set

(a) Plot the data.

You might also like

If r = 0, x and y are said to be uncorrelated and values closer to r = 0 indicate

The correlation coefficient measures only linear association.

The correlation coefficient can be misleading when outliers are present.

Correlation is not causation