Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
42 views93 pages

Selvanathan 7e - 17

Uploaded by

37.thanhtam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views93 pages

Selvanathan 7e - 17

Uploaded by

37.thanhtam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

Chapter 17

Simple linear regression and


correlation
Chapter outline
17.1 Model
17.2 Estimating the coefficients
17.3 Error variable: Required conditions
17.4 Assessing the model
17.5 Using the regression equation
17.6 Coefficients of correlation
17.7 Regression diagnostics – I
Learning objectives
LO1 Identify the dependent and the independent
variables
LO2 Use the least squares method to derive estimators
of simple linear regression model parameters
LO3 Understand the required conditions to perform
statistical inferences about a linear regression
model
LO4 Test the significance of the regression model
parameters
LO5 Calculate measures used to assess the performance
of a regression model
Learning objectives…
LO6 Use the regression equation for prediction
LO7 Calculate the prediction interval of the dependent
variable
LO8 Calculate the coefficient of correlation between
two variables and assess the strength of the
relationship
LO9 Detect violations of required conditions using
diagnostic checks on the regression model results.
17.6

Introduction
When the problem objective is to analyse the
relationship between numerical variables, correlation
and regression analysis is the first tool we will study. We
briefly covered correlation and regression analysis, when
we discussed descriptive graphical and numerical
techniques in Chapters 4 and 5. We now extend that
analysis further in this and the next few chapters.
Regression analysis is used to predict the value of one
variable (the dependent variable) on the basis of other
variables (the independent variables).
Dependent variable: denoted Y
Independent variables: denoted X1, X2, …, Xk
17.7

Correlation analysis
If we are interested only in determining whether a
relationship exists, we employ correlation analysis, a
technique introduced earlier.

This chapter will examine the relationship between two


variables, sometimes called simple linear regression.
We learn how to estimate such a relationship, measure
the strength and make inferences on the relationship.

Mathematical equations describing these relationships


are also called models, and they fall into two types:
deterministic or probabilistic.
17.8

Model types
Deterministic Model: an equation or set of equations that
allow us to fully determine the value of the dependent
variable from the values of the independent variables.

Contrast this with…


Probabilistic Model: a method used to capture the
randomness that is part of a real-life process.

E.g. do all houses of the same size (measured in square


metre) sell for exactly the same price?
17.9

A model
To create a probabilistic model, we start with a
deterministic model that approximates the relationship
we want to model and add a random term that measures
the error of the deterministic component.

Deterministic Model:
The cost of building a new house is about $800 per square
metre and most lots sell for about $300 000. Hence the
approximate selling price (y) would be:
y = $300 000 + $800(x)
(where x is the size of the house in square metres)
17.10

A model…
A model of the relationship between house size
(independent variable) and house price (dependent
variable) would be:

House
price

In this model, the price of


Most lots sell the house is completely
for $300 000. determined by the size.

House size
17.11

A model…
In real life, however, the house cost will vary even
among the same size of house:
Lower vs. higher
variability
House
price
Same house size,
but different price
points (e.g. décor
options, portico
upgrades, lot
300K$ location…).

House price = 300 000 + 800(Size) + 

x
House size
17.12

Random term
We now represent the price of a house as a function of its
size in this probabilistic model:

y = 300 000 + 800x + 

where  (Greek letter epsilon) is the random term (also


known as error variable). It is the difference between
the actual selling price and the estimated price based on
the size of the house. Its value will vary from house sale
to house sale, even if the area of the house (i.e. x)
remains the same due to other factors such as the
location, age, décor etc of the house.
17.13

17.1 Model
A straight line model with one independent variable is
called a first order linear model or a simple linear
regression model. It is written as:

dependent independent
variable variable

slope of
y-intercept error variable
the line
17.14

Simple linear regression model…


y

rise

run
0 1=slope (=rise/run)
=y-intercept

Note that both 0 and 1 are population parameters


which are usually unknown, and hence estimated from
the data.
17.15

17.2 Estimating the coefficients


In much the same way we base estimates of µ on 𝑋, ത we
estimate 0 using 𝛽መ0 and 1 using 𝛽መ1 , the y-intercept and
slope (respectively) of the least squares or regression
line given by:

ŷ  ˆ 0  ˆ 1 x

(Recall: this is an application of the least squares method


and it produces a straight line that minimises the sum of
the squared differences between the points and the line)
17.16

Least squares method


The question is:
• Which straight line fits best?
• The least squares line minimises the sum of squared
differences between the points and the line.

y w

w
w
w w
w
w w w w w
w w w w w
w
x
17.17

Example 1

The annual bonuses ($1,000s) of six employees with


different years of experience were recorded as follows.
We wish to determine the straight line relationship
between annual bonus and years of experience.

Years of experience x 1 2 3 4 5 6
Annual bonus y 6 1 9 5 17 12
17.18

Least Squares Line


Annual
bonus Example 1

these differences
are called
residuals

Years of
experience
17.19

Example 2: Which line fits best?


The best line is the one that minimises the sum of squared
vertical differences between the points and the line.
Line 1: Sum of squared differences = (2 – 1)2 +(4 – 2)2 +(1.5 – 3)2 + (3.2 – 4)2 = 6.89
Line 2: Sum of squared differences = (2 – 2.5)2 +(4 – 2.5)2 +(1.5 – 2.5)2 +(3.2–- 2.5)2 = 3.99
Let us compare two lines.
4 (2, 4)
w The second line is horizontal.
3 w (4, 3.2)
Line 2
2.5
2 The smaller the sum of
(1, 2)w
w (3, 1.5) squared differences,
1 the better the fit of the
line to the data.
1 2 3 4
17.20

Least squares estimates


To calculate the estimates of the coefficients that
minimise the differences between the data points and
the line, use the formulas:

( xi  yi )
 xi yi 
ˆ1   i x y n x y
( x  x )( y  y ) n
i
 or i i

 ( xi  x ) 2
( xi ) 2
x nx 2 2

 xi  n
2 i

ˆ0  y  ˆ1 x
17.21

Least squares estimates…


Alternate Formula
We have already developed formulae for sample
variances and covariances for two variables, X and Y:

1 1  (  x ) 2
1  2
s 2X   ( xi  x ) 
2
  xi 
2 i
  xi  n x 2 
n 1 n  1  n  n  1  

1 1  2 ( yi )2  1  2
sY2   ( yi  y ) 
2
 i y     yi  n y 2 
n 1 n  1  n  n  1  

1  ( xi  yi ) 
  xi yi  n x y 
1 1
sxy   ( xi  x )( yi  y )    x y
i i   
n 1 n 1  n  n 1
17.22

Least squares estimates…


Then

ˆ s XY
1  2
sX
ˆ0  y  ˆ1 x
The estimated simple linear regression equation that
estimates the equation of the first-order linear model
is:
ŷ  ˆ 0  ˆ 1 x
5.23

Example: recall example in chapter 5:


Compute the covariance and the coefficient of
correlation between advertising expenditure and sales
level and discuss the strength and direction of the
relationship between them. Base your calculation on
the data (in millions) provided below.
Advert Sales
1 30
3 40
5 40
4 50
2 35
5 50
3 35
2 25
17.24
Example 3 – Odometer readings and prices
of used cars
(Example 17.3, p717)
Car Odometer Price
A car dealer wants to find the 1 37.4 16.0
relationship between the 2 44.8 15.2
odometer reading and the 3 45.8 15.0
selling price of used cars. 4 30.9 17.4
5 31.7 17.4
A random sample of 100 cars is 6 34.0 16.1
selected and the data are . . .
. . .
recorded.
. . .
Estimate a linear relationship
Independent
between price and odometer variable x
Dependent
reading.. variable y
17.25

Example 3 - Solution
To calculate 𝛽መ0 and 𝛽መ1 , we need to
calculate several statistics first:
n  100;  x  3601.1;  y  1623.7; x  36.01; y  16.24;

 x  133986.6;  y  26421.9;  xy  58067.4


2 2

1  (  x )  4 307.378
2
s 2X    xi 
2 i
   43.509
(n  1)  n  99
1  ( xi  yi )  -403.6207
sxy    x y
i i     4.077
(n  1)  n  99
17.26

Example 3 – Solution…
Therefore,
sxy -4.077
ˆ1    -0.0937
s 2X 4 3.509
ˆ0  y  ˆ1 x  16.24  (  0.0937)(36.01)  19.611

The estimated least squares regression line is

ŷ  ˆ 0  ˆ1x  19.611  0.094 x


17.27

Example 3 – Solution…
Using Excel (Data Analysis)
We can use Data Analysis to come with the same
results.
17.28

Example 3 – Solution…
Using Excel (Data Analysis)
In the Data Analysis dialogue box (shown below),
enter the input and the output is presented in the next
slide.
17.29

Example 3 – Solution…
Using Excel
ŷ  19.611  0.094 x
17.30

Example 3 – Solution…
Using Excel (Data Analysis): output

ŷ  19.611  0.094 x
17.31

19.611
20

18

Price (y)
16

14

12
0 No data 10
15 20 25 30 35 40 45 50 55
Odometer (x)

ŷ  19.611  0.094 x

The intercept is ̂ 0 = 19.611. This is the slope of the line.


For each additional kilometre on the odometer,
the price decreases by an average of $0.094
Do not interpret the intercept as the
‘price of cars that have not been driven’.
17.32

17.3 Error variable: Required conditions


The error  is a critical part of the regression model,
Y =o + 1x + .

Five requirements involving the distribution of e must be satisfied in


order for the estimated coefficients to have desirable properties:

(1) The mean of e is zero: E() = 0.


(2) The standard deviation of  is a constant () for all values of x.
(3) The errors are independent.
(4) The errors are independent of the independent variable x.
(5) The probability distribution of e is normal.
17.33

17.4 Assessing the model

The least squares method will produce a regression


line whether or not there is a linear relationship
between x and y.
Consequently, it is important to assess how well the
linear model fits the data.
Several methods are used to assess the model:
• testing and/or estimating the regression model
coefficients individually/jointly
• using descriptive measurements such as the sum
of squares for errors (SSE), standard error of
estimate s and coefficient of determination (R2).
17.34

Sum of squares for errors (SSE)


• SSE is the sum of squared differences between the
points and the regression line.
• SSE can serve as a measure of how well the line fits
the data.
• The sum of squares for errors (SSE) is calculated as

n é 2 SXY
2 ù
SSE   ( yi  yˆ i ) 2 or SSE = (n -1)êSy - 2 ú
i 1 ë Sx û
• SSE plays a role in every statistical technique we
employ to assess the model.
17.35

Standard Error of Estimate


• If  is small (i.e. the errors tend to be close to zero),
the model fits the data well.
• Therefore we can use  as a measure of the suitability
of using a linear model.
• An unbiased estimator of  is given by the standard
error of estimate, s, defined as

SSE
s 
n2
• If s = 0 (which is equivalent to saying SSE = 0), all the
data points fall on the estimated regression line.
17.36

Example 3…
Calculate the standard error of estimate for Example 3
and describe what it tells you about the model fit.
Solution: calculated before
1  2 (  yi )2 
s 2y    yi    0.5848
n 1  n 
 2 s xy 2 
 (  4.077)2 
SSE  ( n  1)  s y  2   99 0.5848    20.072
 s x   4 3.509 
Thus,
SSE 20.072
s    0.4526
n2 100  2
17.37

Example 3 – Solution…

If s is small, the fit is excellent and the linear model


should be used for forecasting. If s is large, the model
is poor…
But what is small and what is large?
17.38

Example 3 - What is small?


• A small s indicates a good fit.
How small is small?
• We judge the value of the standard error of estimate, s
(=0.4526), by comparing it to the values of the
dependent variable, y, or more specifically to the mean
value of y, 𝑦ത (=16.24).
• In this example, the s is small (only 2.8% =
(0.4526/16.24)×100%) relative to the sample mean of y.
Therefore, we can conclude that the standard error of
estimate is reasonably small.
• s cannot be used alone as an absolute measure of the
model’s utility. But it can be used to compare models.
17.39

Testing the slope

If no linear relationship exists between the two


variables, we would expect the regression line to be
horizontal, that is, to have a slope of zero.

We want to see if there is a linear relationship, i.e. we


want to see if the slope (1) is something other than
zero. Our research hypothesis becomes:
HA: 1 ≠ 0 (linear relationship exists)
Thus the null hypothesis becomes:
H0: 1 = 0 (no linear relationship exists)
17.40

Testing the slope…


We can use the following test statistic to test our hypotheses:

ˆ1  1
t
sˆ
1

where 𝑠𝛽෡1 is the standard deviation of 𝛽መ1 , defined as:


s
sˆ 
1
(n  1) sx2
• If the error variable () is normally distributed, the test
statistic has a Student t-distribution with n–2 degrees of
freedom.
• The rejection region depends on whether or not we’re doing a
one- or two- tail test (two-tail test is most typical).
17.41

Testing the slope…


If we wish to test for positive or negative linear
relationships we conduct one-tail tests, i.e. our research
alternate hypotheses become:
HA: 1 < 0 (testing for a negative slope)
or
HA: 1 >0 (testing for a positive slope)

Of course, the null hypothesis remains: H0: 1 = 0.


17.42

Example 3…
Test to determine whether there is enough evidence to
infer that a linear relationship exists between the price
and the odometer reading at the 5% significance level.
17.43

Example 3 – Solution…
We want to test the hypothesis
H0: 1= 0 (no linear relationship)
HA: 1  0 (a linear relationship exists)
If the null hypothesis is rejected, we conclude that there
is a significant linear relationship between price and
odometer reading.

Test statistic: b̂1 - b1


t=
sb̂
1

has a t-distribution with 98 (=100–2) degrees of freedom.

Level of significance  = 0.05.


17.44

Example 3 – Solution…
Decision rule: Reject H0 if |t| > t0.025,98 = 1.984,
or reject H0 if p-value < .
Value of the test statistic:
To compute t we need the values of 𝛽መ1 and 𝑠𝛽෡1 .
ˆ1  0.0937
s 0.4526
sˆ    0.0069
1
(n  1) sx 2 99(43.509)
ˆ1  1
t  0.0937  0  13.59
sˆ .0069
1

Conclusion: Comparing the decision rule with the calculated t-


value (=–13.59), we reject Ho and conclude that the odometer
readings do affect the sale price.
17.45

Example 3 – Solution…
Using Excel:
Excel regression output:

Coefficients Standard Error t Stat P-value


Intercept 19.61139281 0.252410094 77.69655 7.53E-90
Odometer (x)-0.093704502 0.006895663 -13.5889 2.84E-24
t-value
p-value
Looking at the p-value of the slope
coefficient, we conclude that there is
overwhelming evidence to infer that the
odometer reading affects the auction
selling price.
17.46

Coefficient of determination, R2
The tests thus far are used to conclude whether a linear
(positive or negative) relationship exists.
When we want to measure the strength of the linear
relationship, we use the coefficient of determination, R2,
defined as follows.

s 2XY SSE
R 
2
or R  1 
2
s 2X sY2 (n  1) sY2

For a simple linear regression model, the coefficient of


determination is the squared value of the sample
coefficient of correlation (r). i.e. R2 = (r)2.
17.47

Coefficient of determination…
As we did with analysis of variance, we can partition the total
variation in y (SST) into two parts:
SST = SSE + SSR
SST = Sum of Squares Total
n 2

Total variation in y   (Y  Y )
i 1
i  (n  1) sY 2

SSE = Sum of Squares Error


(measures the amount of variation in y that remains
unexplained, i.e. due to error)

SSR = Sum of Squares Regression


(measures the amount of variation in y explained by variation in
the independent variable x.)
17.48

Coefficient of determination…
𝑆𝑆𝑅 𝑆𝑆𝐸 𝑆𝑆𝑅 𝑆𝑆𝐸
SST = SSE + SSR  1 = +  =1 −
𝑆𝑆𝑇 𝑆𝑆𝑇 𝑆𝑆𝑇 𝑆𝑆𝑇
𝑆𝑆𝑅
• R2 (= ) measures the proportion of the variation in
𝑆𝑆𝑇
y that is explained by the variation in x.
SSR SST  SSE SSE SSE
R2    1  1
SST SST SST (n  1) sY2
• R2 takes on any value between zero and one.
 R2 = 1: perfect match between the line and the
data points.
 R2 = 0: there is no linear relationship between x
and y.
17.49

Coefficient of determination…
In general, the higher the value of R2, the better the
model fits the data.

Unlike the value of a test statistic, the coefficient of


determination does not have a critical value that
enables us to test hypotheses about R2 and to draw
conclusions.
17.50

Example 3…

Find the coefficient of determination for Example 3.


What does this statistic tell you about the model?

Solution
SSE 20.07
R  1
2
 1  0.6533
(n  1) S Y
2
99(0.5848)

Therefore, 65% of the variation in the selling price is explained by the


variation in odometer reading. The rest (35%) remains unexplained
by this model, i.e. due to error.
17.51

Example 3 – Solution…
Using the computer

From the regression output we have, R2 = 0.6533:

Regression Statistics
Multiple R 0.8083
R Square 0.6533
Adjusted R Square 0.6498
Standard Error 0.4526
Observations 100
17.52

More on Excel’s output


An analysis of variance (ANOVA) table for the
simple linear regression model can be given by:

Degrees of Sums of
Source Mean squares F-statistic
freedom squares
Regression 1 SSR MSR = SSR/1 F=MSR/MSE

Error n–2 SSE MSE = SSE/(n–2)

Total n–1 SST

ANOVA
df SS MS F Significance F
Regression 1 37.8211 37.8211 184.6583 0.0000
Residual 98 20.0720 0.2048
Total 99 57.8931
17.53

17.5 Using the regression equation

Before using the regression model, we need to assess


how well it fits the data.

If we are satisfied with how well the model fits the


data, we can use it to make predictions for y.
17.54

Example 3…
Predict the selling price of a three-year-old Ford Laser with 40
000 km on the odometer (refer to Example 3).

Solution
We could use our regression equation:
y = 19.61 – .0937x
to predict the selling price of a car with 40(’000) kms on the
odometer:
𝑦ො = 19.61 – .0937(40) = $15.862 (’000).
We call this value ($15,862) a point prediction. Chances are
though the actual selling price will be different, hence we can
estimate the selling price in terms of an interval.
17.55

Prediction interval and confidence interval


Two intervals can be used to discover how closely the
predicted value will match the true value of y
• prediction interval – for a particular value of y
• confidence interval – for the expected value of y.

The prediction interval for y The confidence interval for E(y)


( xg  x ) 2 1 ( xg  x )
2
1 yˆ  t 
yˆ  t 2, n  2 s 1  s
2,n  2 
n (n  1) s x 2 n ( n  1) s x 2

The prediction interval is wider than the confidence interval.


17.56

Example 3…
Provide an interval estimate for the bidding price on a
Ford Laser with 40 000 km on the odometer.

Solution
The dealer would like to predict the price of a single car.
The 95% prediction interval:
1 ( xg  x )
2

yˆ  t 2,n  2 s 1  
n (n  1) s x 2
t.025,98

1 (40  36.01) 2
[15.862 1.984 0.4526 1    15.862  0.904
100 (99)(43.509)
We predict a selling price between $14 958 and $16 766.
17.57

Example 3…
The car dealer wants to bid on a lot of 250 Ford Lasers, where
each car has been driven for about 40 000 km.
Solution
The dealer needs to estimate the mean price per car. The 95%
confidence interval:

1 ( xg  x )
2

yˆ  t s
2, n  2  
n (n  1) s x 2
1 (40  36.01) 2
 [15.862 1.984 0.4526   15.862  0.105
100 99(43.509)
The lower and upper limits of the confidence interval estimates
of the expected value of the selling price between $15 758 and
$15 968.
17.58

What’s the Difference?


Prediction interval Confidence interval

1 no 1
Used to estimate the value of Used to estimate the mean
one value of y (at given x) value of y (at given x)
The confidence interval estimate of the expected value of y will
be narrower than the prediction interval for the same given value
of x and confidence level. This is because there is less error in
estimating a mean value as opposed to predicting an individual
value.
17.59

Example 3 – Solution…
Using Excel (Data Analysis Plus)
We can use Data Analysis Plus to obtain the prediction
and confidence interval estimates.
17.60

Example 3 – Solution…
Using Excel (Data Analysis Plus™)
In the Data Analysis Plus dialogue box (shown below),
enter the input and the output is presented in the next
slide.
17.61

Intervals with Excel INTERPRET

Using Excel (Data Analysis Plus): Output

Point prediction

Prediction interval

Confidence interval estimator of


the mean price
17.62
17.6 Coefficient of correlation
The coefficient of correlation is used to measure the strength of
a linear association between two variables.

The population coefficient of correlation is denoted  (rho).

The coefficient values range between –1 and 1.

• If  = –1 (perfect negative linear association) or  = +1


(perfect positive linear association) every point falls on
the regression line.
• If  = 0 there is no linear association.
The coefficient can be used to test for linear relationships
between two variables.
17.63

Coefficient of correlation…
We estimate its value from sample data with the sample
coefficient of correlation:

We can conduct a t-test hypothesis testing of the coefficient


of correlation () to determine whether Y and X are linearly
related.
17.64

Testing the coefficient of correlation


• When there is no linear relationship between two
variables,  = 0.
• The hypotheses are:
H0:  = 0 (no linear relationship)
HA:   0 (a linear relationship exists)
• The test statistic is:
n2
tr
1 r 2
The statistic is student t-distributed with d.f. = n – 2, provided the
variables are bivariate normally distributed.
17.65

Example 3…
Test the coefficient of correlation to determine if a
linear relationship exists in the data of Example 3
between the price and odometer reading (use =0.05).
Solution
We test H0:  = 0
HA:   0.

Test statistic:
n-2
t=r ~ tn-2
1- r 2

Level of significance: =0.05


17.66

Example 3 – Solution… COMPUTE

Decision rule:
Reject H0 if |t| > t/2,n-2 = t0.025,98 = 1.984.

Value of the test statistic:


Since R2 = 0.6533 (from the Excel output), we have r =
𝑆𝑥𝑦
0.6533 = 0.8083 (or use the formula r = )
𝑆𝑥 𝑆𝑦

n2 (100  2)
tr  (0.8083)  13.59
1 r 2
(1  0.6533)
17.67

Example 3 – Solution… INTERPRET

Conclusion: Since t = -13.59 < -1.984, reject H0.

Therefore, there is sufficient evidence at the 5% level to


infer that there is a linear relationship between the price
and the odometer reading.
17.68

Example 3 - Solution…
Using Excel (Data Analysis Plus)
We can use Data Analysis Plus based on the large
sample test.
17.69

Example 3 - Solution…
Using Excel (Data Analysis Plus™)
In the Data Analysis Plus dialogue box (shown below),
enter the input and the output is presented in the next
slide.
17.70

Example 3 - Using the Computer… COMPUTE


We can also use Excel > Add-Ins > Data Analysis Plus and
the Correlation (Pearson) tool to get this output:
We can also perform a one-tail test for
positive or negative linear relationships

p-value
compare
Again, we reject the null hypothesis (that there is no
linear correlation) in favour of the alternative hypothesis
(that our two variables are in fact related in a linear
fashion).
17.71

Spearman rank correlation coefficient


The Spearman rank test is used to test whether a
relationship exists between variables in cases where

• at least one variable is ranked, or


• both variables are numerical but the normality
requirement is not satisfied.
17.72

Spearman rank correlation coefficient…

• The null and alternate hypotheses are:


H0 : s = 0
HA: s  0
• The test statistic is
sab
rs 
sa sb
a and b are the ranks of the data.
• For a large sample (n  30) rs is approximately
normally distributed
z  rs n  1
17.73
Example 4: Performance vs aptitude test
scores
(Example 17.10, p751)
A production manager wants to examine the
relationship between
• aptitude test score given prior to hiring, and
• performance rating three months after starting
work.
A random sample of 20 production workers was
selected. Their test scores and performance ratings
were recorded.

Analyse the relationship between the two variables.


17.74

Example 4 - Solution
• The problem objective
Aptitude Performance is to analyse the
Employee test rating relationship between
1 59 3
two variables.
2 47 2
3 58 4 • Performance rating is
4 66 3 ranked.
5 77 2
. . . • The hypotheses are:
. . . H0: s = 0
. . .
HA: s  0
Scores range from 0 to 100 Scores range from 1 to 5

• The test statistic is rs and the rejection region is


|rs| > rcritical (taken from the Spearman rank
correlation table).
17.75

Example 4 - Solution…
Aptitude Performance
Employee test Rank(a) rating Rank(b)
1 59 9 3 10.5
2 47 3 2 3.5 Ties are broken
3 58 8 4 17 by averaging
4 66 14 3 10.5 the ranks.
5 77 20 2 3.5
. . . . .
. . . . .
. . . . .
17.76

Example 4 - Solution…

Solving manually
Rank each variable separately.
Calculate sa = 5.92; sb = 5.50; Sab = 12.34
Thus rs = sab/[sa.sb] = 0.379.
The critical value for  = 0.05 and n = 20 is 0.450.

Since rs = 0.379 < 0.450, we do not reject H0.

Conclusion: We do not reject the null hypothesis. At the 5%


level of significance there is insufficient evidence to infer
that the two variables are related to one another.
17.77

Example 4 - Solution…
Using Excel (Data Analysis Plus)
We can use Data Analysis Plus based on the large
sample test.
17.78

Example 4 - Solution…
Using Excel (Data Analysis Plus™)
In the Data Analysis Plus dialogue box (shown below),
enter the input and the output is presented in the next
slide.
17.79

Example 4 - Solution… INTERPRET


Using Excel (Data Analysis Plus™): Output

Since rs = 0.379 < z0.025 = 1.96, we do not reject H0.


17.80

17.7 Regression diagnostics – I


The three important conditions required for the validity of
the regression analysis are:
• The error variable is normally distributed.
• The error variance is constant for all values of x.
• The errors are independent of each other.

How can we diagnose violations of these conditions?


 Residual analysis, that is, examine the differences
between the actual data points and those predicted by the
linear equation…
17.81

Residual analysis
Recall the deviations between the actual data points and
the regression line were called residuals. Excel calculates
residuals as part of its regression analysis:
RESIDUAL OUTPUT
Residual Ouput for Example 3
Observation Predicted Price (y) Residuals Standard Residuals
1 16.10684 -0.10684 -0.23729
2 15.41343 -0.21343 -0.47400
3 15.31973 -0.31973 -0.71007
4 16.71592 0.68408 1.51924
5 16.64096 0.75904 1.68572

We can use these residuals to determine whether the error


variable is non-normal, whether the error variance is
constant, and whether the errors are independent…
17.82

Residual analysis…
For each residual we calculate the standard deviation
as follows:
sri  s  1  hi where

1 ( xi  x )2
hi  
n
 ( x j  x)2

Standardised residual i = residual i/standard deviation


17.83

Example 3…
Non-normality
• Use Excel to obtain the standardised residual
histogram.
• Examine the histogram and look for a bell-shaped
diagram with mean close to zero.

• As can be seen, the


standardised residual
histogram appear to
be bell-shaped.
• We can also apply the
Lilliefors test or the 2
test of normality.
17.84

Heteroscedasticity
When the requirement of a constant variance is violated,
we have heteroscedasticity.
+
^y
++
Residual
+
+ + + ++
+
+ + + ++ + +
+ + + +
+ + + ++ +
+ + + + y^
+ + ++ +
+ + +
+ + ++
+ + ++

The spread increases with ^y


17.85

Homoscedasticity
When the requirement of a constant variance is not
violated, we have homoscedasticity.
+
^y
++
Residual
+ +
+ + + ++
+
+ + + +
+ ++ + +
+ +
+ + + ++ ++ +
+ + + y^ ++
+ + + ++ +
+ + + + +
++++
+ +++
+
The spread of the data points
does not change much.
17.86

Homoscedasticity…
When the requirement of a constant variance is not
violated, we have homoscedasticity.
^y
++
Residual ++ +
+ + ++ ++
+ +
+ + + +
+ ++ + +++
+ + +++ +
+ +++
+ + + + ++
+ + y^ + +
++
+ + +
+ + + ++ +
+ ++
+ ++

As far as the even spread, this is We can diagnose heteroscedasticity by plotting


a much better situation. the residual against the predicted values of Y.
17.87

Heteroscedasticity…
If the variance of the error variable (𝜎𝜀 2 ) is not constant,
then we have ‘heteroscedasticity’. Here is the plot of the
residual against the predicted value of y for Example 3.

There doesn’t appear to be a


change in the spread of the
plotted points, therefore no
heteroscedasticity 
17.88

Non-independence of the error variable


If we were to observe the auction price of cars every week
for, say, a year, that would constitute a time series.

When the data are time series, the errors often are
correlated. Error terms that are correlated over time are
said to be autocorrelated or serially correlated.

We can often detect autocorrelation by graphing the


residuals against the time periods. If a pattern emerges,
it is likely that the independence requirement is violated.
17.89

Non-independence of the error variable


Positive autocorrelation Negative autocorrelation

Patterns in the appearance of the


residuals over time
indicates that autocorrelation exists:

Note the runs of positive residuals, Note the oscillating behaviour of the
replaced by runs of negative residuals. residuals around zero.

Patterns in the appearance of the residuals over time


indicate that autocorrelation exists.
17.90

Outliers
• An outlier is an observation that is unusually small or
large.
• Several possibilities need to be investigated when an
outlier is observed:
 There was an error in recording the value.
 The point does not belong in the sample.
 The observation is valid.
• Identify outliers from the scatter diagram.
• It is customary to suspect an observation is an outlier if
the absolute value of the standardised residual is > 2.
• They need to be dealt with since they can easily
influence the least squares line…
17.91

Procedure for regression diagnostics


An outlier An influential observation

+++++++++++
+ +
+ … but some outliers
+ +
+ + may be very influential.
+
+ + + +
+
+ +
+

The outlier causes a shift


in the regression line ...
17.92

Procedure for regression diagnostics…


1. Develop a model that has a theoretical basis.
2. Gather data for the two variables in the model.
3. Draw the scatter diagram to determine whether a
linear model appears to be appropriate. Identify
possible outliers.
4. Determine the regression equation.
5. Calculate the residuals and check the required
conditions
6. Assess the model’s fit.
7. If the model fits the data, use the regression
equation to predict a particular value of the
dependent variable and/or estimate its mean.
17.93

Summary of techniques – Linear relationship


between two variables

You might also like