Selvanathan 7e - 17
Selvanathan 7e - 17
Introduction
When the problem objective is to analyse the
relationship between numerical variables, correlation
and regression analysis is the first tool we will study. We
briefly covered correlation and regression analysis, when
we discussed descriptive graphical and numerical
techniques in Chapters 4 and 5. We now extend that
analysis further in this and the next few chapters.
Regression analysis is used to predict the value of one
variable (the dependent variable) on the basis of other
variables (the independent variables).
Dependent variable: denoted Y
Independent variables: denoted X1, X2, …, Xk
17.7
Correlation analysis
If we are interested only in determining whether a
relationship exists, we employ correlation analysis, a
technique introduced earlier.
Model types
Deterministic Model: an equation or set of equations that
allow us to fully determine the value of the dependent
variable from the values of the independent variables.
A model
To create a probabilistic model, we start with a
deterministic model that approximates the relationship
we want to model and add a random term that measures
the error of the deterministic component.
Deterministic Model:
The cost of building a new house is about $800 per square
metre and most lots sell for about $300 000. Hence the
approximate selling price (y) would be:
y = $300 000 + $800(x)
(where x is the size of the house in square metres)
17.10
A model…
A model of the relationship between house size
(independent variable) and house price (dependent
variable) would be:
House
price
House size
17.11
A model…
In real life, however, the house cost will vary even
among the same size of house:
Lower vs. higher
variability
House
price
Same house size,
but different price
points (e.g. décor
options, portico
upgrades, lot
300K$ location…).
x
House size
17.12
Random term
We now represent the price of a house as a function of its
size in this probabilistic model:
17.1 Model
A straight line model with one independent variable is
called a first order linear model or a simple linear
regression model. It is written as:
dependent independent
variable variable
slope of
y-intercept error variable
the line
17.14
rise
run
0 1=slope (=rise/run)
=y-intercept
ŷ ˆ 0 ˆ 1 x
y w
w
w
w w
w
w w w w w
w w w w w
w
x
17.17
Example 1
Years of experience x 1 2 3 4 5 6
Annual bonus y 6 1 9 5 17 12
17.18
these differences
are called
residuals
Years of
experience
17.19
( xi yi )
xi yi
ˆ1 i x y n x y
( x x )( y y ) n
i
or i i
( xi x ) 2
( xi ) 2
x nx 2 2
xi n
2 i
ˆ0 y ˆ1 x
17.21
1 1 ( x ) 2
1 2
s 2X ( xi x )
2
xi
2 i
xi n x 2
n 1 n 1 n n 1
1 1 2 ( yi )2 1 2
sY2 ( yi y )
2
i y yi n y 2
n 1 n 1 n n 1
1 ( xi yi )
xi yi n x y
1 1
sxy ( xi x )( yi y ) x y
i i
n 1 n 1 n n 1
17.22
ˆ s XY
1 2
sX
ˆ0 y ˆ1 x
The estimated simple linear regression equation that
estimates the equation of the first-order linear model
is:
ŷ ˆ 0 ˆ 1 x
5.23
Example 3 - Solution
To calculate 𝛽መ0 and 𝛽መ1 , we need to
calculate several statistics first:
n 100; x 3601.1; y 1623.7; x 36.01; y 16.24;
1 ( x ) 4 307.378
2
s 2X xi
2 i
43.509
(n 1) n 99
1 ( xi yi ) -403.6207
sxy x y
i i 4.077
(n 1) n 99
17.26
Example 3 – Solution…
Therefore,
sxy -4.077
ˆ1 -0.0937
s 2X 4 3.509
ˆ0 y ˆ1 x 16.24 ( 0.0937)(36.01) 19.611
Example 3 – Solution…
Using Excel (Data Analysis)
We can use Data Analysis to come with the same
results.
17.28
Example 3 – Solution…
Using Excel (Data Analysis)
In the Data Analysis dialogue box (shown below),
enter the input and the output is presented in the next
slide.
17.29
Example 3 – Solution…
Using Excel
ŷ 19.611 0.094 x
17.30
Example 3 – Solution…
Using Excel (Data Analysis): output
ŷ 19.611 0.094 x
17.31
19.611
20
18
Price (y)
16
14
12
0 No data 10
15 20 25 30 35 40 45 50 55
Odometer (x)
ŷ 19.611 0.094 x
n é 2 SXY
2 ù
SSE ( yi yˆ i ) 2 or SSE = (n -1)êSy - 2 ú
i 1 ë Sx û
• SSE plays a role in every statistical technique we
employ to assess the model.
17.35
SSE
s
n2
• If s = 0 (which is equivalent to saying SSE = 0), all the
data points fall on the estimated regression line.
17.36
Example 3…
Calculate the standard error of estimate for Example 3
and describe what it tells you about the model fit.
Solution: calculated before
1 2 ( yi )2
s 2y yi 0.5848
n 1 n
2 s xy 2
( 4.077)2
SSE ( n 1) s y 2 99 0.5848 20.072
s x 4 3.509
Thus,
SSE 20.072
s 0.4526
n2 100 2
17.37
Example 3 – Solution…
ˆ1 1
t
sˆ
1
Example 3…
Test to determine whether there is enough evidence to
infer that a linear relationship exists between the price
and the odometer reading at the 5% significance level.
17.43
Example 3 – Solution…
We want to test the hypothesis
H0: 1= 0 (no linear relationship)
HA: 1 0 (a linear relationship exists)
If the null hypothesis is rejected, we conclude that there
is a significant linear relationship between price and
odometer reading.
Example 3 – Solution…
Decision rule: Reject H0 if |t| > t0.025,98 = 1.984,
or reject H0 if p-value < .
Value of the test statistic:
To compute t we need the values of 𝛽መ1 and 𝑠𝛽1 .
ˆ1 0.0937
s 0.4526
sˆ 0.0069
1
(n 1) sx 2 99(43.509)
ˆ1 1
t 0.0937 0 13.59
sˆ .0069
1
Example 3 – Solution…
Using Excel:
Excel regression output:
Coefficient of determination, R2
The tests thus far are used to conclude whether a linear
(positive or negative) relationship exists.
When we want to measure the strength of the linear
relationship, we use the coefficient of determination, R2,
defined as follows.
s 2XY SSE
R
2
or R 1
2
s 2X sY2 (n 1) sY2
Coefficient of determination…
As we did with analysis of variance, we can partition the total
variation in y (SST) into two parts:
SST = SSE + SSR
SST = Sum of Squares Total
n 2
Total variation in y (Y Y )
i 1
i (n 1) sY 2
Coefficient of determination…
𝑆𝑆𝑅 𝑆𝑆𝐸 𝑆𝑆𝑅 𝑆𝑆𝐸
SST = SSE + SSR 1 = + =1 −
𝑆𝑆𝑇 𝑆𝑆𝑇 𝑆𝑆𝑇 𝑆𝑆𝑇
𝑆𝑆𝑅
• R2 (= ) measures the proportion of the variation in
𝑆𝑆𝑇
y that is explained by the variation in x.
SSR SST SSE SSE SSE
R2 1 1
SST SST SST (n 1) sY2
• R2 takes on any value between zero and one.
R2 = 1: perfect match between the line and the
data points.
R2 = 0: there is no linear relationship between x
and y.
17.49
Coefficient of determination…
In general, the higher the value of R2, the better the
model fits the data.
Example 3…
Solution
SSE 20.07
R 1
2
1 0.6533
(n 1) S Y
2
99(0.5848)
Example 3 – Solution…
Using the computer
Regression Statistics
Multiple R 0.8083
R Square 0.6533
Adjusted R Square 0.6498
Standard Error 0.4526
Observations 100
17.52
Degrees of Sums of
Source Mean squares F-statistic
freedom squares
Regression 1 SSR MSR = SSR/1 F=MSR/MSE
ANOVA
df SS MS F Significance F
Regression 1 37.8211 37.8211 184.6583 0.0000
Residual 98 20.0720 0.2048
Total 99 57.8931
17.53
Example 3…
Predict the selling price of a three-year-old Ford Laser with 40
000 km on the odometer (refer to Example 3).
Solution
We could use our regression equation:
y = 19.61 – .0937x
to predict the selling price of a car with 40(’000) kms on the
odometer:
𝑦ො = 19.61 – .0937(40) = $15.862 (’000).
We call this value ($15,862) a point prediction. Chances are
though the actual selling price will be different, hence we can
estimate the selling price in terms of an interval.
17.55
Example 3…
Provide an interval estimate for the bidding price on a
Ford Laser with 40 000 km on the odometer.
Solution
The dealer would like to predict the price of a single car.
The 95% prediction interval:
1 ( xg x )
2
yˆ t 2,n 2 s 1
n (n 1) s x 2
t.025,98
1 (40 36.01) 2
[15.862 1.984 0.4526 1 15.862 0.904
100 (99)(43.509)
We predict a selling price between $14 958 and $16 766.
17.57
Example 3…
The car dealer wants to bid on a lot of 250 Ford Lasers, where
each car has been driven for about 40 000 km.
Solution
The dealer needs to estimate the mean price per car. The 95%
confidence interval:
1 ( xg x )
2
yˆ t s
2, n 2
n (n 1) s x 2
1 (40 36.01) 2
[15.862 1.984 0.4526 15.862 0.105
100 99(43.509)
The lower and upper limits of the confidence interval estimates
of the expected value of the selling price between $15 758 and
$15 968.
17.58
1 no 1
Used to estimate the value of Used to estimate the mean
one value of y (at given x) value of y (at given x)
The confidence interval estimate of the expected value of y will
be narrower than the prediction interval for the same given value
of x and confidence level. This is because there is less error in
estimating a mean value as opposed to predicting an individual
value.
17.59
Example 3 – Solution…
Using Excel (Data Analysis Plus)
We can use Data Analysis Plus to obtain the prediction
and confidence interval estimates.
17.60
Example 3 – Solution…
Using Excel (Data Analysis Plus™)
In the Data Analysis Plus dialogue box (shown below),
enter the input and the output is presented in the next
slide.
17.61
Point prediction
Prediction interval
Coefficient of correlation…
We estimate its value from sample data with the sample
coefficient of correlation:
Example 3…
Test the coefficient of correlation to determine if a
linear relationship exists in the data of Example 3
between the price and odometer reading (use =0.05).
Solution
We test H0: = 0
HA: 0.
Test statistic:
n-2
t=r ~ tn-2
1- r 2
Decision rule:
Reject H0 if |t| > t/2,n-2 = t0.025,98 = 1.984.
n2 (100 2)
tr (0.8083) 13.59
1 r 2
(1 0.6533)
17.67
Example 3 - Solution…
Using Excel (Data Analysis Plus)
We can use Data Analysis Plus based on the large
sample test.
17.69
Example 3 - Solution…
Using Excel (Data Analysis Plus™)
In the Data Analysis Plus dialogue box (shown below),
enter the input and the output is presented in the next
slide.
17.70
p-value
compare
Again, we reject the null hypothesis (that there is no
linear correlation) in favour of the alternative hypothesis
(that our two variables are in fact related in a linear
fashion).
17.71
Example 4 - Solution
• The problem objective
Aptitude Performance is to analyse the
Employee test rating relationship between
1 59 3
two variables.
2 47 2
3 58 4 • Performance rating is
4 66 3 ranked.
5 77 2
. . . • The hypotheses are:
. . . H0: s = 0
. . .
HA: s 0
Scores range from 0 to 100 Scores range from 1 to 5
Example 4 - Solution…
Aptitude Performance
Employee test Rank(a) rating Rank(b)
1 59 9 3 10.5
2 47 3 2 3.5 Ties are broken
3 58 8 4 17 by averaging
4 66 14 3 10.5 the ranks.
5 77 20 2 3.5
. . . . .
. . . . .
. . . . .
17.76
Example 4 - Solution…
Solving manually
Rank each variable separately.
Calculate sa = 5.92; sb = 5.50; Sab = 12.34
Thus rs = sab/[sa.sb] = 0.379.
The critical value for = 0.05 and n = 20 is 0.450.
Example 4 - Solution…
Using Excel (Data Analysis Plus)
We can use Data Analysis Plus based on the large
sample test.
17.78
Example 4 - Solution…
Using Excel (Data Analysis Plus™)
In the Data Analysis Plus dialogue box (shown below),
enter the input and the output is presented in the next
slide.
17.79
Residual analysis
Recall the deviations between the actual data points and
the regression line were called residuals. Excel calculates
residuals as part of its regression analysis:
RESIDUAL OUTPUT
Residual Ouput for Example 3
Observation Predicted Price (y) Residuals Standard Residuals
1 16.10684 -0.10684 -0.23729
2 15.41343 -0.21343 -0.47400
3 15.31973 -0.31973 -0.71007
4 16.71592 0.68408 1.51924
5 16.64096 0.75904 1.68572
Residual analysis…
For each residual we calculate the standard deviation
as follows:
sri s 1 hi where
1 ( xi x )2
hi
n
( x j x)2
Example 3…
Non-normality
• Use Excel to obtain the standardised residual
histogram.
• Examine the histogram and look for a bell-shaped
diagram with mean close to zero.
Heteroscedasticity
When the requirement of a constant variance is violated,
we have heteroscedasticity.
+
^y
++
Residual
+
+ + + ++
+
+ + + ++ + +
+ + + +
+ + + ++ +
+ + + + y^
+ + ++ +
+ + +
+ + ++
+ + ++
Homoscedasticity
When the requirement of a constant variance is not
violated, we have homoscedasticity.
+
^y
++
Residual
+ +
+ + + ++
+
+ + + +
+ ++ + +
+ +
+ + + ++ ++ +
+ + + y^ ++
+ + + ++ +
+ + + + +
++++
+ +++
+
The spread of the data points
does not change much.
17.86
Homoscedasticity…
When the requirement of a constant variance is not
violated, we have homoscedasticity.
^y
++
Residual ++ +
+ + ++ ++
+ +
+ + + +
+ ++ + +++
+ + +++ +
+ +++
+ + + + ++
+ + y^ + +
++
+ + +
+ + + ++ +
+ ++
+ ++
Heteroscedasticity…
If the variance of the error variable (𝜎𝜀 2 ) is not constant,
then we have ‘heteroscedasticity’. Here is the plot of the
residual against the predicted value of y for Example 3.
When the data are time series, the errors often are
correlated. Error terms that are correlated over time are
said to be autocorrelated or serially correlated.
Note the runs of positive residuals, Note the oscillating behaviour of the
replaced by runs of negative residuals. residuals around zero.
Outliers
• An outlier is an observation that is unusually small or
large.
• Several possibilities need to be investigated when an
outlier is observed:
There was an error in recording the value.
The point does not belong in the sample.
The observation is valid.
• Identify outliers from the scatter diagram.
• It is customary to suspect an observation is an outlier if
the absolute value of the standardised residual is > 2.
• They need to be dealt with since they can easily
influence the least squares line…
17.91
+++++++++++
+ +
+ … but some outliers
+ +
+ + may be very influential.
+
+ + + +
+
+ +
+