community project
encouraging academics to share statistics support resources
All stcp resources are released under a Creative Commons licence
Statistical Methods
12. Multiple Linear
Regression and Analysis
of Covariance
Based
on
materials
provided
by
Coventry
University
and
Loughborough
University
under
a
Na9onal
HE
STEM
Programme
Prac9ce
Transfer
Adopters
grant
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Workshop outline
q Multiple Linear Regression:
Ø Two independent variables
Ø Multicollinearity: VIF and tolerance
Ø More than two independent variables:
o Direct variable entry method
o Backwards regression method
Ø Robustness
q Analysis of Covariance
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Please note
q This workshop assumes knowledge of simple
linear regression – see Workshop 11
q Some disciplines have a different culture in
applying multiple linear regression without
assumption checking – please seek guidance
from your faculty
q Most people want to look for significance for
deciding which variables to include in the
model, not for the purpose of prediction
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Multiple Linear Regression with
two independent variables
Model:
y
=
b0
+
b1x
+
b2z
Where:
q y is the dependent variable
q x and z are the independent variables
q b0 is the intercept coefficient
q b1 and b2 are the slope coefficients
Goal: To minimise the sum of the squares of the
errors
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
http://cast.massey.ac.nz/core/index.html?book=general, Section 6.3.7
Least squares estimation of y against x and z
yi
=
b0
+
b1xi
+
b2zi
+
ei
Choose
b0,
b1
and
b2
such
that
∑𝐢=𝟏↑𝐧▒
𝐞↓𝐢 ↓↑𝟐
is
minimised
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Example 1: Monthly sales
figures for women’s clothing
q 120 monthly sales figures for a catalogue-based mail
ordering company from January 1989 to December 1998
q Independent variables:
Ø Number of phone lines open for ordering
Ø Amount spent on print advertising
q Open Sheet1 of the Excel file CatalogueData.xlsx
associated with this workshop
q Turn it into an SPSS
data file – for the Date
field, use the data type
“Date” and the format
“dd-mmm-yy”
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
The regression analysis
process
q Step 1: Get to know your data
q Step 2: Formulate a model and check the
assumptions
q Step 3: Fit the model to the data
q Step 4: Report, interpret, and use the model
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Step 1A: Scatter plot of Sales
against NoPhoneLines
q Relationship Decimal places
appears to be removed
linear
q Variance in Font size
Sales appears increased to 10
to be constant
for different Font size
increased
values of
to 12
NoPhoneLines
q ‘Cigar shaped’
Minimum value changed to 0
data set
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Step 1B: Scatter plot of Sales
against PrintAdvertising
q Relationship
appears to be
linear
q Variance in
Sales appears
to be constant
for different
values of
PrintAdvertising
q ‘Cigar shaped’
data set
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Step 2: Formulate a model
Sales = b0 + b1×NoPhoneLines + b2×PrintAdvertising
Note: A model is always an approximation to the
data
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Step 2: Assumptions of
Multiple Linear Regression
q The observations of the dependent variable are
independent, e.g. they are not time or sequence dependent
q The independent variables are normally distributed or
binary
q The dependent variable is normally distributed for each
value of each predictor (independent) variable
q The variability of the outcome variable is the same for each
value of the predictor variable
q The dependent variable varies linearly as the independent
variables vary
q All seem to be OK for our data set (as indicated by the
‘cigar shaped’ scatter plots)
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Step 3: Fit the model to the data
q Analyze > Regression > Linear
q Add Sales as the dependent variable
q Add NoPhoneLines and PrintAdvertising as the
independent variables
Select
Statistics…
and choose
Collinearity
diagnostics
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Adjusted R
Square = 0.383 ⇒
model explains
38.3% of the
variation in Sales b0 = -21366 b1 = 653.55 b2 = 1.374
b0, b1 and b2 all significantly Tolerance coefficients slightly
different from 0 less than 1, VIF slightly more
than 1 (see later)
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Step 4: Fitted model
Sales = -21366 + 653.55×NoPhoneLines
+ 1.374×PrintAdvertising
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Multicollinearity
q Multicollinearity means there are high correlations
between the predictor (independent) variables
q Thus two or more predictor variables carry similar
information about the outcome variable
q With multicollinearity there is very high uncertainty for
the regression coefficients
q Multicollinearity can be assessed informally by looking
at the bivariate correlations between the independent
variables (any r > 0.8 indicates a possible problem)
q There are two formal measures of multicollinearity:
Ø Variance Inflation Factor (VIF)
Ø Tolerance
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Variance Inflation Factor
q Based on fitting predictor variables to other
predictor variables (i.e. NoPhoneLines to
PrintAdvertising for our example) and
calculating R2
q Values of VIF (Variance Inflation Factor):
Ø VIF < 5: don’t worry
Ø 5 < VIF < 10: multicollinearity may be a problem, be
cautious
Ø VIF > 10: multicollinearity is definitely a problem and
will adversely affect results
Source: Myers, R. (1990) Classical and modern
regression with applications. 2nd ed. Boston,
MA: Duxbury
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Tolerance
q The percentage of the variance in a predictor
variable that cannot be explained by the
other predictors
q In our example the tolerance for each
variable was 0.992, meaning 99.2% of the
variance in both predictor variables cannot
be explained by the other variables
q Tolerance is the reciprocal of VIF – so you
only need to look at VIF
q For our example, VIF was 1.009 for both
variables so it was not a problem
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Multiple Linear Regression with
>2 independent variables
Model:
y
=
b0
+
b1x1
+
b2x2
+
…
+
bmxm
Where:
q y
is
the
dependent
variable
q x1,
x2,
…,
xm
are
the
independent
variables
q b0
is
the
intercept
coefficient
q b1,
b2,
…,
bm
are
the
slope
coefficients
Goal:
To
minimise
the
sum
of
the
squares
of
the
errors
Rule
of
thumb:
For
a
sample
size
of
n,
use
no
more
than
√n
independent
variables
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Issues with >2 variables
q Much more likely to have multicollinearity, or
variables with non-significant coefficients
q Impossible to know in advance which variables are
best removed
⇒ Use a systematic variable selection method in
SPSS to determine an adequate model
⇒ Justify any decision you make
q We recommend backwards removal of predictor
variables:
Ø All predictor variables initially included
Ø Least significant variable removed
Ø Repeat process until least significant variable is below a
threshold (default is 0.1)
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Example 2: Monthly sales
figures for women’s clothing
q Independent variables:
Ø Number of phone lines open for ordering
Ø Amount spent on print advertising
Ø Number of catalogues mailed
Ø Number of pages in catalogue
Ø Number of customer service representatives
q Open Sheet2 of the Excel file
Catalogue2Data.xlsx associated with this
workshop
q Turn it into an SPSS data file
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Step 1A: Sales v. NoMailed
q Clear linear
relationship
q May be
heteroscedastic
– variance in
errors seems to
depend on
NoMailed
q This looks like
an outlier –
reason?
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Activity
q Create a new variable called NoMailedGroup by
recoding NoMailed into a new variable
q Choose a suitable cut-off value
q NoMailedGroup = 1 below the cut-off value
q NoMailedGroup = 2 above the cut-off value
q Run a linear regression of Sales against NoMailed
and choose Unstandardised residuals under
Save…
q Run an independent samples t-test of the
unstandardised residuals against NoMailedGroup
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
NoMailed cut-off = 12,500
q Levene’s test for equality of variances returns a non-
significant result
q Only 8 data values in NoMailedGroup 2
q Probably OK to assume homoscedasticity
q There is no specific test for heteroscedasticity in SPSS
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Step 1B: Sales v. NoPages
q Seems to
be a weak
linear
relationship
q Variance
seems to be
OK
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Step 1C: Sales v.
NoServiceReps
q Stronger
linear
relationship
q Data seems
to be in two
distinct
groups –
month
dependent?
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Activity
q Create a new scale variable called Month
q Enter 1 for a date in January, 2 for a date in
February, etc.
q Create a scatter
plot of Sales
against Month
q Clearly lower in
January and
February
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
q Recode Month into a
new variable called
MonthGroup with 1 =
Jan/Feb and 2 =
otherwise
q Add values to
MonthGroup to
represent these
groups
q Create a grouped
scatter plot of Sales v.
NoServiceReps with
MonthGroup as the
grouping variable
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
q Most of the
lower Sales
were from
Jan/Feb
q We could
add an
additional
variable to
the model
which is 1 for
Jan/Feb and
0 for other
months
q For the moment the months Jan/Feb are excluded
q Also see ANCOVA analysis later
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Step 2: Formulate a model
For the months March – December:
Sales = b0 + b1×NoPhoneLines + b2×PrintAdvertising
+ b3×NoMailed + b4×NoPages + b5×NoServiceReps
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Step 4: Fit the model
q Select only the cases with MonthGroup = 2
q Analyze > Regression > Linear
q Select Sales as the dependent variable and the other 5
variables and the independent variables
Under Statistics…
select collinearity
diagnostics
Select the
method as
Backward
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
q Only one model
required
q R2adj = 0.795 ⇒ model
accounts for 79.5% of
the variation in Sales
q All variables included in initial model (backwards method)
q No variable removed because all the probability values < 0.1
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Activity
q Repeat the analysis with all the months included
q What affect does this have on the models?
q Analysis now has 2
models
q R2adj slightly higher
in second model
q Both markedly
lower than previous
analysis
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
q NoPhoneLines was removed from Model 2 because its
probability value in Model 1 > 0.1 (the absolute value of its
standardised coefficient was low)
q Model 2 has very low probability values for all the variables
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Robustness exceptions
q Homoscedasticity is mandatory – otherwise use a
nonparametric technique
q Linearity is mandatory – otherwise transform the
independent variable or use another model, e.g.
quadratic or polynomial
q Normality is “not necessary for the least-squares
fitting of the regression model, but it is required in
general for inference making.” (e.g. calculating the p-
values and confidence intervals of the coefficients)
“…only extreme departures of the distribution of Y
from normality yield spurious results.”
Source: (Kleinbaum et al., 2008: 120)
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Application of robustness
exceptions to activity
q Sales v. NoServiceReps was not normally distributed
⇒ The probability values of the coefficients may not be reliable
q However, the probability value was not borderline so the
model can still be used
q Including January & February data just increases the ‘noise’
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Example 3: Monthly catalogue
sales figures for jewellery
q Independent variables:
Ø Number of phone lines open for ordering
Ø Amount spent on print advertising
Ø Number of catalogues mailed
Ø Number of pages in catalogue
Ø Number of customer service representatives
q Open Sheet3 of the Excel file
CatalogueData.xlsx associated with this
workshop
q Turn it into an SPSS data file
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Activity
q Use JewellerySales as the dependent variable
q Fit the model with all 5 independent or predictor
variables (direct or ‘enter’ method)
q Is multicollinearity important?
q Just re-fitting the model with only the variables
with significant coefficients may not be sufficient
q Try also the backward variable selection
method
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Five independent variable direct model
Three independent variable direct model
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Results of backwards
regression method
R2adj does not
necessarily
reduce when
variables are
removed from
the model
Final model removes NoServiceReps
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Analysis of Covariance
(ANCOVA)
q Combines ANOVA and Linear Regression:
Ø ANOVA: Explains outcome for different groups of the
data
Ø Linear Regression: Explains outcome with
explanatory variables
Ø ANCOVA: Does both simultaneously
q Increases the precision of the analysis
q Compares the means at the average values of
the predictor variables
q The predictor (independent) variables must be
correlated to the outcome (dependent) variable
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Difference Average Difference of means adjusted
of means Covariate to average covariate
Outcome variable
Group 1
Group 2
Covariate variable
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Assumptions for ANCOVA
q Same as ANOVA and Linear Regression:
Ø Independence of observations
Ø Equality of variances
Ø Normality of distribution
q In addition:
Ø Equal regression slopes – if this assumption is in
doubt, add an extra binary variable to the model to
represent the two groups
q ANCOVA models can be very complex
q Seek advice if you feel ANCOVA is appropriate
for your research
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Example of ANCOVA
q Open the SPSS file Catalogue2 you created
earlier
q We want to compare Sales in January and
February against the rest of the year
q The independent variable is NoServiceReps
q Use Analyze > General Linear Model >
Univariate:
Ø Dependent variable: Sales
Ø Fixed Factor: MonthGroup
Ø Covariate: NoServiceReps
Ø Under Options… choose Parameter estimates
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Model
accounts for
71.4% of the
total variance
The
MonthGroup
=1 coefficient
should be
added to the
intercept
coefficient for
this group
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Add a fitted line to a grouped
scatter plot
q Open the chart
editor
q Select the Add a
reference line from
Equation tool
q Enter the Custom
Equation from the
ANCOVA output
q Change the line
colour using the
Lines tab
q Repeat for the other month group by subtracting the
ANCOVA coefficient from the constant
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Scatter plot with fitted lines
Clearly an
improvement
on Simple
Linear
Regression
but the best
fit slope for
the two
groups is
slightly
different
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Activity
q Open the file AlligatorData.xlsx associated with
this Workshop
q The data file contains pelvic canal width,
snout-vent length and the gender of 35
alligators
q Create a scatterplot for PelvicWidth against
SnoutLength for the different Gender groups
q Run an ACOVA model for PelvicWidth against
SnoutLength for the different Gender groups
q Plot the ANCOVA model on the scatterplot
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Scatterplot of alligator data
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
ANCOVA for alligator data
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Plot of ANCOVA on scatterplot
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Recap
q We have introduced Multiple Linear Regression by
adding one predictor variable then several predictor
variables to Simple Linear Regression
q New diagnostics were required in order to address
the possibility of multicollinearity, namely VIF or
Tolerance
q Better to use a systematic method for introducing or
removing variables, such as Backwards
q Robustness arguments are similar to those in
Simple Linear Regression
q Analysis of Covariance (ANCOVA) is a combination
of ANOVA and Linear Regression
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield
Bibliography
Bovas, A. & Ledolter, J. (2006) Introduction to Regression Modelling.
Belmont, CA: Thomson Brooks/Cole.
Field, A. (2013) Discovering Statistics using SPSS: (And sex and drugs
and rock 'n' roll), 4th ed., London: SAGE, Sections 8.5 - 8.7 and
Chapter 12.
Kleinbaum, D., Kupper L., Muller, K. and Nizam, A. (2008) Applied
Regression Analysis and Other Multivariable Methods. 4th ed.
Belmont, CA: Thomson Brooks/Cole.
Kutner, M., Nachtsheim, C. and Neter, J. (2004) Applied Linear
Regression Models. 4th ed., Irwin: McGraw-Hill.
Myers, R. (1990) Classical and Modern Regression with Applications. 2nd
ed. Boston, MA: Duxbury.
statstutor (n. d.) Multiple Regression resources. Available at:
http://www.statstutor.ac.uk/topics/regression-and-model-building/
multiple-regression/ [Accessed 8/01/14].
Stirling, W. D. (2013) Welcome to the General CAST e-book. Available at:
http://cast.massey.ac.nz/core/index.html?book=general [Accessed
8/01/14], Section 6.3.7.
Peter
Samuels
Reviewer:
Ellen
Marshall
www.statstutor.ac.uk
Birmingham
City
University
University
of
Sheffield