Multiple Regression Analysis
A Report Presentation in EDUC 303 Data Management and Statistical Analysis
Dr. Giena L. Odicta
Course Facilitator
Mary Vincentia P. Olilang-Beldia
Presenter
Introduction
The world as we have known and as we have experienced, is a complex place so when we are looking to
predict the value of a variable, often times, we can get better predictions if we use more than one other
variable to make that prediction. That leads us to multiple regression. With familiarity with linear regression,
this would be a further pursuit.
Please consider the following example.
John, a small business owner for ABC Delivery Service, Inc (ABC DS) which offers same-day delivery for
letters, packages, and other small cargo. John is able to use Google Maps to group individual deliveries into
one trip to reduce time and fuel costs. Therefore some trips will have more than one delivery.
John would like to be able to estimate how long a delivery will take based on two factors: (1) the total
distance of the trip in miles and (2) the number of deliveries that must be made during the trip.
To conduct analysis, take a random sample of 10 past trips and record three pieces of information for
each trip: (1) total miles traveled, (2) number of deliveries, (3) total travel time in hours.
milesTraveled numDeliveries travelTime(hrs.)
(x1) (x2) (y)
89 4 7
66 1 5.4
78 3 6.6
111 6 7.4
44 1 4.8
77 3 6.4
80 3 7
66 2 5.6
109 5 7.3
76 3 6.4
In this case, remember you would like to be able to predict the total travel time using both the miles traveled
and number of deliveries on each trip.
In what way does travel time DEPEND on the first two measures?
Travel time is the dependent variable and miles traveled and number of deliveries are independent variables.
A. Multiple Regression
• Multiple Regression is an extension of simple linear regression. It is used in predicting the value of a
variable based on the value of two or more other variables.
• This investigates the relationship between two or more independent variables and a single dependent
variable
Dependent Variable - variable to predict (outcome, target, or the criterion variable)
Independent variable - the predictor, exploratory or regressor variables
• This allows you to determine the overall fit of the model and the relarive contribution of each of the
predictors to the total variance explained.
B. Relationships between the Variables
Simple Linear Regression:
IV DV
V
C. Multiple Linear Regression:
IV or more …
IV DV
V
IV
D. Assumptions
1. Dependent variables should be measured on a continuous scale (interval or ratio variable)
2. Two or more independent variables on a continuous or categorical
3. Independence of observations, which can be easily checked using the Durbin-Watson statistic
4. There must be a linear relationship between (a) dependent variable and each of the independent
variables and (b) the dependent variable and the independent variables collectively.
5. Data shows homoscedasticity, which is where the variances along the line of best fit remain similar as
you move along the line.
6. There should be no significant outliers, high leverage points or highly influential points.
7. The residual (errors) are approximately normally distributed.
8. Adding more independent variables to a multiple regression procedure does not mean the regression
will be better or offer better predictions; in fact it can make things worse. This is called overfitting.
9. The addition of more independent variables creates more relationships among them. SO not only are
the independent variables potentially related to the dependent variable, they are also potentially related
to each other. When this happens, it is called multicollinearity.
10. The ideal is for all the independent variables to be correlated with the dependent variable but not with
each other.
Because of multicollinearity ad overfitting, there is a fair amount of prep-work to do before
conducting multiple regression analysis if one is to it properly.
o Correlation
o Scatter Plots
o Simple Regression
E. In the previously mentioned example about 123Delivery Service, Inc.,
milesTraveled
(x1)
Multiple Regression
travelTime many-to-one
(y2)
numDeliveries
(x2)
F. Multiple Regression Model
Multiple Regression Model y = β1x1 + β2 x2 + β3 x3 +…….. βp xp
Linear parameter
Multiple Regression Equation E (y) = β0 + β1x1 + β2x2 + …….. βpxp
where bo, b1, b2 ….. bp are the estimates of Bo, B1,. B2. …… Bp
Estimated Multiple Regression Equation ŷ = b0 + b1 x1 + b2 x2 ……. bp xp
ŷ = predicted value of the independent variable
G. Estimated Multiple Regression Equation
Example: ŷ = 6.211 + 0.014x1 + 0.383x2 - 0.607x3
Estimated Multiple Regression ŷ = b0 + b1 x1 + b2 x2 ……. b3 x3
Equation
where bo, b1, b2 ….. bp are the estimates of Bo, B1,. B2. …… Bp
ŷ = predicted value of the dependent variable
H. Example:
John, a small business owner for ABC Delivery Service, Inc which offers same-day delivery for letters,
packages, and other small cargo. He is able to use Google Maps to group individual deliveries into one trip
to reduce time and fuel costs Therefore some trips will have more than one delivery. John would like to be
able to estimate how long a delivery will take based on three factors: (1) the total distance of the trip in miles,
(2) the number of deliveries that must be made during the trip, and (3) the daily price of gas/petrol in US
dollars.
Research question:
Are the three factors: (1) the total distance of the trip in miles, (2) the number of deliveries that must be
made during the trip, and (3) the daily price of gas/petrol in US dollars predictive of how long a delivery
will take?
I. Steps to consider before running the regression:
1. Generate a list of potential variables; independent(s) and dependent.
2. Collect data on the variables.
3. Check the relationships between each independent variable and the dependent variable using scatter
plots and correlations
4. Check the relationship among the independent variable using scatter plots and correlations.
5. (Optional) Conduct simple linear regression for each IV/DV pair.
6. Use the non-redundant independent variables in the analysis to find the best fitting model.
7. Use the best fitting model to make predictions about the dependent variables.
To conduct analysis, take a random sample of 10 past trips and record four pieces of information for
each trip: (1) total miles traveled (2)number of deliveries (3) daily price of gas (4) total time traveled in
hours
milesTraveled(x1) numDeliveries (x2) gasPrice (x3) travelTime(hrs)(y)
89 4 3.84 7
66 1 3.19 5.4
78 3 3.78 6.6
111 6 3.89 7.4
44 1 3.57 4.8
77 3 3.57 6.4
80 3 3.03 7
66 2 3.51 5.6
109 5 3.54 7.3
76 3 3.25 6.4
J. Sketching out relationships:
milesTraveled
(x1) Multiple Regression
many-to-one
travelTime
gasPrice
(y)
(x3)
numDeliveries
(x2)
6 relationships to analyze
K. IV to DV Scatterplots for Relevancy Check
Scatterplot of travelTime(y) vs milesTraveled(x1)
8
7.5
travelTime(y)
7
6.5
6
5.5
5
40 60 80 100 120
milesTraveled(x1)
Scatterplot of travelTime(hrs)(y) vs numDeliveries(x2)
8
travelTome(hrs)(y)
7.5
7
6.5
6
5.5
5
1 3 5 7
numDeliveries (x2)
Scatterplot of travelTime(hrs)(y) vs gasPrice(x 3)
8
7.5
travelTime
7
6.5
6
5.5
5
3 3.2 3.4 3.6 3.8 4
gasPrice(x3)
L. Scatterplot Summary
Dependent variable vs. independent variables
• travelTime(y) appears highly correlated with milesTraveled(x1)
• travelTime(y) appears highly correlated with numDeliveries(x 2)
• travelTime(y) DOES NOT appear highly correlated with gasPrice (x 3)
Since gasPrice(x3) DOES NOT APPEAR CORRELATED with the dependent variable, we would NOT
use that variable in the multiple regression
M. IV to DV Scatterplots for Multicollinearity Check
Scatterplot of numDeliveries(x2) vs milesTraveled(x1) = strong correlation
6
NumDeliveries(x3)
5
4
3
2
1
40 60 80 100 120
milesTraveled(x1)
Scatterplot of gasprice (x3) vs milesTraveled(x1) = no correlation
4
3.8
gasPrice(y)
3.6
3.4
3.2
3
40 60 80 100 120
milesTraveled(x1)
Scatterplot of gasPrice(x3) vs numDeliveries(x2) = no correlation
3.8
gasPrice(x3)
3.6
3.4
3.2
3
1 2 3 4 5 6
NumDeliveries(x2)
N. IV Scatterplot Summary
Independent variable vs. independent variable
• numDeliveries(x2) APPEARS highly correlated with milesTraveled(x1): this is
multicollinearity
• milesTraveled (x1) does not appear highky correlated with gasprice(x 3)
• gasPrice(x3) does not appear correlated with numDeliveries(x 2)
Since numDeliveries(x2) is HIGHLY CORRELATED with milesTraveled, we would NOT use BOTH
in the multiple regression; they are redundant
O. Correlations
milesTraveled(x1) numDeliveries(x2) gasPrice(x3)
numDeliveries(x2) r= 0.956 strong
p value =0.000 correlation
gasPrice(x3) r = 0.356 r = 0.498 no
p value = 0.313 p value = 0.143 correlation
travelTime(y) r= 0.928 r = 0.916 r = 0.267 no
p value = 0.000 p value = 0.000 p value = 0.455 correlation
p value (p < 0.5 statistically sig)
P. Correlation Summary
Correlation analysis confirms the conclusions reached by visual examination of the scatterplots
Redundant multicollinear variables
milesTraveled and numDeleiveries are both highly correlated with each other and therefore are
redundant’ only one should be used in the multiple regression analysis
Non – contributing variables
gasPrice is NOT correlated with the dependent variable and should be excluded
In conclusion:
In multiple regression, a lot of preparation work must be done.
Techniques used: scatterplots, correlation analysis, individual/group regressions