University of Information Technology – Vietnam National University, Ho Chi Minh City Dr.
Tran Van Hai Trieu
Faculty of Information Systems
Chapter 3
Analysis of Regression
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Learning objectives
• Models of Regression.
• Simple Linear Regression.
• Multiple Linear Regression.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
1. Models of Regression
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Correlation Relationship
“The interconnected relationship among indicators
or criteria of a phenomenon, where the fluctuation
of one indicator (result indicator) is affected by
others (cause criteria) called correlation”.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Method of Correlation Analysis
• The correlation analysis process includes the
following specific tasks.
1. Qualitative analysis of the nature of the relationship.
2. Use the method of clustering or graphing to
determine the nature and trend of that relationship.
3. Specifically, express the correlation relationship
using linear or nonlinear regression equations and
compute the parameters of the equations.
4. Evaluate the tightness of the correlation relationship.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Linear Correlation Coefficient
• Correlation Coefficient (r) is a statistical quantity
used to measure the linear relationship between
two variables and has a value ranging from -1 to 1.
• Formula xy − x.y
r= ➢ When r is closer to 0,
x . y the relationship is
or weaker. Especially, if r
x = 0, the relationship
r = b. does not happen.
y
➢ Conversely, when r is
Where: closer to 1 or -1, the
relationship is closer (r
> 0 has a positive
relationship, and r < 0
has a negative
relationship).
University of Information Technology – Vietnam National University, Ho Chi Minh City
8
Faculty of Information Systems
Dr. Tran Van Hai Trieu
Linear Correlation Coefficient (Cont.)
• Example of computing the linear correlation coefficient
➢ Suppose that we have the below table 1 related to
workers with age of experience and labor productivity.
Age of Labor
Workers experience - x productivity - y xy x2 y2
(years) (millions - VNĐ)
A 1 3 3 1 9
B 3 12 36 9 144
C 4 9 36 16 81
D 5 16 84 49 144
E 7 12 84 49 144
F 8 21 168 64 441
G 9 21 189 81 441
H 10 24 240 100 576
I 11 19 209 121 361
K 12 27 324 144 729
Sum 70 164 1369 610 3182
Mean 7 16,4 136,9 - -
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Linear Correlation Coefficient (Cont.)
• Based on data from Table 1, we can compute the
correlation coefficient as follows:
610 70 ➢ From the result of the
2
x = − = 3,464 correlation coefficient,
10 10
we can conclude that
2
there is a positive
3182 164 relationship between
y = − = 7,017
10 10 age of experience and
labor productivity.
136,9 − (7 16,4)
r= = 0,909
3,464 7,017
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
What is Regression Model?
• A regression model determines a relationship
between an independent variable and a
dependent variable, by providing a function.
• Formulating a regression analysis helps you
predict the effects of the independent variable
on the dependent one.
• For example
➢ We can say that age and height can be
described using a linear regression model. Since
a person’s height increases as age increases,
they have a linear relationship.
Source: https://www.voxco.com
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Linear Models
• Regression analysis is a tool for building
statistical models that characterize relationships
among a dependent variable and one or more
independent variables, all of which are
numerical.
• Simple linear regression involves a single
independent variable.
Y = b 0 + b 1X
• Multiple linear regression involves two or more
independent variables.
Y = b0 + b1X1 + b2X2 +…..+ bkXk
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Nonlinear Models
• Nonlinear models can be transformed into linear
models as follows:
1. Logarithm – Logarithm model
❖ Consider the exponential regression model
Y = b0.(X)b1.eu
❖ Convert the above equation to linear model by using
Logarithm of both sides as follows:
Ln(Y) = Lnb0 + b1.Ln(X) + u; Set Lnb0 = α.
Ln(Y) = α + b1.Ln(X) + u
❖ This is a linear model according to parameters, such as
α and b1. It is linear according to Ln(X) and Ln(Y).
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Nonlinear Models (Cont.)
• Nonlinear models can be transformed into linear
models as follows:
1. Logarithm – Logarithm model
❖ Consider marginal effects
❖ Meaning: (% change of Y) = b1*(% change of X)
➔When X changes 1%, then Y changes b1%.
❖ Generalized logarithm - logarithm model
Ln(Yi)= Lnb0 + b1.Ln(X1i) + b2i.Ln(X2i) +…+ bni.Ln(Xni) + Ui
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Nonlinear Models (Cont.)
• Nonlinear models can be transformed into linear
models as follows:
1. Logarithm – Logarithm model
❖ Application of the Cobb-Douglas production function
Y = b0 (X1)b1(X2)b2eU (1)
Where: Y: Output; X1 : Labor; X2 : Capital.
❖From the formula (1), we have the below formula (2):
Ln(Y) = Lnb0 + b1.Ln(X1) + b2.Ln(X2) + U (2)
❖ Meaning b1 and b2:
✓ When X1 increases or decreases 1% and X2 do not change, then
Y changes b1%.
✓ When X2 increases or decreases 1% and X1 do not change, then
Y changes b2%.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Nonlinear Models (Cont.)
• Nonlinear models can be transformed into linear
models as follows:
1. Logarithm – Logarithm model
❖ An example of the linear equation is as follows:
LnY = -3,3386 + 1,4988 LnX1 + 0,4899 LnX2
Where: Y: Output of Agriculture (millions $); X1: Days of
Labor (millions day); X2: Total of Capital (millions $).
❖ The meaning of regression coefficient:
✓ If total of capital is kept constant when days of labor increase
by 1%, the average output increases by 1.5%.
✓ If days of labor are kept constant when total of capital
increases by 1%, the average output increases by 0.5%.
✓ b1 + b2 > 0: Increase scale effectively.
✓ b1 + b2 ≤ 0: Y does not increase or decrease. The increase in
scale is ineffective.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Nonlinear Models (Cont.)
• Nonlinear models can be transformed into linear
models as follows:
2. Semi-logarithm model
2.1. Logarithm – Linear model
✓ The formula of Gross Profit: Yt = Y0 (1+r)t
✓ Convert the above equation to linear model by using
Logarithm of both sides as follows:
❖ Ln(Yt) = Ln(Y0) + t*Ln(1+r) (1)
✓ Set b0 = Ln(Y0), b1 = Ln(1+r)
❖ Ln(Yt) = b0 + b1t (2)
✓ Add random errors (ui) in the formula (2), and we
have the formula (3) as follows:
❖ Ln(Yt) = b0 + b1t + ui (3)
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Nonlinear Models (Cont.)
• Nonlinear models can be transformed into linear
models as follows:
2. Semi-logarithm model
2.1. Logarithm – Linear model
✓ Consider marginal effects
b0
✓ Meaning: Change of Y = b1*(change of t)
➔ When t changes one year, Y changes b1*100%.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Nonlinear Models (Cont.)
• Nonlinear models can be transformed into linear
models as follows:
2. Semi-logarithm model
2.1. Logarithm – Linear model
✓ Example of wage regression and years of
education
➔ Please, consider the meaning of the regression
coefficient.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Nonlinear Models (Cont.)
• Nonlinear models can be transformed into linear
models as follows:
2. Semi-logarithm model
2.2. Linear - Logarithm model
Y = b0 + b1.Ln(X) + U (1)
✓ Consider marginal effects
✓ Meaning: Change of Y = b1*(% change of X)
➔ When X changes 1%, Y changes b1 / 100 units.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Nonlinear Models (Cont.)
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Nonlinear Models (Cont.)
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Nonlinear Models (Cont.)
• Table of summarized models
Models Dependent Independent dY / dX Explanation of
Variable Variable meaning
Normal Linear Y X dY / dX X changes 1 unit,
then Y changes
b1 units.
Linear - Y Ln(X) dY / X changes 1%,
Logarithm d(LnX) then Y changes
(b1 / 100) units.
Logarithm - Ln(Y) X d(LnY) / X changes 1 unit,
Linear dX then Y changes
(b1 * 100)%.
Logarithm – Ln(Y) Ln(X) d(LnY) / X changes 1%,
Logarithm d(LnX) then Y changes
b1 %.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Regression Equation Testing for
Simple Linear Regression
• R-Square Coefficient (R2) measures linear model fit.
• Adjusted (R2) reflects the fit level of the overall model.
• With points connecting the experimental regression line:
Ai(xi, yi), i=1,…,n
• Suppose that we find the regression equation as follows:
~
y = a + bx
• Set
yi = axi+b+ei
• ei: represents the portion of variation in Y that cannot be
explained by a linear relationship between X and
yi = ~
y + ei
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Regression Equation Testing for
Simple Linear Regression (Cont.)
• SSR: Sum of Square for Regression
• SSE: Sum of Square for Error (Residual)
• SST: Sum of Square for Total
n
SSR SSE
SSR = (~yi − y ) 2
i =1
R = 2
SST
= 1−
SST
n n
SSE = (~
yi − yi ) 2 SST = (y i − y)2
i =1 i =1
SSE /(n − ( k + 1))
Adjusted R 2 = 1 −
SST /(n − 1)
• SST = SSR + SSE
• Meaning: Quantity representing the total variation of Y = the
variation of Y explained by the Xi and the part of the variation
24 of Y due to other factors.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Regression Equation Testing for
Simple Linear Regression (Cont.)
• F-Test
• We state hypotheses
➢ H0 : Regression equation is not appropriate.
Ha : Regression equation is appropriate.
• With the number of independent variable k = 1.
SSR SSR
MSR = =
k 1
SSE SSE
MSE = =
n − ( k + 1) n−2
MSR
F = ~ Fisher(1, n − 2)
MSE
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Regression Equation Testing for
Multiple Linear Regression
• Suppose that we have a multiple linear regression as
follows:
Y = b0 + b1X1 + b2X2 +…..+ bkXk
• We continue to apply the formulas (SSR, SSE, SST)
with meaning like in the case of single linear regression.
• We state hypotheses
➢ H0 : Regression equation is not appropriate.
Ha : Regression equation is appropriate.
SSR
MSR =
k
SSE
MSE =
n − ( k + 1)
MSR
F = ~ Fisher( k , n − ( k + 1))
MSE
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Regression Coefficient Testing
for Simple Linear Regression
• State hypotheses:
H0: b = 0
Ha: b 0
S e2 MSE
Sb = 2
= n
(x
n
xi2 − n x
i =1 i =1
i − x) 2
i
e 2
SSE
S e2 = i =1
= = MSE
n−2 n−2
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Regression Coefficient Testing
for Multiple Linear Regression
• State hypotheses:
H0: bj = 0
Ha: bj 0
Gj: Set of variables except for Xj.
Se2
Sb j =
(1 − RX jG j ) * S Xj * (n − 1)
2 2
.
R2XjGj= Correlation between
Where:
Se2: MSE XY − X Y
S2Xj : Sample variance of variable Xj. R 2
XY =
R2XjGj: Correlation between S x SY
Confidence interval for bj: bj t/2*Sbj
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
In conclusion
• Correlation Relationship
• Method of Correlation Analysis
• Linear Correlation Coefficient
• Linear Models
• Nonlinear Models
• Regression Equation Testing for
Simple and Multiple Linear Regression
• Regression Coefficient Testing
for Simple and Multiple Linear Regression
Understand
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
2. Simple Linear Regression
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Simple Linear Regression
• Finds a linear relationship between.
➢ one independent variable X and
➢ one dependent variable Y
• First, prepare a scatter plot to verify the data has
a linear trend.
• Use alternative approaches if the data is not
linear.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Simple Linear Regression (Cont.)
• Example of Home Market Value Data
➢ Size of a house is
typically related to its
market value.
✓ X = square footage
✓ Y = market value ($)
➢ The scatter plot of the
full data set (42 homes)
indicates a linear trend.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Simple Linear Regression (Cont.)
• Finding the Best-Fitting Regression Line
➢ Two possible lines are shown below.
➢ Line A is clearly a better fit to the data.
• We want to determine the best regression line.
^
Y = b 0 + b 1X
where:
b0 is the intercept
b1 is the slope
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Simple Linear Regression (Cont.)
• Using Excel to Find the Best Regression Line
• Market value = 32673 + 35.036(square feet)
➢ The regression
model explains
variation in market
value due to size
of the home.
➢ It provides better
estimates of
market value than
simply using the
average.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Simple Linear Regression (Cont.)
• Least-Squares Regression
• Regression analysis finds
➢ the equation of the best-
fitting line that minimizes
➢ the sum of the squares of
the observed errors
(residuals).
• Using calculus we can solve for the slope and
intercept of the least-squares regression line.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Simple Linear Regression (Cont.)
• Least-Squares Regression Equations
• Slope
➢ b1 = SLOPE(known y’s, known x’s)
• Intercept
➢ b0 = INTERCEPT(known y’s,^ known x’s)
• Predict Y for specified X values: Y = b0 + b1X
^
Y = TREND(known y’s, known x’s, new x’s)
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Simple Linear Regression (Cont.)
• Using Excel Functions to Find Least-Squares
Coefficients
➢ Slope = b1 = 35.036
= SLOPE(C4:C45, B4:B45)
➢ Intercept = b0 = 32,673
= INTERCEPT(C4:C45, B4:B45)
➢
^ Estimate Y when X = 1800 square feet
Y = 32,673 + 35.036(1800) = $95,737.80
=TREND(C4:C45, B4:B45, 1800)
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Simple Linear Regression (Cont.)
• Excel Regression tool
➢ Data
➢ Data Analysis
➢ Regression
❖ Input Y Range
❖ Input X Range
❖ Labels
• Excel outputs a table
with many useful
regression statistics.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Simple Linear Regression (Cont.)
• Regression Statistics in Excel’s Output
➢ Multiple R
❖ | r | where r is the sample correlation
coefficient.
❖ r varies from -1 to +1 (r is negative if slope
is negative).
➢ R Square
❖ Coefficient of determination, R2 varies from
0 (no fit) to 1 (perfect fit).
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Simple Linear Regression (Cont.)
• Regression Statistics in Excel’s Output
➢ Adjusted R Square
❖ Adjusts R2 for sample size and number of X
variables.
➢ Standard Error
❖ Variability between observed & predicted Y
variables.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Simple Linear Regression (Cont.)
• Example of Interpreting Regression Statistics for
Simple Linear Regression (Home Market Value)
53% of the variation in home market values
can be explained by home size.
The standard error of $7287 is less than
standard deviation (not shown) of $10,553.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Simple Linear Regression (Cont.)
• Regression Analysis of Variance
➢ ANOVA conducts an F-test to determine whether
variation in Y is due to varying levels of X.
➢ ANOVA is used to test for significance of regression:
❖ H0: population slope coefficient = 0
❖ H1: population slope coefficient ≠ 0
➢ Excel reports the p-value (Significance F).
➢ Rejecting H0 indicates that X explains variation in Y.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Simple Linear Regression (Cont.)
• Example of Interpreting Significance of Regression
Home size is not a significant variable
Home size is a significant variable
➢ p-value = 3.798 x 10-8
❖ Reject H0.
❖ The slope is not equal to zero.
• Using a linear relationship, home size is a significant
variable in explaining variation in market value.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Simple Linear Regression (Cont.)
• Testing Hypotheses for Regression Coefficients
➢ An alternate method for testing
is to use a t-test:
➢ Excel provides the p-values for tests on the slope
and intercept.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Simple Linear Regression (Cont.)
• Example of Interpreting Hypothesis Tests for
Regression Coefficients (Home Market Value)
➢ p-value for test on the intercept = 0.000649
➢ p-value for test on the slope = 3.798 x 10-8
➢ Both tests reject their null hypotheses.
➢ Both the intercept and slope coefficients are
significantly different from zero.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Simple Linear Regression (Cont.)
• Example of Interpreting Hypothesis Tests for
Regression Coefficients (Home Market Value)
➢ 95% confidence interval estimates
➢ Intercept is between $14,823 and $50,523
➢ Slope is between 24.59 and 45.48$/sq.ft.
^
➢ Lower extreme: Y = 14,823 + 24.59X
^
➢ Upper extreme: Y = 50,523 + 45.48X
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
In conclusion
• Simple Linear Regression.
Understand
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
3. Multiple Linear Regression
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Multiple Linear Regression
• Multiple Regression has more than one
independent variable.
• The multiple linear regression equation is:
• The ANOVA test for significance of the entire
model is:
• One can also test for significance of individual
regression coefficients.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Multiple Linear Regression (Cont.)
• Example of Interpreting Regression Results for
the Colleges and Universities Data
➢ Colleges try to predict student graduation rates
using a variety of characteristics, such as:
1. Median SAT 3. Acceptance rate
2. Expenditures/student 4. Top 10% of HS class
Y
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Multiple Linear Regression (Cont.)
• Example of Interpreting Regression Results
for the Colleges and Universities Data
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Multiple Linear Regression (Cont.)
• Example of Interpreting Regression Results
for the Colleges and Universities Data
All of the slope
coefficient p-values
are < 0.05.
The residual plots (only one shown
here) show random patterns about 0.
Normal probability plots (not shown)
also validate assumptions.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Building Good Regression Models
• All of the independent variables in a linear
regression model are not always significant.
• We will learn how to build good regression
models that include the “best” set of variables.
• Banking Data includes demographic information
on customers in the bank’s current market.
Y
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Building Good Regression Models (Cont.)
• Predicting Average Bank Balance using Regression
Home Value and Education
are not significant.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Building Good Regression Models (Cont.)
• Systematic Approach to Building Good Multiple
Regression Models
1. Construct a model with all available independent
variables and check for significance of each.
2. Identify the largest p-value that is greater than α.
3. Remove that variable and evaluate adjusted R2.
4. Continue until all variables are significant.
➔ Find the model with the highest adjusted R2.
(Do not use unadjusted R2 since it always increases
when variables are added).
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Building Good Regression Models (Cont.)
• Example of Identifying the Best Regression Model
➢ Bank regression after removing Home Value
Adjusted R2 improves slightly.
All X variables are significant.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Building Good Regression Models (Cont.)
• Multicollinearity
➢ It occurs when there are strong correlations among
the independent variables.
➢ Makes it difficult to isolate the effects of independent
variables.
➢ Signs of slope coefficients may be opposite of the
actual value and p-values can be inflated.
• Correlations exceeding ±0.7 are an indication that
multicollinearity might exist.
• Variance Inflation Factors are a better indicator.
• Parsimony is an age-old principle that applies here.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Building Good Regression Models (Cont.)
• Example of Identifying Potential Multicollinearity
➢ Colleges and Universities (full model)
Full model
Adjusted R2 = 0.4921
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Building Good Regression Models (Cont.)
• Example of Identifying Potential Multicollinearity
➢ Correlation Matrix (Colleges and Universities data)
➢ All of the correlations are within ±0.7
➢ Signs of the coefficients are questionable for
Expenditures and Top 10%.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Building Good Regression Models (Cont.)
• Example of Identifying Potential Multicollinearity
➢ Colleges and Universities (reduced model)
Dropping Top 10%
Adjusted R2 drops to 0.4559
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Building Good Regression Models (Cont.)
• Example of Identifying Potential Multicollinearity
➢ Colleges and Universities (reduced model)
Dropping Expenditures
Adjusted R2 drops to 0.4556
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Building Good Regression Models (Cont.)
• Example of Identifying Potential Multicollinearity
➢ Colleges and Universities (reduced model)
Dropping Expenditures and Top 10%
Adjusted R2 drops to 0.3613
Which of the 4 models would you choose?
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Building Good Regression Models (Cont.)
• Example of Identifying the Best Regression Model
➢ Banking Data (full model)
Full Model
Adjusted R2 = 0.9441
Education and Home Value
are not significant.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Building Good Regression Models (Cont.)
• Example of Identifying Potential Multicollinearity
➢ Correlation matrix for the Banking data
➢ Some of the correlations exceed 0.7 for Home
Value and Wealth.
➢ Signs of the coefficients for predicting bank
balance are as expected (positive).
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Building Good Regression Models (Cont.)
• Example of Identifying the Best Regression Model
➢ Banking Data (reduced model)
Dropping Wealth and Home Value
Adjusted R2 drops to 0.9201
Education is not significant.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Building Good Regression Models (Cont.)
• Example of Identifying the Best Regression Model
➢ Re-ordered Correlation matrix for Banking data
➢ By re-ordering the variables, we can see the
correlations for Age, Education, and Wealth are all
within ± 0.7.
➢ Let’s try a reduced model with the Age, Education,
and Wealth variables.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Building Good Regression Models (Cont.)
• Example of Identifying the Best Regression Model
➢ Banking Data (reduced model) ** best model
Dropping Income and Home Value.
Adjusted R2 = 0.9345.
All variables are significant.
Multicollinearity is not a problem.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Regression with Categorical Variables
• Dealing with Categorical Variables
➢ Must be coded numeric using dummy variables.
➢ For variables with 2 categories, code as 0 and 1.
➢ For variables with k ≥ 3 categories, create k−1
binary (0,1) variables.
• Interaction Terms
➢ A dependence between two variables is called
interaction.
➢ Test for interaction by adding a new term to the
model, such as X3 = X1X2.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Regression with Categorical Variables (Cont.)
• Example of A Model with Categorical Variables
➢ Employee Salaries provides data for 35
employees.
➢ Predict Salary using Age and MBA (yes=1, no=0).
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Regression with Categorical Variables (Cont.)
• Example of A Model with Categorical Variables
➢ Salary = 893.59 + 1044(Age) for those without MBA
➢ Salary =15,660.82 + 1044(Age) for those with MBA
Adjusted R2 = 0.949858
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
• Example of Incorporating Interaction Terms in a
Regression Model
➢ Define an interaction between Age and MBA and
include in the regression model.
➢ Interaction = (Age)(MBA)
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
• Example of Incorporating Interaction Terms in a
Regression Model
MBA is now insignificant so we
will drop it from the model.
Adjusted R2 = 0.976701
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
• Example of Incorporating Interaction Terms in a
Regression Model
➢ Salary = 3,323 + 984(Age) for those without MBA
➢ Salary = 3,323 + 1410(Age) for those with MBA
Adjusted R2 = 0.976727
(a slight improvement)
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
• Example of A Regression Model with Multiple
Levels of Categorical Variables
➢ Surface Finish data provides measurements for 35
parts produced on a lathe.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
• Example of A Regression Model with Multiple
Levels of Categorical Variables
➢ Tool Type (A,B,C,D) is now
coded as 3 dummy variables.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
• Example of A Regression Model with Multiple
Levels of Categorical Variables
Tool A: Surf. Finish = 24.5 + 0.098 RPM
Tool B: Surf. Finish = 11.2 + 0.098 RPM
Tool C: Surf. Finish = 4.0 + 0.098 RPM
Tool D: Surf. Finish = -1.6 + 0.098 RPM
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Regression Models
with Nonlinear Terms
• Curvilinear Regression
➢ Curvilinear models may be appropriate when
scatter charts or residual plots show nonlinear
relationships.
➢ A second order polynomial might be used
➢ Here β1 represents the linear effect of X on Y
and β2 represents the curvilinear effect.
➢ This model is linear in the β parameters so we
can use linear regression methods.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Regression Models
with Nonlinear Terms (Cont.)
• Example of Modeling Beverage Sales Using
Curvilinear Regression
➢ Sales of cold beverages increase when it is
hotter outside.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Regression Models
with Nonlinear Terms (Cont.)
• Example of Modeling Beverage Sales Using
Curvilinear Regression
U-shape residual plot
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Regression Models
with Nonlinear Terms (Cont.)
• Example of Modeling Beverage Sales Using
Curvilinear Regression
Residual
pattern is
more random
Sales = 142,850
−3643(temperature)
+ 23.3(temperature)2
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
Regression Models
with Nonlinear Terms (Cont.)
• Example of Modeling Beverage Sales Using
Curvilinear Regression
Second Order Polynomial Trendline
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
In conclusion
• Multiple Linear Regression.
• Building Good Regression Models.
• Regression with Categorical
Variables.
• Regression Models Understand
with Nonlinear Terms.
University of Information Technology – Vietnam National University, Ho Chi Minh City Dr. Tran Van Hai Trieu
Faculty of Information Systems
THANK YOU
FOR YOUR ATTENTION
Q&A