Department of Finance & Banking, University of Malaya
Multiple Regression
Analysis: Further Issues
Dr. Aidil Rizal Shahrin
[email protected]October 10, 2020
Contents
1 Effects of Data Scaling on OLS Statistics
1.1 Standardized Coefficients
2 More on Functional Form
2.1 Logarithmic Functional Forms
2.1.1 Log-Linear Model
2.2 Model with Quadratics
2.3 Model with Interaction Terms
2.4 Computing Average Partial Effects
3 Goodness-of-Fit and Selection of Regressors
3.1 Adjusted R2
3.2 Choosing Nonnested Models
2/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Effects of Data Scaling on OLS Statistics
i. Below is an equation relating infant birth weight in ounces
(bwght), on number of cigarettes smoked per day (cigs) and
annual family income in thousands of dollars (faminc).
\ = β̂0 + β̂1 cigs + β̂2 faminc
bwght (1)
3/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Effects of Data Scaling on OLS Statistics
Figure 1: Effects of Data Scaling
ii. In the first column of Fig.1, we have the result of Eq.1.
Remember bwght unit is ounces, cigs is number of
cigarettes per day and faminc in thousands of dollar.
4/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Effects of Data Scaling on OLS Statistics
iii. Now, change unit of measurement of birth weight in
pounds (1 lbs = (1 oz)/16). Thus, Eq.1 when bwght are
converted to pound is:
\
bwght/16 = β̂0 /16 + (β̂1 /16) cigs + (β̂2 /16) faminc (2)
as reported in column 2 in Fig.1.
5/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Effects of Data Scaling on OLS Statistics
iv. The different between Eq.1 and Eq.2 in term of
interpretation (i use one example only):
V
\ = −.46434∆cigs vs. ∆bwghtlbs = −.0289∆cigs
∆bwght
Increase in 1 cigs, reduce bwhght by 0.464 ounce for first,
while second reduce bwghtlbs by 0.0289 pounds.
(0.0289 lbs × 16 = 0.464 oz, we got the same answer).
v. How about statistical significance? Not affected. The
standard error for β̂1 in Eq.2 are divided by 16. So the t
statistics for cigs in both Eq.1 and 2 are the same which is
t = −5.058.
vi. Same goes to CI in Eq.2, for cigs, the lower bound and
upper bound is 16 times smaller than CI in Eq.1 of cigs (just
divided by 16).
vii. R2 for Eq.1 and 2 are the same, do you noticed that?
6/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Effects of Data Scaling on OLS Statistics
viii. Why SSR and SER differ? First focus on SSR. Let ûi the i
obs. of Eq.1. Then, residuals of same obs. Eq.2 is ûi /16.
Thus, the squared residual in Eq.2 is (ûi /16)2 = û2i /256. That
is why SSREq.2 = SSREq.1 /256.
p p
ix. Since SER = σ̂ = SSR/(n − k − 1) = SSR/1, 385. Thus,
SEREq.2 = SEREq.1 /16.
x. Now, we change the unit of independent variable cigs, to
packs, where 20 × cigs = 1 pack. Thus, Eq.1 with this
transformation is:
\ = β̂0 + (20β̂1 ) (cigs/20) + β̂2 faminc
bwght
(3)
= β̂0 + (20β̂1 ) packs + β̂2 faminc
xi. The intercept and slope coefficient of faminc are
unchanged. The result is in column 3 Fig.1.
7/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Effects of Data Scaling on OLS Statistics
xii. Why we drop cigs in column 3? To avoid perfect
multicollinearity.
xiii. The se(20β̂1 ) is 20 × se(β̂1 ) in Eq.1. Thus,
tcigs = tpacks = −5.059, no effect on statistical significance.
xiv. Do yo notice the SSR and SER for Eq.1 and 3 are the same?
Why? Remember ûi = yi − ŷi ? So it involves only y.
8/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Standardized Coefficients
i. If the variables involved measures in scale, the best way to
interpret is using standardized coefficient.
ii. What it means is that all variables have been standardized in
the sample by subtracting off its mean and diving by its
standard deviation.
iii. Let k = 3, then we have
yi = β̂0 + β̂1 xi1 + β̂2 xi2 + β̂3 xi3 (4)
Taking the average of Eq.4 over the sample, we have
ȳ = β̂0 + β̂1 x̄1 + β̂2 x̄2 + β̂3 x̄3 (5)
9/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Standardized Coefficients
Pbecause the difference between ȳ and all term on
iv. Eq.5 hold
RHS is ( ni=1 ûi )/n = 0. Now subtract Eq.5 from Eq.4, we
have:
yi − ȳ = β̂1 (xi1 − x̄1 ) + β̂2 (xi2 − x̄2 ) + β̂3 (xi3 − x̄3 ) (6)
v. Let σ̂y is sample s.d. for y, and σ̂1 for sample s.d. for x1 and
so on. Then with little algebra, we have
yi − ȳ σ̂1 xi1 − x̄1 σ̂2 xi2 − x̄2
= β̂1 + β̂2
σ̂y σ̂y σ̂1 σ̂y σ̂2
(7)
σ̂3 xi3 − x̄3 ûi
+ β̂3 +
σ̂y σ̂3 σ̂y
Each variable in Eq.7 has been standardized by replacing it
with its z-score.
10/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Standardized Coefficients
vi. Rewriting Eq.7 and dropping the subscript, we have:
zy = b̂1 z1 + b̂2 z2 + b̂3 z3 + error (8)
where zy denotes the z-score of y, z1 denotes the z-score of
x1 and so on.
vii. The new coefficient in Eq.8 is:
σ̂j
b̂j = β̂j for j = 1, 2, 3 (9)
σ̂y
is called standardized coefficients or beta coefficients.
viii. Interpretation of Eq.8, if x1 increase by one standard
deviation, then ŷ changes by b̂1 standard deviations.
11/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Log-Linear Model
i. The log linear equation is (all log is natural log):
log(y) = β0 + β1 x (10)
ii. Both its slope and elasticity change at each point and are
the same as sign as β1 .
iii. Taking antilogarithm on Eq.10, we have
exp [log(y)] = y = exp(β0 + β1 x) (11)
which is an exponential function. The function require
y > 0. Fig.2 plot the Eq.11.
12/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Log-Linear Model
Figure 2: A log linear function: y = exp(β0 + β1 x).
13/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Log-Linear Model
iv. The slope at any point is (taking derivative og Eq.11):
∆y
= exp(β0 + β1 x) × β1
∆x (12)
= yβ1
For β1 > 0, the marginal effect increase for larger values of
y (this function increase at increasing rate).
v. Remember elasticity formula (elasticity measure
percentage in y given a 1% increase in x)?
(∆y/y) × 100
ε=
(∆x/x) × 100
(13)
∆y x
= ×
∆x y
14/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Log-Linear Model
Thus for Eq.11, the elasticity is:
x
ε = yβ1 ×
y (14)
= β1 x
vi. Beside elasticity, we can also determine the semi-elasticity
(it measure the percentage change in y given a 1-unit
increase in x). The formula is
(∆y/y) × 100
εsemi =
∆x
(15)
∆y 1
= × × 100
∆x y
15/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Log-Linear Model
From Eq.12, the semi-elasticity of Eq.11 is:
1
εsemi = yβ1 × × 100
y (16)
= 100β1
Thus:
%∆y = (100β1 )∆x (17)
vii. Noticed that for all calculation above, we are referring to
Eq.11, where y is the dependent variable. But our original
model is Eq.10, where the dependent variable is log(y).
16/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Log-Linear Model
viii. However, we can rely on log approximation of (then
multiple with 100 to make as percentage on both side):
%∆y ≈ 100 × ∆ log(y) (18)
However, this approximation only work if ∆y is small. If
you refer to Eq.17, this depends on ∆x and β1 . So, if this
the case, to compute the exact percentage change, use this
formula:
%∆y = 100 × [exp(β1 ∆x) − 1] (19)
So when x change by 1, we have
%∆y = 100 × [exp(β1 ) − 1] (20)
17/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Log-Linear Model
ix. About logarithm, when to use and many more:
a. When y > 0, model using log(y) as the dependent variable
often satisfy the CLM assumptions more closely than model
using the level y. Strictly positive variables often have
conditional distribution that are heteroskedastic or skewed;
taking the log mitigate, if not eliminate, both problems.
b. Taking log of a variable often narrows its range. Particularly
true variables that can be large monetary value, such as
firm’s annual sales or baseball players’ salary. Population
variable also tend to vary widely. Narrowing the range of
the dependent and independent variable can make OLS less
sensitive to outlier.
c. But, logarithmic transformation may create extreme value
when a variable y is between zero and one (such as a
proportion) and takes on values close to zero. In this case,
log(y) (which is necessarily negative) can be very large in
magnitude whereas the original variable y is bounded
between zero and one.
18/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Log-Linear Model
d. Some rule of thumb, when a variable is a positive dollar
amount, the log is often taken. Variables such as wages,
salaries, firm sales, and firm market value. If they are being
large integer value, also apply logarithm such as population,
total number of employees, school enrollment.
e. Another rule of thumb, variables measured in years-such as
education, experience, tenure, age- usually appear in their
original form.
f. Yet another, variable in proportion or percent-such as the
unemployment rate, the participation rate in a pension plan,
percentage of students passing-can appear in either original
or logarithm form, but the tendency is more toward level
because it has percentage point change interpretation.
19/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Log-Linear Model
g. Limitation, log cannot be used when the variable takes zero
or negative value. If y is nonnegative, but can take value of
zero, log(1 + y) sometimes is used. Percentage change
interpretations are still observed except for changes
beginning at y = 0. However, log(1 + y) cannot be normally
distributed.
h. Never compare R2 where the model one is using y the other
log(y) as dependent variable (all the same except y is
transform).
20/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Model with Quadratics
i. It is use in economic to capture decreasing or increasing
marginal effects. In the simple case, y depends on a single
observed factor x (noticed that single factor?)
y = β0 + β1 x + β2 x2 + u (21)
ii. β1 does not measure partial effect of x to y since we cannot
hold x2 fix if we change x, they are the same.
21/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Model with Quadratics
iii. If we estimate Eq.21, we have:
ŷ = β̂0 + β̂1 x + β̂2 x2 (22)
with the approximation (taking derivative of Eq.22 with
respect to x)
∆ŷ
≈ β̂1 + 2β̂2 x (23)
∆x̂
In many application β̂1 is positive and β̂2 is negative.
iv. Using Eq.23 needs some explanation:
a. If x = 0, β̂1 approximate the slope going from x = 0 to x = 1.
b. If x = 1, β̂1 + 2β̂2 approximate the slope going from x = 1 to
x = 2.
c. If x = 10, β̂1 + 2β̂2 approximate the slope going from x = 10 to
x = 11.
22/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Model with Quadratics
Example....
Using the wage data in WAGE1, we obtain:
w
[ age = 3.73 + .298 exper − .0061 exper2
(.35) (.041) (.0009)
2
n = 526, R = .093
Which implied that exper has a diminishing effect on wage.
The first experience worth roughly 30 cents per hour. The
second year experience worth 28.6 cents. From 10 to 11
years, worth 17.6 cents per hour. Notice it is diminishing?
23/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Model with Quadratics
v. When β̂1 > 0 and β̂2 < 0, the quadratic has parabolic
shape, as what our example have.
vi. The turning point is:
−β̂1
x∗ = (24)
2β̂2
vii. In our example, the turning point x∗ = 24.4 years. It tells us
that return to experience become zero at about 24.4 years.
Refer to Fig.3.
24/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Model with Quadratics
Figure 3: Quadratic relationship between w
[ age and exper.
25/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Model with Quadratics
viii. Summary of quadratic model:
a. β̂1 > 0 and β̂2 < 0, inverted U-shape with maximum point.
b. β̂1 < 0 and β̂2 > 0, U-shape, with minimum point.
c. β̂1 > 0 and β̂2 > 0, no turning point for values x > 0. Smallest
expected value of y is at x = 0 (x is nonnegative).
d. β̂1 < 0 and β̂2 < 0, no turning point for values x > 0. Largest
expected value of y is at x = 0 (x is nonnegative).
26/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Model with Interaction Terms
i. The purpose of this model is that the effect on dependent
variable with respect to an explanatory variable to depend
on magnitude of yet another explanatory variable.
ii. For example:
price = β0 + β1 sqrft + β2 bdrms + β3 sqrft · bdrms + u (25)
the partial effect of bdrms on price (holding all other
variables fixed) is
∆price
= β2 + β3 sqrft (26)
∆bdrms
iii. If β3 > 0 in Eq.26, an additional bedroom (∆ bdrms = 1),
yields a higher increase in housing price (∆ price), for
larger houses (sqrft). There is interaction effect between
square footage (sqrft) and number of bedrooms (bdrms).
27/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Model with Interaction Terms
iv. What will be the value of sqrft in Eq.26? Might be its mean
value, or lower and upper quartiles in the sample (any
interesting value of sqrft). Of course whether β3 is
significant can be easily test using t test on Eq.25. If
significant, we have an interaction effect.
v. Most of the time it is better to re-parameterize the model in
Eq.25. Why? Remember, β2 is the of bdrms on price for a
home with zero square feet! (house is non existence but
you have bedroom?)
28/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Model with Interaction Terms
vi. We can re-parameterize Eq.25 to be:
price = α0 + δ1 sqrft + δ2 bdrms
(27)
+ β3 (sqrft − µsqrft )(bdrms − µbdrms ) + u
where now, δ2 is the partial effect of bdrms on price at the
mean of sqrft, or δ2 = β2 + β3 µsqrft . (see Interaction Term
Proof for details discussion).
vii. Now the coefficients has useful interpretation. In fact you
can use other than the mean of variables as shown above.
29/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Computing Average Partial Effects
i. Often, we wants a single value to describe the relationship
between the dependent variable y and each explanatory
variable.
ii. One popular measure is called average partial effect
(APE), also called the average marginal effect.
30/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Computing Average Partial Effects
iii. Let say, we have a model to explained standardized
outcome on final exam (stndfnl) in terms of percentage of
class attended, prior college grade point average (priGPA)
and ACT score is
stndfl = β0 + β1 atndrte + β2 priGPA + β3 ACT
+ β4 priGPA2 + β5 ACT2 + β6 priGPA · atndrte + u
the partial effect of priGPA is:
∆stndfnl
= β2 + 2β4 priGPA + β6 atndrte (28)
∆priGPA
and the partial effect of atndrte is:
∆stndfnl
= β1 + β6 priGPA (29)
∆atndrte
31/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Computing Average Partial Effects
And the APE for Eq.28 is:
APEpriGPA = β̂2 + 2β̂4 priGPA + β6 atndrte (30)
For Eq.29 is
APEstndfnl = β̂1 + β̂6 priGPA (31)
Where priGPA and atndrte is the sample average of priGPA
and atndrte respectively.
iv. Why applying this? For example, in Eq.31, we don’t need
to report this partial effect of each students in the sample,
instead we have average these partial effects.
32/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Goodness-of-Fit and Selection of Regressors
i. Beginning students of econometrics tend to put too much
weight on R2 .
ii. Choosing a set of explanatory variables based on the size
of R2 can lead to nonsensical models.
iii. Furthermore, in time series we might obtained artificially
high R2 , but the result is misleading.
iv. In CLRM assumption, there is no minimum value of R2
required.
v. One thing that we can say about lower R2 is that the error
variance is large relative to the variance of y, which mean
we may have hard time precisely estimating the βj . But this
can be deal with larger sample size.
33/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Adjusted R2
i. We rewrite the R2 as:
SSR/n
R2 = 1 − (32)
SST/n
the difference now we introduce n (its cancel each other,
then we have the standard R2 ).
ii. The population R2 is defined as (remember relationship
between R2 and correlation?)
σu2
ρ2 = 1 − (33)
σy2
34/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Adjusted R2
iii. The estimate of σu2 in Eq.33 is SSR/n which is biased.
Instead, SSR/(n − k − 1) is the unbiased estimator for it.
Same goes to SST/n is biased. The unbiased estimator for
σy2 is SST/(n − 1). Using the unbiased estimator, now we
have the adjusted R-squared (sometimes call corrected
R-squared) of:
SSR/(n − k − 1)
R̄2 = 1 −
SST/(n − 1)
(34)
n − 1 SSR
=1−
n − k − 1 SST
iv. It is tempting to say that R̄2 correct the bias in R2 for
estimating the population R-squared, ρ2 , but it does not.
The ratio of two unbiased estimator is not an unbiased
estimator (like what we have in Eq.34).
35/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Adjusted R2
v. However, the primary advantage of R̄2 is that it imposes a
penalty for adding additional independent variables to the
model.
vi. As compare to R2 , when you add more x0 s, it never falls
and most of the time it increases due to SSR (remember?).
vii. The penalty by adding more x0 s is k in Eq.34. So, with x is
added to the model, SSR falls, this will increase the R̄2 in
Eq.34. However, the factor (n − 1)/(n − k − 1) will increase
too. So whether the R̄2 increase or decrease depends on
these offsetting effect.
viii. Another interesting fact is that [(n − 1)/(n − k − 1)] > 1, so
R̄2 < R2 , always.
ix. R̄2 can be negative, if the regressors taken together, reduce
SSR too little, and this reduction fails to offset the factor
(n − 1)/(n − k − 1) in Eq.34.
36/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Adjusted R2
x. The adjusted R2 also related to t and F statistic. If |t|-ratio
for any explanatory variable is less than 1, dropping that
particular variable will improve adjusted R2 . (see Adjusted
R Squared for proof).
37/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Choosing Nonnested Models
i. The adjusted R2 , in some cases, allow us to choose a model
without redundant independent variables.
ii. Using the major league baseball salary example:
log(salary) = β0 + β1 years + β2 gamesyr + β3 bavg
(35)
+ β4 hrunsyr + u
log(salary) = β0 + β1 years + β2 gamesyr + β3 bavg
(36)
+ β4 rbisyr + u
where these two equations are nonnested models because
neither equation is a special case of the other.
iii. While, the F statistics allow us to test nested model: one
model (the restricted model) is a special case of the other
model (the unrestricted model).
38/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Choosing Nonnested Models
iv. In your textbook, R̄2 = .6211 for Eq.35 and R̄2 = .6226 for
Eq.36. Based on this result, there is slight preference of the
model in Eq.36 to model in Eq.35. However, the different is
practically small.
v. However, using R̄2 to compare nonnested sets of
independent variable of different functional forms is
valuable.
vi. For example, model relating R&D intensity to firm sales:
rdintens = β0 + β1 log(sales) + u (37)
rdintens = β0 + β1 sales + β2 sales2 + u (38)
where both model captures a diminishing return. Based on
R2 , Eq.37 is .061 while Eq.38 is .148. However, this is unfair
since Eq.37 has fewer parameter.
39/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme
Choosing Nonnested Models
vii. However, based on R̄2 , Eq.37 is .030 while Eq.38 is .090.
Thus, quadratic model Eq.38 is preferable in measuring
diminishing return of this model.
viii. Caution, we cannot use R̄2 or R2 to choose between
nonnested model to choose between different functional
forms for the dependent variable, for example either y or
log(y).
40/40 Aidil Rizal Shahrin University of Malaya Unofficial Beamer Theme