ECONOMETRICS
Data Issues in Regression Analysis:
Specification, Outliers and Multicollinearity
Professor: Antonio Di Paolo
Degree in International Business
Academic Year 2018/2019
Universitat de Barcelona
1
Data Issues in Regression Analysis(1)
The validity of OLS as a reliable statistical tool to describe the reality depends crucially on
whether the underlying assumptions are satisfied.
In this topic we will consider some of the issues that have to be taken into account when estimating a
regression model, which are mostly related to the data used in the empirical analysis.
Specifically, we will examine:
- The implications of working with a miss-specified model (in terms of the variables to be included) and
statistical tools that enables understanding whether the inclusion of additional variables is justified.
- A statistical test to understand whether the model is miss-specified due to the functional form that is chosen
to model the dependency of Y on the Xs.
- The implications of using explanatory variables that are excessively correlated among them, as well as a tool
to detect this possible issue (i.e. multicollinearity).
- The implications of having extreme/atypical observations in the data used for the empirical analysis, as well
as tools to detect whether this possibility occurs and really generate dampens the regression model (i.e.
outliers and influential observations).
2
Model Specification and Selection (1)
As a first step, we should be able to understand what happens if we a) do NOT include variables that are
relevant to explain the outcome and b) do include variables that are irrelevant to explain the outcome.
- Let’s consider that the only observable variables that explain the outcome “y” are “x 1” and x2”, which means
that the model that SHOULD be estimated would be:
True model: 𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝑢𝑖
- However, for some reason it is possible that one or more variables that should appear in the right-hand-side
of the equation are not included, either because these variables are not available in the dataset or because
they are intrinsically unobservable (i.e. cannot be quantified in a statistical variable).
Omission of relevant covariates:
Estimated model 1: 𝑦𝑖 = 𝛼ො ∗ + 𝛽መ1∗ 𝑥1𝑖 + 𝜀𝑖,Ƹ 𝜀𝑖Ƹ = 𝛽2 𝑥2𝑖 + 𝑢𝑖
- What are the implications of omitting a variable that appears in the true model (i.e. is a relevant explanatory
factor of the dependent variable) from our estimated regression model?
- Can we trust in the estimation of a model that is incomplete due to the omission of a relevant variable?
3
Model Specification and Selection (2)
Omission of relevant covariates:
- True model: 𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝑢𝑖
- Estimated model 1: 𝑦𝑖 = 𝛼ො ∗ + 𝛽መ1∗ 𝑥1𝑖 + 𝜀𝑖Ƹ ֜ 𝜀𝑖Ƹ = 𝛽2 𝑥2𝑖 + 𝑢𝑖
By “true” we mean that the estimated coefficients are a correct representation of the real phenomenon under
investigation, which can be summarized into the result that the expected value of the estimated beta
coefficient(s) is equal, on average, to the true populational value(s).
- Taking the expected value of the beta coefficient associated with the “included” variable (x 1), we can see that
when the variable x2 is excluded (while it should be in the model), the expected value of the corresponding
coefficient (𝛽መ1∗ ) is not equal to its true (populational) value 𝛽1 , that is:
𝑛
σ 𝑖=1(𝑥1𝑖 − 𝑥1
ҧ ) 𝑥2𝑖
𝐸 𝛽መ1∗ = 𝛽1 + 𝐸 𝑋 ′ 𝑋 −1 𝑋 ′ 𝜀 = 𝛽1 + 𝑛 2
≠ 𝛽1
σ𝑖=1(𝑥1𝑖 − 𝑥1ҧ )
The omission of a relevant variable in an OLS regression introduces bias in the estimations, due to the failure
of the first hypothesis (i.e., it is not a good representation of the true model).
This kind of situation is usually called “Omitted Variable Bias” and generates correlation between the included
variable and the error term of the estimated equation (i.e. E[ε, x1] ≠ 0), unless x1 and x2 are actually
uncorrelated (which is usually not the case). 4
Model Specification and Selection (3)
Inclusion of irrelevant covariates:
- Let’s now consider the opposite situation, in which we do include a variable that is not relevant to explain the
outcome under investigation (e.g. x3).
True model: 𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝑢𝑖
Estimated model 2: 𝑦𝑖 = 𝛼ො ∗∗ + 𝛽መ1∗ 𝑥1𝑖 +𝛽መ2∗∗ 𝑥2𝑖 + 𝛽መ3 𝑥3𝑖 + 𝑢ො 𝑖
- What are the implications of including a variable in the regression model that does not appear in the true model (i.e. is an
irrelevant explanatory factor of the dependent variable) from our estimated regression model?
- Can we trust in the estimation of a model that is “overparametrized” due to the inclusion of an irrelevant variable?
Is it possible to show that in this case we have: 𝐸 𝛽መ1∗∗ = 𝛽1 .
- This means that if we include a redundant variable the estimation obtained by OLS is still unbiased the estimated beta
for the variable x1 is still, on average, a good representation of its true value.
- However, the variance of the estimated coefficients is higher due to the inclusion of an irrelevant covariate, which means
that the OLS estimator is no longer efficient ( loss of precision of the estimates).
Therefore, if we include “too many” variables that are not really important to explain the outcome, the OLS coefficients
become imprecise (i.e. higher standard errors), the confidence intervals become wider and it would be more likely to not
reject a null hypothesis regarding single coefficients. 5
Model Specification and Selection (4)
- The previous results highlight the relevant of correctly specifying the regression model, in terms of the
variables to be included.
- The relevance of model’s selection can be also appreciated considering the formula of the estimated variance
of the OLS coefficients, that is:
𝑢′
ො 𝑖 𝑢ො 𝑖
𝑉𝑎𝑟𝛽መ 𝑋 = 𝜎ො 2 (𝑋′𝑋)−1 = (𝑋′𝑋)−1
𝑛 − (𝑘 + 1)
For a single coefficient, it is possible to show that is variance is equal to:
𝜎ො 2 ො 𝑢ො Τ(𝑛 − (𝑘 + 1))
𝑢′
𝑉𝑎𝑟 𝛽መ𝑗 = 2 = 2
σ𝑛𝑖=1 𝑥𝑖𝑗 − 𝑥𝑗ҧ (1 − 𝑅𝑗2 ) σ𝑛𝑖=1 𝑥𝑖𝑗 − 𝑥𝑗ҧ (1 − 𝑅𝑗2 )
This formula indicates that the precision with which we estimate the coefficient for the variable xj:
- May increase or decrease with the number of variables included in the model:
The inclusion of a new variable reduces the variance of the regression 𝜎ො 2 , but also reduces the degrees of
freedom with which we estimate the model (n-k) for a given sample size.
- Increases with the number of observations (n).
- Decreases with the relationship between xj and the other explanatory variables included in the model (𝑅𝑗2 ).
2
- Increases with the amount of variability of xj (σ𝑛𝑖=1 𝑥𝑖𝑗 − 𝑥𝑗ҧ ). 6
Model Specification and Selection (5)
This discussion suggests that the selection of variables to be included in the model is a crucial issue that should
be taken into account, as it can affect both the reliability and the precision of our estimations.
- One may consider that maximizing the R-squared could be useful to understand how many explanatory
variables we should include in out model.
𝑛 2
𝑆𝑆𝑅 σ 𝑢
ො
𝑖=1 𝑖
𝑅2 = 1 − =1− 𝑛 2
𝑆𝑆𝑇 σ𝑖=1 𝑦𝑖 − 𝑦ത
However, maximizing the R-squared leads to the overparametrization of the model, since the R-squared never
decreases when we include an additional explanatory variable (even if it is irrelevant).
- Solution: Adjusted R-squared:
𝑆𝑆𝑅Τ 𝑛−(𝑘+1) σ𝑛 ෝ𝑖2 ൗ 𝑛−(𝑘+1)
𝑖=1 𝑢 ෝ2
𝜎 (1−𝑅2 )(𝑛−1)
ത2
𝑅 =1 − = 1− 2 Τ(𝑛−1) 1 − 𝑆𝑆𝑇 Τ 𝑛−1 =1−
𝑆𝑆𝑇 Τ 𝑛−1 σ𝑛 𝑦
𝑖=1 𝑖 − 𝑦ത (𝑛−(𝑘+1))
Using the Adjusted (or centered) R-squared introduce a penalty for each additional variable (k).
- This enables taking into account the trade-off between reducing the residuals’ variance and reducing the
degrees of freedom of the model.
- In practice, it is possible that the Adjusted R-square could decrease when we add an irrelevant variable to our
model. 7
Model Specification and Selection (6)
There are also alternatives (which are usually preferred for reasons that are beyond the scope of this course) to
the adjusted R-squared to select the model’s specification in terms of the variables to be included:
Information Criteria:
2𝑘
- Akaike information criterion: 𝐴𝐼𝐶 = ln 𝜎ො𝑢2 + 𝑛
𝑘𝑙𝑛(𝑛)
- Schwarz (or Bayesian) information criterion: 𝐵𝐼𝐶 = ln 𝜎ො𝑢2 + 𝑛
- Where 𝜎ො𝑢2 is the estimated variance of the error term, k is the number of explanatory variables included in the
model and n is the number of observations.
These statistics are automatically displayed by GRETL after any estimation.
Decision rule: select the model with the minimum value of a given information criterion (in case of
inconsistence between the two use BIC); notice that the sign must be considered, since they can take negative
values (e.g. BIC model A = -325, BIC model B = -356 model B is preferable).
8
Model Specification and Selection (7)
Does this mean that we have to include “all” the available variables in the model as long as we observe an
increase in the adjusted R-squared?
The general answer is NO. This is because:
- Often additional variables end up to be highly correlated, which means that with more variables there is an
increasing risk of having excessive multicollinearity in the model (see later).
- In applied regression analysis, you usually focus on one or two main variables of interest and include other
explanatory variables as “controls” (i.e. to keep fixed other factors that may co-vary with the outcome).
Therefore, you should avoid including other variables that are already “captured” by the controls that are in
the model.
- When you include additional variable, you should always try to avoid variables that are:
i) possibly correlated with the error term of the model (since we assume that there is no correlation between
the explanatory variables and the error) or
ii) are possibly affected by the dependent variable of our model, that is, you should take care of potential
reverse causality issues.
9
Model Specification and Selection (9)
- The latter case of reserve causality happens when the direction of the causal relationship between variables is
not clear.
- Reverse causality means that although there exists an effect of X on Y, it is also true that Y affects X, that is, Y
and X are simultaneously determined.
For example, in case we want to estimate the demand for a given product, prices affect quantity but quantity
affects prices as well.
- If we specify an equation for the “demanded quantity” of a given product we cannot avoid including its price in
the equation
Otherwise this will be a relevant omitted variable, generating bias in the estimated coefficients.
- However, there exists reverse causality between prices and quantity, due to the fact that if we estimate the
equation for the “price of the product” we cannot avoid including the price (which is a relevant determinant of
the supplied quantity of a given product).
The existence of reverse causality “automatically” generates correlation between the explanatory variables and
the error term.
10
Model Specification and Selection (9)
- The existence of reverse causality “automatically” generates correlation between the explanatory variables and
the error term.
To see why, consider the relationship between the demanded quantity of a given product (q) and its price (p):
qi pi i pi qi ui pi i ui
pi qi ui i ui i
· ·
*
ui*
1 1 1 1
Hence, E[p,ε] ≠ 0, since ε appears in the equation that explains p (after some substituting the expression for q),
which invalidates the OLS estimator.
- Take-home of this discussion: take care of the variables that you include in your model, try to follow the theory
(if exists), existing studies and especially “common sense”.
- In case the issue that you want to investigate inevitably involves a) reverse causality issues, b) the inclusion of
variables that are possibly related with the unobserved determinants of your outcome of interest and/or c)
there are crucial variables that are unobservables (i.e. manager’s ability), OLS is probably not the adequate
method and you should use other regression methods, such as the Instrumental Variables Estimator (not
explained in this course…).
11
Functional Form Specification (1)
Another important issue to be considered in applied regression analysis is the functional form that is assumed
to describe the relationship between the dependent variable and the explanatory variables of our regression
model.
- Considering the standard regression model of the form,
𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝑢𝑖
We can see that it implicitly assumes a linear functional form to model the behavior of y.
- One simple way of introducing non-linearities in the dependency of y on the xs is by applying logs to the
variables, for example,
ln(𝑦𝑖 ) = 𝛼 + 𝛽1 ln(𝑥1𝑖 ) + 𝛽2 𝑥2𝑖 + 𝑢𝑖
From the previous topics, we know that on top of a different interpretation of the beta coefficients (i.e.
elasticity or % changes), taking logs of the variables has several advantage (and some drawback).
- On top of that, if the real model that explains y as a function of the x is expressed in logs (e.g. Cobb-Douglas?),
the logs should be there and not considering logarithms would imply a specification error due to neglected
non-linearity.
How can we know whether a log-log model is preferable to a linear model?
12
Functional Form Specification (2)
Moreover, in some case the “marginal effect” of x on y could be not constant for different values of the x.
- It is possible to allow for increasing or decreasing marginal effect of a variable by specifying quadratic variables,
that is:
2
𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 +𝛽3 𝑥2𝑖 + 𝑢𝑖
In this case the marginal effect of x1 on y is constant for each value of x1 (i.e. 𝜕𝑦𝑖ൗ𝜕𝑥𝑖1 = 𝛽መ1 ), whereas the effect
of x2 on y is allowed to be different for different values of x2 (i.e. 𝜕𝑦𝑖ൗ𝜕𝑥 = 𝛽መ2 + 2𝛽መ3 𝑥2𝑖 ).
𝑖2
- Provided that the 𝛽መ3 is statistically different from zero (otherwise the effect would be linear), the relationship
between x2 on y would be:
Concave if 𝛽መ2 > 0 and 𝛽መ3 < 0 the effect of x2 increases, until a certain point, and then decreases
Convex if 𝛽መ2 < 0 and 𝛽መ3 > 0 the effect of x2 first decreases, until a certain point, and then increases
2
𝛽
What is the value of x2 at which its effect on y reaches the maximum/minimum? ֜ 𝑥2∗ = ൗ−2𝛽
3
- Notice that in case the relationship of y with a given x is non-linear and we neglect including quadratic effects,
this would be similar to the omission of relevant variables in the regression model ( biased coefficients).
- Again, how can we test for whether the model suffers miss-specification due to neglected non-linearities?
13
The RESET Test (1)
One possible way of testing for miss-specification problems due to neglected non-linearities in our model is by
means of the RESET test.
- Considering, for simplicity, the model with two explanatory variables:
𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝑢𝑖
The RESET test for General Functional Form Miss-specification Errors can be implemented following the next
steps:
1) Predict the dependent variable of the model you want to test for miss-specification:
֜ 𝑦ො𝑖 = 𝛼ො + 𝛽መ1 𝑥1𝑖 + 𝛽መ2 𝑥2𝑖
2) Then run an auxiliary equation, in which you regress your original dependent variable on the same set of
control variables plus the squared and cubic terms of the prediction obtained in step 1, that is:
𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 +𝛿2 𝑦ො𝑖2 +𝛿3 𝑦ො𝑖3 + 𝑣𝑖
3) Test the following null and alternative hypothesis by means of an F-statistic:
𝐻0 : 𝛿መ2 = 𝛿መ3 = 0
𝐻1 : 𝛿መ2 ≠ 0; 𝛿መ3 ≠ 0 14
The RESET Test (2)
3) Test the joint null hypothesis 𝐻0 : 𝛿መ2 = 𝛿መ3 = 0
- If H0 is not rejected, there is no problem of miss-specification in the model due to neglected non-linearities.
Intuition: if the quadratic and cubic terms of the prediction are significant variable in the equation for y, some
non-linearity has been neglected because 𝑦ො𝑖2 and 𝑦ො𝑖3 are non-linear transformations of the explanatory
variables.
- It is possible to discriminate between models, but the RESET test does not explicitly indicate how to proceed
if the null hypothesis (of no functional form errors) is rejected.
In practice, one may start with a linear model and check whether the test rejects H 0 or not.
If this is the case, it is possible to try considering logs of the variables (when possible) and apply again the
RESET test.
If H0 is rejected again, it is possible to include quadratic effects (or even polynomial, when justified by some
theoretical argument).
If H0 is rejected again, maybe there are other problems in the model like the omission of relevant variables
(i.e. the RESET test sometimes confound non-linearities with other omitted variables).
Overall, a theoretical guidance and the analysis of previous research in the literature are the best criteria for
model’s specification (the test should only support this process). 15
Multicollinearity (1)
Apart from model’s specification in terms of functional form and regarding which variables have to be included
or excluded, another relevant issue concerts the possible relationship between explanatory variables.
- One of the assumptions of the multiple linear regression states that the rank of the matrix X should be full,
which means that none of the variables included in the model should be a perfect linear combination of the
other variables.
- In practice, it is possible that the variables that we may want to include in our regression present a strong
correlation between them, because they are related elements or simply because they are capturing similar
things.
The presence of a strong relationship between different explanatory variables is called multicollinearity and
can affect the precision (i.e. the standard errors) of estimated coefficients obtained by OLS.
- In fact, remember that the variance of a single estimated coefficient can be expressed as:
𝜎ො 2 ො 𝑢ො Τ(𝑛 − (𝑘 + 1))
𝑢′
𝑉𝑎𝑟 𝛽መ𝑗 = 2 = 2
σ𝑛𝑖=1 𝑥𝑖𝑗 − 𝑥𝑗ҧ (1 − 𝑅𝑗2 ) σ𝑛𝑖=1 𝑥𝑖𝑗 − 𝑥𝑗ҧ (1 − 𝑅𝑗2 )
If the variable xj is strongly correlated with the others 𝑅𝑗2 increases 𝑉𝑎𝑟 𝛽መ𝑗 increases
When there is perfect collinearity (Xs are linear combinations) 𝑅𝑗2 → 1֜𝑉𝑎𝑟 𝛽መ𝑗 → ∞ 16
Multicollinearity (2)
If the variable xj is strongly correlated with the others 𝑅𝑗2 increases 𝑉𝑎𝑟 𝛽መ𝑗 increases
When there is perfect collinearity (Xs are linear combinations) 𝑅𝑗2 → 1֜𝑉𝑎𝑟 𝛽መ𝑗 → ∞
17
Multicollinearity (3)
- Since there is always some degree of correlation between different explanatory variables, we would like to know
the “threshold” at which the existing multicollinearity is “excessive”.
This would imply that our estimates are too imprecise, which invalidate statistical inference from our model.
- The Variance Inflation Factor is a tool to detect excessive multicollinearity:
1 𝜎ො
𝑉𝐼𝐹𝑗 = ֜𝑉𝑎𝑟 መ𝑗 = 𝑢 · 𝑉𝐼𝐹𝑗
𝛽
1 − 𝑅𝑗2 𝑆𝑆𝑇𝑗
The value of 𝑅𝑗2 is obtained from the auxiliary regression in which the dependent variable is xj and the explanatory variables
are all the other Xs included in the model (i.e. 𝑥𝑗 = 𝜇 + 𝛾1 𝑥1 + 𝛾2 𝑥2 + ⋯ + 𝛾𝑗−1 𝑥𝑗−1 + 𝛾𝑗+1 𝑥𝑗+1 + ⋯ 𝛾𝑘 𝑥𝑘 + 𝜖
A VIF > 10 means that the degree of collinearity between xj and the other explanatory variables is “excessive”; This is
equivalent to have 𝑅𝑗2 > 0.9.
A VIF > 5 (but < 10) means that the degree of collinearity between xj and the other explanatory variables is “moderate”.
- The VIF indicates the extent to which the variance of the coefficient is “inflated” by the presence of multicollinearity:
When xj is uncorrelated with the other explanatory variables, then VIFj = 1.
If VIFj = 15 it means that the variance of 𝛽መ𝑗 is 15 times higher that what had been in the hypothetical case that xj was
uncorrelated with other regressors.
- What to do if the VIF is above the value 10? If possible, increase sample size or impose restrictions on the involved variables
18
(or remove one of them and check what happens to the other coefficients).
Outliers (1)
Another issue to be considered in our OLS regression model refers to the presence of “atypical observations”,
which are generally called “outliers”.
- Outliers are observations with variables’ values that are very far away from the commonly observed values in the
sample.
Outliers might be due to:
- Coding errors in some of the variables (i.e. adding a “0” more by mistake could generate an outlier).
- Sampling from a relatively small population, and some of the sampled units are very different from all the rest.
- The presence of outliers per se is not a problem, since we should be concerned about whether removing (or
including) the outlier(s) in our regression significantly changes the results.
If dropping possible outliers from the sample changes the regression’s results in a substantial way, these will be
called “influential observations”.
An observation is influential if its exclusion from the estimation sample changes the OLS results by a large
amount (what is large ends up being an arbitrary concept).
- OLS is susceptible to outlying observations, because it minimizes the Sum of Squared Residuals: large residuals
receive a lot of weight in the least squares minimization process.
If the estimates change substantially when we slightly modify our sample by removing one or few outliers, we
19
should be concerned (however, there is no consensus about the solution).
Outliers (2)
How can we analyze whether we do have outliers in our data and these represent an issue for the OLS
estimation?
- First, we need to determine whether a given observation can be considered to be atypical.
A graphical analysis is often useful (on top of standard descriptive statistics, e.g. max-mix).
Removing just 1 observation significantly changes the slope
Regression using n = 37 observations, with an atypical value in y
coefficient obtained with the same data (n* = 37 – 1)
y versus x (with least squares fit) y versus x (with least squares fit)
80 25
Y = 0.782 + 1.18X Y = 3.71 + 0.851X
70
20
60
Possible outliers
50
15
40
y
y
30 10
20
5
10
0
5 10 15 20 25 0
x 5 10 15 20 20
x
Outliers (3)
How can we analyze whether we do have outliers in our data and these represent an issue for the OLS
estimation?
- Second, it is also possible to use statistical tools that reveal whether an observation 1) is an outlier and 2) has a
real influence on the estimated slope coefficient (i.e. its deletion significantly changes the estimates).
A measure that indicates the contribution of observation “i” on the predicted value of the outcome (𝑦)
ො is the so-
called “leverage” of observation “i”, hii:
- If hii ≥ =2((k+1)/n) the observation “i” has a potential influence on the slope.
- If hii < 2((k+1)/N) the observation “i” does not have a potential influence on the slope.
Having computed the leverage, it is possible to obtain the Studentized Residuals:
ෝ𝑖
𝑢
𝑟𝑖 =
ෝ𝑢2 (1−ℎ𝑖𝑖 )
𝜎
- Where ri is distributed as a t-student with “n-(k+1)” degrees of freedom:
If ri ≥ tn-(k+1), α/2 observation “i” can be considered an outlier.
If ri < tn-(k+1), α/2 observation “i” cannot be considered an outlier.
21
Outliers (4)
How can we analyze whether we do have outliers in our data and these represent an issue for the OLS
estimation?
- Second, it is also possible to use statistical tools that reveal whether an observation 1) is an outlier and 2) has a
real influence on the estimated slope coefficient (i.e. its deletion significantly changes the estimates).
In order to understand whether a given observation that has a potential influence and can be considered to be
an outlier actually distort the estimated values of the betas, it is possible to compute the DFFITS:
𝑦ො𝑖 − 𝑦ො 𝑖
𝐷𝐹𝐹𝐼𝑇𝑆𝑖 =
2
𝜎ො𝑢(𝑖 ) ℎ𝑖𝑖
2
- Where 𝑦ො 𝑖 is the predicted value of “y” obtained after removing observation “i” and 𝜎ො𝑢(𝑖 ) is the estimated
variance of the error term without observation “i” as well.
DIFFITSi measures the real contribution of observation “i” on the prediction of the outcome:
𝑘+1
- If |𝐷𝐹𝐹𝐼𝑇𝑆𝑖 | > 2 (or 1 if the sample is small) observation “i” has a real influence on the slopes.
𝑛
Obtaining a significant DIFFITS indicates that the observed values for observation “I” should be carefully checked
and, if no error in the data is detected, one should check what happens when that observation is removed from
the estimation sample. 22