0% found this document useful (0 votes)

47 views22 pages

Regression Analysis Challenges

Course from IB.

Uploaded by

Marc Milian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views22 pages

Regression Analysis Challenges

Course from IB.

Uploaded by

Marc Milian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

ECONOMETRICS

Data Issues in Regression Analysis:

Specification, Outliers and Multicollinearity
Professor: Antonio Di Paolo

Degree in International Business

Academic Year 2018/2019
Universitat de Barcelona

1
Data Issues in Regression Analysis(1)
 The validity of OLS as a reliable statistical tool to describe the reality depends crucially on
whether the underlying assumptions are satisfied.
In this topic we will consider some of the issues that have to be taken into account when estimating a
regression model, which are mostly related to the data used in the empirical analysis.

Specifically, we will examine:

- The implications of working with a miss-specified model (in terms of the variables to be included) and
statistical tools that enables understanding whether the inclusion of additional variables is justified.

- A statistical test to understand whether the model is miss-specified due to the functional form that is chosen
to model the dependency of Y on the Xs.

- The implications of using explanatory variables that are excessively correlated among them, as well as a tool
to detect this possible issue (i.e. multicollinearity).

- The implications of having extreme/atypical observations in the data used for the empirical analysis, as well
as tools to detect whether this possibility occurs and really generate dampens the regression model (i.e.
outliers and influential observations).
2
Model Specification and Selection (1)
 As a first step, we should be able to understand what happens if we a) do NOT include variables that are
relevant to explain the outcome and b) do include variables that are irrelevant to explain the outcome.

- Let’s consider that the only observable variables that explain the outcome “y” are “x 1” and x2”, which means
that the model that SHOULD be estimated would be:

True model: 𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝑢𝑖

- However, for some reason it is possible that one or more variables that should appear in the right-hand-side
of the equation are not included, either because these variables are not available in the dataset or because
they are intrinsically unobservable (i.e. cannot be quantified in a statistical variable).

 Omission of relevant covariates:

 Estimated model 1: 𝑦𝑖 = 𝛼ො ∗ + 𝛽መ1∗ 𝑥1𝑖 + 𝜀𝑖,Ƹ 𝜀𝑖Ƹ = 𝛽2 𝑥2𝑖 + 𝑢𝑖

- What are the implications of omitting a variable that appears in the true model (i.e. is a relevant explanatory
factor of the dependent variable) from our estimated regression model?
- Can we trust in the estimation of a model that is incomplete due to the omission of a relevant variable?
3
Model Specification and Selection (2)
 Omission of relevant covariates:
- True model: 𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝑢𝑖

- Estimated model 1: 𝑦𝑖 = 𝛼ො ∗ + 𝛽መ1∗ 𝑥1𝑖 + 𝜀𝑖Ƹ ֜ 𝜀𝑖Ƹ = 𝛽2 𝑥2𝑖 + 𝑢𝑖

By “true” we mean that the estimated coefficients are a correct representation of the real phenomenon under
investigation, which can be summarized into the result that the expected value of the estimated beta
coefficient(s) is equal, on average, to the true populational value(s).
- Taking the expected value of the beta coefficient associated with the “included” variable (x 1), we can see that
when the variable x2 is excluded (while it should be in the model), the expected value of the corresponding
coefficient (𝛽መ1∗ ) is not equal to its true (populational) value 𝛽1 , that is:
𝑛
σ 𝑖=1(𝑥1𝑖 − 𝑥1
ҧ ) 𝑥2𝑖
𝐸 𝛽መ1∗ = 𝛽1 + 𝐸 𝑋 ′ 𝑋 −1 𝑋 ′ 𝜀 = 𝛽1 + 𝑛 2
≠ 𝛽1
σ𝑖=1(𝑥1𝑖 − 𝑥1ҧ )
The omission of a relevant variable in an OLS regression introduces bias in the estimations, due to the failure
of the first hypothesis (i.e., it is not a good representation of the true model).
This kind of situation is usually called “Omitted Variable Bias” and generates correlation between the included
variable and the error term of the estimated equation (i.e. E[ε, x1] ≠ 0), unless x1 and x2 are actually
uncorrelated (which is usually not the case). 4
Model Specification and Selection (3)
 Inclusion of irrelevant covariates:
- Let’s now consider the opposite situation, in which we do include a variable that is not relevant to explain the
outcome under investigation (e.g. x3).
True model: 𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝑢𝑖
 Estimated model 2: 𝑦𝑖 = 𝛼ො ∗∗ + 𝛽መ1∗ 𝑥1𝑖 +𝛽መ2∗∗ 𝑥2𝑖 + 𝛽መ3 𝑥3𝑖 + 𝑢ො 𝑖
- What are the implications of including a variable in the regression model that does not appear in the true model (i.e. is an
irrelevant explanatory factor of the dependent variable) from our estimated regression model?
- Can we trust in the estimation of a model that is “overparametrized” due to the inclusion of an irrelevant variable?

Is it possible to show that in this case we have: 𝐸 𝛽መ1∗∗ = 𝛽1 .

- This means that if we include a redundant variable the estimation obtained by OLS is still unbiased  the estimated beta
for the variable x1 is still, on average, a good representation of its true value.
- However, the variance of the estimated coefficients is higher due to the inclusion of an irrelevant covariate, which means
that the OLS estimator is no longer efficient ( loss of precision of the estimates).
Therefore, if we include “too many” variables that are not really important to explain the outcome, the OLS coefficients
become imprecise (i.e. higher standard errors), the confidence intervals become wider and it would be more likely to not
reject a null hypothesis regarding single coefficients. 5
Model Specification and Selection (4)
- The previous results highlight the relevant of correctly specifying the regression model, in terms of the
variables to be included.
- The relevance of model’s selection can be also appreciated considering the formula of the estimated variance
of the OLS coefficients, that is:
𝑢′
ො 𝑖 𝑢ො 𝑖
𝑉𝑎𝑟෣𝛽መ 𝑋 = 𝜎ො 2 (𝑋′𝑋)−1 = (𝑋′𝑋)−1
𝑛 − (𝑘 + 1)
For a single coefficient, it is possible to show that is variance is equal to:
𝜎ො 2 ො 𝑢ො Τ(𝑛 − (𝑘 + 1))
𝑢′
𝑉𝑎𝑟 𝛽መ𝑗 = 2 = 2
σ𝑛𝑖=1 𝑥𝑖𝑗 − 𝑥𝑗ҧ (1 − 𝑅𝑗2 ) σ𝑛𝑖=1 𝑥𝑖𝑗 − 𝑥𝑗ҧ (1 − 𝑅𝑗2 )

This formula indicates that the precision with which we estimate the coefficient for the variable xj:
- May increase or decrease with the number of variables included in the model:
 The inclusion of a new variable reduces the variance of the regression 𝜎ො 2 , but also reduces the degrees of
freedom with which we estimate the model (n-k) for a given sample size.
- Increases with the number of observations (n).
- Decreases with the relationship between xj and the other explanatory variables included in the model (𝑅𝑗2 ).
2
- Increases with the amount of variability of xj (σ𝑛𝑖=1 𝑥𝑖𝑗 − 𝑥𝑗ҧ ). 6
Model Specification and Selection (5)
 This discussion suggests that the selection of variables to be included in the model is a crucial issue that should
be taken into account, as it can affect both the reliability and the precision of our estimations.

- One may consider that maximizing the R-squared could be useful to understand how many explanatory
variables we should include in out model.
𝑛 2
𝑆𝑆𝑅 σ 𝑢
ො
𝑖=1 𝑖
𝑅2 = 1 − =1− 𝑛 2
𝑆𝑆𝑇 σ𝑖=1 𝑦𝑖 − 𝑦ത
However, maximizing the R-squared leads to the overparametrization of the model, since the R-squared never
decreases when we include an additional explanatory variable (even if it is irrelevant).

- Solution: Adjusted R-squared:

𝑆𝑆𝑅Τ 𝑛−(𝑘+1) σ𝑛 ෝ𝑖2 ൗ 𝑛−(𝑘+1)
𝑖=1 𝑢 ෝ2
𝜎 (1−𝑅2 )(𝑛−1)
ത2
𝑅 =1 − = 1− 2 Τ(𝑛−1) 1 − 𝑆𝑆𝑇 Τ 𝑛−1 =1−
𝑆𝑆𝑇 Τ 𝑛−1 σ𝑛 𝑦
𝑖=1 𝑖 − 𝑦ത (𝑛−(𝑘+1))

Using the Adjusted (or centered) R-squared introduce a penalty for each additional variable (k).
- This enables taking into account the trade-off between reducing the residuals’ variance and reducing the
degrees of freedom of the model.
- In practice, it is possible that the Adjusted R-square could decrease when we add an irrelevant variable to our
model. 7
Model Specification and Selection (6)
 There are also alternatives (which are usually preferred for reasons that are beyond the scope of this course) to
the adjusted R-squared to select the model’s specification in terms of the variables to be included:
Information Criteria:

2𝑘
- Akaike information criterion: 𝐴𝐼𝐶 = ln 𝜎ො𝑢2 + 𝑛

𝑘𝑙𝑛(𝑛)
- Schwarz (or Bayesian) information criterion: 𝐵𝐼𝐶 = ln 𝜎ො𝑢2 + 𝑛

- Where 𝜎ො𝑢2 is the estimated variance of the error term, k is the number of explanatory variables included in the
model and n is the number of observations.

 These statistics are automatically displayed by GRETL after any estimation.

Decision rule: select the model with the minimum value of a given information criterion (in case of
inconsistence between the two use BIC); notice that the sign must be considered, since they can take negative
values (e.g. BIC model A = -325, BIC model B = -356  model B is preferable).
8
Model Specification and Selection (7)
 Does this mean that we have to include “all” the available variables in the model as long as we observe an
increase in the adjusted R-squared?
 The general answer is NO. This is because:
- Often additional variables end up to be highly correlated, which means that with more variables there is an
increasing risk of having excessive multicollinearity in the model (see later).

- In applied regression analysis, you usually focus on one or two main variables of interest and include other
explanatory variables as “controls” (i.e. to keep fixed other factors that may co-vary with the outcome).
Therefore, you should avoid including other variables that are already “captured” by the controls that are in
the model.

- When you include additional variable, you should always try to avoid variables that are:
i) possibly correlated with the error term of the model (since we assume that there is no correlation between
the explanatory variables and the error) or
ii) are possibly affected by the dependent variable of our model, that is, you should take care of potential
reverse causality issues.
9
Model Specification and Selection (9)
- The latter case of reserve causality happens when the direction of the causal relationship between variables is
not clear.
- Reverse causality means that although there exists an effect of X on Y, it is also true that Y affects X, that is, Y
and X are simultaneously determined.

For example, in case we want to estimate the demand for a given product, prices affect quantity but quantity
affects prices as well.
- If we specify an equation for the “demanded quantity” of a given product we cannot avoid including its price in
the equation
Otherwise this will be a relevant omitted variable, generating bias in the estimated coefficients.

- However, there exists reverse causality between prices and quantity, due to the fact that if we estimate the
equation for the “price of the product” we cannot avoid including the price (which is a relevant determinant of
the supplied quantity of a given product).

The existence of reverse causality “automatically” generates correlation between the explanatory variables and
the error term.
10
Model Specification and Selection (9)
- The existence of reverse causality “automatically” generates correlation between the explanatory variables and
the error term.
To see why, consider the relationship between the demanded quantity of a given product (q) and its price (p):

qi    pi   i  pi    qi  ui       pi   i   ui 

pi    qi  ui    i ui   i
  ·     ·
*
 ui*
1   1   1   1  

Hence, E[p,ε] ≠ 0, since ε appears in the equation that explains p (after some substituting the expression for q),
which invalidates the OLS estimator.
- Take-home of this discussion: take care of the variables that you include in your model, try to follow the theory
(if exists), existing studies and especially “common sense”.
- In case the issue that you want to investigate inevitably involves a) reverse causality issues, b) the inclusion of
variables that are possibly related with the unobserved determinants of your outcome of interest and/or c)
there are crucial variables that are unobservables (i.e. manager’s ability), OLS is probably not the adequate
method and you should use other regression methods, such as the Instrumental Variables Estimator (not
explained in this course…).
11
Functional Form Specification (1)
 Another important issue to be considered in applied regression analysis is the functional form that is assumed
to describe the relationship between the dependent variable and the explanatory variables of our regression
model.
- Considering the standard regression model of the form,
𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝑢𝑖
We can see that it implicitly assumes a linear functional form to model the behavior of y.
- One simple way of introducing non-linearities in the dependency of y on the xs is by applying logs to the
variables, for example,
ln(𝑦𝑖 ) = 𝛼 + 𝛽1 ln(𝑥1𝑖 ) + 𝛽2 𝑥2𝑖 + 𝑢𝑖

From the previous topics, we know that on top of a different interpretation of the beta coefficients (i.e.
elasticity or % changes), taking logs of the variables has several advantage (and some drawback).
- On top of that, if the real model that explains y as a function of the x is expressed in logs (e.g. Cobb-Douglas?),
the logs should be there and not considering logarithms would imply a specification error due to neglected
non-linearity.
How can we know whether a log-log model is preferable to a linear model?
12
Functional Form Specification (2)
 Moreover, in some case the “marginal effect” of x on y could be not constant for different values of the x.
- It is possible to allow for increasing or decreasing marginal effect of a variable by specifying quadratic variables,
that is:
2
𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 +𝛽3 𝑥2𝑖 + 𝑢𝑖
In this case the marginal effect of x1 on y is constant for each value of x1 (i.e. 𝜕𝑦𝑖ൗ𝜕𝑥𝑖1 = 𝛽መ1 ), whereas the effect
of x2 on y is allowed to be different for different values of x2 (i.e. 𝜕𝑦𝑖ൗ𝜕𝑥 = 𝛽መ2 + 2𝛽መ3 𝑥2𝑖 ).
𝑖2

- Provided that the 𝛽መ3 is statistically different from zero (otherwise the effect would be linear), the relationship
between x2 on y would be:
Concave if 𝛽መ2 > 0 and 𝛽መ3 < 0  the effect of x2 increases, until a certain point, and then decreases
 Convex if 𝛽መ2 < 0 and 𝛽መ3 > 0  the effect of x2 first decreases, until a certain point, and then increases
෡2
𝛽
What is the value of x2 at which its effect on y reaches the maximum/minimum? ֜ 𝑥2∗ = ൗ−2𝛽෡
3

- Notice that in case the relationship of y with a given x is non-linear and we neglect including quadratic effects,
this would be similar to the omission of relevant variables in the regression model ( biased coefficients).
- Again, how can we test for whether the model suffers miss-specification due to neglected non-linearities?
13
The RESET Test (1)
 One possible way of testing for miss-specification problems due to neglected non-linearities in our model is by
means of the RESET test.
- Considering, for simplicity, the model with two explanatory variables:
𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝑢𝑖
The RESET test for General Functional Form Miss-specification Errors can be implemented following the next
steps:
1) Predict the dependent variable of the model you want to test for miss-specification:
֜ 𝑦ො𝑖 = 𝛼ො + 𝛽መ1 𝑥1𝑖 + 𝛽መ2 𝑥2𝑖

2) Then run an auxiliary equation, in which you regress your original dependent variable on the same set of
control variables plus the squared and cubic terms of the prediction obtained in step 1, that is:

𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 +𝛿2 𝑦ො𝑖2 +𝛿3 𝑦ො𝑖3 + 𝑣𝑖

3) Test the following null and alternative hypothesis by means of an F-statistic:

𝐻0 : 𝛿መ2 = 𝛿መ3 = 0
𝐻1 : 𝛿መ2 ≠ 0; 𝛿መ3 ≠ 0 14
The RESET Test (2)
3) Test the joint null hypothesis 𝐻0 : 𝛿መ2 = 𝛿መ3 = 0

- If H0 is not rejected, there is no problem of miss-specification in the model due to neglected non-linearities.
Intuition: if the quadratic and cubic terms of the prediction are significant variable in the equation for y, some
non-linearity has been neglected because 𝑦ො𝑖2 and 𝑦ො𝑖3 are non-linear transformations of the explanatory
variables.

- It is possible to discriminate between models, but the RESET test does not explicitly indicate how to proceed
if the null hypothesis (of no functional form errors) is rejected.
In practice, one may start with a linear model and check whether the test rejects H 0 or not.
 If this is the case, it is possible to try considering logs of the variables (when possible) and apply again the
RESET test.
If H0 is rejected again, it is possible to include quadratic effects (or even polynomial, when justified by some
theoretical argument).
 If H0 is rejected again, maybe there are other problems in the model like the omission of relevant variables
(i.e. the RESET test sometimes confound non-linearities with other omitted variables).
 Overall, a theoretical guidance and the analysis of previous research in the literature are the best criteria for
model’s specification (the test should only support this process). 15
Multicollinearity (1)
 Apart from model’s specification in terms of functional form and regarding which variables have to be included
or excluded, another relevant issue concerts the possible relationship between explanatory variables.
- One of the assumptions of the multiple linear regression states that the rank of the matrix X should be full,
which means that none of the variables included in the model should be a perfect linear combination of the
other variables.
- In practice, it is possible that the variables that we may want to include in our regression present a strong
correlation between them, because they are related elements or simply because they are capturing similar
things.
The presence of a strong relationship between different explanatory variables is called multicollinearity and
can affect the precision (i.e. the standard errors) of estimated coefficients obtained by OLS.

- In fact, remember that the variance of a single estimated coefficient can be expressed as:
𝜎ො 2 ො 𝑢ො Τ(𝑛 − (𝑘 + 1))
𝑢′
𝑉𝑎𝑟 𝛽መ𝑗 = 2 = 2
σ𝑛𝑖=1 𝑥𝑖𝑗 − 𝑥𝑗ҧ (1 − 𝑅𝑗2 ) σ𝑛𝑖=1 𝑥𝑖𝑗 − 𝑥𝑗ҧ (1 − 𝑅𝑗2 )

 If the variable xj is strongly correlated with the others  𝑅𝑗2 increases  𝑉𝑎𝑟 𝛽መ𝑗 increases

 When there is perfect collinearity (Xs are linear combinations)  𝑅𝑗2 → 1֜𝑉𝑎𝑟 𝛽መ𝑗 → ∞ 16
Multicollinearity (2)
 If the variable xj is strongly correlated with the others  𝑅𝑗2 increases  𝑉𝑎𝑟 𝛽መ𝑗 increases
 When there is perfect collinearity (Xs are linear combinations)  𝑅𝑗2 → 1֜𝑉𝑎𝑟 𝛽መ𝑗 → ∞

17
Multicollinearity (3)
- Since there is always some degree of correlation between different explanatory variables, we would like to know
the “threshold” at which the existing multicollinearity is “excessive”.
 This would imply that our estimates are too imprecise, which invalidate statistical inference from our model.
- The Variance Inflation Factor is a tool to detect excessive multicollinearity:
1 𝜎ො
𝑉𝐼𝐹𝑗 = ֜𝑉𝑎𝑟 መ𝑗 = 𝑢 · 𝑉𝐼𝐹𝑗
𝛽
1 − 𝑅𝑗2 𝑆𝑆𝑇𝑗
The value of 𝑅𝑗2 is obtained from the auxiliary regression in which the dependent variable is xj and the explanatory variables
are all the other Xs included in the model (i.e. 𝑥𝑗 = 𝜇 + 𝛾1 𝑥1 + 𝛾2 𝑥2 + ⋯ + 𝛾𝑗−1 𝑥𝑗−1 + 𝛾𝑗+1 𝑥𝑗+1 + ⋯ 𝛾𝑘 𝑥𝑘 + 𝜖
A VIF > 10 means that the degree of collinearity between xj and the other explanatory variables is “excessive”; This is
equivalent to have 𝑅𝑗2 > 0.9.
A VIF > 5 (but < 10) means that the degree of collinearity between xj and the other explanatory variables is “moderate”.

- The VIF indicates the extent to which the variance of the coefficient is “inflated” by the presence of multicollinearity:
 When xj is uncorrelated with the other explanatory variables, then VIFj = 1.
 If VIFj = 15 it means that the variance of 𝛽መ𝑗 is 15 times higher that what had been in the hypothetical case that xj was
uncorrelated with other regressors.
- What to do if the VIF is above the value 10? If possible, increase sample size or impose restrictions on the involved variables
18
(or remove one of them and check what happens to the other coefficients).
Outliers (1)
 Another issue to be considered in our OLS regression model refers to the presence of “atypical observations”,
which are generally called “outliers”.
- Outliers are observations with variables’ values that are very far away from the commonly observed values in the
sample.
Outliers might be due to:
- Coding errors in some of the variables (i.e. adding a “0” more by mistake could generate an outlier).
- Sampling from a relatively small population, and some of the sampled units are very different from all the rest.

- The presence of outliers per se is not a problem, since we should be concerned about whether removing (or
including) the outlier(s) in our regression significantly changes the results.
If dropping possible outliers from the sample changes the regression’s results in a substantial way, these will be
called “influential observations”.
 An observation is influential if its exclusion from the estimation sample changes the OLS results by a large
amount (what is large ends up being an arbitrary concept).

- OLS is susceptible to outlying observations, because it minimizes the Sum of Squared Residuals: large residuals
receive a lot of weight in the least squares minimization process.
 If the estimates change substantially when we slightly modify our sample by removing one or few outliers, we
19
should be concerned (however, there is no consensus about the solution).
Outliers (2)
 How can we analyze whether we do have outliers in our data and these represent an issue for the OLS
estimation?
- First, we need to determine whether a given observation can be considered to be atypical.
 A graphical analysis is often useful (on top of standard descriptive statistics, e.g. max-mix).

 Removing just 1 observation significantly changes the slope

 Regression using n = 37 observations, with an atypical value in y
coefficient obtained with the same data (n* = 37 – 1)
y versus x (with least squares fit) y versus x (with least squares fit)
80 25
Y = 0.782 + 1.18X Y = 3.71 + 0.851X

20
60

Possible outliers
50
15

40
y

y
30 10

5
10

0
5 10 15 20 25 0
x 5 10 15 20 20
x
Outliers (3)
 How can we analyze whether we do have outliers in our data and these represent an issue for the OLS
estimation?
- Second, it is also possible to use statistical tools that reveal whether an observation 1) is an outlier and 2) has a
real influence on the estimated slope coefficient (i.e. its deletion significantly changes the estimates).

A measure that indicates the contribution of observation “i” on the predicted value of the outcome (𝑦)
ො is the so-
called “leverage” of observation “i”, hii:

- If hii ≥ =2((k+1)/n)  the observation “i” has a potential influence on the slope.
- If hii < 2((k+1)/N)  the observation “i” does not have a potential influence on the slope.

Having computed the leverage, it is possible to obtain the Studentized Residuals:

ෝ𝑖
𝑢
𝑟𝑖 =
ෝ𝑢2 (1−ℎ𝑖𝑖 )
𝜎

- Where ri is distributed as a t-student with “n-(k+1)” degrees of freedom:

 If ri ≥ tn-(k+1), α/2  observation “i” can be considered an outlier.
 If ri < tn-(k+1), α/2  observation “i” cannot be considered an outlier.
21
Outliers (4)
 How can we analyze whether we do have outliers in our data and these represent an issue for the OLS
estimation?
- Second, it is also possible to use statistical tools that reveal whether an observation 1) is an outlier and 2) has a
real influence on the estimated slope coefficient (i.e. its deletion significantly changes the estimates).

In order to understand whether a given observation that has a potential influence and can be considered to be
an outlier actually distort the estimated values of the betas, it is possible to compute the DFFITS:
𝑦ො𝑖 − 𝑦ො 𝑖
𝐷𝐹𝐹𝐼𝑇𝑆𝑖 =
2
𝜎ො𝑢(𝑖 ) ℎ𝑖𝑖

2
- Where 𝑦ො 𝑖 is the predicted value of “y” obtained after removing observation “i” and 𝜎ො𝑢(𝑖 ) is the estimated
variance of the error term without observation “i” as well.
 DIFFITSi measures the real contribution of observation “i” on the prediction of the outcome:
𝑘+1
- If |𝐷𝐹𝐹𝐼𝑇𝑆𝑖 | > 2 (or 1 if the sample is small)  observation “i” has a real influence on the slopes.
𝑛

 Obtaining a significant DIFFITS indicates that the observed values for observation “I” should be carefully checked
and, if no error in the data is detected, one should check what happens when that observation is removed from
the estimation sample. 22

Raw Introduction to Linear Regression (서울대 회귀분석 강의노트)
No ratings yet
Raw Introduction to Linear Regression (서울대 회귀분석 강의노트)
226 pages
Classical Linear Regression Guide
No ratings yet
Classical Linear Regression Guide
105 pages
Regression Analysis Essentials
No ratings yet
Regression Analysis Essentials
55 pages
Regression and Prediction
No ratings yet
Regression and Prediction
56 pages
Econometric Modeling:: Model Specification and Diagnostic Testing
100% (1)
Econometric Modeling:: Model Specification and Diagnostic Testing
57 pages
Topic 10 Regression Diagnostic IV Analysis Model Specification Errors
No ratings yet
Topic 10 Regression Diagnostic IV Analysis Model Specification Errors
30 pages
Im ch08
No ratings yet
Im ch08
12 pages
Chapter 0 - Multiple Regression Models
No ratings yet
Chapter 0 - Multiple Regression Models
34 pages
Violation of Assumptions2
No ratings yet
Violation of Assumptions2
21 pages
Lecture3 4
No ratings yet
Lecture3 4
48 pages
OM3 CH 11 Forecasting and Demand Planning
No ratings yet
OM3 CH 11 Forecasting and Demand Planning
17 pages
CS3491 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING 01 - by WWW - LearnEngineering.in
No ratings yet
CS3491 ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING 01 - by WWW - LearnEngineering.in
23 pages
Week 2 - The Simple Linear Regression Model PDF
No ratings yet
Week 2 - The Simple Linear Regression Model PDF
47 pages
MGT Three
No ratings yet
MGT Three
86 pages
VariableSelectionAndModelBuilding IIT
No ratings yet
VariableSelectionAndModelBuilding IIT
22 pages
Simple Linear Regression Analysis..
No ratings yet
Simple Linear Regression Analysis..
51 pages
Model Specification & Data Issues
No ratings yet
Model Specification & Data Issues
45 pages
Unit 2
No ratings yet
Unit 2
15 pages
Econometrics Chapter 3
No ratings yet
Econometrics Chapter 3
24 pages
Ch5 Slide VariableSelection
No ratings yet
Ch5 Slide VariableSelection
36 pages
Mis-Specifications of Regression Model
No ratings yet
Mis-Specifications of Regression Model
18 pages
Classical Linear Regression Model (CLRM)
100% (1)
Classical Linear Regression Model (CLRM)
68 pages
Simple Regression
No ratings yet
Simple Regression
45 pages
Model Specification
No ratings yet
Model Specification
2 pages
Diagnostic Tests2
No ratings yet
Diagnostic Tests2
25 pages
Chapter 5
No ratings yet
Chapter 5
30 pages
Ecom 165 Notes
No ratings yet
Ecom 165 Notes
98 pages
Chapter Three
No ratings yet
Chapter Three
35 pages
BRM - L4,5 - Linear Regression
No ratings yet
BRM - L4,5 - Linear Regression
113 pages
Specification Variable in Econometrics
No ratings yet
Specification Variable in Econometrics
15 pages
Lec12 Ecmt
No ratings yet
Lec12 Ecmt
30 pages
Econometrics: Chapter 6: Multiple Regression Model
No ratings yet
Econometrics: Chapter 6: Multiple Regression Model
23 pages
Introduction To Econometrics - Summary
No ratings yet
Introduction To Econometrics - Summary
23 pages
CLRM Assumptions & OLS Violations
No ratings yet
CLRM Assumptions & OLS Violations
54 pages
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model For Medical Data
No ratings yet
Stephen and Senthamarai Kannan (2017) - Detection of Outliers in Regression Model For Medical Data
7 pages
Finance Students' Guide to Regression
No ratings yet
Finance Students' Guide to Regression
41 pages
A Comprehensive Approach To Misspecification Testing in Linear Regression Models
No ratings yet
A Comprehensive Approach To Misspecification Testing in Linear Regression Models
6 pages
Multiple Regression Model
No ratings yet
Multiple Regression Model
17 pages
1 s2.0 S1018363918306767 Main PDF
No ratings yet
1 s2.0 S1018363918306767 Main PDF
13 pages
0 Regularization PDF
No ratings yet
0 Regularization PDF
88 pages
Human Resource Management Chapter 2 Assignment
No ratings yet
Human Resource Management Chapter 2 Assignment
5 pages
Introduction To Econometrics With R
No ratings yet
Introduction To Econometrics With R
18 pages
WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I
No ratings yet
WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I
39 pages
Advanced Econometrics: OLS & Regression Analysis
No ratings yet
Advanced Econometrics: OLS & Regression Analysis
65 pages
Irrelevant Variable in Regression
No ratings yet
Irrelevant Variable in Regression
12 pages
Econometrics
No ratings yet
Econometrics
13 pages
Unit 5. Model Selection: María José Olmo Jiménez
No ratings yet
Unit 5. Model Selection: María José Olmo Jiménez
15 pages
Lecture 09 Model Misspecification
No ratings yet
Lecture 09 Model Misspecification
5 pages
Notes 2
No ratings yet
Notes 2
16 pages
Linear Regression for Analysts
No ratings yet
Linear Regression for Analysts
24 pages
CH 03
No ratings yet
CH 03
17 pages
Omitted Variable Bias C-T 4.7
No ratings yet
Omitted Variable Bias C-T 4.7
6 pages
Emet2007 Notes
No ratings yet
Emet2007 Notes
6 pages
TCH442E Quantitative Methods For Finance: Last Lecture: Next
No ratings yet
TCH442E Quantitative Methods For Finance: Last Lecture: Next
13 pages
2 Regression With Multiple Regressors 1
No ratings yet
2 Regression With Multiple Regressors 1
22 pages
Model Specification in Multiple Regression Analysis
No ratings yet
Model Specification in Multiple Regression Analysis
45 pages
Statistical Modelling: Regression: Choosing The Independent Variables
No ratings yet
Statistical Modelling: Regression: Choosing The Independent Variables
14 pages
Econometrics Chap 3
No ratings yet
Econometrics Chap 3
19 pages
Specification Errors in Regression Analysis
No ratings yet
Specification Errors in Regression Analysis
7 pages
Econometrics - Review Sheet ' (Main Concepts)
No ratings yet
Econometrics - Review Sheet ' (Main Concepts)
5 pages
Saemm Bs Data Science Syllabuses
No ratings yet
Saemm Bs Data Science Syllabuses
122 pages
Impact of Staff Turnover On The Financial Performance of Nigerian Deposit Money Banks
No ratings yet
Impact of Staff Turnover On The Financial Performance of Nigerian Deposit Money Banks
12 pages
Clase III
No ratings yet
Clase III
28 pages
The Moderating Effect of Brand Image On Public Relations Perception and Customer Loyalty
No ratings yet
The Moderating Effect of Brand Image On Public Relations Perception and Customer Loyalty
17 pages
Chapter11 Econometrics SpecificationerrorAnalysis
No ratings yet
Chapter11 Econometrics SpecificationerrorAnalysis
7 pages
Chapter 9
No ratings yet
Chapter 9
26 pages
DOE: Principles & Applications
100% (2)
DOE: Principles & Applications
350 pages
Jackson and Pollock, 1978
No ratings yet
Jackson and Pollock, 1978
9 pages
Lecture03 MachineLearning
No ratings yet
Lecture03 MachineLearning
78 pages
Hypothesis Tests and Confidence Intervals in Multiple Regression
No ratings yet
Hypothesis Tests and Confidence Intervals in Multiple Regression
44 pages
WD - Organizational Culture, Competency, Perfomance (Final) - 2
No ratings yet
WD - Organizational Culture, Competency, Perfomance (Final) - 2
11 pages
Topic 2 - Cost Estimation and Learning Curve...
No ratings yet
Topic 2 - Cost Estimation and Learning Curve...
11 pages
962-Article Text-3951-1-10-20241127
No ratings yet
962-Article Text-3951-1-10-20241127
9 pages
XXX Taffesdsse2017 XXX
No ratings yet
XXX Taffesdsse2017 XXX
14 pages
XGBoost Regression in Depth. Explore Everything About Xgboost - by Fraidoon Omarzai - Medium
No ratings yet
XGBoost Regression in Depth. Explore Everything About Xgboost - by Fraidoon Omarzai - Medium
20 pages
Donald Trumps in The Virtual Polls: Simulating and Predicting Public Opinions in Surveys Using Large Language Models
No ratings yet
Donald Trumps in The Virtual Polls: Simulating and Predicting Public Opinions in Surveys Using Large Language Models
26 pages
BA 2 Course Handout - E
No ratings yet
BA 2 Course Handout - E
4 pages
Self-Employment Entry Across Industry Groups: Timothy Bates
No ratings yet
Self-Employment Entry Across Industry Groups: Timothy Bates
14 pages
Multiple Regression Methodology and Applications
No ratings yet
Multiple Regression Methodology and Applications
7 pages
Rediet Effect of Marketing Strategy Practice On Organizational Performance
No ratings yet
Rediet Effect of Marketing Strategy Practice On Organizational Performance
12 pages
TOD 212 - PPT 1 For Students - Monsoon 2023
No ratings yet
TOD 212 - PPT 1 For Students - Monsoon 2023
26 pages
Bakhshi Et Al. - 2017 - The Future of Skills Employment in 2030
No ratings yet
Bakhshi Et Al. - 2017 - The Future of Skills Employment in 2030
173 pages
Research Article: Linear Regression Model To Identify The Factors Associated With Carbon Stock in Chure Forest of Nepal
No ratings yet
Research Article: Linear Regression Model To Identify The Factors Associated With Carbon Stock in Chure Forest of Nepal
9 pages
"Introductory Econometrics", Chapter 8 by Wooldridge: Heteroskedasticity
No ratings yet
"Introductory Econometrics", Chapter 8 by Wooldridge: Heteroskedasticity
14 pages
Human Emotions On The Onset of Cardiovascular and Small Vessel Related Diseases
No ratings yet
Human Emotions On The Onset of Cardiovascular and Small Vessel Related Diseases
12 pages
QTTM509 Research Methodology-I: Dr. Tawheed Nabi
No ratings yet
QTTM509 Research Methodology-I: Dr. Tawheed Nabi
32 pages
Sankhya Data Science Course
No ratings yet
Sankhya Data Science Course
22 pages
Data Mining
No ratings yet
Data Mining
27 pages

Regression Analysis Challenges

Uploaded by

Regression Analysis Challenges

Uploaded by

ECONOMETRICS

Data Issues in Regression Analysis:

Degree in International Business

Specifically, we will examine:

True model: 𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝑢𝑖

 Omission of relevant covariates:

 Estimated model 1: 𝑦𝑖 = 𝛼ො ∗ + 𝛽መ1∗ 𝑥1𝑖 + 𝜀𝑖,Ƹ 𝜀𝑖Ƹ = 𝛽2 𝑥2𝑖 + 𝑢𝑖

- Estimated model 1: 𝑦𝑖 = 𝛼ො ∗ + 𝛽መ1∗ 𝑥1𝑖 + 𝜀𝑖Ƹ ֜ 𝜀𝑖Ƹ = 𝛽2 𝑥2𝑖 + 𝑢𝑖

Is it possible to show that in this case we have: 𝐸 𝛽መ1∗∗ = 𝛽1 .

- Solution: Adjusted R-squared:

 These statistics are automatically displayed by GRETL after any estimation.

qi    pi   i  pi    qi  ui       pi   i   ui 

𝑦𝑖 = 𝛼 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 +𝛿2 𝑦ො𝑖2 +𝛿3 𝑦ො𝑖3 + 𝑣𝑖

3) Test the following null and alternative hypothesis by means of an F-statistic:

 Removing just 1 observation significantly changes the slope

Having computed the leverage, it is possible to obtain the Studentized Residuals:

- Where ri is distributed as a t-student with “n-(k+1)” degrees of freedom:

You might also like