Key expressions & concepts
Bad control problem
o Bias in OLS estimator because X is correlated with omitted variables Z
o Z/X2: regressors that could themselves be outcomes
E.g earnings = α + β college degree + ui
We add relevant regressor Z that is
Correlated with X OLS estimator is biased & inconsistent
Can explain Y
Problem: Y as well as Z can be outcomes of X
Causal inference
o Usually we can’t give causal interpretation
o Why not?
Omitted variables/ omitted individual characteristics that could cause Y
Reverse causality – Y causes X
o Threatened by zero conditional mean assumption
Central limit theroy
o When n is large, then averages of random variables are normally distributed
Cross-sectional data
o Data on different entities (e.g. workers, consumers, firms, etc.) for a single
time period
o E.g. data on test scores in California -> data for 420 entitites (school districts)
for a single time period (1999)
Errors-in-variables bias in the OLS estimator
o When an independent variable (X) is measured imprecisely
o This bias persists even in large sample sizes
Error term ui
o All factors, other than Xi, that are determinants of Yi
Endogeneity
o X is correlated with the error term (e.g. Y=income, X=education, u=skill X is
correlated with u)
Exogeneity
o X is not correlated with the error term
o X is determined by other factors outside the model
Hausman Test
o Test for endogeneity of regressor X
Why? Using an instrument is only necessary if Xi is endogenous
(correlated with u)
o Test: H0: E[u|X] = 0
o Under H0, both the TSLS (FE) & OLS (RE) estimator are consistent but OLS is
more efficient
o H1: only the FE estimator is efficient
o If H-statistic > (e.g. 5%) critical value, then H0 is rejected -> X is endogenous
(correlated with u)
o If we reject H0: we prefer FE model
o
o H0: RE is appropriate
o H1: FE is appropriate
o Result: Prob>chi2 = 0 reject H0
Homoskedasticity
o Variance - how far the points are away from the line
o Variance (ui|Xi) = constant (var doesn’t vary systematically with X)
i.i.d. independently & identically distributed
o sample is randomly drawn from the population (independent)
o all observations in sample are drawn from same distribution (identically
distributed)
Multicollinearity
o high intercorrelations among two or more independent variables in a
multiple regression model
o one of the 4 assumptions in multiple regression: “no perfect multicollinearity”
if one of the regressors is a perfect linear function of the other regressors
e.g. you want to estimate the coefficient on STR in a regression
of TestScorei on STRi and PctELi but you make a typo and
accidentally type in STRi a second time instead of PctELi -> now
you regress TestScorei on STRi and STRi perfect
multicollinearity
then: impossible to compute the OLS estimator
solution: usally just modify the regressors to eliminate the problem
Multiple regression
o A method that can eliminate omitted variable bias
How? If we have data on the omitted variables, then we can include
them as additional regressors and thereby estimate the causal effect of
one regressor while holding constant the other variables
o Also a method to make predictions that are better than single regression by
using multiple variables as predictors
OLS estimator
o A method to estimate the unknown parameters in a linear regression model
Omitted variables
o Variables that are left out of the regression
Omitted variables bias
o If the regressor (X) is correlated with a variable that has been omitted from
the analysis (variable in u) and that determines, in part, the dependent
variable (Y), then the OLS estimator will have omitted variable bias
o (1) X and u are correlated
o (2) omitted variable is also a determinant of Y
o First least squares assumption E(ui|Xi) = 0 does not hold OLS estimator is
biased & inconsistent
o Solutions
use of instrumental variables regressions (IV)
panel data estimation
use of randomized controlled experiments
Overfitting
o When model is too complex it begins to describe the random error in the
data rather than the relationships between variables
o Misleading R2 values, regression coefficients and p-values
Panel Structure
o Allows us to control for unobserved heterogeneity
o Mitigate omitted variables bias
Randomized controlled experiment
o Controlled: there are both a control group that receives no treatment and a
treatment group that receives treatment
o Randomized: the treatment is assigned randomly ; randomly pick who gets
the treatment
Sargan Test (J-Test)
o Tests exogeneity of instruments
o If we have more instruments than regressors (m > k), then the coefficients are
overidentified
o In case of overidentification (m>k): If we want to test for instruments’ validity
(relevance & exogeneity), we can do so by using a J-Test
o H0 = instruments are exogenous
o Results of J-Test
J-statistic > 5% critical value: reject H0 at least one instrument is
endogenous
If: TSLS estimators are consistent & close to each other all tested
instruments are exogenous
If: one instrument produces very different estimates one or both
instruments are probably not exogenous
Two stage least squares (TSLS)
o If the instrument Z satisfies the conditions of instrument relevance and
exogeneity, the coefficient β1 can be estimated using an IV estimator (TSLS)
o (1) stage
Decompose X into 2 components: a problematic component that may be
correlated with the regression error and another, problem-free
component that is uncorrelated with the error
o (2) stage
Use the problem-free component to estimate β1
Validity – internal
o Statistical inferences about causal effects are valid for the population being
studied
o Conditions
(1) OLS estimator needs to be unbiased and consistent
(2) Hypothesis tests should have the desired significance level,
confidence intervals should have the desired confidence intervals
(computed by the sandard error – SEs should be consistent)
Validity – external
o Statistical inferences about causal effects can be generalized from the
population and setting studied to other populations and settings
Weak instruments
o Instrumental variables that have a low predicitve power for the endogenous
regressor X
o Valid instruments (Z) should be
(1) relevant – Z highly correlated with X
(2) exogenous – Z is correlated with Y solely through its correlation with
X; so Z is uncorrelated with the error term u
o Test for instrument relevance
Investigate the First-stage F-statistic (we want: at least one Z has
coefficient ≠ 0 in the 1st stage – then instrument not weak)
If F > 10, then instrument is good (is relevant) (rule of thumb)
o Test for exogeneity
Difficult to test (J-Test)
Within estimator
o Exploits the within individual variation (over time)