12.4 Panel Data - A Guide On Data Analysis
12.4 Panel Data - A Guide On Data Analysis
Characteristics
Information both across individuals and over time (cross-sectional and time-series)
Types
Nonlinear
Seasonality
Discontinuous shocks
Regressors
Time-invariant regressors xit = xi for all t (e.g., gender, race, education) have zero within
variation
Individual-invariant regressors xit = xt for all i (e.g., time trend, economy trends) have
zero between variation
Estimate Formula
Individual 1
x̄i = ∑ xit
mean T t
Overall 1
x̄ = ∑ ∑ xit
i t
mean NT
Overall 2 1 2
s = ∑ ∑ (xit − x̄)
i t
Variance O N T −1
Between 2 1 2
s = ∑ (x̄i − x̄)
variance B N −1 i
Within 2 1 2 1 2
s = ∑ ∑ (xit − x̄i ) = ∑ ∑ (xit − x̄i + x̄)
W i t i t
variance N T −1 N T −1
Note: s
2
O
≈ s
2
B
+ s
2
W
Since we have n observation for each time period t, we can control for each time effect
separately by including time dummies (time effects)
Note: we cannot use these many time dummies in time series data because in time series
data, our n is 1. Hence, there is no variation, and sometimes not enough data compared to
variables to estimate coefficients.
Unobserved Effects Model Similar to group clustering, assume that there is a random effect
that captures differences across individuals but is constant in time.
where
ci + uit = ϵit
https://bookdown.org/mike/data_analysis/panel-data.html 2/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
′
E(x (ci + uit )) = 0
it
If A4 does not hold, OLS is still consistent, but not efficient, and we need cluster robust SE.
Pooled OLS will give you consistent coefficient estimates under A1, A2, A3a (for both uit and
RE assumption), and A5 (randomly sampling across i).
Random Effects estimator is the Feasible GLS estimator that assumes uit is serially
uncorrelated and homoskedastic
Under A1, A2, A3a (for both uit and RE assumption) and A5 (randomly sampling across i),
RE estimator is consistent.
If the RE assumption is not hold (E(x′it ci ) ≠ 0 ), then A3a does not hold (E(x′it ϵi ) ≠ 0 ).
https://bookdown.org/mike/data_analysis/panel-data.html 3/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
Hence, the OLS and RE are inconsistent/biased (because of omitted variable bias)
However, FE can only fix bias due to time-invariant factors (both observables and
unobservables) correlated with treatment (not time-variant factors that correlated with the
treatment).
The traditional FE technique is flawed when lagged dependent variables are included in the
model. (Nickell 1981) (Narayanan and Nair 2013)
Variables that are time constant will be absorbed into ci . Hence we cannot make
inference on time constant independent variables.
If you are interested in the effects of time-invariant variables, you could consider the
OLS or between estimator
It’s recommended that you should still use cluster robust standard errors.
https://bookdown.org/mike/data_analysis/panel-data.html 4/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
where
1 if observation is i
ci = {
0 otherwise
Since fixed effect is a within estimator, only status changes can contribute to β variation.
Hence, with a small number of changes then the standard error for β will explode
Status changes mean subjects change from (1) control to treatment group or (2)
treatment to control group. Those who have status change, we call them switchers.
(more on this in
Issues:
You could have fundamental difference between switchers and non-switchers. Even
though we can’t definitive test this, but providing descriptive statistics on switchers
and non-switchers can give us confidence in our conclusion.
Because fixed effects focus on bias reduction, you might have larger variance
(typically, with fixed effects you will have less df)
If the true model is random effect, economists typically don’t care, especially when ci is
the random effect and ci ⊥ xit (because RE assumption is that it is unrelated to xit ). The
reason why economists don’t care is because RE wouldn’t correct bias, it only improves
https://bookdown.org/mike/data_analysis/panel-data.html 5/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
FE removes bias from time invariant factors but not without costs because it uses within
variation, which imposes strict exogeneity assumption on uit :
E[(xit − x̄i )(uit − ūit )] = 0
Recall
SSROLS
2
σ
^ϵ =
NT − K
SSRF E SSRF E
2
σ
^u = =
N T − (N + K) N (T − 1) − K
It’s ambiguous whether your variance of error changes up or down because SSR can increase
while the denominator decreases.
FE can be unbiased, but not consistent (i.e., not converging to the true effect)
12.4.2.2.5 FE Examples
Intergenerational mobility
If we transfer resources to low income family, can we generate upward mobility (increase
ability)?
%ΔHuman capital
%Δincome
4. Financial transfer
Income measures:
https://bookdown.org/mike/data_analysis/panel-data.html 6/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
2. Wage income
3. Non-wage income
4. Annual versus permanent income
Uncontrolled by mothers:
mother race
location of birth
education of parents
where
i = test
j = individual (child)
t = time
Grandmother’s model
Since child is nested within mother and mother nested within grandmother, the fixed effect of
child is included in the fixed effect of mother, which is included in the fixed-effect of
grandmother
where
Grandma fixed-effect
Pros:
control for some genetics + fixed characteristics of how mother are raised
https://bookdown.org/mike/data_analysis/panel-data.html 7/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
Con:
Error rate on survey can help you fix this (plug in the number only , but not the uncertainty
associated with that number).
where
i student
j instructor
c course
t time
where μjc is instructor by course fixed effect (unique id), which is different from (θj + δc )
Identification strategy is
https://bookdown.org/mike/data_analysis/panel-data.html 8/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
The least we can say about θj is the teacher effect conditional on student test score.
~
eijt = Tit δj + e ijt
where
uijkt = Wi + Pk + ϵijkt
N
1
∑(Wi + Pk + ϵijkt )
Nj
i=1
1 Nj
then we can Pk +
Nj
∑
i=1
(Wi + ϵijkt )
Nj
1
∑ ϵijkt ≠ 0
Nj
i=1
https://bookdown.org/mike/data_analysis/panel-data.html 9/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
Even if we create random teacher fixed effect and put it in the model, it still contains bias
mentioned above which can still τ (but we do not know the way it will affect - whether more
positive or negative).
If teachers switch schools, then we can estimate both teacher and school fixed effect
(mobility web thin vs. thick)
Mobility web refers to the web of switchers (i.e., from one status to another).
If you want to examine teacher fixed effect, we have to include teacher fixed effect
Control for school, the article argues that there is no selection bias
1 Nj
For Nj
∑
i=1
ϵijkt (teacher-level average residuals), var(τ ) does not change with Nj (Figure
2 in the paper). In words, the quality of teachers is not a function of the number of students
τ
^j = τ j + λ j
var(τ
^) = var(τ + λ)
Assume cov(τj , λj ) = 0 (reasonable) In words, your randomness in getting children does not
correlation with teacher quality.
Hence,
var(τ
^) = var(τ ) + var(λ)
var(τ ) = var(τ
^) − var(λ)
We have var(τ
^) and we need to estimate var(λ)
J
1 2
var(λ) = ∑σ
^j
J
j=1
Hence,
https://bookdown.org/mike/data_analysis/panel-data.html 10/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
var(τ )
= reliability = true variance signal
var(τ
^)
var(τ )
1 − = noise
var(τ
^)
Even in cases where the true relationship is that τ is a function of Nj , then our recovery
method for λ is still not affected
τ
^j = β0 + X j β1 + ϵ j
We typically don’t test heteroskedasticity because we will use robust covariance matrix
estimation anyway.
Dataset
library("plm")
data("EmplUK", package="plm")
data("Produc", package="plm")
data("Grunfeld", package="plm")
data("Wages", package="plm")
12.4.3.1 Poolability
also known as an F test of stability (or Chow test) for the coefficients
H0 : All individuals have the same coefficients (i.e., equal coefficients for all individuals).
Notes:
Under a within (i.e., fixed) model, different intercepts for each individual are assumed
https://bookdown.org/mike/data_analysis/panel-data.html 11/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
library(plm)
Hence, we reject the null hypothesis that coefficients are stable. Then, we should use the
random model.
use the Lagrange multiplier test to test the presence of individual or time or both (i.e.,
individual and time).
Types:
https://bookdown.org/mike/data_analysis/panel-data.html 12/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
https://bookdown.org/mike/data_analysis/panel-data.html 13/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
usually seen in macro panels with long time series (large N and T), not seen in micro
panels (small T and large N)
Serial correlation can arise from individual effects(i.e., time-invariant error component), or
idiosyncratic error terms (e..g, in the case of AR(1) process). But typically, when we refer
to serial correlation, we refer to the second one.
Can be
marginal test: only 1 of the two above dependence (but can be biased towards
rejection)
joint test: both dependencies (but don’t know which one is causing the problem)
conditional test: assume you correctly specify one dependence structure, test
whether the other departure is present.
Under the null, covariance matrix of the residuals = its diagonal (off-diagonal = 0)
It is robust against both unobserved effects that are constant within every group, and
any kind of serial correlation.
https://bookdown.org/mike/data_analysis/panel-data.html 14/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
pwtest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data=Produc)
#>
#> Wooldridge's test for unobserved individual effects
#>
#> data: formula
Here, we reject the null hypothesis that the no unobserved effects in the residuals. Hence, we
will exclude using pooled OLS.
12.4.3.4.2 Locally robust tests for random effects and serial correlation
A joint LM test for random effects and serial correlation assuming normality and
homoskedasticity of the idiosyncratic errors [Baltagi and Li (1991)](Baltagi and Li 1995)
Here, we reject the null hypothesis that there is no presence of serial correlation, and
random effects. But we still do not know whether it is because of serial correlation, of random
effects or of both
To know the departure from the null assumption, we can use (Bera, Sosa-Escudero, and Yoon
2001)’s test for first-order serial correlation or random effects (both under normality and
homoskedasticity assumption of the error).
https://bookdown.org/mike/data_analysis/panel-data.html 15/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
pbsytest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp, data=Produc)
#>
#> Bera, Sosa-Escudero and Yoon locally robust test - balanced panel
#>
#> data: formula
Since BSY is only locally robust, if you “know” there is no serial correlation, then this test is
based on LM test is more superior:
#>
#> Lagrange Multiplier Test - (Honda) for balanced panels
#>
#> data: inv ~ value + capital
#> normal = 28.252, p-value < 2.2e-16
#> alternative hypothesis: significant effects
On the other hand, if you know there is no random effects, to test for serial correlation, use
(BREUSCH 1978)-(Godfrey 1978)’s test
lmtest::bgtest()
https://bookdown.org/mike/data_analysis/panel-data.html 16/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
If you “know” there are random effects, use (Baltagi and Li 1995)’s. to test for serial correlation
in both AR(1) and MA(1) processes.
H0 : Uncorrelated errors.
Note:
pbltest(log(gsp)~log(pcap)+log(pc)+log(emp)+unemp,
data=Produc, alternative="onesided")
#>
applicable to random effects model, OLS, and FE (with large T, also known as long
panel).
can also test higher-order serial correlation
#>
#> data: inv ~ value + capital
#> chisq = 42.587, df = 2, p-value = 5.655e-10
#> alternative hypothesis: serial correlation in idiosyncratic errors
in the case of short panels (small T and large n), we can use
https://bookdown.org/mike/data_analysis/panel-data.html 17/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
12.4.3.6 Heteroskedasticity
Breusch-Pagan test
The continuum between RE (used FGLS which more assumption ) and POLS check back on
the section of FGLS
https://bookdown.org/mike/data_analysis/panel-data.html 18/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
Breusch-Pagan LM test
Test for the random effect model based on the OLS residual
Null hypothesis: variances across entities is zero. In another word, no panel effect.
If the test is significant, RE is preferable compared to POLS
12.4.4.2 FE vs. RE
RE does not require strict exogeneity for consistency (feedback effect between residual
and covariates)
Hypothesis If true
H0 : Cov(ci , xit ) = 0 ^
β
RE
is consistent and efficient, while ^
β
FE
is consistent
H0 : Cov(ci , xit ) ≠ 0 ^
β
RE
is inconsistent, while ^
β
FE
is consistent
Hausman Test
Then,
n(X)
https://bookdown.org/mike/data_analysis/panel-data.html 19/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
Generalized
Violation Instrumental Variable General Means
Basic Method of
variable Coefficients FGLS groups
Estimator Moments
Estimator Estimator estimator estimator estimato
estimator
12.4.5 Summary
All three estimators (POLS, RE, FE) require A1, A2, A5 (for individuals) to be consistent.
Additionally,
If A4 does not hold, use cluster robust SE but POLS is not efficient
RE is consistent under A3a(for uit ): E(x
′
it
uit ) = 0 , and RE Assumption E(x
′
it
ci ) = 0
If A4 (for uit ) holds then usual SE are valid and RE is most efficient
If A4 (for uit ) does not hold, use cluster robust SE ,and RE is no longer most efficient
(but still more efficient than POLS)
FE is consistent under A3 ′
E((xit − x̄it ) (uit − ūit )) = 0
Note: A5 for individual (not for time dimension) implies that you have A5a for the entire data
set.
12.4.6 Application
Recommended application of plm can be found here and here by Yves Croissant
https://bookdown.org/mike/data_analysis/panel-data.html 20/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
#install.packages("plm")
library("plm")
library(foreign)
Panel <- read.dta("http://dss.princeton.edu/training/Panel101.dta")
attach(Panel)
Y <- cbind(y)
X <- cbind(x1, x2, x3)
# Between estimator
between <- plm(Y ~ X, data=pdata, model= "between")
summary(between)
summary(fixed)
https://bookdown.org/mike/data_analysis/panel-data.html 21/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
# Serial Correlation
pbgtest(fixed)
# stationary
library("tseries")
adf.test(pdata$y, k = 2)
# Breusch-Pagan heteroskedasticity
library(lmtest)
bptest(y ~ x1 + factor(country), data = pdata)
## For FE model
coeftest(fixed) # Original coefficients
https://bookdown.org/mike/data_analysis/panel-data.html 22/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
Advanced
Other effects:
Check for the unbalancedness. Closer to 1 indicates balanced data (Ahrens and Pincus 1981)
punbalancedness(random)
Instrumental variable
instr <- plm(Y ~ X | X_ins, data = pdata, random.method = "ht", model = "random", inst.m
https://bookdown.org/mike/data_analysis/panel-data.html 23/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
# Random Effects
zz <- pggls(log(emp)~log(wage)+log(capital), data=EmplUK, model="pooling")
# Fixed
zz <- pggls(log(emp)~log(wage)+log(capital), data=EmplUK, model="within")
Available functions
https://bookdown.org/mike/data_analysis/panel-data.html 24/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
Notes
https://bookdown.org/mike/data_analysis/panel-data.html 25/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
library(fixest)
data(airquality)
# Setting a dictionary
setFixest_dict(
c(
Ozone = "Ozone (ppb)",
Solar.R = "Solar Radiation (Langleys)",
Wind = "Wind Speed (mph)",
Temp = "Temperature"
)
cluster = ~ Day)
etable(est)
#> model 1 model 2
#> Dependent Var.: Ozone (ppb) Ozone (ppb)
#>
#> Solar Radiation (Langleys) 0.1148*** (0.0234) 0.0522* (0.0202)
#> Wind Speed (mph) -3.109*** (0.7986)
#> Temperature 1.875*** (0.3671)
#> Fixed-Effects: ------------------ ------------------
#> Month Yes Yes
#> Day No No
#> __________________________ __________________ __________________
#> S.E.: Clustered by: Day by: Day
#> Observations 111 111
#> R2 0.31974 0.63686
https://bookdown.org/mike/data_analysis/panel-data.html 26/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
# in latex
etable(est, tex = T)
#> \begingroup
#> \centering
#> \begin{tabular}{lcccc}
#> Solar Radiation (Langleys) & 0.1148$^{***}$ & 0.0522$^{**}$ & 0.1078$^{***}$ & 0.
#> & (0.0234) & (0.0202) & (0.0329) & (0
#> Wind Speed (mph) & & -3.109$^{***}$ & & -3
#> & & (0.7986) & & (0
#> Temperature & & 1.875$^{***}$ & & 2.
#> & & (0.3671) & & (0
#> \midrule
#> \emph{Fixed-effects}\\
#> Month & Yes & Yes & Yes & Ye
#> Day & & & Yes & Ye
#> \midrule
https://bookdown.org/mike/data_analysis/panel-data.html 27/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
#> Within R$^2$ & 0.12245 & 0.53154 & 0.12074 & 0.
#> \midrule \midrule
#> \multicolumn{5}{l}{\emph{Clustered (Day) standard-errors in parentheses}}\\
#>
#> COEFFICIENTS:
#> Month: 5 6 7 8 9
#> 3.219 8.288 34.26 40.12 12.13
plot(fixedEffects)
https://bookdown.org/mike/data_analysis/panel-data.html 28/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
https://bookdown.org/mike/data_analysis/panel-data.html 29/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
# set up
library(fixest)
# let R know the base dataset (the biggest/ultimate dataset that includes everything in
base = iris
# rename variables
names(base) = c("y1","y2", "x1","x2", "species")
res_multi = feols(
c(y1, y2) ~ x1 + csw(x2, x2 ^ 2) |
sw0(species),
data = base,
fsplit = ~ species,
lean = TRUE,
vcov = "hc1" # can also clustered at the fixed effect level
)
# it's recommended to use vcov at estimation stage, not summary stage
summary(res_multi, "compact")
#> sample lhs fixef rhs (Intercept)
https://bookdown.org/mike/data_analysis/panel-data.html 30/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
https://bookdown.org/mike/data_analysis/panel-data.html 31/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
https://bookdown.org/mike/data_analysis/panel-data.html 33/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
})
etable(res)
#> model 1 model 2
#> Dependent Var.: y1 y2
#>
#> (Intercept) 4.191*** (0.0970) 3.587*** (0.0937)
set first
https://bookdown.org/mike/data_analysis/panel-data.html 34/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
mvsw (multiverse stepwise): all possible combination of the elements in the set (it will get
mvsw(x1, x2, x3) will be sw0(x1, x2, x3, x1 + x2, x1 + x3, x2 + x3, x1 + x2 +
x3)
#>
#> model 4
#> Sample (species) virginica
#> Dependent Var.: y1
#>
#> (Intercept) 1.052* (0.5139)
#> R2 0.74689
#> Adj. R2 0.73612
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
newey_west : (Newey and West 1987) use for time series or panel data. Errors are
driscoll_kraay (Driscoll and Kraay 1998) use for panel data. Errors are cross-
function.
To let R know which SE estimation you want to use, insert vcov = vcov_type ~ variables
References
Ahrens, H., and R. Pincus. 1981. “On Two Measures of Unbalancedness in a One-Way Model
and Their Relation to Efficiency.” Biometrical Journal 23 (3): 227–35.
https://doi.org/10.1002/bimj.4710230302.
Amemiya, Takeshi, and Thomas E. MaCurdy. 1986. “Instrumental-Variable Estimation of an
Error-Components Model.” Econometrica 54 (4): 869. https://doi.org/10.2307/1912840.
https://bookdown.org/mike/data_analysis/panel-data.html 36/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
https://bookdown.org/mike/data_analysis/panel-data.html 37/38
5/18/23, 11:31 AM 12.4 Panel Data | A Guide on Data Analysis
Gourieroux, Christian, Alberto Holly, and Alain Monfort. 1982. “Likelihood Ratio Test, Wald
Test, and Kuhn-Tucker Test in Linear Models with Inequality Constraints on the Regression
Parameters.” Econometrica 50 (1): 63. https://doi.org/10.2307/1912529.
Honda, Yuzo. 1985. “Testing the Error Components Model with Non-Normal Disturbances.”
The Review of Economic Studies 52 (4): 681. https://doi.org/10.2307/2297739.
King, Maxwell L., and Ping X. Wu. 1997. “Locally Optimal One-Sided Tests for Multiparameter
Hypotheses.” Econometric Reviews 16 (2): 131–56.
https://doi.org/10.1080/07474939708800379.
Narayanan, Sridhar, and Harikesh S Nair. 2013. “Estimating Causal Installed-Base Effects: A
Bias-Correction Approach.” Journal of Marketing Research 50 (1): 70–94.
Nerlove, Marc. 1971. “A Note on Error Components Models.” Econometrica 39 (2): 383.
https://doi.org/10.2307/1913351.
Newey, Whitney K., and Kenneth D. West. 1987. “A Simple, Positive Semi-Definite,
Heteroskedasticity and Autocorrelation Consistent Covariance Matrix.” Econometrica 55 (3):
703. https://doi.org/10.2307/1913610.
Nickell, Stephen. 1981. “Biases in Dynamic Models with Fixed Effects.” Econometrica: Journal
of the Econometric Society, 1417–26.
Swamy, P. A. V. B., and S. S. Arora. 1972. “The Exact Finite Sample Properties of the
Estimators of Coefficients in the Error Components Regression Models.” Econometrica 40 (2):
261. https://doi.org/10.2307/1909405.
Wallace, T. D., and Ashiq Hussain. 1969. “The Use of Error Components Models in Combining
Cross Section with Time Series Data.” Econometrica 37 (1): 55.
https://doi.org/10.2307/1909205.
https://bookdown.org/mike/data_analysis/panel-data.html 38/38