Chapter 6: Panel Data
Peter Hull
Mathematical Econometrics I
Brown University
Spring 2024
Motivation
We’ve seen how to estimate causal effects when a treatment is as
good as randomly assigned conditional on observable characteristics
But often we’re worried that there are unobservable characteristics we
haven’t properly accounted for (i.e. confounding variables)
Next we’ll think about how we can deal with certain types of
unobserved confounding variables when we have panel data
1
What is Panel Data?
Panel data refers to a situation where we observe observations for
each unit i (say a person or state) across multiple periods t
Why is this useful? It allows us to look at differences in outcomes
between treated/untreated units before the treatment occurred
If treated/control outcomes are different before the treatment, this
must be the result of confounding factors.
So we can potentially use pre-treatment differences to learn about the
confounds and adjust for them
Let’s see how this works in an example of difference-in-differences,
which is the most common panel data method used in applied
microeconomic research
2
Outline
1. Diff-in-Diff Basics
2. DiD Meets Regression
3. The DiD Frontier
3
Hastings (2004)
In 2004, Justine Hastings (a former Brown prof!) wrote a study
analyzing how mergers in the gas industry affect gas prices
In particular, she studied an episode in California where a refinery,
ARCO, bought one of the largest gas stations, Thrifty
How do you think such a merger might affect prices?
On the one hand, it could reduce competition and increase prices
On the other, a merger could reduce costs of providing gas and
decrease prices (synergies)
Hastings attempted to answer this question empirically using data on
gas prices by neighborhood in CA
Data contains info on neighborhoods both with/without Thrifty stations
4
Suppose first that we only had data on gas prices from after the
merger occurred.
We could compare prices in areas that had a Thrifty beforehand
(Di = 1) and places that didn’t have a Thrifty beforehand (Di = 0) to
estimate the causal effect of a Thrifty conversion
Why might this not give us the causal effect of converting Thrifties?
Omitted variables!
In particular, places that already had a Thrifty beforehand likely had
more competition than places without a Thrifty. We thus might
expect them to have lower prices.
With panel data, we can test this empirically by looking at prices
before the merger!
5
Before the merger, stations in markets competing with Thrifty had gas
prices about 3 cents lower in every period
Is it reasonable to assume unconfoundedness after the merger? No!
A better assumption might be that the gap would haved remained 3c
if not for the merger! This is the idea of difference-in-differences
6
After the merger, stations in areas with a Thrifty had higher prices by
about 2c
If we assume that they would have had lower prices by 3c (as before
the merger), then this implies a treatment effect of 2 − (−3) = 5
This is the post-treatment difference (2) between treatment & control
minus the pre-treatment difference (-3), i.e. a difference-in-differences
7
Formalizing the Assumptions of DiD
Assume there are 2 periods, t = 1, 2. Treated units (Di = 1) are
treated in period 2; control units never-treated.
Let Yit be the observed outcome for unit i in period t.
Assume Yit = Di Yit (1) + (1 − Di )Yit (0)
No anticipation assumption: Yi1 (0) = Yi1 (1)
Your treatment in period 2 doesn’t affect your outcome in period 1
Parallel trends assumption:
E [Y (0) − Yi1 (0)|Di = 1] = E [Yi2 (0) − Yi1 (0)|Di = 0]
| i2 {z } | {z }
Change in Y (0) for treated Change in Y (0) for control
Equivalently,
E [Yi2 (0)|Di = 1] − E [Yi2 (0)|Di = 0] = E [Yi1 (0)|Di = 1] − E [Yi1 (0)|Di = 0]
| {z } | {z }
Selection bias in period 2 Selection bias in period 1
8
Under these assumptions, we have
E [Y − Y |D = 1] − E [Yi2 − Yi1 |Di = 0] =
| i2 {zi1 i } | {z }
Observed change for treated Observed change for control
= E [Yi2 (1) − Yi1 (1)|Di = 1] − E [Yi2 (0) − Yi1 (0)|Di = 0] (Observed data rule)
= E [Yi2 (1) − Yi1 (0)|Di = 1] − E [Yi2 (0) − Yi1 (0)|Di = 0] (No anticipation)
= E [Yi2 (1) − Yi2 (0)|Di = 1]+
E [Yi2 (0) − Yi1 (0)|Di = 1] − E [Yi2 (0) − Yi1 (0)|Di = 0](Adding and subtracting)
= E [Yi2 (1) − Yi2 (0)|Di = 1] (Parallel trends)
Thus, the difference-in-difference of sample means identifies
τATT = E [Yi2 (1) − Yi2 (0)|Di = 1].
This is called the average treatment effect on the treated (ATT).
It is the average effect in period 2 for treated units.
9
Estimating the ATT
We’ve shown that under the DiD assumptions (parallel trends and no
anticipation), the ATT is identified as
τATT = E [Y − Y |D = 1] − E [Yi2 − Yi1 |Di = 0]
| i2 {zi1 i } | {z }
Change in pop mean for treated Change in pop mean for control
How can we estimate this? Plug in sample means!
Our estimate is:
τ̂ATT = Ȳ − Ȳ − Ȳ02 − Ȳ01 ,
| 12 {z 11} | {z }
Change in sample mean for treated Change in pop mean for control
where Ȳdt is the sample mean for units with Di = d in period t.
10
Example
Consider Hasting’s example, comparing June (period 1) to October
τ̂ATT = Ȳ − Ȳ − Ȳ02 − Ȳ01 =
| 12 {z 11} | {z }
Change in sample mean for treated Change in pop mean for control
(1.43 − 1.25) − (1.41 − 1.28) = 0.05
11
Outline
1. Diff-in-Diff Basics✓
2. DiD Meets Regression
3. The DiD Frontier
12
DiD as Regression
Consider the regression
Yit = β0 + β1 × Postt + β2 Di + β3 Di × Postt + εit ,
where Postt = 1[t = 2].
Claim: the population regression coefficient β3 is equal to τATT under
the DiD assumptions.
Why? The regression above models the CEF as:
E [Yit |Di = 0, Postt = 0] = β0
E [Yit |Di = 0, Postt = 1] = β0 + β1
E [Yit |Di = 1, Postt = 0] = β0 + β2
E [Yit |Di = 1, Postt = 1] = β0 + β1 + β2 + β3
Thus,
β3 =(E [Yit |Di = 1, Postt = 1] − E [Yit |Di = 1, Postt = 0])−
(E [Yit |Di = 0, Postt = 1] − E [Yit |Di = 0, Postt = 0]) = τATT
Analogously, βˆ3 = (Ȳ12 − Ȳ11 ) − (Ȳ02 − Ȳ01 ) = τ̂ATT
13
Example
Suppose we take the Hastings data from June/October and estimate
Yit = β0 + β1 × Postt + β2 Di + β3 Di × Postt + εit ,
via OLS, where Postt is 1 for October and 0 for June.
Constant (βˆ0 ) 1.28
Post (βˆ1 ) 0.13
We get the regression coefficients:
Treated (βˆ2 ) -0.03
Treated × Post (βˆ3 ) 0.05
14
DiD with Multiple Periods
Often we have more that 2 periods for a DiD analysis
This is useful for two reasons:
1 We can test whether parallel trends appears to hold prior to treatment
2 We can analyze how the ATT changes over time
How do we do this?
15
DiD with Multiple periods
Suppose that we have periods t = −T , ..., T̄ . Treated units begin
getting treatment at period 1.
For each period s ̸= 0, we can estimate a 2-period DiD between period
s and period 0:
βˆs = (Ȳ1s − Ȳ0s ) − (Ȳ10 − Ȳ00 )
| {z } | {z }
Diff in period s Diff in period 0
where Ȳdt is the average for treatment group d in period t.
Conveniently, the βˆs are equal to the OLS estimates of the regression
Yit = φt + Di γ + ∑ Di × 1[t = s] × βs + εit
s̸=0
You can also replace Di γ with a unit fixed effect λi and you get the
exact same βˆs .
16
Example - Medicaid Expansion
The Affordable Care Act (ACA, aka Obamacare) expanded Medicaid
coverage to people with income up to 138% of the federal poverty line
Medicaid expansion went into effect in 2014. However, some
Republican-leaning states opted out of expanded coverage.
By 2015, 24 states had expanded Medicaid (more have done so since)
Carey, Miller, and Wherry (2020) study the impacts of Medicaid
expansion using a DiD design comparing early-adopting states to
non-adopters.
17
Example - Medicaid Expansion
A slightly simplified version of their regression specification is
Yits = φt + λs + ∑ Di × 1[t = 2014 + r ] × βr + εit
r ̸=−1
where Yits is outcome for person i in year t in state s, and Di = 1 if in
an expansion state. Lets plot the βs estimates and 95% CIs:
Results show similar “pre-trends” but negative effects after treatment
18
In a related paper, some of the same authors used a similar research design
to estimate the impacts on mortality
19
Some Caution about Parallel Trends
DiD relies on the parallel trends assumption, which allows for selection
bias but requires it to be stable over time. This rules out time-varying
confounding factors.
Often we will be worried about time-varying confounds — e.g.,
macro-economic factors might differentially affect Democratic versus
Republican states
Testing for pre-treatment differences (“pre-trends”) can help increase
our confidence in the research design. But they’re not perfect. Why?
1 Just because trends were parallel beforehand doesn’t mean that they
would continue to be afterwards
2 Often our estimates of pre-trends are noisy so we’re not sure whether
they’re actually zero or not.
20
In addition to looking at the point estimates of pre-trends, it’s
important to consider what the CIs rule out
A good rule of thumb for whether a plot is convincing is whether you
can draw a smooth line through all the confidence intervals
Are you convinced there’s an effect here?
21
In addition to looking at the point estimates of pre-trends, it’s
important to consider what the CIs rule out
A good rule of thumb for whether a plot is convincing is whether you
can draw a smooth line through all the confidence intervals
Are you convinced there’s an effect here? Maybe not!
21
In addition to looking at the point estimates of pre-trends, it’s
important to consider what the CIs rule out
A good rule of thumb for whether a plot is convincing is whether you
can draw a smooth line through all the confidence intervals
What about here?
21
In addition to looking at the point estimates of pre-trends, it’s
important to consider what the CIs rule out
A good rule of thumb for whether a plot is convincing is whether you
can draw a smooth line through all the confidence intervals
And here?
21
Standard Errors for Panel Regressions
We know how to get standard errors for OLS estimates of
Yi = Xi′ β + ei
when (Yi , Xi ) are drawn iid.
Now, we have
Yit = Xit′ β + eit
Is it reasonable to assume that (Yit , Xit ) are iid across i and t? No
1 We expect Yi1 to be correlated with Yi2 , e.g., people with high earnings
in 2010 also tend to have higher earnings in 2011. This is called serial
autocorrelation
2 More subtlely, if treatment is assigned at the state level, all people in a
given state will have the same value of Dit (which is included in Xit )
22
Clustered Standard Errors
Clustered standard errors extend the OLS variance formula to allow
(Yit , Xit ) to be correlated across observations in the same “cluster”
The assumption is that each cluster is sampled independently.
For example, if we cluster at the individual level (i), then we allow for
Yi1 and Yi2 to be dependent, but assume (Yi1 , Yi2 ) is independent of
(Yj1 , Yj2 ) for j ̸= i
In panel analyses, you should at minimum cluster at the individual
level to allow for autocorrelation.
If treatment is assigned at a more aggregate level, it is best to cluster
at the level where treatment is assigned.
Keep in mind: the number of “effective observations” (used for CLT)
is the number of clusters
Clustered SEs will not be reliable when the number of clusters is very
small (e.g. < 20)
23
XKCD
24
Implementing Clustered SEs
Implementing clustered SEs in Stata is very easy
Just replace
reg y x, robust
with
reg y x, cluster(clustervar)
25
26
27
Outline
1. Diff-in-Diff Basics✓
2. DiD Meets Regression✓
3. The DiD Frontier
28
A Very Famous DiD
Card and Krueger (1994) ask: how does the minimum wage affect
employment?
How would you expect the MW to affect employment, based on what
you learned in micro-economic theory?
In a competitive market, a floor on wages (i.e. the price of labor),
should induce a decrease in demand
To study this, CK study an episode in 1992 where NJ raised its
minimum wage from $4.25 to $5.05
They use a DiD comparing change in employment in fast food
restaurants in NJ to that in neighboring PA , where the MW was flat
at $4.25
29
Point estimates suggest an increase in employment of 2.76 FTEs, but
not statisticially signfiicant.
30
Why?!
The result that an increase in the MW does not seem to decrease
employment was very surprising (and controversial) at the time
One explanation for this finding is that labor markets are not perfectly
competitive. Rather, firms are monopsonistic
Consider a firm the employs 100 workers at $7/hour.
Suppose hiring another worker would produce an extra $10 of profit,
but would require raising the wage to $8/hour.
Should the firm raise the raise to $8/hour? Not if it means they have to
pay all 100 workers an extra $1!
However, if the MW is raised to $8/hour, then the firm has to pay the
first 100 workers $8 anyway, and would gladly hire the 101st worker at
$8/hour since this brings $10 of profit.
31
By modern standards, the CK analysis is perhaps not the most
convincing
The two states do not move exactly in parallel even before the policy
change in April 1992. We also only have 2 states!
32
Staggered Timing
Next I’ll show you some more modern evidence on the MW.
But first we need to discuss DiD when treatment timing is staggered –
e.g., states pass minimum wages in different years
Until about 5 years ago, people extended DiD to the staggered setting
by running OLS regressions like:
Yit = φi + λt + Dit β + eit
where Dit = 1 if unit i is treated in period t.
In the two-period model, this corresponds to the diff-in-diff in sample
means between treatment and control
Unfortunately, it turns out that this estimator is not an average of
DiDs between treated and untreated units in the staggered case.
See Borusyak and Jaravel (2016), de Chaisemartin and D’Haultfoeuille
(2020), Goodman-Bacon (2021)
33
Over the last few years, there has been a lot of research about “fixing”
the issues with these regressions
The solutions typically involve making “clean comparisons” by hand
1 For units first treated in year g, compare outcome change between
g − 1 and g + k to that of units who weren’t treated over that period
2 This is an estimate of the effect k years after treatment for cohort g
3 Do this for every g, and then aggregate them to get an average effect
There are many implementations of this and related approaches,
including Callaway and Sant’Anna (2020), Sun and Abraham (2020),
Borusyak, Jaravel & Spiess (2021)
34
Cengiz et al (2019) do a modern version of C&K using 138 MW
changes between 1976 and 2016
For each state that changes its MW, they take a “control group” of
states that didn’t change their MW in the 4 years before/after
They compute a DiD between the treated state and the matched
control states
They then take a weighted average of these DiDs to get an overall
average effect
35
38
Important Considerations/Caveats
Historical MW changes in the MW have been fairly modest
Not clear that changes in MW from $4.25 to $5.05 are informative
abour raises from $7.25 to $15!
Historical analyses of MW are typically relatively short-run
Over long-run, MW increases may induce shifts in technology that
replace workers
There is still some debate among economists over whether MWs
reduce employment!
39
Other Panel Data Methods
We’ve focused on DiD, which is the most commonly-used panel data
method in applied micro-economics
But there are many others:
Controls for lagged dependent variables
Synthetic control
Matrix completion
We won’t have time to cover these, but if you’re interested, I suggest
taking more econometrics classes :)
40