Panel Data
Analysis
Lecture1
Introduction
Spring 2025
TS109
Mohamed Abdallah
Lecturer of Applied
Statistics&Economerics
[email protected]
01222596520
Panel Data Analysis
Overview
Panel Data Econometrics
This is an intermediate level, diploma. course in the area of
Applied Econometrics dealing with Panel Data. The
range of topics covered in the course will span a large
part of econometrics generally, though we are
particularly interested in those techniques as they are
adapted to the analysis of panel data sets.
Why a Course on ‘Panel Data?’
Microeconometrics and applications –
contemporary broad field in
economics/econometrics
Behavioral modeling
Individual choice and response
A platform for surveying econometric models
and methods – most of the field
Various types
Recent developments
Prerequisites
Introduction to econometrics
Applied statistics
We will examine many empirical applications.
You will apply the tools developed in the course.
Text Books
Main text: Baltagi (2008);
read chapters 1,2
Recommended: Greene
(2008); read chapters
1,2,9
Suggested: Wooldridge
(2002); read chapters
1,2,10
Course Applications
Problem sets
Software: R , Eviews
Questions and review as requested
Course Outlines
Statistics and Regression
Fixed and Random Effects
Instrumental Variables, MDE, GMM
The One-way Error Component Regression Model
The Two-way Error Component Regression Model
Hypothesis testing
Heteroskedasticity
Serial Correlation
Term project: Application of method(s) developed in class
to a „live‟ data set. Details to be given in class. (25%)
Attendance (10%).
Midterm, in class, (25%)
Final exam (40%)
Panel Data Analysis
1. Methodology
Econometrics: Modeling
Theoretical foundations
Microeconometrics and macroeconometrics
Behavioral modeling
Statistical foundations: Econometric
methods
Mathematical elements: the usual
„Model‟ building – the econometric model
Mathematical elements
The underlying truth – is there one?
Model Building in Econometrics
Role of the assumptions
Inference
Parametric analysis
Estimation Platforms
Model based
Kernels and smoothing methods (nonparametric)
Moments and quantiles (semiparametric)
Likelihood and M- estimators (parametric)
Methodology based (?)
Classical – parametric
The Sample and Measurement
Population Measurement
Theory
Characteristics
Behavior Patterns
Choices
Classical Inference
Population Measurement
Econometrics
Characteristics
Imprecise inference about Behavior Patterns
the entire population –
sampling theory and Choices
asymptotics
Data Structures
Observation mechanisms
Non-experimental
Active, experimental
The „natural experiment‟
Data types
Cross section
time series
Panel or longitudinal data
Econometric Models
Linear
Static
Dynamic
Vector auto regressive (VAR)
Structural models and demand systems
Estimation Methods and Applications
Least squares etc. – OLS, GLS
Maximum likelihood
Instrumental variables and GMM
Simulation based estimation
Monte Carlo methods
Panel data
These are Models that Combine Cross-
section and Time-Series Data
In panel data the same cross-sectional unit
(industry, firm, country) is surveyed over time,
so we have data which is pooled over space as
well as time.
Reasons for using Panel Data
1. Panel data can take explicit account of individual-specific
heterogeneity (“individual” here means related to the
microunit)
2. By combining data in two dimensions, panel data gives
more data variation, less collinearity and more degrees
of freedom.
3. Panel data is better suited than cross-sectional data for
studying the dynamics of change. For example it is well
suited to understanding transition behaviour – for
example company bankruptcy or merger.
4. Panel data is better at detecting and measuring
effects that cannot be observed in either cross-section
or time-series data.
5. Panel data enables the study of more complex
behavioural models – for example the effects of
technological change, or economic cycles.
6. Panel data can minimise the effects of aggregation
bias, from aggregating firms into broad groups.
Benefits of Panel Data
Time and individual variation in behavior unobservable
in cross sections or aggregate time series
Observable and unobservable individual heterogeneity
Rich hierarchical structures
More complicated models
Features that cannot be modeled with only cross
section or aggregate time series data alone
Dynamics in economic behavior
Panel data regression models are based on panel data,
which are observations on the same cross-sectional, or
individual, units over several time periods.
A balanced panel has the same number of time observations
for each cross-sectional unit.
Panel data have several advantages over purely cross-
sectional or purely time series data. These include:
(a) Increase in the sample size
(b) Study of dynamic changes in cross-sectional units over time
(c) Study of more complicated behavioral models, including
study of time-invariant variables
Where Do We Go From Here?
Review of familiar classical procedures
Fundamental, familiar regression extensions; common
effects models
Endogeneity, instrumental variables, GMM estimation
Dynamic models
Models of heterogeneity
Features of the linear, static and dynamic common
effects models
Panel Data Analysis
2. Econometric Methods
A Statistical Relationship
A relationship of interest:
Number of hospital visits: H = 0,1,2,…
Covariates: x1=Age, x2=Sex, x3=Income, x4=Health
…
Causality and covariation
Theoretical implications of „causation‟
Comovement and association
Models
Conditional mean function: E[y | x]
Other conditional characteristics – what is „the model?‟
Conditional variance function: Var[y | x]
Other conditional moments
Conditional probabilities: P(y|x)
Using the Model
Understanding the relationship:
Estimation of quantities of interest such as
elasticities
Prediction of the outcome of interest
Control of the path of the outcome of interest
Panel Data Sets*
Longitudinal data – „short panels‟
National longitudinal survey of youth (NLS)
British household panel survey (BHPS)
Panel Study of Income Dynamics (PSID)
Cross section time series – „long panels‟
Grunfeld‟s investment data
Penn world tables
Financial data by firm, year – „huge panels‟
rit – rft = i(rmt - rft) + εit, i = 1,…,many; t=1,…many
Exchange rate data, essentially infinite T, large N
Effects: i= + vi
* See Baltagi, Chapter 1
Notation
Fixed Effects – the „dummy variable model‟
y it = i + x it + it
Random Effects – the „error components model‟
y it = ai x it + it + ui
Compound (“composed”) disturbance
Exogeneity
Exogeneity
E[εit|xit,ci]=0 Not sufficient for regression
Doesn‟t imply how to estimate β
Strict Exogeneity
E[εit|xi1, xi2,…,xiT,ci]=0
Can use first difference or fixed effects
Cannot hold if xit contains lagged values of yit
Suppose y is investment and x is a measure of profit.
We have i = 1…n companies and t = 1…T time
periods. Suppose we specify a simple econometric
model which says that investment depends on profit:
yit a0 1 xit uit (1)
uit is a random error term: E (uit ) ~ N (0, σ2)
Estimation of (1) depends on the assumptions that we
make about the intercept (a0), the slope coefficient (a1)
and the error term (uit ).
Pooled regression by OLS
This is estimation option 1 on the list. But pooled regression
may result in heterogeneity bias :
Pooled regression:
y
yit=a0+a1xit+uit
• • True model: Firm 4
•
•
• • True model: Firm 3
• •
• True model: Firm 2
• •
•
• • True model: Firm 1
•
•
x
Fixed and Random Effects
Unobserved individual effects in regression: E[yit | xit, ci]
Notation:
y it = x it + c i + it
Linear specification:
Fixed Effects: E[ci | Xi ] = g(Xi); effects are correlated with
included variables. Common: Cov[xit,ci] ≠0
Random Effects: E[ci | Xi ] = μ; effects are uncorrelated with
included variables. If Xi contains a constant term, μ=0 WLOG.
Common: Cov[xit,ci] =0, but E[ci | Xi ] = μ is needed for the
full model
However, panel models pose several estimation and
inference problems, such as heteroscedasticity,
autocorrelation, and cross-correlation in cross-sectional
units at the same point in time.
The fixed effects model (FEM) and the random effects
model (REM), also known as the error components
model (ECM), are commonly used methods to deal with
one or more of these problems.
In FEM, the intercept in the regression model is allowed
to differ among individuals to reflect the unique feature
of individual units.
This is done by using dummy variables, provided we take care
of the dummy variable trap.
The FEM using dummy variables is known as the least-
squares dummy variable model (LSDV).
FEM is appropriate in situations where the individual-
specific intercept may be correlated with one or more
regressors, but consumes a lot of degrees of freedom
when N (the number of cross-sectional units) is very
large.
Assumptions for Asymptotics
Convergence of moments involving cross section Xi.
N increasing, T or Ti assumed fixed.
“Fixed T asymptotics.
Time series characteristics are not relevant (may be
nonstationary)
If T is also growing, need to treat as multivariate time series.
Strict exogeneity and dynamics. If xit contains yi,t-1 then
xit cannot be strictly exogenous. Xit will be correlated
with the unobservables in period t-1. (To be revisited
later.)
Empirical characteristics of microeconomic data
Estimating β
β is the partial effect of interest
Can it be estimated (consistently) in the
presence of (unmeasured) ci?
Does pooled least squares “work?”
Strategies for “controlling for ci” using the sample
data
Balanced and Unbalanced Panels
Distinction
A notation to help with mechanics
zi,t, i = 1,…,N; t = 1,…,Ti
The role of the assumption
If all the cross-sectional units have the same
number of time series observations the panel is
balanced, if not it is unbalanced.
Balanced, n=NT
Unbalanced: n i=1 Ti
N
Short Term Agenda for Simple Effects Models
Models with individual effects
Interpretation of models
Computation (practice) and estimation (theory)
Extensions
Nonstandard panels: Rotating, Pseudo-, Nested
Generalizing the regression model
Alternative estimators
Methods
Least squares: OLS, GLS, FGLS
MLE and Maximum Simulated Likelihood
The Pooled Regression
Presence of omitted effects
y it =x itβ+c i +εit , observation for person i at time t
y i =X iβ+c ii+ε i , Ti observations in group i
=X iβ+c i +ε i , note c i (c i , c i ,...,c i )
y =Xβ+c +ε , Ni=1 Ti observations in the sample
Potential bias/inconsistency of OLS – depends
on „fixed‟ or „random‟
Endogeneity
Definition: E[ε|x]≠0
Why not?
Omitted variables
Unobserved heterogeneity (equivalent to omitted
variables)
Measurement error on the RHS (equivalent to
omitted variables)
How do panel data fit into this?
We can use the usual models.
We can use far more elaborate models
We can study effects through time
Observations are surely correlated.
The same individual is observed more than once
Unobserved heterogeneity that appears in the disturbance in a
cross section remains persistent across observations (on the
same „unit‟).
Procedures must be adjusted.
Dynamic effects are likely to be present.
Model Selection
Regression models: Fit measure = R2
Nested models: log likelihood, GMM criterion
function (distance function)
Akaike information criterion=(logL – 2K)/N
Bayes (Schwartz) information criterion = (logL-K(logN))/N
Estimation of the Parameters
Least squares, LAD, other estimators – we will
focus on least squares
-1
b = (X'X) X'y
2
s e'e/N or e'e/(N-K)
Classical estimation of
Properties
Statistical inference: Hypothesis tests
Prediction .
Properties of Least Squares
Finite sample properties: Unbiased, etc. No
longer interested in these.
Asymptotic properties
Consistent? Under what assumptions?
Efficient?
Contemporary work: Often not important
Efficiency within a class: GMM
Asymptotically normal: How is this established?
Robust estimation: To be considered later
Remaining to Consider for the
Linear Regression Model
Failures of standard assumptions
Heteroscedasticity
Autocorrelation
Robust estimation
Omitted variables
Measurement error
Thank
you