Short Guides to Microeconometrics Kurt Schmidheiny
Fall 2023 University of Basel
Clustering in the Linear Model
matrix-free
1 Introduction
This handout extends the handout on “The Multiple Linear Regression
model” and refers to its definitions and assumptions in section 2. It relaxes
the homoscedasticity assumption (OLS4a) and allows the error terms to
be heteroscedastic and correlated within groups or so-called clusters. It
shows in what situations the parameters of the linear model can be consis-
tently estimated by OLS and how the standard errors need to be corrected.
The canonical example (Moulton 1986, 1990) for clustering is a regres-
sion of individual outcomes (e.g. wages) on explanatory variables of which
some are observed on a more aggregate level (e.g. employment growth on
the state level).
Clustering also arises when the sampling mechanism first draws a ran-
dom sample of groups (e.g. schools, households, towns) and than surveys
all (or a random sample of) observations within that group. Stratified
sampling, where some observations are intentionally under- or oversam-
pled asks for more sophisticated techniques.
2 The Econometric Model
Consider the multiple linear regression model
ygi = β0 + β1 xgi1 + ... + βK xgiK + ugi
where observations belong to a cluster g = 1, ..., G and observations are
indexed by i = 1, ..., M within their cluster. G is the number of clusters,
Version: 3-1-2024, 14:10
Clustering in the Linear Model 2
P
M is the number of observations per cluster, and N = g M = GM is
the total number of observations. For notational simplicity, M is assumed
constant in this handout. It is easily generalized to a cluster specific
number Mg . ygi is the dependent variable, xgi1 , ..., xgiK are K explanatory
variables, β0 , ..., βK are K + 1 parameters, and ugi is the error term.
The data generation process (dgp) is fully described by:
CL1: Linearity
ygi = β0 + β1 xgi1 + ... + βK xgiK + ugi and E[ugi ] = 0
CL2: Independence
{xg11 , ..., xgM K , yg1 , ..., ygM }G
g=1
i.i.d. (independent and identically distributed)
CL2 assumes that the observations in one cluster are independent from
the observations in all other clusters. It does not assume independence of
the observations within clusters.
CL3: Strict Exogeneity
2
a) ugi |xg11 , ..., xgM K ∼ N (0, σgi )
b) ∀j, k : ugi ⊥ xgjk (independent)
c) E[ugi |xg11 , ..., xgM K ] = 0 (mean independent)
d) ∀k, j : Cov[xgjk , ugi ] = 0 (uncorrelated)
CL3 assumes that the error term ugi is unrelated to all explanatory vari-
ables of all observations within its cluster.
CL4: Clustered Errors
2
V [ugi |xg11 , ..., xgM K ] = σgi > 0 and < ∞
Cov[ugi , ugj |xg11 , ..., xgM K ] = ρgij σgi σgj < ∞, for all i 6= j
CL4 means that the error terms are allowed to have different variances and
to be correlated within clusters conditional on all explanatory variables
3 Short Guides to Microeconometrics
of all observations within the cluster.
Under CL2, CL3c and CL4, the conditional variances and covariances
across all error terms are
2
V (ugi |xg11 , ..., xgM K ) = σgi
Cov(ugi , ugj |xg11 , ..., xgM K ) = ρgij σgi σgj , i 6= j
Cov(ugi , uhj |xg11 , ..., xgM K , xh11 , ..., xhM K ) = 0, i 6= j, g 6= h
CL5: Identifiability
(1, xgi1 , · · · , xgiK ) are not linearly dependent
0 < V [xgik ] < ∞ and 0 < Vb [xgik ]
CL5 assumes that the regressors have identifying variation (non-zero vari-
ance) and are not perfectly collinear.
3 A Special Case: Random Cluster-specific Effects
Suppose as Moulton(1986) that the error term ugi consists of a cluster
specific random effect cg and an individual effect vgi
ugi = cg + vgi
Assume that the individual error term is strictly exogenous, homoscedastic
and independent across all observations
E[vgi |xg11 , ..., xgM K ] = 0
V [vgi |xg11 , ..., xgM K ] = σv2
Cov[vgi , vgj |xg11 , ..., xgM K ] = 0, i 6= j
and that the cluster specific effect is exogenous, homoscedastic and un-
correlated with the individual effect
Clustering in the Linear Model 4
E[cg |xg11 , ..., xgM K ] = 0
V [cg |xg11 , ..., xgM K ] = σc2
Cov[cg , vgi |xg11 , ..., xgM K ] = 0
The resulting variances and covariances of the combined error term
ugi = cg + vgi are then within each cluster g
V [ugi |xg11 , ..., xgM K ] = σu2
Cov[ugi , ugj |xg11 , ..., xgM K ] = ρu σu2 , i 6= j
where σu2 = σc2 + σv2 and ρu = σc2 /(σc2 + σv2 ). This structure is called
equicorrelated errors. In a less restrictive version, σu2 and ρu are allowed
to be cluster specific as a function of xg11 , ..., xgM K .
Note: this structure is formally identical to a random effects model for
panel data with many “individuals” g observed over few “time periods”i.
The cluster specific random effect is also called an unrelated effect.
4 Estimation with OLS
The parameter β can be estimated with OLS by regressing ygi on a con-
stant and on xgi1 , · · · , xgiK . In the special case with one regressor xgi ,
the resulting OLS estimators of β0 and β1 are:
PG PM
g=1 i=1 (xgi − x̄)(ygi − ȳ)
βb1 = PG PM 2
g=1 i=1 (xgi − x̄)
βb0 = ȳ − β̂1 x̄
P P P P
where ȳ = 1/GM g i ygi and x̄ = 1/GM g i xgi .
The OLS estimator of β remains unbiased in small samples under
CL1, CL2, CL3c, CL4, and CL5 and normally distributed additionally
assuming CL3a. It is consistent and approximately normally distributed
5 Short Guides to Microeconometrics
under CL1, CL2, CL3d, CL4, and CL5 in samples with a large number
of clusters. However, the OLS estimator is not efficient any more. More
importantly, the usual standard errors of the OLS estimator and tests (t-,
F -, z-, Wald-) based on them are not valid any more.
5 Estimating Correct Standard Errors
The small sample variance V (βbK |x111 , ..., xGM K ) of βbK differs from the
usual OLS one under CL3c and CL4. This cannot be easily expressed
without matrix notation even for the binary regression model. Con-
sequently, the usual estimator Vb (βbk |x111 , ..., xGM K ) is incorrect. Usual
small sample test procedures, such as the F - or t-Test, based on the usual
estimator are therefore not valid.
With the number of clusters G → ∞ and fixed cluster size M =
N/G, the OLS estimator is asymptotically normally distributed under
CL1, CL2, CL3d, CL4, and CL5
√ d
G(βbk − βk ) −→ N 0, ς 2
where ς 2 is not easily expressed without matrix notation. The OLS esti-
mator is therefore approximately normally distributed in samples with a
large number of clusters
A
βbk ∼ N βk , Avar(βbk )
where Avar(βbk ) = ς 2 /N can be consistently estimated with some addi-
tional assumptions on higher order moments of xg11 , ..., xgM K . For the
binary regression, the robust variance estimator is calculated as
PG PM PM
g=1 i=1 j=1 u bgj (xgi − x̄)(xgj − x̄)
bgi u
Avar(β1 ) =
[ b hP i2
G PM 2
g=1 i=1 (xgi − x̄)
This so-called cluster-robust covariance matrix estimator is a gener-
alization of Huber(1967) and White(1980).1 It does not impose any re-
1 Note: the cluster-robust estimator is not clearly attributed to a specific author.
Clustering in the Linear Model 6
strictions on the form of both heteroscedasticity and correlation within
clusters (though we assumed independence of the error terms across clus-
ters). We can perform the usual z- and Wald-test for large samples using
the cluster-robust covariance estimator.
Note: the cluster-robust covariance matrix is consistent when the num-
ber of clusters G → ∞. In practice we should have at least 50 clusters.
Bootstrapping is an alternative method to estimate a cluster-robust
covariance matrix under the same assumptions. See the handout on “The
Bootstrap”. Clustering is addressed in the bootstrap by randomly draw-
ing clusters g (rather than individual observations gi) and taking all M
observations for each drawn cluster. This so-called block bootstrap pre-
serves all within cluster correlation. With 20 to 50 clusters, a wild block
residual bootstrap-t should be used (Cameron and Miller, 2015).
6 Efficient Estimation with GLS
In some cases, for example with cluster specific random effects, we can es-
timate β efficiently using feasible GLS (see the handout on “Heteroscedas-
ticity in the Linear Model” and the handout on “Panel Data”). In prac-
tice, we can rarely rule out additional serial correlation beyond the one
induced by the random effect. It is therefore advisable to always use
cluster-robust standard errors in combination with FGLS estimation of
the random effects model.
7 Short Guides to Microeconometrics
7 Special Case: Estimating Correct Standard Errors
with Random Cluster-specific Effects
Moulton (1986, 1990) studies the bias of the usual OLS standard errors
for the special case with random cluster-specific effects. Assume cluster-
specific random effects in a bivariate regression:
ygi = β0 + β1 xgi + ugi
where ugi = cg + vgi with σu2 = σc2 + σv2 , ρu = σc2 /(σc2 + σv2 ). Then the
(cluster-robust) asymptotic variance can be estimated as
bu2
σ
Avar
[ cluster [βb1 ] = P
G PM [1 + (M − 1)b
ρx ρbu ]
2
g=1 i=1 (xgi − x̄)
where σbu2 is the usual OLS estimator, ρx is the within cluster correlation of
b , ρbu and ρbx are consistent estimators of σ 2 , ρu and ρx , respectively.
x. σ 2
The robust standard error for the slope coefficient is accordingly
p
se b ols (βb1 ) 1 + (M − 1)b
b cluster (βb1 ) = se ρx ρbu
where seb [βb ] is the usual OLS standard error.
p ols 1
1 + (M − 1)ρx ρu ] > 1 is called the Moulton factor and measures
how much the usual OLS standard errors understate the correct standard
errors. For example, with cluster size M = 500 and intracluster correla-
tions ρu = 0.1 and ρx = 0.1, the correct standard errors are 2.45 times
the usual OLS ones.
Lessons from the Moulton factor
1. If either the within cluster correlation of the combined error term u
is zero (ρu = 0) or the within cluster correlation of x is zero (ρx = 0),
then the Moulton factor is 1 and the usual OLS standard errors are
correct. Both situations generalize to K explanatory variables.
Clustering in the Linear Model 8
2. If the variable of interest is an aggregate variable on the level of the
cluster (hence ρx = 1), the Moulton factor is maximal. This case
generalizes to K aggregate explanatory variables:
p
se b ols (βbk ) 1 + (M − 1)b
b cluster (βbk ) = se ρ
In this situation, we need to correct the standard errors. Alterna-
tively, we could aggregate (average) all variables and run the regres-
sion on the collapsed data.
3. If only control variables are aggregated, we better include cluster
fixed effects (i.e. dummy variables for the groups) which will take
care of the cluster-specific effect. See also the handout on “Panel
Data: Fixed and Random Effects”.
4. If the variable of interest is not aggregated but has an important
cluster specific component (large ρx ), then including cluster fixed
effects may destroy valuable information and we better don’t in-
clude cluster fixed effects. However, we need to correct the standard
errors.
5. If only control variables have an important cluster-specific compo-
nent, it is better to include cluster fixed effects.
6. If the variable of interest has only a small cluster specific component
(i.e. a lot of within-cluster variation and very little between-cluster
variation), it is better to include cluster fixed effects.
Standard errors are in practice most easily corrected using the Eicker-
Huber-White cluster-robust covariance from section 5 and not via the
Moulton factor. Note that we should have at least G = 50 clusters to
justify the asymptotic approximation.
In the context of panel and time series data, serial correlation beyond
the ones from a random effect becomes very important. See the handout
on “Panel Data: Fixed and Random Effects”. In this case, standard errors
need to be corrected even when including fixed effects.
9 Short Guides to Microeconometrics
8 Implementation in Stata 17
Load example data
webuse auto7.dta
Stata reports the cluster-robust covariance estimator clustered for
manufacturer with the vce(cluster) option, e.g.2
regress price weight, vce(cluster manufacturer)
matrix list e(V)
Note: Stata multiplies Vb with (N − 1)/(N − K − 1) · G/(G − 1) to “cor-
rect” for degrees of freedom in small samples. This practice is not based
on asymptotic theory but often produces better small sample properties.
Stata reports p-values for the t- and F -statistics with G − 1 degrees of
freedom.
We can also estimate a cluster robust covariance using a nonparametric
block bootstrap. For example with either of the following,
regress price weight, vce(bootstrap, reps(999) cluster(manufacturer))
bootstrap, reps(999) cluster(manufacturer): regress price weight
The cluster specific random effects model is efficiently estimated by
FGLS. For example,
xtset manufacturer_grp
xtreg price weight, re
In addition, cluster-robust standard errors are reported with
xtreg price weight, re vce(cluster manufacturer)
The wild block residual bootstrap-t for the slope coefficient of the
variable weight is reported by David Roodman’s command boottest
ssc install boottest
regress price weight, vce(cluster manufacturer)
boottest weight=0, reps(99999)
2There are only 23 clusters in this example dataset used by the Stata manual. This
is not enough to justify using large sample approximations.
Clustering in the Linear Model 10
9 Implementation in R 4.3.1
Load example data
library(haven)
auto <- read_dta("http://www.stata-press.com/data/r17/auto7.dta")
First, we estimate the regression with the usual command
ols <- lm(price~weight, data=auto)
summary(ols)
The cluster-robust covariance estimator clustered for manufacturer is
calculated and reported with the packages sandwich and lmtest3
library(sandwich)
library(lmtest)
coeftest(ols, vcov = vcovCL, cluster = ~manufacturer)
The following commands are equivalent
coeftest(ols, vcov = vcovCL, cluster = ~manufacturer, cadjust=TRUE)
coeftest(ols, vcov = vcovCL(ols, cluster = ~manufacturer))
coeftest(ols, vcov = vcovCL(ols, type="HC1", cluster = ~manufacturer))
Note: The above commands multiply Vb with (N −1)/(N −K −1)·G/(G−
1) to “correct” for degrees of freedom in small samples. R reports p-values
for the t- and F -statistics with N − K − 1 degrees of freedom.
We can also estimate a cluster robust covariance using a nonparametric
block bootstrap
coeftest(ols, vcov = vcovBS, cluster = ~manufacturer, R=999)
The wild block residual bootstrap-t for the slope coefficient of the
variabel weight is calculated by David Roodman’s algorithm in boottest
library(fwildclusterboot)
wild <- boottest(ols, param="weight", clustid=c("manufacturer"),
B=99999, type="rademacher", impose_null=TRUE,
p_val_type="two-tailed")
summary(wild)
3There are only 23 clusters in this example dataset used by the Stata manual. This
is not enough to justify using large sample approximations.
11 Short Guides to Microeconometrics
References
Advanced textbooks
Cameron, A. Colin and Pravin K. Trivedi (2005), Microeconometrics:
Methods and Applications, Cambridge University Press. Sections 24.5.
Wooldridge, Jeffrey M. (2002), Econometric Analysis of Cross Section and
Panel Data, MIT Press. Sections 7.8 and 11.54.
Companion textbooks
Angrist, Joshua D. and Jörn-Steffen Pischke (2009), Mostly Harmless
Econometrics: An Empiricist’s Companion, Princeton University Press.
Chapter 8.
Articles
Cameron, A. Colin and Douglas L. Miller (2015), A Practitioner’s Guide
to Cluster-Robust Inference, Journal of Human Resources, forthcom-
ing.
Moulton, B. R. (1986), Random Group Effects and the Precision of Re-
gression Estimates, Journal of Econometrics, 32(3), 385-397.
Moulton, B. R. (1990), An Illustration of a Pitfall in Estimating the Ef-
fects of Aggregate Variables on Micro Units, The Review of Economics
and Statistics, 72, 334-338.