Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
35 views897 pages

RealStats Book

Uploaded by

renatnureyev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views897 pages

RealStats Book

Uploaded by

renatnureyev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 897

Real

Stats
Econometrics for Political Science, Public Policy, and
Economics

Michael A. Bailey

c
•2014 Oxford University Press
CONTENTS

Foreword for Instructors: How to Help Your Students Learn Statistics xix

Foreword for Students: How This Book Can Help You Learn Statistics xxviii

1 The Quest for Causality 1


1.1 The Core Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Two Challenges: Randomness and Endogeneity . . . . . . . . . . . . . . . . 13
Case Study: Flu Shots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Case Study: Country Music and Suicide . . . . . . . . . . . . . . . . . . . . . . . 25
1.3 Randomized Experiments as the Gold Standard . . . . . . . . . . . . . . . . 31
1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2 Stats in the Wild: Good Data Practices 40


2.1 Know Our Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2 Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Case Study: Violent Crime in United States . . . . . . . . . . . . . . . . . . . . . 51
2.3 Statistical Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

i
I The OLS Framework 64

3 Bivariate OLS: The Foundation of Statistical Analysis 65


3.1 Bivariate Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
The bivariate model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Bivariate OLS and presidential elections . . . . . . . . . . . . . . . . . . . 74
3.2 Random Variation in Coefficient Estimates . . . . . . . . . . . . . . . . . . . 78
—ˆ estimates are random . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Distributions of —ˆ estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 80
—ˆ estimates are normally distributed . . . . . . . . . . . . . . . . . . . . . 84
3.3 Exogeneity and Unbiasedness . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Conditions for unbiased estimates . . . . . . . . . . . . . . . . . . . . . . . 87
Bias in crime and ice cream example . . . . . . . . . . . . . . . . . . . . . 90
Characterizing bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.4 Precision of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.5 Probability Limits and Consistency . . . . . . . . . . . . . . . . . . . . . . . 99
3.6 Solvable Problems: Heteroscedasticity and Correlated Errors . . . . . . . . . 102
Homoscedasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Errors uncorrelated with each other . . . . . . . . . . . . . . . . . . . . . . 104
3.7 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Standard error of the regression (ˆ ‡) . . . . . . . . . . . . . . . . . . . . . . 107
Plot of the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Case Study: Height and Wages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
3.8 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

4 Hypothesis Testing and Interval Estimation: Answering Research Ques-


tions 135
4.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
OLS coefficients under the null hypothesis for the presidential election example141
Significance level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.2 t tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
—ˆ1 and standard errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

ii
The t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Critical values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
t statistics for the height and wage example . . . . . . . . . . . . . . . . . 158
Other types of null hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . 159
4.3 p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
4.4 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Incorrectly failing to reject the null hypothesis . . . . . . . . . . . . . . . . 166
Calculating power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Power curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
When to care about power . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
4.5 Straight Talk about Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . 175
4.6 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

5 Multivariate OLS: Where the Action Is 192


5.1 Using Multivariate OLS to Fight Endogeneity . . . . . . . . . . . . . . . . . 195
Multivariate OLS in action: retail sales . . . . . . . . . . . . . . . . . . . . 195
Multivariate OLS in action: height and wages . . . . . . . . . . . . . . . . 199
Estimation process for multivariate OLS . . . . . . . . . . . . . . . . . . . 203
5.2 Omitted Variable Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Omitted variable bias in more complicated models . . . . . . . . . . . . . . 214
Case Study: Does Education Support Economic Growth? . . . . . . . . . . . . . . 215
5.3 Measurement Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
Measurement error in the dependent variable . . . . . . . . . . . . . . . . 220
Measurement error in the independent variable . . . . . . . . . . . . . . . 221
5.4 Precision and Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . 224
Variance of coefficient estimates . . . . . . . . . . . . . . . . . . . . . . . . 224
Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Inclusion of irrelevant variables . . . . . . . . . . . . . . . . . . . . . . . . 231
Case Study: Institutions and Human Rights . . . . . . . . . . . . . . . . . . . . . 235
5.5 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Three model specification challenges . . . . . . . . . . . . . . . . . . . . . 241
Creating and reporting credible results . . . . . . . . . . . . . . . . . . . . 243
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

iii
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

6 Dummy Variables: Smarter Than You Think 254


6.1 Using Bivariate OLS to Assess Difference of Means . . . . . . . . . . . . . . 257
Regression model for difference of means tests . . . . . . . . . . . . . . . . 257
Difference of means and views about President Obama . . . . . . . . . . . 262
Case Study: Sex Differences in Heights . . . . . . . . . . . . . . . . . . . . . . . . 267
6.2 Dummy Independent Variables in Multivariate OLS . . . . . . . . . . . . . . 272
6.3 Transforming Categorical Variables to Multiple Dummy Variables . . . . . . 276
Categorical variables in regression models . . . . . . . . . . . . . . . . . . 277
Categorical variables and regional wage differences . . . . . . . . . . . . . 278
Case Study: Did Republicans Move to the Right in 2010? . . . . . . . . . . . . . . 283
6.4 Interaction Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
Case Study: Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

7 Transforming Variables, Comparing Variables 315


7.1 Quadratic and Polynomial Models . . . . . . . . . . . . . . . . . . . . . . . . 317
Linear versus non-linear models . . . . . . . . . . . . . . . . . . . . . . . . 317
Polynomial models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Case Study: Global Warming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
7.2 Logged Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Logs in regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
Logs in height and wages example . . . . . . . . . . . . . . . . . . . . . . . 333
7.3 Standardized Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
Challenge of comparing coefficients . . . . . . . . . . . . . . . . . . . . . . 337
Standardizing coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
7.4 Hypothesis Testing about Multiple Coefficients . . . . . . . . . . . . . . . . . 342
F tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
Case 1: Multiple coefficients equal zero under the null hypothesis . . . . . 343
Case 2: One or more coefficients equal each other under the null hypothesis 344
F tests using R2 values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345

iv
F tests and baseball salaries . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Case Study: Comparing Effects of Height Measures . . . . . . . . . . . . . . . . . 352
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360

II The Contemporary Statistical Toolkit 365

8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-
in-Difference Models 366
8.1 The Problem with Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Test score example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
8.2 Fixed Effects Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
Least squares dummy variable approach . . . . . . . . . . . . . . . . . . . 378
De-meaned approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
8.3 Working with Fixed Effects Models . . . . . . . . . . . . . . . . . . . . . . . 387
8.4 Two-way Fixed Effects Model . . . . . . . . . . . . . . . . . . . . . . . . . . 393
Case Study: Trade and Alliances . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
8.5 Difference-in-difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Difference-in-difference logic . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Using OLS to estimate difference-in-difference models . . . . . . . . . . . . 402
Difference-in-difference models for panel data . . . . . . . . . . . . . . . . 407
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417

9 Instrumental Variables: Using Exogenous Variation to Fight Endogeneity424


9.1 2SLS Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
9.2 Two-Stage Least Squares (2SLS) . . . . . . . . . . . . . . . . . . . . . . . . . 430
Endogenous and instrumental variables . . . . . . . . . . . . . . . . . . . . 430
First stage regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
Second stage regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
Two characteristics of good instruments . . . . . . . . . . . . . . . . . . . 434
Finding a good instrument is hard . . . . . . . . . . . . . . . . . . . . . . 435
Case Study: Emergency Care for Newborns . . . . . . . . . . . . . . . . . . . . . 439

v
9.3 Multiple Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
2SLS with multiple instruments . . . . . . . . . . . . . . . . . . . . . . . . 445
Overidentification tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
9.4 Weak Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
Quasi-instrumental variables are not strictly exogenous . . . . . . . . . . . 448
Weak instruments do a poor job predicting X . . . . . . . . . . . . . . . . 450
9.5 Precision of 2SLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
9.6 Simultaneous Equation Models . . . . . . . . . . . . . . . . . . . . . . . . . 455
Endogeneity in simultaneous equation models . . . . . . . . . . . . . . . . 456
Using 2SLS for simultaneous equation models . . . . . . . . . . . . . . . . 458
Identification in simultaneous equation models . . . . . . . . . . . . . . . . 458
Case Study: Support for President Bush and the Iraq War . . . . . . . . . . . . . 461
9.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472

10 Experiments: Dealing with Real-World Challenges 478


10.1 Randomization and Balance . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
Blocking to ensure similar treatment and control groups . . . . . . . . . . 482
Reasons why treatment and control groups may differ . . . . . . . . . . . . 482
Checking for balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
What to do if treatment and control groups differ . . . . . . . . . . . . . . 486
Case Study: Development Aid and Balancing . . . . . . . . . . . . . . . . . . . . 488
10.2 Compliance and Intention-to-treat Models . . . . . . . . . . . . . . . . . . . 491
Non-compliance and endogeneity . . . . . . . . . . . . . . . . . . . . . . . 492
Schematic representation of the non-compliance problem . . . . . . . . . . 493
Intention-to-treat models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
10.3 Using 2SLS to Deal with Non-compliance . . . . . . . . . . . . . . . . . . . . 500
Example of using 2SLS to deal with non-compliance . . . . . . . . . . . . . 501
Understanding variables in 2SLS models of non-compliance . . . . . . . . . 505
Case Study: Minneapolis Domestic Violence Experiment . . . . . . . . . . . . . . 508
10.4 Attrition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
Attrition and endogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
Detecting problematic attrition . . . . . . . . . . . . . . . . . . . . . . . . 516
Dealing with attrition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
Case Study: Health Insurance and Attrition . . . . . . . . . . . . . . . . . . . . . 520
10.5 Natural Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524

vi
Case Study: Crime and Terror Alerts . . . . . . . . . . . . . . . . . . . . . . . . . 528
10.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535

11 Regression Discontinuity: Looking for Jumps in Data 541


11.1 Basic Regression Discontinuity Model . . . . . . . . . . . . . . . . . . . . . . 545
The assignment variable in regression discontinuity models . . . . . . . . . 545
Graphical representation of regression discontinuity models . . . . . . . . . 547
The key assumption in regression discontinuity models . . . . . . . . . . . 549
11.2 More Flexible RD Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
Varying slopes model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554
Polynomial model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
11.3 Windows and Bins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
Adjusting the window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
Binned graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
Case Study: Universal Pre-kindergarten . . . . . . . . . . . . . . . . . . . . . . . 566
11.4 Limitations and Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
Imperfect assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
Discontinuous error distribution at threshold . . . . . . . . . . . . . . . . . 571
Diagnostic tests for RD models . . . . . . . . . . . . . . . . . . . . . . . . 572
Generalizability of RD results . . . . . . . . . . . . . . . . . . . . . . . . . 575
Case Study: Alcohol and Grades . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
11.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584

III Limited Dependent Variables 590

12 Dummy Dependent Variables 591


12.1 Linear Probability Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
LPM and the expected value of Y . . . . . . . . . . . . . . . . . . . . . . . 593
Limits to LPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
12.2 Using Latent Variables to Explain Observed Variables . . . . . . . . . . . . . 600
S-curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600

vii
Latent variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
12.3 Probit and Logit Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
Probit model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
Logit model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
12.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
Properties of MLE estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 613
Fitted values from the probit model . . . . . . . . . . . . . . . . . . . . . . 613
Fitted values from the logit model . . . . . . . . . . . . . . . . . . . . . . 616
Goodness of fit for MLE models . . . . . . . . . . . . . . . . . . . . . . . . 616
12.5 Interpreting Probit and Logit Coefficients . . . . . . . . . . . . . . . . . . . . 619
The effect of X1 depends on the value of X1 . . . . . . . . . . . . . . . . . 619
The effect of X1 depends on the values of the other independent variables . 621
Case Study: Dog Politics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
12.6 Hypothesis Testing about Multiple Coefficients . . . . . . . . . . . . . . . . . 631
Case Study: Civil Wars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
12.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649

IV Advanced Material 656

13 Time Series: Dealing with Stickiness over Time 657


13.1 Modeling Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
Model with autoregressive error . . . . . . . . . . . . . . . . . . . . . . . . 660
Examples of autocorrelated errors . . . . . . . . . . . . . . . . . . . . . . . 661
13.2 Detecting Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
Using graphical methods to detect autocorrelation . . . . . . . . . . . . . . 665
Using an auxiliary regression to detect autocorrelation . . . . . . . . . . . 666
13.3 Fixing Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
fl-transforming the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670
Estimating a fl-transformed model . . . . . . . . . . . . . . . . . . . . . . 672
Case Study: Global Temperature Changes Using an AR(1) Model . . . . . . . . . 675
13.4 Dynamic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
Dynamic models include the lagged dependent variable . . . . . . . . . . . 679
Three ways dynamic models differ from OLS models . . . . . . . . . . . . 679
13.5 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683

viii
Nonstationarity as a unit root process . . . . . . . . . . . . . . . . . . . . 684
Nonstationarity and spurious results . . . . . . . . . . . . . . . . . . . . . 686
Spurious results are less likely with stationary data . . . . . . . . . . . . . 689
Detecting unit roots and nonstationarity . . . . . . . . . . . . . . . . . . . 691
How to handle nonstationarity . . . . . . . . . . . . . . . . . . . . . . . . . 693
Case Study: Dynamic Model of Global Temperature . . . . . . . . . . . . . . . . 695
13.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706

14 Advanced OLS 709


14.1 How to Derive the OLS Estimator and Prove Unbiasedness . . . . . . . . . . 710
Deriving the OLS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 710
Properties of OLS estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 713
14.2 How to Derive the Equation for the Variance of —ˆ1 . . . . . . . . . . . . . . . 718
14.3 How to Derive the Omitted Variable Bias Conditions . . . . . . . . . . . . . 721
14.4 Anticipating the Sign of Omitted Variable Bias . . . . . . . . . . . . . . . . 725
14.5 Omitted Variable Bias with Multiple Variables . . . . . . . . . . . . . . . . . 729
14.6 Omitted Variable Bias Due to Measurement Error . . . . . . . . . . . . . . . 732
Model with one independent variable . . . . . . . . . . . . . . . . . . . . . 732
Measurement error with multiple independent variables . . . . . . . . . . . 734
14.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736

15 Advanced Panel Data 737


15.1 Panel Data Models with Serially Correlated Errors . . . . . . . . . . . . . . 738
Autocorrelation without fixed effects . . . . . . . . . . . . . . . . . . . . . 739
Autocorrelation with fixed effects . . . . . . . . . . . . . . . . . . . . . . . 740
15.2 Temporal Dependence with a Lagged Dependent Variable . . . . . . . . . . . 741
Lagged dependent variable without fixed effects . . . . . . . . . . . . . . . 742
Lagged dependent variable with fixed effects . . . . . . . . . . . . . . . . . 744
Two ways to estimate dynamic panel data models with fixed effects . . . . 746
15.3 Random Effects Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 748
15.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753

ix
16 Conclusion: How to Be a Statistical Realist 754
16.1 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761

Acknowledgements 763

Appendices 765

Math and Probability Background 766


A Summation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766
B Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767
C Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
D Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 770
E Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 771
F Probability Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 771
G Normal Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774
H Other Useful Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783
The ‰2 distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783
The t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785
The F distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786
I Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 789
J Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792
K Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793

Citations and Additional Notes 795

Guide to Selected Discussion Questions 813

Bibliography 827

Index 839

Glossary 844

x
LIST OF TABLES

1.1 Donut Consumption and Weight . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1 Descriptive Statistics for Donut and Weight Data . . . . . . . . . . . . . . . 44


2.2 Frequency Table for Male Variable in Donut Data Set . . . . . . . . . . . . . 44
2.3 Frequency Table for Male Variable in Second Donut Data Set . . . . . . . . 44
2.4 Codebook for Height and Wage Data . . . . . . . . . . . . . . . . . . . . . . 49
2.5 Descriptive Statistics for State Crime Data . . . . . . . . . . . . . . . . . . . 52
2.6 Variables for Winter Olympics Questions . . . . . . . . . . . . . . . . . . . . 62
2.7 Variables for Height and Weight Data in the United States . . . . . . . . . . 62
3.1 Selected Observations from Election and Income Data . . . . . . . . . . . . 77
3.2 Effect of Height on Wages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3.3 OLS Models of Crime in U.S. States . . . . . . . . . . . . . . . . . . . . . . 119
3.4 Variables for Questions on Presidential Elections and the Economy . . . . . 131
3.5 Variables for Height and Weight Data in Britain . . . . . . . . . . . . . . . . 133
3.6 Variables for Divorce Rate and Hours Worked . . . . . . . . . . . . . . . . . 134
4.1 Type I and Type II Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.2 Effect of Income Changes on Presidential Elections . . . . . . . . . . . . . . 142
4.3 Decision Rules for Various Alternative Hypotheses . . . . . . . . . . . . . . . 154
4.4 Critical Values for t distribution . . . . . . . . . . . . . . . . . . . . . . . . . 157
4.5 Effect of Height on Wages with t Statistics . . . . . . . . . . . . . . . . . . . 158
4.6 Calculating Confidence Intervals for Large Samples . . . . . . . . . . . . . . 181
4.7 Variables for Height and Weight Data in United States . . . . . . . . . . . . 187

xi
LIST OF TABLES LIST OF TABLES

5.1 Bivariate and Multivariate Results for Retail Sales Data . . . . . . . . . . . 198
5.2 Bivariate and Multiple Multivariate Results for Height and Wages Data . . . 201
5.3 Economic Growth and Education Using Multiple Measures of Education . . 216
5.4 Effects of Judicial Independence on Human Rights - Including Democracy
Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
5.5 Variables for Height and Weight Data in the United States . . . . . . . . . . 249
5.6 Variables for Cell Phones and Traffic Deaths Questions . . . . . . . . . . . . 250
5.7 Variables for Speeding Ticket Data . . . . . . . . . . . . . . . . . . . . . . . 251
5.8 Variables for Height and Weight Data in Britain . . . . . . . . . . . . . . . . 252

6.1 Feeling Thermometer Toward Barack Obama . . . . . . . . . . . . . . . . . . 263


6.2 Difference of Means Test for Height and Gender . . . . . . . . . . . . . . . . 269
6.3 Difference of Means Test for Height and Gender . . . . . . . . . . . . . . . . 272
6.4 Manchester City Example with Dummy and Continuous Independent Variables273
6.5 Wages and Region Using Different Excluded Categories . . . . . . . . . . . . 279
6.6 Hypothetical Results for Wages and Region Using Different Excluded Categories282
6.7 Difference of Means of Conservatism of Republicans Elected in 2010 . . . . . 284
6.8 Multivariate OLS Analysis of Republicans Elected in 2010 . . . . . . . . . . 289
6.9 Interpreting Coefficients in Dummy Interaction Model: Yi = —0 + —1 Xi +
—2 Di + —3 Xi ◊ Di . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
6.10 Programable Thermostat and Home Heating Bill . . . . . . . . . . . . . . . 302
6.11 Variables for Monetary Policy Questions . . . . . . . . . . . . . . . . . . . . 310
6.12 Variables for Speeding Ticket Data . . . . . . . . . . . . . . . . . . . . . . . 312
7.1 Global Temperature from 1879 to 2012 . . . . . . . . . . . . . . . . . . . . . 328
7.2 Different Logged Models of Relationship Between Height and Wages . . . . . 334
7.3 Determinants of Major League Baseball Salaries, 1985 - 2005 . . . . . . . . . 338
7.4 Means and Standard Deviations of Baseball Variables . . . . . . . . . . . . . 339
7.5 Means and Standard Deviations of Baseball Variables . . . . . . . . . . . . . 340
7.6 Standardized Determinants of Major League Baseball Salaries, 1985 - 2005 . 341
7.7 Unrestricted and Restricted Models for F tests . . . . . . . . . . . . . . . . . 353
7.8 Variables for Political Instability Questions . . . . . . . . . . . . . . . . . . . 361
7.9 Variables for Global Education Questions . . . . . . . . . . . . . . . . . . . . 362
7.10 Variables for Height and Weight Data in Britain . . . . . . . . . . . . . . . . 363
7.11 Variables for Speeding Ticket Data . . . . . . . . . . . . . . . . . . . . . . . 364

8.1 Basic OLS Analysis of Burglary and Police Officers, 1951-1992 . . . . . . . . 370
8.2 Example of Robbery and Police Data for Cities in California . . . . . . . . . 379
8.3 Robberies and Police Data for Hypothetical Cities in California . . . . . . . 384
8.4 Burglary and Police Officers, Pooled versus Fixed Effects Models, 1951-1992 385

c
•2014 Oxford University Press xii
LIST OF TABLES LIST OF TABLES

8.5 Burglary and Police Officers, Pooled versus Fixed Effect Models, 1951-1992 . 396
8.6 Bilateral Trade, Pooled versus Fixed Effect Models, 1951-1992 . . . . . . . . 399
8.7 Effect of Stand Your Ground Laws on Homicide Rate Per 100,000 Residents 409
8.8 Variables for Presidential Approval Question . . . . . . . . . . . . . . . . . . 418
8.9 Variables for Peace Corps Question . . . . . . . . . . . . . . . . . . . . . . . 418
8.10 Variables for Teaching Evaluation Questions . . . . . . . . . . . . . . . . . . 419
8.11 Variables for the HOPE Scholarship Question . . . . . . . . . . . . . . . . . 420
8.12 Variables for the Texas School Board Data . . . . . . . . . . . . . . . . . . . 421
8.13 Variables in the Cell Phones and Traffic Deaths Panel Data Set . . . . . . . 422

9.1 Levitt (2002) Results on Effect of Police Officers on Violent Crime . . . . . . 427
9.2 Influence of Distance on NICU Utilization (First Stage Results) . . . . . . . 442
9.3 Influence of NICU Utilization on Baby Mortality . . . . . . . . . . . . . . . . 444
9.4 First Stage Reduced Form Regressions for Bush/Iraq War Simultaneous Equa-
tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
9.5 Second Stage Results for Bush/Iraq War Simultaneous Equation Model . . . 466
9.6 Variables for Rainfall and Economic Growth Question . . . . . . . . . . . . . 472
9.7 Variables for News Program Question . . . . . . . . . . . . . . . . . . . . . . 474
9.8 Variables for Fish Market Question . . . . . . . . . . . . . . . . . . . . . . . 474
9.9 Variables for Education and Crime Questions . . . . . . . . . . . . . . . . . 475
9.10 Variables for Income and Democracy Questions . . . . . . . . . . . . . . . . 476
10.1 Balancing Tests for Progresa Experiment: Differences of Means Tests Using
OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
10.2 First Stage Regression in Campaign Experiment: Explaining Contact . . . . 503
10.3 Second Stage Regression in Campaign Experiment: Explaining Turnout . . . 505
10.4 Various Measures of Campaign Contact in 2SLS Model for Selected Observations506
10.5 First Stage Regression in Domestic Violence Experiment: Explaining Arrests 511
10.6 Selected Observations for Minneapolis Domestic Violence Experiment . . . . 512
10.7 Analyzing Domestic Violence Experiment Using Different Estimators . . . . 513
10.8 Effect of Terror Alerts on Crime . . . . . . . . . . . . . . . . . . . . . . . . 530
10.9 Variables for Get-out-the-vote Experiment from Gerber and Green (2005) . . 535
10.10Variables for Resume Experiment . . . . . . . . . . . . . . . . . . . . . . . . 537
10.11Variables for Afghan School Experiment . . . . . . . . . . . . . . . . . . . . 539
11.1 RD Analysis of Pre-K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
11.2 RD Analysis of Drinking Age and Test Scores (from Carrell, Hoekstra, and
West 2010) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
11.3 RD Diagnostics for Drinking Age and Test Scores (from Carrell, Hoekstra,
and West 2010) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580

c
•2014 Oxford University Press xiii
LIST OF TABLES LIST OF TABLES

11.4 Variables for Pre-kindergarten Question . . . . . . . . . . . . . . . . . . . . . 585


11.5 Variables for Congressional Ideology Question . . . . . . . . . . . . . . . . . 587
11.6 Variables for Head Start Question . . . . . . . . . . . . . . . . . . . . . . . . 588
12.1 LPM Model of the Probability of Admission to Law School . . . . . . . . . . 594
12.2 Sample Probit Results for Discussion Questions . . . . . . . . . . . . . . . . 618
12.3 Dog Ownership and Probability of Supporting Obama in 2008 Election . . . 627
12.4 Estimated Effect of Dog Ownership and Ideology on Probability of Supporting
Obama in 2008 Election . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
12.5 Unrestricted and Restricted Probit Results for Likelihood Ratio Test . . . . 635
12.6 Probit Models of the Determinants of Civil Wars . . . . . . . . . . . . . . . 640
12.7 Variables for Iraq War Questions . . . . . . . . . . . . . . . . . . . . . . . . 650
12.8 Variables for Global Warming Questions . . . . . . . . . . . . . . . . . . . . 651
12.9 Variables for Football Coach Questions . . . . . . . . . . . . . . . . . . . . . 653
12.10Variables for Donor Experiment . . . . . . . . . . . . . . . . . . . . . . . . . 654
12.11Balance Tests for Donor Experiment . . . . . . . . . . . . . . . . . . . . . . 655
13.1 Detecting Autocorrelation Using OLS and Lagged Error Model . . . . . . . . 669
13.2 Example of fl-transformed Data (for fl̂ = 0.5) . . . . . . . . . . . . . . . . . 673
13.3 Global Temperature Model Estimated Using OLS and Via fl-transformed Data 677
13.4 Dickey-Fuller Tests for Stationarity . . . . . . . . . . . . . . . . . . . . . . . 699
13.5 Change in Temperature as a Function of Change in Carbon Dioxide and Other
Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701
13.6 Variables for James Bond Movie Questions . . . . . . . . . . . . . . . . . . . 707
14.1 Effect of Omitting X2 on Coefficient Estimate for X1 . . . . . . . . . . . . . 728
A.1 Examples of Standardized Values . . . . . . . . . . . . . . . . . . . . . . . . 781

A.2 Values of —0 , —1 , —2 and —3 in Figure 8.6 . . . . . . . . . . . . . . . . . . . . . 820

c
•2014 Oxford University Press xiv
LIST OF FIGURES

1.1 Inference and Its Discontents . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


1.2 Weight and Donuts in Springfield . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Regression Line for Weight and Donuts in Springfield . . . . . . . . . . . . . 9
1.4 Examples of Lines Generated by Core Statistical Model . . . . . . . . . . . . 12
1.5 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6 Two Scenarios for the Relationship Between Flu Shots and Health . . . . . . 23
2.1 Two Versions of Debt and Growth Data . . . . . . . . . . . . . . . . . . . . 41
2.2 Weight and Donuts in Springfield . . . . . . . . . . . . . . . . . . . . . . . . 45
2.3 Scatterplots of Violent Crime Against Percent Urban, Single Parent, and Poverty 53
3.1 Relationship Between Income Growth and Vote for the Incumbent President’s
Party, 1948-2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2 Elections and Income Growth with Model Parameters Indicated . . . . . . . 75
3.3 Fitted Values and Residuals for Observations in Table 3.1 . . . . . . . . . . . 76
3.4 Four Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.5 Distribution of —ˆ1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.6 Two Distributions with Different Variances of —ˆ1 . . . . . . . . . . . . . . . . 94
3.7 Four Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.8 Distributions of —ˆ1 for different sample sizes . . . . . . . . . . . . . . . . . . 100
3.9 Plots with Different Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . 110
3.10 Height and Wages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.11 Scatterplot of Violent Crime and Percent Urban . . . . . . . . . . . . . . . . 118
3.12 Scatterplots of Crime Against Percent Urban, Single Parent, and Poverty with
OLS Fitted Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

xv
LIST OF FIGURES LIST OF FIGURES

4.1 Distribution of —ˆ1 Under the Null Hypothesis for Presidential Election Example143
4.2 Distribution of —ˆ1 Under the Null Hypothesis with Larger Standard Error for
Presidential-Election Example . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.3 Three t distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.4 Critical Values for Large Sample t tests . . . . . . . . . . . . . . . . . . . . . 155
4.5 Two Examples of p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.6 Statistical Power for Three Values of —1 , – = 0.01, and a One-Sided Alterna-
tive Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.7 Power Curves for Two Values of se(—ˆ1 ) . . . . . . . . . . . . . . . . . . . . . 172
4.8 Meaning of Confidence Interval for Example of 0.41 ±0.196 . . . . . . . . . . 180
5.1 Monthly Retail Sales and Temperature in New Jersey from 1992 to 2013 . . 193
5.2 Monthly Retail Sales and Temperature in New Jersey with December Indicated195
5.3 95% Confidence Intervals for Coefficients in Adult Height, Adolescent Height,
and Wage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
5.4 Economic Growth, Years of School, and Test Scores . . . . . . . . . . . . . . 218
6.1 Goal Differentials for Home and Away Games for Manchester City and Manch-
ester United . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
6.2 Bivariate OLS with a Dummy Independent Variable . . . . . . . . . . . . . . 260
6.3 Scatterplot of Obama Feeling Thermometers and Party Identification . . . . 265
6.4 Three Difference of Means Tests . . . . . . . . . . . . . . . . . . . . . . . . . 266
6.5 Scatterplot of Height and Gender . . . . . . . . . . . . . . . . . . . . . . . . 268
6.6 Scatterplot of Height and Gender . . . . . . . . . . . . . . . . . . . . . . . . 271
6.7 Fitted Values for Model with Dummy Variable and Control Variable: Manch-
ester City Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
6.8 Relation Between Omitted Variable (Obama Vote) and Other Variables . . . 287
6.9 Confidence Intervals for Newly Elected Variable in Table 6.8 . . . . . . . . . 291
6.10 Fitted Values for Yi = —0 + —1 Xi + —2 Dummyi + —3 Dummyi ◊ Xi + ‘i . . . . 294
6.11 Various Fitted Lines from Dummy Interaction Models . . . . . . . . . . . . . 298
6.12 Heating Used and Heating Degree Days for Homeowner Who Installed a Pro-
grammable Thermostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
6.13 Heating Used and Heating Degree Days with Fitted Values for Different Models305
6.14 Marginal Effect of Text Ban as Total Miles Changes . . . . . . . . . . . . . . 311

7.1 Average Life Satisfaction by Age in the United States . . . . . . . . . . . . . 316


7.2 Life Expectancy and Per Capita GDP in 2011 for All Countries in the World 319
7.3 Linear and Quadratic Fitted Lines for Life Expectancy Data . . . . . . . . . 320
7.4 Examples of Quadratic Fitted Curves . . . . . . . . . . . . . . . . . . . . . . 324
7.5 Global Temperature Over Time . . . . . . . . . . . . . . . . . . . . . . . . . 327

c
•2014 Oxford University Press xvi
LIST OF FIGURES LIST OF FIGURES

8.1 Robberies and Police for Large Cities in California, 1971-1992 . . . . . . . . 371
8.2 Robberies and Police for Specified Cities in California, 1971-1992 . . . . . . . 372
8.3 Robberies and Police for Specified Cities in California with City-specific Re-
gression Lines, 1971-1992 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
8.4 Robberies and Police for Hypothetical Cities in California . . . . . . . . . . . 383
8.5 Difference-in-difference Examples . . . . . . . . . . . . . . . . . . . . . . . . 406
8.6 More Difference-in-difference Examples . . . . . . . . . . . . . . . . . . . . . 411
10.1 Compliance and Non-compliance in Experiments . . . . . . . . . . . . . . . . 495
11.1 Drinking Age and Test Scores . . . . . . . . . . . . . . . . . . . . . . . . . . 543
11.2 Basic Regression Discontinuity Model, Yi = —0 + —1 Ti + —2 (X1i ≠ C) . . . . . 548
11.3 Possible Results with Basic RD Model . . . . . . . . . . . . . . . . . . . . . 550
11.4 Possible Results with Differing-Slopes RD Model . . . . . . . . . . . . . . . . 555
11.5 Fitted Lines for Examples of Polynomial RD Models . . . . . . . . . . . . . 558
11.6 Various Fitted Lines for RD Model of Form Yi = —0 + —1 Ti + —2 (X1i ≠ C) +
—3 (X1i ≠ C)Ti . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
11.7 Smaller Windows for Fitted Lines for Polynomial RD Model in Figure 11.5 . 563
11.8 Bin Plots for RD Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
11.9 Binned Graph of Test Scores and Pre-K Attendance . . . . . . . . . . . . . . 568
11.10Histograms of Assignment Variable for RD Analysis . . . . . . . . . . . . . . 573
11.11Histogram of Age Observations for Drinking Age Case Study . . . . . . . . . 579
12.1 Scatterplot of Law School Admissions Data and LPM Fitted Line . . . . . . 595
12.2 Misspecification Problem in Linear Probability Model . . . . . . . . . . . . . 598
12.3 Scatterplot of Law School Admissions Data and LPM and Probit Fitted Lines 601
12.4 Symmetry of Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . 607
12.5 PDFs and CDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
12.6 Examples of Data and Fitted Lines Estimated by Probit . . . . . . . . . . . 614
12.7 Varying Effect of X in Probit Model . . . . . . . . . . . . . . . . . . . . . . 620
12.8 Fitted Lines from LPM, Probit, and Logit Models . . . . . . . . . . . . . . . 631
12.9 Fitted Lines from LPM and Probit Models for Civil War Data (Holding Ethnic
and Religious Vvariables at Their Means) . . . . . . . . . . . . . . . . . . . . 641
12.10Figure Included for Some Respondents in Global Warming Survey Experiment 652

13.1 Examples of Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . 663


13.2 Global Average Temperature Since 1880 . . . . . . . . . . . . . . . . . . . . 667
13.3 Global Temperature Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
13.4 Data with Unit Roots and Spurious Regression . . . . . . . . . . . . . . . . 688
13.5 Data without Unit Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690

c
•2014 Oxford University Press xvii
LIST OF FIGURES LIST OF FIGURES

13.6 Global Temperature Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696


A.1 An Example of a Probability Density Function (PDF) . . . . . . . . . . . . . 773
A.2 Probabilities that a Standard Normal Random Variable is Less Than Some
Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775
A.3 Probabilities that a Standard Normal Random Variable is Greater Than Some
Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777
A.4 Standard Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 778
A.5 Two ‰2 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784
A.6 Four F Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
A.7 Identifying —0 from a Scatterplot . . . . . . . . . . . . . . . . . . . . . . . . 814

c
•2014 Oxford University Press xviii
FOREWORD FOR INSTRUCTORS: HOW TO HELP YOUR

STUDENTS LEARN STATISTICS

We statistics teachers have high hopes for our students. We want them to understand how

statistics can shed light on important policy and political questions. Sometimes they humor

us with incredible insight. The heavens part, angels sing. We want that to happen daily.

Sadly, a more common experience is seeing a furrowed brow of confusion and frustration.

It’s cloudy and rainy in that place.

It doesn’t have to be this way. If we distill the material down to the most critical concepts

we can inspire more insight and less brow-furrowing. Unfortunately, conventional statistics

books all too often manage to be too simple and too confusing at the same time. They

are simple in that they hardly get past rudimentary ordinary least squares. They are too

confusing in that they get there by way of covering material ranging from probability distri-

butions to ‰2 to basic OLS. These concepts do not naturally fit together and can overwhelm

xix
Foreword for Instructors: How to Help Your Students Learn Statistics

many students. All that to get to naive OLS? And what of the poor students up against a

book that really piles it on with ANOVA, kurtosis, Kruskal’s (or is it Goodman’s?) gamma,

Wilcoxon tests, and goodness knows what else?

This book is predicated on the belief that we are most effective when we teach the tools

we use. What we use are regression-based tools with an increasing focus on experiments and

causal inference. If students can understand these fundamental concepts, they can legiti-

mately participate in analytically sound conversations. They can produce interesting – and

believable! – analysis. They can understand experiments and the sometimes subtle analysis

required when experimental methods meet social scientific reality. They can appreciate that

causal effects are hard to tease out with observational data and that standard errors esti-

mated on crap coefficients, however complex, do no one any good. They can sniff out when

others are being naive or cynical. It is only when we muck around too long in the weeds of

less useful material that statistics becomes the quagmire students fear.

Hence this book seeks to be analytically sophisticated in a simple and relevant way. It

focuses on statistics actually used by real analysts. Nothing useless. No clutter. To do

so, the book is guided by three principles: relevance, opportunity costs, and pedagogical

efficiency.

c
•2014 Oxford University Press xx
Foreword for Instructors: How to Help Your Students Learn Statistics

Relevance

Relevance is a crucial first principal for successfully teaching statistics in the social sciences.

Every experienced instructor knows that most economics, politics, and policy students care

more about the real world than math. How do we get such students to engage with statistics?

One option is to cajole them to care more and work harder. We all know how well that works.

A better option is to show them how a sophisticated understanding of statistical concepts

helps them learn more about the topics they are in class to learn about. Think of a mother

trying to get a child to commit to the training necessary to play competitive sports. She

could start with a semester of theory ... No, that would be cruel. And counterproductive.

Much better is to let the child play and experience the joy of the sport. Then there will be

time (and motivation!) to understand nuances.

Learning statistics is not that different from learning anything else. We need to care to

truly learn. Therefore this book takes advantage of a careful selection of material to spend

more time on the real examples that students care about.

Opportunity costs

Opportunity costs are, as we all tell our students, what we have to give up to do something.

So, while some topic might be a perfectly respectable part of a statistical tool kit, we should

include it only if it does not knock out something more important. The important stuff all

too often gets shunted aside as we fill up the early part of students’ analytical training with

c
•2014 Oxford University Press xxi
Foreword for Instructors: How to Help Your Students Learn Statistics

statistical knick-knacks, material “some people still use” or that students “might see.”

Therefore this book goes quickly through descriptive statistics and doesn’t cover ‰2 tests

for two-way tables, weighted least squares and other denizens of conventional statistics books.

These concepts – and many many more – are all perfectly legitimate. Some are covered

elsewhere (descriptive statistics are covered in elementary schools these days!). Others are

valuable enough that I include them in an “advanced material” section for students and

instructors who want to pursue these topics further. And others simply don’t make the cut.

Only by focusing the material can we get to the tools used by researchers today, tools such

as panel data analysis, instrumental variables, and regression discontinuity. The core ideas

behind these tools are not that difficult, but we need to make time to cover them.

Pedagogical efficiency

Pedagogical efficiency refers to streamlining the learning process by using a single unified

framework. Everything in this book builds from the standard regression model. Hypothesis

testing, difference of means, and experiments can be, and often are, taught independently of

regression. Causal inference is sometimes taught with potential outcomes notation. There is

nothing intellectually wrong with these approaches. But is doing so pedagogically efficient? If

we teach these concepts as stand-alone concepts we have to take time and, more importantly,

student brain space, to set up each separate approach. For students, this is really hard.

Remember the furrowed brows? Students work incredibly hard to get their heads around

c
•2014 Oxford University Press xxii
Foreword for Instructors: How to Help Your Students Learn Statistics

difference of means and where to put degrees of freedom corrections and how to know if

the means come from correlated groups or independent groups and what the equation is

for each of these cases. Then BAM! Suddenly their professor is talking about residuals

and squared deviations. It’s old hat for us, but can overwhelm students first learning the

material. It is more efficient to teach the OLS framework and use that to cover difference

of means, experiments, and the contemporary canon of statistical analysis, including panel

data, instrumental variables, and regression discontinuity. Each tool builds from the same

regression model. Students start from a comfortable place and can see the continuity that

exists.

An important benefit of working with a single framework is that it allows students to re-

visit the core model repeatedly throughout the term. Despite the brilliance of our teaching,

students rarely can put it all together with one pass through the material. I know I didn’t

when I was beginning. Students need to see the material a few times, work with it a bit, and

then it will finally click. Can you imagine if sports were coached the way we do statistics?

A tennis coach who said, “This week we’ll cover forehands (and only forehands), next week

backhands (and only backhands) and the week after that serves (and only serves)” would

not be a tennis coach for long. Instead, coaches introduce material, practice, and then keep

working on the fundamentals. Working with a common framework throughout makes it

easier to build in mini-drills about fundamentals as new material is introduced.

c
•2014 Oxford University Press xxiii
Foreword for Instructors: How to Help Your Students Learn Statistics

Course adoption

This book is organized to work well in two different kinds of courses. First, it can be used in

a second semester course following a semester of probability and statistics. In such a course,

students will likely be able to move quickly through the early material and then pick up

where they left off, typically with multivariate OLS.

Second, this book can be used in a first semester applied course. Using this book for

the first course avoids the “warehouse problem.” The warehouse problem occurs when we

treat students’ statistical education as a warehouse that we first fill up with tools that we

then access later. One challenge is that things rot in a warehouse. Another challenge is

that instructors tend to hoard a bit, putting things in the warehouse “just in case” and

creating clutter. And students undeniably find warehouse work achingly dull. Using this

book in a first semester course avoids the warehouse problem by going directly to interesting

and useful statistical material, providing students with a more just-in-time approach. For

example, they see statistical distributions, but in the context of trying to solve a specific

problem, rather than as an abstract concept that will become useful later.

The book also is designed to encourage two particularly useful pedagogical techniques.

One is interweaving, the process of weaving in material from previous lessons into later

lessons. Each section ends with a “Remember This” box that summarizes key points. Con-

necting back to these points in later lessons is remarkably effective at getting the material

into the active part of students’ brains. The more we ask students about omitted variable

c
•2014 Oxford University Press xxiv
Foreword for Instructors: How to Help Your Students Learn Statistics

bias or multicollinearity or properties of instruments (and in sometimes surprising contexts),

the more they become able to actively apply the material on their own. A second useful

teaching technique is to use frequent low-stakes quizzes. These quizzes can be based the on

summary questions and exercises at the end of each chapter. However done, such quizzes

convert students to active learners without the stress of midterm and final exams. We need

exams, too, of course, but the low-stakes quizzes do a lot of work in preparing students for

them. Brown, Roediger, and McDaniel (2014) provide an excellent discussion of these and

other teaching techniques.

Overview

The first two chapters of the book serve as introductory material. Chapter 1 lays out the

theme of how important – and hard – it is to generate unbiased estimates. This is a good time

to let students offer hypotheses about questions they care about, because these questions

can help bring to life the subsequent material. Chapter 2 introduces computer programs and

good practices. This chapter is a confidence builder that gets students over the hurdle of

using statistical software, if they are not already acclimated.

Part One covers core OLS material. Chapter 3 introduces bivariate OLS. Chapter 4

covers hypothesis testing and Chapter 5 moves to multivariate OLS. Chapters 6 and 7 move

to practical tasks such as use of dummy variables, logged variables, interactions, F-tests, and

the like.

c
•2014 Oxford University Press xxv
Foreword for Instructors: How to Help Your Students Learn Statistics

Part Two covers essential elements of the contemporary statistical tool kit, including

panel data, instrumental variables, analysis of experiments, and regression discontinuity.

The experiments chapter uses instrumental variables, but other than that these chapters can

be covered in any order, so instructors can pick and choose among these chapters as needed.

Part Three is a single chapter on dichotomous dependent variables. It develops the linear

probability model in context of OLS and uses the probit and logit models to introduce

students to maximum likelihood. Instructors can cover this chapter before Part Two if

dichotomous dependent variables play a major role in the course.

Part Four covers advanced material. Chapter 13 covers time series models, introducing

techniques to estimate autocorrelation and dynamic time series models; this chapter can be

covered immediately after the end of Part One. Chapter 14 offers derivations of the OLS

model and additional material on omitted variable bias; this chapter can be consider as an

auxiliary to Chapters 3 and 5 for instructors seeking to expose students to derivations and

extensions of the core OLS material. Chapter 15 introduces more advanced topics in panel

data. This chapter can be considered as an auxiliary to Chapter 8 for instructors who seek

to expose students to time series aspects of panel data. It should be covered after Chapter

13.

Several appendices provide supporting material. An appendix on math and probability

covers background material ranging from mathematical functions to important concepts in

probability. The appendix on citations and additional notes is linked to the text by page

c
•2014 Oxford University Press xxvi
Foreword for Instructors: How to Help Your Students Learn Statistics

numbers and elaborates on some finer points in the text. A separate appendix provides

answers to selected discussion questions.

Teaching statistics is difficult. When the going gets tough it is tempting to blame students,

to say they are unwilling to do the work. Before we do that, we should recognize that many

students find the material quite foreign and (unfortunately) irrelevant. If we can streamline

what we teach and connect it to things students care about, we can improve our chances of

getting students to understand the material, material that is not only intrinsically interesting,

but also forms the foundation for all empirical work. When students understand, teaching

becomes easier. And better. The goal of this book is to help get us there.

c
•2014 Oxford University Press xxvii
FOREWORD FOR STUDENTS: HOW THIS BOOK CAN HELP YOU

LEARN STATISTICS

“Less dull than traditional texts.” – Student A.H.


“It would have been immensely helpful for me to have a textbook like this in
my classes throughout my college and graduate experience. It feels more like an
interactive learning experience than simply reading equations and facts out of a
book and being expected to absorb them.” – Student S.A.

“I wish I had had this book when I was first exposed to the material – it would
have saved a lot of time and hair-pulling...”– Student J.H.

“Material is easy to understand, hard to forget.” – Student M.H.


“Riddled with grammatical errors.” – Student C.Y.

This book introduces the statistical tools necessary to answer important questions. Do

anti-poverty programs work? Does unemployment affect inflation? Does campaign spending

affect election outcomes? These and many more questions are not only interesting, but

also important to answer correctly if we want to support policies that are good for people,

countries, and the world.

xxviii
Foreword for Students: How This Book Can Help You Learn Statistics

When using statistics to answer such questions, we need always to remember a single big

idea: Correlation is not causation. Just because variable Y rises when variable X rises does

not mean that variable X causes variable Y to rise. The essential goal of statistics is to

figure out when we can say that changes in variable X will lead to changes in variable Y .

This book helps us learn how to identify causal relationships with three features seldom

found in other statistics textbooks. First, it focuses on the tools that researchers use most.

These are the real stats that help us make reasonable claims about whether X causes Y .

Using these tools, we can produce analyses that others can respect. We’ll get the most out

of our data while recognizing the limits in what we can say or how confident we can be.

Our emphasis on real stats means that we skip obscure statistical tools that could come

up under certain conditions: They are not here. Statistics is too often complicated by books

and teachers trying to do too much. This book shows that we can have a sophisticated

understanding of statistical inference without being able to catalog every method that our

instructor had to learn when he or she was a student.

Second, this book works with a single unifying framework. We don’t start over with each

new concept; instead we build around a core model. That means there is a single equation

and a unifying set of assumptions that we poke, probe, and expand throughout the book.

This approach reduces the learning costs of moving through the material and allows us to

go back and revisit material. Like any skill, it is unlikely that we will fully understand any

given technique the first time we see it. We have to work at it, we have to work with it.

c
•2014 Oxford University Press xxix
Foreword for Students: How This Book Can Help You Learn Statistics

We’ll get comfortable, we’ll see connections. Then it will click. Whether it is jumping rope,

typing, throwing a baseball, or analyzing data, we have to do things many times to master

the skill. By sticking to a unifying framework, we have more chances to revisit what we have

already learned. You’ll also notice that I’m not afraid to repeat myself on the important

stuff. Really, I’m not afraid to repeat myself.

Third, this book uses many examples from the policy, political, and economic worlds. So

even if you do not care about “two stage least squares” or “maximum likelihood” in and

of themselves, you will see how understanding these techniques will affect what you think

about education policy, trade policy, the determinants of election outcomes and many other

interesting issues. The examples make it clear that the statistical tools developed in this

book are being used by contemporary social scientists who are actually making a difference

with their empirical work.

This book is for people who care about policy, politics, economics, and law. Many will

come to it as a course textbook. Others will find it useful as a supplement for a course that

would benefit from more intuition and context. Others will come to it outside of school, as

more and more public policy and corporate decisions are based on statistical analysis. Even

sports are discussed in statistical terms as a matter of course. (I no longer spit out my coffee

when I come across an article on regression analysis of National Hockey League players.)

The preparation necessary to use this book successfully is modest. We use basic algebra

a fair bit, being careful to explain every step. You do not need calculus to use this book. We

c
•2014 Oxford University Press xxx
Foreword for Students: How This Book Can Help You Learn Statistics

refer to calculus when necessary and the book certainly could be used by a course that works

through some of the concepts using calculus, but you can understand everything without

knowing calculus.

We start with two introductory chapters. Chapter 1 opens the book by laying out the

challenge of statistical inference. This is the challenge of making probabilistic yet accurate

claims about causal relations between variables. We present experiments as an ideal way

to conduct research, but also show how experiments in the real world are tricky and can’t

answer every question we care about. This chapter provides the “big picture” context for

statistical analysis that is every bit as important as the specifics that follow.

Chapter 2 provides a practical foundation related to good statistical practices. In every

statistical analysis, data meets software and if we’re not careful we lose control. This chapter

therefore seeks to inculcate good habits about documenting analysis and understanding data.

Part One consists of five chapters that constitute the heart of the book. They introduce

ordinary least squares (OLS), also known as regression analysis. Chapter 3 introduces the

most basic regression model, the bivariate OLS model. Chapter 4 shows how to use statistical

results to test hypotheses. Chapters 5 through 7 introduce the multivariate OLS model and

applications. By the end of Part One, you will understand regression and be able to control

for anything you can measure. You’ll also be able to fit curves to data and assess whether

the effects of some variables differ across groups, among other very practical and cool skills.

Part Two introduces techniques that constitute the modern statistical tool kit. These

c
•2014 Oxford University Press xxxi
Foreword for Students: How This Book Can Help You Learn Statistics

are the techniques people use when they want to get published – or paid. These techniques

build on multivariate OLS to give us a better chance of identifying causal relations between

two variables. Chapter 8 covers a simple yet powerful way to control for many factors we

can’t measure directly. Chapter 9 covers instrumental variable techniques, which work if we

can find a variable that affects our independent variable, but not our dependent variable.

Instrumental variable techniques are a bit funky, but they can be very useful for isolating

causal effects. Chapter 10 covers randomized experiments. Such experiments are ideal in

theory, but in practice they often raise a number of statistical challenges we need to address.

Chapter 11 covers regression-discontinuity tools that can be used when we’re studying the

effect of variables that were allocated based on some fixed rule. For example, Medicare is

available to people in the United States only when they turn 65; admission to certain private

schools depends on a test score exceeding some threshold. Focusing on policies that depend

on such thresholds turns out to be a great context for conducting credible statistical analysis.

Part Three covers dichotomous dependent variable models. These are simply models

where the outcome we care about takes on two possible values. Examples include high

school graduate (someone graduates or not), unemployment (someone has a job or not), and

alliances (two countries sign an alliance treaty or not). We show how to apply OLS to such

models and then provide more elaborate models that address the deficiencies of OLS in this

context.

Part Four supplements the book with additional useful material. Chapter 14 derives im-

c
•2014 Oxford University Press xxxii
Foreword for Students: How This Book Can Help You Learn Statistics

portant OLS results and extends discussion on specific topics that are quite useful. Chapter

13 covers time series data. The first part is a variation on OLS; the second part introduces

dynamic models that differ from OLS models in important ways. Chapter 15 goes into

greater detail on the vast literature on panel data. This chapter makes sense of the how the

various strands fit together.

The book is designed to help you master the material. Each section ends with a “Re-

member This” box that highlights the key points. If you do as the box says and remember

what’s in it, you’ll have a great foundation in statistics. The glossary at the end of the book

defines key terms.

There are also discussion questions at the end of selected sections. I recommend using

these. There are two ways to learn: asking questions and answering questions. Asking

questions helps keep us engaged and on track. Answering questions helps us be realistic

about whether we’re truly on track. What we’re fighting is something cognitive psychologists

call the “illusion of explanatory depth.” That’s a fancy way of saying we don’t always know

as much as we think we do. By answering the discussion questions we can see where we

are. Answers for selected discussion questions are at the end of the book. Many of the

discussion questions also allow us to see how the concepts apply to issues we care about and,

once invested in this way, we’re no longer doing statistics for the sake of doing statistics, but

rather doing statistics to help us learn about something we care about.

Finally, you may have noticed that this book is opinionated and a bit chatty. This is not

c
•2014 Oxford University Press xxxiii
Foreword for Students: How This Book Can Help You Learn Statistics

the usual tone of statistics books, but being chatty is not the same as being dumb. You’ll see

real material, with real equations and real research – just with a bit more smart-ass asides

than you may see in other stats books. This approach makes the material more accessible and

also reinforces the right mindset: Statistics is not simply a set of mathematical equations.

Instead statistics provides a set of practical tools that curious people use to learn from the

world. But don’t let the tone fool you. This book is not Statistics for Dummies; it’s Real

Stats. Learn the material and you will be well on your way to using statistics to answer

important questions.

c
•2014 Oxford University Press xxxiv
CHAPTER 1

THE QUEST FOR CAUSALITY

How do we know what we know? Or, at least,

why do we think what we think? The modern an-

swer is evidence. In order to convince others – in

order to convince ourselves – we need to provide

information that others can verify. Something

that is a hunch or something that we simply “know” may be important, but it is not the

kind of evidence that drives the modern scientific process.

What is the basis of our evidence? In some cases, we can see cause and effect. We see a

burning candle tip over and start a fire. Now we know what caused the fire. This is perfectly

good knowledge. Sometimes in politics and policy we trace back a chain of causality in a

1
Chapter 1. The Quest for Causality

similar way. This process can get complicated, though. Why did Barack Obama win the

presidential election in 2008? Why did some economies handle the most recent recession

better than others? Why did crime go down in the United States in the 1990s? For these

type of questions, we are looking not only at a single candle; there are lightning strikes,

faulty wires, arsonists, and who knows what else to worry about. Clearly, it will be much

harder to trace cause and effect.

When there is no way of directly observing cause and effect, we naturally turn to data.

And data holds great promise. A building collapses during an earthquake. What about the

building led it – and not others in the same city – to collapse? Was it the building material?

The height? The design? Age? Location near a fault? While we might not be able to see

the cause directly, we can gather information on buildings that did and did not collapse. If

the older buildings were more likely to collapse, we might reasonably suspect that building

age mattered. If buildings built without steel reinforcement collapsed no matter what their

age, we might reasonably suspect that buildings with certain designs were more likely to

collapse.

And yet, we should not get overconfident. Even if old buildings were more likely to

collapse we do not know for certain that age of the building is the main explanation for

building collapse. It could be that more buildings from a certain era were designed a certain

way; it could be that there were more old buildings in a neighborhood where the seismic

activity was most severe. Or it could have been a massive coincidence. The statistics we learn

c
•2014 Oxford University Press 2
Chapter 1. The Quest for Causality

http://xkcd.com/552/

FIGURE 1.1: Inference and Its Discontents

in this book will help to identify causes out and make claims about what really mattered –

and what didn’t.

As Figure 1.1 makes clear, correlation is not causation. This statement is old news. Our

task is to go to the next step – “Well, then, what does imply causation?” It will take

the whole book to fully flesh out the answer, but here’s the short version: If we can find

exogenous variation, then correlation is probably causation. Our task then will be to figure

out what exogenous variation means and how to distinguish randomness from causality as

best we can.

In this chapter we introduce three concepts at the heart of the book. Section 1.1 explains

the core model we use throughout the book. Section 1.2 introduces two things that make

statistics difficult. Neither is math (really!). One is randomness: Sometimes the luck of the

c
•2014 Oxford University Press 3
Chapter 1. The Quest for Causality

draw will lead us to observe relationships which aren’t real or fail to observe relationships

that are real. The second is endogeneity, a phenomenon that can cause us to wrongly think a

variable causes some effect when it doesn’t. Section 1.3 introduces randomized experiments

as the ideal way to overcome endogeneity. Usually, these experiments aren’t possible and even

when they are, things can go wrong. Hence, the rest of the book is about developing a tool

kit that helps us meet (or approximate) the idealized standard of randomized experiments.

1.1 The Core Model

When we talk about cause and effect we’ll refer to the outcome of interest as the dependent

variable. We refer to a possible cause as an independent variable. The dependent

variable is usually denoted as Y , called that because its value depends on the independent

variable. The independent variable, usually denoted by X, is called that because it does

whatever the hell it wants. It is the presumed cause of some change in the dependent

variable.

At root, social scientific theories posit that a change in one thing (the independent vari-

able) will lead to a change in another (the dependent variable). We’ll formalize this relation-

ship in a bit, but let’s start with an example. Suppose we’re interested in the U.S. obesity

epidemic and want to analyze the influence of snack food on health. We may wonder, for

example, if donuts cause health problems. Our model is that eating donuts (variable X, our

c
•2014 Oxford University Press 4
Chapter 1. The Quest for Causality

independent variable) causes some change in weight (variable Y , our dependent variable). If

we can find data on how many donuts people ate and how much they weighed, we might be

on the verge of a scientific breakthrough.

Let’s conjure up a small Midwestern town and do a little research. Figure 1.2 plots donuts

eaten and weights for 13 individuals from a randomly chosen town, Springfield, U.S.A. Our

raw data is displayed in Table 1.1. Each person has a line in the table. Homer is observation

1. Since he ate 14 donuts per week, Donuts1 = 14. We’ll often refer to Xi or Yi , which are

the values of X and Y for person i in the dataset. The weight of the seventh person in the

data set, Smithers, is 160, meaning W eight7 = 160 and so forth.

Table 1.1: Donut Consumption and Weight

Observation Name Donuts Weight


number per week (pounds)
1 Homer 14 275
2 Marge 0 141
3 Lisa 0 70
4 Bart 5 75
5 Comic Book Guy 20 310
6 Mr. Burns 0.75 80
7 Smithers 0.25 160
8 Chief Wiggum 16 263
9 Principal Skinner 3 205
10 Rev. Lovejoy 2 185
11 Ned Flanders 0.8 170
12 Patty 5 155
13 Selma 4 145

c
•2014 Oxford University Press 5
Chapter 1. The Quest for Causality

Weight
(in pounds)
Comic
300 Book
Guy

Homer
Chief Wiggum

250

Principal
200 Skinner

Rev. Lovejoy

Ned Flanders
Smithers
Patty
150 Selma
Marge

100

Mr. Burns
Bart
Lisa

50

0 5 10 15 20
Donuts

FIGURE 1.2: Weight and Donuts in Springfield

c
•2014 Oxford University Press 6
Chapter 1. The Quest for Causality

Figure 1.2 is a scatterplot of data, with each observation located at the coordinates

defined by the independent and dependent variable. The value of donuts per week is on

the X-axis and weights are on the Y-axis. Just by looking at this plot, we sense there is a

positive relationship between donuts and weight because the more donuts eaten, the higher

the weight tends to be.

We use a simple equation to characterize the relationship between the two variables:

W eighti = —0 + —1 Donutsi + ‘i (1.1)

• The dependent variable is W eighti which is the weight of person i.

• The independent variable, Donutsi , is how many donuts person i eats per week.

• —1 is the slope coefficient on donuts indicating how much more1 a person weighs for

each donut eaten. (For those whose Greek is a bit rusty, — is the Greek letter beta.)

• —0 is the constant or intercept indicating the expected weight of people who eat zero

donuts.

• ‘i (the Greek letter epsilon) is the error term that captures anything else that affects

weight.

This equation will help us estimate the two parameters necessary to characterize a line.

Remember Y = mX + b from junior high? This is the equation for a line where Y is the
1 Or less - be optimistic!

c
•2014 Oxford University Press 7
Chapter 1. The Quest for Causality

value of the line on the vertical axis, X is the value on the horizontal axis, m is the slope

and b is the intercept, the value of Y when X is zero. Equation 1.1 is essentially the same,

only we refer to the “b” term as —0 and we call the “m” term —1 .

Figure 1.3 shows an example of an estimated line from this model for our Springfield

data. The intercept (—0 ) is the value of weight when donut consumption is zero (X = 0).

The slope (—1 ) is the amount that weight increases for each donut eaten. In this case, the

intercept is about 122, which means that the average weight for those who eat zero donuts

is around 122 pounds. The slope is around 9.1, which means that for each donut eaten per

week, weight is about 9.1 pounds higher.

More generally, our core model can be written as

Yi = —0 + —1 Xi + ‘i (1.2)

where —0 is the intercept that indicates the value of Y when X = 0 and —1 is the slope that

indicates how much change in Y is expected if X increases by one unit. We almost always

care a lot about —1 , which characterizes the relationship between X and Y . We usually don’t

care a whole lot about —0 . It plays an important role in helping us get the line in the right

place, but it is seldom the case that our core research interest is to determine the value of

Y when X is zero.

We see that the actual observations do not fall neatly on the line that we’re using to

characterize the relationship between donuts and weight. The implication is that our model

c
•2014 Oxford University Press 8
Chapter 1. The Quest for Causality

Comic
300 Book
Weight Guy

Homer
Chief Wiggum

250
pe)
slo
e
Principal (th
200 Skinner β1
Rev. Lovejoy

Ned Flanders
Smithers
Patty
150 Selma
Marge

β0 = 123

100

Mr. Burns
Bart
Lisa

50

0 5 10 15 20

Donuts

FIGURE 1.3: Regression Line for Weight and Donuts in Springfield

c
•2014 Oxford University Press 9
Chapter 1. The Quest for Causality

does not perfectly explain the data. Of course it doesn’t! Springfield residents are much too

complicated for donuts to explain them completely (except, apparently, Comic Book Guy).

The error term, ‘i , comes to the rescue by giving us some wiggle room. It is what is

left over after the variables have done their work in explaining variation in the dependent

variable. In doing this service, the error term plays an incredibly important role for the entire

statistical enterprise. As this book proceeds, we will keep coming back to the importance of

getting to know our error term.

The error term, ‘, is not simply a Greek letter. It is something real. What it covers

depends on the model. In our simple model — in which weight is a function only of how

many donuts a person eats — oodles of factors are contained in the error term. Basically,

anything else that affects weight will be in the error term: sex, height, other eating habits,

exercise patterns, genetics, and on and on. The error term includes everything we haven’t

measured in our model.

We’ll often see ‘ referred to as randomerror, but be careful about that phrase. Yes, for

the purposes of the model we are treating the error term as something random, but it is not

simply random in the sense of a roll of the dice. It is random more in the sense that we

don’t know what it will be for any individual. But every error term reflects, at least in part,

some relationship to real things that we have not measured or included in the model. We

come back to this point often.

c
•2014 Oxford University Press 10
Chapter 1. The Quest for Causality

Remember This
Our core statistical model model is

Yi = —0 + —1 Xi + ‘i
1. —1 , the slope, indicates how much change in Y (the dependent variable) is expected
if X (the independent variable) increases by one unit.
2. —0 , the intercept, indicates where the regression line crosses the Y-axis. It is the
value of Y when X is zero.
3. —1 is almost always more interesting than —0 .

c
•2014 Oxford University Press 11
Chapter 1. The Quest for Causality

1 1

Y Y
0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1


X X
(a) (b)

Y Y 8
0.8 7
6
5
0.6 4
3
2
0.4
1
0
0.2 −1
−2
−3
0 −4

0 0.2 0.4 0.6 0.8 1 −6 −4 −2 0 2 4 6


X X
(c) (d)
FIGURE 1.4: Examples of Lines Generated by Core Statistical Model

Discussion Questions
For each of the panels in Figure 1.4, determine whether —0 and —1 are greater
than, equal to, or less than zero. (Be careful with —0 in panel (d)!)

c
•2014 Oxford University Press 12
Chapter 1. The Quest for Causality

1.2 Two Challenges: Randomness and Endogeneity

Understanding that there are real factors in the error term helps us be smart about making

causal claims. Our data seems to suggest that the more donuts people ate, the more they

packed on the pounds. It’s not crazy to think that donuts cause weight gain.

But can we be certain that donuts, and not some other factor, cause weight gain? Two

fundamental challenges in statistical analysis should make us cautious. The first challenge

is randomness. Any time we observe a relationship in data, we need to keep in mind that

some coincidence could explain it. Perhaps we happened to pick some unusual people for

our data set. Or perhaps we picked perfectly representative people, but they had happened

to have unusual measurements on the day we actually measured them.

In the donut example, the possibility of such randomness should worry us, at least a little.

Perhaps the people in Figure 1.3 are a bit odd. Perhaps if we had more people, we might

get more heavy folks who don’t eat donuts and skinny people who scarf them down. Adding

those folks to the figure would change the figure and our conclusions. Or perhaps even with

the set of folks we observe, we might have gotten some of them on a bad (or good) day and

if we looked at them another day, we might observe a different relationship.

Therefore every legitimate statistical analysis will account for randomness in an effort to

distinguish results that could happen by chance from those that would be unlikely to happen

by chance. The bad news is that we will never escape the possibility that the results we

c
•2014 Oxford University Press 13
Chapter 1. The Quest for Causality

observe are due to randomness rather than some causal effect. The good news, though, is

that we can often do a pretty good job characterizing how confident we are the results are

not simply due to randomness.

The other major statistical challenge arises from the possibility that an observed relation-

ship between X and Y is actually due to some other variable that causes Y and is associated

with X. In the donuts example, worry about scenarios where we could wrongly attribute

changes in weight caused by other factors to our key independent variable (in this case,

donut consumption). What if tall people eat more donuts? Height is in the error term as a

contributing factor to weight, and if tall people eat more donuts we may wrongly attribute

to donuts the effect of height.

There are loads of other possibilities. What if men eat more donuts? What if exercise

addicts don’t eat donuts? What if people who eat donuts are also more likely to down a

tub of Ben and Jerry’s ice cream every night? What if thin people can’t get donuts down

their throats? Being male, exercising, binging on ice cream, having itty-bitty throats – all

these things are probably in the error term (meaning they affect weight) and all could be

correlated with donut eating.

Speaking statistically, we highlight this major statistical challenge by saying that that

the donut variable is endogenous. An independent variable is endogenous if changes in it

are related to factors in the error term. The prefix “endo” refers to something internal, and

endogenous variables are “in the model” in the sense that they are related to other things

c
•2014 Oxford University Press 14
Chapter 1. The Quest for Causality

that also determine Y .

In the donuts example, donut consumption is likely endogenous because how many donuts

a person eats is not independent of other factors that influence weight gain. Factors that

cause weight gain (such as eating Ben and Jerry’s ice cream) might be associated with

donut eating; in other words, factors that influence the dependent variable Y might also be

associated with the independent variable X, muddying the connection between correlation

and causation. If we can’t be sure that our variation in X is not associated with factors

that influence Y , we worry about wrongly attributing to X the causal effect of some other

variable. We might wrongly conclude donuts cause weight gain when really donut eaters are

more likely to eat tubs of Ben and Jerry’s, which is the real culprit.

In all these examples, something in the error term that really causes weight gain is related

to donut consumption. When this situation arises, we risk spuriously attributed to donut

consumption the causal effect of some other factor. Remember, anything not measured in

the model is in the error term and here, at least, we have a wildly simple model in which

only donut consumption is measured. So Ben and Jerry’s, genetics, and everything else are

in the error term.

Endogeneity is everywhere; it’s endemic. Suppose we want to know if raising teacher

salaries increases test scores. It’s an important and timely question. Answering it may seem

easy enough: We could simply see if test scores (a dependent variable) are higher in places

where teacher salaries (an independent variable) are higher. It’s not that easy, though, is it?

c
•2014 Oxford University Press 15
Chapter 1. The Quest for Causality

Endogeneity lurks. Test scores might be determined by unmeasured factors that also affect

teacher salaries. Maybe school districts with lots of really poor families don’t have very

good test scores and don’t have enough money to pay teachers high salaries. Or perhaps

the relationship is the opposite, with poor school districts getting extra federal funds to

pay teachers more. Either way, teacher salaries are endogenous because their levels depend

in part on factors in the error term (like family income) that affect educational outcomes.

Simply looking at test scores’ relationship to teacher salaries risks confusing the effect of

family income and teacher salaries.2

The opposite of endogeneity is exogeneity. An independent variable is exogenous if

changes in it are not related to factors in the error term. The prefix “exo” refers to something

external, and exogenous variables are “outside the model” in the sense that their values are

unrelated to other things that also determine Y . For example, if we use an experiment to

randomly set the value of X, then changes in X are not associated by factors that also

determine Y . This gives us a clean view of the relationship between X and Y , unmuddied

by associations between X and other factors that affect Y .

Our central challenge is to avoid endogeneity and thereby achieve exogeneity. If we

succeed, we can be more confident that we have moved beyond correlation and closer to

understanding if X causes Y – our fundamental goal. This process is not automatic or easy.

Often we won’t be able to find purely exogenous variation, so we’ll have to think through
2A good idea is to measure these things and put them in the model so that they are no longer in the error term.
That’s what we do in Chapter 5.

c
•2014 Oxford University Press 16
Chapter 1. The Quest for Causality

how close we can get. Nonetheless, the bottom line is this: If we can find exogenous variation

in X we can use data to make a reasonable inference about what will happen to variable Y

if we change variable X.

To formalize these ideas we’ll use the concept of correlation. It is a concept most people

know, at least informally. Two variables are correlated (“co-related”) if they move together.

A positive correlation means that high values of one variable are associated with high values

of the other; a negative correlation indicates that high values of one variable are associated

with low values of the other.

Figure 1.5 shows examples of variables that have positive correlation (panel (a)), no

correlation (panel (b)) and negative correlation (panel (c)). Correlations range from 1 to -1.

A correlation of 1 means that the variables move perfectly together. A positive correlation

means that high values of one variable are associated with high values of the other variable.

A negative correlation means that high values of one variable are associated with low values

of the other.

Correlations close to zero indicate weak relationships between variables. When the cor-

relation is zero, there is no linear relationship between two variables.3

We use correlation in our definitions of endogeneity and exogeneity. If our independent


3 In the appendix on page 771 we provide an equation for correlation and discuss how it relates to our ordinary
least squares estimates from Chapter 3. Correlation measures linear relationships between variables; we’ll discuss
non-linear relationships in OLS on page 317 in Chapter 7. An example of two variables that are not correlated
(meaning their correlation is zero)is panel (b) of Figure 1.5.

c
•2014 Oxford University Press 17
Chapter 1. The Quest for Causality

Error Error No correlation Error


term 1.2 term 1.2 term 1.2

Ne
1 1 1
ion

ga
lat

tiv
rre

e
co

co
rre
e

0.8 0.8 0.8


itiv

lat
s

ion
Po

0.6 0.6 0.6

0.4 0.4 0.4

0.2 0.2 0.2

0 0 0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Independent variable Independent variable Independent variable


(a) (b) (c)

FIGURE 1.5: Correlation

c
•2014 Oxford University Press 18
Chapter 1. The Quest for Causality

variable has a relationship to the error term like the one in panel (a) (which shows positive

correlation) or panel (c) (which shows negative correlation), then we have endogeneity. In

other words, when the unmeasured stuff that constitutes the error term is correlated with our

independent variable we have endogeneity which will make it difficult to tell if our variable

or the error term is causing changes in the dependent variable.

On the other hand, if our independent variable has no relationship to the error term as

in panel (b), then we have exogeneity. In this case, if we observe Y rising with X, we can

feel confident that X is causing Y .

The challenge is that the true error term is not observable. Hence much of what we do in

statistics attempts to get around the possibility that something unobserved in the error term

may be correlated with the independent variable. This quest makes statistics challenging

and interesting.

As a practical matter, we should begin every analysis by assessing endogeneity. First, look

away from the model for a moment and list all the things that could determine the dependent

variable. Second, ask if anything on the list correlates with the independent variable in the

model and explain why it might. That’s it. Do that and we are on our way to identifying

endogeneity.

c
•2014 Oxford University Press 19
Chapter 1. The Quest for Causality

Remember This
1. There are two fundamental challenges in statistics: randomness and endogeneity.
2. Randomness can produce data in which it looks like there is a relationship between
X and Y even when there is not or it looks like there is no relationship even when
there is one.
3. An independent variable is endogenous if it is correlated with the error term in
the model.
(a) An independent variable is exogenous if it is not correlated with the error
term in the model.
(b) The error term is not observable, making it a challenge to know if an inde-
pendent variable is endogenous or exogenous.

c
•2014 Oxford University Press 20
Chapter 1. The Quest for Causality

Case Study: Flu Shots

A great way to appreciate the challenges raised

by endogeneity is to look at real examples. Here

is one we all can relate to: Do flu shots work? No

one likes the flu. It kills about 36,000 people in

the United States each year, mostly among the

elderly. At the same time, no one enjoys schlepping down to some hospital basement or

drugstore lobby, rolling up a shirt sleeve, and getting a flu shot. Nonetheless, every year

100,000,000 Americans dutifully go through this ritual.

The evidence that flu shots prevent people from dying from the flu must be overwhelming.

Right? Suppose we start by considering a study using data on whether people died (the

dependent variable) and whether they got a flu shot (the independent variable).

Deathi = —0 + —1 Flu shoti + ‘i (1.3)

where Deathi is a (creepy) variable that is 1 if person i died in the time frame of the study

and 0 if he or she did not. F lu shoti is 1 if the person i got a flu shot and 0 if not.4

A number of studies have done essentially this analysis and have found that people who

get flu shots are less likely to die. According to some estimates, those who receive flu shots

are as much as 50 percent less likely to die. This effect is enormous. Going home with a
4 We discuss dependent variables that equal only zero or one in Chapter 12 and independent variables that equal
zero or one in Chapter 6.

c
•2014 Oxford University Press 21
Chapter 1. The Quest for Causality

Band-Aid with the little blood stain is worth it after all.

But are we convinced? Is there any chance of endogeneity? If there exists some factor in

the error term that affected whether or not someone died and whether he or she got a flu

shot, we would worry about endogeneity.

What is in the error term? Goodness, lots of things affect the probability of dying: age,

health status, wealth, cautiousness - the list is immense. All of these factors and more are

in the error term.

How could these factors cause endogeneity? Let’s focus on overall health. Clearly, health-

ier people die at a lower rate than unhealthy people. If healthy people are also more likely

to get flu shots, we might erroneously attribute life-saving power to flu shots when perhaps

all that is going on is that people who are healthy in the first place tend to get flu shots.

It’s hard, of course, to get measures of health for people, so let’s suppose we don’t have

them. We can, however, speculate on the relationship between health and flu shots. Figure

1.6 shows two possible states of the world. In each figure we plot flu-shot status on the

X-axis. If someone did not get a flu shot, he’s in the 0 group; if someone got a flu shot,

she’s in the 1 group. On the Y-axis we plot health related to everything but flu (supposing

we could get some index that factors in age, heart health, absence of disease, etc.). In panel

(a), health and flu shots don’t seem to go together; in other words the correlation is zero. If

panel (a) represents the state of the world, then our results that flu shots are associated with

lower death rates is looking pretty good because flu shots are not reflecting overall health.

c
•2014 Oxford University Press 22
Chapter 1. The Quest for Causality

Health Health
10 10

8 8

6 6

4 4

2 2

0 1 0 1
Flu shot Flu shot
(a) (b)
FIGURE 1.6: Two Scenarios for the Relationship Between Flu Shots and Health

c
•2014 Oxford University Press 23
Chapter 1. The Quest for Causality

In panel (b), health and flu shots do seem to go together, with the flu shot population being

healthier. In this case, we have correlation of our main variable (flu shots) and something

in the error term (health).

Brownlee and Lenzer (2009) discuss some indirect evidence suggesting that flu shots and

health are actually correlated. A clever approach to assessing this matter is to look at death

rates of people in the summer. The flu rarely kills people in the summer which means that if

people who get flu shots also die at lower rates in the summer, it is because they are healthier

overall; if people who get flu shots die at the same rates as others during the summer, it

would be reasonable to suggest that the flu-shot and non-flu-shot populations have similar

health. It turns out that people who get flu shots have an approximately 60 percent lower

probability of dying outside the flu season.

Other evidence backs up the idea that healthier people get flu shots. As it happened,

vaccine production faltered in 2004 and 40 percent fewer people got vaccinated. What

happened? Flu deaths did not increase. And in some years, the flu vaccine was designed

to attack a different set of viruses than actually spread in those years; again, there was no

clear change in mortality. This data suggests that the reason people who get flu shots live

longer might be because getting flu shots is associated with other healthy behavior, such as

seeking medical care, eating better and so forth.

The point is not to put us off flu shots. We’ve discussed only mortality—whether people

die from the flu—not whether they’re more likely to contract the virus or stay home from work

c
•2014 Oxford University Press 24
Chapter 1. The Quest for Causality

because they are sick.5 The point is to highlight how hard it is to really know if something

(in this case, a vaccine) works. If something as widespread and seemingly straightforward

as flu shots are hard to assess definitively, think about the care we must take when trying

to analyze policies that affect fewer people and have more complicated effects.

Case Study: Country Music and Suicide

Does music affect our behavior? Are we more se-

rious when we listen to classical music? Does

bubblegum pop make us bounce through the

halls? Both ideas seem plausible, but how can

we know for sure?

Stack and Gundlach (1992) decided to look

to data to assess one particular question: does country music depress us? They argued

that country music, with all its lyrics about broken relationships and bad choices, may be

so depressing that it increases suicide rates.6 We can test this claim with the following

statistical model:

Suicide ratesi = —0 + —1 Country musici + ‘i (1.4)

where Suicide ratesi is the suicide rate in metropolitan area i and Country musici is the
5 See, for example, DiazGranados, Denis, and Plotkin (2012) and Osterholm, Kelley, Sommer, and Belongia (2012)
for evidence on the flu vaccine based on randomized experiments.
6 Really, this is an actual published paper; see the endnotes at the end of the book for details.

c
•2014 Oxford University Press 25
Chapter 1. The Quest for Causality

proportion of radio airtime devoted to country music in metropolitan area i.7

It turns out that suicides are indeed higher in metro areas where radio stations play more

country music. Do we believe this is a causal relationship? (In other words, is country music

exogenous?) If radio stations play more country music, should we expect more suicides?

Let’s work through this example.

What does —0 mean? What does —1 mean? In this model, —0 is the expected level of sui-

cide rates in metropolitan areas that play no country music. —1 is the amount by which

suicide rates change for each one-unit increase in the proportion of country music played

in a metropolitan area. We don’t know what —1 is; it could be positive (suicides go up),

zero (no relation to suicides) or negative (suicides decrease). For the record, we don’t know

what —0 is either, but we are less interested in it because it does not directly characterize

the relationship between music and suicides the way —1 does.

What is in the error term? The error term contains factors that are associated with higher

suicide rates, such as alcohol and drugs use, availability of guns, divorce and poverty rates,

lack of sunshine, lack of access to mental health care, and probably much more.

What are the conditions for X to be endogenous? An independent variable is endogenous if

it is correlated with factors in the error term. Therefore, we need to ask whether the amount
7 Their analysis is based on a more complicated model, but this is the general idea.

c
•2014 Oxford University Press 26
Chapter 1. The Quest for Causality

of country music played on radio stations in metropolitan areas is correlated with drinking,

drug use, and all the other stuff in the error term.

Is the independent variable likely to be endogenous? Are booze, divorce and guns likely to be

correlated to amount of country music? Have you listened to any country music? Drinking

and divorce come up now and again. Could this music appeal more in areas where people

drink too much and get divroced more frequently? (To complicate matters, country-music

lyrics also feature more about family and religion than most other types of music.) Or,

could it simply be the case that people in rural areas who like country music also have a lot

of guns? And all of these factors – alcohol, divorce and guns – are plausible influences on

suicide rates. To the extent that country music is correlated with any of them, it would be

endogenous.

Explain how endogeneity could lead to incorrect inferences. This endogeneity could lead to

incorrect inferences in the following way. Suppose for a moment that country music has

no effect whatsoever on suicide rates, but that regions with lots of guns and drinking also

have more suicides and that people in these regions also listen to more country music. If we

look only at the relationship between country music and suicide rates, we will see a positive

relationship: places with lots of country music will have higher suicide rates and places

with little country music will have lower suicide rates. The explanation could be that the

country-music areas have lots of drinking and guns and the areas with little country music

c
•2014 Oxford University Press 27
Chapter 1. The Quest for Causality

have less drinking and fewer guns. Therefore while it may be correct to say that there are

more suicides in places where there is more country music, it would be incorrect to conclude

that country music causes suicides. Or, to put it in another way, it would be incorrect to

conclude that we would save lives by banning country music.

As it turns out, Snipes and Maguire (1995) account for the amount of guns and divorce in

metropolitan areas and find no relationship between country music and metropolitan suicide

rates. So there’s no reason to turn off the radio and put away those cowboy boots.

c
•2014 Oxford University Press 28
Chapter 1. The Quest for Causality

Discussion Questions
1. Labor economists often study the returns to education (see, e.g., Card
(1999)). Suppose we have data on salaries of a set of people, some of
whom went to college and some who did not. A simple model linking
education to salary is

Salaryi = —0 + —1 College graduatei + ‘i


where the value of Salaryi is the salary of person i and the value of
College graduatei is 1 if person i graduated from college and is 0 if
person i did not.
(a) What does —0 mean? What does —1 mean?
(b) What is in the error term?
(c) What are the conditions for the independent variable X to be en-
dogenous?
(d) Is the independent variable likely to be endogenous? Why or why
not?
(e) Explain how endogeneity could lead to incorrect inferences.
2. Donuts aren’t the only food that people worry about. Consider the
following model based on Solnick and Hemenway (2011):
Violencei = —0 + —1 Soft drinksi + ‘i

where V iolencei is the number of physical confrontations student i was


in during a school year and Sof t drinksi is the average number of cans
of soda student i drinks per week.
(a) What does —0 mean? What does —1 mean?
(b) What is in the error term?
(c) What are the conditions for the independent variable X to be en-
dogenous?
(d) Is the independent variable likely to be endogenous? Why or why
not?
(e) Explain how endogeneity could lead to incorrect inferences.

c
•2014 Oxford University Press 29
Chapter 1. The Quest for Causality

Discussion questions - continued


3. We know U.S. political candidates spend an awful lot of time raising
money. And we know they use the money to inflict mind-numbing ads
on us. Do we know if the money and the ads it buys actually work?
That is, does campaign spending increase vote share? Jacobson (1978),
Erikson and Palfrey (2000) and others grapple at length with this issue.
Consider the following model.

Vote shares = —0 + —1 Campaign spendings + ‘s


where V ote shares is the vote share of a candidate in state s and
Campaign spendings is the spending by candidate s.
(a) What does —0 mean? What does —1 mean?
(b) What is in the error term?
(c) What are the conditions for the independent variable X to be en-
dogenous?
(d) Is the independent variable likely to be endogenous? Why or why
not?
(e) Explain how endogeneity could lead to incorrect inferences.
4. Researchers identified every outdoor advertisement in 228 census tracts
in Los Angeles and New Orleans and then interviewed 2,881 residents of
the cities about weight. Their results suggested that a 10% increase in
outdoor food ads in a neighborhood was associated with a 5% increase
in obesity.
a) Do you think there could be endogeneity?
b) How would you test for a relationship between food ads and obesity?
• ]c)] Read the article “Does the Ad Make Me Fat?” by Christopher
Chabris and Daniel Simmons from the March 10, 2013 New York
Times and see how your answers compare to theirs.

c
•2014 Oxford University Press 30
Chapter 1. The Quest for Causality

1.3 Randomized Experiments as the Gold Standard

The best way to fight endogeneity is to have exogenous variation. A good way to have

exogenous variation is to create it. If we do so, we know that our independent variable is

unrelated to the other variables that affect the dependent variable.

In theory, it is easy to create exogenous variation with a randomized experiment. In our

donut example, we could randomly pick people and force them to eat donuts while forbidding

everyone else from eating donuts. If we can pull this experiment off, the amount of donuts

a person eats will be unrelated to other unmeasured variables that affect weight. The only

thing that would determine donut eating would be the luck of the draw. The donut-eating

group would have some ice cream bingers, some health-food nuts, some runners, some round-

the-clock video-game players, and so on. So, too, would the non-donut-eating group. There

wouldn’t be systematic differences in these unmeasured factors across groups. Both treated

and untreated groups would be virtually identical and would resemble the composition of

the population.

In an experiment like this, the variation in our independent variable X is exogenous. We

have won. If we observe that donut eaters weigh more or have other health differences from

non-donut eaters, we can reasonably attribute these effects to donut consumption.

Simply put, the goal of such a randomized experiment is to make sure the independent

variable, which we also call the treatment, is exogenous. The key element of such experiments

c
•2014 Oxford University Press 31
Chapter 1. The Quest for Causality

is randomization, a process whereby the value of the independent variable is determined by

a random process; its value will depend on nothing but chance. In this case, the independent

variable will be uncorrelated with everything, including any factor in the error term affecting

the dependent variable. In other words, a randomized independent variable is exogenous;

analyzing the relationship between an exogenous independent variable and the dependent

variable allows us to make inferences about a causal relationship between the two variables.

This is one of those key moments when a concept may not be that complicated, but the

implications are enormous. By randomly picking some people to get the treatment, we rule

out the possibility that there is some other way for the independent variable to be associated

with the dependent variable. If the randomization is successful, the treated subjects are not

systematically more athletic, taller, or food conscious – or more left-handed or stinkier, for

that matter.

The basic structure of a randomized experiment is simple. Based on our research question,

we identify a relevant population that we randomly split into two groups: a treatment

group that receives the policy intervention and a control group that does not. After the

treatment, we compare the behavior of the treatment and control groups on the outcome we

care about. If the treatment group differs substantially from the control group, we believe

the treatment had an effect; if not, then we’re inclined to think the treatment had no effect.8

For example, suppose we want to know if an ad campaign increases enrollment in Oba-


8 We provide standards for making such judgments in Chapter 3 and beyond.

c
•2014 Oxford University Press 32
Chapter 1. The Quest for Causality

macare. We would identify a sample of uninsured people and split them into a treatment

group that is exposed to the ad and a control group that is not. After the treatment, we

compare the enrollment in Obamacare of the treatment and control groups. If the treated

group enrolled at a substantially higher rate, that outcome would suggest the ad works.

Because they build exogeneity into the research, randomized experiments are often re-

ferred to as the gold standard for causal inference. The phrase “gold standard” usually means

the best of the best. But experiments also merit the gold standard moniker in another sense.

No country in the world is actually on a gold standard. The gold standard doesn’t work

well in practice and, for many research questions, neither do experiments. Simply put, ex-

periments are great, but they can be tricky when applied to real people going about their

business.

The human element of social-scientific experiments makes them very different from exper-

iments in the physical sciences. My third grader’s science-fair project compared cucumber

seeds planted in peanut butter to those planted in dirt. She did not have to worry that the

cucumber seeds would get up and say “There is NO way you are planting me in that.” In

the social sciences, though, people can object, not only to being planted in peanut butter,

but also to things like watching TV commercials or attending a charter school or changing

health care plans or pretty much anything else we want to study with an experiment.

Therefore an appreciation of the virtues of experiments should come also with recognition

of their limits. We devote Chapter 10 to discussing the analytical challenges that accompany

c
•2014 Oxford University Press 33
Chapter 1. The Quest for Causality

experiments. No experiment should be designed without thinking through these issues and

every experimental result should be judged by how well it deals with them.

There are other reasons social-scientific experiments can’t answer all social-scientific re-

search questions. The first is that experiments aren’t always feasible. The financial costs of

many experiments are beyond what most major research organizations can fund, let alone

what a student doing a term paper can afford. And for many important questions, it’s not a

matter of money. Do we want to know if corruption promotes civil unrest? Good luck with

our proposal to randomly end corruption in some countries and not others. Do we want to

know if birth rates affect crime? Are we really going to randomly assign some regions to

have more babies? While the randomizing process could get interesting, we’re unlikely to

pull it off. Or do we want to know something historical? Forget about an experiment.

And even if an experiment is feasible, it might not be ethical. We see this dilemma most

clearly in medicine – if we believe some treatment is better, but we are not sure, how ethical

is it to randomly subject some people to a procedure that might not work? The medical

community has evolved standards relating to level of risk and informed consent by patients,

but such questions will never be easy to answer.

Consider flu shots. We may think that assessing the efficacy of this public-health measure

is a great situation for a randomized experiment. It would be expensive, but it is conceptually

simple. Get a bunch of people who want a flu shot, tell them they are participating in a

random experiment and randomly give some a flu shot and the others a placebo shot. Wait

c
•2014 Oxford University Press 34
Chapter 1. The Quest for Causality

and see how the two groups do.

But would such a randomized trial of flu vaccine be ethical? When we say “Wait and

see how the two groups do,” we actually mean “Wait and see who dies.” That changes

the stakes a bit, doesn’t it? The public-health community strongly believes in the efficacy

of the flu vaccine and, given that belief considers it unethical to deny people the vaccine.

Brownlee and Lenzer recount in Atlantic Monthly how one doctor first told interviewers that

a randomized-experiment trial might be acceptable, then got cold feet and called back to

say that such an experiment would be unethical.9

Finally, experimental results may not be generalizable. That is, a specific experiment

may provide great insight into the effect of a given policy intervention at a given time

and place, but how sure can we be that the same policy intervention will work somewhere

else? Jim Manzi, the author of Uncontrollable, argues that the most honest way to describe

experimental results is that treatment X was effective in a certain time and place in which

the subjects had the characteristics they did and the policy was implemented by people with

the characteristics they had. Perhaps people in different communities respond to treatments

differently. Or perhaps the scale of an experiment could matter: A treatment might work

when implemented on a small scale, but could fail if implemented more broadly.

Statisticians make this point by distinguishing between internal validity and external
9 Another flu researcher came to the opposite conclusion, saying, “What do you do when you have uncertainty?
You test ... We have built huge, population-based policies on the flimsiest of scientific evidence. The most unethical
thing to do is to carry on business as usual.”

c
•2014 Oxford University Press 35
Chapter 1. The Quest for Causality

validity. Internal validity refers to whether the inference is biased; external validity

refers to whether an inference applies more generally. A well-executed experiment will be

internally valid, meaning the results will on average lead us to make the correct inferences

about the treatment and outcome in the context of the experiment. In other words, with

internal validity, we can say confidently that variable X is what causes the change in variable

Y . Even with internal validity, however, an experiment may not be externally valid, because

the causal relationship between the treatment and outcome could differ in other contexts.

That is, even if we have internally valid evidence from an experiment that aardvarks in

Alabama procreate more if they listen to Mozart, we can’t really be sure aardvarks in Alaska

will respond in the same way.

Hence, even as experiments offer a conceptually clear approach to defeating endogeneity,

they cannot always offer the final word for economic, policy, and political research. Hence

most scholars in most fields need to grapple with non-experimental data. Observational

studies use data that has been generated by non-experimental processes. In contrast to

randomized experiments in which a researcher controls at least one of the variables, in

observational studies the data is what it is and we do the best we can to analyze it in a

sensible way. Endogeneity will be a chronic problem, but we are not totally defenseless in

the fight against it. The techniques we learn in this book help us to achieve, or at least

approximate, the exogeneity promised by randomized experiments even when we have only

observational data.

c
•2014 Oxford University Press 36
Chapter 1. The Quest for Causality

Remember This
1. Experiments create exogeneity via randomization.
2. Social-science experiments are complicated by practical challenges associated with
the difficulty of achieving randomization and full participation.
3. Experiments are not always feasible, ethical, or generalizable.
4. Observational studies use non-experimental data. They are necessary to answer
many questions.

Discussion Questions

1. Is it possible to have a non-random exogenous independent variable?


2. Think of a policy question of interest. Discuss how an experiment might
work to address the question.
3. Does foreign aid work? How should we create an experiment to assess
whether aid to very poor communities works? What might some of the
challenges be?
4. Do political campaigns matter? How should we create an experiment
to assess whether phone calls, mailings, and visits by campaign workers
matter? What might some of the challenges be?
5. How are health and medical spending affected when people have to pay
each time they see a doctor? How should we create an experiment to
assess whether the amount of co-payments (payments tendered at every
visit to a doctor) affects health costs and quality? What might some of
the challenges be?

c
•2014 Oxford University Press 37
Chapter 1. The Quest for Causality

1.4 Conclusion

The point of statistical research is almost always to learn if X (the independent variable)

causes Y (the dependent variable). If we see high values of Y when X is high and low values

of Y when X is low, we might be tempted to think X causes Y . We need always to be aware

that the observed relationship could have arisen only by chance. Or, if X is endogenous,

we need to remember that interpreting the relationship between X and Y as causal could

be wrong, possibly completely wrong. When there is some other factor that causes Y and

is correlated with X, any relationship we see between X and Y may actually be due to the

effect of this other factor.

We spend the rest of this book accounting for uncertainty and battling endogeneity. Some

approaches, like randomized experiments, seek to create exogenous change. Others statistical

approaches, like multivariate regression, winnow down the number of other factors lurking

in the background that can cause endogeneity. These and other approaches have strengths,

weaknesses, tricks, and pitfalls. However, they all are united by a fundamental concern

with counteracting endogeneity. Therefore if we understand the concepts in this chapter, we

understand the essential challenges of using statistics to better understand policy, economics,

and politics.

Based on this chapter, we are on the right track if we can

• Section 1.1: Explain the terms in our core statistical model: Yi = —0 + —1 Xi + ‘i

c
•2014 Oxford University Press 38
Chapter 1. The Quest for Causality

• Section 1.2: Explain two major statistical challenges. Define endogeneity. Explain how

endogeneity can undermine causal inference. Define exogeneity. Explain how exogeneity

can enable causal inference.

• Section 1.3: Explain how experiments achieve exogeneity. Explain challenges and limi-

tations of experiments.

Key Terms
• Constant or intercept (7)
• Control group (257)
• Correlation (17)
• Dependent variable (4)
• Endogenous (14)
• Error term (7)
• Exogenous (16)
• External validity (36)
• Generalizable (35)
• Independent variable (4)
• Internal validity (36)
• Observational studies (36)
• Randomization (32)
• Scatterplot (7)
• Slope coefficient (7)
• Treatment group (257)

c
•2014 Oxford University Press 39
CHAPTER 2

STATS IN THE WILD: GOOD DATA PRACTICES

Endogeneity makes it difficult to use data to learn

about the world. Ideally, we overcome endogene-

ity by conducting experiments in clean-rooms

staffed by stainless-steel robots. That’s not re-

ally how the world works, though. Social-science

experiments can produce some pretty messy data, if they are even possible at all. Observa-

tional data is even messier.

One example of data messiness occurred in 2009. Prominent economists Carmen Reinhart

and Ken Rogoff (2010) analyzed more than 3, 700 annual observations of economic growth

from a large sample of countries. Panel (a) of Figure 2.1 depicts one of their key results.

40
Chapter 2. Stats in the Wild: Good Data Practices

Real
GDP
growth 4 4
(percent)

3 3

2 2

1 1

0 0

0−30% 30−60% 60−90% Above 90% 0−30% 30−60% 60−90% Above 90%

Public debt/GDP by category Public debt/GDP by category


(original data) (corrected data)
(a) (b)

FIGURE 2.1: Two Versions of Debt and Growth Data

It shows average GDP growth for countries grouped into four categories depending on the

ratio of public debt to GDP. The shocking finding was the dramatic way average economic

growth dropped off a cliff for countries with government debt above 90 percent of GDP. The

implication was obvious: Governments should be very cautious when using deficit spending

to fight unemployment.

There was one problem with their story. The data didn’t quite say what they said it did.

Herndon, Ash and Pollin (2014) did some digging and found that some observations had

been dropped, others were typos and, most ignominiously, some calculations in the Excel

spreadsheet containing the data weren’t what Reinhart and Rogoff intended. With the data

corrected, the figure changed to panel (b) of Figure 2.1. Not quite the same story. Economic

growth didn’t plummet once government debt passed 90% of GDP. While people can debate

c
•2014 Oxford University Press 41
Chapter 2. Stats in the Wild: Good Data Practices

whether the slope in the right panel is a bunny hill or an intermediate hill, it clearly is

nothing like the cliff in the data originally reported.1

Reinhart and Rogoff’s discomfort can be our gain when we realize that even top scholars

can make data mistakes. Hence we need to be vigilant in making sure that we create habits

and structures to minimize mistakes and maximize the chance that others can find them if

we do.

Therefore, this chapter focuses on the crucial first steps for any statistical analysis. First,

we need to understand our data. Section 2.1 introduces tools for describing data and sniffing

out possible errors or anomalies. Second, we need to be prepared to convince others. If

others can’t re-create our results, they shouldn’t believe them. Therefore Section 2.2 helps

us establish good habits so that our code is understandable to ourselves and others. Finally,

we sure as heck aren’t going to do all this work by hand. Therefore, Section 2.3 introduces

us to two major statistical software programs, Stata and R.

2.1 Know Our Data

Experienced researchers know that data is seldom pristine. Something, somewhere is often

off, even in data sets that are well traveled in academic circles. This is especially true for
1A deeper question is whether we should treat this observational data as having any causal force. Government
debt levels are likely related to other factors that affect economic growth, like institutional quality and wars. In other
words, government debt is likely endogenous, meaning we probably can’t draw any conclusions about the effects of
debt on growth without implementing techniques we cover later in this book.

c
•2014 Oxford University Press 42
Chapter 2. Stats in the Wild: Good Data Practices

data from real-world sources.2

Therefore the first rule of data analysis is to know our data. This rule sounds obvious and

simple but not everyone follows it, sometimes to their embarrassment. For each variable we

should know the number of observations, the mean, standard deviation and the minimum

and maximum values. Knowing this information gives us a feel for data, helping us know if

we have missing data and what the scales and ranges of the variables are. Table 2.1 shows an

example for the donut and weight data we discussed on page 5. The number of observations,

frequently referred to as “N” (for number), is the same for all variables in this example, but

it varies across variables in many data sets. We all know the mean (also known as average).

The standard deviation measures how widely dispersed the values of the observation are.3

The minimum and maximum tell us the range of the data and often point to screwy values

of a variable when the minimum or maximum doesn’t make sense.

It is also helpful to look at the distribution of variables that take on only a few possible

values. Table 2.2 shows a frequency table for the male variable, a variable that equals 1

for men and 0 for women. The table indicates there are 9 men and 4 women. Fair enough.
2 Chris Achen (1982, 53) memorably notes “If the information has been coded by nonprofessionals and not cleaned
at all, as often happens in policy analysis projects, it is probably filthy.”
3 The appendix contains more details on page 768. Here’s a quick refresher. The standard deviation of X is a

measure of the dispersion of X. The larger the standard deviation, the more spread out the values. It is calculated
Ò q
1
as N
(Xi ≠ X)2 where X is the mean of X. For each observation, we see how far it is from the mean. We then
square that value because for the purposes of calculating dispersion we don’t distinguish whether a value is below
the mean or above it; when squared, all these values become positive numbers. We then take the average of these
squared values. Finally, since they’re squared values, taking the square root of the whole thing brings the final value
back to the scale of the original variable.

c
•2014 Oxford University Press 43
Chapter 2. Stats in the Wild: Good Data Practices

Table 2.1: Descriptive Statistics for Donut and Weight Data

Variable Observations Mean Standard Minimum Maximum


(N) deviation
Weight 13 171.85 76.16 70 310
Donuts 13 5.41 6.85 0 20.5

Table 2.2: Frequency Table for Male Variable in Donut Data Set

Variable Observations
0 4
1 9

Suppose that our frequency table looked like Table 2.3 instead. Either we have a very manly

man in the sample or (more likely) we have a mistake in our data. The statistical tools we

use later in this book will not necessarily flag such issues, so it’s our responsibility to be

alert.

Table 2.3: Frequency Table for Male Variable in Second Donut Data Set

Variable Observations
0 4
1 8
100 1

Graphing data is useful because it allows us to see relationships and unusual observa-

tions. The statistical tools we develop later quantify these relationships, but seeing them

c
•2014 Oxford University Press 44
Chapter 2. Stats in the Wild: Good Data Practices

Weight
(in pounds)
Comic
300 Book
Guy

Homer
Chief Wiggum

250

Principal
200 Skinner

Rev. Lovejoy

Ned Flanders
Smithers
Patty
150 Selma
Marge

100

Mr. Burns
Bart
Lisa

50

0 5 10 15 20
Donuts

FIGURE 2.2: Weight and Donuts in Springfield

c
•2014 Oxford University Press 45
Chapter 2. Stats in the Wild: Good Data Practices

for ourselves is an excellent and necessary first step. For example, Figure 2.2 shows the

scatterplot we saw earlier of the weight and donut data. We can see that there does seem to

be a relationship between the two variables.

We also see some relationships that we might have missed without graphing. Lisa and

Bart are children and their weight is much lower; we’ll probably want to account for that in

our analysis. Women also seem to weigh less.

Effective figures are clean, clear, and attractive. We point to some resources for effective

visualization in the Further Reading section at the end of the chapter, but here’s the bottom

line. Get rid of clutter. Don’t overdo axis labels. Avoid abbreviations and jargon. Pick

colors that go together well. And 3D? Don’t. Ever.

Remember This
1. A useful first step toward understanding data is to review sample size, mean,
standard deviation, and minimum and maximum for each variable.
2. Plotting data is useful for identifying patterns and anomalies in data.

2.2 Replication

At the heart of scientific knowledge is the replication standard. Research that meets a

replication standard can be duplicated based on the information provided at the time of

publication. In other words, an outsider could use information provided at the time of

publication to produce identical results.

c
•2014 Oxford University Press 46
Chapter 2. Stats in the Wild: Good Data Practices

We need replication files to satisfy this standard. Replication files document exactly

how data is gathered and organized. When done properly, these files allow others to check

our work by following our steps and seeing if they get identical results.

Replication files also enable others to probe our analysis. Sometimes – often, in fact –

statistical results hinge on seemingly small decisions about what data to include, how to

deal with missing data, and so forth. People who really care about getting the answer right

will want to see if what we’ve done to our data and, realistically, will be wary until they see

for themselves that other reasonable ways of doing the analysis produce similar results. If a

certain coding or statistical choice substantially changes results, then we need to pay a lot

of attention to that choice.

Committing to a replication standard keeps our work honest. We need to make sure that

we make choices based on the statistical merits, not based on whether they produce the

answer we want. If we give others the means to check our work, we’re less likely to fall to

the temptation of reporting only the results we like.

Therefore every statistical project, whether a homework assignment, a thesis, or a multi-

million-dollar consulting project, should start with replication files. One file is a data code-

book that documents the data. Sometimes this file simply notes the website and date the

data was downloaded. Often, though, the codebook will include information about variables

that come from multiple sources. The codebook should note the source of the data, the type

of data, who collected it, and any adjustments the researcher made. For example, is the

c
•2014 Oxford University Press 47
Chapter 2. Stats in the Wild: Good Data Practices

data measured in nominal or real dollars? If it is in real dollars, which inflation deflator has

been used? Is the data measured in fiscal year or calendar year? Have missing observations

been imputed? If so, how? Losing track of this information can lead to frustrating and

unproductive backtracking later.

Table 2.4 contains a sample of a codebook for a data set on height and wages.4 The

data set was used to assess whether tall people get paid more. It is pretty straightforward,

covering how much money people earned, how tall they were, and their activities in high

school. We see, though, that details matter. The wages are stated in dollars per hour, which

itself is calculated based on information from an entire year of work. We could imagine data

on wages in other data sets being expressed in terms of dollars per month or year or in terms

of wages at the time the question was asked. There are two height variables, one measured

in 1981 and the other measured in 1985. The athletics variable indicates whether the person

participated in athletics or not. Given the coding, a person who played multiple sports will

have the same value for this variable as a person who played one sport. Such details are

important in the analysis and we have to be careful to document them thoroughly.

A second replication file should document the statistical analysis, usually by providing

the exact code used to generate the results. Which commands were used to produce the

analysis? Sometimes the file contains a few simple lines of software code. Often, however,

we need to explain complicated steps needed to merge or clean the data. Or we need to
4 We analyze this data on page 112.

c
•2014 Oxford University Press 48
Chapter 2. Stats in the Wild: Good Data Practices

Table 2.4: Codebook for Height and Wage Data

Variable name Description


wage96 Adult hourly wages (dollars) reported in 1996 (salary and
wages divided by hours worked in past calendar year)
height85 Adult height (inches), self-reported in 1985
height81 Adolescent height (inches), self-reported in 1981
athletics Participation in high-school athletics (1=yes, 0=no)
clubnum Number of club memberships in high school, excluding athletics,
academic/honor society clubs, and vocational clubs
male Male (1=yes, 0=no)

detail how we conducted customized statistical analysis. These steps are seldom obvious

from the description of data and methods that makes its way into the final paper or report.

It is a great idea to include commentary in the replication material explaining the code and

reasons behind decisions. Sometimes statistical code will be pretty impenetrable (even to

the person who wrote it!) and detailed commentary helps keeps things clear for everyone.

We show examples of well-documented code in the Computing Corner beginning on page 57.

Having well-documented data and analysis is a huge blessing. Even a modestly complex

project can produce a head-spinning number of variables and choices. And because the work

often extends over days, weeks, or months, we learn quickly that what seems obvious when

fresh can fade into oblivion. How exactly did we create our wonderful new variable at 3 am

three weeks ago? If we can’t recreate the analysis from scratch, it is useless. We may as

well have gone to bed. If we have a good replication file, on the other hand, we can simply

re-run the code and be up to full speed in minutes.

A replication file is also crucial to analyze the robustness of our results. A result is robust

c
•2014 Oxford University Press 49
Chapter 2. Stats in the Wild: Good Data Practices

if it does not change when we change the model. For example, we might believe that a certain

observation was mis-measured and we might therefore exclude it from the data we analyze.

A reader might be nervous about this exclusion. It will therefore be useful to conduct an

robustness check in which we estimate the model including the contested observation. If the

statistical significance and magnitude of the coefficient of interest is essentially the same,

then we can assure others that the results are robust to inclusion of that observation. If

the results change, then the coefficient of interest changes, then the results are not robust

and we have some explaining to do. Experience researchers know that many results are not

robust and therefore demand extensive robustness checks before they will believe results.

Remember This
1. Analysis that cannot be replicated cannot be trusted.
2. Replication files document data sources and statistical methods that someone
could exactly re-create the analysis in question from scratch.
3. Replication files also allow others to explore the robustness of results by enabling
them to assess alternative approaches to the analysis.

c
•2014 Oxford University Press 50
Chapter 2. Stats in the Wild: Good Data Practices

Case Study: Violent Crime in United States

Violent crime is one of our worst fears. The more

we can understand its causes, the more we can

design public policies to address it. Many wonder

if crime is a result of the breakdown of the family,

poverty, or even dense urban living.

To get a preliminary picture of how violent

crime and such demographic features are related, Table 2.5 presents data drawn from U.S.

Census data from 2009 for the 50 states and Washington DC. We can see there is no missing

data (because each variable has 51 observations). We also see that the violent-crime rate

has a broad range, from 119.9 per 100,000 population all the way to 1,348.9 per 100,000

people. The single parent percent variable is on a 0 to 1 scale, also with considerable range,

from 0.18 to 0.61. The percent urban (which is the percent of people in the state living in

a metropolitan area) is measured on a 0 to 100 scale. These scales mean that 50 percent is

indicated in the single-parent variable as 0.5 and as 50 in the urban variable. Getting the

scales mixed up could screw up the way we interpret statistical results about the relationships

among these variables.

Scatterplots provide excellent additional information about our data. Figure 2.3 shows

scatterplots of state-level violent-crime rate and percent urban, percent of children with

c
•2014 Oxford University Press 51
Chapter 2. Stats in the Wild: Good Data Practices

Table 2.5: Descriptive Statistics for State Crime Data

Variable Observations Mean Standard Minimum Maximum


(N) deviation
Violent crime rate 51 406.53 205.61 119.90 1, 348.90
(per 100, 000 people)
Percent single parents 51 0.33 0.07 0.18 0.61
Percent urban 51 73.92 14.92 38.83 100.00
Percent poverty 51 13.85 3.11 8.50 21.92

a single parent, and percent in poverty. Suddenly, the character of the data is revealed.

Washington, DC is a clear outlier because it is very much higher than the 50 states in level

of violent scrime. Perhaps it should be dropped.5

We can also use scatterplots to appreciate non-obvious things about our data. We may

think of highly urbanized states as being the densely populated states in the northeast like

Massachusetts and New Jersey. Actually, though, the scatterplot helps us see that Nevada,

Utah and Florida are among the most urbanized according to the Census Bureau measure.

Understanding the reality of the urbanization variable helps us better appreciate what the

data is telling us.

Being aware of the data can help us detect possible endogeneity. Many of the high single-

parent and high poverty states are in the South; we may suspect that Southern states are

distinctive in other social and political characteristics, so we should therefore be on high alert
5 Despite the fact that more people live in Washington, DC than in Vermont or Wyoming! Or, so says the
Washington, DC resident ...

c
•2014 Oxford University Press 52
Chapter 2. Stats in the Wild: Good Data Practices

Violent
crime
rate DC DC DC
(per 100,000
people)
1200

1000

800

NV NV NV
SC
TN TN SC SC
TN
LA
AK NM DE AK DENM
LA AK DE LANM
600 FL FL FL
MD MD MD

AR OKMOMI OK AR AR
TXIL MI
IL
MO
TX IL MO OKMI TX
MACA MACA MA CA
AL AL AL
GA AZ GA
AZ GA
AZ
400 NC KS KS NC KS NC
PA NY PA NY PA NY
CO
IN OHWA COWAINOH WACO INOH
WV
MS MT IA NE CT NJ NJ
WV
CT
NE IA
NJ
CT WV
HI MT MS IANE MT MS
ND WI OR RI ND ORHI HI ND
RI WI OR
KY ID
MN IDMN WI KYRI MN ID KY
SD WY VA UT UT SDVA
WY WYVA UT SD
200
NH NH NH
VT
ME VTME VTME

40 50 60 70 80 90 100 0.2 0.3 0.4 0.5 0.6 8 10 12 14 16 18 20 22


Percent urban Percent single parent Poverty percent
(0 to 100 scale) (0 to 1 scale) (0 to 100 scale)

FIGURE 2.3: Scatterplots of Violent Crime Against Percent Urban, Single Parent, and Poverty

about potential endogeneity in any analysis that uses poverty or single-parenthood variables.

These variables capture not only poverty and single parenthood, but also “Southerness.”

2.3 Statistical Software

We need software to do statistics. We have many choices and it’s worthwhile to learn at least

two different software packages. Different packages are good at different things and many

c
•2014 Oxford University Press 53
Chapter 2. Stats in the Wild: Good Data Practices

researchers use one program for some tasks and another program for other tasks. Also,

knowing multiple programs helps us think in terms of statistical concepts rather than in

terms of the software commands which reinforces clear statistical thinking.

We refer to two major statistical packages throughout this book: Stata and R. (Yes, R is

a statistical package referred to by a single letter; they folks behind it are a bit minimalist.)

Stata provides simple commands to do many complex statistical analyses; the cost of this

simplicity is that we sometimes need to do a lot of digging to figure out what exactly

Stata is up to. And it is expensive. R can be a bit harder to get the hang of, but the

coding is often more direct so that less is hidden to the user. Oh yes, it’s also free at

http://www.r-project.org/. Free does not mean cheap or basic, though. In fact, R is so

powerful that it is the program of choice for many sophisticated econometricians.

In this book we learn by doing, showing specific examples of code in each chapter’s

Computing Corner. The best way to learn code is to get working; after a while the command

names become second nature. Replication files are also a great learning tool. Even if we

forget a specific command, it’s not so hard to remember “I want to do something like I did

for the homework on about education and wages.” All we have to do, then, is track down

the replication file and build from that.6

6 In the references we indicate some good sources for learning Stata and R.

c
•2014 Oxford University Press 54
Chapter 2. Stats in the Wild: Good Data Practices

Remember This
1. Stata is a powerful statistical software program. It is relatively user-friendly, but
it can be expensive.
2. R is another powerful statistical software program. It is less user-friendly, but it
is free.

2.4 Conclusion

This chapter prepares us for analyzing real data. The first step is understanding our data.

This vital first step makes sure that we know what we’re dealing with. We should use

descriptive statistics to get an initial feel for how much data we have and the scales of the

variables. Then we should graph our data. It’s a great way to appreciate what we’re dealing

with and spot interesting patterns or anomalies.

The second step of working with data is documenting our data and analysis. Social

science depends crucially on replication. Analyses that cannot be replicated cannot be

trusted. Therefore all statistical projects should document data and methods, ensuring that

anyone (including the author!) can re-create all results.

We are on track with the key concepts in this chapter when we can

• Section 2.1: Explain descriptive statistics and what to look for.

• Section 2.2: Explain the importance of replication and the two elements of a replication

file.

c
•2014 Oxford University Press 55
Chapter 2. Stats in the Wild: Good Data Practices

• Section 2.3 (and Computing Corner below): Do basic data description in Stata and R.

Further Reading

King (1995) provides an excellent discussion of the replication standard.

Data visualization is a growing field, with good reason as analysts increasingly commu-

nicate primarily via figures. Tufte (2001) is a landmark book. Schwabish (2004) and Yau

(2011) are nice guides to graphics. Failing to get figures right can even impact your personal

life: http://xkcd.com/833/.

Chen, Ender, Mitchell, and Wells (2003) is an excellent online resource for learning Stata.

Gaubatz (2015) is an accessible and comprehensive introduction to R. Other resources include

Verzani (2004) and online tools such as Swirl (2014). Venables and Ripley (2002) is a classic

reference book on S, the language that preceded R. Virtually all of it applies to R as well.

Other programs are widely used as well. EViews is a powerful program often used by

those doing forecasting models (see eviews.com). Some people use Excel for basic statistical

analysis. It’s definitely useful to have good Excel skills, but most people will need a more

specialized program to do serious statistical analysis.

Key Terms
• Codebook (47)
• Replication (46)

c
•2014 Oxford University Press 56
Chapter 2. Stats in the Wild: Good Data Practices

• Replication files (47)


• Robust (50)

Computing Corner

Stata
• The first thing to know is what to do when we get stuck (when, not if ). In Stata, type
help commandname if you have questions about a certain command. For example, if we
have questions about the summarize command, we can type help summarize to get a
description of the command. Probably the most useful information comes in the form
of the examples included at the end of these files. Often the best approach is to find an
example that seems closest to what we’re trying to do and apply that example to our
problem. Googling usually helps, too.
• A comment line is a line in the code that provides notes for the user. A comment line
does not actually tell Stata to do anything, but it can be incredibly useful to clarify
what is going on in the code. Comment lines in Stata begin with an asterisk (*). Using
(**) makes it easier to visually identify these crucial lines.
• To open a “syntax file” to document our analysis, click on Window - Do file editor - new
Do-file editor. It’s helpful to re-size this window so that we can see both the commands
and the results. Save our syntax file as “SomethingSomething.do”; the more informative
the name, the better. Including the date in the file name aids version control. To run
any command in the syntax file, highlight the whole line and then press ctrl-d. The
results of the command will be displayed in the Stata results window.
• One of the hardest parts of learning new statistical software is loading data into a
program. While some data sets are pre-packaged and easy, many are not, especially
those we create ourselves. We have to be prepared for the process of loading data to
take longer than expected. And because data sets can sometimes misbehave (columns
shifting in odd ways, for example) it is very important to use the descriptive statistics
diagnostics described in this chapter to make sure the data is exactly what we think it
is.
– To load Stata data files (which have .dta at the end of file name), there are two
options.

c
•2014 Oxford University Press 57
Chapter 2. Stats in the Wild: Good Data Practices

1. Use syntax
** For data located on the internet
use "http://www9.georgetown.edu//faculty//baileyma//RealStats/DonutData.dta’’
** For data located on a computer
use "C:\Users\SallyDoe\Documents\DonutData.dta"
The “path” tells the computer where to find the file. In this example the path
is C:\Users\SallyDoe\Documents. The exact path depends on a computer’s file
structure.
2. Point-and-click: Go to the File - Open menu option in Stata and browse to
the file. Stata will then produce and display the command for opening that
particular file. It is a good idea to save this command in the syntax file so that
you document exactly the data being used.
– Loading non-Stata data files (files that are in tab-delimited, comma-delimited, or
other such format) depends on the exact format of the data. For example, use the
following to read in data that has tabs between variables on each line.
1. Use syntax
** For data located on the internet
insheet using "http://www9.georgetown.edu//faculty//baileyma//RealStats/DonutData.raw"
** For data located on a computer
insheet using "C:\Users\SallyDoe\Documents\DonutsData.raw"
2. Point-and-click: Go to File - Import and then select the file where the data
is stored. Stata will then produce and display the command for opening that
particular file. It is a good idea to save this command in the syntax file so that
you document exactly the data being used. Often it is easiest to use point-and-
click the first time and syntax after that.
• To see a list of variables loaded into Stata, look at the variable window that lists all
variables. We can also click on Data - Data editor to see variables.
• To make sure the data loaded correctly, display it with the list command. To display
the first 10 observations of all variables, type list in 1/10. To display the first 8
observations of only the weight variable, type list weight in 1/8. We can also look
at the data in Stata’s “Data Browser” by going to Data/Data editor in the toolbar.
• To see descriptive statistics on the weight and donut data as in Table 2.1 on page 44
use summarize weight donuts.
• To produce a frequency table such as Table 2.2 on page 44, type tabulate male. Use
this command only for variables that take on a limited number of possible values.

c
•2014 Oxford University Press 58
Chapter 2. Stats in the Wild: Good Data Practices

• To plot the weight and donut data as in Figure 2.2, type scatter weight donuts.
There are many options for creating figures. For example, to plot the weight and donut
data for males only with labels from a variable called “name,” type scatter weight
donuts if male==1, mlabel(name).

R
• To get help in R, type ?commandname for questions about a certain command. For
example, if we have questions about the mean command, we can type ?mean to get a
description of the command, options and, most importantly, examples. Often the best
approach is to find an example that seems closest to what we’re trying to do and apply
that example to our problem. Googling usually helps, too.
• Comment lines in R begin with a pound sign (#). Using ## makes it easier to visually
identify these crucial lines.
• To open a syntax file where we document our analysis, click on File - New script. It’s
helpful to re-size this window so that we can see both the commands and the results.
Save our syntax file as “SomethingSomething.R”; the more informative the name, the
better. Including the date in the file name aids version control. To run any command
in the syntax file, highlight the whole line and then press ctrl-r. The results of the
command will be displayed in the R Console window.
• To load R data files (which have .RData at the end of file name), there are two options.
1. Use syntax. The most reliable way to work with data from the internet is to down-
load it and then access it as a file on the computer. To do so, use the download
command, which needs to know the location of the data (which we name URL in
this example) and where on the computer to put the data (which we name Dest in
this example). Then use the load command. The following four commands load
the donut data into R’s memory.
URL = "http://www9.georgetown.edu//faculty//baileyma//RealStats//DonutData.RData"
Dest = "C:\\Users\\SallyDoe\\Documents\\DonutData.RData"
download.file(URL, Dest)
load("C:\\Users\\SallyDoe\\Documents\\DonutData.RData")
## We need the double backslashes in file name.
## Yes, they’re different than double forward slashes in the URL.
2. Point-and-click: Click on the R console (where we see results) and go to the File -
Load Workspace menu option and browse to the file. This method is easier, but it
does not leave a record in our .R file of exactly which data set is being used.

c
•2014 Oxford University Press 59
Chapter 2. Stats in the Wild: Good Data Practices

• To load non-R data files (files which are in .txt or other such format) requires more care.
We can download data using the same commands as for .RData. To read the data, we
use read.table. For example, to read in data that has commas between variables on
each line:
RawData = read.table("C:\\Users\\SallyDoe\\Documents\\DonutData.raw", header=TRUE)
This command saves variables as Data$VariableName (e.g., RawData$weight, Raw-
Data$donuts). It is also possible to install special commands that load in various types
of data. For example, search the internet for “read.dta” to see more information on how
to install a special command that reads Stata files directly into R.
• It is also possible to manually load data into R. For example, weight = c(275, 141,
70, 75, 310, 80, 160, 263, 205, 185, 170, 155, 145)
donuts = c(14, 0, 0, 5, 20.5, 0.75, 0.25, 16, 3, 2, 0.8, 4.5, 3.5)
name = c("Homer", "Marge", "Lisa", "Bart", "Comic Book Guy", "Mr. Burns",
"Smithers", "Chief Wiggum", "Principal Skinner", "Rev. Lovejoy", "Ned Flanders",
"Patty", "Selma")
• To make sure the data loaded correctly, we can display our data in R with the following
tools:
1. Use the objects() command to show the variables and objects loaded into R.
2. For a single variable, enter the variable’s name in the R Console or highlight it in
the syntax file and press ctrl-r.7
3. To display only some observations for a single variable, use brackets. For example,
to see the first 10 observations of the donuts variable use donuts[1:10]
• To see the average of the weight variable, type mean(weight). One tricky thing with
R is that it chokes on variables that having missing data, meaning that if a single
observation is missing, then the simple version of the mean command will produce a
result of “NA.” Therefore we need to tell R what to do with missing data by modifying
the command to mean(weight, na.rm=TRUE). R refers to missing observations with a
“NA.” The “.rm” is shorthand for remove. A way to interpret the command, then, is us
telling R, “Yes, it is true that we will remove missing data from our calculations.” This
7R can load variables directly such that each variable has its own variable name. Or, it can load variables as part
of data frames such that the variables are loaded together. For example, our commands to load the .RData file loaded
each variable separately while our commands to load data from a text file created an object called “RawData” that
contains all the variables. To display a variable in the “RawData” object called “donuts”, type RawData$donuts in
the .R file, highlight it, and press ctrl-r. This process may take some getting used to, but experiment freely with any
data set you load in and it should become second nature.

c
•2014 Oxford University Press 60
Chapter 2. Stats in the Wild: Good Data Practices

syntax works for other descriptive statistics commands as well. Working with the na.rm
command is a bit of an acquired taste, but it becomes second nature soon enough.
To see the standard deviation of the weight variable, type sqrt(var((weight)) where
the sqrt part is referring to the square root function. The minimum and maximum
of the weight variable are displayed with min(weight) and max(weight). To see the
number of observations for a variable, use sum(is.finite(weight)). This command is
a bit clumsy: The is.finite function creates a variable that equals 1 for each non-missing
observation and the sum function sums this variable, creating a count of non-missing
observations.
• To produce a frequency table such as Table 2.2 on page 44, type table(male). Use
this command only for variables that take on a limited number of possible values.
• To plot the weight and donut data as in Figure 2.2, type plot(weight, donuts). There
are many options for creating figures. For example, to plot the weight and donut data
for males only with labels from a variable called “name,” type
plot(weight[male == 1], donuts[male == 1])
text(donuts[male == 1], weight[male == 1], name[male == 1]).
The syntax donuts[male == 1] tells R to use only values of donuts for which male
equals 1.8

Exercises
1. The data set DonutDataX.dta contains data from our donuts example on page 44.
There is one catch: Each of the variables has an error. Use the tools discussed in this
chapter to identify the errors.
2. What determines success at the Winter Olympics? Does population matter? Income?
Or is it simply a matter of being in a cold place with lots of mountains? Table 2.6
describes variables in olympics HW.dta related to the Winter Olympic Games from
1980 until 2014.
a. Summarize the medals, athletes, and GDP data.
b. List the first five observations for the country, year, medals, athletes, and GDP data.
8R plots are very customizable. To get a flavor, use text(donuts[male == 1], weight[male == 1], name[male
== 1], cex=0.6, pos=4) as the second line of the plot sequence of code. The “cex” command controls the size of
the label and the “pos=4” puts the labels to the right of the plotted point. Refer to the help menus in R or Google
around for more ideas.

c
•2014 Oxford University Press 61
Chapter 2. Stats in the Wild: Good Data Practices

Table 2.6: Variables for Winter Olympics Questions

Variable name Description


ID Unique number for each country in the data set
country Name of country
year Year
medals Total number of combined medals won
athletes Number of athletes in Olympic delegation
GDP Gross domestic product (GDP) of country (in $10,000 US dollars)
temp Average high temperature (in Fahrenheit) in January if
country is in Northern Hemisphere or July if Southern
Hemisphere (for largest city)
population Population of country (in 100,000)
host Equals 1 if host nation and equals 0 otherwise

c. How many observations are there for each year?


d. Produce a scatterplot of medals and the number of athletes. Describe the relationship
depicted.
e. Explain if you suspect there could be other factors that explain the observed rela-
tionship between the number of athletes and medals.
f. Create a scatterplot of medals and GDP. Briefly describe any clear patterns.
g. Create a scatterplot of medals and population. Briefly describe any clear patterns.
h. Create a scatterplot of medals and temperature. Briefly describe any clear patterns.
3. Persico, Postlewaite, and Silverman (2004) analyzed data from the National Longitu-
dinal Survey of Youth (NLSY) 1979 cohort to assess the relationship between height
and wages for white men who were between 14 and 22 years old in 1979. This data set
consists of answers from individuals who were asked questions in various years between
1979 and 1996. Here we explore the relationship between height and wages for the full
sample that includes men and women and all races. Table 2.7 describes the variables
we use for this question.
Table 2.7: Variables for Height and Weight Data in the United States

Variable name Description


wage96 Hourly wages (in dollars) in 1996
height85 Adult height: height (in inches) measured in 1985
height81 Adolescent height: height (in inches) measured in 1981
siblings Number of siblings

c
•2014 Oxford University Press 62
Chapter 2. Stats in the Wild: Good Data Practices

a. Summarize the wage, height (both height85 and height81), and sibling variables.
Discuss briefly.
b. Create a scatterplot of wages and adult height (height85). Discuss any distinctive
observations.
c. Create a scatterplot of wages and adult height that excludes the observations with
wages above $500 per hour.
d. Create a scatterplot of adult height against adolescent height. Identify the set of
observations where people’s adolescent height is less than their adult height. Do you
think we should use these observations in any future statistical analysis we conduct
with this data? Why or why not?
4. Anscombe (1973) created four data sets that had interesting properties. Let’s use
tools from this chapter to describe and understand these data sets. The data is in
a Stata data file called AnscombesQuartet.dta. There are four possible independent
variables (x1 through x4) and four possible dependent variables (y1 through y4). Create
a replication file that reads in the data and implements the analysis necessary to answer
the following questions. Include comment lines that explain the code.
a. Briefly note the mean and variance for each of the four X variables. Briefly note
the mean and variance for each of the four Y variables. Based on these, would you
characterize the 4 sets of variables as similar or different?
b. Create 4 scatter plots: one with X1 and Y 1, one with X2 and Y 2, one with X3 and
Y 3 and one with X4 and Y 4.
c. Briefly explain any differences and similarities across the four graphs.

c
•2014 Oxford University Press 63
Part I

The OLS Framework

64
CHAPTER 3

BIVARIATE OLS: THE FOUNDATION OF STATISTICAL ANALYSIS

Every four years Americans elect a president.

Each campaign has drama: controversies, gaffes,

over-the-top commercials. And yet, the results

are actually quite predictable based on how the

economy grew the previous four years. Panel A

of Figure 3.1 shows a scatterplot of the vote share of the incumbent president’s party and

changes in income for each election between 1948 and 2012. The relationship jumps right

out at us: Higher income growth is indeed associated with larger presidential vote shares.1
1 The figure is based on Noel (2010). The figure plots vote share as a percent of the total votes given to Democrats
and Republicans only. We use these data to avoid the complication that in some years third-party candidates such
as Ross Perot (in 1992, 1996) or George Wallace (in 1968) garnered non-trivial vote share.

65
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Incumbent party’s Incumbent party’s


vote percent vote percent

1972 1972
1964 1964

60 60

1984 1984
1956 1956

55 1996 55 1996
1988 1988

1948 1948

2012 2012

2004 2000 2004 2000


1960 1960
50 50
19761968 19761968

1992 1992
2008 2008

45 1952 45 1952
1980 1980

−1 0 1 2 3 4 5 6 −1 0 1 2 3 4 5 6
Percent change in income Percent change in income
(a) (b)

FIGURE 3.1: Relationship Between Income Growth and Vote for the Incumbent President’s Party, 1948-

2012

c
•2014 Oxford University Press 66
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

In this chapter we introduce the foundational statistical model for analyzing such data.

The model allows us to quantify the relationship between two variables and to assess whether

the relationship occurred by chance or due to some real cause. We build on these methods

in the rest of the book in ways that help us differentiate, as best we can, true causes from

simple associations.

Basically what we do is take data like the data found in panel (a) of Figure 3.1 and

we estimate a line that best characterizes the relationship between the two variables. We

include this line in panel (b) of Figure 3.1. Look for the year 2012. It’s almost right on

the line. That’s a bit lucky (other years aren’t so close to the line), but the figure shows

that we can get a remarkably good start on understanding presidential elections with a

not-particularly-big data set and the tools we develop in this chapter.

The specific tool we introduce in this chapter is OLS which stands for ordinary least

squares; we’ll explain why later. It’s not the best name. Regression and linear regression are

other commonly used names for the method - also lame names.2

The goal of this chapter is to introduce OLS. In Section 3.1 we show how to estimate

coefficients that produce a fitted line using OLS. The following sections then show that

these coefficient estimates have many useful properties. Section 3.2 demonstrates that the
2 In the late nineteenth century Francis Galton used the term regression to refer to the phenomenon that children
of very tall parents tended to be less tall than their parents. He called this phenomenon “regression to the mean” in
heights of children because children of tall parents tend to “regress” (move back) to average heights. Somehow the
term “regression” bled over to cover statistical methods to analyze relationships between dependent and independent
variables. Go figure.

c
•2014 Oxford University Press 67
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

OLS coefficient estimates are themselves random variables. Section 3.3 explains one of the

most important concepts in statistics: The OLS estimates of —ˆ1 will not be biased if X is

exogenous. That is, the estimates won’t be systematically higher or lower than the true

values as long as the independent variable is not correlated with the error term. Section

3.4 shows how to characterize the precision of the OLS estimates. Section 3.5 shows how

the distribution of the OLS estimates converge to a point as the same size gets very, very

large. Section 3.6 discusses issues that complicate the calculation of the precision of our

estimates. These issues get intimidating names like heteroscedasticity and autocorrelation,

but their bark is worse than their bite and most statistical software can easily address them.

Finally, Sections 3.7 and 3.8 discuss tools for assessing how well the model fits the data and

whether any crazy observations could distort our conclusions.

3.1 Bivariate Regression Model

Bivariate OLS is a technique we use to estimate a model with two variables - a dependent

variable and an independent variable. In this section, we explain the model, estimate it,

and try it out on our presidential-election example. We extend the model in later chapters

when we discuss multivariate OLS, a technique we use to estimate models with multiple

independent variables.

c
•2014 Oxford University Press 68
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

The bivariate model

Bivariate OLS allows us to quantify the degree to which X and Y move together. We work

with our core statistical model that we introduced on page 8:

Yi = —0 + —1 Xi + ‘i (3.1)

where Yi is the dependent variable and X is the independent variable. The parameter —0

is the intercept (or constant). It indicates the expected value of Y when Xi is zero. The

parameter —1 is the slope. It indicates how much Y changes as X changes. The random

error term ‘i captures everything else other than X that affects Y .

Adapting the generic bivariate equation to the presidential-election example produces:

Incumbent party vote sharei = —0 + —1 Income changei + ‘i (3.2)

where Incumbent party vote sharei is the dependent variable and Income changei is the

independent variable. The parameter —0 indicates the expected value of vote percentage for

the incumbent when income change equals zero. The parameter —1 indicates how much more

we expect vote share to rise as income change increases by one unit.

This model is an incredibly simplified version of the world. The data will not fall on a

completely straight line because many other factors affect elections, ranging from wars to

scandals to social issues and so forth. These factors comprise our error term, ‘i .

For any given data set, OLS produces estimates of the — parameters that best explain

the data. We indicate estimates as —ˆ0 and —ˆ1 , where the “hats” indicate that these are our

c
•2014 Oxford University Press 69
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

estimates. Estimates are different from the true values, —0 and —1 , which don’t get hats in

our notation.3

ˆ define a line with an intercept


How can these parameters best explain the data? The —s

(—ˆ0 ) and a slope (—ˆ1 ). The task boils down to picking a —ˆ0 and —ˆ1 that define the line that

minimizes the aggregate distance of the observations from the line. To do so we use two

concepts: the fitted value and the residual.

The fitted value is the value of Y predicted by our estimated equation. The fitted value

Ŷ (which we call “Y hat”) from our bivariate OLS model is

Ŷi = —ˆ0 + —ˆ1 Xi (3.3)

Note the differences from Equation 3.1 - there are lots of hats and no ‘i . This is the equation

for the regression line defined by the estimated —ˆ0 and —ˆ1 parameters and Xi .

A fitted value tells us what we would expect the value of Y to be given the value of the

X variable for that observation. To calculate a fitted value for any value of X, use Equation

3.3. Or, if we plot the line, we can simply look for the value of the regression line at that

value of X. All observations with the same value of Xi will have the same Ŷi , which is the

fitted value of Y for observation i. Fitted values are also called predicted values.

A residual measures the distance between the fitted value and an actual observation. In

the true model the error, ‘i , is that part of Yi not explained by —0 + —1 Xi . The residual is the
3 Another common notation is to refer to estimates with regular letters rather than Greek letters (e.g. b0 and b1 ).
That’s perfectly fine, too, of course, but we stick with the hat notation for consistency throughout this book.

c
•2014 Oxford University Press 70
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

estimated counterpart to the error. It is the portion of Yi not explained by —ˆ0 + —ˆ1 Xi (notice

the hats). If our coefficient estimates exactly equaled the true values, then the residual would

be the error; in reality, of course, our estimates —ˆ0 and —ˆ1 will not equal the true values —0

and —1 , meaning that our residuals will differ from the error in the true model.

The residual for observation i is ‘ˆi = Yi ≠ Ŷi . Equivalently, we can say a residual is

‘ˆi = Yi ≠ —ˆ0 ≠ —ˆ1 Xi . We indicate residuals with ‘ˆ (“epsilon-hat”). As with the —s, a Greek

letter with a hat is an estimate of the true value. The residual ‘ˆi is distinct from ‘i , which

is how we denote the true, but not directly observed, error.

Estimation

The OLS estimation strategy is to identify values of —ˆ0 and —ˆ1 that define a line that min-

imizes the sum of the squared residuals. We square the residuals because we want to treat

a residual of +7 (as when an observed Yi is 7 units above the fitted line) as equally undesir-

able as a residual of -7 (as when an observed Yi is 7 units below the fitted line). Squaring

the residuals converts all residuals to positive numbers. Our +7 residual and -7 residual

observations will both register as +49 in the sum of squared residuals.

Specifically, the expression for the sum of squared residuals for any given estimates of —ˆ0

and —ˆ1 is
N
ÿ N
ÿ
‘ˆ2i = (Yi ≠ —ˆ0 ≠ —ˆ1 Xi )2
i=1 i=1

The OLS process finds the —ˆ1 and —ˆ0 that minimize the sum of squared residuals. The fact

c
•2014 Oxford University Press 71
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

that we’re squaring the residuals is where the “squares” in “ordinary least squares” comes

from. The “least” bit is from minimizing the sum of squares. The “ordinary” refers to the

fact that we haven’t gotten to anything fancy yet.

As a practical matter, we don’t need to carry out the minimization ourselves - we can leave

that to the software. The steps are not that hard, though, and we step through a simplified

version of the minimization task in Chapter 14 on page 711. This process produces specific

equations for the OLS estimates of —ˆ0 and —ˆ1 . These equations provide estimates of the slope

(—ˆ1 ) and intercept (—ˆ0 ) combination that characterizes the line that best fits the data.

The OLS estimate of —ˆ1 is


qN
i=1 (Xi ≠ X)(Yi ≠ Y)
—ˆ1 = qN (3.4)
i=1 (Xi ≠ X)
2

where X (read as “X bar”) is the average value of X and Y is the average value of Y .

Equation 3.4 shows that —ˆ1 captures how much X and Y move together. The numerator
q
has (Xi ≠ X)(Yi ≠ Y ). The first bit inside the sum is the difference of X from its mean for

the ith observation; the second bit is the difference of Y from its mean for the ith observation.

The product of these bits is summed over observations. So if Y tends to be above its mean

(meaning (Yi ≠ Y ) is positive) when X is above its mean (meaning (Xi ≠ X) is positive),

then there will be a bunch of positive elements in the sum in the numerator. If Y tends to be

below its mean (meaning (Yi ≠ Y ) is negative) when X is below its mean (meaning (Xi ≠ X)

is negative), we’ll also get positive elements in the sum because a negative number times a

c
•2014 Oxford University Press 72
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

negative number is positive. Such observations will also push —ˆ1 to be positive.

On the other hand, —ˆ1 will be negative when the signs of the Xi ≠ X and Yi ≠ Y are

mostly opposite signs. For example, if X is above its mean (meaning (Xi ≠ X) is positive)

when Y is below its mean (meaning (Yi ≠ Y ) is negative), we’ll get negative elements in the

sum and —ˆ1 will tend to be negative.4

The OLS equation for —ˆ0 is easy once we have —ˆ1 . It is

—ˆ0 = Y ≠ —ˆ1 X (3.5)

We focus on the equation for —ˆ1 because this is the parameter that defines the relationship

between X and Y , which is what we usually care most about.


4 There is a close affinity between the regression coefficient in bivariate OLS and covariance and correlation. Using
the equations for variance and covariance from the appendix on pages 768 and 770 we can see that Equation 3.4 can
cov(X,Y )
be rewritten as var(X)
. Using the relationship between covariance and correlation, this equation can equivalently
be written as corr(X, Y ) ‡‡X
Y
, which indicates that the bivariate regression coefficient is simply a rescaled correlation
coefficient. The correlation coefficient indicates the strength of the association, while the bivariate regression coeffi-
cient indicates the effect of a 1 unit increase in X on Y . It’s a good lesson to remember. We all know “correlation
does not imply causation”; this little nugget tells us that “bivariate regression (also!) does not imply causation.” The
appendix provides additional details on page 771.

c
•2014 Oxford University Press 73
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Bivariate OLS and presidential elections

For the election and income data plotted in Figure 3.2, the equations for —ˆ0 and —ˆ1 produce

the following estimates:

‰vote sharei = —ˆ0 + —ˆ1 Income changei


Incumbent party

= 45.9 + 2.3 ◊ Income changei

Figure 3.2 shows what these coefficient estimates mean. The —ˆ1 estimate implies that the

incumbent party’s vote percentage went up by 2.3 percentage points for each one-percent

increase in income. The —ˆ0 estimate implies that the expected election vote share for the

incumbent president’s party for a year with zero income growth was 45.9 percent.

Table 3.1 and Figure 3.3 show predicted values and residuals for specific presidential

elections. In 1960, income growth was rather low (at 0.58 percent). The vote percent for the

Republicans (who controlled the presidency at the time of the election) was 49.9 (Republican

Richard Nixon lost a squeaker to Democrat John F. Kennedy). The fitted value, denoted

by a triangle in the figure, is 45.9 + 2.3 ◊ 0.58 = 47.2. The residual, which is the difference

between the actual and fitted, is 49.9 ≠ 47.2 = 2.7 percent. In other words, in 1960 the

incumbent president’s party did 2.7 percentage points better than would be expected based

on the regression line.

In 1964, income growth was high (at 5.58 percent). The Democrats controlled the pres-

idency at the time of the election and they received 61.3 percent of the vote (Democrat

c
•2014 Oxford University Press 74
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Incumbent party’s
vote percent
1972
1964

60

1984

1956

55 1996

1988

1948

2012

2004
2000
1960
e)
op
50 1968
sl 1976
e
(th
3
2.
^β 1=
1992
2008
^
β0 = 45.9
45 1952
1980

−1 0 1 2 3 4 5 6
Percent change in income

FIGURE 3.2: Elections and Income Growth with Model Parameters Indicated

c
•2014 Oxford University Press 75
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Incumbent party’s
vote percent

1964

60 Residual for 1964

Fitted
value
for 1964

55 Fitted value for 2000

Residual for 2000

1960
50 2000

Residual for
1960

Fitted value for 1960

45

−1 0 1 2 3 4 5 6
Percent change in income

FIGURE 3.3: Fitted Values and Residuals for Observations in Table 3.1

c
•2014 Oxford University Press 76
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Table 3.1: Selected Observations from Election and Income Data

Year Income change Incumbent party Fitted value Residual


(X) vote share (Y ) (Ŷ ) (ˆ‘)
1960 0.58 49.9 47.2 2.7
1964 5.58 61.3 58.7 2.6
2000 3.85 50.2 54.7 -4.5

Lyndon Johnson trounced Republican Barry Goldwater). The fitted value based on the re-

gression line was 45.90 + 2.30 ◊ 5.58 = 58.7. The residual, which is the difference between

the actual and fitted, is 61.3 ≠ 58.7 = 2.6 percent. In other words, in 1964 the incumbent

president’s party did 2.6 percentage points better than would be expected based on the re-

gression line. In 2000, the residual was negative, meaning that the incumbent president’s

party (the Democrats at that time) did 4.5 percentage points worse than would be expected

based on the regression line.

c
•2014 Oxford University Press 77
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Remember This
1. The bivariate regression model is

Yi = —0 + —1 Xi + ‘i
• The slope parameter is —1 . It indicates the change in Y associated with an
increase of X by one unit.
• The intercept parameter is —0 . It indicates the expected value of Y when X
is zero.
• —1 is almost always more interesting than —0 .
2. OLS estimates —ˆ1 and —ˆ0 by minimizing the sum of squared residuals:
N
ÿ N
ÿ
‘ˆ2i = (Yi ≠ —ˆ0 ≠ —ˆ1 Xi )2
i=1 i=1

• A fitted value for observation i is Ŷi = —ˆ0 + —ˆ1 Xi .


• The residual for observation i is the difference between the actual and fitted
value for person i: ‘ˆi = Yi ≠ Ŷi

3.2 Random Variation in Coefficient Estimates

The goal of bivariate OLS is to get the most accurate idea of —0 and —1 that the data can

provide. The challenge is that we don’t observe the values of the —s. We are only able to

estimate the true values based on the data we observe. And because the data we observe is

random, at least in the sense of having a random error term in it, our estimates will have a

random element too.

In this section we explain where the randomness of our coefficient estimates comes from,

introduce the concept of probability distributions, and show that our coefficient estimates

c
•2014 Oxford University Press 78
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

come from a normal probability distribution.

—ˆ estimates are random

There are two different ways to think about the source of randomness in our coefficient

estimates. First, our estimates may have sampling randomness. This variation is due

to the fact that we may be observing only a subset of an entire population. Think of some

population, say the population of ferrets in Florida. Suppose we want to know whether old

ferrets sleep more. There is some relationship between ferret age and sleep in the overall

population, but we are able to get a random sample of, say, only 1,000 ferrets. We estimate

the following bivariate OLS model:

Sleepi = —0 + —1 Agei + ‘i (3.6)

Based on the sample we have selected, we generate a coefficient —ˆ1 . We’re sensible enough

to know that if we had selected a different 1, 000 ferrets in our random sample we would

have gotten a different value of —ˆ1 because the specific values of sleep and age for the

selected ferrets would differ. Every time we select a different 1, 000 ferrets we get a different

estimate —ˆ1 even though the underlying population relationship is fixed at the true value, —1 .

Such variation is called random variation in —ˆ1 due to sampling. Opinion surveys typically

involve a random sample of people and are often considered through the sampling variation

perspective.

Second, our estimates will have modeled randomness. Think again of the population

c
•2014 Oxford University Press 79
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

of ferrets. Even if we were to get data on every last one of them, our model has random

elements. The ferret sleep patterns (the dependent variable) are subject to randomness that

goes into the error term. Maybe one ferret had a little too much celery, another got stuck

in a drawer, and yet another broke up with his girlferret. Having data on every single ferret

would not change the fact that unmeasured factors denoted by ‘ affect ferret sleep.

In other words, there is inherent randomness in the data-generation process even when

data is measured for an entire population. So even if we observe a complete population at

any given time such that we do not have sampling variation, we will have randomness due to

the data-generation process. In other words, the modeled randomness perspective highlights

the fact that virtually every model has some unmeasured component that explains some of

the variation in our dependent variable.

An OLS estimate of —ˆ1 inherits randomness whether from sampling or modeled random-

ness. The estimate —ˆ1 is therefore a random variable, which means it is a variable that

takes on a set of possible different values, each with some probability. An easy way to see

why —ˆ1 is random is to note that it depends on the values of the Yi s, which in turn depend

on the ‘i values, which themselves are random.

Distributions of —ˆ estimates

To understand these random —ˆ1 s, it is best to think of the distribution of —ˆ1 . That is, we

want to think about the various values we expect —ˆ1 to take and the relative likelihood of

c
•2014 Oxford University Press 80
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

these values.

Let’s start with random variables more generally. A random variable with discrete out-

comes can take on one of a finite set of specific outcomes. The flip of a coin or roll of a

die yields a random variable with discrete outcomes. These random variables have prob-

ability distributions. A probability distribution is a graph or formula that identifies the

probability for each possible value of a random variable.

Many probability distributions of random variables are intuitive. We all know the distri-

bution of a coin toss: heads with 50 percent probability and tails with 50 percent probability.

Panel (a) of Figure 3.4 plots this data with the outcome on the horizontal axis and the prob-

ability on the vertical axis. We also know the distribution of the roll of a six-sided die. There
1
is a 6
probability of seeing any of the six numbers on it, as panel (b) of Figure 3.4 shows.

These are examples of random variables with a specific number of possible outcomes: two

(as with a coin toss) or six (as with a roll of a die).

This logic of distributions extends to continuous variables, which are variables that can

take on any value in some range. Weight in our donut example from Chapter 1 is essentially

a continuous variable; it can be measured to a very fine degree of precision so that we

can’t simply say there is some specific number of possible outcomes. We don’t identify a

probability for each possible outcome for continuous variables because there is an unlimited

number of possible outcomes. Instead we identify a probability density, which is a graph

or formula that describes the relative probability a random variable is near a specified value

c
•2014 Oxford University Press 81
As densidades de probabilidade variam do familiar ao estranho. Na extremidade familiar das coisas está uma distribuição
normal, que é a curva de sino clássica no painel (c) da Figura 3.4. Este gráfico indica a probabilidade de observar realizações
da variável aleatória em qualquer intervalo. Por exemplo, como metade da área da densidade é menor que 0, sabemos que
há 50% de chance de que essa variável aleatória distribuída normalmente seja menor
do que zero. Como a densidade de probabilidade é alta no meio e baixa nas extremidades, podemos dizer, por exemplo, que
a variável aleatória normal plotada no painel (c) tem mais probabilidade de assumir valores em torno de zero do que valores
em torno de -4. As chances de observar valores próximos a +1 ou -1 ainda são razoavelmente altas, mas as chances de
observar valores próximos a +3 ou -3 são pequenas.Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Probability Probability
1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

Heads Tails 1 2 3 4 5 6
(a) (b)

Probability Probability
density density

−4 −2 0 2 4 60 65 70 75
(c) (d)
FIGURE 3.4: Four Distributions

c
•2014 Oxford University Press 82
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

for the range of possible outcomes for the random variable.

Probability densities run the gamut from familiar to weird. On the familiar end of things

is a normal distribution, which is the classic bell curve in panel (c) of Figure 3.4. This

plot indicates the probability of observing realizations of the random variable in any given

range. For example, since half of the area of the density is less than 0, we know that there

is a 50 percent chance that this particular normally distributed random variable will be less

than zero. Because the probability density is high in the middle and low on the ends we can

say, for example, that the normal random variable plotted in panel (c) is more likely to take

on values around zero than values around -4. The odds of observing values around +1 or -1

are still reasonably high, but the odds of observing values near +3 or -3 are small.

Probability densities for random variables can have odd shapes, as in panel (d) of Figure

3.4, which shows a probability density for a random variable that has its most likely outcomes

near 64 and 69.5 The point of panel (d) is to make it clear that not all continuous random

variables follow the bell-shaped distribution. We could draw a squiggly line and, if it satisfied

a few conditions, it too would be a valid probability distribution. We discuss probability

densities in more detail in the appendix starting on page 771.


5 The distribution of adult heights measured in inches looks something like this. What explains the two bumps in
the distribution?

c
•2014 Oxford University Press 83
Teorema do Limite Central: A média de um número suficientemente grande de sorteios independentes de qualquer distribuição
será normalmente distribuída. Como as estimativas OLS são médias ponderadas, o Teorema do Limite Central implica que a
distribuição de -ˆ 1 será normalmente distribuída.

Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

—ˆ estimates are normally distributed

ˆ will be normally distributed


The cool thing about OLS is that for large samples the —s

random variables. While we can’t know exactly what the value of —ˆ1 will be for any given

true —1 , we know that the distribution of —ˆ1 will follow a normal bell curve. We discuss

how to calculate the width of the bell curve in Section 3.4, but knowing the shape of the

probability density for —ˆ1 is a huge advantage. The normal distribution has well-known

properties and is relatively easy to deal with, making our lives much easier in what is to

come.

The normality of our OLS coefficient estimates is amazing. If we have enough data,

the distribution —ˆ1 will follow a bell-shaped distribution even if the errors follows a weird

distribution like the one in panel (d) of Figure 3.4. In other words, just pour ‘i values from

any crazy random distribution into our OLS machine and it will spit out —ˆ1 estimates that

are normally distributed as long as we have a sufficiently large sample.6

Why is —ˆ1 normally distributed for large samples? The reason is a theorem at the heart

of all statistics: the Central Limit Theorem. This theorem states that the average of any
6If the errors in the model (the ‘s) are normally distributed, then the —ˆ1 will be normally distributed no matter what
the sample size is. Therefore in small samples, if we could make ourselves believe the errors are normally distributed,
then that belief would be a basis for treating the —ˆ1 as coming from a normal distribution. Unfortunately, many
people are skeptical that errors are be normally distributed in most empirical models. Some statisticians therefore
pour a great deal of energy into assessing whether errors are normally distributed (just Google “normality of errors”),
but we don’t need to worry about this debate as long as we have a large sample.

c
•2014 Oxford University Press 84
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

random variable follows a normal distribution.7 In other words, get a sample of data from

some distribution and calculate the average. For example, roll a six-sided die 50 times and

calculate the average across the 50 rolls. Then roll the die another 50 times and take the

average again. Go through this routine again and again and again and plot a histogram of

the averages. If we get a large number of averages, the histogram will look like a normal

distribution. The most common averages will be around the true average of 3.5 (the average

of the 6 numbers on a die). In some of our sets of 50 rolls we’ll see more sixes than usual and

those averages will tend to be closer to 4. In other sets of 50 rolls we’ll see more ones than

usual and those averages will tend to be closer to 3. Crucially, the shape of the distribution

will look more and more like a normal distribution the larger our sample of averages gets.

Even though the Central Limit Theorem is about averages, it is relevant for OLS. Econo-

metricians deriving the distribution of —ˆ1 invoke the Central Limit Theorem to prove that

—ˆ1 will be normally distributed for a sufficiently large sample size.8

What sample size is big enough for the Central Limit Theorem and, therefore, normality
7 There are some technical assumptions necessary. For example, the “distribution” of the values of the error term
cannot consist solely of a single number.
8 One way to see why is to think of the OLS equation for —ˆ1 as a weighted average of the dependent variable.
That’s not super obvious, but if we squint our eyes and look at Equation 3.4, we see that we could re-write it as
qN
—ˆ1 = wi (Yi ≠ Y ) where wi = qN(Xi ≠X) . (We have to squint really hard!) In other words, we can think of
i=1 (Xi ≠X)2
i=1
—ˆ1 s as a weighted sum of the Yi s where wi is the weight (and we happen to subtract the mean of Y from each Yi ).
It’s not to hard to get from a weighted sum to an average (rewrite the denominator of wi as N var(X)). Doing so
opens the door for the Central Limit Theorem (which is, after all, about averages) to work its magic and establish
that —ˆ1 will be normally distributed for large samples.

c
•2014 Oxford University Press 85
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

to kick in? There is no hard-and-fast rule, but the general expectation is that around

100 observations is enough. If we have data with some really extreme outliers or other

pathological cases, then we may need a larger sample size. Happily, though, the normality

of the —ˆ1 distribution generally applies even for data with as few as 100 observations.

Remember This
1. Randomness in coefficient estimates can be the result of
• sampling variation, which arises due to variation in the observations selected
into the sample. Each time a different random sample is analyzed, a differ-
ent estimate of —ˆ1 will be produced even though the population (or “true”)
relationship is fixed.
• modeled variation, which arises because of inherent uncertainty in outcomes.
Virtually any data set has unmeasured randomness, whether the data set
covers all observations in a population or some subsample (random or not).
2. The Central Limit Theorem implies the —ˆ0 and —ˆ1 coefficients will be normally
distributed random variables if the sample size is sufficiently large.

3.3 Exogeneity and Unbiasedness Exogeneidade e imparcialidade

We know that —ˆ1 is not simply the true value —1 ; it is an estimate, after all. But how does —ˆ1

relate to —1 ? In this section we introduce the concept of unbiasedness, explain the condition

under which our estimates are unbiased, and characterize the nature of the bias when this

condition is not satisfied.

c
•2014 Oxford University Press 86
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Conditions for unbiased estimates

Perhaps the central concept of this whole book is that —ˆ1 is an unbiased estimator of the

true value —1 when X is uncorrelated with ‘. This concept is important; go slowly if it is

new to you.

In ordinary conversation we say a source of information is biased if it slants things against

the truth. The statistical concept of bias is rather close. For example, our estimate —ˆ1 would

be biased if the —ˆ1 s we observe are usually around -12 when the true value of —1 is 16. In

other words, if our system of generating a —ˆ1 estimate was likely to produce a negative value

when the true value was 16, we’d say the estimating procedure was biased. As we discuss

below, such bias happens a lot (and the villain is almost always endogeneity).

Our estimate —ˆ1 is unbiased if the average value of the distribution of the —ˆ1 is equal to

the true value. An unbiased distribution will look like Figure 3.5, which shows a distribution

of —ˆ1 s centered around the true value of —1 . The good news about an unbiased estimator is

that on average, our —ˆ1 should be pretty good. The bad news is that any given —ˆ1 could be

far from true value depending on how wide the distribution is and on luck – just by chance

we could get a value at the low or high end of the distribution.

In other words, unbiased does not mean perfect. It just means that, in general, there is

no systematic tendency to be too high or too low. The distribution of —ˆ1 can be quite wide

so that even as the average is the true value, we could still observe values of —ˆ1 that are far

from the true value, —1 .

c
•2014 Oxford University Press 87
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Probability
density

^
Distribution of β1

β1

FIGURE 3.5: Distribution of —ˆ1

Think of the people who judge figure skating at the Olympics. Some are biased – perhaps

blinded by nationalism or wads of cash – and they systematically give certain skaters higher

or lower scores than the skaters deserve. Other judges (most?) are not biased. These judges

do not get the right answer every time.9 Sometimes an unbiased judge will give a score that

is higher than it should be, sometimes lower. Similarly, an OLS regression coefficient —ˆ1 can

be an unbiased estimate of —1 but be too high or too low in a given application.

Here are two thought experiments that shed light on unbiasedness. First, let’s approach

the issue from the sampling-randomness framework from Section 3.2. Suppose we select a

sample of people, measure some dependent variable Yi and independent variable Xi for each,

and use those to estimate the OLS —ˆ1 . We write that down and then select another sample
9 We’ll set aside the debate about whether a right answer even exists for now. Let’s imagine there is a score that
judges would on average give to a performance if the skater’s identity were unknown.

c
•2014 Oxford University Press 88
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

of people, get the data, estimate the OLS model again and write down the new estimate of

—ˆ1 . The new estimate will be different because we’ll have different people in our data set.

Repeat the process again and again and write down all the different —ˆ1 s and then calculate

the average of the estimated —ˆ1 s. While any given realization of —ˆ1 could be far from the

true value, we will call the estimates unbiased if the average of the —ˆ1 s is the true value, —1 .

We can also approach the issue from the modeled-randomness framework from Section

3.2. Suppose we generate our data. We set the true —1 and —0 values as some specific values.

We also fix the value of Xi for each observation. Then we draw the ‘i for each observation

from some random distribution. These values will come together in our standard equation to

produce values of Y that we then use in the OLS equation for —ˆ1 . Then we repeat the process

of generating random error terms (while keeping the true — and X values the same). Doing

so produces another set of Yi values and a different OLS estimate for —ˆ1 . We re-running this

process a bunch of times and writing down the —ˆ1 estimates. If the average of the —ˆ1 s we

have recorded is equal to the true value —1 , then we say that —ˆ1 is an unbiased estimator of

—1 .

OLS does not automatically produce unbiased coefficient estimates. A crucial condition

must be satisfied for OLS estimates to be unbiased. It’s a condition we discussed earlier

in Chapter 2. The error term cannot be correlated with the independent variable. This

exogeneity condition is at the heart of everything. If this condition is violated, then there is

something in the error term that is correlated with our independent variable and there is a

c
•2014 Oxford University Press 89
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

chance that it will contaminate the observed relationship between X and Y . In other words,

while observing large values of Y associated with large values of X naturally inclines us to

think X pushes Y higher, we worry that something in the error term that is big when X is

big is actually what is causing Y to be high. In that case, the relationship between X and

Y is spurious and the real causal influence is that something in the error term.

Bias in crime and ice cream example

Almost every interesting relationship between two variables in the policy and economic

worlds has some potential for correlation between X and the error term. Let’s start with a

classic example. Suppose we wonder whether ice cream makes people violent.10 We estimate

the following bivariate OLS model

V iolent crimet = —0 + —1 Ice cream salest + ‘t (3.7)

where violent crime in period t is the dependent variable and ice cream sales in period t is

the independent variable. We’d find that —ˆ1 is greater than zero, suggesting crime is indeed

higher when ice cream sales go up.

Does this relationship mean that ice cream is causing crime? Maybe. Probably not. OK,

no, it doesn’t. What’s going on? There are a lot of factors in the error term and one of them

is probably truly associated with crime and correlated with ice cream sales. Any guesses?

Heat. Heat makes people want ice cream and, it turns out, makes them cranky (or gets
10 Why would we ever wonder that? Work with me here ...

c
•2014 Oxford University Press 90
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

them out of doors) such that crime goes up. Hence a bivariate OLS model with just ice

cream sales will show a relationship, but due to endogeneity, this relationship is really just

correlation and not causation .

Characterizing bias

As a general matter, we can say that as the sample size gets large, the estimated coefficient

will on average be off by some function of the correlation between the included variable and

the error term. We show in Chapter 14 on page 713 that the expected value of our bivariate

OLS estimate is

‡‘
E[—ˆ1 ] = —1 + corr(X, ‘) (3.8)
‡X

where E[—ˆ1 ] is short for the expectation of —ˆ1 , corr(X, ‘) is the correlation of X and ‘, ‡‘

(the lower case Greek letter sigma) is the standard deviation of ‘ and ‡X is the standard

deviation of X. The fraction at the end of the equation is more of a normalizing factor, so

we don’t need to worry too much about it.11

The key thing is the correlation of X and ‘. The bigger this correlation, the further

the expected value of —ˆ1 will be from the true value. Or, in other words, the more the

independent variable and the error are correlated, the more biased OLS will be.
11 If we use the fact that corr(X, ‘) = covariance(X,‘)
‡‘ ‡X
, we can write Equation 3.8 as E[—ˆ1 ] = —1 + cov(X,‘)
‡X2 where cov
is short for covariance.

c
•2014 Oxford University Press 91
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

The rest of this book mostly revolves around what to do if the correlation of X and ‘

is not zero. The ideal solution is to use randomized experiments for which corr(X1 , ‘) is

zero by design. But in the real world, experiments often fall prey to challenges discussed

in Chapter 10. For non-experimental data, which is more common than experimental data,

we’ll discuss lots of tricks in the rest of this book that help us generate unbiased estimates

even when corr(X1 , ‘) is non-zero.

Remember This
1. The distribution of an unbiased estimator is centered at the true value, —1 .
2. The OLS estimator —ˆ1 is an unbiased estimator of —1 if X and ‘ are not correlated.
3. If X and ‘ are correlated, the expected value of —ˆ1 is —1 + corr(X, ‘) ‡‡X‘ .

3.4 Precision of Estimates

There are two ways we can get —ˆ1 estimate that is not close to the true value. One is bias,

as discussed above. The other is random chance. Our OLS estimates are random and with

the luck of the draw we might get an estimate that’s not very good. Therefore it is very

useful to characterize the variance of our random —ˆ1 estimates as this will help us appreciate

when we should expect estimates near the true value and when we shouldn’t. In this section

we explain what we mean by the precision of our estimates and provide an equation for the

variance of our coefficient estimates.

c
•2014 Oxford University Press 92
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Estimating coefficients is a bit like trick-or-treating. We show up at a house and reach

into a bowl. We’re not quite sure what we’re going to get. We might get a Snickers (yum!),

a Milky Way (not bad), a Mounds bar (trade-bait) or a severed human pinkie (run away!).

When we estimate OLS coefficients, it’s like we’re reaching into a bowl of possible —ˆ1 s and

pulling out an estimate. Anytime we reach into the unknown, we don’t quite know what

we’re going to get.

But we do know certain properties of the —ˆ1 s that went in to the bowl. If the exogeneity

condition holds, the average of the —ˆ1 s that went into the bowl is —1 . It also turns out that

we can say a lot about the range of —ˆ1 s in the bowl. We do this by characterizing the width

of the —ˆ1 distribution.

To get a sense of what’s at stake, Figure 3.6 shows two distributions for a hypothetical

—ˆ1 . The darker, lower curve is much wider than the lighter, higher curve. The lighter curve

is more precise because more of the distribution is near the true value.

The primary measure of precision is the variance of —ˆ1 . The variance is – you guessed

it – a measure of how much something varies. The wider the distribution, the larger its

variance. The square root of the variance is the standard error of —ˆ1 . The standard error

is a measure of how much —ˆ1 will vary. A large standard error indicates that the distribution

of —ˆ1 is very wide; if the standard error is small, the distribution of —ˆ1 is narrower.

We prefer —ˆ1 to have a smaller variance. With smaller variance, values close to the true

value are more likely, meaning we’re less likely to be far off when we generate the —ˆ1 . In

c
•2014 Oxford University Press 93
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Probability
density
Smaller variance

Larger variance

−6 −4 −2 0 2 4 6 8 10
^
β1

FIGURE 3.6: Two Distributions with Different Variances of —ˆ1

c
•2014 Oxford University Press 94
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

other words, our bowl of estimates will be less likely to have wacky stuff in it.

Under the right conditions, we can characterize the variance (and, by extension, the

standard error) of —ˆ1 with a simple equation. We discuss the conditions on page 102. If they

are satisfied, the estimated variance of —ˆ1 for a bivariate regression is

ˆ2

var(—ˆ1 ) = (3.9)
N ◊ var(X)

This equation tells us how wide our distribution of —ˆ1 is.12 We don’t need to calculate

the variance of —ˆ1 by hand. That is, after all, why we have computers. We can, however,

understand what causes precise or imprecise —ˆ1 estimates by looking at each part of this

equation.

First, note that the variance of —ˆ1 depends directly on the variance of the regression,

ˆ 2 . The variance of the regression measures how well the model explains variation in Y .

(And, just to be clear, the variance of the regression is different from the variance of —ˆ1 ).

That is, do the actual observations cluster fairly closely to the line implied by —ˆ0 and —ˆ1 ? If

so, the fit is pretty good and ‡


ˆ 2 will be low. If the observations are not particularly close to

ˆ the fit is pretty poor and ‡


the line implied by the —s, ˆ 2 will be high.

We calculate ‡
ˆ 2 based on how far the fitted values are from the actual observed values.
12 We derive a simplified version of the equation in the advanced OLS chapter on page 718.

c
•2014 Oxford University Press 95
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

The equation is
qN
2 i=1 (Yi
≠ Ŷi )2
ˆ
‡ =
N ≠k
qN 2
ˆi
i=1 ‘
= (3.10)
N ≠k

which is (essentially) the average squared deviation of fitted values of Y from the actual

values. It’s not quite an average because the denominator has N ≠ k rather than N . The

N ≠ k in the denominator is the degrees of freedom where k is the number of variables

(including the constant) in the model.13

The more individual observations deviate from their fitted values, the higher ‡
ˆ 2 will be.

This is also an estimate of the variance of ‘ in our core model, Equation 3.1.14

Next, look at the denominator of the variance of —ˆ1 (Equation 3.9). It is N ◊ var(X).

Yawn. There are, however, two important substantive facts in there. First, the bigger the

sample size (all else equal), the smaller the variance of —ˆ1 . In other words, more data means

lower variance. More data is a good thing.

Second, we see that variance of X reduces the variance of —ˆ1 . The variance of X is
13 For bivariate regression, k = 2 because we estimate two parameters (—ˆ0 and —ˆ1 ). We can think of the degrees of
freedom correction as a penalty for each parameter we estimate; it’s as if we use up some information in the data with
each parameter we estimate and we cannot, for example, estimate more parameters than the number of observations
ˆ 2 . For
we have. If N is large enough, the k in the denominator will have only a small effect on the estimate of ‡
small samples, the degrees of freedom issue can matter more. Every statistical package will get this right and the
ˆ 2 measures the average squared distance between actual and fitted values.
core intuition is that ‡
q
(ˆ ‘) 2
‘i ≠ˆ
14 Recall that the variance of ‘ˆ will be . The OLS minimization process automatically creates residuals
N

with a average of zero (meaning ‘ˆ = 0). Hence, the variance of the residuals reduces to Equation 3.10.

c
•2014 Oxford University Press 96
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

qN
(Xi ≠X)2
calculated as i=1
N
. In other words, the more our X variable varies, the more precisely

we will be able to learn about —1 .15

Remember This
1. The variance of —ˆ1 measures the width of the —ˆ1 distribution. If the conditions dis-
cussed in the next section (Section 3.6) are satisfied, then the estimated variance
of —ˆ1 is
ˆ2

var(—ˆ1 ) =
N ◊ var(X)

2. Three factors influence the estimated variance of —ˆ1 :


(a) Model fit. The variance of the regression, ‡ ˆ 2 , is a measure of how well the
model explains variation in Y . It is calculated as
qN
2 i=1 (Yi≠ Ŷi )2
ˆ =

N ≠k
ˆ 2 , the lower will be var(—ˆ1 ).
The lower ‡
(b) Sample size. The more observations, the lower var(—ˆ1 ).
(c) Variation in X. The more the X variable varies, the lower will be var(—ˆ1 ).

15 Here, we’re assuming a large sample. If we had


q a small sample, we would calculate the variance of X with a
(Xi ≠X)2
N

degrees of freedom correction such that it would be i=1


N ≠1
. Doing so would mean we would have N ≠ 1 instead
of N in the denominator of Equation 3.10, but it would not change the intuition that more data lowers the variance
for coefficient estimates.

c
•2014 Oxford University Press 97
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Dependent Dependent
variable variable

45 45

40 40

35 35

30 30

25 25

20 20

15 15

4 6 8 10 12 14 4 6 8 10 12 14
Independent variable Independent variable
(a) (b)
Dependent Dependent
variable variable
50 50
45 45
40 40
35 35
30 30
25 25
20 20
15 15
10 10

4 6 8 10 12 14 4 6 8 10 12 14
Independent variable Independent variable
(c) (d)

FIGURE 3.7: Four Scatterplots

Discussion Questions

1. Will the variance of —ˆ1 be smaller in panel (a) or panel (b) of Figure
3.7? Why?
2. Will the variance of —ˆ1 be smaller in panel (c) or panel (d) of Figure
3.7? Why?

c
•2014 Oxford University Press 98
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

3.5 Probability Limits and Consistency

The fact that the variance of —ˆ1 shrinks as the sample size increases means that eventually the

variance approaches zero. This section discusses the implications of this fact by introducing

the statistical concepts of probability limit and consistency.

The probability limit of an estimator is the value to which the estimator converges as

the sample size gets very large. Figure 3.8 illustrates the intuition behind probability limit

by showing the probability density of —ˆ1 for hypothetical experiments in which the true value

of —1 is zero. The flatter dark curve is the probability density for —ˆ1 for an experiment with

N =10 people. The most likely value of —ˆ1 is 0 because this is the place where the density is

highest, but there’s still a pretty good chance of observing a —ˆ1 near 1.0 and even a reasonable

chance of observing a —ˆ1 near 4. For a sample size of 100, the variance shrinks, which means

we’re less likely to see —ˆ1 values near 4 compared to when the sample size was 10. For a

sample size of 1,000, the variance shrinks even more, producing the tall thin distribution.

Under this distribution, we’re not only unlikely to see —ˆ1 near 4, we’re also very unlikely to

see —ˆ1 near 2.

If we were to keep plotting distributions for larger sample sizes, we would see them getting

taller and thinner. Eventually the distribution would converge to a vertical line at the true

value. If we had an infinite number of observations we would get the right answer every

time. That may be cold comfort if we’re stuck with a sad little data set of 37 observations,

c
•2014 Oxford University Press 99
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Probability
density

N=1,000

N=100

N=10

−4 −2 0 2 4
^
β1

FIGURE 3.8: Distributions of —ˆ1 for different sample sizes

but it’s awesome when we have 100, 000 observations.

Consistency is an important property of OLS estimates. An estimator, such as OLS,

is a consistent estimator if the distribution of —ˆ1 estimates shrinks to be closer and closer

to the true value —1 as we get more data. If the exogeneity condition is true, then —ˆ1 is a

consistent estimator of —1 .16 Formally, we say

plim —ˆ1 = —1 (3.11)

where plim is short for probability limit.

Consistency is quite intuitive. If we have only a couple of people in our sample, it is


16 There are some more technical conditions necessary for OLS to be consistent. For example, the values of the
independent variable have to be meaningful enough that the variance will actually get smaller as the sample increases.
If we simply add X values which always equal 0, this condition might not be satisfied.

c
•2014 Oxford University Press 100
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

unreasonable to expect OLS to provide a precise sense of the true value of —1 . If we have a

bajillion observations in our sample, our —ˆ1 estimate should be very close to the true value.

Suppose, for example, that we wanted to assess relationship between height and wages in a

given classroom. If base our estimate on information from only one student, we’re not very

likely to get an accurate estimate. If we ask 10 students, our answer will likely be closer to

the true relationship in the the classroom and if we ask 20 students we’re even more likely

to be closer to the true relationship.

Under some circumstances an OLS or other estimator will be inconsistent, meaning it

will converge to some other value than the true value. Even though the details can get

pretty technical, the probability limit of an estimator is often easier to work with than

the expectation and therefore statisticians routinely characterize problems into terms of

probability limits that deviate from the true value. For example, we see an example of

probability limits that go awry when we assess the influence of measurement error in Section

5.3.17

17 The two best things you can say about an estimator is that it is unbiased and consistent. OLS estimators have
both of these properties when the error is uncorrelated with the independent variable. These properties seem pretty
similar, but they are rather different. These differences are typically only relevant in advanced statistical work. For
reference, though, we discuss in the appendix examples of estimators that are unbiased, but not consistent and vice
versa.

c
•2014 Oxford University Press 101
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Remember This
1. The probability limit of an estimator is the value to which the estimator converges
as the sample size gets very, very large.
2. When the error term and X are uncorrelated, OLS estimates of — are consistent,
meaning that plim —ˆ = —.

3.6 Solvable Problems: Heteroscedasticity and Correlated Errors

Equation 3.9 on page 95 accurately characterizes the variance of —ˆ1 when certain conditions

about the error term are true. In this section, we explain those conditions. If these conditions

do not hold, the calculation of the variance of —ˆ1 will be more involved, but the intuition we

have introduced about ‡


ˆ 2 , sample size, and variation in X will carry through. We discuss

how to calculate var(—ˆ1 ) under these circumstances in this section and in Chapter 13.

Homoscedasticity

The first condition for Equation 3.9 to be appropriate is that the variance of ‘i is the same

for every observation. That is, once we have taken into account the effect of our measured

variable (X), the expected degree of uncertainty in the model is the same for all observations.

If this condition holds, the variance of the error term is the same for low values of X as for

high values of X. This condition gets a fancy name, homoscedasticity. “Homo” means same.

“Scedastic” (yes, that’s a word) means variance. Hence, errors are homoscedastic when

c
•2014 Oxford University Press 102
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

they all have the same variance.

When errors violate this condition, they are heteroscedastic, meaning that the variance

of ‘i is different for at least some observations. That is, some observations are on average

closer to the predicted value than others. Imagine, for example, that we have data on

how much people weigh from two sources: some people weighed themselves with a state-of-

the-art scale and others had one of those guys at a state fair guess their weight. Definite

heteroscedasticity there, as the weight estimates on the scale would be very close to the

truth (small errors) and the weight estimates from the fair dude will be farther from the

truth (large errors).

Violating the homoscedasticity condition doesn’t cause OLS —ˆ1 estimates to be biased. It

simply means we shouldn’t use Equation 3.9 to calculate the variance of —ˆ1 . Happily for us,

the intuitions we have discussed so far about what causes var(—ˆ1 ) to be big or small are not

nullified and there are relatively simple ways to implement procedures for this case. We show

how to generate these heteroscedasticity-consistent standard errors in Stata and R in

the Computing Corner of this chapter on pages 127 and 130. This approach to accounting

for heteroscedasticity does not affect the values of the —ˆ estimates.18


18 The equation for heteroscedasticity-consistent standard errors is ugly. If you must know, it is

1 ÿ
var[—ˆ1 ] = ( q )2 (Xi ≠ X)2 ‘ˆ2i (3.12)
(Xi ≠ X)2

This is less intuitive than in Equation 3.9 on page 95 so we do not emphasize it. As it turns out, we derive
heteroscedasticity-consistent standard errors in the course of deriving the standard errors that assume homoscedas-
ticity (see page 720). Heteroscedasticity-consistent standard errors are also referred to as robust standard errors (be-
cause they are robust to heteroscadicity) or as Huber-White standard errors. Another approach is to use “weighted

c
•2014 Oxford University Press 103
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Errors uncorrelated with each other

The second condition for Equation 3.9 to provide an appropriate estimate of the variance of

—ˆ1 is that the errors are not correlated with each other. If errors are correlated with each

other, knowing the value of the error for one observation provides information about the

value of the error for another observation.

There are two fairly common situations in which errors are correlated. The first is when

we have clustered errors. Suppose, for example, we’re looking at test scores of all 8th graders

in California. It is possible that the unmeasured factors in the error term cluster by school.

Maybe one school attracts science nerds and another attracts jocks. If such patterns exist,

then knowing the error term for a kid in a school gives some information about error terms

of other kids in the same school, which means errors are correlated. In this case, the school

is the “cluster” and errors are correlated within the cluster. Equation 3.9 is inappropriate

when errors are correlated.

This sounds worrisome. It is, but not terribly so. As with heteroscedasticity, violating the

condition that errors are not correlated doesn’t cause OLS —ˆ1 estimate to be biased. It only

renders Equation 3.9 to be inappropriate. So what should we do if errors are correlated?

Get a better equation for the variance of —ˆ1 ! It’s a bit more complicated than that, but
least squares” to deal with heteroscedasticity. This approach is more statistically efficient, meaning the variance of
the estimate will theoretically be lower. The technique produces —ˆ1 estimates that differ from the OLS —ˆ1 estimates.
We point out references with more details on weighted least squares in the Further Reading section at the end of this
chapter.

c
•2014 Oxford University Press 104
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

the upshot is that it is possible to derive variance of —ˆ1 when errors are correlated within

cluster. We simply note the issue here and use the computational procedures discussed in

the Computing Corner to deal with clustered standard errors.

Correlated errors are also common in time series data. Time series data is data on a

specific unit over time. Examples include U.S. growth rates since 1945 or data on annual

attendance at New York Yankee games since 1913. Errors in time series data are frequently

correlated in a pattern we call autocorrelation. Autocorrelation occurs when the error in

one time period is correlated with the error in the previous time period.

One way that correlated errors can occur in time series is that an unmeasured variable

in the error term may be sticky such that a high value in one year implies a high value in

the next year. Suppose, for example, we are modeling annual U.S. economic growth since

1945 and we lack a variable for technological innovation (which is very hard to measure). If

technological innovation was in the error term boosting the economy in one year, it probably

had some boosting to do in the error term the next year. Similar autocorrelation is likely

in many time series data sets, ranging from average temperature in Tampa over time to

monthly Frisbee sales in Frankfurt.

As with the other issues raised in this section, autocorrelation does not cause bias. Auto-

correlation only renders Equation 3.9 inappropriate. Chapter 13 discusses how to generate

appropriate estimates of the variance of —ˆ1 when there is autocorrelation.

It is important to keep these conditions in perspective. Unlike the exogeneity condi-

c
•2014 Oxford University Press 105
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

tion (that X and the errors are uncorrelated), we do not need the homoscedasticity and

uncorrelated-errors conditions for unbiased estimates. When these conditions fail, we simply

need to do some additional steps to get back to a correct equation for the variance of —ˆ1 .

The fact that violations of these conditions get fancy statistical labels like “heteroscedastic-

ity” and “autocorrelation” can make them seem especially important. They are not. The

exogeneity condition matters much more.

Remember This
1. The standard equation for the variance of —ˆ1 (Equation 3.9 on page 95) requires
errors to be homoscedastic and uncorrelated with each other.
• Errors are homoscedastic if their variance is constant. When errors are het-
eroscedastic, the variance of errors is different across observations.
• Correlated errors commonly occur in clustered data where the error for one
observation is correlated with the error of another observation from the same
cluster (such as a school).
• Correlated errors are also common in time series data where errors are au-
tocorrelated, meaning the error in one period is correlated with the error in
the previous period.
2. Violating the homoscedasticity or uncorrelated-errors conditions does not bias
OLS coefficients.

3.7 Goodness of Fit

Goodness of fit is a statistical concept that refers to how well a model fits the data. If

a model fits well, knowing X gives us a pretty good idea of what Y will be. If the model

fits poorly, knowing X doesn’t give as good an idea of what Y will be. In this section we

c
•2014 Oxford University Press 106
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

present three ways to characterize the goodness of fit. We should not worry too much about

goodness of fit, however, as we can have useful, interesting results from models with poor fit

and biased, useless results from models with great fit.

Standard error of the regression (ˆ


‡)

We’ve already seen one goodness of fit measure, the variance of the regression (denoted as

ˆ 2 ). One limitation with this measure is that the scale is not intuitive. For example, if

our dependent variable is salary, the variance of the regression will be measured in dollars

squared (which is odd).

Therefore the standard error of the regression is commonly used as a measure of

goodness of fit. It is simply the square root of the variance of the regression and is denoted

as ‡
ˆ . It corresponds, roughly, to the average distance of observations from fitted values. The

scale of this measure will be the same units as the dependent variable, making it much easier

to relate to.

The trickiest thing about the standard error of the regression may be that it goes by so

many different names. Stata refers to ‡


ˆ as the root mean squared error (or root MSE for

short); root refers to the square root and MSE refers to mean squared error, which is how

we calculate ‡
ˆ 2 (which is, simply, the mean of the squared residuals). R refers to it as the

residual standard error because it is the estimated standard error for the errors in the model

based on the residuals.

c
•2014 Oxford University Press 107
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Plot of the data

Another way to assess goodness of fit is to plot the data and see for ourselves how closely the

observations are to the fitted line. Plotting also allows us to see outliers or other surprises in

the data. Assessing goodness of fit based on looking at a plot is pretty subjective, though,

and hard to communicate to others.

R2

Finally, a very common measure of goodness of fit is R2 . The name comes from the fact that

it is a measure of the squared correlation of the fitted values and actual values.19 Correlation

is often indicated with an “r”, so R2 is simply the square of this value. (Why one is lower

case and the other is upper case is one of life’s little mysteries.)

If the model explains the data well, the fitted values will be highly correlated with the

actual values and R2 will be high. If the model does not explain the data well, the fitted

values will not correlate very highly with the actual values and R2 will be near zero. Possible

values of R2 range from 0 to 1.20

R2 are often interesting to help us understand how well our model predicts the dependent

variable, but the measure may be less useful than it seems. A high R2 is neither necessary

nor sufficient for an analysis to be useful. A high R2 means the predicted values are close to
19 This interpretation works only if an intercept is included in the model, which it usually is
20 The value of R2 also represents the ratio of the variance of the fitted values to the variance of the actual variance
of Y . It is therefore also referred to as a measure of the proportion of the variance explained.

c
•2014 Oxford University Press 108
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

the actual values. It says nothing more. We can have a model loaded with endogeneity that

generates a high R2 . The high R2 in this case means nothing; the model is junk, the high R2

notwithstanding. And to make matters worse, some people have the intuition that a good

fit is necessary for believing regression results. This intuition isn’t correct, either. There is

no minimum value we need for a good regression. In fact, it is very common for experiments

(the gold standard of statistical analyses) to have low R2 s. There can be all kinds of reasons

for low R2 – the world could be messy such that ‡ 2 is high, for example – but the model

could nonetheless yield valuable insight.

Figure 3.9 shows various goodness of fit measures for OLS estimates of two different

hypothetical data sets of salary at age 30 (measured in thousands of dollars) and years of

education. In panel (a), the observations are pretty closely clustered around the regression

line. That’s a good fit. The variance of the regression is 91.62; it’s not really clear what to

make of that, until we look at its square root, ‡


ˆ (also known as the standard error of the

regression, among other terms) which is 9.57. Roughly speaking, this value of the standard

error of the regression means that the observations are on average within 9.57 units of their

fitted values, meaning that on average the fitted values are within $9,570 of actual salary.21

The R2 is 0.89. That’s pretty high. Is that value high enough? We can’t answer that

question, because it is not a sensible question for R2 values.

In panel (b) of Figure 3.9, the observations are more widely dispersed. Not as good a fit.
21 We say “roughly speaking” because this value is actually the square root of the average of the squared residuals.
The intuition for that value is the same, but it’s quite a mouthful.

c
•2014 Oxford University Press 109
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Salary Salary
(in $1,000) (in $1,000)

120 120

100 100

80 80

60 60

40 40
2
^ = 91.62
σ ^ 2= 444.2
σ
^ = 9.57
σ ^ = 21.1
σ
20 20
R2 = 0.89 R2 = 0.6

0 4 8 12 16 0 4 8 12 16
Years of education Years of education
(a) (b)

FIGURE 3.9: Plots with Different Goodness of Fit

The variance of the regression is 444.2. As with panel (a), it’s not really clear what to make

of that variance, until we look at its square root, ‡


ˆ , which is 21.1. This value means that

the observations are on average within $21,100 of actual salary. The R2 is 0.6. Is that good

enough? Silly question.

c
•2014 Oxford University Press 110
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Remember This
There are four ways to asses goodness of fit.
1. The variance of the regression (ˆ ‡ 2 ). This value is used in the equation for var(—ˆ1 ).
It is hard to interpret directly.
2. The standard error of the regression (ˆ
‡ ). It is measured on the same scale as
the dependent variable and roughly corresponds to the average distance between
fitted values and actual values.
3. Scatterplots can be quite informative about not only goodness of fit but also
possible anomalies and outliers.
4. R2 is a widely used measure of goodness of fit.
• It is the square of the correlation between the fitted and observed values of
the dependent variable.
• R2 ranges from 0 to 1.
• A high R2 is neither necessary nor sufficient for an analysis to be useful.

c
•2014 Oxford University Press 111
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Case Study: Height and Wages

You may have heard that tall people get paid

more – and not just in the NBA. If true, that

makes us worry about what exactly our economy

and society are rewarding.

Persico, Postlewaite and Silverman (2004)

tested this idea by analyzing data on height and wages from a nationally representative

sample. Much of their analysis used multivariate techniques discussed in Chapter 5, but

we’ll use bivariate OLS to start thinking about the issue. They limited their data set to

white males in order to avoid potentially important (and unfair) influences of race and gen-

der on wages. (We look at other groups in the homework for Chapter 5.)

Figure 3.10 shows the data. On the X-axis is the adult height of each guy and on the

Y-axis is his wage in 1996. The relationship is messy, but that’s not unusual. Data is at

least as messy as life.22


22 The data is adjusted in two ways for the figure. First, we jitter the data. The problem is that many observations
have the same value of X and Y such that they overlap perfectly. For a plot with such data, we won’t be able to tell
if a given circle reflects a single observation or many observations. The trick to jittering is to add a small random
number to the height, meaning that each observation will be at a slightly different point. If there are only two
observations with the same specific combination of X and Y values, the jittered data will show two circles, probably
overlapping a bit. If there are many observations with some specific combination of X and Y values, the jittered data
will show many circles, overlapping a bit, but creating a cloud of data that indicates lots of data near that point.
We don’t use jittered data in the OLS (or other statistical analysis); we use jittered data only when plotting data.
Second, six outliers who made a ton of money ($750 per hour for one of them!) are excluded. If they are included,
the scatterplot would be so tall that most observations get scrunched up at the bottom.

c
•2014 Oxford University Press 112
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Hourly
wages
(in $)

80

60

40

20

60 65 70 75 80
Height in inches

FIGURE 3.10: Height and Wages

c
•2014 Oxford University Press 113
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

The figure includes a fitted regression line based on the following regression model:

W agei = —0 + —1 Adult heighti + ‘i

The results reported in Table 3.2 look pretty much like the results that any statistical

software will burp out. The estimated coefficient on adult height (—ˆ1 ) is 0.412. The standard

error estimate will vary depending on whether we assume errors are homoscedastic or not.

The column on the left shows that if we assume homoscedasticity (and therefore use Equation

3.9), the estimated standard error of —ˆ1 is 0.0976. The column on the right shows that

if we allow for heteroscedasticity, the estimated standard error for —ˆ1 is 0.0953. It’s not

much of a difference in this case, but the two approaches to estimating standard errors can

differ more substantially for other examples. The estimated constant (—ˆ0 ) is -13.093 with

estimated standard error estimates of 6.897 and 6.681, depending on whether or not we use

heteroscedasticity-consistent standard errors.

Notice that the —ˆ1 and —ˆ1 coefficients are identical across the columns, as the heteroscedasticity-

consistent standard error estimate has no effect on the coefficient.

What, exactly, do these numbers mean? First, let’s interpret the slope coefficient, —ˆ1 . A

coefficient of 0.41 on height implies that a one inch increase in height is associated with a

41 cent increase in wages per hour. That’s a lot!23


23To put that estimate in perspective, we can calculate how much being one inch taller is worth per year for someone
who works 40 hours a week for 50 weeks per year: it is 0.41 ◊ 1 ◊ 40 ◊ 50 = $820 per year. Being three inches taller
is associated with earning 0.41 ◊ 3 ◊ 40 ◊ 50 = $2, 460 more per year. Being tall has its costs, though: tall people
live shorter lives (Palmer 2013).

c
•2014 Oxford University Press 114
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Table 3.2: Effect of Height on Wages

Variable Assuming Allowing


homoscedasticity heteroscedasticity
Adult height 0.412 0.412
(0.0976) (0.0953)
Constant -13.093 -13.093
(6.897) (6.691)
N 1,910 1,910
ˆ2
‡ 142.4 142.4
ˆ
‡ 11.93 11.93
R2 0.009 0.009
Standard errors in parentheses

The interpretation of the constant, —ˆ0 , is that someone who is zero inches tall would get

negative $13.09 dollars an hour. Hmmm. Not the most helpful piece of information. What’s

going on is that most observations of height (the X variable) are far from zero (they are

mostly between 60 and 75 inches). To get the regression line to go through this data it needs

to cross the Y axis at -13.09 for people who are zero inches tall. This example explains why

we don’t spend a lot of time on —ˆ0 . It’s hard to imagine what kind of sicko would want to

know – or believe – the extrapolation of our results to such little tiny people.

If we don’t care about —ˆ0 why do we have it in the model? It plays a very important

role. Remember that we’re fitting a line and the value of —ˆ0 pins down where the line starts

when X is zero. If we do not estimate the parameter, that’s the same as setting —ˆ0 to be

zero (because the fitted value would be Ŷi = —ˆ1 Xi , which is zero when Xi = 0). Forcing —ˆ0

to be zero will typically lead to a much worse model fit than letting the data tell us where

c
•2014 Oxford University Press 115
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

the line should cross the Y-axis when X is zero.

The results are not only about the estimated coefficients. They also include standard

errors. These are quite important as they give us a sense of how accurate our estimates are.

The standard error estimates come from the data and tell us how wide the distribution of

—ˆ1 is. If the standard error of —ˆ1 is huge, then we should not have much confidence that our

—ˆ1 is necessarily close to the true value. If the standard error of —ˆ1 is small, then we should

have more confidence that our —ˆ1 is close to the true value.

Are these results the final word on the relationship between height and wages? (Hint:

NO!) Like most observational data, a bivariate analysis may not be sufficient. We should

worry about endogeneity. In other words, there could be elements in the error term (factors

that influence wages but have not been included in the model) that could be correlated with

adult height and, if so, then the result that height causes wages to go up may be incorrect.

Can you think of anything in the error term that is correlated with height? We come back

to this question in Chapter 5 on page 199, where we revisit this data set.

Table 3.2 also shows several goodness of fit measures. The ‡


ˆ 2 is 142.4; this number is

pretty hard to get our heads around. Much more useful is the standard error of the regression,

ˆ , which is 11.9, meaning roughly that the average distance between fitted and actual heights

is almost $12 per hour. In other words, the fitted values really aren’t particularly accurate.

The R2 is 0.01. This value is low, but as we said earlier, there is no set standard for R2 .

One reasonable concern might be that we should be wary of the OLS results because the

c
•2014 Oxford University Press 116
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

model fit seems pretty poor. That’s not how it works. The coefficients give us the best

estimates given the data. The standard errors of the coefficients incorporate the poor fit (via

the ‡
ˆ 2 ). So, yes, the poor fit matters, but it’s something that is incorporated into the OLS

estimation process.

3.8 Outliers

One practical concern we have in statistics is dealing with outliers, observations that are

extremely different from the rest of sample. The concern is that a single goofy observation

can skew the analysis.

We saw on page 53 that Washington, DC is quite an outlier in a plot of crime data for

the U.S. states and DC. Figure 3.11 shows a scatterplot of violent crime and percent urban.

Imagine drawing an OLS line by hand when Washington, DC is included. Then imagine

drawing an OLS line by hand when Washington, DC is excluded. The line with Washington

DC will be steeper as it will need to get close to the Washington, DC observation; the

line without Washington, DC will be flatter because it can stay in the mass of the data

without worrying about Washington, DC. Hence a reasonable person may worry that the

Washington, DC data point could substantially influence the estimate. On the other hand, if

we were to remove an observation in the middle of the mass of the data, such as Oklahoma,

the estimated line would move little.

c
•2014 Oxford University Press 117
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Violent
crime
DC
rate
(per 100,000
people)
1200

1000

800

NV
SC
TN
AK LA NM DE
600 FL
MD
AR OK MO MI TX IL
AL MACA
GAAZ
400 NC KS
PA NY
IN OH WA CO
WV CT NJ
MS MT ND IA NE HI
KY WI
IDMN OR RI
SD WY VA UT
200
NH
VT
ME

40 50 60 70 80 90 100
Percent urban
FIGURE 3.11: Scatterplot of Violent Crime and Percent Urban

c
•2014 Oxford University Press 118
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

We can see the effect of including and excluding DC in Table 3.3 which shows bivariate

OLS results in which violent crime rate is the dependent variable. In the first column, percent

urban is the independent variable and all states plus DC are included (therefore the N is 51).

The coefficient is 5.61 with a standard error of 1.8. The results in the second column and

are based on data without Washington, DC (dropping the N to 50). The coefficient is quite

a bit smaller, coming in at 3.58, which is consistent with our intuition from our imaginary

line drawing.

The table also shows bivariate OLS coefficients for a model with single-parent percent as

the independent variable. The coefficient when including DC is 23.17. When we exclude

DC, the estimated relationship weakens to 16.91. We see a similar pattern with crime and

poverty percent in the last two columns.


Table 3.3: OLS Models of Crime in U.S. States

With DC Without DC With DC Without DC With DC Without DC


Urban 5.61 3.58
(1.80) (1.47)
Single parent 23.17 16.91
(3.03) (3.55)
Poverty 23.13 14.73
(8.85) (7.06)
Constant -8.37 124.67 -362.74 -164.57 86.12 184.94
(135.57) (109.56) (102.58) (117.59) (125.55) (99.55)
N 51 50 51 50 51 50
R2 0.17 0.11 0.54 0.32 0.12 0.08
Standard errors in parentheses

Figure 3.12 shows scatterplots of the data with the fitted lines included. The fitted lines

based on all data are the solid lines and the fitted lines when DC is excluded are the dashed

c
•2014 Oxford University Press 119
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Violent
crime
rate DC DC DC
(per 100,000
people) Fitted line with DC Fitted line with DC Fitted line with DC
1200 Fitted line without DC Fitted line without DC Fitted line without DC

1000

800

NV NV NV
SC
TN TN SC SC
TN
AK LANM DE AK DENMLA AK DE LANM
600 FL FL FL
MD MD MD

AR OKMOMI OK AR MI TX AR
TXIL MI
IL
MO
TX IL MO OK
MACA MACA MA CA
AL AL AL
GA AZ GA
AZ GA
AZ
400 NC KS KS NC KS NC
PA NY PA NY PA NY
CO
IN OH WA COWAIN OH WACO INOH
WV
MS MT IA NE CT NJ NJ
NE CTWV NJ
CT WV
HI ND IA
MT MS IANE MT MS
ND WI OR RI
KY ID
MN ORHI
IDMN WI KYRI
HI ND
MNRI WI OR
ID KY
SD WY VA UT UT WY
SD VA VA
WY UT SD
200
NH NH NH
VT
ME VTME VTME

40 50 60 70 80 90 100 20 30 40 50 60 8 10 12 14 16 18 20 22
Percent urban Percent single parent Poverty percent

FIGURE 3.12: Scatterplots of Crime Against Percent Urban, Single Parent, and Poverty with OLS Fitted

Lines

lines. In every case, the fitted lines including DC are steeper than the fitted lines when DC

is excluded.

So what are we to conclude here? Which results are correct? There may be no clear

answer. The important thing is to appreciate that the results in these cases depend on a

single observation. In such cases, we need to let the world know. We should show results

with and without the excluded observation and justify substantively why an observation

might merit exclusion. In the case of the crime data, for example, we could exclude DC on

the grounds that it is not (yet!) a state.

c
•2014 Oxford University Press 120
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Outlier observations are more likely to influence OLS results when there is only a small

number of observations. Given that OLS will minimize the sum of squared residuals from

the fitted line, a single observation is more likely to play a big role when there are only a few

residuals to be summed. When data sets are very large, a single observation is less likely to

move the fitted line substantially.

An excellent way to identify potentially influential observations is to plot the data and

look for unusual observations. If an observation looks out-of-whack, it’s a good idea to

run the analysis without it to see if the results change. If they do, we need to explain the

situation to readers and justify including or excluding the outlier.24

24 Most statistical packages can automatically assess the influence of each observation. For a sample size N , these
commands essentially run N separate OLS models, each one excluding a different observation. For each of these N
regressions, the command stores a value indicating how much the coefficient changes when that particular observation
is excluded. The resulting output reflects how much the coefficients change with the deletion of each observation.
ˆ In other words, the
In Stata, the command is dfbeta where the “df” refers to difference and “beta” refers to —.
ˆ when that observation is deleted. In R, the
command will tell us for each observation the difference in estimated —s
command is also called dfbeta.

c
•2014 Oxford University Press 121
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Remember This
Outliers are observations that are very different from other observations.
1. When sample sizes are small, a single outlier can exert considerable influence on
OLS coefficient estimates.
2. Scatterplots are useful to identify outliers.
3. When a single observation substantially influences coefficient estimates, we should
• Inform readers of the issue.
• Report results with and without the influential observation.
• Justify including or excluding the observation.

3.9 Conclusion

Ordinary least squares is an odd name that refers to the way in which the —ˆ estimates are

produced. That’s fine to know, but the real key to understanding OLS is appreciating the

properties of the estimates produced.

The most important property of OLS estimates is that they are unbiased if X is uncor-

related with the error. We’ve all heard “correlation does not imply causation.” “Regression

does not imply causation” is every bit as true. If there is endogeneity we may observe a big

regression coefficient even in the absence of causation.

OLS estimates have many other useful properties. With a large sample size, —ˆ1 is a

normally distributed random variable. The variance of —ˆ1 reflects the width of the —ˆ1 distri-

bution and is determined by the fit of the model (the better the fit, the thinner), the sample

c
•2014 Oxford University Press 122
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

size (the more data, the thinner) and the variance of X (the more variance, the thinner). If

the errors satisfy the homoscedasticity and no-correlation conditions, the variance of —ˆ1 is

defined by Equation 3.9 on page 95. If the errors are heteroscedastic or correlated with each

other, OLS still produces unbiased coefficients but we will need other tools covered here and

in Chapter 13 to get appropriate standard errors for our —ˆ1 estimates.

We’ll have mastered bivariate OLS when we can

• Section 3.1: Write out the bivariate regression equation and explain all its elements

(dependent variable, independent variable, slope, intercept, error term). Draw a hypo-

thetical scatterplot with a small number of observations and show how bivariate OLS

is estimated, identifying residuals, fitted values, and what it means to be a best-fit line.

Sketch an appropriate best-fit line and identify —ˆ0 and —ˆ1 on the sketch. Write out the

equation for —ˆ1 and explain the intuition in it.

• Section 3.2: Explain why —ˆ1 is a random variable and sketch its distribution. Explain

two ways to think about randomness in coefficient estimates.

• Section 3.3: Explain what it means for the OLS estimate —ˆ1 to be an unbiased estimator.

Explain the exogeneity condition and why it is so important.

• Section 3.4: Write out the standard equation for the variance of —ˆ1 in bivariate OLS

and explain three factors that affect this variance.

• Section 3.5: Define probability limit and consistency.

c
•2014 Oxford University Press 123
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

• Section 3.6: Identify the conditions required for the standard variance equation of —ˆ1

equation to be accurate. Explain why these two conditions are less important than the

exogeneity condition.

• Section 3.7: Explain four ways to assess goodness of fit. Explain why R2 alone does not

measure whether or not a regression was successful.

• Section 3.8: Explain what outliers are, how they can affect results, and what to do

about them.

Further Reading

Beck (2010) provides an excellent discussion of what to report from a regression analysis.

Weighted least squares is a type of generalized least squares that can be used when dealing

with heteroscedastic data. Chapter 8 of Kennedy (2008) discusses weighted least squares and

other issues when dealing with errors that are heteroscedastic or correlated with each other.

These issues are often referred to as violations of a “spherical errors” condition. The term

spherical errors is pretentious statistical jargon that means errors are both homoscedastic

and not correlated with each other.

Murray (2006, 500) provides a good discussion of probability limits and consistency for

OLS estimates.

We discuss what to do with autocorrelated errors in Chapter 13. The Further Reading

c
•2014 Oxford University Press 124
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

section at the end of that chapter provides links to the very large literature on time series

data analysis.

Key Terms
• Autocorrelation (105)
• Bias (87)
• Central Limit Theorem (84)
• Consistency (100)
• Continuous variable (81)
• Degrees of freedom (151)
• Distribution (80)
• Fitted value (70)
• Heteroscedastic (103)
• Heteroscedasticity-consistent standard errors (103)
• Homoscedastic (103)
• Modeled randomness (79)
• Normal distribution (83)
• Outlier (117)
• Predicted values (70)
• Probability density (83)
• Probability distribution (81)
• Probability limit (99)
• Random variable (80)
• Regression line (70)

c
•2014 Oxford University Press 125
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

• Residual (70)
• Sampling randomness (79)
• Standard error (93)
• Standard error of the regression (107)
• Time series data (105)
• Unbiased estimator (87)
• Variance (93)
• Variance of the regression (95)

Computing Corner

Stata
1. Using the donut and weight data described in Chapter 1 on page 57, estimate a bi-
variate OLS regression by typing reg weight donuts. The command “reg” stands for
“regression.” The general format is reg Y X for a dependent variable Y and independent
variable X.
Stata’s regression output looks like this:
Source | SS df MS Number of obs = 13
-------------+------------------------------ F( 1, 11) = 22.48
Model | 46731.7593 1 46731.7593 Prob > F = 0.0006
Residual | 22863.933 11 2078.53936 R-squared = 0.6715
-------------+------------------------------ Adj R-squared = 0.6416
Total | 69595.6923 12 5799.64103 Root MSE = 45.591

------------------------------------------------------------------------------
weight | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
donuts | 9.103799 1.919976 4.74 0.001 4.877961 13.32964
_cons | 122.6156 16.36114 7.49 0.000 86.60499 158.6262
------------------------------------------------------------------------------

There is a lot of information here, not all of which is useful. The vital information is
in the bottom table that shows that —ˆ1 is 9.10 with a standard error of 1.92 and —ˆ0 is
122.62 with a standard error of 16.36. We cover t, P>|t| and 95% confidence intervals
in Chapter 4.

c
•2014 Oxford University Press 126
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

The column on the upper right has some useful information too, indicating the number
of observations, R2 and Root MSE. (As we noted in the chapter, Stata refers to the
standard error of the regression, (ˆ‡ ), as root MSE which is Stata’s shorthand for the
square root of the mean squared error.) We discuss the adjusted R2 on page 5.4. The F
and Prob > F on the right side of the output relate information that we cover on page
343; it’s generally not particularly useful.
The table in the upper left is pretty useless. Contemporary researchers seldom use the
information in the Source, SS, df, and MS columns.
2. In Stata, commands often have subcommands that are invoked after a comma. To
estimate the model with heteroscedasticity-consistent standard errors (as discussed on
page 103) simply add the , robust subcommand to Stata’s regression command. For
example: reg weight donuts, robust.
3. To generate predicted values, type predict YourNameHere after running an OLS model.
This command will create a new variable named “YourNameHere.” In our example, we
name the variable Fitted: predict Fitted. A variable containing the residuals is cre-
ated by adding a “, residuals” subcommand to the predict command: predict
Residuals, residuals.

We can display the actual values, fitted values, and residuals with the list command:
list weight Fitted Residuals.

| weight Fitted Residua |


|-------------------------------|
1. | 275 250.0688 24.93121 |
2. | 141 122.6156 18.38439 |
3. | 70 122.6156 -52.61561 |
...

4. In Chapter 2 we plotted simple scatterplots. To produce more elaborate plots work


with Stata’s twoway command (yes, it’s an odd command name). For example, to
add a regression line to a scatterplot, use twoway (scatter weight donuts) (lfit
weight donuts). The lfit command name stands for linear fit.25
5. To exclude an observation from a regression, use the if subcommand. The syntax “!=”
means “not equal.” For example, to run a regression on data that excludes observations
25 We jittered the data in Figure 3.10 to make it a bit easier to see more data points. Stata’s jitter subcommand
jitters data (e.g., scatter weight donuts, jitter(3)). The bigger the number in parentheses, the more the data
will be jittered.

c
•2014 Oxford University Press 127
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

for which name is not Homer, run reg weight donuts if name !="Homer". In this
example, we use quotes because the name variable is a string variable, meaning it is not
a number. If we want to include only observations where weight is greater than 100 we
can type reg weight donuts if weight > 100.

R
1. The following commands use the donut data from Chapter 1 on page 59. R is an
object-oriented language, which means that our regression commands create objects
containing information which we ask R to display as needed. To estimate an OLS
regression, we create an object called OLSResults (we could choose a different name)
by typing OLSResults = lm(weight ≥ donuts). This command stores information
about the regression results in an object called OLSResults. The lm command stands
for “linear model” and is the R command for OLS. The general format is lm(Y ≥ X)
for a dependent variable Y and independent variable X. To display these regression
results, type summary(OLSResults), which produces
lm(formula = weight ˜ donuts)
Residuals:
Min 1Q Median 3Q Max
-93.135 -9.479 0.757 35.108 55.073
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.616 16.361 7.494 0.0000121
donuts 9.104 1.920 4.742 0.000608
Residual standard error: 45.59 on 11 degrees of freedom
Multiple R-squared: 0.6715, Adjusted R-squared: 0.6416
F-statistic: 22.48 on 1 and 11 DF, p-value: 0.0006078
The vital information is in the bottom table that shows that —ˆ1 is 9.104 with a standard
error of 1.920 and —ˆ0 is 122.616 with a standard error of 16.361. We cover t value and
Pr(>|t|) in Chapter 4.
R refers to the standard error of the regression (ˆ ‡ ) as the residual standard error and
lists it below the regression results. Next to that is the degrees of freedom. To calculate
the number of observations in the data set analyzed, recall that degrees of freedom
equals N ≠ k. Since we know k (the number of estimated coefficients) is 2 for this
model, we can infer the sample size is 13. (Yes, this is probably more work than it
should be to display sample size.)
The multiple R2 (which is just the R2 ) is below the residual standard error. We dis-
cuss the adjusted R2 later on page 232. At the bottom is an F-statistic and related

c
•2014 Oxford University Press 128
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

information; this refers to a test we cover on page 343. It’s usually not a center of
attention.
The information on residuals at the top is pretty useless. Contemporary researchers
seldom use that information.
2. The regression object created by R contains lots of other information as well. The
information can be listed by typing the object name, a dollar sign, and the appropriate
syntax. For example, the fitted values for a regression model are stored in the format of
Object$fitted.values. In our case, they are OLSResults$fitted.values. For more
details, type help(lm) in R and look for the list of components associated with “objects
of class lm,” which is R’s way of referring to the regression results like we created above.
To see the fitted values, type OLSResults$fitted.values, which produces
1 2 3 4 5 6 ...
250.0688 122.6156 122.6156 168.1346 309.2435 129.4435 . . .
To see the residuals, type OLSResults$residuals, which produces
1 2 3 4 5 6 ...
24.9312070 18.3843881 -52.6156119 -93.1346052 0.7565158 -49.4434609 . . .
3. To create a scatterplot with a regression line included we can type26
plot(donuts, weight)
abline(OLSResults)
4. One way to exclude an observation from a regression is to use brackets to limit the vari-
able to only those observations for which the condition in the brackets is true; to indicate
a “not equal” condition use “!=”. In other words, weight[name != "Homer"] refers to
values of the weight variable for which the name variable is not equal to “Homer.”
To run a regression on data that excludes observations for which name is Homer, run
OLSResultsNoHomer = lm(weight[name != "Homer"] ≥ donuts[name != "Homer"]).
Here we use quotes because the name variable is a string variable, meaning it is not a
number.27 If we want to include only observations where weight is greater than 100 we
can type OLSResultsNoLow = lm(weight[weight>100] ≥ donuts[weight>100]).
5. There are a number of ways to estimate the model with heteroscedasticity-consistent
standard errors (as discussed on page 103). The easiest may be to use a R package,
26 Figure 3.10 jittered the data to make it a bit easier to see more data points. To jitter data in a R plot, type
plot(jitter(donuts), jitter(weight)).
27 There are more efficient ways to exclude data when using data frames. For example, if the variables are all

included in a data frame called dta, we could type OLSResultsNoHomer = lm(weight ≥ donuts, data = dta[name
!= "Homer", ]).

c
•2014 Oxford University Press 129
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

which is a set of R commands that we install for specific tasks. For heteroscedasticity-
consistent standard errors, the AER package is useful. There are two steps to using
this package.
(a) We need to install the package. Type install.packages("AER"). R will ask us
to pick a location – this is the source where we get the package. It doesn’t matter
where we pick. We can also install a package manually from the packages command
in the toolbar. We need to do the installation only once on each computer. The
package will be saved and available for use by R.
(b) Every time we open R and want to use the commands in the AER (or other)
package, we need to tell R to load the package. We do this with the library
command. We have to use the library command in every session we use a package.
Assuming the AER package has been installed, we can run OLS with heteroscedasticity-
consistent standard errors via the following code:
library(AER)
OLSResults = lm(weight ≥ donuts)
coeftest(OLSResults, vcov = vcovHC(OLSResults, type = "HC1"))
The last line is elaborate. The command coeftest is asking for information on the
variance of the estimates (among other things) and the vcov=vcovHC part of the com-
mand is asking for heteroscedasticity-consistent standard errors. There are multiple
ways to estimate such standard errors and the HC1 asks for the most commonly used
form of these standard errors.28

Exercises
1. Use the data in PresVote.dta to answer the following questions about the relationship
between changes in real disposable income and presidential election results. Table 3.4
describes the variables.
a. Create a scatter plot like Figure 3.1.
b. Estimate an OLS regression in which the vote share of the incumbent party is re-
gressed on change in real disposable income. Report the estimated regression equa-
tion and interpret the coefficients.
c. What is the fitted value for 1996? For 1972?
28 The vcov terminology is short for variance-covariance and the vcovHC terminology is short for heteroscedasticity-
consistent standard errors.

c
•2014 Oxford University Press 130
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

Table 3.4: Variables for Questions on Presidential Elections and the Economy

Variable name Description


year Year of election
rdi4 Change in real disposable income since previous election
vote Percent of two-party vote received by the incumbent president’s party
demcand Name of the Democratic candidate
repcand Name of the Republican candidate
reelection Equals 1 if incumbent is running for reelection and 0 if not

d. What is the residual for 1996? For 1972?


e. Estimate an OLS regression only on years in which the variable Reelection equals 1
- that is, years in which an incumbent president is running for re-election. Interpret
the coefficients.
f. Estimate an OLS regression only on years in which the variable Reelection equals
0 - that is, years in which an incumbent president is not running for re-election.
Interpret the coefficients and discuss the substantive implications of differences from
the model with incumbents only.
2. Suppose we are interested in the effect of education on salary as expressed in the fol-
lowing model:

Salaryi = —0 + —1 Educationi + ‘i
For this problem, we are going to assume that the true model is
Salaryi = 10,000 + 1,000Educationi + ‘i

The models indicates that the salary for each person is $10,000 plus $1,000 times the
number of years of education plus the error term for the individual. Our goal is to
explore how much our estimate of —ˆ1 varies.
Enter the following code into a Stata .do file. It will simulate a data set with 100
observations (as determined with the set obs command). Values of education for each
observation are between 0 and 16 years. The error term will be a normally distributed
error term with a standard deviation of 10,000 (as determined with the scalar SD
command).
program OLS_Sim
clear
set obs 100 /* Set sample size */
gen Ed = 16*runiform() /* Generate education (ind. variable) */
scalar SD = 10000 /* Set value of standard deviation of error term */

c
•2014 Oxford University Press 131
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

gen Salary = 20000 + 1000* Ed + SD*rnormal() /* Generate salary (dep. variable) */


regress Salary Ed /* Run regression */
end
simulate _b _se, reps(50): OLS_Sim /* Run simulation 50 times */

The simulate line runs the code 50 times (as determined in the reps(50) command)
and will save the —ˆ coefficients and standard errors for each simulation. The values
of —ˆEducation for each simulation are listed in a variable called b Ed; the values of —ˆ0
for each simulation are listed in a variable called b cons. The values of se(—ˆEducation )
for each simulation are listed in a variable called se Ed; the values of se(—ˆ0 ) for each
simulation are listed in a variable called se cons.
We can look at the estimated coefficients (via the list command) and summarize them
(via the summarize command):
list _b_* _se* /* Coefficient estimates & std. errors for each simulation */
summarize _b* /* Summarize coefficient estimates for each simulation */

a. Explain why the means of the estimated coefficients across the multiple simulations
are what they are.
b. What are the minimum and maximum values of the estimated coefficients on ed-
ucation? Explain whether these values are inconsistent with our statement in the
chapter that OLS estimates are unbiased.
c. Re-run the simulation with a larger sample size in each simulation. Specifically, set
the sample size to 1,000 in each simulation. (Do this by changing the set obs line of
the code.) Compare the mean, minimum, and maximum of the estimated coefficients
on education to the original results above.
d. Re-run the simulation with a smaller sample size in each simulation. Specifically,
set the sample size to 20 in each simulation. Compare the mean, minimum, and
maximum of the estimated coefficients on education to the original results above.
e. Re-set the sample size to 100 for each simulation and re-run the simulation with a
smaller standard deviation for each simulation. Specifically, set StdDev to 500 for
each simulation. (Do this by changing the scalar StdDev line of the code.) Compare
the mean, minimum, and maximum of the estimated coefficients on education to the
original results above.
f. Keeping the sample size at 100 for each simulation, re-run the simulation with a
larger standard deviation for each simulation. Specifically, set StdDev to 50,000
for each simulation. Compare the mean, minimum, and maximum of the estimated
coefficients on education to the original results above.
g. Revert to original model (sample size at 100 and SD at 10,000). Now run 500
simulations. (Do this by changing the simulate b se, reps(50) line of the code

c
•2014 Oxford University Press 132
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

so that it has reps(500).) Summarize the distribution of the —ˆEducation estimates as


you’ve done so far, but now also plot the distribution of these coefficients using
kdensity _b_Ed /* Density plot of estimated coefficients */

Describe the density plot in your own words.


3. In the chapter we discussed the relationship between height and wages in the United
States. Does this pattern occur elsewhere? The data set heightwage british males.dta
contains data on males in Britain from Persico, Postlewaite, and Silverman (2004). This
data is from the British National Child Development Survey (NCDS), which began as a
study of children born in Britain during the week of March 3, 1985. Information about
these children was gathered when they were 7, 11, 16, 23, and 33 years old. For this
question, we use just the information from when respondents were 33. Table 3.5 shows
the variables we use.
Table 3.5: Variables for Height and Weight Data in Britain

Variable name Description


gwage33 Hourly wages (in British Pounds) at age 33
height33 Height (in inches) measured at age 33

a. Estimate a model where height at age 33 explains income at age 33. Explain —ˆ1 and
—ˆ0 .
b. Create a scatterplot of height and income at age 33. Identify outliers.
c. Create a scatterplot of height and income at age 33 but exclude observations with
wages per hour more than 400 British pounds and height less than 40 inches. Describe
difference from the earlier plot. Which plot seems the more reasonable basis for
statistical analysis? Why?
d. Re-estimate the bivariate OLS model from part (a) but exclude four outliers with
very high wages and outliers with height below 40 inches. Briefly compare results to
earlier results.
e. What happens when the sample size is smaller? To answer this question, re-estimate
the bivariate OLS model from above (that excludes outliers) but limit the analysis
to the first 800 observations.29 Which changes more from the results with the full
29 To do this in Stata include if n <800 at the end of the Stata regress command. Because some observations
have missing data and others are omitted as outliers, the actual sample size in the regression will fall a bit lower
than 800. The “ n” notation is Stata’s way of indicating the observation number, which is the row number of the
observation in the data set.

c
•2014 Oxford University Press 133
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis

sample: the estimated coefficient on height or the estimated standard error of the
coefficient on height? Explain.
4. Table 3.6 lists the variables in the WorkWomen.dta and WorkMen.dta data sets, which
are based on Chakraborty, Holter, and Stepanchuk (2012). Answer the following ques-
tions about the relationship between hours worked and divorce rates.
Table 3.6: Variables for Divorce Rate and Hours Worked

Variable name Description


ID Unique number for each country in the data set
country Name of the country
hours Average yearly labor (in hours) for gender specified in data set
divorcerate Divorce rate per 1000
taxrate Average effective tax rate

a. For each data set (for women and for men), create a scatterplot of hours worked on
the Y-axis and divorce rates on the X-axis.
b. For each data set estimate an OLS regression in which hours worked is regressed on
divorce rates. Report the estimated regression equation and interpret the coefficients.
Explain differences in coefficients, if any.
c. What are the fitted value and residual for men in Germany?
d. What are the fitted value and residual for women in Spain?
5. Use the data in Table 3.6 to answer the following questions about the relationship
between hours worked and tax rates.
a. For each data set (for women and for men), create a scatterplot of hours worked on
the Y-axis and tax rates on the X-axis.
b. For each data set estimate an OLS regression in which hours worked is regressed on
tax rates. Report the estimated regression equation and interpret the coefficients.
Explain differences in coefficients, if any.
c. What are the fitted value and residual for men in the United States?
d. What are the fitted value and residual for women in Italy?

c
•2014 Oxford University Press 134
CHAPTER 4

HYPOTHESIS TESTING AND INTERVAL ESTIMATION:

ANSWERING RESEARCH QUESTIONS

Sometimes the results of an experiment are ob-

vious. In 1881, Louis Pasteur gave an anthrax

vaccine to 24 sheep and selected 24 other sheep

to be a control group. He exposed all 48 sheep

to a deadly dose of anthrax and asked visitors to

come back in two days. By then, 21 of the un-

vaccinated sheep had died. Two more unvaccinated sheep died in front of the visitors’ eyes

and the last unvaccinated sheep died the next day. Of the vaccinated sheep, only one died

and that was from symptoms inconsistent with anthrax. Nobody needed fancy statistics to

135
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

conclude the vaccine worked; they only needed masks to cover the smell.

Mostly, though, the conclusions from an experiment are not so obvious. What if the

death toll had been two unvaccinated sheep and one vaccinated sheep? That well could have

happened by chance. What if five unvaccinated sheep died and no vaccinated sheep died?

That outcome would seem less likely to have happened simply by chance. But would it be

enough for us to believe that the vaccine treatment prevented anthrax?

These kinds of questions pervade all statistical analysis. We’re trying to answer questions

and while it’s pretty easy to see if some policy is associated with more of some outcome, it’s

much harder to know at what point we should become convinced the relationship is real,

rather than the result of the hurly-burly randomness of real life. (o resultado da aleatoriedade tumultuada
da vida real)
Statistics provides an infrastructure for answering these questions via hypothesis test-

ing. Hypothesis testing allows us to assess whether the observed data is consistent or not

with a claim of interest. The process does not yield 100 percent definitive answers, but

rather translates our statistical estimates into statements like “We are quite confident that

the vote share of the incumbent president’s party goes up in the United States when the

economy is good” or “We are quite confident that tall people get paid more.”

The standard statistical way to talk about hypotheses is a bit of an acquired taste.

Suppose there is no effect (that is, that —1 = 0). What is the probability that when we run

OLS on the data we actually have, we see a coefficient as large as we actually observe? That

is, suppose we want to test the claim that —1 = 0. If this claim were true (meaning —1 = 0),

c
•2014 Oxford University Press 136
Neste capítulo, discutimos as ferramentas de teste de hipóteses. A Seção 4.1 apresenta a lógica central e a terminologia. A
Seção 4.2 cobre o workhouse do teste de hipóteses, o teste t. A Seção 4.3 apresenta os valores-p, que são um subproduto
útil do empreendimento de teste de hipóteses. A Seção 4.4 discute o poder estatístico, um conceito que às vezes é
subestimado apesar de seu nome legal. O poder nos ajuda a apreciar a diferença entre não encontrar nenhum
relacionamento porque não há relacionamento ou porque não temos dados suficientes. A Seção 4.5 discute algumas das
limitações muito reais da abordagem de teste de hipóteses e a Seção 4.6 apresenta a abordagem de intervalo de confiança
para estimativa, que evita alguns dos problemas de teste de hipóteses.
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

what is the probability of observing a —ˆ1 = 0.4 or 7.2 or whatever result our OLS produced?

If this probability of observing the —ˆ1 we actually observe is very small if —1 = 0, then we

can reasonably infer that the hypothesis that —1 = 0 is probably not true.

Intuitively we know that if a treatment has no effect, the probability of seeing a huge

difference is low and the chance of seeing a small difference is large. The magic of stats –

and it is quite remarkable – is that we can quantify the probabilities of seeing any observed

effect given that the effect really is zero.

In this chapter we discuss the tools of hypothesis testing. Section 4.1 lays out the core logic

and terminology. Section 4.2 covers the workhouse of hypothesis testing, the t test. Section

4.3 introduces p-values, which are a useful byproduct of the hypothesis testing enterprise.

Section 4.4 discusses statistical power, a concept that sometimes goes underappreciated de-

spite its cool name. Power helps us appreciate the difference between finding no relationship

because there is no relationship or because we don’t have enough data. Section 4.5 discusses

some of the very real limitations to the hypothesis testing approach and Section 4.6 then

introduces the confidence interval approach to estimation, which avoids some of the problems

of hypothesis testing.

Much of the material in this chapter will be familiar to those who have had a probability

and statistics course. Learning or tuning up our understanding of this material will put us

in great position to understand OLS as it is practiced.

c
•2014 Oxford University Press 137
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

4.1 Hypothesis Testing

We want to use statistics to answer questions and the main way to do so is to use OLS to

assess hypotheses. In this section, we introduce the null and alternative hypotheses, apply

the concepts to our presidential election example, and then develop the important concept

of significance level.

Hypothesis testing begins with a null hypothesis, which is typically a hypothesis of no

effect. Consider the height and wage example from page 112:

W agei = —0 + —1 Adult heighti + ‘i (4.1)

The standard null hypothesis is that height has no effect on wages. Or, more formally,

H0 : —1 = 0.

with the subscript zero after the H indicating this is the null hypothesis.

Statistical tools do not allow us to prove or disprove a null hypothesis. Instead, we

“reject” or “fail to reject” the null hypotheses. When we reject a null hypothesis, we are

actually saying that the probability of seeing the —ˆ1 that we estimated is very low if the null

hypothesis were true. For example, it is unlikely we will observe a large —ˆ1 with a small

standard error if the truth were —1 = 0. If we do nonetheless observe a large —ˆ1 with a small

standard error, we will reject the null hypothesis and refer to the coefficient as statistically

significant.

c
•2014 Oxford University Press 138
Quando deixamos de rejeitar uma hipótese nula, estamos dizendo que o -ˆ 1 que observamos não seria particularmente
improvável se a hipótese nula fosse verdadeira. Por exemplo, normalmente rejeitamos a hipótese nula quando observamos
um pequeno —ˆ 1. Esse resultado não seria nada surpreendente se —1 = 0. Também podemos deixar de rejeitar hipóteses
nulas quando a incerteza é alta. Ou seja, um —ˆ 1 grande pode não ser muito surpreendente mesmo quando —1 = 0 se a
variância de —ˆ 1 for grande em relação ao valor de —ˆ 1. Formalizamos essa lógica quando discutimos a estatística t na
próxima seção

Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

When we fail to reject a null hypothesis, we are saying that the —ˆ1 we observe would not

be particularly unlikely if the null hypothesis were true. For example, we typically reject the

null hypothesis when we observe a small —ˆ1 . That outcome would not be surprising at all if

—1 = 0. We can also fail to reject null hypotheses when uncertainty is high. That is, a large

—ˆ1 may not be too surprising even when —1 = 0 if the variance of —ˆ1 is large relative to the

value of —ˆ1 . We formalize this logic when we discuss t statistics in the next section.

The heart of proper statistical analysis is that we recognize that we might be making a

mistake. When we reject a null hypothesis we are concluding that it is unlikely that —1 = 0

given the —ˆ1 we observe. We are not saying it is impossible.

When we fail to reject a null hypothesis we are saying it would not surprise us if —1 = 0

given the —ˆ1 we observe. We are definitely not saying that we know that —1 = 0 when we fail

to reject the null. Instead, the situation is like when a jury says “not guilty”; the accused

may be guilty, but the evidence is not overwhelming enough to convict.

We characterize possible mistakes in two ways. Type I errors occur when we reject

a null hypothesis even when it is true. If we say height increases wages, but actually it

doesn’t, we’re committing a Type I error. Type II errors occur when we fail to reject a

null hypothesis even when it is false. If we say that there is no relationship between height

and wages, but there actually is one, we’re committing a Type II error. Table 4.1 summarizes

this terminology.

Standard hypothesis testing focuses heavily on Type I error. That is, the approach is

c
•2014 Oxford University Press 139
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Table 4.1: Type I and Type II Errors

—1 ”= 0 —1 = 0
Type I error:
Reject H0 Correct inference
wrongly reject null
Type II error:
Fail to reject H0 Correct inference
wrongly fail to reject null

built around specifying an acceptable level of Type I error and proceeding from there. We

should not forget Type II error, though. There are many situations in which we have to take

the threat of Type II error seriously; we discuss these when we discuss statistical power in

Section 4.4.

If we reject the null hypothesis, we accept the alternative hypothesis. We do not prove

the alternative hypothesis is true. Rather, the alternative hypothesis is the idea we hang

onto when we have evidence that is inconsistent with the null hypothesis.

An alternative hypothesis is either one-sided or two-sided. A one-sided alternative

hypothesis has a direction. For example, if we have theoretical reasons to believe that

being taller increases wages, then the alternative hypothesis for the following model

W agei = —0 + —1 Adult heighti + ‘i (4.2)

would be written as HA : —1 > 0.

A two-sided alternative hypothesis has no direction. For example, if we think height

affects wages but we’re not sure whether tall people get paid more or less, then the alternative

hypothesis would be HA : —1 ”= 0. If we’ve done enough thinking to run a statistical model,

it seems reasonable to believe that we should have at least an idea of the direction of the

c
•2014 Oxford University Press 140
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

coefficient on our variable of interest, implying that two-sided alternatives might be rare.

They are not, however, in part because they are more statistically cautious in the manner

we discuss below.

Formulating appropriate null and alternative hypotheses allows us to translate substantive

ideas into statistical tests. For published work, it is generally a breeze to identify null

hypotheses: Just find the —ˆ that the authors jabber on most about. The main null hypothesis

is almost certainly that that coefficient is zero.

OLS coefficients under the null hypothesis for the presidential election example

With a null hypothesis in hand, we can move toward serious statistical analysis. Let’s con-

sider the presidential-election example that opened Chapter 3. To identify a null hypothesis

we first need a model, such as

Vote sharet = —0 + —1 Change in incomet + ‘t (4.3)

where Vote share t is percent of the vote received by the incumbent president’s party in year

t and the independent variable, Change in income t , is the percent change in real disposable

income in the United States in year before the presidential election. The null hypothesis is

that there is no effect: H0 : —1 = 0.

What is the distribution of —ˆ1 under the null hypothesis? Pretty simple: It is a normally

distributed random variable centered on zero because OLS produces unbiased estimates and,

if the true value of —1 is zero, then an unbiased distribution of —ˆ1 will be centered on zero.

c
•2014 Oxford University Press 141
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

How wide is the distribution of —ˆ1 under the null hypothesis? In contrast to the mean

of the distribution which we know under the null, the width depends on the data and the

standard error implied by the data. In other words, we allow the data to tell us the standard

error of the —ˆ1 estimate under the null hypothesis.

Table 4.2 shows the results for the model. Of particular interest for us at this point is that

the standard error of the —ˆ1 estimate is 0.52. This number tells us how wide the distribution

of the —ˆ1 will be under the null.

Table 4.2: Effect of Income Changes on Presidential Elections

Variable Coefficient Standard error


Change in income 2.29 0.52
Constant 45.91 1.69
N = 17

With this information we can picture the distribution of —ˆ1 under the null. Specifically,

Figure 4.1 shows the probability density function of —ˆ1 under the null hypothesis, which is a

normal probability density centered at zero with a standard deviation of 0.52. We also refer to

this as the distribution of —ˆ1 under the null hypothesis. We introduced probability density

functions in Section 3.2 of Chapter 3 and discuss them in further detail in the appendix

starting on page 771.

Figure 4.1 illustrates the key idea of hypothesis testing. The actual value of —ˆ1 that we

estimated is 2.29. That number seems pretty unlikely, doesn’t it? Most of the distribution

c
•2014 Oxford University Press 142
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Probability
density

^
Distribution of β1
(standard error is 0.52)

^
Example of β1
Actual
for which we
value
would fail to ^
of β 1
reject the null

−0.3 2.29
−2 −1 0 1 2
^
β1

FIGURE 4.1: Distribution of —ˆ1 Under the Null Hypothesis for Presidential Election Example

c
•2014 Oxford University Press 143
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

of —ˆ1 under the null hypothesis is to the left of the —ˆ1 observed. We formalize things in the

next section, but intuitively it’s reasonable to think that the observed value —ˆ1 is so unlikely

if the null were true that, well, the null hypothesis is probably not true.

Now name a value of —ˆ1 that would lead us not to reject the null hypothesis. In other

words, name a value of —ˆ1 that is perfectly likely under the null hypothesis. We show one

such example in Figure 4.1 by putting a line at —ˆ1 = ≠0.3. A value like this would be

completely unsurprising if the null hypothesis were true. Hence if we observed such a value

for —ˆ1 we would deem it to be consistent with the null hypothesis and we would not reject

the null hypothesis.

Significance level

Given that our strategy is to reject the null hypothesis when we observe a —ˆ1 that is quite

unlikely under the null hypothesis, the natural question is: Just how unlikely does —ˆ1 have

to be? We get to choose the answer to this question. In other words, we get to decide our

standard for what we deem to be sufficiently unlikely to reject the null hypothesis. We’ll

call this probability the significance level and denote it with – (the Greek letter alpha).

A significance level determines how unlikely a result has to be under the null hypothesis for

us to reject the null hypothesis. A very common significance level is 5 percent (meaning –

= 0.05).

If we set – = 0.05, then we reject the null when we observe a —ˆ1 that is so large that

c
•2014 Oxford University Press 144
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

we would only expect only a 5 percent chance of seeing the observed value or higher under

the null hypothesis. Setting – = 0.05 means that there is a 5 percent chance that we would

see a value high enough to reject the null hypothesis even when the null hypothesis is true,

meaning that – is the probability of making a Type I error.

If we want to be more cautious (in the sense of requiring a more extreme result to reject

the null hypothesis) we can choose – = 0.01, in which case we will reject the null if we have

a 1 percent or lower chance of observing a —ˆ1 as large as we actually did.

Reducing – is not completely costless, however. As the probability of making a Type

I error decreases, the probability of making a Type II error increases. In other words, the

more we say we’re going to need really strong evidence to reject the null hypothesis (which

is what we say when we make – small), the more likely we are going to fail to reject the null

hypothesis even when the null hypothesis is wrong (which is Type II error).

c
•2014 Oxford University Press 145
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Remember This
1. A null hypothesis is typically a hypothesis of no effect that we write as H0 : —1 = 0.
• We reject a null hypothesis when the statistical evidence is inconsistent with
the null hypothesis. A coefficient estimate is statistically significant if we
reject the null hypothesis that the coefficient is zero.
• We fail to reject a null hypothesis when the statistical evidence is consistent
with the null hypothesis.
• Type I error occurs when we wrongly reject a null hypothesis.
• Type II error occurs when we wrongly fail to reject a null hypothesis.
2. A alternative hypothesis is the hypothesis we accept if we reject the null hypoth-
esis.
• We choose a one-sided alternative hypothesis if theory suggests —1 > 0 or
theory suggests —1 < 0.
• We choose a two-sided alternative hypothesis if theory does not provide guid-
ance as to whether —1 is greater than or less than zero.
3. The significance level (–) refers to the probability of Type 1 error for our hypoth-
esis test. We choose the value of the significance level, typically 0.01 or 0.05.
4. There is a trade-off between Type I and Type II error. If we lower – we decrease
the probability of making a Type I error, but increase the probability of making
a Type II error.

c
•2014 Oxford University Press 146
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Discussion Questions

1. Translate each of the following questions into a bivariate model with a


null hypothesis that could be tested. There is no single answer for each.
(a) “What causes test scores to rise?”
(b) “How can Republicans increase support among young voters?”
(c) “Why did unemployment spike in 2008?”
2. For each of the following, identify the null hypothesis, draw a picture
of the distribution of —ˆ1 under the null, identify values of —ˆ1 that would
lead you to reject or fail to reject the null, and explain what it would
mean to commit Type I and Type II errors in each case.
(a) We want to know if height increases wages.
(b) We want to know if gasoline prices affect the sales of SUVs.
(c) We want to know if handgun sales affect murder rates.

4.2 t tests

The most common tool we use for hypothesis testing in OLS is the t test. There’s a quick
—ˆ1
rule of thumb for t tests: If the absolute value of se(—ˆ1 )
is bigger than 2, reject the null

hypothesis (recall that se(—ˆ1 ) is the standard error of our coefficient estimate). If not, don’t.

In this section we provide the logic and tools of t testing so that we can be more precise,

but this rule of thumb is pretty much all there is to it.

c
•2014 Oxford University Press 147
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

—ˆ1 and standard errors

To put our t tests in context, let’s begin with the fact that we have calculated —ˆ1 and are

trying to figure out if —ˆ1 would be highly surprising if the null hypothesis were true. A

challenge is that the scale of our —ˆ1 could be anything. In our presidential-election model

above, we estimated —ˆ1 to be 2.29. Is that estimate surprising under the null? As we saw

in Figure 4.1 on page 143, it is unlikely to observe a —ˆ1 that big when the standard error of

—ˆ1 is only 0.52. What if the standard error of —ˆ1 were 2.0? The distribution of —ˆ1 under the

null hypothesis would still be centered at zero, but it would be really wide, as in Figure 4.2.

In this case, it really wouldn’t be so surprising to see a —ˆ1 of 2.29 if the null hypothesis that

—1 = 0 were true.

What we really care about is not the —ˆ1 coefficient estimate by itself, but rather how large

the —ˆ1 coefficient is relative to its standard error. In other words, it is unlikely to observe

a —ˆ1 coefficient that is much bigger than its standard error, which will place it outside the

range of the most likely outcomes for a normal distribution.

Therefore we use a test statistic that consists of the estimated coefficients divided by the
—ˆ1
estimated standard deviation of the coefficient: se(—ˆ1 )
. Thus our test statistic reflects how

many standard errors above or below zero the estimated coefficient is. If the —ˆ1 is 6 and

se(—ˆ1 ) is 2, then our test statistic will be 3 because the estimated coefficient is 3 standard

errors above zero. If the standard error had been 12 instead, then the value of our test

c
•2014 Oxford University Press 148
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Probability
density
^
Distribution of β1
(standard error is 2.0)

Actual
value
^
of β 1

2.29
−5 −4 −3 −2 −1 0 1 2 3 4 5
^
β1

FIGURE 4.2: Distribution of —ˆ1 Under the Null Hypothesis with Larger Standard Error for Presidential-

Election Example

c
•2014 Oxford University Press 149
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

statistic would be 0.5.

The t distribution

Dividing —ˆ1 by its standard error solves the scale problem, but introduces another challenge.

We know —ˆ1 is normally distributed, but what is the distribution of —ˆ1


se(—ˆ1 )
? The se(—ˆ1 ) is also

a random variable because it depends on the estimated —ˆ1 . It’s a tricky question and now is

a good time to turn to our friends at Guinness Brewing for help. Really. Not for what you

might think, but for work they did in the early twentieth century demonstrating that the
—ˆ1
distribution of se(—ˆ1 )
follows a distribution we call the t distribution.1 The t distribution

is bell-shaped like a normal distribution but has “fatter tails.”2 We say it has fat tails

because the values on the far left and far right have higher probabilities than for the normal

distribution. The extent of these chubby tails depends on the sample size; as the sample

size gets bigger, the tails melt down to become the same as the normal distribution. What’s

going on is that we need to be more cautious about rejecting the null because it is possible

that by chance our estimate of se(—ˆ1 ) will be too small, which will make —ˆ1
se(—ˆ1 )
look like it’s

really big. When we have small amounts of data, the issue is serious because we will be
1 Like many statistical terms, the t distribution and t test have quirky origins. William Sealy Gosset devised
the test in 1908 working for Guinness Brewery in Dublin. His pen name was “Student.” There already was an s
test (now long forgotten) so Gosset named his test and distribution after the second letter of his pen name. The
standard error of —ˆ1 follows a statistical distribution called a ‰2 distribution and the ratio of a normally distributed
random variable and a ‰2 random variable follows a t distribution. More details are in the appendix on page
pagerefsection:AppendixOtherDist.
2 That’s a statistical term. Seriously.

c
•2014 Oxford University Press 150
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

quite uncertain about se(—ˆ1 ); when we have lots of data, we’ll be more confident about our

estimate of se(—ˆ1 ) and, as we’ll see, the fat tails of the t distribution fade away and the t

distribution and normal distribution become virtually indistinguishable.3

The specific shape of a t distribution depends on the degrees of freedom, which is sample

size minus the number of parameters. A bivariate OLS model estimates two parameters (—ˆ0

and —ˆ1 ) which means, for example, that the degrees of freedom for a bivariate OLS model

with a sample of 50 is 50 ≠ 2 = 48.

Figure 4.3 displays three different t distributions with a normal distribution plotted in

the background of each panel as a dotted line. Panel (a) shows a t distribution with degrees

of freedom equal to 2. Check out those fat tails. The probability of observing a value as

high as 3 is higher for the t distribution than for the normal distribution. The same thing

goes for the probability of observing a value as low as -3. Panel (b) of Figure 4.3 shows

a t distribution with degrees of freedom equal to 5. If we look closely, we can see some

chubbiness in the tails as the t distribution has higher probabilities at, for example, values

greater than two. We have to look pretty closely to see that, though. Panel (c) shows a

t distribution with degrees of freedom equal to 50. It is visually indistinguishable from a

normal distribution and, in fact, covers up the normal distribution so we cannot see it.
3 More technically, the estimate of the standard error follows a statistical distribution called a ‰2 distribution
(the Greek letter is chi and is pronounced “kai”). The t distribution characterizes the distribution of a normally
distributed random variable (—ˆ1 ) divided by a ‰2 distributed random variable.

c
•2014 Oxford University Press 151
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Probability
density t−distribution d.f.= 2
normal distribution
(a)

−3 −2 −1 0 1 2 3
^ ^
β1/se(β1)

Probability
density t−distribution d.f.= 5
normal distribution
(b)

−3 −2 −1 0 1 2 3
^ ^
β1/se(β1)

Probability
density t−distribution d.f.= 50
normal distribution
(c)
Note: normal distribution is
covered by t−distribution

−3 −2 −1 0 1 2 3
^ ^
β1/se(β1)

FIGURE 4.3: Three t distributions

c
•2014 Oxford University Press 152
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Critical values

—ˆ1
Once we know the distribution of se(—ˆ1 )
we can come up with a critical value. A critical

value is the threshold for our test statistic. Loosely speaking, we reject the null hypothesis if
—ˆ1 —ˆ1
se(—ˆ1 )
(the test statistic) is greater than the critical value if se(—ˆ1 )
is below the critical value,

we fail to reject the null hypothesis.

More precisely, our specific decision rule depends on the nature of the alternative hypoth-

esis. Table 4.3 displays the specific rules. Rather than trying to memorize these rules, it is

better to concentrate on understanding the logic behind them. If the alternative hypothesis

is two-sided, then big values of —ˆ1 relative to the standard error incline us to reject the

null. We don’t particularly care if they are very positive or very negative. If the alternative

hypothesis is that — > 0, then only large, positive values of —ˆ1 will incline us to reject the

null hypothesis in favor of the alternative hypothesis. Observing a very negative —ˆ1 would

be odd, but certainly would not incline us to believe that the true value of — is greater than

zero. Similarly, if the alternative hypothesis is that — < 0, then only very negative values

of —ˆ1 will incline us to reject the null hypothesis in favor of the alternative hypothesis. We

refer to the appropriate critical value in the table because the actual value of the critical

value will depend on whether the test is one or two-sided, as we discuss below.

The critical value for t tests depends on the t distribution and identifies the point at
—ˆ1
which we decide the observed se(—ˆ1 )
is sufficiently unlikely under the null hypothesis that we

reject the null hypothesis.

c
•2014 Oxford University Press 153
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Table 4.3: Decision Rules for Various Alternative Hypotheses


Alternative Hypothesis Decision Rule
ˆ
HA : —1 ”= 0 (two-sided alternative) Reject H0 if | se(——1ˆ ) | > appropriate critical value
1
—ˆ1
HA : —1 > 0 (one-sided alternative) Reject H0 if se(—ˆ1 )
> appropriate critical value
—ˆ1
HA : —1 < 0 (one-sided alternative) Reject H0 if se(—ˆ1 )
< -1 times appropriate critical value

Critical values depend on –, the significance level we choose, our degrees of freedom,

and whether the alternative is one-sided or two-sided. Figure 4.4 depicts critical values for

various scenarios. We assume the sample size is large in each, allowing us to use the normal

approximation to the t distribution.

Panel (a) of Figure 4.4 shows critical values for – = 0.05 and a two-sided alternative

hypothesis. The distribution of the t statistic is centered at zero under the null hypothesis

that —1 = 0. For a two-sided alternative hypothesis, we want to identify ranges that are far

from zero and unlikely under the null hypothesis. For – = 0.05 we want to find the range

that constitutes the least-likely 5 percent of the distribution under the null. This 5 percent

is the sum of the 2.5 percent on the far left and the 2.5 percent on the far right. Values in

these ranges are not impossible, but they are unlikely. For a large sample size, the critical

values that mark off the least-likely 2.5 percentage regions of the distribution are -1.96 and

1.96.

Panel (b) of Figure 4.4 depicts the situation if we choose – = 0.01. In this case, we’re

saying we’re going to need to observe an even more unlikely —ˆ1 under the null hypothesis in

order to reject the null hypothesis. The critical value for a large sample size is 2.58. This

c
•2014 Oxford University Press 154
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Probability Two−sided alternative hypothesis, α= 0.05


density

2.5% of normal
2.5% of normal distribution is to
(a) distribution is to right of 1.96
left of −1.96

−1.96 1.96
−4 −3 −2 −1 0 1 2 3 4
^ ^
β1/se(β1)

Probability Two−sided alternative hypothesis, α= 0.01


density

0.5% of normal
0.5% of normal distribution is to
(b) distribution is to right of 2.58
left of −2.58

−2.58 2.58
−4 −3 −2 −1 0 1 2 3 4
^ ^
β1/se(β1)

Probability One−sided alternative hypothesis, α= 0.05


density

5% of normal distribution
(c) is to right of 1.64

1.64
−4 −3 −2 −1 0 1 2 3 4
^ ^
β1/se(β1)
FIGURE 4.4: Critical Values for Large Sample t tests

c
•2014 Oxford University Press 155
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

number defines the point at which there is a 0.005 probability (which is half of –) of being

higher than than the critical value and there is a 0.005 probability of being less than the

negative of it.

The picture and critical values differ a bit for a one-tailed test in which we look only at

one side of the distribution. Panel (c) of Figure 4.4 depicts the situation when – = 0.05 and

HA : —1 > 0. Here, 5 percent of the distribution is to the right of 1.64, meaning that we will
—ˆ1
reject the null hypothesis in favor of the alternative that —1 > 0 if ˆ
se(—)
> 1.64.

Note that the one-sided critical value for – = 0.05 is lower than the two-sided critical

value. One-sided critical values will always be lower for any given value of –, meaning that

it is easier to reject the null hypothesis for a one-sided alternative hypothesis than for a two-

sided alternative hypothesis. Hence, using critical values based on a two-sided alternative is

statistically cautious in the sense that we are less likely to look like we’re over-eager to reject

the null if we use a two-sided alternative.

Table 4.4 displays critical values of the t distribution for one-sided and two-sided alter-

native hypotheses for common values of –. When the degrees of freedom are very small

(typically due to a small sample size), the critical values are relatively large. For example,

with 2 degrees of freedom and – = 0.05, we need to see a t stat above 2.92 to reject the

null.4 With 10 degrees of freedom, we need to see a t stat above 1.81 to reject the null. With

100 degrees of freedom, we need a t stat above 1.66 to reject the null. As the degrees of
4 It’s unlikely that we would seriously estimate a model with 2 degrees of freedom. For a bivariate OLS model,
that would mean estimating a model with 4 observations.

c
•2014 Oxford University Press 156
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

freedom get higher, the t distribution looks more and more like a normal distribution, and

for infinite degrees of freedom it is exactly like a normal distribution, producing identical

critical values. For degrees of freedom above 100, it is reasonable to use critical values from

the normal distribution as a good approximation.


Table 4.4: Critical Values for t distribution

– (1-sided) ∆ 0.05 0.025 0.01 0.005


– (2-sided) ∆ 0.10 0.050 0.02 0.01
2 2.92 4.30 6.97 9.92
5 2.01 2.57 3.37 4.03
10 1.81 2.23 2.76 3.17
Degrees 15 1.75 2.13 2.60 2.95
of 20 1.73 2.09 2.53 2.85
freedom 50 1.68 2.01 2.40 2.68
100 1.66 1.98 2.37 2.63
Œ 1.64 1.96 2.32 2.58
A t distribution with Œ degrees of freedom is the same as a
normal distribution.

—ˆ1
We compare se(—ˆ1 )
to our critical value and reject if the magnitude is larger than the
—ˆ1
critical value. We refer to the ratio of se(—ˆ1 )
as the t statistic (or, “t stat” as the kids

say). The t statistic is so named because that ratio will be compared to a critical value that

depends on the t distribution in the manner we have just outlined. Tests based on two-sided

alternative tests with – = 0.05 are very common. When the sample size is large, the critical

value for such a test is 1.96. Hence the rule of thumb is that a t statistic bigger than 2 is

statistically significant at conventional levels.

c
•2014 Oxford University Press 157
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

t statistics for the height and wage example

To show t testing in action, Table 4.5 provides the results of the height and wages models

from Chapter 3 but now adds t statistics. As before, we show results using standard errors

estimated by the equation that requires errors to be homoscedastic and standard errors

estimated via an equation that allows errors to be heteroscedastic. The coefficients across

models are identical.

The column on the left shows that the t statistic from the homoscedastic model for the

coefficient on adult height is 4.225, meaning that —ˆ1 is 4.225 standard deviations away from

zero. The t statistic from the heteroscedastic model for the coefficient on adult height is

4.325, which is essentially the same as in the homoscedastic model. For simplicity, we’ll

focus on the homoscedastic model results.

Table 4.5: Effect of Height on Wages with t Statistics

Variable Assuming Allowing


homoscedasticity heteroscedasticity
Adult height 0.412 0.412
(0.0976) (0.0953)
[t = 4.225] [t = 4.325]
Constant -13.093 -13.093
(6.897) (6.691)
[t = 1.898] [t = 1.957]
N 1,910 1,910
ˆ2
‡ 142.4 142.4
ˆ
‡ 11.93 11.93
R2 0.009 0.009
Standard errors in parentheses

c
•2014 Oxford University Press 158
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Is this coefficient on adult height statistically significant? To answer that question, we’ll

need a critical value. To pick a critical value, we need to choose a one-sided or two-sided

alternative hypothesis and a significance level. Let’s start with a two-sided test and – = 0.05.

For a t distribution we also need to know the degrees of freedom. Recall that the degrees

of freedom are the sample size minus the number of parameters estimated. The smaller

the sample size, the more uncertainty we have about our standard error estimate and hence

the larger we make our critical value. Here, the sample size is 1, 910 and we estimate two

parameters, so the degrees of freedom are 1, 908. For a sample this large, we can reasonably

use the critical values from the last row of Table 4.4. The critical value for a two-sided test

with – = 0.05 and a high number for degrees of freedom is 1.96. Because our t statistic of

4.225 is higher than 1.96 we reject the null hypothesis. It’s that easy.

Other types of null hypotheses

Finally, it’s worth noting that we can extend the t test logic to cases in which the null

hypothesis refers to some value other than zero. Such cases are not super common, but not

unheard of. Suppose, for example, that our null hypothesis is H0 : —1 = 7 versus HA : —1 ”= 7.

In this case, we simply need to check how many standard deviations —ˆ1 is away from 7. We
—ˆ1 ≠7
do so by comparing se(—ˆ1 )
against the standard critical values we developed above. More
—ˆ1 ≠— N ull
generally, to test a null hypothesis that H0 : —1 = — N ull we look at se(—ˆ1 )
where — N ull is

the value of — indicated in the null hypothesis.

c
•2014 Oxford University Press 159
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Remember This
1. We use a t test to test the null hypotheses H0 : —1 = 0. The steps are as follows:
(a) Choose a one-sided or two-sided alternative hypothesis.
(b) Set a significance level, –, usually equal to 0.01 or 0.05.
(c) Find a critical value based on the t distribution. This value depends on –,
whether the alternative hypothesis is one-sided or two-sided, and the degrees
of freedom (equal to sample size minus number of parameters estimated).
(d) Use OLS to estimate parameters.
• For a two-sided alternative hypothesis, we reject the null hypothesis if
ˆ
| se(——1ˆ ) | > the critical value. Otherwise, we fail to reject the null hypothe-
1
sis.
• For a one-sided alternative hypothesis that —1 > 0, we reject the null
ˆ
hypothesis if se(——1ˆ ) > the critical value.
1

• For a one-sided alternative hypothesis that —1 < 0, we reject the null


ˆ
hypothesis if se(——1ˆ ) < -1 times the critical value.
1

ˆ
2. We can test any hypothesis of the form H0 : —1 = — N ull using | —1se( | as the test
≠— N ull

—ˆ1 )
statistic for a t test.

c
•2014 Oxford University Press 160
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Discussion Questions

1. Refer to the results in Table 4.2 on page 142.


(a) What is the t statistic for the coefficient on change in income?
(b) What are the degrees of freedom?
(c) What is the critical value for a two-sided alternative hypothesis and
– = 0.01? Do we accept or reject the null?
(d) What is the critical value for a one-sided alternative hypothesis and
– = 0.05? Do we accept or reject the null?
2. Which is bigger: critical value from one-sided tests or two-sided tests?
Why?
3. Which is bigger: critical values from large sample or small sample?
Why?

4.3 p-values

The p-value is a useful byproduct of the hypothesis testing framework. It indicates the

probability of observing a coefficient as high as we actually did if the null hypothesis were

true. In this section we explain how to calculated p-values and why they’re useful.

As a practical matter, the thing to remember is that we reject the null if the p-value

is less than –. Our rule of thumb here is “small p-value means reject”: Low p-values are

associated with rejecting the null and high p-values are associated with failing to reject the

null hypothesis.

c
•2014 Oxford University Press 161
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

P-values can be calculated for any null hypothesis; we focus on the most common null

hypotheses in which —1 = 0. Most statistical software reports a two-sided p-value, which indi-

cates the probability that a coefficient is larger in magnitude (either positively or negatively)

than the coefficient we observe.

Panel (a) of Figure 4.5 shows the p-value calculation for the —ˆ1 estimate in the wage and

height example we discussed on 158. The t statistic is 4.23. The p-value is calculated by

finding the likelihood of getting a t statistic larger in magnitude than observed under the

null hypothesis. There is a 0.0000122 probability that the t statistic will be larger than 4.23.

(In other words, there is a tiny probability we would observe a t statistic as high as 4.23 if

the null hypothesis were true.) Because the normal distribution is symmetric, there is also

a 0.0000122 probability that the t statistic will be less than -4.23. Hence the p-value will

be twice the probability of being above the observed t statistic and equals 0.0000244.5 We

see a very small p-value, meaning that the observed —ˆ1 is really, really unlikely if —1 actually

equals zero.

Suppose, however, that our —ˆ1 were 0.09 (instead of the 0.41 it actually was). The t
0.09
statistic would be 0.52
= 1.73. Panel (b) of Figure 4.5 shows the p-value in this case. There

is a 0.042 probability of observing a t statistic greater than 1.73 under the null hypothesis
5 Here we are calculating two-sided p-values, which are the output most commonly reported by statistical software.
ˆ
If se(——1ˆ ) is greater than zero, then the two-sided p-value is twice the probability of being greater than that value. If
1
ˆ1
ˆ1 ) is less than zero, then the two-sided p-value is twice the probability of being less than that value. A one-sided

se(—

p-value is simply half the two-sided p-value.

c
•2014 Oxford University Press 162
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Probability
density
Case 1: t−stat is 4.23
p−value is 0.0000244

0.00122% of distribution 0.00122% of distribution


is to left of −4.23 is to right of 4.23

−4.23 4.23
−4 −3 −2 −1 0 1 2 3 4
^ ^
β1/se(β1)
(a)
Probability
density
Case 2: t−stat is 1.73
p−value is 0.084

4.2% of distribution 4.2% of distribution


is to left of −1.73 is to right of 1.73

−1.73 1.73
−4 −3 −2 −1 0 1 2 3 4
^ ^
β1/se(β1)
(b)
FIGURE 4.5: Two Examples of p-values

c
•2014 Oxford University Press 163
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

(and a 0.042 probability of observing a t statistic less than -1.73 under the null) so the

p-value in this case would be 0.084. In this case, just by looking at the p-value we could say

that we would reject the null for – = 0.10 but fail to reject the null for – = 0.05.

P-values are helpful not only because they show us whether we reject the null hypothesis,

but also whether we really reject the null or just barely reject the null. For example, a p-value

of 0.0001 indicates there is only a 0.0001 probability of observing the —ˆ1 as large as what we

observed if the —1 = 0. In this case, we are not only rejecting, we are decisively rejecting.

Seeing a coefficient large enough to produce such a p-value is highly, highly unlikely if —1 = 0.

On the other hand, if the p-value is 0.049, we are just barely rejecting the null for – = 0.05

and would, relatively speaking, have less confidence that the null is false. For – = 0.05, we

just barely fail to reject the null hypothesis with a p-value of 0.051.

We typically don’t need to calculate p-values ourselves because any statistical package

that conducts OLS will provide p-values. Our job is to know what they mean. Calculating

p-values is straightforward, though, especially for large sample sizes. The Computing Corner

provides details.6

6 For a two-sided p-value we want to know the probability of observing a t statistic higher than the absolute value
ˆ
of the t statistic we actually observe under the null hypothesis. This is 2 ú (1 ≠ (| se(——1ˆ ) |)) where is the Greek
1

letter fi (pronounced like the first bit of the word final) and () indicates the normal cumulative density function
(CDF). (We see the normal CDF in our discussion of statistical power below; see page 774 for more details). If the
alternative hypothesis is HA : —1 > 0 the p-value is the probability of observing a t statistic higher than the observed
ˆ
t statistic under the null hypothesis: 1 ≠ ( se(——1ˆ ) ). If the alternative hypothesis is HA : —1 < 0 the p-value is the
1
ˆ
probability of observing a t statistic less than the observed t statistic under the null hypothesis: ( se(——1ˆ ) ).
1

c
•2014 Oxford University Press 164
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Remember This
The p-value is the probability of observing a coefficient as large in magnitude as
actually observed if the null hypothesis is true.
1. The lower the p-value, the less consistent the estimated —ˆ1 is with the null hy-
pothesis.
2. We reject the null hypothesis if the p-value is less than –.
3. A p-value can be useful to indicate the weight of evidence against a null hypoth-
esis.

4.4 Power

The hypothesis testing infrastructure we’ve discussed so far is designed to deal with the

possibility of Type I error, which occurs when we reject the null hypothesis when it is

actually true. When we set the significance level, we are setting the probability of making a

Type I error. Obviously, we’d really rather not believe the null is false when it is true.

Type II errors aren’t so hot either, though. We make a Type II error when — is really

something other than zero, but we fail to reject the null hypothesis that — is zero. In this

section we explain statistical power, the statistical concept associated with Type II errors.

We open by discussing the importance and meaning of Type II error, then show how to

calculate power and how to create power curves. We finish the section by discussing when

we should care about power.

c
•2014 Oxford University Press 165
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Incorrectly failing to reject the null hypothesis

Type II error can be serious. For example, suppose a new medicine really saves lives, but

that the U.S. Food and Drug Administration is given an analysis in which the —ˆ1 estimate of

its efficacy is not statistically significant. If the FDA fails to approve the drug, people will

die unnecessarily. That’s not “oops”; that’s horrific. Even when the stakes are lower, we can

fully imagine how stupid we’d feel if we conclude a policy doesn’t work when in fact it does

work, but we just happened to get a random realization of —ˆ1 that was not high enough to

be statistically significant.

Type II error happens because it is possible to observe values of —ˆ1 that are less than the

critical value even if —1 (the true value of the parameter) is greater than zero. This is more

likely when the standard error of —ˆ1 is high.

Figure 4.6 shows the probability of Type II error for three different values of —. In these

figures, we assume a large sample (allowing us to use the normal distribution for critical

values) and test H0 : —1 = 0 against a one-sided alternative hypothesis HA : —1 > 0 with

– = 0.01. In this case, the critical value is 2.32, which means that we reject the null

hypothesis if we observe —ˆ1


se(—ˆ1 )
greater than 2.32. For simplicity, we’ll suppose se(—ˆ1 ) is 1.

Panel (a) of Figure 4.6 displays the probability of Type II error if the true value of —

equals 1. In this case, the distribution of —ˆ1 will be centered at 1. Only 9.3 percent of this

distribution is to the right of 2.32, meaning that we have only a 9.3 percent chance of rejecting

the null hypothesis. In other words, even though the null hypothesis actually is false – we’re

c
•2014 Oxford University Press 166
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Probability
density
^
β1 distribution centered on 1 when β1 = 1

(a)

the probability of rejecting the null is 0.093

−3 −2 −1 0 1 2 2.32 3 4 5 6 7
^
β1
Probability
density
^
β1 distribution centered on 2 when β1 = 2

(b) the probability of rejecting


the null is 0.374

2.32
−3 −2 −1 0 1 2 3 4 5 6 7
^
β1
Probability
density
^
β1 distribution centered
on 3 when β1 = 3

(c)
the probability of rejecting
the null is 0.751

2.32
−3 −2 −1 0 1 2 3 4 5 6 7
^
β1
FIGURE 4.6: Statistical Power for Three Values of —1 , – = 0.01, and a One-Sided Alternative Hypothesis

c
•2014 Oxford University Press 167
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

assuming —1 equals one, not zero – we have only a roughly one in ten chance of rejecting the

null. In other other words, our test is not particularly able to provide statistically significant

results when the true value of — is one.

Panel (b) of Figure 4.6 displays the probability of Type II error if the true value of —

equals 2. In this case, the distribution of —ˆ1 will be centered at 2. Here 37.4 percent of the

distribution is to the right of 2.32. Better, but hardly: Even though —1 > 0 we have not

much more than a 1 in 3 chance of rejecting the null hypothesis that —1 = 0.

Panel (c) of Figure 4.6 displays the probability of Type II error if the true value of —

equals 3. In this case, the distribution of —ˆ1 will be centered at 3. Here 75.1 percent of the

distribution is to the right of 2.32. We’re making progress, but still are far from perfection.

In other words, the true value of — has to be near or above 3 before we have a 75.1 percent

chance of rejecting the null when we should.

These examples illustrate why we use the somewhat convoluted “fail to reject the null”

terminology; when we observe a —ˆ1 less than the critical value, it is still quite possible that

the true value is not zero. Failure to find an effect is not the same as finding no effect.

Calculating power

The main tool for thinking about whether we are making Type II errors is power. The

statistical definition of power differs from how we use the the word in ordinary conversation.

Power in the statistical sense refers to the ability of our data to reject the null. A high-

c
•2014 Oxford University Press 168
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

powered statistical test will reject the null with a very high probability when the null is

false; a low-powered statistical test will reject the null with a low probability when the

null is false. Think of statistical power like the power of a microscope. Using a powerful

microscope, we can distinguish small differences in an object, differences that we cannot see

if we look through a low-powered microscope even though they are there.

To calculate power we begin by noting that the probability we reject the null for any true

value of —1T rue is the probability that the t statistic is greater than the critical value. We can

write this condition as follows (where the condition following the vertical line is what we’re

assuming to be true):

—ˆ1
P r(Reject null given —1 = —1T rue ) = P r( > Critical value|—1 = —1T rue ) (4.4)
se(—ˆ1 )

In other words, the power is the probability the t statistic is higher than the critical value.

This probability will depend on the actual value of —1 as we know that the distribution of

—ˆ1 will depend on the true value of —1 .

To make these calculations easier, we need to do a couple more steps. First, note that the

probability that the t statistic is bigger than the critical value (as described in Equation 4.4)

is equal to 1 minus the probability that the t statistic is less than the critical value, yielding

—ˆ1
P r(Reject null given —1 = —1T rue ) = 1 ≠ P r( < Critical value|—1 = —1T rue ) (4.5)
ˆ
se(—1 )
ˆ
The key element of this equation is P r( se(——1ˆ ) < Critical value|—1 = —1T rue ). This math-
1

ematical term seems complicated, but we actually know a fair bit about it. For a large

c
•2014 Oxford University Press 169
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

—ˆ1
sample size, the t statistic (which is se(—ˆ1 )
) will be normally distributed with a variance of

one around the true value divided by the standard error of the estimated coefficient. And,

from the properties of the normal distribution (see page 782 for a review), we know the

probability that the t statistic will be less than the critical value will therefore be

—1T rue
P r(Reject null given —1 = —1T rue ) = 1 ≠ (Critical value ≠ )
se(—ˆ1 )

where is the Greek letter fi (pronounced like the first bit of the word final) and () indicates

the normal cumulative density function (see page 774 for more details). This quantity will

vary depending on the true value of —1 we wish to use in our power calculations.7

One thing that can be puzzling is how to decide what true value to use when calculating

power. There really is no specific value that we should look at; instead the point is that

we can pick any value and calculate the power. We might pick a value of —1 that indicates

a substantial real-world effect and see what the probability of rejecting the null is for that

value. If the probability is low (meaning power is low), then we should be a bit skeptical

that we have enough data to reject the null for such a true value. If the probability is high

(meaning power is high), then we can be confident that if the true —1 is that value then we’d

likely be able to reject the null hypothesis.


7 And we can make the calculation a bit easier by using the fact that 1 ≠ (≠Z) = (Z) to write the power as
T rue
( se( ≠ Critical value).
—1
ˆ1 )

c
•2014 Oxford University Press 170
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Power curves

Even better than picking a single value to use to calculate power, we can look at power

over a range of possible true values of —1 . A power curve characterizes the probability of

rejecting the null for each possible value of the parameter. Figure 4.7 displays two power

curves. The solid line on top is the power curve for when se(—ˆ1 ) = 1.0 and – = 0.01. On

the horizontal axis are hypothetical values of —1 . The line shows the probability of rejecting

the null for a one-tailed test of H0 : —1 = 0 versus HA : —1 > 0 for – = 0.01 and a sample

large enough to use the normal approximation to the t distribution. To reject the null under

these conditions requires a t stat greater than 2.32 (see Table 4.4). This power curve plots
—ˆ —ˆ
for each possible value of —1 the probability that ˆ
se(—)
(which in this case is 1.0
) is greater

than 2.32. This curve includes the values we calculated in Figure 4.6 but now also covers all

values of —1 between 0 and 10.

Look first at the values of —1 that are above zero, but small. For these values the prob-

ability of rejecting the null is quite small. In other words, even though the null hypothesis

is false for these values (since —1 > 0) we’re unlikely to reject the null that —1 = 0. As —1

increases, this probability increases and by around —1 = 4, the probability of rejecting the

null approaches 1.0. That is, if the true value of —1 is 4 or bigger, then we will reject the

null with almost certainty.

The dashed line in Figure 4.7 displays a second power curve for which the standard error

is bigger, here equal to 2.0. The significance level is the same as for the first power curve,

c
•2014 Oxford University Press 171
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

1
Probability of Power curve for
^
rejecting the se(β1) = 1.0
null hypothesis
for α = 0.01

0.75

Power curve for


^
se(β1) = 2.0

0.5

0.37

0.25

0.09

0 1 2 3 4 5 6 7 8 9 10
β1
FIGURE 4.7: Power Curves for Two Values of se(—ˆ1 )

c
•2014 Oxford University Press 172
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

– = 0.01 We immediately see that the statistical power is lower. For every possible value of

—1 , the probability of rejecting the null hypothesis is lower than when se(—ˆ1 ) = 1.0 because

there is more uncertainty with the higher standard error for the estimate. For this standard

error, the probability of rejecting the null when —1 equals 2 is 0.09. So even though the null

is false, we will have a very low probability of rejecting it.8

When to care about power

One of the main determinants of se(—ˆ1 ) is sample size (see page 97). Hence, a useful rule

of thumb is that hypothesis tests based on large samples are usually high-powered and

hypothesis tests based on small samples are usually low-powered. In Figure 4.7, we can

think of the solid line as the power curve for a large sample and the dashed line as the power

curve for a smaller sample. More generally, though, statistical power is a function of the

variance of —ˆ1 and all the factors that affect it.

Power is particularly relevant when someone presents a null result, a finding in which

the null hypothesis is not rejected. For example, someone may say class size is not related

to test scores or that an experimental treatment does not work. In this case, we need to ask

what the power of the test was. It could be, for example, that the sample size is very small

such that the probability of rejecting the null is small even for substantively large values of
8What happens when —1 actually is zero? In this case, the null hypothesis is true and power isn’t the right concept.
Instead the probability of rejecting the null in this case is the probability of rejecting the null when it is true. In
other words, the probability of rejecting the null when —1 = 0 is the probability of committing a Type I error, the –
level we set.

c
•2014 Oxford University Press 173
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

—1 .

Remember This
Statistical power refers to the probability of rejecting a null hypothesis for a given
value of —1 .
1. A power curve shows the probability of rejecting the null for a range of possible
values of —1 .
2. Large samples typically produce high-power statistical tests. Small samples typ-
ically produce low-power statistical tests.
3. It is particularly important to discuss power when presenting null results, which
fail to reject the null hypothesis.

Discussion Questions

1. For each of the following, indicate the power of the test of the null
hypothesis H0 : —1 = 0 against the alternative hypothesis of HA : —1 > 0
for a large sample size and – = 0.01 for the given true value of —1 . We’ll
assume se(—ˆ1 ) = 0.75. Draw a sketch to help explain your numbers.
(a) —1T rue = 1
(b) —1T rue = 2
2. Suppose the estimated se(—ˆ1 ) doubled. What will happen to the power
of the test for the above two cases? First answer in general terms and
then calculate specific answers.
3. Suppose se(—ˆ1 ) = 2.5. What is the probability of committing a Type II
error for each of the above true values of —1 ?

c
•2014 Oxford University Press 174
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

4.5 Straight Talk about Hypothesis Testing

The ins and outs of hypothesis testing can be confusing. There are t distributions, degrees

of freedom, one-sided tests, two-sided tests, lions, tigers, and bears. Such confusion is unfor-

tunate for two reasons. First, the essence is simple: High t statistics indicate that the —ˆ1 we

observe would be quite unlikely if —1 = 0. Second, as a practical matter, computers make

hypothesis testing super easy. They crank out t stats and p-values lickety-split.

Sometimes these details distract us from the big picture: Hypothesis testing is not the

whole story. In this section we discuss four important limits to the hypothesis testing frame-

work.

First, and most importantly, all hypothesis testing tools we develop – all of them! – are

predicated on the assumption that there is no endogeneity. If there is endogeneity, then the

hypothesis testing tools are useless. If the input is junk, then even a fancy triple-backflip-

somersault hypothesis test still produces junk.

Second, hypothesis tests offer only a blunt tool for analyzing data. Sometimes they are

misleadingly decisive. Suppose we have a sample of 1,000 and we are interested in a two-sided

hypothesis test for – = 0.05. If we observe a t statistic of 1.95 we will fail to reject the null.

If we observe a t statistic of 1.97 we will reject the null. The world is telling us essentially the

same thing in both cases, but the hypothesis testing approach gives us dramatically different

answers.

c
•2014 Oxford University Press 175
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Third, a hypothesis test can mask important information. Suppose the t statistic on one

variable is 2 and that the t statistic for another is 25. In both cases, we reject the null. But

there’s a big difference. We’re kinda-sorta confident the null is not correct when the t stat

is 2. We’re damn sure the null sucks when the t stat is 25. Hypothesis testing alone does

not make such a distinction. We should. The p-values we discussed earlier are helpful, as

are the confidence intervals we discuss shortly.

Fourth, hypothesis tests and their focus on statistical significance can distract us from

substantive significance. A substantively significant coefficient is one that, well, matters;

it indicates that independent variable has a meaningful effect on the dependent variable.

While it can be a bit subjective as to how big a coefficient has to be for us to believe it

matters, this is a conversation we need to have. And statistical significance is not always a

good guide. Remember that t stats depend a lot on the se(—ˆ1 ) and that the se(—ˆ1 ) in turn

depends on sample size and other factors (see page 97). If we have a really big sample (and

these days it is increasingly common to have sample sizes in the millions), then the standard

error will be tiny and our t stat might be huge even for a substantively trivial —ˆ1 estimate.

In these cases, we may reject the null with even when the —ˆ1 coefficient suggests a minor

effect.

For example, suppose we look at average test scores for every elementary classroom in the

country as a function of the salary of the teachers. We could conceivably get a statistically

significant coefficient that implied, say, an increase of 0.01 points out of 100 for every hundred

c
•2014 Oxford University Press 176
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

thousand dollars we pay teachers. Statistically significant, yes; substantively significant, not

so much.

Or, conversely, we could have a small sample size that would lead to a large standard error

on —ˆ1 and, say, to a failure to reject the null. But the coefficient could be quite big, suggesting

perhaps a meaningful relationship. Of course, we wouldn’t want to rush to conclude that

the effect is really big, but it’s worth appreciating that the data in such a case is indicating

the possibility of a substantively significant relationship. In this instance, getting more data

would be particularly valuable.

Remember This
Statistical significance is not the same as substantive significance.
1. A coefficient is statistically significant if we reject the null hypothesis.
2. A coefficient is substantively significant if the variable has a meaningful effect on
the dependent variable.
3. With large data sets, substantively small effects can sometimes be statistically
significant.
4. With small data sets, substantively large effects can sometimes be statistically
insignificant.

4.6 Confidence Intervals

One way to get many of the advantages of hypothesis testing without the stark black/white,

reject/fail-to-reject dichotomies of hypothesis testing is to use confidence intervals. A

c
•2014 Oxford University Press 177
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

confidence interval defines the range of true values that are most consistent with the observed

coefficient estimate. A confidence interval contrasts with a point estimate, which is a single

number (such as —ˆ1 ).

This section explains how confidence intervals are calculated and why they are useful.

The intuitive way to think about confidence intervals is that they give us a range in which

we’re confident the true parameter lies. An approximate rule of thumb is that the confidence

interval for a —ˆ1 estimate goes from two standard errors of —ˆ1 below —ˆ1 to two standard errors

of —ˆ1 above —ˆ1 . That is, the confidence interval for an estimate —ˆ1 will approximately cover

—ˆ1 ≠ 2 ◊ se(—ˆ1 ) to —ˆ1 + 2 ◊ se(—ˆ1 ).

The full explanation of confidence intervals involves similar statistical logic as t stats.

The starting point is the realization that we can assess the probability of observing the —ˆ1

for any “true” —1 . For some values of —1 , our observed —ˆ1 wouldn’t be surprising. Suppose,

for example, we observe a coefficient of 0.41 with a standard error of 0.1 as we did in Table

3.2. If the true value were 0.41, a —ˆ1 near 0.41 wouldn’t be too surprising. If the true value

were 0.5, we’d be a wee bit surprised, perhaps, to observe —ˆ1 = 0.41 but not shocked. For

some values of —1 , though, the observed —ˆ1 would be surprising. If the true value were 10,

for example, we’d be gobsmacked to observe —ˆ1 = 0.41 with a standard error of 0.1. Hence,

if we see —ˆ1 = 0.41 with a standard error of 0.1, we’re pretty darn sure the true value of —1

isn’t 10.

Confidence intervals generalize this logic to identify the range of true values that would

c
•2014 Oxford University Press 178
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

be reasonably likely to produce the —ˆ1 that we observe. They identify that range of true

values for which the observed —ˆ1 and se(—ˆ1 ) would not be too unlikely. We get to choose

what we mean by unlikely by choosing our significance level, which is typically – = 0.05 or

– = 0.01. We’ll often refer to confidence levels, which are 1- –. The lower bound of a 95%

confidence interval will be a value of —1 such that there is less than a –


2
= 0.025 probability

of observing a —ˆ1 as high as the —ˆ1 actually observed. The upper bound of a 95% confidence

interval will be a value of —1 such that there is less than a 2.5% probability of observing a

—ˆ1 as low as the —ˆ1 actually observed.

Figure 4.8 illustrates the meaning of a confidence interval. Suppose —ˆ1 = 0.41 and se(—ˆ1 ) =

0.1. For any given true value of — we can calculate the probability of observing the —ˆ1 we

actually did observe. Panel (a) shows that if —1 really were 0.606, the distribution of —ˆ1

would be centered at 0.606 and we would see a value as low as 0.41 (what we actually

observe for —ˆ1 ) only 2.5 percent of the time. Panel (b) shows that if —1 really were 0.214, the

distribution of —ˆ1 would be centered at 0.606 and we would see a value as high as 0.41 (what

we actually observe for —ˆ1 ) only 2.5 percent of the time. In other words, our 95% confidence

interval ranges from 0.214 to 0.606 and includes the values of —1 such that it wouldn’t be

too surprising to observe what we actually observed.9


9 Confidence intervals are often defined with reference to random sampling. If we have a fixed population and
draw a large number of samples from it, then 95 percent of the confidence intervals we calculate will cover the
true relationship in the population. Sadly, truly random samples are incredibly rare; for example, so-called random
samples for public opinion surveys are lucky to get 20 percent of the people they call to respond. Therefore it’s useful
not to put all our confidence interval eggs in a sampling basket.

c
•2014 Oxford University Press 179
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Probability
density
The value of the upper bound of a 95% confidence
interval is the value of β1 such that we would see
^
the observed β1 or lower 2.5 percent of the time

(a)

^
If true value of β1 is 0.606 we would see a β1
equal to or less than 0.41 2.5% of the time.

−0.2 0 0.2 0.4 0.6 0.8 1


^
β1
Probability
density
The value of the lower bound of a 95% confidence
interval is the value of β1 such that we would see
^
the observed β1 or higher 2.5 percent of the time.

(b)

^
If true value of β1 is 0.214 we would see a β1
equal to or greater than 0.41 2.5% of the time

−0.2 0 0.2 0.4 0.6 0.8 1


^
β1
FIGURE 4.8: Meaning of Confidence Interval for Example of 0.41 ±0.196

c
•2014 Oxford University Press 180
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Figure 4.8 does not tell us how to calculate the upper and lower bounds of a confidence

interval. We could use trial-and-error. Much better is an equation based on the properties

of the distributions of —ˆ1 . A confidence interval is —ˆ1 ≠ critical value ◊ se(—ˆ1 ) to —ˆ1 +

critical value ◊ se(—ˆ1 ). For large samples and – = 0.05, the critical value is 1.96, giving rise

to the rule of thumb that a 95% confidence interval is approximately —ˆ1 ± 2 ◊ the standard

error of —ˆ1 . In our example, where —ˆ1 = 0.41 and se(—ˆ1 ) = 0.1, we can be 95% confident

that the true value is between 0.214 and 0.606.

Table 4.6 shows some commonly used confidence intervals for large sample sizes. The

large sample size allows us to use the normal distribution to calculate critical values. A

90% confidence interval for our example is 0.246 to 0.574. The 99% confidence interval for

a —ˆ1 = 0.41 and se(—ˆ1 ) = 0.1 is from 0.152 to 0.668. Notice that the higher the confidence

level, the wider the confidence interval.

Table 4.6: Calculating Confidence Intervals for Large Samples

Confidence Critical Confidence interval Example


level value —1 = 0.41 and se(—ˆ1 ) = 0.1
ˆ
90% 1.64 —ˆ1 ± 1.64 ◊ se(—ˆ1 ) 0.41 ± 1.64 ◊ 0.1 = 0.246 to 0.574
95% 1.96 —ˆ1 ± 1.96 ◊ se(—ˆ1 ) 0.41 ± 1.96 ◊ 0.1 = 0.214 to 0.606
99% 2.58 —ˆ1 ± 2.58 ◊ se(—ˆ1 ) 0.41 ± 2.58 ◊ 0.1 = 0.152 to 0.668

Confidence intervals are closely related to hypothesis tests. Because confidence intervals

tell us the range of possible true values that are consistent with what we’ve seen, we simply

c
•2014 Oxford University Press 181
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

need to see if the confidence interval on our estimate includes zero. If it does not, zero was

not a value that would likely produce the data and estimates we observe and we can therefore

reject H0 : —1 = 0.

Confidence intervals do more than hypothesis tests, though, because they provide infor-

mation on the likely location of the true value. If the confidence interval is mostly positive,

but just barely covers zero, we would fail to reject the null hypothesis, but we would also

recognize that the evidence suggests the true value is likely positive. If the confidence in-

terval does not cover zero, but is restricted to a region of substantively unimpressive values

of —1 , then we can conclude that while the coefficient is statistically different from zero, it

seems unlikely that the true value is substantively important.

Remember This
1. A confidence interval indicates a range of values in which true value is likely to
be, given the data.
• The lower bound of a 95% confidence interval will be a value of —1 such
that there is less than a 2.5% probability of observing a —ˆ1 as high as the —ˆ1
actually observed.
• The upper bound a 95% confidence interval will be a value of —1 such that
there is less than a 2.5% probability of observing a —ˆ1 as low as the —ˆ1 actually
observed.
2. A confidence interval is calculated as —ˆ1 ± t-critical value ◊ se(—ˆ1 ) where the t-
critical value is the critical value from the t-table. It depends on the sample size
and –, the significance level. For large samples and – = 0.05, the t-critical value
is 1.96.

c
•2014 Oxford University Press 182
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

4.7 Conclusion

Statistical inference refers to the process of reaching conclusions based on the data. Hy-

pothesis tests are central to inference, particularly t tests. They’re pretty easy. Honestly,

a well-trained parrot could probably do simple t tests. Look at the damn t statistic! Is it

bigger than 2? Then squawk “reject”; if not, squawk “fail to reject.”

We can do much more. With p-values and confidence intervals we can characterize our

findings with some nuance. With power tests we can recognize the likelihood of failing to see

effects even when they’re there. Taken as a whole, then, these tools help us make inferences

from our data in a sensible way.

After reading and discussing this chapter, we should be able to

• Section 4.1: Explain the conceptual building blocks of hypothesis testing, including null

and alternative hypotheses and Type I and Type II errors.

• Section 4.2: Explain the steps in using t tests to test hypotheses.

• Section 4.3: Explain p-values.

• Section 4.4: Explain statistical power. Describe when it is particularly relevant.

• Section 4.5: Explain the limitations of hypothesis testing.

• Section 4.6: Explain confidence intervals. Explain the rule of thumb for approximating

a 95% confidence interval.

c
•2014 Oxford University Press 183
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Further Reading

Ziliak and McCloskey (2008) provide a book-length attack on the hypothesis testing frame-

work. Theirs is hardly the first such critique, but it may be the most fun.

An important, and growing, school of thought in statistics called “Bayesian statistics”

produces estimates of the form “there is an 8.2 percent probability that — is less than zero.”

Happily, there are huge commonalities across Bayesian statistics and the approach used in

this (and most other) introductory books. Simon Jackman’s Bayesian Analysis for the Social

Sciences is an excellent guide to Bayesian statistics.

Key Terms
• Alternative hypothesis (140)
• Confidence interval (201)
• Confidence levels (179)
• Critical value (153)
• Degrees of freedom (151)
• Hypothesis testing (136)
• Null hypothesis (138)
• One-sided alternative hypothesis (140)
• Null result (173)
• p-value (161)
• Point estimate (178)
• Power (168)

c
•2014 Oxford University Press 184
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

• Power curve (171)


• Significance level (144)
• Statistically significant (138)
• Substantive significance (176)
• t distribution (150)
• t statistic 157)
• t test (147)
• Two-sided alternative hypothesis (140)
• Type I error (139)
• Type II error (139)

Computing Corner

Stata
1. To find the critical value from a t distribution for a given – and N ≠k degrees of freedom
use the inverse t tail function in Stata: display invttail(n-k, –).10
• To calculate the critical value for a one-tailed t test with n ≠ k = 100 and – = 0.05
type display invttail(100, 0.05)
• To calculate the critical value for a two-tailed t test with n ≠ k = 100 and – = 0.05
type display invttail(100, 0.05/2)
2. To find the critical value from a normal distribution for a given – use the inverse normal
function in Stata. The display command tells Stata to print the results on the screen.
For a two-sided test with – = 0.05, type display invnormal(1- 0.05 2
). For a one-sided
test with – = 0.01, type display invnormal(1-0.01).
10 This is referred to as an inverse t function because we provide a percent (the –) and it returns a value of the t
distribution for which – percent of the distribution is larger in magnitude. For non-inverse t function we typically
provide some value for t and the function tells us how much of the distribution is larger in magnitude. The tail part
of the function command refers to the fact that we’re dealing with the far ends of the distribution.

c
•2014 Oxford University Press 185
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

3. The regression command in Stata (e.g. reg Y X1 X2) reports two-sided p-values and
confidence intervals. To generate the p-values from the t statistic only, use display
2*ttail(DF, abs(TSTAT)) where DF is the degrees of freedom and TSTAT is the
observed value of the t statistic.11 For a two-sided p-value for a t statistic of 4.23 based
on 1,908 degrees of freedom, type display 2*ttail(1908, 4.23).
4. Use the following code to create a power curve for – = 0.01 and a one-sided alternative
hypothesis covering 71 possible values of the true —1 from 0 to 7:
set obs 71
gen BetaRange = ( n-1)/10 /* Sequence of possible betas from 0 to 7 */
scalar stderrorBeta = 1.0 /* Standard error of beta-hat */
gen PowerCurve = normal(BetaRange/stderrorBeta - 2.32)
/* Probability t statistic is greater than critical value */
/* for each value in BetaRange/stderrorBeta */
graph twoway (line PowerCurve BetaRange)

R
1. In R, inverse probability distribution functions start with q (no reason why, really; it’s
just a convention). To calculate the critical value for a two-tailed t test with n ≠ k =
100 and – = 0.05 use the inverse t distribution command. For the inverse t-function
type qt(1-0.05/2, 100). To find the one-tailed critical value for a t distribution for
– = 0.01 and 100 degrees of freedom: qt(1-0.01, 100).
2. To find the critical value from a normal distribution for a given – use the inverse
normal function in R. For a two-sided test: qnorm(1- –2 ). For a one-sided test: display
qnorm(1-–).
3. The p-value reported in summary(lm(Y ≥ X1)) is a two-sided p-value. To generate
the p-values from the t statistic only, use 2*(1-pt(abs(TSTAT), DF)) where TSTAT
is the observed value of the t statistic and DF is the degrees of freedom. For example,
for a two-sided p-value for a t statistic of 4.23 based on 1,908 degrees of freedom, type
2*(1-pt(abs(4.23), 1908)).
4. To calculate confidence intervals using the regression results from the Simpsons data
on page 128, use the confint command. For example, the 95% confidence intervals for
the coefficient estimates in the donut regression model from the Chapter 3 Computing
11 The ttail function in Stata reports the probability of a t distributed random variable being higher than a t
statistic we provide (which we denote here as TSTAT). This syntax contrasts to the convention for normal distribution
functions, which typically report the probability of being less than the t statistic we provide.

c
•2014 Oxford University Press 186
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

Corner on page 128 is


confint(OLSResults, level = 0.95)
2.5% 97.5%
(Intercept) 86.605 158.626
donuts 4.878 13.329
5. Use the following code to create a power curve for – = 0.01 and a one-sided alternative
hypothesis covering 71 possible values of the true —1 from 0 to 7:
BetaRange = seq(0, 7, 0.1)
# Sequence of possible betas from 0 to 7
# separated by 0.1 (e.g. 0, 0.1, 0.2, ...)
stderrorBeta = 1
# Standard error of beta-hat
PowerCurve = pnorm(BetaRange/stderrorBeta -2.32)
# Probability t statistic is greater than critical value
# for each value in BetaRange/stderrorBeta}
plot(BetaRange, PowerCurve, xlab="Beta", ylab="Probability reject null",
type="l")

Exercises
1. Persico, Postlewaite, and Silverman (2004) analyzed data from the National Longitudi-
nal Survey of Youth (NLSY) 1979 cohort to assess the relationship between height and
wages for white men. Here we explore the relationship between height and wages for
the full sample that includes men and women and all races. The NLSY is a nationally
representative sample of 12,686 young men and women who were 14-22 years old when
they were first surveyed in 1979. These individuals were interviewed annually through
1994 and biannually since then. Table 4.7 describes the variables from heightwage.dta
we’ll use for this question.
Table 4.7: Variables for Height and Weight Data in United States

Variable name Description


wage96 Hourly wages (in dollars) in 1996
height85 Adult height: height (in inches) measured in 1985

a. Create a scatterplot of adult wages against adult height. What does this plot suggest
about the relationship between height and wages?

c
•2014 Oxford University Press 187
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

b. Estimate an OLS regression in which adult wages is regressed on adult height for
all respondents. Report the estimated regression equation and interpret the results,
explaining in particular what the p-value means.
c. Assess whether the null hypothesis that the coefficient on height81 equals zero is
rejected at the 0.05 significance level for one-sided and for two-sided hypothesis tests.
2. In this problem, we will conduct statistical analysis on the sheep experiment discussed
at the beginning of the chapter. We will create variables and use OLS to analyze
their relationships. Death is the dependent variable and treatment is the independent
variable. For all models, the treatment variable will equal 1 for the first 24 observations
and will equal zero for the last 24 observations.
a. Suppose, as in the example, that only one sheep in the treatment group died and all
sheep in the control group died. Is the treatment coefficient statistically significant?
What is the (two-sided) p-value? What is the confidence interval?
b. Suppose now that only one sheep in the treatment group died and only 10 sheep in
the control group died. Is the treatment coefficient statistically significant? What is
the (two-sided) p-value? What is the confidence interval?
c. Continue supposing that only one sheep in the treatment group died. What is the
minimal number of sheep in the control group that needed to die for the treatment
effect to be statistically significant? (Solve by trial and error.)
3. Voters care about the economy, often more than any other issue. It is not surprising,
then, that politicians invariably argue that their party is best for the economy. Who
is right? In this exercise we’ll look at the U.S. economic and presidential party data
in PresPartyEconGrowth.dta to test if there is any difference in economic performance
between Republican and Democratic presidents. We will use two different dependent
variables:
• ChangeGDPpc is the change in real per capita GDP in each year from 1962 to 2013
(in inflation-adjusted U.S. dollars, available from the World Bank)
• Unemployment is the unemployment rate each year from 1947 to 2013 (available
from the Bureau of Labor Statistics).
Our independent variable is LagDemPres. This variable equals 1 if the president in the
previous year was a Democrat and equals 0 if the president in the previous year was
a Republican. The idea is that the president’s policies take some time to take effect
so that the economic growth in a given year depended on who was president the year
before.12
12 Other ways of considering the question are addressed in the large academic literature on presidents and the

c
•2014 Oxford University Press 188
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

a. Estimate a model with unemployment as the dependent variable and LagDemPres


as the independent variable. Interpret the coefficients.
b. Estimate a model with GDP per capita as the dependent variable and LagDemPres
as the independent variable. Interpret the coefficients. Explain why the sample size
differs from the first model.
c. Choose an – level and alternative hypothesis and indicate for each model above
whether you accept or reject the null hypothesis.
d. Explain in your own words what the p-value means for the LagDemPres variable in
each model.
e. Create a power curve for the model with GDP per capita as the dependent variable
for – = 0.01 and a one-sided alternative hypothesis. Explain what the power curve
means by indicating what the curve means for true —1 =200, 400, and 800. Use the
code in the Computing Corner, but with the actual standard error of —ˆ1 from the
regression output.13
f. Discuss the implications of the power curve for the interpretation of the results for
the model in which change in GDP per capita was the dependent variable.
4. Run the simulation code in the initial part of the education and salary question from
Chapter 3 on page 131.
a. Generate t statistics for the coefficient on education for each simulation. What are
the minimal and maximal values of these t statistics?
b. Generate two-sided p-values for the coefficient on education for each simulation.
What are the minimal and maximal values of these p-values?
c. In what percent of the simulations do we reject the null hypothesis that —Education = 0
at the – = 0.05 level with a two-sided alternative hypothesis?
d. Re-run the simulations, but set the true value of —Education to zero. (Do this in the
gen Salary = 20000 + 1000* Ed + StdDev*rnormal() line of code by changing
economy. See, among others, Bartels (2008), Campbell (2011), Comiskey and Marsh (2012), and Blinder and Watson
(2013).
13In Stata, start with the following lines to create a list of possible true values of — and then set the “stderrorBeta”
1

variable to be equal to the actual standard error of —ˆ1 . Note: The first line clears all data; you will need to re-load
the data set if you wish to run additional analyses. If you have created a syntax file it will be easy to re-load and
re-run what you have done so far.
clear
set obs 201
gen BetaRange = 4*( n-1) /* Sequence of true beta values from 0 to 800 */

c
•2014 Oxford University Press 189
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

1000 to 0.) Do this for 500 simulations and report what percent of time we reject
the null at the – = 0.05 level with a two-sided alternative hypothesis.
5. We will continue the analysis of height and wages in Britain from the homework problem
in Chapter 3 on page 133.
a. Estimate the model with income at age 33 as the dependent variable and height at
age 33 as the independent variable. (Exclude observations with wages above 400
British pounds per hour and height less than 40 inches.) Interpret the t statistics on
the coefficients.
b. Explain the p-values for the two estimated coefficients.
c. Show how to calculate the 95% confidence interval for the coefficient on height.
d. Do we accept or reject the null hypothesis that —1 = 0 for – = 0.01 and a two-sided
alternative? Explain why.
e. Do we accept or reject the null hypothesis that —0 = 0 (the constant) for – = 0.01
and a two-sided alternative? Explain why.
f. Limit the sample size to the first 800 observations.14 Do we accept or reject the null
hypothesis that —1 = 0 for – = 0.01 and a two-sided alternative? Explain if/how/why
this answer differs from the earlier hypothesis test about —1 .
6. The dataset MLBattend.dta contains Major League Baseball attendance records for 32
teams from the 1970s through 2000. This problem uses the power calculation described
on page 169.
a. Estimate a regression in which home attendance rate is the dependent variable and
runs scored is the independent variable. Report your results and interpret all coeffi-
cients.
b. Use the standard error from your results to calculate the statistical power of a test
of H0 : —runs scored = 0 vs HA : —runs scored > 0 with – = 0.05 (assuming a large sample
for simplicity) for three cases:
i. —runs scored = 100
ii. —runs scored = 400
iii. —runs scored =1,000
c. Suppose we had much less data than we actually do such that the standard error on
the coefficient on —runs scored were 900 (which is much larger than what we estimated).
Using standard error of —runs scored = 900, calculate the statistical power of a test of
H0 : —runs scored = 0 vs HA : —runs scored > 0 with – = 0.05 (assuming a large sample
for simplicity) for three cases:
14 In Stata, do this by adding & n < 800 to the end of the if statement at the end of the regress command.

c
•2014 Oxford University Press 190
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions

i. —runs scored = 100


ii. —runs scored = 400
iii. —runs scored =1,000
d. Suppose we had much more data than we actually do such that the standard error
on the coefficient on —runs scored were 200 (which is much smaller than what we esti-
mated). Using standard error of —runs scored = 200, calculate the statistical power of
a test of H0 : —runs scored = 0 vs HA : —runs scored > 0 with – = 0.05 (assuming a large
sample for simplicity) for three cases:
i. —runs scored = 100
ii. —runs scored = 400
iii. —runs scored =1,000
e. Discuss the differences across the power calculations for the different standard errors.

c
•2014 Oxford University Press 191
CHAPTER 5

MULTIVARIATE OLS: WHERE THE ACTION IS

It’s pretty easy to understand why we need to

go beyond bivariate OLS: Observational data is

lousy with endogeneity. It’s almost always the

case that X is correlated with ‘ in observational

data and if we ignore that reality, we could come

up with some pretty silly results.

For example, suppose we’ve been tasked to figure out how sales responds to temperature.

Easy, right? We can run a bivariate model such as

Salest = —0 + —1 T emperaturet + ‘t

where Salest is sales in billions of dollars during month t and T emperaturet is the average

192
Chapter 5. Multivariate OLS: Where the Action Is

Monthly
retail sales
(billions of $)
12

10

30 40 50 60 70 80
Average monthly temperature
(in degrees Fahrenheit)

FIGURE 5.1: Monthly Retail Sales and Temperature in New Jersey from 1992 to 2013

temperature in the month. Figure 5.1 shows monthly data for New Jersey for about 20

years. We’ve also added the fitted line from a bivariate regression. It’s negative, implying

that people shop less as temperatures rise.

Is that the full story? Could there be endogeneity, meaning there is something that is

correlated with temperature and associated with more shopping? Think about shopping in

the United States. When is it at its most frenzied? Right before Christmas. Something that

happens in December ... when it’s cold. In other words, we think there is something in the

error term (Christmas shopping season) that is correlated with temperature. That’s a recipe

c
•2014 Oxford University Press 193
Chapter 5. Multivariate OLS: Where the Action Is

for endogeneity.

In this chapter, we learn how to control for other variables so that we can avoid (or at

least reduce) endogeneity and thereby see causal associations more clearly. Multivariate OLS

is the tool that allows us to do so. In our shopping example, multivariate OLS helps us see

that once we account for the December effect, higher temperatures are associated with higher

sales.

Multivariate OLS refers to OLS with multiple independent variables. We’re simply

going to add variables to the OLS model developed in the previous chapters. What do we

gain from doing so? Two things: bias reduction and precision. When we reduce bias, we

get more accurate parameter estimates because the coefficient estimates are on average less

skewed away from the true value. When we increase precision, we reduce uncertainty because

the distribution of coefficient estimates is more closely clustered toward the true value.

In this chapter we explain how to use multivariate OLS to fight endogeneity. Section 5.1

introduces the model and shows how controlling for multiple variables can lead to better

estimates. Section 5.2 discusses omitted variable bias, which occurs when we fail to control

for variables that affect Y and are correlated with included variables. Section 5.3 shows how

the omitted variable bias framework can be used to understand what happens when we use

poorly measured variables. Section 5.4 explains the precision of our estimates in multivariate

OLS. Section 5.5 concludes the chapter with a more big-think discussion of how to decide

which variables should go in a model.

c
•2014 Oxford University Press 194
Chapter 5. Multivariate OLS: Where the Action Is

Monthly Monthly
retail sales retail sales
(billions of $) (billions of $)
12 12 December sales
December sales minus $5 billion
Other months Other months

10 10

8 8

6 6

4 4

30 40 50 60 70 80 30 40 50 60 70 80
Average monthly temperature Average monthly temperature
(in degrees Fahrenheit) (in degrees Fahrenheit)
(a) (b)
FIGURE 5.2: Monthly Retail Sales and Temperature in New Jersey with December Indicated

5.1 Using Multivariate OLS to Fight Endogeneity

Multivariate OLS allows us to control for multiple independent variables at once. In this

section, we explore two examples in which controlling for additional variables has a huge

effect on the results. Then we discuss the mechanics of the multivariate estimation process.

Multivariate OLS in action: retail sales

The sales and temperature example is useful for getting the hang of multivariate analysis.

Panel (a) of Figure 5.2 has the same data as Figure 5.1, but we’ve indicated the December

observations with triangles. Clearly New Jerseyites shop more in December; it looks like the

c
•2014 Oxford University Press 195
Chapter 5. Multivariate OLS: Where the Action Is

average sales are around $11 billion in December versus average sales of around $6 billion

per month in other months. We want to learn whether there is a temperature effect after

taking into account that December sales run about $5 billion higher than other months.

The idea behind multivariate OLS is to net out this December effect and then see what

the relationship between sales and temperature is. That is, suppose we subtracted the $5

billion bump from all the December observations and then looked at the relationship between

temperature and sales. That is what we’ve done in Panel (b) Figure 5.2 where each December

observation is now $5 billion lower than before. When we look at the data this way, the

negative relationship between temperature and sales seems to go away and it may even be

that the relationship is now positive.

In essence, multivariate OLS nets of the effects of other variables when it controls for

additional variables. When we actually implement multivariate OLS, we (or, really, comput-

ers) do everything at once, controlling for the December effect while estimating the effect of

temperature even as we are simultaneously controlling for temperature while estimating the

December effect.

Table 5.1 shows the results for both a bivariate and multivariate model for our sales data.

In the bivariate model, the coefficient on temperature is negative and statistically significant,

implying that folks like to shop in the cold. When we use multivariate OLS to control for

December (by including the December variable that equals one for observations from the

month of December and zero for all other observations), the coefficient on temperature

c
•2014 Oxford University Press 196
Chapter 5. Multivariate OLS: Where the Action Is

becomes positive and statistically significant. Our conclusion has flipped! Heat brings out

the cash. Whether this relationship exists because people like shopping when it’s warm

or are going out to buy swimsuits and sunscreen, we can’t say. We can say, though, that

there’s pretty strong evidence that our initial bivariate finding that people shop less as the

temperature rises is not robust to controlling for holiday shopping in December.

The way we interpret multivariate OLS regression coefficients is slightly different from how

we interpret bivariate OLS regression coefficients. We still say that a one unit increase in X

is associated with a —ˆ1 increase in Y , but now we need to add, “Holding constant the other

factors in the model.” We therefore interpret our multivariate results as “Controlling for

the December shopping boost, increases in temperature are associated with more shopping.”

In particular, the multivariate estimate implies that controlling for the surge in shopping

in December, a one degree increase in average monthly temperature is associated with an

increase in retail sales of $0.014 billion (also known as $14 million).

We don’t have to say the full long version every time we talk about multivariate OLS

results – unless we’re stalling for time – as people who understand multivariate OLS will

understand the longer, technically correct interpretation. We can also use the fancy-pants

phrase ceteris paribus which means all else equal, as in “Ceteris paribus, the effect of a

one degree increase in temperature on retail shopping in New Jersey is $14 million”

The way statisticians talk about multivariate results takes some getting used to. When

statisticians say things like holding all else constant or holding all else equal they are simply

c
•2014 Oxford University Press 197
Chapter 5. Multivariate OLS: Where the Action Is

referring to the fact that other variables are in model and have been statistically controlled

for. What they really mean is more like netting out the effect of other variables in the

model. The logic behind saying that other factors are constant is that once we have netted

out the effects of these other variables it is as if the values of these variables are equal for

every observation. The language doesn’t exactly sparkle with clarity, but the idea is not

particularly subtle. Hence, when someone says something like “holding X2 constant, the

estimated effect of a one-unit change in X1 is —ˆ1 ” we need simply to remember they mean

that accounting for the effect of X2 , the effect of X1 is estimated to be —ˆ1 .

Table 5.1: Bivariate and Multivariate Results for Retail Sales Data

Bivariate Multivariate
Temperature -0.019ú 0.014ú
(0.007) (0.005)
[t = 2.59] [t = 3.02]
December 5.63ú
(0.26)
[t =21.76]
Constant 7.16ú 4.94ú
(0.41) (0.26)
[t =17.54] [t = 18.86]
N 256 256
ˆ
‡ 1.82 1.07
R2 0.026 0.661
Standard errors in parentheses
ú indicates significance at p < 0.05

c
•2014 Oxford University Press 198
Chapter 5. Multivariate OLS: Where the Action Is

Multivariate OLS in action: height and wages

Here’s another example that shows what happens when we add variables to a model. We

use the data on height and wages introduced in Chapter 3 on page 113. The bivariate model

was

Wagesi = —0 + —1 Adult heighti + ‘i (5.1)

where W agesi was the wages of men in the sample in 1996 and the adult height measured

in 1985.

This is observational data and the reality with such data is that the bivariate model is

suspect. There are many ways something in the error term could be correlated with the

independent variable.

The authors of the height and weight study identified several additional variables to

include in the model, focusing in particular on one: adolescent height. They reasoned that

people who were tall as teenagers could have developed more confidence and participated

in more high school activities, and that this experience could have laid the groundwork for

higher wages later.

If teen height is actually boosting adult wages in the way that the researchers suspected,

then it is possible that the bivariate model with only adult height (Equation 5.1) will suggest

a relationship even though the real action is between adolescent height and wages. How can

we tell what the real story is?

c
•2014 Oxford University Press 199
Chapter 5. Multivariate OLS: Where the Action Is

Multivariate OLS comes to the rescue. It allows us to simply “pull” adolescent height out

of the error term and into the model by including it as an additional variable in the model.

The model becomes

Wagesi = —0 + —1 Adult heighti + —2 Adolescent heighti + ‘i (5.2)

where —1 reflects the effect on wages of being one inch taller as an adult when including

adolescent height in the model and —2 reflects the effect on wages of being one inch taller as

an adolescent when including adult height in the model.

The coefficients are estimated using similar logic as for bivariate OLS. We’ll discuss esti-

mation momentarily. For now, though, let’s concentrate on the differences between bivariate

and multivariate results. Both are presented in Table 5.2. The first column shows the coef-

ficient and standard error on —ˆ1 for the bivariate model with only adult height in the model;

these are identical to the results presented in Chapter 3 on page 115. The coefficient of 0.41

implies that each inch of height is associated with an additional 41 cents per hour in wages.

The second column shows results from the multivariate analysis; they tell quite a different

story. The coefficient on adult height is, at 0.003, essentially zero. The coefficient on ado-

lescent height, in contrast, is 0.48, implying that, controlling for adult height, adult wages

were 48 cents higher per hour for each inch taller someone was when younger. The standard

error on this coefficient is 0.19 with a t statistic that is higher than 2, implying a statistically

significant effect.

c
•2014 Oxford University Press 200
Chapter 5. Multivariate OLS: Where the Action Is

Table 5.2: Bivariate and Multiple Multivariate Results for Height and Wages Data

Bivariate Multivariate
(a) (b)
Adult height 0.41ú
0.003 0.03
(0.10)
(0.20) (0.20)
[t = 4.23]
[t = 0.02] [t = 0.17]
Adolescent height 0.48ú 0.35
(0.19) (0.19)
[t = 2.49] [t = 1.82]
Athletics 3.02ú
(0.56)
[t = 5.36]
Clubs 1.88ú
(0.28)
[t = 6.69]
Constant -13.09 -18.14ú -13.57ú
(6.90) (7.14) (7.05)
[t = 1.90] [t = 2.54] [t = 1.92]
N 1,910 1,870 1,851
ˆ
‡ 11.9 12.0 11.7
R 2
0.01 0.01 0.06
Standard errors in parentheses, ú indicates significance at p < 0.05

Figure 5.3 displays the confidence intervals implied by the coefficients and their standard

errors. The dots in the figures are placed at the coefficient estimate (e.g., 0.41 for the

coefficient on adult height in the bivariate model and 0.003 for the coefficient on adolescent

height in the multivariate model). The lines indicate the range of the 95% confidence interval.

As discussed in Chapter 4 on page 181, confidence intervals indicates the range of true values

of — most consistent with the observed estimate; they are calculated as —ˆ ± 1.96 ◊ se(—).
ˆ

The confidence interval for the coefficient on adult height in the bivariate model is clearly

c
•2014 Oxford University Press 201
Chapter 5. Multivariate OLS: Where the Action Is

Bivariate model Multivariate model

Adult
Height

Adolescent
Height

−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Estimated coefficients Estimated coefficients

FIGURE 5.3: 95% Confidence Intervals for Coefficients in Adult Height, Adolescent Height, and Wage

Models

positive and relatively narrow and it does not include zero. However, the confidence interval

for the coefficient on adult height becomes wide and includes zero in the multivariate model.

In other words, the multivariate model suggests that the effect of adult height on wages is

small or even zero when controlling for adolescent height. In contrast, the confidence interval

for adolescent height is positive, reasonably wide, and far from zero when controlling for adult

height. These results suggest that the effect of adolescent height on wages is large and the

relationship we see is unlikely to have arisen simply by chance.

In this head-to-head battle of the two height variables, adolescent height wins: The

coefficient on it is large and its confidence interval is far from zero. The coefficient on

adult height, however, is puny and has a confidence interval that clearly covers zero. In

other words, the multivariate model we have estimated is telling us that being tall as a kid

matters more than being tall as a grown-up. This conclusion is quite thought-provoking.

It appears that the height premium in wages does not reflect a height fetish by bosses, but

c
•2014 Oxford University Press 202
Chapter 5. Multivariate OLS: Where the Action Is

instead reflects the human capital developed in youth extracurricular activities. Eat your

veggies, make the volleyball team, get rich.

Estimation process for multivariate OLS

Multivariate OLS allows us to keep adding independent variables; that’s where the “multi”

comes from. Whenever we think of another variable that could plausibly be in the error

term and be correlated with the independent variable of interest, we simply add it to the

model (thereby removing it from the error term and eliminating it as a possible source of

endogeneity). Lather. Rinse. Repeat. Do this long enough and we may be able to wash

away sources of endogeneity lurking in the error term. The model will look something like

Yi = —0 + —1 X1i + —2 X2i + ... + —k Xki + ‘i (5.3)

where each X is another variable and k is the total number of independent variables. Often

a single variable or perhaps a subset of variables are of primary interest. We refer to the

other independent variables as control variable, as these are included to control for factors

that could affect the dependent variable and be correlated with the independent variables

of primary interest. Control variables and control groups are different: A control variable is

an additional variable we include in a model, while a control group is the group to which we

compare the treatment group in an experiment.1


1 The control variable and control group concepts are related. In an experiment, a control variable is something
that is set to be the same for all subjects of the experiment, so that the only difference between treated and untreated

c
•2014 Oxford University Press 203
Chapter 5. Multivariate OLS: Where the Action Is

The authors of the height and wage study argue that adolescent height in and of itself

was not causing increased wages. Their view is that adolescent height translated into oppor-

tunities that provide skills and experience that led to better ability to get high wages later.

They view increased participation in clubs and sports activities as a channel for adolescent

height to improve wage-increasing human capital. In statistical terms, the claim is that

participation in clubs and athletics was a factor in the error term of a model with only adult

height and adolescent height. If either height variable is correlated with any of the factors

in the error term, we could have endogeneity.

With the right data, we can check the claim that the effect of adolescent height on

adult wages is due, at least in part, to the effect of adolescent height on participation in

developmentally helpful activities. In this case, the researchers had measures of the number

of clubs each person participated in (excluding athletics and academic/honor society clubs)

and a dummy variable that indicated whether or not each person participated in high school

athletics.
groups is the experimental treatment. If we were experimenting on samples in petri dishes, for example, we could
treat temperature as a control variable. We would make sure that the temperature is the same for all petri dishes used
in the experiment. Hence, the control group has everything similar to the treatment group, except the treatment. In
observational studies, we cannot determine the values of other factors, but we can try to net out these other factors
such that once we have taken into account these factors, the treated and untreated groups should be the same. In
the Christmas shopping example, the dummy variable for December is our control variable. The idea is that once
we net out the effect of Christmas on shopping patterns in the United States, retail sales should only differ based
on differences in the temperature. If we worry (as we should) that additional factors other than temperature still
matter, we should include other control variables until we feel confident that the only remaining difference is due to
the variable of interest.

c
•2014 Oxford University Press 204
Chapter 5. Multivariate OLS: Where the Action Is

The right-most column of Table 5.2 therefore presents “multivariate (b)” results from a

model that also includes measures of participation in activities as a young person. If the

way for adolescent height to translate into higher wages is truly that tall adolescents have

more opportunities to develop leadership and other skills, then we would expect part of the

adolescent height effect to be absorbed by the additional variables. As we see in the right-

most column, this is part of the story. The coefficient on adolescent height in the multivariate

(b) column goes down to 0.35 with a standard error of 0.19, which is statistically insignificant.

The coefficients on the clubs and athletics variables are 1.88 and 3.02 respectively with

standard errors of 0.28 and 0.56, implying highly statistically significant effects.

By the way, notice the R2 s at the bottom of the table. They are 0.01, 0.01, and 0.06.

Terrible, right? Recall that R2 is the square of the correlation of observed and fitted ob-

servations. (Or, equivalently, these R2 numbers indicate the proportion of the variance of

wages explained by the independent variables.) These values mean that the even in the best

fitting model the correlation of observed and fitted values of wages is about 0.245 (because
Ô
0.06 = 0.245). That’s not so hot, but we shouldn’t care. That’s not how we evaluate

models. As discussed on page 108 in Chapter 3, we evaluate the strength of estimated rela-

tionships based on coefficient estimates and standard errors, not based on directly looking

at R2 .

As practical people, we recognize that measuring every possible source of endogeneity in

the error term is unlikely. But if we can measure more variables and pull more factors out

c
•2014 Oxford University Press 205
Chapter 5. Multivariate OLS: Where the Action Is

of the error term, our estimates will typically become less biased and be distributed more

closely to the true value. We provide more details when we discuss omitted variable bias in

the next section.

Given how important it is to control for additional variables, we may reasonably wonder

about how exactly multivariate OLS controls for multiple variables. Basically, the estima-

tion of the multivariate model follows the same OLS principles used in the bivariate OLS

model. Understanding the estimation process is not essential for good analysis per se, but

understanding it helps us get comfortable with the model and its fitted values.

First, write out the equation for the residual, which is the difference between actual and

fitted values:

‘ˆi = Yi ≠ Ŷi

= Yi ≠ (—ˆ0 + —ˆ1 X1i + —ˆ2 X2i + ... + —ˆk Xki )

Second, square the residuals (for the same reasons as on page 71).

‘ˆ2i = (Yi ≠ (—ˆ0 + —ˆ1 X1i + —ˆ2 X2i + ... + —ˆk Xki ))2

ˆ that minimize the sum of the squared residuals over


Multivariate OLS then finds the —s

all observations. We let computers do that work for us.

The name “ordinary least squares” (OLS) describes the process: ordinary because we

haven’t gotten to the fancy stuff yet, least because we’re minimizing the deviations between

fitted and actual values, and squares because there was a squared thing going on in there.

c
•2014 Oxford University Press 206
Chapter 5. Multivariate OLS: Where the Action Is

Like I said earlier, it’s an absurd name. It’s like calling a hamburger a “kill-with-stun-gun-

then-grill-and-put-on-a-bun.” OLS is what people call it, though, so we have to get used to

it.

Remember This
1. Multivariate OLS is used to estimate a model with multiple independent variables.
2. Multivariate OLS fights endogeneity by pulling variables from the error term into
the estimated equation.
ˆ in a
3. As with bivariate OLS, the multivariate OLS estimation process selects —s
way that minimizes the sum of squared residuals.

c
•2014 Oxford University Press 207
Chapter 5. Multivariate OLS: Where the Action Is

Discussion Questions
1. Mother Jones magazine blogger Kevin Drum (2013a, b, c) offers the
following scenario: Suppose we gathered records of a thousand school
children aged 7 to 12 and used a bivariate model and found that heavier
kids scored better on standardized math tests.
a) Based on these results, should we recommend that kids should eat
lots of potato chips and french fries if they want to grow up to be
scientists?
b) Write down a model that embodies Drum’s scenario.
c) Propose additional variables for this model.
d) Would inclusion of additional controls bolster the evidence? Would
doing so provide definitive proof?
2. Researchers from the National Center for Addiction and Substance
Abuse at Columbia University (2011) suggest that time spent on Face-
book and Twitter increases risks of smoking, drinking, and drug use.
They found that compared to kids who spent no time on social net-
working sites, kids who spent time on the sites each day were five times
likelier to smoke cigarettes, three times more likely to drink alcohol, and
twice as likely to smoke pot. The researchers argue that kids who use
social media regularly see others engaged in such behaviors and then
emulate them.
a) Write down the model implied by the above discussion and discuss
factors that are in error term.
b) What specifically has to be true about these factors for their omis-
sion to cause bias? Discuss whether these factors will be true for
the factors you identify.
c) Discuss which factors could be measured and controlled for and
which would be difficult to measure and control for.

c
•2014 Oxford University Press 208
Chapter 5. Multivariate OLS: Where the Action Is

Discussion Questions
3. Suppose we are interested in knowing the relationship between hours
studied and scores on a Spanish exam.
a) Suppose some kids don’t study at all but ace the exam, leading to
a bivariate OLS result that studying has little or no effect on the
score. Would you be convinced by these results?
b) Write down a model and discuss your answer to (a) above in terms
of the error term.
c) What if some kids speak Spanish at home? Discuss implications for
a bivariate model that does not include this factor and a multivari-
ate model that controls for this factor.

5.2 Omitted Variable Bias

Another way to think about how multivariate OLS fights bias is by looking at what happens

when we fail to soak up one of the error term variables. That is, what happens if we omit a

variable that should be in the model? In this section we show that omitting a variable that

affects Y and is correlated with X1 will lead to a biased estimate of —ˆ1 .

Let’s start with a case in which the true model has two independent variables, X1 and

X2 :

Yi = —0 + —1 X1i + —2 X2i + ‹i (5.4)

We assume (for now) the error in this true model, ‹i , is uncorrelated with X1i and X2i .

c
•2014 Oxford University Press 209
Chapter 5. Multivariate OLS: Where the Action Is

(The Greek letter ‹ is pronounced “new” - even though it looks like a v.) As usual with

multivariate OLS, the —1 parameter reflects how much higher Yi would be if we increased

X1i by one; —2 reflects how much higher Yi would be if we increased X2i by one.

What happens if we omit X2 and estimate the following model?

Yi = —0OmitX2 + —1OmitX2 X1i + ‘i (5.5)

where — OmitX2 indicates the coefficient on X1i we get when we omit variable X2 from the

model. How close will —ˆ1OmitX2 be to —1 in Equation 5.4? In other words, will —ˆ1OmitX2 be an

unbiased estimator of —1 ? Or, in English: Will our estimate of the effect of X1 suck if we

omit X2 ? We ask questions like this every time we analyze observational data.

It’s useful to first characterize the relationship between the two independent variables,

X1 and X2 . We use an auxiliary regression equation to do so. An auxiliary regression is a

regression that is not directly the one of interest, but yields information helpful in analyzing

the equation we really care about. In this case, we use the following equation to assess how

strongly X1 and X2 are related.

X2i = ”0 + ”1 X1i + ·i (5.6)

where ”0 (“delta”) and ”1 are coefficients for this auxiliary regression and ·i (“tau,” rhymes

with what you say when you stub your toe) is how we denote the error term (which acts

just like the error term in our other equations, but we’re trying to make it clear that we’re

dealing with a different equation). We assume ·i is uncorrelated with ‹i and X1 .

c
•2014 Oxford University Press 210
Chapter 5. Multivariate OLS: Where the Action Is

This equation for X2i is not based on a causal model. Instead, we are using a regression

model to indicate the relationship between the included variable (X1 ) and the excluded

variable (X2 ). If ”1 = 0, then X1 and X2 are not related. If ”1 is large in magnitude, then

X1 and X2 are strongly related.

If we substitute the equation for X2i (Equation 5.6) into the main equation (Equation

5.4), do some re-arranging and a bit of re-labeling, we get

Yi = —0 + —1 X1i + —2 (”0 + ”1 X1i + ·i ) + ‹i

= (—0 + —2 ”0 ) + (—1 + —2 ”1 )X1i + (—2 ·i + ‹i )

= —0OmitX2 + —1OmitX2 X1i + ‘i

This means that

—1OmitX2 = —1 + —2 ”1 (5.7)

where —1 and —2 come from the main equation (Equation 5.4) and ”1 comes from the

equation for X2i (Equation 5.6).2

Given our assumption that · and ‹ are not correlated with any independent variable, we

can use our bivariate OLS results to know that —ˆ1OmitX2 will be distributed normally with

mean of —1 + —2 ”1 . In other words, when we omit X2 , the distribution of the estimated

coefficient on X1 will be skewed away from —1 by a factor —2 ”1 . This is omitted variable

bias.
2 In the last line we replace —2 ·i + ‹i with ‘i . If, as we’re assuming here, ·i and ‹i are uncorrelated with each other
and uncorrelated with X1 , then the sum of them has the same properties (even when ·i is multiplied by —2 ).

c
•2014 Oxford University Press 211
Chapter 5. Multivariate OLS: Where the Action Is

In other words, when we omit X2 , the coefficient on X1 (which is —1OmitX2 ) will pick up

not only —1 , which is the effect of X1 on Y , but also —2 , which is the effect of the omitted

variable X2 on Y . The extent to which —1OmitX2 picks up the effect of X2 depends on ”1 ,

which characterizes how strongly X2 and X1 are related.

This result is consistent with our intuition about endogeneity: When X2 is omitted and

thereby relegated to the error, we won’t be able to understand the true relationship between

X1 and Y to the extent that X2 is correlated with X1 .3

There are two ways to kill omitted variable bias. First, if —2 = 0 then —2 ”1 = 0 and

there is no omitted variable bias. This is easy to explain: if the omitted variable X2 has

no effect on Yi (which is the implication of —2 = 0), then there will be no omitted variable

bias. It’s kind of cheating because we’re saying if you omit a variable that really shouldn’t

have been in the model, then you will not have omitted variable bias. It’s like saying that

you won’t gain weight from eating ice cream if the name of your next door neighbor is Neil.

Nonetheless, it’s a helpful starting point because it clarifies that a variable has to matter for

its omission to cause bias.

The more interesting way to kill omitted variable bias is if ”1 = 0. The parameter ”1 from

Equation 5.6 tells us how strongly X1 and X2 are related. If X1 and X2 are not related,

then ”1 = 0 is zero. This, in turn, means —ˆ1OmitX2 will be an unbiased estimate of —1 from

Equation 5.4, the true effect of X1 on Y even though we omitted X2 from the model. In
3 We derive this result more formally on 721.

c
•2014 Oxford University Press 212
Chapter 5. Multivariate OLS: Where the Action Is

other words, if the omitted variable is not correlated with the included variable, then no

harm and no foul.

This discussion relates perfectly to our theme of endogeneity. If a variable is omitted, it

ends up in the error term. If the omitted variable hanging out in the error term is correlated

with the included variable (which means ”1 ”= 0), then we have endogeneity and we have

bias. We now have an equation that tells us the extent of the bias. If, on the other hand, the

omitted variable hanging out in the error term is not correlated with the included variable

(which means ”1 = 0), then we do not have endogeneity and we do not have bias. Happy,

happy, happy.

If either of these two conditions holds, there is no omitted variable bias. In most cases,

though, we can’t be sure that at least one condition holds because we don’t actually have a

measure of the omitted variable. In that case, we can use omitted variable bias concepts to

speculate on the magnitude of the bias. The magnitude of bias depends on how much the

omitted variable explains Y (which is determined by —2 ) and how much the omitted variable

is related to the included variable (which is reflected in ”1 ). Sometimes we can come up with

possible bias but believe that —2 or ”1 is small, meaning that we shouldn’t lose too much

sleep over bias. On the other hand, in other cases, we might think —2 and ”1 are huge. Hello,

insomnia.

c
•2014 Oxford University Press 213
Chapter 5. Multivariate OLS: Where the Action Is

Omitted variable bias in more complicated models

In Chapter 14 we cover additional topics related to omitted variable bias. On page 728 we

discuss how to use the bias equation to anticipate whether omission of a variable will cause

the estimated coefficient to be higher or lower than it should be. On page 729 we discuss the

more complicated case in which the true model and estimated model have more variables.

In these situations, things get a little harder to predict than in the case we have discussed.

As a general matter, bias usually (but not always) goes down when we add variables that

explain the dependent variable.

Remember This
Two conditions must both be true for omitted variable bias to occur:
1. The omitted variable affects the dependent variable.
• Mathematically: —2 ”= 0 in Equation 5.4 on page 209.
• An equivalent way to state this condition is that X2i really should have been
in Equation 5.4 in the first place.
2. The omitted variable is correlated with the included independent variable.
• Mathematically: ”1 ”= 0 in Equation 5.6 on page 210.
3. Omitted variable bias is more complicated in models with more independent
variables, but the main intuition applies.

c
•2014 Oxford University Press 214
Chapter 5. Multivariate OLS: Where the Action Is

Case Study: Does Education Support Economic Growth?

Does more education lead to more economic

growth? A standard way to look at this ques-

tion is via so-called growth equations in which

the average growth of countries over some time

period is the dependent variable. Hanushek and

Woessmann (2009) put together a data set on

economic growth of 50 countries from 1960 to 2000. The basic model is

Growth from 1960 to 2000i = —0 + —1 Average years of educationi

+—2 GDP per capita in 1960i + ‘i

The data is structured such that even though data exists on the economic growth in

these countries for each year, we are looking only at the average growth rate across the forty

years from 1960 to 2000. Thus each country gets only a single observation. We control for

GDP per capita in 1960 because of a well-established phenomenon that countries that were

wealthier in 1960 have a slower growth rate. The poor countries simply have more economic

capacity to grow. The main independent variable of interest at this point is average years of

education; it measures education across countries.

The results in the left-hand column of Table 5.3 suggest that additional years of schooling

promote economic growth. The —ˆ1 estimate implies that each additional average year of

c
•2014 Oxford University Press 215
Chapter 5. Multivariate OLS: Where the Action Is

schooling within a country is associated with 0.44 percentage points higher annual economic

growth. With a t statistic of 4.22, this is a highly statistically significant result. Using the

standard error and techniques from page 181 we can calculate the confidence interval to be

from 0.24 to 0.65.


Table 5.3: Economic Growth and Education Using Multiple Measures of Education

Without math/science With math/science


test scores test scores
Avg. years of school 0.44ú 0.02
(0.10) (0.08)
[t = 4.22] [t = 0.28]
Math/science test scores 1.97ú
(0.24)
[t = 8.28]
GDP in 1960 -0.39ú -0.30ú
(0.08) (0.05)
[t = 5.19] [t = 6.02]
Constant 1.59ú -4.76ú
(0.54) (0.84)
[t = 2.93] [t = 5.66]
N 50 50
ˆ
‡ 1.13 0.72
R2 0.36 0.74
Standard errors in parentheses, ú indicates significance at p < 0.05

Sounds good: More education, more growth. Nothing more to see here, right? Not to

Hanushek and Woessmann. Their intuition was that not all schooling is equal. They were

skeptical that simply sitting in class and racking up the years improves economically useful

skills and argued that we should assess whether quality of education made a difference, not

simply the quantity of it. As their measure of quality, they used average math and science

test scores.

c
•2014 Oxford University Press 216
Chapter 5. Multivariate OLS: Where the Action Is

Before getting to their updated model, it’s useful to get a feel for the data. Panel (a) of

Figure 5.4 shows a scatterplot of economic growth and average years of schooling. There’s

not an obvious relationship. (The strong positive coefficient we observe in the first column

of Table 5.3 is due to the fact that GDP in 1960 was also controlled for.) Panel (b) of Figure

5.4 shows a scatterplot of economic growth and average test scores. The observations with

high test scores also tended to have high economic growth, suggesting a relationship between

the two.

Could it be that the real story is that test scores explain growth, not years in school? If

so, why is there a significant coefficient on years of schooling in the first column of Table

5.3? We know the answer: Omitted variable bias. As discussed on page 209, if a variable

that matters (and we suspect test scores matter) is omitted, the estimate of the effect of the

variable that is included will be biased if the omitted variable is correlated with the included

variable. To address this issue, panel (c) of Figure 5.4 shows a scatterplot of average test

scores and average years of schooling. Yes, indeed, these variables look quite correlated as

observations with high years of schooling also tend to have high test scores. Hence, the

omission of test scores could be problematic.

Therefore it makes sense to add test scores to the model as the right-hand column of Table

5.3. The coefficient on years of schooling differs markedly from before. It is now very close

to zero. The coefficient on average test scores, on the other hand, is 1.97 and statistically

significant with a t statistic of 8.28.

c
•2014 Oxford University Press 217
Chapter 5. Multivariate OLS: Where the Action Is

Average Average
economic economic
growth growth
(in %) 7 (in %) 7

6 6

5 5

4 4

3 3

2 2

1 1

2 4 6 8 10 12 3 3.5 4 4.5 5 5.5


Average years of school Average test scores
(a) (b)
Average
5.5
test
scores
5

4.5

3.5

3
2 4 6 8 10 12
Average years of school
(c)
FIGURE 5.4: Economic Growth, Years of School, and Test Scores

c
•2014 Oxford University Press 218
Chapter 5. Multivariate OLS: Where the Action Is

Because the scale of the test score variable is not immediately obvious, we need to do

a bit of work to interpret the substantive significance of the coefficient estimate. Based on

descriptive statistics (not reported), the standard deviation of the test score variable is 0.61.

The results therefore imply that increasing average test scores by a standard deviation is

associated with a 0.61 ◊ 1.97 = 1.20 percentage point increase in the average annual growth

rate per year over these forty years. This increase is large when we are talking about growth

compounding over forty years.4

Notice the very different story we have across the two columns. In the first one, years of

schooling is enough for economic growth. In the second specification, quality of education

as measured with math and science test scores matters more. The second specification is

better because it shows that a theoretically sensible variable matters a lot. Excluding this

variable, as the first specification does, risks omitted variable bias. In short, these results

suggest education is about quality, not quantity. High test scores explain economic growth

better than years in school. Crappy schools do little; good ones do a lot. These results don’t

end the conversation about education and economic growth, but they do move it ahead a

few more steps.


4The scale of the test score variable is different from the years in school variable so that we cannot directly compare
the two coefficients. Section 7.3 of Chapter 7 shows how to make such comparisons.

c
•2014 Oxford University Press 219
Chapter 5. Multivariate OLS: Where the Action Is

5.3 Measurement Error

We can apply omitted variable concepts to understand the effects of measurement error on

our estimates. Measurement error is pretty common; it occurs when a variable is measured

inaccurately.

In this section we define the problem, show how to think of it as an omitted variables

problem, and then characterize the nature of the bias caused when independent variables

are measured with error.

Quick: How much money is in your bank account? It’s pretty hard recall the exact

amount (unless it’s zero!). So a survey of wealth relying on people to recall their savings

is probably going to have at least a little error and maybe a lot (especially as people start

getting squirrely about talking about money and some overreport and some underreport).

And many, perhaps even most, variables could have error. Just think how hard it would be

to accurately measure spending on education or life expectancy or attitudes toward Justin

Bieber in an entire country.

Measurement error in the dependent variable

OLS will do just fine if the measurement error is only in the dependent variable. In this

case, the measurement error is simply part of the overall error term. The bigger the error,

the bigger the variance of the error term. We know that in bivariate OLS, a larger variance

ˆ 2 , which increases the variance of —ˆ (see page 97). This


of the error term leads to a larger ‡

c
•2014 Oxford University Press 220
Chapter 5. Multivariate OLS: Where the Action Is

intuition carries over to multivariate OLS as we see in the Section 5.4.

Measurement error in the independent variable

OLS will not do so well if the measurement error is in an independent variable. In this case,

the OLS estimate will systematically under-estimate the magnitude of the coefficient. To

see why, suppose the true model is

Yi = —0 + —1 X1i
ú
+ ‘i

where we use the asterisk in X1i


ú
to indicate that we do not observe this variable directly.

For this section, we assume that ‘i is uncorrelated with X1i


ú
so that we can concentrate on

measurement error.

Instead we observe our independent variable with error; that is, we observe some X1 that

is a function of the true value X1ú and some error. For example, suppose we observe reported

savings rather than actual savings:

X1i = X1i
ú
+ ‹i

We keep things simple here by assuming that the measurement error (‹i ) has a mean of

zero and is uncorrelated with the true value.

Notice that we can re-write X1i


ú
as the observed value (X1i ) minus the measurement error.

ú
X1i = X1i ≠ ‹i

c
•2014 Oxford University Press 221
Chapter 5. Multivariate OLS: Where the Action Is

Substitute for X1i


ú
in the true model, do a bit of re-arranging, and we get

Yi = —0 + —1 (X1i ≠ ‹i ) + ‘i

= —0 + —1 X1i ≠ —1 ‹i + ‘i (5.8)

The trick here is to think of this example as an omitted variable problem where ‹i is the

omitted variable. We don’t observe the measurement error directly, right? If we could observe

it, we would fix our darn measure of X1 . So what we do is treat the measurement error as

an unobserved variable that by definition we must omit and see how this particular form of

omitted variable bias plays out. Compared to a generic omitted variable bias problem, we

know two things that allow us to be more specific than in the general omitted variable case:

the coefficient on the omitted term (‹i ) is —1 and ‹i relates to X1 as in Equation 5.8.

We go step by step through the logic and math on page 732 in Chapter 14. The upshot

is that as the sample size gets very large, the estimated coefficient when the independent

variable is measured with error is


2
‡X
plim —ˆ1 = —1
ú
1

‡‹2 + ‡X
2
ú
1

where plim is the probability limit as discussed in Section 3.5.

Notice that —ˆ1 converges to the true coefficient times a quantity that has to be less than

one.

The equation becomes quite intuitive if we look at two extreme scenarios. If ‡‹2 is zero the

measurement error has no variance and must always equal zero (given that we assumed it is

c
•2014 Oxford University Press 222
Chapter 5. Multivariate OLS: Where the Action Is

2
‡X ú
a mean-zero random variable). In this case 1
2
‡‹2 +‡X ú
will equal one (assuming ‡X
2
ú is not zero,
1
1

which is simply assuming X1ú varies). In other words, if there is no error in the measured

value of X1 (which is what ‡‹2 = 0 means), then plim —ˆ1 = —1 and our estimate of —1 will

be converge to the true value as the sample gets larger. This conclusion makes sense: No

measurement error, no problem. OLS will happily produce an unbiased estimate.

On the other hand, if ‡‹2 is huge relative to ‡X


2
ú , the measurement error varies a lot
1
2
‡X ú
compared to X1ú . In this case, 1
2
‡‹2 +‡X ú
will be less than one and could be near zero, which
1

means that the probability limit of —ˆ1 will be smaller than the true value. This result also

makes sense: If the measurement of the independent variable is junky, it makes sense that

we’ll not be able to see the true effect of that variable on Y .

We refer to this particular example of omitted variable bias as attenuation bias because

when we omit the measurement error term from the model our estimate of —ˆ1 deviates from

the true value by a multiplicative factor between zero and one. This means that —ˆ1 will be

closer to zero than it should be when X1 is measured with error. If the true value of —1 is

some positive number, we see values of —ˆ1 less than they should be. If the true value of —1

is negative, we see values of —ˆ1 larger (meaning closer to zero) than they should be.

c
•2014 Oxford University Press 223
Chapter 5. Multivariate OLS: Where the Action Is

Remember This
1. Measurement error in the dependent variable does not bias —ˆ coefficients, but does
increase the variance of the estimates.
2. Measurement error in an independent variable causes attenuation bias. That is,
when X1 is measured with error, —ˆ1 will be closer to zero than it should be.
• The attenuation bias is a consequence of the omission of the measurement
error from the estimated model.
• The larger the measurement error, the larger the attenuation bias.

5.4 Precision and Goodness of Fit

Precision is crucial for hypothesis tests and confidence intervals. In this section we show

ˆ in multivariate OLS inherits the intuitions of var(—)


that var(—) ˆ in bivariate OLS but also is

influenced by the extent to which the multiple independent variables co-vary together. We

also discuss goodness of fit in the multivariate model and, in particular, what happens when

we include independent variables that don’t explain the dependent variable at all.

Variance of coefficient estimates

The variance of coefficient estimates for the multivariate model is similar to the variance of

—ˆ1 for the bivariate model. As with variance of —ˆ1 in bivariate OLS, the equation we present

applies when errors are homoscedastic and not correlated with each other. Things get more

complicated when errors are heteroscedastic or correlated with each other, but the intuitions

c
•2014 Oxford University Press 224
Chapter 5. Multivariate OLS: Where the Action Is

developed below still apply.

We denote the coefficient of interest as —ˆj to indicate it is the coefficient associated with

the j th independent variable. The variance of the coefficient on the j th independent variable

is
ˆ2

var(—ˆj ) = (5.9)
N var(Xj )(1 ≠ Rj2 )

This equation is similar to the equation for variance of —ˆ1 in bivariate OLS (Equation 3.9

on page 95). The new bit relates to the (1 ≠ Rj2 ) in the denominator. Before elaborating

on Rj2 , let’s note the parts from the bivariate variance equation that carry through to the

multivariate context.

• In the numerator we see ‡


ˆ 2 , which means that the higher the variance of the regression,

the higher the variance of the coefficient estimate. Because ‡


ˆ 2 measures the average
qN
(Yi ≠Ŷi )2
squared deviation of the fitted value from the actual value (ˆ
‡2 = i=1
N ≠k
), all else

equal —ˆj will be more precise the better our variables are able to explain the dependent

variable. This point is particularly relevant for experiments. As long as the experiment

was not plagued by attrition, balance, or compliance problems, we are not worried about

endogeneity and hence do not need to add control variables to avoid bias. Multivariate

OLS does help in experiments, however, by improving the fit of the model, thus reducing

ˆ 2 and therefore giving us more precise estimates of the coefficients of interest.


• In the denominator we see the sample size, N . As for the bivariate model, as we get

c
•2014 Oxford University Press 225
Chapter 5. Multivariate OLS: Where the Action Is

more data this term in the denominator gets bigger, making the var(—ˆj ) smaller. In

other words, more data means more precise estimates.


qN
(Xij ≠X j )2
• The greater the variation of Xj (as measured by i=1
N
for large samples), the

bigger the denominator will be. The bigger the denominator, the smaller var(—ˆj ) will

be.

Multicollinearity

The new element in Equation 5.9 compared to the earlier variance equation is the (1 ≠ Rj2 ).

Notice the j subscript. We use the subscript to indicate that Rj2 is the R2 from an auxiliary

regression in which Xj is the dependent variable and all the other independent variables in

the full model are the independent variables in the auxiliary model. The R2 without the j

is still the R2 for the main equation as discussed on page 111.

There is a different Rj2 for each independent variable. For example, if our model is

Yi = —0 + —1 X1i + —2 X2i + —3 X3i + ‘i (5.10)

there will be three different Rj2 s:

• R12 is the R2 from X1i = “0 + “1 X2i + “2 X3i + ‘i where the “ (the Greek letter gamma)

parameters are estimated coefficients from OLS. We’re not really interested in the value

of these parameters. We’re not making any causal claims about this model and are just

using them to understand the correlation of independent variables (which is measured

c
•2014 Oxford University Press 226
Chapter 5. Multivariate OLS: Where the Action Is

by the Rj2 ). (We’re being a bit loose notationally and re-using the “ notation in each

equation.)

• R22 is the R2 from X2i = “0 + “1 X1i + “2 X3i + ‘i

• R32 is the R2 from X3i = “0 + “1 X1i + “2 X2i + ‘i

These Rj2 tell us how much the other variables explain Xj . If the other variables explain

Xj very well, the Rj2 will be high and – here’s the key insight – the denominator will be

smaller. Notice that the denominator of the equation for var(—ˆj ) has (1 ≠ Rj2 ). Remember

any R2 is between 0 and 1, so as Rj2 gets bigger, then 1 ≠ Rj2 gets smaller which in turn

makes var(—ˆj ) bigger. The intuition is that if variable Xj is virtually indistinguishable from

the other independent variables, it makes sense that it is hard to tell how much that variable

affects Y and we will, therefore, have a larger var(—ˆj ).

In other words, when an independent variable is highly related to other independent

variables, the variance of the coefficient we estimate for that variable will be high. We

use a fancy statistical term, multicollinearity, to refer to situations in which independent

variables have strong linear relationships. The term comes from “multi” for multiple variables

and “co-linear” because they vary together in a linear fashion. The polysyllabic jargon should

not hide a simple fact: The variance of our estimates increases when an independent variable

is closely related to other independent variables.


1
The term 1≠Rj2
is referred to as the variance inflation factor (VIF). It measures how

c
•2014 Oxford University Press 227
Chapter 5. Multivariate OLS: Where the Action Is

much variance is inflated due to multicollinearity relative to a case in which there is no

multicollinearity.

It’s really important to understand what multicollinearity does. It does not cause bias. It

doesn’t even cause the standard errors of —ˆ1 to be incorrect. It simply causes the standard

errors to be bigger than they would be if there were no multicollinearity. In other words,

OLS is on top of the whole multicollinearity thing, producing estimates that are unbiased

with appropriately calculated uncertainty. It’s just that when variables are strongly related

to each other we’re going to have more uncertainty – the distributions of —ˆ1 will be wider,

meaning that it will be harder to learn from the data.

What, then, should we do about multicollinearity? If we have a lot of data, our standard

errors may be small enough to make reasonable inferences about the coefficients on the

collinear variables. In that case, we do not have to do anything. OLS is fine and we’re

perfectly happy. Both of our empirical examples in this chapter are consistent with this

scenario. In the height and wages analysis in Table 5.2, adult height and adolescent height

are highly correlated (at 0.86, actually) and yet the actual effects of these two variables are

so different that we can parse out their differential effects with the amount of data we have.

In the education and economic growth analysis in Table 5.3, the years of school and test

score variables are correlated at 0.81 and yet the effects are different enough that we can

parse out differential effect of these two variables with the data we have.

However, if we have substantial multicollinearity we may get very large standard errors

c
•2014 Oxford University Press 228
Chapter 5. Multivariate OLS: Where the Action Is

on the collinear variables, making us unable to say much about any of one variable. Some

are tempted in such cases to drop one or more of the highly multicollinear variables and to

focus only on the results for the remaining variables. This isn’t quite fair as we may not

have solid evidence to know which variables which should drop and which we should keep.

A better approach is to be honest: We should just say that the collinear variables taken as

a group seem to matter or not and that we can’t parse out the individual effects of these

variables.

For example, suppose we are interested in predicting undergraduate grades as a function

of two variables: Scores from a standardized math test and scores from a standardized

verbal reasoning test. Suppose also that these test scores variables are highly correlated

and that when we run a model with both variables as independent variables, they are both

statistically insignificant in part because the standard errors will be very high due to the

high Rj2 values. If we drop one of the test scores, the remaining test score variable may be

statistically significant, but it would be poor form to believe, then, that only that test score

affected undergraduate grades. Instead, we should use the tools we present later in Section

7.4 on page 342 that allow us to asses whether both variables taken together explain grades.

At that point, we may be able to say that we know standardized test scores matter, but that

we cannot say much the relative effect of math versus verbal test scores. So even though

it’d be more fun to say which test score matters, the statistical evidence may simply not be

there for us to make that distinction.

c
•2014 Oxford University Press 229
Chapter 5. Multivariate OLS: Where the Action Is

A lethal dose of multicollinearity is called perfect multicollinearity, which occurs when

an independent variable is completely explained by other independent variables. If this

happens, Rj2 = 1 and the var(—ˆ1 ) blows up due to having (1 ≠ Rj2 ) in the denominator

(in the sense that the denominator becomes zero, which is a big no-no). In this case,

statistical software will either refuse to estimate the model or will automatically delete enough

independent variables such that there is no perfect multicollinearity. A silly example of

perfect multicollinearity is when someone includes the same variable twice in a model.

Goodness of fit

Let’s talk about the regular old R2 , the one without a j subscript. As with the R2 for a

bivariate OLS model, the R2 for a multivariate OLS model measures goodness of fit and is

the square of the correlation of the fitted values and actual values (see Section 3.7).5 As

before, it can be interesting to know how well the model explains the dependent variable,

but this information is often not particularly useful. A good model can have a low R2 and

a biased model can have a high R2 .

There is one additional wrinkle for R2 in the multivariate context. Adding a variable to

a model necessarily makes the R2 go up, at least by a tiny bit. To see why, notice that OLS

minimizes the sum of squared errors. If we add a new variable, the fit cannot be worse than

before because we can simply set the coefficient on this new variable to be zero, which is
5 The model needs to have a constant term for this interpretation to work – and for R2 to be sensible.

c
•2014 Oxford University Press 230
Chapter 5. Multivariate OLS: Where the Action Is

equivalent to not having the variable in the model in the first place. In other words, every

time we add a variable to a model we do no worse and, as a practical matter, we do at least a

little better even if the variable doesn’t truly affect the dependent variable. Just by chance,

estimating a non-zero coefficient on this variable will typically improve the fit for a couple

of observations. Hence, R2 always is the same or larger as we add variables.

Devious people therefore think “Aha, I can boost my R2 by adding variables.” First of

all, who cares? R2 isn’t directly useful for much. Second of all, that’s cheating. Therefore,

most statistical software program report so-called adjusted R2 results. This measure is

based on the R2 but lowers the value depending on how many variables are in the model.

The adjustment is ad hoc, and different people do it in different ways. The idea behind

the adjustment is perfectly reasonable, but it’s seldom worth getting too worked up about

adjusting it. It’s like electronic cigarettes. Yes, smoking them is less bad than smoking

regular cigarettes, but, really, why do it at all?

Inclusion of irrelevant variables

The equation for the variance of —ˆj is also helpful for understanding what happens when we

include an irrelevant variable, meaning we add a variable to the model that shouldn’t be

there. Whereas our omitted variable discussion was about what happens when we exclude a

variable that should be in the model, here we want to know what happens when we include

a variable that should not be in the model.

c
•2014 Oxford University Press 231
Chapter 5. Multivariate OLS: Where the Action Is

Including an irrelevant variable does not cause bias. We can think of the situation as if

if we wrote down a model and the correct coefficient on the irrelevant variable happens to

be zero. That doesn’t cause bias; it’s just another variable and we should get an unbiased

estimate of that coefficient and including this variable will not create endogeneity.

It might seem therefore that the goal is simply to add as many variables as we can get our

hands on. That is, the more we control for, the less likely there are to be factors in the error

term that are correlated with the independent variable of interest. The reality is different.

Including an irrelevant variable is not harmless. Doing so makes our estimates less precise

because including it will necessarily increase Rj2 because R2 always go up when variables are

added.6 This conclusion makes sense: The more we clutter up our analysis with variables

that don’t really matter, the harder it is to see a clear relationship between a given variable

and the dependent variable.

6 Our discussion just above was about the regular R2 , but it also applies to any R2 (from the main equation or an
auxiliary equation). R2 goes up as the number of variables increases.

c
•2014 Oxford University Press 232
Chapter 5. Multivariate OLS: Where the Action Is

Remember This
1. If errors are errors are not correlated with each other and are homoscedastic, the
variance of the —ˆj estimate is
ˆ2

var(—ˆj ) =
N ◊ var(Xj )(1 ≠ Rj2 )

2. Four factors influence the variance of multivariate —ˆj estimates.


ˆ 2 and var(—ˆj ).
(a) Model fit: The better the model fits, the lower will be ‡
(b) Sample size: The more observations, the lower will be var(—ˆj ).
(c) Variation in X: The more the Xj variable varies, the lower will be var(—ˆj ).
(d) Multicollinearity: The less the other independent variables explain Xj , the
lower will be Rj2 and var(—ˆj ).
3. Independent variables are multicollinear if they are correlated.
(a) The variance of —ˆ1 is higher when there is multicollinearity than when there
is no multicollinearity.
(b) Multicollinearity does not bias —ˆ1 estimates.
(c) The se(—ˆ1 ) produced by OLS accounts for multicollinearity.
(d) An OLS model cannot be estimated when there is perfect multicollinearity,
meaning that an independent variable is perfectly explained by one or more
of the other independent variables.
4. Inclusion of irrelevant variables occurs when variables that do not affect Y are
included in a model.
(a) Inclusion of irrelevant variables causes the variance of —ˆ1 to be higher than if
the variables were not included.
(b) Inclusion of irrelevant variables does not cause bias.
5. The variance of —ˆj is more complicated when errors are correlated or heteroscedas-
tic, but the intuitions about model fit, sample size, variance of X, and multi-
collinearity still apply.

c
•2014 Oxford University Press 233
Chapter 5. Multivariate OLS: Where the Action Is

Discussion Questions
1. How much will other variables explain Xj when Xj is a randomly as-
signed treatment? Approximately what will Rj2 be?
2. Suppose we are designing an experiment in which you can determine
the value of all independent variables for all observations. Do we want
the independent variables to be highly correlated or not? Why or why
not?

c
•2014 Oxford University Press 234
Chapter 5. Multivariate OLS: Where the Action Is

Case Study: Institutions and Human Rights

Governments around the world all too often vio-

late basic human rights. What deters them from

doing so? Many believe an independent judiciary

constrains governments from such bad behavior.

This hypothesis offers a promising opportu-

nity for statistical analysis. Our dependent vari-

able can be Human rightsst , a measure of human

rights for country s at time t based on rights enu-

merated in United Nations treaties. Our inde-

pendent variable can be Judicial independencest

which measures judicial independence for coun-

try s at time t based on the tenure of judges and the scope of judicial authority.7

We pretty quickly see that a bivariate model will be insufficient. What factors are in the

error term? Could they be correlated with judicial independence? Experience seems to show

that human rights violations occur less in wealthy countries. Wealthy countries also tend

to have more independent judiciaries. In other words, omission of country wealth plausibly

satisfies conditions for omitted variable bias to occur: The variable influences the dependent
7 This example is based on La Porta et al (2004). Measurement of abstract concepts like human rights and judicial
independence is not simple. See Harvey (2011) for more details.

c
•2014 Oxford University Press 235
Chapter 5. Multivariate OLS: Where the Action Is

variable and is correlated with the independent variable in question.

Therefore it is a good idea to control for wealth when looking at the effect of judicial

independence on human rights. The left column of Table 5.4 presents results from such a

model. Wealth is measured by per capita GDP. The coefficient on judicial independence

is 11.37, suggesting that judicial independence does indeed improve human rights. The t

statistic is 2.53 so we reject the null hypothesis that the effect of judicial independence is

zero.

Is this the full story? Is there some omitted variable that affects human rights (the depen-

dent variable) that is correlated with judicial independence (the key independent variable)?

If there is, there could be a spurious association between judicial independence and human

rights protection when this other variable is omitted.

New York University professor Anna Harvey (2011) proposes exactly such a critique. She

argues that democracy might protect human rights and that the degree of democracy in a

country could be correlated with judicial independence.

Before we discuss what Harvey found, let’s think about what would have to be true if

omitting a measure of democracy is indeed causing bias using our conditions on page 214.

First, the level of democracy in a country actually needs to affect the dependent variable,

human rights (this is the —2 ”= 0 condition). Is that true here? Very plausibly. We don’t

know beforehand, of course, but it certainly seems possible that torture tends not to be a

great vote-getter. Second, democracy needs to be correlated with the independent variable

c
•2014 Oxford University Press 236
Chapter 5. Multivariate OLS: Where the Action Is

Table 5.4: Effects of Judicial Independence on Human Rights - Including Democracy Variable

without democracy with democracy


variable variable
Judicial independence 11.37ú 1.03
(4.49) (3.15)
[t = 2.53] [t = 0.33]
Log GDP per capita 9.77ú 1.07
(1.36) (4.49)
[t = 7.20] [t = 0.82]
Democracy 24.93ú
(2.77)
[t = 9.01]
Constant -22.68 30.97ú
(12.57) 10.15
[t = 1.80] [t = 3.05]
N 63 63
ˆ
‡ 17.6 11.5
R2 0.47 0.78
2
RJudicialind. 0.153
2
RLogGDP 0.553
2
RDemocracy 0.552
Standard errors in parentheses, ú indicates significance at p < 0.05

of interest, which in this case is judicial independence. This we know is almost certainly

true: democracy and judicial independence definitely seem to go together in the modern

world. In Harvey’s data, democracy and judicial independence correlate at 0.26; not huge,

but not nuthin’. Therefore we have a legitimate candidate for omitted variable bias.

The right-hand column of Table 5.4 shows that Harvey’s intuition was right. When the

democracy measure is added, the coefficients on both judicial independence and GDP per

c
•2014 Oxford University Press 237
Chapter 5. Multivariate OLS: Where the Action Is

capita fall precipitously. The coefficient on democracy, however, is 24.93 with a t statistic of

9.01, a highly statistically significant estimate.

Statistical significance is not the same as substantive significance. Let’s try to interpret

our results in a more meaningful way. If we generate descriptive statistics for our human

rights dependent variable, we see that it ranges from 17 to 99, with a mean of 67 and a

standard deviation of 24. Doing the same for the democracy variable indicates that it ranges

from 0 to 2 with a mean of 1.07 and a standard deviation of 0.79. A coefficient of 24.93

implies that a one standard deviation change in the democracy measures is associated with

a 24.93 ◊ 0.79 = 19.7 increase on the human rights scale. Given that the standard deviation

change in the dependent variable is 24, this is a pretty sizable association between democracy

and human rights.8

This is a textbook example of omitted variable bias.9 When democracy is not accounted

for, judicial independence is strongly associated with human rights. When democracy is

accounted for, however, the effect of judicial independence fades to virtually nothing. And,

this is not just about statistics. How we view the world is at stake, too. The conclusion from
8 Determining exactly what is a substantively large effect can be subjective. There’s no rule book on what is large.
Those who have worked in a substantive area for a long time often get a good sense of what are large effects. An
effect might be considered large if it is larger than the effect of other variables that people think are important. Or
an effect might be considered large if we know that the benefit is estimated to be much higher than the cost. In the
human rights case, we can get a sense of what a 19.7 change in human rights scale means by looking at countries
that were around 20 points different on that scale. Pakistan was 22 points higher than North Korea. Decide if it
would make a difference to vacation in North Korea or Pakistan. If it would make a difference, then 19.7 is a large
difference; if not, then it’s not.
9 Or, it is now...

c
•2014 Oxford University Press 238
Chapter 5. Multivariate OLS: Where the Action Is

the initial model was that courts protect human rights. The additional statistical analysis

suggests that democracy protects human rights.

The example also highlights the somewhat provisional nature of social scientific conclu-

sions. Someone may come along with a variable to add or another way to analyze our data

that will change our conclusions. That is the nature of the social scientific process. We do

the best we can, but we leave room (sometimes a little, sometimes a lot) for a better way to

understand what is going on.

Table 5.4 also includes some diagnostics to help us think about multicollinearity, for surely

things like judicial independence, democracy, and wealth are correlated. Before looking at

specific diagnostics, we should note that collinearity of independent variables does not cause

bias. It doesn’t even cause the variance equation to be wrong. Instead, multicollinearity sim-

ply causes the variance to be higher than if there were no collinearity among the independent

variables.

Toward the bottom of the table we see that RJudicialind.


2
is 0.153. This value is the R2

from an auxiliary regression in which judicial independence is the dependent variable and the

GDP and democracy variables are the independent variables. This value isn’t particularly

high, and if we plug it into the equation for the variance inflation factor (VIF) (which is

just the part of the variance of —ˆj associated with multicollinearity) we see that the VIF for
1 1
the judicial independence variable is 1≠Rj2
= 1≠0.153
= 1.18. In other words, the variance

of the coefficient on the judicial independence variable is 1.18 times larger than it would

c
•2014 Oxford University Press 239
Chapter 5. Multivariate OLS: Where the Action Is

have been if the judicial independence variable were completely uncorrelated with the other

independent variables in the model. That’s pretty small. The RLogGDP


2
is 0.553. This value

corresponds to a vif of 2.24, which is higher but still not in a range people get too worried

about. And, just to reiterate, this is not a problem to be corrected. Rather, it is simply

noting that one source of variance of the coefficient estimate on GDP is multicollinearity.

Another source is the sample size and another is the fit of the model (indicated by ‡
ˆ , which

indicates that the fitted values are, on average, roughly 11.5 units away from the actual

values).

5.5 Model Specification

Multivariate OLS allows us to included multiple

independent variables. Common sense boosted

by our irrelevant variable results in the previ-

ous section suggests that we cannot include every

variable we can find.

That means we have to choose. We call the process model specification because it

is the process of specifying which variables we include in the model.10 This process is

tricky. Political scientist Phil Schrodt (2010) has noted that most experienced statistical
10Model specification also includes deciding on the functional form of the model. We discuss these issues in Chapter
7.

c
•2014 Oxford University Press 240
Chapter 5. Multivariate OLS: Where the Action Is

analysts have witnessed cases in which ”even minor changes in model specification can lead

to coefficient estimates that bounce around like a box full of gerbils on methamphetamines.”

This is an exaggeration – perhaps a box of caffeinated chinchillas is more like it – but there

is a certain truth behind his claim.

In this section we discuss three dangers in model specification and how to conduct and

report results in a way that minimizes these dangers.

Three model specification challenges

One of the most invidious problems in model specification is model-fishing. Model-fishing

occurs when a researchers add and subtract variables until they get just the answers they

were looking for. Sometimes a given result may emerge under just the right conditions –

perhaps when variables X1 and X4 are included and X2 and X3 are excluded – and this is

the only result that the model-fisher gives us.

First, Model fishing is possible because the coefficient on any given variable can change

depending on what other variables are in the model. We have already discussed how omitted

variable bias can affect coefficients. We have also discussed how multicollinearity drives up

variance of our estimates, meaning that the —ˆ1 estimates will tend to bounce around more

when the independent variables are highly correlated with each other.

A second challenge in model specification is that sample size can change as we include

more variables. Sometimes we’re missing observations for some variables. For example, in

c
•2014 Oxford University Press 241
Chapter 5. Multivariate OLS: Where the Action Is

survey data it is pretty common that a pretty good chunk of people do not answer questions

about their annual income. If we include a variable like income, OLS will include only

observations for that variable. Including a variable that is missing for half the people in

a sample will cut our sample size in half. This change in the sample can cause coefficient

estimates to jump around because as we talked about with regard to sampling distributions

(on page ??), coefficients will differ for each sample. In some instances, the effects on a

coefficient estimate can be large.

A third model specification challenge is that we may be tempted to include so-called

post-treatment variables in the model. These are variables that are themselves affected by

the independent variable of interest. For example, Harvard Professor Gary King discusses a

study of the effect of oil prices on perceptions of oil shortages based on surveys over several

years. We may be tempted to include a measure of media stories on oil because media stories

very plausibly affect public perceptions. On the other hand, the media stories themselves

may be a consequence of the oil price increase, meaning that if we include a media variable

in the model we may be underestimating the effect of oil prices on public opinion.

Ideally, we avoid post-treatment variables. In an experimental context, for example, we

should control only for variables measured before the experiment or variables that do not

change (such as sex and race). In an observational context, it can be tricky to irrefutably

categorize variables as post-treatment, but it’s worthwhile to try to focus on pre-treatment

controls and to report results with and without variables that themselves may be affected

c
•2014 Oxford University Press 242
Chapter 5. Multivariate OLS: Where the Action Is

by the independent variable we’re most interested in.11

Creating and reporting credible results

There are certain good practices that mitigate some of the dangers inherent in model speci-

fication. The first is to adhere to the replication standard. Some people see how coefficient

estimates can change dramatically depending on specification and become statistical cyn-

ics. They believe that statistics can be manipulated to give any answer. Such thinking lies

behind the aphorism “There are three kinds of lies: lies, damned lies, and statistics.” A

better response is skepticism, a belief that statistical analysis should be transparent to be

believed. In this view, the saying should be “There are three kinds of lies: lies, damned lies,

and statistics that can’t be replicated.”

A second good practice is to present results from multiple specifications in a way that

allows readers to understand which steps of the specification are the crucial ones for the

conclusion being offered. Coefficients will change when variables are added or excluded; that

is, after all, the point of multivariate analysis. For the analysis to be credible, though, it

needs to be clear about which specification decisions drive the results. Readers need to know

whether the results are robust to a number of reasonable specification choices or depend

narrowly on one set of choices about which variables to include.


11 See King (1991) for more on post-treatment variables (and a critique of R2 ). Even pre-treatment variables can
be problematic; Elwert and Winship (2014) discuss recent results in causal inference suggesting that controlling for
pre-treatment variables with certain characteristics can cause bias as well.

c
•2014 Oxford University Press 243
Chapter 5. Multivariate OLS: Where the Action Is

All statistical analysis should, as a matter of course, report multiple specifications, typi-

cally from a simple model to more complicated models. We saw an example with the height

and weight data in Table 5.2 on page 201 and will see more examples throughout the book.

Remember This
1. An important part of model specification is choosing what variables to include in
the model. Challenges in this process include
(a) Model fishing, which occurs when a researcher searches for a subset of possible
independent variables that provides a desired result.
(b) Changes in sample size (and potentially in results) due to inclusion of inde-
pendent variables with missing observations.
(c) Distortions caused by including post-treatment variables in a model.
2. Researchers should adhere to the replication standard and report multiple specifi-
cations in order to demonstrate the robustness of results and to highlight variables
associated with changes in coefficients.

5.6 Conclusion

Multivariate OLS is a huge help in our fight against endogeneity because it allows us to

add variables to our models. Doing so cuts off at least part of the correlation between an

independent variable and the error term because the included variables are no longer in the

error term. For observational data, multivariate OLS is very necessary, although we seldom

can wholly defeat endogeneity simply by including variables. For experimental data not

suffering from attrition, balance, or compliance problems, we can beat endogeneity without

multivariate OLS, but multivariate OLS makes our estimates more precise.

c
•2014 Oxford University Press 244
Chapter 5. Multivariate OLS: Where the Action Is

A useful way to think about multivariate OLS is as an effort to avoid omitted variable

bias. Omitting a variable causes problems when both of the following are true: the omitted

variable affects the dependent variable and it is correlated with the included independent

variable.

While we are most concerned with the factors that bias estimates, we have also identified

four factors that make our estimates less precise. Three were the same as with bivariate

OLS: poor model fit, limited variation in the independent variable, and small data sets.

A precision-killing factor new to multivariate OLS is multicollinearity. When independent

variables are highly correlated, they get in the way of each other and make it hard for us to

know which one has which effect. The result is not bias, but imprecision.

We’re well on our way to understanding multivariate OLS when we can:

• Section 5.1: Write down the multivariate regression equation and explain all its elements

(dependent variable, independent variables, coefficients, intercept and error term). Ex-

plain how adding a variable to a multivariate OLS model can help fight endogeneity.

• Section 5.2: Explain omitted variable bias, including the two conditions necessary for

omitted variable bias to exist.

• Section 5.3: Explain what measurement error in dependent and independent variables

does to our coefficient estimates.

• Section 5.4: Produce the equation for the variance of —ˆ1 and explain the elements of it,

c
•2014 Oxford University Press 245
Chapter 5. Multivariate OLS: Where the Action Is

qN
including ‡
ˆ2, i=1 (Xij ≠ X j )2 , and Rj2 . Use this equation to explain the consequences

of multicollinearity and inclusion of irrelevant variables.

• Section 5.5: Explain good practices regarding model specification.

Further Reading

King, Keohane, and Verba (1994) provide an intuitive and useful discussion of omitted

variable bias.

Goldberger (1991) has a terrific discussion of multicollinearity as “micronumerosity.” His

point is that the real problem with multicollinear data is that the estimates will be imprecise.

We defeat imprecise data with more data, hence the problem of multicollinearity is not having

enough data, a state of affairs he calls “micronumerosity.”

Morgan and Winship (2014) provide a fascinating alternative way of thinking about con-

trolling for multiple variables. They spend a fair bit of time discussing the strengths and

weaknesses of multivariate OLS and providing alternatives.

Statistical results can often be more effectively presented as figures instead of tables.

Kastellec and Leoni (2007) provide a nice overview of the advantages and options for such

an approach.

c
•2014 Oxford University Press 246
Chapter 5. Multivariate OLS: Where the Action Is

Key Terms
• Adjusted R2 (231)
• Attenuation bias (223)
• Auxiliary regression (210)
• Ceteris paribus (197)
• Confidence interval (201)
• Irrelevant variable (231)
• Measurement error (220)
• Model-fishing (241)
• Model specification (240)
• Multicollinearity (227)
• Multivariate OLS (194)
• Omitted variable bias (211)
• Perfect multicollinearity (230)
• Variance inflation factor (227)

Computing Corner

Stata
1. To estimate a multivariate OLS model, we simply extend the syntax from bivariate OLS
(described on page 126). The syntax is
reg Y X1 X2 X3
For heteroscedasticity-consistent standard errors, simply add the robust subcommand
(as discussed on page 103) reg Y X1 X2 X3, robust
2. There are two ways to assess multicollinearity.

c
•2014 Oxford University Press 247
Chapter 5. Multivariate OLS: Where the Action Is

• Calculating the Rj2 for each variable. For example, calculate the R12 via
reg X1 X2 X3
and calculate the R22 via
reg X2 X1 X3
1
• Stata also provides a variance inflation factor command that estimates 1≠R 2 for
j
each variable. This command needs to be run immediately after the main model of
interest. For example,
reg Y X1 X2 X3
vif
would provide the variance inflation factor for all variables from the main model.
A VIF of 5, for example, indicates that the variance is five times higher than it
would be if there were no multicollinearity.

R
1. To estimate a multivariate OLS model, we simply extend the syntax described on page
128. The syntax is
OLSResults = lm(Y ≥ X1 + X2 + X3)
For heteroscedasticity-consistent standard errors, install and load the AER package and
use the coeftest and vcov commands as follows as discussed on page 130:
coeftest(OLSResults, vcov = vcovHC(OLSResults, type = "HC1"))
2. To assess multicollinearity, calculate the Rj2 for each variable. For example, calculate
the R12 via
AuxReg1 = lm(X1 ≥ X2 + X3)
and calculate the R22 via
AuxReg2 = lm(X2 ≥ X1 + X3)

Exercises
1. Table 5.5 describes variables from heightwage.dta we will use in this problem. We
previously saw this data in Chapter 3 on page 112 and in Chapter 4 on page 187.
a. Estimate two OLS regression models: one in which adult wages is regressed on adult
height for all respondents, the other in which adult wages is regressed on adult height
and adolescent height for all respondents. Discuss differences across the two models.
Explain why the coefficient on adult height changed.
b. Assess the multicollinearity of the two height variables using (i) a plot (ii) the variance
inflation factor command, and (iii) an auxiliary regression. For the plot run once

c
•2014 Oxford University Press 248
Chapter 5. Multivariate OLS: Where the Action Is

Table 5.5: Variables for Height and Weight Data in the United States

Variable name Description


wage96 Hourly wages (in dollars) in 1996
height85 Adult height: height (in inches) measured in 1985
height81 Adolescent height: height (in inches) measured in 1981
athlets Participation in high school athletics (1=yes, 0=no)
clubnum Number of club memberships in high school, excluding
athletics and academic/vocational clubs
siblings Number of siblings
age Age in 1996
male Male (1=yes, 0=no)

without jitter subcommand (e.g., scatter X1 X2, and once with it (e.g., scatter
X1 X2, jitter(3), and choose the more informative of the two plots. (Note that in
the auxiliary regression it’s useful to limit the sample to observations where wage96
is not missing so that the R2 from the auxiliary regression will be based on the same
number of observations as the regression used for the vif command. The syntax is if
wage96 !=. where the exclamation means “not” and the period is how Stata marks
missing values.)
c. Notice that IQ is omitted from the model. Is this a problem? Why or why not?
d. Notice that eye color is omitted from the model. Is this a problem? Why or why
not?
e. You’re the boss! Estimate a model that you think sheds light on an interesting
relationship, using the data in the file. The specification decisions include deciding
whether to limit the sample and what variables to include. Report only a single
additional specification. Describe in two paragraphs or less why this is an interesting
way to assess the data.
2. Use the MLBattend.dta data on Major League Baseball attendance records for 32 teams
from the 1970s through 2000 that we used in Chapter 4 on page 190. We are interested
in the factors that impact baseball game attendance.
a. Estimate a regression in which home attendance rate is the dependent variable and
wins, runs scored, and runs allowed are the independent variables. Report your
results, identify variables that are statistically significant, and interpret all significant
coefficients.
b. Suppose someone argues that we need to take into account the fact that the U.S.
population grew from 1970 through 2000. This particular data set does not have a
population variable, but it does have a variable called season, which indicates what

c
•2014 Oxford University Press 249
Chapter 5. Multivariate OLS: Where the Action Is

season the data is from (e.g., season equals 1969 for observations from 1969 and
season equals 1981 for observations from 1981, etc.). What are the conditions that
need to be true for omission of the season variable to bias other coefficients? Do you
think they hold in this case?
c. Estimate a second regression using the dependent and independent variables from
part (a), but including season as an additional independent variable to control for
trends on overall attendance over time. Report your results and discuss the differ-
ences between these results and those observed in part (a).
d. What is the relationship between season and runs scored? Assess with an auxiliary
regression and a scatterplot. Discuss the implications for the results in part (c).
3. Do cell phones distract drivers and cause accidents? Worried that they do, many states
over the last ten years have passed legislation to reduce distracted driving. Fourteen
states have passed legislation making handheld cell phone use while driving illegal and
44 states have banned texting while driving. This problem looks more closely at the
relationship between cell phones and traffic fatalities. Table 5.7 describes the variables
in the data set Cellphone 2012 homework.dta.
Table 5.6: Variables for Cell Phones and Traffic Deaths Questions

Variable name Description


year Year
State State name
state numeric State name (numeric representation of state)
numberofdeaths Number of traffic deaths
cell subscription Number of cell phone subscriptions (in thousands)
population Population within a state
total miles driven Total miles driven within a state for that year (in millions of miles)

a. While we don’t have the number of people who are using the phone while driving,
we do have the number of cell phones subscriptions within a state (in thousands).
Estimate a bivariate model with traffic deaths as the dependent variable and number
of cell phone subscriptions as the independent variable. Briefly discuss the results
and explain if you suspect endogeneity and why.
b. Add population to the model. What happens to the coefficient on cell phone sub-
scriptions? Why?
c. Add total miles driven to the model. What happens to the coefficient on cell phone
subscriptions? Why?

c
•2014 Oxford University Press 250
Chapter 5. Multivariate OLS: Where the Action Is

d. Based on the model in part (c), calculate the variance inflation factor for population
and total miles driven. Why are they different? Discuss implications of this level
of multicollinearity for the coefficient estimates and the precision of the coefficient
estimates.
4. What determines how much drivers are fined if they are stopped for speeding? Do
demographics like age, gender, and race matter? To answer this quesetion, we’ll in-
vestigate traffic stops and citations in Massachusetts using data from Makowsky and
Stratmann (2009). Even though state law sets a formula for tickets based on how fast
the driver was driving, police officers in practice often deviate from the formula. The
data in speeding tickets text.dta includes information on all traffic stops. It contains
an amount for the fine for only those observations for which the police officer decided
to assess a fine.
Table 5.7: Variables for Speeding Ticket Data

Variable name Description


MPHover Miles per hour over the speed limit
Amount Assessed fine for the ticket
Age Age of driver

a. Estimate a bivariate OLS model in which ticket amount is a function of age. Is age
statistically significant? Is endogeneity possible?
b. Estimate the model from part (a) also controlling for miles per hour over the speed
limit. Explain what happens to the coefficient on age and why.
c. Suppose we had only the first 1, 000 observations in the data set. Estimate the model
from part (b) and report on what happens to the standard errors and t statistics when
we have fewer observations. (In Stata, use if n Æ 1000 at the end of the regression
command to limit the sample to the first 1000 observations. Because the amount is
missing for drivers who were not fined, the sample size will be much smaller than
1, 000.)
5. We will continue the analysis of height and wages in Britain from the homework prob-
lem in Chapter 3 on page 133. We want to know if the relationship between height
and wages in the United States also occurs among British men. The data set height-
wage british males multivariate.dta contains data on males in the Britain from Persico,
Postlewaite, and Silverman (2004). Table 5.8 lists the variables.12
12For the reasons discussed in the homework problem in Chapter 3 on page 133 we limit the data set to observations
with height greater than 40 inches and self-reported income less than 400 British pounds per hour. We also exclude

c
•2014 Oxford University Press 251
Chapter 5. Multivariate OLS: Where the Action Is

Table 5.8: Variables for Height and Weight Data in Britain

Variable name Description


gwage33 Hourly wages (in British Pounds) at age 33
height33 Height (in inches) measured at age 33
height16 Height (in inches) measured at age 16
height07 Height (in inches) measured at age 7
momed Education of mother, measured in years
daded Education of father, measured in years
siblings Number of siblings
Ht16Noisy Height (in inches) measured at age 16 with measurement error added in

a. Persico, Postlewaite, and Silverman (2004) argue that adolescent height is most rel-
evant because it is height at these ages that affects the self-confidence to develop
interpersonal skills at a young age. Estimate a model with wages at age 33 as the
dependent variable and both height at age 33 and age 16 as independent variables.
What happens to the coefficient on height at age 33? Explain what is going on here.
b. Let’s keep going. Add height at age 7 to the above model and discuss the results.
Be sure to note changes in sample size (and it’s possible effects) and to discuss the
implications of adding a variable with the statistical significance observed for the
height at age 7 variable.
c. Is there multicollinearity in the model from part (c)? Diagnose it and indicate its
consequences. Be specific as to whether the multicollinearity will bias coefficients or
have some other effect.
d. Perhaps characteristics of parents affect height (some force kids to eat veggies, while
others give them only french fries and Fanta). Add the two parental education
variables to the model and discuss results. Include only height at age 16 (meaning
we do not include the height at ages 33 and 7 for this question – although feel free
to include them too on your own; the results are interesting).
e. Perhaps kids had their food stolen by greedy siblings. Add the number of siblings to
the model and discuss results
f. We have included a variable, Ht16N oisy, which is the height measured at age 16 with
some random error included. In other words, it does not equal the actual measured
height at age 16, but is a “noisy” measure of height at age 16. Estimate the model
using the Ht16Noisy instead of height16 and discuss any changes in coefficient on the
observations of individuals who grew shorter from age 16 to age 33. Excluding these observations doesn’t substantially
affect the results we see here, but since it’s reasonable to believe there is some kind of non-trivial measurement error
for these cases, we exclude them for the analysis for this question.

c
•2014 Oxford University Press 252
Chapter 5. Multivariate OLS: Where the Action Is

height variable. Relate the changes to theoretical expectations about measurement


error discussed in the chapter.

c
•2014 Oxford University Press 253
CHAPTER 6

DUMMY VARIABLES: SMARTER THAN YOU THINK

Picture, if you will, a frenzied home crowd at a

sporting event. That has to help the home team,

right? The fans sure act like it will. But does

it really? This is a question begging for data

analysis.

Let’s look at Manchester City in English Pre-

mier League soccer in 2012-13. Panel (a) of Figure 6.1 shows the goal differential for Manch-

ester City’s 38 games, distinguishing between home and away games. The average goal

differential for away games is about 0.32 (meaning the team scored on average 0.32 more

goals than their opponents when playing away from home). The average goal differential

254
Chapter 6. Dummy Variables: Smarter Than You Think

for home games is about 1.37, meaning that the goal differential is more than 1 goal higher

at home. Well done, obnoxious drunk fans! Panel (b) Figure 6.1 shows the goal differen-

tial for Manchester United. The average goal differential for away games is about 0.90 and

the average goal differential for home games is about 1.37 (coincidentally the same value as

for Manchester City). These numbers mean that the home field advantage for Manchester

United is only about 0.47. C’mon Manchester United fans – yell louder!

We can use OLS to easily generate such estimates and conduct hypothesis tests. And we

can do much more. We can estimate such difference of means while controlling for other

variables and we can see whether covariates have different effects at home and away. The key

step is using a dummy variable, a variable that equals either zero or one, as the independent

variable.

In this chapter we show the many powerful uses of dummy variables in OLS models.

Section 6.1 shows how to use a bivariate OLS model for difference of means. Section 6.2

shows how to use multivariate OLS to control for other variables when conducting a difference

of means test. In Section 6.3 we use dummy variables to control for categorical variable,

which indicate category membership in one of multiple categories. Religion and race are

classic categorical variables. Section 6.4 discusses how dummy variable interactions allow

us to estimate different slopes for different groups. This chapter covers dummy independent

variables; Chapter 12 covers dummy dependent variables.

c
•2014 Oxford University Press 255
Chapter 6. Dummy Variables: Smarter Than You Think

Goal Goal
differential differential
5 5

4 4

3 3

2 2

Average for Average for


1.37 home games 1.37 home games

1 1 Average for
0.9 away games

Average for
0.32 away games
0 0

−1 −1

−2 −2

0 1 0 1
Away Home Away Home
Manchester City Manchester United
(a) (b)

FIGURE 6.1: Goal Differentials for Home and Away Games for Manchester City and Manchester United

c
•2014 Oxford University Press 256
Chapter 6. Dummy Variables: Smarter Than You Think

6.1 Using Bivariate OLS to Assess Difference of Means

Researchers frequently want to know whether two groups differ. In experiments, researchers

are curious whether the treatment group differed from the control group. In observational

studies, researchers want to compare outcomes between categories: men versus women,

college grads versus high school grads, Ohio State versus Michigan. These comparisons are

often referred to as difference of means tests because they involve comparing the mean

of Y for one group (e.g., the treatment group) against the mean of Y for another group (e.g.,

the control group). In this section, we show how to use the bivariate regression model and

OLS to make such comparisons. We also work through an example about opinion about

President Obama.

Regression model for difference of means tests

Consider a typical experiment. There is a treatment group that is a randomly selected group

of individuals who were given a treatment. There is also a control group that received no

treatment. We use a dummy variable to represent whether or not someone was in the

treatment group. A dummy variable equals either zero or one for each observation. Dummy

variables are also referred to as dichotomous variables. Typically, the dummy variable

equals one for those in the treatment group and zero for those in the control group.

c
•2014 Oxford University Press 257
Chapter 6. Dummy Variables: Smarter Than You Think

A bivariate OLS model that assesses the effect of an experimental treatment is

Yi = —0 + —1 T reatmenti + ‘i (6.1)

where Yi is the dependent variable, —0 is the intercept, —1 is the effect of being treated, and

T reatmenti is our independent variable (“Xi ”). This variable equals 1 if person i received

the experimental treatment and 0 otherwise. As usual, ‘i is the error term. Because this

is an experiment (one that we assume does not suffer from attrition, balancing, or compli-

ance problems), ‘i will be uncorrelated with T reatmenti , thereby satisfying the exogeneity

condition.

The standard interpretation of —ˆ1 from bivariate OLS applies here: A one-unit increase in

the independent variable is associated with a —ˆ1 increase in Yi . (See page 78 on the standard

OLS interpretation.) Equation 6.1 implies that getting the treatment (going from 0 to 1 on

T reatmenti ) is associated with a —ˆ1 increase in Yi .

When our independent variable is a dummy variable as with our T reatmenti variable,

we can also treat —ˆ1 as an estimate of the difference of means of our dependent variable Y

across the two groups. To see why, note first that the fitted value for the control group (for

whom T reatmenti = 0) is

Ŷi = —ˆ0 + —ˆ1 T reatmenti


= —ˆ0 + —ˆ1 ◊ 0
= —ˆ0

In other words, —ˆ0 is the predicted value of Y for individuals in the control group. It is

c
•2014 Oxford University Press 258
Chapter 6. Dummy Variables: Smarter Than You Think

not surprising that the value of —ˆ0 that best fits the data is simply the average of Yi for

individuals in the control group.1

The fitted value for the treatment group (for whom T reatmenti = 1) is

Ŷi = —ˆ0 + —ˆ1 T reatmenti


= —ˆ0 + —ˆ1 ◊ 1
= —ˆ0 + —ˆ1

In other words, —ˆ0 + —ˆ1 is the predicted value of Y for individuals in the treatment group.

The best predictor of this value is simply the average of Y for individuals in the treatment

group. Because —ˆ0 is the average of individuals in the control group, —ˆ1 is the difference in

averages between the treatment and control groups. If —ˆ1 > 0, then the average Y for those

in the treatment group is higher than for those in the control group. If —ˆ1 < 0, then the

average Y for those in the treatment group is lower than for those in the control group. If

—ˆ1 = 0, then the average Y for those in the treatment group is no different than the average

of Y for those in the control group.

In other words, our slope coefficient (—ˆ1 ) is, in the case of a bivariate OLS model with a

dummy independent variable, a measure of the difference in means across the two groups.

The standard error on this coefficient tells us how much uncertainty we have and determines

the confidence interval for our estimate of —ˆ1 .

Figure 6.2 graphically displays the difference of means test in bivariate OLS with a scat-
1 The proof is a bit laborious. We show it in the appendix on page 798.

c
•2014 Oxford University Press 259
Chapter 6. Dummy Variables: Smarter Than You Think

Dependent
variable 20

15

^ ^ Average for
β0 + β1
treatment group

10

pe)
slo
he
^β 1 (t

^ Average for
β0 control group

0 1
Control Treatment
group group
Treatment variable
FIGURE 6.2: Bivariate OLS with a Dummy Independent Variable

c
•2014 Oxford University Press 260
Chapter 6. Dummy Variables: Smarter Than You Think

terplot of data. It looks a bit different than our previous scatterplots (e.g. Figure 3.1 on

page 66) because here the independent variable takes on only two values: 0 or 1. Hence

the observations are stacked at 0 and 1. In our example the values of Y when X = 0 are

generally lower than the values of Y when X = 1. The parameter —ˆ0 corresponds to the

average of Y for all observations for which X = 0. The average for the treatment group (for

whom X = 1) is —ˆ0 + —ˆ1 . The difference in averages across the groups is —ˆ1 . A key point

is that the standard interpretation of coefficients in bivariate OLS still applies: A one unit

change in X (e.g., going from X = 0 to X = 1) is associated with a —ˆ1 change in Y.

This is excellent news. Whenever our independent variable is a dummy variable – as it

typically is for experiments and often is for observational data – we can simply run bivariate

OLS and the —ˆ1 coefficient tells us the difference of means. The standard error on this

coefficient tells us how precisely we have measured this difference and allows us to conduct

a hypothesis test and determine a confidence interval.

OLS produces difference of means tests for observational data, as well. The model and

interpretation are the same; the difference is how much we worry about whether the exogene-

ity assumption is satisfied. Typically, exogeneity will be seriously in doubt for observational

data. And sometimes it is useful to use OLS to estimate the difference of means as a de-

scriptive statistic without a causal interpretation.

Difference of means tests can be conducted without using OLS. Doing so is totally fine, of

course; in fact, OLS and non-OLS difference of means tests assuming same variances across

c
•2014 Oxford University Press 261
Chapter 6. Dummy Variables: Smarter Than You Think

groups produce identical estimates and standard errors. The advantage of the OLS approach

is that we can do it within a framework that also does all the other things OLS does, such

as adding multiple variables to the model.

Difference of means and views about President Obama

Table 6.1 provides an example. The left-hand column presents results from a model of feel-

ings toward President Obama from a December 2011 public opinion survey. The dependent

variable is respondents’ answers to a request to rate Obama on a 0 to 100 “feeling thermome-

ter” scale where 0 is feeling very cold toward him and 100 is feeling very warm toward him.

The independent variable is a dummy variable called Democrat that is 1 for respondents

who identify themselves as Democrats and 0 for those who do not. The Democrat variable

equals 0 for all non-Democrats (a group includes Republicans, independents, supporters of

other parties, and non-partisans). The results indicate that Democrats rate Obama 41.82

points higher than non-Democrats, an effect that is highly statistically significant.2

Difference of means tests convey the same essential information when the coding of the

dummy variable is flipped. The column on the right in Table 6.1 shows results from a

model in which N otDemocrat was the independent variable. This variable is the opposite
2A standard OLS regression model produces a standard error and a t statistic that are equivalent to the standard
error and t statistic produced by a difference of means test in which variance is assumed to be the same across both
groups. An OLS model with heteroscedasticity-consistent standard errors (as discussed in Section 3.6) produces a
standard error and t statistic that are equivalent to a difference of means test in which variance differs across groups.
The Computing Corner shows how to estimate these models.

c
•2014 Oxford University Press 262
Chapter 6. Dummy Variables: Smarter Than You Think

of the Democrat variable, equaling 1 for non-Democrats and zero for Democrats. The

numerical results are different, but they nonetheless contain the same information. The

constant is the mean evaluation of Obama by Democrats. In the first specification this mean

is —ˆ0 + —ˆ1 = 23.38 + 41.82 = 65.20. In the second specification it is simply —ˆ0 because

this is the mean value for the excluded category. In the first equation the coefficient on

Democrat is 41.82, indicating that Democrats evaluated Obama 41.82 points higher than

non-Democrats. In the second equation the coefficient on N onDemocrat is negative 41.82,

indicating Non-Democrats evaluated Obama 41.82 points lower than Democrats.


Table 6.1: Feeling Thermometer Toward Barack Obama

Treatment = Democrat Treatment = Not Democrat


Constant 23.38 ú
65.20ú
(0.78) (0.76)
[t = 30.17] [t = 85.72]
Democrat 41.82ú
(1.09)
[t = 38.51]
Not Democrat -41.82ú
(1.09)
[t = 38.51]
N 2,183 2,183
R2 0.40 0.40
Standard errors in parentheses

Figure 6.3 scatterplots the data and highlights the estimated differences in means between

non-Democrats and Democrats. Dummy variables can be a bit tricky to plot because the

values of the independent variable are only 0 or 1, causing the data to overlap such that we

can’t tell if each dot in the scatterplot indicates 2 or 200 observations. A trick of the trade

c
•2014 Oxford University Press 263
Chapter 6. Dummy Variables: Smarter Than You Think

is to jitter each observation by adding a small random number to each observation for the

independent and dependent variables. The jittered data gives the cloud-like images in the

figure that help us get a decent sense of the data. We jitter only the data that is plotted;

we do not jitter the data when running the statistical analysis. The Computing Corner at

the end of this chapter shows how to jitter data for plots.3

Non-Democrats’ feelings toward Obama clearly run lower because there are many more

observations at the low end of the feeling thermometer scale for non-Democrats. Their

average feeling thermometer rating is 23.38. Feelings toward Obama among Democrats are

higher, with an average of 65.20. Both of the specifications in Table 6.1 tell this same story

when interpreted correctly.

Remember This
A difference of means test assesses whether the average value of the dependent variable
differs between two groups.
1. We often are interested in the difference of means between treatment and control
groups or between women and men or between other groupings.
2. Difference of means tests can be implemented in bivariate OLS by using a dummy
independent variable.
Yi = —0 + —1 Treatmenti + ‘i
(a) The estimate of the mean for the control group is —ˆ0 .
(b) The estimate of the mean for the treatment group is —ˆ0 + —ˆ1 .
(c) The estimate for differences in means between groups is —ˆ1 .

3 We also discussed jittering data on page 112.

c
•2014 Oxford University Press 264
Chapter 6. Dummy Variables: Smarter Than You Think

Feeling
thermometer
toward
Obama 100

80

^ ^ Average for
β0 + β1
Democrats
60

pe)
slo
he
^β 1 (t
40

^ Average for
β0
non−Democrats
20

0 1
Non−Democrats Democrats
Partisan identification

FIGURE 6.3: Scatterplot of Obama Feeling Thermometers and Party Identification

c
•2014 Oxford University Press 265
Chapter 6. Dummy Variables: Smarter Than You Think

Dependent
variable
4

−2

0 1
Control group (a) Treatment group

Dependent
10
variable
5

−5

−10

−15

0 1
Control group (b) Treatment group

Dependent
variable 120
110
100
90
80
70
60
0 1
Control group (c) Treatment group

FIGURE 6.4: Three Difference of Means Tests

Discussion Questions

1. Approximately what are the averages of Y for the treatment and control
groups in each panel of Figure 6.4? Approximately what is the estimated
difference of means in each panel?
2. Approximately what are the values of —ˆ0 and —ˆ1 in each panel of Figure
6.4?

c
•2014 Oxford University Press 266
Chapter 6. Dummy Variables: Smarter Than You Think

Case Study: Sex Differences in Heights


As an example of OLS difference in means, let’s

look at the difference in heights between men and

women. We already know men are, on average,

taller, but it is interesting to know just how much

taller and how confident we can be of the es-

timate. In this case, the dependent variable is

height and the independent variable is gender.

We can code the “treated” value as either being

male or female; for now, we’ll use a male dummy variable that is 1 if the person is male

and 0 if the person is female.4 Later, we’ll come back and re-do the analysis with a female

dummy variable.

Figure 6.5 displays a scatterplot of height and gender. The figure shows that men are, as

expected, taller on average than women as the man-blob is clearly higher than the woman-

blob.

That’s not very precise, though, so we’ll use an OLS model to provide a specific estimate

of the difference in heights between men and women. The model is

Heighti = —0 + —1 M alei + ‘i
4 Sometimes people will name a variable like this “gender.” That’s annoying! Readers will have to dig through the
paper to figure out whether 1 indicates males or females.

c
•2014 Oxford University Press 267
Chapter 6. Dummy Variables: Smarter Than You Think

Height
(in inches)
80

75

70

65

60

55

50

0 1
Women Men
Gender

FIGURE 6.5: Scatterplot of Height and Gender

c
•2014 Oxford University Press 268
Chapter 6. Dummy Variables: Smarter Than You Think

The estimated coefficient —ˆ0 tells us the average height for the group for which the dummy

variable is zero, which in this case is women. The estimated coefficients —ˆ0 + —ˆ1 tell us the

average height for the group for which the dummy variable is 1, which in this case is men.

The difference between the two groups is estimated as —ˆ1 .

The results are reported in Table 6.2. The average height of women is —ˆ0 , which is 64.23

inches. The average height for men is —ˆ0 + —ˆ1 , which is 64.23 + 5.79 = 70.02 inches. The

difference between the two groups is estimated as —ˆ1 , which is 5.79 inches.
Table 6.2: Difference of Means Test for Height and Gender

Constant 64.23
(0.04)
[t = 1,633.6]
Male 5.79
(0.06)
[t = 103.4]
N 10, 863
Standard errors in parentheses

This estimate is quite precise. The t statistic for male is 103.4, which allows us to reject

the null hypothesis. We can also use our confidence interval algorithm from page 181 to

produce a 95% confidence interval for —ˆ1 of 5.68 to 5.90 inches. In other words, we are 95%

confident that the difference of means of height between men and women is between 5.68

and 5.9 inches.

Figure 6.6 adds the information from Table 6.2 to the scatterplot. We can see that —ˆ0 is

estimating the middle of the women-blob —ˆ0 + —ˆ1 is estimating the middle of the men-blob,

c
•2014 Oxford University Press 269
Chapter 6. Dummy Variables: Smarter Than You Think

and the difference between the two is —ˆ1 . We can interpret the estimated effect of going from

zero to one on the independent variable (which is equivalent to going from female to male)

is to add 5.79 inches on average.

We noted earlier that it is reasonable to code the treatment as being female. If we replace

the male dummy variable with a female dummy variable the model becomes

Heighti = —0 + —1 F emalei + ‘i

Now the estimated coefficient —ˆ0 will tell us the average height for men (the group for

which F emale = 0). The estimated coefficients —ˆ0 + —ˆ1 will tell us the average height for

women and the difference between the two groups is estimated as —ˆ1 .

The results with the female dummy variable are in the right hand column of Table 6.3.

The numbers should look familiar, because we are learning the same information from the

data. It is just that the accounting is a bit different. What is the estimate of the average

height for men? It is —ˆ0 in the right hand column, which is 70.02. Sound familiar? That

was the number we got from our initial results (reported again in the left-hand column of

Table 6.3); in that case we had to add —ˆ0 + —ˆ1 because when the dummy variable indicated

men, we needed both coefficients to get the average height for men. What is the difference

between males and females estimated in the right-hand column? It is -5.79, which is the

same as before, only negative. The underlying fact is that women are estimated to be 5.79

inches shorter on average. If we have coded our dummy variable as F emale = 1, then going

c
•2014 Oxford University Press 270
Chapter 6. Dummy Variables: Smarter Than You Think

Height
(in inches)
80

75

^ ^ Average height
β0 + β170 for men

65 Average height
^
β0 for women

60

55

50

0 1
Women Men
Gender

FIGURE 6.6: Scatterplot of Height and Gender

c
•2014 Oxford University Press 271
Chapter 6. Dummy Variables: Smarter Than You Think

from zero to one on the independent variable is associated with a decline of 5.79 inches on

average. If we have coded our dummy variable as M ale = 1, then going from zero to one on

the independent variable is associated with an increase of 5.79 inches on average.


Table 6.3: Difference of Means Test for Height and Gender

Treatment = male Treatment = female


Constant 64.23 70.02
(0.04) (0.04)
[t = 1,633.6] [t = 1,755.9]
Male 5.79
(0.06)
[t = 103.4]
Female -5.79
(0.06)
[t = 103.4]
N 10,863 10,863
Standard errors in parentheses

6.2 Dummy Independent Variables in Multivariate OLS

We can easily extend difference of means tests to multivariate OLS. Doing so is useful because

it allows us to control for other variables when assessing whether two groups are different.

For example, earlier in this chapter we assessed the home field advantage of Manchester

City while controlling for the quality of the opponent. Using multivariate OLS we can

estimate

Goal differential i = —0 + —1 Homei + —2 Opponent quality i + ‘i (6.2)

where Opponent qualityi measures the opponent’s overall goal differential in all other games.

c
•2014 Oxford University Press 272
Chapter 6. Dummy Variables: Smarter Than You Think

The —ˆ1 estimate will tell us, controlling for opponent quality, whether the goal differential

was higher for Manchester City for home games. The results are in Table 6.4.
Table 6.4: Manchester City Example with Dummy and Continuous Independent Variables

Home field 1.026


(0.437)
[t = 2.35]
Opponent quality -0.025
(0.009)
[t = 2.69]
Constant 0.266
(0.309)
[t = 0.86]
N 38
R 2
0.271
ˆ
‡ 1.346
Standard errors in parentheses

The generic for such a model is

Yi = —0 + —1 Dummyi + —2 Xi + ‘i (6.3)

It is useful to think graphically about the fitted lines from this kind of model. Figure 6.7

shows the data for Manchester City’s results in 2012-13. The observations for home games

(for which the home dummy variable equals one) are black dots; the observations for away

games (for which the home dummy variable equals zero) are grey dots.

As discussed on page 281 the intercept for the Homei = 0 observations (the away games)

will be —ˆ0 and the intercept for the Homei = 1 observations (the home games) will be —ˆ0 + —ˆ1 ,

which equals the intercept for away games plus the bump (up or down) for home games.

c
•2014 Oxford University Press 273
Chapter 6. Dummy Variables: Smarter Than You Think

Goal
differential
5
Fitted line for home games
(Home = 1)
Fitted line for away games
4 (Home = 0)

^ ^
β 0 + β1
1
^
β2 (th
es lope)
^
β0
0
^
β2 (th
e slo
pe)

−1

−2

−32 −24 −16 −8 0 8 16 24 32 40


Opponent quality

FIGURE 6.7: Fitted Values for Model with Dummy Variable and Control Variable: Manchester City

Example

c
•2014 Oxford University Press 274
Chapter 6. Dummy Variables: Smarter Than You Think

Note that the coefficient indicating the difference of means is the coefficient on the dummy

variable. (Note that which — we should look at depends on how we write the model. For this

model, —1 indicates the difference of means controlling for the other variable, but it would

be —2 if we wrote the model to have —2 multiplied by the dummy variable.)

The innovation is that our difference of means test here also controls for another variable,

in this case, opponent quality. Here the effect of a one unit increase in opponent quality is

—ˆ2 ; this effect is the same for the Homei = 1 and Homei = 0 groups. Hence the fitted lines

are two parallel lines, one for each group separated by —ˆ1 , the differential bump associated

with being in the Homei = 1 group. In the figure, —ˆ1 is greater than zero, but it could be

less than zero (in which case the dark line for the Homei = 1 group would be below the grey

line) or equal to zero (in which case the dark line and grey lines would overlap exactly).

We can add additional independent variables to our heart’s content, allowing us to assess

the difference of means between the Homei = 1 and Homei = 0 groups in a manner that

controls for the additional variables. Such models are incredibly common.

Remember This
1. Including a dummy variable in a multivariate regression allows us to conduct a
difference of means test while controlling for other factors with a model such as
Yi = —0 + —1 Xi + —2 Dummyi + ‘i (6.4)

2. The fitted values from this model will be two parallel lines, each with a slope of
—ˆ1 and separated by —ˆ2 for all values of X.

c
•2014 Oxford University Press 275
Chapter 6. Dummy Variables: Smarter Than You Think

6.3 Transforming Categorical Variables to Multiple Dummy Vari-

ables

Categorical variables (also known as nominal variables) are common in data analysis.

They have two or more categories, but the categories have no intrinsic ordering. Information

on religion is often contained in a categorical variable: 1 for Buddhist, 2 for Christian, 3 for

Hindu, and so forth. Race, industry, and many more attributes also appear as categorical

variables. Categorical variables differ from dummy variables in that categorical variables

have multiple categories. Categorical variables differ from ordinal variables in that ordinal

variables express rank but not necessarily relative size. An example of an ordinal variable

is a one indicating answers to a survey question that is coded 1 = strongly disagree, 2 =

disagree, 3 = agree, 4 = strongly agree.5

In this section, we show how to use dummy variables to analyze categorical variables.

We illustrate the technique with an example about wage differentials across regions in the

United States.
5 It is possible to treat ordinal independent variables in the same way as categorical variables in the manner we
describe here. Or it is common to simply include ordinal independent variables directly in a regression model and
interpret a one-unit increase as moving from one category to another.

c
•2014 Oxford University Press 276
Chapter 6. Dummy Variables: Smarter Than You Think

Categorical variables in regression models

We might suspect that wages in the United States are different in different regions. Are they

higher in the northeast? Or are they higher in the south? Suppose we have data on wages

and on region. It should be easy to figure this out, right? Well, yes, as long as we appreciate

how to analyze categorical variables. Categorical variables indicate membership in some

category. They are common in policy analysis. For example, suppose our region variable is

coded such that 1 indicates a person is from the northeast, 2 indicates a person is from the

midwest, 3 indicates a person is from the south, and 4 indicates a person is from the west.

How should we incorporate categorical variables into OLS models? Should we estimate

the following model wages?

W agei = —0 + —1 X1i + —2 Regioni + ‘i (6.5)

where W agei is the wages of person i and Regioni is the region person i lived in, as defined

above.

No, no, and no. Though the categorical variable may be coded numerically, it has no

inherent order, which means the units are not meaningful. The midwest is not “1” more

than the northeast; the south is not “1” more than the midwest.

So what do we do with categorical variables? Dummy variables save the day. We simply

convert categorical variables into a series of dummy variables, a different one for each cate-

gory. If region is the nominal variable, we simply create a northeast dummy variable (1 for

c
•2014 Oxford University Press 277
Chapter 6. Dummy Variables: Smarter Than You Think

people from the northeast, 0 otherwise), a midwest dummy variable (1 for midwesterners, 0

otherwise) and so on.

The catch is that we cannot include dummy variables for every category because if we did,

we would have perfect multicollinearity (as we discussed on page 230). Hence we exclude

one of the dummy variables and treat that category as the reference category (also re-

ferred to as the excluded category), which means that coefficients on the included dummy

variables indicate the difference between the category indicated by the dummy variable and

the reference category.

We’ve already been doing something like this with dichotomous dummy variables. When

we used the male dummy variable in our height and wages example on page 267, we did

not include a female dummy variable, meaning that females were the reference category and

the coefficient on the male dummy variable indicated how much taller men were. When we

used the female dummy variable, men were the reference category and the coefficient on the

female dummy variable indicated how much shorter females were on average.

Categorical variables and regional wage differences

To see how categorical variables work in practice, we will analyze women’s wage data in 1996

across the northeast, midwest, south, and west in the United States. We won’t, of course,

include a single region variable. Instead, we create dummy variables for each region and

include all but one of them in the OLS regression. For example, if we treat W est as the

c
•2014 Oxford University Press 278
Chapter 6. Dummy Variables: Smarter Than You Think

excluded category, we estimate

W agesi = —0 + —1 N ortheasti + —2 M idwesti + —3 Southi + ‘i

The results for this regression are in column (a) of Table 6.5. The —ˆ0 result tells us

that the average wage per hour for women in the west (the excluded category) was $12.50.

Women in the northeast are estimated to receive $2.02 more per hour than those in the west,

or $14.52 per hour. Women in the midwest earn $1.59 less than women in the west; which

works out to $10.91 per hour. And women in the south receive $2.13 less than women in the

west, or $10.37 per hour.


Table 6.5: Wages and Region Using Different Excluded Categories

(a) (b) (c) (d)


Exclude Exclude Exclude Exclude
West South Midwest Northeast
Constant 12.50ú 10.37ú 10.91ú 14.52ú
(0.40) (0.26) (0.36) (0.43)
[t = 31.34] [t = 39.50] [t = 30.69] [t = 33.53]
Northeast 2.02ú 4.15ú 3.61ú
(0.59) (0.506) (0.56)
[t = 3.42] [t = 8.19] [t = 6.44]
Midwest -1.59ú 0.54 -3.61ú
(0.534) (0.44) (0.56)
[t = 2.97] [t = 1.23] [t = 6.44]
South -2.13ú -0.54 -4.15ú
(0.48) (0.44) (0.51)
[t = 4.47] [t = 1.23] s [t = 8.19]
West 2.13ú 1.59ú -2.02ú
(0.48) (0.53) (0.59)
[t = 4.47] [t = 2.97] [t = 3.42]
N 3,223 3,223 3,223 3,223
R2 0.023 0.023 0.023 0.023
Standard errors in parentheses

Column (b) of Table 6.5 shows the results from the same data, but with south as the

c
•2014 Oxford University Press 279
Chapter 6. Dummy Variables: Smarter Than You Think

excluded category instead of west. The —ˆ0 result tells us that the average wage per hour

for women in the south (the excluded category) was $10.37. Women in the northeast get

$4.15 per hour more than women in the south, or $14.52 per hour. Women in the midwest

receive $0.54 per hour more than women (which works out to $10.91 per hour) and women

in the west get $2.13 per hour more than women in the south (which works out to $12.50

per hour). The key pattern is that the estimated amount that women in each region get

is the same in columns (a) and (b). Columns (c) and (d) have midwest and northeast as

the excluded categories respectively and, with calculations like those we just did, we can see

that the estimated average wages for each region are the same in all specifications.

Hence it is important to always remember that the coefficient estimates themselves are

only meaningful with reference to the excluded category. Even though the coefficients on

each dummy variable change across the specifications, the underlying estimates for wages in

each region do not. Think of the difference between Fahrenheit and Celsius – the temperature

is the same, but the number on the thermometer is different.

Thus we don’t need to stress about which category should be excluded because it simply

doesn’t matter. The difference is simply due to the reference category we are using. In the

first specification, we are comparing wages in the northeast, midwest, and south to the west;

in the second specification, we are comparing wages in the northeast, midwest, and west to

the south. The reason that the coefficient on midwest is negative in the first specification

and positive in the second is that women in the midwest earn less than women in the west

c
•2014 Oxford University Press 280
Chapter 6. Dummy Variables: Smarter Than You Think

(the reference category in specification (a)) and earn more than women in the south (the

reference category in specification (b)). In both specifications (and the subsequent two),

women in the midwest are estimated to earn $10.91 per hour.

Remember This
To use dummy variables to control for categorical variables, we include dummy vari-
ables for every category except one.
1. The excluded category is the reference point and all the coefficients on the in-
cluded dummy variables indicate how much higher or lower each group is than
the excluded category.
2. Coefficients differ depending on which excluded category is used, but when inter-
preted appropriately the results do not change across specifications.

c
•2014 Oxford University Press 281
Chapter 6. Dummy Variables: Smarter Than You Think

Discussion Questions
1. Suppose we wanted to conduct a cross-national study of opinion in North
America and have a variable named “country” that is coded 1 for re-
spondents from the United States, 2 for respondents from Mexico, and
3 for respondents from Canada. Write a model and explain how to
interpret the coefficients.
2. For the results in Table 6.6, indicate what the coefficients are in boxes
(a) through (j).

Table 6.6: Hypothetical Results for Wages and Region Using Different Excluded Categories

Exclude Exclude Exclude Exclude


West South Midwest Northeast
Constant 125.0 95.0 (d) (g)
(0.9) (1.1) (1.0) (0.9)
Northeast -5.0 (a) (e)
(1.3) (1.4) (1.3)
Midwest -10.0 (b) (h)
(1.4) (1.5) (1.3)
South -30.0 (f) (i)
(1.4) (1.5) (1.4)
West (c) 10.0 (j)
(1.4) (1.4) (1.3)
N 1000 1000 1000 1000
R2 0.3 0.3 0.3 0.3
Standard errors in parentheses

c
•2014 Oxford University Press 282
Chapter 6. Dummy Variables: Smarter Than You Think

Case Study: Did Republicans Move to the Right in 2010?

The U.S. Congress matters. A lot. What hap-

pens there affects budgets, laws, and appoint-

ments, and these have implications not only for

the daily life of people in the United States, but

often for people across the globe. And no one

doubts that elections matter greatly for what

happens in Congress.

The 2010 congressional election was a particularly interesting election. Republicans lost

seats in 2006 and 2008, but came storming back with a big win in 2010. No one doubted that

the newly elected Republicans were more conservative than the Democrats they replaced, but

many thought that the Tea Party movement also succeeded in its efforts to elect Republicans

who were more conservative than previously elected Republicans.

Is this view correct? To answer this question we use data on all Republicans elected to

Congress in 2010 (including both those newly elected in 2010 and those who were reelected

in 2010). We can begin with a bivariate OLS model that estimates the difference in means

of Republicans newly elected to Congress in 2010 and other Republicans.

Conservatismi = —0 + —1 Newly elected 2010 i + ‘i (6.6)

where Conservatismi is a measure of how conservative Republican representative i is based

c
•2014 Oxford University Press 283
Chapter 6. Dummy Variables: Smarter Than You Think

on his or her voting record in Congress (ranging from 0.18 for the most moderate Republican

to +1 for the most conservative Republican) and N ewly elected 2010i is 1 if representative

i was newly elected in 2010 and 0 otherwise.6 A positive value for —ˆ1 would indicate newly

elected Republicans were more conservative than Republicans in Congress who had been

elected before 2010.

The results from a bivariate model are in Table 6.7. Republicans elected before 2010 had

a 0.67 mean level of conservatism (—ˆ0 ) while Republicans elected in 2010 had a 0.70 mean

level of conservatism (—ˆ0 + —ˆ1 ). The difference (—ˆ1 ) is not statistically significant.
Table 6.7: Difference of Means of Conservatism of Republicans Elected in 2010

Newly elected 2010 0.027


(0.023)
[t = 1.16]
Constant 0.670
(0.014)
[t = 47.91]
N 241
R2 0.006
The dependent variable is conservatism for Republican members
of Congress. Standard errors in parentheses

This is a bivariate OLS analysis of observational data. As such, we’re suspicious that

there are unmeasured factors lurking in the error term that could be correlated with the

newly elected dummy variable, which would thereby induce endogeneity.

To explore this situation further, let’s consider where newly elected Republicans in 2010
6 The citations in the appendix on page 800 show where to get data measuring the conservatism of every member
of the U.S. Congress ever.

c
•2014 Oxford University Press 284
Chapter 6. Dummy Variables: Smarter Than You Think

came from. Some replaced GOP representatives who had retired, but most defeated incum-

bent Democrats. What kind of districts had incumbent Democrats? Districts that elected

Democrats in 2008. These districts were probably more liberal than the districts that had

long elected Republicans.

Hence there might be something in the error term (district ideology) that is correlated

with the variable of interest (the dummy variable for newly elected Republicans), which could

cause bias because the district ideology might push these members a bit to the left (coming

from more moderate districts than typical Republican incumbents) which might mask any

extra conservatism that the newly elected might manifest. In other words, the newly elected

Republicans might come from districts that are different from other Republican districts

and this fact, not that they were newly elected, could affect how conservative they are in

Congress. In statistical terms, we are concerned that the estimate of —ˆ1 from Equation 6.6

will be biased downward.

Figure 6.8 corroborates these suspicions. We measure district liberalism with the percent

of the presidential vote in district i that Barack Obama received in the 2008 presidential

election because the percentage Obama received was higher in more liberal districts. Panel

(a) shows district Obama voters on the X-axis and conservatism of Republican representa-

tives on the Y-axis. A bivariate OLS line is included in the figure, showing that conservatism

is indeed related to Obama vote share: The more votes Obama got, the less conservative

was the Republican representing that district.

c
•2014 Oxford University Press 285
Chapter 6. Dummy Variables: Smarter Than You Think

Panel (b) of Figure 6.8 shows whether or not a member was newly elected on the X-axis

and Obama vote on the Y-axis. The data have been jittered for ease of viewing and a

bivariate fitted line (that assesses difference of means) is included. It appears that newly

elected members did indeed come from more liberal districts: The average Obama vote for

Republicans elected before 2010 was 0.42 and the average Obama vote for Republicans newly

elected in 2010 was 0.46.

Hence district liberalism appears to satisfy the two conditions for omitted variable bias:

It looks like district liberalism affected the dependent variable (as seen in panel (a) of Figure

6.8) and also was correlated with the independent variable of interest (as seen in panel (b)

of Figure 6.8).

Multivariate OLS allows us to account for district liberalism. We’ll say, OK, the newly

elected Republicans represented districts that were a bit more liberal, so let’s factor in how

much less conservative members of Congress are when their districts are relatively liberal

and see, from that baseline, whether the newly elected members of Congress in 2010 were

more conservative. To do this we estimate the following multivariate equation

Conservatismi = —0 + —1 Newly elected 2010 i + —2 District Obama percent 2008 i + ‘i (6.7)

where District Obama percent 2008i is the percent of the presidential vote in district i that

Barack Obama received in the 2008 presidential election.

Table 6.8 shows the results. The bivariate column is the same as in Table 6.7. In the

c
•2014 Oxford University Press 286
Chapter 6. Dummy Variables: Smarter Than You Think

District
Ideological Obama
conservatism vote,
1.0
2008 0.6

0.8
0.5

0.6
0.4

0.4

0.3

0.2

0 1
0.3 0.4 0.5 0.6
First elected Newly elected
before 2010 in 2010
District Obama vote, 2008 Election status in 2010
(a) (b)

FIGURE 6.8: Relation Between Omitted Variable (Obama Vote) and Other Variables

c
•2014 Oxford University Press 287
Chapter 6. Dummy Variables: Smarter Than You Think

multivariate (a) column we add the District Obama percent 2008 variable and see that the

coefficient on Newly elected 2010 has doubled in size. In terms of difference of means testing,

we can say that controlling for district ideology, the mean conservatism of Republicans who

were newly elected is 0.059 higher on the ideological scale we are using. The standard error

is 0.022 and the t statistic is 2.68, suggesting that we can reject the null hypothesis that the

effect is zero.

The magnitude isn’t humongous; being 0.059 more conservative on an ideological scale

that ranges from 0.18 to 1.0 for Republicans isn’t that big of a deal. But the estimated

effect is much larger, consistent with our suspicion that the omission of Obama vote in 2008

was causing omitted variable bias. Without controlling for Obama vote, the fact that more

newly elected Republicans came from relatively Obama-friendly districts masked some of the

effect of being newly elected in the bivariate model.

In the multivariate (b) column we also control for district median income and dummy

variables indicating whether the district was in the south, midwest, or west (northeast is the

excluded category). These variables all seem to matter as their t stats all imply statistically

significant effects. The estimated effect of being newly elected in 2010 in the multivariate (b)

specification is 0.087 which is bigger than the estimate in the multivariate (a) column, imply-

ing that the multivariate (a) specification suffered from omitted variable bias. The estimate

implies that newly elected Republicans were 0.087 units more conservative, controlling for

percent Obama vote, income, and region.

c
•2014 Oxford University Press 288
Chapter 6. Dummy Variables: Smarter Than You Think

Table 6.8: Multivariate OLS Analysis of Republicans Elected in 2010

Bivariate Multivariate
(a) (b) (c)
Newly elected 2010 0.027 0.059ú 0.087ú 0.087ú
(0.023) (0.022) (0.022) (0.022)
[t = 1.16] [t = 2.68] [t = 3.88] [t = 3.88]
District Obama 2008 -0.879ú -0.936ú -0.936ú
(0.138) (0.160) (0.160)
[t = 6.39] [t = 5.85] [t = 5.85]
Income 0.002ú 0.002ú
(0.001) (0.001)
[t = 2.42] [t = 2.42]
South 0.151ú
(0.035)
[t = 4.27]
Midwest 0.174ú 0.023
(0.035) (0.027)
[t = 5.01] [t = 0.87]
West 0.180ú 0.030
(0.037) (0.029)
[t = 4.89] [t = 1.00 ]
Northeast -0.151ú
(0.035)
[t = 4.27]
Constant 0.670ú 1.043ú 0.797ú 0.948ú
(0.014) (0.060) (0.084) (0.065)
[t = 47.91] [t = 17.43] [t = 9.50] [t = 14.61]
N 241 241 241 241
R2 0.006 0.151 0.254 0.254
Standard errors in parentheses. The dependent variable is the conservatism for Republican members of
Congress.

Multivariate column (c) shows what happens when we use the south as our excluded

category instead of the northeast. The coefficients on the newly elected dummy variable,

the district Obama 2008 percent, and the income variables are unchanged. Remember:

c
•2014 Oxford University Press 289
Chapter 6. Dummy Variables: Smarter Than You Think

Changing the excluded category only affects how we interpret the coefficients on the dummy

variables associated with the categorical variable in question; doing so does not affect the

other coefficients. In multivariate column (c) we see that the coefficient on midwest is 0.023

and not statistically significant. Wait a minute! Wasn’t it significant in multivariate column

(b)? Yes, but in that column the coefficient on Midwest was comparing conservativism of

midwestern Republican members of Congress to conservativism of Republican northeastern

members of Congress and, yes, the Midwesterners are more conservative. In column (c) the

comparison is to Southerners and no, Midwesterners are not significantly more conservative

than Southerners. In fact, we can see in column (b) that Midwesterners were 0.023 more

conservative than Southerners (by comparing the south and midwest coefficients) and this is

exactly the value we get for the midwest coefficient in column (c) when south is the excluded

reference point. We can go through such a thought process for each of the coefficients and see

that the bottom line is that as long as we know how to use dummy variables for categorical

variables, the substantive results are exactly the same in multivariate columns (b) and (c).

Figure 6.9 shows the 95% confidence intervals for the bivariate and multivariate models.

The 95% confidence interval based on the bivariate OLS model ranges from -0.18 t0 0.072.7

This confidence interval covers zero, which is another way of saying that the coefficient is not

statistically significant. When we add variables in multivariate specifications (a) and (b),
7 Following the confidence interval guide on page 181, we calculate the confidence interval as —ˆ1 ± 1.96 ◊ se(—ˆ1 ) =
0.027 ± 1.96 ◊ 0.023 which is a range from -0.018 to 0.072.

c
•2014 Oxford University Press 290
Chapter 6. Dummy Variables: Smarter Than You Think

Bivariate
model

Multivariate
model (a)

Multivariate
model (b)

−0.05 0.00 0.05 0.10 0.15

Estimated coefficient

FIGURE 6.9: Confidence Intervals for Newly Elected Variable in Table 6.8

the 95% confidence interval shifts because controlling for district liberalism, income, and

regional differences leads to larger estimates with confidence intervals not covering zero.8

We don’t need to plot the results for multivariate column (c) because the results for the

non-regional coefficients are identical to column (b).


8We use the guide to confidence intervals on page 181 to calculate these intervals. For example, the 95% confidence
interval for multivariate (a) specification is —ˆ1 ± 1.96 ◊ se(—ˆ1 ) = 0.059 ± 1.96 ◊ 0.022 which is a range from 0.016 to
0.102.

c
•2014 Oxford University Press 291
Chapter 6. Dummy Variables: Smarter Than You Think

6.4 Interaction Variables

Dummy variables can do even more work for us. We may face a situation in which being in

the Dummyi = 1 group does not simply give each individual a bump up or down. It could

be that group membership could interact with another independent variable, changing the

way the independent variable affects Y . For example, it could be that discrimination does

not simply mean that all men get paid more by the same amount. It could be that work

experience for men is more highly rewarded than work experience for women. We address

this possibility with models in which a dummy independent variable interacts with (meaning

is multiplied by) a continuous independent variable.9

The following OLS model allows the effect of X to differ across groups:

Yi = —0 + —1 Xi + —2 Dummyi + —3 Dummyi ◊ Xi + ‘i (6.8)

The third variable is produced by multiplying the Dummyi variable times the Xi variable.

In a spreadsheet, we would simply create a new column that is the product of the Dummy

and X columns. In statistical software, we simply generate a new variable as described in

the Computing Corner of this chapter.


9 Interactions between continuous variables are created by multiplying two continuous variables together. The
general logic is the same. Kam and Franceze (2007) provide an in-depth discussion of all kinds of interactions.

c
•2014 Oxford University Press 292
Chapter 6. Dummy Variables: Smarter Than You Think

For the Dummyi = 0 group, the fitted value equation simplifies to

Ŷi = —ˆ0 + —ˆ1 Xi + —ˆ2 Dummyi + —ˆ3 Dummyi ◊ Xi

= —ˆ0 + —ˆ1 Xi + —ˆ2 (0) + —ˆ3 (0) ◊ Xi

= —ˆ0 + —ˆ1 Xi

In other words, the estimated intercept for the Dummyi = 0 group is —ˆ0 and the estimated

slope is —ˆ1 .

For the Dummyi = 1 group, the fitted value equation simplifies to

Ŷi = —ˆ0 + —ˆ1 Xi + —ˆ2 Dummyi + —ˆ3 Dummyi ◊ Xi

= —ˆ0 + —ˆ1 Xi + —ˆ2 (1) + —ˆ3 (1) ◊ Xi

= (—ˆ0 + —ˆ2 ) + (—ˆ1 + —ˆ3 )Xi

In other words, the estimated intercept for the Dummyi = 1 group is —ˆ0 + —ˆ2 and the

estimated slope is —ˆ1 + —ˆ3 .

An example of what the fitted lines will look like is in Figure 6.10. As before, the intercept

for the Dummyi = 0 group will be —ˆ0 and the intercept for the Dummyi = 1 group will be

—ˆ0 + —ˆ2 , which is the intercept for everybody plus the bump (up or down) for being in the

Dummyi = 1 group.

What’s new is that this model allows for the slope to differ by groups such that the fitted

lines are no longer parallel. The slope of the line for the Dummyi = 0 group will be —ˆ1 . The

slope of the line for the Dummyi = 1 group will be —ˆ1 + —ˆ3 .

c
•2014 Oxford University Press 293
Chapter 6. Dummy Variables: Smarter Than You Think

Salary
(in $1,000s)

Fitted line for men (dummyi = 1 group)


70 Fitted line for women (dummyi = 0 group)

60

50

)
men
e for
lop
^ 3 (s
40 ^β 1+ β
^ ^ e n)
β 0 + β2 for wom
^ (slope
β1
^
β0

30

0 1 2 3 4 5 6 7 8 9 10
Years of experience

FIGURE 6.10: Fitted Values for Yi = —0 + —1 Xi + —2 Dummyi + —3 Dummyi ◊ Xi + ‘i

c
•2014 Oxford University Press 294
Chapter 6. Dummy Variables: Smarter Than You Think

The coefficient on Dummyi ◊ Xi is —ˆ3 . We have to be careful when interpreting it. It

is the differential slope for the Dummyi = 1 group, meaning that it tells us how different

the effect of X is for the Dummyi = 1 group compared to the Dummyi = 0 group. —ˆ3 is

positive in Figure 6.10, meaning that the slope of the fitted line for the Dummyi = 1 group

is steeper than the slope of the line for the Dummyi = 0 group.

If —ˆ3 were zero, the slope of the fitted line for the Dummyi = 1 group would be no steeper

than the slope of the line for the Dummyi = 0 group, meaning that the slopes would be the

same for both the Dummyi = 0 and Dummyi = 1 groups. If —ˆ3 were negative, the slope of

the fitted line for the Dummyi = 1 group would be less steep than the slope of the line for

the Dummyi = 0 group (or even negative).

Interpreting interaction variables can be a bit tricky sometimes, as the —ˆ3 can be negative,

but the effect of X on Y for the Dummyi = 1 group could still be positive. For example, if

—ˆ1 = 10 and —ˆ3 = ≠3, the slope for the Dummyi = 1 group would be positive because the

slope is the sum of the coefficients and therefore equals 7. The negative —ˆ3 indicates that

the slope for the Dummyi group is less than the slope for the other group; it does not tell

us whether the effect of X is positive or negative, though. We have to look at the sum of

the coefficients to know that.

Table 6.9 summarizes how to interpret coefficients when dummy-interaction variables are

included.

The standard error of —ˆ3 is useful for calculating confidence intervals for the difference in

c
•2014 Oxford University Press 295
Chapter 6. Dummy Variables: Smarter Than You Think

Table 6.9: Interpreting Coefficients in Dummy Interaction Model: Yi = —0 + —1 Xi + —2 Di + —3 Xi ◊ Di

—ˆ3 < 0 —ˆ3 = 0 —ˆ3 > 0


Slope for Di = 0 group is Slope for Di = 0 group is Slope for Di = 0 group is
negative. Slope for Di = 1 negative. Slope for Di = 1 negative. Slope for Di = 1
—ˆ1 < 0 group is more negative. is same. is less negative and will be
positive if —ˆ1 + —ˆ3 > 0.
Slope for Di = 0 is zero Slope for both groups is Slope for Di = 0 group is
—ˆ1 = 0 and slope for Di = 1 zero. zero and slope for Di = 1
group is negative. group is positive.
Slope for Di = 0 group is Slope for Di = 0 group is Slope for Di = 0 group is
positive. Slope for Di = 1 positive. Slope for Di = 1 positive. Slope for Di = 1
—ˆ1 > 0 is less positive and will be is same. is more positive.
negative if —ˆ1 + —ˆ3 < 0.

slope coefficients across the two groups. Standard errors for some quantities of interest are

tricky, though. To generate confidence intervals for the effect of X on Y we need to be alert.

For the Dummyi = 0 group, the effect is simply —ˆ1 and we can simply use the standard error

of —ˆ1 . For the Dummyi = 1 group, the effect is —ˆ1 + —ˆ3 ; the standard error of the effect is

more complicated because we have to take into account the standard error of both —ˆ1 and

—ˆ3 in addition to any correlation between —ˆ1 and —ˆ3 (which is associated with the correlation

of X1 and X3 ). The appendix provides more details on how to do this on page 801.

c
•2014 Oxford University Press 296
Chapter 6. Dummy Variables: Smarter Than You Think

Remember This
Interaction variables allow us to estimate effects that depend on more than one vari-
able.
1. A dummy interaction is created by multiplying a dummy variable times another
variable.
2. Including a dummy interaction in a multivariate regression allows us to conduct
a difference of means test while controlling for other factors with a model such as

Yi = —0 + —1 Xi + —2 Dummyi + —3 Dummyi ◊ Xi + ‘i (6.9)


3. The fitted values from this model will be two lines. For the model as written, the
slope for the group for which Dummyi = 0 will be —1 . The slope for the group
for which Dummyi = 1 will be —1 + —3 .
4. The coefficient on a dummy interaction variable indicates the estimated difference
in slope between two groups.

c
•2014 Oxford University Press 297
Chapter 6. Dummy Variables: Smarter Than You Think

Y 10 Y 10
Fitted line for Dummyi = 1 group Fitted line for Dummyi = 1 group
Fitted line for Dummyi = 0 group Fitted line for Dummyi = 0 group
8 8

6 6

4 4

2 2

0 0

0 2 4 6 8 10 0 2 4 6 8 10
X X
(a) (b)
Y 10 Y 10
Fitted line for Dummyi = 1 group Fitted line for Dummyi = 1 group
Fitted line for Dummyi = 0 group Fitted line for Dummyi = 0 group
8 8

6 6

4 4

2 2

0 0

0 2 4 6 8 10 0 2 4 6 8 10
X X
(c) (d)
Y 10 Y 10
Fitted line for Dummyi = 1 group Fitted line for Dummyi = 1 group
Fitted line for Dummyi = 0 group Fitted line for Dummyi = 0 group
8 8

6 6

4 4

2 2

0 0

0 2 4 6 8 10 0 2 4 6 8 10
X X
(e) (f)

FIGURE 6.11: Various Fitted Lines from Dummy Interaction Models

Discussion Questions
1. For each panel in Figure 6.11, indicate whether each of —0 , —1 , —2 , and
—3 is less than, equal to, or greater than zero for the following model:
Yi = —0 + —1 Xi + —2 Dummyi + —3 Dummyi ◊ Xi + ‘i

2. Express the value of —ˆ3 in panel (d) in terms of other coefficients.


3. True or false: If —ˆ3 < 0, an increase in X for the treatment group is
associated with a decline in Y .

c
•2014 Oxford University Press 298
Chapter 6. Dummy Variables: Smarter Than You Think

Case Study: Energy Efficiency

Energy efficiency promises a double whammy of

benefits: Reduce the amount of energy used and

we can save the world and save money. What’s

not to love?

But do energy saving devices really deliver?

The skeptic in us should worry that energy sav-

ings may be overpromised. In this case study, we analyze the energy used to heat a house

before and after the homeowner installed a programmable thermostat. The attraction of a

programmable thermostat is that it allows the user to pre-set temperatures at energy efficient

levels, especially for the middle of the night when the house doesn’t need to be as warm (or

as cool for hot summer nights).

Figure 6.12 shows a scatterplot of monthly observations of the gas used in the house

(measured in therms) and heating degree days, a measure of how cold it was in the month.10

We’ve marked the months without a programmable thermostat as squares and the months

with the programmable thermostat with circles.

Visually, we immediately see that heating goes up as it gets colder. Not a huge surprise.
10 For each day, the heating degree day is measured as number of degrees that a day’s average temperature is below
65 degrees Fahrenheit, the temperature below which buildings may need to be heated. The monthly measure adds
up the daily measures and provides a rough measure of how much need for heating there was in the month. If the
temperature is above 65 degrees, the heating degree day measure will be zero.

c
•2014 Oxford University Press 299
Chapter 6. Dummy Variables: Smarter Than You Think

Therms 300 Months without programmable thermostat


(amout of
gas used Months with programmable thermostat
in home)

200

100

0 250 500 750 1000


Heating degree days

FIGURE 6.12: Heating Used and Heating Degree Days for Homeowner Who Installed a Programmable

Thermostat

c
•2014 Oxford University Press 300
Chapter 6. Dummy Variables: Smarter Than You Think

We also can see the possibility that the programmable thermostat lowered gas usage because

the observations with the programmable thermostat seem lower. However, it is not clear how

large the effect is and whether it is statistically significant.

We need a model to get a more precise answer. What model is best? Let’s start with a

very basic difference-of-means model:

T hermsi = —0 + —1 Programmable Thermostat i + ‘i

The results for this model are in column (a) of Table 6.10 and indicate that the homeowner

used 13.02 fewer therms of energy in months when he had the programmable thermostat than

without it. Therms cost about $1.59 at this time, so the homeowner saved roughly $20.70

per month on average. That’s not bad. However, the effect is not statistically significant

(not even close, really, as the t statistic is only 0.54), so based on this result we should be

skeptical that the thermostat saved money.

The difference of means model does not control for anything else and we know that the

coefficient on the programmable thermostat variable will be biased if there is some other

variable that matters and is correlated with the programmable thermostat variable. In this

case, we know unambiguously that heating degree days matters and it is plausible that the

heating degree days differed in the months with and without the programmable thermostat.

Hence a better model is clearly

T hermsi = —0 + —1 Programmable Thermostat i + —2 Heating degree days i + ‘i

c
•2014 Oxford University Press 301
Chapter 6. Dummy Variables: Smarter Than You Think

Table 6.10: Programable Thermostat and Home Heating Bill

(a) (b) (c)


Programmable thermostat -13.02 -20.05 ú
-0.48
(23.94) (4.49) (4.15)
[t = 0.54] [t = 4.46] [t = 0.11]
Heating degree days 0.22ú 0.26ú
(HDD) (0.006) (0.007)
[t = 34.42] [t = 38.68]
Programmable thermostat ◊ HDD -0.062ú
(0.009)
[t = 7.00]
Constant 81.52ú 14.70ú 4.24
(17.49) (3.81) (3.00)
[t = 4.66] [t = 3.86] [t = 1.41]
N 45 45 45
ˆ
‡ 80.12 15.00 10.25
R2 0.007 0.966 0.985
Standard errors in parentheses. ú indicates significance at p < 0.05

The results for this model are in column (b) of Table 6.10. The heating degree variable is

hugely (massively, superlatively) statistically significant. Including it also leads to a higher

coefficient on the programmable thermostat variable, which is now 20.045. The standard

error on the programmable thermostat variable also goes down a ton due to the much smaller

ˆ due to the much better fit we get by including the heating degree day variable. The effect

of the programmable thermostat variable is statistically significant and given a cost of $1.59

per therm, the savings is about $31.87 per month. Because a programmable thermostat

costs about $60 plus installation, the programmable thermostat should pay for itself pretty

quickly.

However, something about these results should nag at us. This is about gas usage only

which in this house goes overwhelmingly to heating (with the rest going to heat water and

c
•2014 Oxford University Press 302
Chapter 6. Dummy Variables: Smarter Than You Think

for the stove). Does it make sense that the programmable thermostat should save $30 in the

middle of the summer? The furnace is never on and, well, that’s a lot of scrambled eggs on

the stove to give up to save that much money.

If we think about it, the effect of the thermostat must be interactive. That is, the

thermostat can save more money in cold months, when turning the thermostat down at

night will save money.

T hermsi = —0 + —1 Programmable thermostat i + —2 Heating degree days i

= +—3 Programmable thermostat ◊ HDDi + ‘i

The results for this model are in column (c) of Table 6.10. The coefficient on pro-

grammable thermostat indicates the difference in therms when the other variables are zero.

Because both variables have heating degree days in them, the coefficient on programmable

thermostat indicates the effect of the thermostat when heating degree days are zero (meaning

the weather is warm for the whole month). The coefficient of -0.479 with a t statistic of 0.11

indicates no effect at all. This might seem to be bad news, but is it good news for us given

that we have figured out that the programmable thermostat can’t reduce heating costs when

the furnace isn’t running.

The main effect of the thermostat is captured by the coefficient on the interactive term,

Programmable thermostat ◊HDD. This coefficient is -0.062 and is highly statistically sig-

nificant with a t statistic of 7.00. For every increase in HDD, the programmable thermostat

c
•2014 Oxford University Press 303
Chapter 6. Dummy Variables: Smarter Than You Think

lowered the therms used by 0.062 therms. In a month with the heating degree day variable

equal to 500, the homeowner is estimated to reduce therms used by 500◊0.062 = 31 after the

programable thermostat was installed (which lowers the bill by $49.29 at $1.59 per therm).

In a month with the heating degree day variable equal to 1000, the homeowner is estimated

to reduce therms used by 1000 ◊ 0.062 = 62 (which lowers the bill by $98.58 at $1.59 per

therm). Suddenly we’re talking real money. And we’re doing so from a model that makes

intuitive sense because the savings should indeed differ depending on how cold it is.11

This case provides an excellent example of how useful – and distinctive – the dummy

variable models we’ve presented in this chapter can be. In panel (a) of Figure 6.13 we show

the fitted values based on model (b) in Table 6.10, which controls for heating degree days

but models the effect of the thermostat as a constant difference across all values of heating

degree day. The effect is statistically significant and rather substantial, but it doesn’t ring

true because it suggests the savings from reduced use of gas for the furnace are the same

in a warm summer month as in a frigid winter month. Panel (b) of Figure 6.13 shows the

fitted values based on model (c) in Table 6.10, which allows the effect of the thermostat to

vary depending on the heating degree days. This is an interactive model that yields fitted

lines with different slopes. Just by looking at the lines, we can see the fitted lines for model
11 We might be worried about correlated errors given that this is time series data. As discussed on page 104,
the coefficient estimates are not biased if the errors are correlated, but standard OLS standard errors might not be
appropriate. In Chapter 13 we show how to estimate models with correlated errors. Spoiler alert: The results get a
bit stronger.

c
•2014 Oxford University Press 304
Chapter 6. Dummy Variables: Smarter Than You Think

Therms Therms
300 Months without 300 Months without
programmable thermostat programmable thermostat
Months with Months with
programmable thermostat programmable thermostat

200 200

100 100

0 0

0 250 500 750 1000 0 250 500 750 1000


Heating degree days Heating degree days
(a) (b)

FIGURE 6.13: Heating Used and Heating Degree Days with Fitted Values for Different Models

c
•2014 Oxford University Press 305
Chapter 6. Dummy Variables: Smarter Than You Think

(c) fit the data better. The effects are statistically significant and substantial and, perhaps

most importantly, make more sense because the effect of the programmable thermostat on

heating gas used increases as the month gets colder.

6.5 Conclusion

Dummy variables are incredibly useful. Despite a less-than-flattering name, they do some of

the most important work in all of statistics. Experiments almost always are analyzed with

treatment group dummy variables. A huge proportion of observational studies care about or

control for dummy variables such as gender or race. And, when interacted with continuous

variables, dummy variables allow us to investigate whether the effects of certain variables

differ by group.

We have mastered the core points of this chapter when we can do the following.

• Section 6.1: Write down a model for a difference of means test using bivariate OLS.

Which parameter measures the estimated difference? Sketch a diagram that illustrates

the meaning of this parameter.

• Section 6.2: Write down a model for a difference of means test using multivariate OLS.

Which parameter measures the estimated difference? Sketch a diagram that illustrates

the meaning of this parameter.

• Section 6.3: Explain how to incorporate categorical variables in OLS models. What

c
•2014 Oxford University Press 306
Chapter 6. Dummy Variables: Smarter Than You Think

is the excluded category? Explain why coefficient estimates change when the excluded

category changes.

• Section 6.4: Write down a model that has a dummy variable (D) interaction with

a continuous variable (X). How do we explain the effect of X on Y ? Sketch the

relationship for Di = 0 observations and Di = 1 observations.

Further Reading

Brambor, Clark, and Golder (2006) and Kam and Franceze (2007) both provide excellent

discussions of interactions, including the appropriate interpretation of models with two con-

tinuous variables interacted. Braumoeller (2004) does a good job injecting caution into the

interpretation of coefficients on lower order terms when interaction variables are included in

the model.

Key Terms
• Categorical variable (277)
• Control group (257)
• Dichotomous variable (257)
• Difference of means test (257)
• Dummy variable (257)
• Excluded category (278)
• Jitter (264)

c
•2014 Oxford University Press 307
Chapter 6. Dummy Variables: Smarter Than You Think

• Ordinal variable (276)


• Nominal variable (277)
• Reference category (278)
• Treatment group (257)

Computing Corner

Stata
1. A difference of means test in OLS is simply reg Y Dum. This command will produce an
identical estimate, standard error, and t statistic as ttest Y, by(Dum). To allow the
variance to differ across the two groups, the OLS model is reg Y Dum, robust and the
stand-alone t test is ttest Y, by(Dum) unequal.
2. To create an interaction variable named “DumInteract”, simply type gen DumInteract
= Dum * X where Dum is the name of the dummy variable and X is the name of the
continuous variable.
3. Page 801 discusses how to generate a standard error in Stata for the effect of X on Y
for the Dummyi = 1 group.

R
1. A difference of means test in OLS is simply lm(Y ≥ Dum). This command will produce
an identical estimate, standard error, and t statistic as t.test(Y[Dum==1], Y[Dum==0],
var.equal = TRUE). To allow the variance to differ across the two groups, the stand-
alone t test is t.test(Y[Dum==1], Y[Dum==0], var.equal = FALSE). The OLS ver-
sion of this model takes a bit more work, as it involves estimating the heteroscedasticity-
consistent standard error model described on page 130. It is
OLSResults = lm(Y ≥ Dum)
coeftest(OLSResults, vcov = vcovHC(OLSResults, type = "HC1"))
2. To create an interaction variable named “DumInteract”, simply type DumInteract =
Dum * X where Dum is the name of the dummy variable and X is the name of the
continuous variable.
3. Page 801 discusses how to generate a standard error in R for the effect of X on Y for
the Dummyi = 1 group.

c
•2014 Oxford University Press 308
Chapter 6. Dummy Variables: Smarter Than You Think

Exercises
1. Use data from heightwage.dta that we used in Chapter 5 on page 248.
a. Estimate an OLS regression model with adult wages as the dependent variable and
adult height, adolescent height, and a dummy variable for males as the independent
variables. Does controlling for gender affect the results?
b. Generate a female dummy variable. Estimate a model with both a male dummy
variable and a female dummy variable. What happens? Why?
c. Re-estimate the model from part (a) separately for males and females. Do these
results differ from the model in which male was included as a dummy variable? Why
or why not?
d. Estimate a model in which adult wages is the dependent variable and in which
there are controls for adult and adolescent height in addition to dummy variable
interactions of male times each of the two height variables. Compare the results to
the results from part (c).
e. Estimate a model in which adult wages is the dependent variable and in which
there are controls for male, adult height, adolescent height and two dummy variable
interactions of male times each of the two height variables. Compare the results to
the results from part (c).
f. Every observation is categorized into one of four regions based on where they lived
in 1996. The four regions are northeast (norest96) midwest (norcen96), south
(south96), and west (west96). Add dummy variables for regions to a model ex-
plaining wages in 1996 as a function of height in 1981, male, and male times height
in 1981. First exclude west, then exclude south and explain the changes to the
coefficients on the height variables and the regional dummy variables.
2. These questions are based on a paper “The Fed may be politically independent, but it
is not politically indifferent” by William Clark and Vincent Arel-Bundock (2013). The
paper explores the relationship between elections and the federal funds rate (FFR).
The FFR is the average interest rate at which federal funds trade in a day and is
often a benchmark for financial markets. Table 6.11 describes the variables from
fed 2012 kkedits.dta that we use in this problem.
a. Create two scatterplots, one when a Democrat is in office and one when a Republican
is in office, showing the relationship between the federal funds rate and the distance
to election. Comment on the differences in the relationships. The variable election
is coded 0 to 15, representing each quarter from one election to the next. For each
presidential term, the value of election is zero in the first quarter containing the
election and 15 in the quarter before the next election.

c
•2014 Oxford University Press 309
Chapter 6. Dummy Variables: Smarter Than You Think

Table 6.11: Variables for Monetary Policy Questions

Variable Description
FEDFUNDS Effective federal funds rate (in percent)
lag FEDFUNDS Lagged effective federal funds rate (in percent)
Democrat Democrat = 1, Republican = 0
Election Quarter since previous election (0 to 15)
Inflation Annualized inflation rate (1% inflation = 1.00)
DATE Date

b. Create an interaction variable between election and democrat to test whether or not
closeness to elections has the same effect on Democrats and Republicans. Run a
model with the federal funds rate as the dependent variable, allowing the effect of
the election variable to vary by party of the president.
i. What change in federal fund rates is associated with a one unit increase in the
election variable when the president is a Republican?
ii. What change in federal fund rates is associated with a one unit increase in the
election variable when the president is a Democrat?
c. Is the effect of election statistically significant under Republicans? (Easy.) Is the
effect of election statistically significant under Democrats? (Not so easy.) How can
the answer be determined? Run any additional tests if necessary.
d. Graph two fitted lines for relationship between elections and interest rates, one for
Republicans and one for Democrats. (Use the twoway and lfit commands with ap-
propriate if statements; label by hand.) Briefly describe the relationship.
e. Re-run the model controlling for both the interest rate in the previous quarter
(lag FEDFUND) and inflation and discuss results, focusing on (i) effect of election
for Republicans, (ii) the differential effect of election for Democrats, (iii) impact of
lagged federal funds rate, and (iv) inflation. (Simply report the statistical signifi-
cance of the coefficient estimates; don’t go through the entire analysis from part (c)
above.)
3. This problem uses the cell phone and traffic dataset described in Chapter 5 on page 250
to analyze the relationship between cell and texting bans and traffic fatalities. We add
two variables: cell ban is coded 1 if it is illegal to operate a hand-held cell phone while
driving and 0 otherwise; text ban is coded 1 if it is illegal to text while driving ban and
0 otherwise.
a. Add the dummy variables for cell phone bans and texting bans to the model from
Chapter 5 on page 250. Interpret the coefficients on these dummy variables.

c
•2014 Oxford University Press 310
Chapter 6. Dummy Variables: Smarter Than You Think

Dependent Variable: traffic deaths

500
Marginal effect of text ban
0
−500
−1000
−1500

0 100000 200000 300000


Total miles driven

Marginal effect of text ban


95% confidence interval

FIGURE 6.14: Marginal Effect of Text Ban as Total Miles Changes

b. Based on the the results from part (a), how many lives would be saved if California
had a cell phone ban? How many lives would be saved if Wyoming had a cell phone
ban? Discuss implications for the proper specification of the model.
c. Estimate a model in which total miles is interacted with both the cell phone ban and
texting ban variables. What is the estimated effect of a cell phone ban for California?
For Wyoming? What is the effect of a texting ban for California? For Wyoming?
What is the effect of total miles?
d. This question uses material from page 801. Figure 6.14 displays the effect of the cell
phone ban as a function of total miles. The confidence intervals are depicted with
the dashed lines. Identify the points on the fitted lines for the estimated effects for
California and Wyoming from the results in part (c). Explain the conditions under
which the cell phone ban has a statistically significant effect.12
4. In this problem we continue analyzing the speeding ticket data first introduced in Chap-
ter 5 on page 251. The variables we use are in Table 6.12.
a. Implement a simple difference of means test using OLS to assess whether the fines
for men and women are different. Do we have any reason to expect endogeneity?
Explain.
b. Implement a difference of means for men and women that controls for age and miles
per hour. Do we have any reason to expect endogeneity? Explain.
12Brambor, Clark, and Golder (2006) provide Stata code to create plot like this for models with interaction variables.

c
•2014 Oxford University Press 311
Chapter 6. Dummy Variables: Smarter Than You Think

Table 6.12: Variables for Speeding Ticket Data

Variable name Description


MPHover Miles per hour over the speed limit
Amount Assessed fine for the ticket
Age Age of driver
Female Equals 1 for women and 0 for men
Black Equals 1 for African-Americans and 0 otherwise
Hispanic Equals 1 for Hispanics and 0 otherwise
StatePol Equals 1 if ticketing officer was state patrol officer
OutTown Equals 1 if driver from out of town and 0 otherwise
OutState Equals 1 if driver from out of state and 0 otherwise

c. Building from the above model, also assess whether there are differences in fines for
African Americans and Hispanics. Explain what the coefficients on these variables
mean.
d. Look at standard errors on coefficients for the Female, Black, and Hispanic variables.
Why they are different?
e. Within a single OLS model, assess whether miles over the speed limit has a differential
effect on the fines for women, African Americans, and Hispanics.
5. There is a consensus among economists that increasing government spending and cut-
ting taxes boost economic growth during recessions. Do regular citizens share in this
consensus? We care because political leaders often feel pressure to do what voters want
whether or not it would be effective.
To get at this issue, a 2012 YouGov survey asked people questions about what would
happen to unemployment if the government raised taxes or increased government spend-
ing. Answers were coded into three categories based on how consistent they were with
the economic consensus. On the tax question, people who said raising taxes would
raise unemployment were coded as “3” (the correct answer), people who said raising
taxes would have no effect on unemployment were coded as “2” and people who said
raising taxes would lower unemployment were coded as “1”. On the spending question,
people who said raising government spending would lower unemployment were coded
as “3” (the correct answer), people who said raising spending would have no effect on
unemployment were coded as “2,” and people who said raising spending would raise
unemployment were coded as “1.”
a. Estimate two bivariate OLS models in which political knowledge predicts the answers.
In one model, use the tax dependent variable; in the other model, use the spending

c
•2014 Oxford University Press 312
Chapter 6. Dummy Variables: Smarter Than You Think

dependent variable. The model will be


Answeri = —0 + —1 P olitical knowledgei + ‘i
where Answeri is the correctness of answers, coded as described above. We measure
political knowledge based on how many of nine factual questions about government
each person answered correctly. (Respondents were asked to identify the Vice Presi-
dent, the Chief Justice of the Supreme Court, and so forth.) Interpret the results.
b. Add partisan affiliation to the model by estimating the following model for each of
the two dependent variables (the tax and spending variables):
Answeri = —0 + —1 P olitical knowledgei + —2 Republicani + ‘i
where Republicani is 1 for people who self-identify with the Republican party and 0
for everyone else.13 Explain your results.
c. The effect of party may go beyond simply giving all Republicans a bump up or down
in their answers. It could be that political knowledge interacts with being Republican
such that knowledge has different effects on Republicans than non-Republicans. To
test this, estimate a model that includes a dummy interaction term:
Answeri = —0 + —1 Political knowledge i + —2 Republicani
+—3 Political knowledge i ◊ Republicani + ‘i
Explain the results and compare/contrast to the initial bivariate results.

13 We could separate non-Republicans into Democrats and Independents using tools for categorical variables dis-
cussed in Section 6.3. Our conclusions would be generally similar in this particular example.

c
•2014 Oxford University Press 313
Chapter 6. Dummy Variables: Smarter Than You Think

c
•2014 Oxford University Press 314
CHAPTER 7

TRANSFORMING VARIABLES, COMPARING VARIABLES

What makes people happy? Relationships? Wis-

dom? Money? Chocolate? Figure 7.1 provides

an initial look at this question by displaying the

self-reported life satisfaction of U.S. citizens from

the World Values Survey (2008). Each data point

is the average value reported by survey respon-

dents in a two-year age group. The scores range from 1 (“dissatisfied”) to 10 (“satisfied”).1

There is a pretty clear pattern: people start off reasonably satisfied at age 18 and then reality

hits, making them less satisfied until their mid-40s. Happily, things brighten from that point
1 We have used multivariate OLS to net out the effect of income, religiosity, and children from the life satisfaction
scores.

315
Chapter 7. Transforming Variables, Comparing Variables

Life
satisfaction
8.0

7.5

7.0

6.5

6.0

5.5

5.0

4.5
20 30 40 50 60 70
Age

FIGURE 7.1: Average Life Satisfaction by Age in the United States

onward, such that old folks are generally the happiest bunch. (Who knew?) This pattern is

not an anomaly: Other surveys at other times and in other countries reveal similar patterns.

The relationship is U-shaped (or smile shaped, if you will).2 Given what we’ve done so far,

it may not be obvious how to make OLS estimate such a model. However, OLS is actually

quite flexible and the goal of this chapter is to show off some of the cool tricks OLS can do,

including estimating non-linear relationships like the one we see in the life satisfaction data.

The unifying theme is that each of these tricks involves a transformation of the data or the
2To my knowledge there is no study of chocolate and happiness, but I’m pretty sure it would be an upside down U;
people might get happier the more they eat for a while, but at some point, more chocolate has to lead to unhappiness,
like the kid in Willy Wonka.

c
•2014 Oxford University Press 316
Chapter 7. Transforming Variables, Comparing Variables

model in order to do useful things.

The particular tasks we tackle in this chapter are estimating non-linear models and com-

paring coefficients. Section 7.1 shows how to estimate non-linear effects with polynomial

models. Section 7.2 shows how to produce a different kind of non-linear models using logged

variables, which are particularly useful to characterize effects in percentage terms. Section

7.3 shows how to make OLS coefficients more comparable by standardizing variables. Section

7.4 shows how to formally test whether coefficients differ from each other. The technique

can be used for any hypothesis involving multiple coefficients.

7.1 Quadratic and Polynomial Models

The world doesn’t always move in straight lines and, happily, neither do OLS estimates. In

this section we explain the difference between linear and non-linear models in the regression

context and then introduce quadratic and polynomial models as flexible tools to deal with

non-linear models.

Linear versus non-linear models

The standard OLS model is remarkably flexible. It can, for example, estimate non-linear

effects. This idea might seem a little weird at first. Didn’t we say that OLS is also known

as linear regression (page 67)? How can we estimate non-linear models effects with a linear

regression model? The reason is a bit pedantic, but here goes: when we refer to a “linear”

c
•2014 Oxford University Press 317
Chapter 7. Transforming Variables, Comparing Variables

model we mean linear in parameters, which means that the —s aren’t squared or cubed or

logged or subject to some other non-linearity.

This means OLS can’t handle models like the following:3

Yi = —0 + —12 X1i + ‘i
Ò
Yi = —0 + —1 X1i + —2 X1i + ‘i

The X’s, though, are fair game and hence we can square, cube, log, or otherwise transform

the X’s to produce fitted curves instead of fitted lines. Therefore both of the following models

are okay in OLS because each — simply multiplies itself times some independent variable that

may or not be non-linear:

2
Yi = —0 + —1 X1i + —2 X1i + ‘i
Ò
7
Yi = —0 + —1 X1i + —2 X1i + ‘i

Non-linear relationships are common in the real world. Figure 7.2 shows data on life

expectancy and GDP per capita for countries across the world. We immediately sense that

there is a positive relationship: The wealthier countries definitely have higher life expectancy.

But we also see that the relationship is a curve, rather than a line, because life expectancy

rises rapidly at the lower levels of GDP per capita, but then flattens out. Based on this data,

it’s pretty reasonable to expect an increase of $1,000 in per capita GDP per year could have
3 The world doesn’t end if we really want to estimate a model that is non-linear in the —s. We just need something
other than OLS to estimate the model. In Chapter 12 we discuss probit models, which are non-linear in the —s.

c
•2014 Oxford University Press 318
Chapter 7. Transforming Variables, Comparing Variables

Life
expectancy
(in years)

80

75

70

65

60

55

50

$0 $40,000 $80,000 $120,000


GDP per capita

FIGURE 7.2: Life Expectancy and Per Capita GDP in 2011 for All Countries in the World

a fairly substantial effect on life expectancy in a country with low GDP per capita, while an

increase of $1,000 in per capita GDP for a very wealthy country could have only a negligible

effect on life expectancy. Therefore we want to get beyond estimating only straight lines.

Figure 7.3 shows the life expectancy data with two different kinds of fitted lines. Panel

(a) shows a fitted line from a standard OLS model

Life expectancy i = —0 + —1 GDPi + ‘i (7.1)

The fit isn’t great. The fitted line is lower than the data for many of the observations

with low GDP values. For observations with high GDP levels, the fitted line dramatically

overestimates life expectancy. As bad as it is, this is the best possible straight line in terms

c
•2014 Oxford University Press 319
Chapter 7. Transforming Variables, Comparing Variables

Life
expectancy
in years

90 90

80 80

70 70

60 60

50 50

0 20 40 60 80 100 0 20 40 60 80 100
GDP per capita GDP per capita
(in thousands $) (in thousands $)

FIGURE 7.3: Linear and Quadratic Fitted Lines for Life Expectancy Data

c
•2014 Oxford University Press 320
Chapter 7. Transforming Variables, Comparing Variables

of minimizing squared error.

Polynomial models

We can generate a better fit using a polynomial model. Polynomial models include not

only an independent variable, but also the independent variable raised to some power. By

doing using a polynomial model, we can produce fitted value lines that curve.

The simplest example of a polynomial model is a quadratic model that includes X and

X 2 . The model looks like this:

2
Yi = —0 + —1 X1i + —2 X1i + ‘i (7.2)

For our life expectancy example, a quadratic model is

Life expectancy i = —0 + —1 GDPi + —2 GDPi2 + ‘i (7.3)

Panel (b) of Figure 7.3 plots this fitted curve. It better captures the non-linearity in the

data as life expectancy rises rapidly at low levels of GDP and then levels off. The fitted

curve is not perfect. The predicted life expectancy is still a bit low for low values of GDP

and the turn to negative effects seems more dramatic than warranted by the data. We’ll see

how to fix this problem when we cover logged models shortly.

Interpreting coefficients in a polynomial model is different than in a standard OLS model.

Note that the effect of X changes depending on the value of X. In panel (b) of Figure 7.3,

the effect of GDP on life expectancy is large for low values of GDP. That is, when GDP

c
•2014 Oxford University Press 321
Chapter 7. Transforming Variables, Comparing Variables

goes from 0 to $20,000, the fitted value for life expectancy increases relatively rapidly. The

effect of GDP on life expectancy is smaller as GDP gets higher, as the change in fitted life

expectancy when GDP goes from $40,000 to $60,000 is much smaller than the change in

fitted life expectancy when GDP goes from 0 to $20,000. The predicted effect of GDP even

turns negative when GDP goes above $60,000.

We need some calculus to get the specific equation for the effect of X on Y . We refer to

the effect of X on Y as ˆY
ˆX1
:
ˆY
= —1 + 2—2 X1 (7.4)
ˆX1

This equation means that when interpreting results from a polynomial regression we can’t

look at individual coefficients in isolation, but instead need to know how the coefficients on

X1 and X12 come together to produce the estimated curve.4

Figure 7.4 illustrates more generally the kinds of relationships that a quadratic model

can account for. Each panel illustrates a different quadratic function. Panel (a) shows an

example in which the effect of X is getting bigger as X gets bigger. Panel (b) shows an

example in which the effect of X on Y is getting smaller. In both of the top panels, Y gets

bigger as X gets bigger, but the relationships have a quite different feel.

In panels (c) and (d) there are negative relationships between X and Y : the more X,
4 Equation 7.4 is the result of using standard calculus tools to take the derivative of Y in Equation 7.2 with respect
to X1 . The derivative is the slope evaluated at a given value of X1 . For a linear model, the slope is always the same
and is —ˆ1 . The ˆY in the numerator refers to the change in Y ; The ˆX1 in the numerator refers to the change in X1 .
The fraction ˆY
ˆX1
therefore refers to the change in Y divided by the change in X1 which is the slope.

c
•2014 Oxford University Press 322
Chapter 7. Transforming Variables, Comparing Variables

the less Y . Again, though, we see very different types of relationships. In panel (c) there

is a leveling out, while in panel (d) the negative effect of X on Y is accelerating as X gets

bigger.

A quadratic OLS model can even estimate relationships that change directions. In panel

(e), Y initially gets bigger as X increases, but then it levels out and eventually increases in

X decrease Y . In panel (f), we see the opposite pattern, with Y getting smaller as X rises

for small values of X and eventually Y rising with X.

One of the nice things about using a quadratic specification in OLS is that we don’t have

to know ahead of time whether the relationship is curving down or up, flattening out or

getting steeper. The data will tell us. We can simply estimate a quadratic model and if the

relationship is like the panel (a), the estimated OLS coefficients will yield a curve like in the

panel; if the relationship is like panel (f), OLS will produce coefficients that best fit the data.

So if we have data that looks like any of the patterns in Figure 7.4 we need simply estimate

a quadratic OLS model and we will get fitted lines that reflect the data.

Polynomial models with cubed or higher order terms can account for patterns that wiggle

and bounce even more than the quadratic model. It’s relatively rare to use higher order

polynomial models. Often the data simply doesn’t support such a model. In addition,

using higher order terms without strong theoretical reasons can be a bit fishy – as in raising

the specter of model fishing. A control variable with a high order can be more defensible,

but ideally our main results do not depend on untheorized high order polynomial control

c
•2014 Oxford University Press 323
Chapter 7. Transforming Variables, Comparing Variables

Y Y 1000
150 2
Y = − 0.1X + 0.02X 800

600
100

400
50
200
Y = 20X − 0.1X2

0
0

0 20 40 60 80 100 0 20 40 60 80 100
X X
(a) (b)
Y 0 Y 1000
−200 Y = − 20X +0.1X2
800

−400

600
−600

−800 400

−1000 Y = 1000 + 2X − 0.1X2


200

0 20 40 60 80 100 0 20 40 60 80 100
X X
(c) (d)
Y 250 Y 0

200 −50

150 −100 Y = − 10X + 0.1X2

100 Y = 10X − 0.1X2 −150

50 −200

0 −250

0 20 40 60 80 100 0 20 40 60 80 100
X X
(e) (f)

FIGURE 7.4: Examples of Quadratic Fitted Curves

c
•2014 Oxford University Press 324
Chapter 7. Transforming Variables, Comparing Variables

variables.

Remember This
OLS can estimate non-linear effects via polynomial models.
1. A polynomial model includes X raised to powers greater than 1. The general
form is
Yi = —0 + —1 Xi + —2 Xi2 + —3 Xi3 + . . . + —k Xik + ‘i
2. The most commonly used polynomial model is the quadratic model

Yi = —0 + —1 Xi + —2 Xi2 + ‘i
• The effect of Xi in a quadratic model varies depending on the value of X.
• The estimated effect of a one unit increase in Xi in a quadratic model is
—1 + 2—2 X.

c
•2014 Oxford University Press 325
Chapter 7. Transforming Variables, Comparing Variables

Case Study: Global Warming

Climate change may be one of the most im-

portant long-term challenges facing humankind.

We’d really like to know if temperatures have

been increasing and, if so, at what rate.

Figure 7.5 shows global temperatures since

1880. Panel (a) plots global average tempera-

tures by year over time. Temperature is mea-

sured in deviation from average pre-industrial

temperature. The more positive the value, the

more temperature has increased. Clearly there is

an upward trend. But how should we characterize this trend?

Panel (b) of Figure 7.5 includes the fitted line from a bivariate OLS model with Year as

the independent variable.

T emperaturei = —0 + —1 Y eari + ‘i

The linear model fits reasonably well, although it seems to be underestimating recent tem-

peratures and overestimating temperatures in the 1970s.

Column (a) of Table 7.1 shows the coefficient estimates for the linear model. The esti-

mated —ˆ1 is 0.006 with a standard error of 0.0003. The t statistic of 18.74 indicates a highly

c
•2014 Oxford University Press 326
Chapter 7. Transforming Variables, Comparing Variables

Temperature
(deviation from
average 0.9
pre−industrial
temperature,
in Fahrenheit) 0.7

0.5

0.3

0.1

−0.1

1890 1930 1970 2010


Year
(a)
Temperature
(deviation from
average 0.9 0.9
pre−industrial
temperature,
in Fahrenheit) 0.7 0.7

0.5 0.5

0.3 0.3

0.1 0.1

−0.1 −0.1

1890 1930 1970 2010 1890 1930 1970 2010


Year Year
(b) (c)
FIGURE 7.5: Global Temperature Over Time

c
•2014 Oxford University Press 327
Chapter 7. Transforming Variables, Comparing Variables

statistically significant coefficient. The result suggests that the earth has been getting 0.006

degrees warmer each year since 1879 (when the data series begins).
Table 7.1: Global Temperature from 1879 to 2012

(a) (b)
Constant -10.46 155.68
(0.57) (30.27)
[t = 18.31] [t = 5.14]
Year 0.006 -0.166
(0.0003) (0.031)
[t = 18.74] [t = 5.31]
Year2 0.000044
(0.000008)
[t = 5.49]
N 128 128
R2 0.73 0.78
Standard errors in parentheses

The data looks pretty non-linear, so we also estimate the following quadratic OLS model:

T emperaturei = —0 + —1 Y eari + —2 Y eari2 + ‘i

in which Year and Y ear2 are independent variables. This model allows us to assess whether

the temperature change has been speeding up or slowing down by allowing us to estimate

a curve in which the change per year in recent years is, depending on the data, larger or

smaller than the change per year in earlier years. We have plotted the fitted line in panel

(c) of Figure 7.5; notice that it is a curve that is getting steeper over time. It fits the data

even better with less underestimation in recent years and less overestimation in the 1970s.

Column (b) of Table 7.1 reports results from the quadratic model. The coefficients on

Y ear and Y ear2 have t stats over 5, indicating clear statistical significance.

c
•2014 Oxford University Press 328
Chapter 7. Transforming Variables, Comparing Variables

The coefficient on Y ear is -0.166 and the coefficient on Y ear2 is 0.000044. What the heck

do those numbers mean? Not much at a glance. Recall, however, that in a quadratic model

an increase in Y ear by one will be associated with a —ˆ1 + 2—ˆ2 Y eari increase in estimated

average global temperature. This means the predicted change from an increase in Y ear by

one in 1900 is ≠0.166 + 2 ú 0.000044 ◊ 1900 = 0.0012 degrees. The predicted change in

temperature from an increase in Y ear by one in 2000 is ≠0.166 + 2 ú 0.000044 ◊ 2000 = 0.01

degrees.

In other words, the predicted effect of Year changes over time in the quadratic model.

In particular, the estimated rate of warming in 2000 (0.01 degrees per year) is around eight

times the estimated rate of warming in 1900 (0.0012 degrees per year).

We won’t pay much attention at this point to the standard errors because errors are almost

surely autocorrelated, which would make the standard errors reported by OLS incorrect and

probably too small. We address autocorrelation and other time-series aspects of this data in

Chapter 13.

7.2 Logged Variables

Empirical analysts, especially in economics, often use logged variables. Logged variables

allow for non-linear relationships but have cool properties that allow us to interpret estimated

effects in percentage terms. In this section we discuss logs and how they work in OLS models

c
•2014 Oxford University Press 329
Chapter 7. Transforming Variables, Comparing Variables

and show how they work in our height and wages example. While we show several different

ways to use logged variables, the key thing to remember is that if there’s a log, there’s a

percentage interpretation of some sort going on.

Logs in regression models

We’ll work with so-called natural logs, which revolve around the constant e which equals

approximately 2.71828 and is, like fi ¥ 3.14, one of those numbers that pops up all over in

math. Recall that if e2 = 7.38, then ln(7.38) = 2. (We use the notation “ln” to refer to

natural log.) In other words, the natural log of some number k is the exponent to which we

have to raise e to obtain k. The fact that ln(3) = 1.10 means that e1.10 = 3 (with rounding).

For our purposes, we won’t be using the mathematical properties of logs too much.5 We

instead note that using logged variables in OLS equations can allow us to characterize non-

linear relationships that are broadly similar to panels (a) and (c) of Figure 7.4. In that sense,

these models are not dramatically different than polynomial models.

Models with logged variables have an additional, very attractive feature. The estimated

coefficients can be interpreted directly in percentage terms. That is, with the right logged

model we can produce results that tell us how much a one percent increase in X affects Y .

Often this is a good way to think about empirical questions. For example, suppose we have

wage data for a large number of people across many years and we want to know how inflation
5 We derive the marginal effects in log models in the appendix on page 801.

c
•2014 Oxford University Press 330
Chapter 7. Transforming Variables, Comparing Variables

affects wages. We might begin with an OLS model something like:

W agesit = —0 + —1 Inf lationt + ‘it (7.5)

where wages per hour for person i in year t is the dependent variable and the inflation rate in

year t is the independent variable. The estimated —ˆ1 in this model would tell us the increase

in wages per hour that would be associated with a one unit increase in inflation. At first

glance, this might seem like an OK model, but it is actually absurd. Suppose the model

produces —ˆ1 = 1.25; that result would say that everyone – whatever their wage level – would

get another $1.25 for every one point increase in inflation. That conclusion is actually kind

of weird: Such a model is, by design, saying that for every one percent increase in inflation,

the richest CEO gets another buck and a quarter per hour, as does the lowliest temp.

What we really want is a model that allows us to estimate what percent people’s salary

changes with inflation. Using logged variables allows us to do so. For example, we could

estimate a log-linear model in which the dependent variable is transformed by taking the

natural log of it. Such a model would look like

ln (wagesit ) = —0 + —1 inf lationt + ‘it (7.6)

It turns out, through the magic of calculus (presented on page 801) the —ˆ1 in this model

can be interpreted as the percentage change in Y associated with a one unit increase in X.

In other words, if —ˆ1 = 0.82 we would say that a one unit increase in inflation is associated

with a 0.82 percentage increase in wages. The CEO would get a 0.82 percent increase of her

c
•2014 Oxford University Press 331
Chapter 7. Transforming Variables, Comparing Variables

high wages; the temp would get a 0.82 percent increase of his low wages. These are very

different dollar amounts.

We can also estimate a model in which the dependent variable is not logged, but the

independent variable is. In a linear-log model the independent variable is transformed by

taking the natural log of it. Such a model would look like

Yi = —0 + —1 lnXi + ‘i (7.7)

Here, —1 indicates the effect of a one percent increase in X on Y . We need to divide the

estimated coefficient by 100 to convert it to units of Y . This is one of the odd hiccups in

models with logged variables: The units can be a bit tricky. While we can memorize the

way units work in these various models, the safe course of action here is to simply remember

that we’ll probably have to look up how units in logged models work when we use logged

models after being away from them for a bit.

At the pinnacle of loggy-ness is the so-called log-log model. This model allows us to

estimate elasticity in economic models. Elasticity is the percent change in Y associated

with a percent change in X. For example, if we want to know the elasticity of airline tickets,

we could get data on sales and prices and estimate the following model:

ln(ticket salesit ) = —0 + —1 ln(priceit ) + ‘it (7.8)

where the dependent variable is the natural log of monthly ticket sales on routes (e.g., New

York to Tokyo) and the independent variable is the natural log of the monthly average price

c
•2014 Oxford University Press 332
Chapter 7. Transforming Variables, Comparing Variables

of the tickets on those routes. —ˆ1 estimates the percentage change of sales when ticket prices

go up by one percent.

Another hiccup with logged models is that the values of the variable being logged must be

greater than zero. The reason is that the mathematical log function is undefined for values

less than or equal to zero.6 Hence, logged models work best with economic variables such as

sales, quantities, and prices. Even there, though, it is not uncommon to see an observation

with zero sales or zero wages and we will be forced to omit such observations.7

Logged models are super easy to estimate; we’ll see how in the Computing Corner. The

key is interpretation. If the model has a logged variable or variables, we know the coefficients

reflect a percentage of some sort, with the exact interpretation depending on which variables

are logged.

Logs in height and wages example

Table 7.2 takes us back to the height and wage data we discussed on page ??. It reports

results from four regressions. In the first specification nothing is logged. Interpretation of

—ˆ1 is old hat: a one inch increase in adolescent height is associated with a $0.412 increase in
6 Recall that (natural) log of k is the exponent to which we have to raise e to obtain k. There is no number that
1
we can raise e to and get 0. We can get close by raising e to minus a huge number; for example e≠100 = e100
, which
is very close to zero, but not quite zero.
7 Some people re-code these numbers as something very close to zero (such as 0.0000001) on the reasoning that the log

function is defined for low positive values and the essential information (that the variable is near zero) in such observations is

not lost. However, it’s always a bit sketchy to be changing values (even from zero to a small number), so tread carefully.

c
•2014 Oxford University Press 333
Chapter 7. Transforming Variables, Comparing Variables

predicted hourly wages.


Table 7.2: Different Logged Models of Relationship Between Height and Wages

No log Linear-log Log-linear Log-log


Adolescent height 0.412 ú
0.033ú

(0.098) (0.015)
[t=4.23] [t=2.23]
Log adolescent height 29.316ú 2.362ú
(6.834) (1.021)
[t=4.29] [t=2.31]
Constant -13.093 -108.778ú 0.001 -7.754
(6.897) (29.092) (1.031) (4.348)
[t=1.90] [t=3.74] [t=0.01] [t=1.78]
N 1,910 1,910 1,910 1,910
R2 0.009 0.010 0.003 0.003
Standard errors in parentheses
ú indicates significance at p < 0.05

The second column reports results from a linear-log model in which the dependent variable

is not logged and the independent variable is logged. The interpretation of —ˆ1 is that a

one percent increase in X (which is adolescent height in this case) is associated with a
29.316
100
= $0.293 increase in hourly wages. The dividing by 100 is a bit unusual, but no big

deal once we get used to it.

The third column reports results from a model in which the dependent variable has been

logged but the independent variable has not been logged. The interpretation of —ˆ1 here is

that a one inch increase in height is associated with a 3.3% increase in wages.

The fourth column reports a log-log model in which both the dependent variable and

independent variable have been logged. The interpretation of —ˆ1 here is that a one percent

increase in height is associated with a 2.362 percent increase in wages. Note that in the log-

c
•2014 Oxford University Press 334
Chapter 7. Transforming Variables, Comparing Variables

linear column, the probability is on a 0 to 1 scale and in the log-log column the probability

is on a 0 to 100 scale. Yeah, that’s a pain; it’s just how the math works out.

So which model is best? Sadly, there is no magic bullet in selecting models here, another

hiccup when working with logged models. We can’t simply look at the R2 because they

are not comparable: In the first two models the dependent variable is Y and in the last

two the dependent variable is ln(Y ). As is often the case in statistics, some judgment will

be necessary. If we’re dealing with an economic problem of estimating price elasticity, a

log-log model is natural. In other contexts, we have to decide whether we think the causal

mechanism makes more sense in percentage terms and whether it applies to the dependent

and/or independent variables.

c
•2014 Oxford University Press 335
Chapter 7. Transforming Variables, Comparing Variables

Remember This
1. How to interpret logged models:

Log-linear: lnYi = —0 + —1 Xi + ‘i A one unit increase in X is associated


with a —1 percent change in Y
(on 0 to 1 scale).
Linear-log: Yi = —0 + —1 lnXi + ‘i A one percent increase in X is associated
with a 100
—1
change in Y .
Log-log: lnYi = —0 + —1 lnXi + ‘i A one percent increase in X is associated
with a —1 percent change in Y
(on 0 to 100 scale).

2. Logged models have some challenges not found in other models (the three hic-
cups):
(a) The scale of the —ˆ coefficients varies depending on whether the model is
log-linear, linear-log, or log-log.
(b) We cannot log variables that have values less than or equal to zero.
(c) There is no simple test for choosing among log-linear, linear-log, and log-log
models.

7.3 Standardized Coefficients

We frequently want to compare coefficients. That is, we want to say whether X1 or X2 has

a bigger effect on Y . If the variables are on the same scale, this task is pretty easy. For

example, in the height and wages model, both adolescent and adult height are measured in

inches, so we can naturally compare the estimated effects of an inch of adult height versus

an inch of adolescent height.

c
•2014 Oxford University Press 336
Chapter 7. Transforming Variables, Comparing Variables

Challenge of comparing coefficients

When the variables are not on the same scale, we have a tougher time making a direct

comparison. Suppose we want to understand the economics of professional baseball players’

salaries. Players with high batting averages get on base a lot, keeping the offense going

and increasing the odds of scoring. Players who hit home runs score right away, sometimes

in bunches. Which group of players earns more? We might first address question this by

estimating

Salaryi = —0 + —1 Batting averagei + —2 Home runsi + ‘i

The results are in Table 7.3. The coefficient on batting average is 12,417,629.72. That’s

huge! The coefficient on home runs is 129,627.36. Also big. But nothing like the coefficient

on batting average. Batting average must have a much bigger effect on salaries than home

runs, right?

Umm, no. These variables aren’t comparable. Batting averages typically range from 0.200

to 0.350 (meaning most players get a hit between 20% and 35% of the time). Home runs

per season range from 0 to 73 (with a lot more zeros than 73s!). Each OLS coefficient in the

model tells us what happens if we increase the variable by “1”. For batting average, that’s an

impossibly large increase (going from probability of getting a hit of 0 to a probability of 1.0).

For home runs, that’s just another day at the ballpark. In other words, when “1” means

something very different for two variables, we’d be nuts to directly compare the regression

c
•2014 Oxford University Press 337
Chapter 7. Transforming Variables, Comparing Variables

coefficients on the variables.


Table 7.3: Determinants of Major League Baseball Salaries, 1985 - 2005

Batting average 12,417,629.72


(940,985.99)
[t = 13.20]
Home runs 129,627.36
(2,889.77)
[t = 44.86]
Constant -2,869,439.40
(244,241.12)
[t = 11.75]
N 6,762
R2 0.30
Standard errors in parentheses

Standardizing coefficients

A convenient trick is to standardize the variables. To do so, we convert variables to standard

deviations from their means. That is, instead of having a variable that indicates a baseball

player’s batting average, we have a variable that indicates how many standard deviations

above or below the average batting average a player was. Instead of having a variable that

indicates home runs, we have a variable that indicates how many standard deviations above

or below the average number of home runs a player hit. The attraction of standardizing

variables is that a one unit increase for both standardized independent variables will be a

standard deviation.

Typically we standardize the dependent variable as well, so that the coefficient on a

standardized independent variable can be interpreted as “Controlling for the other variables

c
•2014 Oxford University Press 338
Chapter 7. Transforming Variables, Comparing Variables

in the model, a one standard deviation increase in X is associated with a —ˆ1 standard

deviation increase in the dependent variable.”

We standardize variables using the following equation:

V ariable ≠ V ariable
V ariableStandardized = (7.9)
sd(V ariable)

where V ariable is the average of the variable for all units in the sample and sd(V ariable) is

the standard deviation of the variable.

Table 7.4 reports the means and standard deviations of the variables for our baseball

salary example. Table 7.5 then uses these means and standard deviations to report the un-

standardized and standardized values of salary, batting average, and home runs for three

selected players. Player 1 earned $5.85 million. Given that the standard deviation of

salaries in the data set was $2,764,512, the standardized value of this player’s salary is
5,850,000≠2,024,616
2,764,512
= 1.38. In other words, player 1 earned 1.38 standard deviations more than

the average salary. This player’s batting average was 0.267, which is exactly the average.

Hence, his standardized batting average is zero. He hit 43 home runs, which is 2.99 standard

deviations above the mean number of home runs.


Table 7.4: Means and Standard Deviations of Baseball Variables

Variable Mean Standard deviation


Salary 2,024,616 2,764,512
Batting average 0.267 0.031
Home runs 12.11 10.31

Table 7.6 displays standardized OLS results along with the unstandardized results from

c
•2014 Oxford University Press 339
Chapter 7. Transforming Variables, Comparing Variables

Table 7.5: Means and Standard Deviations of Baseball Variables

Unstandardized Standardized
Player ID Salary Batting average Home runs Salary Batting average Home runs
1 5,850,000 0.267 43 1.38 0.00 2.99
2 2,000,000 0.200 4 -0.01 -2.11 -0.79
3 870,000 0.317 33 -0.42 1.56 2.03

Table 7.3. The standardized results allow us to reasonably compare the effects of batting

average and home runs on salary. We see in Table 7.4 that a standard deviation of batting

average is 0.031. The standardized coefficient column tell us that an increase in one stan-

dard deviation of batting average is associated with an increase in salary of 0.14 standard

deviations. So, for example, a player raising his batting average by 0.031 from 0.267 to 0.298

can expect an increase in salary of 0.14 ◊ $2, 764, 512 = $387, 032. A player who increases his

home runs by one standard deviation (which Table 7.4 tells us is 10.31 home runs), can ex-

pect a 0.48 standard deviation increase in salary (which is 0.48 ◊ $2, 764, 512 = $1, 326, 966).

In other words, home runs have a bigger bang for the buck. Eat your steroid-laced Wheaties,

kids.8

While results from OLS models with standardized variables seem quite different, they are

really only re-scaling the original results. The model fit is the same whether standardized

or unstandardized variables are used. Notice that the R2 is identical. Also, the conclu-

sions about statistical significance are the same in the unstandardized and standardized

regressions; we can see that by comparing the t statistics. Think of the standardization as
8 That’s a joke! Wheaties are gross.

c
•2014 Oxford University Press 340
Chapter 7. Transforming Variables, Comparing Variables

Table 7.6: Standardized Determinants of Major League Baseball Salaries, 1985 - 2005

Unstandarized Standarized
Batting average 12,417,629.72 0.14
(940,985.99) (0.01)
[t = 13.20] [t = 13.20]
Home runs 129,627.36 0.48
(2,889.77) (0.01)
[t = 44.86] [t = 44.86]
Constant -2,869,439.40 0.00
(244,241.12) (0.01)
[t = 11.75] [t = 0.00]
N 6,762 6,762
R2 0.30 0.30
Standard errors in parentheses

something like international currency conversion. In unstandardized form, the coefficients

are reported in different currencies, but in standardized form, the coefficients are reported

in a common currency. The underlying real prices are the same whether they are reported

in dollars, euros, or baht, though.

Remember This
Standardized coefficients allow the effects of two independent variables to be compared.
1. When the independent variable, Xk , and dependent variable are standardized, a
one standard deviation increase in Xk is associated with a —ˆk standard deviation
increase in the dependent variable.
2. Statistical significance and model fit are the same for unstandardized and stan-
dardized results.

c
•2014 Oxford University Press 341
Chapter 7. Transforming Variables, Comparing Variables

7.4 Hypothesis Testing about Multiple Coefficients

The standardized coefficients on batting average and home runs look quite different. But are

they statistically significantly different from each other? The t statistics in Table 7.6 tell us

that each is statistically significantly different from zero, but tell us nothing about whether

they are different from each other.

Answering this kind of question is trickier than the t tests we’ve seen because we’re dealing

with more than one estimated coefficient. Both estimates have uncertainty associated with

them and, to make things worse, they may co-vary in ways that we want to take into account.

In this section, we discuss F tests as a solution to this challenge, explain two different types of

commonly used hypotheses about multiple coefficients, and then show how to use R2 results

to implement these tests, including an example for our baseball data.9

F tests

There are several ways to test hypotheses involving multiple coefficients. We focus on an F

test. This test shares features with hypothesis tests discussed earlier (on page 146). When

using a F test, we define null and alternative hypotheses, set a significance level, and compare

a test statistic to a critical value. The new elements are that we use a funky test statistic and

we compare that to a critical value derived from a F distribution rather than a t distribution

or normal distribution. We provide more information on the F distribution in the appendix


9 It is also possible to use t tests to compare multiple coefficients, but F tests are generally easier for this purpose.

c
•2014 Oxford University Press 342
Chapter 7. Transforming Variables, Comparing Variables

on page 783.

The funky test statistic a F statistic. It is based on R2 values from two separate OLS

specifications.

The first specification is the unrestricted model, which is simply the full model. For

example, if we have three independent variables, our full model might be

Yi = —0 + —1 X1i + —2 X2i + —3 X3i + ‘i (7.10)

The model is called unrestricted because we are imposing no restrictions on what the values

of —ˆ1 , —ˆ2 , and —ˆ3 will be.

The second specification is the so-called restricted model in which we force the computer

to give us results that comport with the null hypothesis. It’s called restricted because we are

restricting the estimated values of —ˆ1 , —ˆ2 , and —ˆ3 to be consistent with the null hypothesis.

How do we do that? Sounds hard. Actually, it isn’t. We simply take the relationship

implied by the null hypothesis and impose it on the unrestricted model. We can divide

hypotheses involving multiple coefficients into two general cases.

Case 1: Multiple coefficients equal zero under the null hypothesis

It is fairly common to see researchers test a null like H0 : —1 = —2 = 0. This a null that

both coefficients are zero; we reject it if we observe evidence that one or both coefficients are

not equal to zero. This type of hypothesis is particularly useful when we have multicollinear

variables. In such circumstances, the multicollinearity may drive up the standard errors

c
•2014 Oxford University Press 343
Chapter 7. Transforming Variables, Comparing Variables

of the —ˆ estimates such that we have very imprecise (and likely statistically insignificant)

estimates for the individual coefficients. By testing the null that both of the multicollinear

variables equal zero, we can at least learn if one (or both) of them is non-zero, even as we

can’t say which one it is because they are so closely related.

In this case, imposing the null hypothesis means making sure that our estimates of —1 and

—2 are both zero. The process is actually easy-schmeasy: Just set the coefficients to zero and

see that the resulting model is simply a model without variables X1 and X2 . Specifically,

Yi = —0 + —1 X1i + —2 X2i + —3 X3i + ‘i

= —0 + 0 ◊ X1i + 0 ◊ X2i + —3 X3i + ‘i

= —0 + —3 X3i + ‘i

Statistical programs such as Stata and R automatically report results for “the” F test

which is a test that the coefficients on all the independent variables equal zero.

Case 2: One or more coefficients equal each other under the null hypothesis

A more complicated – and interesting – case occurs when we want to test whether the effect

of one variable is larger than the effect of another. In this case, the null hypothesis will be

that both coefficients are the same. For example, if we want to know if the effect of X1

is bigger than the effect of X2 , the null hypothesis will be H0 : —1 = —2 . Note that such a

hypothesis test makes sense only if the scales of X1 and X2 are the same or the two variables

c
•2014 Oxford University Press 344
Chapter 7. Transforming Variables, Comparing Variables

have been standardized.

In this case, imposing the null hypothesis to create the restricted equation involves re-

writing the unrestricted equation so that the two coefficients are the same. We can do so,

for example, by replacing —2 with —1 (because they are equal under the null). After some

clean-up, we end up with a model in which —1 = —2 . In our case,

Yi = —0 + —1 X1i + —2 X2i + —3 X3i + ‘i

= —0 + —1 X1i + —1 X2i + —3 X3i + ‘i

= —0 + —1 (X1i + X2i ) + —3 X3i + ‘i

In this restricted equation, increasing X1 or X2 by one unit is associated with a —1 increase

in Yi . To estimate this model, we need only to create a new variable X1 + X2 and include

that as an independent variable instead of X1 and X2 separately.

The cool thing is that if we increase X1 by one unit, X1i + X2i goes up by one and we

expect a —1 increase in Y . At the same time, if we increase X2 by one unit, X1i + X2i goes

up by one and we expect a —1 increase in Y . Presto! We have an equation in which the

effect of X1 and X2 will necessarily be the same.

F tests using R2 values

The statistical fits of the unrestricted and restricted model are measured with RU2 nrestricted

and RRestricted
2
. These are simply the R2 s from each separate model. The RU2 nrestricted will

c
•2014 Oxford University Press 345
Chapter 7. Transforming Variables, Comparing Variables

always be higher because the model without restrictions can generate a better model fit than

the same model subject to some restrictions. This conclusion is a little counterintuitive at

first, but note that RU2 nrestricted will be higher than RRestricted
2
even when the null hypothesis is

true because when estimating the unrestricted equation the software not only has the option

of estimating both coefficients to be whatever the value is under the null (hence assuring

the same fit as in the restricted model), but also any other deviation, large or small, that

improves the fit.

The extent of difference between RU2 nrestricted and RRestricted


2
depends on whether or not

the null hypothesis is true. If we are testing H0 : —1 = —2 = 0 and —1 and —2 really are zero,

then restricting them to be zero won’t cause the RRestricted


2
to be too far from RU2 nrestricted

because the optimal values of —ˆ1 and —ˆ2 really are around zero. If the null is false and —1

and —2 are much different than zero, then there will be a huge difference between RU2 nrestricted

and RRestricted
2
because setting them to non-zero values, as happens only in the unrestricted

model, improves fit substantially.

Hence, the heart of an F test is the difference between RU2 nrestricted and RRestricted
2
. When

the difference is small, imposing the null doesn’t do too much damage to the model fit. When

the difference is large, imposing the null does a lot of damage to model fit.

An F test is based on the F statistic.

(RU2 nrestricted ≠ RRestricted


2
)/q
Fq,N ≠k = (7.11)
(1 ≠ RU nrestricted )/(N ≠ k)
2

c
•2014 Oxford University Press 346
Chapter 7. Transforming Variables, Comparing Variables

The q term refers to how many constraints are in the null hypothesis. That’s just a fancy

way of saying how many equal signs are in the null hypothesis. So for H0 : —1 = —2 the value

of q is 1. For H0 : —1 = —2 = 0 the value of q is 2. The N ≠ k term is a degrees of freedom

term, like what we saw with the t distribution. This is the sample size minus the number of

parameters estimated in the unrestricted model. (For example, k for Equation 7.11 will be

three because we estimate —ˆ0 , —ˆ1 , and —ˆ2 .) We need to know these terms because the shape

of the F distribution depends on the sample size and the number of constraints in the null,

just as the t distribution shifted based on the number of observations.

The F statistic has the difference of RU2 nrestricted and RRestricted


2
in it and also includes some

other bits to ensure that the F statistic is distributed according to an F distribution. The F

distribution describes the relative probability of observing different values of the F statistic

under the null hypothesis. It allows us to know the probability the F statistic will be bigger

than any given number when the null is true. We can use this fact to identify critical values

for our hypothesis tests; we’ll describe how shortly.

How we approach the alternative hypotheses depends on the type of null hypothesis. For

case 1 null hypotheses (in which multiple coefficients are zero under the null hypothesis),

then the alternative hypothesis is that at least one of them is not zero. In other words, the

null hypothesis is that they all are zero and the alternative is the negation of that, which is

that one or more of the coefficients is not zero.

For case 2 null hypotheses (in which two or more coefficients are equal under the null

c
•2014 Oxford University Press 347
Chapter 7. Transforming Variables, Comparing Variables

hypothesis), it is possible to have a directional alternative hypothesis that one coefficient

is larger than the other. The critical value remains the same, but we add a requirement

that the coefficients actually go in the direction of the specified alternative hypothesis. For

example, if we are testing H0 : —1 = —2 versus HA : —1 > —2 we reject the null in favor of the

alternative hypothesis if the F statistic is bigger than the critical value and —ˆ1 is actually

bigger than —ˆ2 .

This all may sound complicated, but the process isn’t that hard, really. (And, as we

show in Computing Corner, statistical software makes it really easy.) The crucial step is

formulating a null hypothesis test and using it to create a restricted equation. This process

is actually pretty easy. If we’re dealing with a case 1 null hypothesis that multiple coefficients

are zero, we simply drop the variables listed in the null in the restricted equation. If we’re

dealing with a case 2 null hypothesis that two or more coefficients are equal to each other,

we simply create a new variable that is the sum of the variables and use that in the restricted

equation instead of the individual variables.

F tests and baseball salaries

To see F testing in action, let’s return to our standardized baseball salary model and first

test the following case 1 null hypothesis: H0 : —1 = —2 = 0. The unrestricted equation is

Salaryi = —0 + —1 Std. batting averagei + —2 Std. home runsi + ‘i

The RU2 nrestricted is 0.2992 (it’s usually necessary to be more precise than the 0.30 reported

c
•2014 Oxford University Press 348
Chapter 7. Transforming Variables, Comparing Variables

in Table 7.6 on page 341).

For the unrestricted model, we simply drop the variables listed in the null hypothesis,

yielding

Salaryi = —0 + ‘i

This is a bit of a silly model, producing an RRestricted


2
= 0.00 (because R2 is always zero when

there are no independent variables to explain the dependent variable). We calculate the F

statistic by substituting these values, along with q (which is 2 because there are 2 equals

signs in the null hypothesis) and N ≠ k, which is the sample size (6,762) minus 3 (because

there are three coefficients estimated in the unrestricted model) which is 6,759. The results

is

(RU2 nrestricted ≠ RRestricted


2
)/q
Fq,N ≠k =
(1 ≠ RU nrestricted )/(N ≠ k)
2

(0.2992 ≠ 0.00)/2
=
(1 ≠ 0.2992)/6, 759
= 1442.846

The critical value (which we show how to identify in the Computing Corner on pages 359

and 360) is 3.00. Since the F statistic is (way!) higher than the critical value, we reject the

null handily.

Or we can easily test which effect is bigger by testing the following case 2 null hypothesis:

c
•2014 Oxford University Press 349
Chapter 7. Transforming Variables, Comparing Variables

H0 : —1 = —2 . The unrestricted equation is, as before,

Salaryi = —0 + —1 Std.batting averagei + —2 Std.home runsi + ‘i

The RU2 nrestricted continues to be 0.2992. For the unrestricted model, we simply replace

the individual batting average and home run variables with a variable that is the sum of the

two variables:

Salaryi = —0 + —1 Std. batting averagei + —2 Std. home runsi + ‘i

= —0 + —1 Std. batting averagei + —1 Std. home runsi + ‘i

= —0 + —1 (Std. batting averagei + Std. home runsi ) + ‘i

The RRestricted
2
turns out to be 0.2602. We calculate the F statistic by substituting these

values, along with q (which is 1 because there is 1 equal sign in the null hypothesis) and

N ≠ k, which continues to be 6, 759. The result is

(RU2 nrestricted ≠ RRestricted


2
)/q
Fq,N ≠k =
(1 ≠ RU nrestricted )/(N ≠ k)
2

(0.2992 ≠ 0.2602)/1
=
(1 ≠ 0.2992)/6, 759
= 376.14

The critical value (which we show how to identify in the Computing Corner on pages 359

and 360) is 3.84. Here too the F statistic is vastly higher than the critical value and we

reject this null hypothesis handily as well.

c
•2014 Oxford University Press 350
Chapter 7. Transforming Variables, Comparing Variables

Remember This
F tests are useful to test hypotheses involving multiple coefficients. To implement a F
test for the following model
Yi = —0 + —1 X1i + —2 X2i + —3 X3i + ‘i

1. Estimate an unrestricted model that is the full model.


2. Write down the null hypothesis.
3. Estimate a restricted model by using the conditions in the null hypothesis to
restrict the full model,
• Case 1: When the null hypothesis is that multiple coefficients equal zero, we
create a restricted model by simply dropping the variables listed in the null
hypothesis.
• Case 2: When the null hypothesis is that two or more coefficients are equal,
we create a restricted model by replacing the variables listed in the null
hypothesis with a single variable that is the sum of the listed variables.
4. Use the R2 values from the unrestricted and restricted models in Equation 7.11
on page 346 and compare to the critical value from the F distribution.
5. The bigger the difference between RU2 nrestricted and RRestricted
2
the more the null
hypothesis is reducing fit and, therefore, the more likely we are to reject the null.

c
•2014 Oxford University Press 351
Chapter 7. Transforming Variables, Comparing Variables

Case Study: Comparing Effects of Height Measures

On page 200 in Chapter 5 we assessed the effect of

height on income. The final specification had in-

dependent variables measuring adult height, ado-

lescent height and participation in clubs and ath-

letics:

W agesi = —0 + —1 Adultheighti + —2 Adolescentheighti + —3 Clubsi + —4 Athleticsi + ‘i

Let’s test two different null hypotheses with multiple coefficients. First, let’s test a case

1 null that neither height variable has an effect on wages. This null is H0 : —1 = —2 = 0. The

restricted equation for this null will be

W agesi = —0 + —3 Clubsi + —4 Athleticsi + ‘i (7.12)

Table 7.7 presents results necessary to test this null. We use an F test that requires R2

values from two specifications. Column (a) presents the unrestricted model; at the bottom is

the RU2 nrestricted , which is 0.06086. Column (b) presents the restricted model; at the bottom is

the RRestricted
2
, which is 0.05295. There are two restrictions in this null, meaning q = 2. The

sample size is 1,851 and the number of parameters in the unrestricted model is 5, meaning

N ≠ k = 1,846.

c
•2014 Oxford University Press 352
Chapter 7. Transforming Variables, Comparing Variables

Table 7.7: Unrestricted and Restricted Models for F tests

Unrestricted Restricted model for Restricted model for


model H0 : —1 = —2 = 0 H0 : —1 = —2
Adult height 0.03
(0.20)
[t = 0.17]
Adolescent height 0.35
(0.19)
[t = 1.82]
Number of clubs 1.88ú 1.91ú 1.89ú
(0.28) (0.28) (0.28)
[t = 6.87] [t = 6.77] [t = 6.71]
Athletics 3.02ú 3.28ú 3.03ú
(0.56) (0.56) (0.56)
[t = 5.36] [t = 5.85] [t = 5.39]
Adult height plus adolescent height 0.19ú
(0.05)
[t = 3.85]
Constant -13.57 13.17ú -13.91ú
(7.05) (0.41) (7.04)
[t = 1.92] [t = 32.11] [t = 1.98]
N 1,851 1,851 1,851
R2 0.06086 0.05295 0.06050
Standard errors in parentheses. ú indicates significance at p < 0.05

Hence, for H0 : —1 = —2 = 0

(RU2 nrestricted ≠ RRestricted


2
)/q
Fq,N ≠k =
(1 ≠ RU nrestricted )/(N ≠ k)
2

(0.06086 ≠ 0.05295)/2
F2,1846 =
(1 ≠ 0.06086)/1846
= 7.77

We have to use software (or tables) to find the critical value. We’ll discuss that process

below on page 359. For q = 2 and N ≠ k = 1,846, the critical value for – = 0.05 is 3.00.

Because our F statistic as calculated above is bigger than that, we can reject the null. In

other words, the data is telling us that if the null were true, it would be very unlikely to see

c
•2014 Oxford University Press 353
Chapter 7. Transforming Variables, Comparing Variables

such a big difference in fit between the unrestricted and restricted models.10

Second, let’s test the following case 2 null, H0 : —1 = —2 . Column (a) still presents the

unrestricted model; at the bottom is the RU2 nrestricted , which is 0.06086. The restricted model

is different for this null. Following the logic of discussed on page 345, it is

W agesi = —0 + —1 (Adultheighti + Adolescentheighti ) + —3 Clubsi + —4 Athleticsi + ‘i (7.13)

Column (c) presents the results for this restricted model; at the bottom is the RRestricted
2
,

which is 0.0605. There is one restriction in this null, meaning q = 1. The sample size

is still 1, 851 and the number of parameters in the unrestricted model is still 5, meaning

N ≠ k = 1, 846.

Hence, for H0 : —1 = —2 ,

(RU2 nrestricted ≠ RRestricted


2
)/q
Fq,N ≠k =
(1 ≠ RU nrestricted )/(N ≠ k)
2

(0.06086 ≠ 0.06050)/1
F1,1846 =
(1 ≠ 0.06086)/1846
= 0.71

We again have to use software (or tables) to find the critical value. For q = 1 and

N ≠ k = 1, 846, the critical value for – = 0.05 is 3.85. Because our F statistic as calculated

above is less than the critical value, we fail to reject the null that the two coefficients are

equal. The coefficients are quite different in the unrestricted model (0.03 and 0.35) but notice
10 The specific value of the F statistic provided by automated software F tests will differ from the above because
they do not round to 3 digits as we have done.

c
•2014 Oxford University Press 354
Chapter 7. Transforming Variables, Comparing Variables

that the standard errors are large enough that we cannot reject the null that either one is

zero. In other words, we have a lot of uncertainty in our estimates. The F test formalizes

this uncertainty by forcing OLS to give us the same coefficient on both height variables and,

when we do this, the overall model fit is pretty close to the model fit when the coefficients

are allowed to vary across the two variables. If the null is true, this result is what we would

expect because imposing the null would not lower R2 by very much. If the null were false,

then imposing the null would have likely caused a more substantial reduction in RRestricted
2
.

7.5 Conclusion

The multivariate OLS model is very powerful and this chapter has worked through some

of its practical capabilities. First, the world is not necessarily linear and the multivariate

model can accommodate a vast array of non-linear relationships. Polynomial models, of

which quadratic models are the most common, can produce fitted lines with increasing

returns, diminishing returns U-shaped and upside down U-shaped relationships. Logged

models allow for effects to be interpreted in percentage terms.

Often we care not only about individual variables, but also about how variables relate to

each other. Which variable has a bigger effect? As a first cut, we can standardize variables

to make them plausibly comparable. If and when the variables are comparable, we can test

for which effect is larger using F tests, a class of hypothesis tests that also allows us to test

c
•2014 Oxford University Press 355
Chapter 7. Transforming Variables, Comparing Variables

other hypotheses about multiple coefficients, such as whether a group of coefficients is all

zero.

We have mastered the core points of this chapter we can do the following.

• Section 7.1: Explain polynomial models and quadratic models. Sketch the various kinds

of relationships that a quadratic model can estimate. Show how to interpret coefficients

from a quadratic model.

• Section 7.2: Explain logged models. How do we interpret coefficients in them?

• Section 7.3: Compare coefficients using standardized variables. How do we standard-

ize an independent variable? How do we interpret the coefficient on a standardized

independent variable?

• Section 7.4: Explain how to test a hypothesis about multiple coefficients. Use an F test

to test the following null hypotheses for the model Yi = —0 + —1 X1i + —2 X2i + ‘i .

– H0 : —1 = —2 = 0

– H0 : —1 = —2

Further Reading

Empirical papers using logged variables are very common; see, for example, Card (1990).

Zakir Hossain (2011) discusses the use of Box-Cox tests to help decide which functional form

c
•2014 Oxford University Press 356
Chapter 7. Transforming Variables, Comparing Variables

(linear, log-linear, linear-log, or log-log) is best. Achen (1982, 77) critiques standardized

variables, in part because they depend on the standard errors of independent variables in

the sample.

Key Terms
• Elasticity (332)
• F test (342)
• F statistic (343)
• Linear-log model (332)
• Log-linear model (331)
• Log-log model (332)
• Polynomial model (321)
• Quadratic model (321)
• Restricted model (343)
• Specification (343)
• Standardize (338)
• Standardized coefficient (341)
• Unrestricted model (343)

Computing Corner

Stata
1. To estimate a quadratic model in Stata, simply generate a new variable equal to the
square of the variable (e.g., gen X1Squared = X1· 2) and then include it in a regression
(e.g., reg Y X1 X1Squared X2).

c
•2014 Oxford University Press 357
Chapter 7. Transforming Variables, Comparing Variables

2. To estimate a linear-logged model in Stata, simply generate a new variable equal to the
log of the independent variable (e.g., gen X1Log = log(X1)) and then include it in a
regression (e.g., reg Y X1 X1Log X2). Log-linear and log-log models proceed similarly.
3. In Stata, there is an easy way and a hard way to generate standardized regression coef-
ficients. Here’s the easy way: Simply type , beta at the end of a regression command.
For example, reg salary BattingAverage Homeruns, beta.

salary | Coef. Std. Err. t P>|t| Beta


-------------+---------------------------------------------------
BattingAverage | 1.24e+07 940986 13.20 0.000 .1422752
Homeruns | 129627.4 2889.771 44.86 0.000 .4836231
_cons | -2869439 244241.1 -11.75 0.000 .
-----------------------------------------------------------------

The standardized coefficients are listed on the right under “Beta.”


The hard way isn’t very hard. Use Stata’s egen comment to create standardized versions
of every variable in the model:
egen BattingAverage std = std(BattingAverage)
egen Homeruns std = std(Homeruns)
egen Salary std = std(Salary)
Then run a regression with these standardized variables:
reg Salary std BattingAverage std Homeruns std

Salary\_std | Coef. Std. Err. t P>|t|


-----------+---------------------------------------
BattingAverage\_std | .1422752 .0107814 13.20 0.000
Homeruns\_std | .4836231 .0107814 44.86 0.000
_cons | -2.82e-09 .0101802 -0.00 1.000
----------------------------------------------------

The standardized coefficients are listed, as usual, under “Coef.” Notice that they are
identical to the results from using the , beta command.
4. Stata has a very convenient way to conduct F tests for hypotheses involving multiple
coefficients. Simply estimate the unrestricted model and then type test and then the
coefficients involved and restriction implied by the null. For example, to test the null
hypothesis that the coefficients on Height81 and Height85 are both equal to zero, type
the following:
reg Wage Height81 Height85 Clubs Athletics

c
•2014 Oxford University Press 358
Chapter 7. Transforming Variables, Comparing Variables

test Height81 = Height85 = 0


To test the null hypothesis that the coefficients on Height81 and Height85 are equal to
each other, type the following:
reg Wage Height81 Height85 Clubs Athletics
test Height81 = Height85
This code will produce slightly different F statistics than on page 353 due to rounding.
5. To find the critical value from an F distribution for a given –, q and N ≠ k, use the
“inverse F” function in Stata. The “display” function will print this on the screen.
display invF(q, N-k, 1-–)
For example, to calculate the critical value on page 353 for H0 : —1 = —2 = 0 type
display invF(2, 1846, 0.95)

R
1. To estimate a quadratic model in R, simply generate a new variable equal to the square
of the variable (e.g., X1Squared = X1· 2) and then include it in a regression (e.g., lm(Y
≥ X1 X1Squared X2)).
2. To estimate a linear-logged model in R, simply generate a new variable equal to the log
of the independent variable (e.g., X1Log = log(X1)) and then include it in a regression
(e.g., lm(Y ≥ X1 X1Log X2)). Log-linear and log-log models proceed similarly.
3. In R, there is an easy way and a hard way to generate standardized regression co-
efficients. Here’s the easy way: Use the scale command in R. This command will
automatically create standardized variables on the fly:
summary(lm(scale(Sal) ≥ scale(BatAvg)+ scale(HR))
A harder but perhaps more transparent approach is to simply create standardized vari-
ables and then use them to estimate a regression model. Standardized variables can be
created manually (e.g. Sal std = (bb$salary - mean(bb$salary))/ sqrt(var(bb$salary)).
After standardizing all variables, simply run a OLS model using the standardized vari-
ables:
summary(lm(Sal std ≥ BatAvg std + HR std))

Estimate Std. Error t value Pr(>|t|)


(Intercept) -0.000 0.010 0.00 1.00
BatAvg\_std 0.142 0.011 13.20 0.00
HR\_std 0.483 0.011 44.86 0.00

c
•2014 Oxford University Press 359
Chapter 7. Transforming Variables, Comparing Variables

4. There are automated functions available on the web to do F tests for hypotheses involv-
ing multiple coefficients, but they require a fair amount of work to get them working at
first. Here we present a manual approach for the tests on page 352.

Unrestricted = lm(Wage˜ Height81 + Height85 + Clubs + Athletics)


# Unrestricted model with all variables
Restricted1 = lm(Wage˜ Clubs + Athletics)
# Restricted model for null that height coefficients are both zero
HeightsAdded = Height81 + Height85
# Creates a new variable that is sum of two height variables
Restricted2 = lm(Wage˜ HeightsAdded + Clubs + Athletics)
# Restricted model for null that height coefficients equal each other

R stores R2 values and degrees of freedom information for each model and we can access
this information by using the “summary” command followed by a dollar sign and the
appropriate name. To see the various values of R2 for the unrestricted and restricted
models type
summary(Unrestricted)$r.squared.
summary(Restricted1)$r.squared.
summary(Restricted2)$r.squared.
To see the degrees of freedom for the unrestricted model, type
summary(Unrestricted1)$df[2].
We’ll have to keep track of q on our own.
To calculate the F statistic for H0 : —1 = —2 = 0 as described on page 353, type
((summary(Unrestricted)$r.squared - summary(Restricted1)$r.squared)/2) /
((1-summary(Unrestricted)$r.squared)/summary(Unrestricted)$df[2])
This code will produce slightly different F statistics than on page 353 due to rounding.
5. To find the critical value from a F distribution for a given –, q and N ≠ k, type
qf(1-–, df1=q, df2= N-k).
For example, to calculate the critical value on page 353 for H0 : —1 = —2 = 0 type
qf(.95, df1=2, df2=1846).

Exercises
1. The relationship between political instability and democracy is important and likely
quite complicated. Do democracies manage conflict in a way that reduces instability
or do they stir up conflict? Using the Instability PS data.dta data set from Zaryab
Iqbal and Christopher Zorn, answer the following questions. The data set covers 157

c
•2014 Oxford University Press 360
Chapter 7. Transforming Variables, Comparing Variables

countries between 1946 and 1997. The unit of observation is the country-year. The
variables are listed in Table 7.8.
Table 7.8: Variables for Political Instability Questions

Variable Description
Ccode Country code
Year Year
Instab Index of instability (revolutions, crises, coups etc). Ranges
from -4.65 to + 10.07
Coldwar Cold war year (1=yes, 0=no)
Gdplag GDP in previous year
Democracy Democracy score in previous year, ranges from 0 (most
autocratic) to 100 (most democratic)

a. Estimate a bivariate model with instability as the dependent variable and democracy
as the independent variable. Because the units of the variables are not intuitive, use
standardized coefficients to interpret. Briefly discuss the estimated relationship and
whether you expect endogeneity.
b. To combat endogeneity, include a variable for lagged GDP. Discuss changes in results,
if any.
c. Perhaps GDP is better conceived of in log terms. Estimate a model with logged
lagged GDP and interpret the coefficient on this GDP variable.
d. Suppose we are interested in whether instability was higher or lower during the Cold
War. Run two models. In the first, add a Cold War dummy variable to the above
specification. In the second model add a logged Cold War dummy variable to the
above specification. Discuss what happens.
e. It is possible that the positive relationship between democracy and political instabil-
ity is due to the fact that in more democratic countries, people feel freer to engage
in confrontational political activities such as demonstrations. It may be, however,
that this relationship is only positive up to a point or that more democracy increases
political instability more at lower levels of political freedom. Estimate a quadratic
model, building off the specification above. Use a figure to depict the estimated
relationship and use calculus to indicate the point at which the sign on democracy
changes.
2. Use the globaled.dta data on education and growth from Hanushek and Woessmann for
this question. The variables are
a. Use standardized variables to assess whether the effect of test scores are larger than
the effect of years in school on economic growth. At this point, simply compare the

c
•2014 Oxford University Press 361
Chapter 7. Transforming Variables, Comparing Variables

Table 7.9: Variables for Global Education Questions

Variable Description
name Country name
code Country code
ypcgr Average annual growth rate (GDP per capita),
1960-2000
testavg Average combined math and science standardized
test scores, 1964-2003
edavg Average years of schooling over 1960-2000
ypc60 GDP per Capita in 1960
region Region
open Openness of the economy scale
proprts Security of property rights scale

different effects in a meaningful way. We’ll do statistical tests next. The dependent
variable is GDP growth per year. For this part, control for average test scores,
average years of schooling over 1960-2000, and GDP per capita in 1960.
b. Now conduct a statistical test of whether the (appropriately comparable) effects of
test scores and years in school on economic growth are different. Do this test in two
ways: (i) use the test command in Stata and (ii) generate values necessary to use
an F test equation. Report differences/similarities in results.
c. Now control for openness of economy and security of property rights. Which matters
more: test scores or property rights? Use appropriate statistical evidence in your
answer.
3. We will continue the analysis of height and wages in Britain from the homework problem
in Chapter 5 on page 252. We’ll use the data set heightwage british all multivariate.dta
which includes men and women and the variables listed in Table 7.10.11
a. Estimate a model explaining wages at age 33 as a function of female, height at age
16, mother’s education, father’s education, and number of siblings. Use standardized
coefficients to assess whether height or siblings have a larger effect on wages.
b. Implement a difference of means test across males and females using bivariate OLS.
Do this twice: once with female as the dummy variable and the second time with
11For the reasons discussed in the homework problem in Chapter 3 on page 133 we limit the data set to observations
with height greater than 40 inches and self-reported income less than 400 British pounds per hour. We also exclude
observations of individuals who grew shorter from age 16 to age 33. Excluding these observations doesn’t really affect
the results, but these observations are just odd enough to make us think that in these cases there is some kind of
non-trivial measurement error.

c
•2014 Oxford University Press 362
Chapter 7. Transforming Variables, Comparing Variables

Table 7.10: Variables for Height and Weight Data in Britain

Variable name Description


gwage33 Hourly wages (in British pounds) at age 33
height33 Height (in inches) measured at age 33
height16 Height (in inches) measured at age 16
momed Education of mother, measured in years
daded Education of father, measured in years
siblings Number of siblings
female Female indicator variable (1 for women, 0 for men)
LogWage33 Log of hourly wages at age 33

male as the dummy variable (the male variable needs to be generated). Interpret the
coefficient on the gender variable in each model and compare results across models.
c. Now do the same test, but with log of wages at age 33 as the dependent variable.
Use female as the dummy variable. Interpret the coefficient on the female dummy
variable.
d. How much does height explain salary differences across genders? Estimate a differ-
ence of means test across genders controlling for height at age 33 and age at 16.
Explain the results.
e. Does the effect of height vary across genders? Use tools of this chapter to test
for differential effects of height across genders. Use logged wages at age 33 as the
dependent variable and control for height at age 16 and the number of siblings.
Explain the estimated effect of height at age 16 for men and for women.
4. Use the MLBattend.dta we used in Chapter 4 on page 190 and Chapter 5 on page 249.
Which matters more for attendance: winning or runs scored? [To keep us on the same
page, use home attend as the dependent variable and control for wins, runs scored,
runs allowed and season.]
5. In this problem we continue analyzing the speeding ticket data first introduced in Chap-
ter 5 on page 251. The variables we will use are in Table 7.11.
a. Is the effect of age on fines non-linear? Assess this question by estimating a model
with a quadratic age term, controlling for MPHover, Female, Black, and Hispanic.
Interpret the coefficients on the age variables.
b. Sketch the relationship between age and ticket amount from the above quadratic
model. Do so by calculating the fitted value for a white male with 0 MPHover
(probably not a lot of people going zero miles over speed limit got a ticket, but this
simplifies calculations a lot) for ages equal to 20, 25, 30, 35, 40, and 70. (Note:

c
•2014 Oxford University Press 363
Chapter 7. Transforming Variables, Comparing Variables

Table 7.11: Variables for Speeding Ticket Data

Variable name Description


MPHover Miles per hour over the speed limit
Amount Assessed fine for the ticket
Age Age of driver
Female Equals 1 for women and 0 for men
Black Equals 1 for African-Americans and 0 otherwise
Hispanic Equals 1 for Hispanics and 0 otherwise
StatePol Equals 1 if ticketing officer was state patrol officer
OutTown Equals 1 if driver from out of town and 0 otherwise
OutState Equals 1 if driver from out of state and 0 otherwise

Either calculate these by hand (or in Excel) from the estimated coefficients or use
Stata’s display function. To display the fitted value for a zero-year old white male
in Stata, use display b[ cons]+ b[ Age]*0+ b[ AgeSq]*0ˆ2.)
c. Use Equation 7.4 to calculate the marginal effect of age at ages 20, 35, and 70.
Describe how these marginal effects relate to your sketch.
d. Calculate the age that is associated with the lowest predicted fine based on the
quadratic OLS model results above.
e. Do drivers from out of town and out of state get treated differently? Do state police
treat these non-locals differently than local police? Estimate a model that allows us to
assess whether out-of-towners and out-of-staters are treated differently and whether
state police respond differently to out-of-towners and out-of-staters. Interpret the
coefficients on the relevant variables.
f. Test whether the two state police interaction terms are jointly significant. Briefly
explain the results.

c
•2014 Oxford University Press 364
Part II

The Contemporary Statistical Toolkit

365
CHAPTER 8

USING FIXED EFFECTS MODELS TO FIGHT ENDOGENEITY IN

PANEL DATA AND DIFFERENCE-IN-DIFFERENCE MODELS

Do police reduce crime? It certainly seems plau-

sible that they deter bad guys or get them off the

street. It is, however, hardly a foregone conclu-

sion. Maybe cops don’t get out of their squad

cars enough to do any good. Maybe police offi-

cers do some good, but not as much universal pre-kindergarten does.

It is natural to try to answer the question by using standard OLS to analyze data on crime

and police in cities over time. The problem is that we’d risk getting things wrong, possibly

very wrong, because of endogeneity: factors that cause cities to have lots of police and also

366
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

cause lots of crime. These factors include gangs, drugs, and subcultures of hopelessness or

lawlessness. Using standard OLS techniques to estimate a model that does not control for

these factors (and most won’t because these factors are very hard to measure) may produce

estimates suggesting that that police cause crime because the places with lots of crime also

have large numbers of police.

In this chapter we introduce fixed-effects models as a simple yet powerful way to fight

endogeneity. As we explain in more detail in this chapter, fixed effects models boil down

to models that have dummy variables that control for otherwise unexplained unit-level dif-

ferences in outcomes across units. The fixed effect approach is broadly applicable and sta-

tistically important. Depending on the nature of the data set, the approach allows us to

control for attributes of individuals, cities, states, countries, and many other units of obser-

vation. The theoretical appeal of fixed effects models is that they reduce the set of possible

causes of endogeneity. The practical appeal of fixed effect models is that they often produce

profoundly different – and more credible – results than basic OLS models.

There are two contexts in which the fixed effect logic is particularly useful. The first

is when we have panel data, which consists of multiple observations for a specific set of

units. Observing annual crime rates in a set of cities over 20 years is an example. So too

is observing national unemployment rates for every year from 1946 to the present for all

advanced economies. Anyone analyzing such data needs to use fixed effects models to be

taken seriously.

c
•2014 Oxford University Press 367
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

The logic behind the fixed effect approach also is important when we conduct difference-

in-difference analysis, which is particularly helpful when evaluating policy changes. In it, we

compare changes in units affected by some policy change to changes in units not affected by

the policy. We show how difference-in-difference methods rely on the logic of fixed models

and, in some cases, use the same tools as panel data analysis.

In this chapter, we show the power and ease of implementing fixed effects models. Section

8.1 uses a panel data example to illustrate how basic OLS can fail when the error term is

correlated with the independent variable. Section 8.2 shows how fixed effects can come to

the rescue in this case (and others). In so doing, the section describes how to estimate fixed

effects models using dummy variables or so-called de-meaned data. Section 8.3 explains the

mildly miraculous ability of fixed effects models to control for variables even as the models

are unable to estimate coefficients associated with these variables. This ability is a blessing in

that we control for these variables; it is a curse in that we sometimes are curious about such

coefficients. Section 8.4 extends fixed effect logic to so-called two-way fixed effects models

that control for both unit and time related fixed effects. Section 8.5 discusses difference-in-

difference methods that rely on fixed effect type logic and are widely used in policy analysis.

c
•2014 Oxford University Press 368
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

8.1 The Problem with Pooling

In this section we show how using basic OLS to analyze crime data in U.S. cities over time

can lead us dangerously astray. Understanding the problem helps us understand the merits

of the fixed effects approach we present in the next section.

We explore a data set that covers robberies per capita and police officers per capita in 59

large cities in the United States from 1951 to 1992.1 Table 8.1 presents OLS results from

estimation of the following simple model

Crimeit = —0 + —1 P olicei,t≠1 + ‘it (8.1)

where Crimeit is crime in city i at time t and P olicei,t≠1 is a measure of the number of police

on duty in city i in the previous year. It’s common to use lagged police in an effort to avoid

the problem that the number of police in a given year could be simultaneously determined

by the number of crimes in that year. We re-visit this point in Section 8.4. For now, let’s

take it as a fairly conventional modeling choice when analyzing the effect of police on crime.

Notice also that the subscripts have both i’s and t’s in them. This is new and will become

important later.

We’ll refer to this model as a pooled model. In a pooled model, an observation is com-

pletely described by its X variables and nothing is made of the fact that some observations

came from one city and others from another city. For all the computer knew when running
1 This data is from Marvell and Moody (1996). Their paper discusses a more comprehensive analysis of this data.

c
•2014 Oxford University Press 369
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

that model, there were N separate cities producing the data.

Table 8.1 shows the results. The coefficient on the police variable is positive and very

statistically significant. Yikes. More cops, more crime. Weird. In fact, for every additional

police officer per capita, there were 2.37 more robberies per capita. Were we to take these

results at face value, we would believe that cities could eliminate more than two robberies

per capita for every police officer per capita they fired.
Table 8.1: Basic OLS Analysis of Burglary and Police Officers, 1951-1992

Pooled OLS
Lag police, per capita 2.37
(0.07)
[t = 32.59]
N 1,232
Standard errors in parentheses

Of course, we don’t believe the pooled results. We worry that there are unmeasured

factors lurking in the error term that could be correlated with the number of police, thereby

causing bias. The error term in Equation 8.1 contains gangs, drugs, economic hopelessness,

broken families, and much more. If any of those factors is correlated with the number of

police in a given city, we have endogeneity. Given that police are more likely to be deployed

when and where there are gangs, drugs, and economic desolation, it seems inevitable that

our model suffers from endogeneity.

In this chapter we try to eliminate some of this endogeneity by focusing on aspects of

the error associated with each city. To keep our discussion relatively simple, we’ll turn

c
•2014 Oxford University Press 370
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

Robberies
per 1,000 people

12

10

2
Police
2.0 2.5 3.0 3.5 per 1,000 people

FIGURE 8.1: Robberies and Police for Large Cities in California, 1971-1992

our attention to five California cities: Los Angeles, San Francisco, Oakland, Fresno, and

Sacramento. Figure 8.1 plots their per capita robbery and police data from 1971 to 1992.

Consistent with the OLS results on all cities, the message seems clear that robberies are

more common when there are more police. However, we actually have more information than

displayed in Figure 8.1. We know which city each observation comes from. Figure 8.2 re-

plots the data from Table 8.1, but in a way that differentiates by city. The underlying data is

exactly the same, but the observations for each city have different shapes. The observations

for Fresno are the circles in the lower left, the observations for Oakland are the triangles in

the top middle, and so forth. What does the relationship between police and crime look like

c
•2014 Oxford University Press 371
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

Robberies
per 1,000 people

12

10 Oakland

San Francisco

Sacramento

4 Los Angeles

Fresno

2
Police
2. 0 2. 5 3.0 3. 5
per 1,000 people

FIGURE 8.2: Robberies and Police for Specified Cities in California, 1971-1992

now?

It’s still a bit hard to see so Figure 8.3 adds a fitted lines for each city. These are OLS

regression lines estimated on a city-by-city basis. All are negative and some are dramatically

so (Los Angeles and San Francisco). The claim that police reduce crime is looking much

better. Within each individual city, robberies tend to decline as police increase.

The difference between the pooled OLS results and these city-specific regression lines

presents a puzzle. How can the pooled OLS estimates suggest such a radically different

conclusion than Figure 8.3? The reason is the villain of this book – endogeneity.

Here’s how it happens. Think about what’s in the error term ‘it in Equation 8.1: gangs,

c
•2014 Oxford University Press 372
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

Robberies
per 1,000 people

12

10 Oakland

San Francisco

Sacramento

4 Los Angeles

Fresno

2
2.0 2.5 3.0 3.5 Police
per 1,000 people

FIGURE 8.3: Robberies and Police for Specified Cities in California with City-specific Regression Lines,

1971-1992

c
•2014 Oxford University Press 373
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

drugs, and all that. These factors almost certainly affect the crime across cities and are

plausibly correlated with the number of police, because cities with bigger gang or drug

problems hire more police officers. Many of these elements in the error term are also stable

within each city, at least in the twenty-year time frame we are looking at. A city that has

a culture or history of crime in year 1 probably has a culture or history of crime in year 20

as well. This is the case in our selected cities: San Francisco has lots of police and many

robberies while Fresno has not so many police and not so many robberies.

And here’s what creates endogeneity: These city-specific baseline levels of crime are

correlated with the dependent variable. The cities with the most robberies (Oakland, Los

Angeles, and San Francisco) have the most police. The cities with fewest robberies (Fresno

and Sacramento) have the fewest police. If we are not able to find another variable to control

for whatever is causing these differential levels of baselines – and, if it is something hard to

measure like history or culture or gangs or drugs, we may not be able to – then standard OLS

will have endogeneity-induced bias and will lead us to the spurious inference we discussed

at the start of the chapter.

Test score example

The problem we have identified here occurs in many contexts. Let’s look at another example

to get comfortable with identifying factors that can cause endogeneity. Suppose we want to

assess whether private schools produce better test scores than public schools and we begin

c
•2014 Oxford University Press 374
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

with the following pooled model:

Test scores it = —0 + —1 Private school it + ‘it (8.2)

where Test scores it is test scores of student i at time t and Private school it is a dummy variable

that is 1 if student i is in a private school at time t. This model is for a (hypothetical) data

set in which we observe test scores for specific children over a number of years.

The following three simple questions help us identify possibly troublesome endogeneity.

What is in the error term? Test performance depends potentially not only on whether a

child went to a private school (a variable in the model) but also on his or her intelligence,

diligence, teacher’s ability, family support, and many other factors in the error term. While

we can hope to measure some of these factors, it is a virtual certainty that we will not be

able to measure all of them.

Are there any stable unit-specific elements in the error term? Intelligence, diligence, and

family support are likely to be quite stable for individual students across time.

Are the stable unit-specific elements in the error term likely to be correlated with the indepen-

dent variable? It is quite likely that family support, at least, is correlated with attendance

at private schools because families with the wealth and/or interest in private schools are

likely to provide other kinds of educational support to their children. This tendency is by no

means set in stone because countless kids with good family support go to public schools and

c
•2014 Oxford University Press 375
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

there are certainly kids with no family support who end up in private schools. On average,

though, it is reasonable to suspect that kids in private schools have more family support. If

this is the case, then what may look to be a causal effect of private schools on test scores

may be little more than an indirect effect of family support on test scores.

Remember This
1. A pooled model with panel data ignores the panel nature of the data. The
equation is
Yit = —0 + —1 Xit + ‘it
2. A common source of endogeneity when using a pooled model to analyze panel
data is that the specific units have different baseline levels of Y and these levels
are correlated with X. For example, cities with higher crime (meaning they have
high unit-specific error terms) also tend to have more police, creating a correlation
in a pooled model between the error term and the police independent variable.

8.2 Fixed Effects Models

In this section, we introduce fixed effects as a way to deal with at least part of the endogeneity

described in the previous section. We define the term and then show two ways to estimate

basic fixed effects models.

Starting with Equation 8.1, we divide the error term, ‘i , into a fixed effect, –i , and a

random error term ‹it (a Greek letter , nu, pronounced “new” even though it looks like a

“v”). Our focus here is be on –i ; we’ll assume the ‹it part of the error term is well-behaved,

meaning that it is homoscedastic and not correlated with any independent variable. We

c
•2014 Oxford University Press 376
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

rewrite our model as

Crimeit = —0 + —1 P olicei,t≠1 + ‘it

= —0 + —1 P olicei,t≠1 + –i + ‹it

More generally, a fixed effect model looks like

Yit = —0 + —1 X1it + –i + ‹it (8.3)

A fixed effects model is simply a model that contains a parameter like –i that captures

differences in the dependent variable associated with each unit and/or period.

The fixed effect –i is that part of the unobserved error that is the same value for every

observation for unit i. It basically reflects the average value of the dependent variable for

unit i, after controlling for the independent variables. The unit is the unit of observation.

In our city crime example, the unit of observation is the city.

Even though we write down only a single parameter (–i ), we’re actually representing a

different value for each unit. That is, this parameter takes on a potentially different value

for each unit. In the city crime model, the value of –i will be different for each city. If

Pittsburgh had a higher average number of robberies than Portland, the –i for Pittsburgh

will be higher than the –i for Portland.

The amazing thing about the fixed effects parameter is that it allows us to control for

a vast array of unmeasured attributes of units in the data set. These could correspond to

historical, geographical, or institutional factors. Or these attributes could relate to things

c
•2014 Oxford University Press 377
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

we haven’t even thought of. The key is that the fixed effect term allows different units to

have different baseline levels of the dependent variable.

Why is it useful to model fixed effects in this way? When fixed effects are in the error

term, as in the pooled OLS model, they can cause endogeneity and bias. But if we can pull

them out of the error term we will have overcome this source of endogeneity. We do so by

controlling for the fixed effects, which will take them out of the error term so that they no

longer can be a source for the correlation of the error term and an independent variable.

This strategy is similar to the one we pursued with multivariate OLS: We identified some

factor in the error term that could cause endogeneity and pulled it out of the error term by

controlling for the variable in the regression.

How do we do pull the fixed effects out of the error term? Easy! We simply estimate a

different intercept for each unit. We can do so as long as we have multiple observations for

each unit. In other words, we can do so when we have panel data.

Least squares dummy variable approach

Concretely, we simply create a dummy variable for each unit and estimate an OLS model,

but now controlling for the fixed effects directly. This approach is called the least squares

dummy variable (LSDV) approach. In the LSDV approach, we create dummy variables

for each unit and include these dummy variables in the model:

Yit = —0 + —1 X1it + —2 D1i + —3 D2i + . . . + —P DP ≠1,i + ‹it (8.4)

c
•2014 Oxford University Press 378
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

where D1i is a dummy variable that equals 1 if the observation is from the first unit (which

in our crime example is city), D2i is a dummy variable that equals 1 if the observation is

from the second unit, and so on to the (P ≠ 1)t h unit. We exclude the dummy for one

unit because we can’t have a dummy variable for every unit if we include —0 for reasons we

discussed on page 276 in Chapter 6).2 The data will look like the data in Table 8.2, which

includes the city, year, the dependent and independent variables, and the first three dummy

variables. In the Computing Corner, we show how to quickly create these dummy variables.

Table 8.2: Example of Robbery and Police Data for Cities in California

City Year Robberies Police per D1 D2 D3


per 1,000 (Fresno (Oakland (San Francisco
1,000 (lagged) dummy) dummy) dummy)
Fresno 1991 6.03 1.83 1 0 0
Fresno 1992 8.42 1.78 1 0 0
Oakland 1991 10.35 2.57 0 1 0
Oakland 1992 11.94 2.82 0 1 0
San Francisco 1991 9.50 3.14 0 0 1
San Francisco 1992 11.02 3.14 0 0 1

With this simple step we have just soaked up anything (anything) that is in the error

term that is fixed within unit over the time period of the panel.

We are really just running OLS with loads of dummy variables. In other words, we’ve seen

this before. Specifically, in Chapter 6 on page 276 we showed how to use multiple dummy
2 It doesn’t really matter which unit we exclude, we exclude the P th unit for convenience; plus it is fun to try to
pronounce (P ≠ 1)t h.

c
•2014 Oxford University Press 379
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

variables to account for categorical variables. Here, the categorical variable is whatever the

unit of observation denotes (such as city in our city crime data).

De-meaned approach

We shouldn’t let the old-news feel of the LSDV approach lead us to underestimate fixed

effects models. They’re actually doing a lot of work, work that we can better see when

we consider a second way to estimate fixed models called the de-meaned approach. It’s

an odd term – it sounds like we’re trying to humiliate data – but it describes well what

we’re doing. (Data is pretty shameless anyway.) When using the de-meaned approach,

we subtract off the unit-specific averages from both independent and dependent variables.

The approach allows us to control for the fixed effects (the –i terms) without estimating

coefficients associated with dummy variables for each unit.

Why might we want to do this? Two reasons. First, it can be a bit of a hassle creating

dummy variables for every unit and then wading through results with so many variables. A

model of voting in the United Nations, for example, could need roughly 200 dummy variables

to use the LSDV approach to estimate a country-specific fixed effects model.

Second, the inner workings of the de-meaned estimator reveal the intuition behind fixed

effects models. This reason is more important. The de-meaned model looks like

Yit ≠ Y i· = —1 (Xit ≠ X i· ) + ‹˜it (8.5)

where Y i· is the average of Y for unit i over all time periods in the data set and X i· is the

c
•2014 Oxford University Press 380
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

average of X for unit i over all time periods in the data set. In our crime data, Y F resno·

is the average crime in Fresno over the time frame of our data and X F resno· is the average

police per capita in Fresno over the time frame of our data.3

Estimating a model using this transformed data will produce exactly the same coefficient

and standard error estimates for —ˆ1 as produced by the LSDV approach.

The de-meaned approach also allows us to see that fixed effects models convert data to

deviations from mean levels for each unit and variable. In other words, fixed effects models

are about differences within units, not differences across units. In the pooled model for our

city crime data, the variables reflect differences in police and robberies in Los Angeles relative

to police and robberies in Fresno. In the fixed effects model, the variables are transformed

to reflect how much robberies in Los Angeles differ from average levels in Los Angeles as a

function of how much police in Los Angeles differ from average levels of police in Los Angeles.

An example shows how this works. Recall the data on crime earlier, where we saw that

estimating the model with a pooled model led to very different coefficients than with the

fixed effects model. The reason for the difference was, of course, that the pooled model was
3 The de-meaned equation is derived by subtracting the same thing from both sides of Equation 8.3. Specifically,
note that the average dependent variable for unit i over time is Y i· = —0 + —1 X i· + –i + ‹ i· . If we subtract the
left-hand side of this equation from the left-hand side of Equation 8.3 and the right-hand side of this equation from
the right-hand side of Equation 8.3, we get Yit ≠Y i· = —0 +—1 Xit +–i +‹it ≠—0 ≠—1 X i· ≠–i· ≠‹ i· . The – terms cancel
because –i equals –i (the average of fixed effects for each unit are, by definition, the same for all observations of a
given unit in all time periods). Rearranging terms yields something that is almost Equation 8.5. For simplicity, we let
‹˜it = ‹it ≠ ‹ i· ; this new error term will inherit the properties of ‹it such as being uncorrelated with the independent
variable and having a mean of zero.

c
•2014 Oxford University Press 381
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

plagued by endogeneity and the fixed effects model was not. How does the fixed effects model

fix things? Figure 8.4 presents illustrative data for two made-up cities, Fresnomento and Los

Frangelese. In panel (a) the pooled data is plotted as in Figure 8.1, with each observation

number indicated. The relationship between police and robberies looks positive and, indeed,

the OLS —ˆ1 is positive.

In panel (b) of Figure 8.4 we plot the same data after it has been de-meaned. Table 8.3

shows how we generated the de-meaned data. Notice, for example, that observation 1 is from

Los Frangelese in 2010. The number of police (the value of Xit ) was 4, which is one of the

bigger numbers in the Xit column. When we compare this number to the average number

of police per 1, 000 people in Los Frangelese (which was 5.33), though, it is low. In fact, the

de-meaned value of the police variable for Los Frangelese in 2010 is -1.33, indicating that

the police per 1, 000 people was actually 1.33 lower than the average for Los Frangelese in

the time period of the data.

Although the raw values of Y get bigger as the raw values of X get bigger, the relationship

between Yit ≠Y i· and Xit ≠X i· is quite different. Panel (b) of Figure 8.4 shows a clear negative

relationship between the de-meaned X and the de-meaned Y .4


4 One issue that can seem confusing at first – but really isn’t – is how to interpret the coefficients. Because the
LSDV and de-meaned approaches produce identical estimates, we can stick with the relatively straightforward way
we explain LSDV results even when describing results from a de-meaned model. Specifically, we can simply say a one
unit change in X1 is associated with a —ˆ1 increase in Y when controlling for unit fixed effects. This interpretation is
similar to how we interpret multivariate OLS coefficients, which makes sense because the fixed effects model is really
just an OLS model with lots of dummy variables.

c
•2014 Oxford University Press 382
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

Robberies
per 1,000
people
12 1

2
9
ine
on l 3
(a) egr essi
6 l ed r
Poo
4
3 Los Frangeles
5
Fresnomento
6

1 2 3 4 5 6 7
Police per 1,000 people
Robberies
per 1,000
people, Fresnomento, de−meaned
de−meaned 2 1 Re
gre
ssi Los Frangeles, de−meaned
by city on
line
1 4
for
d e− m
ean
ed
(b) 0 5 2 (fix
ed
effe
cts
)m
−1 ode
6 l

−2 3

−2 −1 0 1 2
Police per 1,000 people,
de−meaned by city

FIGURE 8.4: Robberies and Police for Hypothetical Cities in California

c
•2014 Oxford University Press 383
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

Table 8.3: Robberies and Police Data for Hypothetical Cities in California

Obs City Year Xit X i· Xit ≠ X i· Yit Y i· Yit ≠ Y i·


1 Los Frangelese 2010 4 5.33 -1.33 12 10 2
2 Los Frangelese 2011 5.5 5.33 0.17 10 10 0
3 Los Frangelese 2012 6.5 5.33 1.17 8 10 -2
4 Fresnomento 2010 1 2 -1 4 3 1
5 Fresnomento 2011 2 2 0 3 3 0
6 Fresnomento 2012 3 2 1 2 3 -1

In practice, we seldom calculate the de-meaned variables ourselves. There are easy ways

to implement the model in Stata and R. We describe these techniques in the Computing

Corner at the end of the chapter.

Table 8.4 shows the results for a basic fixed effects model for our city crime data. We

include the pooled results from Table 8.1 for reference. The coefficient on police per capita

has fallen from 2.37 to 1.49 once we include fixed effects, suggesting that there were indeed

more police officers in cities with higher baseline levels of crime. In other words, the fixed

effects were real (meaning some cities have higher average robberies per capita even when

controlling for the number of police) and that these effects were correlated with the number

of police officers. The fixed effects models controls for these city-specific averages and leads

to a smaller coefficient on police officers.

However, the coefficient still suggests every police officer per capita is associated with 1.49

more robberies. This estimate seems quite large and is highly statistically significant. We’ll

c
•2014 Oxford University Press 384
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

revisit this data once again in Section 8.4 with models that account for additional important

factors.

We should note that we do not indicate whether results in Table 8.4 were estimated with

LSDV or the de-meaned approach. Why? Because it doesn’t matter. Either one would

produce identical coefficients and standard errors on the police variable.


Table 8.4: Burglary and Police Officers, Pooled versus Fixed Effects Models, 1951-1992

Pooled OLS Fixed effects


(one-way)
Lag police 2.37 1.49
(per capita) (0.07) (0.17)
[t = 32.59] [t = 8.67]
N 1,232 1,232
Number of cities 59 59
Standard errors in parentheses

Remember This
1. A fixed effects model includes an –i term for every unit.
Yit = —0 + —1 X1it + –i + ‘it

2. The fixed effects approach allows us to control for any factor that is fixed within
unit for the entire panel, whether or not we observe this factor.
3. There are two ways to produce identical fixed effects coefficient estimates for the
model,
(a) In the least squares dummy variable (LSDV) approach, we simply include
dummy variables for each unit except the excluded reference category.
(b) In the de-meaned approach, we transform the data such that the dependent
and independent variables indicate deviations from the unit mean.

c
•2014 Oxford University Press 385
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

Discussion Question
1. What factors influence student evaluations of professors in college
courses? Do instructors teaching large classes get evaluated less favor-
ably? Consider using the following model to assess the question based
on a data set of evaluations of instructors across multiple classes and
multiple years.

Evaluationit = —0 + —1 Number of students it + ‘it


where Evaluationit is the average evaluation by students of instructor
i in class t, and N umber of studentsit is the number of students in the
class of instructor i’s class t.
(a) What is in the error term?
(b) Are there any stable unit-specific elements in the error term?
(c) Are the stable unit-specific elements in the error term likely to be
correlated with the independent variable?

c
•2014 Oxford University Press 386
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

8.3 Working with Fixed Effects Models

Fixed effects models are relatively easy to implement. In practice, though, there are several

elements that take a bit of experience to get used to. In this section we explore the conse-

quences of using fixed effects models when they’re necessary and when they’re not. We also

explain why fixed effects models cannot estimate some relationships even as they control for

them.

It’s useful to consider possible downsides of using fixed effects models. What if we control

for fixed effects when –i = 0 for all units? In this case, a pooled model that ignores fixed

effects cannot be biased. After all, if the fixed effects are zero, they don’t exist and they

cannot therefore cause bias. Could including fixed effects in this case cause bias? The answer

is no, for the same reasons we discussed earlier (in Chapter 5 on page 231) that controlling

for irrelevant variables does not cause bias. Bias occurs when errors are correlated with

independent variables and as a general matter including extra variables does not cause errors

to be correlated with independent variables.5

If the fixed effects are non-zero, we of course want to control for them. We should note,
5 Controlling for fixed effects when all the –i = 0 will lead to larger standard errors, though, so if we can establish
that there is no sign of a non-zero –i for any unit, we may wish to also estimate a model without fixed effects. To
test for unit specific fixed effects we can implement an F test following the process discussed in Chapter 4 on page
351. The null hypothesis is H0 : –1 = –2 = –3 = . . . = 0. The alternative hypothesis is that at least one of the fixed
effects is non-zero. The unrestricted model is a model with fixed effects (most easily thought of as the LSDV model
that has dummy variables for each specific unit). The restricted model is a model without any fixed effects, which is
simply the pooled OLS model. We provide computer code on pages 415 and 416.

c
•2014 Oxford University Press 387
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

however, that just because some (or many!) –i are non-zero does not necessarily mean that

our fixed effects model will produce different results from our pooled model. Recall that

bias occurs when errors are correlated with an independent variable. The fixed effects could

exist, but they are not necessarily correlated with the independent variables. In other words,

fixed effects must not only exist to cause bias; they must be correlated with the independent

variables to cause bias. It’s not at all impossible to observe instances in real data where

fixed effects exist but don’t cause bias. In such cases, the coefficients from the pooled and

fixed effects models are similar.6

The prudent approach to analyzing panel data is therefore to control for fixed effects. If

the fixed effects are zero, we’ll get unbiased results even with the controls for fixed effects.

If the fixed effects are non-zero, we’ll get unbiased results that will differ or not from pooled

results depending on whether the fixed effects are correlated with the independent variable.

A downside to fixed models is that they make it impossible to estimate effects for certain

variables that we might be interested in. As is often the case, there is no free lunch (although

it’s a pretty cheap lunch).

Specifically, fixed effects models cannot estimate coefficients on any variables that are

fixed for all individuals over the entire time frame. Suppose, for example, that in the process

of analyzing our city crime data we wonder if cities in the north are more crime prone. We
6A so-called Hausman Test can be used to test whether fixed effects are causing bias. If the results indicate no sign
of bias when fixed effects are not controlled for, then we could use the random effects model discussed in Chapter 15
on page 748.

c
•2014 Oxford University Press 388
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

studiously create a dummy variable N orthi that equals one if a city is in a northern state

and 0 otherwise and set about estimating the following model:

Crimeit = —0 + —1 P olicei,t≠1 + —2 N orthi + –i + ‹it

Sadly, this approach won’t work. The reason is easiest to see by considering the fixed

effects model in de-meaned terms. The north variable will be converted to N orthit ≠N orthi· .

What is the value of this de-meaned variable for a city in the north? The N orthit part will

equal one for all time periods for such a city. But, wait, this means that N orthi· will also

be one because that is the average of this variable for this northern city. That means the

value of the de-meaned north variable will equal zero for any city in the north. What is the

value for the de-meaned north variable for a non-northern city? Similar logic applies: The

N orthit part will equal zero for all time periods and so too will N orthi· for a non-nothern

city. The de-meaned north variable will therefore also equal zero for non-northern cities. In

other words, the de-meaned variable will be zero for all cities in all years. The first job of a

variable is to vary. If it doesn’t, well, that ain’t no variable! Hence it will not be possible to

estimate a coefficient on this variable.7

More generally, a fixed effect model (estimated with either LSDV or the de-meaned ap-

proach) cannot estimate a coefficient on a variable if the variable does not change within

units for all units. So even though the variable varies across cities (e.g. the N orthi variable
7Because we know that LSDV and de-meaned approaches produce identical results, we know that we will not be able
to estimate a coefficient on the north variable in a LSDV model as well. This is the result of perfect multicollinearity
where the north variable is perfectly explained as the sum of the dummy variables for the northern cities.

c
•2014 Oxford University Press 389
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

is 1 for some cities and 0 for other cities), we can’t estimate a coefficient on it because within

cities it does not vary. This issue arises in many other contexts. In panel data where individ-

uals are the unit of observation, fixed effects models cannot estimate coefficients on variables

such as gender or race that do not vary within individuals. In panel data on countries, the

effect of variables such as area or being landlocked cannot be estimated when they do not

vary within country for any country in the data set.

Not being able to include such a variable does not mean fixed effects models do not control

for it. The unit specific fixed effect is controlling for all factors that are fixed within a unit

for the span of the data set. The model cannot parse out which of these unchanging factors

have which effect, but it does control for them via the fixed effects parameters.

Some variables might be fixed within some units, but variable within other units. Those

we can estimate. For example, a dummy variable that indicates whether a city has more

than a million people will not vary for many cities that have been above or below one million

in population for the entire span of the panel data. However, if at least some cities have

risen above one million or declined below one million during the period covered in the panel

data, then the variable can be used in a fixed effects model.

Panel data models need not be completely silent with regard to variables that do not

vary. We can investigate how unchanging variables interact with variables that do change.

For example, we can estimate —2 in the following model:

Crimeit = —0 + —1 P olicei,t≠1 + —2 (N orthi ◊ P olicei,t≠1 ) + –i + ‹it

c
•2014 Oxford University Press 390
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

The —ˆ2 will tell us how different the coefficient on the police variable is for northern cities.

Sometimes people are tempted to abandon fixed effects because they care about variables

that do not vary within unit. That’s cheating. The point of fixed effects is that if there is

something fixed within individuals across the panel that is correlated with an independent

variable, we risk bias. Bias is bad and we can’t just close our eyes to it to get to a coefficient

we want to estimate. In this case, the best case scenario is that we run a fixed effects model

and test for whether we need the fixed effects, find that we do not, and then proceed guilt

free. But let’s not get our hopes up. We almost always need the fixed effects.

Remember This
1. Fixed effects models do not cause bias when implemented in situations in which
–i = 0 for all units.
2. Pooled OLS models are biased only when fixed effects are correlated with the
independent variable.
3. Fixed effects models cannot estimate coefficients on variables that do not vary
within at least some units. Fixed effects models do control for these factors,
though, as they are subsumed within the unit specific fixed effect.

c
•2014 Oxford University Press 391
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

Discussion Questions
1. Suppose we have panel data on voter opinions toward government spend-
ing in 1990, 1992, and 1994. Explain why we can or cannot estimate
the effect of each of the following in a fixed effects model:
(a) Gender
(b) Income
(c) Race
(d) Party identification
2. Suppose we have panel data on the annual economic performance of 100
countries from 1960 to 2010. Explain why we can or cannot estimate
the effect of each of the following in a fixed effects model:
(a) Average education
(b) Democracy, which is coded 1 if political control is determined by
competitive elections
(c) Country size
(d) Proximity to equator
3. Suppose we have panel data on the annual economic performance of 50
U.S. states from 1960 to 2010. Explain why we can or cannot estimate
the effect of each of the following in a fixed effects model:
(a) Average education
(b) Democracy, which is coded 1 if political control is determined by
competitive elections
(c) State size
(d) Proximity to Canada

c
•2014 Oxford University Press 392
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

8.4 Two-way Fixed Effects Model

So far we have presented models in which there is a fixed effect for the unit of observation.

We refer to such models as one-way fixed effect models. We can generalize the approach

to a two-way fixed effects model in which we allow for fixed effects not only at the unit

level but also at the time level. That is, just as some cities might have more crime than

others (due to unmeasured history of violence or culture), some years might have more crime

than others due to unmeasured factors. Therefore we add a time fixed effect to our model,

making it

Yit = —0 + —1 X1it + –i + ·t + ‹it (8.6)

where we’ve taken Equation 8.3 from page 377 and added ·t (the Greek letter tau), which

accounts for differences in crime for all units in year t. This notation provides a short-hand

way to indicate that each separate time period gets its own ·t effect on the dependent variable

(in addition to the –i effect on the dependent variable for each individual unit of observation

in the data set).

Similar to our one-way fixed effects model, the single parameter for a time fixed effect

indicates the average difference for all observations in a given year, after having controlled

for the other variables in the model. A positive fixed effect for the year 2008 (–2008 ) would

indicate that, controlling for all other factors, the dependent variable was higher for all units

in the data set in 2008. A negative fixed effect for the year 2014 (–2014 ) would indicate that,

c
•2014 Oxford University Press 393
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

controlling for all other factors, the dependent variable was lower for all units in the data

set in 2014.

There are lots of examples where we suspect a time fixed effect may be appropriate:

• The whole world suffered an economic downturn in 2008 due to a financial crisis starting

in the United States. Hence any model with economic dependent variables could merit

a time fixed effect to soak up this distinctive characteristic of the economy in 2008.

• Approval of political institutions went way up in the United States after the September

11, 2001 terrorist attacks. This was clearly a time specific factor that affected the entire

country.

We can estimate a two-way fixed model in several different ways. The simplest approach is

to extend the LDSV approach to include dummy variables both for units and for time periods.

We can also do a two-way de-meaned approach.8 We can also use a hybrid LSDV/de-meaned

approach; we show how in the Computing Corner.

Table 8.5 shows the huge effect that using a two-way fixed effects model has on our

analysis of the city crime data. For reference, it shows the pooled OLS and one-way fixed
8 The algebra is a bit more involved than for a one-way model, but the result has a similar feel:

Yit ≠ Y i· ≠ Y ·t + Y ·· = —1 (Xit ≠ X i· ≠ X ·t + X ·· ) + ‹˜it (8.7)

where the dot notation indicates what is averaged over such that Y i· is the average value of Y for unit i over time,
Y ·t is the average value of Y for all units at time t and Y ·· is the average over all units and all time periods. Don’t
worry, we almost certainly won’t have to create these variables ourselves; we’re including it just to provide a sense of
how a one-way fixed effects model extends to a two-way fixed effects model.

c
•2014 Oxford University Press 394
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

effects results. The third column displays the results for a two-way fixed effects model

controlling only for police per capita. In contrast to the pooled and one-way models, the

coefficient is small (0.14) and statistically insignificant, suggesting that police spending and

crime were high in certain years. Once we controlled for the fact that robberies were common

in some years throughout the country (possibly due, for example, to the crack epidemic that

was more serious in some years than others), we were able to net out a source of substantial

bias.

The fourth and final column reports two-way fixed effects results from a model that also

controls for the lagged per capita robbery rate in each city in order to control for city-specific

trends in crime. The estimate from this model implies that an increase of one police officer

per 100, 000 people is associated with a decrease of 0.202 robberies per capita. The effect is

marginally statistically significant.9

It is useful to take a moment to appreciate that not all models are created equal. A

cynic might look at the results in Table 8.5 and conclude that statistics can be made to say

anything. But this is not the right way to think about the results. The models do indeed

produce different results, but there are reasons for the differences. One of the models is

better. A good statistical analyst will know this. Using statistical logic we can explain why

the pooled results are suspect. We know pretty much what is going on: There are fixed
9 The additional control variable is called a lagged dependent variable. Inclusion of such a variable is common in
analysis of panel data. These variables often are highly statistically significant, as is the case here. These type of
control variables raise some complications, which we address in Chapter 15 on advanced panel data models.

c
•2014 Oxford University Press 395
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Table 8.5: Burglary and Police Officers, Pooled versus Fixed Effect Models, 1951-1992

Pooled OLS Fixed effects Fixed effects


(one-way) (two-way)
Lag police 2.37úúú 1.49úúú 0.14 -0.212†
(per capita) (0.07) (0.17) (0.17) (0.11)
[t = 32.59] [t = 8.67] [t = 0.86] [t = 1.88]
Lag robberies - - - 0.79úúú
(per capita) - - - (0.02)
- - - [t = 41.63]
N 1,232 1,232 1,232 1,232
Number of cities 59 59 59 59
Standard errors in parentheses
† significant at p < 0.10; úúú p < 0.001

effects in the error term of the pooled model that are correlated with the police variable,

thereby biasing the pooled OLS coefficients. So while there is indeed output from statistical

software that could be taken to imply that police cause crime, we know better. Treating

all results as equivalent is not serious statistics; that’s just pressing buttons on a computer.

Instead of supporting statistical cynicism, this example testifies to the benefits of appropriate

analysis.

Remember This
1. A two-way fixed effects model accounts for both unit and time specific errors.
2. A two-way fixed effects model is written as

Yit = —0 + —1 X1it + —2 X2it + –i + ·t + ‹it

3. Estimation of a two-way fixed effects model can be done with an LSDV approach
(which has dummy variables for each unit and each period in the data set), with
a de-meaned approach, or with a combination of the two.

c
•2014 Oxford University Press 396
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

Case Study: Trade and Alliances

Does trade follow the flag? That is, does interna-

tional trade flow more between countries that are

allies? Or do economic factors alone determine

trade? On the one hand, it seems reasonable that

national security alliances boost trade by foster-

ing good relations and stability. On the other

hand, is there anything in the United States not made in China?

A basic panel model to test for the effect of alliances on trade is

Bilateral trade it = —0 + —1 Allianceit + –i + ‘it (8.8)

where Bilateral tradeit is total trade volume between countries in dyad i at time t. A dyad

is something that consists of two elements. Here, a dyad indicates a pair of countries and

the data indicates how much trade flows between them. For example, the United States

and Canada form one dyad, the United States and Japan form another dyad, and so on.

Allianceit is a dummy variable that is 1 if countries in the dyad are entered into a security

alliance at time t. The –i term captures the amount by which trade in a certain dyad is

higher or lower over the entire course of the panel.

Because the unit of observation is country-pair dyad, fixed effects here relate to factors

related to a pair of countries. For example, the fixed effect for the United States - New

c
•2014 Oxford University Press 397
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

Zealand dyad in the trade model may be higher because of shared language. The fixed effect

for the China-India dyad might be negative because they are divided by mountains (which

they happen to fight over, too).

As we consider whether a fixed effects model is necessary, we need to think about whether

the dyad specific fixed effects could be correlated with the independent variables. Dyad spe-

cific fixed effects could exist because of a history of commerce between two countries, a

favorable trading geography (not divided by mountains, for example), economic complemen-

tarities of some sort, and so on. These factors could also make it easier or harder to form

alliances.

Table 8.6 reports results from Green, Kim, and Yoon (2001) based on data covering trade

and alliances from 1951 to 1992. The dependent variable is the amount of trade between

the two countries in a given dyad in a given year. In addition to the alliance measure, the

independent variables are GDP (total gross domestic product of the two countries in the

dyad), P opulation (total population of the two countries in the dyad), Distance (distance

between the capitals of the two countries in the dyad), and Democracy (the minimum value

of a democracy ranking for the two countries in the dyad; the higher the value the more

democracy).

The dependent and continuous independent variables are logged. Logging variables is a

common practice in this literature; the interpretation is that a one percent increase in any

independent variable is associated with a —ˆ percent increase in trade volume. (We discussed

c
•2014 Oxford University Press 398
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

logged variables on page 336.)

The results are remarkable. In the pooled model, Alliance is associated with a 0.745

percentage point decline in trade. In the one-way fixed effects model, the estimate completely

flips and is associated with a 0.777 increase in trade. In the two-way fixed effects model,

the estimated effect remains positive and significant but drops to 0.459. The coefficients on

Population and Democracy also flip while being statistically significant across the board.

Table 8.6: Bilateral Trade, Pooled versus Fixed Effect Models, 1951-1992

Pooled OLS Fixed effects Fixed effects


one-way two-way
Alliance -0.745úúú 0.777úúú 0.459úúú
(0.042) (0.136) (0.134)
[t = 17.67] [t = 5.72] [t = 3.43]
GDP 1.182úúú 0.810úúú 1.688úúú
(0.008) (0.015) (0.042)
[t = 156.74] [t = 52.28] [t = 39.93]
Population -0.386úúú 0.752úúú 1.281úúú
(0.010) (0.082) (0.083)
[t = 39.70] [t = 9.19] [t = 15.47]
Distance -1.342úúú
(0.018)
[t = 76.09]
Democracy 0.075úúú -0.039úúú -0.015úúú
(0.002) (0.003) (0.003)
[t = 35.98] [t =13.42] [t = 5.07]
Observations 93,924 93,924 93,924
Dyads 3,079 3,079 3,079
Standard errors are in parentheses; úúú
p<0.01, two-tailed test

These results are shocking. If someone told us that they were going to estimate an “OLS

c
•2014 Oxford University Press 399
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

model of bilateral trade relations” we’d be pretty impressed, right? But actually, that model

produces results that lead to almost completely opposite results from the more appropriate

fixed effects models.

There are other interesting things going on as well. The coefficient on Distance disap-

pears in the fixed effects models. Yikes! What’s going on? The reason, of course, is that

the distance between two countries does not change. Fixed effects models cannot estimate

coefficients on such a variable because it does not vary within unit over the course of the

panel. Does that mean that the effect of distance is not controlled for? That would seem to

be a problem because distance certainly affects trade. It’s not a problem, though, because

even though fixed effects models cannot estimate coefficients on variables that do not vary

within unit of observation (which is dyad pairs of countries in this data set), the effects of

these variables are controlled for via the fixed effect. And, even better, not only is the effect

of distance controlled for, so too are hard-to-measure factors such as being on a trade route

or having cultural affinities. That’s what the fixed effect is - a big ball of all the effects that

are the same within units for the period of the panel.

Not all coefficients flip. The coefficient on GDP is relatively stable, indicating that unlike

the variables that do flip signs from the pooled to fixed effects specifications, GDP does

not seem to be correlated with the unmeasured fixed effects that influence trade between

countries.

c
•2014 Oxford University Press 400
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

8.5 Difference-in-difference

The logic of fixed effects plays a major role in difference-in-difference models, models

that look at differences in changes in treated states compared to untreated units and are

particularly useful in policy evaluation. In this section, we explain the logic of this approach,

show how to use OLS to estimate these models, and then link the approach to the two-way

fixed effects models we developed for panel data.

Difference-in-difference logic

To understand difference-in-difference logic, let’s consider a policy evaluation of “stand your

ground” laws that allow individuals to use lethal force when they reasonably believe they are

threatened.10 Does such a law prevent homicides by making criminals fearful of resistance?

Or do these laws increase homicides by escalating violence?

Naturally, we would start by looking at the change in homicides in a state that passed

the law. This approach is what every policy-maker in the history of time uses to assess the

impact of a policy change. Suppose we find homicides went up in the states that passed the

law. Is that fact enough to lead us to conclude that the law increases crime?

It doesn’t take a ton of thinking to realize that such evidence is pretty weak. Homicides

could rise or fall for a lot of reasons, many of them completely unrelated to stand your

ground laws. If homicides went up not only in the state that passed the law, but in all states
10 See McClellan and Tekin (2012) and Cheng and Hoekstra (2013).

c
•2014 Oxford University Press 401
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

– even where there was no policy change – then we can’t seriously blame the law for the rise

in homicides. Or, if homicides declined everywhere, we shouldn’t attribute the decline in a

particular state to the law.

What we really want to do is to look at differences in the state that passed the policy

compared to differences in other similar states that did not pass the law. To use experimental

language, we want to look at the difference in treated states compared to the difference in

control states. We can write this difference of differences as

YT ≠ YC (8.9)

where YT is the change in the dependent variable in treated states (those that passed the

policy) and YC is the change in the dependent variable in the untreated states that did not

pass the policy. We call this approach the difference-in-difference approach because we look

at the difference between differences in treated and control states.

Using OLS to estimate difference-in-difference models

While it is perfectly reasonable to generate a difference-in-difference estimate by calculating

the changes in treated and untreated states and then taking the difference, we’ll use OLS to

produce the same result. The advantage is that OLS will also spit out standard errors on

our estimate. We can also easily add additional control variables when we use OLS.

c
•2014 Oxford University Press 402
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

Specifically, we’ll use the following OLS model:

Yit = —0 + —1 T reatedi + —2 Af tert + —3 (T reatedi ◊ Af tert ) + ‘it (8.10)

where T reatedi equals 1 for a treated state and 0 for a control state, Af tert equals 1 for

all after observations (from both control and treated units), and T reatedi ◊ Af tert is an

interaction of T reatedi and Af tert . This interaction variable will equal 1 for treated states

in the post-treatment period and 0 for all other observations.

The control states have some mean level of homicides that we denote with —0 ; the treated

states have some mean level of homicides that we denote with —0 + —1 T reatedi . If —1 is

positive, the mean level for the treated states is higher than in control states. If —1 is negative,

the mean level for the treated states is lower. If —1 is zero, this means the mean level for the

treated states is the same as in control states. This pre-existing difference of mean levels was

there before the law was even passed, so the law can’t be the cause of differences. Instead

these differences represented by —1 are simply the pre-existing differences in the treated and

untreated states. This parameter is analogous to a unit fixed effect, although here it is for

the entire group of treated states, rather than individual units.

The model captures national trends with the —2 Af tert term. The dependent variable for

all states, treated and not, changes by —2 in the after period. This parameter is analogous

to a time fixed effect, although for the entire post-treatment period rather than individual

time periods.

c
•2014 Oxford University Press 403
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

The key coefficient is —3 . This is the coefficient on the interaction between T reatedi and

Af tert . This variable equals 1 only for treated units in the after period. The coefficient

tells us there is an additional change in the treated states after the policy went into effect

after controlling for pre-existing differences between the treated and control states (—1 ) and

differences in the before and after periods for all states (—2 ).

If we work out the fitted values for changes in treated and control states, we can see how

this regression model produces a difference-in-difference estimate. First, note that the fitted

value for treated states in the after period is —0 + —1 + —2 + —3 (because T reatedi , Af tert ,

and T reatedi ◊ Af tert all equal 1 for treated states in the after period). Second, note that

the fitted value for treated states in the before period is —0 + —1 + —3 , so the change for fitted

states is —2 + —3 . The fitted value for control states in the after period is —0 + —2 (because

T reatedi and T reatedi ◊Af tert equal 0 for control states). The fitted value for control states

in the before period is —0 , so the change for control states is —2 . The difference in differences

of treated and control states will therefore be —3 . Presto!

Figure 8.5 displays two examples that illustrate the logic of difference-in-difference models.

In panel (a) there is no treatment effect. The dependent variables for the treated and control

states differ in the before period by —1 . Then the dependent variable for both the treated

and control units rose by —2 in the after period. In other words, Y was bigger for the treated

than the control before and after the treatment by the same amount. The implication is that

the treatment had no effect even though Y went up in treatment states after they passed

c
•2014 Oxford University Press 404
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

the law.

Panel (b) shows an example with a treatment effect. The dependent variables for the

treated and control states differ in the before period by —1 . Then the dependent variable for

both the treated and control units rose by —2 in the after period, but the value of Y for the

treated unit rose yet another —3 . In other words, the treated group was —1 bigger than the

control before the treatment, and —1 + —3 bigger than the control after the treatment. The

implication is that the treatment caused a —3 bump over and above the across unit and time

differences accounted for in the model.

Consider how the difference-in-difference approach would assess outcomes in our gun law

example. If homicides declined in states with stand your ground laws more than in states

without such laws, the evidence supports the claim that the law prevented homicides. Such

an outcome could happen if homicides went down by 10 in states with the law and went

down only by 2 in other states. Such an outcome could also happen if homicides actually

went up by 2 in states with stand your ground laws but went up by 10 in other states. In

both instances, the difference-in-difference estimate is -8.

One great thing about using OLS to estimate difference-in-difference models is that it is

easy to control for other variables in the OLS model. Simply include them as covariates

and do what we’ve been doing. In other words, simply add a —4 Xit term (and additional

variables, if appropriate), yielding the following difference-in-difference model:

Yit = —0 + —1 T reatedi + —2 Af tert + —3 (T reatedi ◊ Af tert ) + —4 Xit + ‘it (8.11)

c
•2014 Oxford University Press 405
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

Y Y
4 4
Treated Treated β3
Control Control

3 β2 3 β2

2 2
β1 β1
β2 β2

1 1

β0 β0

0 0
Time Time
Before After Before After
No treatment effect Treatment effect
(a) (b)
FIGURE 8.5: Difference-in-difference Examples

c
•2014 Oxford University Press 406
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

Difference-in-difference models for panel data

A difference-in-difference model works not only with panel data but also with rolling cross-

section data. Rolling cross section data consists of data from each treated and untreated

region in which the individual observations come from different individuals across time pe-

riods. An example of a rolling cross section of data is a repeated national survey of people

about their health insurance over multiple years. We could look to see if state-level decisions

about Medicaid coverage in 2014 led to different changes in treated states relative untreated

states. For such data we can easily create dummy variables indicating whether the observa-

tion came from the treated state or not and whether the observation was in the before or

after period. The model can take things from there.

If we have panel data, we can estimate a more general form of a difference-in-difference

model that looks like a two-way fixed effects model:

Yit = –i + ·t + —3 (T reatedi ◊ Af tert ) + —4 Xit + ‘it (8.12)

where

• the –i terms (the unit specific fixed effects) capture differences that exist across units

both before and after the treatment.

• the ·t terms (the time specific fixed effects) capture differences that exist across all units

in every period. If homicide rates are higher in 2007 than in 2003, then the ·t for 2007

will be higher than the ·t for 2003.

c
•2014 Oxford University Press 407
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

• T reatedi ◊ Af tert is an interaction of a variable indicating whether or not a unit is

a treatment unit (meaning in our case that T reatedi = 1 for states that passed stand

your ground laws) and P ostt , which indicates whether or not the observation is post-

treatment (meaning in our case that the observation is after the state passed a stand

your ground law). This interaction variable will equal 1 for treated states in the post-

treatment period and 0 for all other observations.

Our primary interest is the coefficient on T reatedi ◊ Af tert (which we call —3 to be

consistent with earlier equations). As in the difference-in-difference model without fixed

effects, this parameter indicates the effect of the treatment.

Table 8.7 shows an analysis of stand your ground laws by Georgia State University

economists Chandler McClellan and Erdal Tekin. They implemented a state and time fixed

effect version of a difference-in-difference model and found that the homicide rate per 100,000

residents went up by 0.033 after the passage of the stand your ground laws. In other words,

controlling for the pre-existing differences in state homicide rates (via state fixed effects) and

national trends in homicide rates (via time fixed effects) and additional controls related to

race, age, and percent of residents living in urban areas, they found that the homicide rates

went up by 0.033 after states implemented these laws.11

11 Cheng and Hoekstra (forthcoming) found similar results.

c
•2014 Oxford University Press 408
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

Table 8.7: Effect of Stand Your Ground Laws on Homicide Rate Per 100,000 Residents

Variable Coefficient
Stand your ground laws 0.033úú
(0.013)
[ t = 2.54]
State fixed effects Included
Period fixed effects Included
Adapted from Appendix Table 1 of McClellan and Tekin (2012).
Standard errors are in parentheses; úú
p<0.05, two-tailed test
Includes controls for racial, age, and urban demographics.

Remember This
A difference-in-difference model estimates the effect of a change in policy by comparing
changes in treated units to changes in control units.
1. A basic difference-in-difference estimator is YT ≠ YC , where YT is the change
in the dependent variable for the treated unit and YC is the change in the
dependent variable for a control unit.
2. Difference-in-difference estimates can be generated from the following OLS model:
Yit = —0 + —1 T reatedi + —2 Af tert + —3 (T reatedi ◊ Af tert ) + —4 Xit + ‘it

3. For panel data, we can use a two-way fixed effects model to estimate difference-
in-difference effects.
Yit = –i + ·t + —3 (T reatedi ◊ Af tert ) + —4 Xit + ‘it

where the –i fixed effects capture differences in units that existed both before and
after treatment and the ·t captures differences common to all units in each time
period.

c
•2014 Oxford University Press 409
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

Discussion Questions

1. For each of the four panels in Figure 8.6, indicate the values of —0 , —1 , —2 ,

and —3 for the basic difference-in-difference OLS model:

Yit = —0 + —1 T reatedi + —2 Af tert + —3 (T reatedi ◊ Af tert ) + ‘it

2. For each of the following examples, explain how to create (i) a simple

difference-in-difference estimate of policy effects and (ii) a fixed effects

difference-in-difference model.

(a) California implemented a first-in-the-nation paid family leave pro-

gram in 2004. Did this policy increase use of maternity leave?a

(b) Fourteen countries engaged in “expansionary austerity” policies in

response to the 2008 financial crisis. Did these austerity policies

work? (For simplicity, treat austerity as a dummy variable equal to

1 for countries that engaged in it and 0 for others.)

(c) Some neighborhoods in Los Angeles had zoning changes that made

it easier to mix commercial and residential buildings. Did these

changes to zoning laws reduce crime?b

a See Rossin-Slater, Ruhm and Waldfogel (2013).


b See Anderson, Macdonald, Bluthenthal and Ashwood (2013).

c
•2014 Oxford University Press 410
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

Y Y
4 4
Treated Treated
Control Control
3 3

2 2

1 1

Before After Time Before After Time


(a) (b)

Y Y
4 4
Treated
Treated
Control
Control
3 3

2 2

1 1

Before After Time Before After Time


(c) (d)

FIGURE 8.6: More Difference-in-difference Examples

c
•2014 Oxford University Press 411
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

8.6 Conclusion

Again and again we’ve emphasized the importance of exogeneity. If X is uncorrelated with

‘ we get unbiased estimates and we are happy. Experiments are sought after because the

randomization in them ensures – or at least aids – exogeneity. With OLS we can – sometimes,

maybe, almost, sort of, kind of – approximate endogeneity by soaking up enough of the error

term with measured variables such that what remains correlates little or not at all with X.

Realistically, though, we know that we will not be able to measure everything. Real

variables with real causal force will almost certainly lurk in the error term. Are we stuck?

Turns out, no (or, at least, not yet). We’ve got a few more tricks up our sleeve. One of the

best tricks is to use fixed effects tools. Although uncomplicated, the fixed effects approach

can knock out a whole class of unmeasured (and even unknown) variables that lurk in the

error term. Simply put, any factor that is fixed across time periods for each unit or fixed

across units for each time period can be knocked out of the error term. Fixed effects tools

are powerful and, as we have seen in real examples, they can produce results that differ

dramatically from basic OLS models.

When we’re done with this chapter, we will be able to

• Section 8.1: Explain what how a pooled model can be problematic when analyzing

panel data.

• Section 8.2: Write down a fixed effects model and explain the fixed effect. Give examples

c
•2014 Oxford University Press 412
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

of the kind of factors subsumed in a fixed effect. Explain what how to estimate a fixed

effects model with LSDV and de-meaned approaches.

• Section 8.3: Explain why coefficients on variables that do not vary within unit cannot be

estimated in fixed effects models. Explain how these variables are nonetheless controlled

for in fixed effects models.

• Section 8.4 Explain a two-way fixed effects model.

• Section 8.5: Explain the logic behind a difference-in-difference estimator. Provide and

explain an OLS model that generates a difference-in-difference estimate.

Further Reading

Chapter 15 discusses advanced panel data models. Baltagi (2005) is a more technical survey

of panel data methods.

Green, Kim, and Yoon (2001) provide a nice discussion of panel data methods in interna-

tional relations. Wilson and Butler (2007) re-analyze articles that did not use fixed effects

and find results changed, sometimes dramatically.

If we use pooled OLS to analyze panel data sets we are quite likely to have errors that

are correlated within unit in the manner discussed on page 104. This correlation of errors

will not cause OLS —ˆ1 estimates to be biased, but it will make the standard OLS equation

for the variance of —ˆ1 inappropriate. While fixed effects models typically account for a

c
•2014 Oxford University Press 413
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

substantial portion of the correlation of errors, there is also a large literature on techniques

to deal with the correlation of errors in panel data and difference-in-difference models. We

discuss one portion of this literature when we cover random effects models in Chapter 15.

Bertrand, Duflo, and Mullainathan (2004) show that standard error estimates for difference-

in-difference estimators can be problematic in the presence of autocorrelated errors if there

are multiple periods both before and after the treatment.

Hausman and Taylor (1981) discuss an approach for estimating parameters on time-

invariant covariants.

Key Terms
• De-meaned approach (380)
• Difference-in-difference model (401)
• Dyad (397)
• Fixed effect (376)
• Fixed effects model (377)
• Least squares dummy variable (LSDV) approach (378)
• One-way fixed effect model (??)
• panel data (367)
• Pooled model (369)
• Rolling cross-section data (407)
• Two-way fixed effect model (??)

c
•2014 Oxford University Press 414
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

Computing Corner

Stata

1. To estimate a panel data model using the LSDV approach, we run an OLS model with
dummy variables for each unit.
(a) Generate dummy variables for each unit:
tabulate City, generate(CityDum)
This command generates a variable called “CityDum1” that is 1 for observations
from the first city listed in “City” and 0 otherwise, a variable called “CityDum2”
that is 1 for observations from the second city listed in “City,” and so on.
(b) Estimate the model with the regress command regress Y X1 X2 X3 CityDum2
- CityDum50. The notation of CityDum2 - CityDum50 tells Stata to include each
of the city dummies from CityDum2 to CityDum50. As we discussed on page 818
of Chapter 7, we need an excluded category. By starting at CityDum2 in our list
of dummy variables, we are setting the first city as the excluded reference category.
(c) To use a F test test to test whether fixed effects are all zero, the unrestricted model
is the model with the dummy variables we just estimated. The restricted model is
a regression model without the dummy variables (also known as the pooled model):
regress Y X1 X2 X3.
2. To estimate a one-way fixed effects model using the de-meaned approach:
xtreg y x1 x2 x3, fe i(City)
The subcommand of , fe tells Stata to estimate a fixed effects model. The i(City)
subcommand tells Stata to use the City variable to identify the city for each observation.
3. To estimate a two-way fixed model:
(a) Create dummy variables for years:
tabulate Year, gen(Yr)
This command generates a variable called “Yr1” that is 1 for observations in the
first year and 0 otherwise, a variable called “Yr2” that is 1 for observations in the
second year and 0 otherwise and so on.
(b) Run Stata’s built-in one-way fixed effects model and also include the dummies for
the years:
xtreg Y X1 X2 X3 Yr2-Yr10, fe i(City)
where Yr2-Yr10 is a shortcut way of including every Yr variable from Yr2 to Yr10.
4. There are several ways to implement difference-in-difference models:

c
•2014 Oxford University Press 415
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

(a) To implement a basic difference-in-difference model, type reg Y Treat After TreatAfter
X2 where T reat indicates membership in treatment group, Af ter indicates the pe-
riod is the after period, T reatAf ter is the interaction of the two variables and X2
is one (or more) control variables.
(b) To implement a panel data version of a difference-in-difference model, type xtreg
Y TreatAfter X2 Yr2-Yr10, fe i(City).
(c) To plot the basic difference-in-difference results, plot separate fitted lines for the
treated and untreated groups:
graph twoway (lfit Y After if Treat ==0) (lfit Y After if Treat ==1).

R
1. To estimate a panel data model using the LSDV approach, we run an OLS model with
dummy variables for each unit.
(a) While it is possible to name and include dummy variables for every unit, doing so
can be a colossal pain when we have lots of units. It is usually easiest to use the
factor command, which will automatically include dummy variables for each unit.
The code is: lm(Y ≥ X1 + factor(unit)). This command will estimate a model
in which there is a dummy variable for every unique value unit indicated in the
unit variable. For example, if our data looked like Table 8.2 on page 379, including
a factor(city) term in the regression code would lead to dummy variables being
included for each city.
(b) To implement an F test on the hypothesis that all fixed effects (both unit and time)
are zero, the unrestricted equation is the full model and the restricted equation is
the model with no fixed effects.
Unrestricted = lm(Y ≥ X1 + factor(unit)+ factor(time))
Restricted = lm(Y ≥ X1)
See page 360 for more details on how to implement an F test in R.
2. To estimate a one-way fixed effect model using the de-meaned approach use one of
several add-on packages that automate the steps in panel data analysis. We discussed
how to install an R package in Chapter 3 on page 130. For fixed effects models we can
use the “plm” command from the “plm” package.
(a) Install the package by typing install.packages("plm"). Once installed on a
computer, the package can be brought into R’s memory with the library(plm)
command.
(b) The “plm” command works like the “lm” command. We indicate the dependent
variable and the independent variables for the main equation. We need to indicate

c
•2014 Oxford University Press 416
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

what the units are with the index = c("city", "year") command. These are the
variable names that indicate your units and time variables, which will vary depend-
ing on the data set. For a one-way fixed effects model, include model="within".
library(plm)
plm(Y ≥ X1 + X2, index=c("city"), model="within")
3. To estimate a two-way fixed effects model, we have two options.
(a) We can simply include time dummies as covariates in a one-way fixed effects model
plm(Y ≥ X1 + X2 + factor(year), index=c("city"), model="within")
(b) We can use the plm command and indicate the unit and time variables with the
index = c("city", "year") command. These are the variable names that indi-
cate your units and time variables, which will vary depending on your data set. We
also need to include the subcommand effect="twoways".
plm(Y ≥ X1 + X2, index=c("city", "year"), model="within",
effect="twoways")
4. There are several ways to implement difference-in-difference models:
(a) To implement a basic difference-in-difference model, type lm(Y ≥ Treat + After
+ TreatAfter + X2), where T reat indicates membership in treatment group, Af ter
indicates the period is the after period, T reatAf ter is the interaction of the two
variables, and X2 is one (or more) control variables.
(b) To implement a panel data version of a difference-in-difference model, type lm(Y
≥ TreatAfter + factor(Unit)+ factor(Year) + X2).
(c) To plot the basic difference-in-difference results, plot separate fitted lines for the
treated and untreated groups:
plot(After, Y, type = "n")
abline(lm(Y[Treat==0] After[Treat==0]))
abline(lm(Y[Treat==1] After[Treat==1]))

Exercises
1. Researchers have long been interested in the relationship between economic factors and
presidential elections. The PresApproval.dta data set includes data on presidential
approval polls and unemployment rates by state over a number of years. Table 8.8 lists
the variables.
a. Using pooled data for all years, estimate a pooled OLS regression explaining pres-
idential approval as a function of state unemployment rate. Report the estimated
regression equation and interpret the results.

c
•2014 Oxford University Press 417
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Table 8.8: Variables for Presidential Approval Question

Variable name Description


State State name
StCode State numeric ID
Year Year
PresApprov Percent positive presidential approval
UnemPct State unemployment rate
South Southern state (1=yes, 0=no)

b. Many political observers believe politics in the South are different. Add South as
an additional independent variable and re-estimate the model from part (a). Report
the estimated regression equation. Do the results change?
c. Re-estimate the model from part (b) controlling for state fixed effects using the de-
meaned approach. How does this approach affect the results? What happens to the
South variable in this model? Why? Does this model control for differences between
Southern and other states?
d. Re-estimate the model from part (c) controlling for state fixed effects using the LSDV
approach. (Do not include a south dummy variable). Compare the coefficients and
standard errors for the unemployment variable.
e. Estimate a two-way fixed effects model. How does this model affect the results?
2. How do young people respond to economic conditions? Are they more likely to pursue
public service when jobs are scarce? To get at this question, we’ll analyze data in
PeaceCorps.dta, which contains variables on state economies and applications to the
Peace Corps. Table 8.9 lists the variables.
Table 8.9: Variables for Peace Corps Question

Variable name Description


state State name
year Year
stateshort First three letters of state name (for labeling scatterplot)
appspc Applications to the Peace Corps from each state per capita
unemployrate State unemployment rate

a. Before looking at the data, what relationship do you hypothesize between these two
variables? Explain your hypothesis.
b. Run a pooled regression of Peace Corps applicants per capita on the state unemploy-
ment rate and year dummies. Describe and critique the results.

c
•2014 Oxford University Press 418
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

c. Plot the relationship between the state economy and Peace Corps applications. Does
any single state stick out? How may this outlier affect the estimate on unemployment
rate in the pooled regression in part (b) above? Create a scatterplot without the
unusual state and comment briefly on the difference from the scatterplot with all
observations.
d. Run the pooled model from part (b) without the outlier. Comment briefly on the
results.
e. Run a two-way fixed effect model without the outlier using the LSDV approach. Do
your results change from the pooled analysis? Which results are preferable?
f. Run a two-way fixed effects model without the outlier using the fixed effects command
in Stata or R. Compare to LSDV results.
3. We wish to better understand the factors that contribute to a student’s favorable overall
evaluation of an instructor. The data set TeachingEval HW.dta contains average faculty
evaluation scores, class size, a dummy variable indicating required courses, and the
percent of grades that were A- and above. Table 8.10 lists the variables.
Table 8.10: Variables for Teaching Evaluation Questions

Variable name Description


Eval Average course evaluation on a 5 point scale
Apct Percent of students receiving A and A- in class
Enrollment Number of students in the course
Required A dummy variable indicating if the course is required
InstrID A unique identifying number for each instructor
CourseID A unique identifying number for each course

a. Estimate a model ignoring the panel structure of the data. Use overall evaluation
of the instructor as the dependent variable and the class size, required, and grades
variables as independent variables. Report and briefly describe the results.
b. Explain what a fixed effect for each of the following would control for: instructor,
course, and year.
c. Using the equation from part (a), estimate a model that includes a fixed effect for
instructor. Report your results and explain any differences from part (a).
d. Now estimate a two-way fixed effects model with year as an additional fixed effect.
Report and briefly describe your results.
4. In 1993, Georgia initiated a HOPE scholarship program to let state residents who had at
least a B average in high school attend public college in Georgia for free. The program is

c
•2014 Oxford University Press 419
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

not need based. Did the program increase college enrollment? Or did it simply transfer
funds to families who would have sent their children to college anyway? Dynarski (2000)
assessed this question using data on young people in Georgia and neighboring states.12
Table 8.11 lists the variables.
Table 8.11: Variables for the HOPE Scholarship Question

Variable Name Description


InCollege A dummy variable equal to 1 if the individual is in college
AfterGeorgia A dummy variable equal to 1 for Georgia residents after 1992
Georgia A dummy variable equal to 1 if the individual is a Georgia resident
After A dummy variable equal to 1 for observations after 1992
Age Age
Age18 A dummy variable equal to 1 if the individual is 18 years old
Black A dummy variable equal to 1 if the individual is African-American
StateCode State codes
Year Year of observation
Weight Weight used in Dynarski (2000)

a. Run a basic difference-in-difference model. What is the effect of the program?


b. Calculate the percentage of people in the sample in college from the following four
groups: (i) Before 1993/non Georgia, (ii) Before 1993/Georgia, (iii) After 1992/non
Georgia, (iv) After 1992/ Georgia. First do so using the mean function (e.g., in Stata
use mean Y if X1=0 and X2=0). Second, do so using the coefficients from the OLS
output in part (a).
c. Graph the fitted lines for the Georgia group and non-Georgia samples.
d. Use panel data formulation for a difference-in-difference model to control for all year
and state effects.
e. Add covariates for 18 year olds and African-Americans to the above panel data
formulation. What is the effect of the HOPE program?
f. The way the program was designed, Georgia high school graduates with a B or higher
average and annual family income over $50,000 qualified for HOPE by filling out a
simple, one-page form. Those with lower income were required to apply for federal
aid with a complex, four-page form and had federal aid deducted from their HOPE
scholarship. Run separate basic difference-in-difference models for these two groups
and comment on the substantive implication of the results.
12 For simplicity we will not use the sample weights used by Dynarski. The results are stronger when these sample
weights are used.

c
•2014 Oxford University Press 420
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

5. Table 8.12 describes variables in TexasSchools.dta, a data set covering 1,020 Texas
school board districts and teachers’ salaries in them from 2003 to 2009. Anzia (2012)
used this data to estimate the effect of election timing on teachers’ salaries in Texas.
Some believe that teachers will be paid more when school board members are elected in
“off-cycle” elections where only school board members are up for election. The idea is
that teachers and their allies will mobilize for these elections while many other citizens
will tune out. In this view, teachers’ salaries will be relatively lower when school boards
are elected in “on-cycle” elections that also have elections for state and national offices;
in these on-cycle elections, turnout will be higher and teachers and teachers unions will
have relatively less influence.
From 2003 to 2006 all districts in the sample elected their school board members off-
cycle. A change in state policies in 2006 led some, but not all , districts to elect their
school board members on-cycle from 2007 onward. The districts that switched then
stayed switched for the period 2007 to 2009 and no other district switched.
Table 8.12: Variables for the Texas School Board Data

Variable name Description


LnAvgSalary The average salary of teachers in the district, adjusted for inflation and logged
OnCycle A dummy variable indicating school boards in this district were elected “on-cycle”
meaning they were elected at same time voters were voting on other offices. A
zero indicates the school board was elected “off-cycle,” meaning the school boards
were elected in a separate election
CycleSwitch A dummy variable indicating that the district switched from off-cycle to on-cycle
elections starting in 2007
AfterSwitch A dummy variable indicating year > 2006
AfterCycleSwitch CycleSwitch * AfterSwitch
DistNumber District ID number
Year Year

a. Estimate the pooled model of LnAvgSalaryit = —0 + —1 OnCycleit + ‘it . Discuss


whether there is potential bias here. Consider in particular the possibility that
teachers unions are most able to get off-cycle elections in districts where they are
strongest. Could such a situation create bias? Explain why or why not.
b. Estimate a standard difference-in-difference model using the fact that a subset of
districts switched their school board elections to “on-cycle” in 2007 and all subsequent
elections in the data set. No one else switched at any other time. Before 2007 all
districts used “off-cycle” elections. Explain the results. What is the effect of election
time on teachers’ salaries? Can we say anything about the types of districts that
switched? Can we say anything about salaries in all districts in the years after the

c
•2014 Oxford University Press 421
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

switch?
c. Run a one-way fixed effects model where the fixed effect relates to individual school
districts. Interpret the results and explain whether this model accounts for time
trends that could affect all districts.
d. Now use a two-way fixed effects model to estimate a difference-in-difference approach.
Interpret the results and explain whether this model accounts for (i) differences in
pre-existing attributes of the switcher districts and non-switcher districts and (ii)
differences in the post switch years that affected all districts whether or not they
switched.
e. Suppose that we tried to estimate the above two-way fixed effects model on only the
last three years of the data (2007, 2008, and 2009). Would we be able to estimate
the effect of oncycle for this subset of the data? Why or why not?
6. This problem uses a panel version of the dataset described in Chapter 5 on page 250
to analyze the effect of cell phone and texting bans on traffic fatalities. Use deaths
per mile as the dependent variable because this variable accounts for the pattern we
saw earlier that miles driven is a strong predictor of the number of fatalities. Table
8.13 describes the variables in the data set Cellphone panel homework.dta; it covers all
states plus Washington, DC from 2006 to 2012.
Table 8.13: Variables in the Cell Phones and Traffic Deaths Panel Data Set

Variable name Description


year Year
State State name
state numeric State name (numeric representation of state)
population Population within a state
DeathsPerBillionMiles Deaths per billion miles driven in state
cell ban Coded 1 if hand-held cell phone while driving ban is in effect; 0 otherwise
text ban Coded 1 if texting while driving ban is in effect; 0 otherwise
cell per10thous pop Number of cell phone subscriptions per 10,000 people in state
urban percent Percent of state living in urban areas

a. Estimate a pooled OLS model with deaths per mile as the dependent variable and
cell phone ban and text ban as the two independent variables. Briefly interpret the
results.
b. Describe a possible state fixed effect that could cause endogeneity and bias in the
model from part (a).
c. Estimate a one-way fixed effects model that controls for state level fixed effects.
Include deaths per mile as the dependent variable and cell phone ban and text ban

c
•2014 Oxford University Press 422
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models

as the two independent variables. Does the coefficient on cell phone ban change in
the manner you would expect based on your answer from part (a)?
d. Describe a possible year fixed effect that could cause endogeneity and bias in the
fixed effects model in part (c).
e. Estimate a two-way fixed effects model using the hybrid de-meaned approach dis-
cussed in the chapter. Include deaths per mile as the dependent variable and cell
phone ban and text ban as the two independent variables. Does the coefficient on
cell phone ban change in the manner you would expect based on your answer in part
(d)?
f. The model in part (e) is somewhat sparse with regard to control variables. Estimate a
two-way fixed effects model that includes control variables for cell phones per 10,000
people and percent urban. Briefly describe changes in inference about the effect of
cell phone and text bans.
g. Estimate the same two-way fixed effects model using the least square dummy variable
(LSDV) approach. Compare the coefficient and t statistic on the cell phone variable
to the results from part (f).
h. Based on the LSDV results, identify of states with large positive and negative fixed
effects. Explain what these mean (being sure to note the excluded category) and
speculate about what is different about the positive and negative fixed effect states.
(It is helpful to connect the state number to state name; in Stata, do this with the
command list state state numeric if year ==2012.)

c
•2014 Oxford University Press 423
CHAPTER 9

INSTRUMENTAL VARIABLES: USING EXOGENOUS VARIATION TO

FIGHT ENDOGENEITY

Medicaid is the U.S. government health insur-

ance program for low income people. Does it

save lives? If so, how many? These are impor-

tant but challenging question. The challenge is,

you guessed it, endogeneity. People enrolled in

Medicaid differ from those not enrolled, not only in terms of income but also many other

factors. Some factors such as age, race, and gender are fairly easy to measure. Other factors

such as health, lifestyle, wealth, and medical knowledge are difficult to measure, however.

The danger is that these unmeasured factors may be correlated with enrollment in Med-

424
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

icaid. Who is more likely to enroll: a poor sick person or a poor healthy person? Probably

the sick people are more likely to be enrolled, which means that comparing health outcomes

of enrollees and non-enrollees could show differences not only due to Medicaid, but also due

to underlying sickness that preceded the decision to enroll in Medicaid.

We must therefore be cautious – or clever – when analyzing Medicaid. This chapter goes

with clever. We show how we can navigate around endogeneity using instrumental variables.

This approach is relatively advanced, but its logic is pretty simple. The idea is to find

exogenous variation in X and use only that variation to estimate the effect of X on Y . For

the Medicaid question, we want to look for some variation in enrollment in the program

that is unrelated to the health outcomes of individuals. One way is to find some factor that

changed enrollment but was unrelated to health or lifestyle or any other factor that affects the

health outcome variable. In this chapter we show how to incorporate instrumental variables

using a technique called two-stage least squares (2SLS). In Chapter 10 we revisit 2SLS

techniques to analyze randomized experiments in which not everyone complies with their

assigned treatment.

Like many powerful tools, 2SLS can be a bit dangerous. We won’t cut off a finger using

it, but if we aren’t careful we could end up with worse estimates than we would with OLS.

And, also like many powerful tools, the approach is not cheap. In this case, the cost is that

the estimates produced by 2SLS are typically quite a bit less precise than OLS estimates.

In this chapter we provide the instruction manual for this tool. Section 9.1 provides an

c
•2014 Oxford University Press 425
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

example in which an instrumental variables approach proves useful. Section 9.2 presents

the basics for the 2SLS model. Section 9.3 discusses what to do when we have multiple

instruments. Section 9.4 discusses what happens to 2SLS estimates when the instruments

are flawed. Section 9.5 discusses why 2SLS estimates tend to be less precise than OLS

estimates. Section 9.6 then applies 2SLS tools to so-called simultaneous equation models in

which X causes Y , but Y also causes X.

9.1 2SLS Example

Before we work through the steps of the 2SLS approach, this section will introduce the logic

of the approach with an example about police and crime by Freakonomics author Steve

Levitt. We’ve seen the question of whether police reduce crime before (on page 371) and

know full well that an observational study almost certainly suffers from endogeneity. Why?

It is highly likely that things in the error term that cause crime – factors such as drug use,

gang warfare, demographic changes, and so forth – also are related to how many police

officers a city has. After all, it is just common sense that communities that expect more

crime hire more police. Equation 9.1 shows the basic model:

Crimeit = —0 + —1 P olicei,t≠1 + ‘it (9.1)

Levitt’s (2002) idea is that while some police are hired for endogenous reasons (city leaders

expect more crime, so hire more police), other police are hired for exogenous reasons (the

c
•2014 Oxford University Press 426
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

city simply had more money, so spent it). In particular, Levitt argues that the number

of firefighters in a city reflects voters’ tastes for public services, union power, and perhaps

political patronage. These factors also partially predict the size of the police force and are

not directly related to crime. In other words, to the extent that changes in the number of

firefighters predict changes in police numbers, these changes in police are exogenous because

they have nothing to do with crime. The idea, then, is to isolate the portion of changes in

the police force associated with changes in the number of firefighters and see if crime went

down (or up) in relation to those changes.

We’ll work through the exact steps of the process below. For now we can get a sense

of how instrumental variables can matter by looking at Levitt’s results. The left column of

results in Table 9.1 shows the coefficient on police estimated via a standard OLS estimation

of Equation 9.1 based on an OLS analysis with covariates and year dummy variables but

no city fixed effects. The coefficient is positive and significant, implying police cause crime.

Yikes!

Table 9.1: Levitt (2002) Results on Effect of Police Officers on Violent Crime

OLS with only OLS with year 2SLS


year dummies and city dummies
Lagged police officers 0.562 -0.076 -0.435
per capita (logged) (0.056) (0.061) (0.231)
[t = 10.04] [t = 1.25] [t = 1.88]
Standard errors in parentheses. Results are from Levitt (2002). All models include controls
for prison population, per capita income, abortion, city size, and racial demographics.

c
•2014 Oxford University Press 427
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

However, we’re pretty sure that endogeneity distorts simple OLS results in this context.

The second column shows that the results change dramatically when city fixed effects are

included. As discussed in Chapter 8, fixed effects account for the fact that cities with

chronically high crime also tend to have larger police forces. The estimated effect of police

is negative, but small and statistically insignificant at usual levels.

The third column shows the results when the instrumental variables technique is used.

The sign on the police variable is negative and marginally statistically significant. This result

is a huge difference from the OLS without city fixed effects and a non-trivial difference from

the OLS with city fixed effects.

Levitt’s analysis essentially treats changes in firefighters as a kind of experiment. He

estimates the number of police that cities add when they add firefighters and assesses whether

crime changed in conjunction with these particular changes in police. Levitt is using the

firefighter variable as an instrumental variable, a variable that explains the endogenous

independent variable of interest (which in this case is the log of the number of police per

capita) but does not directly explain the dependent variable (which in this case is violent

crimes per capita).

The example also highlights some limits to instrumental variables methods. First, the

increase in police associated with changes in firefighters may not really be exogenous. That is,

is the firefighter variable truly independent of the error term in Equation 9.1? It is possible,

for example, that reelection-minded political leaders provide other public services when they

c
•2014 Oxford University Press 428
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

boost the number of firefighters – goodies such as tax cuts, roads, and new stadiums – and

that these policy choices may affect crime (perhaps by improving economic growth). In that

case, we worry that our exogenous bump in police is actually associated with factors that

also affect crime, and that those factors may be in the error term. Therefore as we develop

the logic of instrumental variables we also spend a lot of time worrying about the exogeneity

of our instruments.

A second concern is that we may reasonably worry that changes in firefighters do not

account for much of the variation in police forces. In that case, the exogenous change we

are measuring will be modest and may lead to imprecise estimates. We see this in Table 9.1

where the instrumental variable standard errors are more than four times larger than the

OLS standard errors.

Remember This
1. An instrumental variable is a variable that explains the endogenous independent
variable of interest but does not directly explain the dependent variable.
2. When we use the instrumental variable approach, we focus on changes in Y due
to the changes in X that are attributable to changes in the instrumental variable.
3. Two major challenges associated with using instrumental variables are
(a) It is often hard to find an appropriate instrumental variable that is exogenous.
(b) Estimates based on instrumental variables are often imprecise.

c
•2014 Oxford University Press 429
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

9.2 Two-Stage Least Squares (2SLS)

We implement the instrumental variables approach with the two-stage least squares (2SLS)

approach. As you can see from the name, it’s still a least squares approach, meaning that

the underlying calculations are still based on minimizing the sum of squared residuals as in

OLS. The new element is that it has – you guessed it – two stages, unlike standard OLS,

which only has one stage.

In this section we distinguish endogenous and instrumental variables, explain the two

stages of 2SLS, discuss the characteristics of good instrumental variables, and describe the

challenges of finding good instrumental variables.

Endogenous and instrumental variables

The core equation in 2SLS is the same as in OLS:

Yi = —0 + —1 X1i + —2 X2i + ‘i (9.2)

where Yi is our dependent variable, X1i is our main variable of interest, and X2i is a control

variable (and we could easily add additional control variables).

The difference is that X1i is an endogenous variable which means that it is correlated

with the error term. Our goal with 2SLS is to replace the endogenous X1i with a different

variable that measures only that portion of X1i that is not related to the error term in the

main equation (Equation 9.2).

c
•2014 Oxford University Press 430
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

We model X1i as

X1i = “0 + “1 Zi + “2 X2i + ‹i (9.3)

where Zi is a new variable we are adding to the analysis, X2i is a control variable in Equation

9.2, the “’s are coefficients that determine how well Zi and X2i explain X1i , and ‹i is the

error term. (Recall that “ is the Greek letter gamma and ‹ is the Greek letter nu.) We call

Z our instrumental variable; this variable is the star of this chapter, hands down. It is the

variable Z that is the source of our exogenous variation in X1i .

In Levitt’s police and crime example, the police officers per capita variable is the endoge-

nous variable (X1 in our notation) and the firefighters variable is the instrumental variable

(Z in our notation). The instrumental variable is the variable that causes the endogenous

variable to change for reasons unrelated to the error time. In other words, in Levitt’s model,

Z (firefighters) explains X1i (police per capita) but is not correlated with the error term in

the equation explaining Y (crime).

2SLS proceeds in two steps.

First stage regression

First, we estimate “ˆ values based on Equation 9.3 in order to generate fitted values of X,

which we call (as is our habit) X̂1i .

X̂1i = “ˆ0 + “ˆ1 Zi + “ˆ2 X2i

c
•2014 Oxford University Press 431
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

Notice that X̂1i is a function only of Z, X2 , and the “s. That fact has important impli-

cations for what we are trying to do. The error term when X1i is the dependent variable is

‹i ; it is almost certainly correlated with the error term in the Yi equation (which is ‘). That

is, drug use and criminal history likely affect both the number of police (X1 ) and crime (Y ).

This means the actual value of X1 is correlated with ‘; the fitted value X̂1i , on the other

hand, is only a function of Z, X2 , and the “s. So even though police forces in reality may

be ebbing and flowing as related to drug use and other factors in the error term of Equation

9.2, the fitted value X̂1i will not. Our X̂1i will ebb and flow only with changes in Z and X2 ,

which means our fitted value of X has been purged of the association between X and ‘.

All control variables from the second stage model must be included in the first stage. We

want our instrument to explain variation in X1 over and above any variation that can be

explained by the other independent variables.

Second stage regression

In the second stage, we estimate our outcome equation, but (key point here) we use X̂1i –

the fitted value of X1i – rather than the actual value of X1i . In other words, instead of using

X1i , which we suspect to be endogenous (correlated with ‘), we use the measure of X̂1i which

has been purged of X1i ’s association with error. Specifically, the second stage of the 2SLS

model is

Yi = —0 + —1 X̂1i + —2 X2i + ‘i (9.4)

c
•2014 Oxford University Press 432
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

The little hat on X̂1i is a big deal. Once we appreciate why we’re using it and how to generate

it, then 2SLS becomes easy. We are now estimating how much the exogenous variation in

X1i affects Y . Notice also that there is no Z in Equation 9.4. By the logic of 2SLS, Z only

affects Y indirectly by affecting X.

Control variables play an important role, just as in OLS. If there is some factor that affects

Y and is correlated with Z, we need to include it in the second stage regression. Otherwise,

the instrument will possibly soak up some of the effect of this omitted factor rather than

merely exogenous variation in X1 . For example, suppose that cities in the South started

facing more forest fires and hence hired more firefighters. In that case, Levitt’s firefighter

instrument for police officers will also contain variation due to region. If we do not control

for region in the second stage regression, then it possible that some of the region effect will

work its way through the instrument, creating a potential bias.

Actual estimation using 2SLS is a bit more involved than simply running OLS with X̂1

because the standard errors need to be adjusted to account for the fact that X̂1i is itself an

estimate. In practice, though, statistical packages do this adjustment automatically with

their 2SLS commands.1


1 When there is a single endogenous independent variable and a single instrument, the 2SLS estimator reduces
to —ˆ1 = cov(Z,Y )
cov(Z,X)
. Murnane and Willett (2011, 229) show that this estimator is equivalent to a 2SLS estimator.
While this estimator is computational simpler than 2SLS, the reliance on exogeneity is not as clear as with the
2SLS estimator. In addition, the 2SLS estimator is more general as it allows for multiple independent variables and
instruments.

c
•2014 Oxford University Press 433
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

Two characteristics of good instruments

The success of 2SLS hinges on the instrument. Good instrument satisfy two conditions.

These conditions are conceptually simple, but in practice they are hard to satisfy.

The first condition is that an instrument must actually explain the endogenous variable

of interest. That is, our endogenous variable, X1 , must vary in relation to our instrument, Z.

This is the inclusion condition: a condition that Z needs to exert a meaningful effect in the

first stage equation that explains X1i . In Levitt’s police example, police forces must actually

rise and fall as firefighter numbers change. It is a plausible claim, but is not guaranteed. We

can easily check this condition for any potential instrument, Z, by estimating the first stage

model of the form of Equation 9.3. If the coefficient on Z is statistically significant, we have

satisfied this condition. For reasons we explain later in Sections 9.4 and 9.5, the more Zi

explains X1i , the better.

The second condition is that an instrument must be uncorrelated with the error term in the

main equation, Equation 9.2. This condition is the exclusion condition because it implies

we can exclude the instrument from the second stage equation because the instrument exerts

no direct effect on Y . In other words, by saying that an instrument is uncorrelated with ‘

we are saying that it reflects no part of the error term in the main equation and hence can

be excluded from it. Recall the kinds of things in the error term in a crime model: drug use,

gang warfare, demographic changes, and so forth. Levitt’s use of firefighters as an instrument

was based on an argument that the number of firefighters in a city was uncorrelated with

c
•2014 Oxford University Press 434
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

these elements of the error term.

Unfortunately, there is no direct test whether Z is uncorrelated with ‘. The whole point

of the error term is that it covers unmeasured factors. We simply cannot directly observe

whether Z is correlated with these unmeasured factors.

A natural instinct is to try to test the exclusion condition by including Z directly in the

second stage, but this method won’t work. If Z is a good instrument it will explain X1i ,

which in turn will affect Y . We will observe some effect of Z on Y , which will be the effect

of Z on X1i , which in turn can have an effect on Y . Instead the discussion of the exclusion

condition will need to be primarily conceptual rather than statistical. We will need to justify

why Z does not affect Y directly without statistical analysis. Yes, that’s a bummer and,

frankly, a pretty weird position to be in for a statistical analyst. Life is like that sometimes.2

Finding a good instrument is hard

Finding an instrument that satisfies the exclusion condition is really hard with observational

data. Economists Josh Angrist and Alan Krueger provided a famous example in a 1991

study of the effect of education on wages. Because the personal traits that lead a person to
2A test called the Hausman test (or the Durbin-Wu-Hausman test) is sometimes referred to as a test of endogeneity.
We should be careful to recognize that this is not a test of the exclusion restriction. Instead, the Hausman test tests
whether X is endogenous. It is not a test of whether Z is exogenous. Hausman derived the test by noting that if Z is
exogenous and X is endogenous, then OLS and 2SLS should produce very different —ˆ estimates. If Z is exogenous and
X is exogenous, then OLS and 2SLS should produce similar —ˆ estimates. The test involves assessing how different
the —ˆ estimates are from OLS and 2SLS. Crucially, we need to assume Z is exogenous for this test. That’s the claim
we usually want to test, so this test is often of limited value.

c
•2014 Oxford University Press 435
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

get more education (smarts, diligence, family wealth) are often the traits that lead someone

to financial success, education is very likely to be endogenous when explaining wages. They

therefore sought an instrument for education, a variable that would explain years of schooling

but not have anything to do with wages. They identified a very clever possibility: quarter

of birth.

While this idea seems crazy at first, it actually makes sense. Quarter of birth satisfies

the inclusion condition because how much schooling a person gets depends, in part, on what

month a person is born in. Birth month matters because of laws that say that young people

have to stay in school until they are 16. For a school district that starts kids in school based

on their age on September 1, kids born in July would be in 11th grade when they turn 16

while kids born in October (who started a year later) would only be in 10th grade when they

turn 16. Hence kids born in July can’t legally drop out until they are in the 11th grade,

while kids born in October can drop out in the 10th grade. The effect is not huge, but with a

lot of data (and Angrist and Krueger had a lot of data), this effect is statistically significant.

Quarter of birth also seems to satisfy the exclusion condition because birth month seems

unrelated to unmeasured factors that affect salary, such as smarts, diligence, and family

wealth. (Astrologers disagree, by the way.) However, Bound, Jaeger, and Baker (1995)

showed that quarter of birth has been associated with school attendance rates, behavioral

difficulties, mental health, performance on tests, schizophrenia, autism, dyslexia, multiple

sclerosis, region, and income. (Wealthy families, for example, have fewer babies in the winter

c
•2014 Oxford University Press 436
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

(Buckles and Hungerman 2013). Go figure.) This outcome is disappointing: If quarter of

birth doesn’t satisfy the exclusion condition, it’s fair to say a lot of less clever instruments

may be in trouble as well. Hence, we should be duly cautious when using instruments,

being sure to implement the diagnostics discussed below and being sure to test theories with

multiple instruments or analytical strategies.

Remember This
Two-stage least squares uses exogenous variation in X to estimate the effect of X on
Y.
1. In the first stage, the endogenous independent variable is the dependent variable
and the instrument, Z, is an independent variable:
X1i = “0 + “1 Zi + “2 X2i + ‹i

2. In the second stage, X̂1i (the fitted values from the first stage) is an independent
variable:
Yi = —0 + —1 X̂1i + —2 X2i + ‘i

3. A good instrument, Z, satisfies two conditions:


• Z must be a statistically significant determinant of X1 . In other words, it
needs to be included in the first stage of the 2SLS estimation process.
• Z must be uncorrelated with the error term in the main equation, which
means Z must not directly influence Y . In other words, an instrument must
be properly excluded from the second stage of the 2SLS estimation process.
This condition cannot be directly assessed statistically.
4. It is difficult to find an instrument that incontrovertibly satisfies the exclusion
condition when using observational data.

c
•2014 Oxford University Press 437
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

Discussion Questions
1. Some people believe cell phones and related technology like Twitter have
increased social unrest by making it easier to organize protest or vio-
lence. Pierskalla and Hollenbach (2013) tested this view using African
data. In its most basic form, the model was
V iolencei = —0 + —1 Cell phone coverage i + ‘i
where violence i is data on organized violence in city i and Cell phone
coverage i measures availability of cell phone coverage in city i.
a) Explain why endogeneity may be a concern.
b) Consider a measure of regulatory quality as an instrument for cell
phone coverage. This variable is proposed based on a separate study
of telecommunications policy in African countries that found that
regulatory quality increased cell phone availability. Explain how to
test whether this variable satisfies the inclusion condition.
c) Does the regulatory quality variable satisfy the exclusion condition?
Can we test whether this condition holds?
2. Do political protests affect election results? Consider the follow-
ing model, which is a simplified version of the analysis presented in
Madestam, Shoag, Veuger, and Yanagizawa-Drott (2013):
Republican vote i = —0 + —1 Tea Party protest turnout i + ‘i

where Republican votes i is the vote for the Republican candidate for
Congress in district i in 2010 and Tea Party protest turnout i measures
the number of people who showed up at Tea Party protests in district i
on April 15, 2009, a day when protests were planned across the United
States.
a) Explain why endogeneity may be a concern.
b) Consider local rainfall on April 15, 2009 as an instrument for Tea
Party protest turnout. Explain how to test whether the rain variable
satisfies the inclusion condition.
c) Does the local rainfall variable satisfy the exclusion condition? Can
we test whether this condition holds?

c
•2014 Oxford University Press 438
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

Discussion questions - continued


3. Do economies grow more when their political institutions are better?
Consider the following simple model

Economic growth i = —0 + —1 Institutional quality i + ‘i


where Economic growth i is the growth of country i and Institutional
quality i is a measure of the quality of governance of country i.
a) Explain why endogeneity may be a concern.
b) Acemoglu, Johnson, and Robinson (2001) proposed country-specific
mortality rates faced by soldiers, bishops, and sailors in the colonies
in the 17th, 18th, and 19th centuries as an instrument for current
institutions. The logic is that European powers were more likely
to set up worse institutions when the people they sent abroad kept
dying. In these places, the institutions were oriented more toward ex-
tracting resources than toward creating a stable, prosperous society.
Explain how to test whether the settler mortality variable satisfies
the inclusion condition.
c) Does the settler mortality variable satisfy the exclusion condition?
Can we test whether this condition holds?

Case Study: Emergency Care for Newborns

Are neonatal intensive care unit (NICU) facilities

effective? These are the high tech medical facil-

ities that deal with the hardest pregnancies and

work to keep premature babies alive and healthy.

It seems highly likely they help because they at-

c
•2014 Oxford University Press 439
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

tract some of the best people in medicine and

have access to the most advanced technology.

A naive analyst using observational data might not think so, however. Suppose we analyze

birth outcomes with the following simple model

Deathi = —0 + —1 N ICUi + ‘i (9.5)

where Death equals 1 if the baby passed away and N ICU equals 1if the delivery occurred

in a high level NICU facility.

It is highly likely that the coefficient in this case would be positive. It is beyond doubt

that the hardest births go to the NICU, meaning the key independent variable (NICU) will

be correlated with factors associated with a higher risk of death. In other words, we are quite

certain endogeneity would bias the coefficient upward. We could, of course, add co-variates

that indicate risk factors in the pregnancy. Doing so would reduce the endogeneity by taking

factors correlated with NICU out of the error term and putting them in the equation. We

would, nonetheless, still worry that cases that are harder than usual in reality, but perhaps

in ways that are difficult to measure, would still be more likely to end up in NICUs, meaning

endogeneity would be hard to fully purge with multivariate OLS.

Perhaps experiments could be helpful. They are, after all, designed to ensure exogeneity.

They are also completely out of bounds in this context. It is shocking to even consider ran-

domly assigning mothers to NICU and non-NICU facilities. It won’t and shouldn’t happen.

c
•2014 Oxford University Press 440
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

So are we done? Do we have to accept multivariate OLS as the best we can do? Not quite.

Instrumental variables, and 2SLS in particular, give us hope for producing more accurate

estimates. What we need is something that explains exogenous variation in use of NICU.

That is, can we identify some variable that explains usage of NICUs that is not correlated

with pregnancy risk factors?

Lorch, Baiocchi, Ahlberg, and Small (2012) identified a good prospect: distance to a

NICU. Specifically, they created a dummy variable that we’ll call Near NICU which equals

one for mothers for whom there was less than 10 minutes difference in travel time to a NICU

compared to another delivery hospital. The idea is that mothers who lived closer to a NICU

hospital would be more likely to deliver at the hospital that had the NICU. At the same

time, distance to a NICU should not directly affect birth outcomes; it should affect birth

outcomes only to the extent that it affects utilization of NICUs.

Does this variable satisfy the conditions necessary for an instrument? The first condition is

that the instrumental variable explains the endogenous variable which in this case is whether

the mother delivered at a NICU. Table 9.2 shows the results from a multivariate analysis in

which the dependent variable was a dummy variable indicating delivery at a NICU and the

main independent variable was the variable indicating that the mother lived near a NICU.

Clearly, mothers who live close to a NICU are more likely to deliver at a hospital with

a NICU. The estimated coefficient on Near NICU is highly statistically significant with a t

statistic over 178. Distance does a very good job explaining NICU usage. The table shows

c
•2014 Oxford University Press 441
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

coefficients for two other variables as well (the actual analysis has 60 control variables).

Gestational age indicates how long the baby had been gestating as of the time of delivery.

Zip code poverty indicates the percent of people in a zip code living below the poverty line.

Both of these control variables are significant, with babies that are gestationally older less

likely to be delivered in NICU hospitals and women from high poverty zip codes more likely

to deliver in NICU hospitals.

Table 9.2: Influence of Distance on NICU Utilization (First Stage Results)

Near NICU 0.040


(0.0002)
[t = 178.05]
Gestational age -0.021
(0.0006)
[t = 34.30]
Zip code poverty 0.623
(0.026)
[t = 23.83]
N 192, 077
Standard errors in parentheses. Results are based on
Lorch, Baiocchi, Ahlberg, and Small (2012). The
model includes a total of 60 controls for pregnancy
risk factors and demographics.

The second condition is that the instrumental variable is not correlated with the error

term in the second stage. This is the exclusion condition that holds that we can justifiably

exclude the instrument from the second stage. Certainly it seems highly unlikely that the

mere fact of living near a NICU would help a baby other than by the mother going to a

c
•2014 Oxford University Press 442
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

NICU. It is, however, possible that living near a NICU could be correlated with a risk factor.

What if NICU’s tended to be in large urban hospitals in poor areas? In that case, living

near a NICU could be correlated with poverty, which in turn could be something that is a

pregnancy risk factor. Hence it is crucial in this analysis that poverty is a control variable

in both the first and second stages. In the first stage, controlling for poverty allows us to

identify how much more mothers are likely to go to a NICU while taking neighborhood

poverty into account. In the second stage, controlling for poverty allows us to control for

the effect of this variable so as not to conflate it with the effect of actually going to a NICU

hospital.

Table 9.3 presents results for assessing the effect of giving birth in a NICU hospital. The

first column shows results from a bivariate model predicting whether the baby passes away

as a function of whether the delivery was in a NICU hospital. The coefficient is positive

and highly significant, meaning babies delivered in NICU hospitals are more likely to die.

For the reasons discussed earlier, we would never believe this conclusion due to obvious

endogeneity, but it provides a useful baseline to appreciate the pitfalls of failing to account

for endogeneity.

The second column shows that adding covariates changes the results considerably because

the effect of giving birth in a NICU is now associated with lower chance of death. The

effect is statistically significant with a t statistic of 6.72. The table reports results for two

covariates, gestational age and zip code poverty. The highly statistically significant coefficient

c
•2014 Oxford University Press 443
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

on gestational age indicates that babies that have been gestating longer are less likely to die.

The effect of zip code poverty is marginally statistically significant. The full analysis included

many more variables on risk factors and demographics as well.

Table 9.3: Influence of NICU Utilization on Baby Mortality

Bivariate OLS Multivariate OLS 2SLS


NICU utilization 0.0109 -0.0042 -0.0058
(0.0006) (0.0006) (0.0016)
[t = 17.68] [t = 6.72] [t = 3.58]
Gestational age -0.0141 -0.0141
(0.0002) (0.0002)
[t = 79.87] [t = 78.81]
Zip code poverty 0.0113 0.0129
(0.0076) (0.0078)
[t = 1.48] [t = 1.66]
N 192, 077 192, 077 192, 077
Standard errors in parentheses. Results are based on Lorch, Baiocchi, Ahlberg, and Small (2012). The
multivariate and 2SLS models include many controls for pregnancy risk factors and demographics.

We’re still worried that the multivariate OLS result could be biased upward (meaning less

negative than it should be) if unmeasured pregnancy risk factors sent women to the NICU

hospitals and, in turn, raised the chances the babies would die. The results in the 2SLS

address this concern by focusing on the exogenous change in utilization of NICU hospitals

associated with living near them. The coefficient on living near a NICU continues to be

negative and at -0.0058 is almost 50 percent larger in magnitude than the multivariate OLS

results (in this case, almost 50 percent more negative). This is the coefficient on the fitted

c
•2014 Oxford University Press 444
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

value of NICU utilization that is generated using the coefficients estimated in Table 9.2.

The estimated coefficient on NICU utilization is statistically significant, but with a smaller t

statistic than multivariate OLS, consistent with the fact that 2SLS results are typically less

precise than OLS results.

9.3 Multiple Instruments

Sometimes we have multiple potential instrumental variables that we think predict X but

not Y . In this section we explain how to handle multiple instruments and the additional

diagnostic tests we can do when we have more than one instrument.

2SLS with multiple instruments

When we have multiple instruments, we proceed more or less as we have been doing but

simply include all instruments in the first stage. So if we had three instruments (Z1 , Z2 , and

Z3 ), the first stage would be:

X1i = “0 + “1 Z1i + “2 Z2i + “3 Z3i + “4 X2i + ‹i (9.6)

If these are all valid instruments, we have multiple sources of exogeneity that could improve

the fit in the first stage.

When we have multiple instruments the best way to assess whether the instruments

adequately predict the endogenous variable is to use an F test for the null hypothesis that

c
•2014 Oxford University Press 445
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

the coefficients on all instruments in the first stage are equal to zero. For our example, the

F test would test H0 : “1 = “2 = “3 = 0. We presented the F test on page 346 of Chapter 4.

In this case rejecting the null would lead us to accept that at least one of the instruments

helps explain X1i . We discuss a rule of thumb for this test on page 451.

Overidentification tests

Having multiple instruments also allows us to implement an overidentification test. The

name of the test comes from the fact that we say that an instrumental variable model is

identified if we have an instrument that can explain X without directly influencing Y . When

we have more than one instrument, the equation is overidentified; that sounds a bit ominous,

like something will explode.3 It is actually a good thing. Having multiple instruments means

we can do some additional analysis that will shed light on the performance of the instruments.

The references in the Further Reading and appendix point to a number of formal tests

regarding multiple instruments. They can get a bit involved, but the core intuition is rather

simple. If each instrument is valid (meaning each satisfies the two conditions for instruments),

then using each one of them alone should produce an unbiased estimate of —1 . Therefore,

as an overidentification test, we can simply estimate the 2SLS model with each individual

instrument alone. The coefficient estimates should look pretty much the same given that

each instrument alone under these circumstances produces an unbiased estimator. Hence,
3 Everyone out now! The model is going to blow any minute ... it’s way overidentified!

c
•2014 Oxford University Press 446
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

if each of these models produces coefficients that are similar, we can feel pretty confident

that each is a decent instrument (or that they all are equally bad, which is the skunk at the

garden party for overidentification tests).

If the instruments produce vastly different —ˆ1 coefficient estimates, then we have to re-

think our instruments. This can happen if one of the instruments violates the exclusion

condition. The catch is that we don’t know which instrument is the bad one. Suppose —ˆ1

using Z1 as an instrument is very different from —ˆ1 using Z2 as an instrument. Is Z1 a bad

instrument? Or is the problem with Z2 ? Overidentification tests can’t say.

An overidentification test is like having two clocks. If the clocks show different times, we

know at least one, and possibly two, are wrong. If both clocks show the same time, we know

they’re either both right or both wrong in same exact way.

Overidentification tests are relatively uncommon, not because they aren’t useful, but

because it’s hard to find one good instrument, let alone two or more.

Remember This
An instrumental variable is overidentified when there are multiple instruments for a
single endogenous variable.
1. To estimate a 2SLS model with multiple valid instruments, simply include all of
them in the first stage.
2. To use overidentification tests to assess instruments, run 2SLS models separately
with each instrumental variable. If the second stage coefficients on the endogenous
variable in question are similar across models, this result is evidence that all the
instruments are valid.

c
•2014 Oxford University Press 447
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

9.4 Weak Instruments

2SLS estimates are fragile. In this section, we show how they can go bad if Z is correlated

with ‘ or if Z performs poorly in the first stage.

Quasi-instrumental variables are not strictly exogenous

As discussed earlier, observational data seldom provide instruments for which we can be sure

that the correlation of Z and ‘ is literally zero. Sometimes we have potential instruments

that we believe correlate with ‘ just a little bit or, at least, a lot less than X correlates with

‘. Such instruments are called quasi-instruments.

It can sometimes be useful to estimate a 2SLS models with a quasi-instrument because a

bit of correlation between Z and ‘ does not necessarily render 2SLS useless. To see why, let’s

consider the simple case where there is one independent variable and one instrument. We

examine the probability limit of —ˆ1 because the properties of probability limits are easier to

work with than expectations in this context.4 For reference, we first note that the probability

limit for the OLS estimate of —ˆ1 is

‡‘
plim —ˆ1OLS = —1 + corr(X, ‘) (9.7)
‡X

where plim refers to the probability limit and corr indicates the correlation of the two

variables in parentheses. If corr(X, ‘) is zero, then the probability limit of —ˆ1OLS is —1 .


4 See Section 3.5 for a refresher on probability limits, if necessary.

c
•2014 Oxford University Press 448
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

That’s a good thing! If corr(X, ‘) is non-zero, the OLS of —ˆ1 will converge to something

other than —1 as the sample size gets very large. That’s not good.

If we use a quasi-instrument to estimate a 2SLS, the probability limit for the 2SLS estimate

of —ˆ1 is

corr(Z, ‘) ‡‘
plim —ˆ12SLS = —1 + (9.8)
corr(Z, X1 ) ‡X1

If corr(Z, ‘) is zero, then the probability limit of —ˆ12SLS is —1 .5 Another good thing!

Otherwise the 2SLS estimate of —ˆ1 will converge to something other than —1 as the sample

size gets very large.

Equation 9.8 has two very different implications. On the one hand, the equation can be

grounds for optimism about 2SLS. If we compare the probability limits from the OLS and

2SLS models we see that if there is only a small correlation between Z and ‘ and there is

a high correlation between Z and X, then 2SLS will perform better than OLS when the

correlation of Z and ‘ is large. This can happen when an instrument does a great job

predicting X, but has a wee bit of correlation with the error in the main equation. In other

words, quasi-instruments may help us get estimates that are closer to the true value.

On the other hand, the correlation of the Z and X1 in the denominator of Equation 9.8

implies that when the instrument does a poor job of explaining X1 , even a small amount of

correlation between Z and ‘ can get magnified by virtue of being divided by a very small

number. In the education and wages example, the month of birth explained so little of the
5 The form of this equation is from Wooldridge (2009) based on Bound, Jaeger and Bound (1995).

c
•2014 Oxford University Press 449
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

variation in education that the danger was that even a dash of correlation between month

of birth and ‘ substantially distorts the 2SLS estimate.

Weak instruments do a poor job predicting X

The possibility that our instrument may have some correlation with ‘ means that we have to

be on guard against problems associated with weak instruments when using 2SLS. A weak

instrument is an instrument that adds little explanatory power to the first stage regression.

Equation 9.8 showed that when we have a weak instrument, a small amount of correlation of

the instrument and error term can lead to 2SLS to produce —ˆ1 estimates that diverge from

the true value.

Weak instruments create additional problems, as well. Technically, 2SLS produces con-

sistent, but biased estimates of —ˆ1 . This means that even though the 2SLS estimate is

converging toward the true value —1 as the sample gets large, for any given sample, the

expected value of the estimate will not be —1 . In particular, the expected value of —ˆ1 from

2SLS will be skewed toward the —ˆ1 from OLS. In large samples, this is not a big problem, but

in small samples this may be more troubling. In short, it means that 2SLS has a tendency

to look more like OLS than we would like in small samples. This problem worsens as the fit

in the first stage model worsens.

We might be tempted therefore to try to pump up the fit of our first stage model by

including additional instruments. Unfortunately, it’s not that simple. The bias of 2SLS

c
•2014 Oxford University Press 450
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

associated with small samples also worsens as the number of instruments increases, creating

a trade-off between the number of instruments and the explanatory power of the instruments

in the first stage. Each additional instrument brings at least a bit more explanatory power,

but will also bring with it a bit more small sample bias. The details are rather involved; see

Angrist and Pischke (2009, 205) for more details.

It is therefore important to diagnose weak instruments by looking at how well Z explains

X1 in the first stage regression. When we use multivariate regression, we’ll want to know

how much more Z explains X1 than the other variables in the model. We’ll look for large t

statistics for the Z variable in the first stage. The typical rule of thumb is that the t statistic

should be greater than 3. When we have multiple instruments a rule of thumb is that the F

statistic should be at least 10 for the test of the null hypothesis that the coefficients on all

instruments are all zero in the first stage regression. This rule of thumb is not a statistical

test, but rather provides a guideline for what to aim for when we say that the first stage

model should fit well.6

6 The rule of thumb is from Staiger and Stock (1997). We can, of course, run an F test even when we have only a
single instrument. A cool curiosity is that the F statistic in this case will be the square of the t statistic. This means
Ô
that when we have only a single instrument, we can simply look for a t statistic that is bigger than 10, which we
approximate (roughly!) by saying the t statistic should be bigger than 3. The appendix provides more information
on the F distribution on page 783.

c
•2014 Oxford University Press 451
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

Remember This
1. A quasi-instrument is an instrument that is correlated with the error term in the
main equation. If the correlation of the quasi-instrument (Z) and the error term
(‘) is small relative to the correlation of the quasi-instrument and the endogenous
variable (X), the 2SLS estimate based on Z will converge to something closer to
the true value than the OLS estimate will as the sample size gets very large.
2. A weak instrument does a poor job explaining the endogenous variable (X). Weak
instruments magnify the problems associated with quasi-instruments and also can
cause bias in small samples.
3. All 2SLS analyses should report tests of independent explanatory power of the
instrumental variable or variables in first stage regression. A rule of thumb is
that the F statistic should be at least 10 for the hypothesis that the coefficients
on all instruments in the first stage regression equal zero.

9.5 Precision of 2SLS

To calculate proper standard errors for 2SLS we need to account for the fact that the fitted

X̂1 values are themselves estimates. Any statistical program worth its salt does this auto-

matically, so we typically will not have to worry about nitty-gritty of calculating precision

for 2SLS.

What we should appreciate, however, is that standard errors for 2SLS estimates differ in

interesting ways from OLS standard errors. In this section we show why they run bigger and

how this result is largely related to the fit of the first stage regression.

The variance of 2SLS estimates is similar to the variance of OLS estimates. Recall from

c
•2014 Oxford University Press 452
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

page 225 that the variance of a coefficient estimate in OLS is

ˆ2

var(—ˆj ) = (9.9)
N var(Xj )(1 ≠ Rj2 )
q
(Yi ≠Ŷi )2
where ‡
ˆ 2 is the variance of ‘ (which is estimated as ‡
ˆ2 = N ≠k
) and Rj2 is the R2 from

a regression of Xj on all the other independent variables (Xj = “0 + “1 X1 + “2 X2 + ...).

For a 2SLS estimate, the variance of the coefficient on the instrumented variable is

ˆ2

var(—ˆ12SLS ) = (9.10)
N var(X̂1 )(1 ≠ RX̂
2
N oZ )
1
q
(Yi ≠Ŷi )2
where ‡
ˆ2 = N ≠k
using fitted values from 2SLS estimation and RX̂
2
N oZ is the R from a
2
1

regression of X̂1 on all the other independent variables (X̂1 = “0 + “2 X2 + ...) but not the

instrumental variable (we’ll discuss RX̂


2
N oZ more below).
1

As with OLS, variance is lower when there is a good model fit (meaning a low ‡
ˆ 2 ) and a

large sample size (meaning a large sum in the denominator).

The new points for the 2SLS variance equation relate to the fact that we use X̂1i instead

of X1i in the equation. There are two important implications.

• The denominator of Equation 9.10 contains var(X̂1 ) which is the variance of the fitted

value, X̂1 (notice the hat). If the fitted values do not vary much, then var(X̂1 ) will be

relatively small. That’s a problem because we want this quantity to be big in order to

produce a small variance. In other words, we want the fitted values for our endogenous

variable to vary a lot. A poor fit in the first stage regression can lead the fitted values

to vary little; a good fit will lead the fitted variables to vary more.

c
•2014 Oxford University Press 453
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

• The RX̂
2
N oZ term in Equation 9.10 is the R from
2
1

X̂1i = fi0 + fi1 X2i + ÷ (9.11)

where we use fi, the Greek letter pi, as coefficients and ÷, the Greek letter eta, to

highlight the fact that this is a new model, different from earlier models. Notice that Z

is not in this regression, meaning that the R2 from it explains the extent to which X̂1

is a function of the other independent variables. If this R2 is high, X̂1 is explained by

X2 , but not by Z, which will push up var(—ˆX̂


2SLS
1
).

The point here is not to learn how to calculate standard error estimates by hand. Com-

puter programs will do that perfectly well. The point is to understand the sources of variance

in 2SLS. In particular, it is useful to see that the ability of Z to add additional explanatory

power to explain X1 is important. If it does not, our —ˆ12SLS estimates will be imprecise.

As for goodness of fit, the conventional R2 for 2SLS is basically broken. It is possible for

it to be negative. If we really need a measure of goodness of fit, the square of the correlation

of the fitted values and actual values will do. However, as we discussed when we introduced

R2 on page 106, the validity of the results does not depend on the overall goodness of fit.

c
•2014 Oxford University Press 454
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

Remember This
1. Four factors influence the variance of 2SLS —ˆj estimates.
ˆ 2 and var(—ˆj2SLS ).
(a) Model fit: The better the model fits, the lower will be ‡
(b) Sample size: The more observations, the lower will be var(—ˆ2SLS ). j
(c) The overall fit of the first stage regression: The better the fit of the first stage
model, the higher will be var(X̂1 ) and the lower the var(—ˆ12SLS ) will be.
(d) The explanatory power of the instrument in explaining X.
• If Z is a weak instrument (meaning it does a poor job explaining X1 when
controlling for the other X variables), then RX̂
2
N oZ will be high because
1

X̂1 will depend almost completely on the other independent variables.


The result will be a high var(—ˆ12SLS ).
• If Z explains X1 when controlling for the other X variables, then RX̂
2
N oZ
1

will be low, which will lower var(—ˆ2SLS


1).
2. R2 is not meaningful for 2SLS models.

9.6 Simultaneous Equation Models

One particular source of endogeneity occurs in simultaneous equation models. In these

models, X causes Y and Y also causes X. In this section we explain these models, why

endogeneity is inherent in them, and how to use 2SLS to estimate them.

c
•2014 Oxford University Press 455
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

Endogeneity in simultaneous equation models

We characterize a simultaneous equation model with the following two equations:

Y1i = —0 + —1 Y2i + —2 Wi + —3 Z1i + ‘1i (9.12)

Y2i = “0 + “1 Y1i + “2 Wi + “3 Z2i + ‘2i (9.13)

The first dependent variable, Y1 , is a function of Y2 (the other dependent variable), W (a

variable that affects both dependent variables), and Z1 (a variable that affects only Y1 ).

The second dependent variable, Y2 , is a function of Y1 (the other dependent variable), W (a

variable that affects both dependent variables), and Z2 (a variable that affects only Y2 ).

This is funky, but not crazy. Examples abound:

• In equilibrium, price in a competitive market is a function of quantity supplied. Quan-

tity supplied is also a function of price.

• Effective government institutions may spur economic growth. At the same time, strong

economic growth may produce effective government institutions.

• Individual views toward the Affordable Care Act (“ObamaCare”) may be influenced by

what a person thinks of President Obama. At the same time, views of President Obama

may be influenced by what a person thinks of the Affordable Care Act.

With simultaneity comes endogeneity. Let’s consider Y2i , which is an independent variable

in Equation 9.12. We know from Equation 9.13 that Y2i is a function of Y1i , which in turn

c
•2014 Oxford University Press 456
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

is a function of ‘1i . Thus Y2i must be correlated with ‘1i , which therefore means we have

endogeneity. The same reasoning applies for Y1i in Equation 9.13.

Simultaneous equations are a bit mind-twisting at first. It really helps to work through

the logic for ourselves. Consider the classic market equilibrium case in which price depends

on quantity supplied and vice versa. Suppose we look only at price as a function of quantity

supplied. Because quantity supplied depends on price, such a model is really looking at price

as a function of something (quantity supplied) that is itself a function of price. Of course

quantity supplied will explain price, because it is determined in part by price.

As a practical matter, the approach to estimating simultaneous equation models is quite

similar to what we did for instrumental variable models. Only now we have two equations,

so we’ll do 2SLS twice. We just need to make sure that our first stage regression does not

include the other endogenous variable, for reasons we describe shortly.

Let’s focus on the case where we are more interested in the Y1 equation; the logic goes

through in the same way for both equations, of course.

In this case, we want to estimate Y1 as a function of Y2 , W , and Z1 . Because Y2 is the

endogenous variable, we’ll want to find an instrument for it with a variable that predicts Y2

but does not predict Y1 . We have such a variable, in this case. It is Z2 , which is in the Y2

equation but not the Y1 equation.

c
•2014 Oxford University Press 457
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

Using 2SLS for simultaneous equation models

The tricky thing is that Y2 is a function of Y1 . If we were to run a first stage model for

Y2 and to include Y1 and then put the fitted value into the equation for Y1 , then we would

have a variable that is a function of Y1 explaining Y1 . Not cool. Instead we work with a

reduced form equation for Y2 . In a reduced form equation, Y1 is only a function of the

non-endogenous variables (which are the W and Z variables, not the Y variables). For this

reason, the first stage regression will be

Y2i = fi0 + fi1 Wi + fi2 Z1i + fi3 Z2i + ‘i (9.14)

We use Greek letter fi (pronounced “pie”) to indicate our coefficients because they will be

different than the coefficients in Equation 9.13 because Equation 9.14 does not include Y1 .

We show in the appendix on page 802 how the reduced form relates to Equations 9.12 and

9.13.

The second stage regression will be

Y1i = —0 + —1 Ŷ2i + —2 Wi + —3 Z1i + ‘1i (9.15)

where Ŷ2i is the fitted value from the first stage regression (Equation 9.14).

Identification in simultaneous equation models

For simultaneous equation models to work, they need to be identified, which is to say that

we need to have the right number of instruments. For 2SLS with one equation, we need at

c
•2014 Oxford University Press 458
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

least one instrument that satisfies both the inclusion and exclusion conditions. When we

have two equations we need at least one instrument for each equation. That is, to estimate

both equations we need one variable that belongs in Equation 9.12 but not in Equation 9.13

(which is Z1 in our notation) and one variable that belongs in Equation 9.13 but not in

Equation 9.12 (which is Z2 in our notation).

Happily, we can identify equations separately. So even if we don’t have an instrument

for each equation we can nonetheless plow ahead with the equation for which we do have

an instrument. So if we have only a variable that works in the second equation, but not

first equation, then we can estimate the first equation (because the instrument allows us to

estimate a fitted value for the endogenous variable in the first equation). If we have only

a variable that works in the first equation, but not second equation, then we can estimate

the second equation (because the instrument allows us to estimate a fitted value for the

endogenous variable in the second equation).

In fact, we can view the police and crime example discussed in Section 9.1 as a simul-

taneous equation model with police and crime determining each other simultaneously. To

estimate the effect of police on crime, Levitt needed an instrument that predicted police but

not crime. He argued that his firefighter variable fit the bill and then used that instrument

in a first stage model predicting police forces, generating a fitted value of police that he used

in the model predicting crime. We discussed this model as a single equation, but the analysis

would be unchanged if we viewed it as a single equation of a simultaneous equation system.

c
•2014 Oxford University Press 459
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

Remember This
We can use instrumental variables to estimate coefficients for the following simultane-
ous equation model:
Y1i = —0 + —1 Y2i + —2 Wi + —3 Z1i + ‘1i
Y2i = “0 + “1 Y1i + “2 Wi + “3 Z2i + ‘2i
1. To estimate the coefficients in the first equation:
• In the first stage, we estimate a model in which the endogenous variable is the
dependent variable and all W and Z variables are the independent variables.
Importantly, the other endogenous variable (Y1 ) is not included in this first
stage:

Y2i = fi0 + fi1 Wi + fi2 Z1i + fi3 Z2i + ‹i

• In the second stage, we estimate a model in which the fitted values from the
first stage, Ŷ2i , is an independent variable:

Y1i = —0 + —1 Ŷ2i + —2 Wi + —3 Z1i + ‘1i

2. We proceed in a similar way to estimate coefficients for the second equation in


the model:
• First, estimate a model with Y1i as the dependent variable and the W and Z
variables (but not Y2 !) as independent variables.
• Estimate the final model using Ŷ1i instead of Y1i as an independent variable.

c
•2014 Oxford University Press 460
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

Case Study: Support for President Bush and the Iraq War

In a democracy, decisions about war cannot be

separated from politics. When the United States

and allies invaded Iraq in 2002, President Bush

was able to use his popularity to help build polit-

ical support for the war. At the same time, what

people thought about the war affected their views

of the president.

If causality really runs both ways and war support influences Bush support while Bush

support also influences war support, then it is hard to estimate how much these factors

affected each other. For example, when predicting Bush support in terms of war support we

worry that war support reflects Bush support.

Our simultaneous equation framework can help us deal with this problem. The two

equations will look like this:

Bush support 2002i = —0 + —1 W ar support 2002i + —2 Wi + —3 Z1i + ‘1i

W ar support 2002i = “0 + “1 Bush support 2002i + “2 Wi + “3 Z2i + ‘2i

We typically have one or more W variables that affect both dependent variables. In

this case, we probably want to control for political party because it is pretty reasonable to

assume that people who supported Bush’s Republican Party were more likely to support

c
•2014 Oxford University Press 461
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

both the war and President Bush. These variables are not necessary; as a practical matter

with observational data, these variables that appear in both equations are simply highly

likely to exist.

More importantly, we need to figure out the Z variables. These are crucial because they

are our instruments and we need one for each equation. That is, we need one variable that

explains war support but not Bush support and one variable that explains Bush support but

not war support.

In particular, we focus on two proposed instruments. For the first equation, our instru-

ment is Bush feeling thermometer in 2000, which indicates how individuals thought of Bush

in 2000 before the war was an issue.7 Because the Iraq War was not an issue in 2000, it is

not possible for that issue to affect what people thought of Bush. But what people thought

of Bush in 2000 almost certainly predicted what they thought of Bush in 2002; after all,

it’s pretty common for people to take a liking (or disliking) to politicians and to carry that

feeling through over time. Hence, Bush support in 2000 can reasonably be included in the

first equation and excluded from the second equation.

An instrument for the second equation needs to explain support for the war in 2002,

but not support for Bush in 2002. One such variable is support for defense spending in

2000, before the Iraq War was an issue. Defending this instrument requires a bit of subtlety.
7 American National Election Study from the University of Michigan has panel data from both 2000 and 2002.
This means the same people were surveyed in 2000 and 2002, so we know what people thought of President Bush
before the Iraq War was an issue.

c
•2014 Oxford University Press 462
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

Surely, support for defense spending in 2000 is correlated with Bush support in 2002, because

political conservatives tended to like defense spending and George Bush, even spanning

different years. Such a correlation would violate the exclusion condition that the instrument

is not correlated with the error term. However, once we control for support for the Republican

Party in 2002 and support for President Bush in 2000, it is less clear that support for defense

spending should still directly affect Bush support in 2002. In other words, conditional on

party affiliation and earlier views of Bush, the views of defense spending may affect views of

Bush only inasmuch as they affected views of the Iraq War. As is typical with instrumental

variables for observational data, this claim is hardly beyond question. The most reasonable

defense of it as an instrument is that it is less correlated with the error term in the equation

for Iraq War support in 2002. Bailey and Wilcox (2015) provide a more detailed discussion.

With these two instruments, our simultaneous equation model becomes

Bush support 2002i = —0 + —1 W ar support 2002i + —2 Republican P artyi

+—3 Bush support 2000i + ‘1i

W ar support 2002i = “0 + “1 Bush support 2002i + “2 Republican P artyi

+“3 Def ense spending support 2000i + ‘2i

We can estimate each equation of this model using 2SLS. In the first stage, we need to

show that the instruments explain the endogenous variables meaning that Bush support in

2000 explains Bush support in 2002 and that defense spending support in 2000 explains

c
•2014 Oxford University Press 463
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

support for the Iraq War in 2002, when controlling for the other variables.

Table 9.4 shows these expectations about the first stage models are borne out in the

data. In the column on the left we see that Bush support in 2000 is highly predictive of

Bush support in 2002. The t statistic is immense at 12.29, telling us there is little doubt

these variables are related. Notice what is not in this model: support for the Iraq War in

2002. That is the other dependent variable. As discussed on page 458, the other endogenous

variable is excluded from the first stage equation in simultaneous equation analysis. Notice

also what is in this first stage model: defense spending support in 2000, the instrument for

Iraq War support in 2002. This seems odd, as only a minute ago we were claiming that

defense spending support in 2000 does not directly affect Bush support in 2002. We’re still

sticking by this claim, but recall that it is a conditional claim: Conditional on support for

the Iraq War in 2002, defense spending support in 2000 does not directly affect Bush support

in 2002. In this reduced form first-stage regression, though, we do not include support for

the Iraq War in 2002. It’s not surprising that the defense spending support variable affects

Bush support in 2002 in this equation, as it is soaking up the effect of the Iraq War support

variable, which is omitted at this point.

The column on the right shows that support for defense spending in 2000 predicts support

for the Iraq War in 2002 when controlling for the other variables. The t statistic is quite

healthy, at 4.44. For this first stage regression we do not include Bush support in 2002 (the

other dependent variable) but do include Bush support in 2000 (the other instrument).

c
•2014 Oxford University Press 464
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

Table 9.4: First Stage Reduced Form Regressions for Bush/Iraq War Simultaneous Equation Model

Bush support 2002 Iraq War support 2002


Y1 Y2
Bush support 2000 (Z1 ) 0.40ú 0.010ú
(0.03) (0.002)
[t= 12.29] [t=5.10]
Republican Party support (W ) 4.37ú 0.15ú
(0.37) (0.02)
[t= 11.66] [t=6.70]
Defense spending support 2000 (Z2 ) 4.31ú 0.18ú
(0.67) (0.04)
[t= 6.43] [t=4.44]
Constant 15.54ú 0.32ú
(2.42) (0.14)
N 785 785
Standard errors in parentheses. ú indicates significance at p < 0.05.

The second-stage results are the results of most interest. The left column of Table 9.5

shows that people who liked the Iraq War in 2002 were more likely to like Bush in 2002.

The key thing is that in this model we are not using actual support for the Iraq War in

2002, which would almost certainly be endogenous given that a person who supported Bush

in 2002 was highly likely to support the Iraq War. Instead, via the magic of 2SLS, the Iraq

support 2002 variable is the fitted value of the first-stage results we discussed just a minute

ago. The variation in this Iraq War support variable therefore does not come from anything

directly related to support for Bush in 2002, but instead comes from variation in what people

thought of defense spending in 2000.

The column on the right shows that Bush support in 2002 also explains support for the

Iraq War in 2002. The key thing is that the Bush support 2002 variable is a fitted value

from the first stage reduced form equation. The variation in this variable therefore does not

c
•2014 Oxford University Press 465
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

come from anything directly related to support for the Iraq War in 2002, but instead comes

from variation in what people thought of Bush in 2000 before the war was an issue.
Table 9.5: Second Stage Results for Bush/Iraq War Simultaneous Equation Model

Bush support 2002 Iraq War support 2002


Y1 Y2
Iraq War support 2002 (Ŷ2 ) 24.51ú
(5.13)
[t= 4.77]
Bush support 2002 (Ŷ1 ) 0.02ú
(0.004)
[t=5.68]
Republican Party support (W ) 0.73 0.04
(0.97) (0.03)
[t= 0.75] [t=1.23]
Bush support 2000 (Z1 ) 0.16ú
(0.07)
[t= 2.17]
Defense spending support 2000 (Z2 ) 0.07
(0.04)
[t=1.61]
Constant 7.64 -0.06
(4.62) (0.16)
N 785 785
R2 0.14 0.38
Standard errors in parentheses. ú indicates significance at p < 0.05.

Because this approach to simultaneous equations breaks the analysis into two separate

2SLS models, we can also be picky if necessary. Suppose we believe that the exclusion

condition for the instrument for Bush support in 2002 (which is Bush support in 2000) is

pretty good, but we do not believe the exclusion restriction for the instrument for Iraq War in

2002 (which is support for defense spending in 2000). In that case, the 2SLS results from the

model in which Iraq War 2002 is the dependent variable still stand, as the key independent

variable will be a fitted value from a first stage regression that we believe is sensible. The

c
•2014 Oxford University Press 466
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

other results may be less sensible, but the weakness of one instrument does not kill both

models, only the model where the instrument is needed. In such a case, it may make sense

to try different instruments, as we discussed on page 446.

9.7 Conclusion

2SLS is a great tool for fighting endogeneity. It provides us a means to use exogenous changes

in an endogenous independent variable to isolate causal effects. It’s easy to implement, both

conceptually (two simple regressions) and practically (let the computer do it).

The problem is that a fully convincing 2SLS can be pretty elusive. In observational data,

in particular, it is very difficult to come up with a perfect or even serviceable instrument

because the assumption that the instrument is uncorrelated with the error term is unverifi-

able statistically and often arguable in practice. The method also often produces imprecise

estimates which means that even if we have a good instrument, it might not tell us much

about the relationship we are studying. Even imperfect instruments, though, can be useful

because they can be less prone to bias than OLS, especially if the instrument performs well

in the first stage model.

When we can answer the following, we can be confident we understand instrumental

variables and 2SLS.

• Section 9.1: Explain the logic of instrumental variables models.

c
•2014 Oxford University Press 467
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

• Section 9.2: Explain the first stage and second stage regressions in 2SLS. What two

conditions are necessary for an instrument to be valid?

• Section 9.3: Explain how to use multiple instruments in 2SLS.

• Section 9.4: Explain quasi-instruments and weak instruments and their implications for

2SLS analysis. What results from the first stage must be reported and why?

• Section 9.5: Explain how the first stage results affect the precision of the second stage

results.

• Section 9.6: Explain what simultaneity is and why it causes endogeneity. Describe

how to use 2SLS to estimate simultaneous equations, noting the difference from non-

simultaneous models.

Further Reading

Murray (2006a) summarizes the instrumental variable approach and is particularly good

discussing finite sample bias and many statistical tests that are useful when diagnosing

whether instrumental variables conditions are met. Baiocchia, Cheng, and Small (2014)

provide an intuitive discussion of instrumental variables in health research.

One topic that has generated considerable academic interest is the possibility that the

effect of X differs within a population. In this case, 2SLS estimates the local average

treatment effect (LATE), which is the causal effect only for those people affected by the

c
•2014 Oxford University Press 468
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

instrument. This effect is considered “local” in the sense of describing the effect for the

specific class of individuals for whom the endogenous X1 variable was influenced by the

exogenous Z variable.8

In addition, scholars who study instrumental variables methods discuss the importance

of monotonicity, which is a condition that the effect of the instrument on the endogenous

variable goes in the same direction for everyone in a population. This condition rules out

the possibility that an increase in Z causes some units to increase X and other units to

decrease X. Finally, scholars also discuss the stable unit treatment value assumption

(SUTVA), a condition that the treatment doesn’t vary in unmeasured ways across individuals

and that there are no spillover effects that might occur – for example, if untreated neighbors

of someone in the treatment group get some of the benefit of treatment via their neighbor.

Imbens (2014) and Chapter 4 of Angrist and Pischke (2009) discuss these points in detail

and provide mathematical derivations. Sovey and Green (2011) discuss these and related

points with a focus on the instrumental variables in political science.


8 Suppose, for example, that the effect of education on future wages differs for students who like school (they learn
a lot in school, so more school leads to higher wages) and students who hate school (they learn little in school, so
more school does not lead to higher wages for them). If we use month of birth as an instrument, the variation in
years of schooling we are looking at is only the variation among people who would or would not drop out of high
school after their sophomore year depending on when they turned 16. The effect of schooling for those folks might
be pretty small, but that’s what the 2SLS approach will estimate.

c
•2014 Oxford University Press 469
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

Key Terms
• 2SLS (425)
• Exclusion condition (434)
• Identified (458)
• Inclusion condition (434)
• Instrumental variable (428)
• Local average treatment effect (from Further Reading section, 468)
• Monotonicity (from Further Reading section, 469)
• Overidentification test (446)
• Probability limit (448)
• Quasi-instrument (448)
• Reduced form equation (458)
• Simultaneous equation model (455)
• Stable unit treatment value assumption (from Further Reading section, 469)
• Two-stage least squares(425)
• Weak instrument (450)

Computing Corner

Stata
1. To estimate a 2SLS model in Stata, use the ivregress 2sls command (which stands for
instrumental variable regression). It works like the reg command in Stata, but now the
endogenous variable (X1 in the example below) is indicated along with the instrument
(Z in our notation in this chapter) in parentheses. The , first subcommand tells
Stata to also display the first stage regression, something we should always do:
ivregress 2sls Y X2 X3(X1 = Z), first

c
•2014 Oxford University Press 470
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

2. It is important to assess the explanatory power of the instruments in the first stage
regression.
• The rule of thumb when there is only one instrument is that the t statistic on the
instrument in the first stage should be above 3. The higher, the better.
• When there are multiple instruments, run an F test. The rule of thumb is that the
F statistic should be larger than 10.
reg X1 Z1 Z2 X2 X3 /* Regress endogenous variable on instruments */
/* and all other independent variables */
test Z1=Z2=0 /* F test that instruments are both zero */
3. To estimate a simultaneous equation model, we simply use the ivreg command:
ivregress 2sls Y1 W1 Z1 (Y2 = Z2), first
ivregress 2sls Y2 W1 Z2 (Y1 = Z1), first

R
1. To estimate a 2SLS model in R, we can use the ivreg command from the AER package.
• See page 130 on how to install the AER package. Recall that we need to tell R to
use the package with the library command below for each R session in which we
use the package.
• Other packages provide similar commands to estimate 2SLS models; they’re gen-
erally pretty similar, especially for standard 2SLS models.
• The ivreg command operates like the lm command. We indicate the dependent
variable and the independent variables for the main equation. The new bit is that
we include a vertical line, after which we note the independent variables in the first
stage. R figures out that whatever is in the first part but not the second is an
endogenous variable. In this case, X1 is in the first part but not the second and
therefore is the endogenous variable:
library(AER)
ivreg(Y ≥ X1 + X2 + X3 | Z1 + Z2 + X2 + X3)
2. It is important to assess the explanatory power of the instruments in the first stage
regression.
• If there is only one instrument, the rule of thumb is that the t statistic on the
instrument in the first stage should be above 3. The higher, the better.
lm(X1 ≥ Z1 + X2 + X3)
• When there are multiple instruments, run an F test with an unrestricted equation
that includes that instruments and a restricted equation that does not. The rule

c
•2014 Oxford University Press 471
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

of thumb is that the F statistic should be larger than 10. See page 360 on how to
implement an F test in R.
Unrestricted = lm(X1 ≥ Z1 + Z2 + X2 + X3)
Restricted = lm(X1 ≥ X2 + X3)
3. We can also use the ivreg command to estimate a simultaneous equation model. Indi-
cate the full model and then, after the vertical line, indicate the reduced form variables
that will be included (which is all variables but the other dependent variable):
library(AER)
ivreg(Y1 ≥ Y2 + W1 + Z1 | Z1 + W1 + Z2)
ivreg(Y2 ≥ Y1 + W1 + Z2 | Z1 + W1 + Z2)

Exercises
1. Does economic growth reduce the odds of civil conflict? Miguel, Satyanath, and Ser-
genti (2004) use an instrumental variable approach to assess the relationship between
economic growth and civil war. They provide data (available in RainIV.dta) on 41
African countries from 1981 to 1999, including the variables listed in Table 9.6.
Table 9.6: Variables for Rainfall and Economic Growth Question

Variable name Description


InternalConflict Coded 1 if civil war with greater than 25 deaths; 0 otherwise
LaggedGDPGrowth Lagged GDP growth
InitialGDPpercap GDP per capita at the beginning of the period of analysis, 1979.
Democracy A measure of democracy (called a “polity” score). Values range
from -10 to 10
Mountains Percent of country that is mountainous terrain
EthnicFrac Ethnic-linguistic fractionalization
ReligiousFrac Religious fractionalization
LaggedRainfallGrowth Lagged estimate of change in precipitation in millimeters from previous year

a. Estimate a bivariate OLS model in which the occurrence of civil conflict is the de-
pendent variable and lagged GDP growth is the independent variable. Comment on
the results.
b. Add control variables for initial GDP, democracy, mountains, and ethnic and reli-
gious fractionalization to the model in part (a). Do these results establish a causal
relationship between the economy and civil conflict?

c
•2014 Oxford University Press 472
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

c. Consider lagged rainfall growth as an instrument for lagged GDP growth. What are
the two conditions needed for a good instrument? Describe if and how we test the
two conditions. Provide appropriate statistical results.
d. Explain in your own words how instrumenting for GDP with rain could help us
identify causal effect of the economy on civil conflict.
e. Use the dependent and independent variables from part (b), but now instrument for
lagged GDP growth with lagged rainfall growth. Comment on the results.
f. Re-do the 2SLS model in part (e), but this time add country fixed effects using
dummy variables. Comment on the quality of the instrument in the first stage and
the results for the effect of lagged economic growth in the second stage.
g. (funky) Estimate the first stage from the 2SLS model in part (f) and save the residu-
als. Then estimate a regular OLS model that includes the same independent variables
from part (f) and country dummies. Use lagged GDP growth (do not use fitted val-
ues) and now include the residuals from the first stage that you just saved. Compare
the coefficient on lagged GDP growth you get here to the coefficient on that variable
in the 2SLS. Discuss how endogeneity is being handled in this specification.
2. Can television inform people about public affairs? It’s a tricky question because the
kind of nerds (like us) who watch public affairs oriented TV are already pretty well
informed to begin with. Therefore political scientists Alberston and Lawrence (2009)
conducted a field experiment in which they randomly assigned people to treatment
and control conditions. Those assigned to the treatment condition were told to watch
a specific television broadcast about affirmative action and that they would be later
interviewed on it. Those in the control group were not told about the program but
were told that they would be re-interviewed later. The program they studied aired in
California prior to the vote on Proposition 209, a controversial proposition relating to
affirmative action. Their data (available in NewsStudy.dta) includes the variables listed
in Table 9.7.
a. Estimate a bivariate OLS model in which the information the respondent has about
Proposition 209 is the dependent variable and whether or not they watched the
program is the independent variable. Comment on the results, especially if and how
they might be biased.
b. Estimate the model in part (a), but now include measures of political interest, news-
paper reading and education. Are the results different? Have we defeated endogene-
ity?
c. Why might the assignment variable be a good instrument for watching the program?
What test or tests can we run?

c
•2014 Oxford University Press 473
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

Table 9.7: Variables for News Program Question

Variable name Description


ReadNews Political news reading habits (never = 1 to everyday = 7)
PoliticalInterest Interest in political affairs (not interested = 1 to very interested = 4)
Education Education level (eighth grade or less = 1 to advanced graduate
degree = 13)
TreatmentGroup Assigned to watch program (treatment = 1; control = 0)
WatchProgram Actually watched program (watched = 1, did not watch = 0)
InformationLevel Information about Proposition 209 prior to election (none = 1 to
great deal = 4)

d. Estimate a 2SLS model from using assignment to the treatment group as an instru-
ment for whether the respondent watched the program. Use the additional indepen-
dent variables from part b. Compare the first-stage results to results in part (c). Are
they similar? Are they identical? (Hint: Compare sample sizes.)
e. What do the 2SLS results suggest about the effect of watching the program on
information levels? Compare the results to those in part (b). Have we defeated
endogeneity?
3. Suppose we want to understand the demand curve for a particular commodity. We’ll
use the following demand curve equation:
QuantitytD = —0 + —1 P ricet + ‘D
t

Economic theory suggests —1 < 0. Following standard practice, we estimate log-log


models so as to estimate elasticity of demand with respect to price.
Table 9.8: Variables for Fish Market Question

Variable name Description


Price Daily price of fish
Supply Daily supply of fish
Stormy Dummy variable indicating a storm at sea based on height
of waves and wind speed at sea
Day1 Dummy variable indicating Monday
Day2 Dummy variable indicating Tuesday
Day3 Dummy variable indicating Wednesday
Day4 Dummy variable indicating Thursday
Rainy Dummy variable indicating rainy day at the fish market
Cold Dummy variable indicating cold day at the fish market

c
•2014 Oxford University Press 474
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

a. To see that prices and quantities are endogenous, draw supply and demand curves
and discuss what happens when the demand curve shifts out (which corresponds to
some change in the error term of demand function). Note also what happens to price
in equilibrium and discuss how this creates endogeneity.
b. The data set fishdata.dta (from Angrist, Graddy, and Imbens (2000)) provides data
on prices and quantities of a certain kind of fish (called Whiting)9 at the Fulton Fish
Market in New York over 111 days. The variables are indicated in Table 9.8. The
price and quantity variables are logged. Estimate a naive OLS model of demand
in which quantity is the dependent variable and price is the independent variable.
Briefly interpret results and then discuss whether this analysis is useful.
c. Angrist, Graddy, and Imbens suggest that a dummy variable indicating a storm at
sea is a good instrumental variable that should affect the supply equation but not
the demand equation. Stormy is a dummy variable that indicates wave height was
greater than 4.5 feet and wind speed was greater than 18 knots. Use 2SLS to estimate
a demand function in which stormy is an instrument for price. Discuss first stage
and second stage results, interpreting the most relevant portions.
d. Re-estimate the demand equation but with additional controls. Continue to use
stormy as an instrument for price, but now also include covariates that account for
the days of the week and weather on shore. Discuss first stage and second stage
results, interpreting the most relevant portions.
4. Does education reduce crime? If so, spending more on education could be a long-term
tool in the fight against crime. The file inmates.dta contains data used by Lochner
and Moretti in their 2004 article in The American Economic Review on the effects of
education on crime. Table 9.9 describes the variables.
Table 9.9: Variables for Education and Crime Questions

Variable name Description


prison Dummy variable equals 1 if the respondent is in prison
educ Years of schooling
age Age
AfAm Dummy variable for African Americans
state State of residence (fips codes)
year Census year
ca9 Dummy equals 1 if state compulsory schooling equals 9 years
ca10 Dummy equals 1 if state compulsory schooling equals 10 years
ca11 Dummy equals 1 if state compulsory schooling is 11 or more years

9 I’ve never heard of Whiting either.

c
•2014 Oxford University Press 475
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

a. Run a linear probability model with prison as the dependent variable and education,
age, and African-American as independent variables. Make this a fixed effects model
by including dummies for state of residence (state) and year of census data (year).
Report and briefly describe the results.
b. Based on the OLS results, can we causally conclude that increasing education will
reduce crime? Why is it difficult to estimate the effect of education on criminal
activity?
c. Lochner and Moretti use 2SLS to improve upon their OLS estimates. They use
changes in compulsory attendance laws (set by each state) as an instrument. The
variable ca9 indicates that compulsory schooling is equal to 9 years, ca10 indicates
that compulsory schooling is equal to 10 years, and ca11 is 11 or more years. The con-
trol group is 8 or fewer years. Does this set of instruments satisfy the two conditions
for good instruments?
d. Estimate a 2SLS model using the instruments described above and the control vari-
ables from the OLS model above (including state and year dummy variables). Briefly
explain the results.
e. 2SLS is known for being less precise than OLS. Is that true here? Is this a problem
for the analysis in this case? Why or why not?
5. Does economic growth lead to democracy? This question is at the heart of our un-
derstanding of how politics and the economy interact. The answer also exerts huge
influence on policy: If we believe economic growth leads to democracy, then we may be
more willing to pursue economic growth first and let democracy come later. If economic
growth does not lead to democracy, then perhaps economic sanctions or other tools may
make sense if we wish to promote democracy. Acemoglu, Johnson, Robinson, and Yared
(2008) analyzed this question using data on democracy and GDP growth from 1960 to
2000. The data is in the form of five year panels, meaning there is one observation for
each country every five years. Table 9.10 describes the variables.
Table 9.10: Variables for Income and Democracy Questions

Variable name Description


democracy fh Freedom House measure of democracy (range from 0 to 1)
log gdp Log real GDP per capita
worldincome Trade-weighted log GDP
year Year
YearCode Order of years of data set (1955 = 1, 1960 =2, 1965 = 3, etc.)
CountryCode Numeric code for each country

c
•2014 Oxford University Press 476
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity

a. Are countries with higher income per capita more democratic? Run a pooled regres-
sion model with democracy (democracy f h) as the dependent variable and logged
GDP per capita (log gdp) as the independent variable. Lag log gdp so the model
reflects that income at time t ≠ 1 predicts democracy at time t. Describe the results.
What are the concerns with this model?
b. Re-run the model from part (a), but now include fixed effects for year and country.
Describe the model. How does including these fixed effects change the results?
c. To better establish causality, the authors use two-stage least squares. One of the in-
struments that they use is changes in the income of trading partners (worldincome).
They theorize that the income of the countries that a country trades with should
predict its own GDP but not directly affect the level of democracy in the country.
Discuss the viability of this instrument with specific reference to the conditions that
instruments need to satisfy. Provide evidence as appropriate.
d. Run a 2SLS model that uses worldincome as an instrument for logged GDP. Re-
member to lag both. Compare the coefficient and standard error to the OLS and
panel data results.

c
•2014 Oxford University Press 477
CHAPTER 10

EXPERIMENTS: DEALING WITH REAL-WORLD CHALLENGES

In the 2012 presidential election, the Obama

campaign team was famously teched up. Not

just in iPhones and laptops, but also in analytics.

They knew how to do all the things we’re talk-

ing about in this book: how to appreciate the

challenges of endogeneity, how to analyze data

effectively, and perhaps most importantly of all, how to design randomized experiments to

answer the questions they were interested in.

One thing they did was work their email list almost to exhaustion with a slew of fundrais-

ing pitches over the course of the campaign. These pitches were not random – or, wait,

478
Chapter 10. Experiments: Dealing with Real-World Challenges

actually they were random in the sense that the campaign tested them ruthlessly using

experimental methods. On June 26, 2012, for example, they sent email messages with ran-

domly selected subject lines, ranging from the minimalist “Change” to the sincere “Thankful

every day” to the politically scary “I will be outspent.” The campaign then tracked which

subject lines generated the most donations. On that day the “I will be outspent” message

kicked butt, producing almost five times the donations the “Thankful every day” subject line

did. AS a result, the campaign sent millions of people emails with the “I will be outspent”

subject line and, according to the campaign, raised millions more than they would have if

they used another subject line.

Of course, campaigns are not the only organizations that use randomized experiments.

Governments and researchers interested in health care, economic development, and many

other public policy issues use them all the time. And experiments are important in the

private sector as well. Capital One, one of the largest credit card companies in the United

States, grew from virtually nothing largely on the strength of a commitment to experiment-

driven decision-making. Google, Amazon, Facebook, and eBay also experiment relentlessly.

Randomized experiments pose an alluring solution to our quest for exogeneity. Let’s create

it! That is, exogeneity requires that our independent variable of interest be uncorrelated with

the error term. As we discussed in Section 1.3, if our independent variable is uncorrelated

with everything, it is uncorrelated with the error term. Hence if the independent variable is

random, then it is exogenous. Randomization produces exogeneity, which enables unbiased

c
•2014 Oxford University Press 479
Chapter 10. Experiments: Dealing with Real-World Challenges

inference.

In theory, analysis of randomized experiments should be easy. We randomly pick a group

of subjects to be the treatment group, treat them, and then look for differences compared

to an untreated control group.1 As discussed in Section 6.1, we can use OLS to estimate a

difference of means model with an equation of the form:

Yi = —0 + —1 T reatmenti + ‘i (10.1)

where Yi is the outcome we care about and T reatmenti equals one for subjects in the

treatment group.

In reality, randomized experiments face a host of challenges. Not only are they costly,

potentially infeasible, and sometimes unethical as discussed in Section 1.3, they run into

several challenges that can undo the desired exogeneity of randomized experiments. This

chapter focuses on these challenges. Section 10.1 discusses the challenges raised by possible

dissimilarity of the treatment and control groups. If the treatment group differs from the

control group in ways other than the treatment, then we can’t be sure if the treatment or

other differences explain differences across these groups. Section 10.2 discusses the challenges

raised by non-compliance with assignment to experimental groups. Section 10.3 shows how
1 Often the control group is given a placebo treatment of some sort. In medicine, this is the well-known sugar
pill instead of medicine. In social science, a placebo treatment may be experience that shares the form, but not the
content of the treatment. For example, in a study of advertising efficacy, a placebo group might be shown a public
service ad. The idea is that the mere act of viewing an ad, any ad, could affect respondents and that ad designers
want their ad to cause changes over and above that.

c
•2014 Oxford University Press 480
Chapter 10. Experiments: Dealing with Real-World Challenges

to use the 2SLS tools from Chapter 9 to deal with non-compliance. Section 10.4 discusses

the challenge posed to experiments by attrition, a common problem that arises when people

leave the experiment. This chapter concludes in Section 10.5 by changing gears to discuss

natural experiments, which occur without intervention by researchers.

We refer to the attrition, balance, and compliance challenges facing experiments as ABC

issues.2 Every analysis of experiments should discuss these ABC issues explicitly.

10.1 Randomization and Balance

When we run experiments we worry that randomization may fail to produce comparable

treatment and control groups, in which case the treatment and control groups might differ

in more ways than just the experimental treatment. If the treatment group is older, for

example, then we worry that the differences between the treatment and control groups could

be due to age rather than the treatment.

In this section we discuss how to try to ensure that treatment and control groups are

equivalent, explain how treatment and control groups can differ, show how to detect such

differences, and explain what to do if there are differences.


2We actually discuss balance first, followed by compliance and then attrition because this order follows the standard
sequence of experimental analysis. But we’ll stick with calling them ABC issues because BCA doesn’t sound as cool
as ABC.

c
•2014 Oxford University Press 481
Chapter 10. Experiments: Dealing with Real-World Challenges

Blocking to ensure similar treatment and control groups

Ideally, researchers will be able to ensure that their treatment and control groups are similar.

They do so by blocking which involves picking treatment and control groups in way that

ensures they will be the same for selected covariates. A simple form of blocking is to separate

the sample into men and women and then randomly pick treatment and control subjects

within those blocks. Doing so ensures that the treatment and control groups will not differ

by sex. Unfortunately, there are limits to blocking. Sometimes it just won’t work in the

context of an experiment being carried out in the real world. Or, more pervasively, there

are practical concerns because it gets harder and harder to make blocking work the more

variables we wish to block for. For example, if we want to ensure treatment and control

groups are the same in each age and sex, we would have to pick subsets of women in each

age group and men in age group. If we add race to our wish list, then we’ll have even smaller

individuals in targeted blocks to randomize within. Eventually, things get very complicated

and our sample size can’t provide people in every block. The Further Reading section at the

end of the chapter points to articles with more guidance on blocking.

Reasons why treatment and control groups may differ

In situations where no blocking is possible or blocking is not able to account for all variables,

differences in treatment and control groups can arise in two ways. First, the randomization

procedures may have failed. Some experimental treatments are quite valuable, such as free

c
•2014 Oxford University Press 482
Chapter 10. Experiments: Dealing with Real-World Challenges

health care, access to a new cancer drug, or admission to a good school. A researcher may

desire this treatment to be randomly allocated, but the family of a sick person or ambitious

school child may be able to get that person into the treatment group. Or perhaps the

people implementing the program aren’t quite on board with randomization and put some

people in or out of the treatment group for their own reasons. Or maybe the folks doing the

randomization just screwed up.

Second, even if there is no explicit violation of randomization, the treatment and control

groups may differ substantially simply by chance. Suppose we want to conduct a random

experiment on a four person family of mom, dad, big sister, and little brother. Even if we

pick the two-person treatment and control groups randomly, we’ll likely get groups that differ

in important ways. Maybe the treatment group will be dad and little brother; too many guys

there. Or maybe the treatment group will be mom and dad; too many middle-aged people

there. In these cases, any outcome differences between the treatment and control groups

would be due not only to the treatment but also possibly to the sex or age differences. Of

course, the odds that the treatment and control groups differ substantially fall rapidly as the

sample size increases (a good reason to have a big sample!). The chance that such differences

occur never completely disappears, however.

c
•2014 Oxford University Press 483
Chapter 10. Experiments: Dealing with Real-World Challenges

Checking for balance

Therefore an important first step in analyzing an experiment is to check for balance. Balance

exists when the treatment and control groups are similar in all measurable ways. The

core diagnostic for balance involves simply comparing difference of means for all possible

independent variables between those assigned to the treatment and control groups. To do

so we use our OLS difference of means test (as discussed on page 257) to assess for each X

variable whether the treatment and control groups are different.

Xi = “0 + “1 T reatmentAssignedi + ‹i (10.2)

where T reatmentAssignedi is 1 for those assigned to the treatment group and 0 for those

assigned to the control group. We use the Greek letter “ (gamma) to indicate the coefficients

and the Greek letter ‹ (nu) to indicate the error term. We do not use — and ‘ here so as

to emphasize that the model differs from the main model (Equation 10.1 on page 480). We

estimate Equation 10.2 for each potential independent variable; each equation will produce

a different “ˆ1 estimate. A statistically significant “ˆ1 estimate indicates that the X variable

differed across those assigned to the treatment and control groups.3

Ideally, we won’t see any statistically significant “ˆ1 estimates; this outcome would indicate

the treatment and control groups are balanced. If the “ˆ1 estimates are statistically significant

for many X variables, we do not have balance in our experimentally assigned groups, which
3More advanced balance tests also allow us to assess whether the variance of a variable is the same across treatment
and control groups. See, for example, Imai (2005).

c
•2014 Oxford University Press 484
Chapter 10. Experiments: Dealing with Real-World Challenges

suggests systematic interference with the planned random assignments.

We should keep statistical power in mind when evaluating balance tests. As we discussed

on page 165, statistical power relates to the probability of rejecting the null hypothesis when

we should. Power is low in small data sets, meaning that when there are few observations we

are unlikely to find statistically significant differences in treatment and control groups even

when there are differences. In contrast, power is high for large data sets, meaning we may

observe statistically significant differences even when the actual differences are substantively

small. Hence, balance tests are sensitive not only to whether there are differences across

treatment and control groups, but also to the factors that affect power. We should therefore

be cautious in believing we have achieved balance when we have small samples and we should

be sure to assess the substantive importance of any differences we see in large samples.

What if the treatment and control groups differ for only one or two variables? This

situation is not enough to indicate that randomization failed. Recall that even when there

is no difference between treatment and control groups, we will reject the null hypothesis of

no difference five percent of the time when – = 0.05. Thus if we look at at twenty variables,

for example, it would be perfectly natural for the means of the treatment and control groups

to differ statistically significantly for one of those variables.

Good results on balancing tests also suggest (without proving) that balance has been

achieved even on the variables we can’t measure. Remember, the key to experiments is

that no unmeasured factor in the error term is correlated with the independent variable.

c
•2014 Oxford University Press 485
Chapter 10. Experiments: Dealing with Real-World Challenges

Given that we cannot see the darn things in the error term, it seems a bit unfair to expect

us to have any confidence about what’s going on in there. However, if balance has been

achieved for everything we can observe, we can reasonably (albeit cautiously) speculate that

the treatment and control groups are also balanced for factors we cannot observe.

What to do if treatment and control groups differ

If we do find imbalances, we should not ignore them. First, we should assess the magnitude

of the difference. Even if only one X variable differs across treatment and control groups,

it could be a sign of a deeper problem if the difference is huge. Second, we should control

for even smallish differences in treatment and control groups in our analysis lest we conflate

outcome differences in Y across treatment and control groups due to differences in the

treatment and due to differences in some X for which treatment and control groups differ.

In other words, when we have imbalances it is a good idea to use multivariate OLS even

though in theory we need only bivariate OLS due to the random assignment of treatment. For

example, if we find that the treatment and control groups differ in age, we should estimate

Yi = —0 + —1 T reatmenti + —2 Agei + ‘i

In adding control variables, we should be careful to control only for variables that are

measured before the treatment or that do not vary over time. If we control for a variable

measured after the treatment, it is possible that it will be affected by the treatment itself,

thereby making it hard to figure out the actual effect of treatment. For example, suppose

c
•2014 Oxford University Press 486
Chapter 10. Experiments: Dealing with Real-World Challenges

we are analyzing an experiment where job training was randomly assigned within a certain

population. In assessing whether the job training helped people get jobs, we would not want

to control for test scores measured after the treatment because the scores could have been

affected by the training. Including such a post-treatment variable will muddy the analysis

because part of the effect of treatment may be captured by this post-treatment variable.

Remember This
1. Experimental treatment and control groups are balanced if the average values
of independent variables are not substantially different for people assigned to
treatment and control groups.
2. We check for balance by conducting difference of means tests for all possible
independent variables.
3. It is a good idea to control for imbalanced variables when assessing the effect of
a treatment.

c
•2014 Oxford University Press 487
Chapter 10. Experiments: Dealing with Real-World Challenges

Case Study: Development Aid and Balancing

One of the most important challenges in modern

times is figuring out how best to fight the grind-

ing poverty that bedevils much of the world’s

population. Some think that alleviating poverty

is simply a question of money: Spend enough and

poverty goes away. Others are skeptical, wondering if the money spent by governmental and

non-governmental organizations actually does any good.

Using observational studies to settle this debate is dicey. Such studies estimate something

like the following equation:

Healthit = —0 + —1 Aidit + —2 Xit + ‘it (10.3)

where Healthit is the health of person i at time t, Aidit is the amount of foreign aid going to

person i’s village, and Xit represents one or more variables that affect health. The problem

is that the error may be correlated with aid. Aid may flow to places where people are truly

needy, with economic and social problems that go beyond any of the simple measures of

poverty we may have. Or resources may flow to places that are actually better off and better

able to attract attention than simple poverty statistics would suggest.

In other words, aid is probably endogenous. And because we cannot know if aid is

positively or negatively correlated with the error term, we have to admit that we don’t know

c
•2014 Oxford University Press 488
Chapter 10. Experiments: Dealing with Real-World Challenges

whether the actual effects are larger or smaller than what we observe with the observational

analysis. That’s not a particularly satisfying study.

If the government resources flowed exogenously, however, we could analyze health and

other outcomes and be much more confident that we are measuring the effect of the aid. The

Progresa experiment in Mexico is an excellent example; it is described in Gertler (2004). In

the late 1990s the Mexican government wanted to run a village-based health care program,

but realized it did not have enough resources to cover all villages at once. They decided

the fairest way to pick villages was to pick them randomly and voila! an experiment was

born. They randomly selected 320 villages as treatment cases and implemented the program

there. They also monitored 185 control villages where no new program was implemented.

In the program, eligible families received a cash transfer worth about 20 to 30 percent of

household income if they participated in health screening and education activities including

immunizations, prenatal visits, annual health check-ups, and more.

Before assessing whether the treatment worked, analysts needed to assess whether ran-

domization worked. We want to know if villages were indeed selected randomly and, if so,

that they were similar with regard to factors that could influence health. Table 10.1 provides

results for balancing tests for the Progresa program. The first column has the “ˆ0 estimates

from Equation 10.2 for various X variables. These are the averages of the variable in question

for the young children in the control villages. The second column displays the “ˆ1 estimates,

which indicate how much higher or lower the average of the variable in question is for chil-

c
•2014 Oxford University Press 489
Chapter 10. Experiments: Dealing with Real-World Challenges

Table 10.1: Balancing Tests for Progresa Experiment: Differences of Means Tests Using OLS

Dependent variable “ˆ0 “ˆ1 t stat (ˆ


“1 ) p-value (ˆ
“1 )
1. Age (in years) 1.61 0.01 0.11 0.91
2. Male 0.49 0.02 1.69 0.09
3. Child was ill in last 4 weeks 0.32 0.01 0.29 0.77
4. Father’s years of education 3.84 -0.04 0.03 0.98
5. Mother’s years of education 3.83 -0.33 1.87 0.06
6. Father speaks Spanish 0.93 0.01 1.09 0.28
7. Mother speaks Spanish 0.92 0.02 0.77 0.44
8. Own house 0.91 0.01 0.73 0.47
9. House has electricity 0.71 -0.07 1.69 0.09
10. Hectares of land owned 0.79 0.02 0.59 0.55
11. Male daily wage rate (pesos) 31.22 -0.74 0.90 0.37
12. Female daily wage rate (pesos) 27.84 -0.58 0.69 0.49
Sample size: 7,825
Results from 12 different OLS regressions in which the dependent variable is as listed at
left. The coefficients are from the model Xi = “0 + “1 T reatmenti + ‹i (see Equation 10.2
on page 484).

dren in the treatment villages. For example, the first line indicates that the children in the

treatment village were 0.01 years older than the children in the control village. The t statis-

tic is very small for this coefficient and the p-value is high, indicating that this difference is

not at all statistically significant. For the second row, the male variable equals one for boys

and zero for girls. The average of this variable indicates the percent of the sample that were

boys. In the control villages, 49 percent of the children were males; 51 percent (ˆ
“0 + “ˆ1 ) of

the children in the treatment villages were male. This two percent difference is statistically

significant at the 0.10 level (given that the p-value is less than 0.10). The most statistically

significant difference we see is in mother’s years of education, for which the p-value is 0.06.

c
•2014 Oxford University Press 490
Chapter 10. Experiments: Dealing with Real-World Challenges

In addition, houses in the treatment group were less likely to have electricity, with a p-value

of 0.09.

These results were taken by the study author to indicate that balance was achieved.

We see, though, that achieving balance is an art, rather than a science because for twelve

variables, only one or perhaps two would be expected to be statistically significant at the

– = 0.10 level if there were, in fact, no differences across the groups. These imbalances

should not be forgotten; in this case, the analysts controlled for all of the listed variables

when estimating treatment effects.

And, by the way, did the Progresa program work? In a word, yes. Using difference of

means tests, analysts found that kids in the treatment villages were sick less often, taller,

and less likely to be anemic.

10.2 Compliance and Intention-to-treat Models

Many social science experiments also have to deal with compliance problems, which arise

when some people assigned to a treatment do not experience the treatment to which they

were assigned. A compliance problem can happen, for example, when someone is randomly

assigned to receive a phone call asking him to donate to charity. If the person does not answer

the phone, we say (perhaps a bit harshly) that he failed to comply with the experimental

treatment.

c
•2014 Oxford University Press 491
Chapter 10. Experiments: Dealing with Real-World Challenges

In this section we show how non-compliance can create endogeneity, present a schematic

for thinking about the problem, and introduce so-called intention-to-treat models as one way

to deal with the problem.

Non-compliance and endogeneity

Non-compliance is often non-random, opening a back door for endogeneity to weasel its way

into the experiments because the people who comply with a treatment may systematically

differ from the people who do not. This is precisely the problem we use experiments to avoid.

Educational voucher experiments illustrate how endogeneity can sneak in with non-

compliance. These experiments typically start when someone ponies up a ton of money

to send poor kids to private schools. Because there are more poor kids than money, appli-

cants are randomly chosen in a lottery to receive vouchers to attend private schools. These

are the kids in the treatment group. The kids who aren’t selected in the lottery are the

control group.4 After a year of schooling (or more), the test scores of the treatment and

control groups are compared to see if kids in private schools did better. Because being in

the voucher schools is a function of a random lottery, we can hope that the only systematic

difference between the treatment and control groups is whether they attended the private

school and that factor therefore caused any differences in outcomes we observe.
4 Researchers in this area are careful to analyze only students who actually applied for the vouchers because the
kinds of students (and parents) who apply for vouchers for private schools almost certainly differ from students (and
parents) who do not.

c
•2014 Oxford University Press 492
Chapter 10. Experiments: Dealing with Real-World Challenges

Non-compliance complicates matters. Not everyone who receives the voucher uses it to

attend private school. In a late 1990s New York City voucher experiment discussed by

Howell and Peterson (2004), for example, 74 percent of families who were offered vouchers

used them in the first year. That number fell to 62 percent and 53 percent after two and

three years of the program, respectively. There are lots of reasons kids with vouchers might

end up not attending the private school. They might find the private school unwelcoming

or too demanding. Their parents may move. Some of these causes are plausibly related

to academic performance: A child who finds private school too demanding is likely less

academically ambitious than one who does not have that reaction. In that case, the kids

who actually use vouchers to attend private school (the “compliers” in our terminology) are

not a randomly selected group, but rather are a group that could systematically differ from

kids who decline to use the vouchers. The result can be endogeneity because the variable of

interest (attending private school) could be correlated with factors in the error term (such

as academic ambition) that explain test performance.

Schematic representation of the non-compliance problem

Figure 10.1 provides a schematic of the non-compliance problem (Imai 2005). At the top

level, a researcher randomly assigns subjects to receive the treatment or not. If a subject is

assigned to receive a treatment, Zi = 1; if a subject is not assigned to receive a treatment,

Zi = 0. Subjects assigned treatment who actually receive it are the compliers and for them

c
•2014 Oxford University Press 493
Chapter 10. Experiments: Dealing with Real-World Challenges

Ti = 1, where T indicates whether the person actually received the treatment. The people

who are assigned to treatment (Zi = 1) but who do not actually receive it (Ti = 0) are the

non-compliers.

For everyone in the control group Zi equals zero, indicating they were not assigned to

receive the treatment. We don’t observe compliance for people in the control group because

they’re not given a chance to comply. Hence the dashed lines in the figure indicate that

we can’t know who among the control group are would-be compliers and would-be non-

compliers.5

We can see the mischief caused by non-compliance when we think about how to compare

treatment and control groups in this context. We could compare the students who actually

went to the private school (Ti = 1) to those who didn’t (Ti = 0). Note, however, that the

Ti = 1 group includes only compliers – the kind of students who, when given the chance to

go to a private school, took it. These students are likely to be more academically ambitious

than the non-compliers. The Ti = 0 group includes non-compliers (for whom Zi = 1 and

Ti = 0) and those not assigned to treatment (for whom Zi = 0). This comparison likely

stacks the deck in favor of finding that the private schools improve test scores because this

Ti = 1 group has a disproportionately high proportion of educationally ambitious students.

Another option is to compare the compliers (the Zi = 1 and Ti = 1 students) to the


5 An additional wrinkle in the real world is that people from the control group may find a way to receive the
treatment even without being assigned to treatment. For example, in the New York voucher experiment discussed
above, five percent of the control group ended up in private schools even though they did not receive a voucher.

c
•2014 Oxford University Press 494
Chapter 10. Experiments: Dealing with Real-World Challenges

Treatment assignment
(random)

Zi = 1 Zi = 0

Compliance Compliance
(non−random) (unobserved)

Ti = 1 Ti = 0 Ti = 0

Compliers Non−compliers Would−be Would−be


compliers non−compliers

FIGURE 10.1: Compliance and Non-compliance in Experiments

c
•2014 Oxford University Press 495
Chapter 10. Experiments: Dealing with Real-World Challenges

whole control group (the Zi = 0 students). This method too has problems. The control

group has two types of students – would-be-compliers and would-be-non-compliers – while

the treatment group in this approach only has compliers. Any differences found with this

comparison could either be due to the effect of the private school or due to the fact that the

complier group has no non-compliers while the control group includes both complier types

and non-complier types.

Intention-to-treat models

A better approach is to conduct an intention-to-treat analysis (ITT). When conducting

an ITT analysis, we compare the means of those assigned treatment (the whole Zi = 1 group

that includes those who received and did not receive the treatment) to those not assigned

treatment (the Zi = 0 group that includes would-be compliers and would-be non-compliers).

The ITT approach sidesteps non-compliance endogeneity at the cost of producing estimates

that are statistically conservative (meaning we expect the estimated coefficients to be smaller

than the actual effect of the treatment).

To understand ITT, let’s start with the non-ITT model that we really care about:

Yi = —0 + —1 T reatmenti + ‘i (10.4)

For individuals who receive no treatment (T reatmenti = 0), we expect Yi to equal some

baseline value, —0 . For individuals who have received the treatment (T reatmenti = 1), then

we expect Yi to be to —0 + —1 . This simple bivariate OLS model allows us to test for the

c
•2014 Oxford University Press 496
Chapter 10. Experiments: Dealing with Real-World Challenges

difference of means between treatment and control groups.

The problem, as we have discussed, is that non-compliance creates correlation between

treatment and the error term because the type of people who comply with the treatment

may systematically differ from non-compliers.

The idea behind the ITT approach is to look for differences between the whole treatment

group (whether they complied or not) and the whole control group. The model is

Yi = ”0 + ”1 Zi + ‹i (10.5)

In this model, Z is 1 for individuals assigned to the treatment group. We use ” to highlight

the fact that we are using treatment-assignment as the independent variable rather than

actual treatment. In this model, ” is an intention-to-treat estimator because it estimates the

difference between all the people we intended to treat and all the people we did not intend

to treat.

Note that Z is uncorrelated with the error term. It reflects assignment to treatment

(rather than actual compliance with treatment) and hence none of the compliance issues are

able to sneak in correlation with anything, including the error term. Therefore the coefficient

estimate associated with the treatment assignment variable will not be clouded by other

factors that could explain both the dependent variable and compliance. For example, if

we use ITT analysis to explain the relationship between test scores and attending private

schools, we do not have to worry that our key independent variable is also capturing the

c
•2014 Oxford University Press 497
Chapter 10. Experiments: Dealing with Real-World Challenges

fact that the more academically ambitious kids may have been more likely to use the private

school vouchers. ITT avoids this problem by comparing all kids given a chance to use the

vouchers to all kids not given that chance.

ITT is not costless, however. When there is non-compliance, ITT will underestimate

the treatment effect. This means the ITT estimate, ”ˆ1 , is a lower bound estimate of —, the

estimate of the effect of the treatment itself from Equation 10.4 on page 496. In other words,

we expect the magnitude of the ”ˆ1 parameter estimated from Equation 10.5 to be smaller

than or equal to the —1 parameter in Equation 10.4.

To see why, consider the two extreme possibilities for compliance: zero compliance and full

compliance. If there is zero compliance such that no one assigned treatment complied (Ti = 0

for all Zi = 1), then ”1 has to be 0 because there is no difference between the treatment and

control groups. (No one took the treatment!) At the other extreme, if everyone assigned

treatment (Zi = 1) also complied (Ti = 1), then the T reatmenti variable in Equation 10.4

will be identical to Zi (treatment assignment) in Equation 10.5. In this instance, —ˆ1 will

be an unbiased estimator of —1 because there is no non-compliance problem messing up the

exogeneity of the random experiment. In this case, —ˆ1 = ”ˆ1 because the variables in the

models would be identical.

Hence we know that the ITT estimate of ”ˆ1 is going to be somewhere between zero and

an unbiased estimator of the true treatment effect. The lower the compliance, the more the

ITT estimate will be biased toward zero. The ITT estimator is still preferable to —ˆ1 from

c
•2014 Oxford University Press 498
Chapter 10. Experiments: Dealing with Real-World Challenges

a model with treatment received when there are non-compliance problems because that —ˆ1

can be biased due to the endogeneity that enters the model when compliers differ from

non-compliers.

The ITT approach is a cop-out, but in a good way. When we use it, we’re being con-

servative in the sense that the estimate will be prone to underestimate the magnitude of

the treatment effect. If we find an effect using the ITT approach, it will not be due to

endogenous non-compliance issues, but due to treatment.

Researchers regularly estimate ITT effects. Sometimes whether or not someone complied

with a treatment is not known. For example, if the experimenter mailed advertisements to

randomly selected households it will be very hard, if not impossible, to know who actually

read the ads (Bailey, Hopkins and Rogers 2015).

Or sometimes the ITT effect is the most relevant quantity of interest, such as when

we know that compliance will be spotty and we want to build non-compliance into our

estimate of a program’s effectiveness. Miguel and Kremer (2004) analyzed an experiment in

Kenya that provided medical treatment for intestinal worms to children at randomly selected

schools. Some children in the treated schools were not treated because they missed school the

day the medicine was administered. An ITT analysis in this case compares kids assigned to

treatment (whether or not they were in school) to kids not assigned to treatment. Because

some kids will always miss school for a treatment like this, policymakers may care more

about the ITT estimated effect of the because ITT takes into account the treatment effect

c
•2014 Oxford University Press 499
Chapter 10. Experiments: Dealing with Real-World Challenges

and the less than perfect compliance.

Remember This
1. In an experimental context, a person assigned to receive a treatment who actually
receives the treatment is said to comply with the treatment.
2. Non-compliance creates endogeneity when compliers differ from non-compliers.
3. Intention-to-treat (ITT) analysis compares people assigned to treatment (whether
they complied or not) to people in the control group.
• ITT is not vulnerable to endogeneity caused by non-compliance.
• ITT estimates will be smaller in magnitude than the true treatment effect.
The more the non-compliance, the closer to zero will be the ITT estimates.

10.3 Using 2SLS to Deal with Non-compliance

An even better way to deal with non-compliance is to use 2SLS to directly estimate the effect

of treatment. The key insight is that randomized treatment assignment is a great instrument.

Randomized assignment satisfies the exclusion condition (that Z is uncorrelated with ‘)

because it is uncorrelated with everything, including the error term. Random assignment

also usually satisfies the inclusion condition because being randomly assigned to treatment

typically predicts whether a person got the treatment.

In this section we build on material from Section 9.2 to show how to use 2SLS to deal with

non-compliance. We do so by working through an example and by showing the sometimes

counterintuitive way we use variables in this model.

c
•2014 Oxford University Press 500
Chapter 10. Experiments: Dealing with Real-World Challenges

Example of using 2SLS to deal with non-compliance

To see how to use 2SLS to analyze an experiment with non-compliance, let’s look at an

experimental study of get-out-the-vote efforts. Political consultants often joke that they

know half of what they do works, they just don’t know which half. An experiment might

help figure out which half (or third or quarter!) works.

We begin by laying out what an observational study of campaign effectiveness looks like.

A simple model is

T urnouti = —0 + —1 Campaign contact i + ‘i (10.6)

where T urnouti equals 1 for people who voted and equals 0 for those who did not.6 The

independent variable is whether or not someone was contacted by a campaign.

What is in the error term? Certainly political interest will be, because more politically

attuned people are more likely to vote. We’ll have endogeneity if political interest (something

in the error term) is correlated with contact by a campaign (the independent variable). We

will probably have endogeneity because campaigns do not want to waste their time contacting

people who won’t vote. Hence, we’ll have endogeneity unless the campaign is incompetent

(or, ironically, run by experimentalists).

Such endogeneity could corrupt the results easily. Suppose we find a positive association

between campaign contact and turnout. We should worry that the relationship is due not
6 The dependent variable is a dichotomous variable. We discuss how to deal with such dependent variables in
Chapter 12.

c
•2014 Oxford University Press 501
Chapter 10. Experiments: Dealing with Real-World Challenges

to the campaign contact, but due to the fact that the kind of people who were contacted

were more likely to vote even before they were contacted. Such concerns make it very hard

to analyze campaign effects with observational data.

Professors Alan Gerber and Don Green (2000, 2005) were struck by these problems with

observational studies and have almost single handedly built an empire of experimental

studies in American politics.7 As part of their signature study, they randomly assigned

citizens to receive in-person visits from a get-out-the-vote campaign. In their study all the

factors that affect turnout would be uncorrelated with assignment to receive the treatment.8

Compliance is a challenge in such studies. When campaign volunteers visited, not ev-

eryone answered the door. Some people weren’t home. Some were in the middle of dinner.

Maybe a few ran out the back door screaming when they saw the hippie volunteer ringing

their doorbell.

Non-compliance, of course, could affect the results. If the more socially outgoing types

answered the door (hence receiving the treatment) and the more reclusive types did not

answer the door (hence not receiving the treatment even though they were assigned to it),

the treatment variable as delivered would depend not only on the random assignment, but

also on how outgoing a person was. If more outgoing people are more likely to vote, then

we have endogeneity because treatment as delivered will be correlated with the sociability
7 Or should we say double handedly? Or, really, quadruple handedly?
8 The study also looked at other campaign tactics, such as phone calls and mailing postcards. They didn’t work
as well as the personal visits; for simplicity, we focus only on the in-person visits.

c
•2014 Oxford University Press 502
Chapter 10. Experiments: Dealing with Real-World Challenges

of the experimental subject.

To get around this problem, Gerber and Green used treatment assignment as an instru-

ment. This variable, which we’ve been calling Zi , indicates whether a person was randomly

selected to receive a treatment. This variable is well suited to satisfy the conditions neces-

sary for a good instrument we discussed in Section 9.2. First, it should be included in the

first stage because being randomly assigned to be contacted by the campaign does indeed

increase campaign contact. Table 10.2 shows the results from the first stage of Gerber and

Green’s turnout experiment. The dependent variable is treatment-delivered, meaning it is 1

if the person actually talked to the volunteer canvasser. The independent variable is whether

or not the person was assigned to treatment.

Table 10.2: First Stage Regression in Campaign Experiment: Explaining Contact

Personal visit assigned (Z=1) 0.279ú


(0.003)
[t = 95.47]
Constant 0.000
(0.000)
[t = 0.00]
N 29, 380
Standard errors in parentheses
ú indicates significance at p < 0.05

These results suggest that 27.9 percent of those assigned to be visited were actually

visited. In other words, 27.9 percent of the treatment group complied with the treatment.

c
•2014 Oxford University Press 503
Chapter 10. Experiments: Dealing with Real-World Challenges

This estimate is hugely statistically significant, in part due to the large sample size. The

intercept is 0.0, implying no one in the non-contact assigned group was contacted by this

particular get-out-the-vote campaign.

The treatment assignment variable Zi also is highly likely to satisfy the 2SLS exclusion

condition because the randomized treatment-assignment variable Zi only affects Y through

people actually getting campaign contact. Being assigned to be contacted by the campaign

in and of itself does not affect turnout. Note we are not saying that the people who actually

complied (received a campaign contact) are random, for all the reasons above related to

concerns about compliance. We are simply saying that when we put a check next to some

randomly selected names indicating they should be visited, these folks were indeed randomly

selected. That means Z is uncorrelated with ‘ and can, therefore, be excluded from the main

equation.

In the second stage regression, we use the fitted values from the first stage regression as

the independent variable. Table 10.3 shows that the effect of a personal visit is to increase

probability of turning out to vote by 8.7 percentage points. This estimate is statistically

significant as we can see from the t stat of 3.34. We could improve the precision of the

estimates by adding covariates but do not need to do so to avoid bias.

c
•2014 Oxford University Press 504
Chapter 10. Experiments: Dealing with Real-World Challenges

Table 10.3: Second Stage Regression in Campaign Experiment: Explaining Turnout

Personal visit (T̂ ) 0.087ú


(0.026)
[t = 3.34]
Constant 0.448ú
(0.003)
[t = 138.38]
N 29, 380
Dependent variable is 1 for individuals who voted. The
independent variable is the fitted value from the first stage.
Standard errors in parentheses. ú indicates
significance at p < 0.05

Understanding variables in 2SLS models of non-compliance

Understanding the way the fitted values work is useful for understanding how 2SLS works

here. Table 10.4 shows the three different ways we are measuring campaign contact for three

hypothetical observations. In the first column is treatment assignment. Volunteers were to

visit Laura and Bryce and not to visit Gio. This selection was randomly determined. In

the second column is actual contact, which is observed contact by the campaign. Laura

answered the door when the campaign volunteer knocked, but Bryce did not. (No one

went to poor Gio’s door.) The third column displays the fitted value from the first stage

equation for the treatment variable. These fitted values depend only on contact assignment.

Laura and Bryce were assigned to be called randomly (Z = 1), so both their fitted values

were X̂ = 0.0 + 0.279 ◊ 1 = 0.279 even though Laura was actually contacted and Bryce

c
•2014 Oxford University Press 505
Chapter 10. Experiments: Dealing with Real-World Challenges

wasn’t. Gio was not assigned not to be visited (Z = 0), so his fitted contact values was

X̂ = 0.0 + 0.279 ◊ 0 = 0.0.

Table 10.4: Various Measures of Campaign Contact in 2SLS Model for Selected Observations

Name Contact-assigned Contact-delivered Contact-fitted


Z T T̂
Laura 1 1 0.279
Bryce 1 0 0.279
Gio 0 0 0.000

2SLS uses the “Contact-fitted” (T̂ ) variable. This variable might be the weirdest thing

in the whole book9 , so it is worth taking the time to really understand it. Even though

Bryce was not contacted, his X̂i is 0.279, just the same as Laura who was successfully

visited. Clearly, this variable looks very different than actual observed campaign contact.

Yes, this is odd, but it is a feature, not a bug. The core inferential problem, as have noted, is

endogeneity in actual observed contact. Bryce might be avoiding contact because he loathes

politics. That’s why we don’t want to use observed contact as a variable – it would capture

not only the effect of contact, but also the fact that the type of people who get contact in

observational data are different. The fitted value, however, only varies according the Z –

something that is exogenous. In other words, by looking at the bump up in expected contact

associated with being in the randomly assigned contact-assigned group, we have isolated the

exogenous bump up in contact associated with the exogenous factor and can assess whether
9 Other than the ferret thing - also weird.

c
•2014 Oxford University Press 506
Chapter 10. Experiments: Dealing with Real-World Challenges

it is associated with a corresponding bump up in voting turnout.

Remember This
1. 2SLS is useful to analyze experiments when there is imperfect compliance with
the experimental treatment.
2. Assignment to treatment typically satisfies the inclusion and exclusion conditions
necessary for instruments in 2SLS analysis.

c
•2014 Oxford University Press 507
Chapter 10. Experiments: Dealing with Real-World Challenges

Case Study: Minneapolis Domestic Violence Experiment

Instrumental variables can be used to analyze an

ambitious and, at first glance, very unlikely ex-

periment. It deals with the problem of domestic

violence, a social ill that has long challenged po-

lice and others trying to reduce it. This may

sound like a crazy place for an experiment, but

stay tuned because it turns out not to be.

The goal is to figure out what police should do when they come upon a domestic violence

incident. Police can either take a hard line and arrest suspects whenever possible or they

can take a conciliatory line and decline to make an arrest as long as no one is in immediate

danger. Either approach could potentially be more effective: Arresting suspects creates clear

consequences for offenders, while not arresting them may possibly defuse the situation.

So what should police do? This is a great question to answer empirically. A model based

on observational data would look like

Arrested lateri = —0 + —1 Arrested initiallyi + —2 Xi + ‘i (10.7)

where Arrested later is 1 if the person is arrested at some later date for domestic violence,

Arrested initially is 1 if the suspect was arrested at the time of the initial domestic violence

report, and X refers to other variables, such as whether a weapon or drugs were involved in

c
•2014 Oxford University Press 508
Chapter 10. Experiments: Dealing with Real-World Challenges

the first incident.

Why might there be endogeneity? (That is, why might we suspect Arrested initially

to be correlated with the error term?) Elements in the error term include person-specific

characteristics. Some people who have police called on them are indeed nasty; let’s call them

the bad eggs. Other people who have the police called on them are involved in a once-in-

a-lifetime incident; compared to the overall population of people who have police called on

them, they are the (relative) good eggs. Such personality traits are in the error term of the

equation predicting domestic violence in the future.

We could also easily imagine that people’s good or bad eggness will affect whether they

were arrested initially. Police who arrive at the scene of a domestic violence incident involving

a bad egg will, on average, find more threat; police who arrive at the scene of an incident

involving a (relative) good egg will likely see a less threatening environment. We would

expect police to arrest the bad egg types more often and we would expect these folks to have

more problems in the future. Observational data could therefore suggest that arrests make

things worse because those arrested are more likely to be bad eggs and therefore more likely

to be re-arrested.

The problem is endogeneity. The correlation of the arrested initially variable and the

personal characteristics in the error term makes it impossible for observational data to isolate

the effect of the policy (arrest) from the fact that this policy may be differentially applied

to different types of people.

c
•2014 Oxford University Press 509
Chapter 10. Experiments: Dealing with Real-World Challenges

An experiment is promising here, at least in theory. If police randomly choose to arrest

people when on domestic violence calls, then our arrest variable would no longer be correlated

with the personal traits of the perpetrators. Of course, this idea is insane, right? Police can’t

randomly arrest people (can they?). Believe it or not, researchers in Minneapolis created an

experiment to do just that. We’ll simplify the experiment a bit; more details are in Angrist

(2006). The Minneapolis researchers gave police a note pad to document incidents. The

note pad had randomly colored pages; the police officer was supposed to arrest or not arrest

the perpetrator based on the color of the page.

Clearly, perfect compliance is impossible and undesirable. No police department could

make such important decisions as to arrest or not based simply on the color of pages in a

notebook. Some circumstances are so dangerous that an arrest must be made, notebook

be damned. Endogeneity concerns arise because the type of people arrested under these

circumstances (the bad eggs) are different than those not arrested.

2SLS can rescue the situation. First we’ll show how randomization in experiments satisfies

the 2SLS conditions and then we’ll show how 2SLS works and compares to other approaches.

The inclusion condition is that Z explains X. In this case, the condition requires that

assignment to the arrest treatment actually predicts being arrested. Table 10.5 shows that

those assigned to be arrested were 77.3 percentage points more likely to be arrested, even

when controlling for whether a weapon or drugs were recorded as being present at the scene.

The effect is massively statistically significant with a t statistic of 17.98. The intercept was

c
•2014 Oxford University Press 510
Chapter 10. Experiments: Dealing with Real-World Challenges

not directly reported in the original paper, but from other information in the paper we can

determine that “ˆ0 = 0.216 in our first stage regression.

Table 10.5: First Stage Regression in Domestic Violence Experiment: Explaining Arrests

Arrest assigned (Z=1) 0.773ú


(0.043)
[t =17.98]
Weapon 0.064
(0.045)
[t = 1.42]
Drugs 0.088ú
(0.040)
[t = 2.20]
Constant 0.216

N 314
Standard errors in parentheses
ú indicates significance at p < 0.05

Assignment to the arrest treatment is very plausibly uncorrelated with the error term.

This condition is not testable and must instead be argued based on non-statistical evidence.

Here, the argument is pretty simple: The instrument was randomly generated and therefore

not correlated with anything, in the error term or otherwise.

Before we present the 2SLS results, let’s be clear about the variable used in the 2SLS

model as compared to the variables used in other approaches. Table 10.6 shows the three

different ways to measure arrest. The first (Z) is whether an individual was assigned to

the arrest treatment. The second (T ) is whether a person was in fact arrested. The third

c
•2014 Oxford University Press 511
Chapter 10. Experiments: Dealing with Real-World Challenges

Table 10.6: Selected Observations for Minneapolis Domestic Violence Experiment

Observation Arrest-assigned Arrest-delivered Arrest-fitted


Z T T̂
1 1 1 0.989
2 1 0 0.989
3 0 1 0.216
4 0 0 0.216

(T̂ ) is the fitted value of arrest based on Z. We report 4 examples, assuming none of

them had weapons or drugs in their initial incident. Person 1 was assigned to be arrested

and in fact arrested. His fitted value is “ˆ0 + “ˆ1 ◊ 1 = 0.216 + 0.773 = 0.989. Person 2

was assigned to be arrested and not arrested. His fitted value is the same as person 1’s:

“ˆ0 + “ˆ1 ◊ 1 = 0.216 + 0.773 = 0.989. Person 3 was not assigned to be arrested but was in

fact arrested. He was probably pretty nasty when the police showed up. His fitted value

is “ˆ0 + “ˆ1 ◊ 0 = 0.216 + 0 = 0.216. Person 4 was not assigned to be arrested and was not

arrested. He was probably relatively calm when the police showed up. His fitted value is

“ˆ0 + “ˆ1 ◊ 0 = 0.216 + 0 = 0.216. Even though we suspect persons 3 and 4 are very different

types of people, the fitted values are the same, which is actually a good thing because factors

associated with actually being arrested (the bad egg-ness) that are correlated with the error

term in the equation predicting future arrests are purged from the T̂ variable.

Table 10.7 shows the results from three different ways to estimate a model in which

Arrested later is the dependent variable. The models also control for whether a weapon or

c
•2014 Oxford University Press 512
Chapter 10. Experiments: Dealing with Real-World Challenges

drugs were involved in the initial incident. The OLS model uses treatment delivered (T ) as

the independent variable. The ITT model uses treatment assigned (Z) as the independent

variable. The 2SLS model uses the fitted value of treatment (T̂ ) as the independent variable.

Table 10.7: Analyzing Domestic Violence Experiment Using Different Estimators

OLS ITT 2SLS


Arrest -0.070 -0.108ú -0.140ú
(0.038) (0.041) (0.053)
[t = 1.84] [t =2.63] [t = 2.64]
Weapon 0.010 0.004 0.005
(0.043) (0.042) (0.043)
[t = 0.23] [t = 0.10] [t = 0.12]
Drugs 0.057 0.052 0.064
(0.039) (0.038) (0.039)
[t = 1.46] [t = 1.37] [t = 1.64]
N 314 314 314
Dependent variable is a dummy variable indicating
re-arrest. Standard errors in parentheses.

The first column shows that OLS estimates a 7 percentage point decrease in probability

of a re-arrest later. The independent variable was whether or not someone was actually

arrested. This group includes people who were randomly assigned to be arrested and people

who were in the no-arrest assigned treatment group but were arrested anyway. We worry

about bias using this variable because we suspect that the bad eggs were more likely to get

arrested.10
10 The OLS model reported here is still based on partially randomized data, because many who were arrested were
arrested due to the randomization in the police protocol. If we had purely observational data with no randomization,
the bias of OLS would be worse as only those who were bad eggs would likely be arrested.

c
•2014 Oxford University Press 513
Chapter 10. Experiments: Dealing with Real-World Challenges

The second column shows that ITT estimates that being assigned to the arrest treatment

lowers the probability of being arrested later by 10.8 percentage points. This result is

more negative than the OLS estimate and is statistically significant. The ITT model avoids

endogeneity because treatment assignment cannot be correlated with the error term. The

approach will understate the true effect when there was non-compliance either because some

people not assigned to the treatment got it or everyone who was assigned to the treatment

actually received it.

The third column shows the 2SLS results. In this model, the independent variable is

the fitted value of the treatment. The estimated coefficient on arrest is even more negative

than the ITT estimate, indicating that the probability of re-arrest for individuals who were

arrested is 14 percentage points lower than for individuals who were not initially arrested.

The magnitude is double the effect estimated by OLS. This result implies that the city of

Minneapolis can, on average, reduce the probability of another incident by 14 percentage

points by arresting individuals on the initial call. 2SLS is the best model because it accounts

for non-compliance and provides an unbiased estimate of the effect that arresting someone

initially had on likelihood of a future arrest.

This study was quite influential and spawned other similar studies elsewhere; see Berk,

Campbell, Klap, and Western (1992) for more details.

c
•2014 Oxford University Press 514
Chapter 10. Experiments: Dealing with Real-World Challenges

10.4 Attrition

Another challenge for experiments is attrition. Attrition occurs when people drop out of

an experiment altogether such that we do not observe the dependent variable for them.

Attrition can happen when experimental subjects become frustrated with the experiment

and discontinue participation or when they are too busy to respond, move away, or even

pass away. Attrition can occur for both treatment and control groups.

In this section we explain how attrition can infect even randomized experiments with

endogeneity, show how to detect problematic attrition, and describe three ways to counteract

the effects of attrition.

Attrition and endogeneity

Attrition opens a back door for endogeneity to enter our experiments when it is non-random.

Suppose we randomly give some people free donuts. If some of them eat so many donuts

that they can’t rise from the couch to answer the experimenter’s phone calls, we no longer

have data for these folks. This is a problem, because these observations would be of people

who got lots of donuts and had a pretty bad health outcome. Losing these observations will

make donuts look less bad and thereby bias our conclusions.

Attrition is real. In the New York City school choice experiment discussed earlier, re-

searchers intended to track test scores of students in the treatment and control groups over

time. For a surprising number of students, such tracking was not possible. Some moved

c
•2014 Oxford University Press 515
Chapter 10. Experiments: Dealing with Real-World Challenges

away, some were absent on test days, and some probably got lost in the computer system.

And attrition can be non-random. In the New York school choice experiment, 67 percent

of African-American students in the treatment group took the test in year two of the experi-

ment, while only 55 percent of African-American students in the control group took the test

in year two. We should wonder if these groups are comparable and should worry that any

test differentials discovered could be due to differential attrition rather than to the effects of

private schooling.

Detecting problematic attrition

Detecting problematic attrition is therefore an important part of any experimental analysis.

First, we should assess whether attrition was related to treatment. Commonsensically, we

can simply look at attrition rates in treatment and control groups. Statistically, we could

estimate the following model:

Attritioni = ”0 + ”1 T reatmenti + ‹i (10.8)

where Attritioni equals 1 for observations for which we do not observe the dependent variable

and equals 0 for observations for which we observe the dependent variable. A statistically

significant ”ˆ1 would indicate differential attrition across treatment and control groups.

We can add some nuance to our evaluation of attrition by looking for differential attrition

patterns in the treatment and control groups. For example, we can investigate whether the

treatment variable interacted with one or more covariates in a model explaining attrition. For

c
•2014 Oxford University Press 516
Chapter 10. Experiments: Dealing with Real-World Challenges

example, when analyzing a randomized charter school experiment we could explore whether

high test scores in earlier years were associated with differential attrition in the treatment

group. Using the tools we discussed in Section 6.4 for interaction variables, the model would

be

Attritioni = ”0 + ”1 T reatmenti + ”2 EarlyT estScoresi +

”3 T reatment ◊ EarlyT estScoresi + ‹i

where EarlyT estScoresi is the pre-experimental test score of student i. If ”3 is not equal to

zero, then the treatment would appear to have had a differential effect on kids who were good

students in the pre-experimental period. Perhaps kids with high test scores were really likely

to stick around in the treated group (which means they attended charter schools), while the

good students in the control group (who did not attend a charter school) were less likely

to stick around (perhaps moving to a different school district and thereby making their test

scores for the period after the experiment had run unavailable). In this situation, treated

and control groups would differ on the early test score measure, something that should show

up in a balance test limited to those who remained in the sample.

Dealing with attrition

There is no magic bullet to zap attrition, but there are three strategies that can prove

useful. The first is simply to control for variables associated with attrition in the final

c
•2014 Oxford University Press 517
Chapter 10. Experiments: Dealing with Real-World Challenges

analysis. Suppose we found that kids with higher pre-treatment test scores were more likely

to stay in the experiment. We would be wise to control for pre-treatment test scores with

multivariate OLS. However, this strategy cannot counter unmeasured sources of attrition

that could be correlated with treatment status and post-treatment test scores.

A second approach to dealing with attrition is to use a trimmed data set to make the

groups more plausibly comparable. A trimmed data set is one for which certain observations

are removed in order to offset potential bias due to attrition. Suppose we observe 10%

attrition in the treated group and 5% attrition in the control group. We should worry that

weak students were dropping out of the treatment group, making the comparison between

treated and untreated unfair because the treated group may have shed some of its weakest

students. A statistically conservative approach here would be to trim the control group by

removing another 5% of the weakest students before doing our analysis so that now both

groups in the data have 10% attrition rates. This practice is statistically conservative in the

sense that it makes it harder to observe a statistically significant treatment effect because

it is unlikely that literally all of those who dropped out from the treatment group were the

worst students.

A third approach to dealing with attrition is to use selection models. The most famous

selection model is called a Heckman selection model (1979). In this approach, we would

model both the process of being observed (which is a dichotomous variable equaling 1 for

those for whom we observe the dependent variable and 0 for others) and the outcome (the

c
•2014 Oxford University Press 518
Chapter 10. Experiments: Dealing with Real-World Challenges

model with the dependent variable of interest, such as test scores). These models build on

the probit model we discuss in Chapter 12. More details are in the Further Reading section

at the end of this chapter.

Remember This
1. Attrition occurs when individuals drop out of an experiment, causing us not to
have outcome data for them.
2. Non-random attrition can cause endogeneity even when treatment is randomly
assigned.
3. We can detect problematic attrition by looking for differences in attrition rates
across treated and control groups.
4. Attrition can be addressed by using multivariate OLS, trimmed data sets, or
selection models.

c
•2014 Oxford University Press 519
Chapter 10. Experiments: Dealing with Real-World Challenges

Case Study: Health Insurance and Attrition

In the United States, health care consumes about

one sixth of the entire economy and is the single

biggest determinant of future budget deficits in

the United States.

Figuring out how to deliver high quality care

more efficiently is therefore one of the biggest policy questions we face. One option that

gets a lot of interest is to change the way we pay for health care. We could, for example,

make consumers pay more for medical care so that they use only what they really need.

In such an approach, health insurance would cover the really big catastrophic items (think

heart-replacement), but would cover less of the more mundane potentially avoidable items

(think flu visits).

To know if such an approach will work we need to answer two questions. First, are health

care outcomes the same or better under such a system? It’s not much of reform if it simply

saves money by making us sicker. Second, do medical expenditures go down when people

have to pay more for medical services?

Because there are a many health insurance plans on the private market, we could imagine

using observational data to answer these questions. We could see if people on relatively

stingy health insurance plans (that pay only for very major costs) are as healthy as others

c
•2014 Oxford University Press 520
Chapter 10. Experiments: Dealing with Real-World Challenges

while spending less on health care.

Such an approach really wouldn’t be very useful though. Why? You guessed it:

Insurance is endogenous; those who expect to demand more services have a clear

incentive to obtain more complete insurance, either by selecting a more generous

option at the place of employment, by working for an employer with a generous

insurance plan, or by purchasing privately more generous coverage (Manning et

al., 1987, 252).

In other words, because sick people probably seek out better health care coverage, a non-

experimental analysis of health coverage and costs would likely show that health care costs

more for those with better coverage. That wouldn’t mean the generous coverage caused costs

to go up; such a relationship could simply be the endogeneity talking.

Suppose we don’t have a good measure of whether someone has diabetes. We would

expect that people with diabetes seek out generous coverage because they expect to use the

doctor a lot. The result would be a correlation between the error term and type of health

plan such that we would see people in the generous health plan having lower health outcomes

(because of all those people with diabetes who signed up for the generous plan). Or maybe

insurance companies figure out a way to measure whether people have diabetes and not let

them into generous insurance plans, which would mean the people in the generous plans

would be healthier than others. Here too the diabetes in the error term would be correlated

with the type of health plan, although in the other direction.

c
•2014 Oxford University Press 521
Chapter 10. Experiments: Dealing with Real-World Challenges

Thus we have a good candidate for a randomized experiment, which is exactly what

ambitious researchers at RAND did in the 1970s. They randomly assigned people to various

health plans including a free plan that covered medical care at no cost and various cost-

sharing plans that had different levels of co-payments. With randomization, the type of

people assigned to a free plan should be expected to be the same as the type of people assigned

to a cost-sharing plan. The only expected difference between the groups is their health plans

and hence to the extent that the groups differed in utilization or health outcomes, the

differences could be attributed to differences in the health plans.

The RAND researchers found that medical expenses were 45 percent higher for people in

plans with no out-of-pocket medical expenses compared to those who had stingy insurance

plans (which required people to pay 95 percent of costs, up to a $1,000 yearly maximum).

In general, health outcomes were no worse for those in the stingy plans.11 This experiment

has been incredibly influential – it is the reason we pay $10 or whatever when we check out

of the doctor’s office.

Attrition is a crucial issue in evaluating the RAND experiment. Not everyone stayed in

the experiment. Some people may have moved, some may have died, and others may have

been unhappy with the plan they were randomly placed in and opted out of the experiment.

The threat to the validity of this experiment is that this attrition may have been non-random.

If the type of people who stayed with one plan were systematically different than the type of
11Outcomes for people in the stingy plans were worse for some subgroups and some conditions, leading the researchers
to suggest programs targeted at specific conditions rather than providing fee-free service for all health care.

c
•2014 Oxford University Press 522
Chapter 10. Experiments: Dealing with Real-World Challenges

people who stayed with another plan, comparing health outcomes or utilization rates across

these groups may be inappropriate because the groups differ both in their health plans and

in the type of people left in them.

Aron-Dine, Einav, and Finkelstein (2013) reexamined the data in light of attrition and

other concerns. They show that the free plan had 1,894 people randomly assigned to it. Of

those, 114 (5 percent) were non-compliers who declined to participate. Of the remainder who

participated, 89 (5 percent) attrited by leaving the experiment. These low numbers are not

very surprising. The free plan was gold-plated, covering everything. The cost-sharing plan

requiring the highest out-of-pocket expenditures has 1,121 people assigned to it. Of these

269 (24 percent) declined to participate and another 145 (17 percent) left the experiment.

These patterns contrast markedly from the free plan non-compliance and attrition patterns.

What kind of people would we expect to leave a cost-sharing plan? Probably the kind

of people who ended up paying a lot of money under the plan. And what kind of people

would end up paying a lot of money under a cost-sharing plan? Sick people, most likely. So

that means we have reason to worry that the free plan had all kinds of people, but that the

cost-sharing plans had a sizeable hunk of sick people who pulled out. So any finding that

the cost-sharing plans yielded the same health outcomes could be due either to the plans

not having different health impacts or to the free plan being better, but having a a sicker

population.

Aron-Dine, Einav, and Finkelstein (2013) therefore conducted an analysis on a trimmed

c
•2014 Oxford University Press 523
Chapter 10. Experiments: Dealing with Real-World Challenges

data set based on techniques from Lee (2009). They dropped the highest spenders in the

free care plan until they had a data set with the same proportion of observations from those

assigned to the free plan and to the costly plan. Comparing these two groups is equivalent

to assuming that those who left the costly plan were the most expensive patients; since

this is unlikely to be completely true, the results from such a comparison are considered

a lower bound as actual differences between the groups would be larger if some of the

people who dropped out from the costly plan were not among the most expensive patients.

The results indicated that the effect of the cost-sharing plan was still negative, meaning it

lowered expenditures. However, the magnitude of the effect was substantially less than the

magnitude reported in the initial study that did little to account for differential attrition

across the various types of plans.

10.5 Natural Experiments

As a practical matter, experiments cannot cover every research question. As we discussed

in Section 1.3, experiments are often infeasible, unethical, or unaffordable.

Sometimes, however, an experiment may fall into our laps. That is, we might find that

the world has essentially already run a natural experiment that pretty much looks like

a randomized experiment, but one that we didn’t have to muck about with actually im-

plementing. A natural experiment occurs when a researcher identifies a situation in which

c
•2014 Oxford University Press 524
Chapter 10. Experiments: Dealing with Real-World Challenges

the values of the independent variable have been determined by a random, or at least an

exogenous, process.

In this section we discuss some of the clever ways researchers have been able to use natural

experiments to answer interesting research questions.

In an ideal natural experiment an independent variable is exogenously determined, leaving

us with treatment and control groups that look pretty much like they would look if we had

intentionally designed a random experiment. One example is in elections. In 2010, a hapless

candidate named Alvin Greene won the South Carolina primary election to run for the

U.S. Senate. Greene had done no campaigning and was not exactly an obvious Senatorial

candidate: He had been involuntarily discharged from both the Army and the Air Force and

had been unemployed since leaving the military. Oh yes, he was also under indictment for

showing pornographic pictures to a college student. Yet he won 59 percent in a statewide

primary against a former state legislator. While some wondered if something nefarious was

going on, many pointed to a more mundane possibility: When voters don’t know much

about candidates, they might pick the first name they see. Greene was first on the ballot

and perhaps that’s why he did so well.12

An experimental test of this proposition would involve randomly rotating the ballot order
12 Greene went on to get only 28 percent of the vote in the general election (a dismal outcome, although if you are
a glass-half-full type you would note that 364, 598 South Carolinians voted for him). After the defeat Greene turned
his sights to the presidency, saying “I’m the next president. I’ll be 35 just before November, so I was born to be
president. I’m the man. I’m the man. I’m the man. Greene’s the man. I’m the man. I’m the greatest person ever. I
was born to be president. I’m the man, I’m the greatest individual ever” (Shiner 2010; see also Khimm 2010).

c
•2014 Oxford University Press 525
Chapter 10. Experiments: Dealing with Real-World Challenges

of candidates and seeing if candidates do better when they appear first on the ballot. Con-

ceptually that’s not too hard, but practically it is a lot to ask given that election officials

are pretty protective of how they run elections. In the 1998 Democratic primary in New

York City, however, election officials decided on their own to rotate the order of candidates’

names by precinct. Political scientists Jonathan Koppell and Jennifer Steen got wind of this

decision and analyzed the election as a natural experiment. Their 2004 paper found that

candidates received more votes in precincts where they were listed first in 71 of 79 races. In

seven of those races the differences were enough to determine the election outcome. That’s

pretty good work for an experiment they didn’t even set up.

Researchers have found other clever opportunities for natural experiments. An impor-

tant question is whether economic stimulus packages of tax cuts and government spending

increases that were implemented in response to the 2008 recession boosted growth. At a

first glance, such analysis should be easy. We know how much the federal government cut

taxes and increased spending. We also know how the economy performed. Of course, things

are not so simple because, as former Chair of the Council of Economic Advisers Christina

Romer (2011) noted, “Fiscal actions are often taken in response to other things happening

in the economy.” When looking at the relationship between two variables, like consumer

spending and the tax rebate, we “need to worry that a third variable, like the fall in wealth,

is influencing both of them. Failing to take account of this omitted variable leads to a biased

estimate of the relationship of interest.”

c
•2014 Oxford University Press 526
Chapter 10. Experiments: Dealing with Real-World Challenges

One way to deal with this challenge is to find exogenous variation in stimulus spending that

is not correlated with any of the omitted variables we worry about. It is typically very hard

to do so, but sometimes natural experiments pop up. For example, Parker et al. (2011) noted

that the 2008 stimulus consisted of tax rebate checks that were sent out in stages according to

the last two digits of people’s Social Security numbers. That means the timing was effectively

random for each family. After all, the last two digits are essentially randomly assigned to

people when they are born. This means, in turn, that the timing of the government spending

by family was exogenous. An analyst’s dream come true! The researchers found that family

spending among those that got a check was almost $500 more than those who did not,

bolstering the case that the fiscal stimulus boosted consumer spending.

Remember This
1. A natural experiment occurs when the values of the independent variable have
been determined by a random, or at least an exogenous, process.
2. Natural experiments are widely used and can be analyzed with OLS, 2SLS, or
other tools.

c
•2014 Oxford University Press 527
Chapter 10. Experiments: Dealing with Real-World Challenges

Case Study: Crime and Terror Alerts

One need not have true randomness for a natural

experiment. One only needs exogeneity, some-

thing quite different from randomness, as the fol-

lowing example makes clear. It is about the effect

of police on crime. As discussed earlier on page

371, using observational data to estimate the following model

Crimest = —0 + —1 P olicest + ‘st (10.9)

will likely suffer from endogeneity and risks statistical catastrophe.

Could we use experiments to test the relationship? Sure, all we need to do is head down

to the police station and tell them to assign officers randomly to different places. The idea is

not completely crazy and, frankly, it is the kind of thing police should consider doing. This

idea is not an easy sell, though. Can you imagine the outrage if a crime occurred in an area

that randomly had a low number of officers?

Economists Jonathan Klick and Alexander Tabarrok identified in 2005 a clever natural

experiment that looks much like the randomized experiment we proposed. They noticed that

Washington DC deployed more police when the terror alert level was high. The high terror

alert was not random; presumably there was some cause somewhere prompting the terror

alert. It was exogenous, though. Whatever leads terrorists to threaten carnage, it was not

c
•2014 Oxford University Press 528
Chapter 10. Experiments: Dealing with Real-World Challenges

something that was associated with factors that lead local criminals in Washington DC to

stick up a liquor store. In other words, it was highly unlikely that terror alerts are correlated

with the things in the error term causing endogeneity that we identified above. It was as

if someone had designed a study in which extra police would be deployed at random times,

only in this case the “random” times were essentially selected by terrorist suspects with no

information about crime in Washington, DC rather than by a computerized random number

generator as they typically would be in an academic experiment.

Klick and Tabarrok therefore assessed whether crime declined when the terror alert level

was high. Table 10.8 reports their main results. They found that crimes decreased when

the terror alert level went up. They also controlled for subway ridership in order to account

for the possibility that more people (and tourists in particular) around could make for more

targets for crime. The effect of the high terror alerts was still negative. Because this variable

was exogenous to crime in Washington and could, they argued, affect crime only by means

of the increased police presence, they argued their result provided pretty good evidence that

police can reduce crime. They used ordinary least squares, but the tools of analysis were

really less important than the vision of finding something that caused exogenous changes to

police deployment and then tracking changes in crime. Again, this is a pretty good day’s

work for an experiment they didn’t run.

c
•2014 Oxford University Press 529
Chapter 10. Experiments: Dealing with Real-World Challenges

Table 10.8: Effect of Terror Alerts on Crime

High terror alert -7.32ú -6.05ú


(2.88) (2.54)
[t = 2.54] [t =2.38]
Subway ridership 17.34ú
(5.31)
[t = 3.27]
N 506 506
Dependent variable is total number of crimes in Washington, DC from
March 12 to July 30, 2003. Standard errors in parentheses.

10.6 Conclusion

Experiments are incredibly promising for statistical inference. If we want to know if X causes

Y , do an experiment. Change X for a random subset of people. Compare what happens to

Y for the treatment and control groups. The approach is simple, elegant, and has been used

productively countless times.

For all their promise, though, experiments are like a movie star. Even though many

people idealize them, they lose a bit of luster in real life. Movie stars’ teeth are a bit yellow

and they aren’t as witty without a script. Experiments don’t always achieve balance, suffer

from non-compliance and attrition, and in many circumstances are not feasible, ethical, or

generalizable.

For these reasons we need to take particular care when examining experiments. We need to

diagnose and, if necessary, respond to ABC issues. Every experiment needs to assess balance

c
•2014 Oxford University Press 530
Chapter 10. Experiments: Dealing with Real-World Challenges

to ensure that the treatment and control groups do not differ systematically except for the

treatment. Many social science experiments also have potential non-compliance problems if

people can choose not to experience the randomly assigned treatment. Non-compliance can

induce endogeneity if we use treatment-delivered as the independent variable, but we can

get back to unbiased inference if we use ITT or 2SLS to analyze the experiment. Finally,

attrition occurs when people leave the experiment, which can be a problem if the attrition

is related to the treatment. Attrition is hard to solve, but it must be diagnosed and, if it

is a problem, we should at least use multivariate OLS or trimmed data to ameliorate the

problem.

The following steps provide a general guide to implementing and analyzing a randomized

experiment.

1. Identify a target population.

2. Randomly pick a subset of the population and give them the treatment. The rest are

the control group.

3. Diagnose possible threats to internal validity.

(a) Assess balance with difference of means tests for all possible independent variables.

(b) Assess compliance by looking at what percent of those assigned to treatment actu-

ally experienced it.

(c) Assess non-random attrition by looking for differences in observation patterns across

c
•2014 Oxford University Press 531
Chapter 10. Experiments: Dealing with Real-World Challenges

treatment and control groups.

4. Gather data on the outcome variable Y and assess differences between treated and

control groups.

(a) If there is perfect balance and compliance and no attrition, use bivariate OLS.

Multivariate OLS also will be appropriate and will provide more precise estimates.

(b) If there are imbalances use multivariate OLS, controlling for variables that are

unbalanced across treatment and control groups.

(c) If there is imperfect compliance, use intention-to-treat analysis and 2SLS.

(d) If there is attrition use multivariate OLS, trim the data or use a selection model.

We are on track to understand social science experiments if we can

• Section 10.1: Explain how to assess whether randomization was successful with balanc-

ing tests.

• Section 10.2: Explain how imperfect compliance can create endogeneity. What is the

ITT approach and how does it avoid conflating treatment effects and non-compliance

effects? How do ITT estimates relate to the actual treatment effects?

• Section 10.3: Explain how 2SLS can be useful for experiments with imperfect compli-

ance.

c
•2014 Oxford University Press 532
Chapter 10. Experiments: Dealing with Real-World Challenges

• Section 10.4: Explain how attrition can create endogeneity. What are some steps we

can take to diagnose and deal with attrition?

• Section 10.5: Explain natural experiments.

Further Reading

Experiments are booming in the social sciences. Gerber and Green (2012) provide a compre-

hensive guide to field experiments. Banerjee and Duflo (2011) is an excellent introduction to

experiments in the developing world and Duflo, Glennerster, and Kremer (2008) provides an

experimental toolkit useful for experiments in the developing world and beyond. Dunning

(2012) is detailed guide to natural experiments. Manzi (2012) is a readable guide to and

critique of randomized experiments in social science and business. On page 190 he refers to

a report to Congress in 2008 that identified policies that demonstrated significant results in

randomized field trials.

Attrition is one of the harder things to deal with and different analysts take different

approaches. Gerber and Green (2012, 214) discusses their approaches to dealing with at-

trition. There is a large literature on selection models; see, for example, Das, Newey, and

Vella (2003). Some experimentalists resist using selection models because those models rely

heavily on assumptions about the distributions of error terms and functional form.

Imai, King, and Stuart (2008) discuss how to use blocking to get more efficiency and less

c
•2014 Oxford University Press 533
Chapter 10. Experiments: Dealing with Real-World Challenges

potential for bias in randomized experiments.

Key Terms
• ABC issues (481)
• Attrition (515)
• Balance (484)
• Blocking (482)
• Compliance (491)
• Intention-to-treat analysis (496)
• Natural experiment (524)
• Selection model (518)
• Trimmed data set (518)

Computing Corner

Stata

• To assess balance, estimate a series of bivariate regression models with all “X” variables
as dependent variables and treatment assignment as independent variables:
reg X1 TreatmentAssignment
reg X2 TreatmentAssignment
• To estimate an ITT model, estimate a model with the outcome of interest as the de-
pendent variable and treatment assignment as the main independent variable. Other
variables can be included, especially if there are balance problems.
reg Y TreatmentAssignment X1 X2
• To estimate a 2SLS model, estimate a model with the outcome of interest as the de-
pendent variable and treatment assignment as an instrument for treatment-delivered.
Other variables can be included, especially if there are balance problems.
reg Y (Treatment = TreatmentAssignment) X1 X2

c
•2014 Oxford University Press 534
Chapter 10. Experiments: Dealing with Real-World Challenges

• To assess balance, estimate a series of bivariate regression models with all “X” variables
as dependent variables and treatment assignment as independent variables:
lm(X1 ≥ TreatmentAssignment)
lm(X2 ≥ TreatmentAssignment)
• To estimate an ITT model, estimate a model with the outcome of interest as the de-
pendent variable and treatment assignment as the main independent variable. Other
variables can be included, especially if there are balance problems.
lm(Y ≥ TreatmentAssignment+ X1 + X2)
• To estimate a 2SLS model, estimate a model with the outcome of interest as the de-
pendent variable and treatment assignment as an instrument for treatment-delivered.
Other variables can be included, especially if there are balance problems. As discussed
on page 471 we’ll use the ivreg command from the AER library:
library(AER)
ivreg(Y ≥ Treatment + X2 + X3 | TreatmentAssignment + X2 + X3)

Exercises
1. In an effort to better understand the effects of get-out-the-vote messages on voter
turnout, Gerber and Green (2005) conducted a randomized field experiment involv-
ing approximately 30,000 individuals in New Haven, Connecticut in 1998. One of the
experimental treatments was in-person visits. These were randomly assigned and in-
volved a volunteer visiting the person’s home and encouraging him or her to vote. The
file GerberGreenData.dta contains the variables described in Table 10.9.
Table 10.9: Variables for Get-out-the-vote Experiment from Gerber and Green (2005)

Variable Description
Voted Voted in the 1998 election (voted = 1)
ContactAssigned Assigned to in-person contact (assigned = 1)
ContactObserved Actually contacted via in-person visit (treated = 1)
Ward Ward number
PeopleHH Household size

a. Estimate a bivariate model of the effect of actual contact on voting. Is the model
biased? Why or why not?

c
•2014 Oxford University Press 535
Chapter 10. Experiments: Dealing with Real-World Challenges

b. Estimate compliance by estimating what percentage of treatment assigned people


actually were contacted.
c. Use ITT to estimate the effect of being assigned treatment on turning out to vote.
Is this estimate likely higher or lower than the actual effect of being contacted? Is it
subject to endogeneity?
d. Use 2SLS to estimate the effect of contact on voting. Compare the results to the
ITT results. Justify your choice of instrument.
e. We can use ITT results and compliance rates to generate a Wald estimator which is
an estimate of the treatment effects. It is calculated by dividing the ITT effect by the
coefficient on the treatment-assignment variable in the first stage model of the 2SLS
model. (If no one in the non-treatment-assignment group gets the treatment, this
coefficient will indicate the compliance rate; more generally, this coefficient indicates
the net effect that treatment assignment has on probability of treatment observed.)
Calculate this quantity using the results in part (b) and (c) (it helps to be as precise
as possible) and compare to the 2SLS results. Are they different? Discuss.
f. Create dummy variables indicating whether the respondent lived in ward 2 and ward
3. Assess balance for wards 2 and 3 and also for the people per household variables.
Is imbalance a problem? Why or why not? Is there anything we should do about it?
g. Estimate a 2SLS model including controls for ward 2 and ward 3 residence and the
number of people in the household. Do you expect the results to differ substantially?
Why or why not? Explain how the first-stage results differ from the balance tests
above.
2. In Chapter 9 on page 473 we considered an experiment in which people were assigned
to a treatment group that was encouraged to watch a television program on affirmative
action. We will revisit that analysis, paying attention to experimental challenges.
a. Check balance in treatment versus control for all possible independent variables.
b. What percent of those assigned to the treatment group actually watched the pro-
gram? How is your answer relevant for the analysis?
c. Are the compliers different from the non-compliers? Provide evidence to support
your answer.
d. In the first round of the experiment, 805 participants were interviewed and assigned to
either the treatment or control condition. After the program aired, 507 participants
were re-interviewed about the program. With only 63 percent of the participants
re-interviewed, what problems are created for the experiment?
e. In this case data (even pre-treatment data) is only available for the 507 people who
did not leave the sample. Is there anything we can do?

c
•2014 Oxford University Press 536
Chapter 10. Experiments: Dealing with Real-World Challenges

f. We estimated a 2SLS model earlier on page 473. Calculate a Wald estimator. It is


calculated by dividing the ITT effect by the coefficient on the treatment-assignment
variable in the first stage model of the 2SLS model. (If no one in the non-treatment-
assignment group gets the treatment, this coefficient will indicate the compliance
rate; more generally, this coefficient indicates the net effect that treatment assignment
has on probability of treatment observed.) Control in all models for measures of
political interest, newspaper reading, and education. Compare the results for the
effect of watching the program to OLS (using actual treatment) and 2SLS estimates.
3. In their paper “Are Emily and Greg More Employable than Lakisha and Jamal? A
Field Experiment on Labor Market Discrimination,” Marianne Bertrand and Sendhil
Mullainathan discuss the results of their field experiment on randomizing names on
job resumes. They wanted to assess whether employers treated African-American and
White applicants similarly. They created fictitious resumes and randomly assigned
White-sounding names (e.g., Emily and Greg) to half of the resumes and African-
American-sounding names (e.g., Lakisha and Jamal) to the other half of the resumes.
They then sent these resumes in response to help-wanted ads in Chicago and Boston
and collected data on the number of callbacks received.
Table 10.10 describes the variables in the data set resume HW.dta.
Table 10.10: Variables for Resume Experiment

Variable Description
education 0 = not reported; 1 = some high school; 2 = high school graduate;
3 = some college; 4 = college graduate or more
yearsexp Number of years of work experience
honors 1 = Resume mentions some honors
volunteer 1 = Resume mentions some volunteering experience
military 1 = Applicant has some military experience
computerskills 1 = Resume mentions computer skills
afn american 1 = African-American sounding name; 0 = White sounding name
call 1 = applicant was called back ; 0 applicant not called back
female 1 = female; 0 = male
h quality 1 = High quality resume; 0 = low quality resume

a. What would be the concern of looking at the number of callbacks by race from an
observational study?
b. Check balance between the two groups (resumes with African-American-sounding
names and resumes with White sounding names) on the following variables: edu-
cation, years of experience, volunteering experience, honors, computer skills, and

c
•2014 Oxford University Press 537
Chapter 10. Experiments: Dealing with Real-World Challenges

gender. The treatment is whether the resume had an African-American name or not
as indicated by the variable af n american.
c. What would compliance be in the context of this experiment? Is there a potential
non-compliance problem?
d. What variables do we need to use 2SLS to deal with non-compliance?
e. Calculate the intention-to-treat (ITT) for receiving a callback from the resumes. The
variable call is coded 1 if a person received a callback and 0 otherwise. Use OLS
with call as the dependent variable.
f. We’re going to add covariates shortly. Discuss the implications of adding covariates
to this analysis of a randomized experiment.
g. Re-run the analysis from part (e) with controls for education, years of experience,
volunteering experience, honors, computer skills and gender. Report the results and
briefly describe the effect of having an African-American sounding name and if/how
the estimated effect changed from the earlier results.
h. The authors were also interested to see if race had a differential effect for high quality
resumes and low quality resumes. They created a variable h quality that indicated
a high quality resume based on labor market experience, career profile, existence of
gaps in employment, and skills. Using the controls from part (g) plus the high quality
indicator variable, estimate the effect of having an African-American sounding name
for high quality and low quality resumes.
4. Improving education in Afghanistan may be key to bring development and stability to
the country. Only 37 percent of primary school-age children in Afghanistan attended
schools, and there is a large gender gap in enrollment (with girls 17 percentage points
less likely to attend school). Traditional schools in Afghanistan serve children from
numerous villages. Some believe that creating more village based schools can increase
enrollment and students performance by bringing education closer to home. To assess
this belief, researchers Dana Burde and Leigh Linden (2013) conducted a randomized
experiment to test the effects of adding village based schools. For a sample of 12 equally
sized village groups, they randomly selected 5 groups to receive a village-based school.
One of the original village groups could not be surveyed and was dropped, resulting
in 11 village-groups with 5 treatment villages in which a new school was built and 6
control villages in which no new school was built.
This question focuses on the treatment effects for the fall 2007 semester, which was after
the schools had been provided. There were 1,490 children across the treatment and con-
trol villages. Table 10.11 displays the variables in the data set schools experiment HW.dta.

c
•2014 Oxford University Press 538
Chapter 10. Experiments: Dealing with Real-World Challenges

Table 10.11: Variables for Afghan School Experiment

Variable Description
formal school Enrolled in school
testscores Fall test scores (normalized); tests were to be given
to all children whether in school or not
treatment Assigned to village-based school=1; otherwise=0
age Age of child
girl Girl = 1; Boy = 0
sheep Number of sheep owned
duration village Duration family has lived in village
farmer Farmer = 1
education head Years of education of head of household
number ppl hh Number of people living in household
distance nearest school Distance to nearest school
f07 test observed Equals 1 if test was observed for fall 2007
Clustercode Village code
f07 hh id Household ID

a. What are the issues with studying the effects of new schools in Afghanistan that are
not randomly assigned?
b. Why is checking balance an important first step in analyzing a randomized experi-
ment?
c. Did randomization work? Check the balance of the following variables: age of child,
girl, number of sheep family owns, length of time family lived in village, farmer,
years of education for household head, number of people in household, and distance
to nearest school.
d. On page 104 we discussed the fact that if errors are correlated, then the standard
OLS estimates for the standard error of —ˆ are incorrect. In this case, we might expect
errors to be correlated within village. That is, knowing the error for one child in a
given village may provide us some information about the error for another child in
the same village. The way to generate standard errors that account for correlated
errors within some unit is to use the , cluster(ClusterName) command at the end
of Stata’s regression command. In this case, the cluster is the village, as indicated
with the variable clustercode. Re-do the balance tests from part (c) with clustered
standard errors. Do the coefficients change? Do the standard errors change? Do our
conclusions change?
e. Calculate the effect of being in a treatment village on fall enrollment. Use OLS and
report the fitted value of the school attendance variable for control and treatment
villages, respectively.

c
•2014 Oxford University Press 539
Chapter 10. Experiments: Dealing with Real-World Challenges

f. Calculate the effect of being in a treatment village on fall enrollment while controlling
for age of child, girl, number of sheep family owns, length of time family lived in
village, farmer, years of education for household head, number of people in household,
and distance to nearest school. Use the standard errors that account for within-village
correlation of errors. Is the coefficient on treatment substantially different from the
bivariate OLS results? Why or why not? Briefly note any control variables that are
significantly associated with attending school.
g. Calculate the effect of being in a treatment village on fall test scores. Use the model
that calculates standard errors that account for within-village correlation of errors.
Interpret the results.
h. Calculate the effect of being in a treatment village on test scores while controlling for
age of child, girl, number of sheep family owns, length of time family lived in village,
farmer, years of education for household head, number of people in household, and
distance to nearest school. Use the standard errors that account for within-village
correlation of errors. Is the coefficient on treatment substantially different from the
bivariate OLS results? Why or why not? Briefly note any control variables that are
significantly associated with higher test scores.
i. Compare the sample size for the enrollment and test score data. What concern does
this comparison raise?
j. Assess whether attrition was associated with treatment. Use the standard errors that
account for within-village correlation of errors.

c
•2014 Oxford University Press 540
CHAPTER 11

REGRESSION DISCONTINUITY: LOOKING FOR JUMPS IN DATA

So far, we’ve been fighting endogeneity with two

strategies. One is to soak up as much endogeneity

as we can by including control variables or fixed

effects, as we have done with OLS and panel data

models. The other is to create or find exogenous

variation via randomization or instrumental vari-

ables.

In this chapter we offer a third way to fight endogeneity: Looking for discontinuities. A

discontinuity is a point which a graph suddenly jumps up or down. Potential discontinuities

arise when a treatment is given in a mechanical way to observations above some cutoff. These

541
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

jumps indicate the causal effects of treatments under reasonably general conditions.

Suppose, for example, that we want to know whether drinking alcohol causes grades to go

down. An observational study might be fun, but worthless: It’s a pretty good bet that the

kind of people who drink a lot also have other things in their error term that also account

for low grades (e.g., lack of interest in school). An experimental study might even be more

fun, but is pretty unlikely to get approved (or finished ...).

But we still have some tricks to get at the effect of drinking. Consider the Air Force

Academy where the drinking age is strictly enforced. Students over 21 are allowed to drink;

students under 21 are not allowed to drink and face expulsion if caught. If we can compare

the performance on final exams of those students who had just turned 21 to those who had

not, we might be able to identify the causal effect of drinking.

Carrell, Hoekstra, and West (2010) did this and Figure 11.1 summarizes their results.

Each circle shows average test score for students grouped by age. The circle on the far left

shows the average test score for students who were 270 days before their 21st birthday when

they took their test. The circle on the far right shows the average test score for students

who turned 21 270 days before their test. In the middle are those who had just turned 21.

We’ve included fit lines to help make the pattern clear. Those who had not yet turned

21 scored higher. There is a discontinuity at the zero point in the figure (corresponding to

students taking a test on their 21st birthday). If we can’t come up with another explanation

for test scores to change at this point, we have pretty good evidence that drinking hurts

c
•2014 Oxford University Press 542
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Normalized
grade
0.2

0.1

−270 −180 −90 0 90 180 270


Age at final exam
(measured in days from 21st birthday)

FIGURE 11.1: Drinking Age and Test Scores

c
•2014 Oxford University Press 543
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

grades.

Regression discontinuity (RD) analysis formalizes this logic. It uses regression anal-

ysis to identify possible discontinuities at the point the treatment applies. For the drinking

age case, RD analysis involves fitting an OLS model that allows us to see if there is a

discontinuity at the point students become legally able to drink.

Regression discontinuity analysis has been used in a variety of contexts where some treat-

ment of interest is determined by a strict cutoff. Card, Dobkin, and Maestas (2009) used

RD to analyze the effect of Medicare on health because Medicare eligibility kicks in the day

someone turns 65. Lee (2008) used RD to study the effect of incumbency on reelection to

Congress because incumbents are decided by whoever gets 50 percent plus one or more of

the vote. Lerman (2009) used RD to assess the effect of being in a high security prison on

inmate aggression because the security level of the prison to which convicts are sent depends

directly on a classification score determined by the state.

When designing research studies, RD can be an excellent option. Standard observational

data may not provide exogeneity. Good instruments are hard to come by. Experiments can

be expensive or infeasible. And even when experiments can work, they can seem unfair or

capricious to policymakers, who may like allocating some treatment randomly. In RD, the

treatment is assigned according to a rule, which to many people seems more reasonable and

fair than random assignment.

RD models can work when analyzing individuals, states, counties, and other units. We

c
•2014 Oxford University Press 544
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

keep things simple and in this chapter mostly discuss RD as applied to individuals, but

the technique works perfectly well to analyze other units that have treatment assigned by a

cutoff rule of some sort.

In this chapter, we show how to use RD models to estimate causal effects. Section 11.1

presents the core RD model. Section 11.2 then presents ways to more flexibly estimate RD

models. Section 11.3 shows how to limit the data sets and create graphs that are particularly

useful in the RD context. The RD approach is not bullet-proof, though and Section 11.4

discusses the vulnerabilities of the approach and how to diagnose them.

11.1 Basic Regression Discontinuity Model

In this section we introduce regression discontinuity models by explaining the important

role of the assignment variable in the model. We then translate the regression discontinuity

model into a convenient graphical form and explain the key condition necessary for the model

to produce unbiased results.

The assignment variable in regression discontinuity models

The necessary ingredient in a regression discontinuity model is an assignment variable

that determines whether or not someone receives some treatment. People with values of

the assignment variable above some cutoff receive the treatment; people with values of the

assignment variable less than the cutoff do not receive the treatment.

c
•2014 Oxford University Press 545
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

As long as the only thing that changes at the cutoff is that the person gets the treatment,

then any bump up or down in the dependent variable at the cutoff will reflect the causal

effect of the treatment.

One way to understand why is to look at observations very, very close to the cutoff. The

only difference between those just above and just below the cutoff is the treatment. For

example, Medicare kicks in when someone turns 65. If we compare the health of people one

minute before their 65th birthday to the health of people who turned 65 one minute ago,

we could reasonably believe that the only difference between those two groups is that the

federal government provides health care for some but not others.

That’s a pretty extreme example, though. As a practical matter, we typically don’t have

data on very many people very close to our cutoff. Because statistical precision depends on

sample size (as we discussed on page 233), we typically can’t expect very useful estimates

unless we expand our data set to include observations some degree above and below the

cutoff. For Medicare, for example, perhaps we’ll need to look at people days, weeks, or

months from their 65th birthday to get a reasonable sample size. Thus the treated and

untreated will differ not only in whether they got the treatment but also in the assignment

variable. People 65 years and two months old not only get Medicare, but they are also older

than people two months shy of their 65th birthday. While four months doesn’t seem like

a lot for an individual, health declines with age in the whole population and there will be

people who experience some bad turn during those four months.

c
•2014 Oxford University Press 546
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Regression discontinuity models therefore control for treatment and the assignment vari-

able. In its most basic form, a regression discontinuity model looks like

Yi = —0 + —1 Ti + —2 (X1i ≠ C) + ‘i (11.1)

Ti = 1 if X1i Ø C

Ti = 0 if X1i < C

where Ti is a dummy variable indicating whether or not person i received the treatment and

X1i ≠ C is our assignment variable which indicates how much above or below the cutoff an

observation is. For reasons we explain below, it is useful to convert our assignment variable

into a variable that indicates how much above or below the cutoff a person was.

Graphical representation of regression discontinuity models

Figure 11.2 displays a scatterplot of data and fitted lines for a typical RD model. This

picture captures the essence of RD. If we understand it, we understand RD models. The

distance to the cutoff variable, X1i ≠ C, is along the horizontal axis. In this particular

example, C = 0, meaning that the eligibility for the treatment kicked in when X1 equalled

zero. Those with X1 above zero got the treatment; those with X1 below zero did not get the

treatment. Starting from the left we see that the dependent variable rises as X1i ≠ C gets

bigger and, whoa, jumps up at the cutoff point (when X1 = 0). This jump at the cutoff,

then, is the estimated causal effect of the treatment.

c
•2014 Oxford University Press 547
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Dependent
variable
(Y) 4000

e)
3000 slop
^ (the
β2
^ ^
β 0 + β1

2000 Bump is β1

^
β0

1000

−500 0 500 1000


C (cutoff)
Assignment variable (X1)

FIGURE 11.2: Basic Regression Discontinuity Model, Yi = —0 + —1 Ti + —2 (X1i ≠ C)

c
•2014 Oxford University Press 548
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

The parameters in the model are easy to locate in the figure. The most important

parameter is —1 , which is the effect of being in the treatment group. This is the bump at

the heart of RD analysis. The slope parameter, —2 , captures the relationship between the

distance to the cutoff variable and the dependent variable. In this basic version of the RD

model, this slope is the same above and below the cutoff.

Figure 11.3 displays more examples of results from RD models. In panel (a) —1 is positive,

just like in Figure 11.2, but —2 is negative, creating a downward slope for the assignment

variable. In panel (b), the treatment has no effect, meaning that —1 = 0. Even though

everyone above the cutoff received the treatment, there is no discernable discontinuity in the

dependent variable at the cutoff point. In panel (c), —1 is negative because there is a bump

downward at the cutoff, implying that the treatment lowered the dependent variable.

The key assumption in regression discontinuity models

The key assumption for RD to work is that the error term itself does not jump at the point

of the discontinuity. In other words, we’re assuming that the error term, whatever is in it, is

continuous without any jumps up or down when the assignment variable crosses the cutoff.

We discuss in Section 11.4 how this condition can be violated.

One of the cool things about RD is that even if the error term is correlated with the

assignment variable, the estimated effect of the treatment is still valid. To see why, sup-

pose C = 0 and the error and assignment variable are correlated and we characterize the

c
•2014 Oxford University Press 549
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Dependent
variable 4000 3000

(Y ) 3500

2500
3000

3000

2500
2000

2000
2000 1500

1500

1000

1000
1000

500

500

0
0
0
−800 −400 0 400 −800 −400 0 400 −800 −400 0 400
Cutoff Cutoff Cutoff
Assignment variable (X 1) Assignment variable (X 1) Assignment variable (X 1)
(a) (b) (c)

FIGURE 11.3: Possible Results with Basic RD Model

correlation as follows:

‘i = flX1i + ‹i (11.2)

where the Greek letter fl (pronounced rho) captures how strongly the error and X1i are

related and ‹i is a random term that is uncorrelated with X1i . For example, in the Medicare

example, mortality is the dependent variable, the treatment T is Medicare (which kicks in

the second someone turns 65), age is the assignment variable and health is in the error term.

It is totally reasonable to believe that health is related to age and we use Equation 11.2 to

characterize such a relationship.

If we estimate a model without controlling for the assignment variable (X1i ) with the

c
•2014 Oxford University Press 550
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

following model

Yi = —0 + —1 Ti + ‘i

there will be endogeneity because the treatment, T , depends on X1i , which is correlated with

the error. In the Medicare example, if we predict mortality as a function of Medicare only,

the Medicare variable will pick up not only the effect of the program, but also the effect of

health, which is in the error term which is correlated with age which is, in turn, correlated

with Medicare.

If we control for X1i , however, the correlation between T and ‘ disappears. To see why,

we begin with the basic RD model (Equation 11.1 on page 547). For simplicity, we assume

C = 0. Substituting for ‘ using Equation 11.2 on page 550, yields

Yi = —0 + —1 Ti + —2 X1i + ‘i

= —0 + —1 Ti + —2 X1i + flX1i + ‹i

We can re-arrange and re-label —2 + fl as —˜2 , producing

Yi = —0 + —1 Ti + (—2 + fl)X1i + ‹i

= —0 + —1 Ti + —˜2 X1i + ‹i

Notice that we have an equation in which the error term is now ‹i (the part of Equation 11.2

that is uncorrelated with anything). Hence, the treatment variable, T , in the RD model is

uncorrelated with the error term even though the assignment variable is correlated with the

c
•2014 Oxford University Press 551
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

error term. This means that OLS will provide an unbiased estimate of —1 , the coefficient on

Ti .

Meanwhile, the coefficient we estimate on the X1i assignment variable is —˜2 (notice the

squiggly on top), a combination of —2 (with no squiggly on top; the actual effect of X1i on

Y ) and fl (the degree of correlation between X1i and the error term in the original model, ‘).

Thus we do not put a lot of stock into the estimate of the variable on the assignment

variable because the coefficient combines the actual effect of the assignment variable and the

correlation of the assignment variable and the error. That’s okay, though, because our main

interest is in the effect of the treatment, —1 .

Remember This
A regression discontinuity (RD) analysis can be used when treatment depends on an
assignment variable being above some cutoff C.
1. The basic model is
Yi = —0 + —1 Ti + —2 (X1i ≠ C) + ‘i
Ti = 1 if X1i Ø C
Ti = 0 if X1i < C

2. RD models require that the error term is continuous at the cutoff. That is, the
value of the error term does not jump up or down at the cutoff.
3. RD identifies a causal effect of treatment because the assignment variable soaks
up the correlation of error and treatment.

c
•2014 Oxford University Press 552
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Discussion Questions
1. Many school districts pay for new school buildings with bond issues
that need to be approved by voters. Supporters of these bond issues
typically argue that new buildings improve schools and thereby boost
housing values. Cellini, Ferreira, and Rothstein (2010) used RD to test
if passage of school bonds caused housing values to rise.
a) What is the assignment variable?
b) Explain how to use a basic regression discontinuity approach to esti-
mate the effect of school bond passage on housing values.
c) Provide a specific equation for the model.
2. Medicare benefits kick in automatically in the United States the day
a person turns 65 years old. Many believe that people with health in-
surance are less likely to die because they will be more likely to seek
treatment and doctors will be more willing to conduct tests and proce-
dures for them. Card, Dobkin, and Maestas (2009) used RD to address
this question.
a) What is the assignment variable?
b) Explain how to use a basic regression discontinuity approach to esti-
mate the effect of Medicare coverage on the probability of dying.
c) Provide a specific equation for the model. For simplicity use a linear
probability model (as discussed on page 592).

11.2 More Flexible RD Models

In a basic version RD model, the slope of the line is the same on both sides of the cutoff for

treatment. This might not be the case in reality. In this section we show how to implement

more flexible regression discontinuity models that allow the slope to vary or allow for a

c
•2014 Oxford University Press 553
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

non-linear relationship between the assignment variable and outcomes.

Varying slopes model

Most RD applications allow the slope to vary above and below the threshold. By incorpo-

rating tools we discussed in Section 6.4, the following will produce estimates in which the

slope is different below the threshold than above the threshold:

Yi = —0 + —1 Ti + —2 (X1i ≠ C) + —3 (X1i ≠ C)Ti + ‘i

Ti = 1 if X1i Ø C

Ti = 0 if X1i < C

The new term at the end of the equation is an interaction between T and X1 ≠ C. The

coefficient on that interaction, —3 , captures how different the slope is for observations where

X1 is greater than C. The slope for untreated observations (for whom Ti = 0) will simply

be —2 , which is the slope for observations to the left of the cutoff. The slope for the treated

observations (for whom Ti = 1) will be —2 + —3 , which is the slope for observations to the

right of the cutoff. (Recall our discussion on page 297 in Chapter 7 regarding the proper

interpretation of coefficients on interactions.)

Figure 11.4 displays examples in which the slopes differ above and below the cutoff. In

panel (a), —2 = 1 and —3 = 2. Because —3 is greater than zero, the slope is steeper for

observations to the right of the cutoff. The slope for observations to the left of the cutoff is

c
•2014 Oxford University Press 554
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Dependent
variable
(Y ) 5000

3000
2500

4000

2500

2000
3000

2000

2000 1500

1500

1000
1000
1000

0
500
500

−800 −400 0 400 −800 −400 0 400 −800 −400 0 400


Cutoff Cutoff Cutoff
Assignment variable (X1) Assignment variable (X1) Assignment variable (X1)
(a) (b) (c)

FIGURE 11.4: Possible Results with Differing-Slopes RD Model

1 (the value of —2 ) and the slope for observations to the right of the cutoff is —2 + —3 = 3.

In panel (b) of Figure 11.4, —3 is zero, meaning that the slope is the same (and equal to

—2 ) on both sides of the cutoff. In panel (c), —3 is less than zero, meaning that the slope

is less steep for observations for which X1 is greater than C. Note that just because —3

is negative does not mean that the slope for observations to the right of the cutoff will be

c
•2014 Oxford University Press 555
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

negative (although it could be). A negative value of —3 simply means that the slope is less

steep for observations to the right of the cutoff. In panel (c), —3 = ≠—2 , which is why the

slope is zero to the right of the cutoff.

It is important to use X1i ≠ C instead of X1i for the assignment variable when estimating

an RD model with varying slopes. In this model, we’re estimating two separate lines. The

intercept for the line for the untreated group is —ˆ0 and the intercept for the line for the

treated group is —ˆ0 + —ˆ1 . If we used X1i as the assignment variable, the —ˆ1 estimate would

indicate the differences in treated and control when X1i is zero; we care about the difference

in treated and control when X1i equals the cutoff. By using X1i ≠ C instead of X1i for the

assignment variable, —ˆ1 will indicate the difference in treated and control when X1i ≠ C is

zero which occurs, of course, when X1i = C.

Polynomial model

Once we start thinking about how the slope could vary across different values of X1 , it is

easy to start thinking about other possibilities. Hence more technical RD analyses spend

a lot of effort estimating relationships that are even more flexible than the varying slopes

model. One way to estimate more flexible relationships between the assignment variable

and outcome is to use our polynomial regression model from page 317 in Chapter 7 to allow

the relationship between X1 to Y to wiggle and curve. The RD insight is that however

wiggly that line gets, we’re still looking for a bump (a discontinuity) at the point where the

c
•2014 Oxford University Press 556
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

treatment kicks in.

For example, we can use polynomial models to allow the estimated lines to curve differently

above and below the treatment threshold with a model like the following:

Yi = —0 + —1 Ti + —2 (X1i ≠ C) + —3 (X1i ≠ C)2 + —4 (X1i ≠ C)3

+—5 (X1i ≠ C)Ti + —6 (X1i ≠ C)2 Ti + —7 (X1i ≠ C)3 Ti + ‘i

Ti = 1 if X1i Ø C

Ti = 0 if X1i < C

Figure 11.5 shows two examples that can be estimated with such a polynomial model.

In panel (a), the value of Y accelerates as X1 approaches the cutoff, dips at the point of

treatment, and then accelerates again from that lower point. In panel (b), the relationship

appears relatively flat for values of X1 below the cutoff. There is a fairly substantial bump

up in Y at the cutoff. After that, Y rises sharply with X1 and then falls sharply.

It is virtually impossible to predict funky non-linear relationships like these ahead of time.

The goal is to find a functional form for the relationship between X1 ≠ C and outcomes that

soaks up any relation between X1 ≠ C and outcomes so that any bump at the cutoff reflects

only the causal effect of the treatment. This means we can estimate the polynomial models

and see what happens even without full theory about how the line should wiggle.

With this flexibility comes danger, though. Polynomial models are quite sensitive and

sometimes can produce bumps at the cutoff that are bigger than they should be. Therefore

c
•2014 Oxford University Press 557
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Dependent Dependent
variable variable
(Y) 10 (Y) 10

8 8

6 6

4 4

2 2

0 0

−4 −2 0 2 4 6 −4 −2 0 2 4 6
Cutoff Cutoff

Assignment variable (X1) Assignment variable (X1)


(a) (b)

FIGURE 11.5: Fitted Lines for Examples of Polynomial RD Models

c
•2014 Oxford University Press 558
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

we should always report simple linear models too so as not to look like we are fishing around

for a non-linear model that gives us the answer we like.

Remember This
1. When conducting regression discontinuity analysis it is useful to allow for a more
flexible relationship between assignment variable and outcome.
• A varying slopes model allows the slope to vary on different sides of the
treatment cutoff:
Yi = —0 + —1 Ti + —2 (X1i ≠ C) + —3 (X1i ≠ C)Ti + ‘i

• We can also use polynomial models to allow for non-linear relationships be-
tween the assignment and outcome variables.

c
•2014 Oxford University Press 559
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Y 10 Y 10

8 8

6 6

4 4

2 2

0 0
0 2 4 6 7 8 10 0 2 4 6 7 8 10
X Cutoff X Cutoff
(a) (b)
Y 10 Y 10

8 8

6 6

4 4

2 2

0 0
0 2 4 6 7 8 10 0 2 4 6 7 8 10
X Cutoff X Cutoff
(b) (d)
Y 10 Y 10

8 8

6 6

4 4

2 2

0 0
0 2 3 4 6 8 10 0 2 3 4 6 8 10
Cutoff X Cutoff X
(e) (f)

FIGURE 11.6: Various Fitted Lines for RD Model of Form Yi = —0 + —1 Ti + —2 (X1i ≠ C) + —3 (X1i ≠ C)Ti

Discussion Question

For each panel in Figure 11.6 indicate whether each of —1 , —2 , and —3 is less
than, equal to, or greater than zero for the varying slopes RD model:
Yi = —0 + —1 Ti + —2 (X1i ≠ C) + —3 (X1i ≠ C)Ti + ‘i

c
•2014 Oxford University Press 560
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

11.3 Windows and Bins

There are other ways to make RD models flexible. An intuitive approach is to simply focus

on a subset of the data near the threshold. In this section we show the benefits and costs of

doing so and introduce binned graphs as a useful tool for all RD analysis.

Adjusting the window

As we discussed earlier, polynomial models can be a bit hard to work with. An easier

alternative (or at least supplement) to polynomial models is to narrow the window in which

we look. The window is the range of the assignment variable to which we limit our analysis;

we only look at observations with values of the assignment variable in this range. Ideally,

we’d make the window very, very small near the cutoff. For such a small window, we’d be

looking only at those observations just below and just above the cutoff. These observations

would be very similar and hence the treatment effect would be the difference in Y for the

untreated (those just below the cutoff) and the treated (those just above the cutoff).

A smaller window allows us to worry less about the functional form on both sides of the

cutoff. Figure 11.7 provides some examples. In panels (a) and (b), we show the same figures

as in Figure 11.5, but highlight a small window. Below each of these panels we show just

the line in the highlighted smaller window. While the relationships are quite non-linear for

the full window, we can see that they are approximately linear in the smaller windows. For

example, when we look only at observations where X1 is between -1 and 1 for panel (a) we

c
•2014 Oxford University Press 561
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

see two more or less linear lines on each side of the cutoff. When we look only at observations

where X1 is between -1 and 1 for panel (b) we see a more or less flat line below the cutoff and

a positively sloped line above the cutoff. So even though the actual relationships between

the assignment variable and Y are non-linear in both panels, a reasonably simple varying

slopes model should be more than sufficient when we focus on the smaller window. A smaller

window for these cases allows us to feel more comfortable that our results do not depend on

sensitive polynomial models, but instead reflect differences between treated and untreated

observations near the cutoff.

As a practical matter, we usually don’t have very many observations in a small window

near the cutoff so in order to have any hope of having any statistical power, we’ll need to

make the window large enough to cover a reasonable number of observations.

Binned graphs

A convenient trick that helps us understand non-linearities and discontinuities in our RD

data is to create binned graphs. Binned graphs look like scatterplots but are a bit different.

To construct a bin plot, we divide the X1 variable into multiple regions (or “bins”) above

and below the cutoff and then calculate the average value of Y within each of those regions.

When we plot the data we get something that looks like panel (a) of Figure 11.8. Notice

there is a single observation for each bin, producing a cleaner graph than a scatterplot of all

observations.

c
•2014 Oxford University Press 562
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Dependent variable (Y)


cutoff
cutoff
10

4
6

3
4

2
2

0
-1 0 1
-4 -2 0 2 4 6
Assignment variable (X1 - C)
(a)
Dependent variable (Y) cutoff

cutoff
10

7
8
6
6
5

4 4

3
2

-1 0 1
0

-4 -2 0 2 4 6
Assignment variable (X1 - C)
(b)

FIGURE 11.7: Smaller Windows for Fitted Lines for Polynomial RD Model in Figure 11.5

c
•2014 Oxford University Press 563
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

The bin plot provides guidance for selecting the right RD model. If the relationship is

highly non-linear or seems dramatically different above and below the cutpoint, the bin plot

will let us know. In panel (a) of Figure 11.8 we see a bit of non-linearity in the relationship

because there is a U-shaped relationship between X1 and Y for values of X1 below the cutoff.

This relationship suggests a quadratic could be appropriate or, even simpler, we could narrow

the window to focus only on the range of X1 where the relationship is more linear. Panel (b)

of Figure 11.8 shows the fitted lines based on an analysis where only observations for which

X1 is between 900 and 2200 are used. The implied treatment effect is the bump in the data

indicated by —1 in the figure. We do not actually use the binned data to estimate the model;

we still use the original data by running simple regressions.

Remember This
1. It is useful to look at smaller window sizes when possible by looking only at data
close to the treatment cutoff.
2. Binned graphs help us visualize the discontinuity and the possibly non-linear
relationship between assignment variable and the outcome.

c
•2014 Oxford University Press 564
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Dependent Dependent
variable variable
(Y) (Y)
3500 3500

3000 3000

2500 2500

2000 2000 β1

1500 1500

1000 1000

1000 1500 2000 1000 1500 2000


C (cutoff) C (cutoff)
Assignment variable (X1) Assignment variable (X1)

(a) (b)

FIGURE 11.8: Bin Plots for RD Model

c
•2014 Oxford University Press 565
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Case Study: Universal Pre-kindergarten

Universal pre-kindergarten is a policy of provid-

ing high quality, free school to four year old chil-

dren. If it works as advocates say, universal pre-k

will counteract socioeconomic disparities, boost

productivity, and decrease crime.

But does it work? Gormley, Phillips, and Gayer (2008) used RD to evaluate one piece

of the puzzle by looking at the impact of universal pre-K on test scores in Tulsa, Oklahoma.

They could do so because children born on or before September 1, 2001, were eligible to

enroll in the program in 2005-06, while children born after this date had to wait to enroll

until the next school year.

Figure 11.9 shows a bin plot for this analysis. The dependent variable is test scores from

a letter-word identification test that measures early writing skills. The children took the test

a year after the older kids started pre-K. The kids born before September 1 spent the year

in pre-K; the kids born after September 1 spent the year doing whatever it is four year olds

do when not in pre-K.

The horizontal axis shows age measured in days from the pre-K cutoff date. The data is

binned in groups of 14 days so that each data point shows the average test scores for children

with ages in a 14 day range. While the actual statistical analysis uses all observations, the

c
•2014 Oxford University Press 566
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

binned graph helps us better see the relationship between the cutoff and test scores than

would a scatterplot of all observations.

One of the nice features of RD is that the plot often tells the story. We’ll do formal

statistical analysis in a second, but in this case, as in many RD examples, we know how the

story is going to end just from the bin plot.

There’s no mistaking the data: there is a jump in test scores precisely at the point of

discontinuity. There’s a clear relationship of kids scoring higher as they get older (as we can

see from the positive slope on age) but right at the age-related cutoff for the pre-K program

there is a substantial bump up. The kids above the cutoff went to pre-K. The kids who

were below the cutoff did not. If the program had no effect, the kids who didn’t go to pre-K

would score lower than the kids who did, simply because they were younger. But there is no

obvious reason why there should be a discontinuity right at the cutoff except if the program

boosted test scores.

Table 11.1 shows statistics results for the basic and varying slopes RD models. For the

basic model, the coefficient on the Pre-K variable is positive and highly significant, with a

t statistic of 10.31. The coefficient indicates the bump that we see in Figure 11.9. The age

variable is also highly significant. No surprise there, as older children did better on the test.

In the varying slopes model, the coefficient on the treatment is virtually unchanged from

the basic model, indicating a bump of 3.479 in test scores for the kids who went to pre-K.

The effect is highly statistically significant with a t statistic of 10.23. The coefficient on the

c
•2014 Oxford University Press 567
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Test
score 12

10

−300 −200 −100 0 100 200 300


Age (days from cutoff)

FIGURE 11.9: Binned Graph of Test Scores and Pre-K Attendance

c
•2014 Oxford University Press 568
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

interaction is insignificant, indicating that the slope on age is the same for kids who had

Pre-K and those who didn’t.

Table 11.1: RD Analysis of Pre-K

Basic Varying slopes


PreK 3.492 ú 3.479 ú
(0.339) (0.340)
[t = 10.31] [t = 10.23]
Age - C 0.007 ú 0.007 ú
(0.001) (0.001)
[t = 8.64] [t = 6.07]
PreK x (Age - C) 0.001
(0.002)
[t = 0.42]
Constant 5.692 ú 5.637 ú
(0.183) (0.226)
[t = 31.07] [t = 24.97]
N 2, 785 2, 785
R2 0.323 0.323
Standard errors in parentheses
ú indicates significance at p < 0.05

The conclusion? Universal pre-K increased school readiness in Tulsa.

11.4 Limitations and Diagnostics

The regression discontinuity approach is a powerful tool. It allows us to generate unbiased

treatment effects as long as treatment depends on some threshold and the error term is

continuous at the treatment threshold. However, RD can go wrong and in this section we

c
•2014 Oxford University Press 569
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

discuss situations in which RD doesn’t work and how to detect these situations. We also

discuss limitations on how broadly we can generalize RD results.

Imperfect assignment

One drawback to the RD approach is that it’s pretty rare to have an assignment variable

that decisively determines treatment. If we’re looking at the effect of going to a certain

college, for example, we probably cannot use RD because admission was based on multiple

factors, none of which was cut and dried. Or if we’re trying to assess the effectiveness of

a political advertising campaign, it’s probably the case that the campaign didn’t simply

advertise in cities where its poll results were less than some threshold, but instead they

probably used some criteria to identify where they might run ads and then used a number

of factors (including gut feel) to decide exactly where to run ads.

In the Further Reading section we point to readings on so-called fuzzy RD models that

can be used when the assignment variable imperfectly predicts treatment. Fuzzy RD models

models can be useful when there is a point at which treatment becomes much more likely,

but not necessarily guaranteed. For example, a college might only look at people with test

score of 160 or higher. Being above 160 may not guarantee admission, but there is a huge

jump up in probability of admission for those who score 160 instead of 159.

c
•2014 Oxford University Press 570
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Discontinuous error distribution at threshold

A bigger problem for RD models occurs when the error can be discontinuous at the treatment

threshold. Real people living their lives may do things that create a bump in the error term

at the discontinuity. For example, suppose that a GPA in high school above 3.0 makes

students eligible for a tuition discount at a state university. This seems like a promising RD

design: Use high school GPA as the assignment variable and set a threshold at 3.0. We can

then see, for example, if graduation rates (Y ) are higher for students who got the tuition

discount.

The problem is that the high school students (and teachers) know the threshold and how

close they are to it. Students who plan ahead and really want to go to college will make

damn sure that their high school GPA is north of 3.0. Students who are drifting through

life and haven’t gotten around to thinking about college won’t be so careful. Therefore we

could expect that when we are looking at students with GPA’s near 3.0, the more ambitious

students pile up on one side and the slackers pile up on the other. If we think ambition

influences graduation (it does!), then ambition (something in the error term) jumps at the

discontinuity, messing up the RD design.

Therefore any RD analysis should discuss whether the only thing happening at the dis-

continuity is the treatment. Do the individuals know about the cutoff? Sometimes they

don’t. Perhaps a worker training program enrolls people who score over some number on

a screening test. The folks taking the test probably don’t know what the number is so it’s

c
•2014 Oxford University Press 571
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

unlikely they would be able to game the system. Or even if people know the score they need,

we can often reasonably assume they’ll do their best because they presumably won’t be able

to precisely know how much effort will be enough to exceed the cutoff. If the test can be

re-taken, though, the more ambitious folks might keep taking it until they pass while the

less ambitious will head home to watch Breaking Bad. In such a situation, something in the

error term (ambitiousness) would jump at the cutoff because the ambitious people would

tend to be above the cutoff and the less ambitious people would be below it.

Diagnostic tests for RD models

Given the vulnerabilities of the RD model, two diagnostic tests are important to assess the

appropriateness of the RD approach. First, we want to know if the assignment variable itself

acts peculiar at the cutoff. If the values of the assignment variable cluster just above the

cutoff, we should worry that people know about the cutoff and are able to manipulate things

in order to get over it. In such a situation, it’s quite plausible that the people who are able to

just get over the cutoff are different than those who do not, perhaps because they have more

ambition (as in our example above) or because they have better contacts or information or

other advantages. To the extent that these factors also affect the dependent variable, we’ll

violate the assumption that the error term does not have a discrete jump at the cutoff.

The best way to assess whether there is clustering on one side of the cutoff is to create a

histogram of the assignment variable, looking for unusual activity in the assignment variable

c
•2014 Oxford University Press 572
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Frequency Frequency
160 160

140 140

120 120

100 100

80 80

60 60

40 40

20 20

0 0

−5 −3 −1 1 2 3 4 5 −5 −3 −1 1 2 3 4 5
Cutoff Cutoff

Assignment variable (X1 − C) Assignment variable (X1 − C)


(a) (b)

FIGURE 11.10: Histograms of Assignment Variable for RD Analysis

c
•2014 Oxford University Press 573
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

at the cutoff point. Panel (a) in Figure 11.10 shows a histogram of assignment values in

a case where there is no obvious clustering. The frequency of values in each bin for the

assignment variable bounces around a bit here and there, but is mostly smooth and there is

no clear bump up or down at the cutoff. In contrast, the histogram in panel (b) shows clear

clustering just above the cutoff. When faced with data like panel (b), it’s pretty reasonable

to suspect that the word is out about what the cutoff is and that people are able to do

something to get over the threshold.1

The second diagnostic test involves assessing whether other variables act weird at the

discontinuity. For RD to be valid, we want only Y to jump at the point where T equals one,

nothing else. If some other variable jumps at the discontinuity, we may wonder if the people

involved are somehow self-selecting (or being selected) based on some additional factors. If

so, it could be that these other factors that are jumping at the discontinuity may be causing

a jump in Y , not the treatment. A basic diagnostic test of this sort looks like

X2i = “0 + “1 Ti + “2 (X1i ≠ C) + ‹i

Ti = 1 if X1i Ø C

Ti = 0 if X1i < C

A statistically significant “ˆ1 coefficient from this model means that X2 jumps at the treatment
1 Formally testing for discontinuity of the assignment variable at the cutoff is a bit tricky. McCrary (2008) has
more. Usually, a visual assessment provides a good sense of what is going on, although it’s a good idea to try different
bin sizes to make sure what you’re seeing is not an artifact of one particular choice for bin size.

c
•2014 Oxford University Press 574
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

discontinuity, which casts doubt on the main assumption of the RD model that the only thing

happening at the discontinuity is movement from untreated to treated category.

A significant “ˆ1 from this diagnostic test doesn’t necessarily kill the RD, but we would

need to control for X2 in the RD model and explain why this additional variable jumps at

the discontinuity. It also makes sense to conduct balance tests using varying slopes models,

polynomial models, and smaller window sizes.

Including any variable that jumps at the discontinuity is only a partial fix, though, because

if we observe a difference at the cutoff in some variable we can measure, it’s plausible that

there is also a difference at the cutoff in some variable we can’t measure. We can measure

education reasonably well; it’s a lot harder to measure intelligence and it’s extremely hard

to measure conscientiousness. If we see that people are more educated at the cutoff, we’ll

worry that they are also more intelligent and conscientious, meaning we’ll worry that at

the discontinuity our treated group may differ from the untreated group in ways we can’t

measure.

Generalizability of RD results

An additional limitation of RD is that it estimates a very specific treatment effect, also

known as the local average treatment effect (LATE). This concept comes up for instrumental

variables models as well (as discussed on page 469). The idea is that the effects of the

treatment may differ within the population: A training program might work great for some

c
•2014 Oxford University Press 575
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

types of people but do nothing for others. The treatment effect estimated by regression

discontinuity is the effect of the treatment on those folks who have X1 equal the threshold.

Perhaps the treatment would have no effect on people with very low values of the assignment

variable. Or perhaps the treatment effect grows as the assignment variable grows. RD will

not be able to speak to these possibilities because we observe only the treatment happening at

one cutoff. Hence it is possible that the RD results do not generalize to the whole population.

Remember This
To assess the appropriateness of RD:
1. Qualitatively assess whether people have control over the assignment variable.
2. Conduct diagnostic tests.
• Assess the distribution of the assignment variable using a histogram to see if
there is clustering on one side of the cutoff.
• Run RD models using other covariates as dependent variables. The treatment
should not be associated with any discontinuity in any covariate.

c
•2014 Oxford University Press 576
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Case Study: Alcohol and Grades

The Air Force Academy alcohol and test score

example that began the chapter provides a great

example of how RD and RD diagnostics work.

Figure 11.1 on page 543 showed the binned graph

for test scores and age. Table 11.2 shows the

actual RD results. The first column results are

based on a varying slopes model in which the key variable is the dummy variable indicating

someone was older than 21 when he or she took the exam. This model also controlled for the

assignment variable, age, allowing the effect of age to vary before and after people turned

21. The dependent variable is standardized test scores, meaning that the results in the first

column indicate that turning 21 decreased test scores by 0.092 standard deviations. This

effect is highly statistically significant with a t statistic of 30.67. Adding controls strengthens

the results, as reported in the second column. The results are quite similar when we allow

the age variable to affect test scores non-linearly by including a quadratic function of age in

the model.

Are we confident that the only thing that happens at the discontinuity is that students

become eligible to drink? That is, are we confident that there is no discontinuity in the error

term at the point people turn 21? First, we want to think about the issue qualitatively.

c
•2014 Oxford University Press 577
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Table 11.2: RD Analysis of Drinking Age and Test Scores (from Carrell, Hoekstra, and West 2010)

Varying slopes Varying slopes with Quadratic with


control variables control variables
Discontinuity at 21 -0.092ú -0.114ú -0.106ú
(0.03) (0.02) (0.03)
[t = 30.67] [t = 57.00] [t = 35.33]
N 38, 782 38, 782 38, 782
Standard errors in parentheses. ú indicates significance at p < 0.05. All three specifications control
for age, allowing the slope to vary on either side of the cutoff. The second and third specifications control
for semester, SAT scores, and other demographics.

Obviously, people can’t affect their age, so there’s little worry that people are manipulating

the assignment variable. And while it is possible, for example, that good students decide to

drop out just after their 21st birthday, which would mean that the students we observe who

just turned 21 are more likely to be bad students, this possibility doesn’t seem particularly

likely.

We can also run diagnostic tests. Figure 11.11 shows the frequency of observations for

students above and below the age cutoff. There is no sign of people manipulating the

assignment variable because the distribution of ages is mostly constant, with some apparently

random bumps up and down.

We can also assess whether other covariates showed discontinuities at the 21st birthday.

As discussed above, the defining RD assumption is that the only discontinuity at the cutoff is

in the dependent variable, so we hope to see no statistically significant discontinuities when

c
•2014 Oxford University Press 578
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Frequency

1500

1000

500

−270 −180 −90 0 90 180 270


Cutoff
Assignment variable (X1 − C)

FIGURE 11.11: Histogram of Age Observations for Drinking Age Case Study

c
•2014 Oxford University Press 579
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

other variables are used as dependent variables. The model we’re testing is

Covariatei = “0 + “1 Ti + “2 (Agei ≠ C) + ‹i

Ti = 1 if X1i Ø C

Ti = 0 if X1i < C

Table 11.3 shows results for three covariates: SAT math scores, SAT verbal scores, and

physical fitness. For none of these covariates is “ˆ1 statistically significant, suggesting that

there is no bump in covariates at the point of the discontinuity, something that is consistent

with the idea that the only thing changing at the discontinuity is the treatment.

Table 11.3: RD Diagnostics for Drinking Age and Test Scores (from Carrell, Hoekstra, and West 2010)

SAT math SAT verbal Physical fitness score


Discontinuity at 21 2.371 1.932 0.025
(2.81) (2.79) (0.04)
[t = 0.84] [t = 0.69] [t = 0.63]
N 38, 782 38, 782 38, 782
Standard errors in parentheses. ú indicates significance at p < 0.05
All 3 specifications control for age, allowing the slope to vary on either side of the cutoff.

11.5 Conclusion

Regression discontinuity is a powerful statistical tool. It works even when the treatment we

are trying to analyze is correlated with the error. It works because the assignment variable –

c
•2014 Oxford University Press 580
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

a variable that determines whether or not a unit gets the treatment – soaks up endogeneity.

The only assumption we need is that there is no discontinuity in the error term at the cutoff

in the assignment variable X1 .

If we have such a situation, the basic RD model is super simple. It is just an OLS

model with a dummy variable (indicating treatment) and a variable indicating distance to

the cutoff. More complicated RD models allow for more complicated relationships between

the assignment variable and the dependent variable, but no matter the model, the heart of

the RD remains looking for a bump in the value of Y at the cutoff point for assignment

to treatment. As long as there is no discontinuity in relationship between error and the

outcome at the discontinuity, then we can attribute any bump in the dependent variable as

the effect of the treatment.

RD is an essential part of any econometric toolkit. Regression discontinuity can fill in a

hole where panel, instrumental variable, or experimental techniques aren’t up to the task.

RD analysis is quite clean. Anybody can pretty much see the answer by looking at a binned

graph and the statistical models are relatively simple to implement and explain.

RD is not without pitfalls, however. If people can manipulate their score on the assign-

ment variable, then the RD estimate no longer simply captures the effect of treatment, but

also captures the effects of whatever qualities are overrepresented among the folks who were

able to get their assignment score above the threshold. For this reason it is very important

to report diagnostics that help us sniff out possible discontinuities in the error term at the

c
•2014 Oxford University Press 581
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

cutoff.

We are on the right track when we can do the following.

• Section 11.1: Write down a basic regression discontinuity model and explain all terms,

including treatment variable, assignment variable, and cutoff. Explain how RD models

overcome endogeneity.

• Section 11.2: Write down and explain RD models with varying slopes and non-linear

relationships.

• Section 11.3: Explain why it is useful to look at a smaller window. Explain a binned

graph and why it is different from a conventional scatterplot.

• Section 11.4: Explain conditions under which RD might not be appropriate. Explain

qualitative and statistical diagnostics for RD models.

Further Reading

Imbens and Lemieux (2008) and Lee and Lemieux (2010) go into additional detail on re-

gression discontinuity designs in a way that is useful for practitioners, including discussions

of fuzzy RD models. Bloom (2012) is another useful overview of RD methods. Cook (2008)

provides a history of RD applications. Buddlemeyer and Skofias (2003) compare performance

of regression discontinuity and experiments and find that regression discontinuity works well

as long as discontinuity is rigorously enforced.

c
•2014 Oxford University Press 582
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

See Grimmer, Hersh, Feinstein, and Carpenter (2010) for an example of using diagnostics

to critique RD studies using election outcomes as an RD assignment variable.

Key Terms
• Assignment variable (545)
• Binned graphs (562)
• Discontinuity (541)
• Fuzzy RD models (570)
• Local average treatment effect (LATE) (575)
• Regression discontinuity (RD)(544)
• Window (561)

Computing Corner

Stata
To estimate an RD model in Stata, create a dummy treatment variable and a X1 ≠ C
variable and use the syntax for multivariate OLS.

1. The following commands create variables needed for RD. Note that a scalar variable
is simply a variable with a single value (in contrast to a typical variable that has a list
of values).
gen scalar cutoff = 10 /* Create scalar variable equal to cutoff */
gen T = 0 /* Initially create a T with all zeros */
replace T = 1 if X1 > cutoff /* For all T with X1 > cutoff value, set */
/* value of T to be one */
gen X1minusC = X1 - cutoff /* Creates X1-C variable */
2. Basic RD is a simple OLS model:
reg Y T X1minusC

c
•2014 Oxford University Press 583
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

3. To estimate a model with varying slopes, first create an interaction variable and then
run OLS:
gen X1minusCxT = X1minusC * T
reg Y T X1minusC X1minusCxT
4. To create a scatterplot with the fitted lines from a varying slopes RD model, do the
following:
graph twoway (scatter Y X1minusC) (lfit Y X1minusC if T == 0) /*
*/ (lfit Y X1minusC if T == 1)
R
To estimate an RD model in R, we create a dummy treatment variable and a X1 ≠C variable
and use the the syntax for multivariate OLS.

1. The following commands create variables needed for RD. Note that a scalar variable is
simply a variable with a single value (in contrast to a typical variable that has a list of
values).
Cutoff = 10 # Create scalar variable equal to cutoff
T = 0 # Initially create a T with all zeros
T[X1 > Cutoff] = 1 # For all T with X1 > cutoff value, set
# value of T to be one
X1minusC = X1 - Cutoff # Creates X1-C variable
2. Basic RD is a simple OLS model:
RDResults = lm(Y ≥ T + X1minusC)
3. To estimate a model with varying slopes, first create an interaction variable and then
run OLS:
X1minusCxT = X1minusCT*T RDResults = lm(Y ≥ T + X1minusC + X1minusCxT)

Exercises
1. As discussed on page 566 Gormley, Phillips, and Gayer (2008) used RD to evaluate the
impact of pre-K on test scores in Tulsa. Children born on or before September 1, 2001,
were eligible to enroll in the program in 2005-06, while children born after this date
had to wait to enroll until the 2006-07 school year. Table 11.4 lists the variables. The
pre-K data set covers 1,943 children just beginning the program in 2006-07 (preschool
entrants) and 1,568 children who just finished the program and began kindergarten in
2006-07 (preschool alumni).

c
•2014 Oxford University Press 584
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Table 11.4: Variables for Pre-kindergarten Question

Variable Description
age Age, days from the birthday cutoff. The cutoff value is coded as 0,
negative values indicate days born after the cutoff; positive values
indicate days born before the cutoff
cutoff Treatment indicator (1 = born before cutoff, 0 = born after cutoff)
wjtest01 Woodcock-Johnson letter-word identification test score
female Female (1 = yes, 0 = no)
black Black (1 = yes, 0 = no)
white White (1 = yes, 0 = no)
hispanic Hispanic (1 = yes, 0 = no)
freelunch Eligible for free lunch based on low income in 2006-07 (1 = yes, 0 = no)

a. Why should there be a bump in the dependent variable right at the point where
a child’s birthday renders him or her eligible to have participated in preschool the
previous year (2005-06) rather than the current year (2006-07)? Should we see jumps
at other points as well?
b. Assess whether there is a discontinuity at the cutoff for the free-lunch status, gender,
and race/ethnicity covariates.
c. Repeat the tests for covariate discontinuities restricting the sample to a one-month
(30 day) window on either side of the cutoff. Does the results change? Why or why
not?
d. Using letter-word identification test score as the dependent variable, estimate a basic
regression discontinuity model controlling for treatment status (born before the cut-
off) and the assignment variable (age measured as days from the cutoff). What is the
estimated effect of the preschool program on letter-word identification test scores?
e. Estimate the effect of pre-K using an RD specification that allows the relationship
to vary on either side of the cutoff. Do the results change? Should we prefer this
model? Why or why not?
f. Add controls for lunch status, gender, and race/ethnicity to the model. Does adding
these controls change the results? Why or why not?
g. Re-estimate the model from the part (f) limiting the window to one month (30 days)
on either side of the cutoff. Do the results change? How do the standard errors in
this model compare to those from the model using the full data set?
2. Gormley, Phillips, and Gayer (2008) also used RD to evaluate the impact of Head Start
on test scores in Tulsa. Children born on or before September 1, 2001, were eligible
to enroll in the program in 2005-06, while children born after this date had to wait to

c
•2014 Oxford University Press 585
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

enroll until the 2006-07 school year. The variable names and definitions are the same
as in Table 11.4, although in this case the data refers to 732 children just beginning
the program in 2006-07 (Head Start entrants) and 470 children who just finished the
program and were beginning kindergarten in 2006-07 (Head Start alumni).
a. Assess whether there is a discontinuity at the cutoff for the free-lunch status, gender,
and race/ethnicity covariates.
b. Repeat the tests for covariate discontinuities restricting the sample to a one-month
(30 day) window on either side of the cutoff. Do the results change? Why or why
not?
c. Using letter-word identification test score as the dependent variable, estimate a basic
regression discontinuity model. What is the estimated effect of the preschool program
on letter-word identification test scores?
d. Estimate the effect of Head Start using an RD specification that allows the relation-
ship to vary on either side of the cutoff. Do the results change? Should we prefer
this model? Why or why not?
e. Add controls for lunch status, gender, and race/ethnicity to the model. Do the results
change? Why or why not?
f. Re-estimate the model from part (f) limiting the window to one month (30 days) on
either side of the cutoff. Do the results change? How do the standard errors in this
model compare to those from the model using the full data set?
3. Congressional elections are decided by a clear rule: whoever gets the most votes in
November wins. Because virtually every congressional race in the Unites States is
between two parties, that means whoever gets more than 50 percent of the vote wins.2
We can use this fact to estimate the effect of political party on ideology. Some argue that
Republicans and Democrats are very distinctive; others argue that members of Congress
have strong incentives to respond to the median voter in their districts, regardless of
party. We can assess how much party matters by looking at the ideology of members of
Congress in the 112th Congress (from 2011 to 2012). Table 11.5 lists the variables.
a. Suppose we try to explain congressional ideology as a function of political party only.
Explain how endogeneity might be a problem.
b. How can an RD model fight endogeneity when trying to assess if and how party
affects congressional ideology?
c. Generate a scatterplot of congressional ideology against GOP2party and based on
this plot discuss what you think the RD will indicate.
2 We’ll look only at votes going to the two major parties, Democrats and Republicans in order to have a nice 50
percent cutoff.

c
•2014 Oxford University Press 586
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

Table 11.5: Variables for Congressional Ideology Question

Variable Description
GOP2party2010 The percent of the vote received by the Republican congressional candidate
in the district in 2010. Ranges from 0 to 1.
GOPwin2010 Dummy variable indicating Republican won; equals 1 if GOP 2party2010 > 0.5.
Ideology The conservativism of the member of Congress as measured by Carroll, Lewis,
Lo, Poole, and Rosenthal (2009, 2014). Ranges from -0.779 to 1.293. Higher
values indicate more conservative voting in Congress.
ChildPoverty Percentage of district children living in poverty. Ranges from 0.03 to 0.49.
MedianIncome Median income in the district. Ranges from $23,291 to $103,664.
Obama2008 Percent of vote for Barack Obama in the district in 2008 presidential election.
Ranges from 0.23 to 0.95.
WhitePct Percent of the district that is non-Hispanic White ranges from 0.03 to 0.97.

d. Write down a basic RD model for this question and explain the terms.
e. Estimate a basic RD model and interpret coefficients.
f. Create an adjusted assignment variable (equal to GOP2party2010 - 0.50) and use it
to estimate a varying slopes RD model and interpret coefficients. Create a plot that
has a scatterplot of the data and fitted lines from the model. Calculate the fitted
values for 4 observations: a Democrat with GOP2party2010 = 0, a Democrat with
GOP2party2010 = 0.5, a Republican with GOP2party2010 = 0.5 and a Republican
with GOP2party2010 = 1.0).
g. Re-estimate the varying slopes model but use the unadjusted variable (and unad-
justed interaction). Compare coefficient estimates to your results in part (f). Calcu-
late the fitted values for four observations: a Democrat with GOP2party2010 = 0,
a Democrat with GOP2party2010 = 0.5, a Republican with GOP2party2010 = 0.5
and a Republican with GOP2party2010 = 1.0). Compare to the fitted values in part
(f).
h. Assess whether there is clustering of the dependent variable just above the cutoff.
i. Assess whether there are discontinuities at GOPVote = 0.50 for ChildPoverty, Me-
dianIncome, Obama2008 and WhitePct. Discuss the implications of your findings.
j. Estimate a varying slopes model controlling for ChildPoverty, MedianIncome, Obama2008,
and WhitePct. Discuss these results in light of your findings from the part (i).
k. Estimate a quadratic RD model and interpret results.
l. Estimate a varying slopes model with a window of GOP vote share from 0.4 to 0.6.
Discuss any meaningful differences in coefficients and standard errors from the earlier
varying slopes model.

c
•2014 Oxford University Press 587
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

m. Which estimate is the most credible?


4. Ludwig and Miller (2007) use a discontinuity in program funding for Head Start to test
the impacts on children mortality rates. In the 1960s, the federal government helped
300 of the poorest counties in the United States write grants for Head Start programs.
Only counties where poverty was greater than 59.2 percent received this assistance.
This problem explores the effects of Head Start on child mortality rates. Table 11.6
lists the variables.
Table 11.6: Variables for Head Start Question

Variable Description
County County indicator
Mortality County mortality rate for children aged 5 to 9 from 1973 to 1983, limited
to causes plausibly affected by Head Start
Poverty Poverty rate in 1960. Transformed by subtracting off cutoff; also divided
by 10 for easier interpretation
HeadStart Dummy variable indicating counties that received Head Start assistance.
Counties with poverty greater than 59.2 are coded as 1; counties with
poverty less than 59.2 are coded as 0
Bin The “bin” label for each observation based on dividing the poverty
into 50 bins

a. Write out an equation for a basic RD design to assess the effect of Head Start assis-
tance on child mortality rates. Draw a picture of what you expect the relationship
to look like. Note that in this example, treatment occurs for low values of the as-
signment variable.
b. Explain how RD can identify a causal effect of Head Start assistance on mortality.
c. Estimate the effect of Head Start on mortality rate using a basic RD design.
d. Estimate the effect of Head Start on mortality rate using a varying slopes RD design.
e. Estimate a basic RD model with (adjusted) poverty values that are between -0.8 and
0.8. Comment on your findings.
f. Implement a quadratic RD design. Comment on the results.
g. Create a scatterplot of the mortality and poverty data. What do you see?
h. Use the following code to create a binned graph of the mortality and poverty data.
What do you see?3
3The trick to creating a binned graph is associating each observation with a bin label that is in the middle of the bin.
Stata code that does this is scalar BinNum = 50; scalar BinMin = -6; scalar BinMax = 3; scalar BinLength

c
•2014 Oxford University Press 588
Chapter 11. Regression Discontinuity: Looking for Jumps in Data

egen BinMean = mean(Mortality), by(Bin)


graph twoway (scatter BinMean Bin, ytitle("Mortality") xtitle("Poverty")
msymbol(O) msize(large) xline(0.0) )/*
*/ (lfit BinMean Bin if HeadStart == 0, clwidth(thick) clcolor(blue)) /*
*/ (lfit BinMean Bin if HeadStart == 1, clwidth(thick) clcolor(red))
i. Re-run the quadratic model and save predicted values as FittedQuadratic. Include
the fitted values in the graph from part (h) by adding (scatter FittedQuadratic
Poverty) to the above code. Explain the results.

= (BinMax-BinMin)/BinNum; gen Bin = BinMin + BinLength*(0.5+(floor((Poverty-BinMin)/BinLength))).


The Bin variable here sets the value for each observation to the middle of the bin; there are likely other ways to do
it.

c
•2014 Oxford University Press 589
Part III

Limited Dependent Variables

590
CHAPTER 12

DUMMY DEPENDENT VARIABLES

Think of a baby born just ... now. Somewhere

in the world it has just happened. This child’s

life will be punctuated by a series of dichotomous

events. Was she born pre-maturely? Will she go

to pre-K? Will she choose a private school? Will

she graduate from high school? Will she get a

job? Get married? Buy a car? Have a child?

Vote Republican? Have health care? Live past 80 years old?

When we use data to analyze such phenomena – and many others – we need to confront

the fact that the outcomes are dichotomous. They either happened or didn’t, meaning

591
Chapter 12. Dummy Dependent Variables

that our dependent variable is either 1 (happened) or 0 (didn’t happen). Although we can

continue to use OLS for dichotomous dependent variables, the probit and logit models we

introduce in this chapter often fit the data better. Probit and logit models come with a

price, though, as they are more complicated to interpret.

This chapter explains how to deal with dichotomous dependent variables. Section 12.1

shows how to use OLS to estimate these models. OLS does fine, but there are some things

that aren’t quite right. Hence Section 12.2 introduces a new model, called a latent variable

model, to model dichotomous outcomes. Section 12.3 then presents the workhorse probit and

logit models. These models differ from OLS and Section 12.4 explains how. Section 12.5 then

presents the somewhat laborious process of interpeting coefficients from these models. Probit

and logit models have several cool properties, but ease of interpretation is not one of them.

Section 12.6 concludes by showing how to test hypotheses involving multiple coefficients

when working with probit and logit models.

12.1 Linear Probability Model

The easiest way to analyze a dichotomous dependent variable is to use the linear prob-

ability model (LPM). This is just a fancy way of saying just run your darn OLS model

already.1 The LPM has witnessed a bit of a renaissance lately as people have realized that

despite some clear defects, it often conveniently and effectively characterizes the relation-
1 We discussed dichotomous independent variables in Chapter 7.

c
•2014 Oxford University Press 592
Chapter 12. Dummy Dependent Variables

ships between independent variables and outcomes. If there is no endogeneity (a big if, as

we know all too well), then the coefficients will be the right sign and will generally imply a

substantive relationship similar to that estimated by the more complicated probit and logit

models we discuss later in this chapter.

In this section we show how the LPM model works and describe its limitations.

LPM and the expected value of Y

One nice features of OLS is that it generates the best estimate of the expected value of Y

as a linear function of the independent variables. In other words, we can think of OLS as

providing us

E[Yi |X1 , X2 ] = —0 + —1 X1i + —2 X2i

where E[Yi |X1 , X2 ] is the expected value of Yi given the values of X1i and X2i . This term is

also referred to as the conditional value of Y .2

When the dependent variable is dichotomous, the expected value of Y is equal to the
2 The terms linear and non-linear can sometimes get confusing in statistics. A linear model is one of the form
Yi = —0 + —1 X1i + —1 X2i + ... where none of the parameters to be estimated are multiplied, divided, or raised to
powers of other parameters. In other words, all the parameters enter in their own little plus term. A non-linear model
is one where some of the parameters are multiplied, divided, or raised to powers of other parameters. Linear models
can estimate some non-linear relationships (by creating terms that are functions of the independent variables, not
the parameters). We described this process in Section 7.1 of Chapter 7. Such polynomial models will not, however,
solve the deficiencies of OLS for dichotomous dependent variables. The models that do address the problems, the
probit and logit models we cover later in this chapter, are complex functions of other parameters and are therefore
necessarily non-linear models.

c
•2014 Oxford University Press 593
Chapter 12. Dummy Dependent Variables

probability the variable equals one. For example, consider a dependent variable that is 1 if it

rains and 0 if it doesn’t. If there is a 40% chance of rain, the expected value of this variable

is 0.40. If there is a 85% chance of rain, the expected value of this variable is 0.85. In other

words, because E[Y |X] = P robability(Y = 1|X), OLS with a dependent variable provides

P r(Y = 1|X1 , X2 ) = —0 + —1 X1i + —2 X2i

The interpretation of —ˆ1 from this model is that a one unit increase in X1 is associated

with a —ˆ1 increase in the probability of observing Y = 1.

Table 12.1 displays the results from an LPM model of the probability of admission into

a competitive Canadian law school. The independent variable is college grade point average

(GPA) (measured on a 100 point scale, as is common in Canada). The coefficient on GPA is

0.032, meaning that an increase in one point on the 100 point GPA scale is associated with

a 3.2% increase in the probability of admission into this law school.


Table 12.1: LPM Model of the Probability of Admission to Law School

GPA 0.032ú
(0.003)
[t = 12.29]
Constant -2.28ú
(0.206)
[t = 11.10]
N 514
R2 0.23
Minimum Ŷi -0.995
Maximum Ŷi 0.682
Standard errors in parentheses
ú indicates significance at p < 0.05

c
•2014 Oxford University Press 594
Chapter 12. Dummy Dependent Variables

Probability of
admission
1 | | || | | | | | | | | | | | | || | |||| |||| |||||||||||||||||| |||||||| | ||| |||| ||| | ||| | || | | | |||

0.75

0.5

0.25 Fitted values from linear


probability model

0 | | | | | | | | | | | | | | || |||| |||| ||| | | | | |||| |||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||| ||||| || | ||| ||| || | | |

40 45 50 55 60 65 70 75 80 85 90 95
GPA
(on a 100 point scale)

FIGURE 12.1: Scatterplot of Law School Admissions Data and LPM Fitted Line

c
•2014 Oxford University Press 595
Chapter 12. Dummy Dependent Variables

Figure 12.1 shows a scatterplot of the law school admissions data with the fitted line from

the LPM model included. The scatterplot looks different than a typical regression model

scatterplot because the dependent variable is either 0 or 1, creating two horizontal lines of

observations. Each point is a light vertical line and the scatterplot looks like a dark bar

where there are many observations. We can see that folks with GPAs under 80 mostly do

not get admitted while people with GPAs above 85 mostly do get admitted.

The expected value of Y based on the LPM model is a straight line with a slope of 0.032.

Clearly, as GPAs rise, the probability of admission rises as well. The difference from OLS is

that instead of interpreting —ˆ1 as the increase in the value of Y associated with a one unit

increase in X, we now interpret —ˆ1 as the increase in the probability Y equals one associated

with a one unit increase in X.

Limits to LPM

While Figure 12.1 is generally sensible, it also has a glaring flaw. The fitted line goes below

zero. In fact, the fitted line goes far below zero. The poor soul with a GPA of 40 has a fitted

value of -0.995. This is nonsensical (and a bit sad). Probabilities must be between 0 and 1.

For a low enough value of X, the predicted value falls below zero; for a high enough value

of X, the predicted value exceeds one.3


3 In this particular figure, the fitted probabilities do not exceed one because GPAs can’t go higher than 100. In
other cases, though, the independent variable may not have such a clear upper bound. And, regardless, it is extremely
common for LPM fitted values to be less than zero for some observations and greater than one for other observations.

c
•2014 Oxford University Press 596
Chapter 12. Dummy Dependent Variables

The problem with LPM isn’t only that it sometimes provides fitted values that make no

sense. We could, after all, simply say that any time we see a fitted value below 0, we’ll call

that a 0 and anytime we see a fitted value above 1 we’ll call that a 1. The deeper problem

is that fitting a straight line to data with a dichotomous dependent variable runs the risk

of misspecifying the relationship between the independent variables and the dichotomous

dependent variable.

Figure 12.2 illustrates an example of LPM’s problem. Panel (a) depicts a fitted line

from an LPM model based on law school admissions data based on the six hypothetical

observations indicated. The line is reasonably steep, implying a clear relationship. Now

suppose, that we add three observations from applicants with very high GPAs, all of whom

were admitted. These observations are the triangles in the upper right of panel (b). Common

sense suggests these observations should strengthen our belief that GPAs predict admission

into law school. Sadly, LPM lacks common sense. The figure shows that the LPM fitted line

with the new observations (the dashed line) is flatter than the original estimate, implying

that the estimated relationship is weaker than the relationship we estimated in the original

model with less data.

What’s that all about? It’s pretty easy to understand once we appreciate that the LPM

needs to fit a linear relationship. If these three new applicants had higher GPAs, from an

LPM perspective we should expect them to have a higher probability of admission than the

applicants in the initial sample. But the dependent variable can’t get higher than one, so

c
•2014 Oxford University Press 597
Chapter 12. Dummy Dependent Variables

Probability of
admission

0.75

(a) 0.5

0.25

50 55 60 65 70 75 80 85 90 95 100
GPA
(on a 100 point scale)
Probability of
admission

0.75 Dashed line is new LPM


fitted line when three
observations are added
(b) 0.5

0.25

50 55 60 65 70 75 80 85 90 95 100
GPA
(on a 100 point scale)

FIGURE 12.2: Misspecification Problem in Linear Probability Model

c
•2014 Oxford University Press 598
Chapter 12. Dummy Dependent Variables

the LPM therefore interprets the new data as suggesting a weaker relationship. In other

words, because these applicants had higher independent variables but not higher dependent

variables, the LPM model infers that the independent variable is not driving the dependent

variable higher.

What really is going on is that once GPAs are high enough, students are pretty much

certain to be admitted. In other words, we expect a non-linear relationship – the probability

of admission rises with GPAs up to a certain level, but then levels off as applicants are pretty

much all admitted when their GPAs are above that level. The probit and logit models we

develop next allow us to capture precisely this possibility.4

In LPM’s defense, it won’t systematically estimate positive slopes when the actual slope is

negative. And we should not underestimate its convenience and practicality. Nonetheless, we

should worry that LPM may sometimes leave us with an incomplete view of the relationship

between the independent and dichotomous dependent variables.

4 LPM also has a heteroscedasticity problem. As discussed earlier, heteroscedasticity seldom is a more serious
problem than endogeneity, but the heteroscedasticity means that we have to cast a skeptical eye toward standard
errors estimated by LPM. There is a fix to dealing with this problem, but the process is complicated enough that we
might as well run the probit or logit models described below. For more details, see Long (1997, 39).

c
•2014 Oxford University Press 599
Chapter 12. Dummy Dependent Variables

Remember This
The linear probability model (LPM) uses OLS to estimate a model with a dichotomous
dependent variable.
1. The coefficients are easy to interpret: a one-unit increase in Xj is associated with
a —j increase in the probability that Y equals one.
2. Limitations of the LPM include:
• Fitted values of Ŷi may be greater than one or less than zero.
• Coefficients from an LPM model may mischaracterize the nature of the rela-
tionship between X and Y .

12.2 Using Latent Variables to Explain Observed Variables

Given these limits to the LPM model, our goal is to develop a model that will produce fitted

values between zero and one. In this section, we describe the S-curves that achieve this goal

and introduce latent variables as a tool that will help us estimate S-curves.

S-curves

Figure 12.3 shows the law school admissions data. The LPM fitted line, in all its negative

probability glory is there, but we have also added a fitted curve from a probit model. The

probit fitted line looks like a tilted letter “S” such that the relationship between X and the

dichotomous dependent variable is non-linear. We explain how to generate such a curve over

the course of this chapter, but for now let’s note some of its nice features.

For applicants with GPAs below 70 or so, the probit fitted line has flattened out. This

c
•2014 Oxford University Press 600
Chapter 12. Dummy Dependent Variables

Probability of
admission
1 | || | | | | | | | | | ||| || ||||| ||||||||||||||||||| |||||| |||| ||||||||||| | || | | | ||

Fitted values from


0.75 probit model

0.5

0.25 Fitted values from linear


probability model

0 | | | | || | || | | | || || |||| |||| ||| | | || |||| ||||| |||||||||||||||||||||||||||||||| |||||||||||||||||||||||||


||||||||||||||||||||||||||||||||||| |||| ||| | ||| || | | |

40 45 50 55 60 65 70 75 80 85 90 95
GPA
(on 100 point scale)

FIGURE 12.3: Scatterplot of Law School Admissions Data and LPM and Probit Fitted Lines

c
•2014 Oxford University Press 601
Chapter 12. Dummy Dependent Variables

means no matter how low their GPAs go, their fitted probability of admission does not go

below zero. For applicants with very high GPAs, increasing GPA leads to only small increases

in the probability of admission. Even if GPAs were to go very, very high, the probit fitted

line flattens out so that no one will have a predicted probability of admission greater than

1.

Not only does the S-shaped curve of the probit fitted line avoid nonsensical probability

estimates, it also reflects the data better in several respects. First, there is a range of GPAs

where the effect on admissions is quite high. Look in the range from around 80 to around

90. As GPA rises in this range, the effect on probability of admission is quite high, much

higher than implied by the LPM fitted line. Second, even though the LPM fitted values for

the high GPAs are logically possible (because they are between 0 and 1), they don’t reflect

the data particularly well. The person with the highest GPA in the entire sample (a GPA

of 92), is predicted by the LPM model to have only a 68% probability of admission. The

probit model, in contrast, predicts a 96% probability of admission for this GPA star.

Latent variables

To generate such non-linear fitted lines, we’re going to think in terms of a latent variable.

Something is latent if you don’t see it. A latent variable is something we don’t see, at

least not directly. We’ll think of the observed dummy dependent variable (which is zero or

one) as reflecting an underlying continuous latent variable. If the value of an observation’s

c
•2014 Oxford University Press 602
Chapter 12. Dummy Dependent Variables

latent variable is high, then the dependent variable for that observation is likely to be one;

if the value of an observation’s latent variable is low, then the dependent variable for that

observation is likely to be zero. In short, we’re interested in a latent variable that is an

unobserved continuous variable reflecting the propensity of an individual observation of Yi

to equal 1.

Here’s an example. Pundits and politicians obsess over presidential approval. They know

that the president’s re-election and policy choices are often tied to the state of his approval.

Presidential approval is typically measured with a yes or no question: Do you approve of

the way the President is handling his job? That’s our dichotomous dependent variable, but

we know full well that the range of responses to the president covers a lot more than two

choices. Some people froth at the mouth in anger at the mention of the president. Others

think “meh.” Others giddily support the president.

It’s useful to think of these different attitudes as different latent attitudes toward the

president. We can think of the people who hate the president as having very negative values

of a latent presidential approval variable. People who are so-so about the president have

values of a latent presidential approval variable near zero. People who love the president

have very positive values of a latent presidential approval variable.

We think in terms of a latent variable because it is easy to write down a model for a

continuous latent model of the propensity to approve of the president. It looks like an OLS

model. Specifically, Yiú (pronounced “Y-star”) is the latent propensity to be a 1 (an ugly

c
•2014 Oxford University Press 603
Chapter 12. Dummy Dependent Variables

phrase, but that’s really what it is). It depends on some independent variable X and the —s.

Yiú = —0 + —1 X1i + ‘i (12.1)

We’ll model the observed dichotomous dependent variable as a function of this unobserved

latent variable. We observe Y = 1 (notice the lack of a star) for people whose latent feelings

are above zero.5 If the latent variable is less than zero, we observe Y = 0. (We ignore

non-answers to keep things simple.)

This latent variable approach is consistent with how the world works. There are folks

who approve of the president but differ in the degree to which they approve; they are all

ones in the observed variable (Y ) but vary in the latent variable (Y ú ). There are folks who

disapprove of the president but differ in the degree to which they disapprove of the president;

they are all zeros in the observed variable (Y ) but vary in the latent variable (Y ú ).

Formally, we connect the latent and observed variables as follows. The observed variable

is
Y
_
_
_
] 0 if Yiú < 0
Yi =
_
_
_
[ 1 if Yiú Ø 0

Plugging in Equation 12.1 for Yiú , we observe Yi = 1 if

—0 + —1 Xi + ‘i Ø 0

‘i Ø ≠—0 ≠ —1 X1i
5 Because the latent variable is unobserved, we have the luxury of labeling the point in the latent variable space at
which folks become 1’s as zero.

c
•2014 Oxford University Press 604
Chapter 12. Dummy Dependent Variables

In other words, if the random error term is greater than or equal to ≠—0 ≠ —1 Xi , we’ll

observe Yi = 1. This implies

P r(Yi = 1|X1 ) = P r(‘i Ø ≠—0 ≠ —1 X1i )

With this characterization, the probability that the dependent variable is one is necessarily

bounded between 0 and 1 because it is expressed in terms of the probability that the error

term is greater or less than some number. Our task in the next section is to characterize the

distribution of the error term as a function of the — parameters.

Remember This
Latent variable models are helpful to analyze dichotomous dependent variables.
1. The latent (unobserved) variable is

Yiú = —0 + —1 Xi + ‘i

2. The observed variable is


I
0 if Yiú < 0
Yi =
1 if Yiú Ø 0

12.3 Probit and Logit Models

Probit model and logit model both allow us to estimate the relationship between X and Y

in a way that the fitted values are necessarily between 0 and 1, thereby producing estimates

that more accurately capture the full relationship between X and Y than do LPM models.

The probit and logit models are effectively very similar, but they differ in the equations

c
•2014 Oxford University Press 605
Chapter 12. Dummy Dependent Variables

they use to characterize the error term distributions. In this section we explain the equations

behind each of these two models.

Probit model

The key assumption in a probit model is that the error term (‘i ) is itself normally dis-

tributed. We’ve worked with the normal distribution a lot because the Central Limit The-

orem (from page 85) implies that with enough data, OLS coefficient estimates are normally

distributed no matter how ‘ is distributed. For the probit model we’re saying that ‘ itself is

normally distributed. So while normality of —ˆ1 is a proven result for OLS, normality of ‘ is

an assumption in the probit model.

Before we explain the equation for the probit model, it is useful to do a bit of bookkeeping.

We have shown that P r(Yi = 1|X1 ) = P r(‘i Ø ≠—0 ≠ —1 X1i ), but this equation is a bit hard

to work with given the widespread convention in probability to characterize the distribution

of a random variable in terms of the probability that it is less than some value. Therefore,

we’re going to do a quick trick based on the symmetry of the normal distribution, a property

that means the distribution has the same shape on each side of its mean. This means that

the probability of seeing something larger than some number is the same as the probability of

seeing something less than the negative of that number. Figure 12.4 illustrates this property.

In panel (a), we shade the probability of being greater than -1.5. In panel (b), we shade the

probability of being less than 1.5. The symmetry of the normal distribution backs up what

c
•2014 Oxford University Press 606
Chapter 12. Dummy Dependent Variables

Probability
density
0.4

0.3

(a) 0.2

0.1 Shaded area is probability


that εi is greater than −1.5

0
−3 −2 −1 0 1 2 3
−1.5
β0 + β1Xi
Probability
density
0.4

0.3

(b) 0.2

0.1 Shaded area is probability


that εi is less than 1.5 (which equals
probability εi is greater than −1.5)
0
−3 −2 −1 0 1 2 3
1.5
β0 + β1Xi

FIGURE 12.4: Symmetry of Normal Distribution

c
•2014 Oxford University Press 607
Chapter 12. Dummy Dependent Variables

our eyes suggest: The shaded areas are equally sized, indicating equal probabilities. In other

words, P rob(‘i > ≠1.5) = P rob(‘i < 1.5). This fact allows us to re-write P r(Yi = 1|X1 ) =

P r(‘i Ø ≠—0 ≠ —1 X1i ) as

P r(Yi = 1|X1 ) = P r(‘i Æ —0 + —1 X1i )

There isn’t a huge conceptual issue here, but simply one that makes it much easier to charac-

terize the model with conventional tools for working with normal distributions. In particular,

stating the condition in this way simplifies our use of the cumulative distribution func-

tion (CDF) of a standard normal distribution. The CDF tells us how much of normal

distribution is to the left of any given point. Feed it a number and the CDF function will

tell us the probability a standard normal random variable is less than that number.

Figure 12.5 shows examples for several values of —0 + —1 X1i . Panel (a) shows a standard

normal PDF with the portion to the left of -0.7 shaded. Below that in panel (d) we show a

CDF function with the value of the CDF at -0.7 highlighted. The value is roughly equal to

0.25 which is the area of the normal curve that is to the left of -0.7 in panel (a).

Panel (b) shows a standard normal density curve with the portion to the left of +0.7

shaded. Clearly this is more than half of the distribution. The CDF function below it in

panel (e) shows that, in fact, roughly 0.75 of a standard normal density is to the left of +0.7.

Panel (c) shows a standard normal PDF with the portion to the left of 2.3 shaded. Panel

(f) below that shows a CDF function with the value of the CDF at 2.3 highlighted, which is

c
•2014 Oxford University Press 608
Chapter 12. Dummy Dependent Variables

Probability
density 0.4 0.4 0.4
(PDF)

0.3 0.3 0.3

0.2 0.2 0.2

0.1 0.1 0.1

0 0 0

−3 −1 0 1 2 3 −3 −1 0 1 2 3 −3 −1 0 1 2 3
−0.7 0.7 2.3
β0 + β1Xi β0 + β1Xi β0 + β1Xi
(a) (b) (c)
Probability
ε< β0 + β1Xi
(CDF) 1 1 1

0.75 0.75 0.75

0.5 0.5 0.5

0.25 0.25 0.25

0 0 0

−3 −1 0 1 2 3 −3 −1 0 1 2 3 −3 −1 0 1 2 3
−0.7 0.7 2.3
β0 + β1Xi β0 + β1Xi β0 + β1Xi
(d) (e) (f)

FIGURE 12.5: PDFs and CDFs

c
•2014 Oxford University Press 609
Chapter 12. Dummy Dependent Variables

about 0.99. Notice that the CDF function can’t be less than zero or more than one because

it is impossible to have less than zero percent or more than 100 percent of the area of the

normal density to the left of any number.

Since we know Yi = 1 if ‘i Æ —0 + —1 X1i , the probability Yi = 1 will be the CDF defined

at the point —0 + —1 X1i .

The notation we’ll use for the normal CDF is () (the Greek letter is pronounced “fi,”

as in wi-fi) which indicates the probability that a normally distributed random variable (‘

in this case) is less than the number in parentheses. In other words

P rob(Yi = 1) = P rob(‘i Æ —0 + —1 X1i )

= (—0 + —1 X1i )

The probit model produces estimates of — that best fit the data. That is, to the extent

ˆ that lead to high predicted probabilities for obser-


possible probit estimates will produce —s

vations that actually were 1s. Likewise, to the extent possible probit estimates will produce

ˆ that lead to low predicted probabilities for observations that actually were 0s. We discuss
—s

estimation after we introduce the logit model.

Logit model

Logit models also allow us to estimate parameters for a model with a dichotomous de-

pendent variable in a way that the fitted values are necessarily between 0 and 1. They are

c
•2014 Oxford University Press 610
Chapter 12. Dummy Dependent Variables

functionally very similar to probit models. The difference from a probit model is the equa-

tion that characterizes the error term. The equation differs dramatically from the probit

equation, but it turns out this difference has little practical import.

In a logit model

P rob(Yi = 1) = P rob(‘i Æ —0 + —1 X1i )


e—0 +—1 Xi
= (12.2)
1 + e—0 +—1 X1i

To get a feel for the logit equation, consider when —0 +—1 X1i is humongous. In the numerator

e is raised to that big number which leads to a super big number. In the denominator will

be that same number plus 1, which is pretty much the same number. Hence the probability

will be very, very close to one. But no matter how big —0 + —1 X1i gets, the probability will

never exceed one.

If —0 + —1 X1i is super negative, then the numerator of the logit function will have e raised

to a huge negative number, which is the same as one over e raised to a big number which

is essentially zero. The denominator will have that number plus one, meaning the fraction

is very close to 01 , which means that the probability that Yi = 1 will be very, very close to

zero. No matter how negative —0 + —1 X1i gets, the probability will never go below zero.6
6 If —0 + —1 X1i is zero, then Prob(Yi = 1) = 0.5. It’s a good exercise to work out why. The logit function can also
be written as

1
P rob(Yi = 1) =
1 + e≠(—0 +—1 X1i )

c
•2014 Oxford University Press 611
Chapter 12. Dummy Dependent Variables

The probit and logit models are rivals, but friendly rivals. When properly interpreted,

they yield virtually identical results. Do not sweat the difference. Simply pick probit or logit

and get on with life. Back in the early days of computers, the logit model was often preferred

because it is computationally easier than the probit model. Now powerful computers make

the issue moot.

Remember This
The probit and logit models are very similar. Both estimate S-shaped fitted lines that
are always above zero and below one.
1. In a probit model

P rob(Yi = 1) = (—0 + —1 X1i )


where () is the standard normal CDF that indicates the probability that a
standard normal random variable is less than the number in parentheses.
2. In a logit model
e—0 +—1 X1i
P rob(Yi = 1) =
1 + e—0 +—1 X1i

12.4 Estimation

So how do we select the best —ˆ given the data? The estimation process for the probit and

logit models is called maximum likelihood estimation (MLE). This process is more com-

plicated than estimating coefficients using OLS. Understanding the inner workings of MLE

is not necessary to implement or understand probit and logit models. Such an understanding

can be helpful, however, for more advanced work and we discuss the technique in more detail

c
•2014 Oxford University Press 612
Chapter 12. Dummy Dependent Variables

in the appendix starting on page 804.

In this section we explain the properties of MLE estimates, describe the fitted values

produced by probit and logit models, and show how goodness of fit is measured in MLE

models.

Properties of MLE estimates

Happily, many major statistical properties of OLS estimates carry over to MLE estimates.

For large samples, the parameter estimates are normally distributed and consistent if there

is no endogeneity. That means we can interpret statistical significance and create confidence

intervals and p-values much as we have done with OLS models. One modest difference is

that we use z tests rather than t tests for MLE models. Z tests compare test statistics to

critical values based on the normal distribution. Because the t distribution approximates the

normal distribution in large samples, z tests and t tests are very similar practically speaking.

The critical values will continue to be the familiar values we used in OLS. In particular, we

can continue to rely on the rule of thumb that a coefficient is statistically significant if it is

more than twice as large as its standard error.

Fitted values from the probit model

ˆ from a probit model will produce fitted lines that best fit the data. Figure
The estimated —s

12.6 shows examples. Panel (a) shows a classic probit fitted line. The observed data are

c
•2014 Oxford University Press 613
Chapter 12. Dummy Dependent Variables

Probability Probability
Y=1 Y=1
1 | | | || || | ||| |||| | |||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 1 | || | || |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

0.75 0.75

0.5 0.5

0.25 ^ 0.25 ^
β0 = −3 β0 = −4
^ ^
β1 = 2 β1 = 6
0 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||| || | || | | | | | | || | 0 ||||||||||||||||||||| |||||| | | | || |

0 1 2 3 0 1 2 3
X X
(a) (b)
Probability Probability
Y=1 Y=1
1 || | | | | || ||| |||| | || |||| || |||||| ||||| | || ||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||| | |||| | | | ||| | |||| | |

0.75 0.75

0.5 0.5

0.25 ^ 0.25 ^
β0 = −1 β0 = 3
^ ^
β1 = 1 β1 = −2
0 ||||||||||||||||||||||||||||||||||||||| || | | || | || ||| | | || || || | | || | | | | 0 || | | || || ||||| |||||| |||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

0 1 2 3 0 1 2 3
X X
(c) (d)

FIGURE 12.6: Examples of Data and Fitted Lines Estimated by Probit

indicated with small vertical lines. For low values of X, Y is mostly zero, with a few

exceptions. There is a range of X where there’s a pretty even mix of Y = 0 and Y = 1

observations and then for high values of X all Y s are one. The estimated —ˆ0 coefficient is

-3, indicating that low values of X are associated with low probabilities that Y = 1. The

estimated —ˆ1 coefficient is positive because higher values of X are associated with a high

probability of observing Y = 1.

To calculate fitted values for the model depicted in panel (a) of Figure 12.6 we need to

supply a value of X and use the coefficient estimates in the probit equation. Using the fact

c
•2014 Oxford University Press 614
Chapter 12. Dummy Dependent Variables

that —ˆ0 = ≠3 and —ˆ1 = 2, the fitted probability of observing Y = 1 when X = 0 is

Ŷi = P rob(Yi = 1)

= (—ˆ0 + —ˆ1 Xi )

= (≠3 + 2 ◊ 0)

= (≠3)

= 0.001

Based on these same coefficient estimates, the fitted probability of observing Y = 1 when

X = 1.5 is

Ŷi = (≠3 + 2 ◊ 1.5)

= (0)

= 0.5

Panel (b) of Figure 12.6 shows a somewhat similar relationship, but here there is a starker

transition between the Y = 0 and Y = 1 observations. When X is less than about 0.5, the

Y s are all zero; when X is greater than about 1.0, the Y s are all one. This pattern of data

indicates a strong relationship between X and Y and —ˆ1 is, not surprisingly, larger in panel

(b) than in panel (a). The fitted line is quite steep.

Panel (c) of Figure 12.6 shows a common situation in which the relationship between X

and Y is rather weak. The estimated coefficients produce a fitted line that is pretty flat and

c
•2014 Oxford University Press 615
Chapter 12. Dummy Dependent Variables

we don’t even see the full S-shape emblematic of probit models. If we were to display the

fitted line for a much broader range of X values, we would see the S-shape because the fitted

probabilities would flatten out at zero for sufficiently negative values of X and the fitted

probabilities would flatten out at one for sufficiently positive values of X. Sometimes, as in

this case, the flattening of a probit fitted line occurs outside the range of observed values of

X.

Panel (d) shows a case in which the —ˆ0 coefficient is positive and —ˆ1 is negative. This case

best fits the pattern of the data in which Y = 1 for low values of X and Y = 0 for high

values of X.

Fitted values from the logit model

For logit, the fitted values are calculated as


ˆ ˆ ˆ
e—0 +—1 X1i +—2 X2i +...
Ŷi =
1 + e—ˆ0 +—ˆ1 X1i +—ˆ2 X2i +...

Yes, that’s pretty ugly. Usually (but not always) there is a convenient way to get statistical

software to generate fitted values if we ask nicely. We’ll discuss how in the Computing Corner

starting on page 645.

Goodness of fit for MLE models

The overall fit of a probit or logit model is reported with a log likelihood statistic, often

written as “log L.” This statistic is a byproduct of the MLE estimation process. The log

c
•2014 Oxford University Press 616
Chapter 12. Dummy Dependent Variables

likelihood is the log of the probability of observing the Y outcomes we did given the X

ˆ It is an odd way to report how well the model fits because, well, it is
data and the —s.

incomprehensible. The upside of the incomprehensibility of this fit statistic is that we are

less likely to put too much emphasis on it, in contrast to the more accessible R2 for OLS

models which sometimes gets overemphasized (as we discussed in Section 3.7).

The log likelihood is useful in hypothesis tests involving multiple coefficients. Just as R2

feeds into the F statistic (as discussed on page 351), the log likelihood feeds into the test

statistic used when we are interested in hypotheses involving multiple coefficients in probit

or logit models as we discuss in Section 12.6 below.

Remember This
1. Probit and logit models are estimated via maximum likelihood estimation (MLE)
instead of OLS.
2. We can assess the statistical significance of MLE estimates of —ˆ using z tests,
which closely resemble t tests in large samples for OLS models.

c
•2014 Oxford University Press 617
Chapter 12. Dummy Dependent Variables

Discussion Questions
1. For each panel in Figure 12.6 on page 614 identify the value of X that
produces Ŷi = 0.5. Use the probit equation.
2. Based on Table 12.2, indicate whether the following statements are true,
false, or indeterminate.
(a) The coefficient on X1 in column (a) is statistically significant.
(b) The coefficient on X1 in column (b) is statistically significant.
(c) The results in column (a) imply a one unit increase in X1 is asso-
ciated with a 50 percentage point increase in the probability that
Y = 1.
(d) The fitted probability using the estimate in column (a) for X1i = 0
and X2i = 0 is 0.
(e) The fitted probability using the estimate in column (b) for X1i = 0
and X2i = 0 is approximately 1.
3. Based on Table 12.2, indicate the fitted probability for the following:
(a) Column (a) and X1i = 4 and X2i = 0.
(b) Column (a) and X1i = 0 and X2i = 4.
(c) Column (b) and X1i = 0 and X2i = 1.

Table 12.2: Sample Probit Results for Discussion Questions

(a) (b)
X1 0.5 1.0
(0.1) (1.0)
X2 -0.5 -3.0
(0.1) (1.0)
Constant 0.00 3.0
(0.1) (0.0)
N 500 500
log L -1000 -1200
Standard errors in parentheses

c
•2014 Oxford University Press 618
Chapter 12. Dummy Dependent Variables

12.5 Interpreting Probit and Logit Coefficients

The LPM model may have its problems, but it is definitely easy to interpret: A one unit

increase in X is associated with a —ˆ1 increase in the probability that Y = 1.

Probit and logit models have their strengths, but being easy to interpret is not one of

ˆ feed into the complicated equations defining the probability of observing


them because the —s

Y = 1. These complicated equations keep the predicted values above zero and less than one,

but they can only do so by having the effect of X vary across values of X.

In this section we explain how the estimated effect of X1 on Y in probit and logit models

depends not only on the value of X1 , but also on the value of the other independent variables.

We then describe approaches to interpreting the coefficient estimates from these models.

The effect of X1 depends on the value of X1

Figure 12.7 displays the fitted line from the probit model of law school admission. Increasing

GPA from 70 to 75 leads to a small change in predicted probability (about 3 percentage

points). Increasing GPA from 85 to 90 is associated with substantial increase in predicted

probability (about 30 percentage points). The change in predicted probability then get small

– really small – when we increase GPA from 95 to 100 (about 1 percentage point).

This is certainly a more complicated story than in OLS, but is perfectly sensible. For

someone with a very low GPA, increasing it really doesn’t get them seriously considered for

admission. For a middle range of GPAs, increases in GPAs are indeed associated with real

c
•2014 Oxford University Press 619
Chapter 12. Dummy Dependent Variables

Probability of
admission
1
Probability rises
by 0.01
when GPA goes
from 95 to 100

Probability rises
0.75 by 0.30
when GPA goes
from 85 to 90

0.5

0.25

Probability rises
by 0.03
when GPA goes
from 70 to 75
0

65 70 75 80 85 90 95 100
GPA
(on 100 point scale)

FIGURE 12.7: Varying Effect of X in Probit Model

c
•2014 Oxford University Press 620
Chapter 12. Dummy Dependent Variables

increases in probability of being admitted. After a certain point, however, higher GPAs have

little effect on the probability of being admitted because pretty much everyone with such

high GPAs is getting admitted.

The effect of X1 depends on the values of the other independent variables

There’s another wrinkle: the other variables. In the non-linear world, the effect of increasing

X1 varies not only over values of X1 , but also over values of the other variables in the model.

Suppose, for example, that we’re analyzing law school admission in terms of college GPAs

and standardized Law School Admission Test (LSAT) test score. The effect of GPAs actually

depends on the value of the LSAT test score. If an applicant’s LSAT test score is very high,

then the predicted probability will be near one based on that alone and there will be very

little room for an increased GPA to affect the predicted probability of being admitted to law

school. If an applicant’s LSAT test score is low, then there will be a lot more room for an

increase in GPA to affect the predicted probability of admission.

The fact that the estimated effect of X1 on the probability Y = 1 depends on the values

of X1 and the other independent variables creates a knotty problem: How do we convey the

magnitude of the estimated effect? In other words, how do we substantively interpret probit

and logit coefficients?

There are several reasonable ways to approach this issue. Here we focus on simulations.

If X1 is a continuous variable, we summarize the effect of X1 on the probability Y = 1 by

c
•2014 Oxford University Press 621
Chapter 12. Dummy Dependent Variables

calculating the average increase in fitted probabilities if we were to increase X1 by one stan-

dard deviation for every observation. First we calculate the fitted values for all observations

ˆ Then, we add one standard deviation to the value of X1 for each


using the estimated —s.

observation and calculate new fitted values for all observations. The average difference in

these two fitted values across all observations is the simulated effect of increasing X1 by one

standard deviation. The bigger —ˆ1 , the bigger this simulated effect will be.

It is not set in stone that we add one standard deviation. Sometimes it may make sense

to calculate these quantities by simply using an increase of one or some other amount.

These simulations make the coefficients interpretable in a common sense way. We can

say things like, “The estimates imply that increasing GPA by one standard deviation is

associated with an average increase in 15 percentage points in predicted probability of being

admitted to law school.” That’s still a mouthful, but much more meaningful than the —ˆ

itself.

If X1 is a dummy variable, we summarize the effect of X1 slightly differently. We calculate

the average increase in fitted probabilities if the value of X1 for every observation were to

go from zero to one. We first calculate the fitted values for all observations using the

ˆ setting X1 = 0 for all observations and using the observed values for all other
estimated —s,

independent variables. Then we calculate the fitted values for all observations setting X1 = 1

for all observations while still using the observed values for all other independent variables.

The average difference in these two fitted values across all observations is the estimated effect

c
•2014 Oxford University Press 622
Chapter 12. Dummy Dependent Variables

of the dummy variable X1 going from zero to one.

Our approach is called the observed-value, discrete-differences approach to estimating the

effect of an independent variable on the probability Y = 1. The “observed-value” bit comes

from the fact that we use observed values when calculating simulated probabilities. The

alternative to the observed-value approach is the average-case approach which uses creates a

single composite observation whose independent variables equal sample averages. We discuss

the average-case approach in the appendix on page 807.

The “discrete-difference” part of our approach involves our use of specific differences in

the value of X1 when simulating probabilities. The alternative to the discrete-differences

approach is the marginal effects approach that calculates the effect of changing X1 by a

minuscule amount. This calculus-based approach is a bit more involved (but easy with a

simple trick) and produces results that are generally similar to the approach we present.

We discuss the marginal effects approach in the appendix on page 808 and show how to

implement the approach in the Computing Corner on pages 646 and 648.

Interpreting logit coefficients proceeds in the same way, only we use the logit equation

(Equation 12.2) instead of the probit equation. For example, for an observed-value, discrete

differences simulation of the effect of a continuous variable, we calculate logit fitted values

for all observations and then calculate logit fitted values when the variable is increased by

a standard deviation. The average difference in fitted values is the simulated effect of a one

standard deviation increase in the variable.

c
•2014 Oxford University Press 623
Chapter 12. Dummy Dependent Variables

Remember This
1. To interpret probit coefficients using the observed-value, discrete-differences
method, use the following guide.
• If X1 is continuous:
(a) For each observation, calculate P1i as the standard fitted probability from
the probit results.

P1i = (—ˆ0 + —ˆ1 X1i + —ˆ2 X2i + —ˆ3 X3i + ...)


(b) For each observation, calculate P2i as the fitted probability when the value
of X1i is increased by one standard deviation (‡X1 ) for each observation

P2i = (—ˆ0 + —ˆ1 (X1i + ‡X1 ) + —ˆ2 X2i + —ˆ3 X3i + ...)
(c) The simulated effect of increasing X1 by one standard deviation is the
average difference P2i ≠ P1i across all observations.
• If X1 is a dummy variable:
(a) For each observation, calculate P1i as the fitted probability but with X1i
set to 0 for all observations.

P0i = (—ˆ0 + —ˆ1 ◊ 0 + —ˆ2 X2i + —ˆ3 X3i + ...)


(b) For each observation, calculate P1i as the fitted probability but with X1i
set to 1 for all observations.
P1i = (—ˆ0 + —ˆ1 ◊ 1 + —ˆ2 X2i + —ˆ3 X3i + ...)
(c) The simulated effect of going from 0 to 1 for the dummy variable X1 is
the average difference P1i ≠ P0i across all observations.
2. To interpret logit coefficients using the observed-value, discrete-differences
method, proceed as with the probit model, but use the logit equation to gen-
erate fitted values.

c
•2014 Oxford University Press 624
Chapter 12. Dummy Dependent Variables

Discussion Questions
1. Suppose X1 is a dummy variable. Explain how to calculate the effect of
X1 on the probability Y = 1.
2. Suppose X2 is a continuous variable. Explain how to calculate the effect
of X1 on the probability Y = 1.

c
•2014 Oxford University Press 625
Chapter 12. Dummy Dependent Variables

Case Study: Dog Politics

Going into the 2008 election, then-Senator

Obama was – gasp! – petless. This was virtually

unheard of in presidential politics. Did Obama’s

petlessness damage him politically? A tongue-

in-jowl debate ensued, including analysis of real

data on people’s political views and pet owner-

ship.

Table 12.3 shows results for the following

model from Mutz (2010)7 :

P rob(Obama approvali = 1) = —0 + —1 Dogi + —3 Ideologyi + ‘i

where Obamai is 1 for individuals who said they voted for Obama and 0 for everyone else.

The Dogi variable is 1 for people who own a dog and 0 for everyone else. Ideologyi measures

ideology on a 7 point scale where 1 is very liberal and 7 is very conservative.

LPM results are in the left column. The coefficient on dog ownership is highly statistically

significant and the —ˆDog = ≠0.05 implies that dog owners were 5 percent less likely to support

Obama. Ideology is also highly significant with the LPM estimate, implying that people

were 20.7 percentage points less likely to support Obama for every one unit increase in their
7 Yes, the author’s name really is Mutz.

c
•2014 Oxford University Press 626
Chapter 12. Dummy Dependent Variables

conservatism on a seven point ideology scale. The t statistics indicate that both variables

are highly statistically significant.

The fitted probabilities of voting for Obama from the LPM model range from minus eight

percent to plus 121 percent. In this case, the minimum fitted value will be for dog-owning

strong conservatives (for whom the ideology variable equals 7). The fitted value from the

LPM model for such a person is 1.421 ≠ 1 ◊ 0.05 ≠ 7 ◊ 0.207 = ≠0.08. The maximum fitted

value will be for non- dog-owning strong liberals (for whom the ideology variable equals 1).

The fitted value from the LPM model for such a person is 1.421 ≠ 0 ◊ 0.05 ≠ 1 ◊ 0.207 = 1.21.

Yeah, these values are weird; probabilities below zero and above one do not make sense.
Table 12.3: Dog Ownership and Probability of Supporting Obama in 2008 Election

LPM Probit Logit

Dog owner -0.050ú -0.213ú -0.368ú


(0.006) (0.025) (0.044)
[t = 8.14] [z = 8.53] [z = 8.44]
Ideology -0.207ú -0.753ú -1.326ú
(0.002) (0.011) (0.021)
[t = 106.0] [z = 71.68] [z = 63.98]
Constant 1.421ú 3.386ú 5.977ú
(0.009) (0.049) (0.095)
[t = 159.9] [z = 69.74] [z = 62.84]
N 15,596 15,596 15,596
R2 0.42
log L -6,720.87 -6,693.98
Minimum Ŷi -0.080 0.018 0.024
Maximum Ŷi 1.213 0.995 0.990
Standard errors in parentheses
ú indicates significance at p < 0.05

The second column and third columns of Table 12.3 display probit and logit results. These

models are, as we know, designed not to produce such odd fitted values and, in so doing, to

c
•2014 Oxford University Press 627
Chapter 12. Dummy Dependent Variables

better capture the relationship between the independent and dependent variables.

Interpreting statistical significance in these models is very familiar given our work with

OLS. For large samples, MLE coefficients divided by their standard errors will come from

normal distributions with means of zero. Hence, we can ascertain statistical significance

easily and quickly simply by looking at the z statistics, where the critical values are based

on the normal distribution. Given that the t statistics we used for OLS are approximately

normally distributed in large samples, we use essentially the same critical values and generate

essentially similar p-values given the ratio of our coefficients to their standard errors. The

estimated coefficient on dog ownership in the probit model is highly statistically significant

with a z statistic of 8.53. The z statistic for the dog owner coefficient in the logit model

is 8.44, meaning the coefficient is also statistically significant. The coefficient on ideology

is very significant in both the probit and logit models with z statistics of 71.68 and 63.98,

respectively.

Interpreting the coefficients is not so straightforward. What exactly do they mean? Does

the fact that —ˆDog = ≠0.213 in the probit model imply that dog owners are 21.3 percentage

points less likely to vote for Obama? Does the fact that —ˆIdeology = ≠0.753 in the probit

model imply that people get 75.3 percentage points less supportive of Obama for every one

unit more conservative they get?

No. No. (No!) We’ll focus on the probit model, but the logic is analogous for the logit

model. The coefficient estimates from probit feed into the complicated probit equation on

c
•2014 Oxford University Press 628
Chapter 12. Dummy Dependent Variables

page 610. We must use our simulation technique to understand the substantive implications

of our probit estimates. Table 12.4 interprets the probit coefficients in a substantively useful

way. Because the Dog variable is a dummy variable, the estimated effect of —ˆDog is calculated

by comparing the fitted probabilities for all individuals when the value of Dogi is set to 0

for all people and when the value of Dogi is set to 1 for all people. The average difference

in probabilities is -0.052. (This effect is eerily similar to the LPM estimate, a common

occurrence.)
Table 12.4: Estimated Effect of Dog Ownership and Ideology on Probability of Supporting Obama in 2008
Election

Variable Simulated change Probit Logit


Dog owner from 0 to 1, ideology at actual value -0.052 -0.051
Ideology increase by 1, dog owner at actual value -0.180 -0.181

In effect, what we’re doing is simulating the change in support Obama if no one owned a

dog compared to if everyone owned a dog. If —ˆ1 is big, there will be big differences in these

probabilities because the first set of probabilities will not have —ˆ1 in the equation (because

we multiply —ˆ1 by zero) and the second set of probabilities will have —ˆ1 (because we multiply

—ˆ1 by 1). If —ˆ1 is very small, then the two probabilities will differ very little.

Table 12.4 also shows the estimated effect of making everyone one unit more conservative

on the ideology measure. First we calculated fitted values from the probit model for everyone

using the observed values of all independent variables. Then we calculated fitted values for

everyone using their actual ideology score plus one. The average difference in these two

c
•2014 Oxford University Press 629
Chapter 12. Dummy Dependent Variables

fitted probabilities across the whole population is the estimated effect of a one unit change

in ideology. The average difference in probabilities for the probit model is -0.180. In other

words, our probit coefficient on ideology implies that increasing conservative ideology by one

unit is associated with an 18 percentage point decline in support for Obama.

The logit estimated effects in Table 12.4 are generated via the same process, but plugging

the logit-estimated coefficients into the logit equation instead of the probit equation. The

logit estimated effects for each variable are virtually identical to the probit estimated effects.

This is almost always the case because the two models are doing the same work, just with

slightly different assumptions about the error term.

Figure 12.8 helps us visualize the results by displaying the fitted values from the LPM,

probit and logit estimates. The solid line in each panel is the fitted line for non-dog owners.

The dashed line in each panel is the fitted line for dog owners. In all panels we see that fitted

probabilities of supporting Obama decline dramatically as an individual’s ideology becomes

more conservative. We also see that dog owners are less likely to support Obama, although

this effect doesn’t seem to have as much impact as ideology does. The LPM lines do not

dramatically differ from the probit and logit lines, although they go above one and below

zero. The probit and logit fitted lines look a bit different than the probit and logit fitted lines

we have seen so far because in this case the probabilities are declining as the independent

variable increases, making the lines look more like a backward S than the S shape we’ve seen

so far. Regardless, the probit and logit fitted lines are visually indistinguishable. In fact,

c
•2014 Oxford University Press 630
Chapter 12. Dummy Dependent Variables

Probability vote for Obama Probability vote for Obama Probability vote for Obama
Linear probability model (LPM) Probit Model Logit Model

1 1 1

non-dog owner non-dog owner non-dog owner


dog owner dog owner dog owner

0.75 0.75 0.75

0.5 0.5 0.5

0.25 0.25 0.25

0 0 0

1 2 3 4 5 6 7 Conservative ideology 1 2 3 4 5 6 7 Conservative ideology 1 2 3 4 5 6 7 Conservative ideology

FIGURE 12.8: Fitted Lines from LPM, Probit, and Logit Models

the fitted values from the probit and logit models correlate at 0.9996; such high correlations

are not unusual when comparing fitted probit and logit values.

These results constitute only an initial cut on the analysis. We are concerned, as always,

about possible bias. Is there any source of endogeneity missing in the model? In particular,

could there be something not currently in the model that is correlated with dog ownership?

If so, the anti-Obama effect of dogs could be spurious.

12.6 Hypothesis Testing about Multiple Coefficients

Sometimes we are interested in hypotheses about multiple coefficients. That is, we might

not simply want to know if —1 is different from zero, but whether is it bigger than —2 . In

this section we show how to conduct such tests when using MLE models such as probit and

logit.

c
•2014 Oxford University Press 631
Chapter 12. Dummy Dependent Variables

In the OLS context we used F tests to test hypotheses involving multiple coefficients; we

discussed these tests in Section 7.4 of Chapter 7. The key idea was to compare the fit of a

model that imposed no restrictions to the fit of a model that imposed the restriction implicit

in the null hypothesis. If the null hypothesis is true, then forcing the computer to spit back

results consistent with the null will not reduce the fit very much. If the null hypothesis is

false, though, forcing the computer to spit back results consistent with it will reduce the fit

substantially.

We’ll continue to use the same logic here. The difference is that we do not measure fit

with R2 as with OLS, but with the log-likelihood as described in Section 12.5. We will

look at the difference in log likelihoods from the restricted and unrestricted estimates. The

statistical test is called a likelihood ratio test(LR test) and the test statistic is

LR = 2(logLU R ≠ logLR )

If the null hypothesis is true, the log-likelihood should be pretty much the same for the

restricted and unrestricted versions of the model. Hence a big difference in the likelihoods

indicates that the null is false. Statistical theory implies that if the null hypothesis is true,

the difference in log-likelihoods will follow a specific distribution and hence we can use that

distribution to calculate critical values for hypothesis testing. The distribution is a ‰2 with

degrees of freedom equal to the number of equal signs in the null hypothesis (‰ is the

Greek letter chi, pronounced “ky” as in Kyle). We show in the Computing Corner how to

c
•2014 Oxford University Press 632
Chapter 12. Dummy Dependent Variables

generate critical values and p-values based on this distribution. The appendix provides more

information on the ‰2 distribution starting on page 783.8

An example makes this process clear. It’s not hard. Suppose we want to know more

about pet politics. Perhaps our pets reveal or even cause some deep political feelings. As

Mutz (2010) noted, dog owners “might have been drawn more to the emotionally effusive

McCain ... If one of the candidates were to jump on you at the door and lick your ear, it

would surely be McCain.” Obama, on the other hand, was more cat-like and emotionally

cool.

So let’s assess whether dog owners and cat owners differed politically. The unrestricted

version of the model is

P rob(Obama approvali = 1) = —0 + —1 Dogi + —2 Cati + —3 Ideologyi + ‘i

This is the unrestricted equation because we are letting the coefficients on Dog and Cat be

whatever best fits the data.

The null hypothesis is a hypothesis that the effect of owning dogs and cats is the same:

H0 : —1 = —2 . We impose this null hypothesis on the model by forcing the computer to give

us results where the coefficients on Dog and Cat are equal. We do so by replacing —2 with —1
8 It may seem odd that this is called a likelihood ratio test when the statistic is the difference in log likelihoods.
The test can also be considered as the log of the ratio of the two likelihoods. Because log LLURR = logLU R ≠ logLR
we can use the form we do. Most software reports the log likelihood, not the (unlogged) likelihood, so it’s more
convenient to use the difference of log likelihoods rather than the ratio of likelihoods. The 2 is there just to make
things work; don’t ask.

c
•2014 Oxford University Press 633
Chapter 12. Dummy Dependent Variables

in the model (which we can do because under the null hypothesis they are equal), yielding

a restricted model of

P rob(Obama approvali = 1) = —0 + —1 Dogi + —1 Cati + —3 Ideologyi + ‘i

= —0 + —1 (Dogi + Cati ) + —3 Ideologyi + ‘i

Therefore, we need simply to estimate these two models, calculate the difference in log

likelihoods, and then compare this difference to a critical value from the appropriate distri-

bution. We estimate the restricted model by creating a new variable, which is Dogi + Cati .

Table 12.5 shows the results. In the unrestricted column are results from the model in

which the dog owner and cat owner variables are entered separately. At the bottom is the

unrestricted log likelihood that will feed into the LR test.

Before we do anything more, this is a good time to do a bit of common sense approxi-

mating. The coefficients on dog and cat in the unrestricted model in Table 12.5 are both

negative and statistically significant, but the coefficient on dog is almost three times the

size of the cat coefficient. Both coefficients have relatively small standard errors, so it is

reasonable to expect there is a difference, suggesting H0 is false.

In the restricted model column are coefficients from the model in which the two separate

dog and cat variables have been replaced by a single dog plus cat variable. At the bottom

is the restricted log likelihood that we will feed into the LR test.

c
•2014 Oxford University Press 634
Chapter 12. Dummy Dependent Variables

Table 12.5: Unrestricted and Restricted Probit Results for Likelihood Ratio Test

Unrestricted Restricted model for


model H0 : —Dog = —Cat
Dog owner -0.207ú
(0.025)
[z = 8.28]
Cat owner -0.065ú
(0.026)
[z = 2.48]
Ideology -0.754ú -0.755ú
(0.011) (0.011)
[z = 72.52] [z = 71.85]
Dog owner + Cat owner -0.139ú
(0.017)
[z = 8.04]
Constant 3.410ú 3.410ú
(0.050) (0.050)
[z = 69.67] [z = 68.812]
N 15,596 15,596
log L -6726.79 -6733.941
Standard errors in parentheses; ú indicates significance at p < 0.05

From this table, it is easy to calculate the LR test statistic:


LR = 2(logLU R ≠ logLR )
= 2(≠6726.79 + 6733.941)
= 14.302

Using the tools described in the Computing Corner, we can determine the p-value and if

it is less than the significance level we have set, we can reject the null hypothesis. In this

case, the p-value associated with a LR value of 14.302 is 0.0002, far below a conventional

significance level of 0.05.

Or, equivalently, we can reject the null hypothesis if LR statistics are greater than the

critical value for our significance level. The critical value for a significance level of 0.05 is

c
•2014 Oxford University Press 635
Chapter 12. Dummy Dependent Variables

3.84 and our LR test statistic of 14.302 far exceeds that. This means we can easily reject the

null that the coefficients on dog and cats have the same effects on owners.9 Dogs are more

political. (Not that cats care!)

9 Of course, the sensible interpretation here is that the kinds of people who own dogs and cats are more likely to
have certain political views.

c
•2014 Oxford University Press 636
Chapter 12. Dummy Dependent Variables

Remember This
Use a likelihood ratio (LR) test to test hypotheses involving multiple coefficients for
probit and logit models.
1. Estimate an unrestricted model that is the full model:
P rob(Yi = 1) = —0 + —1 X1i + —2 X2i + —3 X3i + ‘i

2. Write down the null hypothesis.


3. Estimate a restricted model by using the conditions in the null hypothesis to
restrict the full model:
• For H0 : —1 = —2 , the restricted model is
P rob(Yi = 1) = —0 + —1 X1i + —1 X2i + —3 X3i + ‘i
= —0 + —1 (X1i + X2i ) + —3 X3i + ‘i

• For H0 : —1 = —2 = 0, the restricted model is

P rob(Yi = 1) = —0 + 0 ◊ X1i + 0 ◊ X2i + —3 X3i + ‘i


= —0 + —3 X3i + ‘i

4. Use the log-likelihood values from the unrestricted and restricted models to cal-
culate the LR test statistic:
LR = 2(logLU R ≠ logLR )

5. The larger the difference between the log likelihoods, the more the null hypothesis
is reducing fit and, therefore, the more likely we are to reject the null.
• The test statistic is distributed according to a ‰2 distribution with degrees
of freedom equal to the number of equal signs in the null hypothesis.
• Code for generating critical values and p-values for this distribution is in the
Computing Corner on pages 646 and 648.

c
•2014 Oxford University Press 637
Chapter 12. Dummy Dependent Variables

Case Study: Civil Wars

Civil wars produce horrific human misery. They

are all too often accompanied by atrocities and a

collapse of civilization.

What causes civil wars? Obviously it’s a com-

plicated subject, but is it the case that civil wars

are much more likely in countries that are divided along ethnic or religious lines? Many think

so, arguing that these pre-existing divisions can explode into armed conflict. Stanford pro-

fessors James Fearon and David Laitin (2003) aren’t so sure. They suspect that instability

from poverty is more important.

In this case study we explore these possible determinants of civil war. We’ll see that while

omitted variable bias plays out in a broadly similar fashion across LPM and probit models,

the two approaches nonetheless provide rather different pictures about what is going on.

The dependent variable is civil war onset, coded for 161 countries that had a population

of at least half a million in 1990 from 1945 to 1999. It is 1 for country years in which a civil

war began and 0 in all other country years. We’ll look at three independent variables:

• Ethnic fractionalization measures ethnic divisions within each country; it ranges from

0.001 to 0.93 with mean of 0.39 and a standard deviation of 0.29. The higher the value

of this variable, the more divided a country is ethnically.

c
•2014 Oxford University Press 638
Chapter 12. Dummy Dependent Variables

• Religious fractionalization measures religious divisions within each country; it ranges

from 0 to 0.78 with a mean of 0.37 and a standard deviation of 0.22. The higher the

value of this variable, the more divided a country is religiously.

• GDP is lagged GDP per capita. The GDP measure is lagged so as not to be tainted by

the civil war itself, which almost surely had an effect on the economy. It is measured in

thousands of US dollars that are adjusted for inflation. The variable ranges from 0.05

to 66.7 with a mean of 3.65 and a standard deviation of 4.53.

Table 12.6 shows results for LPM and probit models. For each method we present results

with and without GDP. We see a similar pattern when GDP is omitted. In the LPM (a)

specification, ethnic fractionalization is statistically significant and religious fractionalization

is not. The same thing is true for the probit (a) specification that does not have GDP.

However, Fearon and Laitin’s suspicion was supported by both LPM and probit analyses.

When GDP is included, the ethnic fractionalization variable becomes insignificant in both

LPM and probit (although it is close to significant in the LPM model). The GDP variable

is highly statistically significant in both LPM and probit models. So the general conclusion

that GDP seems to matter more than ethnic fractionalization does not depend on which

model we use to estimate this dichotomous dependent variable model.

However, the two models do tell slightly different stories. Figure 12.9 shows the fitted

lines from the LPM and probit models for the specifications that include the GDP variable.

c
•2014 Oxford University Press 639
Chapter 12. Dummy Dependent Variables

Table 12.6: Probit Models of the Determinants of Civil Wars

LPM Probit
(a) (b) (a) (b)
Ethnic 0.019 ú
0.012 0.451 ú
0.154
fractionalization (0.006) (0.006) (0.141) (0.149)
[t = 3.30] [z = 1.84] [z = 3.20] [z = 1.03]
Religious -0.002 0.002 -0.051 0.033
fractionalization (0.008) (0.008) (0.185) (0.198)
[t = 0.33] [z = 0.27] [z = 0.28] [z = 0.17]
GDP per capita -0.0015ú -0.108ú
(in $1000 US) (0.0004) (0.024)
[z = 3.97] [z = 4.58]
Constant 0.010ú 0.017ú -2.297ú -1.945ú
(0.003) (0.004) (0.086) (0.108)
[z = 3.05] [z = 4.49] [z = 26.67] [z = 18.01]
N 6610 6373 6610 6373
R2 0.002 0.004
ˆ
‡ 0.128 0.128
log L -549.092 -508.545
Standard errors in parentheses; ú indicates significance at p < 0.05

When calculating these lines, we held the ethnic and religious variables at their mean values.

The LPM model has its characteristic brutally straight fitted line. It suggests that whatever

its wealth, a country sees its probability of civil war decline as it gets wealthier. It does this

to the point of not making sense because the fitted probabilities are negative (and hence

meaningless) for countries with per capita GDP above about $20,000 per year. The probit

model has a curve. We’re seeing only a hint of the S-curve because even the poorest countries

have less than a 4% probability of having a civil war. But we do see that the effect of GDP

is concentrated among the poorest countries. For them, the effect of income is relatively

higher, certainly higher than the LPM model suggests. But for countries with about $10,000

per capita GDP per year, there is basically no effect of income on the probability of a civil

c
•2014 Oxford University Press 640
Chapter 12. Dummy Dependent Variables

Probability of
civil war
0.04

0
Fitted values from
probit model

−0.04

Fitted values from linear


probability model

−0.08

0 10 20 30 40 50 60 70
GDP (per capita in thousands of US $)

FIGURE 12.9: Fitted Lines from LPM and Probit Models for Civil War Data (Holding Ethnic and

Religious Vvariables at Their Means)

war. So even as the broad conclusion that GDP matters is similar in the LPM and probit

models, the way in which GDP matters is quite different across the models.

c
•2014 Oxford University Press 641
Chapter 12. Dummy Dependent Variables

12.7 Conclusion

Things we care about are often dichotomous, whether it is unemployment, vote choice,

graduation, war, or countless other phenomena. We can use OLS to analyze such data via

the linear probability model, but we risk producing models that do not fully reflect the

relationships in the data.

The solution is to fit an S-shaped relationship via probit or logit models. Probit and

logit models are, as a practical matter, interchangeable as long as sufficient care is taken

when interpreting coefficients. The cost of these models is that they are more complicated,

especially with regard to interpreting the coefficients.

We’re in good shape when we can:

• Section 12.1: Explain the linear probability model. How do we estimate it? How do we

interpret the coefficient estimates? What are two drawbacks to it?

• Section 12.2: Describe what a latent variable is and how it relates to the observed

dichotomous variable.

• Section 12.3: Describe the probit and logit models. What is the equation for the

probability that Yi = 1 for a probit model? What is the equation for the probability

that Yi = 1 for a logit model?

• Section 12.4: Discuss estimation procedure used for probit and logit and how to generate

fitted values.

c
•2014 Oxford University Press 642
Chapter 12. Dummy Dependent Variables

• Section 12.5: Explain how to interpret probit coefficients using the observed-value,

discrete-differences approach.

• Section 12.6: Explain how to test hypotheses about multiple coefficients using probit

or logit models.

Further Reading

There is no settled consensus on the best way to interpret probit and logit coefficients.

Substantive conclusions rarely depend on the mode of presentation, so any of the methods

is legitimate. Hanmer and Kalkan (2013) argue for the observed-value approach and against

the average-value approach.

MLE models do not inherit all properties of OLS models. In OLS, heteroscedasticity

does not bias coefficient estimates; it only makes the conventional equation for the standard

error of —ˆ1 inappropriate. In probit and logit models, heteroscedasticity can induce bias

(Alvarez and Brehm 1995), but correcting for heteroscedasticity may not always be feasible

or desirable (Keele and Park 2006).

King and Zeng (2001) discuss small sample properties of logistic models, noting in partic-

ular that small-sample bias can be large when the dependent variable is a rare event, with

only a few observations falling in the less frequent category.

Probit and logit models are examples of limited dependent variable models. In these

c
•2014 Oxford University Press 643
Chapter 12. Dummy Dependent Variables

models, the dependent variable is restricted in some way. As we have seen, the dependent

variable in probit models is limited to two values, 1 and 0. MLE can be used for many other

types of limited dependent variable models. If the dependent variable is ordinal with more

than two categories (e.g. answers to a survey question where answers are very satisfied,

satisfied, dissatisfied, very dissatisfied), an ordered probit model is useful. It is based on

MLE methods and is a modest extension of the probit model. Some dependent variables

are categorical. For example, we may be analyzing the mode of transportation to work

(with walking, biking, driving, and taking public transportation as options). In such a

case, multinomial logit is useful, another MLE technique. Other dependent variables are

counts (number of people on a bus) or lengths of time (how long between buses or how long

someone survives after a disease diagnosis). Models with these dependent variables also can

be estimated with MLE methods, such as count models and duration models. Long (1997)

introduces maximum likelihood and covers a broad variety of MLE techniques. King (1989)

explains the general approach. Box-Steffensmeier and Jones (2004) is an excellent guide to

duration models.

Key Terms
• Cumulative distribution function (CDF) (608)
• Dichotomous (591)
• Linear probability model (LPM) (592)
• Latent variable (602)

c
•2014 Oxford University Press 644
Chapter 12. Dummy Dependent Variables

• Likelihood ratio (LR) test (632)


• Logit model (610)
• Log likelihood (616)
• Maximum likelihood estimation (MLE) (612)
• Probit model (606)

Computing Corner

Stata

To implement the observed-value, discrete-differences approach to interpreting estimated


effects for probit in Stata, do the following.
• If X1 is continuous:
** Estimate probit model
probit Y X1 X2

** Generate predicted probabilities for all observations


gen P1 = normal( b[ cons] + b[X1]*X1 + b[X2]*X2) if e(sample)
** ‘‘normal’’ refers to normal CDF function
** b[ cons] is —ˆ0 , b[X1] is —ˆ1 , and so on
** ‘‘e(sample)’’ tells Stata to only use observations used in probit analysis

**Or, equivalently,generate predicted values via ‘‘predict’’ command


probit Y X1 X2
predict P1 if e(sample)

** Create variable with X1 + standard deviation of X1 (which here equals 1)


gen X1Plus = X1 + 1

** Generate predicted probabilities for all observations using X1Plus


gen P2 = normal( b[ cons] + b[X1]*X1Plus + b[X2]*X2) if e(sample)

** Calculate difference in probabilities for all observations


gen PDiff = P2 - P1

c
•2014 Oxford University Press 645
Chapter 12. Dummy Dependent Variables

** Display results
sum PDiff if e(sample)

• If X1 is a dummy variable:
** Estimate probit model
probit Y X1 X2

** Generate predicted probabilities for all observations with X1=0


gen P0 = normal( b[ cons] + b[X1]*0 + b[X2]*X2) if e(sample)

** Generate predicted probabilities for all observations with X1=1


gen P1 = normal( b[ cons] + b[X1]*1 + b[X2]*X2) if e(sample)

** Calculate difference in probabilities for all observations


gen PDiff = P1 - P0

** Display results
sum PDiff if e(sample)
• The margins command produces average marginal effects, which are the average of the
slopes with respect to each independent variable evaluated at observed values of the
independent variables. See page 808 for more details. These are easy to implement in
Stata, with similar syntax for both probit and logit models.
probit Y X1 X2
margins, dydx(X1)
• To conduct an LR test in Stata, use the lrtest command. For example, to test the
null hypothesis that the coefficients on both X2 and X3 equal zero we can first run the
constrained model and save the results using the estimates store command:
probit Y X1
estimates store RESTRICTED
Then run the unconstrained command followed by the lrtest command and the name
of the constrained model.
probit Y X1 X2 X3
lrtest RESTRICTED
Stata will produce a value of the likelihood ratio statistic and a p-value. We can imple-
ment an LR test manually by simply running the restricted and unrestricted models and
plugging the log likelihoods into the likelihood ratio test equation of 2(logLU R ≠ logLR )
(as explained on page 635). To ascertain the critical value for LR test with d.f. = 1 and

c
•2014 Oxford University Press 646
Chapter 12. Dummy Dependent Variables

0.95 confidence level, type


display invchi2(1, 0.95)

To ascertain the p-value for likelihood ratio test with d.f. = 1 and substituting log-
likelihood values in for logLunrestricted and logLrestricted, type
display 1-chi2(1, 2*(logLunrestricted - logLrestricted))
Even easier, we can use Stata’s test command to conduct a Wald test, which is a test
that is asymptotically equivalent to the likelihood ratio test (which is a fancy way of
saying the test statistics get really close to each other as the sample size goes to infin-
ity). For example,
probit Y X1 X2 X3
test X2 = X3 =0
• To estimate a logit model in Stata, use similar logic and structure as for a probit model.
Here are the key differences for the continuous variable example:
logit Y X1 X2
gen LogitP1 = exp( b[ cons] + b[X1]*X1 + b[X2]*X2 ) /
(1+exp( b[ cons] + b[X1]*X1 + b[X2]*X2 ))
gen LogitP2 = exp( b[ cons] + b[X1]*X1Plus + b[X2]*X2 ) /
(1+exp( b[ cons] + b[X1]*X1Plus + b[X2]*X2 ))

• To graph fitted lines from a probit or logit model that has only one independent variable,
first estimate the model and save the fitted values. Then use the following command:
graph twoway (line ProbitFit X, connect(l))

R
To implement a probit or logit analysis in R, we use the glm function, which stands for
“generalized linear model” (as opposed to the lm function, which stands for “linear model”).

• If X1 is continuous:
## Estimate probit model and name it Result (or anything we choose)
Result = glm(Y ≥ X1 + X2, family = binomial(link ="probit"))

## Create variable named P1 with fitted values from probit model


P1 = pnorm(Result$coef[1] + Result$coef[2]*X1 + Result$coef[3]*X2)
** pnorm is the normal CDF function in R
** Result$coef[1] is —ˆ0 , Result$coef[2] is —ˆ1 etc
## Create variable named X1Plus which is X1 + standard deviation of X1 (which

c
•2014 Oxford University Press 647
Chapter 12. Dummy Dependent Variables

here equals 1)
X1Plus = X1 +1

## Create P2: fitted value using X1Plus instead of X1


P2 = pnorm(Result$coef[1] + Result$coef[2]*X1Plus + Result$coef[3]*X2)

## Calculate average difference in two fitted probabilities


mean(P2-P1, na.rm=TRUE)
## ‘‘na.rm=TRUE’’ removes missing data when calculating the mean

• If X1 is a dummy variable:
## Estimate probit model and name it ‘‘Result’’ (or anything we choose)
Result = glm(Y ≥ X1 + X2, family = binomial(link ="probit"))

## Create: P0 fitted values with X1 set to zero


P0 = pnorm(Result$coef[1] + Result$coef[2]*0 + Result$coef[3]*X2)

## Create P1: fitted values with X1 set to one


P1 = pnorm(Result$coef[1] + Result$coef[2]*1 + Result$coef[3]*X2)

## Calculate average difference in two fitted probabilities


mean(P1-P0, na.rm=TRUE)
• To produce average marginal effects (as discussed on page 808) for continuous X1 , use
the following.
MargEffects = Result$coef[2]* dnorm(Result$coef[1] + Result$coef[2]*X1
+ Result$coef[3]*X2)
** dnorm is PDF function in R
mean(MargEffects, na.rm=TRUE)
• To estimate an LR test of H0 :—1 = —2 in R, do the following:
## Unrestricted probit model
URModel = glm(Y ≥ X1 + X2 + X3, family = binomial(link = "probit"))

## Restricted probit model


X1plusX2 = X1 + X2
RModel = glm(Y ≥ X1plusX2 + X3, family = binomial(link = "probit"))

## Calculate the LR test statistic using the logLik function


LR = 2*(logLik(URModel)[1] - logLik(RModel)[1])

c
•2014 Oxford University Press 648
Chapter 12. Dummy Dependent Variables

## Critical value for LR test with d.f. =1 and 0.95 confidence level
qchisq(0.95, 1)

## p-value for likelihood ratio test with d.f. =1


1-pchisq(LR, 1)
– If we wanted to testH0 :—1 = —2 = 0, we would use a different restricted equation:
## Restricted probit model
RModel = glm(Y ≥ X3, family = binomial(link = "probit"))
and proceed with the rest of the test.
• To estimate a logit model in R, use similar logic and structure as for a probit model.
Here are the key differences for the continuous variable example:
Result = glm(Y ≥ X1 + X2, family = binomial(link ="logit"))
P1 = exp(Result$coef[1] + Result$coef[2]*X1 + Result$coef[3]*X2) /
(1+ exp(Result$coef[1] + Result$coef[2]*X1 + Result$coef[3]*X2))
P2 = exp(Result$coef[1] + Result$coef[2]*X1Plus + Result$coef[3]*X2) /
(1+ exp(Result$coef[1] + Result$coef[2]*X1Plus + Result$coef[3]*X2))
• To graph fitted lines from a probit or logit model that has only one independent vari-
able, first estimate the model and save the model. In this case, we’ll save a probit
model as ProbResults. Create a new variable that spans the range of the independent
variable. In this case, we create a variable called Xsequence that ranges from 1 to 7 in
steps of 0.05 (e.g., the first value is 1, the next is 1.05, and so on). We then plot fitted
lines using the coefficients from the ProbResults model and this Xsequence variable:
Xsequence = seq(1, 7, 0.05)
plot(Xsequence, pnorm(ProbResults$coef[1] + ProbResults$coef[2]*Xsequence),
type = "l")

Exercises
1. In this question, we explore the effect of opinion about the Iraq War on the presidential
election in 2004 using the dataset BushIraq.dta. The variables we will focus on are
listed in Table 12.7.
a. Estimate two probit models: one with only P roIraqW ar02 as the independent vari-
able and the other with all the independent variables listed in the table. Which is
better? Why? Comment briefly on statistical significance.

c
•2014 Oxford University Press 649
Chapter 12. Dummy Dependent Variables

Table 12.7: Variables for Iraq War Questions

Variable Description
Bushvote04 Dummy variable =1 if person voted for President Bush in 2004
ProIraqWar02 Position on Iraq War, ranges from 0 (opposed war) to 3 (favored war)
Party02 Partisan affiliation, ranges from 0 for strong Democrats to 6 for strong Republicans
BushVote00 Dummy variable =1 if person voted for President Bush in 2000
CutRichTaxes02 Views on cutting taxes for wealthy, ranges from 0 (oppose) to 2 (favor)
Abortion00 Views on abortion, ranges from 1 (strongly oppose) to 4 (strongly support)

b. Use the model with all the independent variables and the observed-value, discrete-
differences approach to calculate the effect of a one standard deviation increase in
P roIraqW ar02 on support for Bush.
c. Use the model with all the independent variables listed in the table and the observed-
value, discrete-differences approach to calculate the effect of a one standard deviation
increase in party02 on support for Bush. Compare to the effect of P roIraqW ar02.
d. Use Stata’s marginal effects command to calculate the marginal effects of all inde-
pendent variables. Briefly comment on differences from calculations in parts (a) and
(c).
e. Run the same model using logit and
i Briefly comment on patterns of statistical significance compared to probit results.
ii Briefly comment on coefficient values compared to probit results.
iii Use Stata’s margins commands to calculate marginal effects of variables and
briefly comment on differences or similarities from probit results.
f. Calculate the correlation of the fitted values from the probit and logit models.
g. Test the null hypothesis that the coefficients on the three policy opinion variables
(P roIraqW ar02, CutRichT axes02, Abortion00) all equal zero using a likelihood ra-
tio test. Do this work manually (showing your work) and using the Stata commands
for a likelihood ratio test.
2. Public attitudes toward global warming influence the policy response to the issue. The
dataset EnvSurvey.dta provides data from a nationally representative survey of the
American public that asked multiple questions about the environment and energy. Table
12.8 lists the variables.
a. Use a linear probability model to estimate the probability of saying that global
warming is real and caused by humans (the dependent variable is HumanCause2).
Control for sex, being white, education, income, age, and partisan identification.

c
•2014 Oxford University Press 650
Chapter 12. Dummy Dependent Variables

Table 12.8: Variables for Global Warming Questions

Variable Description
Male Dummy variable = 1 for men
White Dummy variable = 1 for whites
Education Education, ranging from 1 for no formal education to 14 for
professional/doctorate degree (treat as a continuous variable)
Income Income, ranging from 1 for household income < $5000 to 19 for
household income > $175000 (treat as a continuous variable)
Age Age
Party7 Partisan identification, ranging from 1 for strong Republicans, 2 for
not-so-strong Republican, 3 leans Republican, 4 undecided/independent,
5 leans Democrat, 6 not-so-strong Democrat, 7 strong Democrat

i. Which variable has the most important influence on this opinion? Why?
ii. What are the minimum and maximum fitted values from this model? Discuss
implications briefly.
iii. Add age-squared to the model. What is the effect of age? Use a simple sketch if
necessary, with key point(s) identified.
b. Use a probit model to estimate the probability of saying that global warming is
real and caused by humans (the dependent variable is HumanCause2). Use the
independent variables from part (a), including the age-squared variable.
i. Compare statistical significance with LPM results.
ii. What are the minimum and maximum fitted values from this model? Discuss
implications briefly.
iii. Use the observed-value, discrete-differences approach to indicate the effect of
partisan identification on the probability of saying global warming is real and
caused by humans. For simplicity, simulate the effect of an increase of 1 unit
on this 7 point scale (as opposed to the effect of one standard deviation, as we
have done for continuous variables in other cases). Compare to LPM and Stata’s
“marginal effects” interpretations.
iv. Use the observed-value, discrete-differences approach to indicate the effect of
being male on the probability of saying global warming is real and caused by
humans. Compare to LPM and Stata’s “marginal effects” interpretations.
c. The survey also included a survey experiment in which respondents were randomly
assigned to different question wordings about an additional question about global
warming. The idea was to see which frames were most likely to lead people to agree
that the earth is getting warmer. The variable we analyze here is called W armAgree.
It records whether or not respondents agreed that the earth’s average temperature

c
•2014 Oxford University Press 651
Chapter 12. Dummy Dependent Variables

FIGURE 12.10: Figure Included for Some Respondents in Global Warming Survey Experiment

is rising. The experimental treatment consisted of four different ways to phrase the
question.
• The variable T reatment equals 1 for people who were asked “Based on your
personal experiences and observations, do you agree or disagree with the following
statement: The average temperature on earth is getting warmer.”
• The variable T reatment equals 2 for people who were given the following infor-
mation and Figure 12.10 before asking them if they agreed or not that average
temperature of the earth is getting warmer: “The following figure shows the
average global temperature compared to the average temperature from 1951-
1980. The temperature analysis comes from weather data from more than 1,000
meteorological stations around the world, satellite observations of sea surface
temperature, and Antarctic research station measurements.”
• The variable T reatment equals 3 for people who were given the following in-
formation before asking them if they agreed or not that average temperature
of the earth is getting warmer: “Scientists working at the National Aeronautics
and Space Administration (NASA) have concluded that the average global tem-
perature has increased by about a half degree Celsius compared to the average
temperature from 1951-1980. The temperature analysis comes from weather data
from more than 1,000 meteorological stations around the world, satellite observa-
tions of sea surface temperature, and Antarctic research station measurements.”
• The variable T reatment equals 4 for people who were simply asked “Do you agree
or disagree with the following statement: The average temperature on earth is
getting warmer.” This is the control group.
Which frame was most effective in affecting opinion about global warming?
3. What determines whether organizations fire their leaders? It’s often hard for outsiders

c
•2014 Oxford University Press 652
Chapter 12. Dummy Dependent Variables

to observe performance, but in sports many facets of performance (particularly winning


percentage) are easily observed. Michael Roach (2013) provides data on NFL football
coaches performance and firing. Table 12.9 lists the variables.
Table 12.9: Variables for Football Coach Questions

Variable name Description


FiredCoach A dummy variable if the football coach was fired during or after the season
WinPct The winning percentage of the team
LagWinPct The winning percentage of the team in the previous year
ScheduleStrength A measure of schedule difficulty based on records of opposing teams
NewCoach A dummy variable indicating if the coach was new
Tenure The number of years the coach has coached the team

a. Run a probit model explaining whether the coach was fired as a function of winning
percentage. Graph fitted values from this model on same graph with fitted values
results from a bivariate linear probability model (use lfit command to plot LPM
results). Explain the differences in the plots.
b. Estimate LPM, probit, and logit models of coach firings using winning percent, lagged
winning percent, a new coach dummy, strength of schedule, and coach tenure as
independent variables. Are the coefficients substantially different? How about the z
statistics?
c. Indicate the minimum, mean, and maximum of the fitted values for each model and
briefly discuss.
d. What are the correlations of the three fitted values?
e. It’s kind of odd to say that lag winning percentage affects the probability that new
coaches got fired because they were not coaching for the year associated with the
lagged winning percentage. Include an interaction for the fired last year dummy and
lagged winning percentage. The effect of lagged winning percentage on probability
of being fired is the sum of the coefficients on lagged winning percentage and the
interaction. Test the null hypothesis that lagged winning percentage has no effect
on coaches who are new (meaning coaches for whom firedlastyear = 1). Use a Wald
test (which is most convenient) and likelihood ratio test.
4. Are members of Congress more likely to meet with donors? To answer this question,
Kalla and Broockman (2014) conducted a field experiment in which they had political
activists attempt to schedule meetings with 191 congressional offices regarding efforts to
ban a potentially harmful chemical. The messages the activists sent to the congressional
offices were randomized. Some messages described the people requesting the meeting

c
•2014 Oxford University Press 653
Chapter 12. Dummy Dependent Variables

as “local constituents” and others described the people requesting the meeting as “local
campaign donors.” Table 12.10 describes two key variables from the experiment.
Table 12.10: Variables for Donor Experiment

Variable Description
donor treat Dummy variable indicating activists seeking meeting were identified as donors.
staffrank Highest ranking person attending the meeting: 0 for no meeting, 1 for non-policy
staff, 2 for legislative assistant, 3 for legislative director, 4 for chief of
staff, and 5 for member of Congress.

a. Before we analyze the experimental data, let’s first suppose we were to conduct
an observational study of access based on a sample of Americans where we ran a
regression in which the dependent variable indicates having met with a member of
Congress and the independent variable was whether or not the individual donated
money to a member of Congress. Would there be concerns about endogeneity? If so,
why?
b. Use a probit model to estimate the effect of the donor treatment condition on prob-
ability of meeting with a member of Congress. Interpret the results. Table 12.10
describes the variables.
c. What factors are missing from the model? What does this omission mean for our
results?
d. Use a linear probability model (LPM) to estimate the same model. Interpret results.
Assess the correlation of the fitted values from the probit and LPM models.
e. Use a LPM model to assess the probability of meeting with a senior staffer (defined
as staffrank> 2).
f. Use a LPM model to assess the probability of meeting with a low-level staffer (defined
staffrank= 1).
g. Table 12.11 shows results for balance tests for two variables: Obama vote share in the
congressional district and the overall campaign contributions received by the member
of Congress contacted. Discuss the implication of these results for balance.

c
•2014 Oxford University Press 654
Chapter 12. Dummy Dependent Variables

Table 12.11: Balance Tests for Donor Experiment

Obama percent Total contributions


Treated -0.71 ≠104, 569
(1.85) (153, 085)
[z = 0.38] [z = 0.68]
Constant 65.59 1, 642, 801
(1.07) (88, 615)
[z = 61.20] [z = 18.54]
N 191 191
Standard errors in parentheses.

c
•2014 Oxford University Press 655
Part IV

Advanced Material

656
CHAPTER 13

TIME SERIES: DEALING WITH STICKINESS OVER TIME

Global warming is a policy nightmare. Truly ad-

dressing it requires complex international cooper-

ation involving potentially serious costs. Within

each country, policies to address global warming

require people to make costly changes today to

possibly prevent climate-related damage for fu-

ture generations.

Empirically, global warming is no picnic ei-

ther. A hot day or a major storm comes and

invariably someone says global warming is accel-

erating. The end is near! If it gets cold or snows,

657
Chapter 13. Time Series: Dealing with Stickiness over Time

someone says global warming is a fraud. Kids, put some more coal on the campfire!

If we use global temperature data to try to pin down trends and associated variables we

are using time series data, data for a particular unit (such as a country or planet) over time.

Time series data is distinct from cross-sectional data, which is data for many units at a

given point in time (such as data on the GDP per capita in all countries in 2012).

Analyzing time series data is deceptively tricky because the data in one year almost

certainly depends on the data in the year before. This seemingly innocuous fact creates

complications, some of which are relatively easy to deal with and others of which are a

major pain in the tuckus.

In this chapter we introduce two approaches to time series data. The first treats the year-

to-year interdependence as the result of autocorrelated errors. As discussed earlier on page

106, autocorrelation doesn’t cause our OLS coefficients to be biased, but it will typically

cause standard OLS estimates of the variance of —ˆ1 to be too incorrect. It’s pretty easy to

purge the data of this autocorrelation; our estimates will continue to be unbiased, but now

will have appropriate standard errors.

The second approach to time series data treats the dependent variable in one period as

directly depending on what the value of the dependent variable was in the previous period.

In this approach, the data remembers: A bump up in year one will affect year two and

because the value in year two will affect year three and so on this means that the bump in

year one will percolate through the entire data series. This is called a dynamic model; such

c
•2014 Oxford University Press 658
Chapter 13. Time Series: Dealing with Stickiness over Time

a model includes a lagged dependent variable as an independent variable. Dynamic models

might seem pretty similar to other OLS models, but they actually differ in important and

funky ways.

This chapter covers both approaches to dealing with time series data. Section 13.1 in-

troduces a model for autocorrelation. Section 13.2 shows how to use this model to detect

autocorrelation and Section 13.3 shows how to purge autocorrelation from the model. Sec-

tion 13.4 introduces dynamic models and Section 13.5 discusses a central but complicated

aspect of dynamic models called stationarity.

13.1 Modeling Autocorrelation

One reasonable approach to time series data is to think of the errors as being correlated over

time. If errors are correlated, —ˆ1 is unbiased, but the standard equation from page 225 for

the variance of —ˆ1 (Equation 5.9 on page 225) is not accurate.1 Often, the variance estimated

by OLS will be too low and will cause our confidence intervals to be too small and lead us

to reject the null hypothesis sometimes when we shouldn’t.

In this section we lay the groundwork for dealing with autocorrelation by developing a

model with autoregressive errors. Autoregressive errors are one type of possibly autocorre-

lated errors; they are the most widely used and quite intuitive. We also provide examples of

various patterns associated with autoregressive errors.


1 We show how the OLS equation for the variance of —ˆ1 depends on the errors being uncorrelated on page 718.

c
•2014 Oxford University Press 659
Chapter 13. Time Series: Dealing with Stickiness over Time

Model with autoregressive error

We start with a familiar regression model:

Yt = —0 + —1 Xt + ‘t (13.1)

This model has slightly different notation than our earlier OLS model. Instead of using “i”

to indicate each individual observation, we use “t” to indicate each time period. Yt therefore

indicates the dependent variable at time t; Xt indicates the independent variable at time t.

We’ll focus on an autoregressive model for error terms. This is the most common model

for addressing autocorrelation. In autoregressive models the dependent variable depends

directly on the value of the dependent variable in the previous period. Here we model

the error as depending on the error in the previous period. In Section 13.4 we use an

autoregressive model for the dependent variable (as opposed to for the error as we’re doing

here).2

The equation for error in this autoregressive model is

‘t = fl‘t≠1 + ‹t (13.2)

This equation says that the error term for time period t equals fl (the Greek letter “rho”)

times the error in the previous term plus a random error, ‹t (the Greek letter nu, pronounced
2 The terms here can get a bit confusing. Autocorrelation refers to errors being correlated with each other. An
autoregressive model is one way (the most common way) to model autocorrelation. It is possible to model correlated
errors differently. For example, errors can be the average of errors from some number of previous periods, an error
process referred to as a moving average error process.

c
•2014 Oxford University Press 660
Chapter 13. Time Series: Dealing with Stickiness over Time

“new”). We assume ‹t is uncorrelated with the independent variable and other error terms.

The ‘t≠1 is referred to as the lagged error because it is the error from the previous period.

We’ll lag other variables as well, which means using the value from the previous period.

We indicate lagged variables with the subscript t ≠ 1 instead of t. A lagged variable is a

variable with the values from the previous period.

Suppose we’re looking at global temperature data from 1880 to 2012 as a dependent

variable and carbon emissions as an independent variable. Suppose we lack a good measure

of sunspots, a solar phenomenon that may affect temperature. Because sunspots strengthen

and weaken over a roughly 11 year cycle, they will be correlated from period to period. This

factor could be in the error term of our global temperature model and could therefore cause

the error to be correlated from year to year.

Examples of autocorrelated errors

The fl term indicates the extent to which the errors are correlated over time. If fl is zero,

then the errors are not correlated and the autoregressive model reduces to a simple OLS

model (because Equation 13.2 becomes ‘t = ‹t when fl = 0). If fl is greater than zero, then

a high value of ‘ in period t ≠ 1 will likely lead to a high value of ‘ in period t. Think of

the errors in this case as being a bit sticky. Instead of bouncing around like independent

random values, they tend to run high for a while, then low for a while.

If fl is less than zero, we have negative autocorrelation. With negative autocorrelation, a

c
•2014 Oxford University Press 661
Chapter 13. Time Series: Dealing with Stickiness over Time

positive value of ‘ in period t ≠ 1 is more likely to lead to a negative value of ‘ in the next

period. In other words, the errors bounce violently back and forth over time.

Figure 13.1 shows examples of errors with varying degrees and types of autocorrelation.

Panel (a) shows an example in which fl is 0.8. This positive autocorrelation produces a

relatively smooth graph, with values tending to be above zero for a few periods and then

below zero for a few periods and so on. This graph is telling us that if we know the error in

one period, we then have some sense of what it will be in the following period. If the error

is positive in period t, then it’s likely (but not certain) the error will be positive in period

t + 1.

Panel (b) of Figure 13.1 shows a case when there is no autocorrelation. The error in

time t is not a function of the error in the previous period. The tell-tale signature of no

autocorrelation is the randomness: It is generally spiky, but here and there the error might

linger above or below zero, but without a strong pattern.

Panel (c) of Figure 13.1 shows negative serial correlation with fl = ≠0.8. The signature

of negative serial correlation is extreme spikiness because a positive error is more likely to

be followed by a negative error and vice versa.

The absolute value of fl has to be less than one in autoregressive models. If fl were greater

than one, the errors would tend to grow larger in each time period and would spiral out of

control.

We often refer to autoregressive models as AR models. In AR models, the error is a

c
•2014 Oxford University Press 662
Chapter 13. Time Series: Dealing with Stickiness over Time

Positive autocorrelation No autocorrelation Negative autocorrelation


ρ = 0.8 ρ = 0.0 ρ = −0.8
Error
3 3 3

2 2 2

1 1 1

0 0 0

−1 −1 −1

−2 −2 −2

−3 −3 −3

1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000
Year Year Year
(a) (b) (c)

FIGURE 13.1: Examples of Autocorrelation

c
•2014 Oxford University Press 663
Chapter 13. Time Series: Dealing with Stickiness over Time

function of error in previous periods. If the error is a function of only the error from the

previous period, the model is referred to as an AR(1) model (pronounced A-R-1). If the

error is a function of the error from two previous periods, the model is referred to as an

AR(2) model and so on. We’ll focus on AR(1) models.

Remember This
1. Autocorrelation refers to errors being correlated with each other.
2. One type of autocorrelated error occurs when errors come from an autoregressive
model in which the error term in period t is a function of the error in previous
periods.
3. The equation for error in an AR(1) model is
‘t = fl‘t≠1 + ‹t

13.2 Detecting Autocorrelation

We know what autocorrelation is. We know what it causes. But just because data is time

series data does not necessarily mean the errors will be correlated. We need to assess whether

it exists in our data and model. If it does, then we need to correct for it. If it does not, then

we can go on our merry way with OLS. In this section we show how to detect autocorrelation

graphically and with auxiliary regressions.

c
•2014 Oxford University Press 664
Chapter 13. Time Series: Dealing with Stickiness over Time

Using graphical methods to detect autocorrelation

The first way to detect autocorrelation is simply to graph the error terms over time. Au-

tocorrelated data has a distinctive pattern and will typically jump out pretty clearly from

a graph. As is typical with graphical methods, looking at a picture doesn’t yield a cut and

dried answer. The advantage, though, is that it allows us to understand the data, perhaps

helping us catch a mistake or identify an unappreciated pattern.

To detect autocorrelation graphically, we first run a standard OLS model ignoring the

autocorrelation and generate residuals. They are calculated as ‘ˆt = Yt ≠ —ˆ0 ≠ —ˆ1 Xt . (If our

model has more independent variables, we would include them in the calculation as we do

in multivariate OLS.) We simply graph these residuals over time and describe what we see.

If the errors move slowly as in panel (a) of Figure 13.1, they’re positively correlated. If

errors bounce violently as in panel (c) of Figure 13.1, they’re negatively correlated. If we

can’t really tell, then the errors are probably not strongly correlated. Wait a minute! Why

are we looking at residuals from an OLS equation that does not correct for autocorrelation?

Isn’t the whole point of this chapter that we need to take into account autocorrelation?

Busted, right?

Actually, no. And here’s where understanding the consequences of autocorrelation is so

ˆ from an OLS model that ignores


valuable. Autocorrelation does not cause bias. The —s

autocorrelation are unbiased even when there is autocorrelation. Because the residuals are a

ˆ they are unbiased too. The OLS standard errors are flawed, but we’re
function of these —s,

c
•2014 Oxford University Press 665
Chapter 13. Time Series: Dealing with Stickiness over Time

not using them to create the residuals in the graph.

Positive autocorrelation is common in time series data. Panel (a) of Figure 13.2 shows

global climate data over time with a fitted line from the following model:

T emperaturet = —0 + —1 Y eart + ‘t

The temperature hovers above the trend line for periods (such as around World War 2

and now) and below the line for other periods (such as 1950 to 1980). This hovering is a

sign that the error in one period is correlated with the error in the next period. Panel (b)

of Figure 13.2 shows the residuals from this regression. For each observation, the residual

is the distance from the fitted line; so the residual plot is essentially panel (a) tilted so that

the fitted line in panel (a) is now the horizontal line in panel (b).

Using an auxiliary regression to detect autocorrelation

A more formal way to detect autocorrelation is to estimate the degree of autocorrelation using

an auxiliary regression. We have seen auxiliary regressions before (in the multicollinearity

discussion on page 210, for example); they are additional regressions that are related to, but

not the same as, the regression of interest. When detecting autocorrelation, we estimate the

following model:

‘ˆt = flˆ‘t≠1 + ‹t (13.3)

c
•2014 Oxford University Press 666
Chapter 13. Time Series: Dealing with Stickiness over Time

Temperature
1

0.8

0.6
(a)
0.4

0.2

1890 1910 1930 1950 1970 1990 2010


Year
Residuals
0.4

0.2

(b) 0.0

−0.2

−0.4

1890 1910 1930 1950 1970 1990 2010


Year

FIGURE 13.2: Global Average Temperature Since 1880

c
•2014 Oxford University Press 667
Chapter 13. Time Series: Dealing with Stickiness over Time

where ‘ˆt and ‘ˆt≠1 are simply the residuals and lagged residuals from the initial OLS estimation

of Yt = —0 + —1 Xt + ‘t .3 Details on how to implement this model in Stata and R are in the

Computing Corner starting on page 705. If fl̂ is statistically significantly different from zero,

we have evidence of autocorrelation.4

Table 13.1 shows the results of such a lagged error model for the climate data in Figure

13.2. The dependent variable in this model is the error from the model and the independent

variable is the lagged value of the error. We’re using this model to estimate how closely ‘ˆt

and ‘ˆt≠1 are related. The answer? They are strongly related. The coefficient on ‘ˆt≠1 is 0.608,

meaning that our fl̂ estimate is 0.608, which is quite a strong relation. The standard error is

0.072, implying a t statistic of 8.39, which is well beyond any conventional critical value. We

can therefore handily reject the null that fl = 0 and conclude that errors are autocorrelated.

3 If we believe that the independent variables might be correlated with the error term, we can also include them
in the auxiliary regression such that we estimate ‘ˆt = flˆ
‘t≠1 + “Xt + ‹t . With this model we continue to look for a
statistically significant fl̂ estimate.
4This approach is closely related to a so-called Durbin-Watson test for autocorrelation. This test statistic is widely

reported, but has a much more complicated distribution than a t distribution and requires use of specific tables. In
general, it produces very similar results as the process we explained with the auxiliary regression.

c
•2014 Oxford University Press 668
Chapter 13. Time Series: Dealing with Stickiness over Time

Table 13.1: Detecting Autocorrelation Using OLS and Lagged Error Model

Lagged error 0.608ú


(0.072)
[t=8.39]
Constant 0.000
(0.009)
[t=0.04]
N 127
R2 0.36
Standard errors in parentheses
ú indicates significance at p < 0.05

Remember This
To detect autocorrelation in time series:
1. Graph the residuals from a standard OLS model over time. If the plot is relatively
smooth, positive autocorrelation likely exists.
2. Estimate the following OLS model:

‘ˆt = flˆ‘t≠1 + ‹t
A statistically significant estimate of fl indicates autocorrelation.

13.3 Fixing Autocorrelation

The way to deal with autocorrelation is to get rid of it. That doesn’t really sound like

something we’re supposed to do, but we can do it via a few steps. In this section we derive

a model that purges error and then we explain how to estimate it.

Our goal is to purge autocorrelation from the data by transforming the dependent and

independent variables before estimating our model. Once we purge the autocorrelation,

c
•2014 Oxford University Press 669
Chapter 13. Time Series: Dealing with Stickiness over Time

OLS using the transformed data will produce an unbiased estimate of —ˆ1 and an appropriate

estimate of var(—ˆ1 ). In contrast, OLS on the untransformed data will produce an unbiased

estimate of —ˆ1 with an inappropriate estimate of var(—ˆ1 ).

fl-transforming the data

The process is calledfl-transforming (“rho transforming”) the data. Because these steps

are automated in many software packages, we typically will not do them manually. If we

understand the steps, though, can use the results more confidently and effectively.

We begin by replacing the ‘t in the main equation (Equation 13.1) with fl‘t≠1 + ‹t from

Equation 13.2:

Yt = —0 + —1 Xt + fl‘t≠1 + ‹t (13.4)

This equation looks like a standard OLS equation except for a pesky fl‘t≠1 term. Our

goal is to zap that term. Here’s how.

1. Write an equation for the lagged value of Yt , which simply requires replacing the t

subscripts with t ≠ 1 subscripts in the original model:

Yt≠1 = —0 + —1 Xt≠1 + ‘t≠1 (13.5)

2. Multiply both sides of Equation 13.5 by fl:

flYt≠1 = fl—0 + fl—1 Xt≠1 + fl‘t≠1 (13.6)

c
•2014 Oxford University Press 670
Chapter 13. Time Series: Dealing with Stickiness over Time

3. Subtract the equation for flYt≠1 (Equation 13.6) from Equation 13.4. That is, subtract

the left side of Equation 13.6 from the left side of Equation 13.4 and subtract the right

side of 13.6 from the right side of Equation 13.4.

Yt ≠ flYt≠1 = —0 ≠ fl—0 + —1 Xt ≠ fl—1 Xt≠1 + ‘t ≠ fl‘t≠1

4. Notice in Equation 13.2 that ‘t ≠ fl‘t≠1 = ‹t and rewrite:

Yt ≠ flYt≠1 = —0 ≠ fl—0 + —1 Xt ≠ fl—1 Xt≠1 + ‹t

5. Rearrange things a bit:

Yt ≠ flYt≠1 = —0 (1 ≠ fl) + —1 (Xt ≠ flXt≠1 ) + ‹t

6. Use squiggles to indicate the transformed variables (where Ỹt = Yt ≠flYt≠1 , —˜0 = —0 (1≠fl)

and X̃t = Xt ≠ flXt≠1 ):

Ỹt = —˜0 + —1 X̃t + ‹t

The key thing is to look at the error term in this new equation. It is ‹t , which we said

at the outset is the well-behaved part of the error term that is not autocorrelated. Where

is ‘t , the naughty autocorrelated part of the error term? Gone! That’s the thing. That’s

what we accomplished with these equations: We end up with an equation that looks pretty

similar to our OLS equation with a dependent variable (Ỹt ), parameters to estimate (—˜0 and

—1 ), an independent variable (X̃t ) and an error term, ‹t . The difference is that, unlike our

c
•2014 Oxford University Press 671
Chapter 13. Time Series: Dealing with Stickiness over Time

original model (based on Equations 13.1 and 13.2), this model has no autocorrelation. By

using Ỹt and X̃t we have transformed the model from one that suffers from autocorrelation

to one that does not.

Estimating a fl-transformed model

What we have to do, then, is estimate a model with the Ỹ and X̃ (note the squiggles over

the variable names) instead of Y and X. Table 13.2 shows the transformed variables for

several observations. The columns labeled Y and X show the original data. The columns

labeled Ỹ and X̃ show the transformed data. We assume for this example that we based on

results from an initial OLS model we have estimated fl̂ = 0.5. In this case, the Ỹ observation

for 2001 will be the actual value in 2001 (which is 110) minus fl̂ times the value of Y in 2000:

Ỹ2001 = 110 ≠ 0.5 ú 100 = 60. Notice that the first observation in the fl-transformed data will

be missing because we don’t know the lagged value for that observation.

Once we’ve created these transformed variables, things are easy. If we think in terms of

a spreadsheet, we’ll simply use the columns Ỹ and X̃ when we estimate the fl-transformed

model. The standard errors produced by this fl-transformed model will not be corrupted by

autocorrelation, unlike the standard errors from a model with untransformed data.

The rho- transformed model is also referred to as a Cochrane-Orcutt model or a Prais-

Winsten model.5 These names are useful to remember when implementing using Stata to
5 The Prais-Winston approximates the values for the missing first observation in the fl-transformed data.

c
•2014 Oxford University Press 672
Chapter 13. Time Series: Dealing with Stickiness over Time

Table 13.2: Example of fl-transformed Data (for fl̂ = 0.5)

Original data fl-transformed data


Ỹ X̃
Year Y X = Y ≠ fl̂Yt≠1 = X ≠ fl̂Xt≠1
2000 100 50 - -
2001 110 70 110 - 0.5 * 100 = 60 70 - 0.5* 50 = 45
2002 130 60 130 - 0.5 * 110 = 75 60 - 0.5* 70 = 25

analyze time series data.

Running the fl-transformed model produces coefficient estimates that are unbiased and

consistent (as were simple OLS coefficients) and also produces accurate standard errors.

Usually (but not always), analysis of fl-transformed data will produce larger standard errors

than in the simple OLS model. That means our estimates are less precise (but more honest!).

Confidence intervals will be larger and it will be harder to reject null hypotheses.

It is worth emphasizing that the —ˆ1 coefficient we estimate in the fl-transformed model is

an estimate of —1 . Throughout all the rigmarole of the transformation process, the value of

—1 doesn’t change. The value of —1 in the original equation is the same as the value of —1

in the transformed equation. Hence when we get results from fl-transformed models we still

speak of them in the same terms as —1 estimates from standard OLS. That is, a one unit

increase in X is associated with a —ˆ1 increase in Y .

One thing that is a unintuitive is that we get different coefficient estimates than with the

simple OLS model. Are fl-transformed results “better”? No and yes, actually. No, in the

c
•2014 Oxford University Press 673
Chapter 13. Time Series: Dealing with Stickiness over Time

sense that both OLS and fl-transformed estimates are unbiased and consistent, which means

that in expectation the estimates equal the true value and as we get more data they converge

to the true value. These things can be true and the models can still yield different estimates.

Just as it is the case that if we flip a coin 100 times, we likely get something different every

time we do this even though the expected number of heads is 50. That’s pretty much what

is going on here, as the two approaches are different realizations of random processes that

are correct on average but still have random noise. The fl-transformed estimates are better

in the sense that they come with correct standard errors. The estimates from OLS do not

when there is autocorrelation.6

Remember This
We correct for autocorrelation by fl-transforming the data, a process that purges the
autocorrelation from the data.
1. We estimate the following model

Ỹt = —˜0 + —1 X̃t + ‹t


where Ỹt = Yt ≠ flYt≠1 , —˜0 = —0 (1 ≠ fl), and X̃t = Xt ≠ flXt≠1 .
2. We interpret —ˆ1 from a fl-transformed model in the same as we do for standard
OLS.
3. This process is automated in many statistical packages.

6 The intercept estimated in a fl-transformed model is actually —0 (1 ≠ fl̂). If we want to know the fitted value for
when Xt is zero (which is the meaning of the intercept in a standard OLS model), we need to divide —˜0 by (1 ≠ fl̂).
The appendix discusses an additional assumption implicit in the fl-transformed model.

c
•2014 Oxford University Press 674
Chapter 13. Time Series: Dealing with Stickiness over Time

Case Study: Global Temperature Changes Using an AR(1) Model

Figure 13.3 shows the global average tempera-

ture data we worked with in Chapter 7 on page

328. Temperature appears to rise over time and

we want to assess whether this increase is mean-

ingful.

We noted in our discussion of Table 7.1 that

the OLS standard errors were likely incorrect due

to autocorrelation. Here we revisit the example

and use fl-transformation to provide more accu-

rate standard errors. We work with the quadratic

model that allows the rate of temperature change

to change over time:

T emperaturet = —0 + —1 Y eart + —2 Y eart2 + ‘t

The first column of Table 13.3 shows results from a standard OLS analysis of the model.

The —ˆ coefficients are precisely estimated, with t statistics above 5.

However, we suspect the errors in this model are autocorrelated. If so, we cannot believe

the standard errors from OLS, which in turn means the t statistics are wrong because t

statistics depend on standard error estimates.

c
•2014 Oxford University Press 675
Chapter 13. Time Series: Dealing with Stickiness over Time

Temperature
(deviation
from
1
pre−industrial
average)

0.75

0.5

0.25

1880 1900 1920 1940 1960 1980 2000


Year
FIGURE 13.3: Global Temperature Data

c
•2014 Oxford University Press 676
Chapter 13. Time Series: Dealing with Stickiness over Time

Table 13.3: Global Temperature Model Estimated Using OLS and Via fl-transformed Data

OLS fl-transformed
Year -0.165 ú
-0.174ú
(0.031) (0.057)
[t = 5.31] [t = 3.09]
Year squared 0.000044ú 0.000046ú
(0.000008) 0.000015)
[t = 5.48] [t = 3.20]
Constant 155.68ú 79.97ú
(30.27) (26.67)
[t = 5.14] [t = 2.99]
From auxiliary regression
fl̂ 0.514ú -0.021
(0.077) (0.090)
[t = 6.65] [t = 0.28]
N 128 127
R2 0.79 0.55
Standard errors in parentheses
ú indicates significance at p < 0.05

The Table 13.3 reports that fl̂ = 0.514, which is generated by estimating an auxiliary re-

gression with errors as the dependent variable and lagged error as the independent variable.

The autocorrelation is lower than in the model without including squared year as an inde-

pendent varaible (as reported on page 669), but nonetheless highly statistically significant,

suggesting we need to correct for autocorrelation to get proper standard errors.

The second column of Table 13.3 shows results from a fl-transformed model. The —ˆ1 and

—ˆ2 haven’t changed much from the first column. This outcome isn’t too surprising given

that both OLS and fl-transformed models produce unbiased estimates of —1 and —2 . The

difference is in the standard errors. The standard error on the Y ear and Y ear2 variables

have almost doubled, which has almost halved the t statistics for —ˆ1 and —ˆ2 to near three.

c
•2014 Oxford University Press 677
Chapter 13. Time Series: Dealing with Stickiness over Time

In this particular instance, the relationship between year and temperature is so strong that

even with these larger standard errors we will reject the null hypotheses of no relationship at

any conventional significance level (such as – = 0.05 or – = 0.01). What we see, though, is

the large effect addressing autocorrelation has on the standard errors. The standard errors

produced by OLS were too small due to autocorrelation. In other words, we overestimated

the amount of information we have when we ignored autocorrelation.

Several aspects of the results from the fl-transformed model are worth noting. First,

the fl̂ from the auxiliary regression is now very small (-0.021) and statistically insignificant,

indicating that we have indeed purged the model of first order autocorrelation. Well done!

Second, the R2 is lower in the fl-transformed model; it’s reporting the traditional goodness

of fit statistic for the transformed model, but it is not directly meaningful or comparable to

the R2 in the original OLS model. Third, the constant changes quite a bit, from 155.68 to

79.97. Recall, however, that the constant in the fl-transformed model is actually —0 (1 ≠ fl)

(where fl is the estimate of autocorrelation in the untransformed model), which means the

estimate of —0 is 79.97
1≠0.514
= 164.5 which is close to the estimate of —ˆ0 in the OLS model.

13.4 Dynamic Models

Another way to deal with time series data is to use a dynamic model. In a dynamic model,

the value of the dependent variable directly depends on the value of the dependent variable

c
•2014 Oxford University Press 678
Chapter 13. Time Series: Dealing with Stickiness over Time

in the previous term. In this section we explain the dynamic model and discuss three ways

the model differs from OLS models.

Dynamic models include the lagged dependent variable

In mathematical terms, a basic dynamic model is

Yt = “Yt≠1 + —0 + —1 Xt + ‘t (13.7)

where the new term is “ (the Greek letter gamma) times the value of the lagged dependent

variable, Yt≠1 . The coefficient “ indicates the extent to which the dependent variable depends

on its lagged value. The higher it is, the more the dependence across time. If the data is

really generated according to a dynamic process, omitting the lagged dependent variable

would risk omitted variable bias; and given that the coefficient on the lagged dependent

variable is often very large, that means we risk large omitted variable bias by omitting the

lagged dependent variable when “ ”= 0.

As a practical matter, a dynamic model with a lagged dependent variable is super easy

to implement: just add the lagged dependent variable as an independent variable.

Three ways dynamic models differ from OLS models

Be alert, though. This seemingly modest change in the model shakes up a lot of our statistical

intuition. Some things that seemed simple in OLS become weird.

c
•2014 Oxford University Press 679
Chapter 13. Time Series: Dealing with Stickiness over Time

First, the interpretation of the coefficients changes. In non-dynamic OLS models (which

simply means OLS models that do not have a lagged dependent variable as an independent

variable), a one unit increase in X is associated with a —ˆ1 increase in Y . In a dynamic model,

it’s not so simple. Suppose X increases by 1 in period 1. Y1 will go up by —1 ; we’re used to

seeing that kind of effect. Y2 will also go up because Y2 depends on Y1 . In other words, an

increase in X has not only immediate effects, but also long-term effects because the boost

to Y will carry forward via the lagged dependent variable.

In fact, if ≠1 < “ < 1, a one unit increase in X will cause a —1


1≠“
increase in Y over the

long term.7 If “ is big (near 1), then the dependent variable has a lot of memory. A change

in one period strongly affects the value of the dependent variable in the next period. In this

case, the long-term effect of X will be much bigger than —ˆ1 because the estimated long-term

effect will be —ˆ1 divided by a small number. If “ is near zero, on the other hand, then the

dependent variable has little memory, meaning that the dependent variable depends little on

its value in the previous period. In this case, the long-term effect of X will be pretty much

—ˆ1 because the estimated long-term effect will be —ˆ1 divided by a number close to one.

A second distinctive characteristic of dynamic models is that correlated errors cause a

lot more trouble in dynamic models than in non-dynamic models. Recall that in OLS,

correlated errors mess up the standard OLS estimates of the variance of —ˆ1 , but they do
7 The condition that the absolute value of “ is less than one rules out certain kinds of explosive processes where Y
gets increasingly bigger or smaller every period. This condition is related to a requirement that data be “stationary”
as discussed below on page 683.

c
•2014 Oxford University Press 680
Chapter 13. Time Series: Dealing with Stickiness over Time

not bias the estimates of —ˆ1 . In dynamic models, correlated errors cause bias. It’s not too

hard to see why. If ‘t is correlated with ‘t≠1 , then it also has to be correlated with Yt≠1

because Yt≠1 is obviously a function of ‘t≠1 . In such a situation, one of the independent

variables (Yt≠1 ) is correlated with the error which is a bias-causing no-no in OLS. The bias

is worse for the estimate of the coefficient on the lagged dependent variable than for —ˆ1 . If

the autocorrelation in the errors is modest or weak, this bias is relatively small.

A third distinctive characteristic of dynamic models is that including a lagged dependent

variable when it is irrelevant (meaning “ = 0) can lead to biased estimates of —ˆ1 . Recall

from page 233 that in OLS, including an irrelevant variable (a variable whose true coefficient

is zero) will increase standard errors but will not cause bias. In a dynamic model, though,

including the lagged dependent variable when “ = 0 leads —ˆ1 to be biased if the error is

autocorrelated and the independent variable itself follows an autoregressive process (such

that its value depends on its lagged value). When these two conditions hold, including

a lagged dependent variable when “ = 0 can cause the estimated coefficient on X to be

vastly understated because the lagged dependent variable will wrongly soak up much of the

explanatory power of the independent variable.

Should we include a lagged dependent variable in our time series model? On the one

hand, if we exclude the lagged dependent variable when it should be there (when “ ”= 0),

we risk omitted variable bias. On the other hand, if we include it when it should be there

(when “ = 0), we risk bias if the errors are autocorrelated. It’s quite a conundrum.

c
•2014 Oxford University Press 681
Chapter 13. Time Series: Dealing with Stickiness over Time

There is no firm answer, but we’re not helpless. The best place to start is with theory

about the nature of the dependent variable being modeled. If we have good reasons to

suspect that it truly is a dynamic process then including the lagged dependent variable is

the best course. For example, many people suspect that political affiliation is a dynamic

process. What party a person identifies with depends not only on external factors like the

state of the economy, but also on what party he or she identified with last period. It’s a

well-known fact of life that many people interpret facts through partisan lenses. Democrats

will see economic conditions in a way that is most favorable to Democrats; Republicans

will see economic conditions in a way that is most favorable to Republicans. This means

that party identification will be sticky in a manner implied by the dynamic model and it is

sensible to include a lagged dependent variable in the model.

In addition, when we include a lagged dependent variable we should test for autocorrelated

errors. If we find that the errors are autocorrelated, then we should worry about possible

bias in the estimate of —ˆ1 ; the higher the autocorrelation of errors, the more we should worry.

We discussed how to test for autocorrelation on page 669. If we find autocorrelation, we can

fl-transform the data to purge the autocorrelation; we’ll see an example in our case study on

page 695.

c
•2014 Oxford University Press 682
Chapter 13. Time Series: Dealing with Stickiness over Time

Remember This
1. A dynamic time series model includes a lagged dependent variable as a control
variable. For example,

Yt = “Yt≠1 + —0 + —1 Xt + ‘t

2. Dynamic models differ from standard OLS models.


(a) Independent variables have short-term effects (—1 ) and long-term effects
( 1≠“
—1
). The long-term effects occur because a short-term effect on Y will
affect subsequent values of the dependent variable through the influence of
the lagged dependent variable.
(b) Autocorrelation causes bias in models with a lagged dependent variable.
(c) Including a lagged dependent variable when the true value of “ is zero can
cause severe bias if the errors are correlated and the independent variable
follows some kind of autoregressive process.

13.5 Stationarity

We also need to think about stationarity when we analyze time series data. A stationary

variable has the same distribution throughout the entire time series. This is a complicated

topic and we’ll only scratch the surface. The upshot is that stationarity is good and its

opposite, nonstationarity, is a pain in the tuckus. When working with time series data, we

want to make sure our data is stationary.

In this section we define nonstationarity as a so-called unit root problem and then explain

how spurious regression results are a huge danger with nonstationary data. Spurious regres-

sion results are less likely with stationary data. We also show how to detect nonstationarity

c
•2014 Oxford University Press 683
Chapter 13. Time Series: Dealing with Stickiness over Time

and what to do if we find it.

Nonstationarity as a unit root process

A variable is stationary if it has the same distribution for the entire time series. A variable

is nonstationary if its distribution depends on time. A variable for which the mean is getting

constantly bigger, for example, is a nonstationary variable. Nonstationary variables come in

multiple flavors, but we’ll focus on a case in which data is prone to persistent trends in a

way we define more precisely below. In this case, the mean of the distribution of the variable

changes over time, making the variable nonstationary.

To help us understand nonstationarity, we begin with a very simple dynamic model in

which Y is a function of the previous value of the variable:

Yt = “Yt≠1 + ‘t (13.8)

We consider three cases for “, the coefficient on the lagged dependent variable: when it

is less than one, equal to one, or greater than one.

If the absolute value of “ is less than one, life is relatively easy. The lagged dependent

variable affects the dependent variable, but the effect diminishes over time. To see why, note

that we can write the value of Y in the third time period as a function of the previous values

of Y simply by substituting for the previous values of Y (for example, Y2 = “Y1 + ‘2 ):

c
•2014 Oxford University Press 684
Chapter 13. Time Series: Dealing with Stickiness over Time

Y3 = “Y2 + ‘3
= “(“Y1 + ‘2 ) + ‘3
= “(“(“Y0 + ‘1 ) + ‘2 ) + ‘3
= “ 3 Y0 + “ 2 ‘1 + “‘2 + ‘3

When “ < 1, the effect of any given value of Y will decay over time. In this case, the effect

of Y0 on Y3 is “ 3 Y0 ; because “ < 1, “ 3 will be less than one. We could extend the above logic

to show that the effect of Y0 on Y4 will be “ 4 which is less than “ 3 when “ < 1. The effect

of the error terms in a given period will also have similar pattern.

This case presents some differences from standard OLS, but it turns out that the property

that the effects of previous values of Y and error fade away means that we will not face a

fundamental problem when estimating coefficients.

What if “ were greater than one? In this case, we’d see an explosive process because

the value of Y would grow by an increasing amount. Time series analysts rule out such

a possibility on theoretical grounds. Variables just don’t explode like this, certainly not

indefinitely as implied by a model with “ > 1.

The tricky case occurs when “ is exactly equal to one. In this case the variable is said

to have a unit root. In a model with a single lag of the dependent variable, a unit root

simply means that the coefficient on the lagged dependent variable (“ for the model as we’ve

written it) is equal to one. The terminology is a bit quirky: Unit refers to the number one

and root refers to the source of something, in this case the lagged dependent variable that

c
•2014 Oxford University Press 685
Chapter 13. Time Series: Dealing with Stickiness over Time

is a source for the value of the dependent variable.

Nonstationarity and spurious results

A variable with a unit root is nonstationary and causes several problems. The most serious

is that spurious regression results are highly probable when regressing a variable with a

unit root on another variable with a unit root.8 A spurious regression is one in which the

regression results suggest that X affects Y when in fact X has no effect on Y ; spurious

results might be simply thought of as bogus results.

It’s reasonably easy to come up with possible spurious results in time series data. Think

about the U.S. population from 1900 to 2010. It rose pretty steadily, right? Now think about

the price of butter since 1900 to 2010. Also rose steadily. If we were to run a regression

predicting the price of butter as a function of population we would see a significant coefficient

on population because low values of population went with low butter prices and high values

of population went with high butter prices. Maybe that’s true, but here’s why we should be

skeptical: It’s quite possible these are just two variables that both happen to be trending

up. We could replace the population of the United States with the population of Yemen

(also trending up) and the price of butter with the number of deer in the United States

(also trending up). We’d again have two variables trending together and if we put them in a

simple OLS model we would observe a spurious positive relationship between the population
8 Other problems are that the coefficient on the lagged dependent variable will be biased downward so that the
coefficient divided by its standard error will not follow a t distribution.

c
•2014 Oxford University Press 686
Chapter 13. Time Series: Dealing with Stickiness over Time

of Yemen and deer in the United States. Silly, right?

The reason that a nonstationary variable is prone to spurious results is that a variable

with a unit root is trendy. Not in a fashionable sense, but in a streaky sense. A variable

with a unit root might go up for while, then down for even longer, blip up, and then continue

down. These unit root variables look like Zorro slashed out their pattern with his sword: A

zig up, a long zag down, another zig up, and so on.9

Figure 13.4 shows examples of two simulated variables with unit roots. In panel (a) Y

is simulated according to Yt = Yt≠1 + ‘t . In this particular simulation, Y mostly goes up,

but with periods in which it goes down for a bit. In panel (b), X is simulated according

to Xt = Xt≠1 + ‹t . In this particular simulation, X trends mostly down, with a flat period

early on and some mini-peaks later in the time seres. Importantly, X and Y have absolutely

nothing to do with each other in the way they were generated. For example, when we

generated values of Y , the values of X played no role.

Panel (c) of Figure 13.4 scatterplots X and Y and includes a fitted OLS regression line.

The regression line has a negative slope that is highly statistically significant. And com-

pletely spurious. The variables are completely unrelated to each other. The reason we see a

significant relationship is that Y was working its way up while X was working its way down

for most of the first part of the series. These movements create a pattern in which a negative

OLS coefficient occurs, but does not indicate an actual relationship. In other words, panel
9 Zorro’s slashes would probably go more side-to-side, so maybe think of unit root variables as slashed by an inebriated Zorro.

c
•2014 Oxford University Press 687
Chapter 13. Time Series: Dealing with Stickiness over Time

Y X
5
25
0
20
−5
15
−10
10
−15
5
−20

0 −25
0 50 100 150 200 0 50 100 150 200
Time Time
(a) (b)
Y
25

20

15

10

5 ^
β1 = −0.81
^
t−stat for β1 = −36.1
0

−25 −20 −15 −10 −5 0 5


X
(c)
FIGURE 13.4: Data with Unit Roots and Spurious Regression

c
•2014 Oxford University Press 688
Chapter 13. Time Series: Dealing with Stickiness over Time

(c) of Figure 13.4 is a classic example of a spurious regression.

Of course, this is a single example. It is, however, quite representative because unit root

variables are so prone to trends. When Y goes up, there is a pretty good chance that X will

be on a trend too: If X is going up, too, then the OLS coefficient on X would be positive. If

X is trending down when Y is trending up, then the OLS coefficient on X would be negative.

Hence, the sign of coefficients in these spurious regression results are not predictable. What

is predictable is that two such variables will often exhibit (spurious) statistically significant

relationships.10

Spurious results are less likely with stationary data

Variables without unit roots behave differently. Panels (a) and (b) of Figure 13.5 show a

simulation of two time series variables where the coefficient on the lagged dependent variable

is 0.5 (as opposed to 1.0 in the unit root simulations). They certainly don’t look like Zorro

sword slashes. They look more like Zorro sneezed them out. And OLS finds no relationship

between the two variables, as is clear in panel (c), which shows a scatterplot of X and Y .

Again, this is a single simulation, but it is a highly representative one because variables

without unit roots typically don’t exhibit the trendiness that causes unit root variables to

produce spurious regressions.


10 The appendix has R code to simulate variables with unit roots and run regressions using those variables. Using
the code, it is easy to see that the proportion of simulations with statistically significant (spurious) results is very
high.

c
•2014 Oxford University Press 689
Chapter 13. Time Series: Dealing with Stickiness over Time

Y X
2
2
1

0
0
−1

−2 −2

−3
−4
0 50 100 150 200 0 50 100 150 200
Time Time
(a) (b)

−2
^
β1 =−0.08
^
t−stat for β1 = −0.997
−4
−3 −2 −1 0 1 2
X
(c)
FIGURE 13.5: Data without Unit Roots

c
•2014 Oxford University Press 690
Chapter 13. Time Series: Dealing with Stickiness over Time

Unit roots are surprisingly common in theory and practice. Unit roots are also known as

random walks because the series starts at Yt≠1 and takes a random step (the error term) from

there. Random walks are important in finance; the efficient market hypothesis holds that

stock market prices account for all information such that there will be no systematic pattern

going forward. A classic book about investing is A Random Walk Down Wall Street (Malkiel

2003); the title is not, ahem, random, but connects unit roots to finance via the random walk

terminology. In practice, many variables show signs of having unit roots, including GDP,

inflation, and other economic variables.

Detecting unit roots and nonstationarity

To test for a unit root (which means the variable is nonstationary), we test whether “ is

equal to one for the dependent variable and the independent variables. If “ is equal to one

for a variable or variables, we have nonstationarity and worry about spurious regression and

other problems associated with nonstationary data.

The main test for unit roots has a cool name: the Dickey-Fuller test. This is a hypoth-

esis test in which the null hypothesis is “ = 1 and the alternative hypothesis is “ < 1.

The standard way to implement the Dickey-Fuller test is to transform the model by

c
•2014 Oxford University Press 691
Chapter 13. Time Series: Dealing with Stickiness over Time

subtracting Yt≠1 from both sides of Equation 13.8:

Yt ≠ Yt≠1 = “Yt≠1 ≠ Yt≠1 + ‘t

Yt = (“ ≠ 1)Yt≠1 + ‘t

Yt = –Yt≠1 + ‘t

where the dependent variable ( Yt , pronounced “delta Y”; this a capital Greek delta, which

is a different symbol that the lower case delta, ”) is now the change in Y in period t and the

independent variable is the lagged value of Y . Here we’re using notation suggesting a unit

root test for the dependent variable. We also run unit root tests with the same approach for

independent variables.

This transformation allows us to re-formulate the model in terms of a new coefficient we

label as – = “ ≠ 1. Under the null hypothesis that “ = 1, our new parameter – equals zero.

Under the alternative hypothesis that “ < 1, our new parameter – is less than zero.

It’s standard to estimate a so-called augmented Dickey-Fuller test that includes a

time trend and a lagged value of the change of Y ( Yt≠1 ):

Yt = –Yt≠1 + —0 + —1 T imet + —2 Yt≠1 + ‘t (13.9)

where T imet is a variable indicating which time period observation t is. T ime is equal to 1

in the first period, 2 in the second period, and so forth.

The focus of the Dickey-Fuller approach is the estimate of –. What we do with our

estimate of – takes some getting used to. The null hypothesis is that Y is nonstationary.

c
•2014 Oxford University Press 692
Chapter 13. Time Series: Dealing with Stickiness over Time

That’s bad. We want to reject the null hypothesis. The alternative is that the Y is stationary.

That’s good. If we reject the null hypothesis in favor of the alternative hypothesis that – < 0,

then we are rejecting that Y is nonstationary in favor of inferring that Y is stationary.

The catch is that if the variable actually is nonstationary, the estimated coefficient is

not normally distributed, which means the coefficient divided by its standard error will not

have a t distribution. Hence we have to use so-called Dickey-Fuller critical values, which are

bigger than standard critical values, making it hard to reject the null hypothesis that the

variable is nonstationary. We show how to implement Dickey-Fuller tests in the Computing

Corner; more details are in the references indicated in the Further Reading section and the

appendix.

How to handle nonstationarity

If the Dickey-Fuller test indicates that a variable data is nonstationary, the standard ap-

proach is to move to a differenced model in which all variables are converted from levels

(e.g., Yt , Xt ) to differences (e.g., Yt , Xt , where indicates the difference between the

variable at time t and time t ≠ 1). We’ll see an example on page 700.

c
•2014 Oxford University Press 693
Chapter 13. Time Series: Dealing with Stickiness over Time

Remember This
A variable is stationary if its distribution is the same for the entire data set. A common
violation of stationarity occurs when data has a persistent trend.
1. Nonstationary data can lead to statistically significant regression results that are
spurious when two variables have similar trends.
2. The test for stationarity is a Dickey-Fuller test. The most widely used format of
this test is an augmented Dickey-Fuller test:
Yt = –Yt≠1 + —0 + —1 T imet + —2 Yt≠1 + ‘t

If we reject the null hypothesis that – = 0, we conclude that the data is stationary
and can use untransformed data. If we fail to reject the null hypothesis that – = 0,
we conclude the data is nonstationary and should use a model with differenced
data.

c
•2014 Oxford University Press 694
Chapter 13. Time Series: Dealing with Stickiness over Time

Case Study: Dynamic Model of Global Temperature

One of the central elements in discussions about

global warming is the role of carbon dioxide.

Figure 13.6 plots carbon dioxide output and

global temperature since 1880. The solid line is

temperature, measured in deviation in degrees

Fahrenheit from pre-industrial average tempera-

ture. The values for temperature are on the left

side of the figure. The dotted line is carbon diox-

ide emissions, measured in parts per million with values indicated on the right side of the

figure. Clearly, these variables seem to move together. The question is how confident we are

that this relationship is in any way causal.

We’ll analyze this question with a dynamic model. We begin with a model that allows

for the non-linear time trend from page 675; this model has Y ear and Y ear2 as independent

variables.11

We’ll also include temperature from the previous time period. This is the lagged depen-
11 Including these variables is not a no-brainer; one might argue that the independent variables are causing the
non-linear time trend and we don’t want the time trend in there to soak up variance. Welcome to time series analysis.
Without definitively resolving the question, we’ll start from there because including time trends is an analytically
conservative approach in the sense that it will typically make it harder, not easier, to find statistical significance for
independent variables.

c
•2014 Oxford University Press 695
Chapter 13. Time Series: Dealing with Stickiness over Time

Temperature Carbon
(deviation 1 dioxide
370 (parts per
from
pre−industrial Temperature (left−hand scale) million)
average,
Carbon dioxide (right−hand scale)
in Fahrenheit)

0.75 350

0.5 330

0.25 310

0 290

1880 1900 1920 1940 1960 1980 2000


Year

FIGURE 13.6: Global Temperature Data

c
•2014 Oxford University Press 696
Chapter 13. Time Series: Dealing with Stickiness over Time

dent variable and by including it, the model becomes a dynamic model. The independent

variable of interest here is carbon dioxide. We want to know if increases in carbon dioxide

are associated with increases in global temperature. The model is

T emperaturet = “T emperaturet≠1 + —0 + —1 Y eart + —2 Y eart2 + —3 CO2t + ‘t (13.10)

where CO2t is a measure of the concentration of carbon dioxide in the atmosphere. This

is a much (much!) simpler model than climate scientists use; our model simply gives us a

broad-brush picture as to whether the relationship between carbon dioxide and temperature

can be ascertained in macro level data.

Our first worry is that the data might not be stationary. If that is the case, there is a

risk of spurious regression. Therefore the first two columns of Table 13.4 show Dickey-Fuller

results for the substantive variables, temperature and carbon dioxide. We use an augmented

Dickey-Fuller test of the following form:

T emperaturet = –T emperaturet≠1 + —1 Y eart + —2 T emperaturet≠1 + ‘t

Recall that the null hypothesis in a Dickey-Fuller test is that the data is nonstationary.

The alternative hypothesis in a Dickey-Fuller test is that the data is stationary; we will accept

this alternative only if the coefficient is sufficiently negative. (Yes, this way of thinking takes

a bit of getting used to.)

To show that data is stationary (which is a good thing!), we need a sufficiently negative

t statistic on the estimate of –. For the temperature variable, the t statistic in the Dickey-

c
•2014 Oxford University Press 697
Chapter 13. Time Series: Dealing with Stickiness over Time

Fuller test is -4.22. As we discussed earlier, the critical values for the Dickey-Fuller test are

not the same as for standard t tests. They are listed in the bottom of the table. Because

the t statistic on the lagged value of temperature is more negative than the critical value,

even at the 1 percent level, we can reject the null hypothesis of nonstationarity. In other

words, the temperature data is stationary. We get a different answer for carbon dioxide.

The t statistic is positive. That immediately dooms a Dickey-Fuller test because we need to

see t statistics more negative than the critical values in order to reject the null. In this case,

we do not reject the null hypothesis and therefore conclude that the carbon dioxide data

is nonstationary. This means that we should be wary of using the carbon dioxide variable

directly in a time series model.

A good way to begin to deal with nonstationarity is to use differenced data, which we

generate by creating a variable that is the change of a variable in period t, as opposed to the

level of the variable.

We still need to check for stationarity with the differenced data and hence, back we go

to Table 13.4 for the Dickey-Fuller tests; this time the last two columns test for stationarity

using the change in temperature and change in carbon dioxide variables. The t statistic

on the lagged value of the change in temperature is -12.04, allowing us to easily reject the

null hypothesis of nonstationarity for temperature. For carbon dioxide, the t statistic on

the lagged value of the change in carbon dioxide is -3.31, which is more negative than the

critical value at the 10 percent level. We conclude carbon dioxide is stationary. However,

c
•2014 Oxford University Press 698
Chapter 13. Time Series: Dealing with Stickiness over Time

Table 13.4: Dickey-Fuller Tests for Stationarity

Temperature Carbon dioxide Change in Change in


temperature carbon dioxide
Lag value -0.353 0.004 -1.669 -0.133
(0.084) (0.002) (0.139) (0.040)
[t =-4.22] [t =0.23] [t =-12.04] [t =-3.31]
Time trend 0.002 0.000 0.000 0.002
(0.001) (0.001) (0.000) (0.001)
Lag change -0.093 0.832 0.304 0.270
(0.093) (0.054) (0.087) (0.088)
(Intercept) -3.943 -1.648 -0.487 -4.057
(0.974) (1.575) (0.490) (1.248)
N 126 126 125 125
R2 0.198 0.934 0.673 0.132
Dickey-Fuller 1 percent: -3.99
critical values 5 percent: -3.43
10 percent: -3.13
Decision stationary nonstationary stationary stationary
Standard errors in parentheses

because CO2 is stationary only at the 10 percent level, a thorough analysis would also

explore additional time series techniques, such as the error correction model discussed in the

appendix.12

Because of the nonstationarity of the carbon dioxide variable, we’ll work with a differ-
12 Dickey-Fuller tests tend to be low powered (see, e.g., Kennedy 2008, 302). This means that these tests fail to
reject the null even when the null is false. For this reason, some people are willing to using relatively high significance
levels (e.g. – = 0.10). On the other hand, the costs of failing to account for nonstationarity when we should are
high while the costs of accounting for nonstationarity even when data are stationarity are modest. In other words,
consequences of failing to address nonstationarity are serious when data is nonstationary, but the consequences of
addressing nonstationarity when data is stationary are not such a big deal. Thus many researchers are inclined to
use differenced data when there are any hints of nonstationarity (Kennedy 2008, 309).

c
•2014 Oxford University Press 699
Chapter 13. Time Series: Dealing with Stickiness over Time

enced model in which the variables are changes. The dependent variable is the change in

temperature. The independent variables reflect change in each of the variables from Equa-

tion 13.10. Because the change in Y ear is 1 every year, this variable disappears (a variable

that doesn’t vary is no variable!). The intercept will now capture this information on the

rise or fall in the dependent variable each year. The other variables are simply the changes

in the variables in each year.

T emperaturet = “ T emperaturet≠1 + —0 + —1 Y eart2 + —2 CO2t + ‘t

Table 13.5 displays the results. The change in carbon dioxide is, indeed, statistically

significant, with a coefficient of 0.052 and a t statistic of 2.004. In this instance, then,

the visual relationship we see between temperature and carbon dioxide holds up even after

accounting for apparent nonstationarity in the carbon dioxide data.

13.6 Conclusion

Time series data is all over: prices, jobs, elections, weather, migration, and much more. To

analyze such data correctly we need to address several statistical challenges.

One is autocorrelation. Autocorrelation does not cause coefficient estimates from OLS to

be biased and is therefore not as problematic as endogeneity. Autocorrelation does, how-

ever, render the standard equation for the variance of —ˆ (from page 225) inaccurate. Often

standard OLS will produce standard errors that are too small when there is autocorrelation,

c
•2014 Oxford University Press 700
Chapter 13. Time Series: Dealing with Stickiness over Time

Table 13.5: Change in Temperature as a Function of Change in Carbon Dioxide and Other Factors

Change of carbon dioxide 0.052ú


(0.026)
[t=2.004]
Lag temperature change -0.308ú
(0.087)
[t=-3.548]
Change year squared -0.0003
(0.0002)
[t=-1.208]
(Intercept) 0.992
(0.830)
[t=1.202]
N 126
R2 0.110
Standard errors in parentheses
ú indicates significance at p < 0.05

giving us false confidence about how precise our understanding of the relationship is.

We can correct for autocorrelation by fl-transforming the data. This approach produces

not only unbiased estimates of —1 (just like OLS) but also correct standard errors of —ˆ1 (an

improvement over OLS). In the fl-transformation approach, we model the error at time t as

a function of fl times the error in the previous period.

Another, more complicated, challenge in time series data is the possibility that the de-

pendent variable is dynamic, which means that the value of the dependent variable in one

period depends directly on its value in the previous period. Dynamic models include the

lagged dependent variable as an independent variable.

c
•2014 Oxford University Press 701
Chapter 13. Time Series: Dealing with Stickiness over Time

Dynamic models exist in an alternative statistical universe. Coefficient interpretation

has short-term and long-term elements. Autocorrelation creates bias. Including a lagged

dependent variable when we shouldn’t creates bias, too.

As a practical matter, time series analysis can be hard. Very hard. This chapter lays the

foundations, but there is a much larger literature that gets funky fast. In fact, sometimes

the many options can feel overwhelming. Here are some considerations to keep in mind when

dealing with time series data.

• Deal with stationarity. It’s often an advanced topic, but it can be a serious problem.

If either a dependent or independent variable is stationary, one relatively easy fix is to

estimate the model using changes (commonly referred to as differenced data).

• It’s probably a good idea to use a lagged dependent variable – and if we do, check for

autocorrelation. Autocorrelation does not cause bias in standard OLS, but when we

include a lagged dependent variable, it can cause bias. If we don’t check for autocorre-

lation ourselves, eventually someone will check it for us. We want to know the answer

before someone else does.

We may reasonably end up estimating a fl-transformed model, a model with a lagged

dependent variable and perhaps a differenced model. How do we know which model is

correct? Ideally, all models provide more or less the same result. Whew. All too often,

though, they do not. Then we need to conduct diagnostics and also think carefully about

c
•2014 Oxford University Press 702
Chapter 13. Time Series: Dealing with Stickiness over Time

the data generating process. Is the data dynamic such that this year’s dependent variable

depends directly on last year’s? If so, we should probably lean toward the results from

the model with the lagged dependent variable. If not, then we might lean toward the fl-

transformed result. Sometimes we may simply have to report both and give our honest best

sense of which seems more consistent with theory and the data.

After reading and discussing this chapter, we should be able to

• Section 13.1: Define autocorrelation and describe its consequences for OLS.

• Section 13.2: Describe two ways to detect autocorrelation in time series data.

• Section 13.3: Explain the process of fl-transforming data to address autocorrelation in

time series data.

• Section 13.4: Explain what a dynamic model is and three ways dynamic models differ

from OLS models.

• Section 13.5: Explain stationarity and how nonstationary data can produce spurious

results. Explain how to test for stationarity.

Further Reading

Researchers do not always agree on whether lagged dependent variables should be included

in models. Achen (2000) discusses bias that can occur when lagged dependent variables

c
•2014 Oxford University Press 703
Chapter 13. Time Series: Dealing with Stickiness over Time

are included. Keele and Kelly (2006) present simulation evidence that the bias that occurs

when including a lagged dependent variable is small unless the autocorrelation of errors is

quite large. Wilson and Butler (2007) discuss how the bias is worse for the coefficient on the

lagged dependent variable.

De Boef and Keele (2008) provide a nice discussion of the error correction model, a model

which can accommodate a broad range of time series dynamics into a single model.

Box-Steffensmeier, Freeman, Hitt, and Pevehouse (2014) provide an accessible discussion

of the latest in time series modeling techniques. Wooldridge (2009, chapters 11 and 18)

discusses advanced topics in time series analysis, including stationarity and cointegration.

Stock and Watson (2011) provide an extensive introduction to using time series models to

forecast economic variables.

Key Terms
• AR(1) model (664)
• Augmented Dickey-Fuller test (692)
• Autoregressive model (660)
• Cross-sectional data (658)
• Dickey-Fuller test (691)
• Dynamic model (678)
• Lagged variable (661)
• Spurious regression (686)
• Stationarity (683)

c
•2014 Oxford University Press 704
Chapter 13. Time Series: Dealing with Stickiness over Time

• Time series data (658)


• Unit root(685)

Computing Corner

Stata
1. To detect autocorrelation, proceed in the following steps:
regress Temp Year /* Estimate basic regression model */
predict Err, resid /* Save residuals using resid */
/* subcommand of predict command */
scatter Err Year /* Plot residuals over time */
tsset year /* Tells Stata which variable indicates time */
reg Err L.Err /* An equivalent way to do the auxiliary regression */
/* ‘‘L.’’for lagged values (requires tsset command) */
2. To correct for autocorrelation, proceed in two steps:
tsset Year /* Identify time series */
prais AvgTemp Year, corc twostep /* fl-transformed model */
The tsset command informs Stata which variable orders the data chronologically. The
prais command (pronounced “price” and named after the originator of the technique)
is the main command for estimating fl-transformed models. The subcommands after
the comma (corc twostep) tell Stata to handle the first observation as we have here.
There are other options described in the Stata help which can be accessed by typing
help prais.
3. Running a dynamic model is simple: Just include a lagged dependent variable. If we
have already told Stata which variable indicates time using the tsset command de-
scribe in part 1 above, then we can simply run reg Y L.Y X1 X2. Or, we can create
gen LagY = Y[ n-1] /* Generate
lagged dependent variable manually before running the model
reg Y LagY X1 X2 X3
4. To implement an augmented Dickey-Fuller test, use Stata’s “dfuller” command, using
the “trend” subcommand to include the trend variable and the “lags(1)” subcommand to
include the lagged change. The “regress” subcommand displays the regression results
underlying the Dickey-Fuller test. Stata automatically displays the relevant critical
values for this test.
dfuller Y, trend lags(1) regress

c
•2014 Oxford University Press 705
Chapter 13. Time Series: Dealing with Stickiness over Time

R
1. To detect autocorrelation in R, proceed in the following steps:
ClimateOLS = lm(Temp ≥ Year) # Estimate basic regression model
Err = resid(ClimateOLS) # Save residuals
plot (Year, Err) # Plot residuals over time
LagErr = c(NA, Err[1:(length(Err)-1)]) # Generate lagged error variable
LagErrOLS = lm(Err ≥ LagErr ) # Auxiliary regression
summary(LagErrOLS) # Display results
2. To correct for autocorrelation, proceed in the following steps:
Rho = summary(LagErrOLS)$coefficients[2] # RhoEst is the estimate of fl̂
N = length(Temp) # Length of Temp variable
LagTemp = c(NA, Temp[1:(N-1)]) # Create lagged temperature
LagYear = c(NA, Year[1:(N-1)]) # Create lagged year
TempRho = AvgTemp - Rho*LagTemp # Create fl-transformed temperature
YearRho = Year- Rho*LagYear # Create fl-transformed year
ClimateRho = lm(TempRho ≥ YearRho ) # Estimate fl-transformed model
summary(ClimateRho ) # Display results
3. Running a dynamic model is simple: Just include a lagged dependent variable.
ClimateLDV = lm(Temp ≥ LagTemp + Year) # Estimate basic regression model
4. We can implement an augmented Dickey-Fuller test by creating the variables in the
model and running the appropriate regression. For example,
ChangeTemp = Temp - LagTemp # Create T emp
LagChangeTemp = (NA, ChangeTemp[1:(N-1)]) # Create lag of T emp
AugDickeyF = lm(ChangeTemp ≥ LagTemp + Year + LagChangeTemp )
summary(AugDickeyF) # Display results

Exercises
1. The Washington Post published data on bike share ridership (measured in trips per
day) over the month of January 2014. Bike share ridership is what we want to explain.
The Post also provided data on daily low temperature (a variable we call lowtemp) and
a dummy variable for weekends. We’ll use these as our explanatory variables. The data
is available in BikeShare.dta.
a. Use an auxiliary regression to assess whether the errors are autocorrelated.

c
•2014 Oxford University Press 706
Chapter 13. Time Series: Dealing with Stickiness over Time

b. Run a model that corrects for AR(1) autocorrelation. Are these results different from
a model in which we do not correct for AR(1) autocorrelation? So that everyone is
on same page, use the , corc twostep subcommands.
2. These questions revisit the monetary policy data we worked with in Chapter 6 on page
309.
a. Estimate a model of the federal funds rate, controlling for whether the president
was a Democrat, the number of quarters from the last election, an interaction of
the Democrat dummy variable and the number of quarters from the last election,
and inflation. Assess whether there is first order autocorrelation using a plot and an
auxiliary regression.
b. Estimate the model from part (a) using the fl-transformation approach and interpret
the coefficients.
c. Estimate the model from part (a), but add a variable for the lagged value of the
federal funds rate. Interpret the results and assess whether there is first order auto-
correlation using a plot and an auxiliary regression.
d. Estimate the dynamic model (with a lagged dependent variable) using the fl-transformation
approach and interpret the coefficients.
3. The file BondUpdate.dta contains data on James Bond films from 1962 to 2012. We
want to know how budget and ratings mattered for how well the movies did at the box
office. Table 13.6 describes the variables.
Table 13.6: Variables for James Bond Movie Questions

Variable name Description


GrossRev Gross revenue, measured in millions and adjusted for inflation
Rating Average rating by viewers on online review sites (IMDB and Rotten
Tomatoes) as of April 2013
Budget Production budget, measured in millions of dollars and adjusted for inflation
Actor Name of main actor
Order A variable indicating the order of the movies; we use this variable as our
“time” indicator even though movies are not evenly spaced in time

a. Estimate an OLS model in which the amount each film grossed is the dependent
variable and ratings and budgets are the independent variables. Assess whether
there is autocorrelation.
b. Correct for autocorrelation. Did the results change? Did the autocorrelation go
away?

c
•2014 Oxford University Press 707
Chapter 13. Time Series: Dealing with Stickiness over Time

c. Now estimate a dynamic model. What is the short-term and (approximate) long-term
effect of a 1-point increase in rating?
d. Assess the stationarity of the revenue, rating, and budget variables.
e. Estimate a differenced model and explain results.
f. Build from the above models to assess the worth (in terms of revenue) of specific
actors.

c
•2014 Oxford University Press 708
CHAPTER 14

ADVANCED OLS

In Chapters 3 through 5 we worked through the OLS model from the basic bivariate model

to a variety of multivariate models. We focused on the practical and substantive issues that

researchers deal with on a daily basis.

It can also be useful to look under the hood to see exactly how things work. That’s what

we do in this chapter. We also go into more detail about omitted variable bias by deriving

the conditions for it to exist in a particular case and discussing how these results generalize.

We derive the OLS estimate of —ˆ1 in a simplified model and show it is unbiased in Section

14.1. Section 14.2 derives the variance of —ˆ1 , showing how the conditions that errors are

homoscedastic and not correlated with each other are necessary for the basic equation for

variance of —ˆ1 . Section 14.3 derives the omitted variable bias conditions explained in Chapter

709
Chapter 14. Advanced OLS

5. Section 14.4 shows how to anticipate the sign of omitted variable bias, a useful tool when

faced with an omitted variable problem. Section 14.5 extends the omitted variable bias

framework to models with multiple independent variables. Things get complicated fast.

However, we can see how the core intuition carries on. Section 14.6 derives the equation for

attenuation bias due to measurement error.

14.1 How to Derive the OLS Estimator and Prove Unbiasedness

The best way to appreciate how the OLS assumptions come together to produce coefficient

estimates that are unbiased, consistent, normally distributed, and with a specific standard

error equation is to derive the equations for the —ˆ estimates. The good news is that the

process is really quite cool. The other good news is that it’s not that hard. The bad news

is, well, math. Two good newses beat one bad news, so off we go.

In this section we derive the equation for —ˆ1 for a simplified regression model and then

show how —ˆ1 is unbiased if X and ‘ are not correlated.

Deriving the OLS estimator

We work here with a simplified model that has a variable and coefficient, but no intercept.

This model builds from King, Keohane, and Verba (1994, 98).

Yi = —1 Xi + ‘i (14.1)

c
•2014 Oxford University Press 710
Chapter 14. Advanced OLS

Not having —0 in the model simplifies the derivation considerably while retaining the essential

intuition about how the assumptions matter.1

Our goal is to find the value of —ˆ1 that minimizes the sum of the squared residuals; this

value will produce a line that best fits the scatterplot. The residual for a given observation

is

‘ˆi = Yi ≠ —ˆ1 Xi

The sum of squared residuals for all observations is

ÿ ÿ
‘ˆ2i = (Yi ≠ —ˆ1 Xi )2 (14.2)

We want to figure out what value of —ˆ1 minimizes this sum. Some simple calculus does

the trick. A function reaches a minimum or maximum at a point where its slope is flat –

that is, where the slope is zero. The derivative is the slope, so we simply have to find the

point at which the derivative is zero.2 The process is the following.

1. Take the derivative of Equation 14.2:


q 2
d ‘ˆ ÿ
i
= 2(Yi ≠ —ˆ1 Xi )Xi
d—ˆ1
1 We’re actually just forcing —0 to be zero, which means that the fitted line goes through the origin. In real life we
would virtually never do this; in real life we likely would be working with a multivariate model, too.
2For any given “flat” spot, we have to figure out if we are at a peak or valley. It is very easy to do this. Simply put,

if we are at a peak, our slope should get more negative as X gets bigger (we go downhill); if we are at a minimum,
our slope should get bigger as X goes higher. The second derivative measures changes in the derivative, so it has to
be negative for a flat spot to be a maximum (and we have to be aware of things like “saddle points” - topics covered
in any calculus book).

c
•2014 Oxford University Press 711
Chapter 14. Advanced OLS

2. Set the derivative to zero:

ÿ
2(Yi ≠ —ˆ1 Xi )Xi = 0

3. Divide both sides by 2:

ÿ
(Yi ≠ —ˆ1 Xi )Xi = 0

4. Separate the sum into its two additive pieces:

ÿ ÿ
Yi Xi ≠ —ˆ1 Xi2 = 0

5. Move terms to opposite sides of the equal sign:

ÿ ÿ
Yi Xi = —ˆ1 Xi2

6. —ˆ1 is a constant, so we can pull it out of the summation:

ÿ ÿ
Yi Xi = —ˆ1 Xi2

q
7. Divide both sides by Xi2 :
q
Yi Xi
q = —ˆ1 (14.3)
Xi2

Equation 14.3 then, is the OLS estimate for —ˆ1 in a model with no —0 . It looks quite similar

to the equation for the OLS estimate of —ˆ1 in the bivariate model with —0 (which is Equation

3.4 on page 72). The only difference is that here we do not subtract X from X and Y from

c
•2014 Oxford University Press 712
Chapter 14. Advanced OLS

(Yi ≠ —ˆ0 ≠ —ˆ1 Xi )2


q 2 q
Y . To derive equation 3.4 we would do the above steps using ‘ˆ
i =

where take the derivative with respect to —ˆ0 and with respect to —ˆ1 to produce two equations

that we then solve simultaneously.

Properties of OLS estimates

The estimate —ˆ1 is a random variable because its equation includes Yi , which we know

depends on ‘i , which is a random variable. Hence —ˆ1 will bounce around as the values of ‘i

bounce around.

We can use Equation 14.3 to explain the relationship of —ˆ1 to the true value of —1 by

substituting for Yi in the —ˆ1 equation.

1. Begin with the equation for —ˆ1 :


q
Yi Xi
—ˆ1 = q
Xi2

2. Substitute for Yi using Equation 14.1 (which is the simplified model we’re using here,

in which —0 = 0):
q
(—1 Xi + ‘i )Xi
—ˆ1 = q 2
Xi

3. Distribute Xi in the numerator:


q
—1 Xi2 + ‘i Xi
—ˆ1 = q 2
Xi

c
•2014 Oxford University Press 713
Chapter 14. Advanced OLS

4. Separate the sum into additive pieces:


q q
—1 Xi2 ‘i Xi
—ˆ1 = q 2 + q 2
Xi Xi

5. —1 is constant so we can pull it out of the first sum:


q q
X2 ‘i Xi
—ˆ1 = —1 q i2 + q 2
Xi Xi

6. This equation characterizes the estimate in terms of the unobserved “true” values of —1

and ‘:
q
‘i Xi
—ˆ1 = —1 + q 2 (14.4)
Xi

In other words, —ˆ is —1 (the true value) plus an ugly fraction with sums of ‘ and X in it.

From this point, we can show that —ˆ1 is unbiased. Here we need to show the conditions

under which the expected value of —ˆ1 = —1 . In other words, the expected value of —ˆ1 is the

value of —ˆ1 we would get if we repeatedly regenerated data sets from the original model and

calculated the average of all the —ˆ1 s estimated from these multiple data sets. It’s not that

we would ever do this - in fact, with observational data it is impossible to do so. Instead,

thinking of estimating —ˆ1 from multiple realizations from the true model is a conceptual way

for us to think about whether the coefficient estimates on average skew too high, too low,

or are just right.

It helps the intuition to note that we could, in principle, generate the expected value of

—ˆ1 s for an experiment if we re-ran the experiment over and over again and calculated the

c
•2014 Oxford University Press 714
Chapter 14. Advanced OLS

average of the —ˆ1 s estimated. Or, more plausibly, we could run a computer simulation in

which we repeatedly regenerated data (which would involve simulating a new ‘i for each

observation for each iteration) and calculating the average of the —ˆ1 s estimated.

To show that —ˆ1 is unbiased we use the formal statistical concept of expected value.

The expected value of a random variable is the value we expect the random variable to be,

on average. (For more discussion, see page 767.)

1. Take expectations of both sides of Equation 14.4


q
‘i Xi
E[—ˆ1 ] = E[—1 ] + E[ q 2 ]
Xi

2. The expectation of a fixed number is that number, meaning that E[—1 ] = —1 . Recall

that in our model, —1 (without the hat) is some number. We don’t know it, but it is

some number, maybe 2, maybe 0, maybe -0.341. Hence the expectation of —1 is simply

whatever number it is. It’s like asking what the expectation of the number 2 is. It’s 2!
q
‘i Xi
E[—ˆ1 ] = —1 + E[ q 2 ]
Xi

3. Use the fact that E[k ◊ g(‘)] = k ◊ E[g(‘)] for constant k and random function g(‘).
q
Here q1X 2 is a constant (equaling one over whatever the sum of Xi2 is) and ‘i Xi is a
i

function of random variables (the ‘i s).

1 ÿ
E[—ˆ1 ] = —1 + q E[ ‘i Xi ]
Xi2

c
•2014 Oxford University Press 715
Chapter 14. Advanced OLS

4. We can move the expectation operator inside the summation because the expectation

of a sum is the sum of expectations:

1 ÿ
E[—ˆ1 ] = —1 + q E[‘i Xi ] (14.5)
Xi2

Equation 14.5 means that the expectation of —ˆ1 is the true value (—1 ) plus some number

q1 times the sum of ‘i Xi s. At this point we use our Very Important Condition, which is
Xi2

the exogeneity condition that ‘i and Xi are uncorrelated. We show next that this condition
q
is equivalent to saying that E[‘i Xi ] = 0 which means E[‘i Xi ] = 0, which will imply that

E[—ˆ1 ] = —1 , which is what we’re trying to show.

1. If ‘i and Xi are uncorrelated, then the covariance of ‘i and Xi is zero because correlation

is simply a re-scaled version of covariance:

covariance(Xi , ‘i )
correlation(Xi , ‘i ) = Ò
var(Xi )var(‘i )

2. Using the definition of covariance and setting it to zero yields the following, where we

refer to the mean of Xi as µX and the mean of the ‘i distribution as µ‘ (The Greek

letter µ is pronounced “mew” which rhymes with dew).

covariance(Xi , ‘i ) = E[(Xi ≠ µx )(‘i ≠ µ‘ )] = 0

3. Multiplying out the covariance equation yields

E[Xi ‘i ≠ Xi µ‘ ≠ µx ‘i + µx µ‘ ] = 0

c
•2014 Oxford University Press 716
Chapter 14. Advanced OLS

4. Using the fact that the expectation of a sum is the sum of expectations, we can rewrite

the equation as

E[Xi ‘i ] ≠ E[Xi µ‘ ] ≠ E[µx ‘i ] + E[µx µ‘ ] = 0

5. Using the fact that µ‘ and µX are fixed numbers, we can pull them out of the expecta-

tions:

E[Xi ‘i ] ≠ µ‘ E[Xi ] ≠ µx E[‘i ] + µx µ‘ = 0

6. Here we add an additional assumption that is necessary, but not particularly substan-

tively interesting. We assume that the mean of the error distribution is zero. In other

words, we assume µ‘ = 0, which is another way of saying that the error term in our

model is simply the random noise around whatever the constant is.3 This assumption

allows us to cancel any term with µ‘ or with E[‘i ]. In other words, if the exogeneity

condition is satisfied and the error is uncorrelated with the error term, then

E[Xi ‘i ] = 0

If E[Xi ‘i ] = 0, Equation 14.5 tells us that the expected value of —ˆ1 will be —1 . In other

words, if the error term and independent variable are uncorrelated, then the OLS estimate
3 In a model that has a non-zero —0 , the estimated constant coefficient would absorb any non-zero mean in the
error term. For example, if the mean of the error term was actually 5, then the estimated constant would simply be
five bigger than what it would be otherwise. Because we so seldom care about the constant term, it’s reasonable to
think of it simply as including the mean value of any error term.

c
•2014 Oxford University Press 717
Chapter 14. Advanced OLS

—ˆ1 is an unbiased estimator of —1 . This same logic carries through in the bivariate model

that includes —0 and in multivariate OLS models as well.

Showing that —ˆ1 is unbiased does not say much about whether any given estimate will be

near —1 . The estimate —ˆ1 is a random variable after all and it is possible that some —ˆ1 will

be way too low and that some will be way too high. All that unbiasedness says is that, on

average, the —ˆ1 will not run higher or lower than the true value.

Remember This
1. We derive the —ˆ1 equation by setting the derivative of the sum of squared residuals
equation to zero and solving for —ˆ1 .
2. The key step in showing that —ˆ1 is unbiased depends on condition that X and ‘
are uncorrelated.

14.2 How to Derive the Equation for the Variance of —ˆ1

ˆ In so doing we
In this section we show how to derive an equation for the standard error of —.

see how we use conditions that errors are homoscedastic and uncorrelated with each other.

Importantly, these assumptions are not necessary for unbiasedness of OLS estimates. If these

assumptions do not hold, we can still use OLS, but we’ll have to do something different (as

discussed in Chapter 13, for example) to get the right standard error estimates.

We’ll combine two assumptions and some statistical properties of the variance operator to

produce a specific equation for the variance of —ˆ1 . We assume that the Xi are fixed numbers

c
•2014 Oxford University Press 718
Chapter 14. Advanced OLS

and the ‘s are random variables.

1. We start with the —ˆ1 equation (Equation 14.4) and take the variance of both sides:
q
‘i Xi
var[—ˆ1 ] = var[—1 + q 2 ]
Xi

2. Use the fact that the variance of a sum of a constant (the true value —1 ) and a function

of a random variable is simply the variance of the function of the random variable (see

variance fact #1 on page 769).


q
‘i Xi
var[—ˆ1 ] = var[ q 2 ]
Xi

3. Note that q1X 2 is a constant (as we noted on page 715, too) and use variance fact #2
i

from page 769 that variance of k times a random variable is k 2 times the variance of

that random variable.

1 2 ÿ
var[—ˆ1 ] = ( q ) var[ ‘i Xi ]
Xi2

4. The no autocorrelation condition (as discussed in Section 3.6 of Chapter 3) means that

corr(‘i , ‘j ) = 0 for all i ”= j. If this condition is satisfied, we can treat the variance of a

sum as the sum of the variances (using variance fact #4 on page 769 that says that the

variance of a sum of uncorrelated random variables equals the sum of the variances of

these random variables).

1 2ÿ
var[—ˆ1 ] = ( q ) var[Xi ‘i ]
Xi2

c
•2014 Oxford University Press 719
Chapter 14. Advanced OLS

5. Within the summation, re-use variance fact #2 (from page 769).

1 2ÿ 2
var[—ˆ1 ] = ( q ) Xi var[‘i ]
Xi2

6. If we assume homoscedasticity (as discussed in Section 3.6 of Chapter 3), we can make

additional simplifications. If the error term is homoscedastic, the variance for each ‘i is

‡ 2 , which we can pull out of the summation and cancel.

1 2ÿ 2 2
var[—ˆ1 ] = ( q ) Xi ‡
Xi2
q 2
2 X
= ‡ q 2i 2
( Xi )
‡2
= q 2 (14.6)
Xi

7. If we don’t assume homoscedasticity, we can use ‘ˆ2i as the estimate for variance of each

observation, yielding a heteroscedasticity-consistent variance estimate.

1 2ÿ 2 2
var[—ˆ1 ] = ( q ) Xi ‘ˆi (14.7)
Xi2

Equation 14.7 is great in that it provides an appropriate estimate for the variance of

of —ˆ1 even when errors are heteroscedastic. However, it is quite unwieldy, making it

harder for us to see the intuition about variance as we can with the variance of —ˆ1 when

errors are homoscedastic.

In this section we have derived the variance of —ˆ1 in our simplified model with no constant

(for both homoscedastic and heteroscedastic cases). Equation looks quite similar to the

c
•2014 Oxford University Press 720
Chapter 14. Advanced OLS

variance of the homoscedastic bivariate model with a constant, which we saw on page 95 in

Chapter 3. The only difference is that when —0 is included in the model, the sum in the
q
denominator is (Xi ≠ X)2 instead of Xi2 . The derivation process is essentially the same

and uses the assumptions for the same purposes.

Let’s take a moment to appreciate how amazing it is that we have been able to derive an

equation for the variance of —ˆ1 . With just a few assumptions, we are able to characterize

how precise our estimate of —ˆ1 will be as a function of the variance of ‘ and the Xi values.

The equation for the variance of —ˆ1 in the multivariate model is similar (see Equation 5.9

on page 225), and the intuition discussed here applies for that model as well.

Remember This
1. We derive the variance of —ˆ1 equation by calculating the variance of the —ˆ1 equa-
tion.
2. If the errors are homoscedastic and not correlated with each other, the variance
equation is in a convenient form.
3. If the errors are not homoscedastic and uncorrelated with each other, OLS es-
timates are still unbiased, but the easy-to-use standard OLS equation for the
variance of —ˆ1 is no longer appropriate.

14.3 How to Derive the Omitted Variable Bias Conditions

On page 214 in Chapter 5 we discussed omitted variable bias, an absolutely central concept

in understanding multivariate OLS. In this section we derive the conditions for omitted

c
•2014 Oxford University Press 721
Chapter 14. Advanced OLS

variable bias to occur.

Suppose the true model is

Yi = —0 + —1 X1i + —2 X2i + ‹i (14.8)

where Yi is the dependent variable, there are two independent variables, X1i and X2i , and ‹i

(the Greek letter nu, pronounced “new”) is an error term that is not correlated with any of

the independent variables. For example, suppose the dependent variable is test scores and

the independent variables are class size and family wealth. We assume (for this discussion)

that ‹i is uncorrelated with X1i and X2i .

What happens if we omit X2 and estimate the following model?

Yi = —0OmitX2 + —1OmitX2 X1i + ‘i (14.9)

where we will use —1OmitX2 to indicate the estimate we get from the model that omits variable

X2 . How close will —ˆ1OmitX2 (the coefficient on X1i in in Equation 14.9) be to the true value

(—1 in Equation 14.8)? In other words, will —ˆ1OmitX2 be an unbiased estimator of —1 ? This

situation is common for observational data because we will almost always suspect that we

are missing some variables that explain our dependent variable.

The equation for —ˆ1OmitX2 is the equation for a bivariate slope coefficient (see Equation

3.4 in Chapter 3). It is


qN
i=1 (X1i ≠ X 1 )(Yi ≠ Y)
—ˆ1OmitX2 = qN (14.10)
i=1 (X1i ≠ X 1 )
2

c
•2014 Oxford University Press 722
Chapter 14. Advanced OLS

Will —ˆ1OmitX2 be an unbiased estimator of —1 ? With a simple substitution and bit of

rearranging we can answer this question. We know from Equation 14.8 that the true value

of Yi is —0 + —1 X1i + —2 X2i + ‹i . Because the —s are fixed values, the average of each is

simply its value. That is, — 0 = —0 and so forth. Therefore Y will be —0 + —1 X 1i + —2 X 2i + ‹ i .

Substituting for Yi and Y in Equation 14.10 and doing some re-arranging yields
q
(X1i ≠ X1 )(—0 + —1 X1i + —2 X2i + ‹i ≠ —0 ≠ —1 X 1 ≠ —2 X 2i ≠ ‹ i )
—ˆOmitX2 = q
(X1i ≠ X 1 )2
q
(X1i ≠ X1 )(—1 (X1i ≠ X 1 ) + —2 (X2i ≠ X 2 ) + ‹i ≠ ‹ i )
= q
(X1i ≠ X 1 )2
q q
Gathering terms and using the fact that —1 (X1i ≠ X 1i )2 = —1 (X1i ≠ X 1i )2 yields
q q q
(X1i ≠ X 1 )2 (X1i ≠ X 1 )(X2i ≠ X 2 ) (X1i ≠ X 1 )(‹i ≠ ‹)
—ˆOmitX2 = —1 q + —2 q + q
(X1i ≠ X 1 )2 (X1i ≠ X 1 )2 (X1i ≠ X 1 )2

We then take the expected value of both sides. Our assumption that ‹ is uncorrelated with
q
X1 means that the expected value of (X1i ≠ X 1 )(‹i ≠ ‹) is zero, which causes the last term

with the ‹s to drop from the equation.4 This leaves us with


q
(X1i ≠ X 1 )(X2i ≠ X 2 )
E[—ˆ1OmitX2 ] = —1 + —2 q (14.11)
(X1i ≠ X 1 )2

meaning that the expected value of —ˆ1OmitX2 is —1 plus —2 times a messy fraction. In

other words, the estimate —ˆ1OmitX2 will deviate, on average, from the true value, —1 , by
q
(X1i ≠X 1 )(X2i ≠X 2 )
—2 q
(X1i ≠X 1 )2
.
4 The
q
logic is similar to how we showed on page 717 that if X and ‘ are uncorrelated, then E[ Xi ‘i ] = 0; in this
q
case, (X1i ≠ X 1 ) is analogous to Xi in the earlier proof and (‹i ≠ ‹) is analogous to ‘i in the earlier proof.

c
•2014 Oxford University Press 723
Chapter 14. Advanced OLS

q
is simply the equation for the estimate of ”ˆ1 from the following
(X1i ≠X 1 )(X2i ≠X 2 )
Note that q
(X1i ≠X 1 )2

model:

X2i = ”0 + ”1 X1i + ·i

See, for example, page 72 and note that we have X2 and X 2 where we had Y and Y in the

standard bivariate OLS equation.

Therefore, we can conclude that our coefficient estimate —ˆ1OmitX2 from the model that

omitted X2 will be an unbiased estimator of —1 if —2 ”ˆ1 = 0. This condition is most easily

satisfied if if —2 = 0. In other words, if X2 has no effect on Y (meaning —2 = 0), then

omitting X2 does not cause our coefficient estimate to be biased. This is excellent news. If

it were not true, we would have to include variables that had nothing to do with Y in our

model. That would be a horrible way to live.

The other way for —2 ”ˆ1 to be zero is for ”ˆ1 to be zero, which happens whenever X1 would

have a coefficient of zero in a regression in which X2 is the dependent variable and X1 is

the independent variable. In short, if X1 and X2 are independent (such that regressing X2

on X1 yields a slope coefficient of zero), then even though we omitted X2 from the model

—ˆ1OmitX2 will be an unbiased estimate of —1 , the true effect of X1 on Y (from Equation 14.8).

No harm, no foul.

The flip side of these conditions is that when we estimate a model that omits a variable

that affects Y (meaning that —2 doesn’t equal zero) and is correlated with the included

c
•2014 Oxford University Press 724
Chapter 14. Advanced OLS

variable, OLS will be biased. The extent of the bias depends on how much the omitted

variable explains Y (which is determined by —2 ) and how much the omitted variable is

related to the included variable (which is reflected in ”ˆ1 ).

What is the take-away here? Omitted variable bias is a problem if both conditions are

met: (1) The omitted variable actually matters (—2 ”= 0) and (2) X2 (the omitted variable) is

correlated with X1 (the included variable). This shorthand is remarkably useful in evaluating

OLS models.

Remember This
The conditions for omitted variable bias can be derived by substituting the true value
of Y into the —ˆ1 equation for the model with X2 omitted.

14.4 Anticipating the Sign of Omitted Variable Bias

It is fairly common that the omitted variable must remain omitted because we simply do

not have a measure of it. In these situations, all is not lost. (A lot is lost, but not all.) We

can use the concepts we have developed so far to work through the implication of omitting

the variable in question. In this section we show how to anticipate the effects of omitting a

variable.

Suppose we are interested in explaining the effect of education on wages. We estimate

the model

Incomei = —0 + —1 Educationi + ‘i (14.12)

c
•2014 Oxford University Press 725
Chapter 14. Advanced OLS

where Incomei is the monthly salary or wages of individual i and Educationi is the number

of years of schooling individual i completed. We are worried, as usual, that there are factors

in the error term that are correlated with education.

We worry, for example, that some people are more productive than others (a factor in

the error term that affects income) and that productive folks are more likely to get more

schooling (school may be easier for them). In other words, we fear the true equation is

Incomei = —0 + —1 Educationi + —2 P roductivityi + ‘i (14.13)

where P roductivityi taps the combination of intelligence, diligence and maturity that lead

person i to add a lot of value to his or her organization. Most data sets will not have a good

measure of it. What can we do?

Without the variable, we’re stuck, but at least we can figure our whether omitting pro-

ductivity will push our estimates of the effect of education higher or lower.5 Our omitted

variable bias results (such as Equation 14.11) indicate that the bias from omitting produc-

tivity depends on the effect that productivity has on the dependent variable (—2 ) and on the

relationship between productivity and education, the included variable.

In our example, we believe productivity boosts income (—2 > 0). We also believe that

there is a positive relationship between education and productivity. Hence, the bias will be

positive because it is —2 > 0 times the effect of the productivity on education. A positive
5Another option is to use panel data that allows us to control for certain unmeasured factors; we do that in Chapter
8. Or we can try to find exogenous variation in education (variation in education that is not due to differences in
productivity); that’s what we do in Chapter 9.

c
•2014 Oxford University Press 726
Chapter 14. Advanced OLS

bias implies that omitting productivity induces a positive bias for education. In other words

the effect of education on income in a model that does not control for productivity will be

overstated. The magnitude of the bias will be related to how strong these two components

are. If we think productivity has a huge effect on income and is strongly related to education

levels, then the size of the bias is large.

In this example, this bias would lead us to be skeptical of a result from a model like

Equation 14.12 that omits productivity. In particular, if we were to find that —ˆ1 is greater

than zero, we would worry that the omitted variable bias has inflated the estimate. On the

other hand, if the results showed that education did not matter or had a negative coefficient,

we would be more confident in our results because the bias would on average make the results

larger than the true value, not smaller.

This line of reasoning is called “signing the bias” and would lead us to treat the estimated

effects based on Equation 14.12 to be an upper bound on the likely effects of education on

income.

Table 14.1 summarizes the relationship for the simple case of one omitted variable. If

X2 , the omitted variable, has a positive effect on Y (meaning —2 > 0) and X2 and X1 are

correlated, then the coefficient on X1 in a model with only X1 will produce a coefficient that

is biased upwards: The estimate will be too big because some of the effect of unmeasured

X2 will be absorbed by the variable X1 .

c
•2014 Oxford University Press 727
Chapter 14. Advanced OLS

Table 14.1: Effect of Omitting X2 on Coefficient Estimate for X1

Correlation “—2 ”
of Effect of omitted variable on Y
X1 and X2 >0 0 <0
>0 Overstate coefficient No bias Understate coefficient
0 No bias No bias No bias
<0 Understate coefficient No bias Overstate coefficient
Cell entries show sign of bias for omitted variable bias problem in which a single variable (X2 )
is omitted. The true equation is Equation 14.8 and the estimated model is Equation 14.9. If
ˆ 2 (the expected value of the coefficient on X from a
—2 > 0 and X1 and X2 are correlated, —1OmitX 1

model that omits X2 ) will be larger than the actual value of —1 .

Remember This
We can use the equation for omitted variable bias to anticipate the effect of omitting
a variable on the coefficient estimate for an included variable.

c
•2014 Oxford University Press 728
Chapter 14. Advanced OLS

Discussion Questions
1. Suppose we are interested in knowing how much social media affect people’s in-
come. Suppose also that Facebook provided us data on how much each individual
spent on Facebook during work hours. The model is

Incomei = —0 + —1 Facebook hours i + ‘i


What is the implication of not being able to measure innate productivity for our
estimate of —1 ?
2. Suppose we are interested in knowing the effect of campaign spending on election
outcomes.

Vote share i = —0 + —1 Campaign spending i + ‘i


We believe that the personal qualities of a candidate also matter. Some are
more charming and/or hard-working than others, which may lead them to better
election results. What is the implication of not being able to measure “candidate
quality” (which captures how charming and hard-working candidates are) for our
estimate of —1 ?

14.5 Omitted Variable Bias with Multiple Variables

Our omitted variable discussion in Section 5.2 was based on a case in which the true model

had two variables and a single variable was omitted. In this section we show how things are

more complicated when there are additional variables.

Suppose the true model has three independent variables

Yi = —0 + —1 X1i + —2 X2i + —3 X3i + ‹i (14.14)

c
•2014 Oxford University Press 729
Chapter 14. Advanced OLS

and that we estimate a model that omits variable X3 :

Yi = —0OmitX3 + —1OmitX3 X1i + —2OmitX3 X2i + ‘i (14.15)

Assuming that the error in the true model (‹) is not correlated with any of the independent

variables, the expected value for —1OmitX3 is


Û
r31 ≠ r21 r32 V3
E[—ˆ1OmitX3 ] = —1 + —3 (14.16)
1 ≠ r21
2
V1

where r31 is the correlation of X3 and X1 , r21 is the correlation of X2 and X1 , r32 is

the correlation of X3 and X2 , and V3 and V1 are the variances of X3 and X1 , respectively.

Clearly, there are more moving parts in this case than the case we discussed earlier.

Equation 14.16 contains commonalities with the simpler omitted variables bias example

we discussed in Section 5.2. The effect of the omitted variable in the true model looms

large. Here —3 is the effect of the omitted variable X3 on Y and it plays a central role in

the bias term. If —3 is zero, there is no omitted variable bias because the crazy fraction will

be multiplied by zero and thereby disappear. As with the the simpler omitted variable bias

case, omitting a variable only causes bias if that variable actually affects Y .

The bias term has more factors, however. The r31 term is the correlation of the excluded

variable (X3 ) and the first variable (X1 ). It is the first term in the denominator of the bias

term, playing a similar role as the correlation of the excluded and included variables in the

simpler model. The complication now is that the correlations of the two included variables

(r21 ) and correlation of the omitted variable and the included variable (r32 ) also matter.

c
•2014 Oxford University Press 730
Chapter 14. Advanced OLS

We can take away some simple principles. If the included independent variables are not

correlated (which would mean that r21 = 0), then the equation simplifies to essentially what

we were dealing with in the simple case. If the excluded variable is not correlated with the

other included variable (r32 = 0), we again can go back to the intuition from the simple

omitted variable bias model. If, however, both of these correlations are non-zero (and, to be

practical, relatively large), then the simple case intuition may not travel well and we should

tread carefully. We’ll still be worried about omitted variable bias, but our ability to sign the

bias will be weakened.

Remember This
When there are multiple variables in the true equation, the effect of omitting one of
them depends in a complicated way on the interrelations of all variables.
1. As in the simpler model, if the omitted variable does not affect Y , then there is
no omitted variable bias.
2. The equation for omitted variable bias when the true equation has only two
variables often provides a reasonable approximation of the effects.

c
•2014 Oxford University Press 731
Chapter 14. Advanced OLS

14.6 Omitted Variable Bias Due to Measurement Error

We discussed measurement error in Section 5.3 of Chapter 5. Here we derive the equation

for attenuation bias due to measurement error in an independent variable for the case where

there is one independent variable. We also discuss implications of measurement error when

there are multiple variables.

Model with one independent variable

We start with a true model based on the actual value of the independent variable, which we

denote with X1i


ú
:

Yi = —0 + —1 X1i
ú
+ ‘i (14.17)

The independent variable we observe has some error:

X1i = X1i
ú
+ ‹i (14.18)

where we assume that ‹i is uncorrelated with X1i


ú
. This little equation will do a lot of work

for us in helping us understand the effect of measurement error.

Substituting for X1i


ú
in the true model yields

Yi = —0 + —1 (X1i ≠ ‹i ) + ‘i

= —0 + —1 X1i ≠ —1 ‹i + ‘i (14.19)

Let’s treat ‹ as the omitted variable and ≠—1 as the coefficient on the omitted variable.

c
•2014 Oxford University Press 732
Chapter 14. Advanced OLS

(Compare these to X2 and —2 in Equation 5.7 in Section 5.2.) Doing so allows us to write

the omitted variable bias equation as

cov(X1 , ‹)
—1OmitX2 = —1 ≠ —1 (14.20)
var(X1 )

where we use the covariance-based equation from page 91 to calculate ”1 in the standard

omitted variable equation.

Using the fact that X1i = X1i


ú
+ ‹i and the rules for covariance from page 770, we can

show that cov(X1 , ‹) = ‡‹2 .6 Also, because X1 = X1ú + ‹, var(X1 ) equals ‡X


2
ú + ‡‹ .
2

We can therefore rewrite Equation 14.20 as

‡‹2
—1OmitX2 = —1 ≠ —1 (14.21)
ú + ‡‹
2 2
‡X

Collecting terms yields

‡‹2
plim —ˆ1 = —1 (1 ≠ )
‡‹2 + ‡X2
ú

‡X2
‡‹2 ú
Finally, we use the fact that 1 ≠ ‡‹2 +‡X 2
ú
= 2
‡‹ +‡X
1
2
ú
to produce
1

2
‡X
plim —ˆ1 = —1
ú
1

‡‹2 + ‡X
2
ú
1

which is the equation we discuss in detail in Section 5.3.


6 First, note that cov(X1 , ‹) = cov(X1ú + ‹, ‹) = cov(X1ú , ‹) + cov(‹, ‹) = cov(‹, ‹) because ‹ is not correlated with
X1ú . Finally, note that cov(‹, ‹) = ‡‹2 by standard rules of covariance.

c
•2014 Oxford University Press 733
Chapter 14. Advanced OLS

Measurement error with multiple independent variables

We have so far dealt with a bivariate regression with a single, poorly measured independent

variable for which the error is a mean-zero random variable uncorrelated with anything

else. If we have multiple independent variables and a single badly measured variable, it is

still the case that the coefficient on the poorly measured independent variable will suffer

from attenuation bias. The other coefficients will also suffer, although in a way that is

hard to anticipate. As a general practice, this source of measurement-related bias is seldom

emphasized in real applications.

Remember This
1. We can derive the effect of a poorly measured independent variable using omitted
variable logic.
2. A single poorly measured independent variable can cause other coefficients to be
biased.

14.7 Conclusion

OLS goes a long way with just a few assumptions about the model and the error terms.

Exogeneity gets us unbiased estimates. Homoscedasticity and non-correlated errors get us

an equation for the variance of our estimates.

How important is it to be able to know exactly how these assumptions come together to

provide all this good stuff? On a practical level, not very. We can go about most of our

c
•2014 Oxford University Press 734
Chapter 14. Advanced OLS

statistical business without knowing how to derive these results.

On a deeper level, though, it is useful to know how the assumptions matter. The sta-

tistical properties of OLS are not magic. They’re not even that hard, once we break the

derivations down step-by-step. The assumptions we rely on play specific roles in figuring out

the properties of our estimates, as we have seen in the derivations in this chapter. We also

formalized our understanding of omitted variable bias, helping us know where conditions

come from and if and how they apply to various circumstances.

We don’t need to be able to produce all the derivations from scratch. If we know the

following, we will have a better understanding of the statistical foundations of OLS.

• Section 14.1: Explain the steps in deriving the equation for the OLS estimate of —ˆ1 .

What assumption is crucial for —ˆ1 to be an unbiased estimator of —1 ?

• Section 14.2: What assumptions are crucial to derive the standard equation for the

variance of —ˆ1 ?

• Section 14.3: Show how to derive the omitted variable bias equation.

• Section 14.4: Show how to use the omitted variable bias equation to “sign the bias.”

• Section 14.5: Explain how omitted variable bias works when there are multiple variables

in the true model.

• Section 14.6: Show how to use omitted variable bias tools to characterize the effect of

measurement error.

c
•2014 Oxford University Press 735
Chapter 14. Advanced OLS

Further Reading

See Clarke (2005) for a further details on omitted variables. Greene (2003, 148) offers a

simple generalization using matrix notation.

Greene (2003, 86) discusses the implications of measurement error when there are multiple

independent variables in the model. Cragg (1994) provides an accessible overview of problems

raised by measurement error and strategies for dealing with them.

c
•2014 Oxford University Press 736
CHAPTER 15

ADVANCED PANEL DATA

In Chapter 8 we used fixed effects in panel data models to control for unmeasured factors

that are fixed within units. We did so by including dummy variables for the units or by

re-scaling the data. We can also control for many time factors by including fixed effects for

time periods.

The models get more complicated when we start thinking about more elaborate depen-

dence across time. We face a major choice of whether we want to treat serial dependence

in terms of serially correlated errors or in terms of dynamic models in which the value of

Yt depends directly on the value of Y in the previous period. These two approaches lead to

different modeling choices and, in some cases, different results.

In this chapter, we introduce these approaches and discuss how they connect to the panel

737
Chapter 15. Advanced Panel Data

data analysis we covered in Chapter 8. Section 15.1 shows how to deal with autocorrelation

in panel data models. Section 15.2 introduces dynamic models for panel data analysis.

Section 15.3 presents an alternative to fixed effects models called random effects models.

Random effects models treat unit-specific error as something that complicates standard error

calculations but does not cause bias. They’re not as useful as fixed effects models, but it

can be helpful to understand how they work.

15.1 Panel Data Models with Serially Correlated Errors

In panel data, it would make sense to worry about autocorrelation for the same reasons it

would make sense to worry about autocorrelation in time series data. Remember all the

stuff in the error term? Lots of that will stick around for a while. Unmeasured factors in

year 1 may linger to affect what is going on in year 2 and so on. In this section we explain

how to deal with autocorrelation in panel models, first without fixed effects and then with

fixed effects.

Before we get into diagnosing and addressing the problem, we need to remind ourselves of

the stakes: Autocorrelation does not cause bias in the standard OLS framework, but it does

cause OLS estimates of standard errors to be incorrect. Often, it causes the OLS estimates

of standard errors to be too small because we don’t really have the number of independent

observations that OLS thinks we do.

c
•2014 Oxford University Press 738
Chapter 15. Advanced Panel Data

Autocorrelation without fixed effects

We start with a model without fixed effects. The model is

Yit = —0 + —1 X1it + —2 X2it + ‘it

‘it = fl‘i,t≠1 + ‹it

where ‹it (the Greek letter nu) is a mean-zero, random error term that is not correlated

with the independent variables. There are N units and T time periods in the panel data

set. We limit ourselves to first-order autocorrelation (the error this period is a function

of the error last period). The tools we discuss generalize pretty easily to higher orders of

autocorrelation.1

Estimation is relatively simple. First, we estimate the model using standard OLS. We

then use the residuals from the OLS model to test for signs of autocorrelated errors. We can

do so because OLS —ˆ estimates are unbiased even if errors are autocorrelated, which means

ˆ are unbiased estimates, too.


that the residuals (which are functions of the data and —)

We test for autocorrelated errors in this context using something called a Lagrange Mul-

tiplier (LM) test. The LM test is similar to our test for autocorrelation in Chapter 13 on

page 666. It involves estimating the following

‘ˆit = flˆ‘i,t≠1 + “1 X1it + “2 X2it + ÷it


1A second order autocorrelated process would have the error in period t correlated with the error in period t ≠ 2
and so on.

c
•2014 Oxford University Press 739
Chapter 15. Advanced Panel Data

where ÷it (the Greek letter eta) is a mean-zero, random error term. We use the fact that

N ◊ R2 from this auxiliary regression is distributed ‰21 (the Greek letter chi, pronounced

“kai”) under the null hypothesis of no-autocorrelation.

If the LM test indicates that there is autocorrelation, we will estimate an AR(1) model

using fl-transformation techniques we discussed in Section 13.3 of Chapter 13.

Autocorrelation with fixed effects

To test for autocorrelation in a panel data model that has fixed effects we must deal with a

slight wrinkle. The fixed effects induce correlation in the de-meaned errors even when there

is no correlation in the actual errors. The error term in the de-meaned model is (‘it ≠ ‘i· ),

which means that the de-meaned error for unit i will include the mean of the error terms
1
for unit i (‘i· ), which in turn means that T
of any given error term will appear in all error

terms. So, for example, ‘i1 (the raw error in the first period) is in the first de-meaned error

term, the second de-meaned error term and so on via the ‘i· term. The result will be at least

a little autocorrelation because the de-meaned error term in the first and second periods, for

example, will move together at least a little bit because they both have some of the same

terms.

To test for AR(1) errors, run a model with residuals from fixed effects model: ‘ˆit =

flˆ‘i,t≠1 + “1 X1it + “2 X2it + ÷it and use robust standard errors.

c
•2014 Oxford University Press 740
Chapter 15. Advanced Panel Data

Remember This
To estimate panel models that account for autocorrelated errors, proceed in the fol-
lowing steps:
1. Estimate an initial model that does not address autocorrelation. This model can
be either an OLS model or a fixed effects model.
2. Use residuals from the initial model to test for autocorrelation using a Lagrange
Multiplier test that is based on the R2 from the following model:
‘ˆit = flˆ‘i,t≠1 + “1 X1it + “2 X2it + ÷it

If the model includes fixed effects, the coefficient and residual estimates are biased,
although the bias decreases as T increases.
3. If we reject the null hypothesis of no autocorrelation (which will happen when
the R2 in the above equation is high), then we should remove the autocorrelation
by fl-transforming the data as discussed in Chapter 13.

15.2 Temporal Dependence with a Lagged Dependent Variable

We can also model temporal dependence with the dynamic models we discussed in Section

13.4. In these models the current value of Yit could depend directly on Yi,t≠1 , the value of Y

in the previous period.

These models are sneakily complex. They seem easy because they simply require us to

include a lagged dependent variable in an OLS model. They actually have many knotty

aspects that differ from standard OLS models. In this section we discuss dynamic models

for panel data, first without fixed effects and then with fixed effects.

c
•2014 Oxford University Press 741
Chapter 15. Advanced Panel Data

Lagged dependent variable without fixed effects

We begin with a panel model without fixed effects. Specifically,

Yit = “Yi,t≠1 + —0 + —1 X1it + —2 X2it + ‘it (15.1)

where “ (the Greek letter gamma) is the effect of the lagged dependent variable, the —s are the

immediate effects of the independent variables, and ‘it is uncorrelated with the independent

variables and homoscedastic.

We see how tricky this model is once we try to characterize the effect of X1it on Yit .

Obviously, if X1it increases by one unit, there will be a —1 increase in Yit that period. Notice,

though, that an increase in Yit in one period affects Yit in future periods via the “Yi,t≠1 term

in the model. Hence increasing X1it in the first period, for example, will affect the value of

Yit in the first period, which will then affect Y in the next period. In other words, if we

change X1it we get not only —1 more Yit but we get “ ◊ —1 more Y in the next period and

so on. In other words, a change in X1it today dribbles on to affect Y forever through the

lagged dependent variable in Equation 15.1.

As a practical matter, including a lagged dependent variable is a double edged sword.

On the one hand, it is typically highly significant, which is good news if we have a control

variable that soaks up variance unexplained by other variables. On the other hand, the lagged

dependent variable can be too good – so highly significant that it sucks the significance out

of the other independent variables. In fact, if there is serial autocorrelation and trending

c
•2014 Oxford University Press 742
Chapter 15. Advanced Panel Data

in the independent variable, including a lagged dependent variable causes bias. In such a

case, Princeton political scientist Chris Achen (2000) has noted that the lagged dependent

variable

does not conduct itself like a decent, well-behaved proxy. Instead it is a kleptoma-

niac, picking up the effect, not only of excluded variables, but also of the included

variables if they are sufficiently trended. As a result, the impact of the included

substantive variables is reduced, sometimes to insignificance.

This conclusion does not mean that lagged dependent variables are evil, but rather that

we should tread carefully when including them. In particular, we should estimate models

with and without them. If results differ substantially, we should be sure to go through all

the tests and logic described below when deciding to place more weight on the model with

or without the lagged dependent variable.

The good news is that if the errors are not autocorrelated, using OLS for a model with

lagged dependent variables works fine. Given that the lagged dependent variable commonly

soaks up any serial dependence in the data, this approach is reasonable and widely used.2

If the errors are autocorrelated, however, OLS will produce biased estimates of —ˆ when a

lagged dependent variable is included. In this case, autocorrelation does more than render

conventional OLS standard error estimates inappropriate. Autocorrelation in models with

lagged dependent variables actually messes up the estimates. This bias is worth mulling
2 See Beck and Katz (2011).

c
•2014 Oxford University Press 743
Chapter 15. Advanced Panel Data

over a bit. It happens because models with lagged dependent variables are outside of the

conventional OLS framework. Hence even though autocorrelation does not cause bias in

OLS models, autocorrelation can cause bias in dynamic models.

Why does autocorrelation cause bias in a model when we include a lagged dependent

variable? It’s pretty easy to see: Yi,t≠1 of course contains ‘i,t≠1 and if ‘i,t≠1 is correlated with

‘it – which is exactly what first-order autocorrelation implies – then one of the independent

variables in Equation 15.1 (Yi,t≠1 ) will be correlated with the error.

This problem is not particularly hard to deal with. Suppose there is no autocorrelation.

In that case, OLS estimates are unbiased, meaning that the residuals from the OLS model

are consistent too. We can therefore use these residuals in a Lagrange Multiplier test like

the one we described earlier on page 739. If we fail to reject the null hypothesis (which is

quite common, because lagged dependent variables often zap autocorrelation), then OLS it

is. If we reject the null hypothesis of no autocorrelation, then we can use an AR(1) model

like the one discussed in Chapter 13 to rid the data of autocorrelation and thereby get us

back to unbiased and consistent estimates.

Lagged dependent variable with fixed effects

The lagged dependent variable often captures the unit-specific variance that fixed effects

capture. Hence it is not uncommon to see lagged dependent variables used in place of fixed

effects. Sometimes we may want both in our model, so we therefore move on to consider

c
•2014 Oxford University Press 744
Chapter 15. Advanced Panel Data

dynamic models with fixed effects.

Beware! Things get complicated when we include a lagged dependent variable and fixed

effects in the same model.

Here’s the model:

Yit = “Yi,t≠1 + —0 + —1 X1it + —2 X2it + –i + ‘it

where ‘it is uncorrelated with the independent variables.

OLS is biased in this situation. Bummer. Recall from Section 8.2 that fixed effects models

are equivalent to de-meaned estimates. That means a fixed effects model with a lagged

dependent variable will include a variable (Yi,t≠1 ≠ Y i,t≠1 ). The Y i,t≠1 part of this variable

is the average of the lagged dependent variable over all periods. This average will therefore

include the value of Yit which, in turn, contains, ‘it . Hence, the de-meaned lagged dependent

variable will be correlated with ‘it . The extent of this bias depends on the magnitude of
1
this correlation, which is proportional to T
where T is the length of the time series for each

observation (often the number of years of data). For a small panel with just 2 or 3 periods,

the bias can be serious. For a panel with 20 or more periods, the problem is less serious. One

piece of good news here is that the bias in a model with a lagged dependent variable and

fixed effects is worse for the coefficient on the lagged dependent variable; simulation studies

indicate that bias is modest for coefficients on the Xit variables, the variables we usually

care most about.

c
•2014 Oxford University Press 745
Chapter 15. Advanced Panel Data

Two ways to estimate dynamic panel data models with fixed effects

What to do? One option is to follow instrumental variable (IV) logic. We cover instrumental

variables in Chapter 9. In this context the IV approach relies on finding some variable

that is correlated with the independent variable in question and not correlated with the

error. Most IV approaches rely on using lagged values of the independent variables, which

are typically correlated with the independent variable in question but not correlated with

the error, because the error is something that happens later. The Arrellano and Bond

(1991) approach, for example, uses all available lags as instruments. These models are quite

complicated and, like many IV models, imprecise.

Another option is to use OLS, accepting some bias but in exchange for better accuracy

and less complexity. While we have talked a lot about bias, we have not yet discussed the

trade-off between bias and accuracy, largely because in basic models such as OLS, unbiased

models are also the most accurate so we don’t have to worry about the trade-off. But in

more complicated models, it is possible to have an estimator that produces coefficients that

are biased but still pretty close to the true value. It is also possible to have an estimator

that is unbiased, but very imprecise. IV estimators are in the latter category – they are, on

average, going to get us the true value, but they have higher variance.

Here’s a goofy example of the trade-off between bias and accuracy. Consider two esti-

mators of average height in the United States. The first is the height of a single person

randomly sampled. This estimator is unbiased – after all, the average of this estimator will

c
•2014 Oxford University Press 746
Chapter 15. Advanced Panel Data

have to be the average of the whole population. But clearly this estimator isn’t very precise

because it is based on a single person.

The second estimator of average height in the United States is the average height of

500 randomly selected people, but measured with a measuring stick that is inaccurate by a

quarter of an inch (making every measurement a quarter inch too big).3 Which estimate of

average height would we rather have? The second one may well make up what it loses in

bias by being more precise. That’s the situation here because the OLS estimate is biased,

but more precise than the IV estimates.

Neal Beck and Jonathan Katz (2011) have run a series of simulations of several options

for estimating models with lagged dependent variables and fixed effects. They find that OLS

performs better in terms of actually being more likely to produce estimates close to the true

value than the IV approach, even though OLS estimates are a bit biased. The performance

of OLS models improves relative to the IV approach as T increases.

H.L. Mencken said that for every problem there is a solution that is simple, neat, and

wrong. Usually that’s a devastating critique. Here it is a compliment. OLS is simple. It is

neat. And, yet, it is wrong in the sense of being biased when we have a lagged dependent

variable and fixed effects. But OLS is more accurate (meaning the variance of —ˆ1 is smaller)

than the alternatives, which nets out to a pretty good approach.

3 Yes, yes, we could subtract the quarter of the inch from all the height measurements. Work with me here. We’re
trying to make a point!

c
•2014 Oxford University Press 747
Chapter 15. Advanced Panel Data

Remember This
1. Researchers often include lagged dependent variables to account for serial depen-
dence. A model with a lagged dependent variable is called a dynamic model.
(a) Dynamic models differ from conventional OLS models in many respects.
(b) In a dynamic model, a change in X has an immediate effect on Y , but also
has an ongoing effect on future Y s because any change in Y associated with a
change in X will affect future values of Y via the lagged dependent variable.
(c) If there are no fixed effects in the model and there is no autocorrelation, then
using OLS for a model with a lagged dependent variable produces unbiased
coefficient estimates.
(d) If there are no fixed effects in the model and there is autocorrelation, the
autocorrelation must be purged from the data in order to generate unbiased
estimates.
2. OLS estimates from models with both a lagged dependent variable and fixed
effects are biased.
(a) One alternative to OLS is to use an instrumental variables approach. This
approach produces unbiased estimates, but is complicated and produces im-
precise estimates.
(b) OLS is useful to estimate a model with a lagged dependent variable and fixed
effects.
• The bias is not severe and decreases as T , the number of observations for
each unit, increases.
• OLS in this context produces relatively precise parameter estimates.

15.3 Random Effects Models

The term “fixed effects” is used to distinguish from “random effects.” In this section we

present an overview of random effects models and discuss when they can be used.

In a random effects model, the unit-specific error term is itself considered a random

c
•2014 Oxford University Press 748
Chapter 15. Advanced Panel Data

variable. Instead of eliminating or estimating the –i as is done in fixed effects models,

random effects models leave the –i s in the error term and account for them when calculating

standard errors. We won’t cover the calculations here other than to note that they can get

a bit tricky.

An advantage of random effects models is that they estimate coefficients on variables that

do not vary within unit (the kind of variables that get dropped in fixed effects models). This

possibility contrasts with fixed effect models, which cannot estimate coefficients on variables

that do not vary within unit (as discussed on page 391).

The disadvantage of random effects models is that the random effects estimates are un-

biased only if the random effects (the –i ) are uncorrelated with the X. The core challenge

in OLS (which we discussed at length earlier) is that the error term is correlated with the

independent variable; this problem continues with random effects models, which address

correlation of errors across observations, but not correlation of errors and independent vari-

ables. Hence, random effects models fail to take advantage of a major attraction of panel

data, which is that we can deal with the possible correlation of the unit-specific effects that

might cause spurious inferences regarding the independent variables.

A statistical test called a Hausman test tests random against fixed effects models. Once

we understand this test, we can see why the bang-to-buck payoff for random effects models is

generally pretty low. In a Hausman test we estimate both a fixed effects model and a random

effects model using the same data. Under the null hypothesis that the –i are uncorrelated

c
•2014 Oxford University Press 749
Chapter 15. Advanced Panel Data

with the X, the estimates should be similar. Under the alternative, the estimates should be

different because the random effects should be corrupted by the correlation of the –i with

the X and the fixed effects should not.

The decision rule for a Hausman test is the following: If fixed effects and random effects

give us pretty much the same answer, we fail to reject the null hypothesis and can use random

effects. If the two approaches provide different answers, we reject the null and should use

fixed effects. Ultimately, we believe either the fixed effects estimate (when we reject the

null hypothesis of no correlation between –i and Xi ) or pretty much the fixed effects answer

(when we fail to reject the null hypothesis of no correlation between –i and Xi ).4

If used appropriately, random effects have some advantages. When the –i are uncorre-

lated with the Xi , random effects models will generally produce smaller standard errors on

coefficients than fixed effects models. In addition, as T gets large the differences between

fixed and random effects decline; in practice, however, the differences can be substantial in

many real world data sets.

Remember This
Random effects models do not estimate fixed effects for each unit, but rather adjust
standard errors and estimates to account for unit-specific elements of the error term.
1. Random effects models produce unbiased estimates of —ˆ1 only when the –i are
uncorrelated with the X variables.
2. Fixed effects models work whether the –i are uncorrelated with the X variables
or not, making fixed effects a more generally useful approach.

4 For more details on the Hausman test, see Wooldridge (2002, 288).

c
•2014 Oxford University Press 750
Chapter 15. Advanced Panel Data

15.4 Conclusion

Serial dependence in panel data models is an important and complicated challenge. There

are two major approaches to dealing with it. One is to treat the serial dependence as

autocorrelated errors. In this case we can test for autocorrelation and if necessary purge the

data of such autocorrelation by fl-transforming the data.

The other approach is to estimate a dynamic model that includes a lagged dependent

variable. Dynamic models are quite different from standard OLS models. Among other

things, each independent variable has a short and long-term effect on Y .

Our approach to estimating a model with a lagged dependent variable depends on whether

there is autocorrelation and whether we included fixed effects or not. If there is no auto-

correlation and we do not include fixed effects, the model is easy to estimate via OLS and

produces unbiased parameter estimates. If there is autocorrelation, the correlation of error

needs to be purged via standard fl-transformation techniques.

If we include fixed effects in a model with a lagged dependent variable, OLS will produce

biased results. However, scholars have found that the bias is relatively small and that OLS is

likely better than alternatives such as instrumental variables or bias-correction approaches.

We will have a good mastery of the material when we can answer the following questions:

• Section 15.1: How do we diagnose and correct for autocorrelation in panel data models?

• Section 15.2: What are consequences of including lagged dependent variables in models

c
•2014 Oxford University Press 751
Chapter 15. Advanced Panel Data

with and without fixed effects? Under what conditions is it reasonable to use lagged

dependent variables and fixed effects, despite the bias?

• Section 15.3: What are random effects models? When are they appropriate?

Further Reading

There is a large and complicated literature on accounting for time dependence in panel data

models. Beck and Katz (2011) is an excellent guide. Among other things, they discuss how

to conduct an LM test for AR(1) errors in a model without fixed effects, the bias in models

with autocorrelation and lagged dependent variables, and the bias of fixed effects models

with lagged dependent variables.

There are many other excellent resources. Wooldridge (2002) is a valuable reference for

more advanced issues in analysis of panel data. Achen (2000) is an important article, pushing

for caution in use of lagged dependent variables. Wawro (2002) provides a nice overview of

Arrellano and Bond methods.

Another approach to dealing with bias in dynamic models with fixed effects is to correct

for bias directly as suggested by Kiviet (1995). This procedure works reasonably well in

simulations, but is also quite complicated.

c
•2014 Oxford University Press 752
Chapter 15. Advanced Panel Data

Key Terms
• Random effects models (748)

c
•2014 Oxford University Press 753
CHAPTER 16

CONCLUSION: HOW TO BE A STATISTICAL REALIST

After World War II, George Orwell (1946) fa-

mously wrote

... we are all capable of believing things

which we know to be untrue, and then,

when we are finally proved wrong, im-

pudently twisting the facts so as to

show that we were right. Intellectually, it is possible to carry on this process

for an indefinite time: the only check on it is that sooner or later a false belief

bumps up against solid reality, usually on a battlefield.

754
Chapter 16. Conclusion: How to Be a Statistical Realist

The goal of statistics is to provide a less violent empirical battlefield where theories bump

up against cold, hard data.

Unfortunately, statistical analysis is no stranger to the twisting rationalizations that allow

us to distort reality to satisfy our preconceptions or interests. We therefore sometimes end up

on an emotional roller coaster. We careen from elation after figuring out a new double-tongue-

twister statistical model to depression when multiple seemingly valid statistical analyses

support wildly disparate conclusions.

Some deal with the situation by fetishizing technical complexity. They pick the most

complicated statistical approach possible and treat the results as the truth. If others don’t

understand the analysis, it is because their puny brains cannot keep up with the mathe-

matical geniuses in the computer lab. Their overconfidence is annoying and intellectually

dangerous.

Others deal with the situation by becoming statistical skeptics. For them, statistics

provide no answers. They avoid statistics or, worse, they manipulate them. Their nihilism,

too, is annoying and intellectually dangerous.

What are we to do? It might seem that avoiding statistics may limit harm. Statistics are

a bit like a chainsaw: If used recklessly, the damage can be terrible. So maybe it’s best to

put down the laptop and back slowly away.

The problem with this approach is that there really is no alternative to statistics. As

baseball analyst Bill James says, the alternative to statistics is not “no statistics.” The

c
•2014 Oxford University Press 755
Chapter 16. Conclusion: How to Be a Statistical Realist

alternative to statistics is bad statistics. Anyone who makes any empirical argument about

the world is making a statistical argument. It might be based on vague data that is not

systematically analyzed, but that’s what people who judge from experience or intuition

are doing. Hence, despite the inability of statistics to answer all questions or to be above

manipulation, a serious effort to understand the world will involve some statistical reasoning.

A better approach is realism about statistics. In the right hands, chainsaws are awesome.

If we learn how to use the tool properly, what it can and can’t do, we can make a lot of

progress.

A statistical realist is committed to robust and thoughtful evaluation of theories. Five

behaviors characterize this approach.

First, a statistical realist prioritizes. A model that explains everything is impossible. We

must simplify. And if we’re going to simplify the world, let’s do it usefully. Statistician

George Box (1976, 792) made this point wonderfully:

Since all models are wrong the scientist must be alert to what is importantly wrong.

It is inappropriate to be concerned about mice when there are tigers abroad.

The tiger abroad is almost always endogeneity. So we must prioritize fighting this tiger,

using our core statistical tool kit: of experiments, OLS, fixed effects models, instrumental

variables, and regression discontinuity. There will be many challenges in any statistical

project, but we must not let them distract us from focusing on the fight against endogeneity.

c
•2014 Oxford University Press 756
Chapter 16. Conclusion: How to Be a Statistical Realist

The second characteristic of a statistical realist is that he or she values robustness. Serious

analysts do not believe assertions based on a single significant coefficient in a single statistical

specification. For even well-designed studies with good data, we worry that the results could

depend on a very specific model specification. A statistical realist will show that the results

are robust by assessing a reasonable range of specifications, perhaps with and without certain

variables or with alternative measures of important concepts.

Third, a statistical realist adheres to the replication standard. Others must see our work

and be able to re-create, modify, correct, and build off our analysis. Results cannot be

scientifically credible otherwise. Replications can be direct, whereby they do exactly the

same procedures on the same data. Or they can be indirect, where a similar research design

is applied to new data or context. We need both to truly believe results.

Fourth, a statistical realist is wary of complexity. Sometimes complex models are in-

evitable. However, just because a model is more complicated does not necessarily make it

more likely to be true. It is more likely to have mistakes. Sometimes complexity becomes a

shield behind which analysts hide, intentionally or not, moving their conclusions effectively

beyond the realm of reasonable replicability and, therefore, beyond credibility.

Remember, statistical analysis is hard, but not because of the math. Statistics is hard

because the world is a complicated place. If anything, the math makes things easier by

providing tools to simplify the world. A certain amount of jargon among specialists in the

field is inevitable and helps experts communicate efficiently. If a result only holds underneath

c
•2014 Oxford University Press 757
Chapter 16. Conclusion: How to Be a Statistical Realist

layers of impenetrable math, however, be wary. Check your wallet. Count your silverware.

Investor Peter Lynch often remarked that he wouldn’t invest in any business idea that

couldn’t be illustrated with a crayon. If the story isn’t simple, it’s probably wrong. This

attitude is useful for statistical analysts as well. There will almost certainly have to be

background work that is not broadly accessible, but to be most persuasive the results should

include a figure or story that simply summarizes the basis for the finding. Perhaps we’ll

have to use a sharp crayon, but if we can’t explain our results with a crayon we should keep

working.

Fifth, a statistical realist thinks holistically. We should step back from any given statistical

result and consider the totality of the evidence. The following indicators of causality provide

a useful framework. None is necessary; none is sufficient. Taken together, though, the more

these conditions are satisfied, the more confident we can be that a given causal claim is true.

• Strength: This is the simplest criterion. Is there a strong relationship between the

independent variable and the dependent variable?

– A strong observed relationship is less likely due simply to random chance. Even

if the null hypothesis of no relationship is true, we know that random variation

can lead to the occasional “significant” result. The random noise producing such a

result is more likely to produce a weak, rather than a strong, observed relationships.

A very strong relationship is highly unlikely to simply be the result of random noise.

c
•2014 Oxford University Press 758
Chapter 16. Conclusion: How to Be a Statistical Realist

– A strong observed relationship is less likely to be spurious for non-obvious reasons.

A strong relationship is not immune to endogeneity, of course, but it is more likely

that a strong result due only to endogeneity will be due to some relatively obvious

source of endogeneity. For a weak relationship, the endogeneity could be subtle,

but enough to account for what we observe.

– A strong observed relationship is more likely to be important. A weak relationship

might not be random or spurious; it might simply be uninteresting. Life is short.

Explain things that matter. Our goal is not to intone the words “statistically

significant” but rather to produce useful knowledge.

• Consistency: Do different analysts consistently find the relationship in different con-

texts?

– All too often, a given theoretical claim is tested with the very data that suggested

the result. That’s not much to go on; a random or spurious relationship in one data

set does not a full-blown theory make. Hence we should be cautious about claims

until they are observed across multiple contexts. In that case, it is less likely that

the result is due to chance or to an analyst leaning on the data to get a result he

or she wanted.

– If results are not observed across multiple contexts, are their contextual differences?

Perhaps the real finding is explaining why a relationship exists in one context and

c
•2014 Oxford University Press 759
Chapter 16. Conclusion: How to Be a Statistical Realist

not others.

– Or, if other results are different, can we explain why the other results are wrong?

It is emphatically not the case that we should interpret two competing statistical

results as a draw. One result could be be based on a mistake. If that’s the case,

explain why (nicely, of course). If we can’t explain why one approach is better,

though, then we are left with conflicting results and we need to be cautious about

believing we have identified a real relationship.

• Specificity: Are the patterns in the data consistent with the specific claim? Each theory

should be mined for as many specific claims as possible, not only about direct effects,

but also about indirect effects and mechanisms. As importantly, the theory should be

mined for claims about when we won’t see the relationship. This line of thinking allows

us to conduct placebo tests in which we should see null results. In other words, the

relationship should be observable everywhere we expect it and nowhere we don’t.

• Plausibility: Given what we know about the world, does the result make sense? Some-

times results are implausible on their face: If someone found that eating french fries led

to weight loss, we should probably ask some probing questions before Supersizing. That

doesn’t mean we should treat implausible results as wrong. After all, the idea that the

earth revolves around the sun was pretty implausible at first. Implausible results just

need more evidence to overcome their implausibility.

c
•2014 Oxford University Press 760
Chapter 16. Conclusion: How to Be a Statistical Realist

These criteria are not as cut and dried as looking at confidence intervals or hypothesis

tests. They are more important because they determine not “statistical significance” but

what we conclude about empirical relationships. They should never be far from the mind of

a statistical realist who wants to use data to learn about how the world really works.

So we have done a lot in this book. We’ve covered a vast array of statistical tools. We’ve

just now described a productive mindset, that of a statistical realist. There is one more

element: creativity. Think of statistics as the grammar for good analysis. It is not the story.

No one reads a book and says “Great grammar!” A terrible book might have bad grammar,

but a good book needs more than good grammar. The material we covered in this book

provides the grammar for making convincing claims about the way the world works. The

rest is up to you. Think hard, be creative, take chances. Good luck.

16.1 Further Reading

Achen (1982) is an 80 page paean to statistical realism. As he puts it, “The uninitiated

are often tempted to trust every statistical study or none. It is the task of empirical social

scientists to be wiser.” Achen followed this publication with a 2002 article arguing for keeping

models simple.

The criteria for evaluating research discussed here are strongly influenced by the Bradford-

Hill criteria from Bradford-Hill (1965). Nevin (2013) assesses the Bradford-Hill criteria for

c
•2014 Oxford University Press 761
Chapter 16. Conclusion: How to Be a Statistical Realist

theory that lead in gasoline was responsible for the crime surge in the United States in the

1980s (and elsewhere).

c
•2014 Oxford University Press 762
ACKNOWLEDGEMENTS

This book has benefited from close reading and probing questions from a large number

of people, including students at the McCourt School of Public Policy at Georgetown, and

my current and former colleagues and students at Georgetown University, including Shirley

Adelstein, Rachel Blum, Ian Gale, Ariya Hagh, Carolyn Hill, Mark Hines, Dan Hopkins,

Jeremy Horowitz, Huade Huo, Wes Joe, Karin Kitchens, Jon Ladd, Jens Ludwig, Paul

Musgrave, Sheeva Nesva, Hans Noel, Ji Yeon Park, Betsy Pearl, Lindsay Pettingill, Barbara

Schone, Dennis Quinn, Chris Schorr, and Erik Voeten.

Credit (and/or blame) for the Simpson’s figure goes to Paul Musgrave.

Participants at a seminar on the book at the University of Maryland gave excellent early

feedback, especially Antoine Banks, Brandon Bartels, Kanisha Bond, Ernesto Calvo, Sarah

Croco, Michael Hanmer, Danny Hayes, Eric Lawrence, Irwin Morris, and John Sides.

In addition, colleagues across the country have been incredibly helpful, especially Allison

Carnegie, Daniel Henderson, Luke Keele, David Peterson, Wendy Tam-Cho, Craig Volden

763
Chapter 16. Acknowledgements

and Chris Way. Anonymous reviewers for Oxford University Press provided supportive yet

probing reviews that were very useful.

I also appreciate the generosity of colleagues who shared data, including Bill Clark, Anna

Harvey, Dan Hopkins, and Hans Noel.

c
•2014 Oxford University Press 764
Appendices

765
MATH AND PROBABILITY BACKGROUND

A Summation

qN
• i=1 Xi = X1 + X2 + X3 + ... + XN

• If a variable in the summation does not have a subscript, it can be “pulled out” of the

summation. For example


N
ÿ
—Xi = —X1 + —X2 + —X3 + ... + —XN
i=1

= —(X1 + X2 + X3 + ... + XN )
N
ÿ
= — Xi
i=1

• If a variable in the summation has a subscript, it cannot be “pulled out” of the summa-
qN
tion. For example i=1 Xi Yi = X1 Y1 + X2 Y2 + X3 Y3 + ... + XN YN cannot as a general

matter be simplified.

• As a general matter a non-linear function in a sum is not the same as the non-linear
qN qN
function of the sum. For example, as a general matter i=1 Xi2 will not equal ( i=1 Xi )2

766
Appendix . Math and Probability Background

except for very particular circumstances (such as Xi = 1 for all observations).

B Expectation

• Expectation is the value we expect a random variable to be. The expectation is basically

the average of the random variable if we could sample from the variable’s distribution

a huge (infinite, really) number of times.

• For example, the expected value of the value of a six-sided die is 3.5. If we roll a die a

huge number of times, we’d expect each side to come up an equal proportion of times,

so the expected average will equal the average of 1, 2, 3, 4, 5, and 6. More formally,
q6
the expected value will be 1 p(Xi )Xi where X is 1, 2, 3, 4, 5, and 6 and p(Xi ) is the
1
probability of each outcome, which in this example is 6
for each value.

• The expectation of some number k times a function is equal to k times the expectation

of the function. That is, E[kg(X)] = kE[g(X)] for constant k where g(X) is some

function of X. Suppose we want to know what the expectation of 10 times the number

on a die is. We can say that the expectation of that is simply 10 times the expectation.

Not rocket science, but useful.

c
•2014 Oxford University Press 767
Appendix . Math and Probability Background

C Variance

• The variance of a random variable is a measure of how spread out the distribution is.

In a large sample, the variance can be estimated as

1 ÿN
var(X) = (Xi ≠ X)2
N i=1

In small samples, a degrees of freedom correction means we divide by N ≠ 1 instead of

N . For large N it hardly matters whether we use N or N ≠ 1; as a practical matter

computer programs take care of this for us.

It is useful to de-construct exactly what the variance equation does. The math is pretty

simple:

1. Take deviation from the mean for each observation.

2. Square it to keep it positive.

3. Take the average.

Here are some useful properties of variance.

1. The variance of a constant plus a random variable is the variance of the random

variable. That is, let k be a fixed number and ‘ be a random variable with variance

c
•2014 Oxford University Press 768
Appendix . Math and Probability Background

‡ 2 , then

var(k + ‘) = var(k) + var(‘)

= 0 + var(‘)

= ‡2

2. The variance of a random variable times a constant is the constant squared times

the variance of the random variable. That is, let k be some constant and ‘ be a

random variable with variance ‡ 2 , then

var(k‘) = k 2 var(‘)

= k2‡2

3. When random variables are correlated, the variance of a sum (or difference) of

random variables depends on the variances and covariance of the variables. Let ‘

and · be random variables.

– var(‘ + · ) = var(‘) + var(· ) + 2cov(‘, · ) where cov(‘, · ) refers to the covariance

of ‘ and · .

– var(‘ ≠ · ) = var(‘) + var(· ) ≠ 2cov(‘, · ) where cov(‘, · ) refers to the covariance

of ‘ and · .

4. When random variables are uncorrelated, the variance of a sum (or difference) of

random variables equals the sum of the variances. This outcome follows directly

c
•2014 Oxford University Press 769
Appendix . Math and Probability Background

from the above, which we can see by noting that if two random variables are un-

correlated, then their covariance equals zero and the covariance term drops out of

the above equations.

D Covariance

• Covariance measures how much two random variables vary together. In large samples,

the covariance of two variables is


qN
i=1 (X1i ≠ X 1 )(X2i ≠ X 2 )
cov(X1 , X2 ) = (A-1)
N

• As with variance, there are several useful properties when dealing with covariance.

1. The covariance of a random variable, ‘, and some constant, k is zero. Formally,

cov(‘, k) = 0.

2. The covariance of a random variable, ‘, with itself is the variance of that variable.

Formally, cov(‘, ‘) = ‡‘2 .

3. The covariance of k1 ‘ and k2 · where k1 and k2 are constants and ‘ and · are random

variables is k1 k2 cov(‘, · ).

4. The covariance of a random variable with a the sum of another random variable

and a constant is the covariance of the two random variables. Formally, let ‘ and

· be random variables, then cov(‘, · + k) = cov(‘, · ).

c
•2014 Oxford University Press 770
Appendix . Math and Probability Background

E Correlation

The equation for correlation is

cov(X, Y )
corr(X, Y ) =
‡X ‡Y

where ‡X is the standard deviation of X and ‡Y is the standard deviation of Y . If X = Y

for all observations, cov(X, Y ) = cov(X, X) = var(X) and ‡X = ‡Y , implying that the

denominator will be sigma2X which is the variance of X. These calculations therefore imply

that the the correlation when X = Y will be +1, which which is the upper bound for

correlations.1 For perfect negative correlation X = ≠Y and correlation equals -1.

The equation for correlation looks a bit like the equation for the slope coefficient in

bivariate regression on page 72 in Chapter 3. The bivariate regression coefficient is simply a

re-standardized correlation:

—ˆ1BivariateOLS = corr(X, Y ) ◊
‡Y
‡X

F Probability Density Functions

A probability density function (PDF) is a mathematical function that describes the

relative probability for a continuous random variable to take on a given probability. Panels

(c) and (d) of Figure 3.4 from Section 3.2 provides examples of two PDFs.
1 We also get perfect correlation if the variables are identical once normalized. That is, X and Y are perfectly
(Xi ≠X) (Yi ≠Y )
correlated if X = 10Y or if X = 5 + 3Y and so forth. In these cases ‡X
= ‡Y
for all observations.

c
•2014 Oxford University Press 771
Appendix . Math and Probability Background

While the shapes of PDFs can vary considerably, all of them share certain fundamental

features. The values of a PDF are greater than or equal to zero for all possible values of the

random variable. The total area under the curve defined by the PDF equals one.

One tricky thing about PDFs is that they are continuous functions, meaning that we

cannot say the probability a random variable equals 2.2 is equal to the value of the function

evaluated at 2.2 because the value of the function is pretty much the same at 2.2000001 and

2.2000002 and pretty soon the total probability would exceed one because a there are always

more possible values very near to any given value. Instead, we need to think in terms of

probabilities the random variable is in some (possibly small) region of values. Hence we need

the tools from calculus to calculate probabilities from a PDF.

Figure A.1 shows the PDF for an example of a random variable. While we cannot use the

PDF to simply calculate the probability the random variable equals, say, 1.5, it is possible to

calculate the probability that the random variable is between 1.5 and any other value. The

figure highlights the area under the PDF curve between 1.5 and 1.8. This area corresponds

to the probability this random variable is between 1.5 and 1.8. In the next section we show

examples of how to calculate such probabilities based on PDFs from the normal distribution.2
2 More formally, we can indicate a PDF as a function, f (x), that is greater than zero for all values of x. The

fact that the total area under the curve means that ≠Œ
f (x)dx = 1. The probability that the random variable x is
sb
between a and b is a
f (x)dx = F (b) ≠ F (a) where F () is the integral of f ().

c
•2014 Oxford University Press 772
Appendix . Math and Probability Background

Probability
density
0.75

0.6

0.45

0.3

0.15

0 1 1.5 1.8 2 3 4 5

Value of x

FIGURE A.1: An Example of a Probability Density Function (PDF)

c
•2014 Oxford University Press 773
Appendix . Math and Probability Background

G Normal Distributions

We work a lot with the standard normal distribution. (Only to us stats geeks does

“standard normal” not seem repetitive.) A normal distribution is a a specific (and famous)

type of PDF and a standard normal distribution is a normal distribution with mean zero

and a variance of one. The standard deviation of a standard normal distribution is also one,

because the standard deviation is the square root of the variance.

One important use of the standard normal distribution is to calculate probabilities of

observing standard normal random variables that are less than or equal to some number.

We denote the function (x) = P rob(X < Z) as the probability that a standard normal

random variable X is less than Z. This is known as the cumulative distribution function

(CDF) because it indicates the probability of seeing a random variable less than some value.

It simply expresses the area under a PDF curve to the left of some value.

Figure A.2 shows four examples using the CDF for standard normal PDFs. Panel (a)

shows (0), which is the probability that a standard normal random variable will be less

than 0. It is the area under the PDF to the left of 0. We can see that it is half of the

total area, meaning that the area to the left of 0 is 0.50 and, therefore, the probability of

observing a value of a standard normal random variable that is less than 0 is 0.50. Panel

(b) shows (≠2), which is the probability that a standard normal random variable will be

less than -2. It is the proportion of the total area that is left of -2, which is 0.023. Panel

c
•2014 Oxford University Press 774
Appendix . Math and Probability Background

0.4

0.4
0.3

0.3
Probability density

Probability density
Φ(0)
=Prob(X<0)
= 0.500
0.2

0.2
Φ(−2)
=Prob(X<−2)
0.1

0.1
= 0.023
0.0

0.0
−4 −2 0 2 4 −4 −2 0 2 4
(a) (b)
0.4

0.4
0.3

0.3
Probability density

Probability density

Φ(1.96) Φ(1.0)
=Prob(X<1.96) =Prob(X<1.0)
= 0.975 = 0.841
0.2

0.2
0.1

0.1
0.0

0.0

1
−4 −2 0 2 4 −4 −2 0 2 4
(c) (d)

FIGURE A.2: Probabilities that a Standard Normal Random Variable is Less Than Some Value

(c) shows (1.96), which is the probability that a standard normal random variable will be

less than 1.96. It is 0.975. Panel (d) shows (1), which is the probability that a standard

normal random variable will be less than 1. It is 0.841.

We can also use our knowledge of the standard normal distribution to calculate the prob-

ability that —ˆ1 is greater than some value. The trick here is to recall that if the probability of

something happening is P , then the probability of it not happening is 1 ≠ P . This property

tells us that if there is a 15% chance of rain, then there is a 85% probability of no rain.

c
•2014 Oxford University Press 775
Appendix . Math and Probability Background

To calculate the probability that a standard normal variable is greater than some value,

Z, use 1 ≠ (Z). Figure A.3 shows four examples. Panel (a) shows 1 ≠ (0), which is the

probability that a standard normal random variable will be greater than 0. This probability

is 0.50. Panel (b) highlights 1 ≠ (≠2), which is the probability that a standard normal

random variable will be greater than -2. It is 0.98. Panel (c) shows (1.96), which is the

probability that a standard normal random variable will be greater than 1.96. It is 0.025.

Panel (d) shows (1), which is the probability that a standard normal random variable will

be greater than 1. It is 0.16.

Figure A.4 shows some key information about the standard normal distribution. In the

table’s left-hand column is some number and in the right hand column is the probability that

a standard normal random variable will be less than that number. There is, for example, a

0.01 probability that a standard normal random variable will be less than -2.32. We can see

this graphically in panel (a). In the top bell-shaped curve, the portion that is to the left of

-2.32 is shaded. It is about 1 percent.

Because the standard deviation of a standard normal is one, all the numbers in the left

hand column can be considered as the number of standard deviations above or below the

mean. That is, the number -1 refers to a point that is one standard deviation below the

mean and the number +3 refers to a point that is 3 standard deviations above the mean.

The third row of the table shows that there is a probability of 0.01 that we’ll observe a

value less than -2.32 standard deviations below the mean. Going down to the shaded row

c
•2014 Oxford University Press 776
Appendix . Math and Probability Background

0.4

0.4
0.3

0.3
Probability density

Probability density
1−Φ(0)
=Prob(X>0) 1−Φ(−2)
= 0.500 =Prob(X>−2)
0.2

0.2
= 0.977
0.1

0.1
0.0

0.0

−4 −2 0 2 4 −4 −2 0 2 4
(a) (b)
0.4

0.4
0.3

0.3
Probability density

Probability density

1−Φ(1.0)
=Prob(X>1.0)
= 0.159
0.2

0.2

1−Φ(1.96)
=Prob(X>1.96)
= 0.025
0.1

0.1
0.0

0.0

1
−4 −2 0 2 4 −4 −2 0 2 4
(c) (d)

FIGURE A.3: Probabilities that a Standard Normal Random Variable is Greater Than Some Value

c
•2014 Oxford University Press 777
Appendix . Math and Probability Background

SD = number of Probability
standard deviations —ˆ1 Æ SD
above or below
the mean, —1
-3.00 0.0001

0.4
-2.58 0.005

Probability density
0.3
-2.32 0.010 ∆
0.2
-2.00 0.023
^
Prob.β1 ≤ − 2.33

0.1

-1.96 0.025
0.0

−2.33

-1.64 0.050
−3 −2 −1 0 1 2 3

-1.28 0.100
0.4
Probability density

-1.00 0.160
0.3

^
Probability β1 ≤ 0

0.00 0.500
0.2


1.00 0.840
0.1
0.0

1.28 0.900 −3 −2 −1 0 1 2 3

1.64 0.950
1.96 0.975
0.4


Probability density

2.00 0.977
0.3

^
Probability β1 ≤ 1.96
0.2

2.32 0.990
0.1

2.58 0.995
0.0

3.00 0.999
1.96
−3 −2 −1 0 1 2 3

Suppose —ˆ1 is distributed as standard normal.


The values in the right hand column are the
probabilities —ˆ1 is less than value in left-
hand column. For example, the probability
—ˆ1 < ≠2.33 = 0.010.

FIGURE A.4: Standard Normal Distribution

c
•2014 Oxford University Press 778
Appendix . Math and Probability Background

SD = 0.00, we see that if —ˆ1 is standard normally distributed, there is a 0.50 probability of

being below 0. This probability is intuitive – the normal distribution is symmetric and we

have the same chance of seeing something above its mean as below it. Panel (b) shows this

graphically.

Going down to the shaded row where SD = 1.96, we see that there is a 0.975 proba-

bility that a standard normal random variable will be less than 1.96. Panel (c) shows this

graphically, with 97.5% of the standard normal distribution shaded. We see this value a

lot in statistics because twice the probability of being greater than 1.96 is 0.05, which is a

commonly used significance level for hypothesis testing.

We can convert any normally distributed random variable to a standard normally dis-

tributed random variable. This process is known as standardizing values and is pretty easy.

This trick is valuable because it allows us to use the intuition and content of Figure A.4 to

work with any normal distribution, whatever its mean and standard deviation.

For example, suppose we have a normal random variable with a mean of 10 and a standard

deviation of 1 and we want to know the probability of observing a value less than 8. From

common sense, we can figure out that in this case 8 is 2 standard deviations below the mean.

Hence we can use Figure A.4 to see that the probability of observing a value less than 8 from

a normal distribution with mean 10 and standard deviation of one is 0.023 (see the fourth

row of the table, which shows that the probability a standard normal random variable is less

than -2 is 0.023).

c
•2014 Oxford University Press 779
Appendix . Math and Probability Background

How did we get there? First, subtract the mean from the value in question to see how far

it is from the mean. Then divide this quantity by the standard deviation to calculate how

many standard deviations away from the mean it is. More generally, for any given number

B drawn from a distribution with mean —1 and standard deviation, se(—ˆ1 ), we can calculate

the number of standard deviations B is away from the mean via the following equation:

B ≠ —1
Standard deviations from mean = (A-2)
se(—ˆ1 )

Notice that the —1 has no hat but se(—ˆ1 ) does. Seems odd, doesn’t it? There is a logic

to it. We’ll be working a lot with hypothetical values of —1 , asking, for example, what the

probability —ˆ1 is greater than some number would be if the “true” —1 were zero. But we’ll

want to work with the precision implied by our actual data so we’ll use se(—ˆ1 ).

To help us get comfortable with converting the distribution of —ˆ1 to the standard normal

distribution, Table A.1 shows several examples. In the first example (the first two rows),

—1 is 0 and the standard error of —ˆ1 is 3. Recall that the standard error of —ˆ1 measures the

width of the —ˆ1 distribution. In this case, 3 is 1 standard deviation above the mean and 1 is

0.33 standard deviations above the mean.

The third and fourth rows of Table A.1 show an example when —1 is 4 and the standard

deviation is 3. In this case, 7 is 1 standard deviation above the mean and 1 is 1 standard

deviation below the mean. In the bottom portion of the table (the last two rows), —1 is 8

and the standard deviation of —ˆ1 is 2. In this case, 6 is 1 standard deviation below the mean

c
•2014 Oxford University Press 780
Appendix . Math and Probability Background

Table A.1: Examples of Standardized Values

Hypothetical distribution
Number —1 se(—ˆ1 ) Standardized Description
B
3≠0
3 0 3 3
=1 3 is 1 standard deviation above
the mean of 0 when se(—ˆ1 ) = 3
1≠0
1 0 3 1
= 0.33 1 is 0.33 standard deviations above
the mean of 0 when se(—ˆ1 ) = 3
7≠4
7 4 3 3
=1 7 is 1 standard deviation above
the mean of 4 when se(—ˆ1 ) = 3
1≠4
1 4 3 3
= ≠1 1 is 1 standard deviation below
the mean of 4 when se(—ˆ1 ) = 3
6≠8
6 8 2 2
= ≠1 6 is 1 standard deviation below
the mean of 8 when se(—ˆ1 ) = 2
1≠8
1 8 2 2
= ≠3.5 1 is 3.5 standard deviations below
the mean of 8 when se(—ˆ1 ) = 2

and 1 is 3.5 standard deviations below the mean.

To calculate (Z) we use a table such as in Figure A.4 or, more likely, computer software

as discussed in the Computing Corner of this appendix.

c
•2014 Oxford University Press 781
Appendix . Math and Probability Background

Remember This
A standard normal distribution is a normal distribution with a mean of zero
and a standard deviation of one.
• Any normal distribution can be converted to a standard normal distri-
bution.
• If —ˆ1 is distributed normally with mean — and standard deviation se(—ˆ1 ),
ˆ
then se( —≠—
—ˆ )
will be distributed as a standard normal random variable.
1

• Converting random variables to standard normal random variables al-


lows us to use standard normal tables to discuss any normal distribution.
• To calculate the probability —ˆ1 Æ B, where B is any number of interest,
do the following
1. Convert B to the number of standard deviations above or below the
mean using se(
B≠—1
—ˆ )
.
1

2. Use the table in Figure A.4 or software to the calculate probability


that —ˆ1 is less than B in standardized terms.
• To calculate the probability that —ˆ1 > B, use the fact that the proba-
bility —ˆ1 is greater than B is 1 minus the probability that —ˆ1 is less than
or equal to B.

c
•2014 Oxford University Press 782
Appendix . Math and Probability Background

Discussion Questions
1. What is the probability a standard normal random variable is less than
or equal to 1.64?
2. What is the probability a standard normal random variable is less than
or equal to -1.28?
3. What is the probability a standard normal random variable is greater
than 1.28?
4. What is the probability a normal random variable with a mean of zero
and a standard deviation of 2 is less than -4?
5. What is the probability a normal random variable with a mean of zero
and a variance of 9 is less than -3?
6. Approximately what is the probability a normal random variable with
a mean of 7.2 and a variance of 4 is less than 9?

H Other Useful Distributions

The normal distribution may be the most famous distribution, but it is far from the only

workhorse distribution in statistical analysis. In this section we briefly discuss three other

distributions that are particularly common in econometric practice, the ‰2 , t, and F distri-

butions. Each of these distributions are derived from the normal distribution.

The ‰2 distribution

The ‰2 distribution is a distribution that describes the distribution of squared normal

variables. The distribution of a squared standard normal random variable is a ‰2 distribution

c
•2014 Oxford University Press 783
Appendix . Math and Probability Background

Probability
density
0.5
χ2(2) distribution

0.25

0 2 4 6 8 10
Value of x
(a)
Probability
density
χ2(4) distribution
0.15

0.1

0.05

0 2 4 6 8 10 12 14
Value of x
(b)

FIGURE A.5: Two ‰2 Distributions

with one degree of freedom. The sum of n independent squared standard normal random

variables are distributed according to a ‰2 distribution with n degrees of freedom.

The ‰2 distribution arises in many different statistical contexts. We’ll see below that it

is a component of the all-important t distribution. The ‰2 distribution also arises when we

conduct likelihood ratio tests for MLE models.

The shape of the ‰2 distribution varies according to the degrees of freedom. Figure A.5

c
•2014 Oxford University Press 784
Appendix . Math and Probability Background

shows two examples of ‰2 distributions. Panel (a) shows a ‰2 distribution with 2 degrees

of freedom. We have highlighted the most extreme 5 percent of the distribution, which

demonstrates that the critical value from a ‰2 (2) distribution is roughly 6. Panel (b) shows

a ‰2 distribution with 4 degrees of freedom. The critical value from a ‰2 (4) distribution is

around 9.5.

The Computing Corner in Chapter 12 on pages 646 and 648 shows how to identify critical

values from an ‰2 distribution. Software will often, but not always, automatically provide

critical values for us.

The t distribution

The t distribution characterizes the distribution of the ratio of a normal random variable

and the square root of a ‰2 random variable divided by it’s degrees of freedom. While such

a ratio may seem to be a pretty obscure combination of things to worry about, we’ve seen

in Section 4.2 the t distribution is an incredibly useful distribution. We know that our OLS

coefficients (among other estimators) are normally distributed. We also know (although we

talk about this less) that the estimates of the standard errors are distributed according to

a ‰2 distribution. Since we need to standardize our OLS coefficients by dividing by our

standard error estimates, we want to know the distribution of the ratio of the coefficient

divided by the standard error.

Formally, if z is a standard normal random variable and x is is a ‰2 variable with n

c
•2014 Oxford University Press 785
Appendix . Math and Probability Background

degrees of freedom, then the following is distributed according to a t distribution with n

degrees of freedom:

z
t(n) = Ò
x/n

Virtually every statistical software package automatically produces t statistics for every

coefficient estimated. We can also use t tests to test hypotheses about multiple coefficients,

although in Section 7.4 we focused on F tests for this purpose on the grounds of convenience.

The shape is the t distribution is quite similar to the normal distribution. As shown

in Figure 4.3 in Chapter 4, the t distribution is a bit wider than the normal distribution.

This means that extreme values are more likely from a t distribution than from a normal

distribution. However, the difference is modest for small sample sizes and disappears as the

sample size gets large.

The F distribution

The F distribution characterizes the distribution of a ratio of two ‰2 random variables

divided by their degrees of freedom. The distribution is named in honor of legendary statis-

tician R. A. Fisher.

Formally, if x1 and x2 are independent ‰2 random variables with n1 and n2 degrees

of freedom, then the following is distributed according to a F distribution with degrees of

c
•2014 Oxford University Press 786
Appendix . Math and Probability Background

freedom n1 and n2 :

x1 /n1
F (n1 , n2 ) =
x2 /n2

Since ‰2 variables are positive, a ratio of two of them must be positive as well, meaning

that random variables following a F distributions are greater than or equal to zero.

An interesting feature of the F distribution is that the square of a t distributed variable

with n degrees of freedom follows a F (1, n) distribution. To see this, note that a t distributed

variable is a normal random variable divided by the square root of a chi2 random variable.

Squaring the t distributed variable gives us a squared normal in the numerator, which is ‰2 ,

and a chi2 in the denominator. In other words, this gives us the ratio of two ‰2 random

variables, which will be distributed according to a F distribution. We used this fact when

noting on page 451 that in certain cases we can square a t statistic to produce an F statistic

that can be compared to a rule of thumb about F statistics in the first stage of 2SLS analyses.

We use the F distribution when doing F tests which, among other things, allow us to test

hypotheses involving multiple parameters. We discuss F tests in Section 7.4.

The F distribution depends on two degrees of freedom parameters. In the F test examples,

the degrees of freedom for the test statistic depend on the number of restrictions on the

parameters and the sample size. The order of the degrees of freedom is important and is

explained in our discussion of F tests.

The F distribution does not have an easily identifiable shape like the normal and t

c
•2014 Oxford University Press 787
Appendix . Math and Probability Background

Probability Probability
density density
F( 3 , 2000 ) F( 18 , 300 )
1
distribution distribution
1

0.75
0.75

0.5
0.5

0.25
0.25

0 0

0 2 2.61 4 0 1.64 2 4

Value of x Value of x
Probability (a) Probability (b)
density density
F( 2 , 100 ) F( 9 , 10 )
1 1
distribution distribution

0.75 0.75

0.5 0.5

0.25 0.25

0 0

0 2 3.09 4 0 2 3.02 4

Value of x Value of x
(c) (d)

FIGURE A.6: Four F Distributions

c
•2014 Oxford University Press 788
Appendix . Math and Probability Background

distributions. Instead it’s shape changes rather dramatically depending on the degrees of

freedom. Figure A.6 plots four examples of F distributions, each with different degrees of

freedom. For each figure we highlight the extreme 5 percent of the distribution, providing a

sense of the values necessary to reject the null hypotheses for each case. Panel (a) shows an

F distribution with degrees of freedom equal to 3 and 1,000. This would be the distribution

of an F statistic when testing a null hypothesis that —1 = —2 = —3 = 0 based on a data

set with 2010 observations and 10 parameters to be estimated. The critical value is 2.61,

meaning that a F test statistic greater than 2.61 would lead us to reject the null hypothesis.

Panel (b) displays an F distribution with degrees of freedom equal to 18 and 300 and so on.

The Computing Corner in Chapter 7 on pages 359 and 360 shows how to identify critical

values from an F distribution. Software will often, but not always, automatically provide

critical values for us.

I Sampling

Section 3.2 of Chapter 3 discussed two sources of variation in our estimates: sampling ran-

domness and modeled randomness. This section elaborates on sampling randomness.

Imagine that we are trying to figure out some feature of a given population. For example,

suppose we are trying to ascertain the average age of everyone in the world at a given time. If

we had (accurate) data from every single person, we’re done. Obviously, that’s not going to

c
•2014 Oxford University Press 789
Appendix . Math and Probability Background

happen, so we take a random sample. Since this random sample will not contain every single

person, the average age of people from it probably will not exactly match the population

average. And if we were to take another random sample we would likely get a different

average because we would have different people in our sample. Maybe the first time we got

more babies than usual and in second time we got the world’s oldest living person.

The genius of the sampling perspective is that we characterize the degree of randomness

we should observe in our random sample. The variation will depend on the sample size we

observe and on the underlying variation in the population.

A useful exercise is to take some population, say the students in a statistics class, and

gather information about every person in the population for some variable. Then if we

draw random samples from this population we will see that the mean of the variable in the

sampled group will bounce around for each random sample we draw. The amazing thing

about statistics, is that we will be able to say certain things about the mean of the averages

we get across the random samples and the variance of the averages. If the sample size is large,

we will be able to approximate the distribution of these averages with a normal distribution

with a variance we can calculate based on the sample size and the underlying variance in

the overall population.

This logic applies to regression coefficients as well. Hence, if we want to know the

relationship between age and wealth in the whole world, we can draw a random sample

and know that we will have variation related to the fact that we observe only a subset of

c
•2014 Oxford University Press 790
Appendix . Math and Probability Background

the target population. And, recall from Section 6.1 that OLS easily estimates means and

difference of means, so even our average age example works in an OLS context.

While it may be tempting to think of statistical analysis only in terms of sampling

variation, this is not very practical. First, it is not uncommon to observe an entire population.

For example, if we want to know the relationship between education and wages in European

countries from 2000 to 2014, we could probably come up with data for each country and year

in our target population. And yet, we would be naive to believe that there is no uncertainty

in our estimates. Hence, there is almost always another source of randomness, something we

referred to as modeled randomness in Section 3.2.

Second, the sampling paradigm requires that the samples from the underlying target

population be random. If the sampling is not random, we run the risk of introducing endo-

geneity as the type of observations that make their way into our analysis may systematically

differ from the people or units that we do not observe. A classic example is that we may

observe the wages of women who work, but this subsample is unlikely to be a random sample

from all women. The women who work are likely more ambitious and/or more financially

dependent on working.

Even public opinion polling data, a presumed model of random sampling, seldom provides

random samples from underlying populations. Commercial polls often have response rates

less than 20 percent and even academic surveys struggle to get response rates near 50 percent.

It is reasonable to believe that the people who respond differ in economic, social personality

c
•2014 Oxford University Press 791
Appendix . Math and Probability Background

traits, meaning that simply attributing variation to sampling variation may be problematic.

So even though sampling variation is incredibly useful as an idealized source of random-

ness in our coefficient estimates, we should not limit ourselves to thinking of variation in

coefficients solely in terms of sampling variation. Instead, it is useful to step back and write

down a model that simply includes an error term representing uncertainty in our model.

If the observations are drawn from a truly random sample of the target population (Hint:

they never are), then we can proceed with thinking of uncertainty reflecting only sampling

variation. However, if there is no random sampling either because we data on the full pop-

ulation or because the sample is not random, then we can model the selection process and

assess whether or not the non-random sampling processed induced correlation between the

independent variables and the error term. The Heckman selection model referenced in Chap-

ter 10 on page 518 provides a framework for considering such issues. Such selection is very

tricky to assess, however, and researchers continue to struggle with the best way to address

the issue.

J Further Reading

Rice (2007) is an excellent guide to probability theory as used in statistical analysis.

c
•2014 Oxford University Press 792
Appendix . Math and Probability Background

K Computing Corner

Excel

Sometimes Excel is the quickest way to calculate quantities of interest related to the normal

distribution.

• There are several ways to find the probability a standard normal is less than some value,

1. Use the NORM.S.DIST function, which calculates the normal distribution. Use a 1

after the comma to produce the cumulative probability, which is the percent of the

distribution to the left of the number indicated: =NORM.S.DIST(2, 1).

2. Use the NORMDIST function and indicate the mean and the standard deviation which,

for a standard normal, are 0 and 1 respectively. Use a 1 after the last comma to

produce the cumulative probability, which is the percent of the distribution to the

left of the number indicated: =NORMDIST(2, 0, 1, 1).

• For a non-standard normal variable, use the NORMDIST function and indicate the mean

and the standard deviation. For example, if the mean is 9 and the standard devi-

ation is 3.2, the probability this distribution yields a random variable less than 7 is

=NORMDIST(7,9, 3.2, 1).

Stata

c
•2014 Oxford University Press 793
Appendix . Math and Probability Background

• To calculate the probability a standard normal is less than some value in Stata, use the

normal command. For example, display normal(2) will return the probability that a

standard normal variable is less than 2.

• To calculate probabilities related to a normally distributed random variable with any

mean and standard deviation, we can also standardize the variable manually. For

example, display normal((7-9)/3.2) returns the probability that a normal variable

with mean 9 and standard deviation 3.2 is less than 7.

• To calculate the probability a standard normal is less than some value in R, use the

pnorm command. For example, pnorm(2, mean= 1, sd=1) will return the probability

that a standard normal variable is less than 2.

• To calculate probabilities related to a normally distributed random variable with any

mean and standard deviation, we can also standardize the variable manually. For

example, pnorm((7-9)/3.2) returns the probability that a normal variable with a mean

of 9 and a standard deviation of 3.2 is less than 7.

c
•2014 Oxford University Press 794
CITATIONS AND ADDITIONAL NOTES

Student preface

• Page xxxiii On the illusion of explanatory depth, see http://scienceblogs.com/

mixingmemory/2006/11/the_illusion_of_explanatory_de.php.

Chapter 1

• Page 4 Gary Burtless (1995, 65) provides the initial motivation for this example – he

used Twinkies.

• Page 35 See Burtless (1995, 77).

Chapter 3

• Page 65 Sides and Vavreck (2013) provide a great look at how theory can help cut

through some of the overly dramatic pundit-speak on elections.

795
Appendix . Citations and Additional Notes

• Page 85 For a discussion of the Central Limit Theorem and its connection to the nor-

mality of OLS coefficient estimates see, for example, Lumley et al. (2002). They note

that for errors that are themselves nearly normal or do not have severe outliers, 80 or

so observations is usually enough.

• Page 91 See, for example, biojournalism.com/2012/08/correlation-vs-causation/.

• Page 101 Stock and Watson (2011, 674) present examples of estimators that highlight

the differences between bias and inconsistency. The estimators are silly, but they make

the point.

– Suppose we tried to estimate the mean of a variable with the first observation in

a sample. This will be unbiased because in expectation this will be equal to the

average of the population. Recall that expectation can be thought of as the average

value of an estimator we would get if we ran an experiment over and over again. This

estimator will not be consistent, though, because no matter how many observations

we have, we’re only using the first observation, meaning that the variance of the

estimator will not get smaller as the sample size gets very large. So, yes, no one in

their right mind would use this estimator, but it nonetheless shows an example of

an estimator that is unbiased, but inconsistent.

1
– Suppose we tried to estimate the mean of a variable with the sample mean plus N
.

This will be biased because the expectation of this estimator will be the population

c
•2014 Oxford University Press 796
Appendix . Citations and Additional Notes

1
average plus N
. However, this estimator will be consistent because the variance of
1
a sample mean goes down as sample size increases and the N
bit will go to zero as

the sample size goes to infinity. Again, this is a nutty estimator that no one would

use in practice, but it shows how it is possible for an estimator to be biased, but

consistent.

Chapter 4

• Page 136 For a report on the Pasteur example, see Manzi (2012, 73) and http://

pyramid.spd.louisville.edu/˜eri/fos/Pasteur_Pouilly-le-fort.pdf.

• Page 150 The distribution of the standard error of —ˆ1 follows a ‰2 distribution. A

normal random variable divided by a ‰2 random variable is distributed according to a

t distribution.

• Page 166 The medical example is from Wilson and Butler (2007, 105).

Chapter 5

• Page 211 In Chapter 14 we show on page 715 that the bias term in a simplified example
q
for a model with no constant is E[ q i 2i ]. For the more standard case that includes
‘X
Xi
q
‘ (X ≠X)
a constant in the model, the bias term is E[ q i i 2 ]
(Xi ≠X)
which is the covariance of X

and ‘ divided by the variance of X. See Greene (2003, 148) for a generalization of the

omitted variable bias formula for any number of included and excluded variables.

c
•2014 Oxford University Press 797
Appendix . Citations and Additional Notes

• Page 237 Harvey’s analysis includes other variables, including a measure of how ethni-

cally and linguistically divided countries are and a measure of distance from the equator

(which is often used in the literature to capture a historical pattern that countries close

to equator have tended to have weaker political institutions).

Chapter 6

• Page 259 To formally show that the OLS —ˆ1 and —ˆ0 estimates are functions of the means

of the treated and untreated groups requires a bit of a slog through some algebra. From
qN
page 72, we know that the bivariate OLS equation for the slope is —ˆ1 =
(Ti ≠T )(Yi ≠Y )
qN
i=1
2
i=1
(Ti ≠T )

where we use Ti to indicate that our independent variable is a dummy variable (where

Ti = 1 indicates a treated observation). We can break the sum into two parts, one

part for Ti = 1 observations and the other for Ti = 0 observations. We’ll also refer to

T as p, where p indicates the percent of observations that were treated, which is the

average of the dummy independent variable. (This is not strictly necessary, but helpful

to highlight the intuition that the average of our independent variable is the percent

who were treated.)


q q
Ti =1 (Ti ≠ p)(Yi ≠ Y) Ti =0 (Ti ≠ p)(Yi ≠ Y)
—ˆ1 = qN + qN
i=1 (Ti ≠ p) i=1 (Ti ≠ p)
2 2

For the Ti = 1 observations, (Ti ≠ p) = (1 ≠ p) because, by definition the value of Ti in

this group is 1. For the Ti = 0 observations, (Ti ≠ p) = (≠p) because, by definition the

value of Ti in this group is 0. We can pull these terms out of the summation because

c
•2014 Oxford University Press 798
Appendix . Citations and Additional Notes

they do not vary across observations within each summation.


q q
(1 ≠ p) Ti =1 (Yi
≠ Y ) p Ti =0 (Yi ≠ Y )
—ˆ1 = qN ≠ qN
i=1 (Ti ≠ p) i=1 (Ti ≠ p)
2 2

We can re-write the denominator as NT (1 ≠ p) where NT is the number of individuals

who were treated (and therefore have T1 = 1).3 We also break the equation into three

parts, producing
q q q
(1 ≠ p) Ti =1 Yi (1 ≠ p) Ti =1 Y p Ti =0 (Yi ≠ Y )
—ˆ1 = ≠ ≠
NT (1 ≠ p) NT (1 ≠ p) NT (1 ≠ p)

The (1 ≠ p) in the numerator and denominator of the first and second terms cancels out.

Note also that the sum of Y for the observations where Ti = 1 equals NT Y , allowing us

to express the OLS estimate of —ˆ1 as


q q
Yi p Ti =0 (Yi
≠Y)
—ˆ1 = Ti =1
≠Y ≠
NT NT (1 ≠ p)
1
We’re almost there. Now note that p
NT (1≠p)
in the third term can be written as NC

where NC is the number of observations in the control group (for whom Ti = 0).4 We
qN
Yi
denote the average of the treated group ( Ti =1
NT
) as Y T and the average of the control
qN
Yi
group ( Ti =0
NC
) as Y C . We can re-write our equation as
q q
Yi Y
—ˆ1 = Y T ≠ Y ≠ Ti =0
+ Ti =0
NC NC
qN qN qN qN qN qN
3 To see this, re-write i=1
(Ti ≠ p)2 as i=1
Ti2 ≠ 2p i=1
Ti ≠ i=1
p2 . Note that both i=1
Ti2 and i=1
Ti
equal NT because the squared value of a dummy variable is equal to itself and because the sum of a dummy variable
qN
is equal to the number of observations for which Ti = 1. We also use the facts that i=1
p2 equals N p2 and p = NT
N
,
2 2
which allows us to write the denominator as NT ≠ 2 N + . Simplifying yields NT (1 ≠ p).
NT N NT
N2
4 To see this, substitute NT for p and simplify, noting that N = N ≠ N .
N C T

c
•2014 Oxford University Press 799
Appendix . Citations and Additional Notes

q
Using fact that Ti =0 Y equals NC Y , we can cancel some terms and (finally!) get our

result:

—ˆ1 = Y T ≠ Y C

To show that —ˆ0 is Y C , use Equation 3.5 from page 73, noting that Y = Y T NT +Y C NC
N
.

• Page 262 Discussions of non-OLS difference of means tests sometimes gets bogged down

into whether the variance is the same across the treatment and control groups. If the

variance varies across treatment and control groups we would have heteroscedasticity

and should adjust our analysis accordingly.

• Page 283 Poole and Rosenthal (1997) have measured ideology of members of Congress

from 1787 to today. For recent updates, see voteview.com.

• Page 283 For more on the ideological shifts in the Republican Party, see Bailey, Mum-

molo and Noel (2012).

• Page 296 See Kam and Franceze (2007, 48) for the derivation of the variance of estimated

effects. The variance of —ˆ1 +Di —ˆ3 is var(—ˆ1 )+Di2 var(—ˆ3 )+2Di covar(—ˆ1 , —ˆ3 ) where covar

is the covariance of —ˆ1 and —ˆ3 (see fact #3 on page 769).

– In Stata, we can display covar(—ˆ1 , —ˆ3 ) with the following commands:

regress Y X1 D X1D

matrix V = get(VCE)

c
•2014 Oxford University Press 800
Appendix . Citations and Additional Notes

disp V[3,1]

For more details, see Kam and Franceze (2007, 136-146).

– In R, generate a regression result object (e.g. OLSResults = lm(Y ≥ X1 D X1D))

and use the vcov(OLSResults) subcommand to display the variance-covariance ma-

trix for the coefficient estimates. The covariance of —ˆ1 and —ˆ3 is the entry in the

column labeled X1 and the row labeled X1D.

Chapter 7

• Page 279 This data is from from Persico, Postlewaite, and Silverman (2004). Results

are broadly similar even if we exclude outliers with very high salaries.

• Page 319 The data on life expectancy and GDP per capita are from the World Bank’s

World Development Indicators database available at http://data.worldbank.org/

indicator/.

• Page 327 Temperature data is from NASA (2012).

• Page 336 In log-linear models, a one unit increase in X is associated with a —1 percent

increase in Y . The underlying model is funky; it is a multiplicative model of e’s raised

to the elements of the log-linear model. It is

Y = e—0 e—1 X e‘

c
•2014 Oxford University Press 801
Appendix . Citations and Additional Notes

If we use the fact that log(eA eB eC ) = A+B +C and log both sides, we get the log-linear

formulation:

ln Y = —0 + —1 X + ‘

If we take the derivative of Y with respect to X in the original model, we get

dY
= e—0 —1 e—1 X e‘
dX

Divide both sides by Y so that the change in Y is expressed as a percentage change in

Y and then cancel, yielding

dY /Y e—0 —1 e—1 X e‘
=
dX e—0 e—1 X e‘
dY /Y
= —1
dX

Chapter 8

• Page 380 See Bailey, Strezhnev, and Voeten (2015) for U.N. voting data.

Chapter 9

• Page 424 Endogeneity is a central concern of Medicaid literature. See, for example,

Currie and Gruber (1996), Finkelstein et al. (2012) and Baicker et al. (2013).

• Page 458 The reduced form is simply the model rewritten to be only a function of the

non-endogenous variables (which are the X and Z variables, not the Y variables). This

c
•2014 Oxford University Press 802
Appendix . Citations and Additional Notes

equation isn’t anything fancy, although it takes a bit of math to see where it comes

from. Here goes:

1. Insert Equation 9.12 into Equation 9.13:

Y2i = “0 + “1 (—0 + —1 Y2i + —2 X1i + —3 Z1i + ‘1i ) + “2 X1i + “3 Z2i + ‘2i

2. Rearrange by multiplying by the “1 term as appropriate and combining terms for

X1 :

Y2i = “0 + “1 —0 + “1 —1 Y2i + (“1 —2 + “2 )X1i + “1 —3 Z1i + “1 ‘1i + “3 Z2i + ‘2i

3. Rearrange some more by moving all Y2 terms to the left side of the equation:

Y2i ≠ “1 —1 Y2i = “0 + “1 —0 + (“1 —2 + “2 )X1i + “1 —3 Z1i + “1 ‘1i + “3 Z2i + ‘2i

Y2i (1 ≠ “1 —1 ) = “0 + “1 —0 + (“1 —2 + “2 )X1i + “1 —3 Z1i + “1 ‘1i + “3 Z2i + ‘2i

4. Divide both sides by (1 ≠ “1 —1 ):

“0 + “1 —0 + (“1 —2 + “2 )X1i + “1 —3 Z1i + “1 ‘1i + “3 Z2i + ‘2i


Y2i =
(1 ≠ “1 —1 )
“0 +“1 —0 (“1 —2 +“2 )
5. Re-label (1≠“1 —1 )
as fi0 , (1≠“1 —1 )
as fi1 , “ 1 —3
(1≠“1 —1 )
as fi2 , “3
(1≠“1 —1 )
as fi3 and combine the

‘ terms into ‘˜:

Y2i = fi0 + fi1 X1i + fi2 Z1i + fi3 Z2i + ‘˜i

This “reduced form” equation isn’t a causal model in any way. The fi coefficients are

crazy mixtures of the coefficients in Equations 9.12 and 9.13, which are the equations

c
•2014 Oxford University Press 803
Appendix . Citations and Additional Notes

that embody the story we are trying to evaluate. The reduced form equation is simply

a useful way to write down the first stage model.

Chapter 10

• Page ?? See Newhouse (1993) and Gerber and Green (2012, 212-214) for more on the

RAND experiment.

Chapter 12

• Page 612 A good place to start when considering MLE is with the name. Maximum is,

well, maximum; likelihood refers to the probability of observing the data we observe,

and estimation is, well, estimation.

For most people, the new bit is the likelihood. The concept is actually quite close to

ordinary usage. Roughly 20 percent of U.S. the population is under 15. What is the

likelihood that when we pick three people randomly we get two people under 15 and

one over 15? The likelihood (which we’ll label “L”) is L = 0.2 ú 0.2 ú 0.8 = 0.03. In

other words, if we pick three people at random in the United States, there is a 3 percent

chance (or, “likelihood”) we will observe two people under 15 and one over 15.

We can apply this concept when we do not know the underlying probability. Suppose

that we want to figure out what proportion of the population has health insurance.

Let’s call “pinsured ” the probability someone is insured (which is simply to proportion

c
•2014 Oxford University Press 804
Appendix . Citations and Additional Notes

of insured in the United States). Suppose we randomly select three people, ask them if

they are insured, and find out that two are insured and one is not. The probability (or

“likelihood”) of observing that combination is

L = pinsured ú pinsured ú (1 ≠ pinsured ) = p2insured ≠ p3insured

MLE finds an estimate of pinsured that maximizes the likelihood of observing the data

we actually observed.

We can get a feel for what values lead to high or low likelihoods by trying out a few

possibilities. If our estimate were pinsured = 0, the likelihood, L, would be 0. That’s a

silly guess. If our estimate were pinsured = 0.5 then L = 0.5 ú 0.5 ú (1 ≠ 0.5) = 0.125

which is better. If we chose pinsured = 0.7 then L = 0.7 ú 0.7 ú 0.3 = 0.147, which is even

better. But if we chose, pinsured = 0.9 then L = 0.9 ú 0.9 ú 0.1 = 0.081, which is not as

high as some of our other guesses.

Conceivably we could keep plugging different values of pinsured into the likelihood equa-

tion until we found the best value. Or, calculus gives us tools to quickly find maxima.5

When we observe two people with insurance and one without, the value of pinsured that
2
maximizes the likelihood is 3
which, by the way, is the common sense estimate when

two of three observed people are insured.

To use MLE to estimate a probit model we extend this logic. Instead of estimating a
5 Here’s the formal way to do this using calculus. First, calculate the derivative of the likelihood with respect to
p: ˆL
ˆp
= 2pinsured ≠ 3p2insured . Second, set the derivative to zero and solving for pinsured yields pinsured = 23 .

c
•2014 Oxford University Press 805
Appendix . Citations and Additional Notes

single probability parameter (pinsured in our example above) we estimate the probability

Yi = 1 as a function of independent variables. In other words, we substitute (—0 +—1 Xi )

for pinsured into the likelihood equation above. In this case, the thing we are trying to

learn about is no longer pinsured , but is now the —s which the determine the probability

for each individual based on their Xi value.

The likelihood if we observe two people who are insured and one who is not is

L = (—0 + —1 X1 ) ú (—0 + —1 X2 ) ú (1 ≠ (—0 + —1 X3 ))

where (—0 + —1 X1 ) is the probability person 1 is insured (where here X1 refers to

the value of X for the first person rather than a separate variable X1 as we typically

use the notation elsewhere), (—0 + —1 X2 ) is the probability of person 2 insured and

(1 ≠ (—0 + —1 X3 )) is the probability person 3 is not insured.

MLE find the —ˆ that maximizes the likelihood, L. The actual estimation process is

complicated; again, that’s why computers are our friends.

• Page 623 To use the average-case approach, create a single “average” person for whom

the value of each independent variable is the average of that independent variable. We

calculate a fitted probability for this person. Then we add one to the value of X1 for this

average person and calculate how much the fitted probability goes up. The downside

of the average-case approach is that in the real data there might not be anyone who is

average across all variables as the variables might typically cluster together. It’s also

c
•2014 Oxford University Press 806
Appendix . Citations and Additional Notes

kind of weird because dummy variables for the “average” person will between 0 and 1

even though no single observation will have any value other than 0 and 1. This means,

for example, that the “average” person will be 0.52 female and 0.85 right-handed and

so forth.

To interpret probit coefficients using the average-case approach, use the following guide.

– If X1 is continuous:

1. Calculate P1 as the fitted probability using —ˆ given all variables are at their

average values. This is

(—ˆ0 + —ˆ1 X 1 + —ˆ2 X 2 + —ˆ3 X 3 + ...)

2. Calculate P2 as the fitted probability using —ˆ given X1 = X 1 + 1 and all other

variables are at their average values. This is

(—ˆ0 + —ˆ1 (X 1 + 1) + —ˆ2 X 2 + —ˆ3 X 3 + ...)

Sometimes it makes more sense to increase X1 by a standard deviation of X1

rather than simply by one. For example, if the scale of X1 is in the millions of

dollars, increasing it by 1 will produce the tiniest of changes in fitted probability

even when the effect of X1 is large.

3. The difference P2 ≠P1 is the estimated effect of a one standard deviation increase

in X1 holding all other variables constant.

c
•2014 Oxford University Press 807
Appendix . Citations and Additional Notes

– If X1 is a dummy variable:

1. Calculate P1 as the fitted probability using —ˆ given X1 = 0 and all other variables

are at their average values. This is

(—ˆ0 + —ˆ1 ◊ 0 + —ˆ2 X 2 + —ˆ3 X 3 + ...)

2. Calculate P2 as the fitted probability using —ˆ given X1 = 1 and all other vari-

ables are at their average values. This is

(—ˆ0 + —ˆ1 ◊ 1 + —ˆ2 X 2 + —ˆ3 X 3 + ...)

3. The difference P2 ≠P1 is the estimated effect of a one unit increase in X1 holding

all other variables constant.

If X1 is a dummy variable, the command margins, dydx(X1) atmeans will produce

an average-case method estimate of the effect of a change in the dummy variable. If X1 is

continuous the command margins, dydx(X1) atmeans will produce an average-case

method estimate of the marginal effect of a change in the dummy variable.

• Page 623 The marginal effects approach uses calculus to determine the slope of the

fitted line. Obviously the slope of the probit fitted line varies, so we have to determine

a reasonable point to calculate this slope. In the observed-value approach, we find the

slope at the point defined by actual values of all the independent variables. This will be
ˆP rob(Yi =1)
ˆX1
. We know that the P rob(Yi = 1) is a CDF and one of the nice properties of

c
•2014 Oxford University Press 808
Appendix . Citations and Additional Notes

a CDF is that the derivative is simply the PDF. (We can see this graphically in Figure

12.5 by noting that if we increase the number on the horizontal axis by a small amount,

the CDF will increase by the value of the PDF at that point.) Applying that property
ˆ (—ˆ0 +—ˆ1 X1i +—ˆ2 X2i )
plus the chain rule, we get ˆX1
= „(—ˆ0 + —ˆ1 X1i + —ˆ2 X2i )—ˆ1 where „() is

the normal PDF. Hence the marginal effect of increasing X1 at the observed value is

„(—ˆ0 + —ˆ1 X1i + —ˆ2 X2i )—ˆ1 .

The discrete differences approach is an approximation to the marginal effects approach.

If the scale of X1 is large such that an increase of 1 unit is small, then the marginal

effects and discrete differences approach will yield similar results. If the scale of X1 is

small such that an increase of 1 unit is a relatively large increase, then the marginal

effects and discrete differences approach may differ noticeably.

We show how to calculate marginal effects in Stata on page 646 and in R on page 648.

• Page 626 Jacobsmeier and Lewis (2013) respond to Mutz (2010).

Chapter 14

• Page 733 We can also derive the attenuation bias result using the general form of

endogeneity from page 91, which is plim —ˆ1 = —1 + corr(X1 , ‘) ‡‡X‘ = —1 + cov(X
‡X
1 ,‘)
. Note
1 1

that “‘” in Equation 14.19 actually contains ≠—1 ‹i + ‘i . Solving for cov(X1 , ≠—1 + ‘)

yields ≠—1 ‡‹ .

c
•2014 Oxford University Press 809
Appendix . Citations and Additional Notes

Chapter 13

• Page 661 Another form of correlated errors is spatial autocorrelation, which occurs when

the error for one observation is correlated with the error for another observation that is

spatially close to it. Our polling example is predicated on the idea that there may be

spatial autocorrelation because those who live close to each other (and sleep in the same

bed!) may have correlated errors. This kind of situation can arise with geographic based

data such as state or county level data because there may be unmeasured similarities

(meaning stuff in the error term) that is common within regions. The consequences of

spatial autocorrelation are similar to the consequences of serial autocorrelation. Spatial

autocorrelation does not cause bias. Spatial autocorrelation does cause the conventional

standard error equation for OLS coefficients to be incorrect. The easiest first step for

dealing with this situation is simply to include a dummy variable for region. Often

this step will capture any regional correlations not captured by the other independent

variables. A more technically complex way of dealing with this situation is via spatial

regression statistical models. The intuition underlying these models is similar to that

for serial correlation, but the math is typically harder. See, for example, Tam-Cho and

Gimpel (2012).

• Page 666 Wooldridge (2009, 416) discusses inclusion of X variables in this test.

• Page 673 The difference between the two approaches is the the Cochrane-Orcutt method

c
•2014 Oxford University Press 810
Appendix . Citations and Additional Notes

loses the first observation (because there are no lagged variables for it, while the Prais-

Winsten fills in for the transformed first observation with a reasonable transformation.

• Page 674 Wooldridge (2009, 424) notes that the fl-transformed approach also requires

that ‘t not be correlated with Xt≠1 or Xt+1 . In a fl-transformed model, the independent

variable is Xt ≠ flXt≠1 and the error is ‘t ≠ fl‘t≠1 . If the lagged error term (‘t≠1 ) is

correlated with Xt , then the independent variable in the fl-transformed model will be

correlated with the error term in the fl-transformed model. We are assuming that the

distribution of the error term isnät shifting over time (see our discussion of stationarity

on page 683 for more on this topic). In other words, This means that if ‘t is not be

correlated with Xt≠1 then it is also not correlated with Xt+1 .

• Page 682 The so-called Breusch-Godfrey test is a more general test for autocorrelation.

See, for example, Greene (2003, 269).

• Page 689 R code to generate multiple simulations with unit root (or other) time series

variables:

c
•2014 Oxford University Press 811
Appendix . Citations and Additional Notes

Nsim = 200 # Number of obs. in each simulation


SimCount = 100 # Number of simulations
SimT = rep(NA, SimCount) # Stores t stat for each simulation
for(s in 1:SimCount){ # Loop thru SimCount simulations
Gamma = 1.0 # 1 for unit root; <1 othwerise
Y = 0 # Start value for Y
X = 0 # Start value for X
for(ii in 1:Nsim){ # Loop to create Y and X values
Y = c(Y, Gamma*Y[ii-1] + rnorm(1)) # Generate Y as dynamic process
X = c(X, Gamma*X[ii-1] + rnorm(1)) # Generate X as dynamic process
SimT[s]=summary(lm(Y X))$coef[2,3] # Store t stats
} # End s loop
sum((abs(SimT)>2))/SimCount # % simulations w/t stat > 2
• Page 691 For more on the Dickey-Fuller test and its critical values, see Greene (2003,

638).

Chapter 16

• Page 756 Columbia Professor Andrew Gelman directed me to this saying of Bill James.

c
•2014 Oxford University Press 812
GUIDE TO SELECTED DISCUSSION QUESTIONS

Chapter 1

Discussion question on page 12:

1. In panel (b) of Figure 1.4 we can see that —0 > 0 (it’s around 40) and —1 > 0 as well.

Panel (a): —0 > 0 (it’s around 0.4) and —1 > 0

Panel (b): —0 > 0 (it’s around 0.8) and —1 < 0

Panel (c): —0 > 0 (it’s around 0.4) and —1 = 0

Panel (d): Note that the X-axis ranges from about -6 to + 6. —0 is the value of Y

when X is zero and is therefore 2, which can be seen in Figure A.7. —0 is not the value

of Y at the left-most point in the figure like it was for the other panels in Figure 1.4.

Chapter 4

Discussion questions on page 161:

813
Appendix . Guide to Selected Discussion Questions

8
Y
7

−1

−2

−3

−4

−6 −4 −2 0 2 4 6
Independent variable, X

FIGURE A.7: Identifying —0 from a Scatterplot

1. Based on the results in Table 4.2 on page 142:

2.29
(a) The t statistic for the coefficient on change in income is 0.52
= 4.40

(b) The degrees of freedom is sample size minus the number of parameters estimated,

so it is 17 - 2 = 15.

(c) The critical value for a two-sided alternative hypothesis and – = 0.01 is 2.95. We

reject the null hypothesis.

(d) The critical value for a one-sided alternative hypothesis and – = 0.05 is 1.75. We

reject the null hypothesis.

2. The critical value from a two-sided test is bigger because it indicates the point at which

2
of the distribution is larger. As Table 4.4 on page 157 shows, the two-sided critical

values are larger than the one-sided critical values for all values of –.

c
•2014 Oxford University Press 814
Appendix . Guide to Selected Discussion Questions

3. The critical values from a small sample are larger because there is additional uncertainty

about our estimate of the standard error of —ˆ1 that the t distribution accounts for. In

other words, even when the null hypothesis is true, the data could work out such that

we get an unusually large estimate of se(—ˆ1 ), which would push up our t statistic,

meaning that the more uncertainty there is about se(—ˆ1 ) the more we could expect to

see higher values of the t statistic even when the null hypothesis is true. As the sample

size increases, uncertainty about se(—ˆ1 ) decreases and so this source of large t statistics

even when the null hypothesis is true diminishes.

Discussion questions on page 174:

1. The power of a test is the probability of observing a t statistic higher than the critical

value given the true value of —1 and the se(—ˆ1 ), –, and alternative hypothesis posited
—1T rue
in the question. This will be 1 ≠ (Critical value ≠ se(—ˆ1 )
). The critical value will be

2.32 for – = 0.01 and a one-sided alternative hypothesis. The sketches will be normal
—1T rue
distributions centered at se(—ˆ1 )
with the portion of the normal distribution greater than

the critical value shaded.

1
(a) The power when —1T rue = 1 is 1 ≠ (2.32 ≠ 0.75
) = 0.162.

2
(b) The power when —1T rue = 2 is 1 ≠ (2.32 ≠ 0.75
) = 0.636.

2. If the estimated se(—ˆ1 ) doubled, the power will go down because the center of the t
—1T rue
statistic distribution will shift toward zero (because se(—ˆ1 )
gets smaller as the standard

c
•2014 Oxford University Press 815
Appendix . Guide to Selected Discussion Questions

error increases). For this higher standard error, the power when —1T rue = 1 is 1 ≠
1 2
(2.32 ≠ 1.5
) = 0.049 and the power when —1T rue = 2 is 1 ≠ (2.32 ≠ 1.5
) = 0.161.

3. The probability of committing a Type II error is simply one minus the power. Hence

when se(—ˆ1 ) = 2.5, the probability of committing a Type II error is 0.838 for —1T rue = 1

and 0.364 for —1T rue = 2.

Chapter 5

Discussion questions on page 208:

1. a) Do you accept this recommendation? Kevin Drum’s response to this scenario: “If

you’re smart, you’d think I’m an idiot. As kids get older, they weigh more. They

also do better on math tests. I haven’t discovered a link between weight and math

ability. All I’ve discovered is the obvious fact that older kids know more math.”

b) Write down a model that embodies Drum’s scenario.

Test score i = —0 + —1 W eighti + ‘i

c) Propose additional variables for this model. Age is an obvious factor to control for.

There could be others: Family income, class size, instructional techniques, and so

forth.

d) Does inclusion of additional controls provide definitive proof? Kevin Drum writes:

“The usual way to handle this is to control for age. That is, I need to find out

c
•2014 Oxford University Press 816
Appendix . Guide to Selected Discussion Questions

if kids of the same age show the same relationship, namely that heavier ones are

better at math. Suppose I did that, and it turned out they are. Am I vindicated?

Not quite. It’s possible, for example, that kids who like math are more sedentary

than kids who don’t. That makes them heavier. The chain of causation doesn’t go

from weight to math, it goes from fondness for math to scores on math tests. But

fondness for math also makes you heavier.”

Discussion questions on page 234:

1. Not at all. Rj2 will be zero. In a random experiment, the treatment is uncorrelated

with anything. Most importantly this buys us exogeneity, but it also buys us increased

precision.

2. We’d like to have a low variance for estimates and to get that we want the Rj2 to be

small. In other words, we want the other variables to explain as little of Xj as possible.

Chapter 6

Discussion questions on page 266:

1. (a) Control group: 0. Treatment group: 2. Difference is 2.

(b) Control group: 4. Treatment group: -6. Difference is -10.

(c) Control group: 100. Treatment group: 100. Difference is 0.

2. (a) —ˆ0 : 0. —ˆ1 : 2

c
•2014 Oxford University Press 817
Appendix . Guide to Selected Discussion Questions

(b) —ˆ0 : 4. —ˆ1 : -10

(c) —ˆ0 : 100. —ˆ1 : 0

Discussion questions on page 282:

1. A model in which a three-category categorical country variable has been converted into

multiple dummy variables with the United States as the excluded category looks like

the following.

Yi = —0 + —1 X1i + —2 Canadai + —3 M exicoi + ‘i

The estimated constant (—ˆ0 ) is the average value of Yi for units in the excluded category

(in this case, U.S. citizens) after taking into account the effect of X1 . The coefficient

on the Canada dummy variable (—ˆ2 ) estimates how much more or less Canadians feel

about Y compared to Americans, the excluded reference category. The coefficient on

the Mexico dummy variable (—ˆ3 ) estimates how much more or less Mexicans feel about

Y compared to Americans. Using Mexico or Canada as excluded categories is equally

valid and would produce substantively identical results although the coefficients on the

dummy variables will differ because they will refer to different reference categories than

when the United States is the excluded category.

2. (a) 25

(b) 20

(c) 30

c
•2014 Oxford University Press 818
Appendix . Guide to Selected Discussion Questions

(d) 115

(e) 5

(f) -20

(g) 120

(h) -5

(i) -25

(j) 5

Discussion questions on page 298:

1. (a) —0 = 0, —1 > 0, —2 > 0, —3 = 0

(b) —0 > 0, —1 < 0, —2 > 0, —3 = 0

(c) —0 > 0, —1 = 0, —2 = 0, —3 > 0

(d) —0 > 0, —1 > 0, —2 = 0, —3 < 0 (actually —3 = ≠—1 )

(e) —0 > 0, —1 > 0, —2 < 0, —3 > 0

(f) —0 > 0, —1 > 0, —2 > 0, —3 < 0

2. Bonus question: —3 in panel (d) is ≠—1

3. False. The effect of X for the treatment group depends on —1 + —3 . If —1 is sufficiently

positive, then the effect of X is still positive for the treatment group even when —3 is

c
•2014 Oxford University Press 819
Appendix . Guide to Selected Discussion Questions

negative.

Chapter 8

Discussion questions on page 386

1. (a) The error term includes the ability of the students, the quality of the teacher, the

time the class meets, the room (Does it have a window? Is it loud?) and other

factors.

(b) There is likely a teacher specific fixed effect that differs across teachers. There may

also be a course-specific error term (e.g., students always love stats! and hate, hate,

hate Introduction to the Movies).

(c) It is plausible that more students take courses from popular teachers (who have

high fixed effects), which would induce correlation between the error term and the

number of students (unless fixed effect is included or some other measure of teacher

quality is included).
Discussion questions on page 410 - see Table A.2:
Table A.2: Values of —0 , —1 , —2 and —3 in Figure 8.6

(a) (b) (c) (d)


—0 2 3 2 3
: —1 -1 -1 0 0
—2 0 -2 2 -2
—3 2 2 -1 1

Chapter 9
Discussion questions on page 438:

c
•2014 Oxford University Press 820
Appendix . Guide to Selected Discussion Questions

1. Cell phones and violence


a) There are a number of sources of potential endogeneity. It could be simply that
the more people there are, the more likely telecommunication firms will be to build
cell phone infrastructure and the more likely there are to be violent incidents. It
could also be that wealthy regions have more cell phones and less violence. It may
be that violence itself discourages investment in cell towers and other infrastructure.
A number of these factors could be controlled for with additional variables such as
population and wealth in the equation, hence pulling them from the error term.
b) Consider a measure of regulatory quality as an instrument for cell phone coverage.
Run a first stage regression in which cell phone coverage is the dependent variable
and the regulatory quality variable is an independent variable. We could also control
for any other variables we plan on including in the equation of interest (in which
violence is the dependent variable). As it turns out, it does satisfy the inclusion
condition.
c) We cannot test whether regulatory quality is in fact uncorrelated with violence. The
authors of the study argue that after controlling for other factors like wealth and
population, the regulatory quality variable should have no direct effect on violence,
which is another way of saying that they do not believe it is correlated with the error
term in the violence equation.
2. Do political protests matter?
a) We are concerned that factors that influence turnout at the protests (such as ideol-
ogy) could also influence the vote share of the Republican candidate in that district.
If a district has a lot of conservatives, we expect more turnout at the protests, but
we also expect more voting for Republicans. If we fail to control for this possibility,
the Tea Party protest variable could capture the effect of conservatism rather than
the effect of the protests themselves.
b) Consider local rainfall as an instrument for Tea Party protest turnout. To assess
whether the proposed instrument satisfies the inclusion condition, simply run an
OLS model in which turnout at the protests is the dependent variable and rainfall is
the independent variable. This variable does indeed satisfy the inclusion condition.
c) We cannot directly test whether rainfall is in fact uncorrelated with the error term in
the Republican vote share equation. The authors of the study can show some indirect
evidence. They show that rainfall is not correlated with Republican vote share in the
previous election, suggesting at least that it’s not the case that the places with more
rainfall on April 15, 2009 were somehow more Republican leaning. The authors also
show that rainfall is not correlated with other variables such as race. They do find

c
•2014 Oxford University Press 821
Appendix . Guide to Selected Discussion Questions

that rainfall on April 15, 2009 is modestly correlated with unemployment, meaning
they need to control for that factor in their full model.
3. Do institutions matter?
a) It could be that countries with high economic growth have better institutions. That
is, rich countries can pay for people and other things necessary to make government
work more effectively. This situation is analogous to the crime and police example.
The police could be going to where the crime is; in this case, the good institutions
could be going to where the economic growth is.
b) To test whether the settler mortality variable satisfies the inclusion condition we run
a model in which institutional quality is the dependent variable and settler mortality
is the independent variable.
c) We cannot directly test whether settler mortality in the 18th century has a direct
effect on modern economic growth. The authors of the study argue that this variable
is so far in the past and relates to a threat to mortality that modern technology may
well have changed. They argue that the only reasonable effect that long-ago settler
mortality has on modern growth is due to the government institutions created by
colonial powers.

Chapter 11
Discussion questions on page 553:
1. See Cellini, Ferreira, and Rothstein (2009) for a study of school bond passage and
housing values.
a) The assignment variable is the election results. The threshold is 50 percent. (The
model in the actual paper is bit more involved. Some bond measures needed more
than 50 percent to pass so their assignment variable is actually percent above or
below the threshold needed to win.)
b) Estimate a model of house values using election results as an assignment variable
and a dummy variable for passage of the bond issue as the treatment variable.
c) The basic version of their model is
House values i = —0 + —1 Ti + —2 (Election support i ≠ 50) + ‹i
Ti = 1 if Election support i Ø 50
Ti = 0 if Election support i < 50

where House values i is the average housing value in city i three years after the bond
election and Election support i is the percent of voters who supported the education

c
•2014 Oxford University Press 822
Appendix . Guide to Selected Discussion Questions

bond. (The model in the actual paper is bit more involved. Among other things,
the authors used logged values of home prices; see page 329 for details on how to use
logged models.)
2. See Card, Dobkin, and Maestas (2009) for an influential study that used RD to study
Medicare. They looked at mortality of all patients admitted to the hospital. They used
individual data, but for the purposes of this question, we’ll work with grouped data so
that we can use a continuous variable (percent of people in a given group who died)
instead of a dichotomous variable (whether an individual died or not). The data are
grouped by birth months so that everyone in the group is the same age.
a) The assignment variable is age. The threshold is 65 years of age.
b) Estimate a mortality using age as an assignment variable and a dummy variable for
being over 65 as the treatment variable. As people get older we expect mortality to
increase, but we do not expect mortality to “jump” at 65 once we have accounted
for the effect of age on mortality.
c) A basic equation for this model is
M ortalityg = —0 + —1 Tg + —2 (Ageg ≠ 65) + ‹g
Tg = 1 if Ageg Ø 65
Tg = 0 if Ageg < 65
where M ortalityg is percent of people in group g who died with a week of being
admitted, Tg is whether people in the group were eligible for Medicare, and Ageg
is the age of the people in the group (which will be essentially the same for all in
the group because they share the same birth month). Card, Dobkin, and Maestas
looked at mortality across a wide range of time frames – within a day, week, year,
and so on.
Discussion questions on page 560:
(a) —1 = 0, —2 = 0, —3 < 0
(b) —1 < 0, —2 = 0, —3 > 0
(c) —1 > 0, —2 < 0, —3 = 0
(d) —1 < 0, —2 > 0, —3 < 0
(e) —1 > 0, —2 > 0, —3 < 0 (actually —3 = ≠—2 )
(f) —1 < 0, —2 < 0, —3 > 0 (here too —3 = ≠—2 , which means —3 is positive because —2 is
negative)

c
•2014 Oxford University Press 823
Appendix . Guide to Selected Discussion Questions

Chapter 12
Discussion questions on page 618:
1. Solve for Yiú = 0.
• Panel (a): X = 1.5
• Panel (b): X = 23
• Panel (c): X = 1.0
• Panel (d): X = 1.5
2. True, false, or indeterminate based on Table 12.2:
(a) True. The t statistic is 5, which is statistically significant for any reasonable sig-
nificance level.
(b) False. The t statistic is 1, which is not statistically significant for any reasonable
significance level.
(c) False! Probit coefficients cannot be directly interpreted.
(d) False. The fitted probability is (0), which is 0.50.
(e) True. The fitted probability is (3), which is approximately 1 because virtually all
of the area under a standard normal curve is to the left of 3.
3. Fitted values based on Table 12.2
(a) The fitted probability is (0 + 0.5 ◊ 4 ≠ 0.5 ◊ 0) = (2), which is 0.978.
(b) The fitted probability is (0 + 0.5 ◊ 0 ≠ 0.5 ◊ 4) = (≠2), which is 0.022.
(c) The fitted probability is (3 + 1.0 ◊ 0 ≠ 3.0 ◊ 1) = (0), which is 0.5.
Discussion questions on page 625:
1. Use the observed-variable, discrete-differences approach to interpreting the coefficient.
Calculate the fitted probability for all observations using X1i = 0 and the actual value
of X2i . Then calculate the fitted probability for all observations using X1i = 1 and the
actual value of X2i . The average difference in these fitted probabilities is the average
effect of X1 on the probability Y = 1.
2. Use the observed-variable, discrete-differences approach to interpreting the coefficient.
Calculate the fitted probability for all observations using actual values of X1i and X2i .
Then calculate the fitted probability for all observations using X1i = X1i + 1 and the
actual value of X2i . The average difference in these fitted probabilities is the average
effect of a one unit increase in X1 on the probability Y = 1.

c
•2014 Oxford University Press 824
Appendix . Guide to Selected Discussion Questions

Chapter 14
Discussion questions on page 729:
1. The full model that includes the unmeasured productivity is
Incomei = —0 + —1 Facebook hours i + —2 Productivityi + ‘i
We can expect that more productive people spend less time on Facebook (meaning the
correlation of X1 and X2 is negative). We can also expect that more productive people
earn more (meaning that —2 > 0). Hence we would expect —ˆ1 from a model excluding
productivity to understate the effect of Facebook, meaning that the effect will be more
negative than it really is. Be careful to note that understate in this context does not
mean simply that the coefficient will be small (i.e., close to zero), but instead means
that understate in this context means that the coefficient will be either less positive or
more negative than it should be. A reasonable expectation is that the —ˆ1 in the model
without productivity will be less than zero. If so, we should worry that some portion
of the negative coefficient comes from the fact that we have not measured productivity.
Note also that we are speculating about relationships of productivity with the other
variables. We could be wrong.
2. The full model that includes the candidate quality variable is
Vote share i = —0 + —1 Campaign spending i + —2 Candidate quality i + ‘i
We can expect that candidate quality is associated with raising more money (meaning
the correlation of X1 and X2 is positive). We can also expect that higher quality
candidates get higher vote shares (meaning that —2 > 0). Hence we would expect that
—ˆ1 from a model that excludes candidate quality would overstate the effect of campaign
spending, meaning that the effect will be more positive than it really is. Suppose we
observe a positive —ˆ1 in the model without candidate quality. We should worry that
some portion of that positive coefficient is due to the omission of candidate quality from
the model.

Appendix Discussion questions on page 783:


1. Using Table A.4, we see that the probability a standard normal random variable is less
than or equal to 1.64 is 0.95, meaning there is a 95% chance that a normal random
variable will be less than or equal to whatever value is 1.64 standard deviations above
its mean.
2. Using Table A.4, we see that the probability a standard normal random variable is less
than or equal to -1.28 is 0.10, meaning there is a 10% chance that a normal random

c
•2014 Oxford University Press 825
Appendix . Guide to Selected Discussion Questions

variable will be less than or equal to whatever value is 1.28 standard deviations below
its mean.
3. Using Table A.4, we see that the probability a standard normal random variable is
greater than 1.28 is 0.90. Because the probability of being above some value is one
minus the probability of being below some value, there is a 10% chance that a normal
random variable will be greater than or equal to whatever number is 1.28 standard
deviations above its mean.
4. We need to convert the number -4 to something in terms of standard deviations from
the mean. The value -4 is 2 standard deviations below the mean of 0 when the standard
deviation is 2. Using Table A.4 we see that the probability a normal random variable
with a mean of zero is less (more negative) than 2 standard deviations below its mean
is 0.023. In other words, the probability of being less than ≠4≠0
2
= ≠2 is 0.023.
5. First, convert -3 to standard deviations above or below the mean. In this case, if
the variance is 9, then the standard deviation (the square root of the variance) is 3.
Therefore -3 is the same as 1 standard deviation below the mean. From the table in
Figure A.4, we see that there is a 0.16 probability a normal variable will be more than
1 standard deviation below its mean. In other words, the probability of being less than
≠3≠0
Ô
9
= ≠1 is 0.16.
6. First convert 9 to standard deviations above or below the mean. The standard deviation
(the square root of the variance) is 2. The value 9 is 9≠7.2
2
= 1.8
2
standard deviations
above the mean. The value 0.9 does not appear in Figure A.4, but it close to 1 and
the probability of being less than 1 is 0.84. Therefore a reasonable approximation is in
the vicinity of 0.8. The actual value is 0.82 and can be calculated as discussed in the
Computing Corner on page 793.

c
•2014 Oxford University Press 826
BIBLIOGRAPHY

Acemoglu, Daron, Simon Johnson, and James A. Robinson. 2001. The Colonial Origins
of Comparative Development: An Empirical Investigation. American Economic Review
91(5): 1369-1401.
Acemoglu, Daron, Simon Johnson, James A. Robinson, and Pierre Yared. 2008. Income and
Democracy. American Economic Review 98(3): 808-842.
Achen, Christopher H. 2002. Toward a new political methodology: Microfoundations and
ART. Annual Review of Political Science 5: 423-450.
Achen, Christopher H. 2000. Why Lagged Dependent Variables Can Suppress the Explana-
tory Power of Other Independent Variables. Manuscript, University of Michigan.
Achen, Christopher H. 1982. Interpreting and Using Regression. Newbury Park, NJ: Sage
Publications.
Albertson, Bethany and Adria Lawrence. 2009. After the Credits Roll: The Long-Term
Effects of Educational Television on Public Knowledge and Attitudes. American Politics
Research 37(2): 275-300.
Alvarez, R. Michael and John Brehm. 1995. American Ambivalence Towards Abortion
Policy: Development of a Heteroskedastic Probit Model of Competing Values. American
Journal of Political Science 39(4): 1055-1082.
Anderson, James M., John M. Macdonald, Ricky Bluthenthal, and J. Scott Ashwood. 2013.
Reducing Crime By Shaping the Built Environment With Zoning: An Empirical Study
Of Los Angeles. University of Pennsylvania Law Review 161: 699-756.
Angrist, Joshua and Alan Krueger. 1991. Does Compulsory School Attendance Affect
Schooling and Earnings? Quarterly Journal of Economics. 106(4): 979-1014.
Angrist, Joshua and Jörn-Steffen Pischke. 2009. Mostly Harmless Econometrics: An Em-
piricist’s Companion. Princeton, NJ: Princeton University Press.
Angrist, Joshua and Jörn-Steffen Pischke. 2010. The Credibility Revolution in Empirical
Economics: How Better Research Design is Taking the Con out of Econometrics. Working
Paper 15794 http://www.nber.org/papers/w15794.
Angrist, Joshua. 2006. Instrumental Variables Methods in Experimental Criminological

827
Appendix . Bibliography

Research: What, Why and How. Journal of Experimental Criminology 2(1): 23-44.
Angrist, Joshua, Kathryn Graddy, and Guido Imbens. 2000. The Interpretation of Instru-
mental Variables Estimators in Simultaneous Equations Models with an Application to
the Demand for Fish. Review of Economic Studies 67(3): 499-527.
Anscombe, Francis J. 1973. Graphs in Statistical Analysis. American Statistician 27(1):
17-21.
Anzia, Sarah. 2012. The Election Timing Effect: Evidence from a Policy Intervention in
Texas. Quarterly Journal of Political Science 7(3): 209 248
Arellano, Manuel and Stephen Bond. 1991. Some Tests of Specification for Panel Data.
Review of Economic Studies 58(2): 277 - 297.
Aron-Dine, Aviva, Liran Einav, and Amy Finkelstein. 2013. The RAND Health Insurance
Experiment, Three Decades Later. Journal of Economic Perspectives 27(1): 197-222
Baicker, Katherine, Sarah Taubman, Heidi Allen, Mira Bernstein, Jonathan Gruber, Joseph
P. Newhouse, Eric Schneider, Bill Wright, Alan Zaslavsky, Amy Finkelstein, and the
Oregon Health Study Group. 2013. The Oregon Experiment - Medicaid’s Effects on
Clinical Outcomes. New England Journal of Medicine 368(18): 1713-1722.
Bailey, Michael A., Jon Mummolo and Hans Noel. 2012. Tea Party Influence: A Story of
Activists and Elites. American Politics Research 40(5): 769-804.
Bailey, Michael A. and Elliott Fullmer. 2011. Balancing in the States, 1978-2009. State
Politics and Policy Quarterly 11(2): 149-167.
Bailey, Michael A. and Clyde Wilcox. 2015. A Two Way Street on Iraq: On the Inter-
actions of Voter Policy Preferences and Presidential Approval. Manuscript, Georgetown
University.
Bailey, Michael A., Daniel J. Hopkins, and Todd Rogers. 2015. Unresponsive and Unper-
suaded: The Unintended Consequences of Voter Persuasion Efforts. Manuscript, George-
town University.
Bailey, Michael A., Anton Strezhnev, and Erik Voeten. 2015. Estimating Dynamic State
Preferences from United Nations Voting Data. Manuscript, Georgetown University.
Baiocchia, Michael, Jing Cheng, and Dylan S. Small. 2014. Tutorial in Biostatistics: Instru-
mental Variable Methods for Causal Inference. Statistics in Medicine 33(13): 2297-2340.
Baltagi, Badi H. 2005. Econometric Analysis of Panel Data, 3rd edition. New York: Wiley.
Banerjee, Abhijit Vinayak and Esther Duflo. 2011. Poor Economics: A Radical Rethinking
of the Way to Fight Global Poverty. Public Affairs.
Bartels, Larry M. 2008. Unequal Democracy: The Political Economy of the New Gilded Age.
Princeton, NJ: Princeton University Press.
Beck, Nathaniel and Jonathan N. Katz. 1996. Nuisance vs. Substance: Specifying and
Estimating Time-Series-Cross-Section Models. Political Analysis 6: 1-36.
Beck, Nathaniel and Jonathan N. Katz. 2011. Modeling Dynamics in Time-Series Cross-
Section Political Economy Data. Annual Review of Political Science 14: 331-352.

c
•2014 Oxford University Press 828
Appendix . Bibliography

Beck, Nathaniel. 2010. Making Regression and Related Output More Helpful to Users. The
Political Methodologist 18(1): 4-9.
Berk, Richard A., Alec Campbell, Ruth Klap, and Bruce Western. 1992. The Deterrent
Effect of Arrest in Incidents of Domestic Violence: A Bayesian Analysis of Four Field
Experiments. American Sociological Review 57(5): 698-708.
Bertrand,Marianne, Esther Duflo and Sendhil Mullainathan. 2004. How Much Should We
Trust Differences-In-Differences Estimates? Quarterly Journal of Economics 119(1): 249-
275.
Bertrand, Marianne and Sendhil Mullainathan. 2004. Are Emily and Greg More Employable
than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. American
Economic Review 94(4): 991-1013.
Blinder, Alan S. and Mark W. Watson. 2013. Presidents and the Economy: A Forensic
Investigation. Manuscript, Princeton University.
Bloom, Howard S. 2012. Modern Regression Discontinuity Analysis. Journal of Research on
Educational Effectiveness 5(1): 43-82.
Bound, John, David Jaeger, and Regina Baker. 1995. Problems with Instrumental Vari-
ables Estimation When the Correlation Between the Instruments and the Endogenous
Explanatory Variable is Weak. Journal of the American Statistical Association 90(430):
443-450.
Box, George E.P. 1976. Science and Statistics. Journal of the American Statistical Associa-
tion 71(356): 791-799.
Box-Steffensmeier, Janet M. and Bradford S. Jones. 2004. Event History Modeling: A Guide
for Social Scientists. Cambridge, England: Cambridge University Press.
Box-Steffensmeier, Janet M., John R. Freeman, Matthew P. Hitt and Jon C. W. Pevehouse.
2014. Time Series Analysis for the Social Sciences. Cambridge, England: Cambridge
University Press.
Bradford-Hill, Austin. 1965. The Environment and Disease: Association or Causation?
Proceedings of the Royal Society of Medicine. 58(5): 295-300.
Brambor, Thomas, William Roberts Clark, and Matt Golder. 2006. Understanding Interac-
tion Models:Improving Empirical Analyses. Political Analysis 14: 63-82.
Braumoeller, Bear F. 2004. Hypothesis Testing and Multiplicative Interaction Terms. In-
ternational Organization 58(4): 807-820.
Brown, Peter C., Henry L. Roediger III, and Mark A. McDaniel. 2014. Making it Stick: the
Science of Successful Learning. Cambridge, MA: Harvard University Press.
Brownlee, Shannon and Jeanne Lenzer. 2009. Does the Vaccine Matter? The Atlantic
November. www.theatlantic.com/doc/200911/brownlee-h1n1/2
Buckles, Kasey and Dan Hungerman. 2013. Season of Birth and Later Outcomes: Old
Questions, New Answers. The Review of Economics and Statistics 95(3): 711-724.
Buddlemeyer, Hielke and Emmanuel. Skofias. 2003. An Evaluation on the Performance of

c
•2014 Oxford University Press 829
Appendix . Bibliography

Regression Discontinuity Design on PROGRESA. Institute for Study of Labor, Discussion


Paper No. 827.
Burde, Dana and Leigh L. Linden. 2013. Bringing Education to Afghan Girls: A Ran-
domized Controlled Trial of Village-Based Schools. American Economic Journal: Applied
Economics 5(3): 27-40.
Burtless, Gary, 1995. The Case for Randomized Field Trials in Economic and Policy Re-
search. Journal of Economic Perspectives 9(2): 63-84.
Campbell, James E. 2011. The Economic Records of the Presidents: Party Differences and
Inherited Economic Conditions. Forum 9(1): 1-29.
Card, David. 1990. The Impact of the Mariel Boatlift on the Miami Labor Market. Industrial
and Labor Relations Review 43(2): 245-257.
Card, David. 1999. The Causal Effect of Education on Earnings. in Handbook of Labor
Economics Volume 3, Edited by O. Ashenfelter and D. Card. Amsterdam: Elsevier
Science.
Card, David, Carlos Dobkin, and Nicole Maestas. 2009. Does Medicare Save Lives? The
Quarterly Journal of Economics 124(2): 597-636.
Carrell, Scott E., Mark Hoekstra, and James E. West. 2010. Does Drinking Impair College
Performance? Evidence from a Regression Discontinuity Approach. NBER Working
Paper No. 16330.
Carroll, Royce, Jeffrey B. Lewis, James Lo, Keith T. Poole, and Howard Rosenthal. 2009.
Measuring Bias and Uncertainty in DW-NOMINATE Ideal Point Estimates via the Para-
metric Bootstrap. Political Analysis 17: 261-27. Updated at http://voteview.com/
dwnominate.asp.
Carroll, Royce, Jeffrey B. Lewis, James Lo, Keith T. Poole, and Howard Rosenthal. 2014.
DW-NOMINATE Scores With Bootstrapped Standard Errors. Updated 17 February 2013
at http://voteview.com/dwnominate.asp.
Cellini, Stephanie Riegg, Fernando Ferreira, and Jesse Rothstein. 2010. The Value of School
Facility Investments: Evidence from a Dynamic Regression Discontinuity Design. Quar-
terly Journal of Economics 125(1): 215-261.
Chakraborty, Indraneel, Hans A. Holter, and Serhiy Stepanchuk. 2012. Marriage Stability,
Taxation, and Aggregate Labor Supply in the U.S. vs. Europe. Uppsala University
Working Paper 2012:10.
Chen, Xiao, Philip B. Ender, Michael Mitchell, and Christine Wells. 2003. Regression with
Stata, from http://www.ats.ucla.edu/stat/stata/webbooks/reg/default.htm.
Cheng, Cheng and Mark Hoekstra. 2013. Does Strengthening Self-Defense Law Deter Crime
or Escalate Violence? Evidence from Castle Doctrine. Journal of Human Resources.
48(3): 821-854.
Clark, William Roberts and Arel-Bundock, Vincent. 2013. Independent but Not Indifferent:
Partisan Bias in Monetary Policy at the Fed. Economics and Politics 25(1): 1-26.

c
•2014 Oxford University Press 830
Appendix . Bibliography

Clarke, Kevin A. 2005. The Phantom Menace: Omitted Variable Bias in Econometric Re-
search. Conflict Management and Peace Science 22(4): 341-352. [http://www.rochester.
edu/college/psc/clarke/CMPSOmit.pdf]
Comiskey, Michael and Lawrence C. Marsh. 2012. Presidents, Parties, and the Business
Cycle, 1949-2009. Presidential Studies Quarterly 42(1): 40-59.
Cook, Thomas. 2008. Waiting for Life to Arrive: A history of the Regression Discontinuity
Design in Psychology, Statistics and Economics. Journal of Econometrics 142(2): 636-
654.
Currie, Janet and Jonathan Gruber. 1996. Saving Babies: The Efficacy and Cost of Recent
Changes in the Medicaid Eligibility of Pregnant Women. Journal of Political Economy
104(6): 1263-1296.
Cragg, John G. 1994. Making Good inferences from Bad Data. Canadian Journal of Eco-
nomics 27(4): 776-800.
Das, Mitali, Whitney K. Newey, and Francis Vella. 2003. Nonparametric Estimation of
Sample Selection Models. The Review of Economic Studies 70(1): 33-58.
De Boef, Suzanna and Luke Keele. 2008. Taking Time Seriously. American Journal of
Political Science 52(1): 184-200.
DiazGranados, Carlos A., Martine Denis, Stanley Plotkin. 2012. Seasonal Influenza Vaccine
Efficacy and Its Determinants in Children and Non-elderly Adults: A Systematic Review
with Meta-analyses of Controlled Trials. Vaccine 31(1): 49-57.
Duflo, Esther, Rachel Glennerster, and Michael Kremer. 2008. Using Randomization in
Development Economics Research: A Toolkit. In T. Schultz and John Strauss, eds.,
Handbook of Development Economics Vol. 4. Amsterdam and New York: North Holland.
Dunning, Thad. 2012. Natural Experiments in the Social Sciences: A Design-Based Ap-
proach. Cambridge, England: Cambridge University Press.
Drum, Kevin. 2013a. America’s Real Criminal Element: Lead - New Research Finds Pb
is the Hidden Villain Behind Violent Crime, Lower IQs, and Even the ADHD Epidemic.
Mother Jones January/February.
Drum, Kevin. 2013b. Crime Is at its Lowest Level in 50 Years. A Simple Molecule May Be
the Reason Why. At Mother Jones.com blog, January 3 at http://www.motherjones.
com/kevin-drum/2013/01/lead-crime-connection.
Drum, Kevin. 2013c. Lead and Crime: A Response to Jim Manzi. At Mother Jones.com
blog, January 12 at http://www.motherjones.com/kevin-drum/2013/01
/lead-and-crime-response-jim-manzi.
Dynarski, Susan. 2000. Hope for Whom? Financial Aid for the Middle Class and Its Impact
on College Attendance. National Tax Journal 53 (3, part 2): 629- 662.
Elwert, Felix and Christopher Winship. 2014. Endogenous Selection Bias: The Problem of
Conditioning on a Collider Variable. Annual Review of Sociology. 40(1): 31-53.
Erikson, Robert S. and Thomas R. Palfrey. 2000. Equilibrium in Campaign Spending

c
•2014 Oxford University Press 831
Appendix . Bibliography

Games: Theory and Data. American Political Science Review 94(3): 595-610.
Fearon, James D. and David D. Laitin. 2003. Ethnicity, Insurgency, and Civil War. Ameri-
can Political Science Review 97(1): 75-90.
Finkelstein, Amy, Sarah Taubman, Bill Wright, Mira Bernstein, Jonathan Gruber, Joseph P.
Newhouse, Heidi Allen, Katherine Baicker, and the Oregon Health Study Group. 2012.
The Oregon Health Insurance Experiment: Evidence from the First Year. Quarterly
Journal of Economics 127(3): 1057-1106.
Gaubatz, Kurt Taylor. 2015. A Survivor’s Guide to R: An Introduction for the Uninitiated
and the Unnerved. Los Angeles: Sage.
Gerber, Alan S. and Donald P. Green. 2012. Field Experiments: Design, Analysis, and
Interpretation. New York: W.W. Norton & Company.
Gerber, Alan S. and Donald P. Green. 2000. The Effects of Canvassing, Telephone Calls
and Direct Mail on Voter Turnout: A Field Experiment. The American Political Science
Review 94(3): 653-663.
Gerber, Alan S., and Donald P. Green. 2005. Correction to Gerber and Green (2000),
Replication of Disputed Findings, and Reply to Imai (2005). American Political Science
Review 99(2): 301-13.
Gertler, Paul. 2004. Do Conditional Cash Transfers Improve Child Health? Evidence from
PROGRESA’s Control Randomized Experiment. American Economic Review 94(2): 336-
41.
Gimpel, James G, Francis E. Lee, and Rebecca U. Thorpe. 2010. The Distributive Politics
of the Federal Stimulus: The Geography of the American Recovery and Reinvestment Act
of 2009. Paper presented at American Political Science Association Meetings.
Goldberger, Arthur S. 1991. A Course in Econometrics. Cambridge, Massachusetts: Har-
vard University Press.
Gormley, William T., Jr., Deborah Phillips, and Ted Gayer. 2008. Preschool Programs Can
Boost School Readiness. Science 320 (5884): 1723-24.
Green, Donald P., Soo Yeon Kim, and David H. Yoon. 2001. Dirty Pool. International
Organization 55(2): 441-468.
Green, Joshua. 2012. The Science Behind Those Obama Campaign E-Mails. Business Week
(November 29). Accessed from http://www.businessweek.com/articles/2012-11-29/
the-science-behind-those-obama-campaign-e-mails.
Greene, William. 2003. Econometric Analysis. New York: Prentice Hall.
Greene, William. 2008. Econometric Analysis. New York: Prentice Hall.
Grimmer, Justin, Eitan Hersh, Brian Feinstein, and Daniel Carpenter. 2010. Are Close
Elections Randomly Determined? Manuscript, Stanford University.
Hanmer, Michael J. and Kerem Ozan Kalkan. 2013. Behind the Curve: Clarifying the
Best Approach to Calculating Predicted Probabilities and Marginal Effects from Limited
Dependent Variable Models. American Journal of Political Science 57(1): 263-277.

c
•2014 Oxford University Press 832
Appendix . Bibliography

Hanushek, Eric A. and Ludger Woessmann. 2009. Do Better Schools Lead to More Growth?
Cognitive Skills, Economic Outcomes, and Causation. NBER Working Paper 14633.
Harvey, Anna. 2011. What’s So Great About Independent Courts? Rethinking Crossna-
tional Studies of Judicial Independence. Manuscript, New York University. Available at
http://politics.as.nyu.edu/docs/IO/2787/HarveyJI.pdf.
Hausman, Jerry A. and William E. Taylor. 1981. Panel Data and Uobservable Individual
Effects. Econometrica 49(6): 1377-1398.
Heckman, James J. 1979. Sample Selection Bias as a Specification Error. Econometrica
47(1): 153 61.
Herndon, Thomas, Michael Ash, and Robert Pollin. 2014. Does high public debt consis-
tently stifle economic growth? A critique of Reinhart and Rogoff. Cambridge Journal of
Economics 38(2): 257-279.
Howell, William G. and Paul E. Peterson. 2004. The Use of Theory in Randomized Field
Trials: Lessons from School Voucher Research on Disaggregation, Missing Data, and the
Generalization of Findings. The American Behavioral Scientist 47(5): 634-657.
Imai, Kosuke. 2005. Do Get-Out-The-Vote Calls Reduce Turnout? The Importance of
Statistical Methods for Field Experiments. American Political Science Review 99(2):
283-300.
Imai, Kosuke, Gary King, and Elizabeth A. Stuart. 2008. Misunderstandings among Exper-
imentalists and Observationalists about Causal Inference. Journal of the Royal Statistical
Society, Series A (Statistics in Society) 171(2): 481-502.
Imbens, Guido W. 2014. Instrumental Variables: An Econometrician’s Perspective. IZA
Discussion Paper No. 8048.
Imbens, Guido W. and Thomas Lemieux. 2008. Regression Discontinuity Designs: A Guide
to Practice. Journal of Econometrics 142(2): 615-635.
Iqbal, Zaryab and Christopher Zorn. 2008. The Political Consequences of Assassination,
Journal of Conflict Resolution 52(3): 385-400.
Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. New York: Wiley.
Jacobsmeier, Matthew L. and Daniel G. Lewis. 2013. Barking up the Wrong Tree: Why Bo
Didn’t Fetch Many Votes for Barack Obama in 2012. PS 46(1): 49-59.
Jacobson, Gary C. 1978. Effects of Campaign Spending in Congressional Elections. Ameri-
can Political Science Review 72(2): 469-491.
Kalla, Joshua L. and David E. Broockman. 2014. Congressional Officials Grant Access
Due To Campaign Contributions: A Randomized Field Experiment. Manuscript, Yale
University.
Kam, Cindy D. and Robert J. Franceze, Jr. 2007. Modeling and Interpreting Interactive
Hypotheses in Regression Analysis. Ann Arbor, MI: University of Michigan Press.
Kastellec, Jonathan P. and Eduardo L. Leoni. 2007. Using Graphs Instead of Tables in
Political Science. Perspectives on Politics 5(4): 755-771.

c
•2014 Oxford University Press 833
Appendix . Bibliography

Keele, Luke and David Park. 2006. Difficult Choices: An Evaluation of Heterogenous Choice
Models. Manuscript, Ohio State University.
Keele, Luke and Nathan J. Kelly. 2006. Dynamic Models for Dynamic Theories: The Ins
and Outs of Lagged Dependent Variables. Political Analysis 14: 186-205.
Kennedy, Peter. 2008. A Guide to Econometrics, 6th edition. Malden, MA: Blackwell
Publishing.
Khimm, Suzy. 2010. Who Is Alvin Greene? Mother Jones Jun. 8 accessed at http:
//motherjones.com/mojo/2010/06/alvin-greene-south-carolina.
King, Gary. 1991. Truth is Stranger than Prediction, More Questionable than Causal
Inference. American Journal of Political Science 35(4): 1047-1053.
King, Gary. 1995. Replication, Replication. PS: Political Science and Politics 28(3): 444-
452.
King, Gary and Langche Zeng. 2001. Logistic Regression in Rare Events Data. Political
Analysis 9: 137-163.
King, Gary, Robert Keohane, and Sidney Verba. 1994. Designing Social Inquiry: Scientific
Inference in Qualitative Research Princeton, NJ: Princeton University Press.
Kiviet, Jan F. 1995. On bias, inconsistency, and efficiency of various estimators in dynamic
panel data models. Journal of Econometrics 68(1): 53 78.
Klick, Jonathan and Alexander Tabarrok. 2005. Using Terror Alert Levels to Estimate the
Effect of Police on Crime. Journal of Law and Economics 48(1): 267-79.
Koppell, Jonathan G. S. and Jennifer A. Steen. 2004. The Effects of Ballot Position on
Election Outcomes. The Journal of Politics 66(1): 267-281.
La Porta, Rafael, F. Lopez-de-Silanes, C. Pop-Eleches, and A. Schliefer. 2004. Judicial
Checks and Balances. Journal of Political Economy 112(2): 445-470.
Lee, David S. 2008. Randomized Experiments from Non-random Selection in U.S. House
Elections. Journal of Econometrics 142(2): 675-697.
Lee, David S. 2009. Training, Wages, and Sample Selection: Estimating Sharp Bounds on
Treatment Effects. Review of Economic Studies 76(3): 1071 1102.
Lee, David S. and Thomas Lemieux. 2010. Regression Discontinuity Designs in Economics
Journal of Economic Literature 48(2): 281-355.
Lerman, Amy E. 2009. The People Prisons Make: Effects of Incarceration on Criminal
Psychology. In Do Prisons Make Us Safer, ed. Steve Raphael and Michael Stoll. New
York: Russell Sage Foundation.
Levitt, Steven D. 1997. Using Electoral Cycles in Police Hiring to Estimate the Effect of
Police on Crime. American Economic Review 87(3): 270-290.
Levitt, Steven D. 2002. Using Electoral Cycles in Police Hiring to Estimate the Effect of
Police on Crime: A Reply. American Economic Review 92(4): 1244-250.
Lochner, Lance, and Enrico Moretti. 2004. The Effect of Education on Crime: Evidence
from Prison Inmates, Arrests, and Self-Reports. American Economic Review 94(1): 155-

c
•2014 Oxford University Press 834
Appendix . Bibliography

189.
Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables.
London: Sage Publications.
Lorch, Scott A., Michael Baiocchi, Corinne S. Ahlberg, and Dylan E. Small. 2012. The
Differential Impact of Delivery Hospital on the Outcomes of Premature Infants. Pediatrics
130(2): 270-278.
Ludwig, Jens and Douglass L. Miller. 2007. Does Head Start Improve Children’s Life
Chances? Evidence from a Regression Discontinuity Design. The Quarterly Journal of
Economics 122(1): 159-208.
Lumley, Thomas, Paula Diehr, Scott Emerson, and Lu Chen. 2002. The Importance of the
Normality Assumption in Large Public Health Data Sets. Annual Review of Public Health
23: 151-69.
Madestam, Andreas, Daniel Shoag, Stan Veuger, and David Yanagizawa-Drott. 2013. Do
Political Protests Matter? Evidence from the Tea Party Movement. June 29 version from
http://www.hks.harvard.edu/fs/dyanagi/Research/TeaParty_Protests.pdf.
Makowsky, Michael and Thomas Stratmann. 2009. Political Economy at Any Speed: What
Determines Traffic Citations? The American Economic Review 99(1): 509-527.
Malkiel, Burton G. 2003. A Random Walk Down Wall Street: The Time-Tested Strategy for
Successful Investing. New York: W.W. Norton.
Manning, Willard G., Joseph P. Newhouse, Naihua Duan, Emmett B. Keeler, and Arleen
Leibowitz. 1987. Health Insurance and the Demand for Medical Care: Evidence from a
Randomized Experiment. American Economic Review 77(3): 251-277.
Manzi, Jim. 2012. Uncontrolled: The Surprising Payoff of Trial-and-Error for Business,
Politics and Society. Basic Books.
Marvell, Thomas B and Carlisle E. Moody. 1996. Specification Problems, Police Levels and
Crime Rates. Criminology 34(4): 609-646.
McClellan, Chandler B. and Erdal Tekin. 2012. Stand Your Ground Laws and Homicides.
NBER Working Paper 18187.
McCrary, Justin. 2002. Using Electoral Cycles in Police Hiring to Estimate the Effect of
Policeon Crime: Comment. The American Economic Review 92(4): 1236-1243.
McCrary, Justin. 2008. Manipulation of the Running Variable in the Regression Disconti-
nuity Design: A Density Test. Journal of Econometrics 142(2): 698-714.
Miguel, Edward and Michael Kremer. 2004. Worms: Identifying Impacts on Education and
Health in the Presence of Treatment Externalities. Econometrica 72(1): 159-217.
Miguel, Edward, Shanker Satyanath, and Ernest Sergenti. 2004. Economic Shocks and Civil
Conflict: An Instrumental Variables Approach. Journal of Political Economy 112(4):
725-753.
Morgan, Stephen L. and Christopher Winship. 2014. Counterfactuals and Causal Infer-
ence: Methods and Principals for Social Research. Second edition. Cambridge, England:

c
•2014 Oxford University Press 835
Appendix . Bibliography

Cambridge University Press.


Murnane, Richard J. and John B. Willett. 2011. Methods Matter: Improving Causal In-
ference in Educational and Social Science Research Oxford, England: Oxford University
Press.
Murray, Michael P. 2006a. Avoiding Invalid Instruments and Coping with Weak Instruments.
Journal of Economic Perspectives 20(4): 111-132.
Murray, Michael P. 2006b. Econometrics: A Modern Introduction. Boston: Pearson Addison
Wesley.
Mutz, Diana C. 2010. The Dog that Didn’t Bark: The Role of Canines in the 2008 Campaign.
PS 43(4): 707-712.
NASA. 2012. Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies
(Land-Ocean Temperature Index, LOTI) Global-mean monthly, seasonal, and annual
means, 1880-present, updated through most recent month. Accessed at http://data.
giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.txt
National Center for Addiction and Substance Abuse at Columbia University. 2011. National
Survey of American Attitudes on Substance Abuse XVI: Teens and Parents (August). Ac-
cessed at www.casacolumbia.org/download.aspx?path=/UploadedFiles/ooc3hqnl.pdf
on November 10, 2011.
Newhouse, Joseph. 1993. Free for All? Lessons from the RAND Health Insurance Experi-
ment Cambridge, MA: Harvard University Press.
Nevin, Rick. 2013. Lead and Crime: Why this correlation does mean causation. Pub-
lished online on January 26 at http://ricknevin.com/uploads/Lead_and_Crime_-_
Why_This_Correlation_Does_Mean_Causation.pdf
Noel, Hans. 2010. Ten Things Political Scientists Know that You Don’t. The Forum 8(3):
article 12.
Orwell, George. 1946. In Front of Your Nose. Tribune. London (March 22).
Osterholm, Michael T., Nicholas S. Kelley, Alfred Sommer, and Edward A. Belongia. 2012.
Efficacy and Effectiveness of Influenza Vaccines: A Systematic Review and Meta-analysis.
The Lancet Infectious Diseases 12(1): 36 44.
Palmer, Brian. 2013. I Wish I Was a Little Bit Shorter. Slate. Posted July 30. http://www.
slate.com/articles/health_and_science/science/2013/07/height_and_longevity_
the_research_is_clear_being_tall_is_hazardous_to_your.html
Parker, Jonathan A., Nicholas S. Souleles, David S. Johnson, and Robert McClelland. 2011.
Consumer Spending and the Economic Stimulus Payments of 2008. NBER Working Paper
Series, Vol. w16684. Available at http://ssrn.com/abstract=1740313.
Persico, Nicola, Andrew Postlewaite, and Dan Silverman. 2004. The Effect of Adolescent
Experience on Labor Market Outcomes: The Case of Height. Journal of Political Economy
112(5): 1019-53.
Pierskalla, Jan H. and Florian M. Hollenbach. 2013. Technology and Collective Action: The

c
•2014 Oxford University Press 836
Appendix . Bibliography

Effect of Cell Phone Coverage on Political Violence in Africa. American Political Science
Review. 107(2): 207-224
Poole, Keith and Howard Rosenthal. 1997. Congress: A Political-Economic History of Roll
Call Voting. Oxford: Oxford University Press.
Reinhart, Carmen M. and Kenneth S. Rogoff. 2010. Growth in a Time of Debt. American
Economic Review: Papers & Proceedings 100(2): 573 78.
Reyes, Jessica Wolpaw. 2007. Environmental Policy as Social Policy? The Impact of
Childhood Lead Exposure on Crime. NBER Working Paper 13097
Rice, John A. 2007. Mathematical Statistics and Data Analysis. Cengage Learning.
Roach, Michael A. 2013. Mean Reversion or a Breath of Fresh Air? The Effect of NFL
Coaching Changes on Team Performance in the Salary Cap Era. Applied Economics
Letters 20(17): 1553-1556.
Romer, Christina D. 2011. What Do We Know about the Effects of Fiscal Policy? Separating
Evidence from Ideology. Talk at Hamilton College (November 7) available at http:
//elsa.berkeley.edu/˜cromer/WrittenVersionofEffectsofFiscalPolicy.pdf.
Rossin-Slater, Maya, Christopher J. Ruhm, and Jane Waldfogel. 2013. The Effects of
Californiaäs Paid Family Leave Program on Mothers’ Leave-Taking and Subsequent Labor
Market Outcomes. Journal of Policy Analysis and Management 32(2): 224 245.
Schrodt, Phil. 2010. Seven Deadly Sins of Contemporary Quantitative Political Science.
Paper presented at American Political Science Association Meetings.
Shiner, Meredith 2010. Alvin Greene: Born to Be President. POLITICO (November 17) ac-
cessed at http://www.politico.com/news/stories/1110/45268.html#ixzz1jLDAM0uo
Sides, John and Lynn Vavreck. 2013. The Gamble: Choice and Chance in the 2012 Presi-
dential Election. Princeton, NJ: Princeton University Press.
Snipes, Jeffrey B. and Edward R. Maguire. 1995. Country Music, Suicide, and Spuriousness.
Social Forces 74(1): 327-329.
Solnick, Sara J. and David Hemenway. 2011. The ‘Twinkie Defense’: the relationship
between carbonated non-diet soft drinks and violence perpetration among Boston high
school students. Injury Prevention 2011-040117.
Sovey, Allison J. and Donald P. Green. 2011. Instrumental Variables Estimation in Political
Science: A Reader’s Guide. American Journal of Political Science 55(1): 188-200.
Stack, Steven and Jim Gundlach. 1992. The Effect of Country Music on Suicide. Social
Forces 71(1): 211-218.
Staiger, Douglas and James H. Stock. 1997. Instrumental Variables Regressions with Weak
Instruments. Econometrica. 65(3): 557-86.
Stock, James H and Mark W. Watson. 2011. Introduction to Econometrics. Third edition.
Boston: Addison-Wesley.
Swirl. 2014. Swirl: statistics with interactive R learning. [computer software package]
available at http://swirlstats.com/index.html.

c
•2014 Oxford University Press 837
Appendix . Bibliography

Schwabish, Jonathan A. 2004. An Economist’s Guide to Visualizing Data. Journal of


Economic Perspectives 28(1): 209-234.
Tam Cho, Wendy K. and James G. Gimpel. 2012. Geographic Information Systems and
the Spatial Dimensions of American Politics. Annual Review of Political Science. 15:
443-460.
Tufte, Edward R. 2001. The Visual Display of Quantitative Information. 2nd edition.
Graphics Press.
University of Michigan. Center for Political Studies. National Election Studies. American
National Election Studies, 2000, 2002, and 2004: Full Panel Study. ICPSR21500-v1. Ann
Arbor, MI: Inter-university Consortium for Political and Social Research [distributor],
2009-01-30. http://doi.org/10.3886/ICPSR21500.v1.
Venables, W. N. and B.D. Ripley. 2002. Modern Applied Statistics with S. Fourth edition.
New York: Springer.
Verzani, John. 2004. Using R for Introductory Statistics. Chapman and Hall.
Wawro, Greg. 2002. Estimating Dynamic Models in Political Science. Political Analysis 10:
25-48.
Wilson, Sven E. and Daniel M. Butler. 2007. A Lot More to Do: The Sensitivity of Time-
Series Cross Section Analyses to Simple Alternative Specifications. Political Analysis 15:
101-123.
Wooldridge, Jeffrey M. 2002. Econometric Analysis of Cross Section and Panel Data. Cam-
bridge MA: MIT Press.
Wooldridge, Jeffrey M. 2009. Introductory Econometrics. Fourth edition. South-Western
Cengage Learning.
World Values Survey. 2008. Integrated EVS/WVS 1981-2008 Data File. http://www.
worldvaluessurvey.org/.
Yau, Nathan. 2011. Visualize This: The Flowing Data Guide to Design, Visualization, and
Statistics. New York: Wiley.
Zakir Hossain, Mohammad. 2011. The Use of Box-Cox Transformation Technique in Eco-
nomic and Statistical Analyses. Journal of Emerging Trends in Economics and Manage-
ment Sciences 2(1): 32-39.
Ziliak, Stephen and Deirdre N. McCloskey. 2008. The Cult of Statistical Significance: How
the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor, MI: University of
Michigan Press.

c
•2014 Oxford University Press 838
INDEX

R2 , 106 Bond, Kanisha, 764


adjusted, 231 Box-Steffensmeier, Janet, 704
‰2 distribution, 783 Bradford-Hill criteria, 761
2SLS, 437 Burtless, Gary, 795
weak instruments, 450
Calvo, Ernesto, 764
Achen, Chris, 43, 743, 761 Campaign spending, 30
Adelstein, Shirley, 764 Carnegie, Allison, 764
Albertson, Bethany, 473 Chinchilla
Angrist, Josh, 435 on caffeine, 241
Attrition, 519 Clark, Bill, 764
Autocorrelation Confidence intervals, 177, 201
fl transformed, 670 Congress, 283
Cochrane-Orcutt, 673 Correlated errors, 104
detection of, 669 Correlation, 17, 771
Prais-Winsten, 673 Country music, 25
Autoregressive model, 664 Covariance, 770
fl-transform, 674 Croco, Sarah, 764
Auxiliary regression, 226 Cumulative distribution function (CDF), 608
Balance, 487 Data
Banerjee, Abhijit, 533 baseball, 336, 348
Banks, Antoine, 764 civil wars, 638
Bartels, Brandon, 764 drinking age and grades, 577
Beck, Neal, 124, 747 education and economic growth, 215
Blocking, 482 emergency care for newborns, 439
Bloom, Howard, 829 global temperature, 326
Blum, Rachel, 764 height and wages, 112, 199, 334, 352

839
INDEX INDEX

human rights, 235 Experiments, 37


Iraq War, 461 attrition, 515
judicial independence, 235 balancing tests, 481
life expectancy, 317 ethics, 34
NICUs, 439 feasibility, 34
Obama, 262 flu shots, 34
pets and politics, 626, 633 health insurance, 520
police and crime, 369, 393, 426 instrumental variables, 500
presidential elections, 65 intention-to-treat, 496
retail sales, 195 Mexican health care, 489
sex and height, 267 non-compliance, 493
Soccer, 255, 272 school voucher, 493, 515
stand your ground laws, 401 Experiments, 2SLS, 507
Tea Party, 283 Experiments, natural, 527
trade and alliances, 397
universal pre-kindergarten, 566 F distribution, 786
wages and region, 276 F test, 351
De-meaned approach, 380 Ferrets
Difference-in-difference, 409 sleeping habits of, 79
Donuts, 4 Fitted value, 71
Drum, Kevin, 208 Fixed effects, 385, 391
Duflo, Esther, 533 two-way, 396
Dunning, Thad, 533 Flu shots, 21
Durbin Wu Hausman test, 435 Franceze, Robert, 292, 801
Durbin-Watson test, 668 Freeman, John, 704
Dynamic time series models, 683
Gale, Ian, 764
Economic growth, and education, 215 Gaubatz, Kurt, 56
Education, returns to, 29 Gayer, Ted, 566
Education, test scores, 215 Gerber, Alan, 502, 533
Education, years of schooling, 215 Gerbils
Elections, 283 on methamphetamines, 241
Equation Gimpel, James, 810
—ˆ0 in bivariate OLS, 73 Gormley, William, 566
—ˆ1 in bivariate OLS, 72 Green, Don, 469, 502, 533
‡ 2 , 96 Greene, Alvin, 525
var(—) ˆ in 2SLS, 453
Hanmer, Michael, 643, 764
ˆ in multivariate OLS, 225, 453
var(—) Hanushek, Eric, 215
var(—ˆ1 ) in bivariate OLS, 95 Harvey, Anna, 764

c
•2014 Oxford University Press 840
INDEX INDEX

Hausman test, 435 Klick, Jonathan, 528


Hayes, Danny, 764 Koppell, Jonathan , 526
Henderson, Daniel, 764 Krueger, Alan, 435
Heteroscedasticity, 103
Heteroscedasticity-consistent standard errors, Ladd, Jon, 764
103 LATE, local average treatment effect, 468,
Hill, Carolyn, 764 575
Hines, Mark, 764 Latent variables, 605
Hitt, Matthew, 704 Lawrence, Adria, 473
Hopkins, Dan, 764 Lawrence, Eric, 764
Horowitz, Jeremy, 764 Least squares dummy variable approach, 378
Hypothesis testing, 146 Lemieux, Thomas, 829
Leoni, Eduardo L., 246
Imai, Kosuke, 484, 534 Likelihood ratio test, 637
Imbens, Guido, 469, 829 Linear probability model, 600
Instrumental variables, 429, 437 Logged variables, 336
bias of estimates, 452 Logit model, 612
conditions for, 437 Ludwig, Jens, 588, 764
use in experiments, 507
Manchester City Football Club, 255, 272
variance, 455
Manchester United, 255, 272
Intention-to-treat models, 496
Interpreting coefficients Manzi, Jim, 533
Maximum likelihood estimation (MLE), 612,
dummy variables, 259
617, 804
interaction variables, 295
nominal variables, 279 McCloseky, Deidre, 184
panel models, 382 Methamphetamines, 241
Miller, Douglass, 588
ITT, 500
ITT models, 496 Model-fishing, 241
Morris, Irwin, 764
Jackman, Simon, 184 Murray, Michael, 468
Jittered data, 112, 127, 129, 249, 264 Musgrave, Paul, 764
Joe, Wesley, 764
Natural experiments, 524
Kalkan, Kerem Ozan, 643 fiscal policy, 527
Kam, Cindy, 292, 801 Nesva, Sheeva, 764
Karin Kitchens, 764 New York City school voucher study, 493, 515
Kastellec, Jonathan P., 246 Noel, Hans, 65, 764
Katz, Jonathan, 747 Null hypothesis, 146
Keele, Luke, 764 Null results, 173
King, Gary, 56, 242, 246, 534 OLS

c
•2014 Oxford University Press 841
INDEX INDEX

Congress election example, 283 Probit model, 612


consistency, 99 Probit model, interpreting coefficients, 624
control variables, 203 Probit, estimation, 612
deriving estimator, 718 Progresa program, 489
deriving omitted variable bias conditions, balance tests, 489
725
deriving variance, 721 Quinn, Dennis, 764
estimation, 206
R, installing packages, 130
fitted value, 71
RAND study, 520
fitted values, 95
Random effects, 750
measurement error, 220, 732 Random variables, 80
modeled randomness, 80
Regression discontinuity, 552
multivariate, 203
binned graphs, 562
omitted variable bias, 209, 721 flexible models, 559, 564
omitted variable bias with multiple vari-
steps in analysis, 576
ables, 729
window size, 561, 564
precision of estimates, 93, 224
Replication, 46, 50, 55, 243
randomness of coefficient estimates, 79
Residual, 71
residual, 71
sampling randomness, 79 Schone, Barbara, 764
sign of omitted variable bias, 725 Schrodt, Phil, 240
Omitted variable bias Sheep, 136
signing the bias, 727 Sides, John, 764
One-sided hypothesis, 146 Significance, substantive, 177
Outliers, 117, 122 Simpsons, 5
Overidentification tests, 447 Simultaneous equation models, 455
Simultaneous equations, 460
p-value, 165
Soft drinks, 29
Panel data
AR(1) models, 738 Sovey, Allison, 469
random effects, 748 Stand your ground laws, 408
Standard error of regression, 107
Peterson, David, 764
Pettingill, Lindsay, 764 Standardized coefficients, 341
Pevehouse, Jon, 704 Stationarity, 694
Steen, Jennifer, 526
Phillips, Deborah, 566
Placebo tests, 760 Stuart, Elizabeth, 534
Pooled data model, 376 Suicide, 25
Power, 168, 174, 485 t distribution, 785
Power curves, 171 t statistic, 160
Probability distribution, 81

c
•2014 Oxford University Press 842
INDEX INDEX

Tabarrok, Alexander, 528


Tam-Cho, Wendy, 764, 810
Tea Party, 283
Two-sided hypothesis, 146
Type II errors, 166

Vella, Frank, 533


Voeten, Erik, 764
Voteview, 800
Way, Chris, 764
Woessmann, Ludger, 215
Ziliak, Stephen, 184

c
•2014 Oxford University Press 843
GLOSSARY

‰2 distribution A probability distribution that characterizes the distribution of squared


standard normal random variable. Standard errors are distributed according to this
distribution which means that the ‰2 plays a role in the t distribution. Also relevant
for many statistical tests, including likelihood ratio tests for MLE. 783
2SLS See two stage least squares. 425

ABC issues Three issues that every experiment needs to address: attrition, balance, and
compliance. 481
adjusted R2 The R2 with a penalty for the number of variables included in the model.
Widely reported, but rarely useful. 231
alternative hypothesis An alternative hypothesis is what we accept if we reject the null.
It’s not something that we are proving (given inherent statistical uncertainty) but it is
the idea we hang onto if we reject the null. 140
AR(1) model An autoregressive model in which the dependent variable depends on its
value in the previous period. Contrasted to, for example, an AR(2) model, which
includes the value from the previous period and the value from the period before that.
AR(1) models are often used to model correlated errors in time series data. 664
assignment variable An assignment variable is relevant in regression discontinuity analy-
sis. Such a variable determines whether or not someone receives some treatment. People
with values of the assignment variable above some cutoff receive the treatment; people
with values of the assignment variable less than the cutoff do not receive the treatment.
545

844
Glossary Glossary

attenuation bias A form of bias in which the estimated coefficient is closer to zero than it
should be. Measurement error in the independent variable causes attenuation bias. 223
attrition Attrition occurs when people drop out of an experiment altogether such that we
do not observe the dependent variable for them. 515
augmented Dickey-Fuller test A test for unit root for time series data that includes a
time trend and lagged values of the change in the variable as independent variables.
692
autocorrelation Errors are autocorrelated if the error from one observation is correlated
with the error of another. One of the assumptions necessary to use the standard equation
for variance of OLS estimates is that errors are not autocorrelated. Autocorrelation is
common in time series data. 105
autoregressive model A time series model in which the dependent variable is a function
of previous values of the dependent variable. Autocorrelation is often modeled with
an autoregressive model such that the error term is a function of previous error terms.
Dynamic models are also autoregressive models in that the dependent variable depends
on lagged values of the dependent variable. 660
auxiliary regression An auxiliary regression is a regression that is not directly the one
of interest, but is related and yields information helpful in analyzing the equation we
really care about. 210

balance In experiments, treatment and control groups are balanced if the distributions of
variables are the same for the treatment and control groups. 484
bias A biased coefficient estimate will systematically be higher or lower than the true value.
87
binned graphs Binned graphs are used in regression discontinuity analysis. The assign-
ment variable is divided into bins and for each bin, the average value of the dependent
variable is plotted. These are useful to visualize a discontinuity at the treatment cut-
off. Binned graphs also are useful to identify possible non-linearities in the relationship
between the assignment variable and the dependent variable. 562
blocking Blocking involves picking treatment and control groups so that they are equal in
covariates. 482

categorical variable A variable that has two or more categories, but which does not have
an intrinsic ordering. Also known as a nominal variable. 277
CDF See cumulative distribution function. 608

c
•2014 Oxford University Press 845
Glossary Glossary

Central Limit Theorem The mean of a sufficiently large number of independent draws
from any distribution will be normally distributed. Because OLS estimates are weighted
averages, the Central Limit Theorem implies the distribution of —ˆ1 will be normally
distributed. 84
ceteris paribus All else equal. A phrase used when describing multivariate regression re-
sults as a coefficient is said to account for change in dependent variable with all other
independent variables held constant. 197
codebook A file that describes sources for variables and any adjustments made. A codebook
is necessary element of replication file. 47
compliance A compliance problem occurs when subjects assigned to an experimental treat-
ment do not actually experience the treatment, often because they opt out in some way.
491
confidence interval A confidence interval defines the range of true values that are consis-
tent with the observed coefficient estimate. Confidence intervals depend on the point
estimate, —ˆ1 , and the measure of uncertainty, se(—ˆ1 ). 177
confidence levels A term used when referring to confidence intervals, based on 1- –. 179
consistency A consistent estimator is one for which the distribution of the estimate gets
closer and closer to the true value as the sample size increases. For example, the
bivariate OLS estimate —ˆ1 consistently estimates —1 if X is uncorrelated with ‘. 100
constant The parameter —0 in a regression model. It is the point at which a regression line
cross the Y -axis. It is the expected value of the dependent variable when all independent
variables equal zero. Also referred to as the intercept. 7
continuous variable A variable that takes on any possible value over some range. Con-
tinuous variables are distinct from discrete variables, which can take on only a limited
number of possible values. 81
control group In an experiment, the group that does not receive the treatment of interest.
32
control variable An independent variable included in a statistical model to control for
some factor that is not the primary factor of interest. 203
correlation Correlation measures the extent to which two variables are linearly related to
each other. A correlation of 1 indicates the variables move together in a straight line.
A correlation of 0 indicates the variables are not linearly related to each other. A
correlation of -1 indicates the variables move in opposite directions. 17

c
•2014 Oxford University Press 846
Glossary Glossary

critical value In hypothesis testing, a value above which a —ˆ1 would be so unlikely as to
lead us to reject the null. 153
cross-sectional data Cross-sectional data has observations for a multiple units for one time
period. Each observation indicates the value of a variable for a given unit for the same
point in time. Cross-sectional data is typically contrasted to panel and time series data.
658
cumulative distribution function The cumulative distribution function, or CDF, indi-
cates how much of normal distribution is to the left of any given point. 608

de-meaned approach An approach to estimating fixed effects models for panel data. The
one-way version involves subtracting off average values within units from all variables.
This approach saves us from having to include dummy variables for every unit and
highlights the fact that fixed effects models estimate parameters based on variation
within units, not between them. 380
degrees of freedom The degrees of freedom is the sample size minus the number of param-
eters. It refers to the amount of information we have available to use in the estimation
process. As a practical matter, degrees of freedom corrections produce more uncer-
tainty for smaller sample sizes. The shape of a t distribution depends on the degrees of
freedom. The higher the degrees of freedom, a t distribution looks more like a normal
distribution. 96
dependent variable The outcome of interest, usually denoted as Y . It is called the de-
pendent variable because its value depends on the values of the independent variables,
parameters and error term. 4
dichotomous Divided into two part. 591
dichotomous variables A dichotomous variable takes on one of two values, almost always
zero or one, for all observations. Also known as a dummy variable. 257
Dickey-Fuller test A test for unit roots, used in dynamic models. 691
difference of means test Tests that involve comparing the mean of Y for one group (e.g.,
the treatment group) against mean of Y for another group (e.g., the control group).
They can be conducted with bivariate and multivariate OLS and other statistical pro-
cedures. 257
difference-in-difference model A model that looks at differences in changes in treated
units compared to untreated units. These models are particularly useful in policy
evaluation. 401

c
•2014 Oxford University Press 847
Glossary Glossary

discontinuity A discontinuity occurs when a graph of a line has a sudden jump up or down.
541
distribution The range of possible values for a random variable and the associated relative
probabilities for each value. Examples of four distributions are displayed in Figure 3.4.
80
dummy variable A dummy variable equals either zero or one for all observations. Dummy
variables are sometimes referred to as dichotomous variables. 257
dyad A dyad is something that consists of two elements. For some data sets such as a trade
data set, a dyad indicates a pair of countries and the data indicates how much trade
flows between them. 397
dynamic model A dynamic model is a time series model that includes a lagged dependent
variable as an independent variable. Among other differences, the interpretation of
coefficients differs in dynamic models from that in standard OLS models. Sometimes
referred to as an autoregressive model. 678

elasticity The percent change in Y associated with a percent change in X. Elasticity is


estimated with log-log models. 332
endogenous An independent variable is endogenous if changes in it are related to other
factors that influence the dependent variable. 14
error term The term associated with unmeasured factors in a regression model, typically
denoted as ‘. 7
excluded category When including dummy variables indicating the multiple categories of
a nominal variable, we need to exclude a dummy variable for one of the groups, which we
refer to as the excluded category. The coefficients on all the included dummy variables
indicate how much higher or lower the dependent variable is for each group relative to
the excluded category. Also referred to as the reference category. 278
exclusion condition For two-stage least squares, a condition that the instrument exert no
direct effect in the second stage equation. This condition cannot be tested empirically.
434
exogenous An independent variable is exogenous if changes in it are unrelated to other
factors that influence the dependent variable. 16
expected value The average value of a large number of realizations of a random variable.
715

c
•2014 Oxford University Press 848
Glossary Glossary

external validity A research finding is externally valid when it applies beyond the context
in which the analysis was conducted. 36

F distribution A probability distribution that characterizes the distribution of a ratio of


‰2 random variables. Used in tests involving multiple parameters, among other appli-
cations. 786
F statistic The test statistic used when conducting an F test. Used in testing hypotheses
about multiple coefficients, among other applications. 343
F test A type of hypothesis test in which the test statistic is compared to a critical value
drawn from the F distribution. Commonly used in testing hypotheses involving multiple
coefficients such as when assessing if one coefficient is larger than another. 342
fitted value A fitted value, Ŷi , is the value of Y predicted by our estimated equation. For
a bivariate OLS model it is Ŷi = —ˆ0 + —ˆ1 Xi . Also called predicted values. 70
fixed effect A parameter associated with a specific unit in a panel data model. For a model
Yit = —0 + —1 X1it + –i + ‹it , the –i parameter is the fixed effect for unit i. 376
fixed effect model A model that controls for unit and/or period specific effects. These
fixed effects capture differences in the dependent variable associated with each unit
and/or period. Fixed effects models are used to analyze panel data and can control for
both measurable and unmeasurable elements of the error term that are stable within
unit. 377
fuzzy RD models Regression discontinuity models in which the assignment variable im-
perfectly predicts treatment. 570

generalizable A statistical result is generalizable if it applies to populations beyond the


sample in the analysis. 35
goodness of fit How well a model fits the data, typically measured with R2 . 106

heteroscedastic A random variable is heteroscedastic if the variance differs for some ob-
servations. For example, observations from one part of the country may be measured
with little error while observations from another part of the country may be measured
with considerable error. Heteroscedasticity violates one of the assumptions necessary
to use the standard equation for variance of OLS estimates. 103
heteroscedasticity-consistent standard errors Standard errors for the coefficients in
OLS that are appropriate even when errors are heteroscedastic. 103

c
•2014 Oxford University Press 849
Glossary Glossary

homoscedastic A random variable is homoscedastic if the variance is the same for all ob-
servations. One of the assumptions necessary to use the standard equation for variance
of OLS estimates is that errors are homoscedastic. 102
hypothesis testing A process assessing whether the observed data is consistent or not with
a claim of interest. t tests and F tests are widely used tools in hypothesis testing. 136

identified A statistical model is identified on the basis of some assumption or information


about the data generating process that allows us to estimate causal effects. Instrumental
variables models are identified by the assumption that the exclusion and inclusion con-
ditions for the instrumental variable. In simultaneous equation models, we can identify
a model if we have instruments for each equation. 458
inclusion condition For two-stage least squares, a condition that the instrument exert a
meaningful effect in the first stage equation in which the endogenous variable is the
dependent variable. 434
independent variable A variable that possibly influences the value of the dependent vari-
able. It is usually denoted as X. It is called independent because its value is typically
treated as independent of the value of the dependent variable. 4
instrumental variable An instrumental variable is a variable that explains the endogenous
independent variable of interest but does not directly explain the dependent variable.
Two-stage least squares uses instrumental variables to produce unbiased estimates. 428
intention-to-treat analysis Intention-to-treat (ITT) analysis addresses potential endo-
geneity that arises in experiments due to non-compliance by comparing the means of
those assigned treatment and those not assigned treatment, irrespective of whether or
not they actually received the treatment. 496
intercept The parameter —0 in a regression model. It is the point at which a regression line
cross the Y-axis. It is the expected value of the dependent variable when all independent
variables equal zero. Also referred to as the constant. 7
internal validity A research finding is internally valid when it is based on a process that is
free from systematic error. Experimental results are often considered internally valid,
but with debatable external validity. 36
irrelevant variable A variable in a regression model that should not be in the model,
meaning that its coefficient is zero. Including an irrelevant variable does not cause bias,
but increases the variance of the estimates. 231

c
•2014 Oxford University Press 850
Glossary Glossary

jitter A process used when scatterplotting data. A small random number is added to each
observation only for the purposes of plotting. This procedure produces cloud-like images
which overlap less than the unjittered data and hence provide a better sense of the data.
112

lagged variable A lagged variable is a variable with the values from the previous period.
661
latent variable A latent variable for a probit or logit model is an unobserved continuous
variable reflecting the propensity of an individual observation of Yi to equal 1. 602
least squares dummy variable (LSDV) approach An approach to estimating fixed ef-
fects models when analyzing panel data. 378
likelihood ratio test A statistical test for maximum likelihood models that is useful in
testing hypotheses involving multiple coefficients. 632
linear probability model A model used when the dependent variable is dichotomous.
This is an OLS model in which the coefficients are interpreted as the change in proba-
bility of observing Yi = 1 for a one unit change in X. 592
linear-log model A model in which the dependent variable is not transformed by taking
natural log and the independent variable is transformed by taking a natural log. In
such a model, a one percent increase in X is associated with a 100
—1
change in Y . 332
local average treatment effect For instrumental variables models, the local average treat-
ment effect (LATE) is the causal effect only for those people affected by the instrument.
Relevant if the effect of X on Y varies within the population. 468
log likelihood The log likelihood is the log of the probability of observing the Y outcomes
ˆ It is a byproduct of the MLE estimation process.
we did given the X data and the —s.
616
log-linear model A model in which the dependent variable is transformed by taking the
natural log of it. A one unit change in X in a log-linear model is associated with a —1
percent change in Y (on 0 to 1 scale). 331
log-log model A model in which the dependent variable and independent variables are
transformed by taking natural log. In these models, a one percent change in X is
associated with a —1 percent change in Y (on 0 to 1 scale). 332
logit model A way to analyze data with a dichotomous dependent variable. The error term
in a logit model is logistically distributed. 610

c
•2014 Oxford University Press 851
Glossary Glossary

LPM The common short-hand used to describe linear probability models, a type of model
used to estimate models with a dichotomous dependent variable. 592
LR test See Likelihood ratio test. 632

maximum likelihood estimation The estimation process used to generate coefficient es-
timates for probit and logit models, among others. 612
measurement error Measurement error occurs when a variable is measured inaccurately.
If the dependent variable has measurement error, OLS coefficient estimates are unbiased,
but less precise. If an independent variable has measurement error, OLS coefficient
estimates suffer from attenuation bias where the magnitude of the attenuation depends
on how large the measurement error variance is relative to the variance of the variable.
220
MLE The common short-hand used to describe maximum likelihood estimation models.
612
model specification The process of deciding which variables should go in a statistical
model. 240
model-fishing Model-fishing occurs when a researchers add and subtract variables until
they get just the answers they were looking for. 241
modeled randomness Variation that occurs due to inherent variation in the data genera-
tion process. This source of randomness exists even when we observe data for an entire
population. 79
monotonicity A condition invoked when discussing instrumental variables models. It re-
quires that the effect of the instrument on the endogenous variable goes in the same
direction for everyone in a population. 469
multicollinearity Variables are multicollinear if they are correlated. The consequence of
multicollinearity is that the variance of —ˆ1 will be higher than if there were no multi-
collinearity. Multicollinearity does not cause bias. 227
multivariate OLS OLS with multiple independent variables. 194

natural experiment A natural experiment occurs when a researcher identifies a situation


in which the values of the independent variable have been determined by a random, or
at least an exogenous, process. 524

c
•2014 Oxford University Press 852
Glossary Glossary

nominal variable A variable that has two or more categories, but which does not have
an intrinsic ordering. Also known as a categorical variable. Typical examples include
“region” (north, south, east, west) or “religion” (Catholic, Protestant, Jewish, Muslim,
Other, Secular). 276
normal distribution A normal distribution is a bell-shaped probability density that char-
acterizes the probability of observing outcomes for normally distributed random vari-
ables. Because of the Central Limit Theorem, many statistical quantities are distributed
normally. 83
null hypothesis A hypothesis of no effect. Statistical tests will reject or fail to reject such
hypotheses. The most common null hypothesis is —1 = 0, written as H0 : —1 = 0. 138
null result A finding in which the null hypothesis is not rejected. 173

observational study Observational studies use data generated in an environment not con-
trolled by a researcher. They are distinguished from experimental studies and are
sometimes referred to as non-experimental studies. 36
omitted variable bias Bias that results from omitting a variable that affects the dependent
variable and is correlated with the independent variable. 211
one-sided alternative hypothesis An alternative to the null hypothesis that indicates
whether the coefficient (or function of coefficients) is higher or lower than the value
indicated in the null hypothesis. Typically written as HA : —1 > 0 or HA : —1 < 0. 140
one-way fixed effects model A panel data model that allows for fixed effects at the unit
level. 393
ordinal variable A variable that expresses rank but not necessarily relative size. An ex-
ample of an ordinal variable is one indicating answers to a survey question that is coded
1 = strongly disagree, 2 = disagree, 3 = agree, 4 = strongly agree. 276
outlier An observation that is extremely different from the rest of sample. 117
overidentification test A test used for 2SLS models when we have more than one in-
strument. The logic of the test is that the estimated coefficient on the endogenous
variable in the second stage equation should be roughly the same when each individual
instrument is used alone. 446

p-value The probability of observing a coefficient as high as we actually did if the null
hypothesis were true. 161

c
•2014 Oxford University Press 853
Glossary Glossary

panel data Panel data has observations for a multiple units over time. Each observation
indicates the value of a variable for a given unit at a given point in time. Panel data is
typically contrasted to cross-sectional and time series data. 367
perfect multicollinearity Perfect multicollinearity occurs when an independent variable
is completely explained by other independent variables. 230
plim A widely used abbreviation for probability limit, the value to which an estimator
converges as the sample size gets very, very large. 100
point estimate Point estimates describe our best guess as to what the true values are. 178
polynomial model Models that include values of X raised to powers more than one. A
polynomial model is an example of a non-linear model in which the effect of X on Y
varies depending on the value of X. The fitted values will be defined by a curve. A
quadratic model is an example of a polynomial model. 321, 325
pooled model A pooled model treats all observations as independent observations. Pooled
models contrast with fixed effect models that control for unit-specific or time-specific
fixed effects. 369
power Power refers to the ability of our data to reject the null. A high-powered statistical
test will reject the null with a very high probability when the null is false; a low-powered
statistical test will reject the null with a low probability when the null is false. 168
power curve A curve that characterizes the probability of rejecting the null for each pos-
sible value of the parameter. 171
predicted values The value of Y predicted by our estimated equation. For a bivariate OLS
model it is Ŷi = —ˆ0 + —ˆ1 Xi . Also called fitted values. 70
probability density A probability density is a graph or formula that describes the relative
probability a random variable is near a specified value. 81
probability density function A mathematical function that describes the relative prob-
ability for a continuous random variable to take on a given probability. 771
probability distribution A probability distribution is a graph or formula that gives the
probability for each possible value of a random variable. 81
probability limit The value to which a distribution converges as the sample size gets very
large. When the error is uncorrelated with the independent variables, the probability
limit of —ˆ1 is —1 . The probability limit of a consistent estimator is the true value of the
parameter. 99

c
•2014 Oxford University Press 854
Glossary Glossary

probit model A way to analyze data with a dichotomous dependent variable. The key
assumption is that the error term is normally distributed. 606

quadratic model Models that include X and X 2 and independent variables. The fitted
values will be defined by a curve. A quadratic model is an example of a polynomial
model. 321
quasi-instrument An instrumental variable that is not strictly exogenous, meaing that
there is a non-zero correlation of it and the error term in the equation of interest. 2SLS
using a quasi-instrument may produce a better estimate than OLS if the correlation
of the quasi-instrument and the error in the main equation is small relative to the
correlation of the quasi-instrument and the endogenous variable. 448

random effects model Random effects models treat unit-specific error as a random vari-
able that is uncorrelated with the independent variable. 748
random variable A variable that takes on values in a range and with the probabilities
defined by a distribution. 80
randomization Randomization is the process of determining the experimental value of the
key independent variable based on a random process. If successful, randomization will
ensure that the independent variable is uncorrelated with all variables, including factors
in the error term. 32
RD See regression discontinuity. 544
reduced form equation In a reduced form equation Y1 is only a function of the non-
endogenous variables (which are the X and Z variables, not the Y variables). Used in
simultaneous equation models. 458
reference category When including dummy variables indicating the multiple categories
of a nominal variable, we need to exclude a dummy variable for one of the groups,
which we refer to as the reference category. The coefficients on all the included dummy
variables indicate how much higher or lower the dependent variable is for each group
relative to the reference category. Also referred to as the excluded category. 278
regression discontinuity Regression discontinuity techniques use regression analysis to
identify possible discontinuities at the point some treatment applies. 544
regression line A regression line is the fitted line from a bivariate regression. 70
replication Research that meets a replication standard can be duplicated based on the
information provided at the time of publication. 46

c
•2014 Oxford University Press 855
Glossary Glossary

replication file Replication files document how exactly data is gathered and organized.
When done properly, these files allow others to check our work by following our steps
and seeing if they get identical results. 47
residual A residual is the difference between the fitted value and observed value. Graphi-
cally, it is the distance between an estimated line and an observation. Mathematically,
a residual is ‘ˆi = Yi ≠ —ˆ0 ≠ —ˆ1 Xi . An equivalent way to calculate a residual is ‘ˆi = Yi ≠ Ŷi .
70
restricted model A restricted model is the model in an F test that imposes the restriction
that the null hypothesis is true. If the fit of the restricted model is much worse than
the fit of the unrestricted model, we infer that that the null hypothesis is not true. 343
robust Statistical results are robust if they do not change when the model changes. 49
rolling cross-section data Repeated cross-sections of data from different individuals at
different points in time. An example would be a survey of U.S. citizens each year in
which different citizens are chosen each year. 407

sampling randomness Variation in estimates when we observe a subset of an entire pop-


ulation. If our sample had a different selection of people, we would observe a different
estimated coefficient. 79
scalar variable A scalar variable is simply a variable with a single value (in contrast to a
typical variable that has a list of values). 583
scatterplot A plot of data with each observation located at the coordinates defined by the
independent and dependent variable. 7
selection model A selection model accounts simultaneously for whether we observe the
dependent variable and what the dependent variable is. Often used to deal with attrition
problems in experiments. The most famous selection model is the Heckman selection
model. 518
significance level For each hypothesis test, we set a significance level that determines how
unlikely a result has to be under the null hypothesis for us to reject the null hypothesis.
The significance level is the probability of committing a Type I error for a hypothesis
test. 144
simultaneous equation model A simultaneous equation model is one in which two vari-
ables simultaneously cause each other. 455
slope coefficient The coefficient on an independent variable. It reflects how much the
dependent variable increases when the independent variable increases by one. In a plot
of fitted values, the slope coefficient characterizes the slope of the fitted line. 7

c
•2014 Oxford University Press 856
Glossary Glossary

specification See model specification. 343


spurious regression A regression that wrongly suggests that X has an effect on Y . A
spurious regression can be caused by, among other sources, omitted variable bias and
nonstationary data. 686
stable unit treatment value assumption The stable unit treatment value assumption
(SUTVA) is a condition that there is no spillover effect of an instrument. This condition
rules out the possibility that the value of an instrument going up by one unit causes a
neighbor to become more likely to change X as well. 469
standard deviation For descriptive data, the standardq deviation describes the spread of
(Xi ≠X)2
the data. For large samples, it is calculated as N
. For probability distributions,
the standard deviation refers to the width of the distribution. For example, we often
refer to the standard deviation of the ‘ distribution as ‡; it is the square root of the
variance (which is ‡ 2 ). To convert a normally distributed random variable into a stan-
dard normal variable, we subtract the mean and divide by the standard deviation of
the distribution of the random variable. 43
standard error Standard error refers to the accuracy of a parameter estimate, which is
determined by the width of the distribution of the parameter estimate. For example,
the standard error Òof —ˆ1 from a bivariate OLS model is the square root of the variance of
ˆ2
the estimate. It is N ◊var(X)

(see page 95). Sometimes the standard error and standard
deviation do similar work. For example, the standard deviation of the distribution of
—ˆ1 distribution is estimated by the standard error of —ˆ1 . A good rule of thumb is to
associate standard errors with parameter estimates and standard deviations with the
spread of a variable or distribution, which may or may not be a distribution associated
with a parameter estimate. 93
standard error of the regression A measure of how well the model fits the data. It is
the square root of the variance of the regression. 107
standard normal distribution A normal distribution with a mean of zero and a variance
(and standard error) of one. 774
standardize Standardizing a variable converts it to a measure of standard deviations from
its mean. This is done by subtracting the mean of the variable from each observation
and dividing the result by the standard deviation of the variable. 338
standardized coefficient A standardized coefficient is the coefficient on an independent
variable that has been standardized according to X1Standardized = Xsd(X
1 ≠X 1
1)
. Because a one-
unit change in a standardized variable is a one standard deviation change no matter

c
•2014 Oxford University Press 857
Glossary Glossary

what the unit of X is (be it inches, dollars or years), effects across variables can be
compared because each —ˆ represents the effect of a one standard deviation change in X
on Y . 340
stationarity A time series term indicating that a variable has the same distribution through-
out the entire time series. Variables that have persistent trends are nonstationary.
Statistical analysis of nonstationary variables can yield spurious regression results. 683
statistically significant A coefficient is statistically significant when we reject the null
hypothesis that it is zero. In this case, the observed value of the coefficient is a sufficient
number of standard deviations from the value posited in the null hypothesis that we
reject the null hypothesis. 138
substantive significance If a reasonable change in the independent variable is associated
with a meaningful change in the dependent variable, the effect is substantively signif-
icant. Some statistically significant effects are not substantively significant, especially
for large data sets. 176

t distribution A distribution that looks like a normal distribution, but with fatter tails.
The exact shape of the distribution depends on the degrees of freedom. This distribution
converges to a normal distribution for large sample sizes. 150
ˆ
t statistic The test statistic used in a t test. It is equal to —1se( . If the t statistic is
≠— N ull

—ˆ1 )
greater than our critical value, we reject the null hypothesis. 157
t test A hypothesis test for hypotheses about a normal random variable with an estimated
ˆ
standard error. It involves comparing | se(——1ˆ ) | to a critical value from a t distribution
1
determined by the chosen significance level (–). For large sample sizes, a t test is closely
approximated by a z test. 147
time series data Time series data has observations for a single unit over time. Each ob-
servation indicates the value of a variable at a given point in time. The data proceed
in order, indicating, for example, annual, monthly, or daily data. Time series data is
typically contrasted to cross-sectional and panel data. 105
treatment group In an experiment, the group that receives the treatment of interest. 32
trimmed data set A trimmed data set is one for which observations are removed in a way
to offset potential bias due to attrition. 518
two-sided alternative hypothesis An alternative to the null hypothesis that indicates
the coefficient (or function of coefficients) is higher or lower than the value indicated in
the null hypothesis. Typically written as HA : —1 ”= 0. 140

c
•2014 Oxford University Press 858
Glossary Glossary

two-stage least squares Two-stage least squares uses exogenous variation in X to estimate
the effect of X on Y . In the first stage, we estimate a model in which the endogenous
independent variable is the dependent variable and the instrument, Z, is an independent
variable. In the second stage, we estimate a model in which the fitted values from the
first stage, X̂1i , is an independent variable. 425
two-way fixed effects model A panel data model that allows for fixed effects at the unit
and time levels. 393
Type I error A hypothesis testing error that occurs when we reject a null hypothesis even
when it is true. 139
Type II error A hypothesis testing error that occurs when we fail to reject a null hypothesis
even when it is false. 139

unbiased estimator An unbiased coefficient estimate will on average equal the true value of
the parameter. An unbiased estimator can produce individual estimates that are quite
incorrect; on average, though, the too low estimates are probabilistically balanced by
too high estimates for unbiased estimators. OLS produces unbiased parameter estimates
if the independent variables are uncorrelated with the error term. 87
unit root A variable with a unit root has a coefficient equal to one on the lagged variable.
A variable with a unit root is nonstationary and must be modeled differently than a
stationary variable. 685
unrestricted model An unrestricted model is the model in an F test that imposes no
restrictions on the coefficients. If the fit of the restricted model is much worse than the
fit of the unrestricted model, we infer that that the null hypothesis is not true. 343

variance Variance is a measure of how much a random variable varies. In graphical terms,
the variance of a random varaible characterizes how wide the distribution is. 93
variance inflation factor A measure of how much variance is inflated due to multicollinear-
1
ity. It can be estimated for each variable and is equal to 1≠R 2 where Rj is from an
2
j
auxiliary regression in which Xj is the dependent variable and all other independent
variables from the main equation are included as independent variables. 227
variance of the regression The variance of the regression measures how well the model
explains
q variation in the dependent variable. For large samples, it is estimated ‡
ˆ2 =
(Y ≠Ŷi )2
N
i=1 i
N
. 95

c
•2014 Oxford University Press 859
Glossary Glossary

weak instrument A weak instrument is an instrumental variable that adds little explana-
tory power to the first stage regression in a 2SLS analysis. 450
window A window is the range of observations we analyze in a regression discontinuity
analysis. The smaller the window, the less we need to worry about non-linear functional
forms. 561

z test An hypothesis test involving comparison of a test statistic to a critical value based
on a normal distribution. Examples include hypothesis tests for maximum likelihood
estimation models and tests of hypotheses about a normal random variable with a
known standard error. 613

c
•2014 Oxford University Press 860
Glossary Glossary

Mab
Dog
Press
c
•2014 Oxford University Press 861

You might also like