RealStats Book
RealStats Book
Stats
Econometrics for Political Science, Public Policy, and
Economics
Michael A. Bailey
c
•2014 Oxford University Press
CONTENTS
Foreword for Instructors: How to Help Your Students Learn Statistics xix
Foreword for Students: How This Book Can Help You Learn Statistics xxviii
i
I The OLS Framework 64
ii
The t distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Critical values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
t statistics for the height and wage example . . . . . . . . . . . . . . . . . 158
Other types of null hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . 159
4.3 p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
4.4 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Incorrectly failing to reject the null hypothesis . . . . . . . . . . . . . . . . 166
Calculating power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
Power curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
When to care about power . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
4.5 Straight Talk about Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . 175
4.6 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
iii
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
iv
F tests and baseball salaries . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Case Study: Comparing Effects of Height Measures . . . . . . . . . . . . . . . . . 352
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
8 Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-
in-Difference Models 366
8.1 The Problem with Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
Test score example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
8.2 Fixed Effects Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
Least squares dummy variable approach . . . . . . . . . . . . . . . . . . . 378
De-meaned approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
8.3 Working with Fixed Effects Models . . . . . . . . . . . . . . . . . . . . . . . 387
8.4 Two-way Fixed Effects Model . . . . . . . . . . . . . . . . . . . . . . . . . . 393
Case Study: Trade and Alliances . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
8.5 Difference-in-difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Difference-in-difference logic . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Using OLS to estimate difference-in-difference models . . . . . . . . . . . . 402
Difference-in-difference models for panel data . . . . . . . . . . . . . . . . 407
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
v
9.3 Multiple Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
2SLS with multiple instruments . . . . . . . . . . . . . . . . . . . . . . . . 445
Overidentification tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
9.4 Weak Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
Quasi-instrumental variables are not strictly exogenous . . . . . . . . . . . 448
Weak instruments do a poor job predicting X . . . . . . . . . . . . . . . . 450
9.5 Precision of 2SLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
9.6 Simultaneous Equation Models . . . . . . . . . . . . . . . . . . . . . . . . . 455
Endogeneity in simultaneous equation models . . . . . . . . . . . . . . . . 456
Using 2SLS for simultaneous equation models . . . . . . . . . . . . . . . . 458
Identification in simultaneous equation models . . . . . . . . . . . . . . . . 458
Case Study: Support for President Bush and the Iraq War . . . . . . . . . . . . . 461
9.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
vi
Case Study: Crime and Terror Alerts . . . . . . . . . . . . . . . . . . . . . . . . . 528
10.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
vii
Latent variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
12.3 Probit and Logit Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
Probit model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
Logit model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
12.4 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
Properties of MLE estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 613
Fitted values from the probit model . . . . . . . . . . . . . . . . . . . . . . 613
Fitted values from the logit model . . . . . . . . . . . . . . . . . . . . . . 616
Goodness of fit for MLE models . . . . . . . . . . . . . . . . . . . . . . . . 616
12.5 Interpreting Probit and Logit Coefficients . . . . . . . . . . . . . . . . . . . . 619
The effect of X1 depends on the value of X1 . . . . . . . . . . . . . . . . . 619
The effect of X1 depends on the values of the other independent variables . 621
Case Study: Dog Politics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
12.6 Hypothesis Testing about Multiple Coefficients . . . . . . . . . . . . . . . . . 631
Case Study: Civil Wars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
12.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649
viii
Nonstationarity as a unit root process . . . . . . . . . . . . . . . . . . . . 684
Nonstationarity and spurious results . . . . . . . . . . . . . . . . . . . . . 686
Spurious results are less likely with stationary data . . . . . . . . . . . . . 689
Detecting unit roots and nonstationarity . . . . . . . . . . . . . . . . . . . 691
How to handle nonstationarity . . . . . . . . . . . . . . . . . . . . . . . . . 693
Case Study: Dynamic Model of Global Temperature . . . . . . . . . . . . . . . . 695
13.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700
Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
Key Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704
Computing Corner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
ix
16 Conclusion: How to Be a Statistical Realist 754
16.1 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761
Acknowledgements 763
Appendices 765
Bibliography 827
Index 839
Glossary 844
x
LIST OF TABLES
xi
LIST OF TABLES LIST OF TABLES
5.1 Bivariate and Multivariate Results for Retail Sales Data . . . . . . . . . . . 198
5.2 Bivariate and Multiple Multivariate Results for Height and Wages Data . . . 201
5.3 Economic Growth and Education Using Multiple Measures of Education . . 216
5.4 Effects of Judicial Independence on Human Rights - Including Democracy
Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
5.5 Variables for Height and Weight Data in the United States . . . . . . . . . . 249
5.6 Variables for Cell Phones and Traffic Deaths Questions . . . . . . . . . . . . 250
5.7 Variables for Speeding Ticket Data . . . . . . . . . . . . . . . . . . . . . . . 251
5.8 Variables for Height and Weight Data in Britain . . . . . . . . . . . . . . . . 252
8.1 Basic OLS Analysis of Burglary and Police Officers, 1951-1992 . . . . . . . . 370
8.2 Example of Robbery and Police Data for Cities in California . . . . . . . . . 379
8.3 Robberies and Police Data for Hypothetical Cities in California . . . . . . . 384
8.4 Burglary and Police Officers, Pooled versus Fixed Effects Models, 1951-1992 385
c
•2014 Oxford University Press xii
LIST OF TABLES LIST OF TABLES
8.5 Burglary and Police Officers, Pooled versus Fixed Effect Models, 1951-1992 . 396
8.6 Bilateral Trade, Pooled versus Fixed Effect Models, 1951-1992 . . . . . . . . 399
8.7 Effect of Stand Your Ground Laws on Homicide Rate Per 100,000 Residents 409
8.8 Variables for Presidential Approval Question . . . . . . . . . . . . . . . . . . 418
8.9 Variables for Peace Corps Question . . . . . . . . . . . . . . . . . . . . . . . 418
8.10 Variables for Teaching Evaluation Questions . . . . . . . . . . . . . . . . . . 419
8.11 Variables for the HOPE Scholarship Question . . . . . . . . . . . . . . . . . 420
8.12 Variables for the Texas School Board Data . . . . . . . . . . . . . . . . . . . 421
8.13 Variables in the Cell Phones and Traffic Deaths Panel Data Set . . . . . . . 422
9.1 Levitt (2002) Results on Effect of Police Officers on Violent Crime . . . . . . 427
9.2 Influence of Distance on NICU Utilization (First Stage Results) . . . . . . . 442
9.3 Influence of NICU Utilization on Baby Mortality . . . . . . . . . . . . . . . . 444
9.4 First Stage Reduced Form Regressions for Bush/Iraq War Simultaneous Equa-
tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
9.5 Second Stage Results for Bush/Iraq War Simultaneous Equation Model . . . 466
9.6 Variables for Rainfall and Economic Growth Question . . . . . . . . . . . . . 472
9.7 Variables for News Program Question . . . . . . . . . . . . . . . . . . . . . . 474
9.8 Variables for Fish Market Question . . . . . . . . . . . . . . . . . . . . . . . 474
9.9 Variables for Education and Crime Questions . . . . . . . . . . . . . . . . . 475
9.10 Variables for Income and Democracy Questions . . . . . . . . . . . . . . . . 476
10.1 Balancing Tests for Progresa Experiment: Differences of Means Tests Using
OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490
10.2 First Stage Regression in Campaign Experiment: Explaining Contact . . . . 503
10.3 Second Stage Regression in Campaign Experiment: Explaining Turnout . . . 505
10.4 Various Measures of Campaign Contact in 2SLS Model for Selected Observations506
10.5 First Stage Regression in Domestic Violence Experiment: Explaining Arrests 511
10.6 Selected Observations for Minneapolis Domestic Violence Experiment . . . . 512
10.7 Analyzing Domestic Violence Experiment Using Different Estimators . . . . 513
10.8 Effect of Terror Alerts on Crime . . . . . . . . . . . . . . . . . . . . . . . . 530
10.9 Variables for Get-out-the-vote Experiment from Gerber and Green (2005) . . 535
10.10Variables for Resume Experiment . . . . . . . . . . . . . . . . . . . . . . . . 537
10.11Variables for Afghan School Experiment . . . . . . . . . . . . . . . . . . . . 539
11.1 RD Analysis of Pre-K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569
11.2 RD Analysis of Drinking Age and Test Scores (from Carrell, Hoekstra, and
West 2010) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
11.3 RD Diagnostics for Drinking Age and Test Scores (from Carrell, Hoekstra,
and West 2010) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
c
•2014 Oxford University Press xiii
LIST OF TABLES LIST OF TABLES
c
•2014 Oxford University Press xiv
LIST OF FIGURES
xv
LIST OF FIGURES LIST OF FIGURES
4.1 Distribution of —ˆ1 Under the Null Hypothesis for Presidential Election Example143
4.2 Distribution of —ˆ1 Under the Null Hypothesis with Larger Standard Error for
Presidential-Election Example . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.3 Three t distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.4 Critical Values for Large Sample t tests . . . . . . . . . . . . . . . . . . . . . 155
4.5 Two Examples of p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.6 Statistical Power for Three Values of —1 , – = 0.01, and a One-Sided Alterna-
tive Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.7 Power Curves for Two Values of se(—ˆ1 ) . . . . . . . . . . . . . . . . . . . . . 172
4.8 Meaning of Confidence Interval for Example of 0.41 ±0.196 . . . . . . . . . . 180
5.1 Monthly Retail Sales and Temperature in New Jersey from 1992 to 2013 . . 193
5.2 Monthly Retail Sales and Temperature in New Jersey with December Indicated195
5.3 95% Confidence Intervals for Coefficients in Adult Height, Adolescent Height,
and Wage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
5.4 Economic Growth, Years of School, and Test Scores . . . . . . . . . . . . . . 218
6.1 Goal Differentials for Home and Away Games for Manchester City and Manch-
ester United . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
6.2 Bivariate OLS with a Dummy Independent Variable . . . . . . . . . . . . . . 260
6.3 Scatterplot of Obama Feeling Thermometers and Party Identification . . . . 265
6.4 Three Difference of Means Tests . . . . . . . . . . . . . . . . . . . . . . . . . 266
6.5 Scatterplot of Height and Gender . . . . . . . . . . . . . . . . . . . . . . . . 268
6.6 Scatterplot of Height and Gender . . . . . . . . . . . . . . . . . . . . . . . . 271
6.7 Fitted Values for Model with Dummy Variable and Control Variable: Manch-
ester City Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
6.8 Relation Between Omitted Variable (Obama Vote) and Other Variables . . . 287
6.9 Confidence Intervals for Newly Elected Variable in Table 6.8 . . . . . . . . . 291
6.10 Fitted Values for Yi = —0 + —1 Xi + —2 Dummyi + —3 Dummyi ◊ Xi + ‘i . . . . 294
6.11 Various Fitted Lines from Dummy Interaction Models . . . . . . . . . . . . . 298
6.12 Heating Used and Heating Degree Days for Homeowner Who Installed a Pro-
grammable Thermostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
6.13 Heating Used and Heating Degree Days with Fitted Values for Different Models305
6.14 Marginal Effect of Text Ban as Total Miles Changes . . . . . . . . . . . . . . 311
c
•2014 Oxford University Press xvi
LIST OF FIGURES LIST OF FIGURES
8.1 Robberies and Police for Large Cities in California, 1971-1992 . . . . . . . . 371
8.2 Robberies and Police for Specified Cities in California, 1971-1992 . . . . . . . 372
8.3 Robberies and Police for Specified Cities in California with City-specific Re-
gression Lines, 1971-1992 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
8.4 Robberies and Police for Hypothetical Cities in California . . . . . . . . . . . 383
8.5 Difference-in-difference Examples . . . . . . . . . . . . . . . . . . . . . . . . 406
8.6 More Difference-in-difference Examples . . . . . . . . . . . . . . . . . . . . . 411
10.1 Compliance and Non-compliance in Experiments . . . . . . . . . . . . . . . . 495
11.1 Drinking Age and Test Scores . . . . . . . . . . . . . . . . . . . . . . . . . . 543
11.2 Basic Regression Discontinuity Model, Yi = —0 + —1 Ti + —2 (X1i ≠ C) . . . . . 548
11.3 Possible Results with Basic RD Model . . . . . . . . . . . . . . . . . . . . . 550
11.4 Possible Results with Differing-Slopes RD Model . . . . . . . . . . . . . . . . 555
11.5 Fitted Lines for Examples of Polynomial RD Models . . . . . . . . . . . . . 558
11.6 Various Fitted Lines for RD Model of Form Yi = —0 + —1 Ti + —2 (X1i ≠ C) +
—3 (X1i ≠ C)Ti . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
11.7 Smaller Windows for Fitted Lines for Polynomial RD Model in Figure 11.5 . 563
11.8 Bin Plots for RD Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565
11.9 Binned Graph of Test Scores and Pre-K Attendance . . . . . . . . . . . . . . 568
11.10Histograms of Assignment Variable for RD Analysis . . . . . . . . . . . . . . 573
11.11Histogram of Age Observations for Drinking Age Case Study . . . . . . . . . 579
12.1 Scatterplot of Law School Admissions Data and LPM Fitted Line . . . . . . 595
12.2 Misspecification Problem in Linear Probability Model . . . . . . . . . . . . . 598
12.3 Scatterplot of Law School Admissions Data and LPM and Probit Fitted Lines 601
12.4 Symmetry of Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . 607
12.5 PDFs and CDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
12.6 Examples of Data and Fitted Lines Estimated by Probit . . . . . . . . . . . 614
12.7 Varying Effect of X in Probit Model . . . . . . . . . . . . . . . . . . . . . . 620
12.8 Fitted Lines from LPM, Probit, and Logit Models . . . . . . . . . . . . . . . 631
12.9 Fitted Lines from LPM and Probit Models for Civil War Data (Holding Ethnic
and Religious Vvariables at Their Means) . . . . . . . . . . . . . . . . . . . . 641
12.10Figure Included for Some Respondents in Global Warming Survey Experiment 652
c
•2014 Oxford University Press xvii
LIST OF FIGURES LIST OF FIGURES
c
•2014 Oxford University Press xviii
FOREWORD FOR INSTRUCTORS: HOW TO HELP YOUR
We statistics teachers have high hopes for our students. We want them to understand how
statistics can shed light on important policy and political questions. Sometimes they humor
us with incredible insight. The heavens part, angels sing. We want that to happen daily.
Sadly, a more common experience is seeing a furrowed brow of confusion and frustration.
It doesn’t have to be this way. If we distill the material down to the most critical concepts
we can inspire more insight and less brow-furrowing. Unfortunately, conventional statistics
books all too often manage to be too simple and too confusing at the same time. They
are simple in that they hardly get past rudimentary ordinary least squares. They are too
confusing in that they get there by way of covering material ranging from probability distri-
butions to ‰2 to basic OLS. These concepts do not naturally fit together and can overwhelm
xix
Foreword for Instructors: How to Help Your Students Learn Statistics
many students. All that to get to naive OLS? And what of the poor students up against a
book that really piles it on with ANOVA, kurtosis, Kruskal’s (or is it Goodman’s?) gamma,
This book is predicated on the belief that we are most effective when we teach the tools
we use. What we use are regression-based tools with an increasing focus on experiments and
causal inference. If students can understand these fundamental concepts, they can legiti-
mately participate in analytically sound conversations. They can produce interesting – and
believable! – analysis. They can understand experiments and the sometimes subtle analysis
required when experimental methods meet social scientific reality. They can appreciate that
causal effects are hard to tease out with observational data and that standard errors esti-
mated on crap coefficients, however complex, do no one any good. They can sniff out when
others are being naive or cynical. It is only when we muck around too long in the weeds of
less useful material that statistics becomes the quagmire students fear.
Hence this book seeks to be analytically sophisticated in a simple and relevant way. It
so, the book is guided by three principles: relevance, opportunity costs, and pedagogical
efficiency.
c
•2014 Oxford University Press xx
Foreword for Instructors: How to Help Your Students Learn Statistics
Relevance
Relevance is a crucial first principal for successfully teaching statistics in the social sciences.
Every experienced instructor knows that most economics, politics, and policy students care
more about the real world than math. How do we get such students to engage with statistics?
One option is to cajole them to care more and work harder. We all know how well that works.
helps them learn more about the topics they are in class to learn about. Think of a mother
trying to get a child to commit to the training necessary to play competitive sports. She
could start with a semester of theory ... No, that would be cruel. And counterproductive.
Much better is to let the child play and experience the joy of the sport. Then there will be
Learning statistics is not that different from learning anything else. We need to care to
truly learn. Therefore this book takes advantage of a careful selection of material to spend
Opportunity costs
Opportunity costs are, as we all tell our students, what we have to give up to do something.
So, while some topic might be a perfectly respectable part of a statistical tool kit, we should
include it only if it does not knock out something more important. The important stuff all
too often gets shunted aside as we fill up the early part of students’ analytical training with
c
•2014 Oxford University Press xxi
Foreword for Instructors: How to Help Your Students Learn Statistics
statistical knick-knacks, material “some people still use” or that students “might see.”
Therefore this book goes quickly through descriptive statistics and doesn’t cover ‰2 tests
for two-way tables, weighted least squares and other denizens of conventional statistics books.
These concepts – and many many more – are all perfectly legitimate. Some are covered
elsewhere (descriptive statistics are covered in elementary schools these days!). Others are
valuable enough that I include them in an “advanced material” section for students and
instructors who want to pursue these topics further. And others simply don’t make the cut.
Only by focusing the material can we get to the tools used by researchers today, tools such
as panel data analysis, instrumental variables, and regression discontinuity. The core ideas
behind these tools are not that difficult, but we need to make time to cover them.
Pedagogical efficiency
Pedagogical efficiency refers to streamlining the learning process by using a single unified
framework. Everything in this book builds from the standard regression model. Hypothesis
testing, difference of means, and experiments can be, and often are, taught independently of
regression. Causal inference is sometimes taught with potential outcomes notation. There is
nothing intellectually wrong with these approaches. But is doing so pedagogically efficient? If
we teach these concepts as stand-alone concepts we have to take time and, more importantly,
student brain space, to set up each separate approach. For students, this is really hard.
Remember the furrowed brows? Students work incredibly hard to get their heads around
c
•2014 Oxford University Press xxii
Foreword for Instructors: How to Help Your Students Learn Statistics
difference of means and where to put degrees of freedom corrections and how to know if
the means come from correlated groups or independent groups and what the equation is
for each of these cases. Then BAM! Suddenly their professor is talking about residuals
and squared deviations. It’s old hat for us, but can overwhelm students first learning the
material. It is more efficient to teach the OLS framework and use that to cover difference
of means, experiments, and the contemporary canon of statistical analysis, including panel
data, instrumental variables, and regression discontinuity. Each tool builds from the same
regression model. Students start from a comfortable place and can see the continuity that
exists.
An important benefit of working with a single framework is that it allows students to re-
visit the core model repeatedly throughout the term. Despite the brilliance of our teaching,
students rarely can put it all together with one pass through the material. I know I didn’t
when I was beginning. Students need to see the material a few times, work with it a bit, and
then it will finally click. Can you imagine if sports were coached the way we do statistics?
A tennis coach who said, “This week we’ll cover forehands (and only forehands), next week
backhands (and only backhands) and the week after that serves (and only serves)” would
not be a tennis coach for long. Instead, coaches introduce material, practice, and then keep
c
•2014 Oxford University Press xxiii
Foreword for Instructors: How to Help Your Students Learn Statistics
Course adoption
This book is organized to work well in two different kinds of courses. First, it can be used in
a second semester course following a semester of probability and statistics. In such a course,
students will likely be able to move quickly through the early material and then pick up
Second, this book can be used in a first semester applied course. Using this book for
the first course avoids the “warehouse problem.” The warehouse problem occurs when we
treat students’ statistical education as a warehouse that we first fill up with tools that we
then access later. One challenge is that things rot in a warehouse. Another challenge is
that instructors tend to hoard a bit, putting things in the warehouse “just in case” and
creating clutter. And students undeniably find warehouse work achingly dull. Using this
book in a first semester course avoids the warehouse problem by going directly to interesting
and useful statistical material, providing students with a more just-in-time approach. For
example, they see statistical distributions, but in the context of trying to solve a specific
problem, rather than as an abstract concept that will become useful later.
The book also is designed to encourage two particularly useful pedagogical techniques.
One is interweaving, the process of weaving in material from previous lessons into later
lessons. Each section ends with a “Remember This” box that summarizes key points. Con-
necting back to these points in later lessons is remarkably effective at getting the material
into the active part of students’ brains. The more we ask students about omitted variable
c
•2014 Oxford University Press xxiv
Foreword for Instructors: How to Help Your Students Learn Statistics
the more they become able to actively apply the material on their own. A second useful
teaching technique is to use frequent low-stakes quizzes. These quizzes can be based the on
summary questions and exercises at the end of each chapter. However done, such quizzes
convert students to active learners without the stress of midterm and final exams. We need
exams, too, of course, but the low-stakes quizzes do a lot of work in preparing students for
them. Brown, Roediger, and McDaniel (2014) provide an excellent discussion of these and
Overview
The first two chapters of the book serve as introductory material. Chapter 1 lays out the
theme of how important – and hard – it is to generate unbiased estimates. This is a good time
to let students offer hypotheses about questions they care about, because these questions
can help bring to life the subsequent material. Chapter 2 introduces computer programs and
good practices. This chapter is a confidence builder that gets students over the hurdle of
Part One covers core OLS material. Chapter 3 introduces bivariate OLS. Chapter 4
covers hypothesis testing and Chapter 5 moves to multivariate OLS. Chapters 6 and 7 move
to practical tasks such as use of dummy variables, logged variables, interactions, F-tests, and
the like.
c
•2014 Oxford University Press xxv
Foreword for Instructors: How to Help Your Students Learn Statistics
Part Two covers essential elements of the contemporary statistical tool kit, including
The experiments chapter uses instrumental variables, but other than that these chapters can
be covered in any order, so instructors can pick and choose among these chapters as needed.
Part Three is a single chapter on dichotomous dependent variables. It develops the linear
probability model in context of OLS and uses the probit and logit models to introduce
students to maximum likelihood. Instructors can cover this chapter before Part Two if
Part Four covers advanced material. Chapter 13 covers time series models, introducing
techniques to estimate autocorrelation and dynamic time series models; this chapter can be
covered immediately after the end of Part One. Chapter 14 offers derivations of the OLS
model and additional material on omitted variable bias; this chapter can be consider as an
auxiliary to Chapters 3 and 5 for instructors seeking to expose students to derivations and
extensions of the core OLS material. Chapter 15 introduces more advanced topics in panel
data. This chapter can be considered as an auxiliary to Chapter 8 for instructors who seek
to expose students to time series aspects of panel data. It should be covered after Chapter
13.
probability. The appendix on citations and additional notes is linked to the text by page
c
•2014 Oxford University Press xxvi
Foreword for Instructors: How to Help Your Students Learn Statistics
numbers and elaborates on some finer points in the text. A separate appendix provides
Teaching statistics is difficult. When the going gets tough it is tempting to blame students,
to say they are unwilling to do the work. Before we do that, we should recognize that many
students find the material quite foreign and (unfortunately) irrelevant. If we can streamline
what we teach and connect it to things students care about, we can improve our chances of
getting students to understand the material, material that is not only intrinsically interesting,
but also forms the foundation for all empirical work. When students understand, teaching
becomes easier. And better. The goal of this book is to help get us there.
c
•2014 Oxford University Press xxvii
FOREWORD FOR STUDENTS: HOW THIS BOOK CAN HELP YOU
LEARN STATISTICS
“I wish I had had this book when I was first exposed to the material – it would
have saved a lot of time and hair-pulling...”– Student J.H.
This book introduces the statistical tools necessary to answer important questions. Do
anti-poverty programs work? Does unemployment affect inflation? Does campaign spending
affect election outcomes? These and many more questions are not only interesting, but
also important to answer correctly if we want to support policies that are good for people,
xxviii
Foreword for Students: How This Book Can Help You Learn Statistics
When using statistics to answer such questions, we need always to remember a single big
idea: Correlation is not causation. Just because variable Y rises when variable X rises does
not mean that variable X causes variable Y to rise. The essential goal of statistics is to
figure out when we can say that changes in variable X will lead to changes in variable Y .
This book helps us learn how to identify causal relationships with three features seldom
found in other statistics textbooks. First, it focuses on the tools that researchers use most.
These are the real stats that help us make reasonable claims about whether X causes Y .
Using these tools, we can produce analyses that others can respect. We’ll get the most out
of our data while recognizing the limits in what we can say or how confident we can be.
Our emphasis on real stats means that we skip obscure statistical tools that could come
up under certain conditions: They are not here. Statistics is too often complicated by books
and teachers trying to do too much. This book shows that we can have a sophisticated
understanding of statistical inference without being able to catalog every method that our
Second, this book works with a single unifying framework. We don’t start over with each
new concept; instead we build around a core model. That means there is a single equation
and a unifying set of assumptions that we poke, probe, and expand throughout the book.
This approach reduces the learning costs of moving through the material and allows us to
go back and revisit material. Like any skill, it is unlikely that we will fully understand any
given technique the first time we see it. We have to work at it, we have to work with it.
c
•2014 Oxford University Press xxix
Foreword for Students: How This Book Can Help You Learn Statistics
We’ll get comfortable, we’ll see connections. Then it will click. Whether it is jumping rope,
typing, throwing a baseball, or analyzing data, we have to do things many times to master
the skill. By sticking to a unifying framework, we have more chances to revisit what we have
already learned. You’ll also notice that I’m not afraid to repeat myself on the important
Third, this book uses many examples from the policy, political, and economic worlds. So
even if you do not care about “two stage least squares” or “maximum likelihood” in and
of themselves, you will see how understanding these techniques will affect what you think
about education policy, trade policy, the determinants of election outcomes and many other
interesting issues. The examples make it clear that the statistical tools developed in this
book are being used by contemporary social scientists who are actually making a difference
This book is for people who care about policy, politics, economics, and law. Many will
come to it as a course textbook. Others will find it useful as a supplement for a course that
would benefit from more intuition and context. Others will come to it outside of school, as
more and more public policy and corporate decisions are based on statistical analysis. Even
sports are discussed in statistical terms as a matter of course. (I no longer spit out my coffee
when I come across an article on regression analysis of National Hockey League players.)
The preparation necessary to use this book successfully is modest. We use basic algebra
a fair bit, being careful to explain every step. You do not need calculus to use this book. We
c
•2014 Oxford University Press xxx
Foreword for Students: How This Book Can Help You Learn Statistics
refer to calculus when necessary and the book certainly could be used by a course that works
through some of the concepts using calculus, but you can understand everything without
knowing calculus.
We start with two introductory chapters. Chapter 1 opens the book by laying out the
challenge of statistical inference. This is the challenge of making probabilistic yet accurate
claims about causal relations between variables. We present experiments as an ideal way
to conduct research, but also show how experiments in the real world are tricky and can’t
answer every question we care about. This chapter provides the “big picture” context for
statistical analysis that is every bit as important as the specifics that follow.
statistical analysis, data meets software and if we’re not careful we lose control. This chapter
therefore seeks to inculcate good habits about documenting analysis and understanding data.
Part One consists of five chapters that constitute the heart of the book. They introduce
ordinary least squares (OLS), also known as regression analysis. Chapter 3 introduces the
most basic regression model, the bivariate OLS model. Chapter 4 shows how to use statistical
results to test hypotheses. Chapters 5 through 7 introduce the multivariate OLS model and
applications. By the end of Part One, you will understand regression and be able to control
for anything you can measure. You’ll also be able to fit curves to data and assess whether
the effects of some variables differ across groups, among other very practical and cool skills.
Part Two introduces techniques that constitute the modern statistical tool kit. These
c
•2014 Oxford University Press xxxi
Foreword for Students: How This Book Can Help You Learn Statistics
are the techniques people use when they want to get published – or paid. These techniques
build on multivariate OLS to give us a better chance of identifying causal relations between
two variables. Chapter 8 covers a simple yet powerful way to control for many factors we
can’t measure directly. Chapter 9 covers instrumental variable techniques, which work if we
can find a variable that affects our independent variable, but not our dependent variable.
Instrumental variable techniques are a bit funky, but they can be very useful for isolating
causal effects. Chapter 10 covers randomized experiments. Such experiments are ideal in
theory, but in practice they often raise a number of statistical challenges we need to address.
Chapter 11 covers regression-discontinuity tools that can be used when we’re studying the
effect of variables that were allocated based on some fixed rule. For example, Medicare is
available to people in the United States only when they turn 65; admission to certain private
schools depends on a test score exceeding some threshold. Focusing on policies that depend
on such thresholds turns out to be a great context for conducting credible statistical analysis.
Part Three covers dichotomous dependent variable models. These are simply models
where the outcome we care about takes on two possible values. Examples include high
school graduate (someone graduates or not), unemployment (someone has a job or not), and
alliances (two countries sign an alliance treaty or not). We show how to apply OLS to such
models and then provide more elaborate models that address the deficiencies of OLS in this
context.
Part Four supplements the book with additional useful material. Chapter 14 derives im-
c
•2014 Oxford University Press xxxii
Foreword for Students: How This Book Can Help You Learn Statistics
portant OLS results and extends discussion on specific topics that are quite useful. Chapter
13 covers time series data. The first part is a variation on OLS; the second part introduces
dynamic models that differ from OLS models in important ways. Chapter 15 goes into
greater detail on the vast literature on panel data. This chapter makes sense of the how the
The book is designed to help you master the material. Each section ends with a “Re-
member This” box that highlights the key points. If you do as the box says and remember
what’s in it, you’ll have a great foundation in statistics. The glossary at the end of the book
There are also discussion questions at the end of selected sections. I recommend using
these. There are two ways to learn: asking questions and answering questions. Asking
questions helps keep us engaged and on track. Answering questions helps us be realistic
about whether we’re truly on track. What we’re fighting is something cognitive psychologists
call the “illusion of explanatory depth.” That’s a fancy way of saying we don’t always know
as much as we think we do. By answering the discussion questions we can see where we
are. Answers for selected discussion questions are at the end of the book. Many of the
discussion questions also allow us to see how the concepts apply to issues we care about and,
once invested in this way, we’re no longer doing statistics for the sake of doing statistics, but
Finally, you may have noticed that this book is opinionated and a bit chatty. This is not
c
•2014 Oxford University Press xxxiii
Foreword for Students: How This Book Can Help You Learn Statistics
the usual tone of statistics books, but being chatty is not the same as being dumb. You’ll see
real material, with real equations and real research – just with a bit more smart-ass asides
than you may see in other stats books. This approach makes the material more accessible and
also reinforces the right mindset: Statistics is not simply a set of mathematical equations.
Instead statistics provides a set of practical tools that curious people use to learn from the
world. But don’t let the tone fool you. This book is not Statistics for Dummies; it’s Real
Stats. Learn the material and you will be well on your way to using statistics to answer
important questions.
c
•2014 Oxford University Press xxxiv
CHAPTER 1
that is a hunch or something that we simply “know” may be important, but it is not the
What is the basis of our evidence? In some cases, we can see cause and effect. We see a
burning candle tip over and start a fire. Now we know what caused the fire. This is perfectly
good knowledge. Sometimes in politics and policy we trace back a chain of causality in a
1
Chapter 1. The Quest for Causality
similar way. This process can get complicated, though. Why did Barack Obama win the
presidential election in 2008? Why did some economies handle the most recent recession
better than others? Why did crime go down in the United States in the 1990s? For these
type of questions, we are looking not only at a single candle; there are lightning strikes,
faulty wires, arsonists, and who knows what else to worry about. Clearly, it will be much
When there is no way of directly observing cause and effect, we naturally turn to data.
And data holds great promise. A building collapses during an earthquake. What about the
building led it – and not others in the same city – to collapse? Was it the building material?
The height? The design? Age? Location near a fault? While we might not be able to see
the cause directly, we can gather information on buildings that did and did not collapse. If
the older buildings were more likely to collapse, we might reasonably suspect that building
age mattered. If buildings built without steel reinforcement collapsed no matter what their
age, we might reasonably suspect that buildings with certain designs were more likely to
collapse.
And yet, we should not get overconfident. Even if old buildings were more likely to
collapse we do not know for certain that age of the building is the main explanation for
building collapse. It could be that more buildings from a certain era were designed a certain
way; it could be that there were more old buildings in a neighborhood where the seismic
activity was most severe. Or it could have been a massive coincidence. The statistics we learn
c
•2014 Oxford University Press 2
Chapter 1. The Quest for Causality
http://xkcd.com/552/
in this book will help to identify causes out and make claims about what really mattered –
As Figure 1.1 makes clear, correlation is not causation. This statement is old news. Our
task is to go to the next step – “Well, then, what does imply causation?” It will take
the whole book to fully flesh out the answer, but here’s the short version: If we can find
exogenous variation, then correlation is probably causation. Our task then will be to figure
out what exogenous variation means and how to distinguish randomness from causality as
best we can.
In this chapter we introduce three concepts at the heart of the book. Section 1.1 explains
the core model we use throughout the book. Section 1.2 introduces two things that make
statistics difficult. Neither is math (really!). One is randomness: Sometimes the luck of the
c
•2014 Oxford University Press 3
Chapter 1. The Quest for Causality
draw will lead us to observe relationships which aren’t real or fail to observe relationships
that are real. The second is endogeneity, a phenomenon that can cause us to wrongly think a
variable causes some effect when it doesn’t. Section 1.3 introduces randomized experiments
as the ideal way to overcome endogeneity. Usually, these experiments aren’t possible and even
when they are, things can go wrong. Hence, the rest of the book is about developing a tool
kit that helps us meet (or approximate) the idealized standard of randomized experiments.
When we talk about cause and effect we’ll refer to the outcome of interest as the dependent
variable is usually denoted as Y , called that because its value depends on the independent
variable. The independent variable, usually denoted by X, is called that because it does
whatever the hell it wants. It is the presumed cause of some change in the dependent
variable.
At root, social scientific theories posit that a change in one thing (the independent vari-
able) will lead to a change in another (the dependent variable). We’ll formalize this relation-
ship in a bit, but let’s start with an example. Suppose we’re interested in the U.S. obesity
epidemic and want to analyze the influence of snack food on health. We may wonder, for
example, if donuts cause health problems. Our model is that eating donuts (variable X, our
c
•2014 Oxford University Press 4
Chapter 1. The Quest for Causality
independent variable) causes some change in weight (variable Y , our dependent variable). If
we can find data on how many donuts people ate and how much they weighed, we might be
Let’s conjure up a small Midwestern town and do a little research. Figure 1.2 plots donuts
eaten and weights for 13 individuals from a randomly chosen town, Springfield, U.S.A. Our
raw data is displayed in Table 1.1. Each person has a line in the table. Homer is observation
1. Since he ate 14 donuts per week, Donuts1 = 14. We’ll often refer to Xi or Yi , which are
the values of X and Y for person i in the dataset. The weight of the seventh person in the
c
•2014 Oxford University Press 5
Chapter 1. The Quest for Causality
Weight
(in pounds)
Comic
300 Book
Guy
Homer
Chief Wiggum
250
Principal
200 Skinner
Rev. Lovejoy
Ned Flanders
Smithers
Patty
150 Selma
Marge
100
Mr. Burns
Bart
Lisa
50
0 5 10 15 20
Donuts
c
•2014 Oxford University Press 6
Chapter 1. The Quest for Causality
Figure 1.2 is a scatterplot of data, with each observation located at the coordinates
defined by the independent and dependent variable. The value of donuts per week is on
the X-axis and weights are on the Y-axis. Just by looking at this plot, we sense there is a
positive relationship between donuts and weight because the more donuts eaten, the higher
We use a simple equation to characterize the relationship between the two variables:
• The independent variable, Donutsi , is how many donuts person i eats per week.
• —1 is the slope coefficient on donuts indicating how much more1 a person weighs for
each donut eaten. (For those whose Greek is a bit rusty, — is the Greek letter beta.)
• —0 is the constant or intercept indicating the expected weight of people who eat zero
donuts.
• ‘i (the Greek letter epsilon) is the error term that captures anything else that affects
weight.
This equation will help us estimate the two parameters necessary to characterize a line.
Remember Y = mX + b from junior high? This is the equation for a line where Y is the
1 Or less - be optimistic!
c
•2014 Oxford University Press 7
Chapter 1. The Quest for Causality
value of the line on the vertical axis, X is the value on the horizontal axis, m is the slope
and b is the intercept, the value of Y when X is zero. Equation 1.1 is essentially the same,
only we refer to the “b” term as —0 and we call the “m” term —1 .
Figure 1.3 shows an example of an estimated line from this model for our Springfield
data. The intercept (—0 ) is the value of weight when donut consumption is zero (X = 0).
The slope (—1 ) is the amount that weight increases for each donut eaten. In this case, the
intercept is about 122, which means that the average weight for those who eat zero donuts
is around 122 pounds. The slope is around 9.1, which means that for each donut eaten per
Yi = —0 + —1 Xi + ‘i (1.2)
where —0 is the intercept that indicates the value of Y when X = 0 and —1 is the slope that
indicates how much change in Y is expected if X increases by one unit. We almost always
care a lot about —1 , which characterizes the relationship between X and Y . We usually don’t
care a whole lot about —0 . It plays an important role in helping us get the line in the right
place, but it is seldom the case that our core research interest is to determine the value of
Y when X is zero.
We see that the actual observations do not fall neatly on the line that we’re using to
characterize the relationship between donuts and weight. The implication is that our model
c
•2014 Oxford University Press 8
Chapter 1. The Quest for Causality
Comic
300 Book
Weight Guy
Homer
Chief Wiggum
250
pe)
slo
e
Principal (th
200 Skinner β1
Rev. Lovejoy
Ned Flanders
Smithers
Patty
150 Selma
Marge
β0 = 123
100
Mr. Burns
Bart
Lisa
50
0 5 10 15 20
Donuts
c
•2014 Oxford University Press 9
Chapter 1. The Quest for Causality
does not perfectly explain the data. Of course it doesn’t! Springfield residents are much too
complicated for donuts to explain them completely (except, apparently, Comic Book Guy).
The error term, ‘i , comes to the rescue by giving us some wiggle room. It is what is
left over after the variables have done their work in explaining variation in the dependent
variable. In doing this service, the error term plays an incredibly important role for the entire
statistical enterprise. As this book proceeds, we will keep coming back to the importance of
The error term, ‘, is not simply a Greek letter. It is something real. What it covers
depends on the model. In our simple model — in which weight is a function only of how
many donuts a person eats — oodles of factors are contained in the error term. Basically,
anything else that affects weight will be in the error term: sex, height, other eating habits,
exercise patterns, genetics, and on and on. The error term includes everything we haven’t
We’ll often see ‘ referred to as randomerror, but be careful about that phrase. Yes, for
the purposes of the model we are treating the error term as something random, but it is not
simply random in the sense of a roll of the dice. It is random more in the sense that we
don’t know what it will be for any individual. But every error term reflects, at least in part,
some relationship to real things that we have not measured or included in the model. We
c
•2014 Oxford University Press 10
Chapter 1. The Quest for Causality
Remember This
Our core statistical model model is
Yi = —0 + —1 Xi + ‘i
1. —1 , the slope, indicates how much change in Y (the dependent variable) is expected
if X (the independent variable) increases by one unit.
2. —0 , the intercept, indicates where the regression line crosses the Y-axis. It is the
value of Y when X is zero.
3. —1 is almost always more interesting than —0 .
c
•2014 Oxford University Press 11
Chapter 1. The Quest for Causality
1 1
Y Y
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
Y Y 8
0.8 7
6
5
0.6 4
3
2
0.4
1
0
0.2 −1
−2
−3
0 −4
Discussion Questions
For each of the panels in Figure 1.4, determine whether —0 and —1 are greater
than, equal to, or less than zero. (Be careful with —0 in panel (d)!)
c
•2014 Oxford University Press 12
Chapter 1. The Quest for Causality
Understanding that there are real factors in the error term helps us be smart about making
causal claims. Our data seems to suggest that the more donuts people ate, the more they
packed on the pounds. It’s not crazy to think that donuts cause weight gain.
But can we be certain that donuts, and not some other factor, cause weight gain? Two
fundamental challenges in statistical analysis should make us cautious. The first challenge
is randomness. Any time we observe a relationship in data, we need to keep in mind that
some coincidence could explain it. Perhaps we happened to pick some unusual people for
our data set. Or perhaps we picked perfectly representative people, but they had happened
In the donut example, the possibility of such randomness should worry us, at least a little.
Perhaps the people in Figure 1.3 are a bit odd. Perhaps if we had more people, we might
get more heavy folks who don’t eat donuts and skinny people who scarf them down. Adding
those folks to the figure would change the figure and our conclusions. Or perhaps even with
the set of folks we observe, we might have gotten some of them on a bad (or good) day and
Therefore every legitimate statistical analysis will account for randomness in an effort to
distinguish results that could happen by chance from those that would be unlikely to happen
by chance. The bad news is that we will never escape the possibility that the results we
c
•2014 Oxford University Press 13
Chapter 1. The Quest for Causality
observe are due to randomness rather than some causal effect. The good news, though, is
that we can often do a pretty good job characterizing how confident we are the results are
The other major statistical challenge arises from the possibility that an observed relation-
ship between X and Y is actually due to some other variable that causes Y and is associated
with X. In the donuts example, worry about scenarios where we could wrongly attribute
changes in weight caused by other factors to our key independent variable (in this case,
donut consumption). What if tall people eat more donuts? Height is in the error term as a
contributing factor to weight, and if tall people eat more donuts we may wrongly attribute
There are loads of other possibilities. What if men eat more donuts? What if exercise
addicts don’t eat donuts? What if people who eat donuts are also more likely to down a
tub of Ben and Jerry’s ice cream every night? What if thin people can’t get donuts down
their throats? Being male, exercising, binging on ice cream, having itty-bitty throats – all
these things are probably in the error term (meaning they affect weight) and all could be
Speaking statistically, we highlight this major statistical challenge by saying that that
are related to factors in the error term. The prefix “endo” refers to something internal, and
endogenous variables are “in the model” in the sense that they are related to other things
c
•2014 Oxford University Press 14
Chapter 1. The Quest for Causality
In the donuts example, donut consumption is likely endogenous because how many donuts
a person eats is not independent of other factors that influence weight gain. Factors that
cause weight gain (such as eating Ben and Jerry’s ice cream) might be associated with
donut eating; in other words, factors that influence the dependent variable Y might also be
associated with the independent variable X, muddying the connection between correlation
and causation. If we can’t be sure that our variation in X is not associated with factors
that influence Y , we worry about wrongly attributing to X the causal effect of some other
variable. We might wrongly conclude donuts cause weight gain when really donut eaters are
more likely to eat tubs of Ben and Jerry’s, which is the real culprit.
In all these examples, something in the error term that really causes weight gain is related
to donut consumption. When this situation arises, we risk spuriously attributed to donut
consumption the causal effect of some other factor. Remember, anything not measured in
the model is in the error term and here, at least, we have a wildly simple model in which
only donut consumption is measured. So Ben and Jerry’s, genetics, and everything else are
salaries increases test scores. It’s an important and timely question. Answering it may seem
easy enough: We could simply see if test scores (a dependent variable) are higher in places
where teacher salaries (an independent variable) are higher. It’s not that easy, though, is it?
c
•2014 Oxford University Press 15
Chapter 1. The Quest for Causality
Endogeneity lurks. Test scores might be determined by unmeasured factors that also affect
teacher salaries. Maybe school districts with lots of really poor families don’t have very
good test scores and don’t have enough money to pay teachers high salaries. Or perhaps
the relationship is the opposite, with poor school districts getting extra federal funds to
pay teachers more. Either way, teacher salaries are endogenous because their levels depend
in part on factors in the error term (like family income) that affect educational outcomes.
Simply looking at test scores’ relationship to teacher salaries risks confusing the effect of
changes in it are not related to factors in the error term. The prefix “exo” refers to something
external, and exogenous variables are “outside the model” in the sense that their values are
unrelated to other things that also determine Y . For example, if we use an experiment to
randomly set the value of X, then changes in X are not associated by factors that also
determine Y . This gives us a clean view of the relationship between X and Y , unmuddied
succeed, we can be more confident that we have moved beyond correlation and closer to
understanding if X causes Y – our fundamental goal. This process is not automatic or easy.
Often we won’t be able to find purely exogenous variation, so we’ll have to think through
2A good idea is to measure these things and put them in the model so that they are no longer in the error term.
That’s what we do in Chapter 5.
c
•2014 Oxford University Press 16
Chapter 1. The Quest for Causality
how close we can get. Nonetheless, the bottom line is this: If we can find exogenous variation
in X we can use data to make a reasonable inference about what will happen to variable Y
if we change variable X.
To formalize these ideas we’ll use the concept of correlation. It is a concept most people
know, at least informally. Two variables are correlated (“co-related”) if they move together.
A positive correlation means that high values of one variable are associated with high values
of the other; a negative correlation indicates that high values of one variable are associated
Figure 1.5 shows examples of variables that have positive correlation (panel (a)), no
correlation (panel (b)) and negative correlation (panel (c)). Correlations range from 1 to -1.
A correlation of 1 means that the variables move perfectly together. A positive correlation
means that high values of one variable are associated with high values of the other variable.
A negative correlation means that high values of one variable are associated with low values
of the other.
Correlations close to zero indicate weak relationships between variables. When the cor-
c
•2014 Oxford University Press 17
Chapter 1. The Quest for Causality
Ne
1 1 1
ion
ga
lat
tiv
rre
e
co
co
rre
e
lat
s
ion
Po
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
c
•2014 Oxford University Press 18
Chapter 1. The Quest for Causality
variable has a relationship to the error term like the one in panel (a) (which shows positive
correlation) or panel (c) (which shows negative correlation), then we have endogeneity. In
other words, when the unmeasured stuff that constitutes the error term is correlated with our
independent variable we have endogeneity which will make it difficult to tell if our variable
On the other hand, if our independent variable has no relationship to the error term as
in panel (b), then we have exogeneity. In this case, if we observe Y rising with X, we can
The challenge is that the true error term is not observable. Hence much of what we do in
statistics attempts to get around the possibility that something unobserved in the error term
may be correlated with the independent variable. This quest makes statistics challenging
and interesting.
As a practical matter, we should begin every analysis by assessing endogeneity. First, look
away from the model for a moment and list all the things that could determine the dependent
variable. Second, ask if anything on the list correlates with the independent variable in the
model and explain why it might. That’s it. Do that and we are on our way to identifying
endogeneity.
c
•2014 Oxford University Press 19
Chapter 1. The Quest for Causality
Remember This
1. There are two fundamental challenges in statistics: randomness and endogeneity.
2. Randomness can produce data in which it looks like there is a relationship between
X and Y even when there is not or it looks like there is no relationship even when
there is one.
3. An independent variable is endogenous if it is correlated with the error term in
the model.
(a) An independent variable is exogenous if it is not correlated with the error
term in the model.
(b) The error term is not observable, making it a challenge to know if an inde-
pendent variable is endogenous or exogenous.
c
•2014 Oxford University Press 20
Chapter 1. The Quest for Causality
elderly. At the same time, no one enjoys schlepping down to some hospital basement or
drugstore lobby, rolling up a shirt sleeve, and getting a flu shot. Nonetheless, every year
The evidence that flu shots prevent people from dying from the flu must be overwhelming.
Right? Suppose we start by considering a study using data on whether people died (the
dependent variable) and whether they got a flu shot (the independent variable).
where Deathi is a (creepy) variable that is 1 if person i died in the time frame of the study
and 0 if he or she did not. F lu shoti is 1 if the person i got a flu shot and 0 if not.4
A number of studies have done essentially this analysis and have found that people who
get flu shots are less likely to die. According to some estimates, those who receive flu shots
are as much as 50 percent less likely to die. This effect is enormous. Going home with a
4 We discuss dependent variables that equal only zero or one in Chapter 12 and independent variables that equal
zero or one in Chapter 6.
c
•2014 Oxford University Press 21
Chapter 1. The Quest for Causality
But are we convinced? Is there any chance of endogeneity? If there exists some factor in
the error term that affected whether or not someone died and whether he or she got a flu
What is in the error term? Goodness, lots of things affect the probability of dying: age,
health status, wealth, cautiousness - the list is immense. All of these factors and more are
How could these factors cause endogeneity? Let’s focus on overall health. Clearly, health-
ier people die at a lower rate than unhealthy people. If healthy people are also more likely
to get flu shots, we might erroneously attribute life-saving power to flu shots when perhaps
all that is going on is that people who are healthy in the first place tend to get flu shots.
It’s hard, of course, to get measures of health for people, so let’s suppose we don’t have
them. We can, however, speculate on the relationship between health and flu shots. Figure
1.6 shows two possible states of the world. In each figure we plot flu-shot status on the
X-axis. If someone did not get a flu shot, he’s in the 0 group; if someone got a flu shot,
she’s in the 1 group. On the Y-axis we plot health related to everything but flu (supposing
we could get some index that factors in age, heart health, absence of disease, etc.). In panel
(a), health and flu shots don’t seem to go together; in other words the correlation is zero. If
panel (a) represents the state of the world, then our results that flu shots are associated with
lower death rates is looking pretty good because flu shots are not reflecting overall health.
c
•2014 Oxford University Press 22
Chapter 1. The Quest for Causality
Health Health
10 10
8 8
6 6
4 4
2 2
0 1 0 1
Flu shot Flu shot
(a) (b)
FIGURE 1.6: Two Scenarios for the Relationship Between Flu Shots and Health
c
•2014 Oxford University Press 23
Chapter 1. The Quest for Causality
In panel (b), health and flu shots do seem to go together, with the flu shot population being
healthier. In this case, we have correlation of our main variable (flu shots) and something
Brownlee and Lenzer (2009) discuss some indirect evidence suggesting that flu shots and
health are actually correlated. A clever approach to assessing this matter is to look at death
rates of people in the summer. The flu rarely kills people in the summer which means that if
people who get flu shots also die at lower rates in the summer, it is because they are healthier
overall; if people who get flu shots die at the same rates as others during the summer, it
would be reasonable to suggest that the flu-shot and non-flu-shot populations have similar
health. It turns out that people who get flu shots have an approximately 60 percent lower
Other evidence backs up the idea that healthier people get flu shots. As it happened,
vaccine production faltered in 2004 and 40 percent fewer people got vaccinated. What
happened? Flu deaths did not increase. And in some years, the flu vaccine was designed
to attack a different set of viruses than actually spread in those years; again, there was no
clear change in mortality. This data suggests that the reason people who get flu shots live
longer might be because getting flu shots is associated with other healthy behavior, such as
The point is not to put us off flu shots. We’ve discussed only mortality—whether people
die from the flu—not whether they’re more likely to contract the virus or stay home from work
c
•2014 Oxford University Press 24
Chapter 1. The Quest for Causality
because they are sick.5 The point is to highlight how hard it is to really know if something
(in this case, a vaccine) works. If something as widespread and seemingly straightforward
as flu shots are hard to assess definitively, think about the care we must take when trying
to analyze policies that affect fewer people and have more complicated effects.
to data to assess one particular question: does country music depress us? They argued
that country music, with all its lyrics about broken relationships and bad choices, may be
so depressing that it increases suicide rates.6 We can test this claim with the following
statistical model:
where Suicide ratesi is the suicide rate in metropolitan area i and Country musici is the
5 See, for example, DiazGranados, Denis, and Plotkin (2012) and Osterholm, Kelley, Sommer, and Belongia (2012)
for evidence on the flu vaccine based on randomized experiments.
6 Really, this is an actual published paper; see the endnotes at the end of the book for details.
c
•2014 Oxford University Press 25
Chapter 1. The Quest for Causality
It turns out that suicides are indeed higher in metro areas where radio stations play more
country music. Do we believe this is a causal relationship? (In other words, is country music
exogenous?) If radio stations play more country music, should we expect more suicides?
What does —0 mean? What does —1 mean? In this model, —0 is the expected level of sui-
cide rates in metropolitan areas that play no country music. —1 is the amount by which
suicide rates change for each one-unit increase in the proportion of country music played
in a metropolitan area. We don’t know what —1 is; it could be positive (suicides go up),
zero (no relation to suicides) or negative (suicides decrease). For the record, we don’t know
what —0 is either, but we are less interested in it because it does not directly characterize
What is in the error term? The error term contains factors that are associated with higher
suicide rates, such as alcohol and drugs use, availability of guns, divorce and poverty rates,
lack of sunshine, lack of access to mental health care, and probably much more.
it is correlated with factors in the error term. Therefore, we need to ask whether the amount
7 Their analysis is based on a more complicated model, but this is the general idea.
c
•2014 Oxford University Press 26
Chapter 1. The Quest for Causality
of country music played on radio stations in metropolitan areas is correlated with drinking,
drug use, and all the other stuff in the error term.
Is the independent variable likely to be endogenous? Are booze, divorce and guns likely to be
correlated to amount of country music? Have you listened to any country music? Drinking
and divorce come up now and again. Could this music appeal more in areas where people
drink too much and get divroced more frequently? (To complicate matters, country-music
lyrics also feature more about family and religion than most other types of music.) Or,
could it simply be the case that people in rural areas who like country music also have a lot
of guns? And all of these factors – alcohol, divorce and guns – are plausible influences on
suicide rates. To the extent that country music is correlated with any of them, it would be
endogenous.
Explain how endogeneity could lead to incorrect inferences. This endogeneity could lead to
incorrect inferences in the following way. Suppose for a moment that country music has
no effect whatsoever on suicide rates, but that regions with lots of guns and drinking also
have more suicides and that people in these regions also listen to more country music. If we
look only at the relationship between country music and suicide rates, we will see a positive
relationship: places with lots of country music will have higher suicide rates and places
with little country music will have lower suicide rates. The explanation could be that the
country-music areas have lots of drinking and guns and the areas with little country music
c
•2014 Oxford University Press 27
Chapter 1. The Quest for Causality
have less drinking and fewer guns. Therefore while it may be correct to say that there are
more suicides in places where there is more country music, it would be incorrect to conclude
that country music causes suicides. Or, to put it in another way, it would be incorrect to
As it turns out, Snipes and Maguire (1995) account for the amount of guns and divorce in
metropolitan areas and find no relationship between country music and metropolitan suicide
rates. So there’s no reason to turn off the radio and put away those cowboy boots.
c
•2014 Oxford University Press 28
Chapter 1. The Quest for Causality
Discussion Questions
1. Labor economists often study the returns to education (see, e.g., Card
(1999)). Suppose we have data on salaries of a set of people, some of
whom went to college and some who did not. A simple model linking
education to salary is
c
•2014 Oxford University Press 29
Chapter 1. The Quest for Causality
c
•2014 Oxford University Press 30
Chapter 1. The Quest for Causality
The best way to fight endogeneity is to have exogenous variation. A good way to have
exogenous variation is to create it. If we do so, we know that our independent variable is
donut example, we could randomly pick people and force them to eat donuts while forbidding
everyone else from eating donuts. If we can pull this experiment off, the amount of donuts
a person eats will be unrelated to other unmeasured variables that affect weight. The only
thing that would determine donut eating would be the luck of the draw. The donut-eating
group would have some ice cream bingers, some health-food nuts, some runners, some round-
the-clock video-game players, and so on. So, too, would the non-donut-eating group. There
wouldn’t be systematic differences in these unmeasured factors across groups. Both treated
and untreated groups would be virtually identical and would resemble the composition of
the population.
have won. If we observe that donut eaters weigh more or have other health differences from
Simply put, the goal of such a randomized experiment is to make sure the independent
variable, which we also call the treatment, is exogenous. The key element of such experiments
c
•2014 Oxford University Press 31
Chapter 1. The Quest for Causality
a random process; its value will depend on nothing but chance. In this case, the independent
variable will be uncorrelated with everything, including any factor in the error term affecting
analyzing the relationship between an exogenous independent variable and the dependent
variable allows us to make inferences about a causal relationship between the two variables.
This is one of those key moments when a concept may not be that complicated, but the
implications are enormous. By randomly picking some people to get the treatment, we rule
out the possibility that there is some other way for the independent variable to be associated
with the dependent variable. If the randomization is successful, the treated subjects are not
systematically more athletic, taller, or food conscious – or more left-handed or stinkier, for
that matter.
The basic structure of a randomized experiment is simple. Based on our research question,
we identify a relevant population that we randomly split into two groups: a treatment
group that receives the policy intervention and a control group that does not. After the
treatment, we compare the behavior of the treatment and control groups on the outcome we
care about. If the treatment group differs substantially from the control group, we believe
the treatment had an effect; if not, then we’re inclined to think the treatment had no effect.8
c
•2014 Oxford University Press 32
Chapter 1. The Quest for Causality
macare. We would identify a sample of uninsured people and split them into a treatment
group that is exposed to the ad and a control group that is not. After the treatment, we
compare the enrollment in Obamacare of the treatment and control groups. If the treated
group enrolled at a substantially higher rate, that outcome would suggest the ad works.
Because they build exogeneity into the research, randomized experiments are often re-
ferred to as the gold standard for causal inference. The phrase “gold standard” usually means
the best of the best. But experiments also merit the gold standard moniker in another sense.
No country in the world is actually on a gold standard. The gold standard doesn’t work
well in practice and, for many research questions, neither do experiments. Simply put, ex-
periments are great, but they can be tricky when applied to real people going about their
business.
The human element of social-scientific experiments makes them very different from exper-
iments in the physical sciences. My third grader’s science-fair project compared cucumber
seeds planted in peanut butter to those planted in dirt. She did not have to worry that the
cucumber seeds would get up and say “There is NO way you are planting me in that.” In
the social sciences, though, people can object, not only to being planted in peanut butter,
but also to things like watching TV commercials or attending a charter school or changing
health care plans or pretty much anything else we want to study with an experiment.
Therefore an appreciation of the virtues of experiments should come also with recognition
of their limits. We devote Chapter 10 to discussing the analytical challenges that accompany
c
•2014 Oxford University Press 33
Chapter 1. The Quest for Causality
experiments. No experiment should be designed without thinking through these issues and
every experimental result should be judged by how well it deals with them.
There are other reasons social-scientific experiments can’t answer all social-scientific re-
search questions. The first is that experiments aren’t always feasible. The financial costs of
many experiments are beyond what most major research organizations can fund, let alone
what a student doing a term paper can afford. And for many important questions, it’s not a
matter of money. Do we want to know if corruption promotes civil unrest? Good luck with
our proposal to randomly end corruption in some countries and not others. Do we want to
know if birth rates affect crime? Are we really going to randomly assign some regions to
have more babies? While the randomizing process could get interesting, we’re unlikely to
And even if an experiment is feasible, it might not be ethical. We see this dilemma most
clearly in medicine – if we believe some treatment is better, but we are not sure, how ethical
is it to randomly subject some people to a procedure that might not work? The medical
community has evolved standards relating to level of risk and informed consent by patients,
Consider flu shots. We may think that assessing the efficacy of this public-health measure
simple. Get a bunch of people who want a flu shot, tell them they are participating in a
random experiment and randomly give some a flu shot and the others a placebo shot. Wait
c
•2014 Oxford University Press 34
Chapter 1. The Quest for Causality
But would such a randomized trial of flu vaccine be ethical? When we say “Wait and
see how the two groups do,” we actually mean “Wait and see who dies.” That changes
the stakes a bit, doesn’t it? The public-health community strongly believes in the efficacy
of the flu vaccine and, given that belief considers it unethical to deny people the vaccine.
Brownlee and Lenzer recount in Atlantic Monthly how one doctor first told interviewers that
a randomized-experiment trial might be acceptable, then got cold feet and called back to
Finally, experimental results may not be generalizable. That is, a specific experiment
may provide great insight into the effect of a given policy intervention at a given time
and place, but how sure can we be that the same policy intervention will work somewhere
else? Jim Manzi, the author of Uncontrollable, argues that the most honest way to describe
experimental results is that treatment X was effective in a certain time and place in which
the subjects had the characteristics they did and the policy was implemented by people with
the characteristics they had. Perhaps people in different communities respond to treatments
differently. Or perhaps the scale of an experiment could matter: A treatment might work
when implemented on a small scale, but could fail if implemented more broadly.
Statisticians make this point by distinguishing between internal validity and external
9 Another flu researcher came to the opposite conclusion, saying, “What do you do when you have uncertainty?
You test ... We have built huge, population-based policies on the flimsiest of scientific evidence. The most unethical
thing to do is to carry on business as usual.”
c
•2014 Oxford University Press 35
Chapter 1. The Quest for Causality
validity. Internal validity refers to whether the inference is biased; external validity
internally valid, meaning the results will on average lead us to make the correct inferences
about the treatment and outcome in the context of the experiment. In other words, with
internal validity, we can say confidently that variable X is what causes the change in variable
Y . Even with internal validity, however, an experiment may not be externally valid, because
the causal relationship between the treatment and outcome could differ in other contexts.
That is, even if we have internally valid evidence from an experiment that aardvarks in
Alabama procreate more if they listen to Mozart, we can’t really be sure aardvarks in Alaska
they cannot always offer the final word for economic, policy, and political research. Hence
most scholars in most fields need to grapple with non-experimental data. Observational
studies use data that has been generated by non-experimental processes. In contrast to
observational studies the data is what it is and we do the best we can to analyze it in a
sensible way. Endogeneity will be a chronic problem, but we are not totally defenseless in
the fight against it. The techniques we learn in this book help us to achieve, or at least
approximate, the exogeneity promised by randomized experiments even when we have only
observational data.
c
•2014 Oxford University Press 36
Chapter 1. The Quest for Causality
Remember This
1. Experiments create exogeneity via randomization.
2. Social-science experiments are complicated by practical challenges associated with
the difficulty of achieving randomization and full participation.
3. Experiments are not always feasible, ethical, or generalizable.
4. Observational studies use non-experimental data. They are necessary to answer
many questions.
Discussion Questions
c
•2014 Oxford University Press 37
Chapter 1. The Quest for Causality
1.4 Conclusion
The point of statistical research is almost always to learn if X (the independent variable)
causes Y (the dependent variable). If we see high values of Y when X is high and low values
that the observed relationship could have arisen only by chance. Or, if X is endogenous,
we need to remember that interpreting the relationship between X and Y as causal could
be wrong, possibly completely wrong. When there is some other factor that causes Y and
is correlated with X, any relationship we see between X and Y may actually be due to the
We spend the rest of this book accounting for uncertainty and battling endogeneity. Some
approaches, like randomized experiments, seek to create exogenous change. Others statistical
approaches, like multivariate regression, winnow down the number of other factors lurking
in the background that can cause endogeneity. These and other approaches have strengths,
weaknesses, tricks, and pitfalls. However, they all are united by a fundamental concern
understand the essential challenges of using statistics to better understand policy, economics,
and politics.
c
•2014 Oxford University Press 38
Chapter 1. The Quest for Causality
• Section 1.2: Explain two major statistical challenges. Define endogeneity. Explain how
endogeneity can undermine causal inference. Define exogeneity. Explain how exogeneity
• Section 1.3: Explain how experiments achieve exogeneity. Explain challenges and limi-
tations of experiments.
Key Terms
• Constant or intercept (7)
• Control group (257)
• Correlation (17)
• Dependent variable (4)
• Endogenous (14)
• Error term (7)
• Exogenous (16)
• External validity (36)
• Generalizable (35)
• Independent variable (4)
• Internal validity (36)
• Observational studies (36)
• Randomization (32)
• Scatterplot (7)
• Slope coefficient (7)
• Treatment group (257)
c
•2014 Oxford University Press 39
CHAPTER 2
experiments can produce some pretty messy data, if they are even possible at all. Observa-
One example of data messiness occurred in 2009. Prominent economists Carmen Reinhart
and Ken Rogoff (2010) analyzed more than 3, 700 annual observations of economic growth
from a large sample of countries. Panel (a) of Figure 2.1 depicts one of their key results.
40
Chapter 2. Stats in the Wild: Good Data Practices
Real
GDP
growth 4 4
(percent)
3 3
2 2
1 1
0 0
0−30% 30−60% 60−90% Above 90% 0−30% 30−60% 60−90% Above 90%
It shows average GDP growth for countries grouped into four categories depending on the
ratio of public debt to GDP. The shocking finding was the dramatic way average economic
growth dropped off a cliff for countries with government debt above 90 percent of GDP. The
implication was obvious: Governments should be very cautious when using deficit spending
to fight unemployment.
There was one problem with their story. The data didn’t quite say what they said it did.
Herndon, Ash and Pollin (2014) did some digging and found that some observations had
been dropped, others were typos and, most ignominiously, some calculations in the Excel
spreadsheet containing the data weren’t what Reinhart and Rogoff intended. With the data
corrected, the figure changed to panel (b) of Figure 2.1. Not quite the same story. Economic
growth didn’t plummet once government debt passed 90% of GDP. While people can debate
c
•2014 Oxford University Press 41
Chapter 2. Stats in the Wild: Good Data Practices
whether the slope in the right panel is a bunny hill or an intermediate hill, it clearly is
Reinhart and Rogoff’s discomfort can be our gain when we realize that even top scholars
can make data mistakes. Hence we need to be vigilant in making sure that we create habits
and structures to minimize mistakes and maximize the chance that others can find them if
we do.
Therefore, this chapter focuses on the crucial first steps for any statistical analysis. First,
we need to understand our data. Section 2.1 introduces tools for describing data and sniffing
others can’t re-create our results, they shouldn’t believe them. Therefore Section 2.2 helps
us establish good habits so that our code is understandable to ourselves and others. Finally,
we sure as heck aren’t going to do all this work by hand. Therefore, Section 2.3 introduces
Experienced researchers know that data is seldom pristine. Something, somewhere is often
off, even in data sets that are well traveled in academic circles. This is especially true for
1A deeper question is whether we should treat this observational data as having any causal force. Government
debt levels are likely related to other factors that affect economic growth, like institutional quality and wars. In other
words, government debt is likely endogenous, meaning we probably can’t draw any conclusions about the effects of
debt on growth without implementing techniques we cover later in this book.
c
•2014 Oxford University Press 42
Chapter 2. Stats in the Wild: Good Data Practices
Therefore the first rule of data analysis is to know our data. This rule sounds obvious and
simple but not everyone follows it, sometimes to their embarrassment. For each variable we
should know the number of observations, the mean, standard deviation and the minimum
and maximum values. Knowing this information gives us a feel for data, helping us know if
we have missing data and what the scales and ranges of the variables are. Table 2.1 shows an
example for the donut and weight data we discussed on page 5. The number of observations,
frequently referred to as “N” (for number), is the same for all variables in this example, but
it varies across variables in many data sets. We all know the mean (also known as average).
The standard deviation measures how widely dispersed the values of the observation are.3
The minimum and maximum tell us the range of the data and often point to screwy values
It is also helpful to look at the distribution of variables that take on only a few possible
values. Table 2.2 shows a frequency table for the male variable, a variable that equals 1
for men and 0 for women. The table indicates there are 9 men and 4 women. Fair enough.
2 Chris Achen (1982, 53) memorably notes “If the information has been coded by nonprofessionals and not cleaned
at all, as often happens in policy analysis projects, it is probably filthy.”
3 The appendix contains more details on page 768. Here’s a quick refresher. The standard deviation of X is a
measure of the dispersion of X. The larger the standard deviation, the more spread out the values. It is calculated
Ò q
1
as N
(Xi ≠ X)2 where X is the mean of X. For each observation, we see how far it is from the mean. We then
square that value because for the purposes of calculating dispersion we don’t distinguish whether a value is below
the mean or above it; when squared, all these values become positive numbers. We then take the average of these
squared values. Finally, since they’re squared values, taking the square root of the whole thing brings the final value
back to the scale of the original variable.
c
•2014 Oxford University Press 43
Chapter 2. Stats in the Wild: Good Data Practices
Table 2.2: Frequency Table for Male Variable in Donut Data Set
Variable Observations
0 4
1 9
Suppose that our frequency table looked like Table 2.3 instead. Either we have a very manly
man in the sample or (more likely) we have a mistake in our data. The statistical tools we
use later in this book will not necessarily flag such issues, so it’s our responsibility to be
alert.
Table 2.3: Frequency Table for Male Variable in Second Donut Data Set
Variable Observations
0 4
1 8
100 1
Graphing data is useful because it allows us to see relationships and unusual observa-
tions. The statistical tools we develop later quantify these relationships, but seeing them
c
•2014 Oxford University Press 44
Chapter 2. Stats in the Wild: Good Data Practices
Weight
(in pounds)
Comic
300 Book
Guy
Homer
Chief Wiggum
250
Principal
200 Skinner
Rev. Lovejoy
Ned Flanders
Smithers
Patty
150 Selma
Marge
100
Mr. Burns
Bart
Lisa
50
0 5 10 15 20
Donuts
c
•2014 Oxford University Press 45
Chapter 2. Stats in the Wild: Good Data Practices
for ourselves is an excellent and necessary first step. For example, Figure 2.2 shows the
scatterplot we saw earlier of the weight and donut data. We can see that there does seem to
We also see some relationships that we might have missed without graphing. Lisa and
Bart are children and their weight is much lower; we’ll probably want to account for that in
Effective figures are clean, clear, and attractive. We point to some resources for effective
visualization in the Further Reading section at the end of the chapter, but here’s the bottom
line. Get rid of clutter. Don’t overdo axis labels. Avoid abbreviations and jargon. Pick
Remember This
1. A useful first step toward understanding data is to review sample size, mean,
standard deviation, and minimum and maximum for each variable.
2. Plotting data is useful for identifying patterns and anomalies in data.
2.2 Replication
At the heart of scientific knowledge is the replication standard. Research that meets a
replication standard can be duplicated based on the information provided at the time of
publication. In other words, an outsider could use information provided at the time of
c
•2014 Oxford University Press 46
Chapter 2. Stats in the Wild: Good Data Practices
We need replication files to satisfy this standard. Replication files document exactly
how data is gathered and organized. When done properly, these files allow others to check
our work by following our steps and seeing if they get identical results.
Replication files also enable others to probe our analysis. Sometimes – often, in fact –
statistical results hinge on seemingly small decisions about what data to include, how to
deal with missing data, and so forth. People who really care about getting the answer right
will want to see if what we’ve done to our data and, realistically, will be wary until they see
for themselves that other reasonable ways of doing the analysis produce similar results. If a
certain coding or statistical choice substantially changes results, then we need to pay a lot
Committing to a replication standard keeps our work honest. We need to make sure that
we make choices based on the statistical merits, not based on whether they produce the
answer we want. If we give others the means to check our work, we’re less likely to fall to
million-dollar consulting project, should start with replication files. One file is a data code-
book that documents the data. Sometimes this file simply notes the website and date the
data was downloaded. Often, though, the codebook will include information about variables
that come from multiple sources. The codebook should note the source of the data, the type
of data, who collected it, and any adjustments the researcher made. For example, is the
c
•2014 Oxford University Press 47
Chapter 2. Stats in the Wild: Good Data Practices
data measured in nominal or real dollars? If it is in real dollars, which inflation deflator has
been used? Is the data measured in fiscal year or calendar year? Have missing observations
been imputed? If so, how? Losing track of this information can lead to frustrating and
Table 2.4 contains a sample of a codebook for a data set on height and wages.4 The
data set was used to assess whether tall people get paid more. It is pretty straightforward,
covering how much money people earned, how tall they were, and their activities in high
school. We see, though, that details matter. The wages are stated in dollars per hour, which
itself is calculated based on information from an entire year of work. We could imagine data
on wages in other data sets being expressed in terms of dollars per month or year or in terms
of wages at the time the question was asked. There are two height variables, one measured
in 1981 and the other measured in 1985. The athletics variable indicates whether the person
participated in athletics or not. Given the coding, a person who played multiple sports will
have the same value for this variable as a person who played one sport. Such details are
A second replication file should document the statistical analysis, usually by providing
the exact code used to generate the results. Which commands were used to produce the
analysis? Sometimes the file contains a few simple lines of software code. Often, however,
we need to explain complicated steps needed to merge or clean the data. Or we need to
4 We analyze this data on page 112.
c
•2014 Oxford University Press 48
Chapter 2. Stats in the Wild: Good Data Practices
detail how we conducted customized statistical analysis. These steps are seldom obvious
from the description of data and methods that makes its way into the final paper or report.
It is a great idea to include commentary in the replication material explaining the code and
reasons behind decisions. Sometimes statistical code will be pretty impenetrable (even to
the person who wrote it!) and detailed commentary helps keeps things clear for everyone.
We show examples of well-documented code in the Computing Corner beginning on page 57.
Having well-documented data and analysis is a huge blessing. Even a modestly complex
project can produce a head-spinning number of variables and choices. And because the work
often extends over days, weeks, or months, we learn quickly that what seems obvious when
fresh can fade into oblivion. How exactly did we create our wonderful new variable at 3 am
three weeks ago? If we can’t recreate the analysis from scratch, it is useless. We may as
well have gone to bed. If we have a good replication file, on the other hand, we can simply
A replication file is also crucial to analyze the robustness of our results. A result is robust
c
•2014 Oxford University Press 49
Chapter 2. Stats in the Wild: Good Data Practices
if it does not change when we change the model. For example, we might believe that a certain
observation was mis-measured and we might therefore exclude it from the data we analyze.
A reader might be nervous about this exclusion. It will therefore be useful to conduct an
robustness check in which we estimate the model including the contested observation. If the
statistical significance and magnitude of the coefficient of interest is essentially the same,
then we can assure others that the results are robust to inclusion of that observation. If
the results change, then the coefficient of interest changes, then the results are not robust
and we have some explaining to do. Experience researchers know that many results are not
robust and therefore demand extensive robustness checks before they will believe results.
Remember This
1. Analysis that cannot be replicated cannot be trusted.
2. Replication files document data sources and statistical methods that someone
could exactly re-create the analysis in question from scratch.
3. Replication files also allow others to explore the robustness of results by enabling
them to assess alternative approaches to the analysis.
c
•2014 Oxford University Press 50
Chapter 2. Stats in the Wild: Good Data Practices
crime and such demographic features are related, Table 2.5 presents data drawn from U.S.
Census data from 2009 for the 50 states and Washington DC. We can see there is no missing
data (because each variable has 51 observations). We also see that the violent-crime rate
has a broad range, from 119.9 per 100,000 population all the way to 1,348.9 per 100,000
people. The single parent percent variable is on a 0 to 1 scale, also with considerable range,
from 0.18 to 0.61. The percent urban (which is the percent of people in the state living in
a metropolitan area) is measured on a 0 to 100 scale. These scales mean that 50 percent is
indicated in the single-parent variable as 0.5 and as 50 in the urban variable. Getting the
scales mixed up could screw up the way we interpret statistical results about the relationships
Scatterplots provide excellent additional information about our data. Figure 2.3 shows
scatterplots of state-level violent-crime rate and percent urban, percent of children with
c
•2014 Oxford University Press 51
Chapter 2. Stats in the Wild: Good Data Practices
a single parent, and percent in poverty. Suddenly, the character of the data is revealed.
Washington, DC is a clear outlier because it is very much higher than the 50 states in level
We can also use scatterplots to appreciate non-obvious things about our data. We may
think of highly urbanized states as being the densely populated states in the northeast like
Massachusetts and New Jersey. Actually, though, the scatterplot helps us see that Nevada,
Utah and Florida are among the most urbanized according to the Census Bureau measure.
Understanding the reality of the urbanization variable helps us better appreciate what the
Being aware of the data can help us detect possible endogeneity. Many of the high single-
parent and high poverty states are in the South; we may suspect that Southern states are
distinctive in other social and political characteristics, so we should therefore be on high alert
5 Despite the fact that more people live in Washington, DC than in Vermont or Wyoming! Or, so says the
Washington, DC resident ...
c
•2014 Oxford University Press 52
Chapter 2. Stats in the Wild: Good Data Practices
Violent
crime
rate DC DC DC
(per 100,000
people)
1200
1000
800
NV NV NV
SC
TN TN SC SC
TN
LA
AK NM DE AK DENM
LA AK DE LANM
600 FL FL FL
MD MD MD
AR OKMOMI OK AR AR
TXIL MI
IL
MO
TX IL MO OKMI TX
MACA MACA MA CA
AL AL AL
GA AZ GA
AZ GA
AZ
400 NC KS KS NC KS NC
PA NY PA NY PA NY
CO
IN OHWA COWAINOH WACO INOH
WV
MS MT IA NE CT NJ NJ
WV
CT
NE IA
NJ
CT WV
HI MT MS IANE MT MS
ND WI OR RI ND ORHI HI ND
RI WI OR
KY ID
MN IDMN WI KYRI MN ID KY
SD WY VA UT UT SDVA
WY WYVA UT SD
200
NH NH NH
VT
ME VTME VTME
FIGURE 2.3: Scatterplots of Violent Crime Against Percent Urban, Single Parent, and Poverty
about potential endogeneity in any analysis that uses poverty or single-parenthood variables.
These variables capture not only poverty and single parenthood, but also “Southerness.”
We need software to do statistics. We have many choices and it’s worthwhile to learn at least
two different software packages. Different packages are good at different things and many
c
•2014 Oxford University Press 53
Chapter 2. Stats in the Wild: Good Data Practices
researchers use one program for some tasks and another program for other tasks. Also,
knowing multiple programs helps us think in terms of statistical concepts rather than in
We refer to two major statistical packages throughout this book: Stata and R. (Yes, R is
a statistical package referred to by a single letter; they folks behind it are a bit minimalist.)
Stata provides simple commands to do many complex statistical analyses; the cost of this
simplicity is that we sometimes need to do a lot of digging to figure out what exactly
Stata is up to. And it is expensive. R can be a bit harder to get the hang of, but the
coding is often more direct so that less is hidden to the user. Oh yes, it’s also free at
In this book we learn by doing, showing specific examples of code in each chapter’s
Computing Corner. The best way to learn code is to get working; after a while the command
names become second nature. Replication files are also a great learning tool. Even if we
forget a specific command, it’s not so hard to remember “I want to do something like I did
for the homework on about education and wages.” All we have to do, then, is track down
6 In the references we indicate some good sources for learning Stata and R.
c
•2014 Oxford University Press 54
Chapter 2. Stats in the Wild: Good Data Practices
Remember This
1. Stata is a powerful statistical software program. It is relatively user-friendly, but
it can be expensive.
2. R is another powerful statistical software program. It is less user-friendly, but it
is free.
2.4 Conclusion
This chapter prepares us for analyzing real data. The first step is understanding our data.
This vital first step makes sure that we know what we’re dealing with. We should use
descriptive statistics to get an initial feel for how much data we have and the scales of the
variables. Then we should graph our data. It’s a great way to appreciate what we’re dealing
The second step of working with data is documenting our data and analysis. Social
trusted. Therefore all statistical projects should document data and methods, ensuring that
We are on track with the key concepts in this chapter when we can
• Section 2.2: Explain the importance of replication and the two elements of a replication
file.
c
•2014 Oxford University Press 55
Chapter 2. Stats in the Wild: Good Data Practices
• Section 2.3 (and Computing Corner below): Do basic data description in Stata and R.
Further Reading
Data visualization is a growing field, with good reason as analysts increasingly commu-
nicate primarily via figures. Tufte (2001) is a landmark book. Schwabish (2004) and Yau
(2011) are nice guides to graphics. Failing to get figures right can even impact your personal
life: http://xkcd.com/833/.
Chen, Ender, Mitchell, and Wells (2003) is an excellent online resource for learning Stata.
Verzani (2004) and online tools such as Swirl (2014). Venables and Ripley (2002) is a classic
reference book on S, the language that preceded R. Virtually all of it applies to R as well.
Other programs are widely used as well. EViews is a powerful program often used by
those doing forecasting models (see eviews.com). Some people use Excel for basic statistical
analysis. It’s definitely useful to have good Excel skills, but most people will need a more
Key Terms
• Codebook (47)
• Replication (46)
c
•2014 Oxford University Press 56
Chapter 2. Stats in the Wild: Good Data Practices
Computing Corner
Stata
• The first thing to know is what to do when we get stuck (when, not if ). In Stata, type
help commandname if you have questions about a certain command. For example, if we
have questions about the summarize command, we can type help summarize to get a
description of the command. Probably the most useful information comes in the form
of the examples included at the end of these files. Often the best approach is to find an
example that seems closest to what we’re trying to do and apply that example to our
problem. Googling usually helps, too.
• A comment line is a line in the code that provides notes for the user. A comment line
does not actually tell Stata to do anything, but it can be incredibly useful to clarify
what is going on in the code. Comment lines in Stata begin with an asterisk (*). Using
(**) makes it easier to visually identify these crucial lines.
• To open a “syntax file” to document our analysis, click on Window - Do file editor - new
Do-file editor. It’s helpful to re-size this window so that we can see both the commands
and the results. Save our syntax file as “SomethingSomething.do”; the more informative
the name, the better. Including the date in the file name aids version control. To run
any command in the syntax file, highlight the whole line and then press ctrl-d. The
results of the command will be displayed in the Stata results window.
• One of the hardest parts of learning new statistical software is loading data into a
program. While some data sets are pre-packaged and easy, many are not, especially
those we create ourselves. We have to be prepared for the process of loading data to
take longer than expected. And because data sets can sometimes misbehave (columns
shifting in odd ways, for example) it is very important to use the descriptive statistics
diagnostics described in this chapter to make sure the data is exactly what we think it
is.
– To load Stata data files (which have .dta at the end of file name), there are two
options.
c
•2014 Oxford University Press 57
Chapter 2. Stats in the Wild: Good Data Practices
1. Use syntax
** For data located on the internet
use "http://www9.georgetown.edu//faculty//baileyma//RealStats/DonutData.dta’’
** For data located on a computer
use "C:\Users\SallyDoe\Documents\DonutData.dta"
The “path” tells the computer where to find the file. In this example the path
is C:\Users\SallyDoe\Documents. The exact path depends on a computer’s file
structure.
2. Point-and-click: Go to the File - Open menu option in Stata and browse to
the file. Stata will then produce and display the command for opening that
particular file. It is a good idea to save this command in the syntax file so that
you document exactly the data being used.
– Loading non-Stata data files (files that are in tab-delimited, comma-delimited, or
other such format) depends on the exact format of the data. For example, use the
following to read in data that has tabs between variables on each line.
1. Use syntax
** For data located on the internet
insheet using "http://www9.georgetown.edu//faculty//baileyma//RealStats/DonutData.raw"
** For data located on a computer
insheet using "C:\Users\SallyDoe\Documents\DonutsData.raw"
2. Point-and-click: Go to File - Import and then select the file where the data
is stored. Stata will then produce and display the command for opening that
particular file. It is a good idea to save this command in the syntax file so that
you document exactly the data being used. Often it is easiest to use point-and-
click the first time and syntax after that.
• To see a list of variables loaded into Stata, look at the variable window that lists all
variables. We can also click on Data - Data editor to see variables.
• To make sure the data loaded correctly, display it with the list command. To display
the first 10 observations of all variables, type list in 1/10. To display the first 8
observations of only the weight variable, type list weight in 1/8. We can also look
at the data in Stata’s “Data Browser” by going to Data/Data editor in the toolbar.
• To see descriptive statistics on the weight and donut data as in Table 2.1 on page 44
use summarize weight donuts.
• To produce a frequency table such as Table 2.2 on page 44, type tabulate male. Use
this command only for variables that take on a limited number of possible values.
c
•2014 Oxford University Press 58
Chapter 2. Stats in the Wild: Good Data Practices
• To plot the weight and donut data as in Figure 2.2, type scatter weight donuts.
There are many options for creating figures. For example, to plot the weight and donut
data for males only with labels from a variable called “name,” type scatter weight
donuts if male==1, mlabel(name).
R
• To get help in R, type ?commandname for questions about a certain command. For
example, if we have questions about the mean command, we can type ?mean to get a
description of the command, options and, most importantly, examples. Often the best
approach is to find an example that seems closest to what we’re trying to do and apply
that example to our problem. Googling usually helps, too.
• Comment lines in R begin with a pound sign (#). Using ## makes it easier to visually
identify these crucial lines.
• To open a syntax file where we document our analysis, click on File - New script. It’s
helpful to re-size this window so that we can see both the commands and the results.
Save our syntax file as “SomethingSomething.R”; the more informative the name, the
better. Including the date in the file name aids version control. To run any command
in the syntax file, highlight the whole line and then press ctrl-r. The results of the
command will be displayed in the R Console window.
• To load R data files (which have .RData at the end of file name), there are two options.
1. Use syntax. The most reliable way to work with data from the internet is to down-
load it and then access it as a file on the computer. To do so, use the download
command, which needs to know the location of the data (which we name URL in
this example) and where on the computer to put the data (which we name Dest in
this example). Then use the load command. The following four commands load
the donut data into R’s memory.
URL = "http://www9.georgetown.edu//faculty//baileyma//RealStats//DonutData.RData"
Dest = "C:\\Users\\SallyDoe\\Documents\\DonutData.RData"
download.file(URL, Dest)
load("C:\\Users\\SallyDoe\\Documents\\DonutData.RData")
## We need the double backslashes in file name.
## Yes, they’re different than double forward slashes in the URL.
2. Point-and-click: Click on the R console (where we see results) and go to the File -
Load Workspace menu option and browse to the file. This method is easier, but it
does not leave a record in our .R file of exactly which data set is being used.
c
•2014 Oxford University Press 59
Chapter 2. Stats in the Wild: Good Data Practices
• To load non-R data files (files which are in .txt or other such format) requires more care.
We can download data using the same commands as for .RData. To read the data, we
use read.table. For example, to read in data that has commas between variables on
each line:
RawData = read.table("C:\\Users\\SallyDoe\\Documents\\DonutData.raw", header=TRUE)
This command saves variables as Data$VariableName (e.g., RawData$weight, Raw-
Data$donuts). It is also possible to install special commands that load in various types
of data. For example, search the internet for “read.dta” to see more information on how
to install a special command that reads Stata files directly into R.
• It is also possible to manually load data into R. For example, weight = c(275, 141,
70, 75, 310, 80, 160, 263, 205, 185, 170, 155, 145)
donuts = c(14, 0, 0, 5, 20.5, 0.75, 0.25, 16, 3, 2, 0.8, 4.5, 3.5)
name = c("Homer", "Marge", "Lisa", "Bart", "Comic Book Guy", "Mr. Burns",
"Smithers", "Chief Wiggum", "Principal Skinner", "Rev. Lovejoy", "Ned Flanders",
"Patty", "Selma")
• To make sure the data loaded correctly, we can display our data in R with the following
tools:
1. Use the objects() command to show the variables and objects loaded into R.
2. For a single variable, enter the variable’s name in the R Console or highlight it in
the syntax file and press ctrl-r.7
3. To display only some observations for a single variable, use brackets. For example,
to see the first 10 observations of the donuts variable use donuts[1:10]
• To see the average of the weight variable, type mean(weight). One tricky thing with
R is that it chokes on variables that having missing data, meaning that if a single
observation is missing, then the simple version of the mean command will produce a
result of “NA.” Therefore we need to tell R what to do with missing data by modifying
the command to mean(weight, na.rm=TRUE). R refers to missing observations with a
“NA.” The “.rm” is shorthand for remove. A way to interpret the command, then, is us
telling R, “Yes, it is true that we will remove missing data from our calculations.” This
7R can load variables directly such that each variable has its own variable name. Or, it can load variables as part
of data frames such that the variables are loaded together. For example, our commands to load the .RData file loaded
each variable separately while our commands to load data from a text file created an object called “RawData” that
contains all the variables. To display a variable in the “RawData” object called “donuts”, type RawData$donuts in
the .R file, highlight it, and press ctrl-r. This process may take some getting used to, but experiment freely with any
data set you load in and it should become second nature.
c
•2014 Oxford University Press 60
Chapter 2. Stats in the Wild: Good Data Practices
syntax works for other descriptive statistics commands as well. Working with the na.rm
command is a bit of an acquired taste, but it becomes second nature soon enough.
To see the standard deviation of the weight variable, type sqrt(var((weight)) where
the sqrt part is referring to the square root function. The minimum and maximum
of the weight variable are displayed with min(weight) and max(weight). To see the
number of observations for a variable, use sum(is.finite(weight)). This command is
a bit clumsy: The is.finite function creates a variable that equals 1 for each non-missing
observation and the sum function sums this variable, creating a count of non-missing
observations.
• To produce a frequency table such as Table 2.2 on page 44, type table(male). Use
this command only for variables that take on a limited number of possible values.
• To plot the weight and donut data as in Figure 2.2, type plot(weight, donuts). There
are many options for creating figures. For example, to plot the weight and donut data
for males only with labels from a variable called “name,” type
plot(weight[male == 1], donuts[male == 1])
text(donuts[male == 1], weight[male == 1], name[male == 1]).
The syntax donuts[male == 1] tells R to use only values of donuts for which male
equals 1.8
Exercises
1. The data set DonutDataX.dta contains data from our donuts example on page 44.
There is one catch: Each of the variables has an error. Use the tools discussed in this
chapter to identify the errors.
2. What determines success at the Winter Olympics? Does population matter? Income?
Or is it simply a matter of being in a cold place with lots of mountains? Table 2.6
describes variables in olympics HW.dta related to the Winter Olympic Games from
1980 until 2014.
a. Summarize the medals, athletes, and GDP data.
b. List the first five observations for the country, year, medals, athletes, and GDP data.
8R plots are very customizable. To get a flavor, use text(donuts[male == 1], weight[male == 1], name[male
== 1], cex=0.6, pos=4) as the second line of the plot sequence of code. The “cex” command controls the size of
the label and the “pos=4” puts the labels to the right of the plotted point. Refer to the help menus in R or Google
around for more ideas.
c
•2014 Oxford University Press 61
Chapter 2. Stats in the Wild: Good Data Practices
c
•2014 Oxford University Press 62
Chapter 2. Stats in the Wild: Good Data Practices
a. Summarize the wage, height (both height85 and height81), and sibling variables.
Discuss briefly.
b. Create a scatterplot of wages and adult height (height85). Discuss any distinctive
observations.
c. Create a scatterplot of wages and adult height that excludes the observations with
wages above $500 per hour.
d. Create a scatterplot of adult height against adolescent height. Identify the set of
observations where people’s adolescent height is less than their adult height. Do you
think we should use these observations in any future statistical analysis we conduct
with this data? Why or why not?
4. Anscombe (1973) created four data sets that had interesting properties. Let’s use
tools from this chapter to describe and understand these data sets. The data is in
a Stata data file called AnscombesQuartet.dta. There are four possible independent
variables (x1 through x4) and four possible dependent variables (y1 through y4). Create
a replication file that reads in the data and implements the analysis necessary to answer
the following questions. Include comment lines that explain the code.
a. Briefly note the mean and variance for each of the four X variables. Briefly note
the mean and variance for each of the four Y variables. Based on these, would you
characterize the 4 sets of variables as similar or different?
b. Create 4 scatter plots: one with X1 and Y 1, one with X2 and Y 2, one with X3 and
Y 3 and one with X4 and Y 4.
c. Briefly explain any differences and similarities across the four graphs.
c
•2014 Oxford University Press 63
Part I
64
CHAPTER 3
of Figure 3.1 shows a scatterplot of the vote share of the incumbent president’s party and
changes in income for each election between 1948 and 2012. The relationship jumps right
out at us: Higher income growth is indeed associated with larger presidential vote shares.1
1 The figure is based on Noel (2010). The figure plots vote share as a percent of the total votes given to Democrats
and Republicans only. We use these data to avoid the complication that in some years third-party candidates such
as Ross Perot (in 1992, 1996) or George Wallace (in 1968) garnered non-trivial vote share.
65
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
1972 1972
1964 1964
60 60
1984 1984
1956 1956
55 1996 55 1996
1988 1988
1948 1948
2012 2012
1992 1992
2008 2008
45 1952 45 1952
1980 1980
−1 0 1 2 3 4 5 6 −1 0 1 2 3 4 5 6
Percent change in income Percent change in income
(a) (b)
FIGURE 3.1: Relationship Between Income Growth and Vote for the Incumbent President’s Party, 1948-
2012
c
•2014 Oxford University Press 66
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
In this chapter we introduce the foundational statistical model for analyzing such data.
The model allows us to quantify the relationship between two variables and to assess whether
the relationship occurred by chance or due to some real cause. We build on these methods
in the rest of the book in ways that help us differentiate, as best we can, true causes from
simple associations.
Basically what we do is take data like the data found in panel (a) of Figure 3.1 and
we estimate a line that best characterizes the relationship between the two variables. We
include this line in panel (b) of Figure 3.1. Look for the year 2012. It’s almost right on
the line. That’s a bit lucky (other years aren’t so close to the line), but the figure shows
that we can get a remarkably good start on understanding presidential elections with a
The specific tool we introduce in this chapter is OLS which stands for ordinary least
squares; we’ll explain why later. It’s not the best name. Regression and linear regression are
other commonly used names for the method - also lame names.2
The goal of this chapter is to introduce OLS. In Section 3.1 we show how to estimate
coefficients that produce a fitted line using OLS. The following sections then show that
these coefficient estimates have many useful properties. Section 3.2 demonstrates that the
2 In the late nineteenth century Francis Galton used the term regression to refer to the phenomenon that children
of very tall parents tended to be less tall than their parents. He called this phenomenon “regression to the mean” in
heights of children because children of tall parents tend to “regress” (move back) to average heights. Somehow the
term “regression” bled over to cover statistical methods to analyze relationships between dependent and independent
variables. Go figure.
c
•2014 Oxford University Press 67
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
OLS coefficient estimates are themselves random variables. Section 3.3 explains one of the
most important concepts in statistics: The OLS estimates of —ˆ1 will not be biased if X is
exogenous. That is, the estimates won’t be systematically higher or lower than the true
values as long as the independent variable is not correlated with the error term. Section
3.4 shows how to characterize the precision of the OLS estimates. Section 3.5 shows how
the distribution of the OLS estimates converge to a point as the same size gets very, very
large. Section 3.6 discusses issues that complicate the calculation of the precision of our
estimates. These issues get intimidating names like heteroscedasticity and autocorrelation,
but their bark is worse than their bite and most statistical software can easily address them.
Finally, Sections 3.7 and 3.8 discuss tools for assessing how well the model fits the data and
Bivariate OLS is a technique we use to estimate a model with two variables - a dependent
variable and an independent variable. In this section, we explain the model, estimate it,
and try it out on our presidential-election example. We extend the model in later chapters
when we discuss multivariate OLS, a technique we use to estimate models with multiple
independent variables.
c
•2014 Oxford University Press 68
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Bivariate OLS allows us to quantify the degree to which X and Y move together. We work
Yi = —0 + —1 Xi + ‘i (3.1)
where Yi is the dependent variable and X is the independent variable. The parameter —0
is the intercept (or constant). It indicates the expected value of Y when Xi is zero. The
parameter —1 is the slope. It indicates how much Y changes as X changes. The random
where Incumbent party vote sharei is the dependent variable and Income changei is the
independent variable. The parameter —0 indicates the expected value of vote percentage for
the incumbent when income change equals zero. The parameter —1 indicates how much more
This model is an incredibly simplified version of the world. The data will not fall on a
completely straight line because many other factors affect elections, ranging from wars to
scandals to social issues and so forth. These factors comprise our error term, ‘i .
For any given data set, OLS produces estimates of the — parameters that best explain
the data. We indicate estimates as —ˆ0 and —ˆ1 , where the “hats” indicate that these are our
c
•2014 Oxford University Press 69
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
estimates. Estimates are different from the true values, —0 and —1 , which don’t get hats in
our notation.3
(—ˆ0 ) and a slope (—ˆ1 ). The task boils down to picking a —ˆ0 and —ˆ1 that define the line that
minimizes the aggregate distance of the observations from the line. To do so we use two
The fitted value is the value of Y predicted by our estimated equation. The fitted value
Note the differences from Equation 3.1 - there are lots of hats and no ‘i . This is the equation
for the regression line defined by the estimated —ˆ0 and —ˆ1 parameters and Xi .
A fitted value tells us what we would expect the value of Y to be given the value of the
X variable for that observation. To calculate a fitted value for any value of X, use Equation
3.3. Or, if we plot the line, we can simply look for the value of the regression line at that
value of X. All observations with the same value of Xi will have the same Ŷi , which is the
fitted value of Y for observation i. Fitted values are also called predicted values.
A residual measures the distance between the fitted value and an actual observation. In
the true model the error, ‘i , is that part of Yi not explained by —0 + —1 Xi . The residual is the
3 Another common notation is to refer to estimates with regular letters rather than Greek letters (e.g. b0 and b1 ).
That’s perfectly fine, too, of course, but we stick with the hat notation for consistency throughout this book.
c
•2014 Oxford University Press 70
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
estimated counterpart to the error. It is the portion of Yi not explained by —ˆ0 + —ˆ1 Xi (notice
the hats). If our coefficient estimates exactly equaled the true values, then the residual would
be the error; in reality, of course, our estimates —ˆ0 and —ˆ1 will not equal the true values —0
and —1 , meaning that our residuals will differ from the error in the true model.
The residual for observation i is ‘ˆi = Yi ≠ Ŷi . Equivalently, we can say a residual is
‘ˆi = Yi ≠ —ˆ0 ≠ —ˆ1 Xi . We indicate residuals with ‘ˆ (“epsilon-hat”). As with the —s, a Greek
letter with a hat is an estimate of the true value. The residual ‘ˆi is distinct from ‘i , which
Estimation
The OLS estimation strategy is to identify values of —ˆ0 and —ˆ1 that define a line that min-
imizes the sum of the squared residuals. We square the residuals because we want to treat
a residual of +7 (as when an observed Yi is 7 units above the fitted line) as equally undesir-
able as a residual of -7 (as when an observed Yi is 7 units below the fitted line). Squaring
the residuals converts all residuals to positive numbers. Our +7 residual and -7 residual
Specifically, the expression for the sum of squared residuals for any given estimates of —ˆ0
and —ˆ1 is
N
ÿ N
ÿ
‘ˆ2i = (Yi ≠ —ˆ0 ≠ —ˆ1 Xi )2
i=1 i=1
The OLS process finds the —ˆ1 and —ˆ0 that minimize the sum of squared residuals. The fact
c
•2014 Oxford University Press 71
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
that we’re squaring the residuals is where the “squares” in “ordinary least squares” comes
from. The “least” bit is from minimizing the sum of squares. The “ordinary” refers to the
As a practical matter, we don’t need to carry out the minimization ourselves - we can leave
that to the software. The steps are not that hard, though, and we step through a simplified
version of the minimization task in Chapter 14 on page 711. This process produces specific
equations for the OLS estimates of —ˆ0 and —ˆ1 . These equations provide estimates of the slope
(—ˆ1 ) and intercept (—ˆ0 ) combination that characterizes the line that best fits the data.
where X (read as “X bar”) is the average value of X and Y is the average value of Y .
Equation 3.4 shows that —ˆ1 captures how much X and Y move together. The numerator
q
has (Xi ≠ X)(Yi ≠ Y ). The first bit inside the sum is the difference of X from its mean for
the ith observation; the second bit is the difference of Y from its mean for the ith observation.
The product of these bits is summed over observations. So if Y tends to be above its mean
(meaning (Yi ≠ Y ) is positive) when X is above its mean (meaning (Xi ≠ X) is positive),
then there will be a bunch of positive elements in the sum in the numerator. If Y tends to be
below its mean (meaning (Yi ≠ Y ) is negative) when X is below its mean (meaning (Xi ≠ X)
is negative), we’ll also get positive elements in the sum because a negative number times a
c
•2014 Oxford University Press 72
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
negative number is positive. Such observations will also push —ˆ1 to be positive.
On the other hand, —ˆ1 will be negative when the signs of the Xi ≠ X and Yi ≠ Y are
mostly opposite signs. For example, if X is above its mean (meaning (Xi ≠ X) is positive)
when Y is below its mean (meaning (Yi ≠ Y ) is negative), we’ll get negative elements in the
We focus on the equation for —ˆ1 because this is the parameter that defines the relationship
c
•2014 Oxford University Press 73
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
For the election and income data plotted in Figure 3.2, the equations for —ˆ0 and —ˆ1 produce
Figure 3.2 shows what these coefficient estimates mean. The —ˆ1 estimate implies that the
incumbent party’s vote percentage went up by 2.3 percentage points for each one-percent
increase in income. The —ˆ0 estimate implies that the expected election vote share for the
incumbent president’s party for a year with zero income growth was 45.9 percent.
Table 3.1 and Figure 3.3 show predicted values and residuals for specific presidential
elections. In 1960, income growth was rather low (at 0.58 percent). The vote percent for the
Republicans (who controlled the presidency at the time of the election) was 49.9 (Republican
Richard Nixon lost a squeaker to Democrat John F. Kennedy). The fitted value, denoted
by a triangle in the figure, is 45.9 + 2.3 ◊ 0.58 = 47.2. The residual, which is the difference
between the actual and fitted, is 49.9 ≠ 47.2 = 2.7 percent. In other words, in 1960 the
incumbent president’s party did 2.7 percentage points better than would be expected based
In 1964, income growth was high (at 5.58 percent). The Democrats controlled the pres-
idency at the time of the election and they received 61.3 percent of the vote (Democrat
c
•2014 Oxford University Press 74
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Incumbent party’s
vote percent
1972
1964
60
1984
1956
55 1996
1988
1948
2012
2004
2000
1960
e)
op
50 1968
sl 1976
e
(th
3
2.
^β 1=
1992
2008
^
β0 = 45.9
45 1952
1980
−1 0 1 2 3 4 5 6
Percent change in income
FIGURE 3.2: Elections and Income Growth with Model Parameters Indicated
c
•2014 Oxford University Press 75
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Incumbent party’s
vote percent
1964
Fitted
value
for 1964
1960
50 2000
Residual for
1960
45
−1 0 1 2 3 4 5 6
Percent change in income
FIGURE 3.3: Fitted Values and Residuals for Observations in Table 3.1
c
•2014 Oxford University Press 76
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Lyndon Johnson trounced Republican Barry Goldwater). The fitted value based on the re-
gression line was 45.90 + 2.30 ◊ 5.58 = 58.7. The residual, which is the difference between
the actual and fitted, is 61.3 ≠ 58.7 = 2.6 percent. In other words, in 1964 the incumbent
president’s party did 2.6 percentage points better than would be expected based on the re-
gression line. In 2000, the residual was negative, meaning that the incumbent president’s
party (the Democrats at that time) did 4.5 percentage points worse than would be expected
c
•2014 Oxford University Press 77
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Remember This
1. The bivariate regression model is
Yi = —0 + —1 Xi + ‘i
• The slope parameter is —1 . It indicates the change in Y associated with an
increase of X by one unit.
• The intercept parameter is —0 . It indicates the expected value of Y when X
is zero.
• —1 is almost always more interesting than —0 .
2. OLS estimates —ˆ1 and —ˆ0 by minimizing the sum of squared residuals:
N
ÿ N
ÿ
‘ˆ2i = (Yi ≠ —ˆ0 ≠ —ˆ1 Xi )2
i=1 i=1
The goal of bivariate OLS is to get the most accurate idea of —0 and —1 that the data can
provide. The challenge is that we don’t observe the values of the —s. We are only able to
estimate the true values based on the data we observe. And because the data we observe is
random, at least in the sense of having a random error term in it, our estimates will have a
In this section we explain where the randomness of our coefficient estimates comes from,
introduce the concept of probability distributions, and show that our coefficient estimates
c
•2014 Oxford University Press 78
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
There are two different ways to think about the source of randomness in our coefficient
estimates. First, our estimates may have sampling randomness. This variation is due
to the fact that we may be observing only a subset of an entire population. Think of some
population, say the population of ferrets in Florida. Suppose we want to know whether old
ferrets sleep more. There is some relationship between ferret age and sleep in the overall
population, but we are able to get a random sample of, say, only 1,000 ferrets. We estimate
Based on the sample we have selected, we generate a coefficient —ˆ1 . We’re sensible enough
to know that if we had selected a different 1, 000 ferrets in our random sample we would
have gotten a different value of —ˆ1 because the specific values of sleep and age for the
selected ferrets would differ. Every time we select a different 1, 000 ferrets we get a different
estimate —ˆ1 even though the underlying population relationship is fixed at the true value, —1 .
Such variation is called random variation in —ˆ1 due to sampling. Opinion surveys typically
involve a random sample of people and are often considered through the sampling variation
perspective.
Second, our estimates will have modeled randomness. Think again of the population
c
•2014 Oxford University Press 79
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
of ferrets. Even if we were to get data on every last one of them, our model has random
elements. The ferret sleep patterns (the dependent variable) are subject to randomness that
goes into the error term. Maybe one ferret had a little too much celery, another got stuck
in a drawer, and yet another broke up with his girlferret. Having data on every single ferret
would not change the fact that unmeasured factors denoted by ‘ affect ferret sleep.
In other words, there is inherent randomness in the data-generation process even when
any given time such that we do not have sampling variation, we will have randomness due to
the data-generation process. In other words, the modeled randomness perspective highlights
the fact that virtually every model has some unmeasured component that explains some of
An OLS estimate of —ˆ1 inherits randomness whether from sampling or modeled random-
ness. The estimate —ˆ1 is therefore a random variable, which means it is a variable that
takes on a set of possible different values, each with some probability. An easy way to see
why —ˆ1 is random is to note that it depends on the values of the Yi s, which in turn depend
Distributions of —ˆ estimates
To understand these random —ˆ1 s, it is best to think of the distribution of —ˆ1 . That is, we
want to think about the various values we expect —ˆ1 to take and the relative likelihood of
c
•2014 Oxford University Press 80
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
these values.
Let’s start with random variables more generally. A random variable with discrete out-
comes can take on one of a finite set of specific outcomes. The flip of a coin or roll of a
die yields a random variable with discrete outcomes. These random variables have prob-
Many probability distributions of random variables are intuitive. We all know the distri-
bution of a coin toss: heads with 50 percent probability and tails with 50 percent probability.
Panel (a) of Figure 3.4 plots this data with the outcome on the horizontal axis and the prob-
ability on the vertical axis. We also know the distribution of the roll of a six-sided die. There
1
is a 6
probability of seeing any of the six numbers on it, as panel (b) of Figure 3.4 shows.
These are examples of random variables with a specific number of possible outcomes: two
This logic of distributions extends to continuous variables, which are variables that can
take on any value in some range. Weight in our donut example from Chapter 1 is essentially
can’t simply say there is some specific number of possible outcomes. We don’t identify a
probability for each possible outcome for continuous variables because there is an unlimited
or formula that describes the relative probability a random variable is near a specified value
c
•2014 Oxford University Press 81
As densidades de probabilidade variam do familiar ao estranho. Na extremidade familiar das coisas está uma distribuição
normal, que é a curva de sino clássica no painel (c) da Figura 3.4. Este gráfico indica a probabilidade de observar realizações
da variável aleatória em qualquer intervalo. Por exemplo, como metade da área da densidade é menor que 0, sabemos que
há 50% de chance de que essa variável aleatória distribuída normalmente seja menor
do que zero. Como a densidade de probabilidade é alta no meio e baixa nas extremidades, podemos dizer, por exemplo, que
a variável aleatória normal plotada no painel (c) tem mais probabilidade de assumir valores em torno de zero do que valores
em torno de -4. As chances de observar valores próximos a +1 ou -1 ainda são razoavelmente altas, mas as chances de
observar valores próximos a +3 ou -3 são pequenas.Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Probability Probability
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
Heads Tails 1 2 3 4 5 6
(a) (b)
Probability Probability
density density
−4 −2 0 2 4 60 65 70 75
(c) (d)
FIGURE 3.4: Four Distributions
c
•2014 Oxford University Press 82
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Probability densities run the gamut from familiar to weird. On the familiar end of things
is a normal distribution, which is the classic bell curve in panel (c) of Figure 3.4. This
plot indicates the probability of observing realizations of the random variable in any given
range. For example, since half of the area of the density is less than 0, we know that there
is a 50 percent chance that this particular normally distributed random variable will be less
than zero. Because the probability density is high in the middle and low on the ends we can
say, for example, that the normal random variable plotted in panel (c) is more likely to take
on values around zero than values around -4. The odds of observing values around +1 or -1
are still reasonably high, but the odds of observing values near +3 or -3 are small.
Probability densities for random variables can have odd shapes, as in panel (d) of Figure
3.4, which shows a probability density for a random variable that has its most likely outcomes
near 64 and 69.5 The point of panel (d) is to make it clear that not all continuous random
variables follow the bell-shaped distribution. We could draw a squiggly line and, if it satisfied
c
•2014 Oxford University Press 83
Teorema do Limite Central: A média de um número suficientemente grande de sorteios independentes de qualquer distribuição
será normalmente distribuída. Como as estimativas OLS são médias ponderadas, o Teorema do Limite Central implica que a
distribuição de -ˆ 1 será normalmente distribuída.
random variables. While we can’t know exactly what the value of —ˆ1 will be for any given
true —1 , we know that the distribution of —ˆ1 will follow a normal bell curve. We discuss
how to calculate the width of the bell curve in Section 3.4, but knowing the shape of the
probability density for —ˆ1 is a huge advantage. The normal distribution has well-known
properties and is relatively easy to deal with, making our lives much easier in what is to
come.
The normality of our OLS coefficient estimates is amazing. If we have enough data,
the distribution —ˆ1 will follow a bell-shaped distribution even if the errors follows a weird
distribution like the one in panel (d) of Figure 3.4. In other words, just pour ‘i values from
any crazy random distribution into our OLS machine and it will spit out —ˆ1 estimates that
Why is —ˆ1 normally distributed for large samples? The reason is a theorem at the heart
of all statistics: the Central Limit Theorem. This theorem states that the average of any
6If the errors in the model (the ‘s) are normally distributed, then the —ˆ1 will be normally distributed no matter what
the sample size is. Therefore in small samples, if we could make ourselves believe the errors are normally distributed,
then that belief would be a basis for treating the —ˆ1 as coming from a normal distribution. Unfortunately, many
people are skeptical that errors are be normally distributed in most empirical models. Some statisticians therefore
pour a great deal of energy into assessing whether errors are normally distributed (just Google “normality of errors”),
but we don’t need to worry about this debate as long as we have a large sample.
c
•2014 Oxford University Press 84
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
random variable follows a normal distribution.7 In other words, get a sample of data from
some distribution and calculate the average. For example, roll a six-sided die 50 times and
calculate the average across the 50 rolls. Then roll the die another 50 times and take the
average again. Go through this routine again and again and again and plot a histogram of
the averages. If we get a large number of averages, the histogram will look like a normal
distribution. The most common averages will be around the true average of 3.5 (the average
of the 6 numbers on a die). In some of our sets of 50 rolls we’ll see more sixes than usual and
those averages will tend to be closer to 4. In other sets of 50 rolls we’ll see more ones than
usual and those averages will tend to be closer to 3. Crucially, the shape of the distribution
will look more and more like a normal distribution the larger our sample of averages gets.
Even though the Central Limit Theorem is about averages, it is relevant for OLS. Econo-
metricians deriving the distribution of —ˆ1 invoke the Central Limit Theorem to prove that
What sample size is big enough for the Central Limit Theorem and, therefore, normality
7 There are some technical assumptions necessary. For example, the “distribution” of the values of the error term
cannot consist solely of a single number.
8 One way to see why is to think of the OLS equation for —ˆ1 as a weighted average of the dependent variable.
That’s not super obvious, but if we squint our eyes and look at Equation 3.4, we see that we could re-write it as
qN
—ˆ1 = wi (Yi ≠ Y ) where wi = qN(Xi ≠X) . (We have to squint really hard!) In other words, we can think of
i=1 (Xi ≠X)2
i=1
—ˆ1 s as a weighted sum of the Yi s where wi is the weight (and we happen to subtract the mean of Y from each Yi ).
It’s not to hard to get from a weighted sum to an average (rewrite the denominator of wi as N var(X)). Doing so
opens the door for the Central Limit Theorem (which is, after all, about averages) to work its magic and establish
that —ˆ1 will be normally distributed for large samples.
c
•2014 Oxford University Press 85
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
to kick in? There is no hard-and-fast rule, but the general expectation is that around
100 observations is enough. If we have data with some really extreme outliers or other
pathological cases, then we may need a larger sample size. Happily, though, the normality
of the —ˆ1 distribution generally applies even for data with as few as 100 observations.
Remember This
1. Randomness in coefficient estimates can be the result of
• sampling variation, which arises due to variation in the observations selected
into the sample. Each time a different random sample is analyzed, a differ-
ent estimate of —ˆ1 will be produced even though the population (or “true”)
relationship is fixed.
• modeled variation, which arises because of inherent uncertainty in outcomes.
Virtually any data set has unmeasured randomness, whether the data set
covers all observations in a population or some subsample (random or not).
2. The Central Limit Theorem implies the —ˆ0 and —ˆ1 coefficients will be normally
distributed random variables if the sample size is sufficiently large.
We know that —ˆ1 is not simply the true value —1 ; it is an estimate, after all. But how does —ˆ1
relate to —1 ? In this section we introduce the concept of unbiasedness, explain the condition
under which our estimates are unbiased, and characterize the nature of the bias when this
c
•2014 Oxford University Press 86
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Perhaps the central concept of this whole book is that —ˆ1 is an unbiased estimator of the
new to you.
the truth. The statistical concept of bias is rather close. For example, our estimate —ˆ1 would
be biased if the —ˆ1 s we observe are usually around -12 when the true value of —1 is 16. In
other words, if our system of generating a —ˆ1 estimate was likely to produce a negative value
when the true value was 16, we’d say the estimating procedure was biased. As we discuss
below, such bias happens a lot (and the villain is almost always endogeneity).
Our estimate —ˆ1 is unbiased if the average value of the distribution of the —ˆ1 is equal to
the true value. An unbiased distribution will look like Figure 3.5, which shows a distribution
of —ˆ1 s centered around the true value of —1 . The good news about an unbiased estimator is
that on average, our —ˆ1 should be pretty good. The bad news is that any given —ˆ1 could be
far from true value depending on how wide the distribution is and on luck – just by chance
In other words, unbiased does not mean perfect. It just means that, in general, there is
no systematic tendency to be too high or too low. The distribution of —ˆ1 can be quite wide
so that even as the average is the true value, we could still observe values of —ˆ1 that are far
c
•2014 Oxford University Press 87
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Probability
density
^
Distribution of β1
β1
Think of the people who judge figure skating at the Olympics. Some are biased – perhaps
blinded by nationalism or wads of cash – and they systematically give certain skaters higher
or lower scores than the skaters deserve. Other judges (most?) are not biased. These judges
do not get the right answer every time.9 Sometimes an unbiased judge will give a score that
is higher than it should be, sometimes lower. Similarly, an OLS regression coefficient —ˆ1 can
Here are two thought experiments that shed light on unbiasedness. First, let’s approach
the issue from the sampling-randomness framework from Section 3.2. Suppose we select a
sample of people, measure some dependent variable Yi and independent variable Xi for each,
and use those to estimate the OLS —ˆ1 . We write that down and then select another sample
9 We’ll set aside the debate about whether a right answer even exists for now. Let’s imagine there is a score that
judges would on average give to a performance if the skater’s identity were unknown.
c
•2014 Oxford University Press 88
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
of people, get the data, estimate the OLS model again and write down the new estimate of
—ˆ1 . The new estimate will be different because we’ll have different people in our data set.
Repeat the process again and again and write down all the different —ˆ1 s and then calculate
the average of the estimated —ˆ1 s. While any given realization of —ˆ1 could be far from the
true value, we will call the estimates unbiased if the average of the —ˆ1 s is the true value, —1 .
We can also approach the issue from the modeled-randomness framework from Section
3.2. Suppose we generate our data. We set the true —1 and —0 values as some specific values.
We also fix the value of Xi for each observation. Then we draw the ‘i for each observation
from some random distribution. These values will come together in our standard equation to
produce values of Y that we then use in the OLS equation for —ˆ1 . Then we repeat the process
of generating random error terms (while keeping the true — and X values the same). Doing
so produces another set of Yi values and a different OLS estimate for —ˆ1 . We re-running this
process a bunch of times and writing down the —ˆ1 estimates. If the average of the —ˆ1 s we
have recorded is equal to the true value —1 , then we say that —ˆ1 is an unbiased estimator of
—1 .
OLS does not automatically produce unbiased coefficient estimates. A crucial condition
must be satisfied for OLS estimates to be unbiased. It’s a condition we discussed earlier
in Chapter 2. The error term cannot be correlated with the independent variable. This
exogeneity condition is at the heart of everything. If this condition is violated, then there is
something in the error term that is correlated with our independent variable and there is a
c
•2014 Oxford University Press 89
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
chance that it will contaminate the observed relationship between X and Y . In other words,
while observing large values of Y associated with large values of X naturally inclines us to
think X pushes Y higher, we worry that something in the error term that is big when X is
big is actually what is causing Y to be high. In that case, the relationship between X and
Y is spurious and the real causal influence is that something in the error term.
Almost every interesting relationship between two variables in the policy and economic
worlds has some potential for correlation between X and the error term. Let’s start with a
classic example. Suppose we wonder whether ice cream makes people violent.10 We estimate
where violent crime in period t is the dependent variable and ice cream sales in period t is
the independent variable. We’d find that —ˆ1 is greater than zero, suggesting crime is indeed
Does this relationship mean that ice cream is causing crime? Maybe. Probably not. OK,
no, it doesn’t. What’s going on? There are a lot of factors in the error term and one of them
is probably truly associated with crime and correlated with ice cream sales. Any guesses?
Heat. Heat makes people want ice cream and, it turns out, makes them cranky (or gets
10 Why would we ever wonder that? Work with me here ...
c
•2014 Oxford University Press 90
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
them out of doors) such that crime goes up. Hence a bivariate OLS model with just ice
cream sales will show a relationship, but due to endogeneity, this relationship is really just
Characterizing bias
As a general matter, we can say that as the sample size gets large, the estimated coefficient
will on average be off by some function of the correlation between the included variable and
the error term. We show in Chapter 14 on page 713 that the expected value of our bivariate
OLS estimate is
‡‘
E[—ˆ1 ] = —1 + corr(X, ‘) (3.8)
‡X
where E[—ˆ1 ] is short for the expectation of —ˆ1 , corr(X, ‘) is the correlation of X and ‘, ‡‘
(the lower case Greek letter sigma) is the standard deviation of ‘ and ‡X is the standard
deviation of X. The fraction at the end of the equation is more of a normalizing factor, so
The key thing is the correlation of X and ‘. The bigger this correlation, the further
the expected value of —ˆ1 will be from the true value. Or, in other words, the more the
independent variable and the error are correlated, the more biased OLS will be.
11 If we use the fact that corr(X, ‘) = covariance(X,‘)
‡‘ ‡X
, we can write Equation 3.8 as E[—ˆ1 ] = —1 + cov(X,‘)
‡X2 where cov
is short for covariance.
c
•2014 Oxford University Press 91
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
The rest of this book mostly revolves around what to do if the correlation of X and ‘
is not zero. The ideal solution is to use randomized experiments for which corr(X1 , ‘) is
zero by design. But in the real world, experiments often fall prey to challenges discussed
in Chapter 10. For non-experimental data, which is more common than experimental data,
we’ll discuss lots of tricks in the rest of this book that help us generate unbiased estimates
Remember This
1. The distribution of an unbiased estimator is centered at the true value, —1 .
2. The OLS estimator —ˆ1 is an unbiased estimator of —1 if X and ‘ are not correlated.
3. If X and ‘ are correlated, the expected value of —ˆ1 is —1 + corr(X, ‘) ‡‡X‘ .
There are two ways we can get —ˆ1 estimate that is not close to the true value. One is bias,
as discussed above. The other is random chance. Our OLS estimates are random and with
the luck of the draw we might get an estimate that’s not very good. Therefore it is very
useful to characterize the variance of our random —ˆ1 estimates as this will help us appreciate
when we should expect estimates near the true value and when we shouldn’t. In this section
we explain what we mean by the precision of our estimates and provide an equation for the
c
•2014 Oxford University Press 92
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
into a bowl. We’re not quite sure what we’re going to get. We might get a Snickers (yum!),
a Milky Way (not bad), a Mounds bar (trade-bait) or a severed human pinkie (run away!).
When we estimate OLS coefficients, it’s like we’re reaching into a bowl of possible —ˆ1 s and
pulling out an estimate. Anytime we reach into the unknown, we don’t quite know what
But we do know certain properties of the —ˆ1 s that went in to the bowl. If the exogeneity
condition holds, the average of the —ˆ1 s that went into the bowl is —1 . It also turns out that
we can say a lot about the range of —ˆ1 s in the bowl. We do this by characterizing the width
To get a sense of what’s at stake, Figure 3.6 shows two distributions for a hypothetical
—ˆ1 . The darker, lower curve is much wider than the lighter, higher curve. The lighter curve
is more precise because more of the distribution is near the true value.
The primary measure of precision is the variance of —ˆ1 . The variance is – you guessed
it – a measure of how much something varies. The wider the distribution, the larger its
variance. The square root of the variance is the standard error of —ˆ1 . The standard error
is a measure of how much —ˆ1 will vary. A large standard error indicates that the distribution
of —ˆ1 is very wide; if the standard error is small, the distribution of —ˆ1 is narrower.
We prefer —ˆ1 to have a smaller variance. With smaller variance, values close to the true
value are more likely, meaning we’re less likely to be far off when we generate the —ˆ1 . In
c
•2014 Oxford University Press 93
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Probability
density
Smaller variance
Larger variance
−6 −4 −2 0 2 4 6 8 10
^
β1
c
•2014 Oxford University Press 94
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
other words, our bowl of estimates will be less likely to have wacky stuff in it.
Under the right conditions, we can characterize the variance (and, by extension, the
standard error) of —ˆ1 with a simple equation. We discuss the conditions on page 102. If they
ˆ2
‡
var(—ˆ1 ) = (3.9)
N ◊ var(X)
This equation tells us how wide our distribution of —ˆ1 is.12 We don’t need to calculate
the variance of —ˆ1 by hand. That is, after all, why we have computers. We can, however,
understand what causes precise or imprecise —ˆ1 estimates by looking at each part of this
equation.
First, note that the variance of —ˆ1 depends directly on the variance of the regression,
ˆ 2 . The variance of the regression measures how well the model explains variation in Y .
‡
(And, just to be clear, the variance of the regression is different from the variance of —ˆ1 ).
That is, do the actual observations cluster fairly closely to the line implied by —ˆ0 and —ˆ1 ? If
We calculate ‡
ˆ 2 based on how far the fitted values are from the actual observed values.
12 We derive a simplified version of the equation in the advanced OLS chapter on page 718.
c
•2014 Oxford University Press 95
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
The equation is
qN
2 i=1 (Yi
≠ Ŷi )2
ˆ
‡ =
N ≠k
qN 2
ˆi
i=1 ‘
= (3.10)
N ≠k
which is (essentially) the average squared deviation of fitted values of Y from the actual
values. It’s not quite an average because the denominator has N ≠ k rather than N . The
The more individual observations deviate from their fitted values, the higher ‡
ˆ 2 will be.
This is also an estimate of the variance of ‘ in our core model, Equation 3.1.14
Next, look at the denominator of the variance of —ˆ1 (Equation 3.9). It is N ◊ var(X).
Yawn. There are, however, two important substantive facts in there. First, the bigger the
sample size (all else equal), the smaller the variance of —ˆ1 . In other words, more data means
Second, we see that variance of X reduces the variance of —ˆ1 . The variance of X is
13 For bivariate regression, k = 2 because we estimate two parameters (—ˆ0 and —ˆ1 ). We can think of the degrees of
freedom correction as a penalty for each parameter we estimate; it’s as if we use up some information in the data with
each parameter we estimate and we cannot, for example, estimate more parameters than the number of observations
ˆ 2 . For
we have. If N is large enough, the k in the denominator will have only a small effect on the estimate of ‡
small samples, the degrees of freedom issue can matter more. Every statistical package will get this right and the
ˆ 2 measures the average squared distance between actual and fitted values.
core intuition is that ‡
q
(ˆ ‘) 2
‘i ≠ˆ
14 Recall that the variance of ‘ˆ will be . The OLS minimization process automatically creates residuals
N
with a average of zero (meaning ‘ˆ = 0). Hence, the variance of the residuals reduces to Equation 3.10.
c
•2014 Oxford University Press 96
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
qN
(Xi ≠X)2
calculated as i=1
N
. In other words, the more our X variable varies, the more precisely
Remember This
1. The variance of —ˆ1 measures the width of the —ˆ1 distribution. If the conditions dis-
cussed in the next section (Section 3.6) are satisfied, then the estimated variance
of —ˆ1 is
ˆ2
‡
var(—ˆ1 ) =
N ◊ var(X)
c
•2014 Oxford University Press 97
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Dependent Dependent
variable variable
45 45
40 40
35 35
30 30
25 25
20 20
15 15
4 6 8 10 12 14 4 6 8 10 12 14
Independent variable Independent variable
(a) (b)
Dependent Dependent
variable variable
50 50
45 45
40 40
35 35
30 30
25 25
20 20
15 15
10 10
4 6 8 10 12 14 4 6 8 10 12 14
Independent variable Independent variable
(c) (d)
Discussion Questions
1. Will the variance of —ˆ1 be smaller in panel (a) or panel (b) of Figure
3.7? Why?
2. Will the variance of —ˆ1 be smaller in panel (c) or panel (d) of Figure
3.7? Why?
c
•2014 Oxford University Press 98
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
The fact that the variance of —ˆ1 shrinks as the sample size increases means that eventually the
variance approaches zero. This section discusses the implications of this fact by introducing
The probability limit of an estimator is the value to which the estimator converges as
the sample size gets very large. Figure 3.8 illustrates the intuition behind probability limit
by showing the probability density of —ˆ1 for hypothetical experiments in which the true value
of —1 is zero. The flatter dark curve is the probability density for —ˆ1 for an experiment with
N =10 people. The most likely value of —ˆ1 is 0 because this is the place where the density is
highest, but there’s still a pretty good chance of observing a —ˆ1 near 1.0 and even a reasonable
chance of observing a —ˆ1 near 4. For a sample size of 100, the variance shrinks, which means
we’re less likely to see —ˆ1 values near 4 compared to when the sample size was 10. For a
sample size of 1,000, the variance shrinks even more, producing the tall thin distribution.
Under this distribution, we’re not only unlikely to see —ˆ1 near 4, we’re also very unlikely to
If we were to keep plotting distributions for larger sample sizes, we would see them getting
taller and thinner. Eventually the distribution would converge to a vertical line at the true
value. If we had an infinite number of observations we would get the right answer every
time. That may be cold comfort if we’re stuck with a sad little data set of 37 observations,
c
•2014 Oxford University Press 99
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Probability
density
N=1,000
N=100
N=10
−4 −2 0 2 4
^
β1
is a consistent estimator if the distribution of —ˆ1 estimates shrinks to be closer and closer
to the true value —1 as we get more data. If the exogeneity condition is true, then —ˆ1 is a
c
•2014 Oxford University Press 100
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
unreasonable to expect OLS to provide a precise sense of the true value of —1 . If we have a
bajillion observations in our sample, our —ˆ1 estimate should be very close to the true value.
Suppose, for example, that we wanted to assess relationship between height and wages in a
given classroom. If base our estimate on information from only one student, we’re not very
likely to get an accurate estimate. If we ask 10 students, our answer will likely be closer to
the true relationship in the the classroom and if we ask 20 students we’re even more likely
will converge to some other value than the true value. Even though the details can get
pretty technical, the probability limit of an estimator is often easier to work with than
the expectation and therefore statisticians routinely characterize problems into terms of
probability limits that deviate from the true value. For example, we see an example of
probability limits that go awry when we assess the influence of measurement error in Section
5.3.17
17 The two best things you can say about an estimator is that it is unbiased and consistent. OLS estimators have
both of these properties when the error is uncorrelated with the independent variable. These properties seem pretty
similar, but they are rather different. These differences are typically only relevant in advanced statistical work. For
reference, though, we discuss in the appendix examples of estimators that are unbiased, but not consistent and vice
versa.
c
•2014 Oxford University Press 101
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Remember This
1. The probability limit of an estimator is the value to which the estimator converges
as the sample size gets very, very large.
2. When the error term and X are uncorrelated, OLS estimates of — are consistent,
meaning that plim —ˆ = —.
Equation 3.9 on page 95 accurately characterizes the variance of —ˆ1 when certain conditions
about the error term are true. In this section, we explain those conditions. If these conditions
do not hold, the calculation of the variance of —ˆ1 will be more involved, but the intuition we
how to calculate var(—ˆ1 ) under these circumstances in this section and in Chapter 13.
Homoscedasticity
The first condition for Equation 3.9 to be appropriate is that the variance of ‘i is the same
for every observation. That is, once we have taken into account the effect of our measured
variable (X), the expected degree of uncertainty in the model is the same for all observations.
If this condition holds, the variance of the error term is the same for low values of X as for
high values of X. This condition gets a fancy name, homoscedasticity. “Homo” means same.
“Scedastic” (yes, that’s a word) means variance. Hence, errors are homoscedastic when
c
•2014 Oxford University Press 102
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
When errors violate this condition, they are heteroscedastic, meaning that the variance
of ‘i is different for at least some observations. That is, some observations are on average
closer to the predicted value than others. Imagine, for example, that we have data on
how much people weigh from two sources: some people weighed themselves with a state-of-
the-art scale and others had one of those guys at a state fair guess their weight. Definite
heteroscedasticity there, as the weight estimates on the scale would be very close to the
truth (small errors) and the weight estimates from the fair dude will be farther from the
Violating the homoscedasticity condition doesn’t cause OLS —ˆ1 estimates to be biased. It
simply means we shouldn’t use Equation 3.9 to calculate the variance of —ˆ1 . Happily for us,
the intuitions we have discussed so far about what causes var(—ˆ1 ) to be big or small are not
nullified and there are relatively simple ways to implement procedures for this case. We show
the Computing Corner of this chapter on pages 127 and 130. This approach to accounting
1 ÿ
var[—ˆ1 ] = ( q )2 (Xi ≠ X)2 ‘ˆ2i (3.12)
(Xi ≠ X)2
This is less intuitive than in Equation 3.9 on page 95 so we do not emphasize it. As it turns out, we derive
heteroscedasticity-consistent standard errors in the course of deriving the standard errors that assume homoscedas-
ticity (see page 720). Heteroscedasticity-consistent standard errors are also referred to as robust standard errors (be-
cause they are robust to heteroscadicity) or as Huber-White standard errors. Another approach is to use “weighted
c
•2014 Oxford University Press 103
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
The second condition for Equation 3.9 to provide an appropriate estimate of the variance of
—ˆ1 is that the errors are not correlated with each other. If errors are correlated with each
other, knowing the value of the error for one observation provides information about the
There are two fairly common situations in which errors are correlated. The first is when
we have clustered errors. Suppose, for example, we’re looking at test scores of all 8th graders
in California. It is possible that the unmeasured factors in the error term cluster by school.
Maybe one school attracts science nerds and another attracts jocks. If such patterns exist,
then knowing the error term for a kid in a school gives some information about error terms
of other kids in the same school, which means errors are correlated. In this case, the school
is the “cluster” and errors are correlated within the cluster. Equation 3.9 is inappropriate
This sounds worrisome. It is, but not terribly so. As with heteroscedasticity, violating the
condition that errors are not correlated doesn’t cause OLS —ˆ1 estimate to be biased. It only
Get a better equation for the variance of —ˆ1 ! It’s a bit more complicated than that, but
least squares” to deal with heteroscedasticity. This approach is more statistically efficient, meaning the variance of
the estimate will theoretically be lower. The technique produces —ˆ1 estimates that differ from the OLS —ˆ1 estimates.
We point out references with more details on weighted least squares in the Further Reading section at the end of this
chapter.
c
•2014 Oxford University Press 104
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
the upshot is that it is possible to derive variance of —ˆ1 when errors are correlated within
cluster. We simply note the issue here and use the computational procedures discussed in
Correlated errors are also common in time series data. Time series data is data on a
specific unit over time. Examples include U.S. growth rates since 1945 or data on annual
attendance at New York Yankee games since 1913. Errors in time series data are frequently
one time period is correlated with the error in the previous time period.
One way that correlated errors can occur in time series is that an unmeasured variable
in the error term may be sticky such that a high value in one year implies a high value in
the next year. Suppose, for example, we are modeling annual U.S. economic growth since
1945 and we lack a variable for technological innovation (which is very hard to measure). If
technological innovation was in the error term boosting the economy in one year, it probably
had some boosting to do in the error term the next year. Similar autocorrelation is likely
in many time series data sets, ranging from average temperature in Tampa over time to
As with the other issues raised in this section, autocorrelation does not cause bias. Auto-
correlation only renders Equation 3.9 inappropriate. Chapter 13 discusses how to generate
c
•2014 Oxford University Press 105
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
tion (that X and the errors are uncorrelated), we do not need the homoscedasticity and
uncorrelated-errors conditions for unbiased estimates. When these conditions fail, we simply
need to do some additional steps to get back to a correct equation for the variance of —ˆ1 .
The fact that violations of these conditions get fancy statistical labels like “heteroscedastic-
ity” and “autocorrelation” can make them seem especially important. They are not. The
Remember This
1. The standard equation for the variance of —ˆ1 (Equation 3.9 on page 95) requires
errors to be homoscedastic and uncorrelated with each other.
• Errors are homoscedastic if their variance is constant. When errors are het-
eroscedastic, the variance of errors is different across observations.
• Correlated errors commonly occur in clustered data where the error for one
observation is correlated with the error of another observation from the same
cluster (such as a school).
• Correlated errors are also common in time series data where errors are au-
tocorrelated, meaning the error in one period is correlated with the error in
the previous period.
2. Violating the homoscedasticity or uncorrelated-errors conditions does not bias
OLS coefficients.
Goodness of fit is a statistical concept that refers to how well a model fits the data. If
a model fits well, knowing X gives us a pretty good idea of what Y will be. If the model
fits poorly, knowing X doesn’t give as good an idea of what Y will be. In this section we
c
•2014 Oxford University Press 106
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
present three ways to characterize the goodness of fit. We should not worry too much about
goodness of fit, however, as we can have useful, interesting results from models with poor fit
We’ve already seen one goodness of fit measure, the variance of the regression (denoted as
ˆ 2 ). One limitation with this measure is that the scale is not intuitive. For example, if
‡
our dependent variable is salary, the variance of the regression will be measured in dollars
goodness of fit. It is simply the square root of the variance of the regression and is denoted
as ‡
ˆ . It corresponds, roughly, to the average distance of observations from fitted values. The
scale of this measure will be the same units as the dependent variable, making it much easier
to relate to.
The trickiest thing about the standard error of the regression may be that it goes by so
short); root refers to the square root and MSE refers to mean squared error, which is how
we calculate ‡
ˆ 2 (which is, simply, the mean of the squared residuals). R refers to it as the
residual standard error because it is the estimated standard error for the errors in the model
c
•2014 Oxford University Press 107
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Another way to assess goodness of fit is to plot the data and see for ourselves how closely the
observations are to the fitted line. Plotting also allows us to see outliers or other surprises in
the data. Assessing goodness of fit based on looking at a plot is pretty subjective, though,
R2
Finally, a very common measure of goodness of fit is R2 . The name comes from the fact that
it is a measure of the squared correlation of the fitted values and actual values.19 Correlation
is often indicated with an “r”, so R2 is simply the square of this value. (Why one is lower
case and the other is upper case is one of life’s little mysteries.)
If the model explains the data well, the fitted values will be highly correlated with the
actual values and R2 will be high. If the model does not explain the data well, the fitted
values will not correlate very highly with the actual values and R2 will be near zero. Possible
R2 are often interesting to help us understand how well our model predicts the dependent
variable, but the measure may be less useful than it seems. A high R2 is neither necessary
nor sufficient for an analysis to be useful. A high R2 means the predicted values are close to
19 This interpretation works only if an intercept is included in the model, which it usually is
20 The value of R2 also represents the ratio of the variance of the fitted values to the variance of the actual variance
of Y . It is therefore also referred to as a measure of the proportion of the variance explained.
c
•2014 Oxford University Press 108
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
the actual values. It says nothing more. We can have a model loaded with endogeneity that
generates a high R2 . The high R2 in this case means nothing; the model is junk, the high R2
notwithstanding. And to make matters worse, some people have the intuition that a good
fit is necessary for believing regression results. This intuition isn’t correct, either. There is
no minimum value we need for a good regression. In fact, it is very common for experiments
(the gold standard of statistical analyses) to have low R2 s. There can be all kinds of reasons
for low R2 – the world could be messy such that ‡ 2 is high, for example – but the model
Figure 3.9 shows various goodness of fit measures for OLS estimates of two different
hypothetical data sets of salary at age 30 (measured in thousands of dollars) and years of
education. In panel (a), the observations are pretty closely clustered around the regression
line. That’s a good fit. The variance of the regression is 91.62; it’s not really clear what to
regression, among other terms) which is 9.57. Roughly speaking, this value of the standard
error of the regression means that the observations are on average within 9.57 units of their
fitted values, meaning that on average the fitted values are within $9,570 of actual salary.21
The R2 is 0.89. That’s pretty high. Is that value high enough? We can’t answer that
In panel (b) of Figure 3.9, the observations are more widely dispersed. Not as good a fit.
21 We say “roughly speaking” because this value is actually the square root of the average of the squared residuals.
The intuition for that value is the same, but it’s quite a mouthful.
c
•2014 Oxford University Press 109
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Salary Salary
(in $1,000) (in $1,000)
120 120
100 100
80 80
60 60
40 40
2
^ = 91.62
σ ^ 2= 444.2
σ
^ = 9.57
σ ^ = 21.1
σ
20 20
R2 = 0.89 R2 = 0.6
0 4 8 12 16 0 4 8 12 16
Years of education Years of education
(a) (b)
The variance of the regression is 444.2. As with panel (a), it’s not really clear what to make
the observations are on average within $21,100 of actual salary. The R2 is 0.6. Is that good
c
•2014 Oxford University Press 110
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Remember This
There are four ways to asses goodness of fit.
1. The variance of the regression (ˆ ‡ 2 ). This value is used in the equation for var(—ˆ1 ).
It is hard to interpret directly.
2. The standard error of the regression (ˆ
‡ ). It is measured on the same scale as
the dependent variable and roughly corresponds to the average distance between
fitted values and actual values.
3. Scatterplots can be quite informative about not only goodness of fit but also
possible anomalies and outliers.
4. R2 is a widely used measure of goodness of fit.
• It is the square of the correlation between the fitted and observed values of
the dependent variable.
• R2 ranges from 0 to 1.
• A high R2 is neither necessary nor sufficient for an analysis to be useful.
c
•2014 Oxford University Press 111
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
tested this idea by analyzing data on height and wages from a nationally representative
sample. Much of their analysis used multivariate techniques discussed in Chapter 5, but
we’ll use bivariate OLS to start thinking about the issue. They limited their data set to
white males in order to avoid potentially important (and unfair) influences of race and gen-
der on wages. (We look at other groups in the homework for Chapter 5.)
Figure 3.10 shows the data. On the X-axis is the adult height of each guy and on the
Y-axis is his wage in 1996. The relationship is messy, but that’s not unusual. Data is at
c
•2014 Oxford University Press 112
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Hourly
wages
(in $)
80
60
40
20
60 65 70 75 80
Height in inches
c
•2014 Oxford University Press 113
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
The figure includes a fitted regression line based on the following regression model:
The results reported in Table 3.2 look pretty much like the results that any statistical
software will burp out. The estimated coefficient on adult height (—ˆ1 ) is 0.412. The standard
error estimate will vary depending on whether we assume errors are homoscedastic or not.
The column on the left shows that if we assume homoscedasticity (and therefore use Equation
3.9), the estimated standard error of —ˆ1 is 0.0976. The column on the right shows that
if we allow for heteroscedasticity, the estimated standard error for —ˆ1 is 0.0953. It’s not
much of a difference in this case, but the two approaches to estimating standard errors can
differ more substantially for other examples. The estimated constant (—ˆ0 ) is -13.093 with
estimated standard error estimates of 6.897 and 6.681, depending on whether or not we use
Notice that the —ˆ1 and —ˆ1 coefficients are identical across the columns, as the heteroscedasticity-
What, exactly, do these numbers mean? First, let’s interpret the slope coefficient, —ˆ1 . A
coefficient of 0.41 on height implies that a one inch increase in height is associated with a
c
•2014 Oxford University Press 114
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
The interpretation of the constant, —ˆ0 , is that someone who is zero inches tall would get
negative $13.09 dollars an hour. Hmmm. Not the most helpful piece of information. What’s
going on is that most observations of height (the X variable) are far from zero (they are
mostly between 60 and 75 inches). To get the regression line to go through this data it needs
to cross the Y axis at -13.09 for people who are zero inches tall. This example explains why
we don’t spend a lot of time on —ˆ0 . It’s hard to imagine what kind of sicko would want to
know – or believe – the extrapolation of our results to such little tiny people.
If we don’t care about —ˆ0 why do we have it in the model? It plays a very important
role. Remember that we’re fitting a line and the value of —ˆ0 pins down where the line starts
when X is zero. If we do not estimate the parameter, that’s the same as setting —ˆ0 to be
zero (because the fitted value would be Ŷi = —ˆ1 Xi , which is zero when Xi = 0). Forcing —ˆ0
to be zero will typically lead to a much worse model fit than letting the data tell us where
c
•2014 Oxford University Press 115
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
The results are not only about the estimated coefficients. They also include standard
errors. These are quite important as they give us a sense of how accurate our estimates are.
The standard error estimates come from the data and tell us how wide the distribution of
—ˆ1 is. If the standard error of —ˆ1 is huge, then we should not have much confidence that our
—ˆ1 is necessarily close to the true value. If the standard error of —ˆ1 is small, then we should
have more confidence that our —ˆ1 is close to the true value.
Are these results the final word on the relationship between height and wages? (Hint:
NO!) Like most observational data, a bivariate analysis may not be sufficient. We should
worry about endogeneity. In other words, there could be elements in the error term (factors
that influence wages but have not been included in the model) that could be correlated with
adult height and, if so, then the result that height causes wages to go up may be incorrect.
Can you think of anything in the error term that is correlated with height? We come back
to this question in Chapter 5 on page 199, where we revisit this data set.
pretty hard to get our heads around. Much more useful is the standard error of the regression,
ˆ , which is 11.9, meaning roughly that the average distance between fitted and actual heights
‡
is almost $12 per hour. In other words, the fitted values really aren’t particularly accurate.
The R2 is 0.01. This value is low, but as we said earlier, there is no set standard for R2 .
One reasonable concern might be that we should be wary of the OLS results because the
c
•2014 Oxford University Press 116
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
model fit seems pretty poor. That’s not how it works. The coefficients give us the best
estimates given the data. The standard errors of the coefficients incorporate the poor fit (via
the ‡
ˆ 2 ). So, yes, the poor fit matters, but it’s something that is incorporated into the OLS
estimation process.
3.8 Outliers
One practical concern we have in statistics is dealing with outliers, observations that are
extremely different from the rest of sample. The concern is that a single goofy observation
We saw on page 53 that Washington, DC is quite an outlier in a plot of crime data for
the U.S. states and DC. Figure 3.11 shows a scatterplot of violent crime and percent urban.
Imagine drawing an OLS line by hand when Washington, DC is included. Then imagine
drawing an OLS line by hand when Washington, DC is excluded. The line with Washington
DC will be steeper as it will need to get close to the Washington, DC observation; the
line without Washington, DC will be flatter because it can stay in the mass of the data
without worrying about Washington, DC. Hence a reasonable person may worry that the
Washington, DC data point could substantially influence the estimate. On the other hand, if
we were to remove an observation in the middle of the mass of the data, such as Oklahoma,
c
•2014 Oxford University Press 117
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Violent
crime
DC
rate
(per 100,000
people)
1200
1000
800
NV
SC
TN
AK LA NM DE
600 FL
MD
AR OK MO MI TX IL
AL MACA
GAAZ
400 NC KS
PA NY
IN OH WA CO
WV CT NJ
MS MT ND IA NE HI
KY WI
IDMN OR RI
SD WY VA UT
200
NH
VT
ME
40 50 60 70 80 90 100
Percent urban
FIGURE 3.11: Scatterplot of Violent Crime and Percent Urban
c
•2014 Oxford University Press 118
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
We can see the effect of including and excluding DC in Table 3.3 which shows bivariate
OLS results in which violent crime rate is the dependent variable. In the first column, percent
urban is the independent variable and all states plus DC are included (therefore the N is 51).
The coefficient is 5.61 with a standard error of 1.8. The results in the second column and
are based on data without Washington, DC (dropping the N to 50). The coefficient is quite
a bit smaller, coming in at 3.58, which is consistent with our intuition from our imaginary
line drawing.
The table also shows bivariate OLS coefficients for a model with single-parent percent as
the independent variable. The coefficient when including DC is 23.17. When we exclude
DC, the estimated relationship weakens to 16.91. We see a similar pattern with crime and
Figure 3.12 shows scatterplots of the data with the fitted lines included. The fitted lines
based on all data are the solid lines and the fitted lines when DC is excluded are the dashed
c
•2014 Oxford University Press 119
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Violent
crime
rate DC DC DC
(per 100,000
people) Fitted line with DC Fitted line with DC Fitted line with DC
1200 Fitted line without DC Fitted line without DC Fitted line without DC
1000
800
NV NV NV
SC
TN TN SC SC
TN
AK LANM DE AK DENMLA AK DE LANM
600 FL FL FL
MD MD MD
AR OKMOMI OK AR MI TX AR
TXIL MI
IL
MO
TX IL MO OK
MACA MACA MA CA
AL AL AL
GA AZ GA
AZ GA
AZ
400 NC KS KS NC KS NC
PA NY PA NY PA NY
CO
IN OH WA COWAIN OH WACO INOH
WV
MS MT IA NE CT NJ NJ
NE CTWV NJ
CT WV
HI ND IA
MT MS IANE MT MS
ND WI OR RI
KY ID
MN ORHI
IDMN WI KYRI
HI ND
MNRI WI OR
ID KY
SD WY VA UT UT WY
SD VA VA
WY UT SD
200
NH NH NH
VT
ME VTME VTME
40 50 60 70 80 90 100 20 30 40 50 60 8 10 12 14 16 18 20 22
Percent urban Percent single parent Poverty percent
FIGURE 3.12: Scatterplots of Crime Against Percent Urban, Single Parent, and Poverty with OLS Fitted
Lines
lines. In every case, the fitted lines including DC are steeper than the fitted lines when DC
is excluded.
So what are we to conclude here? Which results are correct? There may be no clear
answer. The important thing is to appreciate that the results in these cases depend on a
single observation. In such cases, we need to let the world know. We should show results
with and without the excluded observation and justify substantively why an observation
might merit exclusion. In the case of the crime data, for example, we could exclude DC on
c
•2014 Oxford University Press 120
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Outlier observations are more likely to influence OLS results when there is only a small
number of observations. Given that OLS will minimize the sum of squared residuals from
the fitted line, a single observation is more likely to play a big role when there are only a few
residuals to be summed. When data sets are very large, a single observation is less likely to
An excellent way to identify potentially influential observations is to plot the data and
look for unusual observations. If an observation looks out-of-whack, it’s a good idea to
run the analysis without it to see if the results change. If they do, we need to explain the
24 Most statistical packages can automatically assess the influence of each observation. For a sample size N , these
commands essentially run N separate OLS models, each one excluding a different observation. For each of these N
regressions, the command stores a value indicating how much the coefficient changes when that particular observation
is excluded. The resulting output reflects how much the coefficients change with the deletion of each observation.
ˆ In other words, the
In Stata, the command is dfbeta where the “df” refers to difference and “beta” refers to —.
ˆ when that observation is deleted. In R, the
command will tell us for each observation the difference in estimated —s
command is also called dfbeta.
c
•2014 Oxford University Press 121
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Remember This
Outliers are observations that are very different from other observations.
1. When sample sizes are small, a single outlier can exert considerable influence on
OLS coefficient estimates.
2. Scatterplots are useful to identify outliers.
3. When a single observation substantially influences coefficient estimates, we should
• Inform readers of the issue.
• Report results with and without the influential observation.
• Justify including or excluding the observation.
3.9 Conclusion
Ordinary least squares is an odd name that refers to the way in which the —ˆ estimates are
produced. That’s fine to know, but the real key to understanding OLS is appreciating the
The most important property of OLS estimates is that they are unbiased if X is uncor-
related with the error. We’ve all heard “correlation does not imply causation.” “Regression
does not imply causation” is every bit as true. If there is endogeneity we may observe a big
OLS estimates have many other useful properties. With a large sample size, —ˆ1 is a
normally distributed random variable. The variance of —ˆ1 reflects the width of the —ˆ1 distri-
bution and is determined by the fit of the model (the better the fit, the thinner), the sample
c
•2014 Oxford University Press 122
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
size (the more data, the thinner) and the variance of X (the more variance, the thinner). If
the errors satisfy the homoscedasticity and no-correlation conditions, the variance of —ˆ1 is
defined by Equation 3.9 on page 95. If the errors are heteroscedastic or correlated with each
other, OLS still produces unbiased coefficients but we will need other tools covered here and
• Section 3.1: Write out the bivariate regression equation and explain all its elements
(dependent variable, independent variable, slope, intercept, error term). Draw a hypo-
thetical scatterplot with a small number of observations and show how bivariate OLS
is estimated, identifying residuals, fitted values, and what it means to be a best-fit line.
Sketch an appropriate best-fit line and identify —ˆ0 and —ˆ1 on the sketch. Write out the
• Section 3.2: Explain why —ˆ1 is a random variable and sketch its distribution. Explain
• Section 3.3: Explain what it means for the OLS estimate —ˆ1 to be an unbiased estimator.
• Section 3.4: Write out the standard equation for the variance of —ˆ1 in bivariate OLS
c
•2014 Oxford University Press 123
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
• Section 3.6: Identify the conditions required for the standard variance equation of —ˆ1
equation to be accurate. Explain why these two conditions are less important than the
exogeneity condition.
• Section 3.7: Explain four ways to assess goodness of fit. Explain why R2 alone does not
• Section 3.8: Explain what outliers are, how they can affect results, and what to do
about them.
Further Reading
Beck (2010) provides an excellent discussion of what to report from a regression analysis.
Weighted least squares is a type of generalized least squares that can be used when dealing
with heteroscedastic data. Chapter 8 of Kennedy (2008) discusses weighted least squares and
other issues when dealing with errors that are heteroscedastic or correlated with each other.
These issues are often referred to as violations of a “spherical errors” condition. The term
spherical errors is pretentious statistical jargon that means errors are both homoscedastic
Murray (2006, 500) provides a good discussion of probability limits and consistency for
OLS estimates.
We discuss what to do with autocorrelated errors in Chapter 13. The Further Reading
c
•2014 Oxford University Press 124
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
section at the end of that chapter provides links to the very large literature on time series
data analysis.
Key Terms
• Autocorrelation (105)
• Bias (87)
• Central Limit Theorem (84)
• Consistency (100)
• Continuous variable (81)
• Degrees of freedom (151)
• Distribution (80)
• Fitted value (70)
• Heteroscedastic (103)
• Heteroscedasticity-consistent standard errors (103)
• Homoscedastic (103)
• Modeled randomness (79)
• Normal distribution (83)
• Outlier (117)
• Predicted values (70)
• Probability density (83)
• Probability distribution (81)
• Probability limit (99)
• Random variable (80)
• Regression line (70)
c
•2014 Oxford University Press 125
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
• Residual (70)
• Sampling randomness (79)
• Standard error (93)
• Standard error of the regression (107)
• Time series data (105)
• Unbiased estimator (87)
• Variance (93)
• Variance of the regression (95)
Computing Corner
Stata
1. Using the donut and weight data described in Chapter 1 on page 57, estimate a bi-
variate OLS regression by typing reg weight donuts. The command “reg” stands for
“regression.” The general format is reg Y X for a dependent variable Y and independent
variable X.
Stata’s regression output looks like this:
Source | SS df MS Number of obs = 13
-------------+------------------------------ F( 1, 11) = 22.48
Model | 46731.7593 1 46731.7593 Prob > F = 0.0006
Residual | 22863.933 11 2078.53936 R-squared = 0.6715
-------------+------------------------------ Adj R-squared = 0.6416
Total | 69595.6923 12 5799.64103 Root MSE = 45.591
------------------------------------------------------------------------------
weight | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
donuts | 9.103799 1.919976 4.74 0.001 4.877961 13.32964
_cons | 122.6156 16.36114 7.49 0.000 86.60499 158.6262
------------------------------------------------------------------------------
There is a lot of information here, not all of which is useful. The vital information is
in the bottom table that shows that —ˆ1 is 9.10 with a standard error of 1.92 and —ˆ0 is
122.62 with a standard error of 16.36. We cover t, P>|t| and 95% confidence intervals
in Chapter 4.
c
•2014 Oxford University Press 126
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
The column on the upper right has some useful information too, indicating the number
of observations, R2 and Root MSE. (As we noted in the chapter, Stata refers to the
standard error of the regression, (ˆ‡ ), as root MSE which is Stata’s shorthand for the
square root of the mean squared error.) We discuss the adjusted R2 on page 5.4. The F
and Prob > F on the right side of the output relate information that we cover on page
343; it’s generally not particularly useful.
The table in the upper left is pretty useless. Contemporary researchers seldom use the
information in the Source, SS, df, and MS columns.
2. In Stata, commands often have subcommands that are invoked after a comma. To
estimate the model with heteroscedasticity-consistent standard errors (as discussed on
page 103) simply add the , robust subcommand to Stata’s regression command. For
example: reg weight donuts, robust.
3. To generate predicted values, type predict YourNameHere after running an OLS model.
This command will create a new variable named “YourNameHere.” In our example, we
name the variable Fitted: predict Fitted. A variable containing the residuals is cre-
ated by adding a “, residuals” subcommand to the predict command: predict
Residuals, residuals.
We can display the actual values, fitted values, and residuals with the list command:
list weight Fitted Residuals.
c
•2014 Oxford University Press 127
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
for which name is not Homer, run reg weight donuts if name !="Homer". In this
example, we use quotes because the name variable is a string variable, meaning it is not
a number. If we want to include only observations where weight is greater than 100 we
can type reg weight donuts if weight > 100.
R
1. The following commands use the donut data from Chapter 1 on page 59. R is an
object-oriented language, which means that our regression commands create objects
containing information which we ask R to display as needed. To estimate an OLS
regression, we create an object called OLSResults (we could choose a different name)
by typing OLSResults = lm(weight ≥ donuts). This command stores information
about the regression results in an object called OLSResults. The lm command stands
for “linear model” and is the R command for OLS. The general format is lm(Y ≥ X)
for a dependent variable Y and independent variable X. To display these regression
results, type summary(OLSResults), which produces
lm(formula = weight ˜ donuts)
Residuals:
Min 1Q Median 3Q Max
-93.135 -9.479 0.757 35.108 55.073
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.616 16.361 7.494 0.0000121
donuts 9.104 1.920 4.742 0.000608
Residual standard error: 45.59 on 11 degrees of freedom
Multiple R-squared: 0.6715, Adjusted R-squared: 0.6416
F-statistic: 22.48 on 1 and 11 DF, p-value: 0.0006078
The vital information is in the bottom table that shows that —ˆ1 is 9.104 with a standard
error of 1.920 and —ˆ0 is 122.616 with a standard error of 16.361. We cover t value and
Pr(>|t|) in Chapter 4.
R refers to the standard error of the regression (ˆ ‡ ) as the residual standard error and
lists it below the regression results. Next to that is the degrees of freedom. To calculate
the number of observations in the data set analyzed, recall that degrees of freedom
equals N ≠ k. Since we know k (the number of estimated coefficients) is 2 for this
model, we can infer the sample size is 13. (Yes, this is probably more work than it
should be to display sample size.)
The multiple R2 (which is just the R2 ) is below the residual standard error. We dis-
cuss the adjusted R2 later on page 232. At the bottom is an F-statistic and related
c
•2014 Oxford University Press 128
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
information; this refers to a test we cover on page 343. It’s usually not a center of
attention.
The information on residuals at the top is pretty useless. Contemporary researchers
seldom use that information.
2. The regression object created by R contains lots of other information as well. The
information can be listed by typing the object name, a dollar sign, and the appropriate
syntax. For example, the fitted values for a regression model are stored in the format of
Object$fitted.values. In our case, they are OLSResults$fitted.values. For more
details, type help(lm) in R and look for the list of components associated with “objects
of class lm,” which is R’s way of referring to the regression results like we created above.
To see the fitted values, type OLSResults$fitted.values, which produces
1 2 3 4 5 6 ...
250.0688 122.6156 122.6156 168.1346 309.2435 129.4435 . . .
To see the residuals, type OLSResults$residuals, which produces
1 2 3 4 5 6 ...
24.9312070 18.3843881 -52.6156119 -93.1346052 0.7565158 -49.4434609 . . .
3. To create a scatterplot with a regression line included we can type26
plot(donuts, weight)
abline(OLSResults)
4. One way to exclude an observation from a regression is to use brackets to limit the vari-
able to only those observations for which the condition in the brackets is true; to indicate
a “not equal” condition use “!=”. In other words, weight[name != "Homer"] refers to
values of the weight variable for which the name variable is not equal to “Homer.”
To run a regression on data that excludes observations for which name is Homer, run
OLSResultsNoHomer = lm(weight[name != "Homer"] ≥ donuts[name != "Homer"]).
Here we use quotes because the name variable is a string variable, meaning it is not a
number.27 If we want to include only observations where weight is greater than 100 we
can type OLSResultsNoLow = lm(weight[weight>100] ≥ donuts[weight>100]).
5. There are a number of ways to estimate the model with heteroscedasticity-consistent
standard errors (as discussed on page 103). The easiest may be to use a R package,
26 Figure 3.10 jittered the data to make it a bit easier to see more data points. To jitter data in a R plot, type
plot(jitter(donuts), jitter(weight)).
27 There are more efficient ways to exclude data when using data frames. For example, if the variables are all
included in a data frame called dta, we could type OLSResultsNoHomer = lm(weight ≥ donuts, data = dta[name
!= "Homer", ]).
c
•2014 Oxford University Press 129
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
which is a set of R commands that we install for specific tasks. For heteroscedasticity-
consistent standard errors, the AER package is useful. There are two steps to using
this package.
(a) We need to install the package. Type install.packages("AER"). R will ask us
to pick a location – this is the source where we get the package. It doesn’t matter
where we pick. We can also install a package manually from the packages command
in the toolbar. We need to do the installation only once on each computer. The
package will be saved and available for use by R.
(b) Every time we open R and want to use the commands in the AER (or other)
package, we need to tell R to load the package. We do this with the library
command. We have to use the library command in every session we use a package.
Assuming the AER package has been installed, we can run OLS with heteroscedasticity-
consistent standard errors via the following code:
library(AER)
OLSResults = lm(weight ≥ donuts)
coeftest(OLSResults, vcov = vcovHC(OLSResults, type = "HC1"))
The last line is elaborate. The command coeftest is asking for information on the
variance of the estimates (among other things) and the vcov=vcovHC part of the com-
mand is asking for heteroscedasticity-consistent standard errors. There are multiple
ways to estimate such standard errors and the HC1 asks for the most commonly used
form of these standard errors.28
Exercises
1. Use the data in PresVote.dta to answer the following questions about the relationship
between changes in real disposable income and presidential election results. Table 3.4
describes the variables.
a. Create a scatter plot like Figure 3.1.
b. Estimate an OLS regression in which the vote share of the incumbent party is re-
gressed on change in real disposable income. Report the estimated regression equa-
tion and interpret the coefficients.
c. What is the fitted value for 1996? For 1972?
28 The vcov terminology is short for variance-covariance and the vcovHC terminology is short for heteroscedasticity-
consistent standard errors.
c
•2014 Oxford University Press 130
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
Table 3.4: Variables for Questions on Presidential Elections and the Economy
Salaryi = —0 + —1 Educationi + ‘i
For this problem, we are going to assume that the true model is
Salaryi = 10,000 + 1,000Educationi + ‘i
The models indicates that the salary for each person is $10,000 plus $1,000 times the
number of years of education plus the error term for the individual. Our goal is to
explore how much our estimate of —ˆ1 varies.
Enter the following code into a Stata .do file. It will simulate a data set with 100
observations (as determined with the set obs command). Values of education for each
observation are between 0 and 16 years. The error term will be a normally distributed
error term with a standard deviation of 10,000 (as determined with the scalar SD
command).
program OLS_Sim
clear
set obs 100 /* Set sample size */
gen Ed = 16*runiform() /* Generate education (ind. variable) */
scalar SD = 10000 /* Set value of standard deviation of error term */
c
•2014 Oxford University Press 131
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
The simulate line runs the code 50 times (as determined in the reps(50) command)
and will save the —ˆ coefficients and standard errors for each simulation. The values
of —ˆEducation for each simulation are listed in a variable called b Ed; the values of —ˆ0
for each simulation are listed in a variable called b cons. The values of se(—ˆEducation )
for each simulation are listed in a variable called se Ed; the values of se(—ˆ0 ) for each
simulation are listed in a variable called se cons.
We can look at the estimated coefficients (via the list command) and summarize them
(via the summarize command):
list _b_* _se* /* Coefficient estimates & std. errors for each simulation */
summarize _b* /* Summarize coefficient estimates for each simulation */
a. Explain why the means of the estimated coefficients across the multiple simulations
are what they are.
b. What are the minimum and maximum values of the estimated coefficients on ed-
ucation? Explain whether these values are inconsistent with our statement in the
chapter that OLS estimates are unbiased.
c. Re-run the simulation with a larger sample size in each simulation. Specifically, set
the sample size to 1,000 in each simulation. (Do this by changing the set obs line of
the code.) Compare the mean, minimum, and maximum of the estimated coefficients
on education to the original results above.
d. Re-run the simulation with a smaller sample size in each simulation. Specifically,
set the sample size to 20 in each simulation. Compare the mean, minimum, and
maximum of the estimated coefficients on education to the original results above.
e. Re-set the sample size to 100 for each simulation and re-run the simulation with a
smaller standard deviation for each simulation. Specifically, set StdDev to 500 for
each simulation. (Do this by changing the scalar StdDev line of the code.) Compare
the mean, minimum, and maximum of the estimated coefficients on education to the
original results above.
f. Keeping the sample size at 100 for each simulation, re-run the simulation with a
larger standard deviation for each simulation. Specifically, set StdDev to 50,000
for each simulation. Compare the mean, minimum, and maximum of the estimated
coefficients on education to the original results above.
g. Revert to original model (sample size at 100 and SD at 10,000). Now run 500
simulations. (Do this by changing the simulate b se, reps(50) line of the code
c
•2014 Oxford University Press 132
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
a. Estimate a model where height at age 33 explains income at age 33. Explain —ˆ1 and
—ˆ0 .
b. Create a scatterplot of height and income at age 33. Identify outliers.
c. Create a scatterplot of height and income at age 33 but exclude observations with
wages per hour more than 400 British pounds and height less than 40 inches. Describe
difference from the earlier plot. Which plot seems the more reasonable basis for
statistical analysis? Why?
d. Re-estimate the bivariate OLS model from part (a) but exclude four outliers with
very high wages and outliers with height below 40 inches. Briefly compare results to
earlier results.
e. What happens when the sample size is smaller? To answer this question, re-estimate
the bivariate OLS model from above (that excludes outliers) but limit the analysis
to the first 800 observations.29 Which changes more from the results with the full
29 To do this in Stata include if n <800 at the end of the Stata regress command. Because some observations
have missing data and others are omitted as outliers, the actual sample size in the regression will fall a bit lower
than 800. The “ n” notation is Stata’s way of indicating the observation number, which is the row number of the
observation in the data set.
c
•2014 Oxford University Press 133
Chapter 3. Bivariate OLS: The Foundation of Statistical Analysis
sample: the estimated coefficient on height or the estimated standard error of the
coefficient on height? Explain.
4. Table 3.6 lists the variables in the WorkWomen.dta and WorkMen.dta data sets, which
are based on Chakraborty, Holter, and Stepanchuk (2012). Answer the following ques-
tions about the relationship between hours worked and divorce rates.
Table 3.6: Variables for Divorce Rate and Hours Worked
a. For each data set (for women and for men), create a scatterplot of hours worked on
the Y-axis and divorce rates on the X-axis.
b. For each data set estimate an OLS regression in which hours worked is regressed on
divorce rates. Report the estimated regression equation and interpret the coefficients.
Explain differences in coefficients, if any.
c. What are the fitted value and residual for men in Germany?
d. What are the fitted value and residual for women in Spain?
5. Use the data in Table 3.6 to answer the following questions about the relationship
between hours worked and tax rates.
a. For each data set (for women and for men), create a scatterplot of hours worked on
the Y-axis and tax rates on the X-axis.
b. For each data set estimate an OLS regression in which hours worked is regressed on
tax rates. Report the estimated regression equation and interpret the coefficients.
Explain differences in coefficients, if any.
c. What are the fitted value and residual for men in the United States?
d. What are the fitted value and residual for women in Italy?
c
•2014 Oxford University Press 134
CHAPTER 4
vaccinated sheep had died. Two more unvaccinated sheep died in front of the visitors’ eyes
and the last unvaccinated sheep died the next day. Of the vaccinated sheep, only one died
and that was from symptoms inconsistent with anthrax. Nobody needed fancy statistics to
135
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
conclude the vaccine worked; they only needed masks to cover the smell.
Mostly, though, the conclusions from an experiment are not so obvious. What if the
death toll had been two unvaccinated sheep and one vaccinated sheep? That well could have
happened by chance. What if five unvaccinated sheep died and no vaccinated sheep died?
That outcome would seem less likely to have happened simply by chance. But would it be
These kinds of questions pervade all statistical analysis. We’re trying to answer questions
and while it’s pretty easy to see if some policy is associated with more of some outcome, it’s
much harder to know at what point we should become convinced the relationship is real,
rather than the result of the hurly-burly randomness of real life. (o resultado da aleatoriedade tumultuada
da vida real)
Statistics provides an infrastructure for answering these questions via hypothesis test-
ing. Hypothesis testing allows us to assess whether the observed data is consistent or not
with a claim of interest. The process does not yield 100 percent definitive answers, but
rather translates our statistical estimates into statements like “We are quite confident that
the vote share of the incumbent president’s party goes up in the United States when the
economy is good” or “We are quite confident that tall people get paid more.”
The standard statistical way to talk about hypotheses is a bit of an acquired taste.
Suppose there is no effect (that is, that —1 = 0). What is the probability that when we run
OLS on the data we actually have, we see a coefficient as large as we actually observe? That
is, suppose we want to test the claim that —1 = 0. If this claim were true (meaning —1 = 0),
c
•2014 Oxford University Press 136
Neste capítulo, discutimos as ferramentas de teste de hipóteses. A Seção 4.1 apresenta a lógica central e a terminologia. A
Seção 4.2 cobre o workhouse do teste de hipóteses, o teste t. A Seção 4.3 apresenta os valores-p, que são um subproduto
útil do empreendimento de teste de hipóteses. A Seção 4.4 discute o poder estatístico, um conceito que às vezes é
subestimado apesar de seu nome legal. O poder nos ajuda a apreciar a diferença entre não encontrar nenhum
relacionamento porque não há relacionamento ou porque não temos dados suficientes. A Seção 4.5 discute algumas das
limitações muito reais da abordagem de teste de hipóteses e a Seção 4.6 apresenta a abordagem de intervalo de confiança
para estimativa, que evita alguns dos problemas de teste de hipóteses.
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
what is the probability of observing a —ˆ1 = 0.4 or 7.2 or whatever result our OLS produced?
If this probability of observing the —ˆ1 we actually observe is very small if —1 = 0, then we
can reasonably infer that the hypothesis that —1 = 0 is probably not true.
Intuitively we know that if a treatment has no effect, the probability of seeing a huge
difference is low and the chance of seeing a small difference is large. The magic of stats –
and it is quite remarkable – is that we can quantify the probabilities of seeing any observed
In this chapter we discuss the tools of hypothesis testing. Section 4.1 lays out the core logic
and terminology. Section 4.2 covers the workhouse of hypothesis testing, the t test. Section
4.3 introduces p-values, which are a useful byproduct of the hypothesis testing enterprise.
Section 4.4 discusses statistical power, a concept that sometimes goes underappreciated de-
spite its cool name. Power helps us appreciate the difference between finding no relationship
because there is no relationship or because we don’t have enough data. Section 4.5 discusses
some of the very real limitations to the hypothesis testing approach and Section 4.6 then
introduces the confidence interval approach to estimation, which avoids some of the problems
of hypothesis testing.
Much of the material in this chapter will be familiar to those who have had a probability
and statistics course. Learning or tuning up our understanding of this material will put us
c
•2014 Oxford University Press 137
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
We want to use statistics to answer questions and the main way to do so is to use OLS to
assess hypotheses. In this section, we introduce the null and alternative hypotheses, apply
the concepts to our presidential election example, and then develop the important concept
of significance level.
effect. Consider the height and wage example from page 112:
The standard null hypothesis is that height has no effect on wages. Or, more formally,
H0 : —1 = 0.
with the subscript zero after the H indicating this is the null hypothesis.
“reject” or “fail to reject” the null hypotheses. When we reject a null hypothesis, we are
actually saying that the probability of seeing the —ˆ1 that we estimated is very low if the null
hypothesis were true. For example, it is unlikely we will observe a large —ˆ1 with a small
standard error if the truth were —1 = 0. If we do nonetheless observe a large —ˆ1 with a small
standard error, we will reject the null hypothesis and refer to the coefficient as statistically
significant.
c
•2014 Oxford University Press 138
Quando deixamos de rejeitar uma hipótese nula, estamos dizendo que o -ˆ 1 que observamos não seria particularmente
improvável se a hipótese nula fosse verdadeira. Por exemplo, normalmente rejeitamos a hipótese nula quando observamos
um pequeno —ˆ 1. Esse resultado não seria nada surpreendente se —1 = 0. Também podemos deixar de rejeitar hipóteses
nulas quando a incerteza é alta. Ou seja, um —ˆ 1 grande pode não ser muito surpreendente mesmo quando —1 = 0 se a
variância de —ˆ 1 for grande em relação ao valor de —ˆ 1. Formalizamos essa lógica quando discutimos a estatística t na
próxima seção
When we fail to reject a null hypothesis, we are saying that the —ˆ1 we observe would not
be particularly unlikely if the null hypothesis were true. For example, we typically reject the
null hypothesis when we observe a small —ˆ1 . That outcome would not be surprising at all if
—1 = 0. We can also fail to reject null hypotheses when uncertainty is high. That is, a large
—ˆ1 may not be too surprising even when —1 = 0 if the variance of —ˆ1 is large relative to the
value of —ˆ1 . We formalize this logic when we discuss t statistics in the next section.
The heart of proper statistical analysis is that we recognize that we might be making a
mistake. When we reject a null hypothesis we are concluding that it is unlikely that —1 = 0
When we fail to reject a null hypothesis we are saying it would not surprise us if —1 = 0
given the —ˆ1 we observe. We are definitely not saying that we know that —1 = 0 when we fail
to reject the null. Instead, the situation is like when a jury says “not guilty”; the accused
We characterize possible mistakes in two ways. Type I errors occur when we reject
a null hypothesis even when it is true. If we say height increases wages, but actually it
doesn’t, we’re committing a Type I error. Type II errors occur when we fail to reject a
null hypothesis even when it is false. If we say that there is no relationship between height
and wages, but there actually is one, we’re committing a Type II error. Table 4.1 summarizes
this terminology.
Standard hypothesis testing focuses heavily on Type I error. That is, the approach is
c
•2014 Oxford University Press 139
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
—1 ”= 0 —1 = 0
Type I error:
Reject H0 Correct inference
wrongly reject null
Type II error:
Fail to reject H0 Correct inference
wrongly fail to reject null
built around specifying an acceptable level of Type I error and proceeding from there. We
should not forget Type II error, though. There are many situations in which we have to take
the threat of Type II error seriously; we discuss these when we discuss statistical power in
Section 4.4.
If we reject the null hypothesis, we accept the alternative hypothesis. We do not prove
the alternative hypothesis is true. Rather, the alternative hypothesis is the idea we hang
onto when we have evidence that is inconsistent with the null hypothesis.
hypothesis has a direction. For example, if we have theoretical reasons to believe that
being taller increases wages, then the alternative hypothesis for the following model
affects wages but we’re not sure whether tall people get paid more or less, then the alternative
it seems reasonable to believe that we should have at least an idea of the direction of the
c
•2014 Oxford University Press 140
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
coefficient on our variable of interest, implying that two-sided alternatives might be rare.
They are not, however, in part because they are more statistically cautious in the manner
we discuss below.
ideas into statistical tests. For published work, it is generally a breeze to identify null
hypotheses: Just find the —ˆ that the authors jabber on most about. The main null hypothesis
OLS coefficients under the null hypothesis for the presidential election example
With a null hypothesis in hand, we can move toward serious statistical analysis. Let’s con-
sider the presidential-election example that opened Chapter 3. To identify a null hypothesis
where Vote share t is percent of the vote received by the incumbent president’s party in year
t and the independent variable, Change in income t , is the percent change in real disposable
income in the United States in year before the presidential election. The null hypothesis is
What is the distribution of —ˆ1 under the null hypothesis? Pretty simple: It is a normally
distributed random variable centered on zero because OLS produces unbiased estimates and,
if the true value of —1 is zero, then an unbiased distribution of —ˆ1 will be centered on zero.
c
•2014 Oxford University Press 141
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
How wide is the distribution of —ˆ1 under the null hypothesis? In contrast to the mean
of the distribution which we know under the null, the width depends on the data and the
standard error implied by the data. In other words, we allow the data to tell us the standard
Table 4.2 shows the results for the model. Of particular interest for us at this point is that
the standard error of the —ˆ1 estimate is 0.52. This number tells us how wide the distribution
With this information we can picture the distribution of —ˆ1 under the null. Specifically,
Figure 4.1 shows the probability density function of —ˆ1 under the null hypothesis, which is a
normal probability density centered at zero with a standard deviation of 0.52. We also refer to
this as the distribution of —ˆ1 under the null hypothesis. We introduced probability density
functions in Section 3.2 of Chapter 3 and discuss them in further detail in the appendix
Figure 4.1 illustrates the key idea of hypothesis testing. The actual value of —ˆ1 that we
estimated is 2.29. That number seems pretty unlikely, doesn’t it? Most of the distribution
c
•2014 Oxford University Press 142
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Probability
density
^
Distribution of β1
(standard error is 0.52)
^
Example of β1
Actual
for which we
value
would fail to ^
of β 1
reject the null
−0.3 2.29
−2 −1 0 1 2
^
β1
FIGURE 4.1: Distribution of —ˆ1 Under the Null Hypothesis for Presidential Election Example
c
•2014 Oxford University Press 143
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
of —ˆ1 under the null hypothesis is to the left of the —ˆ1 observed. We formalize things in the
next section, but intuitively it’s reasonable to think that the observed value —ˆ1 is so unlikely
if the null were true that, well, the null hypothesis is probably not true.
Now name a value of —ˆ1 that would lead us not to reject the null hypothesis. In other
words, name a value of —ˆ1 that is perfectly likely under the null hypothesis. We show one
such example in Figure 4.1 by putting a line at —ˆ1 = ≠0.3. A value like this would be
completely unsurprising if the null hypothesis were true. Hence if we observed such a value
for —ˆ1 we would deem it to be consistent with the null hypothesis and we would not reject
Significance level
Given that our strategy is to reject the null hypothesis when we observe a —ˆ1 that is quite
unlikely under the null hypothesis, the natural question is: Just how unlikely does —ˆ1 have
to be? We get to choose the answer to this question. In other words, we get to decide our
standard for what we deem to be sufficiently unlikely to reject the null hypothesis. We’ll
call this probability the significance level and denote it with – (the Greek letter alpha).
A significance level determines how unlikely a result has to be under the null hypothesis for
us to reject the null hypothesis. A very common significance level is 5 percent (meaning –
= 0.05).
If we set – = 0.05, then we reject the null when we observe a —ˆ1 that is so large that
c
•2014 Oxford University Press 144
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
we would only expect only a 5 percent chance of seeing the observed value or higher under
the null hypothesis. Setting – = 0.05 means that there is a 5 percent chance that we would
see a value high enough to reject the null hypothesis even when the null hypothesis is true,
If we want to be more cautious (in the sense of requiring a more extreme result to reject
the null hypothesis) we can choose – = 0.01, in which case we will reject the null if we have
I error decreases, the probability of making a Type II error increases. In other words, the
more we say we’re going to need really strong evidence to reject the null hypothesis (which
is what we say when we make – small), the more likely we are going to fail to reject the null
hypothesis even when the null hypothesis is wrong (which is Type II error).
c
•2014 Oxford University Press 145
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Remember This
1. A null hypothesis is typically a hypothesis of no effect that we write as H0 : —1 = 0.
• We reject a null hypothesis when the statistical evidence is inconsistent with
the null hypothesis. A coefficient estimate is statistically significant if we
reject the null hypothesis that the coefficient is zero.
• We fail to reject a null hypothesis when the statistical evidence is consistent
with the null hypothesis.
• Type I error occurs when we wrongly reject a null hypothesis.
• Type II error occurs when we wrongly fail to reject a null hypothesis.
2. A alternative hypothesis is the hypothesis we accept if we reject the null hypoth-
esis.
• We choose a one-sided alternative hypothesis if theory suggests —1 > 0 or
theory suggests —1 < 0.
• We choose a two-sided alternative hypothesis if theory does not provide guid-
ance as to whether —1 is greater than or less than zero.
3. The significance level (–) refers to the probability of Type 1 error for our hypoth-
esis test. We choose the value of the significance level, typically 0.01 or 0.05.
4. There is a trade-off between Type I and Type II error. If we lower – we decrease
the probability of making a Type I error, but increase the probability of making
a Type II error.
c
•2014 Oxford University Press 146
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Discussion Questions
4.2 t tests
The most common tool we use for hypothesis testing in OLS is the t test. There’s a quick
—ˆ1
rule of thumb for t tests: If the absolute value of se(—ˆ1 )
is bigger than 2, reject the null
hypothesis (recall that se(—ˆ1 ) is the standard error of our coefficient estimate). If not, don’t.
In this section we provide the logic and tools of t testing so that we can be more precise,
c
•2014 Oxford University Press 147
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
To put our t tests in context, let’s begin with the fact that we have calculated —ˆ1 and are
trying to figure out if —ˆ1 would be highly surprising if the null hypothesis were true. A
challenge is that the scale of our —ˆ1 could be anything. In our presidential-election model
above, we estimated —ˆ1 to be 2.29. Is that estimate surprising under the null? As we saw
in Figure 4.1 on page 143, it is unlikely to observe a —ˆ1 that big when the standard error of
—ˆ1 is only 0.52. What if the standard error of —ˆ1 were 2.0? The distribution of —ˆ1 under the
null hypothesis would still be centered at zero, but it would be really wide, as in Figure 4.2.
In this case, it really wouldn’t be so surprising to see a —ˆ1 of 2.29 if the null hypothesis that
—1 = 0 were true.
What we really care about is not the —ˆ1 coefficient estimate by itself, but rather how large
the —ˆ1 coefficient is relative to its standard error. In other words, it is unlikely to observe
a —ˆ1 coefficient that is much bigger than its standard error, which will place it outside the
Therefore we use a test statistic that consists of the estimated coefficients divided by the
—ˆ1
estimated standard deviation of the coefficient: se(—ˆ1 )
. Thus our test statistic reflects how
many standard errors above or below zero the estimated coefficient is. If the —ˆ1 is 6 and
se(—ˆ1 ) is 2, then our test statistic will be 3 because the estimated coefficient is 3 standard
errors above zero. If the standard error had been 12 instead, then the value of our test
c
•2014 Oxford University Press 148
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Probability
density
^
Distribution of β1
(standard error is 2.0)
Actual
value
^
of β 1
2.29
−5 −4 −3 −2 −1 0 1 2 3 4 5
^
β1
FIGURE 4.2: Distribution of —ˆ1 Under the Null Hypothesis with Larger Standard Error for Presidential-
Election Example
c
•2014 Oxford University Press 149
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
The t distribution
Dividing —ˆ1 by its standard error solves the scale problem, but introduces another challenge.
a random variable because it depends on the estimated —ˆ1 . It’s a tricky question and now is
a good time to turn to our friends at Guinness Brewing for help. Really. Not for what you
might think, but for work they did in the early twentieth century demonstrating that the
—ˆ1
distribution of se(—ˆ1 )
follows a distribution we call the t distribution.1 The t distribution
is bell-shaped like a normal distribution but has “fatter tails.”2 We say it has fat tails
because the values on the far left and far right have higher probabilities than for the normal
distribution. The extent of these chubby tails depends on the sample size; as the sample
size gets bigger, the tails melt down to become the same as the normal distribution. What’s
going on is that we need to be more cautious about rejecting the null because it is possible
that by chance our estimate of se(—ˆ1 ) will be too small, which will make —ˆ1
se(—ˆ1 )
look like it’s
really big. When we have small amounts of data, the issue is serious because we will be
1 Like many statistical terms, the t distribution and t test have quirky origins. William Sealy Gosset devised
the test in 1908 working for Guinness Brewery in Dublin. His pen name was “Student.” There already was an s
test (now long forgotten) so Gosset named his test and distribution after the second letter of his pen name. The
standard error of —ˆ1 follows a statistical distribution called a ‰2 distribution and the ratio of a normally distributed
random variable and a ‰2 random variable follows a t distribution. More details are in the appendix on page
pagerefsection:AppendixOtherDist.
2 That’s a statistical term. Seriously.
c
•2014 Oxford University Press 150
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
quite uncertain about se(—ˆ1 ); when we have lots of data, we’ll be more confident about our
estimate of se(—ˆ1 ) and, as we’ll see, the fat tails of the t distribution fade away and the t
The specific shape of a t distribution depends on the degrees of freedom, which is sample
size minus the number of parameters. A bivariate OLS model estimates two parameters (—ˆ0
and —ˆ1 ) which means, for example, that the degrees of freedom for a bivariate OLS model
Figure 4.3 displays three different t distributions with a normal distribution plotted in
the background of each panel as a dotted line. Panel (a) shows a t distribution with degrees
of freedom equal to 2. Check out those fat tails. The probability of observing a value as
high as 3 is higher for the t distribution than for the normal distribution. The same thing
goes for the probability of observing a value as low as -3. Panel (b) of Figure 4.3 shows
a t distribution with degrees of freedom equal to 5. If we look closely, we can see some
chubbiness in the tails as the t distribution has higher probabilities at, for example, values
greater than two. We have to look pretty closely to see that, though. Panel (c) shows a
normal distribution and, in fact, covers up the normal distribution so we cannot see it.
3 More technically, the estimate of the standard error follows a statistical distribution called a ‰2 distribution
(the Greek letter is chi and is pronounced “kai”). The t distribution characterizes the distribution of a normally
distributed random variable (—ˆ1 ) divided by a ‰2 distributed random variable.
c
•2014 Oxford University Press 151
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Probability
density t−distribution d.f.= 2
normal distribution
(a)
−3 −2 −1 0 1 2 3
^ ^
β1/se(β1)
Probability
density t−distribution d.f.= 5
normal distribution
(b)
−3 −2 −1 0 1 2 3
^ ^
β1/se(β1)
Probability
density t−distribution d.f.= 50
normal distribution
(c)
Note: normal distribution is
covered by t−distribution
−3 −2 −1 0 1 2 3
^ ^
β1/se(β1)
c
•2014 Oxford University Press 152
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Critical values
—ˆ1
Once we know the distribution of se(—ˆ1 )
we can come up with a critical value. A critical
value is the threshold for our test statistic. Loosely speaking, we reject the null hypothesis if
—ˆ1 —ˆ1
se(—ˆ1 )
(the test statistic) is greater than the critical value if se(—ˆ1 )
is below the critical value,
More precisely, our specific decision rule depends on the nature of the alternative hypoth-
esis. Table 4.3 displays the specific rules. Rather than trying to memorize these rules, it is
better to concentrate on understanding the logic behind them. If the alternative hypothesis
is two-sided, then big values of —ˆ1 relative to the standard error incline us to reject the
null. We don’t particularly care if they are very positive or very negative. If the alternative
hypothesis is that — > 0, then only large, positive values of —ˆ1 will incline us to reject the
null hypothesis in favor of the alternative hypothesis. Observing a very negative —ˆ1 would
be odd, but certainly would not incline us to believe that the true value of — is greater than
zero. Similarly, if the alternative hypothesis is that — < 0, then only very negative values
of —ˆ1 will incline us to reject the null hypothesis in favor of the alternative hypothesis. We
refer to the appropriate critical value in the table because the actual value of the critical
value will depend on whether the test is one or two-sided, as we discuss below.
The critical value for t tests depends on the t distribution and identifies the point at
—ˆ1
which we decide the observed se(—ˆ1 )
is sufficiently unlikely under the null hypothesis that we
c
•2014 Oxford University Press 153
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Critical values depend on –, the significance level we choose, our degrees of freedom,
and whether the alternative is one-sided or two-sided. Figure 4.4 depicts critical values for
various scenarios. We assume the sample size is large in each, allowing us to use the normal
Panel (a) of Figure 4.4 shows critical values for – = 0.05 and a two-sided alternative
hypothesis. The distribution of the t statistic is centered at zero under the null hypothesis
that —1 = 0. For a two-sided alternative hypothesis, we want to identify ranges that are far
from zero and unlikely under the null hypothesis. For – = 0.05 we want to find the range
that constitutes the least-likely 5 percent of the distribution under the null. This 5 percent
is the sum of the 2.5 percent on the far left and the 2.5 percent on the far right. Values in
these ranges are not impossible, but they are unlikely. For a large sample size, the critical
values that mark off the least-likely 2.5 percentage regions of the distribution are -1.96 and
1.96.
Panel (b) of Figure 4.4 depicts the situation if we choose – = 0.01. In this case, we’re
saying we’re going to need to observe an even more unlikely —ˆ1 under the null hypothesis in
order to reject the null hypothesis. The critical value for a large sample size is 2.58. This
c
•2014 Oxford University Press 154
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
2.5% of normal
2.5% of normal distribution is to
(a) distribution is to right of 1.96
left of −1.96
−1.96 1.96
−4 −3 −2 −1 0 1 2 3 4
^ ^
β1/se(β1)
0.5% of normal
0.5% of normal distribution is to
(b) distribution is to right of 2.58
left of −2.58
−2.58 2.58
−4 −3 −2 −1 0 1 2 3 4
^ ^
β1/se(β1)
5% of normal distribution
(c) is to right of 1.64
1.64
−4 −3 −2 −1 0 1 2 3 4
^ ^
β1/se(β1)
FIGURE 4.4: Critical Values for Large Sample t tests
c
•2014 Oxford University Press 155
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
number defines the point at which there is a 0.005 probability (which is half of –) of being
higher than than the critical value and there is a 0.005 probability of being less than the
negative of it.
The picture and critical values differ a bit for a one-tailed test in which we look only at
one side of the distribution. Panel (c) of Figure 4.4 depicts the situation when – = 0.05 and
HA : —1 > 0. Here, 5 percent of the distribution is to the right of 1.64, meaning that we will
—ˆ1
reject the null hypothesis in favor of the alternative that —1 > 0 if ˆ
se(—)
> 1.64.
Note that the one-sided critical value for – = 0.05 is lower than the two-sided critical
value. One-sided critical values will always be lower for any given value of –, meaning that
it is easier to reject the null hypothesis for a one-sided alternative hypothesis than for a two-
sided alternative hypothesis. Hence, using critical values based on a two-sided alternative is
statistically cautious in the sense that we are less likely to look like we’re over-eager to reject
Table 4.4 displays critical values of the t distribution for one-sided and two-sided alter-
native hypotheses for common values of –. When the degrees of freedom are very small
(typically due to a small sample size), the critical values are relatively large. For example,
with 2 degrees of freedom and – = 0.05, we need to see a t stat above 2.92 to reject the
null.4 With 10 degrees of freedom, we need to see a t stat above 1.81 to reject the null. With
100 degrees of freedom, we need a t stat above 1.66 to reject the null. As the degrees of
4 It’s unlikely that we would seriously estimate a model with 2 degrees of freedom. For a bivariate OLS model,
that would mean estimating a model with 4 observations.
c
•2014 Oxford University Press 156
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
freedom get higher, the t distribution looks more and more like a normal distribution, and
for infinite degrees of freedom it is exactly like a normal distribution, producing identical
critical values. For degrees of freedom above 100, it is reasonable to use critical values from
—ˆ1
We compare se(—ˆ1 )
to our critical value and reject if the magnitude is larger than the
—ˆ1
critical value. We refer to the ratio of se(—ˆ1 )
as the t statistic (or, “t stat” as the kids
say). The t statistic is so named because that ratio will be compared to a critical value that
depends on the t distribution in the manner we have just outlined. Tests based on two-sided
alternative tests with – = 0.05 are very common. When the sample size is large, the critical
value for such a test is 1.96. Hence the rule of thumb is that a t statistic bigger than 2 is
c
•2014 Oxford University Press 157
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
To show t testing in action, Table 4.5 provides the results of the height and wages models
from Chapter 3 but now adds t statistics. As before, we show results using standard errors
estimated by the equation that requires errors to be homoscedastic and standard errors
estimated via an equation that allows errors to be heteroscedastic. The coefficients across
The column on the left shows that the t statistic from the homoscedastic model for the
coefficient on adult height is 4.225, meaning that —ˆ1 is 4.225 standard deviations away from
zero. The t statistic from the heteroscedastic model for the coefficient on adult height is
4.325, which is essentially the same as in the homoscedastic model. For simplicity, we’ll
c
•2014 Oxford University Press 158
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Is this coefficient on adult height statistically significant? To answer that question, we’ll
need a critical value. To pick a critical value, we need to choose a one-sided or two-sided
alternative hypothesis and a significance level. Let’s start with a two-sided test and – = 0.05.
For a t distribution we also need to know the degrees of freedom. Recall that the degrees
of freedom are the sample size minus the number of parameters estimated. The smaller
the sample size, the more uncertainty we have about our standard error estimate and hence
the larger we make our critical value. Here, the sample size is 1, 910 and we estimate two
parameters, so the degrees of freedom are 1, 908. For a sample this large, we can reasonably
use the critical values from the last row of Table 4.4. The critical value for a two-sided test
with – = 0.05 and a high number for degrees of freedom is 1.96. Because our t statistic of
4.225 is higher than 1.96 we reject the null hypothesis. It’s that easy.
Finally, it’s worth noting that we can extend the t test logic to cases in which the null
hypothesis refers to some value other than zero. Such cases are not super common, but not
unheard of. Suppose, for example, that our null hypothesis is H0 : —1 = 7 versus HA : —1 ”= 7.
In this case, we simply need to check how many standard deviations —ˆ1 is away from 7. We
—ˆ1 ≠7
do so by comparing se(—ˆ1 )
against the standard critical values we developed above. More
—ˆ1 ≠— N ull
generally, to test a null hypothesis that H0 : —1 = — N ull we look at se(—ˆ1 )
where — N ull is
c
•2014 Oxford University Press 159
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Remember This
1. We use a t test to test the null hypotheses H0 : —1 = 0. The steps are as follows:
(a) Choose a one-sided or two-sided alternative hypothesis.
(b) Set a significance level, –, usually equal to 0.01 or 0.05.
(c) Find a critical value based on the t distribution. This value depends on –,
whether the alternative hypothesis is one-sided or two-sided, and the degrees
of freedom (equal to sample size minus number of parameters estimated).
(d) Use OLS to estimate parameters.
• For a two-sided alternative hypothesis, we reject the null hypothesis if
ˆ
| se(——1ˆ ) | > the critical value. Otherwise, we fail to reject the null hypothe-
1
sis.
• For a one-sided alternative hypothesis that —1 > 0, we reject the null
ˆ
hypothesis if se(——1ˆ ) > the critical value.
1
ˆ
2. We can test any hypothesis of the form H0 : —1 = — N ull using | —1se( | as the test
≠— N ull
—ˆ1 )
statistic for a t test.
c
•2014 Oxford University Press 160
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Discussion Questions
4.3 p-values
The p-value is a useful byproduct of the hypothesis testing framework. It indicates the
probability of observing a coefficient as high as we actually did if the null hypothesis were
true. In this section we explain how to calculated p-values and why they’re useful.
As a practical matter, the thing to remember is that we reject the null if the p-value
is less than –. Our rule of thumb here is “small p-value means reject”: Low p-values are
associated with rejecting the null and high p-values are associated with failing to reject the
null hypothesis.
c
•2014 Oxford University Press 161
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
P-values can be calculated for any null hypothesis; we focus on the most common null
hypotheses in which —1 = 0. Most statistical software reports a two-sided p-value, which indi-
cates the probability that a coefficient is larger in magnitude (either positively or negatively)
Panel (a) of Figure 4.5 shows the p-value calculation for the —ˆ1 estimate in the wage and
height example we discussed on 158. The t statistic is 4.23. The p-value is calculated by
finding the likelihood of getting a t statistic larger in magnitude than observed under the
null hypothesis. There is a 0.0000122 probability that the t statistic will be larger than 4.23.
(In other words, there is a tiny probability we would observe a t statistic as high as 4.23 if
the null hypothesis were true.) Because the normal distribution is symmetric, there is also
a 0.0000122 probability that the t statistic will be less than -4.23. Hence the p-value will
be twice the probability of being above the observed t statistic and equals 0.0000244.5 We
see a very small p-value, meaning that the observed —ˆ1 is really, really unlikely if —1 actually
equals zero.
Suppose, however, that our —ˆ1 were 0.09 (instead of the 0.41 it actually was). The t
0.09
statistic would be 0.52
= 1.73. Panel (b) of Figure 4.5 shows the p-value in this case. There
is a 0.042 probability of observing a t statistic greater than 1.73 under the null hypothesis
5 Here we are calculating two-sided p-values, which are the output most commonly reported by statistical software.
ˆ
If se(——1ˆ ) is greater than zero, then the two-sided p-value is twice the probability of being greater than that value. If
1
ˆ1
ˆ1 ) is less than zero, then the two-sided p-value is twice the probability of being less than that value. A one-sided
—
se(—
c
•2014 Oxford University Press 162
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Probability
density
Case 1: t−stat is 4.23
p−value is 0.0000244
−4.23 4.23
−4 −3 −2 −1 0 1 2 3 4
^ ^
β1/se(β1)
(a)
Probability
density
Case 2: t−stat is 1.73
p−value is 0.084
−1.73 1.73
−4 −3 −2 −1 0 1 2 3 4
^ ^
β1/se(β1)
(b)
FIGURE 4.5: Two Examples of p-values
c
•2014 Oxford University Press 163
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
(and a 0.042 probability of observing a t statistic less than -1.73 under the null) so the
p-value in this case would be 0.084. In this case, just by looking at the p-value we could say
that we would reject the null for – = 0.10 but fail to reject the null for – = 0.05.
P-values are helpful not only because they show us whether we reject the null hypothesis,
but also whether we really reject the null or just barely reject the null. For example, a p-value
of 0.0001 indicates there is only a 0.0001 probability of observing the —ˆ1 as large as what we
observed if the —1 = 0. In this case, we are not only rejecting, we are decisively rejecting.
Seeing a coefficient large enough to produce such a p-value is highly, highly unlikely if —1 = 0.
On the other hand, if the p-value is 0.049, we are just barely rejecting the null for – = 0.05
and would, relatively speaking, have less confidence that the null is false. For – = 0.05, we
just barely fail to reject the null hypothesis with a p-value of 0.051.
We typically don’t need to calculate p-values ourselves because any statistical package
that conducts OLS will provide p-values. Our job is to know what they mean. Calculating
p-values is straightforward, though, especially for large sample sizes. The Computing Corner
provides details.6
6 For a two-sided p-value we want to know the probability of observing a t statistic higher than the absolute value
ˆ
of the t statistic we actually observe under the null hypothesis. This is 2 ú (1 ≠ (| se(——1ˆ ) |)) where is the Greek
1
letter fi (pronounced like the first bit of the word final) and () indicates the normal cumulative density function
(CDF). (We see the normal CDF in our discussion of statistical power below; see page 774 for more details). If the
alternative hypothesis is HA : —1 > 0 the p-value is the probability of observing a t statistic higher than the observed
ˆ
t statistic under the null hypothesis: 1 ≠ ( se(——1ˆ ) ). If the alternative hypothesis is HA : —1 < 0 the p-value is the
1
ˆ
probability of observing a t statistic less than the observed t statistic under the null hypothesis: ( se(——1ˆ ) ).
1
c
•2014 Oxford University Press 164
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Remember This
The p-value is the probability of observing a coefficient as large in magnitude as
actually observed if the null hypothesis is true.
1. The lower the p-value, the less consistent the estimated —ˆ1 is with the null hy-
pothesis.
2. We reject the null hypothesis if the p-value is less than –.
3. A p-value can be useful to indicate the weight of evidence against a null hypoth-
esis.
4.4 Power
The hypothesis testing infrastructure we’ve discussed so far is designed to deal with the
possibility of Type I error, which occurs when we reject the null hypothesis when it is
actually true. When we set the significance level, we are setting the probability of making a
Type I error. Obviously, we’d really rather not believe the null is false when it is true.
Type II errors aren’t so hot either, though. We make a Type II error when — is really
something other than zero, but we fail to reject the null hypothesis that — is zero. In this
section we explain statistical power, the statistical concept associated with Type II errors.
We open by discussing the importance and meaning of Type II error, then show how to
calculate power and how to create power curves. We finish the section by discussing when
c
•2014 Oxford University Press 165
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Type II error can be serious. For example, suppose a new medicine really saves lives, but
that the U.S. Food and Drug Administration is given an analysis in which the —ˆ1 estimate of
its efficacy is not statistically significant. If the FDA fails to approve the drug, people will
die unnecessarily. That’s not “oops”; that’s horrific. Even when the stakes are lower, we can
fully imagine how stupid we’d feel if we conclude a policy doesn’t work when in fact it does
work, but we just happened to get a random realization of —ˆ1 that was not high enough to
be statistically significant.
Type II error happens because it is possible to observe values of —ˆ1 that are less than the
critical value even if —1 (the true value of the parameter) is greater than zero. This is more
Figure 4.6 shows the probability of Type II error for three different values of —. In these
figures, we assume a large sample (allowing us to use the normal distribution for critical
– = 0.01. In this case, the critical value is 2.32, which means that we reject the null
Panel (a) of Figure 4.6 displays the probability of Type II error if the true value of —
equals 1. In this case, the distribution of —ˆ1 will be centered at 1. Only 9.3 percent of this
distribution is to the right of 2.32, meaning that we have only a 9.3 percent chance of rejecting
the null hypothesis. In other words, even though the null hypothesis actually is false – we’re
c
•2014 Oxford University Press 166
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Probability
density
^
β1 distribution centered on 1 when β1 = 1
(a)
−3 −2 −1 0 1 2 2.32 3 4 5 6 7
^
β1
Probability
density
^
β1 distribution centered on 2 when β1 = 2
2.32
−3 −2 −1 0 1 2 3 4 5 6 7
^
β1
Probability
density
^
β1 distribution centered
on 3 when β1 = 3
(c)
the probability of rejecting
the null is 0.751
2.32
−3 −2 −1 0 1 2 3 4 5 6 7
^
β1
FIGURE 4.6: Statistical Power for Three Values of —1 , – = 0.01, and a One-Sided Alternative Hypothesis
c
•2014 Oxford University Press 167
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
assuming —1 equals one, not zero – we have only a roughly one in ten chance of rejecting the
null. In other other words, our test is not particularly able to provide statistically significant
Panel (b) of Figure 4.6 displays the probability of Type II error if the true value of —
equals 2. In this case, the distribution of —ˆ1 will be centered at 2. Here 37.4 percent of the
distribution is to the right of 2.32. Better, but hardly: Even though —1 > 0 we have not
Panel (c) of Figure 4.6 displays the probability of Type II error if the true value of —
equals 3. In this case, the distribution of —ˆ1 will be centered at 3. Here 75.1 percent of the
distribution is to the right of 2.32. We’re making progress, but still are far from perfection.
In other words, the true value of — has to be near or above 3 before we have a 75.1 percent
These examples illustrate why we use the somewhat convoluted “fail to reject the null”
terminology; when we observe a —ˆ1 less than the critical value, it is still quite possible that
the true value is not zero. Failure to find an effect is not the same as finding no effect.
Calculating power
The main tool for thinking about whether we are making Type II errors is power. The
statistical definition of power differs from how we use the the word in ordinary conversation.
Power in the statistical sense refers to the ability of our data to reject the null. A high-
c
•2014 Oxford University Press 168
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
powered statistical test will reject the null with a very high probability when the null is
false; a low-powered statistical test will reject the null with a low probability when the
null is false. Think of statistical power like the power of a microscope. Using a powerful
microscope, we can distinguish small differences in an object, differences that we cannot see
To calculate power we begin by noting that the probability we reject the null for any true
value of —1T rue is the probability that the t statistic is greater than the critical value. We can
write this condition as follows (where the condition following the vertical line is what we’re
assuming to be true):
—ˆ1
P r(Reject null given —1 = —1T rue ) = P r( > Critical value|—1 = —1T rue ) (4.4)
se(—ˆ1 )
In other words, the power is the probability the t statistic is higher than the critical value.
This probability will depend on the actual value of —1 as we know that the distribution of
To make these calculations easier, we need to do a couple more steps. First, note that the
probability that the t statistic is bigger than the critical value (as described in Equation 4.4)
is equal to 1 minus the probability that the t statistic is less than the critical value, yielding
—ˆ1
P r(Reject null given —1 = —1T rue ) = 1 ≠ P r( < Critical value|—1 = —1T rue ) (4.5)
ˆ
se(—1 )
ˆ
The key element of this equation is P r( se(——1ˆ ) < Critical value|—1 = —1T rue ). This math-
1
ematical term seems complicated, but we actually know a fair bit about it. For a large
c
•2014 Oxford University Press 169
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
—ˆ1
sample size, the t statistic (which is se(—ˆ1 )
) will be normally distributed with a variance of
one around the true value divided by the standard error of the estimated coefficient. And,
from the properties of the normal distribution (see page 782 for a review), we know the
probability that the t statistic will be less than the critical value will therefore be
—1T rue
P r(Reject null given —1 = —1T rue ) = 1 ≠ (Critical value ≠ )
se(—ˆ1 )
where is the Greek letter fi (pronounced like the first bit of the word final) and () indicates
the normal cumulative density function (see page 774 for more details). This quantity will
vary depending on the true value of —1 we wish to use in our power calculations.7
One thing that can be puzzling is how to decide what true value to use when calculating
power. There really is no specific value that we should look at; instead the point is that
we can pick any value and calculate the power. We might pick a value of —1 that indicates
a substantial real-world effect and see what the probability of rejecting the null is for that
value. If the probability is low (meaning power is low), then we should be a bit skeptical
that we have enough data to reject the null for such a true value. If the probability is high
(meaning power is high), then we can be confident that if the true —1 is that value then we’d
c
•2014 Oxford University Press 170
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Power curves
Even better than picking a single value to use to calculate power, we can look at power
over a range of possible true values of —1 . A power curve characterizes the probability of
rejecting the null for each possible value of the parameter. Figure 4.7 displays two power
curves. The solid line on top is the power curve for when se(—ˆ1 ) = 1.0 and – = 0.01. On
the horizontal axis are hypothetical values of —1 . The line shows the probability of rejecting
the null for a one-tailed test of H0 : —1 = 0 versus HA : —1 > 0 for – = 0.01 and a sample
large enough to use the normal approximation to the t distribution. To reject the null under
these conditions requires a t stat greater than 2.32 (see Table 4.4). This power curve plots
—ˆ —ˆ
for each possible value of —1 the probability that ˆ
se(—)
(which in this case is 1.0
) is greater
than 2.32. This curve includes the values we calculated in Figure 4.6 but now also covers all
Look first at the values of —1 that are above zero, but small. For these values the prob-
ability of rejecting the null is quite small. In other words, even though the null hypothesis
is false for these values (since —1 > 0) we’re unlikely to reject the null that —1 = 0. As —1
increases, this probability increases and by around —1 = 4, the probability of rejecting the
null approaches 1.0. That is, if the true value of —1 is 4 or bigger, then we will reject the
The dashed line in Figure 4.7 displays a second power curve for which the standard error
is bigger, here equal to 2.0. The significance level is the same as for the first power curve,
c
•2014 Oxford University Press 171
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
1
Probability of Power curve for
^
rejecting the se(β1) = 1.0
null hypothesis
for α = 0.01
0.75
0.5
0.37
0.25
0.09
0 1 2 3 4 5 6 7 8 9 10
β1
FIGURE 4.7: Power Curves for Two Values of se(—ˆ1 )
c
•2014 Oxford University Press 172
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
– = 0.01 We immediately see that the statistical power is lower. For every possible value of
—1 , the probability of rejecting the null hypothesis is lower than when se(—ˆ1 ) = 1.0 because
there is more uncertainty with the higher standard error for the estimate. For this standard
error, the probability of rejecting the null when —1 equals 2 is 0.09. So even though the null
One of the main determinants of se(—ˆ1 ) is sample size (see page 97). Hence, a useful rule
of thumb is that hypothesis tests based on large samples are usually high-powered and
hypothesis tests based on small samples are usually low-powered. In Figure 4.7, we can
think of the solid line as the power curve for a large sample and the dashed line as the power
curve for a smaller sample. More generally, though, statistical power is a function of the
Power is particularly relevant when someone presents a null result, a finding in which
the null hypothesis is not rejected. For example, someone may say class size is not related
to test scores or that an experimental treatment does not work. In this case, we need to ask
what the power of the test was. It could be, for example, that the sample size is very small
such that the probability of rejecting the null is small even for substantively large values of
8What happens when —1 actually is zero? In this case, the null hypothesis is true and power isn’t the right concept.
Instead the probability of rejecting the null in this case is the probability of rejecting the null when it is true. In
other words, the probability of rejecting the null when —1 = 0 is the probability of committing a Type I error, the –
level we set.
c
•2014 Oxford University Press 173
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
—1 .
Remember This
Statistical power refers to the probability of rejecting a null hypothesis for a given
value of —1 .
1. A power curve shows the probability of rejecting the null for a range of possible
values of —1 .
2. Large samples typically produce high-power statistical tests. Small samples typ-
ically produce low-power statistical tests.
3. It is particularly important to discuss power when presenting null results, which
fail to reject the null hypothesis.
Discussion Questions
1. For each of the following, indicate the power of the test of the null
hypothesis H0 : —1 = 0 against the alternative hypothesis of HA : —1 > 0
for a large sample size and – = 0.01 for the given true value of —1 . We’ll
assume se(—ˆ1 ) = 0.75. Draw a sketch to help explain your numbers.
(a) —1T rue = 1
(b) —1T rue = 2
2. Suppose the estimated se(—ˆ1 ) doubled. What will happen to the power
of the test for the above two cases? First answer in general terms and
then calculate specific answers.
3. Suppose se(—ˆ1 ) = 2.5. What is the probability of committing a Type II
error for each of the above true values of —1 ?
c
•2014 Oxford University Press 174
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
The ins and outs of hypothesis testing can be confusing. There are t distributions, degrees
of freedom, one-sided tests, two-sided tests, lions, tigers, and bears. Such confusion is unfor-
tunate for two reasons. First, the essence is simple: High t statistics indicate that the —ˆ1 we
hypothesis testing super easy. They crank out t stats and p-values lickety-split.
Sometimes these details distract us from the big picture: Hypothesis testing is not the
whole story. In this section we discuss four important limits to the hypothesis testing frame-
work.
First, and most importantly, all hypothesis testing tools we develop – all of them! – are
predicated on the assumption that there is no endogeneity. If there is endogeneity, then the
hypothesis testing tools are useless. If the input is junk, then even a fancy triple-backflip-
Second, hypothesis tests offer only a blunt tool for analyzing data. Sometimes they are
misleadingly decisive. Suppose we have a sample of 1,000 and we are interested in a two-sided
hypothesis test for – = 0.05. If we observe a t statistic of 1.95 we will fail to reject the null.
If we observe a t statistic of 1.97 we will reject the null. The world is telling us essentially the
same thing in both cases, but the hypothesis testing approach gives us dramatically different
answers.
c
•2014 Oxford University Press 175
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Third, a hypothesis test can mask important information. Suppose the t statistic on one
variable is 2 and that the t statistic for another is 25. In both cases, we reject the null. But
there’s a big difference. We’re kinda-sorta confident the null is not correct when the t stat
is 2. We’re damn sure the null sucks when the t stat is 25. Hypothesis testing alone does
not make such a distinction. We should. The p-values we discussed earlier are helpful, as
Fourth, hypothesis tests and their focus on statistical significance can distract us from
it indicates that independent variable has a meaningful effect on the dependent variable.
While it can be a bit subjective as to how big a coefficient has to be for us to believe it
matters, this is a conversation we need to have. And statistical significance is not always a
good guide. Remember that t stats depend a lot on the se(—ˆ1 ) and that the se(—ˆ1 ) in turn
depends on sample size and other factors (see page 97). If we have a really big sample (and
these days it is increasingly common to have sample sizes in the millions), then the standard
error will be tiny and our t stat might be huge even for a substantively trivial —ˆ1 estimate.
In these cases, we may reject the null with even when the —ˆ1 coefficient suggests a minor
effect.
For example, suppose we look at average test scores for every elementary classroom in the
country as a function of the salary of the teachers. We could conceivably get a statistically
significant coefficient that implied, say, an increase of 0.01 points out of 100 for every hundred
c
•2014 Oxford University Press 176
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
thousand dollars we pay teachers. Statistically significant, yes; substantively significant, not
so much.
Or, conversely, we could have a small sample size that would lead to a large standard error
on —ˆ1 and, say, to a failure to reject the null. But the coefficient could be quite big, suggesting
the effect is really big, but it’s worth appreciating that the data in such a case is indicating
the possibility of a substantively significant relationship. In this instance, getting more data
Remember This
Statistical significance is not the same as substantive significance.
1. A coefficient is statistically significant if we reject the null hypothesis.
2. A coefficient is substantively significant if the variable has a meaningful effect on
the dependent variable.
3. With large data sets, substantively small effects can sometimes be statistically
significant.
4. With small data sets, substantively large effects can sometimes be statistically
insignificant.
One way to get many of the advantages of hypothesis testing without the stark black/white,
c
•2014 Oxford University Press 177
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
confidence interval defines the range of true values that are most consistent with the observed
coefficient estimate. A confidence interval contrasts with a point estimate, which is a single
This section explains how confidence intervals are calculated and why they are useful.
The intuitive way to think about confidence intervals is that they give us a range in which
we’re confident the true parameter lies. An approximate rule of thumb is that the confidence
interval for a —ˆ1 estimate goes from two standard errors of —ˆ1 below —ˆ1 to two standard errors
of —ˆ1 above —ˆ1 . That is, the confidence interval for an estimate —ˆ1 will approximately cover
The full explanation of confidence intervals involves similar statistical logic as t stats.
The starting point is the realization that we can assess the probability of observing the —ˆ1
for any “true” —1 . For some values of —1 , our observed —ˆ1 wouldn’t be surprising. Suppose,
for example, we observe a coefficient of 0.41 with a standard error of 0.1 as we did in Table
3.2. If the true value were 0.41, a —ˆ1 near 0.41 wouldn’t be too surprising. If the true value
were 0.5, we’d be a wee bit surprised, perhaps, to observe —ˆ1 = 0.41 but not shocked. For
some values of —1 , though, the observed —ˆ1 would be surprising. If the true value were 10,
for example, we’d be gobsmacked to observe —ˆ1 = 0.41 with a standard error of 0.1. Hence,
if we see —ˆ1 = 0.41 with a standard error of 0.1, we’re pretty darn sure the true value of —1
isn’t 10.
Confidence intervals generalize this logic to identify the range of true values that would
c
•2014 Oxford University Press 178
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
be reasonably likely to produce the —ˆ1 that we observe. They identify that range of true
values for which the observed —ˆ1 and se(—ˆ1 ) would not be too unlikely. We get to choose
what we mean by unlikely by choosing our significance level, which is typically – = 0.05 or
– = 0.01. We’ll often refer to confidence levels, which are 1- –. The lower bound of a 95%
of observing a —ˆ1 as high as the —ˆ1 actually observed. The upper bound of a 95% confidence
interval will be a value of —1 such that there is less than a 2.5% probability of observing a
Figure 4.8 illustrates the meaning of a confidence interval. Suppose —ˆ1 = 0.41 and se(—ˆ1 ) =
0.1. For any given true value of — we can calculate the probability of observing the —ˆ1 we
actually did observe. Panel (a) shows that if —1 really were 0.606, the distribution of —ˆ1
would be centered at 0.606 and we would see a value as low as 0.41 (what we actually
observe for —ˆ1 ) only 2.5 percent of the time. Panel (b) shows that if —1 really were 0.214, the
distribution of —ˆ1 would be centered at 0.606 and we would see a value as high as 0.41 (what
we actually observe for —ˆ1 ) only 2.5 percent of the time. In other words, our 95% confidence
interval ranges from 0.214 to 0.606 and includes the values of —1 such that it wouldn’t be
c
•2014 Oxford University Press 179
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Probability
density
The value of the upper bound of a 95% confidence
interval is the value of β1 such that we would see
^
the observed β1 or lower 2.5 percent of the time
(a)
^
If true value of β1 is 0.606 we would see a β1
equal to or less than 0.41 2.5% of the time.
(b)
^
If true value of β1 is 0.214 we would see a β1
equal to or greater than 0.41 2.5% of the time
c
•2014 Oxford University Press 180
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Figure 4.8 does not tell us how to calculate the upper and lower bounds of a confidence
interval. We could use trial-and-error. Much better is an equation based on the properties
of the distributions of —ˆ1 . A confidence interval is —ˆ1 ≠ critical value ◊ se(—ˆ1 ) to —ˆ1 +
critical value ◊ se(—ˆ1 ). For large samples and – = 0.05, the critical value is 1.96, giving rise
to the rule of thumb that a 95% confidence interval is approximately —ˆ1 ± 2 ◊ the standard
error of —ˆ1 . In our example, where —ˆ1 = 0.41 and se(—ˆ1 ) = 0.1, we can be 95% confident
Table 4.6 shows some commonly used confidence intervals for large sample sizes. The
large sample size allows us to use the normal distribution to calculate critical values. A
90% confidence interval for our example is 0.246 to 0.574. The 99% confidence interval for
a —ˆ1 = 0.41 and se(—ˆ1 ) = 0.1 is from 0.152 to 0.668. Notice that the higher the confidence
Confidence intervals are closely related to hypothesis tests. Because confidence intervals
tell us the range of possible true values that are consistent with what we’ve seen, we simply
c
•2014 Oxford University Press 181
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
need to see if the confidence interval on our estimate includes zero. If it does not, zero was
not a value that would likely produce the data and estimates we observe and we can therefore
reject H0 : —1 = 0.
Confidence intervals do more than hypothesis tests, though, because they provide infor-
mation on the likely location of the true value. If the confidence interval is mostly positive,
but just barely covers zero, we would fail to reject the null hypothesis, but we would also
recognize that the evidence suggests the true value is likely positive. If the confidence in-
terval does not cover zero, but is restricted to a region of substantively unimpressive values
of —1 , then we can conclude that while the coefficient is statistically different from zero, it
Remember This
1. A confidence interval indicates a range of values in which true value is likely to
be, given the data.
• The lower bound of a 95% confidence interval will be a value of —1 such
that there is less than a 2.5% probability of observing a —ˆ1 as high as the —ˆ1
actually observed.
• The upper bound a 95% confidence interval will be a value of —1 such that
there is less than a 2.5% probability of observing a —ˆ1 as low as the —ˆ1 actually
observed.
2. A confidence interval is calculated as —ˆ1 ± t-critical value ◊ se(—ˆ1 ) where the t-
critical value is the critical value from the t-table. It depends on the sample size
and –, the significance level. For large samples and – = 0.05, the t-critical value
is 1.96.
c
•2014 Oxford University Press 182
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
4.7 Conclusion
Statistical inference refers to the process of reaching conclusions based on the data. Hy-
pothesis tests are central to inference, particularly t tests. They’re pretty easy. Honestly,
a well-trained parrot could probably do simple t tests. Look at the damn t statistic! Is it
We can do much more. With p-values and confidence intervals we can characterize our
findings with some nuance. With power tests we can recognize the likelihood of failing to see
effects even when they’re there. Taken as a whole, then, these tools help us make inferences
• Section 4.1: Explain the conceptual building blocks of hypothesis testing, including null
• Section 4.6: Explain confidence intervals. Explain the rule of thumb for approximating
c
•2014 Oxford University Press 183
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Further Reading
Ziliak and McCloskey (2008) provide a book-length attack on the hypothesis testing frame-
work. Theirs is hardly the first such critique, but it may be the most fun.
produces estimates of the form “there is an 8.2 percent probability that — is less than zero.”
Happily, there are huge commonalities across Bayesian statistics and the approach used in
this (and most other) introductory books. Simon Jackman’s Bayesian Analysis for the Social
Key Terms
• Alternative hypothesis (140)
• Confidence interval (201)
• Confidence levels (179)
• Critical value (153)
• Degrees of freedom (151)
• Hypothesis testing (136)
• Null hypothesis (138)
• One-sided alternative hypothesis (140)
• Null result (173)
• p-value (161)
• Point estimate (178)
• Power (168)
c
•2014 Oxford University Press 184
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Computing Corner
Stata
1. To find the critical value from a t distribution for a given – and N ≠k degrees of freedom
use the inverse t tail function in Stata: display invttail(n-k, –).10
• To calculate the critical value for a one-tailed t test with n ≠ k = 100 and – = 0.05
type display invttail(100, 0.05)
• To calculate the critical value for a two-tailed t test with n ≠ k = 100 and – = 0.05
type display invttail(100, 0.05/2)
2. To find the critical value from a normal distribution for a given – use the inverse normal
function in Stata. The display command tells Stata to print the results on the screen.
For a two-sided test with – = 0.05, type display invnormal(1- 0.05 2
). For a one-sided
test with – = 0.01, type display invnormal(1-0.01).
10 This is referred to as an inverse t function because we provide a percent (the –) and it returns a value of the t
distribution for which – percent of the distribution is larger in magnitude. For non-inverse t function we typically
provide some value for t and the function tells us how much of the distribution is larger in magnitude. The tail part
of the function command refers to the fact that we’re dealing with the far ends of the distribution.
c
•2014 Oxford University Press 185
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
3. The regression command in Stata (e.g. reg Y X1 X2) reports two-sided p-values and
confidence intervals. To generate the p-values from the t statistic only, use display
2*ttail(DF, abs(TSTAT)) where DF is the degrees of freedom and TSTAT is the
observed value of the t statistic.11 For a two-sided p-value for a t statistic of 4.23 based
on 1,908 degrees of freedom, type display 2*ttail(1908, 4.23).
4. Use the following code to create a power curve for – = 0.01 and a one-sided alternative
hypothesis covering 71 possible values of the true —1 from 0 to 7:
set obs 71
gen BetaRange = ( n-1)/10 /* Sequence of possible betas from 0 to 7 */
scalar stderrorBeta = 1.0 /* Standard error of beta-hat */
gen PowerCurve = normal(BetaRange/stderrorBeta - 2.32)
/* Probability t statistic is greater than critical value */
/* for each value in BetaRange/stderrorBeta */
graph twoway (line PowerCurve BetaRange)
R
1. In R, inverse probability distribution functions start with q (no reason why, really; it’s
just a convention). To calculate the critical value for a two-tailed t test with n ≠ k =
100 and – = 0.05 use the inverse t distribution command. For the inverse t-function
type qt(1-0.05/2, 100). To find the one-tailed critical value for a t distribution for
– = 0.01 and 100 degrees of freedom: qt(1-0.01, 100).
2. To find the critical value from a normal distribution for a given – use the inverse
normal function in R. For a two-sided test: qnorm(1- –2 ). For a one-sided test: display
qnorm(1-–).
3. The p-value reported in summary(lm(Y ≥ X1)) is a two-sided p-value. To generate
the p-values from the t statistic only, use 2*(1-pt(abs(TSTAT), DF)) where TSTAT
is the observed value of the t statistic and DF is the degrees of freedom. For example,
for a two-sided p-value for a t statistic of 4.23 based on 1,908 degrees of freedom, type
2*(1-pt(abs(4.23), 1908)).
4. To calculate confidence intervals using the regression results from the Simpsons data
on page 128, use the confint command. For example, the 95% confidence intervals for
the coefficient estimates in the donut regression model from the Chapter 3 Computing
11 The ttail function in Stata reports the probability of a t distributed random variable being higher than a t
statistic we provide (which we denote here as TSTAT). This syntax contrasts to the convention for normal distribution
functions, which typically report the probability of being less than the t statistic we provide.
c
•2014 Oxford University Press 186
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
Exercises
1. Persico, Postlewaite, and Silverman (2004) analyzed data from the National Longitudi-
nal Survey of Youth (NLSY) 1979 cohort to assess the relationship between height and
wages for white men. Here we explore the relationship between height and wages for
the full sample that includes men and women and all races. The NLSY is a nationally
representative sample of 12,686 young men and women who were 14-22 years old when
they were first surveyed in 1979. These individuals were interviewed annually through
1994 and biannually since then. Table 4.7 describes the variables from heightwage.dta
we’ll use for this question.
Table 4.7: Variables for Height and Weight Data in United States
a. Create a scatterplot of adult wages against adult height. What does this plot suggest
about the relationship between height and wages?
c
•2014 Oxford University Press 187
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
b. Estimate an OLS regression in which adult wages is regressed on adult height for
all respondents. Report the estimated regression equation and interpret the results,
explaining in particular what the p-value means.
c. Assess whether the null hypothesis that the coefficient on height81 equals zero is
rejected at the 0.05 significance level for one-sided and for two-sided hypothesis tests.
2. In this problem, we will conduct statistical analysis on the sheep experiment discussed
at the beginning of the chapter. We will create variables and use OLS to analyze
their relationships. Death is the dependent variable and treatment is the independent
variable. For all models, the treatment variable will equal 1 for the first 24 observations
and will equal zero for the last 24 observations.
a. Suppose, as in the example, that only one sheep in the treatment group died and all
sheep in the control group died. Is the treatment coefficient statistically significant?
What is the (two-sided) p-value? What is the confidence interval?
b. Suppose now that only one sheep in the treatment group died and only 10 sheep in
the control group died. Is the treatment coefficient statistically significant? What is
the (two-sided) p-value? What is the confidence interval?
c. Continue supposing that only one sheep in the treatment group died. What is the
minimal number of sheep in the control group that needed to die for the treatment
effect to be statistically significant? (Solve by trial and error.)
3. Voters care about the economy, often more than any other issue. It is not surprising,
then, that politicians invariably argue that their party is best for the economy. Who
is right? In this exercise we’ll look at the U.S. economic and presidential party data
in PresPartyEconGrowth.dta to test if there is any difference in economic performance
between Republican and Democratic presidents. We will use two different dependent
variables:
• ChangeGDPpc is the change in real per capita GDP in each year from 1962 to 2013
(in inflation-adjusted U.S. dollars, available from the World Bank)
• Unemployment is the unemployment rate each year from 1947 to 2013 (available
from the Bureau of Labor Statistics).
Our independent variable is LagDemPres. This variable equals 1 if the president in the
previous year was a Democrat and equals 0 if the president in the previous year was
a Republican. The idea is that the president’s policies take some time to take effect
so that the economic growth in a given year depended on who was president the year
before.12
12 Other ways of considering the question are addressed in the large academic literature on presidents and the
c
•2014 Oxford University Press 188
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
variable to be equal to the actual standard error of —ˆ1 . Note: The first line clears all data; you will need to re-load
the data set if you wish to run additional analyses. If you have created a syntax file it will be easy to re-load and
re-run what you have done so far.
clear
set obs 201
gen BetaRange = 4*( n-1) /* Sequence of true beta values from 0 to 800 */
c
•2014 Oxford University Press 189
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
1000 to 0.) Do this for 500 simulations and report what percent of time we reject
the null at the – = 0.05 level with a two-sided alternative hypothesis.
5. We will continue the analysis of height and wages in Britain from the homework problem
in Chapter 3 on page 133.
a. Estimate the model with income at age 33 as the dependent variable and height at
age 33 as the independent variable. (Exclude observations with wages above 400
British pounds per hour and height less than 40 inches.) Interpret the t statistics on
the coefficients.
b. Explain the p-values for the two estimated coefficients.
c. Show how to calculate the 95% confidence interval for the coefficient on height.
d. Do we accept or reject the null hypothesis that —1 = 0 for – = 0.01 and a two-sided
alternative? Explain why.
e. Do we accept or reject the null hypothesis that —0 = 0 (the constant) for – = 0.01
and a two-sided alternative? Explain why.
f. Limit the sample size to the first 800 observations.14 Do we accept or reject the null
hypothesis that —1 = 0 for – = 0.01 and a two-sided alternative? Explain if/how/why
this answer differs from the earlier hypothesis test about —1 .
6. The dataset MLBattend.dta contains Major League Baseball attendance records for 32
teams from the 1970s through 2000. This problem uses the power calculation described
on page 169.
a. Estimate a regression in which home attendance rate is the dependent variable and
runs scored is the independent variable. Report your results and interpret all coeffi-
cients.
b. Use the standard error from your results to calculate the statistical power of a test
of H0 : —runs scored = 0 vs HA : —runs scored > 0 with – = 0.05 (assuming a large sample
for simplicity) for three cases:
i. —runs scored = 100
ii. —runs scored = 400
iii. —runs scored =1,000
c. Suppose we had much less data than we actually do such that the standard error on
the coefficient on —runs scored were 900 (which is much larger than what we estimated).
Using standard error of —runs scored = 900, calculate the statistical power of a test of
H0 : —runs scored = 0 vs HA : —runs scored > 0 with – = 0.05 (assuming a large sample
for simplicity) for three cases:
14 In Stata, do this by adding & n < 800 to the end of the if statement at the end of the regress command.
c
•2014 Oxford University Press 190
Chapter 4. Hypothesis Testing and Interval Estimation: Answering Research Questions
c
•2014 Oxford University Press 191
CHAPTER 5
For example, suppose we’ve been tasked to figure out how sales responds to temperature.
Salest = —0 + —1 T emperaturet + ‘t
where Salest is sales in billions of dollars during month t and T emperaturet is the average
192
Chapter 5. Multivariate OLS: Where the Action Is
Monthly
retail sales
(billions of $)
12
10
30 40 50 60 70 80
Average monthly temperature
(in degrees Fahrenheit)
FIGURE 5.1: Monthly Retail Sales and Temperature in New Jersey from 1992 to 2013
temperature in the month. Figure 5.1 shows monthly data for New Jersey for about 20
years. We’ve also added the fitted line from a bivariate regression. It’s negative, implying
Is that the full story? Could there be endogeneity, meaning there is something that is
correlated with temperature and associated with more shopping? Think about shopping in
the United States. When is it at its most frenzied? Right before Christmas. Something that
happens in December ... when it’s cold. In other words, we think there is something in the
error term (Christmas shopping season) that is correlated with temperature. That’s a recipe
c
•2014 Oxford University Press 193
Chapter 5. Multivariate OLS: Where the Action Is
for endogeneity.
In this chapter, we learn how to control for other variables so that we can avoid (or at
least reduce) endogeneity and thereby see causal associations more clearly. Multivariate OLS
is the tool that allows us to do so. In our shopping example, multivariate OLS helps us see
that once we account for the December effect, higher temperatures are associated with higher
sales.
Multivariate OLS refers to OLS with multiple independent variables. We’re simply
going to add variables to the OLS model developed in the previous chapters. What do we
gain from doing so? Two things: bias reduction and precision. When we reduce bias, we
get more accurate parameter estimates because the coefficient estimates are on average less
skewed away from the true value. When we increase precision, we reduce uncertainty because
the distribution of coefficient estimates is more closely clustered toward the true value.
In this chapter we explain how to use multivariate OLS to fight endogeneity. Section 5.1
introduces the model and shows how controlling for multiple variables can lead to better
estimates. Section 5.2 discusses omitted variable bias, which occurs when we fail to control
for variables that affect Y and are correlated with included variables. Section 5.3 shows how
the omitted variable bias framework can be used to understand what happens when we use
poorly measured variables. Section 5.4 explains the precision of our estimates in multivariate
OLS. Section 5.5 concludes the chapter with a more big-think discussion of how to decide
c
•2014 Oxford University Press 194
Chapter 5. Multivariate OLS: Where the Action Is
Monthly Monthly
retail sales retail sales
(billions of $) (billions of $)
12 12 December sales
December sales minus $5 billion
Other months Other months
10 10
8 8
6 6
4 4
30 40 50 60 70 80 30 40 50 60 70 80
Average monthly temperature Average monthly temperature
(in degrees Fahrenheit) (in degrees Fahrenheit)
(a) (b)
FIGURE 5.2: Monthly Retail Sales and Temperature in New Jersey with December Indicated
Multivariate OLS allows us to control for multiple independent variables at once. In this
section, we explore two examples in which controlling for additional variables has a huge
effect on the results. Then we discuss the mechanics of the multivariate estimation process.
The sales and temperature example is useful for getting the hang of multivariate analysis.
Panel (a) of Figure 5.2 has the same data as Figure 5.1, but we’ve indicated the December
observations with triangles. Clearly New Jerseyites shop more in December; it looks like the
c
•2014 Oxford University Press 195
Chapter 5. Multivariate OLS: Where the Action Is
average sales are around $11 billion in December versus average sales of around $6 billion
per month in other months. We want to learn whether there is a temperature effect after
taking into account that December sales run about $5 billion higher than other months.
The idea behind multivariate OLS is to net out this December effect and then see what
the relationship between sales and temperature is. That is, suppose we subtracted the $5
billion bump from all the December observations and then looked at the relationship between
temperature and sales. That is what we’ve done in Panel (b) Figure 5.2 where each December
observation is now $5 billion lower than before. When we look at the data this way, the
negative relationship between temperature and sales seems to go away and it may even be
In essence, multivariate OLS nets of the effects of other variables when it controls for
additional variables. When we actually implement multivariate OLS, we (or, really, comput-
ers) do everything at once, controlling for the December effect while estimating the effect of
temperature even as we are simultaneously controlling for temperature while estimating the
December effect.
Table 5.1 shows the results for both a bivariate and multivariate model for our sales data.
In the bivariate model, the coefficient on temperature is negative and statistically significant,
implying that folks like to shop in the cold. When we use multivariate OLS to control for
December (by including the December variable that equals one for observations from the
month of December and zero for all other observations), the coefficient on temperature
c
•2014 Oxford University Press 196
Chapter 5. Multivariate OLS: Where the Action Is
becomes positive and statistically significant. Our conclusion has flipped! Heat brings out
the cash. Whether this relationship exists because people like shopping when it’s warm
or are going out to buy swimsuits and sunscreen, we can’t say. We can say, though, that
there’s pretty strong evidence that our initial bivariate finding that people shop less as the
The way we interpret multivariate OLS regression coefficients is slightly different from how
we interpret bivariate OLS regression coefficients. We still say that a one unit increase in X
is associated with a —ˆ1 increase in Y , but now we need to add, “Holding constant the other
factors in the model.” We therefore interpret our multivariate results as “Controlling for
the December shopping boost, increases in temperature are associated with more shopping.”
In particular, the multivariate estimate implies that controlling for the surge in shopping
We don’t have to say the full long version every time we talk about multivariate OLS
results – unless we’re stalling for time – as people who understand multivariate OLS will
understand the longer, technically correct interpretation. We can also use the fancy-pants
phrase ceteris paribus which means all else equal, as in “Ceteris paribus, the effect of a
one degree increase in temperature on retail shopping in New Jersey is $14 million”
The way statisticians talk about multivariate results takes some getting used to. When
statisticians say things like holding all else constant or holding all else equal they are simply
c
•2014 Oxford University Press 197
Chapter 5. Multivariate OLS: Where the Action Is
referring to the fact that other variables are in model and have been statistically controlled
for. What they really mean is more like netting out the effect of other variables in the
model. The logic behind saying that other factors are constant is that once we have netted
out the effects of these other variables it is as if the values of these variables are equal for
every observation. The language doesn’t exactly sparkle with clarity, but the idea is not
particularly subtle. Hence, when someone says something like “holding X2 constant, the
estimated effect of a one-unit change in X1 is —ˆ1 ” we need simply to remember they mean
Table 5.1: Bivariate and Multivariate Results for Retail Sales Data
Bivariate Multivariate
Temperature -0.019ú 0.014ú
(0.007) (0.005)
[t = 2.59] [t = 3.02]
December 5.63ú
(0.26)
[t =21.76]
Constant 7.16ú 4.94ú
(0.41) (0.26)
[t =17.54] [t = 18.86]
N 256 256
ˆ
‡ 1.82 1.07
R2 0.026 0.661
Standard errors in parentheses
ú indicates significance at p < 0.05
c
•2014 Oxford University Press 198
Chapter 5. Multivariate OLS: Where the Action Is
Here’s another example that shows what happens when we add variables to a model. We
use the data on height and wages introduced in Chapter 3 on page 113. The bivariate model
was
where W agesi was the wages of men in the sample in 1996 and the adult height measured
in 1985.
This is observational data and the reality with such data is that the bivariate model is
suspect. There are many ways something in the error term could be correlated with the
independent variable.
The authors of the height and weight study identified several additional variables to
include in the model, focusing in particular on one: adolescent height. They reasoned that
people who were tall as teenagers could have developed more confidence and participated
in more high school activities, and that this experience could have laid the groundwork for
If teen height is actually boosting adult wages in the way that the researchers suspected,
then it is possible that the bivariate model with only adult height (Equation 5.1) will suggest
a relationship even though the real action is between adolescent height and wages. How can
c
•2014 Oxford University Press 199
Chapter 5. Multivariate OLS: Where the Action Is
Multivariate OLS comes to the rescue. It allows us to simply “pull” adolescent height out
of the error term and into the model by including it as an additional variable in the model.
where —1 reflects the effect on wages of being one inch taller as an adult when including
adolescent height in the model and —2 reflects the effect on wages of being one inch taller as
The coefficients are estimated using similar logic as for bivariate OLS. We’ll discuss esti-
mation momentarily. For now, though, let’s concentrate on the differences between bivariate
and multivariate results. Both are presented in Table 5.2. The first column shows the coef-
ficient and standard error on —ˆ1 for the bivariate model with only adult height in the model;
these are identical to the results presented in Chapter 3 on page 115. The coefficient of 0.41
implies that each inch of height is associated with an additional 41 cents per hour in wages.
The second column shows results from the multivariate analysis; they tell quite a different
story. The coefficient on adult height is, at 0.003, essentially zero. The coefficient on ado-
lescent height, in contrast, is 0.48, implying that, controlling for adult height, adult wages
were 48 cents higher per hour for each inch taller someone was when younger. The standard
error on this coefficient is 0.19 with a t statistic that is higher than 2, implying a statistically
significant effect.
c
•2014 Oxford University Press 200
Chapter 5. Multivariate OLS: Where the Action Is
Table 5.2: Bivariate and Multiple Multivariate Results for Height and Wages Data
Bivariate Multivariate
(a) (b)
Adult height 0.41ú
0.003 0.03
(0.10)
(0.20) (0.20)
[t = 4.23]
[t = 0.02] [t = 0.17]
Adolescent height 0.48ú 0.35
(0.19) (0.19)
[t = 2.49] [t = 1.82]
Athletics 3.02ú
(0.56)
[t = 5.36]
Clubs 1.88ú
(0.28)
[t = 6.69]
Constant -13.09 -18.14ú -13.57ú
(6.90) (7.14) (7.05)
[t = 1.90] [t = 2.54] [t = 1.92]
N 1,910 1,870 1,851
ˆ
‡ 11.9 12.0 11.7
R 2
0.01 0.01 0.06
Standard errors in parentheses, ú indicates significance at p < 0.05
Figure 5.3 displays the confidence intervals implied by the coefficients and their standard
errors. The dots in the figures are placed at the coefficient estimate (e.g., 0.41 for the
coefficient on adult height in the bivariate model and 0.003 for the coefficient on adolescent
height in the multivariate model). The lines indicate the range of the 95% confidence interval.
As discussed in Chapter 4 on page 181, confidence intervals indicates the range of true values
of — most consistent with the observed estimate; they are calculated as —ˆ ± 1.96 ◊ se(—).
ˆ
The confidence interval for the coefficient on adult height in the bivariate model is clearly
c
•2014 Oxford University Press 201
Chapter 5. Multivariate OLS: Where the Action Is
Adult
Height
Adolescent
Height
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
Estimated coefficients Estimated coefficients
FIGURE 5.3: 95% Confidence Intervals for Coefficients in Adult Height, Adolescent Height, and Wage
Models
positive and relatively narrow and it does not include zero. However, the confidence interval
for the coefficient on adult height becomes wide and includes zero in the multivariate model.
In other words, the multivariate model suggests that the effect of adult height on wages is
small or even zero when controlling for adolescent height. In contrast, the confidence interval
for adolescent height is positive, reasonably wide, and far from zero when controlling for adult
height. These results suggest that the effect of adolescent height on wages is large and the
In this head-to-head battle of the two height variables, adolescent height wins: The
coefficient on it is large and its confidence interval is far from zero. The coefficient on
adult height, however, is puny and has a confidence interval that clearly covers zero. In
other words, the multivariate model we have estimated is telling us that being tall as a kid
matters more than being tall as a grown-up. This conclusion is quite thought-provoking.
It appears that the height premium in wages does not reflect a height fetish by bosses, but
c
•2014 Oxford University Press 202
Chapter 5. Multivariate OLS: Where the Action Is
instead reflects the human capital developed in youth extracurricular activities. Eat your
Multivariate OLS allows us to keep adding independent variables; that’s where the “multi”
comes from. Whenever we think of another variable that could plausibly be in the error
term and be correlated with the independent variable of interest, we simply add it to the
model (thereby removing it from the error term and eliminating it as a possible source of
endogeneity). Lather. Rinse. Repeat. Do this long enough and we may be able to wash
away sources of endogeneity lurking in the error term. The model will look something like
where each X is another variable and k is the total number of independent variables. Often
a single variable or perhaps a subset of variables are of primary interest. We refer to the
other independent variables as control variable, as these are included to control for factors
that could affect the dependent variable and be correlated with the independent variables
of primary interest. Control variables and control groups are different: A control variable is
an additional variable we include in a model, while a control group is the group to which we
c
•2014 Oxford University Press 203
Chapter 5. Multivariate OLS: Where the Action Is
The authors of the height and wage study argue that adolescent height in and of itself
was not causing increased wages. Their view is that adolescent height translated into oppor-
tunities that provide skills and experience that led to better ability to get high wages later.
They view increased participation in clubs and sports activities as a channel for adolescent
height to improve wage-increasing human capital. In statistical terms, the claim is that
participation in clubs and athletics was a factor in the error term of a model with only adult
height and adolescent height. If either height variable is correlated with any of the factors
With the right data, we can check the claim that the effect of adolescent height on
adult wages is due, at least in part, to the effect of adolescent height on participation in
developmentally helpful activities. In this case, the researchers had measures of the number
of clubs each person participated in (excluding athletics and academic/honor society clubs)
and a dummy variable that indicated whether or not each person participated in high school
athletics.
groups is the experimental treatment. If we were experimenting on samples in petri dishes, for example, we could
treat temperature as a control variable. We would make sure that the temperature is the same for all petri dishes used
in the experiment. Hence, the control group has everything similar to the treatment group, except the treatment. In
observational studies, we cannot determine the values of other factors, but we can try to net out these other factors
such that once we have taken into account these factors, the treated and untreated groups should be the same. In
the Christmas shopping example, the dummy variable for December is our control variable. The idea is that once
we net out the effect of Christmas on shopping patterns in the United States, retail sales should only differ based
on differences in the temperature. If we worry (as we should) that additional factors other than temperature still
matter, we should include other control variables until we feel confident that the only remaining difference is due to
the variable of interest.
c
•2014 Oxford University Press 204
Chapter 5. Multivariate OLS: Where the Action Is
The right-most column of Table 5.2 therefore presents “multivariate (b)” results from a
model that also includes measures of participation in activities as a young person. If the
way for adolescent height to translate into higher wages is truly that tall adolescents have
more opportunities to develop leadership and other skills, then we would expect part of the
adolescent height effect to be absorbed by the additional variables. As we see in the right-
most column, this is part of the story. The coefficient on adolescent height in the multivariate
(b) column goes down to 0.35 with a standard error of 0.19, which is statistically insignificant.
The coefficients on the clubs and athletics variables are 1.88 and 3.02 respectively with
standard errors of 0.28 and 0.56, implying highly statistically significant effects.
By the way, notice the R2 s at the bottom of the table. They are 0.01, 0.01, and 0.06.
Terrible, right? Recall that R2 is the square of the correlation of observed and fitted ob-
servations. (Or, equivalently, these R2 numbers indicate the proportion of the variance of
wages explained by the independent variables.) These values mean that the even in the best
fitting model the correlation of observed and fitted values of wages is about 0.245 (because
Ô
0.06 = 0.245). That’s not so hot, but we shouldn’t care. That’s not how we evaluate
models. As discussed on page 108 in Chapter 3, we evaluate the strength of estimated rela-
tionships based on coefficient estimates and standard errors, not based on directly looking
at R2 .
the error term is unlikely. But if we can measure more variables and pull more factors out
c
•2014 Oxford University Press 205
Chapter 5. Multivariate OLS: Where the Action Is
of the error term, our estimates will typically become less biased and be distributed more
closely to the true value. We provide more details when we discuss omitted variable bias in
Given how important it is to control for additional variables, we may reasonably wonder
about how exactly multivariate OLS controls for multiple variables. Basically, the estima-
tion of the multivariate model follows the same OLS principles used in the bivariate OLS
model. Understanding the estimation process is not essential for good analysis per se, but
understanding it helps us get comfortable with the model and its fitted values.
First, write out the equation for the residual, which is the difference between actual and
fitted values:
‘ˆi = Yi ≠ Ŷi
Second, square the residuals (for the same reasons as on page 71).
‘ˆ2i = (Yi ≠ (—ˆ0 + —ˆ1 X1i + —ˆ2 X2i + ... + —ˆk Xki ))2
The name “ordinary least squares” (OLS) describes the process: ordinary because we
haven’t gotten to the fancy stuff yet, least because we’re minimizing the deviations between
fitted and actual values, and squares because there was a squared thing going on in there.
c
•2014 Oxford University Press 206
Chapter 5. Multivariate OLS: Where the Action Is
Like I said earlier, it’s an absurd name. It’s like calling a hamburger a “kill-with-stun-gun-
then-grill-and-put-on-a-bun.” OLS is what people call it, though, so we have to get used to
it.
Remember This
1. Multivariate OLS is used to estimate a model with multiple independent variables.
2. Multivariate OLS fights endogeneity by pulling variables from the error term into
the estimated equation.
ˆ in a
3. As with bivariate OLS, the multivariate OLS estimation process selects —s
way that minimizes the sum of squared residuals.
c
•2014 Oxford University Press 207
Chapter 5. Multivariate OLS: Where the Action Is
Discussion Questions
1. Mother Jones magazine blogger Kevin Drum (2013a, b, c) offers the
following scenario: Suppose we gathered records of a thousand school
children aged 7 to 12 and used a bivariate model and found that heavier
kids scored better on standardized math tests.
a) Based on these results, should we recommend that kids should eat
lots of potato chips and french fries if they want to grow up to be
scientists?
b) Write down a model that embodies Drum’s scenario.
c) Propose additional variables for this model.
d) Would inclusion of additional controls bolster the evidence? Would
doing so provide definitive proof?
2. Researchers from the National Center for Addiction and Substance
Abuse at Columbia University (2011) suggest that time spent on Face-
book and Twitter increases risks of smoking, drinking, and drug use.
They found that compared to kids who spent no time on social net-
working sites, kids who spent time on the sites each day were five times
likelier to smoke cigarettes, three times more likely to drink alcohol, and
twice as likely to smoke pot. The researchers argue that kids who use
social media regularly see others engaged in such behaviors and then
emulate them.
a) Write down the model implied by the above discussion and discuss
factors that are in error term.
b) What specifically has to be true about these factors for their omis-
sion to cause bias? Discuss whether these factors will be true for
the factors you identify.
c) Discuss which factors could be measured and controlled for and
which would be difficult to measure and control for.
c
•2014 Oxford University Press 208
Chapter 5. Multivariate OLS: Where the Action Is
Discussion Questions
3. Suppose we are interested in knowing the relationship between hours
studied and scores on a Spanish exam.
a) Suppose some kids don’t study at all but ace the exam, leading to
a bivariate OLS result that studying has little or no effect on the
score. Would you be convinced by these results?
b) Write down a model and discuss your answer to (a) above in terms
of the error term.
c) What if some kids speak Spanish at home? Discuss implications for
a bivariate model that does not include this factor and a multivari-
ate model that controls for this factor.
Another way to think about how multivariate OLS fights bias is by looking at what happens
when we fail to soak up one of the error term variables. That is, what happens if we omit a
variable that should be in the model? In this section we show that omitting a variable that
Let’s start with a case in which the true model has two independent variables, X1 and
X2 :
We assume (for now) the error in this true model, ‹i , is uncorrelated with X1i and X2i .
c
•2014 Oxford University Press 209
Chapter 5. Multivariate OLS: Where the Action Is
(The Greek letter ‹ is pronounced “new” - even though it looks like a v.) As usual with
multivariate OLS, the —1 parameter reflects how much higher Yi would be if we increased
X1i by one; —2 reflects how much higher Yi would be if we increased X2i by one.
where — OmitX2 indicates the coefficient on X1i we get when we omit variable X2 from the
model. How close will —ˆ1OmitX2 be to —1 in Equation 5.4? In other words, will —ˆ1OmitX2 be an
unbiased estimator of —1 ? Or, in English: Will our estimate of the effect of X1 suck if we
omit X2 ? We ask questions like this every time we analyze observational data.
It’s useful to first characterize the relationship between the two independent variables,
regression that is not directly the one of interest, but yields information helpful in analyzing
the equation we really care about. In this case, we use the following equation to assess how
where ”0 (“delta”) and ”1 are coefficients for this auxiliary regression and ·i (“tau,” rhymes
with what you say when you stub your toe) is how we denote the error term (which acts
just like the error term in our other equations, but we’re trying to make it clear that we’re
c
•2014 Oxford University Press 210
Chapter 5. Multivariate OLS: Where the Action Is
This equation for X2i is not based on a causal model. Instead, we are using a regression
model to indicate the relationship between the included variable (X1 ) and the excluded
variable (X2 ). If ”1 = 0, then X1 and X2 are not related. If ”1 is large in magnitude, then
If we substitute the equation for X2i (Equation 5.6) into the main equation (Equation
—1OmitX2 = —1 + —2 ”1 (5.7)
where —1 and —2 come from the main equation (Equation 5.4) and ”1 comes from the
Given our assumption that · and ‹ are not correlated with any independent variable, we
can use our bivariate OLS results to know that —ˆ1OmitX2 will be distributed normally with
bias.
2 In the last line we replace —2 ·i + ‹i with ‘i . If, as we’re assuming here, ·i and ‹i are uncorrelated with each other
and uncorrelated with X1 , then the sum of them has the same properties (even when ·i is multiplied by —2 ).
c
•2014 Oxford University Press 211
Chapter 5. Multivariate OLS: Where the Action Is
In other words, when we omit X2 , the coefficient on X1 (which is —1OmitX2 ) will pick up
not only —1 , which is the effect of X1 on Y , but also —2 , which is the effect of the omitted
This result is consistent with our intuition about endogeneity: When X2 is omitted and
thereby relegated to the error, we won’t be able to understand the true relationship between
There are two ways to kill omitted variable bias. First, if —2 = 0 then —2 ”1 = 0 and
there is no omitted variable bias. This is easy to explain: if the omitted variable X2 has
no effect on Yi (which is the implication of —2 = 0), then there will be no omitted variable
bias. It’s kind of cheating because we’re saying if you omit a variable that really shouldn’t
have been in the model, then you will not have omitted variable bias. It’s like saying that
you won’t gain weight from eating ice cream if the name of your next door neighbor is Neil.
Nonetheless, it’s a helpful starting point because it clarifies that a variable has to matter for
The more interesting way to kill omitted variable bias is if ”1 = 0. The parameter ”1 from
Equation 5.6 tells us how strongly X1 and X2 are related. If X1 and X2 are not related,
then ”1 = 0 is zero. This, in turn, means —ˆ1OmitX2 will be an unbiased estimate of —1 from
Equation 5.4, the true effect of X1 on Y even though we omitted X2 from the model. In
3 We derive this result more formally on 721.
c
•2014 Oxford University Press 212
Chapter 5. Multivariate OLS: Where the Action Is
other words, if the omitted variable is not correlated with the included variable, then no
ends up in the error term. If the omitted variable hanging out in the error term is correlated
with the included variable (which means ”1 ”= 0), then we have endogeneity and we have
bias. We now have an equation that tells us the extent of the bias. If, on the other hand, the
omitted variable hanging out in the error term is not correlated with the included variable
(which means ”1 = 0), then we do not have endogeneity and we do not have bias. Happy,
happy, happy.
If either of these two conditions holds, there is no omitted variable bias. In most cases,
though, we can’t be sure that at least one condition holds because we don’t actually have a
measure of the omitted variable. In that case, we can use omitted variable bias concepts to
speculate on the magnitude of the bias. The magnitude of bias depends on how much the
omitted variable explains Y (which is determined by —2 ) and how much the omitted variable
is related to the included variable (which is reflected in ”1 ). Sometimes we can come up with
possible bias but believe that —2 or ”1 is small, meaning that we shouldn’t lose too much
sleep over bias. On the other hand, in other cases, we might think —2 and ”1 are huge. Hello,
insomnia.
c
•2014 Oxford University Press 213
Chapter 5. Multivariate OLS: Where the Action Is
In Chapter 14 we cover additional topics related to omitted variable bias. On page 728 we
discuss how to use the bias equation to anticipate whether omission of a variable will cause
the estimated coefficient to be higher or lower than it should be. On page 729 we discuss the
more complicated case in which the true model and estimated model have more variables.
In these situations, things get a little harder to predict than in the case we have discussed.
As a general matter, bias usually (but not always) goes down when we add variables that
Remember This
Two conditions must both be true for omitted variable bias to occur:
1. The omitted variable affects the dependent variable.
• Mathematically: —2 ”= 0 in Equation 5.4 on page 209.
• An equivalent way to state this condition is that X2i really should have been
in Equation 5.4 in the first place.
2. The omitted variable is correlated with the included independent variable.
• Mathematically: ”1 ”= 0 in Equation 5.6 on page 210.
3. Omitted variable bias is more complicated in models with more independent
variables, but the main intuition applies.
c
•2014 Oxford University Press 214
Chapter 5. Multivariate OLS: Where the Action Is
The data is structured such that even though data exists on the economic growth in
these countries for each year, we are looking only at the average growth rate across the forty
years from 1960 to 2000. Thus each country gets only a single observation. We control for
GDP per capita in 1960 because of a well-established phenomenon that countries that were
wealthier in 1960 have a slower growth rate. The poor countries simply have more economic
capacity to grow. The main independent variable of interest at this point is average years of
The results in the left-hand column of Table 5.3 suggest that additional years of schooling
promote economic growth. The —ˆ1 estimate implies that each additional average year of
c
•2014 Oxford University Press 215
Chapter 5. Multivariate OLS: Where the Action Is
schooling within a country is associated with 0.44 percentage points higher annual economic
growth. With a t statistic of 4.22, this is a highly statistically significant result. Using the
standard error and techniques from page 181 we can calculate the confidence interval to be
Sounds good: More education, more growth. Nothing more to see here, right? Not to
Hanushek and Woessmann. Their intuition was that not all schooling is equal. They were
skeptical that simply sitting in class and racking up the years improves economically useful
skills and argued that we should assess whether quality of education made a difference, not
simply the quantity of it. As their measure of quality, they used average math and science
test scores.
c
•2014 Oxford University Press 216
Chapter 5. Multivariate OLS: Where the Action Is
Before getting to their updated model, it’s useful to get a feel for the data. Panel (a) of
Figure 5.4 shows a scatterplot of economic growth and average years of schooling. There’s
not an obvious relationship. (The strong positive coefficient we observe in the first column
of Table 5.3 is due to the fact that GDP in 1960 was also controlled for.) Panel (b) of Figure
5.4 shows a scatterplot of economic growth and average test scores. The observations with
high test scores also tended to have high economic growth, suggesting a relationship between
the two.
Could it be that the real story is that test scores explain growth, not years in school? If
so, why is there a significant coefficient on years of schooling in the first column of Table
5.3? We know the answer: Omitted variable bias. As discussed on page 209, if a variable
that matters (and we suspect test scores matter) is omitted, the estimate of the effect of the
variable that is included will be biased if the omitted variable is correlated with the included
variable. To address this issue, panel (c) of Figure 5.4 shows a scatterplot of average test
scores and average years of schooling. Yes, indeed, these variables look quite correlated as
observations with high years of schooling also tend to have high test scores. Hence, the
Therefore it makes sense to add test scores to the model as the right-hand column of Table
5.3. The coefficient on years of schooling differs markedly from before. It is now very close
to zero. The coefficient on average test scores, on the other hand, is 1.97 and statistically
c
•2014 Oxford University Press 217
Chapter 5. Multivariate OLS: Where the Action Is
Average Average
economic economic
growth growth
(in %) 7 (in %) 7
6 6
5 5
4 4
3 3
2 2
1 1
4.5
3.5
3
2 4 6 8 10 12
Average years of school
(c)
FIGURE 5.4: Economic Growth, Years of School, and Test Scores
c
•2014 Oxford University Press 218
Chapter 5. Multivariate OLS: Where the Action Is
Because the scale of the test score variable is not immediately obvious, we need to do
a bit of work to interpret the substantive significance of the coefficient estimate. Based on
descriptive statistics (not reported), the standard deviation of the test score variable is 0.61.
The results therefore imply that increasing average test scores by a standard deviation is
associated with a 0.61 ◊ 1.97 = 1.20 percentage point increase in the average annual growth
rate per year over these forty years. This increase is large when we are talking about growth
Notice the very different story we have across the two columns. In the first one, years of
schooling is enough for economic growth. In the second specification, quality of education
as measured with math and science test scores matters more. The second specification is
better because it shows that a theoretically sensible variable matters a lot. Excluding this
variable, as the first specification does, risks omitted variable bias. In short, these results
suggest education is about quality, not quantity. High test scores explain economic growth
better than years in school. Crappy schools do little; good ones do a lot. These results don’t
end the conversation about education and economic growth, but they do move it ahead a
c
•2014 Oxford University Press 219
Chapter 5. Multivariate OLS: Where the Action Is
We can apply omitted variable concepts to understand the effects of measurement error on
our estimates. Measurement error is pretty common; it occurs when a variable is measured
inaccurately.
In this section we define the problem, show how to think of it as an omitted variables
problem, and then characterize the nature of the bias caused when independent variables
Quick: How much money is in your bank account? It’s pretty hard recall the exact
amount (unless it’s zero!). So a survey of wealth relying on people to recall their savings
is probably going to have at least a little error and maybe a lot (especially as people start
getting squirrely about talking about money and some overreport and some underreport).
And many, perhaps even most, variables could have error. Just think how hard it would be
OLS will do just fine if the measurement error is only in the dependent variable. In this
case, the measurement error is simply part of the overall error term. The bigger the error,
the bigger the variance of the error term. We know that in bivariate OLS, a larger variance
c
•2014 Oxford University Press 220
Chapter 5. Multivariate OLS: Where the Action Is
OLS will not do so well if the measurement error is in an independent variable. In this case,
the OLS estimate will systematically under-estimate the magnitude of the coefficient. To
Yi = —0 + —1 X1i
ú
+ ‘i
measurement error.
Instead we observe our independent variable with error; that is, we observe some X1 that
is a function of the true value X1ú and some error. For example, suppose we observe reported
X1i = X1i
ú
+ ‹i
We keep things simple here by assuming that the measurement error (‹i ) has a mean of
ú
X1i = X1i ≠ ‹i
c
•2014 Oxford University Press 221
Chapter 5. Multivariate OLS: Where the Action Is
Yi = —0 + —1 (X1i ≠ ‹i ) + ‘i
= —0 + —1 X1i ≠ —1 ‹i + ‘i (5.8)
The trick here is to think of this example as an omitted variable problem where ‹i is the
omitted variable. We don’t observe the measurement error directly, right? If we could observe
it, we would fix our darn measure of X1 . So what we do is treat the measurement error as
an unobserved variable that by definition we must omit and see how this particular form of
omitted variable bias plays out. Compared to a generic omitted variable bias problem, we
know two things that allow us to be more specific than in the general omitted variable case:
the coefficient on the omitted term (‹i ) is —1 and ‹i relates to X1 as in Equation 5.8.
We go step by step through the logic and math on page 732 in Chapter 14. The upshot
is that as the sample size gets very large, the estimated coefficient when the independent
‡‹2 + ‡X
2
ú
1
Notice that —ˆ1 converges to the true coefficient times a quantity that has to be less than
one.
The equation becomes quite intuitive if we look at two extreme scenarios. If ‡‹2 is zero the
measurement error has no variance and must always equal zero (given that we assumed it is
c
•2014 Oxford University Press 222
Chapter 5. Multivariate OLS: Where the Action Is
2
‡X ú
a mean-zero random variable). In this case 1
2
‡‹2 +‡X ú
will equal one (assuming ‡X
2
ú is not zero,
1
1
which is simply assuming X1ú varies). In other words, if there is no error in the measured
value of X1 (which is what ‡‹2 = 0 means), then plim —ˆ1 = —1 and our estimate of —1 will
be converge to the true value as the sample gets larger. This conclusion makes sense: No
means that the probability limit of —ˆ1 will be smaller than the true value. This result also
makes sense: If the measurement of the independent variable is junky, it makes sense that
We refer to this particular example of omitted variable bias as attenuation bias because
when we omit the measurement error term from the model our estimate of —ˆ1 deviates from
the true value by a multiplicative factor between zero and one. This means that —ˆ1 will be
closer to zero than it should be when X1 is measured with error. If the true value of —1 is
some positive number, we see values of —ˆ1 less than they should be. If the true value of —1
is negative, we see values of —ˆ1 larger (meaning closer to zero) than they should be.
c
•2014 Oxford University Press 223
Chapter 5. Multivariate OLS: Where the Action Is
Remember This
1. Measurement error in the dependent variable does not bias —ˆ coefficients, but does
increase the variance of the estimates.
2. Measurement error in an independent variable causes attenuation bias. That is,
when X1 is measured with error, —ˆ1 will be closer to zero than it should be.
• The attenuation bias is a consequence of the omission of the measurement
error from the estimated model.
• The larger the measurement error, the larger the attenuation bias.
Precision is crucial for hypothesis tests and confidence intervals. In this section we show
influenced by the extent to which the multiple independent variables co-vary together. We
also discuss goodness of fit in the multivariate model and, in particular, what happens when
we include independent variables that don’t explain the dependent variable at all.
The variance of coefficient estimates for the multivariate model is similar to the variance of
—ˆ1 for the bivariate model. As with variance of —ˆ1 in bivariate OLS, the equation we present
applies when errors are homoscedastic and not correlated with each other. Things get more
complicated when errors are heteroscedastic or correlated with each other, but the intuitions
c
•2014 Oxford University Press 224
Chapter 5. Multivariate OLS: Where the Action Is
We denote the coefficient of interest as —ˆj to indicate it is the coefficient associated with
the j th independent variable. The variance of the coefficient on the j th independent variable
is
ˆ2
‡
var(—ˆj ) = (5.9)
N var(Xj )(1 ≠ Rj2 )
This equation is similar to the equation for variance of —ˆ1 in bivariate OLS (Equation 3.9
on page 95). The new bit relates to the (1 ≠ Rj2 ) in the denominator. Before elaborating
on Rj2 , let’s note the parts from the bivariate variance equation that carry through to the
multivariate context.
equal —ˆj will be more precise the better our variables are able to explain the dependent
variable. This point is particularly relevant for experiments. As long as the experiment
was not plagued by attrition, balance, or compliance problems, we are not worried about
endogeneity and hence do not need to add control variables to avoid bias. Multivariate
OLS does help in experiments, however, by improving the fit of the model, thus reducing
• In the denominator we see the sample size, N . As for the bivariate model, as we get
c
•2014 Oxford University Press 225
Chapter 5. Multivariate OLS: Where the Action Is
more data this term in the denominator gets bigger, making the var(—ˆj ) smaller. In
bigger the denominator will be. The bigger the denominator, the smaller var(—ˆj ) will
be.
Multicollinearity
The new element in Equation 5.9 compared to the earlier variance equation is the (1 ≠ Rj2 ).
Notice the j subscript. We use the subscript to indicate that Rj2 is the R2 from an auxiliary
regression in which Xj is the dependent variable and all the other independent variables in
the full model are the independent variables in the auxiliary model. The R2 without the j
There is a different Rj2 for each independent variable. For example, if our model is
• R12 is the R2 from X1i = “0 + “1 X2i + “2 X3i + ‘i where the “ (the Greek letter gamma)
parameters are estimated coefficients from OLS. We’re not really interested in the value
of these parameters. We’re not making any causal claims about this model and are just
c
•2014 Oxford University Press 226
Chapter 5. Multivariate OLS: Where the Action Is
by the Rj2 ). (We’re being a bit loose notationally and re-using the “ notation in each
equation.)
These Rj2 tell us how much the other variables explain Xj . If the other variables explain
Xj very well, the Rj2 will be high and – here’s the key insight – the denominator will be
smaller. Notice that the denominator of the equation for var(—ˆj ) has (1 ≠ Rj2 ). Remember
any R2 is between 0 and 1, so as Rj2 gets bigger, then 1 ≠ Rj2 gets smaller which in turn
makes var(—ˆj ) bigger. The intuition is that if variable Xj is virtually indistinguishable from
the other independent variables, it makes sense that it is hard to tell how much that variable
variables, the variance of the coefficient we estimate for that variable will be high. We
variables have strong linear relationships. The term comes from “multi” for multiple variables
and “co-linear” because they vary together in a linear fashion. The polysyllabic jargon should
not hide a simple fact: The variance of our estimates increases when an independent variable
c
•2014 Oxford University Press 227
Chapter 5. Multivariate OLS: Where the Action Is
multicollinearity.
It’s really important to understand what multicollinearity does. It does not cause bias. It
doesn’t even cause the standard errors of —ˆ1 to be incorrect. It simply causes the standard
errors to be bigger than they would be if there were no multicollinearity. In other words,
OLS is on top of the whole multicollinearity thing, producing estimates that are unbiased
with appropriately calculated uncertainty. It’s just that when variables are strongly related
to each other we’re going to have more uncertainty – the distributions of —ˆ1 will be wider,
What, then, should we do about multicollinearity? If we have a lot of data, our standard
errors may be small enough to make reasonable inferences about the coefficients on the
collinear variables. In that case, we do not have to do anything. OLS is fine and we’re
perfectly happy. Both of our empirical examples in this chapter are consistent with this
scenario. In the height and wages analysis in Table 5.2, adult height and adolescent height
are highly correlated (at 0.86, actually) and yet the actual effects of these two variables are
so different that we can parse out their differential effects with the amount of data we have.
In the education and economic growth analysis in Table 5.3, the years of school and test
score variables are correlated at 0.81 and yet the effects are different enough that we can
parse out differential effect of these two variables with the data we have.
However, if we have substantial multicollinearity we may get very large standard errors
c
•2014 Oxford University Press 228
Chapter 5. Multivariate OLS: Where the Action Is
on the collinear variables, making us unable to say much about any of one variable. Some
are tempted in such cases to drop one or more of the highly multicollinear variables and to
focus only on the results for the remaining variables. This isn’t quite fair as we may not
have solid evidence to know which variables which should drop and which we should keep.
A better approach is to be honest: We should just say that the collinear variables taken as
a group seem to matter or not and that we can’t parse out the individual effects of these
variables.
of two variables: Scores from a standardized math test and scores from a standardized
verbal reasoning test. Suppose also that these test scores variables are highly correlated
and that when we run a model with both variables as independent variables, they are both
statistically insignificant in part because the standard errors will be very high due to the
high Rj2 values. If we drop one of the test scores, the remaining test score variable may be
statistically significant, but it would be poor form to believe, then, that only that test score
affected undergraduate grades. Instead, we should use the tools we present later in Section
7.4 on page 342 that allow us to asses whether both variables taken together explain grades.
At that point, we may be able to say that we know standardized test scores matter, but that
we cannot say much the relative effect of math versus verbal test scores. So even though
it’d be more fun to say which test score matters, the statistical evidence may simply not be
c
•2014 Oxford University Press 229
Chapter 5. Multivariate OLS: Where the Action Is
happens, Rj2 = 1 and the var(—ˆ1 ) blows up due to having (1 ≠ Rj2 ) in the denominator
(in the sense that the denominator becomes zero, which is a big no-no). In this case,
statistical software will either refuse to estimate the model or will automatically delete enough
perfect multicollinearity is when someone includes the same variable twice in a model.
Goodness of fit
Let’s talk about the regular old R2 , the one without a j subscript. As with the R2 for a
bivariate OLS model, the R2 for a multivariate OLS model measures goodness of fit and is
the square of the correlation of the fitted values and actual values (see Section 3.7).5 As
before, it can be interesting to know how well the model explains the dependent variable,
but this information is often not particularly useful. A good model can have a low R2 and
There is one additional wrinkle for R2 in the multivariate context. Adding a variable to
a model necessarily makes the R2 go up, at least by a tiny bit. To see why, notice that OLS
minimizes the sum of squared errors. If we add a new variable, the fit cannot be worse than
before because we can simply set the coefficient on this new variable to be zero, which is
5 The model needs to have a constant term for this interpretation to work – and for R2 to be sensible.
c
•2014 Oxford University Press 230
Chapter 5. Multivariate OLS: Where the Action Is
equivalent to not having the variable in the model in the first place. In other words, every
little better even if the variable doesn’t truly affect the dependent variable. Just by chance,
estimating a non-zero coefficient on this variable will typically improve the fit for a couple
Devious people therefore think “Aha, I can boost my R2 by adding variables.” First of
all, who cares? R2 isn’t directly useful for much. Second of all, that’s cheating. Therefore,
most statistical software program report so-called adjusted R2 results. This measure is
based on the R2 but lowers the value depending on how many variables are in the model.
The adjustment is ad hoc, and different people do it in different ways. The idea behind
the adjustment is perfectly reasonable, but it’s seldom worth getting too worked up about
adjusting it. It’s like electronic cigarettes. Yes, smoking them is less bad than smoking
The equation for the variance of —ˆj is also helpful for understanding what happens when we
include an irrelevant variable, meaning we add a variable to the model that shouldn’t be
there. Whereas our omitted variable discussion was about what happens when we exclude a
variable that should be in the model, here we want to know what happens when we include
c
•2014 Oxford University Press 231
Chapter 5. Multivariate OLS: Where the Action Is
Including an irrelevant variable does not cause bias. We can think of the situation as if
if we wrote down a model and the correct coefficient on the irrelevant variable happens to
be zero. That doesn’t cause bias; it’s just another variable and we should get an unbiased
estimate of that coefficient and including this variable will not create endogeneity.
It might seem therefore that the goal is simply to add as many variables as we can get our
hands on. That is, the more we control for, the less likely there are to be factors in the error
term that are correlated with the independent variable of interest. The reality is different.
Including an irrelevant variable is not harmless. Doing so makes our estimates less precise
because including it will necessarily increase Rj2 because R2 always go up when variables are
added.6 This conclusion makes sense: The more we clutter up our analysis with variables
that don’t really matter, the harder it is to see a clear relationship between a given variable
6 Our discussion just above was about the regular R2 , but it also applies to any R2 (from the main equation or an
auxiliary equation). R2 goes up as the number of variables increases.
c
•2014 Oxford University Press 232
Chapter 5. Multivariate OLS: Where the Action Is
Remember This
1. If errors are errors are not correlated with each other and are homoscedastic, the
variance of the —ˆj estimate is
ˆ2
‡
var(—ˆj ) =
N ◊ var(Xj )(1 ≠ Rj2 )
c
•2014 Oxford University Press 233
Chapter 5. Multivariate OLS: Where the Action Is
Discussion Questions
1. How much will other variables explain Xj when Xj is a randomly as-
signed treatment? Approximately what will Rj2 be?
2. Suppose we are designing an experiment in which you can determine
the value of all independent variables for all observations. Do we want
the independent variables to be highly correlated or not? Why or why
not?
c
•2014 Oxford University Press 234
Chapter 5. Multivariate OLS: Where the Action Is
try s at time t based on the tenure of judges and the scope of judicial authority.7
We pretty quickly see that a bivariate model will be insufficient. What factors are in the
error term? Could they be correlated with judicial independence? Experience seems to show
that human rights violations occur less in wealthy countries. Wealthy countries also tend
to have more independent judiciaries. In other words, omission of country wealth plausibly
satisfies conditions for omitted variable bias to occur: The variable influences the dependent
7 This example is based on La Porta et al (2004). Measurement of abstract concepts like human rights and judicial
independence is not simple. See Harvey (2011) for more details.
c
•2014 Oxford University Press 235
Chapter 5. Multivariate OLS: Where the Action Is
Therefore it is a good idea to control for wealth when looking at the effect of judicial
independence on human rights. The left column of Table 5.4 presents results from such a
model. Wealth is measured by per capita GDP. The coefficient on judicial independence
is 11.37, suggesting that judicial independence does indeed improve human rights. The t
statistic is 2.53 so we reject the null hypothesis that the effect of judicial independence is
zero.
Is this the full story? Is there some omitted variable that affects human rights (the depen-
dent variable) that is correlated with judicial independence (the key independent variable)?
If there is, there could be a spurious association between judicial independence and human
New York University professor Anna Harvey (2011) proposes exactly such a critique. She
argues that democracy might protect human rights and that the degree of democracy in a
Before we discuss what Harvey found, let’s think about what would have to be true if
omitting a measure of democracy is indeed causing bias using our conditions on page 214.
First, the level of democracy in a country actually needs to affect the dependent variable,
human rights (this is the —2 ”= 0 condition). Is that true here? Very plausibly. We don’t
know beforehand, of course, but it certainly seems possible that torture tends not to be a
great vote-getter. Second, democracy needs to be correlated with the independent variable
c
•2014 Oxford University Press 236
Chapter 5. Multivariate OLS: Where the Action Is
Table 5.4: Effects of Judicial Independence on Human Rights - Including Democracy Variable
of interest, which in this case is judicial independence. This we know is almost certainly
true: democracy and judicial independence definitely seem to go together in the modern
world. In Harvey’s data, democracy and judicial independence correlate at 0.26; not huge,
but not nuthin’. Therefore we have a legitimate candidate for omitted variable bias.
The right-hand column of Table 5.4 shows that Harvey’s intuition was right. When the
democracy measure is added, the coefficients on both judicial independence and GDP per
c
•2014 Oxford University Press 237
Chapter 5. Multivariate OLS: Where the Action Is
capita fall precipitously. The coefficient on democracy, however, is 24.93 with a t statistic of
Statistical significance is not the same as substantive significance. Let’s try to interpret
our results in a more meaningful way. If we generate descriptive statistics for our human
rights dependent variable, we see that it ranges from 17 to 99, with a mean of 67 and a
standard deviation of 24. Doing the same for the democracy variable indicates that it ranges
from 0 to 2 with a mean of 1.07 and a standard deviation of 0.79. A coefficient of 24.93
implies that a one standard deviation change in the democracy measures is associated with
a 24.93 ◊ 0.79 = 19.7 increase on the human rights scale. Given that the standard deviation
change in the dependent variable is 24, this is a pretty sizable association between democracy
This is a textbook example of omitted variable bias.9 When democracy is not accounted
for, judicial independence is strongly associated with human rights. When democracy is
accounted for, however, the effect of judicial independence fades to virtually nothing. And,
this is not just about statistics. How we view the world is at stake, too. The conclusion from
8 Determining exactly what is a substantively large effect can be subjective. There’s no rule book on what is large.
Those who have worked in a substantive area for a long time often get a good sense of what are large effects. An
effect might be considered large if it is larger than the effect of other variables that people think are important. Or
an effect might be considered large if we know that the benefit is estimated to be much higher than the cost. In the
human rights case, we can get a sense of what a 19.7 change in human rights scale means by looking at countries
that were around 20 points different on that scale. Pakistan was 22 points higher than North Korea. Decide if it
would make a difference to vacation in North Korea or Pakistan. If it would make a difference, then 19.7 is a large
difference; if not, then it’s not.
9 Or, it is now...
c
•2014 Oxford University Press 238
Chapter 5. Multivariate OLS: Where the Action Is
the initial model was that courts protect human rights. The additional statistical analysis
The example also highlights the somewhat provisional nature of social scientific conclu-
sions. Someone may come along with a variable to add or another way to analyze our data
that will change our conclusions. That is the nature of the social scientific process. We do
the best we can, but we leave room (sometimes a little, sometimes a lot) for a better way to
Table 5.4 also includes some diagnostics to help us think about multicollinearity, for surely
things like judicial independence, democracy, and wealth are correlated. Before looking at
specific diagnostics, we should note that collinearity of independent variables does not cause
bias. It doesn’t even cause the variance equation to be wrong. Instead, multicollinearity sim-
ply causes the variance to be higher than if there were no collinearity among the independent
variables.
from an auxiliary regression in which judicial independence is the dependent variable and the
GDP and democracy variables are the independent variables. This value isn’t particularly
high, and if we plug it into the equation for the variance inflation factor (VIF) (which is
just the part of the variance of —ˆj associated with multicollinearity) we see that the VIF for
1 1
the judicial independence variable is 1≠Rj2
= 1≠0.153
= 1.18. In other words, the variance
of the coefficient on the judicial independence variable is 1.18 times larger than it would
c
•2014 Oxford University Press 239
Chapter 5. Multivariate OLS: Where the Action Is
have been if the judicial independence variable were completely uncorrelated with the other
corresponds to a vif of 2.24, which is higher but still not in a range people get too worried
about. And, just to reiterate, this is not a problem to be corrected. Rather, it is simply
noting that one source of variance of the coefficient estimate on GDP is multicollinearity.
Another source is the sample size and another is the fit of the model (indicated by ‡
ˆ , which
indicates that the fitted values are, on average, roughly 11.5 units away from the actual
values).
That means we have to choose. We call the process model specification because it
is the process of specifying which variables we include in the model.10 This process is
tricky. Political scientist Phil Schrodt (2010) has noted that most experienced statistical
10Model specification also includes deciding on the functional form of the model. We discuss these issues in Chapter
7.
c
•2014 Oxford University Press 240
Chapter 5. Multivariate OLS: Where the Action Is
analysts have witnessed cases in which ”even minor changes in model specification can lead
to coefficient estimates that bounce around like a box full of gerbils on methamphetamines.”
This is an exaggeration – perhaps a box of caffeinated chinchillas is more like it – but there
In this section we discuss three dangers in model specification and how to conduct and
occurs when a researchers add and subtract variables until they get just the answers they
were looking for. Sometimes a given result may emerge under just the right conditions –
perhaps when variables X1 and X4 are included and X2 and X3 are excluded – and this is
First, Model fishing is possible because the coefficient on any given variable can change
depending on what other variables are in the model. We have already discussed how omitted
variable bias can affect coefficients. We have also discussed how multicollinearity drives up
variance of our estimates, meaning that the —ˆ1 estimates will tend to bounce around more
when the independent variables are highly correlated with each other.
A second challenge in model specification is that sample size can change as we include
more variables. Sometimes we’re missing observations for some variables. For example, in
c
•2014 Oxford University Press 241
Chapter 5. Multivariate OLS: Where the Action Is
survey data it is pretty common that a pretty good chunk of people do not answer questions
about their annual income. If we include a variable like income, OLS will include only
observations for that variable. Including a variable that is missing for half the people in
a sample will cut our sample size in half. This change in the sample can cause coefficient
estimates to jump around because as we talked about with regard to sampling distributions
(on page ??), coefficients will differ for each sample. In some instances, the effects on a
post-treatment variables in the model. These are variables that are themselves affected by
the independent variable of interest. For example, Harvard Professor Gary King discusses a
study of the effect of oil prices on perceptions of oil shortages based on surveys over several
years. We may be tempted to include a measure of media stories on oil because media stories
very plausibly affect public perceptions. On the other hand, the media stories themselves
may be a consequence of the oil price increase, meaning that if we include a media variable
in the model we may be underestimating the effect of oil prices on public opinion.
should control only for variables measured before the experiment or variables that do not
change (such as sex and race). In an observational context, it can be tricky to irrefutably
controls and to report results with and without variables that themselves may be affected
c
•2014 Oxford University Press 242
Chapter 5. Multivariate OLS: Where the Action Is
There are certain good practices that mitigate some of the dangers inherent in model speci-
fication. The first is to adhere to the replication standard. Some people see how coefficient
estimates can change dramatically depending on specification and become statistical cyn-
ics. They believe that statistics can be manipulated to give any answer. Such thinking lies
behind the aphorism “There are three kinds of lies: lies, damned lies, and statistics.” A
believed. In this view, the saying should be “There are three kinds of lies: lies, damned lies,
A second good practice is to present results from multiple specifications in a way that
allows readers to understand which steps of the specification are the crucial ones for the
conclusion being offered. Coefficients will change when variables are added or excluded; that
is, after all, the point of multivariate analysis. For the analysis to be credible, though, it
needs to be clear about which specification decisions drive the results. Readers need to know
whether the results are robust to a number of reasonable specification choices or depend
c
•2014 Oxford University Press 243
Chapter 5. Multivariate OLS: Where the Action Is
All statistical analysis should, as a matter of course, report multiple specifications, typi-
cally from a simple model to more complicated models. We saw an example with the height
and weight data in Table 5.2 on page 201 and will see more examples throughout the book.
Remember This
1. An important part of model specification is choosing what variables to include in
the model. Challenges in this process include
(a) Model fishing, which occurs when a researcher searches for a subset of possible
independent variables that provides a desired result.
(b) Changes in sample size (and potentially in results) due to inclusion of inde-
pendent variables with missing observations.
(c) Distortions caused by including post-treatment variables in a model.
2. Researchers should adhere to the replication standard and report multiple specifi-
cations in order to demonstrate the robustness of results and to highlight variables
associated with changes in coefficients.
5.6 Conclusion
Multivariate OLS is a huge help in our fight against endogeneity because it allows us to
add variables to our models. Doing so cuts off at least part of the correlation between an
independent variable and the error term because the included variables are no longer in the
error term. For observational data, multivariate OLS is very necessary, although we seldom
can wholly defeat endogeneity simply by including variables. For experimental data not
suffering from attrition, balance, or compliance problems, we can beat endogeneity without
multivariate OLS, but multivariate OLS makes our estimates more precise.
c
•2014 Oxford University Press 244
Chapter 5. Multivariate OLS: Where the Action Is
A useful way to think about multivariate OLS is as an effort to avoid omitted variable
bias. Omitting a variable causes problems when both of the following are true: the omitted
variable affects the dependent variable and it is correlated with the included independent
variable.
While we are most concerned with the factors that bias estimates, we have also identified
four factors that make our estimates less precise. Three were the same as with bivariate
OLS: poor model fit, limited variation in the independent variable, and small data sets.
variables are highly correlated, they get in the way of each other and make it hard for us to
know which one has which effect. The result is not bias, but imprecision.
• Section 5.1: Write down the multivariate regression equation and explain all its elements
(dependent variable, independent variables, coefficients, intercept and error term). Ex-
plain how adding a variable to a multivariate OLS model can help fight endogeneity.
• Section 5.2: Explain omitted variable bias, including the two conditions necessary for
• Section 5.3: Explain what measurement error in dependent and independent variables
• Section 5.4: Produce the equation for the variance of —ˆ1 and explain the elements of it,
c
•2014 Oxford University Press 245
Chapter 5. Multivariate OLS: Where the Action Is
qN
including ‡
ˆ2, i=1 (Xij ≠ X j )2 , and Rj2 . Use this equation to explain the consequences
Further Reading
King, Keohane, and Verba (1994) provide an intuitive and useful discussion of omitted
variable bias.
point is that the real problem with multicollinear data is that the estimates will be imprecise.
We defeat imprecise data with more data, hence the problem of multicollinearity is not having
Morgan and Winship (2014) provide a fascinating alternative way of thinking about con-
trolling for multiple variables. They spend a fair bit of time discussing the strengths and
Statistical results can often be more effectively presented as figures instead of tables.
Kastellec and Leoni (2007) provide a nice overview of the advantages and options for such
an approach.
c
•2014 Oxford University Press 246
Chapter 5. Multivariate OLS: Where the Action Is
Key Terms
• Adjusted R2 (231)
• Attenuation bias (223)
• Auxiliary regression (210)
• Ceteris paribus (197)
• Confidence interval (201)
• Irrelevant variable (231)
• Measurement error (220)
• Model-fishing (241)
• Model specification (240)
• Multicollinearity (227)
• Multivariate OLS (194)
• Omitted variable bias (211)
• Perfect multicollinearity (230)
• Variance inflation factor (227)
Computing Corner
Stata
1. To estimate a multivariate OLS model, we simply extend the syntax from bivariate OLS
(described on page 126). The syntax is
reg Y X1 X2 X3
For heteroscedasticity-consistent standard errors, simply add the robust subcommand
(as discussed on page 103) reg Y X1 X2 X3, robust
2. There are two ways to assess multicollinearity.
c
•2014 Oxford University Press 247
Chapter 5. Multivariate OLS: Where the Action Is
• Calculating the Rj2 for each variable. For example, calculate the R12 via
reg X1 X2 X3
and calculate the R22 via
reg X2 X1 X3
1
• Stata also provides a variance inflation factor command that estimates 1≠R 2 for
j
each variable. This command needs to be run immediately after the main model of
interest. For example,
reg Y X1 X2 X3
vif
would provide the variance inflation factor for all variables from the main model.
A VIF of 5, for example, indicates that the variance is five times higher than it
would be if there were no multicollinearity.
R
1. To estimate a multivariate OLS model, we simply extend the syntax described on page
128. The syntax is
OLSResults = lm(Y ≥ X1 + X2 + X3)
For heteroscedasticity-consistent standard errors, install and load the AER package and
use the coeftest and vcov commands as follows as discussed on page 130:
coeftest(OLSResults, vcov = vcovHC(OLSResults, type = "HC1"))
2. To assess multicollinearity, calculate the Rj2 for each variable. For example, calculate
the R12 via
AuxReg1 = lm(X1 ≥ X2 + X3)
and calculate the R22 via
AuxReg2 = lm(X2 ≥ X1 + X3)
Exercises
1. Table 5.5 describes variables from heightwage.dta we will use in this problem. We
previously saw this data in Chapter 3 on page 112 and in Chapter 4 on page 187.
a. Estimate two OLS regression models: one in which adult wages is regressed on adult
height for all respondents, the other in which adult wages is regressed on adult height
and adolescent height for all respondents. Discuss differences across the two models.
Explain why the coefficient on adult height changed.
b. Assess the multicollinearity of the two height variables using (i) a plot (ii) the variance
inflation factor command, and (iii) an auxiliary regression. For the plot run once
c
•2014 Oxford University Press 248
Chapter 5. Multivariate OLS: Where the Action Is
Table 5.5: Variables for Height and Weight Data in the United States
without jitter subcommand (e.g., scatter X1 X2, and once with it (e.g., scatter
X1 X2, jitter(3), and choose the more informative of the two plots. (Note that in
the auxiliary regression it’s useful to limit the sample to observations where wage96
is not missing so that the R2 from the auxiliary regression will be based on the same
number of observations as the regression used for the vif command. The syntax is if
wage96 !=. where the exclamation means “not” and the period is how Stata marks
missing values.)
c. Notice that IQ is omitted from the model. Is this a problem? Why or why not?
d. Notice that eye color is omitted from the model. Is this a problem? Why or why
not?
e. You’re the boss! Estimate a model that you think sheds light on an interesting
relationship, using the data in the file. The specification decisions include deciding
whether to limit the sample and what variables to include. Report only a single
additional specification. Describe in two paragraphs or less why this is an interesting
way to assess the data.
2. Use the MLBattend.dta data on Major League Baseball attendance records for 32 teams
from the 1970s through 2000 that we used in Chapter 4 on page 190. We are interested
in the factors that impact baseball game attendance.
a. Estimate a regression in which home attendance rate is the dependent variable and
wins, runs scored, and runs allowed are the independent variables. Report your
results, identify variables that are statistically significant, and interpret all significant
coefficients.
b. Suppose someone argues that we need to take into account the fact that the U.S.
population grew from 1970 through 2000. This particular data set does not have a
population variable, but it does have a variable called season, which indicates what
c
•2014 Oxford University Press 249
Chapter 5. Multivariate OLS: Where the Action Is
season the data is from (e.g., season equals 1969 for observations from 1969 and
season equals 1981 for observations from 1981, etc.). What are the conditions that
need to be true for omission of the season variable to bias other coefficients? Do you
think they hold in this case?
c. Estimate a second regression using the dependent and independent variables from
part (a), but including season as an additional independent variable to control for
trends on overall attendance over time. Report your results and discuss the differ-
ences between these results and those observed in part (a).
d. What is the relationship between season and runs scored? Assess with an auxiliary
regression and a scatterplot. Discuss the implications for the results in part (c).
3. Do cell phones distract drivers and cause accidents? Worried that they do, many states
over the last ten years have passed legislation to reduce distracted driving. Fourteen
states have passed legislation making handheld cell phone use while driving illegal and
44 states have banned texting while driving. This problem looks more closely at the
relationship between cell phones and traffic fatalities. Table 5.7 describes the variables
in the data set Cellphone 2012 homework.dta.
Table 5.6: Variables for Cell Phones and Traffic Deaths Questions
a. While we don’t have the number of people who are using the phone while driving,
we do have the number of cell phones subscriptions within a state (in thousands).
Estimate a bivariate model with traffic deaths as the dependent variable and number
of cell phone subscriptions as the independent variable. Briefly discuss the results
and explain if you suspect endogeneity and why.
b. Add population to the model. What happens to the coefficient on cell phone sub-
scriptions? Why?
c. Add total miles driven to the model. What happens to the coefficient on cell phone
subscriptions? Why?
c
•2014 Oxford University Press 250
Chapter 5. Multivariate OLS: Where the Action Is
d. Based on the model in part (c), calculate the variance inflation factor for population
and total miles driven. Why are they different? Discuss implications of this level
of multicollinearity for the coefficient estimates and the precision of the coefficient
estimates.
4. What determines how much drivers are fined if they are stopped for speeding? Do
demographics like age, gender, and race matter? To answer this quesetion, we’ll in-
vestigate traffic stops and citations in Massachusetts using data from Makowsky and
Stratmann (2009). Even though state law sets a formula for tickets based on how fast
the driver was driving, police officers in practice often deviate from the formula. The
data in speeding tickets text.dta includes information on all traffic stops. It contains
an amount for the fine for only those observations for which the police officer decided
to assess a fine.
Table 5.7: Variables for Speeding Ticket Data
a. Estimate a bivariate OLS model in which ticket amount is a function of age. Is age
statistically significant? Is endogeneity possible?
b. Estimate the model from part (a) also controlling for miles per hour over the speed
limit. Explain what happens to the coefficient on age and why.
c. Suppose we had only the first 1, 000 observations in the data set. Estimate the model
from part (b) and report on what happens to the standard errors and t statistics when
we have fewer observations. (In Stata, use if n Æ 1000 at the end of the regression
command to limit the sample to the first 1000 observations. Because the amount is
missing for drivers who were not fined, the sample size will be much smaller than
1, 000.)
5. We will continue the analysis of height and wages in Britain from the homework prob-
lem in Chapter 3 on page 133. We want to know if the relationship between height
and wages in the United States also occurs among British men. The data set height-
wage british males multivariate.dta contains data on males in the Britain from Persico,
Postlewaite, and Silverman (2004). Table 5.8 lists the variables.12
12For the reasons discussed in the homework problem in Chapter 3 on page 133 we limit the data set to observations
with height greater than 40 inches and self-reported income less than 400 British pounds per hour. We also exclude
c
•2014 Oxford University Press 251
Chapter 5. Multivariate OLS: Where the Action Is
a. Persico, Postlewaite, and Silverman (2004) argue that adolescent height is most rel-
evant because it is height at these ages that affects the self-confidence to develop
interpersonal skills at a young age. Estimate a model with wages at age 33 as the
dependent variable and both height at age 33 and age 16 as independent variables.
What happens to the coefficient on height at age 33? Explain what is going on here.
b. Let’s keep going. Add height at age 7 to the above model and discuss the results.
Be sure to note changes in sample size (and it’s possible effects) and to discuss the
implications of adding a variable with the statistical significance observed for the
height at age 7 variable.
c. Is there multicollinearity in the model from part (c)? Diagnose it and indicate its
consequences. Be specific as to whether the multicollinearity will bias coefficients or
have some other effect.
d. Perhaps characteristics of parents affect height (some force kids to eat veggies, while
others give them only french fries and Fanta). Add the two parental education
variables to the model and discuss results. Include only height at age 16 (meaning
we do not include the height at ages 33 and 7 for this question – although feel free
to include them too on your own; the results are interesting).
e. Perhaps kids had their food stolen by greedy siblings. Add the number of siblings to
the model and discuss results
f. We have included a variable, Ht16N oisy, which is the height measured at age 16 with
some random error included. In other words, it does not equal the actual measured
height at age 16, but is a “noisy” measure of height at age 16. Estimate the model
using the Ht16Noisy instead of height16 and discuss any changes in coefficient on the
observations of individuals who grew shorter from age 16 to age 33. Excluding these observations doesn’t substantially
affect the results we see here, but since it’s reasonable to believe there is some kind of non-trivial measurement error
for these cases, we exclude them for the analysis for this question.
c
•2014 Oxford University Press 252
Chapter 5. Multivariate OLS: Where the Action Is
c
•2014 Oxford University Press 253
CHAPTER 6
analysis.
mier League soccer in 2012-13. Panel (a) of Figure 6.1 shows the goal differential for Manch-
ester City’s 38 games, distinguishing between home and away games. The average goal
differential for away games is about 0.32 (meaning the team scored on average 0.32 more
goals than their opponents when playing away from home). The average goal differential
254
Chapter 6. Dummy Variables: Smarter Than You Think
for home games is about 1.37, meaning that the goal differential is more than 1 goal higher
at home. Well done, obnoxious drunk fans! Panel (b) Figure 6.1 shows the goal differen-
tial for Manchester United. The average goal differential for away games is about 0.90 and
the average goal differential for home games is about 1.37 (coincidentally the same value as
for Manchester City). These numbers mean that the home field advantage for Manchester
United is only about 0.47. C’mon Manchester United fans – yell louder!
We can use OLS to easily generate such estimates and conduct hypothesis tests. And we
can do much more. We can estimate such difference of means while controlling for other
variables and we can see whether covariates have different effects at home and away. The key
step is using a dummy variable, a variable that equals either zero or one, as the independent
variable.
In this chapter we show the many powerful uses of dummy variables in OLS models.
Section 6.1 shows how to use a bivariate OLS model for difference of means. Section 6.2
shows how to use multivariate OLS to control for other variables when conducting a difference
of means test. In Section 6.3 we use dummy variables to control for categorical variable,
which indicate category membership in one of multiple categories. Religion and race are
classic categorical variables. Section 6.4 discusses how dummy variable interactions allow
us to estimate different slopes for different groups. This chapter covers dummy independent
c
•2014 Oxford University Press 255
Chapter 6. Dummy Variables: Smarter Than You Think
Goal Goal
differential differential
5 5
4 4
3 3
2 2
1 1 Average for
0.9 away games
Average for
0.32 away games
0 0
−1 −1
−2 −2
0 1 0 1
Away Home Away Home
Manchester City Manchester United
(a) (b)
FIGURE 6.1: Goal Differentials for Home and Away Games for Manchester City and Manchester United
c
•2014 Oxford University Press 256
Chapter 6. Dummy Variables: Smarter Than You Think
Researchers frequently want to know whether two groups differ. In experiments, researchers
are curious whether the treatment group differed from the control group. In observational
studies, researchers want to compare outcomes between categories: men versus women,
college grads versus high school grads, Ohio State versus Michigan. These comparisons are
often referred to as difference of means tests because they involve comparing the mean
of Y for one group (e.g., the treatment group) against the mean of Y for another group (e.g.,
the control group). In this section, we show how to use the bivariate regression model and
OLS to make such comparisons. We also work through an example about opinion about
President Obama.
Consider a typical experiment. There is a treatment group that is a randomly selected group
of individuals who were given a treatment. There is also a control group that received no
treatment. We use a dummy variable to represent whether or not someone was in the
treatment group. A dummy variable equals either zero or one for each observation. Dummy
variables are also referred to as dichotomous variables. Typically, the dummy variable
equals one for those in the treatment group and zero for those in the control group.
c
•2014 Oxford University Press 257
Chapter 6. Dummy Variables: Smarter Than You Think
Yi = —0 + —1 T reatmenti + ‘i (6.1)
where Yi is the dependent variable, —0 is the intercept, —1 is the effect of being treated, and
T reatmenti is our independent variable (“Xi ”). This variable equals 1 if person i received
the experimental treatment and 0 otherwise. As usual, ‘i is the error term. Because this
is an experiment (one that we assume does not suffer from attrition, balancing, or compli-
ance problems), ‘i will be uncorrelated with T reatmenti , thereby satisfying the exogeneity
condition.
The standard interpretation of —ˆ1 from bivariate OLS applies here: A one-unit increase in
the independent variable is associated with a —ˆ1 increase in Yi . (See page 78 on the standard
OLS interpretation.) Equation 6.1 implies that getting the treatment (going from 0 to 1 on
When our independent variable is a dummy variable as with our T reatmenti variable,
we can also treat —ˆ1 as an estimate of the difference of means of our dependent variable Y
across the two groups. To see why, note first that the fitted value for the control group (for
whom T reatmenti = 0) is
In other words, —ˆ0 is the predicted value of Y for individuals in the control group. It is
c
•2014 Oxford University Press 258
Chapter 6. Dummy Variables: Smarter Than You Think
not surprising that the value of —ˆ0 that best fits the data is simply the average of Yi for
The fitted value for the treatment group (for whom T reatmenti = 1) is
In other words, —ˆ0 + —ˆ1 is the predicted value of Y for individuals in the treatment group.
The best predictor of this value is simply the average of Y for individuals in the treatment
group. Because —ˆ0 is the average of individuals in the control group, —ˆ1 is the difference in
averages between the treatment and control groups. If —ˆ1 > 0, then the average Y for those
in the treatment group is higher than for those in the control group. If —ˆ1 < 0, then the
average Y for those in the treatment group is lower than for those in the control group. If
—ˆ1 = 0, then the average Y for those in the treatment group is no different than the average
In other words, our slope coefficient (—ˆ1 ) is, in the case of a bivariate OLS model with a
dummy independent variable, a measure of the difference in means across the two groups.
The standard error on this coefficient tells us how much uncertainty we have and determines
Figure 6.2 graphically displays the difference of means test in bivariate OLS with a scat-
1 The proof is a bit laborious. We show it in the appendix on page 798.
c
•2014 Oxford University Press 259
Chapter 6. Dummy Variables: Smarter Than You Think
Dependent
variable 20
15
^ ^ Average for
β0 + β1
treatment group
10
pe)
slo
he
^β 1 (t
^ Average for
β0 control group
0 1
Control Treatment
group group
Treatment variable
FIGURE 6.2: Bivariate OLS with a Dummy Independent Variable
c
•2014 Oxford University Press 260
Chapter 6. Dummy Variables: Smarter Than You Think
terplot of data. It looks a bit different than our previous scatterplots (e.g. Figure 3.1 on
page 66) because here the independent variable takes on only two values: 0 or 1. Hence
the observations are stacked at 0 and 1. In our example the values of Y when X = 0 are
generally lower than the values of Y when X = 1. The parameter —ˆ0 corresponds to the
average of Y for all observations for which X = 0. The average for the treatment group (for
whom X = 1) is —ˆ0 + —ˆ1 . The difference in averages across the groups is —ˆ1 . A key point
is that the standard interpretation of coefficients in bivariate OLS still applies: A one unit
typically is for experiments and often is for observational data – we can simply run bivariate
OLS and the —ˆ1 coefficient tells us the difference of means. The standard error on this
coefficient tells us how precisely we have measured this difference and allows us to conduct
OLS produces difference of means tests for observational data, as well. The model and
interpretation are the same; the difference is how much we worry about whether the exogene-
ity assumption is satisfied. Typically, exogeneity will be seriously in doubt for observational
data. And sometimes it is useful to use OLS to estimate the difference of means as a de-
Difference of means tests can be conducted without using OLS. Doing so is totally fine, of
course; in fact, OLS and non-OLS difference of means tests assuming same variances across
c
•2014 Oxford University Press 261
Chapter 6. Dummy Variables: Smarter Than You Think
groups produce identical estimates and standard errors. The advantage of the OLS approach
is that we can do it within a framework that also does all the other things OLS does, such
Table 6.1 provides an example. The left-hand column presents results from a model of feel-
ings toward President Obama from a December 2011 public opinion survey. The dependent
ter” scale where 0 is feeling very cold toward him and 100 is feeling very warm toward him.
The independent variable is a dummy variable called Democrat that is 1 for respondents
who identify themselves as Democrats and 0 for those who do not. The Democrat variable
other parties, and non-partisans). The results indicate that Democrats rate Obama 41.82
Difference of means tests convey the same essential information when the coding of the
dummy variable is flipped. The column on the right in Table 6.1 shows results from a
model in which N otDemocrat was the independent variable. This variable is the opposite
2A standard OLS regression model produces a standard error and a t statistic that are equivalent to the standard
error and t statistic produced by a difference of means test in which variance is assumed to be the same across both
groups. An OLS model with heteroscedasticity-consistent standard errors (as discussed in Section 3.6) produces a
standard error and t statistic that are equivalent to a difference of means test in which variance differs across groups.
The Computing Corner shows how to estimate these models.
c
•2014 Oxford University Press 262
Chapter 6. Dummy Variables: Smarter Than You Think
of the Democrat variable, equaling 1 for non-Democrats and zero for Democrats. The
numerical results are different, but they nonetheless contain the same information. The
constant is the mean evaluation of Obama by Democrats. In the first specification this mean
is —ˆ0 + —ˆ1 = 23.38 + 41.82 = 65.20. In the second specification it is simply —ˆ0 because
this is the mean value for the excluded category. In the first equation the coefficient on
Democrat is 41.82, indicating that Democrats evaluated Obama 41.82 points higher than
Figure 6.3 scatterplots the data and highlights the estimated differences in means between
non-Democrats and Democrats. Dummy variables can be a bit tricky to plot because the
values of the independent variable are only 0 or 1, causing the data to overlap such that we
can’t tell if each dot in the scatterplot indicates 2 or 200 observations. A trick of the trade
c
•2014 Oxford University Press 263
Chapter 6. Dummy Variables: Smarter Than You Think
is to jitter each observation by adding a small random number to each observation for the
independent and dependent variables. The jittered data gives the cloud-like images in the
figure that help us get a decent sense of the data. We jitter only the data that is plotted;
we do not jitter the data when running the statistical analysis. The Computing Corner at
the end of this chapter shows how to jitter data for plots.3
Non-Democrats’ feelings toward Obama clearly run lower because there are many more
observations at the low end of the feeling thermometer scale for non-Democrats. Their
average feeling thermometer rating is 23.38. Feelings toward Obama among Democrats are
higher, with an average of 65.20. Both of the specifications in Table 6.1 tell this same story
Remember This
A difference of means test assesses whether the average value of the dependent variable
differs between two groups.
1. We often are interested in the difference of means between treatment and control
groups or between women and men or between other groupings.
2. Difference of means tests can be implemented in bivariate OLS by using a dummy
independent variable.
Yi = —0 + —1 Treatmenti + ‘i
(a) The estimate of the mean for the control group is —ˆ0 .
(b) The estimate of the mean for the treatment group is —ˆ0 + —ˆ1 .
(c) The estimate for differences in means between groups is —ˆ1 .
c
•2014 Oxford University Press 264
Chapter 6. Dummy Variables: Smarter Than You Think
Feeling
thermometer
toward
Obama 100
80
^ ^ Average for
β0 + β1
Democrats
60
pe)
slo
he
^β 1 (t
40
^ Average for
β0
non−Democrats
20
0 1
Non−Democrats Democrats
Partisan identification
c
•2014 Oxford University Press 265
Chapter 6. Dummy Variables: Smarter Than You Think
Dependent
variable
4
−2
0 1
Control group (a) Treatment group
Dependent
10
variable
5
−5
−10
−15
0 1
Control group (b) Treatment group
Dependent
variable 120
110
100
90
80
70
60
0 1
Control group (c) Treatment group
Discussion Questions
1. Approximately what are the averages of Y for the treatment and control
groups in each panel of Figure 6.4? Approximately what is the estimated
difference of means in each panel?
2. Approximately what are the values of —ˆ0 and —ˆ1 in each panel of Figure
6.4?
c
•2014 Oxford University Press 266
Chapter 6. Dummy Variables: Smarter Than You Think
male or female; for now, we’ll use a male dummy variable that is 1 if the person is male
and 0 if the person is female.4 Later, we’ll come back and re-do the analysis with a female
dummy variable.
Figure 6.5 displays a scatterplot of height and gender. The figure shows that men are, as
expected, taller on average than women as the man-blob is clearly higher than the woman-
blob.
That’s not very precise, though, so we’ll use an OLS model to provide a specific estimate
Heighti = —0 + —1 M alei + ‘i
4 Sometimes people will name a variable like this “gender.” That’s annoying! Readers will have to dig through the
paper to figure out whether 1 indicates males or females.
c
•2014 Oxford University Press 267
Chapter 6. Dummy Variables: Smarter Than You Think
Height
(in inches)
80
75
70
65
60
55
50
0 1
Women Men
Gender
c
•2014 Oxford University Press 268
Chapter 6. Dummy Variables: Smarter Than You Think
The estimated coefficient —ˆ0 tells us the average height for the group for which the dummy
variable is zero, which in this case is women. The estimated coefficients —ˆ0 + —ˆ1 tell us the
average height for the group for which the dummy variable is 1, which in this case is men.
The results are reported in Table 6.2. The average height of women is —ˆ0 , which is 64.23
inches. The average height for men is —ˆ0 + —ˆ1 , which is 64.23 + 5.79 = 70.02 inches. The
difference between the two groups is estimated as —ˆ1 , which is 5.79 inches.
Table 6.2: Difference of Means Test for Height and Gender
Constant 64.23
(0.04)
[t = 1,633.6]
Male 5.79
(0.06)
[t = 103.4]
N 10, 863
Standard errors in parentheses
This estimate is quite precise. The t statistic for male is 103.4, which allows us to reject
the null hypothesis. We can also use our confidence interval algorithm from page 181 to
produce a 95% confidence interval for —ˆ1 of 5.68 to 5.90 inches. In other words, we are 95%
confident that the difference of means of height between men and women is between 5.68
Figure 6.6 adds the information from Table 6.2 to the scatterplot. We can see that —ˆ0 is
estimating the middle of the women-blob —ˆ0 + —ˆ1 is estimating the middle of the men-blob,
c
•2014 Oxford University Press 269
Chapter 6. Dummy Variables: Smarter Than You Think
and the difference between the two is —ˆ1 . We can interpret the estimated effect of going from
zero to one on the independent variable (which is equivalent to going from female to male)
We noted earlier that it is reasonable to code the treatment as being female. If we replace
the male dummy variable with a female dummy variable the model becomes
Heighti = —0 + —1 F emalei + ‘i
Now the estimated coefficient —ˆ0 will tell us the average height for men (the group for
which F emale = 0). The estimated coefficients —ˆ0 + —ˆ1 will tell us the average height for
women and the difference between the two groups is estimated as —ˆ1 .
The results with the female dummy variable are in the right hand column of Table 6.3.
The numbers should look familiar, because we are learning the same information from the
data. It is just that the accounting is a bit different. What is the estimate of the average
height for men? It is —ˆ0 in the right hand column, which is 70.02. Sound familiar? That
was the number we got from our initial results (reported again in the left-hand column of
Table 6.3); in that case we had to add —ˆ0 + —ˆ1 because when the dummy variable indicated
men, we needed both coefficients to get the average height for men. What is the difference
between males and females estimated in the right-hand column? It is -5.79, which is the
same as before, only negative. The underlying fact is that women are estimated to be 5.79
inches shorter on average. If we have coded our dummy variable as F emale = 1, then going
c
•2014 Oxford University Press 270
Chapter 6. Dummy Variables: Smarter Than You Think
Height
(in inches)
80
75
^ ^ Average height
β0 + β170 for men
65 Average height
^
β0 for women
60
55
50
0 1
Women Men
Gender
c
•2014 Oxford University Press 271
Chapter 6. Dummy Variables: Smarter Than You Think
from zero to one on the independent variable is associated with a decline of 5.79 inches on
average. If we have coded our dummy variable as M ale = 1, then going from zero to one on
We can easily extend difference of means tests to multivariate OLS. Doing so is useful because
it allows us to control for other variables when assessing whether two groups are different.
For example, earlier in this chapter we assessed the home field advantage of Manchester
City while controlling for the quality of the opponent. Using multivariate OLS we can
estimate
where Opponent qualityi measures the opponent’s overall goal differential in all other games.
c
•2014 Oxford University Press 272
Chapter 6. Dummy Variables: Smarter Than You Think
The —ˆ1 estimate will tell us, controlling for opponent quality, whether the goal differential
was higher for Manchester City for home games. The results are in Table 6.4.
Table 6.4: Manchester City Example with Dummy and Continuous Independent Variables
Yi = —0 + —1 Dummyi + —2 Xi + ‘i (6.3)
It is useful to think graphically about the fitted lines from this kind of model. Figure 6.7
shows the data for Manchester City’s results in 2012-13. The observations for home games
(for which the home dummy variable equals one) are black dots; the observations for away
games (for which the home dummy variable equals zero) are grey dots.
As discussed on page 281 the intercept for the Homei = 0 observations (the away games)
will be —ˆ0 and the intercept for the Homei = 1 observations (the home games) will be —ˆ0 + —ˆ1 ,
which equals the intercept for away games plus the bump (up or down) for home games.
c
•2014 Oxford University Press 273
Chapter 6. Dummy Variables: Smarter Than You Think
Goal
differential
5
Fitted line for home games
(Home = 1)
Fitted line for away games
4 (Home = 0)
^ ^
β 0 + β1
1
^
β2 (th
es lope)
^
β0
0
^
β2 (th
e slo
pe)
−1
−2
FIGURE 6.7: Fitted Values for Model with Dummy Variable and Control Variable: Manchester City
Example
c
•2014 Oxford University Press 274
Chapter 6. Dummy Variables: Smarter Than You Think
Note that the coefficient indicating the difference of means is the coefficient on the dummy
variable. (Note that which — we should look at depends on how we write the model. For this
model, —1 indicates the difference of means controlling for the other variable, but it would
The innovation is that our difference of means test here also controls for another variable,
in this case, opponent quality. Here the effect of a one unit increase in opponent quality is
—ˆ2 ; this effect is the same for the Homei = 1 and Homei = 0 groups. Hence the fitted lines
are two parallel lines, one for each group separated by —ˆ1 , the differential bump associated
with being in the Homei = 1 group. In the figure, —ˆ1 is greater than zero, but it could be
less than zero (in which case the dark line for the Homei = 1 group would be below the grey
line) or equal to zero (in which case the dark line and grey lines would overlap exactly).
We can add additional independent variables to our heart’s content, allowing us to assess
the difference of means between the Homei = 1 and Homei = 0 groups in a manner that
controls for the additional variables. Such models are incredibly common.
Remember This
1. Including a dummy variable in a multivariate regression allows us to conduct a
difference of means test while controlling for other factors with a model such as
Yi = —0 + —1 Xi + —2 Dummyi + ‘i (6.4)
2. The fitted values from this model will be two parallel lines, each with a slope of
—ˆ1 and separated by —ˆ2 for all values of X.
c
•2014 Oxford University Press 275
Chapter 6. Dummy Variables: Smarter Than You Think
ables
Categorical variables (also known as nominal variables) are common in data analysis.
They have two or more categories, but the categories have no intrinsic ordering. Information
on religion is often contained in a categorical variable: 1 for Buddhist, 2 for Christian, 3 for
Hindu, and so forth. Race, industry, and many more attributes also appear as categorical
variables. Categorical variables differ from dummy variables in that categorical variables
have multiple categories. Categorical variables differ from ordinal variables in that ordinal
variables express rank but not necessarily relative size. An example of an ordinal variable
In this section, we show how to use dummy variables to analyze categorical variables.
We illustrate the technique with an example about wage differentials across regions in the
United States.
5 It is possible to treat ordinal independent variables in the same way as categorical variables in the manner we
describe here. Or it is common to simply include ordinal independent variables directly in a regression model and
interpret a one-unit increase as moving from one category to another.
c
•2014 Oxford University Press 276
Chapter 6. Dummy Variables: Smarter Than You Think
We might suspect that wages in the United States are different in different regions. Are they
higher in the northeast? Or are they higher in the south? Suppose we have data on wages
and on region. It should be easy to figure this out, right? Well, yes, as long as we appreciate
category. They are common in policy analysis. For example, suppose our region variable is
coded such that 1 indicates a person is from the northeast, 2 indicates a person is from the
midwest, 3 indicates a person is from the south, and 4 indicates a person is from the west.
How should we incorporate categorical variables into OLS models? Should we estimate
where W agei is the wages of person i and Regioni is the region person i lived in, as defined
above.
No, no, and no. Though the categorical variable may be coded numerically, it has no
inherent order, which means the units are not meaningful. The midwest is not “1” more
than the northeast; the south is not “1” more than the midwest.
So what do we do with categorical variables? Dummy variables save the day. We simply
convert categorical variables into a series of dummy variables, a different one for each cate-
gory. If region is the nominal variable, we simply create a northeast dummy variable (1 for
c
•2014 Oxford University Press 277
Chapter 6. Dummy Variables: Smarter Than You Think
people from the northeast, 0 otherwise), a midwest dummy variable (1 for midwesterners, 0
The catch is that we cannot include dummy variables for every category because if we did,
we would have perfect multicollinearity (as we discussed on page 230). Hence we exclude
one of the dummy variables and treat that category as the reference category (also re-
ferred to as the excluded category), which means that coefficients on the included dummy
variables indicate the difference between the category indicated by the dummy variable and
We’ve already been doing something like this with dichotomous dummy variables. When
we used the male dummy variable in our height and wages example on page 267, we did
not include a female dummy variable, meaning that females were the reference category and
the coefficient on the male dummy variable indicated how much taller men were. When we
used the female dummy variable, men were the reference category and the coefficient on the
female dummy variable indicated how much shorter females were on average.
To see how categorical variables work in practice, we will analyze women’s wage data in 1996
across the northeast, midwest, south, and west in the United States. We won’t, of course,
include a single region variable. Instead, we create dummy variables for each region and
include all but one of them in the OLS regression. For example, if we treat W est as the
c
•2014 Oxford University Press 278
Chapter 6. Dummy Variables: Smarter Than You Think
The results for this regression are in column (a) of Table 6.5. The —ˆ0 result tells us
that the average wage per hour for women in the west (the excluded category) was $12.50.
Women in the northeast are estimated to receive $2.02 more per hour than those in the west,
or $14.52 per hour. Women in the midwest earn $1.59 less than women in the west; which
works out to $10.91 per hour. And women in the south receive $2.13 less than women in the
Column (b) of Table 6.5 shows the results from the same data, but with south as the
c
•2014 Oxford University Press 279
Chapter 6. Dummy Variables: Smarter Than You Think
excluded category instead of west. The —ˆ0 result tells us that the average wage per hour
for women in the south (the excluded category) was $10.37. Women in the northeast get
$4.15 per hour more than women in the south, or $14.52 per hour. Women in the midwest
receive $0.54 per hour more than women (which works out to $10.91 per hour) and women
in the west get $2.13 per hour more than women in the south (which works out to $12.50
per hour). The key pattern is that the estimated amount that women in each region get
is the same in columns (a) and (b). Columns (c) and (d) have midwest and northeast as
the excluded categories respectively and, with calculations like those we just did, we can see
that the estimated average wages for each region are the same in all specifications.
Hence it is important to always remember that the coefficient estimates themselves are
only meaningful with reference to the excluded category. Even though the coefficients on
each dummy variable change across the specifications, the underlying estimates for wages in
each region do not. Think of the difference between Fahrenheit and Celsius – the temperature
Thus we don’t need to stress about which category should be excluded because it simply
doesn’t matter. The difference is simply due to the reference category we are using. In the
first specification, we are comparing wages in the northeast, midwest, and south to the west;
in the second specification, we are comparing wages in the northeast, midwest, and west to
the south. The reason that the coefficient on midwest is negative in the first specification
and positive in the second is that women in the midwest earn less than women in the west
c
•2014 Oxford University Press 280
Chapter 6. Dummy Variables: Smarter Than You Think
(the reference category in specification (a)) and earn more than women in the south (the
reference category in specification (b)). In both specifications (and the subsequent two),
Remember This
To use dummy variables to control for categorical variables, we include dummy vari-
ables for every category except one.
1. The excluded category is the reference point and all the coefficients on the in-
cluded dummy variables indicate how much higher or lower each group is than
the excluded category.
2. Coefficients differ depending on which excluded category is used, but when inter-
preted appropriately the results do not change across specifications.
c
•2014 Oxford University Press 281
Chapter 6. Dummy Variables: Smarter Than You Think
Discussion Questions
1. Suppose we wanted to conduct a cross-national study of opinion in North
America and have a variable named “country” that is coded 1 for re-
spondents from the United States, 2 for respondents from Mexico, and
3 for respondents from Canada. Write a model and explain how to
interpret the coefficients.
2. For the results in Table 6.6, indicate what the coefficients are in boxes
(a) through (j).
Table 6.6: Hypothetical Results for Wages and Region Using Different Excluded Categories
c
•2014 Oxford University Press 282
Chapter 6. Dummy Variables: Smarter Than You Think
happens in Congress.
The 2010 congressional election was a particularly interesting election. Republicans lost
seats in 2006 and 2008, but came storming back with a big win in 2010. No one doubted that
the newly elected Republicans were more conservative than the Democrats they replaced, but
many thought that the Tea Party movement also succeeded in its efforts to elect Republicans
Is this view correct? To answer this question we use data on all Republicans elected to
Congress in 2010 (including both those newly elected in 2010 and those who were reelected
in 2010). We can begin with a bivariate OLS model that estimates the difference in means
c
•2014 Oxford University Press 283
Chapter 6. Dummy Variables: Smarter Than You Think
on his or her voting record in Congress (ranging from 0.18 for the most moderate Republican
to +1 for the most conservative Republican) and N ewly elected 2010i is 1 if representative
i was newly elected in 2010 and 0 otherwise.6 A positive value for —ˆ1 would indicate newly
elected Republicans were more conservative than Republicans in Congress who had been
The results from a bivariate model are in Table 6.7. Republicans elected before 2010 had
a 0.67 mean level of conservatism (—ˆ0 ) while Republicans elected in 2010 had a 0.70 mean
level of conservatism (—ˆ0 + —ˆ1 ). The difference (—ˆ1 ) is not statistically significant.
Table 6.7: Difference of Means of Conservatism of Republicans Elected in 2010
This is a bivariate OLS analysis of observational data. As such, we’re suspicious that
there are unmeasured factors lurking in the error term that could be correlated with the
To explore this situation further, let’s consider where newly elected Republicans in 2010
6 The citations in the appendix on page 800 show where to get data measuring the conservatism of every member
of the U.S. Congress ever.
c
•2014 Oxford University Press 284
Chapter 6. Dummy Variables: Smarter Than You Think
came from. Some replaced GOP representatives who had retired, but most defeated incum-
bent Democrats. What kind of districts had incumbent Democrats? Districts that elected
Democrats in 2008. These districts were probably more liberal than the districts that had
Hence there might be something in the error term (district ideology) that is correlated
with the variable of interest (the dummy variable for newly elected Republicans), which could
cause bias because the district ideology might push these members a bit to the left (coming
from more moderate districts than typical Republican incumbents) which might mask any
extra conservatism that the newly elected might manifest. In other words, the newly elected
Republicans might come from districts that are different from other Republican districts
and this fact, not that they were newly elected, could affect how conservative they are in
Congress. In statistical terms, we are concerned that the estimate of —ˆ1 from Equation 6.6
Figure 6.8 corroborates these suspicions. We measure district liberalism with the percent
of the presidential vote in district i that Barack Obama received in the 2008 presidential
election because the percentage Obama received was higher in more liberal districts. Panel
(a) shows district Obama voters on the X-axis and conservatism of Republican representa-
tives on the Y-axis. A bivariate OLS line is included in the figure, showing that conservatism
is indeed related to Obama vote share: The more votes Obama got, the less conservative
c
•2014 Oxford University Press 285
Chapter 6. Dummy Variables: Smarter Than You Think
Panel (b) of Figure 6.8 shows whether or not a member was newly elected on the X-axis
and Obama vote on the Y-axis. The data have been jittered for ease of viewing and a
bivariate fitted line (that assesses difference of means) is included. It appears that newly
elected members did indeed come from more liberal districts: The average Obama vote for
Republicans elected before 2010 was 0.42 and the average Obama vote for Republicans newly
Hence district liberalism appears to satisfy the two conditions for omitted variable bias:
It looks like district liberalism affected the dependent variable (as seen in panel (a) of Figure
6.8) and also was correlated with the independent variable of interest (as seen in panel (b)
of Figure 6.8).
Multivariate OLS allows us to account for district liberalism. We’ll say, OK, the newly
elected Republicans represented districts that were a bit more liberal, so let’s factor in how
much less conservative members of Congress are when their districts are relatively liberal
and see, from that baseline, whether the newly elected members of Congress in 2010 were
where District Obama percent 2008i is the percent of the presidential vote in district i that
Table 6.8 shows the results. The bivariate column is the same as in Table 6.7. In the
c
•2014 Oxford University Press 286
Chapter 6. Dummy Variables: Smarter Than You Think
District
Ideological Obama
conservatism vote,
1.0
2008 0.6
0.8
0.5
0.6
0.4
0.4
0.3
0.2
0 1
0.3 0.4 0.5 0.6
First elected Newly elected
before 2010 in 2010
District Obama vote, 2008 Election status in 2010
(a) (b)
FIGURE 6.8: Relation Between Omitted Variable (Obama Vote) and Other Variables
c
•2014 Oxford University Press 287
Chapter 6. Dummy Variables: Smarter Than You Think
multivariate (a) column we add the District Obama percent 2008 variable and see that the
coefficient on Newly elected 2010 has doubled in size. In terms of difference of means testing,
we can say that controlling for district ideology, the mean conservatism of Republicans who
were newly elected is 0.059 higher on the ideological scale we are using. The standard error
is 0.022 and the t statistic is 2.68, suggesting that we can reject the null hypothesis that the
effect is zero.
The magnitude isn’t humongous; being 0.059 more conservative on an ideological scale
that ranges from 0.18 to 1.0 for Republicans isn’t that big of a deal. But the estimated
effect is much larger, consistent with our suspicion that the omission of Obama vote in 2008
was causing omitted variable bias. Without controlling for Obama vote, the fact that more
newly elected Republicans came from relatively Obama-friendly districts masked some of the
In the multivariate (b) column we also control for district median income and dummy
variables indicating whether the district was in the south, midwest, or west (northeast is the
excluded category). These variables all seem to matter as their t stats all imply statistically
significant effects. The estimated effect of being newly elected in 2010 in the multivariate (b)
specification is 0.087 which is bigger than the estimate in the multivariate (a) column, imply-
ing that the multivariate (a) specification suffered from omitted variable bias. The estimate
implies that newly elected Republicans were 0.087 units more conservative, controlling for
c
•2014 Oxford University Press 288
Chapter 6. Dummy Variables: Smarter Than You Think
Bivariate Multivariate
(a) (b) (c)
Newly elected 2010 0.027 0.059ú 0.087ú 0.087ú
(0.023) (0.022) (0.022) (0.022)
[t = 1.16] [t = 2.68] [t = 3.88] [t = 3.88]
District Obama 2008 -0.879ú -0.936ú -0.936ú
(0.138) (0.160) (0.160)
[t = 6.39] [t = 5.85] [t = 5.85]
Income 0.002ú 0.002ú
(0.001) (0.001)
[t = 2.42] [t = 2.42]
South 0.151ú
(0.035)
[t = 4.27]
Midwest 0.174ú 0.023
(0.035) (0.027)
[t = 5.01] [t = 0.87]
West 0.180ú 0.030
(0.037) (0.029)
[t = 4.89] [t = 1.00 ]
Northeast -0.151ú
(0.035)
[t = 4.27]
Constant 0.670ú 1.043ú 0.797ú 0.948ú
(0.014) (0.060) (0.084) (0.065)
[t = 47.91] [t = 17.43] [t = 9.50] [t = 14.61]
N 241 241 241 241
R2 0.006 0.151 0.254 0.254
Standard errors in parentheses. The dependent variable is the conservatism for Republican members of
Congress.
Multivariate column (c) shows what happens when we use the south as our excluded
category instead of the northeast. The coefficients on the newly elected dummy variable,
the district Obama 2008 percent, and the income variables are unchanged. Remember:
c
•2014 Oxford University Press 289
Chapter 6. Dummy Variables: Smarter Than You Think
Changing the excluded category only affects how we interpret the coefficients on the dummy
variables associated with the categorical variable in question; doing so does not affect the
other coefficients. In multivariate column (c) we see that the coefficient on midwest is 0.023
and not statistically significant. Wait a minute! Wasn’t it significant in multivariate column
(b)? Yes, but in that column the coefficient on Midwest was comparing conservativism of
members of Congress and, yes, the Midwesterners are more conservative. In column (c) the
comparison is to Southerners and no, Midwesterners are not significantly more conservative
than Southerners. In fact, we can see in column (b) that Midwesterners were 0.023 more
conservative than Southerners (by comparing the south and midwest coefficients) and this is
exactly the value we get for the midwest coefficient in column (c) when south is the excluded
reference point. We can go through such a thought process for each of the coefficients and see
that the bottom line is that as long as we know how to use dummy variables for categorical
variables, the substantive results are exactly the same in multivariate columns (b) and (c).
Figure 6.9 shows the 95% confidence intervals for the bivariate and multivariate models.
The 95% confidence interval based on the bivariate OLS model ranges from -0.18 t0 0.072.7
This confidence interval covers zero, which is another way of saying that the coefficient is not
statistically significant. When we add variables in multivariate specifications (a) and (b),
7 Following the confidence interval guide on page 181, we calculate the confidence interval as —ˆ1 ± 1.96 ◊ se(—ˆ1 ) =
0.027 ± 1.96 ◊ 0.023 which is a range from -0.018 to 0.072.
c
•2014 Oxford University Press 290
Chapter 6. Dummy Variables: Smarter Than You Think
Bivariate
model
Multivariate
model (a)
Multivariate
model (b)
Estimated coefficient
FIGURE 6.9: Confidence Intervals for Newly Elected Variable in Table 6.8
the 95% confidence interval shifts because controlling for district liberalism, income, and
regional differences leads to larger estimates with confidence intervals not covering zero.8
We don’t need to plot the results for multivariate column (c) because the results for the
c
•2014 Oxford University Press 291
Chapter 6. Dummy Variables: Smarter Than You Think
Dummy variables can do even more work for us. We may face a situation in which being in
the Dummyi = 1 group does not simply give each individual a bump up or down. It could
be that group membership could interact with another independent variable, changing the
way the independent variable affects Y . For example, it could be that discrimination does
not simply mean that all men get paid more by the same amount. It could be that work
experience for men is more highly rewarded than work experience for women. We address
this possibility with models in which a dummy independent variable interacts with (meaning
The following OLS model allows the effect of X to differ across groups:
The third variable is produced by multiplying the Dummyi variable times the Xi variable.
In a spreadsheet, we would simply create a new column that is the product of the Dummy
c
•2014 Oxford University Press 292
Chapter 6. Dummy Variables: Smarter Than You Think
= —ˆ0 + —ˆ1 Xi
In other words, the estimated intercept for the Dummyi = 0 group is —ˆ0 and the estimated
slope is —ˆ1 .
In other words, the estimated intercept for the Dummyi = 1 group is —ˆ0 + —ˆ2 and the
An example of what the fitted lines will look like is in Figure 6.10. As before, the intercept
for the Dummyi = 0 group will be —ˆ0 and the intercept for the Dummyi = 1 group will be
—ˆ0 + —ˆ2 , which is the intercept for everybody plus the bump (up or down) for being in the
Dummyi = 1 group.
What’s new is that this model allows for the slope to differ by groups such that the fitted
lines are no longer parallel. The slope of the line for the Dummyi = 0 group will be —ˆ1 . The
slope of the line for the Dummyi = 1 group will be —ˆ1 + —ˆ3 .
c
•2014 Oxford University Press 293
Chapter 6. Dummy Variables: Smarter Than You Think
Salary
(in $1,000s)
60
50
)
men
e for
lop
^ 3 (s
40 ^β 1+ β
^ ^ e n)
β 0 + β2 for wom
^ (slope
β1
^
β0
30
0 1 2 3 4 5 6 7 8 9 10
Years of experience
c
•2014 Oxford University Press 294
Chapter 6. Dummy Variables: Smarter Than You Think
is the differential slope for the Dummyi = 1 group, meaning that it tells us how different
the effect of X is for the Dummyi = 1 group compared to the Dummyi = 0 group. —ˆ3 is
positive in Figure 6.10, meaning that the slope of the fitted line for the Dummyi = 1 group
is steeper than the slope of the line for the Dummyi = 0 group.
If —ˆ3 were zero, the slope of the fitted line for the Dummyi = 1 group would be no steeper
than the slope of the line for the Dummyi = 0 group, meaning that the slopes would be the
same for both the Dummyi = 0 and Dummyi = 1 groups. If —ˆ3 were negative, the slope of
the fitted line for the Dummyi = 1 group would be less steep than the slope of the line for
Interpreting interaction variables can be a bit tricky sometimes, as the —ˆ3 can be negative,
but the effect of X on Y for the Dummyi = 1 group could still be positive. For example, if
—ˆ1 = 10 and —ˆ3 = ≠3, the slope for the Dummyi = 1 group would be positive because the
slope is the sum of the coefficients and therefore equals 7. The negative —ˆ3 indicates that
the slope for the Dummyi group is less than the slope for the other group; it does not tell
us whether the effect of X is positive or negative, though. We have to look at the sum of
Table 6.9 summarizes how to interpret coefficients when dummy-interaction variables are
included.
The standard error of —ˆ3 is useful for calculating confidence intervals for the difference in
c
•2014 Oxford University Press 295
Chapter 6. Dummy Variables: Smarter Than You Think
slope coefficients across the two groups. Standard errors for some quantities of interest are
tricky, though. To generate confidence intervals for the effect of X on Y we need to be alert.
For the Dummyi = 0 group, the effect is simply —ˆ1 and we can simply use the standard error
of —ˆ1 . For the Dummyi = 1 group, the effect is —ˆ1 + —ˆ3 ; the standard error of the effect is
more complicated because we have to take into account the standard error of both —ˆ1 and
—ˆ3 in addition to any correlation between —ˆ1 and —ˆ3 (which is associated with the correlation
of X1 and X3 ). The appendix provides more details on how to do this on page 801.
c
•2014 Oxford University Press 296
Chapter 6. Dummy Variables: Smarter Than You Think
Remember This
Interaction variables allow us to estimate effects that depend on more than one vari-
able.
1. A dummy interaction is created by multiplying a dummy variable times another
variable.
2. Including a dummy interaction in a multivariate regression allows us to conduct
a difference of means test while controlling for other factors with a model such as
c
•2014 Oxford University Press 297
Chapter 6. Dummy Variables: Smarter Than You Think
Y 10 Y 10
Fitted line for Dummyi = 1 group Fitted line for Dummyi = 1 group
Fitted line for Dummyi = 0 group Fitted line for Dummyi = 0 group
8 8
6 6
4 4
2 2
0 0
0 2 4 6 8 10 0 2 4 6 8 10
X X
(a) (b)
Y 10 Y 10
Fitted line for Dummyi = 1 group Fitted line for Dummyi = 1 group
Fitted line for Dummyi = 0 group Fitted line for Dummyi = 0 group
8 8
6 6
4 4
2 2
0 0
0 2 4 6 8 10 0 2 4 6 8 10
X X
(c) (d)
Y 10 Y 10
Fitted line for Dummyi = 1 group Fitted line for Dummyi = 1 group
Fitted line for Dummyi = 0 group Fitted line for Dummyi = 0 group
8 8
6 6
4 4
2 2
0 0
0 2 4 6 8 10 0 2 4 6 8 10
X X
(e) (f)
Discussion Questions
1. For each panel in Figure 6.11, indicate whether each of —0 , —1 , —2 , and
—3 is less than, equal to, or greater than zero for the following model:
Yi = —0 + —1 Xi + —2 Dummyi + —3 Dummyi ◊ Xi + ‘i
c
•2014 Oxford University Press 298
Chapter 6. Dummy Variables: Smarter Than You Think
not to love?
ings may be overpromised. In this case study, we analyze the energy used to heat a house
before and after the homeowner installed a programmable thermostat. The attraction of a
programmable thermostat is that it allows the user to pre-set temperatures at energy efficient
levels, especially for the middle of the night when the house doesn’t need to be as warm (or
Figure 6.12 shows a scatterplot of monthly observations of the gas used in the house
(measured in therms) and heating degree days, a measure of how cold it was in the month.10
We’ve marked the months without a programmable thermostat as squares and the months
Visually, we immediately see that heating goes up as it gets colder. Not a huge surprise.
10 For each day, the heating degree day is measured as number of degrees that a day’s average temperature is below
65 degrees Fahrenheit, the temperature below which buildings may need to be heated. The monthly measure adds
up the daily measures and provides a rough measure of how much need for heating there was in the month. If the
temperature is above 65 degrees, the heating degree day measure will be zero.
c
•2014 Oxford University Press 299
Chapter 6. Dummy Variables: Smarter Than You Think
200
100
FIGURE 6.12: Heating Used and Heating Degree Days for Homeowner Who Installed a Programmable
Thermostat
c
•2014 Oxford University Press 300
Chapter 6. Dummy Variables: Smarter Than You Think
We also can see the possibility that the programmable thermostat lowered gas usage because
the observations with the programmable thermostat seem lower. However, it is not clear how
We need a model to get a more precise answer. What model is best? Let’s start with a
The results for this model are in column (a) of Table 6.10 and indicate that the homeowner
used 13.02 fewer therms of energy in months when he had the programmable thermostat than
without it. Therms cost about $1.59 at this time, so the homeowner saved roughly $20.70
per month on average. That’s not bad. However, the effect is not statistically significant
(not even close, really, as the t statistic is only 0.54), so based on this result we should be
The difference of means model does not control for anything else and we know that the
coefficient on the programmable thermostat variable will be biased if there is some other
variable that matters and is correlated with the programmable thermostat variable. In this
case, we know unambiguously that heating degree days matters and it is plausible that the
heating degree days differed in the months with and without the programmable thermostat.
c
•2014 Oxford University Press 301
Chapter 6. Dummy Variables: Smarter Than You Think
The results for this model are in column (b) of Table 6.10. The heating degree variable is
coefficient on the programmable thermostat variable, which is now 20.045. The standard
error on the programmable thermostat variable also goes down a ton due to the much smaller
ˆ due to the much better fit we get by including the heating degree day variable. The effect
‡
of the programmable thermostat variable is statistically significant and given a cost of $1.59
per therm, the savings is about $31.87 per month. Because a programmable thermostat
costs about $60 plus installation, the programmable thermostat should pay for itself pretty
quickly.
However, something about these results should nag at us. This is about gas usage only
which in this house goes overwhelmingly to heating (with the rest going to heat water and
c
•2014 Oxford University Press 302
Chapter 6. Dummy Variables: Smarter Than You Think
for the stove). Does it make sense that the programmable thermostat should save $30 in the
middle of the summer? The furnace is never on and, well, that’s a lot of scrambled eggs on
If we think about it, the effect of the thermostat must be interactive. That is, the
thermostat can save more money in cold months, when turning the thermostat down at
The results for this model are in column (c) of Table 6.10. The coefficient on pro-
grammable thermostat indicates the difference in therms when the other variables are zero.
Because both variables have heating degree days in them, the coefficient on programmable
thermostat indicates the effect of the thermostat when heating degree days are zero (meaning
the weather is warm for the whole month). The coefficient of -0.479 with a t statistic of 0.11
indicates no effect at all. This might seem to be bad news, but is it good news for us given
that we have figured out that the programmable thermostat can’t reduce heating costs when
The main effect of the thermostat is captured by the coefficient on the interactive term,
Programmable thermostat ◊HDD. This coefficient is -0.062 and is highly statistically sig-
nificant with a t statistic of 7.00. For every increase in HDD, the programmable thermostat
c
•2014 Oxford University Press 303
Chapter 6. Dummy Variables: Smarter Than You Think
lowered the therms used by 0.062 therms. In a month with the heating degree day variable
equal to 500, the homeowner is estimated to reduce therms used by 500◊0.062 = 31 after the
programable thermostat was installed (which lowers the bill by $49.29 at $1.59 per therm).
In a month with the heating degree day variable equal to 1000, the homeowner is estimated
to reduce therms used by 1000 ◊ 0.062 = 62 (which lowers the bill by $98.58 at $1.59 per
therm). Suddenly we’re talking real money. And we’re doing so from a model that makes
intuitive sense because the savings should indeed differ depending on how cold it is.11
This case provides an excellent example of how useful – and distinctive – the dummy
variable models we’ve presented in this chapter can be. In panel (a) of Figure 6.13 we show
the fitted values based on model (b) in Table 6.10, which controls for heating degree days
but models the effect of the thermostat as a constant difference across all values of heating
degree day. The effect is statistically significant and rather substantial, but it doesn’t ring
true because it suggests the savings from reduced use of gas for the furnace are the same
in a warm summer month as in a frigid winter month. Panel (b) of Figure 6.13 shows the
fitted values based on model (c) in Table 6.10, which allows the effect of the thermostat to
vary depending on the heating degree days. This is an interactive model that yields fitted
lines with different slopes. Just by looking at the lines, we can see the fitted lines for model
11 We might be worried about correlated errors given that this is time series data. As discussed on page 104,
the coefficient estimates are not biased if the errors are correlated, but standard OLS standard errors might not be
appropriate. In Chapter 13 we show how to estimate models with correlated errors. Spoiler alert: The results get a
bit stronger.
c
•2014 Oxford University Press 304
Chapter 6. Dummy Variables: Smarter Than You Think
Therms Therms
300 Months without 300 Months without
programmable thermostat programmable thermostat
Months with Months with
programmable thermostat programmable thermostat
200 200
100 100
0 0
FIGURE 6.13: Heating Used and Heating Degree Days with Fitted Values for Different Models
c
•2014 Oxford University Press 305
Chapter 6. Dummy Variables: Smarter Than You Think
(c) fit the data better. The effects are statistically significant and substantial and, perhaps
most importantly, make more sense because the effect of the programmable thermostat on
6.5 Conclusion
Dummy variables are incredibly useful. Despite a less-than-flattering name, they do some of
the most important work in all of statistics. Experiments almost always are analyzed with
treatment group dummy variables. A huge proportion of observational studies care about or
control for dummy variables such as gender or race. And, when interacted with continuous
variables, dummy variables allow us to investigate whether the effects of certain variables
differ by group.
We have mastered the core points of this chapter when we can do the following.
• Section 6.1: Write down a model for a difference of means test using bivariate OLS.
Which parameter measures the estimated difference? Sketch a diagram that illustrates
• Section 6.2: Write down a model for a difference of means test using multivariate OLS.
Which parameter measures the estimated difference? Sketch a diagram that illustrates
• Section 6.3: Explain how to incorporate categorical variables in OLS models. What
c
•2014 Oxford University Press 306
Chapter 6. Dummy Variables: Smarter Than You Think
is the excluded category? Explain why coefficient estimates change when the excluded
category changes.
• Section 6.4: Write down a model that has a dummy variable (D) interaction with
Further Reading
Brambor, Clark, and Golder (2006) and Kam and Franceze (2007) both provide excellent
discussions of interactions, including the appropriate interpretation of models with two con-
tinuous variables interacted. Braumoeller (2004) does a good job injecting caution into the
interpretation of coefficients on lower order terms when interaction variables are included in
the model.
Key Terms
• Categorical variable (277)
• Control group (257)
• Dichotomous variable (257)
• Difference of means test (257)
• Dummy variable (257)
• Excluded category (278)
• Jitter (264)
c
•2014 Oxford University Press 307
Chapter 6. Dummy Variables: Smarter Than You Think
Computing Corner
Stata
1. A difference of means test in OLS is simply reg Y Dum. This command will produce an
identical estimate, standard error, and t statistic as ttest Y, by(Dum). To allow the
variance to differ across the two groups, the OLS model is reg Y Dum, robust and the
stand-alone t test is ttest Y, by(Dum) unequal.
2. To create an interaction variable named “DumInteract”, simply type gen DumInteract
= Dum * X where Dum is the name of the dummy variable and X is the name of the
continuous variable.
3. Page 801 discusses how to generate a standard error in Stata for the effect of X on Y
for the Dummyi = 1 group.
R
1. A difference of means test in OLS is simply lm(Y ≥ Dum). This command will produce
an identical estimate, standard error, and t statistic as t.test(Y[Dum==1], Y[Dum==0],
var.equal = TRUE). To allow the variance to differ across the two groups, the stand-
alone t test is t.test(Y[Dum==1], Y[Dum==0], var.equal = FALSE). The OLS ver-
sion of this model takes a bit more work, as it involves estimating the heteroscedasticity-
consistent standard error model described on page 130. It is
OLSResults = lm(Y ≥ Dum)
coeftest(OLSResults, vcov = vcovHC(OLSResults, type = "HC1"))
2. To create an interaction variable named “DumInteract”, simply type DumInteract =
Dum * X where Dum is the name of the dummy variable and X is the name of the
continuous variable.
3. Page 801 discusses how to generate a standard error in R for the effect of X on Y for
the Dummyi = 1 group.
c
•2014 Oxford University Press 308
Chapter 6. Dummy Variables: Smarter Than You Think
Exercises
1. Use data from heightwage.dta that we used in Chapter 5 on page 248.
a. Estimate an OLS regression model with adult wages as the dependent variable and
adult height, adolescent height, and a dummy variable for males as the independent
variables. Does controlling for gender affect the results?
b. Generate a female dummy variable. Estimate a model with both a male dummy
variable and a female dummy variable. What happens? Why?
c. Re-estimate the model from part (a) separately for males and females. Do these
results differ from the model in which male was included as a dummy variable? Why
or why not?
d. Estimate a model in which adult wages is the dependent variable and in which
there are controls for adult and adolescent height in addition to dummy variable
interactions of male times each of the two height variables. Compare the results to
the results from part (c).
e. Estimate a model in which adult wages is the dependent variable and in which
there are controls for male, adult height, adolescent height and two dummy variable
interactions of male times each of the two height variables. Compare the results to
the results from part (c).
f. Every observation is categorized into one of four regions based on where they lived
in 1996. The four regions are northeast (norest96) midwest (norcen96), south
(south96), and west (west96). Add dummy variables for regions to a model ex-
plaining wages in 1996 as a function of height in 1981, male, and male times height
in 1981. First exclude west, then exclude south and explain the changes to the
coefficients on the height variables and the regional dummy variables.
2. These questions are based on a paper “The Fed may be politically independent, but it
is not politically indifferent” by William Clark and Vincent Arel-Bundock (2013). The
paper explores the relationship between elections and the federal funds rate (FFR).
The FFR is the average interest rate at which federal funds trade in a day and is
often a benchmark for financial markets. Table 6.11 describes the variables from
fed 2012 kkedits.dta that we use in this problem.
a. Create two scatterplots, one when a Democrat is in office and one when a Republican
is in office, showing the relationship between the federal funds rate and the distance
to election. Comment on the differences in the relationships. The variable election
is coded 0 to 15, representing each quarter from one election to the next. For each
presidential term, the value of election is zero in the first quarter containing the
election and 15 in the quarter before the next election.
c
•2014 Oxford University Press 309
Chapter 6. Dummy Variables: Smarter Than You Think
Variable Description
FEDFUNDS Effective federal funds rate (in percent)
lag FEDFUNDS Lagged effective federal funds rate (in percent)
Democrat Democrat = 1, Republican = 0
Election Quarter since previous election (0 to 15)
Inflation Annualized inflation rate (1% inflation = 1.00)
DATE Date
b. Create an interaction variable between election and democrat to test whether or not
closeness to elections has the same effect on Democrats and Republicans. Run a
model with the federal funds rate as the dependent variable, allowing the effect of
the election variable to vary by party of the president.
i. What change in federal fund rates is associated with a one unit increase in the
election variable when the president is a Republican?
ii. What change in federal fund rates is associated with a one unit increase in the
election variable when the president is a Democrat?
c. Is the effect of election statistically significant under Republicans? (Easy.) Is the
effect of election statistically significant under Democrats? (Not so easy.) How can
the answer be determined? Run any additional tests if necessary.
d. Graph two fitted lines for relationship between elections and interest rates, one for
Republicans and one for Democrats. (Use the twoway and lfit commands with ap-
propriate if statements; label by hand.) Briefly describe the relationship.
e. Re-run the model controlling for both the interest rate in the previous quarter
(lag FEDFUND) and inflation and discuss results, focusing on (i) effect of election
for Republicans, (ii) the differential effect of election for Democrats, (iii) impact of
lagged federal funds rate, and (iv) inflation. (Simply report the statistical signifi-
cance of the coefficient estimates; don’t go through the entire analysis from part (c)
above.)
3. This problem uses the cell phone and traffic dataset described in Chapter 5 on page 250
to analyze the relationship between cell and texting bans and traffic fatalities. We add
two variables: cell ban is coded 1 if it is illegal to operate a hand-held cell phone while
driving and 0 otherwise; text ban is coded 1 if it is illegal to text while driving ban and
0 otherwise.
a. Add the dummy variables for cell phone bans and texting bans to the model from
Chapter 5 on page 250. Interpret the coefficients on these dummy variables.
c
•2014 Oxford University Press 310
Chapter 6. Dummy Variables: Smarter Than You Think
500
Marginal effect of text ban
0
−500
−1000
−1500
b. Based on the the results from part (a), how many lives would be saved if California
had a cell phone ban? How many lives would be saved if Wyoming had a cell phone
ban? Discuss implications for the proper specification of the model.
c. Estimate a model in which total miles is interacted with both the cell phone ban and
texting ban variables. What is the estimated effect of a cell phone ban for California?
For Wyoming? What is the effect of a texting ban for California? For Wyoming?
What is the effect of total miles?
d. This question uses material from page 801. Figure 6.14 displays the effect of the cell
phone ban as a function of total miles. The confidence intervals are depicted with
the dashed lines. Identify the points on the fitted lines for the estimated effects for
California and Wyoming from the results in part (c). Explain the conditions under
which the cell phone ban has a statistically significant effect.12
4. In this problem we continue analyzing the speeding ticket data first introduced in Chap-
ter 5 on page 251. The variables we use are in Table 6.12.
a. Implement a simple difference of means test using OLS to assess whether the fines
for men and women are different. Do we have any reason to expect endogeneity?
Explain.
b. Implement a difference of means for men and women that controls for age and miles
per hour. Do we have any reason to expect endogeneity? Explain.
12Brambor, Clark, and Golder (2006) provide Stata code to create plot like this for models with interaction variables.
c
•2014 Oxford University Press 311
Chapter 6. Dummy Variables: Smarter Than You Think
c. Building from the above model, also assess whether there are differences in fines for
African Americans and Hispanics. Explain what the coefficients on these variables
mean.
d. Look at standard errors on coefficients for the Female, Black, and Hispanic variables.
Why they are different?
e. Within a single OLS model, assess whether miles over the speed limit has a differential
effect on the fines for women, African Americans, and Hispanics.
5. There is a consensus among economists that increasing government spending and cut-
ting taxes boost economic growth during recessions. Do regular citizens share in this
consensus? We care because political leaders often feel pressure to do what voters want
whether or not it would be effective.
To get at this issue, a 2012 YouGov survey asked people questions about what would
happen to unemployment if the government raised taxes or increased government spend-
ing. Answers were coded into three categories based on how consistent they were with
the economic consensus. On the tax question, people who said raising taxes would
raise unemployment were coded as “3” (the correct answer), people who said raising
taxes would have no effect on unemployment were coded as “2” and people who said
raising taxes would lower unemployment were coded as “1”. On the spending question,
people who said raising government spending would lower unemployment were coded
as “3” (the correct answer), people who said raising spending would have no effect on
unemployment were coded as “2,” and people who said raising spending would raise
unemployment were coded as “1.”
a. Estimate two bivariate OLS models in which political knowledge predicts the answers.
In one model, use the tax dependent variable; in the other model, use the spending
c
•2014 Oxford University Press 312
Chapter 6. Dummy Variables: Smarter Than You Think
13 We could separate non-Republicans into Democrats and Independents using tools for categorical variables dis-
cussed in Section 6.3. Our conclusions would be generally similar in this particular example.
c
•2014 Oxford University Press 313
Chapter 6. Dummy Variables: Smarter Than You Think
c
•2014 Oxford University Press 314
CHAPTER 7
dents in a two-year age group. The scores range from 1 (“dissatisfied”) to 10 (“satisfied”).1
There is a pretty clear pattern: people start off reasonably satisfied at age 18 and then reality
hits, making them less satisfied until their mid-40s. Happily, things brighten from that point
1 We have used multivariate OLS to net out the effect of income, religiosity, and children from the life satisfaction
scores.
315
Chapter 7. Transforming Variables, Comparing Variables
Life
satisfaction
8.0
7.5
7.0
6.5
6.0
5.5
5.0
4.5
20 30 40 50 60 70
Age
onward, such that old folks are generally the happiest bunch. (Who knew?) This pattern is
not an anomaly: Other surveys at other times and in other countries reveal similar patterns.
The relationship is U-shaped (or smile shaped, if you will).2 Given what we’ve done so far,
it may not be obvious how to make OLS estimate such a model. However, OLS is actually
quite flexible and the goal of this chapter is to show off some of the cool tricks OLS can do,
including estimating non-linear relationships like the one we see in the life satisfaction data.
The unifying theme is that each of these tricks involves a transformation of the data or the
2To my knowledge there is no study of chocolate and happiness, but I’m pretty sure it would be an upside down U;
people might get happier the more they eat for a while, but at some point, more chocolate has to lead to unhappiness,
like the kid in Willy Wonka.
c
•2014 Oxford University Press 316
Chapter 7. Transforming Variables, Comparing Variables
The particular tasks we tackle in this chapter are estimating non-linear models and com-
paring coefficients. Section 7.1 shows how to estimate non-linear effects with polynomial
models. Section 7.2 shows how to produce a different kind of non-linear models using logged
variables, which are particularly useful to characterize effects in percentage terms. Section
7.3 shows how to make OLS coefficients more comparable by standardizing variables. Section
7.4 shows how to formally test whether coefficients differ from each other. The technique
The world doesn’t always move in straight lines and, happily, neither do OLS estimates. In
this section we explain the difference between linear and non-linear models in the regression
context and then introduce quadratic and polynomial models as flexible tools to deal with
non-linear models.
The standard OLS model is remarkably flexible. It can, for example, estimate non-linear
effects. This idea might seem a little weird at first. Didn’t we say that OLS is also known
as linear regression (page 67)? How can we estimate non-linear models effects with a linear
regression model? The reason is a bit pedantic, but here goes: when we refer to a “linear”
c
•2014 Oxford University Press 317
Chapter 7. Transforming Variables, Comparing Variables
model we mean linear in parameters, which means that the —s aren’t squared or cubed or
Yi = —0 + —12 X1i + ‘i
Ò
Yi = —0 + —1 X1i + —2 X1i + ‘i
The X’s, though, are fair game and hence we can square, cube, log, or otherwise transform
the X’s to produce fitted curves instead of fitted lines. Therefore both of the following models
are okay in OLS because each — simply multiplies itself times some independent variable that
2
Yi = —0 + —1 X1i + —2 X1i + ‘i
Ò
7
Yi = —0 + —1 X1i + —2 X1i + ‘i
Non-linear relationships are common in the real world. Figure 7.2 shows data on life
expectancy and GDP per capita for countries across the world. We immediately sense that
there is a positive relationship: The wealthier countries definitely have higher life expectancy.
But we also see that the relationship is a curve, rather than a line, because life expectancy
rises rapidly at the lower levels of GDP per capita, but then flattens out. Based on this data,
it’s pretty reasonable to expect an increase of $1,000 in per capita GDP per year could have
3 The world doesn’t end if we really want to estimate a model that is non-linear in the —s. We just need something
other than OLS to estimate the model. In Chapter 12 we discuss probit models, which are non-linear in the —s.
c
•2014 Oxford University Press 318
Chapter 7. Transforming Variables, Comparing Variables
Life
expectancy
(in years)
80
75
70
65
60
55
50
FIGURE 7.2: Life Expectancy and Per Capita GDP in 2011 for All Countries in the World
a fairly substantial effect on life expectancy in a country with low GDP per capita, while an
increase of $1,000 in per capita GDP for a very wealthy country could have only a negligible
effect on life expectancy. Therefore we want to get beyond estimating only straight lines.
Figure 7.3 shows the life expectancy data with two different kinds of fitted lines. Panel
The fit isn’t great. The fitted line is lower than the data for many of the observations
with low GDP values. For observations with high GDP levels, the fitted line dramatically
overestimates life expectancy. As bad as it is, this is the best possible straight line in terms
c
•2014 Oxford University Press 319
Chapter 7. Transforming Variables, Comparing Variables
Life
expectancy
in years
90 90
80 80
70 70
60 60
50 50
0 20 40 60 80 100 0 20 40 60 80 100
GDP per capita GDP per capita
(in thousands $) (in thousands $)
FIGURE 7.3: Linear and Quadratic Fitted Lines for Life Expectancy Data
c
•2014 Oxford University Press 320
Chapter 7. Transforming Variables, Comparing Variables
Polynomial models
We can generate a better fit using a polynomial model. Polynomial models include not
only an independent variable, but also the independent variable raised to some power. By
doing using a polynomial model, we can produce fitted value lines that curve.
The simplest example of a polynomial model is a quadratic model that includes X and
2
Yi = —0 + —1 X1i + —2 X1i + ‘i (7.2)
Panel (b) of Figure 7.3 plots this fitted curve. It better captures the non-linearity in the
data as life expectancy rises rapidly at low levels of GDP and then levels off. The fitted
curve is not perfect. The predicted life expectancy is still a bit low for low values of GDP
and the turn to negative effects seems more dramatic than warranted by the data. We’ll see
Note that the effect of X changes depending on the value of X. In panel (b) of Figure 7.3,
the effect of GDP on life expectancy is large for low values of GDP. That is, when GDP
c
•2014 Oxford University Press 321
Chapter 7. Transforming Variables, Comparing Variables
goes from 0 to $20,000, the fitted value for life expectancy increases relatively rapidly. The
effect of GDP on life expectancy is smaller as GDP gets higher, as the change in fitted life
expectancy when GDP goes from $40,000 to $60,000 is much smaller than the change in
fitted life expectancy when GDP goes from 0 to $20,000. The predicted effect of GDP even
We need some calculus to get the specific equation for the effect of X on Y . We refer to
the effect of X on Y as ˆY
ˆX1
:
ˆY
= —1 + 2—2 X1 (7.4)
ˆX1
This equation means that when interpreting results from a polynomial regression we can’t
look at individual coefficients in isolation, but instead need to know how the coefficients on
Figure 7.4 illustrates more generally the kinds of relationships that a quadratic model
can account for. Each panel illustrates a different quadratic function. Panel (a) shows an
example in which the effect of X is getting bigger as X gets bigger. Panel (b) shows an
example in which the effect of X on Y is getting smaller. In both of the top panels, Y gets
bigger as X gets bigger, but the relationships have a quite different feel.
In panels (c) and (d) there are negative relationships between X and Y : the more X,
4 Equation 7.4 is the result of using standard calculus tools to take the derivative of Y in Equation 7.2 with respect
to X1 . The derivative is the slope evaluated at a given value of X1 . For a linear model, the slope is always the same
and is —ˆ1 . The ˆY in the numerator refers to the change in Y ; The ˆX1 in the numerator refers to the change in X1 .
The fraction ˆY
ˆX1
therefore refers to the change in Y divided by the change in X1 which is the slope.
c
•2014 Oxford University Press 322
Chapter 7. Transforming Variables, Comparing Variables
the less Y . Again, though, we see very different types of relationships. In panel (c) there
is a leveling out, while in panel (d) the negative effect of X on Y is accelerating as X gets
bigger.
A quadratic OLS model can even estimate relationships that change directions. In panel
(e), Y initially gets bigger as X increases, but then it levels out and eventually increases in
X decrease Y . In panel (f), we see the opposite pattern, with Y getting smaller as X rises
One of the nice things about using a quadratic specification in OLS is that we don’t have
to know ahead of time whether the relationship is curving down or up, flattening out or
getting steeper. The data will tell us. We can simply estimate a quadratic model and if the
relationship is like the panel (a), the estimated OLS coefficients will yield a curve like in the
panel; if the relationship is like panel (f), OLS will produce coefficients that best fit the data.
So if we have data that looks like any of the patterns in Figure 7.4 we need simply estimate
a quadratic OLS model and we will get fitted lines that reflect the data.
Polynomial models with cubed or higher order terms can account for patterns that wiggle
and bounce even more than the quadratic model. It’s relatively rare to use higher order
polynomial models. Often the data simply doesn’t support such a model. In addition,
using higher order terms without strong theoretical reasons can be a bit fishy – as in raising
the specter of model fishing. A control variable with a high order can be more defensible,
but ideally our main results do not depend on untheorized high order polynomial control
c
•2014 Oxford University Press 323
Chapter 7. Transforming Variables, Comparing Variables
Y Y 1000
150 2
Y = − 0.1X + 0.02X 800
600
100
400
50
200
Y = 20X − 0.1X2
0
0
0 20 40 60 80 100 0 20 40 60 80 100
X X
(a) (b)
Y 0 Y 1000
−200 Y = − 20X +0.1X2
800
−400
600
−600
−800 400
0 20 40 60 80 100 0 20 40 60 80 100
X X
(c) (d)
Y 250 Y 0
200 −50
50 −200
0 −250
0 20 40 60 80 100 0 20 40 60 80 100
X X
(e) (f)
c
•2014 Oxford University Press 324
Chapter 7. Transforming Variables, Comparing Variables
variables.
Remember This
OLS can estimate non-linear effects via polynomial models.
1. A polynomial model includes X raised to powers greater than 1. The general
form is
Yi = —0 + —1 Xi + —2 Xi2 + —3 Xi3 + . . . + —k Xik + ‘i
2. The most commonly used polynomial model is the quadratic model
Yi = —0 + —1 Xi + —2 Xi2 + ‘i
• The effect of Xi in a quadratic model varies depending on the value of X.
• The estimated effect of a one unit increase in Xi in a quadratic model is
—1 + 2—2 X.
c
•2014 Oxford University Press 325
Chapter 7. Transforming Variables, Comparing Variables
Panel (b) of Figure 7.5 includes the fitted line from a bivariate OLS model with Year as
T emperaturei = —0 + —1 Y eari + ‘i
The linear model fits reasonably well, although it seems to be underestimating recent tem-
Column (a) of Table 7.1 shows the coefficient estimates for the linear model. The esti-
mated —ˆ1 is 0.006 with a standard error of 0.0003. The t statistic of 18.74 indicates a highly
c
•2014 Oxford University Press 326
Chapter 7. Transforming Variables, Comparing Variables
Temperature
(deviation from
average 0.9
pre−industrial
temperature,
in Fahrenheit) 0.7
0.5
0.3
0.1
−0.1
0.5 0.5
0.3 0.3
0.1 0.1
−0.1 −0.1
c
•2014 Oxford University Press 327
Chapter 7. Transforming Variables, Comparing Variables
statistically significant coefficient. The result suggests that the earth has been getting 0.006
degrees warmer each year since 1879 (when the data series begins).
Table 7.1: Global Temperature from 1879 to 2012
(a) (b)
Constant -10.46 155.68
(0.57) (30.27)
[t = 18.31] [t = 5.14]
Year 0.006 -0.166
(0.0003) (0.031)
[t = 18.74] [t = 5.31]
Year2 0.000044
(0.000008)
[t = 5.49]
N 128 128
R2 0.73 0.78
Standard errors in parentheses
The data looks pretty non-linear, so we also estimate the following quadratic OLS model:
in which Year and Y ear2 are independent variables. This model allows us to assess whether
the temperature change has been speeding up or slowing down by allowing us to estimate
a curve in which the change per year in recent years is, depending on the data, larger or
smaller than the change per year in earlier years. We have plotted the fitted line in panel
(c) of Figure 7.5; notice that it is a curve that is getting steeper over time. It fits the data
even better with less underestimation in recent years and less overestimation in the 1970s.
Column (b) of Table 7.1 reports results from the quadratic model. The coefficients on
Y ear and Y ear2 have t stats over 5, indicating clear statistical significance.
c
•2014 Oxford University Press 328
Chapter 7. Transforming Variables, Comparing Variables
The coefficient on Y ear is -0.166 and the coefficient on Y ear2 is 0.000044. What the heck
do those numbers mean? Not much at a glance. Recall, however, that in a quadratic model
an increase in Y ear by one will be associated with a —ˆ1 + 2—ˆ2 Y eari increase in estimated
average global temperature. This means the predicted change from an increase in Y ear by
one in 1900 is ≠0.166 + 2 ú 0.000044 ◊ 1900 = 0.0012 degrees. The predicted change in
temperature from an increase in Y ear by one in 2000 is ≠0.166 + 2 ú 0.000044 ◊ 2000 = 0.01
degrees.
In other words, the predicted effect of Year changes over time in the quadratic model.
In particular, the estimated rate of warming in 2000 (0.01 degrees per year) is around eight
times the estimated rate of warming in 1900 (0.0012 degrees per year).
We won’t pay much attention at this point to the standard errors because errors are almost
surely autocorrelated, which would make the standard errors reported by OLS incorrect and
probably too small. We address autocorrelation and other time-series aspects of this data in
Chapter 13.
Empirical analysts, especially in economics, often use logged variables. Logged variables
allow for non-linear relationships but have cool properties that allow us to interpret estimated
effects in percentage terms. In this section we discuss logs and how they work in OLS models
c
•2014 Oxford University Press 329
Chapter 7. Transforming Variables, Comparing Variables
and show how they work in our height and wages example. While we show several different
ways to use logged variables, the key thing to remember is that if there’s a log, there’s a
We’ll work with so-called natural logs, which revolve around the constant e which equals
approximately 2.71828 and is, like fi ¥ 3.14, one of those numbers that pops up all over in
math. Recall that if e2 = 7.38, then ln(7.38) = 2. (We use the notation “ln” to refer to
natural log.) In other words, the natural log of some number k is the exponent to which we
have to raise e to obtain k. The fact that ln(3) = 1.10 means that e1.10 = 3 (with rounding).
For our purposes, we won’t be using the mathematical properties of logs too much.5 We
instead note that using logged variables in OLS equations can allow us to characterize non-
linear relationships that are broadly similar to panels (a) and (c) of Figure 7.4. In that sense,
Models with logged variables have an additional, very attractive feature. The estimated
coefficients can be interpreted directly in percentage terms. That is, with the right logged
model we can produce results that tell us how much a one percent increase in X affects Y .
Often this is a good way to think about empirical questions. For example, suppose we have
wage data for a large number of people across many years and we want to know how inflation
5 We derive the marginal effects in log models in the appendix on page 801.
c
•2014 Oxford University Press 330
Chapter 7. Transforming Variables, Comparing Variables
where wages per hour for person i in year t is the dependent variable and the inflation rate in
year t is the independent variable. The estimated —ˆ1 in this model would tell us the increase
in wages per hour that would be associated with a one unit increase in inflation. At first
glance, this might seem like an OK model, but it is actually absurd. Suppose the model
produces —ˆ1 = 1.25; that result would say that everyone – whatever their wage level – would
get another $1.25 for every one point increase in inflation. That conclusion is actually kind
of weird: Such a model is, by design, saying that for every one percent increase in inflation,
the richest CEO gets another buck and a quarter per hour, as does the lowliest temp.
What we really want is a model that allows us to estimate what percent people’s salary
changes with inflation. Using logged variables allows us to do so. For example, we could
estimate a log-linear model in which the dependent variable is transformed by taking the
It turns out, through the magic of calculus (presented on page 801) the —ˆ1 in this model
can be interpreted as the percentage change in Y associated with a one unit increase in X.
In other words, if —ˆ1 = 0.82 we would say that a one unit increase in inflation is associated
with a 0.82 percentage increase in wages. The CEO would get a 0.82 percent increase of her
c
•2014 Oxford University Press 331
Chapter 7. Transforming Variables, Comparing Variables
high wages; the temp would get a 0.82 percent increase of his low wages. These are very
We can also estimate a model in which the dependent variable is not logged, but the
taking the natural log of it. Such a model would look like
Yi = —0 + —1 lnXi + ‘i (7.7)
Here, —1 indicates the effect of a one percent increase in X on Y . We need to divide the
estimated coefficient by 100 to convert it to units of Y . This is one of the odd hiccups in
models with logged variables: The units can be a bit tricky. While we can memorize the
way units work in these various models, the safe course of action here is to simply remember
that we’ll probably have to look up how units in logged models work when we use logged
At the pinnacle of loggy-ness is the so-called log-log model. This model allows us to
with a percent change in X. For example, if we want to know the elasticity of airline tickets,
we could get data on sales and prices and estimate the following model:
where the dependent variable is the natural log of monthly ticket sales on routes (e.g., New
York to Tokyo) and the independent variable is the natural log of the monthly average price
c
•2014 Oxford University Press 332
Chapter 7. Transforming Variables, Comparing Variables
of the tickets on those routes. —ˆ1 estimates the percentage change of sales when ticket prices
go up by one percent.
Another hiccup with logged models is that the values of the variable being logged must be
greater than zero. The reason is that the mathematical log function is undefined for values
less than or equal to zero.6 Hence, logged models work best with economic variables such as
sales, quantities, and prices. Even there, though, it is not uncommon to see an observation
with zero sales or zero wages and we will be forced to omit such observations.7
Logged models are super easy to estimate; we’ll see how in the Computing Corner. The
key is interpretation. If the model has a logged variable or variables, we know the coefficients
reflect a percentage of some sort, with the exact interpretation depending on which variables
are logged.
Table 7.2 takes us back to the height and wage data we discussed on page ??. It reports
results from four regressions. In the first specification nothing is logged. Interpretation of
—ˆ1 is old hat: a one inch increase in adolescent height is associated with a $0.412 increase in
6 Recall that (natural) log of k is the exponent to which we have to raise e to obtain k. There is no number that
1
we can raise e to and get 0. We can get close by raising e to minus a huge number; for example e≠100 = e100
, which
is very close to zero, but not quite zero.
7 Some people re-code these numbers as something very close to zero (such as 0.0000001) on the reasoning that the log
function is defined for low positive values and the essential information (that the variable is near zero) in such observations is
not lost. However, it’s always a bit sketchy to be changing values (even from zero to a small number), so tread carefully.
c
•2014 Oxford University Press 333
Chapter 7. Transforming Variables, Comparing Variables
(0.098) (0.015)
[t=4.23] [t=2.23]
Log adolescent height 29.316ú 2.362ú
(6.834) (1.021)
[t=4.29] [t=2.31]
Constant -13.093 -108.778ú 0.001 -7.754
(6.897) (29.092) (1.031) (4.348)
[t=1.90] [t=3.74] [t=0.01] [t=1.78]
N 1,910 1,910 1,910 1,910
R2 0.009 0.010 0.003 0.003
Standard errors in parentheses
ú indicates significance at p < 0.05
The second column reports results from a linear-log model in which the dependent variable
is not logged and the independent variable is logged. The interpretation of —ˆ1 is that a
one percent increase in X (which is adolescent height in this case) is associated with a
29.316
100
= $0.293 increase in hourly wages. The dividing by 100 is a bit unusual, but no big
The third column reports results from a model in which the dependent variable has been
logged but the independent variable has not been logged. The interpretation of —ˆ1 here is
that a one inch increase in height is associated with a 3.3% increase in wages.
The fourth column reports a log-log model in which both the dependent variable and
independent variable have been logged. The interpretation of —ˆ1 here is that a one percent
increase in height is associated with a 2.362 percent increase in wages. Note that in the log-
c
•2014 Oxford University Press 334
Chapter 7. Transforming Variables, Comparing Variables
linear column, the probability is on a 0 to 1 scale and in the log-log column the probability
is on a 0 to 100 scale. Yeah, that’s a pain; it’s just how the math works out.
So which model is best? Sadly, there is no magic bullet in selecting models here, another
hiccup when working with logged models. We can’t simply look at the R2 because they
are not comparable: In the first two models the dependent variable is Y and in the last
two the dependent variable is ln(Y ). As is often the case in statistics, some judgment will
log-log model is natural. In other contexts, we have to decide whether we think the causal
mechanism makes more sense in percentage terms and whether it applies to the dependent
c
•2014 Oxford University Press 335
Chapter 7. Transforming Variables, Comparing Variables
Remember This
1. How to interpret logged models:
2. Logged models have some challenges not found in other models (the three hic-
cups):
(a) The scale of the —ˆ coefficients varies depending on whether the model is
log-linear, linear-log, or log-log.
(b) We cannot log variables that have values less than or equal to zero.
(c) There is no simple test for choosing among log-linear, linear-log, and log-log
models.
We frequently want to compare coefficients. That is, we want to say whether X1 or X2 has
a bigger effect on Y . If the variables are on the same scale, this task is pretty easy. For
example, in the height and wages model, both adolescent and adult height are measured in
inches, so we can naturally compare the estimated effects of an inch of adult height versus
c
•2014 Oxford University Press 336
Chapter 7. Transforming Variables, Comparing Variables
When the variables are not on the same scale, we have a tougher time making a direct
salaries. Players with high batting averages get on base a lot, keeping the offense going
and increasing the odds of scoring. Players who hit home runs score right away, sometimes
in bunches. Which group of players earns more? We might first address question this by
estimating
The results are in Table 7.3. The coefficient on batting average is 12,417,629.72. That’s
huge! The coefficient on home runs is 129,627.36. Also big. But nothing like the coefficient
on batting average. Batting average must have a much bigger effect on salaries than home
runs, right?
Umm, no. These variables aren’t comparable. Batting averages typically range from 0.200
to 0.350 (meaning most players get a hit between 20% and 35% of the time). Home runs
per season range from 0 to 73 (with a lot more zeros than 73s!). Each OLS coefficient in the
model tells us what happens if we increase the variable by “1”. For batting average, that’s an
impossibly large increase (going from probability of getting a hit of 0 to a probability of 1.0).
For home runs, that’s just another day at the ballpark. In other words, when “1” means
something very different for two variables, we’d be nuts to directly compare the regression
c
•2014 Oxford University Press 337
Chapter 7. Transforming Variables, Comparing Variables
Standardizing coefficients
deviations from their means. That is, instead of having a variable that indicates a baseball
player’s batting average, we have a variable that indicates how many standard deviations
above or below the average batting average a player was. Instead of having a variable that
indicates home runs, we have a variable that indicates how many standard deviations above
or below the average number of home runs a player hit. The attraction of standardizing
variables is that a one unit increase for both standardized independent variables will be a
standard deviation.
standardized independent variable can be interpreted as “Controlling for the other variables
c
•2014 Oxford University Press 338
Chapter 7. Transforming Variables, Comparing Variables
in the model, a one standard deviation increase in X is associated with a —ˆ1 standard
V ariable ≠ V ariable
V ariableStandardized = (7.9)
sd(V ariable)
where V ariable is the average of the variable for all units in the sample and sd(V ariable) is
Table 7.4 reports the means and standard deviations of the variables for our baseball
salary example. Table 7.5 then uses these means and standard deviations to report the un-
standardized and standardized values of salary, batting average, and home runs for three
selected players. Player 1 earned $5.85 million. Given that the standard deviation of
salaries in the data set was $2,764,512, the standardized value of this player’s salary is
5,850,000≠2,024,616
2,764,512
= 1.38. In other words, player 1 earned 1.38 standard deviations more than
the average salary. This player’s batting average was 0.267, which is exactly the average.
Hence, his standardized batting average is zero. He hit 43 home runs, which is 2.99 standard
Table 7.6 displays standardized OLS results along with the unstandardized results from
c
•2014 Oxford University Press 339
Chapter 7. Transforming Variables, Comparing Variables
Unstandardized Standardized
Player ID Salary Batting average Home runs Salary Batting average Home runs
1 5,850,000 0.267 43 1.38 0.00 2.99
2 2,000,000 0.200 4 -0.01 -2.11 -0.79
3 870,000 0.317 33 -0.42 1.56 2.03
Table 7.3. The standardized results allow us to reasonably compare the effects of batting
average and home runs on salary. We see in Table 7.4 that a standard deviation of batting
average is 0.031. The standardized coefficient column tell us that an increase in one stan-
dard deviation of batting average is associated with an increase in salary of 0.14 standard
deviations. So, for example, a player raising his batting average by 0.031 from 0.267 to 0.298
can expect an increase in salary of 0.14 ◊ $2, 764, 512 = $387, 032. A player who increases his
home runs by one standard deviation (which Table 7.4 tells us is 10.31 home runs), can ex-
pect a 0.48 standard deviation increase in salary (which is 0.48 ◊ $2, 764, 512 = $1, 326, 966).
In other words, home runs have a bigger bang for the buck. Eat your steroid-laced Wheaties,
kids.8
While results from OLS models with standardized variables seem quite different, they are
really only re-scaling the original results. The model fit is the same whether standardized
or unstandardized variables are used. Notice that the R2 is identical. Also, the conclu-
sions about statistical significance are the same in the unstandardized and standardized
regressions; we can see that by comparing the t statistics. Think of the standardization as
8 That’s a joke! Wheaties are gross.
c
•2014 Oxford University Press 340
Chapter 7. Transforming Variables, Comparing Variables
Table 7.6: Standardized Determinants of Major League Baseball Salaries, 1985 - 2005
Unstandarized Standarized
Batting average 12,417,629.72 0.14
(940,985.99) (0.01)
[t = 13.20] [t = 13.20]
Home runs 129,627.36 0.48
(2,889.77) (0.01)
[t = 44.86] [t = 44.86]
Constant -2,869,439.40 0.00
(244,241.12) (0.01)
[t = 11.75] [t = 0.00]
N 6,762 6,762
R2 0.30 0.30
Standard errors in parentheses
are reported in different currencies, but in standardized form, the coefficients are reported
in a common currency. The underlying real prices are the same whether they are reported
Remember This
Standardized coefficients allow the effects of two independent variables to be compared.
1. When the independent variable, Xk , and dependent variable are standardized, a
one standard deviation increase in Xk is associated with a —ˆk standard deviation
increase in the dependent variable.
2. Statistical significance and model fit are the same for unstandardized and stan-
dardized results.
c
•2014 Oxford University Press 341
Chapter 7. Transforming Variables, Comparing Variables
The standardized coefficients on batting average and home runs look quite different. But are
they statistically significantly different from each other? The t statistics in Table 7.6 tell us
that each is statistically significantly different from zero, but tell us nothing about whether
Answering this kind of question is trickier than the t tests we’ve seen because we’re dealing
with more than one estimated coefficient. Both estimates have uncertainty associated with
them and, to make things worse, they may co-vary in ways that we want to take into account.
In this section, we discuss F tests as a solution to this challenge, explain two different types of
commonly used hypotheses about multiple coefficients, and then show how to use R2 results
F tests
There are several ways to test hypotheses involving multiple coefficients. We focus on an F
test. This test shares features with hypothesis tests discussed earlier (on page 146). When
using a F test, we define null and alternative hypotheses, set a significance level, and compare
a test statistic to a critical value. The new elements are that we use a funky test statistic and
we compare that to a critical value derived from a F distribution rather than a t distribution
c
•2014 Oxford University Press 342
Chapter 7. Transforming Variables, Comparing Variables
on page 783.
The funky test statistic a F statistic. It is based on R2 values from two separate OLS
specifications.
The first specification is the unrestricted model, which is simply the full model. For
The model is called unrestricted because we are imposing no restrictions on what the values
The second specification is the so-called restricted model in which we force the computer
to give us results that comport with the null hypothesis. It’s called restricted because we are
restricting the estimated values of —ˆ1 , —ˆ2 , and —ˆ3 to be consistent with the null hypothesis.
How do we do that? Sounds hard. Actually, it isn’t. We simply take the relationship
implied by the null hypothesis and impose it on the unrestricted model. We can divide
It is fairly common to see researchers test a null like H0 : —1 = —2 = 0. This a null that
both coefficients are zero; we reject it if we observe evidence that one or both coefficients are
not equal to zero. This type of hypothesis is particularly useful when we have multicollinear
variables. In such circumstances, the multicollinearity may drive up the standard errors
c
•2014 Oxford University Press 343
Chapter 7. Transforming Variables, Comparing Variables
of the —ˆ estimates such that we have very imprecise (and likely statistically insignificant)
estimates for the individual coefficients. By testing the null that both of the multicollinear
variables equal zero, we can at least learn if one (or both) of them is non-zero, even as we
In this case, imposing the null hypothesis means making sure that our estimates of —1 and
—2 are both zero. The process is actually easy-schmeasy: Just set the coefficients to zero and
see that the resulting model is simply a model without variables X1 and X2 . Specifically,
= —0 + —3 X3i + ‘i
Statistical programs such as Stata and R automatically report results for “the” F test
which is a test that the coefficients on all the independent variables equal zero.
Case 2: One or more coefficients equal each other under the null hypothesis
A more complicated – and interesting – case occurs when we want to test whether the effect
of one variable is larger than the effect of another. In this case, the null hypothesis will be
that both coefficients are the same. For example, if we want to know if the effect of X1
is bigger than the effect of X2 , the null hypothesis will be H0 : —1 = —2 . Note that such a
hypothesis test makes sense only if the scales of X1 and X2 are the same or the two variables
c
•2014 Oxford University Press 344
Chapter 7. Transforming Variables, Comparing Variables
In this case, imposing the null hypothesis to create the restricted equation involves re-
writing the unrestricted equation so that the two coefficients are the same. We can do so,
for example, by replacing —2 with —1 (because they are equal under the null). After some
in Yi . To estimate this model, we need only to create a new variable X1 + X2 and include
The cool thing is that if we increase X1 by one unit, X1i + X2i goes up by one and we
expect a —1 increase in Y . At the same time, if we increase X2 by one unit, X1i + X2i goes
The statistical fits of the unrestricted and restricted model are measured with RU2 nrestricted
and RRestricted
2
. These are simply the R2 s from each separate model. The RU2 nrestricted will
c
•2014 Oxford University Press 345
Chapter 7. Transforming Variables, Comparing Variables
always be higher because the model without restrictions can generate a better model fit than
the same model subject to some restrictions. This conclusion is a little counterintuitive at
first, but note that RU2 nrestricted will be higher than RRestricted
2
even when the null hypothesis is
true because when estimating the unrestricted equation the software not only has the option
of estimating both coefficients to be whatever the value is under the null (hence assuring
the same fit as in the restricted model), but also any other deviation, large or small, that
the null hypothesis is true. If we are testing H0 : —1 = —2 = 0 and —1 and —2 really are zero,
because the optimal values of —ˆ1 and —ˆ2 really are around zero. If the null is false and —1
and —2 are much different than zero, then there will be a huge difference between RU2 nrestricted
and RRestricted
2
because setting them to non-zero values, as happens only in the unrestricted
Hence, the heart of an F test is the difference between RU2 nrestricted and RRestricted
2
. When
the difference is small, imposing the null doesn’t do too much damage to the model fit. When
the difference is large, imposing the null does a lot of damage to model fit.
c
•2014 Oxford University Press 346
Chapter 7. Transforming Variables, Comparing Variables
The q term refers to how many constraints are in the null hypothesis. That’s just a fancy
way of saying how many equal signs are in the null hypothesis. So for H0 : —1 = —2 the value
term, like what we saw with the t distribution. This is the sample size minus the number of
parameters estimated in the unrestricted model. (For example, k for Equation 7.11 will be
three because we estimate —ˆ0 , —ˆ1 , and —ˆ2 .) We need to know these terms because the shape
of the F distribution depends on the sample size and the number of constraints in the null,
other bits to ensure that the F statistic is distributed according to an F distribution. The F
distribution describes the relative probability of observing different values of the F statistic
under the null hypothesis. It allows us to know the probability the F statistic will be bigger
than any given number when the null is true. We can use this fact to identify critical values
How we approach the alternative hypotheses depends on the type of null hypothesis. For
case 1 null hypotheses (in which multiple coefficients are zero under the null hypothesis),
then the alternative hypothesis is that at least one of them is not zero. In other words, the
null hypothesis is that they all are zero and the alternative is the negation of that, which is
For case 2 null hypotheses (in which two or more coefficients are equal under the null
c
•2014 Oxford University Press 347
Chapter 7. Transforming Variables, Comparing Variables
is larger than the other. The critical value remains the same, but we add a requirement
that the coefficients actually go in the direction of the specified alternative hypothesis. For
example, if we are testing H0 : —1 = —2 versus HA : —1 > —2 we reject the null in favor of the
alternative hypothesis if the F statistic is bigger than the critical value and —ˆ1 is actually
This all may sound complicated, but the process isn’t that hard, really. (And, as we
show in Computing Corner, statistical software makes it really easy.) The crucial step is
formulating a null hypothesis test and using it to create a restricted equation. This process
is actually pretty easy. If we’re dealing with a case 1 null hypothesis that multiple coefficients
are zero, we simply drop the variables listed in the null in the restricted equation. If we’re
dealing with a case 2 null hypothesis that two or more coefficients are equal to each other,
we simply create a new variable that is the sum of the variables and use that in the restricted
To see F testing in action, let’s return to our standardized baseball salary model and first
The RU2 nrestricted is 0.2992 (it’s usually necessary to be more precise than the 0.30 reported
c
•2014 Oxford University Press 348
Chapter 7. Transforming Variables, Comparing Variables
For the unrestricted model, we simply drop the variables listed in the null hypothesis,
yielding
Salaryi = —0 + ‘i
there are no independent variables to explain the dependent variable). We calculate the F
statistic by substituting these values, along with q (which is 2 because there are 2 equals
signs in the null hypothesis) and N ≠ k, which is the sample size (6,762) minus 3 (because
there are three coefficients estimated in the unrestricted model) which is 6,759. The results
is
(0.2992 ≠ 0.00)/2
=
(1 ≠ 0.2992)/6, 759
= 1442.846
The critical value (which we show how to identify in the Computing Corner on pages 359
and 360) is 3.00. Since the F statistic is (way!) higher than the critical value, we reject the
null handily.
Or we can easily test which effect is bigger by testing the following case 2 null hypothesis:
c
•2014 Oxford University Press 349
Chapter 7. Transforming Variables, Comparing Variables
The RU2 nrestricted continues to be 0.2992. For the unrestricted model, we simply replace
the individual batting average and home run variables with a variable that is the sum of the
two variables:
The RRestricted
2
turns out to be 0.2602. We calculate the F statistic by substituting these
values, along with q (which is 1 because there is 1 equal sign in the null hypothesis) and
(0.2992 ≠ 0.2602)/1
=
(1 ≠ 0.2992)/6, 759
= 376.14
The critical value (which we show how to identify in the Computing Corner on pages 359
and 360) is 3.84. Here too the F statistic is vastly higher than the critical value and we
c
•2014 Oxford University Press 350
Chapter 7. Transforming Variables, Comparing Variables
Remember This
F tests are useful to test hypotheses involving multiple coefficients. To implement a F
test for the following model
Yi = —0 + —1 X1i + —2 X2i + —3 X3i + ‘i
c
•2014 Oxford University Press 351
Chapter 7. Transforming Variables, Comparing Variables
letics:
Let’s test two different null hypotheses with multiple coefficients. First, let’s test a case
1 null that neither height variable has an effect on wages. This null is H0 : —1 = —2 = 0. The
Table 7.7 presents results necessary to test this null. We use an F test that requires R2
values from two specifications. Column (a) presents the unrestricted model; at the bottom is
the RU2 nrestricted , which is 0.06086. Column (b) presents the restricted model; at the bottom is
the RRestricted
2
, which is 0.05295. There are two restrictions in this null, meaning q = 2. The
sample size is 1,851 and the number of parameters in the unrestricted model is 5, meaning
N ≠ k = 1,846.
c
•2014 Oxford University Press 352
Chapter 7. Transforming Variables, Comparing Variables
Hence, for H0 : —1 = —2 = 0
(0.06086 ≠ 0.05295)/2
F2,1846 =
(1 ≠ 0.06086)/1846
= 7.77
We have to use software (or tables) to find the critical value. We’ll discuss that process
below on page 359. For q = 2 and N ≠ k = 1,846, the critical value for – = 0.05 is 3.00.
Because our F statistic as calculated above is bigger than that, we can reject the null. In
other words, the data is telling us that if the null were true, it would be very unlikely to see
c
•2014 Oxford University Press 353
Chapter 7. Transforming Variables, Comparing Variables
such a big difference in fit between the unrestricted and restricted models.10
Second, let’s test the following case 2 null, H0 : —1 = —2 . Column (a) still presents the
unrestricted model; at the bottom is the RU2 nrestricted , which is 0.06086. The restricted model
is different for this null. Following the logic of discussed on page 345, it is
Column (c) presents the results for this restricted model; at the bottom is the RRestricted
2
,
which is 0.0605. There is one restriction in this null, meaning q = 1. The sample size
is still 1, 851 and the number of parameters in the unrestricted model is still 5, meaning
N ≠ k = 1, 846.
Hence, for H0 : —1 = —2 ,
(0.06086 ≠ 0.06050)/1
F1,1846 =
(1 ≠ 0.06086)/1846
= 0.71
We again have to use software (or tables) to find the critical value. For q = 1 and
N ≠ k = 1, 846, the critical value for – = 0.05 is 3.85. Because our F statistic as calculated
above is less than the critical value, we fail to reject the null that the two coefficients are
equal. The coefficients are quite different in the unrestricted model (0.03 and 0.35) but notice
10 The specific value of the F statistic provided by automated software F tests will differ from the above because
they do not round to 3 digits as we have done.
c
•2014 Oxford University Press 354
Chapter 7. Transforming Variables, Comparing Variables
that the standard errors are large enough that we cannot reject the null that either one is
zero. In other words, we have a lot of uncertainty in our estimates. The F test formalizes
this uncertainty by forcing OLS to give us the same coefficient on both height variables and,
when we do this, the overall model fit is pretty close to the model fit when the coefficients
are allowed to vary across the two variables. If the null is true, this result is what we would
expect because imposing the null would not lower R2 by very much. If the null were false,
then imposing the null would have likely caused a more substantial reduction in RRestricted
2
.
7.5 Conclusion
The multivariate OLS model is very powerful and this chapter has worked through some
of its practical capabilities. First, the world is not necessarily linear and the multivariate
which quadratic models are the most common, can produce fitted lines with increasing
returns, diminishing returns U-shaped and upside down U-shaped relationships. Logged
Often we care not only about individual variables, but also about how variables relate to
each other. Which variable has a bigger effect? As a first cut, we can standardize variables
to make them plausibly comparable. If and when the variables are comparable, we can test
for which effect is larger using F tests, a class of hypothesis tests that also allows us to test
c
•2014 Oxford University Press 355
Chapter 7. Transforming Variables, Comparing Variables
other hypotheses about multiple coefficients, such as whether a group of coefficients is all
zero.
We have mastered the core points of this chapter we can do the following.
• Section 7.1: Explain polynomial models and quadratic models. Sketch the various kinds
of relationships that a quadratic model can estimate. Show how to interpret coefficients
independent variable?
• Section 7.4: Explain how to test a hypothesis about multiple coefficients. Use an F test
to test the following null hypotheses for the model Yi = —0 + —1 X1i + —2 X2i + ‘i .
– H0 : —1 = —2 = 0
– H0 : —1 = —2
Further Reading
Empirical papers using logged variables are very common; see, for example, Card (1990).
Zakir Hossain (2011) discusses the use of Box-Cox tests to help decide which functional form
c
•2014 Oxford University Press 356
Chapter 7. Transforming Variables, Comparing Variables
(linear, log-linear, linear-log, or log-log) is best. Achen (1982, 77) critiques standardized
variables, in part because they depend on the standard errors of independent variables in
the sample.
Key Terms
• Elasticity (332)
• F test (342)
• F statistic (343)
• Linear-log model (332)
• Log-linear model (331)
• Log-log model (332)
• Polynomial model (321)
• Quadratic model (321)
• Restricted model (343)
• Specification (343)
• Standardize (338)
• Standardized coefficient (341)
• Unrestricted model (343)
Computing Corner
Stata
1. To estimate a quadratic model in Stata, simply generate a new variable equal to the
square of the variable (e.g., gen X1Squared = X1· 2) and then include it in a regression
(e.g., reg Y X1 X1Squared X2).
c
•2014 Oxford University Press 357
Chapter 7. Transforming Variables, Comparing Variables
2. To estimate a linear-logged model in Stata, simply generate a new variable equal to the
log of the independent variable (e.g., gen X1Log = log(X1)) and then include it in a
regression (e.g., reg Y X1 X1Log X2). Log-linear and log-log models proceed similarly.
3. In Stata, there is an easy way and a hard way to generate standardized regression coef-
ficients. Here’s the easy way: Simply type , beta at the end of a regression command.
For example, reg salary BattingAverage Homeruns, beta.
The standardized coefficients are listed, as usual, under “Coef.” Notice that they are
identical to the results from using the , beta command.
4. Stata has a very convenient way to conduct F tests for hypotheses involving multiple
coefficients. Simply estimate the unrestricted model and then type test and then the
coefficients involved and restriction implied by the null. For example, to test the null
hypothesis that the coefficients on Height81 and Height85 are both equal to zero, type
the following:
reg Wage Height81 Height85 Clubs Athletics
c
•2014 Oxford University Press 358
Chapter 7. Transforming Variables, Comparing Variables
R
1. To estimate a quadratic model in R, simply generate a new variable equal to the square
of the variable (e.g., X1Squared = X1· 2) and then include it in a regression (e.g., lm(Y
≥ X1 X1Squared X2)).
2. To estimate a linear-logged model in R, simply generate a new variable equal to the log
of the independent variable (e.g., X1Log = log(X1)) and then include it in a regression
(e.g., lm(Y ≥ X1 X1Log X2)). Log-linear and log-log models proceed similarly.
3. In R, there is an easy way and a hard way to generate standardized regression co-
efficients. Here’s the easy way: Use the scale command in R. This command will
automatically create standardized variables on the fly:
summary(lm(scale(Sal) ≥ scale(BatAvg)+ scale(HR))
A harder but perhaps more transparent approach is to simply create standardized vari-
ables and then use them to estimate a regression model. Standardized variables can be
created manually (e.g. Sal std = (bb$salary - mean(bb$salary))/ sqrt(var(bb$salary)).
After standardizing all variables, simply run a OLS model using the standardized vari-
ables:
summary(lm(Sal std ≥ BatAvg std + HR std))
c
•2014 Oxford University Press 359
Chapter 7. Transforming Variables, Comparing Variables
4. There are automated functions available on the web to do F tests for hypotheses involv-
ing multiple coefficients, but they require a fair amount of work to get them working at
first. Here we present a manual approach for the tests on page 352.
R stores R2 values and degrees of freedom information for each model and we can access
this information by using the “summary” command followed by a dollar sign and the
appropriate name. To see the various values of R2 for the unrestricted and restricted
models type
summary(Unrestricted)$r.squared.
summary(Restricted1)$r.squared.
summary(Restricted2)$r.squared.
To see the degrees of freedom for the unrestricted model, type
summary(Unrestricted1)$df[2].
We’ll have to keep track of q on our own.
To calculate the F statistic for H0 : —1 = —2 = 0 as described on page 353, type
((summary(Unrestricted)$r.squared - summary(Restricted1)$r.squared)/2) /
((1-summary(Unrestricted)$r.squared)/summary(Unrestricted)$df[2])
This code will produce slightly different F statistics than on page 353 due to rounding.
5. To find the critical value from a F distribution for a given –, q and N ≠ k, type
qf(1-–, df1=q, df2= N-k).
For example, to calculate the critical value on page 353 for H0 : —1 = —2 = 0 type
qf(.95, df1=2, df2=1846).
Exercises
1. The relationship between political instability and democracy is important and likely
quite complicated. Do democracies manage conflict in a way that reduces instability
or do they stir up conflict? Using the Instability PS data.dta data set from Zaryab
Iqbal and Christopher Zorn, answer the following questions. The data set covers 157
c
•2014 Oxford University Press 360
Chapter 7. Transforming Variables, Comparing Variables
countries between 1946 and 1997. The unit of observation is the country-year. The
variables are listed in Table 7.8.
Table 7.8: Variables for Political Instability Questions
Variable Description
Ccode Country code
Year Year
Instab Index of instability (revolutions, crises, coups etc). Ranges
from -4.65 to + 10.07
Coldwar Cold war year (1=yes, 0=no)
Gdplag GDP in previous year
Democracy Democracy score in previous year, ranges from 0 (most
autocratic) to 100 (most democratic)
a. Estimate a bivariate model with instability as the dependent variable and democracy
as the independent variable. Because the units of the variables are not intuitive, use
standardized coefficients to interpret. Briefly discuss the estimated relationship and
whether you expect endogeneity.
b. To combat endogeneity, include a variable for lagged GDP. Discuss changes in results,
if any.
c. Perhaps GDP is better conceived of in log terms. Estimate a model with logged
lagged GDP and interpret the coefficient on this GDP variable.
d. Suppose we are interested in whether instability was higher or lower during the Cold
War. Run two models. In the first, add a Cold War dummy variable to the above
specification. In the second model add a logged Cold War dummy variable to the
above specification. Discuss what happens.
e. It is possible that the positive relationship between democracy and political instabil-
ity is due to the fact that in more democratic countries, people feel freer to engage
in confrontational political activities such as demonstrations. It may be, however,
that this relationship is only positive up to a point or that more democracy increases
political instability more at lower levels of political freedom. Estimate a quadratic
model, building off the specification above. Use a figure to depict the estimated
relationship and use calculus to indicate the point at which the sign on democracy
changes.
2. Use the globaled.dta data on education and growth from Hanushek and Woessmann for
this question. The variables are
a. Use standardized variables to assess whether the effect of test scores are larger than
the effect of years in school on economic growth. At this point, simply compare the
c
•2014 Oxford University Press 361
Chapter 7. Transforming Variables, Comparing Variables
Variable Description
name Country name
code Country code
ypcgr Average annual growth rate (GDP per capita),
1960-2000
testavg Average combined math and science standardized
test scores, 1964-2003
edavg Average years of schooling over 1960-2000
ypc60 GDP per Capita in 1960
region Region
open Openness of the economy scale
proprts Security of property rights scale
different effects in a meaningful way. We’ll do statistical tests next. The dependent
variable is GDP growth per year. For this part, control for average test scores,
average years of schooling over 1960-2000, and GDP per capita in 1960.
b. Now conduct a statistical test of whether the (appropriately comparable) effects of
test scores and years in school on economic growth are different. Do this test in two
ways: (i) use the test command in Stata and (ii) generate values necessary to use
an F test equation. Report differences/similarities in results.
c. Now control for openness of economy and security of property rights. Which matters
more: test scores or property rights? Use appropriate statistical evidence in your
answer.
3. We will continue the analysis of height and wages in Britain from the homework problem
in Chapter 5 on page 252. We’ll use the data set heightwage british all multivariate.dta
which includes men and women and the variables listed in Table 7.10.11
a. Estimate a model explaining wages at age 33 as a function of female, height at age
16, mother’s education, father’s education, and number of siblings. Use standardized
coefficients to assess whether height or siblings have a larger effect on wages.
b. Implement a difference of means test across males and females using bivariate OLS.
Do this twice: once with female as the dummy variable and the second time with
11For the reasons discussed in the homework problem in Chapter 3 on page 133 we limit the data set to observations
with height greater than 40 inches and self-reported income less than 400 British pounds per hour. We also exclude
observations of individuals who grew shorter from age 16 to age 33. Excluding these observations doesn’t really affect
the results, but these observations are just odd enough to make us think that in these cases there is some kind of
non-trivial measurement error.
c
•2014 Oxford University Press 362
Chapter 7. Transforming Variables, Comparing Variables
male as the dummy variable (the male variable needs to be generated). Interpret the
coefficient on the gender variable in each model and compare results across models.
c. Now do the same test, but with log of wages at age 33 as the dependent variable.
Use female as the dummy variable. Interpret the coefficient on the female dummy
variable.
d. How much does height explain salary differences across genders? Estimate a differ-
ence of means test across genders controlling for height at age 33 and age at 16.
Explain the results.
e. Does the effect of height vary across genders? Use tools of this chapter to test
for differential effects of height across genders. Use logged wages at age 33 as the
dependent variable and control for height at age 16 and the number of siblings.
Explain the estimated effect of height at age 16 for men and for women.
4. Use the MLBattend.dta we used in Chapter 4 on page 190 and Chapter 5 on page 249.
Which matters more for attendance: winning or runs scored? [To keep us on the same
page, use home attend as the dependent variable and control for wins, runs scored,
runs allowed and season.]
5. In this problem we continue analyzing the speeding ticket data first introduced in Chap-
ter 5 on page 251. The variables we will use are in Table 7.11.
a. Is the effect of age on fines non-linear? Assess this question by estimating a model
with a quadratic age term, controlling for MPHover, Female, Black, and Hispanic.
Interpret the coefficients on the age variables.
b. Sketch the relationship between age and ticket amount from the above quadratic
model. Do so by calculating the fitted value for a white male with 0 MPHover
(probably not a lot of people going zero miles over speed limit got a ticket, but this
simplifies calculations a lot) for ages equal to 20, 25, 30, 35, 40, and 70. (Note:
c
•2014 Oxford University Press 363
Chapter 7. Transforming Variables, Comparing Variables
Either calculate these by hand (or in Excel) from the estimated coefficients or use
Stata’s display function. To display the fitted value for a zero-year old white male
in Stata, use display b[ cons]+ b[ Age]*0+ b[ AgeSq]*0ˆ2.)
c. Use Equation 7.4 to calculate the marginal effect of age at ages 20, 35, and 70.
Describe how these marginal effects relate to your sketch.
d. Calculate the age that is associated with the lowest predicted fine based on the
quadratic OLS model results above.
e. Do drivers from out of town and out of state get treated differently? Do state police
treat these non-locals differently than local police? Estimate a model that allows us to
assess whether out-of-towners and out-of-staters are treated differently and whether
state police respond differently to out-of-towners and out-of-staters. Interpret the
coefficients on the relevant variables.
f. Test whether the two state police interaction terms are jointly significant. Briefly
explain the results.
c
•2014 Oxford University Press 364
Part II
365
CHAPTER 8
sible that they deter bad guys or get them off the
It is natural to try to answer the question by using standard OLS to analyze data on crime
and police in cities over time. The problem is that we’d risk getting things wrong, possibly
very wrong, because of endogeneity: factors that cause cities to have lots of police and also
366
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
cause lots of crime. These factors include gangs, drugs, and subcultures of hopelessness or
lawlessness. Using standard OLS techniques to estimate a model that does not control for
these factors (and most won’t because these factors are very hard to measure) may produce
estimates suggesting that that police cause crime because the places with lots of crime also
In this chapter we introduce fixed-effects models as a simple yet powerful way to fight
endogeneity. As we explain in more detail in this chapter, fixed effects models boil down
to models that have dummy variables that control for otherwise unexplained unit-level dif-
ferences in outcomes across units. The fixed effect approach is broadly applicable and sta-
tistically important. Depending on the nature of the data set, the approach allows us to
control for attributes of individuals, cities, states, countries, and many other units of obser-
vation. The theoretical appeal of fixed effects models is that they reduce the set of possible
causes of endogeneity. The practical appeal of fixed effect models is that they often produce
profoundly different – and more credible – results than basic OLS models.
There are two contexts in which the fixed effect logic is particularly useful. The first
is when we have panel data, which consists of multiple observations for a specific set of
units. Observing annual crime rates in a set of cities over 20 years is an example. So too
is observing national unemployment rates for every year from 1946 to the present for all
advanced economies. Anyone analyzing such data needs to use fixed effects models to be
taken seriously.
c
•2014 Oxford University Press 367
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
The logic behind the fixed effect approach also is important when we conduct difference-
in-difference analysis, which is particularly helpful when evaluating policy changes. In it, we
compare changes in units affected by some policy change to changes in units not affected by
the policy. We show how difference-in-difference methods rely on the logic of fixed models
and, in some cases, use the same tools as panel data analysis.
In this chapter, we show the power and ease of implementing fixed effects models. Section
8.1 uses a panel data example to illustrate how basic OLS can fail when the error term is
correlated with the independent variable. Section 8.2 shows how fixed effects can come to
the rescue in this case (and others). In so doing, the section describes how to estimate fixed
effects models using dummy variables or so-called de-meaned data. Section 8.3 explains the
mildly miraculous ability of fixed effects models to control for variables even as the models
are unable to estimate coefficients associated with these variables. This ability is a blessing in
that we control for these variables; it is a curse in that we sometimes are curious about such
coefficients. Section 8.4 extends fixed effect logic to so-called two-way fixed effects models
that control for both unit and time related fixed effects. Section 8.5 discusses difference-in-
difference methods that rely on fixed effect type logic and are widely used in policy analysis.
c
•2014 Oxford University Press 368
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
In this section we show how using basic OLS to analyze crime data in U.S. cities over time
can lead us dangerously astray. Understanding the problem helps us understand the merits
We explore a data set that covers robberies per capita and police officers per capita in 59
large cities in the United States from 1951 to 1992.1 Table 8.1 presents OLS results from
where Crimeit is crime in city i at time t and P olicei,t≠1 is a measure of the number of police
on duty in city i in the previous year. It’s common to use lagged police in an effort to avoid
the problem that the number of police in a given year could be simultaneously determined
by the number of crimes in that year. We re-visit this point in Section 8.4. For now, let’s
take it as a fairly conventional modeling choice when analyzing the effect of police on crime.
Notice also that the subscripts have both i’s and t’s in them. This is new and will become
important later.
We’ll refer to this model as a pooled model. In a pooled model, an observation is com-
pletely described by its X variables and nothing is made of the fact that some observations
came from one city and others from another city. For all the computer knew when running
1 This data is from Marvell and Moody (1996). Their paper discusses a more comprehensive analysis of this data.
c
•2014 Oxford University Press 369
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Table 8.1 shows the results. The coefficient on the police variable is positive and very
statistically significant. Yikes. More cops, more crime. Weird. In fact, for every additional
police officer per capita, there were 2.37 more robberies per capita. Were we to take these
results at face value, we would believe that cities could eliminate more than two robberies
per capita for every police officer per capita they fired.
Table 8.1: Basic OLS Analysis of Burglary and Police Officers, 1951-1992
Pooled OLS
Lag police, per capita 2.37
(0.07)
[t = 32.59]
N 1,232
Standard errors in parentheses
Of course, we don’t believe the pooled results. We worry that there are unmeasured
factors lurking in the error term that could be correlated with the number of police, thereby
causing bias. The error term in Equation 8.1 contains gangs, drugs, economic hopelessness,
broken families, and much more. If any of those factors is correlated with the number of
police in a given city, we have endogeneity. Given that police are more likely to be deployed
when and where there are gangs, drugs, and economic desolation, it seems inevitable that
the error associated with each city. To keep our discussion relatively simple, we’ll turn
c
•2014 Oxford University Press 370
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Robberies
per 1,000 people
12
10
2
Police
2.0 2.5 3.0 3.5 per 1,000 people
FIGURE 8.1: Robberies and Police for Large Cities in California, 1971-1992
our attention to five California cities: Los Angeles, San Francisco, Oakland, Fresno, and
Sacramento. Figure 8.1 plots their per capita robbery and police data from 1971 to 1992.
Consistent with the OLS results on all cities, the message seems clear that robberies are
more common when there are more police. However, we actually have more information than
displayed in Figure 8.1. We know which city each observation comes from. Figure 8.2 re-
plots the data from Table 8.1, but in a way that differentiates by city. The underlying data is
exactly the same, but the observations for each city have different shapes. The observations
for Fresno are the circles in the lower left, the observations for Oakland are the triangles in
the top middle, and so forth. What does the relationship between police and crime look like
c
•2014 Oxford University Press 371
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Robberies
per 1,000 people
12
10 Oakland
San Francisco
Sacramento
4 Los Angeles
Fresno
2
Police
2. 0 2. 5 3.0 3. 5
per 1,000 people
FIGURE 8.2: Robberies and Police for Specified Cities in California, 1971-1992
now?
It’s still a bit hard to see so Figure 8.3 adds a fitted lines for each city. These are OLS
regression lines estimated on a city-by-city basis. All are negative and some are dramatically
so (Los Angeles and San Francisco). The claim that police reduce crime is looking much
better. Within each individual city, robberies tend to decline as police increase.
The difference between the pooled OLS results and these city-specific regression lines
presents a puzzle. How can the pooled OLS estimates suggest such a radically different
conclusion than Figure 8.3? The reason is the villain of this book – endogeneity.
Here’s how it happens. Think about what’s in the error term ‘it in Equation 8.1: gangs,
c
•2014 Oxford University Press 372
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Robberies
per 1,000 people
12
10 Oakland
San Francisco
Sacramento
4 Los Angeles
Fresno
2
2.0 2.5 3.0 3.5 Police
per 1,000 people
FIGURE 8.3: Robberies and Police for Specified Cities in California with City-specific Regression Lines,
1971-1992
c
•2014 Oxford University Press 373
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
drugs, and all that. These factors almost certainly affect the crime across cities and are
plausibly correlated with the number of police, because cities with bigger gang or drug
problems hire more police officers. Many of these elements in the error term are also stable
within each city, at least in the twenty-year time frame we are looking at. A city that has
a culture or history of crime in year 1 probably has a culture or history of crime in year 20
as well. This is the case in our selected cities: San Francisco has lots of police and many
robberies while Fresno has not so many police and not so many robberies.
And here’s what creates endogeneity: These city-specific baseline levels of crime are
correlated with the dependent variable. The cities with the most robberies (Oakland, Los
Angeles, and San Francisco) have the most police. The cities with fewest robberies (Fresno
and Sacramento) have the fewest police. If we are not able to find another variable to control
for whatever is causing these differential levels of baselines – and, if it is something hard to
measure like history or culture or gangs or drugs, we may not be able to – then standard OLS
will have endogeneity-induced bias and will lead us to the spurious inference we discussed
The problem we have identified here occurs in many contexts. Let’s look at another example
to get comfortable with identifying factors that can cause endogeneity. Suppose we want to
assess whether private schools produce better test scores than public schools and we begin
c
•2014 Oxford University Press 374
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
where Test scores it is test scores of student i at time t and Private school it is a dummy variable
that is 1 if student i is in a private school at time t. This model is for a (hypothetical) data
set in which we observe test scores for specific children over a number of years.
The following three simple questions help us identify possibly troublesome endogeneity.
What is in the error term? Test performance depends potentially not only on whether a
child went to a private school (a variable in the model) but also on his or her intelligence,
diligence, teacher’s ability, family support, and many other factors in the error term. While
we can hope to measure some of these factors, it is a virtual certainty that we will not be
Are there any stable unit-specific elements in the error term? Intelligence, diligence, and
family support are likely to be quite stable for individual students across time.
Are the stable unit-specific elements in the error term likely to be correlated with the indepen-
dent variable? It is quite likely that family support, at least, is correlated with attendance
at private schools because families with the wealth and/or interest in private schools are
likely to provide other kinds of educational support to their children. This tendency is by no
means set in stone because countless kids with good family support go to public schools and
c
•2014 Oxford University Press 375
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
there are certainly kids with no family support who end up in private schools. On average,
though, it is reasonable to suspect that kids in private schools have more family support. If
this is the case, then what may look to be a causal effect of private schools on test scores
may be little more than an indirect effect of family support on test scores.
Remember This
1. A pooled model with panel data ignores the panel nature of the data. The
equation is
Yit = —0 + —1 Xit + ‘it
2. A common source of endogeneity when using a pooled model to analyze panel
data is that the specific units have different baseline levels of Y and these levels
are correlated with X. For example, cities with higher crime (meaning they have
high unit-specific error terms) also tend to have more police, creating a correlation
in a pooled model between the error term and the police independent variable.
In this section, we introduce fixed effects as a way to deal with at least part of the endogeneity
described in the previous section. We define the term and then show two ways to estimate
Starting with Equation 8.1, we divide the error term, ‘i , into a fixed effect, –i , and a
random error term ‹it (a Greek letter , nu, pronounced “new” even though it looks like a
“v”). Our focus here is be on –i ; we’ll assume the ‹it part of the error term is well-behaved,
meaning that it is homoscedastic and not correlated with any independent variable. We
c
•2014 Oxford University Press 376
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
= —0 + —1 P olicei,t≠1 + –i + ‹it
A fixed effects model is simply a model that contains a parameter like –i that captures
differences in the dependent variable associated with each unit and/or period.
The fixed effect –i is that part of the unobserved error that is the same value for every
observation for unit i. It basically reflects the average value of the dependent variable for
unit i, after controlling for the independent variables. The unit is the unit of observation.
Even though we write down only a single parameter (–i ), we’re actually representing a
different value for each unit. That is, this parameter takes on a potentially different value
for each unit. In the city crime model, the value of –i will be different for each city. If
Pittsburgh had a higher average number of robberies than Portland, the –i for Pittsburgh
The amazing thing about the fixed effects parameter is that it allows us to control for
a vast array of unmeasured attributes of units in the data set. These could correspond to
c
•2014 Oxford University Press 377
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
we haven’t even thought of. The key is that the fixed effect term allows different units to
Why is it useful to model fixed effects in this way? When fixed effects are in the error
term, as in the pooled OLS model, they can cause endogeneity and bias. But if we can pull
them out of the error term we will have overcome this source of endogeneity. We do so by
controlling for the fixed effects, which will take them out of the error term so that they no
longer can be a source for the correlation of the error term and an independent variable.
This strategy is similar to the one we pursued with multivariate OLS: We identified some
factor in the error term that could cause endogeneity and pulled it out of the error term by
How do we do pull the fixed effects out of the error term? Easy! We simply estimate a
different intercept for each unit. We can do so as long as we have multiple observations for
Concretely, we simply create a dummy variable for each unit and estimate an OLS model,
but now controlling for the fixed effects directly. This approach is called the least squares
dummy variable (LSDV) approach. In the LSDV approach, we create dummy variables
for each unit and include these dummy variables in the model:
c
•2014 Oxford University Press 378
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
where D1i is a dummy variable that equals 1 if the observation is from the first unit (which
in our crime example is city), D2i is a dummy variable that equals 1 if the observation is
from the second unit, and so on to the (P ≠ 1)t h unit. We exclude the dummy for one
unit because we can’t have a dummy variable for every unit if we include —0 for reasons we
discussed on page 276 in Chapter 6).2 The data will look like the data in Table 8.2, which
includes the city, year, the dependent and independent variables, and the first three dummy
variables. In the Computing Corner, we show how to quickly create these dummy variables.
Table 8.2: Example of Robbery and Police Data for Cities in California
With this simple step we have just soaked up anything (anything) that is in the error
term that is fixed within unit over the time period of the panel.
We are really just running OLS with loads of dummy variables. In other words, we’ve seen
this before. Specifically, in Chapter 6 on page 276 we showed how to use multiple dummy
2 It doesn’t really matter which unit we exclude, we exclude the P th unit for convenience; plus it is fun to try to
pronounce (P ≠ 1)t h.
c
•2014 Oxford University Press 379
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
variables to account for categorical variables. Here, the categorical variable is whatever the
De-meaned approach
We shouldn’t let the old-news feel of the LSDV approach lead us to underestimate fixed
effects models. They’re actually doing a lot of work, work that we can better see when
we consider a second way to estimate fixed models called the de-meaned approach. It’s
an odd term – it sounds like we’re trying to humiliate data – but it describes well what
we’re doing. (Data is pretty shameless anyway.) When using the de-meaned approach,
we subtract off the unit-specific averages from both independent and dependent variables.
The approach allows us to control for the fixed effects (the –i terms) without estimating
Why might we want to do this? Two reasons. First, it can be a bit of a hassle creating
dummy variables for every unit and then wading through results with so many variables. A
model of voting in the United Nations, for example, could need roughly 200 dummy variables
Second, the inner workings of the de-meaned estimator reveal the intuition behind fixed
effects models. This reason is more important. The de-meaned model looks like
where Y i· is the average of Y for unit i over all time periods in the data set and X i· is the
c
•2014 Oxford University Press 380
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
average of X for unit i over all time periods in the data set. In our crime data, Y F resno·
is the average crime in Fresno over the time frame of our data and X F resno· is the average
police per capita in Fresno over the time frame of our data.3
Estimating a model using this transformed data will produce exactly the same coefficient
and standard error estimates for —ˆ1 as produced by the LSDV approach.
The de-meaned approach also allows us to see that fixed effects models convert data to
deviations from mean levels for each unit and variable. In other words, fixed effects models
are about differences within units, not differences across units. In the pooled model for our
city crime data, the variables reflect differences in police and robberies in Los Angeles relative
to police and robberies in Fresno. In the fixed effects model, the variables are transformed
to reflect how much robberies in Los Angeles differ from average levels in Los Angeles as a
function of how much police in Los Angeles differ from average levels of police in Los Angeles.
An example shows how this works. Recall the data on crime earlier, where we saw that
estimating the model with a pooled model led to very different coefficients than with the
fixed effects model. The reason for the difference was, of course, that the pooled model was
3 The de-meaned equation is derived by subtracting the same thing from both sides of Equation 8.3. Specifically,
note that the average dependent variable for unit i over time is Y i· = —0 + —1 X i· + –i + ‹ i· . If we subtract the
left-hand side of this equation from the left-hand side of Equation 8.3 and the right-hand side of this equation from
the right-hand side of Equation 8.3, we get Yit ≠Y i· = —0 +—1 Xit +–i +‹it ≠—0 ≠—1 X i· ≠–i· ≠‹ i· . The – terms cancel
because –i equals –i (the average of fixed effects for each unit are, by definition, the same for all observations of a
given unit in all time periods). Rearranging terms yields something that is almost Equation 8.5. For simplicity, we let
‹˜it = ‹it ≠ ‹ i· ; this new error term will inherit the properties of ‹it such as being uncorrelated with the independent
variable and having a mean of zero.
c
•2014 Oxford University Press 381
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
plagued by endogeneity and the fixed effects model was not. How does the fixed effects model
fix things? Figure 8.4 presents illustrative data for two made-up cities, Fresnomento and Los
Frangelese. In panel (a) the pooled data is plotted as in Figure 8.1, with each observation
number indicated. The relationship between police and robberies looks positive and, indeed,
In panel (b) of Figure 8.4 we plot the same data after it has been de-meaned. Table 8.3
shows how we generated the de-meaned data. Notice, for example, that observation 1 is from
Los Frangelese in 2010. The number of police (the value of Xit ) was 4, which is one of the
bigger numbers in the Xit column. When we compare this number to the average number
of police per 1, 000 people in Los Frangelese (which was 5.33), though, it is low. In fact, the
de-meaned value of the police variable for Los Frangelese in 2010 is -1.33, indicating that
the police per 1, 000 people was actually 1.33 lower than the average for Los Frangelese in
Although the raw values of Y get bigger as the raw values of X get bigger, the relationship
between Yit ≠Y i· and Xit ≠X i· is quite different. Panel (b) of Figure 8.4 shows a clear negative
c
•2014 Oxford University Press 382
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Robberies
per 1,000
people
12 1
2
9
ine
on l 3
(a) egr essi
6 l ed r
Poo
4
3 Los Frangeles
5
Fresnomento
6
1 2 3 4 5 6 7
Police per 1,000 people
Robberies
per 1,000
people, Fresnomento, de−meaned
de−meaned 2 1 Re
gre
ssi Los Frangeles, de−meaned
by city on
line
1 4
for
d e− m
ean
ed
(b) 0 5 2 (fix
ed
effe
cts
)m
−1 ode
6 l
−2 3
−2 −1 0 1 2
Police per 1,000 people,
de−meaned by city
c
•2014 Oxford University Press 383
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Table 8.3: Robberies and Police Data for Hypothetical Cities in California
In practice, we seldom calculate the de-meaned variables ourselves. There are easy ways
to implement the model in Stata and R. We describe these techniques in the Computing
Table 8.4 shows the results for a basic fixed effects model for our city crime data. We
include the pooled results from Table 8.1 for reference. The coefficient on police per capita
has fallen from 2.37 to 1.49 once we include fixed effects, suggesting that there were indeed
more police officers in cities with higher baseline levels of crime. In other words, the fixed
effects were real (meaning some cities have higher average robberies per capita even when
controlling for the number of police) and that these effects were correlated with the number
of police officers. The fixed effects models controls for these city-specific averages and leads
However, the coefficient still suggests every police officer per capita is associated with 1.49
more robberies. This estimate seems quite large and is highly statistically significant. We’ll
c
•2014 Oxford University Press 384
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
revisit this data once again in Section 8.4 with models that account for additional important
factors.
We should note that we do not indicate whether results in Table 8.4 were estimated with
LSDV or the de-meaned approach. Why? Because it doesn’t matter. Either one would
Remember This
1. A fixed effects model includes an –i term for every unit.
Yit = —0 + —1 X1it + –i + ‘it
2. The fixed effects approach allows us to control for any factor that is fixed within
unit for the entire panel, whether or not we observe this factor.
3. There are two ways to produce identical fixed effects coefficient estimates for the
model,
(a) In the least squares dummy variable (LSDV) approach, we simply include
dummy variables for each unit except the excluded reference category.
(b) In the de-meaned approach, we transform the data such that the dependent
and independent variables indicate deviations from the unit mean.
c
•2014 Oxford University Press 385
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Discussion Question
1. What factors influence student evaluations of professors in college
courses? Do instructors teaching large classes get evaluated less favor-
ably? Consider using the following model to assess the question based
on a data set of evaluations of instructors across multiple classes and
multiple years.
c
•2014 Oxford University Press 386
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Fixed effects models are relatively easy to implement. In practice, though, there are several
elements that take a bit of experience to get used to. In this section we explore the conse-
quences of using fixed effects models when they’re necessary and when they’re not. We also
explain why fixed effects models cannot estimate some relationships even as they control for
them.
It’s useful to consider possible downsides of using fixed effects models. What if we control
for fixed effects when –i = 0 for all units? In this case, a pooled model that ignores fixed
effects cannot be biased. After all, if the fixed effects are zero, they don’t exist and they
cannot therefore cause bias. Could including fixed effects in this case cause bias? The answer
is no, for the same reasons we discussed earlier (in Chapter 5 on page 231) that controlling
for irrelevant variables does not cause bias. Bias occurs when errors are correlated with
independent variables and as a general matter including extra variables does not cause errors
If the fixed effects are non-zero, we of course want to control for them. We should note,
5 Controlling for fixed effects when all the –i = 0 will lead to larger standard errors, though, so if we can establish
that there is no sign of a non-zero –i for any unit, we may wish to also estimate a model without fixed effects. To
test for unit specific fixed effects we can implement an F test following the process discussed in Chapter 4 on page
351. The null hypothesis is H0 : –1 = –2 = –3 = . . . = 0. The alternative hypothesis is that at least one of the fixed
effects is non-zero. The unrestricted model is a model with fixed effects (most easily thought of as the LSDV model
that has dummy variables for each specific unit). The restricted model is a model without any fixed effects, which is
simply the pooled OLS model. We provide computer code on pages 415 and 416.
c
•2014 Oxford University Press 387
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
however, that just because some (or many!) –i are non-zero does not necessarily mean that
our fixed effects model will produce different results from our pooled model. Recall that
bias occurs when errors are correlated with an independent variable. The fixed effects could
exist, but they are not necessarily correlated with the independent variables. In other words,
fixed effects must not only exist to cause bias; they must be correlated with the independent
variables to cause bias. It’s not at all impossible to observe instances in real data where
fixed effects exist but don’t cause bias. In such cases, the coefficients from the pooled and
The prudent approach to analyzing panel data is therefore to control for fixed effects. If
the fixed effects are zero, we’ll get unbiased results even with the controls for fixed effects.
If the fixed effects are non-zero, we’ll get unbiased results that will differ or not from pooled
results depending on whether the fixed effects are correlated with the independent variable.
A downside to fixed models is that they make it impossible to estimate effects for certain
variables that we might be interested in. As is often the case, there is no free lunch (although
Specifically, fixed effects models cannot estimate coefficients on any variables that are
fixed for all individuals over the entire time frame. Suppose, for example, that in the process
of analyzing our city crime data we wonder if cities in the north are more crime prone. We
6A so-called Hausman Test can be used to test whether fixed effects are causing bias. If the results indicate no sign
of bias when fixed effects are not controlled for, then we could use the random effects model discussed in Chapter 15
on page 748.
c
•2014 Oxford University Press 388
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
studiously create a dummy variable N orthi that equals one if a city is in a northern state
Sadly, this approach won’t work. The reason is easiest to see by considering the fixed
effects model in de-meaned terms. The north variable will be converted to N orthit ≠N orthi· .
What is the value of this de-meaned variable for a city in the north? The N orthit part will
equal one for all time periods for such a city. But, wait, this means that N orthi· will also
be one because that is the average of this variable for this northern city. That means the
value of the de-meaned north variable will equal zero for any city in the north. What is the
value for the de-meaned north variable for a non-northern city? Similar logic applies: The
N orthit part will equal zero for all time periods and so too will N orthi· for a non-nothern
city. The de-meaned north variable will therefore also equal zero for non-northern cities. In
other words, the de-meaned variable will be zero for all cities in all years. The first job of a
variable is to vary. If it doesn’t, well, that ain’t no variable! Hence it will not be possible to
More generally, a fixed effect model (estimated with either LSDV or the de-meaned ap-
proach) cannot estimate a coefficient on a variable if the variable does not change within
units for all units. So even though the variable varies across cities (e.g. the N orthi variable
7Because we know that LSDV and de-meaned approaches produce identical results, we know that we will not be able
to estimate a coefficient on the north variable in a LSDV model as well. This is the result of perfect multicollinearity
where the north variable is perfectly explained as the sum of the dummy variables for the northern cities.
c
•2014 Oxford University Press 389
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
is 1 for some cities and 0 for other cities), we can’t estimate a coefficient on it because within
cities it does not vary. This issue arises in many other contexts. In panel data where individ-
uals are the unit of observation, fixed effects models cannot estimate coefficients on variables
such as gender or race that do not vary within individuals. In panel data on countries, the
effect of variables such as area or being landlocked cannot be estimated when they do not
Not being able to include such a variable does not mean fixed effects models do not control
for it. The unit specific fixed effect is controlling for all factors that are fixed within a unit
for the span of the data set. The model cannot parse out which of these unchanging factors
have which effect, but it does control for them via the fixed effects parameters.
Some variables might be fixed within some units, but variable within other units. Those
we can estimate. For example, a dummy variable that indicates whether a city has more
than a million people will not vary for many cities that have been above or below one million
in population for the entire span of the panel data. However, if at least some cities have
risen above one million or declined below one million during the period covered in the panel
Panel data models need not be completely silent with regard to variables that do not
vary. We can investigate how unchanging variables interact with variables that do change.
c
•2014 Oxford University Press 390
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
The —ˆ2 will tell us how different the coefficient on the police variable is for northern cities.
Sometimes people are tempted to abandon fixed effects because they care about variables
that do not vary within unit. That’s cheating. The point of fixed effects is that if there is
something fixed within individuals across the panel that is correlated with an independent
variable, we risk bias. Bias is bad and we can’t just close our eyes to it to get to a coefficient
we want to estimate. In this case, the best case scenario is that we run a fixed effects model
and test for whether we need the fixed effects, find that we do not, and then proceed guilt
free. But let’s not get our hopes up. We almost always need the fixed effects.
Remember This
1. Fixed effects models do not cause bias when implemented in situations in which
–i = 0 for all units.
2. Pooled OLS models are biased only when fixed effects are correlated with the
independent variable.
3. Fixed effects models cannot estimate coefficients on variables that do not vary
within at least some units. Fixed effects models do control for these factors,
though, as they are subsumed within the unit specific fixed effect.
c
•2014 Oxford University Press 391
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Discussion Questions
1. Suppose we have panel data on voter opinions toward government spend-
ing in 1990, 1992, and 1994. Explain why we can or cannot estimate
the effect of each of the following in a fixed effects model:
(a) Gender
(b) Income
(c) Race
(d) Party identification
2. Suppose we have panel data on the annual economic performance of 100
countries from 1960 to 2010. Explain why we can or cannot estimate
the effect of each of the following in a fixed effects model:
(a) Average education
(b) Democracy, which is coded 1 if political control is determined by
competitive elections
(c) Country size
(d) Proximity to equator
3. Suppose we have panel data on the annual economic performance of 50
U.S. states from 1960 to 2010. Explain why we can or cannot estimate
the effect of each of the following in a fixed effects model:
(a) Average education
(b) Democracy, which is coded 1 if political control is determined by
competitive elections
(c) State size
(d) Proximity to Canada
c
•2014 Oxford University Press 392
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
So far we have presented models in which there is a fixed effect for the unit of observation.
We refer to such models as one-way fixed effect models. We can generalize the approach
to a two-way fixed effects model in which we allow for fixed effects not only at the unit
level but also at the time level. That is, just as some cities might have more crime than
others (due to unmeasured history of violence or culture), some years might have more crime
than others due to unmeasured factors. Therefore we add a time fixed effect to our model,
making it
where we’ve taken Equation 8.3 from page 377 and added ·t (the Greek letter tau), which
accounts for differences in crime for all units in year t. This notation provides a short-hand
way to indicate that each separate time period gets its own ·t effect on the dependent variable
(in addition to the –i effect on the dependent variable for each individual unit of observation
Similar to our one-way fixed effects model, the single parameter for a time fixed effect
indicates the average difference for all observations in a given year, after having controlled
for the other variables in the model. A positive fixed effect for the year 2008 (–2008 ) would
indicate that, controlling for all other factors, the dependent variable was higher for all units
in the data set in 2008. A negative fixed effect for the year 2014 (–2014 ) would indicate that,
c
•2014 Oxford University Press 393
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
controlling for all other factors, the dependent variable was lower for all units in the data
set in 2014.
There are lots of examples where we suspect a time fixed effect may be appropriate:
• The whole world suffered an economic downturn in 2008 due to a financial crisis starting
in the United States. Hence any model with economic dependent variables could merit
a time fixed effect to soak up this distinctive characteristic of the economy in 2008.
• Approval of political institutions went way up in the United States after the September
11, 2001 terrorist attacks. This was clearly a time specific factor that affected the entire
country.
We can estimate a two-way fixed model in several different ways. The simplest approach is
to extend the LDSV approach to include dummy variables both for units and for time periods.
We can also do a two-way de-meaned approach.8 We can also use a hybrid LSDV/de-meaned
Table 8.5 shows the huge effect that using a two-way fixed effects model has on our
analysis of the city crime data. For reference, it shows the pooled OLS and one-way fixed
8 The algebra is a bit more involved than for a one-way model, but the result has a similar feel:
where the dot notation indicates what is averaged over such that Y i· is the average value of Y for unit i over time,
Y ·t is the average value of Y for all units at time t and Y ·· is the average over all units and all time periods. Don’t
worry, we almost certainly won’t have to create these variables ourselves; we’re including it just to provide a sense of
how a one-way fixed effects model extends to a two-way fixed effects model.
c
•2014 Oxford University Press 394
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
effects results. The third column displays the results for a two-way fixed effects model
controlling only for police per capita. In contrast to the pooled and one-way models, the
coefficient is small (0.14) and statistically insignificant, suggesting that police spending and
crime were high in certain years. Once we controlled for the fact that robberies were common
in some years throughout the country (possibly due, for example, to the crack epidemic that
was more serious in some years than others), we were able to net out a source of substantial
bias.
The fourth and final column reports two-way fixed effects results from a model that also
controls for the lagged per capita robbery rate in each city in order to control for city-specific
trends in crime. The estimate from this model implies that an increase of one police officer
per 100, 000 people is associated with a decrease of 0.202 robberies per capita. The effect is
It is useful to take a moment to appreciate that not all models are created equal. A
cynic might look at the results in Table 8.5 and conclude that statistics can be made to say
anything. But this is not the right way to think about the results. The models do indeed
produce different results, but there are reasons for the differences. One of the models is
better. A good statistical analyst will know this. Using statistical logic we can explain why
the pooled results are suspect. We know pretty much what is going on: There are fixed
9 The additional control variable is called a lagged dependent variable. Inclusion of such a variable is common in
analysis of panel data. These variables often are highly statistically significant, as is the case here. These type of
control variables raise some complications, which we address in Chapter 15 on advanced panel data models.
c
•2014 Oxford University Press 395
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Table 8.5: Burglary and Police Officers, Pooled versus Fixed Effect Models, 1951-1992
effects in the error term of the pooled model that are correlated with the police variable,
thereby biasing the pooled OLS coefficients. So while there is indeed output from statistical
software that could be taken to imply that police cause crime, we know better. Treating
all results as equivalent is not serious statistics; that’s just pressing buttons on a computer.
Instead of supporting statistical cynicism, this example testifies to the benefits of appropriate
analysis.
Remember This
1. A two-way fixed effects model accounts for both unit and time specific errors.
2. A two-way fixed effects model is written as
3. Estimation of a two-way fixed effects model can be done with an LSDV approach
(which has dummy variables for each unit and each period in the data set), with
a de-meaned approach, or with a combination of the two.
c
•2014 Oxford University Press 396
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
where Bilateral tradeit is total trade volume between countries in dyad i at time t. A dyad
is something that consists of two elements. Here, a dyad indicates a pair of countries and
the data indicates how much trade flows between them. For example, the United States
and Canada form one dyad, the United States and Japan form another dyad, and so on.
Allianceit is a dummy variable that is 1 if countries in the dyad are entered into a security
alliance at time t. The –i term captures the amount by which trade in a certain dyad is
Because the unit of observation is country-pair dyad, fixed effects here relate to factors
related to a pair of countries. For example, the fixed effect for the United States - New
c
•2014 Oxford University Press 397
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Zealand dyad in the trade model may be higher because of shared language. The fixed effect
for the China-India dyad might be negative because they are divided by mountains (which
As we consider whether a fixed effects model is necessary, we need to think about whether
the dyad specific fixed effects could be correlated with the independent variables. Dyad spe-
cific fixed effects could exist because of a history of commerce between two countries, a
favorable trading geography (not divided by mountains, for example), economic complemen-
tarities of some sort, and so on. These factors could also make it easier or harder to form
alliances.
Table 8.6 reports results from Green, Kim, and Yoon (2001) based on data covering trade
and alliances from 1951 to 1992. The dependent variable is the amount of trade between
the two countries in a given dyad in a given year. In addition to the alliance measure, the
independent variables are GDP (total gross domestic product of the two countries in the
dyad), P opulation (total population of the two countries in the dyad), Distance (distance
between the capitals of the two countries in the dyad), and Democracy (the minimum value
of a democracy ranking for the two countries in the dyad; the higher the value the more
democracy).
The dependent and continuous independent variables are logged. Logging variables is a
common practice in this literature; the interpretation is that a one percent increase in any
independent variable is associated with a —ˆ percent increase in trade volume. (We discussed
c
•2014 Oxford University Press 398
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
The results are remarkable. In the pooled model, Alliance is associated with a 0.745
percentage point decline in trade. In the one-way fixed effects model, the estimate completely
flips and is associated with a 0.777 increase in trade. In the two-way fixed effects model,
the estimated effect remains positive and significant but drops to 0.459. The coefficients on
Population and Democracy also flip while being statistically significant across the board.
Table 8.6: Bilateral Trade, Pooled versus Fixed Effect Models, 1951-1992
These results are shocking. If someone told us that they were going to estimate an “OLS
c
•2014 Oxford University Press 399
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
model of bilateral trade relations” we’d be pretty impressed, right? But actually, that model
produces results that lead to almost completely opposite results from the more appropriate
There are other interesting things going on as well. The coefficient on Distance disap-
pears in the fixed effects models. Yikes! What’s going on? The reason, of course, is that
the distance between two countries does not change. Fixed effects models cannot estimate
coefficients on such a variable because it does not vary within unit over the course of the
panel. Does that mean that the effect of distance is not controlled for? That would seem to
be a problem because distance certainly affects trade. It’s not a problem, though, because
even though fixed effects models cannot estimate coefficients on variables that do not vary
within unit of observation (which is dyad pairs of countries in this data set), the effects of
these variables are controlled for via the fixed effect. And, even better, not only is the effect
of distance controlled for, so too are hard-to-measure factors such as being on a trade route
or having cultural affinities. That’s what the fixed effect is - a big ball of all the effects that
are the same within units for the period of the panel.
Not all coefficients flip. The coefficient on GDP is relatively stable, indicating that unlike
the variables that do flip signs from the pooled to fixed effects specifications, GDP does
not seem to be correlated with the unmeasured fixed effects that influence trade between
countries.
c
•2014 Oxford University Press 400
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
8.5 Difference-in-difference
The logic of fixed effects plays a major role in difference-in-difference models, models
that look at differences in changes in treated states compared to untreated units and are
particularly useful in policy evaluation. In this section, we explain the logic of this approach,
show how to use OLS to estimate these models, and then link the approach to the two-way
Difference-in-difference logic
ground” laws that allow individuals to use lethal force when they reasonably believe they are
threatened.10 Does such a law prevent homicides by making criminals fearful of resistance?
Naturally, we would start by looking at the change in homicides in a state that passed
the law. This approach is what every policy-maker in the history of time uses to assess the
impact of a policy change. Suppose we find homicides went up in the states that passed the
law. Is that fact enough to lead us to conclude that the law increases crime?
It doesn’t take a ton of thinking to realize that such evidence is pretty weak. Homicides
could rise or fall for a lot of reasons, many of them completely unrelated to stand your
ground laws. If homicides went up not only in the state that passed the law, but in all states
10 See McClellan and Tekin (2012) and Cheng and Hoekstra (2013).
c
•2014 Oxford University Press 401
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
– even where there was no policy change – then we can’t seriously blame the law for the rise
What we really want to do is to look at differences in the state that passed the policy
compared to differences in other similar states that did not pass the law. To use experimental
language, we want to look at the difference in treated states compared to the difference in
YT ≠ YC (8.9)
where YT is the change in the dependent variable in treated states (those that passed the
policy) and YC is the change in the dependent variable in the untreated states that did not
pass the policy. We call this approach the difference-in-difference approach because we look
the changes in treated and untreated states and then taking the difference, we’ll use OLS to
produce the same result. The advantage is that OLS will also spit out standard errors on
our estimate. We can also easily add additional control variables when we use OLS.
c
•2014 Oxford University Press 402
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
where T reatedi equals 1 for a treated state and 0 for a control state, Af tert equals 1 for
all after observations (from both control and treated units), and T reatedi ◊ Af tert is an
interaction of T reatedi and Af tert . This interaction variable will equal 1 for treated states
The control states have some mean level of homicides that we denote with —0 ; the treated
states have some mean level of homicides that we denote with —0 + —1 T reatedi . If —1 is
positive, the mean level for the treated states is higher than in control states. If —1 is negative,
the mean level for the treated states is lower. If —1 is zero, this means the mean level for the
treated states is the same as in control states. This pre-existing difference of mean levels was
there before the law was even passed, so the law can’t be the cause of differences. Instead
these differences represented by —1 are simply the pre-existing differences in the treated and
untreated states. This parameter is analogous to a unit fixed effect, although here it is for
The model captures national trends with the —2 Af tert term. The dependent variable for
all states, treated and not, changes by —2 in the after period. This parameter is analogous
to a time fixed effect, although for the entire post-treatment period rather than individual
time periods.
c
•2014 Oxford University Press 403
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
The key coefficient is —3 . This is the coefficient on the interaction between T reatedi and
Af tert . This variable equals 1 only for treated units in the after period. The coefficient
tells us there is an additional change in the treated states after the policy went into effect
after controlling for pre-existing differences between the treated and control states (—1 ) and
differences in the before and after periods for all states (—2 ).
If we work out the fitted values for changes in treated and control states, we can see how
this regression model produces a difference-in-difference estimate. First, note that the fitted
value for treated states in the after period is —0 + —1 + —2 + —3 (because T reatedi , Af tert ,
and T reatedi ◊ Af tert all equal 1 for treated states in the after period). Second, note that
the fitted value for treated states in the before period is —0 + —1 + —3 , so the change for fitted
states is —2 + —3 . The fitted value for control states in the after period is —0 + —2 (because
T reatedi and T reatedi ◊Af tert equal 0 for control states). The fitted value for control states
in the before period is —0 , so the change for control states is —2 . The difference in differences
Figure 8.5 displays two examples that illustrate the logic of difference-in-difference models.
In panel (a) there is no treatment effect. The dependent variables for the treated and control
states differ in the before period by —1 . Then the dependent variable for both the treated
and control units rose by —2 in the after period. In other words, Y was bigger for the treated
than the control before and after the treatment by the same amount. The implication is that
the treatment had no effect even though Y went up in treatment states after they passed
c
•2014 Oxford University Press 404
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
the law.
Panel (b) shows an example with a treatment effect. The dependent variables for the
treated and control states differ in the before period by —1 . Then the dependent variable for
both the treated and control units rose by —2 in the after period, but the value of Y for the
treated unit rose yet another —3 . In other words, the treated group was —1 bigger than the
control before the treatment, and —1 + —3 bigger than the control after the treatment. The
implication is that the treatment caused a —3 bump over and above the across unit and time
Consider how the difference-in-difference approach would assess outcomes in our gun law
example. If homicides declined in states with stand your ground laws more than in states
without such laws, the evidence supports the claim that the law prevented homicides. Such
an outcome could happen if homicides went down by 10 in states with the law and went
down only by 2 in other states. Such an outcome could also happen if homicides actually
went up by 2 in states with stand your ground laws but went up by 10 in other states. In
One great thing about using OLS to estimate difference-in-difference models is that it is
easy to control for other variables in the OLS model. Simply include them as covariates
and do what we’ve been doing. In other words, simply add a —4 Xit term (and additional
c
•2014 Oxford University Press 405
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Y Y
4 4
Treated Treated β3
Control Control
3 β2 3 β2
2 2
β1 β1
β2 β2
1 1
β0 β0
0 0
Time Time
Before After Before After
No treatment effect Treatment effect
(a) (b)
FIGURE 8.5: Difference-in-difference Examples
c
•2014 Oxford University Press 406
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
A difference-in-difference model works not only with panel data but also with rolling cross-
section data. Rolling cross section data consists of data from each treated and untreated
region in which the individual observations come from different individuals across time pe-
riods. An example of a rolling cross section of data is a repeated national survey of people
about their health insurance over multiple years. We could look to see if state-level decisions
about Medicaid coverage in 2014 led to different changes in treated states relative untreated
states. For such data we can easily create dummy variables indicating whether the observa-
tion came from the treated state or not and whether the observation was in the before or
where
• the –i terms (the unit specific fixed effects) capture differences that exist across units
• the ·t terms (the time specific fixed effects) capture differences that exist across all units
in every period. If homicide rates are higher in 2007 than in 2003, then the ·t for 2007
c
•2014 Oxford University Press 407
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
a treatment unit (meaning in our case that T reatedi = 1 for states that passed stand
your ground laws) and P ostt , which indicates whether or not the observation is post-
treatment (meaning in our case that the observation is after the state passed a stand
your ground law). This interaction variable will equal 1 for treated states in the post-
Table 8.7 shows an analysis of stand your ground laws by Georgia State University
economists Chandler McClellan and Erdal Tekin. They implemented a state and time fixed
effect version of a difference-in-difference model and found that the homicide rate per 100,000
residents went up by 0.033 after the passage of the stand your ground laws. In other words,
controlling for the pre-existing differences in state homicide rates (via state fixed effects) and
national trends in homicide rates (via time fixed effects) and additional controls related to
race, age, and percent of residents living in urban areas, they found that the homicide rates
c
•2014 Oxford University Press 408
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Table 8.7: Effect of Stand Your Ground Laws on Homicide Rate Per 100,000 Residents
Variable Coefficient
Stand your ground laws 0.033úú
(0.013)
[ t = 2.54]
State fixed effects Included
Period fixed effects Included
Adapted from Appendix Table 1 of McClellan and Tekin (2012).
Standard errors are in parentheses; úú
p<0.05, two-tailed test
Includes controls for racial, age, and urban demographics.
Remember This
A difference-in-difference model estimates the effect of a change in policy by comparing
changes in treated units to changes in control units.
1. A basic difference-in-difference estimator is YT ≠ YC , where YT is the change
in the dependent variable for the treated unit and YC is the change in the
dependent variable for a control unit.
2. Difference-in-difference estimates can be generated from the following OLS model:
Yit = —0 + —1 T reatedi + —2 Af tert + —3 (T reatedi ◊ Af tert ) + —4 Xit + ‘it
3. For panel data, we can use a two-way fixed effects model to estimate difference-
in-difference effects.
Yit = –i + ·t + —3 (T reatedi ◊ Af tert ) + —4 Xit + ‘it
where the –i fixed effects capture differences in units that existed both before and
after treatment and the ·t captures differences common to all units in each time
period.
c
•2014 Oxford University Press 409
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Discussion Questions
1. For each of the four panels in Figure 8.6, indicate the values of —0 , —1 , —2 ,
2. For each of the following examples, explain how to create (i) a simple
difference-in-difference model.
(c) Some neighborhoods in Los Angeles had zoning changes that made
c
•2014 Oxford University Press 410
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Y Y
4 4
Treated Treated
Control Control
3 3
2 2
1 1
Y Y
4 4
Treated
Treated
Control
Control
3 3
2 2
1 1
c
•2014 Oxford University Press 411
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
8.6 Conclusion
Again and again we’ve emphasized the importance of exogeneity. If X is uncorrelated with
‘ we get unbiased estimates and we are happy. Experiments are sought after because the
randomization in them ensures – or at least aids – exogeneity. With OLS we can – sometimes,
maybe, almost, sort of, kind of – approximate endogeneity by soaking up enough of the error
term with measured variables such that what remains correlates little or not at all with X.
Realistically, though, we know that we will not be able to measure everything. Real
variables with real causal force will almost certainly lurk in the error term. Are we stuck?
Turns out, no (or, at least, not yet). We’ve got a few more tricks up our sleeve. One of the
best tricks is to use fixed effects tools. Although uncomplicated, the fixed effects approach
can knock out a whole class of unmeasured (and even unknown) variables that lurk in the
error term. Simply put, any factor that is fixed across time periods for each unit or fixed
across units for each time period can be knocked out of the error term. Fixed effects tools
are powerful and, as we have seen in real examples, they can produce results that differ
• Section 8.1: Explain what how a pooled model can be problematic when analyzing
panel data.
• Section 8.2: Write down a fixed effects model and explain the fixed effect. Give examples
c
•2014 Oxford University Press 412
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
of the kind of factors subsumed in a fixed effect. Explain what how to estimate a fixed
• Section 8.3: Explain why coefficients on variables that do not vary within unit cannot be
estimated in fixed effects models. Explain how these variables are nonetheless controlled
• Section 8.5: Explain the logic behind a difference-in-difference estimator. Provide and
Further Reading
Chapter 15 discusses advanced panel data models. Baltagi (2005) is a more technical survey
Green, Kim, and Yoon (2001) provide a nice discussion of panel data methods in interna-
tional relations. Wilson and Butler (2007) re-analyze articles that did not use fixed effects
If we use pooled OLS to analyze panel data sets we are quite likely to have errors that
are correlated within unit in the manner discussed on page 104. This correlation of errors
will not cause OLS —ˆ1 estimates to be biased, but it will make the standard OLS equation
for the variance of —ˆ1 inappropriate. While fixed effects models typically account for a
c
•2014 Oxford University Press 413
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
substantial portion of the correlation of errors, there is also a large literature on techniques
to deal with the correlation of errors in panel data and difference-in-difference models. We
discuss one portion of this literature when we cover random effects models in Chapter 15.
Bertrand, Duflo, and Mullainathan (2004) show that standard error estimates for difference-
Hausman and Taylor (1981) discuss an approach for estimating parameters on time-
invariant covariants.
Key Terms
• De-meaned approach (380)
• Difference-in-difference model (401)
• Dyad (397)
• Fixed effect (376)
• Fixed effects model (377)
• Least squares dummy variable (LSDV) approach (378)
• One-way fixed effect model (??)
• panel data (367)
• Pooled model (369)
• Rolling cross-section data (407)
• Two-way fixed effect model (??)
c
•2014 Oxford University Press 414
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Computing Corner
Stata
1. To estimate a panel data model using the LSDV approach, we run an OLS model with
dummy variables for each unit.
(a) Generate dummy variables for each unit:
tabulate City, generate(CityDum)
This command generates a variable called “CityDum1” that is 1 for observations
from the first city listed in “City” and 0 otherwise, a variable called “CityDum2”
that is 1 for observations from the second city listed in “City,” and so on.
(b) Estimate the model with the regress command regress Y X1 X2 X3 CityDum2
- CityDum50. The notation of CityDum2 - CityDum50 tells Stata to include each
of the city dummies from CityDum2 to CityDum50. As we discussed on page 818
of Chapter 7, we need an excluded category. By starting at CityDum2 in our list
of dummy variables, we are setting the first city as the excluded reference category.
(c) To use a F test test to test whether fixed effects are all zero, the unrestricted model
is the model with the dummy variables we just estimated. The restricted model is
a regression model without the dummy variables (also known as the pooled model):
regress Y X1 X2 X3.
2. To estimate a one-way fixed effects model using the de-meaned approach:
xtreg y x1 x2 x3, fe i(City)
The subcommand of , fe tells Stata to estimate a fixed effects model. The i(City)
subcommand tells Stata to use the City variable to identify the city for each observation.
3. To estimate a two-way fixed model:
(a) Create dummy variables for years:
tabulate Year, gen(Yr)
This command generates a variable called “Yr1” that is 1 for observations in the
first year and 0 otherwise, a variable called “Yr2” that is 1 for observations in the
second year and 0 otherwise and so on.
(b) Run Stata’s built-in one-way fixed effects model and also include the dummies for
the years:
xtreg Y X1 X2 X3 Yr2-Yr10, fe i(City)
where Yr2-Yr10 is a shortcut way of including every Yr variable from Yr2 to Yr10.
4. There are several ways to implement difference-in-difference models:
c
•2014 Oxford University Press 415
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
(a) To implement a basic difference-in-difference model, type reg Y Treat After TreatAfter
X2 where T reat indicates membership in treatment group, Af ter indicates the pe-
riod is the after period, T reatAf ter is the interaction of the two variables and X2
is one (or more) control variables.
(b) To implement a panel data version of a difference-in-difference model, type xtreg
Y TreatAfter X2 Yr2-Yr10, fe i(City).
(c) To plot the basic difference-in-difference results, plot separate fitted lines for the
treated and untreated groups:
graph twoway (lfit Y After if Treat ==0) (lfit Y After if Treat ==1).
R
1. To estimate a panel data model using the LSDV approach, we run an OLS model with
dummy variables for each unit.
(a) While it is possible to name and include dummy variables for every unit, doing so
can be a colossal pain when we have lots of units. It is usually easiest to use the
factor command, which will automatically include dummy variables for each unit.
The code is: lm(Y ≥ X1 + factor(unit)). This command will estimate a model
in which there is a dummy variable for every unique value unit indicated in the
unit variable. For example, if our data looked like Table 8.2 on page 379, including
a factor(city) term in the regression code would lead to dummy variables being
included for each city.
(b) To implement an F test on the hypothesis that all fixed effects (both unit and time)
are zero, the unrestricted equation is the full model and the restricted equation is
the model with no fixed effects.
Unrestricted = lm(Y ≥ X1 + factor(unit)+ factor(time))
Restricted = lm(Y ≥ X1)
See page 360 for more details on how to implement an F test in R.
2. To estimate a one-way fixed effect model using the de-meaned approach use one of
several add-on packages that automate the steps in panel data analysis. We discussed
how to install an R package in Chapter 3 on page 130. For fixed effects models we can
use the “plm” command from the “plm” package.
(a) Install the package by typing install.packages("plm"). Once installed on a
computer, the package can be brought into R’s memory with the library(plm)
command.
(b) The “plm” command works like the “lm” command. We indicate the dependent
variable and the independent variables for the main equation. We need to indicate
c
•2014 Oxford University Press 416
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
what the units are with the index = c("city", "year") command. These are the
variable names that indicate your units and time variables, which will vary depend-
ing on the data set. For a one-way fixed effects model, include model="within".
library(plm)
plm(Y ≥ X1 + X2, index=c("city"), model="within")
3. To estimate a two-way fixed effects model, we have two options.
(a) We can simply include time dummies as covariates in a one-way fixed effects model
plm(Y ≥ X1 + X2 + factor(year), index=c("city"), model="within")
(b) We can use the plm command and indicate the unit and time variables with the
index = c("city", "year") command. These are the variable names that indi-
cate your units and time variables, which will vary depending on your data set. We
also need to include the subcommand effect="twoways".
plm(Y ≥ X1 + X2, index=c("city", "year"), model="within",
effect="twoways")
4. There are several ways to implement difference-in-difference models:
(a) To implement a basic difference-in-difference model, type lm(Y ≥ Treat + After
+ TreatAfter + X2), where T reat indicates membership in treatment group, Af ter
indicates the period is the after period, T reatAf ter is the interaction of the two
variables, and X2 is one (or more) control variables.
(b) To implement a panel data version of a difference-in-difference model, type lm(Y
≥ TreatAfter + factor(Unit)+ factor(Year) + X2).
(c) To plot the basic difference-in-difference results, plot separate fitted lines for the
treated and untreated groups:
plot(After, Y, type = "n")
abline(lm(Y[Treat==0] After[Treat==0]))
abline(lm(Y[Treat==1] After[Treat==1]))
Exercises
1. Researchers have long been interested in the relationship between economic factors and
presidential elections. The PresApproval.dta data set includes data on presidential
approval polls and unemployment rates by state over a number of years. Table 8.8 lists
the variables.
a. Using pooled data for all years, estimate a pooled OLS regression explaining pres-
idential approval as a function of state unemployment rate. Report the estimated
regression equation and interpret the results.
c
•2014 Oxford University Press 417
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
Table 8.8: Variables for Presidential Approval Question
b. Many political observers believe politics in the South are different. Add South as
an additional independent variable and re-estimate the model from part (a). Report
the estimated regression equation. Do the results change?
c. Re-estimate the model from part (b) controlling for state fixed effects using the de-
meaned approach. How does this approach affect the results? What happens to the
South variable in this model? Why? Does this model control for differences between
Southern and other states?
d. Re-estimate the model from part (c) controlling for state fixed effects using the LSDV
approach. (Do not include a south dummy variable). Compare the coefficients and
standard errors for the unemployment variable.
e. Estimate a two-way fixed effects model. How does this model affect the results?
2. How do young people respond to economic conditions? Are they more likely to pursue
public service when jobs are scarce? To get at this question, we’ll analyze data in
PeaceCorps.dta, which contains variables on state economies and applications to the
Peace Corps. Table 8.9 lists the variables.
Table 8.9: Variables for Peace Corps Question
a. Before looking at the data, what relationship do you hypothesize between these two
variables? Explain your hypothesis.
b. Run a pooled regression of Peace Corps applicants per capita on the state unemploy-
ment rate and year dummies. Describe and critique the results.
c
•2014 Oxford University Press 418
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
c. Plot the relationship between the state economy and Peace Corps applications. Does
any single state stick out? How may this outlier affect the estimate on unemployment
rate in the pooled regression in part (b) above? Create a scatterplot without the
unusual state and comment briefly on the difference from the scatterplot with all
observations.
d. Run the pooled model from part (b) without the outlier. Comment briefly on the
results.
e. Run a two-way fixed effect model without the outlier using the LSDV approach. Do
your results change from the pooled analysis? Which results are preferable?
f. Run a two-way fixed effects model without the outlier using the fixed effects command
in Stata or R. Compare to LSDV results.
3. We wish to better understand the factors that contribute to a student’s favorable overall
evaluation of an instructor. The data set TeachingEval HW.dta contains average faculty
evaluation scores, class size, a dummy variable indicating required courses, and the
percent of grades that were A- and above. Table 8.10 lists the variables.
Table 8.10: Variables for Teaching Evaluation Questions
a. Estimate a model ignoring the panel structure of the data. Use overall evaluation
of the instructor as the dependent variable and the class size, required, and grades
variables as independent variables. Report and briefly describe the results.
b. Explain what a fixed effect for each of the following would control for: instructor,
course, and year.
c. Using the equation from part (a), estimate a model that includes a fixed effect for
instructor. Report your results and explain any differences from part (a).
d. Now estimate a two-way fixed effects model with year as an additional fixed effect.
Report and briefly describe your results.
4. In 1993, Georgia initiated a HOPE scholarship program to let state residents who had at
least a B average in high school attend public college in Georgia for free. The program is
c
•2014 Oxford University Press 419
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
not need based. Did the program increase college enrollment? Or did it simply transfer
funds to families who would have sent their children to college anyway? Dynarski (2000)
assessed this question using data on young people in Georgia and neighboring states.12
Table 8.11 lists the variables.
Table 8.11: Variables for the HOPE Scholarship Question
c
•2014 Oxford University Press 420
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
5. Table 8.12 describes variables in TexasSchools.dta, a data set covering 1,020 Texas
school board districts and teachers’ salaries in them from 2003 to 2009. Anzia (2012)
used this data to estimate the effect of election timing on teachers’ salaries in Texas.
Some believe that teachers will be paid more when school board members are elected in
“off-cycle” elections where only school board members are up for election. The idea is
that teachers and their allies will mobilize for these elections while many other citizens
will tune out. In this view, teachers’ salaries will be relatively lower when school boards
are elected in “on-cycle” elections that also have elections for state and national offices;
in these on-cycle elections, turnout will be higher and teachers and teachers unions will
have relatively less influence.
From 2003 to 2006 all districts in the sample elected their school board members off-
cycle. A change in state policies in 2006 led some, but not all , districts to elect their
school board members on-cycle from 2007 onward. The districts that switched then
stayed switched for the period 2007 to 2009 and no other district switched.
Table 8.12: Variables for the Texas School Board Data
c
•2014 Oxford University Press 421
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
switch?
c. Run a one-way fixed effects model where the fixed effect relates to individual school
districts. Interpret the results and explain whether this model accounts for time
trends that could affect all districts.
d. Now use a two-way fixed effects model to estimate a difference-in-difference approach.
Interpret the results and explain whether this model accounts for (i) differences in
pre-existing attributes of the switcher districts and non-switcher districts and (ii)
differences in the post switch years that affected all districts whether or not they
switched.
e. Suppose that we tried to estimate the above two-way fixed effects model on only the
last three years of the data (2007, 2008, and 2009). Would we be able to estimate
the effect of oncycle for this subset of the data? Why or why not?
6. This problem uses a panel version of the dataset described in Chapter 5 on page 250
to analyze the effect of cell phone and texting bans on traffic fatalities. Use deaths
per mile as the dependent variable because this variable accounts for the pattern we
saw earlier that miles driven is a strong predictor of the number of fatalities. Table
8.13 describes the variables in the data set Cellphone panel homework.dta; it covers all
states plus Washington, DC from 2006 to 2012.
Table 8.13: Variables in the Cell Phones and Traffic Deaths Panel Data Set
a. Estimate a pooled OLS model with deaths per mile as the dependent variable and
cell phone ban and text ban as the two independent variables. Briefly interpret the
results.
b. Describe a possible state fixed effect that could cause endogeneity and bias in the
model from part (a).
c. Estimate a one-way fixed effects model that controls for state level fixed effects.
Include deaths per mile as the dependent variable and cell phone ban and text ban
c
•2014 Oxford University Press 422
Chapter 8. Using Fixed Effects Models to Fight Endogeneity in Panel Data and Difference-in-Difference
Models
as the two independent variables. Does the coefficient on cell phone ban change in
the manner you would expect based on your answer from part (a)?
d. Describe a possible year fixed effect that could cause endogeneity and bias in the
fixed effects model in part (c).
e. Estimate a two-way fixed effects model using the hybrid de-meaned approach dis-
cussed in the chapter. Include deaths per mile as the dependent variable and cell
phone ban and text ban as the two independent variables. Does the coefficient on
cell phone ban change in the manner you would expect based on your answer in part
(d)?
f. The model in part (e) is somewhat sparse with regard to control variables. Estimate a
two-way fixed effects model that includes control variables for cell phones per 10,000
people and percent urban. Briefly describe changes in inference about the effect of
cell phone and text bans.
g. Estimate the same two-way fixed effects model using the least square dummy variable
(LSDV) approach. Compare the coefficient and t statistic on the cell phone variable
to the results from part (f).
h. Based on the LSDV results, identify of states with large positive and negative fixed
effects. Explain what these mean (being sure to note the excluded category) and
speculate about what is different about the positive and negative fixed effect states.
(It is helpful to connect the state number to state name; in Stata, do this with the
command list state state numeric if year ==2012.)
c
•2014 Oxford University Press 423
CHAPTER 9
FIGHT ENDOGENEITY
Medicaid differ from those not enrolled, not only in terms of income but also many other
factors. Some factors such as age, race, and gender are fairly easy to measure. Other factors
such as health, lifestyle, wealth, and medical knowledge are difficult to measure, however.
The danger is that these unmeasured factors may be correlated with enrollment in Med-
424
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
icaid. Who is more likely to enroll: a poor sick person or a poor healthy person? Probably
the sick people are more likely to be enrolled, which means that comparing health outcomes
of enrollees and non-enrollees could show differences not only due to Medicaid, but also due
We must therefore be cautious – or clever – when analyzing Medicaid. This chapter goes
with clever. We show how we can navigate around endogeneity using instrumental variables.
This approach is relatively advanced, but its logic is pretty simple. The idea is to find
exogenous variation in X and use only that variation to estimate the effect of X on Y . For
the Medicaid question, we want to look for some variation in enrollment in the program
that is unrelated to the health outcomes of individuals. One way is to find some factor that
changed enrollment but was unrelated to health or lifestyle or any other factor that affects the
health outcome variable. In this chapter we show how to incorporate instrumental variables
using a technique called two-stage least squares (2SLS). In Chapter 10 we revisit 2SLS
techniques to analyze randomized experiments in which not everyone complies with their
assigned treatment.
Like many powerful tools, 2SLS can be a bit dangerous. We won’t cut off a finger using
it, but if we aren’t careful we could end up with worse estimates than we would with OLS.
And, also like many powerful tools, the approach is not cheap. In this case, the cost is that
the estimates produced by 2SLS are typically quite a bit less precise than OLS estimates.
In this chapter we provide the instruction manual for this tool. Section 9.1 provides an
c
•2014 Oxford University Press 425
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
example in which an instrumental variables approach proves useful. Section 9.2 presents
the basics for the 2SLS model. Section 9.3 discusses what to do when we have multiple
instruments. Section 9.4 discusses what happens to 2SLS estimates when the instruments
are flawed. Section 9.5 discusses why 2SLS estimates tend to be less precise than OLS
estimates. Section 9.6 then applies 2SLS tools to so-called simultaneous equation models in
Before we work through the steps of the 2SLS approach, this section will introduce the logic
of the approach with an example about police and crime by Freakonomics author Steve
Levitt. We’ve seen the question of whether police reduce crime before (on page 371) and
know full well that an observational study almost certainly suffers from endogeneity. Why?
It is highly likely that things in the error term that cause crime – factors such as drug use,
gang warfare, demographic changes, and so forth – also are related to how many police
officers a city has. After all, it is just common sense that communities that expect more
crime hire more police. Equation 9.1 shows the basic model:
Levitt’s (2002) idea is that while some police are hired for endogenous reasons (city leaders
expect more crime, so hire more police), other police are hired for exogenous reasons (the
c
•2014 Oxford University Press 426
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
city simply had more money, so spent it). In particular, Levitt argues that the number
of firefighters in a city reflects voters’ tastes for public services, union power, and perhaps
political patronage. These factors also partially predict the size of the police force and are
not directly related to crime. In other words, to the extent that changes in the number of
firefighters predict changes in police numbers, these changes in police are exogenous because
they have nothing to do with crime. The idea, then, is to isolate the portion of changes in
the police force associated with changes in the number of firefighters and see if crime went
We’ll work through the exact steps of the process below. For now we can get a sense
of how instrumental variables can matter by looking at Levitt’s results. The left column of
results in Table 9.1 shows the coefficient on police estimated via a standard OLS estimation
of Equation 9.1 based on an OLS analysis with covariates and year dummy variables but
no city fixed effects. The coefficient is positive and significant, implying police cause crime.
Yikes!
Table 9.1: Levitt (2002) Results on Effect of Police Officers on Violent Crime
c
•2014 Oxford University Press 427
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
However, we’re pretty sure that endogeneity distorts simple OLS results in this context.
The second column shows that the results change dramatically when city fixed effects are
included. As discussed in Chapter 8, fixed effects account for the fact that cities with
chronically high crime also tend to have larger police forces. The estimated effect of police
The third column shows the results when the instrumental variables technique is used.
The sign on the police variable is negative and marginally statistically significant. This result
is a huge difference from the OLS without city fixed effects and a non-trivial difference from
estimates the number of police that cities add when they add firefighters and assesses whether
crime changed in conjunction with these particular changes in police. Levitt is using the
independent variable of interest (which in this case is the log of the number of police per
capita) but does not directly explain the dependent variable (which in this case is violent
The example also highlights some limits to instrumental variables methods. First, the
increase in police associated with changes in firefighters may not really be exogenous. That is,
is the firefighter variable truly independent of the error term in Equation 9.1? It is possible,
for example, that reelection-minded political leaders provide other public services when they
c
•2014 Oxford University Press 428
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
boost the number of firefighters – goodies such as tax cuts, roads, and new stadiums – and
that these policy choices may affect crime (perhaps by improving economic growth). In that
case, we worry that our exogenous bump in police is actually associated with factors that
also affect crime, and that those factors may be in the error term. Therefore as we develop
the logic of instrumental variables we also spend a lot of time worrying about the exogeneity
of our instruments.
A second concern is that we may reasonably worry that changes in firefighters do not
account for much of the variation in police forces. In that case, the exogenous change we
are measuring will be modest and may lead to imprecise estimates. We see this in Table 9.1
where the instrumental variable standard errors are more than four times larger than the
Remember This
1. An instrumental variable is a variable that explains the endogenous independent
variable of interest but does not directly explain the dependent variable.
2. When we use the instrumental variable approach, we focus on changes in Y due
to the changes in X that are attributable to changes in the instrumental variable.
3. Two major challenges associated with using instrumental variables are
(a) It is often hard to find an appropriate instrumental variable that is exogenous.
(b) Estimates based on instrumental variables are often imprecise.
c
•2014 Oxford University Press 429
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
We implement the instrumental variables approach with the two-stage least squares (2SLS)
approach. As you can see from the name, it’s still a least squares approach, meaning that
the underlying calculations are still based on minimizing the sum of squared residuals as in
OLS. The new element is that it has – you guessed it – two stages, unlike standard OLS,
In this section we distinguish endogenous and instrumental variables, explain the two
stages of 2SLS, discuss the characteristics of good instrumental variables, and describe the
where Yi is our dependent variable, X1i is our main variable of interest, and X2i is a control
The difference is that X1i is an endogenous variable which means that it is correlated
with the error term. Our goal with 2SLS is to replace the endogenous X1i with a different
variable that measures only that portion of X1i that is not related to the error term in the
c
•2014 Oxford University Press 430
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
We model X1i as
where Zi is a new variable we are adding to the analysis, X2i is a control variable in Equation
9.2, the “’s are coefficients that determine how well Zi and X2i explain X1i , and ‹i is the
error term. (Recall that “ is the Greek letter gamma and ‹ is the Greek letter nu.) We call
Z our instrumental variable; this variable is the star of this chapter, hands down. It is the
In Levitt’s police and crime example, the police officers per capita variable is the endoge-
nous variable (X1 in our notation) and the firefighters variable is the instrumental variable
(Z in our notation). The instrumental variable is the variable that causes the endogenous
variable to change for reasons unrelated to the error time. In other words, in Levitt’s model,
Z (firefighters) explains X1i (police per capita) but is not correlated with the error term in
First, we estimate “ˆ values based on Equation 9.3 in order to generate fitted values of X,
c
•2014 Oxford University Press 431
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
Notice that X̂1i is a function only of Z, X2 , and the “s. That fact has important impli-
cations for what we are trying to do. The error term when X1i is the dependent variable is
‹i ; it is almost certainly correlated with the error term in the Yi equation (which is ‘). That
is, drug use and criminal history likely affect both the number of police (X1 ) and crime (Y ).
This means the actual value of X1 is correlated with ‘; the fitted value X̂1i , on the other
hand, is only a function of Z, X2 , and the “s. So even though police forces in reality may
be ebbing and flowing as related to drug use and other factors in the error term of Equation
9.2, the fitted value X̂1i will not. Our X̂1i will ebb and flow only with changes in Z and X2 ,
which means our fitted value of X has been purged of the association between X and ‘.
All control variables from the second stage model must be included in the first stage. We
want our instrument to explain variation in X1 over and above any variation that can be
In the second stage, we estimate our outcome equation, but (key point here) we use X̂1i –
the fitted value of X1i – rather than the actual value of X1i . In other words, instead of using
X1i , which we suspect to be endogenous (correlated with ‘), we use the measure of X̂1i which
has been purged of X1i ’s association with error. Specifically, the second stage of the 2SLS
model is
c
•2014 Oxford University Press 432
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
The little hat on X̂1i is a big deal. Once we appreciate why we’re using it and how to generate
it, then 2SLS becomes easy. We are now estimating how much the exogenous variation in
X1i affects Y . Notice also that there is no Z in Equation 9.4. By the logic of 2SLS, Z only
Control variables play an important role, just as in OLS. If there is some factor that affects
Y and is correlated with Z, we need to include it in the second stage regression. Otherwise,
the instrument will possibly soak up some of the effect of this omitted factor rather than
merely exogenous variation in X1 . For example, suppose that cities in the South started
facing more forest fires and hence hired more firefighters. In that case, Levitt’s firefighter
instrument for police officers will also contain variation due to region. If we do not control
for region in the second stage regression, then it possible that some of the region effect will
Actual estimation using 2SLS is a bit more involved than simply running OLS with X̂1
because the standard errors need to be adjusted to account for the fact that X̂1i is itself an
c
•2014 Oxford University Press 433
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
The success of 2SLS hinges on the instrument. Good instrument satisfy two conditions.
These conditions are conceptually simple, but in practice they are hard to satisfy.
The first condition is that an instrument must actually explain the endogenous variable
of interest. That is, our endogenous variable, X1 , must vary in relation to our instrument, Z.
This is the inclusion condition: a condition that Z needs to exert a meaningful effect in the
first stage equation that explains X1i . In Levitt’s police example, police forces must actually
rise and fall as firefighter numbers change. It is a plausible claim, but is not guaranteed. We
can easily check this condition for any potential instrument, Z, by estimating the first stage
model of the form of Equation 9.3. If the coefficient on Z is statistically significant, we have
satisfied this condition. For reasons we explain later in Sections 9.4 and 9.5, the more Zi
The second condition is that an instrument must be uncorrelated with the error term in the
main equation, Equation 9.2. This condition is the exclusion condition because it implies
we can exclude the instrument from the second stage equation because the instrument exerts
we are saying that it reflects no part of the error term in the main equation and hence can
be excluded from it. Recall the kinds of things in the error term in a crime model: drug use,
gang warfare, demographic changes, and so forth. Levitt’s use of firefighters as an instrument
was based on an argument that the number of firefighters in a city was uncorrelated with
c
•2014 Oxford University Press 434
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
Unfortunately, there is no direct test whether Z is uncorrelated with ‘. The whole point
of the error term is that it covers unmeasured factors. We simply cannot directly observe
A natural instinct is to try to test the exclusion condition by including Z directly in the
second stage, but this method won’t work. If Z is a good instrument it will explain X1i ,
which in turn will affect Y . We will observe some effect of Z on Y , which will be the effect
of Z on X1i , which in turn can have an effect on Y . Instead the discussion of the exclusion
condition will need to be primarily conceptual rather than statistical. We will need to justify
why Z does not affect Y directly without statistical analysis. Yes, that’s a bummer and,
frankly, a pretty weird position to be in for a statistical analyst. Life is like that sometimes.2
Finding an instrument that satisfies the exclusion condition is really hard with observational
data. Economists Josh Angrist and Alan Krueger provided a famous example in a 1991
study of the effect of education on wages. Because the personal traits that lead a person to
2A test called the Hausman test (or the Durbin-Wu-Hausman test) is sometimes referred to as a test of endogeneity.
We should be careful to recognize that this is not a test of the exclusion restriction. Instead, the Hausman test tests
whether X is endogenous. It is not a test of whether Z is exogenous. Hausman derived the test by noting that if Z is
exogenous and X is endogenous, then OLS and 2SLS should produce very different —ˆ estimates. If Z is exogenous and
X is exogenous, then OLS and 2SLS should produce similar —ˆ estimates. The test involves assessing how different
the —ˆ estimates are from OLS and 2SLS. Crucially, we need to assume Z is exogenous for this test. That’s the claim
we usually want to test, so this test is often of limited value.
c
•2014 Oxford University Press 435
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
get more education (smarts, diligence, family wealth) are often the traits that lead someone
to financial success, education is very likely to be endogenous when explaining wages. They
therefore sought an instrument for education, a variable that would explain years of schooling
but not have anything to do with wages. They identified a very clever possibility: quarter
of birth.
While this idea seems crazy at first, it actually makes sense. Quarter of birth satisfies
the inclusion condition because how much schooling a person gets depends, in part, on what
month a person is born in. Birth month matters because of laws that say that young people
have to stay in school until they are 16. For a school district that starts kids in school based
on their age on September 1, kids born in July would be in 11th grade when they turn 16
while kids born in October (who started a year later) would only be in 10th grade when they
turn 16. Hence kids born in July can’t legally drop out until they are in the 11th grade,
while kids born in October can drop out in the 10th grade. The effect is not huge, but with a
lot of data (and Angrist and Krueger had a lot of data), this effect is statistically significant.
Quarter of birth also seems to satisfy the exclusion condition because birth month seems
unrelated to unmeasured factors that affect salary, such as smarts, diligence, and family
wealth. (Astrologers disagree, by the way.) However, Bound, Jaeger, and Baker (1995)
showed that quarter of birth has been associated with school attendance rates, behavioral
sclerosis, region, and income. (Wealthy families, for example, have fewer babies in the winter
c
•2014 Oxford University Press 436
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
birth doesn’t satisfy the exclusion condition, it’s fair to say a lot of less clever instruments
may be in trouble as well. Hence, we should be duly cautious when using instruments,
being sure to implement the diagnostics discussed below and being sure to test theories with
Remember This
Two-stage least squares uses exogenous variation in X to estimate the effect of X on
Y.
1. In the first stage, the endogenous independent variable is the dependent variable
and the instrument, Z, is an independent variable:
X1i = “0 + “1 Zi + “2 X2i + ‹i
2. In the second stage, X̂1i (the fitted values from the first stage) is an independent
variable:
Yi = —0 + —1 X̂1i + —2 X2i + ‘i
c
•2014 Oxford University Press 437
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
Discussion Questions
1. Some people believe cell phones and related technology like Twitter have
increased social unrest by making it easier to organize protest or vio-
lence. Pierskalla and Hollenbach (2013) tested this view using African
data. In its most basic form, the model was
V iolencei = —0 + —1 Cell phone coverage i + ‘i
where violence i is data on organized violence in city i and Cell phone
coverage i measures availability of cell phone coverage in city i.
a) Explain why endogeneity may be a concern.
b) Consider a measure of regulatory quality as an instrument for cell
phone coverage. This variable is proposed based on a separate study
of telecommunications policy in African countries that found that
regulatory quality increased cell phone availability. Explain how to
test whether this variable satisfies the inclusion condition.
c) Does the regulatory quality variable satisfy the exclusion condition?
Can we test whether this condition holds?
2. Do political protests affect election results? Consider the follow-
ing model, which is a simplified version of the analysis presented in
Madestam, Shoag, Veuger, and Yanagizawa-Drott (2013):
Republican vote i = —0 + —1 Tea Party protest turnout i + ‘i
where Republican votes i is the vote for the Republican candidate for
Congress in district i in 2010 and Tea Party protest turnout i measures
the number of people who showed up at Tea Party protests in district i
on April 15, 2009, a day when protests were planned across the United
States.
a) Explain why endogeneity may be a concern.
b) Consider local rainfall on April 15, 2009 as an instrument for Tea
Party protest turnout. Explain how to test whether the rain variable
satisfies the inclusion condition.
c) Does the local rainfall variable satisfy the exclusion condition? Can
we test whether this condition holds?
c
•2014 Oxford University Press 438
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
c
•2014 Oxford University Press 439
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
A naive analyst using observational data might not think so, however. Suppose we analyze
where Death equals 1 if the baby passed away and N ICU equals 1if the delivery occurred
It is highly likely that the coefficient in this case would be positive. It is beyond doubt
that the hardest births go to the NICU, meaning the key independent variable (NICU) will
be correlated with factors associated with a higher risk of death. In other words, we are quite
certain endogeneity would bias the coefficient upward. We could, of course, add co-variates
that indicate risk factors in the pregnancy. Doing so would reduce the endogeneity by taking
factors correlated with NICU out of the error term and putting them in the equation. We
would, nonetheless, still worry that cases that are harder than usual in reality, but perhaps
in ways that are difficult to measure, would still be more likely to end up in NICUs, meaning
Perhaps experiments could be helpful. They are, after all, designed to ensure exogeneity.
They are also completely out of bounds in this context. It is shocking to even consider ran-
domly assigning mothers to NICU and non-NICU facilities. It won’t and shouldn’t happen.
c
•2014 Oxford University Press 440
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
So are we done? Do we have to accept multivariate OLS as the best we can do? Not quite.
Instrumental variables, and 2SLS in particular, give us hope for producing more accurate
estimates. What we need is something that explains exogenous variation in use of NICU.
That is, can we identify some variable that explains usage of NICUs that is not correlated
Lorch, Baiocchi, Ahlberg, and Small (2012) identified a good prospect: distance to a
NICU. Specifically, they created a dummy variable that we’ll call Near NICU which equals
one for mothers for whom there was less than 10 minutes difference in travel time to a NICU
compared to another delivery hospital. The idea is that mothers who lived closer to a NICU
hospital would be more likely to deliver at the hospital that had the NICU. At the same
time, distance to a NICU should not directly affect birth outcomes; it should affect birth
Does this variable satisfy the conditions necessary for an instrument? The first condition is
that the instrumental variable explains the endogenous variable which in this case is whether
the mother delivered at a NICU. Table 9.2 shows the results from a multivariate analysis in
which the dependent variable was a dummy variable indicating delivery at a NICU and the
main independent variable was the variable indicating that the mother lived near a NICU.
Clearly, mothers who live close to a NICU are more likely to deliver at a hospital with
a NICU. The estimated coefficient on Near NICU is highly statistically significant with a t
statistic over 178. Distance does a very good job explaining NICU usage. The table shows
c
•2014 Oxford University Press 441
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
coefficients for two other variables as well (the actual analysis has 60 control variables).
Gestational age indicates how long the baby had been gestating as of the time of delivery.
Zip code poverty indicates the percent of people in a zip code living below the poverty line.
Both of these control variables are significant, with babies that are gestationally older less
likely to be delivered in NICU hospitals and women from high poverty zip codes more likely
The second condition is that the instrumental variable is not correlated with the error
term in the second stage. This is the exclusion condition that holds that we can justifiably
exclude the instrument from the second stage. Certainly it seems highly unlikely that the
mere fact of living near a NICU would help a baby other than by the mother going to a
c
•2014 Oxford University Press 442
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
NICU. It is, however, possible that living near a NICU could be correlated with a risk factor.
What if NICU’s tended to be in large urban hospitals in poor areas? In that case, living
near a NICU could be correlated with poverty, which in turn could be something that is a
pregnancy risk factor. Hence it is crucial in this analysis that poverty is a control variable
in both the first and second stages. In the first stage, controlling for poverty allows us to
identify how much more mothers are likely to go to a NICU while taking neighborhood
poverty into account. In the second stage, controlling for poverty allows us to control for
the effect of this variable so as not to conflate it with the effect of actually going to a NICU
hospital.
Table 9.3 presents results for assessing the effect of giving birth in a NICU hospital. The
first column shows results from a bivariate model predicting whether the baby passes away
as a function of whether the delivery was in a NICU hospital. The coefficient is positive
and highly significant, meaning babies delivered in NICU hospitals are more likely to die.
For the reasons discussed earlier, we would never believe this conclusion due to obvious
endogeneity, but it provides a useful baseline to appreciate the pitfalls of failing to account
for endogeneity.
The second column shows that adding covariates changes the results considerably because
the effect of giving birth in a NICU is now associated with lower chance of death. The
effect is statistically significant with a t statistic of 6.72. The table reports results for two
covariates, gestational age and zip code poverty. The highly statistically significant coefficient
c
•2014 Oxford University Press 443
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
on gestational age indicates that babies that have been gestating longer are less likely to die.
The effect of zip code poverty is marginally statistically significant. The full analysis included
We’re still worried that the multivariate OLS result could be biased upward (meaning less
negative than it should be) if unmeasured pregnancy risk factors sent women to the NICU
hospitals and, in turn, raised the chances the babies would die. The results in the 2SLS
address this concern by focusing on the exogenous change in utilization of NICU hospitals
associated with living near them. The coefficient on living near a NICU continues to be
negative and at -0.0058 is almost 50 percent larger in magnitude than the multivariate OLS
results (in this case, almost 50 percent more negative). This is the coefficient on the fitted
c
•2014 Oxford University Press 444
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
value of NICU utilization that is generated using the coefficients estimated in Table 9.2.
The estimated coefficient on NICU utilization is statistically significant, but with a smaller t
statistic than multivariate OLS, consistent with the fact that 2SLS results are typically less
Sometimes we have multiple potential instrumental variables that we think predict X but
not Y . In this section we explain how to handle multiple instruments and the additional
When we have multiple instruments, we proceed more or less as we have been doing but
simply include all instruments in the first stage. So if we had three instruments (Z1 , Z2 , and
If these are all valid instruments, we have multiple sources of exogeneity that could improve
When we have multiple instruments the best way to assess whether the instruments
adequately predict the endogenous variable is to use an F test for the null hypothesis that
c
•2014 Oxford University Press 445
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
the coefficients on all instruments in the first stage are equal to zero. For our example, the
In this case rejecting the null would lead us to accept that at least one of the instruments
helps explain X1i . We discuss a rule of thumb for this test on page 451.
Overidentification tests
name of the test comes from the fact that we say that an instrumental variable model is
identified if we have an instrument that can explain X without directly influencing Y . When
we have more than one instrument, the equation is overidentified; that sounds a bit ominous,
like something will explode.3 It is actually a good thing. Having multiple instruments means
we can do some additional analysis that will shed light on the performance of the instruments.
The references in the Further Reading and appendix point to a number of formal tests
regarding multiple instruments. They can get a bit involved, but the core intuition is rather
simple. If each instrument is valid (meaning each satisfies the two conditions for instruments),
then using each one of them alone should produce an unbiased estimate of —1 . Therefore,
as an overidentification test, we can simply estimate the 2SLS model with each individual
instrument alone. The coefficient estimates should look pretty much the same given that
each instrument alone under these circumstances produces an unbiased estimator. Hence,
3 Everyone out now! The model is going to blow any minute ... it’s way overidentified!
c
•2014 Oxford University Press 446
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
if each of these models produces coefficients that are similar, we can feel pretty confident
that each is a decent instrument (or that they all are equally bad, which is the skunk at the
If the instruments produce vastly different —ˆ1 coefficient estimates, then we have to re-
think our instruments. This can happen if one of the instruments violates the exclusion
condition. The catch is that we don’t know which instrument is the bad one. Suppose —ˆ1
An overidentification test is like having two clocks. If the clocks show different times, we
know at least one, and possibly two, are wrong. If both clocks show the same time, we know
Overidentification tests are relatively uncommon, not because they aren’t useful, but
because it’s hard to find one good instrument, let alone two or more.
Remember This
An instrumental variable is overidentified when there are multiple instruments for a
single endogenous variable.
1. To estimate a 2SLS model with multiple valid instruments, simply include all of
them in the first stage.
2. To use overidentification tests to assess instruments, run 2SLS models separately
with each instrumental variable. If the second stage coefficients on the endogenous
variable in question are similar across models, this result is evidence that all the
instruments are valid.
c
•2014 Oxford University Press 447
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
2SLS estimates are fragile. In this section, we show how they can go bad if Z is correlated
As discussed earlier, observational data seldom provide instruments for which we can be sure
that the correlation of Z and ‘ is literally zero. Sometimes we have potential instruments
that we believe correlate with ‘ just a little bit or, at least, a lot less than X correlates with
bit of correlation between Z and ‘ does not necessarily render 2SLS useless. To see why, let’s
consider the simple case where there is one independent variable and one instrument. We
examine the probability limit of —ˆ1 because the properties of probability limits are easier to
work with than expectations in this context.4 For reference, we first note that the probability
‡‘
plim —ˆ1OLS = —1 + corr(X, ‘) (9.7)
‡X
where plim refers to the probability limit and corr indicates the correlation of the two
c
•2014 Oxford University Press 448
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
That’s a good thing! If corr(X, ‘) is non-zero, the OLS of —ˆ1 will converge to something
other than —1 as the sample size gets very large. That’s not good.
If we use a quasi-instrument to estimate a 2SLS, the probability limit for the 2SLS estimate
of —ˆ1 is
corr(Z, ‘) ‡‘
plim —ˆ12SLS = —1 + (9.8)
corr(Z, X1 ) ‡X1
If corr(Z, ‘) is zero, then the probability limit of —ˆ12SLS is —1 .5 Another good thing!
Otherwise the 2SLS estimate of —ˆ1 will converge to something other than —1 as the sample
Equation 9.8 has two very different implications. On the one hand, the equation can be
grounds for optimism about 2SLS. If we compare the probability limits from the OLS and
2SLS models we see that if there is only a small correlation between Z and ‘ and there is
a high correlation between Z and X, then 2SLS will perform better than OLS when the
correlation of Z and ‘ is large. This can happen when an instrument does a great job
predicting X, but has a wee bit of correlation with the error in the main equation. In other
words, quasi-instruments may help us get estimates that are closer to the true value.
On the other hand, the correlation of the Z and X1 in the denominator of Equation 9.8
implies that when the instrument does a poor job of explaining X1 , even a small amount of
correlation between Z and ‘ can get magnified by virtue of being divided by a very small
number. In the education and wages example, the month of birth explained so little of the
5 The form of this equation is from Wooldridge (2009) based on Bound, Jaeger and Bound (1995).
c
•2014 Oxford University Press 449
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
variation in education that the danger was that even a dash of correlation between month
The possibility that our instrument may have some correlation with ‘ means that we have to
be on guard against problems associated with weak instruments when using 2SLS. A weak
instrument is an instrument that adds little explanatory power to the first stage regression.
Equation 9.8 showed that when we have a weak instrument, a small amount of correlation of
the instrument and error term can lead to 2SLS to produce —ˆ1 estimates that diverge from
Weak instruments create additional problems, as well. Technically, 2SLS produces con-
sistent, but biased estimates of —ˆ1 . This means that even though the 2SLS estimate is
converging toward the true value —1 as the sample gets large, for any given sample, the
expected value of the estimate will not be —1 . In particular, the expected value of —ˆ1 from
2SLS will be skewed toward the —ˆ1 from OLS. In large samples, this is not a big problem, but
in small samples this may be more troubling. In short, it means that 2SLS has a tendency
to look more like OLS than we would like in small samples. This problem worsens as the fit
We might be tempted therefore to try to pump up the fit of our first stage model by
including additional instruments. Unfortunately, it’s not that simple. The bias of 2SLS
c
•2014 Oxford University Press 450
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
associated with small samples also worsens as the number of instruments increases, creating
a trade-off between the number of instruments and the explanatory power of the instruments
in the first stage. Each additional instrument brings at least a bit more explanatory power,
but will also bring with it a bit more small sample bias. The details are rather involved; see
X1 in the first stage regression. When we use multivariate regression, we’ll want to know
how much more Z explains X1 than the other variables in the model. We’ll look for large t
statistics for the Z variable in the first stage. The typical rule of thumb is that the t statistic
should be greater than 3. When we have multiple instruments a rule of thumb is that the F
statistic should be at least 10 for the test of the null hypothesis that the coefficients on all
instruments are all zero in the first stage regression. This rule of thumb is not a statistical
test, but rather provides a guideline for what to aim for when we say that the first stage
6 The rule of thumb is from Staiger and Stock (1997). We can, of course, run an F test even when we have only a
single instrument. A cool curiosity is that the F statistic in this case will be the square of the t statistic. This means
Ô
that when we have only a single instrument, we can simply look for a t statistic that is bigger than 10, which we
approximate (roughly!) by saying the t statistic should be bigger than 3. The appendix provides more information
on the F distribution on page 783.
c
•2014 Oxford University Press 451
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
Remember This
1. A quasi-instrument is an instrument that is correlated with the error term in the
main equation. If the correlation of the quasi-instrument (Z) and the error term
(‘) is small relative to the correlation of the quasi-instrument and the endogenous
variable (X), the 2SLS estimate based on Z will converge to something closer to
the true value than the OLS estimate will as the sample size gets very large.
2. A weak instrument does a poor job explaining the endogenous variable (X). Weak
instruments magnify the problems associated with quasi-instruments and also can
cause bias in small samples.
3. All 2SLS analyses should report tests of independent explanatory power of the
instrumental variable or variables in first stage regression. A rule of thumb is
that the F statistic should be at least 10 for the hypothesis that the coefficients
on all instruments in the first stage regression equal zero.
To calculate proper standard errors for 2SLS we need to account for the fact that the fitted
X̂1 values are themselves estimates. Any statistical program worth its salt does this auto-
matically, so we typically will not have to worry about nitty-gritty of calculating precision
for 2SLS.
What we should appreciate, however, is that standard errors for 2SLS estimates differ in
interesting ways from OLS standard errors. In this section we show why they run bigger and
how this result is largely related to the fit of the first stage regression.
The variance of 2SLS estimates is similar to the variance of OLS estimates. Recall from
c
•2014 Oxford University Press 452
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
ˆ2
‡
var(—ˆj ) = (9.9)
N var(Xj )(1 ≠ Rj2 )
q
(Yi ≠Ŷi )2
where ‡
ˆ 2 is the variance of ‘ (which is estimated as ‡
ˆ2 = N ≠k
) and Rj2 is the R2 from
For a 2SLS estimate, the variance of the coefficient on the instrumented variable is
ˆ2
‡
var(—ˆ12SLS ) = (9.10)
N var(X̂1 )(1 ≠ RX̂
2
N oZ )
1
q
(Yi ≠Ŷi )2
where ‡
ˆ2 = N ≠k
using fitted values from 2SLS estimation and RX̂
2
N oZ is the R from a
2
1
regression of X̂1 on all the other independent variables (X̂1 = “0 + “2 X2 + ...) but not the
As with OLS, variance is lower when there is a good model fit (meaning a low ‡
ˆ 2 ) and a
The new points for the 2SLS variance equation relate to the fact that we use X̂1i instead
• The denominator of Equation 9.10 contains var(X̂1 ) which is the variance of the fitted
value, X̂1 (notice the hat). If the fitted values do not vary much, then var(X̂1 ) will be
relatively small. That’s a problem because we want this quantity to be big in order to
produce a small variance. In other words, we want the fitted values for our endogenous
variable to vary a lot. A poor fit in the first stage regression can lead the fitted values
to vary little; a good fit will lead the fitted variables to vary more.
c
•2014 Oxford University Press 453
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
• The RX̂
2
N oZ term in Equation 9.10 is the R from
2
1
where we use fi, the Greek letter pi, as coefficients and ÷, the Greek letter eta, to
highlight the fact that this is a new model, different from earlier models. Notice that Z
is not in this regression, meaning that the R2 from it explains the extent to which X̂1
The point here is not to learn how to calculate standard error estimates by hand. Com-
puter programs will do that perfectly well. The point is to understand the sources of variance
in 2SLS. In particular, it is useful to see that the ability of Z to add additional explanatory
power to explain X1 is important. If it does not, our —ˆ12SLS estimates will be imprecise.
As for goodness of fit, the conventional R2 for 2SLS is basically broken. It is possible for
it to be negative. If we really need a measure of goodness of fit, the square of the correlation
of the fitted values and actual values will do. However, as we discussed when we introduced
R2 on page 106, the validity of the results does not depend on the overall goodness of fit.
c
•2014 Oxford University Press 454
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
Remember This
1. Four factors influence the variance of 2SLS —ˆj estimates.
ˆ 2 and var(—ˆj2SLS ).
(a) Model fit: The better the model fits, the lower will be ‡
(b) Sample size: The more observations, the lower will be var(—ˆ2SLS ). j
(c) The overall fit of the first stage regression: The better the fit of the first stage
model, the higher will be var(X̂1 ) and the lower the var(—ˆ12SLS ) will be.
(d) The explanatory power of the instrument in explaining X.
• If Z is a weak instrument (meaning it does a poor job explaining X1 when
controlling for the other X variables), then RX̂
2
N oZ will be high because
1
models, X causes Y and Y also causes X. In this section we explain these models, why
c
•2014 Oxford University Press 455
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
variable that affects both dependent variables), and Z1 (a variable that affects only Y1 ).
variable that affects both dependent variables), and Z2 (a variable that affects only Y2 ).
• Effective government institutions may spur economic growth. At the same time, strong
• Individual views toward the Affordable Care Act (“ObamaCare”) may be influenced by
what a person thinks of President Obama. At the same time, views of President Obama
With simultaneity comes endogeneity. Let’s consider Y2i , which is an independent variable
in Equation 9.12. We know from Equation 9.13 that Y2i is a function of Y1i , which in turn
c
•2014 Oxford University Press 456
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
is a function of ‘1i . Thus Y2i must be correlated with ‘1i , which therefore means we have
Simultaneous equations are a bit mind-twisting at first. It really helps to work through
the logic for ourselves. Consider the classic market equilibrium case in which price depends
on quantity supplied and vice versa. Suppose we look only at price as a function of quantity
supplied. Because quantity supplied depends on price, such a model is really looking at price
similar to what we did for instrumental variable models. Only now we have two equations,
so we’ll do 2SLS twice. We just need to make sure that our first stage regression does not
Let’s focus on the case where we are more interested in the Y1 equation; the logic goes
endogenous variable, we’ll want to find an instrument for it with a variable that predicts Y2
but does not predict Y1 . We have such a variable, in this case. It is Z2 , which is in the Y2
c
•2014 Oxford University Press 457
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
The tricky thing is that Y2 is a function of Y1 . If we were to run a first stage model for
Y2 and to include Y1 and then put the fitted value into the equation for Y1 , then we would
have a variable that is a function of Y1 explaining Y1 . Not cool. Instead we work with a
reduced form equation for Y2 . In a reduced form equation, Y1 is only a function of the
non-endogenous variables (which are the W and Z variables, not the Y variables). For this
We use Greek letter fi (pronounced “pie”) to indicate our coefficients because they will be
different than the coefficients in Equation 9.13 because Equation 9.14 does not include Y1 .
We show in the appendix on page 802 how the reduced form relates to Equations 9.12 and
9.13.
where Ŷ2i is the fitted value from the first stage regression (Equation 9.14).
For simultaneous equation models to work, they need to be identified, which is to say that
we need to have the right number of instruments. For 2SLS with one equation, we need at
c
•2014 Oxford University Press 458
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
least one instrument that satisfies both the inclusion and exclusion conditions. When we
have two equations we need at least one instrument for each equation. That is, to estimate
both equations we need one variable that belongs in Equation 9.12 but not in Equation 9.13
(which is Z1 in our notation) and one variable that belongs in Equation 9.13 but not in
for each equation we can nonetheless plow ahead with the equation for which we do have
an instrument. So if we have only a variable that works in the second equation, but not
first equation, then we can estimate the first equation (because the instrument allows us to
estimate a fitted value for the endogenous variable in the first equation). If we have only
a variable that works in the first equation, but not second equation, then we can estimate
the second equation (because the instrument allows us to estimate a fitted value for the
In fact, we can view the police and crime example discussed in Section 9.1 as a simul-
taneous equation model with police and crime determining each other simultaneously. To
estimate the effect of police on crime, Levitt needed an instrument that predicted police but
not crime. He argued that his firefighter variable fit the bill and then used that instrument
in a first stage model predicting police forces, generating a fitted value of police that he used
in the model predicting crime. We discussed this model as a single equation, but the analysis
c
•2014 Oxford University Press 459
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
Remember This
We can use instrumental variables to estimate coefficients for the following simultane-
ous equation model:
Y1i = —0 + —1 Y2i + —2 Wi + —3 Z1i + ‘1i
Y2i = “0 + “1 Y1i + “2 Wi + “3 Z2i + ‘2i
1. To estimate the coefficients in the first equation:
• In the first stage, we estimate a model in which the endogenous variable is the
dependent variable and all W and Z variables are the independent variables.
Importantly, the other endogenous variable (Y1 ) is not included in this first
stage:
• In the second stage, we estimate a model in which the fitted values from the
first stage, Ŷ2i , is an independent variable:
c
•2014 Oxford University Press 460
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
Case Study: Support for President Bush and the Iraq War
of the president.
If causality really runs both ways and war support influences Bush support while Bush
support also influences war support, then it is hard to estimate how much these factors
affected each other. For example, when predicting Bush support in terms of war support we
Our simultaneous equation framework can help us deal with this problem. The two
We typically have one or more W variables that affect both dependent variables. In
this case, we probably want to control for political party because it is pretty reasonable to
assume that people who supported Bush’s Republican Party were more likely to support
c
•2014 Oxford University Press 461
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
both the war and President Bush. These variables are not necessary; as a practical matter
with observational data, these variables that appear in both equations are simply highly
likely to exist.
More importantly, we need to figure out the Z variables. These are crucial because they
are our instruments and we need one for each equation. That is, we need one variable that
explains war support but not Bush support and one variable that explains Bush support but
In particular, we focus on two proposed instruments. For the first equation, our instru-
ment is Bush feeling thermometer in 2000, which indicates how individuals thought of Bush
in 2000 before the war was an issue.7 Because the Iraq War was not an issue in 2000, it is
not possible for that issue to affect what people thought of Bush. But what people thought
of Bush in 2000 almost certainly predicted what they thought of Bush in 2002; after all,
it’s pretty common for people to take a liking (or disliking) to politicians and to carry that
feeling through over time. Hence, Bush support in 2000 can reasonably be included in the
An instrument for the second equation needs to explain support for the war in 2002,
but not support for Bush in 2002. One such variable is support for defense spending in
2000, before the Iraq War was an issue. Defending this instrument requires a bit of subtlety.
7 American National Election Study from the University of Michigan has panel data from both 2000 and 2002.
This means the same people were surveyed in 2000 and 2002, so we know what people thought of President Bush
before the Iraq War was an issue.
c
•2014 Oxford University Press 462
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
Surely, support for defense spending in 2000 is correlated with Bush support in 2002, because
political conservatives tended to like defense spending and George Bush, even spanning
different years. Such a correlation would violate the exclusion condition that the instrument
is not correlated with the error term. However, once we control for support for the Republican
Party in 2002 and support for President Bush in 2000, it is less clear that support for defense
spending should still directly affect Bush support in 2002. In other words, conditional on
party affiliation and earlier views of Bush, the views of defense spending may affect views of
Bush only inasmuch as they affected views of the Iraq War. As is typical with instrumental
variables for observational data, this claim is hardly beyond question. The most reasonable
defense of it as an instrument is that it is less correlated with the error term in the equation
for Iraq War support in 2002. Bailey and Wilcox (2015) provide a more detailed discussion.
We can estimate each equation of this model using 2SLS. In the first stage, we need to
show that the instruments explain the endogenous variables meaning that Bush support in
2000 explains Bush support in 2002 and that defense spending support in 2000 explains
c
•2014 Oxford University Press 463
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
support for the Iraq War in 2002, when controlling for the other variables.
Table 9.4 shows these expectations about the first stage models are borne out in the
data. In the column on the left we see that Bush support in 2000 is highly predictive of
Bush support in 2002. The t statistic is immense at 12.29, telling us there is little doubt
these variables are related. Notice what is not in this model: support for the Iraq War in
2002. That is the other dependent variable. As discussed on page 458, the other endogenous
variable is excluded from the first stage equation in simultaneous equation analysis. Notice
also what is in this first stage model: defense spending support in 2000, the instrument for
Iraq War support in 2002. This seems odd, as only a minute ago we were claiming that
defense spending support in 2000 does not directly affect Bush support in 2002. We’re still
sticking by this claim, but recall that it is a conditional claim: Conditional on support for
the Iraq War in 2002, defense spending support in 2000 does not directly affect Bush support
in 2002. In this reduced form first-stage regression, though, we do not include support for
the Iraq War in 2002. It’s not surprising that the defense spending support variable affects
Bush support in 2002 in this equation, as it is soaking up the effect of the Iraq War support
The column on the right shows that support for defense spending in 2000 predicts support
for the Iraq War in 2002 when controlling for the other variables. The t statistic is quite
healthy, at 4.44. For this first stage regression we do not include Bush support in 2002 (the
other dependent variable) but do include Bush support in 2000 (the other instrument).
c
•2014 Oxford University Press 464
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
Table 9.4: First Stage Reduced Form Regressions for Bush/Iraq War Simultaneous Equation Model
The second-stage results are the results of most interest. The left column of Table 9.5
shows that people who liked the Iraq War in 2002 were more likely to like Bush in 2002.
The key thing is that in this model we are not using actual support for the Iraq War in
2002, which would almost certainly be endogenous given that a person who supported Bush
in 2002 was highly likely to support the Iraq War. Instead, via the magic of 2SLS, the Iraq
support 2002 variable is the fitted value of the first-stage results we discussed just a minute
ago. The variation in this Iraq War support variable therefore does not come from anything
directly related to support for Bush in 2002, but instead comes from variation in what people
The column on the right shows that Bush support in 2002 also explains support for the
Iraq War in 2002. The key thing is that the Bush support 2002 variable is a fitted value
from the first stage reduced form equation. The variation in this variable therefore does not
c
•2014 Oxford University Press 465
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
come from anything directly related to support for the Iraq War in 2002, but instead comes
from variation in what people thought of Bush in 2000 before the war was an issue.
Table 9.5: Second Stage Results for Bush/Iraq War Simultaneous Equation Model
Because this approach to simultaneous equations breaks the analysis into two separate
2SLS models, we can also be picky if necessary. Suppose we believe that the exclusion
condition for the instrument for Bush support in 2002 (which is Bush support in 2000) is
pretty good, but we do not believe the exclusion restriction for the instrument for Iraq War in
2002 (which is support for defense spending in 2000). In that case, the 2SLS results from the
model in which Iraq War 2002 is the dependent variable still stand, as the key independent
variable will be a fitted value from a first stage regression that we believe is sensible. The
c
•2014 Oxford University Press 466
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
other results may be less sensible, but the weakness of one instrument does not kill both
models, only the model where the instrument is needed. In such a case, it may make sense
9.7 Conclusion
2SLS is a great tool for fighting endogeneity. It provides us a means to use exogenous changes
in an endogenous independent variable to isolate causal effects. It’s easy to implement, both
conceptually (two simple regressions) and practically (let the computer do it).
The problem is that a fully convincing 2SLS can be pretty elusive. In observational data,
because the assumption that the instrument is uncorrelated with the error term is unverifi-
able statistically and often arguable in practice. The method also often produces imprecise
estimates which means that even if we have a good instrument, it might not tell us much
about the relationship we are studying. Even imperfect instruments, though, can be useful
because they can be less prone to bias than OLS, especially if the instrument performs well
c
•2014 Oxford University Press 467
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
• Section 9.2: Explain the first stage and second stage regressions in 2SLS. What two
• Section 9.4: Explain quasi-instruments and weak instruments and their implications for
2SLS analysis. What results from the first stage must be reported and why?
• Section 9.5: Explain how the first stage results affect the precision of the second stage
results.
• Section 9.6: Explain what simultaneity is and why it causes endogeneity. Describe
how to use 2SLS to estimate simultaneous equations, noting the difference from non-
simultaneous models.
Further Reading
Murray (2006a) summarizes the instrumental variable approach and is particularly good
discussing finite sample bias and many statistical tests that are useful when diagnosing
whether instrumental variables conditions are met. Baiocchia, Cheng, and Small (2014)
One topic that has generated considerable academic interest is the possibility that the
effect of X differs within a population. In this case, 2SLS estimates the local average
treatment effect (LATE), which is the causal effect only for those people affected by the
c
•2014 Oxford University Press 468
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
instrument. This effect is considered “local” in the sense of describing the effect for the
specific class of individuals for whom the endogenous X1 variable was influenced by the
exogenous Z variable.8
In addition, scholars who study instrumental variables methods discuss the importance
of monotonicity, which is a condition that the effect of the instrument on the endogenous
variable goes in the same direction for everyone in a population. This condition rules out
the possibility that an increase in Z causes some units to increase X and other units to
decrease X. Finally, scholars also discuss the stable unit treatment value assumption
(SUTVA), a condition that the treatment doesn’t vary in unmeasured ways across individuals
and that there are no spillover effects that might occur – for example, if untreated neighbors
of someone in the treatment group get some of the benefit of treatment via their neighbor.
Imbens (2014) and Chapter 4 of Angrist and Pischke (2009) discuss these points in detail
and provide mathematical derivations. Sovey and Green (2011) discuss these and related
c
•2014 Oxford University Press 469
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
Key Terms
• 2SLS (425)
• Exclusion condition (434)
• Identified (458)
• Inclusion condition (434)
• Instrumental variable (428)
• Local average treatment effect (from Further Reading section, 468)
• Monotonicity (from Further Reading section, 469)
• Overidentification test (446)
• Probability limit (448)
• Quasi-instrument (448)
• Reduced form equation (458)
• Simultaneous equation model (455)
• Stable unit treatment value assumption (from Further Reading section, 469)
• Two-stage least squares(425)
• Weak instrument (450)
Computing Corner
Stata
1. To estimate a 2SLS model in Stata, use the ivregress 2sls command (which stands for
instrumental variable regression). It works like the reg command in Stata, but now the
endogenous variable (X1 in the example below) is indicated along with the instrument
(Z in our notation in this chapter) in parentheses. The , first subcommand tells
Stata to also display the first stage regression, something we should always do:
ivregress 2sls Y X2 X3(X1 = Z), first
c
•2014 Oxford University Press 470
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
2. It is important to assess the explanatory power of the instruments in the first stage
regression.
• The rule of thumb when there is only one instrument is that the t statistic on the
instrument in the first stage should be above 3. The higher, the better.
• When there are multiple instruments, run an F test. The rule of thumb is that the
F statistic should be larger than 10.
reg X1 Z1 Z2 X2 X3 /* Regress endogenous variable on instruments */
/* and all other independent variables */
test Z1=Z2=0 /* F test that instruments are both zero */
3. To estimate a simultaneous equation model, we simply use the ivreg command:
ivregress 2sls Y1 W1 Z1 (Y2 = Z2), first
ivregress 2sls Y2 W1 Z2 (Y1 = Z1), first
R
1. To estimate a 2SLS model in R, we can use the ivreg command from the AER package.
• See page 130 on how to install the AER package. Recall that we need to tell R to
use the package with the library command below for each R session in which we
use the package.
• Other packages provide similar commands to estimate 2SLS models; they’re gen-
erally pretty similar, especially for standard 2SLS models.
• The ivreg command operates like the lm command. We indicate the dependent
variable and the independent variables for the main equation. The new bit is that
we include a vertical line, after which we note the independent variables in the first
stage. R figures out that whatever is in the first part but not the second is an
endogenous variable. In this case, X1 is in the first part but not the second and
therefore is the endogenous variable:
library(AER)
ivreg(Y ≥ X1 + X2 + X3 | Z1 + Z2 + X2 + X3)
2. It is important to assess the explanatory power of the instruments in the first stage
regression.
• If there is only one instrument, the rule of thumb is that the t statistic on the
instrument in the first stage should be above 3. The higher, the better.
lm(X1 ≥ Z1 + X2 + X3)
• When there are multiple instruments, run an F test with an unrestricted equation
that includes that instruments and a restricted equation that does not. The rule
c
•2014 Oxford University Press 471
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
of thumb is that the F statistic should be larger than 10. See page 360 on how to
implement an F test in R.
Unrestricted = lm(X1 ≥ Z1 + Z2 + X2 + X3)
Restricted = lm(X1 ≥ X2 + X3)
3. We can also use the ivreg command to estimate a simultaneous equation model. Indi-
cate the full model and then, after the vertical line, indicate the reduced form variables
that will be included (which is all variables but the other dependent variable):
library(AER)
ivreg(Y1 ≥ Y2 + W1 + Z1 | Z1 + W1 + Z2)
ivreg(Y2 ≥ Y1 + W1 + Z2 | Z1 + W1 + Z2)
Exercises
1. Does economic growth reduce the odds of civil conflict? Miguel, Satyanath, and Ser-
genti (2004) use an instrumental variable approach to assess the relationship between
economic growth and civil war. They provide data (available in RainIV.dta) on 41
African countries from 1981 to 1999, including the variables listed in Table 9.6.
Table 9.6: Variables for Rainfall and Economic Growth Question
a. Estimate a bivariate OLS model in which the occurrence of civil conflict is the de-
pendent variable and lagged GDP growth is the independent variable. Comment on
the results.
b. Add control variables for initial GDP, democracy, mountains, and ethnic and reli-
gious fractionalization to the model in part (a). Do these results establish a causal
relationship between the economy and civil conflict?
c
•2014 Oxford University Press 472
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
c. Consider lagged rainfall growth as an instrument for lagged GDP growth. What are
the two conditions needed for a good instrument? Describe if and how we test the
two conditions. Provide appropriate statistical results.
d. Explain in your own words how instrumenting for GDP with rain could help us
identify causal effect of the economy on civil conflict.
e. Use the dependent and independent variables from part (b), but now instrument for
lagged GDP growth with lagged rainfall growth. Comment on the results.
f. Re-do the 2SLS model in part (e), but this time add country fixed effects using
dummy variables. Comment on the quality of the instrument in the first stage and
the results for the effect of lagged economic growth in the second stage.
g. (funky) Estimate the first stage from the 2SLS model in part (f) and save the residu-
als. Then estimate a regular OLS model that includes the same independent variables
from part (f) and country dummies. Use lagged GDP growth (do not use fitted val-
ues) and now include the residuals from the first stage that you just saved. Compare
the coefficient on lagged GDP growth you get here to the coefficient on that variable
in the 2SLS. Discuss how endogeneity is being handled in this specification.
2. Can television inform people about public affairs? It’s a tricky question because the
kind of nerds (like us) who watch public affairs oriented TV are already pretty well
informed to begin with. Therefore political scientists Alberston and Lawrence (2009)
conducted a field experiment in which they randomly assigned people to treatment
and control conditions. Those assigned to the treatment condition were told to watch
a specific television broadcast about affirmative action and that they would be later
interviewed on it. Those in the control group were not told about the program but
were told that they would be re-interviewed later. The program they studied aired in
California prior to the vote on Proposition 209, a controversial proposition relating to
affirmative action. Their data (available in NewsStudy.dta) includes the variables listed
in Table 9.7.
a. Estimate a bivariate OLS model in which the information the respondent has about
Proposition 209 is the dependent variable and whether or not they watched the
program is the independent variable. Comment on the results, especially if and how
they might be biased.
b. Estimate the model in part (a), but now include measures of political interest, news-
paper reading and education. Are the results different? Have we defeated endogene-
ity?
c. Why might the assignment variable be a good instrument for watching the program?
What test or tests can we run?
c
•2014 Oxford University Press 473
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
d. Estimate a 2SLS model from using assignment to the treatment group as an instru-
ment for whether the respondent watched the program. Use the additional indepen-
dent variables from part b. Compare the first-stage results to results in part (c). Are
they similar? Are they identical? (Hint: Compare sample sizes.)
e. What do the 2SLS results suggest about the effect of watching the program on
information levels? Compare the results to those in part (b). Have we defeated
endogeneity?
3. Suppose we want to understand the demand curve for a particular commodity. We’ll
use the following demand curve equation:
QuantitytD = —0 + —1 P ricet + ‘D
t
c
•2014 Oxford University Press 474
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
a. To see that prices and quantities are endogenous, draw supply and demand curves
and discuss what happens when the demand curve shifts out (which corresponds to
some change in the error term of demand function). Note also what happens to price
in equilibrium and discuss how this creates endogeneity.
b. The data set fishdata.dta (from Angrist, Graddy, and Imbens (2000)) provides data
on prices and quantities of a certain kind of fish (called Whiting)9 at the Fulton Fish
Market in New York over 111 days. The variables are indicated in Table 9.8. The
price and quantity variables are logged. Estimate a naive OLS model of demand
in which quantity is the dependent variable and price is the independent variable.
Briefly interpret results and then discuss whether this analysis is useful.
c. Angrist, Graddy, and Imbens suggest that a dummy variable indicating a storm at
sea is a good instrumental variable that should affect the supply equation but not
the demand equation. Stormy is a dummy variable that indicates wave height was
greater than 4.5 feet and wind speed was greater than 18 knots. Use 2SLS to estimate
a demand function in which stormy is an instrument for price. Discuss first stage
and second stage results, interpreting the most relevant portions.
d. Re-estimate the demand equation but with additional controls. Continue to use
stormy as an instrument for price, but now also include covariates that account for
the days of the week and weather on shore. Discuss first stage and second stage
results, interpreting the most relevant portions.
4. Does education reduce crime? If so, spending more on education could be a long-term
tool in the fight against crime. The file inmates.dta contains data used by Lochner
and Moretti in their 2004 article in The American Economic Review on the effects of
education on crime. Table 9.9 describes the variables.
Table 9.9: Variables for Education and Crime Questions
c
•2014 Oxford University Press 475
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
a. Run a linear probability model with prison as the dependent variable and education,
age, and African-American as independent variables. Make this a fixed effects model
by including dummies for state of residence (state) and year of census data (year).
Report and briefly describe the results.
b. Based on the OLS results, can we causally conclude that increasing education will
reduce crime? Why is it difficult to estimate the effect of education on criminal
activity?
c. Lochner and Moretti use 2SLS to improve upon their OLS estimates. They use
changes in compulsory attendance laws (set by each state) as an instrument. The
variable ca9 indicates that compulsory schooling is equal to 9 years, ca10 indicates
that compulsory schooling is equal to 10 years, and ca11 is 11 or more years. The con-
trol group is 8 or fewer years. Does this set of instruments satisfy the two conditions
for good instruments?
d. Estimate a 2SLS model using the instruments described above and the control vari-
ables from the OLS model above (including state and year dummy variables). Briefly
explain the results.
e. 2SLS is known for being less precise than OLS. Is that true here? Is this a problem
for the analysis in this case? Why or why not?
5. Does economic growth lead to democracy? This question is at the heart of our un-
derstanding of how politics and the economy interact. The answer also exerts huge
influence on policy: If we believe economic growth leads to democracy, then we may be
more willing to pursue economic growth first and let democracy come later. If economic
growth does not lead to democracy, then perhaps economic sanctions or other tools may
make sense if we wish to promote democracy. Acemoglu, Johnson, Robinson, and Yared
(2008) analyzed this question using data on democracy and GDP growth from 1960 to
2000. The data is in the form of five year panels, meaning there is one observation for
each country every five years. Table 9.10 describes the variables.
Table 9.10: Variables for Income and Democracy Questions
c
•2014 Oxford University Press 476
Chapter 9. Instrumental Variables: Using Exogenous Variation to Fight Endogeneity
a. Are countries with higher income per capita more democratic? Run a pooled regres-
sion model with democracy (democracy f h) as the dependent variable and logged
GDP per capita (log gdp) as the independent variable. Lag log gdp so the model
reflects that income at time t ≠ 1 predicts democracy at time t. Describe the results.
What are the concerns with this model?
b. Re-run the model from part (a), but now include fixed effects for year and country.
Describe the model. How does including these fixed effects change the results?
c. To better establish causality, the authors use two-stage least squares. One of the in-
struments that they use is changes in the income of trading partners (worldincome).
They theorize that the income of the countries that a country trades with should
predict its own GDP but not directly affect the level of democracy in the country.
Discuss the viability of this instrument with specific reference to the conditions that
instruments need to satisfy. Provide evidence as appropriate.
d. Run a 2SLS model that uses worldincome as an instrument for logged GDP. Re-
member to lag both. Compare the coefficient and standard error to the OLS and
panel data results.
c
•2014 Oxford University Press 477
CHAPTER 10
effectively, and perhaps most importantly of all, how to design randomized experiments to
One thing they did was work their email list almost to exhaustion with a slew of fundrais-
ing pitches over the course of the campaign. These pitches were not random – or, wait,
478
Chapter 10. Experiments: Dealing with Real-World Challenges
actually they were random in the sense that the campaign tested them ruthlessly using
experimental methods. On June 26, 2012, for example, they sent email messages with ran-
domly selected subject lines, ranging from the minimalist “Change” to the sincere “Thankful
every day” to the politically scary “I will be outspent.” The campaign then tracked which
subject lines generated the most donations. On that day the “I will be outspent” message
kicked butt, producing almost five times the donations the “Thankful every day” subject line
did. AS a result, the campaign sent millions of people emails with the “I will be outspent”
subject line and, according to the campaign, raised millions more than they would have if
Of course, campaigns are not the only organizations that use randomized experiments.
Governments and researchers interested in health care, economic development, and many
other public policy issues use them all the time. And experiments are important in the
private sector as well. Capital One, one of the largest credit card companies in the United
States, grew from virtually nothing largely on the strength of a commitment to experiment-
driven decision-making. Google, Amazon, Facebook, and eBay also experiment relentlessly.
Randomized experiments pose an alluring solution to our quest for exogeneity. Let’s create
it! That is, exogeneity requires that our independent variable of interest be uncorrelated with
the error term. As we discussed in Section 1.3, if our independent variable is uncorrelated
with everything, it is uncorrelated with the error term. Hence if the independent variable is
c
•2014 Oxford University Press 479
Chapter 10. Experiments: Dealing with Real-World Challenges
inference.
of subjects to be the treatment group, treat them, and then look for differences compared
to an untreated control group.1 As discussed in Section 6.1, we can use OLS to estimate a
Yi = —0 + —1 T reatmenti + ‘i (10.1)
where Yi is the outcome we care about and T reatmenti equals one for subjects in the
treatment group.
In reality, randomized experiments face a host of challenges. Not only are they costly,
potentially infeasible, and sometimes unethical as discussed in Section 1.3, they run into
several challenges that can undo the desired exogeneity of randomized experiments. This
chapter focuses on these challenges. Section 10.1 discusses the challenges raised by possible
dissimilarity of the treatment and control groups. If the treatment group differs from the
control group in ways other than the treatment, then we can’t be sure if the treatment or
other differences explain differences across these groups. Section 10.2 discusses the challenges
raised by non-compliance with assignment to experimental groups. Section 10.3 shows how
1 Often the control group is given a placebo treatment of some sort. In medicine, this is the well-known sugar
pill instead of medicine. In social science, a placebo treatment may be experience that shares the form, but not the
content of the treatment. For example, in a study of advertising efficacy, a placebo group might be shown a public
service ad. The idea is that the mere act of viewing an ad, any ad, could affect respondents and that ad designers
want their ad to cause changes over and above that.
c
•2014 Oxford University Press 480
Chapter 10. Experiments: Dealing with Real-World Challenges
to use the 2SLS tools from Chapter 9 to deal with non-compliance. Section 10.4 discusses
the challenge posed to experiments by attrition, a common problem that arises when people
leave the experiment. This chapter concludes in Section 10.5 by changing gears to discuss
We refer to the attrition, balance, and compliance challenges facing experiments as ABC
issues.2 Every analysis of experiments should discuss these ABC issues explicitly.
When we run experiments we worry that randomization may fail to produce comparable
treatment and control groups, in which case the treatment and control groups might differ
in more ways than just the experimental treatment. If the treatment group is older, for
example, then we worry that the differences between the treatment and control groups could
In this section we discuss how to try to ensure that treatment and control groups are
equivalent, explain how treatment and control groups can differ, show how to detect such
c
•2014 Oxford University Press 481
Chapter 10. Experiments: Dealing with Real-World Challenges
Ideally, researchers will be able to ensure that their treatment and control groups are similar.
They do so by blocking which involves picking treatment and control groups in way that
ensures they will be the same for selected covariates. A simple form of blocking is to separate
the sample into men and women and then randomly pick treatment and control subjects
within those blocks. Doing so ensures that the treatment and control groups will not differ
by sex. Unfortunately, there are limits to blocking. Sometimes it just won’t work in the
context of an experiment being carried out in the real world. Or, more pervasively, there
are practical concerns because it gets harder and harder to make blocking work the more
variables we wish to block for. For example, if we want to ensure treatment and control
groups are the same in each age and sex, we would have to pick subsets of women in each
age group and men in age group. If we add race to our wish list, then we’ll have even smaller
individuals in targeted blocks to randomize within. Eventually, things get very complicated
and our sample size can’t provide people in every block. The Further Reading section at the
In situations where no blocking is possible or blocking is not able to account for all variables,
differences in treatment and control groups can arise in two ways. First, the randomization
procedures may have failed. Some experimental treatments are quite valuable, such as free
c
•2014 Oxford University Press 482
Chapter 10. Experiments: Dealing with Real-World Challenges
health care, access to a new cancer drug, or admission to a good school. A researcher may
desire this treatment to be randomly allocated, but the family of a sick person or ambitious
school child may be able to get that person into the treatment group. Or perhaps the
people implementing the program aren’t quite on board with randomization and put some
people in or out of the treatment group for their own reasons. Or maybe the folks doing the
Second, even if there is no explicit violation of randomization, the treatment and control
groups may differ substantially simply by chance. Suppose we want to conduct a random
experiment on a four person family of mom, dad, big sister, and little brother. Even if we
pick the two-person treatment and control groups randomly, we’ll likely get groups that differ
in important ways. Maybe the treatment group will be dad and little brother; too many guys
there. Or maybe the treatment group will be mom and dad; too many middle-aged people
there. In these cases, any outcome differences between the treatment and control groups
would be due not only to the treatment but also possibly to the sex or age differences. Of
course, the odds that the treatment and control groups differ substantially fall rapidly as the
sample size increases (a good reason to have a big sample!). The chance that such differences
c
•2014 Oxford University Press 483
Chapter 10. Experiments: Dealing with Real-World Challenges
Therefore an important first step in analyzing an experiment is to check for balance. Balance
exists when the treatment and control groups are similar in all measurable ways. The
core diagnostic for balance involves simply comparing difference of means for all possible
independent variables between those assigned to the treatment and control groups. To do
so we use our OLS difference of means test (as discussed on page 257) to assess for each X
Xi = “0 + “1 T reatmentAssignedi + ‹i (10.2)
where T reatmentAssignedi is 1 for those assigned to the treatment group and 0 for those
assigned to the control group. We use the Greek letter “ (gamma) to indicate the coefficients
and the Greek letter ‹ (nu) to indicate the error term. We do not use — and ‘ here so as
to emphasize that the model differs from the main model (Equation 10.1 on page 480). We
estimate Equation 10.2 for each potential independent variable; each equation will produce
a different “ˆ1 estimate. A statistically significant “ˆ1 estimate indicates that the X variable
Ideally, we won’t see any statistically significant “ˆ1 estimates; this outcome would indicate
the treatment and control groups are balanced. If the “ˆ1 estimates are statistically significant
for many X variables, we do not have balance in our experimentally assigned groups, which
3More advanced balance tests also allow us to assess whether the variance of a variable is the same across treatment
and control groups. See, for example, Imai (2005).
c
•2014 Oxford University Press 484
Chapter 10. Experiments: Dealing with Real-World Challenges
We should keep statistical power in mind when evaluating balance tests. As we discussed
on page 165, statistical power relates to the probability of rejecting the null hypothesis when
we should. Power is low in small data sets, meaning that when there are few observations we
are unlikely to find statistically significant differences in treatment and control groups even
when there are differences. In contrast, power is high for large data sets, meaning we may
observe statistically significant differences even when the actual differences are substantively
small. Hence, balance tests are sensitive not only to whether there are differences across
treatment and control groups, but also to the factors that affect power. We should therefore
be cautious in believing we have achieved balance when we have small samples and we should
be sure to assess the substantive importance of any differences we see in large samples.
What if the treatment and control groups differ for only one or two variables? This
situation is not enough to indicate that randomization failed. Recall that even when there
is no difference between treatment and control groups, we will reject the null hypothesis of
no difference five percent of the time when – = 0.05. Thus if we look at at twenty variables,
for example, it would be perfectly natural for the means of the treatment and control groups
Good results on balancing tests also suggest (without proving) that balance has been
achieved even on the variables we can’t measure. Remember, the key to experiments is
that no unmeasured factor in the error term is correlated with the independent variable.
c
•2014 Oxford University Press 485
Chapter 10. Experiments: Dealing with Real-World Challenges
Given that we cannot see the darn things in the error term, it seems a bit unfair to expect
us to have any confidence about what’s going on in there. However, if balance has been
achieved for everything we can observe, we can reasonably (albeit cautiously) speculate that
the treatment and control groups are also balanced for factors we cannot observe.
If we do find imbalances, we should not ignore them. First, we should assess the magnitude
of the difference. Even if only one X variable differs across treatment and control groups,
it could be a sign of a deeper problem if the difference is huge. Second, we should control
for even smallish differences in treatment and control groups in our analysis lest we conflate
outcome differences in Y across treatment and control groups due to differences in the
treatment and due to differences in some X for which treatment and control groups differ.
In other words, when we have imbalances it is a good idea to use multivariate OLS even
though in theory we need only bivariate OLS due to the random assignment of treatment. For
example, if we find that the treatment and control groups differ in age, we should estimate
Yi = —0 + —1 T reatmenti + —2 Agei + ‘i
In adding control variables, we should be careful to control only for variables that are
measured before the treatment or that do not vary over time. If we control for a variable
measured after the treatment, it is possible that it will be affected by the treatment itself,
thereby making it hard to figure out the actual effect of treatment. For example, suppose
c
•2014 Oxford University Press 486
Chapter 10. Experiments: Dealing with Real-World Challenges
we are analyzing an experiment where job training was randomly assigned within a certain
population. In assessing whether the job training helped people get jobs, we would not want
to control for test scores measured after the treatment because the scores could have been
affected by the training. Including such a post-treatment variable will muddy the analysis
because part of the effect of treatment may be captured by this post-treatment variable.
Remember This
1. Experimental treatment and control groups are balanced if the average values
of independent variables are not substantially different for people assigned to
treatment and control groups.
2. We check for balance by conducting difference of means tests for all possible
independent variables.
3. It is a good idea to control for imbalanced variables when assessing the effect of
a treatment.
c
•2014 Oxford University Press 487
Chapter 10. Experiments: Dealing with Real-World Challenges
poverty goes away. Others are skeptical, wondering if the money spent by governmental and
Using observational studies to settle this debate is dicey. Such studies estimate something
where Healthit is the health of person i at time t, Aidit is the amount of foreign aid going to
person i’s village, and Xit represents one or more variables that affect health. The problem
is that the error may be correlated with aid. Aid may flow to places where people are truly
needy, with economic and social problems that go beyond any of the simple measures of
poverty we may have. Or resources may flow to places that are actually better off and better
In other words, aid is probably endogenous. And because we cannot know if aid is
positively or negatively correlated with the error term, we have to admit that we don’t know
c
•2014 Oxford University Press 488
Chapter 10. Experiments: Dealing with Real-World Challenges
whether the actual effects are larger or smaller than what we observe with the observational
If the government resources flowed exogenously, however, we could analyze health and
other outcomes and be much more confident that we are measuring the effect of the aid. The
the late 1990s the Mexican government wanted to run a village-based health care program,
but realized it did not have enough resources to cover all villages at once. They decided
the fairest way to pick villages was to pick them randomly and voila! an experiment was
born. They randomly selected 320 villages as treatment cases and implemented the program
there. They also monitored 185 control villages where no new program was implemented.
In the program, eligible families received a cash transfer worth about 20 to 30 percent of
household income if they participated in health screening and education activities including
Before assessing whether the treatment worked, analysts needed to assess whether ran-
domization worked. We want to know if villages were indeed selected randomly and, if so,
that they were similar with regard to factors that could influence health. Table 10.1 provides
results for balancing tests for the Progresa program. The first column has the “ˆ0 estimates
from Equation 10.2 for various X variables. These are the averages of the variable in question
for the young children in the control villages. The second column displays the “ˆ1 estimates,
which indicate how much higher or lower the average of the variable in question is for chil-
c
•2014 Oxford University Press 489
Chapter 10. Experiments: Dealing with Real-World Challenges
Table 10.1: Balancing Tests for Progresa Experiment: Differences of Means Tests Using OLS
dren in the treatment villages. For example, the first line indicates that the children in the
treatment village were 0.01 years older than the children in the control village. The t statis-
tic is very small for this coefficient and the p-value is high, indicating that this difference is
not at all statistically significant. For the second row, the male variable equals one for boys
and zero for girls. The average of this variable indicates the percent of the sample that were
boys. In the control villages, 49 percent of the children were males; 51 percent (ˆ
“0 + “ˆ1 ) of
the children in the treatment villages were male. This two percent difference is statistically
significant at the 0.10 level (given that the p-value is less than 0.10). The most statistically
significant difference we see is in mother’s years of education, for which the p-value is 0.06.
c
•2014 Oxford University Press 490
Chapter 10. Experiments: Dealing with Real-World Challenges
In addition, houses in the treatment group were less likely to have electricity, with a p-value
of 0.09.
These results were taken by the study author to indicate that balance was achieved.
We see, though, that achieving balance is an art, rather than a science because for twelve
variables, only one or perhaps two would be expected to be statistically significant at the
– = 0.10 level if there were, in fact, no differences across the groups. These imbalances
should not be forgotten; in this case, the analysts controlled for all of the listed variables
And, by the way, did the Progresa program work? In a word, yes. Using difference of
means tests, analysts found that kids in the treatment villages were sick less often, taller,
Many social science experiments also have to deal with compliance problems, which arise
when some people assigned to a treatment do not experience the treatment to which they
were assigned. A compliance problem can happen, for example, when someone is randomly
assigned to receive a phone call asking him to donate to charity. If the person does not answer
the phone, we say (perhaps a bit harshly) that he failed to comply with the experimental
treatment.
c
•2014 Oxford University Press 491
Chapter 10. Experiments: Dealing with Real-World Challenges
In this section we show how non-compliance can create endogeneity, present a schematic
for thinking about the problem, and introduce so-called intention-to-treat models as one way
Non-compliance is often non-random, opening a back door for endogeneity to weasel its way
into the experiments because the people who comply with a treatment may systematically
differ from the people who do not. This is precisely the problem we use experiments to avoid.
Educational voucher experiments illustrate how endogeneity can sneak in with non-
compliance. These experiments typically start when someone ponies up a ton of money
to send poor kids to private schools. Because there are more poor kids than money, appli-
cants are randomly chosen in a lottery to receive vouchers to attend private schools. These
are the kids in the treatment group. The kids who aren’t selected in the lottery are the
control group.4 After a year of schooling (or more), the test scores of the treatment and
control groups are compared to see if kids in private schools did better. Because being in
the voucher schools is a function of a random lottery, we can hope that the only systematic
difference between the treatment and control groups is whether they attended the private
school and that factor therefore caused any differences in outcomes we observe.
4 Researchers in this area are careful to analyze only students who actually applied for the vouchers because the
kinds of students (and parents) who apply for vouchers for private schools almost certainly differ from students (and
parents) who do not.
c
•2014 Oxford University Press 492
Chapter 10. Experiments: Dealing with Real-World Challenges
Non-compliance complicates matters. Not everyone who receives the voucher uses it to
attend private school. In a late 1990s New York City voucher experiment discussed by
Howell and Peterson (2004), for example, 74 percent of families who were offered vouchers
used them in the first year. That number fell to 62 percent and 53 percent after two and
three years of the program, respectively. There are lots of reasons kids with vouchers might
end up not attending the private school. They might find the private school unwelcoming
or too demanding. Their parents may move. Some of these causes are plausibly related
to academic performance: A child who finds private school too demanding is likely less
academically ambitious than one who does not have that reaction. In that case, the kids
who actually use vouchers to attend private school (the “compliers” in our terminology) are
not a randomly selected group, but rather are a group that could systematically differ from
kids who decline to use the vouchers. The result can be endogeneity because the variable of
interest (attending private school) could be correlated with factors in the error term (such
Figure 10.1 provides a schematic of the non-compliance problem (Imai 2005). At the top
level, a researcher randomly assigns subjects to receive the treatment or not. If a subject is
Zi = 0. Subjects assigned treatment who actually receive it are the compliers and for them
c
•2014 Oxford University Press 493
Chapter 10. Experiments: Dealing with Real-World Challenges
Ti = 1, where T indicates whether the person actually received the treatment. The people
who are assigned to treatment (Zi = 1) but who do not actually receive it (Ti = 0) are the
non-compliers.
For everyone in the control group Zi equals zero, indicating they were not assigned to
receive the treatment. We don’t observe compliance for people in the control group because
they’re not given a chance to comply. Hence the dashed lines in the figure indicate that
we can’t know who among the control group are would-be compliers and would-be non-
compliers.5
We can see the mischief caused by non-compliance when we think about how to compare
treatment and control groups in this context. We could compare the students who actually
went to the private school (Ti = 1) to those who didn’t (Ti = 0). Note, however, that the
Ti = 1 group includes only compliers – the kind of students who, when given the chance to
go to a private school, took it. These students are likely to be more academically ambitious
than the non-compliers. The Ti = 0 group includes non-compliers (for whom Zi = 1 and
Ti = 0) and those not assigned to treatment (for whom Zi = 0). This comparison likely
stacks the deck in favor of finding that the private schools improve test scores because this
c
•2014 Oxford University Press 494
Chapter 10. Experiments: Dealing with Real-World Challenges
Treatment assignment
(random)
Zi = 1 Zi = 0
Compliance Compliance
(non−random) (unobserved)
Ti = 1 Ti = 0 Ti = 0
c
•2014 Oxford University Press 495
Chapter 10. Experiments: Dealing with Real-World Challenges
whole control group (the Zi = 0 students). This method too has problems. The control
the treatment group in this approach only has compliers. Any differences found with this
comparison could either be due to the effect of the private school or due to the fact that the
complier group has no non-compliers while the control group includes both complier types
Intention-to-treat models
an ITT analysis, we compare the means of those assigned treatment (the whole Zi = 1 group
that includes those who received and did not receive the treatment) to those not assigned
treatment (the Zi = 0 group that includes would-be compliers and would-be non-compliers).
The ITT approach sidesteps non-compliance endogeneity at the cost of producing estimates
that are statistically conservative (meaning we expect the estimated coefficients to be smaller
To understand ITT, let’s start with the non-ITT model that we really care about:
Yi = —0 + —1 T reatmenti + ‘i (10.4)
For individuals who receive no treatment (T reatmenti = 0), we expect Yi to equal some
baseline value, —0 . For individuals who have received the treatment (T reatmenti = 1), then
we expect Yi to be to —0 + —1 . This simple bivariate OLS model allows us to test for the
c
•2014 Oxford University Press 496
Chapter 10. Experiments: Dealing with Real-World Challenges
treatment and the error term because the type of people who comply with the treatment
The idea behind the ITT approach is to look for differences between the whole treatment
group (whether they complied or not) and the whole control group. The model is
Yi = ”0 + ”1 Zi + ‹i (10.5)
In this model, Z is 1 for individuals assigned to the treatment group. We use ” to highlight
the fact that we are using treatment-assignment as the independent variable rather than
difference between all the people we intended to treat and all the people we did not intend
to treat.
Note that Z is uncorrelated with the error term. It reflects assignment to treatment
(rather than actual compliance with treatment) and hence none of the compliance issues are
able to sneak in correlation with anything, including the error term. Therefore the coefficient
estimate associated with the treatment assignment variable will not be clouded by other
factors that could explain both the dependent variable and compliance. For example, if
we use ITT analysis to explain the relationship between test scores and attending private
schools, we do not have to worry that our key independent variable is also capturing the
c
•2014 Oxford University Press 497
Chapter 10. Experiments: Dealing with Real-World Challenges
fact that the more academically ambitious kids may have been more likely to use the private
school vouchers. ITT avoids this problem by comparing all kids given a chance to use the
ITT is not costless, however. When there is non-compliance, ITT will underestimate
the treatment effect. This means the ITT estimate, ”ˆ1 , is a lower bound estimate of —, the
estimate of the effect of the treatment itself from Equation 10.4 on page 496. In other words,
we expect the magnitude of the ”ˆ1 parameter estimated from Equation 10.5 to be smaller
To see why, consider the two extreme possibilities for compliance: zero compliance and full
compliance. If there is zero compliance such that no one assigned treatment complied (Ti = 0
for all Zi = 1), then ”1 has to be 0 because there is no difference between the treatment and
control groups. (No one took the treatment!) At the other extreme, if everyone assigned
treatment (Zi = 1) also complied (Ti = 1), then the T reatmenti variable in Equation 10.4
will be identical to Zi (treatment assignment) in Equation 10.5. In this instance, —ˆ1 will
exogeneity of the random experiment. In this case, —ˆ1 = ”ˆ1 because the variables in the
Hence we know that the ITT estimate of ”ˆ1 is going to be somewhere between zero and
an unbiased estimator of the true treatment effect. The lower the compliance, the more the
ITT estimate will be biased toward zero. The ITT estimator is still preferable to —ˆ1 from
c
•2014 Oxford University Press 498
Chapter 10. Experiments: Dealing with Real-World Challenges
a model with treatment received when there are non-compliance problems because that —ˆ1
can be biased due to the endogeneity that enters the model when compliers differ from
non-compliers.
The ITT approach is a cop-out, but in a good way. When we use it, we’re being con-
servative in the sense that the estimate will be prone to underestimate the magnitude of
the treatment effect. If we find an effect using the ITT approach, it will not be due to
Researchers regularly estimate ITT effects. Sometimes whether or not someone complied
with a treatment is not known. For example, if the experimenter mailed advertisements to
randomly selected households it will be very hard, if not impossible, to know who actually
Or sometimes the ITT effect is the most relevant quantity of interest, such as when
we know that compliance will be spotty and we want to build non-compliance into our
Kenya that provided medical treatment for intestinal worms to children at randomly selected
schools. Some children in the treated schools were not treated because they missed school the
day the medicine was administered. An ITT analysis in this case compares kids assigned to
treatment (whether or not they were in school) to kids not assigned to treatment. Because
some kids will always miss school for a treatment like this, policymakers may care more
about the ITT estimated effect of the because ITT takes into account the treatment effect
c
•2014 Oxford University Press 499
Chapter 10. Experiments: Dealing with Real-World Challenges
Remember This
1. In an experimental context, a person assigned to receive a treatment who actually
receives the treatment is said to comply with the treatment.
2. Non-compliance creates endogeneity when compliers differ from non-compliers.
3. Intention-to-treat (ITT) analysis compares people assigned to treatment (whether
they complied or not) to people in the control group.
• ITT is not vulnerable to endogeneity caused by non-compliance.
• ITT estimates will be smaller in magnitude than the true treatment effect.
The more the non-compliance, the closer to zero will be the ITT estimates.
An even better way to deal with non-compliance is to use 2SLS to directly estimate the effect
of treatment. The key insight is that randomized treatment assignment is a great instrument.
because it is uncorrelated with everything, including the error term. Random assignment
also usually satisfies the inclusion condition because being randomly assigned to treatment
In this section we build on material from Section 9.2 to show how to use 2SLS to deal with
c
•2014 Oxford University Press 500
Chapter 10. Experiments: Dealing with Real-World Challenges
To see how to use 2SLS to analyze an experiment with non-compliance, let’s look at an
experimental study of get-out-the-vote efforts. Political consultants often joke that they
know half of what they do works, they just don’t know which half. An experiment might
We begin by laying out what an observational study of campaign effectiveness looks like.
A simple model is
where T urnouti equals 1 for people who voted and equals 0 for those who did not.6 The
What is in the error term? Certainly political interest will be, because more politically
attuned people are more likely to vote. We’ll have endogeneity if political interest (something
in the error term) is correlated with contact by a campaign (the independent variable). We
will probably have endogeneity because campaigns do not want to waste their time contacting
people who won’t vote. Hence, we’ll have endogeneity unless the campaign is incompetent
Such endogeneity could corrupt the results easily. Suppose we find a positive association
between campaign contact and turnout. We should worry that the relationship is due not
6 The dependent variable is a dichotomous variable. We discuss how to deal with such dependent variables in
Chapter 12.
c
•2014 Oxford University Press 501
Chapter 10. Experiments: Dealing with Real-World Challenges
to the campaign contact, but due to the fact that the kind of people who were contacted
were more likely to vote even before they were contacted. Such concerns make it very hard
Professors Alan Gerber and Don Green (2000, 2005) were struck by these problems with
observational studies and have almost single handedly built an empire of experimental
studies in American politics.7 As part of their signature study, they randomly assigned
citizens to receive in-person visits from a get-out-the-vote campaign. In their study all the
factors that affect turnout would be uncorrelated with assignment to receive the treatment.8
Compliance is a challenge in such studies. When campaign volunteers visited, not ev-
eryone answered the door. Some people weren’t home. Some were in the middle of dinner.
Maybe a few ran out the back door screaming when they saw the hippie volunteer ringing
their doorbell.
Non-compliance, of course, could affect the results. If the more socially outgoing types
answered the door (hence receiving the treatment) and the more reclusive types did not
answer the door (hence not receiving the treatment even though they were assigned to it),
the treatment variable as delivered would depend not only on the random assignment, but
also on how outgoing a person was. If more outgoing people are more likely to vote, then
we have endogeneity because treatment as delivered will be correlated with the sociability
7 Or should we say double handedly? Or, really, quadruple handedly?
8 The study also looked at other campaign tactics, such as phone calls and mailing postcards. They didn’t work
as well as the personal visits; for simplicity, we focus only on the in-person visits.
c
•2014 Oxford University Press 502
Chapter 10. Experiments: Dealing with Real-World Challenges
To get around this problem, Gerber and Green used treatment assignment as an instru-
ment. This variable, which we’ve been calling Zi , indicates whether a person was randomly
selected to receive a treatment. This variable is well suited to satisfy the conditions neces-
sary for a good instrument we discussed in Section 9.2. First, it should be included in the
first stage because being randomly assigned to be contacted by the campaign does indeed
increase campaign contact. Table 10.2 shows the results from the first stage of Gerber and
if the person actually talked to the volunteer canvasser. The independent variable is whether
These results suggest that 27.9 percent of those assigned to be visited were actually
visited. In other words, 27.9 percent of the treatment group complied with the treatment.
c
•2014 Oxford University Press 503
Chapter 10. Experiments: Dealing with Real-World Challenges
This estimate is hugely statistically significant, in part due to the large sample size. The
intercept is 0.0, implying no one in the non-contact assigned group was contacted by this
The treatment assignment variable Zi also is highly likely to satisfy the 2SLS exclusion
people actually getting campaign contact. Being assigned to be contacted by the campaign
in and of itself does not affect turnout. Note we are not saying that the people who actually
complied (received a campaign contact) are random, for all the reasons above related to
concerns about compliance. We are simply saying that when we put a check next to some
randomly selected names indicating they should be visited, these folks were indeed randomly
selected. That means Z is uncorrelated with ‘ and can, therefore, be excluded from the main
equation.
In the second stage regression, we use the fitted values from the first stage regression as
the independent variable. Table 10.3 shows that the effect of a personal visit is to increase
probability of turning out to vote by 8.7 percentage points. This estimate is statistically
significant as we can see from the t stat of 3.34. We could improve the precision of the
c
•2014 Oxford University Press 504
Chapter 10. Experiments: Dealing with Real-World Challenges
Understanding the way the fitted values work is useful for understanding how 2SLS works
here. Table 10.4 shows the three different ways we are measuring campaign contact for three
visit Laura and Bryce and not to visit Gio. This selection was randomly determined. In
the second column is actual contact, which is observed contact by the campaign. Laura
answered the door when the campaign volunteer knocked, but Bryce did not. (No one
went to poor Gio’s door.) The third column displays the fitted value from the first stage
equation for the treatment variable. These fitted values depend only on contact assignment.
Laura and Bryce were assigned to be called randomly (Z = 1), so both their fitted values
were X̂ = 0.0 + 0.279 ◊ 1 = 0.279 even though Laura was actually contacted and Bryce
c
•2014 Oxford University Press 505
Chapter 10. Experiments: Dealing with Real-World Challenges
wasn’t. Gio was not assigned not to be visited (Z = 0), so his fitted contact values was
Table 10.4: Various Measures of Campaign Contact in 2SLS Model for Selected Observations
2SLS uses the “Contact-fitted” (T̂ ) variable. This variable might be the weirdest thing
in the whole book9 , so it is worth taking the time to really understand it. Even though
Bryce was not contacted, his X̂i is 0.279, just the same as Laura who was successfully
visited. Clearly, this variable looks very different than actual observed campaign contact.
Yes, this is odd, but it is a feature, not a bug. The core inferential problem, as have noted, is
endogeneity in actual observed contact. Bryce might be avoiding contact because he loathes
politics. That’s why we don’t want to use observed contact as a variable – it would capture
not only the effect of contact, but also the fact that the type of people who get contact in
observational data are different. The fitted value, however, only varies according the Z –
something that is exogenous. In other words, by looking at the bump up in expected contact
associated with being in the randomly assigned contact-assigned group, we have isolated the
exogenous bump up in contact associated with the exogenous factor and can assess whether
9 Other than the ferret thing - also weird.
c
•2014 Oxford University Press 506
Chapter 10. Experiments: Dealing with Real-World Challenges
Remember This
1. 2SLS is useful to analyze experiments when there is imperfect compliance with
the experimental treatment.
2. Assignment to treatment typically satisfies the inclusion and exclusion conditions
necessary for instruments in 2SLS analysis.
c
•2014 Oxford University Press 507
Chapter 10. Experiments: Dealing with Real-World Challenges
The goal is to figure out what police should do when they come upon a domestic violence
incident. Police can either take a hard line and arrest suspects whenever possible or they
can take a conciliatory line and decline to make an arrest as long as no one is in immediate
danger. Either approach could potentially be more effective: Arresting suspects creates clear
consequences for offenders, while not arresting them may possibly defuse the situation.
So what should police do? This is a great question to answer empirically. A model based
where Arrested later is 1 if the person is arrested at some later date for domestic violence,
Arrested initially is 1 if the suspect was arrested at the time of the initial domestic violence
report, and X refers to other variables, such as whether a weapon or drugs were involved in
c
•2014 Oxford University Press 508
Chapter 10. Experiments: Dealing with Real-World Challenges
Why might there be endogeneity? (That is, why might we suspect Arrested initially
to be correlated with the error term?) Elements in the error term include person-specific
characteristics. Some people who have police called on them are indeed nasty; let’s call them
the bad eggs. Other people who have the police called on them are involved in a once-in-
a-lifetime incident; compared to the overall population of people who have police called on
them, they are the (relative) good eggs. Such personality traits are in the error term of the
We could also easily imagine that people’s good or bad eggness will affect whether they
were arrested initially. Police who arrive at the scene of a domestic violence incident involving
a bad egg will, on average, find more threat; police who arrive at the scene of an incident
involving a (relative) good egg will likely see a less threatening environment. We would
expect police to arrest the bad egg types more often and we would expect these folks to have
more problems in the future. Observational data could therefore suggest that arrests make
things worse because those arrested are more likely to be bad eggs and therefore more likely
to be re-arrested.
The problem is endogeneity. The correlation of the arrested initially variable and the
personal characteristics in the error term makes it impossible for observational data to isolate
the effect of the policy (arrest) from the fact that this policy may be differentially applied
c
•2014 Oxford University Press 509
Chapter 10. Experiments: Dealing with Real-World Challenges
people when on domestic violence calls, then our arrest variable would no longer be correlated
with the personal traits of the perpetrators. Of course, this idea is insane, right? Police can’t
randomly arrest people (can they?). Believe it or not, researchers in Minneapolis created an
experiment to do just that. We’ll simplify the experiment a bit; more details are in Angrist
(2006). The Minneapolis researchers gave police a note pad to document incidents. The
note pad had randomly colored pages; the police officer was supposed to arrest or not arrest
make such important decisions as to arrest or not based simply on the color of pages in a
notebook. Some circumstances are so dangerous that an arrest must be made, notebook
be damned. Endogeneity concerns arise because the type of people arrested under these
circumstances (the bad eggs) are different than those not arrested.
2SLS can rescue the situation. First we’ll show how randomization in experiments satisfies
the 2SLS conditions and then we’ll show how 2SLS works and compares to other approaches.
The inclusion condition is that Z explains X. In this case, the condition requires that
assignment to the arrest treatment actually predicts being arrested. Table 10.5 shows that
those assigned to be arrested were 77.3 percentage points more likely to be arrested, even
when controlling for whether a weapon or drugs were recorded as being present at the scene.
The effect is massively statistically significant with a t statistic of 17.98. The intercept was
c
•2014 Oxford University Press 510
Chapter 10. Experiments: Dealing with Real-World Challenges
not directly reported in the original paper, but from other information in the paper we can
Table 10.5: First Stage Regression in Domestic Violence Experiment: Explaining Arrests
N 314
Standard errors in parentheses
ú indicates significance at p < 0.05
Assignment to the arrest treatment is very plausibly uncorrelated with the error term.
This condition is not testable and must instead be argued based on non-statistical evidence.
Here, the argument is pretty simple: The instrument was randomly generated and therefore
Before we present the 2SLS results, let’s be clear about the variable used in the 2SLS
model as compared to the variables used in other approaches. Table 10.6 shows the three
different ways to measure arrest. The first (Z) is whether an individual was assigned to
the arrest treatment. The second (T ) is whether a person was in fact arrested. The third
c
•2014 Oxford University Press 511
Chapter 10. Experiments: Dealing with Real-World Challenges
(T̂ ) is the fitted value of arrest based on Z. We report 4 examples, assuming none of
them had weapons or drugs in their initial incident. Person 1 was assigned to be arrested
and in fact arrested. His fitted value is “ˆ0 + “ˆ1 ◊ 1 = 0.216 + 0.773 = 0.989. Person 2
was assigned to be arrested and not arrested. His fitted value is the same as person 1’s:
“ˆ0 + “ˆ1 ◊ 1 = 0.216 + 0.773 = 0.989. Person 3 was not assigned to be arrested but was in
fact arrested. He was probably pretty nasty when the police showed up. His fitted value
is “ˆ0 + “ˆ1 ◊ 0 = 0.216 + 0 = 0.216. Person 4 was not assigned to be arrested and was not
arrested. He was probably relatively calm when the police showed up. His fitted value is
“ˆ0 + “ˆ1 ◊ 0 = 0.216 + 0 = 0.216. Even though we suspect persons 3 and 4 are very different
types of people, the fitted values are the same, which is actually a good thing because factors
associated with actually being arrested (the bad egg-ness) that are correlated with the error
term in the equation predicting future arrests are purged from the T̂ variable.
Table 10.7 shows the results from three different ways to estimate a model in which
Arrested later is the dependent variable. The models also control for whether a weapon or
c
•2014 Oxford University Press 512
Chapter 10. Experiments: Dealing with Real-World Challenges
drugs were involved in the initial incident. The OLS model uses treatment delivered (T ) as
the independent variable. The ITT model uses treatment assigned (Z) as the independent
variable. The 2SLS model uses the fitted value of treatment (T̂ ) as the independent variable.
The first column shows that OLS estimates a 7 percentage point decrease in probability
of a re-arrest later. The independent variable was whether or not someone was actually
arrested. This group includes people who were randomly assigned to be arrested and people
who were in the no-arrest assigned treatment group but were arrested anyway. We worry
about bias using this variable because we suspect that the bad eggs were more likely to get
arrested.10
10 The OLS model reported here is still based on partially randomized data, because many who were arrested were
arrested due to the randomization in the police protocol. If we had purely observational data with no randomization,
the bias of OLS would be worse as only those who were bad eggs would likely be arrested.
c
•2014 Oxford University Press 513
Chapter 10. Experiments: Dealing with Real-World Challenges
The second column shows that ITT estimates that being assigned to the arrest treatment
lowers the probability of being arrested later by 10.8 percentage points. This result is
more negative than the OLS estimate and is statistically significant. The ITT model avoids
endogeneity because treatment assignment cannot be correlated with the error term. The
approach will understate the true effect when there was non-compliance either because some
people not assigned to the treatment got it or everyone who was assigned to the treatment
The third column shows the 2SLS results. In this model, the independent variable is
the fitted value of the treatment. The estimated coefficient on arrest is even more negative
than the ITT estimate, indicating that the probability of re-arrest for individuals who were
arrested is 14 percentage points lower than for individuals who were not initially arrested.
The magnitude is double the effect estimated by OLS. This result implies that the city of
points by arresting individuals on the initial call. 2SLS is the best model because it accounts
for non-compliance and provides an unbiased estimate of the effect that arresting someone
This study was quite influential and spawned other similar studies elsewhere; see Berk,
c
•2014 Oxford University Press 514
Chapter 10. Experiments: Dealing with Real-World Challenges
10.4 Attrition
Another challenge for experiments is attrition. Attrition occurs when people drop out of
an experiment altogether such that we do not observe the dependent variable for them.
Attrition can happen when experimental subjects become frustrated with the experiment
and discontinue participation or when they are too busy to respond, move away, or even
pass away. Attrition can occur for both treatment and control groups.
In this section we explain how attrition can infect even randomized experiments with
endogeneity, show how to detect problematic attrition, and describe three ways to counteract
Attrition opens a back door for endogeneity to enter our experiments when it is non-random.
Suppose we randomly give some people free donuts. If some of them eat so many donuts
that they can’t rise from the couch to answer the experimenter’s phone calls, we no longer
have data for these folks. This is a problem, because these observations would be of people
who got lots of donuts and had a pretty bad health outcome. Losing these observations will
make donuts look less bad and thereby bias our conclusions.
Attrition is real. In the New York City school choice experiment discussed earlier, re-
searchers intended to track test scores of students in the treatment and control groups over
time. For a surprising number of students, such tracking was not possible. Some moved
c
•2014 Oxford University Press 515
Chapter 10. Experiments: Dealing with Real-World Challenges
away, some were absent on test days, and some probably got lost in the computer system.
And attrition can be non-random. In the New York school choice experiment, 67 percent
of African-American students in the treatment group took the test in year two of the experi-
ment, while only 55 percent of African-American students in the control group took the test
in year two. We should wonder if these groups are comparable and should worry that any
test differentials discovered could be due to differential attrition rather than to the effects of
private schooling.
can simply look at attrition rates in treatment and control groups. Statistically, we could
where Attritioni equals 1 for observations for which we do not observe the dependent variable
and equals 0 for observations for which we observe the dependent variable. A statistically
significant ”ˆ1 would indicate differential attrition across treatment and control groups.
We can add some nuance to our evaluation of attrition by looking for differential attrition
patterns in the treatment and control groups. For example, we can investigate whether the
treatment variable interacted with one or more covariates in a model explaining attrition. For
c
•2014 Oxford University Press 516
Chapter 10. Experiments: Dealing with Real-World Challenges
example, when analyzing a randomized charter school experiment we could explore whether
high test scores in earlier years were associated with differential attrition in the treatment
group. Using the tools we discussed in Section 6.4 for interaction variables, the model would
be
where EarlyT estScoresi is the pre-experimental test score of student i. If ”3 is not equal to
zero, then the treatment would appear to have had a differential effect on kids who were good
students in the pre-experimental period. Perhaps kids with high test scores were really likely
to stick around in the treated group (which means they attended charter schools), while the
good students in the control group (who did not attend a charter school) were less likely
to stick around (perhaps moving to a different school district and thereby making their test
scores for the period after the experiment had run unavailable). In this situation, treated
and control groups would differ on the early test score measure, something that should show
There is no magic bullet to zap attrition, but there are three strategies that can prove
useful. The first is simply to control for variables associated with attrition in the final
c
•2014 Oxford University Press 517
Chapter 10. Experiments: Dealing with Real-World Challenges
analysis. Suppose we found that kids with higher pre-treatment test scores were more likely
to stay in the experiment. We would be wise to control for pre-treatment test scores with
multivariate OLS. However, this strategy cannot counter unmeasured sources of attrition
that could be correlated with treatment status and post-treatment test scores.
A second approach to dealing with attrition is to use a trimmed data set to make the
groups more plausibly comparable. A trimmed data set is one for which certain observations
are removed in order to offset potential bias due to attrition. Suppose we observe 10%
attrition in the treated group and 5% attrition in the control group. We should worry that
weak students were dropping out of the treatment group, making the comparison between
treated and untreated unfair because the treated group may have shed some of its weakest
students. A statistically conservative approach here would be to trim the control group by
removing another 5% of the weakest students before doing our analysis so that now both
groups in the data have 10% attrition rates. This practice is statistically conservative in the
sense that it makes it harder to observe a statistically significant treatment effect because
it is unlikely that literally all of those who dropped out from the treatment group were the
worst students.
A third approach to dealing with attrition is to use selection models. The most famous
selection model is called a Heckman selection model (1979). In this approach, we would
model both the process of being observed (which is a dichotomous variable equaling 1 for
those for whom we observe the dependent variable and 0 for others) and the outcome (the
c
•2014 Oxford University Press 518
Chapter 10. Experiments: Dealing with Real-World Challenges
model with the dependent variable of interest, such as test scores). These models build on
the probit model we discuss in Chapter 12. More details are in the Further Reading section
Remember This
1. Attrition occurs when individuals drop out of an experiment, causing us not to
have outcome data for them.
2. Non-random attrition can cause endogeneity even when treatment is randomly
assigned.
3. We can detect problematic attrition by looking for differences in attrition rates
across treated and control groups.
4. Attrition can be addressed by using multivariate OLS, trimmed data sets, or
selection models.
c
•2014 Oxford University Press 519
Chapter 10. Experiments: Dealing with Real-World Challenges
more efficiently is therefore one of the biggest policy questions we face. One option that
gets a lot of interest is to change the way we pay for health care. We could, for example,
make consumers pay more for medical care so that they use only what they really need.
In such an approach, health insurance would cover the really big catastrophic items (think
heart-replacement), but would cover less of the more mundane potentially avoidable items
To know if such an approach will work we need to answer two questions. First, are health
care outcomes the same or better under such a system? It’s not much of reform if it simply
saves money by making us sicker. Second, do medical expenditures go down when people
Because there are a many health insurance plans on the private market, we could imagine
using observational data to answer these questions. We could see if people on relatively
stingy health insurance plans (that pay only for very major costs) are as healthy as others
c
•2014 Oxford University Press 520
Chapter 10. Experiments: Dealing with Real-World Challenges
Such an approach really wouldn’t be very useful though. Why? You guessed it:
Insurance is endogenous; those who expect to demand more services have a clear
In other words, because sick people probably seek out better health care coverage, a non-
experimental analysis of health coverage and costs would likely show that health care costs
more for those with better coverage. That wouldn’t mean the generous coverage caused costs
Suppose we don’t have a good measure of whether someone has diabetes. We would
expect that people with diabetes seek out generous coverage because they expect to use the
doctor a lot. The result would be a correlation between the error term and type of health
plan such that we would see people in the generous health plan having lower health outcomes
(because of all those people with diabetes who signed up for the generous plan). Or maybe
insurance companies figure out a way to measure whether people have diabetes and not let
them into generous insurance plans, which would mean the people in the generous plans
would be healthier than others. Here too the diabetes in the error term would be correlated
c
•2014 Oxford University Press 521
Chapter 10. Experiments: Dealing with Real-World Challenges
Thus we have a good candidate for a randomized experiment, which is exactly what
ambitious researchers at RAND did in the 1970s. They randomly assigned people to various
health plans including a free plan that covered medical care at no cost and various cost-
sharing plans that had different levels of co-payments. With randomization, the type of
people assigned to a free plan should be expected to be the same as the type of people assigned
to a cost-sharing plan. The only expected difference between the groups is their health plans
and hence to the extent that the groups differed in utilization or health outcomes, the
The RAND researchers found that medical expenses were 45 percent higher for people in
plans with no out-of-pocket medical expenses compared to those who had stingy insurance
plans (which required people to pay 95 percent of costs, up to a $1,000 yearly maximum).
In general, health outcomes were no worse for those in the stingy plans.11 This experiment
has been incredibly influential – it is the reason we pay $10 or whatever when we check out
Attrition is a crucial issue in evaluating the RAND experiment. Not everyone stayed in
the experiment. Some people may have moved, some may have died, and others may have
been unhappy with the plan they were randomly placed in and opted out of the experiment.
The threat to the validity of this experiment is that this attrition may have been non-random.
If the type of people who stayed with one plan were systematically different than the type of
11Outcomes for people in the stingy plans were worse for some subgroups and some conditions, leading the researchers
to suggest programs targeted at specific conditions rather than providing fee-free service for all health care.
c
•2014 Oxford University Press 522
Chapter 10. Experiments: Dealing with Real-World Challenges
people who stayed with another plan, comparing health outcomes or utilization rates across
these groups may be inappropriate because the groups differ both in their health plans and
Aron-Dine, Einav, and Finkelstein (2013) reexamined the data in light of attrition and
other concerns. They show that the free plan had 1,894 people randomly assigned to it. Of
those, 114 (5 percent) were non-compliers who declined to participate. Of the remainder who
participated, 89 (5 percent) attrited by leaving the experiment. These low numbers are not
very surprising. The free plan was gold-plated, covering everything. The cost-sharing plan
requiring the highest out-of-pocket expenditures has 1,121 people assigned to it. Of these
269 (24 percent) declined to participate and another 145 (17 percent) left the experiment.
These patterns contrast markedly from the free plan non-compliance and attrition patterns.
What kind of people would we expect to leave a cost-sharing plan? Probably the kind
of people who ended up paying a lot of money under the plan. And what kind of people
would end up paying a lot of money under a cost-sharing plan? Sick people, most likely. So
that means we have reason to worry that the free plan had all kinds of people, but that the
cost-sharing plans had a sizeable hunk of sick people who pulled out. So any finding that
the cost-sharing plans yielded the same health outcomes could be due either to the plans
not having different health impacts or to the free plan being better, but having a a sicker
population.
c
•2014 Oxford University Press 523
Chapter 10. Experiments: Dealing with Real-World Challenges
data set based on techniques from Lee (2009). They dropped the highest spenders in the
free care plan until they had a data set with the same proportion of observations from those
assigned to the free plan and to the costly plan. Comparing these two groups is equivalent
to assuming that those who left the costly plan were the most expensive patients; since
this is unlikely to be completely true, the results from such a comparison are considered
a lower bound as actual differences between the groups would be larger if some of the
people who dropped out from the costly plan were not among the most expensive patients.
The results indicated that the effect of the cost-sharing plan was still negative, meaning it
lowered expenditures. However, the magnitude of the effect was substantially less than the
magnitude reported in the initial study that did little to account for differential attrition
Sometimes, however, an experiment may fall into our laps. That is, we might find that
the world has essentially already run a natural experiment that pretty much looks like
a randomized experiment, but one that we didn’t have to muck about with actually im-
c
•2014 Oxford University Press 524
Chapter 10. Experiments: Dealing with Real-World Challenges
the values of the independent variable have been determined by a random, or at least an
exogenous, process.
In this section we discuss some of the clever ways researchers have been able to use natural
us with treatment and control groups that look pretty much like they would look if we had
candidate named Alvin Greene won the South Carolina primary election to run for the
U.S. Senate. Greene had done no campaigning and was not exactly an obvious Senatorial
candidate: He had been involuntarily discharged from both the Army and the Air Force and
had been unemployed since leaving the military. Oh yes, he was also under indictment for
primary against a former state legislator. While some wondered if something nefarious was
going on, many pointed to a more mundane possibility: When voters don’t know much
about candidates, they might pick the first name they see. Greene was first on the ballot
An experimental test of this proposition would involve randomly rotating the ballot order
12 Greene went on to get only 28 percent of the vote in the general election (a dismal outcome, although if you are
a glass-half-full type you would note that 364, 598 South Carolinians voted for him). After the defeat Greene turned
his sights to the presidency, saying “I’m the next president. I’ll be 35 just before November, so I was born to be
president. I’m the man. I’m the man. I’m the man. Greene’s the man. I’m the man. I’m the greatest person ever. I
was born to be president. I’m the man, I’m the greatest individual ever” (Shiner 2010; see also Khimm 2010).
c
•2014 Oxford University Press 525
Chapter 10. Experiments: Dealing with Real-World Challenges
of candidates and seeing if candidates do better when they appear first on the ballot. Con-
ceptually that’s not too hard, but practically it is a lot to ask given that election officials
are pretty protective of how they run elections. In the 1998 Democratic primary in New
York City, however, election officials decided on their own to rotate the order of candidates’
names by precinct. Political scientists Jonathan Koppell and Jennifer Steen got wind of this
decision and analyzed the election as a natural experiment. Their 2004 paper found that
candidates received more votes in precincts where they were listed first in 71 of 79 races. In
seven of those races the differences were enough to determine the election outcome. That’s
pretty good work for an experiment they didn’t even set up.
Researchers have found other clever opportunities for natural experiments. An impor-
tant question is whether economic stimulus packages of tax cuts and government spending
increases that were implemented in response to the 2008 recession boosted growth. At a
first glance, such analysis should be easy. We know how much the federal government cut
taxes and increased spending. We also know how the economy performed. Of course, things
are not so simple because, as former Chair of the Council of Economic Advisers Christina
Romer (2011) noted, “Fiscal actions are often taken in response to other things happening
in the economy.” When looking at the relationship between two variables, like consumer
spending and the tax rebate, we “need to worry that a third variable, like the fall in wealth,
is influencing both of them. Failing to take account of this omitted variable leads to a biased
c
•2014 Oxford University Press 526
Chapter 10. Experiments: Dealing with Real-World Challenges
One way to deal with this challenge is to find exogenous variation in stimulus spending that
is not correlated with any of the omitted variables we worry about. It is typically very hard
to do so, but sometimes natural experiments pop up. For example, Parker et al. (2011) noted
that the 2008 stimulus consisted of tax rebate checks that were sent out in stages according to
the last two digits of people’s Social Security numbers. That means the timing was effectively
random for each family. After all, the last two digits are essentially randomly assigned to
people when they are born. This means, in turn, that the timing of the government spending
by family was exogenous. An analyst’s dream come true! The researchers found that family
spending among those that got a check was almost $500 more than those who did not,
bolstering the case that the fiscal stimulus boosted consumer spending.
Remember This
1. A natural experiment occurs when the values of the independent variable have
been determined by a random, or at least an exogenous, process.
2. Natural experiments are widely used and can be analyzed with OLS, 2SLS, or
other tools.
c
•2014 Oxford University Press 527
Chapter 10. Experiments: Dealing with Real-World Challenges
Could we use experiments to test the relationship? Sure, all we need to do is head down
to the police station and tell them to assign officers randomly to different places. The idea is
not completely crazy and, frankly, it is the kind of thing police should consider doing. This
idea is not an easy sell, though. Can you imagine the outrage if a crime occurred in an area
Economists Jonathan Klick and Alexander Tabarrok identified in 2005 a clever natural
experiment that looks much like the randomized experiment we proposed. They noticed that
Washington DC deployed more police when the terror alert level was high. The high terror
alert was not random; presumably there was some cause somewhere prompting the terror
alert. It was exogenous, though. Whatever leads terrorists to threaten carnage, it was not
c
•2014 Oxford University Press 528
Chapter 10. Experiments: Dealing with Real-World Challenges
something that was associated with factors that lead local criminals in Washington DC to
stick up a liquor store. In other words, it was highly unlikely that terror alerts are correlated
with the things in the error term causing endogeneity that we identified above. It was as
if someone had designed a study in which extra police would be deployed at random times,
only in this case the “random” times were essentially selected by terrorist suspects with no
Klick and Tabarrok therefore assessed whether crime declined when the terror alert level
was high. Table 10.8 reports their main results. They found that crimes decreased when
the terror alert level went up. They also controlled for subway ridership in order to account
for the possibility that more people (and tourists in particular) around could make for more
targets for crime. The effect of the high terror alerts was still negative. Because this variable
was exogenous to crime in Washington and could, they argued, affect crime only by means
of the increased police presence, they argued their result provided pretty good evidence that
police can reduce crime. They used ordinary least squares, but the tools of analysis were
really less important than the vision of finding something that caused exogenous changes to
police deployment and then tracking changes in crime. Again, this is a pretty good day’s
c
•2014 Oxford University Press 529
Chapter 10. Experiments: Dealing with Real-World Challenges
10.6 Conclusion
Experiments are incredibly promising for statistical inference. If we want to know if X causes
Y for the treatment and control groups. The approach is simple, elegant, and has been used
For all their promise, though, experiments are like a movie star. Even though many
people idealize them, they lose a bit of luster in real life. Movie stars’ teeth are a bit yellow
and they aren’t as witty without a script. Experiments don’t always achieve balance, suffer
from non-compliance and attrition, and in many circumstances are not feasible, ethical, or
generalizable.
For these reasons we need to take particular care when examining experiments. We need to
diagnose and, if necessary, respond to ABC issues. Every experiment needs to assess balance
c
•2014 Oxford University Press 530
Chapter 10. Experiments: Dealing with Real-World Challenges
to ensure that the treatment and control groups do not differ systematically except for the
treatment. Many social science experiments also have potential non-compliance problems if
people can choose not to experience the randomly assigned treatment. Non-compliance can
get back to unbiased inference if we use ITT or 2SLS to analyze the experiment. Finally,
attrition occurs when people leave the experiment, which can be a problem if the attrition
is related to the treatment. Attrition is hard to solve, but it must be diagnosed and, if it
is a problem, we should at least use multivariate OLS or trimmed data to ameliorate the
problem.
The following steps provide a general guide to implementing and analyzing a randomized
experiment.
2. Randomly pick a subset of the population and give them the treatment. The rest are
(a) Assess balance with difference of means tests for all possible independent variables.
(b) Assess compliance by looking at what percent of those assigned to treatment actu-
(c) Assess non-random attrition by looking for differences in observation patterns across
c
•2014 Oxford University Press 531
Chapter 10. Experiments: Dealing with Real-World Challenges
4. Gather data on the outcome variable Y and assess differences between treated and
control groups.
(a) If there is perfect balance and compliance and no attrition, use bivariate OLS.
Multivariate OLS also will be appropriate and will provide more precise estimates.
(b) If there are imbalances use multivariate OLS, controlling for variables that are
(d) If there is attrition use multivariate OLS, trim the data or use a selection model.
• Section 10.1: Explain how to assess whether randomization was successful with balanc-
ing tests.
• Section 10.2: Explain how imperfect compliance can create endogeneity. What is the
ITT approach and how does it avoid conflating treatment effects and non-compliance
• Section 10.3: Explain how 2SLS can be useful for experiments with imperfect compli-
ance.
c
•2014 Oxford University Press 532
Chapter 10. Experiments: Dealing with Real-World Challenges
• Section 10.4: Explain how attrition can create endogeneity. What are some steps we
Further Reading
Experiments are booming in the social sciences. Gerber and Green (2012) provide a compre-
hensive guide to field experiments. Banerjee and Duflo (2011) is an excellent introduction to
experiments in the developing world and Duflo, Glennerster, and Kremer (2008) provides an
experimental toolkit useful for experiments in the developing world and beyond. Dunning
(2012) is detailed guide to natural experiments. Manzi (2012) is a readable guide to and
critique of randomized experiments in social science and business. On page 190 he refers to
a report to Congress in 2008 that identified policies that demonstrated significant results in
Attrition is one of the harder things to deal with and different analysts take different
approaches. Gerber and Green (2012, 214) discusses their approaches to dealing with at-
trition. There is a large literature on selection models; see, for example, Das, Newey, and
Vella (2003). Some experimentalists resist using selection models because those models rely
heavily on assumptions about the distributions of error terms and functional form.
Imai, King, and Stuart (2008) discuss how to use blocking to get more efficiency and less
c
•2014 Oxford University Press 533
Chapter 10. Experiments: Dealing with Real-World Challenges
Key Terms
• ABC issues (481)
• Attrition (515)
• Balance (484)
• Blocking (482)
• Compliance (491)
• Intention-to-treat analysis (496)
• Natural experiment (524)
• Selection model (518)
• Trimmed data set (518)
Computing Corner
Stata
• To assess balance, estimate a series of bivariate regression models with all “X” variables
as dependent variables and treatment assignment as independent variables:
reg X1 TreatmentAssignment
reg X2 TreatmentAssignment
• To estimate an ITT model, estimate a model with the outcome of interest as the de-
pendent variable and treatment assignment as the main independent variable. Other
variables can be included, especially if there are balance problems.
reg Y TreatmentAssignment X1 X2
• To estimate a 2SLS model, estimate a model with the outcome of interest as the de-
pendent variable and treatment assignment as an instrument for treatment-delivered.
Other variables can be included, especially if there are balance problems.
reg Y (Treatment = TreatmentAssignment) X1 X2
c
•2014 Oxford University Press 534
Chapter 10. Experiments: Dealing with Real-World Challenges
• To assess balance, estimate a series of bivariate regression models with all “X” variables
as dependent variables and treatment assignment as independent variables:
lm(X1 ≥ TreatmentAssignment)
lm(X2 ≥ TreatmentAssignment)
• To estimate an ITT model, estimate a model with the outcome of interest as the de-
pendent variable and treatment assignment as the main independent variable. Other
variables can be included, especially if there are balance problems.
lm(Y ≥ TreatmentAssignment+ X1 + X2)
• To estimate a 2SLS model, estimate a model with the outcome of interest as the de-
pendent variable and treatment assignment as an instrument for treatment-delivered.
Other variables can be included, especially if there are balance problems. As discussed
on page 471 we’ll use the ivreg command from the AER library:
library(AER)
ivreg(Y ≥ Treatment + X2 + X3 | TreatmentAssignment + X2 + X3)
Exercises
1. In an effort to better understand the effects of get-out-the-vote messages on voter
turnout, Gerber and Green (2005) conducted a randomized field experiment involv-
ing approximately 30,000 individuals in New Haven, Connecticut in 1998. One of the
experimental treatments was in-person visits. These were randomly assigned and in-
volved a volunteer visiting the person’s home and encouraging him or her to vote. The
file GerberGreenData.dta contains the variables described in Table 10.9.
Table 10.9: Variables for Get-out-the-vote Experiment from Gerber and Green (2005)
Variable Description
Voted Voted in the 1998 election (voted = 1)
ContactAssigned Assigned to in-person contact (assigned = 1)
ContactObserved Actually contacted via in-person visit (treated = 1)
Ward Ward number
PeopleHH Household size
a. Estimate a bivariate model of the effect of actual contact on voting. Is the model
biased? Why or why not?
c
•2014 Oxford University Press 535
Chapter 10. Experiments: Dealing with Real-World Challenges
c
•2014 Oxford University Press 536
Chapter 10. Experiments: Dealing with Real-World Challenges
Variable Description
education 0 = not reported; 1 = some high school; 2 = high school graduate;
3 = some college; 4 = college graduate or more
yearsexp Number of years of work experience
honors 1 = Resume mentions some honors
volunteer 1 = Resume mentions some volunteering experience
military 1 = Applicant has some military experience
computerskills 1 = Resume mentions computer skills
afn american 1 = African-American sounding name; 0 = White sounding name
call 1 = applicant was called back ; 0 applicant not called back
female 1 = female; 0 = male
h quality 1 = High quality resume; 0 = low quality resume
a. What would be the concern of looking at the number of callbacks by race from an
observational study?
b. Check balance between the two groups (resumes with African-American-sounding
names and resumes with White sounding names) on the following variables: edu-
cation, years of experience, volunteering experience, honors, computer skills, and
c
•2014 Oxford University Press 537
Chapter 10. Experiments: Dealing with Real-World Challenges
gender. The treatment is whether the resume had an African-American name or not
as indicated by the variable af n american.
c. What would compliance be in the context of this experiment? Is there a potential
non-compliance problem?
d. What variables do we need to use 2SLS to deal with non-compliance?
e. Calculate the intention-to-treat (ITT) for receiving a callback from the resumes. The
variable call is coded 1 if a person received a callback and 0 otherwise. Use OLS
with call as the dependent variable.
f. We’re going to add covariates shortly. Discuss the implications of adding covariates
to this analysis of a randomized experiment.
g. Re-run the analysis from part (e) with controls for education, years of experience,
volunteering experience, honors, computer skills and gender. Report the results and
briefly describe the effect of having an African-American sounding name and if/how
the estimated effect changed from the earlier results.
h. The authors were also interested to see if race had a differential effect for high quality
resumes and low quality resumes. They created a variable h quality that indicated
a high quality resume based on labor market experience, career profile, existence of
gaps in employment, and skills. Using the controls from part (g) plus the high quality
indicator variable, estimate the effect of having an African-American sounding name
for high quality and low quality resumes.
4. Improving education in Afghanistan may be key to bring development and stability to
the country. Only 37 percent of primary school-age children in Afghanistan attended
schools, and there is a large gender gap in enrollment (with girls 17 percentage points
less likely to attend school). Traditional schools in Afghanistan serve children from
numerous villages. Some believe that creating more village based schools can increase
enrollment and students performance by bringing education closer to home. To assess
this belief, researchers Dana Burde and Leigh Linden (2013) conducted a randomized
experiment to test the effects of adding village based schools. For a sample of 12 equally
sized village groups, they randomly selected 5 groups to receive a village-based school.
One of the original village groups could not be surveyed and was dropped, resulting
in 11 village-groups with 5 treatment villages in which a new school was built and 6
control villages in which no new school was built.
This question focuses on the treatment effects for the fall 2007 semester, which was after
the schools had been provided. There were 1,490 children across the treatment and con-
trol villages. Table 10.11 displays the variables in the data set schools experiment HW.dta.
c
•2014 Oxford University Press 538
Chapter 10. Experiments: Dealing with Real-World Challenges
Variable Description
formal school Enrolled in school
testscores Fall test scores (normalized); tests were to be given
to all children whether in school or not
treatment Assigned to village-based school=1; otherwise=0
age Age of child
girl Girl = 1; Boy = 0
sheep Number of sheep owned
duration village Duration family has lived in village
farmer Farmer = 1
education head Years of education of head of household
number ppl hh Number of people living in household
distance nearest school Distance to nearest school
f07 test observed Equals 1 if test was observed for fall 2007
Clustercode Village code
f07 hh id Household ID
a. What are the issues with studying the effects of new schools in Afghanistan that are
not randomly assigned?
b. Why is checking balance an important first step in analyzing a randomized experi-
ment?
c. Did randomization work? Check the balance of the following variables: age of child,
girl, number of sheep family owns, length of time family lived in village, farmer,
years of education for household head, number of people in household, and distance
to nearest school.
d. On page 104 we discussed the fact that if errors are correlated, then the standard
OLS estimates for the standard error of —ˆ are incorrect. In this case, we might expect
errors to be correlated within village. That is, knowing the error for one child in a
given village may provide us some information about the error for another child in
the same village. The way to generate standard errors that account for correlated
errors within some unit is to use the , cluster(ClusterName) command at the end
of Stata’s regression command. In this case, the cluster is the village, as indicated
with the variable clustercode. Re-do the balance tests from part (c) with clustered
standard errors. Do the coefficients change? Do the standard errors change? Do our
conclusions change?
e. Calculate the effect of being in a treatment village on fall enrollment. Use OLS and
report the fitted value of the school attendance variable for control and treatment
villages, respectively.
c
•2014 Oxford University Press 539
Chapter 10. Experiments: Dealing with Real-World Challenges
f. Calculate the effect of being in a treatment village on fall enrollment while controlling
for age of child, girl, number of sheep family owns, length of time family lived in
village, farmer, years of education for household head, number of people in household,
and distance to nearest school. Use the standard errors that account for within-village
correlation of errors. Is the coefficient on treatment substantially different from the
bivariate OLS results? Why or why not? Briefly note any control variables that are
significantly associated with attending school.
g. Calculate the effect of being in a treatment village on fall test scores. Use the model
that calculates standard errors that account for within-village correlation of errors.
Interpret the results.
h. Calculate the effect of being in a treatment village on test scores while controlling for
age of child, girl, number of sheep family owns, length of time family lived in village,
farmer, years of education for household head, number of people in household, and
distance to nearest school. Use the standard errors that account for within-village
correlation of errors. Is the coefficient on treatment substantially different from the
bivariate OLS results? Why or why not? Briefly note any control variables that are
significantly associated with higher test scores.
i. Compare the sample size for the enrollment and test score data. What concern does
this comparison raise?
j. Assess whether attrition was associated with treatment. Use the standard errors that
account for within-village correlation of errors.
c
•2014 Oxford University Press 540
CHAPTER 11
ables.
In this chapter we offer a third way to fight endogeneity: Looking for discontinuities. A
arise when a treatment is given in a mechanical way to observations above some cutoff. These
541
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
jumps indicate the causal effects of treatments under reasonably general conditions.
Suppose, for example, that we want to know whether drinking alcohol causes grades to go
down. An observational study might be fun, but worthless: It’s a pretty good bet that the
kind of people who drink a lot also have other things in their error term that also account
for low grades (e.g., lack of interest in school). An experimental study might even be more
But we still have some tricks to get at the effect of drinking. Consider the Air Force
Academy where the drinking age is strictly enforced. Students over 21 are allowed to drink;
students under 21 are not allowed to drink and face expulsion if caught. If we can compare
the performance on final exams of those students who had just turned 21 to those who had
Carrell, Hoekstra, and West (2010) did this and Figure 11.1 summarizes their results.
Each circle shows average test score for students grouped by age. The circle on the far left
shows the average test score for students who were 270 days before their 21st birthday when
they took their test. The circle on the far right shows the average test score for students
who turned 21 270 days before their test. In the middle are those who had just turned 21.
We’ve included fit lines to help make the pattern clear. Those who had not yet turned
21 scored higher. There is a discontinuity at the zero point in the figure (corresponding to
students taking a test on their 21st birthday). If we can’t come up with another explanation
for test scores to change at this point, we have pretty good evidence that drinking hurts
c
•2014 Oxford University Press 542
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Normalized
grade
0.2
0.1
c
•2014 Oxford University Press 543
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
grades.
Regression discontinuity (RD) analysis formalizes this logic. It uses regression anal-
ysis to identify possible discontinuities at the point the treatment applies. For the drinking
age case, RD analysis involves fitting an OLS model that allows us to see if there is a
Regression discontinuity analysis has been used in a variety of contexts where some treat-
ment of interest is determined by a strict cutoff. Card, Dobkin, and Maestas (2009) used
RD to analyze the effect of Medicare on health because Medicare eligibility kicks in the day
someone turns 65. Lee (2008) used RD to study the effect of incumbency on reelection to
Congress because incumbents are decided by whoever gets 50 percent plus one or more of
the vote. Lerman (2009) used RD to assess the effect of being in a high security prison on
inmate aggression because the security level of the prison to which convicts are sent depends
data may not provide exogeneity. Good instruments are hard to come by. Experiments can
be expensive or infeasible. And even when experiments can work, they can seem unfair or
capricious to policymakers, who may like allocating some treatment randomly. In RD, the
treatment is assigned according to a rule, which to many people seems more reasonable and
RD models can work when analyzing individuals, states, counties, and other units. We
c
•2014 Oxford University Press 544
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
keep things simple and in this chapter mostly discuss RD as applied to individuals, but
the technique works perfectly well to analyze other units that have treatment assigned by a
In this chapter, we show how to use RD models to estimate causal effects. Section 11.1
presents the core RD model. Section 11.2 then presents ways to more flexibly estimate RD
models. Section 11.3 shows how to limit the data sets and create graphs that are particularly
useful in the RD context. The RD approach is not bullet-proof, though and Section 11.4
role of the assignment variable in the model. We then translate the regression discontinuity
model into a convenient graphical form and explain the key condition necessary for the model
that determines whether or not someone receives some treatment. People with values of
the assignment variable above some cutoff receive the treatment; people with values of the
assignment variable less than the cutoff do not receive the treatment.
c
•2014 Oxford University Press 545
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
As long as the only thing that changes at the cutoff is that the person gets the treatment,
then any bump up or down in the dependent variable at the cutoff will reflect the causal
One way to understand why is to look at observations very, very close to the cutoff. The
only difference between those just above and just below the cutoff is the treatment. For
example, Medicare kicks in when someone turns 65. If we compare the health of people one
minute before their 65th birthday to the health of people who turned 65 one minute ago,
we could reasonably believe that the only difference between those two groups is that the
federal government provides health care for some but not others.
That’s a pretty extreme example, though. As a practical matter, we typically don’t have
data on very many people very close to our cutoff. Because statistical precision depends on
sample size (as we discussed on page 233), we typically can’t expect very useful estimates
unless we expand our data set to include observations some degree above and below the
cutoff. For Medicare, for example, perhaps we’ll need to look at people days, weeks, or
months from their 65th birthday to get a reasonable sample size. Thus the treated and
untreated will differ not only in whether they got the treatment but also in the assignment
variable. People 65 years and two months old not only get Medicare, but they are also older
than people two months shy of their 65th birthday. While four months doesn’t seem like
a lot for an individual, health declines with age in the whole population and there will be
people who experience some bad turn during those four months.
c
•2014 Oxford University Press 546
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Regression discontinuity models therefore control for treatment and the assignment vari-
able. In its most basic form, a regression discontinuity model looks like
Yi = —0 + —1 Ti + —2 (X1i ≠ C) + ‘i (11.1)
Ti = 1 if X1i Ø C
Ti = 0 if X1i < C
where Ti is a dummy variable indicating whether or not person i received the treatment and
X1i ≠ C is our assignment variable which indicates how much above or below the cutoff an
observation is. For reasons we explain below, it is useful to convert our assignment variable
into a variable that indicates how much above or below the cutoff a person was.
Figure 11.2 displays a scatterplot of data and fitted lines for a typical RD model. This
picture captures the essence of RD. If we understand it, we understand RD models. The
distance to the cutoff variable, X1i ≠ C, is along the horizontal axis. In this particular
example, C = 0, meaning that the eligibility for the treatment kicked in when X1 equalled
zero. Those with X1 above zero got the treatment; those with X1 below zero did not get the
treatment. Starting from the left we see that the dependent variable rises as X1i ≠ C gets
bigger and, whoa, jumps up at the cutoff point (when X1 = 0). This jump at the cutoff,
c
•2014 Oxford University Press 547
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Dependent
variable
(Y) 4000
e)
3000 slop
^ (the
β2
^ ^
β 0 + β1
2000 Bump is β1
^
β0
1000
c
•2014 Oxford University Press 548
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
The parameters in the model are easy to locate in the figure. The most important
parameter is —1 , which is the effect of being in the treatment group. This is the bump at
the heart of RD analysis. The slope parameter, —2 , captures the relationship between the
distance to the cutoff variable and the dependent variable. In this basic version of the RD
model, this slope is the same above and below the cutoff.
Figure 11.3 displays more examples of results from RD models. In panel (a) —1 is positive,
just like in Figure 11.2, but —2 is negative, creating a downward slope for the assignment
variable. In panel (b), the treatment has no effect, meaning that —1 = 0. Even though
everyone above the cutoff received the treatment, there is no discernable discontinuity in the
dependent variable at the cutoff point. In panel (c), —1 is negative because there is a bump
downward at the cutoff, implying that the treatment lowered the dependent variable.
The key assumption for RD to work is that the error term itself does not jump at the point
of the discontinuity. In other words, we’re assuming that the error term, whatever is in it, is
continuous without any jumps up or down when the assignment variable crosses the cutoff.
One of the cool things about RD is that even if the error term is correlated with the
assignment variable, the estimated effect of the treatment is still valid. To see why, sup-
pose C = 0 and the error and assignment variable are correlated and we characterize the
c
•2014 Oxford University Press 549
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Dependent
variable 4000 3000
(Y ) 3500
2500
3000
3000
2500
2000
2000
2000 1500
1500
1000
1000
1000
500
500
0
0
0
−800 −400 0 400 −800 −400 0 400 −800 −400 0 400
Cutoff Cutoff Cutoff
Assignment variable (X 1) Assignment variable (X 1) Assignment variable (X 1)
(a) (b) (c)
correlation as follows:
‘i = flX1i + ‹i (11.2)
where the Greek letter fl (pronounced rho) captures how strongly the error and X1i are
related and ‹i is a random term that is uncorrelated with X1i . For example, in the Medicare
example, mortality is the dependent variable, the treatment T is Medicare (which kicks in
the second someone turns 65), age is the assignment variable and health is in the error term.
It is totally reasonable to believe that health is related to age and we use Equation 11.2 to
If we estimate a model without controlling for the assignment variable (X1i ) with the
c
•2014 Oxford University Press 550
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
following model
Yi = —0 + —1 Ti + ‘i
there will be endogeneity because the treatment, T , depends on X1i , which is correlated with
the error. In the Medicare example, if we predict mortality as a function of Medicare only,
the Medicare variable will pick up not only the effect of the program, but also the effect of
health, which is in the error term which is correlated with age which is, in turn, correlated
with Medicare.
If we control for X1i , however, the correlation between T and ‘ disappears. To see why,
we begin with the basic RD model (Equation 11.1 on page 547). For simplicity, we assume
Yi = —0 + —1 Ti + —2 X1i + ‘i
= —0 + —1 Ti + —2 X1i + flX1i + ‹i
Yi = —0 + —1 Ti + (—2 + fl)X1i + ‹i
= —0 + —1 Ti + —˜2 X1i + ‹i
Notice that we have an equation in which the error term is now ‹i (the part of Equation 11.2
that is uncorrelated with anything). Hence, the treatment variable, T , in the RD model is
uncorrelated with the error term even though the assignment variable is correlated with the
c
•2014 Oxford University Press 551
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
error term. This means that OLS will provide an unbiased estimate of —1 , the coefficient on
Ti .
Meanwhile, the coefficient we estimate on the X1i assignment variable is —˜2 (notice the
squiggly on top), a combination of —2 (with no squiggly on top; the actual effect of X1i on
Y ) and fl (the degree of correlation between X1i and the error term in the original model, ‘).
Thus we do not put a lot of stock into the estimate of the variable on the assignment
variable because the coefficient combines the actual effect of the assignment variable and the
correlation of the assignment variable and the error. That’s okay, though, because our main
Remember This
A regression discontinuity (RD) analysis can be used when treatment depends on an
assignment variable being above some cutoff C.
1. The basic model is
Yi = —0 + —1 Ti + —2 (X1i ≠ C) + ‘i
Ti = 1 if X1i Ø C
Ti = 0 if X1i < C
2. RD models require that the error term is continuous at the cutoff. That is, the
value of the error term does not jump up or down at the cutoff.
3. RD identifies a causal effect of treatment because the assignment variable soaks
up the correlation of error and treatment.
c
•2014 Oxford University Press 552
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Discussion Questions
1. Many school districts pay for new school buildings with bond issues
that need to be approved by voters. Supporters of these bond issues
typically argue that new buildings improve schools and thereby boost
housing values. Cellini, Ferreira, and Rothstein (2010) used RD to test
if passage of school bonds caused housing values to rise.
a) What is the assignment variable?
b) Explain how to use a basic regression discontinuity approach to esti-
mate the effect of school bond passage on housing values.
c) Provide a specific equation for the model.
2. Medicare benefits kick in automatically in the United States the day
a person turns 65 years old. Many believe that people with health in-
surance are less likely to die because they will be more likely to seek
treatment and doctors will be more willing to conduct tests and proce-
dures for them. Card, Dobkin, and Maestas (2009) used RD to address
this question.
a) What is the assignment variable?
b) Explain how to use a basic regression discontinuity approach to esti-
mate the effect of Medicare coverage on the probability of dying.
c) Provide a specific equation for the model. For simplicity use a linear
probability model (as discussed on page 592).
In a basic version RD model, the slope of the line is the same on both sides of the cutoff for
treatment. This might not be the case in reality. In this section we show how to implement
more flexible regression discontinuity models that allow the slope to vary or allow for a
c
•2014 Oxford University Press 553
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Most RD applications allow the slope to vary above and below the threshold. By incorpo-
rating tools we discussed in Section 6.4, the following will produce estimates in which the
Ti = 1 if X1i Ø C
Ti = 0 if X1i < C
The new term at the end of the equation is an interaction between T and X1 ≠ C. The
coefficient on that interaction, —3 , captures how different the slope is for observations where
X1 is greater than C. The slope for untreated observations (for whom Ti = 0) will simply
be —2 , which is the slope for observations to the left of the cutoff. The slope for the treated
observations (for whom Ti = 1) will be —2 + —3 , which is the slope for observations to the
right of the cutoff. (Recall our discussion on page 297 in Chapter 7 regarding the proper
Figure 11.4 displays examples in which the slopes differ above and below the cutoff. In
panel (a), —2 = 1 and —3 = 2. Because —3 is greater than zero, the slope is steeper for
observations to the right of the cutoff. The slope for observations to the left of the cutoff is
c
•2014 Oxford University Press 554
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Dependent
variable
(Y ) 5000
3000
2500
4000
2500
2000
3000
2000
2000 1500
1500
1000
1000
1000
0
500
500
1 (the value of —2 ) and the slope for observations to the right of the cutoff is —2 + —3 = 3.
In panel (b) of Figure 11.4, —3 is zero, meaning that the slope is the same (and equal to
—2 ) on both sides of the cutoff. In panel (c), —3 is less than zero, meaning that the slope
is less steep for observations for which X1 is greater than C. Note that just because —3
is negative does not mean that the slope for observations to the right of the cutoff will be
c
•2014 Oxford University Press 555
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
negative (although it could be). A negative value of —3 simply means that the slope is less
steep for observations to the right of the cutoff. In panel (c), —3 = ≠—2 , which is why the
It is important to use X1i ≠ C instead of X1i for the assignment variable when estimating
an RD model with varying slopes. In this model, we’re estimating two separate lines. The
intercept for the line for the untreated group is —ˆ0 and the intercept for the line for the
treated group is —ˆ0 + —ˆ1 . If we used X1i as the assignment variable, the —ˆ1 estimate would
indicate the differences in treated and control when X1i is zero; we care about the difference
in treated and control when X1i equals the cutoff. By using X1i ≠ C instead of X1i for the
assignment variable, —ˆ1 will indicate the difference in treated and control when X1i ≠ C is
Polynomial model
Once we start thinking about how the slope could vary across different values of X1 , it is
easy to start thinking about other possibilities. Hence more technical RD analyses spend
a lot of effort estimating relationships that are even more flexible than the varying slopes
model. One way to estimate more flexible relationships between the assignment variable
and outcome is to use our polynomial regression model from page 317 in Chapter 7 to allow
the relationship between X1 to Y to wiggle and curve. The RD insight is that however
wiggly that line gets, we’re still looking for a bump (a discontinuity) at the point where the
c
•2014 Oxford University Press 556
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
For example, we can use polynomial models to allow the estimated lines to curve differently
above and below the treatment threshold with a model like the following:
Ti = 1 if X1i Ø C
Ti = 0 if X1i < C
Figure 11.5 shows two examples that can be estimated with such a polynomial model.
In panel (a), the value of Y accelerates as X1 approaches the cutoff, dips at the point of
treatment, and then accelerates again from that lower point. In panel (b), the relationship
appears relatively flat for values of X1 below the cutoff. There is a fairly substantial bump
up in Y at the cutoff. After that, Y rises sharply with X1 and then falls sharply.
It is virtually impossible to predict funky non-linear relationships like these ahead of time.
The goal is to find a functional form for the relationship between X1 ≠ C and outcomes that
soaks up any relation between X1 ≠ C and outcomes so that any bump at the cutoff reflects
only the causal effect of the treatment. This means we can estimate the polynomial models
and see what happens even without full theory about how the line should wiggle.
With this flexibility comes danger, though. Polynomial models are quite sensitive and
sometimes can produce bumps at the cutoff that are bigger than they should be. Therefore
c
•2014 Oxford University Press 557
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Dependent Dependent
variable variable
(Y) 10 (Y) 10
8 8
6 6
4 4
2 2
0 0
−4 −2 0 2 4 6 −4 −2 0 2 4 6
Cutoff Cutoff
c
•2014 Oxford University Press 558
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
we should always report simple linear models too so as not to look like we are fishing around
Remember This
1. When conducting regression discontinuity analysis it is useful to allow for a more
flexible relationship between assignment variable and outcome.
• A varying slopes model allows the slope to vary on different sides of the
treatment cutoff:
Yi = —0 + —1 Ti + —2 (X1i ≠ C) + —3 (X1i ≠ C)Ti + ‘i
• We can also use polynomial models to allow for non-linear relationships be-
tween the assignment and outcome variables.
c
•2014 Oxford University Press 559
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Y 10 Y 10
8 8
6 6
4 4
2 2
0 0
0 2 4 6 7 8 10 0 2 4 6 7 8 10
X Cutoff X Cutoff
(a) (b)
Y 10 Y 10
8 8
6 6
4 4
2 2
0 0
0 2 4 6 7 8 10 0 2 4 6 7 8 10
X Cutoff X Cutoff
(b) (d)
Y 10 Y 10
8 8
6 6
4 4
2 2
0 0
0 2 3 4 6 8 10 0 2 3 4 6 8 10
Cutoff X Cutoff X
(e) (f)
FIGURE 11.6: Various Fitted Lines for RD Model of Form Yi = —0 + —1 Ti + —2 (X1i ≠ C) + —3 (X1i ≠ C)Ti
Discussion Question
For each panel in Figure 11.6 indicate whether each of —1 , —2 , and —3 is less
than, equal to, or greater than zero for the varying slopes RD model:
Yi = —0 + —1 Ti + —2 (X1i ≠ C) + —3 (X1i ≠ C)Ti + ‘i
c
•2014 Oxford University Press 560
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
There are other ways to make RD models flexible. An intuitive approach is to simply focus
on a subset of the data near the threshold. In this section we show the benefits and costs of
doing so and introduce binned graphs as a useful tool for all RD analysis.
As we discussed earlier, polynomial models can be a bit hard to work with. An easier
alternative (or at least supplement) to polynomial models is to narrow the window in which
we look. The window is the range of the assignment variable to which we limit our analysis;
we only look at observations with values of the assignment variable in this range. Ideally,
we’d make the window very, very small near the cutoff. For such a small window, we’d be
looking only at those observations just below and just above the cutoff. These observations
would be very similar and hence the treatment effect would be the difference in Y for the
untreated (those just below the cutoff) and the treated (those just above the cutoff).
A smaller window allows us to worry less about the functional form on both sides of the
cutoff. Figure 11.7 provides some examples. In panels (a) and (b), we show the same figures
as in Figure 11.5, but highlight a small window. Below each of these panels we show just
the line in the highlighted smaller window. While the relationships are quite non-linear for
the full window, we can see that they are approximately linear in the smaller windows. For
example, when we look only at observations where X1 is between -1 and 1 for panel (a) we
c
•2014 Oxford University Press 561
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
see two more or less linear lines on each side of the cutoff. When we look only at observations
where X1 is between -1 and 1 for panel (b) we see a more or less flat line below the cutoff and
a positively sloped line above the cutoff. So even though the actual relationships between
the assignment variable and Y are non-linear in both panels, a reasonably simple varying
slopes model should be more than sufficient when we focus on the smaller window. A smaller
window for these cases allows us to feel more comfortable that our results do not depend on
sensitive polynomial models, but instead reflect differences between treated and untreated
As a practical matter, we usually don’t have very many observations in a small window
near the cutoff so in order to have any hope of having any statistical power, we’ll need to
Binned graphs
data is to create binned graphs. Binned graphs look like scatterplots but are a bit different.
To construct a bin plot, we divide the X1 variable into multiple regions (or “bins”) above
and below the cutoff and then calculate the average value of Y within each of those regions.
When we plot the data we get something that looks like panel (a) of Figure 11.8. Notice
there is a single observation for each bin, producing a cleaner graph than a scatterplot of all
observations.
c
•2014 Oxford University Press 562
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
4
6
3
4
2
2
0
-1 0 1
-4 -2 0 2 4 6
Assignment variable (X1 - C)
(a)
Dependent variable (Y) cutoff
cutoff
10
7
8
6
6
5
4 4
3
2
-1 0 1
0
-4 -2 0 2 4 6
Assignment variable (X1 - C)
(b)
FIGURE 11.7: Smaller Windows for Fitted Lines for Polynomial RD Model in Figure 11.5
c
•2014 Oxford University Press 563
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
The bin plot provides guidance for selecting the right RD model. If the relationship is
highly non-linear or seems dramatically different above and below the cutpoint, the bin plot
will let us know. In panel (a) of Figure 11.8 we see a bit of non-linearity in the relationship
because there is a U-shaped relationship between X1 and Y for values of X1 below the cutoff.
This relationship suggests a quadratic could be appropriate or, even simpler, we could narrow
the window to focus only on the range of X1 where the relationship is more linear. Panel (b)
of Figure 11.8 shows the fitted lines based on an analysis where only observations for which
X1 is between 900 and 2200 are used. The implied treatment effect is the bump in the data
indicated by —1 in the figure. We do not actually use the binned data to estimate the model;
Remember This
1. It is useful to look at smaller window sizes when possible by looking only at data
close to the treatment cutoff.
2. Binned graphs help us visualize the discontinuity and the possibly non-linear
relationship between assignment variable and the outcome.
c
•2014 Oxford University Press 564
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Dependent Dependent
variable variable
(Y) (Y)
3500 3500
3000 3000
2500 2500
2000 2000 β1
1500 1500
1000 1000
(a) (b)
c
•2014 Oxford University Press 565
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
But does it work? Gormley, Phillips, and Gayer (2008) used RD to evaluate one piece
of the puzzle by looking at the impact of universal pre-K on test scores in Tulsa, Oklahoma.
They could do so because children born on or before September 1, 2001, were eligible to
enroll in the program in 2005-06, while children born after this date had to wait to enroll
Figure 11.9 shows a bin plot for this analysis. The dependent variable is test scores from
a letter-word identification test that measures early writing skills. The children took the test
a year after the older kids started pre-K. The kids born before September 1 spent the year
in pre-K; the kids born after September 1 spent the year doing whatever it is four year olds
The horizontal axis shows age measured in days from the pre-K cutoff date. The data is
binned in groups of 14 days so that each data point shows the average test scores for children
with ages in a 14 day range. While the actual statistical analysis uses all observations, the
c
•2014 Oxford University Press 566
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
binned graph helps us better see the relationship between the cutoff and test scores than
One of the nice features of RD is that the plot often tells the story. We’ll do formal
statistical analysis in a second, but in this case, as in many RD examples, we know how the
There’s no mistaking the data: there is a jump in test scores precisely at the point of
discontinuity. There’s a clear relationship of kids scoring higher as they get older (as we can
see from the positive slope on age) but right at the age-related cutoff for the pre-K program
there is a substantial bump up. The kids above the cutoff went to pre-K. The kids who
were below the cutoff did not. If the program had no effect, the kids who didn’t go to pre-K
would score lower than the kids who did, simply because they were younger. But there is no
obvious reason why there should be a discontinuity right at the cutoff except if the program
Table 11.1 shows statistics results for the basic and varying slopes RD models. For the
basic model, the coefficient on the Pre-K variable is positive and highly significant, with a
t statistic of 10.31. The coefficient indicates the bump that we see in Figure 11.9. The age
variable is also highly significant. No surprise there, as older children did better on the test.
In the varying slopes model, the coefficient on the treatment is virtually unchanged from
the basic model, indicating a bump of 3.479 in test scores for the kids who went to pre-K.
The effect is highly statistically significant with a t statistic of 10.23. The coefficient on the
c
•2014 Oxford University Press 567
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Test
score 12
10
c
•2014 Oxford University Press 568
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
interaction is insignificant, indicating that the slope on age is the same for kids who had
treatment effects as long as treatment depends on some threshold and the error term is
continuous at the treatment threshold. However, RD can go wrong and in this section we
c
•2014 Oxford University Press 569
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
discuss situations in which RD doesn’t work and how to detect these situations. We also
Imperfect assignment
One drawback to the RD approach is that it’s pretty rare to have an assignment variable
that decisively determines treatment. If we’re looking at the effect of going to a certain
college, for example, we probably cannot use RD because admission was based on multiple
factors, none of which was cut and dried. Or if we’re trying to assess the effectiveness of
a political advertising campaign, it’s probably the case that the campaign didn’t simply
advertise in cities where its poll results were less than some threshold, but instead they
probably used some criteria to identify where they might run ads and then used a number
In the Further Reading section we point to readings on so-called fuzzy RD models that
can be used when the assignment variable imperfectly predicts treatment. Fuzzy RD models
models can be useful when there is a point at which treatment becomes much more likely,
but not necessarily guaranteed. For example, a college might only look at people with test
score of 160 or higher. Being above 160 may not guarantee admission, but there is a huge
jump up in probability of admission for those who score 160 instead of 159.
c
•2014 Oxford University Press 570
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
A bigger problem for RD models occurs when the error can be discontinuous at the treatment
threshold. Real people living their lives may do things that create a bump in the error term
at the discontinuity. For example, suppose that a GPA in high school above 3.0 makes
students eligible for a tuition discount at a state university. This seems like a promising RD
design: Use high school GPA as the assignment variable and set a threshold at 3.0. We can
then see, for example, if graduation rates (Y ) are higher for students who got the tuition
discount.
The problem is that the high school students (and teachers) know the threshold and how
close they are to it. Students who plan ahead and really want to go to college will make
damn sure that their high school GPA is north of 3.0. Students who are drifting through
life and haven’t gotten around to thinking about college won’t be so careful. Therefore we
could expect that when we are looking at students with GPA’s near 3.0, the more ambitious
students pile up on one side and the slackers pile up on the other. If we think ambition
influences graduation (it does!), then ambition (something in the error term) jumps at the
Therefore any RD analysis should discuss whether the only thing happening at the dis-
continuity is the treatment. Do the individuals know about the cutoff? Sometimes they
don’t. Perhaps a worker training program enrolls people who score over some number on
a screening test. The folks taking the test probably don’t know what the number is so it’s
c
•2014 Oxford University Press 571
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
unlikely they would be able to game the system. Or even if people know the score they need,
we can often reasonably assume they’ll do their best because they presumably won’t be able
to precisely know how much effort will be enough to exceed the cutoff. If the test can be
re-taken, though, the more ambitious folks might keep taking it until they pass while the
less ambitious will head home to watch Breaking Bad. In such a situation, something in the
error term (ambitiousness) would jump at the cutoff because the ambitious people would
tend to be above the cutoff and the less ambitious people would be below it.
Given the vulnerabilities of the RD model, two diagnostic tests are important to assess the
appropriateness of the RD approach. First, we want to know if the assignment variable itself
acts peculiar at the cutoff. If the values of the assignment variable cluster just above the
cutoff, we should worry that people know about the cutoff and are able to manipulate things
in order to get over it. In such a situation, it’s quite plausible that the people who are able to
just get over the cutoff are different than those who do not, perhaps because they have more
ambition (as in our example above) or because they have better contacts or information or
other advantages. To the extent that these factors also affect the dependent variable, we’ll
violate the assumption that the error term does not have a discrete jump at the cutoff.
The best way to assess whether there is clustering on one side of the cutoff is to create a
histogram of the assignment variable, looking for unusual activity in the assignment variable
c
•2014 Oxford University Press 572
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Frequency Frequency
160 160
140 140
120 120
100 100
80 80
60 60
40 40
20 20
0 0
−5 −3 −1 1 2 3 4 5 −5 −3 −1 1 2 3 4 5
Cutoff Cutoff
c
•2014 Oxford University Press 573
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
at the cutoff point. Panel (a) in Figure 11.10 shows a histogram of assignment values in
a case where there is no obvious clustering. The frequency of values in each bin for the
assignment variable bounces around a bit here and there, but is mostly smooth and there is
no clear bump up or down at the cutoff. In contrast, the histogram in panel (b) shows clear
clustering just above the cutoff. When faced with data like panel (b), it’s pretty reasonable
to suspect that the word is out about what the cutoff is and that people are able to do
The second diagnostic test involves assessing whether other variables act weird at the
discontinuity. For RD to be valid, we want only Y to jump at the point where T equals one,
nothing else. If some other variable jumps at the discontinuity, we may wonder if the people
involved are somehow self-selecting (or being selected) based on some additional factors. If
so, it could be that these other factors that are jumping at the discontinuity may be causing
a jump in Y , not the treatment. A basic diagnostic test of this sort looks like
X2i = “0 + “1 Ti + “2 (X1i ≠ C) + ‹i
Ti = 1 if X1i Ø C
Ti = 0 if X1i < C
A statistically significant “ˆ1 coefficient from this model means that X2 jumps at the treatment
1 Formally testing for discontinuity of the assignment variable at the cutoff is a bit tricky. McCrary (2008) has
more. Usually, a visual assessment provides a good sense of what is going on, although it’s a good idea to try different
bin sizes to make sure what you’re seeing is not an artifact of one particular choice for bin size.
c
•2014 Oxford University Press 574
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
discontinuity, which casts doubt on the main assumption of the RD model that the only thing
A significant “ˆ1 from this diagnostic test doesn’t necessarily kill the RD, but we would
need to control for X2 in the RD model and explain why this additional variable jumps at
the discontinuity. It also makes sense to conduct balance tests using varying slopes models,
Including any variable that jumps at the discontinuity is only a partial fix, though, because
if we observe a difference at the cutoff in some variable we can measure, it’s plausible that
there is also a difference at the cutoff in some variable we can’t measure. We can measure
education reasonably well; it’s a lot harder to measure intelligence and it’s extremely hard
to measure conscientiousness. If we see that people are more educated at the cutoff, we’ll
worry that they are also more intelligent and conscientious, meaning we’ll worry that at
the discontinuity our treated group may differ from the untreated group in ways we can’t
measure.
Generalizability of RD results
known as the local average treatment effect (LATE). This concept comes up for instrumental
variables models as well (as discussed on page 469). The idea is that the effects of the
treatment may differ within the population: A training program might work great for some
c
•2014 Oxford University Press 575
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
types of people but do nothing for others. The treatment effect estimated by regression
discontinuity is the effect of the treatment on those folks who have X1 equal the threshold.
Perhaps the treatment would have no effect on people with very low values of the assignment
variable. Or perhaps the treatment effect grows as the assignment variable grows. RD will
not be able to speak to these possibilities because we observe only the treatment happening at
one cutoff. Hence it is possible that the RD results do not generalize to the whole population.
Remember This
To assess the appropriateness of RD:
1. Qualitatively assess whether people have control over the assignment variable.
2. Conduct diagnostic tests.
• Assess the distribution of the assignment variable using a histogram to see if
there is clustering on one side of the cutoff.
• Run RD models using other covariates as dependent variables. The treatment
should not be associated with any discontinuity in any covariate.
c
•2014 Oxford University Press 576
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
based on a varying slopes model in which the key variable is the dummy variable indicating
someone was older than 21 when he or she took the exam. This model also controlled for the
assignment variable, age, allowing the effect of age to vary before and after people turned
21. The dependent variable is standardized test scores, meaning that the results in the first
column indicate that turning 21 decreased test scores by 0.092 standard deviations. This
effect is highly statistically significant with a t statistic of 30.67. Adding controls strengthens
the results, as reported in the second column. The results are quite similar when we allow
the age variable to affect test scores non-linearly by including a quadratic function of age in
the model.
Are we confident that the only thing that happens at the discontinuity is that students
become eligible to drink? That is, are we confident that there is no discontinuity in the error
term at the point people turn 21? First, we want to think about the issue qualitatively.
c
•2014 Oxford University Press 577
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Table 11.2: RD Analysis of Drinking Age and Test Scores (from Carrell, Hoekstra, and West 2010)
Obviously, people can’t affect their age, so there’s little worry that people are manipulating
the assignment variable. And while it is possible, for example, that good students decide to
drop out just after their 21st birthday, which would mean that the students we observe who
just turned 21 are more likely to be bad students, this possibility doesn’t seem particularly
likely.
We can also run diagnostic tests. Figure 11.11 shows the frequency of observations for
students above and below the age cutoff. There is no sign of people manipulating the
assignment variable because the distribution of ages is mostly constant, with some apparently
We can also assess whether other covariates showed discontinuities at the 21st birthday.
As discussed above, the defining RD assumption is that the only discontinuity at the cutoff is
c
•2014 Oxford University Press 578
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Frequency
1500
1000
500
FIGURE 11.11: Histogram of Age Observations for Drinking Age Case Study
c
•2014 Oxford University Press 579
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
other variables are used as dependent variables. The model we’re testing is
Covariatei = “0 + “1 Ti + “2 (Agei ≠ C) + ‹i
Ti = 1 if X1i Ø C
Ti = 0 if X1i < C
Table 11.3 shows results for three covariates: SAT math scores, SAT verbal scores, and
physical fitness. For none of these covariates is “ˆ1 statistically significant, suggesting that
there is no bump in covariates at the point of the discontinuity, something that is consistent
with the idea that the only thing changing at the discontinuity is the treatment.
Table 11.3: RD Diagnostics for Drinking Age and Test Scores (from Carrell, Hoekstra, and West 2010)
11.5 Conclusion
Regression discontinuity is a powerful statistical tool. It works even when the treatment we
are trying to analyze is correlated with the error. It works because the assignment variable –
c
•2014 Oxford University Press 580
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
a variable that determines whether or not a unit gets the treatment – soaks up endogeneity.
The only assumption we need is that there is no discontinuity in the error term at the cutoff
If we have such a situation, the basic RD model is super simple. It is just an OLS
model with a dummy variable (indicating treatment) and a variable indicating distance to
the cutoff. More complicated RD models allow for more complicated relationships between
the assignment variable and the dependent variable, but no matter the model, the heart of
the RD remains looking for a bump in the value of Y at the cutoff point for assignment
outcome at the discontinuity, then we can attribute any bump in the dependent variable as
hole where panel, instrumental variable, or experimental techniques aren’t up to the task.
RD analysis is quite clean. Anybody can pretty much see the answer by looking at a binned
graph and the statistical models are relatively simple to implement and explain.
RD is not without pitfalls, however. If people can manipulate their score on the assign-
ment variable, then the RD estimate no longer simply captures the effect of treatment, but
also captures the effects of whatever qualities are overrepresented among the folks who were
able to get their assignment score above the threshold. For this reason it is very important
to report diagnostics that help us sniff out possible discontinuities in the error term at the
c
•2014 Oxford University Press 581
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
cutoff.
• Section 11.1: Write down a basic regression discontinuity model and explain all terms,
including treatment variable, assignment variable, and cutoff. Explain how RD models
overcome endogeneity.
• Section 11.2: Write down and explain RD models with varying slopes and non-linear
relationships.
• Section 11.3: Explain why it is useful to look at a smaller window. Explain a binned
• Section 11.4: Explain conditions under which RD might not be appropriate. Explain
Further Reading
Imbens and Lemieux (2008) and Lee and Lemieux (2010) go into additional detail on re-
gression discontinuity designs in a way that is useful for practitioners, including discussions
of fuzzy RD models. Bloom (2012) is another useful overview of RD methods. Cook (2008)
of regression discontinuity and experiments and find that regression discontinuity works well
c
•2014 Oxford University Press 582
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
See Grimmer, Hersh, Feinstein, and Carpenter (2010) for an example of using diagnostics
Key Terms
• Assignment variable (545)
• Binned graphs (562)
• Discontinuity (541)
• Fuzzy RD models (570)
• Local average treatment effect (LATE) (575)
• Regression discontinuity (RD)(544)
• Window (561)
Computing Corner
Stata
To estimate an RD model in Stata, create a dummy treatment variable and a X1 ≠ C
variable and use the syntax for multivariate OLS.
1. The following commands create variables needed for RD. Note that a scalar variable
is simply a variable with a single value (in contrast to a typical variable that has a list
of values).
gen scalar cutoff = 10 /* Create scalar variable equal to cutoff */
gen T = 0 /* Initially create a T with all zeros */
replace T = 1 if X1 > cutoff /* For all T with X1 > cutoff value, set */
/* value of T to be one */
gen X1minusC = X1 - cutoff /* Creates X1-C variable */
2. Basic RD is a simple OLS model:
reg Y T X1minusC
c
•2014 Oxford University Press 583
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
3. To estimate a model with varying slopes, first create an interaction variable and then
run OLS:
gen X1minusCxT = X1minusC * T
reg Y T X1minusC X1minusCxT
4. To create a scatterplot with the fitted lines from a varying slopes RD model, do the
following:
graph twoway (scatter Y X1minusC) (lfit Y X1minusC if T == 0) /*
*/ (lfit Y X1minusC if T == 1)
R
To estimate an RD model in R, we create a dummy treatment variable and a X1 ≠C variable
and use the the syntax for multivariate OLS.
1. The following commands create variables needed for RD. Note that a scalar variable is
simply a variable with a single value (in contrast to a typical variable that has a list of
values).
Cutoff = 10 # Create scalar variable equal to cutoff
T = 0 # Initially create a T with all zeros
T[X1 > Cutoff] = 1 # For all T with X1 > cutoff value, set
# value of T to be one
X1minusC = X1 - Cutoff # Creates X1-C variable
2. Basic RD is a simple OLS model:
RDResults = lm(Y ≥ T + X1minusC)
3. To estimate a model with varying slopes, first create an interaction variable and then
run OLS:
X1minusCxT = X1minusCT*T RDResults = lm(Y ≥ T + X1minusC + X1minusCxT)
Exercises
1. As discussed on page 566 Gormley, Phillips, and Gayer (2008) used RD to evaluate the
impact of pre-K on test scores in Tulsa. Children born on or before September 1, 2001,
were eligible to enroll in the program in 2005-06, while children born after this date
had to wait to enroll until the 2006-07 school year. Table 11.4 lists the variables. The
pre-K data set covers 1,943 children just beginning the program in 2006-07 (preschool
entrants) and 1,568 children who just finished the program and began kindergarten in
2006-07 (preschool alumni).
c
•2014 Oxford University Press 584
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Variable Description
age Age, days from the birthday cutoff. The cutoff value is coded as 0,
negative values indicate days born after the cutoff; positive values
indicate days born before the cutoff
cutoff Treatment indicator (1 = born before cutoff, 0 = born after cutoff)
wjtest01 Woodcock-Johnson letter-word identification test score
female Female (1 = yes, 0 = no)
black Black (1 = yes, 0 = no)
white White (1 = yes, 0 = no)
hispanic Hispanic (1 = yes, 0 = no)
freelunch Eligible for free lunch based on low income in 2006-07 (1 = yes, 0 = no)
a. Why should there be a bump in the dependent variable right at the point where
a child’s birthday renders him or her eligible to have participated in preschool the
previous year (2005-06) rather than the current year (2006-07)? Should we see jumps
at other points as well?
b. Assess whether there is a discontinuity at the cutoff for the free-lunch status, gender,
and race/ethnicity covariates.
c. Repeat the tests for covariate discontinuities restricting the sample to a one-month
(30 day) window on either side of the cutoff. Does the results change? Why or why
not?
d. Using letter-word identification test score as the dependent variable, estimate a basic
regression discontinuity model controlling for treatment status (born before the cut-
off) and the assignment variable (age measured as days from the cutoff). What is the
estimated effect of the preschool program on letter-word identification test scores?
e. Estimate the effect of pre-K using an RD specification that allows the relationship
to vary on either side of the cutoff. Do the results change? Should we prefer this
model? Why or why not?
f. Add controls for lunch status, gender, and race/ethnicity to the model. Does adding
these controls change the results? Why or why not?
g. Re-estimate the model from the part (f) limiting the window to one month (30 days)
on either side of the cutoff. Do the results change? How do the standard errors in
this model compare to those from the model using the full data set?
2. Gormley, Phillips, and Gayer (2008) also used RD to evaluate the impact of Head Start
on test scores in Tulsa. Children born on or before September 1, 2001, were eligible
to enroll in the program in 2005-06, while children born after this date had to wait to
c
•2014 Oxford University Press 585
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
enroll until the 2006-07 school year. The variable names and definitions are the same
as in Table 11.4, although in this case the data refers to 732 children just beginning
the program in 2006-07 (Head Start entrants) and 470 children who just finished the
program and were beginning kindergarten in 2006-07 (Head Start alumni).
a. Assess whether there is a discontinuity at the cutoff for the free-lunch status, gender,
and race/ethnicity covariates.
b. Repeat the tests for covariate discontinuities restricting the sample to a one-month
(30 day) window on either side of the cutoff. Do the results change? Why or why
not?
c. Using letter-word identification test score as the dependent variable, estimate a basic
regression discontinuity model. What is the estimated effect of the preschool program
on letter-word identification test scores?
d. Estimate the effect of Head Start using an RD specification that allows the relation-
ship to vary on either side of the cutoff. Do the results change? Should we prefer
this model? Why or why not?
e. Add controls for lunch status, gender, and race/ethnicity to the model. Do the results
change? Why or why not?
f. Re-estimate the model from part (f) limiting the window to one month (30 days) on
either side of the cutoff. Do the results change? How do the standard errors in this
model compare to those from the model using the full data set?
3. Congressional elections are decided by a clear rule: whoever gets the most votes in
November wins. Because virtually every congressional race in the Unites States is
between two parties, that means whoever gets more than 50 percent of the vote wins.2
We can use this fact to estimate the effect of political party on ideology. Some argue that
Republicans and Democrats are very distinctive; others argue that members of Congress
have strong incentives to respond to the median voter in their districts, regardless of
party. We can assess how much party matters by looking at the ideology of members of
Congress in the 112th Congress (from 2011 to 2012). Table 11.5 lists the variables.
a. Suppose we try to explain congressional ideology as a function of political party only.
Explain how endogeneity might be a problem.
b. How can an RD model fight endogeneity when trying to assess if and how party
affects congressional ideology?
c. Generate a scatterplot of congressional ideology against GOP2party and based on
this plot discuss what you think the RD will indicate.
2 We’ll look only at votes going to the two major parties, Democrats and Republicans in order to have a nice 50
percent cutoff.
c
•2014 Oxford University Press 586
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Variable Description
GOP2party2010 The percent of the vote received by the Republican congressional candidate
in the district in 2010. Ranges from 0 to 1.
GOPwin2010 Dummy variable indicating Republican won; equals 1 if GOP 2party2010 > 0.5.
Ideology The conservativism of the member of Congress as measured by Carroll, Lewis,
Lo, Poole, and Rosenthal (2009, 2014). Ranges from -0.779 to 1.293. Higher
values indicate more conservative voting in Congress.
ChildPoverty Percentage of district children living in poverty. Ranges from 0.03 to 0.49.
MedianIncome Median income in the district. Ranges from $23,291 to $103,664.
Obama2008 Percent of vote for Barack Obama in the district in 2008 presidential election.
Ranges from 0.23 to 0.95.
WhitePct Percent of the district that is non-Hispanic White ranges from 0.03 to 0.97.
d. Write down a basic RD model for this question and explain the terms.
e. Estimate a basic RD model and interpret coefficients.
f. Create an adjusted assignment variable (equal to GOP2party2010 - 0.50) and use it
to estimate a varying slopes RD model and interpret coefficients. Create a plot that
has a scatterplot of the data and fitted lines from the model. Calculate the fitted
values for 4 observations: a Democrat with GOP2party2010 = 0, a Democrat with
GOP2party2010 = 0.5, a Republican with GOP2party2010 = 0.5 and a Republican
with GOP2party2010 = 1.0).
g. Re-estimate the varying slopes model but use the unadjusted variable (and unad-
justed interaction). Compare coefficient estimates to your results in part (f). Calcu-
late the fitted values for four observations: a Democrat with GOP2party2010 = 0,
a Democrat with GOP2party2010 = 0.5, a Republican with GOP2party2010 = 0.5
and a Republican with GOP2party2010 = 1.0). Compare to the fitted values in part
(f).
h. Assess whether there is clustering of the dependent variable just above the cutoff.
i. Assess whether there are discontinuities at GOPVote = 0.50 for ChildPoverty, Me-
dianIncome, Obama2008 and WhitePct. Discuss the implications of your findings.
j. Estimate a varying slopes model controlling for ChildPoverty, MedianIncome, Obama2008,
and WhitePct. Discuss these results in light of your findings from the part (i).
k. Estimate a quadratic RD model and interpret results.
l. Estimate a varying slopes model with a window of GOP vote share from 0.4 to 0.6.
Discuss any meaningful differences in coefficients and standard errors from the earlier
varying slopes model.
c
•2014 Oxford University Press 587
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
Variable Description
County County indicator
Mortality County mortality rate for children aged 5 to 9 from 1973 to 1983, limited
to causes plausibly affected by Head Start
Poverty Poverty rate in 1960. Transformed by subtracting off cutoff; also divided
by 10 for easier interpretation
HeadStart Dummy variable indicating counties that received Head Start assistance.
Counties with poverty greater than 59.2 are coded as 1; counties with
poverty less than 59.2 are coded as 0
Bin The “bin” label for each observation based on dividing the poverty
into 50 bins
a. Write out an equation for a basic RD design to assess the effect of Head Start assis-
tance on child mortality rates. Draw a picture of what you expect the relationship
to look like. Note that in this example, treatment occurs for low values of the as-
signment variable.
b. Explain how RD can identify a causal effect of Head Start assistance on mortality.
c. Estimate the effect of Head Start on mortality rate using a basic RD design.
d. Estimate the effect of Head Start on mortality rate using a varying slopes RD design.
e. Estimate a basic RD model with (adjusted) poverty values that are between -0.8 and
0.8. Comment on your findings.
f. Implement a quadratic RD design. Comment on the results.
g. Create a scatterplot of the mortality and poverty data. What do you see?
h. Use the following code to create a binned graph of the mortality and poverty data.
What do you see?3
3The trick to creating a binned graph is associating each observation with a bin label that is in the middle of the bin.
Stata code that does this is scalar BinNum = 50; scalar BinMin = -6; scalar BinMax = 3; scalar BinLength
c
•2014 Oxford University Press 588
Chapter 11. Regression Discontinuity: Looking for Jumps in Data
c
•2014 Oxford University Press 589
Part III
590
CHAPTER 12
When we use data to analyze such phenomena – and many others – we need to confront
the fact that the outcomes are dichotomous. They either happened or didn’t, meaning
591
Chapter 12. Dummy Dependent Variables
that our dependent variable is either 1 (happened) or 0 (didn’t happen). Although we can
continue to use OLS for dichotomous dependent variables, the probit and logit models we
introduce in this chapter often fit the data better. Probit and logit models come with a
This chapter explains how to deal with dichotomous dependent variables. Section 12.1
shows how to use OLS to estimate these models. OLS does fine, but there are some things
that aren’t quite right. Hence Section 12.2 introduces a new model, called a latent variable
model, to model dichotomous outcomes. Section 12.3 then presents the workhorse probit and
logit models. These models differ from OLS and Section 12.4 explains how. Section 12.5 then
presents the somewhat laborious process of interpeting coefficients from these models. Probit
and logit models have several cool properties, but ease of interpretation is not one of them.
Section 12.6 concludes by showing how to test hypotheses involving multiple coefficients
The easiest way to analyze a dichotomous dependent variable is to use the linear prob-
ability model (LPM). This is just a fancy way of saying just run your darn OLS model
already.1 The LPM has witnessed a bit of a renaissance lately as people have realized that
despite some clear defects, it often conveniently and effectively characterizes the relation-
1 We discussed dichotomous independent variables in Chapter 7.
c
•2014 Oxford University Press 592
Chapter 12. Dummy Dependent Variables
ships between independent variables and outcomes. If there is no endogeneity (a big if, as
we know all too well), then the coefficients will be the right sign and will generally imply a
substantive relationship similar to that estimated by the more complicated probit and logit
In this section we show how the LPM model works and describe its limitations.
One nice features of OLS is that it generates the best estimate of the expected value of Y
as a linear function of the independent variables. In other words, we can think of OLS as
providing us
where E[Yi |X1 , X2 ] is the expected value of Yi given the values of X1i and X2i . This term is
When the dependent variable is dichotomous, the expected value of Y is equal to the
2 The terms linear and non-linear can sometimes get confusing in statistics. A linear model is one of the form
Yi = —0 + —1 X1i + —1 X2i + ... where none of the parameters to be estimated are multiplied, divided, or raised to
powers of other parameters. In other words, all the parameters enter in their own little plus term. A non-linear model
is one where some of the parameters are multiplied, divided, or raised to powers of other parameters. Linear models
can estimate some non-linear relationships (by creating terms that are functions of the independent variables, not
the parameters). We described this process in Section 7.1 of Chapter 7. Such polynomial models will not, however,
solve the deficiencies of OLS for dichotomous dependent variables. The models that do address the problems, the
probit and logit models we cover later in this chapter, are complex functions of other parameters and are therefore
necessarily non-linear models.
c
•2014 Oxford University Press 593
Chapter 12. Dummy Dependent Variables
probability the variable equals one. For example, consider a dependent variable that is 1 if it
rains and 0 if it doesn’t. If there is a 40% chance of rain, the expected value of this variable
is 0.40. If there is a 85% chance of rain, the expected value of this variable is 0.85. In other
words, because E[Y |X] = P robability(Y = 1|X), OLS with a dependent variable provides
The interpretation of —ˆ1 from this model is that a one unit increase in X1 is associated
Table 12.1 displays the results from an LPM model of the probability of admission into
a competitive Canadian law school. The independent variable is college grade point average
(GPA) (measured on a 100 point scale, as is common in Canada). The coefficient on GPA is
0.032, meaning that an increase in one point on the 100 point GPA scale is associated with
GPA 0.032ú
(0.003)
[t = 12.29]
Constant -2.28ú
(0.206)
[t = 11.10]
N 514
R2 0.23
Minimum Ŷi -0.995
Maximum Ŷi 0.682
Standard errors in parentheses
ú indicates significance at p < 0.05
c
•2014 Oxford University Press 594
Chapter 12. Dummy Dependent Variables
Probability of
admission
1 | | || | | | | | | | | | | | | || | |||| |||| |||||||||||||||||| |||||||| | ||| |||| ||| | ||| | || | | | |||
0.75
0.5
0 | | | | | | | | | | | | | | || |||| |||| ||| | | | | |||| |||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||| ||||| || | ||| ||| || | | |
40 45 50 55 60 65 70 75 80 85 90 95
GPA
(on a 100 point scale)
FIGURE 12.1: Scatterplot of Law School Admissions Data and LPM Fitted Line
c
•2014 Oxford University Press 595
Chapter 12. Dummy Dependent Variables
Figure 12.1 shows a scatterplot of the law school admissions data with the fitted line from
the LPM model included. The scatterplot looks different than a typical regression model
scatterplot because the dependent variable is either 0 or 1, creating two horizontal lines of
observations. Each point is a light vertical line and the scatterplot looks like a dark bar
where there are many observations. We can see that folks with GPAs under 80 mostly do
not get admitted while people with GPAs above 85 mostly do get admitted.
The expected value of Y based on the LPM model is a straight line with a slope of 0.032.
Clearly, as GPAs rise, the probability of admission rises as well. The difference from OLS is
that instead of interpreting —ˆ1 as the increase in the value of Y associated with a one unit
increase in X, we now interpret —ˆ1 as the increase in the probability Y equals one associated
Limits to LPM
While Figure 12.1 is generally sensible, it also has a glaring flaw. The fitted line goes below
zero. In fact, the fitted line goes far below zero. The poor soul with a GPA of 40 has a fitted
value of -0.995. This is nonsensical (and a bit sad). Probabilities must be between 0 and 1.
For a low enough value of X, the predicted value falls below zero; for a high enough value
c
•2014 Oxford University Press 596
Chapter 12. Dummy Dependent Variables
The problem with LPM isn’t only that it sometimes provides fitted values that make no
sense. We could, after all, simply say that any time we see a fitted value below 0, we’ll call
that a 0 and anytime we see a fitted value above 1 we’ll call that a 1. The deeper problem
is that fitting a straight line to data with a dichotomous dependent variable runs the risk
of misspecifying the relationship between the independent variables and the dichotomous
dependent variable.
Figure 12.2 illustrates an example of LPM’s problem. Panel (a) depicts a fitted line
from an LPM model based on law school admissions data based on the six hypothetical
observations indicated. The line is reasonably steep, implying a clear relationship. Now
suppose, that we add three observations from applicants with very high GPAs, all of whom
were admitted. These observations are the triangles in the upper right of panel (b). Common
sense suggests these observations should strengthen our belief that GPAs predict admission
into law school. Sadly, LPM lacks common sense. The figure shows that the LPM fitted line
with the new observations (the dashed line) is flatter than the original estimate, implying
that the estimated relationship is weaker than the relationship we estimated in the original
What’s that all about? It’s pretty easy to understand once we appreciate that the LPM
needs to fit a linear relationship. If these three new applicants had higher GPAs, from an
LPM perspective we should expect them to have a higher probability of admission than the
applicants in the initial sample. But the dependent variable can’t get higher than one, so
c
•2014 Oxford University Press 597
Chapter 12. Dummy Dependent Variables
Probability of
admission
0.75
(a) 0.5
0.25
50 55 60 65 70 75 80 85 90 95 100
GPA
(on a 100 point scale)
Probability of
admission
0.25
50 55 60 65 70 75 80 85 90 95 100
GPA
(on a 100 point scale)
c
•2014 Oxford University Press 598
Chapter 12. Dummy Dependent Variables
the LPM therefore interprets the new data as suggesting a weaker relationship. In other
words, because these applicants had higher independent variables but not higher dependent
variables, the LPM model infers that the independent variable is not driving the dependent
variable higher.
What really is going on is that once GPAs are high enough, students are pretty much
of admission rises with GPAs up to a certain level, but then levels off as applicants are pretty
much all admitted when their GPAs are above that level. The probit and logit models we
In LPM’s defense, it won’t systematically estimate positive slopes when the actual slope is
negative. And we should not underestimate its convenience and practicality. Nonetheless, we
should worry that LPM may sometimes leave us with an incomplete view of the relationship
4 LPM also has a heteroscedasticity problem. As discussed earlier, heteroscedasticity seldom is a more serious
problem than endogeneity, but the heteroscedasticity means that we have to cast a skeptical eye toward standard
errors estimated by LPM. There is a fix to dealing with this problem, but the process is complicated enough that we
might as well run the probit or logit models described below. For more details, see Long (1997, 39).
c
•2014 Oxford University Press 599
Chapter 12. Dummy Dependent Variables
Remember This
The linear probability model (LPM) uses OLS to estimate a model with a dichotomous
dependent variable.
1. The coefficients are easy to interpret: a one-unit increase in Xj is associated with
a —j increase in the probability that Y equals one.
2. Limitations of the LPM include:
• Fitted values of Ŷi may be greater than one or less than zero.
• Coefficients from an LPM model may mischaracterize the nature of the rela-
tionship between X and Y .
Given these limits to the LPM model, our goal is to develop a model that will produce fitted
values between zero and one. In this section, we describe the S-curves that achieve this goal
and introduce latent variables as a tool that will help us estimate S-curves.
S-curves
Figure 12.3 shows the law school admissions data. The LPM fitted line, in all its negative
probability glory is there, but we have also added a fitted curve from a probit model. The
probit fitted line looks like a tilted letter “S” such that the relationship between X and the
dichotomous dependent variable is non-linear. We explain how to generate such a curve over
the course of this chapter, but for now let’s note some of its nice features.
For applicants with GPAs below 70 or so, the probit fitted line has flattened out. This
c
•2014 Oxford University Press 600
Chapter 12. Dummy Dependent Variables
Probability of
admission
1 | || | | | | | | | | | ||| || ||||| ||||||||||||||||||| |||||| |||| ||||||||||| | || | | | ||
0.5
40 45 50 55 60 65 70 75 80 85 90 95
GPA
(on 100 point scale)
FIGURE 12.3: Scatterplot of Law School Admissions Data and LPM and Probit Fitted Lines
c
•2014 Oxford University Press 601
Chapter 12. Dummy Dependent Variables
means no matter how low their GPAs go, their fitted probability of admission does not go
below zero. For applicants with very high GPAs, increasing GPA leads to only small increases
in the probability of admission. Even if GPAs were to go very, very high, the probit fitted
line flattens out so that no one will have a predicted probability of admission greater than
1.
Not only does the S-shaped curve of the probit fitted line avoid nonsensical probability
estimates, it also reflects the data better in several respects. First, there is a range of GPAs
where the effect on admissions is quite high. Look in the range from around 80 to around
90. As GPA rises in this range, the effect on probability of admission is quite high, much
higher than implied by the LPM fitted line. Second, even though the LPM fitted values for
the high GPAs are logically possible (because they are between 0 and 1), they don’t reflect
the data particularly well. The person with the highest GPA in the entire sample (a GPA
of 92), is predicted by the LPM model to have only a 68% probability of admission. The
probit model, in contrast, predicts a 96% probability of admission for this GPA star.
Latent variables
To generate such non-linear fitted lines, we’re going to think in terms of a latent variable.
Something is latent if you don’t see it. A latent variable is something we don’t see, at
least not directly. We’ll think of the observed dummy dependent variable (which is zero or
c
•2014 Oxford University Press 602
Chapter 12. Dummy Dependent Variables
latent variable is high, then the dependent variable for that observation is likely to be one;
if the value of an observation’s latent variable is low, then the dependent variable for that
to equal 1.
Here’s an example. Pundits and politicians obsess over presidential approval. They know
that the president’s re-election and policy choices are often tied to the state of his approval.
the way the President is handling his job? That’s our dichotomous dependent variable, but
we know full well that the range of responses to the president covers a lot more than two
choices. Some people froth at the mouth in anger at the mention of the president. Others
It’s useful to think of these different attitudes as different latent attitudes toward the
president. We can think of the people who hate the president as having very negative values
of a latent presidential approval variable. People who are so-so about the president have
values of a latent presidential approval variable near zero. People who love the president
We think in terms of a latent variable because it is easy to write down a model for a
continuous latent model of the propensity to approve of the president. It looks like an OLS
model. Specifically, Yiú (pronounced “Y-star”) is the latent propensity to be a 1 (an ugly
c
•2014 Oxford University Press 603
Chapter 12. Dummy Dependent Variables
phrase, but that’s really what it is). It depends on some independent variable X and the —s.
We’ll model the observed dichotomous dependent variable as a function of this unobserved
latent variable. We observe Y = 1 (notice the lack of a star) for people whose latent feelings
are above zero.5 If the latent variable is less than zero, we observe Y = 0. (We ignore
This latent variable approach is consistent with how the world works. There are folks
who approve of the president but differ in the degree to which they approve; they are all
ones in the observed variable (Y ) but vary in the latent variable (Y ú ). There are folks who
disapprove of the president but differ in the degree to which they disapprove of the president;
they are all zeros in the observed variable (Y ) but vary in the latent variable (Y ú ).
Formally, we connect the latent and observed variables as follows. The observed variable
is
Y
_
_
_
] 0 if Yiú < 0
Yi =
_
_
_
[ 1 if Yiú Ø 0
—0 + —1 Xi + ‘i Ø 0
‘i Ø ≠—0 ≠ —1 X1i
5 Because the latent variable is unobserved, we have the luxury of labeling the point in the latent variable space at
which folks become 1’s as zero.
c
•2014 Oxford University Press 604
Chapter 12. Dummy Dependent Variables
In other words, if the random error term is greater than or equal to ≠—0 ≠ —1 Xi , we’ll
With this characterization, the probability that the dependent variable is one is necessarily
bounded between 0 and 1 because it is expressed in terms of the probability that the error
term is greater or less than some number. Our task in the next section is to characterize the
Remember This
Latent variable models are helpful to analyze dichotomous dependent variables.
1. The latent (unobserved) variable is
Yiú = —0 + —1 Xi + ‘i
Probit model and logit model both allow us to estimate the relationship between X and Y
in a way that the fitted values are necessarily between 0 and 1, thereby producing estimates
that more accurately capture the full relationship between X and Y than do LPM models.
The probit and logit models are effectively very similar, but they differ in the equations
c
•2014 Oxford University Press 605
Chapter 12. Dummy Dependent Variables
they use to characterize the error term distributions. In this section we explain the equations
Probit model
The key assumption in a probit model is that the error term (‘i ) is itself normally dis-
tributed. We’ve worked with the normal distribution a lot because the Central Limit The-
orem (from page 85) implies that with enough data, OLS coefficient estimates are normally
distributed no matter how ‘ is distributed. For the probit model we’re saying that ‘ itself is
normally distributed. So while normality of —ˆ1 is a proven result for OLS, normality of ‘ is
Before we explain the equation for the probit model, it is useful to do a bit of bookkeeping.
We have shown that P r(Yi = 1|X1 ) = P r(‘i Ø ≠—0 ≠ —1 X1i ), but this equation is a bit hard
to work with given the widespread convention in probability to characterize the distribution
of a random variable in terms of the probability that it is less than some value. Therefore,
we’re going to do a quick trick based on the symmetry of the normal distribution, a property
that means the distribution has the same shape on each side of its mean. This means that
the probability of seeing something larger than some number is the same as the probability of
seeing something less than the negative of that number. Figure 12.4 illustrates this property.
In panel (a), we shade the probability of being greater than -1.5. In panel (b), we shade the
probability of being less than 1.5. The symmetry of the normal distribution backs up what
c
•2014 Oxford University Press 606
Chapter 12. Dummy Dependent Variables
Probability
density
0.4
0.3
(a) 0.2
0
−3 −2 −1 0 1 2 3
−1.5
β0 + β1Xi
Probability
density
0.4
0.3
(b) 0.2
c
•2014 Oxford University Press 607
Chapter 12. Dummy Dependent Variables
our eyes suggest: The shaded areas are equally sized, indicating equal probabilities. In other
words, P rob(‘i > ≠1.5) = P rob(‘i < 1.5). This fact allows us to re-write P r(Yi = 1|X1 ) =
There isn’t a huge conceptual issue here, but simply one that makes it much easier to charac-
terize the model with conventional tools for working with normal distributions. In particular,
stating the condition in this way simplifies our use of the cumulative distribution func-
tion (CDF) of a standard normal distribution. The CDF tells us how much of normal
distribution is to the left of any given point. Feed it a number and the CDF function will
tell us the probability a standard normal random variable is less than that number.
Figure 12.5 shows examples for several values of —0 + —1 X1i . Panel (a) shows a standard
normal PDF with the portion to the left of -0.7 shaded. Below that in panel (d) we show a
CDF function with the value of the CDF at -0.7 highlighted. The value is roughly equal to
0.25 which is the area of the normal curve that is to the left of -0.7 in panel (a).
Panel (b) shows a standard normal density curve with the portion to the left of +0.7
shaded. Clearly this is more than half of the distribution. The CDF function below it in
panel (e) shows that, in fact, roughly 0.75 of a standard normal density is to the left of +0.7.
Panel (c) shows a standard normal PDF with the portion to the left of 2.3 shaded. Panel
(f) below that shows a CDF function with the value of the CDF at 2.3 highlighted, which is
c
•2014 Oxford University Press 608
Chapter 12. Dummy Dependent Variables
Probability
density 0.4 0.4 0.4
(PDF)
0 0 0
−3 −1 0 1 2 3 −3 −1 0 1 2 3 −3 −1 0 1 2 3
−0.7 0.7 2.3
β0 + β1Xi β0 + β1Xi β0 + β1Xi
(a) (b) (c)
Probability
ε< β0 + β1Xi
(CDF) 1 1 1
0 0 0
−3 −1 0 1 2 3 −3 −1 0 1 2 3 −3 −1 0 1 2 3
−0.7 0.7 2.3
β0 + β1Xi β0 + β1Xi β0 + β1Xi
(d) (e) (f)
c
•2014 Oxford University Press 609
Chapter 12. Dummy Dependent Variables
about 0.99. Notice that the CDF function can’t be less than zero or more than one because
it is impossible to have less than zero percent or more than 100 percent of the area of the
The notation we’ll use for the normal CDF is () (the Greek letter is pronounced “fi,”
as in wi-fi) which indicates the probability that a normally distributed random variable (‘
= (—0 + —1 X1i )
The probit model produces estimates of — that best fit the data. That is, to the extent
vations that actually were 1s. Likewise, to the extent possible probit estimates will produce
ˆ that lead to low predicted probabilities for observations that actually were 0s. We discuss
—s
Logit model
Logit models also allow us to estimate parameters for a model with a dichotomous de-
pendent variable in a way that the fitted values are necessarily between 0 and 1. They are
c
•2014 Oxford University Press 610
Chapter 12. Dummy Dependent Variables
functionally very similar to probit models. The difference from a probit model is the equa-
tion that characterizes the error term. The equation differs dramatically from the probit
equation, but it turns out this difference has little practical import.
In a logit model
To get a feel for the logit equation, consider when —0 +—1 X1i is humongous. In the numerator
e is raised to that big number which leads to a super big number. In the denominator will
be that same number plus 1, which is pretty much the same number. Hence the probability
will be very, very close to one. But no matter how big —0 + —1 X1i gets, the probability will
If —0 + —1 X1i is super negative, then the numerator of the logit function will have e raised
to a huge negative number, which is the same as one over e raised to a big number which
is essentially zero. The denominator will have that number plus one, meaning the fraction
is very close to 01 , which means that the probability that Yi = 1 will be very, very close to
zero. No matter how negative —0 + —1 X1i gets, the probability will never go below zero.6
6 If —0 + —1 X1i is zero, then Prob(Yi = 1) = 0.5. It’s a good exercise to work out why. The logit function can also
be written as
1
P rob(Yi = 1) =
1 + e≠(—0 +—1 X1i )
c
•2014 Oxford University Press 611
Chapter 12. Dummy Dependent Variables
The probit and logit models are rivals, but friendly rivals. When properly interpreted,
they yield virtually identical results. Do not sweat the difference. Simply pick probit or logit
and get on with life. Back in the early days of computers, the logit model was often preferred
because it is computationally easier than the probit model. Now powerful computers make
Remember This
The probit and logit models are very similar. Both estimate S-shaped fitted lines that
are always above zero and below one.
1. In a probit model
12.4 Estimation
So how do we select the best —ˆ given the data? The estimation process for the probit and
logit models is called maximum likelihood estimation (MLE). This process is more com-
plicated than estimating coefficients using OLS. Understanding the inner workings of MLE
is not necessary to implement or understand probit and logit models. Such an understanding
can be helpful, however, for more advanced work and we discuss the technique in more detail
c
•2014 Oxford University Press 612
Chapter 12. Dummy Dependent Variables
In this section we explain the properties of MLE estimates, describe the fitted values
produced by probit and logit models, and show how goodness of fit is measured in MLE
models.
Happily, many major statistical properties of OLS estimates carry over to MLE estimates.
For large samples, the parameter estimates are normally distributed and consistent if there
is no endogeneity. That means we can interpret statistical significance and create confidence
intervals and p-values much as we have done with OLS models. One modest difference is
that we use z tests rather than t tests for MLE models. Z tests compare test statistics to
critical values based on the normal distribution. Because the t distribution approximates the
normal distribution in large samples, z tests and t tests are very similar practically speaking.
The critical values will continue to be the familiar values we used in OLS. In particular, we
can continue to rely on the rule of thumb that a coefficient is statistically significant if it is
ˆ from a probit model will produce fitted lines that best fit the data. Figure
The estimated —s
12.6 shows examples. Panel (a) shows a classic probit fitted line. The observed data are
c
•2014 Oxford University Press 613
Chapter 12. Dummy Dependent Variables
Probability Probability
Y=1 Y=1
1 | | | || || | ||| |||| | |||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 1 | || | || |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0.75 0.75
0.5 0.5
0.25 ^ 0.25 ^
β0 = −3 β0 = −4
^ ^
β1 = 2 β1 = 6
0 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||| || | || | | | | | | || | 0 ||||||||||||||||||||| |||||| | | | || |
0 1 2 3 0 1 2 3
X X
(a) (b)
Probability Probability
Y=1 Y=1
1 || | | | | || ||| |||| | || |||| || |||||| ||||| | || ||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 1 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||| | |||| | | | ||| | |||| | |
0.75 0.75
0.5 0.5
0.25 ^ 0.25 ^
β0 = −1 β0 = 3
^ ^
β1 = 1 β1 = −2
0 ||||||||||||||||||||||||||||||||||||||| || | | || | || ||| | | || || || | | || | | | | 0 || | | || || ||||| |||||| |||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0 1 2 3 0 1 2 3
X X
(c) (d)
indicated with small vertical lines. For low values of X, Y is mostly zero, with a few
observations and then for high values of X all Y s are one. The estimated —ˆ0 coefficient is
-3, indicating that low values of X are associated with low probabilities that Y = 1. The
estimated —ˆ1 coefficient is positive because higher values of X are associated with a high
probability of observing Y = 1.
To calculate fitted values for the model depicted in panel (a) of Figure 12.6 we need to
supply a value of X and use the coefficient estimates in the probit equation. Using the fact
c
•2014 Oxford University Press 614
Chapter 12. Dummy Dependent Variables
Ŷi = P rob(Yi = 1)
= (—ˆ0 + —ˆ1 Xi )
= (≠3 + 2 ◊ 0)
= (≠3)
= 0.001
Based on these same coefficient estimates, the fitted probability of observing Y = 1 when
X = 1.5 is
= (0)
= 0.5
Panel (b) of Figure 12.6 shows a somewhat similar relationship, but here there is a starker
transition between the Y = 0 and Y = 1 observations. When X is less than about 0.5, the
Y s are all zero; when X is greater than about 1.0, the Y s are all one. This pattern of data
indicates a strong relationship between X and Y and —ˆ1 is, not surprisingly, larger in panel
Panel (c) of Figure 12.6 shows a common situation in which the relationship between X
and Y is rather weak. The estimated coefficients produce a fitted line that is pretty flat and
c
•2014 Oxford University Press 615
Chapter 12. Dummy Dependent Variables
we don’t even see the full S-shape emblematic of probit models. If we were to display the
fitted line for a much broader range of X values, we would see the S-shape because the fitted
probabilities would flatten out at zero for sufficiently negative values of X and the fitted
probabilities would flatten out at one for sufficiently positive values of X. Sometimes, as in
this case, the flattening of a probit fitted line occurs outside the range of observed values of
X.
Panel (d) shows a case in which the —ˆ0 coefficient is positive and —ˆ1 is negative. This case
best fits the pattern of the data in which Y = 1 for low values of X and Y = 0 for high
values of X.
Yes, that’s pretty ugly. Usually (but not always) there is a convenient way to get statistical
software to generate fitted values if we ask nicely. We’ll discuss how in the Computing Corner
The overall fit of a probit or logit model is reported with a log likelihood statistic, often
written as “log L.” This statistic is a byproduct of the MLE estimation process. The log
c
•2014 Oxford University Press 616
Chapter 12. Dummy Dependent Variables
likelihood is the log of the probability of observing the Y outcomes we did given the X
ˆ It is an odd way to report how well the model fits because, well, it is
data and the —s.
incomprehensible. The upside of the incomprehensibility of this fit statistic is that we are
less likely to put too much emphasis on it, in contrast to the more accessible R2 for OLS
The log likelihood is useful in hypothesis tests involving multiple coefficients. Just as R2
feeds into the F statistic (as discussed on page 351), the log likelihood feeds into the test
statistic used when we are interested in hypotheses involving multiple coefficients in probit
Remember This
1. Probit and logit models are estimated via maximum likelihood estimation (MLE)
instead of OLS.
2. We can assess the statistical significance of MLE estimates of —ˆ using z tests,
which closely resemble t tests in large samples for OLS models.
c
•2014 Oxford University Press 617
Chapter 12. Dummy Dependent Variables
Discussion Questions
1. For each panel in Figure 12.6 on page 614 identify the value of X that
produces Ŷi = 0.5. Use the probit equation.
2. Based on Table 12.2, indicate whether the following statements are true,
false, or indeterminate.
(a) The coefficient on X1 in column (a) is statistically significant.
(b) The coefficient on X1 in column (b) is statistically significant.
(c) The results in column (a) imply a one unit increase in X1 is asso-
ciated with a 50 percentage point increase in the probability that
Y = 1.
(d) The fitted probability using the estimate in column (a) for X1i = 0
and X2i = 0 is 0.
(e) The fitted probability using the estimate in column (b) for X1i = 0
and X2i = 0 is approximately 1.
3. Based on Table 12.2, indicate the fitted probability for the following:
(a) Column (a) and X1i = 4 and X2i = 0.
(b) Column (a) and X1i = 0 and X2i = 4.
(c) Column (b) and X1i = 0 and X2i = 1.
(a) (b)
X1 0.5 1.0
(0.1) (1.0)
X2 -0.5 -3.0
(0.1) (1.0)
Constant 0.00 3.0
(0.1) (0.0)
N 500 500
log L -1000 -1200
Standard errors in parentheses
c
•2014 Oxford University Press 618
Chapter 12. Dummy Dependent Variables
The LPM model may have its problems, but it is definitely easy to interpret: A one unit
Probit and logit models have their strengths, but being easy to interpret is not one of
Y = 1. These complicated equations keep the predicted values above zero and less than one,
but they can only do so by having the effect of X vary across values of X.
In this section we explain how the estimated effect of X1 on Y in probit and logit models
depends not only on the value of X1 , but also on the value of the other independent variables.
We then describe approaches to interpreting the coefficient estimates from these models.
Figure 12.7 displays the fitted line from the probit model of law school admission. Increasing
probability (about 30 percentage points). The change in predicted probability then get small
– really small – when we increase GPA from 95 to 100 (about 1 percentage point).
This is certainly a more complicated story than in OLS, but is perfectly sensible. For
someone with a very low GPA, increasing it really doesn’t get them seriously considered for
admission. For a middle range of GPAs, increases in GPAs are indeed associated with real
c
•2014 Oxford University Press 619
Chapter 12. Dummy Dependent Variables
Probability of
admission
1
Probability rises
by 0.01
when GPA goes
from 95 to 100
Probability rises
0.75 by 0.30
when GPA goes
from 85 to 90
0.5
0.25
Probability rises
by 0.03
when GPA goes
from 70 to 75
0
65 70 75 80 85 90 95 100
GPA
(on 100 point scale)
c
•2014 Oxford University Press 620
Chapter 12. Dummy Dependent Variables
increases in probability of being admitted. After a certain point, however, higher GPAs have
little effect on the probability of being admitted because pretty much everyone with such
There’s another wrinkle: the other variables. In the non-linear world, the effect of increasing
X1 varies not only over values of X1 , but also over values of the other variables in the model.
Suppose, for example, that we’re analyzing law school admission in terms of college GPAs
and standardized Law School Admission Test (LSAT) test score. The effect of GPAs actually
depends on the value of the LSAT test score. If an applicant’s LSAT test score is very high,
then the predicted probability will be near one based on that alone and there will be very
little room for an increased GPA to affect the predicted probability of being admitted to law
school. If an applicant’s LSAT test score is low, then there will be a lot more room for an
The fact that the estimated effect of X1 on the probability Y = 1 depends on the values
of X1 and the other independent variables creates a knotty problem: How do we convey the
magnitude of the estimated effect? In other words, how do we substantively interpret probit
There are several reasonable ways to approach this issue. Here we focus on simulations.
c
•2014 Oxford University Press 621
Chapter 12. Dummy Dependent Variables
calculating the average increase in fitted probabilities if we were to increase X1 by one stan-
dard deviation for every observation. First we calculate the fitted values for all observations
observation and calculate new fitted values for all observations. The average difference in
these two fitted values across all observations is the simulated effect of increasing X1 by one
standard deviation. The bigger —ˆ1 , the bigger this simulated effect will be.
It is not set in stone that we add one standard deviation. Sometimes it may make sense
to calculate these quantities by simply using an increase of one or some other amount.
These simulations make the coefficients interpretable in a common sense way. We can
say things like, “The estimates imply that increasing GPA by one standard deviation is
admitted to law school.” That’s still a mouthful, but much more meaningful than the —ˆ
itself.
the average increase in fitted probabilities if the value of X1 for every observation were to
go from zero to one. We first calculate the fitted values for all observations using the
ˆ setting X1 = 0 for all observations and using the observed values for all other
estimated —s,
independent variables. Then we calculate the fitted values for all observations setting X1 = 1
for all observations while still using the observed values for all other independent variables.
The average difference in these two fitted values across all observations is the estimated effect
c
•2014 Oxford University Press 622
Chapter 12. Dummy Dependent Variables
from the fact that we use observed values when calculating simulated probabilities. The
alternative to the observed-value approach is the average-case approach which uses creates a
single composite observation whose independent variables equal sample averages. We discuss
The “discrete-difference” part of our approach involves our use of specific differences in
approach is the marginal effects approach that calculates the effect of changing X1 by a
minuscule amount. This calculus-based approach is a bit more involved (but easy with a
simple trick) and produces results that are generally similar to the approach we present.
We discuss the marginal effects approach in the appendix on page 808 and show how to
implement the approach in the Computing Corner on pages 646 and 648.
Interpreting logit coefficients proceeds in the same way, only we use the logit equation
(Equation 12.2) instead of the probit equation. For example, for an observed-value, discrete
differences simulation of the effect of a continuous variable, we calculate logit fitted values
for all observations and then calculate logit fitted values when the variable is increased by
a standard deviation. The average difference in fitted values is the simulated effect of a one
c
•2014 Oxford University Press 623
Chapter 12. Dummy Dependent Variables
Remember This
1. To interpret probit coefficients using the observed-value, discrete-differences
method, use the following guide.
• If X1 is continuous:
(a) For each observation, calculate P1i as the standard fitted probability from
the probit results.
P2i = (—ˆ0 + —ˆ1 (X1i + ‡X1 ) + —ˆ2 X2i + —ˆ3 X3i + ...)
(c) The simulated effect of increasing X1 by one standard deviation is the
average difference P2i ≠ P1i across all observations.
• If X1 is a dummy variable:
(a) For each observation, calculate P1i as the fitted probability but with X1i
set to 0 for all observations.
c
•2014 Oxford University Press 624
Chapter 12. Dummy Dependent Variables
Discussion Questions
1. Suppose X1 is a dummy variable. Explain how to calculate the effect of
X1 on the probability Y = 1.
2. Suppose X2 is a continuous variable. Explain how to calculate the effect
of X1 on the probability Y = 1.
c
•2014 Oxford University Press 625
Chapter 12. Dummy Dependent Variables
ship.
where Obamai is 1 for individuals who said they voted for Obama and 0 for everyone else.
The Dogi variable is 1 for people who own a dog and 0 for everyone else. Ideologyi measures
LPM results are in the left column. The coefficient on dog ownership is highly statistically
significant and the —ˆDog = ≠0.05 implies that dog owners were 5 percent less likely to support
Obama. Ideology is also highly significant with the LPM estimate, implying that people
were 20.7 percentage points less likely to support Obama for every one unit increase in their
7 Yes, the author’s name really is Mutz.
c
•2014 Oxford University Press 626
Chapter 12. Dummy Dependent Variables
conservatism on a seven point ideology scale. The t statistics indicate that both variables
The fitted probabilities of voting for Obama from the LPM model range from minus eight
percent to plus 121 percent. In this case, the minimum fitted value will be for dog-owning
strong conservatives (for whom the ideology variable equals 7). The fitted value from the
LPM model for such a person is 1.421 ≠ 1 ◊ 0.05 ≠ 7 ◊ 0.207 = ≠0.08. The maximum fitted
value will be for non- dog-owning strong liberals (for whom the ideology variable equals 1).
The fitted value from the LPM model for such a person is 1.421 ≠ 0 ◊ 0.05 ≠ 1 ◊ 0.207 = 1.21.
Yeah, these values are weird; probabilities below zero and above one do not make sense.
Table 12.3: Dog Ownership and Probability of Supporting Obama in 2008 Election
The second column and third columns of Table 12.3 display probit and logit results. These
models are, as we know, designed not to produce such odd fitted values and, in so doing, to
c
•2014 Oxford University Press 627
Chapter 12. Dummy Dependent Variables
better capture the relationship between the independent and dependent variables.
Interpreting statistical significance in these models is very familiar given our work with
OLS. For large samples, MLE coefficients divided by their standard errors will come from
normal distributions with means of zero. Hence, we can ascertain statistical significance
easily and quickly simply by looking at the z statistics, where the critical values are based
on the normal distribution. Given that the t statistics we used for OLS are approximately
normally distributed in large samples, we use essentially the same critical values and generate
essentially similar p-values given the ratio of our coefficients to their standard errors. The
estimated coefficient on dog ownership in the probit model is highly statistically significant
with a z statistic of 8.53. The z statistic for the dog owner coefficient in the logit model
is 8.44, meaning the coefficient is also statistically significant. The coefficient on ideology
is very significant in both the probit and logit models with z statistics of 71.68 and 63.98,
respectively.
Interpreting the coefficients is not so straightforward. What exactly do they mean? Does
the fact that —ˆDog = ≠0.213 in the probit model imply that dog owners are 21.3 percentage
points less likely to vote for Obama? Does the fact that —ˆIdeology = ≠0.753 in the probit
model imply that people get 75.3 percentage points less supportive of Obama for every one
No. No. (No!) We’ll focus on the probit model, but the logic is analogous for the logit
model. The coefficient estimates from probit feed into the complicated probit equation on
c
•2014 Oxford University Press 628
Chapter 12. Dummy Dependent Variables
page 610. We must use our simulation technique to understand the substantive implications
of our probit estimates. Table 12.4 interprets the probit coefficients in a substantively useful
way. Because the Dog variable is a dummy variable, the estimated effect of —ˆDog is calculated
by comparing the fitted probabilities for all individuals when the value of Dogi is set to 0
for all people and when the value of Dogi is set to 1 for all people. The average difference
in probabilities is -0.052. (This effect is eerily similar to the LPM estimate, a common
occurrence.)
Table 12.4: Estimated Effect of Dog Ownership and Ideology on Probability of Supporting Obama in 2008
Election
In effect, what we’re doing is simulating the change in support Obama if no one owned a
dog compared to if everyone owned a dog. If —ˆ1 is big, there will be big differences in these
probabilities because the first set of probabilities will not have —ˆ1 in the equation (because
we multiply —ˆ1 by zero) and the second set of probabilities will have —ˆ1 (because we multiply
—ˆ1 by 1). If —ˆ1 is very small, then the two probabilities will differ very little.
Table 12.4 also shows the estimated effect of making everyone one unit more conservative
on the ideology measure. First we calculated fitted values from the probit model for everyone
using the observed values of all independent variables. Then we calculated fitted values for
everyone using their actual ideology score plus one. The average difference in these two
c
•2014 Oxford University Press 629
Chapter 12. Dummy Dependent Variables
fitted probabilities across the whole population is the estimated effect of a one unit change
in ideology. The average difference in probabilities for the probit model is -0.180. In other
words, our probit coefficient on ideology implies that increasing conservative ideology by one
The logit estimated effects in Table 12.4 are generated via the same process, but plugging
the logit-estimated coefficients into the logit equation instead of the probit equation. The
logit estimated effects for each variable are virtually identical to the probit estimated effects.
This is almost always the case because the two models are doing the same work, just with
Figure 12.8 helps us visualize the results by displaying the fitted values from the LPM,
probit and logit estimates. The solid line in each panel is the fitted line for non-dog owners.
The dashed line in each panel is the fitted line for dog owners. In all panels we see that fitted
more conservative. We also see that dog owners are less likely to support Obama, although
this effect doesn’t seem to have as much impact as ideology does. The LPM lines do not
dramatically differ from the probit and logit lines, although they go above one and below
zero. The probit and logit fitted lines look a bit different than the probit and logit fitted lines
we have seen so far because in this case the probabilities are declining as the independent
variable increases, making the lines look more like a backward S than the S shape we’ve seen
so far. Regardless, the probit and logit fitted lines are visually indistinguishable. In fact,
c
•2014 Oxford University Press 630
Chapter 12. Dummy Dependent Variables
Probability vote for Obama Probability vote for Obama Probability vote for Obama
Linear probability model (LPM) Probit Model Logit Model
1 1 1
0 0 0
FIGURE 12.8: Fitted Lines from LPM, Probit, and Logit Models
the fitted values from the probit and logit models correlate at 0.9996; such high correlations
are not unusual when comparing fitted probit and logit values.
These results constitute only an initial cut on the analysis. We are concerned, as always,
about possible bias. Is there any source of endogeneity missing in the model? In particular,
could there be something not currently in the model that is correlated with dog ownership?
Sometimes we are interested in hypotheses about multiple coefficients. That is, we might
not simply want to know if —1 is different from zero, but whether is it bigger than —2 . In
this section we show how to conduct such tests when using MLE models such as probit and
logit.
c
•2014 Oxford University Press 631
Chapter 12. Dummy Dependent Variables
In the OLS context we used F tests to test hypotheses involving multiple coefficients; we
discussed these tests in Section 7.4 of Chapter 7. The key idea was to compare the fit of a
model that imposed no restrictions to the fit of a model that imposed the restriction implicit
in the null hypothesis. If the null hypothesis is true, then forcing the computer to spit back
results consistent with the null will not reduce the fit very much. If the null hypothesis is
false, though, forcing the computer to spit back results consistent with it will reduce the fit
substantially.
We’ll continue to use the same logic here. The difference is that we do not measure fit
with R2 as with OLS, but with the log-likelihood as described in Section 12.5. We will
look at the difference in log likelihoods from the restricted and unrestricted estimates. The
statistical test is called a likelihood ratio test(LR test) and the test statistic is
LR = 2(logLU R ≠ logLR )
If the null hypothesis is true, the log-likelihood should be pretty much the same for the
restricted and unrestricted versions of the model. Hence a big difference in the likelihoods
indicates that the null is false. Statistical theory implies that if the null hypothesis is true,
the difference in log-likelihoods will follow a specific distribution and hence we can use that
distribution to calculate critical values for hypothesis testing. The distribution is a ‰2 with
degrees of freedom equal to the number of equal signs in the null hypothesis (‰ is the
Greek letter chi, pronounced “ky” as in Kyle). We show in the Computing Corner how to
c
•2014 Oxford University Press 632
Chapter 12. Dummy Dependent Variables
generate critical values and p-values based on this distribution. The appendix provides more
An example makes this process clear. It’s not hard. Suppose we want to know more
about pet politics. Perhaps our pets reveal or even cause some deep political feelings. As
Mutz (2010) noted, dog owners “might have been drawn more to the emotionally effusive
McCain ... If one of the candidates were to jump on you at the door and lick your ear, it
would surely be McCain.” Obama, on the other hand, was more cat-like and emotionally
cool.
So let’s assess whether dog owners and cat owners differed politically. The unrestricted
This is the unrestricted equation because we are letting the coefficients on Dog and Cat be
The null hypothesis is a hypothesis that the effect of owning dogs and cats is the same:
H0 : —1 = —2 . We impose this null hypothesis on the model by forcing the computer to give
us results where the coefficients on Dog and Cat are equal. We do so by replacing —2 with —1
8 It may seem odd that this is called a likelihood ratio test when the statistic is the difference in log likelihoods.
The test can also be considered as the log of the ratio of the two likelihoods. Because log LLURR = logLU R ≠ logLR
we can use the form we do. Most software reports the log likelihood, not the (unlogged) likelihood, so it’s more
convenient to use the difference of log likelihoods rather than the ratio of likelihoods. The 2 is there just to make
things work; don’t ask.
c
•2014 Oxford University Press 633
Chapter 12. Dummy Dependent Variables
in the model (which we can do because under the null hypothesis they are equal), yielding
a restricted model of
Therefore, we need simply to estimate these two models, calculate the difference in log
likelihoods, and then compare this difference to a critical value from the appropriate distri-
bution. We estimate the restricted model by creating a new variable, which is Dogi + Cati .
Table 12.5 shows the results. In the unrestricted column are results from the model in
which the dog owner and cat owner variables are entered separately. At the bottom is the
Before we do anything more, this is a good time to do a bit of common sense approxi-
mating. The coefficients on dog and cat in the unrestricted model in Table 12.5 are both
negative and statistically significant, but the coefficient on dog is almost three times the
size of the cat coefficient. Both coefficients have relatively small standard errors, so it is
In the restricted model column are coefficients from the model in which the two separate
dog and cat variables have been replaced by a single dog plus cat variable. At the bottom
is the restricted log likelihood that we will feed into the LR test.
c
•2014 Oxford University Press 634
Chapter 12. Dummy Dependent Variables
Table 12.5: Unrestricted and Restricted Probit Results for Likelihood Ratio Test
Using the tools described in the Computing Corner, we can determine the p-value and if
it is less than the significance level we have set, we can reject the null hypothesis. In this
case, the p-value associated with a LR value of 14.302 is 0.0002, far below a conventional
Or, equivalently, we can reject the null hypothesis if LR statistics are greater than the
critical value for our significance level. The critical value for a significance level of 0.05 is
c
•2014 Oxford University Press 635
Chapter 12. Dummy Dependent Variables
3.84 and our LR test statistic of 14.302 far exceeds that. This means we can easily reject the
null that the coefficients on dog and cats have the same effects on owners.9 Dogs are more
9 Of course, the sensible interpretation here is that the kinds of people who own dogs and cats are more likely to
have certain political views.
c
•2014 Oxford University Press 636
Chapter 12. Dummy Dependent Variables
Remember This
Use a likelihood ratio (LR) test to test hypotheses involving multiple coefficients for
probit and logit models.
1. Estimate an unrestricted model that is the full model:
P rob(Yi = 1) = —0 + —1 X1i + —2 X2i + —3 X3i + ‘i
4. Use the log-likelihood values from the unrestricted and restricted models to cal-
culate the LR test statistic:
LR = 2(logLU R ≠ logLR )
5. The larger the difference between the log likelihoods, the more the null hypothesis
is reducing fit and, therefore, the more likely we are to reject the null.
• The test statistic is distributed according to a ‰2 distribution with degrees
of freedom equal to the number of equal signs in the null hypothesis.
• Code for generating critical values and p-values for this distribution is in the
Computing Corner on pages 646 and 648.
c
•2014 Oxford University Press 637
Chapter 12. Dummy Dependent Variables
collapse of civilization.
are much more likely in countries that are divided along ethnic or religious lines? Many think
so, arguing that these pre-existing divisions can explode into armed conflict. Stanford pro-
fessors James Fearon and David Laitin (2003) aren’t so sure. They suspect that instability
In this case study we explore these possible determinants of civil war. We’ll see that while
omitted variable bias plays out in a broadly similar fashion across LPM and probit models,
the two approaches nonetheless provide rather different pictures about what is going on.
The dependent variable is civil war onset, coded for 161 countries that had a population
of at least half a million in 1990 from 1945 to 1999. It is 1 for country years in which a civil
war began and 0 in all other country years. We’ll look at three independent variables:
• Ethnic fractionalization measures ethnic divisions within each country; it ranges from
0.001 to 0.93 with mean of 0.39 and a standard deviation of 0.29. The higher the value
c
•2014 Oxford University Press 638
Chapter 12. Dummy Dependent Variables
from 0 to 0.78 with a mean of 0.37 and a standard deviation of 0.22. The higher the
• GDP is lagged GDP per capita. The GDP measure is lagged so as not to be tainted by
the civil war itself, which almost surely had an effect on the economy. It is measured in
thousands of US dollars that are adjusted for inflation. The variable ranges from 0.05
Table 12.6 shows results for LPM and probit models. For each method we present results
with and without GDP. We see a similar pattern when GDP is omitted. In the LPM (a)
is not. The same thing is true for the probit (a) specification that does not have GDP.
However, Fearon and Laitin’s suspicion was supported by both LPM and probit analyses.
When GDP is included, the ethnic fractionalization variable becomes insignificant in both
LPM and probit (although it is close to significant in the LPM model). The GDP variable
is highly statistically significant in both LPM and probit models. So the general conclusion
that GDP seems to matter more than ethnic fractionalization does not depend on which
However, the two models do tell slightly different stories. Figure 12.9 shows the fitted
lines from the LPM and probit models for the specifications that include the GDP variable.
c
•2014 Oxford University Press 639
Chapter 12. Dummy Dependent Variables
LPM Probit
(a) (b) (a) (b)
Ethnic 0.019 ú
0.012 0.451 ú
0.154
fractionalization (0.006) (0.006) (0.141) (0.149)
[t = 3.30] [z = 1.84] [z = 3.20] [z = 1.03]
Religious -0.002 0.002 -0.051 0.033
fractionalization (0.008) (0.008) (0.185) (0.198)
[t = 0.33] [z = 0.27] [z = 0.28] [z = 0.17]
GDP per capita -0.0015ú -0.108ú
(in $1000 US) (0.0004) (0.024)
[z = 3.97] [z = 4.58]
Constant 0.010ú 0.017ú -2.297ú -1.945ú
(0.003) (0.004) (0.086) (0.108)
[z = 3.05] [z = 4.49] [z = 26.67] [z = 18.01]
N 6610 6373 6610 6373
R2 0.002 0.004
ˆ
‡ 0.128 0.128
log L -549.092 -508.545
Standard errors in parentheses; ú indicates significance at p < 0.05
When calculating these lines, we held the ethnic and religious variables at their mean values.
The LPM model has its characteristic brutally straight fitted line. It suggests that whatever
its wealth, a country sees its probability of civil war decline as it gets wealthier. It does this
to the point of not making sense because the fitted probabilities are negative (and hence
meaningless) for countries with per capita GDP above about $20,000 per year. The probit
model has a curve. We’re seeing only a hint of the S-curve because even the poorest countries
have less than a 4% probability of having a civil war. But we do see that the effect of GDP
is concentrated among the poorest countries. For them, the effect of income is relatively
higher, certainly higher than the LPM model suggests. But for countries with about $10,000
per capita GDP per year, there is basically no effect of income on the probability of a civil
c
•2014 Oxford University Press 640
Chapter 12. Dummy Dependent Variables
Probability of
civil war
0.04
0
Fitted values from
probit model
−0.04
−0.08
0 10 20 30 40 50 60 70
GDP (per capita in thousands of US $)
FIGURE 12.9: Fitted Lines from LPM and Probit Models for Civil War Data (Holding Ethnic and
war. So even as the broad conclusion that GDP matters is similar in the LPM and probit
models, the way in which GDP matters is quite different across the models.
c
•2014 Oxford University Press 641
Chapter 12. Dummy Dependent Variables
12.7 Conclusion
Things we care about are often dichotomous, whether it is unemployment, vote choice,
graduation, war, or countless other phenomena. We can use OLS to analyze such data via
the linear probability model, but we risk producing models that do not fully reflect the
The solution is to fit an S-shaped relationship via probit or logit models. Probit and
logit models are, as a practical matter, interchangeable as long as sufficient care is taken
when interpreting coefficients. The cost of these models is that they are more complicated,
• Section 12.1: Explain the linear probability model. How do we estimate it? How do we
• Section 12.2: Describe what a latent variable is and how it relates to the observed
dichotomous variable.
• Section 12.3: Describe the probit and logit models. What is the equation for the
probability that Yi = 1 for a probit model? What is the equation for the probability
• Section 12.4: Discuss estimation procedure used for probit and logit and how to generate
fitted values.
c
•2014 Oxford University Press 642
Chapter 12. Dummy Dependent Variables
• Section 12.5: Explain how to interpret probit coefficients using the observed-value,
discrete-differences approach.
• Section 12.6: Explain how to test hypotheses about multiple coefficients using probit
or logit models.
Further Reading
There is no settled consensus on the best way to interpret probit and logit coefficients.
Substantive conclusions rarely depend on the mode of presentation, so any of the methods
is legitimate. Hanmer and Kalkan (2013) argue for the observed-value approach and against
MLE models do not inherit all properties of OLS models. In OLS, heteroscedasticity
does not bias coefficient estimates; it only makes the conventional equation for the standard
error of —ˆ1 inappropriate. In probit and logit models, heteroscedasticity can induce bias
(Alvarez and Brehm 1995), but correcting for heteroscedasticity may not always be feasible
King and Zeng (2001) discuss small sample properties of logistic models, noting in partic-
ular that small-sample bias can be large when the dependent variable is a rare event, with
Probit and logit models are examples of limited dependent variable models. In these
c
•2014 Oxford University Press 643
Chapter 12. Dummy Dependent Variables
models, the dependent variable is restricted in some way. As we have seen, the dependent
variable in probit models is limited to two values, 1 and 0. MLE can be used for many other
types of limited dependent variable models. If the dependent variable is ordinal with more
than two categories (e.g. answers to a survey question where answers are very satisfied,
MLE methods and is a modest extension of the probit model. Some dependent variables
are categorical. For example, we may be analyzing the mode of transportation to work
(with walking, biking, driving, and taking public transportation as options). In such a
case, multinomial logit is useful, another MLE technique. Other dependent variables are
counts (number of people on a bus) or lengths of time (how long between buses or how long
someone survives after a disease diagnosis). Models with these dependent variables also can
be estimated with MLE methods, such as count models and duration models. Long (1997)
introduces maximum likelihood and covers a broad variety of MLE techniques. King (1989)
explains the general approach. Box-Steffensmeier and Jones (2004) is an excellent guide to
duration models.
Key Terms
• Cumulative distribution function (CDF) (608)
• Dichotomous (591)
• Linear probability model (LPM) (592)
• Latent variable (602)
c
•2014 Oxford University Press 644
Chapter 12. Dummy Dependent Variables
Computing Corner
Stata
c
•2014 Oxford University Press 645
Chapter 12. Dummy Dependent Variables
** Display results
sum PDiff if e(sample)
• If X1 is a dummy variable:
** Estimate probit model
probit Y X1 X2
** Display results
sum PDiff if e(sample)
• The margins command produces average marginal effects, which are the average of the
slopes with respect to each independent variable evaluated at observed values of the
independent variables. See page 808 for more details. These are easy to implement in
Stata, with similar syntax for both probit and logit models.
probit Y X1 X2
margins, dydx(X1)
• To conduct an LR test in Stata, use the lrtest command. For example, to test the
null hypothesis that the coefficients on both X2 and X3 equal zero we can first run the
constrained model and save the results using the estimates store command:
probit Y X1
estimates store RESTRICTED
Then run the unconstrained command followed by the lrtest command and the name
of the constrained model.
probit Y X1 X2 X3
lrtest RESTRICTED
Stata will produce a value of the likelihood ratio statistic and a p-value. We can imple-
ment an LR test manually by simply running the restricted and unrestricted models and
plugging the log likelihoods into the likelihood ratio test equation of 2(logLU R ≠ logLR )
(as explained on page 635). To ascertain the critical value for LR test with d.f. = 1 and
c
•2014 Oxford University Press 646
Chapter 12. Dummy Dependent Variables
To ascertain the p-value for likelihood ratio test with d.f. = 1 and substituting log-
likelihood values in for logLunrestricted and logLrestricted, type
display 1-chi2(1, 2*(logLunrestricted - logLrestricted))
Even easier, we can use Stata’s test command to conduct a Wald test, which is a test
that is asymptotically equivalent to the likelihood ratio test (which is a fancy way of
saying the test statistics get really close to each other as the sample size goes to infin-
ity). For example,
probit Y X1 X2 X3
test X2 = X3 =0
• To estimate a logit model in Stata, use similar logic and structure as for a probit model.
Here are the key differences for the continuous variable example:
logit Y X1 X2
gen LogitP1 = exp( b[ cons] + b[X1]*X1 + b[X2]*X2 ) /
(1+exp( b[ cons] + b[X1]*X1 + b[X2]*X2 ))
gen LogitP2 = exp( b[ cons] + b[X1]*X1Plus + b[X2]*X2 ) /
(1+exp( b[ cons] + b[X1]*X1Plus + b[X2]*X2 ))
• To graph fitted lines from a probit or logit model that has only one independent variable,
first estimate the model and save the fitted values. Then use the following command:
graph twoway (line ProbitFit X, connect(l))
R
To implement a probit or logit analysis in R, we use the glm function, which stands for
“generalized linear model” (as opposed to the lm function, which stands for “linear model”).
• If X1 is continuous:
## Estimate probit model and name it Result (or anything we choose)
Result = glm(Y ≥ X1 + X2, family = binomial(link ="probit"))
c
•2014 Oxford University Press 647
Chapter 12. Dummy Dependent Variables
here equals 1)
X1Plus = X1 +1
• If X1 is a dummy variable:
## Estimate probit model and name it ‘‘Result’’ (or anything we choose)
Result = glm(Y ≥ X1 + X2, family = binomial(link ="probit"))
c
•2014 Oxford University Press 648
Chapter 12. Dummy Dependent Variables
## Critical value for LR test with d.f. =1 and 0.95 confidence level
qchisq(0.95, 1)
Exercises
1. In this question, we explore the effect of opinion about the Iraq War on the presidential
election in 2004 using the dataset BushIraq.dta. The variables we will focus on are
listed in Table 12.7.
a. Estimate two probit models: one with only P roIraqW ar02 as the independent vari-
able and the other with all the independent variables listed in the table. Which is
better? Why? Comment briefly on statistical significance.
c
•2014 Oxford University Press 649
Chapter 12. Dummy Dependent Variables
Variable Description
Bushvote04 Dummy variable =1 if person voted for President Bush in 2004
ProIraqWar02 Position on Iraq War, ranges from 0 (opposed war) to 3 (favored war)
Party02 Partisan affiliation, ranges from 0 for strong Democrats to 6 for strong Republicans
BushVote00 Dummy variable =1 if person voted for President Bush in 2000
CutRichTaxes02 Views on cutting taxes for wealthy, ranges from 0 (oppose) to 2 (favor)
Abortion00 Views on abortion, ranges from 1 (strongly oppose) to 4 (strongly support)
b. Use the model with all the independent variables and the observed-value, discrete-
differences approach to calculate the effect of a one standard deviation increase in
P roIraqW ar02 on support for Bush.
c. Use the model with all the independent variables listed in the table and the observed-
value, discrete-differences approach to calculate the effect of a one standard deviation
increase in party02 on support for Bush. Compare to the effect of P roIraqW ar02.
d. Use Stata’s marginal effects command to calculate the marginal effects of all inde-
pendent variables. Briefly comment on differences from calculations in parts (a) and
(c).
e. Run the same model using logit and
i Briefly comment on patterns of statistical significance compared to probit results.
ii Briefly comment on coefficient values compared to probit results.
iii Use Stata’s margins commands to calculate marginal effects of variables and
briefly comment on differences or similarities from probit results.
f. Calculate the correlation of the fitted values from the probit and logit models.
g. Test the null hypothesis that the coefficients on the three policy opinion variables
(P roIraqW ar02, CutRichT axes02, Abortion00) all equal zero using a likelihood ra-
tio test. Do this work manually (showing your work) and using the Stata commands
for a likelihood ratio test.
2. Public attitudes toward global warming influence the policy response to the issue. The
dataset EnvSurvey.dta provides data from a nationally representative survey of the
American public that asked multiple questions about the environment and energy. Table
12.8 lists the variables.
a. Use a linear probability model to estimate the probability of saying that global
warming is real and caused by humans (the dependent variable is HumanCause2).
Control for sex, being white, education, income, age, and partisan identification.
c
•2014 Oxford University Press 650
Chapter 12. Dummy Dependent Variables
Variable Description
Male Dummy variable = 1 for men
White Dummy variable = 1 for whites
Education Education, ranging from 1 for no formal education to 14 for
professional/doctorate degree (treat as a continuous variable)
Income Income, ranging from 1 for household income < $5000 to 19 for
household income > $175000 (treat as a continuous variable)
Age Age
Party7 Partisan identification, ranging from 1 for strong Republicans, 2 for
not-so-strong Republican, 3 leans Republican, 4 undecided/independent,
5 leans Democrat, 6 not-so-strong Democrat, 7 strong Democrat
i. Which variable has the most important influence on this opinion? Why?
ii. What are the minimum and maximum fitted values from this model? Discuss
implications briefly.
iii. Add age-squared to the model. What is the effect of age? Use a simple sketch if
necessary, with key point(s) identified.
b. Use a probit model to estimate the probability of saying that global warming is
real and caused by humans (the dependent variable is HumanCause2). Use the
independent variables from part (a), including the age-squared variable.
i. Compare statistical significance with LPM results.
ii. What are the minimum and maximum fitted values from this model? Discuss
implications briefly.
iii. Use the observed-value, discrete-differences approach to indicate the effect of
partisan identification on the probability of saying global warming is real and
caused by humans. For simplicity, simulate the effect of an increase of 1 unit
on this 7 point scale (as opposed to the effect of one standard deviation, as we
have done for continuous variables in other cases). Compare to LPM and Stata’s
“marginal effects” interpretations.
iv. Use the observed-value, discrete-differences approach to indicate the effect of
being male on the probability of saying global warming is real and caused by
humans. Compare to LPM and Stata’s “marginal effects” interpretations.
c. The survey also included a survey experiment in which respondents were randomly
assigned to different question wordings about an additional question about global
warming. The idea was to see which frames were most likely to lead people to agree
that the earth is getting warmer. The variable we analyze here is called W armAgree.
It records whether or not respondents agreed that the earth’s average temperature
c
•2014 Oxford University Press 651
Chapter 12. Dummy Dependent Variables
FIGURE 12.10: Figure Included for Some Respondents in Global Warming Survey Experiment
is rising. The experimental treatment consisted of four different ways to phrase the
question.
• The variable T reatment equals 1 for people who were asked “Based on your
personal experiences and observations, do you agree or disagree with the following
statement: The average temperature on earth is getting warmer.”
• The variable T reatment equals 2 for people who were given the following infor-
mation and Figure 12.10 before asking them if they agreed or not that average
temperature of the earth is getting warmer: “The following figure shows the
average global temperature compared to the average temperature from 1951-
1980. The temperature analysis comes from weather data from more than 1,000
meteorological stations around the world, satellite observations of sea surface
temperature, and Antarctic research station measurements.”
• The variable T reatment equals 3 for people who were given the following in-
formation before asking them if they agreed or not that average temperature
of the earth is getting warmer: “Scientists working at the National Aeronautics
and Space Administration (NASA) have concluded that the average global tem-
perature has increased by about a half degree Celsius compared to the average
temperature from 1951-1980. The temperature analysis comes from weather data
from more than 1,000 meteorological stations around the world, satellite observa-
tions of sea surface temperature, and Antarctic research station measurements.”
• The variable T reatment equals 4 for people who were simply asked “Do you agree
or disagree with the following statement: The average temperature on earth is
getting warmer.” This is the control group.
Which frame was most effective in affecting opinion about global warming?
3. What determines whether organizations fire their leaders? It’s often hard for outsiders
c
•2014 Oxford University Press 652
Chapter 12. Dummy Dependent Variables
a. Run a probit model explaining whether the coach was fired as a function of winning
percentage. Graph fitted values from this model on same graph with fitted values
results from a bivariate linear probability model (use lfit command to plot LPM
results). Explain the differences in the plots.
b. Estimate LPM, probit, and logit models of coach firings using winning percent, lagged
winning percent, a new coach dummy, strength of schedule, and coach tenure as
independent variables. Are the coefficients substantially different? How about the z
statistics?
c. Indicate the minimum, mean, and maximum of the fitted values for each model and
briefly discuss.
d. What are the correlations of the three fitted values?
e. It’s kind of odd to say that lag winning percentage affects the probability that new
coaches got fired because they were not coaching for the year associated with the
lagged winning percentage. Include an interaction for the fired last year dummy and
lagged winning percentage. The effect of lagged winning percentage on probability
of being fired is the sum of the coefficients on lagged winning percentage and the
interaction. Test the null hypothesis that lagged winning percentage has no effect
on coaches who are new (meaning coaches for whom firedlastyear = 1). Use a Wald
test (which is most convenient) and likelihood ratio test.
4. Are members of Congress more likely to meet with donors? To answer this question,
Kalla and Broockman (2014) conducted a field experiment in which they had political
activists attempt to schedule meetings with 191 congressional offices regarding efforts to
ban a potentially harmful chemical. The messages the activists sent to the congressional
offices were randomized. Some messages described the people requesting the meeting
c
•2014 Oxford University Press 653
Chapter 12. Dummy Dependent Variables
as “local constituents” and others described the people requesting the meeting as “local
campaign donors.” Table 12.10 describes two key variables from the experiment.
Table 12.10: Variables for Donor Experiment
Variable Description
donor treat Dummy variable indicating activists seeking meeting were identified as donors.
staffrank Highest ranking person attending the meeting: 0 for no meeting, 1 for non-policy
staff, 2 for legislative assistant, 3 for legislative director, 4 for chief of
staff, and 5 for member of Congress.
a. Before we analyze the experimental data, let’s first suppose we were to conduct
an observational study of access based on a sample of Americans where we ran a
regression in which the dependent variable indicates having met with a member of
Congress and the independent variable was whether or not the individual donated
money to a member of Congress. Would there be concerns about endogeneity? If so,
why?
b. Use a probit model to estimate the effect of the donor treatment condition on prob-
ability of meeting with a member of Congress. Interpret the results. Table 12.10
describes the variables.
c. What factors are missing from the model? What does this omission mean for our
results?
d. Use a linear probability model (LPM) to estimate the same model. Interpret results.
Assess the correlation of the fitted values from the probit and LPM models.
e. Use a LPM model to assess the probability of meeting with a senior staffer (defined
as staffrank> 2).
f. Use a LPM model to assess the probability of meeting with a low-level staffer (defined
staffrank= 1).
g. Table 12.11 shows results for balance tests for two variables: Obama vote share in the
congressional district and the overall campaign contributions received by the member
of Congress contacted. Discuss the implication of these results for balance.
c
•2014 Oxford University Press 654
Chapter 12. Dummy Dependent Variables
c
•2014 Oxford University Press 655
Part IV
Advanced Material
656
CHAPTER 13
ture generations.
657
Chapter 13. Time Series: Dealing with Stickiness over Time
someone says global warming is a fraud. Kids, put some more coal on the campfire!
If we use global temperature data to try to pin down trends and associated variables we
are using time series data, data for a particular unit (such as a country or planet) over time.
Time series data is distinct from cross-sectional data, which is data for many units at a
given point in time (such as data on the GDP per capita in all countries in 2012).
Analyzing time series data is deceptively tricky because the data in one year almost
certainly depends on the data in the year before. This seemingly innocuous fact creates
complications, some of which are relatively easy to deal with and others of which are a
In this chapter we introduce two approaches to time series data. The first treats the year-
106, autocorrelation doesn’t cause our OLS coefficients to be biased, but it will typically
cause standard OLS estimates of the variance of —ˆ1 to be too incorrect. It’s pretty easy to
purge the data of this autocorrelation; our estimates will continue to be unbiased, but now
The second approach to time series data treats the dependent variable in one period as
directly depending on what the value of the dependent variable was in the previous period.
In this approach, the data remembers: A bump up in year one will affect year two and
because the value in year two will affect year three and so on this means that the bump in
year one will percolate through the entire data series. This is called a dynamic model; such
c
•2014 Oxford University Press 658
Chapter 13. Time Series: Dealing with Stickiness over Time
might seem pretty similar to other OLS models, but they actually differ in important and
funky ways.
This chapter covers both approaches to dealing with time series data. Section 13.1 in-
troduces a model for autocorrelation. Section 13.2 shows how to use this model to detect
autocorrelation and Section 13.3 shows how to purge autocorrelation from the model. Sec-
tion 13.4 introduces dynamic models and Section 13.5 discusses a central but complicated
One reasonable approach to time series data is to think of the errors as being correlated over
time. If errors are correlated, —ˆ1 is unbiased, but the standard equation from page 225 for
the variance of —ˆ1 (Equation 5.9 on page 225) is not accurate.1 Often, the variance estimated
by OLS will be too low and will cause our confidence intervals to be too small and lead us
In this section we lay the groundwork for dealing with autocorrelation by developing a
model with autoregressive errors. Autoregressive errors are one type of possibly autocorre-
lated errors; they are the most widely used and quite intuitive. We also provide examples of
c
•2014 Oxford University Press 659
Chapter 13. Time Series: Dealing with Stickiness over Time
Yt = —0 + —1 Xt + ‘t (13.1)
This model has slightly different notation than our earlier OLS model. Instead of using “i”
to indicate each individual observation, we use “t” to indicate each time period. Yt therefore
indicates the dependent variable at time t; Xt indicates the independent variable at time t.
We’ll focus on an autoregressive model for error terms. This is the most common model
directly on the value of the dependent variable in the previous period. Here we model
the error as depending on the error in the previous period. In Section 13.4 we use an
autoregressive model for the dependent variable (as opposed to for the error as we’re doing
here).2
‘t = fl‘t≠1 + ‹t (13.2)
This equation says that the error term for time period t equals fl (the Greek letter “rho”)
times the error in the previous term plus a random error, ‹t (the Greek letter nu, pronounced
2 The terms here can get a bit confusing. Autocorrelation refers to errors being correlated with each other. An
autoregressive model is one way (the most common way) to model autocorrelation. It is possible to model correlated
errors differently. For example, errors can be the average of errors from some number of previous periods, an error
process referred to as a moving average error process.
c
•2014 Oxford University Press 660
Chapter 13. Time Series: Dealing with Stickiness over Time
“new”). We assume ‹t is uncorrelated with the independent variable and other error terms.
The ‘t≠1 is referred to as the lagged error because it is the error from the previous period.
We’ll lag other variables as well, which means using the value from the previous period.
Suppose we’re looking at global temperature data from 1880 to 2012 as a dependent
variable and carbon emissions as an independent variable. Suppose we lack a good measure
of sunspots, a solar phenomenon that may affect temperature. Because sunspots strengthen
and weaken over a roughly 11 year cycle, they will be correlated from period to period. This
factor could be in the error term of our global temperature model and could therefore cause
The fl term indicates the extent to which the errors are correlated over time. If fl is zero,
then the errors are not correlated and the autoregressive model reduces to a simple OLS
model (because Equation 13.2 becomes ‘t = ‹t when fl = 0). If fl is greater than zero, then
a high value of ‘ in period t ≠ 1 will likely lead to a high value of ‘ in period t. Think of
the errors in this case as being a bit sticky. Instead of bouncing around like independent
random values, they tend to run high for a while, then low for a while.
c
•2014 Oxford University Press 661
Chapter 13. Time Series: Dealing with Stickiness over Time
positive value of ‘ in period t ≠ 1 is more likely to lead to a negative value of ‘ in the next
period. In other words, the errors bounce violently back and forth over time.
Figure 13.1 shows examples of errors with varying degrees and types of autocorrelation.
Panel (a) shows an example in which fl is 0.8. This positive autocorrelation produces a
relatively smooth graph, with values tending to be above zero for a few periods and then
below zero for a few periods and so on. This graph is telling us that if we know the error in
one period, we then have some sense of what it will be in the following period. If the error
is positive in period t, then it’s likely (but not certain) the error will be positive in period
t + 1.
Panel (b) of Figure 13.1 shows a case when there is no autocorrelation. The error in
time t is not a function of the error in the previous period. The tell-tale signature of no
autocorrelation is the randomness: It is generally spiky, but here and there the error might
Panel (c) of Figure 13.1 shows negative serial correlation with fl = ≠0.8. The signature
of negative serial correlation is extreme spikiness because a positive error is more likely to
The absolute value of fl has to be less than one in autoregressive models. If fl were greater
than one, the errors would tend to grow larger in each time period and would spiral out of
control.
c
•2014 Oxford University Press 662
Chapter 13. Time Series: Dealing with Stickiness over Time
2 2 2
1 1 1
0 0 0
−1 −1 −1
−2 −2 −2
−3 −3 −3
1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000 1950 1960 1970 1980 1990 2000
Year Year Year
(a) (b) (c)
c
•2014 Oxford University Press 663
Chapter 13. Time Series: Dealing with Stickiness over Time
function of error in previous periods. If the error is a function of only the error from the
previous period, the model is referred to as an AR(1) model (pronounced A-R-1). If the
error is a function of the error from two previous periods, the model is referred to as an
Remember This
1. Autocorrelation refers to errors being correlated with each other.
2. One type of autocorrelated error occurs when errors come from an autoregressive
model in which the error term in period t is a function of the error in previous
periods.
3. The equation for error in an AR(1) model is
‘t = fl‘t≠1 + ‹t
We know what autocorrelation is. We know what it causes. But just because data is time
series data does not necessarily mean the errors will be correlated. We need to assess whether
it exists in our data and model. If it does, then we need to correct for it. If it does not, then
we can go on our merry way with OLS. In this section we show how to detect autocorrelation
c
•2014 Oxford University Press 664
Chapter 13. Time Series: Dealing with Stickiness over Time
The first way to detect autocorrelation is simply to graph the error terms over time. Au-
tocorrelated data has a distinctive pattern and will typically jump out pretty clearly from
a graph. As is typical with graphical methods, looking at a picture doesn’t yield a cut and
dried answer. The advantage, though, is that it allows us to understand the data, perhaps
To detect autocorrelation graphically, we first run a standard OLS model ignoring the
autocorrelation and generate residuals. They are calculated as ‘ˆt = Yt ≠ —ˆ0 ≠ —ˆ1 Xt . (If our
model has more independent variables, we would include them in the calculation as we do
in multivariate OLS.) We simply graph these residuals over time and describe what we see.
If the errors move slowly as in panel (a) of Figure 13.1, they’re positively correlated. If
errors bounce violently as in panel (c) of Figure 13.1, they’re negatively correlated. If we
can’t really tell, then the errors are probably not strongly correlated. Wait a minute! Why
are we looking at residuals from an OLS equation that does not correct for autocorrelation?
Isn’t the whole point of this chapter that we need to take into account autocorrelation?
Busted, right?
autocorrelation are unbiased even when there is autocorrelation. Because the residuals are a
ˆ they are unbiased too. The OLS standard errors are flawed, but we’re
function of these —s,
c
•2014 Oxford University Press 665
Chapter 13. Time Series: Dealing with Stickiness over Time
Positive autocorrelation is common in time series data. Panel (a) of Figure 13.2 shows
global climate data over time with a fitted line from the following model:
T emperaturet = —0 + —1 Y eart + ‘t
The temperature hovers above the trend line for periods (such as around World War 2
and now) and below the line for other periods (such as 1950 to 1980). This hovering is a
sign that the error in one period is correlated with the error in the next period. Panel (b)
of Figure 13.2 shows the residuals from this regression. For each observation, the residual
is the distance from the fitted line; so the residual plot is essentially panel (a) tilted so that
the fitted line in panel (a) is now the horizontal line in panel (b).
A more formal way to detect autocorrelation is to estimate the degree of autocorrelation using
an auxiliary regression. We have seen auxiliary regressions before (in the multicollinearity
discussion on page 210, for example); they are additional regressions that are related to, but
not the same as, the regression of interest. When detecting autocorrelation, we estimate the
following model:
c
•2014 Oxford University Press 666
Chapter 13. Time Series: Dealing with Stickiness over Time
Temperature
1
0.8
0.6
(a)
0.4
0.2
0.2
(b) 0.0
−0.2
−0.4
c
•2014 Oxford University Press 667
Chapter 13. Time Series: Dealing with Stickiness over Time
where ‘ˆt and ‘ˆt≠1 are simply the residuals and lagged residuals from the initial OLS estimation
Computing Corner starting on page 705. If fl̂ is statistically significantly different from zero,
Table 13.1 shows the results of such a lagged error model for the climate data in Figure
13.2. The dependent variable in this model is the error from the model and the independent
variable is the lagged value of the error. We’re using this model to estimate how closely ‘ˆt
and ‘ˆt≠1 are related. The answer? They are strongly related. The coefficient on ‘ˆt≠1 is 0.608,
meaning that our fl̂ estimate is 0.608, which is quite a strong relation. The standard error is
0.072, implying a t statistic of 8.39, which is well beyond any conventional critical value. We
can therefore handily reject the null that fl = 0 and conclude that errors are autocorrelated.
3 If we believe that the independent variables might be correlated with the error term, we can also include them
in the auxiliary regression such that we estimate ‘ˆt = flˆ
‘t≠1 + “Xt + ‹t . With this model we continue to look for a
statistically significant fl̂ estimate.
4This approach is closely related to a so-called Durbin-Watson test for autocorrelation. This test statistic is widely
reported, but has a much more complicated distribution than a t distribution and requires use of specific tables. In
general, it produces very similar results as the process we explained with the auxiliary regression.
c
•2014 Oxford University Press 668
Chapter 13. Time Series: Dealing with Stickiness over Time
Table 13.1: Detecting Autocorrelation Using OLS and Lagged Error Model
Remember This
To detect autocorrelation in time series:
1. Graph the residuals from a standard OLS model over time. If the plot is relatively
smooth, positive autocorrelation likely exists.
2. Estimate the following OLS model:
‘ˆt = flˆ‘t≠1 + ‹t
A statistically significant estimate of fl indicates autocorrelation.
The way to deal with autocorrelation is to get rid of it. That doesn’t really sound like
something we’re supposed to do, but we can do it via a few steps. In this section we derive
a model that purges error and then we explain how to estimate it.
Our goal is to purge autocorrelation from the data by transforming the dependent and
independent variables before estimating our model. Once we purge the autocorrelation,
c
•2014 Oxford University Press 669
Chapter 13. Time Series: Dealing with Stickiness over Time
OLS using the transformed data will produce an unbiased estimate of —ˆ1 and an appropriate
estimate of var(—ˆ1 ). In contrast, OLS on the untransformed data will produce an unbiased
The process is calledfl-transforming (“rho transforming”) the data. Because these steps
are automated in many software packages, we typically will not do them manually. If we
understand the steps, though, can use the results more confidently and effectively.
We begin by replacing the ‘t in the main equation (Equation 13.1) with fl‘t≠1 + ‹t from
Equation 13.2:
Yt = —0 + —1 Xt + fl‘t≠1 + ‹t (13.4)
This equation looks like a standard OLS equation except for a pesky fl‘t≠1 term. Our
1. Write an equation for the lagged value of Yt , which simply requires replacing the t
c
•2014 Oxford University Press 670
Chapter 13. Time Series: Dealing with Stickiness over Time
3. Subtract the equation for flYt≠1 (Equation 13.6) from Equation 13.4. That is, subtract
the left side of Equation 13.6 from the left side of Equation 13.4 and subtract the right
6. Use squiggles to indicate the transformed variables (where Ỹt = Yt ≠flYt≠1 , —˜0 = —0 (1≠fl)
The key thing is to look at the error term in this new equation. It is ‹t , which we said
at the outset is the well-behaved part of the error term that is not autocorrelated. Where
is ‘t , the naughty autocorrelated part of the error term? Gone! That’s the thing. That’s
what we accomplished with these equations: We end up with an equation that looks pretty
similar to our OLS equation with a dependent variable (Ỹt ), parameters to estimate (—˜0 and
—1 ), an independent variable (X̃t ) and an error term, ‹t . The difference is that, unlike our
c
•2014 Oxford University Press 671
Chapter 13. Time Series: Dealing with Stickiness over Time
original model (based on Equations 13.1 and 13.2), this model has no autocorrelation. By
using Ỹt and X̃t we have transformed the model from one that suffers from autocorrelation
What we have to do, then, is estimate a model with the Ỹ and X̃ (note the squiggles over
the variable names) instead of Y and X. Table 13.2 shows the transformed variables for
several observations. The columns labeled Y and X show the original data. The columns
labeled Ỹ and X̃ show the transformed data. We assume for this example that we based on
results from an initial OLS model we have estimated fl̂ = 0.5. In this case, the Ỹ observation
for 2001 will be the actual value in 2001 (which is 110) minus fl̂ times the value of Y in 2000:
Ỹ2001 = 110 ≠ 0.5 ú 100 = 60. Notice that the first observation in the fl-transformed data will
be missing because we don’t know the lagged value for that observation.
Once we’ve created these transformed variables, things are easy. If we think in terms of
a spreadsheet, we’ll simply use the columns Ỹ and X̃ when we estimate the fl-transformed
model. The standard errors produced by this fl-transformed model will not be corrupted by
autocorrelation, unlike the standard errors from a model with untransformed data.
Winsten model.5 These names are useful to remember when implementing using Stata to
5 The Prais-Winston approximates the values for the missing first observation in the fl-transformed data.
c
•2014 Oxford University Press 672
Chapter 13. Time Series: Dealing with Stickiness over Time
Running the fl-transformed model produces coefficient estimates that are unbiased and
consistent (as were simple OLS coefficients) and also produces accurate standard errors.
Usually (but not always), analysis of fl-transformed data will produce larger standard errors
than in the simple OLS model. That means our estimates are less precise (but more honest!).
Confidence intervals will be larger and it will be harder to reject null hypotheses.
It is worth emphasizing that the —ˆ1 coefficient we estimate in the fl-transformed model is
an estimate of —1 . Throughout all the rigmarole of the transformation process, the value of
—1 doesn’t change. The value of —1 in the original equation is the same as the value of —1
in the transformed equation. Hence when we get results from fl-transformed models we still
speak of them in the same terms as —1 estimates from standard OLS. That is, a one unit
One thing that is a unintuitive is that we get different coefficient estimates than with the
simple OLS model. Are fl-transformed results “better”? No and yes, actually. No, in the
c
•2014 Oxford University Press 673
Chapter 13. Time Series: Dealing with Stickiness over Time
sense that both OLS and fl-transformed estimates are unbiased and consistent, which means
that in expectation the estimates equal the true value and as we get more data they converge
to the true value. These things can be true and the models can still yield different estimates.
Just as it is the case that if we flip a coin 100 times, we likely get something different every
time we do this even though the expected number of heads is 50. That’s pretty much what
is going on here, as the two approaches are different realizations of random processes that
are correct on average but still have random noise. The fl-transformed estimates are better
in the sense that they come with correct standard errors. The estimates from OLS do not
Remember This
We correct for autocorrelation by fl-transforming the data, a process that purges the
autocorrelation from the data.
1. We estimate the following model
6 The intercept estimated in a fl-transformed model is actually —0 (1 ≠ fl̂). If we want to know the fitted value for
when Xt is zero (which is the meaning of the intercept in a standard OLS model), we need to divide —˜0 by (1 ≠ fl̂).
The appendix discusses an additional assumption implicit in the fl-transformed model.
c
•2014 Oxford University Press 674
Chapter 13. Time Series: Dealing with Stickiness over Time
ingful.
The first column of Table 13.3 shows results from a standard OLS analysis of the model.
However, we suspect the errors in this model are autocorrelated. If so, we cannot believe
the standard errors from OLS, which in turn means the t statistics are wrong because t
c
•2014 Oxford University Press 675
Chapter 13. Time Series: Dealing with Stickiness over Time
Temperature
(deviation
from
1
pre−industrial
average)
0.75
0.5
0.25
c
•2014 Oxford University Press 676
Chapter 13. Time Series: Dealing with Stickiness over Time
Table 13.3: Global Temperature Model Estimated Using OLS and Via fl-transformed Data
OLS fl-transformed
Year -0.165 ú
-0.174ú
(0.031) (0.057)
[t = 5.31] [t = 3.09]
Year squared 0.000044ú 0.000046ú
(0.000008) 0.000015)
[t = 5.48] [t = 3.20]
Constant 155.68ú 79.97ú
(30.27) (26.67)
[t = 5.14] [t = 2.99]
From auxiliary regression
fl̂ 0.514ú -0.021
(0.077) (0.090)
[t = 6.65] [t = 0.28]
N 128 127
R2 0.79 0.55
Standard errors in parentheses
ú indicates significance at p < 0.05
The Table 13.3 reports that fl̂ = 0.514, which is generated by estimating an auxiliary re-
gression with errors as the dependent variable and lagged error as the independent variable.
The autocorrelation is lower than in the model without including squared year as an inde-
pendent varaible (as reported on page 669), but nonetheless highly statistically significant,
The second column of Table 13.3 shows results from a fl-transformed model. The —ˆ1 and
—ˆ2 haven’t changed much from the first column. This outcome isn’t too surprising given
that both OLS and fl-transformed models produce unbiased estimates of —1 and —2 . The
difference is in the standard errors. The standard error on the Y ear and Y ear2 variables
have almost doubled, which has almost halved the t statistics for —ˆ1 and —ˆ2 to near three.
c
•2014 Oxford University Press 677
Chapter 13. Time Series: Dealing with Stickiness over Time
In this particular instance, the relationship between year and temperature is so strong that
even with these larger standard errors we will reject the null hypotheses of no relationship at
any conventional significance level (such as – = 0.05 or – = 0.01). What we see, though, is
the large effect addressing autocorrelation has on the standard errors. The standard errors
produced by OLS were too small due to autocorrelation. In other words, we overestimated
Several aspects of the results from the fl-transformed model are worth noting. First,
the fl̂ from the auxiliary regression is now very small (-0.021) and statistically insignificant,
indicating that we have indeed purged the model of first order autocorrelation. Well done!
Second, the R2 is lower in the fl-transformed model; it’s reporting the traditional goodness
of fit statistic for the transformed model, but it is not directly meaningful or comparable to
the R2 in the original OLS model. Third, the constant changes quite a bit, from 155.68 to
79.97. Recall, however, that the constant in the fl-transformed model is actually —0 (1 ≠ fl)
(where fl is the estimate of autocorrelation in the untransformed model), which means the
estimate of —0 is 79.97
1≠0.514
= 164.5 which is close to the estimate of —ˆ0 in the OLS model.
Another way to deal with time series data is to use a dynamic model. In a dynamic model,
the value of the dependent variable directly depends on the value of the dependent variable
c
•2014 Oxford University Press 678
Chapter 13. Time Series: Dealing with Stickiness over Time
in the previous term. In this section we explain the dynamic model and discuss three ways
Yt = “Yt≠1 + —0 + —1 Xt + ‘t (13.7)
where the new term is “ (the Greek letter gamma) times the value of the lagged dependent
variable, Yt≠1 . The coefficient “ indicates the extent to which the dependent variable depends
on its lagged value. The higher it is, the more the dependence across time. If the data is
really generated according to a dynamic process, omitting the lagged dependent variable
would risk omitted variable bias; and given that the coefficient on the lagged dependent
variable is often very large, that means we risk large omitted variable bias by omitting the
As a practical matter, a dynamic model with a lagged dependent variable is super easy
Be alert, though. This seemingly modest change in the model shakes up a lot of our statistical
c
•2014 Oxford University Press 679
Chapter 13. Time Series: Dealing with Stickiness over Time
First, the interpretation of the coefficients changes. In non-dynamic OLS models (which
simply means OLS models that do not have a lagged dependent variable as an independent
variable), a one unit increase in X is associated with a —ˆ1 increase in Y . In a dynamic model,
seeing that kind of effect. Y2 will also go up because Y2 depends on Y1 . In other words, an
increase in X has not only immediate effects, but also long-term effects because the boost
long term.7 If “ is big (near 1), then the dependent variable has a lot of memory. A change
in one period strongly affects the value of the dependent variable in the next period. In this
case, the long-term effect of X will be much bigger than —ˆ1 because the estimated long-term
effect will be —ˆ1 divided by a small number. If “ is near zero, on the other hand, then the
dependent variable has little memory, meaning that the dependent variable depends little on
its value in the previous period. In this case, the long-term effect of X will be pretty much
—ˆ1 because the estimated long-term effect will be —ˆ1 divided by a number close to one.
lot more trouble in dynamic models than in non-dynamic models. Recall that in OLS,
correlated errors mess up the standard OLS estimates of the variance of —ˆ1 , but they do
7 The condition that the absolute value of “ is less than one rules out certain kinds of explosive processes where Y
gets increasingly bigger or smaller every period. This condition is related to a requirement that data be “stationary”
as discussed below on page 683.
c
•2014 Oxford University Press 680
Chapter 13. Time Series: Dealing with Stickiness over Time
not bias the estimates of —ˆ1 . In dynamic models, correlated errors cause bias. It’s not too
hard to see why. If ‘t is correlated with ‘t≠1 , then it also has to be correlated with Yt≠1
because Yt≠1 is obviously a function of ‘t≠1 . In such a situation, one of the independent
variables (Yt≠1 ) is correlated with the error which is a bias-causing no-no in OLS. The bias
is worse for the estimate of the coefficient on the lagged dependent variable than for —ˆ1 . If
the autocorrelation in the errors is modest or weak, this bias is relatively small.
variable when it is irrelevant (meaning “ = 0) can lead to biased estimates of —ˆ1 . Recall
from page 233 that in OLS, including an irrelevant variable (a variable whose true coefficient
is zero) will increase standard errors but will not cause bias. In a dynamic model, though,
including the lagged dependent variable when “ = 0 leads —ˆ1 to be biased if the error is
autocorrelated and the independent variable itself follows an autoregressive process (such
that its value depends on its lagged value). When these two conditions hold, including
vastly understated because the lagged dependent variable will wrongly soak up much of the
Should we include a lagged dependent variable in our time series model? On the one
hand, if we exclude the lagged dependent variable when it should be there (when “ ”= 0),
we risk omitted variable bias. On the other hand, if we include it when it should be there
(when “ = 0), we risk bias if the errors are autocorrelated. It’s quite a conundrum.
c
•2014 Oxford University Press 681
Chapter 13. Time Series: Dealing with Stickiness over Time
There is no firm answer, but we’re not helpless. The best place to start is with theory
about the nature of the dependent variable being modeled. If we have good reasons to
suspect that it truly is a dynamic process then including the lagged dependent variable is
the best course. For example, many people suspect that political affiliation is a dynamic
process. What party a person identifies with depends not only on external factors like the
state of the economy, but also on what party he or she identified with last period. It’s a
well-known fact of life that many people interpret facts through partisan lenses. Democrats
will see economic conditions in a way that is most favorable to Democrats; Republicans
will see economic conditions in a way that is most favorable to Republicans. This means
that party identification will be sticky in a manner implied by the dynamic model and it is
In addition, when we include a lagged dependent variable we should test for autocorrelated
errors. If we find that the errors are autocorrelated, then we should worry about possible
bias in the estimate of —ˆ1 ; the higher the autocorrelation of errors, the more we should worry.
We discussed how to test for autocorrelation on page 669. If we find autocorrelation, we can
fl-transform the data to purge the autocorrelation; we’ll see an example in our case study on
page 695.
c
•2014 Oxford University Press 682
Chapter 13. Time Series: Dealing with Stickiness over Time
Remember This
1. A dynamic time series model includes a lagged dependent variable as a control
variable. For example,
Yt = “Yt≠1 + —0 + —1 Xt + ‘t
13.5 Stationarity
We also need to think about stationarity when we analyze time series data. A stationary
variable has the same distribution throughout the entire time series. This is a complicated
topic and we’ll only scratch the surface. The upshot is that stationarity is good and its
opposite, nonstationarity, is a pain in the tuckus. When working with time series data, we
In this section we define nonstationarity as a so-called unit root problem and then explain
how spurious regression results are a huge danger with nonstationary data. Spurious regres-
sion results are less likely with stationary data. We also show how to detect nonstationarity
c
•2014 Oxford University Press 683
Chapter 13. Time Series: Dealing with Stickiness over Time
A variable is stationary if it has the same distribution for the entire time series. A variable
is nonstationary if its distribution depends on time. A variable for which the mean is getting
multiple flavors, but we’ll focus on a case in which data is prone to persistent trends in a
way we define more precisely below. In this case, the mean of the distribution of the variable
Yt = “Yt≠1 + ‘t (13.8)
We consider three cases for “, the coefficient on the lagged dependent variable: when it
If the absolute value of “ is less than one, life is relatively easy. The lagged dependent
variable affects the dependent variable, but the effect diminishes over time. To see why, note
that we can write the value of Y in the third time period as a function of the previous values
c
•2014 Oxford University Press 684
Chapter 13. Time Series: Dealing with Stickiness over Time
Y3 = “Y2 + ‘3
= “(“Y1 + ‘2 ) + ‘3
= “(“(“Y0 + ‘1 ) + ‘2 ) + ‘3
= “ 3 Y0 + “ 2 ‘1 + “‘2 + ‘3
When “ < 1, the effect of any given value of Y will decay over time. In this case, the effect
of Y0 on Y3 is “ 3 Y0 ; because “ < 1, “ 3 will be less than one. We could extend the above logic
to show that the effect of Y0 on Y4 will be “ 4 which is less than “ 3 when “ < 1. The effect
of the error terms in a given period will also have similar pattern.
This case presents some differences from standard OLS, but it turns out that the property
that the effects of previous values of Y and error fade away means that we will not face a
What if “ were greater than one? In this case, we’d see an explosive process because
the value of Y would grow by an increasing amount. Time series analysts rule out such
a possibility on theoretical grounds. Variables just don’t explode like this, certainly not
The tricky case occurs when “ is exactly equal to one. In this case the variable is said
to have a unit root. In a model with a single lag of the dependent variable, a unit root
simply means that the coefficient on the lagged dependent variable (“ for the model as we’ve
written it) is equal to one. The terminology is a bit quirky: Unit refers to the number one
and root refers to the source of something, in this case the lagged dependent variable that
c
•2014 Oxford University Press 685
Chapter 13. Time Series: Dealing with Stickiness over Time
A variable with a unit root is nonstationary and causes several problems. The most serious
is that spurious regression results are highly probable when regressing a variable with a
unit root on another variable with a unit root.8 A spurious regression is one in which the
regression results suggest that X affects Y when in fact X has no effect on Y ; spurious
It’s reasonably easy to come up with possible spurious results in time series data. Think
about the U.S. population from 1900 to 2010. It rose pretty steadily, right? Now think about
the price of butter since 1900 to 2010. Also rose steadily. If we were to run a regression
predicting the price of butter as a function of population we would see a significant coefficient
on population because low values of population went with low butter prices and high values
of population went with high butter prices. Maybe that’s true, but here’s why we should be
skeptical: It’s quite possible these are just two variables that both happen to be trending
up. We could replace the population of the United States with the population of Yemen
(also trending up) and the price of butter with the number of deer in the United States
(also trending up). We’d again have two variables trending together and if we put them in a
simple OLS model we would observe a spurious positive relationship between the population
8 Other problems are that the coefficient on the lagged dependent variable will be biased downward so that the
coefficient divided by its standard error will not follow a t distribution.
c
•2014 Oxford University Press 686
Chapter 13. Time Series: Dealing with Stickiness over Time
The reason that a nonstationary variable is prone to spurious results is that a variable
with a unit root is trendy. Not in a fashionable sense, but in a streaky sense. A variable
with a unit root might go up for while, then down for even longer, blip up, and then continue
down. These unit root variables look like Zorro slashed out their pattern with his sword: A
zig up, a long zag down, another zig up, and so on.9
Figure 13.4 shows examples of two simulated variables with unit roots. In panel (a) Y
but with periods in which it goes down for a bit. In panel (b), X is simulated according
to Xt = Xt≠1 + ‹t . In this particular simulation, X trends mostly down, with a flat period
early on and some mini-peaks later in the time seres. Importantly, X and Y have absolutely
nothing to do with each other in the way they were generated. For example, when we
Panel (c) of Figure 13.4 scatterplots X and Y and includes a fitted OLS regression line.
The regression line has a negative slope that is highly statistically significant. And com-
pletely spurious. The variables are completely unrelated to each other. The reason we see a
significant relationship is that Y was working its way up while X was working its way down
for most of the first part of the series. These movements create a pattern in which a negative
OLS coefficient occurs, but does not indicate an actual relationship. In other words, panel
9 Zorro’s slashes would probably go more side-to-side, so maybe think of unit root variables as slashed by an inebriated Zorro.
c
•2014 Oxford University Press 687
Chapter 13. Time Series: Dealing with Stickiness over Time
Y X
5
25
0
20
−5
15
−10
10
−15
5
−20
0 −25
0 50 100 150 200 0 50 100 150 200
Time Time
(a) (b)
Y
25
20
15
10
5 ^
β1 = −0.81
^
t−stat for β1 = −36.1
0
c
•2014 Oxford University Press 688
Chapter 13. Time Series: Dealing with Stickiness over Time
Of course, this is a single example. It is, however, quite representative because unit root
variables are so prone to trends. When Y goes up, there is a pretty good chance that X will
be on a trend too: If X is going up, too, then the OLS coefficient on X would be positive. If
X is trending down when Y is trending up, then the OLS coefficient on X would be negative.
Hence, the sign of coefficients in these spurious regression results are not predictable. What
is predictable is that two such variables will often exhibit (spurious) statistically significant
relationships.10
Variables without unit roots behave differently. Panels (a) and (b) of Figure 13.5 show a
simulation of two time series variables where the coefficient on the lagged dependent variable
is 0.5 (as opposed to 1.0 in the unit root simulations). They certainly don’t look like Zorro
sword slashes. They look more like Zorro sneezed them out. And OLS finds no relationship
between the two variables, as is clear in panel (c), which shows a scatterplot of X and Y .
Again, this is a single simulation, but it is a highly representative one because variables
without unit roots typically don’t exhibit the trendiness that causes unit root variables to
c
•2014 Oxford University Press 689
Chapter 13. Time Series: Dealing with Stickiness over Time
Y X
2
2
1
0
0
−1
−2 −2
−3
−4
0 50 100 150 200 0 50 100 150 200
Time Time
(a) (b)
−2
^
β1 =−0.08
^
t−stat for β1 = −0.997
−4
−3 −2 −1 0 1 2
X
(c)
FIGURE 13.5: Data without Unit Roots
c
•2014 Oxford University Press 690
Chapter 13. Time Series: Dealing with Stickiness over Time
Unit roots are surprisingly common in theory and practice. Unit roots are also known as
random walks because the series starts at Yt≠1 and takes a random step (the error term) from
there. Random walks are important in finance; the efficient market hypothesis holds that
stock market prices account for all information such that there will be no systematic pattern
going forward. A classic book about investing is A Random Walk Down Wall Street (Malkiel
2003); the title is not, ahem, random, but connects unit roots to finance via the random walk
terminology. In practice, many variables show signs of having unit roots, including GDP,
To test for a unit root (which means the variable is nonstationary), we test whether “ is
equal to one for the dependent variable and the independent variables. If “ is equal to one
for a variable or variables, we have nonstationarity and worry about spurious regression and
The main test for unit roots has a cool name: the Dickey-Fuller test. This is a hypoth-
esis test in which the null hypothesis is “ = 1 and the alternative hypothesis is “ < 1.
The standard way to implement the Dickey-Fuller test is to transform the model by
c
•2014 Oxford University Press 691
Chapter 13. Time Series: Dealing with Stickiness over Time
Yt = (“ ≠ 1)Yt≠1 + ‘t
Yt = –Yt≠1 + ‘t
where the dependent variable ( Yt , pronounced “delta Y”; this a capital Greek delta, which
is a different symbol that the lower case delta, ”) is now the change in Y in period t and the
independent variable is the lagged value of Y . Here we’re using notation suggesting a unit
root test for the dependent variable. We also run unit root tests with the same approach for
independent variables.
label as – = “ ≠ 1. Under the null hypothesis that “ = 1, our new parameter – equals zero.
Under the alternative hypothesis that “ < 1, our new parameter – is less than zero.
where T imet is a variable indicating which time period observation t is. T ime is equal to 1
The focus of the Dickey-Fuller approach is the estimate of –. What we do with our
estimate of – takes some getting used to. The null hypothesis is that Y is nonstationary.
c
•2014 Oxford University Press 692
Chapter 13. Time Series: Dealing with Stickiness over Time
That’s bad. We want to reject the null hypothesis. The alternative is that the Y is stationary.
That’s good. If we reject the null hypothesis in favor of the alternative hypothesis that – < 0,
The catch is that if the variable actually is nonstationary, the estimated coefficient is
not normally distributed, which means the coefficient divided by its standard error will not
have a t distribution. Hence we have to use so-called Dickey-Fuller critical values, which are
bigger than standard critical values, making it hard to reject the null hypothesis that the
Corner; more details are in the references indicated in the Further Reading section and the
appendix.
If the Dickey-Fuller test indicates that a variable data is nonstationary, the standard ap-
proach is to move to a differenced model in which all variables are converted from levels
variable at time t and time t ≠ 1). We’ll see an example on page 700.
c
•2014 Oxford University Press 693
Chapter 13. Time Series: Dealing with Stickiness over Time
Remember This
A variable is stationary if its distribution is the same for the entire data set. A common
violation of stationarity occurs when data has a persistent trend.
1. Nonstationary data can lead to statistically significant regression results that are
spurious when two variables have similar trends.
2. The test for stationarity is a Dickey-Fuller test. The most widely used format of
this test is an augmented Dickey-Fuller test:
Yt = –Yt≠1 + —0 + —1 T imet + —2 Yt≠1 + ‘t
If we reject the null hypothesis that – = 0, we conclude that the data is stationary
and can use untransformed data. If we fail to reject the null hypothesis that – = 0,
we conclude the data is nonstationary and should use a model with differenced
data.
c
•2014 Oxford University Press 694
Chapter 13. Time Series: Dealing with Stickiness over Time
ide emissions, measured in parts per million with values indicated on the right side of the
figure. Clearly, these variables seem to move together. The question is how confident we are
We’ll analyze this question with a dynamic model. We begin with a model that allows
for the non-linear time trend from page 675; this model has Y ear and Y ear2 as independent
variables.11
We’ll also include temperature from the previous time period. This is the lagged depen-
11 Including these variables is not a no-brainer; one might argue that the independent variables are causing the
non-linear time trend and we don’t want the time trend in there to soak up variance. Welcome to time series analysis.
Without definitively resolving the question, we’ll start from there because including time trends is an analytically
conservative approach in the sense that it will typically make it harder, not easier, to find statistical significance for
independent variables.
c
•2014 Oxford University Press 695
Chapter 13. Time Series: Dealing with Stickiness over Time
Temperature Carbon
(deviation 1 dioxide
370 (parts per
from
pre−industrial Temperature (left−hand scale) million)
average,
Carbon dioxide (right−hand scale)
in Fahrenheit)
0.75 350
0.5 330
0.25 310
0 290
c
•2014 Oxford University Press 696
Chapter 13. Time Series: Dealing with Stickiness over Time
dent variable and by including it, the model becomes a dynamic model. The independent
variable of interest here is carbon dioxide. We want to know if increases in carbon dioxide
where CO2t is a measure of the concentration of carbon dioxide in the atmosphere. This
is a much (much!) simpler model than climate scientists use; our model simply gives us a
broad-brush picture as to whether the relationship between carbon dioxide and temperature
Our first worry is that the data might not be stationary. If that is the case, there is a
risk of spurious regression. Therefore the first two columns of Table 13.4 show Dickey-Fuller
results for the substantive variables, temperature and carbon dioxide. We use an augmented
Recall that the null hypothesis in a Dickey-Fuller test is that the data is nonstationary.
The alternative hypothesis in a Dickey-Fuller test is that the data is stationary; we will accept
this alternative only if the coefficient is sufficiently negative. (Yes, this way of thinking takes
To show that data is stationary (which is a good thing!), we need a sufficiently negative
t statistic on the estimate of –. For the temperature variable, the t statistic in the Dickey-
c
•2014 Oxford University Press 697
Chapter 13. Time Series: Dealing with Stickiness over Time
Fuller test is -4.22. As we discussed earlier, the critical values for the Dickey-Fuller test are
not the same as for standard t tests. They are listed in the bottom of the table. Because
the t statistic on the lagged value of temperature is more negative than the critical value,
even at the 1 percent level, we can reject the null hypothesis of nonstationarity. In other
words, the temperature data is stationary. We get a different answer for carbon dioxide.
The t statistic is positive. That immediately dooms a Dickey-Fuller test because we need to
see t statistics more negative than the critical values in order to reject the null. In this case,
we do not reject the null hypothesis and therefore conclude that the carbon dioxide data
is nonstationary. This means that we should be wary of using the carbon dioxide variable
A good way to begin to deal with nonstationarity is to use differenced data, which we
generate by creating a variable that is the change of a variable in period t, as opposed to the
We still need to check for stationarity with the differenced data and hence, back we go
to Table 13.4 for the Dickey-Fuller tests; this time the last two columns test for stationarity
using the change in temperature and change in carbon dioxide variables. The t statistic
on the lagged value of the change in temperature is -12.04, allowing us to easily reject the
null hypothesis of nonstationarity for temperature. For carbon dioxide, the t statistic on
the lagged value of the change in carbon dioxide is -3.31, which is more negative than the
critical value at the 10 percent level. We conclude carbon dioxide is stationary. However,
c
•2014 Oxford University Press 698
Chapter 13. Time Series: Dealing with Stickiness over Time
because CO2 is stationary only at the 10 percent level, a thorough analysis would also
explore additional time series techniques, such as the error correction model discussed in the
appendix.12
Because of the nonstationarity of the carbon dioxide variable, we’ll work with a differ-
12 Dickey-Fuller tests tend to be low powered (see, e.g., Kennedy 2008, 302). This means that these tests fail to
reject the null even when the null is false. For this reason, some people are willing to using relatively high significance
levels (e.g. – = 0.10). On the other hand, the costs of failing to account for nonstationarity when we should are
high while the costs of accounting for nonstationarity even when data are stationarity are modest. In other words,
consequences of failing to address nonstationarity are serious when data is nonstationary, but the consequences of
addressing nonstationarity when data is stationary are not such a big deal. Thus many researchers are inclined to
use differenced data when there are any hints of nonstationarity (Kennedy 2008, 309).
c
•2014 Oxford University Press 699
Chapter 13. Time Series: Dealing with Stickiness over Time
enced model in which the variables are changes. The dependent variable is the change in
temperature. The independent variables reflect change in each of the variables from Equa-
tion 13.10. Because the change in Y ear is 1 every year, this variable disappears (a variable
that doesn’t vary is no variable!). The intercept will now capture this information on the
rise or fall in the dependent variable each year. The other variables are simply the changes
Table 13.5 displays the results. The change in carbon dioxide is, indeed, statistically
significant, with a coefficient of 0.052 and a t statistic of 2.004. In this instance, then,
the visual relationship we see between temperature and carbon dioxide holds up even after
13.6 Conclusion
Time series data is all over: prices, jobs, elections, weather, migration, and much more. To
One is autocorrelation. Autocorrelation does not cause coefficient estimates from OLS to
ever, render the standard equation for the variance of —ˆ (from page 225) inaccurate. Often
standard OLS will produce standard errors that are too small when there is autocorrelation,
c
•2014 Oxford University Press 700
Chapter 13. Time Series: Dealing with Stickiness over Time
Table 13.5: Change in Temperature as a Function of Change in Carbon Dioxide and Other Factors
giving us false confidence about how precise our understanding of the relationship is.
We can correct for autocorrelation by fl-transforming the data. This approach produces
not only unbiased estimates of —1 (just like OLS) but also correct standard errors of —ˆ1 (an
improvement over OLS). In the fl-transformation approach, we model the error at time t as
Another, more complicated, challenge in time series data is the possibility that the de-
pendent variable is dynamic, which means that the value of the dependent variable in one
period depends directly on its value in the previous period. Dynamic models include the
c
•2014 Oxford University Press 701
Chapter 13. Time Series: Dealing with Stickiness over Time
has short-term and long-term elements. Autocorrelation creates bias. Including a lagged
As a practical matter, time series analysis can be hard. Very hard. This chapter lays the
foundations, but there is a much larger literature that gets funky fast. In fact, sometimes
the many options can feel overwhelming. Here are some considerations to keep in mind when
• Deal with stationarity. It’s often an advanced topic, but it can be a serious problem.
• It’s probably a good idea to use a lagged dependent variable – and if we do, check for
autocorrelation. Autocorrelation does not cause bias in standard OLS, but when we
include a lagged dependent variable, it can cause bias. If we don’t check for autocorre-
lation ourselves, eventually someone will check it for us. We want to know the answer
dependent variable and perhaps a differenced model. How do we know which model is
correct? Ideally, all models provide more or less the same result. Whew. All too often,
though, they do not. Then we need to conduct diagnostics and also think carefully about
c
•2014 Oxford University Press 702
Chapter 13. Time Series: Dealing with Stickiness over Time
the data generating process. Is the data dynamic such that this year’s dependent variable
depends directly on last year’s? If so, we should probably lean toward the results from
the model with the lagged dependent variable. If not, then we might lean toward the fl-
transformed result. Sometimes we may simply have to report both and give our honest best
sense of which seems more consistent with theory and the data.
• Section 13.1: Define autocorrelation and describe its consequences for OLS.
• Section 13.2: Describe two ways to detect autocorrelation in time series data.
• Section 13.4: Explain what a dynamic model is and three ways dynamic models differ
• Section 13.5: Explain stationarity and how nonstationary data can produce spurious
Further Reading
Researchers do not always agree on whether lagged dependent variables should be included
in models. Achen (2000) discusses bias that can occur when lagged dependent variables
c
•2014 Oxford University Press 703
Chapter 13. Time Series: Dealing with Stickiness over Time
are included. Keele and Kelly (2006) present simulation evidence that the bias that occurs
when including a lagged dependent variable is small unless the autocorrelation of errors is
quite large. Wilson and Butler (2007) discuss how the bias is worse for the coefficient on the
De Boef and Keele (2008) provide a nice discussion of the error correction model, a model
which can accommodate a broad range of time series dynamics into a single model.
of the latest in time series modeling techniques. Wooldridge (2009, chapters 11 and 18)
discusses advanced topics in time series analysis, including stationarity and cointegration.
Stock and Watson (2011) provide an extensive introduction to using time series models to
Key Terms
• AR(1) model (664)
• Augmented Dickey-Fuller test (692)
• Autoregressive model (660)
• Cross-sectional data (658)
• Dickey-Fuller test (691)
• Dynamic model (678)
• Lagged variable (661)
• Spurious regression (686)
• Stationarity (683)
c
•2014 Oxford University Press 704
Chapter 13. Time Series: Dealing with Stickiness over Time
Computing Corner
Stata
1. To detect autocorrelation, proceed in the following steps:
regress Temp Year /* Estimate basic regression model */
predict Err, resid /* Save residuals using resid */
/* subcommand of predict command */
scatter Err Year /* Plot residuals over time */
tsset year /* Tells Stata which variable indicates time */
reg Err L.Err /* An equivalent way to do the auxiliary regression */
/* ‘‘L.’’for lagged values (requires tsset command) */
2. To correct for autocorrelation, proceed in two steps:
tsset Year /* Identify time series */
prais AvgTemp Year, corc twostep /* fl-transformed model */
The tsset command informs Stata which variable orders the data chronologically. The
prais command (pronounced “price” and named after the originator of the technique)
is the main command for estimating fl-transformed models. The subcommands after
the comma (corc twostep) tell Stata to handle the first observation as we have here.
There are other options described in the Stata help which can be accessed by typing
help prais.
3. Running a dynamic model is simple: Just include a lagged dependent variable. If we
have already told Stata which variable indicates time using the tsset command de-
scribe in part 1 above, then we can simply run reg Y L.Y X1 X2. Or, we can create
gen LagY = Y[ n-1] /* Generate
lagged dependent variable manually before running the model
reg Y LagY X1 X2 X3
4. To implement an augmented Dickey-Fuller test, use Stata’s “dfuller” command, using
the “trend” subcommand to include the trend variable and the “lags(1)” subcommand to
include the lagged change. The “regress” subcommand displays the regression results
underlying the Dickey-Fuller test. Stata automatically displays the relevant critical
values for this test.
dfuller Y, trend lags(1) regress
c
•2014 Oxford University Press 705
Chapter 13. Time Series: Dealing with Stickiness over Time
R
1. To detect autocorrelation in R, proceed in the following steps:
ClimateOLS = lm(Temp ≥ Year) # Estimate basic regression model
Err = resid(ClimateOLS) # Save residuals
plot (Year, Err) # Plot residuals over time
LagErr = c(NA, Err[1:(length(Err)-1)]) # Generate lagged error variable
LagErrOLS = lm(Err ≥ LagErr ) # Auxiliary regression
summary(LagErrOLS) # Display results
2. To correct for autocorrelation, proceed in the following steps:
Rho = summary(LagErrOLS)$coefficients[2] # RhoEst is the estimate of fl̂
N = length(Temp) # Length of Temp variable
LagTemp = c(NA, Temp[1:(N-1)]) # Create lagged temperature
LagYear = c(NA, Year[1:(N-1)]) # Create lagged year
TempRho = AvgTemp - Rho*LagTemp # Create fl-transformed temperature
YearRho = Year- Rho*LagYear # Create fl-transformed year
ClimateRho = lm(TempRho ≥ YearRho ) # Estimate fl-transformed model
summary(ClimateRho ) # Display results
3. Running a dynamic model is simple: Just include a lagged dependent variable.
ClimateLDV = lm(Temp ≥ LagTemp + Year) # Estimate basic regression model
4. We can implement an augmented Dickey-Fuller test by creating the variables in the
model and running the appropriate regression. For example,
ChangeTemp = Temp - LagTemp # Create T emp
LagChangeTemp = (NA, ChangeTemp[1:(N-1)]) # Create lag of T emp
AugDickeyF = lm(ChangeTemp ≥ LagTemp + Year + LagChangeTemp )
summary(AugDickeyF) # Display results
Exercises
1. The Washington Post published data on bike share ridership (measured in trips per
day) over the month of January 2014. Bike share ridership is what we want to explain.
The Post also provided data on daily low temperature (a variable we call lowtemp) and
a dummy variable for weekends. We’ll use these as our explanatory variables. The data
is available in BikeShare.dta.
a. Use an auxiliary regression to assess whether the errors are autocorrelated.
c
•2014 Oxford University Press 706
Chapter 13. Time Series: Dealing with Stickiness over Time
b. Run a model that corrects for AR(1) autocorrelation. Are these results different from
a model in which we do not correct for AR(1) autocorrelation? So that everyone is
on same page, use the , corc twostep subcommands.
2. These questions revisit the monetary policy data we worked with in Chapter 6 on page
309.
a. Estimate a model of the federal funds rate, controlling for whether the president
was a Democrat, the number of quarters from the last election, an interaction of
the Democrat dummy variable and the number of quarters from the last election,
and inflation. Assess whether there is first order autocorrelation using a plot and an
auxiliary regression.
b. Estimate the model from part (a) using the fl-transformation approach and interpret
the coefficients.
c. Estimate the model from part (a), but add a variable for the lagged value of the
federal funds rate. Interpret the results and assess whether there is first order auto-
correlation using a plot and an auxiliary regression.
d. Estimate the dynamic model (with a lagged dependent variable) using the fl-transformation
approach and interpret the coefficients.
3. The file BondUpdate.dta contains data on James Bond films from 1962 to 2012. We
want to know how budget and ratings mattered for how well the movies did at the box
office. Table 13.6 describes the variables.
Table 13.6: Variables for James Bond Movie Questions
a. Estimate an OLS model in which the amount each film grossed is the dependent
variable and ratings and budgets are the independent variables. Assess whether
there is autocorrelation.
b. Correct for autocorrelation. Did the results change? Did the autocorrelation go
away?
c
•2014 Oxford University Press 707
Chapter 13. Time Series: Dealing with Stickiness over Time
c. Now estimate a dynamic model. What is the short-term and (approximate) long-term
effect of a 1-point increase in rating?
d. Assess the stationarity of the revenue, rating, and budget variables.
e. Estimate a differenced model and explain results.
f. Build from the above models to assess the worth (in terms of revenue) of specific
actors.
c
•2014 Oxford University Press 708
CHAPTER 14
ADVANCED OLS
In Chapters 3 through 5 we worked through the OLS model from the basic bivariate model
to a variety of multivariate models. We focused on the practical and substantive issues that
It can also be useful to look under the hood to see exactly how things work. That’s what
we do in this chapter. We also go into more detail about omitted variable bias by deriving
the conditions for it to exist in a particular case and discussing how these results generalize.
We derive the OLS estimate of —ˆ1 in a simplified model and show it is unbiased in Section
14.1. Section 14.2 derives the variance of —ˆ1 , showing how the conditions that errors are
homoscedastic and not correlated with each other are necessary for the basic equation for
variance of —ˆ1 . Section 14.3 derives the omitted variable bias conditions explained in Chapter
709
Chapter 14. Advanced OLS
5. Section 14.4 shows how to anticipate the sign of omitted variable bias, a useful tool when
faced with an omitted variable problem. Section 14.5 extends the omitted variable bias
framework to models with multiple independent variables. Things get complicated fast.
However, we can see how the core intuition carries on. Section 14.6 derives the equation for
The best way to appreciate how the OLS assumptions come together to produce coefficient
estimates that are unbiased, consistent, normally distributed, and with a specific standard
error equation is to derive the equations for the —ˆ estimates. The good news is that the
process is really quite cool. The other good news is that it’s not that hard. The bad news
is, well, math. Two good newses beat one bad news, so off we go.
In this section we derive the equation for —ˆ1 for a simplified regression model and then
We work here with a simplified model that has a variable and coefficient, but no intercept.
This model builds from King, Keohane, and Verba (1994, 98).
Yi = —1 Xi + ‘i (14.1)
c
•2014 Oxford University Press 710
Chapter 14. Advanced OLS
Not having —0 in the model simplifies the derivation considerably while retaining the essential
Our goal is to find the value of —ˆ1 that minimizes the sum of the squared residuals; this
value will produce a line that best fits the scatterplot. The residual for a given observation
is
‘ˆi = Yi ≠ —ˆ1 Xi
ÿ ÿ
‘ˆ2i = (Yi ≠ —ˆ1 Xi )2 (14.2)
We want to figure out what value of —ˆ1 minimizes this sum. Some simple calculus does
the trick. A function reaches a minimum or maximum at a point where its slope is flat –
that is, where the slope is zero. The derivative is the slope, so we simply have to find the
if we are at a peak, our slope should get more negative as X gets bigger (we go downhill); if we are at a minimum,
our slope should get bigger as X goes higher. The second derivative measures changes in the derivative, so it has to
be negative for a flat spot to be a maximum (and we have to be aware of things like “saddle points” - topics covered
in any calculus book).
c
•2014 Oxford University Press 711
Chapter 14. Advanced OLS
ÿ
2(Yi ≠ —ˆ1 Xi )Xi = 0
ÿ
(Yi ≠ —ˆ1 Xi )Xi = 0
ÿ ÿ
Yi Xi ≠ —ˆ1 Xi2 = 0
ÿ ÿ
Yi Xi = —ˆ1 Xi2
ÿ ÿ
Yi Xi = —ˆ1 Xi2
q
7. Divide both sides by Xi2 :
q
Yi Xi
q = —ˆ1 (14.3)
Xi2
Equation 14.3 then, is the OLS estimate for —ˆ1 in a model with no —0 . It looks quite similar
to the equation for the OLS estimate of —ˆ1 in the bivariate model with —0 (which is Equation
3.4 on page 72). The only difference is that here we do not subtract X from X and Y from
c
•2014 Oxford University Press 712
Chapter 14. Advanced OLS
where take the derivative with respect to —ˆ0 and with respect to —ˆ1 to produce two equations
The estimate —ˆ1 is a random variable because its equation includes Yi , which we know
depends on ‘i , which is a random variable. Hence —ˆ1 will bounce around as the values of ‘i
bounce around.
We can use Equation 14.3 to explain the relationship of —ˆ1 to the true value of —1 by
2. Substitute for Yi using Equation 14.1 (which is the simplified model we’re using here,
in which —0 = 0):
q
(—1 Xi + ‘i )Xi
—ˆ1 = q 2
Xi
c
•2014 Oxford University Press 713
Chapter 14. Advanced OLS
6. This equation characterizes the estimate in terms of the unobserved “true” values of —1
and ‘:
q
‘i Xi
—ˆ1 = —1 + q 2 (14.4)
Xi
In other words, —ˆ is —1 (the true value) plus an ugly fraction with sums of ‘ and X in it.
From this point, we can show that —ˆ1 is unbiased. Here we need to show the conditions
under which the expected value of —ˆ1 = —1 . In other words, the expected value of —ˆ1 is the
value of —ˆ1 we would get if we repeatedly regenerated data sets from the original model and
calculated the average of all the —ˆ1 s estimated from these multiple data sets. It’s not that
we would ever do this - in fact, with observational data it is impossible to do so. Instead,
thinking of estimating —ˆ1 from multiple realizations from the true model is a conceptual way
for us to think about whether the coefficient estimates on average skew too high, too low,
It helps the intuition to note that we could, in principle, generate the expected value of
—ˆ1 s for an experiment if we re-ran the experiment over and over again and calculated the
c
•2014 Oxford University Press 714
Chapter 14. Advanced OLS
average of the —ˆ1 s estimated. Or, more plausibly, we could run a computer simulation in
which we repeatedly regenerated data (which would involve simulating a new ‘i for each
observation for each iteration) and calculating the average of the —ˆ1 s estimated.
To show that —ˆ1 is unbiased we use the formal statistical concept of expected value.
The expected value of a random variable is the value we expect the random variable to be,
2. The expectation of a fixed number is that number, meaning that E[—1 ] = —1 . Recall
that in our model, —1 (without the hat) is some number. We don’t know it, but it is
some number, maybe 2, maybe 0, maybe -0.341. Hence the expectation of —1 is simply
whatever number it is. It’s like asking what the expectation of the number 2 is. It’s 2!
q
‘i Xi
E[—ˆ1 ] = —1 + E[ q 2 ]
Xi
3. Use the fact that E[k ◊ g(‘)] = k ◊ E[g(‘)] for constant k and random function g(‘).
q
Here q1X 2 is a constant (equaling one over whatever the sum of Xi2 is) and ‘i Xi is a
i
1 ÿ
E[—ˆ1 ] = —1 + q E[ ‘i Xi ]
Xi2
c
•2014 Oxford University Press 715
Chapter 14. Advanced OLS
4. We can move the expectation operator inside the summation because the expectation
1 ÿ
E[—ˆ1 ] = —1 + q E[‘i Xi ] (14.5)
Xi2
Equation 14.5 means that the expectation of —ˆ1 is the true value (—1 ) plus some number
q1 times the sum of ‘i Xi s. At this point we use our Very Important Condition, which is
Xi2
the exogeneity condition that ‘i and Xi are uncorrelated. We show next that this condition
q
is equivalent to saying that E[‘i Xi ] = 0 which means E[‘i Xi ] = 0, which will imply that
1. If ‘i and Xi are uncorrelated, then the covariance of ‘i and Xi is zero because correlation
covariance(Xi , ‘i )
correlation(Xi , ‘i ) = Ò
var(Xi )var(‘i )
2. Using the definition of covariance and setting it to zero yields the following, where we
refer to the mean of Xi as µX and the mean of the ‘i distribution as µ‘ (The Greek
E[Xi ‘i ≠ Xi µ‘ ≠ µx ‘i + µx µ‘ ] = 0
c
•2014 Oxford University Press 716
Chapter 14. Advanced OLS
4. Using the fact that the expectation of a sum is the sum of expectations, we can rewrite
the equation as
5. Using the fact that µ‘ and µX are fixed numbers, we can pull them out of the expecta-
tions:
6. Here we add an additional assumption that is necessary, but not particularly substan-
tively interesting. We assume that the mean of the error distribution is zero. In other
words, we assume µ‘ = 0, which is another way of saying that the error term in our
model is simply the random noise around whatever the constant is.3 This assumption
allows us to cancel any term with µ‘ or with E[‘i ]. In other words, if the exogeneity
condition is satisfied and the error is uncorrelated with the error term, then
E[Xi ‘i ] = 0
If E[Xi ‘i ] = 0, Equation 14.5 tells us that the expected value of —ˆ1 will be —1 . In other
words, if the error term and independent variable are uncorrelated, then the OLS estimate
3 In a model that has a non-zero —0 , the estimated constant coefficient would absorb any non-zero mean in the
error term. For example, if the mean of the error term was actually 5, then the estimated constant would simply be
five bigger than what it would be otherwise. Because we so seldom care about the constant term, it’s reasonable to
think of it simply as including the mean value of any error term.
c
•2014 Oxford University Press 717
Chapter 14. Advanced OLS
—ˆ1 is an unbiased estimator of —1 . This same logic carries through in the bivariate model
Showing that —ˆ1 is unbiased does not say much about whether any given estimate will be
near —1 . The estimate —ˆ1 is a random variable after all and it is possible that some —ˆ1 will
be way too low and that some will be way too high. All that unbiasedness says is that, on
average, the —ˆ1 will not run higher or lower than the true value.
Remember This
1. We derive the —ˆ1 equation by setting the derivative of the sum of squared residuals
equation to zero and solving for —ˆ1 .
2. The key step in showing that —ˆ1 is unbiased depends on condition that X and ‘
are uncorrelated.
ˆ In so doing we
In this section we show how to derive an equation for the standard error of —.
see how we use conditions that errors are homoscedastic and uncorrelated with each other.
Importantly, these assumptions are not necessary for unbiasedness of OLS estimates. If these
assumptions do not hold, we can still use OLS, but we’ll have to do something different (as
discussed in Chapter 13, for example) to get the right standard error estimates.
We’ll combine two assumptions and some statistical properties of the variance operator to
produce a specific equation for the variance of —ˆ1 . We assume that the Xi are fixed numbers
c
•2014 Oxford University Press 718
Chapter 14. Advanced OLS
1. We start with the —ˆ1 equation (Equation 14.4) and take the variance of both sides:
q
‘i Xi
var[—ˆ1 ] = var[—1 + q 2 ]
Xi
2. Use the fact that the variance of a sum of a constant (the true value —1 ) and a function
of a random variable is simply the variance of the function of the random variable (see
3. Note that q1X 2 is a constant (as we noted on page 715, too) and use variance fact #2
i
from page 769 that variance of k times a random variable is k 2 times the variance of
1 2 ÿ
var[—ˆ1 ] = ( q ) var[ ‘i Xi ]
Xi2
4. The no autocorrelation condition (as discussed in Section 3.6 of Chapter 3) means that
corr(‘i , ‘j ) = 0 for all i ”= j. If this condition is satisfied, we can treat the variance of a
sum as the sum of the variances (using variance fact #4 on page 769 that says that the
variance of a sum of uncorrelated random variables equals the sum of the variances of
1 2ÿ
var[—ˆ1 ] = ( q ) var[Xi ‘i ]
Xi2
c
•2014 Oxford University Press 719
Chapter 14. Advanced OLS
1 2ÿ 2
var[—ˆ1 ] = ( q ) Xi var[‘i ]
Xi2
6. If we assume homoscedasticity (as discussed in Section 3.6 of Chapter 3), we can make
additional simplifications. If the error term is homoscedastic, the variance for each ‘i is
1 2ÿ 2 2
var[—ˆ1 ] = ( q ) Xi ‡
Xi2
q 2
2 X
= ‡ q 2i 2
( Xi )
‡2
= q 2 (14.6)
Xi
7. If we don’t assume homoscedasticity, we can use ‘ˆ2i as the estimate for variance of each
1 2ÿ 2 2
var[—ˆ1 ] = ( q ) Xi ‘ˆi (14.7)
Xi2
Equation 14.7 is great in that it provides an appropriate estimate for the variance of
of —ˆ1 even when errors are heteroscedastic. However, it is quite unwieldy, making it
harder for us to see the intuition about variance as we can with the variance of —ˆ1 when
In this section we have derived the variance of —ˆ1 in our simplified model with no constant
(for both homoscedastic and heteroscedastic cases). Equation looks quite similar to the
c
•2014 Oxford University Press 720
Chapter 14. Advanced OLS
variance of the homoscedastic bivariate model with a constant, which we saw on page 95 in
Chapter 3. The only difference is that when —0 is included in the model, the sum in the
q
denominator is (Xi ≠ X)2 instead of Xi2 . The derivation process is essentially the same
Let’s take a moment to appreciate how amazing it is that we have been able to derive an
equation for the variance of —ˆ1 . With just a few assumptions, we are able to characterize
how precise our estimate of —ˆ1 will be as a function of the variance of ‘ and the Xi values.
The equation for the variance of —ˆ1 in the multivariate model is similar (see Equation 5.9
on page 225), and the intuition discussed here applies for that model as well.
Remember This
1. We derive the variance of —ˆ1 equation by calculating the variance of the —ˆ1 equa-
tion.
2. If the errors are homoscedastic and not correlated with each other, the variance
equation is in a convenient form.
3. If the errors are not homoscedastic and uncorrelated with each other, OLS es-
timates are still unbiased, but the easy-to-use standard OLS equation for the
variance of —ˆ1 is no longer appropriate.
On page 214 in Chapter 5 we discussed omitted variable bias, an absolutely central concept
in understanding multivariate OLS. In this section we derive the conditions for omitted
c
•2014 Oxford University Press 721
Chapter 14. Advanced OLS
where Yi is the dependent variable, there are two independent variables, X1i and X2i , and ‹i
(the Greek letter nu, pronounced “new”) is an error term that is not correlated with any of
the independent variables. For example, suppose the dependent variable is test scores and
the independent variables are class size and family wealth. We assume (for this discussion)
where we will use —1OmitX2 to indicate the estimate we get from the model that omits variable
X2 . How close will —ˆ1OmitX2 (the coefficient on X1i in in Equation 14.9) be to the true value
(—1 in Equation 14.8)? In other words, will —ˆ1OmitX2 be an unbiased estimator of —1 ? This
situation is common for observational data because we will almost always suspect that we
The equation for —ˆ1OmitX2 is the equation for a bivariate slope coefficient (see Equation
c
•2014 Oxford University Press 722
Chapter 14. Advanced OLS
rearranging we can answer this question. We know from Equation 14.8 that the true value
of Yi is —0 + —1 X1i + —2 X2i + ‹i . Because the —s are fixed values, the average of each is
Substituting for Yi and Y in Equation 14.10 and doing some re-arranging yields
q
(X1i ≠ X1 )(—0 + —1 X1i + —2 X2i + ‹i ≠ —0 ≠ —1 X 1 ≠ —2 X 2i ≠ ‹ i )
—ˆOmitX2 = q
(X1i ≠ X 1 )2
q
(X1i ≠ X1 )(—1 (X1i ≠ X 1 ) + —2 (X2i ≠ X 2 ) + ‹i ≠ ‹ i )
= q
(X1i ≠ X 1 )2
q q
Gathering terms and using the fact that —1 (X1i ≠ X 1i )2 = —1 (X1i ≠ X 1i )2 yields
q q q
(X1i ≠ X 1 )2 (X1i ≠ X 1 )(X2i ≠ X 2 ) (X1i ≠ X 1 )(‹i ≠ ‹)
—ˆOmitX2 = —1 q + —2 q + q
(X1i ≠ X 1 )2 (X1i ≠ X 1 )2 (X1i ≠ X 1 )2
We then take the expected value of both sides. Our assumption that ‹ is uncorrelated with
q
X1 means that the expected value of (X1i ≠ X 1 )(‹i ≠ ‹) is zero, which causes the last term
meaning that the expected value of —ˆ1OmitX2 is —1 plus —2 times a messy fraction. In
other words, the estimate —ˆ1OmitX2 will deviate, on average, from the true value, —1 , by
q
(X1i ≠X 1 )(X2i ≠X 2 )
—2 q
(X1i ≠X 1 )2
.
4 The
q
logic is similar to how we showed on page 717 that if X and ‘ are uncorrelated, then E[ Xi ‘i ] = 0; in this
q
case, (X1i ≠ X 1 ) is analogous to Xi in the earlier proof and (‹i ≠ ‹) is analogous to ‘i in the earlier proof.
c
•2014 Oxford University Press 723
Chapter 14. Advanced OLS
q
is simply the equation for the estimate of ”ˆ1 from the following
(X1i ≠X 1 )(X2i ≠X 2 )
Note that q
(X1i ≠X 1 )2
model:
X2i = ”0 + ”1 X1i + ·i
See, for example, page 72 and note that we have X2 and X 2 where we had Y and Y in the
Therefore, we can conclude that our coefficient estimate —ˆ1OmitX2 from the model that
omitting X2 does not cause our coefficient estimate to be biased. This is excellent news. If
it were not true, we would have to include variables that had nothing to do with Y in our
The other way for —2 ”ˆ1 to be zero is for ”ˆ1 to be zero, which happens whenever X1 would
the independent variable. In short, if X1 and X2 are independent (such that regressing X2
on X1 yields a slope coefficient of zero), then even though we omitted X2 from the model
—ˆ1OmitX2 will be an unbiased estimate of —1 , the true effect of X1 on Y (from Equation 14.8).
No harm, no foul.
The flip side of these conditions is that when we estimate a model that omits a variable
that affects Y (meaning that —2 doesn’t equal zero) and is correlated with the included
c
•2014 Oxford University Press 724
Chapter 14. Advanced OLS
variable, OLS will be biased. The extent of the bias depends on how much the omitted
variable explains Y (which is determined by —2 ) and how much the omitted variable is
What is the take-away here? Omitted variable bias is a problem if both conditions are
met: (1) The omitted variable actually matters (—2 ”= 0) and (2) X2 (the omitted variable) is
correlated with X1 (the included variable). This shorthand is remarkably useful in evaluating
OLS models.
Remember This
The conditions for omitted variable bias can be derived by substituting the true value
of Y into the —ˆ1 equation for the model with X2 omitted.
It is fairly common that the omitted variable must remain omitted because we simply do
not have a measure of it. In these situations, all is not lost. (A lot is lost, but not all.) We
can use the concepts we have developed so far to work through the implication of omitting
the variable in question. In this section we show how to anticipate the effects of omitting a
variable.
the model
c
•2014 Oxford University Press 725
Chapter 14. Advanced OLS
where Incomei is the monthly salary or wages of individual i and Educationi is the number
of years of schooling individual i completed. We are worried, as usual, that there are factors
We worry, for example, that some people are more productive than others (a factor in
the error term that affects income) and that productive folks are more likely to get more
schooling (school may be easier for them). In other words, we fear the true equation is
where P roductivityi taps the combination of intelligence, diligence and maturity that lead
person i to add a lot of value to his or her organization. Most data sets will not have a good
Without the variable, we’re stuck, but at least we can figure our whether omitting pro-
ductivity will push our estimates of the effect of education higher or lower.5 Our omitted
variable bias results (such as Equation 14.11) indicate that the bias from omitting produc-
tivity depends on the effect that productivity has on the dependent variable (—2 ) and on the
In our example, we believe productivity boosts income (—2 > 0). We also believe that
there is a positive relationship between education and productivity. Hence, the bias will be
positive because it is —2 > 0 times the effect of the productivity on education. A positive
5Another option is to use panel data that allows us to control for certain unmeasured factors; we do that in Chapter
8. Or we can try to find exogenous variation in education (variation in education that is not due to differences in
productivity); that’s what we do in Chapter 9.
c
•2014 Oxford University Press 726
Chapter 14. Advanced OLS
bias implies that omitting productivity induces a positive bias for education. In other words
the effect of education on income in a model that does not control for productivity will be
overstated. The magnitude of the bias will be related to how strong these two components
are. If we think productivity has a huge effect on income and is strongly related to education
In this example, this bias would lead us to be skeptical of a result from a model like
Equation 14.12 that omits productivity. In particular, if we were to find that —ˆ1 is greater
than zero, we would worry that the omitted variable bias has inflated the estimate. On the
other hand, if the results showed that education did not matter or had a negative coefficient,
we would be more confident in our results because the bias would on average make the results
This line of reasoning is called “signing the bias” and would lead us to treat the estimated
effects based on Equation 14.12 to be an upper bound on the likely effects of education on
income.
Table 14.1 summarizes the relationship for the simple case of one omitted variable. If
X2 , the omitted variable, has a positive effect on Y (meaning —2 > 0) and X2 and X1 are
correlated, then the coefficient on X1 in a model with only X1 will produce a coefficient that
is biased upwards: The estimate will be too big because some of the effect of unmeasured
c
•2014 Oxford University Press 727
Chapter 14. Advanced OLS
Correlation “—2 ”
of Effect of omitted variable on Y
X1 and X2 >0 0 <0
>0 Overstate coefficient No bias Understate coefficient
0 No bias No bias No bias
<0 Understate coefficient No bias Overstate coefficient
Cell entries show sign of bias for omitted variable bias problem in which a single variable (X2 )
is omitted. The true equation is Equation 14.8 and the estimated model is Equation 14.9. If
ˆ 2 (the expected value of the coefficient on X from a
—2 > 0 and X1 and X2 are correlated, —1OmitX 1
Remember This
We can use the equation for omitted variable bias to anticipate the effect of omitting
a variable on the coefficient estimate for an included variable.
c
•2014 Oxford University Press 728
Chapter 14. Advanced OLS
Discussion Questions
1. Suppose we are interested in knowing how much social media affect people’s in-
come. Suppose also that Facebook provided us data on how much each individual
spent on Facebook during work hours. The model is
Our omitted variable discussion in Section 5.2 was based on a case in which the true model
had two variables and a single variable was omitted. In this section we show how things are
c
•2014 Oxford University Press 729
Chapter 14. Advanced OLS
Assuming that the error in the true model (‹) is not correlated with any of the independent
where r31 is the correlation of X3 and X1 , r21 is the correlation of X2 and X1 , r32 is
the correlation of X3 and X2 , and V3 and V1 are the variances of X3 and X1 , respectively.
Clearly, there are more moving parts in this case than the case we discussed earlier.
Equation 14.16 contains commonalities with the simpler omitted variables bias example
we discussed in Section 5.2. The effect of the omitted variable in the true model looms
large. Here —3 is the effect of the omitted variable X3 on Y and it plays a central role in
the bias term. If —3 is zero, there is no omitted variable bias because the crazy fraction will
be multiplied by zero and thereby disappear. As with the the simpler omitted variable bias
case, omitting a variable only causes bias if that variable actually affects Y .
The bias term has more factors, however. The r31 term is the correlation of the excluded
variable (X3 ) and the first variable (X1 ). It is the first term in the denominator of the bias
term, playing a similar role as the correlation of the excluded and included variables in the
simpler model. The complication now is that the correlations of the two included variables
(r21 ) and correlation of the omitted variable and the included variable (r32 ) also matter.
c
•2014 Oxford University Press 730
Chapter 14. Advanced OLS
We can take away some simple principles. If the included independent variables are not
correlated (which would mean that r21 = 0), then the equation simplifies to essentially what
we were dealing with in the simple case. If the excluded variable is not correlated with the
other included variable (r32 = 0), we again can go back to the intuition from the simple
omitted variable bias model. If, however, both of these correlations are non-zero (and, to be
practical, relatively large), then the simple case intuition may not travel well and we should
tread carefully. We’ll still be worried about omitted variable bias, but our ability to sign the
Remember This
When there are multiple variables in the true equation, the effect of omitting one of
them depends in a complicated way on the interrelations of all variables.
1. As in the simpler model, if the omitted variable does not affect Y , then there is
no omitted variable bias.
2. The equation for omitted variable bias when the true equation has only two
variables often provides a reasonable approximation of the effects.
c
•2014 Oxford University Press 731
Chapter 14. Advanced OLS
We discussed measurement error in Section 5.3 of Chapter 5. Here we derive the equation
for attenuation bias due to measurement error in an independent variable for the case where
there is one independent variable. We also discuss implications of measurement error when
We start with a true model based on the actual value of the independent variable, which we
Yi = —0 + —1 X1i
ú
+ ‘i (14.17)
X1i = X1i
ú
+ ‹i (14.18)
Yi = —0 + —1 (X1i ≠ ‹i ) + ‘i
= —0 + —1 X1i ≠ —1 ‹i + ‘i (14.19)
Let’s treat ‹ as the omitted variable and ≠—1 as the coefficient on the omitted variable.
c
•2014 Oxford University Press 732
Chapter 14. Advanced OLS
(Compare these to X2 and —2 in Equation 5.7 in Section 5.2.) Doing so allows us to write
cov(X1 , ‹)
—1OmitX2 = —1 ≠ —1 (14.20)
var(X1 )
where we use the covariance-based equation from page 91 to calculate ”1 in the standard
‡‹2
—1OmitX2 = —1 ≠ —1 (14.21)
ú + ‡‹
2 2
‡X
‡‹2
plim —ˆ1 = —1 (1 ≠ )
‡‹2 + ‡X2
ú
‡X2
‡‹2 ú
Finally, we use the fact that 1 ≠ ‡‹2 +‡X 2
ú
= 2
‡‹ +‡X
1
2
ú
to produce
1
2
‡X
plim —ˆ1 = —1
ú
1
‡‹2 + ‡X
2
ú
1
c
•2014 Oxford University Press 733
Chapter 14. Advanced OLS
We have so far dealt with a bivariate regression with a single, poorly measured independent
variable for which the error is a mean-zero random variable uncorrelated with anything
else. If we have multiple independent variables and a single badly measured variable, it is
still the case that the coefficient on the poorly measured independent variable will suffer
from attenuation bias. The other coefficients will also suffer, although in a way that is
Remember This
1. We can derive the effect of a poorly measured independent variable using omitted
variable logic.
2. A single poorly measured independent variable can cause other coefficients to be
biased.
14.7 Conclusion
OLS goes a long way with just a few assumptions about the model and the error terms.
How important is it to be able to know exactly how these assumptions come together to
provide all this good stuff? On a practical level, not very. We can go about most of our
c
•2014 Oxford University Press 734
Chapter 14. Advanced OLS
On a deeper level, though, it is useful to know how the assumptions matter. The sta-
tistical properties of OLS are not magic. They’re not even that hard, once we break the
derivations down step-by-step. The assumptions we rely on play specific roles in figuring out
the properties of our estimates, as we have seen in the derivations in this chapter. We also
formalized our understanding of omitted variable bias, helping us know where conditions
We don’t need to be able to produce all the derivations from scratch. If we know the
• Section 14.1: Explain the steps in deriving the equation for the OLS estimate of —ˆ1 .
• Section 14.2: What assumptions are crucial to derive the standard equation for the
variance of —ˆ1 ?
• Section 14.3: Show how to derive the omitted variable bias equation.
• Section 14.4: Show how to use the omitted variable bias equation to “sign the bias.”
• Section 14.5: Explain how omitted variable bias works when there are multiple variables
• Section 14.6: Show how to use omitted variable bias tools to characterize the effect of
measurement error.
c
•2014 Oxford University Press 735
Chapter 14. Advanced OLS
Further Reading
See Clarke (2005) for a further details on omitted variables. Greene (2003, 148) offers a
Greene (2003, 86) discusses the implications of measurement error when there are multiple
independent variables in the model. Cragg (1994) provides an accessible overview of problems
c
•2014 Oxford University Press 736
CHAPTER 15
In Chapter 8 we used fixed effects in panel data models to control for unmeasured factors
that are fixed within units. We did so by including dummy variables for the units or by
re-scaling the data. We can also control for many time factors by including fixed effects for
time periods.
The models get more complicated when we start thinking about more elaborate depen-
dence across time. We face a major choice of whether we want to treat serial dependence
in terms of serially correlated errors or in terms of dynamic models in which the value of
Yt depends directly on the value of Y in the previous period. These two approaches lead to
In this chapter, we introduce these approaches and discuss how they connect to the panel
737
Chapter 15. Advanced Panel Data
data analysis we covered in Chapter 8. Section 15.1 shows how to deal with autocorrelation
in panel data models. Section 15.2 introduces dynamic models for panel data analysis.
Section 15.3 presents an alternative to fixed effects models called random effects models.
Random effects models treat unit-specific error as something that complicates standard error
calculations but does not cause bias. They’re not as useful as fixed effects models, but it
In panel data, it would make sense to worry about autocorrelation for the same reasons it
would make sense to worry about autocorrelation in time series data. Remember all the
stuff in the error term? Lots of that will stick around for a while. Unmeasured factors in
year 1 may linger to affect what is going on in year 2 and so on. In this section we explain
how to deal with autocorrelation in panel models, first without fixed effects and then with
fixed effects.
Before we get into diagnosing and addressing the problem, we need to remind ourselves of
the stakes: Autocorrelation does not cause bias in the standard OLS framework, but it does
cause OLS estimates of standard errors to be incorrect. Often, it causes the OLS estimates
of standard errors to be too small because we don’t really have the number of independent
c
•2014 Oxford University Press 738
Chapter 15. Advanced Panel Data
where ‹it (the Greek letter nu) is a mean-zero, random error term that is not correlated
with the independent variables. There are N units and T time periods in the panel data
set. We limit ourselves to first-order autocorrelation (the error this period is a function
of the error last period). The tools we discuss generalize pretty easily to higher orders of
autocorrelation.1
Estimation is relatively simple. First, we estimate the model using standard OLS. We
then use the residuals from the OLS model to test for signs of autocorrelated errors. We can
do so because OLS —ˆ estimates are unbiased even if errors are autocorrelated, which means
We test for autocorrelated errors in this context using something called a Lagrange Mul-
tiplier (LM) test. The LM test is similar to our test for autocorrelation in Chapter 13 on
c
•2014 Oxford University Press 739
Chapter 15. Advanced Panel Data
where ÷it (the Greek letter eta) is a mean-zero, random error term. We use the fact that
N ◊ R2 from this auxiliary regression is distributed ‰21 (the Greek letter chi, pronounced
If the LM test indicates that there is autocorrelation, we will estimate an AR(1) model
To test for autocorrelation in a panel data model that has fixed effects we must deal with a
slight wrinkle. The fixed effects induce correlation in the de-meaned errors even when there
is no correlation in the actual errors. The error term in the de-meaned model is (‘it ≠ ‘i· ),
which means that the de-meaned error for unit i will include the mean of the error terms
1
for unit i (‘i· ), which in turn means that T
of any given error term will appear in all error
terms. So, for example, ‘i1 (the raw error in the first period) is in the first de-meaned error
term, the second de-meaned error term and so on via the ‘i· term. The result will be at least
a little autocorrelation because the de-meaned error term in the first and second periods, for
example, will move together at least a little bit because they both have some of the same
terms.
To test for AR(1) errors, run a model with residuals from fixed effects model: ‘ˆit =
c
•2014 Oxford University Press 740
Chapter 15. Advanced Panel Data
Remember This
To estimate panel models that account for autocorrelated errors, proceed in the fol-
lowing steps:
1. Estimate an initial model that does not address autocorrelation. This model can
be either an OLS model or a fixed effects model.
2. Use residuals from the initial model to test for autocorrelation using a Lagrange
Multiplier test that is based on the R2 from the following model:
‘ˆit = flˆ‘i,t≠1 + “1 X1it + “2 X2it + ÷it
If the model includes fixed effects, the coefficient and residual estimates are biased,
although the bias decreases as T increases.
3. If we reject the null hypothesis of no autocorrelation (which will happen when
the R2 in the above equation is high), then we should remove the autocorrelation
by fl-transforming the data as discussed in Chapter 13.
We can also model temporal dependence with the dynamic models we discussed in Section
13.4. In these models the current value of Yit could depend directly on Yi,t≠1 , the value of Y
These models are sneakily complex. They seem easy because they simply require us to
include a lagged dependent variable in an OLS model. They actually have many knotty
aspects that differ from standard OLS models. In this section we discuss dynamic models
for panel data, first without fixed effects and then with fixed effects.
c
•2014 Oxford University Press 741
Chapter 15. Advanced Panel Data
where “ (the Greek letter gamma) is the effect of the lagged dependent variable, the —s are the
immediate effects of the independent variables, and ‘it is uncorrelated with the independent
We see how tricky this model is once we try to characterize the effect of X1it on Yit .
Obviously, if X1it increases by one unit, there will be a —1 increase in Yit that period. Notice,
though, that an increase in Yit in one period affects Yit in future periods via the “Yi,t≠1 term
in the model. Hence increasing X1it in the first period, for example, will affect the value of
Yit in the first period, which will then affect Y in the next period. In other words, if we
change X1it we get not only —1 more Yit but we get “ ◊ —1 more Y in the next period and
so on. In other words, a change in X1it today dribbles on to affect Y forever through the
On the one hand, it is typically highly significant, which is good news if we have a control
variable that soaks up variance unexplained by other variables. On the other hand, the lagged
dependent variable can be too good – so highly significant that it sucks the significance out
of the other independent variables. In fact, if there is serial autocorrelation and trending
c
•2014 Oxford University Press 742
Chapter 15. Advanced Panel Data
in the independent variable, including a lagged dependent variable causes bias. In such a
case, Princeton political scientist Chris Achen (2000) has noted that the lagged dependent
variable
does not conduct itself like a decent, well-behaved proxy. Instead it is a kleptoma-
niac, picking up the effect, not only of excluded variables, but also of the included
variables if they are sufficiently trended. As a result, the impact of the included
This conclusion does not mean that lagged dependent variables are evil, but rather that
we should tread carefully when including them. In particular, we should estimate models
with and without them. If results differ substantially, we should be sure to go through all
the tests and logic described below when deciding to place more weight on the model with
The good news is that if the errors are not autocorrelated, using OLS for a model with
lagged dependent variables works fine. Given that the lagged dependent variable commonly
soaks up any serial dependence in the data, this approach is reasonable and widely used.2
If the errors are autocorrelated, however, OLS will produce biased estimates of —ˆ when a
lagged dependent variable is included. In this case, autocorrelation does more than render
lagged dependent variables actually messes up the estimates. This bias is worth mulling
2 See Beck and Katz (2011).
c
•2014 Oxford University Press 743
Chapter 15. Advanced Panel Data
over a bit. It happens because models with lagged dependent variables are outside of the
conventional OLS framework. Hence even though autocorrelation does not cause bias in
Why does autocorrelation cause bias in a model when we include a lagged dependent
variable? It’s pretty easy to see: Yi,t≠1 of course contains ‘i,t≠1 and if ‘i,t≠1 is correlated with
‘it – which is exactly what first-order autocorrelation implies – then one of the independent
This problem is not particularly hard to deal with. Suppose there is no autocorrelation.
In that case, OLS estimates are unbiased, meaning that the residuals from the OLS model
are consistent too. We can therefore use these residuals in a Lagrange Multiplier test like
the one we described earlier on page 739. If we fail to reject the null hypothesis (which is
quite common, because lagged dependent variables often zap autocorrelation), then OLS it
is. If we reject the null hypothesis of no autocorrelation, then we can use an AR(1) model
like the one discussed in Chapter 13 to rid the data of autocorrelation and thereby get us
The lagged dependent variable often captures the unit-specific variance that fixed effects
capture. Hence it is not uncommon to see lagged dependent variables used in place of fixed
effects. Sometimes we may want both in our model, so we therefore move on to consider
c
•2014 Oxford University Press 744
Chapter 15. Advanced Panel Data
Beware! Things get complicated when we include a lagged dependent variable and fixed
OLS is biased in this situation. Bummer. Recall from Section 8.2 that fixed effects models
are equivalent to de-meaned estimates. That means a fixed effects model with a lagged
dependent variable will include a variable (Yi,t≠1 ≠ Y i,t≠1 ). The Y i,t≠1 part of this variable
is the average of the lagged dependent variable over all periods. This average will therefore
include the value of Yit which, in turn, contains, ‘it . Hence, the de-meaned lagged dependent
variable will be correlated with ‘it . The extent of this bias depends on the magnitude of
1
this correlation, which is proportional to T
where T is the length of the time series for each
observation (often the number of years of data). For a small panel with just 2 or 3 periods,
the bias can be serious. For a panel with 20 or more periods, the problem is less serious. One
piece of good news here is that the bias in a model with a lagged dependent variable and
fixed effects is worse for the coefficient on the lagged dependent variable; simulation studies
indicate that bias is modest for coefficients on the Xit variables, the variables we usually
c
•2014 Oxford University Press 745
Chapter 15. Advanced Panel Data
Two ways to estimate dynamic panel data models with fixed effects
What to do? One option is to follow instrumental variable (IV) logic. We cover instrumental
variables in Chapter 9. In this context the IV approach relies on finding some variable
that is correlated with the independent variable in question and not correlated with the
error. Most IV approaches rely on using lagged values of the independent variables, which
are typically correlated with the independent variable in question but not correlated with
the error, because the error is something that happens later. The Arrellano and Bond
(1991) approach, for example, uses all available lags as instruments. These models are quite
Another option is to use OLS, accepting some bias but in exchange for better accuracy
and less complexity. While we have talked a lot about bias, we have not yet discussed the
trade-off between bias and accuracy, largely because in basic models such as OLS, unbiased
models are also the most accurate so we don’t have to worry about the trade-off. But in
more complicated models, it is possible to have an estimator that produces coefficients that
are biased but still pretty close to the true value. It is also possible to have an estimator
that is unbiased, but very imprecise. IV estimators are in the latter category – they are, on
average, going to get us the true value, but they have higher variance.
Here’s a goofy example of the trade-off between bias and accuracy. Consider two esti-
mators of average height in the United States. The first is the height of a single person
randomly sampled. This estimator is unbiased – after all, the average of this estimator will
c
•2014 Oxford University Press 746
Chapter 15. Advanced Panel Data
have to be the average of the whole population. But clearly this estimator isn’t very precise
The second estimator of average height in the United States is the average height of
500 randomly selected people, but measured with a measuring stick that is inaccurate by a
quarter of an inch (making every measurement a quarter inch too big).3 Which estimate of
average height would we rather have? The second one may well make up what it loses in
bias by being more precise. That’s the situation here because the OLS estimate is biased,
Neal Beck and Jonathan Katz (2011) have run a series of simulations of several options
for estimating models with lagged dependent variables and fixed effects. They find that OLS
performs better in terms of actually being more likely to produce estimates close to the true
value than the IV approach, even though OLS estimates are a bit biased. The performance
H.L. Mencken said that for every problem there is a solution that is simple, neat, and
neat. And, yet, it is wrong in the sense of being biased when we have a lagged dependent
variable and fixed effects. But OLS is more accurate (meaning the variance of —ˆ1 is smaller)
3 Yes, yes, we could subtract the quarter of the inch from all the height measurements. Work with me here. We’re
trying to make a point!
c
•2014 Oxford University Press 747
Chapter 15. Advanced Panel Data
Remember This
1. Researchers often include lagged dependent variables to account for serial depen-
dence. A model with a lagged dependent variable is called a dynamic model.
(a) Dynamic models differ from conventional OLS models in many respects.
(b) In a dynamic model, a change in X has an immediate effect on Y , but also
has an ongoing effect on future Y s because any change in Y associated with a
change in X will affect future values of Y via the lagged dependent variable.
(c) If there are no fixed effects in the model and there is no autocorrelation, then
using OLS for a model with a lagged dependent variable produces unbiased
coefficient estimates.
(d) If there are no fixed effects in the model and there is autocorrelation, the
autocorrelation must be purged from the data in order to generate unbiased
estimates.
2. OLS estimates from models with both a lagged dependent variable and fixed
effects are biased.
(a) One alternative to OLS is to use an instrumental variables approach. This
approach produces unbiased estimates, but is complicated and produces im-
precise estimates.
(b) OLS is useful to estimate a model with a lagged dependent variable and fixed
effects.
• The bias is not severe and decreases as T , the number of observations for
each unit, increases.
• OLS in this context produces relatively precise parameter estimates.
The term “fixed effects” is used to distinguish from “random effects.” In this section we
present an overview of random effects models and discuss when they can be used.
In a random effects model, the unit-specific error term is itself considered a random
c
•2014 Oxford University Press 748
Chapter 15. Advanced Panel Data
random effects models leave the –i s in the error term and account for them when calculating
standard errors. We won’t cover the calculations here other than to note that they can get
a bit tricky.
An advantage of random effects models is that they estimate coefficients on variables that
do not vary within unit (the kind of variables that get dropped in fixed effects models). This
possibility contrasts with fixed effect models, which cannot estimate coefficients on variables
The disadvantage of random effects models is that the random effects estimates are un-
biased only if the random effects (the –i ) are uncorrelated with the X. The core challenge
in OLS (which we discussed at length earlier) is that the error term is correlated with the
independent variable; this problem continues with random effects models, which address
correlation of errors across observations, but not correlation of errors and independent vari-
ables. Hence, random effects models fail to take advantage of a major attraction of panel
data, which is that we can deal with the possible correlation of the unit-specific effects that
A statistical test called a Hausman test tests random against fixed effects models. Once
we understand this test, we can see why the bang-to-buck payoff for random effects models is
generally pretty low. In a Hausman test we estimate both a fixed effects model and a random
effects model using the same data. Under the null hypothesis that the –i are uncorrelated
c
•2014 Oxford University Press 749
Chapter 15. Advanced Panel Data
with the X, the estimates should be similar. Under the alternative, the estimates should be
different because the random effects should be corrupted by the correlation of the –i with
The decision rule for a Hausman test is the following: If fixed effects and random effects
give us pretty much the same answer, we fail to reject the null hypothesis and can use random
effects. If the two approaches provide different answers, we reject the null and should use
fixed effects. Ultimately, we believe either the fixed effects estimate (when we reject the
null hypothesis of no correlation between –i and Xi ) or pretty much the fixed effects answer
(when we fail to reject the null hypothesis of no correlation between –i and Xi ).4
If used appropriately, random effects have some advantages. When the –i are uncorre-
lated with the Xi , random effects models will generally produce smaller standard errors on
coefficients than fixed effects models. In addition, as T gets large the differences between
fixed and random effects decline; in practice, however, the differences can be substantial in
Remember This
Random effects models do not estimate fixed effects for each unit, but rather adjust
standard errors and estimates to account for unit-specific elements of the error term.
1. Random effects models produce unbiased estimates of —ˆ1 only when the –i are
uncorrelated with the X variables.
2. Fixed effects models work whether the –i are uncorrelated with the X variables
or not, making fixed effects a more generally useful approach.
4 For more details on the Hausman test, see Wooldridge (2002, 288).
c
•2014 Oxford University Press 750
Chapter 15. Advanced Panel Data
15.4 Conclusion
Serial dependence in panel data models is an important and complicated challenge. There
are two major approaches to dealing with it. One is to treat the serial dependence as
autocorrelated errors. In this case we can test for autocorrelation and if necessary purge the
The other approach is to estimate a dynamic model that includes a lagged dependent
variable. Dynamic models are quite different from standard OLS models. Among other
Our approach to estimating a model with a lagged dependent variable depends on whether
there is autocorrelation and whether we included fixed effects or not. If there is no auto-
correlation and we do not include fixed effects, the model is easy to estimate via OLS and
If we include fixed effects in a model with a lagged dependent variable, OLS will produce
biased results. However, scholars have found that the bias is relatively small and that OLS is
We will have a good mastery of the material when we can answer the following questions:
• Section 15.1: How do we diagnose and correct for autocorrelation in panel data models?
• Section 15.2: What are consequences of including lagged dependent variables in models
c
•2014 Oxford University Press 751
Chapter 15. Advanced Panel Data
with and without fixed effects? Under what conditions is it reasonable to use lagged
• Section 15.3: What are random effects models? When are they appropriate?
Further Reading
There is a large and complicated literature on accounting for time dependence in panel data
models. Beck and Katz (2011) is an excellent guide. Among other things, they discuss how
to conduct an LM test for AR(1) errors in a model without fixed effects, the bias in models
with autocorrelation and lagged dependent variables, and the bias of fixed effects models
There are many other excellent resources. Wooldridge (2002) is a valuable reference for
more advanced issues in analysis of panel data. Achen (2000) is an important article, pushing
for caution in use of lagged dependent variables. Wawro (2002) provides a nice overview of
Another approach to dealing with bias in dynamic models with fixed effects is to correct
for bias directly as suggested by Kiviet (1995). This procedure works reasonably well in
c
•2014 Oxford University Press 752
Chapter 15. Advanced Panel Data
Key Terms
• Random effects models (748)
c
•2014 Oxford University Press 753
CHAPTER 16
mously wrote
for an indefinite time: the only check on it is that sooner or later a false belief
754
Chapter 16. Conclusion: How to Be a Statistical Realist
The goal of statistics is to provide a less violent empirical battlefield where theories bump
on an emotional roller coaster. We careen from elation after figuring out a new double-tongue-
twister statistical model to depression when multiple seemingly valid statistical analyses
Some deal with the situation by fetishizing technical complexity. They pick the most
complicated statistical approach possible and treat the results as the truth. If others don’t
understand the analysis, it is because their puny brains cannot keep up with the mathe-
matical geniuses in the computer lab. Their overconfidence is annoying and intellectually
dangerous.
Others deal with the situation by becoming statistical skeptics. For them, statistics
provide no answers. They avoid statistics or, worse, they manipulate them. Their nihilism,
What are we to do? It might seem that avoiding statistics may limit harm. Statistics are
a bit like a chainsaw: If used recklessly, the damage can be terrible. So maybe it’s best to
The problem with this approach is that there really is no alternative to statistics. As
baseball analyst Bill James says, the alternative to statistics is not “no statistics.” The
c
•2014 Oxford University Press 755
Chapter 16. Conclusion: How to Be a Statistical Realist
alternative to statistics is bad statistics. Anyone who makes any empirical argument about
the world is making a statistical argument. It might be based on vague data that is not
systematically analyzed, but that’s what people who judge from experience or intuition
are doing. Hence, despite the inability of statistics to answer all questions or to be above
manipulation, a serious effort to understand the world will involve some statistical reasoning.
A better approach is realism about statistics. In the right hands, chainsaws are awesome.
If we learn how to use the tool properly, what it can and can’t do, we can make a lot of
progress.
must simplify. And if we’re going to simplify the world, let’s do it usefully. Statistician
Since all models are wrong the scientist must be alert to what is importantly wrong.
The tiger abroad is almost always endogeneity. So we must prioritize fighting this tiger,
using our core statistical tool kit: of experiments, OLS, fixed effects models, instrumental
variables, and regression discontinuity. There will be many challenges in any statistical
project, but we must not let them distract us from focusing on the fight against endogeneity.
c
•2014 Oxford University Press 756
Chapter 16. Conclusion: How to Be a Statistical Realist
The second characteristic of a statistical realist is that he or she values robustness. Serious
analysts do not believe assertions based on a single significant coefficient in a single statistical
specification. For even well-designed studies with good data, we worry that the results could
depend on a very specific model specification. A statistical realist will show that the results
are robust by assessing a reasonable range of specifications, perhaps with and without certain
Third, a statistical realist adheres to the replication standard. Others must see our work
and be able to re-create, modify, correct, and build off our analysis. Results cannot be
scientifically credible otherwise. Replications can be direct, whereby they do exactly the
same procedures on the same data. Or they can be indirect, where a similar research design
Fourth, a statistical realist is wary of complexity. Sometimes complex models are in-
evitable. However, just because a model is more complicated does not necessarily make it
more likely to be true. It is more likely to have mistakes. Sometimes complexity becomes a
shield behind which analysts hide, intentionally or not, moving their conclusions effectively
Remember, statistical analysis is hard, but not because of the math. Statistics is hard
because the world is a complicated place. If anything, the math makes things easier by
providing tools to simplify the world. A certain amount of jargon among specialists in the
field is inevitable and helps experts communicate efficiently. If a result only holds underneath
c
•2014 Oxford University Press 757
Chapter 16. Conclusion: How to Be a Statistical Realist
layers of impenetrable math, however, be wary. Check your wallet. Count your silverware.
Investor Peter Lynch often remarked that he wouldn’t invest in any business idea that
couldn’t be illustrated with a crayon. If the story isn’t simple, it’s probably wrong. This
attitude is useful for statistical analysts as well. There will almost certainly have to be
background work that is not broadly accessible, but to be most persuasive the results should
include a figure or story that simply summarizes the basis for the finding. Perhaps we’ll
have to use a sharp crayon, but if we can’t explain our results with a crayon we should keep
working.
Fifth, a statistical realist thinks holistically. We should step back from any given statistical
result and consider the totality of the evidence. The following indicators of causality provide
a useful framework. None is necessary; none is sufficient. Taken together, though, the more
these conditions are satisfied, the more confident we can be that a given causal claim is true.
• Strength: This is the simplest criterion. Is there a strong relationship between the
– A strong observed relationship is less likely due simply to random chance. Even
can lead to the occasional “significant” result. The random noise producing such a
result is more likely to produce a weak, rather than a strong, observed relationships.
A very strong relationship is highly unlikely to simply be the result of random noise.
c
•2014 Oxford University Press 758
Chapter 16. Conclusion: How to Be a Statistical Realist
that a strong result due only to endogeneity will be due to some relatively obvious
Explain things that matter. Our goal is not to intone the words “statistically
texts?
– All too often, a given theoretical claim is tested with the very data that suggested
the result. That’s not much to go on; a random or spurious relationship in one data
set does not a full-blown theory make. Hence we should be cautious about claims
until they are observed across multiple contexts. In that case, it is less likely that
the result is due to chance or to an analyst leaning on the data to get a result he
or she wanted.
– If results are not observed across multiple contexts, are their contextual differences?
Perhaps the real finding is explaining why a relationship exists in one context and
c
•2014 Oxford University Press 759
Chapter 16. Conclusion: How to Be a Statistical Realist
not others.
– Or, if other results are different, can we explain why the other results are wrong?
It is emphatically not the case that we should interpret two competing statistical
results as a draw. One result could be be based on a mistake. If that’s the case,
explain why (nicely, of course). If we can’t explain why one approach is better,
though, then we are left with conflicting results and we need to be cautious about
• Specificity: Are the patterns in the data consistent with the specific claim? Each theory
should be mined for as many specific claims as possible, not only about direct effects,
but also about indirect effects and mechanisms. As importantly, the theory should be
mined for claims about when we won’t see the relationship. This line of thinking allows
us to conduct placebo tests in which we should see null results. In other words, the
• Plausibility: Given what we know about the world, does the result make sense? Some-
times results are implausible on their face: If someone found that eating french fries led
to weight loss, we should probably ask some probing questions before Supersizing. That
doesn’t mean we should treat implausible results as wrong. After all, the idea that the
earth revolves around the sun was pretty implausible at first. Implausible results just
c
•2014 Oxford University Press 760
Chapter 16. Conclusion: How to Be a Statistical Realist
These criteria are not as cut and dried as looking at confidence intervals or hypothesis
tests. They are more important because they determine not “statistical significance” but
what we conclude about empirical relationships. They should never be far from the mind of
a statistical realist who wants to use data to learn about how the world really works.
So we have done a lot in this book. We’ve covered a vast array of statistical tools. We’ve
just now described a productive mindset, that of a statistical realist. There is one more
element: creativity. Think of statistics as the grammar for good analysis. It is not the story.
No one reads a book and says “Great grammar!” A terrible book might have bad grammar,
but a good book needs more than good grammar. The material we covered in this book
provides the grammar for making convincing claims about the way the world works. The
Achen (1982) is an 80 page paean to statistical realism. As he puts it, “The uninitiated
are often tempted to trust every statistical study or none. It is the task of empirical social
scientists to be wiser.” Achen followed this publication with a 2002 article arguing for keeping
models simple.
The criteria for evaluating research discussed here are strongly influenced by the Bradford-
Hill criteria from Bradford-Hill (1965). Nevin (2013) assesses the Bradford-Hill criteria for
c
•2014 Oxford University Press 761
Chapter 16. Conclusion: How to Be a Statistical Realist
theory that lead in gasoline was responsible for the crime surge in the United States in the
c
•2014 Oxford University Press 762
ACKNOWLEDGEMENTS
This book has benefited from close reading and probing questions from a large number
of people, including students at the McCourt School of Public Policy at Georgetown, and
my current and former colleagues and students at Georgetown University, including Shirley
Adelstein, Rachel Blum, Ian Gale, Ariya Hagh, Carolyn Hill, Mark Hines, Dan Hopkins,
Jeremy Horowitz, Huade Huo, Wes Joe, Karin Kitchens, Jon Ladd, Jens Ludwig, Paul
Musgrave, Sheeva Nesva, Hans Noel, Ji Yeon Park, Betsy Pearl, Lindsay Pettingill, Barbara
Credit (and/or blame) for the Simpson’s figure goes to Paul Musgrave.
Participants at a seminar on the book at the University of Maryland gave excellent early
feedback, especially Antoine Banks, Brandon Bartels, Kanisha Bond, Ernesto Calvo, Sarah
Croco, Michael Hanmer, Danny Hayes, Eric Lawrence, Irwin Morris, and John Sides.
In addition, colleagues across the country have been incredibly helpful, especially Allison
Carnegie, Daniel Henderson, Luke Keele, David Peterson, Wendy Tam-Cho, Craig Volden
763
Chapter 16. Acknowledgements
and Chris Way. Anonymous reviewers for Oxford University Press provided supportive yet
I also appreciate the generosity of colleagues who shared data, including Bill Clark, Anna
c
•2014 Oxford University Press 764
Appendices
765
MATH AND PROBABILITY BACKGROUND
A Summation
qN
• i=1 Xi = X1 + X2 + X3 + ... + XN
• If a variable in the summation does not have a subscript, it can be “pulled out” of the
= —(X1 + X2 + X3 + ... + XN )
N
ÿ
= — Xi
i=1
• If a variable in the summation has a subscript, it cannot be “pulled out” of the summa-
qN
tion. For example i=1 Xi Yi = X1 Y1 + X2 Y2 + X3 Y3 + ... + XN YN cannot as a general
matter be simplified.
• As a general matter a non-linear function in a sum is not the same as the non-linear
qN qN
function of the sum. For example, as a general matter i=1 Xi2 will not equal ( i=1 Xi )2
766
Appendix . Math and Probability Background
B Expectation
• Expectation is the value we expect a random variable to be. The expectation is basically
the average of the random variable if we could sample from the variable’s distribution
• For example, the expected value of the value of a six-sided die is 3.5. If we roll a die a
huge number of times, we’d expect each side to come up an equal proportion of times,
so the expected average will equal the average of 1, 2, 3, 4, 5, and 6. More formally,
q6
the expected value will be 1 p(Xi )Xi where X is 1, 2, 3, 4, 5, and 6 and p(Xi ) is the
1
probability of each outcome, which in this example is 6
for each value.
• The expectation of some number k times a function is equal to k times the expectation
of the function. That is, E[kg(X)] = kE[g(X)] for constant k where g(X) is some
function of X. Suppose we want to know what the expectation of 10 times the number
on a die is. We can say that the expectation of that is simply 10 times the expectation.
c
•2014 Oxford University Press 767
Appendix . Math and Probability Background
C Variance
• The variance of a random variable is a measure of how spread out the distribution is.
1 ÿN
var(X) = (Xi ≠ X)2
N i=1
It is useful to de-construct exactly what the variance equation does. The math is pretty
simple:
1. The variance of a constant plus a random variable is the variance of the random
variable. That is, let k be a fixed number and ‘ be a random variable with variance
c
•2014 Oxford University Press 768
Appendix . Math and Probability Background
‡ 2 , then
= 0 + var(‘)
= ‡2
2. The variance of a random variable times a constant is the constant squared times
the variance of the random variable. That is, let k be some constant and ‘ be a
var(k‘) = k 2 var(‘)
= k2‡2
3. When random variables are correlated, the variance of a sum (or difference) of
random variables depends on the variances and covariance of the variables. Let ‘
of ‘ and · .
of ‘ and · .
4. When random variables are uncorrelated, the variance of a sum (or difference) of
random variables equals the sum of the variances. This outcome follows directly
c
•2014 Oxford University Press 769
Appendix . Math and Probability Background
from the above, which we can see by noting that if two random variables are un-
correlated, then their covariance equals zero and the covariance term drops out of
D Covariance
• Covariance measures how much two random variables vary together. In large samples,
• As with variance, there are several useful properties when dealing with covariance.
cov(‘, k) = 0.
2. The covariance of a random variable, ‘, with itself is the variance of that variable.
3. The covariance of k1 ‘ and k2 · where k1 and k2 are constants and ‘ and · are random
variables is k1 k2 cov(‘, · ).
4. The covariance of a random variable with a the sum of another random variable
and a constant is the covariance of the two random variables. Formally, let ‘ and
c
•2014 Oxford University Press 770
Appendix . Math and Probability Background
E Correlation
cov(X, Y )
corr(X, Y ) =
‡X ‡Y
for all observations, cov(X, Y ) = cov(X, X) = var(X) and ‡X = ‡Y , implying that the
denominator will be sigma2X which is the variance of X. These calculations therefore imply
that the the correlation when X = Y will be +1, which which is the upper bound for
The equation for correlation looks a bit like the equation for the slope coefficient in
re-standardized correlation:
—ˆ1BivariateOLS = corr(X, Y ) ◊
‡Y
‡X
relative probability for a continuous random variable to take on a given probability. Panels
(c) and (d) of Figure 3.4 from Section 3.2 provides examples of two PDFs.
1 We also get perfect correlation if the variables are identical once normalized. That is, X and Y are perfectly
(Xi ≠X) (Yi ≠Y )
correlated if X = 10Y or if X = 5 + 3Y and so forth. In these cases ‡X
= ‡Y
for all observations.
c
•2014 Oxford University Press 771
Appendix . Math and Probability Background
While the shapes of PDFs can vary considerably, all of them share certain fundamental
features. The values of a PDF are greater than or equal to zero for all possible values of the
random variable. The total area under the curve defined by the PDF equals one.
One tricky thing about PDFs is that they are continuous functions, meaning that we
cannot say the probability a random variable equals 2.2 is equal to the value of the function
evaluated at 2.2 because the value of the function is pretty much the same at 2.2000001 and
2.2000002 and pretty soon the total probability would exceed one because a there are always
more possible values very near to any given value. Instead, we need to think in terms of
probabilities the random variable is in some (possibly small) region of values. Hence we need
Figure A.1 shows the PDF for an example of a random variable. While we cannot use the
PDF to simply calculate the probability the random variable equals, say, 1.5, it is possible to
calculate the probability that the random variable is between 1.5 and any other value. The
figure highlights the area under the PDF curve between 1.5 and 1.8. This area corresponds
to the probability this random variable is between 1.5 and 1.8. In the next section we show
examples of how to calculate such probabilities based on PDFs from the normal distribution.2
2 More formally, we can indicate a PDF as a function, f (x), that is greater than zero for all values of x. The
sŒ
fact that the total area under the curve means that ≠Œ
f (x)dx = 1. The probability that the random variable x is
sb
between a and b is a
f (x)dx = F (b) ≠ F (a) where F () is the integral of f ().
c
•2014 Oxford University Press 772
Appendix . Math and Probability Background
Probability
density
0.75
0.6
0.45
0.3
0.15
0 1 1.5 1.8 2 3 4 5
Value of x
c
•2014 Oxford University Press 773
Appendix . Math and Probability Background
G Normal Distributions
We work a lot with the standard normal distribution. (Only to us stats geeks does
“standard normal” not seem repetitive.) A normal distribution is a a specific (and famous)
type of PDF and a standard normal distribution is a normal distribution with mean zero
and a variance of one. The standard deviation of a standard normal distribution is also one,
observing standard normal random variables that are less than or equal to some number.
We denote the function (x) = P rob(X < Z) as the probability that a standard normal
random variable X is less than Z. This is known as the cumulative distribution function
(CDF) because it indicates the probability of seeing a random variable less than some value.
It simply expresses the area under a PDF curve to the left of some value.
Figure A.2 shows four examples using the CDF for standard normal PDFs. Panel (a)
shows (0), which is the probability that a standard normal random variable will be less
than 0. It is the area under the PDF to the left of 0. We can see that it is half of the
total area, meaning that the area to the left of 0 is 0.50 and, therefore, the probability of
observing a value of a standard normal random variable that is less than 0 is 0.50. Panel
(b) shows (≠2), which is the probability that a standard normal random variable will be
less than -2. It is the proportion of the total area that is left of -2, which is 0.023. Panel
c
•2014 Oxford University Press 774
Appendix . Math and Probability Background
0.4
0.4
0.3
0.3
Probability density
Probability density
Φ(0)
=Prob(X<0)
= 0.500
0.2
0.2
Φ(−2)
=Prob(X<−2)
0.1
0.1
= 0.023
0.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4
(a) (b)
0.4
0.4
0.3
0.3
Probability density
Probability density
Φ(1.96) Φ(1.0)
=Prob(X<1.96) =Prob(X<1.0)
= 0.975 = 0.841
0.2
0.2
0.1
0.1
0.0
0.0
1
−4 −2 0 2 4 −4 −2 0 2 4
(c) (d)
FIGURE A.2: Probabilities that a Standard Normal Random Variable is Less Than Some Value
(c) shows (1.96), which is the probability that a standard normal random variable will be
less than 1.96. It is 0.975. Panel (d) shows (1), which is the probability that a standard
We can also use our knowledge of the standard normal distribution to calculate the prob-
ability that —ˆ1 is greater than some value. The trick here is to recall that if the probability of
tells us that if there is a 15% chance of rain, then there is a 85% probability of no rain.
c
•2014 Oxford University Press 775
Appendix . Math and Probability Background
To calculate the probability that a standard normal variable is greater than some value,
Z, use 1 ≠ (Z). Figure A.3 shows four examples. Panel (a) shows 1 ≠ (0), which is the
probability that a standard normal random variable will be greater than 0. This probability
is 0.50. Panel (b) highlights 1 ≠ (≠2), which is the probability that a standard normal
random variable will be greater than -2. It is 0.98. Panel (c) shows (1.96), which is the
probability that a standard normal random variable will be greater than 1.96. It is 0.025.
Panel (d) shows (1), which is the probability that a standard normal random variable will
Figure A.4 shows some key information about the standard normal distribution. In the
table’s left-hand column is some number and in the right hand column is the probability that
a standard normal random variable will be less than that number. There is, for example, a
0.01 probability that a standard normal random variable will be less than -2.32. We can see
this graphically in panel (a). In the top bell-shaped curve, the portion that is to the left of
Because the standard deviation of a standard normal is one, all the numbers in the left
hand column can be considered as the number of standard deviations above or below the
mean. That is, the number -1 refers to a point that is one standard deviation below the
mean and the number +3 refers to a point that is 3 standard deviations above the mean.
The third row of the table shows that there is a probability of 0.01 that we’ll observe a
value less than -2.32 standard deviations below the mean. Going down to the shaded row
c
•2014 Oxford University Press 776
Appendix . Math and Probability Background
0.4
0.4
0.3
0.3
Probability density
Probability density
1−Φ(0)
=Prob(X>0) 1−Φ(−2)
= 0.500 =Prob(X>−2)
0.2
0.2
= 0.977
0.1
0.1
0.0
0.0
−4 −2 0 2 4 −4 −2 0 2 4
(a) (b)
0.4
0.4
0.3
0.3
Probability density
Probability density
1−Φ(1.0)
=Prob(X>1.0)
= 0.159
0.2
0.2
1−Φ(1.96)
=Prob(X>1.96)
= 0.025
0.1
0.1
0.0
0.0
1
−4 −2 0 2 4 −4 −2 0 2 4
(c) (d)
FIGURE A.3: Probabilities that a Standard Normal Random Variable is Greater Than Some Value
c
•2014 Oxford University Press 777
Appendix . Math and Probability Background
SD = number of Probability
standard deviations —ˆ1 Æ SD
above or below
the mean, —1
-3.00 0.0001
0.4
-2.58 0.005
Probability density
0.3
-2.32 0.010 ∆
0.2
-2.00 0.023
^
Prob.β1 ≤ − 2.33
0.1
-1.96 0.025
0.0
−2.33
-1.64 0.050
−3 −2 −1 0 1 2 3
-1.28 0.100
0.4
Probability density
-1.00 0.160
0.3
^
Probability β1 ≤ 0
0.00 0.500
0.2
∆
1.00 0.840
0.1
0.0
1.28 0.900 −3 −2 −1 0 1 2 3
1.64 0.950
1.96 0.975
0.4
∆
Probability density
2.00 0.977
0.3
^
Probability β1 ≤ 1.96
0.2
2.32 0.990
0.1
2.58 0.995
0.0
3.00 0.999
1.96
−3 −2 −1 0 1 2 3
c
•2014 Oxford University Press 778
Appendix . Math and Probability Background
SD = 0.00, we see that if —ˆ1 is standard normally distributed, there is a 0.50 probability of
being below 0. This probability is intuitive – the normal distribution is symmetric and we
have the same chance of seeing something above its mean as below it. Panel (b) shows this
graphically.
Going down to the shaded row where SD = 1.96, we see that there is a 0.975 proba-
bility that a standard normal random variable will be less than 1.96. Panel (c) shows this
graphically, with 97.5% of the standard normal distribution shaded. We see this value a
lot in statistics because twice the probability of being greater than 1.96 is 0.05, which is a
We can convert any normally distributed random variable to a standard normally dis-
tributed random variable. This process is known as standardizing values and is pretty easy.
This trick is valuable because it allows us to use the intuition and content of Figure A.4 to
work with any normal distribution, whatever its mean and standard deviation.
For example, suppose we have a normal random variable with a mean of 10 and a standard
deviation of 1 and we want to know the probability of observing a value less than 8. From
common sense, we can figure out that in this case 8 is 2 standard deviations below the mean.
Hence we can use Figure A.4 to see that the probability of observing a value less than 8 from
a normal distribution with mean 10 and standard deviation of one is 0.023 (see the fourth
row of the table, which shows that the probability a standard normal random variable is less
than -2 is 0.023).
c
•2014 Oxford University Press 779
Appendix . Math and Probability Background
How did we get there? First, subtract the mean from the value in question to see how far
it is from the mean. Then divide this quantity by the standard deviation to calculate how
many standard deviations away from the mean it is. More generally, for any given number
B drawn from a distribution with mean —1 and standard deviation, se(—ˆ1 ), we can calculate
the number of standard deviations B is away from the mean via the following equation:
B ≠ —1
Standard deviations from mean = (A-2)
se(—ˆ1 )
Notice that the —1 has no hat but se(—ˆ1 ) does. Seems odd, doesn’t it? There is a logic
to it. We’ll be working a lot with hypothetical values of —1 , asking, for example, what the
probability —ˆ1 is greater than some number would be if the “true” —1 were zero. But we’ll
want to work with the precision implied by our actual data so we’ll use se(—ˆ1 ).
To help us get comfortable with converting the distribution of —ˆ1 to the standard normal
distribution, Table A.1 shows several examples. In the first example (the first two rows),
—1 is 0 and the standard error of —ˆ1 is 3. Recall that the standard error of —ˆ1 measures the
width of the —ˆ1 distribution. In this case, 3 is 1 standard deviation above the mean and 1 is
The third and fourth rows of Table A.1 show an example when —1 is 4 and the standard
deviation is 3. In this case, 7 is 1 standard deviation above the mean and 1 is 1 standard
deviation below the mean. In the bottom portion of the table (the last two rows), —1 is 8
and the standard deviation of —ˆ1 is 2. In this case, 6 is 1 standard deviation below the mean
c
•2014 Oxford University Press 780
Appendix . Math and Probability Background
Hypothetical distribution
Number —1 se(—ˆ1 ) Standardized Description
B
3≠0
3 0 3 3
=1 3 is 1 standard deviation above
the mean of 0 when se(—ˆ1 ) = 3
1≠0
1 0 3 1
= 0.33 1 is 0.33 standard deviations above
the mean of 0 when se(—ˆ1 ) = 3
7≠4
7 4 3 3
=1 7 is 1 standard deviation above
the mean of 4 when se(—ˆ1 ) = 3
1≠4
1 4 3 3
= ≠1 1 is 1 standard deviation below
the mean of 4 when se(—ˆ1 ) = 3
6≠8
6 8 2 2
= ≠1 6 is 1 standard deviation below
the mean of 8 when se(—ˆ1 ) = 2
1≠8
1 8 2 2
= ≠3.5 1 is 3.5 standard deviations below
the mean of 8 when se(—ˆ1 ) = 2
To calculate (Z) we use a table such as in Figure A.4 or, more likely, computer software
c
•2014 Oxford University Press 781
Appendix . Math and Probability Background
Remember This
A standard normal distribution is a normal distribution with a mean of zero
and a standard deviation of one.
• Any normal distribution can be converted to a standard normal distri-
bution.
• If —ˆ1 is distributed normally with mean — and standard deviation se(—ˆ1 ),
ˆ
then se( —≠—
—ˆ )
will be distributed as a standard normal random variable.
1
c
•2014 Oxford University Press 782
Appendix . Math and Probability Background
Discussion Questions
1. What is the probability a standard normal random variable is less than
or equal to 1.64?
2. What is the probability a standard normal random variable is less than
or equal to -1.28?
3. What is the probability a standard normal random variable is greater
than 1.28?
4. What is the probability a normal random variable with a mean of zero
and a standard deviation of 2 is less than -4?
5. What is the probability a normal random variable with a mean of zero
and a variance of 9 is less than -3?
6. Approximately what is the probability a normal random variable with
a mean of 7.2 and a variance of 4 is less than 9?
The normal distribution may be the most famous distribution, but it is far from the only
workhorse distribution in statistical analysis. In this section we briefly discuss three other
distributions that are particularly common in econometric practice, the ‰2 , t, and F distri-
butions. Each of these distributions are derived from the normal distribution.
The ‰2 distribution
c
•2014 Oxford University Press 783
Appendix . Math and Probability Background
Probability
density
0.5
χ2(2) distribution
0.25
0 2 4 6 8 10
Value of x
(a)
Probability
density
χ2(4) distribution
0.15
0.1
0.05
0 2 4 6 8 10 12 14
Value of x
(b)
with one degree of freedom. The sum of n independent squared standard normal random
The ‰2 distribution arises in many different statistical contexts. We’ll see below that it
The shape of the ‰2 distribution varies according to the degrees of freedom. Figure A.5
c
•2014 Oxford University Press 784
Appendix . Math and Probability Background
shows two examples of ‰2 distributions. Panel (a) shows a ‰2 distribution with 2 degrees
of freedom. We have highlighted the most extreme 5 percent of the distribution, which
demonstrates that the critical value from a ‰2 (2) distribution is roughly 6. Panel (b) shows
a ‰2 distribution with 4 degrees of freedom. The critical value from a ‰2 (4) distribution is
around 9.5.
The Computing Corner in Chapter 12 on pages 646 and 648 shows how to identify critical
values from an ‰2 distribution. Software will often, but not always, automatically provide
The t distribution
The t distribution characterizes the distribution of the ratio of a normal random variable
and the square root of a ‰2 random variable divided by it’s degrees of freedom. While such
a ratio may seem to be a pretty obscure combination of things to worry about, we’ve seen
in Section 4.2 the t distribution is an incredibly useful distribution. We know that our OLS
coefficients (among other estimators) are normally distributed. We also know (although we
talk about this less) that the estimates of the standard errors are distributed according to
standard error estimates, we want to know the distribution of the ratio of the coefficient
c
•2014 Oxford University Press 785
Appendix . Math and Probability Background
degrees of freedom:
z
t(n) = Ò
x/n
Virtually every statistical software package automatically produces t statistics for every
coefficient estimated. We can also use t tests to test hypotheses about multiple coefficients,
although in Section 7.4 we focused on F tests for this purpose on the grounds of convenience.
The shape is the t distribution is quite similar to the normal distribution. As shown
in Figure 4.3 in Chapter 4, the t distribution is a bit wider than the normal distribution.
This means that extreme values are more likely from a t distribution than from a normal
distribution. However, the difference is modest for small sample sizes and disappears as the
The F distribution
divided by their degrees of freedom. The distribution is named in honor of legendary statis-
tician R. A. Fisher.
c
•2014 Oxford University Press 786
Appendix . Math and Probability Background
freedom n1 and n2 :
x1 /n1
F (n1 , n2 ) =
x2 /n2
Since ‰2 variables are positive, a ratio of two of them must be positive as well, meaning
that random variables following a F distributions are greater than or equal to zero.
with n degrees of freedom follows a F (1, n) distribution. To see this, note that a t distributed
variable is a normal random variable divided by the square root of a chi2 random variable.
Squaring the t distributed variable gives us a squared normal in the numerator, which is ‰2 ,
and a chi2 in the denominator. In other words, this gives us the ratio of two ‰2 random
variables, which will be distributed according to a F distribution. We used this fact when
noting on page 451 that in certain cases we can square a t statistic to produce an F statistic
that can be compared to a rule of thumb about F statistics in the first stage of 2SLS analyses.
We use the F distribution when doing F tests which, among other things, allow us to test
The F distribution depends on two degrees of freedom parameters. In the F test examples,
the degrees of freedom for the test statistic depend on the number of restrictions on the
parameters and the sample size. The order of the degrees of freedom is important and is
The F distribution does not have an easily identifiable shape like the normal and t
c
•2014 Oxford University Press 787
Appendix . Math and Probability Background
Probability Probability
density density
F( 3 , 2000 ) F( 18 , 300 )
1
distribution distribution
1
0.75
0.75
0.5
0.5
0.25
0.25
0 0
0 2 2.61 4 0 1.64 2 4
Value of x Value of x
Probability (a) Probability (b)
density density
F( 2 , 100 ) F( 9 , 10 )
1 1
distribution distribution
0.75 0.75
0.5 0.5
0.25 0.25
0 0
0 2 3.09 4 0 2 3.02 4
Value of x Value of x
(c) (d)
c
•2014 Oxford University Press 788
Appendix . Math and Probability Background
distributions. Instead it’s shape changes rather dramatically depending on the degrees of
freedom. Figure A.6 plots four examples of F distributions, each with different degrees of
freedom. For each figure we highlight the extreme 5 percent of the distribution, providing a
sense of the values necessary to reject the null hypotheses for each case. Panel (a) shows an
F distribution with degrees of freedom equal to 3 and 1,000. This would be the distribution
set with 2010 observations and 10 parameters to be estimated. The critical value is 2.61,
meaning that a F test statistic greater than 2.61 would lead us to reject the null hypothesis.
Panel (b) displays an F distribution with degrees of freedom equal to 18 and 300 and so on.
The Computing Corner in Chapter 7 on pages 359 and 360 shows how to identify critical
values from an F distribution. Software will often, but not always, automatically provide
I Sampling
Section 3.2 of Chapter 3 discussed two sources of variation in our estimates: sampling ran-
Imagine that we are trying to figure out some feature of a given population. For example,
suppose we are trying to ascertain the average age of everyone in the world at a given time. If
we had (accurate) data from every single person, we’re done. Obviously, that’s not going to
c
•2014 Oxford University Press 789
Appendix . Math and Probability Background
happen, so we take a random sample. Since this random sample will not contain every single
person, the average age of people from it probably will not exactly match the population
average. And if we were to take another random sample we would likely get a different
average because we would have different people in our sample. Maybe the first time we got
more babies than usual and in second time we got the world’s oldest living person.
The genius of the sampling perspective is that we characterize the degree of randomness
we should observe in our random sample. The variation will depend on the sample size we
A useful exercise is to take some population, say the students in a statistics class, and
gather information about every person in the population for some variable. Then if we
draw random samples from this population we will see that the mean of the variable in the
sampled group will bounce around for each random sample we draw. The amazing thing
about statistics, is that we will be able to say certain things about the mean of the averages
we get across the random samples and the variance of the averages. If the sample size is large,
we will be able to approximate the distribution of these averages with a normal distribution
with a variance we can calculate based on the sample size and the underlying variance in
This logic applies to regression coefficients as well. Hence, if we want to know the
relationship between age and wealth in the whole world, we can draw a random sample
and know that we will have variation related to the fact that we observe only a subset of
c
•2014 Oxford University Press 790
Appendix . Math and Probability Background
the target population. And, recall from Section 6.1 that OLS easily estimates means and
difference of means, so even our average age example works in an OLS context.
variation, this is not very practical. First, it is not uncommon to observe an entire population.
For example, if we want to know the relationship between education and wages in European
countries from 2000 to 2014, we could probably come up with data for each country and year
in our target population. And yet, we would be naive to believe that there is no uncertainty
in our estimates. Hence, there is almost always another source of randomness, something we
Second, the sampling paradigm requires that the samples from the underlying target
population be random. If the sampling is not random, we run the risk of introducing endo-
geneity as the type of observations that make their way into our analysis may systematically
differ from the people or units that we do not observe. A classic example is that we may
observe the wages of women who work, but this subsample is unlikely to be a random sample
from all women. The women who work are likely more ambitious and/or more financially
dependent on working.
Even public opinion polling data, a presumed model of random sampling, seldom provides
random samples from underlying populations. Commercial polls often have response rates
less than 20 percent and even academic surveys struggle to get response rates near 50 percent.
It is reasonable to believe that the people who respond differ in economic, social personality
c
•2014 Oxford University Press 791
Appendix . Math and Probability Background
traits, meaning that simply attributing variation to sampling variation may be problematic.
ness in our coefficient estimates, we should not limit ourselves to thinking of variation in
coefficients solely in terms of sampling variation. Instead, it is useful to step back and write
down a model that simply includes an error term representing uncertainty in our model.
If the observations are drawn from a truly random sample of the target population (Hint:
they never are), then we can proceed with thinking of uncertainty reflecting only sampling
variation. However, if there is no random sampling either because we data on the full pop-
ulation or because the sample is not random, then we can model the selection process and
assess whether or not the non-random sampling processed induced correlation between the
independent variables and the error term. The Heckman selection model referenced in Chap-
ter 10 on page 518 provides a framework for considering such issues. Such selection is very
tricky to assess, however, and researchers continue to struggle with the best way to address
the issue.
J Further Reading
c
•2014 Oxford University Press 792
Appendix . Math and Probability Background
K Computing Corner
Excel
Sometimes Excel is the quickest way to calculate quantities of interest related to the normal
distribution.
• There are several ways to find the probability a standard normal is less than some value,
1. Use the NORM.S.DIST function, which calculates the normal distribution. Use a 1
after the comma to produce the cumulative probability, which is the percent of the
2. Use the NORMDIST function and indicate the mean and the standard deviation which,
for a standard normal, are 0 and 1 respectively. Use a 1 after the last comma to
produce the cumulative probability, which is the percent of the distribution to the
• For a non-standard normal variable, use the NORMDIST function and indicate the mean
and the standard deviation. For example, if the mean is 9 and the standard devi-
ation is 3.2, the probability this distribution yields a random variable less than 7 is
Stata
c
•2014 Oxford University Press 793
Appendix . Math and Probability Background
• To calculate the probability a standard normal is less than some value in Stata, use the
normal command. For example, display normal(2) will return the probability that a
mean and standard deviation, we can also standardize the variable manually. For
• To calculate the probability a standard normal is less than some value in R, use the
pnorm command. For example, pnorm(2, mean= 1, sd=1) will return the probability
mean and standard deviation, we can also standardize the variable manually. For
example, pnorm((7-9)/3.2) returns the probability that a normal variable with a mean
c
•2014 Oxford University Press 794
CITATIONS AND ADDITIONAL NOTES
Student preface
mixingmemory/2006/11/the_illusion_of_explanatory_de.php.
Chapter 1
• Page 4 Gary Burtless (1995, 65) provides the initial motivation for this example – he
used Twinkies.
Chapter 3
• Page 65 Sides and Vavreck (2013) provide a great look at how theory can help cut
795
Appendix . Citations and Additional Notes
• Page 85 For a discussion of the Central Limit Theorem and its connection to the nor-
mality of OLS coefficient estimates see, for example, Lumley et al. (2002). They note
that for errors that are themselves nearly normal or do not have severe outliers, 80 or
• Page 101 Stock and Watson (2011, 674) present examples of estimators that highlight
the differences between bias and inconsistency. The estimators are silly, but they make
the point.
– Suppose we tried to estimate the mean of a variable with the first observation in
a sample. This will be unbiased because in expectation this will be equal to the
average of the population. Recall that expectation can be thought of as the average
value of an estimator we would get if we ran an experiment over and over again. This
estimator will not be consistent, though, because no matter how many observations
we have, we’re only using the first observation, meaning that the variance of the
estimator will not get smaller as the sample size gets very large. So, yes, no one in
their right mind would use this estimator, but it nonetheless shows an example of
1
– Suppose we tried to estimate the mean of a variable with the sample mean plus N
.
This will be biased because the expectation of this estimator will be the population
c
•2014 Oxford University Press 796
Appendix . Citations and Additional Notes
1
average plus N
. However, this estimator will be consistent because the variance of
1
a sample mean goes down as sample size increases and the N
bit will go to zero as
the sample size goes to infinity. Again, this is a nutty estimator that no one would
use in practice, but it shows how it is possible for an estimator to be biased, but
consistent.
Chapter 4
• Page 136 For a report on the Pasteur example, see Manzi (2012, 73) and http://
pyramid.spd.louisville.edu/˜eri/fos/Pasteur_Pouilly-le-fort.pdf.
• Page 150 The distribution of the standard error of —ˆ1 follows a ‰2 distribution. A
t distribution.
• Page 166 The medical example is from Wilson and Butler (2007, 105).
Chapter 5
• Page 211 In Chapter 14 we show on page 715 that the bias term in a simplified example
q
for a model with no constant is E[ q i 2i ]. For the more standard case that includes
‘X
Xi
q
‘ (X ≠X)
a constant in the model, the bias term is E[ q i i 2 ]
(Xi ≠X)
which is the covariance of X
and ‘ divided by the variance of X. See Greene (2003, 148) for a generalization of the
omitted variable bias formula for any number of included and excluded variables.
c
•2014 Oxford University Press 797
Appendix . Citations and Additional Notes
• Page 237 Harvey’s analysis includes other variables, including a measure of how ethni-
cally and linguistically divided countries are and a measure of distance from the equator
(which is often used in the literature to capture a historical pattern that countries close
Chapter 6
• Page 259 To formally show that the OLS —ˆ1 and —ˆ0 estimates are functions of the means
of the treated and untreated groups requires a bit of a slog through some algebra. From
qN
page 72, we know that the bivariate OLS equation for the slope is —ˆ1 =
(Ti ≠T )(Yi ≠Y )
qN
i=1
2
i=1
(Ti ≠T )
where we use Ti to indicate that our independent variable is a dummy variable (where
Ti = 1 indicates a treated observation). We can break the sum into two parts, one
part for Ti = 1 observations and the other for Ti = 0 observations. We’ll also refer to
T as p, where p indicates the percent of observations that were treated, which is the
average of the dummy independent variable. (This is not strictly necessary, but helpful
to highlight the intuition that the average of our independent variable is the percent
this group is 1. For the Ti = 0 observations, (Ti ≠ p) = (≠p) because, by definition the
value of Ti in this group is 0. We can pull these terms out of the summation because
c
•2014 Oxford University Press 798
Appendix . Citations and Additional Notes
who were treated (and therefore have T1 = 1).3 We also break the equation into three
parts, producing
q q q
(1 ≠ p) Ti =1 Yi (1 ≠ p) Ti =1 Y p Ti =0 (Yi ≠ Y )
—ˆ1 = ≠ ≠
NT (1 ≠ p) NT (1 ≠ p) NT (1 ≠ p)
The (1 ≠ p) in the numerator and denominator of the first and second terms cancels out.
Note also that the sum of Y for the observations where Ti = 1 equals NT Y , allowing us
where NC is the number of observations in the control group (for whom Ti = 0).4 We
qN
Yi
denote the average of the treated group ( Ti =1
NT
) as Y T and the average of the control
qN
Yi
group ( Ti =0
NC
) as Y C . We can re-write our equation as
q q
Yi Y
—ˆ1 = Y T ≠ Y ≠ Ti =0
+ Ti =0
NC NC
qN qN qN qN qN qN
3 To see this, re-write i=1
(Ti ≠ p)2 as i=1
Ti2 ≠ 2p i=1
Ti ≠ i=1
p2 . Note that both i=1
Ti2 and i=1
Ti
equal NT because the squared value of a dummy variable is equal to itself and because the sum of a dummy variable
qN
is equal to the number of observations for which Ti = 1. We also use the facts that i=1
p2 equals N p2 and p = NT
N
,
2 2
which allows us to write the denominator as NT ≠ 2 N + . Simplifying yields NT (1 ≠ p).
NT N NT
N2
4 To see this, substitute NT for p and simplify, noting that N = N ≠ N .
N C T
c
•2014 Oxford University Press 799
Appendix . Citations and Additional Notes
q
Using fact that Ti =0 Y equals NC Y , we can cancel some terms and (finally!) get our
result:
—ˆ1 = Y T ≠ Y C
To show that —ˆ0 is Y C , use Equation 3.5 from page 73, noting that Y = Y T NT +Y C NC
N
.
• Page 262 Discussions of non-OLS difference of means tests sometimes gets bogged down
into whether the variance is the same across the treatment and control groups. If the
variance varies across treatment and control groups we would have heteroscedasticity
• Page 283 Poole and Rosenthal (1997) have measured ideology of members of Congress
• Page 283 For more on the ideological shifts in the Republican Party, see Bailey, Mum-
• Page 296 See Kam and Franceze (2007, 48) for the derivation of the variance of estimated
effects. The variance of —ˆ1 +Di —ˆ3 is var(—ˆ1 )+Di2 var(—ˆ3 )+2Di covar(—ˆ1 , —ˆ3 ) where covar
regress Y X1 D X1D
matrix V = get(VCE)
c
•2014 Oxford University Press 800
Appendix . Citations and Additional Notes
disp V[3,1]
trix for the coefficient estimates. The covariance of —ˆ1 and —ˆ3 is the entry in the
Chapter 7
• Page 279 This data is from from Persico, Postlewaite, and Silverman (2004). Results
are broadly similar even if we exclude outliers with very high salaries.
• Page 319 The data on life expectancy and GDP per capita are from the World Bank’s
indicator/.
• Page 336 In log-linear models, a one unit increase in X is associated with a —1 percent
Y = e—0 e—1 X e‘
c
•2014 Oxford University Press 801
Appendix . Citations and Additional Notes
If we use the fact that log(eA eB eC ) = A+B +C and log both sides, we get the log-linear
formulation:
ln Y = —0 + —1 X + ‘
dY
= e—0 —1 e—1 X e‘
dX
dY /Y e—0 —1 e—1 X e‘
=
dX e—0 e—1 X e‘
dY /Y
= —1
dX
Chapter 8
• Page 380 See Bailey, Strezhnev, and Voeten (2015) for U.N. voting data.
Chapter 9
• Page 424 Endogeneity is a central concern of Medicaid literature. See, for example,
Currie and Gruber (1996), Finkelstein et al. (2012) and Baicker et al. (2013).
• Page 458 The reduced form is simply the model rewritten to be only a function of the
non-endogenous variables (which are the X and Z variables, not the Y variables). This
c
•2014 Oxford University Press 802
Appendix . Citations and Additional Notes
equation isn’t anything fancy, although it takes a bit of math to see where it comes
X1 :
3. Rearrange some more by moving all Y2 terms to the left side of the equation:
This “reduced form” equation isn’t a causal model in any way. The fi coefficients are
crazy mixtures of the coefficients in Equations 9.12 and 9.13, which are the equations
c
•2014 Oxford University Press 803
Appendix . Citations and Additional Notes
that embody the story we are trying to evaluate. The reduced form equation is simply
Chapter 10
• Page ?? See Newhouse (1993) and Gerber and Green (2012, 212-214) for more on the
RAND experiment.
Chapter 12
• Page 612 A good place to start when considering MLE is with the name. Maximum is,
well, maximum; likelihood refers to the probability of observing the data we observe,
For most people, the new bit is the likelihood. The concept is actually quite close to
ordinary usage. Roughly 20 percent of U.S. the population is under 15. What is the
likelihood that when we pick three people randomly we get two people under 15 and
one over 15? The likelihood (which we’ll label “L”) is L = 0.2 ú 0.2 ú 0.8 = 0.03. In
other words, if we pick three people at random in the United States, there is a 3 percent
chance (or, “likelihood”) we will observe two people under 15 and one over 15.
We can apply this concept when we do not know the underlying probability. Suppose
that we want to figure out what proportion of the population has health insurance.
Let’s call “pinsured ” the probability someone is insured (which is simply to proportion
c
•2014 Oxford University Press 804
Appendix . Citations and Additional Notes
of insured in the United States). Suppose we randomly select three people, ask them if
they are insured, and find out that two are insured and one is not. The probability (or
MLE finds an estimate of pinsured that maximizes the likelihood of observing the data
we actually observed.
We can get a feel for what values lead to high or low likelihoods by trying out a few
silly guess. If our estimate were pinsured = 0.5 then L = 0.5 ú 0.5 ú (1 ≠ 0.5) = 0.125
which is better. If we chose pinsured = 0.7 then L = 0.7 ú 0.7 ú 0.3 = 0.147, which is even
better. But if we chose, pinsured = 0.9 then L = 0.9 ú 0.9 ú 0.1 = 0.081, which is not as
Conceivably we could keep plugging different values of pinsured into the likelihood equa-
tion until we found the best value. Or, calculus gives us tools to quickly find maxima.5
When we observe two people with insurance and one without, the value of pinsured that
2
maximizes the likelihood is 3
which, by the way, is the common sense estimate when
To use MLE to estimate a probit model we extend this logic. Instead of estimating a
5 Here’s the formal way to do this using calculus. First, calculate the derivative of the likelihood with respect to
p: ˆL
ˆp
= 2pinsured ≠ 3p2insured . Second, set the derivative to zero and solving for pinsured yields pinsured = 23 .
c
•2014 Oxford University Press 805
Appendix . Citations and Additional Notes
single probability parameter (pinsured in our example above) we estimate the probability
for pinsured into the likelihood equation above. In this case, the thing we are trying to
learn about is no longer pinsured , but is now the —s which the determine the probability
The likelihood if we observe two people who are insured and one who is not is
the value of X for the first person rather than a separate variable X1 as we typically
use the notation elsewhere), (—0 + —1 X2 ) is the probability of person 2 insured and
MLE find the —ˆ that maximizes the likelihood, L. The actual estimation process is
• Page 623 To use the average-case approach, create a single “average” person for whom
the value of each independent variable is the average of that independent variable. We
calculate a fitted probability for this person. Then we add one to the value of X1 for this
average person and calculate how much the fitted probability goes up. The downside
of the average-case approach is that in the real data there might not be anyone who is
average across all variables as the variables might typically cluster together. It’s also
c
•2014 Oxford University Press 806
Appendix . Citations and Additional Notes
kind of weird because dummy variables for the “average” person will between 0 and 1
even though no single observation will have any value other than 0 and 1. This means,
for example, that the “average” person will be 0.52 female and 0.85 right-handed and
so forth.
To interpret probit coefficients using the average-case approach, use the following guide.
– If X1 is continuous:
1. Calculate P1 as the fitted probability using —ˆ given all variables are at their
rather than simply by one. For example, if the scale of X1 is in the millions of
3. The difference P2 ≠P1 is the estimated effect of a one standard deviation increase
c
•2014 Oxford University Press 807
Appendix . Citations and Additional Notes
– If X1 is a dummy variable:
1. Calculate P1 as the fitted probability using —ˆ given X1 = 0 and all other variables
2. Calculate P2 as the fitted probability using —ˆ given X1 = 1 and all other vari-
3. The difference P2 ≠P1 is the estimated effect of a one unit increase in X1 holding
• Page 623 The marginal effects approach uses calculus to determine the slope of the
fitted line. Obviously the slope of the probit fitted line varies, so we have to determine
a reasonable point to calculate this slope. In the observed-value approach, we find the
slope at the point defined by actual values of all the independent variables. This will be
ˆP rob(Yi =1)
ˆX1
. We know that the P rob(Yi = 1) is a CDF and one of the nice properties of
c
•2014 Oxford University Press 808
Appendix . Citations and Additional Notes
a CDF is that the derivative is simply the PDF. (We can see this graphically in Figure
12.5 by noting that if we increase the number on the horizontal axis by a small amount,
the CDF will increase by the value of the PDF at that point.) Applying that property
ˆ (—ˆ0 +—ˆ1 X1i +—ˆ2 X2i )
plus the chain rule, we get ˆX1
= „(—ˆ0 + —ˆ1 X1i + —ˆ2 X2i )—ˆ1 where „() is
the normal PDF. Hence the marginal effect of increasing X1 at the observed value is
If the scale of X1 is large such that an increase of 1 unit is small, then the marginal
effects and discrete differences approach will yield similar results. If the scale of X1 is
small such that an increase of 1 unit is a relatively large increase, then the marginal
We show how to calculate marginal effects in Stata on page 646 and in R on page 648.
Chapter 14
• Page 733 We can also derive the attenuation bias result using the general form of
endogeneity from page 91, which is plim —ˆ1 = —1 + corr(X1 , ‘) ‡‡X‘ = —1 + cov(X
‡X
1 ,‘)
. Note
1 1
that “‘” in Equation 14.19 actually contains ≠—1 ‹i + ‘i . Solving for cov(X1 , ≠—1 + ‘)
yields ≠—1 ‡‹ .
c
•2014 Oxford University Press 809
Appendix . Citations and Additional Notes
Chapter 13
• Page 661 Another form of correlated errors is spatial autocorrelation, which occurs when
the error for one observation is correlated with the error for another observation that is
spatially close to it. Our polling example is predicated on the idea that there may be
spatial autocorrelation because those who live close to each other (and sleep in the same
bed!) may have correlated errors. This kind of situation can arise with geographic based
data such as state or county level data because there may be unmeasured similarities
(meaning stuff in the error term) that is common within regions. The consequences of
autocorrelation does not cause bias. Spatial autocorrelation does cause the conventional
standard error equation for OLS coefficients to be incorrect. The easiest first step for
dealing with this situation is simply to include a dummy variable for region. Often
this step will capture any regional correlations not captured by the other independent
variables. A more technically complex way of dealing with this situation is via spatial
regression statistical models. The intuition underlying these models is similar to that
for serial correlation, but the math is typically harder. See, for example, Tam-Cho and
Gimpel (2012).
• Page 666 Wooldridge (2009, 416) discusses inclusion of X variables in this test.
• Page 673 The difference between the two approaches is the the Cochrane-Orcutt method
c
•2014 Oxford University Press 810
Appendix . Citations and Additional Notes
loses the first observation (because there are no lagged variables for it, while the Prais-
Winsten fills in for the transformed first observation with a reasonable transformation.
• Page 674 Wooldridge (2009, 424) notes that the fl-transformed approach also requires
that ‘t not be correlated with Xt≠1 or Xt+1 . In a fl-transformed model, the independent
variable is Xt ≠ flXt≠1 and the error is ‘t ≠ fl‘t≠1 . If the lagged error term (‘t≠1 ) is
correlated with Xt , then the independent variable in the fl-transformed model will be
correlated with the error term in the fl-transformed model. We are assuming that the
distribution of the error term isnät shifting over time (see our discussion of stationarity
on page 683 for more on this topic). In other words, This means that if ‘t is not be
• Page 682 The so-called Breusch-Godfrey test is a more general test for autocorrelation.
• Page 689 R code to generate multiple simulations with unit root (or other) time series
variables:
c
•2014 Oxford University Press 811
Appendix . Citations and Additional Notes
638).
Chapter 16
• Page 756 Columbia Professor Andrew Gelman directed me to this saying of Bill James.
c
•2014 Oxford University Press 812
GUIDE TO SELECTED DISCUSSION QUESTIONS
Chapter 1
1. In panel (b) of Figure 1.4 we can see that —0 > 0 (it’s around 40) and —1 > 0 as well.
Panel (d): Note that the X-axis ranges from about -6 to + 6. —0 is the value of Y
when X is zero and is therefore 2, which can be seen in Figure A.7. —0 is not the value
of Y at the left-most point in the figure like it was for the other panels in Figure 1.4.
Chapter 4
813
Appendix . Guide to Selected Discussion Questions
8
Y
7
−1
−2
−3
−4
−6 −4 −2 0 2 4 6
Independent variable, X
2.29
(a) The t statistic for the coefficient on change in income is 0.52
= 4.40
(b) The degrees of freedom is sample size minus the number of parameters estimated,
so it is 17 - 2 = 15.
(c) The critical value for a two-sided alternative hypothesis and – = 0.01 is 2.95. We
(d) The critical value for a one-sided alternative hypothesis and – = 0.05 is 1.75. We
2. The critical value from a two-sided test is bigger because it indicates the point at which
–
2
of the distribution is larger. As Table 4.4 on page 157 shows, the two-sided critical
values are larger than the one-sided critical values for all values of –.
c
•2014 Oxford University Press 814
Appendix . Guide to Selected Discussion Questions
3. The critical values from a small sample are larger because there is additional uncertainty
about our estimate of the standard error of —ˆ1 that the t distribution accounts for. In
other words, even when the null hypothesis is true, the data could work out such that
we get an unusually large estimate of se(—ˆ1 ), which would push up our t statistic,
meaning that the more uncertainty there is about se(—ˆ1 ) the more we could expect to
see higher values of the t statistic even when the null hypothesis is true. As the sample
size increases, uncertainty about se(—ˆ1 ) decreases and so this source of large t statistics
1. The power of a test is the probability of observing a t statistic higher than the critical
value given the true value of —1 and the se(—ˆ1 ), –, and alternative hypothesis posited
—1T rue
in the question. This will be 1 ≠ (Critical value ≠ se(—ˆ1 )
). The critical value will be
2.32 for – = 0.01 and a one-sided alternative hypothesis. The sketches will be normal
—1T rue
distributions centered at se(—ˆ1 )
with the portion of the normal distribution greater than
1
(a) The power when —1T rue = 1 is 1 ≠ (2.32 ≠ 0.75
) = 0.162.
2
(b) The power when —1T rue = 2 is 1 ≠ (2.32 ≠ 0.75
) = 0.636.
2. If the estimated se(—ˆ1 ) doubled, the power will go down because the center of the t
—1T rue
statistic distribution will shift toward zero (because se(—ˆ1 )
gets smaller as the standard
c
•2014 Oxford University Press 815
Appendix . Guide to Selected Discussion Questions
error increases). For this higher standard error, the power when —1T rue = 1 is 1 ≠
1 2
(2.32 ≠ 1.5
) = 0.049 and the power when —1T rue = 2 is 1 ≠ (2.32 ≠ 1.5
) = 0.161.
3. The probability of committing a Type II error is simply one minus the power. Hence
when se(—ˆ1 ) = 2.5, the probability of committing a Type II error is 0.838 for —1T rue = 1
Chapter 5
1. a) Do you accept this recommendation? Kevin Drum’s response to this scenario: “If
you’re smart, you’d think I’m an idiot. As kids get older, they weigh more. They
also do better on math tests. I haven’t discovered a link between weight and math
ability. All I’ve discovered is the obvious fact that older kids know more math.”
c) Propose additional variables for this model. Age is an obvious factor to control for.
There could be others: Family income, class size, instructional techniques, and so
forth.
d) Does inclusion of additional controls provide definitive proof? Kevin Drum writes:
“The usual way to handle this is to control for age. That is, I need to find out
c
•2014 Oxford University Press 816
Appendix . Guide to Selected Discussion Questions
if kids of the same age show the same relationship, namely that heavier ones are
better at math. Suppose I did that, and it turned out they are. Am I vindicated?
Not quite. It’s possible, for example, that kids who like math are more sedentary
than kids who don’t. That makes them heavier. The chain of causation doesn’t go
from weight to math, it goes from fondness for math to scores on math tests. But
1. Not at all. Rj2 will be zero. In a random experiment, the treatment is uncorrelated
with anything. Most importantly this buys us exogeneity, but it also buys us increased
precision.
2. We’d like to have a low variance for estimates and to get that we want the Rj2 to be
small. In other words, we want the other variables to explain as little of Xj as possible.
Chapter 6
c
•2014 Oxford University Press 817
Appendix . Guide to Selected Discussion Questions
1. A model in which a three-category categorical country variable has been converted into
multiple dummy variables with the United States as the excluded category looks like
the following.
The estimated constant (—ˆ0 ) is the average value of Yi for units in the excluded category
(in this case, U.S. citizens) after taking into account the effect of X1 . The coefficient
on the Canada dummy variable (—ˆ2 ) estimates how much more or less Canadians feel
the Mexico dummy variable (—ˆ3 ) estimates how much more or less Mexicans feel about
valid and would produce substantively identical results although the coefficients on the
dummy variables will differ because they will refer to different reference categories than
2. (a) 25
(b) 20
(c) 30
c
•2014 Oxford University Press 818
Appendix . Guide to Selected Discussion Questions
(d) 115
(e) 5
(f) -20
(g) 120
(h) -5
(i) -25
(j) 5
positive, then the effect of X is still positive for the treatment group even when —3 is
c
•2014 Oxford University Press 819
Appendix . Guide to Selected Discussion Questions
negative.
Chapter 8
1. (a) The error term includes the ability of the students, the quality of the teacher, the
time the class meets, the room (Does it have a window? Is it loud?) and other
factors.
(b) There is likely a teacher specific fixed effect that differs across teachers. There may
also be a course-specific error term (e.g., students always love stats! and hate, hate,
(c) It is plausible that more students take courses from popular teachers (who have
high fixed effects), which would induce correlation between the error term and the
number of students (unless fixed effect is included or some other measure of teacher
quality is included).
Discussion questions on page 410 - see Table A.2:
Table A.2: Values of —0 , —1 , —2 and —3 in Figure 8.6
Chapter 9
Discussion questions on page 438:
c
•2014 Oxford University Press 820
Appendix . Guide to Selected Discussion Questions
c
•2014 Oxford University Press 821
Appendix . Guide to Selected Discussion Questions
that rainfall on April 15, 2009 is modestly correlated with unemployment, meaning
they need to control for that factor in their full model.
3. Do institutions matter?
a) It could be that countries with high economic growth have better institutions. That
is, rich countries can pay for people and other things necessary to make government
work more effectively. This situation is analogous to the crime and police example.
The police could be going to where the crime is; in this case, the good institutions
could be going to where the economic growth is.
b) To test whether the settler mortality variable satisfies the inclusion condition we run
a model in which institutional quality is the dependent variable and settler mortality
is the independent variable.
c) We cannot directly test whether settler mortality in the 18th century has a direct
effect on modern economic growth. The authors of the study argue that this variable
is so far in the past and relates to a threat to mortality that modern technology may
well have changed. They argue that the only reasonable effect that long-ago settler
mortality has on modern growth is due to the government institutions created by
colonial powers.
Chapter 11
Discussion questions on page 553:
1. See Cellini, Ferreira, and Rothstein (2009) for a study of school bond passage and
housing values.
a) The assignment variable is the election results. The threshold is 50 percent. (The
model in the actual paper is bit more involved. Some bond measures needed more
than 50 percent to pass so their assignment variable is actually percent above or
below the threshold needed to win.)
b) Estimate a model of house values using election results as an assignment variable
and a dummy variable for passage of the bond issue as the treatment variable.
c) The basic version of their model is
House values i = —0 + —1 Ti + —2 (Election support i ≠ 50) + ‹i
Ti = 1 if Election support i Ø 50
Ti = 0 if Election support i < 50
where House values i is the average housing value in city i three years after the bond
election and Election support i is the percent of voters who supported the education
c
•2014 Oxford University Press 822
Appendix . Guide to Selected Discussion Questions
bond. (The model in the actual paper is bit more involved. Among other things,
the authors used logged values of home prices; see page 329 for details on how to use
logged models.)
2. See Card, Dobkin, and Maestas (2009) for an influential study that used RD to study
Medicare. They looked at mortality of all patients admitted to the hospital. They used
individual data, but for the purposes of this question, we’ll work with grouped data so
that we can use a continuous variable (percent of people in a given group who died)
instead of a dichotomous variable (whether an individual died or not). The data are
grouped by birth months so that everyone in the group is the same age.
a) The assignment variable is age. The threshold is 65 years of age.
b) Estimate a mortality using age as an assignment variable and a dummy variable for
being over 65 as the treatment variable. As people get older we expect mortality to
increase, but we do not expect mortality to “jump” at 65 once we have accounted
for the effect of age on mortality.
c) A basic equation for this model is
M ortalityg = —0 + —1 Tg + —2 (Ageg ≠ 65) + ‹g
Tg = 1 if Ageg Ø 65
Tg = 0 if Ageg < 65
where M ortalityg is percent of people in group g who died with a week of being
admitted, Tg is whether people in the group were eligible for Medicare, and Ageg
is the age of the people in the group (which will be essentially the same for all in
the group because they share the same birth month). Card, Dobkin, and Maestas
looked at mortality across a wide range of time frames – within a day, week, year,
and so on.
Discussion questions on page 560:
(a) —1 = 0, —2 = 0, —3 < 0
(b) —1 < 0, —2 = 0, —3 > 0
(c) —1 > 0, —2 < 0, —3 = 0
(d) —1 < 0, —2 > 0, —3 < 0
(e) —1 > 0, —2 > 0, —3 < 0 (actually —3 = ≠—2 )
(f) —1 < 0, —2 < 0, —3 > 0 (here too —3 = ≠—2 , which means —3 is positive because —2 is
negative)
c
•2014 Oxford University Press 823
Appendix . Guide to Selected Discussion Questions
Chapter 12
Discussion questions on page 618:
1. Solve for Yiú = 0.
• Panel (a): X = 1.5
• Panel (b): X = 23
• Panel (c): X = 1.0
• Panel (d): X = 1.5
2. True, false, or indeterminate based on Table 12.2:
(a) True. The t statistic is 5, which is statistically significant for any reasonable sig-
nificance level.
(b) False. The t statistic is 1, which is not statistically significant for any reasonable
significance level.
(c) False! Probit coefficients cannot be directly interpreted.
(d) False. The fitted probability is (0), which is 0.50.
(e) True. The fitted probability is (3), which is approximately 1 because virtually all
of the area under a standard normal curve is to the left of 3.
3. Fitted values based on Table 12.2
(a) The fitted probability is (0 + 0.5 ◊ 4 ≠ 0.5 ◊ 0) = (2), which is 0.978.
(b) The fitted probability is (0 + 0.5 ◊ 0 ≠ 0.5 ◊ 4) = (≠2), which is 0.022.
(c) The fitted probability is (3 + 1.0 ◊ 0 ≠ 3.0 ◊ 1) = (0), which is 0.5.
Discussion questions on page 625:
1. Use the observed-variable, discrete-differences approach to interpreting the coefficient.
Calculate the fitted probability for all observations using X1i = 0 and the actual value
of X2i . Then calculate the fitted probability for all observations using X1i = 1 and the
actual value of X2i . The average difference in these fitted probabilities is the average
effect of X1 on the probability Y = 1.
2. Use the observed-variable, discrete-differences approach to interpreting the coefficient.
Calculate the fitted probability for all observations using actual values of X1i and X2i .
Then calculate the fitted probability for all observations using X1i = X1i + 1 and the
actual value of X2i . The average difference in these fitted probabilities is the average
effect of a one unit increase in X1 on the probability Y = 1.
c
•2014 Oxford University Press 824
Appendix . Guide to Selected Discussion Questions
Chapter 14
Discussion questions on page 729:
1. The full model that includes the unmeasured productivity is
Incomei = —0 + —1 Facebook hours i + —2 Productivityi + ‘i
We can expect that more productive people spend less time on Facebook (meaning the
correlation of X1 and X2 is negative). We can also expect that more productive people
earn more (meaning that —2 > 0). Hence we would expect —ˆ1 from a model excluding
productivity to understate the effect of Facebook, meaning that the effect will be more
negative than it really is. Be careful to note that understate in this context does not
mean simply that the coefficient will be small (i.e., close to zero), but instead means
that understate in this context means that the coefficient will be either less positive or
more negative than it should be. A reasonable expectation is that the —ˆ1 in the model
without productivity will be less than zero. If so, we should worry that some portion
of the negative coefficient comes from the fact that we have not measured productivity.
Note also that we are speculating about relationships of productivity with the other
variables. We could be wrong.
2. The full model that includes the candidate quality variable is
Vote share i = —0 + —1 Campaign spending i + —2 Candidate quality i + ‘i
We can expect that candidate quality is associated with raising more money (meaning
the correlation of X1 and X2 is positive). We can also expect that higher quality
candidates get higher vote shares (meaning that —2 > 0). Hence we would expect that
—ˆ1 from a model that excludes candidate quality would overstate the effect of campaign
spending, meaning that the effect will be more positive than it really is. Suppose we
observe a positive —ˆ1 in the model without candidate quality. We should worry that
some portion of that positive coefficient is due to the omission of candidate quality from
the model.
c
•2014 Oxford University Press 825
Appendix . Guide to Selected Discussion Questions
variable will be less than or equal to whatever value is 1.28 standard deviations below
its mean.
3. Using Table A.4, we see that the probability a standard normal random variable is
greater than 1.28 is 0.90. Because the probability of being above some value is one
minus the probability of being below some value, there is a 10% chance that a normal
random variable will be greater than or equal to whatever number is 1.28 standard
deviations above its mean.
4. We need to convert the number -4 to something in terms of standard deviations from
the mean. The value -4 is 2 standard deviations below the mean of 0 when the standard
deviation is 2. Using Table A.4 we see that the probability a normal random variable
with a mean of zero is less (more negative) than 2 standard deviations below its mean
is 0.023. In other words, the probability of being less than ≠4≠0
2
= ≠2 is 0.023.
5. First, convert -3 to standard deviations above or below the mean. In this case, if
the variance is 9, then the standard deviation (the square root of the variance) is 3.
Therefore -3 is the same as 1 standard deviation below the mean. From the table in
Figure A.4, we see that there is a 0.16 probability a normal variable will be more than
1 standard deviation below its mean. In other words, the probability of being less than
≠3≠0
Ô
9
= ≠1 is 0.16.
6. First convert 9 to standard deviations above or below the mean. The standard deviation
(the square root of the variance) is 2. The value 9 is 9≠7.2
2
= 1.8
2
standard deviations
above the mean. The value 0.9 does not appear in Figure A.4, but it close to 1 and
the probability of being less than 1 is 0.84. Therefore a reasonable approximation is in
the vicinity of 0.8. The actual value is 0.82 and can be calculated as discussed in the
Computing Corner on page 793.
c
•2014 Oxford University Press 826
BIBLIOGRAPHY
Acemoglu, Daron, Simon Johnson, and James A. Robinson. 2001. The Colonial Origins
of Comparative Development: An Empirical Investigation. American Economic Review
91(5): 1369-1401.
Acemoglu, Daron, Simon Johnson, James A. Robinson, and Pierre Yared. 2008. Income and
Democracy. American Economic Review 98(3): 808-842.
Achen, Christopher H. 2002. Toward a new political methodology: Microfoundations and
ART. Annual Review of Political Science 5: 423-450.
Achen, Christopher H. 2000. Why Lagged Dependent Variables Can Suppress the Explana-
tory Power of Other Independent Variables. Manuscript, University of Michigan.
Achen, Christopher H. 1982. Interpreting and Using Regression. Newbury Park, NJ: Sage
Publications.
Albertson, Bethany and Adria Lawrence. 2009. After the Credits Roll: The Long-Term
Effects of Educational Television on Public Knowledge and Attitudes. American Politics
Research 37(2): 275-300.
Alvarez, R. Michael and John Brehm. 1995. American Ambivalence Towards Abortion
Policy: Development of a Heteroskedastic Probit Model of Competing Values. American
Journal of Political Science 39(4): 1055-1082.
Anderson, James M., John M. Macdonald, Ricky Bluthenthal, and J. Scott Ashwood. 2013.
Reducing Crime By Shaping the Built Environment With Zoning: An Empirical Study
Of Los Angeles. University of Pennsylvania Law Review 161: 699-756.
Angrist, Joshua and Alan Krueger. 1991. Does Compulsory School Attendance Affect
Schooling and Earnings? Quarterly Journal of Economics. 106(4): 979-1014.
Angrist, Joshua and Jörn-Steffen Pischke. 2009. Mostly Harmless Econometrics: An Em-
piricist’s Companion. Princeton, NJ: Princeton University Press.
Angrist, Joshua and Jörn-Steffen Pischke. 2010. The Credibility Revolution in Empirical
Economics: How Better Research Design is Taking the Con out of Econometrics. Working
Paper 15794 http://www.nber.org/papers/w15794.
Angrist, Joshua. 2006. Instrumental Variables Methods in Experimental Criminological
827
Appendix . Bibliography
Research: What, Why and How. Journal of Experimental Criminology 2(1): 23-44.
Angrist, Joshua, Kathryn Graddy, and Guido Imbens. 2000. The Interpretation of Instru-
mental Variables Estimators in Simultaneous Equations Models with an Application to
the Demand for Fish. Review of Economic Studies 67(3): 499-527.
Anscombe, Francis J. 1973. Graphs in Statistical Analysis. American Statistician 27(1):
17-21.
Anzia, Sarah. 2012. The Election Timing Effect: Evidence from a Policy Intervention in
Texas. Quarterly Journal of Political Science 7(3): 209 248
Arellano, Manuel and Stephen Bond. 1991. Some Tests of Specification for Panel Data.
Review of Economic Studies 58(2): 277 - 297.
Aron-Dine, Aviva, Liran Einav, and Amy Finkelstein. 2013. The RAND Health Insurance
Experiment, Three Decades Later. Journal of Economic Perspectives 27(1): 197-222
Baicker, Katherine, Sarah Taubman, Heidi Allen, Mira Bernstein, Jonathan Gruber, Joseph
P. Newhouse, Eric Schneider, Bill Wright, Alan Zaslavsky, Amy Finkelstein, and the
Oregon Health Study Group. 2013. The Oregon Experiment - Medicaid’s Effects on
Clinical Outcomes. New England Journal of Medicine 368(18): 1713-1722.
Bailey, Michael A., Jon Mummolo and Hans Noel. 2012. Tea Party Influence: A Story of
Activists and Elites. American Politics Research 40(5): 769-804.
Bailey, Michael A. and Elliott Fullmer. 2011. Balancing in the States, 1978-2009. State
Politics and Policy Quarterly 11(2): 149-167.
Bailey, Michael A. and Clyde Wilcox. 2015. A Two Way Street on Iraq: On the Inter-
actions of Voter Policy Preferences and Presidential Approval. Manuscript, Georgetown
University.
Bailey, Michael A., Daniel J. Hopkins, and Todd Rogers. 2015. Unresponsive and Unper-
suaded: The Unintended Consequences of Voter Persuasion Efforts. Manuscript, George-
town University.
Bailey, Michael A., Anton Strezhnev, and Erik Voeten. 2015. Estimating Dynamic State
Preferences from United Nations Voting Data. Manuscript, Georgetown University.
Baiocchia, Michael, Jing Cheng, and Dylan S. Small. 2014. Tutorial in Biostatistics: Instru-
mental Variable Methods for Causal Inference. Statistics in Medicine 33(13): 2297-2340.
Baltagi, Badi H. 2005. Econometric Analysis of Panel Data, 3rd edition. New York: Wiley.
Banerjee, Abhijit Vinayak and Esther Duflo. 2011. Poor Economics: A Radical Rethinking
of the Way to Fight Global Poverty. Public Affairs.
Bartels, Larry M. 2008. Unequal Democracy: The Political Economy of the New Gilded Age.
Princeton, NJ: Princeton University Press.
Beck, Nathaniel and Jonathan N. Katz. 1996. Nuisance vs. Substance: Specifying and
Estimating Time-Series-Cross-Section Models. Political Analysis 6: 1-36.
Beck, Nathaniel and Jonathan N. Katz. 2011. Modeling Dynamics in Time-Series Cross-
Section Political Economy Data. Annual Review of Political Science 14: 331-352.
c
•2014 Oxford University Press 828
Appendix . Bibliography
Beck, Nathaniel. 2010. Making Regression and Related Output More Helpful to Users. The
Political Methodologist 18(1): 4-9.
Berk, Richard A., Alec Campbell, Ruth Klap, and Bruce Western. 1992. The Deterrent
Effect of Arrest in Incidents of Domestic Violence: A Bayesian Analysis of Four Field
Experiments. American Sociological Review 57(5): 698-708.
Bertrand,Marianne, Esther Duflo and Sendhil Mullainathan. 2004. How Much Should We
Trust Differences-In-Differences Estimates? Quarterly Journal of Economics 119(1): 249-
275.
Bertrand, Marianne and Sendhil Mullainathan. 2004. Are Emily and Greg More Employable
than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. American
Economic Review 94(4): 991-1013.
Blinder, Alan S. and Mark W. Watson. 2013. Presidents and the Economy: A Forensic
Investigation. Manuscript, Princeton University.
Bloom, Howard S. 2012. Modern Regression Discontinuity Analysis. Journal of Research on
Educational Effectiveness 5(1): 43-82.
Bound, John, David Jaeger, and Regina Baker. 1995. Problems with Instrumental Vari-
ables Estimation When the Correlation Between the Instruments and the Endogenous
Explanatory Variable is Weak. Journal of the American Statistical Association 90(430):
443-450.
Box, George E.P. 1976. Science and Statistics. Journal of the American Statistical Associa-
tion 71(356): 791-799.
Box-Steffensmeier, Janet M. and Bradford S. Jones. 2004. Event History Modeling: A Guide
for Social Scientists. Cambridge, England: Cambridge University Press.
Box-Steffensmeier, Janet M., John R. Freeman, Matthew P. Hitt and Jon C. W. Pevehouse.
2014. Time Series Analysis for the Social Sciences. Cambridge, England: Cambridge
University Press.
Bradford-Hill, Austin. 1965. The Environment and Disease: Association or Causation?
Proceedings of the Royal Society of Medicine. 58(5): 295-300.
Brambor, Thomas, William Roberts Clark, and Matt Golder. 2006. Understanding Interac-
tion Models:Improving Empirical Analyses. Political Analysis 14: 63-82.
Braumoeller, Bear F. 2004. Hypothesis Testing and Multiplicative Interaction Terms. In-
ternational Organization 58(4): 807-820.
Brown, Peter C., Henry L. Roediger III, and Mark A. McDaniel. 2014. Making it Stick: the
Science of Successful Learning. Cambridge, MA: Harvard University Press.
Brownlee, Shannon and Jeanne Lenzer. 2009. Does the Vaccine Matter? The Atlantic
November. www.theatlantic.com/doc/200911/brownlee-h1n1/2
Buckles, Kasey and Dan Hungerman. 2013. Season of Birth and Later Outcomes: Old
Questions, New Answers. The Review of Economics and Statistics 95(3): 711-724.
Buddlemeyer, Hielke and Emmanuel. Skofias. 2003. An Evaluation on the Performance of
c
•2014 Oxford University Press 829
Appendix . Bibliography
c
•2014 Oxford University Press 830
Appendix . Bibliography
Clarke, Kevin A. 2005. The Phantom Menace: Omitted Variable Bias in Econometric Re-
search. Conflict Management and Peace Science 22(4): 341-352. [http://www.rochester.
edu/college/psc/clarke/CMPSOmit.pdf]
Comiskey, Michael and Lawrence C. Marsh. 2012. Presidents, Parties, and the Business
Cycle, 1949-2009. Presidential Studies Quarterly 42(1): 40-59.
Cook, Thomas. 2008. Waiting for Life to Arrive: A history of the Regression Discontinuity
Design in Psychology, Statistics and Economics. Journal of Econometrics 142(2): 636-
654.
Currie, Janet and Jonathan Gruber. 1996. Saving Babies: The Efficacy and Cost of Recent
Changes in the Medicaid Eligibility of Pregnant Women. Journal of Political Economy
104(6): 1263-1296.
Cragg, John G. 1994. Making Good inferences from Bad Data. Canadian Journal of Eco-
nomics 27(4): 776-800.
Das, Mitali, Whitney K. Newey, and Francis Vella. 2003. Nonparametric Estimation of
Sample Selection Models. The Review of Economic Studies 70(1): 33-58.
De Boef, Suzanna and Luke Keele. 2008. Taking Time Seriously. American Journal of
Political Science 52(1): 184-200.
DiazGranados, Carlos A., Martine Denis, Stanley Plotkin. 2012. Seasonal Influenza Vaccine
Efficacy and Its Determinants in Children and Non-elderly Adults: A Systematic Review
with Meta-analyses of Controlled Trials. Vaccine 31(1): 49-57.
Duflo, Esther, Rachel Glennerster, and Michael Kremer. 2008. Using Randomization in
Development Economics Research: A Toolkit. In T. Schultz and John Strauss, eds.,
Handbook of Development Economics Vol. 4. Amsterdam and New York: North Holland.
Dunning, Thad. 2012. Natural Experiments in the Social Sciences: A Design-Based Ap-
proach. Cambridge, England: Cambridge University Press.
Drum, Kevin. 2013a. America’s Real Criminal Element: Lead - New Research Finds Pb
is the Hidden Villain Behind Violent Crime, Lower IQs, and Even the ADHD Epidemic.
Mother Jones January/February.
Drum, Kevin. 2013b. Crime Is at its Lowest Level in 50 Years. A Simple Molecule May Be
the Reason Why. At Mother Jones.com blog, January 3 at http://www.motherjones.
com/kevin-drum/2013/01/lead-crime-connection.
Drum, Kevin. 2013c. Lead and Crime: A Response to Jim Manzi. At Mother Jones.com
blog, January 12 at http://www.motherjones.com/kevin-drum/2013/01
/lead-and-crime-response-jim-manzi.
Dynarski, Susan. 2000. Hope for Whom? Financial Aid for the Middle Class and Its Impact
on College Attendance. National Tax Journal 53 (3, part 2): 629- 662.
Elwert, Felix and Christopher Winship. 2014. Endogenous Selection Bias: The Problem of
Conditioning on a Collider Variable. Annual Review of Sociology. 40(1): 31-53.
Erikson, Robert S. and Thomas R. Palfrey. 2000. Equilibrium in Campaign Spending
c
•2014 Oxford University Press 831
Appendix . Bibliography
Games: Theory and Data. American Political Science Review 94(3): 595-610.
Fearon, James D. and David D. Laitin. 2003. Ethnicity, Insurgency, and Civil War. Ameri-
can Political Science Review 97(1): 75-90.
Finkelstein, Amy, Sarah Taubman, Bill Wright, Mira Bernstein, Jonathan Gruber, Joseph P.
Newhouse, Heidi Allen, Katherine Baicker, and the Oregon Health Study Group. 2012.
The Oregon Health Insurance Experiment: Evidence from the First Year. Quarterly
Journal of Economics 127(3): 1057-1106.
Gaubatz, Kurt Taylor. 2015. A Survivor’s Guide to R: An Introduction for the Uninitiated
and the Unnerved. Los Angeles: Sage.
Gerber, Alan S. and Donald P. Green. 2012. Field Experiments: Design, Analysis, and
Interpretation. New York: W.W. Norton & Company.
Gerber, Alan S. and Donald P. Green. 2000. The Effects of Canvassing, Telephone Calls
and Direct Mail on Voter Turnout: A Field Experiment. The American Political Science
Review 94(3): 653-663.
Gerber, Alan S., and Donald P. Green. 2005. Correction to Gerber and Green (2000),
Replication of Disputed Findings, and Reply to Imai (2005). American Political Science
Review 99(2): 301-13.
Gertler, Paul. 2004. Do Conditional Cash Transfers Improve Child Health? Evidence from
PROGRESA’s Control Randomized Experiment. American Economic Review 94(2): 336-
41.
Gimpel, James G, Francis E. Lee, and Rebecca U. Thorpe. 2010. The Distributive Politics
of the Federal Stimulus: The Geography of the American Recovery and Reinvestment Act
of 2009. Paper presented at American Political Science Association Meetings.
Goldberger, Arthur S. 1991. A Course in Econometrics. Cambridge, Massachusetts: Har-
vard University Press.
Gormley, William T., Jr., Deborah Phillips, and Ted Gayer. 2008. Preschool Programs Can
Boost School Readiness. Science 320 (5884): 1723-24.
Green, Donald P., Soo Yeon Kim, and David H. Yoon. 2001. Dirty Pool. International
Organization 55(2): 441-468.
Green, Joshua. 2012. The Science Behind Those Obama Campaign E-Mails. Business Week
(November 29). Accessed from http://www.businessweek.com/articles/2012-11-29/
the-science-behind-those-obama-campaign-e-mails.
Greene, William. 2003. Econometric Analysis. New York: Prentice Hall.
Greene, William. 2008. Econometric Analysis. New York: Prentice Hall.
Grimmer, Justin, Eitan Hersh, Brian Feinstein, and Daniel Carpenter. 2010. Are Close
Elections Randomly Determined? Manuscript, Stanford University.
Hanmer, Michael J. and Kerem Ozan Kalkan. 2013. Behind the Curve: Clarifying the
Best Approach to Calculating Predicted Probabilities and Marginal Effects from Limited
Dependent Variable Models. American Journal of Political Science 57(1): 263-277.
c
•2014 Oxford University Press 832
Appendix . Bibliography
Hanushek, Eric A. and Ludger Woessmann. 2009. Do Better Schools Lead to More Growth?
Cognitive Skills, Economic Outcomes, and Causation. NBER Working Paper 14633.
Harvey, Anna. 2011. What’s So Great About Independent Courts? Rethinking Crossna-
tional Studies of Judicial Independence. Manuscript, New York University. Available at
http://politics.as.nyu.edu/docs/IO/2787/HarveyJI.pdf.
Hausman, Jerry A. and William E. Taylor. 1981. Panel Data and Uobservable Individual
Effects. Econometrica 49(6): 1377-1398.
Heckman, James J. 1979. Sample Selection Bias as a Specification Error. Econometrica
47(1): 153 61.
Herndon, Thomas, Michael Ash, and Robert Pollin. 2014. Does high public debt consis-
tently stifle economic growth? A critique of Reinhart and Rogoff. Cambridge Journal of
Economics 38(2): 257-279.
Howell, William G. and Paul E. Peterson. 2004. The Use of Theory in Randomized Field
Trials: Lessons from School Voucher Research on Disaggregation, Missing Data, and the
Generalization of Findings. The American Behavioral Scientist 47(5): 634-657.
Imai, Kosuke. 2005. Do Get-Out-The-Vote Calls Reduce Turnout? The Importance of
Statistical Methods for Field Experiments. American Political Science Review 99(2):
283-300.
Imai, Kosuke, Gary King, and Elizabeth A. Stuart. 2008. Misunderstandings among Exper-
imentalists and Observationalists about Causal Inference. Journal of the Royal Statistical
Society, Series A (Statistics in Society) 171(2): 481-502.
Imbens, Guido W. 2014. Instrumental Variables: An Econometrician’s Perspective. IZA
Discussion Paper No. 8048.
Imbens, Guido W. and Thomas Lemieux. 2008. Regression Discontinuity Designs: A Guide
to Practice. Journal of Econometrics 142(2): 615-635.
Iqbal, Zaryab and Christopher Zorn. 2008. The Political Consequences of Assassination,
Journal of Conflict Resolution 52(3): 385-400.
Jackman, Simon. 2009. Bayesian Analysis for the Social Sciences. New York: Wiley.
Jacobsmeier, Matthew L. and Daniel G. Lewis. 2013. Barking up the Wrong Tree: Why Bo
Didn’t Fetch Many Votes for Barack Obama in 2012. PS 46(1): 49-59.
Jacobson, Gary C. 1978. Effects of Campaign Spending in Congressional Elections. Ameri-
can Political Science Review 72(2): 469-491.
Kalla, Joshua L. and David E. Broockman. 2014. Congressional Officials Grant Access
Due To Campaign Contributions: A Randomized Field Experiment. Manuscript, Yale
University.
Kam, Cindy D. and Robert J. Franceze, Jr. 2007. Modeling and Interpreting Interactive
Hypotheses in Regression Analysis. Ann Arbor, MI: University of Michigan Press.
Kastellec, Jonathan P. and Eduardo L. Leoni. 2007. Using Graphs Instead of Tables in
Political Science. Perspectives on Politics 5(4): 755-771.
c
•2014 Oxford University Press 833
Appendix . Bibliography
Keele, Luke and David Park. 2006. Difficult Choices: An Evaluation of Heterogenous Choice
Models. Manuscript, Ohio State University.
Keele, Luke and Nathan J. Kelly. 2006. Dynamic Models for Dynamic Theories: The Ins
and Outs of Lagged Dependent Variables. Political Analysis 14: 186-205.
Kennedy, Peter. 2008. A Guide to Econometrics, 6th edition. Malden, MA: Blackwell
Publishing.
Khimm, Suzy. 2010. Who Is Alvin Greene? Mother Jones Jun. 8 accessed at http:
//motherjones.com/mojo/2010/06/alvin-greene-south-carolina.
King, Gary. 1991. Truth is Stranger than Prediction, More Questionable than Causal
Inference. American Journal of Political Science 35(4): 1047-1053.
King, Gary. 1995. Replication, Replication. PS: Political Science and Politics 28(3): 444-
452.
King, Gary and Langche Zeng. 2001. Logistic Regression in Rare Events Data. Political
Analysis 9: 137-163.
King, Gary, Robert Keohane, and Sidney Verba. 1994. Designing Social Inquiry: Scientific
Inference in Qualitative Research Princeton, NJ: Princeton University Press.
Kiviet, Jan F. 1995. On bias, inconsistency, and efficiency of various estimators in dynamic
panel data models. Journal of Econometrics 68(1): 53 78.
Klick, Jonathan and Alexander Tabarrok. 2005. Using Terror Alert Levels to Estimate the
Effect of Police on Crime. Journal of Law and Economics 48(1): 267-79.
Koppell, Jonathan G. S. and Jennifer A. Steen. 2004. The Effects of Ballot Position on
Election Outcomes. The Journal of Politics 66(1): 267-281.
La Porta, Rafael, F. Lopez-de-Silanes, C. Pop-Eleches, and A. Schliefer. 2004. Judicial
Checks and Balances. Journal of Political Economy 112(2): 445-470.
Lee, David S. 2008. Randomized Experiments from Non-random Selection in U.S. House
Elections. Journal of Econometrics 142(2): 675-697.
Lee, David S. 2009. Training, Wages, and Sample Selection: Estimating Sharp Bounds on
Treatment Effects. Review of Economic Studies 76(3): 1071 1102.
Lee, David S. and Thomas Lemieux. 2010. Regression Discontinuity Designs in Economics
Journal of Economic Literature 48(2): 281-355.
Lerman, Amy E. 2009. The People Prisons Make: Effects of Incarceration on Criminal
Psychology. In Do Prisons Make Us Safer, ed. Steve Raphael and Michael Stoll. New
York: Russell Sage Foundation.
Levitt, Steven D. 1997. Using Electoral Cycles in Police Hiring to Estimate the Effect of
Police on Crime. American Economic Review 87(3): 270-290.
Levitt, Steven D. 2002. Using Electoral Cycles in Police Hiring to Estimate the Effect of
Police on Crime: A Reply. American Economic Review 92(4): 1244-250.
Lochner, Lance, and Enrico Moretti. 2004. The Effect of Education on Crime: Evidence
from Prison Inmates, Arrests, and Self-Reports. American Economic Review 94(1): 155-
c
•2014 Oxford University Press 834
Appendix . Bibliography
189.
Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables.
London: Sage Publications.
Lorch, Scott A., Michael Baiocchi, Corinne S. Ahlberg, and Dylan E. Small. 2012. The
Differential Impact of Delivery Hospital on the Outcomes of Premature Infants. Pediatrics
130(2): 270-278.
Ludwig, Jens and Douglass L. Miller. 2007. Does Head Start Improve Children’s Life
Chances? Evidence from a Regression Discontinuity Design. The Quarterly Journal of
Economics 122(1): 159-208.
Lumley, Thomas, Paula Diehr, Scott Emerson, and Lu Chen. 2002. The Importance of the
Normality Assumption in Large Public Health Data Sets. Annual Review of Public Health
23: 151-69.
Madestam, Andreas, Daniel Shoag, Stan Veuger, and David Yanagizawa-Drott. 2013. Do
Political Protests Matter? Evidence from the Tea Party Movement. June 29 version from
http://www.hks.harvard.edu/fs/dyanagi/Research/TeaParty_Protests.pdf.
Makowsky, Michael and Thomas Stratmann. 2009. Political Economy at Any Speed: What
Determines Traffic Citations? The American Economic Review 99(1): 509-527.
Malkiel, Burton G. 2003. A Random Walk Down Wall Street: The Time-Tested Strategy for
Successful Investing. New York: W.W. Norton.
Manning, Willard G., Joseph P. Newhouse, Naihua Duan, Emmett B. Keeler, and Arleen
Leibowitz. 1987. Health Insurance and the Demand for Medical Care: Evidence from a
Randomized Experiment. American Economic Review 77(3): 251-277.
Manzi, Jim. 2012. Uncontrolled: The Surprising Payoff of Trial-and-Error for Business,
Politics and Society. Basic Books.
Marvell, Thomas B and Carlisle E. Moody. 1996. Specification Problems, Police Levels and
Crime Rates. Criminology 34(4): 609-646.
McClellan, Chandler B. and Erdal Tekin. 2012. Stand Your Ground Laws and Homicides.
NBER Working Paper 18187.
McCrary, Justin. 2002. Using Electoral Cycles in Police Hiring to Estimate the Effect of
Policeon Crime: Comment. The American Economic Review 92(4): 1236-1243.
McCrary, Justin. 2008. Manipulation of the Running Variable in the Regression Disconti-
nuity Design: A Density Test. Journal of Econometrics 142(2): 698-714.
Miguel, Edward and Michael Kremer. 2004. Worms: Identifying Impacts on Education and
Health in the Presence of Treatment Externalities. Econometrica 72(1): 159-217.
Miguel, Edward, Shanker Satyanath, and Ernest Sergenti. 2004. Economic Shocks and Civil
Conflict: An Instrumental Variables Approach. Journal of Political Economy 112(4):
725-753.
Morgan, Stephen L. and Christopher Winship. 2014. Counterfactuals and Causal Infer-
ence: Methods and Principals for Social Research. Second edition. Cambridge, England:
c
•2014 Oxford University Press 835
Appendix . Bibliography
c
•2014 Oxford University Press 836
Appendix . Bibliography
Effect of Cell Phone Coverage on Political Violence in Africa. American Political Science
Review. 107(2): 207-224
Poole, Keith and Howard Rosenthal. 1997. Congress: A Political-Economic History of Roll
Call Voting. Oxford: Oxford University Press.
Reinhart, Carmen M. and Kenneth S. Rogoff. 2010. Growth in a Time of Debt. American
Economic Review: Papers & Proceedings 100(2): 573 78.
Reyes, Jessica Wolpaw. 2007. Environmental Policy as Social Policy? The Impact of
Childhood Lead Exposure on Crime. NBER Working Paper 13097
Rice, John A. 2007. Mathematical Statistics and Data Analysis. Cengage Learning.
Roach, Michael A. 2013. Mean Reversion or a Breath of Fresh Air? The Effect of NFL
Coaching Changes on Team Performance in the Salary Cap Era. Applied Economics
Letters 20(17): 1553-1556.
Romer, Christina D. 2011. What Do We Know about the Effects of Fiscal Policy? Separating
Evidence from Ideology. Talk at Hamilton College (November 7) available at http:
//elsa.berkeley.edu/˜cromer/WrittenVersionofEffectsofFiscalPolicy.pdf.
Rossin-Slater, Maya, Christopher J. Ruhm, and Jane Waldfogel. 2013. The Effects of
Californiaäs Paid Family Leave Program on Mothers’ Leave-Taking and Subsequent Labor
Market Outcomes. Journal of Policy Analysis and Management 32(2): 224 245.
Schrodt, Phil. 2010. Seven Deadly Sins of Contemporary Quantitative Political Science.
Paper presented at American Political Science Association Meetings.
Shiner, Meredith 2010. Alvin Greene: Born to Be President. POLITICO (November 17) ac-
cessed at http://www.politico.com/news/stories/1110/45268.html#ixzz1jLDAM0uo
Sides, John and Lynn Vavreck. 2013. The Gamble: Choice and Chance in the 2012 Presi-
dential Election. Princeton, NJ: Princeton University Press.
Snipes, Jeffrey B. and Edward R. Maguire. 1995. Country Music, Suicide, and Spuriousness.
Social Forces 74(1): 327-329.
Solnick, Sara J. and David Hemenway. 2011. The ‘Twinkie Defense’: the relationship
between carbonated non-diet soft drinks and violence perpetration among Boston high
school students. Injury Prevention 2011-040117.
Sovey, Allison J. and Donald P. Green. 2011. Instrumental Variables Estimation in Political
Science: A Reader’s Guide. American Journal of Political Science 55(1): 188-200.
Stack, Steven and Jim Gundlach. 1992. The Effect of Country Music on Suicide. Social
Forces 71(1): 211-218.
Staiger, Douglas and James H. Stock. 1997. Instrumental Variables Regressions with Weak
Instruments. Econometrica. 65(3): 557-86.
Stock, James H and Mark W. Watson. 2011. Introduction to Econometrics. Third edition.
Boston: Addison-Wesley.
Swirl. 2014. Swirl: statistics with interactive R learning. [computer software package]
available at http://swirlstats.com/index.html.
c
•2014 Oxford University Press 837
Appendix . Bibliography
c
•2014 Oxford University Press 838
INDEX
839
INDEX INDEX
c
•2014 Oxford University Press 840
INDEX INDEX
c
•2014 Oxford University Press 841
INDEX INDEX
c
•2014 Oxford University Press 842
INDEX INDEX
c
•2014 Oxford University Press 843
GLOSSARY
ABC issues Three issues that every experiment needs to address: attrition, balance, and
compliance. 481
adjusted R2 The R2 with a penalty for the number of variables included in the model.
Widely reported, but rarely useful. 231
alternative hypothesis An alternative hypothesis is what we accept if we reject the null.
It’s not something that we are proving (given inherent statistical uncertainty) but it is
the idea we hang onto if we reject the null. 140
AR(1) model An autoregressive model in which the dependent variable depends on its
value in the previous period. Contrasted to, for example, an AR(2) model, which
includes the value from the previous period and the value from the period before that.
AR(1) models are often used to model correlated errors in time series data. 664
assignment variable An assignment variable is relevant in regression discontinuity analy-
sis. Such a variable determines whether or not someone receives some treatment. People
with values of the assignment variable above some cutoff receive the treatment; people
with values of the assignment variable less than the cutoff do not receive the treatment.
545
844
Glossary Glossary
attenuation bias A form of bias in which the estimated coefficient is closer to zero than it
should be. Measurement error in the independent variable causes attenuation bias. 223
attrition Attrition occurs when people drop out of an experiment altogether such that we
do not observe the dependent variable for them. 515
augmented Dickey-Fuller test A test for unit root for time series data that includes a
time trend and lagged values of the change in the variable as independent variables.
692
autocorrelation Errors are autocorrelated if the error from one observation is correlated
with the error of another. One of the assumptions necessary to use the standard equation
for variance of OLS estimates is that errors are not autocorrelated. Autocorrelation is
common in time series data. 105
autoregressive model A time series model in which the dependent variable is a function
of previous values of the dependent variable. Autocorrelation is often modeled with
an autoregressive model such that the error term is a function of previous error terms.
Dynamic models are also autoregressive models in that the dependent variable depends
on lagged values of the dependent variable. 660
auxiliary regression An auxiliary regression is a regression that is not directly the one
of interest, but is related and yields information helpful in analyzing the equation we
really care about. 210
balance In experiments, treatment and control groups are balanced if the distributions of
variables are the same for the treatment and control groups. 484
bias A biased coefficient estimate will systematically be higher or lower than the true value.
87
binned graphs Binned graphs are used in regression discontinuity analysis. The assign-
ment variable is divided into bins and for each bin, the average value of the dependent
variable is plotted. These are useful to visualize a discontinuity at the treatment cut-
off. Binned graphs also are useful to identify possible non-linearities in the relationship
between the assignment variable and the dependent variable. 562
blocking Blocking involves picking treatment and control groups so that they are equal in
covariates. 482
categorical variable A variable that has two or more categories, but which does not have
an intrinsic ordering. Also known as a nominal variable. 277
CDF See cumulative distribution function. 608
c
•2014 Oxford University Press 845
Glossary Glossary
Central Limit Theorem The mean of a sufficiently large number of independent draws
from any distribution will be normally distributed. Because OLS estimates are weighted
averages, the Central Limit Theorem implies the distribution of —ˆ1 will be normally
distributed. 84
ceteris paribus All else equal. A phrase used when describing multivariate regression re-
sults as a coefficient is said to account for change in dependent variable with all other
independent variables held constant. 197
codebook A file that describes sources for variables and any adjustments made. A codebook
is necessary element of replication file. 47
compliance A compliance problem occurs when subjects assigned to an experimental treat-
ment do not actually experience the treatment, often because they opt out in some way.
491
confidence interval A confidence interval defines the range of true values that are consis-
tent with the observed coefficient estimate. Confidence intervals depend on the point
estimate, —ˆ1 , and the measure of uncertainty, se(—ˆ1 ). 177
confidence levels A term used when referring to confidence intervals, based on 1- –. 179
consistency A consistent estimator is one for which the distribution of the estimate gets
closer and closer to the true value as the sample size increases. For example, the
bivariate OLS estimate —ˆ1 consistently estimates —1 if X is uncorrelated with ‘. 100
constant The parameter —0 in a regression model. It is the point at which a regression line
cross the Y -axis. It is the expected value of the dependent variable when all independent
variables equal zero. Also referred to as the intercept. 7
continuous variable A variable that takes on any possible value over some range. Con-
tinuous variables are distinct from discrete variables, which can take on only a limited
number of possible values. 81
control group In an experiment, the group that does not receive the treatment of interest.
32
control variable An independent variable included in a statistical model to control for
some factor that is not the primary factor of interest. 203
correlation Correlation measures the extent to which two variables are linearly related to
each other. A correlation of 1 indicates the variables move together in a straight line.
A correlation of 0 indicates the variables are not linearly related to each other. A
correlation of -1 indicates the variables move in opposite directions. 17
c
•2014 Oxford University Press 846
Glossary Glossary
critical value In hypothesis testing, a value above which a —ˆ1 would be so unlikely as to
lead us to reject the null. 153
cross-sectional data Cross-sectional data has observations for a multiple units for one time
period. Each observation indicates the value of a variable for a given unit for the same
point in time. Cross-sectional data is typically contrasted to panel and time series data.
658
cumulative distribution function The cumulative distribution function, or CDF, indi-
cates how much of normal distribution is to the left of any given point. 608
de-meaned approach An approach to estimating fixed effects models for panel data. The
one-way version involves subtracting off average values within units from all variables.
This approach saves us from having to include dummy variables for every unit and
highlights the fact that fixed effects models estimate parameters based on variation
within units, not between them. 380
degrees of freedom The degrees of freedom is the sample size minus the number of param-
eters. It refers to the amount of information we have available to use in the estimation
process. As a practical matter, degrees of freedom corrections produce more uncer-
tainty for smaller sample sizes. The shape of a t distribution depends on the degrees of
freedom. The higher the degrees of freedom, a t distribution looks more like a normal
distribution. 96
dependent variable The outcome of interest, usually denoted as Y . It is called the de-
pendent variable because its value depends on the values of the independent variables,
parameters and error term. 4
dichotomous Divided into two part. 591
dichotomous variables A dichotomous variable takes on one of two values, almost always
zero or one, for all observations. Also known as a dummy variable. 257
Dickey-Fuller test A test for unit roots, used in dynamic models. 691
difference of means test Tests that involve comparing the mean of Y for one group (e.g.,
the treatment group) against mean of Y for another group (e.g., the control group).
They can be conducted with bivariate and multivariate OLS and other statistical pro-
cedures. 257
difference-in-difference model A model that looks at differences in changes in treated
units compared to untreated units. These models are particularly useful in policy
evaluation. 401
c
•2014 Oxford University Press 847
Glossary Glossary
discontinuity A discontinuity occurs when a graph of a line has a sudden jump up or down.
541
distribution The range of possible values for a random variable and the associated relative
probabilities for each value. Examples of four distributions are displayed in Figure 3.4.
80
dummy variable A dummy variable equals either zero or one for all observations. Dummy
variables are sometimes referred to as dichotomous variables. 257
dyad A dyad is something that consists of two elements. For some data sets such as a trade
data set, a dyad indicates a pair of countries and the data indicates how much trade
flows between them. 397
dynamic model A dynamic model is a time series model that includes a lagged dependent
variable as an independent variable. Among other differences, the interpretation of
coefficients differs in dynamic models from that in standard OLS models. Sometimes
referred to as an autoregressive model. 678
c
•2014 Oxford University Press 848
Glossary Glossary
external validity A research finding is externally valid when it applies beyond the context
in which the analysis was conducted. 36
heteroscedastic A random variable is heteroscedastic if the variance differs for some ob-
servations. For example, observations from one part of the country may be measured
with little error while observations from another part of the country may be measured
with considerable error. Heteroscedasticity violates one of the assumptions necessary
to use the standard equation for variance of OLS estimates. 103
heteroscedasticity-consistent standard errors Standard errors for the coefficients in
OLS that are appropriate even when errors are heteroscedastic. 103
c
•2014 Oxford University Press 849
Glossary Glossary
homoscedastic A random variable is homoscedastic if the variance is the same for all ob-
servations. One of the assumptions necessary to use the standard equation for variance
of OLS estimates is that errors are homoscedastic. 102
hypothesis testing A process assessing whether the observed data is consistent or not with
a claim of interest. t tests and F tests are widely used tools in hypothesis testing. 136
c
•2014 Oxford University Press 850
Glossary Glossary
jitter A process used when scatterplotting data. A small random number is added to each
observation only for the purposes of plotting. This procedure produces cloud-like images
which overlap less than the unjittered data and hence provide a better sense of the data.
112
lagged variable A lagged variable is a variable with the values from the previous period.
661
latent variable A latent variable for a probit or logit model is an unobserved continuous
variable reflecting the propensity of an individual observation of Yi to equal 1. 602
least squares dummy variable (LSDV) approach An approach to estimating fixed ef-
fects models when analyzing panel data. 378
likelihood ratio test A statistical test for maximum likelihood models that is useful in
testing hypotheses involving multiple coefficients. 632
linear probability model A model used when the dependent variable is dichotomous.
This is an OLS model in which the coefficients are interpreted as the change in proba-
bility of observing Yi = 1 for a one unit change in X. 592
linear-log model A model in which the dependent variable is not transformed by taking
natural log and the independent variable is transformed by taking a natural log. In
such a model, a one percent increase in X is associated with a 100
—1
change in Y . 332
local average treatment effect For instrumental variables models, the local average treat-
ment effect (LATE) is the causal effect only for those people affected by the instrument.
Relevant if the effect of X on Y varies within the population. 468
log likelihood The log likelihood is the log of the probability of observing the Y outcomes
ˆ It is a byproduct of the MLE estimation process.
we did given the X data and the —s.
616
log-linear model A model in which the dependent variable is transformed by taking the
natural log of it. A one unit change in X in a log-linear model is associated with a —1
percent change in Y (on 0 to 1 scale). 331
log-log model A model in which the dependent variable and independent variables are
transformed by taking natural log. In these models, a one percent change in X is
associated with a —1 percent change in Y (on 0 to 1 scale). 332
logit model A way to analyze data with a dichotomous dependent variable. The error term
in a logit model is logistically distributed. 610
c
•2014 Oxford University Press 851
Glossary Glossary
LPM The common short-hand used to describe linear probability models, a type of model
used to estimate models with a dichotomous dependent variable. 592
LR test See Likelihood ratio test. 632
maximum likelihood estimation The estimation process used to generate coefficient es-
timates for probit and logit models, among others. 612
measurement error Measurement error occurs when a variable is measured inaccurately.
If the dependent variable has measurement error, OLS coefficient estimates are unbiased,
but less precise. If an independent variable has measurement error, OLS coefficient
estimates suffer from attenuation bias where the magnitude of the attenuation depends
on how large the measurement error variance is relative to the variance of the variable.
220
MLE The common short-hand used to describe maximum likelihood estimation models.
612
model specification The process of deciding which variables should go in a statistical
model. 240
model-fishing Model-fishing occurs when a researchers add and subtract variables until
they get just the answers they were looking for. 241
modeled randomness Variation that occurs due to inherent variation in the data genera-
tion process. This source of randomness exists even when we observe data for an entire
population. 79
monotonicity A condition invoked when discussing instrumental variables models. It re-
quires that the effect of the instrument on the endogenous variable goes in the same
direction for everyone in a population. 469
multicollinearity Variables are multicollinear if they are correlated. The consequence of
multicollinearity is that the variance of —ˆ1 will be higher than if there were no multi-
collinearity. Multicollinearity does not cause bias. 227
multivariate OLS OLS with multiple independent variables. 194
c
•2014 Oxford University Press 852
Glossary Glossary
nominal variable A variable that has two or more categories, but which does not have
an intrinsic ordering. Also known as a categorical variable. Typical examples include
“region” (north, south, east, west) or “religion” (Catholic, Protestant, Jewish, Muslim,
Other, Secular). 276
normal distribution A normal distribution is a bell-shaped probability density that char-
acterizes the probability of observing outcomes for normally distributed random vari-
ables. Because of the Central Limit Theorem, many statistical quantities are distributed
normally. 83
null hypothesis A hypothesis of no effect. Statistical tests will reject or fail to reject such
hypotheses. The most common null hypothesis is —1 = 0, written as H0 : —1 = 0. 138
null result A finding in which the null hypothesis is not rejected. 173
observational study Observational studies use data generated in an environment not con-
trolled by a researcher. They are distinguished from experimental studies and are
sometimes referred to as non-experimental studies. 36
omitted variable bias Bias that results from omitting a variable that affects the dependent
variable and is correlated with the independent variable. 211
one-sided alternative hypothesis An alternative to the null hypothesis that indicates
whether the coefficient (or function of coefficients) is higher or lower than the value
indicated in the null hypothesis. Typically written as HA : —1 > 0 or HA : —1 < 0. 140
one-way fixed effects model A panel data model that allows for fixed effects at the unit
level. 393
ordinal variable A variable that expresses rank but not necessarily relative size. An ex-
ample of an ordinal variable is one indicating answers to a survey question that is coded
1 = strongly disagree, 2 = disagree, 3 = agree, 4 = strongly agree. 276
outlier An observation that is extremely different from the rest of sample. 117
overidentification test A test used for 2SLS models when we have more than one in-
strument. The logic of the test is that the estimated coefficient on the endogenous
variable in the second stage equation should be roughly the same when each individual
instrument is used alone. 446
p-value The probability of observing a coefficient as high as we actually did if the null
hypothesis were true. 161
c
•2014 Oxford University Press 853
Glossary Glossary
panel data Panel data has observations for a multiple units over time. Each observation
indicates the value of a variable for a given unit at a given point in time. Panel data is
typically contrasted to cross-sectional and time series data. 367
perfect multicollinearity Perfect multicollinearity occurs when an independent variable
is completely explained by other independent variables. 230
plim A widely used abbreviation for probability limit, the value to which an estimator
converges as the sample size gets very, very large. 100
point estimate Point estimates describe our best guess as to what the true values are. 178
polynomial model Models that include values of X raised to powers more than one. A
polynomial model is an example of a non-linear model in which the effect of X on Y
varies depending on the value of X. The fitted values will be defined by a curve. A
quadratic model is an example of a polynomial model. 321, 325
pooled model A pooled model treats all observations as independent observations. Pooled
models contrast with fixed effect models that control for unit-specific or time-specific
fixed effects. 369
power Power refers to the ability of our data to reject the null. A high-powered statistical
test will reject the null with a very high probability when the null is false; a low-powered
statistical test will reject the null with a low probability when the null is false. 168
power curve A curve that characterizes the probability of rejecting the null for each pos-
sible value of the parameter. 171
predicted values The value of Y predicted by our estimated equation. For a bivariate OLS
model it is Ŷi = —ˆ0 + —ˆ1 Xi . Also called fitted values. 70
probability density A probability density is a graph or formula that describes the relative
probability a random variable is near a specified value. 81
probability density function A mathematical function that describes the relative prob-
ability for a continuous random variable to take on a given probability. 771
probability distribution A probability distribution is a graph or formula that gives the
probability for each possible value of a random variable. 81
probability limit The value to which a distribution converges as the sample size gets very
large. When the error is uncorrelated with the independent variables, the probability
limit of —ˆ1 is —1 . The probability limit of a consistent estimator is the true value of the
parameter. 99
c
•2014 Oxford University Press 854
Glossary Glossary
probit model A way to analyze data with a dichotomous dependent variable. The key
assumption is that the error term is normally distributed. 606
quadratic model Models that include X and X 2 and independent variables. The fitted
values will be defined by a curve. A quadratic model is an example of a polynomial
model. 321
quasi-instrument An instrumental variable that is not strictly exogenous, meaing that
there is a non-zero correlation of it and the error term in the equation of interest. 2SLS
using a quasi-instrument may produce a better estimate than OLS if the correlation
of the quasi-instrument and the error in the main equation is small relative to the
correlation of the quasi-instrument and the endogenous variable. 448
random effects model Random effects models treat unit-specific error as a random vari-
able that is uncorrelated with the independent variable. 748
random variable A variable that takes on values in a range and with the probabilities
defined by a distribution. 80
randomization Randomization is the process of determining the experimental value of the
key independent variable based on a random process. If successful, randomization will
ensure that the independent variable is uncorrelated with all variables, including factors
in the error term. 32
RD See regression discontinuity. 544
reduced form equation In a reduced form equation Y1 is only a function of the non-
endogenous variables (which are the X and Z variables, not the Y variables). Used in
simultaneous equation models. 458
reference category When including dummy variables indicating the multiple categories
of a nominal variable, we need to exclude a dummy variable for one of the groups,
which we refer to as the reference category. The coefficients on all the included dummy
variables indicate how much higher or lower the dependent variable is for each group
relative to the reference category. Also referred to as the excluded category. 278
regression discontinuity Regression discontinuity techniques use regression analysis to
identify possible discontinuities at the point some treatment applies. 544
regression line A regression line is the fitted line from a bivariate regression. 70
replication Research that meets a replication standard can be duplicated based on the
information provided at the time of publication. 46
c
•2014 Oxford University Press 855
Glossary Glossary
replication file Replication files document how exactly data is gathered and organized.
When done properly, these files allow others to check our work by following our steps
and seeing if they get identical results. 47
residual A residual is the difference between the fitted value and observed value. Graphi-
cally, it is the distance between an estimated line and an observation. Mathematically,
a residual is ‘ˆi = Yi ≠ —ˆ0 ≠ —ˆ1 Xi . An equivalent way to calculate a residual is ‘ˆi = Yi ≠ Ŷi .
70
restricted model A restricted model is the model in an F test that imposes the restriction
that the null hypothesis is true. If the fit of the restricted model is much worse than
the fit of the unrestricted model, we infer that that the null hypothesis is not true. 343
robust Statistical results are robust if they do not change when the model changes. 49
rolling cross-section data Repeated cross-sections of data from different individuals at
different points in time. An example would be a survey of U.S. citizens each year in
which different citizens are chosen each year. 407
c
•2014 Oxford University Press 856
Glossary Glossary
c
•2014 Oxford University Press 857
Glossary Glossary
what the unit of X is (be it inches, dollars or years), effects across variables can be
compared because each —ˆ represents the effect of a one standard deviation change in X
on Y . 340
stationarity A time series term indicating that a variable has the same distribution through-
out the entire time series. Variables that have persistent trends are nonstationary.
Statistical analysis of nonstationary variables can yield spurious regression results. 683
statistically significant A coefficient is statistically significant when we reject the null
hypothesis that it is zero. In this case, the observed value of the coefficient is a sufficient
number of standard deviations from the value posited in the null hypothesis that we
reject the null hypothesis. 138
substantive significance If a reasonable change in the independent variable is associated
with a meaningful change in the dependent variable, the effect is substantively signif-
icant. Some statistically significant effects are not substantively significant, especially
for large data sets. 176
t distribution A distribution that looks like a normal distribution, but with fatter tails.
The exact shape of the distribution depends on the degrees of freedom. This distribution
converges to a normal distribution for large sample sizes. 150
ˆ
t statistic The test statistic used in a t test. It is equal to —1se( . If the t statistic is
≠— N ull
—ˆ1 )
greater than our critical value, we reject the null hypothesis. 157
t test A hypothesis test for hypotheses about a normal random variable with an estimated
ˆ
standard error. It involves comparing | se(——1ˆ ) | to a critical value from a t distribution
1
determined by the chosen significance level (–). For large sample sizes, a t test is closely
approximated by a z test. 147
time series data Time series data has observations for a single unit over time. Each ob-
servation indicates the value of a variable at a given point in time. The data proceed
in order, indicating, for example, annual, monthly, or daily data. Time series data is
typically contrasted to cross-sectional and panel data. 105
treatment group In an experiment, the group that receives the treatment of interest. 32
trimmed data set A trimmed data set is one for which observations are removed in a way
to offset potential bias due to attrition. 518
two-sided alternative hypothesis An alternative to the null hypothesis that indicates
the coefficient (or function of coefficients) is higher or lower than the value indicated in
the null hypothesis. Typically written as HA : —1 ”= 0. 140
c
•2014 Oxford University Press 858
Glossary Glossary
two-stage least squares Two-stage least squares uses exogenous variation in X to estimate
the effect of X on Y . In the first stage, we estimate a model in which the endogenous
independent variable is the dependent variable and the instrument, Z, is an independent
variable. In the second stage, we estimate a model in which the fitted values from the
first stage, X̂1i , is an independent variable. 425
two-way fixed effects model A panel data model that allows for fixed effects at the unit
and time levels. 393
Type I error A hypothesis testing error that occurs when we reject a null hypothesis even
when it is true. 139
Type II error A hypothesis testing error that occurs when we fail to reject a null hypothesis
even when it is false. 139
unbiased estimator An unbiased coefficient estimate will on average equal the true value of
the parameter. An unbiased estimator can produce individual estimates that are quite
incorrect; on average, though, the too low estimates are probabilistically balanced by
too high estimates for unbiased estimators. OLS produces unbiased parameter estimates
if the independent variables are uncorrelated with the error term. 87
unit root A variable with a unit root has a coefficient equal to one on the lagged variable.
A variable with a unit root is nonstationary and must be modeled differently than a
stationary variable. 685
unrestricted model An unrestricted model is the model in an F test that imposes no
restrictions on the coefficients. If the fit of the restricted model is much worse than the
fit of the unrestricted model, we infer that that the null hypothesis is not true. 343
variance Variance is a measure of how much a random variable varies. In graphical terms,
the variance of a random varaible characterizes how wide the distribution is. 93
variance inflation factor A measure of how much variance is inflated due to multicollinear-
1
ity. It can be estimated for each variable and is equal to 1≠R 2 where Rj is from an
2
j
auxiliary regression in which Xj is the dependent variable and all other independent
variables from the main equation are included as independent variables. 227
variance of the regression The variance of the regression measures how well the model
explains
q variation in the dependent variable. For large samples, it is estimated ‡
ˆ2 =
(Y ≠Ŷi )2
N
i=1 i
N
. 95
c
•2014 Oxford University Press 859
Glossary Glossary
weak instrument A weak instrument is an instrumental variable that adds little explana-
tory power to the first stage regression in a 2SLS analysis. 450
window A window is the range of observations we analyze in a regression discontinuity
analysis. The smaller the window, the less we need to worry about non-linear functional
forms. 561
z test An hypothesis test involving comparison of a test statistic to a critical value based
on a normal distribution. Examples include hypothesis tests for maximum likelihood
estimation models and tests of hypotheses about a normal random variable with a
known standard error. 613
c
•2014 Oxford University Press 860
Glossary Glossary
Mab
Dog
Press
c
•2014 Oxford University Press 861