Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
135 views242 pages

Measurement Error in Nonlinear Models

The document is a monograph titled 'Measurement Error in Nonlinear Models: A Modern Perspective, Second Edition' that focuses on regression problems where predictors are measured with error, specifically in nonlinear regression contexts. It outlines various analysis strategies and techniques for addressing measurement error, including functional modeling and methods like regression calibration and simulation-extrapolation. The book aims to provide a comprehensive overview of the effects of measurement error on regression coefficients and offers both approximate and consistent estimation techniques.

Uploaded by

Félix Neves
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views242 pages

Measurement Error in Nonlinear Models

The document is a monograph titled 'Measurement Error in Nonlinear Models: A Modern Perspective, Second Edition' that focuses on regression problems where predictors are measured with error, specifically in nonlinear regression contexts. It outlines various analysis strategies and techniques for addressing measurement error, including functional modeling and methods like regression calibration and simulation-extrapolation. The book aims to provide a comprehensive overview of the effects of measurement error on regression coefficients and offers both approximate and consistent estimation techniques.

Uploaded by

Félix Neves
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 242

Monographs on Statistics and Applied Probability 105

Measurement Error
in Nonlinear Models
A Modern Perspective
Second Edition
MONOGRAPHS ON STATISTICS AND APPLIED PROBABILITY 38 Cyclic and Computer Generated Designs, 2nd edition
J.A. John and E.R. Williams (1995)
General Editors
39 Analog Estimation Methods in Econometrics C.F. Manski (1988)
V. Isham, N. Keiding, T. Louis, S. Murphy, R. L. Smith, and H. Tong 40 Subset Selection in Regression A.J. Miller (1990)
41 Analysis of Repeated Measures M.J. Crowder and D.J. Hand (1990)
1 Stochastic Population Models in Ecology and Epidemiology M.S. Barlett (1960) 42 Statistical Reasoning with Imprecise Probabilities P. Walley (1991)
2 Queues D.R. Cox and W.L. Smith (1961) 43 Generalized Additive Models T.J. Hastie and R.J. Tibshirani (1990)
3 Monte Carlo Methods J.M. Hammersley and D.C. Handscomb (1964) 44 Inspection Errors for Attributes in Quality Control
4 The Statistical Analysis of Series of Events D.R. Cox and P.A.W. Lewis (1966) N.L. Johnson, S. Kotz and X. Wu (1991)
5 Population Genetics W.J. Ewens (1969) 45 The Analysis of Contingency Tables, 2nd edition B.S. Everitt (1992)
6 Probability, Statistics and Time M.S. Barlett (1975) 46 The Analysis of Quantal Response Data B.J.T. Morgan (1992)
7 Statistical Inference S.D. Silvey (1975) 47 Longitudinal Data with Serial Correlation—A State-Space Approach
8 The Analysis of Contingency Tables B.S. Everitt (1977) R.H. Jones (1993)
9 Multivariate Analysis in Behavioural Research A.E. Maxwell (1977) 48 Differential Geometry and Statistics M.K. Murray and J.W. Rice (1993)
10 Stochastic Abundance Models S. Engen (1978) 49 Markov Models and Optimization M.H.A. Davis (1993)
11 Some Basic Theory for Statistical Inference E.J.G. Pitman (1979) 50 Networks and Chaos—Statistical and Probabilistic Aspects
O.E. Barndorff-Nielsen, J.L. Jensen and W.S. Kendall (1993)
12 Point Processes D.R. Cox and V. Isham (1980)
51 Number-Theoretic Methods in Statistics K.-T. Fang and Y. Wang (1994)
13 Identification of Outliers D.M. Hawkins (1980)
52 Inference and Asymptotics O.E. Barndorff-Nielsen and D.R. Cox (1994)
14 Optimal Design S.D. Silvey (1980)
53 Practical Risk Theory for Actuaries
15 Finite Mixture Distributions B.S. Everitt and D.J. Hand (1981)
C.D. Daykin, T. Pentikäinen and M. Pesonen (1994)
16 Classification A.D. Gordon (1981)
54 Biplots J.C. Gower and D.J. Hand (1996)
17 Distribution-Free Statistical Methods, 2nd edition J.S. Maritz (1995)
55 Predictive Inference—An Introduction S. Geisser (1993)
18 Residuals and Influence in Regression R.D. Cook and S. Weisberg (1982)
56 Model-Free Curve Estimation M.E. Tarter and M.D. Lock (1993)
19 Applications of Queueing Theory, 2nd edition G.F. Newell (1982)
57 An Introduction to the Bootstrap B. Efron and R.J. Tibshirani (1993)
20 Risk Theory, 3rd edition R.E. Beard, T. Pentikäinen and E. Pesonen (1984)
58 Nonparametric Regression and Generalized Linear Models
21 Analysis of Survival Data D.R. Cox and D. Oakes (1984) P.J. Green and B.W. Silverman (1994)
22 An Introduction to Latent Variable Models B.S. Everitt (1984) 59 Multidimensional Scaling T.F. Cox and M.A.A. Cox (1994)
23 Bandit Problems D.A. Berry and B. Fristedt (1985) 60 Kernel Smoothing M.P. Wand and M.C. Jones (1995)
24 Stochastic Modelling and Control M.H.A. Davis and R. Vinter (1985) 61 Statistics for Long Memory Processes J. Beran (1995)
25 The Statistical Analysis of Composition Data J. Aitchison (1986) 62 Nonlinear Models for Repeated Measurement Data
26 Density Estimation for Statistics and Data Analysis B.W. Silverman (1986) M. Davidian and D.M. Giltinan (1995)
27 Regression Analysis with Applications G.B. Wetherill (1986) 63 Measurement Error in Nonlinear Models
28 Sequential Methods in Statistics, 3rd edition R.J. Carroll, D. Rupert and L.A. Stefanski (1995)
G.B. Wetherill and K.D. Glazebrook (1986) 64 Analyzing and Modeling Rank Data J.J. Marden (1995)
29 Tensor Methods in Statistics P. McCullagh (1987) 65 Time Series Models—In Econometrics, Finance and Other Fields
30 Transformation and Weighting in Regression D.R. Cox, D.V. Hinkley and O.E. Barndorff-Nielsen (1996)
R.J. Carroll and D. Ruppert (1988) 66 Local Polynomial Modeling and its Applications J. Fan and I. Gijbels (1996)
31 Asymptotic Techniques for Use in Statistics 67 Multivariate Dependencies—Models, Analysis and Interpretation
O.E. Bandorff-Nielsen and D.R. Cox (1989) D.R. Cox and N. Wermuth (1996)
32 Analysis of Binary Data, 2nd edition D.R. Cox and E.J. Snell (1989) 68 Statistical Inference—Based on the Likelihood A. Azzalini (1996)
33 Analysis of Infectious Disease Data N.G. Becker (1989) 69 Bayes and Empirical Bayes Methods for Data Analysis
34 Design and Analysis of Cross-Over Trials B. Jones and M.G. Kenward (1989) B.P. Carlin and T.A Louis (1996)
35 Empirical Bayes Methods, 2nd edition J.S. Maritz and T. Lwin (1989) 70 Hidden Markov and Other Models for Discrete-Valued Time Series
36 Symmetric Multivariate and Related Distributions I.L. Macdonald and W. Zucchini (1997)
K.T. Fang, S. Kotz and K.W. Ng (1990) 71 Statistical Evidence—A Likelihood Paradigm R. Royall (1997)
37 Generalized Linear Models, 2nd edition P. McCullagh and J.A. Nelder (1989) 72 Analysis of Incomplete Multivariate Data J.L. Schafer (1997)
73 Multivariate Models and Dependence Concepts H. Joe (1997)
Monographs on Statistics and Applied Probability 105
74 Theory of Sample Surveys M.E. Thompson (1997)
75 Retrial Queues G. Falin and J.G.C. Templeton (1997)
76 Theory of Dispersion Models B. Jørgensen (1997)
77 Mixed Poisson Processes J. Grandell (1997)
78 Variance Components Estimation—Mixed Models, Methodologies and
Applications P.S.R.S. Rao (1997)
79 Bayesian Methods for Finite Population Sampling
G. Meeden and M. Ghosh (1997)
Measurement Error
80 Stochastic Geometry—Likelihood and computation
O.E. Barndorff-Nielsen, W.S. Kendall and M.N.M. van Lieshout (1998)
81 Computer-Assisted Analysis of Mixtures and Applications—
Meta-analysis, Disease Mapping and Others D. Böhning (1999)
in Nonlinear Models
82 Classification, 2nd edition A.D. Gordon (1999)
83 Semimartingales and their Statistical Inference B.L.S. Prakasa Rao (1999) A Modern Perspective
84 Statistical Aspects of BSE and vCJD—Models for Epidemics
C.A. Donnelly and N.M. Ferguson (1999)
85 Set-Indexed Martingales G. Ivanoff and E. Merzbach (2000)
Second Edition
86 The Theory of the Design of Experiments D.R. Cox and N. Reid (2000)
87 Complex Stochastic Systems
O.E. Barndorff-Nielsen, D.R. Cox and C. Klüppelberg (2001)
88 Multidimensional Scaling, 2nd edition T.F. Cox and M.A.A. Cox (2001)
89 Algebraic Statistics—Computational Commutative Algebra in Statistics
G. Pistone, E. Riccomagno and H.P. Wynn (2001)
90 Analysis of Time Series Structure—SSA and Related Techniques
N. Golyandina, V. Nekrutkin and A.A. Zhigljavsky (2001)
91 Subjective Probability Models for Lifetimes
Fabio Spizzichino (2001)
92 Empirical Likelihood Art B. Owen (2001)
93 Statistics in the 21st Century
Adrian E. Raftery, Martin A. Tanner, and Martin T. Wells (2001)
94 Accelerated Life Models: Modeling and Statistical Analysis
Vilijandas Bagdonavicius and Mikhail Nikulin (2001) Raymond J. Carroll
95 Subset Selection in Regression, Second Edition Alan Miller (2002)
96 Topics in Modelling of Clustered Data
Marc Aerts, Helena Geys, Geert Molenberghs, and Louise M. Ryan (2002)
David Ruppert
97 Components of Variance D.R. Cox and P.J. Solomon (2002)
98 Design and Analysis of Cross-Over Trials, 2nd Edition
Leonard A. Stefanski
Byron Jones and Michael G. Kenward (2003)
99 Extreme Values in Finance, Telecommunications, and the Environment Ciprian M. Crainiceanu
Bärbel Finkenstädt and Holger Rootzén (2003)
100 Statistical Inference and Simulation for Spatial Point Processes
Jesper Møller and Rasmus Plenge Waagepetersen (2004)
101 Hierarchical Modeling and Analysis for Spatial Data
Sudipto Banerjee, Bradley P. Carlin, and Alan E. Gelfand (2004)
102 Diagnostic Checks in Time Series Wai Keung Li (2004)
103 Stereology for Statisticians Adrian Baddeley and Eva B. Vedel Jensen (2004)
104 Gaussian Markov Random Fields: Theory and Applications
Håvard Rue and Leonhard Held (2005)
105 Measurement Error in Nonlinear Models: A Modern Perspective, Second Edition
Boca Raton London New York
Raymond J. Carroll, David Ruppert, Leonard A. Stefanski,
and Ciprian M. Crainiceanu (2006) Chapman & Hall/CRC is an imprint of the
Taylor & Francis Group, an informa business
Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2006 by Taylor and Francis Group, LLC
Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works


Printed in the United States of America on acid-free paper
10 9 8 7 6 5 4 3 2 1

International Standard Book Number-10: 1-58488-633-1 (Hardcover)


International Standard Book Number-13: 978-1-58488-633-4 (Hardcover)

This book contains information obtained from authentic and highly regarded sources. Reprinted mate-
To our families and friends
rial is quoted with permission, and sources are indicated. A wide variety of references are listed. Reason-
able efforts have been made to publish reliable data and information, but the author and the publisher
cannot assume responsibility for the validity of all materials or for the consequences of their use.

No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any elec-
tronic, mechanical, or other means, now known or hereafter invented, including photocopying, micro-
filming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC) 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides
licenses and registration for a variety of users. For organizations that have been granted a photocopy
license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
3–6, 7–8, and 9–14. In addition, there is Appendix A, a review of relevant
fitting methods and statistical models.
The first part is introductory. Chapter 1 gives a number of applications
where measurement error is of concern, and defines basic terminology of
Preface to the First Edition error structure, data sources and the distinction between functional and
structural models. Chapter 2 gives an overview of the important ideas
from linear regression, particularly the biases caused by measurement
error and some estimation techniques.
This monograph is about analysis strategies for regression problems in The second part gives the basic ideas and techniques of what we
which predictors are measured with error. These problems are commonly call functional modeling, where the distribution of the true predictor
known as measurement error modeling or errors-in-variables. There is is not modeled parametrically. In addition, in these chapters it is as-
an enormous literature on this topic in linear regression, as summarized sumed that the true predictor is never observable. The focus is on the
by Fuller (1987). Our interest lies almost exclusively in the analysis of additive measurement error model, although periodically we describe
nonlinear regression models, defined generally enough to include gener- modifications for the multiplicative error model. Chapters 3 and 4 dis-
alized linear models, transform-both-sides models, and quasilikelihood cuss two broadly applicable functional methods, regression calibration
and variance function problems. and simulation-extrapolation (SIMEX), which can be thought of as the
The effects of measurement error are well known, and we basically default approaches. Chapter 5 discusses a broadly based approach to
assume that the reader understands that measurement error in predic- the use of instrumental variables. All three of these chapters focus on
tors causes biases in estimated regression coefficients, and hence that the estimators which are easily computed but yield only approximately con-
field is largely about correcting for such effects. Chapter 3∗ summarizes sistent estimates. Chapter 6 is still based on the assumption that the true
much of what is known about the consequences of measurement error predictor is never observable, but here we provide functional techniques
for estimating linear regression parameters, although the material is not which are fully and not just approximately consistent. This material is
exhaustive. somewhat more daunting in (algebraic) appearance than the approxi-
Nonlinear errors-in-variables modeling began in earnest in the early mate techniques, but even so the methods themselves are often easily
1980s with the publication of a series of papers on diverse topics: Prentice programmed. Throughout this part of the book, we use examples of
(1982) on survival analysis; Carroll, Spiegelman, Lan, Bailey, and Abbott binary regression modeling.
(1984) and Stefanski and Carroll (1985) on binary regression; Armstrong The third part of the book concerns structural modeling, meaning that
(1985) on generalized linear models; Amemiya (1985) on instrumental the distribution of the true predictor is parametrically modeled. Chapter
variables; and Stefanski (1985) on estimating equations. David Byar and 7 describes the likelihood approach to estimation and inference in mea-
Mitchell Gail organized a workshop on the topic in 1987 at the National surement error models, while Chapter 8 briefly covers Bayesian model-
Institutes of Health, which in 1989 was published as a special issue of ing. Here we become more focused on the distinction between functional
Statistics in Medicine. Since these early papers, the field has grown dra- and structural modeling, and also describe the measurement error prob-
matically, as evidenced by the bibliography at the end of this book. lem as a missing data problem. We also allow for the possibility that
Unlike the early 1980s, the literature is now so large that it is difficult to the true predictor can be measured in a subset of the study population.
understand the main ideas from individual papers. Indeed, a first draft The discussion is fully general and applies to categorical data as well as
of this book, completed in late 1990, consisted only of the material in to the additive and multiplicative measurement error models. While at
four of the first five chapters. Essentially all the rest of the material has this point the use of structural modeling in measurement error models
been developed since 1990. In a field as rapidly evolving as this one, and is not very popular, we believe it will become more so in the very near
with the entrance of many new researchers into the area, we can present future.
but a snapshot of the current state of knowledge. The fourth part of the book is devoted to more specialized topics.
This book can be divided broadly into four main parts: Chapters 1–2, Chapter 9 takes up the study of functional techniques which are ap-
plicable when the predictor can be observed in a subset of the study.
∗ Chapter numbers in this preface refer to the first edition, not the present edition. Chapter 10 discusses functional estimation in models with generalized
linear structure and an unknown link function. Chapter 11 describes the measurement error model, the strategies here are applied to a hard-core
effects that measurement error has on hypothesis testing. Nonparamet- nonlinear regression bioassay problem (Chapter 3), a changepoint prob-
ric regression and density function estimation are addressed in Chapter lem (Chapter 7), and a 2 × 2 table with misclassification (Chapter 8).
12. Errors in the response rather than in predictors are described in Our hope is that the strategies will be sufficiently clear that they can be
Chapter 13. In Chapter 14, a variety of topics are addressed briefly: applied to new problems as they arise.
case-control studies, differential measurement error, functional mixture We have tried to represent the main themes in the field, and to ref-
methods, design of two-stage studies and survival analysis. erence as many research papers as possible. Obviously, as in any mono-
We have tried to design the text so that it can be read at two levels. graph, the selection of topics and material to be emphasized reflects our
Many readers will be interested only in the background material and own interests. We apologize in advance to those workers whose work we
in the definition of the specific methods that can be employed. These have neglected to cite, or whose work should have been better advertised.
readers will find that the chapters in the middle two parts of the text Carroll’s research and the writing of this book were supported by
(functional and structural modeling) begin with preliminary discussion, grants from the National Cancer Institute (CA–57030 and CA–61067).
move into the definition of the methods, and are then followed by a After January 1, 1996, Splus and SAS computer programs (on SPARC
worked numerical example. The end of the example serves as a flag that architecture SunOS versions 4 and 5 and for Windows on PCs), which
the material is about to become more detailed, with justifications of the implement (for major generalized linear models) many of the functional
methods, derivations of estimated standard errors, etc. Those readers methods described in this book, can be obtained by sending a message
who are not interested in such details should skip the material following to [email protected]. The body of the text should contain only a valid
the examples at first (and perhaps last) reading. return email address. This will generate an automatic reply with instruc-
It is our intention that the part of the book on functional models tions on how to get the software.
(Chapters 3–6) can be understood at an overview level without an ex- Much of Stefanski’s research on measurement error problems has been
tensive background in theoretical statistics, at least through the numeri- supported by grants from the National Science Foundation (DMS–86136-
cal examples. The structural modeling approach requires that one knows 81 and DMS–9200915) and by funding from the Environmental Monitor-
about likelihood and Bayesian methods, but with this exception the ma- ing and Assessment Program, U.S. Environmental Protection Agency.
terial is not particularly specialized. The fourth part of the book (Chap- We want to thank Jim Calvin, Bobby Gutierrez, Stephen Eckert, Joey
ters 9–14) is more technical, and we suggest that those interested mainly Lin, C. Y. Wang, and Naisyin Wang for helpful general comments; Donna
in an overview simply read the first section of each of those chapters. Spiegelman for a detailed reading of the manuscript; Jeff Buzas, John
A full appreciation of the text, especially its details, requires a strong Cook, Tony Olsen, and Scott Overton for ideas and comments related
background in likelihood methods, estimating equations and quasilikeli- to our research; and Viswanath Devanarayan for computing assistance
hood and variance function models. For inference, we typically provide and comments. Rob Abbott stimulated our initial interest in the field
estimated standard errors, as well as suggest use of “the” bootstrap. in 1981 with a question concerning the effects of measurement error
These topics are all covered in Appendix A, albeit briefly. For more in the Framingham Heart Study; this example appears throughout our
background on the models used in this monograph, we highly recom- discussion. Larry Freedman and Mitch Gail have commented on much
mend reading Chapter 1 of Fuller (1987) for an introduction to linear of our work and have been instrumental in guiding us to interesting
measurement error models and the first four chapters of McCullagh and problems. Nancy Potischman introduced us to the world of nutritional
Nelder (1989) for further discussion of generalized linear models, includ- epidemiology, where measurement error is of fundamental concern. Our
ing logistic regression. friend Leon Gleser has been a source of support and inspiration for many
years and has been a great influence on our thinking.
This is a book about general ideas and strategies of estimation and
This book uses data supplied by the National Heart, Lung, and Blood
inference, not a book about a specific problem. Our interest in the field
started with logistic regression, and many of our examples are based Institute, NIH, DHHS from the Framingham Heart Study. The views
expressed in this paper are those of the authors and do not necessarily
upon this problem. However, our philosophy is that measurement error
reflect the views of the National Heart, Lung, and Blood Institute or of
occurs in many fields and in a variety of guises, and what is needed
is an outline of strategies for handling progressively more difficult prob- the Framingham Study.
lems. While logistic regression may well be the most important nonlinear
The background material in Appendix A has been expanded to make
the book somewhat more self-contained. Technical material that ap-
peared as appendices to individual chapters in the first edition has now
been collected into a new Appendix B.
Preface to the Second Edition Carroll’s research has been supported since 1990 by a grant from the
National Cancer Institute (CA57030). The work of Raymond Carroll
partially occurred during multiple visits to Peter Hall at the Centre
of Excellence for Mathematics and Statistics of Complex Systems at
Since the first edition of Measurement Error in Nonlinear Models ap- the Australian National University, whose support is gratefully acknowl-
peared in 1995, the field of measurement error and exposure uncertainty edged, along with the opportunity to take thousands of photos of kanga-
has undergone an explosion in research. Some of these areas are the roos (http://www.stat.tamu.edu/∼carroll/compressed kangaroo.jpg).
following: David Ruppert was supported by the National Science Foundation (DMS
04-538) and the National Institutes of Health (CA57030). Leonard Ste-
• Bayesian computation via Markov Chain Monte Carlo techniques are
fanski also received support from the National Science Foundation and
now widely used in practice. The first edition had a short and not
the National Institutes of Health.
particularly satisfactory Chapter 9 on this topic. In this edition, we
In this second edition, we especially acknowledge our colleagues with
have greatly expanded the material and also the applications. Even if
whom we have discussed measurement error problems and worked since
one is not a card-carrying Bayesian, Bayesian computation is a natural
1995, including Scott Berry, Dennis Boos, John Buonaccorsi, Jeff Buzas,
way to handle what we call the structural approach to measurement
Josef Coresh, Marie Davidian, Eugene Demidenko, Laurence Freedman,
error modeling.
Wayne Fuller, Mitchell Gail, Bobby Gutierrez, Peter Hall, Victor Kipnis,
• A new chapter has been added on longitudinal data and mixed models, Liang Li, Xihong Lin, Jay Lubin, Yanyuna Ma, Doug Midthune, Sastry
areas that have seen tremendous growth since the first edition. Pantula, Dan Schafer, John Staudenmayer, Sally Thurston, Tor Toste-
• Semiparametric and nonparametric methods are enjoying increased son, Naisyin Wang, and Alan Welsh. Owen Hoffman introduced us to
application. The field of semiparametric and nonparametric regression the problem of radiation dosimetry and the ideas of shared Berkson and
(Ruppert, Wand, and Carroll, 2003) has become extremely important classical uncertainties.
in the past 11 years, and in measurement error problems techniques We once again acknowledge Robert Abbott for introducing us to the
are now much better established. We have revamped the old chap- problem in 1980, when he brought to Raymond Carroll a referee report
ter on nonparametric regression and density estimation (Chapter 12) demanding that he explain the impact of measurement error on the
and added a new chapter (Chapter 13) to reflect the changes in the (logistic regression) Framingham data. We would love to acknowledge
literature. that anonymous referee for starting us along the path of measurement
• Methods for handling covariate measurement error in survival anal- error in nonlinear models.
ysis have been developing rapidly. The first edition had a section on We also thank Mitchell Gail, one of the world’s great biostatisticians,
survival analysis in the final chapter, “Other Topics.” This section for his advice and friendship over the last 25 years.
has been greatly expanded and made into a separate Chapter 14. We are extremely grateful to Rick Rossi for a detailed reading of the
• The area of missing data has also expanded vigorously over the last manuscript, a reading that led to many changes in substance and ex-
11 years, especially due to the work of Robins and his colleagues. position. Rick is the only head of a Department of Mathematics and
This work and its connections with measurement error now needs a Statistics who is also a licensed trout-fishing guide.
book-length treatment of its own. Therefore, with some reluctance, Finally, and with gratitude, we acknowledge our good friend Leon
we decided to delete much of the old material on validation data as a Gleser, who, to quote the first edition, has been a source of support
missing data problem. and inspiration for many years and has been a great influence on our
thinking.
• We have completely rewritten the score function chapter, both to keep
up with advances in this area and and to make the exposition more Our book Web site is
http://www.stat.tamu.edu/∼carroll/eiv.SecondEdition.
transparent.
k With equal replication, number of replicates for all subjects
ki Number of replicates of ith subject
K(·) kernel used in nonparametric regression or density estimation
2
κcm σcm /σ 2
Guide to Notation Λ(·) likelihood ratio
L(·) generalized score function
mX (Z, W, γcm ) E(X|Z, W)
mY (Z, X, β) E(Y|Z, X) in QVF (quasilikelihood variance function) model
mY,x (z, x, β) (∂/∂x)mY (z, x, β)
In this section we give brief explanations and representative examples mY,xx (z, x, β) (∂ 2 /∂x2 )mY (z, x, β)
of the notation used in this monograph. For precise definitions, see the π(Y, Z, W, α) probability of selection into a validation study
text. Ψ, ψ estimating functions
S Y measured with error (S = Y + V)
bn , B
A bn components of the sandwich formula si (y|Θ) score function
α0 intercept in model for E(X|Z, W) σu2 variance of U
2
αw coefficient of W in model for E(X|Z, W) σX|Z conditional variance of X given Z
αz coefficient of Z in model for E(X|Z, W) σxy the covariance between random variables X and Y
β0 intercept in a model for E(Y|X, Z) ρxy the correlation between X and Y , which is defined as σxy /(σx σy )
βx coefficient of X in model for E(Y|X, Z) ΣZX covariance matrix between the random vectors Z and X
βz coefficient of Z in model for E(Y|X, Z) T observation related to X
β1ZX coefficient of 1 in generalized linear regresssion Θb (λ) simulated estimator used in SIMEX
∆ indicator of validation data, for example, where X is Θ(λ) average of the Θb (λ)s
observed U observation error in an error model
dim(β) dimension of the vector β Ub,k pseudo-error in SIMEX
fX density of X V measurement error in the response
fY,W,T |Z density of (Y, W, T) given Z W observation related to X
F(·) unknown link function X covariates measured with error
σ 2 g(Z, X, B, θ) var(Y|Z, X) in QVF model Y response
G extrapolant function in SIMEX Yi· average of Yij over j
GQ quadratic extrapolant function [Ỹ|Z̃, X̃, B] density of Ỹ given (Z̃, X̃, B) (Bayesian notation)
GRL rational linear extrapolant function Z covariates measured without error
γ0,cm intercept in a regression calibration model ζ parameter controlling amount of simulated extra
t measurement error in SIMEX
γz, cm coefficient of Z in a regression calibration
model
t
γw, cm coefficient of W in a regression calibration
model
γ0,em intercept in an error model If m(x) is any function, then m′ (x) and m′′ (x) are its first and second
t
γx,em coefficient of X in an error model derivatives and m(m) is its mth derivative for m > 2.
t
γw, em coefficient of W in an error model
For a vector or matrix A, At is its transpose and if A is an invertible
H(v) (1 + exp(−v))−1 , for example, the logistic function
matrix, then A−1 is its inverse.
h bandwidth in nonparametric regression or
density estimation If a = (a1 , . . . , an ) is a vector, then kak is its Euclidean norm, that is,
¡Pn ¢
2 1/2
In (Θ) Fisher information kak = i=1 ai .
If X and Y are random variables, then [X] is the distribution or X
and [X|Y ] is the conditional distribution of X given Y. This notation is
becoming standard in the Bayesian literature.

Contents

1 INTRODUCTION 1
1.1 The Double/Triple Whammy of Measurement Error 1
1.2 Classical Measurement Error: A Nutrition Example 2
1.3 Measurement Error Examples 3
1.4 Radiation Epidemiology and Berkson Errors 4
1.4.1 The Difference Between Berkson and Classical
Errors: How to Gain More Power Without Really
Trying 5
1.5 Classical Measurement Error Model Extensions 7
1.6 Other Examples of Measurement Error Models 9
1.6.1 NHANES 9
1.6.2 Nurses’ Health Study 10
1.6.3 The Atherosclerosis Risk in Communities Study 11
1.6.4 Bioassay in a Herbicide Study 11
1.6.5 Lung Function in Children 12
1.6.6 Coronary Heart Disease and Blood Pressure 12
1.6.7 A-Bomb Survivors Data 13
1.6.8 Blood Pressure and Urinary Sodium Chloride 13
1.6.9 Multiplicative Error for Confidentiality 14
1.6.10 Cervical Cancer and Herpes Simplex Virus 14
1.7 Checking the Classical Error Model 14
1.8 Loss of Power 18
1.8.1 Linear Regression Example 18
1.8.2 Radiation Epidemiology Example 20
1.9 A Brief Tour 23
Bibliographic Notes 23

2 IMPORTANT CONCEPTS 25
2.1 Functional and Structural Models 25
2.2 Models for Measurement Error 26
2.2.1 General Approaches: Berkson and Classical Models 26
2.2.2 Is It Berkson or Classical? 27
2.2.3 Berkson Models from Classical 28
2.2.4 Transportability of Models 29

xix
2.2.5 Potential Dangers of Transporting Models 30 4.5.3 Linear Regression 77
2.2.6 Semicontinuous Variables 32 4.5.4 Additive and Multiplicative Error 78
2.2.7 Misclassification of a Discrete Covariate 32 4.6 Standard Errors 79
2.3 Sources of Data 32 4.7 Expanded Regression Calibration Models 79
2.4 Is There an “Exact” Predictor? What Is Truth? 33 4.7.1 The Expanded Approximation Defined 81
2.5 Differential and Nondifferential Error 36 4.7.2 Implementation 83
2.6 Prediction 38 4.7.3 Bioassay Data 85
Bibliographic Notes 39 4.8 Examples of the Approximations 90
4.8.1 Linear Regression 90
3 LINEAR REGRESSION AND ATTENUATION 41 4.8.2 Logistic Regression 90
3.1 Introduction 41 4.8.3 Loglinear Mean Models 93
3.2 Bias Caused by Measurement Error 41 4.9 Theoretical Examples 94
3.2.1 Simple Linear Regression with Additive Error 42 4.9.1 Homoscedastic Regression 94
3.2.2 Regression Calibration: Classical Error as Berkson 4.9.2 Quadratic Regression with Homoscedastic Regres-
Error 44 sion Calibration 94
3.2.3 Simple Linear Regression with Berkson Error 45 4.9.3 Loglinear Mean Model 95
3.2.4 Simple Linear Regression, More Complex Error Bibliographic Notes and Software 95
Structure 46
3.2.5 Summary of Simple Linear Regression 49
5 SIMULATION EXTRAPOLATION 97
3.3 Multiple and Orthogonal Regression 52
5.1 Overview 97
3.3.1 Multiple Regression: Single Covariate Measured
5.2 Simulation Extrapolation Heuristics 98
with Error 52
5.2.1 SIMEX in Simple Linear Regression 98
3.3.2 Multiple Covariates Measured with Error 53
5.3 The SIMEX Algorithm 100
3.4 Correcting for Bias 55
3.4.1 Method of Moments 55 5.3.1 Simulation and Extrapolation Steps 100
3.4.2 Orthogonal Regression 57 5.3.2 Extrapolant Function Considerations 108
3.5 Bias Versus Variance 60 5.3.3 SIMEX Standard Errors 110
3.5.1 Theoretical Bias–Variance Tradeoff Calculations 61 5.3.4 Extensions and Refinements 111
3.6 Attenuation in General Problems 63 5.3.5 Multiple Covariates with Measurement Error 112
Bibliographic Notes 64 5.4 Applications 112
5.4.1 Framingham Heart Study 112
4 REGRESSION CALIBRATION 65 5.4.2 Single Covariate Measured with Error 113
4.1 Overview 65 5.4.3 Multiple Covariates Measured with Error 118
4.2 The Regression Calibration Algorithm 66 5.5 SIMEX in Some Important Special Cases 120
4.3 NHANES Example 66 5.5.1 Multiple Linear Regression 120
4.4 Estimating the Calibration Function Parameters 70 5.5.2 Loglinear Mean Models 122
4.4.1 Overview and First Methods 70 5.5.3 Quadratic Mean Models 122
4.4.2 Best Linear Approximations Using Replicate Data 70 5.6 Extensions and Related Methods 123
4.4.3 Alternatives When Using Partial Replicates 72 5.6.1 Mixture of Berkson and Classical Error 123
4.4.4 James–Stein Calibration 72 5.6.2 Misclassification SIMEX 125
4.5 Multiplicative Measurement Error 72 5.6.3 Checking Structural Model Robustness via Re-
4.5.1 Should Predictors Be Transformed? 73 measurement 126
4.5.2 Lognormal X and U 74 Bibliographic Notes 128

xx xxi
6 INSTRUMENTAL VARIABLES 129 7.7 Bibliographic Notes 178
6.1 Overview 129 Bibliographic Notes 178
6.1.1 A Note on Notation 130
6.2 Instrumental Variables in Linear Models 131 8 LIKELIHOOD AND QUASILIKELIHOOD 181
6.2.1 Instrumental Variables via Differentiation 131 8.1 Introduction 181
6.2.2 Simple Linear Regression with One Instrument 132 8.1.1 Step 1: The Likelihood If X Were Observable 183
6.2.3 Linear Regression with Multiple Instruments 134 8.1.2 A General Concern: Identifiable Models 184
6.3 Approximate Instrumental Variable Estimation 137 8.2 Steps 2 and 3: Constructing Likelihoods 184
6.3.1 IV Assumptions 137 8.2.1 The Discrete Case 185
6.3.2 Mean and Variance Function Models 138 8.2.2 Likelihood Construction for General Error Models 186
6.3.3 First Regression Calibration IV Algorithm 139 8.2.3 The Berkson Model 188
6.3.4 Second Regression Calibration IV Algorithm 140 8.2.4 Error Model Choice 189
6.4 Adjusted Score Method 140 8.3 Step 4: Numerical Computation of Likelihoods 190
6.5 Examples 143 8.4 Cervical Cancer and Herpes 190
6.5.1 Framingham Data 143 8.5 Framingham Data 192
6.5.2 Simulated Data 145 8.6 Nevada Test Site Reanalysis 193
6.6 Other Methodologies 145 8.6.1 Regression Calibration Implementation 195
6.6.1 Hybrid Classical and Regression Calibration 145 8.6.2 Maximum Likelihood Implementation 196
6.6.2 Error Model Approaches 147 8.7 Bronchitis Example 197
Bibliographic Notes 148 8.7.1 Calculating the Likelihood 198
8.7.2 Effects of Measurement Error on Threshold Models 199
7 SCORE FUNCTION METHODS 151 8.7.3 Simulation Study and Maximum Likelihood 199
7.1 Overview 151 8.7.4 Berkson Analysis of the Data 201
7.2 Linear and Logistic Regression 152 8.8 Quasilikelihood and Variance Function Models 201
7.2.1 Linear Regression Corrected and Conditional 8.8.1 Details of Step 3 for QVF Models 202
Scores 152 8.8.2 Details of Step 4 for QVF Models 203
7.2.2 Logistic Regression Corrected and Conditional Bibliographic Notes 203
Scores 157
7.2.3 Framingham Data Example 159 9 BAYESIAN METHODS 205
7.3 Conditional Score Functions 162 9.1 Overview 205
7.3.1 Conditional Score Basic Theory 162 9.1.1 Problem Formulation 205
7.3.2 Conditional Scores for Basic Models 164 9.1.2 Posterior Inference 207
7.3.3 Conditional Scores for More Complicated Models 166 9.1.3 Bayesian Functional and Structural Models 208
7.4 Corrected Score Functions 169 9.1.4 Modularity of Bayesian MCMC 209
7.4.1 Corrected Score Basic Theory 170 9.2 The Gibbs Sampler 209
7.4.2 Monte Carlo Corrected Scores 170 9.3 Metropolis–Hastings Algorithm 211
7.4.3 Some Exact Corrected Scores 172 9.4 Linear Regression 213
7.4.4 SIMEX Connection 173 9.4.1 Example 216
7.4.5 Corrected Scores with Replicate Measurements 173 9.5 Nonlinear Models 219
7.5 Computation and Asymptotic Approximations 174 9.5.1 A General Model 219
7.5.1 Known Measurement Error Variance 175 9.5.2 Polynomial Regression 220
7.5.2 Estimated Measurement Error Variance 176 9.5.3 Multiplicative Error 221
7.6 Comparison of Conditional and Corrected Scores 177 9.5.4 Segmented Regression 222

xxii xxiii
9.6 Logistic Regression 223 11.4 SIMEX for GLMMEMs 267
9.7 Berkson Errors 225 11.5 Regression Calibration for GLMMs 267
9.7.1 Nonlinear Regression with Berkson Errors 225 11.6 Maximum Likelihood Estimation 268
9.7.2 Logistic Regression with Berkson Errors 227 11.7 Joint Modeling 268
9.7.3 Bronchitis Data 228 11.8 Other Models and Applications 269
9.8 Automatic Implementation 230 11.8.1 Models with Random Effects Multiplied by X 269
9.8.1 Implementation and Simulations in WinBUGS 231 11.8.2 Models with Random Effects Depending Nonlin-
9.8.2 More Complex Models 234 early on X 270
9.9 Cervical Cancer and Herpes 235 11.8.3 Inducing a True-Data Model from a Standard
9.10 Framingham Data 237 Observed Data Model 270
9.11 OPEN Data: A Variance Components Model 238 11.8.4 Autoregressive Models in Longitudinal Data 271
Bibliographic Notes 240 11.9 Example: The CHOICE Study 272
11.9.1 Basic Model 273
10 HYPOTHESIS TESTING 243 11.9.2 Naive Replication and Sensitivity 273
10.1 Overview 243 11.9.3 Accounting for Biological Variability 274
10.1.1 Simple Linear Regression, Normally Distributed Bibliographic Notes 276
X 243
10.1.2 Analysis of Covariance 246 12 NONPARAMETRIC ESTIMATION 279
10.1.3 General Considerations: What Is a Valid Test? 248 12.1 Deconvolution 279
10.1.4 Summary of Major Results 248 12.1.1 The Problem 279
10.2 The Regression Calibration Approximation 249 12.1.2 Fourier Inversion 280
10.2.1 Testing H0 : βx = 0 250 12.1.3 Methodology 280
10.2.2 Testing H0 : βz = 0 250 12.1.4 Properties of Deconvolution Methods 281
10.2.3 Testing H0 : (βxt , βzt )t = 0 250 12.1.5 Is It Possible to Estimate the Bandwidth? 282
10.3 Illustration: OPEN Data 251 12.1.6 Parametric Deconvolution 284
10.4 Hypotheses about Subvectors of βx and βz 251 12.1.7 Estimating Distribution Functions 287
10.4.1 Illustration: Framingham Data 252 12.1.8 Optimal Score Tests 288
10.5 Efficient Score Tests of H0 : βx = 0 253 12.1.9 Framingham Data 289
10.5.1 Generalized Score Tests 254 12.1.10NHANES Data 290
Bibliographic Notes 257 12.1.11Bayesian Density Estimation by Normal Mixtures 291
12.2 Nonparametric Regression 293
11 LONGITUDINAL DATA AND MIXED MODELS 259 12.2.1 Local-Polynomial, Kernel-Weighted Regression 293
11.1 Mixed Models for Longitudinal Data 259 12.2.2 Splines 294
11.1.1 Simple Linear Mixed Models 259 12.2.3 QVF and Likelihood Models 295
11.1.2 The General Linear Mixed Model 260 12.2.4 SIMEX for Nonparametric Regression 296
11.1.3 The Linear Logistic Mixed Model 261 12.2.5 Regression Calibration 297
11.1.4 The Generalized Linear Mixed Model 261 12.2.6 Structural Splines 297
11.2 Mixed Measurement Error Models 262 12.2.7 Taylex and Other Methods 298
11.2.1 The Variance Components Model Revisited 262 12.3 Baseline Change Example 299
11.2.2 General Considerations 263 12.3.1 Discussion of the Baseline Change Controls Data 301
11.2.3 Some Simple Examples 263 Bibliographic Notes 302
11.2.4 Models for Within-Subject X-Correlation 265
11.3 A Bias-Corrected Estimator 265 13 SEMIPARAMETRIC REGRESSION 303

xxiv xxv
13.1 Overview 303 15.4.1 General Likelihood Theory and Surrogates 353
13.2 Additive Models 303 15.4.2 Validation Data 354
13.3 MCMC for Additive Spline Models 304 15.5 Use of Complete Data Only 355
13.4 Monte Carlo EM-Algorithm 305 15.5.1 Likelihood of the Validation Data 355
13.4.1 Starting Values 306 15.5.2 Other Methods 356
13.4.2 Metropolis–Hastings Fact 306 15.6 Semiparametric Methods for Validation Data 356
13.4.3 The Algorithm 306 15.6.1 Simple Random Sampling 356
13.5 Simulation with Classical Errors 309 15.6.2 Other Types of Sampling 357
13.6 Simulation with Berkson Errors 311 Bibliographic Notes 358
13.7 Semiparametrics: X Modeled Parametrically 312
13.8 Parametric Models: No Assumptions on X 314 A BACKGROUND MATERIAL 359
13.8.1 Deconvolution Methods 314 A.1 Overview 359
13.8.2 Models Linear in Functions of X 315 A.2 Normal and Lognormal Distributions 359
13.8.3 Linear Logistic Regression with Replicates 316 A.3 Gamma and Inverse-Gamma Distributions 360
13.8.4 Doubly Robust Parametric Modeling 317 A.4 Best and Best Linear Prediction and Regression 361
Bibliographic Notes 318 A.4.1 Linear Prediction 361
A.4.2 Best Linear Prediction without an Intercept 363
14 SURVIVAL DATA 319 A.4.3 Nonlinear Prediction 363
14.1 Notation and Assumptions 319 A.5 Likelihood Methods 364
14.2 Induced Hazard Function 320 A.5.1 Notation 364
14.3 Regression Calibration for Survival Analysis 321 A.5.2 Maximum Likelihood Estimation 364
14.3.1 Methodology and Asymptotic Properties 321 A.5.3 Likelihood Ratio Tests 365
14.3.2 Risk Set Calibration 322 A.5.4 Profile Likelihood and Likelihood Ratio Confidence
14.4 SIMEX for Survival Analysis 323 Intervals 365
14.5 Chronic Kidney Disease Progression 324 A.5.5 Efficient Score Tests 366
14.5.1 Regression Calibration for CKD Progression 325 A.6 Unbiased Estimating Equations 367
14.5.2 SIMEX for CKD Progression 326 A.6.1 Introduction and Basic Large Sample Theory 367
14.6 Semi and Nonparametric Methods 329 A.6.2 Sandwich Formula Example: Linear Regression
14.6.1 Nonparametric Estimation with Validation Data 330 without Measurement Error 369
14.6.2 Nonparametric Estimation with Replicated Data 332 A.6.3 Sandwich Method and Likelihood-Type Inference 370
14.6.3 Likelihood Estimation 333 A.6.4 Unbiased, but Conditionally Biased, Estimating
14.7 Likelihood Inference for Frailty Models 336 Equations 372
Bibliographic Notes 337 A.6.5 Biased Estimating Equations 372
A.6.6 Stacking Estimating Equations: Using Prior Esti-
15 RESPONSE VARIABLE ERROR 339 mates of Some Parameters 372
15.1 Response Error and Linear Regression 339 A.7 Quasilikelihood and Variance Function Models (QVF) 374
15.2 Other Forms of Additive Response Error 343 A.7.1 General Ideas 374
15.2.1 Biased Responses 343 A.7.2 Estimation and Inference for QVF Models 375
15.2.2 Response Error in Heteroscedastic Regression 344 A.8 Generalized Linear Models 377
15.3 Logistic Regression with Response Error 345 A.9 Bootstrap Methods 377
15.3.1 The Impact of Response Misclassification 345 A.9.1 Introduction 377
15.3.2 Correcting for Response Misclassification 347 A.9.2 Nonlinear Regression without Measurement Error 378
15.4 Likelihood Methods 353 A.9.3 Bootstrapping Heteroscedastic Regression Models 380

xxvi xxvii
A.9.4 Bootstrapping Logistic Regression Models 380
A.9.5 Bootstrapping Measurement Error Models 381
A.9.6 Bootstrap Confidence Intervals 382

B TECHNICAL DETAILS 385


B.1 Appendix to Chapter 1: Power in Berkson and Classical
Error Models 385
B.2 Appendix to Chapter 3: Linear Regression and Attenua-
tion 386
B.3 Appendix to Chapter 4: Regression Calibration 387
B.3.1 Standard Errors and Replication 387
B.3.2 Quadratic Regression: Details of the Expanded
Calibration Model 391
B.3.3 Heuristics and Accuracy of the Approximations 391
B.4 Appendix to Chapter 5: SIMEX 392
B.4.1 Simulation Extrapolation Variance Estimation 393
B.4.2 Estimating Equation Approach to Variance Esti-
mation 395
B.5 Appendix to Chapter 6: Instrumental Variables 399
B.5.1 Derivation of the Estimators 399
B.5.2 Asymptotic Distribution Approximations 401
B.6 Appendix to Chapter 7: Score Function Methods 406
B.6.1 Technical Complements to Conditional Score
Theory 406
B.6.2 Technical Complements to Distribution Theory
for Estimated Σuu 406
B.7 Appendix to Chapter 8: Likelihood and Quasilikelihood 407
B.7.1 Monte Carlo Computation of Integrals 407
B.7.2 Linear, Probit, and Logistic Regression 408
B.8 Appendix to Chapter 9: Bayesian Methods 409
B.8.1 Code for Section 9.8.1 409
B.8.2 Code for Section 9.11 410

References 413

xxviii
CHAPTER 1

INTRODUCTION

1.1 The Double/Triple Whammy of Measurement Error

Measurement error in covariates has three effects:


• It causes bias in parameter estimation for statistical models.

• It leads to a loss of power, sometimes profound, for detecting inter-


esting relationship among variables.

• It masks the features of the data, making graphical model analysis


difficult.
We call the first two the double whammy of measurement error. Most of
the statistical methods described in this book are aimed at the first prob-
lem, namely, to correct for biases of estimation caused by measurement
error. Later in this chapter, we will describe an example from radiation
dosimetry and the profound loss of power for detecting risks that occurs
with uncertainties in individual doses. Here, we briefly describe the third
issue, the masking of features.
Consider a regression of a response Y on a predictor X, uniformly
distributed on the interval [−2, 2]. Suppose that the mean is sin(2X)
and the variance σǫ2 = 0.10. In the top panel of Figure 1.1, we plot 200
simulated observations from such a model that indicate quite clearly
the sinusoidal aspect of the regression function. However, suppose that
instead of observing X, we observe W, normally distributed with mean
X but with variance 4/9. As we will later describe in Section 3.2.1, this
is an attenuation coefficient of 0.75. Thus, what we observe is not X, but
an unbiased estimate of it, W. In the bottom panel of Figure 1.1, we
plot the observed data Y versus W. Note that the sinusoid is no longer
evident and the main feature of the data has been hidden.
It is also worth noting that the variability about the sinusoid is far
smaller when X is observed than the variability about any curve one
could reasonably guess at when only W is observed. This is one sub-
stantial cause of the loss of power. Finally, if one only observes (Y, W)
and hence the bottom panel of Figure 1.1, it would be essentially impos-
sible to reconstruct the sinusoid, and something different would certainly
be used. This is the bias caused by measurement error.

1
2 OPEN data, Protein, Log Scale, Correlation = 0.695
6.8

1
6.6

0
6.4

−1 6.2

Second Protein Biomarker


−2 6
−3 −2 −1 0 1 2 3

5.8

2
5.6

1
5.4

0
5.2

−1 5

−2
−3 −2 −1 0 1 2 3 4.5 5 5.5 6 6.5 7
First Protein Biomarker, Attenuation = 0.694

Figure 1.1 Illustration of the bias, loss of power, and masking of features
caused by measurement error in predictors. Top panel regression on the true Figure 1.2 OPEN Study data, scatterplot of the logarithm of the first and sec-
covariate. Bottom panel regression on the observed covariate. ond protein biomarker measurements. The fact that there is scatter means that
the biomarker has measurement error.

1.2 Classical Measurement Error: A Nutrition Example


j th biomarker log-protein measurement. Then the classical measurement
Much of the measurement error literature is based around what is called error model states that
classical measurement error, in which the truth is measured with ad-
Wij = Xi + Uij . (1.1)
ditive error, usually with constant variance. We introduce the classical
measurement error model via an example from nutrition. In this model, Wij is an unbiased measure of Xi , so that Uij must have
In the National Cancer Institute’s OPEN study, see Subar, Thomp- mean zero, that is, in symbols, E(Uij |Xi ) = 0. The error structure of Uij
son, Kipnis, et al. (2001), one interest is in measuring the logarithm of could be homoscedastic (constant variance) or heteroscedastic. In this
dietary protein intake. True, long-term log-intake is denoted by X, but particular example, we will show later, in Section 1.7, that the measure-
this cannot be observed in practice. Instead, the investigators measured ment error structure is approximately normal with constant variance, so
a biomarker of log-protein intake, namely urinary nitrogen, denoted by we can reasonably think that Uij |Xi ∼ Normal(0, σu2 ).
W. In this study, 297 subjects had replicated urinary nitrogen mea-
surements. If there were no measurement error, then of course the two
1.3 Measurement Error Examples
biomarker measurements would be equal, but then, since this is a book
about measurement error, we would not be wasting space. Indeed, in Nonlinear measurement error models commonly begin with an underly-
Figure 1.2 we see that when we plot the second biomarker versus the ing nonlinear model for the response Y in terms of the predictors. We
first, the correlation is relatively high (0.695), but there clearly is some distinguish between two kinds of predictors: Z represents those predic-
variability in the measurements. tors that, for all practical purposes, are measured without error, and X
In this context, there is evidence from feeding studies that the pro- those that cannot be observed exactly for all study subjects. The distin-
tein biomarker captures true protein intake with added variability. Such guishing feature of a measurement error problem is that we can observe
situations are often called classical measurement error. In symbols, let a variable W, which is related to an unobservable X. The parameters in
Xi be the true log-protein intake for individual i, and let Wij be the the model relating Y and (Z, X) cannot, of course, be estimated directly

2 3
by fitting Y to (Z, X), since X is not observed. The goal of measurement reported. Roughly similar considerations led to the dose estimates and
error modeling is to obtain nearly unbiased estimates of these parame- uncertainties in HTDS.
ters indirectly by fitting a model for Y in terms of (Z, W). Attainment In both NTS and HTDS, the authors consider analyses taking into
of this goal requires careful analysis. Substituting W for X, but making account the uncertainties (measurement error) in dose estimates. Indeed,
no adjustments in the usual fitting methods for this substitution, leads both consider the classical measurement error situation in (1.1). The
to estimates that are biased, sometimes seriously, see Figure 1.1. The HTDS study, though, also considered a different type of measurement
problem here is that the parameters of the regression of Y on (Z, W) error, and based most of their power calculations on it. We will go into
are different from those of Y on (Z, X). detail on the power and analysis issues; see Section 1.8.2 of this chapter
In assessing measurement error, careful attention must be given to the for power and Section 8.6 for the analysis.
type and nature of the error, and the sources of data that allow modeling What we see in the classical measurement error model (1.1) is that the
of this error. The following examples illustrate some of the different types observed dose equals the true dose plus (classical) measurement error.
of problems considered in this book. This, of course, means that the variability of the observed doses will
be greater than the variability of true doses. In HTDS, in contrast, the
authors not only consider this classical measurement error, but they also
1.4 Radiation Epidemiology and Berkson Errors turn the issue around; namely, they assumed that the true dose is equal
There are many studies relating radiation exposure to disease, including to the estimated dose plus measurement error. In symbols, this is
the Nevada Test Site (NTS) Thyroid Disease Study and the Hanford Xi = W i + U i , (1.2)
Thyroid Disease Study (HTDS). Full disclosure: One of us (RJC) was
involved in litigation concerning HTDS, and his expert report is avail- where E(Ui |Wi ) = 0, so that the true dose has more variability than
able at http://www.downwinders.com/files/htds expert report.pdf, the the estimated dose; contrast with (1.1). Model (1.2) is called a Berkson
plaintiffs’ Web site, at least as of May 2005. measurement error model, see Berkson (1950).
Stevens, Till, Thomas, et al. (1992); Kerber, Till, Simon, et al. (1993);
and Simon, Till, Lloyd, et al. (1995) described the Nevada test site study, 1.4.1 The Difference Between Berkson and Classical Errors: How to
where radiation exposure largely came as the result of above-ground Gain More Power Without Really Trying
nuclear testing in the 1950s. Similar statistical issues arise in the Hanford
Thyroid Disease Study: see Davis, Kopecky, Stram, et al. (2002); Stram Measurement error modeling requires considerable care. In this section,
and Kopecky (2003); and Kopecky, Davis, Hamilton, et al. (2004), where we discuss why it is crucial that one understands the seemingly subtle
radiation was released in the 1950s and 1960s. In the Nevada study, differences between Berkson and classical errors, and we illustrate some
over 2, 000 individuals who were exposed to radiation as children were possible pitfalls when choosing between the two error models. As far as
examined for thyroid disease. The primary radiation exposure came from we are aware, one cannot be put in jail for using the wrong model, but
milk and vegetables. The idea of the study was to relate various thyroid an incorrect measurement error model often causes erroneous inferences,
disease outcomes to radiation exposure to the thyroid. which to a statistician is worse than going to jail (okay, we have exag-
Of course, once again, since this is a book about measurement er- gerated). In Section 2.2.2 we provide additional guidance so that the
ror, the main exposure of interest, radiation to the thyroid, cannot be reader can be confident of choosing the correct error model in his/her
observed exactly. What is typical in these studies is to build a large own work.
dosimetry model that attempts to convert the known data about the The difference between Berkson and classical measurement error is
above-ground nuclear tests to radiation actually absorbed into the thy- major when one is planning a study a priori, especially when one is
roid. Dosimetry calculations in NTS were based on age at exposure, attempting power calculations. There are some technical similarities be-
gender, residence history, x-ray history, whether the individual was as a tween classical and Berkson errors, see Section 3.2.2, but different issues
child breast-fed, and a diet questionnaire filled out by the parent, focus- arise in power calculations. What we will indicate here is that for a given
ing on milk consumption and vegetables. The data were then input into measurement error variance, if you want to convince yourself that you
a complex model and, for each individual, the point estimate of thyroid have lots of statistical power despite measurement error, just pretend
dose and an associated standard error for the measurement error were that the measurement error is Berkson and not classical.

4 5
Suppose that the observed data have a normal distribution with mean 9
OPEN data, Energy, Log Scale, Correlation = 0.28

2
zero and variance σw = 2.0. Suppose also that the measurement error
2
has variance σu = 1.0. Then, if one assumes a Berkson model, the true
doses have mean zero and variance σx2 = 3.0. This is so because the 8.5

2
variance of X in (1.2) is the sum of the variance of W (σw = 2.0) and
the variance of the Berkson measurement error U (σu2 = 1.0). Now, in

Dietary History Questionnaire


8
major contrast, if one assumes that the measurement error is classical
instead of Berkson, then the variance of X is, from (1.1), the difference
of the variance of W (2.0) and the variance of the classical measurement 7.5

error U (1.0), that is, 1.0. In other words, if we assume Berkson error, we
think that the true dose X has variance 3.0, while if we assume classical 7
measurement error, we think that the variance of the true dose equals
1.0, a feature reflected in Figure 1.3. Now, for a given set of parameter
values of risk, it is generally the case that the power increases when the 6.5

variance of true exposure X increases, Hence, assuming Berkson when


the error is classical leads to a grossly optimistic overstatement of power. 6
7.2 7.4 7.6 7.8 8 8.2 8.4 8.6
Energy Biomarker

Figure 1.4 OPEN Study data, scatterplot of the logarithm of energy (calories)
Your Estimates Of True Doses If Error Is Berkson
60 using a food frequency questionnaire and a biomarker.
50

40 Further discussion of differences and similarities between power in


30 classical and Berkson error models can be found in Section B.1.
20

10
1.5 Classical Measurement Error Model Extensions
0
−5 −4 −3 −2 −1 0 1 2 3 4
It almost goes without saying, but we will say it, that measurement error
Your Estimates Of True Doses If Error Is Classical
models can be more complex than the classical additive measurement
60 error model (1.1) or the classical Berkson error model (1.2). Here we
50 illustrate some of the complexities of measurement error modeling via
40 an important nutrition biomarker study.
30 The study of diet and disease has been a major motivation for nonlin-
20 ear measurement error modeling. In these studies, it is typical to mea-
10
sure diet via a self–report instrument, for example, a food frequency
0
questionnaire (FFQ), some sort of diary, or a 24-hour recall interview. It
−5 −4 −3 −2 −1 0 1 2 3 4
has been appreciated for decades that these self-report instruments are
only imperfect measures of long-term dietary intakes, and hence that
Figure 1.3 A hypothetical example where the observed doses W have mean zero
measurement error is a major concern.
and variance 2.0, while the measurement errors have mean zero and variance
1.0. Displayed are the distributions of true dose that you think you have if you To understand the profound nature of measurement error in this con-
think that the errors are Berkson (top) or if you think the errors are classical text, we consider the National Cancer Institute’s OPEN study, which
(bottom). The much smaller variability of true dose under the classical model is one of the largest biomarker studies ever done; see Subar, Kipnis,
indicates that the power for detecting effects will be much smaller than if the Troiano, et al. (2003) and Kipnis, Midthune, Freedman, et al. (2003).
errors are Berkson. We illustrate this measurement error with energy (caloric) intake mea-

6 7
150
Biomarker for Calories 2001, 2003), is to allow for bias as well as variance components
Wij = γ0 + γ1 Xij + Uij , (1.3)
100 Uij = ri + ǫij ,
where ri ∼ Normal(0, σr2 ) and ǫij ∼ Normal(0, σǫ2 ). In model (1.3), the
50
linear regression in true intake reflects the biases of the FFQ. The struc-
ture of the measurement error random variables Uij is that they have two
0
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 components: a shared component r and a random component ǫ. Kipnis
et al. (1999, 2001, 2003) call the shared component person-specific bias,
FFQ for Calories
150 reflecting the idea that two people who eat exactly the same foods will
nonetheless systematically report intakes differently when given multiple
100
FFQs. Fuller (1987) calls the person-specific bias an equation error.
Of course, if γ0 = 0, γ1 = 1, and ri ≡ 0, then we have the standard
classical measurement error model (1.1).
50

0
1.6 Other Examples of Measurement Error Models
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

1.6.1 NHANES
Figure 1.5 OPEN Study data, histograms of energy (calories) using a
biomarker (top panel) and a food frequency questionnaire (bottom panel). Note The NHANES-I Epidemiologic Study Cohort data set (Jones, Schatzen,
how individuals report far fewer calories than they actually consume. Green, et al., 1987) is a cohort study originally consisting of 8,596 women
who were interviewed about their nutrition habits and later examined for
evidence of cancer. We restrict attention to a subcohort of 3,145 women
sures. In the OPEN Study, energy intake was measured by the dietary aged 25–50 who have no missing data on the variables of interest.
history questionnaire, an FFQ described in Subar, Thompson, Kipnis, et The response Y indicates the presence of breast cancer. The predictor
al. (2001). In keeping with our notation, since the FFQ is not the truth, variables Z, assumed to be measured without significant error, include
we will denote by W the log energy intake as measured by the FFQ. In the following: age, poverty index ratio, body mass index, alcohol (Yes,
addition, the investigators obtained a near-perfect biomarker measure No), family history of breast cancer, age at menarche, and menopausal
of energy intake using a technique called doubly-labeled water (DLW), status. We are primarily interested in the effects of nutrition variables
which we call X. DLW is basically what it sounds like: Participants drink X that are known to be imprecisely measured, for example, “long-term”
water that is enriched with respect to two isotopes, and urine samples saturated fat intake.
allow the measurement of energy expenditure. If all these underlying variables were observable, then a standard lo-
That true intake X and observed intake W can be very different is gistic regression analysis would be performed. However, it is both diffi-
seen in Figure 1.4, where we plot the FFQ versus the biomarker along cult and expensive to measure long-term diet in a large cohort. In the
with the associated least squares line. The correlation between truth NHANES data, instead of observing X, the measured W was a 24-hour
and observed is only 0.28, indicating that the FFQ is not a very good recall, that is, each participant’s diet in the previous 24 hours was re-
measure of energy intake. It is also interesting to note the histograms for called and nutrition variables computed. That the measurement error is
these two instruments; see Figure 1.5. One can see there that the FFQ is large in 24-hour recalls has been documented previously (Beaton, Mil-
also clearly badly biased downward in general for energy intake, that is, nor, & Little, 1979; Wu, Whittemore, & Jung, 1986). Indeed, there is
people eat more calories than they are willing to report (no surprise!). evidence to support the conclusion that more than half of the variability
In this example, because of the biases seen in Figures 1.4 and 1.5 the in the observed data is due to measurement error.
FFQ is not an unbiased measure of true energy intake, and hence the There are several sources of the measurement error. First, there is
classical measurement error model (1.1) clearly does not hold. A more the error in the ascertainment of food consumption in the previous 24
reasonable model, promoted in a series of papers by Kipnis et al. (1999, hours, especially amounts. Some of this type of error is purely random,

8 9
while another part is due to systematic bias, for example, some people of a year. The average, T, of these diary entries is taken to be an unbiased
resist giving an accurate description of their consumption of snacks. The estimate of X. We will call T a second measure of X. Thus, in contrast
size of potential systematic bias can be determined in some instances to NHANES, measurement error was assessed on data internal to the
(Freedman, Carroll, & Wax, 1991), but in the present study we have primary study. Because T is unbiased for X, E(T|W) = E(X|W), so
available only the 24-hour recall information, and any systematic bias is we can estimate E(X|W) by regressing T on W. Estimating E(X|W)
unidentifiable. is the crucial first step in regression calibration, a widely used method
The major source of “error” is the fact that a single day’s diet does of correcting for measurement error; see Chapter 4.
not serve as an adequate measure of the previous year’s diet. There
are seasonaL differences in diet, as well as day-to-day variations. This
1.6.3 The Atherosclerosis Risk in Communities Study
points out the fact that measurement error is much more than simple
recording or instrument error and encompasses many different sources The Atherosclerosis Risk in Communities (ARIC) study is a multipur-
of variability. pose prospective cohort study described in detail by The ARIC Investiga-
There is insufficient information in the NHANES data to model mea- tors (1989). From 1987 through 1989, 15,792 male and female volunteers
surement error directly. Instead, the measurement error structure was were recruited from four U.S. communities (Forsyth County, NC; subur-
modeled using an external data set, the CSFII (Continuing Survey of ban Minneapolis, MN; Washington County, MD; and Jackson, MS) for a
Food Intakes by Individuals) data (Thompson, Sowers, Frongillo, et al., baseline visit including at-home interviews, clinic examination, and lab-
1992). The CSFII data contain the 24-hour recall measures W, as well oratory measurements. Participants returned approximately every three
as three additional 24-hour recall phone interviews. Using external data, years for second (1990–1992), third (1993–1995), and fourth (1996–98)
rather than assessing measurement error on an internal subset of the visits. Time to event data were obtained from annual participant in-
primary study, entails certain risks that we discuss later in this chapter. terviews and review of local hospital discharge lists and county death
The basic problem is that parameters in the external study may differ certificates. The “event” was primary coronary kidney disease (CKD).
from parameters in the primary study, leading to bias when external One purpose of the study was to explain the race effect on the progres-
estimates are transported to the primary study. sion of CKD. In particular, African-Americans have maintained approx-
imately four times the age- and sex-adjusted rate of end-stage renal dis-
ease (ESRD) compared to whites during the last two decades (USRDS,
1.6.2 Nurses’ Health Study
2003), while the prevalence of decreased kidney function (CKD Stage 3)
While the OPEN Study focused on the properties of instruments for in the U.S. is lower among African-Americans than whites. These pat-
measuring nutrient intakes, the real interest is in relating disease and terns suggest that that African-Americans progress faster through the
nutrient intakes. A famous and still ongoing study concerning nutrition different stages of kidney disease.
and breast cancer has been considered by Rosner, Willett, & Spiegelman In Chapter 14 we investigate the race effect on the probability of
(1989) and Rosner, Spiegelman, & Willett (1990), namely, the Nurses’ progression to CKD using a survival data approach. An important con-
Health Study. The study has over 80,000 participants and includes many founder is the baseline kidney function, which is typically measured by
breast cancer cases. The variables are much the same as in the OPEN the estimated glomerular filtration rate (eGFR), which is a noisy version
study, with the exceptions that (1) alcohol is assessed differently and of GFR obtained from a prediction equation. The nature of the adjust-
(2) a food-frequency questionnaire was used instead of 24-hour recall ment is more complex because of the nonmonotonic relationship between
interviews. The size of the measurement error in the nutrition variables eGFR and progression probability.
is still quite large. Here, X = (long-term average alcohol intake, long-
term average nutrient intake) and W = (alcohol intake measured by
1.6.4 Bioassay in a Herbicide Study
FFQs, nutrient intake measured by FFQs). It is known that W is both
highly variable and biased as an estimator of X. Rudemo, Ruppert, & Streibig (1989) consider a bioassay experiment
The Nurses’ Health Study was designed so that a direct assessment of with plants, in which eight herbicides were applied. For each of these
measurement error is possible. Specifically, 173 nurses recorded alcohol eight combinations, six (common) nonzero doses were applied and the
and nutrient intakes in diary form for four different weeks over the course dry weight Y of five plants grown in the same pot was measured. In

10 11
this instance, the predictor variable X of interest is the amount of the ple at least, Z consists only of age, body mass, and smoking status, while
herbicide actually absorbed by the plant, a quantity that cannot be mea- the variables X measured with error are serum cholesterol and systolic
sured. Here the response is continuous, and if X were observable, then blood pressure. It should be noted that in a related analysis MacMahon,
a nonlinear regression model would have been fit, probably by nonlin- Peto, Cutler, et al. (1990) consider only the last as a variable measured
ear least squares. The four-parameter logistic model (not to be confused with error. We will follow this convention in our discussion.
with logistic regression where the response is binary) is commonly used. Again, it is impossible to measure long-term systolic blood pressure
However, X is not observable; instead, we know only the nominal X. Instead, what is available is the blood pressure W observed during a
concentration W of herbicide applied to the plant. The sources of error clinic visit. The reason that the long-term X and the single-visit W differ
include not only the error in diluting to the nominal concentration, but is that blood pressure has major daily, as well as seasonal, variation.
also the fact that two plants receiving the same amount of herbicide may Generally, the classical measurement error model (1.1) is used in this
absorb different amounts. context.
In this example, the measurement error was not assessed directly. In- In this experiment, we have an extra measurement of blood pressure
stead, the authors assumed that the true amount X was linearly related T from a clinic visit taken 4 years before W was observed. Hence, unlike
to the nominal amount W with nonconstant variance. This error model, any of the other studies we have discussed, in the Framingham study we
combined with the approach discussed in Chapter 4, was used to con- have information on measurement error for each individual. One can look
struct a new model for the observed data. at T as simply a replicate of W. However, T may be a biased measure
of X because of temporal changes in the distribution of blood pressure
in the population. Each way of looking at the data is useful and leads
1.6.5 Lung Function in Children to different methods of analysis.
Tosteson, Stefanski, & Schafer (1989) described an example in which
the response was the presence (Y = 1) or absence (Y = 0) of wheeze 1.6.7 A-Bomb Survivors Data
in children, which is an indicator of lung dysfunction. The predictor
variable of interest is X = personal exposure to NO2 . Since Y is a binary Pierce, Stram, Vaeth, et al. (1992) considered analysis of A-bomb sur-
variable, if X were observable, the authors would have used logistic or vivor data from the Hiroshima and Nagasaki explosions. They discuss
probit regression to model the relationship of Y and X. However, X various responses Y, including the number of chromosomal aberrations.
was not available in their study. Instead, the investigators were able The true radiation dose X cannot be measured; instead, estimates W
to measure a bivariate variable W, consisting of observed kitchen and are available. They assume, as an approximation, that W = 0 if and
bedroom concentrations of NO2 in the child’s home. School-aged children only if X = 0. They adopt a fully parametric approach, specifying that
spend only a portion of their time in their homes, and only a portion when X and W are positive, then W is lognormal with median X and
of that time in their kitchens and bedrooms. Thus, it is clear that the coefficient of variation of 30%. They assume that if X is positive, it has a
true NO2 concentration is not fully explained by what happens in the Weibull distribution. In symbols, they propose the multiplicative model
kitchen and bedroom.
W = X U, log(U) ∼ Normal(µu , σu2 ),
While X was not measured in the primary data set, two independent,
external studies were available in which both X and W were observed. where log(U) is normally distributed with mean zero and variance 0.0862.
We will describe this example in more detail later in this chapter.
1.6.8 Blood Pressure and Urinary Sodium Chloride
1.6.6 Coronary Heart Disease and Blood Pressure
Liu & Liang (1992) described a problem of logistic regression where the
The Framingham study (Kannel, Neaton, Wentworth, et al., 1986) is a response Y is the presence of high systolic blood pressure (greater than
large cohort study following individuals for the development Y of coro- 140). However, in this particular study blood pressure was measured
nary heart disease. The main predictor of interest in the study is systolic many times and the average recorded, so that the amount of measure-
blood pressure, but other variables include age at first exam, body mass, ment error in the average systolic blood pressure is reasonably small. The
serum cholesterol, and whether or not the person is a smoker. In princi- predictors Z measured without error are age and body mass index. The

12 13
predictor X subject to measurement error is urinary sodium chloride, Then differences of the replicates within an individual are normally dis-
which is subject to error because of intra-individual variation over time tributed. This leads to simple graphical devices:
and also possibly due to measurement error in the chemical analyses. In
• Plot the sample standard deviation of the W-values for an individual
order to understand the effects of measurement error, 24-hour urinary
against her/his sample mean, call it W. If there are no obvious trends,
sodium chloride was measured on 6 consecutive days.
this suggests that the measurement error variance does not depend
on X.
1.6.9 Multiplicative Error for Confidentiality • Plot the sample standard deviation of the W-values for an individ-
ual against her/his covariates Z. If there are no obvious trends, this
Hwang (1986) used survey data released by the U. S. Department of
suggests that the measurement error variance does not depend on Z.
Energy on energy consumption by U. S. households. The exact values
of certain variables, for example, heating and cooling degree days, were • Form the differences between replications within an individual, and
not given since this information might allow the homeowners to be iden- then form a q-q plot of these differences across individuals. If the
tified. Instead the Department of Energy multiplied these variables by q-q plot shows no evidence of nonnormality, this suggests that the
computer-generated random numbers. The Department of Energy re- measurement errors are also roughly normally distributed.
leased the method for generating the random errors, so this is a rare
case where the error distribution is known exactly. OPEN data, Protein, Original Scale, Constant Variance Plot
180

160
1.6.10 Cervical Cancer and Herpes Simplex Virus
140
In this example, the question is whether exposure to herpes simplex virus

s.d. of the Protein Biomarkers


increases the risk of cervical cancer. The data are listed in Carroll, Gail, 120

& Lubin (1993). The response Y is the indicator of invasive cervical 100
cancer, X is exposure to herpes simplex virus, type 2 (HSV-2) measured
by a refined western blot procedure, and W is exposure to HSV-2 mea- 80

sured by the western blot procedure. See Hildesheim, Mann, Brinton, et 60

al. (1991) for biological background to this problem. There are 115 com-
plete observations where (Y, X, W) is observed and 1,929 incomplete 40

observations where only (Y, W) is observed. There are 39 cases (Y = 1) 20

among the complete data and 693 cases among the incomplete data.
Among the complete data, there is substantial misclassification, that is, 0
100 200 300 400 500 600 700 800 900
Mean of the Protein Biomarkers
observations where X 6= W. Also, there is evidence of differential er-
ror, meaning that the probability of misclassification depends on the re- Figure 1.6 OPEN Study data, plot of the within-individual standard deviation
sponse, that is, P (X = W|X = x, Y = 0) 6= P (X = W|X = x, Y = 1). versus mean of the actual untransformed protein biomarkers. The obvious re-
gression slope indicates that the variance of the measurement error depends on
true protein intake.
1.7 Checking the Classical Error Model
Suppose that the classical error additive measurement error model (1.1) For example, consider the protein biomarker in the OPEN study; see
holds, and that the errors U are symmetric and have constant vari- Section 1.2. In Figure 1.6 we plot the standard deviation of the replicates
ance in both X and any covariates Z measured without error, that is, versus the mean in the original protein scale. The fact that there is an
var(U|Z, X) = σ 2 (a constant). Then, if the instrument W can be repli- obvious regression slope and the standard deviation of the biomarker
cated, the sample standard deviation of the W-values for an individual varies by a factor of four over the range of the biomarker’s mean is
are uncorrelated with the individual means, and they are also uncorre- strong evidence that, at the very least, the variance of the measurement
lated with Z. Further, suppose that these errors are normally distributed. error depends on true intake.

14 15
OPEN data, Protein, Log Scale, Constant Variance Plot 20
0.7

0.6 10

0.5
s.d. of the Protein Biomarkers
0

0.4

−10
0.3

−20
0.2

0.1 −30

0
5 5.2 5.4 5.6 5.8 6 6.2 6.4 6.6 6.8
Mean of the Protein Biomarkers
−40
−3 −2 −1 0 1 2 3
Standard Normal Quantiles
Figure 1.7 OPEN Study data, plot of the within-individual standard deviation
versus mean of the log protein biomarkers. The lack of any major regression Figure 1.9 Normal q-q plot of the differences between independent Lognor-
slope indicates approximately constant variance measurement error. mal(0,1) random variables, n = 200.

QQ Plot of Log Protein Biomarker Differences, OPEN Study


0.8

0.6

0.4

0.2
A standard way to remove nonconstant variability is via a transfor-
0 mation, and the obvious first attempt is to take logarithms. Figure 1.7
−0.2 is the standard deviation versus the mean plot in this transformed scale.
In contrast to Figure 1.6, here we see no major trend, suggesting that
−0.4
the transformation was successful in removing most of the nonconstant
−0.6 variation. Figure 1.8 gives the q-q plot of the differences: this is not a
−0.8 perfect straight line, but it is reasonably close to straight, suggesting
−1
that the transformation has also helped make the data much closer to
normally distributed.
−1.2
−3 −2 −1 0 1 2 3
Standard Normal Quantiles Using differences between replicates to assess normality has its pitfalls.
The difference between two iid random variables has a symmetric dis-
Figure 1.8 OPEN Study data, q-q plot of the differences of the log protein
tribution even when the random variable themselves are highly skewed.
biomarkers. The nearly straight line of the data indicate nearly normally dis- Thus, nonnormality of measurement errors is somewhat hidden by using
tributed measurement errors. differences. For example, Figure 1.9 is a normal q-q plot of the differ-
ences between 200 pairs of Lognormal(0,1) random variables; see Section
A.2 for the lognormal distribution. Note that the q-q plot shows no sign
of asymmetry. Nonnormality is evident only in the presence of heavier-
than-Gaussian tails.

16 17
1.8 Loss of Power error variance is var(U) = 1.0, this means that the observed predictors
have variance var(W) = var(X) + var(U) = 2.0, and hence 1/2 of the
Classical measurement error causes loss of power, sometimes a profound
variability in the observed predictors is due to noise. At the extreme
loss of power. We illustrate this in two situations: linear regression and
with var(U) = 2.0, 2/3 of the variability in the observed predictors is
radiation epidemiology.
due to noise.
The results are displayed in Figure 1.10. Here we see that while the
1.8.1 Linear Regression Example power would be 90% if X could be observed, when the measurement
error variance equals the variance of X, and hence 1/2 of the variability
Simulation: Loss of Power With Increasing Classical Measurement Error in W is due to noise, the power crashes to 62%. Even worse, when 2/3
1
of the variability in the observed W is noise, the power falls below 50%.
This is the first of the double whammy of measurement error; see Section
0.9
1.1.
Sample Size for 90% Power With Increasing Classical Measurement Error
80
Power for One−Sided Test

0.8

70
0.7

60

0.6
50

Sample Size
0.5 40

30
0.4
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Measurement Error Variance
20

Figure 1.10 An illustration of the loss of power when there is classical mea- 10
surement error. When X is observed, the measurement error variance = 0.0,
and the power is 90%. When X is not observed and the measurement error 0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
variance = 1.0, 1/2 of the variability of the observed W is due to noise, and Measurement Error Variance

the power is only 62%. When 2/3 of the variability of W is due to noise, the
power is only 44%. Figure 1.11 The sample size version of Figure 1.10. When there is no mea-
surement error, the sample size needed for 90% power is n = 20. When X is
Here we consider the simple linear regression model not observed and the measurement error variance = 1.0, 1/2 of the variability
of the observed W is due to noise, the necessary sample size for 90% power
Y i = β 0 + β x Xi + ǫ i ,
more than doubles to n = 45. When 2/3 of the variability of W is due to noise,
where β0 = 0.0, βx = 0.69, var(X) = var(ǫ) = 1.0, and the sample the required sample size is n > 70.
size is n = 20. The results here are based on exact calculations using
the program nQuery Advisor. The slope was chosen so that, when X is The flip side of a loss of power due to classical measurement error is
observed, there is approximately 90% power for a one-sided test of the that sample sizes necessary to gain a given power can increase dramati-
null hypothesis H0 : βx = 0. cally. The following power calculations were done assuming all variances
We added classical measurement error to the true Xs using the model are known, and so should be interpreted qualitatively. In Figure 1.11, we
(1.1), where we varied the variance of the measurement errors U from show that while only n = 20 is required for 90% power when there is no
0.0 to 2.0. When var(U) = 0.0, we are in the case that there is no clas- measurement error, when 1/2 of the variability in the observed predictor
sical measurement error, and the power is 90%. When the measurement W is due to noise, we require at least n = 45 observations, an increase of

18 19
200%! Even more dramatic, when 2/3 of the variability in the observed The parameter βx is the excess relative risk parameter.
predictor W is due to noise, the sample size must increase by over 350%! In our simulations, we used the model of Reeves et al. (2001) and
Mallick et al. (2002) to simulate true and calculated doses. This model
consists of the variables described above, along with a latent intermediate
1.8.2 Radiation Epidemiology Example
variable L between X and W that allows for mixtures of Berkson and
In this section, we describe a second simulation showing the effects of classical error. This model is
classical measurement error. In particular, we show that if one assumes log(X) = log(L) + Ub , (1.5)
that the measurement error is entirely Berkson but it is partially classi-
cal, then one grossly overestimates power. log(W) = log(L) + Uc ; (1.6)
where Ub denotes Berkson-type error, and Uc denotes classical-type er-
0.8
Simulated Power In Radiation Studies: Effects of Assuming Berkson Errors ror. The standard classical measurement error model (1.1) is obtained by
setting Ub = 0. The Berkson model (1.2) is obtained by setting Uc = 0.
0.75 The simulation assumed that log(L), Ub and Uc were normally dis-
tributed. The details of the simulation were as follows with the values
0.7 chosen roughly in accord with data in the HTDS:
• The number of study participants was n = 3, 000, with equal numbers
0.65
of men and women. With no dose, (β0 , βz ) were chosen so that men
had a disease probability of 0.0049, while the disease probability for
Power

0.6
women was 0.0098. The excess relative risk βx = 4.0.
0.55 • The mean of log(L) = log(0.10).
• The standard deviation of log(W) = log(2.7).
0.5
2
• The variance of the
p Berkson errors is σb , the variance of the classical
2 2 2
0.45 errors is σc , and σb + σc = log(2.3).
• The values of σb2 and σc2 were varied to that the total measurement er-
0.4
0 5 10 15 20 25 30 ror that is Berkson is 100%, 90%, 80%, and 70%. Stram and Kopecky
% Of Error Variance That is Classical
(2003) mention that there is classical uncertainty in HTDS because
Figure 1.12 Simulation results for radiation epidemiology, with true excess rel-
of the use of a food frequency questionnaire to measure milk and
ative risk 4.0. Displayed are the power for detecting an effect from an analysis vegetable consumption.
ignoring measurement error when the percentage of the total error that is clas- • A maximum likelihood analysis ignoring measurement errors was em-
sical varies from 0% (all Berkson) to 30% (majority Berkson). Note the very ployed.
rapid loss of power that accrues when error become classical in nature. This
The results of this simulation are displayed in Figures 1.12 and 1.13.
simulation study shows that if one thinks that all measurement error is Berk-
son, but 30% is classical, then one overestimates the power one really has. Figure 1.12 shows one aspect of the double whammy, namely the pro-
found loss of power when the data are subject to classical measurement
error. Note that the power is 75% when all the measurement error is
In the Hanford Thyroid Disease Study (HTDS) litigation, we used
Berkson, but when 30% of the total measurement error variance is clas-
what is called an excess relative risk model. Let Z denote gender, X
sical, the power drops to nearly 40%. In this instance, the practically
denote true but unobservable dose to the thyroid, and Y be some in-
important part of this simulation has to do with power being announced.
dicator of thyroid disease. Let H(x) = {1 + exp(−x)}−1 be the logistic
If one assumes that all uncertainty is Berkson, then one will announce
distribution function, often simply called the logistic function. Then the
75% power to detect an excess relative risk of 4.0, and one would ex-
model fit is
pect to see an effect of this magnitude. However, if in reality 30% of
pr(Y = 1|X, Z) = H {β0 + βz Z + log(1 + βx X)} . (1.4) the measurement error is classical, then the actual power is only 40%,

20 21
and one would not expect to see a statistically significant result. A sta- 1.9 A Brief Tour
tistically nonsignificant result could then easily be misinterpreted as a
As noted in the preface, this monograph is structured into four parts:
lack of effect, rather than the actuality: a lack of power. If you want to
background material, functional modeling where the marginal distribu-
convince yourself that you have lots of statistical power despite measure-
tion of X is not modeled, structural modeling where a parametric model
ment error, just pretend that the measurement error is Berkson and not
for the marginal distribution of X is assumed, and specialized topics.
classical.
Here we provide another brief overview of where we are going.
It is commonly thought that the effect of measurement error is “bias
Excess Relative Risk In Radiation Studies: Effects of Classical Errors
4 toward the null,” and hence that one can ignore measurement error for
the purpose of testing whether a predictor is “statistically significant.”
3.8
This lovely and appealing folklore is sometimes true but unfortunately
3.6 often wrong. The reader may find Chapters 3 and 10 (especially Section
3.6) instructive, for it is in these chapters that we describe in detail the
Excess relative risk

3.4
effects of ignoring measurement error.
3.2 With continuously measured variables, the classical error model (1.1)
is often assumed. The question of how one checks this assumption has not
3
been discussed in the literature. Section 1.7 suggests one such method,
2.8 namely, plotting the intra-individual standard deviation against the mean,
which should show no structure if (1.1) holds. This, and a simple graph-
2.6
ical device to check for normality of the errors, are described in Section
2.4 8.5. Often, the measured value of W is replicated, and the usual assump-
tion is that the replicates are independent.
2.2
Having specified an error model, one can use either functional model-
2
0 5 10 15 20 25 30
ing methods (Chapters 4–7) or structural modeling methods (Chapters
% Of Error Variance That is Classical 8–9). Hypothesis testing is discussed in Chapter 10. Longitudinal data
and mixed models are described briefly in Chapter 11. Density estima-
Figure 1.13 Simulation results for radiation epidemiology, with true excess tion and nonparametric regression methods appear in Chapters 12–13.
relative risk 4.0. Displayed are the median estimated excess relative risks from The analysis of survival data and cases where the response is measured
an analysis ignoring measurement error when the percentage of the total error with errors occurs in Chapters 14–15.
that is classical varies from 0% (all Berkson) to 30% (majority Berkson). Note
the very rapid bias that accrues when error become classical in nature.
Bibliographic Notes
Figure 1.13 shows the other aspect of the double whammy, namely the The model (1.3) for measurement error of a FFQ actually goes back to
bias caused by classical measurement error. Note that the true value of Cochran (1968), who cited Pearson (1902) as his inspiration; see Carroll
the excess relative risk is 4.0, and if all uncertainty is Berkson, then the (2003). The paper of Pearson is well worth reading for its innovative use
estimated excess relative risk has median very close to the actual value. of statistics to understand the biases in self-report instruments, in his
In other words, there is not much bias in parameter estimation in the case the bisection of lines. Amusingly, Pearson had a colleague who did
Berkson scenario. However, as more of the measurement error becomes the actual work of measuring the errors made in bisecting 1,500 lines:
classical, a far different story emerges. Thus, if 30% of the measurement “Dr. Lee spent several months in the summer of 1896 in the reduction
error variance is classical, one will tend to observe an excess relative of the observations,” one of the better illustrations of why one does not
risk of 2.0, half the actual risk. Again, if one assumes all uncertainty is want to be a postdoc.
Berkson, one might well conclude that there is little risk of radiation with Our common notation that X stands for the covariate measured with
an excess relative risk of 2.0, but in fact the much larger true relative error, W is its mismeasured version, and Z are the covariates mea-
risk would be masked by the classical measurement error. sured without error is not, unfortunately, the only notation in use. Fuller

22 23
(1987), for example, used “x” for the covariate measured with error, and
“X” for its mismeasured version. Pierce & Kellerer (2004) use “x” for
the covariate measured with error and “z” for its mismeasured version.
Many authors use “Z” for the mismeasured version of our X, and others
even interchange the meaning of Z and X! Luckily, some authors use
our convention, for example, Tsiatis & Ma (2004) and Huang & Wang
(2001). The annoying lack of a common notation (we, of course, think
ours is the best) can make it rather difficult to read papers in the area.
The program Nquery Advisor is available from Statistical Solutions,
Stonehill Corporate Center, Suite 104, 999 Broadway, Saugus, MA 01906,
http://www.statsol.ie/nquery/nquery.htm. We are not affiliated with
that company.
Tosteson, Buzas, Demidenko, & Karagas (2003) studied power and
sample size for score tests for generalized regression models with covari-
ate measurement error and provide software which, at the time of this
writing, is at http://biostat.hitchcock.org/MeasurementError/Analytics
/SampleSizeCalculationsforLogisticRegression.asp.

24
CHAPTER 2

IMPORTANT CONCEPTS

2.1 Functional and Structural Models

Historically, the taxonomy of measurement error models has been based


upon two major defining characteristics. The first is the structure of the
error model relating W to X, and the second is the type and amount of
additional data available to assess the important features of this error
model, for example, replicate measurements as in the Framingham data
or second measurements as in the NHANES study. These two factors,
error structure and data structure, are clearly related, since more so-
phisticated error models can be entertained only if sufficient data are
available for estimation. We take up the issue of error models in detail
in Section 2.2, although it is a recurrent theme throughout the book.
The second defining characteristic is determined by properties of the
unobserved true values Xi , i = 1, . . . , n. Traditionally, a distinction was
made between classical functional models, in which the Xs are regarded
as a sequence of unknown fixed constants or parameters, and classical
structural models, in which the Xs are regarded as random variables. The
trouble with the classical functional models is that one is then tempted
to use the maximum likelihood paradigm to estimate the nuisance pa-
rameters, the Xs, along with the parameters of interest in a regression
model. This approach works in linear regression, but in virtually no other
context, see for example Stefanski & Carroll (1985) for one of the many
of examples where the method fails.
We believe that it is more fruitful to make a distinction between func-
tional modeling, where the Xs may be either fixed or random, but in
the latter case no, or only minimal, assumptions are made about the
distribution of the Xs, and structural modeling, where models, usually
parametric, are placed on the distribution of the random Xs. Besides
the fact that our approach is cleaner, it also leads to more useful meth-
ods of estimation and inference than the old idea of treating the Xs as
parameters.
Likely the most important concept to keep in mind is the idea of ro-
bust model inference. In the functional modeling approach, we make no
assumptions about the distribution of the unobserved Xs. In contrast, in
a typical structural approach, some version of a parametric distribution

25
for the Xs is assumed, and concern inevitably arises that the resulting Thus, in the OPEN study described in Section 1.5, it is possible that
estimates and inferences will depend upon the parametric model chosen. bias in the FFQ might depend on gender, age, or even body mass index.
Over time, the two approaches have been moving closer to one another. In (2.1), the measurement errors U have mean zero given the ob-
For example, in a number of Bayesian structural approaches, flexible served and unobserved covariates, but nothing is said otherwise about
parametric models have been chosen, the flexibility helping in terms the structure of U. In the OPEN study, the variability might very well
of model robustness; see, for example, Carroll, Roeder, & Wasserman be expected to depend on gender. In addition, nothing is said about
(1999) and Mallick, Hoffman, & Carroll (2002). Tsiatis & Ma (2004) the distributional structure of U, so that, for example, if we have repli-
described a frequentist method that is functional in the sense that the cated instruments Wij for the ith person, the Uij might have a variance
estimators are consistent no matter what the distribution of X is, but components structure, as in (1.3).
the method is also structural in the sense that a pilot distribution for By a regression calibration model we mean one which focuses on the
X must be specified. The Tsiatis–Ma approach involves solving an inte- distribution of X given (Z, W). We have already described the most
gral equation, or at least approximating the solution, and either way of famous case, the Berkson model; see equation (1.2). More generally, one
implementation requires major computational effort. might be willing to model the distribution of the unobserved covariates
Functional modeling is at the heart of the first part of this book, es- directly as a function of the observed versions, as in
pecially in Chapters 4, 5, and 7. The key point is that even when the Xs
form a random sample from a population, functional modeling is useful X = γ0 + γ1t W + γ2t Z + U, E(U|Z, W) = 0. (2.2)
because it leads to estimation procedures that are robust to misspeci-
fication of the distribution of X. As described in Chapter 8, structural The Berkson model says that true X is unbiased for nominal W, so that
modeling has an important role to play in applications (see also Chapter γ0 = γ2 = 0 and γ1 = 1. Mallick & Gelfand (1996) basically started with
9), but a concern is the robustness of inference to assumptions made model (2.2), for example.
about the distribution of X.
Throughout, we will treat Z1 , . . . , Zn as fixed constants, and our anal-
2.2.2 Is It Berkson or Classical?
yses will be conditioned on their values. The practice of conditioning on
known covariates is standard in regression analysis. Compared to complex undertakings such as rocket science or automotive
repair, determining whether data follow the classical additive measure-
2.2 Models for Measurement Error ment error model (1.1) or the standard Berkson error model (1.2) is
generally simple in practice. Basically, if the choice is between the two,
2.2.1 General Approaches: Berkson and Classical Models
then the choice is classical if an error-prone covariate is necessarily mea-
A fundamental prerequisite for analyzing a measurement error problem sured uniquely to an individual, and especially if that measurement can
is specification of a model for the measurement error process. There are be replicated. Thus, for example, if people fill out a food frequency ques-
two general types: tionnaire or if they get a blood pressure measurement, then the errors
• Error models, including classical measurement error models, where and uncertainties are of classical type. If all individuals in a small group
the conditional distribution of W given (Z, X) is modeled. or strata are given the same value of the error-prone covariate, for exam-
ple, textile workers or miners working in a job classification for a fixed
• Regression calibrarion models, including Berkson error models, where
number of years are assigned the same exposure to dust, but the true
the conditional distribution of X given (Z, W) is modeled.
exposure is particular to an individual, as it almost certainly would be,
We have already discussed two variants of the classical error model, see then the measurement error is Berkson. In the herbicide study, the mea-
(1.1) for the simplest additive model, and see (1.3) for an example of bi- sured concentration W is fixed by design and the true concentration X
ases in the instrument, along with a more complex variance components varies due to error, so that the Berkson model is more appropriate.
structure. Somewhat more generally, we can suppose that the relation-
Other differences between Berkson and classical error models, which
ship between the measured W and the unobserved X also depends on
might help distinguish between them in practice, are that in the classical
the observed predictors Z, as in for example,
model the error U is independent of X, or at least E(U|X) = 0, while
W = γ0 + γxt X + γzt Z + U, E(U|X, Z) = 0. (2.1) for Berkson errors U is independent of W or at least E(U|W) = 0.

26 27
Therefore, var(W) > var(X) for classical errors and var(X) > var(W) structural model so that one knows the marginal distribution of X, then
for Berkson errors. an error model can be converted into a regression calibration model by
In practice, the choice is not always between the classical additive Bayes theorem. Specifically,
measurement error model (1.1) and the standard Berkson error model fW|X (w|x)fX (x)
(1.2). As we have seen, with a food frequency questionnaire, as well fX|W (x|w) = R ,
as with other instruments based on self-report, more complex models fW|X (w|x)fX (x)dx
incorporating biases are required. We still measure quantities unique to where fX is the density of X, fW|X is the density of W given X, and
the individual, and the measurements can in principle be replicated, but fX|W is the density of X given W. For example, suppose that W =
biases must be entertained, and the general classical error model (2.1) X + U, where X and U are uncorrelated. Then, as discussed in Section
is appropriate. A.4, the best linear predictor of X given W is (1 − λ)E(X) + λW, and
Outside of the Berkson model, the general regression calibration model
(2.2) is typically used in ad hoc ways, simply as a modeling device and X = (1 − λ)E(X) + λW + U∗ , (2.3)
not based on any fundamental considerations. Briefly, as we will see later where λ = σx2 /(σx2 +σu2 ) ∗
is the attenuation, U = (1−λ){X−E(X)}−λU,
in this book, likelihood-type calculations can become fairly simple if one and a simple calculation shows that U∗ and W are uncorrelated. If X
has a regression calibration model in hand and one can estimate it. For and U are independent and normally distributed, then so are X and
example, consider the lung function study of Tosteson et al. (1989). In U∗ . Attenuation means that the magnitude of a regression coefficient is
this study, interest was in the relationship of long-term true NO2 intake, biased towards zero, and the attenuation coefficient measures the size of
X, in children on the eventual development of lung disease. The variable the attenuation in simple linear regression with classical additive mea-
X was not available. The vector W consists of bedroom and kitchen NO2 surement errors; see Section 3.2.1.
levels as measured by in situ or stationary recording devices. Certainly, Equation (2.3) has the form of a Berkson model, even though the error
X and W are related, but children are exposed to other sources of NO2 , model is classical. Note, however, that the slope of X on W is λ, not
for example, in other parts of the house, at school, etc. 1. Therefore, the variance of X is smaller than the variance of W in
The available data consisted of the primary study in which Y and W keeping with the classical rather than Berkson errors.
were observed, and two external studies, from different locations, study
populations, and investigators, in which (X, W) were observed. In this
problem, the regression calibration model (2.2) seems physically reason- 2.2.4 Transportability of Models
able, because a child’s total exposure X can be thought of as a sum of In some studies, the measurement error process is not assessed directly,
in-home exposure and other uncontrolled factors. Tosteson, Stefanski, & but instead data from other independent studies, called external data
Schafer (1989) fit (2.2) to each of the external studies, found remarkable sets, are used. In this section, we discuss the appropriateness of using
similarities in the estimated regression calibration parameters (γ), and information from independent studies and the manner in which this in-
concluded that the assumption of a common model for all three studies formation should be used.
was a reasonable working assumption. We say that parameters of a model can be transported from one study
A general error model (2.1) could also have been fit. However, W here to another if the model holds with the same parameter values in both
is bivariate, X is univariate, and implementation of estimates and infer- studies. Typically, in applications only a subset of the model parameters
ences is simply less convenient here than it is for a regression calibration need be transportable. Transportability means that not only the model
model. but also the relevant parameter estimates can be transported without
bias.
2.2.3 Berkson Models from Classical In many instances, approximately the same classical error model holds
across different populations. For example, consider systolic blood pres-
There is an interesting relationship at a technical level between error sure at two different clinical centers. Assuming similar levels of training
models and regression calibration models; see Chapter 4. This relation- for technicians making the measurements and a similar protocol, for
ship is important in regression calibration where a model for X given example, sitting after a resting period, it is reasonable to expect that
W is needed, but we start with a model for W given X. If one has a the distribution of the error in the recorded measure W does not de-

28 29
pend on the clinical center one enters, or on the technician making the ler, et al., 1990). Carroll & Stefanski (1994) discussed these studies in
measurement, or on the value of X being measured, except possibly for detail, but here we use the studies only to illustrate the potential pitfalls
heteroscedasticity. Thus, in classical error models it is often reasonable to of extrapolating across studies. It is reasonable to assume that classical
assume that the error distribution of W given (Z, X) is the same across measurement error model (1.1) holds with the same measurement error
different populations. However, even here some care is needed because a variance for both studies, which reduces to stating that the distribution
major component of the measurement error might be sampling error. If of W given (Z, X) is the same in the two studies. However, the distri-
the populations differ in temporal variation or sampling frequency, then bution of X appears to differ substantially in the two studies, with the
the error distribution would differ. MRFIT study having smaller variance. Under these circumstances, while
Much, much more rarely, the same regression calibration model can the error model is probably transportable, a regression calibration model
sometimes be assumed to hold across different studies. For example, con- formed from Framingham would not be transportable to MRFIT. The
sider the NO2 study described in Section 1.6.5. If we have two popula- problem is that, by Bayes’s theorem, the distribution of X given (Z, W)
tions of suburban children, then it may be reasonable to assume that the depends on both the distribution of W given (Z, X) and the distribution
sources of NO2 exposure other than the bedroom and kitchen will be ap- of X given Z, and the latter is not transportable.
proximately the same, and the error models are transportable. However,
if one study consists of suburban children living in a nonindustrial area % Calories from Fat, ACS Study
60
and the second study consists of children living in an inner city near an
industrialized area, the assumption of transportable error models would 50

be tenuous at best. 40

30

20
2.2.5 Potential Dangers of Transporting Models
10

The use of independent-study data to assess error model structure carries 0


0 10 20 30 40 50 60
with it the danger of introducing estimation bias into the primary study
analysis. % Calories from Fat, NHS Validation Study
60
First, consider the regression calibration model for NO2 intake. The
50
primary data set of Tosteson Stefanski, & Schafer (1989) (Section 1.6.5)
40
is a sample from Watertown, Massachusetts. Two independent data sets
30
were used to fit the parameters in (2.2): one from the Netherlands and
one from Portage, Wisconsin. The parameter estimates for this model 20

in the two external data sets were essentially the same, leading Tosteson 10

et al. (1989) to conclude that the common regression relationship from 0


0 10 20 30 40 50 60
the Dutch and Portage studies was likely to be appropriate for the Wa-
tertown study as well. However, as these authors note in some detail, it Figure 2.1 Comparison of % calories from fat using a food frequency question-
is important to remember that this is an assumption, plausible in this naire from two studies. Note how the distributions seem very different, calling
instance, but still one not to be made lightly. If Watertown were to have into question whether the distribution of true intake can be transported between
a much different pattern of NO2 exposure than Portage or the Nether- studies.
lands, then the estimated parameters in model (2.2) from the latter two
studies, while similar, might be biased for the Watertown study, and the That this is not merely a theoretical exercise is illustrated in Figure
results for Watertown hence incorrect. 2.1, which shows the distribution of calories from fat (fat density) for two
The issue of transporting results for error models is critical in the study populations. In this figure, we plot the observed values W from
classical measurement error model as well. Consider the MRFIT study two studies: the validation arm of the Nurses’ Health Study (NHS) and a
(Kannel et al., 1986), in which X is long-term systolic blood pressure. study done by the American Cancer Society (ACS), both using the same
The external data set is the Framingham data (MacMahon, Peto, Cut- food frequency questionnaire (FFQ). What we see in this figure is that

30 31
the observed fat density for the ACS study seems to have much more the critical distributions. These data sources can be partitioned into two
variability than the data in the NHS. Assuming that the error properties main categories:
of the FFQ were the same in the two studies, it would clearly make no • Internal subsets of the primary data.
sense to pretend that the distribution of the exact predictor X is the
• External or independent studies.
same in the two studies, that is, the distribution of the exact predictor
is not transportable. Within each of these broad categories, there are three types of data, all
of which we assume to be available in a random subsample of the data
set in question:
2.2.6 Semicontinuous Variables • Validation data in which X is observable directly. This is the rela-
tively rare circumstance where a measurement error problem is also
Some variables—such as nutrient intakes of food groups, such as red
a missing data problem.
meat, or environmental exposures—have a positive probability of being
zero and otherwise have a positive continuous distribution. Such vari- • Replication data, in which replicates of W are available.
ables have been called semicontinuous by Schafer (1997). An example • Instrumental data, in which another variable T is observable in addi-
is radiation exposure in the atomic bomb survivors study described in tion to W.
Section 1.6.7. As mentioned in that section, Pierce, Stram, Vaeth, et al. An internal validation data set is the ideal, because it can be used with
(1992) assume that W = 0 if and only if X = 0. In many studies, this all known techniques, permits direct examination of the error structure,
assumption is unlikely to hold and is, at best, a useful approximation. and typically leads to much greater precision of estimation and inference.
An alternative model was used by Li, Shao, & Palta (2005). These We cannot express too forcefully that if it is possible to construct an in-
authors assume that there exists a latent continuous variable V such ternal validation data set, one should strive to do so. External validation
that data can be used to assess any of the models (1.1)–(2.2) in the external
X = max(0, V) and W = max(0, V + U) data, but one is always making an assumption when transporting such
models to the primary data.
where U is measurement error. When both X and W are positive, then
Usually, one would make replicate measurements if there were good
the usual classical measurement error model W = X + U holds. Notice
reason to believe that the replicated mean is a better estimate of X
that it is possible for X to be zero while W is positive, or vice versa.
than a single observation, that is, the classical error model is the target.
Such data cannot be used to test whether W is unbiased for X, as in
2.2.7 Misclassification of a Discrete Covariate the classical measurement error model (1.1), or biased, as in the general
measurement error model (2.1). However, if one is willing to assume
So far in this chapter, it has been assumed that the mismeasured co- (1.1), then replication data can be used to estimate the variance of the
variate is continuously distributed. For discrete covariates, measurement measurement error U.
error means misclassification. A common situation is a binary covariate, Data sets sometimes contain a second measurement T, which may or
where X and W are both either 0 or 1, for example, the diagnoses of her- may not be unbiased for X, in addition to the primary measurement
pes simplex virus by the refined western blot and western blot tests dis- W. If T is internal, then it need not be unbiased to be useful. In this
cussed in Section 1.6.10. In such cases, the misclassification model can be case, T is called an instrumental variable (IV) and can be used in an
parameterized using the misclassification probabilities pr(W = 1|X = 0) instrumental variable analysis provided that T possesses certain other
and pr(W = 0|X = 1); see Section 8.4. statistical properties (Chapter 6). If T is external, then it is useful in
general only if it is unbiased for X. In this case, T can be used in a
2.3 Sources of Data regression calibration analysis (Chapter 4).

In order to perform a measurement error analysis, as seen in (2.1)-(2.2),


2.4 Is There an “Exact” Predictor? What Is Truth?
one needs information about either W given (X, Z) (classical measure-
ment error) or about X given (Z, W) (regression calibration). In this We have based our discussion on the existence of an exact predictor
section, we will discuss various data sources that allow estimation of X and measurement error models that provide information about this

32 33
predictor. However, in practice, it is often the case that the term exact food records were taken at four different points during the year, thus
or true needs to be carefully defined prior to discussion of error models. properly accounting for intraindividual variation.
In almost all cases, one has to take an operational definition for the Using an operational definition for an “exact” predictor is often rea-
exact predictor. In the measurement error literature, the term gold stan- sonable and justifiable on the grounds that it is the best one could ever
dard is often used for the operationally defined exact predictor, though possibly hope to accomplish. However, such definitions may be contro-
sometimes this term is used for an exact predictor that cannot be op- versial. For example, consider the breast cancer and fat controversy. One
erationally defined. In the NHANES study, the operational definition is way to determine whether changing one’s fat intake lowers the risk of
the average saturated food intake over a year-long period as measured by developing breast cancer is to do a clinical trial, where the treatment
the average of 24-hour recall instruments. One can think of this as the group is actively encouraged to change their dietary behavior. Even this
best measure of exposure that could possibly be determined in practice, is controversial, because noncompliance can occur in either the treat-
and even here it is extremely difficult to measure this quantity. Having ment or the control arm. If instead one uses prospective data, as in
made this operational definition for X, we are in a position to under- the NHANES study, along with an operational definition of long-term
take an analysis, for clearly the observed measure W is unbiased for intake, one should be aware that the results of a measurement error anal-
X when measured on a randomly selected day. In this case, the mea- ysis could be invalid if true long-term intake and operational long-term
surement error model (1.1) is reasonable. However, in order to ascertain intake differ in subtle ways. Suppose that the operational definition of fat
the distributional properties of the measurement error, one requires a and calories could be measured, and call these (FatO , CaloriesO ), while
replication experiment. The simplest way to take replicates is to per- the actual long-term intake is (FatA , CaloriesA ). If breast cancer risk is
form 24-hour recalls on a few consecutive days (see also Section 1.6.8), associated with age and fat intake through the logistic regression model
but the problem here is that such replicates are probably not condition- Pr(Y = 1|FatA , CaloriesA , Age)
ally independent given the long-term average, and a variance component = H (β0 + β1 Age + β2 CaloriesA + β3 FatA ) ,
model such as (1.3) would likely be required. After all, if one is on an
ice cream jag, several consecutive days of ice cream may show up in the where here and throughout the book, H(x) = {1 + exp(−x)}−1 is the
24-hour recall, even though it is rarely eaten. logistic distribution function. Then the important parameter is β3 , with
β3 > 0 corresponding to the conclusion that increased fat intake at a
This type of replication does not measure the true error, which is
given level of calories leads to increased cancer risk.
highly influenced by intraindividual variation in diet. Hence, with repli-
However, suppose that the observed fat and calories are actually biased
cates on consecutive days, estimating the variance of the measurement
measures of the long-term average:
error by components-of-variance techniques will underestimate the mea-
surement error. FatO = γ1 FatA + γ2 CaloriesA ;
The same problem may occur in the urinary sodium chloride example CaloriesO = γ3 FatA + γ4 CaloriesA .
(Section 1.6.8), because the replicates were recorded on consecutive days. Then a little algebra shows that the regression of disease on the opera-
Liu & Liang (1992) suggested that intraindividual variation is an impor- tionally defined measures has a slope for operationally defined fat of
tant component of variability, and the design is not ideal for measuring
this variation. (γ4 β3 − γ3 β2 ) / (γ1 γ4 − γ2 γ3 ) .
If one wants to estimate the measurement error variance consistently, Depending on the parameter configurations, this can take on a sign dif-
it is much simpler if replicates can be taken far enough apart in time ferent from β3 . For example, suppose that β3 = 0 and there really is no
that the errors can reasonably be considered independent (see Chapter 4 fat effect. Using the operational definition, a measurement error analysis
for details). Otherwise, assumptions must be made about the form of the would lead to a fat effect of −γ3 β2 /(γ1 γ4 − γ2 γ3 ), which may be nonzero.
correlation structure; see Wang, Carroll, & Liang (1996). In the CSFII Hence, in this instance, there really is no fat effect, but our operational
component of the NHANES study, measurements were taken at least definition might lead us to find one.
two months apart, but there was still some small correlation between In our experience, researchers in nutrition shy away from terms such
errors. In the Nurses’ Health Study (Section 1.6.2), the exact predictor as true intake, because except for a few recovery biomarkers (protein
is the long-term average intake as measured by food records. Replicated and energy), the operational definition is not truth. However, the op-

34 35
erational definition, for example, the average of many repeated 24-hour nary heart disease where X is an indicator of elevated LDL (low density
recalls, is generally clear to subject-matter experts and not particularly lipoprotein cholesterol level), taking the values 1 and 0 according as the
controversial. LDL does or does not exceed 160. For their value W they use total
cholesterol. In their particular data set, both X and W are available,
and it transpires that the relationship between W and Y is differential,
2.5 Differential and Nondifferential Error that is, there is still a relationship between the two even after accounting
It is important to make a distinction between differential and nondif- for X. While the example is somewhat forced on our part, one should be
ferential measurement error. Nondifferential measurement error occurs aware that problems in which W is not merely a mismeasured version
when W contains no information about Y other than what is available of X may well have differential measurement error.
in X and Z. The technical definition is that measurement error is non- It is also important to realize that the definition of a surrogate depends
differential if the distribution of Y given (X, Z, W) depends only on on the other variables, Z, in the model. For example, consider a model in
(X, Z). In this case, W is said to be a surrogate. In other words, W is which Z has two components, say Z = (Z1 , Z2 ). Then it is possible that
a surrogate if it is conditionally independent of the response given the W is a surrogate in the model containing both Z1 and Z2 but not in
true covariates; measurement error is differential otherwise. the model containing only Z1 . Buzas et al. (2004) pointed out that this
For instance, consider the Framingham example of Section 1.6.6. The has implications when different models for the response are considered,
predictor of major interest is long-term systolic blood pressure (X), but and they gave a simple example illustrating this phenomenon. We next
we can only observe blood pressure on a single day (W). It seems plau- present a modified version of their algebraic example.
sible that a single day’s blood pressure contributes essentially no infor- Suppose that X, Z1 , ǫ1 , ǫ2 , U1 and U2 are mutually independent nor-
mation over and above that given by true long-term blood pressure, and mal random variables with zero means. Define Z2 = X + ǫ1 + U1 ,
hence that measurement error is nondifferential. The same remarks ap- Y = β1 + βz1 Z1 + βz2 Z2 + βx X + ǫ2 , and W = X + ǫ1 + U2 . Because of
ply to the nutrition examples in Sections 1.6.1 and 1.6.2: Dietary intake joint normality, it is straightforward to show that E(Y | Z1 , Z2 , X, W) =
on a single day should not contribute information about overall health E(Y | Z1 , Z2 , X) and consequently that W is a surrogate in the model
that is not already present in long-term diet intake. containing both Z1 and Z2 . However,
Many problems can plausibly be classified as having nondifferential
measurement error, especially when the true and observed covariates E(Y | Z1 , X) = β0 + βz1 Z1 + (βz2 + βx )X,
occur at a fixed point in time and the response is measured at a later E(Y | Z1 , X, W) = E(Y | Z1 , X) + βz2 E(ǫ1 | Z1 , X, W). (2.4)
time.
There are two exceptions to keep in mind. First, in case-control or The last expectation in (2.4) is not equal to zero because W depends
choice-based sampling studies, the response is obtained first and then on ǫ1 . Thus W is not a surrogate in the model that contains only Z1
subsequent follow-up ascertains the covariates. In nutrition studies, this unless βz2 = 0. So the presence or absence of Z2 in the model determines
ordering of measurement typically causes differential measurement error. whether or not W is a surrogate. The driving feature of this example
For instance, here the true predictor would be long-term diet before is that the measurement error, W − X, is correlated with the covariate
diagnosis, but the nature of case-control studies is that reported diet Z2 . Problems in which measurement error is correlated with error-free
is obtainable only after diagnosis. A woman who develops breast cancer predictors arise in practice and are amenable to the methods of regres-
may well change her diet, so the reported diet as measured after diagnosis sion calibration in Chapter 4 and instrumental variable estimation in
is clearly still correlated with cancer outcomes, even after taking into Chapter 6.
account long-term diet before diagnosis. The reason why nondifferential measurement error is important is
A second setting for differential measurement error occurs when W is that, as we will show in subsequent chapters, one can typically estimate
not merely a mismeasured version of X, but is a separate variable acting parameters in models for responses given true covariates, even when the
as a type of proxy for X. true covariates (X) are not observable. With differential measurement
For example, in an important paper with major implications for the error, this is not the case: Outside of a few special situations, one must
analysis of retrospective studies in the presence of missing data, Satten observe the true covariate on some study subjects. Most of this book
& Kupper (1993) described an example of estimating the risk of coro- focuses on nondifferential measurement error models.

36 37
2.5.0.1 A Technical Derivation surement error, that is, to substituting (Z, W) into the fitted model for
the regression of Y on (Z, W).
Here is a little technical argument illustrating why nondifferential mea-
The one situation requiring that we correctly model the measure-
surement error is so useful. With nondifferential measurement error, the
ment error occurs when we develop a prediction model using data from
relationship between Y and W is greatly simplified relative to the case of
one population but we wish to predict in another population. A naive
differential measurement error. In simple linear regression, for example,
prediction model that ignores measurement error may not be trans-
it means that the regression in the observed data is a linear regression
portable. In more detail, if Y = β0 + β1 X + ǫ in both populations,
of Y on E(X|W), because
then Y = β0∗ + λβx W + ǫ∗ , where β0∗ and λ = σx2 /(σx2 + σu2 ) may differ
E(Y|W) = E {E(Y|X, W)|W} between populations if either σx2 or σu2 does. Thus, the regression of Y
= E {E(Y|X)|W} on W may be different for the two populations.
= E(β0 + βx X|W)
= β0 + βx E(X|W). Bibliographic Notes

The assumption of nondifferential measurement error is used to justify An interesting discussion about the issues of Berkson and classical mod-
the second equality above. This argument is the basis of the regression eling is given throughout Uncertainties in Radiation Dosimetry and
calibration method; see Chapter 4. Their Impact on Dose response Analysis, E. Ron and F. O. Hoffman,
editors, National Cancer Institute Press, 1999. This book, which arose
from a conference on radiation epidemiology, has papers or discussions
2.6 Prediction by many leading statisticians. Although we have stated (Section 2.2.2)
In Chapter 3 we discuss the biases caused by measurement error for that “Compared to complex undertakings such as rocket science or au-
estimating regression parameters, and the effects on hypothesis testing tomotive repair, determining whether data follow the classical additive
are described in Chapter 10. Much of the rest of the book is taken up measurement error model (1.1) or the standard Berkson error model
with methods for removing the biases caused by measurement error, with (1.2) is generally simple in practice,” the conference discussions make it
brief descriptions of inference at each step. clear that radiation, where the errors are a complex mixture of classical
Prediction of a response is, however, another matter. Generally, there and Berkson errors, is a case where it is difficult to sort through what
is no need for the modeling of measurement error to play a role in the models to use.
prediction problem. If a predictor X is measured with error and one
wants to predict a response based on the error-prone version W of X,
then except for a special case discussed below, it rarely makes any sense
to worry about measurement error. The reason for this is quite simple:
W is error-free as a measurement of itself ! If one has an original set
of data (Y, Z, W), one can fit a convenient model to Y as a function
of (Z, W). Predicting Y from (Z, W) is merely a matter of using this
model for prediction, that is, substituting known values of W and Z
into the regression model for Y on (Z, W); the prediction errors from
this model will minimize the expected squared prediction errors in the
class of all linear unbiased predictors. Predictions with (Z, W) naively
substituted for (Z, X) in the regression of Y on (Z, X) will be biased
and can have large prediction errors.
Another potential prediction method is to use the methodology dis-
cussed throughout this book to estimate the regression of Y on (Z, X)
and then to substitute into this model {Z, E(X|W)}. Though this seems
like a nice idea, it turns out to be equivalent to simply ignoring the mea-

38 39
CHAPTER 3

LINEAR REGRESSION AND


ATTENUATION

3.1 Introduction

This chapter summarizes some of the known results about the effects of
measurement error in linear regression and describes some of the sta-
tistical methods used to correct for those effects. Our discussion of the
linear model is intended only to set the stage for our main topic, nonlin-
ear measurement error models, and is far from complete. A comprehen-
sive account of linear measurement error models can be found in Fuller
(1987).

3.2 Bias Caused by Measurement Error

Many textbooks contain a brief description of measurement error in lin-


ear regression, usually focusing on simple linear regression and arriving
at the conclusion that the effect of measurement error is to bias the
slope estimate in the direction of zero. Bias of this nature is commonly
referred to as attenuation or attenuation to the null.
In fact, though, even this simple conclusion must be qualified, because
it depends on the relationship between the measurement, W, and the
true predictor, X, and possibly other variables in the regression model
as well. In particular, the effect of measurement error depends on the
model under consideration and on the joint distribution of the measure-
ment error and the other variables. In linear regression, the effects of
measurement error vary depending on (i) the regression model, be it
simple or multiple regression; (ii) whether or not the predictor measured
with error is univariate or multivariate; and (iii) the presence of bias
in the measurement. The effects can range from the simple attenuation
described above to situations where (a) real effects are hidden; (b) ob-
served data exhibit relationships that are not present in the error-free
data; and (c) even the signs (±) of estimated coefficients are reversed
relative to the case with no measurement error.
The key point is that the measurement error distribution determines
the effects of measurement error, and thus appropriate methods for cor-

41
True X Data Observed W Data Effects of Measurement Error in Linear Regression: Illustration
3 3 3

2 2 2

1 1 1

Response
0 0 0

−1 −1 −1

−2 −2 −2

−3 −3 −3
−2 0 2 −2 0 2 −3 −2 −1 0 1 2 3
True (Solid) and Observed (Empty) Predictors

Figure 3.1 Illustration of additive measurement error model. The left panel dis- Figure 3.2 Illustration of additive measurement error model. Here we combine
plays the true (Y, X) data, while the right panel displays the observed (Y, W) the data in Figure 3.1 and add in least squares fitted lines: The solid line and
data. Note how the true X data plot has less variability and a more obvious solid circles are for the true X data, while the dashed line and empty circles
nonzero effect. are for the observed, error-prone W data. Note how the slope to the true X
data is steeper, and the variability about the line is much smaller.

recting for the effects of measurement error depend on the measurement


(Y, W) data have much more variability about a much less obvious line.
error distribution.
This is the loss of power through additional variability.
In Figure 3.2 we combine the data sets: The solid circles and solid line
3.2.1 Simple Linear Regression with Additive Error are the (Y, X) data and least squares fit, while the empty circles and
dashed line are the (Y, W) data and their least squares fit. Here we see
The basic effects of classical measurement error on simple linear regres- the bias in the least squares line due to classical measurement error.
sion can be seen in Figures 3.1 and 3.2. These effects are the double We can understand the phenomena in Figures 3.1–3.2 through some
whammy of measurement error described in Section 1.1, namely loss theoretical calculations. For example, it is well known that an ordinary
of power when testing and bias in parameter estimation. The third least squares regression of Y on W is a consistent estimate not of βx ,
whammy, masking of features, occurs only in nonlinear models, since but instead of βx∗ = λβx , where
obviously a straight line has no features to mask. σx2
The left panel of Figure 3.1 displays error-free data (Y, X) generated λ= < 1. (3.1)
σx2 + σu2
from the linear regression model Y = β0 + βx X + ǫ, where X has mean
µx = 0 and variance σx2 = 1, the intercept is β0 = 0, the slope is Thus ordinary least squares regression of Y on W produces an estimator
βx = 1, and the error about the regression line ǫ is independent of X, that is attenuated to zero. The attenuating factor, λ, is called the reli-
has mean zero and variance σǫ2 = 0.25. The right panel displays the error- ability ratio (Fuller, 1987). This attenuation is particularly pronounced
contaminated data (Y, W) where W = X + U, and U is independent in Figures 3.1–3.2.
of X, has mean zero, and variance σu2 = 1. This is the classical additive One would expect that because W is an error-prone predictor, it has
measurement error model; see Section 1.2. Note how the (Y, X) data are a weaker relationship with the response than does X, as seen in Figure
more tightly grouped around a well delineated line, while the error-prone 3.1. This can be seen both by the attenuation and also by the fact that

42 43
the residual variance of this regression of Y on W is where U∗ is uncorrelated with W, and varU ∗ = λσu2 . Compare (3.3)
with the formal definition of a Berkson error model (1.2) in Section 1.4.
βx2 σu2 σx2 Effectively, we have a formal transformation of the classical error model
var(Y|W) = σǫ2 + = σǫ2 + λβx2 σu2 . (3.2)
σx2 + σu2 into a Berkson error model, where the observed predictor is now the best
linear predictor of X from W. The calculation leading to (3.3) is at the
This facet of the problem is often ignored, but it is important. Measure-
heart of the regression calibration method of Chapter 4.
ment error causes a double whammy: Not only is the slope attenuated,
Equation (3.3) has important consequences in fitting the linear re-
but the data are more noisy, with an increased error about the line.
gression model and correction for the bias due to classical measurement
It is not surprising that measurement error, as another source of error,
error: Little (generally nothing) can be done to eliminate the loss of
increases variability about the line. Indeed, we can substitute X = W −
power. Substituting (3.3) for X into the regression model, we have
U into the regression model to obtain the model Y = β0 + βx W +
(ǫ − βx U), with error (ǫ − βx U) that has variance σǫ2 + βx2 σu2 > σǫ2 and Y = β0 + βx (1 − λ)µx + βx λW + (ǫ + βx U∗ )
covariate W. What may be surprising is that this additional error causes
bias. However, the error and the covariate have a common component U, = β0 + βx Wblp + ǫ + βx U∗ . (3.4)
which causes them to be correlated. The correlation between the error
In (3.4) the error ǫ + βx U∗ is uncorrelated with the regressor Wblp
and covariate is the source of the bias.
and has variance σǫ2 + λβx2 σu2 in agreement with (3.2). Moreover, the
In light of the effects of classical measurement error discussed above,
regression of Y on W has intercept β0 + βx (1 − λ)µx and slope λβx ,
one might expect that the least squares estimate of slope calculated from
which explains the attenuation of the slope and the additive bias of the
measured (Y, W) is more variable than the slope estimator calculated
intercept.
from the true (Y, X) data. This is not always the case. Buzas, Stefanski,
and Tosteson (2004) pointed out that the naive estimate of slope can be However, these considerations show a way to eliminate bias. By (3.4),
less variable than the true data estimator. In fact, for the classical error we have Y = β0 + βx Wblp + ǫ + βx U∗ , so if we replace the unknown
model, the variance of the naive estimator is less than the variance of X by Wblp , which is known since it depends only on W, then we have
the true-data estimator asymptotically if and only if βx2 σx2 /(σx2 + σu2 ) < a regression model with intercept equal to β0 , slope equal to βx , and
σǫ2 /σx2 , which is possible when σǫ2 is large, or σu2 is large, or βx2 is small. So, error uncorrelated with the regressor. Therefore, regressing Y on Wblp
relative to the case of no measurement error, classical errors can result gives unbiased estimates of β0 and βx . In fact, regressing Y on Wblp
in more precise estimates of the wrong, that is, biased, quantity. This is equivalent to the method-of-moments correction for attenuation dis-
phenomenon explains, in part, why naive-analysis confidence intervals cussed in Section 3.4.1. Replacing X with its predictor Wblp is the key
often have disastrous coverage probabilities; not only are they centered idea behind the technique of regression calibration discussed in Chapter
on the wrong value, but they sometimes have shorter length than would 4. Of course, Wblp is “known” only if we know λ and µx . In practice,
be obtained with the true data. This phenomenon cannot occur with these parameters need to be estimated.
Berkson errors, for which the variance of the naive estimator is never
less than the variance of the true-data estimator asymptotically.
3.2.3 Simple Linear Regression with Berkson Error

3.2.2 Regression Calibration: Classical Error as Berkson Error Suppose that we have linear regression, Yi = β0 + βx Xi + ǫi , with
unbiased Berkson error, that is, Xi = Wi + Ui . Then E(Xi |Wi ) = Wi
There is another way of looking at the bias that will give further insight, so that E(Yi |Wi ) = β0 + βx Wi . As a consequence, the naive estimator
namely that by a simple mapping, classical measurement error can be that regresses Yi on Wi is unbiased for β0 and βx . This unbiasedness
made into a Berkson model. Define Wblp = (1 − λ)µx + λW, the best can be seen in Figure 3.3 which illustrates linear regression with Berkson
linear predictor of X based on W. Then, by (A.8) of Appendix A, errors. In the figure, (Yi , Xi ) and (Yi , Wi ) are plotted, as well as fits
to both (Yi , Xi ) and (Yi , Wi ). The Wi are equally spaced on [−1, 3],
Xi = Wi + Ui , Ui = Normal(0, 1), ǫi = Normal(0, 0.5), n = 50, β0 = 1,
X = Wblp + U∗ , (3.3) and βx = 1.

44 45
6 dietary intake (the value of X) as measures of, for example, the per-
centage of calories from fat in a person’s diet. FFQs are thought to be
5 biased for usual intake, and in a calibration study researchers will obtain
a second measure (the value of W), typically either from a food diary or
4 from an interview in which the study subject reports his or her diet in
the previous 24 hours. In this context, it is often assumed that the diary
3 or recall is unbiased for usual intake. In principle, then, we have simple
linear regression with an additive measurement error model, but in prac-
2 tice a complication can arise. It is often the case that the FFQ and the
diary or recall are given very nearly contemporaneously in time, as in
1
the Women’s Health Trial Vanguard Study (Henderson et al., 1990). In
this case, it makes little sense to pretend that the error in the relation-
0
ship between the FFQ (Y) and usual intake (X) is uncorrelated with
x,y
w,y the error in the relationship between a diary or recall (W) and usual
−1
fit x,y intake. This correlation has been demonstrated (Freedman, Carroll, and
fit w,y Wax, 1991), and in this section we will discuss its effects.
−2
−3 −2 −1 0 1 2 3 4 5 To express the possibility of bias in W, we write the model as W =
γ0 + γ1 X + U, where U is independent of X and has mean zero and vari-
Figure 3.3 Simple linear regression with unbiased Berkson errors. Theory ance σu2 . To express the possibility of correlated errors, we will write the
shows that the fit of Yi to Wi is unbiased for the regression of Yi on Xi ,
correlation between ǫ and U as ρǫu . The classical additive measurement
and the two fits are, in fact, similar.
error model sets γ0 = 0, ρǫu = 0, and γ1 = 1, so that W = X + U.
If (X, ǫ, U) are jointly normally distributed, then the regression of Y
3.2.4 Simple Linear Regression, More Complex Error Structure on W is linear with intercept

Despite admonitions of Fuller (1987) and others to the contrary, it is β0∗ = β0 + βx µx − βx∗ (γ0 + γ1 µx ),
a common perception that the effect of measurement error is always to
attenuate the line. In fact, attenuation depends critically on the classi- and slope
cal additive measurement error model. In this section, we discuss two
p
deviations from the classical additive error model that do not lead to βx γ1 σx2 + ρǫu σǫ2 σu2
attenuation. βx∗ = . (3.5)
γ12 σx2 + σu2
We continue with the simple linear regression model, but now we make
the error structure more complex in two ways. First, we will no longer Examination of (3.5), shows that if W is biased (γ1 6= 1) or if there
insist that W be unbiased for X. The intent of studying this depar- is significant correlation between the measurement error and the error
ture from the classical additive error model is to study what happens about the true line (ρǫu 6= 0), it is possible for |βx∗ | > |βx |, an effect
when one pretends that one has an unbiased surrogate, but in fact the exactly the opposite of attenuation. Thus, correction for bias induced by
surrogate is biased. measurement error clearly depends on the nature, as well as the extent,
A second departure from the additive model is to allow the errors of the measurement error.
in the linear regression model to be correlated with the errors in the
For purposes of completeness, we note that the residual variance of
predictors. This is differential measurement error; see Section 2.5. One
the linear regression of Y on W is
example where this problem arises naturally is in dietary calibration
studies (Freedman et al., 1991). In a typical dietary calibration study, p
one is interested in the relationship between a self-administered food β 2 σ 2 σ 2 − ρ2ǫu σǫ2 σu2 − 2βx γ1 σx2 ρǫu σǫ2 σu2
var(Y|W) = σǫ2 + x u x .
frequency questionnaire (FFQ, the value of Y) and usual (or long-term) γ12 σx2 + σu2

46 47
3.2.4.1 Diagnostic for Correlation of Errors in Regression and the covariances of errors separated in time are small. This assumption
Measurement Errors seems reasonable if the time separation is at all large.
In some cases, there is a simple graphical diagnostic to check whether
the errors in the regression are correlated with the classical measurement 3.2.5 Summary of Simple Linear Regression
errors. The methods are related to the graphical diagnostics used to
detect whether the additive error model is reasonable; see Section 1.7. Before continuing with a discussion of the effects of measurement er-
ror in multiple linear regression, we summarize the primary effects of
WHT Controls data, %−Calories from Far, Correlation = −0.07
25 measurement error in simple linear regression for various types of error
models that we study throughout the book. Table 3.1 displays the im-
20
portant error-model parameters and linear regression model parameters
15 for the case that (Y, X, W) are multivariate normal for a hierarchy of
error model types. In all cases, the underlying regression model is
10
Difference of Food Records

5
Y = β0 + βx X + ǫ, (3.6)

0 where X and ǫ are independent and ǫ has mean zero and variance σ 2 .
−5

−10
3.2.5.1 Differential Error Measurement

−15 The least restrictive type of error model is one in which W is not unbi-
ased and the error is differential. This is also the most troublesome type
−20
of error in the sense that correcting for bias requires the most additional
−25
−20 −15 −10 −5 0 5 10 15 20
information or data. The first row in Table 3.1 shows how the parameters
Difference of FFQ in the regression of Y on W depend on the true-data regression model
parameters, β0 , βx , σ 2 , in this case. Note that to recover βx from the
Figure 3.4 Women’s Health Trial Vanguard Study Data. This is a plot for % regression of Y on W one would have to know or be able to estimate the
calories from fat of the differences of food records and the differences of food covariances, σxw and σǫw . Also, with a differential-error measurement it
frequency questionnaires. With replicated Y and W, this plot is a diagnostic
is possible for the residual variance in the regression of Y on W to be
for whether errors in a regression are correlated with the classical measurement
errors.
less than σ 2 .

Specifically, suppose that the error-prone instrument is replicated, so 3.2.5.2 Surrogate Measurement
that we observe Wij = γ0 + γ1 Xi + Uij . The difference Wi1 − Wi2 =
Ui1 − Ui2 is “pure” error, unrelated to Xi . Suppose further that the As defined in Section, 2.5, a surrogate measurement is one for which the
response is replicated, so that we observe Yij = β0 + βx Xi + ri + ǫij , conditional distribution of Y given (X, Z, W) depends only on (X, Z).
where ri is person-specific bias or equation error; see Section 1.5. Then In this case, W is also said to be a surrogate. The second row of Table
differences Yi1 − Yi2 = ǫi1 − ǫi2 are the model errors. A plot of the 3.1 shows how the parameters in the regression of Y on W depend on
two sets of differences will help reveal whether the regression errors and β0 , βx , σ 2 when W is a surrogate, with no additional assumptions about
the measurement errors are correlated. This is illustrated in Figure 3.4, the type of error model. With a surrogate, it is apparent that knowledge
where there appears to be a very strong correlation between the model of or estimability of σxw is enough to recover βx from the regression of
errors and the measurement errors. A formal test can be performed by Y on W. The residual variance in the regression of Y on W is always
regressing one set of differences on the other and testing the null hypoth- greater than σ 2 when W is a surrogate. In this sense, a surrogate is
esis that the slope is zero. This plotting method and the test assume that always less informative than X.

48 49
3.2.5.3 Classical Error Model
In the classical error model, W is a surrogate and E(W | X) = X, and
we can write W = X + Uc where Uc is a measurement error. Here we
use the subscript c to emphasize that the error is classical and to avoid
confusion with the two error models discussed below. We have already
discussed this model in detail elsewhere, for example, in Sections 1.2 and
2.2. It is apparent from the third row of Table 3.1 that if the reliability
ratio, λ = σx2 /(σx2 + σu2 c ) is known or can be estimated, then βx can be
recovered from the regression of Y on W.

3.2.5.4 Berkson Error Model


Error ρ2xw Intercept Slope Residual
Model Variance In the Berkson error model, W is a surrogate and E(X | W) = W,
βx σxw +σǫw
³ ´
(σxw βx +σǫw )2
and we can write X = W + Ub where Ub is a Berkson error. This
Differential ρ2xw β0 + βx µx − µw βx σxw
+ σǫw
σǫ2 + βx2 σx2 −
2
σw σw2 σw2 σw2
model has been discussed in detail elsewhere, for example, Sections 1.4
βx σxw
³ ´ and 2.2. It is apparent from the fifth row of Table 3.1 that the regression
Surrogate ρ2xw β0 + βx µx − σw 2 µw βx σxw
σw2 σǫ2 + βx2 σx2 (1 − ρ2xw )
parameters are not biased by Berkson measurement error. However, note
that the residual variance in the regression of Y on W is greater than
³ ´
σx2 σx2
Classical σx2 +σu 2 β0 + βx µx (1 − ρ2xw ) βx σx2 +σu 2 σǫ2 + βx2 σx2 (1 − ρ2xw )
c c
σ 2 , a consequence of the fact that for this model W is a surrogate. Both
4 (σ 2 +σ 2 )−1
the unbiasedness and increased residual variation are well illustrated in
σL
³ 2
´ ³ 2
´
σL σL
B/C mixture
L u
2 +σ 2 )
(σL
b
β0 + βx µx 1 − 2 +σ 2
σL
βx 2 +σ 2
σL
σǫ2 + βx2 σx2 (1 − ρ2xw )
uc uc uc
Figure 3.3.
50

σx2 −σu 2
Berkson σx2
b
β0 βx σǫ2 + βx2 σx2 (1 − ρ2xw )
3.2.5.5 Berkson/Classical Mixture Error Model
No error 1 β0 βx σǫ2
We now consider an error model that was encountered previously (see
Table 3.1: Table entries are error model squared correlations, and intercepts, slopes and residual variances of the
Section 1.8.2 ) on the log-scale, and is discussed again at length in Section
linear model relating Y to W when (Y, X, W) is multivariate normal for the cases W is: a general differential 8.6. Here we consider the additive version. The defining characteristic is
measurement, a general surrogate, an unbiased classical-error measurement, an unbiased classical/Berkson mixture
error measurement, an unbiased Berkson measurement, and the case of no error (W = X). Classical error variance,
that the error model contains both classical and Berkson components.
σu2 c ; Berkson error variance, σu2 b ; B/C mixture error model, X = L + Ub , W = L + Uc , Y = β0 + βx X + ǫ. Specifically, it is assumed that
X = L + Ub , (3.7)
W = L + Uc . (3.8)
When Ub = 0, X = L and the classical error model is obtained, whereas
the Berkson error model results when Uc = 0, since then W = L. We
denote the variances of the error terms by σu2 c and σu2 b . This error model
has features of both the classical and Berkson error models. Note that
there is bias in the regression parameters when σu2 c > 0, as in the classical
model. The inflation in the residual variance has the same form as the
other nondifferential error models in terms of ρ2xw , but ρ2xw depends on
both error variances for this model.
The error models in Table 3.1 are arranged from most to least prob-
lematic in terms of the negative effects of measurement error. Although
we discussed the Berkson/classical mixture error model last, in the hi-

51
erarchy of error models its place is between the classical and Berkson 4
ANCOVA, True X data
4
ANCOVA, Observed W data

error models.
3 3

3.3 Multiple and Orthogonal Regression


2 2
3.3.1 Multiple Regression: Single Covariate Measured with Error
1 1
In multiple linear regression, the effects of measurement error are more
complicated, even for the classical additive error model.
0 0
We now consider the case where X is scalar, but there are additional
covariates Z measured without error. The linear model is now
−1 −1

Y = β0 + βx X + βzt Z + ǫ, (3.9)
−2 −2
where Z and βz are column vectors, and βzt is a row vector. In Appendix
B.2 it is shown that if W is unbiased for X, and the measurement error U −3 −3
is independent of X, Z and ǫ, then the least squares regression estimator
of the coefficient of W consistently estimates λ1 βx , where −4 −4
−4 −2 0 2 4 −4 −2 0 2 4
2 2
σx|z σx|z
λ1 = 2 = 2 + σ2 , (3.10) Figure 3.5 Illustration of the effects of measurement error in an unbalanced
σw|z σx|z u
analysis of covariance. The left panel shows the actual (Y, X) fitted functions,
2
and σw|z 2
and σx|z are the residual variances of the regressions of W which are the same, indicating no treatment effect. The density function of X
on Z and X on Z, respectively. Note that λ1 is equal to the simple in the two groups are very different, however, as can be seen in the schematic
density functions of X at the bottom. The right panel shows what happens when
linear regression attenuation, λ = σx2 /(σx2 + σu2 ), only when X and Z are
2 there is measurement error in the continuous covariate: Now the observed data
uncorrelated. Otherwise, σx|z < σx2 and λ1 < λ, showing that collinearity suggest a large treatment effect.
increases attenuation.
The problem of measurement error–induced bias is not restricted to
the regression coefficient of X. The coefficient of Z is also biased in or vice versa. Figure 3.5 illustrates this process schematically. In the left
general, unless Z is independent of X (Carroll, Gallo, and Gleser, 1985; panel, we show linear regression fits in the analysis of covariance model
Gleser, Carroll, and Gallo, 1987). In Section B.2 it is shown that for the when there is no effect of treatment, that is, the two lines are the same.
model (3.9), the naive ordinary least squares estimates not βz but rather At the bottom of this panel, we draw schematic density functions for X in
the two groups: The solid lines are the treatment group with smaller X.
βz∗ = βz + βx (1 − λ1 )Γz , (3.11)
The effect of measurement error in this problem is attenuation around
where Γtz is the coefficient of Z in the regression of X on Z, that is, the mean in each group, leading to the right panel, where the linear
E(X | Z) = Γ0 + Γtz Z. regression fits to the observed W are given. Now note that the lines are
This result has important consequences when interest centers on the not identical, indicating that we would observe a treatment effect, even
effects of covariates measured without error. Carroll et al. (1985) and though it does not exist.
Carroll (1989) showed that in the two-group analysis of covariance where
Z is a treatment assignment variable, naive linear regression produces a
3.3.2 Multiple Covariates Measured with Error
consistent estimate of the treatment effect only if the design is balanced,
that is, X has the same mean in both groups and is independent of Now suppose that there are covariates Z measured without error, that
treatment. With considerable imbalance, the naive analysis may lead to W is unbiased for X, which may consist of multiple predictors, and that
the conclusions that (i) there is a treatment effect when none actually the linear regression model is Y = β0 + βxt X + βzt Z + ǫ. If we write Σab
exists; and (ii) the effects are negative when they are actually positive, to be the covariance matrix between random variables A and B, then

52 53
naive ordinary linear regression consistently estimates not (βx , βz ) but errors U both = 1.0, and let their correlation ρ vary from −0.9 to 0.9.
rather In Figure 3.6 we graph what least squares ignoring measurement error
µ ¶ µ ¶−1 is really estimating in the second component (−0.2) of βx as ρ varies.
βx∗ Σxx + Σuu Σxz
= When the correlation between the measurement error is large but nega-
βz∗ Σzx Σzz
½µ ¶ µ ¶¾ tive, least squares actually suggests that the coefficient is positive when
Σxy Σuǫ it really is negative. Equally surprising, if the correlation between the
+ (3.12)
Σzy 0 measurement errors is large and positive, least squares actually suggests
µ ¶−1 a more negative effect than actually exists.
Σxx + Σuu Σxz
=
Σzx Σzz
½µ ¶µ ¶ µ ¶¾
Σxx Σxz βx Σuǫ 3.4 Correcting for Bias
+ .
Σzx Σzz βz 0 As we have just seen, the ordinary least squares estimator is typically
biased under measurement error, and the direction and magnitude of
Surprising Bias Caused By Correlated Measurement Errors
the bias depends on the regression model, the measurement error distri-
0.2 bution, and the correlation between the true predictor variables. In this
section, we describe two common methods for eliminating bias.

3.4.1 Method of Moments


0 In simple linear regression with the classical additive error model, we
Naive Method Result

have seen in (3.1) that ordinary least squares is an estimate of λβx ,


where λ is the reliability ratio. If the reliability ratio were known, then
one could obtain an unbiased estimate of βx simply by dividing the
ordinary least squares slope βbx∗ by the reliability ratio.
−0.2 True Parameter Of course, the reliability ratio is rarely known in practice, and one has
bu2 is an estimate of the measurement error variance (this
to estimate it. If σ
2
is discussed in Section 4.4), and if σ bw is the sample variance of the Ws,
then a consistent estimate of the reliability ratio is λ b = (b 2
σw −σbu2 )/b
σw2
.
The resulting estimate is βbx∗ /λ. b
In small samples, the sampling distribution of βbx∗ /λ b is highly skewed,
−0.4
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8
Correlation of Errors
and in such cases a modified version of the method-of-moments es-
Figure 3.6 Illustration of the effects of correlated measurement error with two timator is recommended (Fuller, 1987; Section 2.5.1). Fuller’s modifi-
variables measured with error. The true variables are actually uncorrelated, cation depends upon a tuning parameter α. Fuller does not give ex-
while the errors are correlated, with correlations ranging from −0.9 to 0.9. plicit advice about choosing α, but in his simulations α = 2 produced
Displayed is a plot of what least squares estimates against the correlation of more accurate estimates than the unmodified estimator. As an exam-
the measurement errors. The true value of the parameter of interest is −0.2. ple, in Figure 3.7, in the top panel, we plot the histogram of the cor-
rected estimate when n = 20, X is standard normal, the reliability ratio
Thus, ordinary linear regression is biased. In Section 3.4, we take up = 0.5, and the error about the line in the regression model is 0.25:
the issue of bias correction. However, before doing so, it is worth taking The skewness is clear. In the bottom panel, we plot the histogram of
a minute to explore the bias result (3.12). Consider a case of a regression Fuller’s corrected estimator: It is slightly biased downwards, but very
on two error prone covariates, where the coefficient βx in the regression of much more symmetric. In this figure, Fuller’s method was defined as fol-
Y on X is (1.0, −0.2)t , and where the components of X are independent lows. Let σbyw and σ by2 be the sample covariance between Y and W and
2 2
so that Σxx is the identity matrix. Let the variance of the measurement the sample variance of Y, respectively. Define κ b = (bσw −σ byw σy2 )/b
/b σu2 .

54 55
400
Correction for Attenuation Estimate In the case that Σuu and Σuǫ are estimated, the estimates replace the
known values in (3.13). It is often reasonable to assume that Σuǫ = 0,
300 in which case (3.13) simplifies accordingly.
In the event that W is biased for X, that is, W = γ0 + γx X + U, that
200
is, the error calibration model, the method-of-moments estimator can
100 still be used, provided estimates of (γ0 , γx ) are available. The strategy
is to calculate the estimators above using the error-calibrated variate
0
0 0.5 1 1.5 2 2.5 3 3.5 4 W∗ = γ bx−1 (W − γ
b0 ).
Correction for Attenuation Estimate: Fuller Modification
300
3.4.2 Orthogonal Regression
250

200 Another well publicized method for linear regression in the presence
150
of measurement error is orthogonal regression; see Fuller (1987, Section
100
1.3.3). This is sometimes known as the linear statistical relationship (Tan
50
and Iglewicz, 1999) or the linear functional relationship. However, for
0
reasons given below, we are skeptical about the general utility of orthog-
0 0.5 1 1.5 2 2.5 3 3.5 4
onal regression, in large part because it is so easily misused. Although
it is not fundamental to understanding later material on nonlinear mod-
Figure 3.7 Illustration of the small-sample distribution of the method-of- els, we take the opportunity to discuss orthogonal regression at length
moments estimator of the slope in simple linear regression when n = 20 and here in order to emphasize the potential pitfalls associated with it. The
the reliability ratio λ = 0.5. The top panel is the usual method-of-moments work appeared as Carroll and Ruppert (1996), but the message is worth
estimate, while the bottom panel is Fuller’s correction to it. repeating. This section can be skipped by those who are interested only
in estimation for nonlinear models or who plan never to use orthogonal
regression.
Then define σ bx2 = σbw2
−σ b ≥ 1 + (n − 1)−1 , while otherwise
bu2 if κ Let Y = β0 + βx X + ǫ and W = X + U, where ǫ and U are uncorre-
2 2 2 −1
σ
bx = σ bw − σ bu {bκ − (n − 1) }. Then Fuller’s corrected estimate with lated. Whereas the method-of-moments estimator (Section 3.4) requires
his α = 2 is given as (βbx∗ σ
bw2
σx2 + 2b
)/{b σu2 /(n − 1)}. knowledge or estimability of the measurement error variance σu2 , orthog-
The algorithm described above is called the method-of-moments esti- onal regression requires the same for the ratio η = σǫ2 /σu2 .
mator. The terminology is apt, because ordinary least squares and the The orthogonal regression estimator minimizes the orthogonal dis-
reliability ratio depend only on moments of the observed data. tance of (Y, W) to the line β0 +βx X, weighted by η, that is, it minimizes
The method-of-moments estimator can be constructed for the general n n
X o
linear model, not just for simple linear regression. Suppose that W is 2
(Yi − β0 − βx xi ) + η (Wi − xi )
2
(3.14)
unbiased for X, and consider the general linear regression model with i=1
Y = β0 + βxt X + βzt Z + ǫ.
The ordinary least squares estimator is biased even in large samples in the unknown parameters (β0 , βx , x1 , . . . , xn ).
because it estimates (3.12). When Σuu and Σuǫ are known or can be In fact, (3.14) is the sum of squared orthogonal distances between the
estimated, (3.12) can be used to construct a simple method-of-moments points (Yi , Wi )n1 and the line y = β0 + βx x, only in the special case that
estimator that is commonly used to correct for the bias. Let Sab be the η = 1. However, the term orthogonal regression is used to describe the
sample covariance between random variables A and B. The method-of- method regardless of the value of η < ∞.
moments estimator that corrects for the bias in the case that Σuu and The orthogonal regression estimator is the functional maximum like-
Σuǫ are known is lihood estimator (Sections 2.1 and 7.1) assuming that (X1 , . . . , Xn ) are
µ ¶−1 µ ¶ unknown fixed constants, and that the errors (ǫ, U) are independent and
Sww − Σuu Swz Swy − Σuǫ normally distributed.
, (3.13)
Szw Szz Szy Orthogonal regression has the appearance of greater applicability than

56 57
The difference in these two estimates, |0.88 − 0.68|, is larger than
Wi Yi1 Yi2 would be expected from random variation alone. Clearly, something is
amiss. The method-of-moments correction for attenuation is only λ b−1 ≈
1.04, whereas,orthogonal regression in effect, produces a correction for
−1.8007 −0.5558 −0.9089 attenuation of approximately 1.35 ≈ 0.88/0.65.
−0.7717 0.2076 0.6499 The problem lies in the nature of the regression model error ǫ, which
−0.4287 −1.7365 −1.8542 is typically the sum of two components: (i) ǫM , the measurement error
−0.0857 −0.9018 0.2040 in determination of the response; and (ii) ǫL , the equation error, that
0.2572 −0.2312 −0.3097 is, the variation about the regression line of the true response in the
0.6002 0.2967 0.5072 absence of measurement error. See Section 1.5 for another example of
0.9432 0.5928 1.5381 equation error, which in nutrition is called person-specific bias.
1.2862 1.2420 1.2599 If we have replicated measurements, Yij , of the true response, then
Yij = β0 + βx Xi + ǫL,i + ǫM,ij , and of course their average is Y i· =
β0 + βx Xi + ǫL,i + ǫM,i· . Here and throughout the book, a subscript
Table 3.2 Orthogonal regression example with replicated response. “dot” and overbar means averaging. For example, with k replicates,

k
X k
X
method-of-moments estimation in that only the ratio, η, of the error Yi· = k −1 Yij ; ǫM,i· = k −1 ǫM,ij .
variances need be known or estimated. However, it is our experience that j=1 j=1
in the majority of problems η cannot be specified or estimated correctly,
and use of orthogonal regression with an improperly specified value of η The components of variance analysis estimates only the variance of the
often results in an unacceptably large overcorrection for attenuation due average measurement error ǫM,i· in the responses, but completely ignores
to measurement error. the variability, ǫL,i , about the line. The net effect is to underestimate η
We illustrate the problem with some data from a consulting problem and thus overstate the correction required of the ordinary least squares
(Table 3.2). The data include two measurements of a response variable, estimate, because var(ǫM,i· )/σu2 is used as the estimate of η instead of
Yi1 and Yi2 , and one predictor variable with true value Xi , i = 1, . . . , 8. the larger, appropriate value {var(ǫM,i· ) + var(ǫL,i )} /σu2 .
The data are proprietary and we cannot disclose the nature of the appli- The naive use of orthogonal regression on the data in Table 3.2 has
cation. Accordingly, all of the variables have been standardized to have assumed that there is no additional variability about the line in addition
sample means and variances 0 and 1, respectively. to that due to measurement error in the response, that is, ǫL,i = 0. To
We take as the response variable to be used in the regression analysis, check this, refer to Figure 3.8. Each replicated response is indicated by
Yi = (Yi1 + Yi2 )/2, the average of the two response measurements. a solid and filled circle. Remember that there is little measurement error
Using an independent experiment, it had been estimated that σu2 ≈ in W. In addition, the replication analysis suggested that the standard
0.0424, also after standardization. Because the sample standard devia- deviation of the replicates was less than 10% of the variability of the
tion of W is 1.0, measurement error induces very little bias here. The responses. Thus, in the absence of equation error we would expect to
estimated reliability ratio is λb = 1 − 0.0424 ≈ 0.96, and so attenua- see the replicated pairs falling along a clearly delineated straight line.
tion is only about 4%. The ordinary least squares estimated slope from This is far from the case, suggesting that the equation error ǫL,i is a
regressing the average of the responses on W is 0.65, while the method- large part of the variability of the responses. Indeed, while the replica-
of-moments slope estimate is λ b−1 0.65 ≈ 0.68. tion analysis suggests that var(ǫM,i· ) ≈ 0.0683, a method-of-moments
In a first analysis of these data, our client thought that orthogonal analysis suggests var(ǫL,i ) ≈ 0.4860.
regression was an appropriate method for these data. A components-of- Fuller (1987) was one of the first to emphasize the importance of
variance analysis resulted in the estimate 0.0683 for the response mea- equation error. In our experience, outside of some special laboratory
surement error variance. If η is estimated by ηb = 0.0683/0.0424 ≈ 1.6118, validation studies, equation error is almost always important in linear
then the resulting orthogonal regression slope estimate is 0.88. regression. In the majority of cases,orthogonal regression is an inappro-

58 59
Later in this section we will describe theory, but it is instructive to
consider an extreme case, using the same simulated data as in Figure

2
3.7 and Section 3.4.1. In this problem, the sample size is n = 20, the
true slope is βx = 1.0 and the reliability ratio is λ = 0.5. The top panel
of Figure 3.9 gives the histogram of Fuller’s modification of the method-

1
of-moments estimator, while the bottom panel gives the histogram of
the naive method that ignores measurement error. Note how the naive
estimator is badly biased: Indeed, we know it estimates λβx = 0.5, and it
0
is tightly bunched around this (wrong) value. The method-of-moments
estimator is roughly unbiased, but this correction for bias is at the cost
of a much greater variability (2.7 times greater in the simulation).
-1

Correction for Attenuation Estimate: Fuller Modification


300
-2

250

-2 -1 0 1 200

150

100
Figure 3.8 Illustration where the assumptions of orthogonal regression appear
50
violated. The filled and empty circles represent replicated values of the response.
0
Note the evidence of equation error because the replicate responses are very 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

close to each other, indicating little response measurement error, but the circles
do not fall on a line, indicating some type of response error. 300
Naive Estimate, Reliability Ratio λ = 0.5

250

priate technique, unless estimation of both the response measurement 200

error and the equation error is possible. 150

In some cases, Y and W are measured in the same way, for example, 100

if they are both blood pressure measurements taken at different times. 50

Here, it is often entirely reasonable to assume that the variance of ǫM 0


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
equals σu2 , and then there is a temptation to ignore equation error and
hence set η = 1. Almost universally, this is a mistake: Equation error
generally exists. This temptation is especially acute when replicates are Figure 3.9 Bias versus variance tradeoff in estimating the slope in simple lin-
absent, so that σu2 cannot be estimated and the method-of-moments ear regression. This is an extreme example of simple linear regression, with a
estimator cannot be used. sample size of n = 20 and a reliability ratio of λ = 0.5. The true value of the
slope is βx = 1. The top panel is Fuller’s modification of the correction for at-
tenuation estimate; the bottom is the naive estimate that ignores measurement
3.5 Bias Versus Variance error. The former is much more variable; the latter is very badly biased.

Estimates which do not account for measurement error are typically


biased. Correcting for this bias entails what is often referred to as a bias
3.5.1 Theoretical Bias–Variance Tradeoff Calculations
versus variance tradeoff. What this means is that, in most problems,
the very nature of correcting for bias is that the resulting corrected In this section, we will illustrate the bias versus variance tradeoff theo-
estimator will be more variable than the biased estimator. Of course, retically in simple linear regression. This material is somewhat technical,
when an estimator is more variable, the confidence intervals associated and readers may skip it without any loss of understanding of the main
with it are longer. points of measurement error models.

60 61
Consider the simple linear regression model, Y = β0 + βx X + ǫ, with Because σ∗2 decreases with increasing sample size, we can conclude that
additive independent measurement error, W = X + U, under the sim- in sufficiently large samples it is always beneficial, in terms of mean
plifying assumption of joint normality of X, U, and ǫ. Further, suppose squared error, to correct for attenuation due to measurement error.
that the reliability ratio λ in (3.1) is known. We make this assumption Consider now the alternative estimator βbx,a = aβx∗ for a fixed con-
only to simplify the discussion in this section. Generally, in applications stant a. The mean squared error of this estimator is a2 σ∗2 + (aλ − 1)2 βx2 ,
it is seldom the case that this parameter is known, although there are which is minimized when a = a∗ = λβx2 /(σ∗2 + λ2 βx2 ). Ignoring the fact
exceptions (Fuller, 1987). that a∗ depends on unknown parameters, we consider the “estimator”
Let βbx∗ denote the least squares estimate of slope from the regression βbx,∗ = a∗ βx∗ , which has smaller mean squared error than either βbx,mm
of Y on W. We know that its mean is E(βbx∗ ) = λβx . Denote its variance or βbx∗ . Note that as σ∗2 → 0, a∗ → λ−1 .
by σ∗2 . The estimator βbx,∗ achieves its mean-squared-error superiority by mak-
The method-of-moments estimator of βx , is βbx,mm = λ−1 βbx∗ and has ing a partial correction for attenuation in the sense that a∗ < λ−1 . This
mean E(βbx,mm ) = βx , and variance Var(βbx,mm ) = λ−2 σ∗2 . simple exercise illustrates that estimators that make only partial correc-
Because λ < 1, it is clear that while the correction-for-attenuation in tions for attenuation can have good mean-squared-error performance.
βbx,mm reduces its bias to zero, there is an increase in variability relative Although we have used a simple model and a somewhat artificial esti-
to the variance of the biased estimator βbx∗ . The variability is inflated mator to facilitate the discussion of bias and variance, all of the conclu-
even further if an estimate λ b is used in place of λ. sions made above hold, at least to a very good approximation, in general
The price for reduced bias is increased variance. This phenomenon is for both linear and nonlinear regression measurement error models.
not restricted to the simple model and estimator in this section, but
occurs with almost universal generality in the analysis of measurement 3.6 Attenuation in General Problems
error models. In cases where the absence of bias is of paramount im-
portance, there is usually no escaping the increase in variance. In cases We have already seen that, even in linear regression with multiple co-
where some bias can be tolerated, consideration of mean squared error variates, the effects of measurement error are complex and not easily
is necessary. described. In this section, we provide a brief overview of what happens
In the following material, we indicate that there are compromise esti- in nonlinear models.
mators that may outperform both uncorrected and corrected estimators, Consider a scalar covariate X measured with error, and suppose that
at least in small samples. Surprisingly, outside of the work detailed in there are no other covariates. In the classical error model for simple lin-
Fuller (1987), such compromise estimators have not been much investi- ear regression, we have seen that the bias caused by measurement error
gated, especially for nonlinear models. is always in the form of attenuation, so that ordinary least squares pre-
Remember that mean squared error (MSE) is the sum of the variance serves the sign of the regression coefficient asymptotically, but is biased
plus the square of the bias. This is an interesting criterion to use, be- towards zero. Attenuation is a consequence then of (i) the simple linear
cause uncorrected estimators have more bias but smaller variance than regression model; and (ii) the classical additive error model. Without (i)
corrected estimators, and the bias versus variance tradeoff is transparent. and (ii), the effects of measurement error are more complex; we have
Note that already seen that attenuation may not hold if (ii) is violated.
In logistic regression when X is measured with additive error, attenua-
MSE(βbx∗ ) = σ∗2 + (1 − λ)2 βx2 ; and tion does not always occur (Stefanski and Carroll, 1985), but it is typical.
MSE(βbx,mm ) = λ−2 σ∗2 . (3.15) More generally, in most problems with a scalar X and no covariates Z,
the underlying trend between Y and X is preserved under nondiffer-
It follows that
³ ´ ³ ´ ential measurement error, in the sense that the correlation between Y
MSE βbx,mm < MSE βbx∗ and W is positive whenever both E(Y|X) and E(W|X) are increasing
functions of X (Weinberg, Umbach, and Greenland, 1993). Technically,
if and only if this follows because with nondifferential measurement error, Y and W
λ2 (1 − λ)βx2 are uncorrelated given X, and hence the covariance between Y and W
σ∗2 < . is just the covariance between E(Y|X) and E(W|X).
1+λ

62 63
Positively, this result says that for the very simplest of problems (scalar
X, no covariates Z measured without error), the general trend in the data
is typically unaffected by nondifferential measurement error. However,
the result illustrates only part of a complex picture, because it describes
only the correlation between Y and W and says nothing about the
structure of this relationship.
For example, one might expect that if the regression, E(Y|X), of Y on
X is nondecreasing in X, and if W = X + U where U is independent of
X and Y, then the regression of Y on W would also be nondecreasing.
But Hwang and Stefanski (1994) have shown that this need not be the
case, although it is true in linear regression with normally distributed
measurement error. However, these results show that making inferences
about details in the relationship of Y and X, based on the observed
relationship between Y and W, is a difficult problem in general.
There are other practical reasons why ignoring measurement error is
not acceptable. First, estimating the direction of the relationship be-
tween Y and X correctly is nice, but as emphasized by MacMahon et al.
(1990) we can be misled if we severely underestimate its magnitude. Sec-
ond, the result does not apply to multiple covariates, as we have noted
in Figure 3.5 for the analysis of covariance and in Figure 3.6 for corre-
lated measurement errors. Indeed, we have already seen that in multiple
linear regression under the additive measurement error model, the ob-
served and underlying trends may be entirely different. Finally, it is also
the case (Sectiob 10.1) that, especially with multiple covariates, one can
use error modeling to improve the power of inferences. In large classes
of problems, then, there is simply no alternative to careful consideration
of the measurement error structure.

Bibliographic Notes
The linear regression problem has a long history and continues to be the
subject of research. Excellent historic background can be found in the
papers by Lindley (1953), Lord (1960), Cochran (1968) and Madansky
(1959). Furthermore technical analyses are given by Fuller (1980), Car-
roll and Gallo (1982, 1984), and Carroll et al. (1985). Diagnostics are
discussed by Carroll and Spiegelman (1986, 1992) and Cheng and Tsai
(1992). Robustness is discussed by Ketellapper and Ronner (1984), Za-
mar (1988, 1992), Cheng and van Ness (1988), and Carroll et al. (1993).
Ganse, Amemiya, and Fuller (1983) discussed an interesting prediction
problem. Hwang (1986) and Hasenabeldy et al. (1989) discuss problems
with unusual error structure. Boggs et al. (1988) discussed computational
aspects of orthogonal regression in nonlinear models.

64
CHAPTER 4

REGRESSION CALIBRATION

4.1 Overview
In this monograph we will describe two simple, generally applicable ap-
proaches to measurement error analysis: regression calibration in this
chapter and simulation extrapolation (SIMEX) in Chapter 5.
The basis of regression calibration is the replacement of X by the
regression of X on (Z, W). After this approximation, one performs a
standard analysis. The simplicity of this algorithm disguises its power.
As Pierce and Kellerer (2004) state, regression calibration “is widely
used, effective (and) reasonably well investigated.” Regression calibra-
tion shares with multiple imputation the advantage as Pierce and Kellerer
note, “A great many analyses of the same cohort data are made for differ-
ent purposes . . . it is very convenient that (once the replacement is made)
essentially the same methods for ongoing analyses can be employed as if
X were observed.” Regression calibration is simple and potentially appli-
cable to any regression model, provided the approximation is sufficiently
accurate. SIMEX shares these advantages but is more computationally
intensive.
Of course, with anything so simple, yet seemingly so general, there
have to be some catches. These are:
• Estimating the basic quantity, the regression of X on (W, Z), is an
art. After all, we do not observe X! There is an active literature on
this topic, reviewed in Sections 4.4 and 4.5.
• No simple approximation can always be accurate. Regression calibra-
tion tends to be most useful for generalized linear models (GLIM),
helpful given the vast array of applications of these models. Indeed,
in many GLIM, the approximation is exact or painfully close to being
exact. We review this issue in Sections 4.8 and B.3.3.
• On the other hand, the regression calibration approximation can be
rather poor for highly nonlinear models, although sometimes fixups
are possible; see Section 4.7, where a unique application to a bioassay
example is made.
The algorithm is given in Section 4.2. An example using the NHANES
data is given in Section 4.3. Basic to the algorithm is a model for

65
E(X|Z, W), and methods of fitting such models are discussed in Sections NHANES
0.7
4.4 and 4.5. Section 4.6 provides brief remarks on calculating standard Cases
Controls
errors. The expanded regression calibration approximation in Section
0.6
4.7 attempts to improve the basic regression calibration approximation;
the section includes a second example, the bioassay data. Sections 4.8
and 4.9 are devoted to theoretical justification of regression calibration 0.5

and expanded regression calibration. Technical details, of which there

Density Estimate
are many, are relegated to Appendix B.3. 0.4

0.3
4.2 The Regression Calibration Algorithm
The regression calibration algorithm is as follows: 0.2

• Estimate the regression of X on (Z, W), mX (Z, W, γ), depending on


parameters γ, which are estimated by γ b. How to do this is described 0.1

in Sections 4.4 and 4.5.


• Replace the unobserved X by its estimate mX (Z, W, γ b), and then run 0
0 1 2 3 4 5 6
Transformed Saturated Fat
a standard analysis to obtain parameter estimates.
• Adjust the resulting standard errors to account for the estimation of Figure 4.1 Density estimates of transformed saturated fat for cases and con-
γ, using either the bootstrap or sandwich method; consult Appendix trols: NHANES data.
A for the discussion of these techniques.
Suppose, for example, that the mean of Y given (X, Z) can be de-
scribed by error are age, poverty index ratio, body mass index, use of alcohol (yes–
E(Y|Z, W) = mY (Z, X, B) (4.1) no), family history of breast cancer, age at menarche (a dummy variable
for some unknown parameter B. The replacement of X in (4.1) by its taking on the value 1 if the age is ≤ 12), menopausal status (pre or
estimated value in effect proposes a modified model for the observed post), and race. The variable measured with error, X, is long-term aver-
data, namely age daily intake of saturated fat (in grams). The response is breast cancer
incidence. The analysis in this section is restricted to 3, 145 women aged
E(Y|Z, W) ≈ mY {Z, mX (Z, W, γ), B} . (4.2) 25–50 with complete data on all the variables listed above; 59 had breast
It is important to emphasize that the regression calibration model (4.2) is cancer. In general, logistic regression analyses with a small number of
an approximate, working model for the observed data. It is not necessarily disease cases are very sensitive to misclassification, case deletion, etc.
the same as the actual mean for the observed data, but in many cases Saturated fat was measured via a 24-hour recall, that is, a partici-
is only modestly different. Even as an approximation, the regression pant’s diet in the previous 24 hours was recalled and nutrition variables
calibration model can be improved; see Section 4.7 for refinements. computed. It is measured with considerable error (Beaton, Milner, and
Little, 1979; Wu, Whittemore, and Jung, 1986), leading to controversy
as regards the use of 24-hour recall to assess breast cancer risk (Prentice,
4.3 NHANES Example Pepe, and Self, 1989; Willett, Meir, Colditz, et al., 1987).
The purpose of this section is to give an example of the application of re- Our analysis concerns the effect of saturated fat on risk of breast can-
gression calibration to logistic regression. In particular, we will illustrate cer, adjusted for the other variables. To give a first indication of the
the bias versus variance tradeoff exemplified by Figure 3.9 in Section 3.5. effects, we considered the marginal effect of saturated fat. Specifically,
We consider the analysis of the NHANES–I Epidemiologic Study Co- we considered the variable log(5+saturated fat) and computed kernel
hort data set (Jones, Schatzen, Green, et al., 1987). The predictor vari- density estimates (Silverman, 1986) of this variable for the breast cancer
ables Z that are assumed to have been measured without appreciable cases and for the noncases. The transformation was chosen for illustra-

66 67
tive purposes and because it makes the observed values nearly normally 250
Analysis Ignoring Error

distributed. The results are given in Figure 4.1. Note that this figure
200
indicates a small marginal but protective effect due to higher levels of
saturated fat in the diet, which is in opposition to one popular hypothe- 150

sis. Thus we should expect the logistic regression coefficient of saturated 100
fat to be negative (hence, the higher the levels of fat, the lower the
50
estimated risk of breast cancer).
0
−7 −6 −5 −4 −3 −2 −1 0

Regression Calibration Analysis


Variable Estimate Std. Error p-value 250

200

Age /25 2.09 .53 < .001 150

Poverty index .13 .08 .10 100


Body mass index / 100 −1.67 2.55 .51
50
Alcohol .42 .29 .14
Family history .63 .44 .16 0
−7 −6 −5 −4 −3 −2 −1 0
Age at menarche −0.19 .27 .48
Premenopausal .85 .43 .05 Figure 4.2 Bootstrap analysis of the estimated coefficient ignoring measure-
Race .19 .38 .62 ment error (top panel) and accounting for it via regression calibration. Note
log(5 + saturated fat) −0.97 .29 < .001 how the effect of measurement error is to attenuate the coefficient, and the
effect of correcting for measurement error is to widen confidence intervals.
Compare with Figure 3.9.
Table 4.1 Logistic regression in the NHANES data.

In Table 4.1 we list the result of ignoring measurement error. This and for the additive measurement error model, the measurement error
analysis suggests that transformed saturated fat is a highly significant bu2 = 0.171. This error variance estimate is rel-
variance is estimated as σ
predictor of risk with a negative logistic regression coefficient. Results in atively close to the value 0.143 formed using a components-of-variance
Chapter 10 show that the p-value is asymptotically valid because there estimator given by (4.3) below when applied to 24-hour recalls in the
are no other covariates measured with error. American Cancer Society Cancer Prevention Study II (CPS II) Nutri-
There are at least two problems with these data that suggest that the tion Survey Validation Study, which has n = 184 individuals with four
results should be treated with extreme caution. 24-hour recalls per individual.
The first reason is that few epidemiologists would trust the results of a We applied regression calibration to these data, using the “resampling
single 24-hour recall as a measure of long-term daily intake. The second pairs” bootstrap (Section A.9.2) to get estimated standard errors. The
is that if one also adds caloric intake into the model, something often parameter estimate was −4.67 with an estimated variance of 2.26, along
done by epidemiologists, then the statistical significance for saturated with an associated percentile 95% confidence interval from −10.37 to
fat seen in Table 4.1 disappears, with a p-value of 0.07. −1.38. What might be most interesting is the results from this boot-
By using data from the Continuing Survey of Food Intake by Individ- strap, given as histograms in Figure 4.2. There we see the bias versus
uals (CSFII, see Thompson, Sowers, Frongillo, et al., 1992), we estimate variance tradeoff exemplified by Figure 3.9 in Section 3.5. Specifically,
that over 75% of the variance of a single 24-hour recall is made up of note how the bootstrap, when ignoring measurement error, is tightly
measurement error. This analysis is fairly involved and was discussed bunched around a far too small estimated value, while the bootstrap
in too much detail in the first edition: Here, we simply take the esti- accounting for measurement error is centered at a very different place
mate as given, namely, that the observed sample variance of W is 0.233, and with much more variability.

68 69
4.4 Estimating the Calibration Function Parameters X, that is, replicated W measuring the same X. When necessary, the
convention made in this book is to adjust the replicates a priori so that
4.4.1 Overview and First Methods
they have the same sample means.
The basic point of using the regression calibration approximation is that Suppose there are ki replicate measurements, Wi1 , . . . , Wiki , of Xi ,
one runs a favorite analysis with X replaced by its regression on (Z, W). and Wi· is their mean. Replication enables us to estimate the measure-
In this section, we discuss how to do this regression. ment error covariance matrix Σuu by the usual components of variance
There are two simple cases: analysis, as follows:
• With internal validation data, the simplest approach is to regress X on Pn Pki ¡ ¢¡ ¢t
i=1 j=1 Wij − Wi· Wij − Wi·
the other covariates (Z, W) in the validation data. Of course, this is a b
Σuu = Pn . (4.3)
missing data problem, and generally one would then use missing data i=1 (ki − 1)
techniques rather than regression calibration. Regression calibration In (4.3), remember that we are using the “dot and overbar” notation to
in this instance is simply a poor person’s imputation methodology. mean averaging over the “dotted” subscript.
From a practical matter, for a quick analysis we suggest that one Write Σab as the covariance matrix between two random variables, and
use the X data where it is available, but add in a dummy variable let µa be the mean of a random variable. The best linear approximation
to distinguish between the cases that X or its regression calibration to X given (Z, W) is
versus are used.
E(X|Z, W) ≈ µx (4.4)
• In some problems, for example, in nutritional epidemiology, an unbi- µ ¶t · ¸−1 µ ¶
ased instrument T is available for a subset of the study participants; Σxx Σxx + Σuu /k Σxz W − µw
+ .
see Section 2.3. Here, by definition of “unbiased instrument,” the re- Σzx Σtxz Σzz Z − µz
gression of T on (Z, W) is the same as what we want, the regression Here is how one can operationalize (4.4) based on observations (Zi , Wi· ),
of X on (Z, W). This is the method used by Rosner, Spiegelman and b uu . We
replicate sample sizes ki and estimated error covariance matrix Σ
Willett (1990) in their analysis of the Nurses’ Health Study, see Sec- use analysis of variance formulae. Let
tion 1.6.2. In that study, health outcomes and dietary intakes as mea- n n
sured by a food frequency questionnaire (FFQ) W were observed on X X
µ
bx = µ
bw = ki Wi· / ki ; bz = Z· ;
µ
all study participants. On a subset of the study participants, dietary i=1 i=1
intakes were assessed by food diaries, T. The investigators assumed n
X Xn n
X
that the diaries were unbiased for usual dietary intake and applied ν= ki − ki2 / ki ;
regression calibration by regressing the intakes from diaries on those i=1 i=1 i=1
from the FFQ. n
X ¡ ¢¡ ¢t
b zz = (n − 1)−1
Σ Zi − Z· Zi − Z· ;
With validation data or an unbiased instrument, models for E(X|Z, W)
can be checked by ordinary regression diagnostics such as residual plots. i=1
n
X ¡ ¢¡ ¢t
b xz =
Σ bw Zi − Z· /ν;
ki Wi· − µ
4.4.2 Best Linear Approximations Using Replicate Data i=1
"( n
) #
Here we consider the classical additive error model W = X + U where X ¡ ¢¡ ¢t
b xx =
Σ ki Wi· − µ
bw Wi· − µ
bw b uu /ν.
− (n − 1)Σ
conditional on (Z, X) the errors have mean zero and constant covariance
i=1
matrix Σuu . We describe an algorithm yielding a linear approximation
to the regression calibration function. The algorithm is applicable when The resulting estimated calibration function is
Σuu is estimated via external data or via internal replicates. The method E(Xi |Zi , Wi· ) ≈ µbw (4.5)
was derived independently by Carroll and Stefanski (1990) and Gleser · ¸−1 µ ¶
b b b xz
(1990), and used by Liu and Liang (1992) and Wang, Carroll, and Liang b xz ) Σxx + Σuu /ki
b xx , Σ
+(Σ
Σ Wi· − µ
bw
.
b
Σ t b zz
Σ Zi − Z·
(1996). xz
In this subsection, we will discuss using replicates measurements of In linear regression, if there are no replicates (ki ≡ 1) but an external

70 71
estimate Σb uu is available, or if there are exactly two replicates (ki ≡ 2), tion, and first we discuss when a log transformation should be used.
in which case Σb uu is half the sample covariance matrix of the differences Then we introduce alternative strategies for use when a log transforma-
Wi1 − Wi2 , regression calibration reproduces the classical method-of- tion seems inappropriate.
moments estimates, that is, the estimators (3.13) of Section 3.4 with Σ uu
estimated from replicates and Σǫu assumed to be 0.
When the number of replicates is not constant, the algorithm can be
shown to produce consistent estimates in linear regression and (approxi- 4.5.1 Should Predictors Be Transformed?
mately!) in logistic regression. For loglinear mean models, the intercept is
biased, so one should add a dummy variable to the regression indicating
whether or not an observation is replicated. A good example of multiplicative measurement error can be seen in
Figures 1.6, 1.7, and 1.8, as described in Section 1.7. This is a case
where taking logarithms seems to lead to an additive measurement error
4.4.3 Alternatives When Using Partial Replicates model with constant measurement error variance. Other scientists also
The linear approximations defined above are only approximations, but find multiplicative measurement errors. Lyles and Kupper (1997) state
they can be checked by using the replicates themselves. As is typical, if that “there is much evidence for this model” (meaning the multiplicative
only a partial subset of the study has an internal replicate (ki = 2), while error model). Pierce, Stram, Vaeth, et al. (1992) study data from Radia-
most of the data are not replicated (ki = 1), the partial replicates can tion Effects Research Foundation (RERF) in Hiroshima. They state, “It
be used to check the best linear approximations to E(X|Z, W) defined is accepted that radiation dose–estimation errors are more homogeneous
above, by fitting models to the regression of Wi2 on (Zi , Wi1 ). If nec- on a multiplicative than on an additive scale.” Hwang (1986) studied
essary, the partial replication data can be used in this way to estimate data on energy consumption and again found multiplicative errors.
E(X|Z, W). A good picture to study is Figure 1.2, where we plot the While the existence of multiplicative measurement errors is in no
log protein biomarkers against one another. The scatterplot suggest a doubt, what to do in that situation is a matter of some controversy.
linear relationship, and a linear model here seems perfectly reasonable. Indeed, this has nothing to do with measurement error: taking the loga-
rithm of a predictor is a perfectly traditional way to lessen the effects of
leverage in regression. Thus, many authors simply use the transformed
4.4.4 James–Stein Calibration
data scale. In most of the nutrition examples of which we are aware, in-
Whittemore (1989) also proposed regression calibration in the case that vestigators use the transformed predictor as W and carry out analyses:
X is scalar, there is no Z, and the additive error model applies. If σu2 is It is trivial then to construct relative risks from the lowest to highest
unknown and there are k replicates at each observation, then instead of quintiles of a nutrient. Alternatively, they often categorize the observed
the method-of-moments estimate (4.5) of E(X|W), she suggested use of data into quintiles and run a test for trend against the quintile indica-
the James–Stein estimate, namely tors. It would not be typical to run logistic regression analyses on the
½ ¾ original scale data, which are often horribly skew. As we will see, mul-
n − 1 n(k − 1) σ bu2 /k
W·· + 1 − 2 (Wi· − W·· ), tiplicative measurement error generally means that the largest observed
n − 3 n(k − 1) + 2 σ
bw values are very far from the actual values, often an order of magnitude in
where σbu2 is the usual components of variance estimate of σu2 defined in difference. These considerations dictate against using the original data
(4.3) and σbw2
is the sample variance of the terms (W i· ). Typically, the scale when running an analysis that ignores measurement error.
James–Stein and moments estimates are nearly the same. There are, however, many researchers who prefer to fit a regression
model in the original scale, rather than in a transformed scale. We tend
to have little sympathy, in the absence of data analysis, for assertions
4.5 Multiplicative Measurement Error
of the type that scientifically one scale is to be preferred. However, it
Until now we have assumed that the measurement errors are additive, is important to have the flexibility to fit measurement error models on
but multiplicative errors are common and require special treatment. Mul- the original data scale. In this section, we describe how to implement
tiplicative errors can be converted to additive ones by a log transforma- regression calibration in the multiplicative context.

72 73
4.5.2 Lognormal X and U 2

In this section, among other things, we will show that in linear regression
the effect of multiplicative measurement error is to make the observed 1.5
untransformed data appear as if they are curved, not linear.
The multiplicative lognormal error model with a scalar X and an
unbiased version of it is 1

outcome
W = X U, log(U) ∼ Normal{−(1/2)σu2 , σu2 }. (4.6)
For simplicity, we assume that there are no covariates Z measured with- 0.5
out error. If X is also lognormal and independent of U, then regression
2 n=50
calibration takes a simple form. Let µw,log and σw,log be the mean and
variance of log(W), respectively. Let 0 true x
surrogate w
2
var{log(X)} σw,log − σu2 true x
λ= = 2 , surrogate w
var{log(W)} σw,log −0.5
0 1 2 3 4 5 6
and α = µw,log (1 − λ) + (1/2)σu2 . Then by (A.2) and (A.3) predictor
¡ ¢
E(X|W) = Wλ exp α + λσu2 /2 . (4.7)
© ª Figure 4.3 Simulation of multiplicative measurement error: W = XU,
var(X|W) = W2λ exp (2α) exp(2λσu2 ) − exp(λσu2 ) . (4.8) log(X) = Normal(0, 1/4), log(U) = Normal(−1/8, 1/4), Y = (1/2)X +
2 Normal(0, 0.04), n = 50 observations. The solid line is the fit to the unob-
Replacing µw,log and σw,log by the sample mean and variance of log(W), served X data in asterisks, while the dashed line is the spline fit to the observed
and plugging these values into the forms for α and λ allows one to imple- W data in plus signs. Note how the multiplicative error has induced a curve
ment regression calibration using (4.7). Of course, one needs an estimate into what should have been a straight line. Note too the stretching effect of the
of σu2 as well. This parameter can be estimated using validation or repli- measurement error.
cation data by the methods discussed in Section 4.4, but applied to
log(W), log(U), and log(X). Note how the regression calibration func-
tion is nonlinear in W. Figure 4.3 shows 50 observations of simulated data with multiplicative
The way to derive (4.7) and (4.8) is modestly amusing. Take loga- lognormal measurement errors and Y linear in X. The data with the true
rithms of both sides of (4.6) to get log(W) = log(X) + log(U), and then X values are plotted with asterisks and the data with the surrogates W
use equation (A.9) of Appendix A.4 to find that log(X) = α+λlog(W)+ are plotted with pluses. A dotted line connects (Yi , Xi ) to (Yi , Wi )
V where V is Normal(0, λσu2 ), and finally for each i = 1, . . . , n. Penalized splines (Ruppert, Wand, and Carroll,
2003) were fit to true covariates and the surrogates and plotted as solid
X = Wλ exp{α + V}, and dashed lines, respectively. Notice that, as theory predicts, the spline
from which (4.7) and (4.8) follow from standard moment generating fit to the true covariates is linear but the spline fit to the surrogates is
properties; see (A.2) of the appendix. curved. One can also see attenuation; the derivative (slope) of the curved
One of the exciting consequences of (4.7) is that if the regression of Y fit to the surrogates is seen to be everywhere less than the slope of the
on X is linear in X, say E(Y|X) = β0 + βx X, then the regression of Y straight line fit to the true covariates.
on the observed W is nonlinear, that is, by (4.7) Figure 4.4 shows 1,000 observations of simulated data from the same
© ¡ ¢ª joint distribution as in Figure 4.3. Only (Yi , Wi ) data are plotted, but
E(Y|W) = β0 + βx exp α + λσu2 /2 Wλ . penalized spline fits are shown to both (Yi , Wi ) and (Yi , Xi ) . Figures
Therefore, to obtain an
¡ asymptotically
¢ unbiased slope estimate, one re- 4.3 and 4.4 are similar, but, due to the larger sample size, the latter
gresses Y on Wλ exp α + λσu2 /2 . Note that the regression of Y on W has more extreme W values and shows the curvature of E(Y|W) more
is not linear, even though the regression of Y on X is linear. This is dramatically.
because the regression of X on W is not linear. Figure 4.5 uses the simulated data in Figure 4.4 and shows a plot of Y

74 75
3
2.5 1/2
(y,w )
2.5 fit
2
2

1.5 1.5
outcome

outcome
1 1

0.5
0.5
0 (y,w)
fit to (y,x)
fit to (y,w) 0
−0.5
0 1 2 3 4 5 6 7 8
w
−0.5
0 0.5 1 1.5 2 2.5 3
1/2
w
Figure 4.4 Simulation of multiplicative measurement error: n = 1000, W =
XU, log(X) = Normal(0, 1/4), log(U) = Normal(−1/8, 1/4) Y = X/2 +
Normal(0, 0.04). The solid line is the fit to the unobserved X data in asterisks, Figure 4.5 Simulation of multiplicative measurement error: n = 1000, W =
while the dashed line if the spline fit to the observed W data in plus signs. XU , log(X) = Normal(0, 1/4), log(U ) = Normal(−1/8, 1/4) Y = X/2 +
Normal(0, 0.04). Note: λ = 1/2. The line is a penalized spline fit to the regres-
sion of Y on Wλ , the regression calibration approximation. Theory predicts
versus Wλ , with λ = 1/2, and a penalized spline fit to it. As predicted that this should be linear.
by the theory, the regression of Y on Wλ appears linear.
In this context, regression calibration means replacing the unknown 4.5.3 Linear Regression
X by E(X|W) given by (4.7). This is nonlinear regression calibration,
since E(X|W) is nonlinear in W. Notice by formula (4.8) for var(X|W) Fuller (1984) and Hwang (1986) independently developed a method-of-
that, except when βx = 0, the regression of Y on W is heteroscedastic moments correction for multiplicative measurement error in linear re-
even if the regression of Y on X is homoscedastic. In the presence of gression. They make no assumptions that either the measurement errors
heteroscedasticity, ordinary unweighted least-squares is inefficient and, or the true predictors are lognormal.
to gain efficiency, statisticians often use quasilikelihood; see Section A.7. Their basic idea is to regress Y on W (not W λ ) and then make a
Lyles and Kupper (1997) proposed a quasilikelihood estimator that was method-of-moments correction similar to (3.13):
somewhat superior to nonlinear regression calibration in their simulation µ ¶−1 µ ¶
Sww ./Muu Swz Swy − Σuǫ
study, especially when the covariate measurement error is large. , (4.9)
Szw Szz Szy
In this section, we have focused on the case when X is lognormal, be-
cause of the simplicity of the expressions. The methods described above where A./B is coordinate-wise division of equal-size matrices A and B,
should work reasonably well if the unobserved covariate X is roughly that is, (A./B)ij = Aij /Bij , Muu is the second moment matrix of U,
lognormal, but there are no sensitivity studies done to date to confirm Sww is the sum of cross-products matrix for W, and so forth. In the
this. following Σuǫ is assumed to be zero. The Fuller–Hwang method is called
The estimator of E(X|W) in this section assumes that both X and U the “correction method” by Lyles and Kupper (1997) and is similar to
are lognormally distributed. Pierce and Kellerer (2004) have described a linear regression calibration defined in Section 4.2, because both meth-
method based on a Laplace approximation that is more nonparametric ods are based on regressing Y on either W itself (Fuller–Hwang method)
for the estimation of E(X|W). or a linear function of W (regression calibration). In fact, for scalar X

76 77
the Fuller–Hwang estimator is the same as linear regression calibration Stram and Kopecky (2003) in their study of the Hanford Thyroid Disease
when the calibration function predicts X using a linear function of W Study.
without an intercept, that is, a function of form λW, as discussed in Sec-
tion A.4.2. As shown in that section, λ = 1/E(U2 ), and therefore one
can show that both the Fuller–Hwang method and regression calibration 4.6 Standard Errors
without an intercept multiply the ordinary least-squares slope estimate It is possible to provide asymptotic formulae for standard errors (Carroll
by E(U2 ). Thus, the Fuller–Hwang estimator uses a less accurate predic- and Stefanski, 1990), but doing so is extremely tedious because of the
tor of X than linear regression calibration when the calibration function multiplicity of special cases. Some explicit formulae are given in the
is allowed an intercept, which suggests that the Fuller–Hwang method appendix (Section B.3.1) for the case of generalized linear models, and
might be inferior to the latter. The two estimators have apparently not for models in which one specifies only the mean and variance of the
been compared, possibly because nonlinear regression calibration seems response given the predictors.
more appropriate than either of them. The bootstrap (Section A.9) requires less programming (and mathe-
The Fuller–Hwang estimator is consistent but was found to be badly matics!) but takes more computer time. In the first edition, we remarked
biased in simulations of Lyles and Kupper (1999). The bias is still no- that this can be a real issue because, as Donna Spiegelman has pointed
ticeable for n = 10, 000 in their simulations. Iturria, Carroll, and Firth out, many researchers would prefer to have quick standard errors in-
(1999) found similar problems when linear regression calibration is used stead of having to use the bootstrap repeatedly while building models
with multiplicative errors. for their data. However, faster computers and better software are reduc-
In addition, Iturria, Carroll, and Firth (1999) studied polynomial re- ing the time needed to perform bootstrap inference. For example, the
gression with multiplicative error. One of their general methods is a rcal function in STATA uses the bootstrap to obtain standard errors in
special case of the Fuller–Hwang estimator. Their “partial regression” “real time,” 1 minute for the ARIC data set.
estimator assumes lognormality of (X, U) and generalizes the nonlinear In its simplest form, the bootstrap can be used to form standard error
regression calibration estimator discussed earlier. In a simulation with estimates, and then t-statistics can be constructed using the bootstrap
lognormal X and U, the partial regression estimator is often much more standard errors. The bootstrap percentile method can be used for confi-
efficient than the ones that do not assume lognormality. For all these dence intervals. Approximate bootstrap pivots can be formed by ignoring
reasons, we favor our approaches over those of the Fuller–Hwang esti- the variability in the estimation of the calibration function.
mator.
The regression calibration methods of this section are not the only
4.7 Expanded Regression Calibration Models
ways of handling multiplicative errors. For example, the Bayesian anal-
ysis of multiplicative error is discussed in Section 9.5.3. A major purpose of regression calibration is to derive an approximate
model for the observed (Y, Z, W) data in terms of the fundamental
model parameters. The regression calibration method is one means to
4.5.4 Additive and Multiplicative Error this end: Merely replace X by an estimate of E(X|Z, W). This method
works remarkably well in problems such as generalized linear models,
A model with both additive and multiplicative error is W = XU1 + U2 , for example, linear regression, logistic regression, Poisson and gamma
2 2 regression with loglinear links, etc. However, it is often not appropriate
where U1 and U2 are independent errors with variances σu,1 and σu,2 ,
2 2 2 for highly nonlinear problems.
respectively. This model implies that var(W|X) = X σu,1 + σu,2 . For
2 It is convenient for our purposes to cast the problems in the form of
sufficiently small values of X, var(W|X) ≈ σu,2 , while for sufficiently
large values of X var(W|X) ≈ X2 σu,12
. This model has been studied by what are called mean and variance models, often called quasilikelihood
Rocke and Durbin (2001) and applied by them to gene expression levels and variance function (QVF) models, which are described in more gener-
measured by cDNA slides. As far as we are aware, this model has not ality and detail in (A.35) and (A.36). Readers unfamiliar with the ideas
been applied as a measurement error model in regression, but research of quasilikelihood may wish to skip this material at first reading and
on this topic seems well worthwhile. A Berkson model that contains continue into later chapters.
a mixture of additive and multiplicative errors has been proposed by Mean and variance models specify the mean and variance of a response

78 79
Y as functions of covariates (X, Z) and unknown parameters. For exam- it is important enough to affect the efficiency of the estimates, the het-
ple, in linear regression, the mean is a linear function of the covariates, eroscedasticity should show up in residual plots.
and the variance is constant. We write these models in general as The preceding example shows that a refined approximation can im-
prove efficiency of estimation, while the next describes a simple situa-
E(Y|Z, X) = mY (Z, X, B) (4.10)
tion where bias can also be corrected; another example is discussed in
var(Y|Z, X) = σ 2 g 2 (Z, X, B, θ), (4.11) the loglinear mean model case in Section 4.8.3. Consider ordinary ho-
where g 2 (Z, X, B, θ) is some nonnegative function and σ 2 is a scale pa- moscedastic quadratic regression with E(Y|X) = β0 + βx,1 X + βx,2 X2 .
rameter. The parameter vector θ contains parameters in addition to B Use the same heteroscedastic Berkson model as before. Then the regres-
that specify the variance function. In some models, for example, linear, sion calibration approximation suggests a homoscedastic model with X
logistic, Poisson and gamma regression, θ is not needed, since there are replaced by W, while in fact the observed data have mean β0 + βx,1 W +
no additional parameters. βx,2 (W2 + σrc2
W2γ ). If the Berkson error model is heteroscedastic, the
Of course, since X is not observed, to fit a mean and variance function regression calibration approximation will lead to a biased estimate of the
model, what we need is the mean and variance of Y given the observed regression parameters.
data. There are two possible approaches: It is important to stress that these examples do not invalidate re-
gression calibration as a method, because the heteroscedasticity in the
• Posit a probability model for the distribution of X given (Z, W), then
Berkson error model has to be fairly severe before much effect will be
compute, exactly, E(Y|Z, W) = E{mY (Z, X, B)|Z, W} and
noticed. However, there clearly is a need for refined approximations that
var(Y|Z, W) = var{mY (Z, X, B)|Z, W} + σ 2 E{g 2 (Z, X, B, θ)|Z, W). take over when the regression calibration approximation breaks down.

• Instead of a probability model for the entire distribution, posit a


model for the mean and variance of X given (Z, W) and then do 4.7.1 The Expanded Approximation Defined
Taylor series expansions to estimate the mean and variance of the We will consider the QVF models (4.10) and (4.11). We will focus entirely
response given the observed data. These are the expanded regression on the case that X is a scalar. Although the general theory (Carroll and
calibration approximations. Stefanski, 1990) does allow multiple predictors, the algebraic details are
Regression calibration, in effect, says that X given (Z, W) is com- unusually complex.
pletely specified, with no error, by its mean, so that We will to discuss three different sets of approximations:
E(Y|Z, W) ≈ mY {Z, mX (Z, W, γ), B} ; (4.12) • A general formula.
2 2
var(Y|Z, W) ≈ σ g {Z, mX (Z, W, γ), B, θ} . (4.13) • A modification of the general formula that is range preserving, for
example, when a function must be positive.
We will show that in some cases, the model can be modified to improve
the fit; see Section 4.7.3 for a striking data application. • A simplification of the formula when functions are not too badly
An example will help explain the possible need for refined approx- curved.
imations. Consider the simple linear homoscedastic regression model
E(Y|X) = β0 + βx X and var(Y|X) = σ 2 . Suppose the measurement 4.7.1.1 The General Development
process induces a heteroscedastic Berkson model where E(X|W) = W
and var(X|W) = σrc 2
W2γ , where rc stands for regression calibration. The The expanded approximation starts with both the mean and variance of
regression calibration approximate model states that the observed data X given (Z, W):
follow a simple linear homoscedastic regression model with X replaced E(X|Z, W) = mX (Z, W, γ); (4.14)
by E(X|W) = W. However, while this gives a correct mean function, 2
var(X|Z, W) = σrc V 2 (Z, W, γ). (4.15)
the actual variance function for the observed data is heteroscedastic:
var(Y|W) = σ 2 + σrc 2 2
βx W2γ . Hence the regression calibration model We wish to construct approximations to the mean and variance func-
gives a consistent estimate of the slope and intercept, but the estimate tion of the observed response given the observed covariates. Carroll and
2
is inefficient because weighted least squares should have been used. If Stefanski (1990) based such approximations on pretending that σ rc is

80 81
2
“small”; if it equals zero, the resulting approximate model is the regres- var(Y|Z, W) ≈ σrc m2Y,x {Z, mX (·), B}V 2 (·) (4.19)
sion calibration model. · 2
¸
1 2 V (·)sxx (·)
Here is how the approximation works. Let mY,x and mY,xx be the +σ 2 g 2 Z, mX (·) + σrc , B, θ .
2 sx (·)
first and second derivatives of mY (z, x, B) with respect to x, and let
sx (z, w, B, θ, γ) and sxx (·) be the first and second derivatives of s(z, x, B,
θ) = g 2 (z, x, B, θ) with respect to x and evaluated at x = E(X|Z = 4.7.1.3 Models Without Severe Curvature
z, W = w). Defining mX (·) = mX (Z, W, γ) and V (·) = V (Z, W, γ),
simple Taylor series expansions in Section B.3.3 with σrc 2
→ 0 yield When the models for the mean and variance are not severely curved, fxx
the following approximate model, which we call the expanded regression and sxx are small relative to mY (·) and g 2 (·), respectively. In this case,
2
calibration model: setting κ = σrc /σ 2 , the mean and variance functions of the observed
data greatly simplify to
E(Y|Z, W) ≈ mY {Z, mX (·), B} (4.16)
2 2
+(1/2)σrc V (·)mY,xx {Z, mX (·), B} ; E(Y|Z, W) ≈ mY {Z, mX (·), B}
£ ¤
var(Y|Z, W) 2 2
≈ σ g {Z, mX (·), B, θ} (4.17) var(Y|Z, W) ≈ σ 2 g 2 {Z, mX (·), B, θ} + κV 2 (·)mY 2x (·) .
2 2
© 2 2
ª
+σrc V (·) mY,x (·) + (1/2)σ sxx (·) . Having estimated the mean function mX (·), this is just a QVF model in
There are important points to note about the approximate model (4.16)– the parameters (B, θ∗ ), where θ∗ consists of θ, κ and the other parameters
(4.17): in the function V 2 (·). In principle, the QVF fitting methods of Appendix
2
• By setting σrc = 0, it reduces to the regression calibration model, in A can be used.
which we need only estimate E(X|Z, W).
• It is an approximate model that serves as a guide to final model 4.7.2 Implementation
construction in individual cases. We are not assuming that the mea-
surement error is small, only pretending that it is in order to derive The approximations (4.16) and(4.17) require specification of the mean
a plausible model for the observed data in terms of the regression and variance functions (4.14) and (4.15). In the Berkson model, the for-
parameters of interest. In some instances, terms can be dropped or 2
mer is just W and a flexible model for the latter is σrc W2γ , with γ = 0
combined with others to form even simpler useful models for the ob- indicating homoscedasticity. We will see later in a variety of examples
served data. that, for this Berkson class, the model parameters (B, θ) are often es-
• It is a mean and variance model for the observed data. Hence, the timable via QVF techniques using the approximate models, without the
techniques of model fitting and model exploration discussed in Carroll need for any validation data. The Berkson framework thus serves as an
and Ruppert (1988) can be applied to nonlinear measurement error ideal environment for expanded regression calibration models.
model data. Outside the Berkson class, we have already discussed in Sections 4.4
and 4.5 methods for estimating the conditional mean of X. If possi-
4.7.1.2 Range-Preserving Modification ble, one should use available data to estimate the conditional variance
function. For example, if there are k unbiased replicates in an additive
One potential problem with the expanded regression calibration model
measurement error model, then the natural counterpart to the best lin-
(4.16)–(4.17) is that it might not be range preserving. For example,
ear estimate of the mean function is the usual formula for the variance
because of the term sxx (·), the variance function (4.17) need not nec- 2
in a regression, namely var(X|Z, W) = σrc , where if σx2 is the variance
essarily be positive. If the original function mY (·) is positive, the new 2
of X and σu is the measurement error variance,
approximate mean function (4.16) need not be positive because of the
term fxx (·). A range-preserving expanded regression calibration model · ¸−1
2
¡ ¢ σx2 + σu2 /k Σxz ¡ ¢t
for the observed data is σrc = σx2 − σx2 , Σxz σx2 , Σxz .
· ¸ Σtxz Σzz
1 2 V 2 (·)mY,xx (·)
E(Y|Z, W) ≈ mY Z, mX (·) + σrc ,B ; (4.18)
2 mY,x (·) This can be estimated using the formulae of Section 4.4.2.

82 83
H W Y H W Y H W Y H W Y H W Y H W Y H W Y H W Y

0 0 1.51 0 0 1.43 1 1 0.05 1 2 0.06 0 0 1.21 0 0 1.10 1 1 0.04 1 2 0.09


1 4 0.15 1 8 0.40 1 16 0.76 1 32 0.95 1 4 0.12 1 8 0.25 1 16 0.56 1 32 1.04
2 1 0.04 2 2 0.07 2 4 0.13 2 8 0.52 2 1 0.05 2 2 0.06 2 4 0.14 2 8 0.35
2 16 0.79 2 32 1.17 3 1 0.05 3 2 0.26 2 16 0.90 2 32 1.12 3 1 0.06 3 2 0.21
3 4 0.28 3 8 0.70 3 16 1.05 3 32 1.30 3 4 0.37 3 8 0.60 3 16 1.01 3 32 0.70
4 1 0.11 4 2 0.42 4 4 0.59 4 8 0.90 4 1 0.10 4 2 0.20 4 4 0.47 4 8 0.95
4 16 1.08 4 32 1.24 5 1 0.04 5 2 0.06 4 16 1.07 4 32 0.93 5 1 0.05 5 2 0.07
5 4 0.19 5 8 0.50 5 16 0.84 5 32 1.17 5 4 0.09 5 8 0.29 5 16 0.78 5 32 1.05
6 1 0.04 6 2 0.04 6 4 0.24 6 8 0.70 6 1 0.05 6 2 0.07 6 4 0.16 6 8 0.39
6 16 1.21 6 32 1.01 7 1 0.05 7 2 0.08 6 16 0.78 6 32 0.97 7 1 0.04 7 2 0.11
7 4 0.14 7 8 0.60 7 16 1.20 7 32 1.30 7 4 0.24 7 8 0.48 7 16 0.94 7 32 1.30
8 1 0.38 8 2 0.64 8 4 0.88 8 8 1.09 8 1 0.15 8 2 0.26 8 4 0.60 8 8 0.87
8 16 1.50 8 32 1.30 8 16 0.61 8 32 0.98
0 0 1.01 0 0 1.34 1 1 0.05 1 2 0.07
Table 4.2 continued.
1 4 0.09 1 8 0.26 1 16 0.55 1 32 1.21
2 1 0.04 2 2 0.06 2 4 0.19 2 8 1.16
2 16 0.96 2 32 1.13 3 1 0.04 3 2 0.17
3 4 0.33 3 8 0.50 3 16 1.11 3 32 1.20 4.7.3 Bioassay Data
4 1 0.12 4 2 0.30 4 4 0.41 4 8 1.06
4 16 1.29 4 32 1.17 5 1 0.04 5 2 0.07 Rudemo, Ruppert, and Streibig (1989) described a bioassay problem fol-
5 4 0.19 5 8 0.36 5 16 0.88 5 32 1.16 lowing a heteroscedastic Berkson error model. In this experiment, four
6 1 0.04 6 2 0.05 6 4 0.22 6 8 0.61 herbicides were applied either as technical grades or as commercial for-
6 16 1.15 6 32 1.39 7 1 0.04 7 2 0.18 mulations; thus there are eight herbicides: four pairs of two herbicides
7 4 0.27 7 8 0.88 7 16 0.97 7 32 1.26 each. The herbicides were applied at the six different nonzero doses 2j−5
8 1 0.29 8 2 0.98 8 4 1.12 8 8 1.10 for j = 0, 1, . . . , 5. There were also two zero dose observations. The re-
8 16 1.13 8 32 1.31 sponse Y was the dry weight of five plants grown in the same pot. There
were three complete replicates of this experiment done at three different
time periods, so that the replicates are a blocking factor. The data are
Table 4.2 The bioassay data. Here Y is the response and W is the nominal listed in Table 4.2.
dose time 32. The herbicides H are listed as 1–8, and H = 0 means a zero Let Z1 be a vector of size eight with a single nonzero element indicating
dose. The replicates R are separated by horizontal lines. The herbicide pairs which herbicide was applied, and let Z2 be a vector of size four indicating
are (1,5), (2,6), (3,7), and (4,8). Continued on next page. the herbicide pair. Let Z = (Z1 , Z2 ). For zero doses, Z1 and Z2 may be
defined arbitrarily as any nonzero value. In the absence of measurement
error for doses, and if there were no random variation, the relationship
between response and dose, X, is expected to be
β1 − β0
Y ≈ mY (Z, X, B) = β0 + ½ ¾β4t Z2 . (4.20)
X
1+
β3t Z1

84 85
This problem is exactly of the type amenable to analysis by the
transform-both-sides (TBS) methodology of Carroll and Ruppert (1988);
*
see also Ruppert, Carroll, and Cressie (1989). Specifically, model (4.20) is

0.5
* a theoretical model for the data in the absence of any randomness, which,
when fit, shows a pattern of heteroscedasticity. The TBS methodology
0.4 *
* suggests controlling for the heteroscedasticity by transforming both sides
0.3

*
**
of the equation:
* * *
0.2

* * *
* * * *
*
* *
*
** * h(Y, λ) ≈ h {mY (Z, X, B), λ} , (4.21)
* * ** *
* * * * * **
0.1

* * * * *
******* * *** * **
* ** * * ** * *
* * **
*
* ** ** where the transformation family can be arbitrary but is taken here as
** ********** *** *** * ****
* * * **
**** ***
* * ** ** *** * * ** * *
*** ** * * the power transformation family:
0.0

* *** * * *

0.0 0.2 0.4 0.6 0.8 1.0 1.2


h(v, λ) = (v λ − 1)/λ if λ 6= 0;
= log(v) if λ = 0.

Of course, the actual dose applied X may be different from the nom-
0.4

inal dose applied W. It seems reasonable in this context to consider


2
the Berkson error model with mean W and variance σrc W2γ , the het-
0.3

eroscedasticity indicating the perfectly plausible assumption that the


size of the error made depends on the nominal dose applied. With this
0.2

specification, the regression calibration approximation replaces X by W.


Letting Yij be the j th replicate at the ith herbicide–dose combination,
0.1

the TBS-regression calibration model incorporating randomness is


h(Yij , λ) = h {mY (Zi , Wi , B), λ} + ηj + ǫij , (4.22)
0.0

1 2 3 4 5 6
where ǫij is the homoscedastic random effect with variance σ 2 , and ηj
is the fixed block effect. The parameters were fit using maximum likeli-
hood assuming that the errors are normally distributed, as described by
Carroll and Ruppert (1988, Chapter 4). This involves maximizing the
Figure 4.6 Bioassay data. Absolute residual analysis for an ordinary nonlinear loglikelihood
least squares fit. Note the increasing variability for larger predicted values. Ã
2
1 X [h(Yij , λ) − h {mY (Zi , Wi , B), λ} − ηj ]

2 i,j σ2
Model (4.20) is typically referred to as the four-parameter logistic model. !
Physically, the parameters β0 and β1 should be nonnegative, since they +log(σ 2 ) − 2(λ − 1)log(Yij ) .
are the approximate dry weight at infinite and zero doses, respectively.
An initial ordinary nonlinear least squares fit to the data with a fixed
block effect had a negative estimate of β0 . Figure 4.6 displays a plot The estimated transformation, λ b = 0.117, is very near the log trans-
of absolute residuals versus predicted means. Also displayed are box formation. The residual plots are given in Figure 4.7, where we still see
plots of the residuals formed by splitting the data into six equal-sized some unexplained structure to the variability, since the extremes of the
groups ordered on the basis of predicted values. Both figures show that predicted means have smaller variability than the centers.
the residuals are clearly heteroscedastic, with the response variance an To account for the unexplained variability, we now the consider higher-
increasing function of the predicted value. order approximate models (4.16) and (4.17). Denoting the left-hand

86 87
* *

1.0
0.6
* *

0.8
* * * *
* *
0.4
* * * * *

0.6
* *
* * ** * * *
*
* * *
** * * * * * * * ** *
**

0.4
* * * * * * * * ** *
** * * * *** * * * * *
0.2

*
* * * * ** * * **
** *
* ** * * * * ** ** * ** ** ** *
* * * * *

0.2
** **
* * * ** ** * * * **** *** ** ** ** * * ** *
* * * * * **
** * * * * * * * * * * * * * * * * * ** ** **
* ** ***
* *** * ** * * * * * *
** ** * * * *
* * * * ** *
* ** * ** * ****** ***
* *
* * ** * *
* *
* * * * * * * * * ** * * ***********

0.0
0.0

* * * ** * * * * * * * **

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 -5 -4 -3 -2 -1 0


0.6

0.8
0.6
0.4

0.4
0.2

0.2
0.0

0.0
1 2 3 4 5 6 1 2 3 4 5 6

Figure 4.7 Bioassay data. Absolute residual analysis for an ordinary Figure 4.8 Bioassay data. Absolute residual analysis for a second-order ap-
transform-both-sides fit. Note the unexplained structure of the variability. proximate transform-both-sides fit.

whereas before ǫij has variance σ 2 . This is a heteroscedastic TBS model,


side of (4.21) by Y∗ and the right-hand side by mY ∗ (·), and noting all of whose parameters are identifiable and hence estimable from the
that the four-parameter logistic model is one in which mY,xx /mY is observed data. The identifiability of parameters in the Berkson model
typically small, the approximate model (4.17)© says that Y∗ has mean is a general phenomenon; see Section 4.9. The likelihood of (4.23) is the
h {mY (Z, W, B)} and variance σ 2 +σrc 2
W2γ (mY )λ−1 (Z, W, B)mY,x (Z, same as before but with σ 2 replaced by
ª2 2
W, B) . If we define κ = σrc /σ 2 , in contrast to (4.22) an approximate h © ª2 i
model for the data is σ 2 1 + κWi2γ (mY )λ−1 (·)mY,x (·) .

h(Yij , λ) = h {mY (·), λ} + ηj (4.23) This model was fit to the data, and λb ≈ −1/3 with an approximate
h © ª 2
i1/2 standard error of 0.12. The corresponding residual plots are given in
+ǫij 1 + κWi2γ (mY )λ−1 (·)mY,x (·) , Figure 4.8. Here we see no real hint of unexplained variability. As a

88 89
further check, we can contrast the models (4.23) and (4.22) by means “too large” (Rosner, Willett, and Spiegelman, 1989; Rosner, Spiegelman,
of a likelihood ratio test, the two extra parameters being (γ, κ). The and Willett, 1990; Whittemore, 1989). Let the binary response Y fol-
likelihood ratio test for the hypothesis that these two parameters equal low the logistic model Pr(Y = 1|Z, X) = H (β0 + βxt X + βzt Z), where
zero had a chi-squared value of over 30 based on two degrees of freedom, H(v) = {1 + exp(−v)}−1 is the logistic distribution function. The key
indicating a large improvement in the fit due to allowing for possible problem is computing the probability of a response Y given (Z, W).
heteroscedasticity in the Berkson error model. For example, suppose that X given (Z, W) is normally distributed with
mean mX (Z, W, γ) and (co)variance function V (Z, W, γ). Let p be the
number of components of X. As described in more detail in Chapter 8,
4.8 Examples of the Approximations
the probability that Y = 1 for values of (Z, W) is
In this section, we investigate the appropriateness of the regression cal- R h i
t
ibration algorithm in a variety of settings. H(·) exp −(1/2) {x − mX (·)} V −1 (·) {x − mX (·)} dx
, (4.24)
(2π)p/2 |V (·)|1/2
4.8.1 Linear Regression where H(·) = H(β0 +βxt x+βzt Z). Formula (4.24) does not have a closed-
form solution; Crouch and Spiegelman (1990) developed a fast algorithm
Consider linear regression when the variance of Y given (Z, X) is con- that they have implemented in FORTRAN: unfortunately, as far as we
stant, so that the mean and variance of Y when given (Z, X) are β0 + know, this algorithm is not in widespread use. Monahan and Stefanski
βxt X+βzt Z and σ 2 , respectively. As an approximation, the regression cal- (1991) described a different method easily applicable to all standard
ibration model says that the observed data also have constant variance computer packages. However, a simple technique often works just as
but have regression function given by E(Y|Z, W) = β0 +βxt mX (Z, W, γ)+ well, namely, to approximate the logistic by the probit. It is well known
βzt Z. Because we assume nondifferential measurement error (Section 2.5), that H(v) ≈ Φ(v/1.7), where Φ(·) is the standard normal distribution
the regression calibration model accurately reproduces the regression function (Johnson and Kotz, 1970; Liang and Liu, 1991; Monahan and
function, but the observed data have a different variance, namely Stefanski, 1991).
var(Y|Z, W) = σ 2 + βxt var(X|Z, W)βx . In Figure 4.9 we plot the density and distribution functions of the
logistic and normal distributions, and the reader will note that the lo-
Note the difference here: The regression calibration model is a working gistic and normal are very similar. With some standard algebra (Carroll,
model for the observed data, which may differ somewhat from the ac- Bailey, Spiegelman, et al., 1984), one can approximate (4.24) by
tual or true model for the observed data. In this case, the regression " #
calibration approximation gives the correct mean function, and the vari- β0 + βxt mX (Z, W, γ) + βzt Z
Pr(Y = 1|Z, W) ≈ H . (4.25)
ance function is also correct and constant if X has a constant covariance {1 + βxt V (Z, W, γ)βx /1.72 }
1/2
matrix given (Z, W).
If, however, X has nonconstant conditional variance, the regression In most cases, the denominator in (4.25) is very nearly 1, and regres-
calibration approximation would suggest the homoscedastic linear model sion calibration is a good approximation; the exception is for “large”
when the variances are heteroscedastic. In this case, while the least βxt V (·)βx . In general, the denominator in (4.25) means that regression
squares estimates would be consistent, the usual standard errors are in- calibration will lead to estimates of the main risk parameters that are
correct. There are three options: (i) use least squares and bootstrap by slightly attenuated.
resampling vectors (Section A.9.2); (ii) use least-squares and the sand- The approximation (4.25) is often remarkably good, even when the
wich method for constructing standard errors (Section A.6); and (iii) true predictor X is rather far from normally distributed. To test this,
expand the model using the methods of Section 4.7. we dropped Z and computed the approximations and exact forms of
pr(Y = 1|W) under the following scenario. For the distribution of X,
we chose either a standard normal distribution or the chi-squared distri-
4.8.2 Logistic Regression
bution with one degree of freedom. The logistic intercept β0 and slope
Regression calibration is also well established in logistic regression, at βx were chosen so that there was a 10% positive response rate (Y = 1)
least as long as the effects of the variable X measured with error are not on average, and so that exp {βx (q90 − q10 )} = 3, where qa is the ath per-

90 91
Comparison of Logistic and Probit Regression
1
Logit
Probit
0.9

0.8

0.7
Density and CDF

0.6

0.5

0.4

0.3

0.2

0.1

0
−4 −3 −2 −1 0 1 2 3 4

Figure 4.9 The standard logistic distribution and density functions compared
to the normal distribution and density functions with standard deviation 1.70.
The point of the graph is to show how close the two are.

Figure 4.10 Values of pr(Y = 1|W) are plotted against W in the solid line,
while the regression calibration approximation is the dotted line. The measure-
ment error is additive on the first row and multiplicative on the second row.
The fact that the lines are nearly indistinguishable is the whole point. See text
centile of the distribution of X. In the terminology of epidemiology, this for more details.
means that the “relative risk” is 3.0 in moving from the 10th to the 90th
percentile of the distribution of X, a representative situation.
4.8.3 Loglinear Mean Models
In Figure 4.10 we plot values of pr(Y = 1|W) against W in the solid
line, for the range from the 5th to the 95th percentile of the distribution As might occur for gamma or lognormal data, suppose E(Y|Z, X) =
2
of W. The regression calibration approximation is the dotted line. The exp(β0 + βxt X + βzt Z) and var(Y|Z, X) = σ 2 {E(Y|Z, X)} . Suppose
measurement error is additive on the first row and multiplicative on the that the calibration of X on (Z, W) has mean mX (Z, W, γ), and denote
second row. The top left plot has W = X + U where (X, U) follow a the moment generating function of the calibration distribution by
bivariate standard normal distribution, while the top right plot differs in © ª © ª
that both follow a chi-squared distribution with one degree of freedom. E exp(at X)|Z, W = exp at mX (Z, W, γ) + v(a, Z, W, γ) ,
The bottom row has W = XU, where U follows a chi-squared distribu-
where v(·) is a general function which differs from distribution to distri-
tion with one degree of freedom; on the left, X is standard normal, while
bution. If (·) = (Z, W, γ), the observed data then follow the model
on the right, X is chi-squared. Note that the solid and dashed lines very
nearly overlap. In all of these cases, the measurement error is very large, © ª
E(Y|Z, W) = exp β0 + βxt mX (·) + βzt Z + v(βx , ·) ;
so in some sense we are displaying a worst case scenario. For these four © ª
very different situations, the regression calibration approximation works var(Y|Z, W) = exp 2β0 + 2βxt mX (·) + 2βzt Z + v(2βx , ·)
£ 2 ¤
very well indeed. × σ + 1 − exp {2v(βx , ·) − v(2βx , ·)} .

92 93
If the calibration distribution for X is normal with constant covari- it is off by a constant, since its intercept β0∗ differs from β0 . Here, how-
ance matrix Σxx , then v(a, ·) = (1/2)at Σxx a. Remarkably, for β0∗ = ever, the approximate expanded mean model (4.16) is exact, and β0 can
β0 + (1/2)βxt Σxx|z,w βx , the observed data also follow the loglinear mean be estimated as long as one has available an estimate of the calibration
model with intercept β0∗ and a new variance parameter σ∗2 . Thus, the variance σ 2 ; see the previous section.
regression calibration approximation is exactly correct for the slope pa- If the error of X about its conditional mean is homoscedastic and
rameters (βx , βz )! The conclusion holds more generally, requiring only symmetrically distributed, for example, normally distributed, then the
that X− mX (Z, W, γ) have distribution independent of (Z, W). expanded regression calibration model accurately reflects the form of the
variance function for the observed data. Details are given in Appendix
B.3.2. If the error is asymmetric, then the expanded model (4.17) misses
4.9 Theoretical Examples
a term involving the third error moment.
4.9.1 Homoscedastic Regression
The simple homoscedastic linear regression model is mY (z, x, B) = β0 + 4.9.3 Loglinear Mean Model
βx x + βz z with g 2 (·) = V 2 (·) = 1. If the variance function (4.15) is
homoscedastic, then the approximate model (4.16)–(4.17) is exact in The loglinear mean model of Section 4.8.3 has E(Y|X) = exp(β0 +βx X),
this case with E(Y|Z, W) = β0 + βx mX (·) + βz Z and var(Y|Z, W) = and variance proportional to the square of the mean with constant of
2 2
σ 2 + σrc βx , that is, a homoscedastic regression model. One sees clearly proportionality σ 2 . If calibration is homoscedastic and normally dis-
that the effect of measurement error is to inflate the error about the tributed,
© the actual mean functionª for the observed data is E(Y|W) =
observed line. exp β0 + (1/2)βx2 σ 2 + βx mX (W) . The mean model of regression cali-
In simple linear regression satisfying a Berkson error model with pos- bration is exp {β0 + βx mX (W)}. Regression calibration yields a consis-
sibly heteroscedastic calibration variances σrc 2
W2γ , the approximations tent estimate of the slope βx but not of the intercept.
are©again exact: E(Y|Z, ª W) = β 0 + β x W + βz Z and var(Y|Z, W) = In this problem, the range-preserving expanded regression calibration
σ 2 1 + βx2 (σrc2
/σ 2 )W2γ . The reader will recognize this as a QVF model, model (4.18) correctly captures the mean of the observed data. Inter-
2
where the parameter θ = (γ, κ = σrc /σ 2 ). As long as γ 6= 0, all the pa- estingly, it also captures the essential feature of the variance function,
rameters are estimable by standard QVF techniques, without recourse since both the actual and approximate variance functions (4.19) are a
to validation or replication data. constant times exp {2β0 + 2βx mX (W)}.
This problem is an example of a remarkable fact, namely that in Berk-
son error problems, the approximations (4.16) and (4.17) often lead to an
identifiable model, so that the parameters can all be estimated without Bibliographic Notes and Software
recourse to validation data. Of course, if one does indeed have valida-
This regression calibration algorithm was suggested as a general ap-
tion data, then they can be used to improve upon the approximate QVF
proach by Carroll and Stefanski (1990) and Gleser (1990). Prentice
estimators.
(1982) pioneered the idea for the proportional hazard model, where it is
still the default option, and a modification of it has been suggested for
4.9.2 Quadratic Regression with Homoscedastic Regression Calibration this topic by Clayton (1991); see Chapter 14. Armstrong (1985) suggests
regression calibration for generalized linear models, and Fuller (1987,
Ordinary quadratic regression has mean function E(Y|X) = β0 +βx,1 X+ pp. 261–262) briefly mentioned the idea. Rosner, Willett and Spiegel-
βx,2 X2 . With homoscedastic regression calibration, the observed data man (1989) and Rosner, Spiegelman, and Willett (1990) developed the
have mean function idea for logistic regression into a methodology particularly popular in
E(Y|W) = (β0 + βx,2 σ 2 ) + βx,1 mX (W) + βx,2 m2X (W) epidemiology.
There is a long history of approximately consistent estimates in non-
= β0∗ + βx,1 mX (W) + βx,2 m2X (W).
linear problems, of which regression calibration and the SIMEX method
As illustrated in Section 4.8.3, the regression calibration model accu- (Chapter 5) and are the most recent such methods. Readers should
rately reflects the observed data in terms of the slope parameters, but also consult Stefanski and Carroll (1985), Stefanski (1985), Amemiya

94 95
and Fuller (1988), Amemiya (1985, 1990a, 1990b), and Whittemore and
Keller (1988) for other approaches.
Stata (http://www.stata.com/merror) has code for regression calibra-
tion and SIMEX (see next chapter) for generalized linear models. The
programs allow for known measurement error variance, measurement
error variance estimated by replications, bootstrapping, etc. A detailed
example using the Framingham data along with the data are at the book
Web site:
http://www.stat.tamu.edu/∼carroll/eiv.SecondEdition/statacode.php.

96
CHAPTER 5

SIMULATION EXTRAPOLATION

5.1 Overview

In this chapter we describe a measurement error bias-correction method


that shares the simplicity, generality, and approximate-inference charac-
teristics of regression calibration. As the previous chapter indicated, re-
gression calibration is ideally suited for problems in which the calibration
function E(X | W) can be estimated reasonably well and to problems
such as generalized linear models. Simulation extrapolation (SIMEX) is
ideally suited to problems with additive measurement error, and more
generally to any problem in which the measurement error generating
process can be imitated on a computer via Monte Carlo methods.
SIMEX is a simulation-based method of estimating and reducing bias
due to measurement error. SIMEX estimates are obtained by adding
additional measurement error to the data in a resampling-like stage, es-
tablishing a trend of measurement error–induced bias versus the variance
of the added measurement error, and extrapolating this trend back to the
case of no measurement error. The technique was proposed by Cook and
Stefanski (1994) and further developed by Stefanski and Cook (1995),
Carroll, Küchenhoff, Lombard, and Stefanski (1996), Devanarayan (1996),
Carroll and Stefanski (1997), and Devanarayan and Stefanski (2002).
SIMEX is closely related to the Monte Carlo corrected score (MCCS)
method described in Chapter 7, and the interested reader may want
to read the present chapter and the MCCS material in Chapter 7 in
combination.
The fact that measurement error in a predictor variable induces bias in
parameter estimates is counterintuitive to many people. An integral com-
ponent of SIMEX is a self-contained simulation study resulting in graph-
ical displays that illustrate the effect of measurement error on parameter
estimates and the need for bias correction. The graphical displays are
useful when it is necessary to motivate or explain a measurement error
model analysis.
SIMEX is very general in the sense that the bias due to measurement
error in almost any estimator of almost any parameter is readily esti-
mated and corrected, at least approximately. In the absence of measure-
ment error, it is often the case that competing estimators are available

97
that are consistent for the same parameter, and only differ asymptoti-
cally with respect to sampling variability. However, these same estima-
1.2
tors can be differentially affected by measurement error. In Section 5.3.1
we present such an illustrative example and show how SIMEX clearly b
Θ
reveals the differences in biases. 1.0
simex

The key features of the SIMEX algorithm are described in the con-
text of linear regression in the following section. Detailed descriptions
of the method for different measurement error models are then given, 0.8

along with an illustrative application to regression through the origin

Θ(ζ)
using weighted least squares estimation. Next, the SIMEX method is

b
0.6
illustrated for different measurement error models using data from the b
Θ naive
Framingham Heart Study. The first four sections are sufficient for the •

reader wanting a working knowledge of the SIMEX method. Following 0.4 •
PSfrag replacements •
the Framingham example, theoretical aspects of SIMEX estimation are •

also described. • •
0.2

-1.0 0.0 1.0 2.0


5.2 Simulation Extrapolation Heuristics ζ

5.2.1 SIMEX in Simple Linear Regression Figure 5.1 A generic SIMEX plot showing the effect on a statistic of adding
measurement error with variance ζσu2 to the data when estimating a parameter
Θ. The abscissa (x-axis) is ζ, and the ordinate (y-axis) is the estimated coeffi-
We describe the basic idea of SIMEX in the context of simple linear
cient. The SIMEX estimate is an extrapolation to ζ = −1. The naive estimate
regression with additive measurement error. In Section 5.3 we show how
occurs at ζ = 0.
to extend SIMEX to nonadditive models and provide additional exam-
ples. Suppose that Y = β0 + βx X + ǫ, with additive measurement error
W = X + U, where U is independent of (Y, X) and has mean zero and simulation experiments in which the level of the measurement error, as
variance σu2 . The ordinary least squares estimate of βx , denoted βbx,naive , measured by its variance, is intentionally varied.
consistently estimates βx σx2 /(σx2 + σu2 ) (Chapter 3) and thus is biased To this end, suppose that in addition to the original data used to
for βx when σu2 > 0. For this simple model, the effect of measurement calculate βbx,naive , there are M − 1 additional data sets available, each
error on the least squares estimator is easily determined mathematically, with successively larger measurement error variances, say (1 + ζm )σu2 ,
and simple method-of-moments bias corrections are known (Chapter 3; where 0 = ζ1 < ζ2 < · · · < ζM are known. The least squares estimate
Fuller, 1987). Thus, in practice, simple linear regression typically would of slope from the mth data set, βbx,m , consistently estimates βx σx2 /{σx2 +
not be a candidate for SIMEX analysis. However, we use it here to (1 + ζm )σu2 } (Chapter 3; Fuller, 1987).
show that SIMEX provides essentially the same bias corrections as the We can formulate this setup as a nonlinear regression model, with data
method-of-moments. {(ζm , βbx,m ), m = 1, . . . , M }, dependent variable βbx,m , and independent
The key idea underlying SIMEX is the fact that the effect of measure- variable ζm . Asymptotically, the mean function of this regression has the
ment error on an estimator can be determined experimentally via simula- form
tion. In a study of the effect of radiation exposure on tumor development βx σx2
in rats, one is naturally led to an experiment in which radiation dose is E(βbx,m | ζ) = G(ζ) = , ζ ≥ 0.
σx2 + (1 + ζ)σu2
varied. Similarly, in a study of the biasing effects of measurement error
on an estimator, one is naturally led to an experiment in which the level Note that G(−1) = βx . That is, the parameter of interest is obtained
of measurement error is varied. So if we regard measurement error as a from G(ζ) by extrapolation to ζ = −1. The significance of ζ = −1 will
factor whose influence on an estimator is to be determined, we consider become apparent later in this chapter and again in Chapter 7. Heuristi-

98 99
cally, it suffices to see that the measurement error variances in our data 5.3.1.1 Homoscedastic Errors with Known Error Variance
sets are equal to (1 + ζm )σu2 . Ideally, we would like error-free data sets,
While SIMEX is a general methodology, it is easiest to understand
and in terms of ζm this corresponds to having (1 + ζm )σu2 = 0, and thus
when there is only a single, scalar predictor X subject to additive error,
ζm = −1.
though there could be multiple covariates Z measured without error,
SIMEX imitates the procedure described above, as illustrated schemat-
and Wi = Xi + Ui , where Ui is a normal random variable with variance
ically in Figure 5.1.
σu2 , and is independent of Xi , Zi and Yi . Typically, minor violations of
• In the simulation step, additional independent measurement errors the assumption of normality of the measurement errors is not critical in
with variance ζm σu2 are generated and added to the original W data, practice. We assume that the measurement error variance, σu2 , is known
thereby creating data sets with successively larger measurement error or sufficiently well estimated to regard as known.
variances. For the mth data set, the total measurement error variance SIMEX, like regression calibration, is applicable to general estimation
is σu2 + ζm σu2 = (1 + ζm )σu2 . methods, for example, least-squares, maximum likelihood, quasilikeli-
• Next, estimates are obtained from each of the generated contaminated hood, etc. In this section, we will not distinguish among the methods,
data sets. but instead will refer to “the estimator” to mean the chosen estima-
• The simulation and estimation steps are repeated a large number of tion method computed as if there were no measurement error. We let Θ
times, and the average value of the estimate for each level of contam- denote the parameter of interest.
ination is calculated. These averages are plotted against the ζ values The first part of the algorithm is the simulation step. As described
and a regression technique, for example, nonlinear least squares, is above, this involves using simulation to create additional data sets of
used to fit an extrapolant function to the averaged, error-contaminated increasingly larger measurement error (1 + ζ)σu2 . For any ζ ≥ 0, define
estimates. See Section 5.3.2 for a discussion of extrapolation. p
Wb,i (ζ) = Wi + ζ Ub,i , i = 1, . . . , n, b = 1, . . . , B, (5.1)
• Extrapolation to the ideal case of no measurement error (ζ = −1)
yields the SIMEX estimate. where the computer-generated pseudo errors, {Ub,i }ni=1 , are mutually
independent, independent of all the observed data, and identically dis-
tributed, normal random variables with mean 0 and variance σu2 . We call
5.3 The SIMEX Algorithm Wb,i (ζ) a remeasurement of Wi , because it is a measurement of Wi in
5.3.1 Simulation and Extrapolation Steps the same statistical sense that Wi is a measurement of Xi .
Note that var(Wi |Xi ) = σu2 , whereas
We now explain the SIMEX algorithm in detail for four combinations of
error models and measured data. In the first, a single measurement for var{Wb,i (ζ)|Xi } = (1 + ζ)σu2 = (1 + ζ)var(Wi |Xi ). (5.2)
each case is available and the measurement errors are homoscedastic with
The error variance in the remeasured data has been inflated by a mul-
a known, or independently estimated, variance. In the second, a single
tiplicative factor, (1 + ζ) in this case, that equals zero when ζ = −1.
measurement for each case is available and the measurement errors are
Because E{Wb,i (ζ)|Xi } = Xi , (5.2) implies that the mean squared
heteroscedastic with known variances. In the third case, replicate mea-
error of Wb,i as a measurement of Xi defined as MSE{Wb,i (ζ)} =
surements are assumed but no additional assumptions are made about
E[{Wb,i (ζ) − Xi }2 |Xi ] converges to zero as ζ → −1. This is the key
the error variances, that is, it is not assumed that they are known and
property of the simulated pseudo data, or remeasured data.
they could be homoscedastic or heteroscedastic. In the fourth case, we
Having generated the remeasured predictors, we compute the corre-
show how the method generalizes to certain multiplicative error models b b (ζ) to be the estimator when the
sponding naive estimates. Define Θ
and give some illustrative examples. n
{Wb,i (ζ)}1 are used, and define the average of these estimators as
We do not discuss the extrapolation step in detail in any of the four
cases that follow. Once a functional form is selected, fitting the extrap- b PB b
Θ(ζ) = B −1 b=1 Θ b (ζ). (5.3)
olant function and extrapolating are routine applications of linear or
b
nonlinear regression, using ζ as the independent variable and Θ(ζ) given b
By design, Θ(ζ) is the sample mean of {Θb b (ζ)}B , and hence is the aver-
1
below in equation (5.3) as the dependent variable. However, the choice age of the estimates obtained from a large number of experiments with
of functional form is important, and we discuss that in Section 5.3.2. the same amount of measurement error. The reason for averaging over

100 101
many simulations is that we are interested in estimating the extra bias b
The averaged naive estimates, Θ(ζ), are calculated in exactly the same
due to added measurement error, not in inducing more variability, and way as for the case of the homoscedastic error model. The SIMEX es-
averaging reduces the Monte Carlo simulation variation. It is the points b simex , is again obtained by modeling and extrapolation to
timator, Θ
b m )}M that are plotted as filled circles in Figure 5.1. This is the
{ζm , Θ(ζ 2
ζ = −1, as this is the value of ζ for which (1 + ζ)σu,i = 0 for all i.
1
simulation component of SIMEX.
b
Note that the components of Θ(ζ) are all functions of the same scalar 5.3.1.3 Heteroscedastic Errors with Unknown Variances and Replicate
ζ, and there is a separate extrapolation step for each component of Measurements
b
Θ(ζ). The extrapolation step entails modeling each of the components
b
of Θ(ζ) as functions of ζ for ζ ≥ 0 and extrapolating the fitted models We now consider an error model that allows for arbitrary unknown het-
back to ζ = −1. The vector of extrapolated values yields the simu- eroscedastic error variances. SIMEX estimation for this model was devel-
lation extrapolation estimator denoted Θ b simex . In Figure 5.1, the ex- oped and studied by Devanarayan (1996) and Devanarayan and Stefan-
trapolation is indicated by the dashed line and the SIMEX estimate ski (2002). For this model ki ≥ 2 replicate measurements are necessary
2
is plotted as a cross. Heuristically, the significance of ζ = −1 follows for each subject in order to identify the error variances σu,i . The as-
b
from the fact that Θ(ζ) is calculated from measurements having vari- sumed error model is Wi,j = Xi + Ui,j , where Ui,j , j = 1, . . . , ki , are
2 2
ance var{Wb,i (ζ)|Xi } = (1 + ζ)σu2 , and we want to extrapolate to the Normal(0, σu,i ), independent of Xi , Zi and Yi with all σu,i unknown.
case in which the error variance in the measurements is zero, that is, With replicate measurements, the best measurement of Xi is the mean
(1 + ζ)σu2 = 0, or equivalently ζ = −1. Note that although we cannot Wi,. , and we define the so-called naive estimation procedure as doing
add measurement error with negative variance, ζσu2 = −σu2 , we can add the usual, nonmeasurement error analysis, of the data (Yi , Zi , Wi,. )n1 .
error with positive variance, determine the form of the bias as a func- 2
Because the variances, σu,i , are unknown, we cannot generate remea-
tion of ζ, and extrapolate to the hypothetical case of adding negative sured data as in (5.4). However, recall that the key property of the
variance (ζ = −1). remeasured data is that the variance of the best measurement of Xi is
inflated by the factor 1 + ζ. With replicate measurements, we can obtain
5.3.1.2 Heteroscedastic Errors with Known Error Variances such variance-inflated measurements by taking suboptimal linear combi-
Suppose now that Wi = Xi + Ui , where Ui is a normal random variable nations of the replicate measurements. This is done using random linear
with variance σu,i2 2
, is independent of Xi , Zi and Yi , and σu,i is known. contrasts.
t
This is not a common error model, but it provides a useful stepping PSuppose that cb,i P= (c
2
b,i,1 , . . . , cb,i,ki ) is a normalized contrast vector,
stone to other, more common heteroscedastic error models. In addition, c
j b,i,j = 0 and c
j b,i,j = 1. Define
it is appropriate when Wi is the mean of ki ≥ 1 replicate measurements, Pk i
each having known variance σu2 , in which case σu,i2
= σu2 /ki . Wb,i (ζ) = Wi,· + (ζ/ki )1/2 j=1 cb,i,j Wi,j , (5.6)
In this case, the only change in the algorithm is that the remeasure-
for i = 1, . . . , n, b = 1, . . . , B. With this definition, a little calculation
ment procedure in (5.1) is replaced by
p indicates that E{Wb,i (ζ)|Xi } = Xi and
Wb,i (ζ) = Wi + ζ Ub,i , i = 1, . . . , n, b = 1, . . . , B, (5.4) 2
var{Wb,i (ζ)|Xi } = (1 + ζ)σu,i /ki = (1 + ζ)var(Wi,· |Xi ). (5.7)
where the pseudo errors, {Ub,i }ni=1 , are again mutually independent,
independent of all the observed data, and identically distributed, normal Thus the remeasurements Wb,i (ζ) from (5.6) have the same key prop-
2 erties as the remeasurements in (5.1) and (5.4), that is, the variances of
random variables with mean 0 and variance σu,i . Note that
2 the error in the remeasurements are inflated by a multiplicative factor
var{Wb,i (ζ)|Xi } = (1 + ζ)σu,i = (1 + ζ)var(Wi |Xi ), (5.5) that vanishes when ζ = −1, and MSE{Wb,i (ζ)} → 0 as ζ → −1.
and E{Wb,i (ζ)|Xi } = Xi . So just as in the preceding case, we see that Because we want to average over B remeasured data sets, we need
the two variances, var(Wb,i (ζ)|Xi ) and var(Wi |Xi ), differ by a mul- a way to generate random, replicate versions of (5.6). We do this by
tiplicative factor that vanishes when ζ = −1, and consequently that making the contrasts random. We get statistical replicates of Wb,i (ζ)
MSE{Wb,i (ζ)} = E[{Wb,i (ζ) − Xi }2 |Xi ] → 0 as ζ → −1, the key prop- by sampling cb,i uniformly from the set of all normalized contrast vec-
erty of the remeasured data. tors of dimension ki . This is easily accomplished using pseudorandom

102 103
Normal(0, 1) random variables. If Zb,i,1 , . . . , Zb,i,ki are Normal(0, 1), then However, because the error model is not unbiased on the natural scale
2
(E(Wi |Xi ) = Xi eσu /2 6= Xi ), the relevant measure is mean squared
Zb,i,j − Z b,i,·
cb,i,j = qP ¢2 , (5.8) error, not variance. Tedious but routine calculations show that, for the
ki ¡
j=1 Z b,i,j − Z b,i,· multiplicative model,
P P
are such that j cb,i,j = 0 and j c2b,i,j = 1. Furthermore, the random MSE{Wb,i (ζ)|Xi } = c(ζ, σu2 ) MSE{Wi |Xi }, (5.10)
contrast vector cb,i = (cb,i,1 , . . . , cb,i,ki )t is uniformly distributed on the
where
set of all normalized contrast vectors of dimension ki (Devanarayan and
2 2 2
Stefanski, 2002). {eσu (1+ζ) − 1}2 + eσu (1+ζ) {eσu (1+ζ) − 1}
b c(ζ, σu2 ) = 2 2 2 ,
The averaged naive estimates, Θ(ζ), are calculated in exactly the same {eσu − 1}2 + eσu {eσu − 1}
way as for the previous two cases. Also, the SIMEX estimator, Θ b simex ,
b is such that c(0, σu2 ) = 1, c(ζ, σu2 ) is increasing in ζ > 0 for all σu2 , and
is again obtained by modeling the relationship between Θ(ζ) and ζ,
limζ→−1 c(ζ, σu2 ) = 0. Thus (5.10) is the biased error model counterpart
and extrapolating to ζ = −1. Because this version of SIMEX generates
of (5.2), (5.5), and (5.7).
pseudo errors from the observed data (via the random contrasts), we
In the multiplicative model used above, Wi is biased for Xi because
call it empirical SIMEX to distinguish it from versions of SIMEX that
Ui is assumed to have a mean of 0. An alternative assumption is that
generate pseudo errors from a parametric normal model, for example,
Wi is unbiased for Xi , which requires that E(Ui ) = −σu2 /2. This as-
the Normal(0, σu2 ) model.
sumption was used in Section 4.5.2. Either assumption is plausible but,
5.3.1.4 Nonadditive Measurement Error unfortunately, neither can be checked without validation data. If one is
certain that E(Ui ) = 0, then one might divide Wi by exp(σu2 /2) to get
Thus far, we have described the SIMEX algorithm for additive measure- a surrogate that is unbiased. However, we did not do this here because,
ment error models. However, SIMEX applies more generally and is often by definition, the naive analysis is to leave Wi unchanged.
easily extended to other error models (Eckert, Carroll, and Wang, 1997). For more general, nonmultiplicative error models, suppose that we
For example, consider multiplicative error. Taking logarithms trans- can transform W to an additive model by a transformation H, so that
forms the multiplicative model to the additive model, but as discussed H(W) = H(X) + U. This is an example of the transform-both-sides
in Section 4.5, some investigators feel that the most appropriate predic- model; see (4.21). If H has an inverse function G, then the simulation
tor of Y is X on the original, not log, scale. In regression calibration, step generates
multiplicative error is handled in special ways; see Section 4.5. SIMEX n p o
works more naturally, in that one performs the simulation step (5.1) on Wb,i (ζ) = G H(Wi ) + ζUb,i .
the logarithm of W, and not on W itself. To see this, suppose that the
observed data error model is In the multiplicative model, H = log and G = exp. A standard class of
transformation models is the power family discussed in Section 4.7.3. If
log(Wi ) = log(Xi ) + Ui , replicate measurements are available, one can also investigate the ap-
where Ui are Normal(0, σu2 ). The remeasured data are obtained as propriateness of different transformations; see Section 1.7 for a detailed
p discussion. As mentioned there, after transformation the standard devi-
log{Wb,i (ζ)} = log(Wi ) + ζUb,i , ation of the intraindividual replicates should be uncorrelated with their
where Ub,i are Normal(0, σu2 ) pseudorandom variables. Note that upon mean, and one can find the power transformation which makes the two
transformation uncorrelated.
p We now present a simple, yet instructive example with multiplicative
Wb,i (ζ) = exp{log(Wi ) + ζUb,i }. (5.9)
measurement error. In addition to illustrating the SIMEX method in a
In the previous three examples, the key property of the remeasured nonadditive error model, the example also shows that estimators of the
data was the fact that variance was increased by the multiplicative factor same parameter can be differentially affected by measurement error, and
1 + ζ — see equations (5.2), (5.5) and (5.7) — and that this multiplier that SIMEX provides insight into the differential sensitivity of estimators
vanishes when ζ = −1. The multiplicative model has a similar property. to measurement error.

104 105
The true-data model is regression through the origin,
Yi = βXi + ǫi
where the equation errors have mean zero and finite variances, and the
error model for the observed data is additive on the log scale
log(Wi ) = log(Xi ) + Ui , (5.11)
where Ui are Normal(0, σu2 ) with the error variance assumed known.
We consider five weighted least squares estimators with weights pro-
portional to powers of the predictor. The true-data estimators considered
are
Pn 1−p
b 1 Yi Xi
β(p) = P n 2−p , (5.12)
1 Xi

for p = 0 (ordinary least squares), p = 1/2, p = 1 (ratio estimation),


p = 3/2, and p = 2 (mean of ratios). The corresponding naive estimators,
βb(p),naive , are obtained by replacing Xi with Wi in (5.12). In the absence
of measurement error, all five estimators are unbiased, and the choice
among them would be made on the basis of efficiency, as dictated by the
assumed or modeled heteroscedasticity in ǫi .
We generated a data set of size n = 100 from the regression-through-
the-origin model with the ǫi independent and identically distributed
Normal(0, σǫ2 ), and the predictors X1 , . .√. , Xn distributed as a shifted
and scaled chi-squared, Xi = (χ25 + 1)/ 46 where E(X2i ) = 1, σǫ2 =
0.125, σu2 = 0.25, and β = 1.0. Then we applied the multiplicative error
model SIMEX procedure (5.9) for each of the five estimators (5.12).
The error-free data pairs (Xi , Yi ) are plotted in the top-left panel
of Figure 5.2, and observed data pairs (Wi , Yi ) in the top-right panel.
The lower-left panel displays an overlay of the points generated in the
SIMEX simulation step (B = 500) for each of the five estimators. A cor-
responding overlay of the SIMEX extrapolations appears in the lower-
right panel. Quadratic extrapolant functions were used. The five recog-
nizable point plots in the lower left panel and the five curves in the lower
right panel, in order lower to upper, correspond to the estimators with
p = 0, 1/2, 1, 3/2 and 2. Figure 5.2 Regression through the origin: weighted least squares estimation
with multiplicative measurement error. Top left, true data; top right, observed
Note that although the five estimators are differentially affected by
data; bottom left, βb(p) (ζ) estimates calculated in the simulation step (B = 500);
measurement error, the SIMEX estimator for each is corrected appro- bottom right, extrapolation with quadratic extrapolant; bottom two plots, p =
priately, as evidenced by the clustering of the extrapolations to ζ = −1 0, 1/2, 1, 3/2, 2, lower to upper.
around the true parameter values β = 1 in the lower-right panel. In this
example, the simple quadratic extrapolant adequately adjusts for bias.
Figure 5.2 indicates that measurement error attenuates the weighted
least squares estimators (5.12) with p = 0, 1/2, and 1 (decreasing
curves); expands the estimator (bias away from zero) with p = 2 (in-

106 107
creasing curve); and has no biasing effects on the estimator with p = 1.5 cause there are certain regression models for which the asymptotic func-
(horizontal curve). Remember that all of the estimators are consistent tional forms are known, and these provide good approximate extrapolant
with error-free data. Thus, this example shows that bias can depend functions for use in other models, SIMEX remains an attractive alterna-
on the method of estimation, in addition to showing that expansion is tive to MCCS.
possible. In Section 5.3.4.1, we will define what we mean by non-iid pseudo
For this simple model, we can do the mathematics to explain the errors. In multiple linear regression with these non-iid pseudo errors,
apparent trends in Figure 5.2. Using properties of the normal distri- the extrapolant function,
bution moment generating function, one can show that as n → ∞,
γ2 γ1 γ3 + γ2 + γ1 ζ
βb(p),Naive → β(p) where GRL (ζ, Γ) = γ1 + = , (5.14)
γ3 + ζ γ3 + ζ
β(p) = βexp{(2p − 3)σu2 /2}. (5.13)
where Γ = (γ1 , γ2 , γ3 )t , reproduces the usual method-of-moments esti-
The exponent in (5.13) is negative when p < 1.5 (attenuation), positive mators; see Section 5.5.1. Because GRL (ζ, Γ) is a ratio of two linear
when p > 1.5 (expansion), and equal to zero when p = 1.5 (no bias). functions we call it the rational linear extrapolant.
Thus, among the class of weighted least squares estimators (5.12), there SIMEX can be automated in the sense that GRL (ζ, Γ) can be employed
is one estimator that is robust to measurement error: βb(p) with p = 1.5. to the exclusion of other functional forms. However, this is not recom-
A point we want to emphasize is that the estimators from the SIMEX mended, especially in new situations where the effects of measurement
extrapolation step revealed the robustness of βb(1.5) quite convincingly, error are not reasonably well understood. For one thing, as described
and it can do so for more complicated estimators for which mathematical below and seen in Küchenhoff and Carroll (1995), sometimes the ra-
analysis is intractable. Huang, Stefanski, and Davidian (2006) presented tional linear extrapolant has wild behavior. SIMEX is a technique for
methods for testing the robustness of estimators to measurement error studying the effects of measurement error in statistical models and ap-
using estimates from the SIMEX simulation step. An overview of their proximating the bias due to measurement error. The extrapolation step
method is given in Section 5.6.3. should be approached as any other modeling problem, with attention
Finally, we note that the measurement error robustness of the weighted paid to adequacy of the extrapolant based on theoretical considerations,
least squares estimator with p = 1.5 depends critically on the assumed residual analysis, and possibly the use of linearizing transformations. Of
error model (5.11). Had we started with an error model for which W is course, extrapolation is risky in general even when model diagnostics fail
unbiased for X on the untransformed scale, that is, E(W|X) = X, then to indicate problems, and this should be kept in mind.
it is readily seen the usual ratio estimator (p = 1) is consistent for β. In many problems of interest the magnitude of the measurement error
variance, σu2 , is such that the curvature in the best or “true” extrapolant
function is slight and is adequately modeled by either GRL (ζ, Γ) or the
5.3.2 Extrapolant Function Considerations simple quadratic extrapolant,
It follows from the results in Stefanski and Cook (1995) that, under GQ (ζ, Γ) = γ1 + γ2 ζ + γ3 ζ 2 . (5.15)
fairly general conditions, asymptotically there is a function of ζ, that,
when extrapolated to ζ = −1, the true parameter is obtained. However, An advantage of the quadratic extrapolant is that it is often numer-
this function is seldom known, so it is usually estimated by one of a few ically more stable than GRL (ζ, Γ). Instability of the rational linear ex-
simple functional forms. This is what makes SIMEX an approximate trapolant can occur, for example, when the effects of measurement error
method in practice. on a parameter are negligible and a constant, or nearly constant, extrap-
As mentioned previously, SIMEX is closely related to the Monte Carlo olant function is required. Such situations arise, for example, with the
corrected score (MCCS) method described in Chapter 7. In fact, MCCS coefficient of an error-free covariate Z that is uncorrelated with W. In
is an asymptotically exact-extrapolant version of SIMEX for models sat- this case, in (5.14) γ2 ≈ 0 and γ3 is nearly unidentifiable. In cases where
isfying certain smoothness conditions. In other words, MCCS is a version GRL (ζ, Γ) is used to model a nearly horizontal line, γ b1 and γb2 are well
of SIMEX that avoids extrapolation functions. However, MCCS is both determined, but γ b3 is not. Problems arise when 0 < γ b3 < 1, for then the
mathematically and computationally more involved, whereas SIMEX fitted model has a singularity in the range of extrapolation [−1, 0). The
only requires repeated application of the naive estimation method. Be- problem is easily solved by fitting GQ (ζ, Γ) in these cases. The quadratic

108 109
extrapolant typically results in conservative corrections for attenuation; tation is extremely efficient, and bootstrap standard errors for SIMEX
however, the increase in bias is often offset by a reduction in variability. take place in fast (clock) time. Even with our own implementation, most
Of course, problems with the rational linear extrapolant need not be bootstrap applications take place in a reasonable (clock) time.
confined to situations as just described. Asymptotic covariance estimation methods based on the sandwich es-
Simulation evidence and our experience with applications thus far timator are described in Section B.4.2. These are easy to implement
suggest that the extrapolant be fit for ζ in the range [0, ζmax ], where in specific applications but require additional programming. However,
1 ≤ ζmax ≤ 2. We denote the grid of ζ values employed by Λ, that is, when σu2 is known or nearly so, the SIMEX calculations themselves ad-
Λ = (ζ1 , ζ2 , . . . , ζM ), where typically ζ1 = 0 and ζM = ζmax . mit a simple standard error estimator. Here, we consider only the case of
The quadratic extrapolant is a linear model and thus is easily fit. The homoscedastic measurement error. For the case of heteroscedastic error
rational linear extrapolant generally requires a nonlinear least squares and empirical SIMEX, see Devanarayan (1996).
program to fit the model. However, it is possible to obtain exact analytic Let τbb2 (ζ) be any variance estimator attached to Θ b b (ζ), for example,
fits to three points, and this provides a means of obtaining good starting the sandwich estimator or the inverse of the information matrix, and
values. let τb2 (ζ) be their average for b = 1, . . . , B. Let s2∆ (ζ) be the sample
Let ζ0∗ < ζ1∗ < ζ2∗ and define dij = ai − aj , 0 ≤ i < j ≤ 2. Then fitting covariance matrix of the terms Θ b b (ζ) for b = 1, . . . , B. Then, as shown
GRL (ζ, Γ) to the points {aj , ζj∗ }20 results in parameter estimates in Section B.4.1, variance estimates for the SIMEX estimator can be
d12 ζ2∗ (ζ1∗ − ζ0∗ ) − ζ0∗ d01 (ζ2∗ − ζ1∗ ) obtained by extrapolating the components of the differences, τb2 (ζ) −
γ
b3 = s2∆ (ζ), to ζ = −1. When τb2 (ζ) is estimated by the Fisher information
d01 (ζ2∗ − ζ1∗ ) − d12 (ζ1∗ − ζ0∗ )
matrix or sandwich formula, then the extrapolant is called the SIMEX
γ3 + ζ1∗ )(b
d12 (b γ3 + ζ2∗ )
γ
b2 = Information or SIMEX Sandwich variance estimator, respectively.
ζ2 − ζ1∗

γ
b2
γ
b1 = a0 − . 5.3.4 Extensions and Refinements
b3 + ζ0∗
γ
An algorithm we employ successfully to obtain starting values for fit- 5.3.4.1 Modifications of the Simulation Step
b m )}M , where
ting GRL (ζ, Γ) starts by fitting a quadratic model to {ζm , θ(ζ 1 There is a simple modification to the simulation step that is sometimes
the ζm are equally spaced over [0, ζmax ]. Initial parameter estimates for
useful. As described above, the pseudo errors are generated indepen-
fitting GRL (ζ, Γ) are obtained from a three-point fit to (b aj , ζj∗ )20 , where
dently of (Yi , Zi , Wi )n1 as Normal(0, σu2 ) random variables. The Monte
ζ0∗ = 0, ζ1∗ = ζmax /2, ζ2∗ = ζmax and b
aj is the predicted value correspond- b
Carlo variance in Θ(ζ) can be reduced by the use of pseudo errors con-
ing to ζj∗ from the fitted quadratic model. In our experience, initial values
strained so that for each fixed b, Pthe sequence (Ub,i )ni=1 has mean zero,
obtained in this fashion are generally very good and frequently differ in- n
population variance σu , that is, i=1 U2b,i = nσu2 , and its sample correla-
2
significantly from the fully iterated, nonlinear least squares parameter
tions with (Yi , Zi , Wi )n1 are all zero. We call pseudo errors constrained
estimates.
in this manner non-iid pseudo errors. In some simple models, such as
linear regression, the Monte Carlo variance is reduced to zero by the use
5.3.3 SIMEX Standard Errors of non-iid pseudo errors.
Inference for SIMEX estimators can be performed either via the boot- The non-iid pseudo errors are generated by first generating indepen-
strap or the theory of M-estimators (Section A.6), in particular by means dent standard normal pseudo errors (U∗b,i )n1 . Next, fit a linear regression
of the sandwich estimator. Because of the computational burden of the model of the pseudo errors on (Yi , Zi , Wi )n1 , including an intercept. The
SIMEX estimator, the bootstrap requires considerably more computing non-iid pseudo errors are obtained by multiplying the residuals from this
time than do other methods. Without efficient implementation of the regression by the constant
estimation scheme at each step, even with current computing resources £ ¤1/2
c = nσu2 /{(n − p − 1) MSE } ,
the SIMEX bootstrap may take an inconveniently long (clock) time to
compute. In STATA’s implementation of generalized linear models with where MSE is the usual linear regression mean squared error, and p is
measurement error (see http://www.stata.com/merror), the implemen- the dimension of (Y, Zt , Wt )t .

110 111
5.3.4.2 Estimating the Measurement Error Variance of CHD. Predictors employed in this example are the patient’s age at
Exam #2, smoking status at Exam #1, serum cholesterol at Exam #3,
When the measurement error variance σu2 is unknown, it must be es-
and systolic blood pressure (SBP) at Exam #3, the last is the average
timated with auxiliary data, as described in Chapter 4; see especially
of two measurements taken by different examiners during the same visit.
(4.3). The estimate is then substituted for σu2 in the SIMEX algorithm,
In order to illustrate the various SIMEX methods we do multiple anal-
and standard errors are calculated as described in Section B.4.2.
yses. In the first set of analyses, we treat serum cholesterol as error free,
so that the only predictor measured with error is SBP. In these anal-
5.3.5 Multiple Covariates with Measurement Error yses, the error-free covariates Z, are age, smoking status, and serum
cholesterol. For W, we employ a modified version of a transformation
So far, it has been assumed that X is scalar. For the case of a multivariate originally due to Cornfield and discussed by Carroll, Spiegelman, Lan,
X, only a minor change is needed. Suppose that Wi = Xi + Ui and Ui et al. (1984), setting W = log(SBP − 50). Implicitly, we are defining
is Normal(0, Σu ), that is, Ui is multivariate normal with mean zero and X as the long-term average of W. In the final analysis, we illustrate
covariance matrix Σu . Then, to generate the pseudo errors we again use SIMEX when there are two predictors measured with error: SBP and
(5.4) and ζ remains a scalar, the only change being that now Ub,i is serum cholesterol.
generated as Normal(0, Σu ). Note that we again have E{Wb,i (ζ)|Xi } =
Xi and
5.4.2 Single Covariate Measured with Error
var{Wb,i (ζ)|Xi } = (1 + ζ)Σu = (1 + ζ)var(Wi |Xi ), (5.16)
In addition to the variables discussed above, we also have SBP measured
which is the multivariate counterpart of (5.2). Extrapolation is, in prin-
at Exam #2. The mean transformed SBP at Exams #2 and #3 are 4.37
ciple, the same as in the scalar X case because ζ is a scalar even for
and 4.35, respectively. Their difference has mean 0.02, and standard
multivariate X. However, the number of remeasured data sets, B, re-
error 0.0040, so that the large-sample test of equality of means has p-
quired to achieve acceptable Monte Carlo estimation precision will gen-
value < 0.0001. Thus in fact, the measurement at Exam #2 is not exactly
erally need to be larger when there are multiple covariates measured
a replicate, but the difference in means from Exam #2 to Exam #3 is
with error. This is because the Monte Carlo averaging in the simula-
close to negligible for all practical purposes.
tion step, see (5.3), is effectively a means of numerical integration. As
We present two sets of analyses. Both use the full complement of
with any numerical integration method, higher-dimensional integration
replicate measurements from Exams #2 and #3. We calculate estimates
requires greater computational effort for comparable levels of precision.
and standard errors for the naive method, regression calibration, and
Also, less is known about the general utility of the simple extrapolant
two versions of SIMEX: SIMEX assuming homoscedastic measurement
functions, for example, the quadratic, in the multivariate X case, espe-
errors, and empirical SIMEX allowing for possibly heteroscedastic errors.
cially for data with large measurement error variances and either strong
The regression calibration and homoscedastic SIMEX analyses use a
multicollinearity among the X variables or high correlation among the
pooled estimate of σu2 from the full complement of replicates. In this
measurement errors.
case, the large degrees of freedom for estimating σu2 means that there
is very little penalty in terms of added variability for estimating the
5.4 Applications measurement error variance.

5.4.1 Framingham Heart Study 5.4.2.1 SIMEX and Homoscedastic Measurement Error
We illustrate the methods using data from the Framingham Heart Study,
This analysis uses the replicate SBP measurements from Exams #2 and
correcting for bias due to measurement error in systolic blood pressure
#3 for all study participants. The transformed data are Wi,j , where
and serum cholesterol measurements. The Framingham study consists
i denotes the individual and j = 1, 2 refers to the transformed SBP
of a series of exams taken two years apart. We use Exam #3 as the
at Exams #2 and #3, respectively. The overall surrogate is W i,· , the
baseline. There are 1,615 men aged 31–65 in this data set, with the
sample mean for each individual. The model is
outcome, Y, indicating the occurrence of coronary heart disease (CHD)
within an eight-year period following Exam #3; there were 128 cases Wi,j = Xi + Ui,j ,

112 113
2
are 1,614 degrees of freedom for estimating σ bu,∗ and thus, for practical
Age Smoke Chol LSBP purposes, the measurement error variance is known.
In Table 5.1, we list the results of the naive analysis that ignores mea-
surement error, the regression calibration analysis, and the SIMEX anal-
Naive .055 .59 .0078 1.70 ysis. For the naive analysis, “Sand.” and “Info.” refer to the sandwich
Sand. .010 .24 .0019 .39 and information standard errors discussed in Appendix A; the latter is
Info. .011 .24 .0021 .41 the output from standard statistical packages.
For the regression calibration analysis, the first set of sandwich and
Reg. Cal. .053 .60 .0077 2.00 information standard errors are those obtained from a standard logistic
Sand.1 .010 .24 .0019 .46 regression analysis having substituted the calibration equation for W,
Info.1 .011 .25 .0021 .49 and ignoring the fact that the equation is estimated. The second set
Sand.2 .010 .24 .0019 .46 of sandwich standard errors are as described in Section B.3, while the
Bootstrap .010 .25 .0019 .46 bootstrap analysis uses the methods of Appendix A.
For the SIMEX estimator, M-estimator refers to estimates derived
SIMEX .053 .60 .0078 1.93 from the theory of Section B.4.2 for the case where σu2 is estimated from
Simex, Sand.3 .010 .24 .0019 .43 the replicate measurements. Sandwich and Information refer to estimates
Simex, Info.3 .011 .25 .0021 .47 defined in Section B.4.1, with τb2 (ζ) derived from the naive sandwich and
M-est. 4 .010 .24 .0019 .44 naive information estimates, respectively. The M-estimation sandwich
and SIMEX sandwich standard errors yield nearly identical standard
Empirical SIMEX5 .054 .60 .0078 1.94 errors because σu2 is so well estimated.
Simex, Sand. .011 .24 .0020 .44 Figure 5.3 contains plots of the logistic regression coefficients Θ(ζ) b
Simex, Info. .012 .25 .0021 .47 for eight equally spaced values of ζ spanning [0, 2] (solid circles). The
M-est. .011 .24 .0020 .44 points plotted at ζ = 0 are the naive estimates Θ b naive . For this example,
B = 2000. Because of double averaging over n and B, taking B this
large is not necessary in general (see the related discussion of corrected
Table 5.1 Estimates and standard errors from the Framingham data logistic scores for linear regression in Section 7.2.1). However, there is no harm
regression analysis. This analysis assumes that all observations have replicated in taking B large, unless computing time is an issue.
SBP. “Naive” = the regression on average of replicated SBP. “Sand.” = sand- The nonlinear least-squares fits of GRL (ζ, Γ) to the components of
wich standard errors. “Info.” = information standard errors. Also, 1 = calibra- b m )}8 (solid curves) are extrapolated to ζ = −1 (dashed curves)
{ζm , Θ(ζ 1
tion function known; 2 = calibration function estimated; 3 = σu2 known; 4 = σu2 resulting in the SIMEX estimators (crosses). The open circles are the
estimated; and 5 = Empirical SIMEX with no assumptions on measurement er- SIMEX estimators that result from fitting quadratic extrapolants, which
ror variances (standard errors computed as for regular SIMEX). Here “Smoke” are essentially the same as the rational linear extrapolants — not sur-
is smoking status, “Chol” is cholesterol, and “LSBP” is log(SBP−50). prising given the small amount of measurement error in this example.
We have stated previously that the SIMEX plot displays the effect of
measurement error on parameter estimates. This is especially noticeable
in Figure 5.3. In each of the four graphs in Figure 5.3, the range of
where the Ui,j have mean zero and variance σu2 . The components of the ordinate corresponds to a one-standard error confidence interval for
variance estimator (4.3) is σ bu2 = 0.01259. the naive estimate constructed using the information standard errors.
We employ SIMEX using Wi∗ = Wi,· and U∗i = Ui,· . The sample vari- Thus, Figure 5.3 illustrates the effect of measurement error relative to
ance of (Wi∗ )n1 is σ 2
bw,∗ = 0.04543, and the estimated measurement error the variability in the naive estimate. It is apparent that the effect of
2
variance is σ
bu,∗ = σbu2 /2 = 0.00630. Thus, the linear model correction for measurement error is of practical importance only on the coefficient of
attenuation, that is, the inverse of the reliability ratio, for these data is log(SBP − 50).
1.16, so that there is only a small amount of measurement error. There The SIMEX sandwich and the M-estimator (with σu2 estimated) meth-

114 115
Age Smoking Age Smoking

6.7 8.4 1.86 8.33

o
c βbA (ζ)

c βbS (ζ)
6.1 7.1 1.57 7.03
1.39e-4 0.0625
• • • 0.60 • • • • • • • • • • • • •
• • 0.59 1.43e-4 • • • 0.0629

βbA (ζ)

n
• •

βbS (ζ)
5.5 5.9 • • • • • • • • 1.28 5.72
0.055

var

var
0.053
4.9 4.7 0.99 4.42

4.3 3.4 0.70 3.12

-1.0 0.0 1.0 2.0 -1.0 0.0 1.0 2.0 -1.0 0.0 1.0 2.0 -1.0 0.0 1.0 2.0
ζ ζ ζ ζ
Cholesterol Log(SBP-50) Cholesterol Log(SBP-50)

9.9 2.12 5.94 2.32

o
PSfrag replacements

o
0.221

c βbC (ζ)

c βbL (ζ)
1.93
8.8 1.91 5.01 1.96
4.45e-6 0.174
PSfrag replacements 0.0078 1.70 4.46e-6
• • • • • • • • •

βbC (ζ)

n
• • •
βbL (ζ)
7.8 • • • • 1.70 • 4.08 1.60
• •
0.0078

var

var
• •
• •
• •
6.8 1.49 3.16 1.23 •
• •


5.7 1.28 2.23 0.87

-1.0 0.0 1.0 2.0 -1.0 0.0 1.0 2.0 -1.0 0.0 1.0 2.0 -1.0 0.0 1.0 2.0
ζ ζ ζ ζ

Figure 5.3 Coefficient extrapolation functions for the Framingham logistic re- Figure 5.4 Variance extrapolation functions for the Framingham logistic re-
gression modeling. The simulated estimates {βb(·) (ζm ), ζm }81 are plotted (solid gression variance estimation. Values of {(b τ 2 (ζm ) − s2∆ (ζm )), ζm }81 for each
circles) and the fitted rational linear extrapolant (solid line) is extrapolated coefficient estimate (see Section 5.3.3 for definitions of τb2 (ζm ) and s2∆ (ζm ))
to ζ = −1 (dashed line), resulting in the SIMEX estimate (cross). Open cir- are plotted (solid circles) and the fitted rational linear extrapolant (solid line)
cles indicate SIMEX estimates obtained with the quadratic extrapolant. The is extrapolated to ζ = −1 (dashed line), resulting in the SIMEX variance es-
coefficient axis labels for Age are multiplied by 102 , for Smoking by 101 , for timate (cross). Open circles indicate SIMEX variance estimates obtained with
Cholesterol by 103 , and for Log(SBP−50) by 100 . Naive and SIMEX estimate the quadratic extrapolant. Naive variance estimates are obtained via the sand-
values in the graphs are in original units of measurement. wich formula. The coefficient axis labels for Age are multiplied by 10 4 , for
Smoking by 102 , for Cholesterol by 106 , and for Log(SBP−50) by 101 . Naive
and SIMEX estimate values in the graphs are in original units of measurement.

ods of variance estimation yield similar results in this example. The dif- 5.4.2.2 Empirical SIMEX and Heteroscedastic Measurement Error
ference between the SIMEX sandwich and information methods is due For the analysis in this section, we use the same data as in the previous
to differences in the naive sandwich and information methods for these analysis. However, the model is now
data.
Wi,j = Xi + Ui,j ,
Figure 5.4 displays the variance extrapolant functions fit to the com-
ponents of τb2 (ζ)−s2∆ (ζ) used to obtain the SIMEX information variances 2
where the Ui,j have mean zero and variance σu,i . That is, the assumption
and standard errors. The figure is constructed using the same conven- of homogeneity of variances is not made, and we use empirical SIMEX
tions used in the construction of Figure 5.3. For these plots, the ranges as described in Section 5.3. The results of the analysis are reported in
of the ordinates are (1/2)var(naive)
c to (4/3)var(naive),
c where var(naive)
c Table 5.1. For the empirical SIMEX estimator, standard errors were
is the information variance estimate of the naive estimator. calculated as described in Devanarayan (1996) and are the empirical

116 117
SIMEX counterparts of the three versions of regular SIMEX standard (IRLS EIM) Scale param = 1
Deviance = 824.240423 (1/df) Deviance = .5119506
errors. Pearson = 1458.82744 (1/df) Pearson = .906104
Variance Function: V(u) = u(1-u) [Bernoulli]
Link Function : g(u) = log(u/(1-u)) [Logit]
5.4.3 Multiple Covariates Measured with Error Standard Errors : EIM Hessian

In this section, we consider a model with two predictors measured with


firstchd Coef. Std. Err. z P>|z| [95% Conf. Interval]
error and use it to illustrate both the SIMEX method and the STATA
computing environment for SIMEX estimation. The true-data model is age .056446 .0117413 4.81 0.000 .0334334 .0794585
smoke .572659 .2498046 2.29 0.022 .0830509 1.062267
similar to that considered in the first analysis in this section. The major
lcholest3 2.039176 .5435454 3.75 0.000 .9738468 3.104506
difference is that we now regard serum cholesterol as measured with error lsbp3 1.518676 .3889605 3.90 0.000 .7563275 2.281025
and use the repeat measurements from Exams #2 and #3 to estimate _cons -23.39799 3.413942 -6.85 0.000 -30.0892 -16.70679
the measurement error variance.
Preliminary analysis of the duplicate measures of cholesterol indicated
that the measurement error is heteroscedastic with variation increasing
with the mean. In the previous analyses, cholesterol was regarded as error The STATA code and output for the SIMEX analysis appear below.
free, and thus error modeling issues did not arise. Now that we regard Prior to running this STATA code, the elements of the estimated error
covariance matrix, Σ b u in (5.17) were assigned to V and are input to
cholesterol as measured with error, it makes sense to consider trans-
formations to simplify the error model structure. In this case, simply the SIMEX procedure with the command suuinit(V). One subtlety in
taking logarithms homogenizes the error variance nicely. This changes STATA is that the order of the predictor-variable variances along the di-
our true-data model from the one considered in the preceding section to agonal of V, must correspond to the order of the variables measured with
the logistic model with predictors Z1 = age, Z2 = smoking status, X1 = error listed in the STATA simex command. In this example, wcholest is
log(cholesterol) at Exam #3, and X2 = log(SBP−50) at Exam #3. the first variable measured with error and wlsbp is the second, and so
The assumed error model is (W1 , W2 ) = (X1 , X2 ) + (U1 , U2 ), where their error variances are placed in the (1, 1) and (2, 2) components of V
(U1 , U2 ) is bivariate normal with zero mean and covariance matrix Σu . respectively.
The error covariance matrix was estimated by one-half the sample co- . simex (firstchd = age smoke) (wcholest:lcholest3)
variance matrix of the differences between the Exam #2 and Exam #3 (wlsbp:lsbp3), family(binomial) suuinit(V) bstrap seed(10008)
Estimated time to perform bootstrap: 2.40 minutes.
measurements of X1 and X2 , resulting in
µ ¶ Simulation extrapolation No. of obs = 1615
b u = 0.00846 0.000673 .
Σ (5.17) Bootstraps reps = 199
0.000673 0.0126 Residual df = 1610 Wald F(4,1610) = 24.10
Prob > F = 0.0000
The estimated correlation is small, .065, but significantly different from Variance Function: V(u) = u(1-u) [Bernoulli]
Link Function : g(u) = log(u/(1-u)) [Logit]
zero (p-value = .0088), so we do not assume independence of the mea-
surement errors.
Bootstrap
The two error variances correspond to marginal reliability ratios of firstchd Coef. Std. Err. t P>|t| [95% Conf. Interval]
λ1 = 0.73 and λ2 = 0.76, respectively, for W1 and W2 . Thus, in the
absence of strong multicollinearity, we expect the SIMEX estimates of age .0545443 .0099631 5.47 0.000 .0350023 .0740863
smoke .5803764 .2638591 2.20 0.028 .0628329 1.09792
the coefficients of log(cholesterol) and log(SBP−50) to be approximately wcholest 2.5346 .7278619 3.48 0.001 1.106944 3.962256
1/λ1 = 1.37 and 1/λ2 = 1.32 times as large as the corresponding naive wlsbp 1.84699 .4529421 4.08 0.000 .9585718 2.735408
estimates. _cons -27.44831 4.231603 -6.49 0.000 -35.74834 -19.14828
Following is the STATA code and output for the naive analysis:
. qvf firstchd age smoke lcholest3 lsbp3, family(binomial)

Generalized linear models No. of obs = 1615


STATA also provides SIMEX plots for visually assessing the extrapo-
Optimization : MQL Fisher scoring Residual df = 1610 lation step. The variables Z1 = age and Z2 = smoking status are not

118 119
affected by measurement error much, so we only present the SIMEX
Simulation Extrapolation: wcholest
plots for X1 = log(cholesterol) and X2 = log(SBP−50) in Figure 5.5.
Extrapolant: Quadratic Type: Mean
Note that whereas we use ζ as the variance inflation factor for SIMEX SIMEX Estimate

2.5
remeasured data, the default in STATA is to identify this parameter as
“Lambda.”
Note that with the SIMEX analysis there is substantial bias correction
in the coefficients for the coefficients of log(cholesterol) and log(SBP−50), Naive Estimate

2
but not quite as large as predicted from the inverse marginal reliabil-

Coefficient
ity ratios, c.f., for log(cholesterol) (1.37)(2.04) = 2.79, for log(SBP−50)
(1.32)(1.52) = 2.01. Three factors contribute to the differences. First,
whereas the marginal reliability ratios provide a useful rule of thumb for

1.5
determining bias corrections, they do not account for collinearity among
the predictors or for correlation among the measurement errors. The
proper rule-of-thumb multiplier in this case is the inverse of the relia-
bility matrix (Gleser, 1992), but because it is matrix-valued it is not as

1
readily computed and so not as useful. Second, there is variability in the −1 0 1 2
Lambda
SIMEX estimates associated with the choice of B. In STATA the default Naive: 2.039176 SIMEX: 2.5346
is B = 199, but this can be overridden using the breps() command. For
these data, increasing B to 1,000 results in greater corrections for attenu-
ation (we got estimated coefficients for log(cholesterol) and log(SBP−50) Simulation Extrapolation: wlsbp
Extrapolant: Quadratic Type: Mean
of 2.65 and 1.86, respectively). Recall that for multiple predictors mea- SIMEX Estimate
sured with error, greater replication is necessary. Finally, the default

1.8
extrapolant in STATA is the quadratic, which generally results in some-
what less correction for bias than the rational linear extrapolant. Using

1.6
the rational-linear extrapolant, we got estimates for log(cholesterol) and
log(SBP−50) of 2.76 and 1.89, respectively. Naive Estimate

Coefficient
1.4
5.5 SIMEX in Some Important Special Cases
This section describes the bias-correction properties of SIMEX in four

1.2
important special cases.

1
5.5.1 Multiple Linear Regression −1 0 1 2
Lambda
Consider the multiple linear regression model Naive: 1.518676 SIMEX: 1.84699

Yi = β0 + βzt Zi + βx Xi + ǫi .
In the notation of Section 5.3, Θ = (β0 , βzt , βx )t . If non-iid pseudo errors
are employed in the SIMEX simulation step, it is readily seen that Figure 5.5 STATA SIMEX plots for log(cholesterol) (top) and log(SBP−50)
  −1 (bottom) for the Framingham logistic regression model with Z1 = age, Z2 =
X n 1 Zti Wi  smoking status, X1 = log(cholesterol), and X2 = log(SBP−50). Note that
b
Θ(ζ) =  Zi Zi Zti Zi Wi  where we use ζ, STATA uses “Lambda.”
 
i=1 Wi Wi Zti Wi2 + ζσu2

120 121
  
X n Yi  because it is an example where neither the quadratic nor the rational
×  Zi Yi  . linear extrapolant provides exact answers.
 
i=1 Wi Yi Consider fitting a quadratic regression model using orthogonal polyno-
Solving this system of equations we find that mials and least square estimation. Components of the parameter vector
Θ = (β0 , βx,1 , βx,2 )t are the coefficients in the linear model
βbv (ζ) = (Vt V)−1 Vt Y (5.18)
t −1 t
¡ t t t −1 t
¢ Yi = β0 + βx,1 (Xi − X) + βx,2 (X2i − a − bXi ) + ǫi , (5.20)
(V V) V W W Y − W V(V V) V Y
− , where a = a{(Xi )n1 } and b = b{(Xi )n1 }are the intercept and slope,
Wt W − Wt V(Vt V)−1 Vt W + ζσ 2
respectively, of the least squares regression line of X2i on Xi . Model
Wt Y − Wt V(Vt V)−1 Vt Y (5.20) is a reparameterization of the usual quadratic regression model
βbx (ζ) = , (5.19) Yi = β0 + βx,1 Xi + βx,2 X2i + ǫi . The usual model often has severe
Wt W
− Wt V(Vt V)−1 Vt W + ζσ 2
collinearity, but the reparameterized model is orthogonal. The so-called
where βv = (β0 , βzt )t , Vt = (V1 , V2 , . . . , Vn ) with Vi = (1, Zti )t . All naive estimator for this model is obtained by fitting the quadratic re-
of the components of Θ(ζ)b are functions of ζ of the form GRL (ζ, Γ) for gression to (Yi , Wi )n1 , noting that Wi replaces Xi , i = 1, . . . , n, in the
suitably defined, component-dependent Γ = (γ1 , γ2 , γ3 )t . definitions of a and b.
It follows that if the models fit in the SIMEX extrapolation step have Let µx,j = E(Xj ), j = 1, . . . , 4. We assume for simplicity that µx,1 = 0
the form GRL (ζ, Γ), allowing different Γ for different components, then and µx,2 = 1. The exact functional form of Θ b b (ζ) is known for this model
SIMEX results in the usual method-of-moments estimator of Θ. b
and is used to show that asymptotically, Θ(ζ) converges in probability
to Θ(ζ) given by
5.5.2 Loglinear Mean Models β0 (ζ) = β0 ,
Suppose that X is a scalar and that E(Y|X) = exp(β0 + βx X), with βx,1 σx2
βx,1 (ζ) = ,
variance function var(Y | X) = σ 2 exp {θ (β0 + βx X)} for some constants σx2 + δ
σ 2 and θ. It follows from the appendix in Stefanski (1989) that if (W, X) µx,3 βx,1 δ + (1 + δ)βx,2 (µx,4 − 1) − µ2x,3 βx,2
has a bivariate normal distribution and generalized least squares is the βx,2 (ζ) = ,
(1 + δ)(µx,4 − 1 + 4δ + 2δ 2 ) − µ2x,3
method of estimation, then β c0 (ζ) and β
cx (ζ) consistently estimate
where δ = (1 + ζ)σu2 .
µx σu2 βx + βx2 σx2 σu2 /2 Note that both β0 (ζ) and βx,1 (ζ) are functions of ζ of the form GRL (ζ, Γ),
β0 (ζ) = β0 + (1 + ζ)
σx2 + (1 + ζ)σu2 whereas βx,2 (ζ) is not. For arbitrary choices of σu2 , µx,3 , µx,4 , βx,1 , and
and βx,2 , the shape of βx,2 (ζ) can vary dramatically for −1 ≤ ζ ≤ 2, thereby
βx σx2 invalidating the extrapolation step employing an approximate extrap-
βx (ζ) = ,
σx2 + (1 + ζ)σu2 olant. However, in many practical cases, the quadratic extrapolant cor-
respectively, where µx = E(X), σx2 = Var(X) and σu2 = Var(W | X). rects for most of the bias, especially for σu2 sufficiently small. When X
The rational linear extrapolant is asymptotically exact for estimating is normally distributed, βx,2 (ζ) = βx,2 /(1 + δ)2 , which is monotone for
both β0 and βx . all ζ ≥ −1 and reasonably well approximated by either a quadratic or
GRL (ζ, Γ) for a limited but useful range of values of σu2 .

5.5.3 Quadratic Mean Models


5.6 Extensions and Related Methods
There is already a literature on polynomial regression with additive mea-
5.6.1 Mixture of Berkson and Classical Error
surement error; see Wolter and Fuller (1982); Stefanski (1989); Cheng
and Schneeweiss (1998); Iturria, Carroll, and Firth (1999); and Cheng, We now consider the Berkson/classical mixed error model, which was
Schneeweiss, and Thamerus (2000). Thus, the use of SIMEX in this prob- discussed previously in Section 3.2.5 (see Table 3.1, and also Section
lem thus might not be considered in practice, but it is still interesting 1.8.2 and Section 8.6 for log-scale versions of the model). Recall that

122 123
the defining characteristic is that the error model contains both classical defined in (5.25) when s2w and W are replaced by their asymptotic limits
2
and Berkson components. Specifically, it is assumed that to σw and µx . Now consider the remeasured random variable
X = L + Ub , (5.21) W(ζ) = a1 + a2 W + a3 U. (5.27)
W = L + Uc . (5.22) Noting that E{W(ζ) | X} = a1 + a2 E(W | X) and var(W(ζ) | X} =
When Ub = 0, the classical error model is obtained, whereas the Berkson a22 var(W | X) + a23 , and that as ζ → −1, a1 → (1 − 1/γx )µx , a2 → 1/γx
error model results when Uc = 0. The variances of the error terms are and a3 → σx2 − σw2
/γx2 , it follows that as ζ → −1,
σu2 c and σu2 b . Some features of this error model, when (X, W) is bivariate
E{W(ζ) | X} → X and var{W(ζ) | X} → 0. (5.28)
normal, that we will use later are:
σx2 − σu2 b Thus, just as in the other error models we considered in Section 5.3.1,
E(W | X) = (1 − γx )µx + γx X, where γx = ; the mean squared error MSE{W(ζ)|X} = E[{W(ζ) − X}2 |X] converges
σx2
to zero as ζ → −1; see (5.10).
cov(X, W) = γx σx2 ;
2
var(W | X) = σw − γx2 σx2 ; 5.6.2 Misclassification SIMEX
var(W) = σx − σu2 b + σu2 c .
2
(5.23) Küchenhoff, Mwalili, and Lesaffre (2005) developed a general method of
Apart from Schafer, Stefanski, and Carroll (1999), SIMEX for this er- correcting for bias in regression and other estimators when discrete data
ror model has not been considered and is not as well studied as classical- are misclassified, called the misclassification SIMEX (MC-SIMEX). In
error SIMEX. We now show how to implement SIMEX estimation in broad strokes, the method works in much the same way as the SIMEX
this model, assuming that σu2 c and σu2 b are known. Define methods discussed previously. However, the details of the method dif-
fer, especially the simulation component, which could logically be called
bx2 = s2w − σu2 c + σu2 b ,
σ and µ
bx = W, (5.24) reclassification in the spirit of the term remeasurement used previously.
where W and s2w are the sample mean and variance of the W data. The method requires that the misclassification matrix Π = (πij ) be

Then, for 0 ≤ ζ ≤ ζmax where ζmax ≤ ζmax σx2 − σ
= (b bu2 b )/σu2 b , set known or estimable where

bx2 − σu2 b (1 + ζ)
σ πij = pr(W = i | X = j). (5.29)
a2 =
b ,
σx2 − σu2 b )
(b Note that the case of no misclassification corresponds to having Π = I,
q the identity matrix.
a3 = +
b σ a22 (b
bx2 + σu2 c (1 + ζ) − σu2 b (1 + ζ) − b σx2 + σu2 c − σu2 b ) , In Section 8.4, we discuss an example of misclassification using max-
imum likelihood methods, when X is binary. In such cases, maximum
a1 = (1 − b
b a2 )b
µx . (5.25) likelihood is relatively simple, and there would be little need to use MC-

Note that when ζ ≤ ζmax the term under the radical sign is nonneg- SIMEX.
ative, and hence b
a3 is real. Then the bth error-inflated set of pseudo In continuous-variable SIMEX, remeasured data are generated in the
measurements is defined as sense that W(ζ) is constructed as a measurement of W, in the same man-
ner that W is a measurement of X. With misclassification, all variables
Wb,i (ζ) = b
a1 + b
a2 Wi + b
a3 Ub,i , i = 1, · · · , n. (5.26)
are discrete. Küchenhoff, Mwalili, and Lesaffre (2005) show how to gen-
With the one change that the upper bound of the grid of ζ values must erate reclassified data in the sense that W(ζ) is constructed as a misclas-

not exceed ζmax , the SIMEX algorithm works the same from this point sified version W, in the same manner that W is a misclassified version of
on as it does for the case of classical measurement error. X. Suppose that Π has the spectral decomposition Π = EΛE −1 , where Λ
We now show that under the assumption that (X, W) is bivariate nor- is the diagonal matrix of eigenvalues and E is the corresponding matrix
mal, the remeasured data from (5.26) possess, asymptotically, the key of eigenvectors. We can now write symbolically W = M C[Π](X), where
property of remeasured data that we saw for other error models in equa- the misclassification operation, M C[Π](X), denotes the generation of
tions (5.2), (5.5), (5.7), and (5.10). Let a1 , a2 , and a3 denote quantities the misclassified variable W from the true variable X according to the

124 125
probabilities (5.29). Define Πζ = EΛζ E −1 . In MC-SIMEX reclassified squares estimator (p = 1.5) was robust to measurement error, and the
data are generated as robustness was apparent from the horizontal SIMEX plot in Figure 5.2.
The method of Huang, Stefanski, and Davidian (2006) is based on the
Wb,i (ζ) = M C[Πζ ](Wi ), (5.30) fact that if an estimator is not biased by measurement error, then its
where the random reclassification step is repeated b = 1, . . . , B times SIMEX plot should be linear with zero slope. They developed this idea
for each of the i = 1, . . . , n variables. As in SIMEX, the simulation step for checking robustness of parametric modeling assumptions in struc-
in (5.30) is repeated for a grid of ζ values, 0 ≤ ζ1 < · · · < ζM . Once tural measurement error models.
the reclassified data are generated, the rest of the SIMEX algorithm is We give an overview of the method for a simple structural model of
similar to those discussed previously. the type in equation (8.7). In a structural model, the Xi are regarded
The key idea behind continuous-variable SIMEX is that if W|X ∼ as random variables. If a parametric model is assumed for the density of
Normal(X, σu2 ) and W(ζ)|W ∼ Normal(W, ζσu2 ), then the conditional X, say fX (x, α
e2 ), then the density of the observed data is
distribution W(ζ)|X ∼ Normal(W, (1 + ζ)σu2 ). The analogous property
e2 , σu2 ) =
fY,W (y, w|Θ, α
for MC-SIMEX is if W = M C[Π](X) and W(ζ) = M C[Πζ ](W), then Z
W(ζ) = M C[Π1+ζ ](X), where the three preceding equalities denote fY |X (y|x, Θ)fW |X (w|x, σu2 )fX (x, α
e2 )dx, (5.31)
equality in distribution. For continuous-variable SIMEX, ζ = −1 corre-
sponds to the case of no measurement error in the sense that (1+ζ)σu2 = where fW |X (w|x, σu2 ) is the Normal(X, σu2 ) density. The corresponding
0. For MC-SIMEX, ζ = −1 corresponds to the case of no misclassifica- likelihood for the case σu2 is known is
tion in the sense that Π1+ζ = Π0 = I.
n
Y
The heuristic explanation of why MC-SIMEX works is similar to the
L(Θ, α
e2 ) = e2 , σu2 ).
fY,W (Yi , Wi |Θ, α (5.32)
explanation for SIMEX. A statistic calculated from the misclassified
b = Θ(W
b i=1
data, say Θ 1 , . . . , Wn ), will converge asymptotically to a lim-
iting value that depends on the matrix Π, say Θ(Π). In the case of no The appealing features of structural modeling are that inference is based
misclassification, Π = I and the true-data statistic would be consistent, on the likelihood (5.32) and the estimators are consistent and asymp-
leading to the conclusion that Θ(I) = Θ0 , the true parameter value. totically efficient as long as the model is correct. However, the Achilles’
The same statistic calculated from data generated according to (5.30) heel of structural model is specification of the model for X. If this is
will converge to Θ(Π1+ζ ). Now, if we could model how Θ(Π1+ζ ) de- not correct, then maximum likelihood estimators need not be consis-
pends on ζ, then we could extrapolate the model to ζ = −1, resulting in tent or efficient. Remeasurement provides a method of checking whether
limζ→−1 Θ(Π1+ζ ) = Θ(Π0 ) = Θ(I) = Θ0 , the true parameter. The ex- misspecification of the model for X is causing bias in the parameter of
trapolation step does exactly this with the finite-sample data estimates. interest θ. The method is based on the observation that if estimators of
Küchenhoff, Mwalili, and Lesaffre (2005) investigated the asymptotic Θ based on the model fY,W (y, w|Θ, α e2 , σu2 ) using data {Yi , Wi } are not
true extrapolant function for a number of representative models and biased by measurement error, then estimators of Θ based on the model
concluded that a quadratic extrapolant function and a loglinear extrap- fY,W (y, w|Θ, α e2 , (1 + ζ)σu2 ) using remeasured data {Yi , Wi (ζ)} should
olant function are adequate for a wide variety of models. not be biased by measurement error. Alternatively, if fY,W (y, w|Θ, α e2 , σu2 )
2
is a correct model for (Y, W), then fY,W (y, w|Θ, α e2 , (1 + ζ)σu ) is nec-
essarily a correct model for (Y, W(ζ)). And in this case, the SIMEX
5.6.3 Checking Structural Model Robustness via Remeasurement pseudodata estimators Θ(ζ) b are consistent for the true Θ for all ζ > 0.
In this section, we briefly describe a useful remeasurement method that Consequently, the plot of Θ(ζ) b versus ζ should be flat. Conversely, if the
has its roots in SIMEX estimation. Huang, Stefanski, and Davidian b
plot of Θ(ζ) versus ζ is not flat, then the model for X is not correct,
(2006) show how to use remeasurement and SIMEX-like plots to check assuming the other components of the model are correct.
the robustness of certain model assumptions in structural measurement The suggested procedure is simple. For a given assumed model for X,
error models, such as those described in Chapter 8. The idea is sim- generate remeasured data sets and calculate Θ(ζ), b as described in Sec-
ple, and we have already seen the essence of it in the weighted least tion 5.3. Then construct a SIMEX plot as in Figure 5.1. If the plot is
squares example in Section 5.3.1. In that example, one weighted least a flat line, then the indicated conclusion is that the assumed model for

126 127
X is robust to bias from measurement error. If the plot is not flat, then
the indicated conclusion is that the assumed model for X is not robust.
Subjective determination from the plot is not necessary. Huang, Stefan-
ski and Davidian (2006) proposed and studied a test statistic for making
an objective determination of robustness. A more complete treatment
of the robustness in structural measurement error models and details of
the test statistic for testing robustness can be found in their paper.

Bibliographic Notes
In addition to STATA’s implementation of continuous-variable SIMEX,
an R implementation of MC-SIMEX (Section 5.6.2) and continuous-
variable SIMEX has been written by Wolfgang Lederer, see http://cran.r-
mirror.de/src/contrib/Descriptions/simex.html.
Since the original paper by Cook and Stefanski (1994), a number of
papers have appeared that extend the basic SIMEX method, adapt it to
a particular model, or shed light on its performance via comparisons to
other methods; see Küchenhoff and Carroll (1997); Eckert, Carroll, and
Wang (1997); Wang, Lin, Gutierrez, and Carroll (1998); Luo, Stokes,
and Sager (1998); Fung and Krewski (1999); Lin and Carroll (1999,
2000); Polzehl and Zwanzig (2004); Staudenmayer and Ruppert (2004);
Li and Lin (2003a,b); Devanarayan and Stefanski (2002); Kim and Gleser
(2000); Kim, Hong, and Jeong (2000); Holcomb (1999); Carroll, Maca,
and Ruppert (1999); Jeong and Kim (2003); Li and Lin (2003).
SIMEX has found applications in biostatistics and epidemiology (Mar-
cus and Elias, 1998; Marschner, Emberson, Irwin, et al., 2004; Greene
and Cai, 2004), ecology (Hwang and Huang, 2003; Gould, Stefanski,
and Pollock, 1999; Solow, 1998; Kangas, 1998), and data confidentiality
(Lechner and Pohlmeier, 2004).
Besides MC-SIMEX introduced in Section 5.6.2, there is a large liter-
ature on correcting the effects of misclassification of a discrete covariate.
See Gustafson (2004) for an extensive discussion and Buonaccorsi, Laake,
and Veirød, (2005) for some recent results.

128
CHAPTER 6

INSTRUMENTAL VARIABLES

6.1 Overview

The methods discussed thus far depend on knowing the measurement


error variance, or estimating it, for example, with replicate measure-
ments or validation data. However, it is not always possible to obtain
replicates, and thus direct estimation of the measurement error variance
is sometimes impossible. In the absence of information about the mea-
surement error variance, estimation of the regression model parameters
is still possible, provided the data contain an instrumental variable (IV),
T, in addition to the unbiased measurement, W = X + U.
In later sections, we state more precisely the conditions required of an
instrument, as they differ somewhat from one model to another. How-
ever, in all cases an instrument must possess three key properties: (i)
T must not be independent of X; (ii) T must be uncorrelated with the
measurement error U = W − X; (iii) T must be uncorrelated with
Y − E(Y | Z, X). In summary, T must be uncorrelated with all the
variability remaining after accounting for (Z, X). It is of some inter-
est that in certain cases, especially linear regression, U can be corre-
lated with the variability remaining after accounting for (Z, X), that is,
Y − E(Y | Z, X), and thus differential measurement error sometimes
can be allowed.
One possible source of an instrumental variable is a second, possi-
bly biased, measurement of X obtained by an independent measuring
method. Thus, the assumption that a variable is an instrument is weaker
than the assumption that it is a replicate measurement. However, the
added generality is gained at the expense of increased variability in bias-
corrected estimators relative to cases where the measurement error vari-
ance is known or directly estimated. More important, if T is assumed to
be an instrument when it is not, that is, if T is correlated with either
U = W − X or Y − E(Y | Z, X), then instrumental variable estimators
can be biased asymptotically, regardless of the size of the measurement
error variance. So falsely assuming a variable is an instrument can lead
to erroneous inferences even in the case of large sample size and small
measurement error; see Sections 6.2.2.1 and 6.5.2.
In the Framingham data analysis of Chapter 5, it was assumed explic-

129
itly that transformed blood pressure measurements from successive exam Note that those components of U e corresponding to the error-free vari-
periods were replicate measurements, even though a test of the replicate ables (1, Zt ) will equal zero.
measurements assumption was found to be statistically (although not
practically) significant. The same data can also be analyzed under the
weaker assumption that the Exam #2 blood pressure measurements are 6.2 Instrumental Variables in Linear Models
instrumental variables. We do this in Section 6.5 to illustrate the instru-
mental variable methods. 6.2.1 Instrumental Variables via Differentiation
In this chapter, we restrict attention to the important and common
case in which there is a generalized linear model relating Y to (Z, X), Much intuition about the manner in which an instrumental variable is
that is, the mean and variance functions depend on a linear function of used can be obtained by considering the following equations
the covariates and predictors. Except for the linear model, we also as-
sume that the regression of X on (Z, T, W) is linear, although in Section Y = f (X) + ǫ,
6.6 other possibilities are considered. In other words, we assume a re-
W = X + U,
gression calibration model (Section 2.2), leading to a hybrid combination
of classical additive error and regression calibration error. Instrumental
regarding the scalars Y, X, ǫ, W, and U as mathematical variables.
variable estimation is introduced in the context of linear models in Sec-
Differentiating both sides of the top equation with respect to T, us-
tion 6.2. We then describe an extension of the linear IV estimators to
ing the chain rule ∂f /∂T = (∂f /∂X)(∂X/∂T), noting that ∂X/∂T =
nonlinear models using regression calibration–like approximations in Sec-
∂W/∂T − ∂U/∂∂T, and rearranging terms results in
tion 6.3. An alternative generalization of linear model IV estimation due
to Buzas (1997) is presented in Section 6.4. The methods are illustrated
∂W ∂f
by example in Section 6.5. Section 6.6 discusses other approaches. The = ∂Y/∂T + (∂f /∂X)(∂U/∂T) − ∂ǫ/∂T.
∂T ∂X
chapter concludes with some bibliographic notes. Additional technical
details are in Appendix B.5. Consequently if ∂U/∂T = ∂ǫ/∂T = 0 and ∂W/∂T 6= 0, then

6.1.1 A Note on Notation ∂f ∂Y/∂T


= . (6.2)
∂X ∂W/∂T
In this chapter it is necessary to indicate numerous regression param-
eters and we adopt the notation used by Stefanski and Buzas (1995). That is, as long as we know how Y and W change with T, we can
Consider linear regression with mean β0 + βzt Z + βxt X. Then βY |1ZX is determine the way that f changes with X.
the coefficient of 1, that is, the intercept β0 , in the generalized linear re- The suggestive analysis above explains the essential features and work-
gression of Y on 1, Z and X. Also, βYt |1Z X = βzt is the coefficient of Z in ings of instrumental variable estimation in linear measurement error
the regression of Y on 1, Z and X. This notation allows representation models. If an instrument, T, is such that it is not related to U or ǫ
of subsets of coefficient vectors, for example, (∂U/∂T = ∂ǫ/∂T = 0) but is related to X (note that when ∂U/∂T = 0,
βYt |1Z X = (βY |1ZX , βYt |1Z X ) = (β0 , βzt ) then ∂W/∂T = ∂X/∂T and ∂X/∂T 6= 0 =⇒ ∂W/∂T 6= 0), then we
can determine how f varies with X using only the observed variables Y,
and, if the regression of X on (Z, T) has mean α0 + αzt Z + αtt T, then W, and T. For linear models, the essential properties of an instrument
t t t t t are that T is uncorrelated with U and ǫ, and is correlated with X.
βX|1ZT = (βX|1ZT , βX|1Z T , βX|1ZT ) = (α0 , αz , αt ).
The derivation of (6.2) depends critically on the lack of relationships
Also, many of the results in this chapter are best described in terms of between T and ǫ, and between T and U, and also on the denominator on
the composite vectors the right-hand side of (6.2) being nonzero. If either of the first two condi-
e = (1, Zt , Xt )t , f = (1, Zt , Wt )t , tions is not met, (6.2) would not be an equality (the statistical analogue
X W is bias). If the third condition is violated, we get indeterminacy because
e = (1, Zt , Tt )t ,
T e =W
U f − X. e (6.1) of division by zero (the statistical analogue is excessive variability).

130 131
6.2.2 Simple Linear Regression with One Instrument of the measurement error variance. In fact, even when σu2 = 0, IV esti-
mation can lead to large biases when the IV assumptions are violated.
We first consider simple linear regression with one instrument. Suppose
So the possibility exists that in trying to correct for a small amount of
that Y, W, and T, are scalar random variables such that
bias due to measurement error, one could introduce a large amount of
Y = βY |1X + XβY |1X + ǫ, bias due to an erroneous IV assumption. The message should be clear:
W = X + U, (6.3) Use IV estimation only when there is convincing evidence that the IV
assumptions are reasonable.
where ǫ and U have mean zero, and all random variables have finite vari-
ances. Define covariances among these variables as follows: cov(T, Y) = Relative to the asymptotic results in the previous paragraph, the po-
σty , cov(T, X) = σtx , etc. In order to estimate the slope in the regression tential pitfalls can be even greater with finite samples of data when σ tx
of Y on X, that is, βY |1X , we only require that is not far from zero. This is because random variation in the denomina-
tor, σ
btw , of (6.6), can cause it to be arbitrarily close to zero, in which
σtǫ = σtu = 0, and σtx 6= 0. (6.4) case the estimator in (6.6) is virtually worthless by dint of its excessive
To see this, note that (6.3) implies that variability. Fortunately, we can gain insight into whether this is a likely
problem by testing the null hypothesis H0 : σtw = 0. This is most easily
cov(T, Y) = σty = σtx βY |1X + σtǫ done by testing for zero slope in the linear regression of W on T. In-
cov(T, W) = σtw = σtx + σtu , (6.5) strumental variable estimation is contraindicated unless there is strong
so that if (6.4) holds, then evidence that this slope is nonzero (Fuller, 1987, p. 54).

cov(T, Y) σtx βY |1X + σtǫ Problems similar to those noted above occur with multiple predictors
= = βY |1X . measured with error and multiple instrumental variables, although the
cov(T, W) σtx + σtu
linear algebra for diagnosing and understanding them is more involved.
Suppose now that (Yi , Wi , Ti ) are a sample with the structure (6.3), and For this more general setting, Fuller (1987, p. 150–154) describes a test
that σ
bty and σ
btw are the sample covariances of σty and σtw , respectively. analogous to the regression test described above, and also a small-sample
Then the instrumental variable estimator, modification of (6.6) that, in effect, controls the denominator so that it
σ
bty does not get too close to zero; see Section 6.2.3.1 for details.
βbYIV|1X = , (6.6)
σ
btw
is a consistent estimator of βY |1X .
6.2.2.2 Technical Generality of the Assumptions
6.2.2.1 IV Estimation Potential Pitfalls
We now use this simple model to illustrate our previous warnings about
the potential pitfalls associated with instrumental variable estimation When the IV assumptions (6.4) hold, the consistency of βbYIV|1X is note-
when the IV assumptions are not satisfied. Because sample covariances worthy for the lack of other conditions under which it is obtained. Al-
are consistent estimators, we know that as n → ∞, though the representation for Y in (6.3) is suggestive of the common
linear model, consistency does not require the usual linear model as-
cov(T, Y) σtx βY |1X + σtǫ
βbYIV|1X −→ = . (6.7) sumption that ǫ and X are uncorrelated. Neither are any conditions
cov(T, W) σtx + σtu required about the relationship between X and the instrument T other
Consider the right-hand side of (6.7) for various combinations of as- than that of nonzero covariance in (6.4); nor is it required that U and ǫ,
sumption violations: σtǫ 6= 0; σtu 6= 0; σtx = 0. For example, if σtx 6= 0 or U and X be uncorrelated. Although very few assumptions are neces-
and σtu = 0, but σtǫ 6= 0, then the IV estimator has an asymptotic bias sary, it does not mean that the various covariances can be arbitrary. The
given by σtǫ /σtx , which can have either sign (±) and can be of any mag- fact that the covariance matrix of the composite vector (Y, X, W, T)
nitude, depending on how close |σtx | is to zero. Clearly, there are other must be nonnegative definite imposes restrictions on certain variances
combinations of assumption violations that also lead to potentially sig- and covariances. For example, it is impossible to have corr(T, X) = 0.99,
nificant biases. Note that such biases are possible regardless of the size corr(X, U) = 0.99, and corr(T, U) = 0.

132 133
6.2.2.3 Practical Generality of the Assumptions eW
E(T f t) = Ω e f = Ω e e + Ω e e . (6.10)
TW TX TU

The technical generality of the assumptions seems impressive, but as a Equation (6.10) is the multiple linear regression counterpart of (6.5).
practical matter is much less so. The assumption is that T is uncorrelated Note that
with ǫ and U, which in practice is often approximately equivalent to
assuming that T is independent of (ǫ, U). However, T has to be related (ΩtT
eWf ΩT
eWf)
−1 t
ΩT
eWf ΩTY
e = (ΩtT
eWf ΩT
eWf)
−1 t
ΩT
eWf ΩT
eXe βY |X e.
e = βY |X

to X, and so as a practical matter, X too has to be independent of (ǫ, U). Consequently if we replace expectations by averages, so that for example
b e f = n−1 Pn T
Ω e ft
TW i=1 i Wi , then the instrumental variable estimator,
6.2.3 Linear Regression with Multiple Instruments
βbYIV|Xe = (Ω
bt Ω
TeW
b e f )−1 Ω
f T W
bt Ω
TeW
be ,
f TY (6.11)
We now consider multiple linear regression with multiple instruments,
starting with the case where the number of instruments is the same as the is a consistent estimator of βY |Xe .
number of components of X. The case where the number of instruments
exceeds the number of predictors is presented at the end of this section. 6.2.3.1 Small Sample Modification
Suppose that the scalar Y and the composite vectors W f and T e in (6.1)
Fuller (1987, p. 150–154) described a small sample modification to handle
are such that
the instability that can be caused by the matrix inversion in (6.11). It
e t β e + ǫ,
Y=X solves small sample problems arising with weak predictors, but has less of
Y |X
f =X
e + U,
e an ameliorative effect on problems due to violations of the assumptions
W (6.8)
that T and ǫ, and T and U are uncorrelated.
where ǫ and U have mean zero, and all random variables have finite Let V = [Y, W],f and define S = [Y, W] = T( e T e t T)
e −1 T
e t V. Let q be
second moments. the number of components of T. e Define
In what follows, covariances are replaced by uncentered, expected · ¸
crossproduct matrices, for example ΩTY = E(TY)e in place of σty = Saa11 Saa12
e Saa = = (n − q)−1 V t (V − S).
cov(T, Y), a consequence of the fact that a column of ones is included Saa21 Saa22
f and T.
in each of W e Let dim(·) denote the dimension of the argument. Let κ be the smallest root of the determinant equation |S t S − κSaa |.
The multiple linear regression counterparts of assumptions (6.4) are Let α > 0 be a fixed constant, for example, α = 4. Fuller proposed the
e estimator
ΩTǫ
e = 0, ΩT
eUe = 0, and rank(ΩT
eXe ) = dim(X), (6.9)
βbYIV|Xe = {W t W − (κ − α)Saa22 }−1 {W t Y − (κ − α)Saa21 }.
The last assumption requires that T and X not be independent. As we
discussed in detail in Section 6.2.2 for the cases of simple linear regres-
sion with one instrument, violations of the key assumptions (assumptions 6.2.3.2 Technical Generality of the Result
(6.4) for simple linear regression and (6.9) for multiple linear regression),
The consistency of βbYIV|Xe is again noteworthy for the lack of conditions
can have severe consequences. The case where the instrument is an in-
dependent measurement of X obtained using a second, independent, under which it is obtained. The only conditions necessary are those in
method of measurement (possibly biased or with different error vari- (6.9), the third of which requires at least as many instruments as vari-
ance) is one where the key assumptions (6.9) can be expected to hold a ables measured with error. However, consistency does not require that
priori. Even in such cases, however, the declaration of independence is any of the expected crossproduct matrices ΩXǫ e , ΩX e , ΩUǫ
eU e equal zero,

seldom infallible. For other cases, often subject matter expertise must be although again, as for simple linear regression, the fact the covariance
brought to bear on the problem of determining whether the assumptions matrix of the composite vector (Y, X, W, T) must be nonnegative def-
in (6.9) are reasonable. inite imparts certain restriction on these crossproduct matrices. Also,
It follows from (6.8) that even though certain instrumental variable estimators can be written as
functions of the linear least squares regressions of Y and W f on T,e the
e
E(TY) = ΩTY
e = ΩT
eX e + ΩTǫ
e βY |X e , assumption that these regressions are linear in T e is not necessary.

134 135
6.2.3.3 Practical Generality of the Result c1 , the resulting estimator
in (6.12) for some other nonsingular matrix M
can be written as
As in simple linear regression, for most practical purposes the assump-
IV,(M c †
) c )
−(M
tion that T is uncorrelated with ǫ and U means that X is as well. βbY |Xe = βbf e 1 βbY |Te , (6.13)
W |T

6.2.3.4 More Instruments than Predictors −(M1 )


where βbf = (βbW
t bf e )−1 βbt M1 and βb e is the least squares
W |Te f |Te M1 βW |T f |Te
W Y |T
Instrumental variable estimation differs somewhat when the number of e i . Note the similarity of
coefficient estimate in the regression of Yi on T
instrumental variables exceeds the number of variables measured with (6.13) to (6.6).
error, the case we now consider. Our presentation parallels that of Sec-
tion 6.2.3. We assume the model in (6.8) and pick up the discussion
with (6.10). The key difference when dim(T) e > dim(X) e is that there are 6.3 Approximate Instrumental Variable Estimation
more equations in (6.10) than there are regression coefficients, and we 6.3.1 IV Assumptions
use generalized inverses in place of ordinary matrix inverses. Considering
(6.10), then for any generalized inverse of ΩT We have taken care to explain the conditions required for instrumen-
f , say
eW
tal variable estimation in linear models in order to make it easier to
−(M )
Ωe f = (ΩtT
eWf M ΩT
eWf)
−1 t
ΩT
eWf M, with M nonsingular, understand certain of the conditions we invoke for instrumental vari-
TW
able estimation in nonlinear models. Here, we continue to assume that
it follows that
a parametric model is correctly specified for E(Y | Z, X) and that the
−(M ) −(M )
Ω e f ΩTY
e = Ωe e ΩT
eXe βY |X
e = βY |Xe . measurement error U is independent of Z.
TW TX
Whereas lack of correlation is often sufficient when working with first-
Consequently if (Yi , Wi , Ti ), i = 1, . . . , n is an iid sample satisfying
and second-moment estimators, for example, as in linear regression, it
b e f and Ω
(6.8), Ω b e are any consistent estimators of Ω e f and Ω e ,
TW TY TW TY generally is replaced by independence when working with more compli-
c
and M converges in probability to a nonsingular matrix M , then the cated estimators. Thus, for the remainder of this chapter we work under
instrumental variable estimator, a stronger set of assumptions than those required for instrumental vari-
IV,(M c) c)
b −(M able estimation in linear models. These assumptions are the following:
βbY |Xe =Ω be ,
e f ΩTY (6.12)
TW 1. T is correlated with X;
is a consistent estimator of βY |Xe . 2. T is independent of the measurement error U = W − X in the sur-
One means of generating an estimator of the form (6.12) is to first do rogate W;
a multivariate regression of W f i on T
e i and calculate the predicted values
3. (W, T) is a surrogate for X, in particular E(Y k | Z, X, W, T) =
b i = βbt T
from it, X e i . These predicted values are denoted by X b i because
f |Te
W E(Yk | Z, X) for k = 1, 2. The key point here is that both T and
under the model (6.8), X b i = βbt T ei + Ue ∗ , where Ue ∗ = βbt T e i has the measurement error U in the surrogate W are independent of
e Te
X| i i e |Te
U
e ∗, X
bi any variation in the response Y after accounting for (Z, X). In linear
mean 0. Thus, apart from the addition of the mean zero vector U i regression, this means that T and U are independent of the residual
equals the predicted values that would be obtained from the regression error ǫ.
e i on T
of X e i . With X
b i so defined, the coefficient vector estimate from
the least-squares regression of Yi on X b i can be written as With these assumptions, we can derive an alternative explanation of
instrumental variable estimation in linear models. Note that
c ) c∗ )
IV,(M b −(M
βbY |Xe ∗ = Ω be ,

ef
TW TY E(Y | Z, T) = E{E(Y | Z, X, W, T) | Z, T}
c−1 b e e = (n P e et b P e ft
where M ∗ = Ω −1
Ti Ti ), ΩT
eWf = (n
−1
Ti Wi ) and = E{E(Y | Z, X) | Z, T}
P e TT
b
ΩTY
e = (n −1
Ti Yi ). = E(βY |1ZX + βYt |1Z X Z + βYt |1ZX X | Z, T)
c with
Alternatively, if we replace M = βY |1ZX + βYt |1Z X Z + βYt |1ZX E(X | Z, T)
Mc† = Ω
b −1 M
c1 Ω
b −1 = βY |1ZX + βYt |1Z X Z + βYt |1ZX E(W − U | Z, T)
eT
T e eT
T e

136 137
= βY |1ZX + βYt |1Z X Z + βYt |1ZX E(W | Z, T). (6.14) precisely in Section B.5.1, but we note here that in addition to the con-
ditions stated in Section 6.1, we also assume that the regression of X
The key steps in this derivation require that T is a surrogate for X and
on (Z, T, W) is approximately linear; see (B.27), that is, we assume a
that U is independent of T and Z. It follows from (6.14) that if E(W |
regression calibration model, see Section 2.2. This restricts the applicabil-
Z, T) is linear T, that is, E(W | Z, T) = βW |1ZT +βW |1Z T Z+βW |1ZT T,
ity of the methods below somewhat, but is sufficiently general to encom-
then by equating coefficients of T we get that βY |1ZT = βW |1ZT βY |1ZX ,
(−)
pass many potential applications. Combined with the classical additive
and consequently that βY |1ZX = βW |1ZT βY |1ZT when the βW |1ZT has measurement error model for W, these assumptions result in a hybrid
(−)
the required left inverse βW |1ZT . of classical and regression calibration structures, a subject discussed in
more detail in Section 6.6.
We now describe two regression-calibration approximations for use
6.3.2 Mean and Variance Function Models with instrumental variables.
We consider generalized linear models, and mean–variance models. Ex-
amples of these models are linear, logistic, and Poisson regression. As 6.3.3 First Regression Calibration IV Algorithm
described more fully in Sections A.7 and A.8, such models depend on
a linear combination of the predictors plus possibly a parameter θ that This section describes a simple method when the number of instruments
describes the variability in the response. The sections listed above give exactly equals the numbers of variables measured with error, and a sec-
details for model fitting when there is no measurement error. It might be ond method for the case dim(T) e > dim(X).e In Section B.5.1.1, it is
useful upon first reading to think of this chapter simply as dealing with shown that, to the level of approximation of regression calibration,
a class of important models, the details of fitting of which are standard e | T)}
e = mY {β t E(X
E(Y|T) e = f (β t β t T).e
e
Y |X Y |X e Te
e X|
in many computer programs.
These models can be written in general form as Since W = X+U and since U is independent of (Z, T), the regression of
f on T
W e is the same as the regression of X
e on T,
e so that β f e = β e e ,
E(Y|Z, X) = f (βY |1ZX + βYt |1Z X Z + βYt |1ZX X), (6.15) W |T X|T
and hence it follows that (approximately)
var(Y|Z, X) = σ 2 g 2 (βY |1ZX + βYt |1Z X Z + βYt |1ZX X, θ), (6.16) e.
f |Te βY |X
βY |Te = βW (6.17)
and include homoscedastic linear regression, where f (v) = v and g ≡ e in the generalized linear regression of Y on
That is, the coefficient of T
1, and logistic regression where σ 2 = 1, f is the logistic distribution e is the product of β t t
function and g 2 is the Bernoulli variance f (1 − f ). The only notational
T e
Y |X
and βWf |Te .

change with other parts of the book is that the parameters β0 , βz , and This leads to an extremely simple algorithm:
βx have been replaced by βY |1ZX , βY |1Z X and βY |1ZX , respectively. • Let dz be the number of Z variables, dx be the number of X variables,
In terms of the composite vectors and dt be the number of T variables. Perform a multivariate regression
f on T
of W e to obtain βbt , which is a matrix with 1+dim(Z)+dim(X)
e = (1, Zt , Xt )t ,
X f = (1, Zt , Wt )t ,
W f e
W |T

e = (1, Zt , Tt )t , e =W
f −X e rows and 1 + dim(Z) + dim(T) columns. For j = 1, ..., 1 + dim(Z), the
T U
j th row of βbW
f |Te has a 1.0 in the j
th
column and all other elements
defined in (6.1) and βY |Xe = (βY |1ZX , βYt |1Z X , βYt |1ZX )t , the basic model equal to zero, reflecting the fact that the regression of Z on (Z, T) has
(6.15)–(6.16) becomes no error. For k = 1, ..., dim(X), row 1 + dim(Z) + k of βbW f |Te contains
e
E(Y|X) e
= f (βYt |Xe X), the regression coefficients when regressing the k th element of W on
(Z, T), including the intercept in this regression.
e
var(Y|X) e θ).
= σ 2 g 2 (βYt |Xe X, • Then perform a generalized linear regression of Y on the predicted val-
ues βbW
t e bIV 1,RC .
The goal is to estimate βY |Xe , θ and σ . 2 e , which we denote βY |X
f |Te T to obtain an estimator of βY |X e

The assumptions that are necessary for our methods are stated more • This estimator is easily computed, as it requires only linear regressions

138 139
of the components of Wf on T,
e and then quasilikelihood and variance duces fully consistent estimators in certain important generalized lin-
function estimation of Y on the “predictors” βbW
t e
f |Te T. ear models with scalar predictor X subject to measurement error. The
method is based upon the hybrid of classical and regression calibration
The second means of exploiting the basic regression calibration ap-
models, a subject discussed in the regression calibration approximation
proximation works directly from the identity (6.17). For a fixed nonsin-
−(M ) in Section 6.3.2, and also discussed in more detail in Section 6.6.
gular matrix M1 , let βbf e 1 = (βbW
t bf e )−1 βbt M1 . The second
W |T f |Te M1 βW |T f |Te
W In the hybrid approach, along with the measurement error model W =
estimator is X + U, we have a regression calibration model for X given (Z, T), where
IV 1,(M ) −(M ) we write E(X|Z, T) = mX (Z, T, γ). Generally, the parameter γ will have
βbY |Xe 1 = βbf e 1 βbY |Te , (6.18)
W |T to be estimated, but this is the beauty inherent in the assumptions of
the hybrid approach, namely, that as long as the measurement error U
where βbY |Te is the estimated regression coefficient when the generalized
is independent of (Z, T), then (possibly nonlinear) regression of W on
model is fit to the (Y, T) e data. Note that (6.18) makes evident the
(Z, T) will provide an estimate of γ. This is generally done by solving
b
requirement that β f e be of full rank. When T and W are the same the least squares equation of the form
W |T
dimension, this estimator does not depend on M1 and is identical to n
X
the first estimator, but not otherwise. When there are more instruments ψmX (Wi , Zi , T, γ
b) = 0.
than variables measured with error the choice of M1 matters. In Section i=1
B.5.2.1 we derive an estimate M c1 that minimizes the asymptotic variance
The starting point for Buzas’ method is that along with the regression
b IV 1,(M1 )
of βY |Xe . Section B.5.2 gives the relevant asymptotic distribution, parameters βY |1ZX , there may be additional parameters τ . He then sup-
although of course the bootstrap can always be used. poses that there is a score function that produces consistent estimators
in the absence of measurement error. In his framework, the mean func-
tion is denoted by mY (Z, X, βY |1ZX , τ ). The form of the mean functions
6.3.4 Second Regression Calibration IV Algorithm of most interest here is where
The second algorithm exploits the fact that both W and T are surro- a1 + a2 exp(a5 x)
mY (Z, x; βY |1ZX ) = ,
gates. The derivation of the estimator is involved (Section B.5.1.2), but a3 + a4 exp(a5 x)
the estimator is not difficult to compute.
where a1 , . . . , a5 are scalar functions of Z and βY |1ZX , but not x. Note-
Let dim(Z) be the number of components of Z. Define
worthy in this class are
βY |Te W
f = βY |1ZT W , • The logistic mean model mY (Z, x; βY |1ZX ) = 1/{1 + exp(−βY |1ZX −
βY |1Z X Z − βY |1ZX x)}, obtained when a1 = a3 = 1, a2 = 0, a4 =
βY |Te W
f = (01×d , βYt |1ZT W )t ,
exp(−βY |1ZX − βY |1Z X Z) and a5 = −βY |1ZX .
where d = 1 + dim(Z). Then, for a given matrix M2 , the second instru- • The Poisson loglinear mean model mY (Z, x; βY |1ZX ) = exp(βY |1ZX +
mental variables estimator is βY |1Z X Z +βY |1ZX x), obtained when a1 = a4 = 0, a2 = exp(βY |1ZX +
IV 2,(M ) −(M ) bf e βb e f ). βY |1Z X Z), a3 = 1 and a5 = βY |1ZX .
βbY |Xe 2 = βbf e 2 (βbY |Te W
f + βW |T Y |T W
W |T In these problems, the score function ψ has the form
IV 2,(M )
When T and W are the same dimension, βbY |Xe 2 does not depend on ψ(Y, Z, X, βY |1ZX , τ ) =
M2 . In Section B.5.2.1, we derive an estimate of M2 that minimizes the {Y − mY (Z, X, βY |1ZX , τ )}g(Z, X, βY |1ZX , τ );
IV 2,(M )
asymptotic variance of βbY |Xe 2 for the case dim(T) > dim(W). and is such that the estimating equations
Pn
i=1 ψ(Yi , Zi , Xi , βY |1ZX , τ ) = 0,
6.4 Adjusted Score Method
produce consistent estimators of βY |1ZX and τ in the absence of measure-
Buzas (1997) developed an approach to IV estimation that, unlike the ment error. The class of estimators covered by this setup includes nonlin-
approximate regression calibration approach in Section 6.3, actually pro- ear least squares (Gallant, 1987), quasilikelihood and variance function

140 141
models (Carroll and Ruppert, 1988), and generalized linear models (Mc- 6.5 Examples
Cullagh and Nelder, 1989), among others.
6.5.1 Framingham Data
Buzas (1997) showed that when the measurement error U is sym-
metrically distributed about zero given (Z, X, T), then a score function We now illustrate the methods presented in this chapter using the Fram-
leading to consistent estimation is the following. Define ingham heart study data from Section 5.4.1, wherein two systolic blood
¯ ¯ pressure measurements from each of two exams were used. It was as-
¯ mY,x (Z, E(X | Z, T); βY |1ZX ) ¯1/2 sumed that the two transformed variates
φ(Z, W, T, βY |1ZX ) = ¯¯ ¯
mY,x (Z, W; βY |1ZX ) ¯
W1 = log{(SBP3,1 + SBP3,2 )/2 − 50}
with mY,x (Z, x; βY |1ZX ) = (∂/∂x)mY (Z, x; βY |1ZX ). Then the modi- and
fied score leading to consistent estimation is W2 = log{(SBP2,1 + SBP2,2 )/2 − 50},
ψIV (Y, Z, W, T, βY |1ZX , τ ) = where SBPi,j is the j th measurement of SBP from the ith exam, j =
1, 2, i = 2, 3, were replicate measurements of the long-term average
{Y − mY (Z, W; βY |1ZX )}φ(Z, W, T, βY |1ZX )
transformed SBP.
× g(Z, E(X | Z, T), βY |1ZX , τ ). Table 6.1 displays estimates of the same logistic regression model fit in
Inference can be obtained either by the bootstrap or by using the method Section 5.4.2.1 with the difference that W2 was employed as an instru-
of stacking estimating equations in Section A.6.6, that is, stack ψmX (·) mental variable, not as a replicate measurement, that is, in the notation
below ψIV (·). of this section, W = W1 and T = W2 .
IV 1,(M )
Because T has the same dimension as W, the estimate β e 1 does
Y |X
not depend on M1 and is equivalent to βYIV|Xe1,RC . This common estimate
IV 2,(M )
Age Smoke Chol LSBP is listed under IV1 in Table 6.1. Also βY |Xe 2 does not depend on M2
and is listed under IV2 in the table. For the Buzas estimate, a linear
Naive .056 .573 .0078 1.524 regression model for E(X | Z, T) was used, M (Z, T, γ) = (1, Zt , Tt )γ
Std. Err. .010 .243 .0019 .364 with γ b obtained by least squares, so that
ψmX (W, Z, T, γ) = {W − (1, Zt , Tt )γ}(1, Zt , Tt )t .
IV1 .054 .577 .0076 2.002
Table 6.2 displays estimates of the same logistic regression model with
Std. Err. .011 .244 .0020 .517
the difference that the instrumental variable T was taken to be the two-
dimensional variate
IV2 .054 .579 .0077 1.935
Std. Err. .011 .244 .0020 .513 T = {log(SBP2,1 ), log(SBP2,2 )}. (6.19)
Note the similarity among the estimates in Tables 6.1 and 6.2.
Adj Score .055 .597 .0082 1.930 The primary purpose of this second analysis is to illustrate the differ-
Std. Err. .011 .250 .0020 .494 ences between the estimators when dim(T) > dim(X), and to emphasize
that T need only be correlated with X, and not a second measurement,
for the methods to be applicable.
Table 6.1 Estimates and standard errors from the Framingham data in- However, we also use this model to illustrate further the key assump-
strumental variable logistic regression analysis. This analysis used the one- tions (6.9) and to discuss the issues involved in verifying them.
dimensional instrumental variable LSBP = log{(SBP2,1 + SBP2,2 )/2 − 50}. Rewrite the IV model (6.19) as T = (T1 , T2 ). Given our previous
“Smoke” is smoking status and “Chol” is cholesterol level. Standard errors discussions of the Framingham data, it follows that reasonable models
calculated using the sandwich method. for W and T are
W = X + U,

142 143
b1 and b2 is nonzero. (Okay, we’re kidding about the “obvious” part —
Age Smoke Chol LSBP you’ll just have to take our word on this one. However, for those doubting
Thomases among the readers, we suggest that you first consider the case
where the components of X and Z are iid and standardized to mean zero
Naive .056 .573 .0078 1.524 and variance one, and then extend to the general case.)
Std. Err. .010 .243 .0019 .364 This example illustrates the fact that not all instruments need to be
correlated with X, and that if multiple instruments are available (all
IV1,RC .054 .577 .0076 1.877 satisfying (6.9) of course) there is no harm using them. In fact, as Fuller
Std. Err. .011 .244 .0020 .481 (1987, p. 154) notes, adding more instrumental variables can improve
the quality of an IV estimator.
IV1,(M1 ) .054 .577 .0076 1.884
Std Err. .011 .244 .0020 .483
6.5.2 Simulated Data
IV2,(M2 ) .054 .579 .0077 1.860
The instrumental variable results in Table 6.2 are very close to what was
Std. Err. .011 .244 .0020 .484
obtained for regression calibration and SIMEX; see Table 5.1 in Section
5.4.1. The reader can then be forgiven for concluding that instrumen-
Adj Score .055 .592 .0082 1.887
tal variable analyses are equivalent to corrections for attenuation. The
Std. Err. .011 .250 .0020 .494
Framingham data, though, are a special case, because the instrument
is for all practical purposes an unbiased estimate of X with the same
measurement error as that of W. A simulation will help dispel the no-
Table 6.2 Estimates and standard errors from the Framingham data in- tion that instrumental variables are always equivalent to corrections for
strumental variable logistic regression analysis. This analysis used the two- attenuation. First the simulation. Consider linear regression of Y on X,
dimensional instrumental variable {log(SBP2,1 ), log(SBP2,2 )}. “Smoke” is with slope βx = 1 and error about the line σǫ2 = 1. Let the sample size be
smoking status and “Chol” is cholesterol level. Standard errors calculated using n = 400. Let X = Normal(0, 1), and let the measurement error variance
the sandwich method. σu2 = 1, so that the reliability ratio is λ = 0.50. We consider two cases. In
the first, W is replicated in order to estimate σu2 , but only the first repli-
T 1 = a1 + b 1 X + U1 , cate is used. In the second, we have an instrument T = 0.2X + ν, where
ν = Normal(0, 1). Here, the instrument does not have a high correlation
T 2 = a2 + b 2 X + U2 , (6.20)
with W, so the division inherent in (6.6) is bound to cause a problem.
where U, U1 , and U2 are mutually independent and independent of Then, in 500 simulations, the naive estimator is biased as expected, and
Y, X, and Z. These independence assumptions are comparable to those the correction for attenuation and instrumental variable estimators are
used previously to justify the various analyses of the Framingham data. nearly unbiased. However, the correction for attenuation estimator had
With ǫ replaced by Y − E (Y|Z, X) (because the model for Y is logistic much less variability than the instrumental variables estimator, either
not linear), the aforementioned independence assumptions ensure the in its raw form or with the correction for small samples; see Figure 6.1.
validity of the first two components of (6.9). Now for the model (6.20) Even the corrected form has a variance more than four times greater
the crossproduct matrix ΩT e is the 6 × 5 matrix,
eX than the correction for attenuation.
 
1 Zt X
t
 Z ZZ ZX  6.6 Other Methodologies
ΩTeXe =E 
a1 + b1 X (a1 + b1 X)Zt (a1 + b1 X)X
t 6.6.1 Hybrid Classical and Regression Calibration
a2 + b2 X (a2 + b2 X)Z (a2 + b2 X)X
e = 5 for the Framingham data). It
(recall that dim(Z) = 3 and dim(X) We have seen two examples in which the classical additive measurement
should be obvious that rank(ΩT e ) = 5, if and only if at least one of
eX error model relating (W, X) is combined with a parametric regression

144 145
They also assume that ν is independent of (Z, T), but they allow ν and
Naive IV ǫ to be correlated. Effectively, their method is to compute higher-order
moments of the observed data to show that all parameters can be iden-
tified, and then to estimate these moments.
Schennach (2006) also considers a hybrid version of the classical and
regression calibration approaches when X is scalar and there are no
covariates Z. Her general model also has W = X + U and takes the
form
Y = mY (X, B) + ǫ;
0 1 2 0 1 2 X = mX (T, γ) + ν. (6.22)
She assumes, in effect, that ǫ is independent of (ν, T), that U is inde-
Correction for Attenuation IV, Corrected
pendent of (ǫ, ν, T), and that ν is independent of T. She notes that it is
possible to extend the model to include Z. The functional forms of mY (·)
and mX (·) are assumed known, with the unknowns being the parameters
(B, γ). Her method is more complex than in the polynomial case.

6.6.2 Error Model Approaches


In the instrumental variable context, hybrid models such as (6.21) and
(6.22) are appealing because their means as a function of (Z, T) can
0 1 2 0 1 2
be estimated simply by regressing W on (Z, T). This has led us to re-
gression calibration as an approximate device (Section 6.3), the adjusted
score method for certain special problems (Section 6.4), and other mod-
Figure 6.1 Comparison of methods in simulated data when the instrument is eling approaches (Section 6.6.1). All these methods are intrinsically dif-
weak. Top left: naive estimate. Bottom left: correction for attenuation. Top ferent from the approaches to measurement error modeling when the
right: instrumental variables estimator. Bottom right: instrumental variables error variance is known or can be estimated in other ways, for example,
with small sample correction. replication.
Under stronger technical, although not practical, conditions that were
previously discussed, however, it is possible to achieve identifiability of
calibration model relating X to (Z, T); see Section 6.3.2 and 6.4. Several estimation and also to employ methods from previous and succeeding
papers use the same basic modeling strategy. chapters, for example, SIMEX; see Carroll, Ruppert, Crainiceanu, et al.
Hausman, Newey, Ichimura, et al. (1991) consider the polynomial re- (2004). First consider the case when there is no Z. For scalar X, Carroll,
gression model in which the unobserved true X is measured with clas- Ruppert, Crainiceanu, et al. (2004) start with a model that relates the
sical additive error, while it is related to the instrument though a re- response to covariates and random error as
gression calibration model (Section 2.2). Specifically, their model is that
W = X + U, and that Y = G(X, B, ǫ).
Pp This is a completely general model including generalized linear models,
Y = β0 + βzt Z + j=1 βx,j Xj + ǫ;
nonlinear models, etc. These authors assume the usual classical additive
X = α0 + αzt Z + αtt T + ν. (6.21) error model W = X + U, and they relate the instrument via a general-
ization of the biased classical error model in (2.1) discussed in Section
Of course, (6.21) is a regression calibration model. Hausman, Newey,
2.2.1. Specifically, to begin with they assume that
Ichimura, et al. (1991) assume, in effect, that ǫ and U are each indepen-
dent of (Z, T) but that they need not be independent of one another. T = α0 + α1 X + ν, (6.23)

146 147
and that (ǫ, U, ν, X) are all mutually independent. Then, it follows that models nonparametrically by combining an estimate of the measurement
error variance derived from the instrument with methods of nonparamet-
cov(W, T)cov(Y, W)
var(U) = σu2 = var(W) − . ric estimation for measurement error models with known error variance.
cov(Y, T)
Other papers of general interest are Carroll and Stefanski (1994) and
One can thus estimate σu2 by replacing the variance and covariances by Greenland (2000).
their sample versions and then using one’s favorite estimation method
tuned to the case where an estimate of σu2 is available. If the model
G(X, B, ǫ) is simple linear regression, this algorithm produces the usual
instrumental variable estimator. It is worth pointing out that this method-
of-moments estimate need not be positive, or smaller than var(W), and
Carroll, Ruppert, Crainiceanu, et al. (2004) suggest placing bounds on
the attenuation.
More generally, we can include Z by writing the model
Y = G(Z, X, B, ǫ);
T = f0 (Z) + f1 (Z)X + ν,
the latter being a varying coefficient model, with function f0 and f1 .
Carroll, Ruppert, Crainiceanu, et al. (2004) show how to estimate σu2 in
this general model, even when f0 and f1 are modeled nonparametrically:
Once σu2 is estimated, estimating the model parameter B can be done
by any of the methods discussed elsewhere in this book.
For example, in the Framingham data, it makes sense to use (6.23),
since the instrument is a second, possibly biased, measurement of true
blood pressure. When we applied the method to estimate σu2 and then
used regression calibration, we obtained estimates and standard errors
that were essentially the same as in Table 6.1.

Bibliographic Notes
The literature on instrumental variable estimation in linear models is ex-
tensive. For readers wanting more than the introduction in Section 6.2,
a good place to start is Fuller (1987). Interest in instrumental variables
in nonlinear measurement error models is more recent, and a number
of methods have been proposed that are generally either more special-
ized, or more involved than the methods described in this chapter. See,
for example, Amemiya (1985, 1990a, 1990b) for general nonlinear mea-
surement error models; Stefanski and Buzas (1995), Buzas and Stefanski
(1996b), and Thoresen and Laake (1999) for binary measurement error
models; Buzas and Stefanski (1996c) for certain generalized linear mod-
els; and Hausman, Newey, Ichimura, et al. (1991) for polynomial models.
Carroll, Ruppert, Crainiceanu, et al. (2004) provided identifiability re-
sults for very general measurement error models with instrumental vari-
ables, and develop methods for estimating nonlinear measurement error

148 149
CHAPTER 7

SCORE FUNCTION METHODS

7.1 Overview

Regression calibration (Chapter 4) and SIMEX (Chapter 5) are widely


applicable, general methods for eliminating or reducing measurement
error bias. These methods result in estimators that are consistent in
important special cases, such as linear regression and loglinear mean
models, but that are only approximately consistent in general.
In this chapter, we describe methods that are almost as widely ap-
plicable, but that result in fully consistent estimators more generally.
Consistency is achieved by virtue of the fact that the estimators are M-
estimators whose score functions are unbiased in the presence of mea-
surement error. This property is also true of structural model maximum
likelihood and quasilikelihood estimates, as discussed in Chapter 8. The
lack of assumptions about the unknown Xi distinguishes the methods
in this chapter from those in Chapter 8. The methods are functional
methods, as defined in Section 2.1.
However, we do not deal with functional modeling as it is used in the
linear models measurement error setting, for it is not a viable option for
nonlinear measurement error models. Suppose for the sake of discussion
that the measurement error covariance matrix Σuu is known. In the old
classical functional model, the unobservable Xi are fixed constants and
are regarded as parameters. With additive, normally distributed mea-
surement error, functional maximum likelihood maximizes the joint den-
sity of the observed data with respect to all of the unknown parameters,
including the Xi . While this works for linear regression (Gleser, 1981),
it fails for more complex models such as logistic regression (Stefanski
and Carroll, 1985). Indeed, the functional estimator in most nonlinear
models is both extremely difficult to compute and not even consistent
or valid. The methods in this chapter make no assumptions about the
Xi , are often easier computationally, and lead to valid estimation and
inference.
We focus on the case of additive, normally distributed measurement
error, so that W = X+U with U distributed as a normal random vector
with mean zero and covariance matrix Σuu , and two broad classes of
score function methods that have frequent application.

151
• The conditional-score method of Stefanski and Carroll (1987) exploits in the absence of measurement error is
special structures in important models such as linear, logistic, Poisson    
1
loglinear, and gamma-inverse regression, using a traditional statistical  {Yi − (1, Zti , Xti )Θ1 }  Zi  
device, conditioning on sufficient statistics, to obtain estimators.  

ΨLS (Yi , Zi , Xi , Θ) =  µ Xi  . (7.1)
¶ 
• The corrected-score method effectively estimates the estimator that  n−p 2
σ 2 − {Yi − (1, Zti , Xti )Θ1 }
one would use if there were no measurement error. n
We start with linear and logistic regression, using these important spe- The upper equation is the least squares score function (the so-called
cial cases both as motivation for and explanation of the general methods. normal equations) for Θ1 , the regression parameters. The factor (n−p)/n
Next, the conditional- and corrected-score methods are illustrated with a with p = dim(Θ1 ) in the lower equation implements the usual degrees-
logistic regression example in Section 7.2.3. Then, in successive sections, of-freedom correction for the estimator of σ 2 .
we describe the conditional-score and corrected-score methods in detail,
covering the basic theory and giving examples for each method. 7.2.1.1 Linear Regression Conditional Score
We warn the reader that the mathematical notation of conditional and
We now describe an approach to consistent estimation that requires no
corrected scores is more complex than that of regression calibration and
assumptions about the X-variables. The derivation of the method, but
SIMEX. However, the formulae are simple to program and implement,
not its validity, assumes normality of the true-regression equation error,
with the possible exception of Monte Carlo corrected scores (MCCS),
ǫi , as well as the measurement errors Ui . Define
which require complex variable computation that may not be available in
all programming languages. More important, the methods in this chapter ∆i = Wi + Yi Σuu βx /σ 2 . (7.2)
result in fully consistent estimators under the conditions stated on the
true-data model and the error model, not just approximately consistent, Given Zi and Xi , the random variables Yi and ∆i are linear functions
as is often the case for regression calibration and SIMEX. of jointly normal random vectors and thus are jointly normal, condition-
ally on (Zi , Xi ). Consequently, the conditional distribution of Yi given
(Zi , Xi , ∆i ) is also normal, and standard multivariate-normal calcula-
7.2 Linear and Logistic Regression tions show that
This section introduces the ideas of corrected and conditional scores in β0 + βzt Zi + βxt ∆i
E(Yi | Zi , Xi , ∆i ) = E(Yi | Zi , ∆i ) = ,
two important problems, namely, linear regression and logistic regres- 1 + βxt Σuu βx /σ 2
sion. In linear regression, of course, we already know how to construct σ2
valid estimation and inferential methods, as described in Section 3.4, so var(Yi | Zi , Xi , ∆i ) = var(Yi | Zi , ∆i ) = . (7.3)
1 + βxt Σuu βx /σ 2
nothing really new is being done here: The calculations are simply easier
to follow for this case, and those wishing to understand the ideas, espe- These conditional moments are noteworthy for their lack of dependence
cially the new ideas for corrected scores, will find the linear regression on Xi . We will show in Section 7.3 that this is by design, that is, the
calculations give useful insight. For logistic regression, these methods manner in which ∆i is defined ensures that these moments depend only
produce consistent, and not just approximately consistent methods. on the observed data and not on Xi .
It follows from (7.3) that the conditional score,
  
1
7.2.1 Linear Regression Corrected and Conditional Scores  {Yi − E(Yi | Zi , ∆i )}  Zi  
 
Consider the multiple linear regression model with mean E(Y | Z, X) = ΨCond (Yi , Zi , Wi , Θ) = 
 ∆i  ,
 2 
β0 + βzt Z + βxt X, variance var(Y | Z, X) = σ 2 , and the classical additive, {Y i − E(Y i | Z i , ∆ i )}
σ2 −
nondifferential error model W = X + U with U = Normal(0, Σuu ) var(Yi | Zi , ∆i )/σ 2
where Σuu is known. Write the unknown regression parameter as Θ1 =
has the property that
(β0 , βzt , βxt )t and Θ = (Θt1 , Θ2 )t with Θ2 = σ 2 .
The ordinary least squares score function for multiple linear regression E {ΨCond (Yi , Zi , Wi , Θ) | Zi , ∆i } = 0,

152 153
so its unconditional mean also vanishes. Thus, ΨCond can be used to form Σuu . Consider the complex-valued random variate,
unbiased estimating equations, f b,i = Wi + ιUb,i .
Pn W (7.6)
i=1 ΨCond (Yi , Zi , Wi , Θ) = 0. (7.4)
The Monte Carlo corrected score (MCCS) is obtained in three steps:
However, in practice we estimate the parameters by solving the small- 1. Replace Xi with W f b,i in a score function that is unbiased in the
sample modified estimating equations absence of measurement error — for linear least squares regression
   
1 this is (7.1).
 {Y i − E(Y i | Z i , ∆ i )}  Zi   2. Take the real part, Re(·), of the resulting expression to eliminate the
Pn  
 ∆i  = 0. (7.5) imaginary part.
i=1  µ 
 n − p¶ {Y i − E(Y i | Z i , ∆i )}
2  3. Average over multiple sets of pseudorandom vectors, b = 1, . . . , B.
σ2 −
n var(Yi | Zi , ∆i )/σ 2 For linear regression, these steps result in
n o
The factor (n − p)/n in the equation for σ 2 implements the degrees-of- e MCCS,B (Yi , Zi , Wi , Θ) = B −1 PB Re ΨLS (Yi , Zi , W
Ψ f b,i , Θ)
b=1
freedom correction for the estimator of σ 2 . The asymptotic theory of M-      
estimators in Section A.6 can be applied to approximate the distribution 1 0
b
of Θ.  {Yi − (1, Zti , Wit )Θ1 }  Zi  +  0  
 
= Wi cu,i βx
M ,
µ ¶ 
7.2.1.2 Linear Regression Corrected Score  n−p 2 
cu,i βx
σ 2 − {Yi − (1, Zti , Wit )Θ1 } + βxt M
n
We now derive the corrected score for linear regression using the general
method of construction described in Section 7.4. The corrected score for where M cu,i = B −1 PB Ub,i UT . Because E(M cu,i ) = Σuu , it follows
b=1 b,i
linear regression is readily obtained using other approaches, hence the that for all i and B,
general method of construction is overkill for this case. However, the n o
E Ψ e MCCS,B (Yi , Zi , Wi , Θ) | Zi , Xi = ΨLS (Yi , Zi , Xi , Θ), (7.7)
development is instructive and readily generalizes to problems of greater
interest. and consequently that, if we ignore the degrees-of-freedom correction
The general method of constructing corrected scores uses complex factor (n − p)/n in (7.1),
variables and complex-valued functions. Although familiarity with com- n o
plex variables is helpful to understand how the method works, it is not E Ψ e MCCS,B (Yi , Zi , Wi , Θ) = 0. (7.8)
essential to using, or even implementing the methods, provided one uses
a programming language with complex number capabilities (GAUSS and Equation (7.7) provides insight into how corrected scores work. The cor-
rected score, Ψe MCCS,B (Yi , Zi , Wi , Θ), is an unbiased estimator of the
MATLAB have such capabilities, for example). We use the √ bold Greek
letter iota (ι) to denote the unit imaginary number, ι = −1 , to dis- score that would have been be used, ΨLS (Yi , Zi , Xi , Θ), if measurement
tinguish it from the observation index i. Only a few facts about complex error were not present.
numbers are used in this section: a) ι2 = −1; b) the real part of com- It follows from general M-estimator theory (Section A.6) that under
plex number is Re(z1 + ιz2 ) = z1 ; and c) if f (z1 + ιz2 ) is a function regularity conditions the estimating equations,
Pn e
of a complex variable, then f (z1 + ιz2 ) = g(z1 , z2 ) + ιh(z1 , z2 ), where g ΨMCCS,B (Yi , Zi , Wi , Θ) = 0, (7.9)
i=1
and h are both real-valued functions and g is the real part of f , that is,
g(z1 , z2 ) = Re{f (z1 + ιz2 )}. admit a consistent and asymptotically normal sequence of solutions.
The general method of construction has a similar feel to SIMEX, in We can gain further insight into the workings of corrected scores by
that we use the computer to generate random variables to help in defining solving (7.9) for the case of linear regression, resulting in
³ ´−1
an estimator. In the case of corrected scores, these random variables are b1 = M c1zw,1zw − Ωe cy,1zw ,
Θ M
defined as follows. ½³ ¾
Now, for b = 1, ..., B, generate random variables Ub,i that are inde- Pn ´2
b2 = (n − p)−1 i=1
σ Yi − Yb i − βbt Σ b uu
bx ,
β
pendent normal random vectors with mean zero and covariance matrix x

154 155
     
where 1 0
 {Yi − (1, Zti , Wit )Θ1 }  Zi  +  0  
   
1 Zti Wit  Wi Σuu βx .
µ ¶ 
c1zw,1zw = n −1 Pn  Zi  n−p 
M i=1 Zi Zti Zi Wit  , 2 t t 2 t
σ − {Yi − (1, Zi , Wi )Θ1 } + βx Σuu βx
Wi Wi Zti Wi Wit n
 
0 0 0 ³ ´ Note E{Ψ e MCCS,B (Yi , Zi , Wi , Θ) | Yi , Zi , Wi } = Ψ
e CS (Yi , Zi , Wi , Θ). We
e = 0 0
Ω 0 , Σ b uu = n−1 Pn M c
i=1 u,i , e CS (Yi , Zi , Wi , Θ) a corrected score to distinguish it from the Monte
call Ψ
0 0 b uu
Σ e MCCS,B . Corrected scores for certain other simple
  Carlo corrected score Ψ
P 1 common statistical models can be found without using Monte Carlo aver-
cy,1zw n b i = (1, Zt , Wt )Θ
b 1.
M = n−1 i=1 Yi  Zi  , Y i i aging, and some of these are given in Section 7.4.3. However, whenever
Wi a corrected score exists, the Monte Carlo corrected score estimates it
precisely for B large and avoids the mathematical problem of finding it,
Because we are working under the assumption that Σuu is known, although, of course, at the cost of complex variable computation.
it probably seems odd, and it is certainly inefficient, that these esti- Note that in the above discussion, no assumptions were made about
mators depend on the random matrix Σ b uu . The sensible strategy is to the true-regression equation error, ǫi = Yi − E(Yi | Zi , Xi ), either
b
replace Σuu with Σuu . Doing so yields, apart from degrees-of-freedom in practice or in the derivation. A final important point to note about
corrections on the relevant covariance matrices, the usual linear mod- the corrected-score method is that no assumptions are made about the
els, method-of-moments correction for measurement error bias (Fuller, unobserved X variables other than those assumptions that would be
1987). Practically, the substitution of Σuu for Σb uu can also be accom- needed to ensure consistent estimation in the absence of measurement
b error. This fact follows from the key property (7.7).
plished by taking B large, because Σuu converges to Σuu as B → ∞.
Usually, in practice B does not need to be very large to obtain good
results. This is because the randomness introduced in the construction 7.2.2 Logistic Regression Corrected and Conditional Scores
of the Monte Carlo corrected scores is subject to double averaging over
Now we consider the multiple logistic regression model, pr(Y = 1 |
n and B. This is apparent in the linear regression corrected-score esti-
Z, X) = H(β0 + βzt Z + βxt X), where H(t) = 1/{1 + exp(−t)} is the
mator, as it depends on the Ub,i only via
logistic distribution function, and the classical additive, nondifferential
error model W = X + U with U = Normal(0, Σuu ) where Σuu is known.
b uu = (nB)−1 Pn PB Ub,i Ub,i
Σ t
, Write the unknown regression parameter as Θ = (β0 , βzt , βxt )t .
i=1 b=1
The maximum likelihood score function for multiple logistic regression
and the variances of the components of this random matrix are on the in the absence of measurement error is
 
order of (nB)−1 . 1
Herein lies the advantage of the general theory in Section 7.4. For ΨML (Yi , Zi , Xi , Θ) = [Yi − H{(1, Zti , Xti )Θ}]  Zi  . (7.10)
many measurement error models, substituting the complex variate W f b,i Xi
defined in (7.6) for Xi into a score function that is unbiased in the ab-
sence of measurement error, taking the real part, and averaging over 7.2.2.1 Logistic Regression Conditional Score
b = 1, . . . , B results in an unbiased score that is a function of the ob- The conditional-score method for logistic regression is similar to that for
served data. In the linear model we can shortcut the pseudorandom linear regression. We again start by defining
number generation and averaging, because all of the expressions involved
depend only on first- and second-order sample moments. The corrected ∆i = Wi + Yi Σuu βx . (7.11)
score in this case is Note that the definition in (7.11) differs slightly from that in (7.2), due
to the absence of a variance parameter in logistic regression. Conditioned
e MCCS (Yi , Zi , Wi , Θ) =
Ψ on (Zi , Xi ), both Yi and Wi have exponential family densities. Standard

156 157
exponential family calculations (a good exercise) show that the linear model, the resulting expression is not very enlightening, does
¡ ¢ not have a closed-form solution, and its limit as B → ∞ is not easy to
E(Yi | Zi , Xi , ∆i ) = H β0 + βzt Zi + βxt ∆i − βxt Σuu βx /2
obtain. Expanding and simplifying are also not necessary for comput-
= E(Yi | Zi , ∆i ) ing purposes, provided the programming software has complex number
= pr(Yi = 1 | Zi , ∆i ). (7.12) capabilities. Because the logistic model is not covered by the mathemat-
ical theory of corrected scores, analogues of neither (7.7) or (7.8) hold
As in linear regression, the conditional distribution of Yi given (Zi , ∆)
exactly, but both hold to a high degree of approximation.
does not depend on Xi . It follows from (7.12) that the conditional score,
  As with the linear model, estimating equations are formed in the usual
1 fashion, that is,
ΨCond (Yi , Zi , Wi , Θ) = {Yi − E(Yi | Zi , ∆i )}  Zi  ,
∆i Pn e
i=1 ΨMCCS,B (Yi , Zi , Wi , Θ) = 0,
has the property that
E {ΨCond (Yi , Zi , Wi , Θ) | Zi , ∆i } = 0, and large-sample inference uses the standard M-estimation methods in
Section A.6.
so its unconditional mean also vanishes. Thus ΨCond can be used to form
unbiased estimating equations,
Pn
i=1 ΨCond (Yi , Zi , Wi , Θ) = 0, (7.13) 7.2.3 Framingham Data Example
to which the standard asymptotic theory on M-estimators in Section
b For issues of
A.6 can be applied to approximate the distribution of Θ. We illustrate the corrected- and conditional-score methods for logistic
computation, see Section 7.5. regression with the Framingham data used in the example of Section
4.3. All of the replicate measurements were used, and thus our variance
7.2.2.2 Logistic Regression Corrected Score estimate is based on 1,614 degrees of freedom and we proceed under the
assumption that the sampling variability in this estimate is negligible,
We now derive the corrected score for logistic regression using the gen-
that is, the case of known measurement error.
eral method of construction described in Section 7.4. The logistic model
does not satisfy the smoothness conditions required by the corrected- Estimates and standard errors are in Table 7.1, for the conditional-
score theory. However, Novick and Stefanski (2002) showed that even score estimator (7.13), the corrected-score estimators (7.13) with B = 10
though the logistic score does not have the requisite smoothness prop- and B = 10, 000, and the naive estimates for comparison. These esti-
erties, the corrected-score method can still be applied, and as long as mates should be compared with those in Table 5.1, where almost the
the measurement error variance is not large, it produces nearly consis- same answers are obtained. As explained in Section (7.2.2), conditional-
tent estimators. In other words, when applied to logistic regression, the score estimators are fully consistent as long as the logistic model and
corrected-score method is approximate in the sense of reducing measure- normal error model hold, and possess certain asymptotic variance opti-
ment error bias, but the quality of the approximation is so remarkably mality properties. The standard errors in Table 7.1 were computed from
good that the bias is negligible in practice. the sandwich-formula variance estimates in Section 7.5.1.
The method of construction is identical to that for the linear model The difference among the three measurement error model estimates
with the one exception that ΨML in (7.10) replaces ΨLS in (7.1). With is clearly negligible. The equivalence, to the number of significant digits
f b,i defined as in (7.6), the corrected score for logistic regression is
W presented, of the corrected-score estimators for B = 10 and B = 10, 000
n o supports the claim made in Section 7.2.1 that B does not need to be
e MCCS,B (Yi , Zi , Wi , Θ) = B −1 PB Re ΨML (Yi , Zi , W
Ψ f i , Θ) . very large to obtain satisfactory results. The similarity between the
b=1
conditional-score estimates and the corrected-score estimates supports
Just as we did for the linear model in (7.2.1), it is possible to expand the claim that the corrected-score procedure is, for most practical pur-
f i , Θ)} to obtain an ex-
and simplify the expressions Re{ΨML (Yi , Zi , W poses, comparable to consistent methods, even though it is not covered
e MCCS,B in terms of standard functions. However, unlike
pression for Ψ by the theory in Section 7.4.

158 159
Age Smoke Chol LSBP Age Smoke LChol LSBP

Naive .055 .59 .0078 1.71 Naive .056 .57 2.04 1.52
Std. Err. .010 .24 .0019 .39 Std. Err. .011 .24 .52 .37

Conditional .053 .60 .0078 1.94 SIMEX (STATA) .055 .58 2.53 1.85
Std. Err. .011 .24 .0020 .44 Std. Err. .010 .26 .73 .45

Corrected (B = 10) .054 .60 .0078 1.94 Conditional .054 .60 2.84 1.93
Std. Err. .011 .24 .0020 .44 Std. Err. .011 .25 .72 .47

Corrected (B = 104 ) .054 .60 .0078 1.94 Corrected (B = 102 ) .054 .59 2.83 1.92
Std. Err. .011 .24 .0020 .44 Std. Err. .011 .25 .72 .47

Corrected (B = 104 ) .054 .59 2.82 1.92


Std. Err. .011 .25 .72 .47
Table 7.1 Conditional-score and corrected-score estimates and sandwich
standard errors from the Framingham data logistic regression analyses.
Here “Smoke” is smoking status, “Chol” is cholesterol, and “LSBP” is
log(SBP−50). Two sets of corrected-score estimates were calculated using dif- Table 7.2 Conditional-score and corrected-score estimates and sandwich stan-
ferent levels of Monte Carlo averaging, B = 10 and B = 10, 000. dard errors from the Framingham data logistic regression analyses with both
SBP and cholesterol measured with error. Here “Smoke” is smoking sta-
tus, “LChol” is log(cholesterol), and “LSBP” is log(SBP−50). Two sets of
7.2.3.1 Two Predictors Measured with Error
corrected-score estimates were calculated using different levels of Monte Carlo
An appealing feature of the conditional- and corrected-score methods is averaging, B = 100 and B = 10, 000.
the ease with which multiple predictors measured with error are handled.
We now consider the Framingham logistic model for the case in which
both systolic blood pressure and serum cholesterol are measured with 0.73 and λ2 = 0.76, respectively for W1 and W2 , with linear model
error, first analyzed in Section 5.4.3 using SIMEX. corrections for attenuation of 1/λ1 = 1.37 and 1/λ2 = 1.32. So in the
Recall that when serum cholesterol entered the model as a predic- absence of strong multicollinearity the conditional- and corrected-score
tor measured with error, error modeling considerations indicated that estimators of the coefficients of log(cholesterol) and log(SBP−50) should
a log transformation was appropriate to homogenize error variances. be inflated by approximately 37% and 32%, compared to the naive esti-
Thus, as in Section 5.4.3, the true-data model includes predictors Z1 = mates.
age, Z2 = smoking status, X1 = log(cholesterol) at Exam #3 and The results of the analysis displayed in Table 7.2 are consistent with
X2 = log(SBP−50) at Exam #3. The error model is (W1 , W2 ) = expectations. Difference between the conditional- and corrected-score
(X1 , X2 ) + (U1 , U2 ), where (U1 , U2 ) is bivariate normal with zero mean estimates are negligible, and the bias correction in the estimates is con-
and covariance matrix Σu , with error covariance matrix estimate, sistent with the reliability ratios reported above, for log(cholesterol),
µ ¶ 2.82/2.04 = 1.38 ≈ 1.37 and for log(SBP−50), 1.92/1.52 = 1.26 ≈ 1.32.
b 0.00846 0.000673
Σu = . For comparison, we include the naive and SIMEX estimates from Section
0.000673 0.0126
5.4.3. Recall that the SIMEX estimates are somewhat undercorrected for
The two error variances result in marginal reliability ratios of λ1 = bias, as explained in Section 5.4.3.

160 161
As in Table 7.1 for the analysis assuming only ln(SBP−50) is mea- where
sured with error, we calculated the Monte Carlo corrected score esti-    
1
mates using two different levels of Monte Carlo averaging. However, for  {Yi − D′ (ηi )}  Zi  
the present model we took the lower level equal to B = 100, not B = 10  

ΨQL (Yi , Zi , Xi ) =  Xi , (7.16)

used in Table 7.1. The averaging required in the Monte Carlo corrected  µn − p¶ {Yi − D′ (ηi )}
2 
score, see (7.27), is effectively calculating an integral. As with any nu- φ−
merical method of integration, higher-dimensional integration require n D′′ (ηi )
greater computational effort. Thus, the more variables measured with where ηi = β0 + βzt Zi + βxt Xi .
error, the larger one should take B. For certain models (7.15) produces maximum likelihood estimators
apart from the degrees of freedom correction (n − p)/n. However, in
7.3 Conditional Score Functions general it results in quasilikelihood estimators; see Section A.

In this section, we describe the conditional-score estimators of Stefanski 7.3.1.2 GLIM MEMs and Conditional Scores
and Carroll (1987) for an important class of generalized linear models.
We first present the basic theory, followed by conditional scores for spe- Assume now that the measurement error is additive and normally dis-
cific models. Finally, certain extensions are presented to describe the tributed, with error covariance matrix Σuu . If X is regarded as an un-
range of applications of the conditional-score approach. Once again, we known parameter and all other parameters are assumed known, then
note that this section, like this chapter as a whole, is heavy with formulae
∆ = W + YΣuu βx /φ (7.17)
and algebra, but exhibiting the formulae makes the methods usable.
is a sufficient statistic for X (Stefanski and Carroll, 1987). Furthermore,
7.3.1 Conditional Score Basic Theory the conditional distribution of Y given (Z, ∆) = (z, δ) is a canonical
generalized linear model of the same form as (7.14) with certain changes.
7.3.1.1 Generalized Linear Models (GLIM) With (Y, Z, ∆) = (y, z, δ), replace x with δ, and η, c, and D with
Canonical generalized linear models (McCullagh and Nelder, 1989) for η∗ = β0 + βzt z + βxt δ;
Y given (Z, X) have density or mass function
½ ¾ c∗ (y, φ, βxt Σuu βx ) = c(y, φ) − (1/2)(y/φ)2 βxt Σuu βx ;
yη − D(η) D∗ (η∗ , φ, βxt Σuu βx )
f (y|z, x, Θ) = exp + c(y, φ) , (7.14) ·Z ¸
φ © ª
t
= φlog exp yη∗ /φ + c∗ (y, φ, βx Σuu βx ) dµ(y) ,
where η = β0 + βzt z + βxt x is called the natural parameter, and Θ =
(β0 , βzt , βxt , φ) is the unknown parameter to be estimated. The mean and
where the last term is a sum if Y is discrete and an integral otherwise.
variance of Y are D ′ (η) and φD ′′ (η). This class of models includes:
This means that the conditional density or mass function is
• linear regression:√mean = η, variance = φ, D(η) = η 2 /2, c(y, φ) =
−y 2 /(2φ) − log( 2πφ ); f (y|z,
½δ, Θ, Σuu ) = ¾
yη∗ − D∗ (η∗ , φ, βxt Σuu βx )

• logistic regression: mean = H(η), variance = H (η), φ ≡ 1, D(η) = exp + c∗ (y, φ, βxt Σuu βx ) , (7.18)
−log {1 − H(η)}, c(y, φ) = 0, where H(x) = 1/{1 + exp(−x)}; φ
• Poisson loglinear regression: mean = exp(η), variance = exp(η), φ ≡ where η∗ = β0 + βzt z + βxt δ.
1, D(η) = exp(η), c(y, φ) = −log(y!); The correspondence between (7.14) and (7.18) suggests simply sub-
• Gamma inverse regression: mean = −1/η, variance = −φ/η, D(η) = stituting D∗ (η∗ , φ, βxt Σuu βx ) for D(η) into (7.15)–(7.16), and solving the
−log(−η), c(y, φ) = φ−1 log(y/φ) − log {yΓ(1/φ)}. resulting equations replacing ηi by η∗,i = β0 + βxt ∆i + βzt Zi , noting that
∆i depends on βx and φ. This simple idea is easily implemented and
If the Xi were observed, then Θ is estimated by solving
Pn produces consistent estimators.
i=1 ΨQL (Yi , Zi , Xi ) = 0, (7.15) The conditional mean and variance of Y given (Z, ∆) are determined

162 163
by the derivatives of D∗ with respect to η∗ , that is, and computation of the mean and variance functions requires numerical
∂ summation unless βxt Σuu βx = 0.
E(Yi | Zi , ∆i ) = m(η∗ , φ, βxt Σuu βx ) = D∗ ;
∂η∗
∂2 7.3.2.2 Linear and Logistic Models with Interactions
var(Yi | Zi , ∆i ) = φv(η∗ , φ, βxt Σuu βx ) = φ 2 D∗ . (7.19)
∂η∗ Consider the usual form of the generalized linear model (7.14) with the
The estimates of Θ = (β0 , βx , βz , φ) are obtained by solving difference that η = β0 + βzt z + βxt x + xt βxz z where βxz is a dim(X) ×
Pn dim(Z) matrix of interaction parameters. Conditional-score estimation
i=1 ΨCond (Yi , Zi , Wi , Θ) = 0, for this model was studied by Dagalp (2001). The model allows for inter-
where actions between the variables measured with error and those measured
without error. In particular, it allows for analysis of covariance models
ΨCond (Yi , Zi , Wi , Θ) = with some of the covariates measured with error by having Z indicate
  
1 group membership. The appropriate elements of βxz can be constrained

 {Yi − E(Yi | Zi , ∆i )}  Zi  
 to equal zero if the model does not contain all possible interactions.
 ∆i  (7.20) The full parameter vector is denoted by Θ = (β0 , βzt , βxt , vec∗ (βxz )t , φ)
µ 
 n − p¶ {Yi − E(Yi | Zi , ∆i )}
2  where vec∗ denotes the operator that maps the non-zero-constrained el-
φ− ements of the parameter matrix reading left to right, and top to bottom
n var(Yi | Zi , ∆i )/φ
to a column vector. Assuming the normal measurement error model,
with η∗,i = β0 +βzt Zi +βxt ∆i , with ∆i = Wi +Yi Σuu βx /φ. Stefanski and W = Normal(X, Σuu ), the distribution of the observed data again ad-
Carroll (1987) discuss a number of ways of deriving unbiased estimating mits a sufficient statistic for X,
equations from (7.18) and (7.19). The approach described here is the
simplest to implement. ∆ = W + YΣuu (βx + βxz Z)/φ.
This means that we can obtain unbiased score functions in the same
7.3.2 Conditional Scores for Basic Models fashion as with previous models, taking care to include components for
the interaction components. For this model,
In Sections 7.2.1 and 7.2.2, we presented the conditional scores for linear
and logistic regression, respectively. It is an informative exercise to derive ΨCond (Yi , Zi , Wi , Θ) =
those formulae from the general theory, and we leave it to the reader to    
1
do so. In this section, we show how to derive the conditional scores in
 {Y − E(Y | Z , ∆ )}  Zi  
more complex models.  i i i i   
 ∆i 
 Zi ⊗ ∆ , (7.22)
µ 
7.3.2.1 Poisson Regression  n − p¶ {Yi − E(Yi | Zi , ∆i )}
2 
φ−
Linear and logistic regression are the only common canonical models for n var(Yi | Zi , ∆i )/φ
which D∗′ and D∗ ′′ have closed-form expressions. In general, either nu- where Zi ⊗ ∆ represents a column vector of length dim{vec∗ (βxz )} con-
merical integration or summation is required to determine the moments taining the product of the kth element of Zi and the rth element of ∆i
(7.19). For example, for Poisson regression (for which φ ≡ 1), if and only if the (r, k) element of βxz in not constrained to equal zero.
(∞ )
X Define
t −1 2 t
D∗ (η∗ , φ, βx Σuu βx ) = log (y!) exp(yη∗ − y βx Σuu βx /2) .
ξ = βx + βxz Z.
y=0
For linear regression with φ = σ 2 the required conditional expectations
For this model, η∗ = β0 + βzt Z + βxt ∆ and
are
P∞ j
j y=0 y (y!)
−1
exp{y(η∗ ) − y 2 βxt Σuu βx /2} β0 + βzt Z + ξ t ∆
E(Y | Z, ∆) = P∞ −1 exp{y(η ) − y 2 β t Σ β /2}
, (7.21) E(Yi | Zi , ∆i ) = ,
y=0 (y!) ∗ uu x 1 + ξ t Σuu ξ/σ 2

164 165
σ2 7.3.3.2 Proportional Hazards Model with Longitudinal Covariates
var(Yi | Zi , ∆i ) = .
1+ ξ t Σuu ξ/σ 2 Measured with Error
For logistic regression, φ ≡ 1, only the top component of (7.22) is Tsiatis and Davidian (2001) used conditional score techniques to elimi-
relevant, and we need only the first conditional moment, nate subject-specific, time-dependent covariate process parameters when
the time-dependent covariate process is measured with error. In their
E(Yi | Zi , ∆i ) = pr(Y = 1 | Zi , ∆i ) = H(β0 + βzt Z + ξ t ∆ − ξ t Σuu ξ/2), model, the observed data for each subject includes the time on study Vi ,
where H(·) is, as usual, the logistic distribution function. failure indicator Fi , error-free time-independent covariate Zi , and longi-
tudinal measurements Wi (tij ) = Xi (tij ) + ǫij , ti1 < · · · < ti,ki , where
the unobserved time-dependent covariate process is modeled as Xi (u) =
7.3.3 Conditional Scores for More Complicated Models αoi +α1i u and the errors ǫij are independent Normal(0, σ 2 ). The survival
model assumes that the hazard of failure is λi (u) = λ0 (u)exp{γXi (u) +
The following examples provide a sample of models for which conditional-
η t Zi }.
score methods have been derived and studied since the first edition in b i (u) to be the ordinary least squares estimator of Xi (u)
Defining X
1995. The models, and the technical details of the derivations and the
using all of the longitudinal data up to and including time u, the counting
score functions, are generally more complicated than those considered
process increment, dNi (u) = I(u ≤ Vi < u + du, Fi = 1, ti2 ≤ u), and
previously. Our intent is to illustrate the range of application of the
the at-risk process Yi (u)I(Vi ≥ u, ti2 ≤ u), Tsiatis and and Davidian’s
conditional-score approach, and we omit many of the mathematical de-
assumptions are such that conditioned on {αi , Yi (u) = 1, Zi }, X b i (u) =
tails, describing only the models and the relevant conditioning sufficient 2
statistic. Normal{αoi + α1i u, σ θi (u)}, where θi (u) is known. It follows that up
to order du the conditional likelihood of {dNi (u), X b i (u)} given {Yi (u) =
7.3.3.1 Conditional Scores with Instrumental Variables 1, αi , Zi } admits a sufficient statistic for Xi (u) of the form
b i (u) + γσ 2 θi (u)dNi (u).
∆i (u) = X (7.24)
Buzas and Stefanski (1996c) studied conditional-score estimation for the
generalized linear model in (7.14) with observed predictor following the The statistic ∆i (u) is used to derive conditional estimating equations
additive error model, W = X + U, where U = Normal(0, Σuu ), for the free of the αi by conditioning on ∆i . Note the similarity of (7.24) to
case that Σuu is unknown but an instrument is observed, (7.17). Because of the formal equivalence between proportional hazard
partial likelihood and logistic regression likelihood, the technical details
T = Normal(γ1 + γz Z + γx X, Ω), (7.23)
of the corrected score for the proportional hazard model are similar to
where the parameters in (7.23) are also unknown. This is a version of the those for logistic regression.
model studied in Section 6.3.2, with the additional structure of (7.14)
imposed on the primary model relating Y to X and the multivariate 7.3.3.3 Matched Case-Control Studies with Covariate Measurement
linear model structure of (7.23). Note that in the most general case with Error
Z, X, W, and T vector-valued, the regression in (7.23) is multivariate
McShane, Midthune, Dorgan, et al. (2001) used the conditional-score
and γ1 , γz , and γx are matrices of the appropriate dimensions. For this
method to derive estimators for matched case-control studies when co-
model Buzas and Stefanski (1996c) derive conditional-score functions
variates are measured with error. Their study design was a 1 : M
under the assumptions that Y, W, and T are conditionally independent
matched case-control study with K strata, where in the absence of mea-
given Z and X, and rank(γx ) = dim(X). The latter assumption requires
surement error the preferred method of inference is based on the condi-
at least as many instruments as variables measured with error. Under
tional prospective likelihood,
these assumptions £ ¤
pr Y1 , . . . , Yk | {Xk , Zk , (Tk = 1)}K
k=1
∆ = W + YΣuu βx /φ + Σuu γxt Ω−1 T nP o
K exp M +1 t t
Y j=1 Ykj (Xkj βx + Zkj βz )
is a sufficient statistic for X when all other parameters are assumed = PM +1 ,
t t
known. Conditional scores are obtained by conditioning on this statistic. k=1 j=1 exp(Xkj βx + Zkj βz )

166 167
where Yk = (Yk1 , . . . , Yk,M +1 ) is the vector of binary responses for the unbiased error model Wi∗ = Xi + U∗i where Wi∗ = (Dti Di )−1 Dti Wi and
PM +1
M + 1 subjects in the kth stratum, Tk = j=1 Ykj , (Xtkj , Ztkj )t is the U∗i = (Dti Di )−1 Dti Ui is Normal{0, (Dti Di )−1 σu2 }. If we now substitute
error-free covariate for the jth subject in the kth stratum, and Xk = Wi∗ for W and (Dti Di )−1 σu2 for Σuu into the expression for the suffi-
∗ ∗ t −1 2
(Xtk1 , . . . , Xtk,M +1 )t , Zk = (Ztk1 , . . . , Ztk,M +1 )t . The measurement error cient statistic
© tgiven in2(7.17), we ª get ∆i = Wi + Y(Di Di ) σu βx /φ =
t −1
model is a Gaussian, nondifferential additive model with Wkj = Xkj + (Di Di ) Di Wi + σu Yi βx /φ In other words, after transforming to
Ukj , where the model for the errors Ukj allows for multiple additive an unbiased error model, the sufficient statistic in (7.25) is a matrix
components subject to certain restrictions. multiple of the general model sufficient statistic in (7.17). The facts that
With Bk,x = (Yk2 βxt , . . . , Yk,M +1 βxt )t , Dkz = (Ztk2 , . . . , Ztk,M +1 )t − the matrix, (Dti Di )−1 , is known, and that a known, full-rank multiple
Zk1 , Dkx = (Xtk2 , . . . , Xtk,M +1 )t − Xtk1 , Dkw = (Wk2
t t t
, . . . , Wk,M t of a sufficient statistic is also sufficient, establish the link between the
+1 ) −
t
Wk1 , and Dku = Dkw − Dkx , where Σdu ,du = cov(Dku , Dku ), McShane general theory statistic (7.17) and the form of the statistic (7.25) used
et al. (2001) showed that by Li, Zhang, and Davidian (2004).

∆k = Dkw + Σdu ,du Bk,x


7.4 Corrected Score Functions
is sufficient for Dkx when Dkx is regarded as a parameter and all other
parameters are assumed known, k = 1, . . . , K. Thus by conditioning on This section gives the basic theory and the algorithm of corrected score
the ∆k , estimating equations can be derived that do not depend on the functions. It applies to any model for which the usual estimator in the
unobserved Xkj . absence of measurement error is an M-estimator. The basic idea is very
simple:
7.3.3.4 Joint Models with Subject-Specific Parameters
• Let ΨTrue (Y, Z, X, Θ), where Θ is the collection of all unknown pa-
Joint models are discussed in greater detail in Section 11.7. Here, we con- rameters, denote the score function that would be used for estimation
sider a particular joint model that is amenable to the conditional score if X were observed. This could be a nonlinear least-squares score, a
method. Li, Zhang, and Davidian (2004) adapted the conditional-score likelihood score (derivative of the loglikelihood), etc.
method to joint models with subject-specific random effects. Rather • Because X is not observed and hence ΨTrue (Y, Z, X, Θ) cannot be
than model the distribution of the subject-specific effects, they showed used for estimation, we do the next best thing and construct an un-
how to derive conditional scores that are free of the subject-specific ef- biased estimator of ΨTrue (Y, Z, X, Θ) based on the observed data.
fects. In their model the ith subject has observed data: Yi , the primary This new score function is ΨCS (Y, Z, W, Θ). It has the property that
response; Zi , the error-free predictors; and longitudinal measurements E{ΨCS (Y, Z, W, Θ)} = ΨTrue (Y, Z, X, Θ), and thus is also unbiased.
Wi = (Wi1 , . . . , Wiki )t with Wij measured at time tij . The longitudinal
data are assumed to follow the model Wi = Di Xi + Ui , where Di is a • The corrected score function, ΨCS (Y, Z, W, Θ), is used for estimation
ki × q full-rank design matrix depending on tij , Xi is a random, subject- of Θ, the calculation of standard errors, inference, etc., using the M-
specific effect modeling features of the ith longitudinal profile, and Ui estimation techniques in Section (A.6).
are Normal(0, σu2 I), independent of Xi and across i. There are basically two ways to find the corrected score function:
Li, Zhang, and Davidian (2004) assumed that conditioned on (Zi , Xi )
1. Be clever! In some cases, one can intuit the corrected score function
the primary endpoint Yi follows a generalized linear model of the form
exactly. Some examples where this is possible are given in Section
(7.14). It follows that
7.4.3.
∆i = Dti Wi + σu2 Yi βx /φ (7.25)
2. When intuition is lacking, or the corrected score is prohibitively com-
is sufficient for Xi when all other parameters are assumed known. They plicated, an alternative is to use the complex variable theory, as we
then derived and studied conditional score estimators, as well as another did in Section 7.2, and let the computer calculate the score function
conditional estimator described by Stefanski and Carroll (1987). and solve it. Obviously, if we can be clever, we would not use the
It is instructive to reconcile the statistic in (7.25) with the form of the Monte Carlo approach, but the Monte Carlo approach expands the
statistic given in (7.17) for the general model. Starting with the linear possible applications of the methodology. We discuss this approach in
model Wi = Di Xi + Ui , multiplication by (Dti Di )−1 Dti results in the detail in Section 7.4.2.

168 169
7.4.1 Corrected Score Basic Theory • For b = 1, ..., B, generate random numbers Ub,i that are normally
distributed with mean zero and covariance matrix Σuu .
The method of corrected scores does not assume a model for the observed
data per se. Rather, it starts with the assumption that there exists an • Form the complex-valued random variables
unbiased score function that produces consistent estimators with error- f b,i = Wi + ιUb,i .
W (7.26)
free data. This is the true-data score function described above and called √
ΨTrue (Y, Z, X, Θ). The true-data score function should have the property where ι = −1.
that if X were observable, the score function would be unbiased, that is, • Define the Monte Carlo corrected score
E{ΨTrue (Y, Z, X, θ)|Z, X} = 0. ΨMCCS,B (Yi , Zi , Wi , Θ) =
PB f b,i , Θ)}. (7.27)
For the linear and logistic regression models in Sections 7.2.1 and 7.2.2, B −1 b=1 Re{ΨTrue (Yi , Zi , W
ΨTrue was the least squares and maximum likelihood score in (7.1) and
(7.10), respectively. • Get an estimator for Θ by solving the corrected-score estimating equa-
A corrected score is a function, ΨCS , of the observed data having the tions
property that it is unbiased for the true-data score function. In symbols, Pn
this means that i=1 ΨMCCS,B (Yi , Zi , Wi , Θ) = 0.

E {ΨCS (Y, Z, W, Θ)|Y, Z, X)} = ΨTrue (Y, Z, X, Θ). • As B → ∞, ΨMCCS,B (Yi , Zi , Wi , Θ) → ΨCS (Yi , Zi , Wi , Θ). The num-
ber of generated random variables per subject, B, needs to be large
It follows from (7.26) and (7.26) that ΨCS is also conditionally unbiased, enough to make this limit approximately correct. Often, however,
that is, E{ΨCS (Y, Z, W, Θ) = 0. Thus, by the general theory of M- rather small values of B suffice.
estimation, the estimating equations,
Pn • The resulting corrected-score estimators are M-estimators to which
i=1 ΨCS (Yi , Zi , Wi , Θ) = 0, the standard asymptotic results in Section A.6 apply.
possess a consistent, asymptotically normally sequence of solutions (Naka-
mura, 1990), whose asymptotic distribution is readily approximated us- 7.4.2.2 The Theory
ing the M-estimation techniques in Section A.6. A mathematical result on which the corrected-score theory is based is
Note that no assumptions about the Xi are made. Thus, the corrected- that for suitably smooth, integrable functions, f (·), the function defined
score method provides an attractive approach to consistent estimation by
when data are measured with error. The key technical problem is finding h n o i
a corrected score satisfying (7.26) for a given ΨTrue . Corrected scores have e
f (Wi ) = E Re f (Wf b,i ) |Xi , Wi (7.28)
been identified for particular models by Nakamura (1990) and Stefanski
(1989), and the results in Gray, Watkins, and Schucany (1973) provide a does not depend on Xi and is an unbiased estimator of f (Xi ), where
means of obtaining corrected scores via infinite series. These calculations Re{} denotes the real part of its argument, that is,
are described in Section 7.4.3. In the absence of such exact results, Novick n o
E e f (Wi )|Xi = f (Xi ) (7.29)
and Stefanski (2002) describe a general method of constructing corrected
scores based on simple Monte Carlo averaging. We now outline their (Stefanski, 1989; Stefanski and Cook, 1995). We will not prove the gen-
method. eral result here. However, verification of (7.29) for the function g(x) =
exp(ct x) is instructive and also provides results that are used later in
7.4.2 Monte Carlo Corrected Scores this section. First, note that by independence of Ub,i and (Xi , Wi ) and
properties of the normal distribution characteristic function,
7.4.2.1 The Algorithm h n o i
ge(Wi ) = E Re g(W f b,i ) |Xi , Wi
The algorithm is simple, although perhaps with complex numbers simple
is not the most appropriate word. The method is as follows. = exp(ct Wi )exp(−ct Σuu c/2).

170 171
P2 © ª
Now the fact that E{e g (Wi )|Xi } = g(Xi ) follows immediately from k=0 ck (y, z, Θ)(βxt x)k + c3 (y, z, Θ)exp(βxt x);
the normal moment generating function identity, E{exp(ct Wi )|Xi } =
see the examples given below. Then, using normal distribution moment
exp(ct Xi + ct Σuu c/2).
generating function identities, the required function is
The special case g(Xi ) = exp(ct Xi ) is very useful. It follows from the
result for the exponential function that (7.29) also holds for the partial Ψ∗ (y, z, w,
" Θ, Σuu ) =
derivative (∂/∂c)g(Xi ) = Xi exp(ct Xi ) and for higher-order derivatives, ∂ X©
2
ª
as well. Also, P it is clear that (7.29) holds for linear combinations of ck (y, z, Θ)(βxt w)k − c2 (y, z, Θ)βxt Σuu βx
∂Θt
exponentials, j exp(ctj Xi ), and their partial derivatives with respect k=0 #
to cj . These results can be used to find the exact corrected scores in +c3 (y, z, Θ)exp(βxt w − .5βxt Σuu βx ) .
Section (7.4.3).
It follows from (7.29), that for score functions with components that
Regression models in this class include:
are sufficiently smooth and integrable functions of their third argument,
h i • Normal linear with
f b,i , Θ)}|(Yi , Xi ) = ΨTrue (Yi , Zi , Xi , Θ), √ mean = η, variance = φ, c0 = −(y − β0 −
E Re{ΨTrue (Yi , Zi , W βzt z)2 /(2φ) − log( φ ), c1 = (y − β0 − βzt z)/φ, c2 = −(2φ)−1 , c3 = 0.
f b,i , Θ)} is a corrected score. • Poisson with mean = exp(η), variance = exp(η), c0 = y(β0 + βzt z) −
that is, Re{ΨTrue (Yi , Zi , W
log(y!), c1 = y, c2 = 0, c3 = −exp(β0 + βzt z).
The corrected score Re{ΨTrue (Yi , Zi , W f b,i , Θ)} in (7.30) depends on
−1
• Gamma with mean = exp(η), variance =©φexp(2η), ª c0 = −φ (β0 +
the particular generated random vector Zb,i . The preferred corrected t −1 −1 −1 −1 −1
f b,i , Θ)|(Yi , Zi , Wi )}, which eliminates vari- βz z)+(φ −1)log(y)+φ log(φ )−log Γ(φ ) , c1 = φ , c2 = 0,
score is E{ΨTrue (Yi , Zi , W c3 = −φ−1 yexp(−β0 − βzt z).
ability due to Ub,i . The conditional expectation is not always easy to
determine mathematically. However, Monte Carlo integration provides
a simple solution, resulting in the Monte Carlo corrected score (7.27). 7.4.4 SIMEX Connection
The Monte Carlo corrected score possesses the key property, There is a connection between SIMEX and the Monte Carlo corrected-
E{ΨMCCS,B (Yi , Zi , Wi , Θ) | Yi , Zi , Xi } = ΨTrue (Yi , Zi , Xi , Θ), score method. SIMEX adds measurement error multiple times, computes
the new estimator over each generated data set, and then extrapolates
from (7.30) and converges to the exact conditional expectation desired
back to the case of no measurement error. The sequence of operations in
as B → ∞, that is,
simulation extrapolation is 1) generate pseudo-random, (real-valued) re-
ΨCS (Yi , Zi , Wi , Θ) = lim ΨMCCS,B (Yi , Zi , Wi , Θ). √ 1/2
B→∞
measured data sets as Wb,i (ζ) = Wi + ζΣuu Ub,i ; 2) calculate average
estimates from the remeasured data sets; 3) determine the dependence
Corrected-score estimating equations are formed in the usual fashion as
of the averaged estimates on ζ; and 4) extrapolate to ζ = −1.
described above.
The Monte Carlo corrected-score method is obtained more or less by
reordering these operations. It starts with the complex-valued, pseudo-
7.4.3 Some Exact Corrected Scores random data sets. Note that these are obtained by letting ζ → −1 in the
1/2
The exponential function g(X) = exp(ct X) is a special case of (7.29) SIMEX remeasured data, limζ→−1 Wb,i (ζ) = Wi + ιΣuu Ub,i . Rather
studied in Section 7.4, and extensions derived from it are useful for than calculate an estimate from each complex pseudodata set and aver-
finding exact corrected scores when the true-data score functions are aging, the complex, pseudodata estimating equations are averaged, the
linear combinations of products of powers and exponential. imaginary part is removed, and then the averaged equations are solved,
resulting in a single estimate.
7.4.3.1 Likelihoods with Exponentials and Powers
One useful class of models that admit corrected scores contains those 7.4.5 Corrected Scores with Replicate Measurements
models with log-likelihoods of the form
The connection to SIMEX described in the previous section extends
log {f (y|z, x, Θ)} = even further. In Section 5.3.1, we described a version of the SIMEX al-

172 173
gorithm that automatically accommodates heteroscedastic measurement 7.5.1 Known Measurement Error Variance
error with unknown variances, provided ki ≥ 2 replicate measurements
are available for each true Xi . The key innovation there was that pseudo Let Ψ∗ (Y, Z, W, Θ, Σuu ) denote either a conditional or corrected score.
data are generated as random linear contrasts of the replicate measure- Showing the dependence of the score on Σuu will be useful when we deal
ments. Similar methods of generating pseudo errors, with the key change with the case of estimated measurement error variance. Suppose that Θ b∗
that ζ = −1, as described in the preceding section, can be used to is a solution to the estimating equations,
construct corrected scores from replicate measurements that automat- Pn b
ically handle heteroscedastic measurement error. Here, we present the i=1 Ψ∗ (Yi , Zi , Wi , Θ∗ , Σuu ) = 0. (7.32)
approach described in Stefanski, Novick and Devanarayan (2005) for the © ª
b ∗ −Θ) is asymptotically Normal 0, A−1 B(A−1 )t ,
Then, generally, n1/2 (Θ
case that Xi is a scalar.
Assume that the error model is Wi,j = Xi + Ui,j , where Ui,j , j = where A and B are consistently estimated by
2
1, . . . , ki , i = 1, . . . , n are independent Normal(0, σu,i ), independent of b = n−1 Pn Ψ∗Θ (Yi , Zi , Wi , Θ
A b ∗ , Σuu )
2 2 i=1
Xi , Zi , and Yi , with all σu,i unknown. Let Wi and σ bi denote the sample P
b = n−1 n Ψ∗ (Yi , Zi , Wi , Θ
B b ∗ , Σuu )Ψt (Yi , Zi , Wi , Θ
b ∗ , Σuu ),
mean and sample variance of the ith set of replicates, and define i=1 ∗

f b,i = Wi + ι(ki − 1)1/2 σ


W bi Tb,i , (7.30) with Ψ∗Θ (Y, Z, W, Θ, Σuu ) = (∂/∂Θt )Ψ∗ (Y, Z, W, Θ, Σuu ).
2 2 The matrix A b also appears in the Newton–Raphson iteration for solv-
where Tb,i = Vb,1 (Vb,1 + · · · + Vb,k i −1
)−1/2 and the Vb,i are generated as b (0) , successive iterates are
ing (7.32). Starting with an initial estimate Θ
independent Normal(0, 1) independent of the data.
obtained via
Stefanski, Novick and Devanarayan (2005) proved a result comparable
³ ´
to that in (7.28) and (7.29). They showed that if f (·) is a suitably smooth, b (k+1)
Θ ∗ =Θb (k)
∗ −A
b−1 n−1 Pn Ψ∗ Yi , Zi , Wi , Θ b (k)
∗ , Σuu .
i=1
integrable function, then
h n o i
ef (Wi ) = E Re f (W f b,i ) |Xi , Wi (7.31) Estimating equations for both conditional- and corrected-score estimates
can have multiple solutions, and thus Newton–Raphson iteration can be
does not depend on Xi and is an unbiased estimator of f (Xi ), that is, sensitive to starting values. Although the naive estimate is often a rea-
n o sonable initial estimate, it is sometimes necessary to use a measurement-
E ef (Wi )|Xi = f (Xi ). error bias-corrected estimate such as regression calibration or SIMEX
estimates; see Small, Wang, and Yang (2000) for a discussion of the
This result is used to construct corrected scores for the replicate-data,
multiple-root problem.
unknown-heteroscedastic-error-variance model in the same manner that
the result in (7.28) and (7.29) was used to construct them for the known- Different estimators of A and B are sometimes used for both the
error-variance case in Section (7.4.1). The Monte Carlo corrected score conditional- and corrected-score methods. For the conditional-score me-
has exactly the same form, thod, define
PB f b,i , Θ)};
ΨMCCS,B (Yi , Zi , Wi , Θ) = B −1 b=1 Re{ΨTrue (Yi , Zi , W a {Z, ∆(Θ, Σuu ), Θ, Σuu } = E {Ψ∗Θ (·) |Z, ∆(Θ, Σuu )} ,
f b,i is defined in (7.30), as opposed to (7.26). b {Z, ∆(Θ, Σuu ), Θ, Σuu } = cov {Ψ∗ (·) |Z, ∆(Θ, Σuu )} .
the only difference is that W
Then the alternate estimators are
n o
b2 = n−1 Pn a Zi , ∆i (Θ,
7.5 Computation and Asymptotic Approximations
A b Σuu ), Θ,
b Σuu ,
i=1
Conditional-score and corrected-score estimators are M-estimators, and n o
thus the usual numerical methods and asymptotic theory for M-estimators b2 = n−1 Pn b Zi , ∆i (Θ,
B b Σuu ), Θ,
b Σuu . (7.33)
i=1
apply to both. We outline the computation and distribution approxima-
tions here for the case where Σuu is known, and when it is estimated Comparable estimates of A and B for corrected scores are substantially
from independent data. more involved; see Novick and Stefanski (2002).

174 175
7.5.2 Estimated Measurement Error Variance where
When Σuu is unknown, additional data are required to estimate it consis- n¡ ¢¡ ¢t o ³ ´
di = vech Wij − Wi· Wij − Wi· − (ki − 1)vech Σb uu .
tently and the asymptotic variance-covariance matrix of the estimators
is altered. Let Ψ∗ (Y, Z, W, Θ, Σuu ) denote either a conditional score or a
corrected score and Θ b ∗ the estimator obtained by solving the estimating
equations with Σuu replaced by Σ b uu . Define γ = vech(Σuu ), where vech 7.6 Comparison of Conditional and Corrected Scores
is the vector-half of a symmetric matrix, that is, its distinct elements.
When an independent estimate of the error covariance matrix is avail- Conditional-score and corrected-score methods are both functional meth-
able, the following method applies. Let γ b be an estimate of γ that is ods, and thus they have in common the fact that neither one requires
assumed to be independent of Θ b ∗ , with asymptotic covariance matrix assumptions on the Xi for consistency to hold in general. However, they
Cn (Σuu ). If we define differ in other important ways, such as underlying assumptions and ease
Pn ∂ of computation.
Dn (Θ, Σuu ) = i=1 Ψ (Yi , Zi , Wi , Θ, Σuu ) , In general, conditional scores are derived under specific assumptions
∂γ t
about both the model for Y given (Z, X) and the error model for W
then a consistent estimate of the covariance matrix of Θ b is given X, whereas corrected scores assume only a correct estimating func-
³ ´n ³ ´
b−1 Θ b ∗, Σ
b uu B b Θ b ∗, Σ
b uu + tion if X were observed, and sufficient assumptions on the error model to
n−1 A
o ³ ´ enable unbiased estimation of the true-data estimating function. Conse-
Dn (Θb ∗, Σ
b uu )Cn (Σ
b uu )Dt (Θ
b ∗, Σ
b uu ) Ab−t Θb ∗, Σ
b uu , quently, when the assumptions underlying the conditional-score method
n
are satisfied, it will usually be more efficient. Some conditional scores re-
where A b and Bb are the matrices estimated in the construction of sandwich- quire numerical summation or integration. In principle, exact corrected
formulae variance estimates. We have shown their dependence on Θ and scores also require integration; however, the Monte Carlo corrected score
Σuu to emphasize that they are to be computed at the estimated values methods come with a simple, built-in solution to this computational
b and Σ
Θ b uu . problem when the required integrals are not analytically tractable, al-
Finally, a problem of considerable importance occurs when there are though the simplicity requires complex-variable computation (which not
ki independent replicate measurements of Xi , Wij = Xi + Uij , j = all will find simple).
1, . . . , ki . A common situation is when ki = 1 for most i, and the re- A comparison of the two approaches has been made for Poisson re-
mainder of the data have a single replicate (ki = 2). Constructing esti- gression, which is one of the few models where both methods apply.
mated standard errors for this problem has not been done previously, and For this model, the corrected-score estimator is more convenient be-
the justification for our results is given in Appendix B.6. The necessary cause the corrected score has a closed form expression, whereas the
changes are as follows. In computing the estimates, in the previous defi- conditional-score estimator requires numerical summation; see (7.21).
nitions, replace Σuu with Σuu /ki and Wi with Wi· , the sample mean of However, the conditional-score estimator is more efficient in some prac-
the replicates. The estimate of Σuu is the usual components of variance tical cases (Stefanski, 1989). We also note that, for the Poisson model,
estimator, Kukush, Schneeweiss, and Wolf (2004) compared the corrected-score es-
Pn Pk i ¡ ¢¡ ¢t timator to a structural estimator and conclude that the former, while
i=1 j=1 Wij − Wi· Wij − Wi· more variable, is preferred on the basis of insensitivity to structural-
b
Σuu = Pn .
i=1 (ki − 1) model assumptions, except when the error variance is large.
While the above variance estimator has a known asymptotic distribution The conditional-score method and certain extensions thereof have a
(based on the Wishart distribution), it is easier in practice to use the theoretical advantage in terms of efficiency. For the canonical generalized
sandwich estimator of its variance, linear models of Section 7.3, Stefanski and Carroll (1987) showed that
Pn any unbiased estimating equation for (β0 , βzt , βxt )t must be conditionally
t
b uu ) = P i=1 di di
Cn (Σ unbiased given (Z, ∆), and from this they deduce that the asymptotically
n 2,
{ i=1 (ki − 1)} efficient estimating equations for structural models are based on score

176 177
functions of the form that does not possess the same functional-modeling properties as cor-
  rected scores, is presented by Wang and Pepe (2000).
 1 
Both conditional-score and corrected-score estimating equations can
{Yi − E(Yi | Zi , ∆i )} Zi .
  have multiple solutions. In simpler models we have not found this to be
E(Xi | Zi , ∆i )
a problem, but it can be with more complicated models. Small, Wang,
This result shows that, in general, none of the methods we have proposed and Yang (2000) discussed the multiple root problem and proposed some
previously is asymptotically efficient in structural models, except when solutions.
E(X | Z, ∆) is linear in (Z, ∆). This is the case in linear regression with
(Z, X) marginally normally distributed, and in logistic regression when
(Z, X) given Y is normally distributed, that is, the linear discriminant
model. The problem of constructing fully efficient conditional-score es-
timators based on simultaneous estimation of E(Xi | Zi , ∆i ) has been
studied (Lindsay, 1985; Bickel and Ritov, 1987; van der Vaart, 1988),
although the methods are generally too specialized or too difficult to
implement in practice routinely.
Both methods have further extensions not mentioned previously. The
conditional-score method is easily extended to the case that the model
for W given X is a canonical generalized linear model with natural pa-
rameter X. Buzas and Stefanski (1996a) described a simple extension of
corrected-score methods to additive nonnormal error models when the
true-data score function depends on X only through exp(βxt X) and the
measurement error possesses a moment-generating function. Extensions
to nonadditive models are also possible in some cases. For example, Li,
Palta, and Shao, (2004) studied a corrected score for linear regression
with a Poisson surrogate. Nakamura (1990) showed how to construct
a corrected estimating equation for linear regression with multiplica-
tive lognormal errors. He also suggested different methods of estimating
standard errors.

7.7 Bibliographic Notes

Conditioning to remove nuisance parameters is a standard technique in


statistics. The first systematic application of the technique to general-
ized linear measurement error models appeared in Stefanski and Carroll
(1987), which was based on earlier work by Lindsay (1982). Related
methods and approaches can be found in Liang and Tsou (1992), Liang
and Zeger (1995), Hanfelt and Liang (1995), Hanfelt and Liang (1997),
Rathouz and Liang (1999), and Hanfelt (2003).
The systematic development of corrected-score methods started with
Nakamura (1990) and Stefanski (1989). Further developments and ap-
plications of this method can be found in Buzas and Stefanski (1996a),
Augustin (2004), Kukush, Schneeweiss, and Wolf (2004), Li, Palta, and
Shao (2004), and Song and Huang (2005). A related technique, but one

178 179
CHAPTER 8

LIKELIHOOD AND
QUASILIKELIHOOD

8.1 Introduction
This chapter describes the use of likelihood methods in nonlinear mea-
surement error models. Prior to the first edition of this text, there were
only a handful of applications of likelihood methods in our context. Since
that time, largely inspired by the revolution in Bayesian computing, con-
struction of likelihood methods with computation by either frequentist
or Bayesian means has become fairly common.
There are a number of important differences between the likelihood
methods in this chapter and the methods described in previous chapters:
• The previous methods are based on additive or multiplicative mea-
surement error models, possibly after a transformation. Typically, few
if any distributional assumptions about the distribution of X are re-
quired. Likelihood methods require stronger distributional assump-
tions, but they can be applied to more general problems, including
those with discrete covariates subject to misclassification.
• The likelihood for a fully specified parametric model can be used to
obtain likelihood ratio confidence intervals. In methods not based on
likelihoods, inference is based on bootstrapping or on normal approx-
imations. In highly nonlinear problems, likelihood-based confidence
intervals are generally more reliable than those derived from normal
approximations and less computationally intensive than bootstrap-
ping.
• Whereas the previous methods require little more than the use of
standard statistical packages, likelihood methods are often computa-
tionally more demanding.
• Robustness to modeling assumptions is a concern for both approaches,
but is generally more difficult to understand with likelihood methods.
• When the simpler methods described previously are applicable, a like-
lihood analysis will generally buy one increased efficiency, that is,
smaller standard errors, albeit at the cost of extra modeling assump-
tions, the old “no free lunch” phenomenon. Sometimes, the gains in

181
chosen. This could be a classical error model, a Berkson model, a
Step 1: Select the likelihood combination of the two, etc. If one has classical components in the
models as if X were observed. measurement error model, then typically one also needs to specify a
distribution for the unobserved X given the observable covariates Z;
see Section 2.2.3. Much of the grief of a likelihood analysis revolves
around this step.
Step 2: Select the error
• Step 3: The likelihood function is constructed using the building
model, e.g., Berkson, classical. blocks obtained in previous steps.
If classical, also select model
for unobserved X given Z. • Step 4: Now one has to do the sometimes hard slogging to compute
the likelihood function, obtain parameter estimates, do inferences, etc.
Because X is latent, that is, unobservable, this step can be difficult
or time-consuming, because one must integrate out the possibly high
dimensional latent variable.
Step 3: Form the likelihood
We organize our discussion of likelihood methods around these four
function. steps. Except where noted, we assume nondifferential measurement error
(Section 2.5). For a review of maximum likelihood methods in general,
see Appendix A.5.
Fully specified likelihood problems, including problems where X is not
Step 4: Compute likelihood observable or is observable for only a subset of the data, are discussed in
function and maximize. Sections 8.2 and 8.3. The use of likelihood ideas in quasilikelihood and
variance function models (QVF) (Section A.7) is covered in Section 8.8.

Figure 8.1 Flowchart for the steps in a likelihood analysis.


8.1.1 Step 1: The Likelihood If X Were Observable

efficiency are very minor, as in typical logistic regression; see Ste- Likelihood analysis starts with an “outcome model” for the distribution
fanski and Carroll (1990b), who contrasted the maximum likelihood of the response given the true predictors. The likelihood (density or
estimate and the conditional scores estimate of Chapter 7. They found mass) function of Y given (Z, X) will be called fY |Z,X (y|z, x, B) here,
that the conditional score estimates are usually fairly efficient relative and interest lies in estimating B.
to the maximum likelihood estimate, unless the measurement error The form of the likelihood function can generally be specified by
is “large” or the logistic coefficient is “large,” where the definition of reference to any standard statistics text. For example, if Y is nor-
large is somewhat vague. One should be aware, though, that their cal- mally distributed with mean β0 + βxt X + βzt Z and variance σ 2 , then
culations indicate that situations exist where properly parameterized B = (β0 , βx , βz , σ 2 ) and
maximum likelihood estimates are considerably more efficient than £ ¤
estimates derived from functional modeling considerations. fY |Z,X (y|z, x, B) = σ −1 φ {(y − (β0 + βxt x + βzt z)}/σ ,
Figure 8.1 illustrates the steps in doing a likelihood analysis of a mea- where φ(v) is the standard normal density function. If Y follows a logistic
surement error model. These steps are as follows: regression model with mean H(β0 + βxt X + βzt Z), then B = (β0 , βx , βz )
• Step 1: To perform a likelihood analysis, one must specify a para- and
metric model for every component the data. Any likelihood analysis ¡ ¢
fY |Z,X (y|z, x, B) = H y β0 + βxt x + βzt z
begins with the model one would use if X were observable. © ¡ ¢ª1−y
• Step 2: The next crucial decision is the error model that is to be × 1 − H β0 + βxt x + βzt z .

182 183
8.1.2 A General Concern: Identifiable Models starts with determination of the joint distribution of Y, W, and T given
Z, as these are the observed variates.
The concept of identifiability means that if one actually had an infinite
To understand what is going on, we first describe the discrete case
number of observed data values, then one would know the parameters
when there is neither a covariate Z measured without error nor a second
exactly, that is, they are identified. When a problem is not identifiable,
measure T. We then describe in detail the classical and Berkson models
it means that a key piece of information is unavailable. For example, in
in turn.
linear regression with (Y, W, X) all normally distributed, as described
in Section 3.2.1, the parameters are not identifiable because a key piece
of information is absent, namely, the measurement error variance. For 8.2.1 The Discrete Case
this reason, replication data is needed to help estimate the measurement
First consider a simple problem wherein Y, W, and X are discrete ran-
error variance.
dom variables; no second measure T is observed; and there are no other
In nonlinear measurement error models, sometimes the parameters
covariates Z. From basic probability, we know that
are identifiable without any extra information other than measures of X
(Y, Z, W), that is, without validation or replications. Brown and Mari- pr(Y = y, W = w) = pr(Y = y, W = w, X = x)
ano (1993) discuss this issue, considering both likelihood and quasilike- x
X
lihood techniques. = pr(Y = y|W = w, X = x)pr(W = w, X = x). (8.1)
A word of warning: One should not be overly impressed by all claims x
of identifiability. Many problems of practical importance actually are
When W is a surrogate (nondifferential measurement error, see Section
identifiable, but only barely so, and estimation without additional data
2.5), it provides no additional information about Y when X is known,
is not practical. For instance, in linear regression it is known that the
so (8.1) is
regression parameters can be identified without validation or replication
as long as X is not normally distributed (Fuller, 1987, pp. 72–73). How- pr (Y = y, W = w)
ever, this means that the parameter estimates will be very unstable if X X
= pr(Y = y|X = x, B)pr(W = w, X = x), (8.2)
is at all close to being normally distributed. In binary regression with a x
normally distributed calibration, it is known that the probit model is not
identified (Carroll, Spiegelman, Lan, et al., 1984) but that the logistic where, to achieve a parsimonious model, we have (1) replaced pr(Y =
model is (Küchenhoff, 1990). The difference between these two models y|W = w, X = x) by pr(Y = y|X = x) and (2) indicated that typically
is so slight (Figure 4.9) that there is really no useful information about one would use a parametric model pr(Y = y|X = x, B) for the latter.
the parameters without some additional validation or replication data. Thus, in addition to the underlying model, we must specify a model for
However, lest we leave you with a glum picture, there are indeed cases the joint distribution of W and X. How we do this depends on the model
that the nonlinearity in the model helps make identifiability practical; for relating W and X.
example, Rudemo, Ruppert, and Streibig, et al. (1989) describe a highly
nonlinear model that is both identifiable and informative; see Section 8.2.1.1 Classical Case, Discrete Data
4.7.3. For example, in the context of classical measurement error and, for sim-
This discussion highlights the Sherlock Holmes phenomenon: Data are plicity, assuming no Z, we would specify a model for W given X, and
good, but more and different types of data are better. then a model for X. In other words,
X
pr(Y = y|X = x, B)pr(W = w|X = x)pr(X = x). (8.3)
8.2 Steps 2 and 3: Constructing Likelihoods x

Having specified the likelihood function as if X were observable, we now Equation (8.3) has three components: (a) the underlying “outcome model”
turn to constructing the form of the likelihood function. This consists of primary interest; (b) the error model for W given the true covariates;
of 1 or 2 steps, depending on whether the error model is Berkson or and (c) the distribution of the true covariates, sometimes called the ex-
classical. In this section, we allow for general error models and for the posure model in epidemiology. Both (a) and (b) are expected; almost all
possibility that a second measure T is available. A likelihood analysis the methods we have discussed so far require an underlying model and

184 185
an error model. However, (c) is unexpected, in fact a bit disconcerting, density is σu−1 φ {(w − x)/σu }, where φ(·) is the standard normal
because it requires a model for the distribution of the unobservable X. density function.
It is (c) that causes almost all the practical problems of implementation – If W is binary, a natural error model is the logistic where, for
and model selection with maximum likelihood methods. example, α
e1 = (α11 , α12 , α13 ) and pr(W = 1|X = x, Z = z) =
t
H(α11 + α12 x + α13 z).
8.2.1.2 Berkson Case, Discrete Data
– Multiplicative models occur when W = XU, where typically U
The likelihood of the observed data is (8.2) because W is a surrogate. has a lognormal or gamma distribution with E(U) = 1.
At this point, however, the analysis changes. When the Berkson model
• We use fX|Z (x|z, α
e2 ) to denote the density or mass function of X given
holds, we write
Z. As might be expected, the latent nature of X makes specifying
pr(Y = y, W = w) (8.4) this distribution a matter of art. Nonetheless, there are some general
X guidelines:
= pr(Y = y|X = x, B)pr(X = x|W = w)pr(W = w).
x – When X is univariate, generalized linear models (Section A.8) are
The third component of (8.4) is the distribution of W and conveys no natural and useful. For example, one might hypothesize that X is
information about the critical parameter B. Thus, we will divide both normally distributed in a linear regression on Z, or that it follows
sides of (8.4) by pr(W = w) to get likelihoods conditional on W, namely, a gamma or lognormal loglinear model in Z. If X were binary, a
X linear logistic regression on Z would be a natural candidate.
pr(Y = y|W = w) = pr(Y = y|X = x, B)pr(X = x|W = w). (8.5)
– Some model robustness can be gained by specifying flexible distri-
x
butions for X given Z. One class is to suppose that, depending on
the context, X or log(X) follows a linear regression in Z, but that
8.2.2 Likelihood Construction for General Error Models the regression errors have a mixture of normals density. Mixtures
We now describe the form of the likelihood function for general error of normals can be difficult to work with, and an obvious alternative
models. is to use the so-called seminonparametric family (SNP) of David-
When there are covariates Z measured without error, or when there ian and Gallant (1993, p. 478): see Zhang and Davidian (2001)
are second measures T, (8.3) changes in two ways. The second measure for a computationally convenient form of this approach. Davidian
is appended to W, and all probabilities are conditional on Z. Thus, (8.3) and Gallant’s mixture model generalizes easily to the case that all
is generalized to components of X are continuous. We point out that Bayesians of-
X ten use Dirichlet process mixtures to achieve a seminonparametric
pr(Y = y|Z = z, X = x, B) modeling approach.
x
– For mixtures of discrete and continuous variables, the models of
× pr(W = w|Z = z, X = x)pr(X = x|Z = z). Zhao, Prentice, and Self (1992) hold considerable promise. Other-
In general, in problems where X is not observed but there is a natural wise, one can proceed on a case-by-case basis. For example, one
error model, then in addition to specifying the underlying model and can split X into discrete and continuous components. The distri-
the error model, we must hypothesize a distribution for X given Z. In bution of the continuous component given the discrete components
summary: might be modeled by multivariate normal linear regression, while
that of the discrete component given Z could be any multivariate
• The error model has a density or mass function which we will denote
discrete random variable. We would be remiss in not pointing out
by fW,T |Z,X (w, t|z, x, α
e1 ).
that multivariate discrete models can be difficult to specify.
– In many applications, the error model does not depend on z. For
Having hypothesized the various models, the likelihood that (Y = y, W =
example, in the classical additive measurement error model (1.1)
w, T = t) given that Z = z is then
with normally distributed measurement error, σu2 is the only com-
ponent of α
e1 , there is no second measure T, and the error model fY,W,T |Z (y, w, t|z, B, α
e1 , α
e2 )

186 187
Z
= fY |Z,X,W,T (y|z, x, w, t, B)fW,T |Z,X (w, t|z, x, α
e1 ) residual variance σǫ2 from the Berkson error variance σu2 , so that nei-
ther is identified; see Section 8.1.2.
×fX|Z (x|z, α
e2 )dµ(x) (8.6) It can also be shown that the additive Berkson model with homoscedas-
Z
tic errors leads to consistent estimates of nonintercept parameters in log-
= fY |Z,X (y|z, x, B)fW,T |Z,X (w, t|z, x, α
e1 )
linear models and often to nearly consistent estimates in logistic regres-
×fX|Z (x|z, α
e2 )dµ(x). (8.7) sion. In the latter case, the exceptions occur with severe measurement
error and a strong predictive effect; see Burr (1988).
The notation dµ(x) indicates that the integrals are sums if X is discrete In general problems, we must specify the conditional density or mass
and integrals if X is continuous. The assumption of nondifferential mea- function of X given W, which we denote by fX|W (x|w, γ e). In the usual
surement error (Section 2.5), which is equivalent to assuming that W Berkson model, γ e is σu2 , and the density is σu−1 φ {(x − w)/σu }. In a
and T are surrogates for X, was used in going from (8.6) to (8.7), and Berkson model is proportional to W 2θ , the density
© where the θvariance
ª
will be used without mention elsewhere in this chapter. The likelihood θ −1
is (w σu ) φ (x − w)/(w σu ) . The likelihood function then becomes
for the problem is just the product over the sample of the terms (8.7)
evaluated at the data. fY |Z,W (y|z, w, B, γ
e)
Z
Of interest in applications is the density function of Y given (Z, W, T),
= fY |Z,X (y|z, x, B)fX|W (x|w, γ
e)dµ(x). (8.8)
which is (8.7) divided by its integral or, in the discrete case, sum over
y. This density is an important tool in the process of model criticism, The likelihood for the problem is the product over the sample of the
because it allows us to compute such diagnostics as the conditional mean terms (8.8) evaluated at the data.
and variance of Y given (Z, W, T), so that standard model verification As a practical matter, there is rarely a direct “second measure” in the
techniques from regression analysis can be used. Berkson additive or multiplicative models. This means that the parame-
ters in the Berkson model can be estimated only through the likelihood
8.2.3 The Berkson Model (8.8). In some cases, such as homoscedastic linear regression described
above, not all of the parameters can be identified (estimated). For non-
In the Berkson model, a univariate X is not observed, but it is related to linear models, identification usually is possible.
a univariate W by X = W + U, perhaps after a transformation. There In classical generalized linear models, a likelihood analysis of a ho-
are no other covariates. Usually, U is taken to be independent of W and moscedastic, additive Berkson model can be shown to be equivalent to a
normally distributed with mean zero and variance σu2 , but more complex random coefficients analysis with random intercept for each study par-
models are possible. For example, in the bioassay data of Chapter 4, the ticipant.
variance might be modeled as σu2 W2θ .
The additive model is not a requirement. In some cases, it might be
8.2.4 Error Model Choice
more reasonable to assume that X = WU, where U has mean 1.0 and
is either lognormal or gamma. Modeling always has options. For example, there is nothing illegal in sim-
The Berkson additive model has an unusual feature in the linear re- ply specifying a model for X given W, as in equation (8.8), or a model
gression problem. Suppose the regression has true mean β0 +Xβx +Zt βz for X given (Z, W), in which case Z is added to the error distribution
and residual variance σǫ2 . Then in the pure Berkson model, we replace in (8.8). One could then simply ignore the detailed aspects of the mea-
X with W + U, with the following consequences. surement error models, including such inconvenient things as whether
• The good news is the following: The observed data have regression the measurement error is additive and homoscedastic, etc. One can even
mean β0 +Wt βx +Zt βz , the observed data have the correct regression specify reasonably flexible models for such distributions, for example, by
line, so that the naive analysis yields valid estimates of the regression using the Davidian and Gallant models. Mallick and Gelfand (1996) do
line. just this in a Bayesian context.
This purely empirical approach has the attraction of its empiricism,
• The bad news is that the observed data have residual variance σǫ2 + but it almost forces one to write down very general models for X given
βx2 σu2 . This means that the observed data cannot disentangle the true (W, Z) in order to achieve sensible answers. In addition, there is the

188 189
potential for a loss of information, because the real likelihood is the As mentioned in the introduction to this chapter, maximum likelihood
likelihood of Y and W given Z, not simply the likelihood of Y given is a particularly useful technique for treating the problem of misclassi-
(W, Z). This may seem like a minor difference, but we suspect that the fication of a discrete covariate. One reason for this is that numerical
difference is not minor. There is little to no literature on whether such an integration of X out of the joint density of Y, X, and W is replaced by
approach can yield sensible answers when additive–multiplicative error an easy summation. Another reason is that, for a discrete covariate, one
models hold. can use a structural model without the need to make strong structural
assumptions, since, for example, a binary random variable must have a
Bernoulli distribution.
8.3 Step 4: Numerical Computation of Likelihoods The log odds-ratio β is defined by
The overall likelihood based on a sample of size n is the product over the pr(X = 1|Y = 1)/pr(X = 1|Y = 0)
sample of (8.6) when X is unobserved or the product over the sample of exp(β) = (8.9)
pr(X = 0|Y = 1)/pr(X = 0|Y = 0)
(8.8) in the Berkson model. Typically, one maximizes the logarithm of
pr(Y = 1|X = 1)/pr(Y = 0|X = 1)
the overall likelihood in the unknown parameters. There are two ways = . (8.10)
one can maximize the likelihood function. The most direct is to compute pr(Y = 1|X = 0)/pr(Y = 0|X = 0)
the likelihood function itself, and then use numerical optimization tech- Here (8.9) is the odds-ratio for a retrospective study where the disease
niques to maximize the likelihood. Below we provide a few details about status Y is fixed, while (8.10) is the log-odds for a prospective study
computing the likelihood function. The second general approach is to where the risk factor X is fixed. The equality of the two odds-ratios
view the problem as a missing-data problem, and then use missing-data allows one to parameterize either prospectively (in terms of Y given X
techniques; see for example Little and Rubin (2002), Tanner (1993), and and W) or retrospectively (in terms of X and W given Y) and is the
Geyer and Thompson (1992). theoretical basis for case-control studies.
Computing the likelihoods (8.7) and (8.8) analytically is easy if X This problem is particularly easy to parameterize retrospectively by
is discrete, as the conditional expectations are simply sums of terms. specifying the distributions of X given Y, and W given (X, Y). With dif-
Likelihoods in which X has some continuous components can be com- ferential measurement error, the six free parameters are αxd = Pr(W =
puted using a number of different approaches. In some problems, the 1|X = x, Y = d) and γd = Pr(X = 1|Y = d), x = 0, 1 and d = 0, 1.
loglikelihood can be computed or very well approximated analytically, For the validation data where X is observed, the likelihood is
for example, linear, probit, and logistic regression with (W, X) normally nv Y
1 Y 1
1 Y
Y
distributed; see Section B.7.2. In most problems that we have encoun- pr(Wi = w, Xi = x|Yi = y)I(Wi =w,Xi =x,Yi =y) , (8.11)
tered, X is a scalar or a 2 × 1 vector. In these cases, standard numerical i=1 y=0 x=0 w=0
methods, such as Gaussian quadrature, can be applied, although they are
not always very good. When sufficient computing resources are available, where nv is the size of the validation data set. For the nonvalidation
the likelihood can be computed using Monte Carlo techniques (Section data we integrate out X by a simple summation pr(W = w|Y = y) =
B.7.1). One of the advantages of a Bayesian analysis by simulation meth- pr(W = w, X = 0|Y = y) + pr(W = w, X = 1|Y = y), and then the
ods is that X can be integrated out as part of the processing of sampling likelihood is
n
Y 1
Y 1
Y
from the posterior; see Chapter 9. nv

pr(Wi = w|Yi = y)I(Wi =w,Yi =y) , (8.12)


i=1 y=0 w=0
8.4 Cervical Cancer and Herpes
where nnv = n − nv is the size of the validation data set. The likeli-
In the cervical cancer example of Section 1.6.10, (W, Y, X) are all bi- hood for the data set itself is the product of (8.11) and (8.12), with all
nary, W is a possibly misclassified version of X, and there is no variable probabilities expressed in terms of the α’s and γ’s.
Z. In principle, MC-SIMEX (Section 5.6.2) could be used, but maximum A maximum likelihood analysis yielded βb = 0.609 (std. error = 0.350).
likelihood is so simple with binary X that there seems little reason to use For nondifferential misclassification, the analysis simplifies in that αx0 =
MC-SIMEX here. It would obviously be of interest to compare maximum αx1 = αx , and then βb = 0.958 (std. error = 0.237).
likelihood approaches to misclassification of X with MC-SIMEX. The noticeable difference in βb between assuming differential and non-

190 191
differential misclassification suggests that, in this example, misclassifi- 20
Framingham, Standard Devation versus Mean

cation is differential. In Section 9.9, this issue is further explored by


comparing estimates of α0d = Pr(W = 1|X = x, Y = d) for d = 1 and 15

d = 0. 10

5
8.5 Framingham Data
0
In this section we describe an example that has classical error structure. 90 100 110 120 130 140 150
Original Scale
160 170 180 190 200

The Framingham heart study was described in Section 1.6.6. Here X,


the transformed long-term systolic blood pressure, is not observable, and 0.2

the likelihoods of Section 8.2.2 are appropriate. The sample size is n =


0.15
1,615. As before, Z includes age, smoking status, and serum cholesterol.
Transformed systolic blood pressure (SBP) is log(SBP−50). 0.1

At Exam #2, the mean and standard deviation of transformed systolic


0.05
blood pressure are 4.374 and .226, respectively, while the corresponding
figures at Exam #3 are 4.355 and .229. The difference between mea- 0
3.8 4 4.2 4.4 4.6 4.8 5
surements at Exam #2 and Exam #3 has mean 0.019 and standard Transformed Scale
deviation .159, indicating a statistically significant difference in means
due largely to the sample size (n = 1, 615). However, the following anal- Figure 8.2 Framingham systolic blood pressure data. Plot of cubic regres-
ysis will allow for differences in the means. The standard deviations are sion fits to the regression of intraindividual standard deviation against the
sufficiently similar that we will assume that the two exams have the same mean. Top figure is in the original scale; bottom is for the transformation
variability. log(SBP−50). The noticeable trend in the top figure suggests that the measure-
We write W and T for the transformed SBP at Exams 3 and 2, re- ment errors are heteroscedastic in that scale.
spectively. Since Exam #2 is not a true replicate, we are treating it
as a second measure, differing from Exam #3 only in the mean. Thus, possible to compute the likelihood (8.7) analytically; see section B.7.2.
W = X + U and T = α11 + X + V, where U and V are independent We used this analytical calculation, rather than numerical integration.
with common measurement error variance σu2 , and α11 represents the When using all the data, the likelihood estimate for systolic blood pres-
(small) difference between the two exams. sure had a logistic coefficient of 2.013 with an (information) estimated
There is justification for the assumption that transformed systolic standard error of 0.496, which is essentially the same as the regression
blood pressure can be modeled reasonably by an additive model with calibration analysis; compare with Table 5.1.
normally distributed, homoscedastic measurement error. We use the
techniques of Section 1.7. The q-q plot of the differences of transformed
8.6 Nevada Test Site Reanalysis
systolic blood pressure in the two exams is reasonable, although not
perfect, indicating approximate normality of the measurement errors. This section describes a problem that has a combination of Berkson
The regression fits of the intraindividual standard deviation versus the and classical measurement errors, and it is one in which the errors are
mean are plotted in the original and transformed scale in Figure 8.2. multiplicative rather than additive.
The trend in the former suggests heteroscedastic measurement errors, In Section 1.8.2, we described a simulation study in a model where
while the lack of pattern in the latter suggests the transformation is a part of the measurement error was Berkson and part was classical. The
reasonable one. excess relative risk model (1.4) was used, and here for convenience we
Since the transformed systolic blood pressures are themselves approx- redisplay the model:
imately normally distributed, we will also assume that X given Z is
t pr(Y = 1|X, Z) = H {β0 + βz Z + log(1 + βx X)} . (8.13)
normally distributed with mean α21 Z and variance σx2 .
Using the probit approximation to the logistic (Section 4.8.2), it is The parameter βx is the excess relative risk parameter. The mixture of

192 193
classical and Berkson error models is given in equations (1.5) and (1.6). Nevada Test Site, Thyroiditis
Again, for convenience, we redisplay this multiplicative measurement
error model:
log(X) = log(L) + Ub , (8.14)
log(W) = log(L) + Uc , (8.15)
where Ub denotes Berkson-type error, and Uc denotes classical-type er- Mixture, MLE
ror. The standard classical measurement error model (1.1) is obtained by
setting Ub = 0. The Berkson model (1.2) is obtained by setting Uc = 0.
In this section, we analyze the Nevada Test Site data, with outcome
variable thyroiditis. The original data and their analyses were described
Mixture, RegCal
by Stevens, Till, Thomas, et al. (1992); Kerber et al. (1993); and Simon,
Till, Lloyd, et al. (1995). We use instead a revised version of these data
that have corrected dosimetry as well as corrected health evaluations
(Lyon, Alder, Stone, et al., 2006). In the risk model (8.13), the predictors
Z consisted of gender and age at exposure, while X is the true dose. Berkson, RegCal
The data file gives an estimate for each individual of the total error
variance in the log scale, but does not separate out the Berkson and
classical uncertainties. In this illustration, we assumed that 40% of the
uncertainty was classical, reflecting the important components due to
dietary measurement error.
In these data, Owen Hoffman suggested the use of strata, because it 0 5 10 15 20 25 30
is known that the doses received by individuals vary greatly, depending Confidence Interval for Excess Relative Risk
on where they were located. Thus, we fit models (8.14) and (8.15) in five
different strata, namely (a) Washington County, (b) Lincoln County, Figure 8.3 Nevada Test Site analysis of the excess relative risk for thyroiditis.
(c) Graham County, (d) all others in Utah, and (e) all others. In these Estimates are the vertical lines, while the horizontal lines are 95% confidence
models, log(L) was normally distributed in a regression on gender and intervals. Included are a pure Berkson analysis and two analyses that assume
a mixture of classical and independent Berkson measurement errors, with 40%
age, with the regression coefficients and the variance about the mean
of the measurement error variance being classical. Note how the classical com-
depending on the strata. ponent greatly increases the excess relative risk estimate.
We performed three analyses of these data:
• The first model assumed that all uncertainty was Berkson and em-
ployed regression calibration. Specifically, since log(X) = log(W) + test for the hypothesis of no effect due to radiation is < 10−7 . Note
Ub , with Ub normally distributed with mean zero and known vari- how acknowledging the classical measurement error greatly increases the
2
ance σub depending on the individual, E(X|W) = Wexp(σub 2
/2): This estimated excess relative risk, by a factor nearly of two. Of potential
latter value was used in place of true dose. scientific interest is that the upper ends of the confidence intervals shown
in Figure 8.3 change from 11.1 for the pure Berkson analysis to 18.8 for
• The second model assumes that 40% of the uncertainty is classical. We
the mixture of Berkson and classical analysis, indicating the potential
again implemented regression calibration; see below for the details.
for a much stronger dose effect.
• The third analysis was a maximum likelihood analysis; see below for
details.
8.6.1 Regression Calibration Implementation
The results are described in Figure 8.3. The Berkson analysis yields
an excess relative risk estimate βbx = 5.3, regression calibration βbx = 8.7, Here is how we implemented regression calibration for the Nevada Test
and maximum likelihood βbx = 9.9. The p-value using a likelihood ratio 2
Site thyroiditis example. Let σi,tot be the variance of the uncertainty in

194 195
true dose for an individual. Then the Berkson error variance for that already done. Remember that log(Xi ) is normally distributed with mean
2 2
individual is σi,ub = 0.6 × σi,tot , while the classical error variance for µxi = α0s + Zti α1s and variance σxi
2
.
2 2
that individual is σi,uc = 0.4 × σi,tot . We assumed that, for an individual We now just apply (8.7). Let φ(x, µ, σ 2 ) be the normal density func-
i who falls into stratum s, log(Li ) was normally distributed with mean tion, with mean µ and variance σ 2 evaluated at x. Then, the likelihood
α0s + Zti α1s and variance σLs 2
. This means that log(Xi ) and log(Wi ) are function for Yi and log(Wi ) given Zi is
jointly normally distributed with common mean α0s + Zti α1s , variances Z
Y 1−Yi
2
σxi = σLs2 2
+ σi,ub and σwi 2
= σLs2 2
+ σi,uc , respectively, and covariance [H{s, Zi }] i [1 − H{s, Zi }]
2 2
σLs . Define ρi = σLs /(σxi σwi ). By simple algebraic calculations, this 2 2
×φ{log(Wi ), µiw|x (Zi , s), σiw|x }φ(s, µxi , σxi )ds.
means that log(Xi ) given (Wi , Zi ) is normally distributed with mean
(α0s +Zti α1s )(1−σLs2 2
/σwi )+(σLs2 2
/σwi )log(Wi ) and variance σxi2
(1−ρ2i ). Unfortunately, this likelihood function does not have a closed form.
As in Section 4.5, this is a multiplicative measurement error with a Rather than computing the integral using Monte Carlo methods (Section
lognormal structure, and hence it follows that B.7.1), we used numerical quadrature. Specifically, Gaussian quadra-
E(Xi |Wi , Zi ) = exp{(α0s + Zti α1s )(1 − σLs
2 2
/σwi 2
) + σxi (1 − ρ2 )/2}. Rture (Thisted, 1988) is a wayP of approximating integrals of the form
g(t)exp(−t2 )ds as a sum j wj g(tj ). To apply this, we have to do
2 a change p of variables of the likelihood, namely, to replace s by t =
It remains to estimate α0s , α1s , and σLs , and here we use the method
2 , so that the likelihood becomes
(s − µxi )/ 2σxi
of moments. First note that the regression of log(Wi ) on log(Li ) for
a person in stratum s is just α0s + Zti α1s , so that α0s and α1s can be Z · q ¸Yi · q ¸1−Yi
estimated by this regression. Since the residual variance for an individual H{µxi + t 2σxi 2 ,Z } 1 − H{µ + t 2σ 2 ,Z }
i xi xi i
2 2 2
is σwi = σLs + σi,uc , if the (known) mean of the classical uncertainties q
σi,uc within stratum s is σs2 , then the mean squared error of the regression
2
×φ{log(Wi ), µiw|x (Zi , µxi + t 2σxi 2 ), σ 2 2
iw|x }exp(−t )dt.
2
has mean σLs + σs2 : Subtracting σs2 from the observed regression mean
squared error yields a method of moments estimate of σLs 2
. In our implementation, we started from the regression calibration esti-
mates and used the function optimizer “fmincon” in MATLAB.

8.6.2 Maximum Likelihood Implementation


8.7 Bronchitis Example
The implementation of maximum likelihood is fairly straightforward.
The four steps described in Figure 8.1 work as follows. Basically, we are This section describes a cautionary tale about identifying Berkson error
going to compute the likelihood function for Y and log(W) given Z, and models, which we believe are often better analyzed via Bayesian meth-
we will work with the log scale. ods.
The first step, of course, is the regression model (8.13). Write In occupational medicine, an important problem is the assessment of
the health hazard of specific harmful substances in a working area. One
H{log(X), Z} = H [β0 + βz Z + log{1 + βx exp{log(X)}] . approach to modeling assumes that there is a threshold concentration,
The the likelihood function if log(X) could be observed is just a typical called the threshold limiting value (TLV), under which there is no risk
logistic likelihood: due to the substance. Estimating the TLV is of particular interest in the
industrial workplace. We consider here the specific problem of estimating
Y 1−Y
[H{log(X), Z}] [1 − H{log(X), Z}] . the TLV in a dust-laden mechanical engineering plant in Munich.
The second step is the error model, which is really of a form described The regressor variable X is the logarithm of 1.0 plus the average dust
2
in (2.1) of Section 2.2. Let ρi∗ = σLs 2
/σxi , so that given {log(Xi ), Zi }, concentration in the working area over the period of time in question,
log(Wi ) is normally distributed with mean and Y is the indicator that the worker has bronchitis. In addition, the
duration of exposure Z1 and smoking status Z2 are also measured. Fol-
µiw|x {Zi , log(Xi )} = (α0s + Zti α1s )(1 − σLs
2 2
/σxi 2
) + (σLs 2
/σxi )log(Xi ) lowing Ulm (1991), we based our analysis upon the segmented logistic
2
and variance σiw|x 2
= σwi (1 − ρ2i∗ ). model
The third step is the distribution of log(X) given Z, which we have pr(Y = 1|X, Z)

196 197
= H {β0 + βx,1 (X − βx,2 )+ + βz,1 Z1 + βz,2 Z2 } , (8.16) (Step 4). For simplicity, write

where (a)+ = a if a > 0 and = 0 if a ≤ 0. The parameter of primary H(Y, X, Z, B) = H {β0 + βx,1 (X − βx,2 )+ + βz,1 Z1 + βz,2 Z2 } .
interest is βx,2 , the TLV. Let φ(x, µ, σ 2 ) be the normal density function with mean µ and variance
It is impossible to measure X exactly, and instead sample dust con- σ 2 evaluated at x. Then, as in the Nevada Test Site example in Section
centrations were obtained several times between 1960 and 1977. The re- 8.6, from (8.8) the likelihood function is
sulting measurements are W. There were 1,246 observations: 23% of the Z
workers reported chronic bronchitis, and 74% were smokers. Measured {H(Y, x, Z, B)}Y {1 − H(Y, x, Z, B)}1−Y φ(x, 0, σu2 )dx
dust concentration had a mean of 1.07 and a standard deviation of 0.72. Z
The durations Z1 were effectively independent of concentrations, with = {H(Y, W + s(2σu2 )1/2 , Z, B)}Y {1 − H(·)}1−Y exp(−s2 )ds,
correlation 0.093, compare with Ulm’s (1991) Figure 3. Smoking status is
also effectively independent of dust concentration, with the smokers hav- which can be computed by Gaussian quadrature. Note that the maxi-
ing mean concentration 1.068, and the nonsmokers having mean 1.083. mization is supposed to be over B and σu2 .
Thus, in this example, for likelihood calculations we will treat the Z’s
as if they were independent of X. 8.7.1.1 Theoretical Identifiability
A preliminary segmented regression analysis ignoring measurement
error suggested an estimated TLV βbx,2 = 1.27. We will call this the All the parameters, including the Berkson error variance, are identified,
naive TLV. in the sense that if the sample size were infinite, then all parameters
would be known. Küchenhoff and Carroll (1997) showed this fact in
As in Section 8.6, the data really consist of a complex mixture of Berk-
probit regression, and it is generally true in nonlinear models.
son and classical errors. The classical errors come from the measures of
dust concentration in factories, while the Berkson errors come from the
usual occupational epidemiology construct, wherein no direct measures 8.7.2 Effects of Measurement Error on Threshold Models
of dust exposure are taken on individuals, but instead plant records of
It is first of all interesting to understand what the effects of measurement
where they worked and for how long are used to impute some version
error are on segmented regression models or threshold models. We made
of dust exposure. In this section, for illustrative purposes, we will as-
the point in Section 1.1 that measurement error causes loss of features.
sume a pure Berkson error structure. In the first edition of this book, we
Here, that loss is quite profound. In Figure 8.4, we graph (solid line)
reported a much different classical error analysis with a flexible distribu-
the true probabilities as a function of X in a segmented logistic regres-
tion for X; see also Küchenhoff and Carroll (1995). A Bayesian treatment
sion with intercept β0 = 0, slope βx = 3 and threshold = 0. Note the
of segmented regression can be found in Gössi and Küchenhoff (2001);
abrupt change in the probability surface at the threshold. We also plot
in Carroll, Roeder, and Wasserman (1999), who analyzed the Bronchitis
(dashed line) the actual probabilities of the observed data as a function
example assuming Berkson errors and a semiparametric error distribu-
of W when there is Berkson measurement error with variance σu2 = 1.
tion; in Section 9.5.4, where a classical error model is assumed and either
Note how the observed data have smooth probabilities: Indeed, the true
validation or replication data are available; and in Section 9.7.3, where
threshold nature of the data have been obliterated by the measurement
the Bronchitis data are analyzed assuming normally distributed Berkson
error. One can easily imagine, then, that trying to identify the threshold
errors. The likelihood analysis of segmented regression when validation
or the error variance σu2 is likely to be challenging.
data are available is discussed by Staudenmayer and Spiegelman (2002),
who assumed a Berkson error model.
8.7.3 Simulation Study and Maximum Likelihood

8.7.1 Calculating the Likelihood We performed a small simulation study to show how difficult it might
be to estimate a threshold model in a Berkson error case. We fit the
We have already identified the model if X were observed (Step 1), and threshold model (8.16) to the observed data and used this fit to get
we have decided upon a Berkson error model with measurement error estimates of the parameters. We kept the W and Z data fixed, as
variance (Steps 2 and 3), so it remains to compute the likelihood function in the actual data, used the naive parameter estimates as the true

198 199
Threshold Regression Probabilities the posterior distribution, will not be this extreme, for example, would
1
True not be equal to zero when estimating a variance.
Observed

0.9
8.7.4 Berkson Analysis of the Data

0.8
If the reader has been paying attention, the previous discussion is ob-
viously leading up to a problem with the analysis. We applied Berkson
measurement error maximum likelihood to the bronchitis data, and the
Probability

0.7 estimated measurement error variance was σu2 = 0.0! Of course, the sim-
ulation study showed that this can happen in as many as one third of
0.6
data sets, so it is an unfortunate finding but certainly no surprise. In
some sense, this analysis is a cautionary tale that technical identifiabil-
ity does not always lead to practical identifiability. The bioassay data of
0.5 Section 4.7.3 are, of course, the counterpoint to this: There are indeed
problems where technical and real identifiability coincide.
0.4
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
W and X
8.8 Quasilikelihood and Variance Function Models
Figure 8.4 The true probabilities (solid line) as a function of X and the ob- Quasilikelihood and variance function (QVF) models are defined in Sec-
served probabilities (dashed line) as a function of W in segmented Berkson tion A.7. In this approach, we model only the mean and variance func-
logistic regression with intercept β0 = 0, slope βx = 3, threshold = 0, and tions of the response, and not its entire distribution, writing the mean
Berkson measurement error with variance = 1. Note how the observed data
function as E(Y|Z, X) = mY (Z, X, B) and the variance function as
have smooth probabilities, while the true but unobserved data have the abrupt
var(Y|Z, X) = σ 2 g 2 (Z, X, B, θ).
change at the threshold.
Quasilikelihood and variance function techniques require that we com-
pute the mean and variance functions of the observed data (and not the
parameters, and generated large Berkson errors X = W + U, where unobservable data). These are given by
var(U) = var(W) = 0.72.2 = 0.52. We then generated simulation obser-
E(Y|Z, W) = E {mY (·)|Z, W} , (8.17)
vations Y from model (8.16) and ran 200 simulated data sets. We then
fit a maximum likelihood analysis to each simulated data set. © ª
The true TLV in this simulation was βx,2 = 1.27, and the mean of var(Y|Z, W) = σ 2 E g 2 (·)|Z, W + var {mY (·)|Z, W} . (8.18)
the estimates across the simulations was 1.21, very nearly unbiased. The Equations (8.17) and (8.18) define a variance function model. If we knew
true Berkson error variance was 0.52, while the mean estimate over the the functional forms of the mean and variance functions, then we could
simulations was 0.43, only slightly biased. So, one might ask, “What’s apply the fitting and model criticism techniques discussed in Section A.7.
the problem?” The problem is that in 35% of the simulated data sets, the Note how both (8.17) and (8.18) require an estimate of a model for the
MLE for σu2 = 0! It is, to put it mildly, not very helpful when one knows distribution of the unobserved covariate given the observed covariates
that there is Berkson error but an algorithm announces that the Berkson and the surrogate.
error does not exist. This is one of those cases where there is technical A QVF model analysis follows the same pattern of a likelihood anal-
identifiability of a parameter, but the free lunch of identifiability is rather ysis. As described in Figure 8.5, the steps required are as follows.
skimpy.
• Step 1: Specify the mean and variance of Y if X were observed.
This example also illustrates a problem with maximum likelihood
when the likelihood is maximized at the boundary of the parameter • Step 2: Specify a model relating W to X that allows identification
space. Then the MLE takes a value which is the most extreme case of of all model parameters. This will have a classical component, so it
plausible values. In contrast, the usual Bayesian estimator, the mean of is the same as Step 2 for a classical analysis, that is, a model for W

200 201
8.8.2 Details of Step 4 for QVF Models
Step 1: Select the QVF model
mean and variance of Y. The density or mass function of X given (Z, W) is then given by
fW |Z,X (w|z, x, αe1 )fX|Z (x|z, α
e2 )
fX|Z,W (x|z, w) = R .
fW |Z,X (w|z, v, α
e1 )fX|Z (v|z, α
e2 )dµ(v)
From this, one can obtain (8.17) and (8.18) by integration and either
Step 2: Use the W, Z data only analytically or numerically. Thus, for example, (8.17) becomes
and maximum likelihood to Z
estimate via the distribution of E(Y|W, Z) = mY (Z, x, B)fX|Z,W (x|Z, W)dx.
the unobserved X given Z
The sandwich method or the bootstrap can be used for inference, al-
and W. though of course one must take into account the estimation of α
e 1 and
α
e2 , something the bootstrap does naturally.

Step 3: Form the mean and


Bibliographic Notes
variance of the observed data.
Earlier references prior to the first edition of this text include Carroll,
Spiegelman, Lan, et al. (1984) and Schafer (1987, 1993) for probit re-
gression; Whittemore and Gong (1991) in a Poisson model; Crouch and
Spiegelman (1990) and Wang, Carroll and Liang (1997) in logistic re-
Step 4: Compute gression; and Küchenhoff and Carroll (1997) in a change point prob-
quasilikelihood estimates. lem. Some recent references include Turnbull, Jiang, and Clark (1997);
Gould, Stefanski, and Pollock (1997); Spiegelman and Casella (1997);
Lyles, Munoz, Xu, et al. (1999); Buonaccorsi, Demidenko, and Toste-
Figure 8.5 The steps in a quasilikelihood analysis with measurement error. son (2000); Nummi (2000); Higdon and Schafer (2001); Schafer, Lubin,
Ron, et al. (2001); Schafer (2002); Aitkin and Rocci (2002); Mallick,
given (X, Z) and a model for X given Z. We write the densities given Hoffman, and Carroll (2002); Augustin (2004); and Wannemuehler and
by these models as fW |Z,X (w|z, x, α
e1 ) and fX|Z (x|z, α
e2 ), respectively. Lyles (2005).
• Step 3: Do a maximum likelihood analysis of the (W, Z) data only For simple linear regression, Schafer and Purdy (1996) compare max-
to estimate the parameters αe1 and α e2 ; see below for details. imum likelihood estimators with profile-likelihood confidence intervals
to method-of-moments estimators with confidence intervals based on
• Step 4: Form (8.17)–(8.18), the observed data mean and variance
asymptotic normality. They note that in some situations the estima-
functions, and then apply the fitting methods in Section A.7.
tors can have skewed distributions and then likelihood-based intervals
have more accurate coverage probabilities.
8.8.1 Details of Step 3 for QVF Models The cervical cancer data in Section 8.4 has validation data, so that
the misclassification probabilities can be easily estimated. In other ex-
The (reduced) likelihood for a single observation based upon only the
amples, validation data are absent but there is more than one surrogate.
observed covariates is
Z Gustafson (2005) discusses identifiability in such contexts, as well as
fW |Z,X (W|Z, x, α
e1 )fX|Z (x|Z, α
e2 )dµ(x), Bayesian strategies for handling nonidentifiability.

where again the integral is replaced by a sum if X is discrete. The (W, Z)


data are used to estimate (eα1 , α
e2 ) by multiplying this reduced likelihood
over the observations, taking logarithms, and then maximizing.

202 203
CHAPTER 9

BAYESIAN METHODS

9.1 Overview
Over the last two decades, there has been an “MCMC revolution” in
which Bayesian methods have become a highly popular and effective
tool for the applied statistician. This chapter is a brief introduction to
Bayesian methods and their applications in measurement error problems.
The reader new to Bayesian statistics is referred to the bibliographic
notes at the end of this chapter for further reading.
We will not go into the philosophy of the Bayesian approach, whether
one should be an objective or a subjective Bayesian, and so forth. We
recommend reading Efron (2005), who has a number of amusing com-
ments on the differences between Bayesians and Frequentists, and also
on the differences among Bayesians. Our focus here will be how to for-
mulate measurement error models from the Bayesian perspective, and
how to compute them. For those familiar with Bayesian software such as
WinBUGS, a Bayesian analysis is sometimes relatively straightforward.
Bayesian methods also allow one to use other sources of information,
for example, from similar studies, to help estimate parameters that are
poorly identified by the data alone. A disadvantage of Bayesian methods,
which is shared by maximum likelihood, is that, compared to regression
calibration, computation of Bayes estimators is intensive. Another dis-
advantage shared by maximum likelihood is that one must specify a full
likelihood, and therefore one should investigate whether the estimator is
robust to possible model misspecification.

9.1.1 Problem Formulation


Luckily, Bayesian methods start from a likelihood function, a topic we
have already addressed in Chapter 8 and illustrated with a four-step
approach in Figure 8.1.
In the Bayesian approach, there are five essential steps:
• Step 1: This is the same as the first step in a likelihood approach.
Specifically, one must specify a parametric model for every component
of the data. Any likelihood analysis begins with the model one would
use if X were observable.

205
the likelihood of all the data, including W, is formed as if X were
Step 1: Select the likelihood available.
model as if X were observed • Step 4: In the Bayesian approach, parameters are treated as if they
were random, one of the essential differences with likelihood methods.
If one is going to treat parameters as random, then they need to be
given distributions, called prior distributions. Much of the controversy
Step 2: Select the error model among statisticians regarding Bayesian methods revolves around these
and select model for X given Z prior distributions.
• Step 5: The final step is to compute Bayesian quantities, in particular
the posterior distribution of parameters given all the observed data.
There are various approaches to doing this, most of them revolving
around Markov Chain Monte Carlo (MCMC) methods, often based
Step 3: Form the likelihood on the Gibbs sampler. In some problems, such as with WinBUGS,
function as if X were observed users do not actually have to do anything but run a program, and
the appropriate posterior quantities become available. In other cases,
though, either the standard program is not suitable to the problem,
or the program does not work well, in which case one has to tailor the
approach carefully. This usually involves detailed algebraic calculation
Step 4: Select priors of what are called the complete conditionals, the distribution of the
parameters, and the X values, given everything else in the model. We
give a detailed example of this process in Section 9.4.

9.1.2 Posterior Inference


Step 5: Compute complete
conditionals. Perform MCMC Bayesian inference is based upon the posterior density, which is the
conditional density of unobserved quantities (the parameters and un-
observed covariates) given the observed data, and summarizes all of the
information about the unobservables. For example, the mean, median, or
Figure 9.1 Five basic steps in performing a Bayesian analysis of a measure- mode of the posterior density are all suitable point estimators. A region
ment error problem. If automatic software such as WinBUGS is used, the with probability (1 − α) under the posterior is called a credible set, and
complete conditionals, which often require detailed algebra, need not be com- is a Bayesian analog to a confidence region. To calculate the posterior,
puted. one can take the joint density of the data and parameters and, at least
in principle, integrate out the parameters to get the marginal density of
• Step 2: This step too agrees with the likelihood approach. The next the data. One can then divide the joint density by this marginal density
crucial decision is the error model that is to be chosen. This could be to get the posterior density.
a classical error model, a Berkson model, or a combination of the two. There are many “textbook examples” where the posterior can be com-
If one has classical components in the measurement error model, then puted analytically, but in practical applications this is often a non trivial
typically one also needs to specify a distribution for the unobserved problem requiring high-dimensional numerical integration. The compu-
X given the observable covariates Z. tational problem has been the subject of much recent research. The
method currently receiving the most attention in the literature is the
• Step 3: The typical Bayesian approach treats X as missing data, Gibbs sampler and related methods such as the Metropolis–Hastings
and, in effect, imputes it multiple times by drawing from the condi- algorithm (Hastings, 1970; Geman & Geman, 1984; Gelfand & Smith,
tional distribution of X given all other variables. Thus, at this step, 1990).

206 207
The Gibbs sampler, which is often called Markov Chain Monte Carlo tion which itself has unknown hyperparameters. Lindley and El Sayyad
(MCMC), generates a Markov chain whose stationary distribution is the (1968) wrote the first Bayesian paper on functional models, covering the
posterior distribution. The key feature of the Gibbs sampler is that this linear regression case. Because of their complexity, we do not consider
chain can be simulated using only the joint density of the parameters, the Bayesian functional models here.
unobserved X-values and the observed data, for example, the product A second possibility intermediate between functional and hard-core
of the likelihood and the prior, and not the unknown posterior density structural approaches is to specify flexible distributions, much as we
which would require an often intractable integral. If the chain is run suggested in Section 8.2.2. Carroll, Roeder, and Wasserman (1999) and
long enough, then the observations in a sample from the chain are ap- Richardson, Leblond, Jaussent, and Green (2002) used mixtures of nor-
proximately identically distributed, with common distribution equal to mal distributions. Gustafson, Le, and Vallee (2002) used an approach
the posterior. Thus posterior moments, the posterior density, and other based on approximating the distribution of X by a discrete distribution.
posterior quantities can be estimated from a sample from the chain. In this chapter, the Zi ’s are treated as fixed constants, as we have
The Gibbs sampler “fills in” or imputes the values of the unobserved done before in non-Bayesian treatments. This makes perfect sense, since
covariates X by sampling from their conditional distribution given the Bayesians only need to treat unknown quantities as random variables.
observed data and the other parameters. This type of imputation differs Thus, the likelihood is the conditional density of the Yi ’s, Wi ’s, and any
from the imputation of regression calibration in two important ways. Xi ’s that are observed, given the parameters and the Zi ’s. The posterior
First, the Gibbs sampler makes a large number of imputations from the is the conditional density of the parameters given all data, that is, the
conditional distribution of X, whereas regression calibration uses a single Zi ’s, Yi ’s, Wi ’s, and any observed Xi ’s.
imputation, namely the conditional expectation of X given W and Z.
Second, the Gibbs sampler conditions on Y as well as W and Z when
imputing values of X, but regression calibration does not use information 9.1.4 Modularity of Bayesian MCMC
about Y when imputing X.
The beauty of the Bayesian paradigm combined with modern MCMC
computing is its tremendous flexibility. The technology is “modular”
9.1.3 Bayesian Functional and Structural Models in that the methods of handling, for example, multiplicative error, seg-
mented regression and the logistic regression risk model can be combined
We made the point in Section 2.1 that our view of functional and struc- easily. In effect, if one knows how to handle these problems separately, it
tural modeling is that in the former, we make no or at most few as- is often rather easy to combine them into a single analysis and program.
sumptions about the distribution of the unobserved X-values. Chapters
5 and 7 describe methods that are explicitly functional, while regression
calibration is approximately functional. 9.2 The Gibbs Sampler
In contrast, likelihood methods (Chapter 8) and Bayesian methods
As in Chapter 8, especially equation (8.7), the first three steps of our
necessarily must specify a distribution for X in one way or another, and
Bayesian paradigm result in the likelihood computed as if X were observ-
here the distinction between functional and structural is blurred. Effec-
able. Dropping the second measure T, this likelihood for an individual
tively, structural Bayesian likelihood modeling imposes a simple model
observation becomes
on X, such as the normal model, while functional methods specify flexible
distributions for X. We use structural models in this chapter. Examples f (Y, W, X|Z, Ω) = fY |Z,X (Y|Z, X, B)
of this approach are given by Schmid and Rosner (1993), Richardson
×fW |Z,X (W|Z, X, α
e1 )fX|Z (X|Z, α
e2 ),
and Gilks (1993), and Stephens and Dellaportas (1992).
There are at least several ways to formulate a Bayesian functional where Ω is the collection of all unknown parameters. As in the fourth
model. One way would allow the distribution of X to depend on the step of the Bayesian paradigm, we let Ω have a prior distribution π(Ω).
observation number, i. Müller and Roeder (1997) used this idea for the The likelihood of all the ”data” then becomes
case when X is partially observed. They assume that the (Xi , Zi , Wi ) n
Y
are jointly normally distributed with mean µi and covariance matrix π(Ω) f (Yi , Wi , Xi |Zi , Ω).
Σi , where θi = (µi , Σi ) is modeled by a Dirichlet process distribu- i=1

208 209
To keep this section simple, we have not included the possibility of val- • Quantities such as the posterior mean and posterior quantiles are es-
idation data here, but that could be done with only some additional timated by the sample mean and quantiles of Ω1 , Ω2 , . . ., while kernel
effort, mostly notational. To keep notation compact, we will write the density estimates are used to approximate the entire posterior density
e X,
ensemble of Y, X, etc., as Y, e etc. This means that the likelihood can or the marginal posterior density of a single parameter or subset of
be expressed as parameters.
e W,
π(Ω)f (Y, f X|
e Z,
e Ω). An important point is that the first two steps do not require that
The posterior distribution of Ω is then one evaluates the integral in the denominator on the right-hand sides of
R (9.2), (9.3), and (9.4).
¯ π(Ω) f (Y, e W,
f x e Ω)de
e|Z, x
¯e f e Generating pseudorandom observations from (9.4) is the heart of the
f (Ω¯Y, W, Z) = R . (9.1)
e f
π(ω)f (Y, W, x e
e|Z, ω)de
xdω Gibbs sampler. Often the prior on ωj is conditionality conjugate so that
the full conditional for ωj is in the same parametric family as the prior,
The practical problem is that, even if the integration in x e can be ac-
for example, both are normal or both are inverse-gamma; see Section
complished or approximated as in Chapter 8, the denominator of (9.1)
A.3 for a discussion of the inverse-gamma distribution. In such cases, the
may be very difficult to compute. Numerical integration typically fails to
denominator of (9.4) can be determined from the form of the posterior
provide an adequate approximation even when there are as few as three
and the integral need not be explicitly calculated.
or four components to Ω.
If we do not have conditional conjugacy, then drawing from the full
The Gibbs sampler is one solution to the dilemma. The Gibbs sampler
conditional of ωj is more difficult. In this situation, we will use a Metro-
is an iterative, Monte Carlo method consisting of the following main
polis–Hastings, step which will be described soon. The Metropolis–Hast-
steps, starting with initial values of Ω:
ings algorithm does not require that the integral in (9.4) be evaluated.
• Generate a sample of the unobserved X-values by sampling from their
posterior distributions given the current value of Ω, the posterior dis-
tribution of Xi being 9.3 Metropolis–Hastings Algorithm
f (Yi , Wi , Xi |Zi , Ω)
f (Xi |Yi , Wi , Zi , Ω) = R . (9.2) The Metropolis–Hastings algorithm (MH algorithm) is a very versatile
f (Yi , Wi , x|Zi , Ω)dx
and flexible tool, and even includes the Gibbs sampler as a special case.
As we indicate below, this can be done without having to evaluate Suppose we want to sample from a certain density, which in applications
the integral in (9.2). to Bayesian statistics is the posterior, and that the density is Cf (·),
• Generate a new value of Ω from its posterior distribution given the where f is known but the normalizing constant C > 0 is difficult to eval-
observed data and the current generated X-values, namely, uate; see, for example, (9.3). The MH algorithm uses f without knowl-
¯ e W,
π(Ω)f (Y, f X|
e Z,
e Ω) edge of C to generate a Markov chain whose stationary distribution is
¯e f e e
f (Ω¯Y, W, Z, X) = R . (9.3) Cf (·).
e f e e
π(ω)f (Y, W, X|Z, ω)dω To simplify the notation, we will subsume the unobserved X into Ω;
Often, this is done one element of Ω at a time, holding the others this involves no loss of generality, since a Bayesian treats all unknown
fixed (as described below, here too we do not need to compute the quantities in the same way. Suppose that the current value of Ω is Ωcurr .
integral). Thus, for example, if the j th value of Ω is ωj , and the other The idea is to generate (see below) a ”candidate” value Ωcand and either
components of Ω are Ω(−j) , then the posterior in question is simply accept it as the new value or reject it and stay with the current value.
e W,
f Z,
e X,
e Ω(−j) ) Over repeated application, this process results in random variables with
f (ωj |Y, (9.4) the desired distribution.
e W,
π(ωj , Ω(−j) )f (Y, f X|
e Z,
e ωj , Ω(−j) ) Mechanically, one has to have a candidate distribution, which may
=R . depend upon the current value. We write this candidate density as
e f e e
π(ω , Ω(−j) )f (Y, W, X|Z, ω ∗ , Ω(−j) )dω ∗

j j j
q(Ωcand |Ωcurr ). Gelman, Stern, Carlin, and Rubin (2004) call q(·|·) a
• Repeat this many times. Discard the first few of the generated sam- “jumping rule,” since it may generate the jump from Ωcurr to Ωcand .
ples, the so-called burn-in period. Thus, a candidate Ωcand is generated from q(·|Ωcurr ). This candidate is

210 211
accepted and becomes Ωcurr with probability (1995), Gelman et al. (2004), and in many other books and papers. See
½ ¾ Roberts and Rosenthal (2001) for more discussion about scaling of MH
f (Ωcand )q(Ωcurr |Ωcand )
r = min 1, . (9.5) jumping rules.
f (Ωcurr )q(Ωcand |Ωcurr )
More precisely, a uniform(0,1) random variable V is drawn, and then we
9.4 Linear Regression
set Ωcurr = Ωcand if V ≤ r.
The popular “random-walk” MH algorithm uses q(Ωcand |Ωcurr ) = h( In this section, an example is presented where the full conditionals are
Ωcand − Ωcurr ) for some probability density h. Often, as in our examples, all conjugate. For those new to Bayesian computations, we will show
h(·) is symmetric so that in some detail how the full conditionals can be found. In the following
½ ¾ sections, this example will be modified to models where some, but not
f (Ωcand )
r = min 1, . (9.6) all, full conditionals are conjugate.
f (Ωcurr ) Suppose we have a linear regression with a scalar covariate X measured
The “Metropolis–Hastings within Gibbs algorithm” uses the MH al- with error and a vector Z of covariates known exactly. Then the first
gorithm at those steps in a Gibbs sampler where the full conditional is three steps in Figure 9.1 are as follows. The so-called “outcome model”
difficult to sample. Suppose sampling ωj is one such step. If we generate for the outcome Y given all of the covariates (observed or not) is
the candidate ωj,cand from h(· − ωj,curr ) where h is symmetric and ωj,curr
Yi = Normal(Zti βz + Xi βx , σǫ2 ). (9.7)
is the current value of ωj , then r in (9.6) is
( ) Suppose that we have replicates of the surrogate W for X. Then the
e W,
f (ωj,cand |Y, f Z,
e ωℓ,curr for ℓ 6= j) so-called “measurement model” is
r = min 1, .
e W,
f (ωj,curr |Y, f Z,
e ωℓ,curr for ℓ 6= j)
Wi,j = Normal(Xi , σu2 ), j = 1, . . . , ki . (9.8)
Often, h is a normal density, a heavy-tailed normal mixture, or a t- Finally, suppose that the “exposure model” for the covariate measured
density. The scale parameter of this density should be chosen so that with error, X, given Z is
typical values of ωj,cand are neither too close to nor too far from ωj,curr .
If ωj,cand is too close to ωj,curr with high probability, then the MH algo- Xi = Normal(α0 + Zti αz , σx2 ). (9.9)
rithm takes mostly very small steps and does not move quickly enough. If The term exposure model comes from epidemiology, where X is often
ωj,cand is generally too far from ωj,curr , then the probability of acceptance exposure to a toxicant.
is small. To get good performance of the Metropolis within Gibbs algo- For this model, it is possible to have conjugate priors for all of the full
rithm, we might use a Normal(0, σ 2 ) proposal density where σ 2 is tuned conditionals. The prior we will use is that independently
to the algorithm so that the acceptance probability is between 25% and
50%. Gelman, Carlin, Stern, and Rubin (2004, p. 306) state that the op- βx = Normal(0, σβ2 ), βz = Normal(0, σβ2 I)
timal jumping rule has 44% acceptance in one dimension and about 23%
acceptance probability in high dimensions when the jumping and target α0 = Normal(0, σα2 ), αz = Normal(0, σα2 I),
densities have the same shape. To allow for occasional large jumps, one
σǫ2 = IG(δǫ,1 , δǫ,2 ), σu2 = IG(δu,1 , δu,2 ), σx2 = IG(δx,1 , δx,2 ).
might instead use a heavy-tailed normal mixture of 90% Normal(0, σ 2 )
and 10% Normal(0, Lσ 2 ), where L might be 2, 3, 5, or even 10. This As discussed in Section A.3, this prior is conjugate for the full condition-
density is very easy to sample from, since we need only generate inde- als. Here IG(·, ·) is the inverse gamma density, and the hyperparameters
pendent Z ∼ Normal(0, 1) and U ∼ [0, 1]. Then, we multiply Z by σ or
√ σβ and σµ are chosen to be “large” and the δ hyperparameters to be
L σ according as U ≤ 0.9 or U > 0.9. The Normal(0, Lσ 2 ) component “small” so that the priors are relatively noninformative. In particular,
gives the mixture heavy tails and allows the sampler to take large steps because σβ and σµ are large, using a mean of zero for the normal priors
occasionally. One can experiment with the value of L to see which gives should not have much influence on the posterior. See Section A.3 for the
the best mixing, that is, the least autocorrelation in the sample. definition of the inverse gamma distribution and discussion about choos-
More information on the Gibbs sampler and the MH algorithm can ing the hyperparameters of an inverse gamma prior. The unknowns in
be found in Roberts, Gelman, and Gilks (1997), Chib and Greenberg this model are (βx , βz , σǫ , σx , σu ), (X1 , . . . , Xn ), and (α0 , αz , σx ).

212 213
Define rather than known. Therefore, C will vary on each iteration of the Gibbs
µ ¶ µ ¶
Zi βz sampler. The parameters ∆ and σǫ will also vary, even if there is no
Ci = , Y = (Y1 , ..., Yn )t , and β = . measurement error.
Xi βx
The full conditional for α = (α0 , αzt )t can be found in the same way
The likelihood for a single observation is
as for β. First, analogous to (9.11),
1 ½ Pn ¾
f (Yi , Wi , Xi |Zi , Ω) = (2π)−3/2 t 2
αt α
σx σǫ σuki i=1 {Xi − (α0 + Zi αz )}
f (α|others) ∝ exp − − 2 .
¡ ¢2 2σx2 2σα
× exp{− Yi − C ti β /(2σǫ2 )} (9.10)
n P o Let Di = (1 Zti )t and let D be the matrix with ith row equal to Dit . Also,
ki
×exp − j=1 (Wi,j − Xi )2 /(2σu2 ) − (Xi − α0 − Zti αz )2 /(2σx2 ) . let η = σx2 /σα2 . Then, analogous to (9.13),
The joint likelihood is, of course, the product over index i of the terms n¡ ¢−1 t ¡ ¢−1 o
f (α|others) = N Dt D + ηI D X, σx2 Dt D + ηI , (9.14)
(9.10). The joint density of all observed data and all unknown quantities
(parameters and true X’s for nonvalidation data) is the product of the
where X = (X1 , . . . , Xn )t .
joint likelihood and the joint prior. Pk i
To find the full conditional for Xi , define Wi = J=1 Wi,j /ki . Then
In our calculations, we will use the following:
£ t 2 2
¤
f (Xi |others) ∝ exp −(Yi − Xi βx − Zi βz ) /(2σǫ ) (9.15)
Rule: If for some p-dimensional parameter θ we have © ª
© ¡ ¢ ª ×exp −(Xi − α0 − Zi αz ) /(2σx ) − ki (Wi − Xi ) /(2σu2 ) .
t 2 2 2

f (θ|others) ∝ exp − θt Aθ − 2bθ /2


After some algebra and applying the Rule again, f (Xi |others) is seen to
where the constant of proportionality is independent of θ, then f (θ|others) be normal with mean
is Normal(A−1 b, A−1 ). 2
(Yi − Zti βz )(βx /σǫ2 ) + (α0 + Zti αz )/σx2 + Wi /σW
2
(βx2 /σǫ2 ) + (1/σx2 ) + 1/σW
To find the full conditional for β, we isolate the terms depending on β
in this joint density. We write the full conditional of β given the others and variance
as f (β|others). This gives us © ª−1
( ) (βx2 /σǫ2 ) + (1/σx2 ) + (1/σW
2
) .
n
1 X t 2 1 t Notice that the mean of this full conditional distribution for Xi given
f (β|others) ∝ exp − 2 (Yi − C i β) − 2 β β , (9.11)
2σǫ i=1 2σβ everything else depends on Yi , so that, unlike in regression calibration,
where the first term in the exponent comes from the likelihood and the Yi is used for imputation of Xi .
second comes from the prior. Let C have ith row C ti and let ∆ = σǫ2 /σβ2 . Now we will find the full conditional for σǫ2 . Recall that the prior
Then (9.11) can be rearranged to is IG(δǫ,1 , δǫ,2 ), where from Appendix A.3 we know that the IG(α, β)
· ¸ distribution has mean β/(α − 1) if α > 1 and density proportional to
1 © ¡ ¢ ª x−(α+1) exp(−β/x). Isolating the terms depending on σǫ2 in the joint
f (β|others) ∝ exp − 2 β t C t C + ∆I β + 2C t Yβ , (9.12)
2σǫ density of the observed data and the unknowns, we have
where Y = (Y1 , . . . , Yn )t . Using the Rule, f (σǫ2 |others)
³© ª−1 t ¡ ¢−1 ´ ½ Pn ¾
f (β|others) = N C t C + ∆I C Y, σǫ2 C t C + ∆I . (9.13) −δǫ,2 + − 12 i=1 (Yi − Xi βx − Zti βz )2
∝ (σǫ2 )−(δǫ,1 +n/2+1) exp ,
σǫ2
Here we see how the Gibbs sampler can avoid the need to calculate
integrals. The normalizing constant in (9.12) can be found from (9.13) which implies that
simply by knowing the form of the normal distribution. " ( n
)#
X
Result (9.13) is exactly what we would get without measurement er- f (σǫ2 |others) = IG (δǫ,1 + n/2), δǫ,2 + (1/2) (Yi − Xi βx − Zti βz )2 .
ror, except that for the nonvalidation data the X’s in C are “filled-in” i=1

214 215
1.2 0.8
By similar calculations,
1.1
½ 1
Pn ¾ 0.7
−δx,2 − i=1 (Xi − µx )2 1
f (σx2 |others) ∝ (σx2 )−(δx,1 +n/2+1) exp 2
, 0.6

beta0

x
σx2

beta
0.9
0.5
0.8
so that 0.4
" ( n
)# 0.7
X
f (σx2 |others) = IG (δx,1 + (n/2)), δx,2 + (1/2) (Xi − µx )2 . 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
iteration iteration
i=1
Pn 0.5 1.4
Let MJ = i=1 ki /2. Then we have in addition that
1.2
0.4
f (σu2 |others) 1

0
( Pn Pk i )

alpha
0.3

βz
1 2
−δu,2 − j=1 (Wi,j − Xi ) 0.8
2 i=1
∝ (σu2 )−(δu,1 +MJ +1) exp , 0.2
σu2 0.6

0.1 0.4
whence 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
   iteration iteration
 1 Xn X ki  0.8 1.2
f (σu2 |others) = IG (δu,1 + MJ ), δu,2 + (Wi,j − Xi )2  . 1.1
 2 i=1 j=1  0.6
1

alphtax|z
The Gibbs sampler requires a starting value for Ω. For βx , βz , and σǫ , 0.4 0.9

x
σ
one can use estimates from the regression of Yi on Zi and Xi (validation 0.2
0.8

data) or W (nonvalidation data). Although there will be some bias, 0.7


0
these naive estimators should be in a region of reasonably high posterior 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
probability, and bias should not be a problem since they are being used iteration iteration

only as starting values. We start Xi at Wi . Also, µx and σx can be 1.3 0.4

started at the sample mean and standard deviation of the starting values 1.2 0.35

of the Xi ’s. The replication data can be used to find an analysis of 1.1 0.3
variance estimate of σu2 for use as a starting value; see equation (4.3).

σw

σe
1 0.25

0.9 0.2
9.4.1 Example 0.8
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
We simulated data with the following parameters: n = 200, β t = (β0 , βx , iteration iteration

βz ) = (1, 0.5, 0.3), αt = (α0 , αz ) = (1, 0.2), Xi = α0 + αz Zi + Vi , where


Vi ∼ Normal(0, σx2 ) with σx = 1. The Zi were independent Normal(1, 1), Figure 9.2 Every 20th iteration of the Gibbs sampler for the linear regression
and since the analysis is conditioned on their values, their mean and example.
variance are not treated as parameters. Also,
Yi = β0 + βx Xi + βz Zi + ǫi , (9.16) guess at σǫ2 of δǫ,2 /δǫ,1 = 1/3, and that the prior has the amount of
where ǫi = Normal(0, σǫ2 )
with σǫ = 0.3, and Wi,j = Normal(Xi , σu2 ), information that would be obtained from 2δǫ,1 = 6 observations. The
with σu2 = 1. The observed data are (Yi , Zi , Wi,1 , Wi,2 ). same is true of the other δ’s. We experimented with other choices of
We used Gibbs sampling with 10,000 iterations after a burn-in period these prior parameters, in particular, smaller values of the effective prior
of 2,000 iterations. The prior parameters were σβ = σα = 1000, δǫ,1 = sample size, and found that the posterior was relatively insensitive to
3, δǫ,2 = 1, δx,1 = 3, δx,2 = 1, and δu,1 = 3, δu,2 = 1. As discussed the priors, provided that δǫ,2 is not too large.
in Section A.3, the choice of δǫ,1 = 3 and δǫ,2 = 1 suggests a prior Starting values for the unobserved covariates were Xi = Wi = (Wi,1 +

216 217
Wi,2 )/2. The starting values of the parameters were chosen indepen- ever, for linear regression, Bayesian MCMC is a bit of overkill. The
dently: σx , σu , σǫ ∼ Uniform(0.05, 3). The starting value for β and α real strength of Bayesian MCMC is the ability to handle more difficult
were generated from (9.13) and (9.14). problems, for example, segmented regression with multiplicative errors,
Figure 9.2 shows every 20th iteration of the Gibbs sampler. These are a problem that appears not to have been discussed in the literature but
the so-called trace plots that are used to monitor convergence of the which can be tackled by MCMC in a straightforward manner; see Section
Gibbs sampler, that is, at convergence, they should have no discernible 9.1.4.
pattern. No patterns are observed, and thus the sampler appears to have
mixed well. This subset of the iterations was used to make the plots 9.5 Nonlinear Models
clearer; for estimation of posterior means and variance, all iterates were
used. Using all iterates, the sample autocorrelation for βx looks like an The ideas in Section 9.4 can be generalized to complex regression models
AR(1) process with a first-order autocorrelation of about 0.7. We used a in X.
large number (10,000) of iterations to reduce the potentially high Monte
Carlo variability due to autocorrelation. 9.5.1 A General Model
To study the amount of Monte Carlo error from Gibbs sampling and
The models we will study are all special cases of the following general
to see if 10,000 iterations is adequate, the Gibbs sampler was repeated
outcome model
four more times on the same simulated data set but with new random
starting values for σx , σu , and σǫ . The averages of the five posterior [Yi |Xi , Zi , β, θ, σǫ ] = Normal{m(Xi , Zi , β, θ), σǫ2 }, (9.17)
means and standard deviations for βx were 0.4836 and 0.0407. The stan- where
dard deviation of the five posterior means, which estimates Monte Carlo m(Xi , Zi , β, θ) = φ(Xi , Zi )t β1 + ψ(Xi , Zi , θ)t β2 (9.18)
error, was only 0.00093. Thus, the Monte Carlo error of the estimated
is a linear function in β1 , β2 and nonlinear in θ. The functions φ and
posterior means was small relative to the posterior variances, and of
ψ may include nonlinear terms in X and Z, as well as interactions,
course this error was reduced further by averaging the five estimates.
and may be scalar or vector valued. When ψ ≡ 0, particular cases of
The results for the other parameters were similar.
model (9.17) include linear and polynomial regression, interaction mod-
It is useful to compare this Bayesian analysis to a naive estimate that
els, and multiplicative error models. An example of nonlinear component
ignores measurement error. The naive estimate from regressing Yi on
is ψ(Xi , Zi , θ) = |Xi − θ|+ , which appears in segmented regression with
Wi and Zi was βbx = 0.346 with a standard error of 0.0233, so the naive
an unknown break point location. We assume that the other compo-
estimator is only about half as variable as the Bayes estimator, but the
nents of the linear model in Section 9.4 remain unchanged and that Xi
mean square error of the naive estimator will be much larger and due
is scalar, though this assumption could easily be relaxed. The unknowns
almost entirely to bias. The estimated attenuation was 0.701, so the bias-
in this model are (β, θ, σǫ , σu ), (X1 , . . . , Xn ), (α0 , αz , σx ).
corrected estimate was 0.346/0.701 = 0.494. Ignoring the uncertainty
In addition to the priors considered in Section 9.4, we consider a gen-
in the attenuation, the standard error of the bias-corrected estimate is
eral prior π(θ) for θ and assume that all priors are mutually independent.
0.0233/0.701 = 0.0322. This standard error is smaller than the posterior
It is easy to check that the full conditionals f (α|others), f (σx2 |others),
standard deviation but is certainly an underestimate of variability, and
and f (σu2 |others) are unchanged, and that
if we wanted to use the bias-corrected estimator we would want to use " #
X n
the bootstrap or the sandwich formula to get a better standard error. 2 2
f (σǫ |others) = IG δǫ,1 + (n/2), δǫ,2 + (1/2) {Yi − m(Xi , Zi , β, θ)} .
In summary, in this example the Bayes estimate of βx is similar to
i=1
the naive estimate corrected for attenuation, which coincides with the
regression calibration estimate. The Bayes estimator takes more work to Denoting by C(θ) the matrix with i th
row
program but gives a posterior standard deviation that takes into account C ti (θ) = [φ(Xi , Zi ), ψ(Xi , Zi , θ)],
uncertainty due to estimating other parameters. The estimator corrected
for attenuation would require bootstrapping or some type of asymptotic letting β = (β1t , β2t )t , and letting ∆ = σǫ2 /σβ2 , the full conditional for β
−1
approximation, for example, the delta-method or the sandwich formula becomes normal with mean {C(θ)t C(θ) + ∆I} C(θ)t Y and covariance
−1
from estimating equations theory, to account for this uncertainty. How- matrix {C(θ)t C(θ) + ∆I} .

218 219
By grouping together all terms that depend on θ one obtains acceptance. We found that the performance of the Gibbs sampler was
" n # not particularly sensitive to the value of B, and B equal to 1 or 2.5 also
X {Y(1) − ψ(Xi , Zi , θ)β2 }2
f (θ|others) ∝ exp − i
π(θ), (9.19) worked well.
i=1
2σǫ2 As in the linear example, we used five runs of the Gibbs sampler, each
(1) with 10,000 iterations, and with the same starting value distribution as
where Yi = Yi − φ(Xi , Zi )β1 . Since ψ is a nonlinear function in θ, before. The posterior means of β0 , βx,1 , βx,2 , and βz were 1.015, 0.493
this full conditional is generally not in a known family of distributions 0.191, and 0.348, close to the true values of the parameters, which were
regardless of how π(θ) is chosen. One can update θ using a random walk 1.0, 0.5, 0.2, and 0.3. In contrast, the naive estimates obtained by fitting
MH step using Normal(θ, Bσθ2 ) as the proposal density, where B is tuned (9.23) with Xi replaced by Wi were 1.18, 0.427, 0.104, and 0.394, so,
to get a moderate acceptance rate. in particular, the coefficient of X2 was biased downward by nearly 50%.
The full conditional for Xi is The posterior standard deviations were 0.057, 0.056, 0.027, and 0.040,
£ ¤
f (Xi |others) ∝ exp −{Yi − m(Xi , Zi , β, θ)}2 /(2σǫ2 ) (9.20) while the standard errors of the naive estimates were 0.079, 0.052, 0.021,
© 2 2 2 2
¤ and 0.049.
×exp (Xi − α0 − αz Zi ) /(2σx ) + ki (Wi − Xi ) /(2σu ) .
To update Xi , we use a random walk MH step with Normal(Xi , B σu2 /ki )
9.5.3 Multiplicative Error
with the “dispersion” factor, B, chosen to provide a reasonable accep-
tance rate. We now show that a linear regression model (9.7) with multiplicative
We now discuss the details of implementation for polynomial, multi- measurement error is a particular case of model (9.17). As discussed in
plicative measurement error and segmented regression. Section 4.5, this model is relatively common in applications. Indeed, if
X∗i = log(Xi ) and Wi,j

= log(Wi,j ) then the outcome model becomes
9.5.2 Polynomial Regression ∗
Yi = Zti βz + eXi βx + ǫi ,
A particular case of the outcome model (9.17) is the polynomial regres- which can be obtained from (9.17) by setting φ(X∗i , Zi ) = (Zti , eXi )

sion in X and ψ(X∗i , Zi , θ) = 0. The ith row of C := C(θ) is C ti = φ(X∗i , Zi ) and


Yi = Zti βz + Xi βx,1 + · · · + Xpi βx,p + ǫi , (9.21) β = (βzt , βx )t .
2
for some p > 1, where ǫi are independent Normal(0, σǫ ), obtained by We replace the exposure model (9.9) by a lognormal exposure model
setting φ(Xi , Zi ) = (Zti , Xi , . . . , Xpi ) and ψ(Xi , Zi , θ) = 0. The ith row where (9.24) holds with Xi replaced by
of C := C(θ) is C ti = φ(Xi , Zi ) and β = (βzt , βx,1 , . . . , βx,p )t . With this
notation, all full conditionals are as described in Section 9.5.1. In partic- X∗i ∼ Normal(α0 + Zti αz , σx2 ). (9.24)
ular, the full conditional of θ in (9.19) is not necessary because ψ = 0. The measurement model is
In this example, the full conditional for Xi is the only nonstandard dis- ∗
tribution and can be obtained as a particular case of (9.20) as [Wi,j |Xi ] ∼ Normal(X∗i , σu2 ), j = 1, . . . , ki , i = 1, . . . , n. (9.25)
© ª
f (Xi |others) ∝ exp −(Yi − C ti β)2 /(2σǫ2 ) (9.22) With this notation, the full conditionals for this model are the same
© t 2 2 2 2
ª as in Section 9.5.1. One trivial change is that Xi is replaced everywhere
×exp −(Xi − α0 − Zi αz ) /(2σx ) − ki (Wi − Xi ) /(2σu ) . by X∗i , and the full conditional of θ is not needed because ψ = 0.
The full conditional for Xi is nonstandard because C i contains powers To illustrate these ideas, we simulated 200 observations with β0 = 1,
of Xi . βx = 0.3, βz = 0.3, α0 = 0, αz = 0.2, σx = 1, and σu = 1. The Zi
To illustrate these ideas, consider the quadratic regression in X were Normal(−1, 1). We ran the Gibbs sampler with tuning parame-
ter B = 2.5, which gave a 30% acceptance rate. Figure 9.3 shows the
Yi = β0 + βx,1 Xi + βx,2 X2i + βz Zi + ǫi , (9.23)
output from one of five runs of the Gibbs sampler. There were 10,500
with βx,2 = 0.2 and the other parameters unchanged. To update Xi the iterations, of which the first 500 were discarded. One can see that β0
proposal density was Normal(Xi , B σu2 /ki ). After some experimentation, and, especially, βx mix more slowly than the other parameters, yet their
the “dispersion” factor B was chosen to be 1.5 to get approximately 25% mixing seems adequate. In particular, the standard deviation of the five

220 221
1.1 0.5
where we use the notation a+ = min(0, a), θ is the knot, βx,1 is the
1
0.4
slope of Y on X before the knot, and βx,2 is the change in this slope at
the knot. An intercept could be included in Zti βz .
beta0

x
beta
0.9

0.8
0.3 The outcome model (9.26) is a particular case of model (9.17) with
φ(Xi , Zi ) = (Zti , Xi ) and ψ(Xi , Zi , θ) = (Xi − θ)+ . The ith row of C(θ)
is C ti (θ) = {Zti , Xi , (Xi − θ)+ }t and β = (βzt , βx,1 , βx,2 )t . With this
0.7 0.2
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
iteration iteration
notation, all full conditionals are as described in Section 9.5.1.
0.35 0.6
To illustrate segmented regression with measurement error and un-
0.3 0.4
known knot location we simulated data with n = 200, J = 2, β0 = 1,

0
βx = 1, βx,2 = 0.8, βz = 0.1, θ = 1, α0 = 1, αz = 0, σǫ = 0.15,

alpha
0.25 0.2
z
β

0.2 0 σx = 1, and σu = 1. The Zi were Normal(1, 1). Since αz = 1, the Xi


were Normal(1, 1) independently of the Zi .
−0.2
0 2000 4000 6000
iteration
8000 10000 0 2000 4000 6000
iteration
8000 10000
We ran the Gibbs sampler 5 times, each with 10,000 iterations. Start-
0.8 1.3
ing values for θ were Uniform(0.5, 1.5). In the prior for θ, we used the
1.2
Normal(µθ , σθ2 ) distribution with µθ = W and σθ = 5 s(W), where s(W)
0.6
was the sample standard deviation of W 1 , . . . , Wn . This prior was de-
alphtax|z

1.1
0.4 signed to have high prior probability over the entire range of observed
x
σ

1
0.2
0.9
values of W. In the proposal density for θ, we used B = 0.01. This
0 0.8
value was selected by trial and error and gave an acceptance rate of 36%
6000 8000 10000 6000 8000 10000
0 2000 4000
iteration
0 2000 4000
iteration and adequate mixing. The posterior mean and standard deviation of θ
1.1 0.4 were 0.93 and 0.11, respectively. The Monte Carlo standard error of the
posterior mean was only 0.005.
1 0.35
Figure 9.4 reveals how well the Bayesian modeling imputes the Xi and
σw

0.9 0.3
leads to good estimates of θ. The top left plot shows the true Xi plotted
σe

0.8 0.25 with the Yi . The bottom right plot is similar, except that instead of
0.7 0.2 the unknown Xi we use the imputed Xi from the 10,000th iteration of
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
iteration iteration the fifth run of the Gibbs sampler. Notice that the general pattern of X
versus Y is the same for the true and the imputed Xi . In contrast, a plot
Figure 9.3 Every 20th iteration of the Gibbs sampler for the linear regression of Yi and either Wi or E(X b i |Wi ) = (1 − λ) b Wi + λbWi shows much less
example with multiplicative error. b
similarity with the (Xi , Yi ) plot. Here, λ is the estimated attenuation
and W is the mean of W1 , . . . , Wn .
The plot of the imputed Xi versus Yi shows the existence and lo-
posterior means
√ for βx was 0.0076 giving a Monte Carlo standard er- cation of the knot quite clearly, and it is not surprising that θ can be
ror of 0.0078/ 5 = 0.0034, while the posterior standard deviation of estimated with reasonably accuracy. Of course, this “feedback” of infor-
that parameter was 0.0377, about 10 times larger than the Monte Carlo mation about the Xi to information about θ works both ways. Accurate
standard error. knowledge of θ well helps impute the Xi . One estimates both the Xi and
θ well in this example because their joint posterior has highest probabil-
ity near their true values.
9.5.4 Segmented Regression

A commonly used regression model is a segmented line, that is, two lines 9.6 Logistic Regression
joined together at a knot. This model can be written as
In this section, we assume the same model with nonlinear measurement
Yi = Zti βz + βx,1 Xi + βx,2 (Xi − θ)+ + ǫi , (9.26) error as in Section 9.5 but with a binary outcome. We use the logistic

222 223
¸
4 4 (Xi − α0 − αz Zi )2 (Wi − Xi )2
3.5 3.5 + + 2 . (9.28)
σx2 σW
3 3

2.5 2.5 To update Xi we use a random-walk MH step with the same Normal(Xi ,
y

y
2
2 2 B σW ) proposal density as for polynomial regression. To update β we use
1.5 1.5
a random-walk MH step with proposal density N {β, B ′ var(β)}, b where
1 1

0.5 0.5
b
var(β) is the covariance matrix of the naive logistic regression estimator
−2 0 2 4 6 −4 −2 0 2 4 6
true x wbar using W in place of X and B ′ is another tuning constant. A similar
strategy may be applied to update θ when ψ in (9.18) is not identically
4 4
zero.
3.5 3.5

3 3
To illustrate the fitting algorithms for logistic regression with mea-
2.5 2.5
surement error, we simulated data from a quadratic regression similar to
the one in Section 9.5.2 but with a binary response following the logistic
y

y
2 2

1.5 1.5 regression model. The intercept β0 was changed to −1 so that there were
1 1 roughly equal numbers of 0s and 1s among the Yi . Also, the sample size
0.5
−2 0 2 4
0.5
−2 0 2 4 was increased to n = 1, 500 to ensure reasonable estimation accuracy for
E(X|W) imputed x
β. Otherwise, the parameters were the same as the example in Section
9.5.2. The tuning parameters in the MH steps were B = B ′ = 1.5. This
Figure 9.4 Segmented regression. Plots of Yi and Xi and three estimator of
gave acceptance rates of about 52% for the Xi and about 28% for β.
Xi . Top left: Y plotted versus the true X. Top right: Y plotted versus the
mean of the replicated W-values. Bottom left: Y plotted versus the regression
Figure 9.5 show the output from one of the five runs of the Gibbs
calibration estimates of X. Bottom right: Y plotted versus the imputed X in sampler. The samplers appear to have converged and to have mixed rea-
a single iteration of the Gibbs sampler. Note how the Gibbs sampler more sonably well. The posterior mean of β was (−1.18, 0.55, 0.24, 0.30), which
faithfully reproduces the true X-values. can be compared to β = (−1, 0.5, 0.2, 0.3). The posterior standard devi-
ations were (0.13, 0.17, 0.09, 0.06). The Monte Carlo error, as measured
by the between-run standard deviations of the posterior means, was less
regression model than one-tenth as large as the posterior standard deviations.
½ ¾
P (Yi = 1|Xi , Zi )
log = m(Xi , Zi , β, θ),
P (Yi = 0|Xi , Zi ) 9.7 Berkson Errors
so the outcome likelihood is proportional to
" n # The Bayesian analysis of Berkson models is similar to, but somewhat
X n
X n o simpler than, the Bayesian analysis of error models. The reason for the
m(Xi ,Zi ,β,θ)
exp Yi m(Xi , Zi , β, θ) − log 1 + e , simplicity is that we need a Berkson error model only for [X|W] or
i=1 i=1
[X|W, Z]. If, instead, we had an error model [W|X, Z] then, as we have
" seen, we would also need a structural model [X|Z].
n
X n
X n o
[β, θ|others] ∝ exp Yi m(Xi , Zi , β, θ) − log 1 + em(Xi ,Zi ,β,θ) We will consider nonlinear regression with a continuously distributed
i=1 i=1 Y first and then logistic regression.
#
βtβ
− π(θ), (9.27)
2σβ2 9.7.1 Nonlinear Regression with Berkson Errors
and Suppose that we have outcome model (9.17), which for the reader’s con-
" n
X n
X n o venience is
[Xi |others] ∝ exp Yi m(Xi , Zi , β, θ) − log 1 + em(Xi ,Zi ,β,θ)
i=1 i=1 [Yi |Xi , Zi , β, θ, σǫ ] = Normal{m(Xi , Zi , β, θ), σǫ2 }, (9.29)

224 225
−0.5 1.5 Specifically, equation (9.20), which is
£ ¤
−1 1 f (Xi |others) ∝ exp −{Yi − m(Xi , Zi , β, θ)}2 /(2σǫ2 )
© ª

x,1
0
beta

× exp −(Xi − α0 − αz Zi )2 /(2σx2 ) − ki (Wi − Xi )2 /(2σu2 ) ,

beta
−1.5 0.5
is modified to
−2 0 £ ¤
0 2000 4000 6000
iteration
8000 10000 0 2000 4000 6000
iteration
8000 10000 f (Xi |others) ∝ exp −{Yi − m(Xi , Zi , β, θ)}2 /(2σǫ2 ) (9.30)
© ¤
0.6 0.5
× exp −(Wi − Xi )2 /(2σu2 ) .

0.4
0.4 Thus, we see two modifications. The term −(Xi − α0 − αz Zi )2 /(2σx2 ) in
0.3 (9.20), which came from the structural assumption, is not needed and
x,2

z
0.2

β
ki (Wi − Xi )2 is replaced by (Wi − Xi )2 since there are no replicates in
β

0.2
0
0.1 the Berkson model. That’s it for changes—everything else is the same!
−0.2 0
This analysis illustrates a general principle, which may have been ob-
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
iteration iteration vious to the reader, but should be emphasized. When we have a Berkson
model that gives [X|Z, W], we do not need a model for marginal density
1.3 0.25
[W] of W. The Wi are observed so that we can condition upon them.
1.2
0.2 In contrast, if we have an error model for [W|Z, X], we cannot do a con-
alphtax|z

ditional analysis given the Xi since these are unobserved, and therefore
alpha0

1.1
0.15
1 a structural model for [X] or, perhaps, [X|Z] is also needed.
0.1
0.9

0.8 0.05
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 9.7.2 Logistic Regression with Berkson Errors
iteration iteration

1.2 1.15 When errors are Berkson, the analysis of a logistic regression model
1.1
described in Section 9.6 changes in a way very similar to the changes
1.1
1.05
just seen for nonlinear regression. In particular, equation (9.28), which
is
σw
σx

1
1 " n
X n
X n o
0.95
[Xi |others] ∝ exp Yi m(Xi , Zi , β, θ) − log 1 + em(Xi ,Zi ,β,θ)
0.9 0.9
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 i=1 i=1
iteration iteration
¸
(Xi − α0 − αz Zi )2 (Wi − Xi )2
Figure 9.5 Every 20th iteration of the Gibbs sampler for the quadratic logistic
+ + 2 ,
σx2 σW
regression example.
becomes
" n
X n
X n o
but now with Berkson error so that we observe Wi where [Xi |others] ∝ exp Yi m(Xi , Zi , β, θ) − log 1 + em(Xi ,Zi ,β,θ)
i=1 i=1
Xi = Wi + Ui , E(Ui |Zi , Wi ) = 0.
¸
(Wi − Xi )2
Model (9.29) is nonlinear in general, but includes linear models as a + . (9.31)
special case. The analysis in Section 9.5.1, which was based upon repli- σu2
cated classical measurement error and a structural model that says that As before, the term (Xi − α0 − αz Zi )2 /σx2 in (9.28) came from the struc-
X|Z ∼ Normal(α0 + αz Z), must be changed slightly because of the tural model and is not needed for a Berkson analysis, and W i is replaced
Berkson errors. The only full conditionals that change are for the Xi . by Wi because there is no replication.

226 227
5
x 10
2.5 portional to the likelihood, since there are uniform priors on σu and βx,2
and a very diffuse prior on β. The histogram is monotonically decreas-
ing, in agreement with the MLE of 0 for σu . However, the posterior is
2 very diffuse and much larger values of σu are plausible under the poste-
rior. In fact, the posterior mean, standard deviation, 0.025 quantile, and
0.975 quantile of σu were 0.13, 0.098, 0.027, and 0.37, respectively. The
1.5 95% credible interval of (0.027, 0.37) is not much different from (0.0344,
0.3906), the interval formed by the 2.5 and 97.5 percentiles of the prior.
frequency

Thus, the data provide some, but not much, information about σu .
1

−1.6 3

2.5
−1.8
0.5 2
−2

βx,1
β0
1.5
−2.2
1
0 −2.4
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.5
σu
−2.6 0
0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5
iteration 5 iteration 5
x 10 x 10
Figure 9.6 Munich bronchitis data. Histogram of 1,250,000 samples from the
posterior for σu . 1.8 1.2

1.6
1
1.4
9.7.3 Bronchitis Data 0.8
1.2

βx,2

βz,1
We now continue the analysis of the bronchitis data described in Section 1 0.6

8.7. Recall that in that section we found that the MLE of the Berk- 0.8
0.4
son measurement error standard deviation, σu , was zero. Our Bayesian 0.6

analysis will show that σu is poorly determined by the data. Although 0.4
0 0.5 1 1.5 2 2.5
0.2
0 0.5 1 1.5 2 2.5
iteration 5 iteration 5
σu is theoretically identifiable, for practical purposes it is not identified. x 10 x 10

Gustafson (2005) has an extensive discussion of nonidentified models. 0.7 0.4


He argues in favor of using informative priors on nonidentified nuisance
parameters, such as σu here. The following analysis applies Gustafson’s 0.6
0.3

strategy to σu . 0.5

βz,2

σu
We will use a Uniform (0.025, 0.4) prior for σu . This prior seems rea- 0.4
0.2

sonable, since σw is 0.72, so the lower limit of the prior implies very little 0.1
measurement error. Also, the upper limit is over twice the value, 0.187, 0.3

assumed in previous work by Gössi and Küchenhoff (2001). We will use 0.2
0 0.5 1 1.5 2 2.5
0
0 0.5 1 1.5 2 2.5
a Uniform {1.05 min(Wi ), 0.95 max(Wi )} prior for βx,2 . This prior is rea- iteration 5
x 10
iteration 5
x 10

sonable since βx,2 is a TLV (threshold limiting value) within the range of
the observed data. The prior on β, the vector of all regression coefficient, Figure 9.7 Trace plots for the Munich bronchitis data.
is Normal(0, 106 I).
There were five MCMC runs, each of 250,000 iterations excluding a Figure 9.7 shows trace plots for the first of the five MCMC runs. Trace
burn-in of 1,000 iterations. Figure 9.6 is a histogram of the 1,250,000 plots for the other runs are similar. The mixing for σu is poor, but the
values of σu2 from the five runs combined. The posterior is roughly pro- mixing for the other parameters is much better. The poor mixing of σu

228 229
3
5
x 10 LAB programs used in the previous sections are specially tailored and
optimized to address these issues. However, standard software, such as
WinBUGS, may prove to be a powerful additional tool in applications
2.5 where many models are explored. We now show how to use WinBUGS
for fitting models introduced in Sections 9.4 and 9.5.
2

9.8.1 Implementation and Simulations in WinBUGS


frequency

1.5
We describe in detail the implementation of the linear model in Section
9.4 and note only the necessary changes for the more complex models.
1 The complete commented code presented in Appendix B.8.1 follows step
by step the model description in Section 9.4.

0.5
1.2
0.6
1.1

βx
1 0.5

β
0
0 0.5 1 1.5 2 2.5 3 3.5
θ 0.9
0.4
0.8
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
Figure 9.8 Munich bronchitis data. Histogram of 1,250,000 samples from the
0.4 1.2
posterior for TLV, βx,2 .

α0
z
1

β
0.3
0.8
0.2 0.6
was the reason we used 250,000 iterations per run rather than a smaller 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
1.2
value, such as 10,000, which was used in previous examples. 0.4 1.1
We experimented with a Uniform(0, 10) prior for σu and encountered 0.3

x|z
1

σx
α
0.2
difficulties. On some runs, the sampler would get stuck at σu = 0 and 0.1
0.9
0.8
Xi = Wi for all i. On runs where this problem did not occur the mixing 0
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
was very poor for σu , and fair to poor for the other parameters. We
1.1
conclude that a reasonably informative prior on σu is necessary. However, 0.4

ε
1 0.35

σ
σ
fixing σu at a single value, as Gössi and Küchenhoff (2001) have done, 0.3
0.9
is not necessary. 0.25
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
Figure 9.8 is a histogram of the 1,250,000 value of βx,2 from the com- iteration iteration

bined runs with burn-ins excluded. The posterior mean of βx,2 was 1.28,
very close to the naive of 1.27 found in Section 8.7. This is not surprising, Figure 9.9 Every 20th iteration for the WinBUGS Gibbs sampler for the linear
since the simulations in Section 8.7.3 showed that the naive estimator regression example.
had only a slight negative bias. The 95% highest posterior density cred-
ible interval was (0.53, 1.73). The first for loop specifies the outcome, measurement, and exposure
model (9.7), (9.8), and (9.9). Note that Nobservations is the sample
size and that the # sign indicates a comment. The code is structured
9.8 Automatic Implementation and intuitive. For example, the two lines in the outcome model
Bayesian analysis for complex models with covariates measured with er-
Y[i]~dnorm(meanY[i],taueps)
ror needs to be based on carefully constructed prior, full conditional,
meanY[i]<-beta[1]+beta[2]*X[i]+beta[3]*Z[i]
and proposal distributions combined with critical examination of the
convergence and mixing properties of the Markov chains. The MAT- specify that the outcome of the ith subject, Yi , has a normal distribution

230 231
with mean mY (i) = β1 + β2 Xi + β3 Zi and precision parameter τǫ =
1/σǫ2 . It is quite common in Bayesian analysis to specify the normal
distribution in terms of its precision instead of its variance.
The nested for loop corresponding to the replication model

0.06
for (j in 1:Nreplications) {W[i,j]~dnorm(X[i],tauu)}

0.05
specifies that, conditional on the unobserved exposure, Xi , of the ith
subject the proxies Wi,j are normally distributed with mean Xi and
precision τu = 1/σu2 . Here Nreplications is the number of replications

0.04
and it happened to be the same for all subjects. A different number
of replications could easily be accommodated by replacing the scalar

0.03
Nreplications by a vector Nreplications[].
The code corresponding to the measurement error model

0.02
X[i]~dnorm(meanX[i],taux)
meanX[i]<-alpha[1]+alpha[2]*Z[i]

0.01
specifies that the exposure of the ith subject, Xi , has a normal distribu-
tion with mean α1 + α2 Zi and precision parameter τx = 1/σx2 .
The code for prior distributions

0.00
tauu~dgamma(3,1)
Bayes/WinBUGS Naive
taueps~dgamma(3,1)
taux~dgamma(3,1)
specifies that the precision parameters τu , τǫ , τx have independent Gamma
priors with parameters 3 and 1. The dgamma(a,b) notation in WinBUGS
Figure 9.10 Squared error for the Bayes and naive methods for estimating the
specifies a Gamma distribution with mean a/b and variance a/b2 . The
exposure effect βx in the linear model with measurement error in the model
code for prior distributions (9.16).
for (i in 1:nalphas){alpha[i]~dnorm(0,1.0E-6)}
for (i in 1:nbetas){beta[i]~dnorm(0,1.0E-6)} defines the reliability ratio λ = τu /(τu + τx ) = σx2 /(σx2 + σu2 ).
specifies that the parameters α1 , α2 , β1 , β2 , β3 have independent normal To assess the quality of inference based on the WinBUGS program,
priors with mean zero and precision 10−6 . Here nalphas and nbetas we simulated 2,000 data sets from the linear model with measurement
denote the number of α and β parameters. error described in Section 9.4.1. For each data set we used 10,500 simu-
The last part of the code contains only definitions of explicit functions lations based on the WinBUGS program and we discarded the first 500
of the model parameters. For example, simulations as burn in.
Figure 9.9 shows every 20th iteration of the Gibbs sampler for one
sigmaeps<-1/sqrt(taueps) data set, indicating that the mixing properties are comparable to those
sigmau<-1/sqrt(tauu) shown in Figure 9.2. However, this is not always the case and Win-
sigmax<-1/sqrt(taux) BUGS programs typically need 10 to 100 times more simulations than
√ √ √ expert programs to achieve comparable estimation accuracy. Of course,
define the standard deviations σǫ = 1/ τǫ , σu = 1/ τu and σx = 1/ τx
for the outcome, replication and exposure models, respectively, and the time saved by using WinBUGS instead of writing a program often
compensates for the extra computational time.
lambda<-tauu/(tauu+taux) Figure 9.10 displays the squared error of the posterior mean of the

232 233
exposure effect βx using Bayes and naive estimators for the linear model where (X[i]-theta)*step(X[i]-theta) represents (X i − θ)+ because
with measurement error introduced in Section 9.4. More precisely, for the step(a) in WinBUGS is equal to a if a > 0 and 0 otherwise. One needs
(B)
dth data set, d = 1, . . . , 2000, denote by βbx,d the posterior mean of βx only to add the prior for θ:
(N )
using the WinBUGS program and by βb x,d the MLE of βx in a standard theta~dnorm(barWbar,prec.theta)
linear regression, where Xi is replaced by Wi = (Wi1 + Wi2 )/2. Then,
(B) (N )
where barWbar represents the average of all Wij observations and prec.-
the two boxplots in Figure 9.9 correspond to (βbx,d − βx )2 and (βbx,d − theta represents 1/(25σW 2
) and are part of the data.
βx )2 , respectively. WinBUGS uses a rather inefficient simulation algorithm for fitting
We also calculated the coverage probabilities of βx by the 90% and complex measurement error models. This is most probably due to the
95% equal-tail probability credible intervals obtained from the Bayesian sampling scheme, which updates one parameter at a time and does not
analysis based on MCMC simulations implemented in WinBUGS. The take advantage of the explicit full conditionals of groups of parameters.
true value of the parameter βx was covered for 89.5% and 94.6% of the For example, if γ = (γ1 , γ2 )t has a full conditional Normal(µγ , Σγ ) with
data sets by the 90% and 95% credible intervals, respectively. In contrast, a very strong posterior correlation, it is much more efficient to sample
the true value of βx was never covered by the 95% confidence interval of directly from Normal(µγ , Σγ ) than to sample γ1 given γ2 and the others
the naive analysis because of its bias. and then γ2 given γ1 and the others.
Therefore, the mixing properties of the Markov chains generated by
WinBUGS should be carefully analyzed using multiple very long chains.
9.8.2 More Complex Models We also found that simple reparameterizations, such as centering and
orthogonalization of covariates, can substantially improve mixing.
Only minor changes are necessary to fit the quadratic polynomial regres-
While we encourage development, when feasible, of expert programs
sion model in Section 9.5.2. Indeed, the only change is that the specifi-
along the lines described in Sections 9.4 and 9.5, WinBUGS can be a
cation of the mean function of the outcome model becomes
valuable additional tool. The main strengths of WinBUGS are
meanY[i]<-beta[1]+beta[2]*X[i]+beta[3]*pow(X[i],2) 1. Flexibility: Moderate model changes correspond to simple program
+beta[4]*Z[i] changes.
2. Simplicity: Program follows almost literally the statistical model.
while the number of β parameters in the data nbetas is changed from
3 to 4. Here pow(X[i],2) represents X2i . 3. Robustness: Program is less prone to errors.
As discussed in Section 9.5.3, the multiplicative measurement error 4. Operability: Programs can be called from different environments, such
model is equivalent with an additive measurement error model using a as R or MATLAB.

log exposure scale. This can be achieved by the transformations Wi,j = The main weakness of WinBUGS is that chains may exhibit very poor

log(Wi,j ) and Xi = log(Xi ). From a notational perspective in Win- mixing properties when parameters have high posterior correlations.
BUGS, there is no need to use the X∗i notation instead of the Xi as long This problem may be avoided by expert programs through the careful
as the data is transformed accordingly. Therefore, the only necessary study of full conditional distributions.
change is that the mean function of the outcome model becomes
meanY[i]<-beta[1]+beta[2]*exp(X[i])+beta[3]*Z[i] 9.9 Cervical Cancer and Herpes

where exp(X[i]) represents eXi and W[i,j] represents Wi,j ∗
. So far in this chapter, we have assumed that a continuously distributed
To fit the segmented regression model in Section 9.5.4, one needs to covariate is measured with error. However, Bayesian analysis is straight-
change the mean function of the outcome model to forward when a discrete covariate is misclassified.
In this section, we continue the analysis given in Section 8.4 of the cer-
meanY[i]<-beta[1]+beta[2]*X[i] vical cancer data discussed in Section 1.6.10. In particular, we continue
+beta[3]*(X[i]-theta)*step(X[i]-theta) the retrospective parameterization in Section 8.4 using αxd = Pr(W =
+beta[4]*Z[i] 1|X = x, Y = d) and γd = Pr(X = 1|Y = d), x = 0, 1 and d = 0, 1.

234 235
We use beta priors with parameters (axd , bxd ) for the α’s and (a∗d , b∗d ) between the estimates for d = 1 and for d = 0, indicating the critical
for the γ’s, with the α’s and γ’s being mutually independent. If we impose nature of whether or not the error is assumed to be nondifferential.
the constraints, αx0 = αx1 for x = 0, 1, then we have a four-parameter, This example shows the value of validation data—without it, one is
nondifferential measurement error model. The log-odds ratio is related forced to assume nondifferential error and may, unwittingly, reach erro-
to the γ’s by neous conclusions because this assumption does not hold. If at all feasi-
ble, the collection of validation is worth the extra effort and expense.
β = log [{γ1 /(1 − γ1 )} / {γ0 /(1 − γ0 )}] .
Thus, the posterior distribution of β can be found via transformation 9.10 Framingham Data
from the posterior distribution of the γ’s.
As an illustration, we consider only those males ages 45+ whose choles-
If we could observe all the X’s, the joint density of the parameters
terol values at Exam #3 ranged from 200 to 300, giving a data set of
and all the data would be proportional to
n = 641 observations. Recall that Y is the indicator of coronary heart
1 Y1
"
Y disease. Initial frequentist analysis of this data set showed no evidence
axd −1 b −1
αxd (1 − αxd ) xd (9.32) of age or cholesterol effects, so we work only with two covariates, smok-
x=0 d=0
# ing status, Z, and X = log(SBP−50), where SBP is long-term average
n n
Y oI(Xi =x,Yi =d)
Wi 1−Wi
systolic blood pressure. The main surrogate W is the measurement of
× αxd (1 − αxd ) log(SBP−50) at Exam #3, while the replicate T is log(SBP−50) mea-
i=1 sured at Exam #2. Given (Z, X), W and T are assumed independent
" n n
#
1
Y Y oI(Yi =d) and normally distributed with mean X and variance σu2 ; σu2 = α e1 in the
a∗ −1 b∗
d −1 Xi 1−Xi
× γd d (1 − γd ) γd (1 − γd ) . general notation of Chapter 8. The distribution of X given Z is assumed
d=0 i=1 2
to be normal with mean α0 + αz Z and variance σx|z (e
α2 in the general
We can use (9.4) and (9.32) to note that P the posterior distribution of 2
notation). We also assume that σx|z is constant, that is, independent of
n ∗
γ
Pd is a beta distribution with parameters i=1 Xi I(Yi = d) + ad and Z. Let Θ = (σu2 , α0 , αz , σx|z
2
).
n ∗
i=1 (1 − X i )I(Y i = d) + b d . The posterior
Pn distribution of α xd is also a Previous analysis suggested that the measurement error variance is
betaP distribution but with parameters i=1 Wi I(Xi = x, Yi = d) + axd less than 50% of the variance of the true long-term SBP given smoking
n
and i=1 (1 − Wi )I(Xi = x, Yi = d) + bxd . The conditional distribution status. We define ∆ = σu2 /σx|z 2
to be the ratio of these variances and
of a missing Xi , given the (Wi , Yi ) and the parameters, is Bernoulli assume ∆ ∈ (0, 0.5). Restricting the range here makes sense, and we
with success probability p1i /(p0i + p1i ), where would not credit an analysis that suggested that the measurement error
x
pxi = γY (1 − γYi )
1−x Wi
αxY (1 − αxYi )
1−Wi
. variance is larger than the variance of true long-term SBP given smoking
i i
status.
Thus, in order to implement the Gibbs sampler, we need to simulate The Bayesian analysis will be based on the original model, so that Y
observations from the Bernoulli and beta distributions, both of which given (X, Z) is treated as being logistic with mean
are easy to do using standard programs, so the Metropolis–Hastings
H(β0 + βx X + βz Z) .
algorithm was not needed.
2
For nondifferential measurement error, the only difference in these The unknown parameters are (β0 , βx , βz , α0 , αz , σx|z , ∆). The first five
calculations is that αx0 = αx1 = αx , which have a beta of these are given diffuse (noninformative) locally uniform priors, the
Pprior
n
with pa-
rameters (ax P
, bx ) and a beta posterior with parameters i=1 Wi I(Xi = next-to-last has a diffuse inverse Gamma prior, the density functions
n 2
x) + ax and i=1 (1 − Wi )I(Xi = x) + bd . being proportional to 1/σx|z , and ∆ has a uniform prior on the interval
We used uniform priors throughout, so that axd = bxd = a∗d = b∗d = between zero and one half.
1. We ran the Gibbs sampling with an initial burn-in period of 2, 000 We use WinBUGS to implement the Bayesian logistic regression model.
simulations, and then recorded every 50th simulation thereafter. The The WinBUGS model, together with an R file used for data and output
posterior modes were 0.623 and 0.927, respectively, these being very manipulation, is provided as part of the software files for this book.
2
close to the maximum likelihood estimates. Note the large difference Mixing was very good for βz , α0 , αz , σx|z , σu2 , and λ. For these pa-

236 237
−5 3

−10 2

βx
Parameter ML. Boot. Bayes Bayes
0
β −15 1
est. se p. mean p. std.
0 90000 210000 300000 0 90000 210000 300000
1.5
β
x β0 −10.10 2.400 −10.78 2.542
4.47
1 βx 1.76 0.540 1.91 0.562
z

α0
0.5 4.43
β

0 4.4
βz 0.38 0.310 0.40 0.302
−0.5 α0 4.42 0.019 4.42 0.019
0 90000 210000 300000 0 90000 210000 300000
10 × αz −0.19 0.210 −0.20 0.217
0.06 2
0
0.055
10 × σx|z 0.47 0.033 0.51 0.032
αx|z

x|z
−0.03 10 × σu2 0.14 0.011 0.16 0.008

σ2
0.05
−0.06
0.045 λ 0.30 0.031 0.28 0.025
0 90000 210000 300000 0 90000 210000 300000
0.35
0.016
0.015 0.3
Table 9.1 Framingham data. The effects of SBP and smoking are given by
2
σu

λ
0.014
0.013 0.25 βx and βz , respectively. The measurement error variance is σu2 . The mean
0 90000 210000 300000 0 90000 210000 300000 of long-term SBP given smoking status is linear with intercept α0 , slope αz
2
and variance σx|z . Also, λ = σu2 /σx|z
2
. “ML” = maximum likelihood, “se” =
Figure 9.11 Every 600th iteration of the Gibbs sampler for Framingham ex- standard error, “Boot.” = bootstrap, “Bayes” =Bayesian inference based on
ample. Gibbs sampling implemented in WinBUGS, “p. mean” = posterior mean, and
“p. std” = posterior standard deviation.

rameters 1, 000 burn-in and 10, 000 simulations were enough for accurate OPEN Data, Protein, Posterior Density for λ
9
estimation. However, the chains corresponding to β0 and βx were mixing
very slowly, and we ran 310, 000 iterations of the Gibbs algorithm and 8
discarded the first 10, 000 as burn-in. Figure 9.11 displays every 600th
7
iteration for the model parameters with similar, but less clear patterns,
for the unthinned chains. 6
Table 9.1 compares the inference results for the maximum likelihood
5
analysis, based on the regression calibration approximation with the
Bayesian inference based on Gibbs sampling. Clearly, the two types of 4

inferences agree reasonably closely on most parameters. The Bayesian 3


analysis estimates an 8.5% higher effect of SBP βx = 1.91 for Gibbs sam-
pling, compared to βx = 1.76 for maximum likelihood, but the difference 2

is small relative to the standard errors. Results in Table 9.1 are similar 1
to the likelihood and regression calibration results given in Section 8.5,
0
and the differences are easily due to our use here of only 641 of the 1,615 0 0.05 0.1 0.15 0.2 0.25 0.3
subjects analyzed in Section 8.5.

Figure 9.12 Results of the OPEN Study for Protein intake for females. Plotted
9.11 OPEN Data: A Variance Components Model
is the posterior density of the attenuation λ, defined in this case as the slope
The OPEN Study was introduced in Section 1.2 and Section 1.5, see of the regression of true intake on a single food frequency questionnaire. The
Subar, Kipnis, Troiano, et al. (2003) and Kipnis, Midthune, Freedman, posterior mean is 0.13, with 95% credible interval [0.04, 0.21], roughly in line
et al. (2003)indexsLongitudinal data. Briefly, each participant completed with results reported previously.

238 239
up to two food frequency questionnaires (FFQ) which measured reported erences include two classics, Box & Tiao (1973) and Berger (1985). The
Protein intake, and also up to two biomarkers for Protein intake (urinary latter has an extensive and excellent theoretical treatment. There is also
nitrogen). Letting Y denote the logarithm of the FFQ, W the logarithm now a statistical package for Bayesian computation, called WinBUGS:
of the biomarker and X the logarithm of usual intake, the variance com- we will illustrate the use of WinBUGS in this chapter. The literature
ponents model used is now even includes an excellent book devoted exclusively to the Bayesian
approach to measurement error modeling, especially for categorical data,
Yij = β0 + βx Xi + ri + ǫij , (9.33)
see Gustafson (2004).
Wij = Xij + Uij , Good introductions to MCMC are given by Gelman, Carlin, Stern, &
where ǫij = Normal(0, σǫ2 ), Uij = Normal(0, σu2 ) and ri = Normal(0, σr2 ): Rubin (2004), Carlin & Louis (2003), and Gilks, Richardson, & Spiegel-
the terms ri is a person-specific bias or equation error, see Section 1.5. halter (1996).
In Chapter 11, we note that (9.33) is a linear mixed model with repeated The mechanics of stopping the Gibbs sampler and whether one should
measures. We used a subset of the women in the OPEN study for this use one long sequence or a number of shorter sequences are matters of
analysisindexsLongitudinal data. some controversy and not discussed here; however, we note that Gel-
The purpose of the OPEN study was to investigate the properties man & Rubin (1992) and Geyer (1992) give exactly opposite recommen-
of the FFQ for use in large cohort studies. In regression calibration, dations. There is a large literature on diagnostics for convergence; see
Chapter 4, in a cohort study we use the regression of usual intake on the Cowles & Carlin (1996), Polson (1996), Brooks & Gelman (1998), Kass,
FFQ as the predictor of disease outcome. The slope of this regression is Carlin, Gelman, & Neal (1998), and Mengersen, Robert, & Guihenneuc-
simply Jouyaux (1999). Kass et al. (1998) is an interesting panel discussion of
what is actually done in practice by three Bayesian experts, Carlin, Gel-
λregcal = cov(Q, X)/var(Q). man, and Neal: Kass, though also an expert, is the moderator so we do
Kipnis, Subar, Midthune, et al. (2003) describe λregcal as the attenuation not learn about his views or experiences. This discussion is quite inter-
factor and note that the regression calibration approximation says that esting and well worth reading, unless you are already a Bayesian expert
if the true relative risk is R, then the observed relative risk from the use yourself, and probably even in that case. It seems that the experts do not
of the FFQ will be Rλregcal . For example, a true relative risk of 2 would use sophisticated convergence diagnostics, because they feel that these
appear as 2.4 = 1.32 if the attenuation factor were 0.4 and as 2.2 = 1.15 can be misleading. However, they all look at trace plots of various pa-
rameters, such as Figure 9.2. Carlin and Gelman monitor R b (Gelman
if the attenuation factor were 0.2. It is thus of considerable interest to
estimate λregcal . The WinBUGS code along with the prior distributions & Rubin, 1992), which compares the estimated posterior variance from
used is given in Appendix B.8.2. several chains combined to the average posterior variance from the in-
dividual chains. R b close to 1 means that the chains have mixed. Carlin
We plot the posterior density of λregcal in Figure 9.12. The posterior
mean is 0.13, with 95% credible interval [0.04, 0.21], roughly in line with and Neal also compute autocorrelations of various parameters; high au-
results reported by Kipnis, Subar, Midthune, et al. (2003). This means tocorrelations are a sign of slow mixing. Neal also suggests looking at
that a true relative risk of 2 for Protein intake will be attenuated to a the log posterior density, which will be neither steadily increasing nor
relative risk of 20.13 = 1.09 when using the FFQ. As Kipnis, et al. state: steadily decreasing if the chain has converged.
“Our data clearly document the failure of the FFQ to provide a suffi- Alternatives to the Metropolis–Hastings algorithm have been pro-
ciently accurate report of absolute protein . . . intake to allow detection posed, though they seem less used in practice. For example, Smith &
of their moderate associations with disease. Gelfand (1992) discuss the rejection method and the weighted bootstrap
method. Ritter & Tanner (1992) and references therein discuss ways of
drawing samples from (9.4), including the griddy Gibbs sampler, which
Bibliographic Notes effectively discretizes the components of Ω in a clever way; this can be
useful since sampling from a multinomial distribution is trivial.
Since the first edition of this book, the literature on Bayesian compu-
tation has exploded. The reader is referred to Gelman, Carlin, Stern, &
Rubin, Gelman, (2004), Carlin & Louis (2000), and Gilks, Richardson, &
Spiegelhalter, (1996) for a thorough introduction. Other important ref-

240 241
CHAPTER 10

HYPOTHESIS TESTING

10.1 Overview
In this chapter, we discuss hypothesis tests concerning regression pa-
rameters when X is measured with error. In Section 3.3.1 we argued,
in the context of some special cases, that naive tests for the regression
coefficients of Z are not valid in general when X is measured with error
and X is correlated with Z. In particular, we illustrated this in Figure
3.5, where we graphically illustrated a two-group, unbalanced analysis
of covariance, showing that if X has a different distribution in the two
groups, then the treatment effect test is invalid in the naive test, where
W is simply substituted for X. In this chapter, we give a more detailed
and thorough account of testing when X is measured with error.
To keep the exposition simple, we focus on linear regression. However,
the results of Sections 10.2.1, 10.2.3, and 10.5 hold in general, and the
results of Sections 10.2.2 and 10.4 hold to a good approximation for
all generalized linear models, including logistic regression, whenever the
regression calibration approximation is reasonable. More generally, the
same can be said of any problem for which the mean and variance of
the response depends only upon a linear combination of the predictors,
which we assume throughout this chapter. We also assume nondifferen-
tial, additive measurement error, W = X + U.

10.1.1 Simple Linear Regression, Normally Distributed X


In Section 3.2.1 we discussed the effects of measurement error on estima-
tion in the simple linear regression model; see especially equation (3.4).
Recall that the model is Y = β0 + βx X + ǫ, where X has mean µx and
variance σx2 = 1, and the error about the regression line ǫ is independent
of X, has mean zero and variance σǫ2 . Suppose that instead of X we
observe W = X + U, where U is independent of X, has mean zero, and
variance σu2 = 1. The attenuation is σx2 /(σx2 + σu2 ). As we described, if X
is normally distributed, the observed regression of Y on W is the linear
model with intercept β0 + βx µx (1 − λ), slope λβx , and residual variance
σǫ2 + λβx2 σu2 .
Now consider testing the null hypothesis of no effect due to X: H0 :

243
βx = 0. Since the observed data have slope λβx , if the null hypothesis 1987, Section 2.5.1), and computed its standard error using 3, 000 boot-
is true, then in the observed data the slope is also zero. In other words, strap simulations. We then modified the simulations so that the level of
with nondifferential error in this simple setup, no relationship between each test was exactly 0.05. The comparison of the two methods is dis-
Y and X means no relationship between Y and the observed W. This played in Figure 10.1, where we see that the naive test has the greater
has two consequences for the naive test that ignores the measurement power, as predicted by the theory.
error: The loss of power by the test that corrects for measurement error is
• Since the observed data have zero slope under the null hypothesis, the due to Fuller’s correction and its bootstrap standard error. Using the
naive test is valid, in the sense that its level (Type I error) is correct. Fuller correction seems reasonable since it provides an estimator with
good finite-sample properties, in particular, less finite-sample bias than
• Because X is normally distributed, the observed data actually follow b However, if one divides the
simply dividing the naive estimator by λ.
a linear model. Hence, the naive test is efficient in this special case. b then, though the estimate
naive estimator and its standard error by λ,
Of course, this efficiency only holds for normally distributed X.
of βx may be unstable in small samples, the t-statistic corrected in this
way would be the same as the naive t-statistic because the λ’s b would
Normal X, λ = 0.5 cancel. Thus, this less sophisticated correction would, ironically, result
0.8
Naive in the naive test and hence be efficient.
0.7 MEM t−test

0.6 Normal X, λ = 0.5


1.8

0.5 Naive
1.6
MEM
Power

0.4
1.4

Mean estimates βx
0.3 1.2

0.2 1

0.8
0.1

0.6
0
0 0.5 1 1.5 2
0.4
βx
0.2

Figure 10.1 This illustrates the power of a 5%-level test for the null hypothesis
0
of zero slope in a simple linear regression when X is normally distributed,
n = 20, and σx2 = σǫ2 = σu2 = 1. Compared are the naive test that ignores −0.2
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
measurement error (solid line) and the test that accounts for measurement βx
error by estimating the standard deviation of the method of moments estimator
via the bootstrap (dashed line).
Figure 10.2 This illustrates the effects of measurement error in a simple linear
regression when X is normally distributed, n = 20, and σx2 = σǫ2 = σu2 = 1,
To illustrate these points, we did a small simulation study similar to and for different values of βx . The dotted line reflects the true value of βx .
that in Section 1.8.1, with n = 20 observations and µx = 0, σx2 = σǫ2 = Compared are the naive estimate of the slope ignoring measurement error (solid
σu2 = 1. In this simulation, σu2 was assumed known. We varied βx and in- line) and the estimate that accounts for measurement error (dashed line). Note
vestigated the power of two tests. The first is the naive test, which is the the severe small-sample bias in the naive estimate, as well as the near lack of
efficient test because X, U, and ǫ are all normally distributed. The other small-sample bias for the measurement error estimate. The point here is that
test is one based upon accounting for measurement error. Specifically, as if estimation and inference about βx are of interest, then measurement error
in Section 3.4.1, we computed the Fuller’s corrected estimator (Fuller, needs to be accounted for.

244 245
Figure 10.1 might cause one to think that measurement error can be mally distributed with mean µ1 and variance σx2 = 1, and in the second
safely ignored. This is certainly true provided the model is simple linear group, X is normally distributed with mean µ2 and variance σx2 = 1. The
regression, all random variables are normally distributed, and the only measurement error variance is σu2 = 1, and the residual mean square is
interest is in testing the null hypothesis of zero slope. As a cautionary σǫ2 = 1. The difference in the mean of X in the two groups is ∆ = µ2 −µ1 ,
note, in Figure 10.2 we illustrate once again the effects of measurement with larger values of ∆ reflecting increased imbalance between the two
error on estimation. In this figure, we show that the naive estimate is groups. In symbols,
very severely biased, when the correction for attenuation with Fuller’s
Y = β0 + βx X + βz Z + ǫ,
modification has almost no small-sample bias.
where Z is the dummy variable indicating group assignment and the
10.1.2 Analysis of Covariance mean of X given Z = z is µz .

Analysis of Covariance, Type I Error of Nominal 5% Test


0.25
Analysis of Covariance, No Treatment Effect
Naive
Naive MEM

Type I Error for Treatment Effect


MEM 0.2
0.5
Estimate of Treatment Effect

0.4
0.15

0.3

0.1

0.2

0.1 0.05

0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−0.1 Difference in Mean of X Between Treatments


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Difference in Mean of X Between Treatments
Figure 10.4 This illustrates the effects of measurement error in unbalanced
Figure 10.3 This illustrates the effects of measurement error in unbalanced analysis of covariance, when X is normally distributed, n = 20, and σ x2 =
analysis of covariance, when X is normally distributed, n = 20, and σ x2 = σǫ2 = σu2 = 1, and for different values of ∆, the difference in the mean of X
σǫ2 = σu2 = 1, and for different values of ∆, the difference in the mean of between the true groups. Compared are the Type I errors of the naive test for
X between the true groups. Compared are the mean estimates of the treatment treatment effect ignoring measurement error (solid line) and the Type I error of
effect ignoring measurement error (solid line) and accounting for measurement the test that accounts for measurement error (dashed line). Note the increased
error (dashed line). Note the increased bias of the naive estimate of treatment level of the naive test as the imbalance between the two groups increases.
effect as the imbalance between the two groups increases. The dashed line shows
that the correction results in only a slight negative bias, which is small-sample As described in Section 3.3.1, when ignoring measurement error, the
effect. effect of measurement error in X is to bias the estimate of the treatment
effect βz . This is illustrated in Figure 10.3, which is the result of a
It is, of course, not always true that the Type-I error of the naive simulation study with 1, 000 replications, where we display the mean
test is the nominal 5%. Consider, for example, the analysis of covariance estimate of treatment effect βz as a function of the difference in the mean
model described in Section 3.3.1 and Figure 3.5. Consider a situation of X in the two groups, ∆ = µ2 − µ1 , both ignoring and accounting for
of two-group analysis of covariance where in the first group, X is nor- measurement error. The latter method uses Fuller’s modification of the

246 247
correction for attenuation. Note the severe bias of the naive estimate • The naive test of no effects due to (Zt , Xt )t is valid, that is, that none
for larger values of ∆ and the corresponding near lack of bias in the of the covariates affects Y.
correction for attenuation. • The naive test of no effects due to Z is not valid in general but is
The bias in treatment effect when ignoring measurement error also valid under some restrictive assumptions.
leads to invalid tests, that is, the usual test that ignores measurement
error has Type I error greater than the nominal 5%. In Figure 10.4, • The naive test of no effects due to a specified subvector of X, for
we plot the Type I error as a function of ∆ ignoring and accounting for example, the first component of X, is not valid in general.
measurement error: The latter uses a t-test with standard error estimated • When Y follows a generalized linear model (Section A.8) in Z and
by 1, 000 bootstrap simulations. X, then we show that the efficient score test of no effects due to
The analysis of covariance illustrates that hypothesis testing in the X is easily obtained: One takes the efficient score test when X is
measurement error context is not fully straightforward. Understanding observed and replaces X by a parametric estimate of E(X|Z, W). Put
when the naive test that ignores measurement error is valid and attains another way, a null hypothesis test based on regression calibration is
its nominal Type I error is thus of considerable importance. (asymptotically) efficient.
These results are obtained using the regression calibration approxima-
10.1.3 General Considerations: What Is a Valid Test? tion, which takes the regression model for Y given Z and X and replaces
X by E(X|Z, W). Recall that throughout this chapter we assume that
Assuming that one or more of the estimation methods described in the response depends only upon a linear combination of the predictors, for
previous chapters is applicable, the simplest approach to hypothesis test- example, as in a generalized linear model.
ing forms the required test statistic from the parameter estimates and
their estimated standard errors. Such tests are justified whenever the
estimators themselves are justified. However, this approach to testing is 10.2 The Regression Calibration Approximation
only possible when the indicated methods of estimation are possible, and
In linear regression, the mean of the response given the true covariates
thus require either knowledge of the measurement error variance or the
is β0 + βzt Z + βxt X. Under the additional assumption that the possibly
presence of validation data or replicate measurements or instrumental
multivariate regression of X on Z and W is linear, that is,
variables, etc.
There are certain situations in which naive hypothesis tests are justi- E(X | Z, W) = α0 + αzt Z + αw
t
W,
fied and thus can be performed without additional data or information
we have that the observed data also have a linear mean, namely
of any kind. Here naive means that we ignore measurement error and
substitute W for X in a test that is valid when X is observed. This chap- E(Y | Z, W) = β0 + βxt α0 + (βzt + βxt αzt )Z + βxt αw
t
W. (10.1)
ter studies naive tests, describing when they are and are not acceptable,
and indicates how supplementary data, when available, can be used to Equation (10.1) is the starting point for our discussion of testing. One
t
improve the efficiency of naive tests. of the assumptions of our measurement error model is that αw is an
We use the criterion of asymptotic validity to distinguish between invertible matrix.
acceptable and nonacceptable tests. We say a test is asymptotically valid A naive analysis of the data fits a linear model as well. We write this
if its Type I error rate approaches its nominal level as the sample size model as
increases. Asymptotic validity, which we shorten to validity, of a test is E(Y | Z, W) = γ0 + γzt Z + γw
t
W. (10.2)
a minimal requirement for acceptability.
It is the correspondence between the naive model (10.2) and the actual
model (10.1) that is of interest here.
10.1.4 Summary of Major Results The assumption made above that αw t
is invertible is not onerous. In
The main results on the validity of naive tests under nondifferential the case of classical multivariate measurement error where W = X + U,
measurement error are as follows: if the conditional covariance matrix of X given Z, Σx|z , is invertible then
t
• The naive test of no effects due to X is valid. αw = Λz = Σx|z (Σx|z + Σu )−1 ,

248 249
t
so αw is invertible whenever Σx|z is invertible, that is, whenever we 10.3 Illustration: OPEN Data
do not have complete collinearity of the components of (X Z). This is
In many nutrition studies, the response Y is binary (disease or not),
a minimal assumption for βx to be estimable even when there is no
in which case logistic regression is the likely model choice. If pr(Y =
measurement error.
1|Z, W) = H(β0 + βzt Z + βxt X), then following (10.1) the regression
calibration approximation is that
10.2.1 Testing H0 : βx = 0 © ª
pr(Y = 1 | Z, W) = H β0 + βxt α0 + (βzt + βxt αzt )Z + βxt αw
t
W .(10.3)
Here we show that the naive test of no effect due to any of the predictors Some of these concepts can be illustrated numerically in the OPEN
measured with error is asymptotically valid, a point illustrated for simple data; see Section 1.2. Recall here that Z is the logarithm of energy
linear regression in Section 10.1.1. The result though holds in general, (caloric) intake as measured by the doubly labeled biomarker, which
and not just for linear regression. we are taking as measured without error.
A comparison of (10.1) and (10.2) shows that βx = 0 implies that A standard practice is to take X to be the logarithm of protein density,
αw βx = 0, which in turn implies that γw = 0. The converse is also true, which is the percentage of calories coming from protein. Effectively, this
namely that γw = 0 implies that βx = 0 because αw is invertible. is simply the logarithm of the ratio of protein intake to energy intake. The
Because γw = 0 if βx = 0, it follows that the naive test, that is, the surrogate for X is W, the logarithm of the ratio of the protein biomarker
test of H0 : γw = 0, is a valid test of H0 : βx = 0. to energy intake. The interpretation is rather nice: If we change X, then
Although γw = 0 only if βx = 0, this reverse implication, though we are changing the relative composition of what we eat.
perhaps interesting, is not necessary for the validity of the naive test. Using the methods for regression calibration in Section 4.4.2, that is,
equation (4.4), we obtain the estimate that E(X|W, \ Z) = α b0 + αbw W +
α
bz Z ≈ −0.42 + 0.54W + 0.06Z, and that the estimated correlation be-
10.2.2 Testing H0 : βz = 0
tween X and Z is −0.15. If we inspect (10.1), we see that when we ignore
Here we show that in linear regression, the naive tests for effects due measurement error, the slope in the regression of a response Y on (W, Z)
to Z is typically invalid, except under special circumstances, a point has an approximate coefficient βz + 0.06βx for Z. This suggests that if
illustrated for the analysis of covariance in Section 10.1.2. there is no real energy effect (βz = 0), then since the observed data
Further comparison of (10.1) and (10.2) shows that βz = 0 implies that should manifest a slope of only approximately 0.06βx , we are unlikely to
γz = 0, only if αz βx = 0. It follows that the naive test of H0 : βz = 0 is conclude incorrectly that there is an energy effect when we ignore mea-
valid if X is unrelated to Y in the model (10.7), that is, βx = 0, or if Z surement error, unless βx is large and hence X is a very strong predictor
is unrelated to X, that is, αz = 0. of the response.
In generalized linear models, the naive test is valid when Z and X are
independent, at least approximately, at the level of the regression cali- 10.4 Hypotheses about Subvectors of βx and βz
bration approximation. Gail, Wieand, and Piantadosi (1984) and Gail,
Tan, and Piantadosi (1988) showed that when the regression calibration There are situations in which interest focuses on testing for effects due to
approximation fails for logistic regression, then the naive test is no longer some subset of the predictors measured with error, or due to some subset
even approximately valid. of the error-free covariates. That is, if X = (Xt1 , Xt2 )t , βx = (βx,1
t t
, βx,2 )t ,
t t t t t t
The general conclusion is that the test of H0 : βz = 0 is invalid, and Z = (Z1 , Z2 ) , βz = (βz,1 , βz,2 ) , then we may be interested in
although there are certain situations in which it is valid. testing H0 : βx,1 = 0 or H0 : βz,1 = 0.
We have already seen that for testing H0 : βz = 0, the naive test is
not valid in general, and it follows from similar reasoning that the same
10.2.3 Testing H0 : (βxt , βzt )t = 0 is true of naive tests of H0 : βz,1 = 0. Therefore, we restrict attention to
naive tests of H0 : βx,1 = 0.
A final comparison of (10.1) and (10.2) shows that (βxt , βzt )t = 0 if and Suppose now that βxt X = βx,1t t
X1 + βx,2 X2 and that
t
only if (γw , γzt )t = 0, so the naive test that none of the covariates affects
t t t
Y is valid in general. E(X1 | Z, W1 , W2 ) = α1,0 + α1,z Z + α1,w 1
W1 + α1,w 2
W2 ;

250 251
t t t
E(X2 | Z, W1 , W2 ) = α2,0 + α2,z Z + α2,w 1
W1 + α2,w 2
W2 , (10.4) essentially zero. The variances of transformed blood pressure and choles-
terol were 0.0525 and 0.0316, respectively, with a correlation of 0.0966.
where W = (W1t , W2t )t is partitioned as is X.
In other words, both transformed blood pressure and transformed choles-
With these changes (10.1) becomes
terol and their measurement errors are essentially independent. The cor-
t t relations of observed blood pressure and cholesterol with age and smok-
E(Y | Z, W) = β0 + βx,1 α1,0 + βx,2 α2,0
+(βzt + βx,1
t t
α1,z t
+ βx,2 t
α2,z t
)Z + (βx,1 t
α1,w t
+ βx,2 t
α2,w )W1 ing status were also modest. When we used the regression calibration
1 1
formula (4.5), we found the following regressions:
t t t t
+(βx,1 α1,w 2
+ βx,2 α2,w 2
)W2 , (10.5)
E(X1 | Z, W1 , W2 ) ≈ 1.0686 + (0.0131, −0.0041)Z
and in a naive analysis of the data the mean model
+ 0.7459W1 + 0.0076W2 ;
E(Y | Z, W) = γ0 + γzt Z + γw
t
1
t
W1 + γw 1
W2 (10.6) E(X2 | Z, W1 , W2 ) ≈ 1.4298 + (0.0022, 0.0013)Z
is fit to the observed data. + 0.0059W1 + 0.7310W2 .
Comparing (10.5) and (10.6) shows that βx,1 = 0 implies that γw1 = 0
As is seen here, effectively regression calibration of transformed systolic
only if α2,w1 βx,2 = 0. It follows that the naive test of H0 : βx,1 = 0 is
blood pressure on all the variables is essentially the same as regression
valid only if α2,w1 βx,2 = 0. If X2 is related to Y, then βx,2 is nonzero.
calibration using the blood pressure measurements alone, and similarly
If X2 is related to W1 in (10.4), then α2,w1 is nonzero. This is the case
for cholesterol. In particular, we see that effectively, α1,z ≈ 0, α1,w2 ≈ 0,
whenever some components of X1 are correlated with some components
α2,z ≈ 0 and α2,w1 ≈ 0, so that in practice the naive test for systolic
of X2 .
blood pressure is very nearly valid.
For example, consider the NHANES study introduced in Chapter 1
and discussed in more detail in Chapter 4. Let X be the vector of true
total caloric intake (TC = X1 ) and saturated fat (SF = X2 ), and let 10.5 Efficient Score Tests of H0 : βx = 0
Z denote nondietary variables. The naive test for a SF effect simply
In this section, we assume that Y given Z and X follows a general-
substitutes observed TC and SF intake for true TC and SF intake, and
ized linear model (Section A.8). In particular, the mean and variance
it is a valid test provided there is no risk of breast cancer due to TC
functions for these models are in the form
(βx,1 = 0) or when the regression of true SF intake on observed SF,
observed TC and non-dietary variables has no component due to TC E(Y|Z, X) = mY (Z, X, B) = mY (β0 + βzt Z + βxt X); (10.7)
(α2,w1 = 0). var(Y|Z, X) = σ 2 g 2 (Z, X, B, θ)
In general, the conclusion is that the test of H0 : βx,1 = 0 is invalid,
= σ 2 g 2 (β0 + βzt Z + βxt , θ). (10.8)
although there are certain situations in which it is valid.
We show that the naive score test of H0 : βx = 0, while asymptotically
valid in general, is not generally an efficient score test. However, we do
10.4.1 Illustration: Framingham Data
find a test that is asymptotically equivalent to the efficient score test
The Framingham Heart Study was introduced in Section 1.6.6 and de- and show that under certain conditions this test is equal to the naive
scribed in more detail in Sections 5.4.1, 6.5, 7.2.3, 8.5, and 9.10. Here we score test.
consider two variables measured with error, namely transformed systolic Recall that the naive test simply substitutes W for X. We show that
blood pressure (X1 ) and the logarithm of cholesterol (X2 ). The variables if a parametric model for E(X|Z, W) is appropriate, say E(X|Z, W) =
Z measured without error in this example are age and smoking status: mX (Z, W, α), and if α b is a n1/2 -consistent estimator of α, then the
Age was normalized to have mean zero and variance one. test that substitutes mX (Z, W, α b) for X is asymptotically an efficient
Using the replicates of blood pressure and cholesterol, we find an esti- score test. It must be emphasized that this result about substituting
mate of the measurement error covariance matrix using equation (4.3) as mX (Z, W, α b) for X requires the assumption of a generalized linear model.
follows: The measurement error variance for transformed systolic blood The validity of naive null tests for predictors measured with error, and
pressure was 0.0126, that for transformed cholesterol was 0.0085, and the the efficiency for generalized linear models of tests which replace X by
correlation of the measurement errors was estimated as 0.0652, that is, E(X|Z, W), was shown by Tosteson and Tsiatis (1988). For the special

252 253
case of models with canonical link functions, the efficiency of tests that where in the last equation the dependence of C1 , C2 , and C3 on (β0 , βz , α, θ)
replace X by E(X|Z, W) follows from the form of the efficient score has been suppressed for brevity.
for generalized linear measurement error models given in Stefanski and Let θb be any n1/2 -consistent estimate of the variance parameter θ; see
Carroll (1987). Section A.7 or Carroll and Ruppert (1988, Chapter 3) for some methods
It follows from these results that the only time the naive test of of estimating θ. If α is unknown, for example, when
H0 : βx = 0 in generalized linear models is equivalent to the efficient
Hi (α) = E(X|Z, W) = m(Z, W, α),
score test occurs when E(X|Z, W) is independent of Z and linear in
W. Moreover, Tosteson and Tsiatis (1988) showed that the asymptotic then we assume a n1/2 -consistent estimator of α. Methods of estimating
relative efficiency (ARE) of the naive test to the efficient score test is α are discussed in Chapter 4. The quasilikelihood and variance function
always less than 1, unless the two tests are equivalent. They also showed (QVF) estimates of (β0 , βz ), (βb0 , βbz ), satisfy
that, for the special case where X is univariate and Z is not present, this n n o
X
ARE is {corr (E(X|W), W)}2 . Thus, the naive test can be arbitrarily 0= (1, Zti )t di (βb0 , βbz , θ)
b Yi − mY (βb0 + βbt Zi ) .
z
inefficient if E(X|W) is sufficiently nonlinear in W. i=1
The mathematical arguments supporting these statements are given
With dim(Z) denoting the dimension of Z, define
in the following subsection. This subsection is fairly technical and can
n o2
be omitted on first reading.
Xn Yi − mY (βb0 + βbzt Zi )
b2 = {n − 1 − dim(Z)}−1
σ .
10.5.1 Generalized Score Tests i=1 g 2 (βb0 + βbzt Zi , θ)
b

To define a generalized score test of H0 : βx = 0, let Hi (α) be any random We consider test statistics of the form
vector depending on (Zi , Xi , Wi ) and the parameter α and having the b−2 Lt (βb0 , βbz , α
σ b −1 (βb0 , βbz , α
b, θ)D b t (βb0 , βbz , α
b, θ)L b
b, θ). (10.10)
same dimension as Xi . Possible choices of Hi (α) are discussed later.
Define When X is observable, then setting Hi (α) = Xi in (10.9) results in
(10.10) being the usual score test statistic of H0 : βx = 0. The naive score
L(β0 , βz , α, θ) = test statistic is obtained by setting Hi (α) = Wi in (10.9). We show in
n
1 X © ª this section that when E(X|Z, W) = m(Z, W, α), then setting Hi (α) =
√ Hi (α)di (β0 , βz , θ) Yi − mY (β0 + βzt Zi ) , (10.9) m(Zi , Wi , α) in (10.9) results in a test statistic that is asymptotically
n i=1
equivalent to the efficient score test statistic.
where di used here and ci used below are defined by We now show that under the hypothesis H0 : βx = 0, the test statistic
di (β0 , βz , θ) = m′Y (β0 + βzt Zi )/g 2 (β0 + βzt Zi , θ) in (10.10) is asymptotically chi-square with degrees of freedom equal to
the common dimension of Hi (α), Xi and βx . It follows from Carroll and
ci (β0 , βz , θ) = di (β0 , βz , θ)m′Y (β0 + βzt Zi ).
Ruppert (1988, Chapter 7) that to order op (1), under H0
Our test statistic is based on L with the parameters β0 , βz , α, and θ µ ¶ n µ ¶
replaced by estimators. Also define √ βb − β0 C −1 X 1 © ª
n b0 ≈ √3 di Yi − mY (β0 + βzt Zi ) ,
n βz − βz n Zi
X i=1
C1 (β0 , βz , α, θ) = n−1 Hi (α)Hit (α)ci (β0 , βz , θ); where the dependence of C3 and di on the parameters has been sup-
i=1
n
pressed. Since E(Yi |Zi , Wi ) = E(Yi |Zi , Xi ) = mY (β0 + βzt Zi ) under
X the null hypothesis, it is straightforward to show that to order op (1),
C2 (β0 , βz , α, θ) = n−1 Hi (α)(1, Zti )t ci (β0 , βz , θ);
i=1 X n
n
X Lt (βb0 , βbz , α b ≈ √1
b, θ) di
C3 (β0 , βz , α, θ) = n−1 (1, Zti )t (1, Zti )ci (β0 , βz , θ); n i=1
½ µ ¶¾
i=1 © ª 1
× Yi − mY (β0 + βzt Zi ) Hi (α) − C2 C3−1 , (10.11)
D(β0 , βz , α, θ) = C1 − C2 C3−1 C2t , Zi

254 255
and Lt (βb0 , βbz , α b is hence asymptotically multivariate normal with
b, θ) Bibliographic Notes
mean zero and covariance matrix σ 2 D(β0 , βz , α, θ). In (10.11) di = di (β0 ,
For the case studied above in Section 10.5.1 there is a parametric model,
βz , θ). It follows that (10.10) has the indicated chi-square distribution.
E(X|Z, W) = m(Z, W, α). As mentioned before, n1/2 -consistent esti-
It remains to show that for generalized linear models, substituting
mation of α is possible by the methods in Chapter 4. It is also possible to
E(X|Z, W) for Hi (α) in (10.10) results in a test that is asymptoti-
constructed asymptotically efficient or nearly efficient score tests based
cally equivalent to the efficient score test. The argument is adapted from
on nonparametric estimates of E(X|Z, W). Stefanski and Carroll (1990a,
Tosteson and Tsiatis (1988).
1991) constructed semiparametric tests that achieve full or nearly full
The density or mass function of a generalized linear model uses the
efficiency when W is unbiased for X and its measurement error variance
exponential family density given by (A.41). Write ξ = g(η) with η =
is known or independently estimated. Sepanski (1992) used nonparamet-
β0 + βxt x + βzt z. Using the assumption of nondifferential measurement
ric regression techniques to construct efficient tests when there exists an
error (conditional independence so that Y and W are independent given
independent validation data set or an independent data set containing
X and Z), the density or mass function of the observed data is
an unbiased instrumental variable.
Z
The ROC curve is commonly used to assess the ability of a marker
fY|Z,W (y|z, w) = fY|Z,X (y|z, x)fX|Z,W (x|z, w)dµ(x)
to diagnose the presence of a disease or other condition, for example, in
Z · ¸ Reiser (2000), serum creatine kinease is used to diagnose when a woman
yg(η) − C {g(η)}
= exp + c(y, φ) fX|Z,W (x|z, w)dµ(x). is a carrier of DMD (Duchenne muscular dystrophy). Reiser (2000) dis-
φ
cusses the estimation of ROC curves when the marker is measured with
Write h(y, z) = exp ([yg(β0 + βzt z) − C {g(β0 + βzt z)}] /φ). Since c(y, φ) error.
does not depend on βx , the likelihood score used in construction of the
efficient score statistic is
¯
∂ © ª¯
log fY|Z,W (y|z, w) ¯¯
∂βx βx =0
Z ¯
1 ∂ ¯
= fY|Z,X (y|z, x)fX|Z,W (x|z, w)dµ(x)¯¯
h(y, z) ∂βx βx =0
·Z
1
= fX|Z,W (x|z, w)fY|Z,X (y|z, x)
h(y, z)
¸
∂ © ª
× log fY|Z,X (y|z, x) dµ(x)
∂βx βx =0
Z ¯
∂ © ª¯
= fX|Z,W (x|z, w) log fY|Z,X (y|z, x) ¯¯ dµ(x)
∂βx βx =0
£ © ª¤
= g ′ (β0 + βzt z) y − C ′ g(β0 + βzt z)
Z
× (x/φ)fX|Z,W (x|z, w)dµ(x)
1£ © ª¤
= y − C ′ g 2 (β0 + βzt z)
φ
g ′ (β0 + βzt z)E(X|Z = z, W = w). (10.12)
If X were observable, the only difference in these calculations would
be that X would replace E(X|Z, W) in (10.12). Hence, the efficient score
test for the observed data is obtained by substituting E(X|Z, W) for X.

256 257
CHAPTER 11

LONGITUDINAL DATA AND


MIXED MODELS

This chapter is concerned with mixed models and longitudinal/clustered


data structures, ones that are more complex than simple random sam-
pling. That is, in previous chapters we have described situations in which
the observed data are (Yi , Wi , Zi ) for individuals i = 1, ..., n.
Actually, we have already described a simple example of such a more
complex data structure, namely the OPEN data as analyzed in Section
9.11. As seen there, we had repeated measures Yij on each individual,
rather than just a single observation Yi . Repeated measures are a type
of clustered data and can be analyzed using mixed model technology.
This chapter is meant to give the reader an overview of some of the
developments in mixed models with covariate measurement error. The
linear mixed model (LMM) has, of course, been the format for most of
the advances, but more recent developments have focused on nonlinear
mixed models. Book-length treatments of mixed models are given by
many authors, including Verbeke and Molenberghs (2000); McCulloch
and Searle (2001); Ruppert, Wand, and Carroll, (2003); and Demidenko
(2004).

11.1 Mixed Models for Longitudinal Data

11.1.1 Simple Linear Mixed Models

Longitudinal data arise when a sample of subjects is followed over a


period of time and, for each subject, some or all variables are measured at
multiple time points. The subjects are often called clusters. Longitudinal
data are a special type of clustered data, one where time is an important
component. Both longitudinal data and clustered data models are often
analyzed by mixed model technology.
Mixed effects models are a natural extension of linear and generalized
linear models for modeling clustered data. The simplest example of a
mixed model is the mixed balanced one-way ANOVA model, where a
single variable, call it Y, is measured at J time points for each of I
subjects. Thus, the data are Yij , i = 1, . . . , I and j = 1, . . . , J. The

259
model is  many, if not all, covariates without error are in both Zij and Aij ; see
 Yij = µ + bi + ǫij ; the examples in Section 11.2.3.
bi ∼ Normal(0, σb2 ); (11.1) Note that marginally, the mean is our old friend β0 +Xtij βx +Ztij βz , but
ǫij ∼ Normal(0, σǫ2 ). the random effects bi induce correlations among the observations within

This model can be viewed as a compromise between two fixed effect an individual. Thus, the variance of Yij is Atij D(θ)Aij + σǫ2 , while the
models: one that assumes equal intercepts for all subjects (σb2 = 0) and covariance between Yij and Yik is Atij D(θ)Aik .
one that assumes a different intercept for each subject (effectively, σb2 =
∞), and the estimate of bi is a weighted average of Y ·· , the grand average
11.1.3 The Linear Logistic Mixed Model
of all Yij , and Y i· , the average of all Yij in the ith sample. The mixed
model with 0 < σb2 < ∞ allows each subject to have a different intercept Mixed models are, of course, not confined to the linear model. For ex-
but assumes that the intercepts are similar, with the degree of similarity ample, suppose that the response Yij is binary. Then the linear logistic
increasing as σb2 decreases. mixed model is the natural modification of (11.2),
An attractive property of model (11.1) is that observations in the same
subject are correlated with correlation coefficient ρ = σb2 /(σb2 + σǫ2 ). In pr(Yij = 1|bi ) = H(β0 + Xtij βx + Ztij βz + Atij bi ), (11.3)
a mixed model framework, the random coefficients bi , i = 1, . . . , I are where, as usual, H(·) is the logistic distribution function.
called random effects, their variance is called a variance component, and The major differences between the linear mixed model (11.2) and the
ρ is called the within-subject, or within-cluster, correlation. logistic mixed model (11.3) are (a) computation and (b) in interpreta-
Note how the OPEN data model (9.33) is a generalization of (11.1). It tion of the fixed effects. Using the probit approximation to the logistic
has the random effect (there called ri and referred to as person-specific distribution function (Section 4.8.2), we see that marginally,
bias), but instead of a mean common to all individuals, it has a linear ( )
regression mean structure. β0 + Xtij βx + Ztij βz
pr(Yij = 1) ≈ H . (11.4)
(1 + Atij D(θ)Aij /2.9)1/2
11.1.2 The General Linear Mixed Model Because the Aij can depend upon Zij , the interpretation of, for example,
The general linear mixed model is (no surprise!) a generalization of the βx as the effect of changing Xij is no longer correct; see Heagerty and
simple mixed model (11.1). In the general linear mixed model, the mean Kurland (2001) for discussion.
µ + bi for each individual is replaced by a regression with random effects.
Specifically, keeping to the notation of this book,
11.1.4 The Generalized Linear Mixed Model
Yij = β0 + Xtij βx + Ztij βz + Atij bi + ǫij , (11.2)
These ideas extend naturally to more complex models for longitudinal
where the random effects bi that vary between subjects are assumed to data and generate the flexible class of generalized linear mixed models
have a normal distribution with mean zero and covariance matrix D(θ), (GLMMs). In a generalized linear mixed model, given the random ef-
depending on a parameter θ: in symbols, bi ∼ Normal{0, D(θ)}. In fects bi , the responses Yij are assumed to have a distribution (normal,
addition, the ǫij are mutually independent with mean zero and variance binomial, etc.), whose mean is given as µb ij,x , where for some function
i

σǫ2 . The regression parameters βx and βz that are constant between g(·),
subjects are called fixed effects. The parameters in θ are called variance g(µb t t t
ij,x ) = β0 + Xij βx + Zij βz + Aij bi . (11.5)
i

components or, more generally when D(θ) is not diagonal, covariance


components. Here µb i
ij,x is the expected value of Yij , the j measured response on the
th

In what follows, except for the example described in Section 11.8.1, ith subject and Xij , Zij , and Aij are covariate vectors of dimension p1 ,
the covariates Zij and Aij are assumed to be observed without error. p2 , and q, respectively.
A major reason for distinguishing between them, rather than putting In the linear mixed model, g(·) is the identity function, while in the lo-
them into a single vector Zij as done elsewhere in this book, is that gistic mixed mode, g(·) is the inverse of the logistic distribution function,
the regression coefficients of Aij , namely bi , are random effects. Often etc.

260 261
11.2 Mixed Measurement Error Models 11.2.2 General Considerations

The generalized linear mixed measurement error model (GLMMeM) An important general principle is that, under the assumption of additive
model of Wang, Lin, Gutierrez, et al. (1998) starts with (11.5), but now error and a normal structural model, the effect of measurement error on
allows for measurement error in the Xij . a GLMM relating Yij to Xij and Zij is to create a new GLMM relat-
In a GLMMeM, Xij is not observed; instead one observes Wij that is ing Yij to Wij and Zij . Stated differently, under the assumptions of
related to Xij . The analytic closed-form bias calculations in Wang, Lin, additive error and a normal structural model, a GLMMeM in the true
Gutierrez, et al. (1998) are obtained under the assumption of classical covariates becomes a GLMM in the observed covariates. The important
additive errors, that is, consequence of this principle is that analytic expressions for bias and
bias-correction can be found by comparing the parameters of the GLM-
Wij = Xij + Uij , (11.6) MeM to those of the GLMM.
A more precise expression of this principle is given by Wang, Lin,
where the Uij are independent Normal(0, Σu ), but (11.6) was not needed
Gutierrez, et al. (1998) Suppose that Xij is scalar, and define the vector
by these authors for numerical bias calculations, estimation, or inference.
Xi = (Xi1 , . . . , Xini )t , and define Zi , Wi , and Ui similarly. Suppose
that
11.2.1 The Variance Components Model Revisited Xi = 1i η0 + Zi η z + exi ,
where 1i is an ni × 1 vector of ones and exi given Zi is Normal(0, Σxxi ).
It is instructive to consider the OPEN study data model (9.33). In our Also define Λi = Σxxi {Σxxi + cov(Ui )}−1 . Then
random effects notation, we have that Xij ≡ Xi , Yij = β0 + β1 Xi +
bi + ǫij , Wij = Xi + Uij . As it transpires, the observed data become a Xi = (Ii − Λi )(1i η0 + Zi η z ) + Λi Wi + b∗i , (11.7)
combination of two linear mixed models, and the unobserved Xi become where b∗i is independent of bi and Wi . It follows from (11.7) that
random effects. To see this, let µx be the population mean of X and
Xij = α0j + η tz Zti αzj + Wit αwj + C tij b∗i (11.8)
let σx2 be the population variance of X. Let µy = β0 + βx µx , and write
∆xi = Xi − µx . Then the observed data consist of two linear mixed for some α0j , αzj , αwj , and C ij .
models: It follows from (11.5) and (11.8) that the observed data (Yi |Wi , Zi )
follow the GLMM with mean
Yij = µy + (βx ∆xi + bi ) + ǫij ; b∗
g(µij,w
i
) = (β0 + α0j βx ) + Wit αwj βx + (η tz Zti αzj βx + Ztij βz )
Wij = µx + ∆xi + Uij .
+ (Atij bi + C tij βx b∗i ). (11.9)
The complication is that the two linear mixed effect models are corre-
lated because ∆xi occurs in both of them. It is an interesting exercise to Note specifically how the variance structure of the Xij becomes an im-
combine these two linear mixed models into a single linear mixed model, portant consideration when properly handling a GLMMeM.
although the notation is nasty.
There are two main points to this exercise: 11.2.3 Some Simple Examples
• With effort, a linear mixed model with measurement error in covari- To illustrate LMMs, GLMMs, and GLMMeMs, this section contains sev-
ates can first be turned into two linear mixed models with correlated eral simple, hypothetical examples. Suppose that on the j th yearly visit
components, and then with notational wizardry be turned into a sin- of the ith subject to a clinic we observe the subject’s systolic blood pres-
gle, albeit more complex, linear mixed model. sure Yij and age Zij , and that there is a linear relationship between
these two variables with a subject-specific intercept and slope. Then a
• Although it is a little difficult to see from this example, the fact that suitable LMM is
the variance of X shows up in the random effects here means that
when handling a GLMMeM model, the variance structure of the co- Yij = β0 + Zij βz + Aij bi + ǫij , (11.10)
variates measured with error must be taken into account. The next where Aij = ( 1 Zij ) and bi = ( b0,i bz,i )t . Here β0 and βz are the
subsection makes this more explicit. average intercept and slope across the population of all potential subjects

262 263
and b0,i and bz,i are the deviations of the ith subject’s intercept and slope 11.2.4 Models for Within-Subject X-Correlation
from average. The matrix of covariance components is
· ¸ The unbiasedness of the naive estimator of D in this example is due
var(b0,i ) cov(b0,i , b1,i ) to the restrictive assumption that the Xij are mutually independent;
D(θ) = ,
cov(b0,i , b1,i ) var(b1,i ) this assumption is called the “homogenous model” by Wang et al. In
many examples, there will be within-subject correlation between the
and θ = {var(b0,i ), var(b1,i ), cov(b0,i , b1,i )}t contains the unique compo-
Xij , and this correlation causes D11 to be biased upward. Wang, Lin,
nents of D. Now suppose that Yij is also related to a true nutrient
Gutierrez, et al. (1998) have a “heterogenous” model for use in such
intake, Xij , over the previous year. If the regression coefficient for Xij is
examples. As should be clear from our discussion, and as is made explicit
a fixed effect, that is, independent of the subject, then the relationship
in Wang et al. (1998, Section 4), if the homogeneous model is fit when the
between Yij , Xij , and Zij could be modeled by the LMM
heterogeneous model holds, then biases occur, both in the fixed effects
Yij = β0 + Xij βx + Zij βz + Aij bi + ǫij . (11.11) and in the random effects. Hence, as mentioned previously, in fitting a
mixed model with measurement error, it is important to consider the
If the true intake is unobserved and the observed intake is Wij , then we structure of the X-variables within each individual.
have a linear mixed measurement error model (LMMeM, a special case
of a GLMMeM).
As mentioned above, for additive errors and a normal structural model, 11.3 A Bias-Corrected Estimator
measurement error’s effect on a GLMM for Yij , Xij , and Zij is to induce
a new GLMM relating Yij to Wij and Zij . To illustrate this principle, An early study of measurement error in longitudinal modeling is Toste-
we use the fact that if Xij is Normal(µz , σx2 ) and independent of Zij , son, Buonaccorsi, and Demidenko (1998). These authors assume that
if the Xij are mutually independent, and if the classical additive error for the ith subject, one observes a t-dimensional vector Yi of responses,
model (11.6) holds, then we have a regression calibration model which corresponds to a t-dimensional vector Xi of true covariate values.
The observed covariates values are Wi = Xi + Ui , where the measure-
Xij = γ0 + λWij + b∗ij , (11.12) ment error vector Ui has t iid Normal(0, σu2 ) components. In their ex-
where the b∗ij are mutually independent and independent of Wij , γ0 = ample, Yi contains five yearly observed plasma beta-carotene levels and
(1−λ)µx , and λ is the attenuation; see Section 2.2.1. Substituting (11.12) Wi contains the corresponding values of observed beta-carotene intakes
into (11.11), one obtains measured by food frequency questionnaires (FFQ) given at the same
times as the plasma beta-carotene assays. Thus, Xi is defined as the
Yij = (β0 + βx γ0 ) + Wij γ1 βx + Zij βz + Aij bi + ǫ∗ij , (11.13) corresponding true beta-carotene intakes, each over the year prior to the
FFQ.
where ǫ∗ij = βx b∗ij + ǫij . Clearly, (11.13) is an LMM. An important feature of their model is that they assume neither val-
Comparing the parameters in (11.11) to those in (11.13) gives analyti- idation data nor replication of the measurements. For example, in their
cal expressions for the asymptotic biases of the naive estimator, because application one never observes true intakes of beta-carotene and no sub-
the naive estimator will be consistent for the parameters in (11.13). For ject fills out more that one FFQ at any yearly visit. Of course, one could
the fixed effects, these biases are the same as discussed in Chapter 3. The have some subjects complete two FFQs at some visits, but these would
naive estimator of the covariance components matrix D is unbiased, since not be true replicates because their errors would be highly correlated.
the random effects part of the model remains Aij bi . The only variance We have seen that measurement error models are usually not identified
component for which the naive estimator is biased is σǫ2 , since the “error” in the absence of validation or replication data. However, for longitudinal
in (11.13) is ǫ∗ij = βx b∗ij + ǫij , so that the naive estimator is consistent data, the repeated measurements can substitute for replication and, as
for βx2 var(b∗ij ) + σǫ2 . This is an example of another general principle: will be seen, allow parameter identifiability, at least if one is willing to
Naive estimates of variance parameters typically are either unbiased or put structure on the mean of the X-values. See also Higgins, Davidian,
biased upward, because the variation included by measurement error is and Giltinan (1997), who noted the same point. The independence of
not modeled and so is attributed to the random effects or error. the measurement errors if, of course, an assumption in itself.

264 265
The model of Tosteson et al. for Yi given Xi is showed in Buonaccorsi, Demidenko, and Tosteson (2000) that maximum
likelihood and pseudo likelihood estimators can be considerably more
Yi = µ + ΓXi + Zvi + ǫi , (11.14)
efficient.
where Γ is a parameter matrix, Z is a known design matrix, and vi is
a Normal(0, Ω) random effect where Ω is unknown. In addition, ǫi is
Normal(0, σǫ2 ). 11.4 SIMEX for GLMMEMs
They found that the naive estimator of Γ is attenuated by the factor
ΣT (ΣT + σu2 I)−1 , but at this level of generality it is apparently not
Wang, Lin, Gutierrez, et al. (1998) have studied the SIMEX method for
possible to get explicit results for the bias of the naive estimator of the
estimation of the parameter in a GLMMeM. They found the SIMEX is
(co)variance component matrix Ω.
straightforward to apply and effective for removing measurement error
Many of their further results assume that Γ = γI, where I is the t × t
induced bias. They used the quadratic extrapolation function. SIMEX
identity matrix and γ is a scalar parameter.
is, of course, applied to some naive estimator, that is, an estimator that
They assume a structural model
would be used if there were no measurement error. For GLMMs there
Xi = µX + Rφi , (11.15) are several possible choice for the naive estimator. Wang et al. (1998)
used the corrected penalized quasilikelihood method (CPQL) of Breslow
where R is a known t × q design matrix, q < t, and φi is a Normal(0, ΩT ) and Lin (1995) and Lin and Breslow (1996).
random effect. The assumption that q < t is crucial and implies that
the (co)variance component matrix ΩT has only q(q + 1)/2, rather than
t(t + 1)/2 unique components. This dimension-reduction identifies the
parameters of the model, even though there are no replicate measure- 11.5 Regression Calibration for GLMMs
ments of the components of Xi .
Typical choices of R are R1 = (1 1 · · · 1)t and As we have seen in Chapter 4, regression calibration is a simple and
µ ¶t effective method for estimating the parameters in a GLM with covariate
1 1 ··· 1 measurement error. However, a naive application of regression calibra-
R2 = . (11.16)
1 2 ··· t tion is not suitable for GLMMeMs (Wang, Lin, and Gutierrez, 1999).
Using R1 implies that components of Xi are constant, each equaling The reason for this is that substituting E(X|W, Z) for X in a GLMM
µX +φi . In this case, the components of Wi are true replicates. However, correctly specifies that fixed-effects structure, but not the random-effects
this assumption is often suspect, and then R2 might be more reasonable, structure. Therefore, the bias of the naive estimators of variance com-
since R2 implies that the components of Xi follow a linear time trend. ponents is not corrected properly by regression calibration. Wang, Lin,
Note that R2 can be used only if t ≥ 3, since q = 2 for R2 . Gutierrez, et al. (1998) stated that since, in general models such as logis-
Under the assumption that Z = R, Tosteson et al. obtained explicit tic regression, fixed-effects parameters and variance components are not
results for the bias of the naive estimator of Ω, the matrix of (co)variance orthogonal, the fixed-effects parameters estimates will also be biased.
components of the random effects, vi . In particular, they found that the Despite these difficulties, Buonaccorsi, Demidenko, and Tosteson (2000)
naive estimator of Ω is positively biased, since it is estimating both Ω have found regression calibration suitable and, in fact, highly efficient for
and extra variability due to measurement error. estimation of fixed-effects in linear mixed models, a special case in which
Tosteson et al. reparameterized their model into two parameter vec- fixed-effects parameters and variance components are orthogonal. More-
tors: one for the marginal density of Wi and the other for the condi- over, in the context of linear mixed models, they showed how one can
tional density of Yi given Wi ; this is different from our discussion in correct the bias of the regression calibration estimates of the variance
Section 11.2.1. Doing this allows all parameters to be estimated by stan- components. The “corrected regression calibration” method equals the
dard methods using currently available software; they used SAS PROC pseudomaximum likelihood estimator discussed in Section 11.6.
MIXED, though S-PLUS or R could be used. They mentioned that this It is worth reiterating the point made in Section 11.2.4, namely, that
“bias corrected” estimator is not the MLE, but they conjectured that regression calibration requires that the within-subject correlation struc-
it is highly efficient. Their conjecture turned out to be false, since they ture of the X-values be properly specified.

266 267
11.6 Maximum Likelihood Estimation though subjects vary in cycle length, the cycle were standardized to 28
days. During the 28-day standard cycle, log PDG stays constant dur-
Buonaccorsi, Demidenko, and Tosteson (2000) continued the study of the
ing the first 14 days and then rises linearly for 7 days before decreasing
linear mixed models in Tosteson, Buonaccorsi, and Demidenko (1998)
linearly at the same rate for the remaining 7 days. Thus, the pattern
and compared the bias-corrected estimator in Tosteson et al. (1998)
of PDG fluctuation can be described by two parameters: the intercept,
with the maximum likelihood and pseudomaximum likelihood estima-
which is the baseline level during the first 14 days, and the slope, which
tors. They partitioned the parameters into two vectors: θ1 , which con-
is the linear rate of increase or decrease during the last 14 days. Al-
tains the parameters in the model for [Y|X, Z], and θ2 , which contains
though this general pattern is constant across women, the intercept and
the parameters in the model for [W|X, Z]. The likelihood for the ob-
slope parameters are subject-specific. Li et al. used these parameters as
served data is the product of the likelihood f (Y|W, Z; θ1 , θ2 ) for Y
covariates in a model where the response Y is absence of osteopenia,
given (W, Z) and the likelihood f (W|Z; θ2 ) for W given Z. Maximum
which is defined as BMD above the 33rd percentile. However, the inter-
likelihood maximizes the product f (Y|W, Z; θ1 , θ2 )f (W|Z; θ2 ). Pseudo-
cept and slope for any subject are unknown and only longitudinal PDG
maximum likelihood (Gong and Samaniego, 1981) estimates θ2 by max-
measurements are available, so the intercept and slope are estimated
imizing f (W|Z; θ2 ) and then maximizes f (Y|W, Z; θ1 , θ2 ) over θ1 with
with error. Li et al. (2004) use several of the estimators discussed in this
θ2 held fixed at this prior estimate. They showed that the pseudomax-
section and find that the subject-specific intercept is not related to the
imum likelihood estimator equals their corrected regression calibration
absence of osteopenia (p ≈ 0.5, depending slightly upon the method) but
estimator mentioned in Section 11.5.
the subject-specific might be (0.07 ≤ p ≤ 0.11 for the various methods).
In a study of efficiency, Buonaccorsi, Demidenko, and Tosteson (2000)
As in the previous example, regressing the absence of osteopenia on the
showed that the pseudomaximum likelihood has nearly the same effi-
intercept and slope is a better summary of the data than relating the
ciency as full maximum likelihood, but the bias-corrected estimator in
absence of osteopenia directly to the PDG values.
Tosteson et al. (1998) has a noticeably lower efficiency.
A number of estimators have been developed for joint modeling. Wang,
Wang, and Wang (2000) proposed a pseudoexpected estimating equation
11.7 Joint Modeling estimator (EEE) (Wang and Pepe, 1999), a regression calibration estima-
tor (RC), and a refined regression calibration estimator. They found that
As discussed in Section 7.3.3.4, joint modeling (Wang, Wang, and Wang,
2000) refers to the use of subject-specific random-effects parameters the RC estimator was biased in nonlinear models. The EEE estimator
performs well but requires numerical integration. The refined RC estima-
from a mixed model as covariates in a second model. Typically, the
random-effects parameters serve as a summary of a series of measure- tor does not require numerical integration and its performance is close to
ments thought to be related to the outcome in the second model. For that of the EEE estimator. Li, Zhang, and Davidian (2004) proposed two
functional estimators, the sufficiency estimator and the conditional score
example, researchers have investigated child-specific linear trends in BMI
(body mass index) between 3 and 5 years of age and related these pa- estimator, both based upon Stefanski and Carroll (1987); see Section
rameters to adult obesity. Wang, Wang, and Wang (2000) found that 7.3.3.4. Li, Zhang, and Davidian (2005) studied two flexible structural
estimators, maximum likelihood and maximum pseudolikelihood, using
both the initial BMI at age 3 and the slope of the linear trend between
3 and 5 years of age had a significant effect on the risk of adult obesity. the seminonparametric (SNP) structural model. Full maximum likeli-
Clearly, relating the risk of adult obesity to the subject-specific inter- hood requires numerical integration, and Li et al. used Gauss–Hermite
quadrature, though Monte Carlo integration could be used.
cept and slope of BMI is a more insightful analysis than relating the risk
directly to numerous measurements of BMI taken on each individual.
The intercepts and slopes are comparable across individuals, while the 11.8 Other Models and Applications
BMI measurements themselves may not be taken at the same age for all
11.8.1 Models with Random Effects Multiplied by X
individuals and therefore may not be directly comparable.
In another application of joint modeling, Li, Zhang, and Davidian Previously, we have assumed that the random effects bi have covariate
(2004) presented an example where progesterone levels (PDG) in women, vectors Aij that are observed exactly. This need not be the case, of
as well as a number of baseline covariates, are related to bone mineral course.
density (BMD) in the hip. PDG varies over the menstrual cycle. Al- Liang, Wu, and Carroll (2003) considered the varying coefficient linear

268 269
mixed model, where random effects multiply X as well as on Z. While • Measurement error then induces a model for the observed data. This
they worked in great generality and included the use of regression splines observed data model may be more or less standard.
(Section 12.2.2 and Chapter 13), their essential idea can be seen in the • Use this observed data model to estimate the underlying true-data
varying coefficient model, namely, model.
Yij = β0i + Xtij βxi + Ztij βzi + ǫij . Zidek, Le, Wong, et al. (1998) and Zidek, White, Le, et al. (1998) took
This is a simple linear regression model, where the regression lines de- exactly the opposite approach, in a clustered-data situation. Specifically,
pend on the individual or cluster. If we define Aij = (1, Xtij , Ztij )t , then they specified a standard GLMM for the observed data, and then, by
the model becomes a linear mixed model: Taylor series expansion and the like attempt to approximate the under-
lying true-data model. The advantage of their approach is that standard
Yij = β0 + Xtij βx + Ztij βz + bi0 + Xtij bix + Ztij biz + ǫij models for observed data are, at least in principle, easier to check. The
= β0 + Xtij βx + Ztij βz + Atij bi + ǫij . (11.17) disadvantage of course is that it is easily possible that the underlying
true-data model derived by their approach may be nearly unrecogniz-
If we now substitute (11.12) into (11.17), we see that the observed data able, except if approximations are made to do so.
no longer follow a standard mixed model, because now the random effects Fairly roughly, in their particular instance, Zidek, Le, Wong, et al.
bi in (11.17) are multiplied by the induced random effects b∗ij in (11.12). (1998) started with a model in which they specified the mean E(Wij |Zi )
Liang et al. (2003) fit this model using regression calibration. and cov(Wij , Wik |Zi ) = Gik , where Zi is the collection of observed
Z-values for the ith person. Let Wi be the collection of observed W-
11.8.2 Models with Random Effects Depending Nonlinearly on X values for the ith person. In our parlance, this is basically specifying the
regression of X on Z as well as the measurement error variance. They
Higgins, Davidian, and Giltinan (1997) and Wu (2002) describe an non- then assumed a model for the observed data: For some parameter α, a
linear mixed effects model where the random effects themselves depend known function G, and letting Aij depend on (Zij , Wij ), their model is
on covariates measured with error. The general form of the model is that
t
Vij = β0 + Wij βw + Ztij βz + Atij (Wij , Zij )bi ;
Yij = mY (Zij , Bij ) + ǫij ;
E(Yij |Wi , Zi , bi ) = mY (Vij ); (11.18)
Bij = d(Xij , βx , bi ),
cov(Yij , Yik |Wi , Zi , bi ) = G(α, Vij ). (11.19)
for known functions mY (·) and d(·). Note here how the random effects
With all the nonlinearities in this model, the hard part clearly is to go
Bij depend on a subject-level random effect bi as well as the true but
from this model for the observed data to the underlying true-data model.
unobserved covariates Xij . It is assumed that Wij = Xij + Uij , where
If one were willing to make the assumption that the random effects bi
the measurement errors Uij are independent with variance σu2 . As in
were independent of the W-values and the Z-values, this can be done
Section 11.3, there are no replicate data to understand the measurement
directly by numerical integration. Zidek et al. instead used various clever
error properties, so a model is used along the lines of (11.15). For ex-
Taylor series approximations to obtain and approximate version of the
ample, Wu assumed that Xij = Z∗ij (µX + αi ), where Z∗ij are observed
underlying true-data model.
and αi are independent random effects. Higginset al. used a regression
calibration approach to fit the models, while Wu used the EM-algorithm.
11.8.4 Autoregressive Models in Longitudinal Data
11.8.3 Inducing a True-Data Model from a Standard Observed Data Schmid, Segal, and Rosner (1994) have an interesting early discussion
Model of measurement error in the longitudinal linear mixed model when data
Throughout this book, we have taken a common approach: within a person have an autoregressive error structure. Specifically, they
allowed covariates (Zi,ps , Xi,ps ) that are person-specific but do not vary
• Begin with a model for the data as if X could be observed, the so- with time, and covariates (Zij,tv , Xij,tv ) that vary with time, so that in
called underlying true-data model. our notation Xij = (Xi,ps , Xij,tv ), Wij = (Wi,ps , Wij,tv ) and Zij =
• Specify the error model properties. (Zi,ps , Zij,tv ). As seen, for example, in Section 11.2.3, one must specify

270 271
correctly the variance components structure of the measurement errors 11.9.1 Basic Model
in Wij and the variance components structure of the true covariates Xij
The following mixed model was used for fitting the data:
in order to obtain valid inferences. Schmid et al. found that when this is
Yijc ∼ Normal{fD(i) (tij ) + Zti βz + Xij βx + ric , σǫ2 }

done, the maximum likelihood estimate is nearly unbiased and has good 
Wij ∼ Normal(Xij , σu2 )

inference properties such as confidence intervals. 
(11.20)
 ric ∼ Normal(sc , σr2 )
∼ Normal(µs , σs2 ),

sc

where Yijc is the (log) white blood cell count for subject i at the visit j
11.9 Example: The CHOICE Study at the clinic c. The temporal effect is captured by the functions fD(i) (tij ),
where D(i) is the dialysis type indicator for subject i. Both functions,
f0 (t) and f1 (t), were modeled as linear regression splines, that is, piece-
wise linear functions, with three knots at 5, 10, 15 months after initia-
The Choices for Healthy Outcomes in Caring for ESRD (CHOICE) study tion of chronic dialysis. The two covariates, Zi , that were not subject
is a multicenter prospective study to investigate treatment choices and to measurement error were age at baseline and sex. CRP was subject
outcomes of dialysis care among patients beginning dialysis. Its ratio- to measurement error and true (log) CRP was denoted by Xij , while
nale and design have been reported by Powe, Klag, Sadler, et al. (1996). observed (log) CRP was denoted by Wij . The subject-level random ef-
Briefly, the CHOICE study recruited 1,041 incident dialysis patients fect ric is assumed to have a normal distribution with mean equal to
from 81 DCI (Dialysis Clinic Inc.) clinics between 1995 and 1998. Eli- the clinic level mean sc and variance σr2 , while the clinic means were
gibility criteria for CHOICE study included ability to provide informed assumed to have a normal distribution with mean equal to the overall
consent for participation, age older than 17 years, and ability to speak mean and variance σs2 . A useful measure to describe the sources of resid-
English or Spanish. Patients were enrolled a median of 45 days from ini- ual variability is R = σr2 /(σr2 + σs2 ), which is the fraction of within and
tiation of chronic dialysis (98% within 4 months), 54% of the cohort had between clinic variability attributable to within clinic variability.
diabetes at baseline and 51% of the cohort died by December 1, 2001.
Dialysis population is subject to high risk of inflammation. The white 11.9.2 Naive Replication and Sensitivity
blood cell (WBC) count and the C-reactive protein (CRP) are both in-
flammatory markers but may reflect different physiologic changes. How In order to understand the measurement error variance σu2 , we need some
WBC correlates with CRP after the initiation of dialysis remains un- version of a replication study. The subtlety here is that there are two
known. We use the CHOICE data to describe the longitudinal associ- sources of measurement error. The first is that what we measure is short-
ation between WBC and CRP and examine whether dynamic changes term (log) CRP, and CRP at a specific time does not fully characterize
are different between hemo and peritoneal dialysis. the average CRP, much as the OPEN protein biomarker measured short-
term protein intake. The second source of measurement error is the assay
Because dialysis patients have a high death rate, a complete analysis variability, that is, the error that the laboratory makes in biochemically
would require the joint modeling of the survival and biomarker pro- measuring a sample. In most cases, the assay variability is small relative
cesses. However, for illustration purposes we focus only on subjects who to the variability of the short-term measurement around its long-term
have died during the study. We further restrict our data set to those average, and if only assay variability is taken into account, little correc-
subjects who have at least one WBC and one CRP measure during the tion for measurement error will be made.
same visit. Our subset of the data contained 373 subjects from 28 dialysis
Suppose, for example, that all one considers is laboratory variability.
clinics. The analysis of these data is complicated by the expected nonlin-
Then a replication model is
ear longitudinal trajectory of the biomarkers after initiation of dialysis, (
clustering of observations within subject and of subjects within clinics, (r) (r)
Wlk ∼ Normal(Xl , σu2 )
and measurement error in CRP. Information on the CRP measurement (r) (r) (11.21)
Xl ∼ Normal(µx , σx2 ),
error is available from a blinded duplicate CRP assay conducted on 42
subjects. where (r) indicates that these data come from a separate replication

272 273
(r)
study. Here Wlk , k = 1, 2 are the two lab results of the same sam-
ple from subject l = 1, . . . , 42. The joint analysis of models (11.20) and CRP Sex Age R
(11.21) was done using Bayesian inference based on MCMC sampling, as
described in Chapter 9. What we can expect, of course, is that labora-
tory/assay variability will be small relative to biological variability and λ > .999 Point est. .074 -.044 -.0027 .953
the variability of (log) CRP in the population, so that little correction Std. Err. .005 .031 .0012 .033
for measurement error will occur.
To show this, Table 11.1 displays posterior medians and 95% credible λ = .9 Point est. .080 -.045 -.0029 .953
intervals for several parameters of interest, based on 100,000 simulations Std. Err. .005 .031 .0012 .033
from the joint distribution after discarding 10,000 burn-in samples. Not
surprisingly, there is a strong, statistically significant correlation between
(log) WBC and (log) CRP. Also, even after adjusting for CRP the effect λ = .8 Point est. .087 -.045 -.0029 .952
of age is statistically significant and indicates smaller WBC correspond- Std. Err. .006 .031 .0012 .033
ing to older subjects. Sex was not statistically significant. Another in-
teresting finding is that most of the residual variability (∼ 95%) is due
to between-subject variability, with roughly 5% variability being due
Table 11.1 Estimates and standard errors from the CHOICE data Bayesian
to between-clinic variability. Reflecting good measurement calibration
analysis using the longitudinal model (11.20) that also accounts for clustering.
of the (log) CRP, the posterior mean of the reliability parameter was “CRP” is the (log) CRP and “R” is R = σr2 /(σr2 + σs2 ), which is the fraction
0.9994 with a confidence interval [0.9989, 0.9997]. This reliability is of of within and between clinic variability attributable to within clinic variability.
course misleading because it reflects only the precision of measuring Standard errors are obtained from the simulation algorithm. Different values
(log) CRP in a given blood sample and does not incorporate potential of λ indicate the corresponding reliability, with λ > .999 corresponding to
short term biological variability of (log) CRP. Because no direct data data that takes into account only laboratory measurement error and λ = .8, .9
were available to assess the biological measurement error due to using corresponding to hypothetical levels of biological reliability.
one blood sample to represent the short term (log) CRP average, we
conducted a sensitivity analysis using several smaller levels of reliability.
by assuming the following model,
Table 11.1 shows results if the biological reliability of (log) CRP were
∼ Normal{Zti βz + Xij βx + ric , σǫ2 }

λ = 0.9 and λ = 0.8. Interestingly, none of the parameter inferences  Yijc
changes significantly, with the exception of the (log) CRP parameter, ric ∼ Normal(sc , σr2 ) (11.22)
which changes by 18% from 0.074 to 0.087. sc ∼ Normal(µs , σs2 ),

The top plot in Figure 11.1 displays the posterior means of f0 (t) (solid where information about the unobserved process Xij and measurement
line) and f1 (t) (dashed line) for t ≤ 30 months, corresponding to hemo error variance is obtained from the model
and peritoneal dialysis, respectively. The bottom plot displays the pos-
 Wij ∼ Normal(Xij , σu2 )

terior mean and 95% pointwise confidence intervals of f1 (t) − f0 (t).
Xij = fD(i) (tij ) + vi (11.23)
vi ∼ Normal(µv , σv2 ).

11.9.3 Accounting for Biological Variability Here the true unobserved (log) CRP process is assumed to be a dialysis-
specific regression spline with three knots and subject specific random
As described in Section 11.3, one way to get at the biological variabil- intercepts. The rest of the variability in the observed (log) CRP is as-
ity (measurement error) in longitudinal studies is by assuming a simple sumed to be measurement error with variance σu2 . Note that σu2 is es-
and reasonable model for the variable observed without error. Moreover, timable from the model without using the replicated lab data, which was
measurement error variance is identifiable as long as the number of de- not used in this model.
grees of freedom in the measurement error model is smaller than the Table 11.2 reports posterior means and standard errors for several
number of observations per subject. We illustrate this methodology here parameters of interest. Interestingly, the point estimator relating (log)

274 275
Adjusted dialysis effect
CRP Sex Age R
2.3

2.2 Point est. .101 -.043 -.0028 .954


Std. err. .014 .032 .0013 .032
2.1
0 5 10 15 20 25 30
Time (months)

0.4
Table 11.2 Estimates and standard errors from the CHOICE data Bayesian
Adjusted difference

analysis using the longitudinal model (11.22) with biological measurement error
0.2 model (11.23) that also accounts for clustering. “CRP” is the (log) CRP and
“R” is R = σr2 /(σr2 + σs2 ), which is the fraction of within and between clinic
0 variability attributable to within clinic variability. Standard errors are obtained
from the simulation algorithm.
−0.2
0 5 10 15 20 25 30
Time (months)
allows all parameters to be identified. Li, Shao, and Palta (2005) have
another interesting application of this important concept, and they used
Figure 11.1 Comparison of trajectories of (log) WBC for hemo and peritoneal a structural model different from that of Tosteson et al.; see also Wu
dialysis patients adjusted for (log) CRP, age and sex. Top: adjusted population
(2002).
(log) WBC trajectories for Hemodialysis (solid line) and peritoneal dialysis
(dashed line). Bottom: difference between adjusted population (log) WBC tra-
The functional estimators for joint modeling in Li, Zhang, and David-
jectories (peritoneal-hemo) with pointwise 95% confidence intervals. ian (2004) are extended to multivariate longitudinal data by Li, Wang,
and Wang (2005).
Ko and Davidian (2000) studied a two-component nonlinear model
WBC to (log) CRP when biological measurement error is taken into for longitudinal data. In the first component of the model, a vector of
account is 0.101, or 36% larger than the estimator based only on labo- responses on a subject depends on covariates and a subject-specific pa-
ratory measurement error reported in Table 11.1. The standard error of rameter. In the second component of the model, the subject-specific pa-
the (log) CRP estimator has also increased from 0.005 to 0.014, or 180%. rameters depend on covariates, random effects, and fixed effects. Some of
This is essentially due to much larger estimated biological measurement the covariates in the second component are measured with error. Ko and
error variance, with a posterior mean equal to 0.91, which corresponds Davidian presented an example from an AIDS clinical trial that shows
to a posterior reliability of 0.51 for the (log) CRP data. Results were the flexibility of this methodology.
practically unchanged for the other parameters.

Bibliographic Notes
Wang and Davidian (1996) were among the first authors to study the
effects of measurement error on variance component estimators. They
studied Berkson models and found that even a modest amount of mea-
surement error could seriously bias the estimates of intrasubject vari-
ability.
As mentioned earlier, Higgins, Davidian and Giltinan (1997) and Toste-
son, Buonaccorsi, and Demidenko (1998) discovered that if a mismea-
sured covariate X is observed longitudinally, then a structural model for
X with dimension less than the number of X-observations per subject

276 277
CHAPTER 12

NONPARAMETRIC
ESTIMATION

In this chapter, we give an overview of two nonparametric estimation


problems that are of interest in their own right and also arise as sec-
ondary problems in regression calibration and hypothesis testing. The
first problem is the estimation of the density of a random variable X,
while the second is the nonparametric estimation of a regression, both
when X is measured with error.

12.1 Deconvolution
12.1.1 The Problem
The fundamental problem of deconvolution is that of estimating the
density of X when W = X + U is observed and the density of U is
known. Closely related is the problem of estimating the regression func-
tion, m(w) = E(X | W = w), when only W = X+U is observed and the
density of U is known. The latter estimation problem is encountered in
both regression calibration (Chapter 4) and hypothesis testing (Chapter
10).
There are at least three reasons for trying to understand the density
function of X. Suppose that X is a continuous, scalar random variable,
and that there are no covariates Z measured without error.
• Sometimes, the distribution of the latent X is of intrinsic interest, for
example, in nutritional epidemiology, where X represents the usual
intake of foods. In this case, let fX (x) beR its density function. Then
c
the distribution function is pr(X ≤ c) = −∞ fX (x)dx.
• When X is unobservable, likelihood methods (Chapter 8) require a
model for the density of X. Regression calibration (Chapter 4) consists
of the usual analysis but with X replaced by
Z
1
m(W) = E(X|W) = xfX (x)fw|x (W|x)dx
fW (W)
Z
1
= xfX (x)fU (W − x)dx. (12.1)
fW (W)

279
• In Section 10.5, it was shown that when testing for the effect of the for certain smooth kernels, Fourier inversion of fbX (x) is possible; see also
covariate measured with error, replacing X with an estimate of its Stefanski (1989). With an appropriately smooth kernel, the estimator,
regression m(W) on W yields the hypothesis test with the highest Z
1 φbw (t)
local power (asymptotically). fbx (x) = e−itx dt,
2π φu (t)
Estimating the density function, fX , of X is thus critical.
exists, and for suitable choice of bandwidth is consistent for fX (x). The
deconvoluting kernel density estimator, fbx (x), integrates to one but is
12.1.2 Fourier Inversion not always positive. It has the alternative representation
The density function fW is the convolution of fX and fU , Pn
fbx (x) = (nh)−1 K∗ {(Wi − x)/h, h},
i=1
Z
fW (w) = fX (x)fU (w − x)dx, (12.2) where the deconvoluting kernel is
Z
1 φK (y)
and we thus refer to the problem of estimating fX in the absence of K∗ (t, h) = eity dy.
2π φu (y/h)
parametric assumptions as deconvolution.
When both fW and fU are known, fX is recovered by Fourier inversion. The deconvoluting kernel density estimator has pointwise mean squared
Letting φa denote theR characteristic function of the random variable A, error
n o2
for example, φw (t) = eitw fW (w)dw, we have that φx (t) = φw (t)/φu (t).
MSE = E fbx (x) − fX (x)
Then by Fourier inversion,
Z Z Z ½ ¾2
1 1 φw (t) 4 −1 φK (t)
fX (x) = e−itx
φx (t)dt = e−itx dt. ∼ ch + (2πhn) dt;
2π 2π φu (t) |φu (t/h)|
Z Z n o2
′′
Even if, as we will now suppose, the density function fU of U is known, where c = (1/4) x2 K(x)dx fX (x) dx.
the problem is complicated by the fact that the density of W is unknown
and must be estimated. For the deconvolution problem under these as-
sumptions, estimators with known rates of convergence were first ob- 12.1.4 Properties of Deconvolution Methods
tained by Stefanski and Carroll (1986, 1990c), Carroll and Hall (1988) The best bandwidth, in the sense of minimizing MSE asymptotically,
and Liu and Taylor (1989). and the best MSE depend on the error density through its character-
istic function φu . It is well known that in the absence of measurement
12.1.3 Methodology error (U ≡ 0), when fX has two continuous derivatives the best MSE
converges to zero at the rate n−4/5 . However, for nondegenerate U, con-
We now describe a solution to the deconvolution problem. Statisticians vergence rates are much slower in general. The best rate of convergence
have studied kernel density estimates of fW of the form depends on the tail behavior of |φu (t)|, with lighter tails resulting in
Pn slower rates of convergence. The tail behavior of |φu (t)| is in turn re-
fbw (w) = (nh)−1 i=1 K {(Wi − w)/h} ,
lated to the smoothness of fU (u) at u = 0, with smoother densities
where K(·) is a density function and h is the bandwidth, both chosen by having characteristic functions with lighter tails.
the user. The function fbw is itself a density function, with characteristic For example, if U is normally distributed, then
function φbw . It has long been known that for estimation of fW (w) the
|φu (t)| = exp(−σu2 t2 /2)
choice of kernel is relatively unimportant, and ease of use commonly
dictates the choice of K(·), for example, the standard normal density or is extremely light tailed, and the mean squared error converges to 0
−2
a density with bounded support. at a rate no faster than the exceedingly slow rate of {log(n)} . The
It transpires that for commonly used kernels, the estimated density implication is that with normally distributed errors, it is not possible
fbX (x) cannot be deconvolved, in that the integral encountered in Fourier to estimate the actual value of fX (x) well. However, detailed analyses
inversion is not defined. Stefanski and Carroll (1986, 1990c) showed that by Wand (1998) indicate that, for lower levels of measurement error,

280 281
deconvolving density estimators can perform well for reasonable sample will be nigh well impossible in this context. Specifically, we refer to the
sizes. problem of actually estimating the bandwidth.
If U has a more peaked density function than the normal, then |φu (t)| A simple example will suffice to make the point. Later on, we will
does not diminish to 0 as rapidly, and the deconvoluting kernel density study the Framingham data; see Section 12.1.9. These data have 1, 615
estimator has better asymptotic performance. For example, √ consider √ the observations with a reliability ratio of about 0.75, so this is hardly a
Laplace distribution with density function fU (u) = (1/σu 2)exp(− 2|u| nasty example. We applied both of the default methods described above
/σu ). In this case φu (t) = 2/(2 + σu2 t2 ), and the optimal mean squared to these data. Figure 12.1 is the result, and it is amusing. The default
error converges to zero at the rate n−4/9 , tolerably close to the rate in deconvolution method appropriate for Laplace errors is so wild that it
the absence of measurement error, that is, n−4/5 . swamps the default deconvolution method appropriate for Gaussian er-
The fact that smoothness of the error density determines how well rors as well as the best-fitting normal approximation. We removed this
fX can be estimated is a disconcerting nonrobustness result. An open totally ridiculous estimate and replotted, see Figure 12.2: This is not
problem, of course, is how to construct deconvolution estimates that are much better!
adaptive to the amount of smoothness of the measurement error density. The point is that one should be wary, maybe even suspicious, of meth-
We note that the slow rate of convergence of fbX (x) is intrinsic to ods of automatic bandwidth selection in deconvolution kernel methods.
the deconvolution problem, and not specific to the deconvoluting kernel We tend to think that the better method is to vary the bandwidth from
density estimator, which is known to achieve the best rate of convergence smallest to larger and stop when the graph becomes reasonably smooth.
in general (Carroll and Hall, 1988; Stefanski and Carroll, 1990c). The best that one can hope to get out of this is a look at shape.
However, rates of convergence are not always fully informative with
regard to the adequacy of fbx (x) for estimating the basic shape of fX (x).
Normal and Deconvolution Fits
As shown in the examples below, despite the slow pointwise rate, the 0.2
Normal
estimator itself can provide useful information about shape. Taylex

In applications, calculation of fbx (x) requires specification or estima- 0.15


Gaussian−Deco
Laplace−Deco
tion of a bandwidth h. Stefanski and Carroll (1990c) described a band-
width estimator when the improper sinc kernel, K(t) = (πt)−1 sin(t), 0.1
is used. Stefanski (1990) showed that for a large class of kernels and a
large class of error densities that includes the normal densities, the mean
0.05
squared error is minimized asymptotically by a known sequence of band-
−1/2
widths — the optimal bandwidth is h = hG = σu {log(n)} for nor-
0
mal (Gaussian) error. For Laplace measurement error and the kernel with
characteristic function φK (t) = (1−t2 )3 when |t| ≤ 1 and zero otherwise,
Fan, Truong, and Wang (1991) suggested taking hL = (1/2)σu n−1/9 . −0.05

Fan and Truong (1993) and Carroll and Hall (2004) also considered
the use of the deconvoluting kernel function K∗ (t, h) = φ(x){1 − σu2 (x2 − −0.1

1)/(2h2 )}, which is the deconvoluting kernel when the errors have a
Laplace density and the basic kernel function is the standard normal −0.15
60 80 100 120 140 160 180 200 220 240 260
density φ(x). Carroll and Hall (2004) were more interested in regression Framingham Untransformed SBP

function estimation and called their method Taylex; see also Section
12.2.7. Figure 12.1 Density estimates of untransformed SBP in Framingham. Four es-
timates are considered here, but the wildly varying one uses the kernel function
with characteristic function φK (t) = (1 − t2 )3 when |t| ≤ 1 and zero otherwise,
12.1.5 Is It Possible to Estimate the Bandwidth? and the default bandwidth (1/2)σu n−1/9 . The only real purpose of this figure
is to show that automatic bandwidth selection for deconvolution is very hard.
Not withstanding the previous comments, the fact that deconvolution is
hard theoretically means that things that are hard to do in easy problems

282 283
Normal and Deconvolution Fits 12.1.6.2 Moment Methods
0.04
Normal
Taylex
0.035 Gaussian−Deco We can also learn something about the first four moments of X without
numerical integration, useful, for example, if one wants to employ the
0.03
Pearson or Johnson family of densities. Suppose that W = X+U, where
0.025 U is normally distributed with mean zero and variance σu2 . The mean of
2
W is µx = E(X); the variance of W is σw = σx2 + σu2 . Let κ3x and κ4x
0.02
be the skewness and kurtosis of X, being 0 and 3, respectively, if X is
0.015 normally distributed. Then the skewness and kurtosis of W are related
to the skewness and kurtosis of X as follows:
0.01

κ3w = κ3x σx3 /σw


3
;
0.005
κ4w 4
= (κ4x σx4 + 6σx2 σu2 + 3σu4 )/σw , (12.3)
0

from which the skewness and kurtosis of X can be extracted.


−0.005
With replicates, one can push this through even further, making min-
−0.01
60 80 100 120 140 160 180 200 220 240 260
imal distributional assumptions about U, and then fit a parametric dis-
Framingham Untransformed SBP
tribution for X via method of moments. To be specific, suppose that in
a sample of size n, one observes replicate observations Wi,j = Xi + Ui,j
Figure 12.2 Density estimates of untransformed SBP in Framingham. Three (i = 1, . . . , n and j = 1, 2), where it is assumed only that the distribution
estimates are considered here: the best normal approximation (solid line), the
of the errors is symmetrically distributed about zero, something which
Taylex method (dashed line), and the deconvoluting estimator, which uses the
sinc function with a default bandwidth (dot-dashed line).
often can be achieved by transformation.
Let µbw = W·· (the mean), and for k = 2, 3, 4 define sw,k to be the
¡ ¢k
sample mean of the terms Wi,· − µ bw . For k = 2, 4 define su,k to be
12.1.6 Parametric Deconvolution k
the sample mean of the terms {(Ui,1 − Ui,2 )/2} . The term sw,k is an
estimate of the kth central moment of the W i,· ’s, while under symmetry
12.1.6.1 Likelihood Methods
su,k is an estimate of the kth moment of (Ui,1 − Ui,2 )/2, which because
Nonparametric deconvolution is not the only way to estimate the density of symmetry is the same as the kth moment of (Ui1 +Ui2 )/2 = W i· −Xi .
of X in an additive model. By equating moments we find the following consistent estimates of the
If one has a parametric model in mind for X, for example, Weibull, moments of the distribution of X,
gamma, skew-normal, skew-t, mixtures of normals (Wasserman and Roeder,
E(X) = µx ≈ µ
bw ;
1997; Carroll, Maca, and Ruppert, 1999), the SNP (seminonparametric)
family (Zhang and Davidian, 2001), etc., then the density/likelihood E(X − µx )2 ≈ sw,2 − su,2 ;
function for the observed W is given by the fundamental convolution E(X − µx )3 ≈ sw,3 ;
equation (12.2). Assuming that the integration can be done, for exam- E(X − µx )4 ≈ sw,4 − su,4 − 6(sw,2 − su,2 )su,2 .
ple, numerically for maximum likelihood, via MCMC for Bayes, we can
then estimate the unknown parameters and hence obtain an estimate of
12.1.6.3 The SNP Family
the density for X.
This simple prescription can be more or less easy, depending on the In the case of additive normally distributed measurement error, the SNP
flexibility of the parametric family involved. After all, the basic fact is (seminonparametric) distribution has a ready form. The SNP density for
that if no assumptions are made, then it is very difficult to assess the X in the scalar case with K ≤ 2, location µ and scale σx is given as
density function accurately: This suggests that even flexible parametric
methods may have numerical difficulties. fX (x) = σx−1 φ{(x − µ)/σx }GK {(x − µ)/σx },

284 285
PK
where GK (x) = ( j=0 aj xj )2 . If K = 0, this is just the normal density can be computed exactly as
function with mean µ and standard deviation σx . For K > 0, there are n o
constraints on the aj that make this a density function. For example, if fW (w) = (λ/σx2 )1/2 φ(η) (a0 + a1 λ1/2 η)2 + θ2 a21 .
K = 1, we can write a0 = sin(α) and a1 = cos(α), where −π/2 ≤ α ≤
π/2 is a free parameter. Similarly, if K = 2, then define c1 = sin(α1 ), For K = 2,
½
c2 = cos(α1 )sin(α2 ) and c3 = cos(α1 )cos(α2 ), where (α1 , α2 ) are the free
parameters such that −π/2 ≤ α1 , α2 ≤ π/2. Let A be the 3 × 3 matrix fW (w) = (λ/σx2 )1/2 φ(η) (a0 + a1 κ + a2 κ2 )2 + 3a22 θ4
with diagonal elements (1, 1, 3), with element (1, 3) and element (3, 1) ¾
equal to 1.0, and equal to 0.0 elsewhere. Let B be the symmetric square +(θa1 + 2θκa2 )2 + 2a2 θ2 (a0 + a1 κ + a2 κ2 ) .
root of A. Then (a0 , a1 , a2 )t = B −1 (c1 , c2 , c3 )t .
For any of K = 0, 1, 2, the idea is to use maximum likelihood to estimate
the parameters.
Simulation 1 Simulation 2 Simulation 3
2 3 4 Of course, nothing comes for free in deconvolution problems. To see
1.5
2
3 this, we generated nine data sets with n = 3, 145 observations, mean
1 2 3.24, X normally distributed with variance 0.052, and U normally dis-
1
0.5 1 tributed with variance 0.171, in line with the NHANES example in Sec-
0
2.5 3 3.5 4
0
2.5 3 3.5 4
0
2.5 3 3.5 4
tion 12.1.10. This is a lot of measurement error! We fit the SNP dis-
Simulation 4 Simulation 5 Simulation 6
tribution with K = 2 to do a parametric deconvolution. In this case,
4 2 3 remember, X is normally distributed, but five of the nine fits are very
3 1.5
2 nonnormal, with one suggesting a t-density (bottom right, ignore the
2 1 extra modes) and the others being noticeably multimodal, even though
1
1 0.5 the SNP family with K = 2 includes the normal distribution. The point
0
2.5 3 3.5 4
0
2.5 3 3.5 4
0
2.5 3 3.5 4 is that one should not overinterpret things like multiple modes when
Simulation 7 Simulation 8 Simulation 9 doing deconvolution with a flexible family of distributions.
2 2 4
This example is a little unfair, because SNP fits almost always come
1.5 1.5 3
equipped with mention of model selection. In Figure 12.4 we have plotted
1 1 2
the fits to the same nine simulated data sets as in Figure 12.3, with the
0.5 0.5 1
exception that we have allowed K = 0, the correct model, and K = 1
0 0 0
2.5 3 3.5 4 2.5 3 3.5 4 2.5 3 3.5 4 as well as K = 2, and let the model be chosen by AIC, which penalizes
slightly towards the correct normal model. Here, AIC selected the normal
Figure 12.3 Simulation of the SNP family for parametric deconvolution with
model in eight of the nine data sets.
K = 2 and normally distributed X and U. The mean and variance of X are
3.24 and 0.052, respectively, while the variance of U is 0.171. The solid line is
the SNP fit with K = 2, while the dashed line is the normal fit. Displayed are 12.1.7 Estimating Distribution Functions
nine simulated data sets: The fits should all look normal, but do not.
The pessimistic nature of the results for density estimation with normally
distributed error extends to estimating quantiles of the distribution of X,
Let the measurement error have variance σu2 , and make the definitions for example, pr(X ≤ x). Here the optimal achievable rate of convergence
λ = σx2 /(σx2 + σu2 ), θ = (λσu2 /σx2 )1/2 , η = λ1/2 (w − µ)/σx , and κ = λ1/2 η. −3
is of the order {log(n)} , hardly much of an improvement! This casts
Then, the density function of the observed W is given as
doubt on the feasibility of estimating quantiles of the distribution of X
Z without making parametric assumptions.
fW (w) = (λ/σx2 )1/2 φ(η) φ(z)G(κ + θz)dz, (12.4) There are at least two alternatives to a full-blown likelihood analy-
sis. The moment-matching method described previously starts from a
where φ(·) is the standard normal density function. When K = 1, this model for the density function of X, but makes no assumptions about

286 287
Simulation 1 Simulation 2 Simulation 3 Normal and Best LaPlace
2 2 2 0.025
Normal
1.5 1.5 1.5 LaPlace

1 1 1

0.5 0.5 0.5 0.02

0 0 0
2.5 3 3.5 4 2.5 3 3.5 4 2.5 3 3.5 4

Simulation 4 Simulation 5 Simulation 6


2 2 2
0.015
1.5 1.5 1.5

1 1 1

0.5 0.5 0.5


0.01
0 0 0
2.5 3 3.5 4 2.5 3 3.5 4 2.5 3 3.5 4

Simulation 7 Simulation 8 Simulation 9


3 2 2
0.005
1.5 1.5
2
1 1
1
0.5 0.5
0
0 0 0 80 100 120 140 160 180 200 220
2.5 3 3.5 4 2.5 3 3.5 4 2.5 3 3.5 4 Framingham Untransformed SBP

Figure 12.4 Simulation of the SNP family for parametric deconvolution with
Figure 12.5 Density estimates of untransformed SBP in Framingham. Two
K chosen by AIC and normally distributed X and U. The mean and variance
estimates are considered here: the best normal approximation (solid line), the
of X are 3.24 and 0.052, respectively, while the variance of U is 0.171. The
Taylex method (dashed line), with bandwidth chosen by eye to be as small as
solid line is the SNP fit with K = 2, while the dashed line is the normal fit. In
possible but still be smooth.
contrast to Figure 12.3, which always used a flexible model, the use of AIC to
penalize toward the normal model works reasonably well here.
tution works, in the sense that the resulting estimate of m(w) when
the density of U. Its output is an estimated density function that yields substituted into the score test typically achieves the same local power
estimated quantiles. as if m(w) were a known function.
Alternatively, with no model for the density of X but a good model The reason for this is that m(w) is much easier to estimate than
for the error density of U, the SIMEX method can be applied. Previous fX , because of the extra integration in (12.1). In fact, with normally
applications of SIMEX have been to estimated parameters and non- distributed measurement errors, the rate of convergence for estimating
parametric regression estimates, but here the basic input is an empirical m(w) is of order n−4/7 , while for Laplace error the rate is the usual
distribution function (possibly presmoothed). Stefanski and Bay (1996) nonparametric one, that is, n−4/5 (Stefanski and Carroll, 1991).
studied SIMEX methods for deconvoluting finite-population distribution
functions.
12.1.9 Framingham Data

12.1.8 Optimal Score Tests We applied deconvoluting kernel density estimation techniques to the
Framingham data, for untransformed systolic blood pressure (SBP),
While estimating a density function nonparametrically is difficult in the rather than using the transformation log(SBP − 50). We used SBP at
presence of measurement error, estimating smooth functionals of the Exam #2 only to estimate the measurement error variance, but decon-
unknown density, for example, m(w) = E(X|W = w), is often not as volved SBP measured at Exam #3 (W). In the original scale, observed
difficult. SBP had mean 130.01, variance 395.65, and the estimated measurement
For estimating m(w), we can simply replace fX and fW in (12.1) by error variance was 83.69. This leads to an estimate of the variance for
their estimators. Stefanski and Carroll (1991) showed that this substi- long-term SBP (X) of 311.96.

288 289
Parametric Deconvolution Normal and Two Smooth Fits
0.035 2
SNP Normal
Normal sinc kernel
t−dist Polynomial
0.03
1.5

0.025

1
0.02

0.015
0.5

0.01

0
0.005

0 −0.5
60 80 100 120 140 160 180 200 220 240 260 2 2.5 3 3.5 4 4.5
Framingham Untransformed SBP NHANES Transformed Saturated Fat

Figure 12.6 Density estimates of untransformed SBP in Framingham. Three Figure 12.7 Density estimates of transformed saturated fat in NHANES. Three
estimates are considered here: the best normal approximation (dashed line), estimates are considered here: the best normal approximation (solid line),
the SNP fit with K = 2 (solid line) and the best fitting t-density (dotted line). Gaussian deconvolution (dashed line), and Laplace deconvolution (dotted line),
both with bandwidth selected by eye. There is a clear suggestion of symmetry
in the data, but not much else.
Figure 12.5 shows the best-fitting normal distribution, along with
the Taylex deconvoluting kernel fit (Carroll and Hall, 2004), where the
bandwidth was chosen by eye to be as small as possible while retaining be smooth but small. The sample skewness of W is nearly zero (−0.05),
smoothness. The only real point of interest here is that the deconvolut- and this is reflected in the near symmetry of the plots.
ing kernel fit picks up the skewness inherent in the untransformed SBP Figure 12.8 gives the best fitting normal, the best-fitting t-density,
data, thus correctly suggesting that the data should be transformed. The and the SNP density with k = 2 as parametric deconvolution methods.
parametric deconvolution fit using the SNP distribution is given in Fig- There is a suggestion on the latter two that the data are somewhat like
ure 12.6, along with the best-fitting t-density and the best-fitting normal a t-density, here with about 5 degrees of freedom. The sample kurtosis
density. Here, the skewness we know to exist is exhibited by the lonely is 3.32, where a kurtosis of 3 applies for the normal distribution. If the
little mode over on the right. kurtosis of X is denoted by κ4x , then in the additive error model with
normally distributed errors the kurtosis for W is given by (12.3). Sub-
stituting sample estimates of (κ4w , σx2 , σu2 , σw
2
) and solving for κ4x , the
12.1.10 NHANES Data kurtosis for X is estimated to be approximately 9.0, indicating about
The NHANES data (Chapter 4) exhibit considerably more measurement 5.0 degrees of freedom since the kurtosis of a t-density is 3 + 6/(r − 4),
error, and consequently deconvolution is much harder. For these data where r is the degrees of freedom.
2
we have earlier derived the variances σbw bu2 = 0.171 and σ
= 0.223, σ bx2 =
0.052. 12.1.11 Bayesian Density Estimation by Normal Mixtures
We used the same methods as for the Framingham data. Figure 12.7
gives the best-fitting normal density, along with the deconvolution den- If the reader is unfamiliar with Bayesian estimation, then Chapter 9
sity estimates for normal and Laplace errors, with bandwidths selected to should be read before this section.

290 291
Parametric Deconvolution their full conditionals. Given these imputed values, the other steps of
2.5 SNP the Gibbs sampler are exactly the same as when the Xi are observed.
Normal
t−dist

12.2 Nonparametric Regression


2

Nonparametric regression has become a rapidly developing field as re-


searchers have realized that parametric regression is not suitable for
1.5 adequately fitting curves to all data sets that arise in practice.
Nonparametric regression entails estimating the mean of Y as a func-
tion of X,
1
E(Y|X = x0 ) = mY (x0 ), (12.5)
without the imposition of mY belonging to a parametric family of func-
0.5 tions. We focus on the local-polynomial, kernel-weighted regression and
spline estimators of mY .
The most promising approach we know of for nonparametric regres-
0
2 2.5 3 3.5 4 4.5 sion with measurement error is a Bayesian methodology using splines and
NHANES Transformed Saturated Fat
MCMC introduced by Berry, Carroll, and Ruppert (2002) and its fre-
quentist counterpart, which was introduced by Ganguli, Staudenmayer,
Figure 12.8 Density estimates of transformed saturated fat in NHANES. Three
and Wand (2005). The algorithm is given also in Ruppert, Wand and
estimates are considered here: the best normal approximation (dashed line), the
Carroll (2003, Chapter 15.3) in a somewhat easier-to-digest form. We
SNP fit with K = 2 (solid line), and the best fitting t-density (dotted line).
will discuss this methodology in detail in Chapter 13. In the present
chapter, earlier and simpler estimators will be discussed.

Wasserman and Roeder (1997) proposed a Bayesian method for non-


parametric density estimation. They assume that the density fX of X1 , 12.2.1 Local-Polynomial, Kernel-Weighted Regression
. . . , Xn is a normal mixture When X is observable, the local, order-p polynomial estimator is βb0 (x),
k n
the solution for β0 to the weighted least squares problem minimizing,
X X
fX (x) = pi φ{(x − µi )/σi }/σi , pi ≥ 0, pi = 1, k ≤ L, Pn n Pp j
o2
i=1 Yi − j=1 βj (X i − x) Kh (Xi − x). (12.6)
i=1 i=1
R
where L is a known upper bound for the number of components and φ Here h is called the bandwidth, K is a kernel function such that K(u)
is the Normal(0, 1) density. Since any smooth density can be closely ap- du = 1, and Kh (u) = h−1 K(u/h). The function K(·) and the bandwidth
proximated by a normal mixture, this method is nonparametric, that is, h are under the control of the investigator, and in practice the latter is
it is appropriate even if fX is not exactly a normal mixture. Wasserman the more important.
and Roeder described how k, p1 , . . . , pk , µ1 , . . . , µk , and σ1 , . . . , σk can Problem (12.6) is a straightforward weighted least squares problem,
be estimated by a Gibbs sampler. Carroll, Maca, and Ruppert (1999) and hence is easily solved numerically. The local least squares estimator
showed that this Gibbs sampler can be applied when X1 , . . . , Xn are ob- of mY (x) is then
served with measurement error, that is, one observes W1 , . . . , Wn where b Y (x, h) = βb0 (x),
m (12.7)
Wi = Xi +Ui and Ui is Normal(0, σu2 ) for a known σu ; of course, in prac-
tice one substitutes an estimate for σu . The only modification needed to while for j ≤ p, the estimator of the jth derivative of mY (x) is j!βbj (x).
accommodate the measurement error is a simple idea used repeatedly Estimator (12.7) has had long use in time series analysis, and is a spe-
in Chapter 9: During the MCMC the unobserved Xi are sampled from cial case of the robust, local regression estimators in Cleveland (1979).

292 293
Cleveland and Devlin (1988) discussed practical implementation and pre- this degree and knots can be written as
sented several interesting case studies where local regression data analy- p K
sis is considerably more insightful than classic linear regression analysis. X X
s(x) = βk xk + bk (x − κk )p+ . (12.8)
Ruppert and Wand (1994) described the asymptotic theory of these esti-
k=0 k=1
mators. R-code is available, see http://web.maths.unsw.edu.au/∼wand.
As in parametric problems, ignoring measurement error causes in- If a spline is fit by ordinary least squares, then the number and loca-
consistent estimation of mY (x). The regression calibration and SIMEX tions of the knots have a tremendous effect on the fit, and there is a large
methods of Chapters 4 and 5 provide simple means for constructing ap- literature on the so-called adaptive splines where the knots are chosen to
proximately consistent estimators of mY (x) in the case that W = X+U, provide the most accurate estimate; see Stone, Hansen, Kooperberg, et
where U has mean zero. Hastie and Stuetzle (1989) describe an alterna- al. (1997) and Hansen and Kooperberg (2002). An alternative to adap-
tive method for an orthogonal regression problem, wherein it is assumed tive knot selection is to use a large number of knots and to place a penalty
that the conditional variances of Y and W given X are equal; we have on the “roughness” of the fit. For example, cubic smoothing splines use a
already commented (Section 3.4.2) on the general lack of applicability knot at every unique value of x and penalize the integral of the squared
of such an assumption. second derivative of the fit. This penalty reduces the curvature of the
fit.
In this section, we describe algorithms for nonparametric regression
Ruppert, Wand, and Carroll (2003) introduce a simple penalty that is
taking measurement error into account.
convenient for our purposes. They fit s(x) in (12.8) to data {(Xi , Yi )}ni=1
by minimizing
n
" ( p K
)#2 K
12.2.2 Splines X X X p
X
k
Yi − βk Xi + bi (Xi − κk )+ +λ b2k . (12.9)
i=1 k=0 i=k k=1
Low-degree polynomials are effective for approximating a smooth func-
tion in small regions. However, if we wish to approximate a smooth The knots can be equally spaced or spaced so that there are roughly
function over a large region, polynomial approximation typically does an equal number of the unique values of the Xi between each pair of
not work well. One can increase the polynomial order to gain degrees adjacent knots. The number of knots is generally between 5 and 20. The
of freedom, but higher-order polynomials can be highly oscillatory and exact number of knots has little effect on the fit, because the penalty
changing the coefficients to increase the accuracy of the approximation in prevents overfitting (Ruppert, 2002). The spline minimizing (12.9) is
one location changes the polynomial globally and may decrease accuracy called a penalized spline. The smoothing parameter λ has a crucial influ-
elsewhere. The solution to this problem is to piece together a number of ence on the fit and must be chosen appropriately. Data-based methods
low-degree polynomials. A pth degree spline with knots κ1 , . . . , κK is a for selecting λ include generalized crossvalidation, REML, and Bayesian
piecewise polynomial function s with polynomial form changing at each estimation; see Ruppert, Wand, and Carroll (2003).
knot in such a way that the j th derivative of s is continuous everywhere
for j ≤ p − 1. In practice, generally p is 1, 2, or 3.
12.2.3 QVF and Likelihood Models
There are many ways to parameterize a spline. Usually, one works with
a spline basis. Given a fixed degree and knots, one can find 1 + p + K Local polynomial nonparametric regression is easily extended to like-
basis functions, collectively called a basis, such that any spline with this lihood and quasilikelihood and variance function (QVF) models. The
degree and knots is a linear combination of these basis functions. There reason is that local polynomial regression can be looked at in two ways
are many, in fact, infinitely many, bases, and we can work with whichever that permit immediate generalization. First, as seen in (12.6), local poly-
one we like. One basis that is easy to understand is the truncated power nomial regression estimation of mY (x0 ) at a value x0 is equivalent to
function basis. The pth degree truncated power function with knot κi is a weighted maximum likelihood estimate of the intercept in the model,
(x − κi )p+ which is defined to be zero if x ≤ κi and (x − κi )p if x > κi . assuming that Y is normally distributed with mean β0 + β1 (X − x0 ),
The truncated power basis of degree p and knots κ1 < . . . < κK is constant variance, and with the weights Kh (X − x0 ). Thus, in other gen-
{1, x, . . . , xp , (x − κ1 )p+ , . . . , (x − κK )p+ }, and an arbitrary spline with eralized linear models (logistic, Poisson, gamma, etc.), the suggestion

294 295
is to perform a weighted likelihood analysis with a mean of the form the SIMEX estimator is smoother and has smaller mean squared error
h {β0 + β1 (X − x0 )} for some function h(·). than SIMEX with the “naive” bandwidth used by Carroll, Maca, and
Extending local polynomial nonparametric regression to QVF models Ruppert (1999).
is also routine. As seen in (12.7), local linear regression is a weighted
QVF estimate based on a model with mean β0 +β1 (X−x0 ) and constant 12.2.4.2 SIMEX Applied to Penalized Splines
variance, and with extra weighting Kh (X−x0 ). The suggestion in general
Carroll, Maca, and Ruppert (1999) also applied SIMEX to penalized
problems is to do the QVF analysis with argument β0 + β1 (X − x0 ) and
splines and found that the SIMEX/splines estimator performed quite
extra weighting Kh (X − x0 ).
similarly to SIMEX/local polynomial estimator and was inferior to their
structural spline estimator, which will be discussed in Section 12.2.6.
12.2.4 SIMEX for Nonparametric Regression
Use of SIMEX in nonparametric regression follows the same ideas as in 12.2.5 Regression Calibration
parametric problems. We require an additive error model W = X + U, In 1995, regression calibration made some sense for nonparametric re-
where U is independent of X with variance σu2 . Sometimes, a transfor- gression, since there was little in the way of a literature. Now there
mation of the original surrogate is required to achieve additivity and is more, and we do not think regression calibration should be used in
homoscedasticity. The SIMEX algorithm for nonparametric regression this context. At best, in its expanded form, it will be able to capture
is as follows: quadratic functions, but fitting quadratics is not the intent of the field.
(a) Fix values for λ ∈ Λ = (0 < λ1 < . . . < λM ).
(b) For b = 1, . . . , B, let ǫib be the non-iid pseudoerrors.
12.2.6 Structural Splines
(c) Define Wib (λ) = Wi + σu λ1/2 ǫib .
(d) For b = 1, . . . , B and λ ∈ Λ, compute the nonparametric regres- A more sophisticated application of the regression calibration idea was
sion estimate (12.7) by regressing Yi on Wib (λ). Call the resulting proposed by Carroll, Maca, and Ruppert (1999). The spline regression
estimate fb(x, b, λ). model
Xp K
X
(e) Let fb(x, λ) be the sample mean of the terms fb(x, b, λ).
Yi = βk Xki + bi (Xi − κk )p+ + ǫi , (12.10)
(f) For each x, extrapolate the values fb(x, λ) as a function of λ back k=0 i=k
to λ = −1, resulting in the SIMEX estimator fb(x). implies that
p
X K
X
12.2.4.1 SIMEX Applied to Local Polynomial Regression E(Yi |Wi ) = βk E(Xki |Wi ) + bi E{(Xi − κk )p+ |Wi }. (12.11)
An interesting problem is how best to choose the smoothing parameter. k=0 i=k

The smoothing parameter determines how one trades off smoothing bias If we can estimate each of the conditional expectations in (12.11), then
and variance; smaller bandwidths give more variance but less smooth- we can fit this equation to the Yi to estimate the parameters in (12.10).
ing bias. For the case of local polynomial regression, Carroll, Maca, and Fortunately, estimation of these conditional expectations is a straightfor-
Ruppert (1999) applied Ruppert’s (1997) EBBS (empirical bias band- ward application of Bayesian density estimation methodology of Section
width selector) method to the naive estimates in step (d), but doing 12.1.11, as will be explained in the next paragraph. Fitting (12.11) is
this resulted in final SIMEX estimators that were undersmoothed. This an example of regression calibration if we think of Xi , . . . , Xpi , (Xi −
problem was addressed by Staudenmayer and Ruppert (2004), who no- κ1 )p+ , . . . , (Xi − κK )p+ as a p + K dimensional set of covariates measured
ticed that the undersmoothing occurred because the SIMEX extrapolant with error.
used to estimate f (x) is much more variable than the naive estimators When the MCMC algorithm in Section 12.1.11 is run, each Xi is
of f (x) in step (d) to which EBBS was being applied. They developed imputed from its conditional distribution given Wi and the parameters.
an asymptotic theory for the SIMEX extrapolant so that EBBS could be Thus, if for any function g we average g(Xi ) over the imputed values of
applied to the SIMEX estimator itself, rather than to the naive estima- Xi from an MCMC sample, then we estimate E{g(Xi )|Wi }. If this is
tors fed into SIMEX. With the Staudenmayer and Ruppert bandwidth, done for g(x) equal to each of x, . . . , xp , (x − κ1 )p+ , . . . , (x − κK )p+ and

296 297
for i = 1, . . . , n, then these quantities can be used as the covariates to 6
Control Data

fit (12.11).
5
Since the methodology of Section 12.1.11 is based on a flexible nor-
mal mixture model for the density of X, the algorithm described in the 4
present section was called the “structural spline” method by Carroll,
3
Maca, and Ruppert (1999). In their simulation experiment, the struc-
tural spline estimates had substantially smaller mean squared errors 2
than SIMEX applied to either local polynomial estimation or penalized

Response
splines. However, Berry, Carroll, and Ruppert (2002) found that a fully 1

Bayesian model outperformed the structural spline estimator. Their fully 0


Bayesian approach is described in Chapter 13.
−1

−2
12.2.7 Taylex and Other Methods
−3

12.2.7.1 Globally Consistent Methods via Deconvolution −4


−3 −2 −1 0 1 2 3
Baseline
A globally consistent deconvoluting kernel regression function estimate
can be obtained by replacing the kernel in (12.6) with a deconvolut- Figure 12.9 The control data in the baseline change example.
ing kernel (Fan and Truong, 1993), resulting in what we refer to as a
deconvoluting kernel, local regression estimator.
However, the bandwidth selection problem associated with this ap- 12.2.7.2 Taylex
proach is by no means trivial, and the rates of convergence for the re- Carroll and Hall’s (2004) Taylex method (for Taylor Series Expansion)
sulting estimators are the same as for the density estimation problem. is in the class of deconvolution estimators, with the deconvoluting kernel
Moreover, in a simulation study of Carroll, Maca, and Ruppert (1999), function K∗ (t, h) = φ(x){1 − σu2 (x2 − 1)/(2h2 )}. However, and crucially,
the deconvoluting kernel estimate was applied with the ideal “oracle” they do not advertise the method as globally consistent, but merely ap-
bandwidth that minimized the mean squared error, and even then it proximately consistent for relatively small measurement error. They used
was inferior to SIMEX and structural splines. the following bandwidth selection algorithm. They first fitted local lin-
All the deconvolution methods described to this point require that the ear regression with local bandwidths computed Ruppert’s (1997) EBBS
distribution of the measurement error distribution be known, except up methodology, using the Taylex kernel with σu2 = 0. Then the mean of
to a scale parameter such as a variance. Schennach (2004) developed a the EBBS bandwidths was computed, and the bandwidth actually used
method that allows the measurement error distribution to be completely was the average EBBS bandwidth multiplied by 0.75. Calculations by
unknown, as long as there are replicates of W, that is, Wij = Xi + Uij Staudenmayer and Ruppert (2004) show that there is a bias term of
for j = 1, 2. One can easily see how this is can be in the most important order O(σu4 ) in the regression estimate, and holding this fixed suggests
special case of additive symmetric errors Uij , because there is a special that the bandwidth should be of order smaller than the usual h−1/5 . The
trick that applies. In this case, Wi· = Xi + Ui· = Xi + (Ui1 + Ui2 )/2, 0.75 correction is an ad hoc means of accomplishing this.
while (Wi1 − Wi2 )/2 = (Ui1 − Ui2 )/2. Here is the trick: Because of
symmetry of the errors, the distributions Ui2 and −Ui2 are the same,
12.3 Baseline Change Example
and thus the distributions of of (Ui1 + Ui2 )/2 and (Ui1 − Ui2 )/2 are
the same. This means that (Wi1 − Wi2 )/2 can be used to estimate We analyze data originally analyzed by Berry, Carroll, and Ruppert
the density function of Ui· . Of course, the slow rates of convergence for (2002). Unfortunately, we do not have permission to discuss the study
deconvolution methods do not get any better once one has to estimate details here, or to make the data publicly available. The data have been
the error distribution. transformed and rescaled and have had random noise added.

298 299
Control Data, Naive Kernel Fits Control Data, SIMEX Kernel Fits
3 3
h=0.73 Naive
h=1.09 SIMEX
2.5 h=1.63 2.5
Quadratic

2 2
Change from Baseline

Change from Baseline


1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
Baseline Baseline

Figure 12.10 Local quadratic kernel regression fits to the baseline change con- Figure 12.11 A naive local quadratic fit (solid line) and a SIMEX local
trols data, along with a quadratic fit. All methods ignore measurement error. quadratic fit (dashed line) for the baseline change controls data.

Essentially, there is a treatment group and a control group, which are ror. As is typical of kernel methods, the smallest bandwidth has the least
evaluated using a scale at baseline (W) and at the end of the study (Y). smooth character.
Smaller values of both indicate more severe disease. The scale itself is We then applied SIMEX to both the local polynomial fits and to the
subject to considerable error because it is based on a combination of quadratic fit. In Figure 12.11, we used the middle of the three band-
self-report and clinic interview. The study investigators estimate that in widths and display the naive and SIMEX fits. One can see that the
their transformed and rescaled form, the measurement error variance is “bumps,” or less colloquially the local features, in the naive fit are ex-
approximately σu2 = 0.35. aggerated somewhat in the SIMEX fit. Figure 12.12 gives the SIMEX
A preliminary Wilcoxon test applied to the observed change from base- local quadratic fit and the SIMEX global quadratic fit.
line, Y −W , indicated a highly statistically significant difference between
the two groups.
12.3.1 Discussion of the Baseline Change Controls Data
The main interest focuses on the population mean change from base-
line ∆(X) = mY (X) − X for the two groups, and most importantly the Figure 12.12 is difficult to interpret without the scientific context, which
difference between these two functions. Here we describe results only for we are not at liberty to discuss. However, we can note that the people
the control group. The data are given in Figure 12.9. in the study are controls, that is, not given a treatment. The higher the
Preliminary nonparametric regression analysis of the data ignoring change from baseline, the more the patient has improved by the end of
measurement error indicates possible nonlinearity in the data: A quadratic the study. Since these are untreated patients, what this figure shows is a
regression is marginally statistically significant (p ≈ 0.03). When we cor- placebo effect, sometimes a rather strong one. Both the local quadratic
rected the quadratic fits for measurement error (Cheng and Schneeweiss, SIMEX fit and the global quadratic SIMEX fit suggest that the placebo
1998) and bootstrapped the resulting parameter estimates, both p-values effect is confined away from those doing best or worst at baseline. In the
exceeded 0.20, although the fitted functions had substantial curvature. context of the actual study, this is not so far-fetched. In contrast, the
In Figure 12.10, we plot local quadratic kernel estimates with three band- naive fits (Figure 12.10) suggest that those who are doing the worst at
widths, along with a quadratic regression, all ignoring measurement er- baseline (lowest values) have a strong placebo effect.

300 301
Control Data, SIMEX Fits
2.5
Kernel
Quadratic

1.5
Change from Baseline

0.5

−0.5
−2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
Baseline

Figure 12.12 A SIMEX local quadratic fit (solid line) and SIMEX global
quadratic fit (dashed line) to the baseline change controls data.

Bibliographic Notes
The early work of Stefanski and Carroll (1986, 1990c), Carroll and Hall
(1988) and Liu and Taylor (1989) has since spawned a considerable lit-
erature on deconvolution; see for example Liu and Taylor (1990), Zhang
(1990), Fan (1991a,b,c; 1992a), Fan, et al. (1991), Masry and Rice (1992),
Fan and Truong (1993), Fan and Masry (1993), and Stefanski (1989,
1990). An interesting econometric application using a modification of
these methods is discussed by Horowitz and Markatou (1993). There
continues to be interest in the problem of deconvolution (Taupin, 2001;
Delaigle and Gijbels, 2004ab, 2005, 2006).
There have been a number of monographs nonparametric regression
(for example, Müller, 1988; Härdle, 1990; Hastie and Tibshirani, 1990;
Fan and Gijbels, 1996; Bowman and Azzalini, 1997; and Ruppert, Wand,
and Carroll, 2003), where it is shown that nonparametric regression tech-
niques have much to offer in applications.

302
CHAPTER 13

SEMIPARAMETRIC
REGRESSION

13.1 Overview
Semiparametric models combine parametric and nonparametric submod-
els. For example, in a semiparametric regression model the effects of
some, but not all, covariates are modeled nonparametrically. In this chap-
ter we will take a Bayesian viewpoint, mostly for the pragmatic reason
that we have found that a Bayesian analysis is easiest to implement.
This chapter considers three main topics. In the first, primary topic,
the effect of X on the response is modeled nonparametrically and the
effect of Z is modeled parametrically. The second topic considered is
the opposite: the effect of X on the response is modeled parametrically
and the effect of Z is modeled nonparametrically. Finally, the third topic
considers the case that both X and Z are modeled parametrically by a
nonlinear form, but the distribution of X is treated nonparametrically.

13.2 Additive Models


For concreteness, we will focus on the additive model where Zi is a vector
of observed covariates with linear effects and Xi is an unobserved scalar
covariate with an effect of unknown form. Thus, we will use the model
Yi = s(Xi ) + βzt Zi + ǫi , (13.1)
where s is smooth, but otherwise unknown. Model (13.1) is additive
because the effects of Xi and the components of Zi are added together;
there are no interaction terms that would be functions of two or more
covariates. We will use a spline model given by (12.8) and repeated here
as
Xp K
X
s(x) = βk xk + bk (x − κk )p+ . (13.2)
k=0 k=1

As discussed in Section 12.2.2, κ1 < . . . < κK are knots and (x − κi )p+


is defined to be zero if x ≤ κi and (x − κi )p if x > κi . The degree p is
typically 1, 2, or 3 and there are between five and 20 knots. The knots

303
are usually at “equally spaced” quantiles of the Wi , for example, at the prior variance σb2 , bx is Normal(0, σb2 I), and σb2 is IG(δb,1 , δb,2 ), where IG
deciles if K = 9. stands for the inverse-Gamma distribution defined in Section A.3. Using
Model (13.1) is one of many spline models discussed in detail in Rup- the [ · ] notation described in the Guide to Notation, we can write this
pert, Wand, and Carroll (2003), which deals mostly with covariates with- prior as
out error but has a chapter on measurement error. Readers not familiar
[bz |σb2 ] = Normal(0, σb2 I), and [σb2 ] = IG(δb,1 , δb,2 ). (13.7)
with semiparametric regression may wish to consult that reference. Once
the analysis of (13.1) is understood, it can be extended to many other As discussed in Section 9.4, for the prior to be noninformative the values
models, for example, ones where the effects of some components of Zi of δb,1 and δb,2 should be small. For δb,1 , “small” means small relative
are also modeled via splines or where Xi has several components. to the sample size. Since δb,2 is a prior guess about the variance of the
bk , and since spline coefficients are often quite small, it is crucial that
δb,2 be sufficiently small. Put differently, δb,2 should be small relative
13.3 MCMC for Additive Spline Models
to typical values of the b2k . Depending on the application, δb,2 = 10−8
As seen in Chapter 9, sampling from the posterior for a regression model or even smaller may be necessary to prevent overfitting, which in this
with covariate measurement error is nearly identical to sampling from the context is the same as undersmoothing. The reason for this is that the
same model without covariate measurement error. The main difference spline coefficients are typically very small, though how small depends
is that there is an extra step where the unknown values of the Xi for upon the number of knots and the scaling of X (Crainiceanu, Ruppert,
nonvalidation data are imputed. and Wand, 2005).
By combining (13.1) and (13.2) we arrive at the model Besides the prior on the regression coefficients, we need a prior on σǫ2 ,
p K the variance of ǫ1 , . . . , ǫn . We also need an error model for W|X or a
X X
Yi = βk Xki + bk (Xi − κk )p+ + βzt Zi + ǫi . (13.3) Berkson model for X|W, and, in the case of an error model, a structural
k=0 k=1
“exposure” model for X|W. Of course, we need priors on these models
as well. Each of these models and priors is chosen in the same manner
Clearly, (13.3) is a linear model, albeit a somewhat complicated one. as in Chapter 9.
In Section 9.4, we used nearly flat, that is, noninformative priors, for The full conditionals are the same as in Section 9.4 for error models
the regression coefficients of a linear model. This strategy will not work or Section 9.7 for Berkson models, except for σb2 which did not appear
well for model (13.3), because of the large number of coefficients. What is in models studied in Chapter 9. It is not difficult to show that the full
needed is a method for shrinking the spline coefficients b1 , . . . , bK toward conditional for σb2 is
zero to prevent overfitting. This shrinkage can be accomplished by using
K
X
a more informative prior for b1 , . . . , bK while continuing to use a flat
prior for the other coefficients. To separate the two types of regression f (σb2 |others) = IG{δb,1 + K/2, δb,2 + (1/2) b2k }. (13.8)
coefficients, we will rewrite (13.3) as k=1

t Here again, we see that δb,2 will dominate the posterior if it is large
Yi = βzx m1 (Xi , Zi ) + btx m2 (Xi ) + ǫi , (13.4) PK
relative to k=1 b2k .
where
m1 (Xi , Zi ) = (1, Xi , . . . , Xpi , Zti ), 13.4 Monte Carlo EM-Algorithm
m2 (Xi ) = {(Xi − κ1 )p+ , . . . , (Xi − κK )p+ }, (13.5) Ganguli, Staudenmayer, and Wand (2005) developed a Monte Carlo EM-
βzx = (β0 , . . . , βp , βzt )t , algorithm for the structural nonparametric regression problem, in the
and bx = (b1 , . . . , bK )t . model
Xp K
X
We will use the prior Yi = βk Xki + bk (Xi − κk )p+ + ǫi . (13.9)
k=0 k=1
[βzx ] = Normal(0, σβ2 I) (13.6)
Their algorithm is relatively simple, although like the Bayesian approach
where σβ2 is “large,” say 106 . The prior for bx is hierarchical: given a in Section 13.3, a Metropolis–Hastings step is required; see Section 9.3.

304 305
We refer to this approach as the GSW-EM (Ganguli, Staudenmayer, Simulated Data, Classical Errors
1.5
Wand-EM) algorithm. The algorithm is given also in Ruppert et al.
(2003, Chapter 15.3), although their σv2 is our σu2 and their σu2 is our σb2 . 1

0.5
13.4.1 Starting Values
The GSW-EM algorithm starts with estimates of µx = E(X), σx2 = 0

var(X) and E(X|W); see, for example, Section 4.4 for the regression

Y
calibration calculations. Define V to be the n × (p + 1) matrix with ith −0.5

row given as (1, Xi , . . . , Xpi ), and define Z to be the n × K matrix with


−1
ith row m2 (Xi ) given by (13.5). Let Y, X , W, and E be the n × 1 vectors
with ith element Yi , Wi , Xi and ǫi , respectively. Then the model can Actual Data
−1.5
be written as True Function

Y = Vβx + Zbx + E, −2
−5 0 5
X
where cov(bx ) = σb2 I
and cov(E) = σǫ2 I.
This is a standard linear
mixed model, and replacing Xi with its regression calibration estimate
Figure 13.1 Classical error simulation with sine function. Solid line: true func-
E(Xi |Wi ), REML is used to obtain starting values for (βx , bx , σb2 , σǫ2 ).
tion. The true X-observations are plotted: note how the true function is readily
estimated if X were observable.
13.4.2 Metropolis–Hastings Fact
The density of (X1 , ..., Xn ) given bx , Y, W is proportional to (3) Define
½ ¾ · ¸
1 1 1
exp − 2 kY −Vβx −Zbx k2 − 2 kV −µx 1k2 − 2 kW −Vk2 . (13.10) V tV V tZ
2σǫ 2σx 2σu P = .
Z tV Z t Z + (σǫ2 /σb2 )I
Here k · k is the Euclidean norm; see the Guide to Notation. As a re-
The value of P is unknown and must be imputed, because we do not
sult, generating conditional quasi-random variates from X ’s conditional
observe V.
distribution can be done using the Metropolis–Hastings algorithm. This
fact is also used by Berry, Carroll, and Ruppert (2002).
(4) Define C = [V, Z]. With the results from step 2, compute Monte Carlo
estimates of the conditional expectations of P , C t Y, and C t C given
13.4.3 The Algorithm (Y, W, bx ). We denote estimates of these quantities by Pb, Cd d
t Y and C t C.
b
For example, P is computed by defining Vj to be the same as V except
We now specify the complete algorithm:
that the true X-values are replaced by their j th Metropolis–Hastings
Pm
(1) Set t = 0. generated value. Then an upper-left block of Pb is m−1 j=1 Vjt Vj ,
P m
(2) Use the Metropolis–Hastings algorithm, applied one element at a and upper-right block is m−1 j=1 Vjt Z, and so forth.
time, to draw m samples from the distribution of Xi given (bx , Y, W)
evaluated at the current estimates of (βx , bx , σb2 , σǫ2 ). Call these sam- (5) Holding the estimates from the previous step fixed, run several it-
ples (X1i , ..., Xmi ). This is the most time-consuming step in the EM erations of an update scheme. For instance, using the standard EM
algorithm. The choice of the number of samples m is a problem that algorithm to compute REML estimates (for example Dempster, Ru-
remains difficult to solve. In their calculations, Ganguli, Stauden- bin, and Tsutakawa, 1981), the nested updates are as follows. First,
mayer, and Wand started with m = 50 and increased m by 10 for update βx and bx as {βxt , btx }t = Pb−1 Cd t Y. Then, update σ 2 as σ 2 =
b b
each interaction of the EM algorithm, to a maximum of 500. K −1 {btx bx + σǫ2 trace(Pb−1 ). Finally, update σǫ2 as follows. Define D =

306 307
{βxt , btx }t and let the current value be σǫ,curr
2
. Then Simulated Data, Classical Errors
1

σǫ2 = n−1 {Y t Y − 2(Cd d


t Y)t D + D t C t CD} + n−1 σ 2 dt b −1 ).
ǫ,curr trace(C C P
0.8 True
Naive
0.6
GSW
2
(6) Update µx and σx from their current µx,curr and σx,curr using stan- 0.4
Bayes
dard point estimates based on the Monte Carlo data from Step 2.
0.2
Specifically, µx is the mean across all the values of components of
(X1i , ..., Xmi ). Also, σx2 is the mean across all components of (Xji −

Y
0

µx,curr )2 . −0.2

−0.4
We terminate the algorithm after plots of the current estimates of the
regression function appear to stabilize −0.6

−0.8

Simulated Data, Classical Errors −1


−4 −3 −2 −1 0 1 2 3 4
1.5 X

1 Figure 13.3 Classical error simulation with sine function. Solid line: true func-
tion. Dashed line: naive spline estimate that ignores measurement error. Dot-
dashed line: measurement error spline fit by EM-algorithm of Ganguli, et al.
0.5 (2005). Dotted line: 5-knot Bayesian fit.

0 13.5 Simulation with Classical Errors


Y

We simulated data with replicated classical errors, Wij = Xi + Uij ,


−0.5 j = 1, 2, and a sinusoidal regression function Yi = sin(Xi ) + ǫi . In
the simulation Xi = Normal(0.0, 4.0), Ui = Normal(0.0, 4.0), ǫi =
Normal{0, (0.2)2 }, and n = 200. The actual data and the true regression
−1
are given in Figure 13.1, where the true function is apparent.
To see what is going on, we first plot the observed data (W i , Yi ).
−1.5 Once again we see the double-whammy of classical measurement error
described in Section 1.1: in contrast to the true data in Figure 13.1, the
observed data are clearly more variable about any version of a line, hence
−2 the loss of power, and the true function is now no longer apparent, hence
−6 −4 −2 0 2 4 6
the loss of features. The idea that one can recreate the true line given in
Observed Predictor Figure 13.1 from the observed data in Figure 13.2 is daunting.
In Figure 13.3 we see a plot of a Bayes estimate, the true regression
Figure 13.2 Classical error simulation with sine function. Figure 13.1 gives function, a naive fit and the EM-algorithm fit of Ganguli, et al. (2005)
the true-X data along with the observed responses, where the sine function is defined in Section 13.4. The Bayes estimator and the naive fit used five
obvious. Here we plot the responses Y against the observed values W of the knots at quantiles of the W. The Bayes estimator that had σx2 and σǫ2
mean of the two mismeasured covariates. Note the lack of features in the data, had IG(1, 1) priors, µx and the β’s had Normal(0, 100) priors, and σb2
where the sine function is masked.
had an IG(0.01, 100) prior.
The naive fit was a penalized spline fit of W i to Yi and shows clear

308 309
Simulated Data, Classical Errors function is quite obvious, much more so than in the plot of Yi against
1.5
Imputed Wi .
True
1
1.5

0.5

1
0
Y

−0.5 0.5

−1
0
−1.5

−0.5
−2
−5 0 5
Imputed X w,y
Bayes, 1/4
−1
Bayes, 1
Figure 13.4 A plot of the imputed Xi versus Yi . The imputed Xi are from the Bayes, 4
sin(5*x)
from the last iteration of the Gibbs sampler. Notice the sinusoidal pattern of −1.5
the true regression function (solid curve labelled “true”) is quite evident here, −1.5 −1 −0.5 0 0.5 1 1.5
although it was largely hidden in the plot of W i versus Yi in Figure 13.2.
Figure 13.5 Bayes estimates with Berkson errors. “Bayes, C” is the Bayes
estimate for a given C.
bias, both being attenuated toward zeros and having its peaks somewhat
mislocated. The Bayes fit, and the EM-algorithm, can virtually recreate
the correct function, somewhat remarkable in light of the observed data
13.6 Simulation with Berkson Errors
in Figure 13.2.
The Bayes estimate is the mean of 5,000 iterations of a Gibbs sampler, Berkson errors were simulated with Wi equally spaced on (−1, 1), Xi =
after a burn-in of 5,000 iterations. Because a spline is a linear model, but Wi + Ui , where Ui = Normal{0, (0.2)2 }, Yi = sin(5Xi ) + ǫi with ǫi =
with a hierarchical prior for the regression coefficients given by (13.6) Normal{0, (0.2)2 }, and n = 300.
and (13.7), the Gibbs sampler was the same as used for linear regression In the case of classical errors, we used knots at quantiles of the Wi .
with classical errors in Section 9.4, except that there was an extra step Since the Wi are more dispersed than the Xi when the errors are clas-
for sampling σb2 using (13.8). sical, the knots should cover the range of the Xi . In the case of Berkson
The Gibbs sampler used the naive penalized spline fit to get starting errors, this might not be true. Therefore, we simulated a set of Xi by
values for the parameters. The starting value for Xi was Wi . The spline adding Normal(0, σ̃u2 ) random variates to the Wi . Here, σ̃u2 is the mean
was of degree p = 2 with five knots placed so that there was approxi- of the prior on σu2 . The knots were at quantiles of these simulated Xi .
mately an equal number of Wi between each adjacent pair. The knot There were 10 knots.
locations are not considered to be unknown parameters and the knots The prior on σu2 was IG(1/2, Cσu2 /2) where C was 1/4, 1, and 4 cor-
were kept fixed during the Gibbs sampler. responding to prior guesses of σu2 equal to 1/4, 1, or 4 times the true
The power of the Gibbs sampler is that it uses both Yi , Wi and value. Since the prior effective sample size is 1, the value of C should
knowledge of the regression function to impute Xi . The result is that not be too important. The Bayes estimates of s(x) = sin(5x) are shown
often Xi is estimated with remarkable accuracy. This accuracy can be in Figure 13.5. In Figure 13.6 we see the naive spline fit of Y to W, the
seen in Figure 13.4, which is a plot of Yi against the imputed Xi from ideal spline fit of Y to X, the Bayes estimator with C = 1, and the true
the last iteration of the Gibbs sampler. The shape of the true regression curve.

310 311
1.5
0.2

σ2
u
1 C=1/4
0.1
0 500 1000 1500 2000 2500 3000
iteration
0.5

0.2

σ2
u
0 C=1
0.1
0 500 1000 1500 2000 2500 3000
iteration
−0.5

w,y 0.2

σ2
u
fit x,y
−1
fit w,y C=4
Bayes 0.1
0 500 1000 1500 2000 2500 3000
sin(5*x) iteration
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
Figure 13.7 Berkson errors. Estimates of σu2 .
Figure 13.6 Berkson errors. True curve (sin(5x)), Bayes estimate with C =1
(Bayes), ideal spline fit using true covariates (x-y), and naive spline fit (w-y).

measurement errors. An infeasible estimator is


In Figure 13.7 we see the trace plots for σu for the three values of n
X
C. One can see that σu can be rather accurately estimated despite the βbinfeas = [n−1 {Wi − E(W|Zi )}{Wi − E(W|Zi )} − Σtuu ]−1
lack of replication. It is interesting that σu is so well estimated here. In i=1
linear regression, σu is not identified. For some nonlinear models, such n
X
−1
as segmented binary regression, σu is theoretically identified but there ×n {Wi − E(W|Zi )}{Yi − E(Y|Zi )}. (13.11)
is so little information about σu in the data that, for practical purposes, i=1
it is not identified; see the Munich bronchitis example in Section 9.7.3.
Here, we have two things going for us when we estimate σu : Liang et al. simply replace E(W|Z) and E(Y|Z) with any convenient
nonparametric regression, which is feasible because these are all observed
• The response is continuous not binary. quantities. They used kernels because it is easy to prove results for ker-
• The true curve is very nonlinear. nels, but splines, etc., can be used as well.
Liang and Wang (2005) considered the partially linear single index
model where now Z is multivariate and
13.7 Semiparametrics: X Modeled Parametrically
E(Y|X, Z) = Xt βx + θ(Zt βz ).
The other variant of semiparametric regression is when the effect of
X on Y is modeled parametrically but at least a component of Z is This is a harder problem, and they developed a clever two-step approach
modelled nonparametrically. For example, Liang, Härdle, and Carroll wherein they first estimated βx without having to worry about βz , and
(1999) considered the partially linear model then updated to get an estimate of βz . Specifically, Y − E(Y|Z) = {X −
E(Y|X, Z) = Xt βx + θ(Z), E(X|Z)}t βx , which is a partially linear model as described above, so that
βx is easily estimated. They then noted that E(Y|Z) − E(Xt |Z)βx =
where θ(·) is an unknown function. Similar to (13.12), the approach is θ(Zt βz ), which is a standard single-index model, for which a host of
a correction for attenuation. Let Σuu be the covariance matrix of the solutions are known.

312 313
13.8 Parametric Models: No Assumptions on X minimizing
n
X
There has been a great deal of recent activity about parametric response
models that attempt to correct for measurement error under minimal ω(Wi ){Yi − m(Wi , B, fbX , fbW , fW |X )}2 .
i=1
assumptions about the latent variable X or the measurement error when
observing W = X+U. In this section, we try to summarize this research This line of attack has been taken, for example, by Taupin (2001). She
concisely, much of it being theoretical in nature. It is probably too soon allowed the unknown mean function f (X, B) to have different amounts of
to tell how this recent literature will affect the practice in the area, smoothness, and showed that if it is smooth in any standard sense, then
because some of the methods are computationally complex or rely on the algorithm will produce estimates of B that have parametric rates of
the use of characteristic functions. convergence, although standard error estimates were not obtained.
Before 1995, there was only one set of literature available that yielded
globally consistent estimation, namely, the conditional and corrected 13.8.1.2 Partially Linear Models
methods described in Chapter 7, and that only for special cases, such Liang (2000) used deconvolution methods in the partially linear model
as linear and logistic regression and under strong parametric assump- where E(Y|X, Z) = Zt β + θ(X), where θ(·) is an unknown function.
tions about the measurement error distribution. The newest literature This is the same model as (13.1) that we analyzed in Sections 13.2 and
expands the models that can be considered. When replications of W are 13.3 using splines. Liang started with the infeasible “estimator”
available, assumptions about the error distribution may also sometimes " #−1
n
X
be avoided.
βbinfeas = n−1
{Zi − E(Z|Xi )}{Zi − E(Z|Xi )} t

i=1
13.8.1 Deconvolution Methods X n
×n−1 {Zi − E(Z|Xi )}{Yi − E(Y|Xi )}. (13.12)
13.8.1.1 Least Squares Methods
i=1
Suppose that the mean of Y given X is given as mY (X, B). Remember He then replaced E(Z|Xi ) and E(Y|Xi ) by deconvoluting kernel regres-
that fX (·) is the unknown density function of X, fW|X (·) is the density sion estimators. For normally distributed measurement errors, he derived
function of W given X, and fW (·) is the density function of W. Then, a limiting normal distribution for the resulting estimate of β, under the
as described in (12.1) in Section 12.1, the mean of Y given an observed restriction that Z and X were independent.
W is Zhu and Cui (2003) considered the same model, except that they al-
Z
1 lowed Z to also be observed with error independent of the error in X.
mY|W (W, B, fX , fW , fW |X ) = mY (x, B)fX (x)fW |X (W|x)dx.
fW (W) They doid not require that X and Z be independent. Their method is
effectively a correction for attenuation version of the infeasible estimate.
The obvious approach then is to do some form of least squares. Let ω(W)
be a weight function. Then, assuming that all the density functions are
known, one way to estimate B is to minimize 13.8.2 Models Linear in Functions of X
n
X Schennach (2004a) considered models of the form
ω(Wi ){Yi − mY|W (Wi , B, fX , fW , fW |X )}2 .
i=1 E(Y|Z, X) = Zt βz + h(X)βx ,
To implement this idea, one has to produce estimates for the density She noted that B = (βzt , βxt )t can be recovered as
functions (fX , fW , fW |X ). For example, suppose that W = X + U, · ¸−1
where U is normally distributed with mean zero and known variance E(ZZt ) E{Zht (X)}
B= [E(Zt Y), E{ht (X)Y}]t .
σu2 , so that fW |X (w|x) = σu−1 φ{(w − x)/σu }, where φ(·) is the standard E{h(X)Zt } E{h(X)ht (X)}
normal density function. Let fbX be the deconvoluting kernel density Her work is striking because the measurement error distribution need
estimator defined in Section 12.1, let fbW be the corresponding regular not be specified. Indeed, she allowed for a second measurement, T =
kernel density estimator. Then it would be tempting to estimate B by X + V, where V has mean zero and is independent of everything else

314 315
(her conditions are actually somewhat weaker than this). No assumptions 13.8.4 Doubly Robust Parametric Modeling
are made about the distribution of either the measurement error U or
The approaches outlined in this section have been more or less ad hoc,
the measurement error V. She showed how to go about estimating the
some of them with fairly daunting computational issues.
moments necessary to identify B, and derived the limiting distribution
Tsiatis and Ma (2004) avoided the ad hoc nature of the approach
of her estimator.
by developing a modern semiparametric approach to the problem. They
While Schennach’s approach uses characteristic function, this is not
started with a full parametric model for Y given (X, Z) in terms of a
deconvolution in any sense. However, Schennach (2004b) used much the
parameter B, and they assumed that the distribution of W given X is
same approach to develop a deconvoluting kernel density estimator with-
completely known, or is known except for a parameter. Their methodol-
out assumptions about the distribution of the measurement errors; see
ogy has the following features:
also Section 12.2.7.
• They first specify a candidate distribution for X, which may or may
not be correct.
13.8.3 Linear Logistic Regression with Replicates • Their method, which is not maximum likelihood for this candidate
distribution, provides a consistent estimate of the parameter B, no
Huang and Wang (2001) considered linear logistic regression when there matter what the actual distribution of X is.
are replicated error prone measures, so that Wij = Xi + Uij is observed
for j = 1, 2. Although we believe that simple regression calibration is • Their method is also the most efficient approach among all methods
generally quite well suited to this problem, at least for the purpose of that are consistent in the above sense. Thus, for example, for canonical
having a complete asymptotic theory, there is a need to have consistent, exponential families with normally distributed measurement error,
and not just approximately consistent, estimation procedures. Of course, their method reduces to the conditional score method described in
if the measurement errors are normally distributed, we have already Section 7.2.2.
described such a methodology; see Chapter 7. Tsiatis and Ma applied their methodology to the logistic model that is
Huang and Wang made no assumptions about the distribution of the quadratic in X, with impressive improvement in bias reduction compared
measurement errors or about the distribution of X. Their notation is even to their version of regression calibration (similar but not the same
somewhat difficult to decipher (sometimes too much generality detracts as the one proposed in this book), much less the naive method.
from otherwise excellent papers), but in the case of two replicates, works The methodology can be briefly summarized as follows. Suppose that

like this. Define a density function fX|Z (x|z) has been hypothesized as the density for
X X given Z. Let S(Y, X, Z, B) be the score function if X were observable,
Ψi1 (β0 , B) = {Yi − 1 + Yi exp(−β0 − Zti βz − Wij t t t
βx )}(1, Wik ); that is the derivative of the loglikelihood function. Define S ∗ (Y, W, Z, B) =
j6=k
X E ∗ {S(Y, X, Z, B)|Y, W, Z}, where the superscript means that the ex-
Ψi2 (β0 , B) = {Yi + (Yi − 1)exp(β0 + Zti βz + Wij
t t t
βx )}(1, Wik ). pectation is taken with respect to the hypothesized model. Then there is
j6=k a function a(X, Z) with the property that it solves the integral equation
Pn E{S ∗ (Y, W, Z, B)|X, Z} = E [E ∗ {a(X, Z)|Y, W, Z}|X, Z] . (13.13)
For j = 1, 2, let βb0j and Bbj be a solution to 0 = i=1 Ψij (β0 , B),
b be the sample covariance matrix of the terms {Ψi1 (βb01 , Bb1 )
and let Λ Further define
and {Ψi2 (βb02 , Bb2 ) when combined. Then estimate B by minimizing the Seff (Y, W, Z, B) = S ∗ (Y, W, Z, B) − E ∗ {a(X, Z)|Y, W, Z}.
quadratic form
Then,
P Tsiatis and Ma proposed to estimate B by solving the equation
· Pn ¸t · Pn ¸
Ψi1 (β01 , B) b Pi=1 Ψi1 (β01 , B) 0 = i Seff (Yi , Wi , Zi , B).
Pi=1
n Λ n . Theoretically, this approach has a great deal to be said for it. One can
i=1 Ψi1 (β02 , B) i=1 Ψi1 (β02 , B)
use best guesses to get something near a maximum likelihood solution,
Huang and Wang derived the limiting distribution of the estimator and but still have robustness against specifying the model for X incorrectly.
also allowed for the possibility that there are more than two replicates. The practical difficulties are the following:

316 317
• First, one has to be able to compute S ∗ (Y, W, Z, B), which requires
numerical integration, but then of course so too does maximum like-
lihood.
• More of an issue is actually solving the integral equation (13.13). In
their Section 4.2, Tsiatis and Ma (2004) came up with an approximate
solution. Specifically, they discretized X, stating that it takes on a
finite number of values, and then they specified the probability of
these values given X. In their example, they allowed X to take on
15 different values, and made the probabilities of X given Z to be
proportional to the density function of X given Z at these 15 values.
They then solved (13.13) in this discrete setting, which is little more
than solving linear equations with somewhat messy input arguments.
The Tsiatis and Ma methodology has considerable potential. How-
ever, more numerical work will be needed in order to understand how to
discretize, and more important, it would be very useful if multipurpose
software could be developed.

Bibliographic Notes
Berry, Carroll, and Ruppert (2002) developed the Bayesian spline method-
ology for nonparametric regression with covariate measurement error
that is the basis for this chapter. Mallick, Hoffman, and Carroll (2002)
developed semiparametric methods for Berkson errors in the Nevada
Test Site example, although in their case they knew the variance of the
Berkson errors. The application of this regression spline methodology to
Berkson errors with unknown Berkson error variance is new.
It has long been known that, in parametric problems, the Berkson
error variance in the model X = W + U is identified if the true re-
gression function is not linear; see, for example, Rudemo, Ruppert, and
Streibig (1989) in Section 4.7.3. These results apply to our approach,
which are flexible and nonlinear parametric methods, hence semipara-
metric. In purely nonparametric regression problems, identifiability of
the measurement error variance and hence of the regression function is
harder. Delaigle et al. (2006) pointed out that with Berkson errors, if
the true regression function is mY (·), then what we estimate is γ(W) =
E{mY (W+U)}, and identifiability of the true regression function means
we need to be able to recover it from γ(·). This can be tricky, since in
Berkson models X is more variable than W. An interesting theoretical
issue is whether one can hope to recover the true function mY (·) beyond
the range of the observed W-values. At first, this seems difficult, if not
impossible, but simulation results suggest otherwise, at least partly; see
Figure 13.5, where the estimates follow the true function beyond the
W-values.

318
CHAPTER 14

SURVIVAL DATA

Survival analysis has developed from the analysis of life tables in actu-
arial sciences and has enjoyed remarkable success with modern applica-
tions in medicine, epidemiology, and the social sciences. The popularity
of survival analysis models, such as the Cox proportional hazards model,
is probably surpassed only by the popularity of standard linear regres-
sion models.
Survival data are the product of a continuous death process coupled
with a censoring mechanism. Typically, the death rate depends on a
number of factors, and time to death is only partially observed for those
subjects with censored observations. Standard analyses of survival data
assume that all covariates affecting survival rates are observed without
error. However, in many applications some of the covariates are subject
to measurement error or are available without error only for a subsample
of the population.

14.1 Notation and Assumptions


Like most research areas in statistics, survival analysis has several stan-
dard sets of notations. In this chapter, we will follow notation intro-
duced by Miller (1998). Assume that n subjects are observed over time
and their failure times T1 , . . . , Tn are subject to right censoring and
C1 , . . . , Cn are the corresponding censoring times. Let
δi = I(Ti < Ci )
be the failure indicator and
Yi = min(Ti , Ci )
be the time to failure or censoring for subject i. Denote by
Ri = {j : Yj ≥ Yi }, (14.1)
the risk set when the event corresponding to subject i occurs. Ri is the
index set for those subjects who have not failed and are uncensored at
the time the ith subject fails or is censored. The at-risk indicator process
for the ith subject is defined as
Yi (t) = I(Yi ≥ t).

319
We assume that the survival probability for each subject depends on model as
covariates that are subject to measurement error, Xi , as well as on co- λ (t|Z, W) = E [λ (t|X, Z) |T ≥ t, Z, W]
variates that are not, Zi . The covariate Xi is measured through the usual (14.6)
classical measurement error model = λ0 (t)exp (βzt Z) E {exp(βxt X|T ≥ t, Z, W)} .
Wi = Xi + Ui , (14.2) As shown by Prentice (1982) and by Pepe, Self, and Prentice (1989),
where the distribution of Ui is known or estimable. We also assume that the difficulty is that the conditional expectation in (14.6) for the observed
(Ti , Xi , Ci , Ui ) are iid random vectors, Ci is independent of (Ti , Xi ), data depends upon the unknown baseline hazard function λ0 . This de-
and Ui is independent of (Ti , Xi , Ci ). The observed data are the vectors pendence is due to the conditioning on (T ≥ t). The induced hazard
(Yi , δi , Wi , Zi ), where (Yi , δi ) is a proxy observation for (Ti , Ci ) and function does not factor into a product of an arbitrary baseline hazard
Wi is a proxy observation for Xi . and a term that depends only on observed data and an unknown param-
eter, and the methodology for proportional hazards regression cannot be
The distribution of the failure time, Ti , is completely described by
applied without modification.
the hazard rate
An important simplification occurs when the failure events are rare,
P (t < Ti < t + dt|Xi , Zi ) that is, when the probability of survival beyond time t, P (T ≥ t), is
λi (t|Xi , Zi ) = lim
dt↓0 dtP (Ti > t|Xi , Zi ) close to 1. The rare-event assumption implies that the hazard (14.6) of
the observed data can be approximated by
and can be interpreted as the instantaneous risk that the time Ti of ¡ ¢ © ª
an event equals t conditional on no events for subject i prior to time t. λ∗ (t|Z, W) = λ0 (t)exp βzt Z E exp(βxt X|Z, W) . (14.7)
The proportional hazards model introduced by Cox (1972) is the most
commonly used model for the hazard rate and assumes that A special case that leads directly to regression calibration is when
X given Z and W is normally distributed with mean m(Z, W, γ) and
λi (t|Xi , Zi ) = λ0 (t)exp(βxt Xi + βzt Zi ), (14.3) with constant covariance matrix Σ. In this case the approximate hazard
function is, from (14.7),
where λ0 (·) is an unspecified baseline hazard function that does not de- © ª
pend on the λ∗ (t|Z, W) = λ∗0 (t)exp βxt m (Z, W, γ) ,
R tcovariate values. The baseline cumulative hazard function
is Λ0 (t) = 0 λ0 (s)ds. In the standard regression case when Xi are ob-
where λ∗0 (t) = λ0 (t)exp(0.5βxt Σβx ), which is still arbitrary since λ0 is
served, Cox (1972) suggested that inference on βx and βz be based on
arbitrary.
the log partial likelihood function
  
X n X  14.3 Regression Calibration for Survival Analysis
l(βx , βz ) = δi βxt Xi + βzt Zi − log exp(βxt Xj + βzt Zj )  ,
  One of the first applications of regression calibration was proposed by
i=1 j∈Ri
(14.4) Prentice (1982) for estimating the parameters in a Cox model. The idea
which does not depend on λ0 (·). An alternative strategy is to use the log of regression calibration is to replace the covariate of interest by its
of the full likelihood of the model (14.3) conditional mean E(X|Z, W) = m (Z, W, γ) and is a first-order bias-
Pn t t
correction method.
L(βx , βz ) = i=1 δi [βx Xi + βz Zi + log{λ0 (Yi )}]

RY (14.5)
t t
−eβx Xi +βz Zi 0 i λ0 (s)ds. 14.3.1 Methodology and Asymptotic Properties
The procedure starts by estimating X by mX (Z, W, γ b), where γ
b is an
14.2 Induced Hazard Function estimator of γ that could be obtained as in Section 4.4. The next step
is to maximize the log partial likelihood (14.4), where Xi is replaced
When X is unobservable and instead we observe a surrogate W, Prentice by Xb i = mX (Z, W, γ b). If X∗i = mX (Z, W, γ), then the approximate
(1982) introduced the induced hazard function for the Cox regression regression calibration Cox model assumes that the hazard function for

320 321
the observed data is As with regression calibration in general, the advantage of Clayton’s
method is that no new software need be developed, other than calculat-
λ(t|X∗i , Zi ) = λ∗0 (t)exp(βx∗ t X∗i + βz∗ t Zi ). (14.8)
ing the means within risk sets. Formula (4.5) shows how to generalize
Under the regularity conditions in Wang, Wang, and Carroll (1997) one this method to multivariate covariates and covariates measured without
can show that the parameter estimator obtained by maximizing (14.4) error.
with Xi replaced by Xb i is a consistent, asymptotically normal, estimator

of βx . Because model (14.8) is just an approximation of the true model, 14.4 SIMEX for Survival Analysis
these results are only approximate for the parameter, βx , of the true
model. In practice, model (14.8) is often a good approximation of model The simulation–extrapolation (SIMEX) procedure proposed by Cook
(14.6). and Stefanski (1994) and presented in detail in Chapter 5 is a general
A major advantage of regression calibration is that, after fitting a methodology that extends naturally to survival analysis. For simplicity
reasonable model for E(Xi |Zi , Wi ), one can use existent software de- of presentation, we consider the case when only one variable is measured
signed for proportional hazards models, such as R or S–plus (coxph() with error.
and survreg() functions) or SAS (PHREG procedure), to produce first- The essential idea is to simulate new data by adding increasing amounts
order bias-corrected estimators of the parameters of a Cox model with of noise to the measured values Wi of the error prone covariate Xi , com-
covariates subject to measurement error. pute the estimator on each simulated data set, model the expectation
of the estimator as a function of the measurement error variance, and
extrapolate back to the case of no measurement error. More precisely, if
14.3.2 Risk Set Calibration σu2 is the variance of the measurement error, then for each ζ on a grid
Clayton (1991) proposed a modification of regression calibration that of points between [0, 2] we simulate
does not require events to be rare. If the X’s were observable, and if Xi p
Wb,i (ζ) = Wi + ζUb,i , b = 1, . . . , B, (14.9)
is the covariate associated with the ith event, in the absence of ties the
usual proportional hazards regression would maximize where Ub,i are normal, mean zero, independent random variables with
k
variance σu2 , and B is the number of simulations for each value of ζ. The
Y exp(βxt Xi ) measurement error variance of the contaminated observations Wb,i (ζ) is
P ,
i=1
t
j∈Ri exp(βx Xj ) (1 + ζ)σu2 , and the case of no measurement error corresponds to ζ = −1.
By replacing Xi with Wb,i (ζ) in the hazard function (14.3), we obtain
where Ri is the risk set (14.1) at the time when failure or censoring of © ª © ª
subject i occurs. Clayton suggested using regression calibration within λi t|Wb,i (ζ), Zi = λ0 (t)exp βxt Wb,i (ζ) + βzt Zi , (14.10)
each risk set, Ri , given in (14.1). He assumed that the true values X and either the partial likelihood (14.4) or the full likelihood (14.5) could
within the ith risk set are normally distributed with mean µi and variance be used to produce estimators βbxb (ζ) and βbzb (ζ). For each level of added
σx2 , and that within this risk set W = X + U, where U is normally noise ζ one obtains
distributed with mean zero and variance σu2 . Neither σx2 nor σu2 depends
B B
upon the risk set in his formulation. Given an estimate σ bu2 , one can 1 X bb 1 X bb
2 βbx (ζ) = βx (ζ), βbz (ζ) = βz (ζ) .
construct an estimate of σ bx just as in the equations following (4.4). B B
b=1 b=1
Clayton thus modified regression calibration by using it within each
risk set. Within each risk set, he applied the formula (4.5) for the best A quadratic or rational extrapolant, as described in Section 5.3.2, can
unbiased estimate of the X’s. Specifically, in the absence of replication, then be used to obtain the estimated values corresponding to ζ = −1.
for any member of the ith risk set, the estimate of the true covariate X For the case of multivariate failure time data, Greene and Cai (2004)
is have established the consistency and asymptotic normality of the SIMEX
estimator when the measurement error variance and an exact extrap-
b =µ bx2
σ
X bi + (W − µ
bi ) , olant are known. Li and Lin (2003a) have used SIMEX coupled with
bx2 + σ
σ bu2 the EM algorithm to provide inference for clustered survival data when
where µ
bi is the sample mean of the W’s in the ith risk set. some of the covariates are subject to measurement error.

322 323
The ideas extend to the case in which more than one predictor is prone
to measurement error. Suppose Wi = Xi + Ui with Ui independent
Normal(0, Ω), where Ω is a known positive definite q ×q variance matrix.
If Ω1/2 is its positive square root, then remeasured data is generated as
p

0.020
Wb,i (ζ) = Wi + ζ Ω1/2 Ub,i (ζ),

where Ub,i (ζ) are independent Normal(0, Iq ) vectors and ζ is a positive

0.015
scalar. Note that
© ª

Density
Cov Wb,i (ζ) = (1 + ζ) Ω,

0.010
which converges to the zero matrix as ζ → −1. After this, the rest of the
simulation and extrapolation steps are conceptually similar.

0.005
14.5 Chronic Kidney Disease Progression

0.000
To illustrate the regression calibration and SIMEX methodologies in
survival analysis, we analyze time-to-event data, where the event is de-
50 100 150 200
tection of primary coronary kidney disease (CKD). Primary CKD could
be viewed as the least severe phase of kidney disease and is typically de- Baseline eGFR

fined in relationship to the estimated glomerular filtration rate (eGFR)


of the kidney. Primary CKD is defined as either achievement of followup
Figure 14.1 Baseline estimated glomerular filtration rate (eGFR) for African-
eGFR < 60 or a post baseline CKD hospitalization or death (Marsh- Americans (solid line) and others (dashed line).
Manzi, Crainiceanu, Astor, et al., 2005).
Specifically, we are interested in testing whether African-Americans
are at higher risk of CKD progression. Data were obtained from the 14.5.1 Regression Calibration for CKD Progression
Atherosclerosis Risk in Communities (ARIC) study, a large multipur-
pose epidemiological study conducted in four U.S. communities (Forsyth Regression calibration is a first-order bias-reduction method that works
County, NC; suburban Minneapolis, MN; Washington County, MD; and well when the covariates subject to measurement error enter the model
Jackson, MS). A detailed description of the ARIC study is provided by linearly. Because eGFR has a nonlinear effect on survival probability,
the ARIC investigators (1989). In short, from 1987 through 1989, 15, 792 in this section we consider only subjects with baseline eGFR less than
male and female volunteers aged 45 through 64 were recruited from these 120. This is done only to illustrate regression calibration. One is primar-
communities for a baseline and three subsequent visits. For the purpose ily interested in the relationship between survival and eGFR over the
of this study, all primary CKD events up to January 1, 2003 were in- entire range of eGFR, and in Section 14.5.2 we model this nonlinear re-
cluded and the time-to-event data were obtained from annual participant lationship with a spline and correct for measurement error with SIMEX.
interviews and review of local hospital discharge lists and county death Several subjects were omitted from our analyses due to missing data
certificates. or baseline eGFR values smaller than 60, indicating decreased baseline
The estimated glomerular filtration rate (eGFR) is a measure of kidney kidney function. This reduced the number of subjects from 15, 792 to
function and characterizes the different stages of kidney disease. eGFR is 15, 080 in our full data set and 13, 359 in the reduced data set (eGFR <
subject to measurement error, and the measurement error variance was 120).
estimated from a different replication study. We consider Cox models Figure 14.1 shows the estimated probability density of baseline eGFR
for time-to-CKD events and we include eGFR, an indicator of African- for African-Americans compared to others, indicating better baseline
American race, age at baseline visit, and sex as covariates. kidney function for African-Americans. We considered a Cox model de-

324 325
eGFR AA Age Sex AA Age Sex

Naive −0.061 0.55 0.075 −0.01 Naive 0.50 0.070 0.011


SE 0.0022 0.061 0.0048 0.052 SE 0.059 0.0047 0.051

Reg. Cal. −0.105 0.84 .064 0.024 SIMEX 0.63 0.054 0.061
SE 0.0038 0.064 0.0049 0.052 SE 0.062 0.0049 0.052

Table 14.1 Estimates and standard errors (SE) of risk factors using a reduced Table 14.2 Estimates and standard errors (SE) of risk factors using all sub-
ARIC data set (13, 359 subjects) corresponding to eGFR < 120 and events jects with eGFR > 60 (15, 080 subjects) using events up to 2002. Naive is the
observed from first to second visit. Naive, regression on observed eGFR; “AA” regression using the observed eGFR; “AA” is African-American race.
is African-American race.

where mY (·) is a function of the eGFR, and AA denotes the African-


scribing the time to primary CKD events with covariates eGFR, an in- American race. We used a linear spline with four equally spaced knots
dicator of African-American race, sex, and age. Table 14.1 compares between eGFR = 70 and eGFR = 165. More precisely,
results of the naive analysis, which uses the observed eGFR values, with 4
the regression calibration, which uses the means of eGFR conditional on X
mY (x) = β1 x + αk (x − κk )+ , (14.12)
the observed eGFR and the other covariates. Not accounting for mea- k=1
surement error in eGFR would decrease the size of the effect of eGFR
by 42% and of the African-American race indicator by 35%, and would where κk , k = 1, . . . , 4 are the knots of the spline and a+ is equal to
increase the effect of age by 17%. The effect of sex on CKD progression a if a > 0 and 0 otherwise. In this parameterization, the αk parameter
was not statistically significant under either the naive or the regression represents the change in the slope of the log hazard ratio at knot κk
calibration procedure. corresponding to eGFR. The proportional hazard model (14.11) using
The measurement error variance was estimated using data from the the linear spline (14.12) with fixed knots to describe the effect of eGFR
Third National Health and Nutrition Examination Survey (NHANES is linear in the α and β parameters but it is nonlinear in the variable
III). Duplicate eGFR measurements were obtained for each of 513 par- measured with error.
ticipants aged 45 to 64 with eGFR ≥ 60 from two visits at a median of Following the SIMEX methodology described in Chapter 5, we simu-
17 days apart (Coresh et al., 2002). The estimated measurement error lated data sets using
variance was σ bu2 = 77.56, was treated as a constant in our analyses, and p
eGFRb,i (ζ) = eGFRi + ζUb,i , i = 1, . . . , n, b = 1, . . . , B, (14.13)
corresponded to a reliability of 0.80 for eGFR when all subjects with
eGFR ≥ 60 were considered and only 0.60 for eGFR of subjects with where Ub,i are normal, mean zero, independent random variables with
60 ≤ eGFR < 120. variance σu2 , where σu2 is the variance of the measurement error asso-
ciated with eGFR. We used 10 values for ζ on an equally spaced grid
between 0.2 and 2 and B = 50 simulated data sets for each value of ζ.
14.5.2 SIMEX for CKD Progression The entire program was implemented in R and run in approximately 5
Given the nonlinear relationship in the full data set between eGFR and minutes on a PC (3.6GHz CPU, 3.6Gb RAM), with more than 99% of
the hazard ratio, we fit the following Cox proportional hazard model the computation time being dedicated to fitting the 500 Cox models,
each with 15, 080 observations.
λi (t) = λ0 (t) exp{mY (eGFRi ) + β2 AAi + β3 Agei + β4 Sexi }, (14.11) Models were fit by replacing eGFRb,i (ζ) for eGFRi in model (14.13).

326 327
African American African American effect of age by 30%. The effect of sex on progression to primary CKD
was not statistically significant under either the naive or the SIMEX

0.0042
0.65
o
procedure.
Coefficient
o

Variance

0.0036
0.50
To obtain the SIMEX estimator of the eGFR effect, we estimated the

0.0030
0.35
function mY (·) on an equally spaced grid of points xg , g = 1, . . . , G =
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
100, between the minimum and maximum observed eGFR. For each level
of added noise, ζσu2 , the SIMEX estimator at each grid point, xg , is
ζ ζ

Age Age
B
1 X b
m
b Y (xg , ζ) = m
b Y (xg , ζ),
0.080

2.5 e−05
B
Coefficient

Variance
o b=1
0.065

2.0 e−05
o
where m b bY (xg , ζ) is the estimated function at xg using the bth simulated
0.050

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
data set obtained as in (14.13) at the noise level (1+ζ)σu2 . For every grid
ζ ζ
point we then used a quadratic linear extrapolant to obtain the SIMEX
Sex Sex estimator m b Y (xg , ζ = −1). The solid lines in Figure 14.3 represent the
estimated function mY (·), m b Y (xg , ζ), for ζ = 0, 0.4, 0.8, 1.2, 1.6, 2, with
0.06

o
0.0027

higher values of noise corresponding to higher intercepts and less shape


Coefficient

Variance

o
0.02

definition. The bottom dashed line is the SIMEX estimated curve.


−0.02

0.0024

The nonmonotonic shape of all curves is clear in Figure 14.3, with


−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
unexpected estimated increase in CKD hazard for very large values of
ζ ζ
eGFR. Such results should be interpreted cautiously for two reasons.
First, the apparent increase may not be statistically significant, since
Figure 14.2 Coefficient and variance extrapolation curves for the ARIC sur- there are only 30 CKD cases with baseline eGFR > 140 and 14 with
vival modeling. The simulated estimates are based on 50 simulated data sets baseline eGFR > 150. The total number of CKD cases in our data set
and are plotted as solid circles. The fitted quadratic extrapolant (solid line) is
was 1, 605. Second, eGFR is not a direct measure of the kidney function
extrapolated to ζ = −1 (dashed line), resulting in the SIMEX estimate (open
circle).
and is typically obtained from a prediction equation, with creatinine as
an important predictor. Creatinine is produced by muscles and is filtered
out of the body by the kidney. Thus, lower values of creatinine typically
If βbkb (ζ), k = 2, 4 are the parameter estimates, then predict better kidney function. However, very low values of creatinine,
which would predict very large values of eGFR, could also be due to
B
1 X bb lack of muscular mass which, in turn, is associated with higher CKD in-
βbk (ζ) = βk (ζ), k = 2, 3, 4 cidence. In short, very low values of creatinine may occur either because
B
b=1 the kidney does amazing filtration work or because the subject already
are the estimated effects for noise level (1 + ζ)σu2 . Figure 14.2 displays has other serious problems and lacks muscular mass. The latter mech-
βbk (ζ) in the left column as filled black circles. The parameter estimates anism may actually be the one that is providing the increasing pattern
are obtained using a quadratic extrapolant evaluated at ζ = −1, which corresponding to eGFR > 140, irrespective of its statistical significance.
corresponds to zero measurement error variance. A similar method is
applied for the variance of the parameter estimates presented in the 14.6 Semi and Nonparametric Methods
right column. The only difference is that we extrapolate separately the
sampling and the measurement error variability, as described in Section Semiparametric models usually refer to a combination of parametric,
B.4.1. Table 14.2 provides a comparison between the naive and SIMEX richly parameterized, and nonparametric models. Survival models with
estimates showing that ignoring measurement error would artificially covariate measurement error have three main components that may be
decrease the effect of African-American race by 21% and increase the modeled semi or nonparametrically:

328 329
(Z, W) are categorical. If V and V̄ are the sets of indices corresponding
to validation and nonvalidation data, respectively, then
P t
j∈V Yj (t)I {Zj = Zi , Wj = Wi } exp(βx Xj )
ebi (t|βx ) = P (14.14)
j∈V Yj (t)I {Zj = Zi , Wj = Wi }
−2
is a simple nonparametric estimator of E {exp(βxt Xi |Ti ≥ t, Zi , Wi )}.
−4

The estimator ebi (t) is easy to calculate and represents the average of
exp(βxt Xj ) over those subjects in the validation data that are still at risk
f(eGFR,n.knots=4)

−6

at time t and share the same observed covariate values (Zj , Wj ) with
the ith subject. The induced hazard function with the partial likelihood
−8

replaced by an estimator is
b
λ(t|W t
i , Zi ) = λ0 (t)exp(β Zi ) {exp(βx Xi )I(i ∈ V ) + e
bi (t|βx )I(i ∈
/ V )} .
z
−10

Zhou and Pepe (1995) suggested maximizing the following estimator of


the log partial likelihood:
−12

 
Xn X
b i (βx , βz )} − log{
δi log{H b j (βx , βz )} ,
−14

EPL(βx , βz ) = H
i=1 j∈Ri
60 80 100 120 140 160 180 200
(14.15)
Baseline eGFR
where H b i (βx , βz ) = exp(βzt Zi ) {exp(βx Xi )I(i ∈ V ) + ebi (Yi |βx )I(i ∈ / V )}
is the estimated relative risk of subject i. Because the H b i (βx , βz ) is a
Figure 14.3 Linear spline fits with K = 4 knots. Function estimators based on weighted average of hazard ratios of subjects in the validation sample,
50 simulated data sets corresponding to ζ = 0, 0.4, 0.8, 1.2, 1.6, 2 are plotted as standard Cox regression software cannot be used to maximize (14.15).
solid lines, with larger values of noise corresponding to higher intercepts. The One possible solution would be to maximize (14.15) directly using non-
SIMEX estimate is the dashed line. linear optimization software.
A more serious limitation of the procedure occurs when the condi-
1. The conditional expectation E {exp(βxt X|T ≥ t, Z, W)}. tional distribution [X|Z, W] depends on three or more discrete covari-
2. The distribution function of X. ates. In this situation, it is difficult to estimate the conditional distribu-
tion [X|Z, W] well from a validation sample that is usually small. This
3. The baseline hazard function λ0 (t).
could have severe effects on the variability of the parameter estimate.
Various parametric and nonparametric methods depend on which com- Also, in practice, the conditional distribution [X|Z, W] often depends
ponent or combination of components is modeled nonparametrically, as on continuous covariates.
well as on the choice between partial or full likelihood function. A similar nonparametric estimator using kernel smoothing was pro-
posed by Zhou and Wang (2000) when (Z, W) are continuous. If K(·) is a
14.6.1 Nonparametric Estimation with Validation Data multivariate kernel function, then the conditional relative risk H b i (t|βx , βz )
t t
= E {exp(βx Xi + βz Zi |Ti ≥ t, Zi , Wi )} can be estimated as
When the rare-failure assumption introduced in Section 14.2 does not
hold, approximating E {exp(βxt X|T ≥ t, Z, W)} by E {exp(βxt X|Z, W)} Hb i (t|βx , βz )
may lead to seriously biased estimates (Hughes, 1993). One way to avoid P © t t t t t t
ª
j∈V Yj (t)Kh (Zi , Wi ) − (Zj , Wj ) exp(βxt Xj + βzt Zj )
this problem is to estimate E {exp(βxt X|T ≥ t, Z, W)} nonparametri- = P © t t t t t t
ª , (14.16)
cally. j∈V Yj (t)Kh (Zi , Wi ) − (Zj , Wj )
Assuming the existence of a validation sample, Zhou and Pepe (1995) where Kh (·) = K(·/h) and h is the kernel bandwidth size. Therefore,
proposed a nonparametric estimator of E {exp(βxt X|T ≥ t, Z, W)} when for a subject i that is not in the validation data, that is, i ∈
/ V , the

330 331
estimated hazard ratio H b i (t|βx , βz ) is a weighted sum of all hazard ratios The score function is
Hb j (t|βx , βz ) of subjects in the validation data that are still at risk at time (
∂l(βx ,βz ) Pn
t. The weights assigned to the hazard ratio of each subject depend on ∂βx ∂βz = i=1 δi [Xi , Zi ]
the distance between the observed covariates (Zti , Wit )t for subject i and
(Ztj , Wjt )t in the validation data. ) (14.17)
P
As in the case of discrete covariates, maximizing the function EPL(βx , j∈Ri [Xj ,Zj ] exp (βxt
Xj +βzt Zj )
− P
exp t X +β t Z )
(βx ,
βz ) from equation (14.15) can be used to estimate (βx , βz ). Maximiz- j∈Ri j z j

ing EPL(βx , βz ) requires a separate nonlinear maximization algorithm.


However, writing the code should be straightforward because derivatives where [X, Z] denotes the matrix obtained by binding the columns of X
of H b i (βx , βz ) can be calculated as weighted sums of the derivatives of and Z. The naive approach would replace Xi with (Wi1 + Wi2 )/2, while
Hj (βx , βz ), j ∈ V , with the weights being calculated only once. the regression calibration would replace Xi with E[Xi |Wi1 , Wi2 , Zi ].
The main problem is that the estimators of the induced hazard func- Huang and Wang (2000) proposed replacing the score function (14.17)
tion will depend heavily on the bandwidth size h of the kernel function. with
( P )
Since typical conditions for asymptotic consistency provide no informa- n
l(βx , βz ) X
∂e j∈Ri Bj (βx , βz )
tion about the choice of bandwidth for finite samples, careful data de- = δi Ai − P , (14.18)
pendent tuning is usually necessary. This problem is especially difficult
∂βx ∂βz i=1 j∈Ri Cj (βx , βz )

when there two or more covariates (Z, W), because the typically small where
validation data would be used for tuning of the smoothing parameters of
[Wi1 , Zi ] + [Wi1 , Zi ]
a multivariate kernel. Another limitation of the methods in this section Ai = ,
is that they require a validation sample that, in many applications, is 2
not available. [Wi1 , Zj ] exp(βxt Wi2 ) + [Wi2 , Zj ] exp(βxt Wi1 )
Bj (βx , βz ) = exp(βzt Zj )
2
and
14.6.2 Nonparametric Estimation with Replicated Data
exp(βxt Wi1 ) + exp(βxt Wi1 )
Cj (βx , βz ) = exp(βzt Zj ).
Huang and Wang (2000) have proposed a nonparametric approach for 2
the case when replicated proxy observations are available for each sub- Huang and Wang (2000) showed that the estimator obtained by max-
ject. They assumed that for each variable prone to measurement error, imizing (14.18) is consistent and asymptotically normal. When more
Xi , there are at least two proxy measurements, Wi1 , Wi2 linked to Xi than two replicates are available the formulas for Ai , Bj (βx , βz ) and
through a classical additive measurement error model Cj (βx , βz ) are slightly more complicated by taking averages over all
replicates instead of just two.
Wij = Xi + Uij , j = 1, 2,
A potential problem with this approach is that the approximate score
where Uij are mutually independent and independent of all other vari- function (14.18) can be evaluated only for those subjects i that have
ables. The approach is nonparametric because it does not require spec- repeated measurements. Therefore, serious losses of efficiency may occur
ification of the baseline hazard function, λ0 , or the distribution of Uij . when replication data are available for only a small subsample. Biased
However, the proportional hazard function is modeled parametrically. estimators may occur when the subset of subjects with replicated data
If Xi were observed without error, then a consistent, asymptotically is not a random subsample of the original population sample.
normal estimator of (βxt , βzt ) can be obtained by solving the score equa-
tion 14.6.3 Likelihood Estimation
∂l(βx , βz )
= 0, Likelihood estimation is a different approach to estimation in the context
∂βx ∂βz of survival analysis with measurement error. Under the assumptions in
where l(βx , βz ) is the log partial likelihood function defined in (14.4). Section 14.1, the likelihood associated with one observation, i, in the

332 333
Cox model is using the following conditional distribution
Qm I(Yi =tj )
j=1 λ0 (tj ) exp {δi (βxt Xi + βzt Zi )}
Z ( Z )
Yi h Pm i
δi
{λ(Yi |x, Zi )} exp − λ(u|x, Zi )du f (Wi , x|Zi )dx, (14.19) × exp −exp{δi (βxt Xi + βzt Zi )} j=1 λ0 (tj )I(tj ≤ Yi )
0

× f (Wi |Xi , Zi ).

where f (w, x|z) = f (w|x, z)f (x) is the joint conditional density function These functions work well when the integral in equation (14.20) is low
of the random variables W and X given Z. It is assumed that X is dimensional, that is, when the number of variables subject to measure-
independent of Z. If t1 , . . . , tm are all the unique failure times then the ment error is small.
full likelihood function can be written as The normality assumption can be further relaxed by considering a
more general family of distributions for the unobserved variables Xi .
For the case when only one variable is observed with error, Hu et al.
(1998) used the semi-nonparametric (SNP) class of smooth densities of
" Gallant and Nychka (1987),
Qn R Qm ½ ¾
L = i=1 j=1 λ0 (tj )I(Yi =tj ) exp{δi (βxt x + βzt Zi )} 1 ¡ ¢
K 2 1 (x − µx )2
f (x|θ) = 1 + a1 x + . . . + a K x exp − ,
C(θ) σx 2σx2
h Pm i (14.21)
× exp −exp{δi (βxt x + βzt Zi )} j=1 λ0 (tj )I(tj ≤ Yi ) where
R θ = (a 1 , . . . , a K , µx , σ x ) and C(θ) is a constant that ensures
f (x|θ)dx = 1. The class of smooth densities (14.21) contains the nor-
#
mal densities as a particular case when a1 = . . . = aK = 0. Because the
×f (Wi |x, Zi )f (x|θ)dx , number of monomials, K, is unknown and can theoretically be very large,
the family of distributions (14.21) can be viewed as a nonparametric fam-
(14.20) ily. However, in practice, it is very rare that K ≥ 3 is necessary. When
where βx , βz , λ0 (t1 ), . . . , λ0 (tm ), θ are treated as unknown parameters. K is small, maximizing (14.20) when f (x|θ) has the form (14.21) could
Hu, Tsiatis, and Davidian (1998) assumed that the conditional density provide a useful sensitivity analysis to the specification of the marginal
f (w|x, z) is known and used parametric, semi and nonparametric models distribution of X. Alternatively, the robustness of the specification can
for f (x|θ). be checked using the remeasurement method of Huang, Stefanski, and
The parametric model assumes that X has a normal distribution with Davidian (2006) (see also Section 5.6.3). The FORTRAN program Nlmix
parameters θ = (µx , Σx ), which in many applications is a reasonable (Davidian and Gallant, 1993) can handle random effect distributions of
assumption. When this assumption is not reasonable one can often find the type (14.21).
a transformation of the observed proxies, W, that would be consistent The fully nonparametric approach of Hu et al. (1998) for modeling
with the normality assumption. While Hu, Tsiatis, and Davidian (1998) f (x|θ) uses a binning strategy similar to histogram estimation. More
called this a fully parametric method, the baseline hazard function is precisely, a set of locations x1 , . . . , xK is fixed, where K << n is the
not parameterized. Thus, the procedure requires maximization over a number of support points of the approximate distribution. The proba-
large number of parameters and could be considered nonparametric with bility mass function of X is represented as
respect to the hazard function. K
Y I(X=xk )
An important feature of this methodology is that existent software f (x|θ) = pk , (14.22)
k=1
developed for fitting nonlinear mixed effect models, such as the FOR-
TRAN program Nlmix (Davidian and Gallant, 1993) or the R function where θ = (K, x1 , . . . , xK , p1 , . . . , pK ), pk = P (X = xk ), k = 1, . . . , K
PK
nlme, can be adapted to maximize (14.20). This can be done by treating and k=1 pk = 1. While, in principle, one could maximize the likelihood
the unobserved variables Xi as independent normal random effects and over θ, λ0 (t1 ), . . . , λ0 (tm ), βx and βz this is a not a realistic approach.

334 335
Hu, Tsiatis, and Davidian (1998) fixed K to be moderately large (K = where Uij are mean zero measurement error variables, independent of
20) and x1 , . . . , xK equally spaced on the range of observed values of W. Tij , Cij , Xij and bi .
For θ = (p1 , . . . , pK ), Hu et al. proposed an EM algorithm to maximize A full likelihood approach was proposed by Li and Lin (2000) for fitting
(14.20) and provided a simulation study comparing these methods with model (14.23), assuming normality of the frailty and [X|W, Z] distribu-
regression calibration. tions. They used an EM algorithm to maximize the marginal likelihood
Somewhat unexpectedly, regression calibration performs remarkably by treating the frailties and the covariates observed with error as miss-
well even in small samples (n = 100) when the distribution of X is nor- ing data. The “complete data” for the ith cluster are the observed data
mal and the attenuation factor is moderate or small. The full likelihood (Zij , Wij ) and the unobserved (Xij , bi ). The complete data likelihood
analysis using the normal distribution performed well even when the for this cluster is
distribution of X was not normal. From a practical perspective, apply-
Li (Θ; Xij , bi , Zij , Wij ) = {λij (t|Xij , Zij , bi )}δij ×
ing normalizing transformations to the observed W and using regression
calibration may be a very good first step of the analysis. As discussed ( Z )
Yi
by Hu et al. (1998), applying a likelihood-based method may be com- exp − λ(u|Xij , Zij , bi )du φ(bi , σb2 )φ(Xij |Wij , Zij , θ), (14.24)
putationally prohibitive for realistic data sets. A reasonable alternative 0

could be to apply these methods to a random subsample of the data as where φ(bi , σb2 ) is the normal density of bi with mean zero and variance
a sensitivity analysis. σb2 and φ(Xij |Wij , Zij ) is the conditional normal density of Xij given
One limitation of the methods described in this section is that the (Wij , Zij ). Here Θ is the vector of all parameters and includes the pa-
distribution of X is not allowed to depend on observed covariates Z. An- rameters of the proportional hazard function, (βx , βz ); the parameter of
other limitation is that the methods are designed for one variable subject the random effects model, σb2 ; the parameter of the conditional distri-
to measurement error, and they do not easily generalize to multiple cor- bution [X|W, Z], θ; and the set of all jumps in the integrated baseline
related variables. Lastly, the computational burden seems prohibitive for hazard function, ∆Λ0 (t).
data sets with thousands of observations and multiple covariates. Of course, the complete data likelihood cannot be used directly for esti-
mation since it contains unobserved data. Instead, one uses the marginal
likelihood of the observed data, which is obtained by integrating out the
14.7 Likelihood Inference for Frailty Models
unobserved quantities, that is
 
Random effects models have been discussed in Chapter 11. Random I Z Y J 
Y
effects have also been used in survival analysis with clustered data, but L(Θ) = Li (Θ, ; Xij , bi , Zij , Wij )dXij dbi .
in this context they are called frailties. In this section, we will use the  
i=1 j=1
notation introduced in Section 14.1 but use a pair of indices (i, j) instead
of the single index i, where i = 1, . . . , I is the cluster index, and j = The EM algorithm of Li and Lin (2000) usesd Monte Carlo simulations
1, . . . , J is the observation index within cluster. to perform these integrations at the E-step. A full likelihood analysis
Conditional on the cluster-specific frailty, the proportional hazards requires an estimate of the baseline hazard, and Li and Lin used a non-
function follows a Cox model (14.3): parametric maximum likelihood estimator.
¡ ¢
λij (t|Xij , Zij , bi ) = λ0 (t) exp βxt Xij + βzt Zij + bi , (14.23)
Bibliographic Notes
where Xij are variables subject to measurement error, Zij are observed
without measurement error, and bi is the cluster specific frailty. In ad- Substantial methodological and applied research has been dedicated in
dition to the standard assumptions for survival analysis we will assume recent years to survival analysis with covariates subject to measurement
that bi are iid Normal(0, σb2 ), independent of failure, censoring, and mea- error, starting with the seminal paper by Prentice (1982). The regression
surement error processes. We assume that the proxy variables Wij follow calibration approach was expanded and refined by Pepe, Self, and Pren-
a classical additive measurement error model tice (1989) and Wang, Hsu, Feng, and Prentice (1997). Clayton (1991)
used regression calibration within risk sets, thus avoiding the rare dis-
Wij = Xij + Uij , ease assumption. For data containing a validation sample, Zhou and Pepe

336 337
(1995) and Zhou and Wang (2000) proposed nonparametric estimators
of the induced hazard function. For data with at least two replicates,
Huang and Wang (2000) proposed a consistent nonparametric estimator
based on a modification of the partial likelihood score equation. Au-
gustin (2004) showed that Nakamura’s (1992) methodology of adjusting
the likelihood can be applied to the Breslow likelihood to provide an ex-
act corrected likelihood. This result circumvented the impossibility result
derived by Stefanski (1989) for the partial likelihood. Hu, Tsiatis, and
Davidian (1998) have proposed likelihood maximization algorithms for
parametric and nonparametric specifications of the distribution of the
unobserved variables. Greene and Cai (2004) established the asymptotic
properties of the SIMEX estimators for models with measurement er-
ror and multivariate failure time data. Hu and Lin (2004) introduced a
modified score equation and established the asymptotic properties of the
estimators for multivariate failure time data. Li and Lin (2000, 2003a)
used the EM algorithm and SIMEX, respectively, to provide maximum
likelihood estimators for frailty models with variables observed with er-
ror. Song and Huang (2005) compared the conditional score estimation of
Tsiatis and Davidian (2001) with Nakamura’s (1992) parametric adjust-
ment. Tadesse, Ibrahim, Gentleman, et al. (2005) discussed the Bayesian
analysis of times to remission using as covariates gene expression levels
measured by microarrays.
Surrogate markers are outcomes that are correlated with the outcome
of primary interest, for example CD4 counts have been used as a surro-
gate marker for survival times in AIDS clinical trials (Dafni and Tsiatis,
1998). The advantage of a surrogate marker is that it can indicate rel-
atively rapidly whether a treatment is effective, for example an AIDS
treatment can be judged effective relatively quickly if it causes a sig-
nificant increase in CD4 counts, whereas effectiveness based on survival
times might not be evident until enough deaths have occurred, which
might take years. In the meantime, patients would be deprived of a new
and effective treatment. A surrogate marker is an outcome, that is a
variable depending upon treatment, as well as a covariate for predicting
the primary outcome. Dafni and Tsiatis (1998) discussed methodology
for handling surrogate outcomes measured with error.

338
CHAPTER 15

RESPONSE VARIABLE ERROR

15.1 Response Error and Linear Regression

In preceding chapters, we have focused primarily on problems associated


with measurement error in predictor variables. In this chapter, we con-
sider problems that arise when a true response is measured with error.
Since in previous chapters we have designated X as a covariate measured
with error, to emphasize that we are not combining response error and
covariate error, we will not use X in this chapter.
Abrevaya and Hausman (2004) state “Classical measurement error
(that is, additive error uncorrelated with the covariates) in the dependent
variable is generally ignored in regression analysis because it simply gets
absorbed into the error residual.” It is interesting to consider this claim.

Without Response Error With Response Error


6 6

4 4

2 2

0 0

−2 −2

−4 −4

−6 −6
−2 −1 0 1 2 −2 −1 0 1 2

Figure 15.1 An illustration of response error in linear regression with unbi-


ased classical measurement error. The solid line is the data without response
measurement error, while the dashed line is the observed data with response
measurement error. Note the increased variability about the line when there is
response error.

339
Least Squares Fits With and Without Response Error Without Response Error With Response Error
2.5 3 3

2
2 2
1.5

1
1 1
0.5

0 0 0

−0.5
−1 −1
−1

−1.5
−2 −2
−2 No Response Error
With Response Error
−2.5 −3 −3
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 −2 −1 0 1 2 −2 −1 0 1 2

Figure 15.2 The fitted least squares lines in the data from Figure 15.1. The Figure 15.3 Two hundred simulated data sets without (left panel) and with
left panel is the fitted least squares without response measurement error, while (right panel) measurement error in the setup of Figure 15.1. Note that re-
the right panel is the fitted least squares line with response measurement error. sponse measurement error that is classical and unbiased simply increases the
Note the lack of bias due to this type of response measurement error. variability of the least squares fitted lines, without affecting bias.

unbiased response measurement error simply increases the variability of


We generated linear regression data so that the true Z values were the fitted lines.
equally spaced on the interval [−2, 2], the intercept β0 = 0.0, and the We now make the following conclusions, the first of which is supported
slope βz = 1. The error about the line was σǫ2 = 1, so that Y = by the exercise with simulated data that we have just undertaken:
Normal(Z, 1). We then added to Y normally distributed response er- • In linear regression with unbiased and homoscedastic response mea-
ror with variance σv2 = 3, that is we observe S = Y + V, where V = surement error, the response measurement error increases the vari-
Normal(0, 3.0). Note that the measurement error in the response is three ability of the fitted lines without causing bias.
times the error about the line. If the measurement error were this large • We now go on to make a far stronger conclusion. In linear or nonlinear
and had been in the predictors, we know that the effect on the fitted regression that has homoscedastic errors about the true line, the only
lines would be enormous. effects of adding unbiased, homoscedastic response measurement error
What happens, though, when the measurement error is in the re- is to increase the variability of the fitted lines and surfaces, and to
sponse? The remark of Abrevaya and Hausman is illustrated in Figures decrease power for detecting effects. All tests, confidence intervals,
15.1, 15.2, and 15.3. In Figure 15.1, we show a typical set of data gener- etc. are perfectly valid: they are simply less powerful.
ated with and without response error. The obvious feature we see here The argument for the last very strong statement is perfectly simple.
is that the unbiased measurement error in the response increases the Suppose that without response error, Y has mean mY (Z, B) and vari-
variability of the observe data about the line. ance σ 2 . Now suppose that we observe S, which is just Y with additive
Figure 15.2 delves a little deeper, displaying the actual fitted lines. error σv2 . Then the observe response S has mean mY (Z, B) and variance
2
The remarkable thing here is that, even though the data with response σnew = σ 2 + σv2 . Thus, the observed data have the same mean and a
measurement error have four times the variability about the line as the constant, but larger variance.
data without response measurement error, the two lines are very simi- There is one caveat. For strongly nonlinear models, the larger response
lar. Figure 15.3 is the results of 200 simulated data sets, showing that variance has further implications. Inference for nonlinear models is often

340 341
σ =0
u
σ = 0.5
u
15.2 Other Forms of Additive Response Error
0.999 0.999
0.997 0.997
0.99
0.98
0.99
0.98
15.2.1 Biased Responses
0.95 0.95
0.90 0.90
Probability If S is not unbiased for Y, then regression of it on the observed predictors

Probability
0.75 0.75
0.50 0.50
0.25 0.25
leads to biased estimates of the main regression parameters. For example,
0.10
0.05
0.10
0.05 suppose Y given Z follows a normal linear model with mean β0 + βzt Z
0.02 0.02
0.01
0.003
0.01
0.003
and variance σǫ2 , while S given (Y, Z) follows a normal linear model with
0.001 0.001
1.8 2 2.2 1.5 2 2.5 3 mean γ0 + γ1 Y and variance σv2 . Here S is biased, and the observed data
esitmate of beta esitmate of beta
follow a normal linear model with mean γ0 + β0 γ1 + γ1 βzt Z and variance
σ =1 σ = 1.5
0.999
u
0.999
u σv2 + γ12 σǫ2 . Thus, instead of estimating βz , naive regression ignoring
0.997
0.99
0.98
0.997
0.99
0.98
measurement error in the response estimates γ1 βz .
0.95
0.90
0.95
0.90 There is an obvious solution to this problem, namely, to change S
Probability

Probability
0.75 0.75
0.50 0.50
so that it is unbiased, that is use (S − γ0 )/γ1 . The careful reader will
0.25 0.25 note that when a writer says things are obvious, he/she means something
0.10 0.10
0.05
0.02
0.05
0.02 different. Clearly (a better word!), the problem here is to obtain informa-
0.01 0.01
0.003
0.001
0.003
0.001 tion about (γ0 , γ1 ). In a series of papers, Buonaccorsi (1991, 1996) and
1 2 3 4 1 2 3 4 5 6
esitmate of beta esitmate of beta Buonaccorsi and Tosteson (1993) discussed how to do just this. Here, we
give a brief overview of what they proposed.
Figure 15.4 Normal plots of βb for an exponential regression model with differ-
ent amounts of measurement error in the response. 15.2.1.1 Validation Data
Suppose that validation data are available on a simple random subsample
of the primary data. The idea neatly breaks down into a series of steps:
based on approximation of the model by a linear one using a Taylor
• Use the validation subsample data to obtain estimates of B, the pa-
expansion of the parameter, β, about its true value, β0 , for example
rameters relating Y and Z, and (γ0 , γ1 ): call the former Bb1 .
Yi = mY (Zi , β) + ǫi ≈ mY (Zi , β0 ) + f ′ (Zi , β0 )(β − β0 ) + ǫi . • Create an estimated unbiased response as (S − γb0 )/b
γ1 and run your
favorite analysis to get a second estimate Bb2 .
The error in the Taylor approximation decreases to zero as β approaches
• Estimate the joint covariance matrix of these estimates using the boot-
β0 .
strap, and call it Σ.
An increase in response variance causes βb to vary more about β0 , which
makes the approximation less accurate. This can be seen in Figure 15.4, • Form the best weighted combination of the two estimates, namely
which has normal plots of βb from 250 simulations of the model Bb = (J t Σ−1 J)−1 J t Σ−1 (Bb1t , Bb2t )t ,

Yi = exp(−βZi ) + ǫi + Ui , i = 1, . . . , 200, where J = (I, I) and I is the identity matrix with the same number
of rows as there are elements in B.
where Z1 . . . . , Z200 are equally spaced on [0, 1], β = 2, σǫ = 1/4, and b −1 J)−1 as the estimated covariance matrix for the combined
• Use (J t Σ
σu = 0, 0.5, 1, and 1.5, respectively, in the four panels starting at the estimate B.b
top left. Notice that larger values of σu increase not only the variability
of βb but also its skewness. Because the errors are normally distributed,
15.2.1.2 Alloyed Gold Standard
βb would have an exact normal distribution if the model were linear.
Deviation from normality increases with σu , because larger values of σu In some (presumably fairly rare) cases, one might not have validation
increase the effect of the nonlinearity in the model. This problem does data, but instead, for a random subsample of the primary data, one
not occur, of course, if the model is linear. might have two independent replicate unbiased measurements of Y; call

342 343
them (S1∗ , S2,∗ ). These unbiased replicates are in addition to the biased 15.2.2.2 General Case
surrogate S measured on the main study sample.
Luckily, the general case is the simple case, just with more general
In this case, we use the same algorithm as for validation data, with
notation. Now the regression function for Y is something general like
the following changes:
mY (Z, B), and the variance function for Y is something general like
• Use the unbiased response (S1,∗ + S2,∗ )/2 to get Bb1 . σǫ2 g 2 (Z, B, θ). The rule, though, remains the same: if there is additive,
unbiased response error, the regression function remains the same, the
• Estimate (γ0 , γ1 ) using measurement error methods (!) as described
variance function changes, and Section A.7 shows how to cope with the
in Chapter 3, because the replication data follow the model,
change. As before, if we observe S = Y + V, where V has mean zero and
S = γ0 + γ1 Y + V; variance σv2 , then the new variance function is just σv2 + σǫ2 g 2 (Z, B, θ).
Sj,∗ = Y + Uj,∗ for j = 1, 2,
15.2.2.3 Ignoring Heteroscedasticity
where U1,∗ and U2,∗ are independent with mean zero. This is a linear
regression measurement error model with response S and “true co- We think it is silly, but many people ignore nonconstant variability and
variate” Y and with replicate measurements S1,∗ and S2,∗ of Y. The fit unweighted regressions, with the variance for the regression function
methods reviewed in Chapter 3 are used to estimate (γ0 , γ1 ). parameters fixed up by devices like the bootstrap (Section A.9) or the
sandwich method (Section A.6).
Why silly?
15.2.2 Response Error in Heteroscedastic Regression • Not accounting properly for variability leads to a decrease in efficiency
Weighted least squares and generalized least squares approaches are of- for estimating the regression parameter B. In effect, this means throw-
ten used when the data exhibit nonconstant variances, that is are het- ing away data just for sport. Few investigators have enough data that
eroscedastic. We call such models QVF models (Sections 8.8 and A.7), they are willing to throw some away to entertain the statistician.
because they combine aspects of quasilikelihood and variance function • Not all of statistics is estimating regression parameters. It is often
modeling. Sections 8.8 and A.7 describes what these models are, and important to understand the variability in order to make inferences
how to fit and make inference about them. Of course, we like to think about predictions and calibrations. This ihas been demonstrated in
of Carroll and Ruppert (1988) as the authoritative text on the topic. striking detail by Carroll (2002) and in a series of examples by Carroll
Briefly, additive unbiased response measurement error in a heterosce- and Ruppert (1988).
dastic regression simply changes the form of the variance function. Once
one keeps track of the change, then the methods of Section A.7 apply.
15.3 Logistic Regression with Response Error

15.2.2.1 A Simple Case 15.3.1 The Impact of Response Misclassification

To see this in a simple case, suppose that the regression function one In logistic regression, response error is misclassification. There are two
wants to fit to Y is linear: β0 + βz Z, but that the variance about the primary differences with regression models having a continuous response
true line is σǫ2 Zα . If Y were observed and α were known, one would and additive response measurement error:
simply perform a weighted least squares regression with weights Z−α : • Additive measurement error makes no sense. The error occurs when a
Section A.7 shows how to estimate α. positive response Y = 1 is transmuted into a negative response S = 0,
Now suppose, however, that instead of observing Y, one observed and vice versa.
S = Y +V, where V has mean zero and variance σv2 . Then we are simply • Misclassification is biased response error, and the bias needs to be
adding variability, so that S has the same linear regression function as accounted for.
does Y, but the variance becomes σv2 + σǫ2 Zα . The form of the variance
Thus, consider a logistic regression model that has probability of re-
function has changed by the addition of the response error variance σv2 .
sponse
However, Section A.7 is so general that the methods described there
apply to the new model for S. pr(Y = 1|Z) = H(β0 + βzt Z), (15.1)

344 345
Logistic Regression With Misclassification Also, response misclassification can lead to major biases in parameter
0.9
estimates. In our case, the true slope is βz = 1, but the response mis-
0.8
classification makes logistic regression think that the slope is more along
the lines of 0.40, a major difference.
0.7 This illustration indicates that response misclassification does need to
be accounted for.
0.6
Probability

0.5
15.3.2 Correcting for Response Misclassification
0.4
The profound impact of response misclassification in logistic regression
has led to the development of many interesting statistical methods,
0.3
see among many others Palmgren and Ekholm (1982, 1987), Ekholm
0.2 and Palmgren (1987), Copas (1988), Neuhaus (2002), Ramalho (2002),
Prescott and Garthwaite (2002), and Paulino et al. (2003).
0.1
No Misclassification
With Misclassification
0
15.3.2.1 Unknown Misclassification Probabilities
−3 −2 −1 0 1 2 3
Predictor
If one believes the misclassification probabilities are independent of the
covariates, then estimating all the parameters (π1 , π0 , β0 , βz ) can be done
Figure 15.5 Illustration of the effect of misclassification of the response in
via maximum likelihood or Bayesian approaches. Let the probability
logistic regression. Solid line: the true probability of response. Dashed line: the
observed probability of response when cases (noncases) are classified incorrectly
model (15.2) be denoted as Ψ(S, Z, π0 , π1 , β0 , βz ). Then the loglikelihood
20% (30%) of the time. function to be maximized is just
n
X £
Si log{Ψ(S, Z, π0 , π1 , β0 , βz )} (15.3)
where H(·) is the logistic distribution function. Pretend that misclassifi-
i=1
cation does not depend on Z, and that we classify individuals correctly ¤
+(1 − Si )log{1 − Ψ(S, Z, π0 , π1 , β0 , βz )} .
with probabilities
pr(S = 1|Y = 1, Z) = π1 ; Maximization of this loglikelihood can be done via many devices, includ-
ing the method of scoring, iteratively reweighted least squares and the
pr(S = 0|Y = 0, Z) = π0 . EM-algorithm.
The observed data no longer follow the logistic model (15.1), but instead The major practical issue is that the classification probabilities are
have the more complex form only very weakly identified by the data, that is they are difficult to
estimate with any precision, and that difficulty carries over to estima-
pr(S = 1|Z) = (1 − π0 ) + (π1 + π0 − 1)H(β0 + βzt Z). (15.2)
tion of the underlying risk function. Copas (1988) states that “accurate
Figure 15.5 gives an illustration of the impact of misclassification of estimation of (the misclassification parameters) is very difficult if not
the response. In this setting, those who actually have a response Y = 1 impossible unless n is extremely large.” This is one of the classic cases
are correctly classified with probability 80%, while those who did not where parameters may be identified theoretically but not in any practi-
have a response, that is Y = 0, are correctly classified with probability cal sense; see also Section 8.1.2. Copas (1988) and Neuhaus (2002) both
70%. The logistic intercept is −1.0 and the slope is 1.0. One can see basically concluded that the best one can hope to do is a sensitivity
that the effect of response misclassification is to bias the true line badly, analysis for plausible values of the misclassification probabilities.
somewhat along the lines of an attenuation. The difference between the Paulino et al. (2003), in a slightly different context, addressed the
major impact of misclassification of the response here and the near null problem of lack of practical identifiability of the misclassification proba-
impact of unbiased response error in linear regression (Figure 15.2) is bilities via the Bayesian route, using informative prior distributions that
profound and important. were developed with the help of a subject-matter expert.

346 347
We next describe situations in which there is information about the that the effect of misclassification is to increase greatly the variability of
misclassification probabilities. the fitted logistic slope.

15.3.2.2 Known Misclassification Probabilities 15.3.2.3 Validation Data


In the presumably rare event that the classification probabilities π1 and In some cases, there might be validation data, that is Y may be observ-
π0 are known, maximizing the loglikelihood (15.3) in (β0 , βz ) is simple able on a subset of the study. In this case, one can directly estimate the
using iteratively reweighted least squares. classification probabilities.
Logistic Regression With Possible Misclassification
Logistic Regression With Possible Misclassification, 20% Validation
2.5
2
No Misclassification
Complete
With Misclassification
Pseudo
1.8 MLE

2
1.6

1.4

1.5
1.2

1
0.8

0.6

0.5
0.4

0.2

0
0.5 1 1.5 2 0
0.5 1 1.5 2
Displayed are density estimates from a simulation
Displayed are density estimates from a simulation

Figure 15.6 Illustration of the effect of misclassification of the response in Figure 15.7 Illustration of the effect of misclassification of the response in lo-
logistic regression. Solid line: density estimate of the estimated slopes in a gistic regression, when there is 20% validation done completely at random.
simulation study of the logistic regression when there is no misclassification. Solid line: density estimate of the slope in the logistic regression for the vali-
Dashed line: density estimate with misclassification, when the misclassification dation data. Dashed line: density estimate for the MLE. Dotted line: density
probabilities are known. The observed probability of response with cases (non- estimate for pseudolikelihood.
cases) are classified incorrectly 20% (30%) of the time. Note the profound loss
of information due to response misclassification. One possibility is to estimate π1 (π0 ) as the fraction of those in the val-
idation study who are correctly classified among those whose true value
Figure 15.6 describes a simulation study of the same logistic model
is Y = 1 (Y = 0), pretend that these are known, and then maximize the
described previously when the number of observations is n = 500. It
(now pretend) likelihood (15.3). This approach is called pseudolikelihood,
contrasts the density function of the logistic regression slope estima-
a methodology that has a long and honorable history in statistics. There
tor if there were no misclassification (solid line) versus what happens
are two major difficulties with such an approach:
when there is misclassification, but the misclassification probabilities
are known. The point of this figure is to note that if the classification • It is invalid, leading to biased estimation and inference, if selection
probabilities are known, then one can indeed construct an approximately into the validation study depends on the observed values of S or Z,
consistent estimate of the true slope (in contrast, if one ignores the mis- as might reasonably happen.
classification, one thinks that the slope is 0.40, not the correct 1.00), but • It is inefficient, because in this case a proper likelihood analysis can

348 349
be undertaken that uses the observed Y values effectively. A detailed
derivation of the likelihood function is delayed until Section 15.4 be- Validation Data
low. See also Prescott and Garthwaite (2002) for a Bayesian treat- Z S Y Count
ment. 0 0 0 19
In Figure 15.7, we display what happens to the complete data esti- 0 0 1 5
mate, the pseudolikelihood estimate, and the maximum likelihood esti- 0 1 0 7
mate of the slope when there is 20% randomly selected validation. The 0 1 1 14
complete data estimate, also called the complete-cases estimate, uses 1 0 0 28
only the cases that have all variables measured, that is only the valida- 1 0 1 27
tion data. All the methods are consistent estimates, and there is little 1 1 0 8
to choose between them. 1 1 1 24

Logistic Regression With Possible Misclassification and Validation Nonvalidation Data


2
Complete
0 0 – 47
Pseudo
1.8 MLE

1.6
Table 15.1 GVHD data set. Here Y = 1 if the patient develops chronic GVHD
1.4
and = 0 otherwise, while S = 1 if the patient develops acute GVHD. The
predictor Z = 1 if the patient is aged 20 or greater, and zero otherwise.
1.2

1
lihood estimate is unbiased, with the complete data estimates and the
0.8 pseudolikelihood estimates incurring substantial bias.

0.6
15.3.2.4 Example
0.4
In this section, we present an example where selection into the validation
0.2 study depends on the mismeasured response. We compare the maximum
0
likelihood estimate with the naive use of the complete data. The latter
0 0.5 1 1.5
Validation depends on observed X and S
2 2.5
is not valid and appears to be seriously biased in this case.
Pepe (1992) and Pepe et al. (1994) described a study of 179 aplastic
Figure 15.8 Illustration of the effect of misclassification of the response in anemia patients given bone marrow transplants. The objective of the
logistic regression, when selection into the validation study depends on S and analysis is to relate patient age to incidence of chronic graft versus host
Z. Solid line: density estimate of the slope in the logistic regression for the disease (GVHD). Patients who develop acute GVHD, which manifests
validation data. Dashed line: density estimate for the MLE. Dotted line: density itself early in the post transplant period,are at high risk of developing
estimate for pseudolikelihood. The actual slope is 1.0: note the bias in all but chronic GVHD. Thus, in this example Y is chronic GVHD, S is acute
the MLE. GVHD, and Z = 0, 1 depending on whether or not a patient is less than
20 years of age. The data are given in Table 15.1. A logistic regression
However, in Figure 15.8 validation is more complex and depends on model for Y given Z is assumed.
both S and Z. If S = 1 and Z > 0, we observe Y with probability 0.05. The selection process as described by Pepe et al. (1994) is to select only
If S = 1 and Z ≤ 0, we observe Y with probability 0.15. If S = 0 and 1/3 of low risk patients (less than 20 years old and no acute GVHD) into
Z > 0, we observe Y with probability 0.20. If S = and Z ≤ 0, we observe the validation study, while following all other patients. Thus, π(S, Z) =
Y with probability 0.40. This figure shows that only the maximum like- 1/3 if S = 0 and Z = 0, otherwise π(S, Z) = 1. Note that, here, selection

350 351
Validation Data MLE 15.4 Likelihood Methods
In this section, we describe the technical details of likelihood methods
βbz 0.66 1.13 for response measurement error. As seen in Section 15.3, one has to
Standard Error 0.37 0.38 be careful to separate out theoretical identifiability of parameters from
p-value 0.078 0.004 actual identifiability, the latter meaning that there is enough information
about the parameters in the observed data to make their estimation
practical.
Table 15.2 Analysis of GVHD data set, with the validation data analysis and
the maximum likelihood analysis. In this data set, selection depends on both 15.4.1 General Likelihood Theory and Surrogates
Z and S, so that an analysis based only upon the validation data will lead to
biased estimates and inference. Let fS|Y,Z (s|y, z, γ) denote the density or mass function for S given
(Y, Z). We will call S a surrogate response if its distribution depends
only on the true response, that is fS|Y,Z (s|y, z, γ) = fS|Y (s|y, γ). All
into the validation study depends on both S and Z, so that an ordinary
the models we have considered in detail to this point are for surrogate
logistic regression analysis on the completed data (∆ = 1) will be invalid.
responses.
We performed the following analyses: (i) use of validation or complete
In the case of a surrogate response, a very pleasant thing occurs.
data only, which is not valid in this problem because of the nature of
Specifically, if there is no relationship between the true response Y and
the selection process, but is included for comparison, and (ii) maximum
the predictors, then neither is there one between the observed response S
likelihood. The results of the two analyses are listed in Table 15.2. We
and the predictors. Thus, if one’s only goal is to check whether there is
see that the validation data analysis is badly biased relative to the valid
any predictive ability in any of the predictors, and if S is a surrogate,
maximum likelihood analysis, with markedly different significant levels.
then using the observed data provides a valid test. However, like every-
thing having to do with measurement error, a valid test does not mean
15.3.2.5 Repeats and Multiple Instruments
a powerful test: measurement error in the response lowers power.
We have seen above that there is very little information about the mis- This definition of a surrogate response is the natural counterpart to
classification probabilities if we only observe S, while validation data in a surrogate predictor, because it implies that all the information in the
which Y is observed does provide such information. Going from nothing relationship between S and the predictors is explained by the underlying
to everything is a large gap! response.
Suppose that there is no validation study component. In some cases, In general, that is for a possibly nonsurrogate response, the likelihood
experiments done by others in which Y is observed provide information function for the observed response is
about the misclassification, along with standard error estimates. Using
fS|Z (s|z, B, γ) =
these estimates provides a means of estimation of the underlying logistic X
regression model, with standard errors that can be propagated through fY|Z (y|z, B)fS|Y,Z (s|y, z, γ). (15.4)
y
by drawing bootstrap samples from the previous study and the current
study separately. If Y is a continuous random variable, the sum is replaced by an integral.
In other cases, replication of S can be used to gain information about If S is a surrogate, then fS|Y (s|y, γ) replaces fS|Y,Z (s|y, z, γ) in (15.4)
the misclassification probabilities. For example, if Y and S are binary, showing that if there is no relationship between the true response and
and if the misclassification probability is the same for both values of the predictors, then neither is there one between the observed response
Y, then two independent replicates of S per person suffice to identify and the predictors. The reason for this is that under the stated condi-
the misclassification probability. Otherwise, at least three independent tions, neither term inside the integral depends on the predictors: the first
replicates are necessary for identification. Whether technical identifiabil- because Y is not related to Z, and the second because S is a surrogate.
ity results in practical identifiability is not clear, and one has to expect However, if S is not a surrogate, then there may be no relationship be-
that in the absence of a strong prior distribution on the misclassification tween the true response and the covariates, but the observed response
rates, the effect of misclassification will be to lower power greatly. may be related to the predictors.

352 353
It follows that if interest lies in determining whether the predictors compute the likelihood. These are little different from what is required
contain any information about the response, one can use naive hypothesis for any likelihood problem.
tests and ignore response error only if S is a surrogate. The resulting
tests have asymptotically correct level, but decreased power relative to
15.5 Use of Complete Data Only
tests derived from true response data. This property of a surrogate is
important in clinical trials; see Prentice (1989). We have downplayed descriptions of the very large literature when there
Note that one implication of (15.4) is that a likelihood analysis with is a gold standard for a covariate X measured with error. This huge
mismeasured responses requires a model for the distribution of response literature, which includes both the missing data likelihood literature
error. We have already seen an example of this approach in Section 15.3. and the missing data semiparametric literature, tends to be technical
Just as in the predictor-error problem, it is sometimes, but not al- and entire books can, and have been, written about them.
ways, the case that the parameters (B, γ) are identifiable, that is can be In the case that the response Y can be observed on a subset of the
estimated from data on (S, Z) alone. We have seen two examples of this: study data, the literature is much smaller and more manageable. Here-
(a) in regression models with a continuous response and additive unbi- with we make a few remarks on methods that use the validation data
ased measurement error, the parameters in the model for the mean are only, throwing away any of the data when Y is not observed. It is not
identified; and (b) logistic regression when S is a surrogate. Of course, entirely clear why one would do this instead of performing a complete
in the latter case, as seen in Section 15.3.2, the identifiability is merely likelihood or Bayesian analysis, except in the case of logistic regression
a technical one, not practical. with a surrogate where selection depends only on S, in which case the
various validation data analyses are simple variants of logistic regression.
Given the availability of logistic regression software, this is certainly a
15.4.2 Validation Data
useful simplification.
We now suppose that there is a validation subsample obtained by mea- In what follows, selection into the validation study occurs with prob-
suring Y on units in the primary sample selected with probability π(S, Z). ability π(S, Z). Let ∆ = 1 denote selection into the validation study.
The presence (absence) of validation data on a primary-sample unit is
indicated by ∆ = 1 (0). Then, based on a primary sample of size n, the
15.5.1 Likelihood of the Validation Data
likelihood of the observed data for a general proxy S is
n h
Y The validation data have the likelihood function for a single observation
∆i given by
{f (Si |Yi , Zi , γ)f (Yi |Zi , B)} ×
i=1 i
1−∆i f (Y, S|Z, ∆ = 1) =
{f (Si |Zi , B, γ)} , (15.5) π(S, Z)f (S|Y, Z, γ)f (Y|Z, B)
P P , (15.6)
where f (Si |Zi , B, γ) is computed by (15.4) and we have dropped the s y π(s, Z)f (s|y, Z, γ)f (y|Z, B)

subscripts on the density functions for brevity. where again if S or Y are continuous, the sums are replaced by integrals.
The model for the distribution of S given (Y, Z) is a critical component Here are a few implications of (15.6):
of (15.5). If S is discrete, then one approach is to model this conditional • If selection into the validation study is completely at random, or if
distribution by a polytomous logistic model. For example, suppose the it simply depends on the predictors but not S, then one can run the
levels of S are (0, 1, . . . , S). A standard logistic model is standard analysis on the (Y, Z) data and ignore S entirely. Checking
pr(S ≥ s|Y, Z) = H(γ0s + γ1 Y + γ2t Z), s = 1, . . . , S. this is a small math calculation.

When S is not discrete, a simple strategy is to categorize it into S levels, • In general, (15.6) cannot be simplified, and in particular, using the
and then use the logistic model above. standard analysis on the observed (Y, Z) data leads to bias; see Figure
As described above, likelihood analysis is, in principle, straightfor- 15.8.
ward. There are two obvious potential drawbacks, namely that one has • Logistic regression has a very special place here. If S is a binary
to worry about the model for the measurement error and then one has to surrogate, and if selection into the validation study depends on S

354 355
only, then running a logistic regression ignoring the very existence • Problems that have continuous components of (S, Y, Z) are more
of S leads to valid inference about the nonintercept parameters see complicated. For example, suppose that S is continuous, but the other
Tosteson and Ware (1990). random variables are discrete. Then the density function of S in each
of the cells formed by the various combinations of (Y, Z) must be es-
timated. Even in the simplest case that (Y, Z) are binary, this means
15.5.2 Other Methods
estimating four density functions using validation data only. While the
In some problems, it can occur that there are two data sets; a primary asymptotic theory of such a procedure has been investigated (Pepe,
one in which (S, Z) are observed (∆ = 0), and an independent data set 1992), we know of no numerical evidence indicating that the den-
in which (Y, Z) are observed (∆ = 1). This may occur when Y is a sity estimation methods will work adequately in finite samples, nor is
sensitive endpoint such as income, and S is reported income. Because there any guidance on the practical problems of bandwidth selection
of confidentiality concerns, it might be impossible to measure Y and S and dimension reduction when two or more components of (S, Y, Z)
together. In such problems, the likelihood is are continuous.
n
Y ∆i 1−∆i
• In practice, if S is not already naturally categorical, then an alterna-
{f (Yi |Zi , B)} {f (Si |Zi , B, γ)} . tive strategy is to perform such categorization, fit a flexible logistic
i=1 model to the distribution of S given the other variables, and maximize
the resulting likelihood (15.5).
15.6 Semiparametric Methods for Validation Data
As we have suggested, likelihood methods can potentially be troublesome 15.6.2 Other Types of Sampling
because they might be sensitive to the assumed distribution for the mis-
Pseudolikelihood can be modified when selection into the second stage of
measured response. This has led to a small literature on semiparametric
the study is not by simple random sampling. The estimating equations
methods, which attempt in various guises to model the distribution of S
for the EM-algorithm maximizing (15.5) are
given Y and the covariates nonparametrically.
n
X
0 = ∆i {Ψ1 (Yi , Zi , B) + Ψ2 (Si , Yi , Zi , γ)}
15.6.1 Simple Random Sampling i=1
n
X
Suppose that selection into the second stage validation study is by simple
+ (1 − ∆i )E {Ψ1 (Yi , Zi , B)
random sampling, that is, all possible samples of the specified sample size
i=1
are equally likely. Pepe (1992) constructed a pseudolikelihood method
+ Ψ2 (Si , Yi , Zi , γ)|Si , Zi } ,
similar in spirit to that of Carroll and Wand (1991) and Pepe and Flem-
ing (1991) for the mismeasured covariate problem with validation data. where
The basic idea is to use the validation data to form a nonparametric es- Ψ1 = ((∂/∂B)log(fY|Z )t , 0t )t ,
timator fbS|Y,Z of fS|Y,Z . One then substitutes this estimator into (15.4)
Ψ2 = (0t , (∂/∂γ)log(fS|Y,Z )t )t .
to obtain an estimator fbS|Z (s|z, B) and then maximizes
n n o1−∆i The idea is to use the validation data to estimate
Y ∆i
{f (Yi |Zi , B)} fb(Si |Zi , B) . E {Ψ1 (Yi , Zi , B)|Si , Zi }
i=1
and then solve
This approach requires an estimator of fS|Y,Z . Here are a few com- n
X
ments: 0 = [∆i Ψ1 (Yi , Zi , B)+
• If all the random variables are discrete, the nonparametric estimator i=1 i
of the probability that S = s given (Y, Z) = (y, z) is the fraction in the b {Ψ1 (Yi , Zi , B)|Si , Zi } .
(1 − ∆i )E
validation study which have S = s among those with (Y, Z) = (y, z),
although we prefer flexible parametric models in this case. For example, suppose that (S, Z) are all discrete. Now define Iij to

356 357
equal one when (Sj , Zj ) = (Si , Zi ) and zero otherwise. Then
Pn
∆j Ψ1 (Yj , Zj , B)Iij
b {Ψ1 (Yi , Zi , B)|Si , Zi } = j=1 P
E n .
j=1 ∆j Iij

In other cases, nonparametric regression can be used. In the discrete


case, Pepe et al. (1994) derived an estimate of the asymptotic covariance
matrix of Bb as A−1 (A + B)A−t , where
n
X
A = − b
∆i (∂/∂B T )Ψ1 (Yi , Zi , B)
i=1
n Pn b ij
X j=1 ∆j (∂/∂B T )Ψ1 (Yj , Zj , B)I
− (1 − ∆i ) Pn ;
i=1 j=1 ∆j Iij
X n(s, z)n2 (s, z)
B = b
r(s, z, B),
s,z
n1 (s, z)

n1 (s, z), n2 (s, z), and n(s, z) are the number of validation, nonvalida-
tion and total cases with (S, Z) = (s, z), and where r(s, z, B) b is the
b
sample covariance matrix of Ψ1 (Y, Z, B) computed from observations
with (∆, S, Z) = (1, s, z).

Bibliographic Notes
Lyles, Williamson, Lin, and Heilig (2005) extend McNemar’s test for
paired binary outcomes to the situation where the outcomes are mis-
classified.

358
APPENDIX A

BACKGROUND MATERIAL

A.1 Overview
This Appendix collects some of the technical tools that are required for
understanding the theory employed in this monograph. The background
material is, of course, available in the literature, but often widely scat-
tered, and one can use this chapter as a brief tour of likelihood, quasi-
likelihood and estimating equations.
Section A.2 and A.3 discuss the normal and lognormal, respectively,
gamma and inverse-gamma distributions. Section A.4 discusses predic-
tion of an unknown random variable by another random variable and
introduces “best prediction,” which can be considered a population ana-
log to regression; conversely, regression is the sample analog of best pre-
diction. Section A.5 reviews likelihood methods, which will be familiar
to most readers. Section A.6 is a brief introduction to the method of
estimating equations, a widely applicable tool that is the basis of all
estimators in this book. Section A.8 defines generalized linear models.
The bootstrap is explained in Section A.9, but one need only note while
reading the text that the bootstrap is a computer-intensive method for
performing inference.

A.2 Normal and Lognormal Distributions


We say that X is Normal(µx , σx2 ) if it is normally distributed with mean
µx and variance σx2 . Then the density of X is φ{(x − µx )/σx } where φ
is the standard normal pdf
1 ¡ ¢
φ(x) = √ exp −x2 /2 , (A.1)

Rx
and the CDF of X is Φ{(x − µx )/σx }, where Φ(x) = −∞ φ(u)du is the
standard normal CDF.
We say that the random vector X = (X1 , . . . , Xp )t has a joint multi-
variate normal distribution with mean (vector) µx and covariance matrix
Σx if X has density
1 © ª
p/2 1/2
exp −(1/2)(X − µx )t Σ−1 x (X − µx ) .
(2π) |Σx |

359
Pn
The random variable X is said to have a Lognormal(µx , σx2 ) distribu- Because i=1 Xi2 can be arbitrarily small, there can be no choice of β
tion if log(X) is Normal(µx , σx2 ). In that case that is “noninformative” in all situations. For example, β = 0.001,
Pn which2
seems “small,” will completely dominate the likelihood if i=1 Xi =
E(X) = exp(µx + σx2 /2), and (A.2)
0.0001. Since the Xi are observed in the Pnpresent example, one should
var(X) = exp(2µx ){exp(2σx2 ) − exp(σx2 )}. (A.3) be aware when β is large relative to 2
i=1 Xi . However, in the case
of a prior on the variance of unobservable random effects, more care
A.3 Gamma and Inverse-Gamma Distributions is required. Otherwise, the prior might dominate the likelihood. At the
very least, one should reconsider the choice of β unless it is smaller than
A random variable X has a Gamma(α, β) distribution if its probability the posterior mean of this variance.
density function is
β α α−1
x exp(−βx), x > 0. A.4 Best and Best Linear Prediction and Regression
Γ(α)
A.4.1 Linear Prediction
Here, Γ(·) is the gamma function, α > 0 is a shape parameter, and
β −1 > 0 is a scale parameter. A word of caution is in order—some Let X and Y be any two random variables. If the value of Y is unknown
authors denote the scale parameter by β, in which case their β is the but X is known and is correlated with Y , then we can estimate or “pre-
reciprocal of ours. The expectation of this distribution is α/β, while its dict” Y using X. The best linear predictor of Y is γ0 + γx X, where γ0
variance is α/β 2 . The chi-square distribution with n degrees of freedom is and γx are chosen to minimize the mean square error
the Gamma(n/2, 1/2) distribution and arises as a sampling distribution
E{Y − (γ0 + γx X)}2 = [E{Y − (γ0 + γx X)}]2 + var(Y − γx X). (A.4)
in Gaussian models.
If X is Gamma(α, β), then the distribution of X −1 is called Inverse- On the right-hand side of (A.4), the first term is squared bias and the
Gamma(α, β), abbreviated as IG(α, β). Then α is the shape parameter second is variance. The variance does not depend on γ0 , so the optimal
of X and β is its scale (not inverse scale) parameter. The density of X −1 value of γ0 is µy − γx µx , which eliminates the bias. The variance is
is
β α −(α+1) σy2 + γx2 σx2 − 2γx σxy ,
x exp(β/x), x > 0.
Γ(α) where σxy is the covariance between X and Y , and an easy calculus exer-
The mode of the IG(α, β) density is β/(α + 1), and the expectation is cise shows that the variance is minimized by γx = σxy /σx2 . In summary,
β/(α − 1) for α > 1. For α ≤ 1, the expectation is infinite. the best linear predictor of Y based on X is
The inverse Gamma distribution is the conjugate prior for variance σxy
parameters in many Gaussian models. As a simple case, suppose that Yb = µy + 2 (X − µx ). (A.5)
σx
X1 , . . . , Xn are iid Normal(0, σ 2 ). Then the likelihood is
µ ¶n µ Pn ¶ The prediction error is
1 X2 σxy
√ exp − i=12 i . Y − Yb = (Y − µy ) − 2 (X − µx ), (A.6)
2πσ 2σ σx
Therefore, if the prior on σ 2 is IG(α, β), then theP
joint density of σ 2 , X1 , where σxy is the covariance between X and Y . It is an easy calculation
n
. . . , Xn is proportional to the IG(α + n/2, β + i=1 Xi2 /2) density.
P It to show that the prediction error is uncorrelated with X and that
follows that the posterior distribution of σ is IG(α + n/2, β + Xi2 /2).
2
var(Y − Yb ) = σy2 (1 − ρ2xy ). (A.7)
The parameters α and β in the prior have this interpretation: The
prior is equivalent to 2α observations with sum of squares equal to 2β, There is an intuitive reason for this—if the error were correlated with W ,
which, if actually observed, would give a prior variance estimate of β/α. then the error itself could be predicted and therefore we could improve
Thus, an Inverse-Gamma(α, β) prior can be viewed as a prior guess at the predictor of Y , but we know that this predictor cannot be improved
the variance of β/α based on 2α observations. A value of α that is since it was chosen to be the best possible.
small relative to n/2 is “noninformative.” Also, the value of β P has little As an illustration, consider the classical error model W = X+U , where
n
influence relative to the data only when it is small relative to i=1 Xi2 . X and U are uncorrelated and E(U ) = 0. Then σxw = σx2 , σxw /σw 2

360 361
(the attenuation), and µw = µx , so the best linear predictor of X given where Σxz = E [{X − µx }{Z − µz }t ]. When W is replicated, (A.14) gen-
W is eralizes to (4.4).
Xb = µx + λ(W − µx ) = (1 − λ)µx + λW (A.8)
and A.4.2 Best Linear Prediction without an Intercept
X = µx + λ(W − µx ) + U ∗ , (A.9)
∗ b Prediction without an intercept is inferior to prediction with an inter-
where the prediction error U = X − X is uncorrelated with W and has
cept, except in the rare case where the intercept of the best linear pre-
variance σx2 (1 − λ), since λ = ρ2xw because σxw = σx2 .
dictor is zero. So you no doubt are wondering why we are bothering you
So far, we have assumed the ideal situation where µx , σxy , and σx2
with this topic. The reason is that this material is needed to understand
are known. In practice, they will generally be unknown and replaced by
the Fuller–Hwang estimator discussed in Section 4.5.3. You should read
estimates. If we observe an iid sample, (Yi , Xi ), then we use the sample
this section only if you are reading Section 4.5.3.
means, variances, and covariances and γ bx = σ σx2 and γ
bxy /b bx X
b0 = Y − γ
If X and Y are scalars and we predict Y by a linear function of X
are the usual ordinary least-squares estimates with
Pn without an intercept, that is, with a predictor of form λX, then a simple
(Yi − Y )(Xi − X) calculation shows that the mean squared error (MSE) of prediction is
bx = i=1
γ Pn .
i=1 (Xi − X)
2 minimized by λ = E(XY )/E(X 2 ).
If we have more than one X-variable, then the best linear predictor of From a sample of {(Yi , Xi )}ni=1 pairs, YbNI can be estimated by linear
Y given X = (X1 , . . . , Xp ) is regression without an intercept. Here the subscript “NI” reminds us that
the prediction is done with “no intercept.” The estimate of λ is
Yb = µy + Σyx Σ−1
x (X − µx ), (A.10) Pn
b = Pi=1 Yi Xi
where Σyx = (σY X1 , . . . , σY Xp ) and Σx is the covariance matrix of X. λ n 2
.
i=1 Xi
If the means, variances, and covariances are replaced by their analogs
from a sample, (Yp , Xi1 , . . . , Xip ), i = 1, . . . , n, then (A.10) becomes the As an example, consider the multiplicative error model W = XU with
ordinary least-squares estimator. true covariate X and surrogate W , where X and U are independent and
Equation (A.10) remains valid when Y is a vector. Then E(U ) = 1. We need to predict X using W , so now X plays the role
h i played by Y in the previous three paragraphs and W plays the role of
t
Σyx = E {(Y − E(Y )} {(X − E(X)} . (A.11) X. Then E(W 2 ) = E(X 2 ) E(U 2 ) and, assuming that var(U ) 6= 0, one
has that λ = 1/E(U 2 ) < 1 since E(U 2 ) > {E(U )}2 = 1. The best linear
As an illustration of the vector Y case, we will generalize (A.8) to the bNI = λW . The predictor X
predictor of X without an intercept is X bNI will
case where have at least as large an MSE of prediction as the best linear predictor
W =X+U with an intercept. One problem with prediction without an intercept is
with W, X, and U all vectors. Then Σw = Σx + Σu and that it is biased in the sense that E(XbNI ) = λE(X) 6= E(X), assuming
b = µx + Λ(W − µx ) = (I − Λ)µx + ΛW,
X (A.12) b
that var(U ) 6= 0. This is because XNI attenuates towards zero, whereas
from (A.9) we see that X b shrinks W towards µx .
where I is the identity matrix and
Λ = Σx (Σx + Σu )−1 . (A.13)
A.4.3 Nonlinear Prediction
Here, we are assuming that (Σx + Σu ) is invertible, which will hold if
Σx is invertible since Σu must be positive semidefinite. Note that Λ is a If we do not constrain the predictor to be linear in X, then the predictor
multivariate generalization of the attenuation λ. minimizing the MSE over all functions of X is the conditional mean of
If the distribution of X depends on Z, then (A.12) is replaced by Y given X, that is, the best predictor of Y given X is Yb = E(Y |X).
µ ¶−1 ½µ ¶ µ ¶¾ There are some circumstances in which E(Y |X) is linear in X, in
b = µx + ( Σx Σxz ) Σx +t Σu Σxz
X
W

µx
,
which case the best predictor and the best linear predictor coincide. For
Σxz Σz Z µz example, this happy situation occurs whenever (Y, X) is jointly normally
(A.14) distributed. Another case in which the best predictor is linear in X occurs

362 363
when the linearity is a consequence of modeling assumption, for example, or “score function” is si (y|Θ) = (∂/∂Θ)log{fi (y|Θ)}. The Fisher infor-
if it assumed that Y = β0 + βxt X + ǫ, where ǫ is independent of X. mation matrix, or expected information, is
Finding the best predictor E(Y |X) requires knowledge of the joint Pn e i |Θ)}
In (Θ) = − i=1 E{(∂/∂Θt )si (Y (A.15)
distribution of (Y, X), not simply means, variance, and covariances. In Pn
= e t e
practice, E(Y |X) is estimated by nonparametric regression. There are i=1 E{si (Yi |Θ)si (Yi , |Θ)}. (A.16)
many techniques for nonparametric regression, for example, smoothing In large samples, the MLE is approximately normally distributed with
splines, local polynomial regression such as LOESS, and fixed-knot pe- mean Θ and covariance matrix In−1 (Θ), whose entries converge to 0 as
nalized splines (Ruppert et al., 2003). When X is multivariate, one typ- n → ∞. There are several methods of estimating In (Θ). The most obvi-
ically uses dimension-reduction techniques such as additive models or b Efron and Hinkley (1978) presented arguments in favor of
ous is In (Θ).
sliced-inverse regression to avoid the so-called curse of dimensionality.
using instead the observed Fisher information matrix, defined as
Pn ∂ e i |Θ),
b
A.5 Likelihood Methods Ibn = − i=1 si (Y (A.17)
∂Θt
A.5.1 Notation which is an empirical version of (A.15). The empirical version of (A.16)
is
Denote the unknown parameter by Θ. The vector of observations, in-
cluding response, covariates, surrogates, etc., is denoted by (Y e i , Zi ) for Bbn = Pn si (Y e i |Θ)s
b t (Y e i |Θ),
b
i=1 i
i = 1, ..., n, where, as before, Zi is the vector of covariates that is ob-
e i collects all the other random variables which is not used directly to estimate In , but is part of the so-called
servable without error and Y
e i , Zi ), i = 1, ..., n, is the aggregation sandwich formula, Ibn−1 B
bn−1 Ibn−1 , used to estimate In−1 (Θ). As discussed
into one vector. The data set (Y
in Section A.6, the sandwich formula has certain “robustness” properties
of all data sets, primary and external, including replication and valida-
e i will depend on i, for example, but can be subject to high sampling variability.
tion data. Thus, the composition of Y
whether the ith case is a validation case, a replication case, etc. We
emphasize that Y e i is different from the response Yi used throughout the A.5.3 Likelihood Ratio Tests
book, and hence the use of tildes. The Y e i are assumed independent, with
e i depending both on Zi and on the type of data set the Suppose that the dimension of Θ is dim(Θ) = p, that ϕ is a known
the density of Y
function of Θ such that dim{ϕ(Θ)} = p1 < p, and that we wish to
ith case came from and denoted by fi (e y |Θ). We assume that fi has two
test H0 : ϕ(Θ) = 0 against the general alternative that ϕ(Θ) 6= 0. We
continuous derivatives with respect to Θ. The loglikelihood is
Pn suppose that rank{(∂/∂Θt ) ϕ(Θ)} = p1 , so that the constraints imposed
L(Θ) = e
i=1 log{fi (Yi |Θ)}. by the null hypothesis are linearly independent; otherwise p1 , is not
well defined, that is, we can add redundant constraints and increase p1
without changing H0 , and the following result is invalid.
A.5.2 Maximum Likelihood Estimation b 0 maximize L(Θ) subject to ϕ(Θ) = 0, and define LR = {L(Θ)−
b
Let Θ
In practice, maximum likelihood is probably the most widely used method b
L(Θ0 )}, the log likelihood ratio. Under H0 , 2 × LR converges in distri-
of estimation. It is reasonably easy to implement, efficient, and the ba- bution to the chi-squared distribution with p1 degrees of freedom. Thus,
sis of readily available inferential methods, such as standard errors by an asymptotically valid test rejects the null hypothesis if LR exceeds
Fisher information and likelihood ratio tests. Also, many other common χ2p1 (α)/2, the (1 − α) quantile of the chi-squared distribution with p1
estimators are closely related to maximum likelihood estimators, for ex- degrees of freedom.
ample, the least squares estimator, which is the maximum likelihood
estimator under certain circumstances, and quasilikelihood estimators.
A.5.4 Profile Likelihood and Likelihood Ratio Confidence Intervals
In this section, we quickly review some of these topics.
The maximum likelihood estimator (MLE), denoted by Θ, b maximizes Profile likelihood is used to draw inferences about a single component of
L(Θ). Under some regularity conditions, for example in Serfling (1980), the parameter vector. Suppose that Θ = (θ1 , Θ2 ), where θ1 is univariate.
the MLE has a simple asymptotic distribution. The “likelihood score” Let c be a hypothesized value of θ1 . To test H0 : θ1 = c using the theory

364 365
of Section A.5.3, we use ϕ(Θ) = θ1 − c and find Θ b 2 (c) so that {c, Θb 2 (c)} The basic idea behind the efficient score test is that under H0 we expect
maximizes L subject to H0 . The function Lmax (θ1 ) = L{θ1 , Θ b 2 (θ1 )} is b 0 ) to be close to 0, since the expectation of S(Θ) is 0 and Θ
S1 (Θ b 0 is
called the profile likelihood function for θ1 —it does not involve Θ2 since consistent for Θ.
the log likelihood has been maximized over Θ2 . Then, LR = L(Θ) b − Let In11 be the upper left corner of (In )−1 evaluated at Θ b 0 . The effi-
Lmax (c) where, as before, Θ b is the MLE. One rejects the null hypothesis b
cient score test statistic measures the departure of S1 (Θ0 ) from 0 and is
if LR exceeds χ21 (α). defined as
Inference for θ1 is typically based on the profile likelihood. In partic- b 0 )t In11 S1 (Θ
b 0 ) = S(Θ
b 0 )In−1 S(Θ
b 0 ).
Rn = S1 (Θ
ular, the likelihood ratio confidence region for θ1 is the set
b − χ21 (α)/2}. The equality holds because S2 (Θ b 0 ) = 0.
{θ1 : Lmax (θ1 ) > L(Θ)
Under H0 , Rn asymptotically has a chi-squared distribution with p1
This region is also the set of all c such that we cannot reject the null degrees of freedom, so we reject H0 is Rn exceeds (1 − α) chi-squared
hypothesis H0 : θ1 = c. The confidence region is typically an interval, quantile, χ2p1 (α). See Cox and Hinkley (1974, Section 9.3) for a proof of
but there can be exceptions. An alternative large-sample interval is the asymptotic distribution.
θb1 ± Φ−1 (1 − α/2)se(θb1 ), (A.18)
where se(θb1 ) is the standard error of θb1 , say from the Fisher information A.6 Unbiased Estimating Equations
matrix or from bootstrapping, as in Section A.9. For nonlinear models, All of the estimators described in this book, including the MLE, can
the accuracy of (A.18) is questionable, that is, the true coverage proba- be characterized as solutions to unbiased estimating equations. Under-
bility is likely to be somewhat different than (1 − α), and the likelihood standing the relationship between estimators and estimating equations is
ratio interval is preferred. useful because it permits easy and routine calculation of estimated stan-
dard errors. The theory of estimating equations arose from two distinct
A.5.5 Efficient Score Tests lines of research, in Godambe’s (1960) study of efficiency and Huber’s
(1964, 1967) work on robust statistics. Huber’s (1967) seminal paper
The efficient score test, or simply the “score test,” is due to Rao (1947). used estimating equations to understand the behavior of the MLE un-
Under the null hypothesis, the efficient score test is asymptotically equiv- der model misspecification, but his work also applies to estimators that
alent to the likelihood ratio test, for example, the difference between the are not the MLE under any model. Over time, estimating equations be-
two test statistics converges to 0 in probability. The advantage of the came established as a highly effective, unified approach for studying wide
efficient score test is that the MLE needs to be computed only under classes of estimators; see, for example, Carroll and Ruppert (1988) who
the null hypothesis, not under the alternative, as for the likelihood ratio use estimating equation theory to analyze a variety of transformation
test. This can be very convenient when testing the null hypothesis of and weighting methods in regression.
no effects for covariates measured with error, since these covariates, and This section reviews the basic ideas of estimating equations; See Hu-
hence measurement error, can be ignored when fitting under H0 . ber (1967), Ruppert (1985), Carroll and Ruppert (1988), McLeish and
To define the score test, start by partitioning Θ as (Θt1 , Θt2 )t , where Small (1988), Desmond (1989), or Godambe (1991) for more extensive
dim(Θ1 ) = p1 , 1 ≤ p1 ≤ p. We will test the null hypothesis that H0 : discussion.
Θ1 = 0. Many hypotheses can P be put into this form, possibly after
n e i |Θ) and partition S into S1
reparameterization. Let S(Θ) = i=1 si (Y
and S2 with dimensions p1 and (p − p1 ), respectively, that is, A.6.1 Introduction and Basic Large Sample Theory
µ ¶ µ Pn ¶
S1 (Θ) (∂/∂Θ1 )Pi=1 log{fi (y|Θ)} As in Section A.5, the unknown parameter is Θ, and the vector of ob-
S(Θ) = = n . servations, including response, covariates, surrogates, etc., is denoted by
S2 (Θ) (∂/∂Θ2 ) i=1 log{fi (y|Θ)}
e i , Zi ) for i = 1, ..., n. For each i, let Ψi be a function of (Y
(Y e i , Θ) taking
Note that, in general, S1 (Θ) depends on both Θ1 and Θ2 , and similarly values in p-dimensional space (p = dim(Θ)). Typically, Ψi depends on i
for S2 (Θ). Let Θb 0 = (0t , Θ
b t )t be the MLE of Θ under H0 . Notice that
0,2 through Zi and the type of data set the ith case belongs to, for example,
b 0 ) = 0 since Θ
S2 (Θ b 0,2 maximizes the likelihood over Θ2 when Θ1 = 0. whether that case is validation data, etc. An estimating equation for Θ

366 367
has the form sandwich estimator or a robust covariance estimator (a term we do not
−1 Pn e like—see below); in the former terminology, Bn is sandwiched between
0=n i=1 Ψi (Yi , Θ). (A.19)
the inverse of An . The sandwich estimator uses
b to (A.19) as Θ ranges across the set of possible param-
The solution, Θ, Pn ∂
bn
A = n−1 Ψi (Y e i , Θ);
b (A.24)
eter values is called an M-estimator of Θ, a term due to Huber (1964). i=1
∂Θt
In practice, one obtains an estimator by some principle, for example, bn −1 Pn e i , Θ)Ψ
b t (Y e i , Θ).
b
B = n i=1 Ψi (Y i (A.25)
maximum likelihood, least squares, generalized least squares, etc. Then,
one shows that the estimator satisfies an equation of form (A.19) and Note that B bn is a sample covariance matrix of {Ψi (Y e i , Θ)}
b n , since Θ b
i=1
Ψi is identified. The point is that one doesn’t choose the Ψi ’s directly; solves (A.19).
but rather, they are defined through the choice of an estimator. The second method, called the model-based expectation method, uses
In (A.19), the function Ψi is called an estimating function and depends an underlying model to evaluate (A.22) and (A.23) exactly, and then sub-
on i through Zi . The estimating function (and hence the estimating stitutes the estimated value Θ b Bn (Θ)A
b for Θ, that is, uses A−1 (Θ) b
b −t (Θ).
n n
equation) is said to be conditionally unbiased if it has mean zero when If Ψi is the likelihood score, that is, Ψi = si , where si is defined
evaluated at the true parameter, that is, in Section A.5.2, then Θ b is the MLE. In this case, both Bn (Θ) and
n o bn and
0 = E Ψi (Ye i , Θ) , for i = 1, ..., n. (A.20) An (Θ) equal the Fisher information matrix, In (Θ). However, A
b
Bn are generally different, so the sandwich method differs from using
As elsewhere in this book, expectations and covariances are always con- the observed Fisher information.
ditional upon {Zi }n1 . As a general rule, the sandwich method provides a consistent estimate
If the estimating equations are unbiased, then under certain regularity of the covariance matrix of Θ,b without the need to make any distribution
conditions Θ b is a consistent estimator of Θ. See Huber (1967) for the assumptions. In this sense it is robust. However, in comparison with the
regularity conditions and proof in the iid case. The basic idea is that for model-based expectation method, when a distributional model is reason-
each value of Θ the right-hand side of (A.19) converges to its expectation able the sandwich estimator is typically inefficient, which can unneces-
by the law of large numbers, and the true Θ is a zero of the expectation sarily inflate the length of confidence intervals (Kauermann and Carroll,
of (A.19). One of the regularity conditions is that the true Θ is the only 2001). This inefficiency is why we do not like to call the sandwich method
zero, so that Θ b will converge to Θ under some additional conditions. “robust.” Robustness usually means insensitivity to assumptions at the
Moreover, if Θ b is consistent, then by a Taylor series approximation price of a small loss of efficiency, whereas the sandwich formula can lose
Pn n o a great deal of efficiency.
0 ≈ n−1 i=1 Ψi (Y e i , Θ) + n−1 Pn ∂
Ψi (Ye i , Θ) (Θ b − Θ),
i=1 t
∂Θ
where Θ now is the true parameter value. Applying the law of large A.6.2 Sandwich Formula Example: Linear Regression without
numbers to the term in curly brackets, we have Measurement Error
b − Θ ≈ −An (Θ)−1 n−1 Pn Ψi (Y
Θ e i , Θ), (A.21) As an example, consider ordinary multiple regression without measure-
i=1
© ªt ment errors so that Yi = β0 + βzt Zi + ǫi , where the ǫ’s are independent,
where An (Θ) is given by (A.23) below. Define A−t −1
n (Θ) = An (Θ) . mean-zero random variables. Let Z∗i = (1, Zti )t and Θ = (β0 , βzt )t . Then
Then Θ b is asymptotically normally distributed with mean Θ and covari- the ordinary least squares estimator is an M-estimator with Ψi (Yi , Θ) =
ance matrix n−1 A−1 −t
n (Θ)Bn (Θ)An (Θ), where (Yi − β0 − βzt Zi )Z∗i . Also,
Pn n o
Bn (Θ) = n−1 i=1 cov Ψi (Y e i , Θ) ; (A.22) ∂
Ψi (Yi , Θ) = −Z∗i (Z∗i )t ,
½ ¾ ∂Θt
Pn ∂ Pn
An (Θ) = n−1 i=1 E Ψi (Y e i , Θ) . (A.23) An = −n−1 i=1 Z∗i (Z∗i )t , (A.26)
∂Θt
2
and if one assumes that the variance of ǫi is a constant σ for all i, then
See Huber (1967) for a proof. There are two ways to estimate this covari-
ance matrix. The first uses empirical expectation and is often called the Bn = −σ 2 An . (A.27)

368 369
Notice that An and Bn do not depend on Θ, so they are known exactly type of estimating function, as
except for the factor σ 2 in Bn . The model-based expectation method ∂
gives covariance matrix −σ 2 A−1 Ui (Θ) = e i ),
ℓi (Θ|Y
n , the well known variance of the least ∂Θ
squares estimator. Generally, σ 2 is estimated by the residual mean square.
The sandwich formula uses A bn = An and the score covariance,
P Pn t
bn = n−1 n (Yi − βb0 − βbt Zi )2 Z∗ (Z∗ )t .
B (A.28) Jn = i=1 E{Ui (Θ) Ui (Θ) }, (A.29)
i=1 z i i
and the negative expected hessian,
We have not made distributional assumptions about ǫi , but we have ½ ¾
assumed homoscedasticity, that is, that var(ǫi ) ≡ σ 2 . To illustrate the Pn ∂
Hn = − i=1 E Ui (Θ) . (A.30)
“robustness” of the sandwich formula, consider the heteroscedatic model ∂Θt
where the variance of ǫi is σi2 depending on Zi . Then Bn is no longer
If ℓ were the true log likelihood, then we would have Hn = Jn , but this
given by (A.27) but rather by
equality usually fails for quasilikelihood. As in the theory of estimating
Pn
Bn = n−1 i=1 σi2 Z∗i (Z∗i )t , equations, the parameter Θ is determined by the equation E{Ui (Θ)} = 0
for all i (conditionally
Pn unbiased), or possibly through the weaker con-
which is consistently estimated by (A.28). Thus, the sandwich formula
straint that i=1 E{Ui (Θ)} = 0 (unbiased).
is heteroscedasticity consistent. In contrast, the model-based estimator
We partition Θ = (γ t , η t )t , where γ is the p-dimensional parameter
of Bn , which is σb2 An with An given by (A.26), is inconsistent for Bn .
vector of interest and η is the vector of nuisance parameters. Partition
This makes model-based estimation of the covariance matrix of Θ b incon-
H, omitting the subscript n for ease of notation, similarly as
sistent. µ ¶
The inefficiency of the sandwich estimator can also be seen in this Hγγ Hγη
H= ,
example. Suppose that there is a high leverage point, that is, an obser- Hηγ Hηη
vation with an outlying value of Zi . Then, as seen in (A.28), the value and define Hγγ·η = Hγγ − Hγη Hηη −1
Hηγ .
of Bbn is highly dependent upon the squared residual of this observation.
b t t t
Let Θ0 = (γ0 , ηb0 ) denote the maximum quasilikelihood estimate sub-
This makes B bn highly variable and indicates the additional problem that
bn is very sensitive to outliers. ject to γ = γ0 . We need the large sample distribution of the log quasi-
B
likelihood ratio,
b − ℓ(Θ
L(γ0 ) = 2{ℓ(Θ) b 0 )}.
A.6.3 Sandwich Method and Likelihood-Type Inference
The following result is well known under various regularity conditions.
For the basic idea of the proof, see Kent (1982).
Less well known are likelihood ratio-type extensions of sandwich stan-
dard errors; see Huber (1967), Schrader and Hettmansperger (1980), Theorem: If γ = γ0 , then, as the number of independent
Pp observa-
Kent (1982), Ronchetti (1982), and Li and McCullagh (1994). This the- tions increases, L(γ0 ) converges in distribution to λ W
k=1 k k , where
ory is essentially an extension of the theory of estimating equations, W1 , ..., Wp are independently distributed as χ21 , and λ1 , ..., λp are the
where the estimating equation is assumed to correspond to a criterion eigenvalues of Hγγ·η (H−1 J H−1 )γγ .
function, that is, solving the estimating equation minimizes the criterion To use this result in practice, either to perform a quasilikelihood ratio
function. test of H0 : γ = γ0 or to compute a quasilikelihood confidence set for γ0 ,
In the general theory, we consider inferences about a parameter vector we need to estimate the matrices H and J . If all data are independent,
Θ, and we assume that the estimate Θ b maximizes an estimating criterion, an obvious approach is to replace the theoretical expectations in (A.29)
ℓ(Θ), which is effectively the working log likelihood, although, it need and (A.30) with the analogous empirical averages.
P b
not be the logarithm of an actual density P function. Following Li and We also need to compute quantiles of the distribution of k λ k Wk .
McCullagh (1994), we refer to exp(ℓ) = exp( ℓi ) as the quasilikelihood Observe that if p = 1, the appropriate distribution is simply a scaled
function. (Here, ℓi is the log quasilikelihood for the ith case and ℓ is the χ21 distribution. If p > 1, then algorithms given by Marazzi (1980) and
log quasilikelihood for the entire data set.) Define the score function, a Griffiths and Hill (1985) may be used. A quick and simple way to do

370 371
P b
the computation is to simulate from the distribution of k λk Wk , since We work generally in that α and B can be any parameter vectors in a
chi-squared random variables are easy to generate. statistical model, and we assume that both α b and Bb are M-estimators.
Suppose that that α b solves the estimating equation
A.6.4 Unbiased, but Conditionally Biased, Estimating Equations Pn e
0= i=1 φi (Yi , α), (A.32)
It is possible to relax (A.20) to
n o
Pn e i , Θ)
b , and Bb solves
0 = i=1 E Ψi (Y
Pn e
and then the estimating function and estimating equation are not con- 0= i=1 Ψi (Yi , B, α
b), (A.33)
ditionally unbiased, but are still said to be unbiased. The theory of con-
ditionally unbiased estimating equations carries over without change to with αb in (A.33) fixed at the solution to (A.32). The estimating func-
estimating equations that are merely unbiased. tions in (A.32) and (A.33) are assumed to be conditionally unbiased.
Since (b b solves (A.32) and (A.33) simultaneously, the asymptotic
α, B)
distribution of (b b can be found by stacking (A.32) and (A.33) into a
α, B)
A.6.5 Biased Estimating Equations
single estimating equation:
The estimation methods described in Chapters 4, 5, and 6 are approx- Ã !
µ ¶
imately consistent, in the sense that they consistently estimate a value 0 −1 Pn
e i , α)
φi (Y
that closely approximates the true parameter. These estimators are formed =n i=1 e i , B, α) . (A.34)
0 Ψi (Y
by estimating equations such as (A.19), but the estimating functions are
not unbiased for the true parameter Θ. Usually there exists Θ∗ , which
One then applies the usual theory to (A.34). Partition An = An (Θ),
is close to Θ and which solves
Pn n o Bn = Bn (Θ), and A−1 −t
n Bn An according to the dimensions of α and B.
0= E Ψi (Y e i , Θ∗ ) . (A.31) Then the asymptotic variance of Bb is n−1 times the lower right submatrix
i=1
of A−1 −t
n Bn An . After some algebra, one gets
In such cases, Θb is still asymptotically normally distributed, but with
n
mean Θ∗ instead of mean Θ. In fact, the theory of Section A.6.4 is b ≈ n−1 A−1 Bn,22 − An,21 A−1 Bn,12
var(B) n,22 n,11
applicable since the equations are unbiased for Θ∗ . If o
n o t t t
0 = E Ψi (Y e i , Θ∗ ) , for i = 1, ..., n, −Bn,12 A−t −1 −t −t
n,11 An,21 + An,21 An,11 Bn,11 An,11 An,21 An,22 ,

then the the estimating functions are conditionally unbiased for Θ∗ and where
the sandwich method yields asymptotically correct standard error esti- n ∂ o
mators. Pn e i , α) ,
An,11 = i=1 E φ i (Y
∂αt
Pn n ∂ o
An,21 = Ψi (Y e i , B, α) ,
A.6.6 Stacking Estimating Equations: Using Prior Estimates of Some i=1 E t
∂α
Parameters n ∂ o
Pn e i , B, α) ,
An,22 = i=1 E Ψi (Y
To estimate the regression parameter, B, in a measurement error model, ∂B t
one often uses the estimates of the measurement error parameters, α, Pn e i , α)φt (Y e i , α),
Bn,11 = i=1 φ i (Y i
obtained from another data set. How does uncertainty about the mea- Pn
Bn,12 = e t e
surement error parameters affect the accuracy of the estimated regres- i=1 φi (Yi , α)Ψi (Yi , α, B), and
Pn e t e
sion parameter? In this subsection, we develop the theory to answer this Bn,22 = i=1 Ψi (Yi , α, B)Ψi (Yi , α, B).
question. The fact that such complicated estimating schemes can be eas-
ily analyzed by the theory of estimating equations further illustrates the As usual, the components of An and Bn can be estimated by model-
power of this theory. based expectations or by the sandwich method.

372 373
A.7 Quasilikelihood and Variance Function Models (QVF) A.7.2 Estimation and Inference for QVF Models
A.7.1 General Ideas Specification of only the mean and variance models (A.35)–(A.36) allows
one to construct estimates of the parameters (B, θ). No further detailed
In the case of no measurement error, Carroll and Ruppert (1988) de- distributional assumptions are necessary. Given θ, B can be estimated
scribed estimation based upon the mean and variance functions of the by generalized (weighted) least squares (GLS), a term often now referred
observed data, that is, the conditional mean and variance of Y as func- to as quasilikelihood estimation. The conditionally unbiased estimating
tions of (Z, X). We will call these QVF methods, for quasilikelihood and function for estimating B by GLS is
variance functions. The models include the important class of generalized
linear models (McCullagh and Nelder, 1989; Section A.8 of this mono- Y − mY (Z, X, B)
mY B (Z, X, B), (A.37)
graph), and in particular linear, logistic, Poisson, and gamma regression. σ 2 g 2 (Z, X, B, θ)
QVF estimation is an important special case of estimating equations. where
The typical regression model is a specification of the relationship be-

tween the mean of a response Y and the predictors (Z, X): fB (Z, X, B) = f (Z, X, B)
∂B
E(Y|Z, X) = mY (Z, X, B), (A.35) is the vector of partial derivatives of the mean function. The conditionally
where mY (·) is the mean function and B is the regression parameter. unbiased estimating equation for B is the sum of (A.37) over the observed
Generally, specification of the model is incomplete without an accompa- data.
nying model for the variances, To understand why (A.37) is the GLS estimating function, note that
the nonlinear least squares (LS) estimator, which minimizes
var(Y|Z, X) = σ 2 g 2 (Z, X, B, θ), (A.36) Pn 2
i=1 {Y − mY (Z, X, B)} ,
where g(·) is called the variance function and θ is called the variance
solves
function parameter. We find it convenient in (A.36) to separate the vari-
Pn
ance parameters into the scale factor σ 2 and θ, which determines the i=1 {Y − mY (Z, X, B)}mY B (Z, X, B) = 0. (A.38)
possible heteroscedasticity.
The combination of (A.35) and (A.36) includes many important spe- The LS estimator is inefficient and can be improved by weighting the
cial cases, among them: summands in (A.38) by reciprocal variances; the result is (A.37).
There are many methods for estimating θ. These may be based on
• Homoscedastic linear and nonlinear regression, with g(z, x, B, θ) ≡ 1. true replicates if they exist, or on functions of squared residuals. These
For linear regression, mY (z, x, B) = β0 + βxt x + βzt z. methods are reviewed in Chapters 3 and 6 of Carroll and Ruppert (1988);
• Generalized linear models, including Poisson and gamma regression, see also Davidian and Carroll (1987) and Rudemo et al. (1989). Let (·)
with stand for the argument (Z, X, B). If we define

g(z, x, B, θ) = mY θ (z, x, B) R(Y, ·, θ, σ) = {Y − mY (·)} / {σg(·, θ)} , (A.39)

for some parameter θ. For example, θ = 1/2 for Poisson regression, then one such (approximately) conditionally unbiased score function for
while θ = 1 for gamma and lognormal models. θ (and σ) given B is
½ ¾
• Logistic regression, where mY (z, x, B) = H(β0 + βxt x + βzt z), H(v) = n − dim(B) ∂
R2 (Y, ·, θ, σ) − t log{σg(·, θ)}, (A.40)
1/{1 + exp(−v)}, and since Y is Bernoulli distributed, g 2 = mY (1 − n ∂ (σ, θ)
mY ), σ 2 = 1, and there is no parameter θ. where dim(B) is the number of components of the vector B. The (ap-
Model (A.35)–(A.36) includes examples from fields including epidemi- proximately) conditionally unbiased estimating equation for θ and σ is
ology, econometrics, fisheries research, quality control, pharmacokinetics, the sum of (A.40) over the observed data. The resulting M-estimator
assay development, etc. See Carroll and Ruppert (1988, Chapters 2-4) is closely related to the REML estimator used in variance components
for more details. modeling; see Searle, Casella, and McCulloch (1992).

374 375
As described by Carroll and Ruppert (1988), (A.37)–(A.40) are weighted to evaluate the models (A.35)–(A.36). Ordinary and weighted residual
least squares estimating equations, and nonlinear regression algorithms plots with smoothing can be used to understand departures from the
can be used to estimate the parameters. assumed mean function, while absolute residual plots can be used to
There are two specific types of covariance estimates, depending on detect deviations from the assumed variance function. These graphical
whether or not one believes that the variance model has been approx- techniques are discussed in Chapter 2, section 7 of Carroll and Ruppert
imately correctly specified. We concentrate here on inference for the (1988).
regression parameter B, referring the reader to Chapter 3 of Carroll and
Ruppert (1988) for variance parameter inference. Based on a sample of
A.8 Generalized Linear Models
size n, Bb is generally asymptotically normally distributed with mean B
and covariance matrix n−1 A−1 −1
n Bn An , where if (·) stands for (Zi , Xi , B),
Exponential families have density or mass function
½ ¾
Pn t© ª−1 yξ − C(ξ)
An = n−1 i=1 {mY B (·)} {mY B (·)} σ 2 g 2 (·, θ) ; f (y|ξ) = exp + c(y, φ) . (A.41)
2
φ
Pn t E {Yi − mY (·)}
Bn = n−1 i=1 {mY B (·)} {mY B (·)} . With superscripted (j) referring to the jth derivative, the mean and
σ 4 g 4 (·, θ)
variance of Y are µ = C ′ (ξ) and φC ′′ (ξ), respectively. See, for example,
The matrix Bn in this expression is the same as (A.22) in the general McCullagh and Nelder (1989).
theory of unbiased estimating equations. The matrix An is the same as If ξ is a function of a linear combination of predictors, say ξ = Xi (η)
(A.23), but it is simplified somewhat by using the fact that E(Y|Z, X) = where η = (β0 + βxt X + βzt Z), then we have a generalized linear model.
f (Z, X, B). Generalized linear models include many of the common regression mod-
2
If the variance model is correct, then E {Yi − mY (Zi , Xi , B)} = els, for example, normal, logistic, Poisson, and gamma. Consideration
2 2 of specific models is discussed in detail in Chapter 7. Generalized linear
σ g (Zi , Xi , B, θ), An = Bn and an asymptotically correct covariance
matrix is n−1 A b−1 , where (·) stands for (Zi , Xi , B)
n
b and models are mean and variance models in the observed data, and can be
n o−1 fit using QVF methods.
Abn = n−1 Pn {mY B (·)} {mY B (·)}t σ b
b2 g 2 (·, θ) . If we define L = (C ′ ◦ Xi )−1 , then L(µ) = η; L is called the link
i=1
function since it links the mean of the response and the linear predictor,
If one has severe doubts about the variance model, one can use the η. If Xi is the identity function, then we say that the model is canonical;
2
sandwich method to estimate E {Yi − mY (·)} , leading to the covari- this implies that L = (C ′ )−1 , which is called the canonical link function.
b−1 B
ance matrix estimate A b−1 , where
bn A The link function L, or equivalently Xi , should be chosen so that the
n n
2
model fits the data as well as possible. However, if the canonical link
bn Pn t {Yi − mY (·)} function fits reasonably well, then it is typically used, because doing so
B = n−1 i=1 {mY B (·)} {mY B (·)} .
σ b
b4 g 4 (·, θ) simplifies the analysis.
In some situations, the method of Section A.6.3 can be used in place of
the sandwich method. A.9 Bootstrap Methods
With a flexible variance model that seems to fit the data fairly well,
b−1 , because it can be A.9.1 Introduction
we prefer the covariance matrix estimate n−1 A n
much less variable than the sandwich estimator. Drum and McCullagh The bootstrap is a widely used tool for analyzing the sampling variabil-
(1993) basically come to the same conclusion, stating that “unless there ity of complex statistical methods. The basic idea is quite simple. One
is good reason to believe that the assumed variance function is sub- creates simulated data sets, called bootstrap data sets, whose distribu-
stantially incorrect, the model-based estimator seems to be preferable tion is equal to an estimate of the probability distribution of the actual
in applied work.” Moreover, if the assumed variance function is clearly data. Any statistical method that is applied to the actual data can also
inadequate, most statisticians would find a better variance model and be applied to the bootstrap data sets. Thus, the empirical distribution
then use n−1 Ab−1
n with the better-fitting model. of an estimator or test statistic across the bootstrap data sets can be
In addition to formal fitting methods, simple graphical displays exist used to estimate the actual sampling distribution of that statistic.

376 377
For example, suppose that Θ b is obtained by applying some estimator pling. Therefore, standard errors and confidence intervals produced by
to the actual data, and Θ b (m) is obtained by applying the same estimator this type of bootstrapping will be asymptotically valid in the presence of
to the mth bootstrap data set, m = 1, ..., M , where M is the number heteroscedasticity or other forms on nonhomogeneity. Besides this type
of bootstrap data sets that we generate, and let Θ̄ be the average of of robustness, another advantage of resampling pairs is that it is easy to
b ′ , ..., Θ
Θ b (m) . Then, the covariance matrix of Θb can be estimated by extend to more complex situations, such as measurement error models.
M ³ ´³ ´t The disadvantage of resampling pairs is that the bootstrap data sets
X
var(
c Θ) b = (M − 1)−1 b (m) − Θ̄ Θ
Θ b (m) − Θ̄ . (A.42) will have different sets of Zi ’s than the original data. For example, if
m=1
there is a high leverage point in the original data, it may appear several
times or not at all in a given bootstrap data set. Therefore, this form of
Despite this underlying simplicity, implementation of the bootstrap the bootstrapping estimates unconditional sampling distributions, not
can be a complex, albeit fascinating subject. There are many ways to sampling distributions conditional on the Zi ’s. Some statisticians will
estimate the probability distribution of the data, and it is not always object to this, asking, “Even if the Zi ’s are random, why should I care
obvious which way is most appropriate. Bootstrap standard errors are that I might have gotten different Zi ’s than I did? I know the values of
easily found from (A.42), and these can be plugged into (A.18) to get the Zi ’s that I got, and I want to condition upon them.” We feel that this
“normal theory” confidence intervals. However, these simple confidence objection is valid. However, as Efron and Tibshirani (1993) point out,
intervals are not particularly accurate, and several improved bootstrap often conditional and unconditional standard errors are nearly equal.
intervals have been developed. Comparing bootstrap standard errors and
confidence intervals with traditional methods and comparing the various A.9.2.2 Resampling Residuals
bootstrap intervals with each other requires the powerful methodology
of Edgeworth expansions. Efron and Tibshirani (1993) give an excellent, The purpose behind resampling residuals is to condition upon the Zi ’s.
b where Bb is, say, the nonlinear
The ith residual is ei = Yi − mY (Zi , B),
comprehensive account of bootstrapping theory and applications. For
more mathematical theory, including Edgeworth expansions, see Hall least squares estimate. To create the mth bootstrap data set we first
(1992). Here we give enough background so that the reader can un- center the residuals by subtracting their sample mean, ē, and then draw
(m)
derstand how the bootstrap is applied to obtain standard errors in the {ei }ni=1 randomly, with replacement, from {(ei − ē)}ni . Then we let
(m) b (m) (m)
examples. Yi = mY (Zi , B)+e i . The mth bootstrap data set is {(Yi , Zi )}ni=1 .
Notice that the bootstrap data sets have the same set of Zi ’s as the
original data, so that bootstrap sampling distributions are conditional
A.9.2 Nonlinear Regression without Measurement Error
on the Zi ’s. By design, the distribution of the ith “error” in a bootstrap
To illustrate the basic principles of bootstrapping, we start with nonlin- data set is independent of Zi . Therefore, resampling residuals is only
ear regression without measurement error. Suppose that Yi = mY (Zi , B)+ appropriate when the ǫi ’s in the actual data are identically distributed,
ǫi , where the Zi are, as usual, covariates measured without error, and and is particularly sensitive to the homoscedasticity assumption.
the ǫi ’s are independent with the density of ǫi possibly depending on
Zi . There are at least three distinct methods for creating the bootstrap A.9.2.3 The Parametric Bootstrap
data sets. Efron and Tibshirani (1993) call the first two methods resam- The parametric bootstrap can be used when we assume a parametric
pling pairs and resampling residuals. The third method is a form of the model for the ǫi ’s. Let f be a known mean-zero density, say the stan-
parametric bootstrap. dard normal density, φ. Assume that the density of ǫi is in the scale
family mY (·/σ)/σ, σ > 0, and let σ b be a consistent estimator of σ,
A.9.2.1 Resampling Pairs say the residual root-mean square if f is equal to φ. Then, as when
(m)
Resampling pairs means forming a bootstrap data set by sampling at resampling residuals, the bootstrap data sets are {(Yi , Zi )}ni , where
random with replacement from {(Yi , Zi )}ni . The advantage of this method Yi = mY (Zi , B)b + e(m) , but now the ǫ(m) s are, conditional on the ob-
i i
is that it requires minimal assumptions. If ǫi has a distribution depend- served data, iid from f (·/b
σ )/b
σ . Like resampling residuals, the parametric
ing on Zi in the real data, then this dependence is captured by the bootstrap estimates sampling distributions that are conditional on the
resampling, since the (Yi , Zi ) pairs are never broken during the resam- Zi ’s and requires that the ǫi ’s be independent of the Zi ’s. In addition,

378 379
(m)
like other parametric statistical methods, the parametric bootstrap is and let Yi be Bernoulli with
more efficient when the parametric assumptions are met, but possibly
(m)
biased otherwise. pr(Yi = 1|Zi ) = H(βb0 + βbzt Zi ).

A.9.3 Bootstrapping Heteroscedastic Regression Models


A.9.5 Bootstrapping Measurement Error Models
Consider the QVF model
In a measurement error problem, a typical data vector consists of Zi and
Yi = mY (Zi , B) + σg(Zi , B, θ)ǫi ,
a subset of the following data: the response Yi , the true covariates Xi ,
where the ǫi ’s are iid. The assumption of iid errors holds when Yi {wi,j : j = 1, ..., ki } which are replicate surrogates for Xi , and a second
given Zi is normal, but this assumption precludes logistic, Poisson, and surrogate Ti . We divide the total collection of data into homogeneous
gamma regression, for example. This model can be fit by the methods data sets, which have the same variables measured on each observation
of section A.7.2. To estimate the sampling distribution of the QVF esti- and are from a common source, for example, primary data, internal
mators, bootstrap data sets can be formed by resampling from the set of replication data, external replication data, and internal validation data.
pairs {(Yi , Zi )}ni , as discussed for nonlinear regression models in Section The method of “resampling pairs” ignores the various data subsets,
A.9.2. and can often be successful (Efron, 1994). Taking into account the data
Resampling residual requires some reasonably obvious changes from subsets is better called “resampling vectors,” and consists of resampling,
Section A.9.2. First, define the ith residual to be with replacement, independently from each of the homogeneous data
b sets. This ensures that each bootstrap data set has the same amount
Yi − mY (Zi , B)
ei = − ē, of validation data, data with two replicates of w, data with three repli-
σbg(Zi , B, b
b θ)
cations, etc., as the actual data set. Although in principle we wish to
where ē is defined so that the ei ’s sum to 0. To form mth bootstrap data condition on the Zi ’s and resampling vectors does not do this, resampling
(m) vectors is a useful expedient and allows us to bootstrap any collection of
set, let {ei }ni=1 be sampled with replacement from the residuals and
then let data sets with minimal assumptions. In the examples in this monograph,
(m) (m) we have reported the “resampling pairs” bootstrap analyses, but because
Yi b +σ
= mY (Zi , B) bg(Zi , B, b
b θ)e .
i of the large sample sizes, the reported results do not differ substantially
(m)
Note that ei is not the residual from the ith of the original observa- from the “resampling vectors” bootstrap.
tions, but is equally likely to be any of the n residuals from the origi- Resampling residuals is applicable to validation data when there are
nal observations. See Carroll and Ruppert (1991) for further discussion two regression models: one for Yi given (Zi , Xi ) and another for wi given
of bootstrapping heteroscedastic regression models, with application to (Zi , Xi ). One fits both models and resamples residuals from the first to
(m) (m)
prediction and tolerance intervals for the response. create the bootstrap Yi ’s and from the second to create the wi ’s.
This method generates sampling distributions that are conditional on
the observed (Zi , Xi )’s.
A.9.4 Bootstrapping Logistic Regression Models
The parametric bootstrap can be used when the response, given the
Consider the logistic regression model without measurement error, observed covariates, has a distribution in a known parametric family.
For example, suppose one has a logistic regression model with internal
pr(Yi = 1|Zi ) = H(β0 + βzT Zi ),
validation data. One can fix the (Zi , Xi , wi ) vectors of the validation
where, as elsewhere in this book, H(v) = {1 + exp(−v)}−1 . The general data and create bootstrap responses as in Section A.9.4, using (Zi , Xi )
purpose technique of resampling pairs works here, of course. Resam- in place of Zi . Because wi is a surrogate, it is not used to create the
pling residuals is not applicable, since the residuals will have skewness bootstrap responses of validation data. For the nonvalidation data, one
depending on Zi so are not homogeneous even after weighting as in Sec- fixes the (Zi , wi ) vectors. Using regression calibration as described in
tion A.9.3. The parametric bootstrap, however, is easy to implement. To Chapter 4, one fits an approximate logistic model for Yi given (Zi , wi )
form the mth data set, fix the Zi ’s equal to their values in the real data and again creates bootstrap responses distributed according to the fitted

380 381
model. The bootstrap sampling distributions generated in this way are Note that se(θb1 ) is calculated from the original data in the same way
conditional on all observed covariates. that se(m) (θb1 ) is calculated from the mth bootstrap data set.

A.9.6 Bootstrap Confidence Intervals


As in Section A.5.4, let Θt = (θ1 , Θt2 ), where θ1 is univariate, and sup-
pose that we want a confidence interval for θ1 . The simplest bootstrap
confidence interval is “normal based.” The bootstrap covariance matrix
in (A.42) is used for a standard error
q
se(θb1 ) = var(
c Θ) b 11 .
This standard error is then plugged into (A.18), giving
θb1 ± Φ−1 (1 − α/2)se(θb1 ). (A.43)
The so-called percentile methods replace the normal approximation
(m)
in (A.43) by percentiles of the empirical distribution of {(θb1 − θb1 )}M
1 .
The best of these percentile methods are the so-called BCa and ABC
intervals, and they are generally more accurate than (A.43) in the sense
of having a true coverage probability closer to the nominal (1 − α); see
Efron and Tibshirani (1993) for a full description of these intervals.
Hall (1992) has stressed the advantages of bootstrapping an asymptot-
ically pivotal quantity, that is, a quantity whose asymptotic distribution
is independent of unknown parameters. The percentile-t methods used
the “studentized” quantity
θb1 − θ1
t= , (A.44)
se(θb1 )
which is an asymptotic pivot with a large-sample standard normal dis-
tribution for all values of θ. Let se(m) (θb1 ) be the standard error of θb1
computed from the mth bootstrap data set and let
(m)
θb1 − θb1
t(m) = .
se(m) (θb1 )
Typically, se(m) (θb1 ) will come from an expression for the asymptotic
variance matrix of Θ b (for example, the inverse of the observed Fisher
information matrix given by (A.17)) rather than bootstrapping, since
the latter would require two levels of bootstrapping: an outer level for
{t(m) }M
1 and, for each m, an inner level for calculating the denominator
of t(m) . This would be very computationally expensive, especially for the
nonlinear estimators in this monograph. Let t1−α be the (1 − α) quantile
of {|t(m) |}M
1 . Then the symmetric percentile-t confidence interval is

θb1 ± se(θb1 ) t1−α . (A.45)

382 383
APPENDIX B

TECHNICAL DETAILS

This Appendix is a collection of technical details that will interest some,


but certainly not all, readers.

B.1 Appendix to Chapter 1: Power in Berkson and Classical


Error Models
In Chapter 1, we emphasized why it is important not to mix up Berkson
and classical error models when calculating power. In this section, we
discuss what is common between these two error models with regard to
power.
In the normal linear model with (Y, W, X, U) jointly normal, if W is
a surrogate (nondifferential error model), then power is a function of the
measurement error model only via the squared correlation between W
and X, ρ2xw = σx2 /σw2
(classical) or ρ2xw = σw
2
/σx2 (Berkson), and thus is
the same for classical and Berkson error models with equal correlations.
So whether the error model is Berkson or classical, the loss of power is
the same provided ρ2xw is the same in the two models. This is also true of
noncalibrated measurements, that is, ones with biases and multiplicative
constants (for example, W = α0 + αx X + U). So when comparing the
effects of measurement error on loss of power across different types of
measurement error models, the squared correlation is the most relevant
quantity on which to focus the discussion. Looking at variances is not
very meaningful because var(U) is not comparable across error models,
and it depends on whether or not we have calibrated measurements, but
correlations are comparable across error types. See Buzas, Tosteson, and
Stefanski (2004) for further discussion.
The explanation of the effects of measurement error on power in terms
of correlations has some useful implications for designing studies. The
most useful of which is that one should try to make an educated guess
(or estimate) of the squared correlation between W and X, as this is the
most relevant quantity.
This discussion also shows why assuming a Berkson model when a
classical model holds can lead to an inflated estimate of power. Suppose
we have an estimate of σu2 and estimate σw 2
from the observed Wi . Define
f (x) = (x − σu2 )/x = 1 − σu2 /x, x > 0, and note that f is strictly

385
increasing. If one assumes Berkson error, then ρ2xy is estimated by f (σw 2
+ using the definition of (B.4), we find that the coefficient of Z in (B.1) is
2 2 2
σu ). If classical errors are assumed, then ρxy is estimated by f (σw ). Since t
f is strictly increasing, for fixed estimates of σu2 and σw 2
, the estimate βz∗ = βzt + βx (1 − λ1 )Γtz . (B.5)
2
of ρxy is larger for the Berkson error model than for the classical error
model. The point here is that although the power would be the same for B.3 Appendix to Chapter 4: Regression Calibration
the two models if they had the same values of ρ2xy , for fixed values of σu2
2
and σw , ρ2xy is larger when Berkson errors are assumed. B.3.1 Standard Errors and Replication
As promised in Section 4.6, here we provide formulae for asymptotic
B.2 Appendix to Chapter 3: Linear Regression and standard errors for generalized linear models, wherein
Attenuation
E(Y|Z, X) = f (β0 + βxt X + βzt Z);
Here we establish (3.10) and (3.11) under the assumption of multivari- var(Y|Z, X) = σ 2 g 2 (β0 + βxt X + βzt Z).
ate normality. Taking expectations of both sides of (3.9) conditional on
(X, Z) leads to the identity Let f ′ (·) be the derivative of the function f (·), and let B = (β0 , βxt , βzt )t .
We will use here the best linear approximations of Section 4.4.2. Let n
E(Y | W, Z) = β0 + βx E(X | W, Z) + βzt Z. (B.1) be the size of the main data set, and N − n the size of any independent
data set giving information about the measurement error variance Σuu .
Under joint normality the regression of X on (W, Z) is linear. To facili-
Let ∆ = 1 mean that the main data set is used, and ∆ = 0 otherwise.
tate the derivation, we parameterize this as
Remember
Pn thatPthere arePkni replicates for the i individual and that
th

n 2
E(X | W, Z) = γ0 + γw {W − E(W | Z)} + γzt {Z − E(Z)} . (B.2) ν = i=1 ki − i=1 ki / i=1 ki .
Make the definitions α = (n − 1)/ν, Σ b wz = Σb xz , Σ b zw = Σ
bt , Σ b ww =
wz
Because of the orthogonalization in (B.2) it is immediate that b b
Σxx + αΣuu , rwi = (Wi· − µw ), rzi = (Zi − µz ), and
E (E [X {W − E(W | Z)} | Z]) PN PN PN
γw = ³ h i´ bw = i=1 ∆i ki Wi· / i=1 ∆i ki ; µ
µ bz = n−1 i=1 ∆i Zi ; (B.6)
2
E E {W − E(W | Z)} | Z  
0 0 0
E {E(XW | Z) − E(X | Z)E(W | Z)} Ψ1i∗ =  0 (nki /ν)rwi rwi t
(nki /ν)rwi rzi t ;
= 2 , (B.3) t t
σw|z 0 (nki /ν)rzi rwi {n/(n − 1)}rzi rzi
Ψ1i = Ψ1i∗ − Vi ;
2
where σw|z = var(W | Z).  
0 0 0
Now because U is independent of Z, E(W | Z) = E(X | Z), E(XW | Vi =  0 bi1 bi2  ;
Z) = E(X2 | Z), and the numerator in (B.3) is just σx|z 2
. Independence 0 bti2 bi3
2
of U and Z also implies that σw|z 2
= σx|z + σu2 . It follows that  
nki  X X X 
2
σx|z 2
σx|z bi1 = Σxx 1 − 2ki / ∆j kj + ∆j kj2 /( ∆j kj )2
ν  
γw = 2 = 2 , (B.4) j j j
σw|z σx|z + σu2 X
+Σuu (n/ν)(1 − ki / ∆j kj );
as claimed. j
Suppose now that E(X | Z) = Γ0 + Γtz Z. As noted previously, E(W | X
bi2 = Σxz (n/ν)(ki − ki2 / ∆j kj ); bi3 = Σzz .
Z) = E(X | Z), and thus E(W | Z) = Γ0 + Γtz Z also.
j
Again, because of the orthogonalization in (B.2), it is immediate that
γz = Γz . In what follows, except where explicitly noted, we assume that the
If we now replace E(W | Z) with Γ0 + Γtz Z in (B.2) and substitute data have been centered, so that µbw = 0 and µ bz = 0. This is accom-
the right-hand side of (B.2) into (B.1), and then collect coefficients of Z plished by subtracting the original values of the quantities (B.6) from

386 387
the W’s and Z’s, and has an effect only on the intercept. Reestimating reproduces the regression calibration estimates. Now make the following
the intercept after “uncentering” is described at the end of this section. series of definitions:
The analysis requires an estimate of Σuu . For this we only assume that n o2
sbi = b t B)/g(
f ′ (Q b Q b
b t B) ;
for some random variables Ψ2i and Ψ3i , if i i
    b1n = n−1 P N b b t bi ;
0 0 0 0 0 0 A i=1 ∆i Qi Qi s
© ª
Sb =  0 Σ b uu 0  ; S =  0 Σuu 0  , ri = Yi − f (Qi B) f ′ (Qti B)Qi /g 2 (Qti B);
t

0 0 0 0 0 0 N
X
then din1 = n−1 ∆j sj Qj Rjt cj Ψ1i {I − cj (D − αS)} B;
  j=1
0 0 0
N
Sb − S = 0 b uu − Σuu 0 
Σ X © ª
0 0 0 din2 = n−1 ∆j sj Qj Rjt cj Ψ2i (α − kj−1 )(D − αS)cj − αI B;
j=1
−1 PN
≈n i=1 {∆i Ψ2i + (1 − ∆i )Ψ3i } . (B.7) N
X © ª
For example, if the estimator comes from an independent data set of size din3 = n−1 ∆j sj Qj Rjt cj Ψ3i (α − kj−1 )(D − αS)cj − αI B;
N − n, then Ψ2i = 0 and j=1
  ein = ∆i (ri − din1 − din2 ) − (1 − ∆i )din3 .
0 0 0
Ψ3i =  0 ψ3i 0  , where Here and in what follows, si , Qi , ci , A1n , etc. are obtained by removing
0 0 0 the estimates in each of their terms. Similarly, rbi , dbin1 , dbin2 , ebin , etc. are
obtained by replacing population quantities by their estimates.
Pk i ¡ ¢¡ ¢t We are going to show that
j=1 Wij − Wi· Wij − Wi· − (ki − 1)Σuu
ψ3i = PN . PN
n−1 l=1 (1 − ∆l )(kl − 1) Bb − B ≈ A−1 n−11n ein ,
i=1 (B.8)

If the estimate of Σuu comes from internal data, then Ψ3i = 0 and and hence a consistent asymptotic covariance matrix estimate obtained
  by using the sandwich method is
0 0 0
n−1 Ab−1 A
b b−1
Ψ2i =  0 ψ2i 0  , where 1n 2n A1n , where (B.9)
½ ³ ´ ³ ´
b2n = n−1 PN ∆i rbi − dbin1 − dbin2 rbi − dbin1 − dbin2
0 0 0 t
A i=1
Pk i ¡ ¢¡ ¢t ¾
Wij − Wi· Wij − Wi· − (ki − 1)Σuu
ψ2i =
j=1
PN . +(1 − ∆i )dbin3 dbin3
t
. (B.10)
n−1 l=1 ∆l (kl − 1)
Now make the further definitions The information-type asymptotic covariance matrix uses

1 0 0
 b2n,i = A
A b1n − n−1 PN ∆i rbi rbt .
b2n + A (B.11)
i=1 i
Db = 0 Σ b ww Σ b wz  ;
It is worth noting that deriving (B.9) and (B.11) takes considerable
0 Σ b zw Σ b zz
effort, and that programming it is not trivial. The bootstrap avoids both
n o−1
ci = D
b b − (α − k −1 )Sb . steps, at the cost of extra computer time.
i
To verify (B.8), note by the definition of the quasilikelihood estimator
Let D and S be the limiting values of D b and S.b Let I be the identity and by a Taylor series, we have the expansion
t PN n o
matrix of the same dimension as B. Define Ri = (1, Wi· , Zti )t and Q bi = 0 = n−1/2 i=1 ∆i Yi − f (Q b ti B)
b f ′ (Q b ti B)
bQ b i /g 2 (Q
b ti B)
b
b −αS)b
(D b ci Ri . Using the fact that the data are centered, it is an easy but n ³ ´o
PN b t Bb − Qt B
crucial calculation to show that Q b i = (1, E(X
b t |Zi , Wi· ), Zt )t , that is, it ≈ n−1/2 i=1 ∆i ri − si Qi Q i i
i i

388 389
½ ³ ´t ¾
PN b estimate of it. If that experiment reports an asymptotic variance ξ/n b
≈ n−1/2 ∆
i=1 i ri − s Q
i i Q i − Q i B (B.12)
based on a sample of size N − n, then Ψ3i is a scalar and simplifications
³ ´
result which enable a valid asymptotic analysis. Define
−A1n n1/2 Bb − B . n o
PN b j Rt b −1 b b cj − αI B.b
However, by a standard linear expansion of matrices, dn4 = n−1 j=1 ∆j sbj Q j cj (α − kj )(D − αS)b
n o P
b i − Qi = b − αS)b
b ci − (D − αS)ci Ri Then, in (B.10) replace n−1 b bt t b
Q (D i (1 − ∆i )din3 din3 with dn4 dn4 nξ/(N − n).
n o
≈ b − D) − α(Sb − S) ci Ri
(D
n o B.3.2 Quadratic Regression: Details of the Expanded Calibration Model
−(D − αS)ci (D b − D) − (α − k −1 )(Sb − S) ci Ri
i Here we show that, as stated in Section 4.9.2, in quadratic regression,
b − D)ci Ri if X given W is symmetrically distributed and homoscedastic, the ex-
= {I − (D − αS)ci } (D
© ª panded© model (4.17)ªaccurately summarizes the variance function. Let
+ (α − ki−1 )(D − αS)ci − αI (Sb − S)ci Ri . κ = E (X − m)4 |W , which is constant because of the homoscedastic-
b − D) ≈ n−1/2 PN ∆i Ψ1i , and substituting this to-
However, n1/2 (D
ity.∗ Then, if r = X − m, the variance function is given by
i=1
gether with (B.7) means that var(Y|W) = σ 2 + βx,1
2 2
var(X|W) + βx,2 var(X2 |W)
½ ³ ´t ¾ © ª
−1/2 PN b +2βx,1 βx,2 cov (X, X2 )|W
n i=1 ∆i ri − si Qi Qi − Qi B © ª
= σ 2 + βx,1
2
σ 2 + βx,2
2
E X4 − (m2 + σ 2 )2 |W
PN £ © ª ¤
≈ n−1/2 i=1 ∆i ri +2βx,1 βx,2 E r r2 + 2mr − σ 2 |W
N
X
PN = σ 2 + βx,1
2
σ 2 + βx,2
2
(κ + 4m2 σ 2 − σ 4 ) + 4βx,1 βx,2 mσ 2
−n−1/2 t
i=1 ∆i si Qi Ri ci n
−1
∆j Ψ1j {I − ci (D − αS)} B
j=1 = σ∗2 + σ 2 (βx,1 + 2βx,2 m)2 ,
N
PN X where σ∗2 = σ 2 + βx,2
2
κ − σ 4 . The approximation (4.17) is of exactly the
−n−1/2 t
i=1 ∆i si Qi Ri ci n
−1
{∆j Ψ2j + (1 − ∆j )Ψ3j } same form. The only difference is that it replaces the correct σ∗2 with σ 2 ,
j=1
© ª but this replacement is unimportant since both are constant.
× (α − ki−1 )(D − αS)ci − αI B.
If we interchange the roles of i and j in the last expressions and inset B.3.3 Heuristics and Accuracy of the Approximations
into (B.12), we obtain (B.8).
While the standard error formulae have assumed centering, one can The essential step in regression calibration is the replacement of X with
still make inference about the original intercept that would have been E(X|W, Z) = m(Z, W, γ) in (4.10) and (4.11), leading to the model
obtained had one not centered. Letting the original means of the Zi ’s and (4.12)–(4.13). This model can be justified by a “small-σ” argument, that
Wi· ’s be µ bw,o , the original intercept is estimated by βb0 + βbxt µ
bz,o and µ bw,o is, by assuming that the measurement error is small. The basic idea is
bt that under small measurement error, X will be close to its expectation.
+βz µ bz,o . If one conditions on the observed values of µ bz,o and µ bw,o , then
However, even with small measurement error, X may not be close to W,
this revised intercept is the linear combination at Bb = (1, µ b
btz,o )B,
btz,o , µ so naively replacing X with W may lead to large bias, hence the need
and its variance is estimated by n−1 at A b A
−1 b b −1
1n 2n A1n a. for calibration. For simplicity, assume that X is univariate. Let X =
If Σuu is known, or if one is willing to ignore the variation in its esti- E(X|Z, W) + V, where E(V|Z, W) = 0 and var(V|Z, W) = σX|Z,W 2
.
mate Σ b uu , set din2 = din3 = 0. This may be relevant if Σ b uu comes from
Let m(·) = m(Z, W, γ). Let fx and fxx be the first and second partial
a large, careful independent study, for which only summary statistics are 2
derivatives of f (z, x, B) with respect to x. Assuming that σX|Z,W is small,
available (a common occurrence).
In other cases, W is a scalar variable, Σuu cannot be treated as known ∗ More precisely, in addition to a constant variance, we are also assuming a constant
and one must rely on an independent experiment that reports only an kurtosis, as is true if, for example, X given W is normally distributed.

390 391
and hence that V is small with high probability, we have the Taylor We now describe two methods of estimating the covariance matrix
approximation: of the asymptotic distribution of Θ b
¯ simex that avoid nested resampling.
n o We do so in the context of homoscedastic measurement error. The first
¯
E(Y|Z, W) = E E(Y|Z, W, X)¯Z, W method uses the pseudo estimates, Θ b b (ζ), generated during the SIMEX
n
simulation step in a procedure akin to Tukey’s jackknife variance es-
≈ E f (Z, m(·), B) + fx (Z, m(·), B) V
¯ o timate. Its applicability is limited to situations in which σu2 is known
¯ or estimated well enough to justify such an assumption. The second
+(1/2)fxx (Z, m(·), B) V2 ¯Z, W
method exploits the fact that Θ b
2 simex is asymptotically equivalent to
= f {Z, m(·), B} + (1/2)fxx {Z, m(·), B} σX|Z,W . an M-estimator and makes use of standard formulae from Appendix A.
2 This method requires additional programming but has the flexibility to
Model (4.12) results from dropping the term involving σX|Z,W , which
accommodate situations in which σu2 is estimated and the variation in
can be justified by the small-σ assumption. This term is retained in the
bu2 is not negligible.
σ
expanded regression calibration model developed in Section 4.7.
To derive (4.13), note that
n ¯ o B.4.1 Simulation Extrapolation Variance Estimation
¯
var(Y|Z, W) = var E(Y|Z, W, X)¯Z, W (B.13)
n ¯ o Stefanski and Cook (1995) establish a close relationship between SIMEX
¯
+E var(Y|Z, W, X)¯Z, W . inference and jackknife inference. In particular, they identified a method
of variance estimation applicable when σu2 is known that closely parallels
The first term on the right-hand side of (B.13) is
Tukey’s jackknife variance estimation. We now describe the implemen-
var{f (Z, X, B)|Z, W} ≈ var{fx (Z, m(·), B)V|Z, W} tation of their method of estimating var(Θ b
simex ).
= fx2 {Z, m(·), B} σX|Z,W
2
, It is convenient to introduce a function T to denote the estimator
under study. For example, T {(Yi , Zi , Xi )n1 } is the estimator of Θ when
which represents variability in Y due to measurement error and is set
X is observable, and T {(Yi , Zi , Wi )n1 } is the naive estimator.
equal to 0 in the regression calibration approximation, but is used in
For theoretical purposes we let
the expanded regression calibration approximation of Section 4.7. Let n o
p
sx and sxx be the first and second partial derivatives of g 2 (z, x, B, θ) b b (ζ) = T (Yi , Zi , Wi + ζUb,i )n1 ,
Θ
with respect to x. The second term on the right-hand side of (B.13) is
we redefine
E{σ 2 g 2 (Z, X, B, θ)|Z, W} ≈ σ 2 g 2 (Z, m(·), B, θ) n o
1 2
b
Θ(ζ) =E Θb b (ζ) | (Yi , Zi , Wi )n1 . (B.14)
+ sxx (Z, m(·), B, θ)σX|Z,W .
2
2 The expectation in (B.14) is with respect to the distribution of (Ub,i )ni=1
Setting the term involving σX|Z,W equal to 0 gives the regression calibra-
only, since we condition on the observed data. It can be obtained as the
tion approximation, while both terms are used in expanded regression b 1 (ζ) + · · · + Θ
b B (ζ)}/B. In effect, Θ(ζ)
b
limit as B → ∞ of the average {Θ
calibration.
is the estimator obtained when computing power is unlimited.
We now introduce a second function, Tvar to denote an associated
B.4 Appendix to Chapter 5: SIMEX variance estimator, that is,
The ease with which estimates can be obtained via SIMEX, even for Tvar {(Yi , Zi , Xi )n1 } = var(
c Θ b c {(Yi , Zi , Xi )n1 }],
true ) = var[T
very complicated and nonstandard models, is offset somewhat by the
where Θb
complexity of the resulting estimates, making the calculation of standard true denotes the “estimator” calculated from the “true” data
errors difficult or at least nonstandard. Except for the computational (Yi , Zi , Xi )n1 .
burden of nested resampling schemes, SIMEX is a natural candidate We allow T to be p-dimensional, in which case Tvar is (p × p)-matrix
for the use of the bootstrap or a standard implementation of Tukey’s valued, and variance refers to the variance–covariance matrix. For ex-
jackknife to calculate standard errors. ample, Tvar could be the inverse of the information matrix when Θb
true

392 393
is a maximum likelihood estimator. Alternatively, Tvar could be a sand- The variance matrix s2∆ (ζ) is an unbiased estimator of the conditional
wich estimator for either a maximum likelihood estimator or a general variance var{Θ b b (ζ) − Θ(ζ)
b | (Yi , Zi , Wi )n1 } for all B > 1 and converges
M-estimator (Appendix A). in probability to its conditional expectation as B → ∞. Since E{Θ b b (ζ)−
We use τ 2 to denote the parameter var(Θ b 2
true ), τbtrue to denote the b n 2
Θ(ζ) | (Yi , Zi , Wi )1 } = 0, it follows that unconditionally E{s∆ (ζ)} =
n 2
true variance estimator Tvar {(Yi , Zi , Xi )1 }, and τbnaive to denote the var{Θb b (ζ) − Θ(ζ)}.
b
naive variance estimator Tvar {(Yi , Zi , Wi )n1 }. Thus, the component of variance we want to estimate is given by
Stefanski and Cook (1995) show that b b 2
var(Θ simex − Θtrue ) = − lim E{s∆ (ζ)}.
b n b ζ→−1
E{Θ simex | (Yi , Zi , Xi )1 } ≈ Θtrue , (B.15)
This can be (approximately) estimated by fitting models to the compo-
where the approximation is due to both a large-sample approximation nents of s2∆ (ζ) as functions of ζ > 0 and extrapolating the component
and to use of an approximate extrapolant function. We will make use models back to ζ = −1. We use sb2∆ (−1) to denote the estimated variance
of such approximations without further explanation; see Stefanski and matrix obtained by this procedure.
Cook (1995) for additional explanation. In light of (B.16), the definition of τ 2 , and (B.19) the difference,
It follows from Equation (B.15) that b
τbsimex − sb2∆ (−1), is an estimator of var{Θ
2
simex }. In practice, sepa-
b
var(Θ b b b rate extrapolant functions are not fit to the components of both τb2 (ζ)
simex ) ≈ var(Θtrue ) + var(Θsimex − Θtrue ). (B.16)
and s2∆ (ζ); rather, the components of the difference, τb2 (ζ) − s2∆ (ζ), are
Equation (B.16) decomposes the variance of Θ b
simex into a component modeled and extrapolated to ζ = −1.
due to sampling variability, var(Θ b ) = τ 2
, and a component due to In summary, for SIMEX estimation with known σu2 , the simulation
true
measurement error variability, var(Θ b − Θb b
simex true ). step results in Θ(ζ), τb2 (ζ) and s2∆ (ζ) for ζ ∈ Λ. The model extrapolation
SIMEX estimation can be used to estimate the first component τ 2 . b b
of Θ(ζ) to ζ = −1, Θsimex , provides an estimator of Θ, and the model
That is, extrapolation of (the components of) the difference, τb2 (ζ) − s2∆ (ζ) to
τbb2 (ζ) = Tvar [{Yi , Zi , Wb,i (ζ)}n1 ] ζ = −1 provides an estimator of var(Θ b
simex ). It should be emphasized
is calculated for b = 1, . . . , B, and upon averaging and letting B → ∞, that the entire procedure is approximate in the sense that it is generally
results in τb2 (ζ). The components of τb2 (ζ) are then plotted as functions valid only in large samples with small measurement error.
of ζ, extrapolant models are fit to the components of {b τ 2 (ζm ), ζm }M
1 There is no guarantee that the estimated covariance matrix so ob-
and the modeled values at ζ = −1 are estimates of the corresponding tained is positive definite. This is similar to the nonpositivity problems
components of τ 2 . that arise in estimating components-of-variance. We have not encoun-
The basic building blocks required to estimate the second component tered problems of this nature, although there is no guarantee that they
of the variance, var(Θ b b
simex − Θtrue ), are the differences will not occur. If it transpires that the estimated variance of a linear
b is negative, a possible course of action is to plot,
combination, say γ t Θ,
b b (ζ) − Θ(ζ),
∆b (ζ) = Θ b b = 1, . . . , B. (B.17) model, and extrapolate directly the points [γ t {b τ 2 (ζm )−s2∆ (ζm }γ, ζm ]M
1 ,
Define though there is also no guarantee that its extrapolation will be nonneg-
PB ative.
s2∆ (ζ) = (B − 1)−1 t
b=1 ∆b (ζ)∆b (ζ), (B.18)
b b (ζ)}B . Its significance stems
that is, the sample variance matrix of {Θ B.4.2 Estimating Equation Approach to Variance Estimation
b=1
from the fact that
This section is based on the results in Carroll, Küchenhoff, Lombard,
b
var(Θ b b b
simex − Θtrue ) = − lim var{Θb (ζ) − Θ(ζ)};
ζ→−1
(B.19) and Stefanski (1996). Assuming iid sampling, these authors showed that
the estimate of Θ is asymptotically normally distributed and proposed
see Stefanski and Cook (1995). The minus sign on the right-hand side an estimator of its asymptotic covariance matrix. We highlight the main
b b (ζ) − Θ(ζ)}
of (B.19) is not an misprint; var{Θ b is positive for ζ > 0 and points of the asymptotic analysis in order to motivate the proposed vari-
zero for ζ = 0, so the extrapolant is negative for ζ < 0. ance estimator.

394 395
We describe the application of SIMEX in the setting of M-estimation, Θ∗ (Λ). Define
that is, using unbiased estimating equations (Appendix A), assuming
ΨB,i(1) {σu2 , Λ, Θ∗ (Λ)} = vec[χB,i {σu2 , ζ, Θ(ζ)}, ζ ∈ Λ]
that in the absence of measurement errors, M-estimation produces con-
sistent estimators. A11 {σu2 , Λ, Θ∗ (Λ)} = diag[A{σu2 , ζ, Θ(ζ)}, ζ ∈ Λ].
The estimator obtained in the absence of measurement error is denoted Then, using (B.22), the joint limit distribution of n1/2 {Θ b ∗ (Λ) − Θ∗ (Λ)}
b
Θ true and solves the system of equations is seen to be multivariate normally distributed with mean zero and co-
Pn b variance Σ, where
0 = n−1 i=1 Ψ(Yi , Zi , Xi , Θ true ). (B.20)
© 2 ª © −1 ªt
This is just a version of (A.19) and is hence applicable to variance func- Σ = A−1 11 (·)C11 σu , Λ, Θ∗ (Λ) A11 (·) (B.23)
© 2 ª £ © 2 ª¤
tion and generalized linear models. In multiple linear regression, Ψ(·) C11 σu , Λ, Θ∗ (Λ) = Cov ΨB,1(1) σu , Λ, Θ∗ (Λ) . (B.24)
represents the normal equations for a single observation, namely,
Define G ∗ (Λ, Γ∗ ) = vec [{G(ζm , Γj )}m=1,...,M, j=1,...,p ], where Γ∗ =
Ψ(Y, Z, X, Θ) = (Y − β0 − βzt Z t t t
− βx X)(1, Z , X ) . (Γt1 , . . . , Γtp )t and Γj is the parameter vector estimated in the extrap-
olation step for the j th component of Θ(ζ),b j = 1, . . . , p.
In multiple logistic regression, with H(·) being the logistic distribution ∗ b ∗ ∗
Define R(Γ ) = Θ∗ (Λ) − G (Λ, Γ ). The extrapolation steps results in
function,
b ∗ , obtained by minimizing Rt (Γ∗ )R(Γ∗ ). The estimating equation for
Γ
© ¡ ¢ª
Ψ(Y, Z, X, Θ) = Y − H β0 + βzt Z + βx X (1, Zt , Xt )t . b ∗ has the form 0 = s(Γ∗ )R(Γ∗ ) where st (Γ∗ ) = {∂/∂(Γ∗ )t }R(Γ∗ ). With
Γ
The solution to (B.20) cannot be calculated, since it depends on the D(Γ∗ ) = s(Γ∗ )st (Γ∗ ), standard asymptotic results show that
unobserved true predictors. The estimator obtained by ignoring mea- b ∗ − Γ∗ ) ≈ N{0, Σ(Γ∗ )},
n−1/2 (Γ
b
surement error is denoted by Θ naive and solves the system of equations where Σ(Γ∗ ) = D−1 (Γ∗ )s(Γ∗ )Σst (Γ∗ )D−1 (Γ∗ ) and Σ is given by (B.23).
P n b b b∗ √
0 = n−1 i=1 Ψ(Yi , Zi , Wi , Θ naive ). Now Θ ∗
simex = G (−1, Γ ) and thus by the ∆ method, the n -normalized
SIMEX estimator is asymptotically normal with asymptotic variance,
For fixed b and ζ and large n, a standard linearization (Appendix A)
shows that GΓ∗∗ (−1, Γ∗ )Σ(Γ∗ ){GΓ∗∗ (−1, Γ∗ )}t
n o
n1/2 Θb b (ζ) − Θ(ζ) ≈ −A−1 {σu2 , ζ, Θ(ζ)} where GΓ∗∗ (ζ, Γ∗ ) = {∂/∂(Γ∗ )t }G ∗ (ζ, Γ∗ ).
Pn Note that the matrix C11 (·) in (B.24) is consistently estimated by
× n−1/2 i=1 Ψ{Yi , Zi , Wb,i (ζ), Θ(ζ)}, (B.21) Cb11 (·), the sample covariance matrix of [ΨB,i(1) {σu2 , Λ, Θb ∗ (Λ)}]n . Also,
1

where A{σu2 , ζ, Θ(ζ)} = E [ΨΘ {Y, Z, Wb,i (ζ), Θ(ζ)}], and A11 (·) is consistently estimated by Ab11 (·) = diag{Abm (·)} for m = 1, . . . , M ,
where
ΨΘ {Y, Z, Wb,i (ζ), Θ} = (∂/∂Θt )Ψ{Y, Z, Wb,i (ζ), Θ}. Pn PB
Abm (·) = (nB)−1 i=1
b m )}.
ΨΘ {Yi , Zi , Wb,i (ζm ), Θ(ζ
b=1
Averaging (B.21) over b results in the asymptotic approximation
The indicated variance estimator is
n o
b
n1/2 Θ(ζ) − Θ(ζ) ≈ −A−1 (·) b ∗ )Σ(
n−1 G ∗∗ (−1, Γ b Γb ∗ ){G ∗∗ (−1, Γ
b ∗ )}t , (B.25)
Γ Γ
Pn b Γb ∗ ) = s(Γ
b ∗ )st (Γ
b ∗ ) and
× n−1/2 i=1 χB,i {σu2 , ζ, Θ(ζ)}, (B.22) where D(
P B b Γb∗ ) b −1 (Γ
b ∗ )s(Γ
b ∗ )Σs
b t (Γ
b ∗ )D
b −1 (Γ
b ∗ );
where χB,i {σu2 , ζ, Θ(ζ)} = B −1 b=1 Ψ{Yi , Zi , Wb,i (ζ), Θ(ζ)}, and A−1 (·) Σ( = D
n ot
= A−1 {σu2 , ζ, Θ(ζ)}. The summands χB,i (·) in (B.22) are independent b
Σ = Ab−1 b−1 b−1
11 (·)C11 (·) A11 (·) .
and identically distributed with mean zero.
Let Λ = {ζ1 , . . . , ζM } denote the grid of values used in the extrapola- When σu2 is estimated, the estimating equation approach is modified
tion step. Let Θb ∗ (Λ) denote {Θ b t (ζ1 ), . . . , Θ
b t (ζM )}t , which we also denote by the inclusion of additional estimating equations employed in the es-
b 2
vec{Θ(ζ), ζ ∈ Λ}. The corresponding vector of estimands is denoted by timation of σ
cu . We illustrate the case in which each Wi is the mean of

396 397
two replicate measurements, Wij , j = 1, 2 where B.5 Appendix to Chapter 6: Instrumental Variables
Wij = Xi + Ui,j , j = 1, 2, i = 1, . . . , n. B.5.1 Derivation of the Estimators
With replicates, Wi is replaced by Wi∗ = Wi,. and σu2
by = 2
σu,∗ σu2 /2. In this section, we derive the estimators presented in Section 6.3. We
Let ½ ¾ start with the following assumptions:
2 (Di − µ)2 − σu,∗
2
Ψ(i)2 (σu,∗ , µ) = , t t
Di − µ E(X | Z, T, W) = βX|1 ZT W + βX|1Z T W Z +
P t t
where Di = (Wi1 − Wi2 )/2. Then solving 2
Ψi(2) (σu,∗ , µ) = 0, results βX|1ZT W T + βX|1ZT W W; (B.27)
in the estimators µ b = D and σ bu,∗ = (n − 1)sd /n, where s2d is the sample
2 2 E(X − W | Z, X, T) = 0; (B.28)
© ª
variance of (Di )n1 and consistently estimates σu,∗ 2
. E(Y | Z, T, W) = E E(Y | Z, X) | Z, T, W . (B.29)
By combining ΨB,i(1) and Ψi(2) into a single estimating equation and
applying standard theory, the covariance matrix of the joint distribution We have discussed each of these previously. Note that (B.27) and (B.28)
b ∗ (Λ), σ
of Θ 2
bu,∗ , and µb is imply that E(X | Z, T) = E(W | Z, T) and also that βW |1ZT = βX|1ZT ,
½ ¾−1 ½ ¾½ ¾−t βW |1Z T = βX|1Z T , and βW |1ZT = βX|1ZT .
−1 A11 (·) A12 (·) C11 (·) C12 (·) A11 (·) A12 (·)
n t (B.26)
,
0 A22 (·) C12 (·) C22 (·) 0 A22 (·) B.5.1.1 First Regression Calibration Instrumental Variable Algorithm
where The first algorithms are simple to describe once (6.17) is justified, which
½ ¾ · © 2 ª¸
C11 (·) C12 (·) ΨB,1(1) σu,∗ , Λ, Θ∗ (Λ) we do now. Making use of the fact that T is a surrogate, (B.29) and the
t = C∗ (·) = cov ¡ 2 ¢ , standard regression calibration approximation results in the approximate
C12 (·) C22 (·) Ψ1(2) σu,∗ , µ
model
2
A12 {σu,∗ , Λ, Θ∗ (Λ)} e
E(Y|T) e | T)}
= f {βYt |Xe E(X e = f (β t β t T),e (B.30)
e Te
e X|
Y |X
P n ∂ © 2 ª
= n−1 i=1 E[ Ψ σu,∗ , Λ, Θ∗ (Λ) ], e
var(Y|T) e | T),
= σ 2 g 2 {βYt |Xe E(X e θ} (B.31)
2 , µ) B,i(1)
∂(σu,∗
2 2 t
(βYt |Xe βX| e
and = σ g e Te T, θ).
½ ¾
¡ 2 ¢ Pn ∂ ¡ 2 ¢ e in the generalized
A22 σu,∗ , µ = n−1 i=1 E Ψ σu,∗ , µ It follows from (B.30)–(B.31) that the coefficient of T
2 , µ) i(2)
∂(σu,∗ e t t t
f |Te =
½ ¾ ½ ¾ e Te . By (B.28) βW
linear regression of Y on T is βY |Te = βY |Xe βX|
Pn 1 2(Di − µ) 1 0
= −n−1 i=1 E =− . e Te , and (6.17) follows.
βX|
0 1 0 1
Estimating these quantities via the sandwich method is ©straightforward. B.5.1.2 Second Regression Calibration Instrumental Variable Algorithm
2
ª
For A12 (·) remove the expectation symbol and replace σu,∗ , Θ∗ (Λ), µ The derivation of the second algorithm is somewhat involved, but the
n o
by the estimates σ 2
bu,∗ b ∗ (Λ), µ
,Θ b . The covariance matrix C∗ (·) can be estimator is relatively easy to compute. Remember that the strategy is
estimated by the sample covariance matrix of the vectors to exploit the fact that both W and T are surrogates.
" n o# Making use of the fact that both T and W are surrogates, applica-
ΨB,i(1) σ 2
bu,∗ b ∗ (Λ)
, Λ, Θ tion of the standard regression calibration approximation produces the
¡ 2 ¢ .
Ψi(2) σbu,∗ , µ
b approximate model
These estimates are substituted into (B.26), thereby obtaining an esti- e W)
E(Y|T, e | T,
f = f {β t E(X e W)},
f (B.32)
e
Y |X
b ∗ (Λ), σ
mate of the joint covariance matrix of Θ 2
bu,∗ , and µ
b. The submatrix e W)
var(Y|T, f = σ g 2 2 e
{βYt |Xe E(X e W),
| T, f θ}. (B.33)
corresponding to the components of Θ b ∗ (Λ) is now employed in (B.25) in
b
place of Σ. Under the linear regression assumption (B.27), there exist coefficient

398 399
t t
matrices βX| f and βX|
e Te W f , such that
e Te W f |Te , is shown to be equivalent to
βW

e | T,
e W)
f = βt e t f f |Te H2 βY |Te W = βW
H1 βY |Te W + βW e.
f |Te βY |X
E(X f T + βX|
e Te W
X| f W.
e Te W (B.34)
Let βbY |Te W be the estimated regression parameter from the gener-
Taking conditional expectations of both sides of (B.34) with respect to
e and using the fact that E(X
T e | T)
e = βt T e results in the identity alized linear regression of Y on (1, Z, T, W), and let βbW f |Te be as be-
e Te
X|
fore. Under the identifiability assumption that for a given matrix M2 ,
t e t e t t e (βbf
t bf e ) is asymptotically nonsingular, it follows that the esti-
e Te T = βX|
βX| f T + βX|
e Te W f βW
e Te W f |Te T. W |Te M2 βW |T
mator (6.19), namely,
e and using the fact that β f e = β e e , we find
Equating coefficients of T W |T X|T IV 2,(M ) −(M ) b
that βbY |1X 2 = βbf e 2 (H1 βbY |Te W + βbW
f |Te H2 βY |Te W ),
W |T
t
βX|
e Te = t
βX| f
e Te W + t
βX|
e Te W
t
e Te .
f βX| (B.35) is approximately consistent for βY |Xe .
IV 2,(M )
t
When T and W are the same dimension, βbY |1X 2 does not depend
Solving (B.35) for βX| f and then substitution into (B.34) shows that
e Te W c2 that minimizes the
on M2 . In Section B.5.2.1, we derive an estimate M
IV 2,(M )
e | T,
E(X e W)
f = (I − β t
e Te W
t e t
e Te T + βX|
f )βX|
f
f W.
e Te W (B.36) asymptotic variance of βb Y |1X
2
for the case dim(T) > dim(W).
X|

By convention, βYt |Te W et ft t


f is the regression coefficient of (T , W ) in the B.5.2 Asymptotic Distribution Approximations
e and W.
generalized linear regression of Y on T f The indicated model is
We first derive the asymptotic distributions assuming that M1 and M2
overparameterized ,and thus the components of βYt |Te W
f are not uniquely are fixed and that M-estimation is used in the generalized linear and
determined. Although other specifications are possible, we define the linear regression modeling steps. We then show how to estimate M1 and
f uniquely as
components of βY |Te W M2 for efficient asymptotic inference.
Let ψ denote the score function for the generalized linear model under
βY |Te W
f = βY |1ZT W ,
consideration (6.17)–(6.17). This score function has as many as three
βY |Te W
f = (01×d , βYt |1ZT W )t , components, the first corresponding to the unknown regression param-
eter, the second and third to the parameters in the variance function.
where d = 1 + dim(Z). Let H1 and H2 be the matrices that define All of the components are functions of the unknown parameters, the
f and βY |Te W
βY |Te W f = H1 βY |1ZT W
f in terms of βY |1ZT W , so that βY |Te W response variable and the vector of covariate/predictor variables. For
e
and β e f = H2 βY |1ZT W . Also note that because T = (1, Zt , Tt )t , our example, with logistic regression there are no variance parameters and
Y |T W
notation allows us to write βYt |1ZT W = βYt |Te W . ψ(y, x, β) = {y − H(β t x)} x where H(t) = 1/ {1 + exp(−t)}.
Let n o
e and
Substitution of (B.36) into (B.32) and equating coefficients of T e i , β e , σ 2 , θ1
ψ1i = ψ Yi , T Y |T 1
f
W results in the two equations:
denote the ith score function employed in fitting the approximate model
t t
βYt |Te W
f = βYt |Xe (I − βX| e Te ,
f )βX|
e Te W (B.37) e i )n . Let
(B.30)–(B.31) to (Yi , T 1
t t
n o
βYt |Te W
f = βY |Xe βX| f.
e Te W (B.38) ψ2i = ψ Yi , (T e t , Wt )t , β e , σ 2 , θ2
i i 2
Y |T W
t
e Te and adding the resulting equation to
Postmultiplying (B.38) by βX| denote the ith score
n function employedon in fitting the approximate model
(B.37) results in the single equation, (B.32)–(B.33) to Yi , (T e t , Wt )t . Note that each fit of the generalized
i i
1
βY |Te W f = βX|
e Te βY |Te W
f + βX| e,
e Te βY |X
linear model produces estimates of the variance parameters as well as
the regression coefficients. These are denoted with subscripts as above,
e Te =
which, upon using the definitions of H1 and H2 and the identity βX| for example, σ12 , θ1 , etc.

400 401
f |Te ), for
Let ψ3i denote the ith score function used to estimate vec(βW are estimates of the variance matrices of the asymptotic distributions of
1,RC bIV 1,(M1 ) IV 2,(M )
example, for least squares estimation βbYIV|1X , βY |1X , and βbY |1X 2 , respectively.
³ ´
f i − βt T e e
ψ3i = W f |Te i ⊗ Ti ,
W B.5.2.1 Two-Stage Estimation
and let n o When T and W have the same dimension, the estimators (6.18) and
t e IV 1,RC
, σ 2
, θ .
f |Te Ti ), βY |X
ψ4i = ψ Yi , (βW e 3 3 (6.19) do not depend on M1 and M2 . However, when there are more
Finally, define ψ5i and ψ6i as instruments than predictors measured with error it is possible to iden-
³ ´ tify and consistently estimate matrices M1 and M2 which minimize the
t IV 1,(M1 ) t
ψ5i = βW f |Te βY |X
f |Te M1 βW e f |Te M1 βY |Te ,
− βW asymptotic variance matrix of the corresponding estimators. We give the
results first and then sketch their derivations.
and For an asymptotically efficient estimator (6.18), replace M1 with
³ ´
t IV 2,(M2 )
ψ6i = βW f |Te βY |X
f |Te M2 βW e c1,opt = Ω
M b 1,1 − Ω
b 1,7 C
bt − C
bΩb 7,1 + C
bΩb 7,7 C
b t )−1 ,
t
f |Te H2 βY |Te W ).
+ βW ³ ´t
f |Te M2 (H1 βY |Te W
−βW
b = Id ⊗ βbIV 1,(I) , Id is the identity matrix of dimension
where C e
T Y |1X Te
Note that neither ψ5i nor ψ6i depends on i. IV 1,(I)
dTe = dim(Te), and βbY |1X is the estimator obtained by setting M1 equal
Define the composite parameter
( to IdTe .
³ ´ For an asymptotically efficient estimator (6.19), replace M2 with
Θ = t
βYt |Te , σ12 , θ1t , βYt |Te W , σ22 , θ2t , vect βW
f |Te , (B.39) ½
M2,opt = (H1 + βbW
c b bf e )t +
f |Te )Ω4,4 (H1 + βW
)t |T
³ ´t ³ ´t ³ ´t
βYIV|Xe1,RC , σ32 , θ3t ,
IV 1,(M )
βY |Xe 1
IV 2,(M )
, βY |Xe 2 , ¾−1
(H1 + βbW b bt b b bf e )t + D
f |Te )Ω4,7 D + D Ω7,4 (H1 + βW
bΩb 7,7 D
bt ,
|T
and the i th
composite score function
¡ t ¢ where D b = Id ⊗ (H2 βb e )t − Id ⊗ βbIV 2,(I) )t and βbIV 2,(I) is the
t t t t t t e
T Y |T W e
T Y |1X Y |1X
ψi (Θ) = ψ1i , ψ2i , ψ3i , ψ4i , ψ5i , ψ6i . (B.40) estimator obtained by setting M2 equal to IdTe .
b solves
It follows that Θ We now describe the main steps in the demonstrations of the asymp-
c
totic efficiency of M c
Pn 1,opt and M2,opt .
b =0
i=1 ψi (Θ) dim(Θ)×1 , The argument for M c
1,opt and the estimator (6.18) is simpler and is
showing that Θb is an M-estimator. Thus, under fairly general conditions given first. We start with a heuristic derivation of the efficient estimator.
b is approximately normally distributed in large samples and the theory
Θ Consider the basic identity in (6.17), βY |Te = βW f |Te βY |Xe . Replacing

of Chapter A applies. b b b b
β e with β e − (β e − β e ) and β f e with β f e − (β f e − β f e )
Y |T Y |T Y |T Y |T W |T W |T W |T W |T
An estimate of the asymptotic covariance matrixP of Θ b is given by and rearranging terms shows that this equation is equivalent to
b−1 b b−1 t b n b
the sandwich formula An Bn (An ) , where An = i=1 ψiΘ (Θ) with b bf e − β f e )β e .
t b Pn b t b βbY |Te = βbW e + (βY |Te − βY |Te ) − (βW
f |Te βY |X |T W |T Y |X
ψiΘ (Θ) = ∂ψi (Θ)/∂Θ , and Bn = i=1 ψi (Θ)ψi (Θ). Note that because
we are fitting approximate (or misspecified) models, information-based This equation has the structure of a linear model with response vector
standard errors, that is, standard errors obtained by replacing A bn and βbY |Te , design matrix βbW e , and equation er-
f |Te , regression parameter βY |X
b
Bn with model-based estimates exploiting the information identity, are ror (βb e − β e ) − (βbf e − β f e )β e . Let Σ denote the covariance
Y |T Y |T W |T W |T Y |X
generally not appropriate. matrix of this equation error. The best linear unbiased estimator of βY |Xe
b=A
Let Ω b−1 b b−1 t b
n Bn (An ) and let Ωi,j , i, j = 1, . . . , 12 denote the (i, j)
th
in this pseudolinear model is
submatrix of Ω b corresponding to the natural partitioning induced by
−1 b
the components of Θ in (B.39). It follows that Ω b 8,8 , Ωb 11,11 , and Ω
b 12,12 (βbW
t
f |Te Σ
−1 b
βWf |Te )
−1 bt
βWf |Te Σ βY |Te ,

402 403
which is exactly (6.18) with M1 = Σ−1 . Note that the estimator M c the parameter estimates. From the approximation we find that
1,opt
−1
is a consistent estimator of Σ . ³√ ´
c )
Showing that the heuristic derivation is correct and that there is no AVAR
IV 2,(M
n βbY |1X 2 =
penalty for using an estimated covariance matrix is somewhat more in- h n³ ´ oi ³ ´t
volved, but entails nothing more than linearization via Taylor series ap- −(M )
β f e 2 AVAR H1 + βW f |Te H2 ǫ1 + Dǫ3
−(M )
βf e 2 ,
W |T W |T
proximations and ∆-method arguments.
c1 be a consistent estimator of the matrix M1 . Expanding the
Let M
c1 )
IV 1,(M which is minimized when
estimating equation for βb Y |1X around the true parameters results in
the approximation h n³ ´ oi−1
o f |Te H2 ǫ1 + Dǫ3
M2 = AVAR H1 + βW .
√ n IV 1,(M
c ) −(M )
n βbY |1X 1 − βY |Xe ≈ β f e 1 (ǫ2 − Cǫ3 ) ,
W |T

where B.5.2.2 Computing Estimates and Standard Errors


√ ³ ´
ǫ2 = n βbY |Te − βY |Te ,
√ n ³ ´ ³ ´o The two-stage estimates are only slightly more difficult to compute than
ǫ3 = n vec βbW f |Te − vec βW
f |Te , the first-stage estimates. Here, we describe an algorithm that results in
both estimates.
C = IdTe ⊗ βYt |Xe . b
Note that for fixed matrices M1 and M2 all of the components of Θ
This Taylor series approximation is noteworthy for the fact that it is the in (B.39) either are calculated directly as linear regression or general-
same for M1 known as it is for M1 estimated. Consequently, there is no ized linear regression estimates, or are simple transformations of such
penalty asymptotically for estimating M1 . estimates. So for fixed M1 and M2 , obtaining Θ b is straightforward.
Thus, with AVAR denoting asymptotic variance, we have that Asymptotic variance estimation is most easily accomplished by first
n√ c )
o ³ ´t programming the two functions
IV 1,(M −(M ) −(M )
AVAR n βbY |1X 1 = β f e 1 {AVAR (ǫ2 − Cǫ3 )} β f e 1 .
W |T W |T Pn
G1 (Θ) = ψi (Θ),
That this asymptotic variance is minimized when Pi=1
n t
G2 (Θ) = i=1 ψi (Θ)ψi (Θ) ,
−1
M1 = {AVAR (ǫ2 − Cǫ3 )}
is a consequence of the optimality of weighted-least squares linear re- where ψi (Θ) is the ith composite score function from (B.40). Although
we do not actually solve G1 (Θ) = 0 to find Θ, b it should be true that
gression.
c2 be a consistent estimator of the matrix M2 . Expanding the
Let M b
G1 (Θ) = 0. This provides a check on the programming of G1 .
c2 )
IV 2,(M b results in the matrix A bn .
estimating equation for βb Y |1X around the true parameters results in Numerical differentiation of G1 at Θ = Θ
the approximation Alternatively, analytical derivatives of ψi (Θ) can be used, but these
o n³ ´ o are complicated and tedious to program. Evaluation of G2 at Θ = Θ b
√ n IV 2,(Mc ) −(M )
n βbY |1X 2 − βY |Xe ≈ β f e 2 f |Te H2 ǫ1 + Dǫ3 ,
H1 + β W is the matrix B bn . The covariance matrix of Θb is then found as Ω b =
W |T
Ab−1 B
bn (A
b−1 )t .
n n
where The algorithm described above is first used with M1 and M2 set to
√ ³ ´
e resulting in the first-stage
ǫ1 = n βbY |Te W − βY |Te W , the identity matrix of dimension dim(T),
³ ´t estimates and estimated asymptotic covariance matrix. Next, M1 and
M2 are set to M c c
D = IdTe ⊗ H2 βY |Te W − IdTe ⊗ βYt |Xe . 1,opt and M2,opt , respectively, as described in Section
B.5.2.1. A second implementation of the algorithm results in the second-
As before, estimating M2 does not affect the asymptotic distribution of stage estimates and estimated asymptotic covariance matrix.

404 405
B.6 Appendix to Chapter 7: Score Function Methods B.7 Appendix to Chapter 8: Likelihood and Quasilikelihood
B.6.1 Technical Complements to Conditional Score Theory B.7.1 Monte Carlo Computation of Integrals
We first justify (7.18). The joint density of Y and W is the product of
If one can easily generate observations from the conditional distribution
(7.14) and the normal density, and hence is proportional to
of X given Z (error model) or given (Z, W) (calibration model), an
½ ¾
yη − D(η) appealing and easily programmed Monte Carlo approximation due to
∝ exp + c(y, φ) − (1/2)(w − x)t Σ−1
uu (w − x) McFadden (1989) can be used to compute likelihoods. The error model
φ
© t likelihood (8.7) can be approximated as follows. Generate on a computer
∝ exp y(β0 + βz z)/φ + c(y, φ) −
ª a sample (Xs1 , · · · , XsN ) of size N from the density f (x|z, α
e2 ) of X given
(1/2)wt Σ−1 t −1
uu w + x Σuu (w + yΣuu βx /φ , Z = z. Then for large enough N ,
where by ∼ we mean terms that do not depend on y or w. Now set δ =
w + yΣuu βx /φ and make a change of variables (which has Jacobian 1). fY,W |Z (y, w|Z, B, α
e1 , α
e2 ) (B.43)
The joint density of (Y, ∆) given (Z, X) is thus seen to be proportional −1 PN s s
≈N i=1 fY |Z,X (y|z, Xi , B)fW |Z,X (w|z, Xi , α
e1 ).
to
© ª The dependence of (B.43) on αe2 comes from the fact that the distribution
∝ exp y(β0 + βxt δ + βzt z)/φ + c(y, φ) − (1/2)(y/φ)2 βxt Σuu βx of X given Z depends on αe2 .
© ª
= exp yη∗ /φ + c∗ (y, φ, βxt Σuu βx ) . (B.41) We approximate
The conditional density of Y given (Z, X, ∆) is (B.41) divided by its fY |Z,W (y|z, w, B, γ
e)
integral with respect to y, which is necessarily in the form (7.18) as Z
claimed, with = fY |Z,X (y|z, x, B)fX|Z,W (x|z, w, γ
e)dµ(x) (B.44)
t
D∗ (η∗ , φ,
Z βx Σuu βx ) =
© ª by generating a sample (Xs1 , · · · , XsN ) of size N from the distribution
φlog[ exp yη∗ /φ + c∗ (y, φ, βxt Σuu βx ) dµ(y)], (B.42) fX|Z,W (x|z, w, γ
e) of X given (Z = z, W = w). Then for large enough N ,
PN
where as before the notation means that (B.42) is a sum if Y is discrete e) ≈ N −1
fY |Z,W (y|z, w, B, γ s
i=1 fY |Z,X (y|z, Xi , B).
and an integral otherwise.
This “brute force” Monte Carlo integration method is computing in-
tensive. There are two reasons for this. First, one has to generate random
B.6.2 Technical Complements to Distribution Theory for Estimated observations for each value of (Yi , Zi , Wi ), which may be a formidable
Σuu task if the sample size is large. Second, and somewhat less important,
Next we justify the estimated standard errors for Θ b when there is par- maximum likelihood is an iterative algorithm, and one must generate
tial replication. Recall that with normally distributed observations, the simulated X’s at each iteration. Brown and Mariano (1993) suggested
sample mean and the sample covariance matrix are independent. Hence, that N must be fairly large compared to n1/2 in order to eliminate the
b uu and γ
Σ b = vech(Σ b uu ) are independent of all the terms (Yi , Zi , Xi , Ui· ) effects of Monte Carlo variance. They also suggested a modification that
and also independent of (Yi , Zi , Wi· ). By a Taylor series expansion, will be less computing intensive.
³ ´ There is a practical matter with using things such as (B.43). Specif-
An (·) Θb −Θ ≈ ically, if (B.43) is computed at each stage of an iterative process with
Pn different random numbers, then optimization routines will tend to get
i=1 {ΨC (Yi , Zi , Wi , Θ, Σuu )} + Dn (Θ, Σuu ) (b
γ − γ) .
confused unless N is very, very large. If, for example, X given Z is
Because the two terms in the last sum are independent, the total covari- normally distributed, a better approach is to generate a fixed but large
ance is the sum of the two covariances, namely, Bn (·) = Dn (·)Cn (·)Dnt (·), number N of standard normals once, and then simply modify these fixed
as claimed. numbers at each iteration to have the appropriate mean and variance.

406 407
B.7.2 Linear, Probit, and Logistic Regression B.7.2.3 Probit and Logistic Regression
B.7.2.1 Linear Regression Now we return to probit and logistic regression. In probit regression, ex-
act statements are possible. We have indicated that given either (Z, W)
In some cases, the required likelihoods can be computed exactly or very
or (Z, W, T), X is normally distributed with mean m(·) and covariance
nearly so. Suppose that W and T are each normally distributed unbiased
matrix Σx|· , where Σx|· is either Σx|z,w or Σx|z,w,t , and similarly for m(·).
replicates of X, being independent given X, and each having covariance
From the calculations in Section B.3, it follows that
matrix Σuu . Suppose also that X itself is normally distributed with mean · ¸
γ t Z and covariance matrix Σxx . As elsewhere, all distributions are con- β0 + βxt m(·) + βzt Z
pr(Y = 1|Z, W, T) = Φ .
ditioned on Z. In this case, in order to allow for an intercept, the first (1 + βxt Σx|· βx )1/2
element of Z equals 1.0.
In normal linear regression where the response has mean β0 + βxt X + For logistic regression (Section 4.8.2), a good approximation is
· ¸
βzt Z and variance σ 2 , the joint distribution of (Y, W, T) given Z is mul- β0 + βxt m(·) + βzt Z
tivariate normal with means β0 +βxt γ t Z+βzt Z, γ t Z and γ t Z, respectively, pr(Y = 1|Z, W, T) ≈ H ; (B.45)
(1 + βxt Σx|· βx /1.72 )1/2
and covariance matrix
 2  see also Monahan and Stefanski (1992).
σ + βxt Σxx βx βxt Σxx βxt Σxx t
Σy,w,t =  Σxx βx Σxx + Σuu Σxx . Write Θ = (B, Σuu , α21 , Σxx ) and r(W) = r(W, α21 ) = (W − α21 Z).
Σxx βx Σxx Σxx + Σuu Using (B.45), except for a constant in logistic regression, the logarithm
of the approximate likelihood for (Y, W, T) given Z is
B.7.2.2 Distribution of X Given the Observed Data ℓ(Y, Z, W, T, Θ) = −(1/2)log {det (Γw,t )} (B.46)
For probit and logistic regression, we compute the joint density using +Ylog {H(·)} + (1 − Y)log {1 − H(·)}
the formulas fY,W |Z = fY |Z,W fW |Z and fY,W,T |Z = fY |Z,W,T fW,T |Z . © ª © t ªt
−(1/2) rt (W), rt (T) Γ−1 t
w,t r (W), r (T) .
This requires a few preliminary calculations.
First, consider W alone. Our model says that W given Z is normally A similar result applies if only W is measured, namely,
t
distributed with mean α21 Z and covariance matrix Σxx + Σuu . Define ℓ(Y, Z, W, Θ) = −(1/2)log {det (Σxx + Σuu )}
Λw = Σxx (Σxx + Σuu )−1 , m(Z, W) = (I − Λw )γ t Z + Λw W and Σx|z,w =
(I − Λw )Σxx . From linear regression theory, for example, see Section +Ylog {H(·)} + (1 − Y)log {1 − H(·)}
−1
A.4, X given (Z, W) is normally distributed with mean m(Z, W) and −(1/2)r t (W, γ1 ) (Σuu + Σxx ) r(W, γ1 ).
covariance matrix Σx|z,w .
Next, consider W and T together. Our model says that given Z they
B.8 Appendix to Chapter 9: Bayesian Methods
are jointly normally distributed with common mean γ1t Z, common indi-
vidual covariances (Σxx + Σuu ), and cross-covariance matrix Σxx . If we B.8.1 Code for Section 9.8.1
define
· ¸−1 model
Σxx + Σuu Σxx {#BEGIN MODEL
Λw,t = (Σxx , Σxx ) = (Σxx , Σxx )Γ−1
w,t ,
Σxx Σxx + Σuu for (i in 1:Nobservations)
then X given (Z, W, T) is normally distributed with mean and covari- {#BEGIN FOR i in 1:Nobservations
ance matrix given by
#Outcome model
m(Z, W, T) = γtZ Y[i]~dnorm(meanY[i],taueps)
© ªt
+Λw,t (W − γ t Z)t , (T − γ t Z)t meanY[i]<-beta[1]+beta[2]*X[i]+beta[3]*Z[i]
Σx|z,w,t = Σxx − Λw,t (Σxx , Σxx )t ,
#Replication model
respectively. for (j in 1:Nreplications) {W[i,j]~dnorm(X[i],tauu)}

408 409
logprotein2[i]~dnorm(X[i],tauu)
#Exposure model
X[i]~dnorm(meanX[i],taux) X[i]~dnorm(meanX[i],taux)
meanX[i]<-alpha[1]+alpha[2]*Z[i] meanX[i]<-alpha[1]+alpha[2]*AGE[i]+alpha[3]*BMI[i]
}#END FOR i in 1:Nobservations }#END for i in 1:Nobservations

#Noninformative priors on the model parameters #Define lambda (a noninformative prior is assigned)
tauu~dgamma(3,1) tauu<-lambda*taux/(1-lambda)
taueps~dgamma(3,1)
taux~dgamma(3,1) #Noninformative priors on the model parameters
lambda~dunif(0,1)
#Priors for alpha and beta taueps~dgamma(3,0.1)
for (i in 1:nalphas){alpha[i]~dnorm(0,1.0E-6)} taux~dgamma(3,0.1)
for (i in 1:nbetas){beta[i]~dnorm(0,1.0E-6)} taur~dgamma(3,0.1)

#Deterministic transformations: standard deviations #Define the signal attenuation


sigmaeps<-1/sqrt(taueps) attenuation<-beta[2]/(pow(beta[2],2)+taux/taur+taux/taueps)
sigmau<-1/sqrt(tauu)
sigmax<-1/sqrt(taux) #Priors for the fixed effects
for (i in 1:nalphas){alpha[i]~dnorm(0,1.0E-6)}
#Deterministic transformation: attenuation for (i in 1:nbetas){beta[i]~dnorm(0,1.0E-6)}
lambda<-tauu/(tauu+taux)
}#END MODEL #Deterministic transformations (obtain variances)
sigma2eps<-1/taueps
sigma2x<-1/taux
B.8.2 Code for Section 9.11 sigma2u<-1/tauu
model sigma2r<-1/taur
{#BEGIN MODEL }#END MODEL
for (i in 1:Nobservations)
{#BEGIN for i in 1:Nobservations

#Outcome model (repeated observations of FFQ)


logFFQ1[i]~dnorm(meanlogFFQ[i],taueps)
logFFQ2[i]~dnorm(meanlogFFQ[i],taueps)

#Model for mean of log FFQ


meanlogFFQ[i]~dnorm(meanmeanlogFFQ[i],taur)

#Define the fixed effects part of the mean FFQ


meanmeanlogFFQ[i]<-beta[1]+beta[2]*X[i]

#Biomarker model for log protein


logprotein1[i]~dnorm(X[i],tauu)

410 411
References

Abrevaya, J., & Hausman, J. A. (2004). Response error in a transformation


model with an application to earnings equation estimation. Econometrics
Journal, 7, 366–388.
Aitkin, M., & Rocci, R. (2002). A general maximum likelihood analysis of
measurement error in generalized linear models. Statistics and Computing,
12, 163–174.
Albert, P. S. (1999). A mover-stayer model for longitudinal marker data. Bio-
metrics, 55, 1252–1257.
Albert, P. S., & Dodd, L. E. (2004). A cautionary note on the robustness of
latent class models for estimating diagnostic error without a gold standard.
Biometrics, 60, 427–435.
Amemiya, Y. (1985). Instrumental variable estimator for the nonlinear errors
in variables model. Journal of Econometrics, 28, 273–289.
Amemiya, Y. (1990a). Instrumental variable estimation of the nonlinear mea-
surement error model. In P. J. Brown & W. A. Fuller, (Eds.), Statistical
Analysis of Measurement Error Models and Application, Providence, RI:
American Mathematics Society.
Amemiya, Y. (1990b). Two stage instrumental variable estimators for the non-
linear errors in variables model. Journal of Econometrics, 44, 311–332.
Amemiya, Y., & Fuller, W. A. (1988). Estimation for the nonlinear functional
relationship. Annals of Statistics, 16, 147–160.
ARIC Investigators (1989). The Atherosclerosis Risk in Communities (ARIC)
Study: Design and objectives. International Journal of Epidemiology, 129,
687–702.
Armstrong, B. (1985). Measurement error in generalized linear models. Com-
munications in Statistics, Series B, 14, 529–544.
Augustin, T. (2004). An exact corrected log-likelihood function for Cox’s pro-
portional hazards model under measurement error and some extensions.
Scandinavian Journal of Statistics, 31, 43–50.
Baker, S. G., Wax, Y., & Patterson, B. H. (1993). Regression analysis of
grouped survival data: Informative censoring and double sampling. Bio-
metrics, 49, 379–389.
Beaton, G. H., Milner, J., & Little, J. A. (1979). Sources of variation in 24-hour
dietary recall data: Implications for nutrition study design and interpreta-
tion. American Journal of Clinical Nutrition, 32, 2546–2559.
Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd
ed., New York: Springer-Verlag.
Berkson, J. (1950). Are there two regressions? Journal of the American Sta-

413
tistical Association, 45, 164–180. W. Ahrens & I. Pigeot (Eds.), Handbook of Epidemiology, London: Springer.
Berry, S. A., Carroll, R. J., & Ruppert, D. (2002). Bayesian smoothing and Carlin, B. P., & Louis, T. A. (2000). Bayes and Empirical Bayes Methods for
regression splines for measurement error problems. Journal of the American Data Analysis, 2nd ed., London & New York: Chapman & Hall.
Statistical Association, 97, 160–169. Carroll, R. J. (1989). Covariance analysis in generalized linear measurement
Bickel, P. J., & Ritov, Y. (1987). Efficient estimation in the errors in variables error models. Statistics in Medicine, 8, 1075–1093.
model. The Annals of Statistics, 15, 513–540. Carroll, R. J. (1997). Surprising effects of measurement error on an aggregate
Boggs, P. T., Spiegelman, C. H., Donaldson, J. R., & Schnabel, R. B. (1988). data estimation. Biometrika, 84, 231–234.
A computational examination of orthogonal distance regression. Journal of Carroll, R. J. (1999). Risk assessment with subjectively derived doses. In E.
Econometrics, 38, 169–201. Ron & F. O. Hoffman (Eds.), Uncertainties in Radiation Dosimetry and
Bowman, A. W., & Azzalini, A. (1997). Applied Smoothing Techniques for Their Impact on Dose response Analysis, National Cancer Institute Press.
Data Analysis, Oxford: Clarendon Press. Carroll, R. J. (2003). Variances are not always nuisance parameters: The 2002
Box, G. E. P., & Tiao, G. C. (1973). Bayesian Inference in Statistical Analysis, R. A. Fisher Lecture. Biometrics, 59, 211–220.
Reading, MA: Addison-Wesley. Carroll, R. J., & Galindo, C. D. (1998). Measurement error, biases and the
Breslow, N., & Lin, X. (1995). Bias correction in generalized linear mixed validation of complex models. Environmental Health Perspectives, 106 (Sup-
models with a single component of dispersion, Biometrika, 82, 81-91. plement 6), 1535–1539.
Brooks, S. P., & Gelman, A. (1998). General methods for monitoring con- Carroll, R. J., & Gallo, P. P. (1982). Some aspects of robustness in functional
vergence of iterative simulations, Journal of Computational and Graphical errors-in-variables regression models. Communications in Statistics, Series
Statistics, 7, 434–455 A, 11, 2573–2585.
Brown, B. W., & Mariano, R. S. (1993). Stochastic simulations for inference Carroll, R. J., & Gallo, P. P. (1984). Comparisons between maximum likelihood
in nonlinear errors-in-variables models. Handbook of Statistics, Vol. 11, 611– and method of moments in a linear errors-in-variables regression model. In
627. North Holland: New York. T. J Santner & A. C. Tamhane (Eds.), Design of Experiment: Ranking and
Buonaccorsi, J. P. (1991). Measurement error, linear calibration and inferences Selection, New York: Marcel Dekker.
for means. Computational Statistics and Data Analysis, 11, 239–257. Carroll, R. J., & Hall, P. (1988). Optimal rates of convergence for deconvolving
Buonaccorsi, J. P. (1996). Measurement error in the response in the general a density. Journal of the American Statistical Association, 83, 1184–1186.
linear model. Journal of the American Statistical Association, 91, 633–642. Carroll, R. J., & Hall, P. (2004). Low-order approximations in deconvolution
Buonaccorsi, J. P., & Tosteson, T. (1993). Correcting for nonlinear measure- and regression with errors in variables. Journal of the Royal Statistical So-
ment error in the dependent variable in the general linear model. Commu- ciety, Series B, 66, 31–46.
nications in Statistics, Theory & Methods, 22, 2687–2702. Carroll, R. J., & Ruppert, D. (1988). Transformation and Weighting in Re-
Buonaccorsi, J. P., Demidenko, E., & Tosteson, T. (2000). Estimation in lon- gression. London: Chapman & Hall.
gitudinal random effects models with measurement error. Statistica Sinica, Carroll, R. J., & Ruppert, D. (1991). Prediction and tolerance intervals with
10, 885–903. transformation and/or weighting. Technometrics, 33, 197–210.
Buonaccorsi, J. P., Laake, P., & Veirød, M. (2005). On the effect of misclassi- Carroll, R. J., & Ruppert, D. (1996). The use and misuse of orthogonal regres-
fication on bias of perfectly measured covariates in regression. Biometrics, sion estimation in linear errors-in-variables models. The American Statisti-
61, 831–836. cian, 50, 1–6.
Burr, D. (1988). On errors-in-variables in binary regression—Berkson case. Carroll, R. J., & Spiegelman, C. H. (1986). The effect of small measurement
Journal of the American Statistical Association, 83, 739–743. error on precision instrument calibration. Journal of Quality Technology,
Buzas, J. S. (1997). Instrumental variable estimation in nonlinear measure- 18, 170–173.
ment error models. Communications in Statistics, Part A, 26, 2861–2877. Carroll, R. J., & Spiegelman, C. H. (1992). Diagnostics for nonlinearity and
Buzas, J. S., & Stefanski, L. A. (1996a). A note on corrected score estimation. heteroscedasticity in errors in variables regression. Technometrics, 34, 186–
Statistics & Probability Letters, 28 , 1–8. 196.
Buzas, J. S., & Stefanski, L. A. (1996b). Instrumental variable estimation Carroll, R. J., & Stefanski, L. A. (1990) Approximate quasilikelihood estima-
in a probit measurement error model. Journal of Statistical Planning and tion in models with surrogate predictors. Journal of the American Statistical
Inference, 55, 47–62. Association, 85, 652–663.
Buzas, J. S., & Stefanski, L. A. (1996c). Instrumental variables estimation Carroll, R. J., & Stefanski, L. A. (1994). Measurement error, instrumental
in generalized linear measurement error models. Journal of the American variables and corrections for attenuation with applications to meta-analyses.
Statistical Association, 91, 999–1006. Statistics in Medicine, 13, 1265–1282.
Buzas, J. S., Stefanski, L. A., & Tosteson, T. D. (2004). Measurement Error. In Carroll, R. J., & Stefanski, L. A. (1997). Asymptotic theory for the Simex

414 415
estimator in measurement error models. In S. Panchapakesan & N. Bal- Journal of the American Statistical Association, 91, 242–250.
akrishnan (Eds.) Advances in Statistical Decision Theory and Applications, Carroll, R. J., Midthune, D., Freedman, L. S., & Kipnis, V. (2006). Seem-
(pp. 151–164), Basel: Birkhauser Verlag. ingly unrelated measurement error models, with application to nutritional
Carroll, R. J., & Wand, M. P. (1991). Semiparametric estimation in logistic epidemiology. Biometrics, to appear.
measurement error models. Journal of the Royal Statistical Society, Series Carroll, R. J., Ruppert, D., Tosteson, T. D., Crainiceanu, C., & Karagas, M.
B, 53, 573–585. R. (2004). Nonparametric regression and instrumental variables. Journal of
Carroll, R. J., Eltinge, J. L., & Ruppert, D. (1993). Robust linear regression the American Statistical Association, 99, 736–750.
in replicated measurement error models. Letters in Statistics & Probability, Carroll, R. J., Spiegelman, C., Lan, K. K., Bailey, K. T., & Abbott, R. D.
16, 169–175. (1984). On errors-in-variables for binary regression models. Biometrika, 71,
Carroll, R. J., Freedman, L., & Hartman, A. (1996). The use of semiquanti- 19–26.
tative food frequency questionnaires to estimate the distribution of usual Cheng, C.-L., & van Ness, J. W. (1988). Generalized M-estimators for errors
intake. American Journal of Epidemiology, 143, 392–404. in variables regression. The Annals of Statistics, 20, 385–397.
Carroll, R. J., Freedman, L. S., & Pee, D. (1997). Design aspects of calibration Cheng, C.-L., & Schneeweiss, H. (1998). Polynomial regression with errors in
studies in nutrition, with analysis of missing data in linear measurement the variables. Journal of the Royal Statistical Society, Series B, 60, 189–199.
error models. Biometrics, 53, 1440–1451. Cheng, C.-L., & Tsai, C.-L. (1992). Diagnostics in measurement error models.
Carroll, R. J., Gail, M. H., & Lubin, J. H. (1993). Case-control studies with Unpublished.
errors in predictors. Journal of the American Statistical Association, 88, Cheng, C.-L., Schneeweiss, H., & Thamerus, M. (2000). A small sample esti-
177–191. mator for a polynomial regression with errors in the variables. Journal of
Carroll, R. J., Gallo, P. P., & Gleser, L. J. (1985). Comparison of least squares the Royal Statistical Society, Series B, 62, 699–709 .
and errors-in-variables regression, with special reference to randomized anal- Clayton, D. G. (1991). Models for the analysis of cohort and case-control
ysis of covariance. Journal of the American Statistical Association, 80, 929– studies with inaccurately measured exposures. In J. H. Dwyer, M. Feinleib,
932. P. Lipsert, P., et al. (Eds.), Statistical Models for Longitudinal Studies of
Carroll, R. J., Hall, P. G., & Ruppert, D. (1994). Estimation of lag in misreg- Health, (pp. 301–331). New York: Oxford University Press.
istration problems for averaged signals. Journal of the American Statistical Cleveland, W. (1979). Robust locally weighted regression and smoothing scat-
Association, 89, 219–229. terplots. Journal of the American Statistical Association, 74, 829–836.
Carroll, R. J., Knickerbocker, R. K., & Wang, C. Y. (1995). Dimension reduc- Cleveland, W., & Devlin, S. (1988). Locally weighted regression: An approach
tion in semiparametric measurement error models. The Annals of Statistics, to regression analysis by local fitting. Journal of the American Statistical
23, 161–181. Association, 83, 596–610.
Carroll, R. J., Maca, J. D., & Ruppert, D. (1998). Nonparametric regression Cochran, W. G. (1968). Errors of measurement in statistics. Technometrics,
splines for generalized linear measurement error models. In Econometrics in 10, 637–666.
Theory and Practice: Festschrift in the Honour of Hans Schneeweiss, (pp. Cook, J. R., & Stefanski, L. A. (1994). Simulation-extrapolation estimation in
23–30), Physica Verlag. parametric measurement error models. Journal of the American Statistical
Carroll, R. J., Maca, J. D., & Ruppert, D. (1999). Nonparametric regression Association, 89, 1314–1328.
with errors in covariates. Biometrika, 86, 541–554. Copas, J. B. (1988). Binary regression models for contaminated data (with
Carroll, R. J., Roeder, K., & Wasserman, L. (1999). Flexible parametric mea- discussion). Journal of the Royal Statistical Society, Series B, 50, 225–265.
surement error models. Biometrics, 55, 44–54. Cornfield, J. (1962). Joint dependence of risk of coronary heart disease on
Carroll, R. J., Wang, C. Y., & Wang, S. (1995). Asymptotics for prospective serum cholesterol and systolic blood pressure: A discriminant function anal-
analysis of stratified logistic case-control studies. Journal of the American ysis. Federal Proceedings, 21, 58–61.
Statistical Association, 90, 157–169. Cowles, M. K., & Carlin, B. P. (1996). Markov Chain Monte Carlo convergence
Carroll, R. J., Wang, S., & Wang, C. Y. (1995). Asymptotics for prospective diagnostics: A comparative review. Journal of the American Statistical As-
analysis of stratified logistic case-control studies. Journal of the American sociation, 91, 883–904.
Statistical Association, 90, 157–169. Cox, D. R. (1972). Regression models and life tables (with discussion). Journal
Carroll, R. J., Freedman, L. S., Kipnis, V. & Li, L. (1998). A new class of of the Royal Statistical Society, Series B, 34, 187–220.
measurement error models, with applications to estimating the distribution Cox, D. R., & Hinkley, D. V. (1974). Theoretical Statistics. London: Chapman
of usual intake. Canadian Journal of Statistics, 26, 467–477. & Hall.
Carroll, R. J., Küchenhoff, H., Lombard, F., & Stefanski, L. A. (1996). Asymp- Crainiceanu, C., Ruppert, D., & Wand, M. (2005). Bayesian Analysis for pe-
totics for the SIMEX estimator in structural measurement error models. nalized spline regression using WinBUGS. Journal of Statistical Software.

416 417
Volume 14, Issue 14. (http://www.jstatsoft.org/) American Journal of Epidemiology, 132, 746–748.
Crouch, RE. A., & Spiegelman, D. (1990). The evaluation of integrals of the Drum, M., & McCullagh, P. (1993). Comment on the paper by Fitzmaurice,
form −∞ f (t)exp(−t2 )dt: Applications to logistic-normal models. Journal

Laird, & Rotnitzky. Statistical Science, 8, 300–301.
of the American Statistical Association, 85, 464–467. Eckert, R. S., Carroll, R. J., & Wang, N. (1997). Transformations to additivity
Dafni, U. & Tsiatis, A. (1998). Evaluating surrogate markers of clinical out- in measurement error models. Biometrics, 53, 262–272.
come when measured with error. Biometrics, 54, 1445–1462. Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans.
Dagalp, R. E. (2001). Estimators for Generalized Linear Measurement Error Philadelphia: SIAM.
Models with Interaction Terms. Unpublished Ph.D. thesis, North Carolina Efron, B. (1994). Missing data, imputation and the bootstrap. Journal of the
State University, Raleigh, NC. American Statistical Association, 463–475.
Davidian, M., & Carroll R. J. (1987). Variance function estimation. Journal Efron, B. (2005). Bayesians, frequentists and scientists. Journal of the Amer-
of the American Statistical Association, 82, 1079–1091. ican Statistical Association, 100, 1–5.
Davidian, M., & Gallant, A. R. (1993). The nonlinear mixed effects model Efron, B., & Hinkley, D. V. (1978). Assessing the accuracy of the max-
with a smooth random effects density. Biometrika, 80, 475–488. imum likelihood estimator: Observed versus expectedFisher information.
Davis, S., Kopecky, K., & Hamilton, T. (2002). Hanford Thyroid Disease Study Biometrika, 65, 457–487.
Final Report. Centers for Disease Control and Prevention (CDC). Efron, B., & Tibshirani, R. (1993). An Introduction to the Bootstrap. London:
Delaigle, A., & Gijbels, I. (2004a). Comparison of data-driven bandwidth selec- Chapman & Hall.
tion procedures in deconvolution kernel density estimation. Computational Ekholm, A., Green, M., & Palmgren, J. (1986). Fitting exponential family
Statistics and Data Analysis, 45, 249–267. nonlinear models in GLIM 3.77. GLIM Newsletter, 13, 4–13.
Delaigle, A., & Gijbels, I. (2004b). Bootstrap bandwidth selection in kernel Ekholm, A., & Palmgren, J. (1982). A model for a binary response with mis-
density estimation from a contaminated sample. Annals of the Institute of classification. In GLIM-82, editor R. Gilchrist. Heidelberg: Springer.
Statistical Mathematics, 56, 19–47. Ekholm, A., & Palmgren, J. (1987). Correction for misclassifiction using dou-
Delaigle, A., & Gijbels, I. (2005). Data-driven boundary estimation in deconvo- bly sampled data. Journal of Official Statistics, 3, 419–429.
lution problems. Computational Statistics and Data Analysis, 50, 1965–1994 Fan, J. (1991a). On the optimal rates of convergence for nonparametric de-
Delaigle, A., & I. Gijbels (2006). Estimation of boundary and discontinuity convolution problems. The Annals of Statistics, 19, 1257–1272.
points in deconvolution problems. Statistica Sinica, to appear. Fan, J. (1991b). Asymptotic normality for deconvolving kernel density estima-
Delaigle, A., Hall, P., & Qui, P. (2006). Nonparametric methods for solving tors. Sankhyā, Series A, 53, 97–110.
the Berkson errors-in-variables problem. Journal of the Royal Statistical Fan, J. (1991c). Global behavior of deconvolution kernel estimates. Statistica
Society, Series B, 68, 201–220. Sinica, 1, 541–551.
Demidenko, E. (2004). Mixed Models: Theory and Applications. New York: Fan, J. (1992a). Deconvolution with supersmooth distributions. Canadian
John Wiley & Sons. Journal of Statistics, 20, 23–37.
Dempster, A. P., Rubin, D. B., & Tsutakawa, R. K. (1981). Estimation in Fan, J., & Gijbels, I. (1996). Local Polynomial Modelling and Its Applications,
covariance components models. Journal of the American Statistical Associ- London: Chapman & Hall.
ation, 76, 341–353. Fan, J., & Masry, E. (1993). Multivariate regression estimation with errors-in-
Desmond, A. F. (1989). Estimating Equations, Theory of, In S. Kotz & N. variables: asymptotic normality for mixing processes Journal of Multivariate
L. Johnson, (Eds.) Encyclopedia of Statistical Sciences, (pp. 56–59), New Analysis, 43, 237–271.
York: John Wiley & Sons. Fan, J., & Truong, Y. K. (1993). Nonparametric regression with errors in
Devanarayan, V. (1996). Simulation Extrapolation Method for Heteroscedas- variables. The Annals of Statistics, 21, 1900–1925.
tic Measurement Error Models with Replicate Measurements. Unpublished Fan, J., Truong, Y. K., & Wang, Y. (1991). Nonparametric function esti-
Ph.D. thesis, North Carolina State University, Raleigh, NC. mation involving errors-in-variables. In G. Roussas (Ed.), Nonparametric
Devanarayan, V., & Stefanski, L. A. (2002). Empirical simulation extrapola- Functional Estimation and Related Topics (pp. 613–627), Dordrecht: Kluwer
tion for measurement error models with replicate measurements. Statistics Academic Publishers.
& Probability Letters, 59, 219–225. Freedman, L. S., Carroll, R. J., & Wax, Y. (1991). Estimating the relationship
Dominici F., Zeger S. L., & Samet, J. M. (2000). A Measurement error model between dietary intake obtained from a food frequency questionnaire and
for time-series studies of air pollution and mortality. Biostatistics, l, 157– true average intake. American Journal of Epidemiology, 134, 510–520.
175. Freedman, L. S., Feinberg, V., Kipnis, V., Midthune, D., & Carroll, R. J.
Dosemeci, M., Wacholder, S., & Lubin, J. H. (1990). Does nondifferential mis- (2004). A new method for dealing with measurement error in explanatory
classification of exposure always bias a true effect towards the null value? variables of regression models. Biometrics, 60, 171–181.

418 419
Freedman, L., Schatzkin, A., & Wax, Y. (1990). The effect of dietary mea- Gleser, L. J. (1990). Improvements of the naive approach to estimation in non-
surement error on the sample size of a cohort study. American Journal of linear errors-in-variables regression models. In P. J. Brown & W. A. Fuller
Epidemiology, 132, 1185–1195. (Eds.) Statistical Analysis of Measurement Error Models and Application.
Fuller, W. A. (1980). Properties of some estimators for the errors in variables American Mathematics Society, Providence.
model. The Annals of Statistics, 8, 407–422. Gleser, L. J. (1992). The importance of assessing measurement reliability in
Fuller, W. A. (1984). Measurement error models with heterogeneous error multivariate regression, Journal of the American Statistical Association, 87,
variances. In Y. P. Chaubey & T. D. Dwivedi (Eds.) Topics in Applied 696–707.
Statistics, (pp. 257–289). Montreal: Concordia University. Gleser, L. J., Carroll, R. J., & Gallo, P. P. (1987). The limiting distribution
Fuller, W. A. (1987). Measurement Error Models. New York: John Wiley & of least squares in an errors-in-variables linear regression model. Annals of
Sons. Statistics, 15, 220–233.
Fung, K. Y., & Krewski, D. (1999). On measurement error adjustment methods Godambe, V. P. (1960). An optimum property of regular maximum likelihood
in Poisson regression. Environmetrics, 10, 213–224. estimation. Annals of Mathematical Statistics, 31, 1208–1211.
Gail, M. H., Tan, W. Y., & Piantadosi, S. (1988). Tests for no treatment effect Godambe, V. P. (1991). Estimating functions. New York: Clarendon Press.
in randomized clinical trials. Biometrika, 75, 57–64. Gong, G., & Samaniego, F. (1981). Pseudo maximum likelihood estimation:
Gail, M. H., Wieand, S. & Piantadosi, S. (1984). Biased estimates of treatment Theory and applications. The Annals of Statistics, 9, 861–869.
effect in randomized experiments with nonlinear regressions and omitted Gössi, C., & Küchenhoff, H. (2001). Bayesian analysis of logistic regression
covariates. Biometrika, 71, 431–444. with an unknown change point and covariate measurement error. Statistics
Gallant, A. R., & Nychka, D. W. (1987). Seminonparametric maximum likeli- in Medicine, 20, 3109–3121.
hood estimation. Econometrica, 55, 363–390. Gould, W. R., Stefanski, L. A., & Pollock, K. H. (1997). Effects of measurement
Gallo, P. P. (1982). Consistency of some regression estimates when some vari- error on catch-effort estimation. Canadian Journal of Fisheries and Aquatic
ables are subject to error. Communications in Statistics, Series A, 11, 973– Science, 54, 898–906.
983. Gould, W. R., Stefanski, L. A., & Pollock, K. H. (1999). Use of simulation-
Ganguli, B., Staudenmayer, J., & Wand, M. P. (2005). Additive models with extrapolation estimation in catch-effort analyses. Canadian Journal of Fish-
predictors subject to measurement error. Australian and New Zealand Jour- eries and Aquatic Sciences, 56, 1234–1240.
nal of Statistics, 47, 193–202. Gray H. L., Watkins, T. A., & Schucany W. R. (1973). On the jackknife
Ganse, R. A., Amemiya, Y., & Fuller, W. A. (1983). Prediction when both statistic and its relation to UMVU estimators in the normal case. Commu-
variables are subject to error, with application to earthquake magnitude. nications in Statistics, Theory & Methods, 2, 285–320.
Journal of the American Statistical Association, 78, 761–765. Green, P., & Silverman, B. (1994). Nonparametric Regression and Generalized
Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to cal- Linear Models. New York: Chapman & Hall.
culating marginal densities. Journal of the American Statistical Association, Greene, W. F., & Cai, J. (2004). Measurement error in covariates in the
85, 398–409. marginal hazards model for multivariate failure time data. Biometrics, 60,
Gelman, A., & Rubin, D. R. (1992). Inference from iterative simulation using 987–996.
multiple sequences. Statistical Science, 7, 457–511. Greenland, S. (1988a). Statistical uncertainty due to misclassification: Im-
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2004). Bayesian Data plications for validation substudies. Journal of Clinical Epidemiology, 41,
Analysis, 2nd ed. London & New York: Chapman & Hall. 1167–1174.
Geman, S., & Geman D. (1984). Stochastic relaxation,, Gibbs distributions, Greenland, S. (1988b). On sample size and power calculations for studies using
and the Bayesian restoration of images. IEEE Transactions on Pattern confidence intervals. American Journal of Epidemiology, 128, 231–236.
Analysis and Machine Intelligence, 6, 721–741. Greenland, S. (1988c). Variance estimation for epidemiologic effect estimates
Geyer, C. J. (1992). Practical Markov chain Monte-Carlo. Statistical Science, under misclassification. Statistics in Medicine, 7, 745–757.
7, 473–511. Greenland, S., & Kleinbaum, D. G. (1983). Correcting for misclassification in
Geyer, C. J., & Thompson, E. A. (1992). Constrained Monte Carlo maximum two-way tables and pair-matched studies. International Journal of Epidemi-
likelihood for dependent data (with discussion). Journal of the Royal Sta- ology, 12, 93–97.
tistical Society, Series B, 54, 657–700. Griffiths, P., & Hill, I. D. (1985). Applied Statistics Algorithms. London: Hor-
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996). Markov Chain wood.
Monte Carlo in Practice. London & New York: Chapman & Hall. Gustafson, P. (2004). Measurement Error and Misclassification in Statistics
Gleser, L. J. (1981). Estimation in a multivariate errors in variables regression and Epidemiology. Boca Raton: CRC/Chapman & Hall.
model: Large sample results. The Annals of Statistics, 9, 24–44. Gustafson, P. (2005) On model expansion, model contraction, identifiability

420 421
and prior information: two illustrative scenarios involving mismeasured vari- cancer. International Journal of Cancer, 49, 335–340.
ables (with discussion),. Statistical Science, 20, 111–140. Holcomb, J.P. (1999). Regression with covariates and outcome calculated from
Gustafson, P., Le, N. D., & Vallée, M. (2002). A Bayesian approach to case- a common set of variables measured with error: Estimation using the SIMEX
control studies with errors in covariables. Biostatistics, 3, 229–243. method. Statistics in Medicine, 18, 2847–2862.
Hall, P. (1989). On projection pursuit regression. The Annals of Statistics, 17, Horowitz, J. L., & Markatou, M. (1993). Semiparametic estimation of regres-
573–588. sion models for panel data. Unpublished.
Hall, P. G. (1992). The Bootstrap and Edgeworth Expansion, New York: Hotelling, H. (1940). The selection of variates for use in prediction with some
Springer-Verlag. comments on the general problem of nuisance parameters. Annals of Math-
Härdle, W. (1990). Applied Nonparametric Regression. New York: Cambridge ematical Statistics, 11, 271–283.
University Press. Hu, C. & Lin, D. Y. (2004). Semiparametric failure time regression with repli-
Hanfelt, J. J. (2003). Conditioning to reduce the sensitivity of general esti- cates of mismeasured covariates. Journal of the American Statistical Asso-
mating functions to nuisance parameters. Biometrika, 90, 517–531. ciation, 99, 105–118.
Hanfelt, J. J., & Liang, K. Y. (1995). Approximate likelihood ratios for general Hu, P., Tsiatis, A. A., & Davidian, M. (1998). Estimating the parameters of the
estimating functions. Biometrika, 82, 461–477. Cox model when covariate variables are measured with errors. Biometrics,
Hanfelt, J. J. ,& Liang, K. Y. (1997). Approximate likelihoods for generalized 54, 1407–1419.
linear errors-in-variables models. Journal of the Royal Statistical Society, Huang, Y., & Wang, C. Y. (2000). Cox regression with accurate covari-
Series B, 59, 627–637. ates unascertainable: A nonparametric correction approach. Journal of the
Hansen, M. H., & Kooperberg, C. (2002). Spline adaptation in extended linear American Statistical Association, 95, 1209–1219.
models (with discussion). Statistical Science, 17, 2–51. Huang, Y., & Wang, C. Y. (2001). Consistent functional methods for logis-
Hasenabeldy, N., Fuller, W. A., & Ware, J. (1988). Indoor air pollution and tic regression with errors in covariates. Journal of the American Statistical
pulmonary performance: investigating errors in exposure assessment. Statis- Association, 96, 1469–1482.
tics in Medicine, 8, 1109–1126. Huang, X., Stefanski, L., & Davidian, M. (2006). Latent-model robustness in
Hastie, T., & Stuetzle, W. (1989). Principal curves. Journal of the American structural measurement error models. Biometrika, 93, 53–64.
Statistical Association, 84, 502–516. Huber, P. J. (1964). Robust estimation of a location parameter. Annals of
Hastie, T., & Tibshirani, R. (1990). Generalized Additive Models, New York: Mathematical Statistics, 35, 73–101.
Chapman & Hall. Huber, P. J. (1967). The behavior of maximum likelihood estimates under
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains nonstandard conditions. Proceedings of the 5th Berkeley Symposium, 1, 221–
and their applications. Biometrika, 57, 97–109. 233.
Hausman, J. A., Newey, W. K., Ichimura, H., & Powell, J. L. (1991). Identifi- Hughes, M. D. (1993). Regression dilution in the proportional hazards model.
cation and estimation of polynomial errors-in-variables models. Journal of Biometrics, 49, 1056–1066.
Econometrics, 50, 273–295. Hunter, W. G., & Lamboy, W. F. (1981). A Bayesian analysis of the linear
Heagerty, P. J. & Kurland, B. F. (2001). Misspecified maximum likelihood calibration problem. Technometrics, 23, 323–328.
estimates and generalized linear mixed models. Biometrika, 88, 973–985. Hwang, J. T. (1986). Multiplicative errors in variables models with applica-
Henderson, M. M., Kushi, L. H., Thompson, D. J., et al. (1990). Feasibility of tions to the recent data released by the U.S. Department of Energy. Journal
a randomized trial of a low-fat diet for the prevention of breast cancer: Di- of the American Statistical Association, 81, 680–688.
etary compliance in the Women’s Health Trial Vanguard Study. Preventive Hwang, W. H., & Huang, S.Y.H. (2003). Estimation in capture-recapture mod-
Medicine, 19, 115–133. els when covariates are subject to measurement errors. Biometrics, 59, 1113–
Higdon, R., & Schafer, D. W. (2001). Maximum likelihood computations for 1122.
regression with measurement error. Computational Statistics & Data Anal- Hwang, J. T., & Stefanski, L. A. (1994). Monotonicity of regression functions
ysis, 35, 283–299. in structural measurement error models. Statistics & Probability Letters, 20,
Higgins, K. M., Davidian, M., & Giltinan, D. M. (1997). A two-step approach 113–116.
to measurement error in time-dependent covariates in nonlinear mixed ef- Iturria, S., Carroll, R. J., & Firth, D. (1999). Multiplicative measurement error
fects models, with application to IGF-1 pharmacokinetics. Journal of the estimation: Estimating equations. Journal of the Royal Statistical Society,
American Statistical Association, 92, 436–448. Series B, 61, 547–562.
Hildesheim, A., Mann, V., Brinton, L. A., Szklo, M., Reeves, W. C., & Rawls, Jeong, M., & Kim, C. (2003). Some properties of SIMEX estimator in partially
W. E. (1991). Herpes Simplex Virus Type 2: A possible interaction with linear measurement error model. Journal of the Korean Statistical Society,
Human Papillomavirus Types 16/18 in the development of invasive cervical 32, 85–92.

422 423
Johnson, N. L., & Kotz, S. (1970). Distributions in Statistics, Vol. 2. Boston: biomarker study. American Journal of Epidemiology, 158, 14–21.
Houghton-Mifflin. Ko, H., & Davidian, M. (2000) Correcting for measurement error in individual-
Jones, D. Y., Schatzkin, A., Green, S. B., Block, G., Brinton, L. A., Ziegler, level covariates in nonlinear mixed effects models. Biometrics, 56, 368–375.
R. G., Hoover, R., & Taylor, P. R. (1987). Dietary fat and breast cancer in Kopecky, K. J., Davis, S., Hamilton, T. E., Saporito, M. S., & Onstad, L.
the National Health and Nutrition Survey I: Epidemiologic follow-up study. E. (2004). Estimation of thyroid radiation doses for the Hanford Thyroid
Journal of the National Cancer Institute, 79, 465–471. Disease Study: Results and implications for statistical power of the epidemi-
Kangas, A. S. (1998). Effect of errors-in-variables on coefficients of a growth ological analyses. Health Physics, 87, 15–32.
model and on prediction of growth. Forest Ecology And Management, 102, Küchenhoff, H. (1990). Logit- und Probitregression mit Fehlen in den Vari-
203–212. abeln. Frankfurt am Main: Anton Hain.
Kannel, W. B., Neaton, J. D., Wentworth, D., Thomas, H. E., Stamler, J., Küchenhoff, H., & Carroll, R. J. (1997). Segmented regression with er-
Hulley, S. B., & Kjelsberg, M. O. (1986). Overall and coronary heart disease rors in predictors: Semi-parametric and parametric methods. Statistics in
mortality rates in relation to major risk factors in 325,348 men screened for Medicine, 16, 169–188.
MRFIT. American Heart Journal, 112, 825–836. Küchenhoff, H., Mwalili, S. M., & Lesaffre, E. (2006). A general method for
Kass, R. E., Carlin, B. P., Gelman, A., & Neal, R. M. (1998). Markov chain dealing with misclassification in regression: The misclassification SIMEX.
Monte Carlo in practice: A roundtable discussion. The American Statisti- Biometrics, to appear.
cian, 52, 93–100. Kukush, A., Schneeweiss, H., & Wolf, R. (2004). Three estimators for the
Kauermann, G., & Carroll, R. J. (2001). The Sandwich variance estimator: Ef- Poisson regression model with measurement errors. Statistical Papers, 45,
ficiency properties and coverage probability of confidence intervals. Journal N3, 351–368.
of the American Statistical Association, 96, 1387–1396. Landin, R., Carroll, R. J., & Freedman, L. S. (1995). Adjusting for time trends
Kent, J. T. (1982). Robust properties of likelihood ratio tests. Biometrika, 69, when estimating the relationship between dietary intake obtained from a
19–27. food frequency questionnaire and true average intake. Biometrics, 51, 169–
Kerber, R. L., Till, J. E., Simon, S. L., Lyon, J. L. Thomas, D. C., Preston- 181.
Martin, S., Rollison, M. L., Lloyd, R. D., & Stevens, W. (1993). A cohort Lechner, S., & Pohlmeier, W. (2004). To blank or not to blank? A compari-
study of thyroid disease in relation to fallout from nuclear weapons testing. son of the effects of disclosure limitation methods on nonlinear regression
Journal of the American Medical Association, 270, 2076–2083. estimates. Annals of the New York Academy of Sciences, 3050, 187–200.
Ketellapper, R. H., & Ronner, R. E. (1984). Are robust estimation methods Li, B., & McCullagh, P. (1994). Potential functions and conservative estimat-
useful in the structural errors in variables model? Metrika, 31, 33–41. ing functions. Annals of Statistics, 22, 340–356.
Kim, C., Hong, C., & Jeong, M. (2000). Simulation-extrapolation via the Li, E., Wang, N., and Wang, N-Y. (2005). Joint models for a primary endpoint
Bezier curve in measurement error models. Communications In Statistics— and multivariate longitudinal data. Manuscript.
Simulation and Computation, 29, 1135–1147. Li, E., Zhang, D., & Davidian, M. (2004). Conditional estimation for gener-
Kim, J., & Gleser, L. J. (2000). SIMEX approaches to measurement error alized linear models when covariates are subject-specific parameters in a
in ROC studies. Communications in Statistics—Theory and Methods, 29, mixed model for longitudinal parameters. Biometrics 60, 1–7.
2473–2491. Li, E., Zhang, D., & Davidian, M. (2005). Likelihood and pseudo-likelihood
Kipnis, V., Carroll, R. J., & Freedman, L. S. & Li, L. (1999). A new dietary methods for semiparametric joint models for a primary endpoint and longi-
measurement error model and its application to the estimation of relative tudinal data. Manuscript.
risk: Application to four validation studies. American Journal of Epidemi- Li, L., Palta, M., & Shao, J. (2004). A measurement error model with a Poisson
ology, 150, 642–651. distributed surrogate. Statistics in Medicine, 23, 2527-2536.
Kipnis, V., Midthune, D., Freedman, L. S., Bingham, S., Day, N. E., Riboli, Li, L., Shao, J., & Palta, M. (2005). A longitudinal measurement error model
E., & Carroll, R. J. (2003). Bias in dietary-report instruments and its im- with a semicontinuous covariate. Biometrics, 61, 828–830.
plications for nutritional epidemiology. Public Health Nutrition, 5, 915–923. Li, Y., & Lin, X. (2000). Covariate measurement errors in frailty models for
Kipnis V., Midthune D., Freedman L. S., Bingham S., Schatzkin A., Subar A., clustered survival data. Biometrika, 87, 849–866.
& Carroll R. J. (2001). Empirical evidence of correlated biases in dietary Li, Y., & Lin, X. (2003a). Functional inference in frailty measurement error
assessment instruments and its implications. American Journal of Epidemi- models for clustered survival data using the SIMEX approach. Journal of
ology, 153, 394–403. The American Statistical Association, 98, 191–203.
Kipnis, V., Subar, A. F., Midthune, D., Freedman, L. S., Ballard-Barbash, R., Li, Y., & Lin, X. (2003b). Testing the correlation for clustered categorical and
Troiano, R. Bingham, S., Schoeller, D. A., Schatzkin, A., & Carroll, R. J. censored discrete time-to-event data when covariates are measured with-
(2003). The structure of dietary measurement error: Results of the OPEN out/with errors. Biometrics, 59, 25–35.

424 425
Liang, H. (2000). Asymptotic normality of parametric part in partly linear tation and Simulation, 35, 145–167.
models with measurement error in the nonparametric part. Journal of Sta- Liu, K., Stamler, J., Dyer, A., McKeever, J., & McKeever, P. (1978). Statistical
tistical Planning & Inference, 86, 51–62. methods to assess and minimize the role of intra-individual variability in
Liang, H., & Wang, N. (2005). Partially linear single-index measurement error obscuring the relationship between dietary lipids and serum cholesterol.
models. Statistica Sinica, 15, 99–116. Journal of Chronic Diseases, 31, 399–418.
Liang, H., Härdle, W., & Carroll, R. J. (1999). Large sample theory in a Liu, X., & Liang, K. Y. (1992). Efficacy of repeated measures in regression
semiparametric partially linear errors in variables model. The Annals of models with measurement error. Biometrics, 48, 645–654.
Statistics, 27, 1519–1535. Lord, F. M. (1960). Large sample covariance analysis when the control variable
Liang, H., Wu, H., & Carroll, R. J. (2003). The relationship between viro- is fallible. Journal of the American Statistical Association, 55, 307–321.
logic and immunologic responses in AIDS clinical research using mixed- Luo, M., Stokes, L., & Sager, T. (1998). Estimation of the CDF of a finite
effects varying-coefficient semiparametric models with measurement error. population in the presence of a calibration sample. Environmental and Eco-
Biostatistics, 4, 297–312. logical Statistics, 5, 277–289.
Liang, K. Y., & Liu, X. H. (1991). Estimating equations in generalized lin- Lubin, J. H., Schafer, D. W. Ron, E. Stovall, M., & Carroll, R. J. (2004). A
ear models with measurement error. In V. P. Godambe (Ed.) Estimating reanalysis of thyroid neoplasms in the Israeli tinea capitis study accounting
Functions, Oxford: Clarendon Press. for dose uncertainties. Radiation Research, 161, 359–368.
Liang, K. Y., & Tsou, D. (1992). Empirical Bayes and conditional inference Lyles, R. H., & Kupper, L. L. (1997). A detailed evaluation of adjustment
with many nuisance parameters. Biometrika, 79, 261–270. methods for multiplicative measurement error in linear regression with ap-
Liang, K. Y., & Zeger, S. L. (1995). Inference based on estimating functions plications in occupational epidemiology. Biometrics, 53, 1008–1025.
in the presence of nuisance parameters. Statistical Science 10, 158–173. Lyles, R., Williamson, J., Lin, H.-M., & Heilig, C. (2005). Extending McNe-
Lin, D. Y., & Ying, Z. (1993). Cox regression with incomplete covariate mea- mar’s test: Estimation and inference when paired binary outcome data are
surements. Journal of the American Statistical Association, 88, 1341–1349. misclassified. Biometrics, 61, 287–294.
Lin, X., & Breslow, N. (1996). Bias correction in generalized linear mixed Lyles, R. H., Munoz, A., Xu, J., Taylor, J. M. G., & Chmiel, J. S. (1999).
models with multiple components of dispersion. Journal of the American Adjusting for measurement error to assess health effects of variability in
Statistical Association, 91, 1007–1016. biomarkers. Statistics in Medicine, 18, 1069–1086.
Lin, X., & Carroll, R. J. (1999). SIMEX variance component tests in general- Lyon, J. L., Alder, S. C., Stone, M. B., Scholl, A., Reading, J. C. Holubkov,
ized linear mixed measurement error models. Biometrics, 55, 613–619. R., Sheng, X. White, G. L., Hegmann, K. T., Anspaugh, L., Hoffman, F.
Lin, X., & Carroll, R. J. (2000). Nonparametric function estimation for clus- O., Simon, S. L., Thomas, B., Carroll, R. J., & Meikle, A. W. (2006). Thy-
tered data when the predictor is measured without/with error. Journal of roid disease associated with exposure to the Nevada Test Site radiation: A
the American Statistical Association, 95, 520–534. reevaluation based on corrected dosimetry and examination data. Preprint.
Lindley, D. V. (1953). Estimation of a functional relationship. Biometrika, 40, Ma, Y., & Carroll, R. J. (2006). Locally efficient estimators for semiparametric
47–49. models with measurement error. Journal of the American Statistical Asso-
Lindley, D. V., & El Sayyad, G. M. (1968). The Bayesian estimation of a linear ciation, to appear.
functional relationship. Journal of the Royal Statistical Society, Series B, MacMahon, S., Peto, R., Cutler, J., Collins, R., Sorlie, P., Neaton, J., Ab-
30, 190–202. bott, R., Godwin, J., Dyer, A., & Stamler, J. (1990). Blood pressure, stroke
Lindsay, B. G. (1982). Conditional score functions: Some optimality results. and coronary heart disease: Part 1, prolonged differences in blood pressure:
Biometrika, 69, 503–512. Prospective observational studies corrected for the regression dilution bias.
Lindsay, B. G. (1983). The geometry of mixture likelihoods, Part I: A general Lancet, 335, 765–774.
theory. The Annals of Statistics, 11, 86–94. Madansky, A. (1959). The fitting of straight lines when both variables are
Lindsay, B. G. (1985). Using empirical partially Bayes inference for increased subject to error. Journal of the American Statistical Association, 54, 173–
efficiency. The Annals of Statistics, 13, 914–32. 205.
Little, R. J. A., & Rubin, D. B. (2002). Statistical Analysis with Missing Data, Mallick, B. K., & Gelfand, A. E. (1996). Semiparametric errors-in-variables
2nd ed. New York: John Wiley & Sons. models: A Bayesian approach. Journal of Statistical Planning and Inference,
Liu, M. C., & Taylor, R. L. (1989). A consistent nonparametric density esti- 52, 307–321.
mator for the deconvolution problem. Canadian Journal of Statistics, 17, Mallick, B., Hoffman, F., & Carroll, R. (2002). Semiparametric regression mod-
399–410. eling with mixtures of Berkson and classical error, with application to fallout
Liu, M. C., & Taylor, R. L. (1990). Simulation and computation of a nonpara- from the Nevada test site. Biometrics, 58, 13–20.
metric density estimator for the deconvolution problem. Statistical Compu- Marazzi, A. (1980). ROBETH, a subroutine library for robust statistical proce-

426 427
dures. COMPSTAT 1980, Proceedings in Computational Statistics, Vienna: Newey, W. K. (1991). Semiparametric efficiency bounds. Journal of Applied
Physica. Econometrics, 5, 99–135.
Marcus, A. H., & Elias, R. W. (1998) Some useful statistical methods for Novick, S. J., & Stefanski, L. A. (2002). Corrected score estimation via com-
model validation. Environmental Health Perspectives, 106, 1541–1550. plex variable simulation extrapolation. Journal of the American Statistical
Marschner, I. C., Emberson, J., Irwig, L., & Walter, S. D. (2004). The num- Association, 97, 472–481.
ber needed to treat (NNT) can be adjusted for bias when the outcome is Nummi, T. (2000). Analysis of growth curves under measurement errors. Jour-
measured with error. Journal of Clinical Epidemiology, 57, 1244–1252. nal of Applied Statistics, 27, 235–243.
Marsh-Manzi, J., Crainiceanu, C. M., Astor, B. C., Powe, N. R., Klag, M. Palmgren, J. (1987). Precision of double sampling estimators for comparing
J., Taylor, H. A., & Coresh, J. (2005). Increased risk of CKD progression two probabilities. Biometrika, 74, 687–694.
and ESRD in African Americans: The Atherosclerosis Risk in Communities Palmgren, J., & Ekholm, A. (1987). Exponential family non-linear models for
(ARIC) Study. Submitted. categorical data with errors of observation. Applied Stochastic Models and
Masry, E., & Rice, J. A. (1992). Gaussian deconvolution via differentiation. Data Analysis, 3, 111–124.
Canadian Journal of Statistics, 20, 9–21. Palta, M., & Lin, C.-Y. (1999). Latent variables, measurement error and
McCullagh, P. (1980). Regression models for ordinal data (with discussion). methods for analysing longitudinal binary and ordinal data. Statistics in
Journal of the Royal Statistical Society, Series B, 42, 109–142. Medicine, 18, 385–396.
McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Paulino, C. D., Soares, P., & Neuhaus, J. (2003). Binomial regression with
London: Chapman & Hall. misclassification. Biometrics, 59, 670–675.
McCulloch, C. E., & Searle, S. R. (2001). Generalized, Linear, and Mixed Pearson, K. (1902). On the mathematical theory of errors of judgment. Philo-
Models. New York: John Wiley & Sons. sophical Transactions of the Royal Society of London A, 198, 235–299.
McFadden, D. (1989). A method of simulated moments for estimation of dis- Pepe, M. S. (1992). Inference using surrogate outcome data and a validation
crete response models without numerical integration. Econometrica, 57, sample. Biometrika, 79, 355–365.
239–265. Pepe, M. S., & Fleming, T. R. (1991). A general nonparametric method for
McLeish, D. L., & Small, C. G. (1988). The Theory and Applications of Sta- dealing with errors in missing or surrogate covariate data. Journal of the
tistical Inference Functions. New York: Springer-Verlag. American Statistical Association, 86, 108–113.
McShane, L., Midthune, D. N., Dorgan, J. F., Freedman, L. S., & Carroll, R. J. Pepe, M. S., Reilly, M., & Fleming, T. R. (1994). Auxilliary outcome data and
(2001). Covariate measurement error adjustment for matched case-control the mean score method. Journal of Statistical Planning and Inference, 42,
studies. Biometrics, 57, 62–73. 137–160.
Mengersen, K. L., Robert, C. P., & Guihenneuc-Jouyaux, C. (1999). MCMC Pepe, M. S., Self, S. G., & Prentice, R. L. (1989). Further results in covariate
convergence diagnostics: A reviewww, In J. M. Bernardo, J. O. Berger, A. measurement errors in cohort studies with time to response data. Statistics
F. Dawid & A. F. M. Smith, (Eds.), Bayesian Statistics 6 Oxford: Oxford in Medicine, 8, 1167–1178.
University Press. Pierce, D. A., & Kellerer, A. M. (2004). Adjusting for covariate errors with
Miller, R. G. (1998). Survival Analysis. New York: John Wiley & Sons. nonparametric assessment of the true covariate distribution. Biometrika, 91,
Monahan, J., & Stefanski, L. A. (1992). Normal scale mixture approximations 863–876.
to F ∗ (z) and computation of the logistic-normal integral. In N. Balakrishnan Pierce, D. A., Stram, D. O., Vaeth, M., & Schafer, D. (1992). Some insights
(Ed.) Handbook of the Logistic Distribution (pp. 529–540). New York: Marcel into the errors in variables problem provided by consideration of radiation
Dekker. dose-response analyses for the A-bomb survivors. Journal of the American
Müller, H-G. (1988). Nonparametric Analysis of Longitudinal Data. Berlin: Statistical Association, 87, 351–359.
Springer-Verlag. Polson, N. G. (1996). Convergence of Markov chain Monte Carlo algorithms,
Müller, P., & Roeder, K. (1997). A Bayesian semiparametric model for case- In J. M. Bernardo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.),
control studies with errors in variables. Biometrika, 84, 523–537. Bayesian Statistics, 5 Oxford: Oxford University Press.
Nakamura, T. (1990). Corrected score functions for errors-in-variables models: Polzehl, J., & Zwanzig, S. (2004). On a symmetrized simulation extrapolation
Methodology and application to generalized linear models. Biometrika, 77, estimator in linear errors-in-variables models. Computational Statistics &
127–137. Data Analysis, 47, 675–688.
Nakamura, T. (1992). Proportional hazards models with covariates subject to Powe N. R., Klag M. J., Sadler J. H., Anderson G. F., Bass E. B., Briggs W.
measurement error. Biometrics, 48, 829–838. A., Fink N. E., Levey A. S., Levin N. W., Meyer K. B., Rubin H. R., & Wu
Neuhaus, J. M. (2002). Analysis of clustered and longitudinal binary data A. W. (1996). Choices for healthy outcomes in caring for end stage renal
subject to response misclassification. Biometrics, 58, 675–683. disease. Seminars in Dialysis 9, 9–11.

428 429
Prentice, R. L. (1976). Use of the logistic model in retrospective studies, Bio- lis Hastings algorithms. Statistical Science, 16, 351–367.
metrics, 32, 599–606. Rocke, D.,, & Durbin, B. (2001). A model for measurement error for gene
Prentice, R. L. (1982). Covariate measurement errors and parameter estima- expression arrays. Journal of Computational Biology, 8, 557–569.
tion in a failure time regression model. Biometrika, 69, 331–342. Roeder, K., Carroll, R. J., & Lindsay, B. G. (1996). A semiparametric mixture
Prentice, R. L. (1989). Surrogate endpoints in clinical trials: Definition and approach to case-control studies with errors in covariables. Journal of the
operational criteria. Statistics in Medicine, 8, 431–440. American Statistical Association, 91, 722–732.
Prentice, R. L., Pepe, M., & Self, S. G. (1989). Dietary fat and breast cancer: Ronchetti, E. (1982). Robust testing in linear models: The infinitesimal ap-
Areview of the literature and a discussion of methodologic issues. Cancer proach. Ph.D. Thesis. ETH, Zurich.
Research, 49, 3147–3156. Rosner, B., Spiegelman, D., & Willett, W. C. (1990). Correction of logistic
Prentice, R. L., & Pyke, R. (1979). Logistic disease incidence models and regression relative risk estimates and confidence intervals for measurement
case-control studies. Biometrika, 66, 403–411. error: The case of multiple covariates measured with error. American Jour-
Prescott, G. J., & Garthwaite, P. H. (2002). A simple Bayesian analysis of nal of Epidemiology, 132, 734–745.
misclassified binary data with a validation substudy. Biometrics, 58, 454– Rosner, B., Willett, W. C., & Spiegelman, D. (1989). Correction of logistic
458. regression relative risk estimates and confidence intervals for systematic
Ramalho, E. A. (2002). Regression models for choice-based samples with mis- within-person measurement error. Statistics in Medicine, 8, 1051–1070.
classification in the response variable. Journal of Econometrics, 106, 171– Rudemo, M., Ruppert, D., & Streibig, J. C. (1989). Random effect models in
201. nonlinear regression with applications to bioassay. Biometrics, 45, 349–362.
Rao, C. R. (1947). Large-sample test of statistical hypotheses concerning sev- Ruppert, D. (1985). M-estimators, In S. Kotz & N. L. Johnson (Eds.) Ency-
eral parameters with applications to problems of estimation. Proceedings clopedia of Statistical Sciences, vol. 5, (pp. 443–449). New York: John Wiley
Cambridge Philosophical Society, 44, 50–57. & Sons.
Rathouz, P. J., & Liang, K. Y. (1999). Reducing sensitivity to nuisance pa- Ruppert, D. (1997). Empirical-bias bandwidths for local polynomial nonpara-
rameters in semiparametric models: A quasi-score method. Biometrika, 86 metric regression and density estimation, Journal of the American Statistical
857–869. Association, 92, 1049–1062.
Reeves, G. K., Cox, D. R., Darby, S. C., & Whitley, E. (1998). Some aspects Ruppert, D. (2002). Selecting the number of knots for penalized splines. Jour-
of measurement error in explanatory variables for continuous and binary nal of Computational and Graphical Statistics, 11, 735–757.
regression models. Statistics in Medicine, 17, 2157–2177. Ruppert, D., & Carroll, R. J. (2000). Spatially adaptive penalties for spline
Reilly, M., & Pepe, M. S. (1994). A mean score method for missing and aux- fitting. Australia and New Zealand Journal of Statistics, 42, 205–223.
iliary covariate data in regression models. Biometrika, 82, 299–314. Ruppert, D., & Wand, M. P. (1994). Multivariate locally weighted least squares
Reiser, B. (2000). Measuring the effectiveness of diagnostic markers in the regression. The Annals of Statistics, 22, 1346–1370.
presence of measurement error through the use of ROC curves. Statistics in Ruppert, D., Carroll, R. J., & Cressie, N. (1989). A transformation/weighting
Medicine, 19, 2115–2159. model for estimating Michaelis-Menten parameters. Biometrics, 45, 637–
Richardson, S., & Gilks, W. R. (1993). A Bayesian approach to measure- 362.
ment error problems in epidemiology using conditional independence mod- Ruppert, D., Wand, M. P., & Carroll, R. J. (2003). Semiparametric Regression,
els. American Journal of Epidemiology, 138, 430–442. Cambridge University Press, Cambridge, UK.
Richardson, S., & Gilks, W. R. (1993). Conditional independence models Satten, G. A., & Kupper, L. L. (1993). Inferences about exposure-disease asso-
for epidemiological studies with covariate measurement error. Statistics in ciation using probability of exposure information. Journal of the American
Medicine, 12 , 1703–1722. Statistical Association, 88, 200–208.
Richardson, S., Leblond, L., Jaussent, I., & Green, P. J. (2002). Mixture models Schafer, D. W. (1987). Covariate measurement error in generalized linear mod-
in measurement error problems, with reference to epidemiological studies. els. Biometrika, 74, 385–391.
Journal of the Royal Statistical Society, Series A, 165, 549–566. Schafer, D. W. (1992). Replacement methods for measurement error models.
Ritter, C., & Tanner, M. A. (1992). Facilitating the Gibbs sampler: The Gibbs Unpublished.
stopper and the griddy Gibbs stopper. Journal of the American Statistical Schafer, D. W. (1993). Likelihood analysis for probit regression with measure-
Association, 87, 861–868. ment errors. Biometrika, 80, 899–904.
Roberts, G. O., Gelman, A., & Gilks, W. R. (1997). Weak convergence and op- Schafer, D. W. (2002). Likelihood analysis and flexible structural modeling
timal scaling of random walk Metropolis algorithms. The Annals of Applied for measurement error model regression. Journal of Statistical Computation
Probability, 7, 110–120. and Simulation, 72, 33–45.
Roberts, G. O., & Rosenthal, J. S. (2001). Optimal scaling for various Metropo- Schafer, D., & James, I. R. (1991). Weibull regression with covariate measure-

430 431
ment errors and assessment of unemployment duration dependence. Unpub- in estimation, Statistical Science, 15, 313-341.
lished. Smith, A. F. M., & Gelfand, A. E. (1992) Bayesian statistics without tears: A
Schafer, D. W., & Purdy, K. (1996). Likelihood analysis for errors-in-variables sampling-resampling perspective. American Statistician, 46, 84–88.
regression with replicate measurement. Biometrika, 83, 813–824. Solow, A. R. (1998). On fitting a population model in the presence of obser-
Schafer, D. W., Lubin, J. H., Ron, E., Stovall, M., & Carroll, R. J. (2001). vation error. Ecology, 79, 1463–1466.
Thyroid cancer following scalp irradiation: A reanalysis accounting for un- Song, X., & Huang, Y. (2005). On corrected score approach for proportional
certainty in dosimetry. Biometrics, 57, 689–697. hazards model with covariate measurement error. Biometrics, 61, 702–714.
Schafer, D. W., Stefanski, L. A., & Carroll, R. J. (1999). Consideration of Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002).
measurement errors in the international radiation study of cervical cancer. Bayesian measures of model complexity and fit. Journal of the Royal Sta-
In E. Ron & F. O. Hoffman (Eds.) Uncertainties in Radiation Dosimetry tistical Society, Series B, 64, 583–616.
and Their Impact on Dose response Analysis, National Cancer Institute Spiegelman, C. H. (1986). Two pitfalls of using standard regression diagnostics
Press. when both X and Y have measurement error. The American Statistician,
Schafer J. (1997). The Analysis of Incomplete Multivariate Data, New York: 40, 245–248.
Chapman & Hall/CRC. Spiegelman, D. (1994). Cost-efficient study designs for relative risk model-
Schatzkin, A., Kipnis, V., Subar, A. F., Midthune, D., Carroll, R. J., Bingham, ing with covariate measurement error. Journal of Statistical Planning and
S., Schoeller, D. A., Troiano, R., & Freedman, L. S. (2003). A comparison Inference, 42, 187–208.
of a food frequency questionnaire with a 24-hour recall for use in an epi- Spiegelman, D., & Casella, M. (1997). Fully parametric and semi-parametric
demiological cohort study: Results from the biomarker-based OPEN study. regression models for common events with covariate measurement error in
International Journal of Epidemiology, 32, 1054–1062. main study/validation study designs. Biometrics, 53, 395–409.
Schennach, S. M. (2006). Instrumental variables estimation of nonlin- Sposto, R., Preston, D. L., Shimizu, Y., & Mabuchi, K. (1992). The effect of di-
ear errors-in-variables models. Preprint available at http://home.uchicago agnostic misclassification on non-cancer and cancer mortality dose response
.edu/∼smschenn/cv schennach.pdf. in A-bomb survivors. Biometrics, 48, 605–618.
Schennach, S. M. (2004a). Estimation of nonlinear models with measurement Staudenmayer, J., & Spiegelman, D. (2002). Segmented regression in the pres-
error. Econometrica, 72, 33–75. ence of covariate measurement error in main study/validation study designs.
Schennach, S. M. (2004b). Nonparametric regression in the presence of mea- Biometrics, 58, 871–877.
surement error. Econometric Theory, 20, 1046–1093. Staudenmeyer, J., & Ruppert, D. (2004). Local polynomial regression and
Schmid, C. H., & Rosner, B. (1993). A Bayesian approach to logistic regression simulation-extrapolation. Journal of the Royal Statistical Society, Series B,
models having measurement error following a mixture distribution. Statis- 66, 17–30.
tics in Medicine, 12, 1141–1153. Stefanski, L. A. (1985). The effects of measurement error on parameter esti-
Schmid, C. H., Segal, M. R., & Rosner, B. (1994). Incorporating measure- mation. Biometrika, 72, 583–592.
ment error in the estimation of autoregressive models for longitudinal data. Stefanski, L. A. (1989). Unbiased estimation of a nonlinear function of a normal
Journal of Statistical Planning and Inference, 42, 1–18 mean with application to measurement error models. Communications in
Schrader, R. M., & Hettmansperger, T. P. (1980). Robust analysis of variance Statistics, Series A, 18, 4335–4358.
based upon a likelihood criterion. Biometrika, 67, 93–101. Stefanski, L. A. (1990). Rates of convergence of some estimators in a class of
Searle, S. R., Casella, G., & McCulloch, C. E. (1992). Variance Components, deconvolution problems. Statistics & Probability Letters, 9, 229–235.
New York: John Wiley & Sons. Stefanski, L. A., & Bay, J. M. (1996). Simulation extrapolation deconvolution
Sepanski, J. H. (1992). Score tests in a generalized linear model with surrogate of finite population cumulative distribution function estimators. Biometrika,
covariates, Statistics & Probability Letters, 15, 1–10. 83, 407–417.
Serfling, R. J. (1980). Approximation Theorems of Mathematical Statistics. Stefanski, L. A., & Buzas, J. S. (1995). Instrumental variable estimation in
New York: John Wiley & Sons. binary regression measurement error models. Journal of the American Sta-
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis, tistical Association, 90, 541–550.
London & New York: Chapman & Hall. Stefanski, L. A., & Carroll, R. J. (1985). Covariate measurement error in
Simon, S. L., Till, J. E., Lloyd, R. D., Kerber, R. L., Thomas, D. C., Preston- logistic regression. Annals of Statistics, 13, 1335–1351.
Martin, S., Lyon, J. L., & Stevens, W. (1995). The Utah Leukemia case- Stefanski, L. A., & Carroll, R. J. (1986). Deconvoluting kernel density estima-
control study: dosimetry methodology and results. Health Physics, 68, 460– tors. Statistics, 21, 169–184.
471. Stefanski, L. A., & Carroll, R. J. (1987). Conditional scores and optimal scores
Small, C. G., Wang, J., & Yang, Z. (2000). Eliminating multiple root problems in generalized linear measurement error models. Biometrika, 74, 703–716.

432 433
Stefanski, L. A., & Carroll, R. J. (1990a). Score tests in generalized linear Tan, C. Y., & Iglewicz, B. (1999). Measurement-methods comparisons and
measurement error models. Journal of the Royal Statistical Society, Series linear statistical relationship. Technometrics, 41, 192–201.
B, 52, 345–359. Tanner, M. A. (1993). Tools for Statistical Inference: Methods for the Explo-
Stefanski, L. A., & Carroll, R. J. (1990b). Structural logistic regression mea- ration of Posterior Distributions and Likelihood Functions (2nd ed.) New
surement error models. In P. J. Brown & W. A. Fuller (Eds.) Statistical York: Springer-Verlag.
analysis of measurement error models and applications: Proceedings of the Taupin, M. L. (2001). Semi-parametric estimation in the nonlinear structural
AMS-IMS-SIAM joint summer research conference held June 10-16, 1989, errors-in-variables model. The Annals of Statistics, 29, 66–93.
with support from the National Science Foundation and the U.S. Army Re- Thisted, R. A. (1988). Elements of Statistical Computing, New York & London:
search Office, Providence, RI: American Mathematical Society. Chapman & Hall.
Stefanski, L. A., & Carroll, R. J. (1990c). Deconvoluting kernel density esti- Thomas, D. C., Gauderman, J., & Kerber, R. (1993). A nonparametric Monte-
mators. Statistics, 21, 165–184. Carlo approach to adjustment for covariate measurement errors in regression
Stefanski, L. A., & Carroll, R. J. (1991). Deconvolution based score tests in analysis. Unpublished.
measurement error models. The Annals of Statistics, 19, 249–259. Thomas, D., Stram, D., & Dwyer, J. (1993). Exposure measurement error: In-
Stefanski, L. A., & Cook, J. (1995). Simulation extrapolation: the measure- fluence on exposure-disease relationships and methods of correction. Annual
ment error jackknife. Journal of the American Statistical Association, 90, Review of Public Health, 14, 69–93.
1247–1256. Thompson, F. E., Sowers, M. F., Frongillo, E. A., & Parpia, B. J. (1992).
Stefanski, L. A., Novick, S. J., & Devanarayan, V. (2005). Estimating a non- Sources of fiber and fat in diets of U.S. women aged 19–50: Implications
linear function of a normal mean. Biometrika, 92, 732–736. for nutrition education and policy. American Journal of Public Health, 82,
Stephens, D. A., & Dellaportas, P. (1992). Bayesian analysis of generalized 695–718.
linear models with covariate measurement error. In J. M. Bernardo, J. O. Tosteson, T., & Tsiatis, A. (1988). The asymptotic relative efficiency of score
Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian Statistics 4 (pp. tests in a generalized linear model with surrogate covariates. Biometrika,
813–820), Oxford: Oxford University Press. 75, 507–514.
Stevens, W., Till, J. E., Thomas, D. C., et al. (1992). Assessment of leukemia Tosteson, T. D., & Ware, J. H. (1990). Designing a logistic regression study
and thyroid disease in relation to fallout in Utah: Report of a cohort study using surrogate measures of exposure and outcome. Biometrika, 77, 11–20.
of thyroid disease and radioactive fallout from the Nevada test site. Salt Tosteson, T., Buonaccorsi, J., & Demidenko, E. (1998). Covariate measure-
Lake City: University of Utah. ment error and the estimation of random effect parameters in a mixed model
Stone, C. J., Hansen, M. H., Kooperberg, C., & Truong, Y. K. (1997). Poly- for longitudinal data. Statistics in Medicine, 17, 1959–1971.
nomial splines and their tensor products in extended linear modeling. The Tosteson, T., Stefanski, L. A., & Schafer D.W. (1989). A measurement error
Annals of Statistics, 25, 1371–1425. model for binary and ordinal regression. Statistics in Medicine, 8, 1139–
Stram, D. O., & Kopecky, K. J. (2003). Power and uncertainty analysis of epi- 1147.
demiological studies of radiation-related disease risk in which dose estimates Tosteson, T., Buzas, J., Demidenko, E., & Karagas, M. (2003). Power and
are based on a complex dosimetry system: some observations. Radiation Re- sample size calculations for generalized regression models with covariate
search, 160, 408–417. measurement error. Statistics in Medicine, 22, 1069–1082.
Subar, A. F., Kipnis, V., Troiano, R. P., Midthune, D., Schoeller, D. A., Bing- Tsiatis, A. A., & Davidian, M. (2001). A semiparametric estimator for the pro-
ham, S., Sharbaugh, C. O., Trabulsi, J., Runswick, S., Ballard-Barbash, R., portional hazards model with longitudinal covariates measured with error.
Sunshine, J., & Schatzkin, A. (2003). Using intake biomarkers to evaluate Biometrika, 88, 447–458.
the extent of dietary misreporting in a large sample of adults: The Ob- Tsiatis, A. A., & Ma, Y. (2004). Locally efficient semiparametric estimators
serving Protein and Energy Nutrition (OPEN) study. American Journal of for functional measurement error models. Biometrika 91, 835–848.
Epidemiology, 158, 1–13. Turnbull, B. W., Jiang, W., & Clark, L. C. (1997). Regression models for
Subar, A. F., Thompson, F. E., Kipnis, V., Midthune, D., Hurwitz, P., McNutt, recurrent event data: Parametric random effects models with measurement
S., McIntosh, A., & Rosenfeld, S. (2001). Comparative validation of the error. Statistics in Medicine, 16, 853–864.
Block, Willett and National Cancer Institute food frequency questionnaires: Ulm, K. (1991). A statistical method for assessing a threshold in epidemiolog-
The Eating at America’s Table Study. American Journal of Epidemiology, ical studies. Statistics in Medicine, 10, 341–349.
154, 1089–1099. United States Renal Data System. (2003). USRDS 2003 Annual Data Report.
Tadesse, M., Ibrahim, J., Gentleman, R., Chiaretti, S., Ritz, J., & Foa, Bethesda, MD: National Institute of Health, National Institute of Diabetes
R. (2005), Bayesian error-in-variable survival model for the analysis of and Digestive and Kidney Disease.
GeneChip arrays. Biometrics, 61, 488–497. van der Vaart, A. (1988). Estimating a real parameter in a class of semipara-

434 435
metric models. The Annals of Statistics, 16, 1450–1474. 81–93.
Verbeke, G., & Molenberghs, G. (2000). Linear Mixed Models for Longitudinal Whittemore, A. S., & Keller, J. B. (1988). Approximations for regression with
Data. New York: Springer-Verlag. covariate measurement error. Journal of the American Statistical Associa-
Wand, M. P. (1998). Finite sample performance of deconvolving density esti- tion, 83, 1057–1066.
mators. Statistics & Probability Letters, 37, 131–139. Willett, W. C. (1989). An overview of issues related to the correction of non-
Wang, C. Y., & Carroll, R. J. (1994). Robust estimation in case-control studies differential exposure measurement error in epidemiologic studies. Statistics
with errors in predictors. In J. O. Berger & S. S. Gupta (Eds.), Statistical in Medicine, 8, 1031–1040.
Decision Theory and Related Topics, V, New York: Springer-Verlag Willett, W. C., Meir, J. S., Colditz, G. A., Rosner, B. A., Hennekens, C. H., &
Wang, C. Y., & Pepe, M. S. (2000). Expected estimating equations to ac- Speizer, F. E. (1987). Dietary fat and the risk of breast cancer. New England
commodate covariate measurement error. Journal of the Royal Statistical Journal of Medicine, 316, 22–25.
Society, Series B, 62, 509–524. Willett, W. C., Sampson, L., Stampfer, M. J., Rosner, B., Bain, C., Witschi,
Wang, C. Y., Wang, N., & Wang, S. (2000). Regression analysis when co- J., Hennekens, C. H., & Speizer, F. E. (1985). Reproducibility and validity
variates are regression parameters of a random effects model for observed of a semiquantitative food frequency questionnaire. American Journal of
longitudinal measurements. Biometrics, 56, 487–495. Epidemiology, 122, 51–65.
Wang, C. Y., Wang, S., & Carroll, R. J. (1997). Estimation in choice-based Wolter, K. M., & Fuller, W. A. (1982a). Estimation of the quadratic errors in
sampling with measurement error and bootstrap analysis. Journal of Econo- variables model. Biometrika, 69, 175–182.
metrics, 77, 65–86. Wu, L. (2002). A joint model for nonlinear mixed-effects models with censor-
Wang, C. Y., Hsu, L., Feng, Z. D., & Prentice, R. L. (1997). Regression cali- ing and covariates measured with error, with application to AIDS studies.
bration in failure time regression, Biometrics, 53, 131–145. Journal of the American Statistical Association, 97, 955–964.
Wang, N., & Davidian, M. (1996). A note on covariate measurement error in Wu, M. L., Whittemore, A. S., & Jung, D. L. (1986). Errors in reported dietary
nonlinear mixed effects models, Biometrics, 83, 801–812. intakes. American Journal of Epidemiology, 124, 826–835.
Wang, N., Carroll, R. J., & Liang, K. Y. (1996). Quasilikelihood estimation in Zamar, R. H. (1988). Orthogonal regression M-estimators. Biometrika, 76,
measurement error models with correlated replicates. Biometrics, 52, 401– 149–154.
411. Zamar, R. H. (1992). Bias-robust estimation in the errors in variables model.
Wang, N. Lin, X., & Gutierrez, R. (1999). A bias correction regression cal- The Annals of Statistics, 20, 1875–1888.
ibration approach in generalized linear mixed measurement error model. Zeger, S. L., & Karim, M. R. (1991). Generalized linear models with random
Communication in Statistics, Series A, Theory and Methods, 28, 217–232. effects: A Gibbs sampling approach. Journal of the American Statistical
Wang, N., Lin, X., Gutierrez, R. G., & Carroll, R. J. (1998). Generalized Association, 86, 79–86.
linear mixed measurement error models. Journal of the American Statistical Zhang, C. H. (1990). Fourier methods for estimating mixing densities and
Association, 93, 249–261. distributions. The Annals of Statistics, 18, 806–831.
Wannemuehler, K. A., & Lyles, R. H. (2005). A unified model for covariate Zhang, D., & Davidian, M. (2001). Linear mixed models with flexible distri-
measurement error adjustment in an occupational health study while ac- butions of random effects for longitudinal data. Biometrics, 57, 795–802.
counting for non-detectable exposures. Applied Statistics, Journal of the Zhao, L. P., Prentice, R. L., & Self, S. G. (1992). Multivariate mean param-
Royal Statistical Society, Series C, 54, 259–271. eter estimation by using a partly exponential model. Journal of the Royal
Wasserman, L., & Roeder, K. (1997). Bayesian density estimation using mix- Statistical Society, Series B, 54, 805–812.
tures of normals. Journal of the American Statistical Association, 90, 1247– Zhou, H., & Pepe, M. S. (1995). Auxiliary covariate data in failure time re-
1256. gression. Biometrika, 82, 139–149.
Weinberg, C. R., Umbach, D. M., & Greenland, S. (1993). When will nondiffer- Zhou, H., & Wang, C. Y. (2000). Failure time regression with continuous
ential misclassification preserve the direction of a trend? American Journal covariates measured with error. Journal of the Royal Statistical Society,
of Epidemiology, 140, 565–571. Series B, 62, 657–665.
Weinberg, C. R., & Wacholder, S. (1993). Prospective analysis of case-control Zhu, L., & Cui, H. (2003). A Semi-parametric regression model with errors in
data under general multiplicative-intercept models. Biometrika, 80, 461– variables. Scandinavian Journal of Statistics, 30, 429–442.
465. Zidek, J. V., Le, N. D., Wong, H., & Burnett, R. T. (1998). Including structural
Whittemore, A. S. (1989). Errors in variables regression using Stein estimates. measurement errors in the nonlinear regression analysis of clustered data.
American Statistician, 43, 226–228. The Canadian Journal of Statistics / La Revue Canadienne de Statistique,
Whittemore, A. S., & Gong, G. (1991). Poisson regression with misclassified 26, 537–548.
counts: Application to cervical cancer mortality rates. Applied Statistics, 40, Zidek, J. V., White, R., Le, N. D., Sun, W., & Burnett, R. T. (1998). Imput-

436 437
ing unmeasured explanatory variables in environmental epidemiology with
application to health impact analysis of air pollution. Ecological and Envi-
ronmental Statistics, 5, 99–115.

438

You might also like