Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
74 views10 pages

Interpretable Machine Learning

This document discusses the growing importance of interpretability in machine learning (ML), defining it within a framework that categorizes interpretation methods into model-based and post hoc approaches. The authors introduce the predictive, descriptive, relevant (PDR) framework, which emphasizes the need for predictive accuracy, descriptive accuracy, and relevancy to human audiences when evaluating interpretation methods. The paper aims to clarify the concept of interpretability, providing a common vocabulary for researchers and practitioners to facilitate discussions and selections of interpretation techniques.

Uploaded by

ldvsantos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views10 pages

Interpretable Machine Learning

This document discusses the growing importance of interpretability in machine learning (ML), defining it within a framework that categorizes interpretation methods into model-based and post hoc approaches. The authors introduce the predictive, descriptive, relevant (PDR) framework, which emphasizes the need for predictive accuracy, descriptive accuracy, and relevancy to human audiences when evaluating interpretation methods. The paper aims to clarify the concept of interpretability, providing a common vocabulary for researchers and practitioners to facilitate discussions and selections of interpretation techniques.

Uploaded by

ldvsantos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Definitions, methods, and applications in interpretable

machine learning
W. James Murdocha,1 , Chandan Singhb,1 , Karl Kumbiera,2 , Reza Abbasi-Aslb,c,d,2 , and Bin Yua,b,3
a
Statistics Department, University of California, Berkeley, CA 94720; b Electrical Engineering and Computer Science Department, University of California,
Berkeley, CA 94720; c Department of Neurology, University of California, San Francisco, CA 94158; and d Allen Institute for Brain Science, Seattle, WA 98109

Contributed by Bin Yu, July 1, 2019 (sent for review January 16, 2019; reviewed by Rich Caruana and Giles Hooker)

Machine-learning models have demonstrated great success in model based∗ and post hoc. We then introduce the predictive,
learning complex patterns that enable them to make predic- descriptive, relevant (PDR) framework, consisting of 3 desider-
tions about unobserved data. In addition to using models for ata for evaluating and constructing interpretations: predictive
prediction, the ability to interpret what a model has learned accuracy, descriptive accuracy, and relevancy, where relevancy is
is receiving an increasing amount of attention. However, this judged by a human audience. Using these terms, we categorize
increased focus has led to considerable confusion about the a broad range of existing methods, all grounded in real-world
notion of interpretability. In particular, it is unclear how the wide examples.† In doing so, we provide a common vocabulary for
array of proposed interpretation methods are related and what researchers and practitioners to use in evaluating and selecting
common concepts can be used to evaluate them. We aim to interpretation methods. We then show how our work enables a
address these concerns by defining interpretability in the context clearer discussion of open problems for future research.
of machine learning and introducing the predictive, descriptive,
relevant (PDR) framework for discussing interpretations. The PDR 1. Defining Interpretable Machine Learning
framework provides 3 overarching desiderata for evaluation: On its own, interpretability is a broad, poorly defined concept.

STATISTICS
predictive accuracy, descriptive accuracy, and relevancy, with rel- Taken to its full generality, to interpret data means to extract
evancy judged relative to a human audience. Moreover, to help information (of some form) from them. The set of methods
manage the deluge of interpretation methods, we introduce a falling under this umbrella spans everything from designing an
categorization of existing techniques into model-based and post initial experiment to visualizing final results. In this overly gen-
hoc categories, with subgroups including sparsity, modularity, eral form, interpretability is not substantially different from the
and simulatability. To demonstrate how practitioners can use the established concepts of data science and applied statistics.
PDR framework to evaluate and understand interpretations, we Instead of general interpretability, we focus on the use of
provide numerous real-world examples. These examples high- interpretations to produce insight from ML models as part
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.

light the often underappreciated role played by human audiences of the larger data–science life cycle. We define interpretable
in discussions of interpretability. Finally, based on our frame- machine learning as the extraction of relevant knowledge from
work, we discuss limitations of existing methods and directions
for future work. We hope that this work will provide a com-
Significance
mon vocabulary that will make it easier for both practitioners
and researchers to discuss and choose from the full range of
interpretation methods. The recent surge in interpretability research has led to con-
fusion on numerous fronts. In particular, it is unclear what
interpretability | machine learning | explainability | relevancy it means to be interpretable and how to select, evaluate,
or even discuss methods for producing interpretations of
machine-learning models. We aim to clarify these concerns

M achine learning (ML) has recently received considerable


attention for its ability to accurately predict a wide variety
of complex phenomena. However, there is a growing realization
by defining interpretable machine learning and constructing
a unifying framework for existing methods which highlights
the underappreciated role played by human audiences. Within
that, in addition to predictions, ML models are capable of pro- this framework, methods are organized into 2 classes: model
ducing knowledge about domain relationships contained in data,
based and post hoc. To provide guidance in selecting and
often referred to as interpretations. These interpretations have
evaluating interpretation methods, we introduce 3 desiderata:
found uses in their own right, e.g., medicine (1), policymaking
predictive accuracy, descriptive accuracy, and relevancy. Using
(2), and science (3, 4), as well as in auditing the predictions them-
our framework, we review existing work, grounded in real-
selves in response to issues such as regulatory pressure (5) and
world studies which exemplify our desiderata, and suggest
fairness (6). In these domains, interpretations have been shown
directions for future work.
to help with evaluating a learned model, providing information
to repair a model (if needed), and building trust with domain Author contributions: W.J.M., C.S., K.K., R.A.-A., and B.Y. designed research; W.J.M., C.S.,
experts (7). K.K., and R.A.-A., performed research; and W.J.M. and C.S. wrote the paper.y
In the absence of a well-formed definition of interpretability, Reviewers: R.C., Microsoft Research; and G.H., Cornell University.y
a broad range of methods with a correspondingly broad range The authors declare no competing interest.y
of outputs (e.g., visualizations, natural language, mathematical Published under the PNAS license.y
equations) have been labeled as interpretation. This has led to 1
W.J.M. and C.S. contributed equally to this work.y
considerable confusion about the notion of interpretability. In
2
particular, it is unclear what it means to interpret something, K.K. and R.A.-A. contributed equally to this work.y
3
what common threads exist among disparate methods, and how To whom correspondence may be addressed. Email: [email protected]
to select an interpretation method for a particular problem/ This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.
audience. 1073/pnas.1900654116/-/DCSupplemental.y
In this paper, we attempt to address these concerns. To do so, First published October 16, 2019.
we first define interpretability in the context of machine learning * For clarity, throughout this paper we use the term “model” to refer to both machine-
and place it within a generic data science life cycle. This allows us learning models and algorithms.
to distinguish between 2 main classes of interpretation methods: †
Examples were selected through a nonexhaustive search of related work.

www.pnas.org/cgi/doi/10.1073/pnas.1900654116 PNAS | October 29, 2019 | vol. 116 | no. 44 | 22071–22080


a machine-learning model concerning relationships either con- is, one should not interpret parts of a model which are not sta-
tained in data or learned by the model. Here, we view knowledge ble to appropriate perturbations to the model and data. This is
as being relevant if it provides insight for a particular audience demonstrated through examples in the text (21, 24, 25).
into a chosen problem. These insights are often used to guide
communication, actions, and discovery. They can be produced 3. Interpretation in the Data–Science Life Cycle
in formats such as visualizations, natural language, or mathe- Before discussing interpretation methods, we first place the
matical equations, depending on the context and audience. For process of interpretable ML within the broader data–science
instance, a doctor who must diagnose a single patient will want life cycle. Fig. 1 presents a deliberately general description of
qualitatively different information than an engineer determining this process, intended to capture most data-science problems.
whether an image classifier is discriminating by race. What we What is generally referred to as interpretation largely occurs
define as interpretable ML is sometimes referred to as explain- in the modeling and post hoc analysis stages, with the prob-
able ML, intelligible ML, or transparent ML. We include these lem, data, and audience providing the context required to choose
headings under our definition. appropriate methods.
2. Background Problem, Data, and Audience. At the beginning of the cycle, a
Interpretability is a quickly growing field in machine learning, data–science practitioner defines a domain problem that the
and there have been multiple works examining various aspects practitioner wishes to understand using data. This problem can
of interpretations (sometimes under the heading, explainable take many forms. In a scientific setting, the practitioner may be
AI). One line of work focuses on providing an overview of dif- interested in relationships contained in the data, such as how
ferent interpretation methods with a strong emphasis on post brain cells in a particular area of the visual system relate to visual
hoc interpretations of deep learning models (8, 9), sometimes stimuli (26). In industrial settings, the problem often concerns
pointing out similarities between various methods (10, 11). Other the predictive performance or other qualities of a model, such as
work has focused on the narrower problem of evaluating inter- how to assign credit scores with high accuracy (27) or do so fairly
pretations (12, 13) and what properties they should satisfy (14). with respect to gender and race (17). The nature of the prob-
These previous works touch on different subsets of interpretabil- lem plays a role in interpretability, as the relevant context and
ity, but do not address interpretable machine learning as a whole, audience are essential in determining what methods to use.
and give limited guidance on how interpretability can actually After choosing a domain problem, the practitioner collects
be used in data–science life cycles. We aim to do so by provid- data to study it. Aspects of the data-collection process can affect
ing a framework and vocabulary to fully capture interpretable the interpretation pipeline. Notably, biases in the data (i.e.,
machine learning, its benefits, and its applications to concrete mismatches between the collected data and the population of
data problems. interest) will manifest themselves in the model, restricting one’s
Interpretability also plays a role in other research areas. For ability to generalize interpretations generated from the data to
example, interpretability is a major topic when considering bias the population of interest.
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.

and fairness in ML models (15–17). In psychology, the general


notions of interpretability and explanations have been studied at Model. Based on the chosen problem and collected data, the
a more abstract level (18, 19), providing relevant conceptual per- practitioner then constructs a predictive model. At this stage,
spectives. Additionally, we comment on 2 related areas that are the practitioner processes, cleans, and visualizes data; extracts
distinct but closely related to interpretability: causal inference features; selects a model (or several models); and fits it. Inter-
and stability. pretability considerations often come into play in this step
related to the choice between simpler, easier to interpret mod-
Causal Inference. Causal inference (20) is a subject from statis- els and more complex, black-box models, which may fit the data
tics which is related, but distinct, from interpretable machine better. The model’s ability to fit the data is measured through
learning. According to a prevalent view, causal inference meth- predictive accuracy.
ods focus solely on extracting causal relationships from data,
i.e., statements that altering one variable will cause a change Post Hoc Analysis. Having fitted a model (or models), the practi-
in another. In contrast, interpretable ML, and most other sta- tioner then analyzes it for answers to the original question. The
tistical techniques, is used to describe general relationships. process of analyzing the model often involves using interpretabil-
Whether or not these relationships are causal cannot be verified ity methods to extract various (stable) forms of information from
through interpretable ML techniques, as they are not designed the model. The extracted information can then be analyzed and
to distinguish between causal and noncausal effects. displayed using standard data analysis methods, such as scat-
In some instances, researchers use both interpretable machine ter plots and histograms. The ability of the interpretations to
learning and causal inference in a single analysis (21). One form properly describe what the model has learned is denoted by
of this is where the noncausal relationships extracted by inter- descriptive accuracy.
pretable ML are used to suggest potential causal relationships.
These relationships can then be further analyzed using causal Iterate. If sufficient answers are uncovered after the post
inference methods and fully validated through experimental hoc analysis stage, the practitioner finishes. Otherwise, the
studies.

Stability. Stability, as a generalization of robustness in statistics,


is a concept that applies throughout the entire data–science life
cycle, including interpretable ML. The stability principle requires
that each step in the life cycle is stable with respect to appropri-
ate perturbations, such as small changes in the model or data.
Recently, stability has been shown to be important in applied
statistical problems, for example when trying to make conclu-
sions about a scientific problem (22) and in more general settings Fig. 1. Overview of different stages (black text) in a data–science life cycle
(23). Stability can be helpful in evaluating interpretation meth- where interpretability is important. Main stages are discussed in Section 3
ods and is a prerequisite for trustworthy interpretations. That and accuracy (blue text) is described in Section 4.

22072 | www.pnas.org/cgi/doi/10.1073/pnas.1900654116 Murdoch et al.


practitioner updates something in the chain (problem, data, perturbations. One should not trust interpretations from a model
and/or model) and the iterate (28). Note that the practitioner which changes dramatically when trained on a slightly smaller
can terminate the loop at any stage, depending on the context of subset of the data.
the problem. A.2. Descriptive accuracy. The second source of error occurs dur-
ing the post hoc analysis stage, when interpretation methods are
Interpretation Methods within the PDR Framework. In the frame- used to analyze a fitted model. Oftentimes, interpretations pro-
work described above, our definition of interpretable ML focuses vide an imperfect representation of the relationships learned by a
on methods in either the modeling or post hoc analysis stages. model. This is especially challenging for complex black-box mod-
We call interpretability in the modeling stage model-based inter- els such as neural networks, which store nonlinear relationships
pretability (Section 5). This part of interpretability is focused between variables in nonobvious forms.
upon the construction of models that readily provide insight Definition: We define descriptive accuracy, in the context of
into the relationships they have learned. To provide this insight, interpretation, as the degree to which an interpretation method
model-based interpretability techniques must generally use sim- objectively captures the relationships learned by machine-
pler models, which can result in lower predictive accuracy. Con- learning models.
sequently, model-based interpretability is best used when the A.3. A common conflict: predictive vs. descriptive accuracy. In
underlying relationship is sufficiently simple that model-based selecting what model to use, practitioners are sometimes faced
techniques can achieve reasonable predictive accuracy or when with a trade-off between predictive and descriptive accuracy.
predictive accuracy is not a concern. On the one hand, the simplicity of model-based interpreta-
We call interpretability in the post hoc analysis stage post tion methods yields consistently high descriptive accuracy, but
hoc interpretability (Section 6). In contrast to model-based inter- can sometimes result in lower predictive accuracy on complex
pretability, which alters the model to allow for interpretation, datasets. On the other hand, in complex settings such as image
post hoc interpretation methods take a trained model as input analysis, complicated models can provide high predictive accu-
and extract information about what relationships the model has racy, but are harder to analyze, resulting in a lower descriptive
learned. They are most helpful when the data are especially com- accuracy.

STATISTICS
plex, and practitioners need to train a black-box model to achieve
reasonable predictive accuracy. B. Relevancy. When selecting an interpretation method, it is not
After discussing desiderata for interpretation methods, we enough for the method to have high accuracy—the extracted
investigate these 2 forms of interpretations in detail and discuss information must also be relevant. For example, in the context of
associated methods. genomics, a patient, doctor, biologist, and statistician may each
want different (yet consistent) interpretations from the same
4. The PDR Desiderata for Interpretations model. The context provided by the problem and data stages in
In general, it is unclear how to select and evaluate interpreta- Fig. 1 guides what kinds of relationships a practitioner is inter-
tion methods for a particular problem and audience. To help ested in learning about and by extension the methods that should
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.

guide this process, we introduce the PDR framework, consisting be used.


of 3 desiderata that should be used to select interpretation meth- Definition: We define an interpretation to be relevant if it
ods for a particular problem: predictive accuracy, descriptive provides insight for a particular audience into a chosen domain
accuracy, and relevancy. problem.
Relevancy often plays a key role in determining the trade-off
A. Accuracy. The information produced by an interpretation between predictive and descriptive accuracy. Depending on the
method should be faithful to the underlying process the practi- context of the problem at hand, a practitioner may choose to
tioner is trying to understand. In the context of ML, there are 2 focus on one over the other. For instance, when interpretability
areas where errors can arise: when approximating the underlying is used to audit a model’s predictions, such as to enforce fair-
data relationships with a model (predictive accuracy) and when ness, descriptive accuracy can be more important. In contrast,
approximating what the model has learned using an interpreta- interpretability can also be used solely as a tool to increase the
tion method (descriptive accuracy). For an interpretation to be predictive accuracy of a model, for instance, through improved
trustworthy, one should try to maximize both of the accuracies. In feature engineering.
cases where either accuracy is not very high, the resulting inter- Having outlined the main desiderata for interpretation meth-
pretations may still be useful. However, it is especially important ods, we now discuss how they link to interpretation in the
to check their trustworthiness through external validation, such modeling and post hoc analysis stages in the data–science life
as running an additional experiment. cycle. Fig. 2 draws parallels between our desiderata for interpre-
A.1. Predictive accuracy. The first source of error occurs during tation techniques introduced in Section 4 and our categorization
the model stage, when an ML model is constructed. If the model of methods in Sections 5 and 6. In particular, both post hoc and
learns a poor approximation of the underlying relationships in model-based methods aim to increase descriptive accuracy, but
the data, any information extracted from the model is unlikely to only the model-based method affects the predictive accuracy.
be accurate. Evaluating the quality of a model’s fit has been well Not shown is relevancy, which determines what type of output
studied in standard supervised ML frameworks, through mea- is helpful for a particular problem and audience.
sures such as test-set accuracy. In the context of interpretation,
we describe this error as predictive accuracy. 5. Model-Based Interpretability
Note that in problems involving interpretability, one must We now discuss how interpretability considerations come into
appropriately measure predictive accuracy. In particular, the play in the modeling stage of the data–science life cycle (Fig. 1).
data used to check for predictive accuracy must resemble the At this stage, the practitioner constructs an ML model from
population of interest. For instance, evaluating on patients from the collected data. We define model-based interpretability as
one hospital may not generalize to others. Moreover, problems the construction of models that readily provide insight into the
often require a notion of predictive accuracy that goes beyond relationships they have learned. Different model-based inter-
just average accuracy. The distribution of predictions matters. pretability methods provide different ways of increasing descrip-
For instance, it could be problematic if the prediction error is tive accuracy by constructing models which are easier to under-
much higher for a particular class. Finally, the predictive accu- stand, sometimes resulting in lower predictive accuracy. The
racy should be stable with respect to reasonable data and model main challenge of model-based interpretability is to come up

Murdoch et al. PNAS | October 29, 2019 | vol. 116 | no. 44 | 22073
reducing the number of parameters to analyze, sparse models
can be easier to understand, yielding higher descriptive accu-
racy. Moreover, incorporating prior information in the form of
sparsity into a sparse problem can help a model achieve higher
predictive accuracy and yield more relevant insights. Note that
incorporating sparsity can often be quite difficult, as it requires
understanding the data-specific structure of the sparsity and how
it can be modeled.
Methods for obtaining sparsity often utilize a penalty on a
loss function, such as LASSO (32) and sparse coding (33),
or on model selection criteria such as AIC or BIC (34, 35).
Many search-based methods have been developed to find sparse
solutions. These methods search through the space of nonzero
coefficients using classical subset-selection methods [e.g., orthog-
onal matching pursuit (36)]. Model sparsity is often useful for
high-dimensional problems, where the goal is to identify key fea-
tures for further analysis. For instance, sparsity penalties have
been incorporated into random forests to identify a sparse subset
of important features (37).
In the following example from genomics, sparsity is used
to increase the relevancy of an interpretation by reducing the
number of potential interactions to a manageable level.
Example (Ex): Identifying interactions among regulatory fac-
tors or biomolecules is an important question in genomics.
Typical genomic datasets include thousands or even millions
of features, many of which are active in specific cellular or
developmental contexts. The massive scale of such datasets
makes interpretation a considerable challenge. Sparsity penalties
are frequently used to make the data manageable for statisti-
cians and their collaborating biologists to discuss and identify
promising candidates for further experiments.
Fig. 2. Impact of interpretability methods on descriptive and predictive For instance, one recent study (24) uses a biclustering
approach based on sparse canonical correlation analysis (SCCA)
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.

accuracies. Model-based interpretability (Section 5) involves using a sim-


pler model to fit the data which can negatively affect predictive accuracy, to identify interactions among genomic expression features in
but yields higher descriptive accuracy. Post hoc interpretability (Section 6) Drosophila melanogaster (fruit flies) and Caenorhabditis ele-
involves using methods to extract information from a trained model (with gans (roundworms). Sparsity penalties enable key interactions
no effect on predictive accuracy). These correspond to the model and post
among features to be summarized in heatmaps which contain
hoc stages in Fig. 1.
few enough variables for a human to analyze. The authors of
this study also perform stability analysis, finding their model
with models that are simple enough to be easily understood by to be robust to different initializations and perturbations to
the audience, while maintaining high predictive accuracy. hyperparameters.
In selecting a model to solve a domain problem, the practi-
tioner must consider the entirety of the PDR framework. The B. Simulatability. A model is said to be simulatable if a human
first desideratum to consider is predictive accuracy. If the con- (for whom the interpretation is intended) is able to internally
structed model does not accurately represent the underlying simulate and reason about its entire decision-making process
problem, any subsequent analysis will be suspect (29, 30). Sec- (i.e., how a trained model produces an output for an arbitrary
ond, the main purpose of model-based interpretation methods input). This is a very strong constraint to place on a model and
is to increase descriptive accuracy. Finally, the relevancy of a can generally be done only when the number of features is low
model’s output must be considered and is determined by the con- and the underlying relationship is simple. Decision trees (38)
text of the problem, data, and audience. We now discuss some are often cited as a simulatable model, due to their hierarchical
common types of model-based interpretability methods. decision-making process. Another example is lists of rules (39,
40), which can easily be simulated. However, it is important to
A. Sparsity. When the practitioner believes that the underlying note that these models cease to be simulatable when they become
relationship in question is based upon a sparse set of signals, large. In particular, as the complexity of the model increases
the practitioner can impose sparsity on the model by limiting the (number of nodes in a decision tree or the number of rules in
number of nonzero parameters. In this section, we focus on lin- a list), it becomes increasingly difficult for a human to internally
ear models, but sparsity can be helpful more generally. When the simulate.
number of nonzero parameters is sufficiently small, a practitioner Due to their simplicity, simulatable models have very high
can interpret the variables corresponding to those parameters as descriptive accuracy. When they can also provide reasonable
being meaningfully related to the outcome in question and can predictive accuracy, they can be very effective. In the following
also interpret the magnitude and direction of the parameters. example, a simulatable model is able to produce high predictive
However, before one can interpret a sparse parameter set, one accuracy, while maintaining the high levels of descriptive accu-
should check for stability of the parameters. For example, if the racy and relevancy normally attained by small-scale rules-based
signs/magnitudes of parameters or the predictions change due to models.
small perturbations in the dataset, the coefficients should not be Ex: In medical practice, when a patient has been diagnosed
interpreted (31). with atrial fibrillation, caregivers often want to predict the
When the practitioner is able to correctly incorporate sparsity risk that the particular patient will have a stroke in the next
into the model, it can improve all 3 interpretation desiderata. By year. Given the potential ramifications of medical decisions, it

22074 | www.pnas.org/cgi/doi/10.1073/pnas.1900654116 Murdoch et al.


is important that these predictions are not only accurate, but correct errors like this one, better ensuring that the model could
interpretable to both the caregivers and patients. be trusted in the real world.
To make the prediction, ref. 40 uses data from 12,586 patients
detailing their age, gender, history of drugs and conditions, and D. Domain-Based Feature Engineering. While the type of model
whether they had a stroke within 1 y of diagnosis. To construct a is important in producing a useful interpretation, so are the
model that has high predictive and descriptive accuracy, ref. 40 features that are used as inputs to the model. Having more
introduces a method for learning lists of if–then rules that are informative features makes the relationship that needs to be
predictive of 1-y stroke risk. The resulting classifier, displayed learned by the model simpler, allowing one to use other model-
in SI Appendix, Fig. S1, requires only 7 if–then statements to based interpretability methods. Moreover, when the features
achieve competitive accuracy and is easy for even nontechnical have more meaning to a particular audience, they become easier
practitioners to quickly understand. to interpret.
Although this model is able to achieve high predictive and In many individual domains, expert knowledge can be useful
descriptive accuracy, it is important to note that the lack of in constructing feature sets that are useful for building predic-
stability in these types of models can limit their uses. If the tive models. The particular algorithms used to extract features
practitioner’s intent is to simply understand a model that is ulti- are generally domain specific, relying both on the practitioner’s
mately used for predictions, these types of models can be very existing domain expertise and on insights drawn from the data
effective. However, if the practitioner wants to produce knowl- through exploratory data analysis. For example, in natural lan-
edge about the underlying dataset, the fact that the learned rules guage processing, documents are embedded into vectors using
can change significantly when the model is retrained limits their tf-idf (45). Moreover, using ratios, such as the body mass index
generalizability. (BMI), instead of raw features can greatly simplify the relation-
ship a model learns, resulting in improved interpretations. In
C. Modularity. We define an ML model to be modular if a the example below, domain knowledge about cloud coverage is
meaningful portion(s) of its prediction-making process can be exploited to design 3 simple features that increase predictive
interpreted independently. A wide array of models satisfies mod- accuracy while maintaining the high descriptive accuracy of a

STATISTICS
ularity to different degrees. Generalized additive models (41) simple predictive model.
force the relationship between variables in the model to be addi- Ex: When modeling global climate patterns, an important
tive. In deep learning, specific methods such as attention (42) quantity is the amount and location of arctic cloud coverage. Due
and modular network architectures (43) provide limited insight to the complex, layered nature of climate models, it is beneficial
into a network’s inner workings. Probabilistic models can enforce to have simple, easily auditable, cloud coverage models for use
modularity by specifying a conditional independence structure by downstream climate scientists.
which makes it easier to reason about different parts of a model In ref. 46, the authors use an unlabeled dataset of arc-
independently (44). tic satellite imagery to build a model predicting whether each
The following example uses modularity to produce relevant pixel in an image contains clouds or not. Given the qualita-
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.

interpretations for use in diagnosing biases in training data. tive similarity between ice and clouds, this is a challenging
Ex: When prioritizing patient care for patients with pneumo- prediction problem. By conducting exploratory data analysis
nia in a hospital, one possible method is to predict the likelihood and using domain knowledge through interactions with climate
of death within 60 d and focus on the patients with a higher mor- scientists, the authors identify 3 simple features that are suf-
tality risk. Given the potential life and death consequences, being ficient to cluster whether or not images contain clouds. Using
able to explain the reasons for hospitalizing a patient or not is these 3 features as input to quadratic discriminant analysis, they
very important. achieve both high predictive accuracy and transparency when
A recent study (7) uses a dataset of 14,199 patients with pneu- compared with expert labels (which were not used in developing
monia, with 46 features including demographics (e.g., age and the model).
gender), simple physical measurements (e.g., heart rate, blood
pressure), and laboratory tests (e.g., white blood cell count, E. Model-Based Feature Engineering. There are a variety of auto-
blood urea nitrogen). To predict mortality risk, the researchers matic approaches for constructing interpretable features. Two
use a generalized additive model with pairwise interactions, dis- examples are unsupervised learning and dimensionality reduc-
played below. The univariate and pairwise terms (fj (xj ) and tion. Unsupervised methods, such as clustering, matrix factor-
fij (xi , xj )) can be individually interpreted in the form of curves ization, and dictionary learning, aim to process unlabeled data
and heatmaps, respectively: and output a description of their structure. These structures
often shed insight into relationships contained within the data
X X and can be useful in building predictive models. Dimensional-
g(E[y]) = β0 + fj (xj ) + fij (xi , xj ). [1] ity reduction focuses on finding a representation of the data
j i6=j which is lower dimensional than the original data. Methods such
as principal components analysis (47), independent components
By inspecting the individual modules, the researchers found analysis (48), and canonical correlation analysis (49) can often
a number of counterintuitive properties of their model. For identify a few interpretable dimensions, which can then be used
instance, the fitted model learned that having asthma is asso- as input to a model or to provide insights in their own right.
ciated with a lower risk of dying from pneumonia. In reality, Using fewer inputs can not only improve descriptive accuracy,
the opposite is true—patients with asthma are known to have but also increase predictive accuracy by reducing the number of
a higher risk of death from pneumonia. Because of this, in the parameters to fit. In the following example, unsupervised learn-
collected data all patients with asthma received aggressive care, ing is used to represent images in a low-dimensional, genetically
which was fortunately effective at reducing their risk of mortality meaningful, space.
relative to the general population. Ex: Heterogeneity is an important consideration in genomic
In this instance, if the model were used without having been problems and associated data. In many cases, regulatory fac-
interpreted, pneumonia patients with asthma would have been tors or biomolecules can play a specific role in one context,
deprioritized for hospitalization. Consequently, the use of ML such as a particular cell type or developmental stage, and
would increase their likelihood of dying. Fortunately, the use of have a very different role in other contexts. Thus, it is impor-
an interpretable model enabled the researchers to identify and tant to understand the “local” behavior of regulatory factors

Murdoch et al. PNAS | October 29, 2019 | vol. 116 | no. 44 | 22075
or biomolecules. A recent study (50) uses unsupervised learn- In addition to feature importances, methods exist to extract
ing to learn spatial patterns of gene expression in Drosophila important interactions between features. Interactions are impor-
(fruit fly) embryos. In particular, it uses stability-driven non- tant as ML models are often highly nonlinear and learn complex
negative matrix factorization to decompose images of complex interactions between features. Methods exist to extract interac-
spatial gene expression patterns into a library of 21 “principal tions from many ML models, including random forests (21, 57,
patterns,” which can be viewed as preorgan regions. This decom- 58) and neural networks (59, 60). In the below example, the
position, which is interpretable to biologists, allows the study of descriptive accuracy of random forests is increased by extracting
gene–gene interactions in preorgan regions of the developing Boolean interactions (a problem-relevant form of interpretation)
embryo. from a trained model.
Ex: High-order interactions among regulatory factors or
6. Post Hoc Interpretability genes play an important role in defining cell type-specific behav-
We now discuss how interpretability considerations come into ior in biological systems. Thus, extracting such interactions from
play in the post hoc analysis stage of the data–science life cycle. genomic data is an important problem in biology.
At this stage, the practitioner analyzes a trained model to pro- A previous line of work considers the problem of searching for
vide insights into the learned relationships. This is particularly biological interactions associated with important biological pro-
challenging when the model’s parameters do not clearly show cesses (21, 57). To identify candidate biological interactions, the
what relationships the model has learned. To aid in this process, authors train a series of iteratively reweighted random forests
a variety of post hoc interpretability methods have been devel- (RFs) and search for stable combinations of features that fre-
oped to provide insight into what a trained model has learned, quently co-occur along the predictive RF decision paths. This
without changing the underlying model. These methods are par- approach takes a step beyond evaluating the importance of indi-
ticularly important for settings where the collected data are high vidual features in an RF, providing a more complete description
dimensional and complex, such as with image data. In these of how features influence predicted responses. By interpreting
settings, interpretation methods must deal with the challenge the interactions used in RFs, the researchers identified gene–
that individual features are not semantically meaningful, mak- gene interactions with 80% accuracy in the Drosophila embryo
ing the problem more challenging than on datasets with more and identify candidate targets for higher-order interactions.
meaningful features. Once the information has been extracted A.2. Statistical feature importances. In some instances, in addi-
from the fitted model, it can be analyzed using standard, tion to the raw value, we can compute statistical measures of
exploratory data analysis techniques, such as scatter plots and confidence as feature importance scores, a standard technique
histograms. taught in introductory statistics classes. By making assumptions
When conducting post hoc analysis, the model has already about the underlying data-generating process, models like lin-
been trained, so its predictive accuracy is fixed. Thus, under ear and logistic regression can compute confidence intervals and
the PDR framework, a researcher must consider only descrip- hypothesis tests for the values, and linear combinations, of their
tive accuracy and relevancy (relative to a particular audi- coefficients. These statistics can be helpful in determining the
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.

ence). Improving on each of these criteria are areas of active degree to which the observed coefficients are statistically sig-
research. nificant. It is important to note that the assumptions of the
Most widely useful post hoc interpretation methods fall into underlying probabilistic model must be fully verified before using
2 main categories: prediction-level and dataset-level interpreta- this form of interpretation. Below we present a cautionary exam-
tions, which are sometimes referred to as local and global inter- ple where different assumptions lead to opposing conclusions
pretations, respectively. Prediction-level interpretation methods being drawn from the same dataset.
focus on explaining individual predictions made by models, Ex: Here, we consider the lawsuit Students for Fair Admis-
such as what features and/or interactions led to the particular sions, Inc. v. Harvard regarding the use of race in undergraduate
prediction. Dataset-level approaches focus on the global rela- admissions to Harvard University. Initial reports by Harvard’s
tionships the model has learned, such as what visual patterns are Office of Institutional Research used logistic regression to model
associated with a predicted response. These 2 categories have the probability of admission using different features of an appli-
much in common (in fact, dataset-level approaches often yield cant’s profile, including race (61). This analysis found that the
information at the prediction level), but we discuss them sepa- coefficient associated with being Asian (and not low income)
rately, as methods at different levels are meaningfully different. was −0.418 with a significant P value (<0.001). This negative
Prediction-level insights can provide fine-grained information coefficient suggested that being Asian had a significant negative
about individual predictions, but often fail to yield dataset-level association with admission probability.
insights when it is not feasible to examine a sufficient amount of Subsequent analysis from both sides in the lawsuit attempted
prediction-level interpretations. to analyze the modeling and assumptions to decide on the sig-
nificance of race in the model’s decision. The plaintiff’s expert
A. Dataset-Level Interpretation. When practitioners are interested report (62) suggested that race was being unfairly used by build-
in more general relationships learned by a model, e.g., relation- ing on the original report from Harvard’s Office of Institutional
ships that are relevant for a particular class of responses, they Research. It also incorporates analysis on more subjective factors
use dataset-level interpretations. For instance, this form of inter- such as “personal ratings” which seem to hurt Asian students’
pretation can be useful when it is not feasible for a practitioner admission. In contrast, the expert report supporting Harvard
to look at a large number of local predictions. In addition to the University (63) finds that by accounting for certain other vari-
areas below, we note that there are other emerging techniques, ables, the effect of race on Asian students’ acceptance is no
such as model distillation (51, 52). longer significant. Significances derived from statistical tests in
A.1. Interaction and feature importances. Feature importance regression or logistic regression models at best establish associa-
scores, at the dataset level, try to capture how much individ- tion, but not causation. Hence the analyses from both sides are
ual features contribute, across a dataset, to a prediction. These flawed. This example demonstrates the practical and mislead-
scores can provide insights into what features the model has ing consequences of statistical feature importances when used
identified as important for which outcomes and their relative inappropriately.
importance. Methods have been developed to score individual A.3. Visualizations. When dealing with high-dimensional data-
features in many models including neural networks (53), random sets, it can be challenging to quickly understand the complex
forests, (54, 55), and generic classifiers (56). relationships that a model has learned, making the presentation

22076 | www.pnas.org/cgi/doi/10.1073/pnas.1900654116 Murdoch et al.


of the results particularly important. To help deal with this, tance of a feature can vary for different examples as a result of
researchers have developed a number of different visualizations interactions with other features.
which help to understand what a model has learned. For lin- While this area has seen progress in recent years, concerns
ear models with regularization, plots of regression coefficient have been raised about the descriptive accuracy of these meth-
paths show how varying a regularization parameter affects the ods. In particular, ref. 80 shows that many popular methods
fitted coefficients. When visualizing convolutional neural net- produce similar interpretations for a trained model versus a ran-
works trained on image data, work has been done on visualizing domly initialized one and are qualitatively very similar to an edge
filters (64, 65), maximally activating responses of individual neu- detector. Moreover, it has been shown that some feature impor-
rons or classes (66), understanding intraclass variation (67), and tance scores for CNNs are doing (partial) image recovery which
grouping different neurons (68). For long short-term memory is unrelated to the network decisions (81).
networks (LSTMs), researchers have focused on analyzing the Ex: When using ML models to predict sensitive outcomes,
state vector, identifying individual dimensions that correspond such as whether a person should receive a loan or a criminal
to meaningful features (e.g., position in line, within quotes) (69), sentence, it is important to verify that the algorithm is not dis-
and building tools to track the model’s decision process over the criminating against people based on protected attributes, such
course of a sequence (70). as race or gender. This problem is often described as ensuring
In the following example, relevant interpretations are pro- ML models are “fair.” In ref. 17, the authors introduce a vari-
duced by using maximal activation images for identifying able importance measure designed to isolate the contributions
patterns that drive the response of brain cells. of individual variables, such as gender, among a set of correlated
Ex: A recent study visualizes learned information from deep variables.
neural networks to understand individual brain cells (25). In this Based on these variable importance scores, the authors con-
study, macaque monkeys were shown images while the responses struct transparency reports, such as the one displayed in SI
of brain cells in their visual system (area V4) were recorded. Appendix, Fig. S2, which displays the importance of features
Neural networks were trained to predict the responses of brain used to predict that “Mr. Z” is likely to be arrested in the
cells to the images. These neural networks produce accurate future (an outcome which is often used in predictive polic-

STATISTICS
fits, but provide little insight into what patterns in the images ing), with each bar corresponding to a feature provided to the
increase the brain cells’ response without further analysis. To classifier, and the y axis displaying the importance score for
remedy this, the authors introduce DeepTune, a method which that feature. In this instance, the race feature is the largest
provides a visualization, accessible to neuroscientists and others, value, indicating that the classifier is indeed discriminating based
of the patterns which activate a brain cell. The main intuition on race. Thus, in this instance, prediction-level feature impor-
behind the method is to optimize the input of a network to max- tance scores can identify that a model is unfairly discriminating
imize the response of a neural network model (which represents based on race.
a brain cell). B.2. Alternatives to feature importances. While feature impor-
The authors go on to analyze the major problem of instabil- tance scores can provide useful insights, they also have a number
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.

ity. When post hoc visualizations attempt to answer scientific of limitations (80, 82). For instance, they are unable to capture
questions, the visualizations must be stable to reasonable per- when algorithms learn interactions between variables. There is
turbations; if there are changes in the visualization due to the currently an evolving body of work centered around uncovering
choice of a model, it is likely not meaningful. The authors address and addressing these limitations. These methods focus on explic-
this explicitly by fitting 18 different models to the data and using itly capturing and displaying the interactions learned by a neural
a stable optimization over all of the models to produce a final network (83, 84), alternative forms of interpretations such as tex-
consensus DeepTune visualization. tual explanations (85), influential data points (86), and analyzing
A.4. Analyzing trends and outliers in predictions. When inter- nearest neighbors (87, 88).
preting the performance of an ML model, it can be helpful to
look not just at the average accuracy, but also at the distribution 7. Future Work
of predictions and errors. For example, residual plots can identify Having introduced the PDR framework for defining and dis-
heterogeneity in predictions and suggest particular data points cussing interpretable machine learning, we now leverage it to
to analyze, such as outliers in the predictions, or examples which frame what we feel are the field’s most important challenges
had the largest prediction errors. Moreover, these plots can be moving forward. Below, we present open problems tied to each
used to analyze trends across the predictions. For instance, in the of this paper’s 3 main sections: interpretation desiderata (Sec-
example below, influence functions are able to efficiently identify tion 4), model-based interpretability (Section 5), and post hoc
mislabeled data points. interpretability (Section 6).

B. Prediction-Level Interpretation. Prediction-level approaches are A. Measuring Interpretation Desiderata. Currently, there is no
useful when a practitioner is interested in understanding how clear consensus in the community around how to evaluate inter-
individual predictions are made by a model. Note that predic- pretation methods, although some recent works have begun to
tion-level approaches can sometimes be aggregated to yield address it (12–14). As a result, the standard of evaluation varies
dataset-level insights. considerably across different works, making it challenging both
B.1. Feature importance scores. The most popular approach to for researchers in the field to measure progress and for prospec-
prediction-level interpretation has involved assigning impor- tive users to select suitable methods. Within the PDR frame-
tance scores to individual features. Intuitively, a variable with a work, to constitute an improvement, an interpretation method
large positive (negative) score made a highly positive (negative) must improve at least one desideratum (predictive accuracy,
contribution to a particular prediction. In the deep learning lit- descriptive accuracy, or relevancy) without unduly harming the
erature, a number of different approaches have been proposed others. While improvements in predictive accuracy are easy to
to address this problem (71–78), with some methods for other measure, measuring improvements in descriptive accuracy and
models as well (79). These are often displayed in the form of relevancy remains a challenge.
a heatmap highlighting important features. Note that feature A.1. Measuring descriptive accuracy. One way to measure an
importance scores at the prediction level can offer much more improvement to an interpretation method is to demonstrate that
information than feature importance scores at the dataset level. its output better captures what the ML model has learned, i.e.,
This is a result of heterogeneity in a nonlinear model: The impor- its descriptive accuracy. However, unlike predictive accuracy,

Murdoch et al. PNAS | October 29, 2019 | vol. 116 | no. 44 | 22077
descriptive accuracy is generally very challenging to measure or transparent feature engineering. It is worth noting that some-
quantify (82). As a fallback, researchers often show individual, times a combination of model-based and post hoc interpretations
cherry-picked, interpretations which seem “reasonable.” These is ideal.
kinds of evaluations are limited and unfalsifiable. In particular, B.1. Building accurate and interpretable models. In many in-
these results are limited to the few examples shown and not stances, model-based interpretability methods fail to achieve
generally applicable to the entire dataset. a reasonable predictive accuracy. In these cases, practitioners
While the community has not settled on a standard evalua- are forced to abandon model-based interpretations in search of
tion protocol, there are some promising directions. In particular, more accurate models. Thus, an effective way of increasing the
the use of simulation studies presents a partial solution. In this potential uses for model-based interpretability is to devise new
setting, a researcher defines a simple generative process, gener- modeling methods which produce higher predictive accuracy
ates a large amount of data from that process, and trains the ML while maintaining their high descriptive accuracy and relevance.
model on those data. Assuming a proper simulation setup, a suf- Promising examples of this work include the previously dis-
ficiently powerful model to recover the generative process, and cussed examples on estimating pneumonia risk from patient data
sufficiently large training data, the trained model should achieve (7) and Bayesian models for generating rule lists to estimate a
near-perfect generalization accuracy. To compute an evaluation patient’s risk of stroke (40). Detailed directions for this work are
metric, the researcher can then check whether the interpre- suggested in ref. 91.
tations recover aspects of the original generative process. For B.2. Tools for feature engineering. When we have more infor-
example, refs. 59 and 89 train neural networks on a suite of gen- mative and meaningful features, we can use simpler modeling
erative models with certain built-in interactions and test whether methods to achieve a comparable predictive accuracy. Thus,
their method successfully recovers them. Here, due to the ML methods that can produce more useful features broaden the
model’s near-perfect generalization accuracy, we know that the potential uses of model-based interpretations. The first main cat-
model is likely to have recovered some aspects of the generative egory of work lies in improved tools for exploratory data analysis.
process, thus providing a ground truth against which to evalu- By better enabling researchers to interact with and understand
ate interpretations. In a related approach, when an underlying their data, these tools (combined with domain knowledge) pro-
scientific problem has been previously studied, prior experimen- vide increased opportunities for them to identify helpful fea-
tal findings can serve as a partial ground truth to retrospectively tures. Examples include interactive environments (92–94), tools
validate interpretations (21). for visualization (95–97), and data exploration tools (98, 99).
A.2. Demonstrating relevancy to real-world problems. Another The second category falls under unsupervised learning, which
angle for developing improved interpretation methods is to is often used as a tool for automatically finding relevant struc-
improve the relevancy of interpretations for some audience or ture in data. Improvements in unsupervised techniques such as
problem. This is normally done by introducing a novel form of clustering and matrix factorization could lead to more useful
output, such as feature heatmaps (71), rationales (90), or feature features.
hierarchies (84), or identifying important elements in the train-
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.

ing set (86). A common pitfall in the current literature is to focus C. Post Hoc. In contrast to model-based interpretability, much
on the novel output, ignoring what real-world problems it can of post hoc interpretability is relatively new, with many foun-
actually solve. Given the abundance of possible interpretations, dational concepts still unclear. In particular, we feel that 2 of
it is particularly easy for researchers to propose novel methods the most important questions to be answered are what an inter-
which do not actually solve any real-world problems. pretation of an ML model should look like and how post hoc
There have been 2 dominant approaches for demonstrating interpretations can be used to increase a model’s predictive
improved relevancy. The first, and strongest, is to directly use the accuracy. It has also been emphasized that in high-stakes deci-
introduced method in solving a domain problem. For instance, sions practitioners should be very careful when applying post hoc
in one example discussed above (21), the authors evaluated a methods with unknown descriptive accuracy (91).
new interpretation method (iterative random forests) by demon- C.1. What an interpretation of a black box should look like. Given
strating that it could be used to identify meaningful biological a black-box predictor and real-world problem, it is generally
Boolean interactions for use in experiments. In instances like unclear what format, or combination of formats, is best to fully
this, where the interpretations are used directly to solve a domain capture a model’s behavior. Researchers have proposed a variety
problem, their relevancy is indisputable. A second, less direct, of interpretation forms, including feature heatmaps (71), feature
approach is the use of human studies, often through services hierarchies (84), and identifying important elements in the train-
like Amazon’s Mechanical Turk. Here, humans are asked to ing set (86). However, in all instances there is a gap between
perform certain tasks, such as evaluating how much they trust the simple information provided by these interpretations and
a model’s predictions (84). While challenging to properly con- what the model has actually learned. Moreover, it is unclear
struct and perform, these studies are vital to demonstrate that whether any of the current interpretation forms can fully cap-
new interpretation methods are, in fact, relevant to any poten- ture a model’s behavior or whether a new format altogether is
tial practitioners. However, one shortcoming of this approach is needed. How to close that gap, while producing outputs relevant
that it is only possible to use a general audience of AMT crowd- to a particular audience/problem, is an open problem.
sourced workers, rather than a more relevant, domain-specific C.2. Using interpretations to improve predictive accuracy. In
audience. some instances, post hoc interpretations uncover that a model
has learned relationships a practitioner knows to be incorrect.
B. Model Based. Now that we have discussed the general problem For instance, prior interpretation work has shown that a binary
of evaluating interpretations, we highlight important challenges husky vs. wolf classifier simply learns to identify whether there
for the 2 main subfields of interpretable machine learning: is snow in the image, ignoring the animals themselves (77). A
model-based and post hoc interpretability. Whenever model- natural question to ask is whether it is possible for the practi-
based interpretability can achieve reasonable predictive accuracy tioner to correct these relationships learned by the model and
and relevancy, by virtue of its high descriptive accuracy it is consequently increase its predictive accuracy. Given the chal-
preferable to fitting a more complex model and relying upon post lenges surrounding simply generating post hoc interpretations,
hoc interpretability. Thus, the main focus for model-based inter- research on their uses has been limited (100, 101), particularly
pretability is increasing its range of possible use cases by increas- in modern deep learning models. However, as the field of post
ing its predictive accuracy through more accurate models and hoc interpretations continues to mature, this could be an exciting

22078 | www.pnas.org/cgi/doi/10.1073/pnas.1900654116 Murdoch et al.


avenue for researchers to increase the predictive accuracy of 16-1-2664, National Science Foundation (NSF) DMS-1613002, and NSF IIS
their models by exploiting prior knowledge, independently of any 1741340; an Natural Sciences and Engineering Research Council of Canada
Postgraduate Scholarships-Doctoral program fellowship; and an Adobe
other benefits of interpretations. research award. We thank the Center for Science of Information, a US NSF
Science and Technology Center, under Grant CCF-0939370. R.A.-A. thanks
ACKNOWLEDGMENTS. This research was supported in part by Grants the Allen Institute founder, Paul G. Allen, for his vision, encouragement,
Army Research Office W911NF1710005, Office of Naval Research N00014- and support.

1. G. Litjens et al., A survey on deep learning in medical image analysis. Med. Image 36. Y. C. Pati, R. Rezaiifar, P. S. Krishnaprasad, “Orthogonal matching pursuit: Recursive
Anal. 42, 60–88 (2017). function approximation with applications to wavelet decomposition” in Proceedings
2. T. Brennan, W. L. Oliver, The emergence of machine learning techniques in of the 27th Asilomar Conference on Signals, Systems & Computers, F. Harris, Ed. (IEEE,
criminology. Criminol. Public Policy 12, 551–562 (2013). Pacific Grove, CA, 1993), pp. 40–44.
3. C. Angermueller, T. Pärnamaa, L. Parts, O. Stegle, Deep learning for computational 37. D. Amaratunga, J. Cabrera, Y. S. Lee, Enriched random forests. Bioinformatics 24,
biology. Mol. Syst. Biol. 12, 878 (2016). 2010–2014 (2008).
4. M. A. T. Vu et al., A shared vision for machine learning in neuroscience. J. Neurosci. 38. L. Breiman, J. Friedman, R. Olshen, C. J. Stone, Classification and Regression Trees
38, 1601–1607 (2018). (Chapman and Hall, 1984).
5. B. Goodman, S. Flaxman, European Union regulations on algorithmic decision-making 39. J. H. Friedman, B. E. Popescu, Predictive learning via rule ensembles. Ann. Appl. Stat.
and a “right to explanation”. arXiv:1606.08813 (31 August 2016). 2, 916–954 (2008).
6. C. Dwork, M. Hardt, T. Pitassi, O. Reingold, R. Zemel, “Fairness through awareness” 40. B. Letham, C. Rudin, T. H. McCormick, D. Madigan, Interpretable classifiers using rules
in Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, and Bayesian analysis: Building a better stroke prediction model. Ann. Appl. Stat. 9,
S. Goldwasser, Ed. (ACM, New York, NY, 2012), pp. 214–226. 1350–1371 (2015).
7. R. Caruana et al., “Intelligible models for healthcare: Predicting pneumonia risk and 41. T. Hastie, R. Tibshirani, Generalized additive models. Stat. Sci. 1, 297–318 (1986).
hospital 30-day readmission” in Proceedings of the 21th ACM SIGKDD International 42. J. Kim, J. F. Canny, “Interpretable learning for self-driving cars by visualizing causal
Conference on Knowledge Discovery and Data Mining, L. Cao, C. Zhang, Eds. (ACM, attention” in ICCV, K. Ikeuchi, G. Medioni, M. Pelillo, Eds. (IEEE, 2017), pp. 2961–2969.
New York, NY, 2015), pp. 1721–1730. 43. J. Andreas, M. Rohrbach, T. Darrell, D. Klein, “Neural module networks” in Proceed-
8. S. Chakraborty et al., “Interpretability of deep learning models: A survey of results” ings of the IEEE Conference on Computer Vision and Pattern Recognition, R. Bajcsy,
in Interpretability of Deep Learning Models: A Survey of Results, D. El Baz, J. Gao, F. Li, T. Tuytelaars, Eds. (IEEE, 2016), pp. 39–48.
R. Grymes, Eds. (IEEE, San Francisco, CA, 2017). 44. D. Koller, N. Friedman, F. Bach, Probabilistic Graphical Models: Principles and
9. R. Guidotti, A. Monreale, F. Turini, D. Pedreschi, F. Giannotti, A survey of methods for Techniques (MIT Press, 2009).

STATISTICS
explaining black box models. arXiv:1802.01933 (21 June 2018). 45. J. Ramos, “Using tf-idf to determine word relevance in document queries” in Proceed-
10. S. M. Lundberg, S. I. Lee, “A unified approach to interpreting model predictions” ings of the First Instructional Conference on Machine Learning, T. Fawcett, N. Mishra,
in Advances in Neural Information Processing Systems, T. Sejnowski, Ed. (Neural Eds. (ICML, 2003), vol. 242, pp. 133–142.
Information Processing Systems, 2017), pp. 4768–4777. 46. T. Shi, B. Yu, E. E. Clothiaux, A. J. Braverman, Daytime arctic cloud detection based
11. M. Ancona, E. Ceolini, C. Oztireli, M. Gross, “Towards better understanding of on multi-angle satellite data with case studies. J. Am. Stat. Assoc. 103, 584–593
gradient-based attribution methods for deep neural networks” in 6th International (2008).
Conference on Learning Representations, A. Rush, Ed. (ICLR, 2018) (2018). 47. I. Jolliffe, Principal Component Analysis (Springer, 1986).
12. F. Doshi-Velez, B. Kim, A roadmap for a rigorous science of interpretability. 48. A. J. Bell, T. J. Sejnowski, An information-maximization approach to blind separation
arXiv:1702.08608 (2 March 2017). and blind deconvolution. Neural Comput. 7, 1129–1159 (1995).
13. L. H. Gilpin et al., Explaining explanations: An approach to evaluating interpretability 49. H. Hotelling, Relations between two sets of variates. Biometrika 28, 321–377 (1936).
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.

of machine learning. arXiv:1806.00069 (3 February 2019). 50. S. Wu et al., Stability-driven nonnegative matrix factorization to interpret spatial
14. Z. C. Lipton, The mythos of model interpretability. arXiv:1606.03490 (6 March 2017). gene expression and build local gene networks. Proc. Natl. Acad. Sci. U.S.A. 113,
15. M. Hardt, E. Price, N. Srebro, “Equality of opportunity in supervised learning” in 4290–4295 (2016).
Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, Eds. (Neural 51. M. Craven, J. W. Shavlik, “Extracting tree-structured representations of trained net-
Information Processing Systems, 2016), pp. 3315–3323. works” in Advances in Neural Information Processing Systems, T. Petsche, Ed. (Neural
16. D. Boyd, K. Crawford, Critical questions for big data: Provocations for a cul- Information Processing Systems, 1996), pp. 24–30.
tural, technological, and scholarly phenomenon. Inf. Commun. Soc. 15, 662–679 52. N. Frosst, G. Hinton, Distilling a neural network into a soft decision tree.
(2012). arXiv:1711.09784 (27 November 2017).
17. A. Datta, S. Sen, Y. Zick, “Algorithmic transparency via quantitative input influence: 53. J. D. Olden, M. K. Joy, R. G. Death, An accurate comparison of methods for quan-
Theory and experiments with learning systems” in 2016 IEEE Symposium on Security tifying variable importance in artificial neural networks using simulated data. Ecol.
and Privacy (SP), M. Locasto, Ed. (IEEE, San Jose, CA, 2016), pp. 598–617. Model. 178, 389–397 (2004).
18. F. C. Keil, Explanation and understanding. Annu. Rev. Psychol. 57, 227–254 (2006). 54. L. Breiman, Random forests. Mach. Learn. 45, 5–32 (2001).
19. T. Lombrozo, The structure and function of explanations. Trends Cogn. Sci. 10, 464– 55. C. Strobl, A. L. Boulesteix, T. Kneib, T. Augustin, A. Zeileis, Conditional variable
470 (2006). importance for random forests. BMC Bioinf. 9, 307 (2008).
20. G. W. Imbens, D. B. Rubin, Causal Inference in Statistics, Social, and Biomedical 56. A. Altmann, L. Toloşi, O. Sander, T. Lengauer, Permutation importance: A corrected
Sciences (Cambridge University Press, 2015). feature importance measure. Bioinformatics 26, 1340–1347 (2010).
21. S. Basu, K. Kumbier, J. B. Brown, B. Yu, Iterative random forests to discover predic- 57. K. Kumbier, S. Basu, J. B. Brown, S. Celniker, B. Yu, Refining interaction search
tive and stable high-order interactions. Proc. Natl. Acad. Sci. U.S.A. 115, 1943–1948 through signed iterative random forests. arXiv:1810.07287 (16 October 2018).
(2018). 58. S. Devlin, C. Singh, W. J. Murdoch, B. Yu, Disentangled attribution curves for
22. B. Yu, Stability. Bernoulli 19, 1484–1500 (2013). interpreting random forests and boosted trees. arXiv:1905.07631 (18 May 2019).
23. F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, W. A. Stahel, Robust Statistics: The 59. M. Tsang, D. Cheng, Y. Liu, Detecting statistical interactions from neural network
Approach Based on Influence Functions (John Wiley & Sons, 2011), vol. 196. weights. arXiv:1705.04977 (27 February 2018).
24. H. Pimentel, Z. Hu, H. Huang, Biclustering by sparse canonical correlation analysis. 60. R. Abbasi-Asl, B. Yu, Structural compression of convolutional neural networks based
Quant. Biol. 6, 56–67 (2018). on greedy filter pruning. arXiv:1705.07356 (21 July 2017).
25. R. Abbasi-Asl et al., The DeepTune framework for modeling and characterizing 61. Office of Institutional Research HU, Exhibit 157: Demographics of Harvard college
neurons in visual cortex area V4. bioRxiv p. 465534 (9 November 2018). applicants. http://samv91khoyt2i553a2t1s05i-wpengine.netdna-ssl.com/wp-content/
26. A. W. Roe et al., Toward a unified theory of visual area v4. Neuron 74, 12–29 (2012). uploads/2018/06/Doc-421-157-May-30-2013-Report.pdf (2018), pp. 8–9.
27. C. L. Huang, M. C. Chen, C. J. Wang, Credit scoring with a data mining approach based 62. P. S. Arcidiacono, Exhibit a: Expert report of Peter S. Arcidiacono. http://
on support vector machines. Expert Syst. Appl. 33, 847–856 (2007). samv91khoyt2i553a2t1s05i-wpengine.netdna-ssl.com/wp-content/uploads/2018/06/Doc-
28. G. E. Box, Science and statistics. J. Am. Stat. Assoc. 71, 791–799 (1976). 415-1-Arcidiacono-Expert-Report.pdf (2018).
29. L. Breiman, Statistical modeling: The two cultures (with comments and a rejoinder by 63. D. Card, Exhibit 33: Report of David Card. https://projects.iq.harvard.edu/files/diverse-
the author). Stat. Sci. 16, 199–231 (2001). education/files/legal - card report revised filing.pdf (2018).
30. D. A. Freedman, Statistical models and shoe leather. Sociol. Methodol. 21, 291–313 64. M. D. Zeiler, R. Fergus, “Visualizing and understanding convolutional networks” in
(1991). European Conference on Computer Vision, D. Fleet, T. Padjla, B. Schiele, T. Tuytelaars,
31. C. Lim, B. Yu, Estimation stability with cross-validation (ESCV). J. Comput. Graph. Stat. Eds. (Springer, Zurich, Switzerland, 2014), pp. 818–833.
25, 464–492 (2016). 65. C. Olah, A. Mordvintsev, L. Schubert, Feature visualization. Distill 2, e7 (2017).
32. R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 66. A. Mordvintsev, C. Olah, M. Tyka, Deepdream-a code example for visualizing neural
58, 267–288 (1996). networks. Google Res. 2, 5 (2015).
33. B. A. Olshausen, D. J. Field, Sparse coding with an overcomplete basis set: A strategy 67. D. Wei, B. Zhou, A. Torrabla, W. Freeman, Understanding intra-class knowledge inside
employed by v1?Vis. Res. 37, 3311–3325 (1997). CNN. arXiv:1507.02379 (21 July 2015).
34. H. Akaike, “Factor analysis and AIC” in Selected Papers of Hirotugu Akaike (Springer, 68. Q. Zhang, R. Cao, F. Shi, Y. N. Wu, S. C. Zhu, Interpreting CNN knowledge via an
1987), pp. 371–386. explanatory graph. arXiv:1708.01785 (2017).
35. K. P. Burnham, D. R. Anderson, Multimodel inference: Understanding AIC and BIC in 69. A. Karpathy, J. Johnson, L. Fei-Fei, Visualizing and understanding recurrent networks.
model selection. Sociol. Methods Res. 33, 261–304 (2004). arXiv:1506.02078 (17 November 2015).

Murdoch et al. PNAS | October 29, 2019 | vol. 116 | no. 44 | 22079
70. H. Strobelt, S. Gehrmann, B. Huber, H. Pfister, A. M. Rush, Visual analysis of hid- 85. A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, B. Schiele, “Grounding of textual phrases
den state dynamics in recurrent neural networks. arXiv:1606.07461v1 (23 June in images by reconstruction” in European Conference on Computer Vision, H. Bischof,
2016). D. Cremers, B. Schiele, R. Zabih, Eds. (Springer, New York, NY, 2016).
71. M. Sundararajan, A. Taly, Q. Yan, “Axiomatic attribution for deep networks” in ICML, 86. P. W. Koh, P. Liang, Understanding black-box predictions via influence functions.
T. Jebara, Ed. (ICML, 2017). arXiv:1703.04730 (10 July 2017).
72. R. R. Selvaraju et al., Grad-cam: Visual explanations from deep networks via gradient- 87. R. Caruana, H. Kangarloo, J. Dionisio, U. Sinha, D. Johnson, “Case-based explana-
based localization. https://arxiv.org/abs/1610.02391 v3 7(8). Accessed 7 December tion of non-case-based learning methods” in Proceedings of the AMIA Symposium
2018. (American Medical Informatics Association, Bethesda, MD, 1999), p. 212.
73. D. Baehrens et al., How to explain individual classification decisions. J. Mach. Learn. 88. N. Papernot, P. McDaniel, Deep k-nearest neighbors: Towards confident, interpretable
Res. 11, 1803–1831 (2010). and robust deep learning. arXiv:1803.04765 (13 March 2018).
74. A. Shrikumar, P. Greenside, A. Shcherbina, A. Kundaje, Not just a black box: Learn- 89. M. Tsang, Y. Sun, D. Ren, Y. Liu, Can I trust you more? Model-agnostic hierarchical
ing important features through propagating activation differences. arXiv:1605.01713 explanations. arXiv:1812.04801 (12 December 2018).
(11 April 2017). 90. T. Lei, R. Barzilay, T. Jaakkola, Rationalizing neural predictions. arXiv:1606.04155
75. W. J. Murdoch, A. Szlam, Automatic rule extraction from long short term memory (2 November 2016).
networks arXiv:1702.02540 (24 February 2017). 91. C. Rudin, Please stop explaining black box models for high stakes decisions.
76. P. Dabkowski, Y. Gal, Real time image saliency for black box classifiers. arXiv:1705. arXiv:1811.10154 (22 September 2019).
07857 (22 May 2017). 92. T. Kluyver et al., “Jupyter notebooks-a publishing format for reproducible computa-
77. M. T. Ribeiro, S. Singh, C. Guestrin, “Why should I trust you?: Explaining the pre- tional workflows” in ELPUB (ePrints Soton, 2016), pp. 87–90.
dictions of any classifier” in Proceedings of the 22nd ACM SIGKDD International 93. F. Pérez, B. E. Granger, Ipython: A system for interactive scientific computing. Comput.
Conference on Knowledge Discovery and Data Mining, B. Krishnapuram, M. Shah, Sci. Eng. 9, 21–29 (2007).
Eds. (ACM, New York, NY, 2016), pp. 1135–1144. 94. RStudio Team, RStudio: Integrated Development Environment for R (RStudio, Inc.,
78. L. M. Zintgraf, T. S. Cohen, T. Adel, M. Welling, Visualizing deep neural net- Boston, MA, 2016).
work decisions: Prediction difference analysis. arXiv:1702.04595 (15 February 95. R. Barter, B. Yu, Superheat: Supervised heatmaps for visualizing complex data.
2017). arXiv:1512.01524 (26 January 2017).
79. S. M. Lundberg, G. G. Erion, S. I. Lee, Consistent individualized feature attribution for 96. H. Wickham, ggplot2: Elegant Graphics for Data Analysis (Springer, 2016).
tree ensembles. arXiv:1802.03888 (7 March 2019). 97. M. Waskom et al., Seaborn: Statistical data visualization. https://seaborn.pydata.org
80. J. Adebayo et al., “Sanity checks for saliency maps” in Advances in Neural Information (2014). Accessed 15 May 2017.
Processing Systems, T. Sejnowski, Ed. (Neural Information Processing Systems, 2018), 98. W. McKinney et al., “Data structures for statistical computing in python” in Proceed-
pp. 9505–9515. ings of the 9th Python in Science Conference (SciPy, Austin, TX, 2010), vol. 445, pp.
81. W. Nie, Y. Zhang, A. Patel, A theoretical explanation for perplexing behaviors of 51–56.
backpropagation-based visualizations. arXiv:1805.07039 (8 June 2018). 99. H. Wickham, tidyverse: Easily install and load the ‘tidyverse’ (Version 1.2.1, CRAN,
82. G. Hooker, Generalized functional ANOVA diagnostics for high-dimensional functions 2017).
of dependent variables. J. Comput. Graph. Stat. 16, 709–732 (2007). 100. A. S. Ross, M. C. Hughes, F. Doshi-Velez, Right for the right reasons: Training differen-
83. W. J. Murdoch, P. J. Liu, B. Yu, “Beyond word importance: Contextual decomposition tiable models by constraining their explanations. arXiv:1703.03717 (25 May 2017).
to extract interactions from LSTMs” in ICLR, A. rush, Ed. (ICLR, 2018). 101. O. Zaidan, J. Eisner, C. Piatko, “Using “annotator rationales” to improve machine
84. C. Singh, W. J. Murdoch, B. Yu, “Hierarchical interpretations for neural network learning for text categorization” in Proceedings of NAACL HLT, C. Sidner, Ed. (ACL,
predictions” in ICLR, D. Sonog, K. Cho, M. White, Eds. (ICLR, 2019). 2007), pp. 260–267.
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.

22080 | www.pnas.org/cgi/doi/10.1073/pnas.1900654116 Murdoch et al.

You might also like