Interpretable Machine Learning
Interpretable Machine Learning
machine learning
W. James Murdocha,1 , Chandan Singhb,1 , Karl Kumbiera,2 , Reza Abbasi-Aslb,c,d,2 , and Bin Yua,b,3
a
Statistics Department, University of California, Berkeley, CA 94720; b Electrical Engineering and Computer Science Department, University of California,
Berkeley, CA 94720; c Department of Neurology, University of California, San Francisco, CA 94158; and d Allen Institute for Brain Science, Seattle, WA 98109
Contributed by Bin Yu, July 1, 2019 (sent for review January 16, 2019; reviewed by Rich Caruana and Giles Hooker)
Machine-learning models have demonstrated great success in model based∗ and post hoc. We then introduce the predictive,
learning complex patterns that enable them to make predic- descriptive, relevant (PDR) framework, consisting of 3 desider-
tions about unobserved data. In addition to using models for ata for evaluating and constructing interpretations: predictive
prediction, the ability to interpret what a model has learned accuracy, descriptive accuracy, and relevancy, where relevancy is
is receiving an increasing amount of attention. However, this judged by a human audience. Using these terms, we categorize
increased focus has led to considerable confusion about the a broad range of existing methods, all grounded in real-world
notion of interpretability. In particular, it is unclear how the wide examples.† In doing so, we provide a common vocabulary for
array of proposed interpretation methods are related and what researchers and practitioners to use in evaluating and selecting
common concepts can be used to evaluate them. We aim to interpretation methods. We then show how our work enables a
address these concerns by defining interpretability in the context clearer discussion of open problems for future research.
of machine learning and introducing the predictive, descriptive,
relevant (PDR) framework for discussing interpretations. The PDR 1. Defining Interpretable Machine Learning
framework provides 3 overarching desiderata for evaluation: On its own, interpretability is a broad, poorly defined concept.
STATISTICS
predictive accuracy, descriptive accuracy, and relevancy, with rel- Taken to its full generality, to interpret data means to extract
evancy judged relative to a human audience. Moreover, to help information (of some form) from them. The set of methods
manage the deluge of interpretation methods, we introduce a falling under this umbrella spans everything from designing an
categorization of existing techniques into model-based and post initial experiment to visualizing final results. In this overly gen-
hoc categories, with subgroups including sparsity, modularity, eral form, interpretability is not substantially different from the
and simulatability. To demonstrate how practitioners can use the established concepts of data science and applied statistics.
PDR framework to evaluate and understand interpretations, we Instead of general interpretability, we focus on the use of
provide numerous real-world examples. These examples high- interpretations to produce insight from ML models as part
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.
light the often underappreciated role played by human audiences of the larger data–science life cycle. We define interpretable
in discussions of interpretability. Finally, based on our frame- machine learning as the extraction of relevant knowledge from
work, we discuss limitations of existing methods and directions
for future work. We hope that this work will provide a com-
Significance
mon vocabulary that will make it easier for both practitioners
and researchers to discuss and choose from the full range of
interpretation methods. The recent surge in interpretability research has led to con-
fusion on numerous fronts. In particular, it is unclear what
interpretability | machine learning | explainability | relevancy it means to be interpretable and how to select, evaluate,
or even discuss methods for producing interpretations of
machine-learning models. We aim to clarify these concerns
STATISTICS
plex, and practitioners need to train a black-box model to achieve
reasonable predictive accuracy. B. Relevancy. When selecting an interpretation method, it is not
After discussing desiderata for interpretation methods, we enough for the method to have high accuracy—the extracted
investigate these 2 forms of interpretations in detail and discuss information must also be relevant. For example, in the context of
associated methods. genomics, a patient, doctor, biologist, and statistician may each
want different (yet consistent) interpretations from the same
4. The PDR Desiderata for Interpretations model. The context provided by the problem and data stages in
In general, it is unclear how to select and evaluate interpreta- Fig. 1 guides what kinds of relationships a practitioner is inter-
tion methods for a particular problem and audience. To help ested in learning about and by extension the methods that should
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.
Murdoch et al. PNAS | October 29, 2019 | vol. 116 | no. 44 | 22073
reducing the number of parameters to analyze, sparse models
can be easier to understand, yielding higher descriptive accu-
racy. Moreover, incorporating prior information in the form of
sparsity into a sparse problem can help a model achieve higher
predictive accuracy and yield more relevant insights. Note that
incorporating sparsity can often be quite difficult, as it requires
understanding the data-specific structure of the sparsity and how
it can be modeled.
Methods for obtaining sparsity often utilize a penalty on a
loss function, such as LASSO (32) and sparse coding (33),
or on model selection criteria such as AIC or BIC (34, 35).
Many search-based methods have been developed to find sparse
solutions. These methods search through the space of nonzero
coefficients using classical subset-selection methods [e.g., orthog-
onal matching pursuit (36)]. Model sparsity is often useful for
high-dimensional problems, where the goal is to identify key fea-
tures for further analysis. For instance, sparsity penalties have
been incorporated into random forests to identify a sparse subset
of important features (37).
In the following example from genomics, sparsity is used
to increase the relevancy of an interpretation by reducing the
number of potential interactions to a manageable level.
Example (Ex): Identifying interactions among regulatory fac-
tors or biomolecules is an important question in genomics.
Typical genomic datasets include thousands or even millions
of features, many of which are active in specific cellular or
developmental contexts. The massive scale of such datasets
makes interpretation a considerable challenge. Sparsity penalties
are frequently used to make the data manageable for statisti-
cians and their collaborating biologists to discuss and identify
promising candidates for further experiments.
Fig. 2. Impact of interpretability methods on descriptive and predictive For instance, one recent study (24) uses a biclustering
approach based on sparse canonical correlation analysis (SCCA)
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.
STATISTICS
ularity to different degrees. Generalized additive models (41) simple predictive model.
force the relationship between variables in the model to be addi- Ex: When modeling global climate patterns, an important
tive. In deep learning, specific methods such as attention (42) quantity is the amount and location of arctic cloud coverage. Due
and modular network architectures (43) provide limited insight to the complex, layered nature of climate models, it is beneficial
into a network’s inner workings. Probabilistic models can enforce to have simple, easily auditable, cloud coverage models for use
modularity by specifying a conditional independence structure by downstream climate scientists.
which makes it easier to reason about different parts of a model In ref. 46, the authors use an unlabeled dataset of arc-
independently (44). tic satellite imagery to build a model predicting whether each
The following example uses modularity to produce relevant pixel in an image contains clouds or not. Given the qualita-
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.
interpretations for use in diagnosing biases in training data. tive similarity between ice and clouds, this is a challenging
Ex: When prioritizing patient care for patients with pneumo- prediction problem. By conducting exploratory data analysis
nia in a hospital, one possible method is to predict the likelihood and using domain knowledge through interactions with climate
of death within 60 d and focus on the patients with a higher mor- scientists, the authors identify 3 simple features that are suf-
tality risk. Given the potential life and death consequences, being ficient to cluster whether or not images contain clouds. Using
able to explain the reasons for hospitalizing a patient or not is these 3 features as input to quadratic discriminant analysis, they
very important. achieve both high predictive accuracy and transparency when
A recent study (7) uses a dataset of 14,199 patients with pneu- compared with expert labels (which were not used in developing
monia, with 46 features including demographics (e.g., age and the model).
gender), simple physical measurements (e.g., heart rate, blood
pressure), and laboratory tests (e.g., white blood cell count, E. Model-Based Feature Engineering. There are a variety of auto-
blood urea nitrogen). To predict mortality risk, the researchers matic approaches for constructing interpretable features. Two
use a generalized additive model with pairwise interactions, dis- examples are unsupervised learning and dimensionality reduc-
played below. The univariate and pairwise terms (fj (xj ) and tion. Unsupervised methods, such as clustering, matrix factor-
fij (xi , xj )) can be individually interpreted in the form of curves ization, and dictionary learning, aim to process unlabeled data
and heatmaps, respectively: and output a description of their structure. These structures
often shed insight into relationships contained within the data
X X and can be useful in building predictive models. Dimensional-
g(E[y]) = β0 + fj (xj ) + fij (xi , xj ). [1] ity reduction focuses on finding a representation of the data
j i6=j which is lower dimensional than the original data. Methods such
as principal components analysis (47), independent components
By inspecting the individual modules, the researchers found analysis (48), and canonical correlation analysis (49) can often
a number of counterintuitive properties of their model. For identify a few interpretable dimensions, which can then be used
instance, the fitted model learned that having asthma is asso- as input to a model or to provide insights in their own right.
ciated with a lower risk of dying from pneumonia. In reality, Using fewer inputs can not only improve descriptive accuracy,
the opposite is true—patients with asthma are known to have but also increase predictive accuracy by reducing the number of
a higher risk of death from pneumonia. Because of this, in the parameters to fit. In the following example, unsupervised learn-
collected data all patients with asthma received aggressive care, ing is used to represent images in a low-dimensional, genetically
which was fortunately effective at reducing their risk of mortality meaningful, space.
relative to the general population. Ex: Heterogeneity is an important consideration in genomic
In this instance, if the model were used without having been problems and associated data. In many cases, regulatory fac-
interpreted, pneumonia patients with asthma would have been tors or biomolecules can play a specific role in one context,
deprioritized for hospitalization. Consequently, the use of ML such as a particular cell type or developmental stage, and
would increase their likelihood of dying. Fortunately, the use of have a very different role in other contexts. Thus, it is impor-
an interpretable model enabled the researchers to identify and tant to understand the “local” behavior of regulatory factors
Murdoch et al. PNAS | October 29, 2019 | vol. 116 | no. 44 | 22075
or biomolecules. A recent study (50) uses unsupervised learn- In addition to feature importances, methods exist to extract
ing to learn spatial patterns of gene expression in Drosophila important interactions between features. Interactions are impor-
(fruit fly) embryos. In particular, it uses stability-driven non- tant as ML models are often highly nonlinear and learn complex
negative matrix factorization to decompose images of complex interactions between features. Methods exist to extract interac-
spatial gene expression patterns into a library of 21 “principal tions from many ML models, including random forests (21, 57,
patterns,” which can be viewed as preorgan regions. This decom- 58) and neural networks (59, 60). In the below example, the
position, which is interpretable to biologists, allows the study of descriptive accuracy of random forests is increased by extracting
gene–gene interactions in preorgan regions of the developing Boolean interactions (a problem-relevant form of interpretation)
embryo. from a trained model.
Ex: High-order interactions among regulatory factors or
6. Post Hoc Interpretability genes play an important role in defining cell type-specific behav-
We now discuss how interpretability considerations come into ior in biological systems. Thus, extracting such interactions from
play in the post hoc analysis stage of the data–science life cycle. genomic data is an important problem in biology.
At this stage, the practitioner analyzes a trained model to pro- A previous line of work considers the problem of searching for
vide insights into the learned relationships. This is particularly biological interactions associated with important biological pro-
challenging when the model’s parameters do not clearly show cesses (21, 57). To identify candidate biological interactions, the
what relationships the model has learned. To aid in this process, authors train a series of iteratively reweighted random forests
a variety of post hoc interpretability methods have been devel- (RFs) and search for stable combinations of features that fre-
oped to provide insight into what a trained model has learned, quently co-occur along the predictive RF decision paths. This
without changing the underlying model. These methods are par- approach takes a step beyond evaluating the importance of indi-
ticularly important for settings where the collected data are high vidual features in an RF, providing a more complete description
dimensional and complex, such as with image data. In these of how features influence predicted responses. By interpreting
settings, interpretation methods must deal with the challenge the interactions used in RFs, the researchers identified gene–
that individual features are not semantically meaningful, mak- gene interactions with 80% accuracy in the Drosophila embryo
ing the problem more challenging than on datasets with more and identify candidate targets for higher-order interactions.
meaningful features. Once the information has been extracted A.2. Statistical feature importances. In some instances, in addi-
from the fitted model, it can be analyzed using standard, tion to the raw value, we can compute statistical measures of
exploratory data analysis techniques, such as scatter plots and confidence as feature importance scores, a standard technique
histograms. taught in introductory statistics classes. By making assumptions
When conducting post hoc analysis, the model has already about the underlying data-generating process, models like lin-
been trained, so its predictive accuracy is fixed. Thus, under ear and logistic regression can compute confidence intervals and
the PDR framework, a researcher must consider only descrip- hypothesis tests for the values, and linear combinations, of their
tive accuracy and relevancy (relative to a particular audi- coefficients. These statistics can be helpful in determining the
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.
ence). Improving on each of these criteria are areas of active degree to which the observed coefficients are statistically sig-
research. nificant. It is important to note that the assumptions of the
Most widely useful post hoc interpretation methods fall into underlying probabilistic model must be fully verified before using
2 main categories: prediction-level and dataset-level interpreta- this form of interpretation. Below we present a cautionary exam-
tions, which are sometimes referred to as local and global inter- ple where different assumptions lead to opposing conclusions
pretations, respectively. Prediction-level interpretation methods being drawn from the same dataset.
focus on explaining individual predictions made by models, Ex: Here, we consider the lawsuit Students for Fair Admis-
such as what features and/or interactions led to the particular sions, Inc. v. Harvard regarding the use of race in undergraduate
prediction. Dataset-level approaches focus on the global rela- admissions to Harvard University. Initial reports by Harvard’s
tionships the model has learned, such as what visual patterns are Office of Institutional Research used logistic regression to model
associated with a predicted response. These 2 categories have the probability of admission using different features of an appli-
much in common (in fact, dataset-level approaches often yield cant’s profile, including race (61). This analysis found that the
information at the prediction level), but we discuss them sepa- coefficient associated with being Asian (and not low income)
rately, as methods at different levels are meaningfully different. was −0.418 with a significant P value (<0.001). This negative
Prediction-level insights can provide fine-grained information coefficient suggested that being Asian had a significant negative
about individual predictions, but often fail to yield dataset-level association with admission probability.
insights when it is not feasible to examine a sufficient amount of Subsequent analysis from both sides in the lawsuit attempted
prediction-level interpretations. to analyze the modeling and assumptions to decide on the sig-
nificance of race in the model’s decision. The plaintiff’s expert
A. Dataset-Level Interpretation. When practitioners are interested report (62) suggested that race was being unfairly used by build-
in more general relationships learned by a model, e.g., relation- ing on the original report from Harvard’s Office of Institutional
ships that are relevant for a particular class of responses, they Research. It also incorporates analysis on more subjective factors
use dataset-level interpretations. For instance, this form of inter- such as “personal ratings” which seem to hurt Asian students’
pretation can be useful when it is not feasible for a practitioner admission. In contrast, the expert report supporting Harvard
to look at a large number of local predictions. In addition to the University (63) finds that by accounting for certain other vari-
areas below, we note that there are other emerging techniques, ables, the effect of race on Asian students’ acceptance is no
such as model distillation (51, 52). longer significant. Significances derived from statistical tests in
A.1. Interaction and feature importances. Feature importance regression or logistic regression models at best establish associa-
scores, at the dataset level, try to capture how much individ- tion, but not causation. Hence the analyses from both sides are
ual features contribute, across a dataset, to a prediction. These flawed. This example demonstrates the practical and mislead-
scores can provide insights into what features the model has ing consequences of statistical feature importances when used
identified as important for which outcomes and their relative inappropriately.
importance. Methods have been developed to score individual A.3. Visualizations. When dealing with high-dimensional data-
features in many models including neural networks (53), random sets, it can be challenging to quickly understand the complex
forests, (54, 55), and generic classifiers (56). relationships that a model has learned, making the presentation
STATISTICS
fits, but provide little insight into what patterns in the images ing), with each bar corresponding to a feature provided to the
increase the brain cells’ response without further analysis. To classifier, and the y axis displaying the importance score for
remedy this, the authors introduce DeepTune, a method which that feature. In this instance, the race feature is the largest
provides a visualization, accessible to neuroscientists and others, value, indicating that the classifier is indeed discriminating based
of the patterns which activate a brain cell. The main intuition on race. Thus, in this instance, prediction-level feature impor-
behind the method is to optimize the input of a network to max- tance scores can identify that a model is unfairly discriminating
imize the response of a neural network model (which represents based on race.
a brain cell). B.2. Alternatives to feature importances. While feature impor-
The authors go on to analyze the major problem of instabil- tance scores can provide useful insights, they also have a number
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.
ity. When post hoc visualizations attempt to answer scientific of limitations (80, 82). For instance, they are unable to capture
questions, the visualizations must be stable to reasonable per- when algorithms learn interactions between variables. There is
turbations; if there are changes in the visualization due to the currently an evolving body of work centered around uncovering
choice of a model, it is likely not meaningful. The authors address and addressing these limitations. These methods focus on explic-
this explicitly by fitting 18 different models to the data and using itly capturing and displaying the interactions learned by a neural
a stable optimization over all of the models to produce a final network (83, 84), alternative forms of interpretations such as tex-
consensus DeepTune visualization. tual explanations (85), influential data points (86), and analyzing
A.4. Analyzing trends and outliers in predictions. When inter- nearest neighbors (87, 88).
preting the performance of an ML model, it can be helpful to
look not just at the average accuracy, but also at the distribution 7. Future Work
of predictions and errors. For example, residual plots can identify Having introduced the PDR framework for defining and dis-
heterogeneity in predictions and suggest particular data points cussing interpretable machine learning, we now leverage it to
to analyze, such as outliers in the predictions, or examples which frame what we feel are the field’s most important challenges
had the largest prediction errors. Moreover, these plots can be moving forward. Below, we present open problems tied to each
used to analyze trends across the predictions. For instance, in the of this paper’s 3 main sections: interpretation desiderata (Sec-
example below, influence functions are able to efficiently identify tion 4), model-based interpretability (Section 5), and post hoc
mislabeled data points. interpretability (Section 6).
B. Prediction-Level Interpretation. Prediction-level approaches are A. Measuring Interpretation Desiderata. Currently, there is no
useful when a practitioner is interested in understanding how clear consensus in the community around how to evaluate inter-
individual predictions are made by a model. Note that predic- pretation methods, although some recent works have begun to
tion-level approaches can sometimes be aggregated to yield address it (12–14). As a result, the standard of evaluation varies
dataset-level insights. considerably across different works, making it challenging both
B.1. Feature importance scores. The most popular approach to for researchers in the field to measure progress and for prospec-
prediction-level interpretation has involved assigning impor- tive users to select suitable methods. Within the PDR frame-
tance scores to individual features. Intuitively, a variable with a work, to constitute an improvement, an interpretation method
large positive (negative) score made a highly positive (negative) must improve at least one desideratum (predictive accuracy,
contribution to a particular prediction. In the deep learning lit- descriptive accuracy, or relevancy) without unduly harming the
erature, a number of different approaches have been proposed others. While improvements in predictive accuracy are easy to
to address this problem (71–78), with some methods for other measure, measuring improvements in descriptive accuracy and
models as well (79). These are often displayed in the form of relevancy remains a challenge.
a heatmap highlighting important features. Note that feature A.1. Measuring descriptive accuracy. One way to measure an
importance scores at the prediction level can offer much more improvement to an interpretation method is to demonstrate that
information than feature importance scores at the dataset level. its output better captures what the ML model has learned, i.e.,
This is a result of heterogeneity in a nonlinear model: The impor- its descriptive accuracy. However, unlike predictive accuracy,
Murdoch et al. PNAS | October 29, 2019 | vol. 116 | no. 44 | 22077
descriptive accuracy is generally very challenging to measure or transparent feature engineering. It is worth noting that some-
quantify (82). As a fallback, researchers often show individual, times a combination of model-based and post hoc interpretations
cherry-picked, interpretations which seem “reasonable.” These is ideal.
kinds of evaluations are limited and unfalsifiable. In particular, B.1. Building accurate and interpretable models. In many in-
these results are limited to the few examples shown and not stances, model-based interpretability methods fail to achieve
generally applicable to the entire dataset. a reasonable predictive accuracy. In these cases, practitioners
While the community has not settled on a standard evalua- are forced to abandon model-based interpretations in search of
tion protocol, there are some promising directions. In particular, more accurate models. Thus, an effective way of increasing the
the use of simulation studies presents a partial solution. In this potential uses for model-based interpretability is to devise new
setting, a researcher defines a simple generative process, gener- modeling methods which produce higher predictive accuracy
ates a large amount of data from that process, and trains the ML while maintaining their high descriptive accuracy and relevance.
model on those data. Assuming a proper simulation setup, a suf- Promising examples of this work include the previously dis-
ficiently powerful model to recover the generative process, and cussed examples on estimating pneumonia risk from patient data
sufficiently large training data, the trained model should achieve (7) and Bayesian models for generating rule lists to estimate a
near-perfect generalization accuracy. To compute an evaluation patient’s risk of stroke (40). Detailed directions for this work are
metric, the researcher can then check whether the interpre- suggested in ref. 91.
tations recover aspects of the original generative process. For B.2. Tools for feature engineering. When we have more infor-
example, refs. 59 and 89 train neural networks on a suite of gen- mative and meaningful features, we can use simpler modeling
erative models with certain built-in interactions and test whether methods to achieve a comparable predictive accuracy. Thus,
their method successfully recovers them. Here, due to the ML methods that can produce more useful features broaden the
model’s near-perfect generalization accuracy, we know that the potential uses of model-based interpretations. The first main cat-
model is likely to have recovered some aspects of the generative egory of work lies in improved tools for exploratory data analysis.
process, thus providing a ground truth against which to evalu- By better enabling researchers to interact with and understand
ate interpretations. In a related approach, when an underlying their data, these tools (combined with domain knowledge) pro-
scientific problem has been previously studied, prior experimen- vide increased opportunities for them to identify helpful fea-
tal findings can serve as a partial ground truth to retrospectively tures. Examples include interactive environments (92–94), tools
validate interpretations (21). for visualization (95–97), and data exploration tools (98, 99).
A.2. Demonstrating relevancy to real-world problems. Another The second category falls under unsupervised learning, which
angle for developing improved interpretation methods is to is often used as a tool for automatically finding relevant struc-
improve the relevancy of interpretations for some audience or ture in data. Improvements in unsupervised techniques such as
problem. This is normally done by introducing a novel form of clustering and matrix factorization could lead to more useful
output, such as feature heatmaps (71), rationales (90), or feature features.
hierarchies (84), or identifying important elements in the train-
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.
ing set (86). A common pitfall in the current literature is to focus C. Post Hoc. In contrast to model-based interpretability, much
on the novel output, ignoring what real-world problems it can of post hoc interpretability is relatively new, with many foun-
actually solve. Given the abundance of possible interpretations, dational concepts still unclear. In particular, we feel that 2 of
it is particularly easy for researchers to propose novel methods the most important questions to be answered are what an inter-
which do not actually solve any real-world problems. pretation of an ML model should look like and how post hoc
There have been 2 dominant approaches for demonstrating interpretations can be used to increase a model’s predictive
improved relevancy. The first, and strongest, is to directly use the accuracy. It has also been emphasized that in high-stakes deci-
introduced method in solving a domain problem. For instance, sions practitioners should be very careful when applying post hoc
in one example discussed above (21), the authors evaluated a methods with unknown descriptive accuracy (91).
new interpretation method (iterative random forests) by demon- C.1. What an interpretation of a black box should look like. Given
strating that it could be used to identify meaningful biological a black-box predictor and real-world problem, it is generally
Boolean interactions for use in experiments. In instances like unclear what format, or combination of formats, is best to fully
this, where the interpretations are used directly to solve a domain capture a model’s behavior. Researchers have proposed a variety
problem, their relevancy is indisputable. A second, less direct, of interpretation forms, including feature heatmaps (71), feature
approach is the use of human studies, often through services hierarchies (84), and identifying important elements in the train-
like Amazon’s Mechanical Turk. Here, humans are asked to ing set (86). However, in all instances there is a gap between
perform certain tasks, such as evaluating how much they trust the simple information provided by these interpretations and
a model’s predictions (84). While challenging to properly con- what the model has actually learned. Moreover, it is unclear
struct and perform, these studies are vital to demonstrate that whether any of the current interpretation forms can fully cap-
new interpretation methods are, in fact, relevant to any poten- ture a model’s behavior or whether a new format altogether is
tial practitioners. However, one shortcoming of this approach is needed. How to close that gap, while producing outputs relevant
that it is only possible to use a general audience of AMT crowd- to a particular audience/problem, is an open problem.
sourced workers, rather than a more relevant, domain-specific C.2. Using interpretations to improve predictive accuracy. In
audience. some instances, post hoc interpretations uncover that a model
has learned relationships a practitioner knows to be incorrect.
B. Model Based. Now that we have discussed the general problem For instance, prior interpretation work has shown that a binary
of evaluating interpretations, we highlight important challenges husky vs. wolf classifier simply learns to identify whether there
for the 2 main subfields of interpretable machine learning: is snow in the image, ignoring the animals themselves (77). A
model-based and post hoc interpretability. Whenever model- natural question to ask is whether it is possible for the practi-
based interpretability can achieve reasonable predictive accuracy tioner to correct these relationships learned by the model and
and relevancy, by virtue of its high descriptive accuracy it is consequently increase its predictive accuracy. Given the chal-
preferable to fitting a more complex model and relying upon post lenges surrounding simply generating post hoc interpretations,
hoc interpretability. Thus, the main focus for model-based inter- research on their uses has been limited (100, 101), particularly
pretability is increasing its range of possible use cases by increas- in modern deep learning models. However, as the field of post
ing its predictive accuracy through more accurate models and hoc interpretations continues to mature, this could be an exciting
1. G. Litjens et al., A survey on deep learning in medical image analysis. Med. Image 36. Y. C. Pati, R. Rezaiifar, P. S. Krishnaprasad, “Orthogonal matching pursuit: Recursive
Anal. 42, 60–88 (2017). function approximation with applications to wavelet decomposition” in Proceedings
2. T. Brennan, W. L. Oliver, The emergence of machine learning techniques in of the 27th Asilomar Conference on Signals, Systems & Computers, F. Harris, Ed. (IEEE,
criminology. Criminol. Public Policy 12, 551–562 (2013). Pacific Grove, CA, 1993), pp. 40–44.
3. C. Angermueller, T. Pärnamaa, L. Parts, O. Stegle, Deep learning for computational 37. D. Amaratunga, J. Cabrera, Y. S. Lee, Enriched random forests. Bioinformatics 24,
biology. Mol. Syst. Biol. 12, 878 (2016). 2010–2014 (2008).
4. M. A. T. Vu et al., A shared vision for machine learning in neuroscience. J. Neurosci. 38. L. Breiman, J. Friedman, R. Olshen, C. J. Stone, Classification and Regression Trees
38, 1601–1607 (2018). (Chapman and Hall, 1984).
5. B. Goodman, S. Flaxman, European Union regulations on algorithmic decision-making 39. J. H. Friedman, B. E. Popescu, Predictive learning via rule ensembles. Ann. Appl. Stat.
and a “right to explanation”. arXiv:1606.08813 (31 August 2016). 2, 916–954 (2008).
6. C. Dwork, M. Hardt, T. Pitassi, O. Reingold, R. Zemel, “Fairness through awareness” 40. B. Letham, C. Rudin, T. H. McCormick, D. Madigan, Interpretable classifiers using rules
in Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, and Bayesian analysis: Building a better stroke prediction model. Ann. Appl. Stat. 9,
S. Goldwasser, Ed. (ACM, New York, NY, 2012), pp. 214–226. 1350–1371 (2015).
7. R. Caruana et al., “Intelligible models for healthcare: Predicting pneumonia risk and 41. T. Hastie, R. Tibshirani, Generalized additive models. Stat. Sci. 1, 297–318 (1986).
hospital 30-day readmission” in Proceedings of the 21th ACM SIGKDD International 42. J. Kim, J. F. Canny, “Interpretable learning for self-driving cars by visualizing causal
Conference on Knowledge Discovery and Data Mining, L. Cao, C. Zhang, Eds. (ACM, attention” in ICCV, K. Ikeuchi, G. Medioni, M. Pelillo, Eds. (IEEE, 2017), pp. 2961–2969.
New York, NY, 2015), pp. 1721–1730. 43. J. Andreas, M. Rohrbach, T. Darrell, D. Klein, “Neural module networks” in Proceed-
8. S. Chakraborty et al., “Interpretability of deep learning models: A survey of results” ings of the IEEE Conference on Computer Vision and Pattern Recognition, R. Bajcsy,
in Interpretability of Deep Learning Models: A Survey of Results, D. El Baz, J. Gao, F. Li, T. Tuytelaars, Eds. (IEEE, 2016), pp. 39–48.
R. Grymes, Eds. (IEEE, San Francisco, CA, 2017). 44. D. Koller, N. Friedman, F. Bach, Probabilistic Graphical Models: Principles and
9. R. Guidotti, A. Monreale, F. Turini, D. Pedreschi, F. Giannotti, A survey of methods for Techniques (MIT Press, 2009).
STATISTICS
explaining black box models. arXiv:1802.01933 (21 June 2018). 45. J. Ramos, “Using tf-idf to determine word relevance in document queries” in Proceed-
10. S. M. Lundberg, S. I. Lee, “A unified approach to interpreting model predictions” ings of the First Instructional Conference on Machine Learning, T. Fawcett, N. Mishra,
in Advances in Neural Information Processing Systems, T. Sejnowski, Ed. (Neural Eds. (ICML, 2003), vol. 242, pp. 133–142.
Information Processing Systems, 2017), pp. 4768–4777. 46. T. Shi, B. Yu, E. E. Clothiaux, A. J. Braverman, Daytime arctic cloud detection based
11. M. Ancona, E. Ceolini, C. Oztireli, M. Gross, “Towards better understanding of on multi-angle satellite data with case studies. J. Am. Stat. Assoc. 103, 584–593
gradient-based attribution methods for deep neural networks” in 6th International (2008).
Conference on Learning Representations, A. Rush, Ed. (ICLR, 2018) (2018). 47. I. Jolliffe, Principal Component Analysis (Springer, 1986).
12. F. Doshi-Velez, B. Kim, A roadmap for a rigorous science of interpretability. 48. A. J. Bell, T. J. Sejnowski, An information-maximization approach to blind separation
arXiv:1702.08608 (2 March 2017). and blind deconvolution. Neural Comput. 7, 1129–1159 (1995).
13. L. H. Gilpin et al., Explaining explanations: An approach to evaluating interpretability 49. H. Hotelling, Relations between two sets of variates. Biometrika 28, 321–377 (1936).
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.
of machine learning. arXiv:1806.00069 (3 February 2019). 50. S. Wu et al., Stability-driven nonnegative matrix factorization to interpret spatial
14. Z. C. Lipton, The mythos of model interpretability. arXiv:1606.03490 (6 March 2017). gene expression and build local gene networks. Proc. Natl. Acad. Sci. U.S.A. 113,
15. M. Hardt, E. Price, N. Srebro, “Equality of opportunity in supervised learning” in 4290–4295 (2016).
Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, Eds. (Neural 51. M. Craven, J. W. Shavlik, “Extracting tree-structured representations of trained net-
Information Processing Systems, 2016), pp. 3315–3323. works” in Advances in Neural Information Processing Systems, T. Petsche, Ed. (Neural
16. D. Boyd, K. Crawford, Critical questions for big data: Provocations for a cul- Information Processing Systems, 1996), pp. 24–30.
tural, technological, and scholarly phenomenon. Inf. Commun. Soc. 15, 662–679 52. N. Frosst, G. Hinton, Distilling a neural network into a soft decision tree.
(2012). arXiv:1711.09784 (27 November 2017).
17. A. Datta, S. Sen, Y. Zick, “Algorithmic transparency via quantitative input influence: 53. J. D. Olden, M. K. Joy, R. G. Death, An accurate comparison of methods for quan-
Theory and experiments with learning systems” in 2016 IEEE Symposium on Security tifying variable importance in artificial neural networks using simulated data. Ecol.
and Privacy (SP), M. Locasto, Ed. (IEEE, San Jose, CA, 2016), pp. 598–617. Model. 178, 389–397 (2004).
18. F. C. Keil, Explanation and understanding. Annu. Rev. Psychol. 57, 227–254 (2006). 54. L. Breiman, Random forests. Mach. Learn. 45, 5–32 (2001).
19. T. Lombrozo, The structure and function of explanations. Trends Cogn. Sci. 10, 464– 55. C. Strobl, A. L. Boulesteix, T. Kneib, T. Augustin, A. Zeileis, Conditional variable
470 (2006). importance for random forests. BMC Bioinf. 9, 307 (2008).
20. G. W. Imbens, D. B. Rubin, Causal Inference in Statistics, Social, and Biomedical 56. A. Altmann, L. Toloşi, O. Sander, T. Lengauer, Permutation importance: A corrected
Sciences (Cambridge University Press, 2015). feature importance measure. Bioinformatics 26, 1340–1347 (2010).
21. S. Basu, K. Kumbier, J. B. Brown, B. Yu, Iterative random forests to discover predic- 57. K. Kumbier, S. Basu, J. B. Brown, S. Celniker, B. Yu, Refining interaction search
tive and stable high-order interactions. Proc. Natl. Acad. Sci. U.S.A. 115, 1943–1948 through signed iterative random forests. arXiv:1810.07287 (16 October 2018).
(2018). 58. S. Devlin, C. Singh, W. J. Murdoch, B. Yu, Disentangled attribution curves for
22. B. Yu, Stability. Bernoulli 19, 1484–1500 (2013). interpreting random forests and boosted trees. arXiv:1905.07631 (18 May 2019).
23. F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, W. A. Stahel, Robust Statistics: The 59. M. Tsang, D. Cheng, Y. Liu, Detecting statistical interactions from neural network
Approach Based on Influence Functions (John Wiley & Sons, 2011), vol. 196. weights. arXiv:1705.04977 (27 February 2018).
24. H. Pimentel, Z. Hu, H. Huang, Biclustering by sparse canonical correlation analysis. 60. R. Abbasi-Asl, B. Yu, Structural compression of convolutional neural networks based
Quant. Biol. 6, 56–67 (2018). on greedy filter pruning. arXiv:1705.07356 (21 July 2017).
25. R. Abbasi-Asl et al., The DeepTune framework for modeling and characterizing 61. Office of Institutional Research HU, Exhibit 157: Demographics of Harvard college
neurons in visual cortex area V4. bioRxiv p. 465534 (9 November 2018). applicants. http://samv91khoyt2i553a2t1s05i-wpengine.netdna-ssl.com/wp-content/
26. A. W. Roe et al., Toward a unified theory of visual area v4. Neuron 74, 12–29 (2012). uploads/2018/06/Doc-421-157-May-30-2013-Report.pdf (2018), pp. 8–9.
27. C. L. Huang, M. C. Chen, C. J. Wang, Credit scoring with a data mining approach based 62. P. S. Arcidiacono, Exhibit a: Expert report of Peter S. Arcidiacono. http://
on support vector machines. Expert Syst. Appl. 33, 847–856 (2007). samv91khoyt2i553a2t1s05i-wpengine.netdna-ssl.com/wp-content/uploads/2018/06/Doc-
28. G. E. Box, Science and statistics. J. Am. Stat. Assoc. 71, 791–799 (1976). 415-1-Arcidiacono-Expert-Report.pdf (2018).
29. L. Breiman, Statistical modeling: The two cultures (with comments and a rejoinder by 63. D. Card, Exhibit 33: Report of David Card. https://projects.iq.harvard.edu/files/diverse-
the author). Stat. Sci. 16, 199–231 (2001). education/files/legal - card report revised filing.pdf (2018).
30. D. A. Freedman, Statistical models and shoe leather. Sociol. Methodol. 21, 291–313 64. M. D. Zeiler, R. Fergus, “Visualizing and understanding convolutional networks” in
(1991). European Conference on Computer Vision, D. Fleet, T. Padjla, B. Schiele, T. Tuytelaars,
31. C. Lim, B. Yu, Estimation stability with cross-validation (ESCV). J. Comput. Graph. Stat. Eds. (Springer, Zurich, Switzerland, 2014), pp. 818–833.
25, 464–492 (2016). 65. C. Olah, A. Mordvintsev, L. Schubert, Feature visualization. Distill 2, e7 (2017).
32. R. Tibshirani, Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 66. A. Mordvintsev, C. Olah, M. Tyka, Deepdream-a code example for visualizing neural
58, 267–288 (1996). networks. Google Res. 2, 5 (2015).
33. B. A. Olshausen, D. J. Field, Sparse coding with an overcomplete basis set: A strategy 67. D. Wei, B. Zhou, A. Torrabla, W. Freeman, Understanding intra-class knowledge inside
employed by v1?Vis. Res. 37, 3311–3325 (1997). CNN. arXiv:1507.02379 (21 July 2015).
34. H. Akaike, “Factor analysis and AIC” in Selected Papers of Hirotugu Akaike (Springer, 68. Q. Zhang, R. Cao, F. Shi, Y. N. Wu, S. C. Zhu, Interpreting CNN knowledge via an
1987), pp. 371–386. explanatory graph. arXiv:1708.01785 (2017).
35. K. P. Burnham, D. R. Anderson, Multimodel inference: Understanding AIC and BIC in 69. A. Karpathy, J. Johnson, L. Fei-Fei, Visualizing and understanding recurrent networks.
model selection. Sociol. Methods Res. 33, 261–304 (2004). arXiv:1506.02078 (17 November 2015).
Murdoch et al. PNAS | October 29, 2019 | vol. 116 | no. 44 | 22079
70. H. Strobelt, S. Gehrmann, B. Huber, H. Pfister, A. M. Rush, Visual analysis of hid- 85. A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, B. Schiele, “Grounding of textual phrases
den state dynamics in recurrent neural networks. arXiv:1606.07461v1 (23 June in images by reconstruction” in European Conference on Computer Vision, H. Bischof,
2016). D. Cremers, B. Schiele, R. Zabih, Eds. (Springer, New York, NY, 2016).
71. M. Sundararajan, A. Taly, Q. Yan, “Axiomatic attribution for deep networks” in ICML, 86. P. W. Koh, P. Liang, Understanding black-box predictions via influence functions.
T. Jebara, Ed. (ICML, 2017). arXiv:1703.04730 (10 July 2017).
72. R. R. Selvaraju et al., Grad-cam: Visual explanations from deep networks via gradient- 87. R. Caruana, H. Kangarloo, J. Dionisio, U. Sinha, D. Johnson, “Case-based explana-
based localization. https://arxiv.org/abs/1610.02391 v3 7(8). Accessed 7 December tion of non-case-based learning methods” in Proceedings of the AMIA Symposium
2018. (American Medical Informatics Association, Bethesda, MD, 1999), p. 212.
73. D. Baehrens et al., How to explain individual classification decisions. J. Mach. Learn. 88. N. Papernot, P. McDaniel, Deep k-nearest neighbors: Towards confident, interpretable
Res. 11, 1803–1831 (2010). and robust deep learning. arXiv:1803.04765 (13 March 2018).
74. A. Shrikumar, P. Greenside, A. Shcherbina, A. Kundaje, Not just a black box: Learn- 89. M. Tsang, Y. Sun, D. Ren, Y. Liu, Can I trust you more? Model-agnostic hierarchical
ing important features through propagating activation differences. arXiv:1605.01713 explanations. arXiv:1812.04801 (12 December 2018).
(11 April 2017). 90. T. Lei, R. Barzilay, T. Jaakkola, Rationalizing neural predictions. arXiv:1606.04155
75. W. J. Murdoch, A. Szlam, Automatic rule extraction from long short term memory (2 November 2016).
networks arXiv:1702.02540 (24 February 2017). 91. C. Rudin, Please stop explaining black box models for high stakes decisions.
76. P. Dabkowski, Y. Gal, Real time image saliency for black box classifiers. arXiv:1705. arXiv:1811.10154 (22 September 2019).
07857 (22 May 2017). 92. T. Kluyver et al., “Jupyter notebooks-a publishing format for reproducible computa-
77. M. T. Ribeiro, S. Singh, C. Guestrin, “Why should I trust you?: Explaining the pre- tional workflows” in ELPUB (ePrints Soton, 2016), pp. 87–90.
dictions of any classifier” in Proceedings of the 22nd ACM SIGKDD International 93. F. Pérez, B. E. Granger, Ipython: A system for interactive scientific computing. Comput.
Conference on Knowledge Discovery and Data Mining, B. Krishnapuram, M. Shah, Sci. Eng. 9, 21–29 (2007).
Eds. (ACM, New York, NY, 2016), pp. 1135–1144. 94. RStudio Team, RStudio: Integrated Development Environment for R (RStudio, Inc.,
78. L. M. Zintgraf, T. S. Cohen, T. Adel, M. Welling, Visualizing deep neural net- Boston, MA, 2016).
work decisions: Prediction difference analysis. arXiv:1702.04595 (15 February 95. R. Barter, B. Yu, Superheat: Supervised heatmaps for visualizing complex data.
2017). arXiv:1512.01524 (26 January 2017).
79. S. M. Lundberg, G. G. Erion, S. I. Lee, Consistent individualized feature attribution for 96. H. Wickham, ggplot2: Elegant Graphics for Data Analysis (Springer, 2016).
tree ensembles. arXiv:1802.03888 (7 March 2019). 97. M. Waskom et al., Seaborn: Statistical data visualization. https://seaborn.pydata.org
80. J. Adebayo et al., “Sanity checks for saliency maps” in Advances in Neural Information (2014). Accessed 15 May 2017.
Processing Systems, T. Sejnowski, Ed. (Neural Information Processing Systems, 2018), 98. W. McKinney et al., “Data structures for statistical computing in python” in Proceed-
pp. 9505–9515. ings of the 9th Python in Science Conference (SciPy, Austin, TX, 2010), vol. 445, pp.
81. W. Nie, Y. Zhang, A. Patel, A theoretical explanation for perplexing behaviors of 51–56.
backpropagation-based visualizations. arXiv:1805.07039 (8 June 2018). 99. H. Wickham, tidyverse: Easily install and load the ‘tidyverse’ (Version 1.2.1, CRAN,
82. G. Hooker, Generalized functional ANOVA diagnostics for high-dimensional functions 2017).
of dependent variables. J. Comput. Graph. Stat. 16, 709–732 (2007). 100. A. S. Ross, M. C. Hughes, F. Doshi-Velez, Right for the right reasons: Training differen-
83. W. J. Murdoch, P. J. Liu, B. Yu, “Beyond word importance: Contextual decomposition tiable models by constraining their explanations. arXiv:1703.03717 (25 May 2017).
to extract interactions from LSTMs” in ICLR, A. rush, Ed. (ICLR, 2018). 101. O. Zaidan, J. Eisner, C. Piatko, “Using “annotator rationales” to improve machine
84. C. Singh, W. J. Murdoch, B. Yu, “Hierarchical interpretations for neural network learning for text categorization” in Proceedings of NAACL HLT, C. Sidner, Ed. (ACL,
predictions” in ICLR, D. Sonog, K. Cho, M. White, Eds. (ICLR, 2019). 2007), pp. 260–267.
Downloaded from https://www.pnas.org by 177.134.191.60 on March 20, 2025 from IP address 177.134.191.60.