Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views13 pages

Pilania 2021

This review discusses the impact of machine learning (ML) on materials science, highlighting its applications in developing predictive models, materials discovery, and autonomous design. It emphasizes the importance of explainable AI in understanding ML findings and the integration of domain knowledge for effective materials design. The document also identifies key opportunities and challenges in the rapidly evolving field of ML-enabled materials science.

Uploaded by

ldanhtrung2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

Pilania 2021

This review discusses the impact of machine learning (ML) on materials science, highlighting its applications in developing predictive models, materials discovery, and autonomous design. It emphasizes the importance of explainable AI in understanding ML findings and the integration of domain knowledge for effective materials design. The document also identifies key opportunities and challenges in the rapidly evolving field of ML-enabled materials science.

Uploaded by

ldanhtrung2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Computational Materials Science 193 (2021) 110360

Contents lists available at ScienceDirect

Computational Materials Science


journal homepage: www.elsevier.com/locate/commatsci

Machine learning in materials science: From explainable predictions to


autonomous design
Ghanshyam Pilania
Materials Science and Technology Division, Los Alamos National Laboratory, Los Alamos, NM 87544, USA

A R T I C L E I N F O A B S T R A C T

Keywords: The advent of big data and algorithmic developments in the field of machine learning (and artificial intelligence,
Materials informatics in general) have greatly impacted the entire spectrum of physical sciences, including materials science. Materials
Statistical learning data, measured or computed, combined with various techniques of machine learning have been employed to
Physics-informed learning
address a myriad of challenging problems, such as, development of efficient and predictive surrogate models for
Domain-specific learning
Materials discovery
a range of materials properties, screening and down-selection of novel candidate materials for targeted appli­
cations, new methodologies to improve and further expedite molecular and atomistic simulations, with likely
many more important developments to come in the foreseeable future. While the applications thus far have
provided a glimpse of the true potential data-enabled routes have to offer, it has also become clear that further
progress in this direction hinges on our ability to understand, explain and rationalize findings of a machine
learning model in light of the domain-knowledge. This focused review provides an overview of the main areas
where machine learning has been widely and successfully used in materials science. Subsequently, a brief dis­
cussion of several techniques that have been helpful in extracting physically-meaningful insights, causal re­
lationships and design-centric knowledge from materials data is provided. Finally, we identify some of the
imminent opportunities and challenges that materials community faces in this exciting and rapidly growing field.

1. Introduction identify trends and patterns in the data. Semisupervised learning can be
used for large datasets with partially missing labels. In the past decade,
We are living in the age of “big data” and information. The amount of tools and applications built on ML have found a widespread use in ap­
data generated, shared, processed and stored around the planet on a plications as diverse as, transportation, communications, healthcare,
daily basis is unprecedented. The summation of all this data is collec­ business intelligence and strategy, social networking, and industrial
tively called the global datasphere. Based on the latest estimates, it is research [2]. This paradigm shift is brought about by a confluence of
predicted that that the global datasphere will grow exponentially to 175 sustainable growth in computing power, the aforementioned data rev­
zettabytes by 2025 (zettabyte is 1021 bytes) [1]. The sheer volume, olution, multiple algorithmic breakthroughs and design and deployment
production speed and heterogeneous nature of this data naturally de­ of hardware customized to boost the performance of ML algorithms.
mands for new and efficient methods of analysis to unearth hidden Moreover, a self-reinforcing and synergistic growth of computing, data,
patterns, trends and insights in this vast sea of information. This ever algorithms as well as co-designed software and hardware–the compo­
growing need to analyze and make sense out of the big data has been a nents forming the ML ecosystem–has further helped in expediting the
primary driving force behind the state-of-the-art machine learning (ML) pace of development in each individual area, while benefitting from and
methods and algorithms. driven by the progress made in the other fields of the ecosystem.
ML broadly refers to the use of algorithms and computer systems that The resounding success of big data and ML methods in tasks, such as,
can learn to perform a task given just the relevant data, do not require image and speech recognition [3,4], language translation [5,6] as well
any explicit programming specific to the task, and get better with as the superhuman performance achieved by artificial intelligence (AI)
experience (i.e., the available past data). ML models can be supervised, based algorithms in games of chess [7], Go [8,9], poker [10,11] and
semi-supervised or unsupervised, depending on the type of available Jeopardy [12] has also been reflected in their wide spread adoption in
training data. In supervised learning, the training data consist of sets of physical sciences. More specifically in materials science and related
input and associated output values. In other words, labelled training fields use of ML-based methods has led to several developments per­
samples are required. On the other hand, if the training dataset contains taining to the design and development of new materials and a better
unlabeled samples, unsupervised learning can be used in ordered to understanding of the existing ones. Novel ML-based routes for mapping

https://doi.org/10.1016/j.commatsci.2021.110360
Received 1 November 2020; Received in revised form 30 January 2021; Accepted 1 February 2021
Available online 10 March 2021
0927-0256/© 2021 Elsevier B.V. All rights reserved.
G. Pilania Computational Materials Science 193 (2021) 110360

potential energy surfaces and forcefields have allowed for atomistic where a significant portion of these recent developments and applica­
simulations of molecules and solids reaching beyond the tradeoffs of tions have been covered and discussed [15,22,16,23–26]. Below we
accuracy, speed, time- and length-scales possible within the traditional present a selected set of key areas where informatics-based methods
molecular dynamics simulations. More recently, the focus has been on have proved particularly promising and widely applicable. These are
using active learning (or adaptive design) to enable autonomous robot- also depicted graphically in Fig. 1. By choosing specific examples in each
assisted development of functional materials with prespecified proper­ of these areas, we highlight how ML is helping to progress the field by
ties. Given a target chemical space and constraints related to available reducing barriers in materials design via addressing challenges related
resources, development time and a property wish list, these efforts have to materials modeling, synthesis and characterization.
focused on harnessing the power of optimal learning concepts within the
context of efficient experimental design. Finally, going beyond super­ 2.1. Efficient and predictive surrogate models
vised learning, use of natural language processing techniques to auto­
matically extract and synthesize materials science knowledge present in A vast majority of recent work in the field falls within this category
the published literature in form of information-dense word embeddings where ML-based surrogate models provide an alternative data-enabled
to capture complex materials science concepts remains a very exciting route to establish desired processing-structure-property-performance
and potentially transformative new area of research. linkages within the target chemical space. Relying on easily accessible
A wide range of ML algorithms, vastly varying in terms of complexity and carefully devised numerical representations, frequently referred to
and transparency, have been employed for data-enabled materials as features or descriptors, ML algorithms are used to develop validated
design. On one side of the spectrum lie, for instance, tree-based classi­ mappings that connect problem-relevant aspects of materials’ compo­
fication and regression methods that are completely transparent when it sition, structure, morphology, processing etc. to the target property or
comes to explaining the model predictions. The other extreme is occu­ performance criteria, while largely bypassing traditional time and
pied by deep neural networks and ensemble-based methods, which resource-intensive experimental and computational routes (see Fig. 1a).
allow for little insights and reasoning into the inner workings of the The selection of an appropriate descriptor is one of the most crucial
models leading to the final results. Since a vast majority of studies in the aspect of the entire surrogate model building exercise, which often relies
field of materials informatics have focused on developing ML-based heavily on domain-specific expertise. Further, best practices of statisti­
surrogate models of structure-property-processing relationships, the cal learning, such as, appropriate and unbiased selection of the training
primary emphasis has always been on achieving a predictive accuracy as data representative of the underlying true data distribution, use of cross-
high as possible. This quest for improved performance naturally creates validation for the model hyper parameter selection, testing on unseen
a bias for employing more complex, and therefore, less transparent and data are required to ensure a truly predictive optimal learning model.
poorly-explainable models. However, eventually an integration of the Once developed, validated and rigorously tested to be predictive within
knowledge mined from (and assimilation of the discoveries made with) a given domain of applicability, the true value of such models lies in
statistical pattern recognition techniques back into the materials science their remarkable speed compared to the traditional property prediction
demands for a deeper analysis and a better understanding of the findings or measurement routes. As a result, ML-based surrogate models are
in light of the domain-knowledge. To expedite the pace of progress and particularly well suited for high throughput screening efforts where one
potential impact ML methods can bear on materials development, AI targets to identify molecules or compounds with one or more properties
algorithms must be tasked with generation of understanding that ex­ in a pre-specified range. If a subset of properties exhibits conflicting
plains the obtained results, as we now uniquely task human intuition. trends or inverse relationships, looking for “optimal” compounds cor­
The need for explainable models in hard sciences, including materials responds to finding chemistries falling on or near the underlying Pareto
science, has recently led to a surge of activity in the field of explainable front-providing the best achievable tradeoffs among the conflicting
AI (frequently referred to as XAI) [13,14]. responses.
In this contribution, after reviewing various recent applications of ML algorithms have been employed to identify potential nonlinear
ML in materials science and related fields, we focus on a selected set of multivariate relationships for a wide variety of materials classes span­
ML tools and techniques that have been employed in the past both to ning over metals and alloys, ceramics and composites, polymers, two-
automatically extract physically meaningful knowledge from the data dimensional materials, organic-inorganic hybrids and multicomponent
and to better rationalize existence of causal relationships in the identi­ heteroanionic compounds [23,27,28]. The applications cover varied
fied patterns. Using selected examples from recent studies, we empha­ length scales, e.g., electronic-, atomic- and meso-scales [16]. Successful
size that integration of relevant domain knowledge is a crucial step in attempts for materials property predictions using ML include estimation
devising an ML strategy and that this becomes even more critical when of energetics [29–33], phase stability and cation/anion ordering
dealing with small training datasets. Finally, we identify and discuss key [34–40], defect energetics [41–44], bandgaps [45–49], melting and
challenges and opportunities faced by the potent and quickly growing glass transition temperatures [38,50,51], mechanical and elastic prop­
field of ML-enabled materials design. Throughout the review, it is erties [52–54], thermal conductivity [55], dielectric properties [56,57],
assumed that the reader is familiar with the basic nomenclature and tendency for crystallization [58], catalytic activity [59,60] and radia­
standard methods of ML applied to the field of materials informatics. A tion damage resistance [61].
familiarity with the best practices of statistical learning is also assumed
and, therefore, these topics are not covered here but could be found 2.2. Materials design and discovery
elsewhere [15–17].
Further building on the primary strength of allowing for fast yet
2. Use of machine learning in materials science accurate predictions of materials properties, ML-based surrogate models
can be employed in various ways to enable materials design and dis­
Use of ML to address design and development challenges in materials covery. In a most straight forward approach, a developed model can be
science and other related fields is an actively growing area, which has used to make predictions on the entire set of combinatorially-
seen a rapid growth in the last decade. The amount of scientific research enumerated compounds falling within the domain of applicability of
published in this field has exhibited a sustained exponential growth the model. Even more excitingly, multiple property prediction models
since 2014, with the number of contributions approximating doubling in can be integrated as a part of a hierarchical down-selection pipeline to
every one and half years [15]. Therefore, an exhaustive survey of the screen materials based on increasingly complex and stringent criteria
entire spectrum of this research is beyond the scope of this review, employed at each of the subsequent stages [62–64]. Another approach is
however, we refer interested readers to a number of excellent reviews to “invert” the forward materials-to-properties prediction route via

2
G. Pilania Computational Materials Science 193 (2021) 110360

Fig. 1. Key application areas of ML in materials science are highlighted and further discussed in the text. (a) Surrogate model development for efficient materials
property predictions. (b) Iterative framework for adaptive design and active learning. Adapted from Ref. [16]. (c) Generative materials design using variational
autoencoders (VAEs) and generative adversarial networks (GANs). Adapted from Ref. [18]. (d) ML-enabled autonomous materials synthesis via combining design of
experiment algorithms with automated robotic platforms. Adapted from Ref. [19]. (e) Use of ML-based force fields to address a range of atomistic materials
simulation problems. (f) Deep learning for accurate characterization of atomic-scale materials imaging data. Adapted from Ref. [20]. (g) Use of natural language
processing and ML to automatically extract scientific knowledge and insights from scientific texts. Adapted from Ref. [21].

employing an optimization routine such as evolutionary algorithms, utility function that prioritizes the decision-making process on unseen
simulated annealing, minima-hopping, or swarm optimization-based data. More specifically, the adaptive design loop employs a ML model to
routines [63,65]. Contrary to a direct brute-force enumeration achieve a target objective with the smallest possible number of mea­
approach that relies on virtual screening of candidate materials from a surements or computations. This is achieved by balancing the
pre-defined set of possibilities, the optimization-based inversion route exploitation-exploration trade-off during the model development. At
focuses on directly predicting a set of materials that satisfy certain pre- any given stage, one can perform the next computation/measurement
specified target objectives, leading to a more general approach to ma­ on the candidate predicted to have the property closest to the desired
terials discovery. In addition to the enumeration, multi-step screening value (i.e., model exploitation) or try to further improve the model by
and optimization-based inversion routes, more sophisticated approaches selecting a material where the predictions are worst in quality (i.e., with
are being explored by the community to further expedite materials largest predictive uncertainties). By choosing the latter, one allows for
development, as discussed below. exploration of less-sampled portions of the design space, leading to an
improved model with reduced uncertainties as well as improved likeli­
2.2.1. Active learning hood of meeting the objective upon exploitation. A number of recent
The ML-based surrogate models discussed above can allow for a materials design and discovery efforts have demonstrated the power and
quick identification of candidates with tailored properties for further utility active learning methods in applications ranging from design of
validation via either experimental synthesis or more elaborate domain- shape memory alloys with improved thermal hysteresis [70,71] to
knowledge-based computations. However, such an approach is inher­ identifying Pb-free piezoelectric material with the largest measured
ently passive and does not allow for any control over the prediction electrostrain [72] and from optimizing GaN light emitting diode struc­
errors resulting from the size and quality of the training dataset. tures [73] to finding high glass transition temperature polymers [74].
Therefore, given a ML model, selection of candidates to perform next
experiments or computations on such that the generated data when fed 2.2.2. Generative design
back into the current model leads to the maximum expected improve­ In conventional screening and discovery efforts, including past
ment (measured in terms of either improving the model or identifying active-learning based efforts, the exploration space is generally defined
materials with properties falling within or close to the desired range) is a by a set of candidates that either already exist as a part of a known
key challenge to achieve optimal experimental design. In the past database or can be systematically enumerated. In contrast, deep-
decade, active-learning algorithms that exploit Bayesian optimization learning-based generative models focus on building a continuous ma­
frameworks have been developed to effectively address this challenge terials vector space, often referred to as the latent space. Once the in­
[66–68]. formation embedded in the materials training dataset is mapped onto
As shown schematically in Fig. 1b, active learning adopts an iterative the latent space, it can be used to generate new data points on demand.
procedure where predictions using the current ML model are used to Furthermore, by building a parallel mapping between the latent space
guide the data collection effort in a batch mode to further improve the and a property of interest, new materials with their property in a desired
model [69]. The approach heavily relies on the use of model predictions target range can be generated to enable inverse design [18,75,76]. In
and uncertainties together with a judiciously selected acquisition or this respect, generative models are a class of deep learning methods that

3
G. Pilania Computational Materials Science 193 (2021) 110360

seek to model the underlying probability distribution of both structure systems that can adapt appropriately to new information, as and when it
and property mapped over a non-linear latent space. The materials becomes available, with little human intervention. In this regard, when
generated using these models can be very diverse and considerably compared to automated systems, autonomous systems are very dynamic
distinct, in terms of the functionality they exhibit, from the known in nature and can adjust on-the-fly to available information in order to
materials in the training data. This is because the underlying structure- achieve optimal experimental design. As depicted graphically in Fig. 1d,
property relationships are frequently nonlinear in nature for complex the ability to employ ML algorithms as experiment planner to avoid
functional materials. As a result, the generative design approach pre­ marginally informative experiments in lieu of the most informative next
sents a higher potential for discovery and novel materials design experiments lies at the heart of the high efficiency boost gained with an
compared to conventional high throughput virtual screening efforts that autonomous discovery process. These gains in experimental efficiency
are typically limited by the existing materials databases [77]. can be as high as an order of magnitude over conventional high
Schematically illustrated in Fig. 1c, the variational autoencoders throughput screening approaches [99].
(VAEs) [78–81] and generative adversarial networks (GANs) [82–86] Some of the early studies reporting autonomous materials synthesis
have recently emerged as the two most popular methods in deep- targeted unsupervised growth of carbon nanotubes and production of
learning-based generative models. A VAE setup consists of two deep Bose-Einstein condensates [100,101]. Since then a number of other
neural networks, namely, the encoder and the decoder. The encoder applications, including discovery of chemical reactions [102,103],
nonlinearly projects the target chemical space onto a low dimensional crystallization of giant self-assembled polyoxometalate clusters [104],
latent space, and the decoder implements the inverse mapping allowing assembly of layered superlattices [105], synthesis of perovskite quan­
for generation of materials corresponding to the specific regions in the tum dots with tuned bandgaps, quantum yield and composition poly­
latent space. In contrast, a GAN uses a pair of networks-the generator dispersity [106], and optimization of synthesis conditions for the
and the discriminator-to learn the underlying materials data distribu­ formation of high quality organic-inorganic hybrid halide perovskites
tions implicitly. The generator tries to emulate the real data distribution single crystals [107] have been successfully demonstrated. In addition,
while the discriminator is tasked to distinguish the generated synthetic open source portable, modular and versatile software packages, such as
(or fake) data from the real data. The overall training process is built ChemOS [108], are under active development to enable remote control
around the generator trying to maximize the probability of the of self-driving laboratories, provide access to distributed computing
discriminator making an error, while the discriminator getting better at resources, and integrate cutting-edge ML methods in a seamless manner.
catching the fake data. In addition to the three core components, namely the automation
While a number of exciting studies utilizing the generative power of hardware, compute resources and ML algorithms, integration of addi­
VAEs and GANs to identify molecules with desired properties have tional auxiliary features such as image and speech recognition, access to
recently been reported [75,82–84,87–91], applications of these methods on-demand distributed cloud computing resources, improved graphical
to solids have been rather limited owing to the additional challenges user interface and web interfaces in autonomous materials design plat­
associated with representing materials with periodic boundary condi­ forms is expected to both improve their user-friendliness and enrich
tions. Although a number of suitable representations built on composi­ their capabilities in imminent future [19].
tional and configurational details or graph-based encodings exist for
solids and have been demonstrated to predict several key properties, as 2.3. Molecular and atomistic simulations
discussed above in Section 2.1, most of these representations are not
invertible. That is given a representation, the composition and crystal Quantum mechanical and classical forcefield based atomistic simu­
structure details of the material cannot be uniquely identified. On the lation methods play a powerful role in modeling and understanding
other hand, any successful domain-specific application requires that the materials behavior and properties via accurate studies of a diverse range
features generated from the latent space should be invertible back to a of phenomena including thermal and mass transport, phase trans­
realistic crystal structure. To address this issue, 3-dimensional voxel formations, chemical reactions, mechanical behavior, materials degra­
image representations have been put forward with some success dation and failure [109–111]. From fundamental laws governing
[76,92,93]. However, this route faces challenges associated with images interatomic interactions, molecular dynamics [112] (and related atom­
not being translational-, rotational-, and supercell-invariant as well as istic methods) can be used to follow in time the classical equations of
relatively poor efficiency due to memory intensive nature of the repre­ motion to enable highly accurate predictions of materials behavior with
sentations leading to longer training times. More recently, a crystal full atomistic detail. However, the quantum mechanical methods and
representation inspired by the “point cloud” [94–96] method (where classical simulations vastly differ in the accuracy (and the concomitant
objects are considered as a set of points and vectors with three- computational cost) of how well they capture details of the interatomic
dimensional coordinates) was suggested by Kim et al. [97] to repre­ interactions. Quantum mechanics-based methods, such as density
sent the crystal structure as a set of atomic coordinates and cell pa­ functional theory (DFT), are versatile and offer the capability to accu­
rameters. The new representation was used with a GAN to generate and rately model a range of chemistries and chemical environments. How­
explore new crystal structures within the Mg-Mn-O ternary system to ever, these methods remain computationally very demanding; limiting
find a promising photoanode material for water splitting. Moreover, this both the length- and time-scales of the phenomena (to nanometers and
inversion-free representation was shown to be more efficient by a factor picoseconds, respectively) [113]. Semi-empirical methods capture the
of ∼400 compared to the previously reported image-based essence of the interatomic interactions in a coarse-grained manner (via
representations. parameterized analytical functional forms), and are thus an inexpensive
solution to the materials simulation problem [114–116]. Nevertheless,
2.2.3. Autonomous synthesis their applicability is severely restricted to the specific chemistries and
In the previous sections, we have discussed how ML-based surrogate chemical environments considered during parameterization, and accu­
modeling, active learning and deep learning generative models are racy cannot be guaranteed for properties not explicitly targeted by the
being used to expedite chemical space explorations and enable inverse fit. Therefore, one of the goals for ML algorithms in this arena is to help
design. The power of ML combined with automated robotic platforms develop potentials (referred to as ML potentials or ML forcefields) that
has led to even more exciting opportunities in autonomous synthesis and can achieve accuracies approaching to quantum mechanical methods at
self-driving laboratories [19,98]. Here it is important to note the the cost of semi-empirical methods to accomplish a variety of tasks (for
distinction between automated versus autonomous systems. While the instance, see Fig. 1e).
former refers to robotic platforms that can handle repetitive tasks in a The last decade has witnessed a tremendous amount of activity and
high throughput manner, the latter points specifically to intelligent successes in the field of data-driven atomistic simulations and, in

4
G. Pilania Computational Materials Science 193 (2021) 110360

particular, in the area of ML forcefields development. Unlike the tradi­ materials. However, extraction of structure-property relationships
tional semi-empirical methods utilizing domain-knowledge-based spe­ from these truly large databases remains a formidable challenge beyond
cific functional forms or rigid parameterizations adopted in semi- the scope of conventional data-analysis techniques that are based largely
empirical methods, ML methods use past accumulated data to make on manual inspection by a domain-knowledge expert. Taking advantage
interpolative predictions of the energy and forces in the chemical space of big datasets available from state-of-the-art characterization tech­
of interest. A major challenge in this direction has been the development niques, recent ML efforts have focused on developing theory-guided
of configurational representations for ML that respect the required mappings between the characterized atomic-level structure and mea­
symmetry and invariance constraints (e.g., translational, rotational, and sure the response surfaces (e.g., see Fig. 1f) [147,20,148].
exchange of like atoms), capture the details of potential energy surfaces In addition, ML-based efficient on-the-fly analysis of materials
at sub-atomic-level resolution and are “smoothly-varying” (i.e., contin­ characterization data can help address workflow bottlenecks in imaging
uous and differentiable) with respect to small variations in atomic po­ applications [149,150]. For instance, electron backscatter diffraction
sitions. Several local-atomic environment fingerprinting schemes with (EBSD) technique is routinely employed to obtain three-dimensional
varying cost-accuracy tradeoffs have been proposed, including those spatially resolved crystallographic characterization of polycrystalline
based on symmetry functions [117–119], Coulomb matrices [120,121], samples as large as 10 mm [151,152] and provides orientation maps at
bispectra of neighborhood atomic densities [122], smooth overlap of about 200 nm spatial resolution and 0.5 deg crystal orientation resolu­
atomic positions (SOAP) [123–125], AGNI [126–128], momentum tion. While top-of-the-line commercially available orientation imaging
tensor potential [129] and others [130,131]. These representations microscopes can make these measurements at unprecedented speeds
combined with well-established ML algorithms such as kernel ridge (one diffraction pattern measurement in less than 1 ms) [153]. However,
regression, Gaussian process regression or deep neural networks have use of traditional indexing techniques for orientation reconstruction
been used widely to explore diverse materials energy landscapes. from highly noisy EBSD patterns remains a bottleneck, requiring much
Transfer learning approaches have been particularly promising in longer time scales. This has been a major limitation towards imple­
accessing cheap surrogate models for highly accurate and menting efficient real-time orientation indexing required, for instance,
computationally-demanding beyond-DFT-level energetics [132,133]. to study in-situ microstructure evolution. A number of recent studies
Active learning strategies can also be very effective here in strategically have shown that ML based methods are robust to experimentally
acquiring training data that is uniformly spread out over the target measured image noise and can be used to index orientations as fast as the
configurational space in order develop robust and effective ML models highest EBSD scanning rates [150,154,155]. Other notable examples of
[133–135]. In near future, ML-augmented molecular and materials ML-aided characterization include classification of local chemical en­
simulations hold promise to significantly narrow the gap between the vironments from X-ray absorption spectra [156], identification of two-
simulated and experimentally-observed time and length scales, while dimensional heterostructures in optical microscopy [157], automated
providing high-fidelity predictions of the behavior of matter under tuning of microscope controls, data acquisition and analysis [158],
varying environmental and processing conditions. phase identification in Raman spectroscopy [159], automated image
Another direction that has been explored on this frontier deals with segmentation and image reconstruction for magnetic resonance imaging
using ML to bypass explicit solutions of the Schrodinger’s equation (or [160–162]. In future, merging the ML-extracted knowledge from ma­
the Kohn-Sham equation within the DFT framework) to come up with a terials characterization data with physics-informed models will enable a
much faster linearly-scaling data-enabled route to address the electronic new paradigm of materials research where theoretical predictions and
structure problem for materials. A number of good ideas, including experimental observations go hand-in-hand at the microscopic levels.
learning the kinetic energy functionals as well as learning density-
potential and energy-density maps, have been proposed [136–140]. 2.5. Automated knowledge extraction from text
However, one can argue that the field is still a state of infancy, as most of
these studies have dealt largely with toy problems and simple test cases. A significantly large amount of materials scientific knowledge today
One notable exception in this direction was presented by Chandra­ exists as text (manuscripts, reports, abstracts etc.), which is continuously
shekhar et al. [141] showing how ML can be utilized to map an external growing at an unprecedented rate. However, due to absence of efficient
potential (governed solely by the type and positions of nuclei) directly algorithms that can directly extract correlations, connections and re­
onto the corresponding electronic charge density and local density of lationships from text inputs, this information rich resource remains
states. The predicted density of states and charge density can in turn be largely untapped and the materials community has mainly relied on
utilized to obtain the total energy and other derived properties of the expert-curated and well-structured property databases for materials
system. The demonstration of the proposed approach on realistic poly­ design and discovery efforts undertaken in the past. However, in the last
meric and metallic systems further shows tremendous promise of similar decade a number of breakthroughs in natural language processing (NLP)
approaches in an attempt to integrate ML within the inner workings of have opened up exciting new avenues in materials science and related
DFT (and more broadly, quantum mechanics). fields. Most remarkably, use of ML algorithms, such as Word2vec
[163,164] and GloVe [165] to construct high dimensional vector spaces
2.4. Materials characterization (commonly referred to as embeddings) for words appearing in a text
corpus such that their relative semantic and syntactic relationships are
In addition to the modeling, simulations and synthesis, progress of preserved has given rise to a generalized approach that can be used to
atomically resolved imaging techniques has opened up new avenues for mine scientific literature in a highly effective manner. These word em­
ML-based methods to aid in achieving rapid and quantitative charac­ beddings can capture complex materials science concepts and structure-
terization of functional matter under both static and dynamic condi­ property relationships directly from text without any need for explicit
tions. Although traditionally characterization techniques have mainly domain knowledge insertion. These notions are graphically captured in
been used to illustrate a material system’s qualitative structure or Fig. 1g.
behavior, improved resolution and multi-probe characterization options A practical demonstration using this approach to capture latent
accessible in modern imaging tools offer much more quantitative and knowledge from materials science literature was put forward by Tshi­
information-rich measurements. For instance, today real-space imaging toyan et al. [166], who collected and processed materials-related
techniques such as scanning transmission electron microscopy research from approximately 3.3 million scientific abstracts published
[142,143], scanning tunneling microscopy [144,145] and atomic force between 1922 and 2018 in more than 1,000 journals. As a major finding,
microscopy [146] permit direct imaging of atomic-level structure and this study showed that information regarding future discoveries already
functional properties in complex multi-component and multi-phase exists, to a large extent, in past publications in a latent form and

5
G. Pilania Computational Materials Science 193 (2021) 110360

therefore such NLP models can potentially recommend new functional interpretability to explainability requires involvement of human with a
materials several years before their normal course of discovery timeline. scientific understanding of the problem. In the quest to learn from
In the same vein as the previous study, using polymers as an example learning machines (or intelligible intelligence) widely acceptable con­
class of functional materials, Shetty and Ramprasad also confirmed that cepts of transparency, interpretability and explainability have recently
materials science knowledge can be automatically inferred from textual emerged as the core elements of utmost importance that are deemed
information contained in scientific texts [21]. Using a data set of nearly necessary to enable scientific outcomes from ML endeavors [168].
0.5 million polymer papers, it was shown that vector representations The aforementioned notions are directly connected to model
trained for every word appearing in the accumulated text corpus were complexity. Simple, and therefore transparent ML models are highly
able to capture crucial materials knowledge in a completely unsuper­ amenable to interpretations and explanations, however, generally suffer
vised manner. Subsequently, ML-based temporal studies, aimed at from relatively poor accuracy and reliability as compared to more
tracking popularity of various polymers for different applications, were complex “black-box” type models. Therefore, similar to the well-known
able to identify new polymers for novel applications based solely on the bias-variance tradeoffs that are provoked to prevent overfitting while
domain knowledge contained in the mined database. building a robust predictive ML model, balancing of a performance
Another challenging problem that has been targeted with automated (reliability and accuracy) versus transparency tradeoff needs to be
text mining via combining ML and NLP pertains to identifying realistic carefully considered for explainable ML models.
materials synthesis routes. In particular, Kim et al. [167] demonstrated
use of ML methods to successfully predict the critical synthesis param­ 3.2. Hybrid and local-learning approaches for improved transparency
eters needed to make targeted materials–titania nanotubes via hydro­
thermal methods in this particular case–where the training dataset was Given the common scenarios where model performance closely ac­
automatically compiled from tens of thousands of scholarly publications companies model complexity, model transparency (and therefore,
using NLP techniques. More importantly, the study also showed the interpretability and explainability) exhibits a downwards slope that has
capacity for transfer learning for the developed ML models by predicting largely remained unavoidable in the past. This situation is graphically
synthesis outcomes on materials systems not included in the original represented in Fig. 2, where deep neural networks are at one extreme
training set. offering excellent performance but little transparency [14]. The other
As evident from the examples discussed in this section, ML-based extreme of high transparency with relatively poor performance is
methods and algorithms have found a wide range of applications occupied by decision trees and rule-based algorithms that are
within the field of materials design and development. The developed completely interpretable. However, going beyond traditional single-
models largely rely on materials descriptors or features to numerically model frameworks, more sophisticated hybrid methods have been sug­
represent details of the problem, such as, chemical composition and gested to simultaneously improve model transparency and performance
configurational structure of the material, processing conditions and [170–173]. For instance, Kailkhura et al. [170] recently presented an
relevant environmental factors. The choice of an appropriate descriptor approach that first transforms a regression problem into a multi-class
set is a crucial step of enormous importance. The choice of an initial set classification problem on a sub-sampled training data to balance the
is typically based either solely on the underlying domain knowledge of distribution of the least represented material classes. Subsequently,
the problem (i.e., mechanistic details, well-established and physically- smaller and simpler models for the different classes are trained to gain
intuitive relationships, constitutive laws of physics and chemistry etc.) better understanding of different subdomain-specific regimes. This
or on an unbiased selection using the available data starting from a very domain-specific learning enabled a rationale-generator component to
large set of combinatorial possibilities. One can argue that both the the framework which can provide both model-level and decision-level
routes have their own pros and cons. while the former approach is likely explanations. This led to improvements in the overall transparency
to result in models that are more amenable to physical interpretation, and explainability of the model as compared to the conventional
the latter harbors an increased potential for discoveries that are typically approach of training just one regression model for the entire dataset.
beyond the realm of conventional wisdom. Regardless, an exploratory Finally, a transfer learning technique harnessing correlations between
analysis utilizing several approaches to the problem at hand during a ML multiple properties was employed to compensate for the model perfor­
model building exercise is always helpful. Eventually, our ability to not mance reduction as a result of improved transparency. In a different
only generate transparent ML models, but also extract physical insights study, Sutton et al. [171] presented a sub-group discovery based new
from these surrogates, while preserving the potential for discovery that
is intrinsic to the data-enabled methods, will dictate the extent of the
impact of materials informatics on the field.

3. Physical insights from materials learning

3.1. Performance-transparency tradeoffs

In addition to deliver robust and accurate predictions, ML models in


physical sciences are often required to provide new scientific under­
standing and physical insights directly from observational or simulated
data. As a prerequisite to domain knowledge extraction via ML is
explainability–the ability to rationalize individual predictions by exam­
ining inner workings of a transparent model and further interpreting the
outcomes in combination with expert-knowledge. Therefore, a collec­
tion of interpretations for a transparent model when evaluated by a
domain-knowledge expert leads to explainability. Within this context,
transparency is largely confined to details of the employed ML model (i.
e., details pertaining to the specific choices of model class, model Fig. 2. Schematic showing Trade-off between model performance and trans­
complexity, learning algorithm employed, hyper parameters, initial parency. The area of potential future improvement due to improved explainable
constraints etc.), while interpretability combines both the input data as AI techniques and tools is highlighted in green with the improvement directions
well as the ML model to make sense of the output. Going from represented with arrows. Adapted from Ref. [14].

6
G. Pilania Computational Materials Science 193 (2021) 110360

approach to identify domains of applicability of ML models and showed dividing line (Fig. 3c). In a similar spirit of finding accurate symbolic
that the domain-specific learning is not only crucial for a deeper un­ expressions that match data from an unknown function, Udrescu et al.
derstanding and improved interpretability, but can also significantly [177] developed a recursive multidimensional symbolic regression al­
improve prediction performance for certain domains. The idea of fitting gorithm, named AI Feynman, and demonstrated rediscovery of a set of
local domain-specific model to gain improved understanding of other­ 100 hand-picked equations from the Feynman Lectures on Physics
wise opaque ML models lies at the heart of the local interpretable model- [178–180]. These contributions suggest that compressed sensing and
agnostic explanations (LIME) algorithm [172] that can explain the symbolic regression-based techniques, combined with appropriately
predictions of any classifier in a faithful way, by approximating it locally identified domain knowledge-based constraints can be enormously
with an interpretable model. In future, developments in the direction of helpful in gaining physical insights from materials data.
transparency-preserving hybrid modeling approaches and focus on
interpretability-driven new model developments are going to further
3.4. Informatics-enhanced design maps
expand these frontiers, highlighted in green in Fig. 2.

Efficient interpolation ability of ML algorithms in high-dimensional


3.3. Causality- and consistency-based validations spaces can be harnesses in development of informatics-enhanced
design maps that are much more informative and information rich as
An explainable model further opens doors for devising testable hy­ compared to traditional methods that have largely employed two-
pothesis or more stringent validation tests for specific predictions to dimensional maps. As an example, Fig. 4 compares a traditional toler­
address their consistency, generalizability and causality. A compelling ance factor versus octahedral factor structure map often invoked to
example demonstration in this direction was presented by Ouyang et al. identify formable perovskite oxides [181]. Indeed, the pair of geomet­
using sure independent screening and sparsifying operator (SISSO) rical descriptors shows a remarkable predictive power and all the known
method based on the compressed sensing technique [169]. Note that this compounds that have been successfully synthesized in a perovskite
method allows for efficient exploration of vast descriptor spaces-with crystal structure tend to cluster in this plot as depicted in Fig. 4a. A
the number of unique descriptors typically reaching up to several bil­ shortcoming of such an approach, however, could be that the descriptor
lions–to identify transparent analytical descriptor-property relation­ pair is solely based on size effects (i.e., coordination environment
ships and has been widely applied to address a diverse set of materials dependent Shannon’s ionic radii [182]) and completely ignores aspects
design and discovery problems [34,169,174–176]. Ouyang et al. applied of local bonding interactions, such as, ionicity versus covalency, relative
a SISSO-based approach to learn an accurate, transparent and predictive electronegatively differences between different cations etc. which might
metal-insulator classification model for binary AxBy-type compounds also play an important role in dictating formability in perovskites.
[169]. Simple two-dimensional analytical descriptors found by SISSO Although one can argue that some of these aspects are implicitly
led to almost perfect classification (with 99.0% accuracy) of metal accounted for in the relative atomic and ionic size trends, the ability to
versus nonmetal chemistries for a set of total 299 compounds (see explicitly incorporate additional relevant factors that might play a role
Fig. 3a). More interestingly, to conclusively show that the discovered might significantly improve such conventional maps in terms of their
descriptors indeed bore a causal relationship with the metallic or insu­ predictive power. For instance, Fig. 4b shows an analogous plot which is
lating behavior exhibited by the materials, the model was employed to generated by a random forest ML model that was trained and validated
rediscover the available pressure-induced insulator to metal transitions on a much larger set of descriptors, including octahedral and tolerance
with a number of chemistries that were known to undergo such a tran­ factors as well as electronegativities, ionization potentials, electron af­
sition laying consistently near the classification boundary, as shown in finities, orbital-dependent pseudo potential radii of the cations. Once the
Fig. 3b. Furthermore, the model was able to make additional prediction model has been trained and validated, it can be used to make probabi­
of yet unknown transition candidates, ripe for experimental validation. listic estimates of perovskite formability in the entire multi-dimensional
As an additional evidence of an underlying causal relationship was input feature space and these predictions can be projected back on to a
provided by depicting a qualitative yet clear trend between the experi­ two-dimensional plot for the two classical geometric factors, while
mental band gap of the insulators versus the scaled distance from the integrating out or marginalizing all the other feature dimensions, as

Fig. 3. An example depicting SISSO classification performance in separating metals from insulator. (a) A near-perfect classification of metal/nonmetal for 299 binary
AxBy-type materials. Symbols χ , IE and x represent Pauling electronegativity, ionization energy and atomic composition, respectively. ΣVatom/Vcell represents packing
factor. Red circles, blue squares, and open blue squares represent metals, non-metals, and the three erroneously characterized non-metals, respectively. (b) Rep­
resentation of pressure induced insulator to metal transitions (red arrows) and materials that remain insulators upon compression (blue arrows). Computational
predictions at step of 1 GPa are shown with green bars. (c) Correlation between the band gap of the non-metals and the scaled coordinate from the dividing line.
Adapted from Ref. [169], with permissions. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of
this article.)

7
G. Pilania Computational Materials Science 193 (2021) 110360

Fig. 4. A comparison of structure maps between the tolerance and octahedral factors for perovskite formability. (a) Conventional structure map with a scatter plot.
Perovskite formability region is given by a convex hull encompassing the known examples (green circles). (b) Informatics-enhanced structure map with the same set
of variables, explicitly accounting for the probability of formation. (For interpretation of the references to colour in this figure legend, the reader is referred to the
web version of this article.)

shown in Fig. 4b. One might argue that the Fig. 4b is more informative this phase, studies emphasized on addressing basic questions such as:
since it implicitly contains trends reflected by the entire set of de­ “How different statistical learning methods work?”; “What are their
scriptors that were used to train the model and not just the tolerance and potential strengths and weaknesses?”; “How does one select appropriate
octahedral factors, as in the case of Fig. 4a. Moreover, the informatics- method for a given problem?”; “What are some best practices of statis­
based route allows for generation of analogous plots for any pair of tical learning that one should follow for developing and validating a
features drawn from the original input feature set. Here we note that a predictive model?” etc. These efforts have culminated in democratiza­
closely related approach is readily available in tree-based ensemble tion of the process of training a model on a materials data with the
models known as partial dependence plots [183]. Although we have availability of several open source ML packages and repositories for
focused on a relatively simple example, it is not hard to imagine much model development and dissemination. Now that the field has matured
more complex situations where such an approach can be applied. In into an established discipline from a specialized area of research, the
complex materials design problems, the ability to construct such design research focus has shifted to a number of more general materials-
maps to explore and rationalize intricate trends and tradeoffs among key science-specific issues that the community is currently grappling with.
design variables can be enormously helpful. Although ML problems are frequently referred to as “big data”
Finally, we note that a great deal of research has lately gone into the problems, datasets used in materials design and discovery problems are
development of explainable deep learning techniques and the efforts generally relatively small, barring certain cases dealing with small
have been reviewed in a number of surveys [14,184–187]. While it is molecules or imaging data for materials characterization. A large frac­
impractical to delve deep into this large body of work, we note here in tion of materials data available in open source materials databases
passing that, at a higher level, explainable deep learning methods comes from first principles computations with a major emphasis on the
largely fall in three broad categories, namely, visualization, model ground state atomic structures and energetics. Availability of high-
distillation and intrinsic methods [187]. As the nomenclature suggests, quality data on most functional properties generated via direct experi­
visualization methods rely scientific visualization to single out key mental measurements is rather limited. On the other hand, an infor­
characteristics of an input that strongly influence the output to generate matics effort that targets to discover a new functional material with a
an explanation. Model distillation approach resorts to a separate, “glass- desired set of properties usually requires a dataset with several com­
box” ML model that is trained to mimic the input-output behavior of pounds in a target compositional and configurational space (i.e., for
original “black-box” model but in a more transparent manner by iden­ given chemistries and crystal structures) with entries on multiple
tifying specific decision rules that lead to the final output. Intrinsic properties, spanned over a range of processing conditions. Such datasets
methods, on the other hand, have an explanation system integrated are extremely difficult to populate starting from publicly available ma­
within by design and therefore can balance the transparency- terials data. In fact, accumulation and curation of an initial high-quality
performance tradeoff on-the-fly by jointly optimize both model perfor­ dataset remains a highly laborious and time-consuming step for most of
mance and some quality measure of the explanations produced. In the materials design efforts today. To address this data scarcity problem
future, as materials datasets grow larger, these techniques will play roles going forward, development and wide-spread use of data-mining and
of increasing importance in materials design problems. NLP-based high throughput data acquisition techniques and advanced
methods to extract data directly from graphics that permit a much faster
4. Challenges and opportunities ahead and semi-automated extraction of materials datasets from past literature
is a crucial next step.
As briefly alluded to above, over the past decade the field of mate­ In past, materials development has largely been led by chemical
rials informatics has grown exponentially. While the early phase of this intuition guided explorations. Moreover, results on failed experiments
growth was largely focused on developing a deeper understanding of ML are rarely reported in the peer-reviewed literature. As a result, in
model development itself with a primary focus on testing efficacy and addition to being sparce, the available data distributions can be highly
efficiency of the data-enabled approaches in materials development. In imbalanced and skewed, violating one of the central assumptions of

8
G. Pilania Computational Materials Science 193 (2021) 110360

most standard ML methods requiring uniformly sampled and balanced change in profound ways. Starting from a niche area, within a short
training data. Furthermore, data coming from different sources can have period of time materials informatics has already been established as a
varying levels of noise. A robust model development in such situations full-blown mature discipline. ML algorithms are already aiding in effi­
demands for more advanced analysis going beyond standard predictive- cient materials property predictions, materials design and discovery, as
accuracy-centric “testing on unseen data” approach. Advance methods well as different components of experimental design, dealing with
for rigorous uncertainly quantification, establishment of domain of identification, organization and prioritization of next experiments.
applicability, effectively correcting class imbalance problems and Going forward, a number of crucial challenges pertaining to accessibility
skewed data distributions are just beginning to find inroads into mate­ and quality of data as well as regarding integrating domain-knowledge
rials informatics [170,171,188]. into ML models (beyond the means of feature selection and feature en­
In addition to being sparce and skewed, materials data can be highly gineering) and extracting novel insights out of the trained models need
heterogeneous and can be generated at varying levels of fidelities. to be addressed. Upon success, materials science in coming decades will
Furthermore, due to the underlying cost-accuracy tradeoffs in both be defined by our ability to learn from learning machines.
experimental and computational techniques available for data acquisi­
tion, a larger amount of data is generally available at a lower fidelity Declaration of Competing Interest
level. Development and use of advance algorithms that allow for effec­
tive integration of information coming in from varying fidelity sources, The authors declare that they have no known competing financial
while explicitly accounting for different noise levels in the different data interests or personal relationships that could have appeared to influence
segments, to make predictions at the highest level of fidelity (i.e., at the the work reported in this paper.
highest predictive accuracy and lowest uncertainty levels) is highly
desirable in materials informatics [189]. Such algorithms also provide a CRediT authorship contribution statement
means to address the data-scarcity problem as they have been proven
effective in learning-from-small-datasets scenarios [190]. Ghanshyam Pilania: Conceptualization, Writing - original draft,
In addition to the aforementioned challenges that largely concern Writing - review & editing.
with the amount and quality of available training data, effective ap­
proaches that enable integration of domain-knowledge with ML could be Declaration of Competing Interest
transformative. In this direction, both the ability to put in domain
knowledge into a ML model as well as extract new physical insights from The authors declare that they have no known competing financial
an explainable ML model should be considered. On one hand ML algo­ interests or personal relationships that could have appeared to influence
rithms that can directly integrate available mechanistic understanding the work reported in this paper.
and known domain knowledge (in terms of physical laws and well-
established principles, constraints such as boundary conditions,
Acknowledgements
asymptotic limits, smoothness criteria, symmetries, invariances, and
other problem-specific knowledge obtained from theory and simula­
I would like to acknowledge the many fellow researchers, collabo­
tions) to train more efficiently with smaller datasets are required [191].
rators and mentors whom I had a chance to work with in the last few
On the other hand, breakthroughs in terms of implementing hybrid and
years. I also acknowledge the financial support from Laboratory
locally-interpretable models (as discussed in Section 3) to explain ML
Directed Research and Development (LDRD) program of the Los Alamos
predictions are sought. Going beyond standard statistical validation
National Laboratory (LANL) via multiple projects (#20190001DR and #
techniques, more stringent domain-specific validation criteria, using
20190043DR). LANL is operated by Triad National Security, LLC, for the
either direct experimentation or rigorously validated first principles
National Nuclear Security Administration of U.S. Department of Energy
computational methods, are required to establish that the identified
(Contract No. 89233218CNA000001).
correlations indeed represent causal relationships with true predictive
power.
Appendix A. Supplementary data
Finally, to facilitate documentation, dissemination and effective use
of highly multiscale, multidimensional and heterogeneous nature of
Supplementary data to this article can be found online at https://doi.
materials data, it is desirable to develop new file formats and data
org/10.1016/j.commatsci.2021.110360.
structures that are flexible enough to handle this level of complexity.
Encouraging documentation of not just the data, but also the relevant
References
metadata–providing a much richer context for the primary data–across
the community would be increasingly helpful going forward as text to [1] D.R.-J.G. Rydning, The Digitization of the World from Edge to Core, International
knowledge mining methods progress. Infrastructure development for Data Corporation, Framingham, 2018.
not just sharing the data, but the developed ML models themselves [2] P. Larrañaga, D. Atienza, J. Diaz-Rozo, A. Ogbechie, C.E. Puerto-Santana,
C. Bielza, Industrial Applications of Machine Learning, CRC Press, 2018.
would be desired to address one of the most important issue of repro­ [3] C.M. Bishop, Pattern Recognition and Machine Learning, springer, 2006.
ducibility. Some efforts in these directions are currently in progress [4] I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep Learning, vol. 1, MIT press
[192–195]. Going forward, cultivating a culture that encourages pub­ Cambridge, 2016.
[5] C. Manning, H. Schutze, Foundations of Statistical Natural Language Processing,
lishing results from failed experiments and adoption of publication file
MIT press (1999).
formats that enable by-design an efficient data extraction via text mining [6] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models
will open new avenues for information-rich materials datasets. As ML are unsupervised multitask learners, OpenAI blog 1 (8) (2019) 9.
[7] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T.
methods become increasingly popular and more widely used within the
Hubert, L. Baker, M. Lai, A. Bolton, Mastering the game of go without human
materials community, addressing these challenges becomes critical for knowledge, Nature 550 (7676) (2017) 354–359, publisher: Nature Publishing
expediting the pace of progress. Group..
[8] D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J.
Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, Mastering the game
5. Conclusions of go with deep neural networks and tree search, Nature 529 (7587) (2016)
484–489, publisher: Nature Publishing Group.
ML and data-enabled methods represent the advent of a new para­ [9] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot,
L. Sifre, D. Kumaran, T. Graepel, Mastering chess and shogi by self-play with a
digm in materials science. As a result, the way in which materials design general reinforcement learning algorithm, arXiv preprint arXiv:1712.01815
and discovery has traditionally been pursued in the field is poised to (2017).

9
G. Pilania Computational Materials Science 193 (2021) 110360

[10] M. Moravcík, M. Schmid, N. Burch, V. Lisy, D. Morrill, N. Bard, T. Davis, K. [36] G. Pilania, J.E. Gubernatis, T. Lookman, Classification of octet AB-type binary
Waugh, M. Johanson, M. Bowling, Deepstack: expert-level artificial intelligence compounds using dynamical charges: a materials informatics perspective, Sci.
in heads-up no-limit poker, Science 356 (6337) (2017) 508–513, publisher: Rep. 5 (2015) 17504, publisher: Nature Publishing Group.
American Association for the Advancement of Science. [37] G. Pilania, P.V. Balachandran, J.E. Gubernatis, T. Lookman, Classification of
[11] N. Brown, T. Sandholm, Superhuman AI for heads-up no-limit poker: Libratus ABO3 perovskite solids: a machine learning study, Acta Crystallogr. Sect. B Struct.
beats top professionals, Science 359 (6374) (2018) 418–424, publisher: American Sci., Cryst. Eng. Mater. 71 (5) (2015) 507–513, publisher: International Union of
Association for the Advancement of Science. Crystallography.
[12] D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A.A. Kalyanpur, A. Lally, [38] G. Pilania, J.E. Gubernatis, T. Lookman, Structure classification and melting
J.W. Murdock, E. Nyberg, J. Prager, Building watson: an overview of the DeepQA temperature prediction in octet AB solids via machine learning, Phys. Rev. B 91
project, AI Mag. 31 (3) (2010) 59–79. (21) (2015) 214302, publisher: APS.
[13] A. Adadi, M. Berrada, Peeking inside the black-box: A survey on explainable [39] G. Pilania, A. Ghosh, S.T. Hartman, R. Mishra, C.R. Stanek, B.P. Uberuaga, Anion
artificial intelligence (XAI), IEEE Access 6 (2018) 52138–52160, publisher: IEEE. order in oxysulfide perovskites: origins and implications, NPJ Comput. Mater. 6
[14] A.B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. (1) (2020) 1–11, publisher: Nature Publishing Group.
García, S. Gil-López, D. Molina, R. Benjamins, Explainable artificial intelligence [40] G. Pilania, X.-Y. Liu, Machine learning properties of binary wurtzite superlattices,
(XAI): concepts, taxonomies, opportunities and challenges toward responsible AI, J. Mater. Sci. 53 (9) (2018) 6652–6664, publisher: Springer.
Inf. Fusion 58 (2020) 82–115, publisher: Elsevier. [41] B. Medasani, A. Gamst, H. Ding, W. Chen, K.A. Persson, M. Asta, A. Canning, M.
[15] D. Morgan, R. Jacobs, Opportunities and challenges for machine learning in Haranczyk, Predicting defect behavior in b2 intermetallics by merging ab initio
materials science, Annu. Rev. Mater. Res. (2020), 50, publisher: Annual Reviews. modeling and machine learning, NPJ Comput. Mater. 2 (1) (2016) 1–10,
[16] R. Ramprasad, R. Batra, G. Pilania, A. Mannodi-Kanakkithodi, C. Kim, Machine publisher: Nature Publishing Group.
learning in materials informatics: recent applications and prospects, NPJ Comput. [42] A. Mannodi-Kanakkithodi, M.Y. Toriyama, F.G. Sen, M.J. Davis, R.F. Klie, M.K.
Mater. 3 (1) (2017) 1–13, publisher: Nature Publishing Group. Chan, Machine-learned impurity level prediction for semiconductors: the
[17] T. Mueller, A.G. Kusne, R. Ramprasad, Machine learning in materials science: example of cd-based chalcogenides, NPJ Comput. Mater. 6 (1) (2020) 1–14,
recent progress and emerging applications, Rev. Comput. Chem. 29 (2016) publisher: Nature Publishing Group.
186–273, publisher: Wiley Online Library. [43] V. Sharma, P. Kumar, P. Dev, G. Pilania, Machine learning substitutional defect
[18] B. Sanchez-Lengeling, A. Aspuru-Guzik, Inverse molecular design using machine formation energies in ABO3 perovskites, J. Appl. Phys. 128 (3) (2020) 034902,
learning: generative models for matter engineering, Science 361 (6400) publisher: AIP Publishing LLC.
(2018–07-27) 360–365, publisher: American Association for the Advancement of [44] R. Batra, G. Pilania, B.P. Uberuaga, R. Ramprasad, Multifidelity information
Science Section: Review. doi:10.1126/science.aat2663. URL:https://science. fusion with machine learning: A case study of dopant formation energies in
sciencemag.org/content/361/6400/360. hafnia, ACS Appl. Mater. Interfaces 11 (28) (2019) 24906–24918, publisher: ACS
[19] F. Häse, L.M. Roch, A. Aspuru-Guzik, Next-generation experimentation with self- Publications.
driving laboratories, Trends Chem. 1 (3) (2019) 282–291, publisher: Elsevier. [45] Y. Zhuo, A. Mansouri Tehrani, J. Brgoch, Predicting the band gaps of inorganic
[20] M. Ziatdinov, O. Dyck, A. Maksov, X. Li, X. Sang, K. Xiao, R.R. Unocic, R. solids by machine learning, J. Phys. Chem. Lett. 9 (7) (2018) 1668–1673,
Vasudevan, S. Jesse, S.V. Kalinin, Deep learning of atomically resolved scanning publisher: ACS Publications.
transmission electron microscopy images: chemical identification and tracking [46] A. Mishra, S. Satsangi, A.C. Rajan, H. Mizuseki, K.-R. Lee, A.K. Singh, Accelerated
local transformations, ACS Nano 11 (12) (2017) 12742–12752, publisher: ACS data-driven accurate positioning of the band edges of MXenes, J. Phys. Chem.
Publications. Lett. 10 (4) (2019) 780–785, publisher: ACS Publications.
[21] P. Shetty, R. Ramprasad, Automated knowledge extraction from polymer [47] A.C. Rajan, A. Mishra, S. Satsangi, R. Vaish, H. Mizuseki, K.-R. Lee, A.K. Singh,
literature using natural language processing, Iscience 24 (1) (2021) 101922, Machine-learning-assisted accurate band gap predictions of functionalized
publisher: Elsevier. MXene, Chem. Mater. 30 (12) (2018) 4031–4038, publisher: ACS Publications.
[22] R. Batra, L. Song, R. Ramprasad, Emerging materials intelligence ecosystems [48] G. Pilania, A. Mannodi-Kanakkithodi, B.P. Uberuaga, R. Ramprasad, J.E.
propelled by machine learning, Nat. Rev. Mater. (2020) 1–24, Publisher: Nature Gubernatis, T. Lookman, Machine learning bandgaps of double perovskites, Sci.
Publishing Group. Rep. 6 (2016) 19375, publisher: Nature Publishing Group.
[23] J. Schmidt, M.R. Marques, S. Botti, M.A. Marques, Recent advances and [49] G. Pilania, J.E. Gubernatis, T. Lookman, Multi-fidelity machine learning models
applications of machine learning in solid-state materials science, NPJ Comput. for accurate bandgap predictions of solids, Comput. Mater. Sci. 129 (2017)
Mater. 5 (1) (2019) 1–36, publisher: Nature Publishing Group. 156–163, publisher: Elsevier.
[24] K.T. Butler, D.W. Davies, H. Cartwright, O. Isayev, A. Walsh, Machine learning for [50] G. Pilania, C.N. Iverson, T. Lookman, B.L. Marrone, Machine-learning-based
molecular and materials science, Nature 559 (7715) (2018) 547–555, publisher: predictive modeling of glass transition temperatures: a case of
Nature Publishing Group. polyhydroxyalkanoate homopolymers and copolymers, J. Chem. Inf. Model. 59
[25] L. Ward, C. Wolverton, Atomistic calculations and materials informatics: a (12) (2019) 5013–5025, publisher: ACS Publications.
review, Curr. Opin. Solid State Mater. Sci. 21 (3) (2017) 167–176, publisher: [51] A. Seko, T. Maekawa, K. Tsuda, I. Tanaka, Machine learning with systematic
Elsevier. density-functional theory calculations: application to melting temperatures of
[26] A. Jain, G. Hautier, S.P. Ong, K. Persson, New opportunities for materials single-and binary-component solids, Phys. Rev. B 89 (5) (2014), 054303. URL:
informatics: resources and data mining techniques for uncovering hidden https://journals.aps.org/prb/abstract/10.1103/PhysRevB.89.054303.
relationships, J. Mater. Res. 31 (8) (2016) 977–994, publisher: Cambridge [52] M. De Jong, W. Chen, R. Notestine, K. Persson, G. Ceder, A. Jain, M. Asta, A.
University Press. Gamst, A statistical learning framework for materials science: application to
[27] G. Pilania, P.V. Balachandran, J.E. Gubernatis, T. Lookman, Data-based methods elastic moduli of k-nary inorganic polycrystalline compounds, Sci. Rep. 6 (2016)
for materials design and discovery: basic ideas and general methods, Synth. Lect. 34256, publisher: Nature Publishing Group.
Mater. Opt. 1 (1) (2020) 1–188, publisher: Morgan & Claypool Publishers. [53] S. Aryal, R. Sakidja, M.W. Barsoum, W.-Y. Ching, A genomic approach to the
[28] L. Chen, G. Pilania, R. Batra, T.D. Huan, C. Kim, C. Kuenneth, R. Ramprasad, stability, elastic, and electronic properties of the MAX phases, Phys. Status Solidi
Polymer informatics: Current status and critical next steps, Mater. Sci. Eng. R: (b) 251 (8) (2014) 1480–1497, publisher: Wiley Online Library.
Rep. 144 (2021) 100595, publisher: Elsevier. [54] S. Chatterjee, M. Murugananth, H. Bhadeshia, δ) TRIP steel, Mater. Sci. Technol.
[29] F.A. Faber, A. Lindmaa, O.A. Von Lilienfeld, R. Armiento, Machine learning 23 (7) (2007) 819–827, publisher: Taylor & Francis.
energies of 2 million elpasolite (a b c 2 d 6) crystals, Phys. Rev. Lett. 117 (13) [55] A. Seko, A. Togo, H. Hayashi, K. Tsuda, L. Chaput, I. Tanaka, Prediction of low-
(2016) 135502, publisher: APS. thermal-conductivity compounds with first-principles anharmonic lattice-
[30] B. Meredig, A. Agrawal, S. Kirklin, J.E. Saal, J.W. Doak, A. Thompson, K. Zhang, dynamics calculations and bayesian optimization, Phys. Rev. Lett. 115 (20)
A. Choudhary, C. Wolverton, Combinatorial screening for new materials in (2015) 205901. URL:https://journals.aps.org/prl/abstract/10.1103/
unconstrained composition space with machine learning, Phys. Rev. B 89 (9) PhysRevLett.115.205901.
(2014) 094104, publisher: APS. [56] C. Kim, G. Pilania, R. Ramprasad, From organized high-throughput data to
[31] A.M. Deml, R. O’Hayre, C. Wolverton, V. Stevanovic, Predicting density phenomenological theory using machine learning: the example of dielectric
functional theory total energies and enthalpies of formation of metal-nonmetal breakdown, Chem. Mater. 28 (5) (2016) 1304–1311, publisher: ACS Publications.
compounds by linear regression, Phys. Rev. B 93 (8) (2016) 085142, publisher: [57] C. Kim, G. Pilania, R. Ramprasad, Machine learning assisted predictions of
APS. intrinsic dielectric breakdown strength of ABX3 perovskites, J. Phys. Chem. C 120
[32] F. Legrain, J. Carrete, A. van Roekeghem, S. Curtarolo, N. Mingo, How chemical (27) (2016) 14575–14580, publisher: ACS Publications.
composition alone can predict vibrational free energies and entropies of solids, [58] S. Venkatram, R. Batra, L. Chen, C. Kim, M. Shelton, R. Ramprasad, Predicting
Chem. Mater. 29 (15) (2017) 6220–6227, publisher: ACS Publications. crystallization tendency of polymers using multifidelity information fusion and
[33] A. Talapatra, B.P. Uberuaga, C.R. Stanek, G. Pilania, A machine learning machine learning, J. Phys. Chem. B 124 (28) (2020) 6046–6054, publisher: ACS
approach for the prediction of formability and thermodynamic stability of single Publications.
and double perovskite oxides, Chem. Mater. Publisher: ACS Publications. [59] M. Andersen, S.V. Levchenko, M. Scheffler, K. Reuter, Beyond scaling relations for
[34] C.J. Bartel, C. Sutton, B.R. Goldsmith, R. Ouyang, C.B. Musgrave, L.M. the description of catalytic materials, ACS Catal. 9 (4) (2019) 2752–2759,
Ghiringhelli, M. Scheffler, New tolerance factor to predict the stability of publisher: ACS Publications.
perovskite oxides and halides, Sci. Adv. 5 (2) (2019) eaav0693, publisher: [60] B. Weng, Z. Song, R. Zhu, Q. Yan, Q. Sun, C.G. Grice, Y. Yan, W.-J. Yin, Simple
American Association for the Advancement of Science. descriptor derived from symbolic regression accelerating the discovery of new
[35] G. Pilania, P.V. Balachandran, C. Kim, T. Lookman, Finding new perovskite perovskite catalysts, Nat. Commun. 11 (1) (2020) 1–8, publisher: Nature
halides via machine learning, Front. Mater. 3 (2016) 19, publisher: Frontiers. Publishing Group.
[61] G. Pilania, K.R. Whittle, C. Jiang, R.W. Grimes, C.R. Stanek, K.E. Sickafus, B.
P. Uberuaga, Using machine learning to identify factors that govern

10
G. Pilania Computational Materials Science 193 (2021) 110360

amorphization of irradiated pyrochlores, Chem. Mater. 29 (6) (2017) 2574–2583. [90] L. Yu, W. Zhang, J. Wang, Y. Yu, Seqgan: Sequence generative adversarial nets
URL:http://pubs.acs.org/doi/abs/10.1021/acs.chemmater.6b04666. with policy gradient, in: Thirty-first AAAI Conference on Artificial Intelligence,
[62] V. Sharma, C. Wang, R.G. Lorenzini, R. Ma, Q. Zhu, D.W. Sinkovits, G. Pilania, A. 2017.
R. Oganov, S. Kumar, G.A. Sotzing, Rational design of all organic polymer [91] J. Lim, S. Ryu, J.W. Kim, W.Y. Kim, Molecular generative model based on
dielectrics, Nat. Commun. 5 (1) (2014) 1–8, publisher: Nature Publishing Group. conditional variational autoencoder for de novo molecular design, J. Cheminf. 10
[63] A. Mannodi-Kanakkithodi, G. Pilania, T.D. Huan, T. Lookman, R. Ramprasad, (1) (2018) 1–9, publisher: BioMed Central.
Machine learning strategy for accelerated design of polymer dielectrics, Sci. Rep. [92] J. Hoffmann, L. Maestrati, Y. Sawada, J. Tang, J.M. Sellier, Y. Bengio, Data-driven
6 (2016) 20952, publisher: Nature Publishing Group. approach to encoding and decoding 3-d crystal structures, arXiv preprint arXiv:
[64] G. Pilania, C. Wang, X. Jiang, S. Rajasekaran, R. Ramprasad, Accelerating 1909.00949 (2019).
materials property predictions using machine learning, Sci. Rep. 3 (1) (2013) 1–6, [93] B. Kim, S. Lee, J. Kim, Inverse design of porous materials using artificial neural
publisher: Nature Publishing Group. networks, Sci. Adv. 6 (1) (2020) eaax9324, publisher: American Association for
[65] A. Zunger, Inverse design in search of materials with target functionalities, Nat. the Advancement of Science.
Rev. Chem. 2 (4) (2018) 1–16, publisher: Nature Publishing Group. [94] A.H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, O. Beijbom, Pointpillars: fast
[66] W.B. Powell, The knowledge gradient for optimal learning, Wiley Encyclopedia of encoders for object detection from point clouds, in: Proceedings of the IEEE
Operations Research and Management SciencePublisher: Wiley Online Library Conference on Computer Vision and Pattern Recognition, 2019,
(2010). pp. 12697–12705.
[67] W.B. Powell, I.O. Ryzhov, Optimal Learning, vol. 841, John Wiley & Sons, 2012. [95] J. Li, B.M. Chen, G. Hee Lee, So-net: self-organizing network for point cloud
[68] I.O. Ryzhov, W.B. Powell, P.I. Frazier, The knowledge gradient algorithm for a analysis, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
general class of online learning problems, Oper. Res. 60 (1) (2012) 180–195, Recognition, 2018, pp. 9397–9406.
publisher: INFORMS. [96] W. Wu, Z. Qi, L. Fuxin, Pointconv: deep convolutional networks on 3d point
[69] T. Lookman, P.V. Balachandran, D. Xue, R. Yuan, Active learning in materials clouds, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
science with emphasis on adaptive sampling using uncertainties for targeted Recognition, 2019, pp. 9621–9630.
design, NPJ Comput. Mater. 5 (1) (2019) 1–17, publisher: Nature Publishing [97] S. Kim, J. Noh, G.H. Gu, A. Aspuru-Guzik, Y. Jung, Generative adversarial
Group. networks for crystal structure prediction, arXiv preprint arXiv:2004.01396
[70] D. Xue, D. Xue, R. Yuan, Y. Zhou, P.V. Balachandran, X. Ding, J. Sun, T. Lookman, (2020).
An informatics approach to transformation temperatures of NiTi-based shape [98] B.P. MacLeod, F.G. Parlane, T.D. Morrissey, F. Häse, L.M. Roch, K.E. Dettelbach,
memory alloys, Acta Mater. 125 (2017) 532–541, publisher: Elsevier. R. Moreira, L.P. Yunker, M.B. Rooney, J.R. Deeth, Self-driving laboratory for
[71] D. Xue, P.V. Balachandran, J. Hogden, J. Theiler, D. Xue, T. Lookman, accelerated discovery of thin-film materials, Sci. Adv. 6 (20) (2020) eaaz8867,
Accelerated search for materials with targeted properties by adaptive design, Nat. publisher: American Association for the Advancement of Science.
Commun. 7 (1) (2016) 1–9, publisher: Nature Publishing Group. [99] D.P. Tabor, L.M. Roch, S.K. Saikin, C. Kreisbeck, D. Sheberla, J.H. Montoya, S.
[72] D. Xue, P.V. Balachandran, R. Yuan, T. Hu, X. Qian, E.R. Dougherty, T. Lookman, Dwaraknath, M. Aykol, C. Ortiz, H. Tribukait, Accelerating the discovery of
Accelerated search for BaTiO3-based piezoelectrics with vertical morphotropic materials for clean energy in the era of smart automation, Nat. Rev. Mater. 3 (5)
phase boundary using bayesian learning, Proc. Nat. Acad. Sci. 113 (47) (2016) (2018) 5–20, publisher: Nature Publishing Group.
13301–13306, publisher: National Acad Sciences. [100] P. Nikolaev, D. Hooper, F. Webber, R. Rao, K. Decker, M. Krein, J. Poleski, R.
[73] B. Rouet-Leduc, K. Barros, T. Lookman, C.J. Humphreys, Optimisation of GaN Barto, B. Maruyama, Autonomy in materials research: a case study in carbon
LEDs and the reduction of efficiency droop using active machine learning, Sci. nanotube growth, NPJ Comput. Mater. 2 (1) (2016) 1–6, publisher: Nature
Rep. 6 (2016) 24862, publisher: Nature Publishing Group. Publishing Group.
[74] C. Kim, A. Chandrasekaran, A. Jha, R. Ramprasad, Active-learning and materials [101] P.B. Wigley, P.J. Everitt, A. van den Hengel, J.W. Bastian, M.A. Sooriyabandara,
design: the example of high glass transition temperature polymers, MRS G.D. McDonald, K.S. Hardman, C.D. Quinlivan, P. Manju, C.C. Kuhn, Fast
Commun. 9 (3) (2019) 860–866, publisher: Cambridge University Press. machine-learning online optimization of ultra-cold-atom experiments, Sci. Rep. 6
[75] R. Gómez-Bombarelli, J.N. Wei, D. Duvenaud, J.M. Hernández-Lobato, B. (1) (2016) 1–6, publisher: Nature Publishing Group.
Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, T.D. Hirzel, R.P. Adams, [102] J.M. Granda, L. Donina, V. Dragone, D.-L. Long, L. Cronin, Controlling an organic
A. Aspuru-Guzik, Automatic chemical design using a data-driven continuous synthesis robot with machine learning to search for new reactivity, Nature 559
representation of molecules, ACS Central Sci. 4 (2) (2018) 268–276, publisher: (7714) (2018) 377–381, publisher: Nature Publishing Group.
ACS Publications. [103] V. Dragone, V. Sans, A.B. Henson, J.M. Granda, L. Cronin, An autonomous organic
[76] J. Noh, J. Kim, H.S. Stein, B. Sanchez-Lengeling, J.M. Gregoire, A. Aspuru-Guzik, reaction search engine for chemical reactivity, Nat. Commun. 8 (1) (2017) 1–8,
Y. Jung, Inverse design of solid-state materials via a continuous representation, publisher: Nature Publishing Group.
Matter 1 (5) (2019) 1370–1384, publisher: Elsevier. [104] V. Duros, J. Grizou, W. Xuan, Z. Hosni, D.-L. Long, H.N. Miras, L. Cronin, Human
[77] Q. Vanhaelen, Y.-C. Lin, A. Zhavoronkov, The advent of generative chemistry, versus robots in the discovery and crystallization of gigantic polyoxometalates,
ACS Med. Chem. Lett. 11 (8) (2020) 1496–1505, publisher: ACS Publications. Angew. Chem. Inte. Ed. 56 (36) (2017) 10815–10820, publisher: Wiley Online
[78] D.P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv: Library.
1312.6114 (2013). [105] S. Masubuchi, M. Morimoto, S. Morikawa, M. Onodera, Y. Asakawa, K. Watanabe,
[79] C. Doersch, Tutorial on variational autoencoders, arXiv preprint arXiv: T. Taniguchi, T. Machida, Autonomous robotic searching and assembly of two-
1606.05908 (2016). dimensional crystals to build van der waals superlattices, Nat. Commun. 9 (1)
[80] D.J. Rezende, S. Mohamed, D. Wierstra, Stochastic backpropagation and (2018) 1–12, publisher: Nature Publishing Group.
approximate inference in deep generative models, arXiv preprint arXiv: [106] R.W. Epps, M.S. Bowen, A.A. Volk, K. Abdel-Latif, S. Han, K.G. Reyes, A.
1401.4082 (2014). Amassian, M. Abolhasani, Artificial chemist: an autonomous quantum dot
[81] R. Batra, H. Dai, T.D. Huan, L. Chen, C. Kim, W.R. Gutekunst, L. Song, R. synthesis bot, Adv. Mater. (2020) 2001626Publisher: Wiley Online Library.
Ramprasad, Polymers for extreme conditions designed using syntax-directed [107] Z. Li, M.A. Najeeb, L. Alves, A.Z. Sherman, V. Shekar, P. Cruz Parrilla, I.
variational autoencoders, Chem. Mater. Publisher: ACS Publications (2020). M. Pendleton, W. Wang, P.W. Nega, M. Zeller, Robot-accelerated perovskite
[82] E. Putin, A. Asadulaev, Y. Ivanenkov, V. Aladinskiy, B. Sanchez-Lengeling, A. investigation and discovery, Chem. Mater. (2020). Publisher: ACS Publications.
Aspuru-Guzik, A. Zhavoronkov, Reinforced adversarial neural computer for de [108] L.M. Roch, F. Häse, C. Kreisbeck, T. Tamayo-Mendoza, L.P. Yunker, J.E. Hein, A.
novo molecular design, J. Chem. Inf. Model. 58 (6) (2018) 1194–1204, publisher: Aspuru-Guzik, ChemOS: an orchestration software to democratize autonomous
ACS Publications. discovery, PLoS One 15 (4) (2020) e0229862, publisher: Public Library of Science
[83] B. Sanchez-Lengeling, C. Outeiral, G.L. Guimaraes, A. Aspuru-Guzik, Optimizing San Francisco, CA USA.
distributions over molecular space. an objective-reinforced generative adversarial [109] A.F. Voter, F. Montalenti, T.C. Germann, Extending the time scale in atomistic
network for inverse-design chemistry (ORGANIC), Publisher: ChemRxiv (2017). simulation of materials, Annu. Rev. Mater. Res. 32 (1) (2002) 321–346, publisher:
[84] G.L. Guimaraes, B. Sanchez-Lengeling, C. Outeiral, P.L.C. Farias, A. Aspuru- Annual Reviews 4139 El Camino Way, PO Box 10139, Palo Alto, CA 94303–0139,
Guzik, Objective-reinforced generative adversarial networks (ORGAN) for USA.
sequence generation models, arXiv preprint arXiv:1705.10843 (2017). [110] T. Frauenheim, G. Seifert, M. Elstner, T. Niehaus, C. Köhler, M. Amkreutz, M.
[85] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Sternberg, Z. Hajnal, A. Di Carlo, S. Suhai, Atomistic simulations of complex
Courville, Y. Bengio, Generative adversarial nets, in: Advances in neural materials: ground-state and excited-state properties, J. Phys. Condens. Matter 14
information processing systems, 2014, pp. 2672–2680. (11) (2002) 3015, publisher: IOP Publishing.
[86] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, [111] V. Brázdová, D.R. Bowler, Atomistic Computer Simulations: A Practical Guide,
Improved techniques for training gans, in: Advances in Neural Information John Wiley & Sons, 2013.
Processing Systems, 2016, pp. 2234–2242. [112] D.C. Rapaport, The Art of Molecular Dynamics Simulation, Cambridge University
[87] N. De Cao, T. Kipf, MolGAN: An implicit generative model for small molecular Press, 2004.
graphs, arXiv preprint arXiv:1805.11973 (2018). [113] D. Marx, J. Hutter, Ab Initio Molecular Dynamics: Basic Theory and Advanced
[88] A. Kadurin, S. Nikolenko, K. Khrabrov, A. Aliper, A. Zhavoronkov, druGAN: an Methods, Cambridge University Press, 2009.
advanced generative adversarial autoencoder model for de novo generation of [114] F.H. Stillinger, T.A. Weber, Computer simulation of local order in condensed
new molecules with desired molecular properties in silico, Mol. Pharmaceut. 14 phases of silicon, Phys. Rev. B 31 (8) (1985) 5262, publisher: APS.
(9) (2017) 3098–3104, publisher: ACS Publications. [115] M.S. Daw, M.I. Baskes, Embedded-atom method: derivation and application to
[89] T. Blaschke, M. Olivecrona, O. Engkvist, J. Bajorath, H. Chen, Application of impurities, surfaces, and other defects in metals, Phys. Rev. B 29 (12) (1984)
generative autoencoder in de novo molecular design, Mol. Inf. 37 (1) (2018) 6443, publisher: APS.
1700123, publisher: Wiley Online Library. [116] E.B. Tadmor, R.E. Miller, Modeling Materials: Continuum, Atomistic and
Multiscale Techniques, Cambridge University Press, 2011.

11
G. Pilania Computational Materials Science 193 (2021) 110360

[117] J. Behler, M. Parrinello, Generalized neural-network representation of high- [147] S.V. Kalinin, B.G. Sumpter, R.K. Archibald, Big-deep-smart data in imaging for
dimensional potential-energy surfaces, Phys. Rev. Lett. 98 (14) (2007) 146401, guiding materials design, Nat. Mater. 14 (10) (2015) 973–980, publisher: Nature
publisher: APS. Publishing Group.
[118] J. Behler, R. Martonák, D. Donadio, M. Parrinello, Metadynamics simulations of [148] J.L. Lansford, D.G. Vlachos, Infrared spectroscopy data-and physics-driven
the high-pressure phases of silicon employing a high-dimensional neural network machine learning for characterizing surface microstructure of complex materials,
potential, Phys. Rev. Lett. 100 (18) (2008) 185501, publisher: APS. Nat. Commun. 11 (1) (2020) 1–12, publisher: Nature Publishing Group.
[119] J. Behler, Representing potential energy surfaces by high-dimensional neural [149] M.J. Cherukara, Y.S. Nashed, R.J. Harder, Real-time coherent diffraction
network potentials, J. Phys. Condens. Matter 26 (18) (2014) 183001, publisher: inversion using deep generative networks, Sci. Rep. 8 (1) (2018) 1–8, publisher:
IOP Publishing. Nature Publishing Group.
[120] M. Rupp, A. Tkatchenko, K.-R. Müller, O.A. Von Lilienfeld, Fast and accurate [150] Y.-F. Shen, R. Pokharel, T.J. Nizolek, A. Kumar, T. Lookman, Convolutional
modeling of molecular atomization energies with machine learning, Phys. Rev. neural network-based method for real-time orientation indexing of measured
Lett. 108 (5) (2012) 058301, publisher: APS. electron backscatter diffraction patterns, Acta Mater. 170 (2019) 118–131,
[121] M. Rupp, Machine learning for quantum mechanics in a nutshell, Int. J. Quantum publisher: Elsevier.
Chem. 115 (16) (2015) 1058–1073, publisher: Wiley Online Library. [151] R.A. Schwarzer, D.P. Field, B.L. Adams, M. Kumar, A.J. Schwartz, Present state of
[122] A.P. Bartók, M.C. Payne, R. Kondor, G. Csányi, Gaussian approximation electron backscatter diffraction and prospective developments, in: Electron
potentials: the accuracy of quantum mechanics, without the electrons, Phys. Rev. Backscatter Diffraction in Materials Science, Springer, 2009, pp. 1–20.
Lett. 104 (13) (2010) 136403, publisher: APS. [152] S.I. Wright, M.M. Nowell, A review of in situ EBSD studies, in: Electron
[123] A.P. Bartók, R. Kondor, G. Csányi, On representing chemical environments, Phys. Backscatter Diffraction in Materials Science, Springer, 2009, pp. 329–337.
Rev. B 87 (18) (2013) 184115, publisher: APS. [153] T.B. Britton, J. Jiang, Y. Guo, A. Vilalta-Clemente, D. Wallis, L.N. Hansen, A.
[124] W.J. Szlachta, A.P. Bartók, G. Csányi, Accuracy and transferability of gaussian Winkelmann, A.J. Wilkinson, Tutorial: Crystal orientations and EBSD–or which
approximation potential models for tungsten, Phys. Rev. B 90 (10) (2014) way is up?, Mater. Charact. 117 (2016) 113–126, publisher: Elsevier.
104108, publisher: APS. [154] R. Liu, A. Agrawal, W.-k. Liao, A. Choudhary, M. De Graef, Materials discovery:
[125] A.P. Bartók, G. Csányi, G aussian approximation potentials: A brief tutorial understanding polycrystals from large-scale electron patterns, in: 2016 IEEE
introduction, Int. J. Quantum Chem. 115 (16) (2015) 1051–1057, publisher: International Conference on Big Data (Big Data), IEEE, 2016, pp. 2261–2269.
Wiley Online Library. [155] D. Jha, S. Singh, R. Al-Bahrani, W.-K. Liao, A. Choudhary, M. De Graef, A.
[126] V. Botu, R. Batra, J. Chapman, R. Ramprasad, Machine learning force fields: Agrawal, Extracting grain orientations from ebsd patterns of polycrystalline
construction, validation, and outlook, J. Phys. Chem. C 121 (1) (2017) 511–522, materials using convolutional neural networks, Microscopy Microanal. 24 (5)
publisher: ACS Publications. (2018) 497–502, publisher: Cambridge University Press.
[127] V. Botu, J. Chapman, R. Ramprasad, A study of adatom ripening on an al (1 1 1) [156] M.R. Carbone, S. Yoo, M. Topsakal, D. Lu, Classification of local chemical
surface with machine learning force fields, Comput. Mater. Sci. 129 (2017) environments from X-ray absorption spectra using supervised machine learning,
332–335, publisher: Elsevier. Phys. Rev. Mater. 3 (3) (2019) 033604, publisher: APS.
[128] T.D. Huan, R. Batra, J. Chapman, S. Krishnan, L. Chen, R. Ramprasad, A universal [157] X. Lin, Z. Si, W. Fu, J. Yang, S. Guo, Y. Cao, J. Zhang, X. Wang, P. Liu, K. Jiang,
strategy for the creation of machine learning-based atomistic force fields, NPJ Intelligent identification of two-dimensional nanostructures by machine-learning
Comput. Mater. 3 (1) (2017) 1–8, publisher: Nature Publishing Group. optical microscopy, Nano Res. 11 (12) (2018) 6316–6324, publisher: Springer.
[129] A.V. Shapeev, Moment tensor potentials: a class of systematically improvable [158] C.C. Mody, Instrumental Community: Probe Microscopy and the Path to
interatomic potentials, Multiscale Model. Simul. 14 (3) (2016) 1153–1173, Nanotechnology, MIT Press, 2011.
publisher: SIAM. [159] A. Cui, K. Jiang, M. Jiang, L. Shang, L. Zhu, Z. Hu, G. Xu, J. Chu, Decoding phases
[130] S. Jindal, S. Chiriki, S.S. Bulusu, Spherical harmonics based descriptor for neural of matter by machine-learning raman spectroscopy, Phys. Rev. Appl. 12 (5)
network potentials: structure and dynamics of au147 nanocluster, J. Chem. Phys. (2019) 054049, publisher: APS.
146 (20) (2017) 204301, publisher: AIP Publishing LLC. [160] A. Fakhry, T. Zeng, S. Ji, Residual deconvolutional networks for brain electron
[131] A.P. Thompson, L.P. Swiler, C.R. Trott, S.M. Foiles, G.J. Tucker, Spectral neighbor microscopy image segmentation, IEEE Trans. Med. Imag. 36 (2) (2016) 447–456,
analysis method for automated generation of quantum-accurate interatomic publisher: IEEE.
potentials, J. Comput. Phys. 285 (2015) 316–330, publisher: Elsevier. [161] T.M. Quan, D.G. Hildebrand, W.-K. Jeong, Fusionnet: A deep fully residual
[132] J.S. Smith, O. Isayev, A.E. Roitberg, ANI-1: an extensible neural network potential convolutional neural network for image segmentation in connectomics, arXiv
with DFT accuracy at force field computational cost, Chem. Sci. 8 (4) (2017) preprint arXiv:1612.05360 (2016).
3192–3203, publisher: Royal Society of Chemistry. [162] B. Zhu, J.Z. Liu, S.F. Cauley, B.R. Rosen, M.S. Rosen, Image reconstruction by
[133] J.S. Smith, B.T. Nebgen, R. Zubatyuk, N. Lubbers, C. Devereux, K. Barros, S. domain-transform manifold learning, Nature 555 (7697) (2018) 487–492,
Tretiak, O. Isayev, A.E. Roitberg, Approaching coupled cluster accuracy with a publisher: Nature Publishing Group.
general-purpose neural network potential through transfer learning, Nat. [163] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word
Commun. 10 (1) (2019) 1–8, publisher: Nature Publishing Group. representations in vector space, arXiv preprint arXiv:1301.3781 (2013).
[134] E.V. Podryabinkin, A.V. Shapeev, Active learning of linearly parametrized [164] T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed
interatomic potentials, Comput. Mater. Sci. 140 (2017) 171–180, publisher: representations of words and phrases and their compositionality, in: Advances in
Elsevier. Neural Information Processing Systems, 2013, pp. 3111–3119.
[135] J.S. Smith, B. Nebgen, N. Lubbers, O. Isayev, A.E. Roitberg, Less is more: [165] J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word
Sampling chemical space with active learning, J. Chem. Phys. 148 (24) (2018) representation, in: Proceedings of the 2014 Conference on Empirical Methods in
241733, publisher: AIP Publishing LLC. Natural Language Processing (EMNLP), 2014, pp. 1532–1543.
[136] J.C. Snyder, M. Rupp, K. Hansen, K.-R. Müller, K. Burke, Finding density [166] V. Tshitoyan, J. Dagdelen, L. Weston, A. Dunn, Z. Rong, O. Kononova, K.A.
functionals with machine learning, Phys. Rev. Lett. 108 (25) (2012) 253002, Persson, G. Ceder, A. Jain, Unsupervised word embeddings capture latent
publisher: APS. knowledge from materials science literature, Nature 571 (7763) (2019) 95–98,
[137] K. Yao, J. Parkhill, Kinetic energy of hydrocarbons as a function of electron publisher: Nature Publishing Group.
density and convolutional neural networks, J. Chem. Theory Comput. 12 (3) [167] E. Kim, K. Huang, A. Saunders, A. McCallum, G. Ceder, E. Olivetti, Materials
(2016) 1139–1147, publisher: ACS Publications. synthesis insights from scientific literature via text extraction and machine
[138] F. Brockherde, L. Vogt, L. Li, M.E. Tuckerman, K. Burke, K.-R. Müller, Bypassing learning, Chem. Mater. 29 (21) (2017) 9436–9444, publisher: ACS Publications.
the kohn-sham equations with machine learning, Nat. Commun. 8 (1) (2017) [168] R. Roscher, B. Bohn, M.F. Duarte, J. Garcke, Explainable machine learning for
1–10, publisher: Nature Publishing Group. scientific insights and discoveries, IEEE Access 8 (2020) 42200–42216, publisher:
[139] R. Nagai, R. Akashi, O. Sugino, Completing density functional theory by machine IEEE.
learning hidden messages from molecules, NPJ Comput. Mater. 6 (1) (2020) 1–8, [169] R. Ouyang, S. Curtarolo, E. Ahmetcik, M. Scheffler, L.M. Ghiringhelli, SISSO: a
publisher: Nature Publishing Group. compressed-sensing method for identifying the best low-dimensional descriptor in
[140] M. Bogojeski, L. Vogt-Maranto, M.E. Tuckerman, K.-R. Müller, K. Burke, Quantum an immensity of offered candidates, Phys. Rev. Mater. als 2 (8) (2018) 083802,
chemical accuracy from density functional approximations via machine learning, publisher: APS.
Nat. Commun. 11 (1) (2020) 1–11, publisher: Nature Publishing Group. [170] B. Kailkhura, B. Gallagher, S. Kim, A. Hiszpanski, T.Y.-J. Han, Reliable and
[141] A. Chandrasekaran, D. Kamal, R. Batra, C. Kim, L. Chen, R. Ramprasad, Solving explainable machine-learning methods for accelerated material discovery, NPJ
the electronic structure problem with machine learning, NPJ Comput. Mater. 5 Comput. Mater. 5 (1) (2019) 1–9, publisher: Nature Publishing Group.
(1) (2019) 1–7, publisher: Nature Publishing Group. [171] C. Sutton, M. Boley, L.M. Ghiringhelli, M. Rupp, J. Vreeken, M. Scheffler,
[142] A.V. Crewe, Scanning electron microscopes: is high resolution possible?, Science Identifying domains of applicability of machine learning models for materials
154 (3750) (1966) 729–738, publisher: American Association for the science, Nat. Commun. 11 (1) (2020) 1–9, publisher: Nature Publishing Group.
Advancement of Science. [172] M.T. Ribeiro, S. Singh, C. Guestrin, Why should i trust you? Explaining the
[143] S.J. Pennycook, P.D. Nellist, Scanning Transmission Electron Microscopy: predictions of any classifier, in: Proceedings of the 22nd ACM SIGKDD
Imaging and Analysis, Springer Science & Business Media, 2011. International Conference on Knowledge Discovery and Data Mining, 2016,
[144] G. Binnig, H. Rohrer, C. Gerber, E. Weibel, 7× 7 reconstruction on si (111) pp. 1135–1144.
resolved in real space, Phys. Rev. Lett. 50 (2) (1983) 120, publisher: APS. [173] M. Haghighatlari, C.-Y. Shih, J. Hachmann, Thinking globally, acting locally: on
[145] G. Binnig, H. Rohrer, C. Gerber, E. Weibel, Surface studies by scanning tunneling the issue of training set imbalance and the case for local machine learning models
microscopy, Phys. Rev. Lett. 49 (1) (1982) 57, publisher: APS. in chemistry, preprint at ChemRxiv: https://doi. org/10.26434/chemrxiv
[146] C. Gerber, H.P. Lang, How the doors to the nanoworld were opened, Nat. 8796947 (2019) v2.
Nanotechnol. 1 (1) (2006) 3–5, publisher: Nature Publishing Group. [174] R. Ouyang, E. Ahmetcik, C. Carbogno, M. Scheffler, L.M. Ghiringhelli,
Simultaneous learning of several materials properties from incomplete databases

12
G. Pilania Computational Materials Science 193 (2021) 110360

with multi-task SISSO, J. Phys. Mater. 2 (2) (2019) 024002, publisher: IOP [189] A.I. Forrester, A. Sóbester, A.J. Keane, Multi-fidelity optimization via surrogate
Publishing. modelling, Proc. Roy. Soc. A: Math., Phys. Eng. Sci. 463 (2088) (2007)
[175] S.R. Xie, G.R. Stewart, J.J. Hamlin, P.J. Hirschfeld, R.G. Hennig, Functional form 3251–3269, publisher: The Royal Society London.
of the superconducting critical temperature from machine learning, Phys. Rev. B [190] Y. Zhang, C. Ling, A strategy to apply machine learning to small datasets in
100 (17) (2019) 174513, publisher: APS. materials science, NPJ Comput. Mater. 4 (1) (2018) 1–8, publisher: Nature
[176] G. Cao, R. Ouyang, L.M. Ghiringhelli, M. Scheffler, H. Liu, C. Carbogno, Z. Zhang, Publishing Group.
Artificial intelligence for high-throughput discovery of topological insulators: the [191] G. Pilania, K.J. McClellan, C.R. Stanek, B.P. Uberuaga, Physics-informed machine
example of alloyed tetradymites, Phys. Rev. Mater. 4 (3) (2020) 034204, learning for inorganic scintillator discovery, J. Chem. Phys. 148 (24) (2018)
publisher: APS. 241729, publisher: AIP Publishing LLC.
[177] S.-M. Udrescu, M. Tegmark, AI feynman: A physics-inspired method for symbolic [192] L.M. Ghiringhelli, C. Carbogno, S. Levchenko, F. Mohamed, G. Huhs, M. Lüders,
regression, Sci. Adv. 6 (16) (2020) eaay2631, publisher: American Association for M. Oliveira, M. Scheffler, Towards efficient data exchange and sharing for big-
the Advancement of Science. data driven materials science: metadata and data formats, NPJ Comput. Mater. 3
[178] R.P. Feynman, R.B. Leighton, M. Sands, The Feynman lectures on physics, Vol. I: (1) (2017) 1–9, publisher: Nature Publishing Group.
The New Millennium Edition: Mainly Mechanics, Radiation, and Heat, vol. 1, [193] C. Draxl, M. Scheffler, Big data-driven materials science and its FAIR data
Basic books, 2011. infrastructure, Handbook Mater. Model.: Methods Theory Model. (2020) 49–73,
[179] R.P. Feynman, R.B. Leighton, M.L. Sands, The Feynman Lectures on Physics, vol. Publisher: Springer.
2, Addison-Wesley, Redwood City, CA, CA, 1989. [194] C. Draxl, M. Scheffler, NOMAD: The FAIR concept for big data-driven materials
[180] R.P. Feynman, R.B. Leighton, M. Sands, The Feynman Lectures on Physics, science, Mrs. Bull. 43 (9) (2018) 676–682, publisher: Cambridge University Press.
Volume III: Quantum Mechanics, vol. 3, Basic Books, 2010. [195] R. Chard, Z. Li, K. Chard, L. Ward, Y. Babuji, A. Woodard, S. Tuecke, B. Blaiszik,
[181] O. Muller, R. Roy, The Major Ternary Structural Families, Springer Verlag, Berlin, M. Franklin, I. Foster, DLHub: model and data serving for science, in: 2019 IEEE
Heidelberg, New York, 1974. International Parallel and Distributed Processing Symposium (IPDPS), IEEE,
[182] R.D. Shannon, Revised effective ionic radii and systematic studies of interatomic 2019, pp. 283–292.
distances in halides and chalcogenides, Acta Crystallogr. Section A Cryst. Phys.
Diffract., Theor. Gen. Crystallogr. 32 (5) (1976) 751–767, publisher: International
Union of Crystallography.
Ghanshyam Pilania is a senior scientist in Materials Science
[183] B.M. Greenwell, pdp: an r package for constructing partial dependence plots, R J.
9 (1) (2017) 421. and Technology Division at the Los Alamos National Laboratory
[184] W. Samek, K.-R. Müller, Towards explainable artificial intelligence, in: (LANL). He received a B. Tech in Metallurgical and Materials
Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, Springer, Engineering from Indian Institute of Technology Roorkee, India
2019, pp. 5–22. in 2007, followed by a Ph.D. in Materials Science and Engi­
neering from University of Connecticut, Storrs, USA in 2012.
[185] E. Tjoa, C. Guan, A survey on explainable artificial intelligence (XAI): towards
medical XAI, arXiv preprint arXiv:1907.07374 (2019). His four-year postdoctoral work was supported by a LANL Di­
[186] D.V. Carvalho, E.M. Pereira, J.S. Cardoso, Machine learning interpretability: a rectors’ postdoctoral fellowship award and an Alexander von
survey on methods and metrics, Electronics 8 (8) (2019) 832, publisher: Humboldt postdoctoral fellowship at the Fritz Haber Institute of
Multidisciplinary Digital Publishing Institute. the Max Planck Society. At LANL, his current research interests
broadly include developing and applying high throughput
[187] N. Xie, G. Ras, M. van Gerven, D. Doran, Explainable deep learning: A field guide
for the uninitiated, arXiv preprint arXiv:2004.14545 (2020). electronic structure and atomistic methods to understand and
design functional materials, with a particular focus on targeted
[188] P. Raccuglia, K.C. Elbert, P.D. Adler, C. Falk, M.B. Wenny, A. Mollo, M. Zeller, S.
A. Friedler, J. Schrier, A.J. Norquist, Machine-learning-assisted materials materials design and discovery using materials informatics and machine learning based
discovery using failed experiments, Nature 533 (7601) (2016) 73–76, publisher: techniques.
Nature Publishing Group.

13

You might also like