Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
29 views92 pages

Lecture 1 Foundations of Materials Informatics

Materials Informatics applies data science to materials science, leveraging machine learning to uncover complex trends and facilitate breakthroughs in material development. The field, still relatively new, emphasizes the importance of data-driven approaches to overcome traditional research limitations and improve material discovery. Various machine learning techniques are utilized to analyze small, heterogeneous datasets, enabling the identification of new materials and properties through innovative computational methods.

Uploaded by

Bethany Liu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views92 pages

Lecture 1 Foundations of Materials Informatics

Materials Informatics applies data science to materials science, leveraging machine learning to uncover complex trends and facilitate breakthroughs in material development. The field, still relatively new, emphasizes the importance of data-driven approaches to overcome traditional research limitations and improve material discovery. Various machine learning techniques are utilized to analyze small, heterogeneous datasets, enabling the identification of new materials and properties through innovative computational methods.

Uploaded by

Bethany Liu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

Materials Informatics

Taylor D. Sparks
University of Utah, Materials Science and Engineering
Many people and agencies make this work possible!
My full materials informatics course is available on YouTube/Github!
Materials Informatics is data science applied to materials science

Structure

Characterization
Performance
Property

Processing
Materials Informatics is data science applied to materials science

Structure

Characterization
Performance
Property

Processing
Machine learning is capable of extracting highly complex trends from data
Machine learning has already made huge impacts on science and engineering!
Machine learning has already made huge impacts on science and engineering!

Al 7A77
New alloy!
Materials Informatics is only a few decades old!
In the early days of materials informatics, nobody knew what they were doing
We initially only used “big data” to write analytical reviews

Sparks et al. (2016) Script. Mat.


Gaultois et al. (2013) Chem Mat.
Ghadbeigi et al. (2016) Energy Environ. Sci.
Gaultois et al. (2016) APL Mat.
Why do we need data-driven materials science?
Most materials research is incremental, breakthroughs are the exception!
Useful information can be mined from large datasets without domain
knowledge!
Are there materials “genes” responsible for their properties and performance?

• Equip the next generation workforce


• Enable a paradigm shift in materials
development
• Integrate experiments, computation, and theory
• Facilitate access to materials data
New tools of discovery are needed to explore chemical whitespace
Is materials informatics a passing fad or here to stay?
Machine learning is at the type of the Hype Cycle
Some discoveries are “1% inspiration, 99% perspiration”
Some discoveries are being in the right place at the right time
Some inspiration can be taken from nature
Some discoveries require a keen eye for detail
Some discoveries require a keen eye ear for detail
Some discoveries result form totally unsafe lab practices!
Some discoveries require mistakes
Materials Informatics differs substantially from traditional machine learning
Materials Informatics datasets are typically pretty small compared to traditional ML
Materials Informatics usually interests itself in identifying unusual outliers rather than
averages
Materials research suffers from heterogeneous data with many modalities
Materials research suffers from heterogeneous data with many modalities
UncertaintyWhy
quantification
do we need is data-driven
extremely important
materialswhen
discovery?
each new data is expensive

Facilities and physical lab space required $$

Materials need to be purchased $$


UncertaintyWhy
quantification
do we need is data-driven
extremely important
materialswhen
discovery?
each new data is
expensive

Samples need to be synthesized

High chance of failure

Diverse reasons for failure


UncertaintyWhy
quantification
do we need is data-driven
extremely important
materialswhen
discovery?
each new data is expensive

Samples need to be characterized $$ &

Characterization requires equipment $$

Characterization requires expertise $$

Characterization takes time


Compositional
Why design
do we need
spacedata-driven
is enormous!
materials discovery?

Total unique inorganic compounds ≈ 1012

Four component systems


(AaBbCcDd).

Ignore dopants
(a, b, c, and d > 0.03). https://htwins.net/scale2/
Traditional ML has moved away from feature engineering to deep learning
Feature engineering can significantly improve predictions with <103 data instances
Feature engineering can significantly improve predictions with <103 data instances

Anton Oliynyk!
Structure-property-processing linkages depend on interpretable relationships

Structure

Characterization
Performance

Property

Processing
Machine learning is learning from patterns in data

AI: the theory and development of


computer systems able to perform
tasks that normally require human
intelligence, such as visual perception,
speech recognition, decision-making,
and translation between languages.
Machine learning is learning from patterns in data

ML: the use and development of computer


systems that are able to learn and adapt
without following explicit instructions, by
using algorithms and statistical models to
analyze and draw inferences from patterns
in data.
Machine learning is learning from patterns in data

Deep learning: part of a broader family of


machine learning methods that imitates
the workings of the human brain in
processing data and creating patterns for
use in decision making.
Materials scientists have long noticed patterns in data
Empirical relationships are correlations that may not be supported by theory
Materials scientists are using nearly all types of machine learning
There are multiple types of machine learning algorithms available

Ensemble Techniques: Bayesian: Neural Networks:


• Random forest • Kriging or GP • ANN
• Gradient boosted • Gaussian RF • GAN
• Adaboost • Bayesian NN • CNN
• Extra Trees

Linear Models:
Support Vector Machine: • Lasso
• SVR • Ridge
• Linear SVR • K nearest neighbors
There are multiple types of machine learning algorithms available

Ensemble Techniques: Neural Networks:


• Fast learners Bayesian: • Fast (GPU)
• Efficient • Works well with • Feature-free
(parallelization) small data • GANs
• Non-linear • Includes uncertainty • High accuracy
• Problem with • “Physics informed” as • Blackest box
extrapolation priors utilized • overfitting
• Feature weights

Linear Models:
Support Vector Machine: • Interpretable
• Kernel selection • Fast
• Good metrics • Not suitable for many
• Hinge loss problems (linear vs
• Scales poorly non-linear)
Case study: Superhard materials
Superhard materials have important commercial applications
Diamond is the ultimate superhard material
High hardness materials can be classified in two groups
(i) Containing only light elements like B, C, N, O, Si
c-BN BC2N B 6O Preparation of these materials
requires extreme pressures of
≈15 GPa and temperatures
>2100°C

Wentorf J. Chem. Phys. 1961, 34, 809.


He et al. Appl. Phys. Lett. 2002, 81, 643.

Example 1: Using a machine learning model to Solozhenko et al. Appl. Phys. Lett. 2001, 78, 1385.

identify new materials


(ii) Containing a transition metal and main group elements (B, N, C)
Task:
ReB2
Predict material WB4
property using regression
Synthesis tends to only require
elevated temperatures between
1500°C and 2000°C

Kaner and co-workers, Science 2007, 316, 436.


Kaner and co-workers, PNAS 2011, 108, 10958.
Machine learning can screen for new materials!
Hardness measurement databases don’t exist, we need a proxy

Mansouri et al. (2017) Int. Mat. Man. Innov.


Bulk and shear modulus should serve as a good proxy

bulk modulus (B0) (GPa)

shear modulus (G) (GPa)

G3/B02 (GPa)

Chen et al. (2011) Intermetallics


Bulk and shear modulus should be good proxies for hardness

https://materialsproject.org/

Mansouri et al. (2018) JACS


Feature selection is critical for predictions

Mansouri et al. (2018) JACS


Feature are either composition or structure based

Composition-based features vs Structure-based features


Not all features are equally important
Many algorithms and hyperparameters possible for training
Many algorithms and hyperparameters possible for training
A model must first be validated prior to use for predictions
Bulk modulus predictions perform better than shear modulus

RMSECV = 17.21 GPa RMSECV = 16.35 GPa


R2 CV = 0.94 R2 CV = 0.84

Mansouri et al. (2018) JACS


Bulk and shear modulus predicted for entire Pearson database

Prediction of B and G for 15770 binaries 56266 ternaries 46251


quaternaries from Pearson’s Crystal Database (PCD)

118,287 compounds predicted in less than 30 seconds!

Using Intel® Core™i5-4690K CPU @ 3.50 GHz PC (Windows 10)

Mansouri et al. (2018) JACS


Superhard materials need to be both incompressible and rigid

Mansouri et al. (2018) JACS


Known superhard materials show up in the top right quadrant

BN

WB4 TaC
WC
ReB2
OsB2

Mansouri et al. (2018) JACS


New alternative superhard candidates identified

Re-W-C system

Mo-W-B-C system

Mansouri et al. (2018) JACS


Materials could be synthesized easily at ambient pressure

Mansouri et al. (2018) JACS


Rapid discovery of two new superhard materials at low load

Mo-W-B-C Re-W-C

Mansouri et al. (2018) JACS


High pressure synchrotron measurements confirmed bulk modulus

Mansouri et al. (2018) JACS


We can track volume as pressure is increased

Mansouri et al. (2018) JACS


Bulk modulus determined by fitting volume as function of pressure
Third-order Birch-Murnaghan Equation of State

ReWC MoWBC
ML predicted: 370 Gpa ML predicted: 398 Gpa
Experimental: 372±3.6 Gpa Experimental: 380±8.1 GPa

Mansouri et al. (2018) JACS


There are even more types!

https://machinelearningmastery.com/types-of-learning-in-machine-learning/
Unsupervised learning includes classification, clustering, density estimation, projection
Clustering Density Estimation

Visualization Projection
Reinforcement learning differs from supervised learning

An agent operates in an environment and must learn to operate using feedback


- No fixed dataset and the feedback may be delayed or noisy
Hybrid learning blurs lines between learning types

https://machinelearningmastery.com/types-of-learning-in-machine-learning/
Semi-supervised learning requires making the most of only partially labeled data

Algorithms are used to learn relationships between


labeled and unlabeled data to then use all the data
Self-supervised learning is unsupervised learning framed as supervised learning problem
Colorization

Supervised learning algorithms are used to solve an


alternate or pretext task, the result of which is a
model or representation that can be used in the
solution of the original (actual) modeling problem.

Inpainting
Multi-instance learning is supervised learning where “bags” of samples are labeled

Members of the “Perovskite” bag all contain some shared attributes along with some non-shared attributes.

Q: which attributes are essential to “Perovskite” bag?


Inference refers to reaching an outcome or making a decision

https://machinelearningmastery.com/types-of-learning-in-machine-learning/
Inductive vs deductive learning are opposites

Inductive learning is learning general rules from specific examples.


Deductive learning is learning specifics examples from general rules.
Transductive learning is predicting specific examples from specific examples.

Inductive learning:
- Model learns the general rules
- Draw general conclusions about future from past examples
- Fitting the ML model

Deductive learning:
- Top down reasoning seeking all premises to be met before
conclusion
- Using the ML model for inference

Transductive learning:
- Better predictions with few labeled points
- No predictive model built, new prediction requires full calculation
again
Inference refers to reaching an outcome or making a decision

https://machinelearningmastery.com/types-of-learning-in-machine-learning/
Multi-task learning is fitting a model to one dataset with multiple related problems

Training models together is more than efficient, it should improve overall performance!
- Useful when dataset has abundance of input data labeled for one task but another
task with much less labeled data.
- This will allow us to “borrow statistical strength” from tasks with lots of data and to
share it with tasks with little data.
- Improves model generalizability
Active learning allows for very efficient learning when new data points are expensive

Active learning is a technique where the model is able to query an oracle during the
learning process in order to resolve ambiguity during the learning process.
- Well-suited to small datasets where new data is expensive to generate or label
- Very efficient learner since model can ignore features it already understands well
- Similar to semi-supervised learning except new ground truth labels are generated
instead of relying on models to label the unlabeled data.
Online learning involves continual updating of the model after each data point
acquisition
Online learning involves using the data available and updating the model directly before
a prediction is required or after the last observation was made.
- Well-suited to sequential datasets where new data could be changing over time
(consider shoe sales as a fad comes and goes)
- Possibly subject to catastrophic interference (catastrophic forgetting)
Transfer learning is when a trained model can be applied to another related task

In transfer learning, the learner must perform two or more different tasks, but we assume
that many of the factors that explain the variations in task 1 are relevant to the variations
that need to be captured for learning subsequent tasks.
- Well-suited for instances when first task has extensive data, but subsequent tasks have
only limited data.
- Differs from multi-task learning by sequentially learning the different tasks
Ensemble learning is when multiple models are trained on data and the results are
combined
The objective of ensemble learning is to achieve better performance with the ensemble
of models as compared to any individual model. This involves both deciding how to
create models used in the ensemble and how to best combine the predictions from the
ensemble members.
- Takes advantage of pros/cons of each algorithm or model type
- Can provide additional measure of uncertainty

You might also like