Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views11 pages

23 MatSciBERT Model

The document introduces MatSciBERT, a materials domain-specific language model designed to enhance text mining and information extraction in materials science. Trained on a large corpus of peer-reviewed publications, MatSciBERT outperforms existing models like SciBERT in various downstream tasks, including named entity recognition and relation classification. The pre-trained weights of MatSciBERT are publicly available to facilitate accelerated materials discovery.

Uploaded by

Nicolas Cuadrado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views11 pages

23 MatSciBERT Model

The document introduces MatSciBERT, a materials domain-specific language model designed to enhance text mining and information extraction in materials science. Trained on a large corpus of peer-reviewed publications, MatSciBERT outperforms existing models like SciBERT in various downstream tasks, including named entity recognition and relation classification. The pre-trained weights of MatSciBERT are publicly available to facilitate accelerated materials discovery.

Uploaded by

Nicolas Cuadrado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

www.nature.

com/npjcompumats

ARTICLE OPEN

MatSciBERT: A materials domain language model for text


mining and information extraction
1 2 2,3 ✉ 3,4 ✉
Tanishq Gupta , Mohd Zaki , N. M. Anoop Krishnan and Mausam

A large amount of materials science knowledge is generated and stored as text published in peer-reviewed scientific literature.
While recent developments in natural language processing, such as Bidirectional Encoder Representations from Transformers
(BERT) models, provide promising information extraction tools, these models may yield suboptimal results when applied on
materials domain since they are not trained in materials science specific notations and jargons. Here, we present a materials-aware
language model, namely, MatSciBERT, trained on a large corpus of peer-reviewed materials science publications. We show that
MatSciBERT outperforms SciBERT, a language model trained on science corpus, and establish state-of-the-art results on three
downstream tasks, named entity recognition, relation classification, and abstract classification. We make the pre-trained weights of
MatSciBERT publicly accessible for accelerated materials discovery and information extraction from materials science texts.
npj Computational Materials (2022)8:102 ; https://doi.org/10.1038/s41524-022-00784-w
1234567890():,;

INTRODUCTION state-of-the-art performance on multiple NLP tasks like question


Discovering materials and utilizing them for practical applications answering and entity recognition, to name a few.
is an extremely time-consuming process that may span decades1,2. Researchers have used NLP tools to automate database creation
To accelerate this process, we need to exploit and harness the for ML applications in the materials science domain. For instance,
knowledge on materials that has been developed over the ChemDataExtractor27, an NLP pipeline, has been used to create
centuries through rigorous scientific procedure in a cohesive databases of battery materials28, Curie and Néel temperatures of
fashion3–8. Textbooks, scientific publications, reports, handbooks, magnetic materials29, and inorganic material synthesis routes30.
websites, etc., serve as a large data repository that can be mined Similarly, NLP has been used to collect the composition and
for obtaining the already existing information9,10. However, it is a dissolution rate of calcium aluminosilicate glassy materials31, and
challenging task to extract useful information from these texts zeolite synthesis routes to synthesize germanium containing
since most of the scientific data is semi- or un-structured in the zeolites32, and to extract process and testing parameters of oxide
form of text, paragraphs with cross reference, image captions, and glasses, thereby enabling improved prediction of the Vickers
tables10–12. Extracting such information manually is extremely hardness11. Researchers have also made an automated NLP tool to
time- and resource-intensive and relies on the interpretation of a create databases using the information extracted from computa-
domain expert. tional materials science research papers33. NLP has also been used
Natural language processing (NLP), a sub-domain in artificial for other tasks such as topic modeling in glasses, that is, to group
intelligence, presents an alternate approach that can automate the literature into different topics in an unsupervised fashion and
information extraction from text. Earlier approaches in NLP relied to find images based on specific queries such as elements present,
on non-neural methods based on n-grams such as Brown et al. synthesis, or characterization techniques, and applications10.
(1992)13, structural learning framework by Ando and Zhang A comprehensive review by Olivetti et al. (2019) describes
(2005)14, or structural correspondence learning by Blitzer et al. several ways in which NLP can benefit the materials science
(2006)15, but these are no longer state of the art. Neural pre- community34. Providing insights into chemical parsing tools like
trained embeddings like word2vec16,17 and GloVe18 are quite OSCAR435 capable of identifying entities and chemicals from text,
popular, but they lack domain-specific knowledge and do not Artificial Chemist36, which takes the input of precursor information
produce contextual embeddings. Recent progress in NLP has led and generates synthetic routes to manufacture optoelectronic
to the development of a computational paradigm in which a large, semiconductors with targeted band gaps, robotic system for
pre-trained language model (LM) is finetuned for domain-specific making thin films to produce cleaner and sustainable energy
tasks. Research has consistently shown that this pretrain-finetune solutions37, and identification of more than 80 million materials
paradigm leads to the best overall task performance19–23. science domain-specific named entities, researches have
Statistically, LMs are probability distributions for a sequence of prompted the accelerated discovery of materials for different
words such that for a given set of words, it assigns a probability to applications through the combination of ML and NLP techniques.
each word24. Recently, due to the availability of large amounts of Researchers have shown the domain adaptation capability of
text and high computing power, researchers have been able to word2vec and BERT in the field of biological sciences as
pre-train these large neural language models. For example, BioWordVec38 and BioBERT19, other domain-specific BERTs like
Bidirectional Encoder Representations from Transformers (BERT)25 SciBERT21 trained on scientific and biomedical corpus39, clinical-
is trained on BookCorpus26 and English Wikipedia, resulting in BERT40 trained on 2 million clinical notes in MIMIC-III v1.4

1
Department of Mathematics, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India. 2Department of Civil Engineering, Indian Institute of Technology Delhi,
Hauz Khas, New Delhi 110016, India. 3School of Artificial Intelligence, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India. 4Department of Computer Science
and Engineering, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India. ✉email: [email protected]; [email protected]

Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
T. Gupta et al.
2
The materials science corpus developed for this work has
~285 M words, which is nearly 9% of the number of words used to
pre-train SciBERT (3.17B words) and BERT (3.3B words). Since we
continue pre-training SciBERT, MatSciBERT is effectively trained on
a corpus consisting of 3.17 + 0.28 = 3.45B words. From Supple-
mentary Table 1, one can observe that 40% of the words are from
research papers related to inorganic glasses and ceramics, and
20% each from bulk metallic glasses (BMG), alloys, and cement.
Although the number of research papers for “cement and
concrete” is more than “inorganic glasses and ceramics”, the
Fig. 1 Methodology for training MatSciBERT. We create the latter has higher words. This is because of the presence of a
Materials Science Corpus (MSC) through query search followed by greater number of full-text documents retrieved associated with
selection of relevant research papers. MatSciBERT, pre-trained on the latter category. The Supplementary Table 2 represents the
MSC, is evaluated on various downstream tasks. word count of important strings relevant to the field of materials
science. It should be noted that the corpus encompasses the
database41, mBERT42 for multilingual machine translations tasks, important fields of thermoelectric, nanomaterials, polymers, and
PatentBERT23 for patent classification and FinBERT for financial biomaterials. Also, note that the corpora used for training the
tasks22. This suggests that a materials-aware LM can significantly language model consists of both experimental and computational
accelerate the research in the field by further adapting to works as both these approaches play a crucial role in under-
downstream tasks9,34. Although there were no papers on standing material response. The average paper length for this
developing materials-aware language models prior to this work43, corpus is ~1848 words, which is two-thirds of the average paper
in a recent preprint44, Walker et al. (2021) emphasize the impact of length of 2769 words for the SciBERT corpus. The lower average
domain-specific language models on named entity recognition paper length can be attributed to two things: (a) In general,
(NER) tasks in materials science. materials science papers are shorter than biomedical papers. We
In this work, we train materials science domain-specific BERT, verified this by computing the average paper length of full-text
1234567890():,;

namely MatSciBERT. Figure 1 shows the graphical summary of the materials science papers. The number came out to be 2366. (b)
methodology adopted in this work encompassing creating the There are papers without full text also in our corpus. In that case,
materials science corpus, training the MatSciBERT, and evaluating we have used the abstracts of such papers to arrive at the final
different downstream tasks. We achieve state-of-the-art results on corpus.
domain-specific tasks as listed below.
a. NER on SOFC, SOFC Slot dataset by Friedrich et al. (2020)45 Pre-training of MatSciBERT
and Matscholar dataset by Weston et al. (2019)9 For MatSciBERT pre-training, we follow the domain adaptive pre-
b. Glass vs. Non-Glass classification of paper abstracts10 training proposed by Gururangan et al. (2020). In this work,
c. Relation Classification on MSPT corpus46 authors continued pre-training of the initial LM on corpus of
The present work, thus, bridges the gap in the availability of a domain-specific text20. They observed a significant improvement
materials domain language model, allowing researchers to in the performance on domain-specific downstream tasks for all
automate information extraction, knowledge graph completion, the four domains despite the overlap between initial LM
and other downstream tasks and hence accelerate the discovery vocabulary and domain-specific vocabulary being less than
of materials. We have hosted the MatSciBERT pre-trained weights 54.1%. BioBERT19 and FinBERT22 were also developed using the
at https://huggingface.co/m3rg-iitd/matscibert and codes for pre- similar approach where the vanilla BERT model was further pre-
training and finetuning on downstream tasks at https://github. trained on domain-specific text, and tokenization is done using
com/M3RG-IITD/MatSciBERT. Also, the codes with finetuned the original BERT vocabulary. We initialize MatSciBERT weights
models for the downstream tasks are available at https://doi.org/ with that of some suitable LM and then pre-train it on MSC. To
10.5281/zenodo.6413296. determine the appropriate initial weights for MatSciBERT, we
trained an uncased wordpiece47 vocabulary based on the MSC
using the tokenizers library48. The overlap of MSC vocabulary is
RESULTS AND DISCUSSION 53.64% with the uncased SciBERT21 vocabulary and 38.90% with
Dataset the uncased BERT vocabulary. Because of the larger overlap with
the vocabulary of SciBERT, we tokenize our corpus using the
Textual datasets are an integral part of the training of an LM. There SciBERT vocabulary and initialize the MatSciBERT weights with that
exist many general-purpose corpora like BookCorpus26 and of SciBERT as made publicly available by Beltagy et al. (2019)21. It is
EnglishWikipedia, and domain-specific corpora like biomedical worth mentioning that a materials science domain-specific
corpus39, and clinical database41, to name a few. However, none of vocabulary would likely represent the corpus with a lesser number
these corpora is suitable for the materials domain. Therefore, with of wordpieces and potentially lead to a better language model.
the aim of providing a materials specific LM, we first create a For e.g., “yttria-stabilized zirconia” is tokenized as [“yt”, “##tri”,
corpus spanning four important materials science families of “##a”, “-”, “stabilized”, “zircon”, “##ia”] by the SciBERT vocabulary,
inorganic glasses, metallic glasses, alloys, and cement and whereas a domain-specific tokenization might have resulted in
concrete. It should be noted that although these broad categories [“yttria”, “-”, “stabilized”, “zirconia”]. However, using a domain-
are mentioned, several other categories of materials, including specific tokenizer does not allow the use of SciBERT weights and
two-dimensional materials, were also present in the corpus. takes advantage of the scientific knowledge already learned by
Specifically, we have selected ~150 K papers out of ~1 M papers SciBERT. Further, using the SciBERT vocabulary for the materials
downloaded from the Elsevier Science Direct Database. The steps domain is not necessarily detrimental since the deep neural
to create the corpus are provided in the Methods section. The language models have the capacity to learn repeating patterns
details about the number of papers and words for each family are that represent new words using the existing tokenizer. For
given in Supplementary Table 1. We have also provided the list of instance, when the wordpieces “yt”, “##tri”, and “##a” occur
DOIs and PIIs of the papers used to pre-train MatSciBERT in the consecutively, SciBERT indeed recognizes that some material
GitHub repository for this work. is being discussed, as demonstrated in the downstream tasks.

npj Computational Materials (2022) 102 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
T. Gupta et al.
3
Table 1. Macro-F1 scores on the test set for SOFC-Slot and SOFC datasets averaged over three seeds and five cross-validation splits.

Architecture LM = MatSciBERT LM = SciBERT LM = BERT SOTA

SOFC-Slot dataset
LM-Linear 63.82 ± 2.53 58.64 ± 1.49 57.06 ± 2.86 62.6 (67.8 ± 12.9)
(67.53 ± 4.23) (64.58 ± 3.73) (61.68 ± 5.23)
LM-CRF 65.35 ± 2.73 59.07 ± 2.85 58.26 ± 1.73
(70.07 ± 3.36) (68.31 ± 2.88) (65.38 ± 3.96)
LM-BiLSTM-CRF 65.95 ± 2.69 61.68 ± 1.42 55.44 ± 1.97
(69.76 ± 3.72) (68.44 ± 3.15) (65.36 ± 3.68)
SOFC dataset
LM-Linear 82.28 ± 1.11 79.91 ± 1.20 77.08 ± 1.75 81.5 (81.7 ± 4.2)
(81.60 ± 2.63) (80.91 ± 2.37) (79.61 ± 3.01)
LM-CRF 82.39 ± 1.23 81.07 ± 0.93 78.93 ± 1.62
(82.61 ± 2.34) (82.04 ± 2.36) (81.26 ± 2.87)
LM-BiLSTM-CRF 82.24 ± 1.12 80.12 ± 1.00 78.15 ± 1.55
(82.61 ± 2.77) (81.92 ± 2.27) (80.94 ± 2.72)
Values in the parenthesis show the results on the validation set.

This is also why most domain-specific BERT-based LMs like by Friedrich et al. (2020), who introduced the dataset44. We run
FinBERT22, BioBERT19, and ClinicalBERT40 extend the pre-training the experiments on the same train-validation-test splits as done
instead of using domain-specific tokenizers and learning from by Friedrich et al. (2020) for a fair comparison of results. Moreover,
scratch. since the authors reported results averaged over 17 entities (the
The details of the pre-training procedure are provided in the extra entity is “Thickness”) for the SOFC-Slot dataset, we also
Methods section. The pre-training was performed for 360 h, after report the results taking the ‘Thickness’ entity into account.
which the model achieved a final perplexity of 2.998 on the Table 1 shows the Macro-F1 scores for the NER task on the
validation set (see Supplementary Fig. 1a). Although not directly SOFC-Slot and SOFC datasets by MatSciBERT, SciBERT, and BERT.
comparable due to different vocabulary and validation corpus, We observe that LM-CRF always performs better than LM-Linear.
BERT25, and RoBERTa49 authors report perplexities as 3.99 and This can be attributed to the fact that the CRF layer can model the
3.68, respectively, which are in the same range. We also provide BIO tags accurately. Also, all SciBERT architectures perform better
graphs for other evaluation metrics like MLM loss and MLM than the corresponding BERT architecture. We obtained an
accuracy in Supplementary Fig. 1b, c. The final pre-trained LM was improvement of ~6.3 Macro F1 and ~3.2 Micro F1 (see
then used to evaluate different materials science domain-specific Supplementary Table 3) on the SOFC-Slot test set for MatSciBERT
downstream tasks, details of which are described in the
vs. SciBERT while using the LM-CRF architecture. For the SOFC test
subsequent sections. The performance of the LM on the down-
dataset, MatSciBERT-BiLSTM-CRF performs better than SciBERT-
stream tasks was compared with that of SciBERT, BERT, and other
BiLSTM-CRF by ~2.1 Macro F1 and ~2.1 Micro F1. Similar
baseline models to evaluate the effectiveness of MatSciBERT to
learn the materials’ specific information. improvements can be seen for other architectures as well. These
In order to understand the effect of pre-training on the model MatSciBERT results also surpass the current best results on SOFC-
performance, a materials domain-specific downstream task, NER Slot and SOFC datasets by ~3.35 and ~0.9 Macro-F1, respectively.
on SOFC-slot, was performed using the model at regular intervals It is worth noting that the SOFC-slot dataset consists of 17 entity
of pre-training. To this extent, the pre-trained model was types and hence has more fine-grained information regarding the
finetuned on the training set of the SOFC-slot dataset. The choice materials. On the other hand, SOFC has only four entity types
of the SOFC-slot dataset was based on the fact that the dataset representing coarse-grained information. We notice that the
was comprised of fine-grained materials-specific information. performance of MatSciBERT on SOFC-slot is significantly better
Thus, this dataset is appropriate to distinguish the performance than that of SciBERT. To further evaluate this aspect, we analyzed
of SciBERT from the materials-aware LMs. The performance of the F1-score of both SciBERT and MatSciBERT on all the 17 entity
these finetuned models was evaluated on the test set. LM-CRF types of the SOFC-slot data individually, as shown in Fig. 2.
architecture was used for the analysis since LM-CRF consistently Interestingly, we observe that for all the materials related entity
gives the best performance for the downstream task, as shown types, namely anode material, cathode material, electrolyte
later in this work. The macro-F1 averages across three seeds material, interlayer material, and support material, MatSciBERT
exhibited an increasing trend (see Supplementary Fig. 2a), performs better than SciBERT. In addition, for materials related
suggesting the importance of training for longer durations. We properties such as open circuit voltage and degradation rate,
also show a similar graph for the abstract classification task MatSciBERT is able to significantly outperform SciBERT. This
(Supplementary Fig. 2b). suggests that MatSciBERT is indeed able to capitalize on the
additional information learned from the MSC to deliver better
Downstream tasks performance on complex problems specific to the materials
Here, we evaluate MatSciBERT on three materials science specific domain.
downstream tasks namely, Named Entity Recognition (NER), Now, we present the results for the Matscholar dataset9 in Table 2.
Relation Classification, and Paper Abstract Classification. For this dataset too, MatSciBERT outperforms SciBERT, BERT as well
We now present the results on the three materials science NER as the current best results, as can be seen in the case of LM-CRF
datasets as described in the Methods section. To the best of our architecture. The authors obtained Macro-F1 of 85.41% on the
knowledge, the best Macro-F1 on solid oxide fuel cells (SOFC) and validation set and 85.10% on the test set, and Micro-F1 of 87.09%
SOFC-Slot datasets is 81.50% and 62.60%, respectively, as reported and 87.04% (see Supplementary Table 4). We observe that our

Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2022) 102
T. Gupta et al.
4
best model MatSciBERT-CRF has Macro-F1 values of 88.66% and trained. Specifically, each scientific discipline exhibits significant
86.38%, both better than the existing state of the art. variability in terms of ontology, vocabulary, and domain-specific
In order to demonstrate the performance of MatSciBERT, we notations. Thus, the development of a domain-specific language
demonstrate an example from the validation set of the dataset in model, even within the scientific literature, can significantly
Supplementary Figs. 3 and 4. The overall superior performance of enhance the performance in downstream tasks related to text
MatSciBERT is evident from Table 2. mining and information extraction from literature.
Table 3 shows the results for the Relation Classification task
performed on the Materials Synthesis Procedures dataset46. We Applications in materials domain
also compare the results with two recent baseline models,
MaxPool and MaxAtt50, details of which can be found in the Now, we discuss some of the potential areas of application of
Methods section. Even in this task, we observe that MatSciBERT MatSciBERT in materials science. These areas can range from the
performs better than SciBERT, BERT, and baseline models simple topic-based classification of research papers to discovering
consistently, although with a lower margin. materials or alternate applications for existing materials. We
In Paper Abstract Classification downstream task, we consider demonstrate some of these applications as follows: (i) Document
the ability of LMs to classify a manuscript into glass vs. non-glass classification: A large number of manuscripts have been published
topics based on an in-house dataset10. This is a binary on materials related topics, and the numbers are increasing
classification problem, with the input being the abstract of a exponentially. Identifying manuscripts related to a given topic is a
manuscript. Here too, we use the same baseline models MaxPool challenging task. Traditionally, these tasks are carried out
and MaxAtt50. Table 4 shows the comparison of accuracies employing approaches such as term frequency-inverse document
achieved by MatSciBERT, SciBERT, BERT, and baselines. It can be frequency (TFIDF) or Word2Vec, which is used along with a
clearly seen that MatSciBERT outperforms SciBERT by more than classification algorithm. However, these approaches directly
2.75% accuracy on the test set. vectorize a word and are not context sensitive. For instance, in
Altogether, we demonstrate that the MatSciBERT, pre-trained on the phrases “flat glass”, “glass transition temperature”, “tea glass”,
a materials science corpus, can perform better than SciBERT for all the word “glass” is used in a very different sense. MatSciBERT will
the downstream tasks such as NER, abstract classification, and be able to extract the contextual meaning of the embeddings.
relation classification on materials datasets. These results also Thus, MatSciBERT will be able to effectively classify the topics
suggest that the scientific literature in the materials domain, on thereby enabling improved topic classification. This is evident
which MatSciBERT is pre-trained, is significantly different from the from the binary classification results presented earlier in Table 4,
computer science and biomedical domains on which SciBERT is where we observe that the accuracy obtained using MatSciBERT
(96.22%) was found to be significantly higher than the results
obtained using pooling based BiLSTM models (91.44%). This
approach can be extended to a larger set of abstracts for the
accurate classification of documents from the literature.
(ii) Topic modeling: Topic modeling is an unsupervised
approach of grouping documents belonging to similar topics
together. Traditionally, topic modeling employs algorithms such
as latent Dirichlet allocation (LDA) along with TF-IDF or Word2Vec
to cluster documents having the same or semantically similar
words together. Note that these approaches rely purely on the
frequency of word (in TF-IDF) or the embeddings of the word (in
Word2Vec) for clustering without taking into account the context.
The use of context-aware embeddings as learned in MatSciBERT
could significantly enhance the topic modeling task. As a
preliminary study, we perform topic modeling using MatSciBERT
on an in-house corpus of abstracts on glasses and ceramics. Note
that the same corpus was used in an earlier work10 for topic
modeling using LDA. Specifically, we obtain the output embed-
dings of the [CLS] token for each abstract using MatSciBERT.
Further, these embeddings were projected into two dimensions
Fig. 2 Comparison of MatSciBERT and SciBERT on validation sets using the UMAP algorithm51 and then clustered using the
of SOFC-Slot dataset. The entity-level F1-score for MatSciBERT and k-means algorithm52. We then concatenate all the abstracts
Scibert models in blue and red color respectively. The bold colored belonging to the same cluster and calculate the most frequent
text represents the best model’s score. words for each cluster/topic.

Table 2. Macro-F1 scores on the test set for Matscholar averaged over three seeds.

Architecture LM = MatSciBERT LM = SciBERT LM = BERT SOTA

LM-Linear 85.46 ± 0.13 83.80 ± 0.32 82.10 ± 0.81 85.10 (85.41)


(87.83 ± 1.21) (86.05 ± 0.55) (82.79 ± 0.20)
LM-CRF 86.38 ± 0.49 85.04 ± 0.77 84.07 ± 0.19
(88.66 ± 0.88) (88.07 ± 0.96) (84.61 ± 0.81)
LM-BiLSTM-CRF 86.09 ± 0.46 85.66 ± 0.24 83.39 ± 0.20
(89.15 ± 0.57) (87.66 ± 0.29) (84.07 ± 0.29)
Values in the parenthesis show the results on the validation set.

npj Computational Materials (2022) 102 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
T. Gupta et al.
5
Table 3. Test set results for Materials Synthesis Procedures dataset averaged over three seeds.

MatSciBERT SciBERT BERT MaxPool MaxAtt

Macro-F1 89.02 ± 0.27(88.31 ± 0.14) 87.22 ± 0.58(87.21 ± 0.17) 85.40 ± 1.45(85.95 ± 0.78) 81.19 ± 1.54(80.93 ± 0.71) 80.39 ± 0.85(81.53 ± 2.23)
Micro-F1 91.94 ± 0.20(91.50 ± 0.20) 91.04 ± 0.32)(91.03 ± 0.08 90.16 ± 0.69(90.44 ± 0.54) 86.81 ± 1.55(86.68 ± 0.84) 87.16 ± 0.60(87.62 ± 1.34)
Values in the parenthesis represent the results on the validation set.

Table 4. Test set results for glass vs. non-glass dataset averaged over three seeds.

MatSciBERT SciBERT BERT MaxPool MaxAtt

Accuracy 96.22 ± 0.16 93.44 ± 0.57 93.89 ± 0.68 91.44 ± 0.31 91.44 ± 0.68
(95.33 ± 0.27) (94.00 ± 0.00) (93.33 ± 0.98) (92.22 ± 0.56) (91.22 ± 0.16)
Values in the parenthesis represent the results on the validation set.

The Supplementary Tables 5 and 6 shows the top ten topics or “emission spectra of doped glasses”, or “SEM images of
obtained using LDA and MatSciBERT, respectively. The top 10 bioglasses with Ag”, to name a few.
keywords associated with each topic are also provided in the Further, Fig. 4 shows some of the selected captions from the
table. We observe that the topics and keywords from MatSciBERT- image captions along with the corresponding manual annotation
based topic modeling are more coherent than the ones obtained by Venugopal et al. (2021)10. The task of assigning tags to each
from LDA. Further, the actual topics associated with the keywords caption was carried out by human experts. Note that only one
are not very apparent from Supplementary Table 5. Specifically, word was assigned per image caption in the previous work. Using
Topic 9 by LDA contains keywords from French, suggesting that the MatSciBERT NER model, we show that multiple entities are
the topic represents French publications. Similarly, Topic 5 and extracted for the selected five captions. This illustrates the large
Topic 3 have several generic keywords that don’t represent a topic amount of information that can be captured using the LM
clearly. On the other hand, the keywords obtained by MatSciBERT proposed in this work.
enable a domain expert to identify the topics well. For instance, (iv) Materials caption graph: In addition to the queries as
some of the topics identified based on the keywords by three mentioned earlier, graph representations can provide in-depth
selected domain experts are dissolution of silicates (9), oxide thin insights into the information spread in figure captions. For
films synthesis and their properties (8, 6), materials for energy (0), instance, questions such as “which synthesis and characterization
electrical behavior of ceramics (1), and luminescence studies (5). methods are commonly used for a specific material?”, “what are
Despite their efforts, the same three domain experts were unable the methods for measuring a specific property?” can be easily
to identify coherent topics based on the keywords provided by answered using knowledge graphs. Here, we demonstrate how
LDA. Altogether, MatSciBERT can be used for topic modeling, the information in figure captions can be represented using
thereby providing a broad overview of the topics covered in the materials caption graphs (MCG). To this extent, we first randomly
literature considered. select 10,000 figure captions from glass-related publications.
(iii) Information extraction from images: Images hold a large Further, we extract the entities and their types from the figure
amount of information regarding the structure and properties of captions using the MatSciBERT finetuned on Matscholar NER
materials. A proxy to identify relevant images would be to go dataset. For each caption, we create a fully connected graph by
through the captions of all the images. However, each caption connecting all the entities present in that caption. These graphs
may contain multiple entities, and identifying the relevant are then joined together to form a large MCG. We demonstrate
keywords might be a challenging task. To this extent, MatSciBERT some insights gained from the MCGs below.
finetuned on NER can be an extremely useful tool for extracting Figure 5 shows two subsets of graphs extracted from the MCGs.
information from figure captions. In Fig. 5a, we identified two entities that are two-hop neighbors,
Here, we extracted entities from the figure captions used by namely, Tg and anneal. Note that these entities do not share an
Venugopal et al. (2021)10 using MatSciBERT finetuned on the edge. In other words, these two entities are not found
Matscholar NER dataset. Specifically, entities were extracted from simultaneously in any given caption. We then identified the
~110,000 image captions on topics related to inorganic glasses. intersection of all the one-hop neighbors of both the nodes and
Using MatSciBERT, we obtained 87,318 entities as DSC (sample plotted the graph as shown in Fig. 5a. The thickness of the edge
descriptor), 10,633 entities under APL (application), 145,324 as represents the strength of the connection in terms of the number
MAT (inorganic material), 76,898 as PRO (material property), of occurrences. We observe that there are four common one-hop
73,241 as CMT (characterization method), 33,426 as SMT (synthesis neighbors for Tg and anneal, namely, XRD, doped, glass, and
method), and 2,676 as SPL (symmetry/phase label). Figure 3 shows amorphous. This means that these four entities occur in captions
the top 10 extracted entities under the seven categories proposed along with Tg and anneal, even though these two entities are not
in the Matscholar dataset. The top entities associated with each of directly connected in the captions used for generating the graph.
the categories are coating (application), XRD (characterization), Figure 5a suggests that Tg is related to glass, amorphous, and
glass (sample descriptor, inorganic material), composition (mate- doped materials and that these materials can be synthesized by
rial property), heat (synthesis method), and hexagonal (symmetry/ annealing. Similarly, the structures obtained by annealing can be
phase). Further details associated with each category can also be characterized by XRD. From these results, we can also infer that Tg
obtained from these named entities. It should be noted that each is affected by annealing, which agrees with the conventional
caption may be associated with more than one entity. These knowledge in glass science.
entities can then be used to obtain relevant images for specific Similarly, Fig. 5b shows all the entities connected to the node
queries such as “XRD measurements of glasses used for coating” XRD. To this extent, we select all the captions having XRD as CMT.

Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2022) 102
T. Gupta et al.
6

Fig. 3 Top-10 entities for various categories. a APL Application, b CMT Characterization method, c DSC Sample descriptor, d MAT Inorganic
material, e PRO Material Property, and f SMT Synthesis method.

After obtaining all the entities in those captions, we randomly generated for different entities and entity types using the MCG
sample 20 pairs and then plotted them as shown in Fig. 5b. Note to gain insights into the materials literature.
that the number of edges is 18 and the number of nodes is 19 (v) Other applications such as relation classification: MatSciBERT
because of one pair being (XRD, XRD) and two similar pairs (XRD, can also be applied for addressing several other issues such as
glass). The node color represents the entity type, and the edge relation classification and question answering. The relation
width represents the frequency of the pair in the entire database classification task demonstrated in the present manuscript can
of entities extracted from the captions where “XRD” is present. provide key information regarding several aspects in materials
Using the graph, we can obtain the following information: science which are followed in a sequence. These would include
synthesis and testing protocols, and measurement sequences. This
1. XRD is used as a characterization method for different information can be further used to discover an optimal pathway
material descriptors like glass, doped materials, nanofibers, for material synthesis. In addition, such approaches can also be
and films. used to obtain the effect of different testing and environmental
2. Materials prepared using synthesis methods (SMT) like conditions, along with the relevant parameters, on the measured
aging, heat-treatment, and annealing are also characterized property of materials. This could be especially important for those
using XRD. properties such as hardness or fracture toughness, which are
3. While studying the property (PRO) glass transition tempera- highly sensitive to sample preparation protocols, testing condi-
ture (Tg), XRD was also performed to characterize the tions, and the equipment used. Thus, the LM can enable the
samples. extraction of information regarding synthesis and testing condi-
4. In the case of silica glass ceramics (SGCs), phosphor, and tions that are otherwise buried in the text.
phosphor-in-glass (PiG) applications (APL), XRD is used At this juncture, it is worth noting that there are very few
as CMT. annotated datasets available for the material corpus. This
5. For different materials like ZnO, glasses, CsPBr3, yttria contrasts with the biomedical corpus, where several annotated
partially stabilized zirconia (YPSZ), XRD is a major CMT datasets are available for different downstream tasks such as
which is evident from the thicker edge widths. relation extraction, question-answering, and NER. While the
Note this information covers a wide range of materials and development of materials science specific language model can
applications in materials literature. Similar graphs can be significantly accelerate the NLP-related applications in materials,

npj Computational Materials (2022) 102 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
T. Gupta et al.
7

Fig. 4 Comparison of MatSciBERT based NER tagging with manually assigned labels. MatSciBERT-based NER model is able to extract
multiple entities as compared to single manual label for each caption.

Fig. 5 Materials caption graph. a Connecting two unconnected entities, b exploring entities related to characterization method “XRD”.

Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2022) 102
T. Gupta et al.
8
the development of annotated datasets is equally important for from the tokenizers library by Hugging Face56,57. Next, we created a list
accelerating materials discovery. containing mappings of the Unicode characters appearing in the MSC. We
In conclusion, we developed a materials-aware language model, mapped random characters to space so that they do not interfere during
namely, MatSciBERT, that is trained on materials science corpus pre-training. It’s important to note that we also perform this normalization
step on every dataset before passing it through the MatSciBERT tokenizer.
derived from journals. The LM, trained from the initial weights of
SciBERT, exploits the knowledge on computer science and
biomedical corpora (on which the original SciBERT was pre- Pre-training of MatSciBERT
trained) along with the additional information from the materials We pre-train MatSciBERT on MSC as detailed in the last sub-section. Pre-
domain. We test the performance of MatSciBERT on several training LM from scratch requires significant computational power and a
downstream tasks such as document classification, NER, and large dataset. To address this issue, we initialize MatSciBERT with weights
from SciBERT and perform tokenization using the SciBERT uncased
relation classification. We demonstrate that MatSciBERT exhibits
vocabulary. This has the additional advantage that existing models relying
superior performance on all the datasets tested in comparison to on SciBERT, which are pre-trained on biomedical and computer science
SciBERT. Finally, we discuss some of the applications through corpora, can be interchangeably used with MatSciBERT. Further, the
which MatSciBERT can enable accelerated information extraction vocabulary existing in the scientific literature as constructed by SciBERT
from the materials science text corpora. To enable accelerated text can be used to reasonably represent the new words in the materials
mining and information extraction, the pre-trained weights of domain.
MatSciBERT are made publicly available at https://huggingface.co/ To pre-train MatSciBERT, we employ the optimized training recipe,
m3rg-iitd/matscibert. RoBERTa49, suggested by Liu et al. (2019). This approach has been shown
to significantly improve the performance of the original BERT. Specifically,
the following simple modifications were adopted for MatSciBERT pre-
training:
METHODS
Dataset collection and preparation 1. Dynamic whole word masking: It involves masking at the word level
In the training of an LM in a generalizable way, a considerable amount of instead of masking at the wordpiece level, as discussed in the latest
dataset is required. For example, BERT25 was pre-trained on BookCorpus26 release of the BERT pre-training code by Google58. Each time a
and English Wikipedia, containing a total of 3.3 billion words. SciBERT21, an sequence is sampled, we randomly mask 15% of the words and let
LM trained on scientific literature, was pre-trained using a corpus the model predict each masked wordpiece token independently.
consisting of 82% papers from the broad biomedical domain and 18% 2. Removing the NSP loss from the training objective: BERT was pre-
papers from the computer science domain. However, we note that none of trained using two unsupervised tasks: Masked-LM and Next-
these LMs includes text related to the materials domain. Here, we consider Sentence Prediction (NSP). NSP takes as input a pair of sentences
and predicts whether the two sentences follow each other or not.
materials science literature from four broad categories, namely, inorganic
RoBERTa authors claim that removing the NSP loss matches or
glasses and ceramics, metallic glasses, cement and concrete, and alloys, to
slightly improves downstream task performance.
cover the materials domain in a representative fashion.
3. Training on full-length sequences: BERT was pre-trained with a
The first step in retrieving the research papers is to query search from
sequence length of 128 for 90% of the steps and with a sequence of
the Crossref metadata database53. This resulted in a list of more than 1 M
the length of 512 for the remaining 10% steps. RoBERTa authors
articles. Although Crossref gives the search results from different journals
obtained better performance by training only with full-length
and publishers, we downloaded papers only from the Elsevier Science
sequences. Here, input sequences are allowed to contain segments
Direct database using their sanctioned API54. Note that the Elsevier API
of more than one document and [SEP] token is used to separate the
returns the research articles in XML format; hence, we wrote a custom XML documents within an input sequence.
parser for extracting the text. Occasionally, there were papers having only 4. Using larger batch sizes: Authors also found that training with larger
abstract and not full text depending upon the journal and publication mini-batches improved the pre-training loss and increased the end-
date. Since the abstracts contain concise information about the problem task performance.
statement being discussed in the paper and what the research
contributions are, therefore, we have included them in our corpus. Following these modifications, we pre-train MatSciBERT on the MSC with
Therefore, we have included all the sections of the paper when available a maximum sequence length of 512 tokens for fifteen days on 2 NVIDIA
and abstracts otherwise. For glass science-related papers, the details are V100 32GB GPUs with a batch size of 256 sequences. We use the AdamW
given in our previous work10. For concrete and alloys, we first downloaded optimizer with β1 = 0.9, β2 = 0.98, ε = 1e–6, weight decay = 1e−2 and linear
many research papers for each material category using several queries decay schedule for learning rate with warmup ratio = 4.8% and peak
such as “cement”, “interfacial transition zone”, “magnesium alloy”, and learning rate = 1e−4. Pre-training code is written using PyTorch59 and
“magnesium alloy composite materials”, to name a few. Transformers57 library and is available at our GitHub repository for this
Since all the downloaded papers did not belong to a particular class of work https://github.com/M3RG-IITD/MatSciBERT.
materials, we manually annotated 500 papers based on their abstracts,
whether they were relevant to the field of interest or not. Further, we Downstream tasks
finetuned SciBERT classifiers21,55, one for each category of material, on Once the LM is pre-trained, we finetune it on various supervised
these labeled abstracts for identifying relevant papers among the downstream tasks. Pre-trained LM is augmented with a task-specific
downloaded 1 M articles. We consider these selected papers from each output layer. Finetuning is done to adapt the model to specific tasks as
category of materials for training the language model. A detailed well as to learn the task-specific randomly initialized weights present in the
description of the Materials Science Corpus (MSC) is given in the Results output layer. Finetuning is done on all the parameters end-to-end. We
and Discussion section of the paper. Finally, we divided this corpus into evaluate the performance of MatSciBERT on the following three down-
training and validation, with 85% being used to train the language model stream NLP tasks:
and the remaining 15% as validation to assess the model’s performance on
unseen text. 1. Named Entity Recognition (NER) involves identifying domain-
Note that the texts in the scientific literature may have several symbols, specific named entities in a given sentence. Entities are encoded
including some random characters. Sometimes the same semantic symbol using the BIO scheme to account for multi-token entities53. Dataset
has many Unicode surface forms. To address these anomalies, we also for the NER task includes various sentences, with each sentence
performed Unicode normalization of MSC to: being split into multiple tokens. Gold labels are provided for each
token. More formally, Let E = {e1, … ek} be the set of k entity types
a. get rid of random Unicode characters like , , , and for a given dataset. If [x1, … xn] are tokens of a sentence and [y1, …
b. map different Unicode characters having similar meaning and yn] are labels for these tokens, then each yi ∈ L = {B-e1, I-e1, … B-ek,
appearance to either a single standard character or a sequence of I-ek, O}. Here, B-ei and I-ei represent the beginning and inside of
standard characters. entity ei.
For example, % gets mapped to %, > to > , ⋙ to ⋙, = and = to = , ¾ 2. Input for the Relation Classification60 task consists of a sentence and
to 3/4, to name a few. First, we normalized the corpus using BertNormalizer an ordered pair of entity spans in that sentence. Output is a label

npj Computational Materials (2022) 102 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
T. Gupta et al.
9

Fig. 6 Relation classification task. The different entities are enclosed in boxes with their respective labels. The related entities are connected
using arrows labeled with the relation.

denoting the directed relationship between the two entities. The optical, dielectric, and thermal properties of glasses, to name a few. We
two entity spans can be represented as s1 = (i, j) and s2 = (k, l), divide the abstracts into a train-validation-test split of 3:1:1.
where i and j denote the starting and ending index of the first entity
and similarly k and l denote the starting and ending index of the
second entity in the input statement. Here, i ≤ j, k ≤ l, and (j < k or l < Modeling
i). The last constraint guarantees that the two entities do not overlap For NER task, we use the BERT contextual output embedding of the first
with each other. The output label belongs to L, where L is a fixed set wordpiece of every token to classify the tokens among |L | classes. We
of relation types. An example of a sentence from the task is given in model the NER task using three architectures: LM-Linear, LM-CRF, and LM-
Fig. 6. The task is to predict the labels like “Participant_Material”,
BiLSTM-CRF. Here, LM can be replaced by any BERT-based transformer
“Apparatus_Of” given the sentence and pair of entities as input.
model. We take LM to be BERT, SciBERT and MatSciBERT in this work.
3. In the Paper Abstract Classification task, we are given an abstract of
a research paper, and we have to classify whether the abstract is 1. LM-Linear: The output embedding of the wordpieces are passed
relevant to a given field or not. through a linear layer with softmax activation. We use the BERT
Token Classifier implementation of transformers library57.
2. LM-CRF: We replace the final softmax activation of the LM-Linear
Datasets architecture with a CRF layer61 so that the model can learn to label
the tokens belonging to the same entity mentioned and also learn
We use the following three Materials Science-based NER datasets to the transition scores between different entity types. We use the CRF
evaluate the performance of MatSciBERT against SciBERT: implementation of PyTorch-CRF library62.
1. Matscholar NER dataset9 by Weston et al. (2019): This dataset is 3. LM-BiLSTM-CRF: Bidirectional Long Short-Term Memory63 is added
publicly available and contains seven different entity types. Training, in between the LM and CRF layer. BERT embeddings of all the
validation, and test sets consist of 440, 511, and 546 sentences, wordpieces are passed through a stacked BiLSTM. The output of
respectively. Entity types present in this dataset are inorganic BiLSTM is finally fed to the CRF layer to make predictions.
material (MAT), symmetry/phase label (SPL), sample descriptor In case of Relation Classification task, we use the Entity Markers-Entity
(DSC), material property (PRO), material application (APL), synthesis Start architecture60 proposed by Soares et al. (2019) for modeling. Here, we
method (SMT), and characterization method (CMT). surround the entity spans within the sentence with some special
2. Solid Oxide Fuel Cells – Entity Mention Extraction (SOFC) dataset by wordpieces. We wrap the first and second entities with [E1], [\E1] and
Friedrich et al. (2020)45: This dataset consists of 45 open-access [E2], [\E2] respectively. We concatenate the output embeddings of [E1] and
scholarly articles annotated by domain experts. Four different entity [E2] and then pass it through a linear layer with softmax activation. We use
types have been annotated by the authors, namely Material, the standard cross-entropy loss function for the training of the linear layer
Experiment, Value, and Device. There are 611, 92, and 173 sentences and finetuning of the language model.
in the training, validation, and test sets, respectively. For the baseline, we use two recent models, MaxPool and MaxAtt,
3. Solid Oxide Fuel Cells – Slot Filling (SOFC-Slot) dataset by Friedrich proposed by Maini et al. (2020)50. In this approach too, the pair of entities
et al. (2020)45: This is the same as the above dataset except that
are wrapped with the same special tokens. Then glove embeddings18 of
entity types are more fine-grained. There are 16 different entity
words in the input sentence are passed through a BiLSTM, an aggregation
types, namely Anode Material, Cathode Material, Conductivity,
Current Density, Degradation Rate, Device, Electrolyte Material, Fuel mechanism (different for MaxPool and MaxAtt) over words, and a linear
Used, Interlayer Material, Open Circuit Voltage, Power Density, layer with softmax activation.
Resistance, Support Material, Time of Operation, Voltage, and In Paper Abstract Classification task, we use the output embedding of
Working Temperature. Two additional entity types: Experiment the CLS token to encode the entire text/abstract. We pass this embedding
Evoking Word and Thickness, are used for training the models. through a simple classifier to make predictions. We use the BERT Sentence
For relation classification, we use the Materials Synthesis Procedures Classifier implementation of the transformers library57. For the baseline, we
dataset by Mysore et al. (2019)46. This dataset consists of 230 synthesis use a similar approach as relation classification except that there is no pair
procedures annotated as graphs where nodes represent the participants of of input entities.
synthesis steps, and edges specify the relationships between the nodes.
The average length of a synthesis procedure is nine sentences, and 26 Hyperparameters
tokens are present in each sentence on average. The dataset consists of 16
We use a linear decay schedule for the learning rate with a warmup ratio of
relation labels. The relation labels have been divided into three categories
0.1. To ensure sufficient training of randomly initialized non-BERT layers,
by the authors:
we set different learning rates for the BERT part and non-BERT part. We set
a. Operation-Argument relations: Recipe target, Solvent material, the peak learning rate of the non-BERT part to 3e-4 and choose the peak
Atmospheric material, Recipe precursor, Participant material, Appa- learning rate of the BERT part from [2e−5, 3e−5, 5e−5], whichever results in
ratus of, Condition of a maximum validation score averaged across three seeds. We use a batch
b. Non-Operation Entity relations: Descriptor of, Number of, Amount of, size of 16 and an AdamW optimizer for all the architectures. For LM-
Apparatus-attr-of, Brand of, Core of, Property of, Type of BiLSTM-CRF architecture, we use a 2-layer stacked BiLSTM with a hidden
c. Operation-Operation relations: Next operation dimension of 300 and dropout of 0.2 in between the layers. We perform
The train, validation, and test set consist of 150, 30, and 50 annotated finetuning for 15, 20, and 40 epochs for Matscholar, SOFC, and SOFC Slot
material synthesis procedures, respectively. datasets, respectively, as initial experiments exhibited little or no
The dataset for classifying research papers related to glass science or not improvement after the specified number of epochs. All the weights of
on the basis of their abstracts has been taken from Venugopal et al. any given architecture are updated during finetuning, i.e., we do not freeze
(2021)10. The authors have manually labeled 1500 abstracts as glass and any of the weights. We make the code for finetuning and different
non-glass. These abstracts belong to different fields of glass science like architectures publicly available. We refer readers to the code for further
bioactive glasses, rare-earth glasses, glass ceramics, thin-film studies, and details about the hyperparameters.

Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2022) 102
T. Gupta et al.
10
Evaluation metrics 18. Pennington, J., Socher, R. & Manning, C. GloVe: Global vectors for word repre-
We evaluate the NER task based on entity-level exact matches. We use the sentation. in Proceedings of the 2014 Conference on Empirical Methods in Natural
CoNLL evaluation script (https://github.com/spyysalo/conlleval.py). For Language Processing (EMNLP) 1532–1543 (Association for Computational Lin-
NER and Relation Classification tasks, we use Micro-F1 and Macro-F1 as guistics, 2014).
the primary evaluation metrics. We use accuracy to evaluate the 19. Lee, J. et al. BioBERT: a pre-trained biomedical language representation model for
performance of the paper abstract classification task. biomedical text mining. Bioinformatics 36, 1234–1240 (2020).
20. Gururangan, S. et al. Don’t stop pretraining: Adapt language models to domains
and tasks. in Proceedings of the 58th annual meeting of the association for com-
DATA AVAILABILITY putational linguistics 8342–8360 (Association for Computational Linguistics, 2020).
21. Beltagy, I., Lo, K. & Cohan, A. SciBERT: A pretrained language model for scientific
Any data used for the work are available from the corresponding authors upon
text. in Proceedings of the 2019 conference on empirical methods in natural lan-
reasonable request. The PIIs and DOIs of the research papers used in this work are
guage processing and the 9th international joint conference on natural language
available at https://github.com/M3RG-IITD/MatSciBERT/blob/main/pretraining/piis_dois.
processing, EMNLP-IJCNLP 2019, hong kong, china, november 3-7, 2019 (eds. Inui,
csv.
K., Jiang, J., Ng, V. & Wan, X.) 3613–3618 (Association for Computational Lin-
guistics, 2019).
22. Araci, D. FinBERT: Financial sentiment analysis with pre-trained language models.
CODE AVAILABILITY Preprint at https://arxiv.org/abs/1908.10063 (2019).
All the codes used in the present work are available at https://github.com/M3RG-IITD/ 23. Lee, J.-S. & Hsiang, J. Patent classification by fine-tuning BERT language model.
MatSciBERT. Also, the codes with finetuned models for the downstream tasks are World Pat. Inf. 61, 101965 (2020).
available at https://doi.org/10.5281/zenodo.6413296. 24. Manning, C. & Schutze, H. Foundations of Statistical Natural Language Processing.
(MIT Press, 1999).
Received: 29 October 2021; Accepted: 12 April 2022; 25. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep
bidirectional transformers for language understanding. in NAACL-HLT (1)
4171–4186 (Association for Computational Linguistics, 2019).
26. Zhu, Y. et al. Aligning books and movies: towards story-like visual explanations by
watching movies and reading books. in 2015 IEEE International Conference on
Computer Vision (ICCV) 19–27 (2015).
REFERENCES 27. Swain, M. C. & Cole, J. M. ChemDataExtractor: A toolkit for automated extraction
1. Science, N. & (US), T. C. Materials genome initiative for global competitiveness. of chemical information from the scientific literature. J. Chem. Inf. Model. 56,
(Executive Office of the President, National Science and Technology Council, 1894–1904 (2016).
https://www.mgi.gov/sites/default/files/documents/materials_genome_initiative- 28. Huang, S. & Cole, J. M. A database of battery materials auto-generated using
final.pdf, 2011). ChemDataExtractor. Sci. Data 7, 260 (2020).
2. Jain, A. et al. Commentary: The Materials Project: A materials genome approach 29. Court, C. J. & Cole, J. M. Auto-generated materials database of Curie and Néel
to accelerating materials innovation. APL Mater. 1, 011002 (2013). temperatures via semi-supervised relationship extraction. Sci. Data 5, 180111
3. Zunger, A. Inverse design in search of materials with target functionalities. Nat. (2018).
Rev. Chem. 2, 1–16 (2018). 30. Kononova, O. et al. Text-mined dataset of inorganic materials synthesis recipes.
4. Chen, C. et al. A critical review of machine learning of energy materials. Adv. Sci. Data 6, 203 (2019).
Energy Mater. 10, 1903242 (2020). 31. Uvegi, H. et al. Literature mining for alternative cementitious precursors and
5. de Pablo, J. J. et al. New frontiers for the materials genome initiative. Npj Comput. dissolution rate modeling of glassy phases. J. Am. Ceram. Soc. 104, 3042–3057
Mater. 5, 1–23 (2019). (2020).
6. Greenaway, R. L. & Jelfs, K. E. Integrating computational and experimental 32. Jensen, Z. et al. A machine learning approach to zeolite synthesis enabled by
workflows for accelerated organic materials discovery. Adv. Mater. 33, 2004831 automatic literature data extraction. ACS Cent. Sci. 5, 892–899 (2019).
(2021). 33. Guha, S. et al. MatScIE: An automated tool for the generation of databases of
7. Ravinder et al. Artificial intelligence and machine learning in glass science and methods and parameters used in the computational materials science literature.
technology: 21 challenges for the 21st century. Int. J. Appl. Glass Sci. 12, 277–292 Comput. Mater. Sci. 192, 110325 (2021).
(2021). 34. Olivetti, E. A. et al. Data-driven materials research enabled by natural language
8. Zanotto, E. D. & Coutinho, F. A. B. How many non-crystalline solids can be made processing and information extraction. Appl. Phys. Rev. 7, 041317 (2020).
from all the elements of the periodic table? J. Non-Cryst. Solids 347, 285–288 35. Jessop, D. M., Adams, S. E., Willighagen, E. L., Hawizy, L. & Murray-Rust, P. OSCAR4:
(2004). a flexible architecture for chemical text-mining. J. Cheminformatics 3, 41 (2011).
9. Weston, L. et al. Named entity recognition and normalization applied to large- 36. Epps, R. W. et al. Artificial chemist: an autonomous quantum dot synthesis bot.
scale information extraction from the materials science literature. J. Chem. Inf. Adv. Mater. 32, 2001626 (2020).
Model. 59, 3692–3702 (2019). 37. MacLeod, B. P. et al. Self-driving laboratory for accelerated discovery of thin-film
10. Venugopal, V. et al. Looking through glass: Knowledge discovery from materials materials. Sci. Adv. 6, eaaz8867 (2020).
science literature using natural language processing. Patterns 2, 100290 (2021). 38. Zhang, Y., Chen, Q., Yang, Z., Lin, H. & Lu, Z. BioWordVec, improving biomedical
11. Zaki, M., Jayadeva & Krishnan, N. M. A. Extracting processing and testing para- word embeddings with subword information and MeSH. Sci. Data 6, 52 (2019).
meters from materials science literature for improved property prediction of 39. Ammar, W. et al. Construction of the Literature Graph in Semantic Scholar. in
glasses. Chem. Eng. Process. - Process Intensif. 108607 (2021). https://doi.org/ Proceedings of the 2018 Conference of the North American Chapter of the Asso-
10.1016/j.cep.2021.108607. ciation for Computational Linguistics: Human Language Technologies, Volume 3
12. El-Bousiydy, H. et al. What can text mining tell us about lithium-ion battery (Industry Papers) 84–91 (Association for Computational Linguistics, 2018).
researchers’ habits? Batter. Supercaps 4, 758–766 (2021). 40. Alsentzer, E. et al. Publicly available clinical BERT embeddings. in Proceedings of
13. Brown, P. F., Della Pietra, V. J., deSouza, P. V., Lai, J. C. & Mercer, R. L. Class-based n- the 2nd Clinical Natural Language Processing Workshop 72–78 (Association for
gram models of natural language. Comput. Linguist. 18, 467–480 (1992). Computational Linguistics, 2019).
14. Ando, R. K. & Zhang, T. A framework for learning predictive structures from 41. Johnson, A. E. W. et al. MIMIC-III, a freely accessible critical care database. Sci. Data
multiple tasks and unlabeled data. J. Mach. Learn. Res. 6, 1817–1853 (2005). 3, 160035 (2016).
15. Blitzer, J., McDonald, R. & Pereira, F. Domain adaptation with structural corre- 42. Libovický, J., Rosa, R. & Fraser, A. On the language neutrality of pre-trained
spondence learning. in Proceedings of the 2006 Conference on Empirical Methods multilingual representations. in Findings of the association for computational
in Natural Language Processing 120–128 (Association for Computational Lin- linguistics: EMNLP 2020 1663–1674 (Association for Computational Linguistics,
guistics, 2006). 2020)
16. Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word repre- 43. Gupta, T., Zaki, M., Krishnan, N. M. A., & Mausam. MatSciBERT: A materials domain
sentations in vector space. in 1st international conference on learning repre- language model for text mining and information extraction. Preprint at https://
sentations, ICLR 2013, scottsdale, arizona, USA, may 2-4, 2013, workshop track arxiv.org/abs/2109.15290. (2021).
proceedings (eds. Bengio, Y. & LeCun, Y.) (2013). 44. Walker, N. et al. The impact of domain-specific pre-training on named entity
17. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed repre- recognition tasks in materials science. Available SSRN 3950755 (2021).
sentations of words and phrases and their compositionality. in Advances in Neural 45. Friedrich, A. et al. The SOFC-Exp corpus and neural approaches to information
Information Processing Systems vol. 26 (Curran Associates, Inc., 2013). extraction in the materials science domain. in Proceedings of the 58th annual

npj Computational Materials (2022) 102 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences
T. Gupta et al.
11
meeting of the association for computational linguistics 1255–1268 (Association for RESPOND as part of the STC at IIT Delhi. M.Z. acknowledges the funding received
Computational Linguistics, 2020). from the PMRF award by Government of India. M. acknowledges grants by Google,
46. Mysore, S. et al. The materials science procedural text corpus: Annotating IBM, Bloomberg, and a Jai Gupta chair fellowship. The authors thank the High
materials synthesis procedures with shallow semantic structures. in Proceedings Performance Computing (HPC) facility at IIT Delhi for computational and storage
of the 13th linguistic annotation workshop 56–64 (Association for Computational resources.
Linguistics, 2019). https://doi.org/10.18653/v1/W19-4007.
47. Wu, Y. et al. Google’s neural machine translation system: Bridging the gap
between human and machine translation. Preprint at https://arxiv.org/abs/ AUTHOR CONTRIBUTIONS
1609.08144. (2016). M. and N.M.A.K. supervised the work. T.G. developed the codes and trained the
48. Tokenizer. https://huggingface.co/transformers/main_classes/main_classes/tokenizer. models. M.Z. along with T.G. performed data collection and processing. All the
html. authors analyzed the results and wrote the manuscript.
49. Liu, Y. et al. RoBERTa: A robustly optimized BERT pretraining approach. Preprint at
https://arxiv.org/abs/1907.11692. (2019).
50. Maini, P., Kolluru, K., Pruthi, D., & Mausam. Why and when should you pool?
COMPETING INTERESTS
analyzing pooling in recurrent architectures. in Findings of the association for
computational linguistics: EMNLP 2020 4568–4586 (Association for Computational The authors declare no competing interests.
Linguistics, 2020).
51. Sainburg, T., McInnes, L. & Gentner, T. Q. Parametric UMAP embeddings for
representation and semisupervised learning. Neural Comput. 33, 2881–2907 (2021). ADDITIONAL INFORMATION
52. Goodfellow, I., Bengio, Y. & Courville, A. Deep learning. (MIT Press, 2016). Supplementary information The online version contains supplementary material
53. allenai/scibert_scivocab_uncased · Hugging Face. https://huggingface.co/allenai/ available at https://doi.org/10.1038/s41524-022-00784-w.
scibert_scivocab_uncased.
54. Hugging Face. GitHub https://github.com/huggingface. Correspondence and requests for materials should be addressed to N. M. Anoop
55. Wolf, T. et al. Transformers: State-of-the-art natural language processing. in Pro- Krishnan or Mausam.
ceedings of the 2020 conference on empirical methods in natural language processing:
System demonstrations 38–45 (Association for Computational Linguistics, 2020). Reprints and permission information is available at http://www.nature.com/
56. bert/run_pretraining.py at master · google-research/bert. GitHub https://github. reprints
com/google-research/bert.
57. Paszke, A. et al. PyTorch: An Imperative Style, High-Performance Deep Learning Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims
Library. 12. in published maps and institutional affiliations.
58. Crossref Metadata Search. https://search.crossref.org/.
59. Elsevier Developer Portal. https://dev.elsevier.com/.
60. Baldini Soares, L., FitzGerald, N., Ling, J. & Kwiatkowski, T. Matching the blanks:
Distributional similarity for relation learning. in Proceedings of the 57th annual
meeting of the association for computational linguistics 2895–2905 (Association for Open Access This article is licensed under a Creative Commons
Computational Linguistics, 2019). Attribution 4.0 International License, which permits use, sharing,
61. Lafferty, J. D., McCallum, A. & Pereira, F. C. N. Conditional random fields: Prob- adaptation, distribution and reproduction in any medium or format, as long as you give
abilistic models for segmenting and labeling sequence data. in ICML 282–289 appropriate credit to the original author(s) and the source, provide a link to the Creative
(Morgan Kaufmann, 2001). Commons license, and indicate if changes were made. The images or other third party
62. pytorch-crf — pytorch-crf 0.7.2 documentation. https://pytorch-crf.readthedocs. material in this article are included in the article’s Creative Commons license, unless
io/en/stable/. indicated otherwise in a credit line to the material. If material is not included in the
63. Huang, Z., Xu, W. & Yu, K. Bidirectional LSTM-CRF models for sequence tagging. article’s Creative Commons license and your intended use is not permitted by statutory
Preprint at https://arxiv.org/abs/1508.01991 (2015). regulation or exceeds the permitted use, you will need to obtain permission directly
from the copyright holder. To view a copy of this license, visit http://creativecommons.
org/licenses/by/4.0/.
ACKNOWLEDGEMENTS
N.M.A.K. acknowledges the funding support received from SERB (ECR/2018/002228),
© The Author(s) 2022
DST (DST/INSPIRE/04/2016/002774), BRNS YSRA (53/20/01/2021-BRNS), ISRO

Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2022) 102

You might also like