A Text Mining-based Approach for Comprehensive Und
A Text Mining-based Approach for Comprehensive Und
com/scientificreports
Keywords Text mining, Railway operational equipment failure, BERT, BiLSTM, CRF, Knowledge graph
Railway transportation is a crucial component of modern infrastructure, demanding high levels of safety and
reliability. !e upkeep and coordinated operation of technical equipment, speci"cally operation equipment,
are fundamental to ensuring the smooth functioning of railway transport services. !e rapid expansion of
the railway network and the continual increase in operational mileage have been accompanied by a growing
intricacy in safety management of operational equipment due to the evolving nature of internal and external
operational conditions. Despite advancements in safety design, this complexity imposes a heightened risk of
operational failures within the railway system. Consequently, railway departments across China have amassed
a signi"cant volume of failure reports pertaining to operational equipment. !ese documents, rich in an array
of details that include not only timing, a#ected locations, causes, and remedial measures but also encompass
additional information, serve as a comprehensive data source for the analysis of malfunction patterns and the
prediction of future failures. However, traditional failure reports analysis approaches are heavily reliant on
expert interpretation, frequently leading to the underutilization of valuable data due to the insu$ciency of
expert experience. !e advent of digital technologies in railway operations, along with the emergence of natural
language processing (NLP) techniques, emphasizes the need for more advanced techniques for analyzing failure
texts. Within this context, designing advanced designing advanced text mining techniques not only contribute
to a more comprehensive analysis of historical failures and the exploration of interrelations among factors
linked to such failures, thereby contributing to the prediction of future failures, but also strengthen support for
maintenance and operational decision-making, including scheduling. !is advances the e$ciency and safety of
railway operations.
School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, China. email:
At present, there has been no research focused on the mining and analysis of railway operational equipment
failures. However, some scholars have utilized text mining techniques to analyze railway accident reports for
risk assessment1. !ey introduced a BERT model to perform NER on these reports, identifying entities such as
incident names and causes. However, they de"ned only four types of accident entities and two types of entity
relationships, which is insu$cient for a comprehensive analysis of historical accident reports. Additionally, the
precision, recall, and f1 value for identifying certain classes of entities, such as accident causes and descriptions,
did not exceed 93%, indicating room for improvement.
ROEF reports texts are marked by their complexity, incorporating entities such as failure description, lines,
and categories, which present more diverse features compared to general texts. !is complexity is further
exempli"ed by the lengthier railway-speci"c terminologies used in failure descriptions. Moreover, the scarcity of
publicly accessible datasets in this "eld hinders the identi"cation process of named entities within ROEF texts.
To address the challenges mentioned above, this study e#ectively leverages the text processing capabilities of
text mining technology by applying an optimized NER model to the ROEF domain and constructing the ROEF
knowledge graph (ROEFKG). Initially, the study collects real historical failure report data provided by a Chinese
railway bureau and performs operations such as data cleaning and labeling to construct a Chinese corpus in this
"eld, thereby improving the utilization rate of knowledge. !e BERT-BiLSTM-CRF model is then optimized by
concatenating data from the BERT and BiLSTM layers, processing it with an entity attention layer that uses an
attention mechanism to extract more profound features from the preceding layers’ output labels, and reducing
the dimension with a fully connected layer based on the number of labels to provide more comprehensive
data for the CRF layer. Additionally, a dropout regularization technique is employed during model training to
enhance its generalization ability. !is novel NER model extracts essential information such as the time of failure
occurrence, line, train number, failure description, reason of failure, corrective measures taken, failure location,
failure category, responsible system, and the e#ect of the failure. Comparative evaluations of precision, recall,
and f1 value demonstrate that our model achieves superior results on the provided dataset. Finally, the causal
transmission paths among entities were standardized, leading to the establishment of the ROEFKG model. !is
model reveals the interconnections among historical failure-related entities, thereby laying a foundation for fault
prediction and enhancing railway operational safety.
!e remaining structure of this paper is organized as follows: Sect.%“Related work” reviews relevant literature
and previous research. Section% “NER algorithm module” delves into a comprehensive description of the key
methodology employed. Section%“Experimental settings” discusses data collection, preprocessing, annotation, the
de"nition of entities and relationships, and details the experimental environment setup. Section%“Experimental
results and analysis” is dedicated to an experimental comparative analysis to evaluate the model’s e#ectiveness.
Section%“Visualized results: knowledge graph construction” focuses on the creation of a Neo4j
database and explicates the development of the ROEFKG. Section% “Conclusion” concludes the paper by
summarizing the "ndings.
Related work
KGs are characterized as data graphs that amass and convey knowledge pertaining to the real world2. Entities
of interest are represented as nodes within these graphs, while the relationships between entities are depicted
by edges3,4. !ese representations leverage formal semantics, enabling e$cient and unambiguous processing by
computers. Due to their signi"cant role in processing heterogeneous information within a machine-readable
context, substantial and ongoing research e#orts have been dedicated to KGs in recent years5. !e proposed
KGs have found widespread adoption in various AI systems6,7including recommender systems, question
answering, and information retrieval. Furthermore, they have been extensively applied across diverse domains
(e.g., education8 and healthcare9 to improve human life and societal well-being10,11. !e structured information
management and visualization capabilities of KG also aid in application development and platform design12.
In recent advancements, KG have also been strategically integrated into safety analysis research to facilitate
knowledge modeling and risk management endeavors. For instance, A community gas-safety risk-prediction
method is introduced by Zhang et al.13aimed at addressing the intricate and ongoing factors pertaining
to community gas safety, utilizing temporal KG. Gan et al.14 proposes the integration of multiple sources of
knowledge in &ag state control detection using emerging KG technology. Additionally, Mao et al.15 engineered
a semi-automatic KG development solution tailored for process safety within the chemical industry. !ese
endeavors underscore the critical role and application of KG technology in promoting safety management and
risk assessment in various industries beyond the realm of transportation.
To date, scholarly endeavors have been directed towards researching knowledge aspects of ROEF. Liu et
al.16 suggested a KG-based method for mining railway operational accidents, primarily focusing on the British
railway dataset with a limited scope of data coverage. Lin and Wang17 employed text mining techniques to extract
the causes of railway switch failures. Sobrie et al.18 utilized deep learning techniques for real-time prediction
of railway delays. Lin et al.19 applied NER from NLP to identify entities in railway signal equipment failure
information. !ese studies, while aligned with our research direction, focused on uncovering latent dangers and
risks within the railway operation process through the analysis of relevant report texts. However, they fell short
of presenting a comprehensive construction of the knowledge system and the visualization of knowledge links,
thereby limiting the provision of intuitive and e$cient decision support to "eld sta#.
NER is a direct approach to complete knowledge discovery from text data, has been recently widely applied in
geology20–22agriculture23–25medicine26–28and other "elds. NER is a pivotal task in NLP29aiming to identify and
categorize key information units termed named entities (NEs) from textual data. !ese entities fundamentally fall
into two broad classi"cations: generic NEs, such as persons, locations, and organizations, and domain-speci"c
NEs, encompassing specialized terminologies30. !e identi"cation of NEs serves as a cornerstone for numerous
NLP applications, including relation extraction31–33machine translation34and question answering35thereby
marking NER as a critical area of research within the "eld. In contrast to traditional research paradigms, the
emphasis is placed on failure-factor-oriented failure-related entities, which are distinct from conventional
entities such as individuals or geographic locations. Consequently, recognition is not only facilitated by the
algorithm designed but also an understanding of the underlying semantics is compassed. To sum up, in this
paper, we build a NER (in a broad sense) algorithm model by de"ning ROEF-related entities containing the
characteristics of text reports to achieve in-depth mining of failure information.
considering the contextual nuances of the data, thus elevating the model’s accuracy. Additionally, the CRF
layer facilitates the prediction of optimal label sequences by accounting for label dependencies based on ample
contextual information, signi"cantly heightening the accuracy and robustness of the model.
BERT layer
In comparison with earlier models such as ELMO36 and openAI-GPT37BERT distinctively adopts a 12-layer
encoder from the transformer architecture38 as its fundamental component for extracting context information
from both preceding and succeeding texts to derive word vectors. !e core design of BERT is centered around
the adoption of Masked Language Modeling (MLM) as an e#ective strategy for learning the contextual
semantics within a corpus. Predominantly structured around the Transformer’s encoders, this pretrained model,
speci"cally the 12-layer BERT BASE model mentioned in this study, comprises a sequence of 12 Encoders.
!e Transformer, characterized by its utilization of the attention mechanism, outlines a profound network
architecture as depicted in Fig.%2, facilitating the extraction of semantic relationships from input text sequences
with noteworthy e$ciency.
In the Encoder’s architecture, the attention mechanism is identi"ed as a critical component, where the weight
coe$cient is dynamically adjusted according to the correlation degree among words within a sentence, enabling
the acquisition of the "nal word’s representation, as depicted in Eq.%(1):
! "
QK T
Attention (Q, K, V ) = Soft max √ V (1)
dk
where Q, K , and V are denoted as word vector matrices, with dk representing the Embedding dimension. !e
Encoder’s multi-head Attention mechanism involves mapping Q, K , and V through multiple distinct linear
transformations and subsequently splicing di#erent attentions together, as demonstrated in Eq.%(2) ~ (3):
! "
headi = Attention (Qi , Ki , Vi ) = Attention QWiQ , KWiK , V WiV (2)
the sentence, the sequence of word vectors generated by the BERT layer is then provided as input to the second
BiLSTM module for semantic encoding.
BiLSTM layer
Long-Short Term Memory (LSTM)39 is recognized as a variant of Recurrent Neural Network (RNN)40. !e
achievement of sustaining long-term dependencies is enabled through the adept incorporation of gating
mechanisms, e#ectively mitigating the challenges associated with gradient explosion and vanishing encountered
during the training phase of RNNs. !e principal components constituting the LSTM architecture encompass
the forget gate, input gate, output gate, and the memory cell. !e regulation of information &ow through the gate
function is a#ected by the LSTM, promoting the realization of both long and short-term memory as it extracts
information from sequences, as shown in Fig.%3. ∼
In LSTM models, the composition includes the input word xt , cell state ct , temporary cell state c t , hidden
state ht , forget gate ft , input gate it , and output gate ot . Within the context of NER, the forget gate is utilized
for the selection of recognized information, the input gate is employed to determine the information to be stored
in the cell state. It is determined by the input word at the current moment xt and the hidden state from the
previous time step ht−1 , as depicted in Eq.%(4) ~ (5)41:
ft = σ (wf • [ht−1 , xt ] + bf ) (4)
it = σ (wi • [ht−1, xt ] + bi ) (5)
In the given context, σ is represented as the sigmoid activation function. By integrating the aforementioned
equations with the cell state from the previous moment, the current moment’s cell state Ct can be obtained, as
shown in Eq.%(6)–(7):
∼
ct = tan(w • [ht−1 , xt ] + b) (6)
∼
ct = ft × ct−1 + it × ct (7)
In the described framework, the activation function used for the temporary cell state is characterized by the tanh
function, the forget gate ft is entrusted with the responsibility of regulating which segments of information from
the previous cell state ct−1 should be preserved. !is is instrumental in the calculation of the current cell∼state
ct . Furthermore, the input gate is designated to dictate which characteristics of the temporary cell state c t are
imparted onto the current cell state ct .
Subsequently, the values pertaining to the output gate ot and the state of the hidden layer ht are derived
from the cell state at the contemporaneous moment. !ese derivations are encapsulated in Eqs.% (8) and (9),
illustrating the mathematical relationship and the process through which the values are ascertained.
ot = σ (wo · [ht−1 , xt ] + bo ) (8)
ht = ot ∗ tan (ct ) (9)
.
Within the speci"ed formulas, the weight coe$cients are denoted by wf , wi , w, and wo for the forget gate,
input gate, temporary cell state, and output gate, respectively. Correspondingly, bf , bi , b, and bo , serve as the
o#set vectors for each respective component.
To sum up, the LSTM provides a robust framework for modeling time-series data and sequences with its
in-built mechanisms for long-term dependency learning. Building upon the foundational principles of LSTMs,
the BiLSTM extends this architecture by deploying two separate LSTM layers that process the input sequence in
both forward and backward directions42. !is bidirectional processing is rendered particularly e#ective for tasks
wherein the context of both preceding and subsequent elements is crucial for accurate predictions. As a result, in
this study, the BiLSTM is adopted as a component of the NER model for ROEF reports.
exp(W hi + b)
αi = !N (10)
j=1
exp(W hj + b)
where α i represents the attention weight assigned to token i, hi is the hidden state vector of token i, and are
learnable parameters, N is the sequence length. !e "nal entity-enhanced representation, H ! , is then obtained
as:
!
H! = N
i=1 α i hi (11)
!is re"ned representation is then passed through a fully connected layer for dimensionality reduction before
being processed by the CRF layer for optimal entity classi"cation.
!e introduction of the entity attention layer substantially enhances NER performance, particularly in
scenarios with imbalanced fault category distributions.
By dynamically assigning higher weights to distinguishing features of underrepresented fault entities, the model
mitigates the issue of class imbalance and reduces misclassi"cation errors. For instance, in a dataset where “电缆
芯线断线” (cable core disconnection) appears far less frequently than “道岔故障” (turnout failure), the entity
attention mechanism ampli"es the feature importance of rare entities, improving their recognition without
compromising the detection of more frequent categories.
!is layer re"nes entity representations based on surrounding context, leading to more accurate semantic
disambiguation across di#erent fault types. For example, when processing “signal light blackout”, the attention
mechanism prioritizes correlations with “power failure” rather than unrelated words in the sequence. Such
context-aware adjustments enhance the model’s generalization ability.
By selectively "ltering out non-entity noise, the attention mechanism mitigates false positives. In a sentence like
“A'er the train passed, the signal light went out”, the model disregards non-relevant contextual phrases (e.g.,
“A'er the train passed”) and focuses on core diagnostic entities, improving precision.
!e introduction of the entity attention layer leads to signi"cant empirical performance gains in NER tasks,
particularly for unbalanced fault category datasets. By assigning greater attention to underrepresented entities,
the model e#ectively improves recall for rare fault categories while maintaining a balance between precision and
recall, thereby enhancing thef1 score. Additionally, the selective focus on critical entity features helps the model
converge faster, reducing the risk of over"tting and ensuring robust generalization across di#erent fault types.
Experimental results (detailed in Sect.%5) further con"rm these improvements, demonstrating a notable increase
in recall for rare entities, a balanced precision-recall tradeo#, and an overall enhancement in thef1 score.
By addressing the data imbalance issue directly through context-aware attention weighting, the Entity
Attention Layer ensures a more reliable and equitable entity recognition performance across all fault categories,
including those critical to railway safety.
CRF layer
In the task of NER, it has been observed that the BiLSTM model exhibits pro"ciency in handling long-distance
textual information; however, its capability to address dependencies between adjacent tags remains inadequate43.
!e de"ciency of BiLSTM in managing tag adjacency is e#ectively compensated by the employment of CRF,
which are capable of deriving an optimal prediction sequence through the analysis of relationships between
adjacent tags. !e application of CRF to enhance the LSTM network model has been demonstrated to accomplish
a signi"cant feature matching capability, as evidenced in the domains of "nance44medicine45and agriculture46.
Consequently, the introduction of CRF for the optimization of the aforementioned models is proposed, aiming
to further elevate the performance of advanced text embedding and model training.
CRF are recognized as models for conditional probability distributions, assigned to generate output
sequences based on a given set of input sequences. !is method has gained prominence as a quintessential
technique for addressing challenges within the domain of NLP47. In the framework of CRF, vertices are utilized to
symbolize random variables, with their interrelations being denoted by edges, thereby composing an undirected
graph. When a particular text sequence O = {O1 , O2 , . . . , OT } along with its associated tag sequence
S = {x1 , x2 , . . . , xT } is provided, the probability of the state sequence is determined by the Eq.%(12):
!" " #
1
P (S|O) = exp T
k λ k fk (xi , Oi ) (12)
Z (O)
i=1
(10) in which (Z(O)) acts as the normalization factor, fk (xi , Oi )is known as the state feature function, as
depicted in equations (13):
!
1, Oi = word and xi = tag
fk (xi , Oi ) = 0, else (13)
.
λ k represents the respective correlation weight. !e objective of this modeling approach is to systematically
evaluate the probable outcomes of tagging sequences, given a sequence of text inputs.
Ultimately, the outcome of entity recognition is ascertained by identifying the tag sequence that achieves the
maximum probability, expressed as Eq.%(14):
S ∗ = args max {P ( S |O)} (14)
.
!is modeling choice underscores CRF’s capacity to intricately analyze input sequences and their
corresponding tag sequences, seeking to forecast outcomes with precision. Such functionality accentuates the
utility of CRF in pioneering advancements within the sphere of NLP, fostering a profound understanding and
application of language processing techniques.
Experimental settings
Data acquisition and preprocessing
!is study focuses on NER for ROEF. Following the Chinese ROEF classi"cation standards, the dataset was
systematically categorized. Table% 1 presents the 16 fault categories and their corresponding data quantities,
covering 1,690 fault reports.
!e raw dataset contains multiple inconsistencies and noise, making it unsuitable for direct NER application.
!e primary challenges observed in the raw text include:
• Extraneous characters and erroneous word entries, which introduce noise and a#ect recognition accuracy.
• Unstructured sentence formatting, making entity segmentation di$cult.
• Overly lengthy and complex fault descriptions, complicating model processing.
To address these issues, we implemented a structured data preprocessing pipeline, consisting of data cleaning
and sentence segmentation, ensuring high-quality input for NER.
Certain reports contained incomplete records or inconsistent terminology, which required standardization.
Raw input example: A'er the train arrived, the signal light did not turn on. Reason: missing. Handling
method: not recorded.
Processed output: A'er the train arrived, the signal light did not turn on. Reason: unknown. Handling
method: not provided.
Here, " missing” is replaced with " unknown” to maintain consistency, and “not recorded” is rephrased as “not
provided” to ensure uniform terminology.
!e raw dataset contained title numbers, redundant punctuation, and irrelevant metadata that could interfere
with entity extraction.
Raw input example: Report No.: 2023A06 !e train signal light went out, and the station attendant reported
that the fault was not cleared. @Error Code #232.
Processed output: !e train signal light went out, and the station attendant reported that the fault was not
cleared.
Removal of metadata, special symbols (@, #, []), and redundant identi"ers ensures that only meaningful
content is retained for NER processing.
Many reports contained long, unstructured descriptions with unclear sentence boundaries, making text parsing
challenging.
Raw input example: During train operation, the signal light suddenly went out. Upon inspection, a power
supply anomaly was detected. !e station attendant reported that the turnout signal light was unresponsive.
Processed output: During train operation, the signal light suddenly went out. Upon inspection, a power
supply anomaly was detected. !e station attendant reported that the turnout signal light was unresponsive.
Here, proper sentence boundaries were restored using punctuation, guided by fault-related keywords.
To further illustrate the impact of preprocessing, Fig.% 4 compares a raw fault report (Fig.% 4a) with its
preprocessed version (Fig.%4b). !e structured text output ensures cleaner and more standardized input for the
NER model, leading to improved accuracy in entity recognition. !ese preprocessing techniques lay a solid
foundation for reliable NER performance and subsequent fault diagnosis.
category of each word for processing. A total of 15,461 entities were annotated, highlighting ten main features
of ROEF, including time, line, train number, failure description, failure cause, measures, failure location, failure
category, responsibility system, and failure impact, as presented in Table%2. For complex entities, multiple entities
are annotated separately. For example, the entity “Li-Qin Line 20111“(Li(B-LIN)Qin(I-LIN)Line(I-LIN)2(B-
NUM)0(I-NUM)1(I-NUM)1(I-NUM)1(I-NUM)number(I-NUM)), “Li-Qin Line” and “Line 20111” are tagged
as Failure Line and Train Number respectively, and in the experiments the complex entities are also matched
individually, while the segments of the text not constituting named entities are designated as ‘O’.
Further, the construction of a relationship link encompassing various entities is de"ned as Fig.%5 to facilitating
a comprehensive understanding of the interconnections between the identi"ed entities.
Ultimately, the dataset is partitioned in a ratio of 8:1:1, a strategic division designed to allocate the data for
training, validation, and testing purposes, respectively. !is prepared dataset is then inputted into the NER
model. Analogous to the text preprocessing procedure employed for reports detailing signal failure events within
the Chinese railway system50this manuscript presents a visual depiction of the entire preprocessing sequence, as
illustrated in Fig.%6. !is "gure serves as an informative guide delineating the steps requisite for the preparation
of data prior to training NER models.
Table 2. Type and number of ROEF entity. !e "rst column “tags” are abbreviations used in the tagging
process.
!e Entity Attention Layer plays a vital role in mitigating this issue by enhancing the representation of low-
frequency entities and allowing the model to learn more informative features from their surrounding context.
Speci"cally:
!e recall rate for “MEA” increased from 84.02% (baseline model) to 97.62% (optimized model) (+ 16.18%).
!e recall rate for “DES” improved from 92.90 to 97.02% (+ 4.43%). !is substantial improvement highlights the
e#ectiveness of the entity attention layer in addressing data imbalance by capturing richer semantic dependencies
within the context.
!e entity attention layer selectively emphasizes relevant entity features, reducing the impact of dominant high-
frequency categories and ensuring better contextual di#erentiation for underrepresented entity types.
Dropout regularization helps prevent over"tting to high-frequency entities, ensuring that all entity types are
adequately represented in the learned embeddings.
!e optimized model consistently achieves high recall values across all entity types, indicating its capacity to
capture entity occurrences even in failure categories with limited training samples. For instance, the recall of REA
increases from 0.6641 to 0.9768, demonstrating a notable enhancement in recognizing rare entity mentions. !is
suggests that the model successfully generalizes to low-resource scenarios by leveraging contextual information
more e#ectively.
Certain entity types, such as EFF and LIN, showed signi"cant improvements inf1 score, increasing from 0.7444
to 0.9793 and 0.7157 to 0.9769, respectively. !ese entities are crucial for failure analysis, and their improved
recognition ensures that key failure characteristics are better captured. !e increase in precision and recall for
these entities further con"rms that the proposed enhancements improve the extraction of critical information,
even when training data is sparse.
Unlike the baseline model, where some entities exhibited a signi"cant gap between precision and recall, the
optimized model achieves a more balanced performance. For example, in the baseline model, NUM had
a relatively low recall of 0.6787, while the optimized model increased it to 0.8651, ensuring a more stable
performance across di#erent entity categories. !is balance reduces the likelihood of the model being overly
conservative or aggressive in recognizing certain entities, which is crucial for handling real-world failure reports.
!e overall increase inf1 score across all entities con"rms the e#ectiveness of our proposed approach in handling
data imbalance. Since failure reports from G4, G5, G10, G11, G14, and G16 represent a low-resource scenario,
the strong performance of the optimized model demonstrates its robustness and adaptability when dealing with
underrepresented categories. !is suggests that even with limited training samples for speci"c failure types, the
model maintains a high level of recognition accuracy, reinforcing its generalization ability across diverse failure
scenarios.
!e above "ndings con"rm that the proposed entity attention mechanism signi"cantly improves the model’s
capability to recognize entities in imbalanced datasets. By enhancing the representation of low-frequency
entities, the optimized model not only outperforms the baseline in low-resource scenarios but also ensures a
more balanced, accurate, and generalizable entity recognition process for railway failure reports. !is e#ectively
addresses concerns regarding the impact of dataset imbalance on the model’s performance and provides strong
evidence of its applicability in real-world railway maintenance and fault diagnosis tasks.
Table 7. Summary of micro, macro, and weighted averages for optimized model performance evaluation.
Fig. 9. Confusion matrix of the optimized model. Each row of the matrix represents the actual entity type,
while each column represents the predicted entity type. !e diagonal cells indicate correctly classi"ed
instances, and the o#-diagonal cells re&ect misclassi"cations. As observed, most predictions are concentrated
along the diagonal, con"rming high accuracy. However, the confusion between MEA and REA suggests some
overlap in linguistic features, which may contribute to the misclassi"cation. !e relatively low error rates across
all tags further validate the e#ectiveness of the optimized model.
!e results in Table% 8 indicate that traditional machine learning models, particularly SVM and CRF,
demonstrate competitive performance, achieving f1 score above 0.81. However, these approaches heavily rely on
feature engineering and struggle with complex linguistic patterns that require deeper contextual understanding.
!e proposed optimized BERT-BiLSTM-CRF model outperforms all baselines, achieving an f1 score of 0.9875,
with statistically signi"cant improvements (p < 0.01, paired t-test).
!e superior performance of our model can be attributed to transformer-based contextual embeddings,
which dynamically capture semantic relationships between words, unlike traditional models that rely on static,
handcra'ed features. Furthermore, the bi-directional structure of BiLSTM enhances sequence modeling, while
the CRF layer re"nes entity boundaries through global inference, making the model particularly e#ective for
domain-speci"c NER tasks.
!is comparative study underscores the signi"cant advantages of the optimized BERT-BiLSTM-CRF model
over traditional machine learning approaches. While CRF, SVM, and RF remain viable for general NER tasks,
their reliance on feature engineering and limited contextual awareness restricts their e#ectiveness in complex,
real-world scenarios. In contrast, our model leverages deep contextual embeddings and structured sequence
modeling, yielding state-of-the-art performance.
!ese "ndings suggest that the proposed approach can be e#ectively deployed in real-world railway fault
diagnosis systems, enabling automated, high-accuracy entity recognition.
Model Parameter size(M) Training time per epoch(s) Total training time (h) GPU memory usage(GB)
CRF 0.2 5 0.02 1
SVM 1.5 20 0.08 2
RF 5.2 35 0.15 3
BiLSTM-CRF 10.5 35 0.2 4
BERT 110.0 120 0.7 10
BERT-CRF 115.0 150 0.8 12
BERT-BiLSTM-CRF 125.0 180 1.0 14
BERT-BiLSTM-CRF(optimized) 118.0 160 0.9 13
DistilBERT-BiLSTM-CRF 82.0 110 0.6 9
including deep learning-based models (BERT, BERT-CRF, BiLSTM-CRF, BERT-BiLSTM-CRF, and DistilBERT-
BiLSTM-CRF) and traditional machine learning models (CRF, SVM, RF). !e evaluation is conducted in the
same experimental environment, utilizing an RTX 4090 GPU with 24GB of memory.
Table% 9 presents the comparison of parameter sizes, average training time per epoch, total training time,
and GPU memory usage across di#erent models. From Table 9, it is evident that models incorporating BERT-
based architectures generally exhibit higher parameter counts and computational costs compared to traditional
machine learning models. !e BiLSTM-CRF model, while lightweight in terms of parameter size and training
time, does not leverage contextual word representations, leading to inferior performance in entity recognition
tasks.
!e proposed BERT-BiLSTM-CRF (optimized) model introduces improvements in both e$ciency and
performance by optimizing the BiLSTM hidden unit size and implementing a re"ned training strategy. Compared
to the standard BERT-BiLSTM-CRF model, the optimized version reduces parameter size by approximately
5.6%, decreases training time per epoch by 11.1%, and requires 7.1% less GPU memory, without compromising
entity recognition accuracy.
While cascading multiple neural network layers inevitably increases computational demands, the enhanced
contextual representation and sequence modeling capabilities justify the additional cost, particularly in mission-
critical applications such as railway fault diagnosis. Given that the fault diagnosis process does not require real-
time inference but rather prioritizes high recall and precision, the computational complexity remains acceptable
for practical deployment.
Moreover, the introduction of DistilBERT-BiLSTM-CRF as a lightweight alternative demonstrates a potential
trade-o# between model e$ciency and performance. Although DistilBERT reduces computational overhead, its
lower number of transformer layers may lead to a decline in entity recognition accuracy, which is undesirable
for high-stakes applications.
In conclusion, the BERT-BiLSTM-CRF (optimized) model achieves a well-balanced trade-o# between
computational e$ciency and fault diagnosis accuracy. Future work may explore quantization and model pruning
techniques to further enhance e$ciency while maintaining robust entity recognition performance.
subgraph extracted from one failure report, and (c) displays the completed ROEFKG composed of multi-typed
entities and relationships. Entity types are color-coded, and relationships are expressed via labeled directional
edges. A legend is provided to ensure readability. !e entities representing di#erent aspects of the failure data
are distinguished by colors in the visualization: “TIM” (time) in purple, “LIN” (line) in blue, “NUM” (number)
in light gray, “DES” (description) in lavender, “REA” (reason) in yellow, “LOC” (location) in orange, “MEA”
(measure) in green, “CAT” (category) in dark gray, and “SYS” (system) in deep blue. !e relationships between
these entities are depicted with colored arrows: “occurred at time” in purple, " occurred at line” in blue, “occurred
at number” in light gray, “occurred at location” in lavender, “result in” in yellow, “category of failure” in orange,
“due to” in green, " responsible system” in dark gray, and “taken” in deep blue. !is visualization schema simpli"es
the representation and comprehension of the failure data structure, allowing for a clear understanding of the
relationships and attributes associated with each entity. Based on the KG depicted, an intuitive understanding of
the ROEF reports, along with their causes and impacts, can be acquired by tracing the relational links between
physical nodes, in turn, this in&uences the operational e$ciency and safety of train movement. E.g.,
(1) When Luwu Station was processing the approach for train number 20111, the positioning of switch number
occurred at number)
9 showed no indication) (the following is referred to as the “failure description”) → train
number 20111,
occurred at line
(2) failure description → Liqin line,
occurred at time
(3) failure description responsible
→ system12:25 on April 8,2020,
(4) failure description category→ of failure
electrical service,
(5) failure description → G7 signal equipment malfunction,
resultin
(6) failure description → did not a#ect the train,
taken
(7) failure description → Electrical and track maintenance departments were noti"ed for inspection and
handling. At 12:53, the track maintenance department veri"ed that the equipment was functioning normal-
ly, and at 13:11, the electrical service con"rmed the recti"cation, restoring normal train operations),
occurred at location
(8) failure description → turnout equipment,
due to
(9) failure description → the diode of switch number 9 was malfunctioning.
On the other hand, the Neo4j spatial database a#ords users the capability to perform an array of operations on
the stored visual information, including indexing, querying, adding, deleting, among others. Consequently, the
KG developed on the Neo4j platform, tailored to encapsulate information pertinent to Chinese reports, is poised
to o#er online services. !ese services are instrumental in providing guidance to on-site railway personnel,
thereby facilitating the execution of safety management and decision-making processes.
Conclusion
Given the dynamic environment and complex equipment failures in the railway transportation system, this
paper comprehensively analyzes railway rolling stock equipment failure report data using advanced text mining
techniques. !e innovative work and conclusions are as follows:
(1) A collection of on-site ROEFs reports served as the primary dataset, totaling 1,690 reports, categorized
into 16 failure types based on railway transportation equipment failure classi"cation standards. From these
reports, a corpus was constructed, encompassing over 350,000 lines of named entities and labels. Moreover,
ten categories of failure-related entities and the interrelationships within ROEF reports were de"ned.
(2) Leveraging the constructed corpus, the optimized BERT-BiLSTM-CRF model was employed to enhance
entity extraction from railway equipment failure texts. To address challenges posed by imbalanced data
distribution, an entity attention layer was introduced, allowing the model to adaptively assign higher atten-
tion weights to critical but underrepresented entities. !is mechanism enhances the feature representation
of low-frequency fault types, ensuring their more e#ective recognition. Furthermore, the dropout regu-
larization technique was incorporated to improve model robustness. Experimental results con"rm that
the optimized BERT-BiLSTM-CRF model signi"cantly outperforms the baseline approach in NER tasks,
particularly in handling imbalanced entity distributions.
(3) Following entity extraction and relationship analysis, a ROEFKG was constructed using the Neo4j database.
!is graphical representation provides an intuitive visualization of failure patterns, enhancing knowledge
retrieval and interpretability. !e developed ROEFKG supports failure diagnosis and decision-making by
establishing structured interconnections among di#erent fault-related entities.
In summary, the entity database and knowledge graph developed in this study provide a data-driven framework
for understanding railway operational failures, enhancing knowledge application e$ciency, and bridging current
gaps in ROEF report mining. !e introduction of the entity attention layer plays a crucial role in mitigating the
impact of data imbalance, enabling more e#ective recognition of both frequent and infrequent fault types. !is
enhancement ensures that critical but rare failures are not overshadowed by dominant categories, improving the
model’s reliability and its practical applicability in railway operational safety.
Data availability
!e datasets generated and analyzed during the current study are not publicly available due to the ongoing na-
ture of the project, but they are available from the corresponding author upon reasonable request.
References
1. Liu, C. & Yang, S. Using text mining to Establish knowledge graph from accident/incident reports in risk assessment. Expert Syst.
Appl. 207, 117991 (2022).
2. Guo, L. et al. Distributed representations of entities in open-world knowledge graphs. knowledge-based Syst. 290, 111582 (2024).
3. Cheng, D., Yang, F., Xiang, S. & Liu, J. Financial time series forecasting with multi-modality graph neural network. Pattern
Recognit. 121, 108218 (2022).
4. Hogan, A. et al. Knowledge graphs. ACM Comput. Surv. 54 (4), 1–37 (2021).
5. Dai, Y., Wang, S., Chen, X., Xu, C. & Guo, W. Generative adversarial networks based on Wasserstein distance for knowledge graph
embeddings. Knowledge-Based Syst. 190, 105165 (2020).
6. Ko, H., Witherell, P., Lu, Y. & Kim, S. Machine learning and knowledge graph based design rule construction for additive
manufacturing. Addit. Manuf. 37, 101620 (2021).
7. Mohamed, S. K., Nounu, A. & Nová(ek, V. Biological applications of knowledge graph embedding models. Brie!ngs Bioinf. 22 (2),
1679–1693 (2021).
8. Chen, D., Chen, J., Fang, C. & Zhang, Z. Complex visual question answering based on uniform form and content. Appl. Intell. 54,
4602–4620 (2024).
9. Zafar, A., Varshney, D., Kumar, S. S., Das, A. & Ekbal, A. Are my answers medically accurate? Exploiting medical knowledge graphs
for medical question answering. Appl. Intell. 54, 2172–2187 (2024).
10. Bounhas, I., Soudani, N. & Slimani, Y. Building a morpho-semantic knowledge graph for Arabic information retrieval. Inf. Process.
Manage. 57 (6), 102124 (2020).
11. Sun, R. et al. ACM,. Multi-modal knowledge graphs for recommender systems, In Proceedings of the 29th ACM International
Conference on Information & Knowledge Management, 1405–1414 (2020).
12. Nettleton, D. F. & Salas, J. A data driven anonymization system for information rich online social network graphs. Expert Syst.
Appl. 55, 87–105 (2016).
13. Zhang, Q. et al. Construction of knowledge graphs for maritime dangerous goods. Sustainability 11 (10), 2849 (2019).
14. Gan, L. et al. Construction of knowledge graph for &ag state control (FSC) inspection for ships: A case study from China. J. Mar.
Sci. Eng. 10 (10), 1352 (2022).
15. Mao, S., Zhao, Y., Chen, J., Wang, B. & Tang, Y. Development of process safety knowledge graph: A case study on delayed coking
process. Comput. Chem. Eng. 143, 107094 (2020).
16. Liu, J., Schmid, F., Li, K. & Zheng, W. A knowledge graph-based approach for exploring railway operational accidents. Reliab. Eng.
Syst. Saf. 207, 107352 (2021).
17. Lin, C. & Wang, G. Failure cause extraction of railway switches based on text mining. In Proceedings of the International Conference
on Computer Science and Arti!cial Intelligence, 237–241 (ACM, 2017)., 237–241 (ACM, 2017). (2017).
18. Sobrie, L., Verschelde, M., Hennebel, V. & Roets, B. Capturing complexity over space and time via deep learning: an application to
real-time delay prediction in railways. Eur. J. Oper. Res. 310 (3), 1201–1217 (2023).
19. Lin, J., Li, S., Qin, N. & Ding, S. Entity recognition of railway signal equipment fault information based on RoBERTa-wwm and
deep learning integration. Math. Biosci. Eng. 21 (1), 1228–1248 (2024).
20. Cai, Z. et al. !e sources and transport pathways of sediment in the Northern Ninety-east ridge of the India ocean over the last
35000 years. Front. Mar. Sci. 10, 1073054 (2023).
21. Li, W. et al. Chinese word segmentation based on self-learning model and geological knowledge for the geoscience domain. Earth
Space Sci. 8 (6), e2021EA001673 (2021).
22. Qiu, Q. et al. Chinese engineering geological named entity recognition by fusing multi-features and data enhancement using deep
learning. Expert Syst. Appl. 238, 121925 (2024).
23. Liang, J., Li, D., Lin, Y., Wu, S. & Huang, Z. Named entity recognition of Chinese crop diseases and pests based on RoBERTa-wwm
with adversarial training. Agron 13 (3), 941 (2023).
24. Yin, T. et al. Research on life cycle assessment and performance comparison of bioethanol production from various biomass
feedstocks. Sustainability 16 (5), 1788 (2024).
25. Zhang, D., Zheng, G., Liu, H., Ma, X. & Xi, L. AWdpCNER: automated Wdp Chinese named entity recognition from wheat diseases
and pests text. Agric 13 (6), 1220 (2023).
26. Hu, Z. & Ma, X. A novel neural network model fusion approach for improving medical named entity recognition in online health
expert question-answering services. Expert Syst Appl. 223, 119880 (2023).
27. Yang, P. et al. A large-scale and multi-source medical knowledge graph for intelligent medicine applications. Knowledge-Based Syst.
284, 111323 (2024).
28. Wu, S. et al. Deep learning in clinical natural Language processing: A methodical review. J. Am. Med. Inf. Assoc. 27 (3), 457–470
(2020).
29. Helwe, C. & Elbassuoni, S. Arabic named entity recognition via deep co-learning. Artif. Intell. Rev. 52, 197–215 (2019).
30. Li, J., Sun, A., Han, J. & Li, C. A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34 (1), 50–70
(2020).
31. Bunescu, R. & Mooney, R. A shortest path dependency kernel for relation extraction. in: Proceedings of human language
technology conference and conference on empirical methods in natural language processing. Vancouver. 724–731 (2005).
32. Culotta, A. & Sorensen, J. Dependency tree kernels for relation extraction. In: Proceedings of the 42nd annual meeting of the
association for computational linguistics (ACL-04), 423–429ACL, (2004).
33. Mintz, M., Bills, S., Snow, R. & Jurafsky, D. Distant supervision for relation extraction without labeled data. In: Proceedings of the
Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing
of the AFNLP, 1003–1011 (ACL, 2009).
34. Modrzejewski, M., Exel, M., Buschbeck, B., Ha, T. L. & Waibel, A. Incorporating external annotation to improve named entity
translation in NMT. In: Proceedings of the 22nd annual conference of the European association for machine translation. 45–51EAMT,
(2020).
35. Mollá, D., Van, Z. M. & Smith, D. Named entity recognition for question answering. In: Australasian Lang. Technol. Association
Workshop 51–58 (2006). (ALTA.
36. Patel, M. et al. An evolutionarily conserved autoinhibitory molecular switch in ELMO proteins regulates Rac signaling. Curr. Biol.
20 (22), 2021–2027 (2010).
37. Tay, Y., Luu, A. T. & Hui, S. C. Recurrently controlled recurrent networks. In: 32nd Conference on Neural Information Processing
SystemsNeurIPS,. (2018).
38. Vaswani, A. et al. Attention is all you need. In: Advances in Neural Information Processing Systems. (NeurIPS, (2017).
39. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9 (8), 1735–1780 (1997).
40. Li, J. et al. WCP-RNN: a novel RNN-based approach for bio-NER in Chinese EMRs. J. Supercomput. 76, 1450–1467 (2020).
41. Tao, F., Liu, G. & Advanced, L. S. T. M. A study about better time dependency modeling in emotion recognition. In: IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2906–2910 (IEEE,2019). 2906–2910 (IEEE,2019).
(2018).
42. Li, W. et al. UD_BBC: named entity recognition in social network combined BERT-BiLSTM-CRF with active learning. Eng. Appl.
Artif. Intell. 116, 105460 (2022).
43. Ronran, C. & Lee, S. E#ect of character and word features in bidirectional LSTM-CRF for NER. In: 2020 IEEE International
Conference on Big Data and Smart Computing (BigComp). 613–616IEEE, (2020).
44. Chen, Z., Ji, W., Ding, L. & Song, B. Fine-grained document-level "nancial event argument extraction approach. Eng. Appl. Artif.
Intell. 121, 105943 (2023).
45. Kang, T., Perotte, A., Tang, Y., Ta, C. & Weng, C. UMLS-based data augmentation for natural Language processing of clinical
research literature. J. Am. Med. Inf. Assoc. 28 (4), 812–823 (2021).
46. Liu, Y. et al. Naming entity recognition of citrus pests and diseases based on the BERT-BiLSTM-CRF model. Expert Syst. Appl. 234,
121103 (2023).
47. Zhou, S. et al. A cross-institutional evaluation on breast cancer phenotyping NLP algorithms on electronic health records. Comput.
Struct. Biotechnol. J. 22, 32–40 (2023).
48. Hinze, A., Heese, R., Schlegel, A. & Paschke, A. Manual semantic annotations: user evaluation of interface and interaction designs.
J. Web Semant. 58, 100516 (2019).
49. Rani, P. S., Suresh, R. M. & Sethukarasi, R. Multi-level semantic annotation and uni"ed data integration using semantic web
ontology in big data processing. Cluster Comput. 22 (Suppl 5), 10401–10413 (2019).
50. Liu, C. & Yang, S. A text mining-based approach for Understanding Chinese railway incidents caused by electromagnetic
Interferenc. Eng. Appl. Artif. Intell. 117, 105598 (2023).
51. Chen, X. et al. Multi-target detection and tracking based on CRF network and spatio-temporal attention for sports videos. Sci. Rep.
15, 6808 (2025).
52. Wang, W. et al. Prediction model of water inrush risk level of coal seam &oor based on KPCA-DBO-SVM. Sci. Rep. 15, 10393
(2025).
53. Fang, M. et al. Missing value imputation for > 2%mev electron &uxes in geostationary orbit based on GA-RF model. Sci. Rep. 15,
10427 (2025).
54. Lin, J., Zhao, Y., Huang, W., Liu, C. & Pu, H. Domain knowledge graph-based research progress of knowledge representation.
Neural Comput. Appl. 33, 681–690 (2021).
55. Atzeni, P., Bugiotti, F., Cabibbo, L. & Torlone, R. Data modeling in the NoSQL world. Comput. Stand. Interfaces. 67, 103149 (2020).
Acknowledgements
We gratefully acknowledge that this research is supported by the Project on the High-Quality Development and
Safety Assurance System and Key Technologies for Railways, funded by China National Railway Group Corpo-
ration Limited.
Author contributions
X. Y.: Conceptualization, Methodology, So'ware, Validation, Data Curation, Writing—Original Dra', Writ-
ing—Review and Editing. H. L.: Methodology, Formal analysis, Validation, Data Curation, Writing—Review
and Editing. Y. X., N.S. and R. H.: Formal analysis, Validation, Writing—Review and Editing.
Declarations
Competing interests
!e authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to H.L.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional a$liations.
Open Access !is article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives
4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in
any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide
a link to the Creative Commons licence, and indicate if you modi"ed the licensed material. You do not have
permission under this licence to share adapted material derived from this article or parts of it. !e images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence
and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to
obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommo
ns.org/licenses/by-nc-nd/4.0/.
© !e Author(s) 2025
1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at