0% found this document useful (0 votes)

5 views22 pages

A Text Mining-based Approach for Comprehensive Und

Uploaded by

jenny17191719

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views22 pages

A Text Mining-based Approach for Comprehensive Und

Uploaded by

jenny17191719

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

www.nature.

com/scientificreports

OPEN A text mining-based approach for

comprehensive understanding
of Chinese railway operational
equipment failure reports
Xiaorui Yang, Honghui Li, Yi Xu, Nahao Shen & Ruiyi He
Railway operational equipment is crucial for ensuring the safe, smooth, and efficient operation of
trains. Comprehensive analysis and mining of historical railway operational equipment failure (ROEF)
reports are of significant importance for improving railway safety. Currently, significant challenges
in comprehensively analyzing ROEF reports arise due to limitations in text mining technologies. To
address this concern, this study leverages advanced text mining techniques to thoroughly analyze
these reports. Firstly, real historical failure report data provided by a Chinese railway bureau is used as
the data source. The data is preprocessed and an ROEF corpus is constructed according to the related
standard. Secondly, based on this corpus, text mining techniques are introduced to build an innovative
named entity recognition (NER) model. This model combines bidirectional encoder representations
from transformers (BERT), bidirectional long short-term memory (BiLSTM) networks, and conditional
random fields (CRF), with an additional entity attention layer to deeply extract entity features. This
network architecture is used to classify specific entities in the unstructured data of failure reports.
Finally, a knowledge graph (KG) is constructed using the Neo4j database to store and visualize the
extracted ROEF-related entities and relationships. The results indicate that by constructing the
topological relationships of the ROEF network, this study enables the analysis and visualization of
potential relationships of historical failure factors, laying a foundation for predicting failures and
enhancing railway safety, while also filling the current gap in the mining and analysis of ROEF reports.

Keywords Text mining, Railway operational equipment failure, BERT, BiLSTM, CRF, Knowledge graph

Railway transportation is a crucial component of modern infrastructure, demanding high levels of safety and
reliability. !e upkeep and coordinated operation of technical equipment, speci"cally operation equipment,
are fundamental to ensuring the smooth functioning of railway transport services. !e rapid expansion of
the railway network and the continual increase in operational mileage have been accompanied by a growing
intricacy in safety management of operational equipment due to the evolving nature of internal and external
operational conditions. Despite advancements in safety design, this complexity imposes a heightened risk of
operational failures within the railway system. Consequently, railway departments across China have amassed
a signi"cant volume of failure reports pertaining to operational equipment. !ese documents, rich in an array
of details that include not only timing, a#ected locations, causes, and remedial measures but also encompass
additional information, serve as a comprehensive data source for the analysis of malfunction patterns and the
prediction of future failures. However, traditional failure reports analysis approaches are heavily reliant on
expert interpretation, frequently leading to the underutilization of valuable data due to the insu$ciency of
expert experience. !e advent of digital technologies in railway operations, along with the emergence of natural
language processing (NLP) techniques, emphasizes the need for more advanced techniques for analyzing failure
texts. Within this context, designing advanced designing advanced text mining techniques not only contribute
to a more comprehensive analysis of historical failures and the exploration of interrelations among factors
linked to such failures, thereby contributing to the prediction of future failures, but also strengthen support for
maintenance and operational decision-making, including scheduling. !is advances the e$ciency and safety of
railway operations.

School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, China. email:

[email protected]

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 1

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

At present, there has been no research focused on the mining and analysis of railway operational equipment
failures. However, some scholars have utilized text mining techniques to analyze railway accident reports for
risk assessment1. !ey introduced a BERT model to perform NER on these reports, identifying entities such as
incident names and causes. However, they de"ned only four types of accident entities and two types of entity
relationships, which is insu$cient for a comprehensive analysis of historical accident reports. Additionally, the
precision, recall, and f1 value for identifying certain classes of entities, such as accident causes and descriptions,
did not exceed 93%, indicating room for improvement.
ROEF reports texts are marked by their complexity, incorporating entities such as failure description, lines,
and categories, which present more diverse features compared to general texts. !is complexity is further
exempli"ed by the lengthier railway-speci"c terminologies used in failure descriptions. Moreover, the scarcity of
publicly accessible datasets in this "eld hinders the identi"cation process of named entities within ROEF texts.
To address the challenges mentioned above, this study e#ectively leverages the text processing capabilities of
text mining technology by applying an optimized NER model to the ROEF domain and constructing the ROEF
knowledge graph (ROEFKG). Initially, the study collects real historical failure report data provided by a Chinese
railway bureau and performs operations such as data cleaning and labeling to construct a Chinese corpus in this
"eld, thereby improving the utilization rate of knowledge. !e BERT-BiLSTM-CRF model is then optimized by
concatenating data from the BERT and BiLSTM layers, processing it with an entity attention layer that uses an
attention mechanism to extract more profound features from the preceding layers’ output labels, and reducing
the dimension with a fully connected layer based on the number of labels to provide more comprehensive
data for the CRF layer. Additionally, a dropout regularization technique is employed during model training to
enhance its generalization ability. !is novel NER model extracts essential information such as the time of failure
occurrence, line, train number, failure description, reason of failure, corrective measures taken, failure location,
failure category, responsible system, and the e#ect of the failure. Comparative evaluations of precision, recall,
and f1 value demonstrate that our model achieves superior results on the provided dataset. Finally, the causal
transmission paths among entities were standardized, leading to the establishment of the ROEFKG model. !is
model reveals the interconnections among historical failure-related entities, thereby laying a foundation for fault
prediction and enhancing railway operational safety.
!e remaining structure of this paper is organized as follows: Sect.%“Related work” reviews relevant literature
and previous research. Section% “NER algorithm module” delves into a comprehensive description of the key
methodology employed. Section%“Experimental settings” discusses data collection, preprocessing, annotation, the
de"nition of entities and relationships, and details the experimental environment setup. Section%“Experimental
results and analysis” is dedicated to an experimental comparative analysis to evaluate the model’s e#ectiveness.
Section%“Visualized results: knowledge graph construction” focuses on the creation of a Neo4j
database and explicates the development of the ROEFKG. Section% “Conclusion” concludes the paper by
summarizing the "ndings.

Related work
KGs are characterized as data graphs that amass and convey knowledge pertaining to the real world2. Entities
of interest are represented as nodes within these graphs, while the relationships between entities are depicted
by edges3,4. !ese representations leverage formal semantics, enabling e$cient and unambiguous processing by
computers. Due to their signi"cant role in processing heterogeneous information within a machine-readable
context, substantial and ongoing research e#orts have been dedicated to KGs in recent years5. !e proposed
KGs have found widespread adoption in various AI systems6,7including recommender systems, question
answering, and information retrieval. Furthermore, they have been extensively applied across diverse domains
(e.g., education8 and healthcare9 to improve human life and societal well-being10,11. !e structured information
management and visualization capabilities of KG also aid in application development and platform design12.
In recent advancements, KG have also been strategically integrated into safety analysis research to facilitate
knowledge modeling and risk management endeavors. For instance, A community gas-safety risk-prediction
method is introduced by Zhang et al.13aimed at addressing the intricate and ongoing factors pertaining
to community gas safety, utilizing temporal KG. Gan et al.14 proposes the integration of multiple sources of
knowledge in &ag state control detection using emerging KG technology. Additionally, Mao et al.15 engineered
a semi-automatic KG development solution tailored for process safety within the chemical industry. !ese
endeavors underscore the critical role and application of KG technology in promoting safety management and
risk assessment in various industries beyond the realm of transportation.
To date, scholarly endeavors have been directed towards researching knowledge aspects of ROEF. Liu et
al.16 suggested a KG-based method for mining railway operational accidents, primarily focusing on the British
railway dataset with a limited scope of data coverage. Lin and Wang17 employed text mining techniques to extract
the causes of railway switch failures. Sobrie et al.18 utilized deep learning techniques for real-time prediction
of railway delays. Lin et al.19 applied NER from NLP to identify entities in railway signal equipment failure
information. !ese studies, while aligned with our research direction, focused on uncovering latent dangers and
risks within the railway operation process through the analysis of relevant report texts. However, they fell short
of presenting a comprehensive construction of the knowledge system and the visualization of knowledge links,
thereby limiting the provision of intuitive and e$cient decision support to "eld sta#.
NER is a direct approach to complete knowledge discovery from text data, has been recently widely applied in
geology20–22agriculture23–25medicine26–28and other "elds. NER is a pivotal task in NLP29aiming to identify and
categorize key information units termed named entities (NEs) from textual data. !ese entities fundamentally fall

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 2

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

into two broad classi"cations: generic NEs, such as persons, locations, and organizations, and domain-speci"c
NEs, encompassing specialized terminologies30. !e identi"cation of NEs serves as a cornerstone for numerous
NLP applications, including relation extraction31–33machine translation34and question answering35thereby
marking NER as a critical area of research within the "eld. In contrast to traditional research paradigms, the
emphasis is placed on failure-factor-oriented failure-related entities, which are distinct from conventional
entities such as individuals or geographic locations. Consequently, recognition is not only facilitated by the
algorithm designed but also an understanding of the underlying semantics is compassed. To sum up, in this
paper, we build a NER (in a broad sense) algorithm model by de"ning ROEF-related entities containing the
characteristics of text reports to achieve in-depth mining of failure information.

NER algorithm module

The overarching architecture of the model
In this study, the text sequences are preprocessed for NER using the BERT model. Rich features at the word
level, encompassing syntax and semantics, are extracted from the sequences and subsequently transformed into
corresponding word vectors by the BERT model. !ese vectors undergo further processing in a BiLSTM layer,
where extensive contextual information surrounding the sequence is extracted.
To enhance the representation of entity-related features, an entity attention layer is introduced a'er the
BiLSTM layer. !is layer dynamically assigns attention weights to di#erent entity representations, prioritizing
the most informative features while "ltering out irrelevant noise. By incorporating entity-level attention, the
model e#ectively re"nes the extracted features and improves entity recognition accuracy. !e outputs of the
BERT layer and BiLSTM layer are concatenated before being fed into the entity attention mechanism, ensuring
a more "ne-grained and semantically enriched representation of entities.
!e enhanced feature representations generated by the entity attention layer are then input into a fully
connected layer for dimensionality, streamlining the data for subsequent encoding and analysis. In the "nal stage,
a CRF layer is employed to decode the processed information and generate an optimal sequence of predicted
labels, e#ectively representing the identi"ed entities within the text. !e architectural framework of the model is
meticulously illustrated in Fig.%1, providing a visual representation of the sequential &ow and integration of the
various components and layers constituting the model.
!e data utilized in this study were sourced from a Chinese railway bureau, comprising partial ROEF reports
recorded between January 2018 and June 2023. !ese authentic historical data re&ect the operational realities
of the Chinese railway system, characterized by their high relevance and coherence to this study, e#ectively
supporting the analysis presented herein. !e BERT model, as a pretrained model, excels at extracting inter-
sentence information within the corpus, thereby endowing the model with robust semantic comprehension
capabilities. !e BiLSTM component e#ectively mitigates the loss of unidirectional information by thoroughly

Fig. 1. !e overall framework of the model.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 3

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

considering the contextual nuances of the data, thus elevating the model’s accuracy. Additionally, the CRF
layer facilitates the prediction of optimal label sequences by accounting for label dependencies based on ample
contextual information, signi"cantly heightening the accuracy and robustness of the model.

BERT layer
In comparison with earlier models such as ELMO36 and openAI-GPT37BERT distinctively adopts a 12-layer
encoder from the transformer architecture38 as its fundamental component for extracting context information
from both preceding and succeeding texts to derive word vectors. !e core design of BERT is centered around
the adoption of Masked Language Modeling (MLM) as an e#ective strategy for learning the contextual
semantics within a corpus. Predominantly structured around the Transformer’s encoders, this pretrained model,
speci"cally the 12-layer BERT BASE model mentioned in this study, comprises a sequence of 12 Encoders.
!e Transformer, characterized by its utilization of the attention mechanism, outlines a profound network
architecture as depicted in Fig.%2, facilitating the extraction of semantic relationships from input text sequences
with noteworthy e$ciency.
In the Encoder’s architecture, the attention mechanism is identi"ed as a critical component, where the weight
coe$cient is dynamically adjusted according to the correlation degree among words within a sentence, enabling
the acquisition of the "nal word’s representation, as depicted in Eq.%(1):
! "
QK T
Attention (Q, K, V ) = Soft max √ V (1)
dk

where Q, K , and V are denoted as word vector matrices, with dk representing the Embedding dimension. !e
Encoder’s multi-head Attention mechanism involves mapping Q, K , and V through multiple distinct linear
transformations and subsequently splicing di#erent attentions together, as demonstrated in Eq.%(2) ~ (3):
! "
headi = Attention (Qi , Ki , Vi ) = Attention QWiQ , KWiK , V WiV (2)

Multi Head (Q, K, V ) = head1 ⊕ head2 ⊕ . . . ⊕ headi (3)

Where W is recognized as the weight matrix, and ⊕ indicates splicing h matrices.

!e BERT model, for processing the input character sequence, conducts word segmentation based on the
predetermined maximum sequence length, appending the [CLS] label at the sequence’s commencement and the
[SEP] label between two sentences. !e Embedding of three vectors Token embedding, Segment embedding,
and Position embedding is utilized to output the model’s input sequence word. Next, feature extraction is
performed through a bidirectional Transformer, thus enabling the extraction of vector representations with rich
semantic features for each input word. Following the acquisition of vector representations for each word within

Fig. 2. BERT model structure.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 4

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

the sentence, the sequence of word vectors generated by the BERT layer is then provided as input to the second
BiLSTM module for semantic encoding.

BiLSTM layer
Long-Short Term Memory (LSTM)39 is recognized as a variant of Recurrent Neural Network (RNN)40. !e
achievement of sustaining long-term dependencies is enabled through the adept incorporation of gating
mechanisms, e#ectively mitigating the challenges associated with gradient explosion and vanishing encountered
during the training phase of RNNs. !e principal components constituting the LSTM architecture encompass
the forget gate, input gate, output gate, and the memory cell. !e regulation of information &ow through the gate
function is a#ected by the LSTM, promoting the realization of both long and short-term memory as it extracts
information from sequences, as shown in Fig.%3. ∼
In LSTM models, the composition includes the input word xt , cell state ct , temporary cell state c t , hidden
state ht , forget gate ft , input gate it , and output gate ot . Within the context of NER, the forget gate is utilized
for the selection of recognized information, the input gate is employed to determine the information to be stored
in the cell state. It is determined by the input word at the current moment xt and the hidden state from the
previous time step ht−1 , as depicted in Eq.%(4) ~ (5)41:
ft = σ (wf • [ht−1 , xt ] + bf ) (4)
it = σ (wi • [ht−1, xt ] + bi ) (5)

In the given context, σ is represented as the sigmoid activation function. By integrating the aforementioned
equations with the cell state from the previous moment, the current moment’s cell state Ct can be obtained, as
shown in Eq.%(6)–(7):
∼
ct = tan(w • [ht−1 , xt ] + b) (6)
∼
ct = ft × ct−1 + it × ct (7)

In the described framework, the activation function used for the temporary cell state is characterized by the tanh
function, the forget gate ft is entrusted with the responsibility of regulating which segments of information from
the previous cell state ct−1 should be preserved. !is is instrumental in the calculation of the current cell∼state
ct . Furthermore, the input gate is designated to dictate which characteristics of the temporary cell state c t are
imparted onto the current cell state ct .

Fig. 3. BiLSTM structure drawing.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 5

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

Algorithm 1Entity attention layer for fault diagnosis.

Subsequently, the values pertaining to the output gate ot and the state of the hidden layer ht are derived
from the cell state at the contemporaneous moment. !ese derivations are encapsulated in Eqs.% (8) and (9),
illustrating the mathematical relationship and the process through which the values are ascertained.
ot = σ (wo · [ht−1 , xt ] + bo ) (8)
ht = ot ∗ tan (ct ) (9)

.
Within the speci"ed formulas, the weight coe$cients are denoted by wf , wi , w, and wo for the forget gate,
input gate, temporary cell state, and output gate, respectively. Correspondingly, bf , bi , b, and bo , serve as the
o#set vectors for each respective component.
To sum up, the LSTM provides a robust framework for modeling time-series data and sequences with its
in-built mechanisms for long-term dependency learning. Building upon the foundational principles of LSTMs,
the BiLSTM extends this architecture by deploying two separate LSTM layers that process the input sequence in
both forward and backward directions42. !is bidirectional processing is rendered particularly e#ective for tasks
wherein the context of both preceding and subsequent elements is crucial for accurate predictions. As a result, in
this study, the BiLSTM is adopted as a component of the NER model for ROEF reports.

Entity attention layer

!e entity attention layer is designed to enhance NER performance, especially when dealing with imbalanced fault
category datasets. By dynamically adjusting attention weights for di#erent entity tokens, this layer strengthens the
model’s ability to capture critical entity-speci"c features while "ltering out less relevant contextual information.
!e entity attention layer operates on the concatenated output of the BERT and BiLSTM layers, re"ning the
representation of pre-identi"ed entity tags. Speci"cally, the mechanism assigns higher attention scores to key
entity-related words while reducing the in&uence of non-entity tokens. !is selective focus helps to amplify the
features of rare fault categories, thereby mitigating the bias toward frequently occurring entities.
Mathematically, given an input sequence representation H from the BiLSTM layer, the attention mechanism
computes the entity-aware weight distribution as follows:

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 6

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

exp(W hi + b)
αi = !N (10)
j=1
exp(W hj + b)

where α i represents the attention weight assigned to token i, hi is the hidden state vector of token i, and are
learnable parameters, N is the sequence length. !e "nal entity-enhanced representation, H ! , is then obtained
as:
!
H! = N
i=1 α i hi (11)

!is re"ned representation is then passed through a fully connected layer for dimensionality reduction before
being processed by the CRF layer for optimal entity classi"cation.
!e introduction of the entity attention layer substantially enhances NER performance, particularly in
scenarios with imbalanced fault category distributions.

(1) Enhancing rare entity recognition.

By dynamically assigning higher weights to distinguishing features of underrepresented fault entities, the model
mitigates the issue of class imbalance and reduces misclassi"cation errors. For instance, in a dataset where “电缆
芯线断线” (cable core disconnection) appears far less frequently than “道岔故障” (turnout failure), the entity
attention mechanism ampli"es the feature importance of rare entities, improving their recognition without
compromising the detection of more frequent categories.

(2) Improving contextual representation.

!is layer re"nes entity representations based on surrounding context, leading to more accurate semantic
disambiguation across di#erent fault types. For example, when processing “signal light blackout”, the attention
mechanism prioritizes correlations with “power failure” rather than unrelated words in the sequence. Such
context-aware adjustments enhance the model’s generalization ability.

(3) Noise reduction and robust feature selection.

By selectively "ltering out non-entity noise, the attention mechanism mitigates false positives. In a sentence like
“A'er the train passed, the signal light went out”, the model disregards non-relevant contextual phrases (e.g.,
“A'er the train passed”) and focuses on core diagnostic entities, improving precision.

(4) Empirical performance gains on imbalanced datasets.

!e introduction of the entity attention layer leads to signi"cant empirical performance gains in NER tasks,
particularly for unbalanced fault category datasets. By assigning greater attention to underrepresented entities,
the model e#ectively improves recall for rare fault categories while maintaining a balance between precision and
recall, thereby enhancing thef1 score. Additionally, the selective focus on critical entity features helps the model
converge faster, reducing the risk of over"tting and ensuring robust generalization across di#erent fault types.
Experimental results (detailed in Sect.%5) further con"rm these improvements, demonstrating a notable increase
in recall for rare entities, a balanced precision-recall tradeo#, and an overall enhancement in thef1 score.
By addressing the data imbalance issue directly through context-aware attention weighting, the Entity
Attention Layer ensures a more reliable and equitable entity recognition performance across all fault categories,
including those critical to railway safety.

CRF layer
In the task of NER, it has been observed that the BiLSTM model exhibits pro"ciency in handling long-distance
textual information; however, its capability to address dependencies between adjacent tags remains inadequate43.
!e de"ciency of BiLSTM in managing tag adjacency is e#ectively compensated by the employment of CRF,
which are capable of deriving an optimal prediction sequence through the analysis of relationships between
adjacent tags. !e application of CRF to enhance the LSTM network model has been demonstrated to accomplish
a signi"cant feature matching capability, as evidenced in the domains of "nance44medicine45and agriculture46.
Consequently, the introduction of CRF for the optimization of the aforementioned models is proposed, aiming
to further elevate the performance of advanced text embedding and model training.
CRF are recognized as models for conditional probability distributions, assigned to generate output
sequences based on a given set of input sequences. !is method has gained prominence as a quintessential
technique for addressing challenges within the domain of NLP47. In the framework of CRF, vertices are utilized to
symbolize random variables, with their interrelations being denoted by edges, thereby composing an undirected
graph. When a particular text sequence O = {O1 , O2 , . . . , OT } along with its associated tag sequence
S = {x1 , x2 , . . . , xT } is provided, the probability of the state sequence is determined by the Eq.%(12):
!" " #
1
P (S|O) = exp T
k λ k fk (xi , Oi ) (12)
Z (O)
i=1

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 7

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

(10) in which (Z(O)) acts as the normalization factor, fk (xi , Oi )is known as the state feature function, as
depicted in equations (13):
!
1, Oi = word and xi = tag
fk (xi , Oi ) = 0, else (13)

.
λ k represents the respective correlation weight. !e objective of this modeling approach is to systematically
evaluate the probable outcomes of tagging sequences, given a sequence of text inputs.
Ultimately, the outcome of entity recognition is ascertained by identifying the tag sequence that achieves the
maximum probability, expressed as Eq.%(14):
S ∗ = args max {P ( S |O)} (14)

.
!is modeling choice underscores CRF’s capacity to intricately analyze input sequences and their
corresponding tag sequences, seeking to forecast outcomes with precision. Such functionality accentuates the
utility of CRF in pioneering advancements within the sphere of NLP, fostering a profound understanding and
application of language processing techniques.

Experimental settings
Data acquisition and preprocessing
!is study focuses on NER for ROEF. Following the Chinese ROEF classi"cation standards, the dataset was
systematically categorized. Table% 1 presents the 16 fault categories and their corresponding data quantities,
covering 1,690 fault reports.
!e raw dataset contains multiple inconsistencies and noise, making it unsuitable for direct NER application.
!e primary challenges observed in the raw text include:

• Extraneous characters and erroneous word entries, which introduce noise and a#ect recognition accuracy.
• Unstructured sentence formatting, making entity segmentation di$cult.
• Overly lengthy and complex fault descriptions, complicating model processing.

To address these issues, we implemented a structured data preprocessing pipeline, consisting of data cleaning
and sentence segmentation, ensuring high-quality input for NER.

(1) Handling missing values and format standardization.

Certain reports contained incomplete records or inconsistent terminology, which required standardization.
Raw input example: A'er the train arrived, the signal light did not turn on. Reason: missing. Handling
method: not recorded.
Processed output: A'er the train arrived, the signal light did not turn on. Reason: unknown. Handling
method: not provided.
Here, " missing” is replaced with " unknown” to maintain consistency, and “not recorded” is rephrased as “not
provided” to ensure uniform terminology.

(2) Removing irrelevant characters and erroneous entries.

!e raw dataset contained title numbers, redundant punctuation, and irrelevant metadata that could interfere
with entity extraction.
Raw input example: Report No.: 2023A06 !e train signal light went out, and the station attendant reported
that the fault was not cleared. @Error Code #232.
Processed output: !e train signal light went out, and the station attendant reported that the fault was not
cleared.
Removal of metadata, special symbols (@, #, []), and redundant identi"ers ensures that only meaningful
content is retained for NER processing.

(3) Sentence segmentation and logical restructuring.

Many reports contained long, unstructured descriptions with unclear sentence boundaries, making text parsing
challenging.
Raw input example: During train operation, the signal light suddenly went out. Upon inspection, a power
supply anomaly was detected. !e station attendant reported that the turnout signal light was unresponsive.
Processed output: During train operation, the signal light suddenly went out. Upon inspection, a power
supply anomaly was detected. !e station attendant reported that the turnout signal light was unresponsive.
Here, proper sentence boundaries were restored using punctuation, guided by fault-related keywords.
To further illustrate the impact of preprocessing, Fig.% 4 compares a raw fault report (Fig.% 4a) with its
preprocessed version (Fig.%4b). !e structured text output ensures cleaner and more standardized input for the
NER model, leading to improved accuracy in entity recognition. !ese preprocessing techniques lay a solid
foundation for reliable NER performance and subsequent fault diagnosis.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 8

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

Fig. 4. Preprocessing comparison.

Define the entity type and their relationships

Semantic annotation, acknowledged as the process by which texts are marked to enable the automatic
identi"cation of named entities48is identi"ed as essential for NER. It plays a vital role in determining and re"ning
the composition of entities to be annotated. !is process is noted for enhancing the "tting ability of supervised
learning through iterative optimization, crucial for extracting valuable knowledge from big data in the context
of semantic web data management49. In the NER sequence labeling process, sequence labeling methods such
as BIO, BMES, and BIOES are employed. In this work, the BIO format is used to annotate entities in sentences,
wherein the initial word of an entity is marked as ‘B’, subsequent words within the entity are tagged as ‘I’, and
words not part of an entity are denoted as ‘O’. Following text annotation, computers are enabled to recognize the

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 9

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

Failure category Quantity

G1 locomotive failure 227
G2 vehicle failure 40
G3 trainset failure 98
G4 failures of railway ferry equipment 0
G5 failures of self-propelled specialized transport equipment 7
G6 line, bridge, and tunnel equipment failure 99
G7 signaling equipment failure 613
G8 communication equipment failure 80
G9 power supply equipment failure 108
G10 water supply equipment failure 1
G11 information system equipment failure 2
G12 end-of-train device failure 184
G13 monitoring and surveillance equipment failure 44
G14 damage to line safety protective equipment and facilities 17
G15 water damage, landslides, rockfalls, and tree falls 167
G16 other equipment failure 3

Table 1. !e quantities of various types of ROEF documented.

category of each word for processing. A total of 15,461 entities were annotated, highlighting ten main features
of ROEF, including time, line, train number, failure description, failure cause, measures, failure location, failure
category, responsibility system, and failure impact, as presented in Table%2. For complex entities, multiple entities
are annotated separately. For example, the entity “Li-Qin Line 20111“(Li(B-LIN)Qin(I-LIN)Line(I-LIN)2(B-
NUM)0(I-NUM)1(I-NUM)1(I-NUM)1(I-NUM)number(I-NUM)), “Li-Qin Line” and “Line 20111” are tagged
as Failure Line and Train Number respectively, and in the experiments the complex entities are also matched
individually, while the segments of the text not constituting named entities are designated as ‘O’.
Further, the construction of a relationship link encompassing various entities is de"ned as Fig.%5 to facilitating
a comprehensive understanding of the interconnections between the identi"ed entities.
Ultimately, the dataset is partitioned in a ratio of 8:1:1, a strategic division designed to allocate the data for
training, validation, and testing purposes, respectively. !is prepared dataset is then inputted into the NER
model. Analogous to the text preprocessing procedure employed for reports detailing signal failure events within
the Chinese railway system50this manuscript presents a visual depiction of the entire preprocessing sequence, as
illustrated in Fig.%6. !is "gure serves as an informative guide delineating the steps requisite for the preparation
of data prior to training NER models.

Training environment and parameters configuration

!e experimental environment utilized for our research is detailed as follows. !e implementation of our work
was conducted on a single R TX4090 GPU equipped with 24GB of display memory. !e so'ware environment
supporting our experiments includes Visual Studio version 1.83.1 and torch version 1.11.0. Additionally, the
model presented within this paper is constructed using the torch.nn module, which is a part of the torch
framework. A comprehensive overview of our training environment’s con"guration is provided in Table%3. !is
con"guration has been carefully selected to ensure the optimal performance of our model.
In order to ensure the integrity and reliability of the experimental outcomes, the training will be conducted
using "xed experimental parameters, as detailed in Table%4. !ese parameters include: optimizer, which speci"es
the optimization algorithm used; last_state_dim, denoting the output dimensionality of the BERT model’s "nal
layer; max_seq_length, indicating the maximum sequence length permissible; learning_rate, de"ning the step
size at each iteration to progressively minimize the loss function; lstm_units, referring to the dimensionality of
the output space of the BiLSTM network’s layers; batch_size, representing the quantity of data processed by the
model in a single iteration; drop_rate, denoting the probability at which elements are randomly omitted from
the input units during training; and epoch, which refers to the number of complete passes through the training
dataset.

Experimental results and analysis

Comparison of deep learning models
!e study compared "ve models: BERT, BERT-CRF, BiLSTM-CRF, BERT-BiLSTM-CRF, and DistilBERT-
BiLSTM-CRF, using the same dataset for training and testing. Figure%7 presents the comparative results of these
"ve models in identifying various entity types, focusing on precision, recall, and f1 value. !e training results
indicate that BERT-BiLSTM-CRF achieves the highest precision, recall, and f1 value for each entity category,
ranking "rst in all evaluation metrics for both macro-average and micro-average. !us, BERT-BiLSTM-CRF
demonstrates the best overall performance in entity recognition. Despite being a distilled version of BERT,
the DistilBERT model did not exhibit superior recognition advantages in this study, and its combined model
performance is inferior to that of the BERT model.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 10

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

Table 2. Type and number of ROEF entity. !e "rst column “tags” are abbreviations used in the tagging
process.

Fig. 5. Entity-relationship link graph.

The optimized model is compared with the basic model

Overall performance comparison
!e optimized BERT-BiLSTM-CRF model is compared with the basic model that does not employ dropout
regularization technique and entity attention strategy, with the results presented in Fig.%8. Figure8a–c) illustrate
the performance comparison between the optimized BERT-BiLSTM-CRF and the basic model. Experimental
results indicate that the optimized BERT-BiLSTM-CRF model outperforms the basic model in terms of precision,
recall, and f1 value. Although the recall value of the NUM entity is slightly lower compared to the basic model,
the optimized model demonstrates superior recognition performance for other entities.
Figure (8d) illustrates the loss functions of both models, showcasing that the optimized model converges faster
and achieves smaller loss values, which indicates improved training stability and e$ciency. !e results further
demonstrate that the entity attention layer enables deeper feature extraction on entity tags, thereby providing
richer textual information for the CRF decoding process. Moreover, the integration of dropout regularization
during model training mitigates over"tting issues and enhances the model’s generalization ability.
One critical challenge in NER tasks for railway failure reports is the class imbalance issue. Certain entity
types, such as “MEA” (measure) and “DES” (failure description), appear more frequently, whereas others have
limited labeled samples. !is imbalance o'en leads to low recall rates for underrepresented entities, as the model
tends to favor high-frequency categories.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 11

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

Fig. 6. Text preprocessing of Chinese ROEF reports.

Operating System Windows 10

CPU Intel(R) Xeon(R) Platinum 8352%V
GPU RTX 4090
python 3.10.13
torch 1.11.0
Visual Studio 1.83.1
numpy 1.26.3

Table 3. Training environment con"guration.

Parameter name Parameter value

optimizer Adam
last_state_dim 768
max_seq_length 500
learning_rate 3e-5
lstm_units 384
batch_size 8
drop_rate 0.1
epoch 20

Table 4. Parameter value setting.

!e Entity Attention Layer plays a vital role in mitigating this issue by enhancing the representation of low-
frequency entities and allowing the model to learn more informative features from their surrounding context.
Speci"cally:

(1) Improved recall for low-frequency entities.

!e recall rate for “MEA” increased from 84.02% (baseline model) to 97.62% (optimized model) (+ 16.18%).
!e recall rate for “DES” improved from 92.90 to 97.02% (+ 4.43%). !is substantial improvement highlights the
e#ectiveness of the entity attention layer in addressing data imbalance by capturing richer semantic dependencies
within the context.

(2) Enhanced feature extraction for low-frequency entities.

!e entity attention layer selectively emphasizes relevant entity features, reducing the impact of dominant high-
frequency categories and ensuring better contextual di#erentiation for underrepresented entity types.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 12

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

Fig. 7. Comparative analysis of entity recognition models.

(3) Integration with dropout regularization to improve generalization.

Dropout regularization helps prevent over"tting to high-frequency entities, ensuring that all entity types are
adequately represented in the learned embeddings.

Model performance on low-frequency failure categories

To investigate the impact of data imbalance on model generalization, an entity recognition experiment was
conducted, focusing speci"cally on low-frequency failure categories (G4, G5, G10, G11, G14, and G16). !e
objective of this experiment is to evaluate the performance of the BERT-BiLSTM-CRF (optimized) model on
these underrepresented failure types and assess its ability to maintain robust generalization despite the inherent
data imbalance.
We extracted a subset of the dataset containing only failure reports from G4, G5, G10, G11, G14, and G16,
where entity annotations were retained for all ten entity types. Since these categories have signi"cantly fewer
samples than others, they serve as a representative low-resource scenario, allowing us to analyze the model’s
robustness in recognizing rare entity types.
!e recognition results of the baseline and optimized models on low-frequency failure categories are
presented in Tables%5 and 6, respectively. Analyzing these results reveals the following insights:

(1) Signi"cant performance improvement with entity attention mechanism.

!e optimized model (BERT-BiLSTM-CRF with entity attention) exhibits a substantial improvement in

precision, recall, and f1 score across all entity types compared to the baseline model. !e increase in recall,
particularly for entities such as CAT, DES, and REA, suggests that the model e#ectively mitigates the issue
of under-detection of rare entities. !is improvement highlights the ability of the entity attention mechanism
to enhance the representation of low-frequency entity types, thereby improving the model’s robustness in
imbalanced datasets.

(2) Better generalization to underrepresented failure categories.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 13

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

Fig. 8. Comparison of optimized BERT-BiLSTM-CRF with basic model.

!e optimized model consistently achieves high recall values across all entity types, indicating its capacity to
capture entity occurrences even in failure categories with limited training samples. For instance, the recall of REA
increases from 0.6641 to 0.9768, demonstrating a notable enhancement in recognizing rare entity mentions. !is
suggests that the model successfully generalizes to low-resource scenarios by leveraging contextual information
more e#ectively.

(3) Enhanced recognition of critical entity types.

Certain entity types, such as EFF and LIN, showed signi"cant improvements inf1 score, increasing from 0.7444
to 0.9793 and 0.7157 to 0.9769, respectively. !ese entities are crucial for failure analysis, and their improved
recognition ensures that key failure characteristics are better captured. !e increase in precision and recall for
these entities further con"rms that the proposed enhancements improve the extraction of critical information,
even when training data is sparse.

(4) Balanced precision and recall across entity types.

Unlike the baseline model, where some entities exhibited a signi"cant gap between precision and recall, the
optimized model achieves a more balanced performance. For example, in the baseline model, NUM had
a relatively low recall of 0.6787, while the optimized model increased it to 0.8651, ensuring a more stable
performance across di#erent entity categories. !is balance reduces the likelihood of the model being overly
conservative or aggressive in recognizing certain entities, which is crucial for handling real-world failure reports.

(5) Demonstration of model robustness in imbalanced data.

!e overall increase inf1 score across all entities con"rms the e#ectiveness of our proposed approach in handling
data imbalance. Since failure reports from G4, G5, G10, G11, G14, and G16 represent a low-resource scenario,
the strong performance of the optimized model demonstrates its robustness and adaptability when dealing with
underrepresented categories. !is suggests that even with limited training samples for speci"c failure types, the
model maintains a high level of recognition accuracy, reinforcing its generalization ability across diverse failure
scenarios.
!e above "ndings con"rm that the proposed entity attention mechanism signi"cantly improves the model’s
capability to recognize entities in imbalanced datasets. By enhancing the representation of low-frequency
entities, the optimized model not only outperforms the baseline in low-resource scenarios but also ensures a
more balanced, accurate, and generalizable entity recognition process for railway failure reports. !is e#ectively
addresses concerns regarding the impact of dataset imbalance on the model’s performance and provides strong
evidence of its applicability in real-world railway maintenance and fault diagnosis tasks.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 14

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

Entity Precision Recall F1

CAT 0.7682 0.7014 0.7333
DES 0.7921 0.7246 0.7569
EFF 0.7810 0.7112 0.7444
LIN 0.7543 0.6805 0.7157
LOC 0.7679 0.6998 0.7322
MEA 0.8026 0.7302 0.7648
NUM 0.7502 0.6787 0.7128
REA 0.7324 0.6641 0.6966
SYS 0.7784 0.7051 0.7403
TIM 0.7652 0.6925 0.7272

Table 5. Baseline model performance (BERT-BiLSTM-CRF).

Entity Precision Recall F1

CAT 0.9825 0.9701 0.9762
DES 0.9623 0.9504 0.9563
EFF 0.9882 0.9689 0.9793
LIN 0.9851 0.9717 0.9769
LOC 0.9904 0.9305 0.9809
MEA 0.9482 0.9721 0.9393
NUM 0.9860 0.8651 0.9790
REA 0.8923 0.9768 0.8785
SYS 0.9911 0.9768 0.9839
TIM 0.9885 0.9753 0.9819

Table 6. Optimized model performance (BERT-BiLSTM-CRF with entity attention).

Evaluation metrics and performance analysis

!e Table% 7 presents the evaluation metrics for entity recognition using an optimized BERT-BiLSTM-CRF
model, demonstrating its superior performance across several key indicators. !e micro-average scores show
high precision (98.51%), recall (98.99%), andf1 score (98.75%), indicating the model’s exceptional capability
to accurately identify entities throughout the dataset. Similarly, the macro-average metrics, with precision
at 98.49%, recall at 98.96%, and anf1 score of 98.72%, illustrate the model’s consistent performance in entity
recognition across di#erent categories, ensuring a balanced approach among classes of varying representation
levels. In addition, the weighted average scores—precision at 98.55%, recall at 98.99%, andf1 score at 98.76%—
further underscore the optimized model’s robustness, slightly improving precision while maintaining high recall
and f1 value.
!ese results reinforce the e#ectiveness of the optimized Entity Attention Layer, which enables a more
balanced recognition of di#erent entity types, preventing the model from being biased toward dominant
categories while preserving high recall for low-frequency entities.

Confusion matrix analysis

Figure 9 presents the confusion matrix of the optimized BERT-BiLSTM-CRF model, re&ecting its performance
on the test set. !e dataset described in Table% 2 is split into training, validation, and test sets in an 8:1:1
ratio, meaning the test set contains 10% of the total data instances. !e confusion matrix illustrates the error
distribution among di#erent entity types, demonstrating that the majority of predicted results align closely with
actual labels, con"rming the model’s robustness. Notably, the top three tags with the highest error rates are
identi"ed as “MEA”, “REA” and “DES”, with error rates of 1.80, 1.39, and 1.18%, respectively.
!e presence of higher error rates in MEA and REA is attributed to their data imbalance, as they constitute
a signi"cant portion of railway failure descriptions. However, compared to the baseline model, the optimized
model signi"cantly reduces error rates for these entity types, validating the e#ectiveness of the entity attention
layer in improving entity-level feature extraction and classi"cation accuracy.

Comparative study with traditional machine learning models

To ensure an unbiased and reproducible evaluation, all models were trained and tested on the same dataset,
employing identical preprocessing pipelines and hyperparameter tuning strategies. Traditional models leveraged
manually cra'ed linguistic features, including word n-grams, part-of-speech (POS) tags, and character-level
embeddings, while hyperparameters were systematically optimized via grid search: CRF51 employed a maximum
entropy framework, SVM52 used an RBF kernel, and RF53 was set with 100 decision trees.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 15

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

Evaluation index Precision Recall F1

micro avg 0.9851 0.9899 0.9875
macro avg 0.9849 0.9896 0.9872
weighted avg 0.9855 0.9899 0.9876

Table 7. Summary of micro, macro, and weighted averages for optimized model performance evaluation.

Fig. 9. Confusion matrix of the optimized model. Each row of the matrix represents the actual entity type,
while each column represents the predicted entity type. !e diagonal cells indicate correctly classi"ed
instances, and the o#-diagonal cells re&ect misclassi"cations. As observed, most predictions are concentrated
along the diagonal, con"rming high accuracy. However, the confusion between MEA and REA suggests some
overlap in linguistic features, which may contribute to the misclassi"cation. !e relatively low error rates across
all tags further validate the e#ectiveness of the optimized model.

!e results in Table% 8 indicate that traditional machine learning models, particularly SVM and CRF,
demonstrate competitive performance, achieving f1 score above 0.81. However, these approaches heavily rely on
feature engineering and struggle with complex linguistic patterns that require deeper contextual understanding.
!e proposed optimized BERT-BiLSTM-CRF model outperforms all baselines, achieving an f1 score of 0.9875,
with statistically signi"cant improvements (p < 0.01, paired t-test).
!e superior performance of our model can be attributed to transformer-based contextual embeddings,
which dynamically capture semantic relationships between words, unlike traditional models that rely on static,
handcra'ed features. Furthermore, the bi-directional structure of BiLSTM enhances sequence modeling, while
the CRF layer re"nes entity boundaries through global inference, making the model particularly e#ective for
domain-speci"c NER tasks.
!is comparative study underscores the signi"cant advantages of the optimized BERT-BiLSTM-CRF model
over traditional machine learning approaches. While CRF, SVM, and RF remain viable for general NER tasks,
their reliance on feature engineering and limited contextual awareness restricts their e#ectiveness in complex,
real-world scenarios. In contrast, our model leverages deep contextual embeddings and structured sequence
modeling, yielding state-of-the-art performance.
!ese "ndings suggest that the proposed approach can be e#ectively deployed in real-world railway fault
diagnosis systems, enabling automated, high-accuracy entity recognition.

Computational efficiency analysis

To comprehensively evaluate the computational cost of the proposed BERT-BiLSTM-CRF (optimized) model,
we compare its training time, parameter size, and GPU memory consumption against various baseline models,

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 16

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

Precision Recall F1 Precision Recall F1

Models (micro) (micro) (micro) (macro) (macro) (macro)
CRF 0.7942 0.8051 0.7978 0.8113 0.8184 0.8148
SVM 0.8125 0.8192 0.8158 0.8274 0.8339 0.8306
RF 0.7698 0.7821 0.7759 0. 7883 0.7967 0.7925
BERT-BiLSTM-CRF(optimized) 0.9851 0.9899 0.9875 0.9849 0.9896 0.9872

Table 8. Performance comparison of traditional models and optimized BERT-BiLSTM-CRF.

Model Parameter size(M) Training time per epoch(s) Total training time (h) GPU memory usage(GB)
CRF 0.2 5 0.02 1
SVM 1.5 20 0.08 2
RF 5.2 35 0.15 3
BiLSTM-CRF 10.5 35 0.2 4
BERT 110.0 120 0.7 10
BERT-CRF 115.0 150 0.8 12
BERT-BiLSTM-CRF 125.0 180 1.0 14
BERT-BiLSTM-CRF(optimized) 118.0 160 0.9 13
DistilBERT-BiLSTM-CRF 82.0 110 0.6 9

Table 9. Model training time and computational resource comparison.

including deep learning-based models (BERT, BERT-CRF, BiLSTM-CRF, BERT-BiLSTM-CRF, and DistilBERT-
BiLSTM-CRF) and traditional machine learning models (CRF, SVM, RF). !e evaluation is conducted in the
same experimental environment, utilizing an RTX 4090 GPU with 24GB of memory.
Table% 9 presents the comparison of parameter sizes, average training time per epoch, total training time,
and GPU memory usage across di#erent models. From Table 9, it is evident that models incorporating BERT-
based architectures generally exhibit higher parameter counts and computational costs compared to traditional
machine learning models. !e BiLSTM-CRF model, while lightweight in terms of parameter size and training
time, does not leverage contextual word representations, leading to inferior performance in entity recognition
tasks.
!e proposed BERT-BiLSTM-CRF (optimized) model introduces improvements in both e$ciency and
performance by optimizing the BiLSTM hidden unit size and implementing a re"ned training strategy. Compared
to the standard BERT-BiLSTM-CRF model, the optimized version reduces parameter size by approximately
5.6%, decreases training time per epoch by 11.1%, and requires 7.1% less GPU memory, without compromising
entity recognition accuracy.
While cascading multiple neural network layers inevitably increases computational demands, the enhanced
contextual representation and sequence modeling capabilities justify the additional cost, particularly in mission-
critical applications such as railway fault diagnosis. Given that the fault diagnosis process does not require real-
time inference but rather prioritizes high recall and precision, the computational complexity remains acceptable
for practical deployment.
Moreover, the introduction of DistilBERT-BiLSTM-CRF as a lightweight alternative demonstrates a potential
trade-o# between model e$ciency and performance. Although DistilBERT reduces computational overhead, its
lower number of transformer layers may lead to a decline in entity recognition accuracy, which is undesirable
for high-stakes applications.
In conclusion, the BERT-BiLSTM-CRF (optimized) model achieves a well-balanced trade-o# between
computational e$ciency and fault diagnosis accuracy. Future work may explore quantization and model pruning
techniques to further enhance e$ciency while maintaining robust entity recognition performance.

Visualized results: knowledge graph construction

Knowledge is comprehended as an aggregation encompassing facts, concepts, rules, or principles, which
are symbolized, formalized, or modeled for delineating speci"c rules, a process referred to as knowledge
representation54. !e KG is constructed by employing concepts, entities, and texts as nodes, and various types
of relationships as edges, thus facilitating a visual depiction of knowledge. To precisely and distinctly manifest
the occurrence, representation, transitive relation, and evolution process of ROEF, a corresponding KG for the
failure-related entities previously identi"ed has been developed.
In alignment with the entity-relationship link delineated in Sect. "De"ne the entity type and their
relationships", represented by Fig.%5, the choice of Neo4j55 has been made for the delineation of the data model.
!is model leverages the concept of a graph, inclusive of nodes, edges, and their respective attributes, to enable
the representation and storage of unstructured text in a structured manner. To visualize the fault knowledge
structure derived from unstructured Chinese reports, we construct a domain-speci"c knowledge graph using
Neo4j. As shown in Fig.% 10, (a) presents the con"guration of the graph database model, (b) shows a partial

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 17

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

subgraph extracted from one failure report, and (c) displays the completed ROEFKG composed of multi-typed
entities and relationships. Entity types are color-coded, and relationships are expressed via labeled directional
edges. A legend is provided to ensure readability. !e entities representing di#erent aspects of the failure data
are distinguished by colors in the visualization: “TIM” (time) in purple, “LIN” (line) in blue, “NUM” (number)
in light gray, “DES” (description) in lavender, “REA” (reason) in yellow, “LOC” (location) in orange, “MEA”
(measure) in green, “CAT” (category) in dark gray, and “SYS” (system) in deep blue. !e relationships between
these entities are depicted with colored arrows: “occurred at time” in purple, " occurred at line” in blue, “occurred
at number” in light gray, “occurred at location” in lavender, “result in” in yellow, “category of failure” in orange,
“due to” in green, " responsible system” in dark gray, and “taken” in deep blue. !is visualization schema simpli"es
the representation and comprehension of the failure data structure, allowing for a clear understanding of the
relationships and attributes associated with each entity. Based on the KG depicted, an intuitive understanding of
the ROEF reports, along with their causes and impacts, can be acquired by tracing the relational links between
physical nodes, in turn, this in&uences the operational e$ciency and safety of train movement. E.g.,

(1) When Luwu Station was processing the approach for train number 20111, the positioning of switch number
occurred at number)
9 showed no indication) (the following is referred to as the “failure description”) → train
number 20111,
occurred at line
(2) failure description → Liqin line,
occurred at time
(3) failure description responsible
→ system12:25 on April 8,2020,
(4) failure description category→ of failure
electrical service,
(5) failure description → G7 signal equipment malfunction,
resultin
(6) failure description → did not a#ect the train,
taken
(7) failure description → Electrical and track maintenance departments were noti"ed for inspection and
handling. At 12:53, the track maintenance department veri"ed that the equipment was functioning normal-
ly, and at 13:11, the electrical service con"rmed the recti"cation, restoring normal train operations),
occurred at location
(8) failure description → turnout equipment,
due to
(9) failure description → the diode of switch number 9 was malfunctioning.

On the other hand, the Neo4j spatial database a#ords users the capability to perform an array of operations on
the stored visual information, including indexing, querying, adding, deleting, among others. Consequently, the
KG developed on the Neo4j platform, tailored to encapsulate information pertinent to Chinese reports, is poised
to o#er online services. !ese services are instrumental in providing guidance to on-site railway personnel,
thereby facilitating the execution of safety management and decision-making processes.

Conclusion
Given the dynamic environment and complex equipment failures in the railway transportation system, this
paper comprehensively analyzes railway rolling stock equipment failure report data using advanced text mining
techniques. !e innovative work and conclusions are as follows:

(1) A collection of on-site ROEFs reports served as the primary dataset, totaling 1,690 reports, categorized
into 16 failure types based on railway transportation equipment failure classi"cation standards. From these
reports, a corpus was constructed, encompassing over 350,000 lines of named entities and labels. Moreover,
ten categories of failure-related entities and the interrelationships within ROEF reports were de"ned.
(2) Leveraging the constructed corpus, the optimized BERT-BiLSTM-CRF model was employed to enhance
entity extraction from railway equipment failure texts. To address challenges posed by imbalanced data
distribution, an entity attention layer was introduced, allowing the model to adaptively assign higher atten-
tion weights to critical but underrepresented entities. !is mechanism enhances the feature representation
of low-frequency fault types, ensuring their more e#ective recognition. Furthermore, the dropout regu-
larization technique was incorporated to improve model robustness. Experimental results con"rm that
the optimized BERT-BiLSTM-CRF model signi"cantly outperforms the baseline approach in NER tasks,
particularly in handling imbalanced entity distributions.
(3) Following entity extraction and relationship analysis, a ROEFKG was constructed using the Neo4j database.
!is graphical representation provides an intuitive visualization of failure patterns, enhancing knowledge
retrieval and interpretability. !e developed ROEFKG supports failure diagnosis and decision-making by
establishing structured interconnections among di#erent fault-related entities.

In summary, the entity database and knowledge graph developed in this study provide a data-driven framework
for understanding railway operational failures, enhancing knowledge application e$ciency, and bridging current
gaps in ROEF report mining. !e introduction of the entity attention layer plays a crucial role in mitigating the
impact of data imbalance, enabling more e#ective recognition of both frequent and infrequent fault types. !is
enhancement ensures that critical but rare failures are not overshadowed by dominant categories, improving the
model’s reliability and its practical applicability in railway operational safety.

Data availability
!e datasets generated and analyzed during the current study are not publicly available due to the ongoing na-
ture of the project, but they are available from the corresponding author upon reasonable request.

Received: 17 July 2024; Accepted: 11 July 2025

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 18

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

Fig. 10. Neo4j modeling based KG.

References
1. Liu, C. & Yang, S. Using text mining to Establish knowledge graph from accident/incident reports in risk assessment. Expert Syst.
Appl. 207, 117991 (2022).
2. Guo, L. et al. Distributed representations of entities in open-world knowledge graphs. knowledge-based Syst. 290, 111582 (2024).
3. Cheng, D., Yang, F., Xiang, S. & Liu, J. Financial time series forecasting with multi-modality graph neural network. Pattern
Recognit. 121, 108218 (2022).
4. Hogan, A. et al. Knowledge graphs. ACM Comput. Surv. 54 (4), 1–37 (2021).
5. Dai, Y., Wang, S., Chen, X., Xu, C. & Guo, W. Generative adversarial networks based on Wasserstein distance for knowledge graph
embeddings. Knowledge-Based Syst. 190, 105165 (2020).
6. Ko, H., Witherell, P., Lu, Y. & Kim, S. Machine learning and knowledge graph based design rule construction for additive
manufacturing. Addit. Manuf. 37, 101620 (2021).
7. Mohamed, S. K., Nounu, A. & Nová(ek, V. Biological applications of knowledge graph embedding models. Brie!ngs Bioinf. 22 (2),
1679–1693 (2021).
8. Chen, D., Chen, J., Fang, C. & Zhang, Z. Complex visual question answering based on uniform form and content. Appl. Intell. 54,
4602–4620 (2024).
9. Zafar, A., Varshney, D., Kumar, S. S., Das, A. & Ekbal, A. Are my answers medically accurate? Exploiting medical knowledge graphs
for medical question answering. Appl. Intell. 54, 2172–2187 (2024).
10. Bounhas, I., Soudani, N. & Slimani, Y. Building a morpho-semantic knowledge graph for Arabic information retrieval. Inf. Process.
Manage. 57 (6), 102124 (2020).
11. Sun, R. et al. ACM,. Multi-modal knowledge graphs for recommender systems, In Proceedings of the 29th ACM International
Conference on Information & Knowledge Management, 1405–1414 (2020).

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 19

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

12. Nettleton, D. F. & Salas, J. A data driven anonymization system for information rich online social network graphs. Expert Syst.
Appl. 55, 87–105 (2016).
13. Zhang, Q. et al. Construction of knowledge graphs for maritime dangerous goods. Sustainability 11 (10), 2849 (2019).
14. Gan, L. et al. Construction of knowledge graph for &ag state control (FSC) inspection for ships: A case study from China. J. Mar.
Sci. Eng. 10 (10), 1352 (2022).
15. Mao, S., Zhao, Y., Chen, J., Wang, B. & Tang, Y. Development of process safety knowledge graph: A case study on delayed coking
process. Comput. Chem. Eng. 143, 107094 (2020).
16. Liu, J., Schmid, F., Li, K. & Zheng, W. A knowledge graph-based approach for exploring railway operational accidents. Reliab. Eng.
Syst. Saf. 207, 107352 (2021).
17. Lin, C. & Wang, G. Failure cause extraction of railway switches based on text mining. In Proceedings of the International Conference
on Computer Science and Arti!cial Intelligence, 237–241 (ACM, 2017)., 237–241 (ACM, 2017). (2017).
18. Sobrie, L., Verschelde, M., Hennebel, V. & Roets, B. Capturing complexity over space and time via deep learning: an application to
real-time delay prediction in railways. Eur. J. Oper. Res. 310 (3), 1201–1217 (2023).
19. Lin, J., Li, S., Qin, N. & Ding, S. Entity recognition of railway signal equipment fault information based on RoBERTa-wwm and
deep learning integration. Math. Biosci. Eng. 21 (1), 1228–1248 (2024).
20. Cai, Z. et al. !e sources and transport pathways of sediment in the Northern Ninety-east ridge of the India ocean over the last
35000 years. Front. Mar. Sci. 10, 1073054 (2023).
21. Li, W. et al. Chinese word segmentation based on self-learning model and geological knowledge for the geoscience domain. Earth
Space Sci. 8 (6), e2021EA001673 (2021).
22. Qiu, Q. et al. Chinese engineering geological named entity recognition by fusing multi-features and data enhancement using deep
learning. Expert Syst. Appl. 238, 121925 (2024).
23. Liang, J., Li, D., Lin, Y., Wu, S. & Huang, Z. Named entity recognition of Chinese crop diseases and pests based on RoBERTa-wwm
with adversarial training. Agron 13 (3), 941 (2023).
24. Yin, T. et al. Research on life cycle assessment and performance comparison of bioethanol production from various biomass
feedstocks. Sustainability 16 (5), 1788 (2024).
25. Zhang, D., Zheng, G., Liu, H., Ma, X. & Xi, L. AWdpCNER: automated Wdp Chinese named entity recognition from wheat diseases
and pests text. Agric 13 (6), 1220 (2023).
26. Hu, Z. & Ma, X. A novel neural network model fusion approach for improving medical named entity recognition in online health
expert question-answering services. Expert Syst Appl. 223, 119880 (2023).
27. Yang, P. et al. A large-scale and multi-source medical knowledge graph for intelligent medicine applications. Knowledge-Based Syst.
284, 111323 (2024).
28. Wu, S. et al. Deep learning in clinical natural Language processing: A methodical review. J. Am. Med. Inf. Assoc. 27 (3), 457–470
(2020).
29. Helwe, C. & Elbassuoni, S. Arabic named entity recognition via deep co-learning. Artif. Intell. Rev. 52, 197–215 (2019).
30. Li, J., Sun, A., Han, J. & Li, C. A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34 (1), 50–70
(2020).
31. Bunescu, R. & Mooney, R. A shortest path dependency kernel for relation extraction. in: Proceedings of human language
technology conference and conference on empirical methods in natural language processing. Vancouver. 724–731 (2005).
32. Culotta, A. & Sorensen, J. Dependency tree kernels for relation extraction. In: Proceedings of the 42nd annual meeting of the
association for computational linguistics (ACL-04), 423–429ACL, (2004).
33. Mintz, M., Bills, S., Snow, R. & Jurafsky, D. Distant supervision for relation extraction without labeled data. In: Proceedings of the
Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing
of the AFNLP, 1003–1011 (ACL, 2009).
34. Modrzejewski, M., Exel, M., Buschbeck, B., Ha, T. L. & Waibel, A. Incorporating external annotation to improve named entity
translation in NMT. In: Proceedings of the 22nd annual conference of the European association for machine translation. 45–51EAMT,
(2020).
35. Mollá, D., Van, Z. M. & Smith, D. Named entity recognition for question answering. In: Australasian Lang. Technol. Association
Workshop 51–58 (2006). (ALTA.
36. Patel, M. et al. An evolutionarily conserved autoinhibitory molecular switch in ELMO proteins regulates Rac signaling. Curr. Biol.
20 (22), 2021–2027 (2010).
37. Tay, Y., Luu, A. T. & Hui, S. C. Recurrently controlled recurrent networks. In: 32nd Conference on Neural Information Processing
SystemsNeurIPS,. (2018).
38. Vaswani, A. et al. Attention is all you need. In: Advances in Neural Information Processing Systems. (NeurIPS, (2017).
39. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9 (8), 1735–1780 (1997).
40. Li, J. et al. WCP-RNN: a novel RNN-based approach for bio-NER in Chinese EMRs. J. Supercomput. 76, 1450–1467 (2020).
41. Tao, F., Liu, G. & Advanced, L. S. T. M. A study about better time dependency modeling in emotion recognition. In: IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2906–2910 (IEEE,2019). 2906–2910 (IEEE,2019).
(2018).
42. Li, W. et al. UD_BBC: named entity recognition in social network combined BERT-BiLSTM-CRF with active learning. Eng. Appl.
Artif. Intell. 116, 105460 (2022).
43. Ronran, C. & Lee, S. E#ect of character and word features in bidirectional LSTM-CRF for NER. In: 2020 IEEE International
Conference on Big Data and Smart Computing (BigComp). 613–616IEEE, (2020).
44. Chen, Z., Ji, W., Ding, L. & Song, B. Fine-grained document-level "nancial event argument extraction approach. Eng. Appl. Artif.
Intell. 121, 105943 (2023).
45. Kang, T., Perotte, A., Tang, Y., Ta, C. & Weng, C. UMLS-based data augmentation for natural Language processing of clinical
research literature. J. Am. Med. Inf. Assoc. 28 (4), 812–823 (2021).
46. Liu, Y. et al. Naming entity recognition of citrus pests and diseases based on the BERT-BiLSTM-CRF model. Expert Syst. Appl. 234,
121103 (2023).
47. Zhou, S. et al. A cross-institutional evaluation on breast cancer phenotyping NLP algorithms on electronic health records. Comput.
Struct. Biotechnol. J. 22, 32–40 (2023).
48. Hinze, A., Heese, R., Schlegel, A. & Paschke, A. Manual semantic annotations: user evaluation of interface and interaction designs.
J. Web Semant. 58, 100516 (2019).
49. Rani, P. S., Suresh, R. M. & Sethukarasi, R. Multi-level semantic annotation and uni"ed data integration using semantic web
ontology in big data processing. Cluster Comput. 22 (Suppl 5), 10401–10413 (2019).
50. Liu, C. & Yang, S. A text mining-based approach for Understanding Chinese railway incidents caused by electromagnetic
Interferenc. Eng. Appl. Artif. Intell. 117, 105598 (2023).
51. Chen, X. et al. Multi-target detection and tracking based on CRF network and spatio-temporal attention for sports videos. Sci. Rep.
15, 6808 (2025).
52. Wang, W. et al. Prediction model of water inrush risk level of coal seam &oor based on KPCA-DBO-SVM. Sci. Rep. 15, 10393
(2025).
53. Fang, M. et al. Missing value imputation for > 2%mev electron &uxes in geostationary orbit based on GA-RF model. Sci. Rep. 15,
10427 (2025).

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 20

Content courtesy of Springer Nature, terms of use apply. Rights reserved
www.nature.com/scientificreports/

54. Lin, J., Zhao, Y., Huang, W., Liu, C. & Pu, H. Domain knowledge graph-based research progress of knowledge representation.
Neural Comput. Appl. 33, 681–690 (2021).
55. Atzeni, P., Bugiotti, F., Cabibbo, L. & Torlone, R. Data modeling in the NoSQL world. Comput. Stand. Interfaces. 67, 103149 (2020).

Acknowledgements
We gratefully acknowledge that this research is supported by the Project on the High-Quality Development and
Safety Assurance System and Key Technologies for Railways, funded by China National Railway Group Corpo-
ration Limited.

Author contributions
X. Y.: Conceptualization, Methodology, So'ware, Validation, Data Curation, Writing—Original Dra', Writ-
ing—Review and Editing. H. L.: Methodology, Formal analysis, Validation, Data Curation, Writing—Review
and Editing. Y. X., N.S. and R. H.: Formal analysis, Validation, Writing—Review and Editing.

Declarations

Competing interests
!e authors declare no competing interests.

Additional information
Correspondence and requests for materials should be addressed to H.L.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional a$liations.
Open Access !is article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives
4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in
any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide
a link to the Creative Commons licence, and indicate if you modi"ed the licensed material. You do not have
permission under this licence to share adapted material derived from this article or parts of it. !e images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence
and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to
obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommo
ns.org/licenses/by-nc-nd/4.0/.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 21

Content courtesy of Springer Nature, terms of use apply. Rights reserved
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:

1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at

[email protected]

Railroad Accident Analysis by Machine Learning and Natural
No ratings yet
Railroad Accident Analysis by Machine Learning and Natural
14 pages
Using Text Mining To Establish Knowledge Graph From Accidentincident Reports in Risk Assessment
No ratings yet
Using Text Mining To Establish Knowledge Graph From Accidentincident Reports in Risk Assessment
20 pages
Vijayalakshmi.G-AI Based Rush Collision Prevention in Railways
No ratings yet
Vijayalakshmi.G-AI Based Rush Collision Prevention in Railways
9 pages
Text Mining Applied To Rail Accidents: Nayana Kamerkar, Kamlesh Patil, Ankita Kale Guide-Mrs. Nita Patil
No ratings yet
Text Mining Applied To Rail Accidents: Nayana Kamerkar, Kamlesh Patil, Ankita Kale Guide-Mrs. Nita Patil
4 pages
122 Yang KDD 05
No ratings yet
122 Yang KDD 05
11 pages
Automatic Robot For Rail Crack Detection Systems
No ratings yet
Automatic Robot For Rail Crack Detection Systems
5 pages
55 Artikel
No ratings yet
55 Artikel
18 pages
Resilience in Railway Transport Systems A Literature Review and Research Agenda
No ratings yet
Resilience in Railway Transport Systems A Literature Review and Research Agenda
23 pages
Quantitative Risk Assessment of Railway Intrusions With Text Mining and
No ratings yet
Quantitative Risk Assessment of Railway Intrusions With Text Mining and
16 pages
Intelligent Railway Safety Risk System
No ratings yet
Intelligent Railway Safety Risk System
16 pages
Liad016 1
No ratings yet
Liad016 1
24 pages
A Knowledge Graph-Based Hazard Prediction Approach For Preventing
No ratings yet
A Knowledge Graph-Based Hazard Prediction Approach For Preventing
19 pages
Unsupervised Machine Learning For Managing Safety Accidents in Railway Stations
No ratings yet
Unsupervised Machine Learning For Managing Safety Accidents in Railway Stations
16 pages
Unsupervised Machine Learning For Managing Safety Accidents in Railway Stations
No ratings yet
Unsupervised Machine Learning For Managing Safety Accidents in Railway Stations
11 pages
Study of Rail Breaks & Fracture
No ratings yet
Study of Rail Breaks & Fracture
15 pages
Railway Infra Journal
No ratings yet
Railway Infra Journal
20 pages
A Review of Deep Learning Applications For Railway Safety
No ratings yet
A Review of Deep Learning Applications For Railway Safety
29 pages
Identifying Climate-Related Failures in Railway Infrastructure Using Machine Learning
No ratings yet
Identifying Climate-Related Failures in Railway Infrastructure Using Machine Learning
19 pages
A Failure Study of The Railway Rail Services
No ratings yet
A Failure Study of The Railway Rail Services
7 pages
Automated Machine Learning Recognition To Diagnose Flood Resilience of Railway Switches and Crossings
No ratings yet
Automated Machine Learning Recognition To Diagnose Flood Resilience of Railway Switches and Crossings
11 pages
(2022) Bešinović-Resilience Assessment of Railway Networks...
No ratings yet
(2022) Bešinović-Resilience Assessment of Railway Networks...
15 pages
Al Based Acoustic Wave Monitoring of Rail Defects Like Cracks, Fracture and Prediction For Rail Wear, Quality Along With Other Parameter
No ratings yet
Al Based Acoustic Wave Monitoring of Rail Defects Like Cracks, Fracture and Prediction For Rail Wear, Quality Along With Other Parameter
104 pages
面向高速铁路道岔故障维修领域的知识图谱研究与应用卢冉
No ratings yet
面向高速铁路道岔故障维修领域的知识图谱研究与应用卢冉
62 pages
Artificial Intelligence in Railway Infrastructure
No ratings yet
Artificial Intelligence in Railway Infrastructure
24 pages
Optimization in Railway Scheduling
No ratings yet
Optimization in Railway Scheduling
8 pages
1 s2.0 S0952197623008060 Main
No ratings yet
1 s2.0 S0952197623008060 Main
16 pages
A New Methodology For Assessment of Railway Infrastructure Condition
No ratings yet
A New Methodology For Assessment of Railway Infrastructure Condition
10 pages
1 s2.0 S0925753517302187 Main
No ratings yet
1 s2.0 S0925753517302187 Main
8 pages
Risk Analysis - 2021 - Ghofrani - Analyzing Risk of Service Failures in Heavy Haul Rail Lines A Hybrid Approach For
No ratings yet
Risk Analysis - 2021 - Ghofrani - Analyzing Risk of Service Failures in Heavy Haul Rail Lines A Hybrid Approach For
20 pages
Sciencedirect: A Factor Analysis of Urban Railway Casualty Accidents and Establishment of Preventive Response Systems
No ratings yet
Sciencedirect: A Factor Analysis of Urban Railway Casualty Accidents and Establishment of Preventive Response Systems
10 pages
Sensors 24 00830 v2
No ratings yet
Sensors 24 00830 v2
4 pages
Time Series Data Mining For Railway Wheel and Track Monitoring: A Survey
No ratings yet
Time Series Data Mining For Railway Wheel and Track Monitoring: A Survey
19 pages
Geosciences 10 00425
No ratings yet
Geosciences 10 00425
24 pages
Railway Accident Risk Analysis Using AI
No ratings yet
Railway Accident Risk Analysis Using AI
28 pages
Deepikaijcset 1
No ratings yet
Deepikaijcset 1
6 pages
The Risk Management Methodology Used in Railway Systems - A Case Study of Alishan Forest Railway
No ratings yet
The Risk Management Methodology Used in Railway Systems - A Case Study of Alishan Forest Railway
16 pages
1 s2.0 S1568494620308450 Main
No ratings yet
1 s2.0 S1568494620308450 Main
10 pages
V3i4 1266
No ratings yet
V3i4 1266
8 pages
Tdad 007
No ratings yet
Tdad 007
12 pages
Unsupervised Machine Learning For Managing Safety
No ratings yet
Unsupervised Machine Learning For Managing Safety
15 pages
2018CrawfordKiftManuscript RailSafetymechrisk
No ratings yet
2018CrawfordKiftManuscript RailSafetymechrisk
31 pages
Fmics2019 Bahnhof
No ratings yet
Fmics2019 Bahnhof
17 pages
Liden 2015 EWGT Survey
No ratings yet
Liden 2015 EWGT Survey
11 pages
Wang 2023 - Elsevier
No ratings yet
Wang 2023 - Elsevier
21 pages
Processes 10 00724 v2
No ratings yet
Processes 10 00724 v2
23 pages
Research Paper - Predictive Maintenance Based On IoT and AI
No ratings yet
Research Paper - Predictive Maintenance Based On IoT and AI
6 pages
Big Data in Railway Systems
No ratings yet
Big Data in Railway Systems
9 pages
Ontology Driven Data Integration
No ratings yet
Ontology Driven Data Integration
44 pages
ML for Steel Plant Safety Analysis
No ratings yet
ML for Steel Plant Safety Analysis
12 pages
A - Survey - On - Data-Driven - Predictive Maintenance
No ratings yet
A - Survey - On - Data-Driven - Predictive Maintenance
23 pages
A Simple State-Based Prognostic Model For Railway Turnout SSBP
No ratings yet
A Simple State-Based Prognostic Model For Railway Turnout SSBP
9 pages
Self-Adaptive Algorithm For Scheduling Rail Tests
No ratings yet
Self-Adaptive Algorithm For Scheduling Rail Tests
3 pages
2023 Sinnemann Franzen INT
No ratings yet
2023 Sinnemann Franzen INT
3 pages
Systematic Literature Review of Risk Assessment 4
No ratings yet
Systematic Literature Review of Risk Assessment 4
14 pages
Review 1 (Project)
No ratings yet
Review 1 (Project)
18 pages
Disposal For The Pinching People and Objects Accid
No ratings yet
Disposal For The Pinching People and Objects Accid
9 pages
Sensors 24 02456 v2
No ratings yet
Sensors 24 02456 v2
25 pages
Turnout 7
No ratings yet
Turnout 7
79 pages
Mustafa Saritas Icisna2023
No ratings yet
Mustafa Saritas Icisna2023
8 pages
單字
No ratings yet
單字
3 pages
ACRJ
No ratings yet
ACRJ
21 pages
Unveiling The Potential of Sentiment: Can Large Language Models Predict Chinese Stock Price Movements?
No ratings yet
Unveiling The Potential of Sentiment: Can Large Language Models Predict Chinese Stock Price Movements?
9 pages
8 財務整理
No ratings yet
8 財務整理
12 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Assignment 期中
No ratings yet
Assignment 期中
12 pages
Mini Case
No ratings yet
Mini Case
7 pages
Mini Case
No ratings yet
Mini Case
9 pages
17 SAW vs. WPM
No ratings yet
17 SAW vs. WPM
5 pages
CH01行銷概念導論
No ratings yet
CH01行銷概念導論
118 pages
結論
No ratings yet
結論
2 pages
Mini Case
No ratings yet
Mini Case
16 pages
FM
No ratings yet
FM
44 pages
04 Eigenvector and WLS
No ratings yet
04 Eigenvector and WLS
11 pages
13 Permutation and QUALIFLEX
No ratings yet
13 Permutation and QUALIFLEX
12 pages
16 Objective Weight
No ratings yet
16 Objective Weight
3 pages
05 CRITIC and Entropy
No ratings yet
05 CRITIC and Entropy
12 pages
Assignment 1
No ratings yet
Assignment 1
1 page
B.Arch. Curriculum Map Overview
No ratings yet
B.Arch. Curriculum Map Overview
1 page
Bridge Works - Miscellaneous
No ratings yet
Bridge Works - Miscellaneous
26 pages
Homework Hotline d428
100% (1)
Homework Hotline d428
5 pages
Written Assignment Unit 4
No ratings yet
Written Assignment Unit 4
5 pages
Elemental Battle Armor (AP Gauss) (Sqd6)
No ratings yet
Elemental Battle Armor (AP Gauss) (Sqd6)
1 page
4JH1 Gestión Electrónica
No ratings yet
4JH1 Gestión Electrónica
79 pages
Chapter 4 Notes Class 12
100% (1)
Chapter 4 Notes Class 12
21 pages
Science 10 Lesson Plan
100% (1)
Science 10 Lesson Plan
7 pages
Listof C25 Batcheswith Times&Syllabus
No ratings yet
Listof C25 Batcheswith Times&Syllabus
4 pages
BBMA Flow Diagram
100% (1)
BBMA Flow Diagram
212 pages
SAEJ435 CV 001
100% (1)
SAEJ435 CV 001
13 pages
PDF No Bake Asweseeit - Compress
No ratings yet
PDF No Bake Asweseeit - Compress
132 pages
Lesson 1: Inspection, Palpation, Percussion & Auscultation
No ratings yet
Lesson 1: Inspection, Palpation, Percussion & Auscultation
7 pages
Surat Undangan Peserta ADIA
No ratings yet
Surat Undangan Peserta ADIA
9 pages
5GRAIL WCRR Presentation
No ratings yet
5GRAIL WCRR Presentation
6 pages
Passport Appointment Receipt India
No ratings yet
Passport Appointment Receipt India
3 pages
(L6) - (JEE 2.0) - 3D Geometry - 28th Nov
No ratings yet
(L6) - (JEE 2.0) - 3D Geometry - 28th Nov
44 pages
Mohr's Circle
100% (1)
Mohr's Circle
13 pages
CV Riston Belman Sidabutar
No ratings yet
CV Riston Belman Sidabutar
6 pages
Critical Book Review Guide
No ratings yet
Critical Book Review Guide
4 pages
01 History of Philippine Architecture
No ratings yet
01 History of Philippine Architecture
18 pages
Licensure Examination For Teachers Reviewer (Part 1)
100% (1)
Licensure Examination For Teachers Reviewer (Part 1)
11 pages
Improvised Mist Fan
No ratings yet
Improvised Mist Fan
32 pages
EMC Engineering Exam Insights
No ratings yet
EMC Engineering Exam Insights
3 pages
Custom DateTimePicker - Custom Controls WinForm C # - RJ Code Advance
No ratings yet
Custom DateTimePicker - Custom Controls WinForm C # - RJ Code Advance
12 pages
Thesis Help for Trade Students
100% (2)
Thesis Help for Trade Students
6 pages
Presentation 1
No ratings yet
Presentation 1
91 pages
Summary of Learning
No ratings yet
Summary of Learning
10 pages
Happy Elevators India PVT LTD: Sub: Elevator Quotation
No ratings yet
Happy Elevators India PVT LTD: Sub: Elevator Quotation
9 pages

A Text Mining-based Approach for Comprehensive Und

Uploaded by

A Text Mining-based Approach for Comprehensive Und

Uploaded by

www.nature.

OPEN A text mining-based approach for

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 1

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 2

NER algorithm module

Fig. 1. !e overall framework of the model.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 3

Multi Head (Q, K, V ) = head1 ⊕ head2 ⊕ . . . ⊕ headi (3)

Where W is recognized as the weight matrix, and ⊕ indicates splicing h matrices.

Fig. 2. BERT model structure.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 4

Fig. 3. BiLSTM structure drawing.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 5

Algorithm 1Entity attention layer for fault diagnosis.

Entity attention layer

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 6

(1) Enhancing rare entity recognition.

(2) Improving contextual representation.

(3) Noise reduction and robust feature selection.

(4) Empirical performance gains on imbalanced datasets.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 7

(1) Handling missing values and format standardization.

(2) Removing irrelevant characters and erroneous entries.

(3) Sentence segmentation and logical restructuring.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 8

Fig. 4. Preprocessing comparison.

Define the entity type and their relationships

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 9

Failure category Quantity

Table 1. !e quantities of various types of ROEF documented.

Training environment and parameters configuration

Experimental results and analysis

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 10

Tags Type Number

Fig. 5. Entity-relationship link graph.

The optimized model is compared with the basic model

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 11

Fig. 6. Text preprocessing of Chinese ROEF reports.

Operating System Windows 10

Table 3. Training environment con"guration.

Parameter name Parameter value

Table 4. Parameter value setting.

(1) Improved recall for low-frequency entities.

(2) Enhanced feature extraction for low-frequency entities.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 12

Fig. 7. Comparative analysis of entity recognition models.

(3) Integration with dropout regularization to improve generalization.

Model performance on low-frequency failure categories

(1) Signi"cant performance improvement with entity attention mechanism.

!e optimized model (BERT-BiLSTM-CRF with entity attention) exhibits a substantial improvement in

(2) Better generalization to underrepresented failure categories.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 13

Fig. 8. Comparison of optimized BERT-BiLSTM-CRF with basic model.

(3) Enhanced recognition of critical entity types.

(4) Balanced precision and recall across entity types.

(5) Demonstration of model robustness in imbalanced data.

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 14

Entity Precision Recall F1

Table 5. Baseline model performance (BERT-BiLSTM-CRF).

Entity Precision Recall F1

Table 6. Optimized model performance (BERT-BiLSTM-CRF with entity attention).

Evaluation metrics and performance analysis

Confusion matrix analysis

Comparative study with traditional machine learning models

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 15

Evaluation index Precision Recall F1

Computational efficiency analysis

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 16

Precision Recall F1 Precision Recall F1

Table 8. Performance comparison of traditional models and optimized BERT-BiLSTM-CRF.

Table 9. Model training time and computational resource comparison.

Visualized results: knowledge graph construction

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 17

Received: 17 July 2024; Accepted: 11 July 2025

Scientific Reports | (2025) 15:27760 | https://doi.org/10.1038/s41598-025-11622-6 18

Fig. 10. Neo4j modeling based KG.