Chapter 3 Methods
This chapter details the design and implementation of two Retrieval-Augmented
Generation (RAG) pipelines developed to automate FMEA for RCM. Given the
growing availability of unstructured and semi-structured maintenance data ranging
from technical manuals to ISO-compliant FMEA tables, this study implements two
RAG architectures: a baseline Vanilla RAG and an enhanced Semi-Structured RAG,
and an Agentic RAG approach. Both pipelines integrate large language models
(LLMs) with domain-specific retrieval mechanisms to generate context-aware outputs
grounded in trusted sources. The chapter is organized into six sections: a conceptual
overview of RAG, a description of the datasets used, the selection of LLMs via the
Groq API, detailed implementation of both RAG pipelines, and a performance
evaluation using standard natural language generation and retrieval metrics via
Retrieval augmented generation assessment system (RAGAS).
3.1 Data collection and description
In this study, an LLM will be utilized to perform the FMEA task. The model will be
trained on a dataset comprising previous FMEA studies conducted on various types of
equipment. To strengthen the model’s grasp of FMEA methodology, the dataset will
also include research on FMEA ontologies. Furthermore, relevant ISO standards
governing FMEA procedures will be incorporated in the dataset to ensure alignment
with industry best practices. The papers with good quality of FMEA tables, as well as
good number of tables of various kinds of pump were considered in the study, and the
articles of high standard were chosen to be a part of the dataset.
3.2 Overview of RAG
RAG is an advanced neural architecture designed to enhance the generation
capabilities of LLMs by incorporating an explicit retrieval mechanism from an
external knowledge base. This hybrid approach mitigates the common limitations of
purely generative models, such as hallucination and inability to access updated or
domain-specific knowledge stored outside the model’s parameters. The RAG process
consists of three primary stages: Text Preprocessing, Information Retrieval, and Large
Language Model (LLM) Integration. Each of these stages plays a critical role in
ensuring that the final output is both relevant and accurate (Lewis et al., 2020).
Text Preprocessing: The first step in the framework involves Text
Preprocessing, where all available textual data is converted into vector
representations. This transformation is achieved using techniques such as word
embeddings (e.g., Word2Vec) or transformer-based embeddings (e.g. OpenAI’s
embeddings). These vector representations encapsulate the semantic meaning of the
text, allowing for efficient and meaningful comparisons. Once the text is vectorized, it
is stored in a vector database, which enables rapid and accurate retrieval of
information when needed. The purpose of this stage is to create a structured and
searchable knowledge base that can later be queried to retrieve relevant information
(Mikolov, T. et al 2013).
Information Retrieval: The second stage, Information Retrieval, is where the
system processes a user's query by converting it into a vector representation, similar to
how the initial dataset was transformed. This query vector is then compared with the
stored vectors in the database using cosine similarity or other distance-based measures
(e.g., Euclidean distance, dot product similarity). Cosine similarity is particularly
effective as it measures the angular distance between two vectors, ensuring that
semantically similar concepts are identified even if the exact wording differs. Once the
similarity scores are computed, the system selects the most relevant vectors from the
database – those that closely match the user’s query (Karpukhin et al., 2020). This
retrieval step ensures that the model has access to highly relevant external knowledge
before proceeding to generate the final response. The efficiency and accuracy of this
stage are critical, as retrieving incorrect or irrelevant data could compromise the
reliability of the generated output.
Large Language Model (LLM) Integration: The final stage, LLM
Integration, occurs after the relevant vectors have been retrieved from the database. At
this point, the Large Language Model (LLM) processes the retrieved information and
generates a coherent, well-structured textual response. This is where the model
integrates both its pre-trained knowledge and the retrieved external knowledge to
formulate an informed and contextually appropriate output (Wang & Li, 2023). By
combining retrieval-based knowledge with generative capabilities, the LLM ensures
that responses are more precise, factually grounded and tailored to the specific query.
This hybrid approach significantly enhances the model’s ability to perform FMEA-
related tasks, as it can leverage historical FMEA studies, industry standards (such as
ISO guidelines), and structured ontological data to produce accurate and insightful
analyses.
3.3 Implementation of RAG
Figure 2. Illustration of the Text Preprocessing stage.
Figure 2 illustrates the steps involved in the Text Preprocessing stage. The workflow
starts with assembling a collection of relevant documents, which may include research
papers, technical standards, and various ontologies. These documents are stored within
a document directory to serve as the knowledge base for a RAG system. Once the
documents are collected, they undergo a preprocessing phase where each document is
divided into smaller, logically segmented text chunks. This chunking process ensures
that each piece of information is contextually relevant and easier for language models
to process and understand.
Following this, each individual text chunk is passed through an embedding
model, which converts the textual data into high-dimensional contextual vector
representations. These vector embeddings encapsulate the semantic meaning of the
original text, enabling accurate similarity comparisons. Finally, the resulting
contextual vectors are stored in a specialized storage system called a vector database
(or vector DB). This database is optimized for storing and retrieving vectorized data,
allowing the RAG system to quickly search and retrieve the most relevant chunks in
response to a user query, thus enhancing the quality and relevance of generated
responses by the LLM.
Figure 3 outlines the comprehensive process of preparing documents and
building a model for analyzing FMEA within the RAG framework. Initially, all
necessary documents are collected and organized in a designated folder. Key tools and
libraries, such as LangChain for workflow management, document loaders (using
Unstructured and PyPDF), and a vector database (Chroma DB) for storing vector
representations, are then installed and configured.
Figure 3. Code structure for text pre-processing part of the RAG implementation.
The model can be built using either API key models (such as OpenAI, GROQ,
or Mistral) or locally run models such as “Ollama”. In this study, the Llama 3.1-8B
model – derived from “GROQ”, which is an open-source platform to access various
LLMs free of cost has been employed. Additionally, while Chroma DB is used as the
vector database in this setup, other options like FAISS (Facebook AI Similarity
Search) can serve similar purposes. Overall, this structured setup under the RAG
framework ensures that the model generates answers with precise and relevant context
drawn from the prepared documents.
Figure 4. Retrieval of information and generation of response based on user query.
Once the setup is complete, the documents are loaded into Python and
segmented into smaller “chunks” – each with a size of 2000 tokens and an overlap of
500 tokens – to ensure that contextual integrity is maintained, the chunk size and
overlap are hyper parameters that can vary upon the document size and its type. Each
of these chunks is then converted into a vector representation, a numerical format that
encapsulates the semantic meaning of the text. These vectors are stored in the vector
database that is necessary for efficient retrieval. Within the RAG framework, this
vector store is later utilized during the “Information Retrieval” phase. When a user
query is submitted, it is also transformed into a vector, and cosine similarity is used to
identify the most relevant stored vectors.
Figure 4 depicts the Information Retrieval process within the RAG framework.
When a user submits a query, it is first converted into a vector representation using the
same embedding process applied during text preprocessing.
This query vector is then compared against the stored vectors in the vector
database via co-sine similarity, which helps identify the most relevant matches. The
top matching vectors, together with the original query, are then fed into the Large
Language Model. The LLM processes this combined input to generate a coherent and
contextually relevant response. Essentially, the LLM plays a dual role by both con-
verting the text corpus and the query into vectors and generating the final output.
Additionally, the framework leverages the LangChain library to efficiently manage
tasks such as vector retrieval and text splitting.
Figure 5 presents a comprehensive pipeline for establishing a question-and-
answer system that leverages a Large
Language Model within a Retrieval-
Augmented Generation (RAG)
framework to generate FMEA
insights. In this setup, the process
begins by configuring a retriever that
efficiently queries a vectorized store of
FMEA-related documents. This
retriever is a key component of the
RAG framework, ensuring that when a
query is made, the most relevant
contextual information is promptly
accessible.
Figure 5. Retrieval and
output generation process.
Next, the language model is imported and initialized, equipping it to interpret
user queries and generate coherent responses. To seamlessly integrate the retrieval and
generation components, a processing chain is constructed using the LangChain library.
This chain effectively links the question-answering functionality with the retrieval
process, so that the language model can incorporate the retrieved data to craft an
informed answer. Once the system is set up, a user’s query activates the chain. The
retriever identifies and supplies the pertinent vectors from the database, and the
language model uses both the original query and the retrieved information to generate
a relevant response.
The resulting output is then printed or checked for accuracy, ensuring the
quality of the generated answer. Subsequently, the FMEA data is transformed into a
DataFrame – a structured format that facilitates easier manipulation and analysis.
During this stage, certain columns such as Severity, Occurrence, Detection, and the
Risk Priority Number (RPN) are removed to account for the variability that should
instead be determined by a maintenance expert, considering the specific industry or
organizational context.
Finally, the refined DataFrame is exported to Excel, allowing for further use
and sharing of the processed FMEA data. Overall, this RAG-based workflow
streamlines the retrieval, processing, and dissemination of critical FMEA information,
ensuring that domain experts receive accurate and contextually relevant insights.
In summary, this RAG framework, implemented in this study integrates
structured retrieval and generative AI to enhance the context of the model’s outputs.
This approach not only improves the reliability of automated FMEA processes but
also facilitates better decision-making, ensuring that the analysis aligns with best
practices.
3.4 Semi-Structured RAG
This methodology has been specifically developed to enhance the generation and
interpretation of table-based outputs produced by LLMs. It builds upon the structure
of the conventional or “vanilla” RAG framework but addresses some of its critical
limitations. One of the primary drawbacks of vanilla RAG is its inability to effectively
distinguish between structured and unstructured data during the preprocessing phase.
As a result, important relational information present in tables may be lost or
misinterpreted when treated as plain text, so to overcome this, the proposed approach
introduces a distinct preprocessing step that separates tabular and textual components
within the dataset. Thus by handling these components independently, the model is
better equipped to preserve structural integrity and semantic meaning, thereby
improving both the quality and relevance of the generated outputs in tasks involving
table-based reasoning.
Figure 6. Semi-Structured RAG.
The modified process of pre-processing is illustrated in Figure 6. To optimize
the system for FMEA table generation, the standard RAG pipeline has been modified.
As the main aim of the system, is to generate FMEA tables, first, all tabular content is
extracted from the document corpus, then each table is summarized, embedded, and
stored alongside its summary vectors in the same database. The text segments are also
summarized, vectorized and stored in the same database. This approach improves the
table comprehension ability of the model.
Rest of the process remains same as the “vanilla” RAG, the retrieval and
answer generation process are same, but use of specific chat prompt templates is done
to generate table a specific way, i.e. without severity, occurrence and detection and
risk priority number values, to remove subjectivity from the output, and an additional
column with the corrective actions, to suggest the necessary action for each failure
mode. Whenever the question is regarding FMEA generation, or related to a specific
entry from a table, like failure mode, or cause of suggestive measure for maintenance,
it retrieves the vector embeddings of the tables, and if question related to text may be
asked, those vectors shall be retrieved.
3.5 Implementation of semi-structured RAG
As done in vanilla RAG, the necessary documents are collected and stored in
directory. Then all the necessary libraries get installed, in this case, most same as the
previous implementation, but some new dependencies are to be installed, as well in
case some tables in the document maybe in image format, so tools like OCR (optical
character recognition) are to be installed in the environment. As shown in the figure 7,
once the setup is complete, the documents are loaded into Python and segmented into
smaller “chunks” – each with a size of 550 tokens and an overlap of 250 tokens – to
ensure that contextual integrity is maintained. Once the chunks have been obtained the
“table elements” from the chunks are separated, and they were processed separately.
This is done because, the major aim of the system is to generate FMEA tables,
so to improve the quality of tabular output, it’s better to process the tables separately.
Then the tabular and textual chunks are summarized with the help of the LLM, so that
the content of the document, especially table chunks can be understood completely.
Figure 7. Code structure for semi structured RAG implementation.
Each of these chunks (tabular and text) is then converted into a vector
representation, a numerical format that encapsulates the semantic meaning of the text,
and those vectors, together with their summaries and associated metadata (also
vectorized), are then stored in a vector database for efficient retrieval. The remaining
process of retrieval of vectors and generation of output is the same as the previous
approach, just the vector store from which the chunks are obtained, is different.
The use of chat-prompt templates is being done to guide the model behaviour
towards a certain way to prevent hallucinations and make sure models doesn’t go out
of context, and the table is generated with only certain columns, namely failure mode,
cause, effect, mechanism of failure, and corrective action. Now the table generated by
the model will not have to be re-formatted in Excel or by any other means, which was
not the case for the previous approach. Some of the hyper-parameters of the LLM
such as temperature, were kept low as well to avoid to extra creativity that may lead
the model to go out of the provided context of the documents.
3.6 Agentic RAG
Agentic RAG refers to an advanced RAG architecture that integrates autonomous
agents to enhance information retrieval, reasoning, and response generation.
Traditional RAG pipelines retrieve relevant documents from a knowledge base and
pass them to a language model to generate responses. Agentic RAG extends this by
using specialized agents—modular components with distinct roles such as query
reformulation, document filtering, summarization, or reasoning.
These agents are designed with specific roles, such as query refinement, document
filtering, summarization, fact-checking, or synthesis—and can operate in a
coordinated workflow. Unlike static pipelines, agentic systems are dynamic and
adaptive: agents can assess the quality of retrieved information, trigger additional
retrievals, break tasks into subgoals, or revise outputs based on feedback or
constraints. Agentic RAG represents a shift from single-shot, passive systems to goal-
directed, collaborative intelligence.
3.7 Implementation of Agentic RAG
In the preprocessing stage, as shown in Figure 8, the system begins with a set of
FMEA PDF files that contain both structured tables (listing failure modes, causes,
effects, and corrective actions) and unstructured narrative text (background
explanations, design rationale, and procedural details). One tool scans each page to
detect lines that resemble table entries, typically by checking for multiple numeric or
tabular cues, then aggregates those lines into a continuous block, breaks that block
into fixed-size segments, and converts each segment into a numerical embedding,
storing all resulting vectors in a dedicated “Table Vector DB.”
Figure 8: Text pre-processing in Agentic RAG.
A second component collects all remaining paragraphs, bullet points, and
headings, concatenates this narrative text into one long string, splits it into overlapping
chunks to preserve context across boundaries, and transforms each chunk into its own
vector representation, saving those in a separate “Text Vector DB.” By splitting the
raw PDF content into two distinct streams—table-like lines and free-form narrative—
and independently embedding and indexing each, the preprocessing stage ensures that
structured data and explanatory text remain separate, laying the groundwork for later
components to query the appropriate index when answering user questions.
Figure 9: Generation of output in Agentic RAG.
In figure 9, use of LLM agent is utilised to retrieve and generate the response
as well, by the help, of retrieval tools. This figure highlights how this agentic RAG
system handles user queries and produces final outputs.
First, the agent receives the user’s request—such as “Generate FMEA of
submersible pump”, and passes it to a routing component that determines which
vector store (“table” or “text”) is most likely to contain the answer. The central agent,
powered by a language model and supported by its set of tools and optional memory,
then formulates a search query in vector space. It queries the appropriate vector
database (either the structured-data index of table embeddings or the narrative-text
index) to retrieve the top-k semantically relevant vectors. Once those relevant vectors,
whether they correspond to specific table rows or descriptive text passages are
fetched, the agent constructs a context-rich prompt combining the original query with
the retrieved content.
Finally, the language model generates a coherent, grounded output (for
example, a detailed FMEA overview of the submersible pump) that directly reflects
the information held in the retrieved vectors. Because this is an advanced
methodology, use of advanced reasoning models was done, for better performance,
and output generation. By dynamically choosing the correct index at query time and
fusing the retrieved context into the LLM prompt, this stage of the pipeline ensures
that user requests are answered accurately and comprehensively, completing the
agentic RAG workflow.
3.8 Evaluation of textual quality
Evaluating the quality of the system’s generated text is essential, as it provides insight
into how closely the model's output aligns with both the reference text and the
expected answer. This evaluation plays a key role in tuning model parameters to
produce responses in a specific textual style or format, for instance, ensuring the
consistent use of domain-specific technical terminology. These stylistic preferences
can be embedded within the chat prompt template, effectively guiding the model to
generate text that adheres to desired conventions. Since assessing "text quality" is
inherently subjective, the use of well-defined evaluation metrics introduces a level of
objectivity, enabling more structured and data-driven decision-making when
optimizing model performance.
Metrics used for evaluating textual output were as follows:
1. BLEU: The Bilingual Evaluation Understudy (BLEU) score is a widely used
metric for evaluating the quality of text generated by ma-chine translation or
other natural language generation systems. It functions by comparing the
overlap between n-grams of the generated output and one or more reference
texts. The underlying assumption is that higher similarity with human-
generated references indicates better quality. BLEU typically calculates
precision for n-grams up to four and includes a brevity penalty to penalize
outputs that are excessively short. Although initially developed for machine
translation, BLEU has been adapted for use in various other language
generation tasks (Papineni et. al, 2002).
2. ROUGE: It stands for “Recall-Oriented Understudy for Gisting Evaluation”,
is a set of metrics used primarily to evaluate automatic summarization systems.
Unlike BLEU, which emphasizes precision, ROUGE focuses more on recall,
assessing how much of the reference text's content is captured in the generated
summary. The most common variants include ROUGE-N (which looks at n-
gram overlaps), ROUGE-L (which measures the longest common
subsequence), and ROUGE-S (which considers skip bigrams). These metrics
help quantify the informativeness and coverage of the generated text relative to
the reference (Ganesan, K et. al, 2018, Lin, C.-Y. 2004).
I. ROUGE-1: This score evaluates the quality of generated text
by measuring unigram (single word) overlap with a reference. It
emphasizes content coverage, assuming that a higher word
match reflects better output quality. ROUGE-1 is widely used in
text summarization to assess whether key words from the
reference are present in the generated version (Lin, C.-Y. 2004).
II. ROUGE-2: ROUGE-2 measures bigram (two consecutive
words) overlaps between the generated and reference texts. It
captures not only content but also local syntactic structure,
making it more sensitive to fluency and coherence than
ROUGE-1. This metric is useful in summarization and
translation, where preserving short word sequences is important
(Lin, C.-Y. 2004).
III. ROUGE-L: This score evaluates similarity based on the
Longest Common Subsequence (LCS) between generated and
reference texts. It accounts for words in the same order, even if
they are not adjacent, reflecting the preservation of sentence
structure and word flow. This makes ROUGE-L effective in
tasks where word order and narrative flow matter (Lin, C.-Y.
2004).
Thus, to assess the textual quality of the generated outputs, these metrics were
utilized as the main evaluation measures. BLEU was used to calculate n-gram
precision, incorporating a smoothing function to account for variations at the sentence
level. In contrast, ROUGE-1, ROUGE-2, and ROUGE-L provided recall-oriented
insights into structural similarities between the model-generated text and reference
answers phrased from the text. Together, these metrics offered a well-rounded analysis
of both the linguistic accuracy and semantic correspondence of the responses,
compared to the reference text.
Table 3 presents a qualitative comparison between reference text and responses
generated by a large language model for a set of FMEA questions related to
submersible pumps. The questions target specific aspects such as common failure
modes, root causes, and corrective actions. The comparison is used to assess the
model's ability to provide accurate, relevant, and technically sound information
aligned with expert knowledge. The generated answers were then checked against the
ground truths and the model was evaluated for the above-mentioned metrics namely
BLEU, ROUGE-1, ROUGE-2 and ROUGE-L. The dataset has been prepared from the
document titled “Failure Mode and Effect Analysis of Subsea Multiphase Pump” by
Oluwatoyin et al. (2014). The paper mainly contains theoretical information about
FMEA, as well as the FMEA of various types of pumps such as twin-screw pump,
helio-axial pump, etc.
Table 3. Dataset for textual quality evaluation
Question Reference text Generated answers
Submersible pumps often
Submersible pumps often fail due
experience failures due to
to problems like blockages,
clogging, overheating, and
overheating, and damage to the
problems with the
sealing components. Blockages
mechanical seal. Clogging
typically occur when foreign
happens when particles or
particles or debris obstruct the
sediment obstruct the intake
pump's intake or rotor.
What are the or impeller. Overheating is
Overheating is commonly caused
common failure typically caused by
by electrical malfunctions, such as
modes of a electrical faults like
circuit overloads or short circuits,
submersible overloading or short circuits,
which result in excessive heat
pump? which generate too much
accumulation. Failure of the
heat. Mechanical seal issues
mechanical seal is frequently due
stem from factors like wear,
to factors like wear, corrosion, or
corrosion, or incorrect
improper installation. These
installation. These problems
issues can impair the pump's
can decrease pump
functionality, increase power
efficiency, increase energy
consumption, and eventually lead
usage, and ultimately cause
to complete failure of the pump.
the pump to fail.
what's the To minimize the risk of Preventing such failures requires
corrective these failures, it’s important regular maintenance, correct
action for motor to follow proper installation installation, and operating the
failure in practices, perform regular pump within its specified limits.
maintenance, and ensure the These practices help mitigate the
pump operates within its
intended parameters. These
risks of clogging, overheating,
actions help avoid clogging,
submersible and mechanical breakdowns,
overheating, and mechanical
pump? thereby improving efficiency and
failures, promoting reliable
extending the pump’s lifespan
performance and longer
service life.
Impeller damage in
submersible pumps may Impeller damage in submersible
result from corrosion, pumps is often caused by
erosion, or mechanical corrosion, erosion, and
stress. Corrosion occurs mechanical stress. Corrosion
when the impeller is takes place when the impeller
exposed to harmful or comes into contact with
What is the abrasive substances. Erosion chemically aggressive or abrasive
cause of is caused by the continuous materials. Erosion results from
impeller failure impact of fast-moving water the repeated impact of high-
in a submersible that gradually wears away velocity water, which gradually
pump? the impeller surface. deteriorates the impeller surface.
Mechanical stress from Mechanical stress, arising from
factors like excessive conditions such as excessive
pressure, imbalance, or pressure, imbalance, or intense
vibration can weaken the vibration, can progressively
impeller over time, leading weaken the impeller, ultimately
to fatigue and eventual leading to fatigue and failure.
breakdown.
3.9 Evaluation of RAG pipeline
As RAG systems are increasingly used in applications like document question
answering, chatbots, and knowledge base assistants, robust evaluation becomes
critical. Traditional metrics like BLEU or ROUGE fail to capture the unique
challenges of RAG such as factual correctness of the output generated, faithfulness of
the context retrieved, relevance of the overall answer, amongst many others.
Retrieval-Augmented Generation Assessment, or RAGAS, fills this gap by
introducing metrics tailored to RAG pipelines, helping researchers to comprehensively
evaluation metrics and pipelines for measurement of overall system performance.
RAGAS, is a standardized framework developed to evaluate the effectiveness of
RAG systems. It specifically measures how accurately these systems retrieve relevant
content and generate responses that are coherent and grounded in that retrieved
information (Es et al., 2023). RAGAS addresses this gap by offering evaluation
criteria tailored to RAG work-flows, enabling better diagnosis of retrieval and
generation weaknesses, enhancing system robustness, and facilitating more
meaningful comparisons between different RAG approaches. It achieves this by
analysing structured data points, or triplets, that encapsulate the relationship between
queries, retrieved content, and generated answers. It applies language models to score
these triplets across key dimensions. These metrics offer a structured way to assess
both retrieval and generation quality which are (Yu et al. 2024):
1. Faithfulness: It assesses whether the generated answer is factually supported
by the retrieved context. It checks if the model’s response is grounded in the
content/context provided by the documents, rather than introducing
information that was not present. This metric is especially important in
preventing hallucinations, ensuring that the answer remains accurate and
trustworthy based on the retrieved data.
2. Answer Relevancy: Evaluates how well the generated answer aligns with the
original question. Unlike the other metrics, it doesn’t depend on the retrieved
context instead, it measures whether the answer is on-topic and meaningful in
relation to the question. A high relevance score indicates that the answer is
appropriate and directly addresses the user's query.
3. Context Precision: Measures how relevant the retrieved context is to the
question. It evaluates whether the retrieval step returns documents that are
actually useful for answering the query. A high precision score indicates that
most of the retrieved information is directly related to the question, which
helps the model focus on high-quality, relevant input when generating an
answer.
4. Context Recall: Focuses on how well the answer incorporates the relevant
information from the retrieved context. It looks at whether the model has made
use of the important content that was available to it. A high context recall score
means that the model effectively identified and included key pieces of
information when generating the response.
Together, they help identify weaknesses in the pipeline, guide improvements, and
ensure the model produces focused outputs. RAGAS is especially valuable when
comparing different RAG strategies or tuning components for domain-specific tasks.
The following table shows a sample part of dataset utilized which was utilized to
evaluate the language models, for their retrieval and generation capacity. The dataset
has also been prepared from the document by Oluwatoyin et al. (2014). In addition to
the aforementioned evaluations, the system was further tested using a diverse set of
targeted queries designed to assess its ability to extract and reason over specific failure
modes, root causes, associated effects, and recommended corrective actions. These
queries were tailored to pumps of different types, thereby enabling a comprehensive
evaluation of the system’s domain adaptability and depth of failure analysis across
varied equipment.
Table 4. Dataset for RAGAS evaluation
Question Reference
What is FMEA? Failure Mode and Effects Analysis (FMEA) is a structured
method used to identify critical components, their failure
modes, and their effects on system performance. It helps in
modelling and predicting failures to improve equipment
reliability, especially in challenging environments like
subsea applications
What is the Failure mode refers to the symptom of a failure and
difference describes the effect by which a failure is observed (e.g.,
between failure loss of pump function, reduced performance).Failure
mode and failure mechanism refers to the physical, chemical, or operational
mechanism? process that causes the failure mode (e.g., high
temperature, pressure fluctuations, material degradation).
Common failure Cavitation: Caused by low suction pressure, leading to
modes of helico vapor bubbles that can damage internal components. Seal
axial pump Failures: Resulting from high pressure, temperature, or
mechanical wear, leading to leaks. Impeller Damage: Due
to wear, corrosion, or blockage, reducing pump
efficiency.Bearing Failure: Caused by excessive load,
improper lubrication, or misalignment
what are the Motor wear (due to increased speed/low
common failure lubrication),Mechanical seal failure (caused by high gas
modes of twin- volume fraction, high temperature, or gas bubbles),Rotor
screw pump? failure (due to thermal expansion, screw rubs, or
equipment damage),Cooling system failure (leading to
severe equipment damage),Oil refilling failure (causing
severe equipment damage),Bearing damage (resulting
from increased speed or mechanical seal leakage),Other
failures (such as increased temperature, loss of volumetric
efficiency, or corroded gears).
What are the Mechanical seal failure in pumps can be caused by a
cause of seal number of issues, including high gas volume fraction
failure in a within the pump. This leads to reduced cooling efficiency
submersible and can cause gas bubbles to form intermittently, further
pump? damaging the seal. Other potential causes include wear and
tear on the seal material, improper installation,
contamination from debris or chemicals, or corrosion of
the seal and surrounding components
3.10 Tools
1. LLM utilised in the study are as follows:
a) Llama 4 scout 17B: Meta's Llama 4 Scout is a 17-billion-
parameter model utilizing a mixture-of-experts (MoE)
architecture, activating 16 experts out of a total of 109 billion
parameters during inference. This design enhances efficiency
by engaging only the necessary components, and it’s a
multimodal LLM that can process varied inputs (Meta AI,
2025).
b) DeepSeek r1: It is a first-generation reasoning model developed
by DeepSeek AI. Trained using large-scale reinforcement
learning, it excels in complex reasoning tasks across domains
such as mathematics, programming, and language. The model
leverages a mixture-of-experts (MoE) architecture to activate
only the necessary components during inference, enhancing
efficiency (Jin et al., 2025).
c) Gemma 2 9B IT: It is an instruction-tuned variant of Google's
Gemma 2 model. Part of the lightweight, state-of-the-art open
models from Google, it is built from the same research and
technology used to create the Gemini models. Gemma 2 9B IT
is optimized for language understanding, reasoning, and text
generation use cases (Riviere et al. 2024).
d) Mistral Saba: It is a 24-billion-parameter language model
developed by Mistral AI, specifically designed to cater to the
linguistic and cultural nuances of the Middle East and South
Asia. Trained on meticulously curated datasets from these
regions, Saba excels in understanding and generating text in
languages such as Arabic, Tamil, and Malayalam (Mistral AI
2025).
e) Llama 3.1-8b instruct: It is a lightweight instruction-tuned large
language model designed to deliver strong performance across a
wide range of tasks while remaining accessible for fine-tuning
and deployment on limited hardware. Built on the llama 3
architecture, this 8-billion-parameter model benefits from
instruction tuning that enables more accurate and contextually
relevant responses compared to its base variant. Recent studies
have demonstrated its effectiveness in diverse applications
(Zhang et al., 2024).
2. Programming Languages: Python 3.9.
3. Programming Environment: Google Colab.
4. Vector Databases: Chroma DB and FAISS.
5. Computational Resources: T-4 GPU, freely available on Google Colab
platform.
6. LLM access: Via GROQ, a free and open-source platform to access and utilise
the opensource LLMs for various tasks, with the help of an API key.