0% found this document useful (0 votes)

10 views19 pages

Leveraging Deep Reinforcement Learning

This document presents a deep reinforcement learning (DRL) framework aimed at enhancing intrusion detection in SCADA systems, addressing vulnerabilities exposed by increasing cyber-attacks. The proposed model achieved a remarkable accuracy of 99.36% in detecting attacks using two datasets, demonstrating DRL's potential in improving cybersecurity for critical infrastructure. The research also discusses the challenges and future directions for implementing DRL in cybersecurity applications.

Uploaded by

Pradeep K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views19 pages

Leveraging Deep Reinforcement Learning

Uploaded by

Pradeep K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Received 6 February 2024, accepted 3 April 2024, date of publication 18 April 2024, date of current version 10 May 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3390722

Leveraging Deep Reinforcement Learning

Technique for Intrusion Detection
in SCADA Infrastructure
FRANTZY MESADIEU , DAMIANO TORRE , (Member, IEEE),
AND ANITHA CHENNAMANENI , (Member, IEEE)
Subhani Department of Computer Information Systems, Texas A&M University Central Texas, Killeen, TX 76549, USA
Corresponding author: Anitha Chennamaneni ([email protected])
This work was supported in part by the Air Force Research Laboratory (AFRL) and the Department of Homeland Security (DHS) Science
and Technology Directorate (S&T) under Award FA8750-19-C-0077.

ABSTRACT The prevalence of cyber-attacks perpetrated over the last two decades, including coordinated
attempts to breach targeted organizations, has drastically and systematically exposed some of the more
critical vulnerabilities existing in our cyber ecosystem. Particularly in Supervisory Control and Data
Acquisition (SCADA) systems with targeted attacks aiming to bypass signature-based protocols, attempting
to gain control over operational processes. In the past, researchers utilized deep learning and reinforcement
learning algorithms to mitigate threats against industrial control systems (ICS). However, as technology
evolves, these techniques become ineffective in monitoring and enhancing the cybersecurity defenses of
those system against unwanted attacks. To address these concerns, we propose a deep reinforcement learning
(DRL) framework for anomaly detection in the SCADA network. Our model utilizes a ‘‘Q-network’’, which
allows it to achieve state-of-the-art performance in pattern recognition from complex tasks. We validated
our solution on two publicly available datasets. The WUSTL-IIoT-2018 and the WUSTL-IIoT-2021, each
comprised of twenty-five networking features representing benign and attack traffic. The results obtained
shows that our model successfully achieved an accuracy of 99.36% in attack detection, highlighting DRL’s
potential to enhance the security of critical infrastructure and laying the foundation for future research in
this domain.

INDEX TERMS Critical infrastructure, deep reinforcement learning, cybersecurity, SCADA.

I. INTRODUCTION Control and Data Acquisition (SCADA). In an industrial

Cybersecurity threats and attacks against businesses are control system, such as ‘‘critical infrastructure’’ these sys-
escalating and becoming increasingly sophisticated, making tems enhance and automate equipment’s performances [2].
them extremely challenging to detect. While traditional For example, a SCADA-controlled smart grid infrastructure
antivirus software with pattern recognition algorithms is would leverage internet protocols to communicate with
commonly used for early detection, however, the complexity sensors to detect faults and isolate potential damages to power
and frequency of attacks have made it difficult for network lines and mechanical field assets responsible of performing
administrators and standard applications to monitor. More- daily operations [3], [4].
over, the relentless number of attacks occurring each day has With the advent of the Internet, coupled with lurking
also added to the growing challenge [1]. Among the domains persistent cyber-threats and the proliferation of the Internet
and software that are vulnerable and susceptible to cyber- of Things (IoT), a new paradigm has shifted towards artificial
attacks, is autonomous critical infrastructure or Supervisory intelligence (AI) and the dynamism of ‘‘deep reinforcement
learning’’ (DRL). Based on prior knowledge, this technique
The associate editor coordinating the review of this manuscript and is used to navigate through new obstacles to solve future
approving it for publication was Ahmed Mohamed . problems. The methodological complexity and deep neural
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 63381
F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

network (DNN) algorithmic policy-based that interconnects and autonomous controls using diverse datasets [19], [20],
the model’s neurons, gives it significant computational [21]. In fact, we did not find any relevant papers on DRL
aptitude to analyze intricate network data and make decisions that used similar data at the time of our research. Considering
[5]. In terms of multi-tasking capabilities, DRL technique these findings, we explored the effectiveness and robustness
provides researchers with an array of advanced AI technical of the DRL framework [22] in the cybersecurity domain.
frameworks ready to be deployed in various domains of We developed and implemented a model algorithm, testing
industrial importance. In areas such as smart grids for voltage it on two SCADA datasets: Wustl-IIoT-2018 and Wustl-
regulation [2]; adaptive assembly lines [6]; in robotics to IIoT-2021. Our research results demonstrated the successful
perform specific functions that optimized individual tasks application of DRL methodology in detecting cyber threats
[7]. For example, in cybersecurity [8], the DRL algorithm is within SCADA’s critical network [14].
applied to filter through network traffic and prevent intrusions
[9], thus reducing cost and the unnecessary manpower
needed to accomplish complex and time-consuming cyber- A. RESEARCH OBJECTIVES
monitoring tasks. Conceptually, DRL technique is trained The main goal of this work is to explore the capabilities of
in a pre-defined environment where agents learn from low Deep Reinforcement Learning (DRL) algorithms to enhance
dimension of feature’s inputs from metadata and perform the accuracy of anomaly detection.
meta-learning1 through trial and error [10], [11]. In particular, this research explores the potential of
As a subset of machine learning, DRL’s computational implementing a DRL technique to protect and defend critical
proficiency and intellectual effectiveness in the video gaming infrastructure from continuous cybersecurity attacks as we
industry display exceptional results in both offensive attack investigate whether the algorithm can effectively enhance the
strategies as well as defensive tactics [12]. DRL’s neural security and resilience of SCADA systems against ongoing
network application uses a function estimator to observe data threats from bad actors.
state or labels’ input, as it operates in a set environment, and Similarly, we explore possible challenges associated with
uses a greedy algorithmic policy to process unlimited network applying DRL in Smart Grid systems and discuss specific
traffic’s history for anomalous and malicious attacks [13]. difficulties or limitations in implementing the technique in
In retrospect, these capabilities make it a valuable intrusion the context of critical systems’ network anomaly detec-
detection algorithm for protecting and defending against tion, which encompasses protection from cybersecurity
sophisticated cybersecurity attacks, including advanced per- attacks.
sistent threats (APT) [14]. We discuss optimization and deployment strategies for
In the digital transformation era, various sectors such utilizing the technique to effectively secure SCADA systems
as business infrastructure, banking, IoT, Industrial IoT, and and provide insights into specific approaches, methodologies,
transportation, including SCADA infrastructure, are under and necessary considerations for maximizing the efficacy of
constant threats of cybersecurity attacks and the risk of DRL to improve network security.
data breaches [14]. However, in recent years, machine Lastly, we identify the potential advantages and disadvan-
learning techniques have shown great promise in enhancing tages of deploying DRL as an Intrusion Detection System and
the security posture of critical infrastructure by aiding in Intrusion Prevention System in Industrial Control Systems
penetration testing, pattern detection, and real-time attack [3], [23], [24]. This approach seeks to analyze how DRL can
mapping. However, it should be noted that the complexity of enhance the security and resilience of ICS to effectively detect
code and the challenges posed by large datasets remain an and prevent cybersecurity threats, improve incident response,
obstacle for certain aspects of machine learning algorithms and mitigate risks to critical infrastructure.
[15], [16], particularly in areas involving continuous complex
calculations and large-scale decision-making.
In this paper, we conducted a qualitative assessment of B. CONTRIBUTIONS
cyber-attacks on critical infrastructure and their impact on Our proposed approach presents a multifold contribution to
business operations, drawing insights from studies, e.g., critical infrastructures’ existing security measures.
[17] and [18]. We performed a comprehensive review of A) It incorporates novel features such as actor-critic
prior research on the application of DRL methodology algorithm [23], designed for optimal policy update [24],
in combating SCADA network anomalies; our analysis to dynamically assist the model (actor) in maximizing
revealed that while there have been numerous research efforts its decision process. The architectural design of the actor
utilizing DRL approaches to enhance the security of SCADA and critic networks are mirrored with the same dynamic
network infrastructure, however, the focus has been primarily parameters to facilitate a reciprocal training environment.
directed towards hardware and the monitoring aspects of (CI). In this manner, our actor is trained to generate actions that
Including the development of intrusion detection methods maximize the expected rewards, as the critic evaluates the
quality of those actions for the expected value of future
1 Meta-learning in ML is the process that refers to learning algorithms rewards. This iterative process continues until the model
that continuously learn from other learning algorithms. converges.

63382 VOLUME 12, 2024

F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

To address issues stemming from correlated data and non- II. BACKGROUND
stationary distributions, we introduce a ‘‘ReplayMemory’’ In this section, we provide a brief introduction of DRL
function [3] designed to store experiences and sample framework. Following that, we summarize the necessary
transitions (i.e., state−action−reward −next −state, tuples) background information related to SCADA infrastructure.
that the agent encounters during its exploration of the
environment. This memory buffer enables the agent to learn A. DEEP REINFORCEMENT LEARNING
from past experiences by randomly sampling previous tran- Deep reinforcement learning (DRL), a subset of machine
sitions from the memory during training. By incorporating learning, is a branch of artificial intelligence that combines
this technique, the agent can leverage a wider range of deep learning algorithms and reinforcement learning tech-
experiences from its entire history rather than relying solely niques allowing an agent to interact with its environment
on its most recent encounters. and learn from trial and error. Using Python programming
B) We present a detailed assessment and conduct a language, we defined the following: an agent is represented
qualitative and quantitative evaluation using two real-world through functions, software, or a block of codes. The envi-
datasets to train our model. We evaluate our algorithm on ronment, on the other hand, is a function that simulates both
the WUSTL-IIoT-2018 dataset [25], and the WUSTL-IIoT- virtual and physical training scenarios, containing parameters
2021 dataset [26], consisting of network traffic protocols. for an agent to exploit. In this conceptual framework (Fig.1),
To validate our solution, we load the saved model with test the agent explores a pre-defined environment and exploits its
data containing multiple binary classification inputs. The parameters [28] to maximize reward signals.
agent applies the methods described in algorithms (6) and (7) In training, for each time step, the environment sends
to read the labeled attacks and predict the targeted attack a scalar reward signal to the reinforcement learning agent
labels. The results obtained, demonstrates that our DRL for each action taken. It is important to highlight that due
model can effectively classifies threats in real time and to the insistent algorithmic nature of DRL framework, the
provides detection and response [27]. sole objective of the agent is to learn from its implemented
C) Lastly, we offer valuable insights for future research stochastic policy to maximize cumulative reward over
directions regarding the use of DRL to detect cybersecurity time [29].
attacks in the SCADA domain. Despite its successes, some
of the challenges faced by DRL technique applications
are: Deep neural network utilization, greedy policy imple-
mentation, and the need for sample or data efficiency.
In addition to these challenges, there are several points
of interest for future research on DRL implementation in
the cybersecurity domain. These include: Investigating the
capability of DRL framework to handle larger and more
complex SCADA network datasets; evaluating the robustness
and generalizability of the technique by testing it on diverse
SCADA systems and network environments; and exploring
the possibility of combining the DRL algorithm with other
machine learning techniques, such as anomaly detection
and ensemble methods, to further improve the accuracy and
effectiveness of cyber threat detection and response. These
research directions aim to address challenges in adopting
DRL for cybersecurity and advance the development of active
solutions in this field.
FIGURE 1. DQN Agent in SCADA Environment Using MDP.

C. STRUCTURE A typical DRL framework has several key components.

The remainder of the paper is structured as follows: Section II An essential element in the software design is the ‘‘envi-
details a comprehensive background of DRL and highlights ronment’’, representing the task or challenges the agent
key topological concepts of SCADA. Section III reviews prior is attempting to resolve [30], as depicted in figure 1.
work that employs DRL methodology for anomaly detection In that context, the environment facilitates the agent with
in SCADA. Section IV describes the methodology used to observations, thus prompted an action (a), state (s) that
build our DRL model and the application of the model to geared towards a signal reward (r). This virtual cyber-space
SCADA domain. Section C details the training process and is usually a high-dimensional sensory inputs, with expected
the implementation of our DRL model. Section VI presents rewards, indicating the agent’s performance based on the
the results of our evaluation. Section VII outlines the threats quantity of rewards obtained [15]. The second component in
to validity. Section VIII concludes the paper. DRL framework is the ‘‘agent’’, this methodology employs

VOLUME 12, 2024 63383

F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

deep neural network, hence enabling the agent to learn and communications. Known as the Industrial Internet of Things
interact with the environment. During training and upon or IIoT 4.0 [38], these peripherals enable remote automation
initialization, the agent uses data input as observations and and high-level supervision of critical infrastructures like
its outputs as actions, while attempting to maximize reward power grids and water distribution plants. The system is an
signals, where Q(s, a) represents the sum of all rewards, and integral component of smart grids and facilitates real-time
maxQ(s′ , a′ ) for the maximum rewards an agent is able to data analysis, command, and control processes of assets.
achieve from its current state [31]. The third component is Depicted in figure 2, a standard SCADA structural design can
the ‘‘training algorithm’’, which updates the deep neural be outfitted with ‘‘Remote Terminal Units, Programmable
network’s weights based on the reward function and the Logic Controller, etc.’’ These microprocessors communicate
actions taken by the agent. In our proposed model, the state- commands to field devices such as Pump units and valves,
action value function is called a Q function, and is defined as where the processes are then displayed in a ‘‘human-machine
Q(s, a). Referred to as Q-learning, this Q function is used for interface’’ HMI for visual confirmation [39].
optimal rewards in each action as it updates the policy using
this Bellman equation [24]. 1) SCADA OPTIMIZATION PROCESS
• Control processes
Q(s, a) = Q(s, a) + α(r + γ maxQ(s , a ) − Q(s, a))
′ ′
• Monitor and gather information in real-time
• Interact with various devices
1) DEEP Q-NETWORK ALGORITHM • Record events in a back-end database
Deep Q-Network (DQN) is a deep reinforcement learning Based on figure 2, regardless of topological configuration
algorithm that uses deep neural networks or a Q-network and structural design, a SCADA system uses data from its
to approximate optimal action-value functions in a Markov connected nodes or (IIoT) devices to perform its operations.
Decision Process (MDP) [32]. The technique is built upon These nodes are a collection of sensors and other monitoring
the traditional Q-learning algorithm, which serves as a devices attached to the network via ethernet cables or
function approximator. Unlike reinforcement learning which wireless communication channels [40]. Depicted in 3, remote
maintains a (Q − table to store Q − values for each state − control access requires internet communication and uses the
action pair), Q-network takes the raw observation state as following mode for data exchange: ‘‘Local Area Network’’
input and directly outputs the estimated Q-values for all (LAN) or a ‘‘Wide Area Network’’ (WAN). Protocols such
possible actions. Introduced by Google DeepMind in 2013 as Modbus, DNP3, OPC, and others, defined the format and
[33], the DQN technique combines deep neural networks and rules for exchanging data and commands between nodes
Q-learning to learn optimal policies in complex and high- within a system [40].
dimensional environments [34].

2) IMPROVING DQN ACTION SELECTION USING TEMPORAL

DIFFERENCE
Using MDP principle, we initialized the DQN ReplayMem-
ory with St , At , where the model takes random action in the
environment and uses experience replay samples to update
Q-values. However, from a policy-based perspective, the
algorithm is designed for optimal performance, from which
an RL agent’s goal is to select a policy that optimizes its
expected return (V-value function):
X XX
Vπ (s) = π(a|s) p(s′ , r|s, a)[r + γ Vπ (s′ )]
a s′ r

This aggressive approach may lead to (TD) or Temporal

Difference learning which may create noise [35], [36].
To calculate TD for reward maximization, where at = st , FIGURE 2. SCADA network topology.
we use Q(st+1 , a′ ) for immediate reward R = Q(st+1 , a′ ).
By adding Q(st , at ), we were able to update the difference
between TD-target, using a as the learning rate for each added 2) THREAT ANALYSIS
Q(st , at ) [37]. This section outlines the application of threat analysis in the
context of SCADA systems operational continuity and its
B. SCADA relevance with the integration of Industry 4.0.
The Supervisory Control and Data Acquisition System Threat analysis helps identify potential vulnerabilities and
(SCADA), is a modern computerized control system com- assess their impact on business operations. It is a quantitative
prised of hardware components, software, and network data and qualitative assessment of critical organizational resources

63384 VOLUME 12, 2024

F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

risks, a network attack such as Stuxnet4 on IoT devices can

disrupt an entire system and causes great financial loss in
terms of the business impact to consumers and stakeholders
[43], [45]. SCADA software and IoT devices can be
subjected to various types of security attacks [41], especially
when considering and evaluating communications’ links
with respect to software-related vulnerabilities. Although
multiple AI techniques and frameworks have been developed
in conjunction with cybersecurity taxonomies for intrusion
detection and mitigation, unfortunately, these techniques are
not designed to address software flaws.

III. RELATED WORK

In retrospect to threat prevention and detection, our main
focus is to develop a cutting-edge DRL algorithm that
incorporates network communication protocols to proac-
tively detect network intrusion in SCADA’s industrial control
FIGURE 3. Modbus server.
systems and field assets. To guide our search, we referred
to the systematic mapping conducted in [46], which helps
identify existing machine learning techniques that can be
to determine the business impact on Information Technology leveraged to analyze network traffic behaviors for threat
and Information Systems IT/IS [41], [42]. anomalies. Surprisingly, most of the research conducted
on DRL in this domain focused on anomaly detection,
a: INDUSTRY 4.0 VULNERABILITY but utilized diverse datasets, including proprietary data,
IoT and IIoT devices integrated in SCADA infrastructure, to address specific SCADA network security challenges.
enable advanced automation and cost effectiveness of Table 1 provides a summary of the reviewed studies,
operational processes. However, the risks associated with including the related work, datasets employed, and their
critical infrastructure are derived from the programmable respective domains.
logic controllers (PLCs) installed in components responsible For example, Liu et al. [3] developed a DRL framework
for executing tasks upon commands [39]. SCADA’s physical that monitors IIoT components in physical water plant to
structure is comprised of PLCs, such as actuators, sensors, investigate the impact of a performance-based attack on
enterprise resource planning, and manufacturing execution IoT devices. The authors’ proposed DQN experimented
system (MES) for asset tracking and management [43]. on proprietary data, collected from a simulated water
Together, vulnerabilities like ‘‘insecure software/firmware, distribution testbed. In [14], the deep RL-based APT
insecure web interface, lack of transport encryption, insuf- defense scheme introduced by Ning and his associates,
ficient authentication, can all be exploited as part of the combined deep learning and policy-gradient based actor-
system’s flaws. critic to identify ‘‘advanced persistent threats’’ and determine
the type of resources needed to manage both the speed
and the interval at which attacks are launched against
b: IDENTIFYING AND DETERMINING RISKS
critical infrastructure. Tharewal et al. [15] presented a
SCADA communication links in relation to topological DRL-based intrusion detection system for the industrial
structure, functionality, and proximity, make the use of Internet of Things, to detect network intrusions that are
internet protocol (e.g., TCP/IP) essential to its operation; as otherwise too complex for conventional machine learning
a result, IoT devices connected to network traffic facing the techniques to identify. Introduced by Landen et al. in [17],
internet are likely to become targets of malicious attacks [44], DRAGON is a DRL model, structured to actively monitor
[43]. Meaning, both hardware and software components, such potential cyber-attacks in smart grids through interactions,
as PLCs,2 RTUs,3 sensors and other controllers performing and data collection, from which the model learned from past
data communication and acquisitions, can also be exposed to experiences using a grid simulator, featuring IEEE 14-bus
unwanted attacks when using internet resources (e.g., local power transmission system as its environment. Wei et al. [18]
area network (LAN) and wide area network or (WAN) to developed a DRL framework that serves as a transmission
carry out critical functions). In lieu of these identifiable retriever and optimizer of lost communications during a
cyber-attack. The model’s objective is to minimize reclosing
2 PLC programmable logic controller; a microprocessor that communi-
time among affected communication lines to both prevent
cates commands to field devices.
3 RTU Remote Terminal Unit, is a microprocessor that monitors & 4 Stuxnet a self-replicating malware that takes advantage of auto-
controls field devices. execution vulnerabilities.

VOLUME 12, 2024 63385

F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

TABLE 1. Related work comparison.

and recover from network transmission attacks. The proposed Our research aims to explore the potential of DRL
methodology presented by Wang et al. [21] is a DRL based Q- techniques in the cybersecurity domain, specifically focusing
learning attack strategy, simulated on the IEEE 30-bus system on optimizing and deploying these techniques to enhance
for SCADA load management uncertainty; a vulnerability the security of industrial control systems. By leveraging
that a hacker may use to trip critical transmission lines. Their DRL, we aim to strengthen the resilience of SCADA systems
model detects and prevents smart grid-coordinated topolog- against various cyber threats. Here, we introduce the four-step
ical attacks by identifying false electronic communication. methodology:
The actor-critic approach by Moradi et al. [29] is a high- 1. Building the DRL Model (Section IV-A). We devel-
level algorithm, structured to strategically simulate attacks oped a DQN-based approach that effectively addresses
in a smart environment. The technique can simultaneously the challenges in protecting and defending critical
learn policies and detect network anomalies from smart infrastructure against cybersecurity attacks using com-
electrical communication systems. The evaluation shows plex IIoT datasets.
positive results against the Wood and Wollenberg 6-bus 2. Applying DRL to SCADA (Section IV-B). We describe
and the IEEE 30-bus systems respectively. The abnormal how the DRL model can be applied to SCADA domain.
flow detection in industrial control network presented by 3. Optimizing DRL to SCADA (Section IV-C). We detail
Wang et al. [44] is a DRL model which the authors deploy the steps to take in order to optimize our DRL model to
to monitor ‘‘abnormal flow’’, a form of intrusion in ICS SCADA domain.
systems. The model helps prevent bad actors from taking 4. Monitoring SCADA with DRL (Section IV-D).
over command of industrial control systems or natural We elaborate on the monitoring techniques that enable
gas pipeline operations. To the best of our knowledge, our model to identify potential attacks on SCADA
our study is the first to implement a DQN in SCADA systems.
‘‘Industrial Internet of Things’’ for intrusion detection. Following this, we offer a detailed breakdown of the four
In contrast, the gap between our proposed algorithm and the steps.
related work comparison presented in table 5, lies in the
sophistication of our model’s architecture, the complexity A. BUILDING THE DRL MODEL
of our datasets and the performance achieved, highlighting In this paper, we utilize datasets within the IoT and IIoT
our algorithm’s advancements in handling complex tasks, domain.
and preventing continuous cyberterrorism attacks against As follows, we outline some of the primary protocols
SCADA communication’s infrastructure. and software deficiencies affecting IIoT with respect to
SCADA, highlighting how these discrepancies continue to
jeopardize the security of smart grid systems. In reference to
the framework of IoT vulnerabilities outlined in [40] and [47],
IV. METHODOLOGY we categorize security threats in IIoT devices to include the
As follows, we present the research methodology employed following vulnerabilities:
in constructing our DRL model, which is based on our • Outdated software: Inadequate access control, which
investigative discoveries [38]. may allow unauthorized access.

63386 VOLUME 12, 2024

F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

Algorithm 1 DRL Agent Action Algorithm for

Unexpected Queries
Input: Query q
Output: Action a
if q contains restricted content then
if anomaly detected then
initiate countermeasures to neutralize request;
a = neutralize;
end
else
raise alert;
a = raise-alert; FIGURE 4. Q-learning vs Deep Q-learning.
end
else
allow request; capable of detecting and mitigating cybersecurity threats in
a = allow; real time. Based on our proposed approach, it is evident that
end with the Deep Q-learning algorithm, when paired with our (ϵ)
epsilon greedy policy, the result greatly enhances our model’s
performance and in turn strengthens the security of SCADA
infrastructure [48].
• Poor network segmentation: The absence of encryption
causes security vulnerabilities within the network of
devices and equipment used in industrial settings. 1) PROPOSED MODEL ALGORITHM STRUCTURE
These limitations pose the most significant security In the following manner, we present a detailed description of
concerns in SCADA operations. Therefore, implementing our DRL-based model:
DRL as an intrusion detection system (IDS) and as a network 1) The structural design of our QNetwork, leverages the
intrusion prevention system (NIPS) can potentially enhance Bellman equations [50], [51] to compute and express
cybersecurity defenses against continuous attacks in critical actions’ value taking in a given state, with respect to
infrastructure [18]. Our approach, can analyze past attacks the sum of the immediate reward and the discounted
and simulate threat scenarios to learn and adapt to new threats value of the next state:
in real-time. Utilizing a concept called Deep Q-Learning as ≫ Q(s, a) = r(s, a) + γ · maxa′ (Q(s′ , a′ )) where:
contrast in Figure 4, the proposed model intelligently uses • Q(s,a) is the expected value of action’s taken a in
meta-leaning1 for continuous decision-making to maximize state s
rewards [17]. • r(s, a) is the immediate reward received for
Because unlike Q-learning [37], which uses a Q-table action’s taken a in state s
(Fig. 4) to store expected rewards for each state-action • γ is the discount factor, which determines how
pair [32], ‘‘Deep Q-Learning’’ architecture, however, uses a much the agent values future rewards over imme-
dual neural network and weights differences for its learning diate rewards
processes [48]; the added neural networks allow a trained • s′ is the next state that the agent transitions to after
DRL agent to identify patterns of malicious behaviors and taking action a in state s
take appropriate actions to mitigate threats, such as detecting • maxa′ (Q(s′ , a′ )) is the maximum expected value
anomalies and unusual network traffic in the system’s data. over all possible actions a′ in the next state s′
For example, if a DRL agent detects any unexpected queries 2) Policy Gradient: Our model uses this policy in
made to a SCADA client’s server, attempting to access conjunction with an epsilon greedy policy to optimize
restricted content [49], as shown in Figure 3, the agent can the objective function directly [52] by computing the
take deliberate actions, such as initiating countermeasures as gradient of the expected reward with respect to the
seen in (Algorithm 1) to neutralize the request if an anomaly policy parameters:
≫ ∇θ J (θ) = E[ t = 0T ∇θ logπθ (at |st ) · At ] where:
P
is detected [14]. These measures may include blocking
network traffic from suspicious requests to isolating affected • J (θ) is the expected reward (also known as the
parts of the system [3]. objective function) for a given policy parameter-
For our proposed scheme, the ‘‘Deep Q-learning’’ tech- ized by θ
nique is implemented, because it combines ‘‘Convolutional • πθ (at |st ) is the probability of action’s taken at in
Neural Network’’ with the classical Q-learning to approx- state st under the policy parameterized by θ
imate rewards signal [29], [32]. In this way, training our • At is the advantage function, which measures
agent to interact with the SCADA system and learn from how much better the agent performed in state st
its actions in a complex environment has proven to be fully compared to the expected reward in that state.

VOLUME 12, 2024 63387

F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

3) Actor-Critic: In our model, we implemented an actor- In adopting this algorithm, the agent updates the action-
critic algorithm (2) [23], combined with our policy values Q(S, A), representing the sum of rewards to Q(s, a)
gradient which updates the training agent’s (actor) = maxQ(s′ , a′ ) relative to optimal rewards at s( t + 1) based
expected reward for a given state-action pair: on the maximum action-value of the next state, even though
≫ δt = rt + γ · V (st+1 ) − V (st )θ < −θ + αθ · δt · the exact action-value for the next state might not be fully
∇θ logπθ (at |st )V (st ) < −V (st ) + αV · δt known, due to the random actions [35], [6]. Nonetheless, this
where: tabular algorithm ensured the computation and convergence
• δt is the temporal difference (TD) error, which of the action-values Qt + 1(St, At) = Qt(St, At) + αt for
measures the difference between the predicted each state-action pair, leading to the exploration of infinite
value of the current state-action pair and the actual optimal action-values as states are explored. In algorithm 3,
reward received. we present the pseudo code implementation of our off-policy
• V (st ) is the estimated value of state st model as described by [54] Sutton and Barton.
• αθ and αV are the learning rates for the policy and
value function, respectively. Algorithm 3 Q-learning (off-policy approach) for
estimating π ≈ π∗
Algorithm 2 Actor critic algorithm implementation Algorithm parameters: step size α ∈ (0, 1], small
Input: Initialize policy, two soft Q and two target soft (epsilon)ϵ > 0;
Q-DNNs; Initialize experience replay buffer with attack Initialize Q(s, a), for all s ∈ S + , a ∈ A(s), arbitrarily except
samples; Result: Optimised actor and critic DNNs for each that Q(terminal, ·) = 0;
episode do for each episode do
for each step do Initialize S;
sample actions from the policy; sample transition for each step of episode do
from the environment; store the transition in the Choose A from S using policy derived from Q
replay buffer; (ϵ − greedy);
Take action A, observe R, S ′ ;
end Q(S, A) ←
for each gradient update step do Q(S, A) + α[R + γ maxa Q(S ′ , a) − Q(S, A)];
update the soft DQNetwork weights; update the
S ← S′;
policy DQNetwork weights; adjust the entropy
end
temperature; update the target DQNetwork
end
weights;
end
end In addition to off-policy, we improved our approach in
its exploratory state of the environment, by adding this
decay function σ (N ) = σ0 ekN . The algorithm represents
the relationship between the model’s input and output as
2) ON-POLICY VS OFF-POLICY the rate of decay. By defining the values of σ0 and k,
Actor-critic algorithms can be on-policy or off-policy. These we parameterized the rate at which the decay value changes
methods are used in DRLs when determining whether the during iterations. The method helps estimate future values
data collected during training is used for updating the policy and rewards, based on observed trends [55].
or remains neutral.
The main objective of this paper is to develop a DRL 3) SOLVING EXPLORATORY NOISE
framework that improves network security with a robust In our algorithm, exploration noise arose from deliberate
algorithm that optimizes attack’s detection. In that, we tested exploration strategies, environmental stochasticity, random
our actor-critic algorithm using the two policy methods; initialization, and the use of experience replay. These features
however, because on-policy uses the same policy (SARSA), enable our agent to discover optimal policies in complex
which it seeks to improve experience collection [49], as a environments [56] and is also added to encourage the agent
result, that approach could not fully optimize our actor to explore new actions.
network for our intended solution. To enhance our approach during the exploratory phase
Based on this insight, we conducted our experiment using of the environment, we utilize a decay function. Gradually,
the off-policy approach, which revealed that the method the method reduces the level of exploration thus making the
instead follows the tailored policy gradient, such as ‘‘∈- agent more deterministic as it accumulates experiences in the
greedy’’ for experience replay. In doing so, our agent learned environment. By incorporating a decay function within the
from a parameterized policy [53] that differed from the one algorithm as a policy, we helped guide the agent’s actions
it was supposed to follow (on-policy), thus allowing it to towards increased determinism based on its experience level,
explore the state space, using a stochastic algorithmic policy.5 using this equation:
5 Stochastic: A probability distribution that allows an agent to explore the σ (N )
π(a|s, N ) = (1 − σ (N ))πtarget (a|s) +
state space without always taking the same action. |A|
63388 VOLUME 12, 2024
F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

where πtarget (a|s) is the target policy, |A| is the number of random selection of actions with a probability of (ϵ) epsilon,
actions in the action space, and σ (N ) = σ0 ekN is the decay can also lead to sub-optimal performance [59].
function for the exploration noise [55]. To address this trade-off, we improved the robustness of
Alternatively, when incorporating the decay function into our DRL-based model by incorporating a deep Q-learning
the exploration noise as a policy, our agent learned how to algorithm, along with an off-policy approach and a decay
effectively balance both the exploration and exploitation time, function. This optimization aims to overcome the limitations
thereby allowing it to achieve better performance during each and enhance the overall performance of the model [14].
episode over the course of the training [57].
In testing our model’s robustness, we train and evaluate Summary: In developing our DQN model, the
our DRL algorithm with network traffic collected from a explorative and exploitative concepts within the
SCADA test bed. We inspect the data for typos, and remove DRL methodology present significant challenge
all duplicates and irrelevant content, which streamlines the in our design, particularly in large state and
data preparation. Next, we label the attacks to be iterated, action spaces. To overcome this trade-off and
so that when introduced to the environment, the model can achieve optimal performance, we carefully man-
learn from the patterns [35]. We set the decay parameter, aged and balanced the technique by incorporating
and epsilon values for each step, initialize both our ‘‘loss a ReplayMemory function with a mini-batch
function and the replaymemory’’ with a bach-size of 32, and training; a decay function in conjunction with an
then train the model for 250 episodes Fig. 10. The successful off-policy approach.
results obtained from our experiment shows that DRL has the
potential to improve the detection and prevention of persistent
attacks against SCADA systems.
C. OPTIMIZING DRL FOR SCADA ENVIRONMENT
Summary: DRL has the capability to learn From our investigative analysis of SCADA, particularly from
patterns effectively, using convolutional neuro the perspective of the dynamic configuration (i.e., proprietary
network while leveraging Deep Q − Learning software, hardware, and topological configuration), the
(Fig. 4) to efficiently classify intrusions from information collected and expertise gained from the analysis
SCADA’s network traffic protocols. led to the structural design of our proposed approach.
As outlined in subsection IV-C, para IV, the difference
between exploration & exploitation can be balanced to help
optimize DRL as they merge in unified facets, in which the
model decides ‘‘when’’ to explore versus ‘‘what’’ to exploit
B. APPLYING DRL TO SCADA as a strategy [48]. It is worth noting that several explorative
DRL is a powerful technique, ideal for monitoring and secur- strategies have been developed to address this trade-off,
ing SCADA systems. However, from an algorithm design including ‘‘Thompson sampling, Upper Confidence Bound
perspective, many challenges in method applicability must (UCB), epsilon-greedy, and more. By incorporating a level
be considered when choosing the appropriate framework best of randomness, these strategies help balance this trade-off.
suited for SCADA security; particularly when employing the To overcome all performance challenges associated with
Deep Q-learning algorithm and epsilon-greedy policy [29]. explorative and exploitative states, we designed a DRL
During the software development, training, and testing of our framework that balances explorative strategies IV-B using
DRL algorithm, overcoming the following challenges were epsilon-greedy as presented in (Algorithm 3), which encour-
crucial to the success of our experiment: ages the agent network to explore, trying out new actions,
Exploration vs Exploitation Trade-Off: This framework while actively taking advantage of experienced knowledge
represents the fundamental concept in reinforcement learn- through exploitation. Our design incorporates regularization,
ing, including DRL. It outlines the dilemma faced by an agent in which ‘‘decay weight’’ is adjusted, with an ensemble
when deciding between exploring new actions or exploiting method comprised of an actor-critic (Algorithm 2); the
its current knowledge to maximize long-term rewards. technique helps balance the model’s efficiency and prevents
In the explorative state, training our actor in a complex generalization to new environments.
environment with large state and action spaces is very We adopted this approach because SCADA datasets are
time-consuming as the agent network samples each set of subject to data bias, due to their dynamic and non-stationary
actions to obtain optimal rewards. This risky strategy can properties [21], and also because of privacy and security
lead to negative results and ultimately affects our model’s concerns, as their structural designs differ based on regions,
detection outcome. In contrast, our model exploitation state software, etc.
experienced ‘‘local optima’’ and over-fitting as it exploits To avoid model specialization in retrospect to DRL
prior knowledge through experience in favor of long-term challenges, we create a separate data class with specific
rewards [58]. Though exploiting existing knowledge can lead functions and methods, tailored to accommodate individual
to more efficient and effective decision-making, however, the dataset’s format [60]. This allows us to carefully tune and

VOLUME 12, 2024 63389

F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

adjust our hyperparameters [48], [59], as shown in Table 3, anomalies and classify them as network intrusions based
in order to achieve the best possible performance. on our label parameters [58].
• To improve scalability, we implemented a replay
Summary: To optimize DRL for SCADA net- memory and utilized a mini-batch function that allows
work security, we addressed the challenges caused the agent to sample data from its prior experiences
by model generalization and specialization. They to optimize learning. We employ an off-policy along
are due in part to data formatting; we create a with a decay function to balance the trade-off between
separate data class with functions and methods exploration and exploitation for rewards optimization
tailored to SCADA designs to resolve the prob- while reducing unnecessary exploitation time [65].
lem. • We pre-processed and formatted our data using a multi-
class binary classification, which reduces false positives,
and then trained our model on historical data containing
examples of both normal network behavior and various
D. MONITORING SCADA USING DRL ALGORITHM
attack scenarios.
In our investigative findings, we identified various domains
• We continuously monitor our DQN performance by
in which DRL algorithm implementation has shown remark-
testing it on additional datasets to ensure its efficacy of
able success. For example, in the gaming industry, DRL
real-time threat detection and response, thus simulating
deployment in video game applications developed learning
the automation of regular updates of new attack patterns
strategies in complex and large action spaces for reward
and emerging security risks.
maximization; in assembly line operations, the technique also
demonstrated substantial results in process optimization and
labor reduction. However, in retrospect to these domains, Using this approach, as the model adapts to changing
when considering similar methodology for cybersecurity network traffic, it generates appropriate actions or alerts to
threat detection, particularly in SCADA application, we iden- address security threats. This proactive monitoring technique
tified several functions in DRL structural design that affect allows our model to identify potential attacks of SCADA
the algorithm performance [3], [61], [62]. communication systems in real-time and take preventive
For example, implementing a Greedy Policy Limitation measures to ensure the integrity and reliability of the
function confines the agent to rely solely on current grid [66].
knowledge to exploit actions that maximize immediate
rewards. The Exploration and Exploitation function within Summary: To train and deploy a DRL model to
the algorithm designs would direct the agent to either actively monitor smart infrastructure, we simu-
explore the environment to gain knowledge, discover new lated the following: We selected two well-known
threat patterns about protocols’ vulnerabilities to maximize datasets to simulate the data collection and for-
rewards or exploit the environment by taking uncertain and matting process; we define the security objectives
potentially risky actions that may reduce immediate rewards using binary classification; we developed and
[63]. The cascading effects of these features can make trained a DQN model using a SCADA dataset;
it challenging for a DRL agent to operate effectively in we then reloaded the saved model and tested it
stochastic, unpredictably large, and complex environments on a second set of data, thus simulating ‘‘real-
[53]. time deployment.’’ The result of this experiment
Unlike video game structural environments and assembly shows that our model continuously analyzes the
lines which are strictly designed with constants such as incoming data, leveraging its training to recognize
known rules and constraints, with a deterministic configu- patterns of normal behavior and anomalous activ-
ration that makes it relatively easy for the DRL technique ity.
to optimize rewards, SCADA systems, on the other hand,
often operate in dynamic and unpredictable real-world envi-
ronments, with complex interconnected systems comprising
of water treatment plants, power grids, or manufacturing V. IMPLEMENTATION AND EVALUATION
facilities. The interdependencies of nodes and network For our implementation, we utilize the SCADA use-case pre-
communication settings of the individual system are also sented in Fig. 5. This use-case scenario describes a network
proven to be very challenging for a DRL agent to optimize fully responsible for remotely supervising, monitoring, and
rewards effectively [52], [64]. As follows, we provide the controlling critical infrastructure, such as electrical grids and
explanations of the different method implementations: stations, water treatment facilities, transportation systems,
• To actively monitor SCADA systems, we address or oil refineries. In this complex operational cyber-space,
the challenges stemming from the DRL framework the threat model involves an unauthorized attacker gaining
by developing a DQN model with defined security access and compromising SCADA network traffic protocols,
objectives that allow the algorithm to detect network attempting to disrupt the system’s operations.

63390 VOLUME 12, 2024

F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

FIGURE 5. DRL deployment.

A. SYSTEM THREAT MODEL security of SCADA’s network-attached nodes (Fig. 2). These
1. Initial Compromise: Through vulnerability exploita- interconnected IoT and IIoT devices lack the necessary
tion, either by means of brute force and or social security features, which create a larger attack surface within
engineering, an attacker gains unauthorized access to the the system, thus making it more susceptible to cyber threats.
system’s network and begins to sabotage network traffic However, applying DRL as an intrusion detection ‘‘avant-
protocols and disrupts industrial processes. guard’’ addresses this challenge by using deep q-learning
2. Traffic Protocol Manipulation and Alteration: Upon [58] algorithms to analyze network traffic patterns, identify
establishing administrative command, the attacker can- anomalous behaviors for potential intrusions, while reducing
cels and encrypts control commands, injecting false downtime using the deployment approach in (Fig. 5) and the
sensor data to disrupt communications between IoT and detection strategy in (Algorithm 4).
IIoT devices. Such compromise may lead to catastrophic
loss and physical damage to consumers. B. EXPERIMENTAL ENVIRONMENT
3. Detection Scheme and Response: Deploy a trained All of our experiments were conducted on a computer with
DRL as an intrusion detection response system to the following specifications:
mitigate and assess abnormal network behaviors and to
• Intel(R) CoreTM I7-1700 CPU @ 2.90GHz; 16GB RAM
identify possible threats caused by compromised traffic
• Intel(R) UHD Graphics 630 GPU
protocols.
• And a dedicated AMD Radeon RX 640 GPU
Critical infrastructure Model: Because modern critical • Windows 10 Pro operating system
infrastructure systems are autonomous, they use IoT and • Python 3.9.12, gym 0.21.0, Tensorflow 2.9.1
IIoT devices, and internet protocols (i.e., TCP /IP) to • Keras 2.9.0, SciKit-Learn 1.0.2, matplotlib, seaborn
monitor and control critical functions in water treatment
plants, power grids, transportation networks, etc. Although C. EVALUATION
these devices provide cutting-edge multi-functionality to data In this section, we briefly outline key tenants of our algorithm
collection and automation processes, their designs pose the research objective and provide a brief description of the
greatest security vulnerability to the system. So, deploying datasets used in our proposed approach.
a trained DRL as an NIDS [35], [67], [68] can authenticate Our experimental objective aimed to develop and evaluate
communication and filter TCP/IP traffic of attached nodes in a DRL framework to assess its robustness and effectiveness
a synchronous and asynchronous [66] manner to help detect and to determine whether the algorithm can be successfully
lurking threats. leveraged to detect cybersecurity attacks in large and complex
Applying DRL to Critical Infrastructure: In this section, environments [29]; specifically, to investigate the impact of
we describe our Deep QNetwork approach to ensure the using DRL’s technique on new and unseen data.

VOLUME 12, 2024 63391

F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

Algorithm 4 Deploying DRL as an IDS Algorithm 5 DQN cross-validation

_Use-case: we consider a SCADA (Fig. 2) network Input: Dataset D
Discount factor γ
fully responsible for remotely supervising, Exploration rate ϵ
monitoring and controlling critical infrastructure, Minimum exploration rate ϵmin
such as electrical grids and stations, water treatment Exploration rate decay factor ϵdecay
Replay memory size N
facilities, transportation system, or oil refineries; Batch size B
_Threat: unauthorized access and compromised Number of episodes E
network traffic protocols; Target network update frequency C
Learning rate α
1. Initial Compromise: Hidden layer size H
Attacker exploits vulnerabilities through brute force Number of hidden layers L
or social engineering; Output:
Trained DQN model
Gain unauthorized access to the network; Procedure:
Sabotage network traffic protocols and disrupt Initialize main Q-network Q with random weights
industrial processes; Initialize target Q-network Q′ with same weights as Q
Initialize the replay memory R with size N
2. Network Protocol Manipulation and Alteration: for episode in range(E) do
The attacker establishes administrative command; Initialize state s
Cancel and encrypt control commands; Initialize done to False while not done do
if With probability ϵ, select a random action a then
Inject false sensor data;
end
Disrupt communication between IoT and IIoT else
devices; Select a = arg maxa Q(s, a)
Potential catastrophic loss and physical damages; end
Execute action a and observe reward r and next state s′
3. Detection Scheme and Response: Add transition (s, a, r, s′ , done) to replay memory R
Deploy a trained DRL as an intrusion detection Sample random minibatch of B transitions
response system; (si , ai , ri , s′i , donei ) from R
foreach transition in the minibatch do
while not done do if donei , set target = ri then
Observe the network state; end
Select an action using an epsilon-greedy policy; else
Perform the selected action in the network; target = ri + γ maxa′ Q′ (s′i , a′ )
end
Observe the next network state, reward, and done Calculate loss L = (Q(si , ai ) − target)2
flag; Update weights of Q using gradient descent with
Store the experience in the replay memory; learning rate α to minimize L
end
Sample a minibatch of experiences from the for Every C steps do
replay memory; Copy weights of Q to Q′
Update the Q-network using the actor-critic end
ϵ-greedy exploration rate is decayed linearly from ϵ to
algorithm; ϵmin over time.
Update the actor-network using TD error; end
Update the critic network using the TD target; end
Return trained DQN model
Periodically update the target networks;
Check if the episode is done or the maximum
number of steps is reached;
if yes then bach from replay-memory containing past experiences, while
Go to the next episode; reducing exploration rate and improving exploitation of prior
end knowledge [67].
Evaluate the trained DRL agent’s performance; We test the model on two publically available IIoT SCADA
Publish attacks: Types, Names and graph results datasets, wustl-iiot-20186 and the wustl-iiot-20217 dataset
end to gain insights into the technique’s performance in these
aspects.

1) DATASETS
To validate these objectives, we developed a DNQ model
The WUSTL-IIoT-2018 dataset is a dataset focused on
consisting of two hidden layers with an actor-critic algorithm
Industrial Internet of Things (IIoT) systems, capturing
and a replay-memory function.
features related to network traffic, device interactions,
As illustrated in algorithm 5, to balance our model
and protocols used. It contains a significant number of
explorative and exploitative behavior in large, complex state
and action spaces, we implemented an off-policy along with 6 wustl-iiot-2018 https://www.cse.wustl.edu/ jain/iiot/index.html
a decay function that allows the agent to sample mini- 7 wustl-iiot-2021 https://www.cse.wustl.edu/ jain/iiot2/index.html

63392 VOLUME 12, 2024

F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

attacks, including DDoS, injection, and command execution datasets, and extracted features of importance for detecting
attacks [25]. cyber anomalies. We then label each selected features and
The WUSTL-IIoT-2021 dataset is a recent dataset focusing store them in a dictionary to fit our model’s input as shown
on IIoT devices and systems. It includes updated informa- in (Algorithm 6). Upon initialization, the actor-network loops
tion on IIoT behavior, network interactions, and potential through the dictionary and matches the corresponding labels
cybersecurity threats. The dataset contains a variety of with each attack category as presented in (Algorithm 7).
features specific to IIoT, with a substantial number of attacks, Next, we fine-tuned the algorithm using the values
including unauthorized access, data exfiltration, and device in (Table 3). These methods allow our actor to recog-
manipulation [69]. nize patterns from normal and anomalous network traffic
Datasets Characteristics: During our preliminary analysis behavior.
and pre-processing of the datasets (Fig. 6), we identified
four types of attacks commonly simulated to test SCADA’s 2) PERFORMANCE METRICS
network security defenses: Port Scanner; Address Scan For validation, we used the standard measurement metrics
Attack; Device Identification Attack and Exploit [16], [70], to assess the model’s predicting ‘‘Accuracy, F-1 score, False
[71], [72]. These ‘‘scans’’ represent a precursor or the Positive, and False Negative rates.’’ We also use the formulas
initial stage for more severe attacks. These frontal assaults TP, TN, FP, and FN to calculate our model’s performance.
help collect sensitive data and subsequently expose an Accuracy: It measures the ratio of correctly predicted labels
organization’s vulnerability [14]. Their success can lead to using this equation:
data breaches and possibly affect control decisions, causing TP + TN
equipment damage or triggering unsafe use of infrastructure Accuracy = (1)
TP + TN + FP + FN
assets. Based on those threat patterns, we mapped and labeled
those attacks as our target detections (Algorithm 6). Precision: Measures the ratio of the accuracy of positive
predictions:
Algorithm 6 Label mapping TP
Precision = (2)
Input: Initialize dataset TP + FP
Process: Initialize an empty dictionary to store labels
Output: DICTIONARY = Attack_labels & Attack_map Recall: Quantifies the number of correct positive predic-
for each label in in mapping dict do
tions made out of all positive predictions that could have been
if the label is ‘‘normal’’ then made by the classifier:
add to dictionary with a value of 0
if the label is ‘‘Probe’’ then TP
add to dictionary with a value of 1
Recall = (3)
TP + FN
if the label is ‘‘R2L’’ then
add to dictionary with a value of 2 F1-score: This metric measures the harmonic mean of
if the label is ‘‘DoS’’ then precision and recall using this formula:
add to dictionary with a value of 3
end 2·P·R
end f 1 − score = (4)
end P+R
end
Lastly, we trained and evaluated our model using the
labels selected datasets: WUSTL-IIoT-2018 [25] and WUSTL-IIoT-
end
2021 [26], comprised of network traffic protocols. After
successfully training the model, our DRL learns to classify
threats in real-time and provides detection and response [27].
Algorithm 7 Reading Labels In retrospect to validation V-C2, we loaded the saved model
Input: Initialize dictionary with our test data, comprised of multiple binary classification
Output: Attack_labels inputs. Using both methods from algorithms 6 and 7, the
forall correct label do agent-network read the selected labeled attacks and predicts
labels ← Agent (by) our targeted attack labels.
for each iteration do
st , at = []
for attack label in attack_map do VI. RESULTS
n=length(attack_map)
end The results presented from our evaluation is an illustration
return labels of the training characteristic of our DRL framework. Table 4
labels shows the evaluation results achieved on the two datasets
end considered in this work: WUSTL-IIoT-2018 and WUSTL-
end
IIoT-2021. Likewise, in Table 5, we present the performance
metrics of our model, which were compared with two other
In meeting our set goals, we designed a double-layer Deep research works that also utilized a DRL technique. (i.e.,
QNetwork algorithm [58], we pre-processed two unique [15] and [44] on the same dataset. We considered only the

VOLUME 12, 2024 63393

F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

FIGURE 6. Data processing.

TABLE 2. Common attacks found on each dataset.

TABLE 3. Environment hyper-parameter. Our DQN is trained with optimized parameters as recorded
in table 3 and algorithms 6 and 7 to accurately classify
both normal and anomalous behavior from the input data.
In this synchronous process, we set our critic parameter to
equal those of our actor-network (i.e., identically mirroring
γ , ϵ, ϵmin , and rd ) as target value.
We adjust our DQN minibatch_size after each training
phase to minimize the loss function, aiming to improve
the alignment between the predicted output of the model
and the desired output, as labeled in algorithm 6. Overall
research works by Tharewal et al. [15] and Wang et al. [44], the training results demonstrate that our model successfully
because they were the only two presenting a complete captures and accurately interprets the assigned attack labels,
evaluation of their solution with the appropriate evaluation achieving convergence with an improved loss of less than
metrics. In this experiment, we trained our model using an 0.4 over 250 training episodes (Fig. 10), while accumulating
actor-critic algorithm that learns to make decisions. From the highest number of rewards (Fig. 7).
the conceptual environment as demonstrated in Figure 10, During each learning step, our model updates both
the critic evaluates the actor’s performance by providing the Actor-Critic’s parameter using ‘‘policy gradients’’ and
feedback; the actor then updates the policy distribution based ‘‘advantage value’’ by minimizing the mean squared error
on the information received, thus helping to improve the using the Bellman’s equation as detailed and implemented
model’s learning capabilities. in IV-A1.

63394 VOLUME 12, 2024

F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

functionality and security perspective, the diverse structural

designs of these systems, change the dynamics of data
collection, in terms of hardware, network protocols, and data
uniformity.

TABLE 4. Model performance results.

FIGURE 7. DQN agent cumulative rewards.

After convergence, the model total loss per episode drops TABLE 5. Comparison table WUSTL-IIoT-2018.
from 0.5 and remains stable at under 0.3 for the duration of
the training as presented in Figure 8.

A. SAMPLING BIAS AND CULTURAL BIAS

a.) Our selected WUSTL-IIOT datasets may not be rep-
resentative of all categories of SCADA communication
infrastructure; this is due in part to structural designs
(e.g., topology, hardware, and software) which may produce
significant variance in data samples
FIGURE 8. Converging loss.

B. NON-STATIONARY DATA AND LIMITED

Figure 9 illustrates our DQN successful training results,
GENERALIZABILITY
depicting the accurate labeling and classification of network
attacks. b.) SCADA communications network could vary by states,
regions, and country-to-country, such that it changes over
time. In that, when considering the WUSTL-IIOT dataset,
there is a likelihood that these changes may not be reflected
The presence of these variances could significantly affect
the performance of the model on unseen data; meaning,
If the sample is not representative of the target population,
the generalizability of the results may be limited, and it
may be inappropriate to extrapolate the findings to broader
populations or contexts

C. MITIGATION STRATEGIES:
To alleviate these threats, we employed rigorous methodolog-
ical procedures, including:
FIGURE 9. Training results. • randomized control trial selection selection
• we carefully conducted sample and robust statistical
analyses
VII. THREATS TO THE VALIDITY Nevertheless, it is important to acknowledge that some degree
In this section, we explore the conceivable threats to of uncertainty may still exist. Additionally, we have discussed
the validity of our study and discuss criteria that could and addressed several of these concerns in the limitations’
potentially impact the integrity and applicability of our section.
findings. We acknowledge and address the following
threats: D. LIMITATIONS
SCADA technology plays a significant role in the Despite our proposed DRL promising outcomes and effective
autonomous operations of smart infrastructure. From a results in identifying patterns and detecting cyber-attacks in

VOLUME 12, 2024 63395

F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

FIGURE 10. Cumulative training results.

SCADA infrastructure, we have identified three limitations 3. Scalability: the proposed framework utilizes a fully
that should be acknowledged. We detailed these techniques connected neural network architecture with two hidden
and outlined the measures taken to mitigate these constraints: layers consisting of 64 and 32 fully connected neurons,
1. Limited Dataset: one of the limitations of our study is respectively. While this architecture has demonstrated
the availability of a limited dataset specific to SCADA satisfactory performance in our experiments, scala-
infrastructure cyber-attacks. To mitigate this limitation, bility may become a concern when dealing with
we utilized the WUSTL-IIoT-2018 and WUSTL-IIoT- larger SCADA systems or more complex environments.
2021 datasets, which include 25 networking features To address this, future research should focus on inves-
representing both benign and attack traffic. While tigating advanced network architectures and exploring
these datasets provide valuable insights, they may not techniques such as convolutional or recurrent neural
cover the entire spectrum of potential cyber-attacks. networks to enhance the scalability of the model without
In particular, the selected datasets WUSTL-IIOT-2018 compromising its performance.
and the WUSTL-IIOT-2021 may not represent the
broader population of SCADA infrastructure, which E. FUTURE RESEARCH DIRECTIONS
could lead to sampling bias. Such inconsistency may Subsequently, we present four future research directions that
affect a model’s accuracy and cause poor performance arise from our study:
on unseen data [73]. To address this, we ensured rigorous 1. Focus on Adversarial Attacks: as the sophistication
preprocessing to maximize the utility of the available of cyber-attacks continues to evolve, future research
data. should explore the vulnerability of the proposed DRL
2. Generalization: our proposed model may face chal- framework to adversarial attacks. Investigating methods
lenges in detecting novel or previously unseen cyber- to enhance the model’s robustness against adversar-
attack types that are not present in the training dataset. ial manipulations and exploring adversarial training
Although the Deep Q-learning algorithm and the Q- techniques could significantly improve its real-world
network as function approximators are designed to applicability and effectiveness.
capture complex patterns, the model’s performance may 2. Use Real-time Data: real-time detection and response
be affected when encountering unknown attack patterns. are crucial in protecting SCADA infrastructure. Future
In other words, our DRL model trained on two specific research should focus on reducing the inference time of
datasets, as explained in ‘‘non-stationary data’’ may not the proposed model to ensure timely identification and
generalize well to other datasets or when applied to prevention of cyber-attacks. Techniques such as model
different scenarios. This could limit the applicability of compression, quantization, and hardware acceleration
the model and may require substantial retraining when can be explored to achieve low-latency and efficient
attempting to use the saved model in a new context. deployment of the framework in real-world scenarios.
To address this, we emphasize the need for continuous 3. Apply Transfer Learning and Data Augmentation:
monitoring and updating of the model with new data to address the limited dataset issue, future research can
to adapt to emerging threats and to ensure ongoing explore transfer learning techniques to leverage pre-
effectiveness in real-world scenarios. trained models on larger and more diverse datasets in

63396 VOLUME 12, 2024

F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

related domains. Additionally, data augmentation tech- attacks in SCADA infrastructure is a promising approach
niques can be employed to generate synthetic samples with the potential to significantly improve the security of
that represent a wider range of attack scenarios, further these systems.
enhancing the model’s generalization capabilities.
4. Improve Explainability and Interpretability: DL REFERENCES
models, including the proposed DRL framework, often [1] D. P. Joseph and J. Norman, ‘‘An analysis of digital forensics in cyber
lack interpretability, making it challenging to understand security,’’ in Proc. 1st Int. Conf. Artif. Intell. Cogn. Comput. (AICC).
Singapore: Springer, 2018, pp. 701–708.
the reasoning behind their decisions. Because DRL
[2] R. Diao, Z. Wang, D. Shi, Q. Chang, J. Duan, and X. Zhang, ‘‘Autonomous
models are difficult to interpret, it may be challenging voltage control for grid operation using deep reinforcement learning,’’
to fix errors, diagnose, or even improve a model’s in Proc. IEEE Power Energy Soc. Gen. Meeting (PESGM), Aug. 2019,
pp. 1–5.
overall performance, all due in part to structural
[3] X. Liu, W. Yu, F. Liang, D. Griffith, and N. Golmie, ‘‘On deep
complexity. However, through proper design choices reinforcement learning security for industrial Internet of Things,’’ Comput.
and documentation, including the model’s assumptions Commun., vol. 168, pp. 20–32, Feb. 2021.
and hyperparameters implementation, researchers can [4] J. Duan, D. Shi, R. Diao, H. Li, Z. Wang, B. Zhang, D. Bian, and
Z. Yi, ‘‘Deep-reinforcement-learning-based autonomous voltage control
have a clear understanding of a model’s behavior and for power grid operations,’’ IEEE Trans. Power Syst., vol. 35, no. 1,
even a broader insight into its decision-making process. pp. 814–817, Jan. 2020.
Future research should focus on developing techniques [5] B. B. Zad, J.-F. Toubeau, O. Acclassato, O. Durieux, and F. Vallée, ‘‘An
innovative centralized voltage control method for MV distribution systems
to explain and interpret the model’s behavior, providing based on deep reinforcement learning: Application on a real test case in
insights into the features and patterns that contribute to Benin,’’ in Proc. CIRED 26th Int. Conf. Exhib. Electr. Distrib. Edison, NJ,
cyber-attack detection. Doing so would not only help USA: IET, 2021, pp. 1577–1581.
[6] Y. Liu, H. Xu, D. Liu, and L. Wang, ‘‘A digital twin-based sim-to-real
enhance the model’s trustworthiness but also enable
transfer for deep reinforcement learning-enabled industrial robot grasp-
cybersecurity analysts to gain valuable insights for ing,’’ Robot. Comput.-Integr. Manuf., vol. 78, Aug. 2022, Art. no. 102365.
further investigations. [7] J. Kober, J. A. Bagnell, and J. Peters, ‘‘Reinforcement learning in robotics:
By addressing these limitations and exploring future A survey,’’ Int. J. Robot. Res., vol. 32, no. 11, pp. 1238–1274, Sep. 2013.
[8] T. T. Nguyen and V. J. Reddi, ‘‘Deep reinforcement learning for cyber
research directions, we can continue to advance the capa- security,’’ IEEE Trans. Neural Netw. Learn. Syst., pp. 1–17, 2021.
bilities of DRL in detecting and preventing cyber-attacks in [9] OpenAI. (2020). Robogym. [Online]. Available: https://github.com/
SCADA infrastructure, ultimately enhancing critical infras- openai/robogym
tructure security and protecting against emerging threats. [10] T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey, ‘‘Meta-learning
in neural networks: A survey,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 44, no. 9, pp. 5149–5169, Sep. 2022.
VIII. CONCLUSION [11] S. Thrun and L. Pratt, Learning to Learn: Introduction and Overview.
In this work, we aimed to address the challenges of the Boston, MA, USA: Springer, 1998, pp. 3–17, doi: 10.1007/978-1-4615-
5529-2_1.
DRL framework and its applicability in cybersecurity to [12] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, ‘‘The arcade
advance the development of effective solutions in this learning environment: An evaluation platform for general agents,’’ J. Artif.
field. To achieve this, we design a double-layered DQN Intell. Res., vol. 47, pp. 253–279, Jun. 2013.
[13] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A.
model that utilizes a Deep Q-Learning algorithm that Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui,
enhances learning abilities in complex environments while L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, ‘‘Mastering
adapting to new and unseen data. To balance our model’s the game of go without human knowledge,’’ Nature, vol. 550, no. 7676,
pp. 354–359, Oct. 2017.
explorative and exploitative behavior in complex state and [14] B. Ning and L. Xiao, ‘‘Defense against advanced persistent threats in smart
action spaces, we implemented an off-policy with a decay grids: A reinforcement learning approach,’’ in Proc. 40th Chin. Control
function, allowing the agent to sample a mini-batch from Conf. (CCC), Jul. 2021, pp. 8598–8603.
replay memory containing past experiences. To evaluate [15] S. Tharewal, M. W. Ashfaque, S. S. Banu, P. Uma, S. M. Hassen, and M.
Shabaz, ‘‘Intrusion detection system for industrial Internet of Things based
the model’s efficacy, we explored the security challenge on deep reinforcement learning,’’ Wireless Commun. Mobile Comput.,
of critical infrastructure against continuous cyber-security vol. 2022, pp. 1–8, Mar. 2022.
attacks. We conducted an investigative threat analysis of [16] O. Yousuf and R. N. Mir, ‘‘DDoS attack detection in Internet
of Things using recurrent neural network,’’ Comput. Electr. Eng.,
SCADA systems to assess their network dependency and vol. 101, Jul. 2022, Art. no. 108034. [Online]. Available: https://www.
potential vulnerabilities. Based on our findings, we selected sciencedirect.com/science/article/pii/S004579062200297X
the SCADA domain as our target objective and evaluated our [17] M. Landen, K. Chung, M. Ike, S. Mackay, J.-P. Watson, and W. Lee,
‘‘DRAGON: Deep reinforcement learning for autonomous grid operation
DQN using two publicly available datasets from the SCADA and attack detection,’’ in Proc. 38th Annu. Comput. Secur. Appl. Conf.,
testbed: WUSTL-IIoT-2018 and WUSTL-IIoT-2021. The 2022, pp. 13–27.
results from our trained model show that our DQN can learn [18] F. Wei, Z. Wan, and H. He, ‘‘Cyber-attack recovery strategy for smart grid
based on deep reinforcement learning,’’ IEEE Trans. Smart Grid, vol. 11,
to classify threats at 99% accuracy and provide detection and no. 3, pp. 2476–2486, May 2020.
response in real-time. In retrospect, it is important to note [19] D. Silver, ‘‘Mastering the game of Go with deep neural networks and tree
that detecting certain types of attacks, such as spoofed traffic, search,’’ Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.
destination packets, and total byte attacks, can be challenging [20] M. Alauthman, N. Aslam, M. Al-kasassbeh, S. Khan, A. Al-Qerem,
and K.-K. Raymond Choo, ‘‘An efficient reinforcement learning-based
and may require collaborating with other techniques and botnet detection approach,’’ J. Netw. Comput. Appl., vol. 150, Jan. 2020,
tools. Overall, the use of DRL for detecting cybersecurity Art. no. 102479.

VOLUME 12, 2024 63397

F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

[21] Z. Wang, H. He, Z. Wan, and Y. Sun, ‘‘Coordinated topology attacks in [44] W. Wang, J. Guo, Z. Wang, H. Wang, J. Cheng, C. Wang, M. Yuan,
smart grid using deep reinforcement learning,’’ IEEE Trans. Ind. Informat., J. Kurths, X. Luo, and Y. Gao, ‘‘Abnormal flow detection in industrial
vol. 17, no. 2, pp. 1407–1415, Feb. 2021. control network based on deep reinforcement learning,’’ Appl. Math.
[22] M. H. Ling, K.-L.-A. Yau, J. Qadir, G. S. Poh, and Q. Ni, ‘‘Application Comput., vol. 409, Nov. 2021, Art. no. 126379.
of reinforcement learning for security enhancement in cognitive radio [45] P. ÄOisar and S. M. ÄOisar, ‘‘General vulnerability aspects of Internet of
networks,’’ Appl. Soft Comput., vol. 37, pp. 809–829, Dec. 2015. Things,’’ in Proc. 16th IEEE Int. Symp. Comput. Intell. Informat. (CINTI),
[23] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, ‘‘Soft actor-critic: Off- Nov. 2015, pp. 117–121.
policy maximum entropy deep reinforcement learning with a stochastic [46] D. Torre, F. Mesadieu, and A. Chennamaneni, ‘‘Deep learning techniques
actor,’’ in Proc. Int. Conf. Mach. Learn., 2018, pp. 1861–1870. to detect cybersecurity attacks: A systematic mapping study,’’ Empirical
[24] A. Kathirgamanathan, E. Mangina, and D. P. Finn, ‘‘Development Softw. Eng., vol. 28, no. 3, p. 44, May 2023.
of a soft actor critic deep reinforcement learning approach for har- [47] R. Antrobus, B. Green, S. Frey, and A. Rashid, ‘‘The forgotten I in IIoT:
nessing energy flexibility in a large office building,’’ Energy AI, A vulnerability scanner for industrial Internet of Things,’’ in Proc. Living
vol. 5, Sep. 2021, Art. no. 100101. [Online]. Available: https://www. Internet Things (IoT), 2019, pp. 1–8.
sciencedirect.com/science/article/pii/S2666546821000537 [48] B. Jang, M. Kim, G. Harerimana, and J. W. Kim, ‘‘Q-learning algorithms:
[25] M. Teixeira, T. Salman, M. Zolanvari, R. Jain, N. Meskin, and M. Samaka, A comprehensive classification and applications,’’ IEEE Access, vol. 7,
‘‘SCADA system testbed for cybersecurity research using machine pp. 133653–133667, 2019.
learning approach,’’ Future Internet, vol. 10, no. 8, p. 76, Aug. 2018. [49] J. Suh and T. Tanaka, ‘‘Sarsa(0) reinforcement learning over fully
[26] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, ‘‘A detailed analysis homomorphic encryption,’’ in Proc. SICE Int. Symp. Control Syst. (SICE
of the KDD CUP 99 data set,’’ in Proc. IEEE Symp. Comput. Intell. Secur. ISCS), Mar. 2021, pp. 1–7.
Defense Appl., Jul. 2009, pp. 1–6. [50] I. C. Dolcetta and H. Ishii, ‘‘Approximate solutions of the Bellman equation
[27] J. Li, T. Yu, H. Zhu, F. Li, D. Lin, and Z. Li, ‘‘Multi-agent deep of deterministic control theory,’’ Appl. Math. Optim., vol. 11, no. 1,
reinforcement learning for sectional AGC dispatch,’’ IEEE Access, vol. 8, pp. 161–181, Feb. 1984.
pp. 158067–158081, 2020. [51] R. W. Beard, G. N. Saridis, and J. T. Wen, ‘‘Galerkin approxima-
[28] M. Botvinick, J. X. Wang, W. Dabney, K. J. Miller, and Z. Kurth-Nelson, tions of the generalized Hamilton-Jacobi-Bellman equation,’’ Automat-
‘‘Deep reinforcement learning and its neuroscientific implications,’’ ica, vol. 33, no. 12, pp. 2159–2177, Dec. 1997. [Online]. Available:
Neuron, vol. 107, no. 4, pp. 603–616, Aug. 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0005109897001283
https://www.sciencedirect.com/science/article/pii/S0896627320304682 [52] J. Peters and S. Schaal, ‘‘Reinforcement learning of motor skills
[29] M. Moradi, Y. Weng, and Y.-C. Lai, ‘‘Defending smart electrical power with policy gradients,’’ Neural Netw., vol. 21, no. 4, pp. 682–697,
grids against cyberattacks with deep q-learning,’’ PRX Energy, vol. 1, no. 3, May 2008. [Online]. Available: https://www.sciencedirect.com/
Nov. 2022, Art. no. 033005. science/article/pii/S0893608008000701
[30] K. Arulkumaran, M. Peter Deisenroth, M. Brundage, and A. A. Bharath, [53] A. Akagic and I. Džafic, ‘‘Deep reinforcement learning in smart grid:
‘‘A brief survey of deep reinforcement learning,’’ 2017, arXiv:1708.05866. Progress and prospects,’’ in Proc. XXVIII Int. Conf. Inf., Commun. Autom.
[31] D.-J. Shin and J.-J. Kim, ‘‘Deep reinforcement learning-based network Technol. (ICAT), Jun. 2022, pp. 1–6.
routing technology for data recovery in exa-scale cloud distributed [54] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
clustering systems,’’ Appl. Sci., vol. 11, no. 18, p. 8727, Sep. 2021. Cambridge, MA, USA: MIT Press, 2018.
[Online]. Available: https://www.mdpi.com/2076-3417/11/18/8727 [55] P. E. Protter, Stochastic Differential Equations. Berlin, Germany: Springer,
[32] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic 2005.
Programming. Hoboken, NJ, USA: Wiley, 2014. [56] S. Mohamed, J. Dong, A. R. Junejo, and D. C. Zuo, ‘‘Model-based:
[33] E. Gibney, ‘‘Deepmind algorithm beats people at classic video games,’’ End-to-end molecular communication system through deep reinforcement
Nature, vol. 518, no. 7540, pp. 465–466, 2015. learning auto encoder,’’ IEEE Access, vol. 7, pp. 70279–70286, 2019.
[34] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, ‘‘Deep reinforce- [57] B. Øksendal, Stochastic Differential Equations. Berlin, Germany:
ment learning for dynamic multichannel access in wireless networks,’’ Springer, 2003, pp. 65–84, doi: 10.1007/978-3-642-14394-6_5.
IEEE Trans. Cognit. Commun. Netw., vol. 4, no. 2, pp. 257–265, Jun. 2018. [58] G. Fragkos, J. Johnson, and E. E. Tsiropoulou, ‘‘Dynamic role-based
[35] Z. Gao, Y. Gao, Y. Hu, Z. Jiang, and J. Su, ‘‘Application of deep Q-network access control policy for smart grid applications: An offline deep
in portfolio management,’’ in Proc. 5th IEEE Int. Conf. Big Data Analytics reinforcement learning approach,’’ IEEE Trans. Hum.-Mach. Syst., vol. 52,
(ICBDA), May 2020, pp. 268–275. no. 4, pp. 761–773, Aug. 2022.
[36] B. Berry, ‘‘Do you know these key SCADA concepts SCADA tutorial: A [59] J. Khoury and M. Nassar, ‘‘A hybrid game theory and reinforcement learn-
quick, easy, comprehensive guide (white paper),’’ DPS Telecom, Fresno, ing approach for cyber-physical systems security,’’ in Proc. IEEE/IFIP
CA, USA, Tech. Rep., 2011. Netw. Oper. Manage. Symp. (NOMS), Apr. 2020, pp. 1–9.
[37] J. Clifton and E. Laber, ‘‘Q-learning: Theory and applications,’’ Annu. Rev. [60] Z. Wang and X. Chu, ‘‘Operating condition identification of complete wind
Statist. Appl., vol. 7, pp. 279–301, Mar. 2020. turbine using DBN and improved DDPG-SOM,’’ in Proc. IEEE 11th Data
[38] S. R. Chhetri, S. Faezi, N. Rashid, and M. A. Al Faruque, ‘‘Manufacturing Driven Control Learn. Syst. Conf. (DDCLS), Aug. 2022, pp. 94–101.
supply chain and product lifecycle security in the era of industry 4.0,’’ J. [61] D. Zhao, D. Liu, F. L. Lewis, J. C. Principe, and S. Squartini, ‘‘Special issue
Hardw. Syst. Secur., vol. 2, no. 1, pp. 51–68, Mar. 2018. on deep reinforcement learning and adaptive dynamic programming,’’
[39] R. Antrobus, B. Green, S. Frey, and A. Rashid, ‘‘The forgotten I in IIoT: IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, pp. 2038–2041,
A vulnerability scanner for industrial Internet of Things,’’ Living Internet Jun. 2018.
Things (IoT), London, U.K., Tech. Rep., May 2019. [62] J. Xu, H. Wang, J. Rao, and J. Wang, ‘‘Zone scheduling optimization
[40] X. Liu, C. Qian, W. G. Hatcher, H. Xu, W. Liao, and W. Yu, ‘‘Secure of pumps in water distribution networks with deep reinforcement
Internet of Things (IoT)-based smart-world critical infrastructures: learning and knowledge-assisted learning,’’ Soft Comput., vol. 25, no. 23,
Survey, case study and research opportunities,’’ IEEE Access, vol. 7, pp. 14757–14767, Dec. 2021.
pp. 79523–79544, 2019. [63] J. Stranahan, T. Soni, and V. Heydari, ‘‘Supervisory control and data
[41] G. Stoneburner, A. Goguen, and A. Feringa, ‘‘Risk management guide acquisition testbed vulnerabilities and attacks,’’ in Proc. SoutheastCon,
for information technology systems,’’ Nist Special Publication, vol. 800, Apr. 2019, pp. 1–5.
no. 30, pp. 30–800, 2002. [64] D. Hamouda, M. A. Ferrag, N. Benhamida, and H. Seridi, ‘‘Intrusion
[42] R. Huang, Y. Li, and X. Wang, ‘‘Attention-aware deep reinforcement detection systems for industrial Internet of Things: A survey,’’ in Proc.
learning for detecting false data injection attacks in smart grids,’’ Int. Int. Conf. Theor. Applicative Aspects Comput. Sci. (ICTAACS), Dec. 2021,
J. Electr. Power Energy Syst., vol. 147, May 2023, Art. no. 108815. pp. 1–8.
[Online]. Available: https://www.sciencedirect.com/science/article/ [65] S. Wang, R. Diao, C. Xu, D. Shi, and Z. Wang, ‘‘On multi-event co-
pii/S0142061522008110 calibration of dynamic model parameters using soft actor-critic,’’ IEEE
[43] D. Wu, A. Ren, W. Zhang, F. Fan, P. Liu, X. Fu, and J. Terpenny, ‘‘Cyber- Trans. Power Syst., vol. 36, no. 1, pp. 521–524, Jan. 2021.
security for digital manufacturing,’’ J. Manuf. Syst., vol. 48, pp. 3–12, [66] X. Zhao, S. Ding, Y. An, and W. Jia, ‘‘Applications of asynchronous deep
Jul. 2018. [Online]. Available: https://www.sciencedirect.com/science/ reinforcement learning based on dynamic updating weights,’’ Appl. Intell.,
article/pii/S0278612518300396 vol. 49, pp. 581–591, 2019.

63398 VOLUME 12, 2024

F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure

[67] S. Baek, J. Kim, H. Yu, G. Yang, I. Sohn, Y. Cho, and C. Park, DAMIANO TORRE (Member, IEEE) received the
‘‘Intelligent feature selection for ECG-based personal authentication using B.Sc. degree from the University of Bari, Italy,
deep reinforcement learning,’’ Sensors, vol. 23, no. 3, p. 1230, Jan. 2023. in 2009, the M.Sc. degree from the University
[Online]. Available: https://www.mdpi.com/1424-8220/23/3/1230 of Castilla-La Mancha, Spain, in 2011, and the
[68] M. Zheng, I. Zada, S. Shahzad, J. Iqbal, M. Shafiq, M. Zeeshan, and A. Ph.D. degree from Carleton University, Canada,
Ali, ‘‘Key performance indicators for the integration of the service-oriented in 2018.
architecture and scrum process model for IoT,’’ Scientific Program., From 2020 to 2023, he was an Associate
vol. 2021, pp. 1–11, Feb. 2021.
Research Scientist with the Centre for Cyberse-
[69] M. Zolanvari, M. A. Teixeira, L. Gupta, K. M. Khan, and R. Jain, ‘‘Machine
curity Innovation, Texas A&M University Central
learning-based network vulnerability analysis of industrial Internet of
Texas, USA. Prior to coming to the USA, he was a
Things,’’ IEEE Internet Things J., vol. 6, no. 4, pp. 6822–6834, Aug. 2019.
[70] S. Cheung, B. Dutertre, M. Fong, U. Lindqvist, K. Skinner, and A. Valdes, Research Associate with the University of Luxembourg, from 2018 to 2021.
‘‘Using model-based intrusion detection for SCADA networks,’’ in Proc. His research interests include computer science, and more specifically
SCADA Security Sci. Symp., vol. 46, 2007, pp. 1–12. on software engineering, cybersecurity, artificial intelligence, model-driven
[71] W. Alsabbagh, S. Amogbonjaye, D. Urrego, and P. Langendörfer, ‘‘A engineering, and empirical software engineering. He regularly serves on the
stealthy false command injection attack on modbus based SCADA organizing/program committees of ISSRE and QRS; and satellite events of
systems,’’ in Proc. IEEE 20th Consum. Commun. Netw. Conf. (CCNC), ESEM, ICSE, and ASE.
Jan. 2023, pp. 1–9.
[72] I. Ortega-Fernandez and F. Liberati, ‘‘A review of denial of service
attack and mitigation in the smart grid using reinforcement learn-
ing,’’ Energies, vol. 16, no. 2, p. 635, Jan. 2023. [Online]. Available:
https://www.mdpi.com/1996-1073/16/2/635
[73] G. C. Cawley and N. L. Talbot, ‘‘On over-fitting in model selection and
subsequent selection bias in performance evaluation,’’ J. Mach. Learn.
Res., vol. 11, pp. 2079–2107, Jul. 2010.

FRANTZY MESADIEU received the Associate

of Art degree from Central Texas College, USA, ANITHA CHENNAMANENI (Member, IEEE)
in 2018, and the B.S. degree in cybersecurity and received the Ph.D. degree in business administra-
the M.S. degree in computer information system tion (with a major in information systems and a
from Texas A&M University Central Texas, USA, minor in computer science) from The University
in 2020 and 2023, respectively. of Texas at Arlington. She is currently the Chair
From 2021 to 2022, he was a Graduate Research and a Professor with the Subhani Department
Assistant with the Centre for Cybersecurity Inno- of Computer Information Systems, Texas A&M
vation, Texas A&M University Central Texas. University Central Texas, and the Director of the
He is currently a Research Associate and an Centre for Cybersecurity Innovation. Her work has
Adjunct Faculty Member with the Subhani Department of Computer been published in many peer-reviewed journals
Information Systems, Texas A&M University Central Texas. His research and conferences. Her research interests include cybersecurity, artificial
interests include cybersecurity, more specifically in artificial intelligence, intelligence, deep learning, IS security and privacy, the Internet of Things,
experimenting with reinforcement learning, and deep reinforcement learning digital forensics, and knowledge management.
framework.