Leveraging Deep Reinforcement Learning
Leveraging Deep Reinforcement Learning
ABSTRACT The prevalence of cyber-attacks perpetrated over the last two decades, including coordinated
attempts to breach targeted organizations, has drastically and systematically exposed some of the more
critical vulnerabilities existing in our cyber ecosystem. Particularly in Supervisory Control and Data
Acquisition (SCADA) systems with targeted attacks aiming to bypass signature-based protocols, attempting
to gain control over operational processes. In the past, researchers utilized deep learning and reinforcement
learning algorithms to mitigate threats against industrial control systems (ICS). However, as technology
evolves, these techniques become ineffective in monitoring and enhancing the cybersecurity defenses of
those system against unwanted attacks. To address these concerns, we propose a deep reinforcement learning
(DRL) framework for anomaly detection in the SCADA network. Our model utilizes a ‘‘Q-network’’, which
allows it to achieve state-of-the-art performance in pattern recognition from complex tasks. We validated
our solution on two publicly available datasets. The WUSTL-IIoT-2018 and the WUSTL-IIoT-2021, each
comprised of twenty-five networking features representing benign and attack traffic. The results obtained
shows that our model successfully achieved an accuracy of 99.36% in attack detection, highlighting DRL’s
potential to enhance the security of critical infrastructure and laying the foundation for future research in
this domain.
network (DNN) algorithmic policy-based that interconnects and autonomous controls using diverse datasets [19], [20],
the model’s neurons, gives it significant computational [21]. In fact, we did not find any relevant papers on DRL
aptitude to analyze intricate network data and make decisions that used similar data at the time of our research. Considering
[5]. In terms of multi-tasking capabilities, DRL technique these findings, we explored the effectiveness and robustness
provides researchers with an array of advanced AI technical of the DRL framework [22] in the cybersecurity domain.
frameworks ready to be deployed in various domains of We developed and implemented a model algorithm, testing
industrial importance. In areas such as smart grids for voltage it on two SCADA datasets: Wustl-IIoT-2018 and Wustl-
regulation [2]; adaptive assembly lines [6]; in robotics to IIoT-2021. Our research results demonstrated the successful
perform specific functions that optimized individual tasks application of DRL methodology in detecting cyber threats
[7]. For example, in cybersecurity [8], the DRL algorithm is within SCADA’s critical network [14].
applied to filter through network traffic and prevent intrusions
[9], thus reducing cost and the unnecessary manpower
needed to accomplish complex and time-consuming cyber- A. RESEARCH OBJECTIVES
monitoring tasks. Conceptually, DRL technique is trained The main goal of this work is to explore the capabilities of
in a pre-defined environment where agents learn from low Deep Reinforcement Learning (DRL) algorithms to enhance
dimension of feature’s inputs from metadata and perform the accuracy of anomaly detection.
meta-learning1 through trial and error [10], [11]. In particular, this research explores the potential of
As a subset of machine learning, DRL’s computational implementing a DRL technique to protect and defend critical
proficiency and intellectual effectiveness in the video gaming infrastructure from continuous cybersecurity attacks as we
industry display exceptional results in both offensive attack investigate whether the algorithm can effectively enhance the
strategies as well as defensive tactics [12]. DRL’s neural security and resilience of SCADA systems against ongoing
network application uses a function estimator to observe data threats from bad actors.
state or labels’ input, as it operates in a set environment, and Similarly, we explore possible challenges associated with
uses a greedy algorithmic policy to process unlimited network applying DRL in Smart Grid systems and discuss specific
traffic’s history for anomalous and malicious attacks [13]. difficulties or limitations in implementing the technique in
In retrospect, these capabilities make it a valuable intrusion the context of critical systems’ network anomaly detec-
detection algorithm for protecting and defending against tion, which encompasses protection from cybersecurity
sophisticated cybersecurity attacks, including advanced per- attacks.
sistent threats (APT) [14]. We discuss optimization and deployment strategies for
In the digital transformation era, various sectors such utilizing the technique to effectively secure SCADA systems
as business infrastructure, banking, IoT, Industrial IoT, and and provide insights into specific approaches, methodologies,
transportation, including SCADA infrastructure, are under and necessary considerations for maximizing the efficacy of
constant threats of cybersecurity attacks and the risk of DRL to improve network security.
data breaches [14]. However, in recent years, machine Lastly, we identify the potential advantages and disadvan-
learning techniques have shown great promise in enhancing tages of deploying DRL as an Intrusion Detection System and
the security posture of critical infrastructure by aiding in Intrusion Prevention System in Industrial Control Systems
penetration testing, pattern detection, and real-time attack [3], [23], [24]. This approach seeks to analyze how DRL can
mapping. However, it should be noted that the complexity of enhance the security and resilience of ICS to effectively detect
code and the challenges posed by large datasets remain an and prevent cybersecurity threats, improve incident response,
obstacle for certain aspects of machine learning algorithms and mitigate risks to critical infrastructure.
[15], [16], particularly in areas involving continuous complex
calculations and large-scale decision-making.
In this paper, we conducted a qualitative assessment of B. CONTRIBUTIONS
cyber-attacks on critical infrastructure and their impact on Our proposed approach presents a multifold contribution to
business operations, drawing insights from studies, e.g., critical infrastructures’ existing security measures.
[17] and [18]. We performed a comprehensive review of A) It incorporates novel features such as actor-critic
prior research on the application of DRL methodology algorithm [23], designed for optimal policy update [24],
in combating SCADA network anomalies; our analysis to dynamically assist the model (actor) in maximizing
revealed that while there have been numerous research efforts its decision process. The architectural design of the actor
utilizing DRL approaches to enhance the security of SCADA and critic networks are mirrored with the same dynamic
network infrastructure, however, the focus has been primarily parameters to facilitate a reciprocal training environment.
directed towards hardware and the monitoring aspects of (CI). In this manner, our actor is trained to generate actions that
Including the development of intrusion detection methods maximize the expected rewards, as the critic evaluates the
quality of those actions for the expected value of future
1 Meta-learning in ML is the process that refers to learning algorithms rewards. This iterative process continues until the model
that continuously learn from other learning algorithms. converges.
To address issues stemming from correlated data and non- II. BACKGROUND
stationary distributions, we introduce a ‘‘ReplayMemory’’ In this section, we provide a brief introduction of DRL
function [3] designed to store experiences and sample framework. Following that, we summarize the necessary
transitions (i.e., state−action−reward −next −state, tuples) background information related to SCADA infrastructure.
that the agent encounters during its exploration of the
environment. This memory buffer enables the agent to learn A. DEEP REINFORCEMENT LEARNING
from past experiences by randomly sampling previous tran- Deep reinforcement learning (DRL), a subset of machine
sitions from the memory during training. By incorporating learning, is a branch of artificial intelligence that combines
this technique, the agent can leverage a wider range of deep learning algorithms and reinforcement learning tech-
experiences from its entire history rather than relying solely niques allowing an agent to interact with its environment
on its most recent encounters. and learn from trial and error. Using Python programming
B) We present a detailed assessment and conduct a language, we defined the following: an agent is represented
qualitative and quantitative evaluation using two real-world through functions, software, or a block of codes. The envi-
datasets to train our model. We evaluate our algorithm on ronment, on the other hand, is a function that simulates both
the WUSTL-IIoT-2018 dataset [25], and the WUSTL-IIoT- virtual and physical training scenarios, containing parameters
2021 dataset [26], consisting of network traffic protocols. for an agent to exploit. In this conceptual framework (Fig.1),
To validate our solution, we load the saved model with test the agent explores a pre-defined environment and exploits its
data containing multiple binary classification inputs. The parameters [28] to maximize reward signals.
agent applies the methods described in algorithms (6) and (7) In training, for each time step, the environment sends
to read the labeled attacks and predict the targeted attack a scalar reward signal to the reinforcement learning agent
labels. The results obtained, demonstrates that our DRL for each action taken. It is important to highlight that due
model can effectively classifies threats in real time and to the insistent algorithmic nature of DRL framework, the
provides detection and response [27]. sole objective of the agent is to learn from its implemented
C) Lastly, we offer valuable insights for future research stochastic policy to maximize cumulative reward over
directions regarding the use of DRL to detect cybersecurity time [29].
attacks in the SCADA domain. Despite its successes, some
of the challenges faced by DRL technique applications
are: Deep neural network utilization, greedy policy imple-
mentation, and the need for sample or data efficiency.
In addition to these challenges, there are several points
of interest for future research on DRL implementation in
the cybersecurity domain. These include: Investigating the
capability of DRL framework to handle larger and more
complex SCADA network datasets; evaluating the robustness
and generalizability of the technique by testing it on diverse
SCADA systems and network environments; and exploring
the possibility of combining the DRL algorithm with other
machine learning techniques, such as anomaly detection
and ensemble methods, to further improve the accuracy and
effectiveness of cyber threat detection and response. These
research directions aim to address challenges in adopting
DRL for cybersecurity and advance the development of active
solutions in this field.
FIGURE 1. DQN Agent in SCADA Environment Using MDP.
deep neural network, hence enabling the agent to learn and communications. Known as the Industrial Internet of Things
interact with the environment. During training and upon or IIoT 4.0 [38], these peripherals enable remote automation
initialization, the agent uses data input as observations and and high-level supervision of critical infrastructures like
its outputs as actions, while attempting to maximize reward power grids and water distribution plants. The system is an
signals, where Q(s, a) represents the sum of all rewards, and integral component of smart grids and facilitates real-time
maxQ(s′ , a′ ) for the maximum rewards an agent is able to data analysis, command, and control processes of assets.
achieve from its current state [31]. The third component is Depicted in figure 2, a standard SCADA structural design can
the ‘‘training algorithm’’, which updates the deep neural be outfitted with ‘‘Remote Terminal Units, Programmable
network’s weights based on the reward function and the Logic Controller, etc.’’ These microprocessors communicate
actions taken by the agent. In our proposed model, the state- commands to field devices such as Pump units and valves,
action value function is called a Q function, and is defined as where the processes are then displayed in a ‘‘human-machine
Q(s, a). Referred to as Q-learning, this Q function is used for interface’’ HMI for visual confirmation [39].
optimal rewards in each action as it updates the policy using
this Bellman equation [24]. 1) SCADA OPTIMIZATION PROCESS
• Control processes
Q(s, a) = Q(s, a) + α(r + γ maxQ(s , a ) − Q(s, a))
′ ′
• Monitor and gather information in real-time
• Interact with various devices
1) DEEP Q-NETWORK ALGORITHM • Record events in a back-end database
Deep Q-Network (DQN) is a deep reinforcement learning Based on figure 2, regardless of topological configuration
algorithm that uses deep neural networks or a Q-network and structural design, a SCADA system uses data from its
to approximate optimal action-value functions in a Markov connected nodes or (IIoT) devices to perform its operations.
Decision Process (MDP) [32]. The technique is built upon These nodes are a collection of sensors and other monitoring
the traditional Q-learning algorithm, which serves as a devices attached to the network via ethernet cables or
function approximator. Unlike reinforcement learning which wireless communication channels [40]. Depicted in 3, remote
maintains a (Q − table to store Q − values for each state − control access requires internet communication and uses the
action pair), Q-network takes the raw observation state as following mode for data exchange: ‘‘Local Area Network’’
input and directly outputs the estimated Q-values for all (LAN) or a ‘‘Wide Area Network’’ (WAN). Protocols such
possible actions. Introduced by Google DeepMind in 2013 as Modbus, DNP3, OPC, and others, defined the format and
[33], the DQN technique combines deep neural networks and rules for exchanging data and commands between nodes
Q-learning to learn optimal policies in complex and high- within a system [40].
dimensional environments [34].
and recover from network transmission attacks. The proposed Our research aims to explore the potential of DRL
methodology presented by Wang et al. [21] is a DRL based Q- techniques in the cybersecurity domain, specifically focusing
learning attack strategy, simulated on the IEEE 30-bus system on optimizing and deploying these techniques to enhance
for SCADA load management uncertainty; a vulnerability the security of industrial control systems. By leveraging
that a hacker may use to trip critical transmission lines. Their DRL, we aim to strengthen the resilience of SCADA systems
model detects and prevents smart grid-coordinated topolog- against various cyber threats. Here, we introduce the four-step
ical attacks by identifying false electronic communication. methodology:
The actor-critic approach by Moradi et al. [29] is a high- 1. Building the DRL Model (Section IV-A). We devel-
level algorithm, structured to strategically simulate attacks oped a DQN-based approach that effectively addresses
in a smart environment. The technique can simultaneously the challenges in protecting and defending critical
learn policies and detect network anomalies from smart infrastructure against cybersecurity attacks using com-
electrical communication systems. The evaluation shows plex IIoT datasets.
positive results against the Wood and Wollenberg 6-bus 2. Applying DRL to SCADA (Section IV-B). We describe
and the IEEE 30-bus systems respectively. The abnormal how the DRL model can be applied to SCADA domain.
flow detection in industrial control network presented by 3. Optimizing DRL to SCADA (Section IV-C). We detail
Wang et al. [44] is a DRL model which the authors deploy the steps to take in order to optimize our DRL model to
to monitor ‘‘abnormal flow’’, a form of intrusion in ICS SCADA domain.
systems. The model helps prevent bad actors from taking 4. Monitoring SCADA with DRL (Section IV-D).
over command of industrial control systems or natural We elaborate on the monitoring techniques that enable
gas pipeline operations. To the best of our knowledge, our model to identify potential attacks on SCADA
our study is the first to implement a DQN in SCADA systems.
‘‘Industrial Internet of Things’’ for intrusion detection. Following this, we offer a detailed breakdown of the four
In contrast, the gap between our proposed algorithm and the steps.
related work comparison presented in table 5, lies in the
sophistication of our model’s architecture, the complexity A. BUILDING THE DRL MODEL
of our datasets and the performance achieved, highlighting In this paper, we utilize datasets within the IoT and IIoT
our algorithm’s advancements in handling complex tasks, domain.
and preventing continuous cyberterrorism attacks against As follows, we outline some of the primary protocols
SCADA communication’s infrastructure. and software deficiencies affecting IIoT with respect to
SCADA, highlighting how these discrepancies continue to
jeopardize the security of smart grid systems. In reference to
the framework of IoT vulnerabilities outlined in [40] and [47],
IV. METHODOLOGY we categorize security threats in IIoT devices to include the
As follows, we present the research methodology employed following vulnerabilities:
in constructing our DRL model, which is based on our • Outdated software: Inadequate access control, which
investigative discoveries [38]. may allow unauthorized access.
3) Actor-Critic: In our model, we implemented an actor- In adopting this algorithm, the agent updates the action-
critic algorithm (2) [23], combined with our policy values Q(S, A), representing the sum of rewards to Q(s, a)
gradient which updates the training agent’s (actor) = maxQ(s′ , a′ ) relative to optimal rewards at s( t + 1) based
expected reward for a given state-action pair: on the maximum action-value of the next state, even though
≫ δt = rt + γ · V (st+1 ) − V (st )θ < −θ + αθ · δt · the exact action-value for the next state might not be fully
∇θ logπθ (at |st )V (st ) < −V (st ) + αV · δt known, due to the random actions [35], [6]. Nonetheless, this
where: tabular algorithm ensured the computation and convergence
• δt is the temporal difference (TD) error, which of the action-values Qt + 1(St, At) = Qt(St, At) + αt for
measures the difference between the predicted each state-action pair, leading to the exploration of infinite
value of the current state-action pair and the actual optimal action-values as states are explored. In algorithm 3,
reward received. we present the pseudo code implementation of our off-policy
• V (st ) is the estimated value of state st model as described by [54] Sutton and Barton.
• αθ and αV are the learning rates for the policy and
value function, respectively. Algorithm 3 Q-learning (off-policy approach) for
estimating π ≈ π∗
Algorithm 2 Actor critic algorithm implementation Algorithm parameters: step size α ∈ (0, 1], small
Input: Initialize policy, two soft Q and two target soft (epsilon)ϵ > 0;
Q-DNNs; Initialize experience replay buffer with attack Initialize Q(s, a), for all s ∈ S + , a ∈ A(s), arbitrarily except
samples; Result: Optimised actor and critic DNNs for each that Q(terminal, ·) = 0;
episode do for each episode do
for each step do Initialize S;
sample actions from the policy; sample transition for each step of episode do
from the environment; store the transition in the Choose A from S using policy derived from Q
replay buffer; (ϵ − greedy);
Take action A, observe R, S ′ ;
end Q(S, A) ←
for each gradient update step do Q(S, A) + α[R + γ maxa Q(S ′ , a) − Q(S, A)];
update the soft DQNetwork weights; update the
S ← S′;
policy DQNetwork weights; adjust the entropy
end
temperature; update the target DQNetwork
end
weights;
end
end In addition to off-policy, we improved our approach in
its exploratory state of the environment, by adding this
decay function σ (N ) = σ0 ekN . The algorithm represents
the relationship between the model’s input and output as
2) ON-POLICY VS OFF-POLICY the rate of decay. By defining the values of σ0 and k,
Actor-critic algorithms can be on-policy or off-policy. These we parameterized the rate at which the decay value changes
methods are used in DRLs when determining whether the during iterations. The method helps estimate future values
data collected during training is used for updating the policy and rewards, based on observed trends [55].
or remains neutral.
The main objective of this paper is to develop a DRL 3) SOLVING EXPLORATORY NOISE
framework that improves network security with a robust In our algorithm, exploration noise arose from deliberate
algorithm that optimizes attack’s detection. In that, we tested exploration strategies, environmental stochasticity, random
our actor-critic algorithm using the two policy methods; initialization, and the use of experience replay. These features
however, because on-policy uses the same policy (SARSA), enable our agent to discover optimal policies in complex
which it seeks to improve experience collection [49], as a environments [56] and is also added to encourage the agent
result, that approach could not fully optimize our actor to explore new actions.
network for our intended solution. To enhance our approach during the exploratory phase
Based on this insight, we conducted our experiment using of the environment, we utilize a decay function. Gradually,
the off-policy approach, which revealed that the method the method reduces the level of exploration thus making the
instead follows the tailored policy gradient, such as ‘‘∈- agent more deterministic as it accumulates experiences in the
greedy’’ for experience replay. In doing so, our agent learned environment. By incorporating a decay function within the
from a parameterized policy [53] that differed from the one algorithm as a policy, we helped guide the agent’s actions
it was supposed to follow (on-policy), thus allowing it to towards increased determinism based on its experience level,
explore the state space, using a stochastic algorithmic policy.5 using this equation:
5 Stochastic: A probability distribution that allows an agent to explore the σ (N )
π(a|s, N ) = (1 − σ (N ))πtarget (a|s) +
state space without always taking the same action. |A|
63388 VOLUME 12, 2024
F. Mesadieu et al.: Leveraging DRL Technique for Intrusion Detection in SCADA Infrastructure
where πtarget (a|s) is the target policy, |A| is the number of random selection of actions with a probability of (ϵ) epsilon,
actions in the action space, and σ (N ) = σ0 ekN is the decay can also lead to sub-optimal performance [59].
function for the exploration noise [55]. To address this trade-off, we improved the robustness of
Alternatively, when incorporating the decay function into our DRL-based model by incorporating a deep Q-learning
the exploration noise as a policy, our agent learned how to algorithm, along with an off-policy approach and a decay
effectively balance both the exploration and exploitation time, function. This optimization aims to overcome the limitations
thereby allowing it to achieve better performance during each and enhance the overall performance of the model [14].
episode over the course of the training [57].
In testing our model’s robustness, we train and evaluate Summary: In developing our DQN model, the
our DRL algorithm with network traffic collected from a explorative and exploitative concepts within the
SCADA test bed. We inspect the data for typos, and remove DRL methodology present significant challenge
all duplicates and irrelevant content, which streamlines the in our design, particularly in large state and
data preparation. Next, we label the attacks to be iterated, action spaces. To overcome this trade-off and
so that when introduced to the environment, the model can achieve optimal performance, we carefully man-
learn from the patterns [35]. We set the decay parameter, aged and balanced the technique by incorporating
and epsilon values for each step, initialize both our ‘‘loss a ReplayMemory function with a mini-batch
function and the replaymemory’’ with a bach-size of 32, and training; a decay function in conjunction with an
then train the model for 250 episodes Fig. 10. The successful off-policy approach.
results obtained from our experiment shows that DRL has the
potential to improve the detection and prevention of persistent
attacks against SCADA systems.
C. OPTIMIZING DRL FOR SCADA ENVIRONMENT
Summary: DRL has the capability to learn From our investigative analysis of SCADA, particularly from
patterns effectively, using convolutional neuro the perspective of the dynamic configuration (i.e., proprietary
network while leveraging Deep Q − Learning software, hardware, and topological configuration), the
(Fig. 4) to efficiently classify intrusions from information collected and expertise gained from the analysis
SCADA’s network traffic protocols. led to the structural design of our proposed approach.
As outlined in subsection IV-C, para IV, the difference
between exploration & exploitation can be balanced to help
optimize DRL as they merge in unified facets, in which the
model decides ‘‘when’’ to explore versus ‘‘what’’ to exploit
B. APPLYING DRL TO SCADA as a strategy [48]. It is worth noting that several explorative
DRL is a powerful technique, ideal for monitoring and secur- strategies have been developed to address this trade-off,
ing SCADA systems. However, from an algorithm design including ‘‘Thompson sampling, Upper Confidence Bound
perspective, many challenges in method applicability must (UCB), epsilon-greedy, and more. By incorporating a level
be considered when choosing the appropriate framework best of randomness, these strategies help balance this trade-off.
suited for SCADA security; particularly when employing the To overcome all performance challenges associated with
Deep Q-learning algorithm and epsilon-greedy policy [29]. explorative and exploitative states, we designed a DRL
During the software development, training, and testing of our framework that balances explorative strategies IV-B using
DRL algorithm, overcoming the following challenges were epsilon-greedy as presented in (Algorithm 3), which encour-
crucial to the success of our experiment: ages the agent network to explore, trying out new actions,
Exploration vs Exploitation Trade-Off: This framework while actively taking advantage of experienced knowledge
represents the fundamental concept in reinforcement learn- through exploitation. Our design incorporates regularization,
ing, including DRL. It outlines the dilemma faced by an agent in which ‘‘decay weight’’ is adjusted, with an ensemble
when deciding between exploring new actions or exploiting method comprised of an actor-critic (Algorithm 2); the
its current knowledge to maximize long-term rewards. technique helps balance the model’s efficiency and prevents
In the explorative state, training our actor in a complex generalization to new environments.
environment with large state and action spaces is very We adopted this approach because SCADA datasets are
time-consuming as the agent network samples each set of subject to data bias, due to their dynamic and non-stationary
actions to obtain optimal rewards. This risky strategy can properties [21], and also because of privacy and security
lead to negative results and ultimately affects our model’s concerns, as their structural designs differ based on regions,
detection outcome. In contrast, our model exploitation state software, etc.
experienced ‘‘local optima’’ and over-fitting as it exploits To avoid model specialization in retrospect to DRL
prior knowledge through experience in favor of long-term challenges, we create a separate data class with specific
rewards [58]. Though exploiting existing knowledge can lead functions and methods, tailored to accommodate individual
to more efficient and effective decision-making, however, the dataset’s format [60]. This allows us to carefully tune and
adjust our hyperparameters [48], [59], as shown in Table 3, anomalies and classify them as network intrusions based
in order to achieve the best possible performance. on our label parameters [58].
• To improve scalability, we implemented a replay
Summary: To optimize DRL for SCADA net- memory and utilized a mini-batch function that allows
work security, we addressed the challenges caused the agent to sample data from its prior experiences
by model generalization and specialization. They to optimize learning. We employ an off-policy along
are due in part to data formatting; we create a with a decay function to balance the trade-off between
separate data class with functions and methods exploration and exploitation for rewards optimization
tailored to SCADA designs to resolve the prob- while reducing unnecessary exploitation time [65].
lem. • We pre-processed and formatted our data using a multi-
class binary classification, which reduces false positives,
and then trained our model on historical data containing
examples of both normal network behavior and various
D. MONITORING SCADA USING DRL ALGORITHM
attack scenarios.
In our investigative findings, we identified various domains
• We continuously monitor our DQN performance by
in which DRL algorithm implementation has shown remark-
testing it on additional datasets to ensure its efficacy of
able success. For example, in the gaming industry, DRL
real-time threat detection and response, thus simulating
deployment in video game applications developed learning
the automation of regular updates of new attack patterns
strategies in complex and large action spaces for reward
and emerging security risks.
maximization; in assembly line operations, the technique also
demonstrated substantial results in process optimization and
labor reduction. However, in retrospect to these domains, Using this approach, as the model adapts to changing
when considering similar methodology for cybersecurity network traffic, it generates appropriate actions or alerts to
threat detection, particularly in SCADA application, we iden- address security threats. This proactive monitoring technique
tified several functions in DRL structural design that affect allows our model to identify potential attacks of SCADA
the algorithm performance [3], [61], [62]. communication systems in real-time and take preventive
For example, implementing a Greedy Policy Limitation measures to ensure the integrity and reliability of the
function confines the agent to rely solely on current grid [66].
knowledge to exploit actions that maximize immediate
rewards. The Exploration and Exploitation function within Summary: To train and deploy a DRL model to
the algorithm designs would direct the agent to either actively monitor smart infrastructure, we simu-
explore the environment to gain knowledge, discover new lated the following: We selected two well-known
threat patterns about protocols’ vulnerabilities to maximize datasets to simulate the data collection and for-
rewards or exploit the environment by taking uncertain and matting process; we define the security objectives
potentially risky actions that may reduce immediate rewards using binary classification; we developed and
[63]. The cascading effects of these features can make trained a DQN model using a SCADA dataset;
it challenging for a DRL agent to operate effectively in we then reloaded the saved model and tested it
stochastic, unpredictably large, and complex environments on a second set of data, thus simulating ‘‘real-
[53]. time deployment.’’ The result of this experiment
Unlike video game structural environments and assembly shows that our model continuously analyzes the
lines which are strictly designed with constants such as incoming data, leveraging its training to recognize
known rules and constraints, with a deterministic configu- patterns of normal behavior and anomalous activ-
ration that makes it relatively easy for the DRL technique ity.
to optimize rewards, SCADA systems, on the other hand,
often operate in dynamic and unpredictable real-world envi-
ronments, with complex interconnected systems comprising
of water treatment plants, power grids, or manufacturing V. IMPLEMENTATION AND EVALUATION
facilities. The interdependencies of nodes and network For our implementation, we utilize the SCADA use-case pre-
communication settings of the individual system are also sented in Fig. 5. This use-case scenario describes a network
proven to be very challenging for a DRL agent to optimize fully responsible for remotely supervising, monitoring, and
rewards effectively [52], [64]. As follows, we provide the controlling critical infrastructure, such as electrical grids and
explanations of the different method implementations: stations, water treatment facilities, transportation systems,
• To actively monitor SCADA systems, we address or oil refineries. In this complex operational cyber-space,
the challenges stemming from the DRL framework the threat model involves an unauthorized attacker gaining
by developing a DQN model with defined security access and compromising SCADA network traffic protocols,
objectives that allow the algorithm to detect network attempting to disrupt the system’s operations.
A. SYSTEM THREAT MODEL security of SCADA’s network-attached nodes (Fig. 2). These
1. Initial Compromise: Through vulnerability exploita- interconnected IoT and IIoT devices lack the necessary
tion, either by means of brute force and or social security features, which create a larger attack surface within
engineering, an attacker gains unauthorized access to the the system, thus making it more susceptible to cyber threats.
system’s network and begins to sabotage network traffic However, applying DRL as an intrusion detection ‘‘avant-
protocols and disrupts industrial processes. guard’’ addresses this challenge by using deep q-learning
2. Traffic Protocol Manipulation and Alteration: Upon [58] algorithms to analyze network traffic patterns, identify
establishing administrative command, the attacker can- anomalous behaviors for potential intrusions, while reducing
cels and encrypts control commands, injecting false downtime using the deployment approach in (Fig. 5) and the
sensor data to disrupt communications between IoT and detection strategy in (Algorithm 4).
IIoT devices. Such compromise may lead to catastrophic
loss and physical damage to consumers. B. EXPERIMENTAL ENVIRONMENT
3. Detection Scheme and Response: Deploy a trained All of our experiments were conducted on a computer with
DRL as an intrusion detection response system to the following specifications:
mitigate and assess abnormal network behaviors and to
• Intel(R) CoreTM I7-1700 CPU @ 2.90GHz; 16GB RAM
identify possible threats caused by compromised traffic
• Intel(R) UHD Graphics 630 GPU
protocols.
• And a dedicated AMD Radeon RX 640 GPU
Critical infrastructure Model: Because modern critical • Windows 10 Pro operating system
infrastructure systems are autonomous, they use IoT and • Python 3.9.12, gym 0.21.0, Tensorflow 2.9.1
IIoT devices, and internet protocols (i.e., TCP /IP) to • Keras 2.9.0, SciKit-Learn 1.0.2, matplotlib, seaborn
monitor and control critical functions in water treatment
plants, power grids, transportation networks, etc. Although C. EVALUATION
these devices provide cutting-edge multi-functionality to data In this section, we briefly outline key tenants of our algorithm
collection and automation processes, their designs pose the research objective and provide a brief description of the
greatest security vulnerability to the system. So, deploying datasets used in our proposed approach.
a trained DRL as an NIDS [35], [67], [68] can authenticate Our experimental objective aimed to develop and evaluate
communication and filter TCP/IP traffic of attached nodes in a DRL framework to assess its robustness and effectiveness
a synchronous and asynchronous [66] manner to help detect and to determine whether the algorithm can be successfully
lurking threats. leveraged to detect cybersecurity attacks in large and complex
Applying DRL to Critical Infrastructure: In this section, environments [29]; specifically, to investigate the impact of
we describe our Deep QNetwork approach to ensure the using DRL’s technique on new and unseen data.
1) DATASETS
To validate these objectives, we developed a DNQ model
The WUSTL-IIoT-2018 dataset is a dataset focused on
consisting of two hidden layers with an actor-critic algorithm
Industrial Internet of Things (IIoT) systems, capturing
and a replay-memory function.
features related to network traffic, device interactions,
As illustrated in algorithm 5, to balance our model
and protocols used. It contains a significant number of
explorative and exploitative behavior in large, complex state
and action spaces, we implemented an off-policy along with 6 wustl-iiot-2018 https://www.cse.wustl.edu/ jain/iiot/index.html
a decay function that allows the agent to sample mini- 7 wustl-iiot-2021 https://www.cse.wustl.edu/ jain/iiot2/index.html
attacks, including DDoS, injection, and command execution datasets, and extracted features of importance for detecting
attacks [25]. cyber anomalies. We then label each selected features and
The WUSTL-IIoT-2021 dataset is a recent dataset focusing store them in a dictionary to fit our model’s input as shown
on IIoT devices and systems. It includes updated informa- in (Algorithm 6). Upon initialization, the actor-network loops
tion on IIoT behavior, network interactions, and potential through the dictionary and matches the corresponding labels
cybersecurity threats. The dataset contains a variety of with each attack category as presented in (Algorithm 7).
features specific to IIoT, with a substantial number of attacks, Next, we fine-tuned the algorithm using the values
including unauthorized access, data exfiltration, and device in (Table 3). These methods allow our actor to recog-
manipulation [69]. nize patterns from normal and anomalous network traffic
Datasets Characteristics: During our preliminary analysis behavior.
and pre-processing of the datasets (Fig. 6), we identified
four types of attacks commonly simulated to test SCADA’s 2) PERFORMANCE METRICS
network security defenses: Port Scanner; Address Scan For validation, we used the standard measurement metrics
Attack; Device Identification Attack and Exploit [16], [70], to assess the model’s predicting ‘‘Accuracy, F-1 score, False
[71], [72]. These ‘‘scans’’ represent a precursor or the Positive, and False Negative rates.’’ We also use the formulas
initial stage for more severe attacks. These frontal assaults TP, TN, FP, and FN to calculate our model’s performance.
help collect sensitive data and subsequently expose an Accuracy: It measures the ratio of correctly predicted labels
organization’s vulnerability [14]. Their success can lead to using this equation:
data breaches and possibly affect control decisions, causing TP + TN
equipment damage or triggering unsafe use of infrastructure Accuracy = (1)
TP + TN + FP + FN
assets. Based on those threat patterns, we mapped and labeled
those attacks as our target detections (Algorithm 6). Precision: Measures the ratio of the accuracy of positive
predictions:
Algorithm 6 Label mapping TP
Precision = (2)
Input: Initialize dataset TP + FP
Process: Initialize an empty dictionary to store labels
Output: DICTIONARY = Attack_labels & Attack_map Recall: Quantifies the number of correct positive predic-
for each label in in mapping dict do
tions made out of all positive predictions that could have been
if the label is ‘‘normal’’ then made by the classifier:
add to dictionary with a value of 0
if the label is ‘‘Probe’’ then TP
add to dictionary with a value of 1
Recall = (3)
TP + FN
if the label is ‘‘R2L’’ then
add to dictionary with a value of 2 F1-score: This metric measures the harmonic mean of
if the label is ‘‘DoS’’ then precision and recall using this formula:
add to dictionary with a value of 3
end 2·P·R
end f 1 − score = (4)
end P+R
end
Lastly, we trained and evaluated our model using the
labels selected datasets: WUSTL-IIoT-2018 [25] and WUSTL-IIoT-
end
2021 [26], comprised of network traffic protocols. After
successfully training the model, our DRL learns to classify
threats in real-time and provides detection and response [27].
Algorithm 7 Reading Labels In retrospect to validation V-C2, we loaded the saved model
Input: Initialize dictionary with our test data, comprised of multiple binary classification
Output: Attack_labels inputs. Using both methods from algorithms 6 and 7, the
forall correct label do agent-network read the selected labeled attacks and predicts
labels ← Agent (by) our targeted attack labels.
for each iteration do
st , at = []
for attack label in attack_map do VI. RESULTS
n=length(attack_map)
end The results presented from our evaluation is an illustration
return labels of the training characteristic of our DRL framework. Table 4
labels shows the evaluation results achieved on the two datasets
end considered in this work: WUSTL-IIoT-2018 and WUSTL-
end
IIoT-2021. Likewise, in Table 5, we present the performance
metrics of our model, which were compared with two other
In meeting our set goals, we designed a double-layer Deep research works that also utilized a DRL technique. (i.e.,
QNetwork algorithm [58], we pre-processed two unique [15] and [44] on the same dataset. We considered only the
TABLE 3. Environment hyper-parameter. Our DQN is trained with optimized parameters as recorded
in table 3 and algorithms 6 and 7 to accurately classify
both normal and anomalous behavior from the input data.
In this synchronous process, we set our critic parameter to
equal those of our actor-network (i.e., identically mirroring
γ , ϵ, ϵmin , and rd ) as target value.
We adjust our DQN minibatch_size after each training
phase to minimize the loss function, aiming to improve
the alignment between the predicted output of the model
and the desired output, as labeled in algorithm 6. Overall
research works by Tharewal et al. [15] and Wang et al. [44], the training results demonstrate that our model successfully
because they were the only two presenting a complete captures and accurately interprets the assigned attack labels,
evaluation of their solution with the appropriate evaluation achieving convergence with an improved loss of less than
metrics. In this experiment, we trained our model using an 0.4 over 250 training episodes (Fig. 10), while accumulating
actor-critic algorithm that learns to make decisions. From the highest number of rewards (Fig. 7).
the conceptual environment as demonstrated in Figure 10, During each learning step, our model updates both
the critic evaluates the actor’s performance by providing the Actor-Critic’s parameter using ‘‘policy gradients’’ and
feedback; the actor then updates the policy distribution based ‘‘advantage value’’ by minimizing the mean squared error
on the information received, thus helping to improve the using the Bellman’s equation as detailed and implemented
model’s learning capabilities. in IV-A1.
After convergence, the model total loss per episode drops TABLE 5. Comparison table WUSTL-IIoT-2018.
from 0.5 and remains stable at under 0.3 for the duration of
the training as presented in Figure 8.
C. MITIGATION STRATEGIES:
To alleviate these threats, we employed rigorous methodolog-
ical procedures, including:
FIGURE 9. Training results. • randomized control trial selection selection
• we carefully conducted sample and robust statistical
analyses
VII. THREATS TO THE VALIDITY Nevertheless, it is important to acknowledge that some degree
In this section, we explore the conceivable threats to of uncertainty may still exist. Additionally, we have discussed
the validity of our study and discuss criteria that could and addressed several of these concerns in the limitations’
potentially impact the integrity and applicability of our section.
findings. We acknowledge and address the following
threats: D. LIMITATIONS
SCADA technology plays a significant role in the Despite our proposed DRL promising outcomes and effective
autonomous operations of smart infrastructure. From a results in identifying patterns and detecting cyber-attacks in
SCADA infrastructure, we have identified three limitations 3. Scalability: the proposed framework utilizes a fully
that should be acknowledged. We detailed these techniques connected neural network architecture with two hidden
and outlined the measures taken to mitigate these constraints: layers consisting of 64 and 32 fully connected neurons,
1. Limited Dataset: one of the limitations of our study is respectively. While this architecture has demonstrated
the availability of a limited dataset specific to SCADA satisfactory performance in our experiments, scala-
infrastructure cyber-attacks. To mitigate this limitation, bility may become a concern when dealing with
we utilized the WUSTL-IIoT-2018 and WUSTL-IIoT- larger SCADA systems or more complex environments.
2021 datasets, which include 25 networking features To address this, future research should focus on inves-
representing both benign and attack traffic. While tigating advanced network architectures and exploring
these datasets provide valuable insights, they may not techniques such as convolutional or recurrent neural
cover the entire spectrum of potential cyber-attacks. networks to enhance the scalability of the model without
In particular, the selected datasets WUSTL-IIOT-2018 compromising its performance.
and the WUSTL-IIOT-2021 may not represent the
broader population of SCADA infrastructure, which E. FUTURE RESEARCH DIRECTIONS
could lead to sampling bias. Such inconsistency may Subsequently, we present four future research directions that
affect a model’s accuracy and cause poor performance arise from our study:
on unseen data [73]. To address this, we ensured rigorous 1. Focus on Adversarial Attacks: as the sophistication
preprocessing to maximize the utility of the available of cyber-attacks continues to evolve, future research
data. should explore the vulnerability of the proposed DRL
2. Generalization: our proposed model may face chal- framework to adversarial attacks. Investigating methods
lenges in detecting novel or previously unseen cyber- to enhance the model’s robustness against adversar-
attack types that are not present in the training dataset. ial manipulations and exploring adversarial training
Although the Deep Q-learning algorithm and the Q- techniques could significantly improve its real-world
network as function approximators are designed to applicability and effectiveness.
capture complex patterns, the model’s performance may 2. Use Real-time Data: real-time detection and response
be affected when encountering unknown attack patterns. are crucial in protecting SCADA infrastructure. Future
In other words, our DRL model trained on two specific research should focus on reducing the inference time of
datasets, as explained in ‘‘non-stationary data’’ may not the proposed model to ensure timely identification and
generalize well to other datasets or when applied to prevention of cyber-attacks. Techniques such as model
different scenarios. This could limit the applicability of compression, quantization, and hardware acceleration
the model and may require substantial retraining when can be explored to achieve low-latency and efficient
attempting to use the saved model in a new context. deployment of the framework in real-world scenarios.
To address this, we emphasize the need for continuous 3. Apply Transfer Learning and Data Augmentation:
monitoring and updating of the model with new data to address the limited dataset issue, future research can
to adapt to emerging threats and to ensure ongoing explore transfer learning techniques to leverage pre-
effectiveness in real-world scenarios. trained models on larger and more diverse datasets in
related domains. Additionally, data augmentation tech- attacks in SCADA infrastructure is a promising approach
niques can be employed to generate synthetic samples with the potential to significantly improve the security of
that represent a wider range of attack scenarios, further these systems.
enhancing the model’s generalization capabilities.
4. Improve Explainability and Interpretability: DL REFERENCES
models, including the proposed DRL framework, often [1] D. P. Joseph and J. Norman, ‘‘An analysis of digital forensics in cyber
lack interpretability, making it challenging to understand security,’’ in Proc. 1st Int. Conf. Artif. Intell. Cogn. Comput. (AICC).
Singapore: Springer, 2018, pp. 701–708.
the reasoning behind their decisions. Because DRL
[2] R. Diao, Z. Wang, D. Shi, Q. Chang, J. Duan, and X. Zhang, ‘‘Autonomous
models are difficult to interpret, it may be challenging voltage control for grid operation using deep reinforcement learning,’’
to fix errors, diagnose, or even improve a model’s in Proc. IEEE Power Energy Soc. Gen. Meeting (PESGM), Aug. 2019,
pp. 1–5.
overall performance, all due in part to structural
[3] X. Liu, W. Yu, F. Liang, D. Griffith, and N. Golmie, ‘‘On deep
complexity. However, through proper design choices reinforcement learning security for industrial Internet of Things,’’ Comput.
and documentation, including the model’s assumptions Commun., vol. 168, pp. 20–32, Feb. 2021.
and hyperparameters implementation, researchers can [4] J. Duan, D. Shi, R. Diao, H. Li, Z. Wang, B. Zhang, D. Bian, and
Z. Yi, ‘‘Deep-reinforcement-learning-based autonomous voltage control
have a clear understanding of a model’s behavior and for power grid operations,’’ IEEE Trans. Power Syst., vol. 35, no. 1,
even a broader insight into its decision-making process. pp. 814–817, Jan. 2020.
Future research should focus on developing techniques [5] B. B. Zad, J.-F. Toubeau, O. Acclassato, O. Durieux, and F. Vallée, ‘‘An
innovative centralized voltage control method for MV distribution systems
to explain and interpret the model’s behavior, providing based on deep reinforcement learning: Application on a real test case in
insights into the features and patterns that contribute to Benin,’’ in Proc. CIRED 26th Int. Conf. Exhib. Electr. Distrib. Edison, NJ,
cyber-attack detection. Doing so would not only help USA: IET, 2021, pp. 1577–1581.
[6] Y. Liu, H. Xu, D. Liu, and L. Wang, ‘‘A digital twin-based sim-to-real
enhance the model’s trustworthiness but also enable
transfer for deep reinforcement learning-enabled industrial robot grasp-
cybersecurity analysts to gain valuable insights for ing,’’ Robot. Comput.-Integr. Manuf., vol. 78, Aug. 2022, Art. no. 102365.
further investigations. [7] J. Kober, J. A. Bagnell, and J. Peters, ‘‘Reinforcement learning in robotics:
By addressing these limitations and exploring future A survey,’’ Int. J. Robot. Res., vol. 32, no. 11, pp. 1238–1274, Sep. 2013.
[8] T. T. Nguyen and V. J. Reddi, ‘‘Deep reinforcement learning for cyber
research directions, we can continue to advance the capa- security,’’ IEEE Trans. Neural Netw. Learn. Syst., pp. 1–17, 2021.
bilities of DRL in detecting and preventing cyber-attacks in [9] OpenAI. (2020). Robogym. [Online]. Available: https://github.com/
SCADA infrastructure, ultimately enhancing critical infras- openai/robogym
tructure security and protecting against emerging threats. [10] T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey, ‘‘Meta-learning
in neural networks: A survey,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 44, no. 9, pp. 5149–5169, Sep. 2022.
VIII. CONCLUSION [11] S. Thrun and L. Pratt, Learning to Learn: Introduction and Overview.
In this work, we aimed to address the challenges of the Boston, MA, USA: Springer, 1998, pp. 3–17, doi: 10.1007/978-1-4615-
5529-2_1.
DRL framework and its applicability in cybersecurity to [12] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, ‘‘The arcade
advance the development of effective solutions in this learning environment: An evaluation platform for general agents,’’ J. Artif.
field. To achieve this, we design a double-layered DQN Intell. Res., vol. 47, pp. 253–279, Jun. 2013.
[13] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A.
model that utilizes a Deep Q-Learning algorithm that Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui,
enhances learning abilities in complex environments while L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, ‘‘Mastering
adapting to new and unseen data. To balance our model’s the game of go without human knowledge,’’ Nature, vol. 550, no. 7676,
pp. 354–359, Oct. 2017.
explorative and exploitative behavior in complex state and [14] B. Ning and L. Xiao, ‘‘Defense against advanced persistent threats in smart
action spaces, we implemented an off-policy with a decay grids: A reinforcement learning approach,’’ in Proc. 40th Chin. Control
function, allowing the agent to sample a mini-batch from Conf. (CCC), Jul. 2021, pp. 8598–8603.
replay memory containing past experiences. To evaluate [15] S. Tharewal, M. W. Ashfaque, S. S. Banu, P. Uma, S. M. Hassen, and M.
Shabaz, ‘‘Intrusion detection system for industrial Internet of Things based
the model’s efficacy, we explored the security challenge on deep reinforcement learning,’’ Wireless Commun. Mobile Comput.,
of critical infrastructure against continuous cyber-security vol. 2022, pp. 1–8, Mar. 2022.
attacks. We conducted an investigative threat analysis of [16] O. Yousuf and R. N. Mir, ‘‘DDoS attack detection in Internet
of Things using recurrent neural network,’’ Comput. Electr. Eng.,
SCADA systems to assess their network dependency and vol. 101, Jul. 2022, Art. no. 108034. [Online]. Available: https://www.
potential vulnerabilities. Based on our findings, we selected sciencedirect.com/science/article/pii/S004579062200297X
the SCADA domain as our target objective and evaluated our [17] M. Landen, K. Chung, M. Ike, S. Mackay, J.-P. Watson, and W. Lee,
‘‘DRAGON: Deep reinforcement learning for autonomous grid operation
DQN using two publicly available datasets from the SCADA and attack detection,’’ in Proc. 38th Annu. Comput. Secur. Appl. Conf.,
testbed: WUSTL-IIoT-2018 and WUSTL-IIoT-2021. The 2022, pp. 13–27.
results from our trained model show that our DQN can learn [18] F. Wei, Z. Wan, and H. He, ‘‘Cyber-attack recovery strategy for smart grid
based on deep reinforcement learning,’’ IEEE Trans. Smart Grid, vol. 11,
to classify threats at 99% accuracy and provide detection and no. 3, pp. 2476–2486, May 2020.
response in real-time. In retrospect, it is important to note [19] D. Silver, ‘‘Mastering the game of Go with deep neural networks and tree
that detecting certain types of attacks, such as spoofed traffic, search,’’ Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.
destination packets, and total byte attacks, can be challenging [20] M. Alauthman, N. Aslam, M. Al-kasassbeh, S. Khan, A. Al-Qerem,
and K.-K. Raymond Choo, ‘‘An efficient reinforcement learning-based
and may require collaborating with other techniques and botnet detection approach,’’ J. Netw. Comput. Appl., vol. 150, Jan. 2020,
tools. Overall, the use of DRL for detecting cybersecurity Art. no. 102479.
[21] Z. Wang, H. He, Z. Wan, and Y. Sun, ‘‘Coordinated topology attacks in [44] W. Wang, J. Guo, Z. Wang, H. Wang, J. Cheng, C. Wang, M. Yuan,
smart grid using deep reinforcement learning,’’ IEEE Trans. Ind. Informat., J. Kurths, X. Luo, and Y. Gao, ‘‘Abnormal flow detection in industrial
vol. 17, no. 2, pp. 1407–1415, Feb. 2021. control network based on deep reinforcement learning,’’ Appl. Math.
[22] M. H. Ling, K.-L.-A. Yau, J. Qadir, G. S. Poh, and Q. Ni, ‘‘Application Comput., vol. 409, Nov. 2021, Art. no. 126379.
of reinforcement learning for security enhancement in cognitive radio [45] P. ÄOisar and S. M. ÄOisar, ‘‘General vulnerability aspects of Internet of
networks,’’ Appl. Soft Comput., vol. 37, pp. 809–829, Dec. 2015. Things,’’ in Proc. 16th IEEE Int. Symp. Comput. Intell. Informat. (CINTI),
[23] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, ‘‘Soft actor-critic: Off- Nov. 2015, pp. 117–121.
policy maximum entropy deep reinforcement learning with a stochastic [46] D. Torre, F. Mesadieu, and A. Chennamaneni, ‘‘Deep learning techniques
actor,’’ in Proc. Int. Conf. Mach. Learn., 2018, pp. 1861–1870. to detect cybersecurity attacks: A systematic mapping study,’’ Empirical
[24] A. Kathirgamanathan, E. Mangina, and D. P. Finn, ‘‘Development Softw. Eng., vol. 28, no. 3, p. 44, May 2023.
of a soft actor critic deep reinforcement learning approach for har- [47] R. Antrobus, B. Green, S. Frey, and A. Rashid, ‘‘The forgotten I in IIoT:
nessing energy flexibility in a large office building,’’ Energy AI, A vulnerability scanner for industrial Internet of Things,’’ in Proc. Living
vol. 5, Sep. 2021, Art. no. 100101. [Online]. Available: https://www. Internet Things (IoT), 2019, pp. 1–8.
sciencedirect.com/science/article/pii/S2666546821000537 [48] B. Jang, M. Kim, G. Harerimana, and J. W. Kim, ‘‘Q-learning algorithms:
[25] M. Teixeira, T. Salman, M. Zolanvari, R. Jain, N. Meskin, and M. Samaka, A comprehensive classification and applications,’’ IEEE Access, vol. 7,
‘‘SCADA system testbed for cybersecurity research using machine pp. 133653–133667, 2019.
learning approach,’’ Future Internet, vol. 10, no. 8, p. 76, Aug. 2018. [49] J. Suh and T. Tanaka, ‘‘Sarsa(0) reinforcement learning over fully
[26] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, ‘‘A detailed analysis homomorphic encryption,’’ in Proc. SICE Int. Symp. Control Syst. (SICE
of the KDD CUP 99 data set,’’ in Proc. IEEE Symp. Comput. Intell. Secur. ISCS), Mar. 2021, pp. 1–7.
Defense Appl., Jul. 2009, pp. 1–6. [50] I. C. Dolcetta and H. Ishii, ‘‘Approximate solutions of the Bellman equation
[27] J. Li, T. Yu, H. Zhu, F. Li, D. Lin, and Z. Li, ‘‘Multi-agent deep of deterministic control theory,’’ Appl. Math. Optim., vol. 11, no. 1,
reinforcement learning for sectional AGC dispatch,’’ IEEE Access, vol. 8, pp. 161–181, Feb. 1984.
pp. 158067–158081, 2020. [51] R. W. Beard, G. N. Saridis, and J. T. Wen, ‘‘Galerkin approxima-
[28] M. Botvinick, J. X. Wang, W. Dabney, K. J. Miller, and Z. Kurth-Nelson, tions of the generalized Hamilton-Jacobi-Bellman equation,’’ Automat-
‘‘Deep reinforcement learning and its neuroscientific implications,’’ ica, vol. 33, no. 12, pp. 2159–2177, Dec. 1997. [Online]. Available:
Neuron, vol. 107, no. 4, pp. 603–616, Aug. 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0005109897001283
https://www.sciencedirect.com/science/article/pii/S0896627320304682 [52] J. Peters and S. Schaal, ‘‘Reinforcement learning of motor skills
[29] M. Moradi, Y. Weng, and Y.-C. Lai, ‘‘Defending smart electrical power with policy gradients,’’ Neural Netw., vol. 21, no. 4, pp. 682–697,
grids against cyberattacks with deep q-learning,’’ PRX Energy, vol. 1, no. 3, May 2008. [Online]. Available: https://www.sciencedirect.com/
Nov. 2022, Art. no. 033005. science/article/pii/S0893608008000701
[30] K. Arulkumaran, M. Peter Deisenroth, M. Brundage, and A. A. Bharath, [53] A. Akagic and I. Džafic, ‘‘Deep reinforcement learning in smart grid:
‘‘A brief survey of deep reinforcement learning,’’ 2017, arXiv:1708.05866. Progress and prospects,’’ in Proc. XXVIII Int. Conf. Inf., Commun. Autom.
[31] D.-J. Shin and J.-J. Kim, ‘‘Deep reinforcement learning-based network Technol. (ICAT), Jun. 2022, pp. 1–6.
routing technology for data recovery in exa-scale cloud distributed [54] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
clustering systems,’’ Appl. Sci., vol. 11, no. 18, p. 8727, Sep. 2021. Cambridge, MA, USA: MIT Press, 2018.
[Online]. Available: https://www.mdpi.com/2076-3417/11/18/8727 [55] P. E. Protter, Stochastic Differential Equations. Berlin, Germany: Springer,
[32] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic 2005.
Programming. Hoboken, NJ, USA: Wiley, 2014. [56] S. Mohamed, J. Dong, A. R. Junejo, and D. C. Zuo, ‘‘Model-based:
[33] E. Gibney, ‘‘Deepmind algorithm beats people at classic video games,’’ End-to-end molecular communication system through deep reinforcement
Nature, vol. 518, no. 7540, pp. 465–466, 2015. learning auto encoder,’’ IEEE Access, vol. 7, pp. 70279–70286, 2019.
[34] S. Wang, H. Liu, P. H. Gomes, and B. Krishnamachari, ‘‘Deep reinforce- [57] B. Øksendal, Stochastic Differential Equations. Berlin, Germany:
ment learning for dynamic multichannel access in wireless networks,’’ Springer, 2003, pp. 65–84, doi: 10.1007/978-3-642-14394-6_5.
IEEE Trans. Cognit. Commun. Netw., vol. 4, no. 2, pp. 257–265, Jun. 2018. [58] G. Fragkos, J. Johnson, and E. E. Tsiropoulou, ‘‘Dynamic role-based
[35] Z. Gao, Y. Gao, Y. Hu, Z. Jiang, and J. Su, ‘‘Application of deep Q-network access control policy for smart grid applications: An offline deep
in portfolio management,’’ in Proc. 5th IEEE Int. Conf. Big Data Analytics reinforcement learning approach,’’ IEEE Trans. Hum.-Mach. Syst., vol. 52,
(ICBDA), May 2020, pp. 268–275. no. 4, pp. 761–773, Aug. 2022.
[36] B. Berry, ‘‘Do you know these key SCADA concepts SCADA tutorial: A [59] J. Khoury and M. Nassar, ‘‘A hybrid game theory and reinforcement learn-
quick, easy, comprehensive guide (white paper),’’ DPS Telecom, Fresno, ing approach for cyber-physical systems security,’’ in Proc. IEEE/IFIP
CA, USA, Tech. Rep., 2011. Netw. Oper. Manage. Symp. (NOMS), Apr. 2020, pp. 1–9.
[37] J. Clifton and E. Laber, ‘‘Q-learning: Theory and applications,’’ Annu. Rev. [60] Z. Wang and X. Chu, ‘‘Operating condition identification of complete wind
Statist. Appl., vol. 7, pp. 279–301, Mar. 2020. turbine using DBN and improved DDPG-SOM,’’ in Proc. IEEE 11th Data
[38] S. R. Chhetri, S. Faezi, N. Rashid, and M. A. Al Faruque, ‘‘Manufacturing Driven Control Learn. Syst. Conf. (DDCLS), Aug. 2022, pp. 94–101.
supply chain and product lifecycle security in the era of industry 4.0,’’ J. [61] D. Zhao, D. Liu, F. L. Lewis, J. C. Principe, and S. Squartini, ‘‘Special issue
Hardw. Syst. Secur., vol. 2, no. 1, pp. 51–68, Mar. 2018. on deep reinforcement learning and adaptive dynamic programming,’’
[39] R. Antrobus, B. Green, S. Frey, and A. Rashid, ‘‘The forgotten I in IIoT: IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 6, pp. 2038–2041,
A vulnerability scanner for industrial Internet of Things,’’ Living Internet Jun. 2018.
Things (IoT), London, U.K., Tech. Rep., May 2019. [62] J. Xu, H. Wang, J. Rao, and J. Wang, ‘‘Zone scheduling optimization
[40] X. Liu, C. Qian, W. G. Hatcher, H. Xu, W. Liao, and W. Yu, ‘‘Secure of pumps in water distribution networks with deep reinforcement
Internet of Things (IoT)-based smart-world critical infrastructures: learning and knowledge-assisted learning,’’ Soft Comput., vol. 25, no. 23,
Survey, case study and research opportunities,’’ IEEE Access, vol. 7, pp. 14757–14767, Dec. 2021.
pp. 79523–79544, 2019. [63] J. Stranahan, T. Soni, and V. Heydari, ‘‘Supervisory control and data
[41] G. Stoneburner, A. Goguen, and A. Feringa, ‘‘Risk management guide acquisition testbed vulnerabilities and attacks,’’ in Proc. SoutheastCon,
for information technology systems,’’ Nist Special Publication, vol. 800, Apr. 2019, pp. 1–5.
no. 30, pp. 30–800, 2002. [64] D. Hamouda, M. A. Ferrag, N. Benhamida, and H. Seridi, ‘‘Intrusion
[42] R. Huang, Y. Li, and X. Wang, ‘‘Attention-aware deep reinforcement detection systems for industrial Internet of Things: A survey,’’ in Proc.
learning for detecting false data injection attacks in smart grids,’’ Int. Int. Conf. Theor. Applicative Aspects Comput. Sci. (ICTAACS), Dec. 2021,
J. Electr. Power Energy Syst., vol. 147, May 2023, Art. no. 108815. pp. 1–8.
[Online]. Available: https://www.sciencedirect.com/science/article/ [65] S. Wang, R. Diao, C. Xu, D. Shi, and Z. Wang, ‘‘On multi-event co-
pii/S0142061522008110 calibration of dynamic model parameters using soft actor-critic,’’ IEEE
[43] D. Wu, A. Ren, W. Zhang, F. Fan, P. Liu, X. Fu, and J. Terpenny, ‘‘Cyber- Trans. Power Syst., vol. 36, no. 1, pp. 521–524, Jan. 2021.
security for digital manufacturing,’’ J. Manuf. Syst., vol. 48, pp. 3–12, [66] X. Zhao, S. Ding, Y. An, and W. Jia, ‘‘Applications of asynchronous deep
Jul. 2018. [Online]. Available: https://www.sciencedirect.com/science/ reinforcement learning based on dynamic updating weights,’’ Appl. Intell.,
article/pii/S0278612518300396 vol. 49, pp. 581–591, 2019.
[67] S. Baek, J. Kim, H. Yu, G. Yang, I. Sohn, Y. Cho, and C. Park, DAMIANO TORRE (Member, IEEE) received the
‘‘Intelligent feature selection for ECG-based personal authentication using B.Sc. degree from the University of Bari, Italy,
deep reinforcement learning,’’ Sensors, vol. 23, no. 3, p. 1230, Jan. 2023. in 2009, the M.Sc. degree from the University
[Online]. Available: https://www.mdpi.com/1424-8220/23/3/1230 of Castilla-La Mancha, Spain, in 2011, and the
[68] M. Zheng, I. Zada, S. Shahzad, J. Iqbal, M. Shafiq, M. Zeeshan, and A. Ph.D. degree from Carleton University, Canada,
Ali, ‘‘Key performance indicators for the integration of the service-oriented in 2018.
architecture and scrum process model for IoT,’’ Scientific Program., From 2020 to 2023, he was an Associate
vol. 2021, pp. 1–11, Feb. 2021.
Research Scientist with the Centre for Cyberse-
[69] M. Zolanvari, M. A. Teixeira, L. Gupta, K. M. Khan, and R. Jain, ‘‘Machine
curity Innovation, Texas A&M University Central
learning-based network vulnerability analysis of industrial Internet of
Texas, USA. Prior to coming to the USA, he was a
Things,’’ IEEE Internet Things J., vol. 6, no. 4, pp. 6822–6834, Aug. 2019.
[70] S. Cheung, B. Dutertre, M. Fong, U. Lindqvist, K. Skinner, and A. Valdes, Research Associate with the University of Luxembourg, from 2018 to 2021.
‘‘Using model-based intrusion detection for SCADA networks,’’ in Proc. His research interests include computer science, and more specifically
SCADA Security Sci. Symp., vol. 46, 2007, pp. 1–12. on software engineering, cybersecurity, artificial intelligence, model-driven
[71] W. Alsabbagh, S. Amogbonjaye, D. Urrego, and P. Langendörfer, ‘‘A engineering, and empirical software engineering. He regularly serves on the
stealthy false command injection attack on modbus based SCADA organizing/program committees of ISSRE and QRS; and satellite events of
systems,’’ in Proc. IEEE 20th Consum. Commun. Netw. Conf. (CCNC), ESEM, ICSE, and ASE.
Jan. 2023, pp. 1–9.
[72] I. Ortega-Fernandez and F. Liberati, ‘‘A review of denial of service
attack and mitigation in the smart grid using reinforcement learn-
ing,’’ Energies, vol. 16, no. 2, p. 635, Jan. 2023. [Online]. Available:
https://www.mdpi.com/1996-1073/16/2/635
[73] G. C. Cawley and N. L. Talbot, ‘‘On over-fitting in model selection and
subsequent selection bias in performance evaluation,’’ J. Mach. Learn.
Res., vol. 11, pp. 2079–2107, Jul. 2010.