Al Hawawreh2018 PDF
Al Hawawreh2018 PDF
a r t i c l e i n f o a b s t r a c t
Article history: Internet Industrial Control Systems (IICSs) that connect technological appliances and services with phys-
ical systems have become a new direction of research as they face different types of cyber-attacks that
Keywords: threaten their success in providing continuous services to organizations. Such threats cause firms to suffer
Industrial internet of things (IIoT) financial and reputational losses and the stealing of important information. Although Network Intrusion
Internet industrial control systems (IICSs) Detection Systems (NIDSs) have been proposed to protect against them, they have the difficult task of col-
Deep learning lecting information for use in developing an intelligent NIDS which can proficiently detect existing and
Auto-encoder new attacks. In order to address this challenge, this paper proposes an anomaly detection technique for
IICSs based on deep learning models that can learn and validate using information collected from TCP/IP
packets. It includes a consecutive training process executed using a deep auto-encoder and deep feedfor-
ward neural network architecture which is evaluated using two well-known network datasets, namely,
the NSL-KDD and UNSW-NB15. As the experimental results demonstrate that this technique can achieve
a higher detection rate and lower false positive rate than eight recently developed techniques, it could be
implemented in real IICS environments.
© 2018 Elsevier Ltd. All rights reserved.
1. Introduction security domain cannot find promising control solutions for pre-
venting them [6].
Cyberspace plays a key role in contemporary societies and With the number of IIoT devices and applications rapidly in-
economies as the internet has changed the ways in which peo- creasing, protecting critical infrastructures (i.e., IICSs) is becoming
ple and organizations communicate and conduct business electron- a more critical issue for business [7]. In IIoT environments, mal-
ically [1]. Therefore, different devices, applications, and services, ware, which leverages zero-day vulnerabilities, is one of the most
which link the virtual and physical worlds, are included in the new common threats, whereby attackers infect critical devices in order
term the ‘Industrial Internet of Things’ (IIoT) [2]. The interoperabil- to control and modify their operations using different methodolo-
ity of Information Technology (IT) and Operational Technology (OT) gies, such as Advanced Persistent Threat (APT), Denial of Service
exposes industrial environments that depend on closed and propri- (DoS) and Distributed DoS (DDoS). For example, the Stuxnet worm
etary communication protocols to diverse types of anomalous ac- attacked the Iranian nuclear program in 2010, Iranian hackers pen-
tivities [3]. IIoTs are connected to the internet via the TCP/IP pro- etrated the ICS of New York’s dam in 2013, black-energy malware
tocol in the forms of Machine-to-machine (M2M) and Machine-to- was directly responsible for power outages for at least 80.0 0 0 cus-
people (M2P) using specific IIoT protocols, for example, Message tomers in Ukraine in 2015 and, more recently, a SFG malware at-
Queue Telemetry Transport (MQTT) and Advanced Message Queu- tack targeted European energy companies [10,11]. These malicious
ing Transport (AMQT) [4]. There have been substantial increases activities have proven that the ‘security by obscurity’ or traditional
in the numbers of loopholes and vulnerabilities in IICSs that can cyber-security mechanisms, including security policies, authenti-
be breached using several sophisticated attack techniques, whereby cation, firewalls and signature-based Intrusion Detection Systems
attackers attempt to exploit these systems in order to achieve their (IDSs), are no longer appropriate schemes for achieving efficient
aims of stealing valuable information and/or financial funds, and/or protection for critical infrastructure.
corrupting device resources [5]. It is expected that cyber threats To detect IIoT attacks, a Network IDS (NIDS), which is the
to the IIoT/IICSs will cost up to $90 trillion by 2030 if the cyber- second line of defense after firewall, antivirus and access con-
trol systems, has to be deployed [8]. It is defined as a software
and/or hardware mechanism used to monitor and detect suspi-
∗
Corresponding author. cious events throughout networked systems [9], with its method-
E-mail address: [email protected] (N. Moustafa).
https://doi.org/10.1016/j.jisa.2018.05.002
2214-2126/© 2018 Elsevier Ltd. All rights reserved.
2 M. AL-Hawawreh et al. / Journal of Information Security and Applications 41 (2018) 1–11
ology categorized as either signature- or anomaly-based detection. an IDS based on the context of the Modbus/TCP protocol. Although
The former identifies existing intrusions by comparing upcoming the above mechanisms achieved reasonable performances to some
rules/signatures against a blacklist of their known rules but can- extent, they were dedicated to specific protocols with high FPRs.
not detect new attacks while the latter can detect known and new Stewart et al. [23] proposed an adaptive IDS for fitting the dy-
attacks but also creates a huge number of errors [8]. An anomaly- namic architectures of SCADA systems using different OCSVM mod-
based IDS could be a powerful technique if its methodology could els to choose the most appropriate one for effectively detecting
successfully detect known and unknown attacks that attempt to different attacks. However, this system consumed a high amount
breach IIoTs [8,9]. According to the literature, IDSs have been built of computational resources while executing and produced a high
based on classical machine-learning and data-mining techniques false alarm rate for detection. Shang et al. [24] proposed an ADS
[12,13], rules-based models [14], artificial intelligence approaches for discovering attacks that penetrated the Modbus/TCP protocol
[15] and statistical models [40]. However, these methods often pro- by extracting different features of communication activities from
duce high False Positive Rates (FPRs) due to overlapping between SCADA systems which were used by a SVM algorithm to classify
legitimate and anomalous observations. attacks. However, the detection process was not sufficiently effec-
In this study, we propose an effective Anomaly Detection Sys- tive for detecting abnormal behaviors.
tem (ADS) for IICSs using deep-learning models to address the Meglaras and Jiang [25] combined the OCSVM model and re-
drawback of FPRs as much as possible because these models can cursive K-means clustering algorithm to avoid the influence of ker-
automatically analyze raw network data to discover abnormal pat- nel parameters on the OCSVM for effectively detecting network
terns efficiently. A deep-learning technique is very effective, as it attacks. An IDS for a critical infrastructure based on an Artifi-
can deal with high dimensionality and determine the latent struc- cial Neural Network (ANN) mechanism, which used error back-
ture from unlabeled data [34]. More importantly, in the training propagation and Levenberg-Marquard functions to train a multi-
phase, it conducts a consecutive training process using the unsu- perceptron ANN to detect abnormal network traffic, was presented
pervised Deep Auto-Encoder (DAE) algorithm to learn normal net- by Linda et al. [26]. Hodo et al. [27] adopted an ANN to detect
work behaviors and produce the optimal parameters (i.e., weights DoS/DDoS attacks in IoTs using a simulated network, and Chen
and biases). Then, a standard supervised deep neural network et al. [28] proposed an artificial immune-based distributed IDS
model uses the estimated parameters of the ADE models for effec- for IoT systems. More recently, Marsden et al. [56] proposed a
tively tuning its parameters and classifying network observations. Probability Risk Identification based Intrusion Detection System
The proposed technique is evaluated on two benchmark datasets, (PRI-IDS) mechanism by inspecting network traffic of the Mod-
the NSL-KDD [38] and UNSW-NB15 [39,44,45], due to their widely bus TCP/IP protocol for detecting replay attacks. However, these
and recently use for assessing ADSs. The experimental results re- schemes produced high false alarm rates and had a difficulty of
veal the superiority of the proposed technique compared with dif- recognizing some new attacks.
ferent network intrusion detection mechanisms which clarify its
effectiveness for deployment in real-world IICS environments. 2.2. Deep networks for IDS
The remainder of this paper is organized as follows.
Section 2 provides an overview of the most relevant litera- IDSs have been studied using shallow and deep networks for
ture concerning ADSs in industrial control systems and the IoT. detecting abnormal observations from the host- and network-
Section 3 discusses the use of deep learning as an ADS. In based systems [47–50,54]. A shallow network is an ANN that con-
Section 4, details of the design of the proposed model and its sists of often one/two hidden layer(s), whilst a deep network com-
deployment in an IIoT environment are presented. Descriptions prises many hidden layers with different architectures. Deep learn-
of the datasets and evaluation metrics used are presented in ing is one of the most popular machine-learning techniques that
Section 5, and the experimental results discussed in Section 6. academic and industrial researchers use due to its capability of
Finally, the conclusion is presented in Section 7. learning a computational process in depth that mimic the natural
behaviors of a human’s brain [29].
2. Background and related work Deep learning can be categorized into different types depend-
ing on its architectural design which consists of hierarchical layers
This section explains the ADS technology and its approaches in of non-linear processing levels [17]. According to Hodo et al [54],
IoT and industrial environments. Moreover, we focus on the ap- Deep networks are classified based on its architecture into genera-
proaches of shallow and deep networks, which are related to our tive and discriminative; the generative architecture models a joint
proposed technique, demonstrating their capability of identifying probability distribution for observed data with their classes. There
suspicious activity. are four types of generative models, which are Recurrent Neural
Networks (RNN), Dee Belief Network (DBN), Auto-Encoder (AE),
2.1. ADS technology and Deep Boltzmann Machine (DBM). The discriminative architec-
ture models the posterior distributions of classes conditioned on
An ADS is a fundamental security control mechanism which the observed data comprises RNN and Convolutional Neural Net-
acts as a sniffer and decision engine for monitoring network traf- work (CNN) [54]. These models are described as follows.
fic and identifying abnormal activities [18]. We focus on one that
establishes a profile from normal data and considers any variation • Generative deep architectures
from it an attack because it can detect both known and unknown RNN is considered as a supervised or unsupervised learn-
(zero-day) attacks [8,19]. Some ADSs have been introduced in IIoTs; ing model. The core theory behind it is that information
for example, Shang et al. [20] proposed a Particle Swarm Optimiza- is linked in long sequences via a layer-by-layer connection
tion (PSO) technique-based ADS for improving the efficiency of the with a feedback loop. There is a directed cycle between its
One Class Support Vector Machine (OCSVM) model by extracting layers that increase its reliability, with the capability of cre-
packets of the Modbus/TCP communication protocol for training ating an internal memory for storing data of the previous
and validating the model. Similarly, Maglaras and Jiang [21] de- input. RNN has two types: Elman and Jordan, based on the
veloped an IDS/ADS based on this model which was trained on way of layer connections. Elman consists of three layers (i.e.,
offline data using the network traces collected from a SCADA en- input, hidden, and output) in addition to the context layer.
vironment. Silva and Schukat [22] used a K-NN classifier to build The hidden layer is connected to the context layer after each
M. AL-Hawawreh et al. / Journal of Information Security and Applications 41 (2018) 1–11 3
feed-forward and learning rules are applied, a copy of the six features selected from the NSL-KDD dataset. Seok et al. [52] uti-
previously hidden units is saved at the context units. Jordan lized the CNN technique-based IDS for recognizing malware. In
is like Elman networks but the context units are fed from [53], the author proposed an ensemble method for IDS using dif-
the output units. ferent DFN architectures that contain shallow auto-encoder net-
DAE is used for learning efficient coding in an unsupervised works, DBN, DNN, and an extreme learning machine. The method
manner. The simplest architecture of DAE involves an input was evaluated on the NSL-KDD dataset, and the experiment results
layer, more than one hidden layer and an output layer that showed a good performance of detecting abnormal observations
has the same number of neurons in the input layer for re- from network data.
construction. From the discussion above, it is observed that deep learning
DBM is an undirected probabilistic generative model. It con- techniques could considerably improve the performance of design-
sists of energy and stochastic units for the overall network ing a reliable IDS for IICSs with higher detection accuracy and low
for producing binary results. A Restricted Boltzmann Ma- false alarm rates. This is the motivation of utilizing deep learn-
chine (RBM) is applied to reduce hidden layers, which does ing models in this study, due to their ability of the automatic
not allow intra-layer connections between hidden units. feature extracting with a depth analysis to network data and de-
Training a stack of DBM on unlabeled data as the input of tecting outlier patterns from data as suspicious vectors. Our pro-
the next layer and inserting a layer for discrimination can posed DAE-DFFNN-based ADS technique contains a DAE algorithm
lead to constructing an architecture of DBN. to pre-train the DFFNN model that classifies network observations
DBN consists of multiple hidden layers, where a connection by ranking the parameter values of the ADE. It has the capability of
is between layers not between units within each layer. It is discovering a good representation for network data and converting
a composition of unsupervised and supervised learning net- the high dimensional data to low dimensional using the decreased
works. The unsupervised model is learned by a greedy layer- layer in the DAE-DFFNN model, as detailed in the following section.
by-layer connection at a time, whereas the supervised net-
work is one or more layers linked for classifying tasks.
3. Proposed ADS-based deep learning for IICSs
• Discriminative deep architectures
RNN utilizes discriminative power for a classification task,
This study applies different architectures of deep-learning mod-
and this occurs when the output of the model is labeled
els to develop an efficient ADS for IIoT environments. In the train-
data in a sequence with the input.
ing phase, a DAE algorithm learns using normal network observa-
CNN is a space invariant multi-perceptron ANN, which is bi-
tions to create the initialization parameters (i.e., weights and bi-
ologically inspired by the organization of the animal visual
ases) and learn a deep representation of normal behaviors. These
cortex. It has many hidden layers, which typically consists
parameters are used as an initialization stage for training a stan-
of convolutional layers, pooling layers, fully connected lay-
dard Deep Feed Forward Neural Network (DFFNN) to discover ex-
ers and normalization layer. The convolutional layers share
isting and new attack instances. In the testing phase, the DFFNN is
many weights that have a few parameters and this makes
used to recognize malicious vectors. Different hidden nodes in the
the CNN is easier in the training process compared with
technique can professionally learn a deep feature representation
other models with the same number of hidden units.
and capture the most important features by converting the high
dimensions of data to low dimensions based on the decreased hid-
Many recent research studies [46–53] have applied deep learn-
den layer. The details of the proposed ADS technique are explained
ing techniques for IDSs. A study by Alom et al. [46] used DBN
in the following three subsections.
which adopted the greedy layer-by-layer learning algorithm to
learn each stack of RBM at a time to identify intrusion activities.
Similarly, Gao et al [47] suggested the use of the DBN technique to 3.1. Deep feed-forward neural network (DFFNN)
build an IDS. In the training phase, the greedy layer-by-layer algo-
rithm was used for pre-training and fine-tuning the model. In [48], Typically, a DFFNN is defined as an ANN technique that has
a deep auto-encoder algorithm was utilized to reduce the data di- an input layer, more than one hidden layers and an output layer
mensions and a pre-stage of classifying network data. The ANN with direct connections without a cycle between them [30]. Each
mechanism was adopted as a classifier to evaluate the efficiency of hidden layer of the nodes represents abstracted features based on
the auto-encoder compared with the Principle Component Analy- the previous level’s output which are automatically determined
sis (PCA), kernel- PCA and factor analysis algorithms. The results and collected in several layers to generate the outputs. To train
revealed the technique’s efficiency for detecting network attacks. this technique, a stochastic gradient descent back-propagation al-
Li et al. [49] presented a hybrid malicious code detector based gorithm [42] is used.
on deep learning. In the first step, the auto-encoder was used In this deep-learning algorithm, the input data feeds into an in-
for decreasing the data dimensions, and the unsupervised DBN put layer and is then propagated to the hidden layer, the output
model was applied to discover network attacks. Chuan-Long et al. from which is a non-linear transformation of the data that passes
[50] proposed an IDS based on RNN to classify the collected data. to the output layer. A loss function or back-propagation error [31],
The experiments were conducted on a different number of hidden which is the difference between the predicted and actual output, is
nodes and learning rate values. The output of the techniques ac- calculated to evaluate the model’s performance and its value prop-
complished a reasonable performance with the parameters’ setting agated backward through the hidden layers to update the weights.
of 80 hidden nodes and learning rate 0.1, but its computational Calculations of the loss function are conducted based on single or
processing was high. A flexible NIDA using a Self-Taught Learn- mini-batch samples of, rather than all, the training data, with the
ing (STL) algorithm was presented by Niyaz et al. [43]. The sparse weights updated after each sample is processed in order to prop-
auto-encoder was applied to represent a good feature representa- erly fit the model.
tion while the soft-max regression technique was utilized for clas- The supervised training process in this algorithm depends on
sifying the network data. The proposed model performed well per- the randomness of the initialization of the neural network’s pa-
formance in the evaluation process. rameters which tends to place the model at local minima solutions
Tang et al. [51] built an IDS using a simple DFN which consists with poor regularization [33]. To have better convergence proper-
of three hidden layers, the model was trained and tested the best ties and improve the results of supervised learning, pre-training
4 M. AL-Hawawreh et al. / Journal of Information Security and Applications 41 (2018) 1–11
unsupervised techniques, in particular, an AE, can be used to create cess computed by the deterministic mapping (gθ ) as
the initialization parameters.
gθ x(i ) = T W z(i) + b , (3)
3.2. Deep auto-encoder (DAE) where W is a dh × d0 weight matrix, b a bias vector and θ the
mapping parameters [W , b ].
A DAE is a feed-forward neural network algorithm for learning The input is formed in a compressed representation to fit the
efficient coding using an unsupervised technique [34]. It creates a hidden layer, the data in which is then used as input to recon-
representation of a set of data (x) by learning the approximation struct the original data. The training process minimizes the recon-
of an identity function, where the output (xˆ) is similar to the in- struction error (i.e., the difference between the original data and
put (x), that is, x → xˆ. Its schematic structure consists of vectors its low-dimensional reconstruction) and is calculated for a single
(x(i) ) in the input layer and more than one hidden layer with a or mini-batch training sample (s) by
non-linear activation function. The hidden layers are used to learn
1 (i )
s
a compressed representation of the input data via fewer neurons E x, xˆ = ||x − xˆ(i) || 2 , (4)
than the input layer. As a result, it learns the most important fea- 2
i
tures, reduces the dimensionality and represents an abstraction of
the input data. Ultimately, the output layer (xˆ(i ) ) is displayed as an θ = {W, b} = argminθ E x, xˆ , (5)
approximate representation of the input layer. The DAE architecture depicted in Fig. 1 that contains three hid-
The simplest architecture of an AE consists of an input layer, den layers, an encoder, bottleneck (which consists of fewer nodes
hidden layer, and output layer. Assuming that the training data than the previous layers and is used to represent the input data
(x(i) ) has n samples, where each x(i) (i ∈ (1, …., n) has many di- with a non-linear dimensionality reduction, where the number of
mensions, and there is a dimensional feature vector (d0 ), the Tanh nodes represents the number of dimensions) and decoder. This
activation function is used [34] and computed by model potentially operates using a non-linear Principal Component
1 − e−2t Analysis (PCA) technique for reducing dimensionality [36]. In order
T (t ) = , (1) to identify legitimate and suspicious activities in IICSs, we propose
1 + e−2t
an ADS based on deep learning that includes training and testing
The AE algorithm has two main parts, an encoder and decoder phases, as discussed below.
[16,35]. To map the input vector (x(i) ) into a hidden layer represen-
tation (z(i) ), a deterministic mapping called an encoder process (fθ ) 3.3. Training and testing phases of ADS-based deep learning
is used [16] and the dimensionality of x(i) reduced to provide the
correct number of codes as The DFFNN and DAE approaches discussed above are the fun-
fθ x (i ) = T (Wx(i) + b) (2) damental mechanisms used to build the proposed ADS-based
deep-learning technique; the structures of networks in DAE-DFFFN
where W is a weight matrix of size d0 × dh , dh
a number of neurons model are depicted in Fig. 2. In the training phase, given an un-
in a hidden layer (dh < d0 ), b the bias vector, T a Tanh activation labeled normal training dataset A, and labeled training dataset B,
function and θ the mapping parameters [W, b]. where AB, a DAE with a bottleneck layer is trained based on only
To reconstruct the input as an approximation (xˆ(i ) ), the result of normal records (A) without any anomalous vectors to learn and
the hidden layer’s representation is mapped and the decoder pro- discover the most important feature representations for normal
M. AL-Hawawreh et al. / Journal of Information Security and Applications 41 (2018) 1–11 5
Algorithm 1: vectors that test the accuracy of the method. In more detail, the
Unsupervised training phase for proposed ADS-based deep learning.
network model is trained based on the stochastic gradient descent
Input: training dataset (A) with n umber of samples (n) of back-propagation mechanism to minimize the loss function, with
(x(i) ), where i ∈ (1, …., n). the mean square error calculated from the difference between the
Output: parameters θ = {W, b}
values of the target output (y(i) ) and predicted output ( gθ (x(i) )),
Begin
Initialize {W, b}; where gθ is the hypothesis function that yields an estimated out-
repeat put.
For each record (x(i) ), do In the testing phase, after the parameters are automatically
compute the activation ( z(i) ) in/at hidden layer and learned in the training phase, the new dataset sample (C {A, B})
give output (xˆ(i ) ) to outhe tput layer.
is tested based on the final constructed network model. Each in-
compute the training error (E (x(i ) , xˆ(i ) ).
Back-propagate E and update parameters put record (xˆ(i ) ) is passed to the input layer with the initialization
θ = {W, b}; weights and bias (θ f = {Wf ,bf )) adopted and then the input data is
End processed through the hidden layers. Finally, the output layer pre-
until converged
dicts the class of the input data as either normal or attack based
end
on the estimated value of the loss function of each class.
patterns. It is trained using all the data, where the input to the 4. Suggested framework for applying proposed ADS in IICSs
network (x(i) ) is passed through three hidden layers, including the
bottleneck one, to reconstruct it (xˆ(i ) ), where We , Wn and Wf are This study proposes an efficient anomaly detection model for
the weights of the DAE, DFFNN and final prediction model, as de- protecting IICS environments against malicious activities. As illus-
picted in Fig. 2. trated in Fig. 3, its architecture consists of training and testing
In the encoding step, the input layer in the first hidden layer steps.
is processed using Eqs. (1) and (2). In the bottleneck step, a low- Data pre-processing, this includes feature transformation and
dimensional non-linear transformation of the input features is ex- normalization, and is the first stage in the proposed ADS-based
ecuted to extract important representative features from the net- deep-learning mechanism, inspects and selects important informa-
work data. Then, in the decoding step, the last hidden layer in tion from the large-scale data in an IIoT environment.
the bottleneck feature is used to approximately replicate the in-
put using Eqs. (1) and (3), and stochastic gradient descent back- • Feature transformation—as the proposed model accepts only
propagation to reduce the loss function, that is, the mean square numerical features, each symbolic feature value is converted
error between (x(i) ) and (xˆ(i ) ) using Eqs. (4) and (5). The key pro- into a numerical one; for example, the NSL-KDD dataset has
cedures for unsupervised training of the proposed ADS-based deep many symbolic attributes such as protocol types with nominal
learning are presented in Algorithm 1. values like ICMP, TCP, and UDP which are mapped into 1, 2 and
Then, the trained model is used as the starting point and initial 3, respectively.
parameters of weights and biases for training the supervised deep • Feature normalization—since deep learning depends on
network and the process for tuning the network model conducted weights, the different feature scales can bias data into partic-
using the labeled training dataset (B(x(i) ,y(i) )). The same steps as ular layers which may cause certain weights to update faster
previously followed are applied to learn and validate the proposed than others [32]. Consequently, it is necessary to handle this is-
ADS technique on dataset B which includes normal and anomalous sue using a statistical normalization whereby the Z-score func-
6 M. AL-Hawawreh et al. / Journal of Information Security and Applications 41 (2018) 1–11
Labeled Data
Features Features
Transformation Normalization
Unlabeled Normal
Data
Unsupervised-Deep Learning Supervised Deep Learning
Learning Algorithms
Predicted Data
Features Features
Classifier Label
Conversion Normalization
(Norma || Attack)
tion for each feature value (v(i) ) is performed using based on the deep-learning algorithm in an IIoT environment. Like
other IDSs, the proposed system uses a set of cooperative and ma-
v −μ
(i )
Z (i ) = (6) jor components: a sniffing and monitoring unit; databases; and an-
σ alyzer and response units. A general view of the structure of this
where μ is the mean of the n values for a given feature (v(i) (i ∈ 1, IDS is presented in Fig. 4.
2, 3, …n)) and σ the standard deviation. The main components of the proposed ADS are described be-
As network data contains a high-dimensional space, it is essen- low.
tial reducing its dimensionality for improving the computational
resources to design a lightweight and scalable ADS technique [36]. • Sniffing and monitoring unit—a sniffer, which is implemented
Consequently, the proposed DAE-DFFNN model is utilized to re- in the gateway to monitor and collect the traffic exchanged
duce the high dimensions into low ones using a central decreased between it and the external network via the internet can be
layer. In more detail, there is a non-linear function in the Model embedded in either the software or hardware to obtain the
that encodes a large number of features into the lower feature set sent and received packets which it stores in files that are then
in the decreased hidden layer, so feature reduction is applied with- passed to the raw traffic database. Three databases, raw traf-
out the need for human knowledge. The target of the DAE-DFFNN fic, behavior, and log, implemented in our IDS are installed and
feature reduction is to eliminate from the ambiguous structure in retained in the cloud storage at the network’s edge (i.e., fog
the input distribution and find out well-designed representations computing storage). The first stores the raw network traffic col-
in terms of higher-level learned, as well as importantly filtered and lected by the sniffing unit, the second contains a list of previous
reduced features. datasets which is considered a profile history of the network
Since a suspicious activity is determined by any change in the while the third stores the new signatures of detected attacks
normal state of the network, analyzing normal behaviors are so and is used to continuously feed the behavior database.
significant for facilitating the detection process. Therefore, an un- • Analyzer unit—this is an important component in an IDS which
supervised learning process is applied to the normal data observa- consists of data processing and detection models.
tions to estimate the initialization parameters of the weights and ➢ Data processing model—since an enormous amount of data
biases as a given input of the standard DFFNN for decreasing the can be extracted from packets and processing it is extremely
processing time of building this model. The parameters are also re- challenging in terms of the processing power, resources and
tuned in the supervised deep-learning method using the labeled time required, the raw data should be passed to this model
data (i.e., normal and malicious), with the final training model to convert it to useful information. In it, the collected net-
evaluated based on new data samples obtained during the testing work traffic is analyzed and gathered according to the size
phase, as explained in the aforementioned sections. of the time window or flow, such as the source and desti-
The placement of an IDS in the IIoT environment is critical for nation IPs, source and destination ports, and protocol type.
ensuring that the environment is secured against any malicious ac- Moreover, the basic features are extracted for the data flow
tivities. As industrial systems transmit data for end-users and/or and the data converted to a uniform format, a process han-
cloud storage through the IoT gateway which includes an internet– dled directly by the proposed ADS based on the deep learn-
protocol connectivity (e.g., IP, TCP, UDP or HTTP), this gateway is ing algorithm in order to reduce the data’s dimensional-
a crucial location in which to deploy the proposed anomaly-IDS ity. Therefore, this processing model assists in the decision-
M. AL-Hawawreh et al. / Journal of Information Security and Applications 41 (2018) 1–11 7
Cloud Storage
Alert System
Response Unit
Intrusion Detection System
making process and prevents any bias and confusion for the Root (U2R), Remote to Local (R2L) and Normal [37,38]. However,
anomaly detector. although having been used widely in IDSs, it is outdated [40].
➢ Detection Model—the result obtained from the processing Therefore, to effectively evaluate our proposed work, a new
model is sent to the detection model for a decision to be dataset called UNSW-NB15 is used. It reflects real modern normal
made for each data group, that is, the flow or window size. behaviors and contains contemporary synthesized attack activities
This model is used to detect known and unknown attacks [39,44,45]. It has 257,673 records (i.e., 93,0 0 0 normal and 164,673
by learning from the behavior database whereby, if any of attacks), each with 41 features and a class label. There are ten dif-
the input data does not match normal network behavior, it ferent class labels, one normal and nine attacks, namely, Fuzzers,
is classified as an attack. Details of the detection process are Analysis, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shell-
discussed in Section 3. code, and Worms.
• Response unit—this contains the alert system model and log
database. The ADS alerts the system administrator to take the 5.2. Evaluation metrics
appropriate action when any abnormal activity is detected in
the network, with the new signature for a specific attack type The proposed ADS is evaluated on the two datasets in terms
stored in the log database which, in turn, is used to feed the of the accuracy, detection rates, and FPRs extracted from the con-
behavior database. fusion matrix terms; True Positive (TP) and False Negative (FN),
which indicate the numbers of attack observations correctly iden-
tified as anomalous and incorrectly identified as normal, respec-
5. Descriptions of datasets and evaluation metrics tively, and True Negative (TN) and False Positive (FP) which show
the numbers of normal observations correctly identified as normal
5.1. Datasets and incorrectly identified as attack, respectively. The main evalua-
tion metrics are calculated as follows.
Since a dataset plays a vital role in testing, analyzing and eval-
uating the behavior of a detection system, a good-quality one not • Accuracy identifies the total number of observations correctly
only produces efficient results for an offline system but is also po- identified with respect to the total number of observations and
tentially effective when deployed in a real environment. Most re- is calculated by
searchers have used the popular NSL-KDD dataset, an amended TP + TN
version of the KDD CUP 99 one which solved the main problems Accuarcy = , (7)
TP + TN + FP + FN
of KDD CUP 99 by eliminating its redundant records and select-
ing numbers of records from it in proportion to their percentages. • The Detection Rate (DR) is the ratio of attack to normal obser-
After pre-processing, it consists of 148,517 records (i.e., 77,054 nor- vations correctly classified and defined as
mal and 71,460 attacks), each of which contains 41 features and TP
a class label. There are five classes, namely, Probing, DoS, User to Det ection Rat e = , (8)
TP + FN
8 M. AL-Hawawreh et al. / Journal of Information Security and Applications 41 (2018) 1–11
Fig. 5. (a) NSL-KDD dataset ROC curve; (b) UNSW-NB15 dataset ROC curve.
Fig. 6. (a) Detection rates for NSL-KDD dataset classes; (b) Detection rates for UNSW-NB15 dataset classes.
Table 2
Time cost (Elapsed and CPU time in seconds) for both datasets.
performance continually improves when the structure of the deep Supplementary materials
learning model is trained with more data.
The proposed model is different from the previous IDSs based Supplementary material associated with this article can be
on deep learning that it used the simple mathematical algorithm found, in the online version, at doi:10.1016/j.jisa.2018.05.002.
(DAE) for unsupervised learning which estimates the parameters
in a suitable range that is the input of the supervised-DFFNN
References
for building it effectively and efficiently. In addition, the model,
through the decreased hidden layer, learns and explores the high- [1] Sherasiya T, Upadhyay H, Patel H. A survey: intrusion detection system for in-
level features, reduce the dimensionality of data automatically, and ternet of things. J. Comput Sci Eng 2016;5:91–8.
[2] Drath R, Horch A. Industry 4.0: hit or hype? [Industry forum]. IEEE Ind Electron
represent the crucial features well. Therefore, these traits ensure
Mag 2014:56–8.
that our proposed model is appropriate for deployment in a real [3] Shahzad A, Kim G, Elgamoudi A. Secure IoT platform for industrial control sys-
industrial environment containing a massive amount of unlabeled tems. In: Platform technology and service (PlatCon), international conference
and unstructured data. on. IEEE; 2017. p. 1–6.
[4] Katsikeas S, Fysarakis K, Miaoudakis A, Van Bemten A, Askoxylakis I, Papaef-
stathio I, Plemenos A. Lightweight and secure industrial IoT communications
via the MQ telemetry transport protocol. Symposium on computers and com-
munications conference. IEEE; 2017.
6.4. Pros and cons of proposed anomaly-IDS based on deep learning [5] Stouffer K, Falco J, Scarfone K. Guide to industrial control systems (ICS) secu-
rity. NIST special publication; 2011. p. 6–16.
[6] Atlantic Council. http://publications.atlanticcouncil.org/cyberrisks//.
The proposed ADS has several advantages. Firstly, it can eas- [7] Sitnikova E, Foo E, Vaughn B. The power of hands-on exercises in SCADA cyber
ily detect normal and attack behaviors in an IIoT environment as, security education. In: IFIP world conference on information security educa-
because it is designed to identify normal behavior by training the tion. Springer; 2009. p. 83–94.
[8] Abraham A, Grosan C, Martin-vide C. Evolutionary design of intrusion detec-
model using normal behavior in the first training phase, based on tion. Int J Netw Secur 2007:328–39.
the unsupervised deep learning algorithm, it can obtain good rep- [9] Modi C, Patel D, Borisaniya B, Patel H, Patel A, Rajarajan M. A survey of intru-
resentations of normal traffic. Secondly, it conducts an additional sion detection techniques in cloud. J Netw Comput Appl 2013:42–57.
[10] Tzokatziou G, Maglaras A, Janicke H, He Y. Exploiting SCADA vulnerabilities
training process based on normal and attack behaviors to power
using a human interface device. Int J Adv Comput Sci Appl 2015:234–41.
the system and ensure that it is capable of detecting sophisticated [11] Kushner D. The real story of stuxnet. In: IEEE spectrum 50; 2013. p. 48–53.
attacks. Thirdly, it provides an automated process for feature engi- [12] Tsai F, Lin Y. A triangle area based nearest neighbors approach to intrusion
detection. Pattern Recognit 2010;43:222–9.
neering which minimizes the time and effort required, and makes
[13] Ambusaidi A, He X, Nanda P, Tan Z. Building an intrusion detection sys-
it more effective when deployed in a real environment. Finally, it tem using a filter-based feature selection algorithm. IEEE Trans Comput
depends on the parameters which are tuned in only the training 2016;65:2986–98.
phase and has a tolerance for noisy and outlier data. [14] N. Moustafa, and J. Slay. "A hybrid feature selection for network intrusion de-
tection systems: central points." arXiv preprint arXiv:1707.05505 (2017).
Although a disadvantage of this model is that choosing its pa- [15] Kim J, Bentley P, Aickelin U, Greensmith J, Tedesco G, Twycross J. Immune sys-
rameters for the training phase is not a trivial process, this is tem approaches to intrusion detection – a review. Nat Comput 2007:413–66.
not considered a major problem given the current availability of [16] Hardy W, Chen L, Hou S, Ye Y, Li X. DL4MD: a deep learning framework for
intelligent malware detection. In: Proceedings of the international conference
complex and fast hardware. Also, while it cannot deal with non- on data mining (DMIN). The steering committee of the world congress in com-
numeric and the original range of features values, we overcome puter science, computer engineering and applied computing (WorldComp);
this issue by using feature transformation and normalization steps 2016. p. 61–7.
[17] Huang W, Song G, Hong H, Xie K. Deep architecture for traffic flow prediction:
in the pre-processing stage. deep belief networks with multitask learning. In: IEEE transactions on intelli-
gent transportation systems; 2014. p. 2191–201.
[18] Nadiammai G, Hemalatha M. Effective approach toward intrusion detection
system using data mining techniques. Egypt Inf J 2013:37–50.
7. Conclusion [19] Mustafa N, Slay J. The evaluation of network anomaly detection systems: sta-
tistical analysis of the UNSW-NB15 data set and the comparison with the
KDD99 dataset. Inf Secur J 2016:18–31.
In this paper, an ADS model for detecting intrusive activities in [20] Shang W, Zeng P, Wan M, Li L, An P. Intrusion detection algorithm based on
IIoT environments using the data collected from TCP/IP traffic is OCSVM in industrial control system. Secur Commun Netw 2016:1040–9.
[21] Maglaras A, Jiang J. Intrusion detection in SCADA systems using machine
proposed. It uses deep-learning methods for unsupervised learning
learning techniques. In: Science and information conference (SAI). IEEE; 2014.
with automatic dimensionality reductions and a good representa- p. 626–31.
tion of normal network patterns. It obtains powerful rather than [22] Silva P, Schukat M. On the use of K-NN in intrusion detection for industrial
random parameters for a supervised training for DFFNN. Then, to control systems. In: 13th international conference on information technology
and telecommunication; 2014. p. 103–6.
better tune these parameters, the supervised DFFNN is used. The [23] Stewart B, Rosa L, Maglaras A, Cruz T, Ferrag M, Simoes P, Janicke H. A novel
proposed DAE-DFFNN model can successfully build and extract im- intrusion detection mechanism for SCADA systems that automatically adapts
portant features which enhance its performance overall. The final to changes in network topology. Ind Netw Intell Syst 2017:1–12.
[24] Shang W, Cui J, Wan M, An P, Zeng P. Modbus communication behavior mod-
constructed model is tested on different data samples from the eling and SVM intrusion detection method. In: Proceedings of the 6th in-
NSL-KDD and NSW-NB15 datasets, with the results revealing that ternational conference on communication and network security. ACM; 2016.
it achieves the highest detection rate and fewest false alarms com- p. 80–5.
[25] Maglaras A, Jiang J. Ocsvm model combined with k-means recursive cluster-
pared with some techniques developed in recent studies. In future, ing for intrusion detection in scada systems. In: Heterogeneous networking for
we will extend this work to train this algorithm on real data col- quality, reliability, security and robustness (QShine), 10th international confer-
lected from IIoT systems to demonstrate the efficiency of its imple- ence on. IEEE; 2014. p. 133–4.
[26] Linda O, Vollmer T, Manic M. Neural network based intrusion detection system
mentation.
for critical infrastructures. In: Neural networks, international joint conference
Also, we present a perception for deploying this proposed ADS on. IEEE; 2009. p. 1827–34.
model based on the deep-learning algorithm in the real-world IIoT [27] Hodo E, Bellekens X, Hamilton A, Dubouilh L, Iorkyase E, Tachtatzis C, Atkin-
son R. Threat analysis of iot networks using artificial neural network intrusion
environment by: firstly, using it to collect the TCP/IP traffic us-
detection system. In: Networks, computers and communications (ISNCC), in-
ing the sniffer implemented in the gateway; and, secondly, pre- ternational symposium on. IEEE; 2016. p. 1–6.
processing and analyzing the collected data in order to reveal any [28] Chen R, Liu M, Chen C. An artificial immune-based distributed intrusion detec-
intrusive activity. Further modifications and ideas for deploying tion model for the Internet of Things. In: Advanced materials research; 2012.
p. 165–8.
and validating this model in a real environment, and extending this [29] Van Dijk C, Williams P. The history of artificial intelligence. Expert systems in
work to handle different protocols will be considered in future. auditing Palgrave Macmillan UK; 1990. 21-16.
M. AL-Hawawreh et al. / Journal of Information Security and Applications 41 (2018) 1–11 11
[30] Tang A, Mhamdi L, McLernon D, Zaidi R, Ghogho M. Deep learning approach [44] Moustafa N, Slay J. The evaluation of network anomaly detection systems:
for network intrusion detection in software defined networking. In: Wireless statistical analysis of the UNSW-NB15 data set and the comparison with the
networks and mobile communications (WINCOM), international conference on. KDD99 data set. Inf Secur J 2016;25:18–31.
IEEE; 2016. p. 258–63. [45] Moustafa N, Slay J. The significant features of the UNSW-NB15 and the KDD99
[31] Svozil D, KvasniEka V, Pospichal J. Introduction to multi-layer feed-forward data sets for network intrusion detection systems. Building analysis datasets
neural networks. Elsevier; 1997. p. 43–62. and gathering experience returns for security (BADGERS), 2015 4th interna-
[32] Recht B, Re C, Wright S, Niu F. Hogwild: A lock-free approach to paralleliz- tional workshop on. IEEE; 2015.
ing stochastic gradient descent. In: Advances in neural information processing [46] Alom MdZ, Bontupalli V, Taha. Intrusion detection using deep belief networks.
systems; 2011. p. 693–701. In: Aerospace and electronics conference (NAECON), 2015 national. IEEE; 2015.
[33] Erhan D, Bengio Y, Courville A, Manzagol A, Vincent P, Bengio S. Why does p. 339–44.
unsupervised pre-training help deep learning? J Mach Learn Res 2010:625–60. [47] Gao N, Gao L, Gao Q, Wang H. An intrusion detection model based on deep
[34] Tao X, Kong D, Wei Y, Wang Y. A big network traffic data fusion approach belief networks. In: Advanced cloud and big data (CBD), 2014 second interna-
based on fisher and deep auto-encoder. Information 2016:20–30. tional conference on. IEEE; 2014. p. 247–52.
[35] Lv Y, Duan Y, Kang W, Li Z, Wang Y. Traffic flow prediction with big data: a [48] Abolhasanzadeh B. Nonlinear dimensionality reduction for intrusion detection
deep learning approach. IEEE Trans Intell Transp Syst 2015:865–73. using auto-encoder bottleneck features. In: Information and knowledge tech-
[36] Yousefi-Azar M, Varadharajan V, Hamey L, Tupakula U. Autoencoder-based fea- nology (IKT), 2015 7th conference on. IEEE; 2015. p. 1–5.
ture learning for cyber security applications. In: Neural networks (IJCNN), 2017 [49] Li Y, Rong M, Runhai J. A hybrid malicious code detection method based on
international joint conference on. IEEE; 2017. p. 3854–61. deep learning. Int J Secur Appl Methods 2015;9(5).
[37] Rathore S, Saxena A, Manoria M. Intrusion detection system on KDDCup99 [50] Chuan-long Y, Yue-fei Z, Jin-long F, Xin-zheng H. A deep learning approach for
dataset: a survey. Int J Comput Sci Inf Tech 2015. intrusion detection using recurrent neural networks. IEEE Access 2017:1–7.
[38] Tavallaee M, Bagheri E, Lu W, Ghorbani A. A detailed analysis of the KDD CUP [51] Tang T, Mhamdi L, McLernon D, Zaidi S, Ghogho M. Deep learning approach for
99 data set. In: Computational intelligence for security and defense applica- network intrusion detection in software defined networking. In: Wireless net-
tions. CISDA 2009. IEEE symposium on. IEEE; 2009. p. 1–6. works and mobile communications (WINCOM), 2016 international conference
[39] Moustafa N, Slay J. UNSW-NB15: a comprehensive data set for network intru- on. IEEE; 2016. p. 258–63.
sion detection systems (UNSW-NB15 network data set). In: Military communi- [52] Seok S, Howon K. Visualized malware classification based-on convolutional
cations and information systems conference (MilCIS). IEEE; 2015. p. 1–6. neural network. J Korea Inst Inf Secur Cryptol 2016;26(1):197–208.
[40] Moustafa N, Creech G, Slay J. Big data analytics for intrusion detection sys- [53] Ludwig S. Intrusion detection of multiple attack classes using a deep neu-
tem: statistical decision-making using finite Dirichlet mixture models. In: Data ral net ensemble. 2017 IEEE symposium series on computational intelligence;
Analytics and Decision Support for Cybersecurity. Springer; 2017. p. 127–56. 2017.
[41] Tan Z, Jamdagni A, He X, Nanda P, Liu P, Hu J. Detection of denial-of-service at- [54] E. Hodo, B. Xavier, H. Andrew, T. Christos, and A. Robert "Shallow and deep
tacks based on computer vision techniques. IEEE Trans Comput 2015:2519–33. networks intrusion detection system: a taxonomy and survey." arXiv preprint
[42] Li M, Zhang T, Chen Y, Smola J. Efficient mini-batch training for stochastic op- arXiv:1701.02145 (2017).
timization. In: Proceedings of the 20th ACM SIGKDD international conference [55] Z, Lipton, J. Berkowitz, and C. Elkan. "A critical review of recurrent neural net-
on Knowledge discovery and data mining. ACM; 2014. p. 661–70. works for sequence learning." arXiv preprint arXiv:1506.0 0 019 (2015).
[43] Niyaz Q, Sun W, Javaid A, Alam M. A deep learning approach for network [56] T. Marsden, N. Moustafa, E. Sitnikova, and G. Creech, G. (2017). Probability
intrusion detection system. In: Proceedings of the 9th EAI international con- risk identification based intrusion detection system for SCADA systems. arXiv
ference on bio-inspired information and communications technologies (for- preprint arXiv:1711.02826.
merly BIONETICS). ICST (Institute for Computer Sciences, Social-Informatics
and Telecommunications Engineering); 2016. p. 21–6.