Machine Learning Based Network Intrusion Detection
Machine Learning Based Network Intrusion Detection
Abstract—Network security has become a very important issue and detailed data collected through protected networks. When
and attracted a lot of study and practice. To detect or prevent unknown data are sent to such a signature-based detection
network attacks, a network intrusion detection (NID) system may system, they are more likely to be labeled as incorrect cate-
be equipped with machine learning algorithms to achieve better
accuracy and faster detection speed. One of the major advantages gories. As a result, if the system lacks a proper discriminating
of applying machine learning to network intrusion detection is mechanism, it could cause lots of false positives or false
that we don’t need expert knowledge as much as the black or negatives. For machine learning-based systems, their models
white list model. In this paper, we apply the equality constrained- can learn from training data and adjust the output classification
optimization-based extreme learning machine to network in- results. The learning is not only based on one single type
trusion detection. An adaptively incremental learning strategy
is proposed to derive the optimal number of hidden neurons. of data, as white or black list models, but also based on all
The optimization criteria and a way of adaptively increasing kinds of data. Through computation and comparison, incoming
hidden neurons with binary search are developed. The proposed data to a machine learning-based system are classified as the
approach is applied to network intrusion detection to examine most possible category. Many machine learning techniques,
its capability. Experimental results show our proposed approach e.g., genetic and fuzzy algorithms [4], clustering and K-nearest
is effective in building models with good attack detection rates
and fast learning speed. neighbor (k-NN) algorithms [5], etc., have been used to build
NID systems.
keywords-network intrusion detection; machine learning; For NID systems, the sampling rate of network data is very
extreme learning machine; incremental learning. high, and an appropriate detection model must be trained with
a large amount of data which may fail an iterative training
I. I NTRODUCTION model as BP neural networks. Fast learning speed of extreme
In response to rising popularity and fast development of learning machines (ELMs) makes them suitable for use in
the Internet, more and more networks have been established NID systems. Several modified versions of ELM have been
for businesses, social media and governments. Because of the developed to improve its capability. For classification prob-
diverse and complicated expertise required for cyber-security, lems, Huang et al. have proposed the inequality constrained-
technical staff may not be able to manage their networks optimization-based ELM [6]. Huang et al. also proposed
properly. As a result, the networks are prone to be attacked another ELM, called the equality constrained-optimization-
and the concept of network intrusion detection (NID) has been based ELM (C-ELM) [7], which integrates with the learning
proposed. NID systems are mainly used to protect networks rule of least squares SVM (LS-SVM) [8], a well-known variant
from being attacked and filter or detect malicious behavior of SVM [9]. C-ELM adopts a similar objective function as
launched by attackers, e.g., denial of services attacks (DoS) LS-SVM to obtain a global solution for the weights of output
[1]. layer.
One of the most common detection approaches is signature- The number of hidden neurons in ELMs needs to be deter-
based detection using data mining algorithms, e.g., white list mined in advance by the user. If an ELM doesn’t achieve the
or black list models. Meng et al. proposed a way to improve expected performance, different numbers of hidden neurons
signature-based detection by reducing detection load, speeding have to be tested until satisfied [10]. However, every time
up the signature matching process, and reducing the false a different number of hidden neurons is tested, the output
alarm rate [2]. Jamdagni et al. proposed a detection system weights need to be re-computed from scratch. In this paper, we
based on packet payloads [3]. Text features are extracted from propose an efficient C-ELM construction approach by which
payloads by the n-gram method and used to build nomal hidden neurons are added in an adaptively incremental way
behavior models. Both the above apply analysis and statistic and the output weights are derived without re-computation
methods to generate signatures for their systems. However, the from scratch. The optimization criteria and a way of adaptively
signature-based models are created from the comprehensive increasing hidden neurons with binary search are developed.
80
updated in an adaptively incremental way for C-ELM. Besides, Based on the inversion of block matrices, we have
re-computing the output weights from scratch is avoided and IΔn0 ×Δn0
time can be saved. D1 = [ + ΔHT0 (IN ×N − H0 H†0 )ΔH0 ]−1 ×
C
A. Incremental Learning ΔHT0 (IN ×N − H0 H†0 ). (16)
Assume initially we create a C-ELM with n0 hidden neu- Similarly, we have
rons for an application. The values of the input weights and
biases are randomly generated. We compute the H matrix of U1 = H†0 (IN ×N − ΔH0 D1 ). (17)
this n0 -hidden-neurons C-ELM by Eq.(6) and denote it as H0
Note that it is only a Δn0 ×Δn0 matrix involved in computing
which is of size N × n0 . The output weights Ω0 associated
U1 and D1 . Therefore,
with this n0 -hidden-neurons C-ELM, according to Eq.(5), is
−1 U1
In0 ×n0 Ω1 = Y (18)
Ω0 = + HT0 H0 HT0 Y. (7) D1
C
which can be computed much more efficiently.
Let
−1 B. Constructing C-ELM
In0 ×n0
H†0 = + HT0 H0 HT0 . (8) Now we proceed with constructing a C-ELM with an
C
optimal number of hidden neurons for any given application.
We have Initially, we create a C-ELM having n0 hidden neurons, a good
Ω0 = H†0 Y. (9) choice of n0 being equal to the number of output nodes, m. H†0
is computed, from which the output weights Ω0 is computed
Now suppose we add Δn0 more hidden neurons to the C- by Eq.(9). If the C-ELM matches our pre-specified goal, e.g.,
ELM. The resulting machine then has n1 neurons in the hidden training error bound, we are done. Otherwise, a number of
layer, where Δn0 hidden neurons are added and we set n1 = n0 + Δn0 .
n1 = n0 + Δn0 . (10) The output weights of the n1 -hidden-neurons C-ELM are
Let the H matrix of this n1 -hidden-neurons C-ELM be H1 computed by Eq.(18). Then we check if the resulting n1 -
which is of size N × n1 , and can be written as hidden-neurons C-ELM matches our goal. If so, we are done.
Otherwise, another Δn1 hidden neurons are added and we
H1 = H0 ΔH0 (11) set n2 = n1 + Δn1 . The output weights of the n2 -hidden-
where ΔH0 , of size N ×Δn0 , is computed from the randomly neurons C-ELM are computed by Eq.(18). Then we check if
generated values of the input weights and biases associated the resulting n2 -hidden-neurons C-ELM matches our goal. If
with the added Δn0 hidden neurons. From Eq.(5), we have so, we are done. Otherwise, another Δn2 hidden neurons are
the output weights of this n1 -hidden-neurons C-ELM to be added and the output weights of the n3 -hidden-neurons C-
ELM are computed by Eq.(18). This process iterates until the
Ω1 = H†1 Y (12) C-ELM matches our goal, as summarized below.
where procedure Constructing C-ELM incrementally
In1 ×n1
−1 Initialize the C-ELM with n0 hidden neurons;
H†1 = + HT1 H1 HT1 Compute Ω0 by Eq.(9);
C
T −1 Set k = 0;
In1 ×n1 H0 HT0 while the pre-specified goal is not met
= + H0 ΔH0 .
C ΔH0 T
ΔHT0 Set k = k + 1;
(13) Set nk = nk−1 + Δnk−1 ;
Compute Ωk by Eq.(18);
Note that
end while;
I 0 Set K = k;
In1 ×n1 = n0 ×n0 . (14)
0 IΔn0 ×Δn0 end procedure
Note that Eq.(12) requires the computation of the inverse of
The increments Δnk , k = 0, 1, . . ., are computed adaptively.
a n1 ×n1 matrix. To reduce the computation complexity, we
At the time the procedure stops, K times of additions have
can rewrite Eq.(13) as
been done, and the C-ELM obtained has nK hidden neurons
H†1 with the output weights ΩK .
−1 Let’s give an example showing the construction of a C-
In0 ×n0
+ HT0 H0 HT0 ΔH0 HT0 ELM. In this example, there are 5 output nodes. Initially,
= C
+ ΔHT0 ΔH0 ΔHT0
IΔn0 ×Δn0
ΔHT0 H0 C 5 hidden nodes are set for the C-ELM. Therefore, n0 = 5.
U1 We would like the accuracy of the C-ELM to be stable. We
= . (15) calculate the first addition, Δn0 , which is 1. So we have
D1
81
TABLE I TABLE IV
A CONSTRUCTION EXAMPLE C OMPARISON WITH S INGH ’ S IN BINARY CLASSIFICATION
are defined as
n1 = 5 + 1 = 6. The goal is not met. We then calculate TP + TN
ACC (Accuracy) = ,
Δn1 , which is also 1. So we have n2 = 6 + 1 = 7. The Np + Nn
goal is not met either. We then calculate Δn2 , and so on. The FP
FAR (False Alarm Rate) = ,
whole process is shown in TABLE I. At k = 41, the accuracy Nn
becomes stable, and the process stops. Therefore, the C-ELM TP
REC (Recall) = ,
has 429 neurons in the hidden layer. Np
To measure the performance of a classifier on a given dataset,
IV. A PPLICATION IN N ETWORK I NTRUSION D ETECTION we divide the data into 10 folds, and run the classifier 10 times
with the dataset. The average of the results of the 10 runs is
In this section, we apply our proposed approach to build then shown as the performance of the classifier on the given
C-ELMs as classifiers for network intrusion detection. Ex- dataset.
periments are conducted on benchmark NID datasets and Comparisons with the method proposed by Singh et al.
comparisons with other methods are done to demonstrate the [14] are shown in Table IV and Table V. Singh’s method
effectiveness of our approach. incorporates feature selection (FS) and data reduction (DR),
A commonly used benchmark dataset, the NSL-KDD and its performance depends on how FS and DR are used. In
dataset [12], which is a less biased subset from the KDD-Cup Table IV, the results are obtained by treating normal instances
99 dataset [13], as shown in Table II, is adopted in our exper- as negative and all the abnormal instances as positive. Conse-
iments. This dataset contains only 148,517 instances, with 41 quently, only two categories are involved. Three different sets
features and 5 categories. Three criteria, ACC, REC, and FAR of results are shown for Singh’s method, all taken directly
are used for performance evaluation. For an experiment with from [14]. The column with ‘All features’ shows the results
Np actual positive instances and Nn actual negative instances, with all the original features involved. The column with ‘FS’
the four outcomes of an classifier can be formulated in a shows the results with feature selection. The column with
confusion matrix as shown in Table III. The evaluation criteria ‘FS+DR’ shows the results with both feature selection and data
reduction. We can observe that our approach is not the best in
TABLE III every case in Table IV, but it has advantages. Our approach
C ONFUSION M ATRIX has a much lower FAR, 0.79%, than Singh’s method.
``` In Table V, the results are obtained by treating the instances
```
Predicted
Actual ``` Positive Negative
of one category as positive and the instances of the other
Positive TP FN categories as negative. Consequently, 5 categories are involved.
Negative FP TN
Three different sets of results are also shown for Singh’s
82
method, all taken directly from [14]. ACC is given for the [4] S. Elhag, A. Fernndez, A. Bawakid, S. Alshomrani, and F. Herrera,
whole testing dataset, and REC and FAR are given for each “On the combination of genetic fuzzy systems and pairwise learning
for improving detection rates on intrusion detection systems,” Expert
category. From this table, we can see that our approach per- Systems With Applications, vol. 42, no. 22, pp. 193–202, 2015.
forms better than Singh’s in almost every evaluation measure. [5] W.-C. Lin, S.-W. Ke, and C.-F. Tsai, “CANN: An intrusion detection
Even for the minor category, like the U2R attack, its instances system based on combining cluster centers and nearest neighbors,”
Knowledge-Based Systems, vol. 78, pp. 13–21,2015.
can be well detected by our approach with REC being 87.26%,
[6] G.-B. Huang, X. Ding, and H. Zhou, “Optimization method based
much higher than those provided by Singh’s. Furthermore, the extreme learning machine for classification,” Neurocomputing, vol. 74,
number of hidden neurons derived from our approach is much no. 1, pp. 155–163, 2010.
lower than that derived from Singh’s method. [7] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning
machine for regression and multiclass classification,” IEEE Transactions
on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 2,
V. C ONCLUSION pp. 513–529, 2012.
[8] J. A. K. Suykens and J. Vandewalle, “Least squares support vector
We have presented an efficient C-ELM construction ap- machine classifiers,” Neural processing letters, vol. 9, no. 3, pp. 293–
proach by which hidden neurons are added incrementally and 300, 1999.
the output weights are derived without re-computation from [9] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning,
vol. 20, no. 3, pp. 273–297, 1995. .
scratch. The optimization criteria and a way of adaptively in- [10] R. Wang, S. Kwong, and D. D. Wang, An analysis of ELM approximate
creasing hidden neurons with binary search are developed. As error based on random weight matrix,” International Journal of Uncer-
a result, time and efforts can be saved during the construction tainty, Fuzziness and Knowledge-Based Systems, vol. 21, no.supp02,
pp. 1–12, 2013.
process of an optimal C-ELM. Our proposed approach is ap- [11] R.-F. Xu, Z.-Y. Wang, and S.-J. Lee, “onstrained-optimization-based
plied to network intrusion detection to examine its capability. extreme learning machine with incremental learning,” Proc. International
Experimental results have shown that the proposed approach is Conference on Multimedia, Communication and Computing Application
(MCCA 2014), Xiamen, China, October 16-17, 2014, pp. 315–318.
effective in building models with good attack detection rates [12] NSL-KDD data set,
and fast learning speed. We have also shown that even the http://www.unb.ca/research/iscx/dataset/iscx-NSL-KDD-dataset.html
[13] KDD-Cup 99 data set,
attack instances in the minority are not ignored and are also https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
detectable. [14] R. Singh, H. Kumar, and R. K. Singla, “An intrusion detection system
using network traffic profiling and online sequential extreme learning
R EFERENCES machine,” Expert Systems With Applications, vol. 42, pp. 8609–8624,
2015.
[1] Z. Tan, A. Jamdagni, X. He, P. Nanda, and R. P. Liu, “A system for
denial-of-service attack detection based on multivariate correlation anal-
ysis,” IEEE Transactions on Parallel and Distributed Systems, vol. 25,
no. 2, pp. 447–456, 2014.
[2] W. Meng, W. Li, and L.-F. Kwok, “Efm: Enhancing the performance
of signature-based network intrusion detection systems using enhanced
filter mechanism, Computers & Security, vol. 43, pp. 189–204, 2014.
[3] A. Jamdagni, Z. Tan, X. He, P. Nanda, and R. P. Liu, “Repids: A
multi tier real-time payload-based intrusion detection system,” Computer
Networks, vol. 57, pp. 811–824, 2013.
83