An Optimal Big Data Analytics With Concept Drift D
An Optimal Big Data Analytics With Concept Drift D
DOI:10.32604/cmc.2021.016626
Article
1
Department of Mathematics, Faculty of Science, New Valley University, El-Kharga, 72511, Egypt
2
Department of Information Systems, College of Computer and Information Sciences, Princess Nourah Bint
Abdulrahman University, Riyadh, 84428, Saudi Arabia
3
Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah Bint
Abdulrahman University, Riyadh, 84428, Saudi Arabia
4
Department of Entrepreneurship and Logistics, Plekhanov Russian University of Economics, Moscow, 117997, Russia
5
Department of Logistics, State University of Management, Moscow, 109542, Russia
*
Corresponding Author: Romany F. Mansour. Email: [email protected]
Received: 07 January 2021; Accepted: 01 March 2021
Abstract: Big data streams started becoming ubiquitous in recent years, thanks
to rapid generation of massive volumes of data by different applications. It
is challenging to apply existing data mining tools and techniques directly
in these big data streams. At the same time, streaming data from several
applications results in two major problems such as class imbalance and
concept drift. The current research paper presents a new Multi-Objective
Metaheuristic Optimization-based Big Data Analytics with Concept Drift
Detection (MOMBD-CDD) method on High-Dimensional Streaming Data.
The presented MOMBD-CDD model has different operational stages such
as pre-processing, CDD, and classification. MOMBD-CDD model overcomes
class imbalance problem by Synthetic Minority Over-sampling Technique
(SMOTE). In order to determine the oversampling rates and neighboring
point values of SMOTE, Glowworm Swarm Optimization (GSO) algorithm
is employed. Besides, Statistical Test of Equal Proportions (STEPD), a CDD
technique is also utilized. Finally, Bidirectional Long Short-Term Memory
(Bi-LSTM) model is applied for classification. In order to improve classifi-
cation performance and to compute the optimum parameters for Bi-LSTM
model, GSO-based hyperparameter tuning process is carried out. The perfor-
mance of the presented model was evaluated using high dimensional bench-
mark streaming datasets namely intrusion detection (NSL KDDCup) dataset
and ECUE spam dataset. An extensive experimental validation process con-
firmed the effective outcome of MOMBD-CDD model. The proposed model
attained high accuracy of 97.45% and 94.23% on the applied KDDCup99
Dataset and ECUE Spam datasets respectively.
This work is licensed under a Creative Commons Attribution 4.0 International License,
which permits unrestricted use, distribution, and reproduction in any medium, provided
the original work is properly cited.
2844 CMC, 2021, vol.68, no.3
1 Introduction
The progressive deployment of Information Technology (IT) in different domains results in
the production of huge volumes of data. The maximum velocity of big data outperforms the
computational approach followed by classic models. Some of the examples are Sensor Networks
(SN), spam filtering mechanism, traffic management, and Intrusion Detection System (IDS).
In general, data-stream S is highly unbounded, while the regular sequence of samples arrives
frequently in a robust manner. However, this framework has a primary limitation i.e., the presence
of concept drift issues and the principle behind this model is drifted in dynamic fashion. This
issue must be resolved. Concept drift is a common problem that exists in real-time practical
applications. For example, in recommended systems, the choice of a user may vary and it tend to
change frequently based on situation, finance, climatic conditions, and other such factors. These
changes may reduce the classification function. Basically, a classifier should be capable of realizing
these changes and react to them appropriately. Followed by, learning models are developed to
handle static platforms. But real-time applications have dynamic hierarchy. Here, concept drift
is reported as non-stationary platform in which the target concept is modified in the presence
of controversial training and application data [1]. Different domains have concept drift-based
issues such as monitoring, management and strategic planning, personal guidance, and so on. The
current research work makes use of technologies to handle and predict the concept drift.
Preprocessing is an important task since the storage space is minimum and the samples need
to be scanned in single pass. Further, collective samples should be decided from data stream as
well. The main purpose of sampling is to select a part of data stream and regard it as ‘entire
system.’ When computing stream data, irregular data phenomenon is important since it is utilized
in several applications like weather data forecasting, anomalous prediction, social media mining,
etc. Next, class imbalance is feasible when representing a single instance or if the values are more
than others. Classes, with maximum number of data samples, are named as majority classes. While
the remaining classes are referred to as minority classes. In stream data classification, the majority
class overcomes the samples and eliminates the minority class.
Pre-processing is a better solution to balance the distribution of class. When the reservoir
size is unnecessarily allocated for stream data sourced from different devices, it increases the
imbalance problem. So, resampling is applied extensively to manage the sample set through instant
elimination of majority class and the process is called under-sampling and oversampling. But, the
sensitivity of learning accuracy, in class imbalance, is based on the distribution of minority classes
and degree of overlap among the classes. Concept drift denotes the modifications in distributed
samples due to major problems in stream data examination.
This research work presents a new Multi-Objective Metaheuristic Optimization-based Big
Data Analytics with Concept Drift Detection (MOMBD-CDD) on High-Dimensional Streaming
Data. The presented MOMBD-CDD model handles class imbalance problem using a Synthetic
Minority Over-sampling Technique (SMOTE). To determine oversampling rate and the neighbor-
ing points of SMOTE, Glowworm Swarm Optimization (GSO) algorithm is employed. Further,
Statistical Test of Equal Proportions (STEPD), a CDD technique is utilized. At last, bidirec-
tional Long Short-Term Memory (Bi-LSTM) model is applied for classification. To enhance the
classifier results of Bi-LSTM model, GSO-based hyperparameter tuning process is performed.
The proposed MOMBD-CDD model was evaluated through comprehensive analysis of high
dimensional benchmark streaming datasets namely intrusion detection (NSL KDDCup) dataset
and ECUE dataset.
CMC, 2021, vol.68, no.3 2845
2 Literature Survey
Barros et al. [2] presented Reactive Drift Detection Method (RDDM) based on DDM. This
technique eliminates the previous samples of prolonged models. It helps in predicting drifts as well
as increasing the accuracy. Li et al. [3] projected Ensemble Decision Trees for Concept (EDTC)
drift data streams by mimicking cut-points in tree development. The method was used along
with three diverse random Feature Selection (FS) models. After reaching an instance, a growing
node randomly divides the features and eliminates the unwanted branches. In this research, EDTC
applies two thresholds and local data distributions are employed to predict the drift. Ross et al. [4]
proposed Exponentially Weighted Moving Average (EWMA) for Concept DD (ECDD), a drift
detection model depending on exponentially-weighted average chart. The model used classification
error stream and the developers required no data to be saved in storage space.
Widmer et al. [5] developed Floating Rough Approximation (FLORA) approach to handle
CD with collective descriptors. In this study, variable-sized sample window was used for selecting
the descriptors. Liu et al. [6] projected a DD in SN-relied Angle Optimized Global Embedding
(AOGE) as well as Principal Component Analysis (PCA) method. PCA and AOGE intend to
examine the projection difference and projection angle which are again applied in the prediction
of drift. Bifet et al. [7] implied Adaptive Windowing (ADWIN2) mechanism, an extended version
of ADWIN model. ADWIN2 has windows of different sizes which gets developed or reduced,
when a concept drift is predicted. Additionally, supervised models are used in predicting the
drifts under the application of elements present in a window. Xu et al. [8] deployed Dynamic
Extreme Learning Machine (DELM) technology by leveraging Extreme Learning Machine (ELM)
technique for drift prediction. The primary objective of this method was to apply a double hidden
layer to train the network and enhance its performance.
Lobo et al. [9] established a popular Spiking Neural Network (NN) model in web learning
data streams. This method primarily focused on the mitigation of size of neurons. By exploiting
data limitation methods, the study reaped the benefits of compressed neuron learning potential.
Zhang et al. [10] implied a 3-layered drift prediction model in text data stream. In this model, a
layer represents multiple components such as label space, layer of feature space, and finally the
layer of mapping labels as well as its features. Lobo et al. [11] illustrated DRED relied on multi-
objective optimization for data labeling. The developers, in this paper, projected the significance
of applying ensembles which possess the capability to deal with modifications in a data stream
after its prediction.
Mirza et al. [12] proposed Ensemble of Subset Online Sequential Extreme Learning Machine
(ESOS-ELM), a drift detection mechanism to solve the class imbalance issues. Arabmakki
et al. [13] deployed Reduced labeled Samples Self Organizing Map (RLS-SOM) to overcome the
issues in imbalanced data stream. The ensemble is used to classify Dynamic Weighted Majority
(DWM) as per the new method under the application of labeled samples, if the drifts are selected.
Lobo et al. [14] implied a possible mechanism to overcome the problems in imbalanced data
streams. Next, the researchers have recommended the identification of essential samples from
senior learners.
Sethi et al. [15] introduced MD3 (Margin Density DD) to predict the drift in unlabeled
stream. When there is a deviation in margin density, a classifier has a collection of labeled
samples that can be retrained. De Andrade Silva et al. [16] projected Fast Evolutionary Algorithm
for Clustering (FEAC-data Streams) algorithm based on k-means clustering with k-automatic
2846 CMC, 2021, vol.68, no.3
estimation of stream value. In this study, FEAC-Stream applied Page-Hinkley test to predict the
reduction in quality of clusters to initialize k evolutionary models.
being minimum. In order to resolve these issues, the newly-deployed scheme brings a change in
(f̂ (Xt ), yt ). Here, f̂ implies the classifier applied for prediction. It has evolved from the drift of
P f̂ (Xt ) , yt representing a drift P (Xt , yt ), with probability 1.
Assume f̂ (Xt ) = ŷt as a binary classification model for the applied data stream (Xt , yt ). It
is defined that the corresponding 2 × 2 confusion probability matrix (CP) for f̂ is presented.
Here, CP[1, 1], CP [0, 0] , CP [1, 0], and CP[0, 1] signify the ratio of True Positives (TP), True
Negatives (TN), False Positives (FP) and False Negatives (FN) correspondingly, for classifier f
and is expressed by, CP [1, 1] = P yt = 1, ŷt = 1 . The values for these four parameters (TP Rate,
TN Rate, Positive Predicted Value and Negative Predicted Value) can be estimated through the
formulae given below.
Ptpr = TP/ (TP + FN) Ptnr = TN/ (TN + FP) , Pppv = TP/ (FP + TP) and Pnpv = TN/ (TN + FN)
The above-defined characteristic values from P = Ptpr , Ptnr , Pppv , Pnpv are 1, when there
is no misclassification. When an ideal steady concept (P (Xt , yt ) is used, Ptpr , Ptnr , Pppv , Pnpv
remains the same. Hence, the vital change of P means a modification in joint distribution (yt , ŷt ).
It is pointed at each time step t, for a viable (yt , ŷt ) pair, while a massive empirical rate in P
might change and two values are meant to be “influenced by (yt , ŷt ).” Additionally, the prediction
of concept drift is not actually performed and remains an unwanted alert for empirical rates in P .
This occurs because the model derived from the historical data performs better in the classification
of big streaming data.
are deployed over Hadoop infrastructure. Map Reduce method, a scheme of MRODC approach,
is utilized in enhancing the classification scalability and robustness in computing. The following
aspects are composed of MRODC method.
• Based on N-gram, Polarity score is checked for each sentence
• Based on Polarity score, data classification is performed
• Based on the classified data, new words and Term Frequency (TF) are evaluated
When applying diverse Data Mining (DM) models, the basic data from HDFS undergoes
pre-processing. Using Map function, the iterations are processed simultaneously and are named
as combiner function and reduce function respectively. The performance is measured when the
Map approves every line from the sentence, as different pairs of key-value, since this is the input
for Map function. Based on the developed corpus, Map function measures a data object value.
Based on different grams, the value is determined. The result of a mapping function is forwarded
to Combiner function. The whole set of data objects are retrieved from Combiner function after
which the data is classified according to identical class. Consequently, it unifies the whole set of
data with identical class values, and saves the sample values for Reducer function. The simulation
result of a cluster is transmitted. From different classes, Reduce function retrieves complete data
which can otherwise be called as the simulation outcome of Combiner function. After the data
from different class labels is summarized and evaluated, the final results are attained in JHDFS
along with class labels and the next iteration is proceeded.
To determine the sample rate and neighboring points of SMOTE, GSO algorithm is used.
GSO algorithm applies glowworms with glowing quantity named luciferin, or agents. At the
beginning, the glowworms are considered as initial solutions which are randomly distributed in
problem space. Then, it travels to highly-illuminated place by the sensor range. At last, the brighter
ones are collected and is referred to as an optimal solution for the given problem. In GSO,
there are three phases listed in the following literature [20]. Fig. 2 illustrates the flowchart of
GSO model.
constant set for which the value is 0.6, Ii (t) and Ii (t + 1) are luciferins from iterations t and t + 1,
correspondingly. F (xi (t + 1)) implies the objective function that is resultant power of PV module
and is projected by:
F = Ppv = VPV ∗ I (3)
where VPV means the overall voltage of PV cells in a series. The voltage of PV cell is projected
as a function of present I. Hence, F denotes the performance of solar irradiation, current, and
temperature. I signifies the variable to be optimized considering the position of glowworm, and
S denotes the input parameter.
di, j (t) = xi − xj signifies the Euclidean distance between glowworms, i and j at iteration t.
rid means the variable neighborhood, connected with glowworm i at time t. The movement
can be selected by applying probability Eq. (4). When pij0 (t) = max pij (t) , fix the location of
j
glowworm i similar to the place of glowworm j. Followed by, the location of the glowworms is
updated. The movement update rule is expressed as follows.
xj
xi (t + 1) = xi (t) + s∗ (6)
xJ
xj (t) − xi (t)
xi (t + 1) = xi (t) + s∗
xj (t) − xi (t)
where s denotes the step size and xi (t) and xi (t + 1) correspond to position of agent i at iteration
t and t + 1, correspondingly.
The final outcome of huI (8) is applied to identify the p-value from standard normal dis-
tribution scale and is compared with significance levels which is applied for drifts and warnings.
If p-value < αd is null, then (r0 /n0 = rr /nr ) is eliminated and STEPD predicts a concept drift.
Likewise, the warnings are signaled when p value is < αw .
where xt denotes a series input, ath implies the network input for LSTM with unit h at time t,
whereas the activation function of h at time t is signified by bth . wlh refers to the weight of input l
towards h. wh, h defines the weight of hidden unit h from hidden unit h . h refers to an activation
function of hidden unit h. The backward estimation of Bi-LSTM is defined by Eqs. (11) and (12).
T
δO δO t
= b (11)
δwhk δath h
t=1
⎛ ⎞
K H
δO δO δO
= h ⎝ath whk + whh , ⎠ (12)
δath δath δat+1
k=1 h =1, t>0 h
4 Performance Validation
4.1 Dataset Used
The performance of the presented MOMBD-CDD model was validated in this section using
two datasets namely, KDDCup99 [23] and ECUE spam dataset [24]. Tab. 1 shows the information
relevant to these datasets. Firstly, the KDDCup 99 dataset includes a set of 125973 instances with
two class labels and 42 attributes. Secondly, the ECUE spam dataset comprises of 4 attributes
with 9978 instances with two class labels.
Tab. 2 shows the results attained after class imbalance handling process by GSO-SMOTE
technique. The table reports that the GSO-SMOTE technique sampled the original KDDCup99
dataset with 125973 instances into 129843 instances. Besides, on the applied ECUE spam dataset,
the GSO-SMOTE model sampled 17025 instances from the original 9978 instances.
CMC, 2021, vol.68, no.3 2853
4.2 Results
Tab. 3 and Figs. 4–5 demonstrate the classification results of analysis for the presented
MOMBD-CDD model upon applied KDDCup99 and ECUE spam datasets. The resultant values
of the presented MOMBD-CDD model on applied KDDCup99 dataset accomplished a higher
sensitivity, specificity, precision, accuracy, F-score, and kappa value of 97.84%, 95.43%, 97.17%,
97.45%, 96.51%, and 95.29% respectively. At the same time, the obtained experimental values
denote that MOMBD-CDD model processed the ECUE spam dataset with maximum sensitivity,
specificity, precision, accuracy, F-score, and kappa value of 94.88%, 93.20%, 94.19%, 94.23%,
93.56%, and 92.90% respectively.
Fig. 6 illustrates the results of ROC analysis for MOMBD-CDD model on the applied
test KDDCup99 dataset. From the figure, it is understood that the MOMBD-CDD model
accomplished effective outcomes i.e., maximum AUC of 0.97398323.
2854 CMC, 2021, vol.68, no.3
Fig. 7 demonstrates the results of ROC analysis for MOMBD-CDD model upon the applied
test ECUE Spam Dataset. From the figure, it is understood that the MOMBD-CDD model gained
proficient performance as the model produced high AUC of 0.932389937.
Tab. 4 and Fig. 8 investigate the results of classification analysis of the MOMBD-CDD model
upon applied KDDCup99 dataset [25]. The table values denote that the Gradient Boosting tech-
nique achieved only the least accuracy of 84.30%. Besides, Naïve Bayesian model accomplished a
slightly-increased accuracy of 89.60%. At the same time, Random Forest model obtained a mod-
erate accuracy of 90.24%. Likewise, the OC-SVM and Gaussian Process models too demonstrated
closer accuracy values of 91.80% and 91.10% respectively. Simultaneously, the DNN-SVM model
CMC, 2021, vol.68, no.3 2855
exhibited a competitive accuracy of 92%. Among these, the MOMBD-CDD model accomplished
effective outcomes with high accuracy of 97.45%.
Table 4: Performance comparison of the proposed method with recent methods on KDDCup99
dataset
Methods Accuracy
MOMBD-CDD 97.45
DNN + SVM 92.00
OC-SVM 91.80
Naive Bayesian 89.60
Gaussian Process 91.10
Random Forest 90.24
Gradient Boosting 84.30
Tab. 5 and Fig. 9 examine the outcomes of classification analysis for MOMBD-CDD method
upon the applied ECUE spam dataset [26–29]. The table values correspond that the HELF
technique resulted in minimum accuracy of 75%. Besides, the KNN model resulted in somewhat
higher accuracy of 81.80%. At the same time, the Genetic algorithm model attained a moderate
accuracy of 84%. In line with this, the Adaboost model also showcased an even better result with
an accuracy of 87%.
2856 CMC, 2021, vol.68, no.3
Table 5: Performance evaluation of the proposed method with recent methods on ECUE
spam dataset
Methods Accuracy
MOMBD-CDD 94.23
CBT 91.30
Naive Bayes 88.10
Genetic Algorithm 84.00
Flexible Bayes 88.80
HELF 75.00
Adaboost 87.00
KNN 81.80
Likewise, Naive Bayes and Flexible Bayes models also accomplished close accuracy values
of 88.10% and 88.80% respectively. Simultaneously, CBT model exhibited a competitive accuracy
value of 91.30%. At last, the MOMBD-CDD model accomplished the best effective results with
high accuracy of 94.23%. From the above discussed results, it is evident that the MOMBD-CDD
model accomplished superior results over other methods.
5 Conclusion
This research work proposed a novel MOMBD-CDD model for High-Dimensional Big
Streaming Data. The presented MOMBD-CDD model has different operational stages namely,
pre-processing, CDD, and classification. At first, online streaming big data was preprocessed to
transform the raw streaming data into a compatible format. Then, the preprocessed data under-
went class imbalance handling process with the help of SMOTE-GSO algorithm. Followed by,
the CDD process was incorporated with the help of STEPD technique. Finally, the classification
task was performed by Bi-LSTM model and further tuned by GSO algorithm to determine the
hyperparameters. The model was extensively validated through experiments which confirmed that
the proposed model can produce effective outcome. The presented MOMBD-CDD model attained
high accuracies of 97.45% and 94.23% on the applied datasets i.e., KDDCup99 Dataset and Spam
dataset respectively. In future, the performance can be increased through clustering and feature
selection techniques.
Funding Statement: The authors received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding
the present study.
References
[1] I. Žliobaite, “Learning under concept drift: An overview,” Technical report, Faculty of Mathematics
and Informatics, Vilnius University, Vilnius, Lithuania, 2009.
[2] R. S. M. Barros, D. R. L. Cabral, P. M. Gonçalves Jr. and S. G. T. C. Santos, “RDDM: Reactive drift
detection method,” Expert Systems with Applications, vol. 90, pp. 344–355, 2017.
[3] P. Li, X. Wu, X. Hu and H. Wang, “Learning concept-drifting data streams with random ensemble
decision trees,” Neurocomputing, vol. 166, no. 3, pp. 68–83, 2015.
[4] J. Ross, N. M. Adams, D. K. Tasoulis and D. J. Hand, “Exponentially weighted moving average charts
for detecting concept drift,” Pattern Recognition Letters, vol. 33, no. 2, pp. 191–198, 2012.
[5] G. Widmer and M. Kubat, “Learning in the presence of concept drift and hidden contexts,” Machine
Learning, vol. 23, no. 1, pp. 69–101, 1996.
[6] S. Liu, L. Feng, J. Wu, G. Hou and G. Han, “Concept drift detection for data stream learning based
on angle optimized global embedding and principal component analysis in sensor networks,” Computers
& Electrical Engineering, vol. 58, no. 8, pp. 327–336, 2017.
[7] A. Bifet and R. Gavaldà, “Learning from time-changing data with adaptive windowing,” in Society for
Industrial and Applied Mathematics. Int. Conf. on Data Mining, Minnesota, USA, 2007.
[8] S. Xu and J. Wang, “Dynamic extreme learning machine for data stream classification,” Neurocomput-
ing, vol. 238, no. 99, pp. 433–449, 2017.
[9] J. L. Lobo, I. Laña, J. D. Ser, M. N. Bilbao and N. Kasabov, “Evolving spiking neural networks for
online learning over drifting data streams,” Neural Networks, vol. 108, no. 12, pp. 1–19, 2018.
[10] Y. Zhang, G. Chu, P. Li, X. Hu and X. Wu, “Three-layer concept drifting detection in text data
streams,” Neurocomputing, vol. 260, no. 3, pp. 393–403, 2017.
2858 CMC, 2021, vol.68, no.3
[11] J. L. Lobo, J. D. Ser, M. N. Bilbao, C. Perfecto and S. S. Sanz, “DRED: An evolutionary diversity gen-
eration method for concept drift adaptation in online learning environments,” Applied Soft Computing,
vol. 68, pp. 693–709, 2018.
[12] B. Mirza, Z. Lin and N. Liu, “Ensemble of subset online sequential extreme learning machine for class
imbalance and concept drift,” Neurocomputing, vol. 149, no. 9, pp. 316–329, 2015.
[13] E. Arabmakki and M. Kantardzic, “SOM-based partial labeling of imbalanced data stream,” Neuro-
computing, vol. 262, pp. 120–133, 2017.
[14] J. L. Lobo, J. D. Ser, M. N. Bilbao, I. Laña and S. S. Sanz, “A probabilistic sample matchmaking
strategy for imbalanced data streams with concept drift,” in Int. Symp. on Intelligent and Distributed
Computing IDC 2016: Intelligent Distributed Computing X , Cham, Springer, vol. 678, pp. 237–246, 2016.
[15] T. S. Sethi and M. Kantardzic, “On the reliable detection of concept drift from streaming unlabeled
data,” Expert Systems with Applications, vol. 82, no. 12, pp. 77–99, 2017.
[16] J. De Andrade Silva, E. R. Hruschka and J. Gama, “An evolutionary algorithm for clustering data
streams with a variable number of clusters,” Expert Systems with Applications, vol. 67, no. 1, pp. 228–
238, 2017.
[17] I. Kim and C. H. Park, “Concept drift detection on streaming data under limited labeling,” in IEEE
Int. Conf. on Computer and Information Technology, Nadi, Fiji, pp. 1–9, 2016.
[18] V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-
sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
[19] J. Wang, B. Makond, K. H. Chen and K. M. Wang, “A hybrid classifier combining SMOTE with
PSO to estimate 5-year survivability of breast cancer patients,” Applied Soft Computing, vol. 20, pp. 15–
24, 2014.
[20] N. Krishnan and D. Ghose, “Glowworm swarm optimization for searching higher dimensional spaces,”
in Innovations in Swarm Intelligence, vol. 248. Berlin, Heidelberg: Springer, pp. 61–75, 2009.
[21] D. R. De Lima Cabral and R. S. M. De Barros, “Concept drift detection based on fisher’s exact test,”
Information Sciences, vol. 442–443, pp. 220–234, 2018.
[22] N. Yulita, M. I. Fanany and A. M. Arymuthy, “Bi-directional long short-term memory using quantized
data of deep belief networks for sleep stage classification,” Procedia Computer Science, vol. 116,
pp. 530–538, 2017.
[23] KDD Cup 1999 Data, The Third International Knowledge Discovery and Data Mining Tools Compe-
tition. [Online]. Available: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (Accessed June 14,
2020).
[24] ECUE Spam dataset, [Online]. Available: http://www.comp.dit.ie/sjdelany/dataset.htm (Accessed June
14, 2020).
[25] H. Hindy, R. Atkinson, C. Tachtatzis, J. N. Colin and E. Bayne, “Utilising deep learning techniques
for effective zero-day attack detection,” Electronics, vol. 9, no. 10, pp. 1–16, 2020.
[26] J. Delany, P. Cunningham, A. Tsymbal and L. Coyle, “A case-based technique for tracking concept
drift in spam filtering,” Knowledge-Based Systems, vol. 18, no. 4–5, pp. 187–195, 2005.
[27] N. Pérez-Díaz, D. Ruano-Ordás, F. Fdez-Riverola and J. R. Méndez, “Boosting accuracy of classi-
cal machine learning antispam classifiers in real scenarios by applying rough set theory,” Scientific
Programming, vol. 2016, pp. 1–10, 2016.
[28] C. Zhao, Y. Xin, X. Li, Y. Yang and Y. Chen, “A heterogeneous ensemble learning framework for spam
detection in social networks with imbalanced data,” Applied Sciences, vol. 10, no. 3, pp. 1–18, 2020.
[29] N. Saidani, K. Adi and M. S. Allili, “A semantic-based classification approach for an enhanced spam
detection,” Computers & Security, vol. 94, no. 1, pp. 101716, 2020.