A Report On
Review Of Machine Learning-Based Zero-Day Attack
Detection: Challenges And Future Directions
Submitted in partial fulfillment of the requirement of
University of Mumbai for the Degree of
Bachelor of Technology
In
Computer Engineering
Submitted By
Saloni Dongare
Supervisor
Prof. Payel Thakur
Department of Computer Engineering
PILLAI COLLEGE OF ENGINEERING
New Panvel – 410 206
UNIVERSITY OF MUMBAI
Academic Year 2024– 25
Table of Contents
1. Introduction
2. Implementation
3. Challenge
4. Conclusion and Future Work
5..1 Conclusion
5.2 Future Work
1
1. Introduction
Zero-day attacks are cybersecurity threats that exploit unknown vulnerabilities, allowing
attackers to bypass defenses without being detected. These attacks target flaws that are not
publicly disclosed, giving no time for security teams to patch systems. Traditional detection
methods, like signature-based systems, rely on known patterns to identify attacks and are
ineffective against zero-day threats since these patterns are unavailable. Anomaly-based
detection systems, while useful, can struggle with false positives or fail to detect more subtle
attacks. Machine learning (ML) has emerged as a promising tool for detecting zero-day attacks.
ML models analyze patterns in data and adapt to detect unusual behavior, potentially identifying
unknown attacks. Different types of ML models, such as supervised and unsupervised learning,
have been explored for zero-day detection. However, challenges persist, particularly with the
lack of data on zero-day attacks, which makes training these models difficult. Researchers often
assume similarities between zero-day and known attacks, but this needs to be validated.
Furthermore, evaluation of these models is challenging due to limited real-world testing data and
the absence of standardized benchmarks. Despite these challenges, ML-based approaches show
potential for improving zero-day attack detection. This report reviews existing ML models and
identifies gaps in current detection methods.
2. Implementation
3.1 OUTLIER-BASED ZERO-DAY ATTACK DETECTION USING NORMAL DATA
A. One-Class SVM-based Detection
● One-Class SVM learns a decision boundary around normal data using non-linear kernels.
Points that fall outside the boundary are considered outliers, i.e., potential zero-day
attacks.
2
● The model involves finding a hyperplane in a feature space that maximizes the distance
from the origin while keeping normal data on one side of the plane.
● Optimization for this detection involves a kernel function (e.g., Gaussian kernel) and
solves a quadratic programming problem to determine the model's boundaries.
B. Autoencoder-based Detection
● Autoencoders, a type of neural network, aim to reconstruct input data with low error for
normal data and high error for outliers. The reconstruction error helps determine if new
data is an outlier.
● The autoencoder consists of an encoder that compresses the input and a decoder that
reconstructs it. A data point is considered anomalous if the reconstruction error exceeds a
threshold.
C. Performance Comparison
● One-Class SVM and autoencoder models were trained on CIC-IDS2017 and NSL-KDD
datasets, which include various types of benign and attack traffic.
● Autoencoders generally outperform One-Class SVM for more complex zero-day attacks,
as their performance scales better with complex data.
● Both models have low false positives, though detection accuracy can vary based on attack
types (e.g., attacks very different from normal data like Hulk and DDoS have high
detection rates, while others like DoS-SlowHTTPTest perform poorly)
3
D. Ensemble of Autoencoders
● An ensemble approach, such as the Kitsune framework, enhances zero-day detection by
combining multiple autoencoders, each monitoring different network features.
● The ensemble method was evaluated using a test bed involving IP camera surveillance
systems, and Kitsune demonstrated performance comparable to offline anomaly detectors
like Isolation Forests and Gaussian Mixture Models. However, its performance varied
depending on the type of attack.
3.2 SUPERVISED AND HYBRID LEARNING-BASED ZERO-DAY ATTACK
DETECTION USING LABELED DATA
A. Evaluation of Supervised Machine Learning Classifiers for Zero-Day Attack Detection
This section focuses on evaluating the effectiveness of six popular machine learning
classifiers—Random Forest, Gaussian Naive Bayes, Decision Tree, Multi-layer Perceptron,
K-Nearest Neighbors, and Quadratic Discriminant Analysis—for detecting zero-day attacks. The
CSE-CIC-IDS2018 dataset was used for training, containing labeled data from a variety of
attacks and benign network activities. After pre-processing to remove noisy features, 25
bi-directional flow features were used for model training. The performance of the classifiers was
evaluated on real-world zero-day attack data, showing that the Decision Tree model
outperformed others, achieving a true-positive rate of 96% and a false-positive rate of 5% at an
optimal tree depth.
B. Integrating Supervised and Unsupervised Learning for Zero-Day Malware Detection
4
Comar et al. developed a two-level hybrid detection method using supervised and unsupervised
learning to identify zero-day malware. The approach consists of a macro-level binary classifier
(Random Forest) to identify malicious traffic, followed by a micro-level classifier (multi-class
Support Vector Machine) to distinguish between known and zero-day malware. This layered
approach successfully identified both known and zero-day malware, with an AUC score of 91%.
However, the F1 score for zero-day detection was lower at 0.50, indicating room for
improvement.
C. Hybrid Learning Using Available Unlabeled Data for Zero-Day Malware Detection
A hybrid approach was proposed that leverages unlabeled data to detect zero-day malware. The
method involves running files in a sandbox, extracting API call frequencies, and applying
k-means clustering to merge labeled and unlabeled data. Geometric distances between data
points and cluster centroids are used as additional features for training classifiers. This approach
achieved perfect detection accuracy (100%) when augmented geometric features were used with
classifiers such as Random Forest and SVM.
3. Challenges
The primary challenge of zero-day attacks is the lack of prior knowledge, making them nearly
impossible to detect with traditional IDSs that rely on predefined signatures. Attackers can
exploit these unknown vulnerabilities for long periods before any defensive measures are
developed, allowing significant damage to be inflicted. In rapidly evolving smart community
environments, this challenge is amplified, as interconnected systems handling critical
infrastructure are prime targets. The constantly changing nature of zero-day threats leaves
conventional security methods struggling to keep up, often resulting in severe breaches and
financial losses. Additionally, the sheer volume of potential vulnerabilities in modern software
and hardware increases the difficulty of maintaining comprehensive defenses. Attackers can
innovate faster than defenders can patch, making zero-day attacks especially dangerous in
environments where data security and system availability are critical. The lack of transparency in
proprietary systems further complicates efforts to detect and prevent these attacks.
5
4. Conclusion and Future Work
4.1 Conclusion
Zero-day attacks are frequent, lasting an average of 312 days before detection and causing
significant financial damage, with costs averaging $1.2 million per attack. Machine Learning
(ML)-based detection methods offer the most promising solution for identifying these attacks.
This review explored various ML approaches, including unsupervised, supervised, hybrid, and
transfer learning methods. However, key challenges remain, particularly the lack of zero-day
attack data in training sets, which limits the accuracy and robustness of existing models.
Additionally, limited datasets and feature spaces hinder the effectiveness of current detection
methods.
To improve ML-based zero-day detection, future efforts should focus on integrating the latest
ML advancements, incorporating expert knowledge, and developing standardized, data-rich
benchmarks for better evaluation and model improvement.
4.2 Future Work
To overcome the challenges in designing effective ML-based zero-day attack detection, several
multi-front efforts are essential. First, collecting zero-day attack data before they become widely
known is critical. Honeypots can be used to gather this data, allowing systems to learn from
real-world attacks before they are publicly disclosed. Another key area is effective feature
engineering, which requires integrating domain expertise. Attackers may disguise their attacks as
legitimate actions, so involving cybersecurity experts in selecting features ensures that new
attacks are detectable within the model’s feature space.
Staying updated with advancements in machine learning is also crucial. Reinforcement Learning
(RL), for example, allows systems to learn by interacting with the environment through trial and
error, making it particularly suited for zero-day attack detection. RL-based methods have shown
success in defending against various types of zero-day attacks, including strategic and random
attacks. Lastly, developing a comprehensive benchmark suite with standardized datasets and
evaluation tools is necessary to further research in ML-based zero-day detection systems. This
would significantly speed up innovation and improve the effectiveness of these systems.
6
PPT:
7
8
9