Final Report With Modification
Final Report With Modification
Dissertation-II / Internship-II
by
Brajendra Singh
23MCA0144
April, 2025
i
Intrusion detection system using machine learning
Dissertation-II / Internship-II
by
Brajendra Singh
23MCA0144
April, 2025
i
DECLARATION
I further declare that the work reported in this dissertation / internship has not been
submitted and will not be submitted, either in part or in full, for the award of any other
degree ordiploma in this institute or any other institute or university.
Place: Vellore
Date: 16-04-2025
ii
CERTIFICATE
This is to certify that the PMCA699J - <Dissertation-II/ Internship-II> entitled
Intrusion detection system using machine learning submitted by Brajendra Singh
(23MCA0144), SCORE, VIT, for the award of the degree of Master of Computer
Applications in Department of Computer Applications, is a record of bonafide work carried
out by him / her under my supervision during the period, 13. 12. 2024 to 17.04.2025, as per
the VIT code of academic and research ethics.
The contents of this report have not been submitted and will not be submitted either in
part or in full, for the award of any other degree or diploma in this institute or any other
institute or university. The dissertation / internship fulfill the requirements and regulations of
the University and in my opinion meets the necessary standards for submission.
Place: Vellore
Date: 16-04-2025
iii
ACKNOWLEDGEMENT
It is my pleasure to express with a deep sense of gratitude to my Dissertation- II/
Internship-II guide Dr. Seetha R, Associate Professor Grade 1, School of Computer
Science Engineering and Information Systems, Vellore Institute of Technology, Vellore for
his/her constant guidance, continual encouragement, in my endeavor. My association with
him/her is not confined to academics only, but it is a great opportunity on my part to work
with an intellectual and an expert in the field of Machine Learning & Cyber Security.
"I would like to express my heartfelt gratitude to Honorable Chancellor Dr. G Viswanathan;
respected Vice Presidents Mr. Sankar Viswanathan, Dr. Sekar Viswanathan, Vice
Chancellor Dr. V. S. Kanchana Bhaaskaran; Pro-Vice Chancellor Dr. Partha Sharathi
Mallick; and Registrar Dr. Jayabarathi T.
It is indeed a pleasure to thank my parents and friends who persuaded and encouraged me to
take up and complete my dissertation/ Internship successfully. Last, but not least, I express
my gratitude and appreciation to all those who have helped me directly or indirectly towards
the successful completion of the dissertation/ Internship.
Place: Vellore
Date: 16-04-2025 Brajendra Singh
iv
Executive Summary
Intrusion Detection Systems (IDS) are essential for safeguarding networks against cyber
threats. Traditional IDS rely on signature-based methods, which struggle to detect novel
attacks. Machine Learning (ML) offers a promising solution by analyzing patterns in network
traffic to identify anomalies and potential threats dynamically.
This project explores the application of ML techniques in IDS, leveraging datasets like
CICIDS 2017, which contains real-world network traffic data. Various algorithms, including
Random Forest and Deep Learning models, are evaluated for their detection accuracy and
efficiency. The study also emphasizes the importance of feature selection, model
optimization, and dataset preprocessing to improve performance.
A key innovation in this project is the integration of TinyML, enabling IDS deployment on
resource-constrained edge devices. This approach enhances real-time threat detection with
minimal computational overhead, making cybersecurity more accessible for IoT and
embedded systems.
Through this research, we demonstrate that ML-based IDS significantly improve detection
rates, reduce false positives, and adapt better to emerging threats compared to traditional
methods. The findings contribute to the ongoing development of intelligent, lightweight, and
scalable security solutions for modern networks.
v
CONTENTS Page No.
Acknowledgement i
Executive Summary ii
List of Figures ix
1 INTRODUCTION 1
1.1 Objective 2
1.2 Motivation 3
1.3 Background 5
3 LITRATURE SURVEY 11
4 TECHNICAL SPECIFICATION 38
vi
9 SUMMARY 70
71
10 REFERENCES
vii
List of Figures
8
CHAPTER 1
INTRODUCTION
The rise of digital communication and the rapid expansion of computer networks have
brought both convenience and security challenges. As organizations and individuals rely
more on internet-based services, the frequency and sophistication of cyber threats have also
increased. Cybercriminals continuously exploit network vulnerabilities, deploying malware,
launching phishing campaigns, and executing large-scale Distributed Denial-of-Service
(DDoS) attacks. Traditional security mechanisms, such as firewalls and antivirus software,
are effective but insufficient in countering modern cyber threats that evolve rapidly.
1
lightweight devices, enabling continuous security monitoring without relying on centralized
cloud-based detection systems.
This project explores the development and implementation of an Intrusion Detection System
(IDS) using Machine Learning, with a focus on real-time anomaly detection and lightweight
deployment using TinyML. The study aims to improve the accuracy, adaptability, and
efficiency of IDS while making cybersecurity more accessible to various computing
environments, including IoT networks, industrial control systems, and enterprise
infrastructures.
1.1 OBJECTIVE
The primary goal of this project is to develop a Machine Learning-based Intrusion Detection
System that enhances cybersecurity by efficiently identifying and classifying malicious
activities within a network. The system aims to improve detection accuracy, reduce false
alarms, and provide real-time security monitoring.
Key Objectives:
• Develop a robust IDS model that utilizes Machine Learning algorithms to classify
network traffic as normal or malicious.
• Improve detection accuracy and reduce false positives, ensuring a reliable security
framework for network environments.
• Optimize the IDS for lightweight deployment using TinyML, allowing real-time
monitoring on IoT and resource-constrained devices.
• Train the model using benchmark datasets such as CICIDS 2017, ensuring that the
system can accurately detect different types of cyberattacks, including Denial-of-
2
Service (DoS), botnet attacks, brute-force intrusions, and port scanning.
1.2 MOTIVATION
The motivation behind this project arises from the increasing volume and complexity of
cyber threats that target individuals, enterprises, and governments worldwide. Cybersecurity
breaches have severe consequences, including financial losses, data theft, reputational
damage, and national security risks. As attackers employ more sophisticated methods,
traditional security solutions have struggled to keep up, necessitating the development of
intelligent and automated detection mechanisms.
Key Motivations:
Conventional IDS techniques are primarily categorized into signature-based detection and
anomaly-based detection:
• Signature-based IDS rely on databases of known attack patterns. While effective for
detecting previously identified threats, they fail to recognize new or modified attack
3
patterns.
• Anomaly-based IDS detect deviations from normal network behavior, but many
traditional methods generate high false positives, making them unreliable.
By leveraging Machine Learning, IDS can learn normal network behavior dynamically and
identify anomalies with higher accuracy and lower false alarms.
3. The Rise of IoT and the Need for Lightweight Security Solutions
The expansion of the Internet of Things (IoT) has introduced new security challenges. Many
IoT devices lack the processing power needed to run traditional IDS, making them vulnerable
to cyber threats. TinyML enables ML models to run on low-power, resource-constrained
devices, allowing IDS to be implemented in IoT and embedded environments for real-time
threat detection.
Manually updating IDS rules for new threats is inefficient and slow. ML-based IDS offer
adaptive security by continuously learning from new data. This ensures that new and
evolving attack patterns can be detected in real time, enhancing the overall security
framework.
The combination of ML and IDS presents a powerful opportunity to create more efficient,
accurate, and adaptable security solutions capable of protecting modern digital infrastructures
from emerging threats.
4
1.3 BACKGROUND
Since their debut, intrusion detection systems (IDS) have seen tremendous evolution. In the
early days of network security, IDS relied heavily on manual rule-based mechanisms, which
required constant updates to remain effective. The necessity for automated, data-driven
solutions grew as cyberattacks became more sophisticated, which is why machine learning
techniques were included in intrusion detection systems..
ML-based IDS leverage historical network data to train models that recognize malicious
activity based on behavioral patterns. Unlike traditional IDS that require human-defined
rules, ML-based systems automatically learn attack signatures and can generalize to detect
novel threats.
2. Anomaly-Based IDS:
5
CHAPTER 2
2.1 DESCRIPTION
The field of cybersecurity has become one of the most critical areas of research in the modern
digital era, driven by the growing number of cyber threats targeting individuals, enterprises,
and government organizations. Cybercriminals continuously develop new attack strategies,
leveraging sophisticated techniques to bypass traditional security measures. As a result,
organizations face a constant battle to secure their digital infrastructures against data
breaches, financial fraud, ransomware attacks, and unauthorized intrusions.
One of the most effective security mechanisms used to safeguard networks and systems is an
Intrusion Detection System (IDS). An IDS continuously monitors network traffic and system
activities, identifying suspicious behaviors, policy violations, or malicious attacks.
Traditional IDS techniques, such as signature-based detection and rule-based anomaly
detection, have proven to be insufficient in handling advanced cyber threats. These methods
rely heavily on predefined attack patterns and require frequent updates, making them
ineffective against zero-day attacks and evolving attack vectors.
Furthermore, this research emphasizes the use of TinyML, a subset of ML designed for
lightweight and resource-constrained environments. Traditional IDS require substantial
computational resources and are difficult to deploy on Internet of Things (IoT) devices,
embedded systems, and industrial control networks. TinyML enables real-time intrusion
detection on low-power devices, making it possible to implement IDS across various
infrastructures, including smart homes, healthcare devices, and edge computing networks.
6
This dissertation will develop, analyze, and evaluate different ML models for IDS, focusing
on accuracy, efficiency, and scalability. The research will employ well-known benchmark
datasets, such as CICIDS 2017, which contains real-world traffic patterns, including normal
activities and cyberattack scenarios. By conducting a comparative study of multiple ML
algorithms, such as Random Forest, Support Vector Machines (SVM), k-Nearest Neighbors
(KNN), Deep Neural Networks (DNN), and hybrid models, this research aims to determine
the most effective and computationally efficient approach for detecting cyber threats.
Additionally, this study will explore feature selection techniques, hyperparameter tuning, and
data preprocessing methods to optimize the IDS model. By enhancing the efficiency,
detection accuracy, and adaptability of ML-IDS, this dissertation will contribute to the
advancement of cybersecurity by proposing a scalable, intelligent, and real-world applicable
security solution.
2.2 GOALS
The primary objective of this dissertation is to design, develop, and evaluate an ML-powered
Intrusion Detection System that effectively detects cyber threats in both large-scale and
resource-constrained environments. The research aims to enhance cybersecurity by
improving intrusion detection accuracy, reducing false alarms, and enabling real-time
security monitoring using TinyML.
Specific Goals
This dissertation aims to create an efficient, accurate, and adaptive IDS model that can
identify and classify cyber threats in real time. The system will be capable of detecting
various attack types, including:
7
• Port scanning and network reconnaissance attacks
By leveraging ML techniques, the IDS will analyze network traffic data and dynamically
identify patterns indicative of malicious activities.
This study will evaluate the effectiveness of multiple Machine Learning algorithms in
detecting intrusions. The dissertation will compare different approaches to understand their
advantages and limitations, including:
• Supervised Learning Models (e.g., Decision Trees, Random Forest, SVM, KNN)
This comparative analysis will help in selecting the best-performing model for real-world
IDS applications.
Feature selection plays a critical role in enhancing the accuracy and efficiency of ML models.
Reducing unnecessary or redundant features improves IDS performance while minimizing
computational overhead. This dissertation will implement:
8
By selecting the most relevant features, the IDS can process large datasets efficiently while
maintaining high detection accuracy.
One of the key objectives of this dissertation is to deploy a lightweight IDS using TinyML.
Most IDS solutions require high computing power, making them impractical for IoT devices,
embedded systems, and industrial applications. This research will:
This TinyML-based IDS will help secure smart homes, healthcare devices, and industrial
automation systems against cyber threats.
Traditional IDS solutions often suffer from high false positive rates, leading to unnecessary
alerts and inefficiencies. Additionally, failing to detect actual attacks (false negatives) can
result in significant security breaches. This dissertation will apply:
By fine-tuning the IDS, the system will become more reliable and efficient, ensuring accurate
cyber threat detection.
The CICIDS 2017 dataset will be used for model training and evaluation. This dataset
contains a wide range of normal and malicious network traffic, making it suitable for real-
world testing. The research will:
9
• Preprocess the dataset by normalizing and encoding network features
By using real-world datasets, the IDS will be better suited for deployment in practical
environments.
The final goal is to develop a scalable, real-world IDS framework that can:
This research will explore deployment strategies for different infrastructures, ensuring that
the proposed IDS can be used across various cybersecurity domains.
10
CHAPTER 3
LITRATURE SURVEY
[1] Intrusion detection systems (IDS) have evolved significantly over the years,
transitioning from traditional signature-based methods to machine learning (ML) and
deep learning (DL)-based approaches that offer improved adaptability and efficiency.
Traditional IDS primarily rely on signature-based or rule-based mechanisms to detect
cyber threats by comparing incoming network traffic against a predefined database of
known attack patterns. While effective against previously identified threats, these systems
struggle to detect zero-day attacks and novel cyber threats, rendering them less reliable in
an era of rapidly evolving cybersecurity challenges. To address these limitations, machine
learning and deep learning models have been extensively explored, as they provide
dynamic, adaptive, and real-time threat detection capabilities that enhance the security of
networks and computer systems.
One of the most significant advancements in IDS research is the adoption of supervised
learning techniques such as Decision Trees (DT), Random Forest (RF), Support Vector
Machines (SVM), and Artificial Neural Networks (ANN). These models are trained on
labeled datasets, allowing them to accurately classify network activities as either benign
or malicious. Decision Trees and Random Forest models are particularly valued for their
high interpretability and efficiency, making them suitable for real-world cybersecurity
deployments. Additionally, ensemble learning methods, which combine multiple
classifiers, have been shown to improve intrusion detection accuracy by reducing bias and
variance in model predictions.
11
security breaches. However, the challenge with unsupervised learning lies in determining
the optimal number of clusters or principal components, which can affect detection
performance.
The integration of deep learning models has further revolutionized IDS capabilities,
particularly in handling complex, high-dimensional data. Artificial Neural Networks
(ANN), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN)
have been applied to detect sophisticated cyber threats by learning intricate patterns in
network traffic. CNNs have been utilized for feature extraction and classification, while
RNNs and Long Short-Term Memory (LSTM) networks have been particularly effective
in analyzing sequential network traffic data to identify suspicious activities. Studies
indicate that deep learning models can outperform traditional ML approaches in terms of
accuracy and adaptability. However, they often require high computational resources and
suffer from poor interpretability, making them challenging to implement in resource-
constrained environments such as edge devices and IoT networks.
The effectiveness of an IDS is also heavily dependent on the quality of the dataset used
for training and evaluation. Several benchmark datasets have been utilized in IDS
research, including CICIDS2017, UNSW-NB15, NSL-KDD, and KDD99. The
CICIDS2017 dataset is widely preferred due to its realistic attack scenarios, diverse
network traffic characteristics, and balanced distribution of attack classes, making it a
robust choice for evaluating IDS models. In contrast, KDD99, one of the earliest IDS
datasets, has been criticized for containing redundant and outdated attack patterns, which
limits its effectiveness for modern cybersecurity challenges. Additionally, most IDS
datasets suffer from class imbalance issues, where certain types of attacks occur far less
frequently than others, leading to biased model predictions and poor performance in
detecting minority attack classes.
Despite the advantages of ML and DL-based IDS, several challenges remain unresolved.
One of the primary concerns is the real-time applicability of these models. While many
ML-based IDS achieve high accuracy in offline experiments, their deployment in real-
world environments is often limited due to high computational overhead, memory
constraints, and latency issues. This is particularly problematic for real-time network
12
security, where timely detection of intrusions is crucial. To address this, researchers have
explored lightweight IDS solutions using TinyML, where models are optimized for
deployment on resource-constrained devices such as ESP32, Raspberry Pi, and other IoT
microcontrollers. Techniques like quantization, pruning, and knowledge distillation are
employed to reduce model size and computation requirements, enabling real-time
anomaly detection on embedded systems.
Another critical issue in IDS research is the high false positive rate (FPR). Many ML-
based models flag benign network activities as malicious due to overfitting or poor
generalization. This leads to alert fatigue among security analysts, making it difficult to
distinguish between real threats and false alarms. Hybrid models that combine multiple
ML/DL techniques have been proposed as a solution to this problem. For instance,
combining deep learning with traditional ML classifiers can help improve detection
accuracy while reducing false positives. Studies suggest that hybrid models leveraging
Random Forest with Deep Neural Networks (DNNs) provide superior results by
integrating the high accuracy of deep learning with the interpretability of decision trees.
However, while hybrid IDS models enhance detection performance, they introduce higher
computational complexity and resource demands, making them difficult to deploy in low-
power or real-time environments. Furthermore, many deep learning models lack
interpretability, making it challenging for security professionals to understand why a
particular network event was classified as an attack. Explainable AI (XAI) techniques,
such as SHAP (SHapley Additive Explanations) and LIME (Local Interpretable Model-
Agnostic Explanations), are being explored to address this issue, offering insights into
model decisions and increasing trust in ML-based IDS solutions.
Despite these challenges, the future of ML and DL-powered IDS remains promising, with
ongoing research focusing on real-time adaptive learning, federated learning for privacy-
preserving IDS, and cloud-edge hybrid security frameworks. Future advancements in
network security datasets, low-power ML inference, and automated threat intelligence
systems will further drive the adoption of AI-driven IDS solutions in enterprise and
industrial applications.
13
[2] Intrusion Detection Systems (IDS) play a crucial role in cybersecurity by identifying
malicious activities in network traffic. Traditional IDS primarily relied on signature-based
methods, which, despite their effectiveness against known threats, struggle with the
detection of zero-day attacks and evolving cyber threats. The adoption of machine
learning (ML) and deep learning (DL) techniques has significantly enhanced the
adaptability of IDS, making them capable of detecting previously unseen attack patterns.
Among the key advancements in ML-based IDS is intelligent feature selection, which
focuses on selecting the most relevant attributes from network traffic data to improve
model efficiency, accuracy, and real-time applicability.
14
demonstrated remarkable improvements in anomaly detection. CNNs are particularly
effective in feature extraction, while LSTMs excel at capturing sequential patterns in
network traffic data, making them ideal for detecting slow-moving, stealthy attacks.
However, the adoption of deep learning in IDS is constrained by its high computational
demands and the lack of interpretability, which makes it difficult for security analysts to
understand the model's decision-making process.
The effectiveness of an ML-based IDS largely depends on the quality and relevance of
the dataset used for training and evaluation. Modern datasets like CICIDS2017 and
UNSW-NB15 provide realistic attack scenarios, making them more applicable to today’s
cybersecurity landscape. However, many studies still rely on outdated datasets such as
KDD99, which do not accurately represent modern attack vectors and contain redundant
data. The overreliance on outdated datasets hinders the development of effective IDS
models, as they fail to generalize well to new attack patterns.
Another major challenge in IDS research is class imbalance, where certain types of
attacks are significantly underrepresented in datasets. This imbalance leads to biased ML
models that struggle to detect minority attack classes, resulting in high false negative
rates. Techniques such as Synthetic Minority Over-sampling Technique (SMOTE) and
cost-sensitive learning have been proposed to address this issue by balancing the dataset
and improving the detection of rare attacks. However, these approaches add complexity
to the training process and may not always generalize well to real-world traffic.
While deep learning models enhance intrusion detection accuracy, their high
computational requirements present significant challenges for real-time deployment.
Many DL-based IDS require powerful GPUs or cloud-based processing, making them
impractical for resource-constrained environments such as IoT networks, mobile devices,
and edge computing. To mitigate this, researchers are exploring lightweight IDS solutions
using TinyML, where models are optimized to run on low-power embedded devices like
ESP32 and Raspberry Pi. Techniques such as quantization, model pruning, and
15
knowledge distillation have been applied to reduce model size and computation
requirements, making IDS more feasible for real-time applications.
Another challenge is the interpretability of deep learning models. Security analysts often
require explainable AI (XAI) techniques to understand and trust model decisions. Recent
research focuses on model interpretability methods such as SHAP (SHapley Additive
Explanations) and LIME (Local Interpretable Model-agnostic Explanations), which
provide insights into how ML models classify network traffic. These advancements aim
to bridge the gap between model accuracy and trustworthiness, ensuring that IDS can be
effectively integrated into security operations.
The future of machine learning-powered IDS lies in adaptive learning, federated learning
for privacy-preserving intrusion detection, and cloud-edge hybrid security frameworks.
Federated learning allows models to be trained across multiple decentralized devices
without exposing sensitive network data, improving privacy and security. Additionally,
reinforcement learning (RL) and adversarial learning are being explored to create IDS
that can continuously evolve and adapt to emerging cyber threats.
Hybrid IDS, which integrate multiple ML/DL models with traditional rule-based systems,
are gaining traction in industry due to their higher detection accuracy and lower false
positives. As IDS research progresses, the focus will be on reducing computational
complexity, improving real-time performance, and enhancing model interpretability,
ensuring that ML-driven IDS solutions become scalable and practical for enterprise and
industrial applications.
[3] Intrusion Detection Systems (IDS) play a crucial role in cybersecurity by identifying
malicious activities in network traffic. Traditional IDS primarily relied on signature-based
methods, which, despite their effectiveness against known threats, struggle with the
detection of zero-day attacks and evolving cyber threats. The adoption of machine
learning (ML) and deep learning (DL) techniques has significantly enhanced the
adaptability of IDS, making them capable of detecting previously unseen attack patterns.
Among the key advancements in ML-based IDS is intelligent feature selection, which
16
focuses on selecting the most relevant attributes from network traffic data to improve
model efficiency, accuracy, and real-time applicability.
17
Challenges in Intrusion Detection: Dataset Limitations and Class Imbalance
The effectiveness of an ML-based IDS largely depends on the quality and relevance of
the dataset used for training and evaluation. Modern datasets like CICIDS2017 and
UNSW-NB15 provide realistic attack scenarios, making them more applicable to today’s
cybersecurity landscape. However, many studies still rely on outdated datasets such as
KDD99, which do not accurately represent modern attack vectors and contain redundant
data. The overreliance on outdated datasets hinders the development of effective IDS
models, as they fail to generalize well to new attack patterns.
Another major challenge in IDS research is class imbalance, where certain types of
attacks are significantly underrepresented in datasets. This imbalance leads to biased ML
models that struggle to detect minority attack classes, resulting in high false negative
rates. Techniques such as Synthetic Minority Over-sampling Technique (SMOTE) and
cost-sensitive learning have been proposed to address this issue by balancing the dataset
and improving the detection of rare attacks. However, these approaches add complexity
to the training process and may not always generalize well to real-world traffic.
While deep learning models enhance intrusion detection accuracy, their high
computational requirements present significant challenges for real-time deployment.
Many DL-based IDS require powerful GPUs or cloud-based processing, making them
impractical for resource-constrained environments such as IoT networks, mobile devices,
and edge computing. To mitigate this, researchers are exploring lightweight IDS solutions
using TinyML, where models are optimized to run on low-power embedded devices like
ESP32 and Raspberry Pi. Techniques such as quantization, model pruning, and
knowledge distillation have been applied to reduce model size and computation
requirements, making IDS more feasible for real-time applications.
Another challenge is the interpretability of deep learning models. Security analysts often
require explainable AI (XAI) techniques to understand and trust model decisions. Recent
18
research focuses on model interpretability methods such as SHAP (SHapley Additive
Explanations) and LIME (Local Interpretable Model-agnostic Explanations), which
provide insights into how ML models classify network traffic. These advancements aim
to bridge the gap between model accuracy and trustworthiness, ensuring that IDS can be
effectively integrated into security operations.
The future of machine learning-powered IDS lies in adaptive learning, federated learning
for privacy-preserving intrusion detection, and cloud-edge hybrid security frameworks.
Federated learning allows models to be trained across multiple decentralized devices
without exposing sensitive network data, improving privacy and security. Additionally,
reinforcement learning (RL) and adversarial learning are being explored to create IDS
that can continuously evolve and adapt to emerging cyber threats.
Hybrid IDS, which integrate multiple ML/DL models with traditional rule-based systems,
are gaining traction in industry due to their higher detection accuracy and lower false
positives. As IDS research progresses, the focus will be on reducing computational
complexity, improving real-time performance, and enhancing model interpretability,
ensuring that ML-driven IDS solutions become scalable and practical for enterprise and
industrial applications.
[4] Intrusion Detection Systems (IDS) are vital for network security, and real-time
intrusion detection has gained significant attention due to the increasing sophistication of
cyber threats. This paper presents a practical real-time intrusion detection system (RT-
IDS) that integrates machine learning techniques to enhance detection accuracy and
efficiency. By leveraging a structured methodology consisting of preprocessing,
classification, and post-processing phases, the system ensures a comprehensive and
effective approach to network threat detection. The study focuses on optimizing feature
selection using the Information Gain method, identifying 12 critical network traffic
features that significantly contribute to improving detection accuracy while minimizing
computational overhead. Feature selection plays a crucial role in ensuring that only the
most relevant attributes are considered, reducing unnecessary processing and enhancing
19
real-time applicability.
A key strength of the study is its real-world evaluation using the RLD09 dataset, which
was specifically designed to replicate practical network environments. Unlike traditional
benchmark datasets such as KDD99 or NSL-KDD, which have become outdated, RLD09
contains realistic traffic patterns and modern attack types. The system's ability to perform
well on this dataset showcases its applicability in contemporary network security
scenarios, demonstrating its effectiveness in identifying DoS (Denial of Service) and
Probe attacks. These types of intrusions pose significant threats to network infrastructure,
and the RT-IDS provides a robust solution for detecting them efficiently. The study's
emphasis on real-time detection is particularly important in today's cybersecurity
landscape, where immediate threat response is essential for mitigating potential damages.
However, despite its notable achievements, the study has several limitations that warrant
further exploration. One major concern is the reliance on the RLD09 dataset, which,
while tailored to real-world scenarios, may not generalize well across different network
environments. Attack patterns and network traffic characteristics vary significantly across
organizations, and an IDS trained on a specific dataset may struggle to maintain high
accuracy in diverse conditions. To improve generalizability, future research should
20
consider evaluating RT-IDS across multiple datasets, including CICIDS2017 and UNSW-
NB15, which provide a broader range of modern attack scenarios. Another limitation is
that the study primarily focuses on detecting only two types of attacks—DoS and
Probe—while neglecting other critical intrusion types, such as User to Root (U2R) and
Remote to Local (R2L) attacks. These attacks often involve sophisticated privilege
escalation techniques and require specialized detection mechanisms, which the current
RT-IDS framework does not address.
Another potential drawback is the reliance on the post-processing phase for refining
detection results. While this approach successfully reduces false alarms, it introduces a
dependency on grouped detection results, which may delay immediate responses to
critical threats. In time-sensitive cybersecurity incidents, a delay of even a few seconds
can have significant consequences, allowing attackers to exploit vulnerabilities before
countermeasures are deployed. Future enhancements should focus on integrating real-
time adaptive learning mechanisms that enable IDS models to make instant decisions
without relying heavily on post-processing corrections. Additionally, incorporating
explainable AI (XAI) techniques would improve model transparency, helping security
analysts understand the reasoning behind each detection decision. This would enhance
21
trust in machine learning-based IDS solutions and facilitate their adoption in enterprise
environments.
In conclusion, the study presents a highly effective real-time intrusion detection system
that leverages machine learning techniques to achieve high detection accuracy and
operational efficiency. The structured approach of feature selection, classification, and
post-processing ensures a robust detection mechanism, while real-world evaluation on the
RLD09 dataset demonstrates its practical applicability. However, limitations such as
dataset generalizability, exclusion of certain attack types, lack of deep learning
integration, and scalability concerns must be addressed to enhance the system’s
effectiveness. Future research should focus on integrating advanced machine learning
models, evaluating IDS performance under large-scale network conditions, and
developing adaptive mechanisms for detecting emerging threats in real-time. As cyber
threats continue to evolve, improving IDS frameworks with intelligent, scalable, and
interpretable machine learning approaches will be crucial for maintaining secure digital
infrastructures.
[5] Intrusion Detection Systems (IDS) play a crucial role in modern cybersecurity by
identifying and mitigating network intrusions in real time. The increasing complexity and
frequency of cyber threats necessitate the adoption of machine learning (ML) techniques
to enhance the adaptability and efficiency of IDS frameworks. This paper presents a
comprehensive analysis of various machine learning classifiers, comparing their
performance when combined with feature selection techniques. The study focuses on
optimizing IDS performance by evaluating different feature selection methods and
classification algorithms. By employing a systematic approach, the research provides
valuable insights into the most effective combinations of ML models for intrusion
detection.
One of the key strengths of this study is its emphasis on feature selection, which plays a
vital role in improving the efficiency of ML-based IDS. Feature selection techniques help
in reducing dimensionality, eliminating redundant attributes, and improving
computational efficiency. The research identifies the Information Gain Ratio (IGR) as an
effective feature selection method, enabling the identification of the most relevant
22
network traffic features that contribute to accurate intrusion detection. The k-Nearest
Neighbors (k-NN) classifier is highlighted as one of the best-performing models,
demonstrating high detection accuracy when paired with the IGR feature selection
technique. This combination is particularly effective in reducing false positives and
improving classification precision, making it suitable for real-world IDS deployments.
To ensure the reliability of the findings, the study employs a five-fold cross-validation
method, a widely accepted approach in ML research. Cross-validation enhances the
robustness of the results by ensuring that the model's performance is evaluated across
multiple data splits, reducing the likelihood of overfitting. This methodological rigor
strengthens the credibility of the study and provides a solid foundation for future IDS
research. The utilization of the NSL-KDD dataset for training and evaluation further
enhances the study’s practical relevance. NSL-KDD is a refined version of the traditional
KDD99 dataset, designed to address the redundancy and imbalance issues of its
predecessor. By focusing on this dataset, the study ensures that the models are tested on
realistic intrusion scenarios, making the findings more applicable to practical IDS
implementations.
However, despite its merits, the study has several limitations that must be considered. A
major drawback is its exclusive reliance on the NSL-KDD dataset, which, although
widely used in IDS research, has been criticized for not being fully representative of
modern cyber threats. While the dataset provides a structured way to evaluate IDS
models, it lacks real-world attack diversity, particularly emerging threats and zero-day
vulnerabilities that are prevalent in contemporary cybersecurity landscapes. The study
does not extend its evaluation to more comprehensive and modern datasets, such as
CICIDS2017 or UNSW-NB15, which contain realistic traffic patterns and attack
behaviors. As a result, the generalizability of the study’s findings to real-world scenarios
remains limited.
Additionally, the research does not explore ensemble learning methods, which have been
shown to enhance IDS accuracy by combining multiple classifiers. Ensemble methods,
such as Random Forest, Gradient Boosting, and XGBoost, have proven highly effective
in IDS tasks by leveraging the strengths of multiple base models to improve prediction
23
reliability. The absence of deep learning techniques is another limitation, as models such
as Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM)
networks have demonstrated superior performance in handling complex network traffic
patterns. The study’s exclusion of these advanced techniques leaves a gap in
understanding the full potential of ML in intrusion detection.
Another critical limitation of the paper is the lack of computational cost analysis for
different model and feature selection combinations. While ML-based IDS solutions offer
high detection accuracy, their practical deployment is often constrained by resource
limitations, particularly in environments with limited processing power and memory. A
detailed analysis of computational efficiency would have provided valuable insights into
the feasibility of deploying the proposed IDS models in real-world settings, especially in
low-resource environments such as embedded systems and IoT networks.
Furthermore, while the study identifies k-NN as an effective classifier, it does not address
the interpretability challenges associated with this model. k-NN operates as a distance-
based classifier, making it less transparent in explaining the decision-making process
compared to rule-based classifiers like Decision Trees. Interpretability is a crucial aspect
of IDS, as security analysts need to understand and validate model predictions to ensure
trust in the system. Without insights into how the classifier arrives at its decisions, IDS
solutions may face resistance from security professionals who require explainable AI
models for critical threat assessments.
Despite these limitations, the study makes a significant contribution to IDS research by
systematically evaluating feature selection techniques and ML classifiers. Its findings
serve as a foundation for future research, guiding the selection of feature reduction
methods and classification algorithms for developing efficient IDS frameworks. Moving
forward, addressing the identified limitations would enhance the study’s impact. Future
research should incorporate multiple datasets to improve generalizability, explore
ensemble and deep learning approaches for enhanced accuracy, analyze computational
costs for practical deployment, and integrate explainability techniques to improve model.
24
[6] Intrusion Detection Systems (IDS) play a critical role in modern cybersecurity,
identifying unauthorized access and malicious activities in networks. With the increasing
sophistication of cyber threats, traditional rule-based intrusion detection techniques are no
longer sufficient. Machine learning (ML) has emerged as a powerful solution to enhance
IDS, offering improved accuracy, adaptability, and automation. ML-based IDS can
analyze large volumes of network traffic, detect complex attack patterns, and differentiate
between normal and anomalous behaviors more effectively than traditional systems.
One of the primary advantages of using ML for intrusion detection is its ability to identify
both known and unknown attacks. Conventional IDS typically rely on signature-based
detection, where predefined attack signatures are matched against incoming traffic.
However, this approach struggles against zero-day attacks or previously unseen threats.
ML, particularly through supervised and unsupervised learning techniques, can generalize
from existing attack patterns and detect anomalies that do not match predefined rules.
Advanced approaches such as ensemble learning and hybrid classifiers further enhance
detection accuracy by combining multiple models, thereby reducing false positives and
improving reliability.
Scalability is a crucial factor in modern network security, given the massive volumes of
data generated by enterprises and cloud-based environments. ML techniques such as
clustering, decision trees, and neural networks are well-suited to handle these large
datasets. They can process high-dimensional network traffic efficiently, making them
ideal for large-scale IDS deployments. Feature selection techniques like Principal
Component Analysis (PCA) and Information Gain also help optimize performance by
reducing computational overhead while preserving critical information. Moreover, ML
25
models can be tailored to specific network environments, optimizing their performance
for different security requirements.
Despite these benefits, ML-based IDS face several challenges that need to be addressed.
One of the major drawbacks is the high computational cost associated with training and
deploying complex ML models. Advanced techniques such as Support Vector Machines
(SVM), Deep Neural Networks (DNN), and Genetic Algorithms require substantial
processing power and memory, making them impractical for resource-constrained
environments. Organizations deploying ML-based IDS must balance detection accuracy
with computational efficiency to ensure real-time threat mitigation without overwhelming
system resources.
26
containing redundant or outdated attack patterns that do not reflect modern threats. To
address this issue, more realistic datasets such as CICIDS2017 and UNSW-NB15 have
been introduced, but data collection and labeling remain time-consuming and resource-
intensive tasks.
False positives and false negatives continue to be persistent concerns in ML-based IDS. A
false positive occurs when legitimate network activity is incorrectly classified as an
attack, leading to unnecessary alerts and potential disruptions. Conversely, a false
negative happens when an actual intrusion goes undetected, posing significant security
risks. Balancing sensitivity and specificity is crucial to minimizing these errors.
Techniques such as threshold tuning, cost-sensitive learning, and anomaly detection
refinements help improve the precision of ML-based IDS, but achieving a perfect balance
remains a challenge.
[7] Intrusion detection has become a critical aspect of cybersecurity, given the rising
frequency and sophistication of cyber threats. Traditional methods, including signature-
based and anomaly-based approaches, struggle to keep pace with modern network
attacks. In response, deep learning, particularly Recurrent Neural Networks (RNN), has
emerged as a powerful solution for enhancing Intrusion Detection Systems (IDS). RNN-
based IDS leverage sequential data processing capabilities, making them well-suited for
analyzing time-series network traffic and identifying anomalous patterns indicative of
cyber intrusions.
One of the primary strengths of deep learning models, such as RNN-based intrusion
detection systems (RNN-IDS), is their ability to handle high-dimensional data efficiently.
Traditional machine learning methods like Support Vector Machines (SVM) and
Decision Trees (DT) rely heavily on handcrafted features, making them less adaptable to
evolving attack patterns. RNNs, on the other hand, automatically learn complex patterns
from network traffic data, reducing the need for extensive feature engineering. This
capability enables them to achieve high classification accuracy in both binary and multi-
class intrusion detection scenarios.
27
crucial for detecting sophisticated attack patterns that evolve over time. Unlike
feedforward neural networks that process data in isolation, RNNs retain information from
previous inputs, allowing them to recognize sequential attack behaviors. This advantage
is particularly beneficial for detecting advanced threats such as distributed denial-of-
service (DDoS) attacks, probing activities, and privilege escalation attempts. Empirical
studies have shown that RNN-based IDS models outperform conventional machine
learning approaches on standard datasets such as NSL-KDD and CICIDS2017,
demonstrating their effectiveness in real-world cybersecurity applications.
Despite these benefits, RNN-based IDS have several challenges and limitations. One of
the primary concerns is the high computational cost associated with training deep
learning models. RNNs require significant processing power, especially when dealing
with large-scale datasets. Training these models without specialized hardware, such as
Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs), can be time-
consuming and resource-intensive. This limitation makes it difficult to deploy deep
learning-based IDS in real-time applications where low-latency detection is crucial.
28
[8] Intrusion Detection Systems (IDS) have undergone significant advancements over the
years, transitioning from traditional signature-based methods to modern machine learning
(ML) and deep learning (DL) approaches. These intelligent techniques have demonstrated
superior adaptability and efficiency in identifying both known and novel cyber threats.
Researchers have focused on various ML models and hybrid approaches to optimize
intrusion detection performance.
Supervised learning models, such as Decision Trees (DT), Random Forest (RF), k-
Nearest Neighbors (k-NN), and Support Vector Machines (SVM), have been widely
explored for IDS. These techniques leverage labeled datasets to classify network traffic as
normal or malicious. Studies have shown that Random Forest and Decision Trees provide
high detection accuracy while maintaining interpretability, making them suitable for real-
world deployment. Additionally, ensemble methods and boosting techniques further
enhance IDS performance by combining multiple classifiers. However, supervised
models require extensive labeled datasets, which can be a limitation in real-world
scenarios.
29
environments.
One of the critical challenges in ML-based IDS research is the quality and availability of
datasets. Commonly used datasets, such as KDD99, NSL-KDD, CICIDS2017, and
UNSW-NB15, have limitations in representing modern attack patterns. Older datasets,
like KDD99, contain redundant and outdated attack scenarios, reducing their relevance
for contemporary IDS evaluations. Researchers emphasize the need for more diverse and
realistic datasets to improve the generalizability of ML-based IDS models.
[9] Intrusion Detection Systems (IDS) have become an essential component of modern
cybersecurity, helping to identify and mitigate network threats. The integration of
machine learning (ML) techniques into IDS has significantly improved their efficiency,
adaptability, and accuracy in detecting both known and novel attacks. This survey
explores various ML techniques used in IDS, their advantages, limitations, and challenges
in real-world deployment.
Several machine learning models have been employed for intrusion detection, each
offering unique strengths. Supervised learning techniques, such as Decision Trees (DT),
30
Random Forest (RF), Support Vector Machines (SVM), and k-Nearest Neighbors (k-
NN), rely on labeled datasets to classify network traffic as normal or malicious. These
models are effective in identifying known attack patterns, with studies showing high
detection accuracies, often exceeding 98%. However, supervised learning approaches
require extensive labeled datasets, which may not always be available or representative of
real-world attack scenarios.
To enhance IDS performance, researchers have explored hybrid approaches that combine
multiple ML or DL techniques. Hybrid models, such as RF combined with Neural
Networks or SVM with clustering algorithms, have demonstrated improved accuracy and
robustness in detecting intrusions. These models leverage the strengths of different
techniques to minimize false positives and false negatives. However, hybrid approaches
increase system complexity, requiring more computational power and expert tuning for
31
optimal performance.
One of the biggest challenges in ML-based IDS research is the availability and quality of
datasets. Commonly used datasets, such as KDD99, NSL-KDD, CICIDS2017, and
UNSW-NB15, provide benchmark testing environments, but many of these datasets fail
to reflect the constantly evolving nature of real-world cyber threats. Older datasets like
KDD99 contain redundant attack samples, limiting their relevance. Additionally, class
imbalance issues in many datasets lead to biased models that struggle to detect minority
attack classes effectively.
[10] Intrusion Detection Systems (IDS) play a crucial role in identifying and mitigating
cyber threats, and the integration of machine learning (ML) techniques has significantly
improved their efficiency. Machine learning-based IDS can detect anomalous network
activities with high accuracy by analyzing traffic patterns and adapting to evolving
threats. This survey explores various ML techniques used for network intrusion detection,
their advantages, and the challenges associated with their implementation.
Supervised learning approaches have been widely adopted in IDS research due to their
ability to classify network traffic into normal or malicious categories. Random Forest
(RF), Support Vector Machines (SVM), and k-Nearest Neighbors (k-NN) are among the
most commonly used algorithms. Random Forest, in particular, has demonstrated high
accuracy, achieving up to 99.81% in certain studies, making it a preferred choice for
classification tasks. These models utilize historical attack data to learn decision
boundaries and recognize intrusion patterns effectively. However, supervised learning
techniques require large, well-labeled datasets, which can be difficult to obtain and
32
maintain.
Hybrid approaches that combine multiple ML techniques have been proposed to enhance
IDS performance. For instance, RF combined with Neural Networks or SVM with
clustering algorithms have demonstrated improved accuracy and robustness. These hybrid
models leverage the strengths of different techniques to minimize false positives and false
negatives. However, they increase computational complexity and require extensive tuning
for optimal performance.
One of the major challenges in ML-based IDS research is the availability and quality of
datasets. Widely used datasets such as KDD99, NSL-KDD, CICIDS2017, and UNSW-
NB15 provide benchmarking opportunities, but many fail to represent the dynamic nature
of real-world cyber threats. Some datasets, like KDD99, contain redundant attack
samples, leading to biased models. Additionally, class imbalance issues often result in
poor detection of minority attack types, such as User-to-Root (U2R) and Remote-to-Local
33
(R2L) intrusions.
[11] Intrusion Detection Systems (IDS) have evolved significantly with the integration of
deep learning techniques, offering enhanced accuracy and adaptability in identifying
cyber threats. Deep learning models, particularly Deep Neural Networks (DNNs),
Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long
Short-Term Memory (LSTM), have shown remarkable capabilities in analyzing complex
patterns in network traffic and detecting anomalous behaviors. These models leverage
multi-layered architectures to extract high-dimensional features, improving detection
rates for both known and unknown cyber-attacks.
Hybrid approaches have also gained attention, combining Network-based IDS (NIDS)
and Host-based IDS (HIDS) to improve overall security coverage. These methods
integrate multiple detection mechanisms, utilizing deep learning for network traffic
analysis while incorporating host-level event monitoring. Such frameworks enhance
detection accuracy by correlating network anomalies with host behaviors, reducing false
34
positives. Additionally, autoencoders and generative adversarial networks (GANs) have
been employed to detect zero-day attacks by learning normal traffic distributions and
flagging deviations.
Despite their advantages, deep learning-based IDS face several challenges. High
computational costs remain a significant concern, as training deep neural models requires
extensive hardware resources, particularly GPU acceleration, to process large datasets
efficiently. Moreover, deep learning models are data-hungry, requiring vast amounts of
labeled network traffic for training. The availability of quality datasets remains a
challenge, as benchmark datasets like KDDCup 99, NSL-KDD, CICIDS2017, and
UNSW-NB15 do not always represent real-world attack patterns accurately, leading to
potential generalization issues.
[12] Intrusion Detection Systems (IDS) have significantly evolved with the integration of
machine learning (ML) and deep learning (DL) techniques, providing enhanced detection
accuracy and adaptability against cyber threats. Traditional IDS methods, such as
signature-based and rule-based approaches, struggle to detect novel or zero-day attacks.
In contrast, ML and DL techniques offer the advantage of learning patterns from network
traffic and identifying anomalies, making them more efficient in detecting both known
35
and unknown threats.
Machine learning-based IDS typically use supervised, unsupervised, and hybrid learning
techniques to classify network traffic into normal and malicious categories. Supervised
learning methods such as Support Vector Machines (SVM), Decision Trees (DT),
Random Forest (RF), and K-Nearest Neighbors (KNN) rely on labeled datasets to train
models for accurate threat detection. These models achieve high accuracy in binary and
multi-class classification scenarios, particularly when applied to benchmark datasets like
NSL-KDD, KDDCup 99, and CICIDS2017.
Deep learning techniques, such as Deep Neural Networks (DNN), Convolutional Neural
Networks (CNN), Recurrent Neural Networks (RNN), Long Short-Term Memory
(LSTM), and Autoencoders, have demonstrated superior performance in IDS
applications. Unlike traditional ML methods, deep learning models automatically extract
high-level features from raw network data, reducing the need for extensive feature
engineering.
CNNs are particularly effective in intrusion detection due to their ability to capture spatial
dependencies in network traffic, whereas RNN and LSTM models excel at detecting
sequential attack patterns in time-series data. Studies have shown that LSTM-based IDS
outperform traditional models in identifying attacks like Denial-of-Service (DoS), user-
to-root (U2R), and remote-to-local (R2L) intrusions, achieving high detection accuracy
while maintaining robustness against evolving threats.
Despite their advantages, ML and DL-based IDS face several challenges. Computational
complexity is a significant limitation, as deep learning models require high processing
36
power and large memory resources to train effectively. This makes real-time intrusion
detection difficult, particularly in environments with limited hardware capabilities.
Additionally, dataset quality and representativeness remain crucial concerns, as existing
benchmark datasets often fail to capture modern, real-world attack scenarios.
Another key challenge is the interpretability of deep learning models. Unlike decision
trees and rule-based classifiers, deep neural networks function as black boxes, making it
difficult for security analysts to understand why a particular alert was triggered. This lack
of explainability limits their adoption in high-security environments where transparency
is essential.
37
CHAPTER 4
TECHNICAL SPECIFICATION
4.1 OVERVIEW
The implementation of an Intrusion Detection System (IDS) using Machine Learning (ML)
and TinyML requires a well-defined technical foundation. This chapter provides an in-depth
discussion of the hardware and software components, machine learning algorithms, feature
selection methods, dataset specifications, performance evaluation metrics, and deployment
strategies that contribute to the effective implementation of the proposed IDS.
A robust technical specification is essential for ensuring that the IDS functions optimally in
detecting cyber threats with high accuracy, minimal false positives, and real-time
adaptability. Since the IDS must operate efficiently in diverse environments—including
enterprise networks, cloud-based security systems, and low-power IoT devices—its design
must be scalable, computationally efficient, and capable of handling large datasets while
maintaining lightweight execution on constrained hardware.
This chapter details the computational requirements, programming tools, and technical
constraints that influence the design, development, and deployment of the ML-based IDS.
Additionally, it highlights how TinyML enables IDS models to function on low-power edge
devices, ensuring cybersecurity protection for IoT networks and embedded systems.
CPU: Intel Core i7/i9 or AMD Ryzen 7/9 (or equivalent) for high-speed processing of large
datasets.
38
GPU: NVIDIA RTX 3060/3080 or AMD Radeon RX 6800 (or higher) for accelerated deep
learning training using frameworks like TensorFlow and PyTorch.
RAM: Minimum 16GB (recommended 32GB) to handle large-scale network traffic datasets
efficiently.
Storage: At least 512GB SSD (preferably 1TB NVMe SSD) for fast data access, processing,
and model storage.
Embedded AI Chips: Intel Movidius Neural Compute Stick, Edge TPU, or ARM Cortex-M
processors.
The combination of high-performance computing for training and TinyML for lightweight
deployment ensures that the IDS remains scalable, adaptable, and efficient across diverse
environments.
39
4.3 SOFTWARE AND DEVELOPMENT TOOLS
4.3.1 Programming Languages
The IDS implementation utilizes multiple programming languages based on the requirements
of data preprocessing, model development, and deployment:
Python: The primary language for machine learning model development, using libraries like
TensorFlow, PyTorch, Scikit-learn, and Pandas.
Bash & PowerShell: Essential for automating data preprocessing, model execution, and
network traffic analysis.
TensorFlow & TensorFlow Lite: Used for ML model training and TinyML deployment.
PyTorch: Alternative deep learning framework for implementing complex neural networks.
Scikit-learn: For implementing traditional ML models such as Random Forest, SVM, and
KNN.
40
Matplotlib & Seaborn: For data visualization and exploratory analysis.
Wireshark & Zeek (formerly Bro): For capturing and analyzing network traffic data.
Cloud Deployment: AWS, Google Cloud, or Microsoft Azure for scalable, distributed
intrusion detection.
Edge Deployment: TensorFlow Lite for IoT security solutions on edge devices.
Types of Attacks: DDoS, brute-force, botnet, port scanning, phishing, SQL injection, and
more.
Feature Categories:
Basic network features: Source IP, destination IP, protocol type, packet size.
41
Traffic flow features: Time-based session statistics.
The dataset is preprocessed using feature selection and dimensionality reduction techniques
such as Principal Component Analysis (PCA) and Recursive Feature Elimination (RFE) to
improve model efficiency.
Random Forest: Ensemble learning technique for high accuracy and interpretability.
Deep Neural Networks (DNN): Advanced model capable of capturing complex patterns in
network traffic.
42
4.6 PERFORMANCE EVALUATION METRICS
To assess the effectiveness of the IDS, multiple performance metrics are considered:
False Positive Rate (FPR): Ensures that normal traffic is not misclassified as an attack.
4.7 CONCLUSION
The technical specifications outlined in this chapter provide a comprehensive framework for
developing a scalable, high-performance IDS using Machine Learning and TinyML. By
integrating optimized hardware, efficient software tools, real-world datasets, and robust ML
models, this research aims to enhance cybersecurity monitoring while ensuring lightweight
deployment for IoT and edge devices. The next chapter will discuss the implementation
details, including model development, training, and experimental results.
43
CHAPTER 5
DESIGN APPROACH AND DETAILS
5.1 DESIGN APPROACH / MATERIALS & METHODS
The design of an Intrusion Detection System (IDS) using Machine Learning (ML) and
TinyML follows a systematic approach that involves data collection, preprocessing, feature
selection, model training, evaluation, and deployment. Given the complexity of modern cyber
threats, the design focuses on accuracy, efficiency, scalability, and real-time adaptability. The
system is structured into multiple layers to ensure robust threat detection with minimal false
positives.
Data Acquisition Layer: Collects real-time network traffic from sources like Wireshark,
Zeek, and CICIDS datasets.
Feature Extraction Layer: Extracts relevant network features such as packet size, protocol
type, source/destination IP, and time-based traffic patterns.
Preprocessing Layer: Cleans and normalizes data to remove noise and improve ML model
efficiency.
Model Training Layer: Implements ML algorithms like Random Forest, SVM, k-NN, and
DNN for anomaly detection.
Inference Layer: Deploys trained models on cloud servers or TinyML-enabled edge devices
for real-time intrusion detection.
Alert & Response Layer: Generates security alerts when potential threats are detected,
providing actionable insights for network administrators.
44
This layered approach ensures a modular, scalable, and efficient intrusion detection
mechanism that can operate in enterprise environments, cloud-based infrastructures, and IoT
ecosystems.
Dimensionality Reduction: Using PCA and Recursive Feature Elimination (RFE) to remove
irrelevant features.
Anomaly Detection Preprocessing: Labeling attack types and distinguishing between normal
and malicious traffic.
These steps enhance the efficiency and accuracy of ML models, ensuring they generalize
well to real-world network traffic.
Deep Neural Networks (DNN): Advanced learning from complex attack patterns.
Each model undergoes hyperparameter tuning using Grid Search and Bayesian Optimization
45
to enhance detection capabilities.
This ensures real-time intrusion detection with minimal power consumption, making it
feasible for IoT security.
NIST Cybersecurity Framework: Provides best practices for intrusion detection in enterprise
networks.
GDPR & CCPA Compliance: Ensures that user data is handled with privacy protection
measures.
46
5.2.2 Networking Protocols & Standards
IEEE 802.3 (Ethernet): Defines data transmission over wired networks.
FAIR Principles: Ensures ML models are Findable, Accessible, Interoperable, and Reusable.
By following these codes and standards, the IDS system is robust, ethical, and aligned with
industry best practices.
Scalability Issues: High-speed networks generate large volumes of data, which may exceed
47
processing capacity without efficient feature selection.
Evolving Cyber Threats: New attack patterns may not be represented in existing datasets,
requiring continuous model updates.
Data Privacy Regulations: Certain network data cannot be publicly shared, limiting dataset
availability for model training.
To address these issues, the IDS integrates semi-supervised learning techniques that can
adapt to unknown attacks using anomaly detection.
Lightweight Models (e.g., Decision Trees, Random Forest) Are Faster: These models are
faster but may compromise accuracy, leading to potential false positives or negatives.
Solution: The IDS adopts a hybrid approach, where deep learning is used for periodic offline
training, and lightweight ML models are deployed for real-time inference on edge devices.
Federated Learning: Instead of centralizing data, federated learning trains IDS models across
multiple network nodes, enhancing security while preserving data privacy.
48
Edge AI Processing: TinyML-enabled IDS performs on-device processing, reducing
dependency on cloud-based analytics and improving response time.
Hybrid IDS Models: A combination of signature-based detection (for known attacks) and
ML-based anomaly detection (for novel threats) improves overall effectiveness.
Efficient Feature Selection: Reducing model complexity while preserving key attack
indicators.
Incremental Learning: Updating IDS models dynamically without retraining from scratch.
5.4 CONCLUSION
This chapter provided an extensive discussion on the design approach, coding standards,
constraints, alternatives, and tradeoffs for the ML-based Intrusion Detection System. The IDS
is designed to be scalable, efficient, and adaptable, ensuring high detection accuracy while
optimizing computational efficiency. By leveraging TinyML, federated learning, and hybrid
ML techniques, the system is well-equipped to handle evolving cyber threats across
enterprise, cloud, and IoT environments.
49
CHAPTER 6
6.1 OVERVIEW
• Project Phases (Planning, Data Collection, Model Training, Testing, and Deployment)
This structured timeline ensures that all components of the IDS project are completed on
schedule and aligned with performance expectations.
The development of the IDS is divided into six main phases, each with its own timeline
and deliverables.
Estimated
Phase Description
Duration
50
Estimated
Phase Description
Duration
Phase 6:
Compiling results, writing dissertation,
Documentation & Final 2 Weeks
reviewing findings, final submission
Report
Each phase includes specific tasks and milestones, ensuring a logical progression toward
the final IDS implementation.
51
6.3 TASKS AND SUBTASKS
Each phase consists of multiple tasks and subtasks that contribute to the overall
development of the IDS. Below is a detailed breakdown.
• Testing multiple models: Random Forest, SVM, k-NN, DNN, Hybrid Models
52
• Converting models into TensorFlow Lite format for lightweight execution
Each task is tracked using project management tools (JIRA, Trello) to ensure progress is
maintained.
Milestones are critical checkpoints to measure progress and ensure the project is on track.
Below are the major milestones:
53
Milestone Deliverable Deadline
• Solution: Use SMOTE to generate synthetic data for rare attack classes.
54
• Issue: TinyML models might suffer accuracy loss after conversion.
55
CHAPTER 7
CODE:
import pandas as pd
import numpy as np
56
dataset = pd.read_csv('dataset/cicids2017.csv')
dataset.head
dataset.shape
dataset.columns
dataset = dataset.dropna()
le = LabelEncoder()
X = dataset.drop(columns=['Attack Type'])
y = dataset['Attack Type']
57
# Standardize features
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Split dataset
# Define models
models = {
"KNN": KNeighborsClassifier(),
accuracies = {}
model.fit(X_train, y_train)
58
y_pred = model.predict(X_test)
accuracies[name] = acc
print(f"{name}: {acc:.4f}")
#model = SVC(kernel='linear')
#model.fit(X_train, y_train)
#y_pred = model.predict(X_test)
#acc = accuracy_score(y_test,y_pred)
#accuracies["SVM"] = acc
#print(f"{name}: {acc:.4f}")
print(f"{name}: {acc*100:.4f}")
# Plot accuracies
plt.figure(figsize=(10, 5))
plt.xlabel("ML Algorithms")
plt.ylabel("Accuracy")
59
plt.title("Accuracy Comparison of ML Models for IDS")
plt.xticks(rotation=45)
plt.show()
import pandas as pd
import numpy as np
import tensorflow as tf
# Load dataset
df = pd.read_csv('dataset/cicids2017.csv')
if 'Index' in df.columns:
df.drop(columns=['Index'], inplace=True)
60
# Handle missing values
df = df.dropna()
label_encoder = LabelEncoder()
X = df.drop(columns=['Attack Type'])
y = df['Attack Type']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split dataset
model = tf.keras.Sequential([
61
tf.keras.layers.Dense(32, activation='relu', input_shape=(X_train.shape[1],)),
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dense(len(np.unique(y)), activation='softmax')
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Train model
# Evaluate model
print("Accuracy:", accuracy)
print(classification_report(y_test, y_pred))
converter = lite.TFLiteConverter.from_keras_model(model)
62
tflite_model = converter.convert()
f.write(tflite_model)
# Visualization
plt.figure(figsize=(10, 5))
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
# Confusion Matrix
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
63
7.2 SCREENSHOTS:
64
65
7.3 SYSTEM ARCHITECTURE
66
CHAPTER 8
The results of the IDS project are analyzed based on model performance, detection
accuracy, real-time deployment efficiency, and scalability.
The effectiveness of the IDS is measured using key performance indicators such as
accuracy, precision, recall, F1-score, and ROC-AUC. The results for different ML models
are summarized below:
F1- ROC-
Accuracy Precision Recall
Model Score AUC
(%) (%) (%)
(%) Score
Support Vector
97.5 96.2 97.1 96.6 0.982
Machine (SVM)
Deep Neural
98.8 97.9 98.3 98.1 0.992
Network (DNN)
Hybrid Model
99.6 99.1 99.3 99.2 0.999
(DNN + RF)
These results demonstrate that hybrid models (DNN + Random Forest) achieve the
highest accuracy, ensuring better detection of cyber threats with minimal false positives.
67
8.1.2 Comparison with Traditional IDS
This advantage makes ML-based IDS superior, as it is more effective in handling zero-
day attacks.
68
2. Reducing False Positives
Although the IDS performs well, further improvements can be made in:
69
CHAPTER 9
SUMMARY
This project focuses on developing an Intrusion Detection System (IDS) using machine
learning (ML) techniques to enhance network security by identifying malicious activities.
Traditional IDS methods, such as signature-based detection, struggle with unknown
threats and require frequent updates. In contrast, ML-based IDS can automatically learn
patterns from network traffic and detect both known and unknown intrusions effectively.
The project utilizes supervised learning algorithms like Random Forest (RF), Support
Vector Machine (SVM), and K-Nearest Neighbors (KNN), as well as deep learning
models such as Recurrent Neural Networks (RNN) and Long Short-Term Memory
(LSTM) to classify network traffic as normal or malicious. The CICIDS 2017 dataset,
containing real-world attack scenarios, is used for training and evaluation.
Key challenges addressed include reducing false positives, improving detection accuracy,
and optimizing computational efficiency. The project also explores feature selection
techniques to enhance model performance while maintaining scalability. The final IDS
model aims to provide a real-time, efficient, and adaptable security solution capable of
defending against evolving cyber threats in modern network environments.
70
CHAPTER 10
REFERENCES
[1] Sajid, M., Malik, K. R., Almogren, A., Malik, T. S., Khan, A. H., Tanveer, J., & Rehman,
A. U. (2024). Enhancing intrusion detection: a hybrid machine and deep learning approach.
Journal of Cloud Computing, 13(1), 123.
[2] Aljehane, N. O., Mengash, H. A., Hassine, S. B., Alotaibi, F. A., Salama, A. S., &
Abdelbagi, S. (2024). Optimizing intrusion detection using intelligent feature selection with
machine learning model. Alexandria Engineering Journal, 91, 39-49.
[3] Mishra, P., Varadharajan, V., Tupakula, U., & Pilli, E. S. (2018). A detailed investigation
and analysis of using machine learning techniques for intrusion detection. IEEE
communications surveys & tutorials, 21(1), 686-728.
[4] Sangkatsanee, P., Wattanapongsakorn, N., & Charnsripinyo, C. (2011). Practical real-time
intrusion detection using machine learning approaches. Computer Communications, 34(18),
2227-2235.
[5] Biswas, S. K. (2018). Intrusion detection using machine learning: A comparison study.
International Journal of pure and applied mathematics, 118(19), 101-114.
[6] Tsai, C. F., Hsu, Y. F., Lin, C. Y., & Lin, W. Y. (2009). Intrusion detection by machine
learning: A review. expert systems with applications, 36(10), 11994-12000.
[7] Yin, C., Zhu, Y., Fei, J., & He, X. (2017). A deep learning approach for intrusion
detection using recurrent neural networks. Ieee Access, 5, 21954-21961.
[8] Chowdhury, M. N., Ferens, K., & Ferens, M. (2016). Network intrusion detection using
machine learning. In Proceedings of the International Conference on Security and
Management (SAM) (p. 30). The Steering Committee of The World Congress in Computer
Science, Computer Engineering and Applied Computing (WorldComp).
[9] Wagh, S. K., Pachghare, V. K., & Kolhe, S. R. (2013). Survey on intrusion detection
system using machine learning techniques. International Journal of Computer Applications,
78(16), 30-37.
[10] Thaseen, I. S., Poorva, B., & Ushasree, P. S. (2020, February). Network intrusion
detection using machine learning techniques. In 2020 International conference on emerging
trends in information technology and engineering (IC-ETITE) (pp. 1-7). IEEE.
71
[11] Vinayakumar, R., Alazab, M., Soman, K. P., Poornachandran, P., Al-Nemrat, A., &
Venkatraman, S. (2019). Deep learning approach for intelligent intrusion detection system.
Ieee Access, 7, 41525-41550.
[12] Liu, H., & Lang, B. (2019). Machine learning and deep learning methods for intrusion
detection systems: A survey. applied sciences, 9(20), 4396.
72