Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
7 views74 pages

Report

Uploaded by

sharmila11121311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views74 pages

Report

Uploaded by

sharmila11121311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 74

NETWORK BASED ATTACK CLASSIFICATION WITH

HYPER-PARAMETER TUNING AND ENHANCED


PREDICTION TECHNIQUE.
A PROJECT REPORT

Submitted by

MOHANA SRI .T (422221104022)

PACHAIAMMAL .P.V (422221104023)

PAVITHRA .A (422221104025)

In partial fulfillment for the award of the degree

Of

BACHELOR OF ENGINEERING

in

COMPUTER SCIENCE AND ENGINEERING

SCHOOL OF ENGINEERING AND

TECHNOLOGY SURYA GROUP OF

INSTITUTIONS VIKIRAVANDI-605652

ANNA UNIVERSITY: CHENNAI 600 025

MAY 2025
ANNA UNIVERSITY: CHENNAI 600 025

BONAFIDE CERTIFICATE

Certified that this project “NETWORK BASED ATTACK CLASSIFICATION


WITH HYPER – PARAMETER TUNING AND ENHANCED PREDICTION
TECHNIQUE” is the bonafide work of “T.MOHANASRI (422221104022),
P.V.PACHAIAMMAL (422221104023), A.PAVITHRA (422221104025)” who
carried out the project work under my supervision.

SIGNATURE SIGNATURE
Dr. A. Jagan, M.Tech., Ph.D. Mrs. B. Lakshmidevi, M.Tech.,
Supervisor
HoD & Vice Principal
Assistant Professor
Associate Professor
Dept. of Computer Science & Engineering
Dept. of Computer Science & Engineering
Surya Group of Institutions
Surya Group of Institutions
School of Engineering and Technology
School of Engineering and Technology

Vikiravandi – 605 652. Vikiravandi – 605 652.

Submitted for the project viva voce examination held on ………………………

INTERNA EXAMINER EXTERNAL EXAMINER


ACKNOWLEDGEMENT

We express our sincere thanks to our honorable Chairman


Dr.P.Gauthama Sigamani, M.S.(Ortho) and to our beloved Secretary
Dr.P.Ashok Sigamani, M.D.(Paed.) for their excellent encouragement in the
completion of this innovative project.

We extend our heartful thanks to our principal Dr.M.Sankar.,M.E.,Ph.D.,


for giving an opportunity to do the project.

We are highly indebted to Dr. A. Jagan, M.Tech., Ph.D., HoD & Vice
Principal, Department of Computer Science and Engineering, for providing
valuable insights into the subject and helping us wherever possible.

We would like to express our profound gratitude and sincere thanks to


MrsB.Lakshmidevi,M.Tech.,AssistantProfessor(Sr.G),Departmentof
Computer Science and Engineering, for her valuable guidance, suggestions, and
timely supervision for the successful completion of the project and for her
continuous support.

Finally, we also extend our heartfelt thanks to all our staff members and
friends who rendered their valuable help in making this project successful
TABLE OF CONTENTS

CHAPTER TITLE PAGE


NO NO
LIST OF FIGURES iv

LIST OF ABBREVATIONS v
ABSTRACT 1
1 INTRODUCTION 2

1.1 BACKGROUND OF INTRUSION 2


DETECTION
1.2 PROBLEM STATEMENT 3
1.3 OBJECTIVE OF THE STUDY 4
1.4 SCOPE OF THE STUDY 5

1.5 SIGNIFICANCE OF THE STUDY 5

2 SYSTEM ANALYSIS 6
2.1 LITERATURE SURVEY 6
2.1.1 Survey Paper 1 6
2.1.2 Survey Paper 2 6
2.1.3 Survey Paper 3 7

2.1.4 Survey Paper 4 7


2.1.5 Survey Paper 5 8

2.1.6 Survey Paper 6 8


2.2 EXISTING SYSTEM 9
2.2.1 Disadvantages 9
2.3 PROPOSED SYSTEM 9

i
2.3.1 Advantage 9

ii
3 SYSTEM ARCHITECTURE 10
3.1 SYSTEM ARCHITECTURE DESIGN 10
3.2 UML DIAGRAMS 11
3.2.1 Use Case Diagram 11

3.2.2 Class Diagram 12


3.2.3 Activity Diagram 13

3.2.4 Data Flow Diagram 14


4 SYSTEM SPECIFICATION 15

4.1 HARDWARE REQUIREMENTS 15


4.2 SOFTWARE REQUIREMENTS 15
5 SYSTEM DESIGN 16

5.1 SYSTEM MODULES 16


5.1.1 Data Collection Module 16

5.1.2 Feature Extraction and Selection 17


Module

5.1.3 Detection Engine Module 18


5.1.4 Alert and Response Module 18

5.1.5 Reporting and Analysis Module 19


6 SYSTEM IMPLEMENTATION 20

6.1 PROBLEM DEFINITION 20


6.2 IMPLEMENTATION 21
7 SYSTEM TESTING 23
7.1 TESTING OBJECTIVES 23
7.2 TYPES OF TESTING 23

iii
7.2.1 Unit Testing 23
7.2.2 Functional Testing 23

7.2.3 Integration Testing 24


7.3 TESTIG STRATEGIES 24

7.3.1 White Box Testing 24


7.3.2 Black box Testing 24

7.4 ACCEPTANCE TESTING 25


7.4.1 Objective 25

7.4.2 Acceptance Criteria 24


7.4.3 Testing Approach 25
8 CONCLUSION 27
9 FUTURE ENHANCEMENT 28

APPENDIX 1 29
APPENDIX 2 56
REFERNCES 62

iv
LIST OF FIGURES

FIGURE FIGURE DECSCRIPTIONS PAGE NO

1.1 Background of Intrusion Detection 3

3.1 Architecture Diagram 10

3.2.1 Use Case Diagram 11

3.2.2 Class Diagram 12

3.2.3 Activity Diagram 13

3.2.4 Data Flow Diagram 14

6.2 Implementation 21

v
LIST OF ABBREVATIONS

IDS Intrusion Detection System

ML Machine Learning

CIC-IDS2017 Canadian Institute for Cybersecurity - Intrusion


Detection System 2017
PCA Principal Component Analysis

SVM Support Vector Machine

KNN K-Nearest Neighbors

LR Logistic Regression

ROC Receiver Operating Characteristic

AUC Area Under Curve

TP True Positive

TN True Negative

FP False Positive

FN False Negative

F1-Score F1 Score (Harmonic Mean of Precision and Recall)

GUI Graphical User Interface

CSV Comma-Separated Values

v
ABSTRACT

Intrusion Detection Systems (IDS) are crucial for securing modern networks against
increasing cyber threats. This study utilizes the CIC-IDS2017 dataset, a benchmark dataset
that closely simulates real-world network traffic, to evaluate the performance of traditional
supervised machine learning algorithms in binary classification of normal and attack
traffic.To optimize performance and reduce computational overhead, data preprocessing
techniques such as feature selection and Principal Component Analysis (PCA) are applied.
PCA helps reduce dimensionality while preserving essential information, thereby enabling
more efficient training of classification models.

The machine learning models implemented include Logistic Regression, Support Vector
Machines (SVM), and K-Nearest Neighbors (KNN). Each algorithm is tested with two different
configurations to examine the impact of parameter tuning on model performance. The results
reveal that both Logistic Regression and SVM achieve high accuracy and generalization
capability. Each algorithm is tested with two parameter configurations, and their performance is
assessed using metrics such as accuracy, precision, recall, F1-score, ROC-AUC, and cross-
validation scores.This report provides a comprehensive performance comparison and highlights
the potential of classical machine learning models for real-world intrusion detection tasks,
offering insights into model selection and tuning for cybersecurity applications.

1
CHAPTER 1
INTRODUCTION
INTRODUCTION

Intrusion detection is a critical component of cybersecurity that involves monitoring


and analyzing network or system activities for signs of unauthorized access, misuse, or
anomalies that may indicate a security breach. By identifying potential threats in real time or
retrospectively, intrusion detection systems (IDS) help organizations protect sensitive data,
maintain system integrity, and respond quickly to malicious behavior. These systems can be
classified into host-based or network-based, and may use signature-based, anomaly-based, or
hybrid detection techniques to identify threats.

As cyberattacks become more sophisticated, the role of effective intrusion detection


grows increasingly vital in safeguarding digital infrastructure. To enhance the performance
and reduce the computational complexity of the models, data preprocessing techniques are
applied, including feature selection and Principal Component Analysis (PCA). These steps
help eliminate redundancy and improve model generalization by reducing dimensionality
without significant loss of information.

This study aims to provide insights into the practical implementation of machine learning
models for intrusion detection, highlighting the trade-offs between different algorithms and
the impact of tuning and dimensionality reduction. The findings serve as a reference for
developing scalable and efficient IDS solutions suitable for real-world deployment.

1.1 BACKGROUND OF INTRUSION DETECTION

As computer networks and internet services have become integral to modern society, they
have also become prime targets for cyberattacks. Organizations and individuals alike are
vulnerable to threats such as malware, phishing, denial-of-service (DoS) attacks, and data breaches.
To counter these threats, security mechanisms are employed to monitor, detect, and respond to
unauthorized activities.
2
Figure 1.1 Background of Intrusion Detection

Intrusion Detection Systems are designed to monitor network or system activities for
malicious behavior or policy violations. IDS solutions typically fall into two main categories:
Signature-based IDS and Anomaly-based IDS. Signature-based systems rely on predefined
patterns of known threats, offering high accuracy for previously encountered attacks but failing
to detect novel or evolving threats. In contrast, anomaly-based IDS uses statistical or machine
learning techniques to detect deviations from normal behavior, allowing for the identification of
zero-day and sophisticated attacks.

With the increasing volume and complexity of network traffic, manual monitoring and
static rule-based detection are no longer sufficient. This challenge has led to the integration of
machine learning and data mining techniques in intrusion detection. Machine learning models
can learn from historical data to distinguish between normal and malicious behavior, adapting to
new attack patterns over time. These models offer the potential for more accurate, scalable, and
real- time intrusion detection.

To support research and development in this area, several datasets have been created,
with the CIC-IDS2017 dataset standing out for its realistic simulation of modern network traffic
and labeled attack scenarios. It includes various types of attacks such as DDoS, brute force,
infiltration, and botnet traffic, providing a valuable resource for evaluating and training machine
learning models.

3
1.2 PROBLEM STATEMENT

These systems often rely on predefined attack patterns or signatures, which means they
are only effective at identifying attacks that match known signatures. As cyberattacks evolve and
new methods are continually developed by malicious actors, this approach proves insufficient in
detecting novel and complex attack types.

Moreover, many existing IDS solutions suffer from high false positive rates, where
benign activities are incorrectly flagged as malicious, leading to unnecessary alarms and resource
wastage. This not only impacts the performance of the system but also hampers the productivity
of security personnel, who must sift through large volumes of false alerts.

The CIC-IDS2017 dataset, a publicly available network traffic dataset, provides a


valuable resource for training and testing intrusion detection systems. This dataset contains labeled
instances of both benign and malicious network traffic, including various types of attacks such as
DoS (Denial of Service), DDoS (Distributed Denial of Service), and various probing and
information- gathering attacks.

Thus, the core problem addressed in this study is the development of an effective
machine learning-based intrusion detection system that can accurately detect both known and
unknown attacks, reduce false positives and false negatives, and provide an efficient solution for
real-time network monitoring. This study aims to provide a comparative analysis of their
performance in detecting network intrusions and contribute to the improvement of network
security systems.

1.3 OBJECTIVES OF THE STUDY


The primary objective of this study is to develop and evaluate a machine learning-based
Intrusion Detection System (IDS) using the CIC-IDS2017 dataset to identify and classify various
network intrusions. This study aims to explore the effectiveness of machine learning algorithms
in enhancing the performance of IDS systems, with particular focus on improving detection
accuracy and minimizing false positive and false negative rates. The key objectives can be
outlined as follows:

4
• Analyze the CIC-IDS2017 dataset for network traffic patterns
• Preprocess and optimize the dataset for machine learning
• Apply and evaluate multiple machine learning algorithms for intrusion detection.
• Identify the most effective machine learning model for real-time intrusion detection
• Contribute to the advancement of intrusion detection technologies

1.4 SCOPE OF THE STUDY

The scope of this study is centered on the development, application, and evaluation of
machine learning techniques for detecting intrusions in computer networks, using the CIC-
IDS2017 dataset. As cyber threats continue to grow in complexity and frequency, traditional
intrusion detection systems are often unable to keep up. This study aims to bridge that gap by
exploring intelligent, data-driven approaches that can identify both known and unknown attacks
with high accuracy and low false alarm rates.

The study focuses on five major algorithms such as Logistic Regression, Support Vector
Machines (SVM), Random Forest, Decision Tree, and K-Nearest Neighbors (KNN) to
evaluate their performance in both binary classification (attack vs. normal) and multi-class
classification.

The dataset used, CIC-IDS2017, is a widely accepted benchmark in cybersecurity


research, containing realistic and diverse traffic data that mimic real-world scenarios. It includes
several types of network attacks such as DDoS, Brute Force, Port Scans, and Infiltration attacks.
The scope of the study includes data cleaning, normalization, feature selection, dimensionality
reduction using Principal Component Analysis (PCA), and model evaluation using metrics such
as accuracy, precision, recall, and F1-score.

This study is intended for academic research, providing a foundation for future
implementations in practical cybersecurity systems. The findings can assist network
administrators, cybersecurity analysts, and developers in choosing appropriate algorithms for
building efficient IDS solutions

5
CHAPTER 2

SYSTEM ANALYSIS

2.1 LITERATURE SURVEY

2.1.1 TRADITIONAL INTRUCTION DETECTION SYSTEMS

Shone, N., Ngoc, S. M., & Loo, J. (2018). A deep learning approach to network intrusion
detection. Proceedings of the International Conference on Neural Networks.

Traditional machine learning models like decision trees, SVMs, and random forests have
shown effectiveness but often require manual feature engineering and struggle with high
dimensional data. In contrast, deep learning models such as convolutional neural networks,
recurrent neural networks, long short-term memory networks, and autoencoders have demonstrated
superior performance by automatically learning complex patterns from raw traffic data. Traditional
signature-based IDS have served as foundational tools in network security, their limitations—
such as high false-negative rates, reliance on continuous updates, and inability to detect novel
attacks have driven the development of more advanced detection techniques.

2.1.2 EVALUATION OF MACHINE LEARNING IN IDS

Moustafa, N., & Slay, J. (2015). The evaluation of network anomaly detection systems: Statistical
analysis and benchmarking. Information Security Journal: A Global Perspective, 24(1), 1-18.

Existing literature on the evaluation of network anomaly detection systems highlights the
importance of rigorous statistical analysis and standardized benchmarking for assessing system
performance. Researchers have employed metrics such as detection rate, false positive rate,
precision, recall, and F1-score to quantify the effectiveness of various anomaly detection models.
Several studies have emphasized the need for statistically sound evaluation methods, including
cross-validation, hypothesis testing, and confidence intervals, to avoid overfitting and ensure the
generalizability of results. Moreover, recent works explore the integration of synthetic data
generation and adversarial testing to improve benchmarking relevance.

6
2.1.3 USE OF CIC-IDS2017 DATASET

Xia, F., Wang, S., & Yang, L. (2015). A deep learning approach for intrusion detection in
industrial control systems. IEEE Transactions on Industrial Informatics, 11(3), 698-707.

The CIC-IDS2017 dataset is distinguished by its variety and realism. It includes data
from multiple attack types such as Denial of Service (DoS), Distributed Denial of Service
(DDoS), botnets, brute-force attacks, and various other malicious activities that are commonly
encountered in real-world networks. A significant advantage of the CIC-IDS2017 dataset is the
variety of attack scenarios it includes. Researchers have utilized this dataset to test and evaluate
various machine learning algorithms, such as Random Forest, Support Vector Machines (SVM),
and K-Nearest Neighbors (KNN), which are popular in intrusion detection tasks. These studies have
demonstrated the effectiveness of these models in detecting different types of attacks, with
particularly high detection rates for DoS and DDoS attacks

2.1.4 COMPARATIVE ANALYSIS OF ALGORITHMS

Alazab, M., Tang, M., & Soni, S. (2020). An overview of machine learning-based intrusion
detection systems. Journal of Network and Computer Applications, 124, 34-43.

Intrusion detection systems (IDS) reveals a broad and growing interest in leveraging
intelligent algorithms to identify and respond to cybersecurity threats. Traditional IDS
techniques often rely on rule-based or signature-based detection, which struggle to keep pace
with evolving attack patterns. In contrast, machine learning (ML) approaches such as decision
trees, support vector machines (SVM), k-nearest neighbors (KNN), naive Bayes, and ensemble
methods have shown improved performance by learning from historical data to detect both
known and unknown intrusions. More recent studies also integrate feature selection,
dimensionality reduction, and unsupervised learning techniques like clustering and anomaly
detection to handle large-scale and imbalanced datasets.

7
2.1.5 DIMENSIONALITY REDUCTION AND FEATURE ENGINEERING

Sadiq, M., Zaheer, S., & Jan, Z. (2020). A survey on intrusion detection systems in the era
of big data and machine learning. Future Generation Computer Systems, 112, 314-326.

Network traffic data often contains a large number of features, many of which may be
redundant or irrelevant to the detection of intrusions. To address these challenges, researchers have
explored various dimensionality reduction and feature engineering techniques to enhance the
performance of IDS. By applying dimensionality reduction techniques like PCA and performing
targeted feature engineering, IDS models become more efficient in processing large volumes of
network traffic. These techniques not only improve the speed of intrusion detection systems but
also ensure that they remain computationally feasible, especially in resource constrained
environments. They help reduce computational complexity, improve detection accuracy, and
enable real-time processing of network traffic.

2.1.6 GAPS IDENTIFIED IN EXISTING RESEARCH

Hussain, Z., & Hamid, S. (2019). Intrusion detection systems: A comprehensive survey and
future directions. Computer Networks, 149, 29-53.

The literature on intrusion detection systems (IDS) presents a comprehensive overview of


their evolution, classification, and effectiveness in combating a wide range of cyber threats. IDS
are generally categorized into signature-based, anomaly-based, and hybrid systems, each with its
strengths and limitations. Signature-based systems are effective against known threats but fail to
detect novel attacks, while anomaly-based systems can identify unknown patterns but often
suffer from high false positive rates. Recent research highlights the integration of machine learning,
deep learning, and data mining techniques to improve detection accuracy and adaptability.
Studies also examine deployment architectures—host-based, network-based, and cloud-based
IDS—and the role of real-time detection in complex environments.

8
2.2 EXISTING SYSTEM

The existing system uses the outdated KDD99 and NSL-KDD datasets for network
intrusion detection. These include class imbalance, scalability issues with large-scale data,
and real-time adaptability to evolving attack strategies. The system struggles with outdated
attack representations, sophisticated intrusion methods.

2.2.1 DISADVANTAGES
• Accuracy limitations, particularly with imbalanced datasets.

• Inadequate scalability for real-time environments.

• Higher false positive rates in certain cases.

2.3 PROPOSED SYSTEM


The proposed system aims to enhance DDoS attack classification using the CICIDS 2017
dataset.
1. Key components include:
This Network dataset has 2 Class one is Normal and another one is Anomaly. The main
aim is detect the anomaly using labelled data. Try to detect the patterns in Normal and
anomaly data without using labelled data by unsupervised methods. Also try to make a
model which detects abnormal behaviour in each system.
2. Performance Metrics
 Accuracy, Precision, Recall, ROC, AUC
 F1-Score, Recall, Sensitivity.
 Classification Report
 Confusion metric.
2.3.1 ADVANTAGES
 High classification accuracy with optimized ML models.
 Effective handling of class imbalance using SMOTE.
 Real-time DDoS detection and classification.
 User-friendly web interface for data uploads and results display.
 Scalable and efficient system design.

9
CHAPTER 3
SYSTEM ARCHITECTURE

3.1 SYSTEM ARCHITECTURE DESIGN

An Architecture diagram is a graphical representation of a set of concepts that are part


of architecture, including their principles, elements and components. The system is developed
using object-oriented principles, ensuring that components such as data loaders, feature
selectors, classifiers, and evaluators are implemented as modular and reusable objects. This
design enhances maintainability and scalability, making it easier to update or extend the
system in the future. Each object represents a distinct functional unit such as data
preprocessing or model evaluation and interacts with others through well-defined interfaces.

Figure 3.1 Architecture Diagram

10
3.2 UML DIAGRAM

3.2.1 USE CASE DIGRAM

UML Use Case Diagrams. The use case diagram provides a high-level overview of the
system's functional requirements by visually representing the interactions between users
(actors) and the system’s functionalities (use cases). Intrusion Detection System (IDS) using the
CIC-IDS2017 dataset, the primary actor is the Network Analyst or System Administrator,
who interacts with the system to upload network traffic data, initiate preprocessing steps, and
run machine learning models. The system provides use cases such as Load Dataset, Preprocess
Data, Train Model, Evaluate Model, and Visualize Result. Each use case signifies a major
function of the system and is crucial for achieving the system’s goal of identifying and
classifying network intrusions.

Figure 3.2.1 Use Case Diagram

11
3.2.2 CLASS DIAGRAM

The class diagram shows how the different entities (people, things and data) relate to
each other; in other words, it shows the static structures of the system. A class diagram can be
used to display logical classes. Class diagram can also be used to show implementation classes,
which are the things that programmers typically deal with. A class is depicted on the class
diagram as a rectangle with horizontal sections. The upper section shows the class’s operations
(“methods”). The diagram has five main classes which give the attributes and operations used in
each class.

Figure 3.2.2 Class Diagram

12
3.2.3 ACTIVITY DIAGRAM

Activity diagram are graphical representation of workflows of stepwise activities and


actions with support for choice, iteration and concurrency. In the Unified Modelling
Language, activity diagram are intended to model both computational and organizational
processes. Activity diagrams show the overall flow of control. Intrusion Detection System
(IDS) using the CIC-IDS2017 dataset, the activity diagram outlines the end-to-end process
involved in detecting network intrusions using machine learning techniques. Intrusion
Detection Decision, where the trained and optimized model is ready to classify network
traffic as normal or malicious.

Figure 3.2.3 Activity Diagram

13
3.2.4 DATA FLOW DIAGRAM

A Data Flow Diagram shows what kinds of information will be input to and output from
the system, where data will come from and go to and where the data will be stored. It does not
show information about the timing of processes, or information about whether processes will
operate in sequence or parallel. Intrusion Detection System (IDS) based on the CIC-IDS2017
dataset, the DFD illustrates how raw network traffic data is transformed into actionable

intrusion alerts using machine learning techniques.

Fig 3.2.4 Data Flow Diagram

14
CHAPTER 4

SYSTEM SPECIFICATION

4.1 HARDWARE REQUIREMENTS

• Processor: Intel Core i5 (7th Gen) or AMD Ryzen 5 (minimum).


• Storage: 256 GB SSD or 500 GB HDD (minimum), 512 GB SSD recommended.
• Graphics Card: Integrated GPU (minimum), NVIDIA GTX 1650 or higher
(recommended).

• Display: Minimum 13" screen with 1366x768 resolution.


• Network Interface: Ethernet or Dual-Band Wi-Fi.

4.2 SOFTWARE REQUIREMENTS

• Operating System: Windows 10 / Ubuntu 20.04 LTS or later.

• RAM: Minimum 8 GB (16 GB recommended).

• Processor: Intel i5 / Ryzen 5 or better.

• Disk Space: At least 20 GB free for dataset and logs.

15
CHAPTER 5

SYSTEM DESIGN

5.1 SYSTEM MODULES

5.1.1 DATA COLLECTION MODULE

The Data Collection Module is the first line of defense in an Intrusion Detection System
(IDS), tasked with gathering raw data from various network sources. This module collects data
in real-time from network devices like routers, switches, firewalls, servers, and endpoints. The
data may include network traffic logs, packet captures, system logs, user activity records, and
intrusion attempts, which are fundamental for detecting malicious activity.

The module ensures that data collection is continuous and uninterrupted, processing
incoming information to maintain an up-to-date snapshot of the system’s security status. It uses
both active and passive data collection methods. Active collection involves querying devices for
logs and metrics, while passive collection listens to traffic on the network without interfer.

The quality of the collected data is essential for the success of the IDS. Data
preprocessing occurs within this module, where tasks like normalization, timestamp
synchronization, and filtering irrelevant data take place. This ensures that the collected data is in
a consistent format and is free from redundancies, ensuring accurate analysis in later stages. By
collecting and preparing the data, this module ensures that subsequent modules receive
highquality information for detection and analysis.

The module is also responsible for reducing system resource consumption by deploying
lightweight agents or utilizing existing network monitoring systems to minimize overhead. As
the foundation of the IDS, its effectiveness determines the overall system performance and
accuracy of threat detection.

16
To begin analyzing the CIC-IDS2017 dataset, several Python libraries were imported.
NumPy and Pandas are foundational libraries for handling numerical operations and
structured data, respectively. These were used to load and manipulate the dataset efficiently.
The dataset, consisting of both normal and malicious network traffic, was read into a Pandas
DataFrame, providing a tabular structure suitable for analysis.

Overall, this setup phase laid the groundwork for the rest of the analysis. By ensuring
that the appropriate tools were in place for loading, inspecting, and visualizing the data, it
became possible to proceed with preprocessing, feature selection, and model development
with confidence and clarity.

5.1.2 FEATURE EXTRACTION AND SELECTION MODULE

The feature Extraction and Selection Module is responsible for transforming raw data
into actionable insights by identifying key features that are most indicative of potential threats.
Once data is collected, it undergoes feature extraction, where important attributes such as network
traffic patterns, login attempts, packet sizes, protocol types, and session duration are identified
and extracted. These features serve as the basis for further analysis in the detection engine.

Feature extraction is often followed by feature selection, which aims to reduce the
dataset's complexity while retaining essential information. This process involves techniques such
as statistical analysis, correlation measures, and dimensionality reduction methods like Principal
Component Analysis (PCA). The goal is to eliminate redundant or irrelevant features that may
introduce noise, improving both the speed and accuracy of the detection process.

The output of this module is a refined dataset of relevant features that can be used to train
machine learning models or rule-based detection systems. By selecting the most significant
features, this module helps ensure that the IDS system is efficient and responsive while maintaining
a high detection rate. The quality of feature extraction and selection plays a pivotal role in the
accuracy of anomaly detection and signature-based detection mechanisms in the IDS.

17
5.1.3 DETECTION ENGINE MODULE

The Detection Engine Module is the heart of the Intrusion Detection System, responsible
for analyzing the data processed by the Feature Extraction and Selection Module to identify
potential security threats. This module applies various detection methods, including signaturebased
and anomaly-based detection, to flag unusual activities within the network.

In signature-based detection, the engine compares observed behaviors against a database


of known attack patterns or signatures. This method is highly effective in detecting well-known
and recurring threats. However, it has limitations in identifying new, unknown, or evolving threats.
To overcome this, anomaly detection techniques are also employed. Anomaly detection models,
such as Decision Trees, Support Vector Machines (SVM), and Neural Networks, learn the baseline
behavior of a network over time. Any deviation from the baseline is considered suspicious,
which may indicate an intrusion attempt.

Machine learning models can be used in this module to improve the detection rate and
reduce false positives by training on historical data labeled as either normal or malicious. The
engine also needs to work in real-time, providing timely detection to minimize the damage
caused by attacks. Alerts generated by this engine are passed to the alert and response module, which
takes action based on the severity of the threat.

The detection engine is critical to the IDS’s overall effectiveness, as it determines how well
the system can identify both known and novel security threats in real-time.

5.1.4 ALERT AND RESPONSE MODULE

The Alert and Response Module is a vital component that handles the alerts generated by
the Detection Engine and initiates the necessary response actions. Once a potential intrusion is
detected, this module assesses the severity and context of the threat. Alerts are generated and
communicated to system administrators through various channels such as dashboards, emails, or
Security Information and Event Management (SIEM) systems.

18
This module provides real-time alerts, helping administrators take prompt action to
mitigate any security breaches. Depending on the severity of the detected intrusion, the response
may vary. For minor incidents, the system may simply notify administrators, while for more
serious threats, automated responses are triggered. These responses can include blocking an IP
address, isolating a compromised system, terminating a suspicious session, or activating forensic
logging to gather additional evidence.

The alert and response module also plays a crucial role in post-event analysis by
maintaining detailed logs of each incident, which can be reviewed during investigations or for
improving the system’s rules. This module can be integrated with other security tools such as
firewalls, antivirus systems, or patch management tools to ensure coordinated actions during a
security event. The success of the IDS largely depends on the timely and appropriate actions

taken by this module.

5.1.5 REPORTING AND ANALYSIS MODULE

The Reporting and Analysis Module is responsible for generating detailed reports on the
system’s security status, including detected threats, response actions taken, and potential
vulnerabilities. It provides both real-time and historical data analysis, allowing administrators to
understand trends and patterns in network security incidents.

Reports generated by this module are often used for compliance purposes, helping
organizations adhere to security standards and regulations. The module can also generate
actionable insights, such as recurring attack patterns or areas of vulnerability that need attention.
The reports typically contain metrics like the number of attacks detected, the type of attack, the
duration of the attack, and the system’s response.

The analysis tools within this module help in identifying areas for system improvement
by reviewing past intrusions and their effectiveness. Through trend analysis, it can also predict
future threats and help in making proactive changes to the network security infrastructure. The
Reporting and Analysis Module plays a significant role in making the IDS more adaptive to
emerging threats.

19
CHAPTER 6

SYSTEM IMPLEMENTATION

6.1 PROBLEM DEFINITION

With the rapid growth of digital communication and networking technologies,


cybersecurity threats have become more frequent, complex, and damaging. Traditional security
systems such as firewalls and antivirus software are no longer sufficient to detect sophisticated
and evolving cyberattacks. This creates a critical need for intelligent systems capable of
identifying and preventing unauthorized access or malicious activities in real-time.

The core problem addressed in this project is the identification and classification of
intrusions in a network traffic environment using machine learning techniques.

Specifically, the challenge lies in accurately detecting both known and unknown attacks,
minimizing false positives, and handling high-dimensional data efficiently. To solve this, the
project uses the CIC-IDS2017 dataset, a comprehensive and realistic dataset that includes
various modern attack types such as DDoS, brute force, infiltration, and botnet activities.

The system applies data preprocessing techniques, feature selection, dimensionality


reduction, and a variety of machine learning algorithms including Logistic Regression, Support
Vector Machine, Decision Tree, Random Forest, and K-Nearest Neighbors to train predictive
models.

The objective is to design a robust, efficient, and scalable intrusion detection system
capable of distinguishing between normal and malicious traffic, thereby enhancing the security
posture of modern networks.

20
6.2 IMPLEMENTATION

Figure 6.2 Implementation

The implementation phase of this study was conducted using Google Colab due to its
accessibility and support for high-performance computing. To begin with, Google Drive was
mounted using the command drive.mount('/content/drive') to enable access to the CIC-IDS2017
dataset and to save output files such as trained models and visualizations. This integration
allowed for seamless loading and saving of data throughout the project.

After mounting, the CIC-IDS2017 dataset was loaded using Python’s pandas library. The
dataset includes network traffic data, both normal and malicious, across a range of attack categories
like DDoS, Botnet, and Brute Force. The dataset contains over 80 features, making it
comprehensive and suitable for building robust intrusion detection models.

Once loaded, the data was explored to understand its structure, identify missing values,
and check for data consistency. Preprocessing was a crucial step to prepare the dataset for
machine learning. Missing and infinite values were either removed or imputed appropriately.

Label encoding was applied to convert categorical values, such as attack types, into
numeric formats suitable for model training. Feature normalization was performed using Min-Max
scaling to ensure that all features contributed equally during training, avoiding bias due to differing
feature scales.

21
Since the dataset was imbalanced, with benign samples far outnumbering malicious ones,
balancing techniques such as undersampling and SMOTE (Synthetic Minority Oversampling
Technique) were considered to ensure fair training across all classes.Principal Component
Analysis (PCA) was employed to reduce the feature space while retaining the most informative
attributes.

Correlation analysis and information gain methods were used to select features that had
the most impact on classification performance. These techniques not only enhanced model speed
but also helped in preventing overfitting.

These included Logistic Regression (LR), Support Vector Machine (SVM), Decision
Tree (DT), Random Forest (RF), and K-Nearest Neighbors (KNN). The models were trained
using an 80-20 train-test split and evaluated using cross-validation to ensure robustness. The
objective was to compare these models in terms of accuracy, speed, and reliability in detecting
different types of intrusions.

The trained models were then evaluated using performance metrics such as accuracy,
precision, recall, F1-score, and ROC-AUC. Confusion matrices were used to analyze the true
positives, false positives, true negatives, and false negatives for each model.

Visualization techniques such as bar charts and ROC curves were applied to provide an
intuitive understanding of each model’s strengths and limitations, allowing for effective
comparison.

Hyperparameter tuning was performed using methods like Grid Search and
Random Search to fine-tune the model’s parameters for improved accuracy and generalizability.
This optimized model serves as the final output of the implementation phase, ready for
deployment or integration into a real-time intrusion detection system.

22
CHAPTER 7

SYSTEM TESTING

7.1 TESTING OBJECTIVES

Testing is a set of activities that can be planned in advance and conducted systematically.
For this reason a template for software testing, a set of steps into which can place specific test
case design techniques and testing methods should be defined for software process. Testing often
accounts for more effort than any other software engineering activity. It would therefore seem
reasonable to establish a systematic strategy for testing software.

7.2 TYPES OF TESTING

7.2.1 UNIT TESTING

The Unit testing for an intrusion detection system using the dataset, you should first
modularize your system into distinct components such as data preprocessing, feature selection,
model prediction, and evaluation. Then, use a Python testing framework like `unittest` or `pytest`
to write test cases for each component.. Use a small, representative subset of the dataset or mock
data to keep tests fast and focused. This structured testing approach ensures robustness and easier
debugging as you develop your intrusion detection system.

7.2.2. FUNCTIONAL TESTING

The Functional testing for an intrusion detection system using the dataset,. involves
feeding the full or a representative portion of the dataset through the complete pipeline including
data loading, preprocessing, feature extraction, model inference, and alert generation and
verifying that the system correctly identifies normal and malicious traffic based on expected
outputs. Functional tests should validate that the system meets its intended requirements, such
as detecting specific attack types generating appropriate logs or alerts, and maintaining
performance under realistic data volumes.

23
7.2.3. INTEGRATION TESTING

The Integration testing for an intrusion detection system using the dataset, test
different modules of the system such as data ingestion, preprocessing, feature selection, model
prediction, and evaluation work together as a cohesive unit.. The goal is to catch issues that
may not appear in isolated unit tests, such as mismatched data formats, incorrect model input
shapes, or failures in sequential processing. Integration tests help confirm that the system
behaves correctly when modules interact, ensuring stability and correctness before full
functional deployment.

7.3 TESTING STRATEGIES

7.3.1 WHITE BOX TESTING

The White Box testing on an intrusion detection system using the dataset, full access
to the system’s internal logic, including its source code, algorithms, and data flow. This type
of testing involves designing test cases based on the internal structure of each module such as
control flow in data preprocessing, logic in feature selection, and conditions in the classification
model to verify that all paths, branches, and conditions function correctly. The data can create
targeted inputs to trigger specific parts of the code, ensuring, that the system handles edge
cases like missing or anomalous data, applies normalization correctly.

7.3.2 BLACK BOX TESTING

The Black Box testing on an intrusion detection system using dataset, treat the system
as a closed unit without knowledge of its internal workings, focusing only on the inputs and
expected outputsThe input various traffic samples from the dataset including both normal and
malicious behaviors such as DoS, PortScan, and brute-force attacks and observe whether the
system correctly identifies and classifies them without examining how the detection is
implemented. This testing approach helps verify that meets functional requirements, such as
accuracy and alert generation, and can handle different attack scenarios realistically.

24
7.4 ACCEPTANCE TESTING

Acceptance testing is the final phase of software testing that ensures the system meets the
requirements and expectations of the end users or stakeholders. Intrusion Detection System project
using the dataset, acceptance testing was conducted to validate that the system performs effectively in
identifying and classifying network intrusions as per the defined objectives. This involves running the
full dataset or a representative portion through to ensure it accurately detects known attacks and
normal traffic, meets performance benchmarks (like detection rate, false positive rate, and
latency), and integrates properly with real-world network environments. Acceptance testing
focuses on verifying that the system behaves correctly from a user or organizational perspective,
confirming that it fulfills all functional, reliability, and usability criteria.

7.4.1 OBJECTIVE

The objective of acceptance testing is to confirm that the IDS system is

 Functionally complete and meets all project specifications.


 Capable of detecting and classifying both normal and attack traffic accurately.
 Usable, reliable, and ready for deployment in a real-world network environment.

7.4.2 ACCEPTANCE CRITERIA

The following criteria were established to determine the successful acceptance of the system

 Data Loading & Preprocessing


The system must successfully load the CIC-IDS2017 dataset and preprocess it
(including cleaning, normalization, and feature reduction).
 Model Training & Prediction
The system must train various machine learning models (e.g., Logistic Regression,
SVM, Decision Tree, Random Forest, KNN) without errors. It should predict
intrusion classes (binary and multi-class) with satisfactory accuracy.

25
 Performance Metrics
The system should achieve at least 90% accuracy for binary classification and
acceptable performance for multi-class classification, Precision, recall, and F1-score
should also be within expected thresholds for various attack types.

 System Integration
All modules (data processing, model training, evaluation, and output) must work together
without failures or crashes.
 User Feedback
The output results (e.g., confusion matrix, classification report) must be interpretable
and informative for cybersecurity professionals.

7.4.3 TESTING PROCEDURE

• A set of real-world scenarios from the CIC-IDS2017 dataset were used.

• Stakeholders simulated network environments containing various attack types.

• Each output was reviewed to ensure accurate detection and classification of attacks.

• The interface and response times were evaluated from a usability perspective.

• Ensure data is normalized, cleaned, and ready for processing.

• Make sure all attack types, edge cases, and performance scenarios are tested.

• Compare system output with expected results for attack detection and traffic classification.

• Keep clear records of all tests, bugs, and system modifications.

26
CHAPTER 8

8.1 CONCLUSION

The project also addressed attack classification by encoding the attack types and
mapping them to numerical labels, facilitating the application of machine learning algorithms
for attack detection. The encoding process, followed by feature scaling and normalization,
prepared the data for modeling, ensuring that the models would not be biased by extreme
values or irrelevant features. The project has provided a comprehensive framework for
analyzing network traffic data for intrusion detection. The results underscore the importance
of data preprocessing and feature engineering in machine learning workflows, particularly for
cybersecurity applications. The findings from the exploratory analysis, correlation studies,
and outlier investigations will directly inform the modeling phase, where machine learning
algorithms will be applied to predict attacks. This process not only enhances the accuracy of
the model but also improves its ability to detect and respond to network security threats
effectively. The next steps in this project will involve the application of classification
algorithms to build an intrusion detection model that can accurately identify both known and
unknown attack types, contributing to a robust network security system.

27
CHAPTER 9

FUTURE ENHANCEMENT
We have to choose our model accordingly if we want to stick with one. Future work on
intrusion detection using the CICIDS 2017 dataset can focus on enhancing detection accuracy,
reducing false positives, and improving adaptability to evolving attack patterns. One
promising direction is the application of advanced deep learning models, such as recurrent neural
networks (RNNs), transformers, or graph neural networks, to better capture temporal and
relational patterns in network traffic. It can focus on optimizing model performance through
advanced machine learning and deep learning approaches. This includes implementing
automated hyper- parameter optimization techniques such as Bayesian optimization, grid
search, or genetic algorithms to fine-tune classifiers like Random Forest, XG Boost, or deep
neural networks. Enhanced prediction techniques such as ensemble learning, stacking models,
or attention-based deep architectures can be explored to improve classification accuracy and
robustness across diverse attack types. Additionally, incorporating techniques for feature
selection and dimensionality reduction can help reduce noise and computational cost while
preserving predictive power. It may also involve testing the generalizability of the proposed
models on unseen or evolving threats, integrating real-time detection capabilities, and
benchmarking against other datasets to validate model effectiveness in dynamic network
environments.

28
APPENDIX 1

29
APPENDIX 1
SOURCE CODE

Dataset Characteristics and Exploratory Data Analysis

Load, View Data and Show Analysis on Rows and Columns

#from google.colab import drive

#drive.mount('/content/drive')

import numpy as np

import pandas as pd

import seaborn as sns

import missingno as msno

sns.set(style='darkgrid')

import matplotlib.pyplot as plt

# Loading the dataset

data1 = pd.read_csv("C:Users/Narayanan/Desktop/anstechnologicalsolution/ANS
Tech/network baced attack/Monday-WorkingHours.pcap_ISCX.csv")

data2 = pd.read_csv("C:/Users/Narayanan/Desktop/ans technological solution/ANS


Tech/network baced attack/Tuesday-WorkingHours.pcap_ISCX.csv")

data3 = pd.read_csv("C:/Users/Narayanan/Desktop/ans technological solution/ANS


Tech/network baced attack/Wednesday-workingHours.pcap_ISCX.csv")

data4 = pd.read_csv("C:/Users/Narayanan/Desktop/ans technological solution/ANS


Tech/network baced attack/Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv")

data5 = pd.read_csv("C:/Users/Narayanan/Desktop/ans technological solution/ANS


Tech/network baced attack/Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv")

data6 = pd.read_csv("C:/Users/Narayanan/Desktop/ans technological solution/ANS


Tech/network baced attack/Friday-WorkingHours-Morning.pcap_ISCX.csv")

data7 = pd.read_csv("C:/Users/Narayanan/Desktop/ans technological solution/ANS


Tech/network baced attack/Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv")

30
data8 = pd.read_csv("C:/Users/Narayanan/Desktop/ans technological solution/ANS
Tech/network baced attack/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv")

data_list = [data1, data2, data3, data4, data5, data6, data7, data8]

print('Data dimensions: ')

for i, data in enumerate(data_list, start = 1):

rows, cols = data.shape

print(f'Data{i} -> {rows} rows, {cols} columns')

data = pd.concat(data_list)

rows, cols = data.shape

print('New dimension:')

print(f'Number of rows: {rows}')

print(f'Number of columns: {cols}')

print(f'Total cells: {rows * cols}')

# Renaming the columns by removing leading/trailing whitespace

col_names = {col: col.strip() for col in data.columns}

data.rename(columns = col_names, inplace = True)

data.columns

data.info()

pd.options.display.max_rows = 80

print('Overview of Columns:')

data.describe().transpose()

pd.options.display.max_columns = 80

data

# Identifying duplicate values

31
dups = data[data.duplicated()]

print(f'Number of duplicates: {len(dups)}')

data.drop_duplicates(inplace = True)

data.shape

# Identifying missing values

missing_val = data.isna().sum()

print(missing_val.loc[missing_val > 0])

# Checking for infinity values

numeric_cols = data.select_dtypes(include = np.number).columns

inf_count = np.isinf(data[numeric_cols]).sum()

print(inf_count[inf_count > 0])

# Calculating missing value percentage in the dataset

mis_per = (missing / len(data)) * 100

mis_table = pd.concat([missing, mis_per.round(2)], axis = 1)

mis_table = mis_table.rename(columns = {0 : 'Missing Values', 1 : 'Percentage of Total Values'})

print(mis_table.loc[mis_per > 0])

# Visualization of missing data

sns.set_palette('pastel')

colors = sns.color_palette()

missing_vals = [col for col in data.columns if data[col].isna().any()]

fig, ax = plt.subplots(figsize = (2, 6))

msno.bar(data[missing_vals], ax = ax, fontsize = 12, color = colors)

ax.set_xlabel('Features', fontsize = 12)

32
ax.set_ylabel('Non-Null Value Count', fontsize = 12)

ax.set_title('Missing Value Chart', fontsize = 12)

plt.show()

# Dealing with missing values (Columns with missing data)

plt.figure(figsize = (8, 3))

sns.boxplot(x = data['Flow Bytes/s'])

plt.xlabel('Boxplot of Flow Bytes/s')

plt.show()

colors = sns.color_palette('Blues')

plt.hist(data['Flow Bytes/s'], color = colors[1])

plt.title('Histogram of Flow Bytes/s')

plt.xlabel('Flow Bytes/s')

plt.ylabel('Frequency')

plt.show()

plt.figure(figsize = (8, 3))

sns.boxplot(x = data['Flow Packets/s'])

plt.xlabel('Boxplot of Flow Packets/s')

plt.show()

plt.hist(data['Flow Packets/s'], color = colors[1])

plt.title('Histogram of Flow

Packets/s') plt.xlabel('Flow Packets/s')

plt.ylabel('Frequency')

plt.show()

33
med_flow_bytes = data['Flow Bytes/s'].median()

med_flow_packets = data['Flow Packets/s'].median()

print('Median of Flow Bytes/s: ', med_flow_bytes)

print('Median of Flow Packets/s: ', med_flow_packets)

# Identifying outliers based on attack type

outlier_counts = {}

for i in numeric_data:

for attack_type in sampled_data['Attack Type'].unique():

attack_data = sampled_data[i][sampled_data['Attack Type'] == attack_type]

q1, q3 = np.percentile(attack_data, [25, 75])

iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr

upper_bound = q3 + 1.5 * iqr

num_outliers = ((attack_data < lower_bound) | (attack_data > upper_bound)).sum()

outlier_percent = num_outliers / len(attack_data) * 100

outlier_counts[(i, attack_type)] = (num_outliers, outlier_percent)

for i in numeric_data:

print(f'Feature: {i}')

# Plotting the percentage of outliers that are higher than 20%

fig, ax = plt.subplots(figsize = (24, 10))

for i in numeric_data:

for attack_type in sampled_data['Attack Type'].unique():

num_outliers, outlier_percent = outlier_counts[(i, attack_type)]

34
if outlier_percent > 20:

ax.bar(f'{i} - {attack_type}', outlier_percent)

ax.set_xlabel('Feature-Attack Type')

ax.set_ylabel('Percentage of Outliers')

ax.set_title('Outlier Analysis')

ax.set_yticks(np.arange(0, 41, 10))

plt.xticks(rotation = 90)

plt.show()

attacks = data.loc[data['Attack Type'] != 'BENIGN']

plt.figure(figsize = (10, 6))

ax = sns.countplot(x = 'Attack Type', data = attacks, palette = 'pastel', order = attacks['Attack


Type'].value_counts().index)

plt.title('Types of attacks')

plt.xlabel('Attack Type')

data.groupby('Attack Type').first()

3. Data Preprocessing

For improving performance and reduce memory-related errors

old_memory_usage = data.memory_usage().sum() / 1024 ** 2

print(f'Initial memory usage: {old_memory_usage:.2f} MB')

for col in data.columns:

col_type = data[col].dtype

if col_type != object:

c_min = data[col].min()

c_max = data[col].max()

35
# Downcasting float64 to float32

if str(col_type).find('float') >= 0 and c_min > np.finfo(np.float32).min and c_max <


np.finfo(np.float32).max:

data[col] = data[col].astype(np.float32

# Downcasting int64 to int32

elif str(col_type).find('int') >= 0 and c_min > np.iinfo(np.int32).min and c_max <
np.iinfo(np.int32).max:

data[col] = data[col].astype(np.int32)

new_memory_usage = data.memory_usage().sum() / 1024 ** 2

print(f"Final memory usage: {new_memory_usage:.2f} MB")

# Calculating percentage reduction in memory usage

print(f'Reduced memory usage: {1 - (new_memory_usage / old_memory_usage):.2%}')

data.info()

data.describe().transpose()

# Columns after removing non variant columns

data.columns

# Standardizing the dataset

from sklearn.preprocessing import StandardScaler

features = data.drop('Attack Type', axis = 1)

attacks = data['Attack Type']

scaler = StandardScaler()

scaled_features = scaler.fit_transform(features)

from sklearn.decomposition import IncrementalPCA

size = len(features.columns) // 2

36
ipca = IncrementalPCA(n_components = size, batch_size = 500)

for batch in np.array_split(scaled_features, len(features) // 500):

ipca.partial_fit(batch)

print(f'information retained: {sum(ipca.explained_variance_ratio_):.2%}')

transformed_features = ipca.transform(scaled_features)

new_data = pd.DataFrame(transformed_features, columns = [f'PC{i+1}' for i in range(size)])

new_data['Attack Type'] = attacks.values

new_data

4. Machine Learning Models

from sklearn.model_selection import cross_val_score

# Creating a balanced dataset for Binary Classification

normal_traffic = new_data.loc[new_data['Attack Type'] == 'BENIGN']

intrusions = new_data.loc[new_data['Attack Type'] != 'BENIGN']

normal_traffic = normal_traffic.sample(n = len(intrusions), replace = False)

ids_data = pd.concat([intrusions, normal_traffic])

ids_data['Attack Type'] = np.where((ids_data['Attack Type'] == 'BENIGN'), 0, 1)

bc_data = ids_data.sample(n = 15000)

print(bc_data['Attack Type'].value_counts())

# Splitting the data into features (X) and target

(y) from sklearn.model_selection import

train_test_split X_bc = bc_data.drop('Attack Type',

axis = 1)

y_bc = bc_data['Attack Type'X_train_bc, X_test_bc, y_train_bc, y_test_bc =


train_test_split(X_bc, y_bc, test_size = 0.25, random_state = 0)

37
from sklearn.linear_model import LogisticRegression

38
lr1 = LogisticRegression(max_iter = 10000, C = 0.1, random_state = 0, solver = 'saga')

lr1.fit(X_train_bc, y_train_bc)

cv_lr1 = cross_val_score(lr1, X_train_bc, y_train_bc, cv = 5)

print('Logistic regression Model 1')

print(f'\nCross-validation scores:', ', '.join(map(str, cv_lr1)))

print(f'\nMean cross-validation score: {cv_lr1.mean():.2f}')

print('Logistic Regression Model 1 coefficients:')

print(*lr1.coef_, sep = ', ')

print('\nLogistic Regression Model 1 intercept:', *lr1.intercept_)

lr2 = LogisticRegression(max_iter = 15000, solver = 'sag', C = 100, random_state = 0)

lr2.fit(X_train_bc, y_train_bc)

cv_lr2 = cross_val_score(lr2, X_train_bc, y_train_bc, cv = 5)

print('Logistic regression Model 2')

print(f'\nCross-validation scores:', ', '.join(map(str, cv_lr2))) print(f'\

nMean cross-validation score: {cv_lr2.mean():.2f}') print('Logistic

Regression Model 2 coefficients:') print(*lr2.coef_, sep = ', ')

print('\nLogistic Regression Model 2 intercept:', *lr2.intercept_)

from sklearn.svm import SVC

svm1 = SVC(kernel = 'poly', C = 1, random_state = 0, probability = True)

svm1.fit(X_train_bc, y_train_bc)

cv_svm1 = cross_val_score(svm1, X_train_bc, y_train_bc, cv = 5)

print('Support Vector Machine Model 1')

39
print(f'\nCross-validation scores:', ', '.join(map(str, cv_svm1))) print(f'\

nMean cross-validation score: {cv_svm1.mean():.2f}')

svm2 = SVC(kernel = 'rbf', C = 1, gamma = 0.1, random_state = 0, probability = True)

svm2.fit(X_train_bc, y_train_bc)

cv_svm2 = cross_val_score(svm2, X_train_bc, y_train_bc, cv = 5)

print('Support Vector Machine Model 2')

print(f'\nCross-validation scores:', ', '.join(map(str, cv_svm2)))

print(f'\nMean cross-validation score: {cv_svm2.mean():.2f}')

print('SVM Model 1 intercept:', *svm1.intercept_)

print('SVM Model 2 intercept:', *svm2.intercept_)

# Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

rf1 = RandomForestClassifier(n_estimators = 10, max_depth = 6, max_features = None,


random_state = 0)

rf1.fit(X_train, y_train)

cv_rf1 = cross_val_score(rf1, X_train, y_train, cv = 5)

print('Random Forest Model 1')

print(f'\nCross-validation scores:', ', '.join(map(str, cv_rf1))) print(f'\

nMean cross-validation score: {cv_rf1.mean():.2f}')

rf2 = RandomForestClassifier(n_estimators = 15, max_depth = 8, max_features = 20,


random_state = 0)

rf2.fit(X_train, y_train)

cv_rf2 = cross_val_score(rf2, X_train, y_train, cv = 5)

print('Random Forest Model 2')

40
print(f'\nCross-validation scores:', ', '.join(map(str,

cv_rf2))) print(f'\nMean cross-validation score:

{cv_rf2.mean():.2f}') # Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

dt1 = DecisionTreeClassifier(max_depth = 6)

dt1.fit(X_train, y_train)

cv_dt1 = cross_val_score(dt1, X_train, y_train, cv = 5)

print('Decision Tree Model 1')

print(f'\nCross-validation scores:', ', '.join(map(str, cv_dt1)))

print(f'\nMean cross-validation score: {cv_dt1.mean():.2f}')

dt2 = DecisionTreeClassifier(max_depth = 8)

dt2.fit(X_train, y_train)

cv_dt2 = cross_val_score(dt2, X_train, y_train, cv = 5)

print('Decision Tree Model 2')

print(f'\nCross-validation scores:', ', '.join(map(str,

cv_dt2))) print(f'\nMean cross-validation score:

{cv_dt2.mean():.2f}') # K Nearest Neighbours

from sklearn.neighbors import KNeighborsClassifier

knn1 = KNeighborsClassifier(n_neighbors = 16)

knn1.fit(X_train, y_train)

cv_knn1 = cross_val_score(knn1, X_train, y_train, cv = 5)

print('K Nearest Neighbors Model 1')

print(f'\nCross-validation scores:', ', '.join(map(str, cv_knn1)))

41
print(f'\nMean cross-validation score: {cv_knn1.mean():.2f}')

knn2 = KNeighborsClassifier(n_neighbors = 8)

knn2.fit(X_train, y_train)

cv_knn2 = cross_val_score(knn2, X_train, y_train, cv = 5)

print('K Nearest Neighbors Model 1')

print(f'\nCross-validation scores:', ', '.join(map(str, cv_knn2))) print(f'\

nMean cross-validation score: {cv_knn2.mean():.2f}')

5. Performance Evaluation and Discussion

from sklearn.metrics

import confusion_matrix, ConfusionMatrixDisplay, accuracy_score, classification_report,

# Logistic Regression Models Comparison

y_pred_lr1 = lr1.predict(X_test_bc)

y_pred_lr2 = lr2.predict(X_test_bc)

conf_matrix_model1 = confusion_matrix(y_test_bc, y_pred_lr1)

conf_matrix_model2 = confusion_matrix(y_test_bc, y_pred_lr2)

fig, axs = plt.subplots(1, 2, figsize = (12, 4))

sns.heatmap(conf_matrix_model1, annot = True, cmap = 'Blues', ax = axs[0])

axs[0].set_title('Model 1')

sns.heatmap(conf_matrix_model2, annot = True, cmap = 'Blues', ax = axs[1])

axs[1].set_title('Model 2')

axs[0].set_xlabel('Predicted label')

axs[0].set_ylabel('True label')

axs[1].set_xlabel('Predicted label')

42
plt.show()

y_prob_lr1 = lr1.predict_proba(X_test_bc)[:,1]

y_prob_lr2 = lr2.predict_proba(X_test_bc)[:,1]

fpr1, tpr1, _ = roc_curve(y_test_bc, y_prob_lr1)

roc_auc1 = auc(fpr1, tpr1)

fpr2, tpr2, _ = roc_curve(y_test_bc, y_prob_lr2)

roc_auc2 = auc(fpr2, tpr2)

colors = sns.color_palette('Set2', n_colors = 3)

fig, axes = plt.subplots(1, 3, figsize = (15, 5))

plt.tight_layout()

plt.show()

precision1, recall1, threshold1 = precision_recall_curve(y_test_bc, y_prob_lr1)

precision2, recall2, threshold2 = precision_recall_curve(y_test_bc, y_prob_lr2)

fig, axs = plt.subplots(1, 3, figsize = (15, 5))

axs[0].plot(recall1, precision1, color = colors[1], label = 'Model

1') axs[0].set_xlabel('Recall')

axs[0].set_ylabel('Precision')

axs[0].set_title('Precision-Recall Curve (Model 1)')

axs[1].plot(recall2, precision2, color = colors[2], label = 'Model 2')

axs[1].set_xlabel('Recall')

axs[1].set_ylabel('Precision')

axs[1].set_title('Precision-Recall Curve (Model 2)')

axs[2].plot(recall1, precision1, color = colors[1], label = 'Model 1')

43
axs[2].plot(recall2, precision2, color = colors[2], label = 'Model 2')

precision1 = [metrics1[target_name]['precision'] for target_name in target_names]

recall1 = [metrics1[target_name]['recall'] for target_name in target_names]

f1_score1 = [metrics1[target_name]['f1-score'] for target_name in target_names]

metrics2 = classification_report(y_true = y_test_bc, y_pred = y_pred_lr2, target_names =


target_names, output_dict = True)

precision2 = [metrics2[target_name]['precision'] for target_name in target_names]

recall2 = [metrics2[target_name]['recall'] for target_name in target_names]

f1_score2 = [metrics2[target_name]['f1-score'] for target_name in target_names]

data1 = np.array([precision1, recall1, f1_score1])

data2 = np.array([precision2, recall2, f1_score2])

rows = ['Precision', 'Recall', 'F1-score']

fig, axs = plt.subplots(1, 2, figsize = (14, 6))

sns.heatmap(data1, cmap='Pastel1', annot = True, fmt='.2f', xticklabels = target_names,


yticklabels = rows, ax = axs[0])

sns.heatmap(data2, cmap = 'Pastel1', annot = True, fmt = '.2f', xticklabels = target_names,


yticklabels = rows, ax = axs[1])

axs[0].set_title('Classification Report (Model 1)')

axs[1].set_title('Classification Report (Model 2)')

plt.show()

palette = sns.color_palette('Blues', n_colors = 3)

acc1 = accuracy_score(y_pred_lr1, y_test_bc)

acc2 = accuracy_score(y_pred_lr2, y_test_bc)

labels = ['Model 1', 'Model 2']

scores = [acc1, acc2]

44
fig, ax = plt.subplots(figsize = (9, 3))

ax.barh(labels, scores, color = palette)

ax.set_xlim([0, 1])

ax.set_xlabel('Accuracy Score')

ax.set_title('Logistic Regression Model Comparison')

for i, v in enumerate(scores):

ax.text(v + 0.01, i, str(round(v, 3)), ha = 'left', va = 'center')

plt.show()

palette = sns.color_palette('Greens', n_colors = 3)

labels = ['Model 1', 'Model 2']

scores = [cv_lr1.mean(), cv_lr2.mean()]

fig, ax = plt.subplots(figsize = (9, 3))

ax.barh(labels, scores, color = palette)

ax.set_xlim([0, 1])

ax.set_xlabel('Cross Validation Score')

ax.set_title('Logistic Regression Model Comparison (Cross Validation)')

for i, v in enumerate(scores):

ax.text(v + 0.01, i, str(round(v, 3)), ha = 'left', va = 'center')

plt.show()

fig, ax = plt.subplots(figsize = (9, 3))

ax.barh(labels, scores, color =

palette) ax.set_xlim([0, 1])

ax.set_xlabel('Cross Validation Score')

45
ax.set_title('Support Vector Machine Model Comparison (Cross Validation)')

for i, v in enumerate(scores):

ax.text(v + 0.01, i, str(round(v, 3)), ha = 'left', va = 'center')

plt.show()

# Comparison of the Binary Classification Algorithms

1. Logistic Regression: Model 2

1. Support Vector Machine: Model 2

conf_matrix_model1 = confusion_matrix(y_test_bc, y_pred_lr2)

conf_matrix_model2 = confusion_matrix(y_test_bc, y_pred_svm2)

fig, axs = plt.subplots(1, 2, figsize = (12, 4))

sns.heatmap(conf_matrix_model1, annot = True, cmap = 'Blues', ax = axs[0])

axs[0].set_title('Logistic Regression')

sns.heatmap(conf_matrix_model2, annot = True, cmap = 'Blues', ax = axs[1])

axs[1].set_title('Support Vector Machine')

axs[0].set_xlabel('Predicted label')

axs[0].set_ylabel('True label')

axs[1].set_xlabel('Predicted label')

plt.show()

fpr1, tpr1, _ = roc_curve(y_test_bc, y_prob_lr2)

roc_auc1 = auc(fpr1, tpr1)

fpr2, tpr2, _ = roc_curve(y_test_bc, y_prob_svm2)

roc_auc2 = auc(fpr2, tpr2)

fig, axes = plt.subplots(1, 3, figsize = (15, 5))

46
axes[0].plot(fpr1, tpr1, label = f'ROC curve (area = {roc_auc1:.2%})', color = colors[1])

axes[0].plot([0, 1], [0, 1], color = colors[0], linestyle = '--')

axes[0].set_xlim([-0.05, 1.0])

axes[1].set_xlabel('False Positive Rate')

axes[1].set_ylabel('True Positive Rate')

axes[2].set_title('LR vs SVM')

axes[2].legend(loc = 'lower right')

plt.tight_layout()

plt.show()

precision1, recall1, threshold1 = precision_recall_curve(y_test_bc, y_prob_lr2)

precision2, recall2, threshold2 = precision_recall_curve(y_test_bc, y_prob_svm2)

fig, axs = plt.subplots(1, 3, figsize = (15, 5))

axs[0].plot(recall1, precision1, color = colors[1], label = 'Model

1') axs[0].set_xlabel('Recall')

axs[0].set_ylabel('Precision')

axs[0].set_title('Precision-Recall Curve (LR)')

axs[1].plot(recall2, precision2, color = colors[2], label = 'Model 2')

axs[2].set_title('LR vs SVM')

axs[2].legend(loc = 'lower left')

plt.tight_layout()

plt.show()

target_names = svm2.classes_

metrics1 = classification_report(y_true = y_test_bc, y_pred = y_pred_lr2, target_names =


target_names, output_dict = True)

47
precision1 = [metrics1[target_name]['precision'] for target_name in target_names]

recall1 = [metrics1[target_name]['recall'] for target_name in target_names]

f1_score1 = [metrics1[target_name]['f1-score'] for target_name in target_names]

metrics2 = classification_report(y_true = y_test_bc, y_pred = y_pred_svm2, target_names =


target_names, output_dict = True)

precision2 = [metrics2[target_name]['precision'] for target_name in target_names]

recall2 = [metrics2[target_name]['recall'] for target_name in target_names]

f1_score2 = [metrics2[target_name]['f1-score'] for target_name in target_names]

data1 = np.array([precision1, recall1, f1_score1])

data2 = np.array([precision2, recall2, f1_score2])

rows = ['Precision', 'Recall', 'F1-score']

fig, axs = plt.subplots(1, 2, figsize = (14, 6))

sns.heatmap(data1, cmap = 'Pastel1', annot = True, fmt = '.2f', xticklabels = target_names,


yticklabels = rows, ax=axs[0])

sns.heatmap(data2, cmap = 'Pastel1', annot = True, fmt = '.2f', xticklabels = target_names,


yticklabels = rows, ax=axs[1])

axs[0].set_title('Classification Report (LR)')

axs[1].set_title('Classification Report (SVM)')

plt.show()

palette = sns.color_palette('Blues', n_colors = 2)

acc1 = accuracy_score(y_pred_lr2, y_test_bc)

acc2 = accuracy_score(y_pred_svm2, y_test_bc)

labels = ['Logistic Regression', 'Support Vector Machine']

scores = [acc1, acc2]

fig, ax = plt.subplots(figsize = (9, 3))

48
ax.barh(labels, scores, color = palette)

ax.set_xlim([0, 1])

ax.set_xlabel('Accuracy Score')

ax.set_title('Binary Classification Model Comparison')

for i, v in enumerate(scores):

ax.text(v + 0.01, i, str(round(v, 3)), ha = 'left', va = 'center')

plt.show()

palette = sns.color_palette('Greens', n_colors = 2)

labels = ['Logistic Regression', 'Support Vector Machine']

scores = [cv_lr2.mean(), cv_svm2.mean()]

fig, ax = plt.subplots(figsize = (9, 3))

ax.barh(labels, scores, color =

palette) ax.set_xlim([0, 1])

ax.set_xlabel('Cross Validation Score')

ax.set_title('Binary Classification Model Comparison')

for i, v in enumerate(scores):

ax.text(v + 0.01, i, str(round(v, 3)), ha = 'left', va = 'center')

plt.show()

# Random Forest Models Comparison

y_pred_rf1 = rf1.predict(X_test)

y_pred_rf2 = rf2.predict(X_test)

conf_matrix_model1 = confusion_matrix(y_test, y_pred_rf1)

conf_matrix_model2 = confusion_matrix(y_test, y_pred_rf2)

49
fig, axs = plt.subplots(1, 2, figsize = (16, 7))

sns.heatmap(conf_matrix_model1, annot = True, cmap = 'Blues', ax = axs[0], xticklabels =


rf1.classes_, yticklabels = rf1.classes_)

axs[0].set_title('Model 1')

sns.heatmap(conf_matrix_model2, annot = True, cmap = 'Blues', ax = axs[1], xticklabels =


rf2.classes_, yticklabels = rf2.classes_)

axs[1].set_title('Model 2')

axs[0].set_xlabel('Predicted label')

axs[0].set_ylabel('True label')

axs[1].set_xlabel('Predicted label')

fig.tight_layout()

plt.show()

target_names = rf1.classes_

metrics1 = classification_report(y_true = y_test, y_pred = y_pred_rf1, target_names =


target_names, output_dict = True)

precision1 = [metrics1[target_name]['precision'] for target_name in target_names]

recall1 = [metrics1[target_name]['recall'] for target_name in target_names]

f1_score1 = [metrics1[target_name]['f1-score'] for target_name in target_names]

metrics2 = classification_report(y_true = y_test, y_pred = y_pred_rf2, target_names =


target_names, output_dict = True)

precision2 = [metrics2[target_name]['precision'] for target_name in target_names]

recall2 = [metrics2[target_name]['recall'] for target_name in target_names]

f1_score2 = [metrics2[target_name]['f1-score'] for target_name in target_names]

data1 = np.array([precision1, recall1, f1_score1])

data2 = np.array([precision2, recall2, f1_score2])

50
rows = ['Precision', 'Recall', 'F1-score']`

fig, axs = plt.subplots(1, 2, figsize = (14, 6))

sns.heatmap(data1, cmap = 'Pastel1', annot = True, fmt = '.2f', xticklabels = target_names,


yticklabels = rows, ax = axs[0])

sns.heatmap(data2, cmap = 'Pastel1', annot = True, fmt = '.2f', xticklabels = target_names,


yticklabels = rows, ax = axs[1])

axs[0].set_title('Classification Report (Model 1)')

axs[1].set_title('Classification Report (Model 2)')

fig.tight_layout()

plt.show()

palette = sns.color_palette('Blues', n_colors = 2)

acc1 = accuracy_score(y_pred_rf1, y_test)

acc2 = accuracy_score(y_pred_rf2, y_test)

labels = ['Model 1', 'Model 2']

plt.show()

palette = sns.color_palette('Greens', n_colors = 2)

labels = ['Model 1', 'Model 2']

scores = [cv_rf1.mean(), cv_rf2.mean()]

fig, ax = plt.subplots(figsize = (9, 3))

ax.barh(labels, scores, color = palette)

ax.set_xlim([0, 1])

ax.set_xlabel('Cross Validation Score')

ax.set_title('Support Vector Machine Model Comparison (Cross Validation)')

for i, v in enumerate(scores):

51
ax.text(v + 0.01, i, str(round(v, 3)), ha = 'left', va = 'center')

plt.show()

# Decision Trees Models Comparison

y_pred_dt1 = dt1.predict(X_test)

y_pred_dt2 = dt2.predict(X_test)

conf_matrix_model1 = confusion_matrix(y_test, y_pred_dt1)

conf_matrix_model2 = confusion_matrix(y_test, y_pred_dt2)

fig, axs = plt.subplots(1, 2, figsize = (16, 7))

metrics1 = classification_report(y_true = y_test, y_pred = y_pred_dt1, target_names =


target_names, output_dict = True)

precision1 = [metrics1[target_name]['precision'] for target_name in target_names]

recall1 = [metrics1[target_name]['recall'] for target_name in target_names]

f1_score1 = [metrics1[target_name]['f1-score'] for target_name in target_names]

metrics2 = classification_report(y_true = y_test, y_pred = y_pred_dt2, target_names =


target_names, output_dict = True)

precision2 = [metrics2[target_name]['precision'] for target_name in target_names]

recall2 = [metrics2[target_name]['recall'] for target_name in target_names]

f1_score2 = [metrics2[target_name]['f1-score'] for target_name in target_names]

data1 = np.array([precision1, recall1, f1_score1])

data2 = np.array([precision2, recall2, f1_score2])

rows = ['Precision', 'Recall', 'F1-score']

fig, axs = plt.subplots(1, 2, figsize = (14, 6))

sns.heatmap(data1, cmap = 'Pastel1', annot = True, fmt = '.2f', xticklabels = target_names,


yticklabels = rows, ax = axs[0])

52
sns.heatmap(data2, cmap = 'Pastel1', annot = True, fmt = '.2f', xticklabels = target_names,
yticklabels = rows, ax = axs[1])

axs[0].set_title('Classification Report (Model 1)')

axs[1].set_title('Classification Report (Model 2)')

fig.tight_layout()

plt.show()

palette = sns.color_palette('Blues', n_colors = 2)

acc1 = accuracy_score(y_pred_dt1, y_test)

acc2 = accuracy_score(y_pred_dt2, y_test)

labels = ['Model 1', 'Model 2']

scores = [acc1, acc2]

fig, ax = plt.subplots(figsize = (9, 3))

ax.barh(labels, scores, color = palette)

ax.set_xlim([0, 1])

ax.set_xlabel('Accuracy Score')

ax.set_title('Decision Trees Model Comparison')

for i, v in enumerate(scores):

ax.text(v + 0.01, i, str(round(v, 3)), ha = 'left', va = 'center')

plt.show()

palette = sns.color_palette('Greens', n_colors = 2)

labels = ['Model 1', 'Model 2']

scores = [cv_dt1.mean(), cv_dt2.mean()]

plt.show()

target_names = knn1.classes_

53
metrics1 = classification_report(y_true = y_test, y_pred = y_pred_knn1, target_names =
target_names, output_dict = True)

precision1 = [metrics1[target_name]['precision'] for target_name in target_names]

recall1 = [metrics1[target_name]['recall'] for target_name in target_names]

f1_score1 = [metrics1[target_name]['f1-score'] for target_name in target_names]

metrics2 = classification_report(y_true = y_test, y_pred = y_pred_knn2, target_names =


target_names, output_dict = True)

precision2 = [metrics2[target_name]['precision'] for target_name in target_names]

recall2 = [metrics2[target_name]['recall'] for target_name in target_names]

f1_score2 = [metrics2[target_name]['f1-score'] for target_name in target_names]

data1 = np.array([precision1, recall1, f1_score1])

data2 = np.array([precision2, recall2, f1_score2])

rows = ['Precision', 'Recall', 'F1-score']

fig, axs = plt.subplots(1, 2, figsize = (14, 6))

sns.heatmap(data1, cmap = 'Pastel1', annot = True, fmt = '.2f', xticklabels = target_names,


yticklabels = rows, ax = axs[0])

sns.heatmap(data2, cmap = 'Pastel1', annot = True, fmt = '.2f', xticklabels = target_names,


yticklabels = rows, ax = axs[1])

axs[0].set_title('Classification Report (Model 1)')

axs[1].set_title('Classification Report (Model 2)')

fig.tight_layout()

plt.show()

palette = sns.color_palette('Blues', n_colors = 2)

Comparison of the Multi-class Classification Algorithms

palette = sns.color_palette('Blues', n_colors = 3)

54
ax.barh(labels, scores, color = palette)

ax.set_xlim([0, 1])

ax.set_xlabel('Accuracy Score')

ax.set_title('Multi-class Classification Model Comparison')

for i, v in enumerate(scores):

ax.text(v + 0.01, i, str(round(v, 4)), ha = 'left', va = 'center')

plt.show()

palette = sns.color_palette('Greens', n_colors = 3)

labels = ['Random Forest', 'Decision Trees', 'K Nearest Neighbours']

scores = [cv_rf2.mean(), cv_dt2.mean(), cv_knn2.mean()]

fig, ax = plt.subplots(figsize = (9, 3))

ax.barh(labels, scores, color =

palette) ax.set_xlim([0, 1])

ax.set_xlabel('Cross Validation Score')

ax.set_title('Multi-class Classification Model Comparison')

for i, v in enumerate(scores):

preds = [y_pred_rf2, y_pred_dt2, y_pred_knn2]

datas = []

sns.heatmap(datas[0], cmap = 'Pastel1', annot = True, fmt = '.2f', xticklabels = target_names,


yticklabels = rows, ax = axs[0])

sns.heatmap(datas[1], cmap = 'Pastel1', annot = True, fmt = '.2f', xticklabels = target_names,


yticklabels = rows, ax = axs[1])

sns.heatmap(datas[2], cmap = 'Pastel1', annot = True, fmt = '.2f', xticklabels = target_names,


yticklabels = rows, ax = axs[2])

axs[0].set_title('Classification Report (Random Forest)')

55
axs[1].set_title('Classification Report (Decision Trees)')

axs[2].set_title('Classification Report (K Nearest Neighbours)')

fig.tight_layout()

plt.show()

preds = [y_pred_rf2, y_pred_dt2, y_pred_knn2]

conf_matrix = [confusion_matrix(y_test, y_pred) for y_pred in preds]

fig, axs = plt.subplots(1, 3, figsize = (22, 8))

sns.heatmap(conf_matrix[0], annot = True, cmap = 'Blues', ax = axs[0], xticklabels =


dt1.classes_, yticklabels = dt1.classes_)

sns.heatmap(conf_matrix[1], annot = True, cmap = 'Blues', ax = axs[1], xticklabels =


dt1.classes_, yticklabels = dt1.classes_)

sns.heatmap(conf_matrix[2], annot = True, cmap = 'Blues', ax = axs[2], xticklabels =


dt1.classes_, yticklabels = dt1.classes_)

axs[0].set_title('Confusion Matrix (Random Forest)')

axs[1].set_title('Confusion Matrix (Decision Trees)')

axs[2].set_title('Confusion Matrix (K Nearest Neighbours)')

axs[0].set_xlabel('Predicted label')

axs[1].set_xlabel('Predicted label')

axs[2].set_xlabel('Predicted label')

axs[0].set_ylabel('True label')

fig.tight_layout()

plt.show()

56
APPENDIX 2

57
APPENDIX 2

SCREENSHOT

LOGISTIC REGRESSION MODEL COMPARSION

Figure A2.1 Logistic Regression Model Comparsion

58
CROSS VALIDATION

Figure A2.2 Cross Validation

59
PREDICT TABLE

Figure A2.3 Predict Table

60
ROC CURVE MODEL

Figure A2.4 ROC CURVE MODEL

61
PRECISION-RECALL CURVE MODEL

Figure A2.5 Precision-Recall Model

62
REFERENCES

LIST OF PAPERS

[1] Shone, N., Ngoc, S. M., & Loo, J. (2018). A deep learning approach to network
intrusion detection. Proceedings of the International Conference on Neural Networks.

[2] Moustafa, N., & Slay, J. (2015). The evaluation of network anomaly detection systems:
Statistical analysis and benchmarking. Information Security Journal: A Global
Perspective, 24(1), 1-18.

[3] Xia, F., Wang, S., & Yang, L. (2015). A deep learning approach for intrusion detection
in industrial control systems. IEEE Transactions on Industrial Informatics, 11(3), 698-
707.

[4] Alazab, M., Tang, M., & Soni, S. (2020). An overview of machine learning-based intrusion
detection systems. Journal of Network and Computer Applications, 124, 34-43.

[5] Sadiq, M., Zaheer, S., & Jan, Z. (2020). A survey on intrusion detection systems in the
era of big data and machine learning. Future Generation Computer Systems, 112, 314-
326.

[6] Hussain, Z., & Hamid, S. (2019). Intrusion detection systems: A comprehensive survey
and future directions. Computer Networks, 149, 29-53.

[7] Sikder, M. K., & Sharmila, D. (2021). A survey of machine learning techniques for
intrusion detection systems: Challenges and future directions. Journal of King Saud
University-Computer and Information Sciences.

[8] Stoll, C. (1989). The Cuckoo's Egg: Tracking a Spy Through the Maze of Computer
Espionage. An early and influential account of a real-world intrusion detection case.

[9] Heberlein, T., & Mukherjee, B. (1990). Network Intrusion Detection. Introduced one of
the first IDS prototypes at Lawrence Livermore National Lab.

[10] Northcutt, S. (1998). Crackers Knock, Don't Get In. Discusses the evolution and
challenges of IDS in the late 1990s.

[11] Butun, I., Morgera, S. D., & Sankar, R. (2014). A survey of intrusion detection systems
in wireless sensor networks. IEEE Communications Surveys & Tutorials, 16(1), 266–
282.

63
[12] Nguyen, T. T., & Armitage, G. (2008). A survey of techniques for internet traffic
classification using machine learning. IEEE Communications Surveys & Tutorials, 10(4),
56–76.

[13] Nehinbe, J. O. (2009). A simple method for improving intrusion detections in corporate
networks. In: International conference on information security and digital forensics.
Springer, pp 111–122.

[14] Nehinbe, J. O. (2011). A critical evaluation of datasets for investigating IDSs and IPSs
researches. In: 2011 IEEE 10th international conference on cybernetic intelligent
systems (CIS). IEEE, pp 92–97.

[15] Ngueajio, M. K., Washington, G., Rawat, D. B., & Ngueabou, Y. (2022). Intrusion
Detection Systems Using Support Vector Machines on the KDDCUP'99 and NSL-KDD
Datasets: A Comprehensive Survey.

[16] Xu, Z., Wu, Y., Wang, S., Gao, J., Qiu, T., Wang, Z., Wan, H., & Zhao, X. (2025). Deep
Learning-based Intrusion Detection Systems: A Survey.

[17] Gueriani, A., Kheddar, H., & Mazari, A. C. (2024). Deep Reinforcement Learning for
Intrusion Detection in IoT: A Survey.

[18] Hoque, M. S., Mukit, M. A., & Bikas, M. A. N. (2012). An Implementation of Intrusion
Detection System Using Genetic Algorithm.

[19] A. Ahmim, A. Ghoualmi-Zine, and A. M. Mellouk, (2020) “A novel hierarchical intrusion


detection system based on decision tree and rules-based models,” Computers &
Security, vol. 92.

[20] M. Soltani, M. A. Salahshour, and A. Dehghantanha, “A content-based deep intrusion


detection system,” 2001.05009, 2020.

[21] R. Pradhan, “Decision tree based classifications on CICIDS 2017 dataset for the
identification of DDoS, Botnet, and web attack,” pp. 20–26, 2020.

[22] Z. K. Maseer, M. A. Mohammed,(2022) “Benchmarking of ML for anomaly-based IDSs


in CICIDS2017,” IJISA, vol. 17, no. 2.

[23] Ashiku, L., Dagli, C.(2020).“Network intrusion detection system using deep learning,”
Procedia Computer Science, vol. 170, pp. 234–239

64
[24] Ahmim, A., Ghoualmi-Zine, N., & Mellouk, A. (2020). A novel hierarchical intrusion
detection system based on decision tree and rules-based models. Computers & Security,
92.

[25] Soltani, M., Salahshour, M. A., & Dehghantanha, A. (2020). A content-based deep
intrusion detection system. 2001.05009.

[26] Pradhan, R. (2020). Decision tree-based classifications on CICIDS 2017 dataset for
identifying DDoS and botnet attacks.18(6), 20–26.

[27] Zhang, Y., & Ran, X. (2022). A step-based deep learning approach for intrusion detection.
International Journal of Intelligent Systems and Applications, 14(2).

[28] Ashiku, L., & Dagli, C. (2021). Deep learning for network intrusion detection. Procedia
Computer Science, 170, 234–239.

[29] Camacho, J., Pérez-Villegas, A., García-Teodoro, P., & Maciá-Fernández, G.(2016).
PCA- based multivariate statistical network monitoring for anomaly detection.
Computers & Security, 59, 118–137.

[30] Nisioti, A., Mylonas, A., Yoo, P. D., & Katos, V. (2018). From intrusion detection to
attacker attribution: a comprehensive survey of unsupervised methods. IEEE
Communications Surveys & Tutorials, 20(4), 3369–3388.

65

You might also like