Report
Report
Submitted by
PAVITHRA .A (422221104025)
Of
BACHELOR OF ENGINEERING
in
INSTITUTIONS VIKIRAVANDI-605652
MAY 2025
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
Dr. A. Jagan, M.Tech., Ph.D. Mrs. B. Lakshmidevi, M.Tech.,
Supervisor
HoD & Vice Principal
Assistant Professor
Associate Professor
Dept. of Computer Science & Engineering
Dept. of Computer Science & Engineering
Surya Group of Institutions
Surya Group of Institutions
School of Engineering and Technology
School of Engineering and Technology
We are highly indebted to Dr. A. Jagan, M.Tech., Ph.D., HoD & Vice
Principal, Department of Computer Science and Engineering, for providing
valuable insights into the subject and helping us wherever possible.
Finally, we also extend our heartfelt thanks to all our staff members and
friends who rendered their valuable help in making this project successful
TABLE OF CONTENTS
LIST OF ABBREVATIONS v
ABSTRACT 1
1 INTRODUCTION 2
2 SYSTEM ANALYSIS 6
2.1 LITERATURE SURVEY 6
2.1.1 Survey Paper 1 6
2.1.2 Survey Paper 2 6
2.1.3 Survey Paper 3 7
i
2.3.1 Advantage 9
ii
3 SYSTEM ARCHITECTURE 10
3.1 SYSTEM ARCHITECTURE DESIGN 10
3.2 UML DIAGRAMS 11
3.2.1 Use Case Diagram 11
iii
7.2.1 Unit Testing 23
7.2.2 Functional Testing 23
APPENDIX 1 29
APPENDIX 2 56
REFERNCES 62
iv
LIST OF FIGURES
6.2 Implementation 21
v
LIST OF ABBREVATIONS
ML Machine Learning
LR Logistic Regression
TP True Positive
TN True Negative
FP False Positive
FN False Negative
v
ABSTRACT
Intrusion Detection Systems (IDS) are crucial for securing modern networks against
increasing cyber threats. This study utilizes the CIC-IDS2017 dataset, a benchmark dataset
that closely simulates real-world network traffic, to evaluate the performance of traditional
supervised machine learning algorithms in binary classification of normal and attack
traffic.To optimize performance and reduce computational overhead, data preprocessing
techniques such as feature selection and Principal Component Analysis (PCA) are applied.
PCA helps reduce dimensionality while preserving essential information, thereby enabling
more efficient training of classification models.
The machine learning models implemented include Logistic Regression, Support Vector
Machines (SVM), and K-Nearest Neighbors (KNN). Each algorithm is tested with two different
configurations to examine the impact of parameter tuning on model performance. The results
reveal that both Logistic Regression and SVM achieve high accuracy and generalization
capability. Each algorithm is tested with two parameter configurations, and their performance is
assessed using metrics such as accuracy, precision, recall, F1-score, ROC-AUC, and cross-
validation scores.This report provides a comprehensive performance comparison and highlights
the potential of classical machine learning models for real-world intrusion detection tasks,
offering insights into model selection and tuning for cybersecurity applications.
1
CHAPTER 1
INTRODUCTION
INTRODUCTION
This study aims to provide insights into the practical implementation of machine learning
models for intrusion detection, highlighting the trade-offs between different algorithms and
the impact of tuning and dimensionality reduction. The findings serve as a reference for
developing scalable and efficient IDS solutions suitable for real-world deployment.
As computer networks and internet services have become integral to modern society, they
have also become prime targets for cyberattacks. Organizations and individuals alike are
vulnerable to threats such as malware, phishing, denial-of-service (DoS) attacks, and data breaches.
To counter these threats, security mechanisms are employed to monitor, detect, and respond to
unauthorized activities.
2
Figure 1.1 Background of Intrusion Detection
Intrusion Detection Systems are designed to monitor network or system activities for
malicious behavior or policy violations. IDS solutions typically fall into two main categories:
Signature-based IDS and Anomaly-based IDS. Signature-based systems rely on predefined
patterns of known threats, offering high accuracy for previously encountered attacks but failing
to detect novel or evolving threats. In contrast, anomaly-based IDS uses statistical or machine
learning techniques to detect deviations from normal behavior, allowing for the identification of
zero-day and sophisticated attacks.
With the increasing volume and complexity of network traffic, manual monitoring and
static rule-based detection are no longer sufficient. This challenge has led to the integration of
machine learning and data mining techniques in intrusion detection. Machine learning models
can learn from historical data to distinguish between normal and malicious behavior, adapting to
new attack patterns over time. These models offer the potential for more accurate, scalable, and
real- time intrusion detection.
To support research and development in this area, several datasets have been created,
with the CIC-IDS2017 dataset standing out for its realistic simulation of modern network traffic
and labeled attack scenarios. It includes various types of attacks such as DDoS, brute force,
infiltration, and botnet traffic, providing a valuable resource for evaluating and training machine
learning models.
3
1.2 PROBLEM STATEMENT
These systems often rely on predefined attack patterns or signatures, which means they
are only effective at identifying attacks that match known signatures. As cyberattacks evolve and
new methods are continually developed by malicious actors, this approach proves insufficient in
detecting novel and complex attack types.
Moreover, many existing IDS solutions suffer from high false positive rates, where
benign activities are incorrectly flagged as malicious, leading to unnecessary alarms and resource
wastage. This not only impacts the performance of the system but also hampers the productivity
of security personnel, who must sift through large volumes of false alerts.
Thus, the core problem addressed in this study is the development of an effective
machine learning-based intrusion detection system that can accurately detect both known and
unknown attacks, reduce false positives and false negatives, and provide an efficient solution for
real-time network monitoring. This study aims to provide a comparative analysis of their
performance in detecting network intrusions and contribute to the improvement of network
security systems.
4
• Analyze the CIC-IDS2017 dataset for network traffic patterns
• Preprocess and optimize the dataset for machine learning
• Apply and evaluate multiple machine learning algorithms for intrusion detection.
• Identify the most effective machine learning model for real-time intrusion detection
• Contribute to the advancement of intrusion detection technologies
The scope of this study is centered on the development, application, and evaluation of
machine learning techniques for detecting intrusions in computer networks, using the CIC-
IDS2017 dataset. As cyber threats continue to grow in complexity and frequency, traditional
intrusion detection systems are often unable to keep up. This study aims to bridge that gap by
exploring intelligent, data-driven approaches that can identify both known and unknown attacks
with high accuracy and low false alarm rates.
The study focuses on five major algorithms such as Logistic Regression, Support Vector
Machines (SVM), Random Forest, Decision Tree, and K-Nearest Neighbors (KNN) to
evaluate their performance in both binary classification (attack vs. normal) and multi-class
classification.
This study is intended for academic research, providing a foundation for future
implementations in practical cybersecurity systems. The findings can assist network
administrators, cybersecurity analysts, and developers in choosing appropriate algorithms for
building efficient IDS solutions
5
CHAPTER 2
SYSTEM ANALYSIS
Shone, N., Ngoc, S. M., & Loo, J. (2018). A deep learning approach to network intrusion
detection. Proceedings of the International Conference on Neural Networks.
Traditional machine learning models like decision trees, SVMs, and random forests have
shown effectiveness but often require manual feature engineering and struggle with high
dimensional data. In contrast, deep learning models such as convolutional neural networks,
recurrent neural networks, long short-term memory networks, and autoencoders have demonstrated
superior performance by automatically learning complex patterns from raw traffic data. Traditional
signature-based IDS have served as foundational tools in network security, their limitations—
such as high false-negative rates, reliance on continuous updates, and inability to detect novel
attacks have driven the development of more advanced detection techniques.
Moustafa, N., & Slay, J. (2015). The evaluation of network anomaly detection systems: Statistical
analysis and benchmarking. Information Security Journal: A Global Perspective, 24(1), 1-18.
Existing literature on the evaluation of network anomaly detection systems highlights the
importance of rigorous statistical analysis and standardized benchmarking for assessing system
performance. Researchers have employed metrics such as detection rate, false positive rate,
precision, recall, and F1-score to quantify the effectiveness of various anomaly detection models.
Several studies have emphasized the need for statistically sound evaluation methods, including
cross-validation, hypothesis testing, and confidence intervals, to avoid overfitting and ensure the
generalizability of results. Moreover, recent works explore the integration of synthetic data
generation and adversarial testing to improve benchmarking relevance.
6
2.1.3 USE OF CIC-IDS2017 DATASET
Xia, F., Wang, S., & Yang, L. (2015). A deep learning approach for intrusion detection in
industrial control systems. IEEE Transactions on Industrial Informatics, 11(3), 698-707.
The CIC-IDS2017 dataset is distinguished by its variety and realism. It includes data
from multiple attack types such as Denial of Service (DoS), Distributed Denial of Service
(DDoS), botnets, brute-force attacks, and various other malicious activities that are commonly
encountered in real-world networks. A significant advantage of the CIC-IDS2017 dataset is the
variety of attack scenarios it includes. Researchers have utilized this dataset to test and evaluate
various machine learning algorithms, such as Random Forest, Support Vector Machines (SVM),
and K-Nearest Neighbors (KNN), which are popular in intrusion detection tasks. These studies have
demonstrated the effectiveness of these models in detecting different types of attacks, with
particularly high detection rates for DoS and DDoS attacks
Alazab, M., Tang, M., & Soni, S. (2020). An overview of machine learning-based intrusion
detection systems. Journal of Network and Computer Applications, 124, 34-43.
Intrusion detection systems (IDS) reveals a broad and growing interest in leveraging
intelligent algorithms to identify and respond to cybersecurity threats. Traditional IDS
techniques often rely on rule-based or signature-based detection, which struggle to keep pace
with evolving attack patterns. In contrast, machine learning (ML) approaches such as decision
trees, support vector machines (SVM), k-nearest neighbors (KNN), naive Bayes, and ensemble
methods have shown improved performance by learning from historical data to detect both
known and unknown intrusions. More recent studies also integrate feature selection,
dimensionality reduction, and unsupervised learning techniques like clustering and anomaly
detection to handle large-scale and imbalanced datasets.
7
2.1.5 DIMENSIONALITY REDUCTION AND FEATURE ENGINEERING
Sadiq, M., Zaheer, S., & Jan, Z. (2020). A survey on intrusion detection systems in the era
of big data and machine learning. Future Generation Computer Systems, 112, 314-326.
Network traffic data often contains a large number of features, many of which may be
redundant or irrelevant to the detection of intrusions. To address these challenges, researchers have
explored various dimensionality reduction and feature engineering techniques to enhance the
performance of IDS. By applying dimensionality reduction techniques like PCA and performing
targeted feature engineering, IDS models become more efficient in processing large volumes of
network traffic. These techniques not only improve the speed of intrusion detection systems but
also ensure that they remain computationally feasible, especially in resource constrained
environments. They help reduce computational complexity, improve detection accuracy, and
enable real-time processing of network traffic.
Hussain, Z., & Hamid, S. (2019). Intrusion detection systems: A comprehensive survey and
future directions. Computer Networks, 149, 29-53.
8
2.2 EXISTING SYSTEM
The existing system uses the outdated KDD99 and NSL-KDD datasets for network
intrusion detection. These include class imbalance, scalability issues with large-scale data,
and real-time adaptability to evolving attack strategies. The system struggles with outdated
attack representations, sophisticated intrusion methods.
2.2.1 DISADVANTAGES
• Accuracy limitations, particularly with imbalanced datasets.
9
CHAPTER 3
SYSTEM ARCHITECTURE
10
3.2 UML DIAGRAM
UML Use Case Diagrams. The use case diagram provides a high-level overview of the
system's functional requirements by visually representing the interactions between users
(actors) and the system’s functionalities (use cases). Intrusion Detection System (IDS) using the
CIC-IDS2017 dataset, the primary actor is the Network Analyst or System Administrator,
who interacts with the system to upload network traffic data, initiate preprocessing steps, and
run machine learning models. The system provides use cases such as Load Dataset, Preprocess
Data, Train Model, Evaluate Model, and Visualize Result. Each use case signifies a major
function of the system and is crucial for achieving the system’s goal of identifying and
classifying network intrusions.
11
3.2.2 CLASS DIAGRAM
The class diagram shows how the different entities (people, things and data) relate to
each other; in other words, it shows the static structures of the system. A class diagram can be
used to display logical classes. Class diagram can also be used to show implementation classes,
which are the things that programmers typically deal with. A class is depicted on the class
diagram as a rectangle with horizontal sections. The upper section shows the class’s operations
(“methods”). The diagram has five main classes which give the attributes and operations used in
each class.
12
3.2.3 ACTIVITY DIAGRAM
13
3.2.4 DATA FLOW DIAGRAM
A Data Flow Diagram shows what kinds of information will be input to and output from
the system, where data will come from and go to and where the data will be stored. It does not
show information about the timing of processes, or information about whether processes will
operate in sequence or parallel. Intrusion Detection System (IDS) based on the CIC-IDS2017
dataset, the DFD illustrates how raw network traffic data is transformed into actionable
14
CHAPTER 4
SYSTEM SPECIFICATION
15
CHAPTER 5
SYSTEM DESIGN
The Data Collection Module is the first line of defense in an Intrusion Detection System
(IDS), tasked with gathering raw data from various network sources. This module collects data
in real-time from network devices like routers, switches, firewalls, servers, and endpoints. The
data may include network traffic logs, packet captures, system logs, user activity records, and
intrusion attempts, which are fundamental for detecting malicious activity.
The module ensures that data collection is continuous and uninterrupted, processing
incoming information to maintain an up-to-date snapshot of the system’s security status. It uses
both active and passive data collection methods. Active collection involves querying devices for
logs and metrics, while passive collection listens to traffic on the network without interfer.
The quality of the collected data is essential for the success of the IDS. Data
preprocessing occurs within this module, where tasks like normalization, timestamp
synchronization, and filtering irrelevant data take place. This ensures that the collected data is in
a consistent format and is free from redundancies, ensuring accurate analysis in later stages. By
collecting and preparing the data, this module ensures that subsequent modules receive
highquality information for detection and analysis.
The module is also responsible for reducing system resource consumption by deploying
lightweight agents or utilizing existing network monitoring systems to minimize overhead. As
the foundation of the IDS, its effectiveness determines the overall system performance and
accuracy of threat detection.
16
To begin analyzing the CIC-IDS2017 dataset, several Python libraries were imported.
NumPy and Pandas are foundational libraries for handling numerical operations and
structured data, respectively. These were used to load and manipulate the dataset efficiently.
The dataset, consisting of both normal and malicious network traffic, was read into a Pandas
DataFrame, providing a tabular structure suitable for analysis.
Overall, this setup phase laid the groundwork for the rest of the analysis. By ensuring
that the appropriate tools were in place for loading, inspecting, and visualizing the data, it
became possible to proceed with preprocessing, feature selection, and model development
with confidence and clarity.
The feature Extraction and Selection Module is responsible for transforming raw data
into actionable insights by identifying key features that are most indicative of potential threats.
Once data is collected, it undergoes feature extraction, where important attributes such as network
traffic patterns, login attempts, packet sizes, protocol types, and session duration are identified
and extracted. These features serve as the basis for further analysis in the detection engine.
Feature extraction is often followed by feature selection, which aims to reduce the
dataset's complexity while retaining essential information. This process involves techniques such
as statistical analysis, correlation measures, and dimensionality reduction methods like Principal
Component Analysis (PCA). The goal is to eliminate redundant or irrelevant features that may
introduce noise, improving both the speed and accuracy of the detection process.
The output of this module is a refined dataset of relevant features that can be used to train
machine learning models or rule-based detection systems. By selecting the most significant
features, this module helps ensure that the IDS system is efficient and responsive while maintaining
a high detection rate. The quality of feature extraction and selection plays a pivotal role in the
accuracy of anomaly detection and signature-based detection mechanisms in the IDS.
17
5.1.3 DETECTION ENGINE MODULE
The Detection Engine Module is the heart of the Intrusion Detection System, responsible
for analyzing the data processed by the Feature Extraction and Selection Module to identify
potential security threats. This module applies various detection methods, including signaturebased
and anomaly-based detection, to flag unusual activities within the network.
Machine learning models can be used in this module to improve the detection rate and
reduce false positives by training on historical data labeled as either normal or malicious. The
engine also needs to work in real-time, providing timely detection to minimize the damage
caused by attacks. Alerts generated by this engine are passed to the alert and response module, which
takes action based on the severity of the threat.
The detection engine is critical to the IDS’s overall effectiveness, as it determines how well
the system can identify both known and novel security threats in real-time.
The Alert and Response Module is a vital component that handles the alerts generated by
the Detection Engine and initiates the necessary response actions. Once a potential intrusion is
detected, this module assesses the severity and context of the threat. Alerts are generated and
communicated to system administrators through various channels such as dashboards, emails, or
Security Information and Event Management (SIEM) systems.
18
This module provides real-time alerts, helping administrators take prompt action to
mitigate any security breaches. Depending on the severity of the detected intrusion, the response
may vary. For minor incidents, the system may simply notify administrators, while for more
serious threats, automated responses are triggered. These responses can include blocking an IP
address, isolating a compromised system, terminating a suspicious session, or activating forensic
logging to gather additional evidence.
The alert and response module also plays a crucial role in post-event analysis by
maintaining detailed logs of each incident, which can be reviewed during investigations or for
improving the system’s rules. This module can be integrated with other security tools such as
firewalls, antivirus systems, or patch management tools to ensure coordinated actions during a
security event. The success of the IDS largely depends on the timely and appropriate actions
The Reporting and Analysis Module is responsible for generating detailed reports on the
system’s security status, including detected threats, response actions taken, and potential
vulnerabilities. It provides both real-time and historical data analysis, allowing administrators to
understand trends and patterns in network security incidents.
Reports generated by this module are often used for compliance purposes, helping
organizations adhere to security standards and regulations. The module can also generate
actionable insights, such as recurring attack patterns or areas of vulnerability that need attention.
The reports typically contain metrics like the number of attacks detected, the type of attack, the
duration of the attack, and the system’s response.
The analysis tools within this module help in identifying areas for system improvement
by reviewing past intrusions and their effectiveness. Through trend analysis, it can also predict
future threats and help in making proactive changes to the network security infrastructure. The
Reporting and Analysis Module plays a significant role in making the IDS more adaptive to
emerging threats.
19
CHAPTER 6
SYSTEM IMPLEMENTATION
The core problem addressed in this project is the identification and classification of
intrusions in a network traffic environment using machine learning techniques.
Specifically, the challenge lies in accurately detecting both known and unknown attacks,
minimizing false positives, and handling high-dimensional data efficiently. To solve this, the
project uses the CIC-IDS2017 dataset, a comprehensive and realistic dataset that includes
various modern attack types such as DDoS, brute force, infiltration, and botnet activities.
The objective is to design a robust, efficient, and scalable intrusion detection system
capable of distinguishing between normal and malicious traffic, thereby enhancing the security
posture of modern networks.
20
6.2 IMPLEMENTATION
The implementation phase of this study was conducted using Google Colab due to its
accessibility and support for high-performance computing. To begin with, Google Drive was
mounted using the command drive.mount('/content/drive') to enable access to the CIC-IDS2017
dataset and to save output files such as trained models and visualizations. This integration
allowed for seamless loading and saving of data throughout the project.
After mounting, the CIC-IDS2017 dataset was loaded using Python’s pandas library. The
dataset includes network traffic data, both normal and malicious, across a range of attack categories
like DDoS, Botnet, and Brute Force. The dataset contains over 80 features, making it
comprehensive and suitable for building robust intrusion detection models.
Once loaded, the data was explored to understand its structure, identify missing values,
and check for data consistency. Preprocessing was a crucial step to prepare the dataset for
machine learning. Missing and infinite values were either removed or imputed appropriately.
Label encoding was applied to convert categorical values, such as attack types, into
numeric formats suitable for model training. Feature normalization was performed using Min-Max
scaling to ensure that all features contributed equally during training, avoiding bias due to differing
feature scales.
21
Since the dataset was imbalanced, with benign samples far outnumbering malicious ones,
balancing techniques such as undersampling and SMOTE (Synthetic Minority Oversampling
Technique) were considered to ensure fair training across all classes.Principal Component
Analysis (PCA) was employed to reduce the feature space while retaining the most informative
attributes.
Correlation analysis and information gain methods were used to select features that had
the most impact on classification performance. These techniques not only enhanced model speed
but also helped in preventing overfitting.
These included Logistic Regression (LR), Support Vector Machine (SVM), Decision
Tree (DT), Random Forest (RF), and K-Nearest Neighbors (KNN). The models were trained
using an 80-20 train-test split and evaluated using cross-validation to ensure robustness. The
objective was to compare these models in terms of accuracy, speed, and reliability in detecting
different types of intrusions.
The trained models were then evaluated using performance metrics such as accuracy,
precision, recall, F1-score, and ROC-AUC. Confusion matrices were used to analyze the true
positives, false positives, true negatives, and false negatives for each model.
Visualization techniques such as bar charts and ROC curves were applied to provide an
intuitive understanding of each model’s strengths and limitations, allowing for effective
comparison.
Hyperparameter tuning was performed using methods like Grid Search and
Random Search to fine-tune the model’s parameters for improved accuracy and generalizability.
This optimized model serves as the final output of the implementation phase, ready for
deployment or integration into a real-time intrusion detection system.
22
CHAPTER 7
SYSTEM TESTING
Testing is a set of activities that can be planned in advance and conducted systematically.
For this reason a template for software testing, a set of steps into which can place specific test
case design techniques and testing methods should be defined for software process. Testing often
accounts for more effort than any other software engineering activity. It would therefore seem
reasonable to establish a systematic strategy for testing software.
The Unit testing for an intrusion detection system using the dataset, you should first
modularize your system into distinct components such as data preprocessing, feature selection,
model prediction, and evaluation. Then, use a Python testing framework like `unittest` or `pytest`
to write test cases for each component.. Use a small, representative subset of the dataset or mock
data to keep tests fast and focused. This structured testing approach ensures robustness and easier
debugging as you develop your intrusion detection system.
The Functional testing for an intrusion detection system using the dataset,. involves
feeding the full or a representative portion of the dataset through the complete pipeline including
data loading, preprocessing, feature extraction, model inference, and alert generation and
verifying that the system correctly identifies normal and malicious traffic based on expected
outputs. Functional tests should validate that the system meets its intended requirements, such
as detecting specific attack types generating appropriate logs or alerts, and maintaining
performance under realistic data volumes.
23
7.2.3. INTEGRATION TESTING
The Integration testing for an intrusion detection system using the dataset, test
different modules of the system such as data ingestion, preprocessing, feature selection, model
prediction, and evaluation work together as a cohesive unit.. The goal is to catch issues that
may not appear in isolated unit tests, such as mismatched data formats, incorrect model input
shapes, or failures in sequential processing. Integration tests help confirm that the system
behaves correctly when modules interact, ensuring stability and correctness before full
functional deployment.
The White Box testing on an intrusion detection system using the dataset, full access
to the system’s internal logic, including its source code, algorithms, and data flow. This type
of testing involves designing test cases based on the internal structure of each module such as
control flow in data preprocessing, logic in feature selection, and conditions in the classification
model to verify that all paths, branches, and conditions function correctly. The data can create
targeted inputs to trigger specific parts of the code, ensuring, that the system handles edge
cases like missing or anomalous data, applies normalization correctly.
The Black Box testing on an intrusion detection system using dataset, treat the system
as a closed unit without knowledge of its internal workings, focusing only on the inputs and
expected outputsThe input various traffic samples from the dataset including both normal and
malicious behaviors such as DoS, PortScan, and brute-force attacks and observe whether the
system correctly identifies and classifies them without examining how the detection is
implemented. This testing approach helps verify that meets functional requirements, such as
accuracy and alert generation, and can handle different attack scenarios realistically.
24
7.4 ACCEPTANCE TESTING
Acceptance testing is the final phase of software testing that ensures the system meets the
requirements and expectations of the end users or stakeholders. Intrusion Detection System project
using the dataset, acceptance testing was conducted to validate that the system performs effectively in
identifying and classifying network intrusions as per the defined objectives. This involves running the
full dataset or a representative portion through to ensure it accurately detects known attacks and
normal traffic, meets performance benchmarks (like detection rate, false positive rate, and
latency), and integrates properly with real-world network environments. Acceptance testing
focuses on verifying that the system behaves correctly from a user or organizational perspective,
confirming that it fulfills all functional, reliability, and usability criteria.
7.4.1 OBJECTIVE
The following criteria were established to determine the successful acceptance of the system
25
Performance Metrics
The system should achieve at least 90% accuracy for binary classification and
acceptable performance for multi-class classification, Precision, recall, and F1-score
should also be within expected thresholds for various attack types.
System Integration
All modules (data processing, model training, evaluation, and output) must work together
without failures or crashes.
User Feedback
The output results (e.g., confusion matrix, classification report) must be interpretable
and informative for cybersecurity professionals.
• Each output was reviewed to ensure accurate detection and classification of attacks.
• The interface and response times were evaluated from a usability perspective.
• Make sure all attack types, edge cases, and performance scenarios are tested.
• Compare system output with expected results for attack detection and traffic classification.
26
CHAPTER 8
8.1 CONCLUSION
The project also addressed attack classification by encoding the attack types and
mapping them to numerical labels, facilitating the application of machine learning algorithms
for attack detection. The encoding process, followed by feature scaling and normalization,
prepared the data for modeling, ensuring that the models would not be biased by extreme
values or irrelevant features. The project has provided a comprehensive framework for
analyzing network traffic data for intrusion detection. The results underscore the importance
of data preprocessing and feature engineering in machine learning workflows, particularly for
cybersecurity applications. The findings from the exploratory analysis, correlation studies,
and outlier investigations will directly inform the modeling phase, where machine learning
algorithms will be applied to predict attacks. This process not only enhances the accuracy of
the model but also improves its ability to detect and respond to network security threats
effectively. The next steps in this project will involve the application of classification
algorithms to build an intrusion detection model that can accurately identify both known and
unknown attack types, contributing to a robust network security system.
27
CHAPTER 9
FUTURE ENHANCEMENT
We have to choose our model accordingly if we want to stick with one. Future work on
intrusion detection using the CICIDS 2017 dataset can focus on enhancing detection accuracy,
reducing false positives, and improving adaptability to evolving attack patterns. One
promising direction is the application of advanced deep learning models, such as recurrent neural
networks (RNNs), transformers, or graph neural networks, to better capture temporal and
relational patterns in network traffic. It can focus on optimizing model performance through
advanced machine learning and deep learning approaches. This includes implementing
automated hyper- parameter optimization techniques such as Bayesian optimization, grid
search, or genetic algorithms to fine-tune classifiers like Random Forest, XG Boost, or deep
neural networks. Enhanced prediction techniques such as ensemble learning, stacking models,
or attention-based deep architectures can be explored to improve classification accuracy and
robustness across diverse attack types. Additionally, incorporating techniques for feature
selection and dimensionality reduction can help reduce noise and computational cost while
preserving predictive power. It may also involve testing the generalizability of the proposed
models on unseen or evolving threats, integrating real-time detection capabilities, and
benchmarking against other datasets to validate model effectiveness in dynamic network
environments.
28
APPENDIX 1
29
APPENDIX 1
SOURCE CODE
#drive.mount('/content/drive')
import numpy as np
import pandas as pd
sns.set(style='darkgrid')
data1 = pd.read_csv("C:Users/Narayanan/Desktop/anstechnologicalsolution/ANS
Tech/network baced attack/Monday-WorkingHours.pcap_ISCX.csv")
30
data8 = pd.read_csv("C:/Users/Narayanan/Desktop/ans technological solution/ANS
Tech/network baced attack/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv")
data = pd.concat(data_list)
print('New dimension:')
data.columns
data.info()
pd.options.display.max_rows = 80
print('Overview of Columns:')
data.describe().transpose()
pd.options.display.max_columns = 80
data
31
dups = data[data.duplicated()]
data.drop_duplicates(inplace = True)
data.shape
missing_val = data.isna().sum()
inf_count = np.isinf(data[numeric_cols]).sum()
sns.set_palette('pastel')
colors = sns.color_palette()
32
ax.set_ylabel('Non-Null Value Count', fontsize = 12)
plt.show()
plt.show()
colors = sns.color_palette('Blues')
plt.xlabel('Flow Bytes/s')
plt.ylabel('Frequency')
plt.show()
plt.show()
plt.title('Histogram of Flow
plt.ylabel('Frequency')
plt.show()
33
med_flow_bytes = data['Flow Bytes/s'].median()
outlier_counts = {}
for i in numeric_data:
iqr = q3 - q1
for i in numeric_data:
print(f'Feature: {i}')
for i in numeric_data:
34
if outlier_percent > 20:
ax.set_xlabel('Feature-Attack Type')
ax.set_ylabel('Percentage of Outliers')
ax.set_title('Outlier Analysis')
plt.xticks(rotation = 90)
plt.show()
plt.title('Types of attacks')
plt.xlabel('Attack Type')
data.groupby('Attack Type').first()
3. Data Preprocessing
col_type = data[col].dtype
if col_type != object:
c_min = data[col].min()
c_max = data[col].max()
35
# Downcasting float64 to float32
data[col] = data[col].astype(np.float32
elif str(col_type).find('int') >= 0 and c_min > np.iinfo(np.int32).min and c_max <
np.iinfo(np.int32).max:
data[col] = data[col].astype(np.int32)
data.info()
data.describe().transpose()
data.columns
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
size = len(features.columns) // 2
36
ipca = IncrementalPCA(n_components = size, batch_size = 500)
ipca.partial_fit(batch)
transformed_features = ipca.transform(scaled_features)
new_data
print(bc_data['Attack Type'].value_counts())
axis = 1)
37
from sklearn.linear_model import LogisticRegression
38
lr1 = LogisticRegression(max_iter = 10000, C = 0.1, random_state = 0, solver = 'saga')
lr1.fit(X_train_bc, y_train_bc)
lr2.fit(X_train_bc, y_train_bc)
svm1.fit(X_train_bc, y_train_bc)
39
print(f'\nCross-validation scores:', ', '.join(map(str, cv_svm1))) print(f'\
svm2.fit(X_train_bc, y_train_bc)
rf1.fit(X_train, y_train)
rf2.fit(X_train, y_train)
40
print(f'\nCross-validation scores:', ', '.join(map(str,
dt1 = DecisionTreeClassifier(max_depth = 6)
dt1.fit(X_train, y_train)
dt2 = DecisionTreeClassifier(max_depth = 8)
dt2.fit(X_train, y_train)
knn1.fit(X_train, y_train)
41
print(f'\nMean cross-validation score: {cv_knn1.mean():.2f}')
knn2 = KNeighborsClassifier(n_neighbors = 8)
knn2.fit(X_train, y_train)
from sklearn.metrics
y_pred_lr1 = lr1.predict(X_test_bc)
y_pred_lr2 = lr2.predict(X_test_bc)
axs[0].set_title('Model 1')
axs[1].set_title('Model 2')
axs[0].set_xlabel('Predicted label')
axs[0].set_ylabel('True label')
axs[1].set_xlabel('Predicted label')
42
plt.show()
y_prob_lr1 = lr1.predict_proba(X_test_bc)[:,1]
y_prob_lr2 = lr2.predict_proba(X_test_bc)[:,1]
plt.tight_layout()
plt.show()
1') axs[0].set_xlabel('Recall')
axs[0].set_ylabel('Precision')
axs[1].set_xlabel('Recall')
axs[1].set_ylabel('Precision')
43
axs[2].plot(recall2, precision2, color = colors[2], label = 'Model 2')
plt.show()
44
fig, ax = plt.subplots(figsize = (9, 3))
ax.set_xlim([0, 1])
ax.set_xlabel('Accuracy Score')
for i, v in enumerate(scores):
plt.show()
ax.set_xlim([0, 1])
for i, v in enumerate(scores):
plt.show()
45
ax.set_title('Support Vector Machine Model Comparison (Cross Validation)')
for i, v in enumerate(scores):
plt.show()
axs[0].set_title('Logistic Regression')
axs[0].set_xlabel('Predicted label')
axs[0].set_ylabel('True label')
axs[1].set_xlabel('Predicted label')
plt.show()
46
axes[0].plot(fpr1, tpr1, label = f'ROC curve (area = {roc_auc1:.2%})', color = colors[1])
axes[0].set_xlim([-0.05, 1.0])
axes[2].set_title('LR vs SVM')
plt.tight_layout()
plt.show()
1') axs[0].set_xlabel('Recall')
axs[0].set_ylabel('Precision')
axs[2].set_title('LR vs SVM')
plt.tight_layout()
plt.show()
target_names = svm2.classes_
47
precision1 = [metrics1[target_name]['precision'] for target_name in target_names]
plt.show()
48
ax.barh(labels, scores, color = palette)
ax.set_xlim([0, 1])
ax.set_xlabel('Accuracy Score')
for i, v in enumerate(scores):
plt.show()
for i, v in enumerate(scores):
plt.show()
y_pred_rf1 = rf1.predict(X_test)
y_pred_rf2 = rf2.predict(X_test)
49
fig, axs = plt.subplots(1, 2, figsize = (16, 7))
axs[0].set_title('Model 1')
axs[1].set_title('Model 2')
axs[0].set_xlabel('Predicted label')
axs[0].set_ylabel('True label')
axs[1].set_xlabel('Predicted label')
fig.tight_layout()
plt.show()
target_names = rf1.classes_
50
rows = ['Precision', 'Recall', 'F1-score']`
fig.tight_layout()
plt.show()
plt.show()
ax.set_xlim([0, 1])
for i, v in enumerate(scores):
51
ax.text(v + 0.01, i, str(round(v, 3)), ha = 'left', va = 'center')
plt.show()
y_pred_dt1 = dt1.predict(X_test)
y_pred_dt2 = dt2.predict(X_test)
52
sns.heatmap(data2, cmap = 'Pastel1', annot = True, fmt = '.2f', xticklabels = target_names,
yticklabels = rows, ax = axs[1])
fig.tight_layout()
plt.show()
ax.set_xlim([0, 1])
ax.set_xlabel('Accuracy Score')
for i, v in enumerate(scores):
plt.show()
plt.show()
target_names = knn1.classes_
53
metrics1 = classification_report(y_true = y_test, y_pred = y_pred_knn1, target_names =
target_names, output_dict = True)
fig.tight_layout()
plt.show()
54
ax.barh(labels, scores, color = palette)
ax.set_xlim([0, 1])
ax.set_xlabel('Accuracy Score')
for i, v in enumerate(scores):
plt.show()
for i, v in enumerate(scores):
datas = []
55
axs[1].set_title('Classification Report (Decision Trees)')
fig.tight_layout()
plt.show()
axs[0].set_xlabel('Predicted label')
axs[1].set_xlabel('Predicted label')
axs[2].set_xlabel('Predicted label')
axs[0].set_ylabel('True label')
fig.tight_layout()
plt.show()
56
APPENDIX 2
57
APPENDIX 2
SCREENSHOT
58
CROSS VALIDATION
59
PREDICT TABLE
60
ROC CURVE MODEL
61
PRECISION-RECALL CURVE MODEL
62
REFERENCES
LIST OF PAPERS
[1] Shone, N., Ngoc, S. M., & Loo, J. (2018). A deep learning approach to network
intrusion detection. Proceedings of the International Conference on Neural Networks.
[2] Moustafa, N., & Slay, J. (2015). The evaluation of network anomaly detection systems:
Statistical analysis and benchmarking. Information Security Journal: A Global
Perspective, 24(1), 1-18.
[3] Xia, F., Wang, S., & Yang, L. (2015). A deep learning approach for intrusion detection
in industrial control systems. IEEE Transactions on Industrial Informatics, 11(3), 698-
707.
[4] Alazab, M., Tang, M., & Soni, S. (2020). An overview of machine learning-based intrusion
detection systems. Journal of Network and Computer Applications, 124, 34-43.
[5] Sadiq, M., Zaheer, S., & Jan, Z. (2020). A survey on intrusion detection systems in the
era of big data and machine learning. Future Generation Computer Systems, 112, 314-
326.
[6] Hussain, Z., & Hamid, S. (2019). Intrusion detection systems: A comprehensive survey
and future directions. Computer Networks, 149, 29-53.
[7] Sikder, M. K., & Sharmila, D. (2021). A survey of machine learning techniques for
intrusion detection systems: Challenges and future directions. Journal of King Saud
University-Computer and Information Sciences.
[8] Stoll, C. (1989). The Cuckoo's Egg: Tracking a Spy Through the Maze of Computer
Espionage. An early and influential account of a real-world intrusion detection case.
[9] Heberlein, T., & Mukherjee, B. (1990). Network Intrusion Detection. Introduced one of
the first IDS prototypes at Lawrence Livermore National Lab.
[10] Northcutt, S. (1998). Crackers Knock, Don't Get In. Discusses the evolution and
challenges of IDS in the late 1990s.
[11] Butun, I., Morgera, S. D., & Sankar, R. (2014). A survey of intrusion detection systems
in wireless sensor networks. IEEE Communications Surveys & Tutorials, 16(1), 266–
282.
63
[12] Nguyen, T. T., & Armitage, G. (2008). A survey of techniques for internet traffic
classification using machine learning. IEEE Communications Surveys & Tutorials, 10(4),
56–76.
[13] Nehinbe, J. O. (2009). A simple method for improving intrusion detections in corporate
networks. In: International conference on information security and digital forensics.
Springer, pp 111–122.
[14] Nehinbe, J. O. (2011). A critical evaluation of datasets for investigating IDSs and IPSs
researches. In: 2011 IEEE 10th international conference on cybernetic intelligent
systems (CIS). IEEE, pp 92–97.
[15] Ngueajio, M. K., Washington, G., Rawat, D. B., & Ngueabou, Y. (2022). Intrusion
Detection Systems Using Support Vector Machines on the KDDCUP'99 and NSL-KDD
Datasets: A Comprehensive Survey.
[16] Xu, Z., Wu, Y., Wang, S., Gao, J., Qiu, T., Wang, Z., Wan, H., & Zhao, X. (2025). Deep
Learning-based Intrusion Detection Systems: A Survey.
[17] Gueriani, A., Kheddar, H., & Mazari, A. C. (2024). Deep Reinforcement Learning for
Intrusion Detection in IoT: A Survey.
[18] Hoque, M. S., Mukit, M. A., & Bikas, M. A. N. (2012). An Implementation of Intrusion
Detection System Using Genetic Algorithm.
[21] R. Pradhan, “Decision tree based classifications on CICIDS 2017 dataset for the
identification of DDoS, Botnet, and web attack,” pp. 20–26, 2020.
[23] Ashiku, L., Dagli, C.(2020).“Network intrusion detection system using deep learning,”
Procedia Computer Science, vol. 170, pp. 234–239
64
[24] Ahmim, A., Ghoualmi-Zine, N., & Mellouk, A. (2020). A novel hierarchical intrusion
detection system based on decision tree and rules-based models. Computers & Security,
92.
[25] Soltani, M., Salahshour, M. A., & Dehghantanha, A. (2020). A content-based deep
intrusion detection system. 2001.05009.
[26] Pradhan, R. (2020). Decision tree-based classifications on CICIDS 2017 dataset for
identifying DDoS and botnet attacks.18(6), 20–26.
[27] Zhang, Y., & Ran, X. (2022). A step-based deep learning approach for intrusion detection.
International Journal of Intelligent Systems and Applications, 14(2).
[28] Ashiku, L., & Dagli, C. (2021). Deep learning for network intrusion detection. Procedia
Computer Science, 170, 234–239.
[29] Camacho, J., Pérez-Villegas, A., García-Teodoro, P., & Maciá-Fernández, G.(2016).
PCA- based multivariate statistical network monitoring for anomaly detection.
Computers & Security, 59, 118–137.
[30] Nisioti, A., Mylonas, A., Yoo, P. D., & Katos, V. (2018). From intrusion detection to
attacker attribution: a comprehensive survey of unsupervised methods. IEEE
Communications Surveys & Tutorials, 20(4), 3369–3388.
65