Final Report 1
Final Report 1
SUBMITTED BY
CHETHAN K
1JS21AI009
DEPARTMENT OF AI & ML
JSS ACADEMY OF TECHNICAL EDUCATION
JSS Campus, Dr.Vishnuvardhan Road, Bengaluru-560060
2024 - 2025
JSS ACADEMY OF TECHNICAL EDUCATION
JSS Campus, Dr. Vishnuvardhan Road, Bengaluru-560060
DEPARTMENT OF ARTIFICAL INTELLIGENCE
& MACHINE LEARNING
CERTIFICATE
This is to certify that the Technical Seminar entitled “Data Fingerprinting and
Visualization for AI-Enhanced Cyber-Defence Systems” carried out by CHETHAN K
bearing USN 1JS21AI009 respectively bonafide student of JSS ACADEMY OF
TECHNICAL EDUCATION, BENGALURU in partial fulfilment for 7th Semester
Technical Seminar (21AI81) in AI&ML of the Visvesvaraya Technological University,
Belagavi during the academic year 2024-25. It is certified that all corrections/suggestions
indicated for Internal Assessment have been incorporated in the report deposited in the
departmental library. The Technical Seminar report has been approved as it satisfies the
academic requirements in respect of Technical Seminar work prescribed for the said degree.
I express our sincere thanks to our beloved principal, Dr. Bhimasen Soragaon for having
supported us in our academic endeavors.
I am thankful for the resourceful guidance, timely assistance and graceful gesture of my guide
Ms. Sneha Y S, Assistant Professor, Department of CSE(AI&ML), who helped me in every
aspect of my Technical Seminar work.
Last, but not least, I am pleased to express our heart full thanks to all the teaching and
nonteaching staff of the Department of CSE(AI&ML) and my friends who have rendered their
help, motivation, and support.
CHETHAN K
1JS21AI009
ABSTRACT
Artificial intelligence (AI)-assisted cyber-attacks have become a significant threat across all
stages of the cyber-defense lifecycle, leveraging advanced AI tools to outsmart traditional
defenses. In the reconnaissance phase, for example, AI-powered tools like MalGAN exploit
vulnerabilities in cyber-defense systems, making attacks more efficient and difficult to detect.
Existing countermeasures often fail against such AI-enhanced threats due to their inability to
recognize the sophisticated and dynamic attack patterns employed by these tools. This
highlights the urgent need for innovative approaches to bolster cyber-defense strategies and
counter AI-driven threats effectively.
To address this challenge, the paper proposes using data fingerprinting and visualization as
foundational tools within the AIECDS (AI-Enhanced Cyber-defense Systems) methodology.
Data fingerprinting involves creating unique identifiers for datasets, enabling precise tracking
and identification of malicious activities. Combined with visualization techniques, these
fingerprints transform complex data patterns into intuitive, graphical insights that simplify the
differentiation between benign and malicious behaviors. This approach not only reduces the
complexity of decision boundaries in machine-learning models but also enhances their
detection efficiency, even for malicious threats with minimal data samples. A practical use case
validated the effectiveness of this approach by demonstrating how fingerprinted network
sessions enable clear visual discrimination of benign and malicious events, underscoring its
potential to strengthen cybersecurity defenses against AI-driven attacks.
TABLE OF CONTENTS
II ABSTRACT II
1 INTRODUCTION 1
2 3
LITEARTURE SURVEY
3 RELATED WORKS 6
4 THE AIECDS-METHODOLOGY 11
5 EXPERIMENTAL RESULTS 20
8 REFERENCES 38
TABLE OF FIGURES
CHAPTER 1
INTRODUCTION
Defending cyber assets against increasingly sophisticated cyber-attacks is one of the most pressing
challenges of the 21st century. Global events like the COVID-19 pandemic and the Russian
invasion of Ukraine have provided threat actors with unprecedented opportunities to escalate
cyber-attacks. These events have highlighted key defensive challenges, including the growing
complexity of attacks, the prevalence of unprotected data, inadequate cybersecurity practices, and
an over-reliance on vulnerability management as the main defense strategy. Such reliance often
leaves cyber-defense perimeters vulnerable to exploitation. Simultaneously, attackers have
leveraged Artificial Intelligence (AI) to create more potent and adaptive attack strategies. These
AI-enhanced cyber-attacks exploit critical vulnerabilities more rapidly, leaving defenders with
minimal time to patch systems effectively. For example, zero-day attacks, which exploit unknown
vulnerabilities, nearly doubled by 2021, showcasing the rising sophistication of cyber threats.
Despite advances in cyber-defense systems, their effectiveness has not kept pace with the evolving
threat landscape. A study conducted by the University of New South Wales (UNSW) and the
Australian Cybersecurity Center (ACSC) in 2015 inadvertently exposed the limitations of state-
of-the-art defensive technologies. Many existing systems rely on machine learning (ML) models,
particularly anomaly detection techniques. However, these systems are vulnerable to adversarial
attacks, where threat actors manipulate data to confuse models during training or inference. For
instance, Generative Adversarial Networks (GANs) are frequently used by attackers to generate
malicious inputs that evade detection. These AI-enhanced techniques enable cybercriminals to
conduct sophisticated attacks that are increasingly difficult to detect and mitigate. A major factor
contributing to the ineffectiveness of these ML-based defenses is the reliance on laboratory-created
datasets rather than real-world or real-time data. Such datasets often fail to capture the complexity
and variability of real-world cyber threats, limiting the generalizability of these systems to
practical scenarios.
The "data problem" is a significant barrier to the development of effective AI-enhanced cyber-
defense solutions. High-quality, real-world datasets are essential for training ML models capable
of detecting modern threats. However, the acquisition of such data faces several challenges,
including the lack of access to large-scale datasets, the sensitive nature of cybersecurity
information, and concerns over privacy, security, and confidentiality. Most ML-based
cybersecurity studies have been conducted in controlled environments, without real-time testing
or exposure to live attack scenarios. This lack of real-time data undermines the ability of cyber-
defense systems to perform effectively in operational settings. Research emphasizes that real-
world and real-time data are critical for building robust models capable of countering sophisticated
threats.
One promising solution lies in the use of visualized data for training ML models. Visualizations
simplify complex, multimodal datasets by representing them in a more interpretable format,
making it easier for AI models to learn and detect patterns. This approach has been shown to
enhance model accuracy and efficiency, especially in cybersecurity applications. Visualized
representations also provide an opportunity to extract meaningful insights from real-world threat
environments, such as network traffic logs, which can reveal hidden patterns of malicious activity.
By reducing data complexity and improving interpretability, visualization bridges the gap between
raw data and actionable intelligence for both machines and human analysts.
Building on this concept, the study introduces a methodology for developing AI-enhanced cyber-
defense tools that incorporates data fingerprinting and data visualization as core components.
Data fingerprinting creates unique identifiers for datasets, allowing precise tracking and
differentiation between benign and malicious activities. When combined with visualization, these
fingerprints enable defenders to identify threats more effectively while simplifying the decision-
making process for machine-learning models. This methodology addresses key aspects of the data
problem by offering an innovative way to use limited datasets more efficiently and improve
detection capabilities, even for subtle or emerging threats.
The proposed methodology was validated through a use case focusing on the discovery of cyber
threats using fingerprinted network sessions.
CHAPTER 2
LITEARTURE SURVEY
Data Fingerprinting and Visualization for AI-Enhanced Cyber-Defence Systems requires a study
of current research and advancements within the domains of data fingerprinting, data visualization,
and their respective roles in augmenting cybersecurity via artificial intelligence (AI). An organized
summary of relevant literature is presented below, emphasizing significant themes, methodologies,
results, and existing gaps within the prevailing body of research.
- Data Visualization: Techniques for visualization help to simplify complex datasets, and thus,
make it easy for analysts and machine learning models to interpret and understand data. Effective
visualization can enhance anomaly and pattern detection that may be indicative of cyber threats.
- AI-Accelerated Cyber Attacks. The researches indicate that hackers are increasingly using AI for
automation and sophistication of attacks in order to make them sophisticated and difficult to
identify. Tools like MalGAN and DeepLocker are perfect examples of how AI is being used to
bypass normal security measures.
- Need for Sophisticated Defensive Strategies: Artificial intelligence in cybersecurity threats call
for the development of sophisticated defensive strategies that can adapt to these evolving tactics.
Signature-based detection techniques often fail to cope with AI-enhanced attacks.
Dept of AIML, JSSATE-B 3|Page
Data Fingerprinting and Visualization for AI-Enhanced Cyber-Defence Systems
- Machine Learning Models: Many of today's cyber-defense systems use machine learning (ML)
models, most notably based on anomaly detection techniques. These models, however, frequently
suffer from incorrect assumptions built in during the training process and thus are exploitable.
- Limitations of Current Systems: Studies show that many machine learning-based cybersecurity
solutions have not been tested in real-time environments, which is a critical test for determining
their effectiveness. The lack of high-quality, real-world data hinders the development of robust
cyber-defense systems.
- Methodologies: A variety of methodologies for data fingerprinting have been proposed, focusing
on the encoding and extraction of meaningful data from network traffic. The purpose of these
methods is to create distinctive representations of data that are usable for effective threat detection.
In the representation of complex data, these techniques have shown they can ease interpretability
for expansive and multi-modal datasets, thus possibly smoothing the learning task for the AI
models concerned.
- Practical Uses: Visualization use in cyber-defense systems demonstrates a great promise for
increasing the detection rates and decreasing response time to threats. Visual information helps
analysts identify anomalies quickly, thus improving the effectiveness in managing incidents.
- Real-World Data Availability: A notable deficiency in the existing literature pertains to the
insufficient availability of extensive, real-world attack datasets that are essential for the training
and evaluation of machine learning models. This constraint has implications for the practical
implementation of numerous suggested solutions.
- Utilization of Real-Time Data: Highlight the importance of real-time data in the training and
evaluation of cyber-defense systems for making them more efficient. Studies should be directed
towards approaches for collecting and utilizing real-world data, all this while considering privacy
and security issues.
- Novel AI Methodologies: The study of new AI techniques, such as deep learning and
reinforcement learning, as well as data fingerprinting and visualization can potentially result in
more powerful cyber-defense strategies that adapt well to changing threats.
CHAPTER 3
RELATED WORKS
Current cyber-defense research outputs that are important for the research at hand include the
following.
The cyber-defense lifecycle refers to the various phases through which a cyber-attacker must
progress to successfully infiltrate an organization’s network and achieve their malicious goals. The
lifecycle is generally composed of several critical stages, including Reconnaissance,
Weaponization, Delivery, Exploitation, Installation, Command & Control, and Actions on
Objectives. Each phase corresponds to a different set of activities that attackers must complete,
and understanding these phases is vital for developing effective countermeasures in cyber-defense
systems. By analyzing and mitigating each phase of the lifecycle, cybersecurity professionals can
identify weaknesses in a system and proactively protect against potential threats. For the purpose
of developing AI-enhanced cyber-defense systems, it's crucial to focus on these phases, especially
the Reconnaissance phase, where attackers gather information about the target to plan their attack.
During reconnaissance, attackers collect intelligence such as vulnerabilities, network
configurations, and possible entry points, which lays the groundwork for the next phases of the
attack.
Among the phases of the cyber-defense lifecycle, Reconnaissance is of particular interest for
researchers looking to enhance cybersecurity through AI-driven solutions. This is because the
reconnaissance phase is where attackers first identify and gather information about their target. It
is during this stage that AI-enhanced tools, such as MalGAN, can be deployed to execute
sophisticated attacks. MalGAN is a type of Generative Adversarial Network (GAN), a machine
learning model that has been trained to generate adversarial malware. This malware is designed to
evade traditional malware detection methods, often referred to as black-box detectors, by hiding
or disguising its true nature. These detectors, which typically look for known patterns or signatures
of malicious activity, are ineffective against malware that can mutate and disguise itself as
legitimate files. MalGAN’s ability to create such malware that bypasses traditional security
measures exemplifies how attackers are utilizing AI-enhanced tools to circumvent existing cyber-
defense systems during the reconnaissance phase, making detection and mitigation much more
challenging for defenders.
Another example of an AI-enhanced tool used during the reconnaissance phase is DeepLocker,
an advanced form of malware that uses AI to conceal its malicious payload. DeepLocker’s
approach involves payload mutation, which means that the malware remains dormant and hidden
until triggered by a specific condition, such as a particular face recognition or geolocation trigger.
This makes it especially dangerous because the malware remains undetected until the right
circumstances allow it to activate, making it much harder to spot during the initial reconnaissance
phase. DeepLocker is a prime example of how AI can be used not only to conceal the presence of
malware but also to enhance the effectiveness of the payload by adapting its behavior based on
specific environmental triggers. Such techniques have revolutionized the way adversaries
approach the Command & Control phase of the cyber-defense lifecycle. In this phase, attackers
typically establish a channel to control and manipulate the compromised system remotely. AI-
enhanced malware like DeepLocker, which can evade detection and only activate under very
specific conditions, makes the Command & Control phase much harder for defenders to identify
and disrupt.
reduce the complexity of detecting and mitigating such attacks, enabling a more proactive defense
against AI-driven threats in the cyber-defense lifecycle.
Compounding the issue of outdated datasets is the challenge of extracting meaningful information
from real-world systems, particularly in complex environments like computer networks. Studies
have shown that many network-based cyber-defense systems are prototyped using simplified
calculated features derived from data telemetry and average statistical measurements. While these
simplified features make it easier to model and understand the data, they often lead to higher
inference sensitivity and longer delays in classifying cyber-attacks. These delays are particularly
problematic in real-time environments, where the ability to detect and respond to attacks promptly
is crucial. As a result, despite the evolution of ML-based countermeasures from feature-
engineering-based models to more advanced deep learning models, challenges remain in detecting
threats with small malicious sample datasets. The performance improvements seen with deep
learning models are incremental, and they have not yet sufficiently addressed the issue of detecting
rare or minuscule malicious samples. This is especially true when it comes to real-time detection,
where systems tend to perform worse in live environments compared to laboratory conditions.
This disparity suggests that many ML-based IDS models are not yet optimized for handling the
complexities of real-time, evolving cyber threats.
The difficulty in detecting cyber-attacks with minute sample datasets, as well as the challenges
posed by real-time timing differences, points to the limitations of traditional ML-based IDS
systems. These limitations are further exacerbated by the fact that most research has been
conducted on outdated datasets, without accounting for the latest threats. Similar concerns have
been raised in the field of malware detection, where no ML-based MDS approach has been able to
detect all types of malware effectively. Although advancements have been made in various
detection techniques such as signature-based, behavior-based, and deep learning approaches, none
have proven universally successful. A notable exception is the work by Malialis, who developed
a Reinforced Learning (RL) model for real-time network intrusion detection and response. This
approach used real-time network packet data and employed a distributed RL defense system to
mitigate Distributed Denial-of-Service (DDoS) attacks. The system involved a network of routers
configured as RL agents, which were trained to limit the flow of DDoS traffic through the network.
This real-time approach demonstrated the potential of RL in improving the efficiency of cyber-
defense systems, but it remains a rare and specialized solution in the broader context of
cybersecurity research.
Concerns surrounding adversarial machine learning in the context of cyber warfare have also been
highlighted. Researchers have pointed out that faulty assumptions during the training of ML
models—such as the assumption that data can be linearly separated—can introduce vulnerabilities
into cyber-defense systems. Adversarial AI, which manipulates models during inference to cause
them to misclassify data, has become a significant challenge. This highlights the need for more
robust methodologies that can address vulnerabilities in ML models and improve the ability of
cyber-defense systems to detect sophisticated cyber-attacks. Additionally, the increasing
sophistication of cyber-attacks, as reported in recent studies, underscores the importance of
developing countermeasures that can rapidly adapt to emerging threats. As cyber actors continue
to refine their tactics and increase the scale and speed of their attacks, the need for dynamic, real-
time countermeasures becomes even more pressing.
Given these challenges, researchers have identified several key requirements for developing
effective AI-enhanced cyber-defense systems. First, these systems must improve detection rates
and reduce detection times to ensure that attacks are identified and mitigated as quickly as possible.
Additionally, employing dynamic self-learning techniques, such as Reinforcement Learning
(RL), can enable systems to adapt to evolving threats in real time. A major focus must also be
placed on detecting adversarial and unknown cyber-attacks, which current systems struggle to
identify. To achieve this, AI-enhanced countermeasures must be trained using real-world
environments and real-time data, as opposed to outdated or simplified datasets. One critical aspect
of this approach is data fingerprinting, which involves extracting and encoding meaningful data
from real-world systems to create more accurate representations of potential threats. Data
fingerprinting can help systems distinguish between benign and malicious activities more
effectively. Moreover, data visualization plays a key role in overcoming the complexity of multi-
modal, threat-related data, enabling systems to handle large and diverse datasets more efficiently.
By addressing these challenges and implementing these requirements, AI-enhanced cyber-defense
systems can become more robust, adaptive, and capable of protecting against both known and
emerging cyber threats.
CHAPTER 4
A good example of this methodology in action is the use of the UNSW-15 dataset, which is widely
employed in cybersecurity research. The UNSW-15 dataset is derived from the University of New
South Wales (UNSW) network traffic, including both benign and malicious activities. It contains
around 100 GB of network data, including 82 million network packets in the training set alone,
making it a substantial source of real-time network data. This dataset includes a variety of attack
types such as Denial of Service (DoS) attacks, worms, backdoors, fuzzers, and zero-day attacks,
providing a diverse set of threat scenarios. Furthermore, the dataset is labeled with specific threat
categories rather than actual attack events, which helps researchers focus on the broader detection
of malicious patterns rather than relying on specific signatures. This labeling system enhances the
flexibility and scalability of machine learning algorithms, allowing them to generalize better across
different types of attacks and to detect unknown or novel threats.
The advantage of using such a dataset is that it allows for the training of AI models to detect threats
at the earliest stages of attack, such as when network packets are first received. By capturing
packets as they are transmitted through the network and analyzing them in real time, AI systems
can detect anomalies and potential threats before the malicious activity fully propagates through
the network. This real-time detection is crucial because it reduces the window of opportunity for
attackers to exploit vulnerabilities, thereby minimizing potential damage. Moreover, by processing
network data with minimal delays, the proposed system can operate with low computational
complexity, ensuring that threat detection occurs quickly without overwhelming system resources.
This ability to balance speed and efficiency is one of the key advantages of using the PCAP
approach and training systems with real-world, real-time data. It allows for rapid identification of
cyber threats while maintaining performance and scalability in live network environments.
In summary, training AI-enhanced countermeasures with real-time network data, such as that
provided by PCAP datasets like UNSW-15, is critical for the effectiveness of cyber-defense
systems. This methodology enables the early detection of threats in live networks, reduces delays
in threat response, and ensures that the AI models are robust enough to handle a wide range of
attack scenarios. By leveraging datasets that reflect real-world network traffic and maintaining
continuous, near-instantaneous analysis, AI-enhanced cyber-defense systems can provide more
effective and adaptive defense mechanisms against evolving cyber threats.
To meet the criteria for AI-Enhanced Cyber-Defense Systems (AIECDS), network packets must
be processed as they are received, which ensures timely detection of cyber threats. This approach
is particularly effective in minimizing detection delays and addressing the real-time nature of
network-based attacks. The packets, which are captured using technologies like Packet Capture
(PCAP), are processed in a continuous stream. As each packet is received, key features are
extracted, such as source and destination IP addresses, source and destination ports, protocol types,
and various flags associated with the packet. These features provide crucial information that can
be used to identify patterns and anomalies indicative of potential cyber threats. The extracted
features are stored temporarily in a buffer, which acts as a holding area before being processed for
threat detection.
The buffer serves an essential role in organizing the incoming data, ensuring that each packet's
relevant details are accumulated and stored for analysis. Once the buffer reaches a certain threshold
or the network session ends, the stored data is passed to the data preparation phase, where it is
further processed for machine learning (ML) models. After processing, the buffer is cleared, and
the system is ready to receive the next set of packets. This cycle is repeated continuously, allowing
the system to monitor network traffic in real-time without delays. This method is critical for
minimizing computational overhead, reducing latency, and ensuring that threats are detected as
soon as they enter the network.
In the case of network session datasets, such as the UNSW-15 PCAP dataset, the packet data
typically includes a variety of protocols, including IP and address resolution protocols (ARP),
network protocols, and transport protocols. These datasets contain both normal and malicious
traffic, which is essential for training AI-based detection systems. During training, features like IP
source and destination addresses, IP length, TCP flags, ARP protocol sources and destinations,
and transmitted data are extracted. These features help the machine learning models understand
the underlying patterns of normal versus malicious activity.
For example, the UNSW-15 dataset, commonly used in cybersecurity research, contains over 100
GB of network traffic data, with more than 82 million packets in the training dataset alone. The
dataset is annotated with threat labels, which categorize the types of attacks present, such as Denial
of Service (DoS), worms, backdoors, fuzzers, and zero-day attacks. These labeled datasets are
valuable for training high-performance algorithms capable of distinguishing between benign and
malicious traffic. The detailed packet-level features enable the models to make more accurate
predictions and respond to potential threats in real-time. The use of such real-time data and
continuous processing of incoming packets helps to create robust cyber-defense systems capable
of responding to threats before they cause significant damage.
3. DATA PREPARATION
During the data preparation phase, the extracted features are carefully organized and stored in
buffers. Buffers act as temporary storage for packets as they are received or processed in batches.
These buffers may either be periodically input into the next phase or used to compile large datasets
for analysis over time. The goal is to ensure that no important network behavior is lost, even if
data needs to be analyzed in real-time or incrementally. The fingerprints, which are key features
representing network sessions, are created by associating the extracted features with specific threat
labels during the training phase. These labels categorize network events based on the type of threat
they represent, such as Denial of Service (DoS), worms, or zero-day attacks. By doing so, the
dataset becomes enriched with meaningful information, allowing the system to learn and
differentiate between benign and malicious events effectively.
In a use case, the data preparation phase might involve using a dataset like the UNSW-15 PCAP
dataset, which contains millions of network packets with a wide variety of attack scenarios. As
network traffic flows in real-time or is extracted from existing datasets, relevant features such as
ARP (Address Resolution Protocol), IP addresses, protocol types, and ports are selected. These
features are then linked to specific metadata, which could include information about the network's
topology, the operating system of the devices involved, or the applications in use at the time of the
traffic. This extra context helps enrich the features, allowing the AI models to build a more accurate
understanding of the network environment. For instance, identifying which applications are active
during a malicious event can provide additional clues about the attack’s nature or intent. By
combining these features with the threat labels, the system can create "fingerprints" for various
types of network sessions, effectively preparing the data for further analysis and machine learning
training. Through this process, the system can then be trained to recognize and classify various
types of cyber-attacks more accurately.
4. FINGERPRINT SESSIONS
The fingerprint session phase, as part of the AIECDS (AI-enhanced Cyber-defense Systems)
methodology, plays a crucial role in encoding meaningful data from real-world systems. This
phase focuses on extracting relevant data from network traffic or other real-world systems and
encoding them in a way that simplifies their complexity, making it easier to detect and analyze
potential cyber threats. One of the key objectives of this phase is to "fingerprint" the data, or in
other words, transform raw data into a unique, identifiable signature that represents specific
sessions or events. This is particularly important for tracking the flow of data within network
sessions and identifying potential security threats based on unusual or suspicious activity. To
achieve this, data are encoded into a structured format that can be easily visualized and analyzed
by machine learning models or cybersecurity experts.
A variety of encoding and visualization techniques can be used to achieve this. Two prominent
techniques chosen for the AIECDS methodology are Hilbert curves and tornado graphs. A Hilbert
curve is a type of continuous fractal space-filling curve that maintains the relative positions of data
points within a given structure. This is significant for the AIECDS methodology, as maintaining
the relative positions of network packets within network sessions is essential for accurately
identifying anomalous behavior. The positions of packets are crucial because they indicate the
flow and sequence of communication between different hosts in the network. Hilbert curves
provide an efficient way to represent this flow of data visually while preserving the relationships
between different data elements. On the other hand, tornado graphs, which are a type of bar chart,
are used to visualize patterns and distributions of data in a way that makes anomalies more easily
detectable. These visualization techniques make it possible to represent complex and multi-modal
threat-related data in a simplified, interpretable format, which is particularly useful for improving
the efficiency of threat detection systems.
The design of the fingerprint data itself depends on the specific use case, which in this case
involves detecting cyber threats through fingerprinted network sessions. The fingerprint is divided
into three distinct sections: the header, protocol discourse, and transmitted data. The header is
crucial because it identifies the unique attributes of a network session, such as source and
destination IP addresses, ports, and protocols. The encoding of the header section is done in a way
that highlights its significance within the fingerprint, ensuring that key behaviors of the network
session are not overlooked. For example, both the TCP and UDP port numbers can be encoded
using a color scale, with each range of port numbers represented by different shades. This encoding
system allows for the efficient visualization of the port numbers and their associated protocols
within the network session.
The protocol discourse section of the fingerprint is designed to represent the sequence of
communication between multiple hosts in a network session. In this section, attributes such as the
sequence of packets exchanged during the session are encoded to represent the flow of data
between the source and destination IPs. The session typically follows a standard communication
process, which starts with the establishment of a connection (such as the TCP handshake),
followed by data transfer, and ends with the termination of the connection. By encoding this
sequence of events in a fingerprint, the system can track the entire flow of communication between
hosts. The fingerprint's design can capture up to 128 interactions between two hosts, which is
sufficient to represent most network sessions, including multiple flows that may occur within the
same session. This level of granularity allows for the detection of subtle anomalies in the
communication pattern that could indicate malicious activity, such as a denial of service attack or
unauthorized access attempts.
Finally, the transmitted data section of the fingerprint is used to encode the actual data transmitted
during the network session. The transmitted data is represented by a 128 × 128 Hilbert curve,
which is a matrix of values that encodes the raw bytes of the data transmitted between the hosts.
Each byte is encoded as a grayscale color, which corresponds to a value in the range of 0 to 255.
This approach ensures that the fingerprint can capture the entire data transmission process within
a session, providing a comprehensive view of the traffic involved. The Hilbert curve structure
allows for a dense and compact representation of the transmitted data, making it easier to analyze
large volumes of network traffic without losing important details. The 128 × 128 matrix was
specifically chosen to accommodate the large amount of data typically exchanged during network
sessions, while still allowing for efficient processing and analysis. This section of the fingerprint
is critical for detecting data exfiltration, malware communication, and other forms of malicious
activity that involve the transfer of sensitive data over the network.
5. THREAT DETECTION
The AIECDS methodology incorporates dynamic self-learning and Reinforcement Learning (RL)
techniques to enhance threat detection capabilities. In the threat detection phase of the AIECDS
system, two primary tasks are involved: the management of the fingerprint system and the training
of the Detection Reinforcement Learning (DRL) model. The fingerprint management system is
responsible for buffering and maintaining all generated fingerprints. Each fingerprint represents a
unique snapshot of a network session or event and is crucial for detecting cyber threats. The system
ensures that each fingerprint is associated with specific metadata, including the available
fingerprint space, the time of its last presentation to the DRL model, and the time of its most recent
update.
The fingerprint management system operates by inserting newly created fingerprints into a buffer.
These fingerprints are periodically scheduled to be presented to the DRL model for threat detection.
The system also ensures that existing fingerprints are updated and re-scheduled for classification.
This ongoing process ensures that the DRL model is continuously exposed to fresh data, enabling
it to learn and adapt to emerging threat patterns. Once a fingerprint is classified by the DRL model,
the result is recorded in the fingerprint’s metadata. If a threat is detected, appropriate mitigation
actions are taken, and the fingerprint is removed from the buffer.
The threat detection DRL model plays a critical role in the AIECDS methodology. Its primary
function is to identify cyber threats using as little information as possible, which is essential for
real-time threat detection. One of the AIECDS criteria is the ability to detect threats and attacks
even when only small sample datasets are available. This requirement is addressed by presenting
fingerprints to the DRL model in incremental steps as the fingerprints are updated over time. The
DRL model is trained to prioritize early threat detection, with higher rewards given for faster
detection of threats. Conversely, negative rewards are assigned when the DRL model makes
incorrect classifications or delays in detecting threats. This reward-based system enables the DRL
model to learn how to make more accurate predictions with fewer data points, improving its
performance in real-world environments.
The integration of Reinforcement Learning ensures that the system can adapt to new and evolving
cyber threats, including adversarial and previously unknown attacks. By learning from patterns of
known malicious activity, the DRL model can identify new attack vectors and cyber threats that
may not have been seen before. This capability is crucial in the ever-changing landscape of
cybersecurity, where attackers continually devise new methods to bypass traditional defenses.
In practical terms, the system functions by continuously analyzing network session fingerprints to
identify patterns that distinguish benign traffic from malicious activity. For instance, over time, as
more fingerprints are collected, the DRL model is able to profile the boundaries between normal
and malicious network sessions. This allows the system to identify anomalies, such as previously
unseen attack types, based on their unique characteristics at the byte level. The use of byte-level
analysis is particularly effective because malicious code often manifests at this granular level,
making it easier to detect hidden threats within the network traffic. Unlike other systems that may
compare high-level data such as text, video, or music, the AIECDS focuses on the raw byte-level
information, where malicious intent is more likely to be detected.
Additionally, the fingerprinting process is not solely focused on identifying known threats but also
on detecting unknown or emerging cyber-attacks. By learning from the visualized features of
network sessions, the DRL model becomes resilient to variations in network traffic and is capable
of detecting subtle indicators of malicious behavior that might be missed by traditional signature-
based systems. The ability to detect unknown attacks is a critical aspect of the AIECDS, as it
enables the system to anticipate and respond to new types of cyber threats that have not yet been
cataloged in threat databases.
The RL model's ability to operate in real-time further enhances the AIECDS's capability. As
network packets are received, the system can immediately begin analyzing and classifying them,
reducing the delay in threat detection. This real-time functionality ensures that cyber threats are
detected and mitigated as they occur, providing timely protection for the network and its resources.
In summary, the integration of dynamic self-learning and DRL in the AIECDS methodology
allows for advanced and adaptive threat detection. By continuously updating the fingerprints and
training the DRL model, the system becomes more effective at identifying both known and
unknown cyber threats. Its ability to operate with minimal information and in real-time, combined
with byte-level analysis and adversarial learning, makes it a powerful tool in the fight against
increasingly sophisticated cyber-attacks.
CHAPTER 5
EXPERIMENTAL RESULTS
To structure the selection of malicious fingerprint samples for analysis, an organized clustering
methodology is utilized, leveraging k-means clustering. This approach provides a systematic way
to group malicious fingerprint samples, enabling the analysis of malware threat categories. K-
means clustering is particularly effective in situations where large volumes of data need to be
segmented into meaningful patterns or groups. For malware threat categories with more than four
fingerprinted network sessions, k-means clustering identifies distinct clusters, grouping similar
fingerprints based on their features.
The process begins with determining the optimal number of clusters for the dataset. This is
achieved using the elbow method, a widely recognized approach in clustering analysis. The elbow
method involves plotting the variance explained by the clustering model against the number of
clusters. The "optimal elbow" is the point at which the rate of decrease in variance slows
significantly, forming an elbow-like bend in the graph. This point represents the most appropriate
number of clusters, ensuring a balance between over-segmentation and under-segmentation of the
data.
Once the optimal number of clusters is identified, k-means clustering assigns each fingerprint to
the cluster with the smallest Euclidean distance to its center. The Euclidean distance serves as a
measure of similarity, ensuring that fingerprints within a cluster are closely related in terms of their
attributes. This grouping is especially useful for malware threat categories with a larger number of
fingerprints, as it provides insights into distinct behavioral patterns or attack characteristics within
each category.
For malware threat categories with fewer than or equal to four fingerprinted network sessions, all
fingerprints are selected without clustering. This ensures that smaller datasets are not overlooked
or excluded from analysis due to the clustering process. These smaller categories may still hold
significant information about specific or rare attack types, making their inclusion vital to a
comprehensive analysis.
The outcome of this process results in a structured dataset where all malicious network session
fingerprints are either clustered or directly selected, depending on the size of the dataset for each
malware category. This structured approach allows for targeted and efficient analysis, where key
cluster fingerprints are analyzed to uncover critical features and behaviors associated with each
malware type. By grouping similar fingerprints and analyzing their patterns, this methodology
enhances the ability to detect and classify malicious activity with precision.
Additionally, the clustering process aids in identifying outliers or anomalous behaviors within the
data. These outliers may represent previously unknown attack patterns or variations within a
malware category. Overall, this structured selection and clustering approach streamline the
analysis of malicious fingerprints, enabling researchers and cybersecurity systems to build more
effective and resilient defense mechanisms against evolving malware threats.
2. FROBENIUS DISTANCE
The Frobenius distance measure plays a crucial role in comparing and analyzing differences
between pairs of fingerprints, particularly when these fingerprints exhibit similar pattern motifs.
The Frobenius distance is well-suited for this task because it calculates the movement or variation
between individual elements within a matrix. This characteristic makes it an effective metric for
identifying subtle differences in structure and content between fingerprints, which is essential for
applications like threat detection.
The importance of the Frobenius distance in this context stems from its alignment with the
requirements of the Deep Reinforcement Learning (DRL) threat detection model. The DRL
model operates by evaluating fingerprints based on their element-level differences. By
incorporating the Frobenius distance into its analysis, the DRL model can effectively identify
nuanced variations in fingerprint patterns, which may indicate unique threat behaviors. This
ensures that the model remains sensitive to subtle changes, which is critical for detecting advanced
and evolving cyber threats.
To assess the significance of the Frobenius distance for each fingerprint comparison, a
specialized gauge was introduced, as shown in Figure 8. This gauge evaluates the Frobenius
distance relative to the average element value of the fingerprints being compared. By
contextualizing the Frobenius distance within this framework, the gauge provides a standardized
metric for determining the relevance of the differences between fingerprints. This approach is
particularly valuable for analyzing the transmitted data and protocol discourse sections of
fingerprints, as it highlights variations that may be indicative of malicious activity.
In summary, the Frobenius distance measure is essential for distinguishing between fingerprints
with similar patterns by quantifying their element-level differences. Its integration into the DRL
threat detection model ensures robust analysis and detection of subtle anomalies in network
sessions. The accompanying significance gauge further enhances its utility by providing a clear
and standardized method for interpreting these differences, enabling more accurate and efficient
threat detection.
The transmitted data section of a fingerprint comprises a 128 × 128 grid, where each element can
take values ranging from 0 to 255. This results in 4,194,304 factorial permutations, theoretically
allowing for an infinite number of potential patterns. However, analysis reveals a finite and
recurring set of patterns within transmitted data. This observation underscores the structured nature
of network traffic and the predictability of malicious behavior patterns within it. The analysis
aimed to uncover unique patterns in transmitted data, conduct similarity assessments between
malicious and benign fingerprints, and identify broader insights about these relationships.
Through visual inspection of fingerprints, researchers identified 13 unique patterns that repeatedly
occurred in transmitted data. These patterns are summarized and illustrated . Each pattern provides
distinct insights into the data structure of network sessions and highlights differences between
malicious and benign behavior. The use of an evaluation guide facilitated the categorization of
these patterns during the analysis process, emphasizing their significance in differentiating various
network behaviors.
Table 1 provides a comprehensive summary of the analysis, comparing malicious and benign
fingerprints side by side. For each row, the table includes details such as the threat category,
protocol, port, and results from the pattern guide for both malicious and benign fingerprints.
Additionally, metrics such as the Frobenius distance, mean, standard deviation, and statistical
significance are presented to quantify the differences between these fingerprints.
For instance, in the case of the shellcode threat type, both malicious fingerprints (5.1 and 5.2)
exhibited identical transmitted data patterns. However, the Frobenius distance between the two
was 1128 points, indicating significant variability despite similar patterns. Conversely, the
reconnaissance threat type displayed complete pattern matches between malicious and benign
fingerprints, yet the Frobenius distances were significantly smaller, with an average of 139 points
and a standard deviation of 28 points. These small differences illustrate how subtle variations in
transmitted data can be consistently identified, even in low-difference cases like reconnaissance.
The analysis identified Pattern 9 as the most dominant across both malicious and benign
fingerprints. It appeared 20 times in malicious fingerprints and 13 times in benign fingerprints. For
malicious fingerprints, Patterns 7 and 10 followed in prevalence, while Patterns 2 and 3 were more
common in benign fingerprints but rarely observed in malicious ones. Interestingly, Pattern 11 was
unique to a single malicious fingerprint, signifying its potential utility as a specific marker for
certain types of malware.
The transmitted data similarity analysis highlights the utility of fingerprints as a framework for
identifying differences between malware and benign network sessions. Crucially, no malware
fingerprint's transmitted data section was identical to its closest benign counterpart, as evidenced
by the absence of a Frobenius distance of zero. This finding suggests that even subtle differences
between malicious and benign network behaviors can be detected and leveraged to enhance threat
detection algorithms.
The analysis also revealed that previously undetectable malware categories, such as backdoor,
shellcode, and worms in the UNSW-15 simulation, exhibited distinct transmitted data patterns that
could facilitate their detection using simpler algorithms. For example, reconnaissance malware,
which had the smallest detection ratio (0.2%) in earlier simulations, demonstrated consistent
differences in transmitted data patterns, providing a basis for improved detection strategies.
Interestingly, seven malicious fingerprints (7, 8, 9, 10, 11, 21, and 23) shared their closest benign
fingerprints with two other unique benign fingerprints. This overlap further demonstrates the
effectiveness of the proposed fingerprinting solution in delineating the decision boundaries
between malicious and benign classifications. The ability to visually profile transmitted data
provides an additional layer of differentiation, enhancing the detection of complex threats.
The results indicate that fingerprinting transmitted data offers a promising approach for
distinguishing malicious from benign network behaviors. The unique patterns identified provide a
visual and measurable basis for detecting even subtle differences, enabling the classification of
malware threat categories with greater precision. Furthermore, the findings suggest that
transmitted data fingerprints can be instrumental in identifying unknown or emerging threats,
thereby addressing one of the core challenges in modern cybersecurity.
Interestingly, all reconnaissance network sessions utilized port 111, commonly associated with
remote procedure calls. This consistent use of the same port highlights a behavioral signature that
could assist in threat detection. The lack of discernible differences in most reconnaissance
fingerprints suggests that malicious reconnaissance often mimics benign activity to evade
detection, further justifying the need for advanced fingerprinting methods.
The fingerprints associated with backdoor and shellcode threats exhibited notable differences,
despite their relatively simplistic communication patterns. These threats typically involve small
packets exchanged during the initial setup phase, limiting the amount of detectable data. However,
differences were observed in the number of packets exchanged in several fingerprints, such as
Fingerprints 1, 3, 4, and 6, all of which had nonzero distances.The average Frobenius distances for
backdoor and shellcode fingerprints were 232 points and 125 points, respectively, with standard
deviations of 56 and 167 points. These metrics suggest moderate variation among fingerprints
within these threat categories, which could be leveraged for detection. However, Fingerprint 5
displayed minimal significance, indicating it closely resembled its benign counterpart. This
underscores the variability within backdoor and shellcode threat fingerprints and highlights the
need for precise analysis to identify their unique characteristics.
From the results in Table 2, it is evident that protocol discourse analysis is instrumental in
differentiating malicious from benign network activity. A total of 14 fingerprints exhibited no
differences in packet sequence or structure, resulting in an average Frobenius distance of 288
points. In contrast, fingerprints with one or more differences (20 out of 34) had a significantly
higher average Frobenius distance of 2473 points. This stark disparity reinforces the value of
detecting variations in protocol discourse as a key method for identifying malicious activity.
Notably, only four fingerprints—all reconnaissance fingerprints (30, 31, 32, and 34)—had zero
distances and identical packet sequences. This perfect match highlights the challenges of detecting
reconnaissance attacks due to their deliberate mimicry of benign behavior. Despite these
challenges, the protocol discourse data and corresponding visual fingerprints for most other threat
categories provided meaningful distinctions between malicious and benign fingerprints.
Summary of Findings-
The protocol discourse analysis revealed several important insights. Except for four
reconnaissance fingerprints, the majority of malicious fingerprints displayed distinct differences
in packet sequence, size, or flags compared to their benign counterparts. These differences were
quantified using Frobenius distance, with higher distances indicating greater dissimilarity. For
threats like backdoors and shellcode, the number of packets exchanged and setup phase behaviors
were key differentiators. Reconnaissance threats, while harder to detect due to their similarity to
benign patterns, still exhibited subtle discrepancies, such as in Fingerprint 33.
Overall, this analysis demonstrates that protocol discourse data and visual fingerprints are valuable
tools for identifying and differentiating malicious network sessions. By focusing on sequence
patterns, packet sizes, and specific protocol behaviors, the proposed methodology provides a
robust framework for enhancing the detection of various cyber threats.
evaluation of the differences in network behavior between these two classes. The columns outline
key features for both malicious and benign fingerprints, such as row numbers, threat categories,
protocols, ports, and pattern guide results. Furthermore, metrics like the sum of differences,
Frobenius distance measure, mean, standard deviation, and the level of significance are included,
offering a multifaceted view of the disparities between these fingerprints.
Interestingly, all reconnaissance network sessions utilized port 111, commonly associated with
remote procedure calls. This consistent use of the same port highlights a behavioral signature that
could assist in threat detection. The lack of discernible differences in most reconnaissance
fingerprints suggests that malicious reconnaissance often mimics benign activity to evade
detection, further justifying the need for advanced fingerprinting methods.
The fingerprints associated with backdoor and shellcode threats exhibited notable differences,
despite their relatively simplistic communication patterns. These threats typically involve small
packets exchanged during the initial setup phase, limiting the amount of detectable data. However,
differences were observed in the number of packets exchanged in several fingerprints, such as
Fingerprints 1, 3, 4, and 6, all of which had nonzero distances.
The average Frobenius distances for backdoor and shellcode fingerprints were 232 points and 125
points, respectively, with standard deviations of 56 and 167 points. These metrics suggest
moderate variation among fingerprints within these threat categories, which could be leveraged
for detection. However, Fingerprint 5 displayed minimal significance, indicating it closely
resembled its benign counterpart. This underscores the variability within backdoor and shellcode
threat fingerprints and highlights the need for precise analysis to identify their unique
characteristics.
From the results in Table 2, it is evident that protocol discourse analysis is instrumental in
differentiating malicious from benign network activity. A total of 14 fingerprints exhibited no
differences in packet sequence or structure, resulting in an average Frobenius distance of 288
points. In contrast, fingerprints with one or more differences (20 out of 34) had a significantly
higher average Frobenius distance of 2473 points. This stark disparity reinforces the value of
detecting variations in protocol discourse as a key method for identifying malicious activity.
Notably, only four fingerprints—all reconnaissance fingerprints (30, 31, 32, and 34)—had zero
distances and identical packet sequences. This perfect match highlights the challenges of detecting
reconnaissance attacks due to their deliberate mimicry of benign behavior. Despite these
challenges, the protocol discourse data and corresponding visual fingerprints for most other threat
categories provided meaningful distinctions between malicious and benign fingerprints.
Summary of Findings-
The protocol discourse analysis revealed several important insights. Except for four
reconnaissance fingerprints, the majority of malicious fingerprints displayed distinct differences
in packet sequence, size, or flags compared to their benign counterparts. These differences were
quantified using Frobenius distance, with higher distances indicating greater dissimilarity. For
threats like backdoors and shellcode, the number of packets exchanged and setup phase behaviors
were key differentiators. Reconnaissance threats, while harder to detect due to their similarity to
benign patterns, still exhibited subtle discrepancies, such as in Fingerprint 33.
Overall, this analysis demonstrates that protocol discourse data and visual fingerprints are valuable
tools for identifying and differentiating malicious network sessions. By focusing on sequence
patterns, packet sizes, and specific protocol behaviors, the proposed methodology provides a
robust framework for enhancing the detection of various cyber threats.
CHAPTER 6
The protocol discourse section of the fingerprint plays a crucial role in representing the
behavioral patterns of network sessions, focusing on the interaction and sequence of transmitted
packets between hosts. This section comprises 128 values that range from −1500 to 1500, allowing
for a staggering 384,000 factorial permutations. While the mathematical possibilities of such
permutations are theoretically infinite, the analysis reveals that network sessions tend to conform
to specific packet ranges and phases, leading to identifiable and recurring patterns. These patterns
provide essential insights into how data flows in a network session, making it possible to
differentiate between benign and malicious activities.
To decode and analyze these recurring patterns, two distinct pattern guides were employed in the
analysis:
1. Packet Pattern Guide: This guide focused on packet sizes and the phases of engagement
within the session. By examining the size and sequence of packets, it was possible to
recognize distinct behaviors that characterize the flow of information in various types of
network sessions.
2. Setup-Phase Packet Length and Sequence Analysis Guide: This guide specifically
targeted the setup-phase packets, with an emphasis on their lengths and sequences. It
identified repeating patterns that occur during the initial stages of a session and examined
how these sequences correspond to specific ports or protocols.
1. Setup Phase: This phase involves the exchange of initial packets to establish a connection
between two hosts. Typically, small packets are predominant in this phase.
2. Transfer Phase: Once the connection is established, data is transmitted between the hosts.
This phase features a mix of large and small packets, with the size often depending on the
type and purpose of the data being transferred.
3. Teardown Phase: This phase marks the closure of the session, with packets exchanged to
finalize the communication. The teardown phase often includes smaller packets, but
medium-sized packets may also appear.
These phases and their packet patterns were instrumental in revealing consistent structures in both
benign and malicious network sessions.
The analysis uncovered unique protocol discourse patterns by overlaying repeating setup-phase
sequences. Three distinct sequences emerged, which were annotated and visualized to reveal how
different sequences of packet lengths and phases combined. For instance, the setup-phase analysis
showed that even within a single type of network session, variations could occur in the number,
size, and order of packets exchanged.
2. Partial Lengths: Specific ranges of packet lengths observed within the sequence.
3. Partial Sequences: Portions of the full sequence that retained their distinctive
characteristics.
The visualization and annotation of these repeating setup-phase patterns provided critical insights:
Malicious network sessions often exhibited distinct deviations in the setup phase when
compared to their benign counterparts. These deviations included unique packet lengths,
sequences, or abnormalities in the timing of the exchanges.
Repeating patterns were observed across certain threat categories, offering a way to classify
and identify similar attack behaviors. For example, setup-phase patterns for reconnaissance
activities displayed repetitive sequences that could be linked to specific protocols and ports.
In summary, the protocol discourse section of the fingerprint proved to be a robust framework for
identifying and analyzing the behavior of network sessions. By leveraging packet pattern guides
and phase analysis, this section could differentiate between malicious and benign activities. The
repeating patterns and unique setup-phase behaviors provide a foundation for advanced detection
mechanisms, enabling threat detection models to classify and respond to cyber threats with greater
accuracy and efficiency.
CHAPTER 7
The AIECDS methodology proposed in this paper represents a comprehensive and innovative
approach to developing AI-enhanced cyber-defense systems. The methodology is designed to
address the challenges faced in cybersecurity, particularly in detecting and mitigating cyber threats
in real-time network environments. One of the key principles of the AIECDS framework is the
extraction of meaningful data from network traffic and the creation of visualized fingerprints that
represent these data points in a way that is both informative and interpretable by machine learning
models. By focusing on meaningful data extraction, the methodology ensures that the most
relevant features are captured from network sessions, making the detection process more accurate
and efficient.
The concept of fingerprinting plays a pivotal role in the AIECDS methodology. Fingerprints are
essentially unique representations of network sessions, created by extracting key features from
network traffic. These fingerprints are then analyzed to detect hidden patterns, which can be
indicative of malicious activity. The fingerprinting process enables the identification of subtle
differences between malicious and benign network sessions. The visual comparison of malicious
fingerprints with their closest benign counterparts further enhances this process by allowing for a
clearer and more interpretable distinction between the two. This approach is particularly beneficial
for the detection of attacks that may not be easily identifiable using traditional methods or on small
datasets.
One of the significant advantages of using visualized fingerprints is that it reduces the complexity
of decision boundaries. In many machine learning models, the decision boundary – which
separates benign from malicious traffic – can be highly complex, particularly when dealing with
large amounts of data or intricate attack patterns. By transforming network session data into visual
fingerprints, the complexity of these decision boundaries is significantly reduced, making it easier
for machine learning models to classify traffic correctly. This simplification not only enhances the
accuracy of the models but also improves their ability to detect threats, even when the available
data is limited or when the malicious attacks are rare or novel.
Moreover, the use of visualization in cybersecurity allows for a more intuitive understanding of
the data. Through the use of tools such as Hilbert curves, tornado graphs, and other visual encoding
techniques, the behavior of network sessions can be represented in a manner that makes it easier
for both humans and machine learning algorithms to detect patterns. Visualized data highlights
differences between malicious and benign sessions, allowing for a more granular analysis of
network traffic. This aids in the discovery of hidden threats, including new or unknown attack
types that may not have been previously detected by traditional cybersecurity methods.
The application of the AIECDS methodology in the context of the case study, which focuses on
the detection of cyber threats using fingerprinted network sessions, demonstrates its practical
utility. In this use case, the methodology successfully identifies cyber threats by leveraging
fingerprinted data and visualizing network sessions. This approach proves to be particularly
effective for detecting threats with minimal sample datasets, which is a critical challenge in
cybersecurity. Often, cyber-attacks, particularly zero-day exploits, can have very few data points
available for detection, making traditional machine learning techniques less effective. By utilizing
visualized fingerprints, the AIECDS methodology overcomes this limitation and enhances
detection capabilities, even in the face of limited data.
In summary, the AIECDS methodology offers a significant contribution to the field of AI-
enhanced cyber-defense systems. By focusing on meaningful data extraction, fingerprinting, and
the visualization of network sessions, it improves the ability to detect malicious threats in real-
time environments. This approach simplifies the machine learning models required for threat
detection, reduces the complexity of decision boundaries, and provides a more effective means of
identifying hidden threats. The case study highlights the practical application of this methodology
and demonstrates its potential to transform cybersecurity practices, particularly in environments
where data availability is limited or where novel threats are encountered.
CHAPTER 8
REFERENCES
[1] ENISA Threat Landscape Report: July 2021 to July 2022, Eur. Union Agency for Cybersecur.
(ENISA), Athens, Greece, 2022.
[2] C. Gidi, ‘‘Vulnerability and threat trends report,’’ Skybox Secur., San Jose, CA, USA, Tech.
Rep., 2022.
[3] A. F. Police, ‘‘ACSC annual cyber threat report: July 2021 to June 2022,’’ Austral. Criminal
Intell. Commission (ACSC), Tech. Rep., 2022.
[4] R. Sobers. (May 2022). 89 Must-Know Data Breach Statistics 2022. Accessed: Jun. 29, 2022.
[Online]. Available: https://www.varonis.com/ blog/cybersecurity-statistics
[5] N. Moustafa and J. Slay, ‘‘UNSW-NB15: A comprehensive data set for network intrusion
detection systems (UNSW-NB15 network data set),’’ in Proc. Mil. Commun. Inf. Syst. Conf.
(MilCIS), Nov. 2015, pp. 1–6.
[6] K. Shaukat, S. Luo, V. Varadharajan, I. A. Hameed, and M. Xu, ‘‘A survey on machine learning
techniques for cyber security in the last decade,’’ IEEE Access, vol. 8, pp. 222310–222354, 2020.
[7] N. Kaloudi and J. Li, ‘‘The AI-based cyber threat landscape: A survey,’’ ACM Comput. Surv.,
vol. 53, no. 1, pp. 1–34, Jan. 2021.
[8] T. T. Nguyen and V. J. Reddi, ‘‘Deep reinforcement learning for cyber security,’’ IEEE Trans.
Neural Netw. Learn. Syst., vol. 34, no. 8, pp. 3779–3795, Aug. 2021.
[9] I. F. Kilincer, F. Ertam, and A. Sengur, ‘‘Machine learning methods for cyber security intrusion
detection: Datasets and comparative study,’’ Comput. Netw., vol. 188, Apr. 2021, Art. no. 107840.
[10] J. K. F. Bowles, A. Silvina, E. Bin, and M. Vinov, ‘‘On defining rules for cancer data
fabrication,’’ in Proc. Int. Joint Conf. Rules Reasoning, in Lecture Notes in Computer Science,
vol. 12173, 2020, pp. 168–176.
[11] Y. Lu, M. Shen, H. Wang, X. Wang, C. van Rechem, T. Fu, and W. Wei, ‘‘Machine learning
for synthetic data generation: A review,’’ 2023, arXiv:2302.04062.
[12] K. Shaukat, S. Luo, V. Varadharajan, I. Hameed, S. Chen, D. Liu, and J. Li, ‘‘Performance
comparison and current challenges of using machine learning techniques in cybersecurity,’’
Energies, vol. 13, no. 10, p. 2509, May 2020.
Jan. 2022.
[14] A. Wu, Y. Wang, X. Shu, D. Moritz, W. Cui, H. Zhang, D. Zhang, and H. Qu, ‘‘AI4VIS:
Survey on artificial intelligence approaches for data visualization,’’ IEEE Trans. Vis. Comput.
Graphics, vol. 28, no. 12, pp. 5049–5070, Dec. 2022.
[15] Threat Hunter Team. (Mar. 2022). Daxin Backdoor: In-Depth Analysis. Accessed: Dec. 20,
2022. [Online]. Available: https://symantec-enterpriseblogs.security.com/blogs/threat-
intelligence/daxin-malware-espio
[16] W. Hu and Y. Tan, ‘‘Generating adversarial malware examples for blackbox attacks based on
GAN,’’ in Proc. Int. Conf. Data Mining Big Data. Singapore: Springer, Nov. 2022, pp. 409–423.
[17] D. Kirat, J. Jang, and M. Stoecklin. (2018). DeepLocker: Concealing Targeted Attacks With
AI Locksmithing. Black Hat USA. Accessed: Jul. 27, 2024. [Online]. Available:
https://i.blackhat.com/us-18/Thu-August-9/us-18-Kirat-DeepLockerConcealing-Targeted-
Attacks-with-AI-Locksmithing.pdf
[18] Z. Ahmad, A. Shahid Khan, C. Wai Shiang, J. Abdullah, and F. Ahmad, ‘‘Network intrusion
detection system: A systematic study of machine
learning and deep learning approaches,’’ Trans. Emerg. Telecommun. Technol., vol. 32, no. 1, p.
e4150, Jan. 2021.
[19] K. Arshad, R. F. Ali, A. Muneer, I. A. Aziz, S. Naseer, N. S. Khan, and S. M. Taib, ‘‘Deep
reinforcement learning for anomaly detection: A systematic review,’’ IEEE Access, vol. 10, pp.
124017–124035, 2022.
[20] Y.-F. Hsu and M. Matsuoka, ‘‘A deep reinforcement learning approach for anomaly
network intrusion detection system,’’ in Proc. IEEE 9th Int. Conf. Cloud Netw. (CloudNet), Nov.
2020, pp. 1–6.