Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views31 pages

Malware Detection

This document discusses the challenges of malware detection in the context of evolving cyber threats and the potential of integrating machine learning techniques into traditional detection methods. It outlines the objectives of developing a robust malware detection system that combines dynamic and static analysis to improve detection accuracy and reduce false positives. The document also highlights the importance of AI in cybersecurity for both offensive and defensive applications, emphasizing the need for advanced tools to combat increasingly sophisticated malware.

Uploaded by

mohamed amine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views31 pages

Malware Detection

This document discusses the challenges of malware detection in the context of evolving cyber threats and the potential of integrating machine learning techniques into traditional detection methods. It outlines the objectives of developing a robust malware detection system that combines dynamic and static analysis to improve detection accuracy and reduce false positives. The document also highlights the importance of AI in cybersecurity for both offensive and defensive applications, emphasizing the need for advanced tools to combat increasingly sophisticated malware.

Uploaded by

mohamed amine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Table of content

1 Abstract 7

2 General introduction 8

2.1 Problem statment : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Objective : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Chapter one : 10

3.1 Malwares : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Traditional malwares detection approche . . . . . . . . 10

3.2.1 Signature-based Detection : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.2 Behavioral-based Detection : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Traditional malwares detection approche problem :


. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4 Machine learning : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4.1 How can machine learning help in malwares detection : . . . . . . . . . . . . . . . 11

4 Chapter Two : 13

3
4.1 Introduction : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 Portable execution file format : . . . . . . . . . . . . . . . . . 13

4.3 Dataset description : . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3.1 Feature extraction : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Chapter Three : 19

5.1 Introduction : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2 Static models : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2.1 Xgboost : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2.2 Lightgbm : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.2.3 Mlp : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.3 CNN : Convolutional Neural Network (CNN) for Static Malware Detection . . . . . . . . 22

5.3.1 Feature Vector Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.3.2 CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.3.3 Training and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.4 Dynamic Analysis Models for Malware Detection . . . . . . . . . . . . . . . . . . . . . . . 23

5.4.1 Introduction to Dynamic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.4.2 API Calls in Executables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4
5.4.3 Dynamic Analysis Using Speakeasy Emulation . . . . . . . . . . . . . . . . . . . . 23

5.4.4 Why a Purely Dynamic Approach is Lacking . . . . . . . . . . . . . . . . . . . . . 23

5.4.5 Approaches Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

6 Chapter Four : 26

6.1 Introduction : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.2 Models results : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.2.1 Xgboost model : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

6.2.2 Lightgbm model : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6.2.3 MLP model : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6.2.4 Cnn model : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.3 Discussion : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.3.1 LightGBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.3.2 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.3.3 MLP (Multi-Layer Perceptron) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.3.4 CNN (Convolutional Neural Network) . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.3.5 Dynamic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5
6.3.7 Explanation of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7 Chapter Five : 32

7.1 Achievement : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.2 Limitations : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.2.1 Lack of dataset : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.2.2 Degradation Over Time : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.2.3 Static and Dynamic Analysis Balance : . . . . . . . . . . . . . . . . . . . . . . . . 32

7.2.4 Feature Engineering : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.2.5 Resource Intensity : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.3 Future Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7.3.1 several future improvements are proposed : . . . . . . . . . . . . . . . . . . . . . . 33

7.3.2 Regular Model Updates : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7.3.3 Hybrid Analysis Techniques : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7.3.4 Advanced Feature Selection : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7.3.5 Real-Time Detection Capabilities : . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

8 conclusion : 33

6
1 Abstract
Malware detection is a field that is constantly in flux. Increasingly organized and intelligent malware
distributors and authors make the detection and prevention of malware infections more difficult now
than it has ever been before, as does the proliferation of easy-to-use malware creation tools and more
complex and dangerous samples. Modern warfare and international cybercrime has also increasingly
involved cyberattacks, targeted distribution attacks and highly networked cyberespionage, resulting in
extremely dangerous malware that was explicitly designed and released by nation state actors for the
purposes of enacting large-scale damage to organizations or infrastructure. Because of this growing and
evolving threat, more advanced tools for malware detection, prevention and recovery are increasingly
necessary to defend computing networks from attack. Machine learning has been posited in the past as a
potential solution to many of the problems that the field currently faces with traditional methodologies.
The ability for algorithms to learn from existing samples and then apply that knowledge to novel samples
is one that holds great potential for this field, especially since many high-profile malware attacks in the
last decade were caused by previously unknown samples that exploited novel vulnerabilities and were not
detected by traditional systems until it was too late. Our work proposes a comprehensive approach to
malware detection that combines both dynamic and static analysis models. This dual-method strategy
is designed to maximize detection accuracy while minimizing the rate of false positives.

7
2 General introduction
In recent years, Artificial Intelligence (AI) has become increasingly an important tool across very
sectors, revolutionizing the way we live and work. Its applications span a wide range of industries,
healthcare, education, agriculture and many other domains.
Cybersecurity is also an integral domain leveraging AI techniques, it is defined as the practice of pro-
tecting systems, networks, and programs from different digital attacks. These cyberattacks are usually
aimed at accessing, changing, or destroying sensitive information ; extorting money from users via ran-
somware ; or interrupting normal business processes.
Implementing effective cybersecurity measures is particularly challenging today because there are
more devices than people, and attackers are becoming more innovative.
Artificial intelligence is being increasingly utilised in cyber security for both offensive and defensive
applications. In offensive roles, it is used to predict and mimic attackers’ behaviors, allowing security
teams to proactively address vulnerabilities. On the defensive side, AI tools are employed to monitor
network traffic, detect anomalies, and respond to threats in real-time. The need for AI in cyber security
is particularly crucial in identifying and preventing new, unknown threats, as traditional signature-based
systems may not be effective in recognising evolving attack techniques.
The speed and accuracy of AI algorithms have significantly improved the ability of cyber security
systems to detect and respond to threats in real time. Machine learning algorithms can analyze vast
amounts of data to identify patterns and anomalies that could signal a potential security breach. AI-
powered systems can also automate routine security tasks, freeing up human experts to focus on more
complex and strategic aspects of cyber defense. Furthermore, AI can enhance the predictive capabilities
of security systems, allowing organisations to anticipate and proactively address potential vulnerabilities
before malicious actors exploit them. Overall, the integration of AI technology in cyber security has
proven to be crucial in fortifying organisations against ever-evolving cyber threats.

2.1 Problem statment :


The purpose of this work is to investigate whether the integration of machine learning techniques
and pattern recognition algorithms into traditional malware detection frameworks and methodologies will
result in a quantifiable improvement in detection performance. This will be a research project exploring
the assertion that if machine learning and artificially intelligent pattern recognition techniques were to
be more fully applied and integrated into the problem of completely novel (“zero-day”) vulnerability
and malware identification, specifically in the areas that currently rely on primarily behavior analysis
and signature analysis alone, it will demonstrate a substantial improvement over existing identification
methods and sample detection rate.

2.2 Objective :
The objective of this project is to develop a robust and efficient malware detection system that
integrates both dynamic and static analysis methods. By leveraging machine learning techniques, the
system aims to maximize detection accuracy and minimize false positives. This involves extracting and
analyzing relevant features from static sources, such as opcode sequences, API calls, and metadata, as
well as dynamic sources, including system calls, network traffic, and runtime behaviors. The project will
train a machine learning model on a comprehensive dataset of labeled benign and malicious samples
and rigorously evaluate its performance. By benchmarking the hybrid model against traditional single-
method approaches, the project seeks to demonstrate the enhanced effectiveness and efficiency of the
integrated approach. Ultimately, this work aims to contribute to the field of cybersecurity by providing
a more resilient and adaptive malware detection solution capable of keeping pace with evolving threats.

8
Chapter one :
Overview of malware and
traditional detection techniques

9
3 Chapter one :
3.1 Malwares :
Malware, or malicious software, encompasses a wide range of harmful programs that infiltrate com-
puter systems with the intent of causing damage, stealing data, or gaining unauthorized access. These
programs often exhibit deceptive and stealthy behavior to avoid detection and removal. They can silently
install themselves by exploiting vulnerabilities, piggybacking on legitimate software, or tricking users into
executing them through social engineering tactics. Once inside a system, malware can perform a variety
of malicious activities, such as logging keystrokes, redirecting web traffic, encrypting files for ransom,
or creating backdoors for further exploitation. Their behavior is characterized by a constant evolution,
adapting to new security measures and employing sophisticated techniques to remain concealed and
maintain persistence within the infected environment. The primary goal of malware is to compromise
the integrity, confidentiality, and availability of the target systems, often for financial gain, espionage, or
disruption.

3.2 Traditional malwares detection approche


Traditional malware detection approaches primarily relied on signature-based and heuristic tech-
niques.

3.2.1 Signature-based Detection :


This method involves identifying known malware by comparing files and processes against a database
of known malware signatures, which are unique strings of data derived from the malware code. When a
match is found, the file is flagged as malicious. This approach is highly effective for detecting previously
identified threats but struggles with new, unknown malware variants, as it requires constant updates to
the signature database to recognize new threats.

3.2.2 Behavioral-based Detection :


Behavioral analysis monitors the behavior of programs in real-time, looking for actions that are typical
of malware, such as unauthorized access to system resources, attempts to disable security features, or
unexpected changes to system settings. By focusing on what programs do rather than their code, this
method can catch malware that evades signature-based detection. While effective against zero-day threats
and polymorphic malware, this approach can be resource-intensive and may also lead to false positives.

10
3.3 Traditional malwares detection approche problem :
Traditional signature analysis techniques in the problem of malware detection has the disadvantage
of only being able to detect malware that has a previously encountered signature, meaning that these
detection techniques are limited in their ability to detect all the novel types of malware that are being
released into the wild every day. Behavioral analysis is similarly limited by its knowledge of specific
behaviors it has seen before, and thus can fail to detect malware that has a new attack vector or exploit
type that has never previously been encountered or recorded.

3.4 Machine learning :


Machine learning is a branch of artificial intelligence that focuses on developing algorithms that
enable computers to learn from and make predictions or decisions based on data. Unlike traditional
programming, where explicit instructions are given for every task, machine learning allows systems to
identify patterns and insights from large datasets on their own. By processing vast amounts of data,
these algorithms can detect subtle correlations and trends that might be missed by human analysis or
traditional methods. This ability to automatically improve and adapt from experience makes machine
learning particularly powerful for complex tasks such as image and speech recognition, predictive ana-
lytics, and, crucially, cybersecurity. In the context of malware detection, machine learning can analyze
various attributes and behaviors of files, distinguishing between normal and malicious activities with
high accuracy, even when dealing with new, previously unseen threat

3.4.1 How can machine learning help in malwares detection :


The integration of machine learning techniques into traditional malware detection approaches pre-
sents a promising avenue for enhancing detection accuracy beyond the limitations of signature-based
and behavioral analysis methods. By leveraging machine learning algorithms, such as neural networks
and decision trees, we can potentially uncover novel patterns and correlations in malware samples. This
capability is crucial because it enables the identification of malware variants even when their specific si-
gnatures have not been previously recorded or encountered. Unlike signature-based detection, which relies
on known patterns, machine learning algorithms can generalize from existing data to detect previously
unseen threats. Similarly, while behavioral analysis observes the actions of malware during execution,
machine learning models can learn from these behaviors and recognize subtle deviations indicative of
malicious activity. Therefore, the integration of machine learning holds the promise of significantly im-
proving the accuracy and efficacy of malware detection, making it a vital advancement in cybersecurity
defense strategies

11
Chapter Two :
dataset

12
4 Chapter Two :
4.1 Introduction :
In our work, we focus on malware detection using machine learning techniques applied to PE files,
or Portable Executables, primarily due to their critical role in the Windows operating system and their
widespread use across the Windows ecosystem. PE files serve as the primary format for executables, DLLs,
and device drivers, facilitating the execution of software applications and system processes. However,
their ubiquity also makes them an attractive target for malware distribution and propagation. Malicious
actors frequently disguise malware within PE files, employing various techniques like file obfuscation and
code packing to evade detection. As a result, PE files containing malware can spread undetected, posing
significant security risks to users and organizations alike. In our research, we recognize the importance
of thorough analysis of PE files for identifying and mitigating these threats. By examining the structure
and contents of PE files, we aim to uncover indicators of malicious activity, enabling proactive detection
and mitigation strategies. Leveraging machine learning algorithms trained on PE file features further
enhances our detection capabilities, allowing us to identify malware variants even in the absence of
specific signatures. Therefore, our focus on PE file analysis underscores its critical role in safeguarding
against malware threats and ensuring the security of Windows-based systems.

4.2 Portable execution file format :


The PE file format describes the predominant executable format for Microsoft Windows operating
systems, and includes executables, dynamically-linked libraries (DLLs), and FON font files. The for-
mat is currently supported on Intel, AMD and variants of ARM instruction set architectures. The file
format is arranged with a number of standard headers (see Fig. 1 for PE-32 format), followed by one
or more sections . Headers include the Common Object File Format (COFF) file header that contains
important information such as the type of machine for which the file is intended, the nature of the file
(DLL, EXE, OBJ), the number of sections, the number of symbols, etc. The optional header identifies
the linker version, the size of the code, the size of initialized and uninitialized data, the address of the
entry point, etc. Data directories within the optional header provide pointers to the sections that fol-
low it. This includes tables for exports, imports, resources, exceptions, debug information, certificate
information, and relocation tables. As such, it provides a useful summary of the contents of an execu-
table . Finally, the section table outlines the name, offset and size of each section in the PE file. PE
sections contain code and initialized data that the Windows loader is to map into executable or rea-
dable/writeable memory pages, respectively, as well as imports, exports and resources defined by the
file. Each section contains a header that specifies the size and address. An import address table instructs
the loader which functions to statically import. A resources section may contain resources such as re-
quired for user interfaces : cursors, fonts, bitmaps, icons, menus, etc. A basic PE file would normally
contain a .text code section and one or more data sections (.data, .rdata or .bss). Relocation tables
are typically stored in a .reloc section, used by the Windows loader to reassign a base address from
the executable’s preferred base. A .tls section contains special thread local storage (TLS) structure for
storing thread-specific local variables, which has been exploited to redirect the entry point of an exe-
cutable to first check if a debugger or other analysis tool are being run . Section names are arbitrary
from the perspective of the Windows loader, but specific names have been adopted by precedent and are
overwhelmingly common. Packers may create new sections, for example, the UPX packer creates UPX1
to house packed data and an empty section UPX0 that reserves an address range for runtime unpacking

13
4.3 Dataset description :
We utilized the Ember dataset, a publicly available repository established since April 2018, compri-
sing over 1.1 million Portable Executable (PE) files, distinguishing between malware and benign software.
This dataset encapsulates a comprehensive array of surface features from each sample, represented in
JSON files, with each file embodying the attributes of a malware instance. The EMBER dataset consists
of a collection of JSON lines files, where each line contains a single JSON object. Each object includes
the following types in data : the sha256 hash of the original file as a unique identifier ; coarse time in-
formation (month resolution) that establishes an estimate of when the file was first seen ; a label, which
may be 0 for benign, 1 for malicious or -1 for unlabeled ; and eight groups of raw features that include
both parsed values as well as format-agnostic histograms

14
File Period of collection Malware Benign Unlabeled Sum
train0 Dec 2006 – Dec 2017 0 50,000 0 50,000
train1 Jan 2018 – Feb 2018 60,000 50,000 60,000 170,000
train2 Mar 2018 – Apr 2018 60,000 50,000 60,000 170,000
train3 May 2018 – Jun 2018 60,000 50,000 60,000 170,000
train4 Jul 2018 – Aug 2018 60,000 50,000 60,000 170,000
train5 Sep 2018 – Oct 2018 60,000 50,000 60,000 170,000
test Nov 2018 – Dec 2018 100,000 100,000 0 200,000
All 400,000 400,000 300,000 1,100,000

Table 1 – Data Collection Summary

15
4.3.1 Feature extraction :
We extracted 8 different groups of features from the pe files

General file information (general) : The set of features in the general file information group
includes the file size and basic information obtained from the PE header : the virtual size of the file, the
number of imported and exported functions, whether the file has a debug section, thread local storage,
resources, relocations, or a signature, and the number of symbols.

Header information (header) : From the COFF header, we report the timestamp in the header,
the target machine (string) and a list of image characteristics (list of strings). From the optional header,
we provide the target subsystem (string), DLL characteristics (a list of strings), the file magic as a
string (e.g., “PE32”), major and minor image versions, linker versions, system versions and subsystem
versions, and the code, headers and commit sizes. To create model features, string descriptors such as
DLL characteristics, target machine, subsystem, etc. are summarized using the feature hashing trick
prior to training a model, with 10 bins allotted for each noisy indicator vector. .

Imported functions (imports) : We parse the import address table and report the imported
functions by library. To create model features for the baseline model, we simply collect the set of unique
libraries and use the hashing trick to sketch the set (256 bins). Similarly, we use the hashing trick (1024
bins) to capture individual functions, by representing each as a string such as library :FunctionName
pair (e.g., kernel32.dll :CreateFileMappingA).

Exported functions (exports) : The raw features include a list of the exported functions. These
strings are summarized into model features using the hashing trick with 128 bins..

Section information (section) : n. Properties of each section are provided and include the name,
size, entropy, virtual size, and a list of strings representing section characteristics. The entry point is
specified by name. To convert to model features, we use the hashing trick on (section name, value) pairs
to create vectors containing section size, section entropy, and virtual size (50 bins each). We also use the
hashing trick to capture the characteristics (list of strings) for the entry point.

Format-agnostic features. The EMBER dataset also includes three groups of features that are
format agnostic, in that they do not require parsing of the PE file for extraction : a raw byte histogram,
byte entropy histogram based on work previously published in [26], and string extraction.

.....Byte histogram (histogram) : The byte histogram contains 256 integer values, representing
the counts of each byte value within the file. When generating model features, this byte histogram is
normalized to a distribution, since the file size is represented as a feature in the general file information

.....Byte-entropy histogram (byteentropy) : The byte entropy histogram approximates the joint
distribution p(H,X) of entropy H and byte value X. This is done as described in [26], by computing the
scalar entropy H for a fixed-length window and pairing it with each byte occurrence within the window.
This is repeated as the window slides across the input bytes. In our implementation, we use a window
size of 2048 and a step size of 1024 bytes, with 16 × 16 bins that quantize entropy and the byte value.
Before training, we normalize these counts to sum to unity

.....String information (strings) : String information. The dataset includes simple statistics about
printable strings (consisting of characters in range 0x20 to 0x7f, inclusive) that are at least five printable
characters long. In particular, reported are the number of strings, their average length, a histogram of the
printable characters within those strings, and the entropy of characters across all printable strings. The
printable characters distribution provides distinct information from the byte histogram information above
since it is derived only from strings containing at least five consecutive printable characters. In addition,
the string feature group includes the number of strings that begin with C : (case insensitive) that may indi-
cate a path, the number of occurrences of http :// or https :// (case insensitive) that may indicate a URL,
the number of occurrences of HKEYt hatmayindicatearegistrykey, andthenumberof occurrencesof theshortstringM Ztha

16
Feature group Number of features
General file information (general) 10
Header information (header) 62
Imported functions (imports) 1,280
Exported functions (exports) 128
Section information (section) 255
Byte histogram (histogram) 256
Byte-entropy histogram (byteentropy) 256
String information (strings) 104
All 2,351

Table 2 – Feature groups and their respective number of features

17
Chapter Three :
models construction

18
5 Chapter Three :
5.1 Introduction :
In this chapter, we delve into the construction and evaluation of our malware detection models,
which harness both static and dynamic analysis techniques. Our approach involves the development
of four models based on static analysis methods, utilizing popular machine learning algorithms such
as XGBoost, LightGBM, MLP (Multi-Layer Perceptron), and CNN (Convolutional Neural Network).
Additionally, we incorporate a model based on dynamic analysis for a comprehensive understanding
of malware behavior .This chapter provides insights into the design, implementation, and performance
evaluation of our malware detection models, shedding light on their effectiveness in identifying and
mitigating diverse malware threats.

5.2 Static models :


5.2.1 Xgboost :
XGBoost, a leading implementation of gradient boosting decision trees, offers a potent solution for
malware detection tasks, particularly suited for our dataset comprising 600,000 malware samples, each
characterized by 2,353 features. XGBoost’s efficacy stems from its ensemble learning approach, where
multiple weak learners, typically decision trees, are iteratively trained to collectively make accurate pre-
dictions. Through gradient boosting, XGBoost optimizes a loss function by sequentially adding decision
trees that focus on instances misclassified by preceding trees, thereby enhancing the model’s ability to
capture complex relationships within the data. Additionally, XGBoost incorporates regularization tech-
niques such as L1 and L2 regularization to prevent overfitting and promote model robustness in the face
of noisy or irrelevant features commonly encountered in malware analysis. Its efficient parallel processing
capabilities further expedite model training and inference, making it well-suited for handling large-scale
datasets like ours within a reasonable timeframe. These attributes, coupled with its ability to effectively
harness the rich feature space provided by our extensive dataset, position XGBoost as a compelling
choice for malware detection, capable of discerning subtle patterns indicative of malicious behavior with
high accuracy and efficiency.

19
We trained our model using the following hyperparameters :

5.2.2 Lightgbm :
LightGBM, a gradient boosting framework developed by Microsoft, represents another formidable
contender for malware detection tasks, offering several advantages over traditional approaches like XG-
Boost, especially when dealing with large datasets such as ours containing 600,000 malware samples,
each described by 2,353 features. LightGBM’s superiority lies in its efficient implementation of gradient
boosting, which utilizes a novel technique called gradient-based one-side sampling (GOSS) to significantly
accelerate training speed while retaining high accuracy. GOSS focuses on instances with larger gradients,
enabling LightGBM to prioritize informative samples during the training process, thereby reducing com-
putation overhead and improving efficiency. Additionally, LightGBM employs histogram-based decision
tree algorithms, which discretize continuous features into discrete bins to construct histograms for ef-
ficient computation. This approach minimizes memory consumption and enhances scalability, making
LightGBM well-suited for handling large and high-dimensional datasets like ours without compromising
performance. Furthermore, LightGBM introduces exclusive feature importance calculation methods that
account for the contribution of each feature in the context of its usage frequency and split gain. This fea-
ture importance mechanism offers deeper insights into the predictive power of individual features, aiding
in feature selection and model interpretation. Overall, LightGBM’s efficient training speed, memory sca-
lability, and enhanced feature importance calculation make it a compelling choice for malware detection
tasks, especially when dealing with large datasets like ours. Its ability to deliver superior performance
while maintaining computational efficiency positions LightGBM as a formidable alternative to traditional
approaches like XGBoost, offering promising avenues for advancing malware detection capabilities.

20
We trained the Lightgbm model using the following hyperparameters :

We used as well a deep learning algorithm for this work

21
5.2.3 Mlp :
In our malware detection project, we employed a Multi-Layer Perceptron (MLP) neural network
model trained using the Adam optimizer with 50 epochs, a learning rate of 0.1, and a batch size of
32. MLPs are renowned for their ability to learn complex patterns in data, making them suitable for
our diverse dataset. The Adam optimizer adapts learning rates for each parameter, ensuring efficient
convergence and generalization. With these configurations, our MLP model offers a powerful tool for
detecting malware, leveraging deep learning to uncover subtle relationships within the data.
Model architecture :

5.3 CNN : Convolutional Neural Network (CNN) for Static Malware Detec-
tion
5.3.1 Feature Vector Transformation
To leverage the powerful capabilities of Convolutional Neural Networks (CNNs) for static malware
detection, we transformed the feature vectors into a form suitable for image processing. Each feature
vector, originally consisting of 2381 features, was reshaped into a two-dimensional array with pixel values
ranging between 0 to 255 to mimic the structure of an image. This transformation enabled us to apply
CNN techniques, which are highly effective in capturing spatial hierarchies in data.

5.3.2 CNN Architecture


We designed a sequential CNN model using the Keras library, structured to extract and learn features
from the transformed "image" representations of the malware feature vectors. The architecture of the
CNN model is as follows :
— Convolutional Layers : Three convolutional layers (conv2d, conv2d_1, and conv2d_2) with 32,
64, and 128 filters, respectively, capture spatial features at different levels of granularity.
— Pooling Layers : Three max pooling layers (max_pooling2d, max_pooling2d_1, and max_pooling2d_2)
downsample the feature maps to reduce dimensionality and computation.
— Fully Connected Layer : A dense layer with 128 neurons, followed by a dropout layer to prevent
overfitting.
— Output Layer : A final dense layer with a single neuron to output the probability of the malware
being present.

5.3.3 Training and Evaluation


The model was trained on the transformed dataset, and the performance was evaluated using accuracy
and the false positive rate (FPR). The results obtained from the CNN model were as follows :
— Accuracy : 86%
— False Positive Rate (FPR) : 1%
These results indicate that the CNN model is effective in detecting malware with high accuracy and a
very low rate of false positives, demonstrating the suitability of CNNs for static malware detection tasks.

22
5.4 Dynamic Analysis Models for Malware Detection
5.4.1 Introduction to Dynamic Analysis
Dynamic analysis involves executing a program in a controlled environment to observe its behavior
during runtime. This approach is particularly useful for malware detection as it allows for the identifi-
cation of malicious actions that might not be evident through static analysis alone. By monitoring the
interactions of the program with the system, dynamic analysis can reveal patterns and activities that
indicate malicious intent.

5.4.2 API Calls in Executables


API (Application Programming Interface) calls are requests made by a program to the underlying
operating system or libraries to perform specific functions. These calls provide a way for programs
to interact with the system’s hardware and software resources, such as file systems, memory, network
interfaces, and user interfaces. API calls are crucial for executing various tasks within a program and are
especially significant in dynamic malware analysis for the following reasons :
— Accessing System Resources : Programs use API calls to access files, network resources, and other
system components. For instance, CreateFile opens or creates a file, while WriteFile writes data
to a file.
— Process Management : API calls manage processes and threads. Examples include CreateProcess
to start a new process and TerminateProcess to end a process.
— Memory Management : Functions like VirtualAlloc and VirtualFree allocate and free memory.
— Network Communication : API calls such as WSASocket and connect manage network connections.
— Registry Operations : Functions like RegOpenKeyEx and RegSetValueEx read and write to the
Windows registry.
By monitoring API calls, dynamic analysis can detect behaviors indicative of malware, such as creating
or modifying system files, establishing network connections to command-and-control servers, or altering
registry settings.

5.4.3 Dynamic Analysis Using Speakeasy Emulation


Speakeasy Emulator Speakeasy is an advanced emulation framework designed to analyze the beha-
vior of Windows malware. It emulates the execution of binary files in a controlled environment, capturing
detailed information about the system interactions they perform. This includes API calls, file operations,
network communications, and other significant actions.
The key features of Speakeasy include :
— High Fidelity Emulation : Accurately replicates the behavior of Windows operating systems, allo-
wing for precise observation of malware actions.
— API Call Interception : Captures a wide range of API calls made by the malware, providing a
comprehensive view of its behavior.
— Behavioral Analysis : Records detailed logs of system interactions, which can be used to identify
malicious patterns and behaviors.
— Customizable Environment : Allows configuration of the emulated environment to match different
system setups, enhancing the detection of environment-specific malware behaviors.

Quo Vadis Dataset The Quo Vadis dataset is a comprehensive collection of malware samples and their
corresponding behavioral reports generated by the Speakeasy emulator. It includes detailed records of API
call sequences and other system interactions performed by various malware samples during emulation.
This dataset is invaluable for training and evaluating dynamic malware detection models, as it provides
a rich source of real-world malware behaviors.

5.4.4 Why a Purely Dynamic Approach is Lacking


While dynamic analysis is a powerful tool for malware detection, it has several limitations that can
hinder its effectiveness :

23
— Evasion Techniques : Malware authors often implement techniques to detect the presence of an emu-
lation or sandbox environment. Upon detection, the malware may alter its behavior or terminate
its execution, rendering the dynamic analysis ineffective.
— Resource Intensive : Executing and monitoring each sample in a controlled environment requires
significant computational resources and time. This can limit the scalability of dynamic analysis,
especially when dealing with large datasets.
— Limited Visibility : Some malicious actions may not be immediately observable during the execution
period. For example, time-triggered or conditionally executed payloads may remain dormant during
the analysis, leading to incomplete detection.
— Complexity and Coverage : The dynamic analysis environment must accurately replicate a wide
range of system configurations and states to trigger all possible malicious behaviors. Ensuring
comprehensive coverage of all potential execution paths is challenging.
Despite these limitations, dynamic analysis remains a crucial component of a comprehensive malware
detection strategy. When combined with static analysis and other techniques, it can significantly enhance
the detection and understanding of sophisticated malware threats.

5.4.5 Approaches Used


Decision Tree Classifier Decision trees are a simple yet powerful method for classification tasks,
including malware detection. They work by splitting the data into subsets based on the value of input
features, creating a tree-like model of decisions.
— Accuracy : 0.7436

Random Forest Random forests are an ensemble learning method that constructs multiple decision
trees during training and outputs the class that is the mode of the classes of individual trees.
— Accuracy : 0.6856

Multilayer Perceptron (MLP) MLP is a class of feedforward artificial neural network (ANN) that
consists of at least three layers of nodes : an input layer, a hidden layer, and an output layer. MLPs are
capable of modeling complex non-linear relationships.
— Accuracy : 0.7427

24
Chapter Four :
discussion

25
6 Chapter Four :
6.1 Introduction :
In this chapter, we delve into a comprehensive discussion of the results obtained from our malware
detection experiments and the underlying factors contributing to these outcomes. We analyze the perfor-
mance of our models, including XGBoost, LightGBM, MLP, and CNN, as well as our dynamic analysis-
based approach. Through a systematic examination of the detection accuracy, false positive rate, and
other relevant metrics, we aim to elucidate the strengths and limitations of each model and approach.
Furthermore, we explore the impact of various factors, such as feature selection, hyperparameter tuning,
and dataset characteristics, on model performance. By providing insights into the rationale behind our
results, we seek to offer a deeper understanding of the effectiveness and efficacy of our malware detection
strategies, facilitating informed decisions for future research and practical applications in cybersecurity.

6.2 Models results :


Next, we present the results obtained from each model. Each model was evaluated using a confusion
matrix, accuracy, precision, recall, F1 score, and false positive rate.

6.2.1 Xgboost model :

— Accuracy : 0.972925
— Precision : 0.9742291301077964
— Recall : 0.97155
— F1 Score : 0.9728877206158468
— False Positives : 2570
— False Positives Percentage : 1.2850000000000001

26
6.2.2 Lightgbm model :

— Accuracy : 0.97983
— Precision : 0.9843049779966894
— Recall : 0.97521
— F1 Score : 0.9797363820852337
— False Positives : 1555
— False Positives Percentage : 0.7775

27
6.2.3 MLP model :

— Accuracy : 0.811165
— Precision : 0.9228797436667668
— Recall : 0.76084
— F1 Score : 0.8147197118751458
— False Positives : 3851
— False Positives Percentage : 1.9255000000000002

28
6.2.4 Cnn model :

— Accuracy : 0.871165
— Precision : 0.9128797436667668
— Recall : 0.80084
— F1 Score : 0.8547197118751458
— False Positives : 16612
— False Positives Percentage : 1.1655%

6.3 Discussion :
6.3.1 LightGBM
LightGBM demonstrated superior performance among all models with an accuracy of 97.98, a preci-
sion of 98.43, and a recall of 97.52. The F1 score of 97.97 highlights its balanced performance in terms of
precision and recall. LightGBM produced 1,555 false positives, representing 0.78 of the total predictions.
The low false positive rate indicates LightGBM’s effectiveness in distinguishing between benign and
malicious samples. The gradient-based one-side sampling and histogram-based decision tree algorithms
used in LightGBM contributed to its high efficiency and accuracy, making it highly suitable for handling
large datasets and high-dimensional feature spaces.

6.3.2 XGBoost
XGBoost also showed strong performance with an accuracy of 97.29, a precision of 97.42, and a
recall of 97.15. The F1 score was 97.29, indicating a slightly lower performance compared to LightGBM.
XGBoost generated 2,570 false positives, which is 1.29 of the total predictions. Although XGBoost
performed well, its higher false positive rate compared to LightGBM can be attributed to its traditional
gradient boosting approach, which may not be as efficient in handling large datasets with numerous
features. Nonetheless, XGBoost’s robustness and scalability still make it a valuable model for malware
detection.

29
6.3.3 MLP (Multi-Layer Perceptron)
The MLP model showed a significant drop in performance with an accuracy of 71.12, a precision of
92.29, and a recall of 46.08. The F1 score of 61.47 indicates a substantial imbalance between precision and
recall. The high number of false positives (3,851) and the false positive rate of 1.93 highlight the challenges
faced by MLP in this context. The MLP’s difficulty in handling the complex and high-dimensional nature
of the dataset might have led to overfitting and poor generalization, emphasizing the need for further
tuning and optimization.

6.3.4 CNN (Convolutional Neural Network)


The CNN model achieved an accuracy of 87.12, a precision of 91.29, and a recall of 80.08, resulting
in an F1 score of 85.47. The model produced 16,612 false positives, equating to 1.17 of the total predic-
tions. While the CNN demonstrated decent performance, its higher false positive rate and lower accuracy
compared to LightGBM and XGBoost suggest that it might struggle with distinguishing intricate pat-
terns in the dataset. The CNN’s convolutional layers, typically effective for image data, might not be as
advantageous for tabular data, necessitating alternative architectures or additional feature engineering.

6.3.5 Dynamic Analysis


The application of machine learning (ML) and deep learning (DL) methods to static analysis out-
performed their application to dynamic analysis due to several key factors. Static analysis leverages a
comprehensive and stable set of features directly extracted from the binary code, providing a richer
dataset for training models. In contrast, dynamic analysis is susceptible to evasion techniques employed
by malware to avoid detection during execution, leading to incomplete or misleading behavioral data.
Additionally, static analysis is more resource-efficient and scalable, as it does not require the overhead
of running the executable in an emulated environment. This efficiency allows for the processing of larger
datasets and facilitates early detection by analyzing code pre-execution. These advantages collectively
contribute to the superior performance of static analysis when combined with ML and DL methods for
malware detection.

6.3.6 Summary
Our experiments reveal that LightGBM and XGBoost are the most effective models for malware
detection, with LightGBM slightly outperforming XGBoost in terms of accuracy and false positive rate.
The superior performance of these gradient boosting models can be attributed to their ability to handle
high-dimensional data efficiently and their advanced algorithms designed for large datasets. In contrast,
the MLP and CNN models, while useful, require further optimization and tuning to achieve comparable
performance levels. These findings underscore the importance of selecting appropriate models and tech-
niques tailored to the specific characteristics of the dataset and the problem domain, paving the way for
more effective and reliable malware detection systems.

6.3.7 Explanation of Results


The superior performance of LightGBM and XGBoost can be attributed to their advanced gradient
boosting techniques, which are well-suited for handling large, high-dimensional datasets like the EMBER
dataset. These models are designed to iteratively improve their predictions by focusing on the hardest-
to-classify samples, resulting in high accuracy and low false positive rates.
On the other hand, the MLP model struggled with the dataset’s complexity and dimensionality,
leading to overfitting and poor generalization. The CNN, while more effective than the MLP, faced
challenges due to its architecture being better suited for image data rather than tabular data. These
findings emphasize the importance of selecting appropriate models tailored to the dataset’s characteris-
tics. LightGBM and XGBoost’s ability to process large amounts of data efficiently and capture intricate
patterns make them highly effective for malware detection in the context of the EMBER dataset.

30
Chapter Five :

31
7 Chapter Five :
7.1 Achievement :
After conducting this study, we have found that applying machine learning to malware detection is
not only feasible but also highly effective. Our best model, LightGBM, achieved an impressive accuracy
of 97.98 with a false positive rate of less than 1. These results are comparable to, and in some cases
exceed, the performance of traditional detection approaches.
Additionally, the prediction time of our machine learning models was on par with traditional methods,
demonstrating that these advanced techniques can be implemented without sacrificing speed or efficiency.
This balance of high accuracy, low false positive rates, and efficient prediction times highlights the
practical applicability of machine learning in the field of malware detection.
Overall, our work underscores the potential of machine learning to enhance cybersecurity measures,
providing a powerful tool to combat the ever-evolving threat of malware. By continuing to refine and
develop these models, we can further improve their robustness and adaptability, ensuring the ongoing
protection of systems and data.

7.2 Limitations :
While our study demonstrates promising results in malware detection using machine learning models,
there are several limitations to acknowledge :

7.2.1 Lack of dataset :


The scarcity of datasets for malware detection, particularly for benign files, is a significant challenge..

7.2.2 Degradation Over Time :


Malware constantly evolves, with attackers regularly updating their techniques to evade detection.
This can lead to model degradation over time as new malware forms emerge that were not present in the
training data. Consequently, models may become less effective at detecting newer threats, necessitating
continuous updates and retraining.

7.2.3 Static and Dynamic Analysis Balance :


The majority of our models rely on static analysis features. While dynamic analysis was incorporated,
a more balanced approach might yield better results. Dynamic analysis is more computationally intensive
but can detect behaviors that static analysis misses.

7.2.4 Feature Engineering :


The high-dimensional feature space, while comprehensive, may include irrelevant or redundant fea-
tures that could affect model performance. More refined feature selection and engineering could improve
model efficiency and accuracy.

7.2.5 Resource Intensity :


Training models like XGBoost and LightGBM on such large datasets is computationally intensive
and requires significant resources. This might limit the practical deployment of these models in resource-
constrained environments.

32
7.3 Future Improvements
To address these limitations and enhance our malware detection approach

7.3.1 several future improvements are proposed :


Enhanced Dataset Diversity : Incorporating additional datasets from diverse sources, including a more
balanced representation of benign files, could improve model robustness and generalizability. Real-world
samples from different environments would provide a more comprehensive training base.

7.3.2 Regular Model Updates :


Implementing a system for the continuous updating and retraining of models as new malware samples
are identified will help maintain detection efficacy. This approach will counteract the degradation of
models over time due to evolving malware forms.

7.3.3 Hybrid Analysis Techniques :


Combining static and dynamic analysis in a more integrated manner could leverage the strengths of
both approaches, improving detection rates and reducing false positives.

7.3.4 Advanced Feature Selection :


Implementing more sophisticated feature selection techniques, such as recursive feature elimination
or feature importance scoring, could reduce dimensionality and improve model performance.

7.3.5 Real-Time Detection Capabilities :


Developing models and systems optimized for real-time malware detection would make the solution
more applicable in practical scenarios, enhancing its utility in active cybersecurity defenses.

8 conclusion :
In conclusion, our study demonstrates the potential of machine learning models, particularly LightGBM
and XGBoost, in effectively detecting malware within large and complex datasets. By leveraging advanced
gradient boosting techniques, these models achieve high accuracy and low false positive rates, showca-
sing their suitability for malware detection tasks. However, challenges such as dataset imbalance, model
degradation over time, feature engineering, and computational resource requirements highlight the need
for ongoing research and development.
Future improvements aimed at enhancing dataset diversity, integrating static and dynamic analysis,
refining feature selection, and optimizing model architectures promise to further advance the field of
malware detection. By addressing these areas, we can develop more robust, efficient, and generalizable
models, contributing to more effective cybersecurity measures.
Our work underscores the importance of continuing to evolve and adapt machine learning approaches
to meet the dynamic challenges posed by malware, ensuring the ongoing protection of systems and data
against malicious threats.

33

You might also like