Malware Detection
Malware Detection
1 Abstract 7
2 General introduction 8
2.2 Objective : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Chapter one : 10
3.1 Malwares : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Chapter Two : 13
3
4.1 Introduction : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Chapter Three : 19
5.1 Introduction : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2.1 Xgboost : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2.2 Lightgbm : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2.3 Mlp : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 CNN : Convolutional Neural Network (CNN) for Static Malware Detection . . . . . . . . 22
4
5.4.3 Dynamic Analysis Using Speakeasy Emulation . . . . . . . . . . . . . . . . . . . . 23
6 Chapter Four : 26
6.1 Introduction : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.3 Discussion : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3.1 LightGBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3.2 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5
6.3.7 Explanation of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7 Chapter Five : 32
7.1 Achievement : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.2 Limitations : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8 conclusion : 33
6
1 Abstract
Malware detection is a field that is constantly in flux. Increasingly organized and intelligent malware
distributors and authors make the detection and prevention of malware infections more difficult now
than it has ever been before, as does the proliferation of easy-to-use malware creation tools and more
complex and dangerous samples. Modern warfare and international cybercrime has also increasingly
involved cyberattacks, targeted distribution attacks and highly networked cyberespionage, resulting in
extremely dangerous malware that was explicitly designed and released by nation state actors for the
purposes of enacting large-scale damage to organizations or infrastructure. Because of this growing and
evolving threat, more advanced tools for malware detection, prevention and recovery are increasingly
necessary to defend computing networks from attack. Machine learning has been posited in the past as a
potential solution to many of the problems that the field currently faces with traditional methodologies.
The ability for algorithms to learn from existing samples and then apply that knowledge to novel samples
is one that holds great potential for this field, especially since many high-profile malware attacks in the
last decade were caused by previously unknown samples that exploited novel vulnerabilities and were not
detected by traditional systems until it was too late. Our work proposes a comprehensive approach to
malware detection that combines both dynamic and static analysis models. This dual-method strategy
is designed to maximize detection accuracy while minimizing the rate of false positives.
7
2 General introduction
In recent years, Artificial Intelligence (AI) has become increasingly an important tool across very
sectors, revolutionizing the way we live and work. Its applications span a wide range of industries,
healthcare, education, agriculture and many other domains.
Cybersecurity is also an integral domain leveraging AI techniques, it is defined as the practice of pro-
tecting systems, networks, and programs from different digital attacks. These cyberattacks are usually
aimed at accessing, changing, or destroying sensitive information ; extorting money from users via ran-
somware ; or interrupting normal business processes.
Implementing effective cybersecurity measures is particularly challenging today because there are
more devices than people, and attackers are becoming more innovative.
Artificial intelligence is being increasingly utilised in cyber security for both offensive and defensive
applications. In offensive roles, it is used to predict and mimic attackers’ behaviors, allowing security
teams to proactively address vulnerabilities. On the defensive side, AI tools are employed to monitor
network traffic, detect anomalies, and respond to threats in real-time. The need for AI in cyber security
is particularly crucial in identifying and preventing new, unknown threats, as traditional signature-based
systems may not be effective in recognising evolving attack techniques.
The speed and accuracy of AI algorithms have significantly improved the ability of cyber security
systems to detect and respond to threats in real time. Machine learning algorithms can analyze vast
amounts of data to identify patterns and anomalies that could signal a potential security breach. AI-
powered systems can also automate routine security tasks, freeing up human experts to focus on more
complex and strategic aspects of cyber defense. Furthermore, AI can enhance the predictive capabilities
of security systems, allowing organisations to anticipate and proactively address potential vulnerabilities
before malicious actors exploit them. Overall, the integration of AI technology in cyber security has
proven to be crucial in fortifying organisations against ever-evolving cyber threats.
2.2 Objective :
The objective of this project is to develop a robust and efficient malware detection system that
integrates both dynamic and static analysis methods. By leveraging machine learning techniques, the
system aims to maximize detection accuracy and minimize false positives. This involves extracting and
analyzing relevant features from static sources, such as opcode sequences, API calls, and metadata, as
well as dynamic sources, including system calls, network traffic, and runtime behaviors. The project will
train a machine learning model on a comprehensive dataset of labeled benign and malicious samples
and rigorously evaluate its performance. By benchmarking the hybrid model against traditional single-
method approaches, the project seeks to demonstrate the enhanced effectiveness and efficiency of the
integrated approach. Ultimately, this work aims to contribute to the field of cybersecurity by providing
a more resilient and adaptive malware detection solution capable of keeping pace with evolving threats.
8
Chapter one :
Overview of malware and
traditional detection techniques
9
3 Chapter one :
3.1 Malwares :
Malware, or malicious software, encompasses a wide range of harmful programs that infiltrate com-
puter systems with the intent of causing damage, stealing data, or gaining unauthorized access. These
programs often exhibit deceptive and stealthy behavior to avoid detection and removal. They can silently
install themselves by exploiting vulnerabilities, piggybacking on legitimate software, or tricking users into
executing them through social engineering tactics. Once inside a system, malware can perform a variety
of malicious activities, such as logging keystrokes, redirecting web traffic, encrypting files for ransom,
or creating backdoors for further exploitation. Their behavior is characterized by a constant evolution,
adapting to new security measures and employing sophisticated techniques to remain concealed and
maintain persistence within the infected environment. The primary goal of malware is to compromise
the integrity, confidentiality, and availability of the target systems, often for financial gain, espionage, or
disruption.
10
3.3 Traditional malwares detection approche problem :
Traditional signature analysis techniques in the problem of malware detection has the disadvantage
of only being able to detect malware that has a previously encountered signature, meaning that these
detection techniques are limited in their ability to detect all the novel types of malware that are being
released into the wild every day. Behavioral analysis is similarly limited by its knowledge of specific
behaviors it has seen before, and thus can fail to detect malware that has a new attack vector or exploit
type that has never previously been encountered or recorded.
11
Chapter Two :
dataset
12
4 Chapter Two :
4.1 Introduction :
In our work, we focus on malware detection using machine learning techniques applied to PE files,
or Portable Executables, primarily due to their critical role in the Windows operating system and their
widespread use across the Windows ecosystem. PE files serve as the primary format for executables, DLLs,
and device drivers, facilitating the execution of software applications and system processes. However,
their ubiquity also makes them an attractive target for malware distribution and propagation. Malicious
actors frequently disguise malware within PE files, employing various techniques like file obfuscation and
code packing to evade detection. As a result, PE files containing malware can spread undetected, posing
significant security risks to users and organizations alike. In our research, we recognize the importance
of thorough analysis of PE files for identifying and mitigating these threats. By examining the structure
and contents of PE files, we aim to uncover indicators of malicious activity, enabling proactive detection
and mitigation strategies. Leveraging machine learning algorithms trained on PE file features further
enhances our detection capabilities, allowing us to identify malware variants even in the absence of
specific signatures. Therefore, our focus on PE file analysis underscores its critical role in safeguarding
against malware threats and ensuring the security of Windows-based systems.
13
4.3 Dataset description :
We utilized the Ember dataset, a publicly available repository established since April 2018, compri-
sing over 1.1 million Portable Executable (PE) files, distinguishing between malware and benign software.
This dataset encapsulates a comprehensive array of surface features from each sample, represented in
JSON files, with each file embodying the attributes of a malware instance. The EMBER dataset consists
of a collection of JSON lines files, where each line contains a single JSON object. Each object includes
the following types in data : the sha256 hash of the original file as a unique identifier ; coarse time in-
formation (month resolution) that establishes an estimate of when the file was first seen ; a label, which
may be 0 for benign, 1 for malicious or -1 for unlabeled ; and eight groups of raw features that include
both parsed values as well as format-agnostic histograms
14
File Period of collection Malware Benign Unlabeled Sum
train0 Dec 2006 – Dec 2017 0 50,000 0 50,000
train1 Jan 2018 – Feb 2018 60,000 50,000 60,000 170,000
train2 Mar 2018 – Apr 2018 60,000 50,000 60,000 170,000
train3 May 2018 – Jun 2018 60,000 50,000 60,000 170,000
train4 Jul 2018 – Aug 2018 60,000 50,000 60,000 170,000
train5 Sep 2018 – Oct 2018 60,000 50,000 60,000 170,000
test Nov 2018 – Dec 2018 100,000 100,000 0 200,000
All 400,000 400,000 300,000 1,100,000
15
4.3.1 Feature extraction :
We extracted 8 different groups of features from the pe files
General file information (general) : The set of features in the general file information group
includes the file size and basic information obtained from the PE header : the virtual size of the file, the
number of imported and exported functions, whether the file has a debug section, thread local storage,
resources, relocations, or a signature, and the number of symbols.
Header information (header) : From the COFF header, we report the timestamp in the header,
the target machine (string) and a list of image characteristics (list of strings). From the optional header,
we provide the target subsystem (string), DLL characteristics (a list of strings), the file magic as a
string (e.g., “PE32”), major and minor image versions, linker versions, system versions and subsystem
versions, and the code, headers and commit sizes. To create model features, string descriptors such as
DLL characteristics, target machine, subsystem, etc. are summarized using the feature hashing trick
prior to training a model, with 10 bins allotted for each noisy indicator vector. .
Imported functions (imports) : We parse the import address table and report the imported
functions by library. To create model features for the baseline model, we simply collect the set of unique
libraries and use the hashing trick to sketch the set (256 bins). Similarly, we use the hashing trick (1024
bins) to capture individual functions, by representing each as a string such as library :FunctionName
pair (e.g., kernel32.dll :CreateFileMappingA).
Exported functions (exports) : The raw features include a list of the exported functions. These
strings are summarized into model features using the hashing trick with 128 bins..
Section information (section) : n. Properties of each section are provided and include the name,
size, entropy, virtual size, and a list of strings representing section characteristics. The entry point is
specified by name. To convert to model features, we use the hashing trick on (section name, value) pairs
to create vectors containing section size, section entropy, and virtual size (50 bins each). We also use the
hashing trick to capture the characteristics (list of strings) for the entry point.
Format-agnostic features. The EMBER dataset also includes three groups of features that are
format agnostic, in that they do not require parsing of the PE file for extraction : a raw byte histogram,
byte entropy histogram based on work previously published in [26], and string extraction.
.....Byte histogram (histogram) : The byte histogram contains 256 integer values, representing
the counts of each byte value within the file. When generating model features, this byte histogram is
normalized to a distribution, since the file size is represented as a feature in the general file information
.....Byte-entropy histogram (byteentropy) : The byte entropy histogram approximates the joint
distribution p(H,X) of entropy H and byte value X. This is done as described in [26], by computing the
scalar entropy H for a fixed-length window and pairing it with each byte occurrence within the window.
This is repeated as the window slides across the input bytes. In our implementation, we use a window
size of 2048 and a step size of 1024 bytes, with 16 × 16 bins that quantize entropy and the byte value.
Before training, we normalize these counts to sum to unity
.....String information (strings) : String information. The dataset includes simple statistics about
printable strings (consisting of characters in range 0x20 to 0x7f, inclusive) that are at least five printable
characters long. In particular, reported are the number of strings, their average length, a histogram of the
printable characters within those strings, and the entropy of characters across all printable strings. The
printable characters distribution provides distinct information from the byte histogram information above
since it is derived only from strings containing at least five consecutive printable characters. In addition,
the string feature group includes the number of strings that begin with C : (case insensitive) that may indi-
cate a path, the number of occurrences of http :// or https :// (case insensitive) that may indicate a URL,
the number of occurrences of HKEYt hatmayindicatearegistrykey, andthenumberof occurrencesof theshortstringM Ztha
16
Feature group Number of features
General file information (general) 10
Header information (header) 62
Imported functions (imports) 1,280
Exported functions (exports) 128
Section information (section) 255
Byte histogram (histogram) 256
Byte-entropy histogram (byteentropy) 256
String information (strings) 104
All 2,351
17
Chapter Three :
models construction
18
5 Chapter Three :
5.1 Introduction :
In this chapter, we delve into the construction and evaluation of our malware detection models,
which harness both static and dynamic analysis techniques. Our approach involves the development
of four models based on static analysis methods, utilizing popular machine learning algorithms such
as XGBoost, LightGBM, MLP (Multi-Layer Perceptron), and CNN (Convolutional Neural Network).
Additionally, we incorporate a model based on dynamic analysis for a comprehensive understanding
of malware behavior .This chapter provides insights into the design, implementation, and performance
evaluation of our malware detection models, shedding light on their effectiveness in identifying and
mitigating diverse malware threats.
19
We trained our model using the following hyperparameters :
5.2.2 Lightgbm :
LightGBM, a gradient boosting framework developed by Microsoft, represents another formidable
contender for malware detection tasks, offering several advantages over traditional approaches like XG-
Boost, especially when dealing with large datasets such as ours containing 600,000 malware samples,
each described by 2,353 features. LightGBM’s superiority lies in its efficient implementation of gradient
boosting, which utilizes a novel technique called gradient-based one-side sampling (GOSS) to significantly
accelerate training speed while retaining high accuracy. GOSS focuses on instances with larger gradients,
enabling LightGBM to prioritize informative samples during the training process, thereby reducing com-
putation overhead and improving efficiency. Additionally, LightGBM employs histogram-based decision
tree algorithms, which discretize continuous features into discrete bins to construct histograms for ef-
ficient computation. This approach minimizes memory consumption and enhances scalability, making
LightGBM well-suited for handling large and high-dimensional datasets like ours without compromising
performance. Furthermore, LightGBM introduces exclusive feature importance calculation methods that
account for the contribution of each feature in the context of its usage frequency and split gain. This fea-
ture importance mechanism offers deeper insights into the predictive power of individual features, aiding
in feature selection and model interpretation. Overall, LightGBM’s efficient training speed, memory sca-
lability, and enhanced feature importance calculation make it a compelling choice for malware detection
tasks, especially when dealing with large datasets like ours. Its ability to deliver superior performance
while maintaining computational efficiency positions LightGBM as a formidable alternative to traditional
approaches like XGBoost, offering promising avenues for advancing malware detection capabilities.
20
We trained the Lightgbm model using the following hyperparameters :
21
5.2.3 Mlp :
In our malware detection project, we employed a Multi-Layer Perceptron (MLP) neural network
model trained using the Adam optimizer with 50 epochs, a learning rate of 0.1, and a batch size of
32. MLPs are renowned for their ability to learn complex patterns in data, making them suitable for
our diverse dataset. The Adam optimizer adapts learning rates for each parameter, ensuring efficient
convergence and generalization. With these configurations, our MLP model offers a powerful tool for
detecting malware, leveraging deep learning to uncover subtle relationships within the data.
Model architecture :
5.3 CNN : Convolutional Neural Network (CNN) for Static Malware Detec-
tion
5.3.1 Feature Vector Transformation
To leverage the powerful capabilities of Convolutional Neural Networks (CNNs) for static malware
detection, we transformed the feature vectors into a form suitable for image processing. Each feature
vector, originally consisting of 2381 features, was reshaped into a two-dimensional array with pixel values
ranging between 0 to 255 to mimic the structure of an image. This transformation enabled us to apply
CNN techniques, which are highly effective in capturing spatial hierarchies in data.
22
5.4 Dynamic Analysis Models for Malware Detection
5.4.1 Introduction to Dynamic Analysis
Dynamic analysis involves executing a program in a controlled environment to observe its behavior
during runtime. This approach is particularly useful for malware detection as it allows for the identifi-
cation of malicious actions that might not be evident through static analysis alone. By monitoring the
interactions of the program with the system, dynamic analysis can reveal patterns and activities that
indicate malicious intent.
Quo Vadis Dataset The Quo Vadis dataset is a comprehensive collection of malware samples and their
corresponding behavioral reports generated by the Speakeasy emulator. It includes detailed records of API
call sequences and other system interactions performed by various malware samples during emulation.
This dataset is invaluable for training and evaluating dynamic malware detection models, as it provides
a rich source of real-world malware behaviors.
23
— Evasion Techniques : Malware authors often implement techniques to detect the presence of an emu-
lation or sandbox environment. Upon detection, the malware may alter its behavior or terminate
its execution, rendering the dynamic analysis ineffective.
— Resource Intensive : Executing and monitoring each sample in a controlled environment requires
significant computational resources and time. This can limit the scalability of dynamic analysis,
especially when dealing with large datasets.
— Limited Visibility : Some malicious actions may not be immediately observable during the execution
period. For example, time-triggered or conditionally executed payloads may remain dormant during
the analysis, leading to incomplete detection.
— Complexity and Coverage : The dynamic analysis environment must accurately replicate a wide
range of system configurations and states to trigger all possible malicious behaviors. Ensuring
comprehensive coverage of all potential execution paths is challenging.
Despite these limitations, dynamic analysis remains a crucial component of a comprehensive malware
detection strategy. When combined with static analysis and other techniques, it can significantly enhance
the detection and understanding of sophisticated malware threats.
Random Forest Random forests are an ensemble learning method that constructs multiple decision
trees during training and outputs the class that is the mode of the classes of individual trees.
— Accuracy : 0.6856
Multilayer Perceptron (MLP) MLP is a class of feedforward artificial neural network (ANN) that
consists of at least three layers of nodes : an input layer, a hidden layer, and an output layer. MLPs are
capable of modeling complex non-linear relationships.
— Accuracy : 0.7427
24
Chapter Four :
discussion
25
6 Chapter Four :
6.1 Introduction :
In this chapter, we delve into a comprehensive discussion of the results obtained from our malware
detection experiments and the underlying factors contributing to these outcomes. We analyze the perfor-
mance of our models, including XGBoost, LightGBM, MLP, and CNN, as well as our dynamic analysis-
based approach. Through a systematic examination of the detection accuracy, false positive rate, and
other relevant metrics, we aim to elucidate the strengths and limitations of each model and approach.
Furthermore, we explore the impact of various factors, such as feature selection, hyperparameter tuning,
and dataset characteristics, on model performance. By providing insights into the rationale behind our
results, we seek to offer a deeper understanding of the effectiveness and efficacy of our malware detection
strategies, facilitating informed decisions for future research and practical applications in cybersecurity.
— Accuracy : 0.972925
— Precision : 0.9742291301077964
— Recall : 0.97155
— F1 Score : 0.9728877206158468
— False Positives : 2570
— False Positives Percentage : 1.2850000000000001
26
6.2.2 Lightgbm model :
— Accuracy : 0.97983
— Precision : 0.9843049779966894
— Recall : 0.97521
— F1 Score : 0.9797363820852337
— False Positives : 1555
— False Positives Percentage : 0.7775
27
6.2.3 MLP model :
— Accuracy : 0.811165
— Precision : 0.9228797436667668
— Recall : 0.76084
— F1 Score : 0.8147197118751458
— False Positives : 3851
— False Positives Percentage : 1.9255000000000002
28
6.2.4 Cnn model :
— Accuracy : 0.871165
— Precision : 0.9128797436667668
— Recall : 0.80084
— F1 Score : 0.8547197118751458
— False Positives : 16612
— False Positives Percentage : 1.1655%
6.3 Discussion :
6.3.1 LightGBM
LightGBM demonstrated superior performance among all models with an accuracy of 97.98, a preci-
sion of 98.43, and a recall of 97.52. The F1 score of 97.97 highlights its balanced performance in terms of
precision and recall. LightGBM produced 1,555 false positives, representing 0.78 of the total predictions.
The low false positive rate indicates LightGBM’s effectiveness in distinguishing between benign and
malicious samples. The gradient-based one-side sampling and histogram-based decision tree algorithms
used in LightGBM contributed to its high efficiency and accuracy, making it highly suitable for handling
large datasets and high-dimensional feature spaces.
6.3.2 XGBoost
XGBoost also showed strong performance with an accuracy of 97.29, a precision of 97.42, and a
recall of 97.15. The F1 score was 97.29, indicating a slightly lower performance compared to LightGBM.
XGBoost generated 2,570 false positives, which is 1.29 of the total predictions. Although XGBoost
performed well, its higher false positive rate compared to LightGBM can be attributed to its traditional
gradient boosting approach, which may not be as efficient in handling large datasets with numerous
features. Nonetheless, XGBoost’s robustness and scalability still make it a valuable model for malware
detection.
29
6.3.3 MLP (Multi-Layer Perceptron)
The MLP model showed a significant drop in performance with an accuracy of 71.12, a precision of
92.29, and a recall of 46.08. The F1 score of 61.47 indicates a substantial imbalance between precision and
recall. The high number of false positives (3,851) and the false positive rate of 1.93 highlight the challenges
faced by MLP in this context. The MLP’s difficulty in handling the complex and high-dimensional nature
of the dataset might have led to overfitting and poor generalization, emphasizing the need for further
tuning and optimization.
6.3.6 Summary
Our experiments reveal that LightGBM and XGBoost are the most effective models for malware
detection, with LightGBM slightly outperforming XGBoost in terms of accuracy and false positive rate.
The superior performance of these gradient boosting models can be attributed to their ability to handle
high-dimensional data efficiently and their advanced algorithms designed for large datasets. In contrast,
the MLP and CNN models, while useful, require further optimization and tuning to achieve comparable
performance levels. These findings underscore the importance of selecting appropriate models and tech-
niques tailored to the specific characteristics of the dataset and the problem domain, paving the way for
more effective and reliable malware detection systems.
30
Chapter Five :
31
7 Chapter Five :
7.1 Achievement :
After conducting this study, we have found that applying machine learning to malware detection is
not only feasible but also highly effective. Our best model, LightGBM, achieved an impressive accuracy
of 97.98 with a false positive rate of less than 1. These results are comparable to, and in some cases
exceed, the performance of traditional detection approaches.
Additionally, the prediction time of our machine learning models was on par with traditional methods,
demonstrating that these advanced techniques can be implemented without sacrificing speed or efficiency.
This balance of high accuracy, low false positive rates, and efficient prediction times highlights the
practical applicability of machine learning in the field of malware detection.
Overall, our work underscores the potential of machine learning to enhance cybersecurity measures,
providing a powerful tool to combat the ever-evolving threat of malware. By continuing to refine and
develop these models, we can further improve their robustness and adaptability, ensuring the ongoing
protection of systems and data.
7.2 Limitations :
While our study demonstrates promising results in malware detection using machine learning models,
there are several limitations to acknowledge :
32
7.3 Future Improvements
To address these limitations and enhance our malware detection approach
8 conclusion :
In conclusion, our study demonstrates the potential of machine learning models, particularly LightGBM
and XGBoost, in effectively detecting malware within large and complex datasets. By leveraging advanced
gradient boosting techniques, these models achieve high accuracy and low false positive rates, showca-
sing their suitability for malware detection tasks. However, challenges such as dataset imbalance, model
degradation over time, feature engineering, and computational resource requirements highlight the need
for ongoing research and development.
Future improvements aimed at enhancing dataset diversity, integrating static and dynamic analysis,
refining feature selection, and optimizing model architectures promise to further advance the field of
malware detection. By addressing these areas, we can develop more robust, efficient, and generalizable
models, contributing to more effective cybersecurity measures.
Our work underscores the importance of continuing to evolve and adapt machine learning approaches
to meet the dynamic challenges posed by malware, ensuring the ongoing protection of systems and data
against malicious threats.
33