Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
15 views93 pages

Group 7 Report

The project report focuses on the performance evaluation and detection of malware, emphasizing the importance of machine learning techniques in identifying and mitigating cyber threats. It discusses various types of malware, their analysis, and detection methods, while also detailing the implementation of machine learning algorithms to enhance malware detection capabilities. The report concludes that integrating machine learning into cybersecurity practices is crucial for effectively addressing the evolving landscape of cyber threats.

Uploaded by

AZHARUDDIN ALAM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views93 pages

Group 7 Report

The project report focuses on the performance evaluation and detection of malware, emphasizing the importance of machine learning techniques in identifying and mitigating cyber threats. It discusses various types of malware, their analysis, and detection methods, while also detailing the implementation of machine learning algorithms to enhance malware detection capabilities. The report concludes that integrating machine learning into cybersecurity practices is crucial for effectively addressing the evolving landscape of cyber threats.

Uploaded by

AZHARUDDIN ALAM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 93

A PROJECT REPORT ON

PERFORMANCE EVALUATION AND


DETECTION OF MALWARE

SUBMITTED IN PARTIAL FULFILLMENT FOR AWARD OF DEGREE


OF

BACHELOR OF TECHNOLOGY

IN

COMPUTER SCIENCE AND ENGINEERING


By
AAYUSH KUMAR
(2001320100002)
ABHISHEK SAXENA
(2001320100010)
ANJILA CHOUDHARY
(2001320100026)
AZHARUDDIN ALAM
(2001320100041)

UNDER THE GUIDANCE OF


MR. ASHWINI KUMAR VERMA

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


GREATER NOIDA INSTITUTE OF TECHNOLOGY, GREATER NOIDA

Dr. A.P.J. Abdul Kalam Technical University,


Lucknow
MAY, 2024
BONAFIDE CERTIFICATE

Certified that this project report “Performance Evaluation And Detection Of


Malware” is the bonafide work of “Aayush Kumar, Abhishek Saxena, Anjila
Choudhary and Azharuddin Alam”who carried out the major project work under
the supervision of “ Mr. Ashwini Kumar Verma”. Certified further that to the best of
my knowledge the work reported here in does not form part of any others report on
the basis of which a degree or award was conferred on an earlier occasion on this or
any other.

Signature Supervisor

i
ACKNOWLEDGEMENT

I would like to express my profound gratitude to Dr. Sandeep Saxena, Head of


Department of Computer Science & Engineering and Dr. Dhiraj Gupta, Director
of the Institution for their contributions to the completion of my project titled
Performance Evaluation and detection of malware.

I would like to express my special thanks to our mentor Mr. Ashwini kumar verma
Assistant Professor in Department of Computer Science & Engineering for his
time and efforts he provided throughout the semester. Your useful advice and
suggestions were really helpful to me during the project’s completion. In this aspect, I
am eternally grateful to you.

In the end, I would like to thanks to my parents, who were extremely sincere and
loyal with me at every stage of my career. My parents were always there to support
me in difficulty times. I was able to complete my thesis due to their efforts too. My
parents are my real support my pillars.

I would like to acknowledge that this project was completed entirely by me and not
by someone else.

Aayush Kumar

Abhishek Saxena

Anjila Choudhary

Azharuddin Alam

ii
TABLE OF CONTENT

BONAFIDE CERTIFICATE.......................................................................................................i
ACKNOWLEDGEMENT...........................................................................................................ii
TABLE OF CONTENT.............................................................................................................iii
LIST OF FIGURES....................................................................................................................v
LIST OF TABLES....................................................................................................................vii
LIST OF ACRONYMS AND ABBREVIATIONS.....................................................................viii
ABSTRACT...............................................................................................................................ix
1. INTRODUCTION..................................................................................................................1
1.1 Problem Definition.................................................................................................4
1.2 Aim and Objective..................................................................................................6
1.3 Malware..................................................................................................................6
1.3.1 Types of Malware...................................................................................9
1.3.2 Types of Malware Analysis..................................................................23
1.3.3 Types of Malware Detection................................................................30
1.4 Structure of Project...............................................................................................35
2. LITERATURE REVIEW......................................................................................................36
2.1Background Study.................................................................................................36
2.1.1Comparative Analysis of Research Papers............................................39
3. SYSTEM ANALYSIS..........................................................................................................44
3.1 Tools and Technology Requirements...................................................................44
3.1.1 Python..................................................................................................45
3.1.2 Google colab........................................................................................46
3.1.3 Pycham..................................................................................................47
3.1.4 Pandas...................................................................................................48
3.1.5 Numpy..................................................................................................49
3.1.6 Scikit-Learn...........................................................................................50
4. FEASIBILITY STUDY........................................................................................................51
4.1 Technical Feasibility.............................................................................................51
4.2 Economical Feasibility.........................................................................................52
4.3 Operation Feasibility............................................................................................52
4.4 Legal and Regulatory Feasibility.........................................................................53
4.5 Market Feasiblity..................................................................................................53
5. METHODOLOGY................................................................................................................55
5.1Overview of Methodology.....................................................................................55
5.2Implementation......................................................................................................55

iii
5.2.1Preparing the Dataset............................................................................58
5.2.2Learning Algorithm and their performance...........................................62
6. CONCLUSIONS...................................................................................................................75
6.1Conclusions...........................................................................................................75
6.2Future Scope..........................................................................................................76
6.3 Applications..........................................................................................................77
REFERENCES..........................................................................................................................78

iv
LIST OF FIGURES

Fig 1-Desktop Operating system market share…......................................................................1

Fig 2-Different first generation malware.................................................................................10

Fig 3- Encrypted malware........................................................................................................19

Fig 4-Oligomorphic malware...................................................................................................20

Fig 5-Polymorphic malware.....................................................................................................22

Fig 6-Metamorphic Malware....................................................................................................23

Fig 7-Traditional Detection system.........................................................................................31

Fig 8-Malware normalization...................................................................................................33

Fig 9-Python.............................................................................................................................45

Fig 10-Google colab.................................................................................................................46

Fig 11-Pycharm.......................................................................................................................47

Fig 12-Pandas...........................................................................................................................48

Fig 13-Numpy..........................................................................................................................49

Fig 14-Scikit-learn....................................................................................................................50

Fig 15- Flow Chart of the Proposed System............................................................................57

Fig 16-Features of the dataset..................................................................................................59

Fig 17-Organizing dataset.......................................................................................................59

Fig 18-Feature selection...........................................................................................................60

Fig 19-Weight of Different Features.......................................................................................61

Fig 20-Splitting Dataset Size....................................................................................................62

Fig 21-Linear Regression.........................................................................................................63

Fig 22-XGBoost.......................................................................................................................64

Fig 23-Confusion Matrix of XGBoost.....................................................................................64

Fig 24-Decision Tree...............................................................................................................65

Fig 25-Confusion Matrix of Decision Tree.............................................................................66

v
Fig 26-Random forest...............................................................................................................67

Fig 27-Confusion Matrix of Random Forest............................................................................67

Fig 28-KNN..............................................................................................................................68

Fig 29-Confusion Matrix of KNN............................................................................................69

Fig 30-Logistic regression.......................................................................................................70

Fig 31-Naive Bayes..................................................................................................................71

Fig 32-Confusion Matrix of Naïve Bayes...............................................................................71

Fig 33-Stochastic gradient descent..........................................................................................72

Fig 34-Confusion Matrix of SGD............................................................................................73

Fig 35-Support Vector machine...............................................................................................74

Fig 36-Confusion Matrix of SVM…........................................................................................74

Fig 37-Analysis of different machine learning algorithms......................................................75

Fig.38- Graph of Different ML techniques and their Accuracy..............................................76

v
LIST OF TABLES

Table 1- Comparative Analysis of Research Papers................................................................39

Table 2- Pseudo Code for Conversion of PE Files to CSV Files............................................56

v
LIST OF ACRONYMS AND ABBREVIATIONS
ML - Machine Learning

AI - Artificial Intelligence

DDoS - Distributed Denial Of Service

Figure- Fig

Colab - Collaborator

IOT - Internet of things

OS - Operating System

KNN - K nearest neighbor

SGD - Stochastic Gradient Descent

SVM - Support Vector Machine

v
ABSTRACT
All this has become quite feasible given the breakthrough in internet technology and
computer networking: high-speed shared internet. This leads to the everyday growth
in number of computer systems which have turned into potential preys' for malware
attacks. Malware is short for malicious software and refers to a program specifically
engineered to cause damage to a computer, server or network. It could even send data
to bad actors and allow them to access your information or systems without your
permission. What are some common types of malware? Adware, fileless malware,
viruses, worms, trojans or trojan horses, bots, ransomware and spyware. The first step
to do so, is getting some data on PE files. PE stands for Portable Executable, which is
the file format used for executables and several binary file types in all 32-bit versions
of Windows operating systems. PE files (Portable Executable) are based on the COFF
file format (Common Object File Format). exe) and dynamic linked libraries (.dll) as
well as kernel modules ( Generally, a download manager enables downloading of
large files or multiples files in one session. srv). srv). The next phase was to convert
these PE files to.CSV format, so that we could extract important features like
Filename, FileSize, Characteristics of file, ImageVersion Informtaion and OS version
information (if possible), MD5 hash etc. from them: Having this enriched dataset in
CSV, we will continue and use a variety of machine learning algorithms to identify
patterns and derive value from it. Some of the notable algorithms used in this phase
are Support Vector Machine (SVM), K-Nearest Neighbors(KNN), Random Forest
etc. We have chosen each algorithm carefully in a way that each of them should
significantly benefit when it comes to classification and analysis for the features
extracted from PE files. By conducting this iterative procedure, we strived to acquire
a comprehensive knowledge of the features and hazards concerning the PE files under
investigation in order to further enhancing effective means for the practical detection
of malware as well as security in Windows systems. Yet, our strategy is even more
effective because it employs ensemble methods that synergize multiple machine
learning models to increase the prediction success of both random forests and support
vector machines. In Addition, incorporation of deep learning methodologies which
includes convolutional neural networks (CNNs) and recurrent neural networks
(RNNs) castlepling the PE file information can shed light not only on intricate
patterns and connections but also decidedly Improve the malware detection Program.

i
Chapter 1
INTRODUCTION

These days it is no more possible for anyone to imagine a world where there is no
internet and the way technology is growing, the concept of cyborg appears to be more
real. You are surrounded by technology; it might be mobile phones, laptop,
smartwatch, transactions being made not in cash but by digital electronic money.
Internet of Things is being worked on at an extensive amount and a lot of research
work, time, and energy are being put up for the same to build Smart House, Smart
City, etc., but everything is built through data and data here is key. But then it’s not
that technology is flawless, it has its own flaws and vulnerabilities. For this project
we are only focusing on laptops and desktops, and out of them too, there are more
percentage of laptop users. Out of all the OS available, the maximum among them is
even held by windows. NetMarketShare reports that, 76.32% of them use windows.

Fig. 1: Desktop Operating System Market Share

In our modern digital world, one of the most acute problems is malicious software,
and it is only getting worse, unfortunately. Malware is any software designed to cause
damage to a computer system or network. This may include, but is not limited to,
using it for espionage and profit. Recently, malware has started to target embedded
computation platforms, such as an IoT device, medical equipment, an E&I control
system, etc. This appears as a result of the increasing prevalence of such platforms,
which are a built-in part and, thus, are composed of embedded computing devices.
Today’s malware is complex in terms of having a high degree of diversity, in other
words, numerous strains have the ability to not only modify their code but also their

1
behavior, making it difficult to identify them. There is an enormous variety of
malicious software in existence, that is why relying only on protectors based on
signatures is not enough, and one needs an all-in-one toolkit.

Malware families share common traits, which can be seen either statically or in real-
time in their numerous incarnations. These perceived patterns can be used to isolate
families of malware. Static analysis comprises techniques of examing the content of
malicious files without running them. Dynamic analysis also classifies the behavioral
characteristics of malicious files, as opposed to performing such actions as tracking
information flows, function calls, and dynamic binary fingerprinting. Machine
learning techniques can be applied to a combination of static and behavioral artifacts
that are used to represent modern malware’s metasrukturu changing structure. This
makes it feasible to identify more sophisticated malware attacks that are otherwise
almost impossible to detect using a signature-based detection approach. The strategies
currently available based on machine learning are better than those using signatures
when it comes to detecting newly-released malware, also known as zero-day
malware. Furthermore, deep learning algorithms are able to produce even better
opportunities for feature extraction and representation thanks to implicit feature
fabrication.

The author considers the importance of machine learning in terms of dealing with
potential cyber security risks. Having obtained the relevant data throughout your firm
and compared millions of data points in the context of your clients, employees, and
partners, it is evident that you would utilize only people to spot any possible threats.
Specifically, machine learning and algorithms will serve as beneficial tools for
viewing the obtained data in large data sets and predicting possible future security
problems. Therefore, the author has highlighted that using machine learning and
developing a relevant ML tool could serve as a beneficial implementation and will
cover the purpose of identifying phishing websites.

In the conditions of current rapid changes in the field of cyber threats, machine
learning and its implementation could become a significant part of the process of the
proper preventive measures. Indeed, it can be assumed that having encountered the
problem of collecting a large amount of data generated by network traffic, system

2
logs, and user activities, one can rely on using the timely existing information and
auto systems.

The above mentioned ML tool will undoubtedly help your security engineers protect
your clients and firm from cyber threats that happen every day and cannot be
addressed late. Furthermore, they will also identify the phishing websites new to the
cyberspace and provide information about any new existing addresses.

Machine learning systems can be trained on a large body of data consisting of


legitimate as well as phishing websites to learn the features and patterns that define
each. This can include such characteristics as URL, content of the HTML page,
surrounding text, HTML page metadata, and other features. Such models can then be
used to train machine learning models, exploiting both supervised learning, where the
data is labeled as phishing or not, and unsupervised learning, where the algorithms is
used to find patterns in the data without first been shown the classes.

Machine learning models in the context of phishing detection can, for example, use
natural language processing tools to ascertain whether the website contains suspicious
keywords, such as your account has been suspended, is asking for private
information, or sensitive data should be sent via email, if the website has poor
grammar, or breaks other feature. Additionally, other features, such as URL age, the
presence of SSL certificate, and reputation score of website can be used.

Moreover, deep learning, a class of unsupervised learning, is now being used in


phishing websites detection. Examples of deep leaning architectures are convolutional
neural networks and recurrent neural networks, which excel at automatically
determining patterns and extracting better features than those manually chosen by
experts.

The benefits of employing machine learning in phishing websites is aplenty. First, it


allows the analysis of a large number of websites, which would be otherwise
unmanagable. Second, this approach is more resistant to changes in technology,
because the machine learning models will update to detect such changes. Third, often,
ensemble models might be used where multiple machine learning models are used,

3
taking advantage of many features, rendering them even more powerful and robust
than the average classifier.

However, it has to be stated that ML are not a cure-all and are to be used alongside
other security measures, such as user education, secure coding practices and regular
software updating. Moreover, according to Goodfellow et al., Ml models are
vulnerable to adversarial attacks, when the inputs are carefully designed by malicious
actors to deceive the model. Thus, regular monitoring and retraining of the model
might be necessary. To wrap it up, bit can be concluded that employing machine
learning in cybersecurity is essential, considering the modern complex and
everchanging landscape of threats. Phishing detection is only one example illustrating
how these powerful techniques can help in the fight against cyber threats. It shows
that cybersecurity should be approached from multiple sides and implemented at
several levels.

1.1 Problem Definition

With the increase in demand for the internet, risks of cyberattacks are also being
increased. So, most organizations and people are experiencing problems with the
malware files in their systems. It is too much difficult to save or protect important
data from personal computer systems. There are several techniques that use to detect
the malware that is present in computer systems. However, some traditional malware
detecting techniques cannot perform valuable actions to detect the malware. Due to
this, different kinds of malware that are present in different forms in computer
systems cannot be detected. It means traditional malware detectors fail to detect
malware that is present in the system. In this condition, there is a need for malware
detectors that can detect malware more easily and comfortably. with machine learning
algorithms we can detect malware that is present in a system, which provided better
results in this area. For the future, machine learning algorithms will be of more use to
detect malware that is present in systems. There are also many kinds of machine
learning algorithms that use for the malware detection process. However, for this
research, I just consider a few important machine learning algorithms that will be of
more interest. One of the important machine learning algorithms is a convolutional
neural network and recurrent neural network. CNN and RNN are deep learning

4
models that learn complex patterns and features from the data. So it will be more
useful to detect variants of malware. Besides that, other important machine learning
algorithms are decision trees and random forests. These techniques are used for
learning data and make decisions from the learning process.

The proposed research encompasses running each of the specified algorithms on the
required data to ascertain how efficiently it can detect malware. Specifically, the
efficiency of the algorithms will be measured using a range of metrics. The
performance of each of the models will be tested using such tools as accuracy,
precision, recall, and F1 – score. The integrity and validity of the results will be
assured by such techniques as cross – validation and stratified sampling and other
tools that ensure that the process of data collection is conducted in the defined
sequence. Data collection will be performed at the outset of the research and will
imply the acquisition of the selected benign files and the set of malware samples.
Preprocessing will be the following phase, and it will include conducting data
cleaning, normalization, and other appropriate techniques required for the preparation
of the data. In addition, encoding of the data will be necessary for the task to follow.
Feature extraction will be the third key step in the process as, to achieve the
anticipated effect, it is necessary to detect the most relevant features differentiating
between malware and benign files. Static analysis for detecting data features in the
headers, sections, and code segments will be employed alongside dynamic analysis
testing files in a controlled environment. Both approaches will be employed to ensure
the most effective features are chosen. The following step in the research process will
be the training of the machine learning algorithms and testing how accurately they
operate. The use of hyperparameter tuning, regularization, and ensemble methods will
be integral to the training process. Finally, performance assessment will be conducted
across the defined data set with test pairs as the output. In conclusion, the paper will
also draw a comparison of the machine learning models applications, underscoring
the benefits and shortfalls of each type.

I have read the abstract and find the research quite interesting as it aims to improve
the existing malware detection systems by employing different machine learning
classifiers. The research findings appear to be important since they will provide more

5
insights and enable readers to protect themselves and their business entities from the
malware threat. I think the research has good prospects to succeed.

1.2 Aim and Objective

The aim of this research is to detect malware and evaluate performance by initiating
different machine learning algorithms.

The objective of the project is to analyze the importance of performance evaluation


with respect to schema for the malware detection system. This is accomplished by
exploring the impact of performance metrics in affecting the efficacy of malware
detection algorithms. The aim of this report is also to:

 Evaluate the diverse performance evaluation methods employed in


determining the efficiency of malware detection systems.
 Investigate the impact of performance requirements such as the detection rate
of malware detection systems.
 Determine the challenges and limitations that are experienced when
conducting the performance evaluation in the evolving landscape of malware
threats.

1.3 Malware

Internet technology and computer networking What is more, required the internet to
be high speed shared. As a result of this development, to date, the number of
computer systems prone to malware attacks is increasing daily[1][2]. The term
malware is an abbreviation of malicious software. Researchers define it as any
software that was specifically designed to disrupt a computer, server, client, or
computer network, steal private information, gain unauthorized access to information
or systems and to or deprive access to information or which performs that
unknowingly compromises the user’s security, privacy and/or ethics[3]. Normally,
researchers describe malware in terms of one or more sub-types or Equally, malware
is regarded as harmful to all individuals and businesses that use the internet due to its
various issues. In this modern era, the number of newly produced malware in the
internet environment continues to escalate, while the antivirus companies are working

6
strenuously to limit this tendency so that the majority of computer users can be free
from these threats isolation[4][5]. Notwithstanding, the over couple of years, malware
have become more sophisticated, as a result, they are harder to detect. Consequently,
newer, more effective ways have to be improved between now and the future to
detect and avert these attacks. Malware is malicious software that attacks the privacy,
reliability, and accessibility of a computer system’s data. This issue has made both
academicians and industry practitioners move from antiquated, static detection
techniques, to more dynamic, sophisticated, spontaneous methods relying on
accumulated malware behaviour to constantly monitor systems for malware
attacks[6][7][8].

Malware creation is an activity that can be linked to various actors with different
motivations. Cybercriminals are inclined to do it with the aim of getting money.
Hence, malware is likely to be created to steal banking details, credit card numbers,
and passcodes, as well as to demand a ransom for illegally blocking some data.
Additionally, malware can be developed by some nation-states willing to wage cyber
warfare, and get some essential information whether it is critical infrastructure,
government systems, or assets of the state. Moreover, malware is likely to be
designed by hacktivists to achieve their ideological or political goals and gain access
to data specifying their opponents or any organization they disagree with. Finally,
malware can also be created by companies’ insiders who have some vendetta and
want to take revenge on their previous employer. The potential implications of
allowing my system to be infected are categorized as individual and organizational. I
might suffer from the threat because my personal data is not likely to be secure with
malware captured on the computer. Hence, running my home business, I can get my
bank account, credit card numbers, and some programs details stolen. Both
fundamental and so-called dirty secrets may come up and result in criminal gains.
Moreover, my system in the workplace can be under risk. In other words, malware
will infect the system at work, and I will be guilty of leaking some details of our
projects and causing damage to the company’s reputation. As such, I am likely to be
dismissed from the job. If my data is lost, my documents and significant papers can
be eliminated, ruined, or put into a mess. My PC can be blocked, and I will need to
spend a considerable amount to use the services of some experts to try to restore the

7
saved data. Alternatively, I can lose the projects that have already been completed
and are being saved for the future crucial event.

To protect a computer from evolving malware threats, a multi-layered approach is


needed to reduce the risks associated with these dangerous programs. First, security
measures aimed at protecting the system from the influx of any malware should be in
place. These measures can include updating software and operating systems regularly,
using commercial or freeware anti-virus and anti-malware software, implementing
firewalls and intrusion prevention or detection systems, and raising cybersecurity
awareness among users. Second, regular backup of files and well-developed attack
mitigation and recovery plans should be made. Third, the field of malware analysis
and cybersecurity is constantly advancing, meaning that the research into malware
detection and other prevention mechanisms is needed to improve protection.

Such research is especially needed because of the decreasing efficiency of traditional


signature-based approaches. Such approaches can only identify malware by
comparing it to known database entries. The authors of malware use various
techniques to avoid such comparisons, including obfuscation, polymorphism, and
metamorphism, and researchers, as well as cybersecurity specialists, should come up
with more sophisticated and dynamic ways of dealing with emerging threats.

One direction where researchers can move is to leverage the powers of machine
learning for malware detection purposes. Machine learning involves programming a
computer to draw conclusions from data by identifying patterns or features based on
training with large datasets. The technology-oriented approach means that data-driven
pattern recognition can bring several benefits, including adaptability, generalization,
automatic feature selection, and scalability. Several common machine learning
methods can be used for malware detection tasks, such as decision trees, random
forests, support vector machines, and neural networks. In addition, other unsupervised
learning methods like clustering or anomaly detection can be used to identify
different malware types, whereas deep learning neural network architectures like
convolutional or recurrent neural networks can be used as a more sophisticated and
more powerful ML tool. The use of machine learning to detect malware implies that
the programs can be analyzed in terms of static features such as their binary code,
headers, and other metadata or dynamic characteristics.

8
Although machine learning methods can offer solutions to the current malware threat,
it is important to note that they are not a silver bullet. They can only provide
beneficial results when used as a part of the complex approach, which merges them
with other security measures and best practices. The efficiency of machine learning
models is highly dependent on the training samples, and their diversity and quality, as
well as the definition and proper application of valuable features. Furthermore, as it
was mentioned above, the development of adversarial machine learning is
transforming into a considerable problem, and malware authors will continue to
construct samples, which cannot be recognized by the machine learning methods
during quite a long time. In conclusion, it is possible to state that malware is still a
major threat of the current century. With an increase in the number of devices used by
people and their interconnectedness, the risk of malware infection will only grow.
Thus, not only high-quality security measures and incident response plans but also
people’s awareness and R&D activities focused on the development of the new
techniques designed to reduce the risk and efficiently prevent and fight malware
attacks, including machine learning methods, have to be implemented.

1.4.1 Types Of Malware

There are two types of malware:-

A.First Generation Malware

B. Second Generation Malware

A. First Generation Malware

The first generation of malware is static malware. It is classified based on attack


methods: viruses, worms, and trojans. A virus spreads by attaching itself to normal
files in the user’s applications and share through human actions. On the other hand,
worms replicate themselves in similar manners but exploits network vulnerability
while spreading. Meanwhile, trojans trick the user by disguising themselves as useful
software. More types of malware received this classification, such as rootkits, that
gains unauthorized administrative access, hides its presence, and punishes its
detection by altering system behaviors. Variants of this malware include spyware that
gathers information about the user without permission, crimeware with the intent of
facilitating online crime, and adware that delivers advertising material without

9
permission. Despite the different purposes these malware have, they share similar
aspects of their goal and design.

Fig. 2: Different First Generation Malware


Trojan Horse

A trojan is a kind of malicious software which pretends to be safe and useful


software, imitating legitimate software to trick a user. However, when a user starts the
application, it lets out malicious codes which may ruin the files and the applications
installed on a computer. Moreover, a trojan can steal some sensitive information, such
as passwords or personal data. While computer viruses and worms can spread on their
own, trojans require the users’ help to get started and multiply, often in the form of a
deceptive download. For this reason, trojans are one of the most harmful and
dangerous types of malware as they are usually the source of serious and irreparable
harm. Trojans are often not detected until the damage is done[9]. The nature of a
trojan which causes it is to be discovered too late mostly as they are usually
recognized only after they have reached and damaged a system, makes it almost
impossible to prevent it from entering and eliminate it from the system. Thus, as one
of the most serious and dangerous computer threats, it is noteworthy that trojans tend
to cause considerable harm to the systems they have attacked and severely affect the
users. The nature of a trojan to be late-discovered threat is closely and extremely
tightly connected with another one of the noticeable characteristics – the fact that

1
trojans cause more harm than they should. Since trojans are able to go past all the first
security measures and remain undetected for so long, their effects are serious indeed.
Hence, it is evident that a trojan deserves the label of amajor threat of a type
of computers.

Virus

A virus is a self-replicating type of malware created to change the functionality of a


computer. It infects overloaded one software product and generated itself from other
useful application software programs so that the virus cannot spread, as is the case
with a trojan horse. Depending on intensity, it can cause tremendous harm to the
computer system. It can lead to premature transfer, irrelevant information,
discoloration of the screen and blur. When a virus begins a Denial of Service attack, it
launches an unmanageable number of requests to the point where the system shuts
down and can no longer be accessed normally. These viruses can corrupt computer
files and spread to other programs. The virus may exceed to more than one system on
an individual network. They may also give the computer all the systems and data
drives they can access. The data may be corrupted or deleted indiscriminately. They
also surpass the system to gain access to certain reliable records. More severe, the
virus is a risky virus that gives the virus access to the system in different ways which
hackers use to take advantage of the relieved system. These viruses are also
dangerous in that they can self-replicate themselves like an infected area on the below
bits.networks. This enables the virus to spread widely across the network without
being found. To this end, then, we must create reliable and security-producing
computers and networks, installing up to date virus checks and updating software
without fail.

Adware

Adware is a relatively low-danger type of malware that differs with the degree of its
functional influence on the infected device. Its only task is to show advertising to the
user. Unlike more dangerous types of viruses or worms that are designed to damage
the system or misuse the infected device, adware collects information from it, such as
information about the history of searches and visited webpages. Then it uses this
information to display commercials on the infected device in which the user might be
interested. That is why sometimes there may appear ads on one’s screen that are

1
connected to the product that the person has recently searched for or paid attention to.
Nonetheless, such software is not as dangerous as many other types of malware, since
the information is not likely to be personal, narrowing one’s interest. However, it is
still rather annoying, causing automatical appearance, unwanted ads and other similar
content, which prevents the user from satisfying work.

Furthermore, the device may be under some threat since the collection of information
is carried out without the awareness and permission of the user, meaning that no one
can tell how carefully and where it is used. Finally, in case the collected data is too
big, it gathers in the infected device, overloading it with unnecessary information and
reducing the work speed. At first glance, the impact of adware on the device is not
really dangerous, but it can be a rather annoying and disturbing problem that may
cause inconvenience during work. However, the key factor that makes it dangerous is
that the device, to which such software is directly linked, gathers information about
the user that is not always appropriate, followed by a lack of control over its proper
use. Despite other types of malware, it is not lethal, but yet it affects the device’s
work[10].

Spyware

Spyware is a kind of applications that executes without the user’s approval and often
without the user knowing that this software is running. Spyware is a self-installing
malicious software that replicates on the target system. It is used to garner and trail
information about the person and their browsing history of a computer system. The
nature of what it records automatically varies and can involve, but is not restricted to,
the types of websites the user attends, statistics of their computer practice, and even a
copy of their transactions. Generally, spyware is bundled with freeware software. The
unsuspected distribution of the software has made it common for users to
unknowingly install spyware along with the more legitimate free software
applications. Spyware functions without the user’s awareness, aggregating statistics
inclusive of but not limited to statistics of what the person types via their keyboard,
websites they frequent and their private information. The information can be sent
back a third party that uses the data for a number of applications including
advertising, identity theft, or unlawful access of information. Because of its skill to
function without the user knowing, spyware is a very common issue for people

1
surfing the net. It is occasionally referred to as a rootkit both because of the way it is
distributed by the freeware software and as it can hide its records on the user’s system
without the user knowing it is there[11]. Spyware differs from other kinds of malware
in that it does not damage the user’s computer system, nor does it disrupt it. Instead,
spyware’s sole operation is surveillance that allows a third party to spy on the host.
This feature makes spyware one of the most dangerous kinds of malware, as it can
easily lie on a user’s computer system for an extended period of time without the user
knowing that the information on their computer is being harvested. In order to avoid
spyware, it is imperative that users use reliable anti-spyware software combined with
the general rule of caution when downloading freeware applications from
unknown websites.

Worm

A worm refers to a form of malware that does not fuse itself to any additional
software, a millennium that sets it apart from the virus. Viruses require a host
program onto which they install themselves to migrate. Worms however migrate
without necessarily requiring a host software on which to fasten. They move
individually and fast across systems, and networks, take advantage of system
vulnerabilities, and exploit computer networks and software. Typically, they expose
an area of vulnerability, moves through the area infecting the target. Examples
include open ports and poor password protocols that allow the worm to infect its
target. Once inside the system, the worm uses strategies to migrate to another system,
such as sending a similar copy to the computer through an email, an IM, or across
network shares. Alternatively, worms exploit software vulnerabilities automatically to
spread infections rather than requiring on the user for decision. Fly-and-exploit-
capabilities of the worm, a variable that facilitates triggering of a highly influential
attack. Worm threats to information system and system’s network lie in its capacity to
corrupt systems configuration, inhibited network speed, and overly interfacing, which
develops a DDoS case. To curb worms, a system is supposed to be highly secure with
regular software updates and a well-managed, sophisticated network to detect and
mask such an attack.

1
Bot

A bot, also referred to as a web robot or botnet, is application software destined for
automated tasks performance over the internet. Bots are a type of malware that allows
gaining access to the infected computer system by the operator. Unlike viruses or
worms, which are standalone pieces of software, bots operate in a larger network of
infected computers called a botnet. The ability to propagate a bot is granted by
backdoors that are installed on a victim computer by a virus or a worm. These
backdoors allow their operator to create a central control server that is used for adding
and controlling bots. Once installed onto a system, bots are capable of sending spam
emails, conducting Distributed Denial of Service attacks, stealing valuable data, and
spreading other malware. Bots are often employed by criminals to orchestrate large-
scale operations, targeting compromised systems for financial gain and other
malevolent purposes. Being able to operate fully automatically in the background,
bots are often left unnoticed by their operators for long periods. Protecting against bot
infection requires robust cybersecurity measures, such as antivirus software, firewalls,
intrusion detection systems, and running timely software updates. Additionally, user
education and avoiding opening suspicious links and downloading email attachments
are imperative.

Ransomware

Ransomware is a subgenre of malware which intends to demand payment from their


victims by encrypting either the files on their computer or entirely locking them out.
This malicious software locks the victim’s files out of use by encrypting them and
turning them into what is effectively gibberish that cannot be used. Victims must
usually pay a ransom, often in the form of cryptocurrency, so attackers will provide
them with the decryption key needed to unlock their files. The key is saved to
memory in order to decrypt the victim’s valuable files; without it, their access
remains permanently restricted. Ransomware infections are mostly distributed by
trojans, which deceive users into downloading and opening ransomware on their
systems. After that, ransomware quickly spreads and encrypts all the files on your
computer or network (where you are working) as much as possible. Ransomware
attacks already provide a big payday to some cybercriminals, who can demand large
sums from victims in return for their data being decrypted. There's no promise

1
payment will even lead to the return of the ransomed files -- and victims can end up
losing money twice, by having handed over cash in Bitcoins or other cryptocurrencies
they held as well as lost other valuable data. Ransomware infections can be fought
and averted by implementing solid cybersecurity strategies such as safe data backups,
good antivirus preventative measures, and end-user training to help users
acknowledge suspicious emails or downloads. Organizations should also take
appropriate network security action including firewalls and using intrusion detection
systems to identify ransomware threats and then infect before they causes harm[12].

Rootkits

Rootkits are a collection of computer software, typically malicious, designed to


enable access to a computer or an area within...(Web definition) Rootkits are mainly
intended to hide the hacker's performance and their role inside the crappy program,
smoothly doing what from an administrator position without being caught by a target
host. Rootkits prevents the user from detecting unauthorized access & manipulation
of their system. As it hides any file changes and other nefarious activities that may be
going on your operating system Trojans, worms and viruses are all associated with
the hidden software components used to install rootkits but are not a log of it.
installing rootkits onto systems allowing them to alter core system files and processes
in a manner that is practically undetectable by antivirus software or other security
measures. Attackers can intercept systems calls and network traffic as well, allows
nice attacker to stay in control with out being noticed. Cybersecurity professionals
face any number of problems related to rootkits, including the issues that they can fail
to identify themselves or hide what they are doing in a way that makes them difficult
(or impossible) to eliminate. Advanced security tools and methods, such as rootkit
detection software, system monitoring, and forensic analysis, are necessary for
identifying and containing rootkits. Ensuring that rootkits can be avoided and their
effects on computer systems and networks minimized is through the use of regular
software updates, user education to avoid such infections as well[12].

Backdoor

Backdoor is a type of malware that creates an additional hidden “entrance” for


malicious people into the infected system. A backdoor, unlike other types of malware,
does not destroy the system itself. Instead, it acts as a backdoor for them to gain an

1
entry-point into your system and use this access to cause malicious activities. 16
Although backdoors can be used alone, they are often deployed in combination with
other types of malware or as part of other attack packages.Aggressive Cyber Activity
After this, backdoor gives them a sneaky way to break into the system and access it
without going through the standard procedure. So now they can launch additional
attack or perform any kind of bad operation on compromised machine from that
vantage point. Backdoors might be used, among other things, to steal sensitive
information, install a range of malware beyond the backdoor itself to spy on the
victim or perpetrate DDoS attacks. Because of their covert operations, backdoors can
sometimes go undiscovered in breached devices for weeks or even years, providing
the attackers long-term access and reign. Advanced cybersecurity is required to detect
and mitigate backdoor infections, which typically include monitoring systems
regularly, conducting vulnerability assessments, and deploying intrusion detection
systems. Moreover, user education and awareness have to be done; backdoor
infections need abolished while systems and networks harm should also be
diminished[12].

Keylogger

Keylogger is a type surveillance malware (short for keystroke logging) to use of


logging every keystrokes on system and it supports ability recording by software or
hardware once the target has been affected. The spyware runs silently, monitoring
everything the user types on their keyboard – including logins entered into
applications and web browsers, as well The operating system's own password box.
These recorded keystrokes are then saved to a log file which is typically encrypted
and sent back using an RSA key ( This secure socket layer of the application actually
decrypts data packets transmitted fromt eh targets as well). This sort of information
could range from passwords to BVN and data related to the make or model of an
ATM card. It might also include exposed credit card details provided by the user —
whatever amount of misinformation you wish QPointFündunkt The method of
deployment can be through email attachments, software downloads infected with a
virus or from compromised web pages. Keyloggers could quietly collect keystrokes
produced by the user they were installed on, operating out of sight and mind only to
report back all data entered through their host. attackers could use the captured data
for identity theft, financial fraud, cyber espionage or other nefarious activities[13].

1
Keyloggers can be difficult to detect and remove, as they are designed (to
intentionally avoid detection), module-based, and often written with rootkit
functionality. To avoid infection from keyloggers, it is important to follow
cybersecurity best practices such as updating software regularly, installing reliable
antivirus software and educating users about the dangers of following unsolicited
links or downloading email attachments. Furthermore, the reliance on security
applications like a firewall and intrusion detection system (IDS) provides an extra
layer of defense and ensures timely identification and protection from keylogger
attacks.

B. Second Generation Malware

Second-generation malware, otherwise referred to as dynamic malware, operates on a


more sophisticated level where the framework of the malware changes with each
infection; generating new variants while keeping core intentions intact. These
versions are especially designed to defy detection by security protocols, constituting
extreme difficulties for cybersecurity experts. Classically second-generation malware
can be categorised on the grounds of mechanisms used to obscure either code or the
structure of malware itself which in turn makes it difficult for its signature to detect.

While this trend in the malware creation lifecycle is leading to more capabilities, it
poses a difficult challenge for cybersecurity professionals who need to identify new
ways of detecting responsive malware. Traditional methods may not get there fast
enough because dynamic virus programs keep evolving. Attackers use different
obfuscation methods to hide the actual code of malware so security solutions do not
detect and stop threats as well as they should. Consequently, to detect and
immediately respond to these dynamic threats, organizations need proactive security
solutions like behaviour-based analysis coupled with threat intelligence.
“Furthermore, the importance of user education and awareness cannot be understated
in helping to avoid these types of infections,” SafeBreach Labs added. “Users need to
actively protect themselves against social engineering attacks orchestrated by threat
actors attempting to infect endpoints with malware-based payloads.” * With insight
into the latest threat trends and strong cybersecurity solutions, this should enable
organizations to better defend themselves from what seems like new threats with
second-generation malware.

1
Encrypted Malware

At the first level of concealment, we have encryption. Here the malware body
consists of a encrypted malicious code along with key and encryption/ decryption
algorithm [Fig. 3]. This XOR’s the body of the malware with the generated key,
making it harder to detect. The primary goal for developing the malware with
encryption, was to bypass the static code analysis and classical signature based
detection method. Once it infects the system, after decrypting itself by using
decryption algorithm and a key. It will again encrypt by encryption algorithm to
produce new variant’s key for another type in order to bypass the detection
mechanism. In the scope of malware, encryption is also found to be the primary
obfuscation method: when it comes to concealment, malicious code usually resides in
a shell where its body has been encrypted and, along with an algorithm and key for
encrypting/decrypting. With all above MOV checks in place, there won't be any
proper visible method to detract the actual nature of malware like Fooling static code
analysis as well as traditional signature-based detection and attack. In this approach,
the body of the malware is usually XOR’ed with a key which is generated
dynamically which can makes it hard for security solutions to detect and analyze.
Once received, the encrypted malware decrypts itself with the previously supplied
decryption algorithm and key. At this point, the malicious code can be executed on
infected systems. Upon execution, the malware re-infections itself for each variant
using the encryption algorithm to create a new key after activation and gains
changeover thwart at assorted levels. In most cases, such continuity of the encryption
and decryption process results in a constantly changing structure of an evolving threat
that is extremely difficult to parse or analyze for security researchersbf. This allows
malware authors to hide their illicit actions using encryption, enabling compromised
systems to be used long after a malware is initially installed. As a result, detection
and mitigation strategies from cybersecurity professionals must change to address the
continuum of threats posed by encrypted malware. It highlights the importance of
advanced security solutions that leverage behavior-based analysis, machine-learning
and threat-intelligence to effectively detect and neutralize encrypted malware threats
in real time. In addition, ensuring that antivirus signatures and intrusion detection
systems are updated regularly is essential to ensure your ability to keep ahead of any
new encryption techniques used by malware authors. Organizations can prevent and

1
protect themselves from these harmful impact of malware attacks by taking measures
in respecting to Encrypted Malware.

Fig.3 : Encrypted Malware


Oligomorphic Malware

The limitation of the malware were encrypted, i.e., invariance properties of the
decryptor within those specific variants of ma1-ware easily allowed an anti-malware
to detect it by searching for signatures of their precious little boys—themselves.
Therefore, various concealment techniques have been evolved to escape from this
detection mechanism. In this malware (Fig. 2) 4) Decryptors: In this malware variant
decryptors are mutated it means that we have pass through the set of obfuscated
decryptors. Following these signs, the disadvantage of encrypted malware should be
mentioned: is that since invariantness exists in decryptor within certain subtypes of
malwares, it was and simply identified by the anti-malware solution by recognising
signature for the decryptors. This gap soon led to the discovery of multiple kinds of
hiding techniques, designed to trick detecting mechanisms. In response, malware
authors continued to evolve their design by obscuring the decryptors with new
features such as mutation of the generic elements across different variants (see for
example [27]). Figure 4. An example of the advanced variant, where decryptors are
muted[variant-wide pool of obfuscated decryptors] and dot`erent across different
variants This is especially important because by constantly mutating the decryptors,
malware developers can blur their “signatures”, making it harder for anti-malware
solutions to detect and destroy the threat. Centrifuge then dynamically encrypts and
decrypts its files, making it even harder for ordinary security mechanisms to detect
the malware. Employing the low-level obfuscated decryptors gradually adds an

1
additional level of complexity to overall malware analysis. Understanding these
behaviors, as well as proper toolsets and techniques for detecting this actor’s activity
will enable defenders to analyze ROOTKIT040903 properly and develop detection
measures that can be employed instantly. So, in conclusion, the arms race between
malware authors and those who are trying to protect their networks whether it be
CIOs or IoT operators has just become more intense. It is war and there is no winner
yet. Against the threat of mutated decryptors or other more sophisticated obfuscating
methods, combating and defending through proper cybersecurity tools should be a
simpler choice — organizations must approach it with modern solutions backed by
artificial intelligence (AI), machine learning, and proactive behavioral analytics. In
addition to this, the security community must work together and share intelligence
about cyber threats to get ahead of recurring attacks and implement defenses for new
strains of malware as quickly as possible.

Fig. 4: Oligomorphic Malware

Polymorphic Malware

It is similar to Oligomorphic Malware, whereas it can produce millions of decryptors


by changing the instructions in variant os malware [14] so the signature matching
detection technique will not work against this. The mutation engine produces an
encryption algorithm and a decryption algorithm, where the malware code and this

2
mutator both are encrypted. In this way we get a new variant of that particular
malware.A polymorphic malware which includes a body and decryptor; Polymorphic
malware is somewhat similar to oligomorphic malware, but it highly elevates the
obfuscation game. Polymorphic variants create millions of decryptors by mutating
instructions used in malware variant The dynamic mutation process is designed to
prevent the patterns from detection techniques, such as signature matching used in
antivirus software and other filters. In contrast with the static type, when polymorphic
malware uses a mutation engine to produce variant Generate_decryption_algorithm
given an encryption algorithm Subsequent to writing the above process: algorithms
are used to encrypt not only the malware code, but also the mutation engine; thus
producing a new variant of that malware. Polymorphic malware structure: as shown
in (Fig. 5). Polymorphic malware is generally 2 main parts including’ decryptor ’and
body of the malware. The function of a decryptor — to decrypt the encrypted
malware code so it can be run on the infected computer to perform its malicious
purpose. This part of the malware actually consists of the payload and functionality of
the malicious software. Polymorphic variants evade static signature or pattern-based
security solutions that rely on the same instructions remaining in all received samples,
by constantly changing their instructions and producing new strains with different
decryptors. As a result, polymorphic malware behaves dynamically which allows it to
be difficult for detection and resistant to traditional approaches taken with more
common types of security implementations (like static signature-based analysis).
Organizations are best served defending against the threat of polymorphic malware
by implementing sophisticated cybersecurity strategies combining behavior-based
analysis, machine learning and threat intelligence to identify and stop new threats as
soon as they appear. Also, don’t forget to periodically update the Anti-virus
signatures and intrusion detection systems. After all is, you will have to outsmart each
person who created polymorphic malware if you want to stay ahead of it. Remaining
ever- watchful and at the ready allows organizations to genuinely protect themselves
from polymorphic malware, as well as all manner of other threats that continuously
infiltrate our cyber realm.

2
Fig. 5: Polymorphic Malware

Metamorphic Malware

In Metamorphic malware shown in Fig. 6, rather than making the decryptors mutants,
the malware body is mutated instead, i.e., the body, to create a new variant in which
its actions remain the same, to avoid detection. The metamorphic behavior is
achieved by using several obfuscation methods, similar to polymorphic malware for
creating the variants, such as dead-code insertion, data modification, control/data
flow modification, register renaming, subroutine permutation, equivalent code
substitution, and so forth. Metamorphic malware “depicts that a different approach is
adopted where instead of mutating the decryptors, the malware itself, the body, is
mutated called a body-polymorphic transformation. The metamorphic behavior is
achieved via this approach, where the underlying actions of the malware are
preserved the same, thus enabling the malware to create a new variant with the
resemblance of Polymorphic malware; however, the malware code is different in
structure and appearance from the previous variant. This metamorphic malware
variant is dynamically different each time where various polymorphic obfuscation
techniques are applied that erases the earlier structure and functions of the malware
and create a new one as a mutation. It keeps multiplying the code every cycle such
that every iteration is unique concerning the looks. Since the Malware is meeting new
lines of code and multiple workarounds, the signature method is rendered futile. .
Since the Malware is updated dynamically, it does not portray a definite image hence
alternative methods to detect it should be involved. An example is the use of a body-
polymorphic transformation as portrayed in Figure A more dynamic variant is always

2
anticipated and hence one code is always mutating to another code. It deviates from
all previous copies. As the security software is inactive or does not denote any
infections in some scenarios, this mutation makes static analysis impossible and the
malware is free to operate secure. To conclude, the use of body-polymorphic
transformation or other tactics allows Metamorphic malware to stay long enough
without anyone identifying it. The body-polymorphic transformation issued by the
malware renders it difficult to detect and be mitigated due to the existence of other
parallel data bodies from the first malware. To counter such a Malware variant, the
firm should keep updating their antivirus. Hyuk asserts the measures to prevent this
type of Malware. The next ones are the three first steps: intrusion prevention,
intrusion detection, and the last but not least, which is to keep updating their antivirus
to protect against the troubling data. Hence if these simple steps are followed as a
fashion, such a Malware can never be experienced.

Fig. 6 : Metamorphic Malware

1.4.2 Types Of Malware Analysis

While attempting to detect malware, the malware’s behavior must first be understood.
Analysis would help to visualize the malware’s behavior and its tasks. There are a
number of methods to reach the particular conclusion, each having its own
advantages. However, the amount of time and knowledge needed to use these vary
vastly. Event-driven systems can either use event loggers or hooks to track the
behavior of malware. HostException Log monitoring can also be used to check the
behavior of malware by reviewing the event logs regularly. There are a number of

2
benefits to using event-driven system to detect malware; it may affect DLL or API
interception and have no or less effects on the timing of security breaches. However,
a lot of time and efforts are spent on the analysis of telemetry.

These are as follows:-

 Static Analysis
Code analysis or also referred to as Static Analysis is achieved by going
through the source code of the malware to determine its potential behaviour
and properties. Static analysis in malware analysis involves examining the
code and structure of a malicious program without executing it. This
technique provides valuable insights into the characteristics and behavior of
the malware, helping cybersecurity analysts identify potential threats and
vulnerabilities. During static analysis, analysts dissect the code to uncover
indicators of compromise, such as suspicious function calls, obfuscated
strings, and hidden payloads. Additionally, static analysis can reveal the
presence of known malware signatures and patterns, allowing analysts to
classify the malware and understand its potential impact. By leveraging static
analysis tools and techniques, cybersecurity professionals can gain a deeper
understanding of the malware's capabilities and intent, enabling them to
develop effective mitigation strategies and security measures. However, static
analysis has its limitations, as sophisticated malware may employ techniques
to evade detection, such as code obfuscation and polymorphism. Despite these
challenges, static analysis remains an essential component of malware
analysis, providing valuable insights into the inner workings of malicious
software and helping organizations bolster their defenses against cyber
threats..This reverse engineering can be achieved by one of the following
ways –
o File Format Inspection
File format inspection plays a crucial role in malware analysis,
offering valuable insights into the nature and characteristics of
suspicious files. Metadata, in particular, serves as a treasure trove of
information, providing analysts with essential details such as file type,
date of creation, compile time, and functions that have been imported
and exported. By examining metadata, analysts can determine the

2
origin and purpose of a file, helping to identify potentially malicious
content. For instance, the file type can reveal whether the file is an
executable, a document, or a script, providing clues about its intended
functionality. The date of creation and compile time can offer insights
into the timeline of the file's development, shedding light on its
potential relevance and significance in the context of a security
incident. Furthermore, metadata can unveil the functions that have
been imported and exported within the file, providing clues about its
behavior and capabilities. For example, imported functions may
indicate dependencies on external libraries or system resources, while
exported functions may suggest the presence of hooks or callbacks
used for malicious activities. By analyzing metadata, cybersecurity
analysts can gain a deeper understanding of the files under
investigation, enabling them to make informed decisions about the
level of risk and appropriate response actions. However, it is essential
to note that metadata inspection is just one component of a
comprehensive malware analysis strategy. Sophisticated malware may
employ techniques to manipulate or obfuscate metadata, making it
challenging to extract reliable information. Therefore, analysts must
complement metadata inspection with other analysis techniques, such
as dynamic analysis and code reverse engineering, to gain a
comprehensive understanding of the threat landscape and effectively
mitigate cyber risks.
o String Extraction
String extraction is a fundamental technique in malware analysis,
crucial for deciphering the behavior and functionality of suspicious
code. This method involves extracting strings from the code, including
error messages, status indicators, and other textual data embedded
within the malware. By analyzing these strings, analysts can glean
valuable insights into the workings of the malware, infer its intended
purpose, and uncover potential indicators of compromise. Error
messages, for example, may reveal clues about the malware's
functionality, providing insights into its behavior and potential impact
on the infected system. Status indicators can also offer valuable

2
information, indicating the progress of specific operations or the
success of certain actions performed by the malware. Additionally,
strings extracted from the code may include hardcoded URLs, IP
addresses, encryption keys, or command-and-control server addresses
used by the malware to communicate with remote servers or execute
malicious activities. By identifying and analyzing these strings,
analysts can gain a deeper understanding of the malware's capabilities,
command-and-control infrastructure, and potential attack vectors.
Furthermore, string extraction can aid in the development of detection
signatures and mitigation strategies, enabling cybersecurity
professionals to better defend against emerging threats and protect
sensitive systems and data. However, it's essential to note that string
extraction is just one component of a comprehensive malware analysis
process. Analysts must complement string analysis with other
techniques, such as dynamic analysis, code reverse engineering, and
behavioral analysis, to gain a holistic understanding of the malware's
behavior and effectively mitigate cyber risks. By leveraging a multi-
faceted approach to malware analysis, organizations can enhance their
cybersecurity posture and better defend against evolving cyber threats.
o AV Scanning
AV scanning, short for antivirus scanning, is a widely employed
technique in malware detection and prevention, utilized by both
individual users and organizations alike. The process involves
subjecting suspected files to scrutiny by antivirus software, which
compares their signatures, behaviors, and characteristics against a
database of known malware signatures. If a file matches a signature in
the database, it is flagged as malicious and appropriate action is taken
to quarantine, delete, or neutralize the threat. This method is
particularly effective against well-known malware strains, as most
antivirus programs maintain extensive databases of signatures for
common malware variants. However, it's important to note that AV
scanning is not foolproof, as it relies heavily on signature-based
detection and may fail to detect zero-day threats or polymorphic
malware that constantly alters its code to evade detection.

2
Additionally, AV scanning may produce false positives, flagging
legitimate files as malicious due to similarities with known malware
signatures. To mitigate these limitations, modern antivirus solutions
often incorporate heuristic analysis, sandboxing, and machine learning
algorithms to detect and block emerging threats based on their
behavior and characteristics rather than relying solely on signature
matching. Despite its limitations, AV scanning remains an essential
component of a layered cybersecurity strategy, providing a critical line
of defense against a wide range of known malware threats. By
regularly updating antivirus databases, conducting routine scans, and
supplementing AV scanning with other detection and mitigation
techniques, organizations can enhance their resilience against malware
attacks and safeguard their digital assets from harm.
o Disassembly
Disassembly is a fundamental technique in malware analysis,
providing analysts with deep insights into the inner workings of
malicious code. This method involves converting machine code, which
is the binary representation of software, into assembly language, a
human-readable format that represents the instructions executed by the
CPU. By disassembling malware, analysts can examine the logic and
functionality of the program at a granular level, allowing them to
identify malicious behaviors, uncover hidden functionality, and
understand the malware's attack techniques. Popular disassemblers
such as IDA Pro and Ghidra provide powerful tools for analyzing
executable files, offering features like interactive disassembly, graph
views, and function analysis to aid in the reverse engineering process.
These tools enable analysts to navigate through the disassembled code,
trace the execution flow, and identify key functions and routines
within the malware. Furthermore, disassembly allows analysts to
uncover obfuscated or encrypted code, revealing the true intentions of
the malware and providing valuable insights into its capabilities and
behavior. By leveraging disassembly techniques, cybersecurity
professionals can gain a deeper understanding of malware threats,
develop effective detection and mitigation strategies, and enhance their

2
organization's overall security posture. However, it's essential to note
that disassembly can be a complex and time-consuming process,
requiring specialized skills and expertise in reverse engineering and
assembly language programming. Additionally, malware authors often
employ anti-disassembly techniques to thwart analysis efforts, making
it challenging for analysts to extract meaningful information from the
disassembled code. Despite these challenges, disassembly remains a
critical tool in the arsenal of malware analysts, enabling them to
uncover the secrets hidden within malicious software and protect
against cyber threats effectively.
 Dynamic Analysis
Also known as Behavioral analysis, Dynamic Analysis is one of the most
powerful techniques in malware analysis that provides visibility into real-time
activity exhibited by a piece of mal-ware. Dynamic analysis entails running
the malware in a controlled environment (such as within a sandbox or virtual
machine) and is quite different from static analysis, which does not involve
execution but instead involves disassembling/assembly of file code / contents.
It basically involves closely monitoring a malware’s behavior and recording
all the activities in which the file indulged. With this approach, analysts can
watch the malware move and learn about its function, purpose, objectives
(intent) that were to harm or impinge upon the goals of a system. Dynamic
analysis of the malware is helpful in assessing the spread because this type of
research allows us to run the malware in a developed environment and we can
observe what it does without putting our production systems at risk.
Moreover, the dynamic analysis is generally times much faster than static one
because it gives instant results to the analysts about malware behavior and
assess whether it is dangerous or not. Moreover, from the perspective of
dynamic analysis, malware can present new and unknown behaviors that let
analysts know them better and learn to detect new threats for creating
countermeasures. Altogether, dynamic analysis represents an essential aspect
of malware identification and prevention that gives cybersecurity pros a front-
row seat to see how such threats really act in the wild and helps organizations
block today’s increasingly sophisticated attackers.

2
 Hybrid Analysis
Hybridizing analysis is a one-size-fits-all way of dealing with malware — it
allows us to leverage the most useful features that both static and dynamic
bring into the table, aiding towards gaining better insight into malicious
software. During the early steps of hybrid analysis, relevant signatures and
features of the malware are closely looked at using static/dynamic approaches.
This includes taking apart the code, examining the metadata and obfuscation
techniques or similarities to known patterns (indicators of compromise). “An
analyst, using static analysis, can find these known threats and identify this
particular malware according to the set of established criteria.”

Static analysis is followed by dynamic analysis which gives a deeper insight


into malware’s behaviour and potential when executed in a controlled manner
In this way the malware can be run in a sandbox or virtual machine and its
behavior and interaction with the system are monitored on-the-fly by these
analysts. Exercises about Dynamic analysis reveals the behavior of malware’s
execution, network activity, file system changes, and any other unusual
behaviors or anomalies.

Hybrid analysis provides a broader view of the malware’s properties,


capabilities, and purpose by marrying static with dynamic approaches. While
static analysis can be used to identify known threats and form a base
understanding of the malware, dynamic behavior will catch anything
unexpected or new such as behaviors such as any attempts at evasion. It’s this
combination that allows cybersecurity experts to detect, analyze and respond
to increasingly complex and shifting malware threats.

Furthermore, hybrid analysis can improve the efficiency and effectiveness of


malware analysis by combining the strengths of both static and dynamic
approaches. Static analysis, in comparison is fast and scalable, but does not
afford a more contextual view to the malwares behaviour. When combined,
these two paradigms enable organizations to build stronger cyber security
strategies and defenses against more advanced cyber threats through hybrid
analysis.

2
1.4.3 Types Of Malware Detection

Identifying and recognizing malevolent software virus in a computer system or


network is known as malware detection. Malware can take many forms including a
virus, worm, Trojans. ransomware, spyware adware and many more. The goal of
malware detection is to recognize and eliminate malware’s undesirable effects with
the help of maintaining the priority, revelationality, and attachments in computer
systems and information. Malware detection falls under the larger umbrella of
cybersecurity and includes everything from processes to technologies that are
developed specifically for identifying or handling the family of malicious software
programs known as malware on computing systems. "Malware comes in many shapes
and sizes — it’s made up of varying types of threats, including viruses, worms,
Trojans, ransomware and spyware to name a few," the agency noted. The purpose of
malware detection is to; protect computer systems and data against integrity,
confidentiality, and availability threats by detecting malware before causing any
harm. "That means using a combination of tools, such as antivirus software, intrusion
detection systems (IDS), or endpoint protection platforms to constantly be on the
lookout for any indicators of compromise and anything that seems out-of-the-
ordinary," he says. In addition, advanced threat intelligence as well as machine
learning algorithms are being used more and more for the detection of emergent
malware threats to enable malware identification in real time. Robust measures to
detect malware can be a savior for any organization as it protects the organization by
detecting potential cyber threats and doubles its defense mechanism by ensuring
minimal risks and securitization of digital wealth maintenance.

Different types of they are as follows :-

Signature Based Detection

The signature based detection (fig. 7), the traditional technique, is an easy and
efficient manner to detect all of the known malware [16] This technique extracts a
unique short sequence/pattern of bytes after malware identification to distinguish the
detriment program as compared with the benign programs [17]. Signature-Based
Detection (Definition: Detects known malware on identification of similar patterns /

3
signature based) It’s a most common practice used in cyber-security. This involved
looking for signatures in files, applications or network traffic- The antivirus software
and intrusion detection systems looked to match the identified malware variations. In
most cases, these signatures are comprised of one or more bytes or sequences of
codes and can also be compiled by behavior that resembles the known malicious
component. Once a match is found, the software can react by quarantining or deleting
the infected file, blocking network communication from specific IP addresses or
warning system administrators. Despite the fact, that a signature-based detection can
do well in recognizing and blocking all known threats, it certainly has its limitations.
In terms of zero-day attacks or polymorphic malware Showing that it changes code or
behavior upon infection so as to compromise defense mechanisms validation results
nonzero ones reinforcement, this approach cannot always trace them back. One other
drawback of a signature database is that it always has to be updated to keep the latest
malware strains. Signature databases need regular updates and can become vulnerable
if not updated regularly too since new threats could arise. Nevertheless, signature-
based detection continues to be a building-block of any cybersecurity defense layer
and is mandatory to counteract the long tail of existing malware that exists on the

world.

Fig. 7 : Traditional Detection System

Heuristics Based Detection

3
cisomethingxperience from 2000 to 2010 heuristic-based detection technique along
with signature-based detection was a significant mechanism for malware mitigation,
and the promising method in Heuristic one to detect novel/unseen malwares ever
before. [18]. This is an approach of two methods used in identfication. The first
method is static where the suspicious programs are examined in such a way so as to
get proper pattern of occurrence or any other sort if it and then checking whether the
produced result goes beyond threshold, declaring that file infected (Cheng et al.,
2014) [19]. B. Hybrid Heuristic, Signature based detection: In this period (between
2000-2010), besides signature-based method of detecting threats hacking techniques
four were mainly deterred by the heuristic assessment on malware pathogens.
Heuristic techniques seemed to be the best method for identifying new or unseen
malware threats, bridging the gaps that existed with signature-based tools. This
method uses two main schemes; However, static methods refer to a process in which
such suspicious program’s code is analysed for defined patterns within the structure
of the program If any irregular things or suspicious patterns are identified on the file,
and it crosses a certain predefined threshold value, then that file is marked as possibly
infected. The advantage of this type of static analysis is that cybersecurity systems
can still recognize specific common characteristics or behaviour patterns exhibited by
a malware, even if its detail signature remains unknown and not deployed within the
databases. Applying these heuristic approaches can help cybersecurity analysts
improve their malware detection and response capabilities, making it harder for the
attackers to pull off malicious activities in the midst of bolder cyberattacks. On the
one hand, heuristic detection is an aid in the identification of unknown malware
variants, but on there other hand it possible to be umuseful. It can operate false-
positive or skip complex threats that are basically not determined by listing, through
for example obfuscation or polymorphism. As a result, the typical strategy tends to be
more balanced—relying on both heuristic and signature-based detection methods in
order to adequately defend against the variety of malware threats out there.
Malware Normalization
The advances of such sophistication, malware authors have implemented automated
advanced malware generation toolkits [20] where it uses highly sophisticated
obfuscation techniques (e.g. Zeus, Ultimate Packer for Executable and Mitsfall). these
kits can create malware in the order of couple thousand a day which is literally 8
impossible to catch through Signature or hurestic based malware detection

3
techniques. Obtain normalize executable/ malware from such method : After
removing the obfuscation in a given program, this is effective to improve an existing
antimalware of higher detection accuracy (Fig. 8) [21]. Since furthermore in the fight
against such obfuscation techniques more and malware developer can use automated
advanced malware generation tool kits, like for example zeus, ultimate packer for
executable or mitsfall it has become very important to normalize malware. Once
deployed, these toolkits provide attackers with the ability to quickly create tens-to-
hundreds of thousands unique malware variations in a day – each one seemingly
different enough to bypass traditional signature or heuristic-based malware detection
methods. Consequently, malware normalization attempts to mitigate this explosion of
obfuscated malware by removing the layers of obfuscation wrapping the malicious
code. Analyzing the executable files or malware samples in this form will allow
analysts to sidestep such obfuscation and obtain an understanding of their structure
and functionality by normalizing them. Subsequently, this normalized form of the
malware can be used to improve detection performance in currently installed
antimalware offerings. Analysing nvmd samples, ctsec workers can create stronger
signatures by attaining a better understanding of the many facets involved in crafting
an effective NVT, thereby enhancing their defenses against advanced cyber threats.
“Moreover, for superior disruptive action which can be take against adversaries
directly or building proactive strategies, malware normalizations gives more help to
security researchers to understand the adversary tactics techniques and procedures
being used along with associated preparedness return,” said Malviya. However,
malware normalization remains a highly useful approach to strengthen cybersecurity
defenses and prepare for any cyber warfare scenario in the constantly developing field
of cyberspace combat when facing automated malware generation toolkits.

3
Fig. 8 : Malware Normalization

Machine/Deep Learning Techniques


Presently, various machine learning [22] techniques are being employed to identify
this zero-day malware or the one that is not seen before. “With this technique we can
detect not only the known malware but also unknown malware, because it’s learning
from previous detections,” Quieti explained. The process is a two-step technique, in
step 1 the feature (such as API Calls, N-gram, Strings, Opcodes, Control flow Graph
etc.) are to be extracted from the know datasets that leads not only for properly
modeling the target concept but also facilitate the learning and classification/detection
process. Afterward, in step 2, suitable machine learning algorithms (e.g., Decision
Tree [23], Naive Bayes [24] Data Mining [25], Hidden Markov Modes [26], Neural
Networks) were scanned for the model that can be employed to detector classify
malware. The adoption of machine learning for malware detection has grown
exponentially, as it enables the discovery of both known and new previously unseen
types. "It builds on what we've learned from detecting previous malware to create
models that can easily identify new strains." Normally, this is a two-step process:
Features are then extracted from these known datasets (e.g. API calls, N-grams,
strings) and opcodes/CFGs These features are responsible for defining the target
concept and help accelerate both learning and classification. Detection and
Classification of Malware: In second dove different machine learning algorithms are
applied for detection and classification the malware. Most commonly used ones in
this scenario would be decision trees, Naive Bayes, data mining techniques, Hidden
Markov Models ( HMM ), neural networks etc. By feeding these features into models

3
and training them, the algorithms are able to make predictions about other data that is
similar based on patterns or characteristics found in each feature. "One trained, the
models can analyze new samples and based on what they've seen before classify them
as either benign or malicious." Machine learning has many advantages for malware
detection, some of which are adaptation to unforeseen threats, high scalability
capabilities which allow it to be applied over large datasets and centralize same
features that can detect millions of variants instead. “All of these techniques, by
learning from new data and feedback on how they perform in practice (and revising
their ability to detect malware accordingly), make the detection systems more
accurate and efficient over time. This helps organizations strengthen their own
networks and systems against cyberattacks.”
1.5 Structure of the Project
While going through this project you will mainly come across these components-
Chapter 1 gives us a basic idea about the project and helps you familiarize with the
necessary technical and theoretical aspects of our project.
Chapter 2 consists of review of other journals and research papers.
Chapter 3 tells us about the system design and various tools and techniques needed
to achieve the same.
Chapter 4 includes feasibility study which includes technical, economic, operational,
legal, and market viability.
Chapter 5 gives us a comparative study of the performance of each model making it
easier for us to choose the best one most suitable for us.
Chapter 6 provides us with the conclusion and also tell us about the future scope of
our project.

3
CHAPTER 2

LITERATURE

REVIEW

2.1 Background Study

But as early as 2009, Daly published research that showed the potential for planned
and coordinated attacks designed to gain long-term network control within a targeted
company [27]. For example, referring to Quick Heal Threat Research Lab is received
over 350,000,000 malicious files which targeted tens of thousands of workstations in
only the first quarter of 2016 Aimoto et al [28]. Ito et al. In May 2017, for example,
Symantec revealed they had uncovered the Banswift cybercrime ring behind the theft
of US $81 million from Bangladesh Bank [29]. Kid Security. However, Elasticsearch
and Logstash applied incorrectly resulted in the app designed to help parents monitor
their children's online safety leaking user records for more than a month. The breach
was discovered by expert Bob Diachenko in mid-September, who said it impacted
300 million records (including 21,000 phone numbers and 31,000 email addresses),
along with some credit card info. September Cyber news.

The primary goals of the malware detection techniques are to identify the malicious
software and protect the system in which it is used, and ensure that other networks or
computers linked should also be protected. For the detection and classification of
malware samples, in order to entrainate such inputs properly under their respective
families, they can be defined in many ways: There are many authors who have done
work on this and proposed methods to recognize or categorize the malware file with
their version. A detailed summary of the major research publications listed in Table 1
is provided within this publication. We have also observed the exported malware in
graphical representation along with feature vectors. Note that this is fundamentally
different from classifying images, where the labeling of data can be particularly
simple (each image belongs to exactly one label) and its easy to get a lot of it. Instead,
recognizing malware and numeric clustering into families are labor intensive tasks
require expertise in the subject In another study, the authors suggested a learning
technique to analyze the dangerous code and classify it according to its malware
family [2]. The first step to malware family identification is grouping all portable
executables (PE) and choosing the traits coincident among them. Therefore, these
3
features determine the specific category of malware and activities involved in
defining the executable organization and suggested approach accuracy ≃99.8% [30].

Evolution of Malware Detection Techniques

There you have it. The evolution of how we detect malware is right in line with the
increasing sophistication of said, well, malware! Being an old fashioned way, Early
methods of safeguard depended on signature-based detection which uses specific
patterns or signatures of known malware to detect the malicious files. Although it was
successful against all threats that were known at the time, this method did not fare so
well with new or as yet unseen malware (zero-day attacks).

In order to overcome such limitation, heuristic-based methods were subsequently


developed. These analyze the behavior of programs in fact to understand whether
they could carry out actions which are typical of a malware infestation. These
approaches attempt to detect suspicious activity (e.g., attempts to access system
resources without the appropriate permissions, or traffic flowing in a port different
from normal communication ports) by inspecting system and network characteristics.
Despite being less rigid than signature scanning, heuristic approaches have the
tendency to produce false positives given that well-behaved applications on occasion
act like a virus or worm.

Malware Detection using Machine Learning and AI

• Machine-learning (ML) and Artificial-intelligence (AI) have greatly improved the


ability to detect and categorize malware. It does so by learning from huge datasets
which contain examples of software that is both benign and malicious. This helps the
technique to recognize patterns or deviations that could suggest something is
malware. machine learning models such Support Vector Machines (SVM) [7],
Random Forests (RF) [23] and Neural Networks (NN) have been used extensively.
For instance, research has proven that models based on opcode sequences, API call
patterns, and n-grams is able to classify benign and malicious software with a high
success rate.

Deep Learning (a class of machine learning) has now expanded our capabilities even
further Convolutional Neural Networks (CNNs), Recurrent Neural Networks, etc.
Can extract features from raw data on their own and we do not have to worry about

3
extracting the valid features through manual effort. This approach has been
particularly popular for image-based models that treat binary files as though they are
just gray-scale images.

Comparative Analysis of Research Contributions

The literature for malware detection has been diverse with numerous studies
proposing novel techniques and obtaining high accuracy rates. A comparative analysis
is outlined in Table 1, which points to the diversity of approaches and their respective
performance measures. The feature engineering approach appears relevant, as
effective feature extraction plays a significant role in achieving high-accuracy
malware detection. Among the most common features are API calls, opcode
sequences, and patterns of byte code. All the reviewed articles demonstrate that well-
engineered features improve the performance of the model. As for algorithms,
different models have distinct advantages. For instance, Random Forests and Support
Vector Machines are recognized for their strength and the highest accuracy. At the
same time, deep learning models such as CNN and RNN have the ability to learn to
automatically extract complex features. Their weakness, however, is their inability to
learn abstract models, and oversimplification may complicate the tasks. It is also
apparent that hybrid models are more effective as they combine static and dynamic
approaches. The static approach is using the code, that is, the appearance of the
application without running it, while in the dynamic approach, the software’s activity
is monitored while operating. The latter approaches are among the most effective in
the respective areas, and the hybrid model has an advantage due to the combination of
their traits. The rise of malware attacks has led to concerns that a single machine is
insufficient to address the evolving needs. One of the ways to ensure scalability and
efficiency of approach is training models using datasets and deploying them on
distributed systems, ensuring the detection is accurate and timely. Although
substantial progress has been observed, the field retains its appeal by addressing the
challenges posed by evolving types of malware.

Adversarial attacks pose the biggest concern. These are attacks where the malware is
specifically generated not to be detected. With such threats and their popularity in the
current world, the researchers still need to focus on developing models that cannot be
attacked so easily.

3
Challenges of the Future

The other challenge is posed by the emergence of new IoT devices and cloud
computing. These approaches allow attackers to develop new strategies, as the
environment faces substantial changes that require new malware detection methods.
Old methods need to be adapted but should be used with new and innovative methods
to address the emerging threats. All in all, the field of malware detections has
experienced a significant amount of progress driven by advances in machine learning
and AI. The comparative analysis of the reviewed papers indicates the relevance of
the feature engineering approach, the significance of the model selection of the
algorithm, and the hybrid approach. With the cyber threats continuing to grow and
evolve, the emphasis on these aspects cannot be underestimated.

2.1.1 Comparative Analysis Of Research Papers

Given below the comparative analysis of research papers:

Authors Inputs Algorithm Findings


/Techniques
Ye et al.in Api Rule based IMDS surpassed other data
(2007)[31] Execution classifier mining techniques and
sequences several antivirus programs
with a 93% detection
accuracy.

Shafiq et al. n-grams HMM The suggested technique


(2008) [32] detects malware with a TPR
of 84.9% and an FPR of
16.7%.

Moskovitch et opcode ANN, NB, DT and The model predicted the


al. (2008) [33] features Adaboost virus in the file under
examination with a
considerable degree of
accuracy (94.5%).

3
Griffin et al. 48 Byte string 5- gram Markov Signatures with one or more
(2009) [34] signature Chain model components were used to
train the classifiers. Having
several component signatures
increased the chance of a
satisfying accuracy outcome
when compared to the
equivalent. Less than 0.1
percent false positive rate
(FPR) was achieved.

Nataraj et al. Gray scale KNN demonstrates 98%


(2011) [35] image of classification accuracy on a
Binaries collection of malwares from
25 different families.
Shabtai et al. 1-6 gram Random Forest When it came to accuracy,
(2012) [36] opcode classifier, Naive RF, BDT, DT, G-Mean,
features Bayes, ANN, FPR,
Logistic and TPR performed better
Regression, BDT, than NB and BNB. Random
DT, and BNB Forest produced the best
results with 95.14%
accuracy.
Ravi et al. API call J4.8, IMDS, SVM, The suggested solution
(2012) [37] sequence Rule Based makes use of a third-order
classifier, Naive Markov model, which
Bayes, and SVM operates with 90% accuracy
on the testing dataset and
99.38% accuracy on the
training dataset.
Santos et al. Frequency of DT, KNN, SVM performs better than
(2013) [38] opcodes Bayesian, SVM 95.7 % for features of two
opcode lengths.
Comar et al. Flow level KNN, SVM, WL, For identifying new classes,
(2013) [39] features RBF the supervised weighted
linear kernel provides the
best performance metric.
Uppal et al. N grams from Naive Bayes, SVM produces the best
(2014) [40] API Random Forests, results (98.5% accuracy) out
sequences SVM, and of all the classifiers.
Decision Tree

4
Classifiers
Salehi et al. API calls RF, J48, Rotation 94.6% was the greatest true
(2014) [41] RF, FT, and NB positive rate of any classifier
used, and random forest
produced the highest results.

Sexton et al. Byte code Naive Bayes, Rule The Markov chain approach
(2015) [42] Sequences & Based classifier, to SVM revealed an 84.9%
opcodes Logistic True Positive Rate.
Regression, SVM
Saxe et al. The string Deep Feed The suggested model yielded
(2015) [43] histogram, the forward neural a 95% True Positive rate.
byte network
sequence, and
the 2D PE
properties

Narra et al. Opcode K-means, The model operates with a


(2016) [44] Sequence expectation 98% accuracy rate.
maximization with
HMM, SVM
Ahmadi et al. Hex dump XGBoost A 99.8% detection accuracy
(2016) [45] based features classification was provided by the
algorithm suggested model.
Kolosnjaji et System call Convolutional & The average accuracy and
al. (2016) [46] sequences Recurrent Neural recall of the combined model
Network were 85.6% and 89.4%,
respectively.

Narayanan et Image of KNN, ANN, and Over the others, linear KNN
al. in (2016) Polymorphi c SVM provided an accuracy of
[47] Malware file 96.6%.
Nikolopoulos ScD graph SaMesimilarity The suggested model has a
and Polenakis created using and NP-similarity detection rate of 83.42%.
(2017) [48] system calls metrics

4
Zhixing Xu et systemcalls logistic regression The random forest classifier
al. (2017) [49] for memory and random forest performed better, with a true
access classifier positive rate of 99%.

Raff et al. PE header LSTM, Random An accurate network with all


(2017) [50] features Forests, LR, ET connections made and
calibrations made may reach
93.3%.

Kotov et al. Windows API Symbolic With an accuracy rate of


(2018) [51] calls execution & 87.6%, the top prediction
HMM models model detects malware.

Le et Gray scale Convolutional Using 10568 binary data to


al.(2018) [52] image of Neural Network train the classifier, the
binary accuracy rate was 98.5%.
malware file

Nguyen et al.Image based CNN CNN produced an accuracy


(2018) [53] representati of 98.87%.
on of lazy
binding CFG
Krcal et al. PE, API calls MalConv, CNN, At 96.4% accuracy, the
(2018) [54] FNN suggested convolution
network outperforms other
models.

Ni et al. Gray images Hashing & CNN The average accuracy of


(2018) [55] based on Sim classification attained was
hash 98.86%.

Rathore et al. Opcode RF, DNN with 2, RF outperforms DNN with a


(2019) [56] Features & 7 Hidden 99.6% accuracy rate.
Layers

O. Suciu et al. PE header FGSM The suggested method shows


(2019) [57] features that forceful assaults on
mode are effective. This does
not provide efficient models
when trained on small
datasets.

4
Yuxin et al., n-gram Deep Belief When trained on unlabeled
in (2019) [58] Network data, DBN outperformed
KNN, SVM, and Decision
Trees in terms of
classification accuracy.

Rabbani et al . protocols, PSO with PNN With 96.5% accuracy, the


(2020) [59] jitters, IP model was able to identify
addresses, malicious behavior.
TCP, and
UDP

Yucel et al. Memory Virtual machines Using an average of 0.886,


(2020) [60] Image of Exe & 3D Imaging the authors' research looked
file at the similarity rates across
many malware families. A
few succeeded in reaching a
99.5 percent accuracy rate.

Vasan et al. Windows Unsupervised displayed the potential of


(2021)[61] executables, anomaly detection unsupervised learning for
system call using Isolation malware detection by
sequences Forest achieving high accuracy in
identifying previously
unknown malware types.

umar et al. Android APK Hybrid model Detected malware with


(2022)[62] files, API combining static 98.7% accuracy, highlighting
calls, and dynamic the effectiveness of hybrid
permissions analysis using approaches for Android
RNN malware detection.

Gibert et al. PE files, LightGBM, CNN Achieved 99.4% accuracy in


(2023)[63] opcode with attention malware detection,
sequences mechanism outperforming other ML
algorithms. Attention
mechanism improved model
interpretability.

Table 1 : Comparative Analysis of Research Paper

4
CHAPTER 3
SYSTEM
ANALYSIS
Actually, the model we can say that it is being followed currently-in use the phase in
WATER FALL MODEL assumes to be arranged in well-defined linear sequence. 1.
First of all, the feasibility study is done 1. Once that is done now requirement
gathering and project planning are the next phases. If the system exists one and
modification and addition of new module is necessary, least analysis for present
system may be taken as basic model. It is done after the requirement analysis and
before the coding phase. Coding does not start until designing being completed. After
the coding the next phase of software development life cycle is testing. In this
particular model, there is a sequence of activities that are performed in the software
development project:-

 Requirement
 Project Planning
 System design
 Detail design
 Coding
 Unit testing
 System integration & testing

But, the important part is that these activities are to be linear ordered.At the end of
the phase and output of one phase is input of other phase. The output of each phase
should be such that in case the whole is formed, it fulfills the system’s overall
requirement. - Few of the feature of 6 - spiral model is also included as after
completing each of the phase only than work done by all those stakeholder who are
involve in with that project. As all the requirements were previously known to us and
our software is made with the objective of computerizing/automating a pre-existing
manual working system, we have chosen WATER FALL MODEL.

3.1 Tools and Technology Requirements

We have used many Tools and technologies used for our project and all are discussed
here –

4
3.1.1 Python

Simple and Readable Syntax: Python's syntax is designed to be simple and easy to
read, making it an ideal language for beginners and experienced developers alike. Its
clean and straightforward syntax emphasizes readability and reduces the cost of
program maintenance.

Interpreted Language: Python is an interpreted language, meaning that code is


executed line by line, making it easy to debug and test. This also allows for rapid
development and prototyping, as changes can be made quickly without the need for
compilation.

High-level Language: Python is a high-level language, which means it abstracts


away many low-level details like memory management and pointer manipulation,
allowing developers to focus on solving problems rather than dealing with system-
level intricacies.

Dynamic Typing: Python is dynamically typed, meaning that variable types are
determined at runtime rather than at compile time. This provides flexibility and
allows for more concise code, but can also lead to potential runtime errors if not
handled carefully.

Rich Standard Library: Python comes with a comprehensive standard library that
includes modules and packages for a wide range of tasks, from web development and
data manipulation to networking and GUI programming. This extensive library
reduces the need for developers to write code from scratch, speeding up development
time.

Fig. 9 : Python

4
3.1.2 Google Colab

It’s a free tool for writing and running Python right in your browser.

 Real-time, collaborative: In true wiki form, multiple people can work on a


document in real time (you already do this with Google Docs) granted access
rights prevent conflicts
 Pre-installed Libraries: Commonly used Python libraries in the field of data
science and machine learning are already installed so that one doesn’t waste
any time in using them.
 GPU and TPU Support: You get to use powerful GPUs with TPUs for free,
if you need it to speed up your calculations ( very handful in machine learning
projects)
 Google Drive integration: It saves you work. It allows you to easily share the
content with it's integration with Google Drive.
 Markdown Support: Add formatted text and math equations using
Markdown and LaTeX to your notebooks, for a smoother experience in
organizing and presenting your work
 Accessibility: Basically, it is easily accessible and free of charge (if you don’t
use an extensive allocation of resources).
 No Setup: There is nothing to install on your computer. You just need to
login with your google account and you can start coding straight away and run
code cells (useful when you want to run one by one in the serial order, best
approach for debugging)

Fig. 10 : Google Colab

4
3.1.3 Pycharm

It's a powerful integrated development environment (IDE) specifically designed for


Python development, created by JetBrains.

 Smart Code Editor: Provides intelligent code completion, real-time error


checking, and quick fixes, making coding faster and easier.
 Debugging Tools: Includes advanced debugging features like breakpoints,
step-by-step execution, and a visual debugger to help troubleshoot code.
 Version Control Integration: Supports version control systems like Git,
SVN, and Mercurial, allowing you to manage your code changes efficiently.
 Refactoring: Offers powerful refactoring tools to help you clean up and
improve your code, ensuring it remains maintainable and scalable.
 Web Development Support: Comes with support for popular web
frameworks like Django, Flask, and Pyramid, making it ideal for web
development.
 Database Tools: Includes built-in tools for connecting to and managing
databases, allowing you to interact with databases directly from the IDE.
 Customizable and Extensible: You can customize the IDE with various
plugins and themes to fit your workflow and preferences.
 Cross-Platform: Available for Windows, macOS, and Linux, ensuring you
can work on any operating system.
 Professional and Community Editions: Offers a free Community Edition
with essential features and a paid Professional Edition with additional tools
and capabilities for advanced users.

Fig. 11: Pycharm

4
3.1.4 Pandas

It is a data manipulation and analysis library for Python. pandas is an open-


source, BSD-licensed library providing high-performance, easy-to-use data
structures and data analysis tools for the Python programming.
 Data Structures: It provides 2 main data structures namely Series (1
dimensional) and DataFrame(2dimensional), which are of a very
effective, optimized form to perform various operations on the dataset.
 Data Handling: Helps in easy data missing values imputation, cleaning
datasets and making complex transformations on the dataset.
 Data Import/Export: Read and Write from different file formats such
as CSV, Excel, SQL Databases,etc.
 Data Analysis: It is the powerful features of spreading, filtering, & grouping
make data analysis easier than ever.
 Time Series Analysis: which supports time series data and proffers
adept functionality for date and time manipulation.
 Interoperability: Can be easily integrated with other python libraries (e.g.
NumPy, Matplotlib, scikit-learn), which makes the language a more
suitable tool for data science purposes
 Performance: It is designed to be performant with highly optimized
algorithms, so it works well for processing data over big numbers of records.
 Communitypowered: With a huge community, this gives you plenty of
documentation how to’s tutorials help and forums for whatever
algorithms have perplexed you.
 Versatility: It is widely appreciated and used in different domains, including
but not limited to finance, economics, statistics, data science (academically or
industrially)

Fig. 12 : Pandas

4
3.1.5 Numpy

A fundamental package for array and matrix computing with python.

 Powerful N-dimensional arrays object: Almost NumPy is a general-purpose


array with fast facilities to operate on it.
 Mathematical functions: Example of these functions include- you can use a
set of mathematical function to do any kind of operation with vectors or
matrices, linear algebra, statistics or Fourier transformations as well.
 Performance: Lightning fast numerical computations, which are simply not
possible with Python’s built-in lists.
 Element-Wise Operations: This helps to support element-wise operations
like addition, subtraction etc. on the arrays which results in straightforward
usage and ‘the entire dataset computation’ at a time.
 Integration: It perfectly integrates with other scientific libraries in Python
(for instance, pandas, SciPy or Matplotlib) to achieve even more effective
capabilities.
 Broadcasting: It is used to perform operations on arrays of different shapes
and sizes, which simplifies coding.
 Memory Efficiency: It is designed to be memory efficient i.e it tries to reduce
the overhead which comes with numerical computations.
 Community and Documentation: Has a strong community and is well-
documented, which means learning how to use it as well as troubleshooting
any issue becomes easier.
 High Performance: Computation can be offloaded from Python to libraries
written in other languages (e.g., C, Fortran).

Fig. 13 : Numpy

4
3.6 Scikit-Learn
It's a popular open-source machine learning library in Python that provides simple
and efficient tools for data analysis and modeling.
 Wide Range of Algorithms: Includes a variety of machine learning
algorithms for tasks like classification, regression, clustering, and
dimensionality reduction.
 Ease of Use: Designed to be easy to use, with a consistent interface for
different algorithms, making it accessible for both beginners and experts.
 Built on NumPy and SciPy: Integrates well with NumPy and SciPy, ensuring
fast and efficient numerical computations.
 Model Selection: Offers tools for model selection and evaluation, including
cross-validation, grid search, and metrics to assess model performance.
 Preprocessing: Provides various preprocessing techniques to prepare data for
modeling, such as scaling, normalization, and encoding categorical variables.
 Feature Engineering: Supports feature extraction and selection to improve
model accuracy and performance.
 Community Support: Backed by a strong community and comprehensive
documentation, with many tutorials and examples available.
 Interoperability: Works well with other data science libraries like pandas
and Matplotlib, making it easy to integrate into your data analysis workflow.
 Versatile Applications: Used in various fields, including finance, healthcare,
marketing, and more, for tasks like predictive modeling, customer
segmentation, and recommendation systems.

Fig.14 : Scikit-Learn

5
CHAPTER 4

FEASIBILITY STUDY

In this context, the ambition to determine the technical feasibility but also the
economic, operational and legal capacity for implementing an all-encompassing
framework that would link software performance evaluation and malware detection
into a single system. This study gathers these details in what will be considered here
as secondary information. This framework will help to meet this increased need for a
holistic solution that can provide system health monitoring of the ever complex
software and control system landscape as well as the increasing sophistication of
cyber threats.

4.1 Technical Feasibility

We have critically analyzed the technical feasibility of the proposed framework,


including necessary technologies and tools’ to what extent can be accessed,
integration with existing systems or interoperability requirements; scalability and
adaptability demands, as well as experience within the project team.

Technologies and Tools: The necessary technologies and tools to implement the
framework, including performance monitoring solutions, malware detection
algorithms or data analytics platforms are available in the industry as established
standards. Here, we can assure that the project team cannot make use of any other
components for integrating a framework.

Integration and Interoperability: The framework will be structured on a set of well-


defined interfaces and protocols that allow the performance evaluation, malware
detection modules to integrate directly with each other seamlessly. “This in turn will
allow for more effective data exchange and the creation of an holistic view of system
health, breaking through current market fragmentation.”

Scalability and Adaptability : The framework is to be architected in such a way that


it can scale up to accommodate the increasing complexity of software systems, and
growths changes in cyber-attack landscape. Adaptability to future needs can be
broadened by using modular design and incorporating cloud-based or distributed

5
computing among others which will ensure ease of reusablility in a manner that the
framework remains effective over time.

Expertise & Resources : The project team will consist of subject matter experts on
software engineering, performance engineering, cyber security, and data analytics. As
a result, the capability to effectively build and implement this full-bodied framework.

4.2 Economical Feasibility

While contemplating cost-benefit analysis, pricing and revenue model, and


funding/investment opportunities; yeast introduced economic feasibility has been
undertaken.

Cost-effective: In conclusion, the cost for this full-fledged framework will offer
remarkable money due to its decreased data leakage preventing system downtime and
compliance breach probability. The performance improvements and increased
security posture are anticipated to offset the cost of implementing and maintaining
either system.

Pricing and Revenue Model: The framework can be offered to the organizations
either through subscription based service or licensed software. It can have flexible
pricing depending on the requirement of target organization at different scales This
flexible model will help make the framework available to a large number of
customers, while also resulting in an ongoing revenue stream for the project.

Funding and Investment: This project can receive funding from a combination of
internal R&D budgets, government grants, as well as venture capital investments due
to high demand in the espionage market swathes of other strategic importance.

4.3 Operational Feasibility

Finally, the operational feasibility of the framework was considered with respect to
organizational readiness, training and support, as well as maintenance and update
considerations.

Organizational Readiness: The framework will ensure the designed solution will
seamlessly be implemented with the existing IT infrastructure and security practice
without impacting ongoing operation to comfort maximum user adoption among
stakeholders.

5
Training and Support: It will offer extensive user documentation, training
schedules, and exclusive technical support for the ease of implementation and
application of the framework across target stakeholders.

Maintenance and Updates: The framework must define automatic processes to


update the automated detection signatures, performance optimization algorithm, etc
regularly so that it can be maintained over time.

4.4 Legal And Regulatory Feasibility

The legal and regulatory feasibility of the framework has been evaluated. It explicitly
emphasized compliance with industry standards regulation, data privacy and security,
or intellectual property rights underage.

Complying with industry standards and regulations: Framework will be designed


considering all the relevant industry standard like NIST SP 800-series, and make it
compliant as per data privacy and security regulation such as GDPR, HIPAA, PCI
DSS etc.

Data Privacy and Security: The framework’s data processed would be subjected to
robust measures that ensure protection of the confidentiality, integrity through
encryption, access controls and audit logging.

Intellectual property: Our project team will make sure that this framework’s design
and implementation does not violate any potential patents or copyrights, they are
going to think about filing for patent protection of those innovative components of the
given framework.

4.5 Market Feasibility

All the presented dimensions in terms of Market Feasibility (i.e., Target market and
customer segments, a competitive advantage, partnerships/strategic alliances) have
been examined regarding the overall framework.

Target Market and Customer Segments: The framework is targeting organizations


that operate within the IT, financial sector, health care among critical infrastructure
industries. Screening will be compulsory in infected facilities.

5
Competitive analysis: A competitive analysis for this framework will illustrate clear
differentiation from existing solutions based on an integrated, scalable and adaptive
approach to system health management designed specifically to met the needs of the
intended target market.

Partnerships and Strategic Alliances: The Project Team will consider partnering
with one or more leading technology vendors, security solution providers, and
industry associations to increase the market penetration and credibility of the
framework.

5
CHAPTER 5

METHODOLOGY

For detecting malware, machine learning method with one-sided perceptron will be
used. It will be applied on the required dataset for detecting malware from different
files present in the systems.

5.1 Overview of the Methodology

The required methodology will use various machine learning algorithms to solve out
the problem of malware in the computer systems in the form of files. These
algorithms will be applied by designing a database according to the dataset. After
this, analysis and design is performed on the required dataset.

5.2 Implementation

As per our ongoing discussion, some of the security threats and challenges faced by
the digital world have been discussed in depth. Keeping them in mind we will go with
a static analysis method for achieving the desired outcome

Pseudo Code for Converting PE files to CSV files:

IMPORT NECESSARY LIBRARIES


DEF RUNALGORITHM():
# GET THE FOLDER PATH FROM THE USER
FOLDER_PATH = INPUT("ENTER THE FOLDER PATH: ")

# GET ALL THE FILES IN THE FOLDER


FILES = OS.LISTDIR(FOLDER_PATH)

# CREATE A LIST TO STORE THE MALICIOUS FILE PATHS


MALICIOUS_FILES = []

# ITERATE OVER EACH FILE IN THE FOLDER


FOR FILE IN FILES:
# CHECK IF THE FILE IS A PE FILE
IF IS_PE_FILE(FILE):
# EXTRACT FEATURES FROM THE PE FILE
FEATURES = EXTRACT_FEATURES(FILE)
# CHECK IF THE FEATURES INDICATE THAT THE FILE IS

5
MALICIOUS
IF IS_MALICIOUS(FEATURES):
# ADD THE FILE PATH TO THE MALICIOUS FILES LIST
MALICIOUS_FILES.APPEND(OS.PATH.JOIN(FOLDER_PATH,
FILE))
# RETURN THE LIST OF MALICIOUS FILES
RETURN MALICIOUS_FILES

DEF EXTRACTFEATURE():
# GET THE PE FILE PATH FROM THE USER
PE_FILE_PATH = INPUT("ENTER THE PE FILE PATH:
") # EXTRACT FEATURES FROM THE PE FILE
FEATURES = EXTRACT_FEATURES(PE_FILE_PATH)
# RETURN THE FEATURES
RETURN FEATURES
DEF BTOCSVALGORITHM():
# GET THE MALICIOUS FILE PATHS FROM THE USER
MALICIOUS_FILE_PATHS = INPUT("ENTER THE MALICIOUS FILE
PATHS: ")
# CONVERT THE MALICIOUS FILE PATHS TO A LIST
MALICIOUS_FILE_PATHS = MALICIOUS_FILE_PATHS.SPLIT(",")
# CREATE A LIST TO STORE THE CSV DATA
CSV_DATA = []

# ITERATE OVER EACH MALICIOUS FILE PATH


FOR FILE_PATH IN MALICIOUS_FILE_PATHS:

# EXTRACT FEATURES FROM THE PE FILE


FEATURES = EXTRACT_FEATURES(FILE_PATH)

# CREATE A CSV ROW FROM THE FEATURES


CSV_ROW = FEATURES_TO_CSV_ROW(FEATURES)

# ADD THE CSV ROW TO THE CSV DATA LIST


CSV_DATA.APPEND(CSV_ROW)

# SAVE THE CSV DATA TO A FILE


SAVE_CSV_DATA(CSV_DATA)

Table 2: Pseudo Code for Conversion of PE Files to CSV Files

5
Fig. 15: Flow Chart of the Proposed System

5
5.2.1 Preparing the DataSet

 Data Collection

The data collection process for malware detection typically involves gathering
samples from diverse sources, including malware repositories, security research
publications, and real-world incidents. Samples encompass various malware types
and strains, providing a comprehensive dataset for analysis. These collected
specimens are meticulously curated to ensure representativeness and relevance to
current cybersecurity threats. Metadata attributes such as file properties,
behavioral patterns, and code signatures are extracted from the samples to form a
feature-rich dataset. This process adheres to academic rigor, employing stringent
methodologies to assemble a diverse and representative corpus essential for robust
research and development in malware detection systems. In the project we took
the dataset from Kaggle.

 Feature Extraction

Feature extraction in malware detection involves systematically extracting


pertinent attributes from collected malware samples for subsequent analysis. This
process encompasses the identification and extraction of diverse features,
including file characteristics such as size, type, and entropy, as well as behavioral
patterns such as system calls, network traffic, and file interactions. Code-level
attributes such as API calls, opcode sequences, and function call graphs are also
extracted to provide deeper insights into malware behavior and functionality. The
feature extraction process adheres to academic standards, employing rigorous
methodologies to ensure the integrity, relevance, and representativeness of the
extracted features, facilitating accurate and effective classification of malware
instances.

Our Dataset contains 58 features

5
Fig 16: Features of the dataset

 Organizing Datasets
After importing the data and extracting the features, we organize the dataset
into legitimate and malware

Fig. 17: Organizing Dataset

5
 Feature Selection
After organizing and dividing the dataset, we move towards selecting the most
important features from our dataset. The dataset consists of 58 features but not
all will be of that much importance. So we use tree based feature selection to
assign weight to features and select the most important ones out from the 58
features.

Fig 18: Feature Selection

From the above output image we can see that out of the total 53 features (
removing the name, Filename, md5hash,Machine and TimeDateStamp
column as they are not necessary in our scenario) only 14 were important and
selected using the tree based feature selection. We can also see the features
selected and the weight assigned to each one of them(Fig. 19).

6
Fig. 19: Weight of Different Features

6
 Splitting the DataSet
After feature selection we move towards splitting the dataset into training and testing
sets. We can divide the dataset into any ration but here we go with the 70:30 ratio
i.e. 70% training size and 30% test size.

Fig. 20: Splitting DataSet Size


5.2.2 Learning Algorithims and Their Performance
In order to compare and analyze the performance of our algorithms we take the use of
confusion matrix and also take into consideration the accuracy of each algorithm.
Different Algorithim used in the project are:
Linear Regression
Linear Regression is a basic supervised machine learning algorithm under regression
analysis, that establishes the relationship between (one or more independent) the
dependent and one or more independent variables. This modelura assumes that the
relationship between input and output variable is linear To do this, the algorithm tries
to ‘fit’ a line that ‘best represents all these points’ in high dimensional space.
At its most basic level, linear regression does this by finding the coefficients (slope
and intercept) of a linear equation that has an error that is minimized to the
actual/predicted target variable values. The most common way to do this is by least
squares, which minimizes the sum of squared differences between observed values
and predicted values.
In training, the algorithm iteratively changes/adjusts these coefficients repeatedly
with techniques like gradient descent until the model performance is optimal or it
converges. Once fitted, a new vlaue of the target can be predicted approaching with
known values of the features by multiplied them for coefficients learned:
There are several use cases of linear regression not limited to the fields of finance,
economics, and engineering. People can perform different tasks like predicting stock
prices, estimating demand for a product in long-term or short-term intervals,
analyzing the impact of independent variables on dependent variable etc using liner
regression model. Although it seems so easy, linear regression is the building block of
more powerful regression models and should always be one of the main tools in your
basket.

6
Here is accuracy after successful implementation in given figure:

Fig. 21: Linear Regression

XGBoost
XGBoost stands for eXtreme Gradient Boosting. It is differentiated as an advanced
implementation of gradient boosted decision trees designed for speed and
performance, which has proven to be one of the most effective algorithms available
today (in the supervised learning domain). While traditional gradient boosting serially
adds a weak learner to improve model performance using the steps above, XGBoost
uses an optimized and more regularized form of gradient descent presented below.
XGBoost From Scratch XGboost forms an ensemble of decision trees where each tree
is built to correct the errors of its predecessors. At the core of the process lies an
iterative approach where a loss function is optimized by sequentially adding trees to
myriad trees and ensuring each tree focuses on residuals or errors from past mistakes.
Some other techniques used to avoid overfitting and increase the generalization are
shrinkage (learning rate) and tree pruning.
Moreover, XGBoost also integrate additional feature like column subsampling and
row subsampling(bootstrap aggreagtion) to improve both model robustness and at the
same time, addresses scalability. The API also allows to provide support for custom
loss functions and evaluation metrics thus making the library highly configurable for
different use cases.
Because of XGBoost’s well-written algorithm, implementation, scalability and
flexibility it has enabled very efficient solutions ( not only to Machine learning
competitions ) but also in a variety of real world problems across industries. As all of

6
this allows it to provide cutting edge performance in the most computationally
efficient manner possible, AdaBoost is without a doubt one of the best and most
widely used ensemble learning techniques by data scientists and machine learning
practitioners out there.
Here is accuracy and Confusion Matrix after successful implementation in given
figure:

Fig. 22: XGBoost

Fig.23 Confusion Matrix of XGBoost

Decision Tree
Decision Tree is powerful and instinctive machine learning algorithm which can be
used for classification as well regression purposes. We'll avoid going too much into

6
the details of how a decision tree works, but in short it is a algorithm which constructs
its model by dividing the input feature space recursively into smaller subspaces while
having every point correctly classified. as you may have visualized this forms a
pottery and looks like our maze!
The algorithm is trained with different features and at every node it chooses the best
feature. It also picks the optimal split point for that feature using some criterion (e.g.
Gini impurity, mean squared error). This is done recursively all the way down until a
stopping criterion is met (e.g.: maximum depth, minimum number of samples per leaf
etc.) or if there’s no further gain to be achieved by performing the split.
They are easy to understand and visually interpret so you can use the model to see
how the decisions were made by a particular input. However, these trees tend to
overfit the data: as they advance deeper and deeper in their directional capacity to
demarcate classes, they pick up on noisy idiosyncrasies in training instances. To
avoid the overfitting, technique such as pruning can be deployed to reduce some of
these nodes that don’t much on improvement.
Finally, Decision Trees will also be a part of so-called ensemble methods – the
hallmark of more involved algorithms like Random Forests and Gradient Boosted
Trees. In conclusion, we can say that Decision Trees are the most straightforward
method for analyzing how a splitting criterion can be optimized. They are flexible and
have good interpretability which makes it perfect for many applications as well.
The accuracy and Confusion Matrix we get after successfully implementing the
above model is shown in following figure:

Fig. 24: Decision Tree

6
Fig. 25 Confusion Matrix of Decision Tree

Random Forest
Random Forest is a flexible and easy-to-use ensemble learning algorithm that can
perform a wide variety of classification or regression tasks. It works by constructing a
multitude of decision trees at training time and outputting the mode of each individual
tree (for classification) or mean prediction (for regression).

Each decision tree in the Random Forest is trained on a random subset of the training
data and also a random subset of the features. This will introduce the randomness to
decorrelate the trees which can reduce overfitting and improve generalization
performance. Furthermore, the Random Forest makes use of a method called bagging.
Here multiple trees are trained on bootstrapped samples of our training data. This
adds another layer to making our model more diverse and stronger.

Prediction: During prediction stage, each tree present in the random forest model
takes an independent decision and it gives a result. The final output is decided by
combining all the trees predictions (mode or mean) Ensemble methods generally have
been found to be more stable and accurate than individual decision trees.

Random Forest provides various hyperparameters to tweak the model, such as


number of trees, max depth of trees against being pruned and the number of features
which are considered while making decision at each node.

6
Random Forest is a popular ensemble learning method used in different fields such as
finance, healthcare, bioinformatics to solve problems which include — credit scoring,
disease diagnosis etc. because of its strength to work with high-dimensional data,
non-linear relations and noisy data.
The accuracy we get after successfully implementing the above model is shown in
following figure:

Fig. 26: Random Forest

Fig. 27 Confusion Matrix Of Random Forest

K-Nearest Neighbors (KNN)


K-Nearest Neighbors, KNN is a simple but powerful supervised machine learning
algorithm used for both classification and regression objectives. KNN simply form a
prediction for the new data point, which is determined as the majority class (for
classification) or average (for regression) value of its k nearest neighbors in feature
space.
KNN keeps the entire training dataset while during pre-computation of weights, you
force a model to memorize every one of your instances. While classifying or

6
predicting, any new data point will calculate the distance between the new data point
with all other training dataset of points to measure how closely it resembles each
training example using a specific formula(metric) frequently used is Euclidean. After
computing the Euclidean distance of the unknown charities from each one in the
training set, it gets these k with least distances.
In the case of classification, using majority vote from all k nearest neighbours is used
to assign a class label for new data point For the regression tasks, the predicted value
is nothing more than the average of their k nearest neighbors target values.
From above, we can conclude KNN depends on the value of k and which distance
metric to use. Smaller value of K will provide more flexible model but may suffer
from overfitting; On the other hand a lager value of k can cause a smoother decision
boundary because not enough information is used to make an accurate prediction
Although this may seem trivial to understand, KNN can be quite computationally
expensive and time consuming (particularly for large datasets) as all distances from
our query point need to be calculated in order to find the nearest neighbors
Nevertheless, its intuitiveness and simple implementation allow for wide usage in
various machine learning projects, particularly those that involve a small dataset size.
Here is accuracy and Confusion Matrix after successful implementation in given
figure:

Fig. 28: KNN

6
Fig. 29 Confusion Matrix Of KNN

Logistic Regression
Logistic Regression is a common parametric linear classification algorithm for binary
and two class classification.' 'Here, we select the amount of probability. It is used to
predict whether something will happen or not happening?' Although the name may
suggest, logistic regression is a classification algorithm, not a regression one.
The logistic regression algorithm models the relationship between independent
variables (or features) and probability of target variable belonging to certain class
using a link function which is the logistic function so, it is also known as logit
regression. The sigmoid function will map any real-valued number into the range [0,
1], so it’s a really convenient way to represent probabilities.
When training, logistic regression will learn the coefficients (which you could
interpret as weights) of each feature through maximal likelihood estimation using a
method like gradient descent. The resulting coefficients for each feature represent the
change in the additive log-odds of one class as that feature changes.
We can use our learned coefficients to make predictions by passing the input features
through the trained logistic function and obtain a predicted probability. This predicted
probability is then mapped onto a binary class label by selecting all instances for
which the predicted probability is greater than or equal to some threshold (often 0.5)
This type of regression is easily interpretable, easy to implement and works well
when the relationship between your features and target variable is in or close to linear
and is widely used in many fields, e.g., predicting disease occurrence (Weng et al.

6
2017); credit risk scoring (Crook et al. 2007); customer churn prediction (Chan and
Baumgartner 2007). Tfeas forest also is commonly employed at the industry level.
Here is accuracy after successful implementation in given figure:

Fig. 30: Logistic regression

Naive Bayes
Naive Bayes is a probabilistic classification algorithm using Bayes’ Theorem. Naive
in the sense of situation to keep things common and easy; refering the conditional
impedance between every element. However, despite this simplification (or perhaps
because of it) Naive Bayes is (somehow/someway/one way or another), able to
deliver robust performances and for such reasons is often used on text data where the
independence assumption isn’t true; these include things like spam filtering.

How does the algorithm do this? Using Bayes’ theorem, which states that the
probability of each class given a set of features is proportional to the probability of
those features given the class multiplied by our prior belief about this class:
The “naive” assumption above gives us a way to calculate these probabilities without
having access to (labeled) training data for MLE / MAP. We assume independence
between our features, and therefore use the product law of probability which reduces
our conditional probabiliy terms: Although it is overly simplistic, the model still
performs very well and can be used in machine learning when the assumption holds
almost exactly (when dependencies do not exist) or even if that independence
assumption does not hold to some extent.

This class of Bayesian classifiers (Naive Bayes) comes in different flavors; the var-
iants differ mainly by the assumptions they make about X. Gaussian Naive
BayesAlgorithms assumes that continuous features follow a Gaussian distribution,

7
Multinomial specifies that Features are assumed to be generated from a simple
multinomial dis- tribution based on Poisson and Bernoulli Distribution for binary
data[4].

Naive Bayes is computationally fast, simple to implement and works well on small
datasets. It’s particularly suitable for high dimensionality data like text classification
which runs into infinite dimensions due to the huge number of possible words.
Here is accuracy and Confusion Matrix after successful implementation in given
figure:

Fig. 31: Naïve Bayes

Fig. 32 Confusion Matrix Of Naïve Bayes

Stochastic gradient descent (SGD)


Stochastic Gradient Descent (SGD) is an oft-used optimization algorithm for training
machine learning models, especially when dealing with large datasets. Unlike the
traditional gradient descent that computes gradients with the entire dataset at every
iteration, SGD interestingly calculates up-to-date parameters in very small steps by
randomly selecting tiny data samples.

7
In every iteration, SGD computes the gradient of the loss function w.r.t to the model
parameters considering only single batch data. And finally it takes a step in the
direction of the opposite that gradient with some learning rate. This whole process is
iteratively repeated until we reach convergence or based on some stopping criterion.
By using mini-batches, SGD is introducing randomness in the optimization process
that can prevent it from getting stuck on local minimum and also lead to a faster
convergence especially when dealing with high-dimensional parameter space.
However, as we can see above this randomness also gives rise to noisy updates,
which means that choosing an appropriate learning rate (and hyper-parameters more
generally) is crucial.
Although it’s simple but SGD is a family of powerful optimization methods, some
advanced versions are: mini-batch SGD, momentum SGD and the algorithms that
automatically change the learning rate for each weight such as Adagrad and Adam.
SGD is commonly used in training of deep learning models(especially neural
networks) because of its scalability and efficiency. “[A] powerful, extremely capable
algorithm… that every machine learning practitioner should have in their toolbox
because it’s really good at handling large sets of data and operating in high-
dimensional parameter space.”
Here is accuracy after successful implementation in given figure:

Fig. 33: Stochastic gradient descent

7
Fig. 34: Confusion Matrix Of SGD

Support Vector Machine (SVM)


One great machine learning algorithm that can be implemented for classification,
regression or any other supervised algorithm is Support vector Machine (SVM) It
works by looking for the best hyperplane that separates all of data points from
different classes as well in a high dimension space. The major idea with SVM is to
locate the most critical margin this being defined as the gap between a hyperplane and
closest plot(datum) of each class.
Classification: in classification, the aim is to find a hyperplane that maximizes the
margin while all data points are “outside”, or correctly classified (or with some
tolerance due to soft margins). SVM can work well with linearly separable and non-
linear data structure also. To achieve this we use a kernel trick where the features are
transformed into higher dimension dataset such that it gets converted to linearly
separable form turns.
SVMs are resistant to overfitting, especially in high-dimensional spaces, and provide
good results on future data. Furthermore, SVMs are well suited for datasets having
large number of features which makes it versatile for many applications in image
classification, text classification, bioinformatics to name just a few.
Nevertheless, SVMs are memory intensive and might not be optimal when it comes
to huge datasets and require a lot of computation because each time you need to solve
this quadratic optimization problem. Nevertheless, they are still appreciated in the
machine learning community and used in a variety of fields as SMMs are highly
usable with strong theoretical background.

7
Here is accuracy after successful implementation in given figure:

Fig. 35: Support Vector Machine

Fig. 36: Confusion Matrix of SVM

7
CHAPTER 6
CONCLUSIONS
6.1 Conclusions

Putting it all together, we found that malware attacks are a serious threat to any
individual’s personal computer system and cause a number of problems and
interruptions. It cannot be stressed enough that removing Malware from your system
is the first step to keeping your computer clean and functioning correctly. This
research effort will ultimately work towards the use of machine learning techniques to
minimize the error and accurate identification of Malware objects from within a
system. This way, if malware detection mechanisms are implemented correctly there
should not be a single false positive, making the system stronger towards possible
threats. Yet, empirical data show that even if research goals are achieved to a large
extent, some non zero false positive rate remains and these findings confirm the
complexity of the problem Still, the suggested framework comes out as a viable
commercial product that unifies various similar deterministic expectation features.
This framework is very much helpful towards detecting malware from files which in
turn, increases the security level of a system.

Fig. 37: Accuracy of Different Machine Algorithm

7
Secondly, as we have seen in Fig. 30 with a whole set of algorithms gathering the
worst combinations and only few exceptions. 30 out of these, the certain algorithms
which I have used are KNN, Decision Tree, Random Forest, Linear
Regression,XGBoost — eXtreme Gradient Boosting,LightGBM - Light Gradient
Boosting Machine Logistic Regression Naïve Bayes Stochastic Gradient Descent and
Support Vector Machines(SVM) It is important to notice that few algorithms like
Linear Regression, XGBoost, LightGBM and Decision Tree have more accurate in
malware detection relative to the other algorithms.

Fig. 38: Graph of Different ML techniques and their Accuracy

Therefore, as a result of other advantages associated with these algorithms, they are
considered to be the most preferable due to its addressing malware threat. In the
future, with further improvements and optimizations on these algorithms, we will be
able to accomplish what we want in malware detection. To conclude, the
implementation of machine learning in cybersecurity is a major development and
enhancement that provides an opportunity to increase protection against malware
attacks.

6.2 Future Scope

“We can do a lot on this proposed model as technology is changing rapidly. “, with
that note we end our scope here for further improvements.

 The Algorithms are good enough on their own, but nowadays Neural
Networks have a vital role to play in classification problems.

7
 Instead of implementing machine learning algorithms neural Networks can be
implemented as they are far better in unsupervised learning.
 one could also do specific feature selection in the order to remove those false
positives.

6.3 Applications

The implications for security are wide and far reaching. Capable of detecting
malicious files, this technology can be integrated into many companies software as
they all move toward that trend too.

7
REFERENCES

[1] Sanjay Chakrabortya and Lopamudra Dey. A rule-based probabilistic


technique for malware code detection. Multiagent and Grid Systems – An
International Journal, IOS Press, 12, 2016, pp. 271–286 271. DOI
10.3233/MGS-160254
[2] Y. Zhou, Z. Wang, W. Zhou, and X. Jiang. Hey, you, get off of my market:
Detecting malicious apps in official and alternative android markets. in NDSS,
vol. 25, no. 4, 2012, pp. 50–52.
[3] D. Keragala. Detecting malware and sandbox evasion techniques, SANS
Institute InfoSec Reading Room, 2016. URL: https://www.sans.org/reading-
room/whitepapers/ forensics/detecting-malwaresandbox-evasion-techniques-
36667
[4] Sharif, M., Yegneswaran, V., Saidi, H., Porras, P., and Lee, W. Eureka: A
framework for enabling static malware analysis. In Computer security-
ESORICS 2008, pages 481- 500. Springer.
[5] Moser, A., Kruegel, C., and Kirda, E. Limits of static analysis for malware
detection. In Computer security applications conference, ACSAC 2007.
Twenty-third annual, 2007, pages 421-430.
[6] Egele, M., Scholte, T., Kirda, E., and Kruegel, C. A survey on automated
dynamic malware-analysis techniques and tools. ACM Computing Surveys
(CSUR), 2012, 44(2):6.
[7] Ahmad, S., Ahmad, S., Xu, S., and Li, B. Next generation malware analysis
techniques and tools. In Electronics, Information Technology and
Intellectualization: Proceedings of the International Conference EITI 2014,
Shenzhen, 16-17 August 2015, page 17. CRC Press.
[8] Gorecki, C., Freiling, F. C., Kuhrer, M., and Holz, T. Trumanbox: Improving
dynamic malware analysis by emulating the internet. In Stabilization, Safety,
and Security of Distributed Systems, Springer, 2011, pages 208-222.
[9] Cisco(2015). What is the difference: Viruses, worms, trojans, and bots?
Online. http://www.cisco.com/web/about/security/intelligence/virus-worm-
diffs.html
[10] Radu-Stefan Pirscoveanu. Clustering Analysis of Malware Behaviour.
Master Thesis Department of Electronic Systems at Aalborg University, 139
pages, 2015.17
7
[11] Symantec. What are malware, viruses, spyware, and cookies, and what
differentiates them 2009. Online.
http://www.symantec.com/connect/articles/what-are-malware-viruses-
spyware-andcookies-and-what-differentiates-them
[12] ] Zimba, A., Wang, Z., & Chen, H. Multi-stage crypto ransomware attacks: A
new emerging cyber threat to critical infrastructure and industrial control
systems. ICT Express, 2018.
[13] Kateryna Chumachenko. Machine Learning Methods for Malware Detection
and Classification. Bachelor's Thesis in Information Technology. University
of Applied Sciences, 93 pages, 2017.
[14] ] Babak Bashari Rad, Maslin Masrom, and Suhaimi Ibrahim. Evolution of
computer virus concealment and anti-virus techniques: a short survey. arXiv
preprint arXiv:1104.1070, 2011.
[15] Babak Bashari Rad, Maslin Masrom, and Suhaimi Ibrahim. Camouflage in
malware: from encryption to metamorphism. International Journal of
Computer Science and Network Security, 12(8):74–83, 2012.
[16] ] Kent Griffin, Scott Schneider, Xin Hu, and Tzi-Cker Chiueh. Automatic
generation of string signatures for malware detection. In Recent advances in
intrusion detection, pages 101–120. Springer, 2009
[17] Nhat-Phuong Tran and Myungho Lee. High performance string matching for
security applications. In ICT for Smart Society (ICISS), 2013 International
Conference on, pages 1–5. IEEE, 2013
[18] David Harley and Andrew Lee. Heuristic analysis–detecting unknown
viruses. Technical report, 2007 (Date last accessed 31-May-2018).
[19] Kirti Mathur and Saroj Hiranwal. A survey on techniques in detection and
analyzing malware executables. International Journal of Advanced Research
in Computer Science and Software Engineering, 3(4):422–428, 2013
[20] Ming Xu, Lingfei Wu, Shuhui Qi, Jian Xu, Haiping Zhang, Yizhi Ren, and
Ning Zheng. A similarity metric method of obfuscated malware using
function-call graph. Journal of Computer Virology and Hacking Techniques,
9(1):35–47, 2013.
[21] J-Y Xu, Andrew H Sung, Patrick Chavez, and Srinivas Mukkamala.
Polymorphic malicious executable scanner by api sequence analysis. In

7
Hybrid Intelligent Systems, 2004. HIS’04. Fourth International Conference
on, pages 378–383. IEEE, 2004.
[22] Tom M Mitchell. Machine learning. WCB, McGraw-Hill Boston, MA, 1997.
[23] Robert Moskovitch, Yuval Elovici, and Lior Rokach. Detection of unknown
computer worms based on behavioral classification of the host. Computational
Statistics & Data Analysis, 52(9):4544– 4566, 2008.
[24] Mamoun Alazab, Sitalakshmi Venkatraman, Paul Watters, and Moutaz
Alazab. Zero-day malware detection based on supervised learning algorithms
of api call signatures. In Proceedings of the Ninth Australasian Data Mining
Conference-Volume 121, pages 171–182. Australian Computer Society, Inc.,
2011.
[25] Muazzam Siddiqui, Morgan C Wang, and Joohan Lee. A survey of data
mining techniques for malware detection using file features. In Proceedings of
the 46th annual southeast regional conference on xx, pages 509–510. ACM,
2008.
[26] ] Gr´egoire Jacob, Herv´e Debar, and Eric Filiol. Behavioral detection of
malware: from a survey towards an established taxonomy. Journal in
computer Virology, 4(3):251–266, 2008.
[27] Daly, M.K., “Advanced persistent threat”, Usenix, Nov. 2009.
[28] Quick heal quarterly threat report q2: Technical report, 2015. Quick Heal,
Feb. 2015.
[29] Aimoto, S., AlKhatib, T., Coogan, P., Corpin, M., DiMaggio, “Internet
security threat report”, Technical report, Symantec Corporation, 2017.
[30] A. K. Verma and S. K. Sharma, "Malware Detection Approaches using
Machine Learning TechniquesStrategic Survey," 2021 3rd International
Conference on Advances in Computing, Communication Control and
Networking (ICAC3N), Greater Noida, India, 2021, pp. 1958-1962
[31] Ye, Y., Wang, D., Li, T., Ye, D., “IMDS: intelligent malware detection
system”, In Proceedings of 13th ACM SIGKDD, International Conference on
Knowledge Discovery, Data Mining”, pp. 1043-1047, 2010.
[32] M. Zubair Shafiq, Syed Ali Khayam, and Muddassar Farooq. “Embedded
Malware Detection Using Markov n-Grams”, in Detection of Intrusions and
Malware, and Vulnerability Assessment. Springer Berlin, Heidelberg, 2008.

8
[33] R. Moskovitch, C. Feher, N. Tzachar, E. Berger, M. Gitelman, “Unknown
malcode detection using OPCODE representation”, Intelligence and Security
Informatics, pp 204-215, 2008.
[34] K. Griffin, S. Schneider, X. Hu, T.-c. Chiueh, “Automatic generation of
string signatures for malware detection”, pp. 101–120, 2009.
[35] Nataraj, L., Karthikeyan, S., Jacob, G., Manjunath, B.S., “Malware images:
visualization and automatic classification”, in Proceedings of the 8th
International Symposium on Visualization for Cyber Security. ACM, New
York, USA, pp. 4:14, 2011.
[36] A. Shabtai, R. Moskovitch, C. Feher, S. Dolev, Y. Elovici, “Detecting
unknown malicious code by applying classification techniques on OpCode
patterns”, Security Information, 2012.
[37] Chandrasekar Ravi, R Manoharan, “Malware Detection using Windows Api
Sequence and Machine Learning”, International Journal of Computer
Applications, Volume 43, April 2012.
[38] I. Santos, J. Devesa, F. Brezo, J. Nieves, P. G. Bringas, “Opem: A static-
dynamic approach for machinelearning-based malware detection”, in CISIS
’12- ICEUTE´ , page 271-280, 2013.
[39] P. M. Comar, L. Liu, S. Saha, P. N. Tan, A. Nucci, “Combining supervised
and unsupervised learning for zero-day malware detection”, in: Proceedings
INFOCOM, IEEE, pp. 2022–2030, 2013.
[40] Uppal, D., Sinha, R., Mehra, V., Jain, V., “Malware detection and
classification based on extraction of API sequences”, in Proceedings of
International Conference on Advanced Computer Commun. Informatics, pp.
2337-2342, 2013.
[41] Salehi, Z., Sami, A., Ghiasi, M., “Using feature generation from api calls for
malware detection”. Computer Fraud & Security, pp. 9–18 2014.
[42] Joseph Sexton, Curtis Storlie, Blake Anderson, “Subroutine based detection
of APT malware”, Journal of Computer Virology Hacking Techniques, pp 1-
9, 2015.
[43] ] Saxe, J., Berlin, K., “Deep neural network based malware detection using
two dimensional binary program features”. In 10th International Conference
on Malicious and Unwanted Software (MALWARE). IEEE, pp. 11-20, 2015.

8
[44] U. Narra, F. Di, T. Visaggio, A. Corrado, T.H. Austin, M. Stamp,
“Clustering versus SVM for malware detection”, Journal of Computer
Virology Hacking Techniques, pp 213–224, 2016.
[45] Ahmadi, M., Ulyanov, D., Semenov, S., Trofimov, M., Giacinto, “Novel
feature extraction, selection and fusion for effective malware family
classification”, in proceedings of the Sixth ACM Conference on Data and
Application Security and Privacy, pp. 183–194, 2016.
[46] Kolosnjaji, B., Zarras, A., Webster, G., Eckert, “Deep learning for
classification of malware system call sequences”, in Lecture Notes of
Computer Science (Including Subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics), pp. 137-149, 2016.
[47] B.N. Narayanan, O. Djaneye-Boundjou, T.M. Kebede, “Performance
analysis of machine learning and pattern recognition algorithms for Malware
classification”, in IEEE National Aerospace and Electronics Conference
(NAECON) and Ohio Innovation Summit (OIS), pp. 338– 342, 2016.
[48] S.D. Nikolopoulos, I. Polenakis, “A graph-based model for malware
detection and classification using system-call groups”, Journal of Computer
Virology Hacking Techniques, pp 29–46, 2017.
[49] Xu, Z., Ray, S., Subramanyan, P., Malik, “Malware detection using machine
learning based analysis of virtual memory access patterns”, Design,
Automation & Test in Europe Conference & Exhibition, IEEE, pp. 169–174,
2017.
[50] E. Raff, J. Sylvester, and C. Nicholas, “Learning the PE Header, Malware
Detection with Minimal Domain Knowledge”, in Proceedings of the 10th
ACM Workshop on Artificial Intelligence and Security, NY, USA: ACM, pp.
121–132, 2017.
[51] Kotov, V., Wojnowicz, “Towards generic deobfuscation of windows api
calls”, arXiv ,preprint arXiv:1802.04466, 2018.
[52] Q. Le, O. Boydell, B. Mac, M. Scanlon, “Deep learning at the shallow end:
Malware classification for non-domain experts”, Digital Investigation, pp
S118– S126,2018.
[53] Nguyen, M.H., Le Nguyen, D., Nguyen, X.M., Quan, T.T., “Autodetection
of sophisticated malware using lazy-binding control flow graph and deep
learning”. Computer Security, pages 128-155, 2018.

8
[54] M. Krcal, O. Svec, M. Balek, and O. Jasek, “Deep Convolutional Malware
Classifiers Can Learn from Raw Executables and Labels Only,” in ICLR
Workshop, 2018.
[55] Ni, S., Qian, Q., Zhang, R., “Malware identification using visualization
images and deep learning”. Computer Security,2018.
[56] Hemant Rathore, Swati Agarwal, Sanjay K. Sahay and Mohit Sewak,
“Malware Detection using Machine Learning and Deep Learning ”,
International Conference on Big Data Analytics, Springer, LNCS, Vol. 11297,
pp. 402-411, 2018.
[57] O. Suciu, S. E. Coull, and J. Johns, “Exploring adversarial examples in
malware detection”, in IEEE Security and Privacy Workshops (SPW), pp 8–
14, 2019.
[58] Yuxin, D., Siyi, Z., “Malware detection based on deep learning algorithm”,
Neural Comput. Appl. 31 (2), pp 461–472, Feb 2019.
[59] ] M. Rabbani, Y.L. Wang, R. Khoshkangini, H. Jelodar, R. Zhao, P. Hu, “A
hybrid machine learning approach for malicious behaviour detection and
recognition in cloud computing”, Journal of Network and Computer
Applications, 2020
[60] C. Yucel, A. Koltuksuz, “ Imaging and evaluating the memory access for
malware”, Forensic Science International Digital Investigation, 2020.
[61] Vasan, D., Alazab, M., Wassan, S., Safaei, B., & Zheng, Q. (2021).
Windows malware detection using anomaly detection algorithms
[62] Kumar, A. and Singh, B. and Gupta, D. Hybrid Malware Detection for
Android using Static and Dynamic Analysis with RNN
[63] D. Gibert, A. Smith, and M. Jones. "Enhanced Malware Detection through
Interpretable Deep Learning and Gradient Boosting." In Proceedings of the
2023 IEEE Symposium on Security and Privacy (SP), pp. 123-137, 2023

You might also like