Group 7 Report
Group 7 Report
BACHELOR OF TECHNOLOGY
IN
Signature Supervisor
i
ACKNOWLEDGEMENT
I would like to express my special thanks to our mentor Mr. Ashwini kumar verma
Assistant Professor in Department of Computer Science & Engineering for his
time and efforts he provided throughout the semester. Your useful advice and
suggestions were really helpful to me during the project’s completion. In this aspect, I
am eternally grateful to you.
In the end, I would like to thanks to my parents, who were extremely sincere and
loyal with me at every stage of my career. My parents were always there to support
me in difficulty times. I was able to complete my thesis due to their efforts too. My
parents are my real support my pillars.
I would like to acknowledge that this project was completed entirely by me and not
by someone else.
Aayush Kumar
Abhishek Saxena
Anjila Choudhary
Azharuddin Alam
ii
TABLE OF CONTENT
BONAFIDE CERTIFICATE.......................................................................................................i
ACKNOWLEDGEMENT...........................................................................................................ii
TABLE OF CONTENT.............................................................................................................iii
LIST OF FIGURES....................................................................................................................v
LIST OF TABLES....................................................................................................................vii
LIST OF ACRONYMS AND ABBREVIATIONS.....................................................................viii
ABSTRACT...............................................................................................................................ix
1. INTRODUCTION..................................................................................................................1
1.1 Problem Definition.................................................................................................4
1.2 Aim and Objective..................................................................................................6
1.3 Malware..................................................................................................................6
1.3.1 Types of Malware...................................................................................9
1.3.2 Types of Malware Analysis..................................................................23
1.3.3 Types of Malware Detection................................................................30
1.4 Structure of Project...............................................................................................35
2. LITERATURE REVIEW......................................................................................................36
2.1Background Study.................................................................................................36
2.1.1Comparative Analysis of Research Papers............................................39
3. SYSTEM ANALYSIS..........................................................................................................44
3.1 Tools and Technology Requirements...................................................................44
3.1.1 Python..................................................................................................45
3.1.2 Google colab........................................................................................46
3.1.3 Pycham..................................................................................................47
3.1.4 Pandas...................................................................................................48
3.1.5 Numpy..................................................................................................49
3.1.6 Scikit-Learn...........................................................................................50
4. FEASIBILITY STUDY........................................................................................................51
4.1 Technical Feasibility.............................................................................................51
4.2 Economical Feasibility.........................................................................................52
4.3 Operation Feasibility............................................................................................52
4.4 Legal and Regulatory Feasibility.........................................................................53
4.5 Market Feasiblity..................................................................................................53
5. METHODOLOGY................................................................................................................55
5.1Overview of Methodology.....................................................................................55
5.2Implementation......................................................................................................55
iii
5.2.1Preparing the Dataset............................................................................58
5.2.2Learning Algorithm and their performance...........................................62
6. CONCLUSIONS...................................................................................................................75
6.1Conclusions...........................................................................................................75
6.2Future Scope..........................................................................................................76
6.3 Applications..........................................................................................................77
REFERENCES..........................................................................................................................78
iv
LIST OF FIGURES
Fig 9-Python.............................................................................................................................45
Fig 11-Pycharm.......................................................................................................................47
Fig 12-Pandas...........................................................................................................................48
Fig 13-Numpy..........................................................................................................................49
Fig 14-Scikit-learn....................................................................................................................50
Fig 22-XGBoost.......................................................................................................................64
v
Fig 26-Random forest...............................................................................................................67
Fig 28-KNN..............................................................................................................................68
v
LIST OF TABLES
v
LIST OF ACRONYMS AND ABBREVIATIONS
ML - Machine Learning
AI - Artificial Intelligence
Figure- Fig
Colab - Collaborator
OS - Operating System
v
ABSTRACT
All this has become quite feasible given the breakthrough in internet technology and
computer networking: high-speed shared internet. This leads to the everyday growth
in number of computer systems which have turned into potential preys' for malware
attacks. Malware is short for malicious software and refers to a program specifically
engineered to cause damage to a computer, server or network. It could even send data
to bad actors and allow them to access your information or systems without your
permission. What are some common types of malware? Adware, fileless malware,
viruses, worms, trojans or trojan horses, bots, ransomware and spyware. The first step
to do so, is getting some data on PE files. PE stands for Portable Executable, which is
the file format used for executables and several binary file types in all 32-bit versions
of Windows operating systems. PE files (Portable Executable) are based on the COFF
file format (Common Object File Format). exe) and dynamic linked libraries (.dll) as
well as kernel modules ( Generally, a download manager enables downloading of
large files or multiples files in one session. srv). srv). The next phase was to convert
these PE files to.CSV format, so that we could extract important features like
Filename, FileSize, Characteristics of file, ImageVersion Informtaion and OS version
information (if possible), MD5 hash etc. from them: Having this enriched dataset in
CSV, we will continue and use a variety of machine learning algorithms to identify
patterns and derive value from it. Some of the notable algorithms used in this phase
are Support Vector Machine (SVM), K-Nearest Neighbors(KNN), Random Forest
etc. We have chosen each algorithm carefully in a way that each of them should
significantly benefit when it comes to classification and analysis for the features
extracted from PE files. By conducting this iterative procedure, we strived to acquire
a comprehensive knowledge of the features and hazards concerning the PE files under
investigation in order to further enhancing effective means for the practical detection
of malware as well as security in Windows systems. Yet, our strategy is even more
effective because it employs ensemble methods that synergize multiple machine
learning models to increase the prediction success of both random forests and support
vector machines. In Addition, incorporation of deep learning methodologies which
includes convolutional neural networks (CNNs) and recurrent neural networks
(RNNs) castlepling the PE file information can shed light not only on intricate
patterns and connections but also decidedly Improve the malware detection Program.
i
Chapter 1
INTRODUCTION
These days it is no more possible for anyone to imagine a world where there is no
internet and the way technology is growing, the concept of cyborg appears to be more
real. You are surrounded by technology; it might be mobile phones, laptop,
smartwatch, transactions being made not in cash but by digital electronic money.
Internet of Things is being worked on at an extensive amount and a lot of research
work, time, and energy are being put up for the same to build Smart House, Smart
City, etc., but everything is built through data and data here is key. But then it’s not
that technology is flawless, it has its own flaws and vulnerabilities. For this project
we are only focusing on laptops and desktops, and out of them too, there are more
percentage of laptop users. Out of all the OS available, the maximum among them is
even held by windows. NetMarketShare reports that, 76.32% of them use windows.
In our modern digital world, one of the most acute problems is malicious software,
and it is only getting worse, unfortunately. Malware is any software designed to cause
damage to a computer system or network. This may include, but is not limited to,
using it for espionage and profit. Recently, malware has started to target embedded
computation platforms, such as an IoT device, medical equipment, an E&I control
system, etc. This appears as a result of the increasing prevalence of such platforms,
which are a built-in part and, thus, are composed of embedded computing devices.
Today’s malware is complex in terms of having a high degree of diversity, in other
words, numerous strains have the ability to not only modify their code but also their
1
behavior, making it difficult to identify them. There is an enormous variety of
malicious software in existence, that is why relying only on protectors based on
signatures is not enough, and one needs an all-in-one toolkit.
Malware families share common traits, which can be seen either statically or in real-
time in their numerous incarnations. These perceived patterns can be used to isolate
families of malware. Static analysis comprises techniques of examing the content of
malicious files without running them. Dynamic analysis also classifies the behavioral
characteristics of malicious files, as opposed to performing such actions as tracking
information flows, function calls, and dynamic binary fingerprinting. Machine
learning techniques can be applied to a combination of static and behavioral artifacts
that are used to represent modern malware’s metasrukturu changing structure. This
makes it feasible to identify more sophisticated malware attacks that are otherwise
almost impossible to detect using a signature-based detection approach. The strategies
currently available based on machine learning are better than those using signatures
when it comes to detecting newly-released malware, also known as zero-day
malware. Furthermore, deep learning algorithms are able to produce even better
opportunities for feature extraction and representation thanks to implicit feature
fabrication.
The author considers the importance of machine learning in terms of dealing with
potential cyber security risks. Having obtained the relevant data throughout your firm
and compared millions of data points in the context of your clients, employees, and
partners, it is evident that you would utilize only people to spot any possible threats.
Specifically, machine learning and algorithms will serve as beneficial tools for
viewing the obtained data in large data sets and predicting possible future security
problems. Therefore, the author has highlighted that using machine learning and
developing a relevant ML tool could serve as a beneficial implementation and will
cover the purpose of identifying phishing websites.
In the conditions of current rapid changes in the field of cyber threats, machine
learning and its implementation could become a significant part of the process of the
proper preventive measures. Indeed, it can be assumed that having encountered the
problem of collecting a large amount of data generated by network traffic, system
2
logs, and user activities, one can rely on using the timely existing information and
auto systems.
The above mentioned ML tool will undoubtedly help your security engineers protect
your clients and firm from cyber threats that happen every day and cannot be
addressed late. Furthermore, they will also identify the phishing websites new to the
cyberspace and provide information about any new existing addresses.
Machine learning models in the context of phishing detection can, for example, use
natural language processing tools to ascertain whether the website contains suspicious
keywords, such as your account has been suspended, is asking for private
information, or sensitive data should be sent via email, if the website has poor
grammar, or breaks other feature. Additionally, other features, such as URL age, the
presence of SSL certificate, and reputation score of website can be used.
3
taking advantage of many features, rendering them even more powerful and robust
than the average classifier.
However, it has to be stated that ML are not a cure-all and are to be used alongside
other security measures, such as user education, secure coding practices and regular
software updating. Moreover, according to Goodfellow et al., Ml models are
vulnerable to adversarial attacks, when the inputs are carefully designed by malicious
actors to deceive the model. Thus, regular monitoring and retraining of the model
might be necessary. To wrap it up, bit can be concluded that employing machine
learning in cybersecurity is essential, considering the modern complex and
everchanging landscape of threats. Phishing detection is only one example illustrating
how these powerful techniques can help in the fight against cyber threats. It shows
that cybersecurity should be approached from multiple sides and implemented at
several levels.
With the increase in demand for the internet, risks of cyberattacks are also being
increased. So, most organizations and people are experiencing problems with the
malware files in their systems. It is too much difficult to save or protect important
data from personal computer systems. There are several techniques that use to detect
the malware that is present in computer systems. However, some traditional malware
detecting techniques cannot perform valuable actions to detect the malware. Due to
this, different kinds of malware that are present in different forms in computer
systems cannot be detected. It means traditional malware detectors fail to detect
malware that is present in the system. In this condition, there is a need for malware
detectors that can detect malware more easily and comfortably. with machine learning
algorithms we can detect malware that is present in a system, which provided better
results in this area. For the future, machine learning algorithms will be of more use to
detect malware that is present in systems. There are also many kinds of machine
learning algorithms that use for the malware detection process. However, for this
research, I just consider a few important machine learning algorithms that will be of
more interest. One of the important machine learning algorithms is a convolutional
neural network and recurrent neural network. CNN and RNN are deep learning
4
models that learn complex patterns and features from the data. So it will be more
useful to detect variants of malware. Besides that, other important machine learning
algorithms are decision trees and random forests. These techniques are used for
learning data and make decisions from the learning process.
The proposed research encompasses running each of the specified algorithms on the
required data to ascertain how efficiently it can detect malware. Specifically, the
efficiency of the algorithms will be measured using a range of metrics. The
performance of each of the models will be tested using such tools as accuracy,
precision, recall, and F1 – score. The integrity and validity of the results will be
assured by such techniques as cross – validation and stratified sampling and other
tools that ensure that the process of data collection is conducted in the defined
sequence. Data collection will be performed at the outset of the research and will
imply the acquisition of the selected benign files and the set of malware samples.
Preprocessing will be the following phase, and it will include conducting data
cleaning, normalization, and other appropriate techniques required for the preparation
of the data. In addition, encoding of the data will be necessary for the task to follow.
Feature extraction will be the third key step in the process as, to achieve the
anticipated effect, it is necessary to detect the most relevant features differentiating
between malware and benign files. Static analysis for detecting data features in the
headers, sections, and code segments will be employed alongside dynamic analysis
testing files in a controlled environment. Both approaches will be employed to ensure
the most effective features are chosen. The following step in the research process will
be the training of the machine learning algorithms and testing how accurately they
operate. The use of hyperparameter tuning, regularization, and ensemble methods will
be integral to the training process. Finally, performance assessment will be conducted
across the defined data set with test pairs as the output. In conclusion, the paper will
also draw a comparison of the machine learning models applications, underscoring
the benefits and shortfalls of each type.
I have read the abstract and find the research quite interesting as it aims to improve
the existing malware detection systems by employing different machine learning
classifiers. The research findings appear to be important since they will provide more
5
insights and enable readers to protect themselves and their business entities from the
malware threat. I think the research has good prospects to succeed.
The aim of this research is to detect malware and evaluate performance by initiating
different machine learning algorithms.
1.3 Malware
Internet technology and computer networking What is more, required the internet to
be high speed shared. As a result of this development, to date, the number of
computer systems prone to malware attacks is increasing daily[1][2]. The term
malware is an abbreviation of malicious software. Researchers define it as any
software that was specifically designed to disrupt a computer, server, client, or
computer network, steal private information, gain unauthorized access to information
or systems and to or deprive access to information or which performs that
unknowingly compromises the user’s security, privacy and/or ethics[3]. Normally,
researchers describe malware in terms of one or more sub-types or Equally, malware
is regarded as harmful to all individuals and businesses that use the internet due to its
various issues. In this modern era, the number of newly produced malware in the
internet environment continues to escalate, while the antivirus companies are working
6
strenuously to limit this tendency so that the majority of computer users can be free
from these threats isolation[4][5]. Notwithstanding, the over couple of years, malware
have become more sophisticated, as a result, they are harder to detect. Consequently,
newer, more effective ways have to be improved between now and the future to
detect and avert these attacks. Malware is malicious software that attacks the privacy,
reliability, and accessibility of a computer system’s data. This issue has made both
academicians and industry practitioners move from antiquated, static detection
techniques, to more dynamic, sophisticated, spontaneous methods relying on
accumulated malware behaviour to constantly monitor systems for malware
attacks[6][7][8].
Malware creation is an activity that can be linked to various actors with different
motivations. Cybercriminals are inclined to do it with the aim of getting money.
Hence, malware is likely to be created to steal banking details, credit card numbers,
and passcodes, as well as to demand a ransom for illegally blocking some data.
Additionally, malware can be developed by some nation-states willing to wage cyber
warfare, and get some essential information whether it is critical infrastructure,
government systems, or assets of the state. Moreover, malware is likely to be
designed by hacktivists to achieve their ideological or political goals and gain access
to data specifying their opponents or any organization they disagree with. Finally,
malware can also be created by companies’ insiders who have some vendetta and
want to take revenge on their previous employer. The potential implications of
allowing my system to be infected are categorized as individual and organizational. I
might suffer from the threat because my personal data is not likely to be secure with
malware captured on the computer. Hence, running my home business, I can get my
bank account, credit card numbers, and some programs details stolen. Both
fundamental and so-called dirty secrets may come up and result in criminal gains.
Moreover, my system in the workplace can be under risk. In other words, malware
will infect the system at work, and I will be guilty of leaking some details of our
projects and causing damage to the company’s reputation. As such, I am likely to be
dismissed from the job. If my data is lost, my documents and significant papers can
be eliminated, ruined, or put into a mess. My PC can be blocked, and I will need to
spend a considerable amount to use the services of some experts to try to restore the
7
saved data. Alternatively, I can lose the projects that have already been completed
and are being saved for the future crucial event.
One direction where researchers can move is to leverage the powers of machine
learning for malware detection purposes. Machine learning involves programming a
computer to draw conclusions from data by identifying patterns or features based on
training with large datasets. The technology-oriented approach means that data-driven
pattern recognition can bring several benefits, including adaptability, generalization,
automatic feature selection, and scalability. Several common machine learning
methods can be used for malware detection tasks, such as decision trees, random
forests, support vector machines, and neural networks. In addition, other unsupervised
learning methods like clustering or anomaly detection can be used to identify
different malware types, whereas deep learning neural network architectures like
convolutional or recurrent neural networks can be used as a more sophisticated and
more powerful ML tool. The use of machine learning to detect malware implies that
the programs can be analyzed in terms of static features such as their binary code,
headers, and other metadata or dynamic characteristics.
8
Although machine learning methods can offer solutions to the current malware threat,
it is important to note that they are not a silver bullet. They can only provide
beneficial results when used as a part of the complex approach, which merges them
with other security measures and best practices. The efficiency of machine learning
models is highly dependent on the training samples, and their diversity and quality, as
well as the definition and proper application of valuable features. Furthermore, as it
was mentioned above, the development of adversarial machine learning is
transforming into a considerable problem, and malware authors will continue to
construct samples, which cannot be recognized by the machine learning methods
during quite a long time. In conclusion, it is possible to state that malware is still a
major threat of the current century. With an increase in the number of devices used by
people and their interconnectedness, the risk of malware infection will only grow.
Thus, not only high-quality security measures and incident response plans but also
people’s awareness and R&D activities focused on the development of the new
techniques designed to reduce the risk and efficiently prevent and fight malware
attacks, including machine learning methods, have to be implemented.
9
permission. Despite the different purposes these malware have, they share similar
aspects of their goal and design.
1
trojans cause more harm than they should. Since trojans are able to go past all the first
security measures and remain undetected for so long, their effects are serious indeed.
Hence, it is evident that a trojan deserves the label of amajor threat of a type
of computers.
Virus
Adware
Adware is a relatively low-danger type of malware that differs with the degree of its
functional influence on the infected device. Its only task is to show advertising to the
user. Unlike more dangerous types of viruses or worms that are designed to damage
the system or misuse the infected device, adware collects information from it, such as
information about the history of searches and visited webpages. Then it uses this
information to display commercials on the infected device in which the user might be
interested. That is why sometimes there may appear ads on one’s screen that are
1
connected to the product that the person has recently searched for or paid attention to.
Nonetheless, such software is not as dangerous as many other types of malware, since
the information is not likely to be personal, narrowing one’s interest. However, it is
still rather annoying, causing automatical appearance, unwanted ads and other similar
content, which prevents the user from satisfying work.
Furthermore, the device may be under some threat since the collection of information
is carried out without the awareness and permission of the user, meaning that no one
can tell how carefully and where it is used. Finally, in case the collected data is too
big, it gathers in the infected device, overloading it with unnecessary information and
reducing the work speed. At first glance, the impact of adware on the device is not
really dangerous, but it can be a rather annoying and disturbing problem that may
cause inconvenience during work. However, the key factor that makes it dangerous is
that the device, to which such software is directly linked, gathers information about
the user that is not always appropriate, followed by a lack of control over its proper
use. Despite other types of malware, it is not lethal, but yet it affects the device’s
work[10].
Spyware
Spyware is a kind of applications that executes without the user’s approval and often
without the user knowing that this software is running. Spyware is a self-installing
malicious software that replicates on the target system. It is used to garner and trail
information about the person and their browsing history of a computer system. The
nature of what it records automatically varies and can involve, but is not restricted to,
the types of websites the user attends, statistics of their computer practice, and even a
copy of their transactions. Generally, spyware is bundled with freeware software. The
unsuspected distribution of the software has made it common for users to
unknowingly install spyware along with the more legitimate free software
applications. Spyware functions without the user’s awareness, aggregating statistics
inclusive of but not limited to statistics of what the person types via their keyboard,
websites they frequent and their private information. The information can be sent
back a third party that uses the data for a number of applications including
advertising, identity theft, or unlawful access of information. Because of its skill to
function without the user knowing, spyware is a very common issue for people
1
surfing the net. It is occasionally referred to as a rootkit both because of the way it is
distributed by the freeware software and as it can hide its records on the user’s system
without the user knowing it is there[11]. Spyware differs from other kinds of malware
in that it does not damage the user’s computer system, nor does it disrupt it. Instead,
spyware’s sole operation is surveillance that allows a third party to spy on the host.
This feature makes spyware one of the most dangerous kinds of malware, as it can
easily lie on a user’s computer system for an extended period of time without the user
knowing that the information on their computer is being harvested. In order to avoid
spyware, it is imperative that users use reliable anti-spyware software combined with
the general rule of caution when downloading freeware applications from
unknown websites.
Worm
A worm refers to a form of malware that does not fuse itself to any additional
software, a millennium that sets it apart from the virus. Viruses require a host
program onto which they install themselves to migrate. Worms however migrate
without necessarily requiring a host software on which to fasten. They move
individually and fast across systems, and networks, take advantage of system
vulnerabilities, and exploit computer networks and software. Typically, they expose
an area of vulnerability, moves through the area infecting the target. Examples
include open ports and poor password protocols that allow the worm to infect its
target. Once inside the system, the worm uses strategies to migrate to another system,
such as sending a similar copy to the computer through an email, an IM, or across
network shares. Alternatively, worms exploit software vulnerabilities automatically to
spread infections rather than requiring on the user for decision. Fly-and-exploit-
capabilities of the worm, a variable that facilitates triggering of a highly influential
attack. Worm threats to information system and system’s network lie in its capacity to
corrupt systems configuration, inhibited network speed, and overly interfacing, which
develops a DDoS case. To curb worms, a system is supposed to be highly secure with
regular software updates and a well-managed, sophisticated network to detect and
mask such an attack.
1
Bot
A bot, also referred to as a web robot or botnet, is application software destined for
automated tasks performance over the internet. Bots are a type of malware that allows
gaining access to the infected computer system by the operator. Unlike viruses or
worms, which are standalone pieces of software, bots operate in a larger network of
infected computers called a botnet. The ability to propagate a bot is granted by
backdoors that are installed on a victim computer by a virus or a worm. These
backdoors allow their operator to create a central control server that is used for adding
and controlling bots. Once installed onto a system, bots are capable of sending spam
emails, conducting Distributed Denial of Service attacks, stealing valuable data, and
spreading other malware. Bots are often employed by criminals to orchestrate large-
scale operations, targeting compromised systems for financial gain and other
malevolent purposes. Being able to operate fully automatically in the background,
bots are often left unnoticed by their operators for long periods. Protecting against bot
infection requires robust cybersecurity measures, such as antivirus software, firewalls,
intrusion detection systems, and running timely software updates. Additionally, user
education and avoiding opening suspicious links and downloading email attachments
are imperative.
Ransomware
1
payment will even lead to the return of the ransomed files -- and victims can end up
losing money twice, by having handed over cash in Bitcoins or other cryptocurrencies
they held as well as lost other valuable data. Ransomware infections can be fought
and averted by implementing solid cybersecurity strategies such as safe data backups,
good antivirus preventative measures, and end-user training to help users
acknowledge suspicious emails or downloads. Organizations should also take
appropriate network security action including firewalls and using intrusion detection
systems to identify ransomware threats and then infect before they causes harm[12].
Rootkits
Backdoor
1
entry-point into your system and use this access to cause malicious activities. 16
Although backdoors can be used alone, they are often deployed in combination with
other types of malware or as part of other attack packages.Aggressive Cyber Activity
After this, backdoor gives them a sneaky way to break into the system and access it
without going through the standard procedure. So now they can launch additional
attack or perform any kind of bad operation on compromised machine from that
vantage point. Backdoors might be used, among other things, to steal sensitive
information, install a range of malware beyond the backdoor itself to spy on the
victim or perpetrate DDoS attacks. Because of their covert operations, backdoors can
sometimes go undiscovered in breached devices for weeks or even years, providing
the attackers long-term access and reign. Advanced cybersecurity is required to detect
and mitigate backdoor infections, which typically include monitoring systems
regularly, conducting vulnerability assessments, and deploying intrusion detection
systems. Moreover, user education and awareness have to be done; backdoor
infections need abolished while systems and networks harm should also be
diminished[12].
Keylogger
1
Keyloggers can be difficult to detect and remove, as they are designed (to
intentionally avoid detection), module-based, and often written with rootkit
functionality. To avoid infection from keyloggers, it is important to follow
cybersecurity best practices such as updating software regularly, installing reliable
antivirus software and educating users about the dangers of following unsolicited
links or downloading email attachments. Furthermore, the reliance on security
applications like a firewall and intrusion detection system (IDS) provides an extra
layer of defense and ensures timely identification and protection from keylogger
attacks.
While this trend in the malware creation lifecycle is leading to more capabilities, it
poses a difficult challenge for cybersecurity professionals who need to identify new
ways of detecting responsive malware. Traditional methods may not get there fast
enough because dynamic virus programs keep evolving. Attackers use different
obfuscation methods to hide the actual code of malware so security solutions do not
detect and stop threats as well as they should. Consequently, to detect and
immediately respond to these dynamic threats, organizations need proactive security
solutions like behaviour-based analysis coupled with threat intelligence.
“Furthermore, the importance of user education and awareness cannot be understated
in helping to avoid these types of infections,” SafeBreach Labs added. “Users need to
actively protect themselves against social engineering attacks orchestrated by threat
actors attempting to infect endpoints with malware-based payloads.” * With insight
into the latest threat trends and strong cybersecurity solutions, this should enable
organizations to better defend themselves from what seems like new threats with
second-generation malware.
1
Encrypted Malware
At the first level of concealment, we have encryption. Here the malware body
consists of a encrypted malicious code along with key and encryption/ decryption
algorithm [Fig. 3]. This XOR’s the body of the malware with the generated key,
making it harder to detect. The primary goal for developing the malware with
encryption, was to bypass the static code analysis and classical signature based
detection method. Once it infects the system, after decrypting itself by using
decryption algorithm and a key. It will again encrypt by encryption algorithm to
produce new variant’s key for another type in order to bypass the detection
mechanism. In the scope of malware, encryption is also found to be the primary
obfuscation method: when it comes to concealment, malicious code usually resides in
a shell where its body has been encrypted and, along with an algorithm and key for
encrypting/decrypting. With all above MOV checks in place, there won't be any
proper visible method to detract the actual nature of malware like Fooling static code
analysis as well as traditional signature-based detection and attack. In this approach,
the body of the malware is usually XOR’ed with a key which is generated
dynamically which can makes it hard for security solutions to detect and analyze.
Once received, the encrypted malware decrypts itself with the previously supplied
decryption algorithm and key. At this point, the malicious code can be executed on
infected systems. Upon execution, the malware re-infections itself for each variant
using the encryption algorithm to create a new key after activation and gains
changeover thwart at assorted levels. In most cases, such continuity of the encryption
and decryption process results in a constantly changing structure of an evolving threat
that is extremely difficult to parse or analyze for security researchersbf. This allows
malware authors to hide their illicit actions using encryption, enabling compromised
systems to be used long after a malware is initially installed. As a result, detection
and mitigation strategies from cybersecurity professionals must change to address the
continuum of threats posed by encrypted malware. It highlights the importance of
advanced security solutions that leverage behavior-based analysis, machine-learning
and threat-intelligence to effectively detect and neutralize encrypted malware threats
in real time. In addition, ensuring that antivirus signatures and intrusion detection
systems are updated regularly is essential to ensure your ability to keep ahead of any
new encryption techniques used by malware authors. Organizations can prevent and
1
protect themselves from these harmful impact of malware attacks by taking measures
in respecting to Encrypted Malware.
The limitation of the malware were encrypted, i.e., invariance properties of the
decryptor within those specific variants of ma1-ware easily allowed an anti-malware
to detect it by searching for signatures of their precious little boys—themselves.
Therefore, various concealment techniques have been evolved to escape from this
detection mechanism. In this malware (Fig. 2) 4) Decryptors: In this malware variant
decryptors are mutated it means that we have pass through the set of obfuscated
decryptors. Following these signs, the disadvantage of encrypted malware should be
mentioned: is that since invariantness exists in decryptor within certain subtypes of
malwares, it was and simply identified by the anti-malware solution by recognising
signature for the decryptors. This gap soon led to the discovery of multiple kinds of
hiding techniques, designed to trick detecting mechanisms. In response, malware
authors continued to evolve their design by obscuring the decryptors with new
features such as mutation of the generic elements across different variants (see for
example [27]). Figure 4. An example of the advanced variant, where decryptors are
muted[variant-wide pool of obfuscated decryptors] and dot`erent across different
variants This is especially important because by constantly mutating the decryptors,
malware developers can blur their “signatures”, making it harder for anti-malware
solutions to detect and destroy the threat. Centrifuge then dynamically encrypts and
decrypts its files, making it even harder for ordinary security mechanisms to detect
the malware. Employing the low-level obfuscated decryptors gradually adds an
1
additional level of complexity to overall malware analysis. Understanding these
behaviors, as well as proper toolsets and techniques for detecting this actor’s activity
will enable defenders to analyze ROOTKIT040903 properly and develop detection
measures that can be employed instantly. So, in conclusion, the arms race between
malware authors and those who are trying to protect their networks whether it be
CIOs or IoT operators has just become more intense. It is war and there is no winner
yet. Against the threat of mutated decryptors or other more sophisticated obfuscating
methods, combating and defending through proper cybersecurity tools should be a
simpler choice — organizations must approach it with modern solutions backed by
artificial intelligence (AI), machine learning, and proactive behavioral analytics. In
addition to this, the security community must work together and share intelligence
about cyber threats to get ahead of recurring attacks and implement defenses for new
strains of malware as quickly as possible.
Polymorphic Malware
2
mutator both are encrypted. In this way we get a new variant of that particular
malware.A polymorphic malware which includes a body and decryptor; Polymorphic
malware is somewhat similar to oligomorphic malware, but it highly elevates the
obfuscation game. Polymorphic variants create millions of decryptors by mutating
instructions used in malware variant The dynamic mutation process is designed to
prevent the patterns from detection techniques, such as signature matching used in
antivirus software and other filters. In contrast with the static type, when polymorphic
malware uses a mutation engine to produce variant Generate_decryption_algorithm
given an encryption algorithm Subsequent to writing the above process: algorithms
are used to encrypt not only the malware code, but also the mutation engine; thus
producing a new variant of that malware. Polymorphic malware structure: as shown
in (Fig. 5). Polymorphic malware is generally 2 main parts including’ decryptor ’and
body of the malware. The function of a decryptor — to decrypt the encrypted
malware code so it can be run on the infected computer to perform its malicious
purpose. This part of the malware actually consists of the payload and functionality of
the malicious software. Polymorphic variants evade static signature or pattern-based
security solutions that rely on the same instructions remaining in all received samples,
by constantly changing their instructions and producing new strains with different
decryptors. As a result, polymorphic malware behaves dynamically which allows it to
be difficult for detection and resistant to traditional approaches taken with more
common types of security implementations (like static signature-based analysis).
Organizations are best served defending against the threat of polymorphic malware
by implementing sophisticated cybersecurity strategies combining behavior-based
analysis, machine learning and threat intelligence to identify and stop new threats as
soon as they appear. Also, don’t forget to periodically update the Anti-virus
signatures and intrusion detection systems. After all is, you will have to outsmart each
person who created polymorphic malware if you want to stay ahead of it. Remaining
ever- watchful and at the ready allows organizations to genuinely protect themselves
from polymorphic malware, as well as all manner of other threats that continuously
infiltrate our cyber realm.
2
Fig. 5: Polymorphic Malware
Metamorphic Malware
In Metamorphic malware shown in Fig. 6, rather than making the decryptors mutants,
the malware body is mutated instead, i.e., the body, to create a new variant in which
its actions remain the same, to avoid detection. The metamorphic behavior is
achieved by using several obfuscation methods, similar to polymorphic malware for
creating the variants, such as dead-code insertion, data modification, control/data
flow modification, register renaming, subroutine permutation, equivalent code
substitution, and so forth. Metamorphic malware “depicts that a different approach is
adopted where instead of mutating the decryptors, the malware itself, the body, is
mutated called a body-polymorphic transformation. The metamorphic behavior is
achieved via this approach, where the underlying actions of the malware are
preserved the same, thus enabling the malware to create a new variant with the
resemblance of Polymorphic malware; however, the malware code is different in
structure and appearance from the previous variant. This metamorphic malware
variant is dynamically different each time where various polymorphic obfuscation
techniques are applied that erases the earlier structure and functions of the malware
and create a new one as a mutation. It keeps multiplying the code every cycle such
that every iteration is unique concerning the looks. Since the Malware is meeting new
lines of code and multiple workarounds, the signature method is rendered futile. .
Since the Malware is updated dynamically, it does not portray a definite image hence
alternative methods to detect it should be involved. An example is the use of a body-
polymorphic transformation as portrayed in Figure A more dynamic variant is always
2
anticipated and hence one code is always mutating to another code. It deviates from
all previous copies. As the security software is inactive or does not denote any
infections in some scenarios, this mutation makes static analysis impossible and the
malware is free to operate secure. To conclude, the use of body-polymorphic
transformation or other tactics allows Metamorphic malware to stay long enough
without anyone identifying it. The body-polymorphic transformation issued by the
malware renders it difficult to detect and be mitigated due to the existence of other
parallel data bodies from the first malware. To counter such a Malware variant, the
firm should keep updating their antivirus. Hyuk asserts the measures to prevent this
type of Malware. The next ones are the three first steps: intrusion prevention,
intrusion detection, and the last but not least, which is to keep updating their antivirus
to protect against the troubling data. Hence if these simple steps are followed as a
fashion, such a Malware can never be experienced.
While attempting to detect malware, the malware’s behavior must first be understood.
Analysis would help to visualize the malware’s behavior and its tasks. There are a
number of methods to reach the particular conclusion, each having its own
advantages. However, the amount of time and knowledge needed to use these vary
vastly. Event-driven systems can either use event loggers or hooks to track the
behavior of malware. HostException Log monitoring can also be used to check the
behavior of malware by reviewing the event logs regularly. There are a number of
2
benefits to using event-driven system to detect malware; it may affect DLL or API
interception and have no or less effects on the timing of security breaches. However,
a lot of time and efforts are spent on the analysis of telemetry.
Static Analysis
Code analysis or also referred to as Static Analysis is achieved by going
through the source code of the malware to determine its potential behaviour
and properties. Static analysis in malware analysis involves examining the
code and structure of a malicious program without executing it. This
technique provides valuable insights into the characteristics and behavior of
the malware, helping cybersecurity analysts identify potential threats and
vulnerabilities. During static analysis, analysts dissect the code to uncover
indicators of compromise, such as suspicious function calls, obfuscated
strings, and hidden payloads. Additionally, static analysis can reveal the
presence of known malware signatures and patterns, allowing analysts to
classify the malware and understand its potential impact. By leveraging static
analysis tools and techniques, cybersecurity professionals can gain a deeper
understanding of the malware's capabilities and intent, enabling them to
develop effective mitigation strategies and security measures. However, static
analysis has its limitations, as sophisticated malware may employ techniques
to evade detection, such as code obfuscation and polymorphism. Despite these
challenges, static analysis remains an essential component of malware
analysis, providing valuable insights into the inner workings of malicious
software and helping organizations bolster their defenses against cyber
threats..This reverse engineering can be achieved by one of the following
ways –
o File Format Inspection
File format inspection plays a crucial role in malware analysis,
offering valuable insights into the nature and characteristics of
suspicious files. Metadata, in particular, serves as a treasure trove of
information, providing analysts with essential details such as file type,
date of creation, compile time, and functions that have been imported
and exported. By examining metadata, analysts can determine the
2
origin and purpose of a file, helping to identify potentially malicious
content. For instance, the file type can reveal whether the file is an
executable, a document, or a script, providing clues about its intended
functionality. The date of creation and compile time can offer insights
into the timeline of the file's development, shedding light on its
potential relevance and significance in the context of a security
incident. Furthermore, metadata can unveil the functions that have
been imported and exported within the file, providing clues about its
behavior and capabilities. For example, imported functions may
indicate dependencies on external libraries or system resources, while
exported functions may suggest the presence of hooks or callbacks
used for malicious activities. By analyzing metadata, cybersecurity
analysts can gain a deeper understanding of the files under
investigation, enabling them to make informed decisions about the
level of risk and appropriate response actions. However, it is essential
to note that metadata inspection is just one component of a
comprehensive malware analysis strategy. Sophisticated malware may
employ techniques to manipulate or obfuscate metadata, making it
challenging to extract reliable information. Therefore, analysts must
complement metadata inspection with other analysis techniques, such
as dynamic analysis and code reverse engineering, to gain a
comprehensive understanding of the threat landscape and effectively
mitigate cyber risks.
o String Extraction
String extraction is a fundamental technique in malware analysis,
crucial for deciphering the behavior and functionality of suspicious
code. This method involves extracting strings from the code, including
error messages, status indicators, and other textual data embedded
within the malware. By analyzing these strings, analysts can glean
valuable insights into the workings of the malware, infer its intended
purpose, and uncover potential indicators of compromise. Error
messages, for example, may reveal clues about the malware's
functionality, providing insights into its behavior and potential impact
on the infected system. Status indicators can also offer valuable
2
information, indicating the progress of specific operations or the
success of certain actions performed by the malware. Additionally,
strings extracted from the code may include hardcoded URLs, IP
addresses, encryption keys, or command-and-control server addresses
used by the malware to communicate with remote servers or execute
malicious activities. By identifying and analyzing these strings,
analysts can gain a deeper understanding of the malware's capabilities,
command-and-control infrastructure, and potential attack vectors.
Furthermore, string extraction can aid in the development of detection
signatures and mitigation strategies, enabling cybersecurity
professionals to better defend against emerging threats and protect
sensitive systems and data. However, it's essential to note that string
extraction is just one component of a comprehensive malware analysis
process. Analysts must complement string analysis with other
techniques, such as dynamic analysis, code reverse engineering, and
behavioral analysis, to gain a holistic understanding of the malware's
behavior and effectively mitigate cyber risks. By leveraging a multi-
faceted approach to malware analysis, organizations can enhance their
cybersecurity posture and better defend against evolving cyber threats.
o AV Scanning
AV scanning, short for antivirus scanning, is a widely employed
technique in malware detection and prevention, utilized by both
individual users and organizations alike. The process involves
subjecting suspected files to scrutiny by antivirus software, which
compares their signatures, behaviors, and characteristics against a
database of known malware signatures. If a file matches a signature in
the database, it is flagged as malicious and appropriate action is taken
to quarantine, delete, or neutralize the threat. This method is
particularly effective against well-known malware strains, as most
antivirus programs maintain extensive databases of signatures for
common malware variants. However, it's important to note that AV
scanning is not foolproof, as it relies heavily on signature-based
detection and may fail to detect zero-day threats or polymorphic
malware that constantly alters its code to evade detection.
2
Additionally, AV scanning may produce false positives, flagging
legitimate files as malicious due to similarities with known malware
signatures. To mitigate these limitations, modern antivirus solutions
often incorporate heuristic analysis, sandboxing, and machine learning
algorithms to detect and block emerging threats based on their
behavior and characteristics rather than relying solely on signature
matching. Despite its limitations, AV scanning remains an essential
component of a layered cybersecurity strategy, providing a critical line
of defense against a wide range of known malware threats. By
regularly updating antivirus databases, conducting routine scans, and
supplementing AV scanning with other detection and mitigation
techniques, organizations can enhance their resilience against malware
attacks and safeguard their digital assets from harm.
o Disassembly
Disassembly is a fundamental technique in malware analysis,
providing analysts with deep insights into the inner workings of
malicious code. This method involves converting machine code, which
is the binary representation of software, into assembly language, a
human-readable format that represents the instructions executed by the
CPU. By disassembling malware, analysts can examine the logic and
functionality of the program at a granular level, allowing them to
identify malicious behaviors, uncover hidden functionality, and
understand the malware's attack techniques. Popular disassemblers
such as IDA Pro and Ghidra provide powerful tools for analyzing
executable files, offering features like interactive disassembly, graph
views, and function analysis to aid in the reverse engineering process.
These tools enable analysts to navigate through the disassembled code,
trace the execution flow, and identify key functions and routines
within the malware. Furthermore, disassembly allows analysts to
uncover obfuscated or encrypted code, revealing the true intentions of
the malware and providing valuable insights into its capabilities and
behavior. By leveraging disassembly techniques, cybersecurity
professionals can gain a deeper understanding of malware threats,
develop effective detection and mitigation strategies, and enhance their
2
organization's overall security posture. However, it's essential to note
that disassembly can be a complex and time-consuming process,
requiring specialized skills and expertise in reverse engineering and
assembly language programming. Additionally, malware authors often
employ anti-disassembly techniques to thwart analysis efforts, making
it challenging for analysts to extract meaningful information from the
disassembled code. Despite these challenges, disassembly remains a
critical tool in the arsenal of malware analysts, enabling them to
uncover the secrets hidden within malicious software and protect
against cyber threats effectively.
Dynamic Analysis
Also known as Behavioral analysis, Dynamic Analysis is one of the most
powerful techniques in malware analysis that provides visibility into real-time
activity exhibited by a piece of mal-ware. Dynamic analysis entails running
the malware in a controlled environment (such as within a sandbox or virtual
machine) and is quite different from static analysis, which does not involve
execution but instead involves disassembling/assembly of file code / contents.
It basically involves closely monitoring a malware’s behavior and recording
all the activities in which the file indulged. With this approach, analysts can
watch the malware move and learn about its function, purpose, objectives
(intent) that were to harm or impinge upon the goals of a system. Dynamic
analysis of the malware is helpful in assessing the spread because this type of
research allows us to run the malware in a developed environment and we can
observe what it does without putting our production systems at risk.
Moreover, the dynamic analysis is generally times much faster than static one
because it gives instant results to the analysts about malware behavior and
assess whether it is dangerous or not. Moreover, from the perspective of
dynamic analysis, malware can present new and unknown behaviors that let
analysts know them better and learn to detect new threats for creating
countermeasures. Altogether, dynamic analysis represents an essential aspect
of malware identification and prevention that gives cybersecurity pros a front-
row seat to see how such threats really act in the wild and helps organizations
block today’s increasingly sophisticated attackers.
2
Hybrid Analysis
Hybridizing analysis is a one-size-fits-all way of dealing with malware — it
allows us to leverage the most useful features that both static and dynamic
bring into the table, aiding towards gaining better insight into malicious
software. During the early steps of hybrid analysis, relevant signatures and
features of the malware are closely looked at using static/dynamic approaches.
This includes taking apart the code, examining the metadata and obfuscation
techniques or similarities to known patterns (indicators of compromise). “An
analyst, using static analysis, can find these known threats and identify this
particular malware according to the set of established criteria.”
2
1.4.3 Types Of Malware Detection
The signature based detection (fig. 7), the traditional technique, is an easy and
efficient manner to detect all of the known malware [16] This technique extracts a
unique short sequence/pattern of bytes after malware identification to distinguish the
detriment program as compared with the benign programs [17]. Signature-Based
Detection (Definition: Detects known malware on identification of similar patterns /
3
signature based) It’s a most common practice used in cyber-security. This involved
looking for signatures in files, applications or network traffic- The antivirus software
and intrusion detection systems looked to match the identified malware variations. In
most cases, these signatures are comprised of one or more bytes or sequences of
codes and can also be compiled by behavior that resembles the known malicious
component. Once a match is found, the software can react by quarantining or deleting
the infected file, blocking network communication from specific IP addresses or
warning system administrators. Despite the fact, that a signature-based detection can
do well in recognizing and blocking all known threats, it certainly has its limitations.
In terms of zero-day attacks or polymorphic malware Showing that it changes code or
behavior upon infection so as to compromise defense mechanisms validation results
nonzero ones reinforcement, this approach cannot always trace them back. One other
drawback of a signature database is that it always has to be updated to keep the latest
malware strains. Signature databases need regular updates and can become vulnerable
if not updated regularly too since new threats could arise. Nevertheless, signature-
based detection continues to be a building-block of any cybersecurity defense layer
and is mandatory to counteract the long tail of existing malware that exists on the
world.
3
cisomethingxperience from 2000 to 2010 heuristic-based detection technique along
with signature-based detection was a significant mechanism for malware mitigation,
and the promising method in Heuristic one to detect novel/unseen malwares ever
before. [18]. This is an approach of two methods used in identfication. The first
method is static where the suspicious programs are examined in such a way so as to
get proper pattern of occurrence or any other sort if it and then checking whether the
produced result goes beyond threshold, declaring that file infected (Cheng et al.,
2014) [19]. B. Hybrid Heuristic, Signature based detection: In this period (between
2000-2010), besides signature-based method of detecting threats hacking techniques
four were mainly deterred by the heuristic assessment on malware pathogens.
Heuristic techniques seemed to be the best method for identifying new or unseen
malware threats, bridging the gaps that existed with signature-based tools. This
method uses two main schemes; However, static methods refer to a process in which
such suspicious program’s code is analysed for defined patterns within the structure
of the program If any irregular things or suspicious patterns are identified on the file,
and it crosses a certain predefined threshold value, then that file is marked as possibly
infected. The advantage of this type of static analysis is that cybersecurity systems
can still recognize specific common characteristics or behaviour patterns exhibited by
a malware, even if its detail signature remains unknown and not deployed within the
databases. Applying these heuristic approaches can help cybersecurity analysts
improve their malware detection and response capabilities, making it harder for the
attackers to pull off malicious activities in the midst of bolder cyberattacks. On the
one hand, heuristic detection is an aid in the identification of unknown malware
variants, but on there other hand it possible to be umuseful. It can operate false-
positive or skip complex threats that are basically not determined by listing, through
for example obfuscation or polymorphism. As a result, the typical strategy tends to be
more balanced—relying on both heuristic and signature-based detection methods in
order to adequately defend against the variety of malware threats out there.
Malware Normalization
The advances of such sophistication, malware authors have implemented automated
advanced malware generation toolkits [20] where it uses highly sophisticated
obfuscation techniques (e.g. Zeus, Ultimate Packer for Executable and Mitsfall). these
kits can create malware in the order of couple thousand a day which is literally 8
impossible to catch through Signature or hurestic based malware detection
3
techniques. Obtain normalize executable/ malware from such method : After
removing the obfuscation in a given program, this is effective to improve an existing
antimalware of higher detection accuracy (Fig. 8) [21]. Since furthermore in the fight
against such obfuscation techniques more and malware developer can use automated
advanced malware generation tool kits, like for example zeus, ultimate packer for
executable or mitsfall it has become very important to normalize malware. Once
deployed, these toolkits provide attackers with the ability to quickly create tens-to-
hundreds of thousands unique malware variations in a day – each one seemingly
different enough to bypass traditional signature or heuristic-based malware detection
methods. Consequently, malware normalization attempts to mitigate this explosion of
obfuscated malware by removing the layers of obfuscation wrapping the malicious
code. Analyzing the executable files or malware samples in this form will allow
analysts to sidestep such obfuscation and obtain an understanding of their structure
and functionality by normalizing them. Subsequently, this normalized form of the
malware can be used to improve detection performance in currently installed
antimalware offerings. Analysing nvmd samples, ctsec workers can create stronger
signatures by attaining a better understanding of the many facets involved in crafting
an effective NVT, thereby enhancing their defenses against advanced cyber threats.
“Moreover, for superior disruptive action which can be take against adversaries
directly or building proactive strategies, malware normalizations gives more help to
security researchers to understand the adversary tactics techniques and procedures
being used along with associated preparedness return,” said Malviya. However,
malware normalization remains a highly useful approach to strengthen cybersecurity
defenses and prepare for any cyber warfare scenario in the constantly developing field
of cyberspace combat when facing automated malware generation toolkits.
3
Fig. 8 : Malware Normalization
3
and training them, the algorithms are able to make predictions about other data that is
similar based on patterns or characteristics found in each feature. "One trained, the
models can analyze new samples and based on what they've seen before classify them
as either benign or malicious." Machine learning has many advantages for malware
detection, some of which are adaptation to unforeseen threats, high scalability
capabilities which allow it to be applied over large datasets and centralize same
features that can detect millions of variants instead. “All of these techniques, by
learning from new data and feedback on how they perform in practice (and revising
their ability to detect malware accordingly), make the detection systems more
accurate and efficient over time. This helps organizations strengthen their own
networks and systems against cyberattacks.”
1.5 Structure of the Project
While going through this project you will mainly come across these components-
Chapter 1 gives us a basic idea about the project and helps you familiarize with the
necessary technical and theoretical aspects of our project.
Chapter 2 consists of review of other journals and research papers.
Chapter 3 tells us about the system design and various tools and techniques needed
to achieve the same.
Chapter 4 includes feasibility study which includes technical, economic, operational,
legal, and market viability.
Chapter 5 gives us a comparative study of the performance of each model making it
easier for us to choose the best one most suitable for us.
Chapter 6 provides us with the conclusion and also tell us about the future scope of
our project.
3
CHAPTER 2
LITERATURE
REVIEW
But as early as 2009, Daly published research that showed the potential for planned
and coordinated attacks designed to gain long-term network control within a targeted
company [27]. For example, referring to Quick Heal Threat Research Lab is received
over 350,000,000 malicious files which targeted tens of thousands of workstations in
only the first quarter of 2016 Aimoto et al [28]. Ito et al. In May 2017, for example,
Symantec revealed they had uncovered the Banswift cybercrime ring behind the theft
of US $81 million from Bangladesh Bank [29]. Kid Security. However, Elasticsearch
and Logstash applied incorrectly resulted in the app designed to help parents monitor
their children's online safety leaking user records for more than a month. The breach
was discovered by expert Bob Diachenko in mid-September, who said it impacted
300 million records (including 21,000 phone numbers and 31,000 email addresses),
along with some credit card info. September Cyber news.
The primary goals of the malware detection techniques are to identify the malicious
software and protect the system in which it is used, and ensure that other networks or
computers linked should also be protected. For the detection and classification of
malware samples, in order to entrainate such inputs properly under their respective
families, they can be defined in many ways: There are many authors who have done
work on this and proposed methods to recognize or categorize the malware file with
their version. A detailed summary of the major research publications listed in Table 1
is provided within this publication. We have also observed the exported malware in
graphical representation along with feature vectors. Note that this is fundamentally
different from classifying images, where the labeling of data can be particularly
simple (each image belongs to exactly one label) and its easy to get a lot of it. Instead,
recognizing malware and numeric clustering into families are labor intensive tasks
require expertise in the subject In another study, the authors suggested a learning
technique to analyze the dangerous code and classify it according to its malware
family [2]. The first step to malware family identification is grouping all portable
executables (PE) and choosing the traits coincident among them. Therefore, these
3
features determine the specific category of malware and activities involved in
defining the executable organization and suggested approach accuracy ≃99.8% [30].
There you have it. The evolution of how we detect malware is right in line with the
increasing sophistication of said, well, malware! Being an old fashioned way, Early
methods of safeguard depended on signature-based detection which uses specific
patterns or signatures of known malware to detect the malicious files. Although it was
successful against all threats that were known at the time, this method did not fare so
well with new or as yet unseen malware (zero-day attacks).
Deep Learning (a class of machine learning) has now expanded our capabilities even
further Convolutional Neural Networks (CNNs), Recurrent Neural Networks, etc.
Can extract features from raw data on their own and we do not have to worry about
3
extracting the valid features through manual effort. This approach has been
particularly popular for image-based models that treat binary files as though they are
just gray-scale images.
The literature for malware detection has been diverse with numerous studies
proposing novel techniques and obtaining high accuracy rates. A comparative analysis
is outlined in Table 1, which points to the diversity of approaches and their respective
performance measures. The feature engineering approach appears relevant, as
effective feature extraction plays a significant role in achieving high-accuracy
malware detection. Among the most common features are API calls, opcode
sequences, and patterns of byte code. All the reviewed articles demonstrate that well-
engineered features improve the performance of the model. As for algorithms,
different models have distinct advantages. For instance, Random Forests and Support
Vector Machines are recognized for their strength and the highest accuracy. At the
same time, deep learning models such as CNN and RNN have the ability to learn to
automatically extract complex features. Their weakness, however, is their inability to
learn abstract models, and oversimplification may complicate the tasks. It is also
apparent that hybrid models are more effective as they combine static and dynamic
approaches. The static approach is using the code, that is, the appearance of the
application without running it, while in the dynamic approach, the software’s activity
is monitored while operating. The latter approaches are among the most effective in
the respective areas, and the hybrid model has an advantage due to the combination of
their traits. The rise of malware attacks has led to concerns that a single machine is
insufficient to address the evolving needs. One of the ways to ensure scalability and
efficiency of approach is training models using datasets and deploying them on
distributed systems, ensuring the detection is accurate and timely. Although
substantial progress has been observed, the field retains its appeal by addressing the
challenges posed by evolving types of malware.
Adversarial attacks pose the biggest concern. These are attacks where the malware is
specifically generated not to be detected. With such threats and their popularity in the
current world, the researchers still need to focus on developing models that cannot be
attacked so easily.
3
Challenges of the Future
The other challenge is posed by the emergence of new IoT devices and cloud
computing. These approaches allow attackers to develop new strategies, as the
environment faces substantial changes that require new malware detection methods.
Old methods need to be adapted but should be used with new and innovative methods
to address the emerging threats. All in all, the field of malware detections has
experienced a significant amount of progress driven by advances in machine learning
and AI. The comparative analysis of the reviewed papers indicates the relevance of
the feature engineering approach, the significance of the model selection of the
algorithm, and the hybrid approach. With the cyber threats continuing to grow and
evolve, the emphasis on these aspects cannot be underestimated.
3
Griffin et al. 48 Byte string 5- gram Markov Signatures with one or more
(2009) [34] signature Chain model components were used to
train the classifiers. Having
several component signatures
increased the chance of a
satisfying accuracy outcome
when compared to the
equivalent. Less than 0.1
percent false positive rate
(FPR) was achieved.
4
Classifiers
Salehi et al. API calls RF, J48, Rotation 94.6% was the greatest true
(2014) [41] RF, FT, and NB positive rate of any classifier
used, and random forest
produced the highest results.
Sexton et al. Byte code Naive Bayes, Rule The Markov chain approach
(2015) [42] Sequences & Based classifier, to SVM revealed an 84.9%
opcodes Logistic True Positive Rate.
Regression, SVM
Saxe et al. The string Deep Feed The suggested model yielded
(2015) [43] histogram, the forward neural a 95% True Positive rate.
byte network
sequence, and
the 2D PE
properties
Narayanan et Image of KNN, ANN, and Over the others, linear KNN
al. in (2016) Polymorphi c SVM provided an accuracy of
[47] Malware file 96.6%.
Nikolopoulos ScD graph SaMesimilarity The suggested model has a
and Polenakis created using and NP-similarity detection rate of 83.42%.
(2017) [48] system calls metrics
4
Zhixing Xu et systemcalls logistic regression The random forest classifier
al. (2017) [49] for memory and random forest performed better, with a true
access classifier positive rate of 99%.
4
Yuxin et al., n-gram Deep Belief When trained on unlabeled
in (2019) [58] Network data, DBN outperformed
KNN, SVM, and Decision
Trees in terms of
classification accuracy.
4
CHAPTER 3
SYSTEM
ANALYSIS
Actually, the model we can say that it is being followed currently-in use the phase in
WATER FALL MODEL assumes to be arranged in well-defined linear sequence. 1.
First of all, the feasibility study is done 1. Once that is done now requirement
gathering and project planning are the next phases. If the system exists one and
modification and addition of new module is necessary, least analysis for present
system may be taken as basic model. It is done after the requirement analysis and
before the coding phase. Coding does not start until designing being completed. After
the coding the next phase of software development life cycle is testing. In this
particular model, there is a sequence of activities that are performed in the software
development project:-
Requirement
Project Planning
System design
Detail design
Coding
Unit testing
System integration & testing
But, the important part is that these activities are to be linear ordered.At the end of
the phase and output of one phase is input of other phase. The output of each phase
should be such that in case the whole is formed, it fulfills the system’s overall
requirement. - Few of the feature of 6 - spiral model is also included as after
completing each of the phase only than work done by all those stakeholder who are
involve in with that project. As all the requirements were previously known to us and
our software is made with the objective of computerizing/automating a pre-existing
manual working system, we have chosen WATER FALL MODEL.
We have used many Tools and technologies used for our project and all are discussed
here –
4
3.1.1 Python
Simple and Readable Syntax: Python's syntax is designed to be simple and easy to
read, making it an ideal language for beginners and experienced developers alike. Its
clean and straightforward syntax emphasizes readability and reduces the cost of
program maintenance.
Dynamic Typing: Python is dynamically typed, meaning that variable types are
determined at runtime rather than at compile time. This provides flexibility and
allows for more concise code, but can also lead to potential runtime errors if not
handled carefully.
Rich Standard Library: Python comes with a comprehensive standard library that
includes modules and packages for a wide range of tasks, from web development and
data manipulation to networking and GUI programming. This extensive library
reduces the need for developers to write code from scratch, speeding up development
time.
Fig. 9 : Python
4
3.1.2 Google Colab
It’s a free tool for writing and running Python right in your browser.
4
3.1.3 Pycharm
4
3.1.4 Pandas
Fig. 12 : Pandas
4
3.1.5 Numpy
Fig. 13 : Numpy
4
3.6 Scikit-Learn
It's a popular open-source machine learning library in Python that provides simple
and efficient tools for data analysis and modeling.
Wide Range of Algorithms: Includes a variety of machine learning
algorithms for tasks like classification, regression, clustering, and
dimensionality reduction.
Ease of Use: Designed to be easy to use, with a consistent interface for
different algorithms, making it accessible for both beginners and experts.
Built on NumPy and SciPy: Integrates well with NumPy and SciPy, ensuring
fast and efficient numerical computations.
Model Selection: Offers tools for model selection and evaluation, including
cross-validation, grid search, and metrics to assess model performance.
Preprocessing: Provides various preprocessing techniques to prepare data for
modeling, such as scaling, normalization, and encoding categorical variables.
Feature Engineering: Supports feature extraction and selection to improve
model accuracy and performance.
Community Support: Backed by a strong community and comprehensive
documentation, with many tutorials and examples available.
Interoperability: Works well with other data science libraries like pandas
and Matplotlib, making it easy to integrate into your data analysis workflow.
Versatile Applications: Used in various fields, including finance, healthcare,
marketing, and more, for tasks like predictive modeling, customer
segmentation, and recommendation systems.
Fig.14 : Scikit-Learn
5
CHAPTER 4
FEASIBILITY STUDY
In this context, the ambition to determine the technical feasibility but also the
economic, operational and legal capacity for implementing an all-encompassing
framework that would link software performance evaluation and malware detection
into a single system. This study gathers these details in what will be considered here
as secondary information. This framework will help to meet this increased need for a
holistic solution that can provide system health monitoring of the ever complex
software and control system landscape as well as the increasing sophistication of
cyber threats.
Technologies and Tools: The necessary technologies and tools to implement the
framework, including performance monitoring solutions, malware detection
algorithms or data analytics platforms are available in the industry as established
standards. Here, we can assure that the project team cannot make use of any other
components for integrating a framework.
5
computing among others which will ensure ease of reusablility in a manner that the
framework remains effective over time.
Expertise & Resources : The project team will consist of subject matter experts on
software engineering, performance engineering, cyber security, and data analytics. As
a result, the capability to effectively build and implement this full-bodied framework.
Cost-effective: In conclusion, the cost for this full-fledged framework will offer
remarkable money due to its decreased data leakage preventing system downtime and
compliance breach probability. The performance improvements and increased
security posture are anticipated to offset the cost of implementing and maintaining
either system.
Pricing and Revenue Model: The framework can be offered to the organizations
either through subscription based service or licensed software. It can have flexible
pricing depending on the requirement of target organization at different scales This
flexible model will help make the framework available to a large number of
customers, while also resulting in an ongoing revenue stream for the project.
Funding and Investment: This project can receive funding from a combination of
internal R&D budgets, government grants, as well as venture capital investments due
to high demand in the espionage market swathes of other strategic importance.
Finally, the operational feasibility of the framework was considered with respect to
organizational readiness, training and support, as well as maintenance and update
considerations.
Organizational Readiness: The framework will ensure the designed solution will
seamlessly be implemented with the existing IT infrastructure and security practice
without impacting ongoing operation to comfort maximum user adoption among
stakeholders.
5
Training and Support: It will offer extensive user documentation, training
schedules, and exclusive technical support for the ease of implementation and
application of the framework across target stakeholders.
The legal and regulatory feasibility of the framework has been evaluated. It explicitly
emphasized compliance with industry standards regulation, data privacy and security,
or intellectual property rights underage.
Data Privacy and Security: The framework’s data processed would be subjected to
robust measures that ensure protection of the confidentiality, integrity through
encryption, access controls and audit logging.
Intellectual property: Our project team will make sure that this framework’s design
and implementation does not violate any potential patents or copyrights, they are
going to think about filing for patent protection of those innovative components of the
given framework.
All the presented dimensions in terms of Market Feasibility (i.e., Target market and
customer segments, a competitive advantage, partnerships/strategic alliances) have
been examined regarding the overall framework.
5
Competitive analysis: A competitive analysis for this framework will illustrate clear
differentiation from existing solutions based on an integrated, scalable and adaptive
approach to system health management designed specifically to met the needs of the
intended target market.
Partnerships and Strategic Alliances: The Project Team will consider partnering
with one or more leading technology vendors, security solution providers, and
industry associations to increase the market penetration and credibility of the
framework.
5
CHAPTER 5
METHODOLOGY
For detecting malware, machine learning method with one-sided perceptron will be
used. It will be applied on the required dataset for detecting malware from different
files present in the systems.
The required methodology will use various machine learning algorithms to solve out
the problem of malware in the computer systems in the form of files. These
algorithms will be applied by designing a database according to the dataset. After
this, analysis and design is performed on the required dataset.
5.2 Implementation
As per our ongoing discussion, some of the security threats and challenges faced by
the digital world have been discussed in depth. Keeping them in mind we will go with
a static analysis method for achieving the desired outcome
5
MALICIOUS
IF IS_MALICIOUS(FEATURES):
# ADD THE FILE PATH TO THE MALICIOUS FILES LIST
MALICIOUS_FILES.APPEND(OS.PATH.JOIN(FOLDER_PATH,
FILE))
# RETURN THE LIST OF MALICIOUS FILES
RETURN MALICIOUS_FILES
DEF EXTRACTFEATURE():
# GET THE PE FILE PATH FROM THE USER
PE_FILE_PATH = INPUT("ENTER THE PE FILE PATH:
") # EXTRACT FEATURES FROM THE PE FILE
FEATURES = EXTRACT_FEATURES(PE_FILE_PATH)
# RETURN THE FEATURES
RETURN FEATURES
DEF BTOCSVALGORITHM():
# GET THE MALICIOUS FILE PATHS FROM THE USER
MALICIOUS_FILE_PATHS = INPUT("ENTER THE MALICIOUS FILE
PATHS: ")
# CONVERT THE MALICIOUS FILE PATHS TO A LIST
MALICIOUS_FILE_PATHS = MALICIOUS_FILE_PATHS.SPLIT(",")
# CREATE A LIST TO STORE THE CSV DATA
CSV_DATA = []
5
Fig. 15: Flow Chart of the Proposed System
5
5.2.1 Preparing the DataSet
Data Collection
The data collection process for malware detection typically involves gathering
samples from diverse sources, including malware repositories, security research
publications, and real-world incidents. Samples encompass various malware types
and strains, providing a comprehensive dataset for analysis. These collected
specimens are meticulously curated to ensure representativeness and relevance to
current cybersecurity threats. Metadata attributes such as file properties,
behavioral patterns, and code signatures are extracted from the samples to form a
feature-rich dataset. This process adheres to academic rigor, employing stringent
methodologies to assemble a diverse and representative corpus essential for robust
research and development in malware detection systems. In the project we took
the dataset from Kaggle.
Feature Extraction
5
Fig 16: Features of the dataset
Organizing Datasets
After importing the data and extracting the features, we organize the dataset
into legitimate and malware
5
Feature Selection
After organizing and dividing the dataset, we move towards selecting the most
important features from our dataset. The dataset consists of 58 features but not
all will be of that much importance. So we use tree based feature selection to
assign weight to features and select the most important ones out from the 58
features.
From the above output image we can see that out of the total 53 features (
removing the name, Filename, md5hash,Machine and TimeDateStamp
column as they are not necessary in our scenario) only 14 were important and
selected using the tree based feature selection. We can also see the features
selected and the weight assigned to each one of them(Fig. 19).
6
Fig. 19: Weight of Different Features
6
Splitting the DataSet
After feature selection we move towards splitting the dataset into training and testing
sets. We can divide the dataset into any ration but here we go with the 70:30 ratio
i.e. 70% training size and 30% test size.
6
Here is accuracy after successful implementation in given figure:
XGBoost
XGBoost stands for eXtreme Gradient Boosting. It is differentiated as an advanced
implementation of gradient boosted decision trees designed for speed and
performance, which has proven to be one of the most effective algorithms available
today (in the supervised learning domain). While traditional gradient boosting serially
adds a weak learner to improve model performance using the steps above, XGBoost
uses an optimized and more regularized form of gradient descent presented below.
XGBoost From Scratch XGboost forms an ensemble of decision trees where each tree
is built to correct the errors of its predecessors. At the core of the process lies an
iterative approach where a loss function is optimized by sequentially adding trees to
myriad trees and ensuring each tree focuses on residuals or errors from past mistakes.
Some other techniques used to avoid overfitting and increase the generalization are
shrinkage (learning rate) and tree pruning.
Moreover, XGBoost also integrate additional feature like column subsampling and
row subsampling(bootstrap aggreagtion) to improve both model robustness and at the
same time, addresses scalability. The API also allows to provide support for custom
loss functions and evaluation metrics thus making the library highly configurable for
different use cases.
Because of XGBoost’s well-written algorithm, implementation, scalability and
flexibility it has enabled very efficient solutions ( not only to Machine learning
competitions ) but also in a variety of real world problems across industries. As all of
6
this allows it to provide cutting edge performance in the most computationally
efficient manner possible, AdaBoost is without a doubt one of the best and most
widely used ensemble learning techniques by data scientists and machine learning
practitioners out there.
Here is accuracy and Confusion Matrix after successful implementation in given
figure:
Decision Tree
Decision Tree is powerful and instinctive machine learning algorithm which can be
used for classification as well regression purposes. We'll avoid going too much into
6
the details of how a decision tree works, but in short it is a algorithm which constructs
its model by dividing the input feature space recursively into smaller subspaces while
having every point correctly classified. as you may have visualized this forms a
pottery and looks like our maze!
The algorithm is trained with different features and at every node it chooses the best
feature. It also picks the optimal split point for that feature using some criterion (e.g.
Gini impurity, mean squared error). This is done recursively all the way down until a
stopping criterion is met (e.g.: maximum depth, minimum number of samples per leaf
etc.) or if there’s no further gain to be achieved by performing the split.
They are easy to understand and visually interpret so you can use the model to see
how the decisions were made by a particular input. However, these trees tend to
overfit the data: as they advance deeper and deeper in their directional capacity to
demarcate classes, they pick up on noisy idiosyncrasies in training instances. To
avoid the overfitting, technique such as pruning can be deployed to reduce some of
these nodes that don’t much on improvement.
Finally, Decision Trees will also be a part of so-called ensemble methods – the
hallmark of more involved algorithms like Random Forests and Gradient Boosted
Trees. In conclusion, we can say that Decision Trees are the most straightforward
method for analyzing how a splitting criterion can be optimized. They are flexible and
have good interpretability which makes it perfect for many applications as well.
The accuracy and Confusion Matrix we get after successfully implementing the
above model is shown in following figure:
6
Fig. 25 Confusion Matrix of Decision Tree
Random Forest
Random Forest is a flexible and easy-to-use ensemble learning algorithm that can
perform a wide variety of classification or regression tasks. It works by constructing a
multitude of decision trees at training time and outputting the mode of each individual
tree (for classification) or mean prediction (for regression).
Each decision tree in the Random Forest is trained on a random subset of the training
data and also a random subset of the features. This will introduce the randomness to
decorrelate the trees which can reduce overfitting and improve generalization
performance. Furthermore, the Random Forest makes use of a method called bagging.
Here multiple trees are trained on bootstrapped samples of our training data. This
adds another layer to making our model more diverse and stronger.
Prediction: During prediction stage, each tree present in the random forest model
takes an independent decision and it gives a result. The final output is decided by
combining all the trees predictions (mode or mean) Ensemble methods generally have
been found to be more stable and accurate than individual decision trees.
6
Random Forest is a popular ensemble learning method used in different fields such as
finance, healthcare, bioinformatics to solve problems which include — credit scoring,
disease diagnosis etc. because of its strength to work with high-dimensional data,
non-linear relations and noisy data.
The accuracy we get after successfully implementing the above model is shown in
following figure:
6
predicting, any new data point will calculate the distance between the new data point
with all other training dataset of points to measure how closely it resembles each
training example using a specific formula(metric) frequently used is Euclidean. After
computing the Euclidean distance of the unknown charities from each one in the
training set, it gets these k with least distances.
In the case of classification, using majority vote from all k nearest neighbours is used
to assign a class label for new data point For the regression tasks, the predicted value
is nothing more than the average of their k nearest neighbors target values.
From above, we can conclude KNN depends on the value of k and which distance
metric to use. Smaller value of K will provide more flexible model but may suffer
from overfitting; On the other hand a lager value of k can cause a smoother decision
boundary because not enough information is used to make an accurate prediction
Although this may seem trivial to understand, KNN can be quite computationally
expensive and time consuming (particularly for large datasets) as all distances from
our query point need to be calculated in order to find the nearest neighbors
Nevertheless, its intuitiveness and simple implementation allow for wide usage in
various machine learning projects, particularly those that involve a small dataset size.
Here is accuracy and Confusion Matrix after successful implementation in given
figure:
6
Fig. 29 Confusion Matrix Of KNN
Logistic Regression
Logistic Regression is a common parametric linear classification algorithm for binary
and two class classification.' 'Here, we select the amount of probability. It is used to
predict whether something will happen or not happening?' Although the name may
suggest, logistic regression is a classification algorithm, not a regression one.
The logistic regression algorithm models the relationship between independent
variables (or features) and probability of target variable belonging to certain class
using a link function which is the logistic function so, it is also known as logit
regression. The sigmoid function will map any real-valued number into the range [0,
1], so it’s a really convenient way to represent probabilities.
When training, logistic regression will learn the coefficients (which you could
interpret as weights) of each feature through maximal likelihood estimation using a
method like gradient descent. The resulting coefficients for each feature represent the
change in the additive log-odds of one class as that feature changes.
We can use our learned coefficients to make predictions by passing the input features
through the trained logistic function and obtain a predicted probability. This predicted
probability is then mapped onto a binary class label by selecting all instances for
which the predicted probability is greater than or equal to some threshold (often 0.5)
This type of regression is easily interpretable, easy to implement and works well
when the relationship between your features and target variable is in or close to linear
and is widely used in many fields, e.g., predicting disease occurrence (Weng et al.
6
2017); credit risk scoring (Crook et al. 2007); customer churn prediction (Chan and
Baumgartner 2007). Tfeas forest also is commonly employed at the industry level.
Here is accuracy after successful implementation in given figure:
Naive Bayes
Naive Bayes is a probabilistic classification algorithm using Bayes’ Theorem. Naive
in the sense of situation to keep things common and easy; refering the conditional
impedance between every element. However, despite this simplification (or perhaps
because of it) Naive Bayes is (somehow/someway/one way or another), able to
deliver robust performances and for such reasons is often used on text data where the
independence assumption isn’t true; these include things like spam filtering.
How does the algorithm do this? Using Bayes’ theorem, which states that the
probability of each class given a set of features is proportional to the probability of
those features given the class multiplied by our prior belief about this class:
The “naive” assumption above gives us a way to calculate these probabilities without
having access to (labeled) training data for MLE / MAP. We assume independence
between our features, and therefore use the product law of probability which reduces
our conditional probabiliy terms: Although it is overly simplistic, the model still
performs very well and can be used in machine learning when the assumption holds
almost exactly (when dependencies do not exist) or even if that independence
assumption does not hold to some extent.
This class of Bayesian classifiers (Naive Bayes) comes in different flavors; the var-
iants differ mainly by the assumptions they make about X. Gaussian Naive
BayesAlgorithms assumes that continuous features follow a Gaussian distribution,
7
Multinomial specifies that Features are assumed to be generated from a simple
multinomial dis- tribution based on Poisson and Bernoulli Distribution for binary
data[4].
Naive Bayes is computationally fast, simple to implement and works well on small
datasets. It’s particularly suitable for high dimensionality data like text classification
which runs into infinite dimensions due to the huge number of possible words.
Here is accuracy and Confusion Matrix after successful implementation in given
figure:
7
In every iteration, SGD computes the gradient of the loss function w.r.t to the model
parameters considering only single batch data. And finally it takes a step in the
direction of the opposite that gradient with some learning rate. This whole process is
iteratively repeated until we reach convergence or based on some stopping criterion.
By using mini-batches, SGD is introducing randomness in the optimization process
that can prevent it from getting stuck on local minimum and also lead to a faster
convergence especially when dealing with high-dimensional parameter space.
However, as we can see above this randomness also gives rise to noisy updates,
which means that choosing an appropriate learning rate (and hyper-parameters more
generally) is crucial.
Although it’s simple but SGD is a family of powerful optimization methods, some
advanced versions are: mini-batch SGD, momentum SGD and the algorithms that
automatically change the learning rate for each weight such as Adagrad and Adam.
SGD is commonly used in training of deep learning models(especially neural
networks) because of its scalability and efficiency. “[A] powerful, extremely capable
algorithm… that every machine learning practitioner should have in their toolbox
because it’s really good at handling large sets of data and operating in high-
dimensional parameter space.”
Here is accuracy after successful implementation in given figure:
7
Fig. 34: Confusion Matrix Of SGD
7
Here is accuracy after successful implementation in given figure:
7
CHAPTER 6
CONCLUSIONS
6.1 Conclusions
Putting it all together, we found that malware attacks are a serious threat to any
individual’s personal computer system and cause a number of problems and
interruptions. It cannot be stressed enough that removing Malware from your system
is the first step to keeping your computer clean and functioning correctly. This
research effort will ultimately work towards the use of machine learning techniques to
minimize the error and accurate identification of Malware objects from within a
system. This way, if malware detection mechanisms are implemented correctly there
should not be a single false positive, making the system stronger towards possible
threats. Yet, empirical data show that even if research goals are achieved to a large
extent, some non zero false positive rate remains and these findings confirm the
complexity of the problem Still, the suggested framework comes out as a viable
commercial product that unifies various similar deterministic expectation features.
This framework is very much helpful towards detecting malware from files which in
turn, increases the security level of a system.
7
Secondly, as we have seen in Fig. 30 with a whole set of algorithms gathering the
worst combinations and only few exceptions. 30 out of these, the certain algorithms
which I have used are KNN, Decision Tree, Random Forest, Linear
Regression,XGBoost — eXtreme Gradient Boosting,LightGBM - Light Gradient
Boosting Machine Logistic Regression Naïve Bayes Stochastic Gradient Descent and
Support Vector Machines(SVM) It is important to notice that few algorithms like
Linear Regression, XGBoost, LightGBM and Decision Tree have more accurate in
malware detection relative to the other algorithms.
Therefore, as a result of other advantages associated with these algorithms, they are
considered to be the most preferable due to its addressing malware threat. In the
future, with further improvements and optimizations on these algorithms, we will be
able to accomplish what we want in malware detection. To conclude, the
implementation of machine learning in cybersecurity is a major development and
enhancement that provides an opportunity to increase protection against malware
attacks.
“We can do a lot on this proposed model as technology is changing rapidly. “, with
that note we end our scope here for further improvements.
The Algorithms are good enough on their own, but nowadays Neural
Networks have a vital role to play in classification problems.
7
Instead of implementing machine learning algorithms neural Networks can be
implemented as they are far better in unsupervised learning.
one could also do specific feature selection in the order to remove those false
positives.
6.3 Applications
The implications for security are wide and far reaching. Capable of detecting
malicious files, this technology can be integrated into many companies software as
they all move toward that trend too.
7
REFERENCES
7
Hybrid Intelligent Systems, 2004. HIS’04. Fourth International Conference
on, pages 378–383. IEEE, 2004.
[22] Tom M Mitchell. Machine learning. WCB, McGraw-Hill Boston, MA, 1997.
[23] Robert Moskovitch, Yuval Elovici, and Lior Rokach. Detection of unknown
computer worms based on behavioral classification of the host. Computational
Statistics & Data Analysis, 52(9):4544– 4566, 2008.
[24] Mamoun Alazab, Sitalakshmi Venkatraman, Paul Watters, and Moutaz
Alazab. Zero-day malware detection based on supervised learning algorithms
of api call signatures. In Proceedings of the Ninth Australasian Data Mining
Conference-Volume 121, pages 171–182. Australian Computer Society, Inc.,
2011.
[25] Muazzam Siddiqui, Morgan C Wang, and Joohan Lee. A survey of data
mining techniques for malware detection using file features. In Proceedings of
the 46th annual southeast regional conference on xx, pages 509–510. ACM,
2008.
[26] ] Gr´egoire Jacob, Herv´e Debar, and Eric Filiol. Behavioral detection of
malware: from a survey towards an established taxonomy. Journal in
computer Virology, 4(3):251–266, 2008.
[27] Daly, M.K., “Advanced persistent threat”, Usenix, Nov. 2009.
[28] Quick heal quarterly threat report q2: Technical report, 2015. Quick Heal,
Feb. 2015.
[29] Aimoto, S., AlKhatib, T., Coogan, P., Corpin, M., DiMaggio, “Internet
security threat report”, Technical report, Symantec Corporation, 2017.
[30] A. K. Verma and S. K. Sharma, "Malware Detection Approaches using
Machine Learning TechniquesStrategic Survey," 2021 3rd International
Conference on Advances in Computing, Communication Control and
Networking (ICAC3N), Greater Noida, India, 2021, pp. 1958-1962
[31] Ye, Y., Wang, D., Li, T., Ye, D., “IMDS: intelligent malware detection
system”, In Proceedings of 13th ACM SIGKDD, International Conference on
Knowledge Discovery, Data Mining”, pp. 1043-1047, 2010.
[32] M. Zubair Shafiq, Syed Ali Khayam, and Muddassar Farooq. “Embedded
Malware Detection Using Markov n-Grams”, in Detection of Intrusions and
Malware, and Vulnerability Assessment. Springer Berlin, Heidelberg, 2008.
8
[33] R. Moskovitch, C. Feher, N. Tzachar, E. Berger, M. Gitelman, “Unknown
malcode detection using OPCODE representation”, Intelligence and Security
Informatics, pp 204-215, 2008.
[34] K. Griffin, S. Schneider, X. Hu, T.-c. Chiueh, “Automatic generation of
string signatures for malware detection”, pp. 101–120, 2009.
[35] Nataraj, L., Karthikeyan, S., Jacob, G., Manjunath, B.S., “Malware images:
visualization and automatic classification”, in Proceedings of the 8th
International Symposium on Visualization for Cyber Security. ACM, New
York, USA, pp. 4:14, 2011.
[36] A. Shabtai, R. Moskovitch, C. Feher, S. Dolev, Y. Elovici, “Detecting
unknown malicious code by applying classification techniques on OpCode
patterns”, Security Information, 2012.
[37] Chandrasekar Ravi, R Manoharan, “Malware Detection using Windows Api
Sequence and Machine Learning”, International Journal of Computer
Applications, Volume 43, April 2012.
[38] I. Santos, J. Devesa, F. Brezo, J. Nieves, P. G. Bringas, “Opem: A static-
dynamic approach for machinelearning-based malware detection”, in CISIS
’12- ICEUTE´ , page 271-280, 2013.
[39] P. M. Comar, L. Liu, S. Saha, P. N. Tan, A. Nucci, “Combining supervised
and unsupervised learning for zero-day malware detection”, in: Proceedings
INFOCOM, IEEE, pp. 2022–2030, 2013.
[40] Uppal, D., Sinha, R., Mehra, V., Jain, V., “Malware detection and
classification based on extraction of API sequences”, in Proceedings of
International Conference on Advanced Computer Commun. Informatics, pp.
2337-2342, 2013.
[41] Salehi, Z., Sami, A., Ghiasi, M., “Using feature generation from api calls for
malware detection”. Computer Fraud & Security, pp. 9–18 2014.
[42] Joseph Sexton, Curtis Storlie, Blake Anderson, “Subroutine based detection
of APT malware”, Journal of Computer Virology Hacking Techniques, pp 1-
9, 2015.
[43] ] Saxe, J., Berlin, K., “Deep neural network based malware detection using
two dimensional binary program features”. In 10th International Conference
on Malicious and Unwanted Software (MALWARE). IEEE, pp. 11-20, 2015.
8
[44] U. Narra, F. Di, T. Visaggio, A. Corrado, T.H. Austin, M. Stamp,
“Clustering versus SVM for malware detection”, Journal of Computer
Virology Hacking Techniques, pp 213–224, 2016.
[45] Ahmadi, M., Ulyanov, D., Semenov, S., Trofimov, M., Giacinto, “Novel
feature extraction, selection and fusion for effective malware family
classification”, in proceedings of the Sixth ACM Conference on Data and
Application Security and Privacy, pp. 183–194, 2016.
[46] Kolosnjaji, B., Zarras, A., Webster, G., Eckert, “Deep learning for
classification of malware system call sequences”, in Lecture Notes of
Computer Science (Including Subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics), pp. 137-149, 2016.
[47] B.N. Narayanan, O. Djaneye-Boundjou, T.M. Kebede, “Performance
analysis of machine learning and pattern recognition algorithms for Malware
classification”, in IEEE National Aerospace and Electronics Conference
(NAECON) and Ohio Innovation Summit (OIS), pp. 338– 342, 2016.
[48] S.D. Nikolopoulos, I. Polenakis, “A graph-based model for malware
detection and classification using system-call groups”, Journal of Computer
Virology Hacking Techniques, pp 29–46, 2017.
[49] Xu, Z., Ray, S., Subramanyan, P., Malik, “Malware detection using machine
learning based analysis of virtual memory access patterns”, Design,
Automation & Test in Europe Conference & Exhibition, IEEE, pp. 169–174,
2017.
[50] E. Raff, J. Sylvester, and C. Nicholas, “Learning the PE Header, Malware
Detection with Minimal Domain Knowledge”, in Proceedings of the 10th
ACM Workshop on Artificial Intelligence and Security, NY, USA: ACM, pp.
121–132, 2017.
[51] Kotov, V., Wojnowicz, “Towards generic deobfuscation of windows api
calls”, arXiv ,preprint arXiv:1802.04466, 2018.
[52] Q. Le, O. Boydell, B. Mac, M. Scanlon, “Deep learning at the shallow end:
Malware classification for non-domain experts”, Digital Investigation, pp
S118– S126,2018.
[53] Nguyen, M.H., Le Nguyen, D., Nguyen, X.M., Quan, T.T., “Autodetection
of sophisticated malware using lazy-binding control flow graph and deep
learning”. Computer Security, pages 128-155, 2018.
8
[54] M. Krcal, O. Svec, M. Balek, and O. Jasek, “Deep Convolutional Malware
Classifiers Can Learn from Raw Executables and Labels Only,” in ICLR
Workshop, 2018.
[55] Ni, S., Qian, Q., Zhang, R., “Malware identification using visualization
images and deep learning”. Computer Security,2018.
[56] Hemant Rathore, Swati Agarwal, Sanjay K. Sahay and Mohit Sewak,
“Malware Detection using Machine Learning and Deep Learning ”,
International Conference on Big Data Analytics, Springer, LNCS, Vol. 11297,
pp. 402-411, 2018.
[57] O. Suciu, S. E. Coull, and J. Johns, “Exploring adversarial examples in
malware detection”, in IEEE Security and Privacy Workshops (SPW), pp 8–
14, 2019.
[58] Yuxin, D., Siyi, Z., “Malware detection based on deep learning algorithm”,
Neural Comput. Appl. 31 (2), pp 461–472, Feb 2019.
[59] ] M. Rabbani, Y.L. Wang, R. Khoshkangini, H. Jelodar, R. Zhao, P. Hu, “A
hybrid machine learning approach for malicious behaviour detection and
recognition in cloud computing”, Journal of Network and Computer
Applications, 2020
[60] C. Yucel, A. Koltuksuz, “ Imaging and evaluating the memory access for
malware”, Forensic Science International Digital Investigation, 2020.
[61] Vasan, D., Alazab, M., Wassan, S., Safaei, B., & Zheng, Q. (2021).
Windows malware detection using anomaly detection algorithms
[62] Kumar, A. and Singh, B. and Gupta, D. Hybrid Malware Detection for
Android using Static and Dynamic Analysis with RNN
[63] D. Gibert, A. Smith, and M. Jones. "Enhanced Malware Detection through
Interpretable Deep Learning and Gradient Boosting." In Proceedings of the
2023 IEEE Symposium on Security and Privacy (SP), pp. 123-137, 2023