ABSTRACT
One of the most significant issues facing internet users nowadays is malware. Malware is any
software intentionally designed to cause damage to a computer, server, client, or computer
network. A wide variety of malware types exist, including computer viruses, worms, Trojan
horses, ransomware, spyware, adware, rogue software, wiper and scareware. Polymorphic
malware is a new type of malicious software that is more adaptable than previous generations
of viruses. Polymorphic malware constantly modifies its signature traits to avoid being
identified by traditional signature-based malware detection models. To identify malicious
threats or malware, we will use a number of machine learning techniques. Machine learning
algorithms can be used to detect malware by identifying its behaviour and other
characteristics. The proposed approach is based on computing the difference in correlation
symmetry integrals. Which demonstrates that machine learning algorithms can be used to
effectively detect malware, even polymorphic malware. This is good news for internet users,
as it can help to improve the security of computer systems and networks..
vii
CONTENTS
Chapter Description Page No.
No.
1 Introduction 1
2 Requirements Specification 5
2.1 Software Requirements 9
2.2 Hardware Requirements 9
2.3 Functional Requirements
2.4 Non-Functional Requirements
3 Literature Survey 5
3.1 Existing System
3.2 Drawbacks
3.3 Proposed System
3.4 Working
4 PROBLEM STATEMENT 10
5 OBJECTIVES
6 METHODOLOGIES
7 FUNCTIONAL MODULES
8
CONCLUSION 16
References 17
vii
LIST OF FIGURES
FigureNo. Description Page No.
4.1 System Architecture
vii
LIST OF TABLES
Tableno. Description Page no.
4.1
4.2
4.3
5.1
vii
Project Title
Chapter 1
INTRODUCTION
Malware is a major threat to the security of computer systems and networks. Cyberattacks are
currently the most pressing concern in the realm of modern technology. The word implies
exploiting a system’s flaws for malicious purposes, such as stealing from it, changing it, or
destroying it. Malware is an example of a cyberattack. Malware is any program or set of
instructions that is designed to harm a computer, user, business, or computer system. The term
“malware” encompasses a wide range of threats, including viruses, Trojan horses,
ransomware, spyware, adware, rogue software, wipers, scareware, and so on. Malicious
software, by definition, is any piece of code that is run without the user’s knowledge or
consent. Traditional signature-based malware detection methods are becoming increasingly
ineffective against new and emerging malware strains. Machine learning (ML) algorithms
have the potential to overcome these limitations by detecting malware based on its behaviour
and other characteristics. Both static and dynamic learning methods may be used to identify
behavioral similarities between members of the same family of malware. Unlike static
analysis, which examines dangerous files’ contents without actually running them, dynamic
analysis takes their behavior into account by tracking data flows, recording function calls, and
adding monitoring code to dynamic binaries. Machine learning algorithms may leverage such
static and behavioral artefacts to describe the ever-evolving structure of contemporary
Symmetry 2022, 14, 2304 3 of 11 malware, allowing them to identify increasingly complex
malware assaults that could otherwise avoid detection using signature-based techniques. As
machine learning-based solutions do not rely on signatures, they are more successful against
newly released malware. Deep learning algorithms that can perform feature engineering on
their own can be used to obtain and represent features more accurately. Our synopsis a
comprehensive survey of ML algorithms for malware analysis and detection. We here discuss
the different types of ML algorithms that can be used for malware detection, as well as the
different features that can be extracted from malware samples for classification. We also
review the state-of-the-art ML-based malware detection systems and their performance.
Dept. of CSE,SJCIT 2023-2024 Page 1
Project Title
Chapter 2
REQUIREMENT SPECIFICATIONS
System Requirement Specification (SRS) is a central report, which frames the
establishment of the product advancement process. It records the necessities of a framework
as well as has a depiction of its significant highlight. An SRS is essentially an association's
seeing (in composing) of a client or potential customer's frame work necessities and
conditions at a specific point in time (generally) before any genuine configuration or
improvement work. It's a two-way protection approach that guarantees that both the customer
and the association comprehend alternate's necessities from that viewpoint at a given point in
time.The SRS talks about the item however not the venture that created it, consequently the
SRS serves as a premise for later improvement of the completed item. The SRS may need to
be changed, however it does give an establishment to proceed with creation assessment. In
straightforward words, programming necessity determination is the beginning stage of the
product improvement action.
The SRS means deciphering the thoughts in the brains of the customers – the information, into
a formal archive – the yield of the prerequisite stage. Subsequently the yield of the stage is a
situated of formally determined necessities, which ideally are finished and steady, while the
data has none of these properties.
2.1 Hardware Requirements
• Processor Type : Intel CoreTM– i5
• Speed : 2.4 GHZ
• RAM :8 GB RAM
• Hard disk : 80 GB HDD
2.2 Software Requirements
• Operating System : Windows 64-bit
• Technology : Python
• IDE : PythonIDLE
• Tools : Anaconda
• Python Version : Python 3.6
SJCIT 2023-24 Page 2
Project Title
2.3 Functional Requirements
1.Malware Detection Algorithm: The system must employ machine learning
techniques for malware detection, including Naive Bayes, SVM, J48, RF, and a
proposed approach.
2.High Detection Accuracy: The selected algorithm must achieve a high
detection ratio, ensuring accurate identification of malicious threats.
3. Confusion Matrix: The system should generate a confusion matrix to
measure false positives and false negatives, providing additional performance
insights.
4. Comparison of Classifiers: The system must compare the performance of
DT, CNN, and SVM algorithms in terms of detection accuracy, particularly on
a small False Positive Rate (FPR).
2.4 Non-Functional Requirements
These are requirements that are not functional in nature, that is, these are constraints
within which the system must work.
The program must be self-contained so that it can easily be moved from one Computer
to another. It is assumed that network connection will be available on the computer on
which the program resides.
Capacity, scalability and availability.
The system shall achieve 100 per cent availability at all times.
The system shall be scalable to support additional clients and volunteers.
Maintainability.
The system should be optimized for supportability, or ease of maintenance as far as
possible. This may be achieved through the use documentation of coding standards,
naming conventions, class libraries and abstraction.
Randomness, verifiability and load balancing.
The system should be optimized for supportability, or ease of maintenance as far as
possible. This may be achieved through the use documentation of coding standards,
naming conventions, class libraries and abstraction. It should have randomness to
check the nodes and should be load balanced.
SJCIT 2023-24 Page 3
Project Title
Chapter 3
LITERATURE SURVEY
A literature survey or a literature review in a project report shows the various
analyses and research made in the field of interest and the results already published, taking
into account the various parameters of the project and the extent of the project. Literature
survey is mainly carried out in order to analyze the background of the current project which
helps to find out flaws in the existing system & guides on which unsolved problems we can
work out. So, the following topics not only illustrate the background of the project but also
uncover the problems and flaws which motivated to propose solutions and work on this
project.
A literature survey is a text of a scholarly paper, which includes the current
knowledge including substantive findings, as well as theoretical and methodological
contributions to a particular topic. Literature reviews use secondary sources, and do not
report new or original experimental work. Most often associated with academic-oriented
literature, such as a thesis, dissertation or a peer-reviewed journal article, a literature
review usually precedes the methodology and results sectional though this is not always
the case. Literature reviews are also common in are search proposal or prospectus (the
document that is approved before a student formally begins a dissertation or thesis). Its
main goals are to situate the current study within the body of literature and to provide
context for the particular reader. Literature reviews are a basis for researching nearly every
academic field. demic field. A literature survey includes the following:
• Existing theories about the topic which are accepted universally.
• Books written on the topic, both generic and specific.
• Research done in the field usually in the order of oldest to latest.
• Challenges being faced and on-going work, if available.
Literature survey describes about the existing work on the given project. It deals with the
problem associated with the existing system and also gives user a clear knowledge on how to
deal with the existing problems and how to provide solution to the existing problems.
Objectives of Literature Survey
• Learning the definitions of the concepts.
• Access to latest approaches, methods and theories.
• Discovering research topics based on the existing research
SJCIT 2023-24 Page 4
Project Title
• Concentrate on your own field of expertise– Even if another field uses the same
words, they usually mean completely.
• It improves the quality of the literature survey to exclude sidetracks– Remember to
explicate what is excluded.
Before building our application, the following system is taken into consideration:
Malware Analysis and Detection Using Machine Learning Algorithms, Muhammad
Shoaib Akhtar and Tao Feng Malware is a major threat to the security of computer systems
and networks. Traditional signature-based malware detection methods are becoming
increasingly ineffective against new and emerging malware strains. Machine learning (ML)
algorithms have the potential to overcome these limitations by detecting malware based on its
behaviour and other characteristics. This paper presents a comprehensive survey of ML
algorithms for malware analysis and detection. The authors discuss the different types of ML
algorithms that can be used for malware detection, as well as the different features that can be
extracted from malware samples for classification. They also review the state-of-the-art ML-
based malware detection systems and their performance. The authors conclude that ML
algorithms are a promising approach for malware detection. However, they also highlight
some of the challenges that need to be addressed before MLbased malware detection systems
can be widely deployed. These challenges include the need for large and well-labelled
malware datasets, the need to develop ML algorithms that are robust to adversarial attacks, and
the need to develop ML algorithms that can be deployed in real time. Overall, this paper
provides a valuable overview of ML algorithms for malware analysis and detection. It is a
must-read for anyone interested in this area of research.
A state-of-the-art survey of malware detection approaches using data mining techniques,
Alireza Souri and Rahil Hosseini Malware detection is a challenging task, especially in the
face of new and emerging malware strains. Data mining techniques have the potential to
overcome the limitations of traditional malware detection methods by detecting malware based
on its behaviour and other characteristics. This paper presents a state-of-the-art survey of
malware detection approaches using data mining techniques. The authors discuss the different
types of malware detection approaches, as well as the different data mining techniques that can
be used for malware detection. They also review the state-of-the-art malware detection systems
and their performance. The authors conclude that data mining techniques are a promising
approach for malware detection. However, they also highlight some of the challenges that need
to be addressed before data mining-based malware detection systems can be widely deployed.
These challenges include the need for large and well-labelled malware datasets, the need to
SJCIT 2023-24 Page 5
Project Title
develop data mining algorithms that are robust to adversarial attacks, and the need to develop
data mining algorithms that can be deployed in real time. Overall, this paper provides a
valuable overview of data mining techniques for malware detection. It is a must-read for
anyone interested in this area of research
3.1 Existing system
A high-performance malware detection system using deep learning and feature selection
methodologies is introduced. Two different malware datasets are used to detect malware and
differentiate it from benign activities. The datasets are preprocessed, and then correlation-
based feature selection is applied to produce different feature-selected datasets. The dense and
LSTM-based deep learning models are then trained using these different versions of feature-
selected datasets.
Techniques Used:
In LSTM technique is used.
3.2 Draw backs of Existing System:
Due to the deep learning architecture it consumes more training time.
Accuracy is less than 90%.
It is suitable to detect attack is there or not, which is not suitable to detect different types of
malware.
3.3 Proposed system
The approach used in this project aims to use a multi classifier to detect and classify malware.
Malware classification is approached using two techniques of binary and multi-class problems.
The binary classification includes the differentiation between malicious and benign classes
whereas the multi-classification includes classifying the malicious malware into Virus, Trojan,
Spyware, Worms, Ransomware, and Adware type. Supervised learning approach and machine
learning models like Random Forest model, Decision tree model, Support vector machine
model, Naïve Bayes model, and K-Nearest Neighbour model is used for the classification of
malware. The results show that Random Forest performs well in terms of Binary classification
and the multi-classification problem with an accuracy of 95% and 91% respectively.
Advantages:
Less time consumption of Implementation
Accuracy is above 90%.
It is used to detect types of attacks also.
SJCIT 2023-24 Page 6
Project Title
3.4 Working
The main objective of this malware detection is efficiently detects the different types of
attacks.
SJCIT 2023-24 Page 7
Project Title
CHAPTER-5
OBJECTIVES
To investigate on how to implement machine learning to malware detection in order to
detection unknown malware.
To develop a malware detection software that implement machine learning to detect
unknown malware.
To validate that malware detection that implement machine learning will be able to
achieve a high accuracy rate with low false positive rate.
To effectively Detecting malware in specific types of files, such as executable files,
PDFs, or images.
SJCIT 2023-24 Page 8
Project Title
Chapter 4
PROBLEM STATEMENT
The term malware is a contraction of malicious software. Put simply, malware is any piece of software
that was written with the intent of doing harm to data, devices or to people. The major part of
protecting a computer system from a malware attack is to identify whether a given piece of
file/software is a malware and identify its family class. Malware detection and classification techniques
are two separate tasks which are performed by anti-malware and cybersecurity companies. Once
detected, malware needs to be categorized into a specific family for further analysis.
Basically, classification accuracy and classification time tend to be comparative. For example,
DT can be classified quickly by using a single tree, but classification performance may be
degraded by instability and dependency on a particular set of features. On the other hand,
random forest (RF) classifies more accurately than DT for most cases, but the classifying
speed is much slower that DT. Thus, if we take advantage of the fast but less accurate
classifier and the slow but more accurate classifier, we can develop a fast and accurate
classifier.
SJCIT 2023-24 Page 9
Project Title
CHAPTER-6
METHODODLOGY
Figure 1: Flowchart
The system architecture illustrate the workflow process from start to finish..
SJCIT 2023-24 Page 10
Project Title
CHAPTER-7
FUNCTIONAL MODULES
7.1 Dataset
This study relied entirely on data provided by the Canadian Institute for Cybersecurity. The
collection has many data files that include log data for various types of malware. These
recovered log features may be used to train a broad variety of models. Approximate 51 distinct
malware families were found in the samples. More than 17,394 data points from different
locations were included; the dataset had 279 columns and 17,394 rows.
7.2. Pre-Processing
Data were stored in the file system as binary code, and the files themselves were unprocessed
executables. We prepared them in advance of our research. Unpacking the executables
required a protected environment, or virtual machine (VM). PEiD software automated
unpacking of compressed executables.
7.3. Features Extraction
Twentieth-century datasets frequently contain tens of thousands of features. In recent years, as
feature counts have grown, it has become clear that the resultant machine learning model has
been overfit. To address this problem, we built a smaller set of features from a larger set; this
technique is commonly used to maintain the same degree of accuracy while using fewer
features. The goal of this study was to refine the existing dataset of dynamic and static features
by keeping those that were most helpful and eliminating those that were not valuable for data
analysis.
7.4. Features Selection
After completing feature extraction, which involved the discovery of more features, feature
selection was performed. Feature selection was a crucial process for enhancing accuracy,
simplifying the model, and reducing overfitting, as it involved choosing features from a pool
of newly recognised qualities. Researchers have used many feature classification strategies in
the past in an effort to identify dangerous code in software.
7. 5. Results and Discussion
The two main phases of the classification process were training and testing. To train a system,
it was sent both harmful and safe files . Automated classifiers were taught using a learning
algorithm. Each classifier (KNN, CNN, NB, RF, SVM, or DT) became smarter with each set
of data it annotated. In the testing phase, a classifier was sent a collection of new files, some
harmful and some not; the classifier determined whether the files were malicious or clean.
SJCIT 2023-24 Page 11
Project Title
Chapter 8
CONCLUSION
This project demonstrates that academics have recently shown a growing interest in ML
algorithm solutions for malware identification. We presented a protective mechanism that
evaluated three ML algorithm approaches to malware detection and chose the most
appropriate one. The results show that compared with other classifiers, DT (99%), LR
(98.76%), and SVM (96.41%) performed well in terms of detection accuracy.
SJCIT 2023-24 Page 12
Project Title
REFERENCES
[1] Akhtar, M.S.; Feng, T, “Malware Analysis and Detection Using Machine Learning
Algorithms (2022)”, DOI:10.3390/sym14112304.
[2] Akshit Kamboj, Priyanshu Kumar, Amit Kumar Bairwa , “Detection of malware in
downloaded files using various machine learning models (2022)”,DOI:
https://doi.org/10.1016/j.eij.2022.12.002.
[3] Raj Sinha, “Study Of Malware Detection Using Machine Learning”, DOI:
10.13140/RG.2.2.11478.16963.
[4] Souri, Hosseini Hum, Cent. Comput. Inf. Sci., “A State-Of-The-Art Survey Of Malware
Detection Approaches Using Datamining Techniques(2018) ”,DOI:org/10.1186/s13673.
SJCIT 2023-24 Page 13