Malware - Me Project Document
Malware - Me Project Document
ABSTARCT:
Despite the fact that Android apps are rapidly expanding throughout the mobile
ecosystem, Android malware continues to emerge. Malware operations are on the rise,
particularly on Android phones, it make up 72.2 percent of all smartphone sales.
Credential theft, eavesdropping, and malicious advertising are just some of the ways used
by hackers to attack cell phones. Many researchers have looked into Android malware
detection from various perspectives and presented hypothesis and methodologies.
Machine learning (ML)-based techniques have demonstrated to be effective in identifying
these attacks because they can build a classifier from a set of training cases, eliminating
the need for explicit signature definition in malware detection. This project provided a
detailed examination of machine-learning-based Android malware detection approaches.
According to present research, machine learning is good for identifying Android
malware, this is a powerful and promising solution.
2
CHAPTER 1
INTRODUCTION
There are three main types of infiltration mechanisms that malware use to infect Android
systems: – Repackaging – Updating – Downloading
• Repackaging: Intruders lure users by creating similar apps akin to some of the
most popular ones, tempting them with a variety of useful features. These apps are
repackaged, cloned versions of the original apps, containing malicious code.
• Updating: Many authors include an update component in their app which can later
download malicious code, while the app is in use.
• Downloading: Malware developers lure users into downloading attractive apps,
falsely promising them a variety of amazing features, with malicious code inserted
inside such apps.
There are many malware detection mechanisms mostly based on content signatures,
which compare an app’s signature to a database of known malware signature definitions.
This mechanism can only detect known malware, not new ones. Research has shown that
signature-based approaches never keep up with the speed of new malware development.
Hence, there is an urgent need to research and develop solutions to alleviate non-
detection of malware on the Android platform. Examples of effective solutions can
include characteristic and behavioral-based (static or dynamic) methods. One of the most
popular static behavioral-based methods of malware detection is based on analyzing its
3
requested permission list and resource usages, e.g. Location Services, Contact
Information, WiFi, etc. In our approach, we extracted data from more than one thousand
user-defined permissions. A more powerful approach is dynamic behavioral-based, which
dynamically observes the behavior of applications while running, e.g. dynamic API calls,
capturing the run-time activities of the application. But, such an approach is complex and
time-consuming. Thus, we are proposing a new framework for classifying Android
applications, utilizing machine learning (ML) techniques for large and diverse datasets.
Our framework compiles a set of more than one thousand different app-requested
permissions. Permissions are encoded to train ML classifiers to detect malicious
applications. Experimentation using different quality/quantity ML input-datasets.
knowledge. Degradation of the overall performance of the system and causing its
instability.
The first Android smartphone was launched in September 2008, and shortly thereafter,
smartphones powered by the new open-source operating system were everywhere. In
2021, almost 12 new enhanced versions of Android were released, and it is the most
widely used mobile operating system in the world, with an 84% share of the global
smartphone market [1].
With this level of adoption coupled with the open-source nature of Android applications,
security attacks are becoming more and more ubiquitous and seriously threaten the
integrity of Android applications. Statistics show that more than 50 million malware and
potentially unwanted applications (PUA) have been identified for Android [2].
Researchers have been studying the nature of malware applications for many years and
have categorized them into different families [3]
5
Trojans: These appear as benign apps and aim to steal the user's confidential
information without the user's knowledge.
Backdoors: These exploit root grant privileges and aim to gain control over the
device and perform any operation without the user's knowledge.
Worms: This malware creates copies of itself and distributes them over the
mobile device's networks.
Spyware: These appear as benign apps designed to monitor the user's confidential
information, such as messages, contacts, location, bank information, etc., for
undesirable consequences.
Botnets: A botnet is a network of compromised Android devices controlled by a
remote server.
Ransomware: This malware prevents users from accessing their data by locking
the mobile phone until a ransom amount is paid.
Riskwares: These are legitimate that malicious authors exploit to reduce the
device's performance or harm their data.
Standard approaches to detecting malware can be of two types – Static and Dynamic.
Static Approach:
Dynamic Approach:
In this approach, the application is examined during execution and can help identify
undetected malware by static analysis techniques due to code obfuscation and encryption
of the malware.
6
These approaches can be further sub-divided based on the method of anomaly detection.
Some of these sub-divisions are shown in the diagram below –
application the permission to send information over the internet or access contacts. These
permissions are assumed to be needed for the application to perform its designed
functions. However, many times, applications request permissions that are not required
for their functionality.
By using publicly available labeled data sources, different classification models are built
to distinguish between malware and benign applications. The first data source contains
information on permissions granted to the applications, while the second data source
contains information on API call signatures of the applications.
Various data mining models are trained, and their performance metrics, such as precision
and recall, are analyzed and compared. This comparison is first made for different
8
classification models within an approach. Then, the best results for each approach are
compared to understand which of the two systems is better for detecting malware.
Dimensionality Reduction:
The datasets used in the project had a large number of features, so it was necessary to
reduce the dimensions before using classification algorithms.
Frequency Counts:
Features that contain only one unique value are removed because they cannot help
differentiate between the classes
Correlations:
If there are any features that are highly correlated to each other, then only one of those
features is retained.
9
Exploratory Analysis:
In this step, visualizations are used to deep dive into the data.
CHAPTER 2
LITERATURE REVIEW
Botnet Detection has been an active research area over the last decades.
Researchers have been working hard to develop effective techniques to detect
Botnets. From reviewing existing approaches it can be noticed that many of
them target specific Botnets. Also, many approaches try to identify any Botnet
activity by analysing network traffic. They achieve this by concatenating
existing Botnet datasets to obtain larger datasets, building predictive models
using these datasets and then employing these models to predict whether
network traffic is safe or harmful. The problem with the first approaches is that
data is usually scarce and costly to obtain. By using small amounts of data, the
quality of predictive models will always be questionable. On the other hand, the
problem with the second approaches is that it is not always correct to
concatenate datasets containing network traffic from different Botnets. Datasets
can have different distributions which means they can downgrade the predictive
performance of machine learning models. Our idea is instead of concatenating
datasets, we propose using transfer learning approaches tocarefully decide what
data to use. Our hypothesis is ―Predictive Performance can be improved by
using transfer learning techniques across datasets containing network traffic
12
captures the packet and monitors the contents of the packet. The collector
captures the flow traffic and the analyzer component initiates an automated
analysis of traffic with the captured packet information. The packet flow
information is collected by virtual interface and physical probe. The virtual
interface is used for collecting the malicious traffic information between the
Virtual Machines (VMs) and the physical probe gathers malicious traffic
information between the network bridges connecting VMs. The information
collected from these techniques are analyzed for detecting the botnets in inter
VM and intra VM. Compared to the existing Dendritic Cell Algorithm (DCA),
the proposed VM based botnet detection system has minimal time consumption,
increased detection speed, and higher attack prevention ratio.
Among various network attacks, botnet led attacks are considered as the
most serious threats. A botnet, i.e., the network of compromised computers is
able to perform large scale illegal activities such as Distributed Denial of
Service attacks, click fraud, bitcoin mining etc. These attacks are considered as
the major concern now-a-days. In this paper, we present a comprehensive
review of botnets, their lifecycle and types. We also discuss the peer-to-peer
botnet detection techniques' behaviors using various latest detection techniques.
labeled botnet traffic data, compared the SoNSTAR system with three leading
machine learning-based traffic classifiers in a botnet activity detection test.
SoNSTAR demonstrated greater accuracy (99.92%), precision (97.1%), and
recall (99.5%) and much lower false positive rates (0.007%) than the other
techniques. The knowledge generated about characteristic botnet behaviors
could be used in the development of future IDSs.
and a bar chart. Finally, some challenges and future scope of the machine
learning techniques in the improvement of IDS will be discussed.
CHAPTER 3
18
SYSTEM DESIGN
3.1 OBJECTIVE
For effective and efficient detection, the uses of feature extraction are recommended
for malware detection (Ahmadi, M. et al., 2016).
There are various type of detection method, the method that we are using will be
detecting through hex and assembly file of the malware. Feature will be extracted
from both hex view and assembly view of malware files.
After extracting feature to its category, all category is to be combine into one feature
vector for the classifier to run on them (Ahmadi, M. et al., 2016).
For feature selection, separating binary file into blocks to be compare the similarities
of malware binaries.This will reduce the analysis overhead which cause the process
to be faster
Disadvantages
Evaluated number of experiments and found that setting of 2 grams, TF, using 300
features selected by Df measured outperform the perform lacks ML specific
techniques
This project present the static analysis method based on data mining which extends
the general heuristic detection techniques using a variable length instructions
sequence mining approach for purpose of scareware detection but metrics specific
and unsupervised techniques un included can be broken
The research will cover features obtained from the application code. Three methods
of feature selection where be tested. Then, for each classifier, its selected parameters
will be tested, and with the adopted determined parameters, the classification will be
determined depending on the number of features taken into account.
Next, the most common features in malware will be listed. The best results in terms
of the number of correctly classified instances will be compiled for the 5 tested
classifiers, along with the time of the algorithm’s operation and the time of the
preprocessing process and the extraction of features
Machine learning can easily identify the malware in the data and datasets
K-nearest neighbors
Decision tree
20
Random forest
SVM.
Naïve Bayes
PROPOSED ARCHITECTURE
AN
DRO
ID
MA
LW Pre-
ARE processing D
S
Trai P
ning
Training T
V
data with R
M
N
F
AN algorithm K
DRO Build model B
ID Train N Predicti
on
MA ed N Ht
(MALW
LW
ARE mod ml
ARE –
Flas Use
y/n)
Test el
Metric /jsk r
ing
data evaluatio /cfra inp
final
n ssme ut
mod
el wor
Fig:Architecture k END
FRONT
3.4 MODULES:
Data Collection
Data Analysis
Data Preprocessing
Model training
Model testing
21
Comparative analysis
Flask framework prediction
MODULES - 1
Dataset :kaggle
Attributes
name,tcp_packets,dist_port_tcp,external_ips,vulume_bytes,udp_packets,tcp_urg_p
acket,source_app_packets,remote_app_packets,source_app_bytes,remote_app_byt
es,duracion,avg_local_pkt_rate,avg_remote_pkt_rate,source_app_packets,dns_que
ry_times,type
SAMPLE DATA :
AntiVirus;33;0;1;4502;0;0;34;29;2888;4580;NA;NA;NA;34;1;benign
AntiVirus;95;0;9;22495;0;0;105;90;20823;23268;NA;NA;NA;105;10;benign
AntiVirus;27;6;4;4731;0;0;32;27;3888;5134;NA;NA;NA;32;5;benign
MODULE:2
Data Analysis
MODULE:3
DATA PREPROCESSING
MODULE:4
MODEL TRAINING
K-nearest neighbors
Decision tree
Random forest
SVM.
Naïve Bayes
KNN:
n_neighbors: Number of neighbors to use for prediction. In this case, the value is
set to 1.
23
The classifier is then fit on the training data using the fit method and the trained
classifier is used to make predictions on the test data using the predict method. The
predicted class labels are stored in the pred variable.
Decision tree:
decision tree classifier using the DecisionTreeClassifier class from scikit-learn's
tree module. The classifier is initialized with default parameters.
The classifier is then fit on the training data using the fit method and the trained
classifier is used to make predictions on the test data using the predict method. The
predicted class labels are stored in the pred variable.
RANDOM FOREST :
n_estimators: Number of trees in the forest. The default value is 100. In this
case, the value is set to 650.
max_depth: Maximum depth of the tree. The default value is None, which
means the nodes are expanded until all the leaves contain less than
min_samples_split samples. In this case, the value is set to 60.
random_state: Seed for the random number generator. In this case, the value is
set to 25.
The classifier is then fit on the training data using the fit method and the
trained classifier is used to make predictions on the test data using the predict
method.
SVM:
A linear support vector machine classifier using the LinearSVC class from scikit-
learn's svm module. The classifier is initialized with default parameters. The classifier is
then fit on the training data using the fit method.
24
NAÏVE BAYES:
Gaussian Naive Bayes classifier using the GaussianNB class from scikit-learn's
naive_bayes module. The classifier is initialized with default parameters.
The classifier is then fit on the training data using the fit method. Finally, the
trained classifier is used to make predictions on the test data using the predict
method. The predicted class labels are stored in the pred variable.
PREP
R Mo
BLOCK DIAGRAM
MA O del Final–
LW C Tra BEST
ML
AR E inin MOD
E ALGO
S RITHM g& FLAS
EL
Dat
S Tes K
aSe
t
I tin We
FRA
N
Dat
g b
MW PRED
G app ICTIO
a ORK
licat N
inp
ut ion
MA MAL
LW WARE
ARE NOT
DET DETEC
ECT TED
ED
Fig:Block diagram
25
Ram : 2 GB
26
CHAPTER 4
RESULT
27
CHAPTER 5
CONCLUSION
Our main target was to come up with a machine learning framework that generically
detects as much malware samples as it can, with the tough constraint of having a zero
false positive rate. We were very close to our goal, although we still have a non-zero false
positive rate. In order that this framework to become part of a highly competitive
commercial product, a number of deterministic exception mechanisms have to be added.
In our opinion, malware detection via machine learning will not replace the standard
detection methods used by anti-virus vendors, but will come as an addition to them. Any
commercial anti-virus product is subject to certain speed and memory limitations,
therefore the most reliable algorithms.Paper is focused on the issue of malware detection
for currently the most popular mobile system Android, using static analysis. In this thesis,
an overview of Android malware analysis was presented, and a unique set of features was
chosen that was later used in the study of malware classification. Five classification
algorithms (Random Forest, SVM, K-NN, Nave Bayes, Logistic Regression) and three
attribute selection algorithms were examined in order to choose those that would provide
the most effective malware detection. The characteristics of malicious software were
identified based on a collected set of applications. This analysis was conducted for
features extracted from Java class code. It was determined which source of features
provides higher quality of classification.
FUTURE ENHANCEMENT:
28
References
Almin, S. B., & Chatterjee, M. (2015). A novel approach to detect Android malware.
Procedia Computer Science, 45, 407-417.
Arshad et al. (2016). Android malware detection & protection: a survey. International
Journal of Advanced Computer Science and Applications, 7(2), 463-475.
Assisi, A., Abhijith, A., Babu, A., & Nair, A. M. Significant permission identification
for machine learning based android malware detection: a review.
Christiana, A., Gyunka, B., & Noah, A. (2020). Android Malware Detection through
Machine Learning Techniques: A Review.
Daoudi et al. (2021, August). Dexray: A simple, yet effective deep learning approach
to android malware detection based on image representation of bytecode. In
International Workshop on Deployable Machine Learning for Security Defense, 81-
106.
Fallah, S., & Bidgoly, A. J. (2019). Benchmarking machine learning algorithms for
android malware detection. Jordanian Journal of Computers and Information
Technology (JJCIT), 5(03).
Jiang et al. (2020). Android malware detection using fine-grained features. Scientific
Programming.
30
Kumar, R., Xiaosong, Z., Khan, R. U., Kumar, J., & Ahad, I. (2018, March). Effective
and explainable detection of android malware based on machine learning algorithms.
In Proceedings of the 2018 International Conference on Computing and Artificial
Intelligence (pp. 35-40).
Kyaw, M. T., & Kham, N. S. M. (2019). Machine Learning Based Android Malware
Detection using Significant Permission Identification (Doctoral dissertation, MERAL
Portal).
Li, J., Sun, L., Yan, Q., Li, Z., Srisa-An, W., & Ye, H. (2018). Significant permission
identification for machine-learning-based android malware detection. IEEE
Transactions on Industrial Informatics, 14(7), 3216-3225.
Needham, M. (2021). Smartphone market share: Supply chain constraints finally catch
up to the global smartphone market, contributing to a 6.7% decline in third quarter
shipments, according to IDC. Retrieved Oct, 2021 from
https://www.idc.com/promo/smartphone- market-share/os
Rana, M. S., Gudla, C., & Sung, A. H. (2018, December). Evaluating machine
learning models for Android malware detection: A comparison study. In Proceedings
of the 2018 VII International Conference on Network, Communication and Computing
(pp. 17-21).
Shao, K., Xiong, Q., & Cai, Z. (2021). FB2Droid: A Novel Malware Family-Based
Bagging Algorithm for Android Malware Detection. Security and Communication
Networks.
Sharma, S., Krishna, C. R., & Kumar, R. (2020, November). Android Ransomware
Detection using Machine Learning Techniques: A Comparative Analysis on GPU and
CPU. In 2020 21st International Arab Conference on Information Technology (ACIT)
(pp. 1-6). IEEE.
Singh, D., Karpa, S., & Chawla, I. (2022). “Emerging trends in computational
intelligence to solve real-world problems” Android Malware Detection Using
Machine Learning. In the International Conference on Innovative Computing and
Communications (329-341). Springer, Singapore.
Wen, L., & Yu, H. (2017, August). An Android malware detection system based on
machine learning. In AIP Conference Proceedings (Vol. 1864, No. 1, p. 020136). AIP
Publishing LLC.
Yerima, S. Y., & Sezer, S. (2018). Droidfusion: A novel multilevel classifier fusion
approach for android malware detection. IEEE Transactions on Cybernetics, 49(2),
453-466.