International Journal of Intelligent Information Systems
2014; 3(6-1): 33-37
Published online October 20, 2014 (http://www.sciencepublishinggroup.com/j/ijiis)
doi: 10.11648/j.ijiis.s.2014030601.16
ISSN: 2328-7675 (Print); ISSN: 2328-7683 (Online)
Malware detection using data mining techniques
Sara Najari, Iman Lotfi
Computer Department, Payam Noor University, Tehran, Iran
Email address:
[email protected] (S. Najari), [email protected](I. Lotfi)
To cite this article:
Sara Najari, Iman Lotfi. Malware Detection Using Data Mining Techniques. International Journal of Intelligent Information Systems. Special
Issue: Research and Practices in Information Systems and Technologies in Developing Countries. Vol. 3, No. 6-1, 2014, pp. 33-37.
doi: 10.11648/j.ijiis.s.2014030601.16
Abstract: Nowadays, malicious software attacks and threats against data and information security has become a complex
process. The variety and number of these attacks and threats has resulted in providing various type of defending ways against
them, but unfortunately current detection technologies are ineffective to cope with new techniques of malware designers which
use them to escape from anti-malwares. In current research, we present a combination of static and dynamic methods to
accelerate and improve malware detection process and to enable malware detection systems to detect malware with high
precision, in less time and help network security experts to react well since time detection of security threats has a high
importance in dealing with attacks.
Keywords: Malware, Malware Detection, Escape Techniques, Data Mining
has a significant importance in protecting resources. In order
1. Introduction to detect such malwares, before the advent of malicious effects,
The Continues growth of malwares, has resulted in creating we should employ methods for detecting good and bad
enormous threats in information and security points so that software behaviors to be able to detect which software is
cyber defense centers have high importance in many countries. problematic and which ones are not. For this means, we should
Like country boundaries which could be attacked from investigate both type of software in order to not face with a
different aspects such as contraband and thieves, virtual space problem in detection process [2].
also suffer from these attacks [1]. Figure 1 indicates increased volume of malware from 2003 to
2010 which has reported by Panda laboratory and it is predicted
that this increasing trend of attack would continue in the next
few years with a much faster speed so that the mean number of
new threats per day exceeds from 55000 attacks per day. These
attack are usually done to computer networks of sensitive
agencies such as security entities, banks, economic centers,
information storage centers, computer networks and etc.
2. Malware Definition and Analysis
Computer applications which have a destructive content
and apply to system from invader, are called malware and the
systems which apply on it is called victim system [3]. The
malware word is assigned to virus, worm, Trojan and any
other program which is created for distractive goals and
Figure 1. Ncreased volume of malware from 2003 to 2010. abusing of users’ privacy.
But what is the difference between a virus and a worm?
Experiences have shown that most of these attacks are from What is the difference between these two and Trojan? Do
malwares. On time detection of virtual space security attacks antivirus programs apply against worms and Trojans or only
34 Sara Najari and Iman Lotfi: Malware Detection Using Data Mining Techniques
against the viruses? All of these questions originate from one following methods:
source and it’s the complex and complicated world of a) Rootkit integrate its codes with operating system codes
destructive codes [1]. which are at low-levels and accordingly can access all
Enormous numbers of available destructive codes have system requests such as reading files, running processes
made their classification difficult. Generally, malwares are and etc.
classified into several kind based on behavior, attack method: b) Rootkit transfers its malicious codes into healthy
For example, some kind of malware classification is as processes and by doing so, it can use the memory that
follow: virus, worm, spyware, rootkit, each one has a special and do its malicious programs [6].
behavior which are described below: The base of traditional and usual methods to detect
malware is using signature in which part of malware code is
2.1. Virus hold as the signature in the database and malware detection is
A code which includes itself to other programs such as carried out using signatures available in the database. Due to
operating systems and needs to run within the host program [4]. the failure of old methods in detecting new and unrecognized
malwares or polymorphic malwares in recent years,
2.2. Worm researchers have tried to present more reliable methods for
malware detection using unchanging characteristics of the
Malwares which transform themselves from one system to malwares [6].
other using self-publishing in a network which include some Nowadays, signature for antiviruses is a tool which is
connected computers. Generally, viruses try to publish created manually. Before writing a signature, the analyst
themselves via a program, while worms unlike viruses put should identify how to deal with the unknown sample as a
themselves only in one computer, and try to pollute a threat for users.
computer network [1]. The process of searching malware is called analyzing. The
2.3. Trojan Horse more analysis tool and techniques, the more attackers try in
using hidden making techniques and generating dynamic
A type of malware that appears in the form of pieces of hidden codes from user’s perspective. Analysts use two type
software code and are intended for useful purposes. It runs up of analysis to detect malware: static analysis and Dynamic
desired functions for users but hiddenly runs a series of analysis.
actions beside it. It even can destroy the integration of a
system [3]. 2.8. Static Analysis
2.4. Logic Bomb Software analysis without execution, is called static
analysis which without running the program, investigates the
A Logic bomb does not publish itself, but is installed on a code and can detect malicious code and put it in one of the
system and waits until an external event such as data input, available groups based on different learning methods [7].
reaches to a special date, creating, deleting or even modify a Since such methods deal with real codes, they can be used
special file leading to damaging the system [2]. in the conditions in which there are polymorphic malwares.
One of the problems of static analysis is that source code of
2.5. Backdoors the program isn’t usually available which this reduces using
Backdoor is a kind of software which enters the computer of static analysis techniques that results in analyzing their
system without authorization and achieves its goals without binary codes which in turn is very complicated.
normal entering to system [1]. In the static method, binary codes are checked and viruses
are detected based on binary codes. In fact this is the key part
2.6. Spy of static method. It is worth mentioning that extracting binary
codes is a relatively complex work [5].
A term for a collection of software that collects user
personal information such as most visited pages, email
addresses, keys pressed by the user [5]. 3. Dynamic Analysis
2.7. Rootkit To overcome these shortcomings, several dynamic
detection methods have been proposed. Unlike the static
Rootkit is a malware that has the ability to hide itself and method which relies on malware binary codes, there is a
its activities on the target system. Owner of rootkit is capable completely different method without using the codes but
to run file and settings on the victim system without the according to the runtime behavior [3].
owner of system being aware of it. It usually attaches itself to Although promising, but unfortunately this method is too
original files of operating system core and run with it. slow as real time detectors on the end host and often need
Rootkits try targeting original structures and programs of virtual machine technology [1]. In fact, program analyzing,
the operating system and the integrity of their contents in while it is running, is called dynamic analysis which also
order to change performance trend and the result of their referred to as behaviors analyzing and include software
running. Rootkits can hide themselves from users through the running and watching its behavior, system interaction and its
International Journal of Intelligent Information Systems 2014; 3(6-1): 33-37 35
effects on host system [6]. Dynamic analysis method need to complex malwares and can hardly detect malwares which use
run polluted files in a virtual environment like a virtual polymorphism and transformation methods. In addition, one
machine, a simulator, sand box, etc to analyze it in virtual of the limitations of signature-based detection methods is that
environment [2]. they require human knowledge to update the signature
To analyze programs by dynamic methods, different database by new signatures [8].
techniques have been applied. Furthermore, a number of research studies have shown that
So far which the most common method and techniques some of polymorphic software’s writers can easily defeat
include [8]: signature based method by obfuscation methods [9].
Checking recalled functions. Given the mentioned problems, it is better to use analysis
Following the flow of information. method at runtime. However, the behavior based methods
Following the order of running functions. also have a major problem since this method is to slow as the
real-time detectors on the final host and they often need
4. Malware Detection Techniques virtual machine technologies.
There are different methods to detect malwares but
considering that malware have become more complicated 5. Methods used for Escaping from Anti
using hidden techniques; we need more advanced methods to Malwares
detect them.
Generally, common malware detection techniques are Since signature-based antivirus systems try to find viral
divided into two categories: codes by searching for a character sequence string in the
Detection methods based on signature executive file, virus programmers apply various techniques to
Detection methods based on behavior hide malwares and such sequences some of which are
described below.
4.1. Signature- Based Detection
5.1. Cryptography
The main goal of this method is to extract the unique bytes
sequence of codes as the signature. Searching for a signature Virus code encryption by different encryption key would
in the suspicious files is a part of the task [8]. result in creating different texts.
Most of today’s commercial anti-malwares use a set of As a result, it could be ensured that signature based
signatures to detect malicious programs which these scanners can’t detect this virus. To run the virus, these texts
suspicious codes are compared with a unique sequence of should initially be decoded.
structures of programs or bytes [7]. Detailed analysis of decoding algorithm is only possible if
If the signature is not available in the dataset, it means that we know these keys [10].
the file is begin other than malicious [9]. 5.2. Polymorphic Generator
The main problem of such approaches is that the
anti-malwares experts should wait until new malware harm Malwares use a polymorphic generator to change codes
several computers, order to define a signature for it [8]. while the original algorithm remains intact. However, we
Usage of polymorphic model in cryptography has led to should know that, at the end, all samples generated from a
neutralize the signature based method which makes these malware do the same work.
polymorphic malwares undetectable through this method. This is performed by combining many commands that
In order to overcome these problems, the behavior based have no impact on the execution mode and its effects. For
method is used. example, each copy of the virus may be neutral group of
commands such as increasing and then decreasing over the
4.2. Behavior-Based Detection same operand or left ship and then right shift or push a value
Behavioral parameters include many factors such as source and pop it again.
or destination of malware, kinds of attachments and other All these methods will effectively hide virus codes from
statistical properties [8]. Dynamic behaviors are directly used the signature based anti viruses [10].
in evaluating the damage to the system and also help us to 5.3. Obfuscation
detect and classify new malwares. Malware clustering based
on dynamic analysis is based on running the malware in a In malwares there are different evasion approaches to evade
real controlled environment [7]. the malcodes from external anti malware scanners such as
Code obfuscation, decrypting encryption and etc.
4.3. Comparison between Detection Methods In code obfuscation the main goal is to hide the underlying
Given the polymorphism and transformation techniques logic of the program so as to prevent the others from having
which currently are used by malware designers, the signature any related knowledge of the code[8].
based methods are inherently prone to errors [9]. The malicious code remains incomprehensible and all its
Signature based methods are unable to detect more harmful functionality whenever activated. When we apply
36 Sara Najari and Iman Lotfi: Malware Detection Using Data Mining Techniques
some obfuscation transformations to a code, then it results in a What we do in this study consists of a very large data set
kind of self-decrypting encryption. which involve various types of bengin and malicious
But Packing refers to encrypt or compress the executable softwares which generally, the number of extracted calls is
file. In Packing, original code remains hidden till the runtime about 5000 different features of 420 different files from 890
or the unpacking process of executable codes which results in libraries which includes different types of malwares such as
the immunity of code for static analysis [7]. Trojan, Backdoor, Worm, Exploit, Flooder, Sniffer, Spoofer
Packed malware codes can be treated as subset of and viruses.
obfuscated codes which are compressed and cannot be
analyzed so, consequently unpacking phase is necessary to 7. Research Methodology
reveal the overall semantic of the code [9].
This research has been performed by some basic steps:
6. Problem Definition data collection
data processing
One the most important and most serious problems which analysis of results
the internet world is faced with is the existence of malwares In the following, we will discuss each of these steps.
like.
According to studies conducted in this field, we have 7.1. Data Collection
concluded that 80 percent of damages to systems have been
from malwares and only 20 percent of it has been from other In order to collect data related to malwares. We examine
factors [9]. the Anubis database [11].
However, unfortunately, most of the works has been on the Each sample of this set provides us its executive’s code.
20% and the malwares have received less attention and thus These codes are used to learn the proposed model. In order to
we're facing many security problems every day [5]. evaluate and test, a set of 3131 collected malware were tested
In the early days of virus emergence, there were only static which more than 90% of them include rootkits.
and simple viruses in the world [3]. We selected this malwares set because in this study, our
Therefore, simple signature based methods were able to goal is detection rates of malwares especially rootkits.
overcome them. But these methods were only useful as long 7.2. Data Processing and Preparation
as there weren't so many variations in the types of malwares
and malwares writers didn't use obfuscation techniques to In this section, we deal with data processing using 3
sophisticate them [5]. reverse engineering tools namely: HDasm [12], Ida pro [13]
However, rapid developments in malwares activities and W32dsm89 [14] as well as Peid anti-packing tool [15].
convinced researchers to explore new methods, so that after First we process the Peid tool (which is the malware
some time, researchers were forced to use data-mining executive file) but with the understanding that the file has
methods to detect malwares by employing data mining, they been packed by Packing tool. Otherwise, there is no need to
could add a lot of malware to anti-malware and hence they apply this tool on it.
didn't have to investigate all malwares, because checking all In fact by unpacking task, the packing task will be
of them require enormous time and cost [2]. removed if it has been applied on it because otherwise, the
One of such works was a method called n-grams. At that file isn’t executable by reverse engineering tool and thus we
time, Geraldn et al. [3] developed n-grams analysis method can't see the called system functions in it.
to detect boot sector viruses using neural networks. Afterwards, we give the file as input to three
The base of n-grams detection method was the occurrence above-mentioned disassembler and they get the assembly
frequencies in the benign and malicious programs [3]. code of these fields and return the called system functions list
After that, Hofmeyr [10] used a simple sequence of system from these assembly codes. Then we save the list as an Xml
calls as a guide to evaluate malicious codes. This API CALLs file. Later, we apply our algorithm on this stored file to detect
sequence showed the hidden dependencies between code whether it is a malware or not and finally we obtain our
sequences. success rate in detecting malwares using Weka data-minig
Thereafter, Shultz, al. [7] tried to use the name of DLLs as tool.
a useful feature in the file categorization. However, in the
recent work by Ye [7], a system (IMDS) was generated in 7.3. Analysis of Results
which the system calls pattern has been used. Then data Malwares of the same category usually have the
mining process has been applied on these patterns. The study samegeneral patterns, for example a number of system
includes 12214 healthy files and 17366 malicious files which functions names are common in all members of this family.
they have only used 200 files to test the system [7]. We aim to analyze and detect malwares by examining the
Although the accuracy and learning rate of this method is shared pattern using machine learning techniques among
relatively good, but there is a fundamental problem that is malwares.
Unbalancing of the test data versus the balancing of learning In fact, we want to use so called Api calls in malware to
data. overcome the limitations of traditional signature based
International Journal of Intelligent Information Systems 2014; 3(6-1): 33-37 37
methods and to cope with techniques used by malwares This type of detection (which is based on static method) is
writers as well as to increase malware detection rate. based on called system functions in each executive code of
This method, which is based on called system functions in the malware and its goal is to detect versions of malware
malware executive code, uses reverse engineering tool and which haven't seen yet or are a new version of old malware
monitoring tool for static and dynamic analysis, respectively. families.
This means, that we obtain their assembly code by
disassembling them and then extract called system function
in it and obtain the API CALLs list of malware executive file References
by monitoring the file using monitoring tool.
Finally, with respect to the shared sequence of maleware [1] Ravi, C & Manoharan, R. Malware Detection using Windows
Api Sequence and Machine Learning. International Journal of
which is common among them and could be used to detect Computer Application, Vol.43, No.17, 2012.
and identify them as the signature, we deal with the detection
of malwares. [2] Ravi, C & Chetia, G. Malware Threats And Mitigation
The advantages of this method include its high success rate Strategies: A Survey, Journal of Theoretical and Applied
Information Technology, Vol. 29, No. 2, pp. 69-73, 2011.
in malwares detection because it is directly in contact with
malware binary codes and also there is no need to run them [3] Egele, M. S, A Survey on Automated Dynamic
and we can understand whether it is a malware or not only Malware-Analysis. ACM Computing Surveys, Vol. 44, No. 2,
using their code and obtaining the shared sequence of called 2012.
system functions. [4] Herath, H. M. P. S., & Wijayanayake, W. M. J. I. Computer
Furthermore, we apply the prepared algorithm on the log Misuse in the Workplace. Journal of Business Continuity &
file of each file to obtain our database. Emergency Planning, Vol.3, No.3, P.P 259–270, 2009.
After that, we transform the information of this database to [5] Mathur, K., and Saroj H. A Survey on Techniques in Detection
a data mining tool (here we used Weka tool) to obtain the and Analyzing Malware Executables. International Journal of
success rate of detection task. Figure 2 shows a graph of data Advanced Research in Computer Science and Software
mining operation results using Weka tool on database. As Engineering, Vol. 44, No. 2, 2012.
shown above, the success rate of this method in rootkit [6] Doherty, N. F., Anastasakis, L., & Fulford, H, The Information
detection is over than 97% which is a remarkable rate. Security Policy Unpacked: A Critical Study of the Content of
University Policies. International Journal of Information
Management, Vol.29, No.6, pp. 449–457, 2009.
[7] G. Tahan, L.R.Y. Automatic Malware Detection Using
Common Segment Analysis and Meta-Features. Journal of
Machine Learning Research, 13l, pp. 949-979, 2012.
[8] I. Gurrutxaga , Evaluation of Malware clustering based on its
dynamic behaviour. Seventh Australasian Data Mining
conference, Australia, pp. 163–170, 2008.
[9] Rieck. K, Willems.T, D¨ussel. P and Laskov. p, Learning and
classification of malware behavior, 5th international conference
on Detection of Intrusions and Malware, and Vulnerability
Assessment. Berlin, Heidelberg: Springer-Verlag, pp. 108–125,
2008.
Figure 2. Success rate of our method in rootkit detection. [10] Patel, S. C., Graham, J. H., & Ralston, P. A, Qualitatively
Assessing the Vulnerability of Critical Information Systems: A
New Method for Evaluating Security Eenhancements.
8. Discussion and Conclusion International Journal of Information Management, Vol.28, pp.
483–491, 2008.
Malwares are becoming widespread and more complex
every day. As examples of their complexity, we can note the [11] http:// www.anubis.org
need of using polymorphism techniques, transformation and [12] http://hdasm.software.informer.com
encryption, The traditional methods such as matching some
code string of malwares signatures do not have enough [13] www.hex-rays.com
efficiency. [14] processchecker.com/file/W32dsm89.exe.html
However, there are also some problems in dynamic
methods which their slowness is the most important one. [15] https://boveda.banamex.com.mx/englishdir/ayudas/masinfoah
This is why we need a more intelligent detection method. nlab.htm