Behavioral Malware Classification Using Convolutional Recurrent Neural Networks
Behavioral Malware Classification Using Convolutional Recurrent Neural Networks
Networks
104 2018 13th International Conference on Malicious and Unwanted Software: “Know Your Enemy” (MALWARE)
Type Size Malware Family Samples
Adware 0.79% MultiPlug, SoftPulse, DomaIQ
Backdoor 2.25% Advml, Fynloski, Cycbot, Hlux
Trojan 89.18% AntiFW, Buzus, Invader, Kovter
Virus 1.44% Lamer, Parite, Nimnul, Virut
Worm 4.28% AutoIt, Socks, VBNA, Generic
Ransomware 2.07% Xorist, Zerber, Blocker, Bitman
BiLSTM layer.
The sixth, and final, layer is Softmax. This layer outputs
the probability that a malware sample belongs to a specific
malware family.
To improve the generalization of our model, we apply
different regularization techniques. First, we apply dropout
between our model layers. Dropout is a commonly used
technique in training large neural networks to reduce over-
fitting [39]. Dropout has shown to improve the training and
classification performance of large neural networks. The
Figure 1: Example of a Prefetch file for the CMD.EXE pro-
goal is to learn hidden patterns without merely memorizing
gram
the training samples in the training data. This improves the
robustness of the model on unseen (i.e., zero-day) malware
samples.
learning problems [33].
The second layer is a convolutional layer. The layer ap-
plies a one dimensional (1D) sequential filter of a particular 5 Experimental Setup
size. The layer, then, slides the filter over the entire list
to extract adjacent file names. This helps the model learn
This section described how the dataset and the ground
the local relation between embedding vectors. 1D convolu-
truth labeling used in our experiment was created.
tional layers have been used successfully in sequence clas-
sification and text classification [23] problems.
The third layer is Max Pooling. This layer reduces the 5.1 Dataset Collection
size of the data from the previous layer. It is designed to
improve the computational performance and the accuracy We successfully executed around 100,000 malware
of our model and its respective training process. We use the samples obtained from the public malware repository
maximum function to select the important representation out VirusShare1 . Malware samples were deployed on freshly
of the data. installed Windows 7 executing on a virtual machine. Af-
The fourth layer is Bidirectional LSTM. Bidirectional ter each Prefetch file is collected, the virtual machine is re-
LSTM (BiLSTM) is an architecture of recurrent neural net- set to a clean (non-infected) state. In order for Windows
works [16]. Recurrent neural networks learn the long-term to generate a Prefetch file for malware sample, the sample
dependency between the embedding vectors. In our con- needs to be executed. Once the sample is loaded, Win-
text, they model the relationship between the resources file dows generates a Prefetch file automatically. This simpli-
names loaded in each Prefetch file. The bidirectional struc- fies the task of extracting the Prefetch files for malicious
ture consists of a forward and reversed LSTM, which is a programs. Our experiments only included malware sam-
structure that has been successful in NLP and sequence clas- ples that produced Prefetch files and were identified by ma-
sification problems [45, 17]. jor anti-virus engines, such as Kaspersky, EsetNod32, Mi-
The fifth layer is Global Max Pooling. This layer prop- crosoft, and McAfee.
agates only relevant information from the sequence of out-
puts of BiLSTM. It reduces the size of the output of the 1 VirusShare2 http://www.virusshare.com
2018 13th International Conference on Malicious and Unwanted Software: “Know Your Enemy” (MALWARE) 105
Figure 2: 1D-Conv-BiLSTM model architecture
5.2 Ground Truth Labeling classes in training data [11]. Malware training datasets of-
ten contain unbalanced samples for different malware fam-
Ground truth labels for malware were obtained through ilies. The ratio between malware family sizes sometimes
an online third-party virus scanning service called VirusTo- varies 1:100. Table 2 shows malware type, size of malware
tal3 . Given an MD5, SHA1 or SHA256 of a malware file, type, and a few examples of malware families.
VirusTotal provides the detection information for popular
anti-virus engines. This information also includes meta- 6.2 Classification Performance with Com-
data such as target platforms, malware types, and malware mon Malware Families
families for each anti-virus scan engine. Table 1 illustrates
malware types, sample size, examples of malware fami- We evaluate our malware classification model against the
lies according to EsetNod32, Kaspersky, Microsoft, and model of previous work on behavioral malware classifica-
MacAfee. tion [9]. The previous work examined multiple types of
feature extractions, feature selections, classification mod-
6 Evaluation els based on large datasets extracted from sequences of OS
system calls. The top performing models were Logistic re-
gression (LR) and Random Forests (RF). LR and RF were
This section describes the experimental evaluation of our
used with n-grams feature extraction and Term Frequency-
model against a model from previous work.
Inverse Document Frequency (TF-IDF) feature transforma-
tion [10]. RF also used Singular Value Decomposition
6.1 Performance Measurements (SVD) for feature dimensionality reduction [22].
We implemented our new model using the Keras and
The classification accuracy of our classification model is Tensorflow [12, 1] deep learning frameworks. We config-
measured by the F1 score, F1 demonstrates the trade-off be- ured our model using the following parameters:
tween Recall and Precision and combines them into a sin-
gle metric range from 0.0 to 1.0. Recall is the fraction of • Embedding layer: 100 hidden units
a number of retrieved examples over the number of all the
• 1D Convolutional layer: 250 filters, kernel size of five,
relevant examples. Precision is the fraction of the number
one stride, and RELU activation function
of relevant examples over the number of all retrieved ones.
The F1 score formula is: • 1D Max Pooling: pool size of four
P recision ∗ Recall • Bidirectional LSTM: 250 hidden units
F1 = 2 ∗
P recision + Recall
• L2 regularization: 0.02
A classifier is superior when its F1 score is higher. We
choose the F1 score because it is less prone to unbalanced • Dropout regularization: 0.5
3 VirusTotal, http://www.virustotal.com • Recurrent Dropout regularization: 0.2
106 2018 13th International Conference on Malicious and Unwanted Software: “Know Your Enemy” (MALWARE)
Anti-virus label (# of malware families)
Kaspersky (50) EsetNod32 (53) Microsoft (38) McAfee (55) F1 mean
1D-Conv-BiLSTM 0.734 0.854 0.754 0.765 0.777
LR 2-grams 0.711 0.821 0.734 0.756 0.756
LR 3-grams 0.718 0.822 0.726 0.756 0.756
RF 2-grams 0.702 0.792 0.731 0.755 0.745
RF 3-grams 0.671 0.699 0.72 0.724 0.704
Table 2: F1 score for 1D-Conv-BiLSTM, LR (2,3)-grams, and RF (2,3)-grams models using Kaspersky, EsetNod32, Mi-
crosoft, and McAfee labelings.
We implemented the previous work LR and RF models us- 6.3 Classification Performance with Rare
ing Scikit-learn [32]. We applied a grid search to select the Malware Families
best hyperparameters for the LR and RF models.
We train our model using Stochastic Gradient Descent
(SGD) with batch size of 32 samples and 300 epochs [47]. Rare malware families with small sample sizes repre-
SGD is an iterative optimization algorithm commonly used sent a significant percentage of all malware families. This
in training large neural networks. SGD can operate on large presents a difficulty for models to extract useful behavioral
training sets using one sample or a small batch of samples patterns due to insufficient samples during training. In this
at a time. Thus, it is efficient for large training sets and for experiment, we include any malware family that has at least
online training [5]. 10 malware samples. This presents a challenge for clas-
sification models because the number of malware families
We use a 10-fold cross-validation with stratified sam- largely increases while, at the same time, the number of
pling to create a balanced distribution of malware samples malware samples for each family decreases. We aim to
for malware families in each fold. We train the models on show the robustness of our classification model when ap-
9 splits of our dataset and test on a separate dataset. We plied to rare malware families.
repeat this experiment 10 times and take the average metric
Table 3 shows the classification performance of our
score for the final output. We include any malware families
model against LR and RF models using four anti-virus la-
that have a minimum of 50 malware samples.
beling schemes. The table shows that our model consis-
Table 2 shows the F1 score results of our experiment us- tently outperforms all other models despite the increased
ing four major anti-virus scan engines: Kaspersky, EsetN- number of malware families with a low sample size. For ex-
ode32, Microsoft, and MacAfee. The results show that our ample, on the EsetNod32 labeling scheme, our model per-
model outperforms all other models using any anti-virus en- formance decreases only -1.0% when the number of fam-
gine labeling. The second best are the LR models, which ilies increases from 53 to 180 families while other mod-
outperform the RF models on all anti-virus scan engines els exhibit larger classification performance degradations.
and reproduce the results described in [9]. It is noteworthy Specifically, our model shows the smallest decrease in
that the 3-gram features extraction usually provides better the classification performance from any anti-virus labeling
results than the 2-gram features in the LR models. How- scheme.
ever, the 2-gram features outperform the 3-gram features in
Figure 3 shows the average F1 scores of malware fam-
the RF models.
ilies for LR 3-grams, RF 2-grams, and 1D-Conv-BiLSTM
As shown, the performance of behavioral classification using EsetNod32 ground truth labels. We study the perfor-
models depends on the anti-virus engine labelling scheme mance of the behavioral classification models on individual
used during training. LR 3-grams show a better per- malware families to demonstrate the strength of the classi-
formance using the Kaspersky and EsetNode32 labelings, fication models on common and rare malware families. As
while a worse performance using the Microsoft labeling shown, the LR model struggles with rare malware families.
scheme. Moreover, RF 2-grams underperform all LR mod- However, it outperforms the RF model when the number
els except when using the Microsoft naming scheme. The malware samples in a family increases. Conversely, the RF
inconsistency of the results leads researchers to use the anti- model performs reasonably on rare malware families, but it
virus engine that produces the highest classification score. underperforms the LR models on common malware fami-
However, our model shows consistent performance across lies. Ultimately, our 1D-Conv-BiLSTM model outperforms
all major anti-virus engines and outperforms previous work both LR and RF models on almost all common and rare
on major anti-virus engines. malware families.
2018 13th International Conference on Malicious and Unwanted Software: “Know Your Enemy” (MALWARE) 107
Anti-virus label (# of malware families)
Kaspersky (192) EsetNod32 (180) Microsoft (137) McAfee (209)
F1 Diff (%) F1 Diff (%) F1 Diff (%) F1 Diff (%) F1 mean Diff (%)
1D-Conv-BiLSTM 0.647 -0.088 0.844 -0.010 0.727 -0.027 0.720 -0.045 0.735 -4.25%
LR 2-Grams 0.586 -0.124 0.790 -0.032 0.656 -0.078 0.652 -0.104 0.671 -8.45%
LR 3-Grams 0.594 -0.124 0.790 -0.032 0.651 -0.075 0.656 -0.100 0.673 -8.28%
RF 2-Grams 0.588 -0.114 0.760 -0.031 0.664 -0.067 0.658 -0.097 0.668 -7.73%
RF 3-Grams 0.527 -0.144 0.650 -0.049 0.627 -0.093 0.587 -0.137 0.598 -10.58%
Table 3: F1 score for 1D-Conv-BiLSTM, LR (2,3)-grams, and RF (2,3)-grams models using Kaspersky, EsetNod32, Mi-
crosoft, and McAfee labelings. Diff (%) shows the change of the F1 scores from the previous section after adding rare
malware families.
108 2018 13th International Conference on Malicious and Unwanted Software: “Know Your Enemy” (MALWARE)
models during training. The experiment shows that the in-
crementally re-trained model achieves a higher F1 at early
stages during training than the newly trained model. There-
fore, the training process can be shortened to reduce the
overhead of training on new malware samples. Moreover,
incremental re-training of our model is efficient and recom-
mended over fully re-training the model.
7 Conclusion
2018 13th International Conference on Malicious and Unwanted Software: “Know Your Enemy” (MALWARE) 109
www.symantec.com/content/dam/symantec/ [24] J. Z. Kolter and M. A. Maloof. Learning to detect and clas-
docs/security-center/white-papers/ sify malicious executables in the wild. Journal of Machine
increased-use-of-powershell-in-attacks-16-en. Learning Research, 7(Dec):2721–2744, 2006.
pdf, 2016. [Online; accessed 10-Jan-2017]. [25] C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna.
[8] J. Canto, M. Dacier, E. Kirda, and C. Leita. Large scale Polymorphic worm detection using structural information of
malware collection: lessons learned. In IEEE SRDS Work- executables. In International Workshop on Recent Advances
shop on Sharing Field Data and Experiment Measurements in Intrusion Detection, pages 207–226. Springer, 2005.
on Resilience of Distributed Computing Systems. Citeseer, [26] W. Lee, S. J. Stolfo, and K. W. Mok. A data mining frame-
2008. work for building intrusion detection models. In Security
[9] R. Canzanese, S. Mancoridis, and M. Kam. Run-time classi- and Privacy, 1999. Proceedings of the 1999 IEEE Sympo-
fication of malicious processes using system call analysis. In sium on, pages 120–132. IEEE, 1999.
Malicious and Unwanted Software (MALWARE), 2015 10th [27] P. Li, L. Liu, D. Gao, and M. K. Reiter. On challenges in
International Conference on, pages 21–28. IEEE, 2015. evaluating malware clustering. In International Workshop
[10] W. Cavnar. Using an n-gram-based document representation on Recent Advances in Intrusion Detection, pages 238–255.
with a vector processing retrieval model. NIST SPECIAL Springer, 2010.
PUBLICATION SP, pages 269–269, 1995. [28] C. H. Malin, E. Casey, and J. M. Aquilina. Malware Foren-
[11] N. V. Chawla. Data mining for imbalanced datasets: An sics Field Guide for Windows Systems: Digital Forensics
overview. In Data mining and knowledge discovery hand- Field Guides. Elsevier, 2011.
book, pages 853–867. Springer, 2005. [29] J. A. Marpaung, M. Sain, and H.-J. Lee. Survey on mal-
[12] F. Chollet et al. Keras (2015), 2017. ware evasion techniques: State of the art and challenges. In
[13] A. Dove. Fileless malware–a behavioural analysis of kovter Advanced Communication Technology (ICACT), 2012 14th
persistence. 2016. International Conference on, pages 744–749. IEEE, 2012.
[30] D. Molina, M. Zimmerman, G. Roberts, M. Eaddie, and
[14] M. Egele, T. Scholte, E. Kirda, and C. Kruegel. A survey on
G. Peterson. Timely rootkit detection during live response.
automated dynamic malware-analysis techniques and tools.
In IFIP International Conference on Digital Forensics,
ACM Computing Surveys (CSUR), 44(2):6, 2012.
pages 139–148. Springer, 2008.
[15] E. Filiol. Malware pattern scanning schemes secure against
[31] R. Moskovitch, C. Feher, N. Tzachar, E. Berger, M. Gitel-
black-box analysis. Journal in Computer Virology, 2(1):35–
man, S. Dolev, and Y. Elovici. Unknown malcode detection
50, 2006.
using opcode representation. In Intelligence and Security
[16] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Deep
Informatics, pages 204–215. Springer, 2008.
learning, volume 1. MIT press Cambridge, 2016.
[32] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
[17] A. Graves and J. Schmidhuber. Framewise phoneme clas-
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
sification with bidirectional lstm and other neural network
V. Dubourg, et al. Scikit-learn: Machine learning in python.
architectures. Neural Networks, 18(5):602–610, 2005.
Journal of machine learning research, 12(Oct):2825–2830,
[18] K. Heller, K. Svore, A. D. Keromytis, and S. Stolfo. One 2011.
class support vector machines for detecting anomalous win- [33] J. Pennington, R. Socher, and C. Manning. Glove: Global
dows registry accesses. In Workshop on Data Mining for vectors for word representation. In Proceedings of the 2014
Computer Security (DMSEC), Melbourne, FL, November conference on empirical methods in natural language pro-
19, 2003, pages 2–9, 2003. cessing (EMNLP), pages 1532–1543, 2014.
[19] B. S. R. R. U. INOCENCIO. Doing more [34] R. Perdisci et al. Vamo: towards a fully automated mal-
with less: A study of fileless infection at- ware clustering validity analysis. In Proceedings of the 28th
tacks. https://www.virusbulletin.com/ Annual Computer Security Applications Conference, pages
uploads/pdf/conference_slides/2015/ 329–338. ACM, 2012.
RiveraInocencio-VB2015.pdf, SEPTEMBER 30, [35] C. Raiu. A virus by any other name: Virus naming practices.
2015. [Online; accessed 19-Jan-2017]. Security Focus, 2002.
[20] G. Jacob, H. Debar, and E. Filiol. Behavioral detection of [36] K. Rieck, P. Trinius, C. Willems, and T. Holz. Automatic
malware: from a survey towards an established taxonomy. analysis of malware behavior using machine learning. Jour-
Journal in computer Virology, 4(3):251–266, 2008. nal of Computer Security, 19(4):639–668, 2011.
[21] A. Kantchelian, M. C. Tschantz, S. Afroz, B. Miller, [37] A.-D. Schmidt, R. Bye, H.-G. Schmidt, J. Clausen, O. Ki-
V. Shankar, R. Bachwani, A. D. Joseph, and J. D. Tygar. raz, K. A. Yuksel, S. A. Camtepe, and S. Albayrak. Static
Better malware ground truth: Techniques for weighting anti- analysis of executables for collaborative malware detection
virus vendor labels. In Proceedings of the 8th ACM Work- on android. In Communications, 2009. ICC’09. IEEE Inter-
shop on Artificial Intelligence and Security, pages 45–56. national Conference on, pages 1–5. IEEE, 2009.
ACM, 2015. [38] M. G. Schultz, E. Eskin, E. Zadok, and S. J. Stolfo. Data
[22] H. Kim, P. Howland, and H. Park. Dimension reduction in mining methods for detection of new malicious executa-
text classification with support vector machines. In Journal bles. In Security and Privacy, 2001. S&P 2001. Proceed-
of Machine Learning Research, pages 37–53, 2005. ings. 2001 IEEE Symposium on, pages 38–49. IEEE, 2001.
[23] Y. Kim. Convolutional neural networks for sentence classi- [39] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
fication. arXiv preprint arXiv:1408.5882, 2014. R. Salakhutdinov. Dropout: a simple way to prevent neural
110 2018 13th International Conference on Malicious and Unwanted Software: “Know Your Enemy” (MALWARE)
networks from overfitting. Journal of Machine Learning Re-
search, 15(1):1929–1958, 2014.
[40] S. J. Stolfo, F. Apap, E. Eskin, K. Heller, S. Hershkop,
A. Honig, and K. Svore. A comparative evaluation of two
algorithms for windows registry anomaly detection. Journal
of Computer Security, 13(4):659–693, 2005.
[41] P. Szor. The art of computer virus research and defense.
Pearson Education, 2005.
[42] M. Venable, A. Walenstein, M. Hayes, C. Thompson, and
A. Lakhotia. Vilo: a shield in the malware variation battle.
Virus Bulletin, pages 5–10, 2007.
[43] K. Wang and S. Stolfo. One-class training for masquerade
detection. 2003.
[44] M. Woźniak, M. Graña, and E. Corchado. A survey of mul-
tiple classifier systems as hybrid systems. Information Fu-
sion, 16:3–17, 2014.
[45] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey,
et al. Google’s neural machine translation system: Bridg-
ing the gap between human and machine translation. arXiv
preprint arXiv:1609.08144, 2016.
[46] N. Ye and Q. Chen. An anomaly detection technique based
on a chi-square statistic for detecting intrusions into infor-
mation systems. Quality and Reliability Engineering Inter-
national, 17(2):105–112, 2001.
[47] T. Zhang. Solving large scale linear prediction problems
using stochastic gradient descent algorithms. In Proceed-
ings of the twenty-first international conference on Machine
learning, page 116. ACM, 2004.
2018 13th International Conference on Malicious and Unwanted Software: “Know Your Enemy” (MALWARE) 111