Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Malware classification project using static analysis, where applications are converted into bytecode, the byte sequences are transformed into grayscale images, and deep learning–based image classification is applied to categorize malware into 31 distinct subclasses for accurate detection without executing the files.

Notifications You must be signed in to change notification settings

tharunsridhar/Static-Malware-Classification

Repository files navigation

Multi-Class Malware Family Classification using EfficientNetV2-S

Deep learning–based static malware analysis system that classifies executable binaries into 31 malware families using image-based representations and transfer learning.


Key Results

  • Total Classes: 31 Malware Families
  • Training Samples: 8,388
  • Validation Samples: 1,480
  • Test Samples: 3,879
  • Input Size: 384 × 384
  • Batch Size: 16
  • Model: EfficientNetV2-S (Pretrained on ImageNet)
  • Optimizer: Adam
  • Loss: Categorical Cross-Entropy (with Label Smoothing)
  • Test Accuracy: 95%
  • Macro F1-Score: 0.96
  • Weighted F1-Score: 0.95

Strong class-wise precision and recall across most malware families.


Problem Statement

Traditional signature-based antivirus systems struggle against obfuscated and zero-day malware.

IDEA

This project implements a static, image-based malware classification pipeline that:

  • Detects malicious binaries
  • Classifies them into 31 malware families
  • Avoids execution of untrusted files
  • Uses transfer learning instead of handcrafted features

Methodology

1️ Malware Binary to Image Conversion

Executable binaries are converted into grayscale/RGB images by mapping raw byte values (0–255) to pixel intensities.
This preserves structural patterns such as entropy regions, packed sections, and instruction repetition.

2️ Image Preprocessing

  • Resized to 384 × 384
  • RGB format (3 channels)
  • Normalized using EfficientNetV2 preprocess_input
  • Data augmentation applied:
    • Random flip
    • Random rotation
    • Random zoom
    • Random contrast

3️ Model Architecture

  • EfficientNetV2-S (include_top=False)
  • Global Average Pooling
  • Batch Normalization
  • Dense Layer (Swish activation)
  • Dropout
  • Final Dense (Softmax – 31 classes)

4️ Training Strategy

  • Warm-up phase (base frozen)
  • Head training
  • Fine-tuning selected deeper layers
  • Class weights applied for imbalance
  • Early stopping + ReduceLR callbacks
  • Model saved in .keras and .h5 formats

Training Performance

Training vs Validation Accuracy & Loss

Training Curves

Model shows stable convergence without severe overfitting.


Confusion Matrix

Confusion Matrix

Strong diagonal dominance indicates accurate class-wise prediction across malware families.


Classification Report

Classification Report

Balanced precision and recall across both frequent and minority classes.


Test Set Evaluation

Test Evaluation

Final evaluation performed on completely unseen test dataset.


Tech Stack

  • Python
  • TensorFlow / Keras
  • EfficientNetV2
  • NumPy
  • Scikit-learn
  • Matplotlib
  • GPU Acceleration (if available)

Dataset Overview

Multi-Class Classification

  • 31 Malware Families
  • Directory-based labeling
  • Train / Validation / Test split
  • Class imbalance handled using class weights

Image Properties

  • Resolution: 384 × 384
  • RGB format
  • EfficientNet preprocessing applied

Training Configuration

  • Loss: Categorical Cross-Entropy with Label Smoothing
  • Optimizer: Adam
  • Batch Size: 16
  • Transfer Learning Enabled
  • Fine-Tuning Applied
  • Early Stopping & Learning Rate Reduction
  • Class Weights Integrated

How to Run

Install Dependencies

pip install -r requirements.txt

Train Model

python (classification)efficientnet v2s.ipynb
python (detection)efficientnet v2 s.ipynb

Run Inference

python launch_streamlit_tunnel.py

Design Strengths

  • Static analysis (no malware execution)
  • Transfer learning for faster convergence
  • Handles class imbalance
  • Modular deep learning pipeline
  • Deployment-ready model formats

Limitations

  • Static analysis only (no behavioral features)
  • Performance depends on dataset diversity
  • Requires periodic retraining for evolving malware

Full Technical Report

Complete 34-page documentation available in:

Reports.pdf

Conclusion

This project demonstrates that image-based malware representation combined with EfficientNetV2 transfer learning achieves high multi-class classification performance (95% accuracy) while maintaining balanced precision and recall across malware families.

It provides a scalable, reproducible, and research-backed deep learning pipeline for static malware analysis.


Author

Tharun Sridhar

About

Malware classification project using static analysis, where applications are converted into bytecode, the byte sequences are transformed into grayscale images, and deep learning–based image classification is applied to categorize malware into 31 distinct subclasses for accurate detection without executing the files.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors