Paper ID: 182
Title: Guarding the Inbox: Enhancing Email
Security with Firefly Optimization Algorithm and
Fuzzy Logic for Phishing Detection
Authors: Aditi Katiyar, G Aditya Kumar, Kannan
Arputharaj
Organization: VIT Vellore
2024 Second International Conference on Emerging Trends in Information Technology and Engineering (ICETITE)
Abstract
Phishing emails are a serious hazard in the internet since they trick people into disclosing personal information
by impersonating as recognized corporations. Identifying these attacks swiftly and accurately is an indispensable
feature of protecting the user's data and ensuring their privacy. This article is devoted to the application of a novel
method of email phishing detection using a hybrid model that blends deep learning with firefly optimization and
fuzzy learning methods. We make use of a bidirectional Long Short-Term Memory (Bi-LSTM) deep learning
model, which aids in data processing and is mostly revered for its performance in dealing with sequential data,
and then we boost this model with the integration of the firefly optimization algorithm to fine-tune the model
parameters. Such integration is well-performed and leads to a recognized trance of phishing patterns in emails
which are complex all the time. We follow this with the attention of Bidirectional Long and Short-Term Memory
along with the temporal fuzzy logic that together efficiently analyzes and refines the results addressing the
uncertainties in the content. Our method is demonstrated to be very efficient not only in comparison with the
traditional techniques but also in certain detection cases where it differs profoundly from traditionally used
methods. We evaluate the success of the presented model by benchmarking it against some known deep learning
algorithms like RNN, GRU AND LSTM. The comparison however will indicate that the hybrid deep learning
model always performed significantly better when measured by accuracy, precision, and recall than these
traditional algorithms. By using LSTM with firefly optimization along with temporal fuzzy logic, there is a good
chance of defeating the phishing issues that are getting serious day by day, and this becomes one of the key
characteristics to achieve a secured cyber milieu.
Introduction
Email phishing is a prevalent cyberattack method that deceives recipients into divulging sensitive information
or downloading malware. Detecting these attacks is crucial to prevent identity theft, financial loss, and data
breaches. Current detection methods face challenges due to phishing's dynamic nature and sophistication.
Traditional approaches like rule-based filters are often ineffective against evolving tactics.
To address these challenges, we propose integrating Long Short-Term Memory (LSTM) networks with the
Firefly optimization algorithm. This enhances the deep learning framework's ability to detect phishing emails by
automatically adjusting parameters. Additionally, we employ fuzzy logic to handle ambiguities, refining
decision-making. Furthermore, our approach emphasizes the importance of continuous adaptation and learning
to stay ahead of phishing threats. By integrating these advanced technologies, we create a more adaptive and
responsive detection system that can identify new and evolving phishing tactics effectively.
Our approach outperforms traditional and current state-of-the-art models in terms of accuracy, precision, and
recall. The integration of LSTM, firefly optimization, and fuzzy logic offers a dynamic and robust solution to
combat phishing threats, advancing cybersecurity measures.
Background/Related Work
Title Methodology Proposed Limitations
Focuses on machine learning classifiers and feature Limited scope to specific databases, potentially missing
Systematic Literature Review on Phishing Email
selection techniques. Suggests incorporating extra relevant studies. Future research suggested to include a
Detection
features for better relevance and efficiency. wider array of databases for comprehensive coverage.
Potential non-representativeness of chosen datasets for
all spam email types; high computational resources
Comparison of Machine Learning Techniques for Spam Analyzes and classifies spam emails using various
required by the best-performing classifier. Future
Detection machine learning algorithms.
research to focus on scalability and efficiency
enhancements.
Limitations due to reliance on specific datasets,
Utilizes LSTM networks for phishing email detection, possibly not encompassing the diversity of phishing
LSTM Based Phishing Detection for Big Email Data
showcasing its superior accuracy and efficiency. threats. LSTM's computational demand and processing
speed could affect real-time applications.
High complexity of the model might lead to longer
training times and challenges in real-time application.
Introduces a novel approach for detecting phishing
URL-based Phishing Attack Detection Using BiLSTM Generalizability of the model across diverse phishing
URLs using BiLSTM and CNN.
scenarios not covered in the datasets used could be
further explored.
Research Objectives
• Develop an Advanced Phishing Detection System: Create a sophisticated email phishing detection system
that integrates Long Short-Term Memory (LSTM) networks with the Firefly optimization algorithm and
fuzzy logic. This system aims to enhance the accuracy, precision, and recall of phishing detection.
• Address Current Phishing Detection Challenges: Overcome the limitations of traditional phishing
detection methods, such as rule-based filters, by implementing a more adaptive and responsive approach.
This involves automatic adjustment of parameters and handling ambiguities in decision-making using fuzzy
logic.
• Benchmark and Compare with Existing Models: Rigorously compare the proposed model with traditional
and state-of-the-art deep learning models such as RNN, GRU, LSTM, and Bi-LSTM. Evaluate performance
metrics like accuracy, precision, and recall to demonstrate the superiority of the proposed approach.
• Advance Cybersecurity Measures: Contribute to advancing cybersecurity measures by improving the
effectiveness of email phishing detection systems. The goal is to detect and mitigate phishing email threats
more efficiently, thereby enhancing overall cybersecurity.
• Contribute to Phishing Detection Research: Contribute to the field of phishing detection research by
proposing an innovative approach that combines LSTM, firefly optimization, and fuzzy logic. This approach
aims to address the evolving nature of phishing tactics and improve the adaptability of detection systems.
Model Architecture for the
System
Fig. 2. Architecture for the Proposed Model
Methodology
1. Data Preprocessing
Our dataset, tailored for email phishing detection and classification, contains 17,538 rows and two columns:
'email text' and 'email type.' This structured dataset provides raw email body data for linguistic and semantic
analysis, along with binary classification into 'safe email' or 'phishing email.' In our model, data processing
begins by loading a dataset of email correspondences with dimensions (17538, 2), featuring 'Email Text' and
'Email Type' columns. Irrelevant columns are removed, and duplicate/null values are eliminated to ensure data
integrity. Textual content undergoes rigorous cleansing, removing hyperlinks, punctuation, and uppercase letters
while standardizing text and transforming categorical 'Email Type' labels into numerical values (0 for phishing,
1 for safe emails). The dataset balance is visualized, and TF-IDF converts cleaned text into numerical vectors,
limiting features to the top 10,000 terms for efficiency. The feature matrix and labels are split into
training/testing sets (80-20 split) for machine learning. This meticulous preprocessing ensures effective phishing
email detection and classification.
Methodology
2. Model Training
In our study, we employed a Bi-LSTM (Bidirectional Long Short-Term Memory) model to enhance the accuracy of phishing email detection. The
Bi-LSTM model stands out due to its ability to capture context effectively by processing text data from both forward and backward directions,
thereby outperforming traditional RNN, GRU, and standard LSTM models.
i. Pre-processing and Embedding:
Text sequences undergo standardization to a fixed length of 150 using the Tokenizer from Keras. This step converts text to integer sequences and
pads them for uniform length, facilitating consistent input for the model. The Embedding layer maps these sequences to dense vectors of a fixed
size (50 dimensions), enabling efficient data compression and feature learning. By transforming input text sequences into dense vectors, this layer
learns word representations during training, capturing semantic relationships between words.
ii. Bidirectional LSTM Layer:
The core of our model comprises a Bidirectional LSTM layer with 100 LSTM units, processing input in both directions. This bidirectional
approach allows the model to capture dependencies and context effectively, enhancing its ability to understand the sequential nature of text data.
The forward LSTM layer processes text from the beginning to the end, while the backward LSTM layer processes text from the end to the
beginning. This dual processing enables the model to extract and leverage contextual information from both preceding and succeeding elements in
the text sequence, providing a comprehensive view of each data point's context.
iii. Dropout Layer:
A Dropout layer with a dropout rate of 0.5 is positioned after the Bidirectional LSTM layer to mitigate overfitting. During training, this layer
randomly disables a fraction of input units, preventing the model from relying too much on specific features and promoting robustness.
Methodology
iv. Dense Output Layer:
The final layer of our model is a Dense layer with a sigmoid activation function, providing binary classification (phishing
or safe email). This layer finalizes the prediction process, producing output probabilities that indicate the likelihood of an
email being phishing or safe.
Significance of Each Layer:
Pre-processing and Embedding: Standardizing text sequences and converting them into dense vectors allows the model
to understand the semantic relationships between words, facilitating effective feature learning.
Bidirectional LSTM Layer: Processing input from both directions enables the model to capture context and dependencies
effectively, enhancing its ability to understand the sequential nature of text data.
Dropout Layer: Mitigates overfitting by preventing the model from relying too heavily on specific features during
training, promoting generalization to unseen data.
Dense Output Layer: Provides binary classification, allowing the model to make final predictions about whether an email
is phishing or safe.
This detailed architectural approach underscores our model's enhanced ability to detect phishing attempts accurately,
contributing to more secure email communication environments.
Methodology
4. Firefly Optimization Algorithm
The Firefly Algorithm (FA) is employed to fine-tune critical hyperparameters of a Bi-LSTM network for the binary
classification task of identifying phishing emails.
Key Methods in FA
1. Initialization: Initial population of fireflies representing hyperparameter values.
2. Evaluation: Assessing effectiveness through validation accuracy.
3. Attraction and Movement: Fireflies move towards brighter counterparts, updating hyperparameters.
4. Selection: Identifying the most effective hyperparameter set.
Why FA?
- Optimization: FA seeks to maximize validation accuracy, enhancing the model's proficiency in identifying phishing
patterns.
- Biological Inspiration: Inspired by fireflies' communication patterns, FA navigates the hyperparameter space
effectively.
Methodology
Novelty of Approach
• Integration with DL: Novel application of FA with a Bi-LSTM model enhances phishing email detection.
• Synergy of Biology and Computation: Leveraging biological inspiration showcases the adaptability of
metaheuristic methods in cybersecurity.
Significance:
• Complex Model Tuning: Addresses optimization challenges in machine learning environments.
• Improving Cybersecurity Measures: Enhances machine learning-driven cybersecurity, particularly for
binary classification tasks.
Methodology
5. Fuzzy Logic Integration
In the phishing email detection model, fuzzy logic is integrated alongside LSTM-generated probabilities to
enhance classification capabilities into 'Low', 'Medium', or 'High' risk categories. Fuzzy logic excels in handling
the uncertainty and vagueness inherent in email threat detection by allowing partial memberships to multiple risk
categories based on probability values. This method stands out because it reflects the gradation and complexity
of real-world scenarios more accurately than binary classifications. It does so by using specific membership
values that transition smoothly between categories; for example, a phishing probability of 0.2 results in equal
memberships in 'Low' and 'Medium' risk categories, highlighting zones of uncertainty where an email does not
clearly belong to one category. This overlap in membership values at certain probabilities captures the nuanced
nature of threat detection, where the indicators of phishing are often ambiguous. Incorporating fuzzy logic
reduces the risk of misclassification and adapts dynamically to evolving phishing tactics, making the system
more responsive and effective at identifying potential threats.
Methodology
Fuzzy Logic Table
References
1. Q. Li, M. Cheng, J. Wang and B. Sun, "LSTM Based Phishing Detection for Big Email Data," in IEEE
Transactions on Big Data, vol. 8, no. 1, pp. 278-288, 1 Feb. 2022, doi: 10.1109/TBDATA.2020.2978915.
2. S. Salloum, T. Gaber, S. Vadera and K. Shaalan, "A Systematic Literature Review on Phishing Email Detection
Using Natural Language Processing Techniques," in IEEE Access, vol. 10, pp. 65703-65727, 2022, doi:
10.1109/ACCESS.2022.3183083.
3. Sun, Bo & Ban, Tao & Han, Chansu & Takahashi, Takeshi & Yoshioka, Katsunari & Takeuchi, Junrichi &
Sarrafzadeh, Abdolhossein & Qiu, Meikang & Inoue, Daisuke. (2021). Leveraging Machine Learning
Techniques to Identify Deceptive Decoy Documents Associated With Targeted Email Attacks. IEEE Access. PP.
1-1. 10.1109/ACCESS.2021.3082000.
4. L. R. Kalabarige, R. S. Rao, A. R. Pais, and L. A. Gabralla, "A Boosting-Based Hybrid Feature Selection and
Multi-Layer Stacked Ensemble Learning Model to Detect Phishing Websites," in IEEE Access, vol. 11, pp.
71180-71193, 2023, doi: 10.1109/ACCESS.2023.3293649
5. S. Asiri, Y. Xiao, S. Alzahrani, S. Li and T. Li, "A Survey of Intelligent Detection Designs of HTML URL
Phishing Attacks," in IEEE Access, vol. 11, pp. 6421-6443, 2023, doi: 10.1109/ACCESS.2023.3237798.