See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/381773453
SPAM DETECTION PROJECT
Experiment Findings · June 2024
DOI: 10.13140/RG.2.2.12917.59363
CITATIONS READS
0 226
3 authors, including:
Bubacarr Jobarteh Ansumana F Jadama
University of New Brunswick University of New Brunswick
1 PUBLICATION 0 CITATIONS 7 PUBLICATIONS 0 CITATIONS
SEE PROFILE SEE PROFILE
All content following this page was uploaded by Ansumana F Jadama on 28 June 2024.
The user has requested enhancement of the downloaded file.
SPAM DETECTION PROJECT REPORT
MACHINE LEARNING AND DATA MINING
CS 6735
GROUP: 08
SUBMITTED BY
Bubacarr Jobarteh 3773328
Modou K Touray 3734028
Md Mohaiminul Islam 3759013
Ansumana F Jadama 3749943
INTRODUCTION
Email is a primary source of communication that attackers use as one of the main ways to
access individual and organizational information and systems. This caused the rapid
increase in spam emails which are also called junk emails. The "spam" concept is diverse:
advertisements for products/web sites, make money fast schemes, chain letters, pornography
and so on. In order to classify spam emails, we try using machine learning classification
algorithms to enable us to identify emails as SPAM or NON-SPAM and compare three
machine learning methods: XGBoost, Naive Bayes, and Logistic Regression to assess the
algorithms' recall, accuracy, precision, and F1-score. It is an important task in machine
learning projects as it helps in filtering out irrelevant and potentially harmful content. We
used three algorithms Naive Bayes, XGBoost, and Logistic Regression, to effectively
classify messages as spam or non-spam, however, we focused more on Naïve Byes for the
Implementation without libraries.
PROBLEM STATEMENT
Spam emails are still a common problem, and detecting and removing them requires
strong machine-learning models. In order to create a reliable spam email classifier, this
study compares the performance of three classification algorithms using the Spambase[2]
dataset.
DESCRIPTION OF DATASET
The data used for this project was taken from the Spambase[2] website. The dataset is
numerical and continuous. It has 4601 Data and 57 features. Features that were taken
from a group of emails that were both spam and non-spam are included in the Spambase
dataset. In order to help in the classification, it contains word frequencies, number
frequencies, and other characteristics.
DATA PROCESSING
Data Cleaning: No missing values were discovered when the dataset was inspected for
them. Numerical features were standardized using feature scaling.
Feature Engineering: We did not do any feature engineering i.e. modify features to
make better, more useful features. But we checked for columns with greater importance.
As visible from the chart below, features word_freq_george, char_freq_$,
word_freq_000, word_freq_free etc. has greater importance from the data to determine
whether spam is spam compared to all other features such as word_freq_table,
word_freq_all. Usually ‘000’ and ‘$’ are found in spam emails in shape of 10,000$,
1000$ etc. more than regular emails. So, including feature importance is a necessary step.
Page 1
Importance
word_freq_george
word_freq_remove
char_freq_$
word_freq_000
word_freq_hp
word_freq_free
word_freq_money
…
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07
Importance
Chart 1: Feature Importance
Sampling: No sampling was done since we deemed that distribution of spam class and
not spam class was balanced enough to be considered balanced data. It was found to have
39% SPAM and 61% NON-SPAM emails. We might have considered sampling if the
NON-SPAM class had overwhelming majority data, around 80% or more.
ALGORITHMS FOR MACHINE LEARNING
Selection Algorithms:
We used and contrasted the following algorithms:
- Naive Bayes (main algorithm implemented by us)
- XGBoost
- Logistic Regression
Model Training:
The dataset is divided into a testing, validation, and training set with proportion of 30%,
27.5%, and 52.5% respectively. Models are trained using the training data for each
algorithm, and hyperparameter optimization is used to fine-tune the models as needed
with validation data.
OUTCOMES
Below are the results from the different algorithms used for spam detection. The
implementation of Naïve Bayes along with cross validation was implemented by us. We
used library implementations for the XGBoost and Logistic Regression. Overall,
XGBoost performs well, balancing precision and recall.
Page 2
Comparison of Feature Importance and Smoothing Factor in Naïve Bayes
Smoothing Factor Top Features Precision Recall F1-Score Accuracy
0
10 0.884 0.892 0.887 0.889
25 0.894 0.908 0.895 0.896
50 0.834 0.845 0.836 0.839
0.6
10 0.902 0.903 0.903 0.906
25 0.904 0.917 0.906 0.907
50 0.859 0.869 0.852 0.852
0.8
10 0.902 0.904 0.903 0.906
25 0.902 0.915 0.904 0.906
50 0.857 0.867 0.848 0.848
1
10 0.902 0.904 0.903 0.906
25 0.903 0.916 0.906 0.907
50 0.845 0.853 0.831 0.831
Table 1: Naïve Bayes on Validation Data
The above is the results got from running Naïve bayes on validation data. It can be seen
that too low feature cound leaves out important information, on the other hand too many
of them clutters and hinders the model to learn. So among 57 features, taking top 25
features seemed suitable. Also the smoothing factor is a hyper parameter for Naïve Bayes,
for which it is found that higher smoothing factor perform similarly. Based on the best
smoothing factor (0.8 taken) and feature count 25, The following is the result of running
Naïve Bayes on test data.
Smoothing Factor Top Features Precision Recall F1-Score Accuracy
0.8 25 0.8827 0.8928 0.8832 0.8841
Table 2: Naïve Bayes on Test Data With Best Parameters
Run kfold cross validation and test data in XGBoost and Logistic Regression
We have run kfold cross validation with k = 10 for both XGBoost and Logistic
Regression, and run them on test data. The results found are as follows.
Mean Macro avg Precision Recall F1-Score
Cross Validation 0.949 0.948 0.948
Test 0.95 0.95 0.95
Table 3: Result from XGBoost for 10-fold Cross Validation and Test Data
Page 3
Mean Macro avg Precision Recall F1-Score
Cross Validation 0.926 0.92 0.923
Test 0.92 0.91 0.91
Table 4: Result from Logistic Regression for 10-fold Cross Validation and Test Data
RESULTS COMPARATIVE ANALYSIS OF ALGORITHMS PERFORMANCE
Chart Title
0.95 0.95 0.95
0.92
0.91 0.91
0.89
0.88 0.88
NAÏVE BAYES XG BOOST LOGISTIC REGRESSION
Precision Recall F1-Score
Chart 2: Comparison of Performance of Models
PERFORMANCE COMPARISON FROM SPAMBASE [2]
Image 1: Baseline model performance from Spambase [2]
Page 4
View publication stats
The above is the comparison of the baseline model performance from Spambase [2]. This
shows that XGBoost Classification has a better performance than the other models. Our
selected algorithms have similar performance and XGBoost outperformed the rest of our
models. Thus, our result is consistent with the findings from the authors of the data.
IMPLEMENTATION AND PRACTICAL APPLICATIONS
Spam detection has numerous practical applications such as filtering unwanted emails,
identifying potential phishing attempts, and reducing the spread of malicious content.
By accurately detecting spam, users can have a cleaner and more organized inbox,
allowing them to focus on important messages. Implementing spam detection algorithms
can significantly reduce the risk of falling victim to scams and cyber-attacks.
CONCLUSION
Spam detection is an important task in machine learning to identify and filter out
unwanted messages. We compared the performance of three algorithms, Naive Bayes,
XG Boost, and Logistic Regression, for spam detection. Based on the results, the
XGBoost algorithm outperformed the rest with higher accuracy.
REFERENCE
[1] Hopkins, Mark, Reeber, Erik, Forman, George, and Suermondt, Jaap. (1999).
Spambase. UCI Machine Learning Repository. https://doi.org/10.24432/C53G6X.
[2] Archive, " Spambase," [Online]. Available:
https://archive.ics.uci.edu/dataset/94/spambase
Page 5