0% found this document useful (0 votes)

65 views7 pages

Spam Detection Using Machine Learning

spam

Uploaded by

preetraj710

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views7 pages

Spam Detection Using Machine Learning

spam

Uploaded by

preetraj710

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/381773453

SPAM DETECTION PROJECT

Experiment Findings · June 2024

DOI: 10.13140/RG.2.2.12917.59363

CITATIONS READS
0 226

3 authors, including:

Bubacarr Jobarteh Ansumana F Jadama

University of New Brunswick University of New Brunswick
1 PUBLICATION 0 CITATIONS 7 PUBLICATIONS 0 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Ansumana F Jadama on 28 June 2024.

The user has requested enhancement of the downloaded file.

SPAM DETECTION PROJECT REPORT
MACHINE LEARNING AND DATA MINING

CS 6735
GROUP: 08

SUBMITTED BY

Bubacarr Jobarteh 3773328

Modou K Touray 3734028
Md Mohaiminul Islam 3759013
Ansumana F Jadama 3749943
INTRODUCTION
Email is a primary source of communication that attackers use as one of the main ways to
access individual and organizational information and systems. This caused the rapid
increase in spam emails which are also called junk emails. The "spam" concept is diverse:
advertisements for products/web sites, make money fast schemes, chain letters, pornography
and so on. In order to classify spam emails, we try using machine learning classification
algorithms to enable us to identify emails as SPAM or NON-SPAM and compare three
machine learning methods: XGBoost, Naive Bayes, and Logistic Regression to assess the
algorithms' recall, accuracy, precision, and F1-score. It is an important task in machine
learning projects as it helps in filtering out irrelevant and potentially harmful content. We
used three algorithms Naive Bayes, XGBoost, and Logistic Regression, to effectively
classify messages as spam or non-spam, however, we focused more on Naïve Byes for the
Implementation without libraries.

PROBLEM STATEMENT
Spam emails are still a common problem, and detecting and removing them requires
strong machine-learning models. In order to create a reliable spam email classifier, this
study compares the performance of three classification algorithms using the Spambase[2]
dataset.

DESCRIPTION OF DATASET
The data used for this project was taken from the Spambase[2] website. The dataset is
numerical and continuous. It has 4601 Data and 57 features. Features that were taken
from a group of emails that were both spam and non-spam are included in the Spambase
dataset. In order to help in the classification, it contains word frequencies, number
frequencies, and other characteristics.

DATA PROCESSING

Data Cleaning: No missing values were discovered when the dataset was inspected for
them. Numerical features were standardized using feature scaling.

Feature Engineering: We did not do any feature engineering i.e. modify features to
make better, more useful features. But we checked for columns with greater importance.
As visible from the chart below, features word_freq_george, char_freq_$,
word_freq_000, word_freq_free etc. has greater importance from the data to determine
whether spam is spam compared to all other features such as word_freq_table,
word_freq_all. Usually ‘000’ and ‘$’ are found in spam emails in shape of 10,000$,
1000$ etc. more than regular emails. So, including feature importance is a necessary step.

Page 1
Importance
word_freq_george
word_freq_remove
char_freq_$
word_freq_000
word_freq_hp
word_freq_free
word_freq_money
…

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07

Importance

Chart 1: Feature Importance

Sampling: No sampling was done since we deemed that distribution of spam class and
not spam class was balanced enough to be considered balanced data. It was found to have
39% SPAM and 61% NON-SPAM emails. We might have considered sampling if the
NON-SPAM class had overwhelming majority data, around 80% or more.

ALGORITHMS FOR MACHINE LEARNING

Selection Algorithms:
We used and contrasted the following algorithms:
- Naive Bayes (main algorithm implemented by us)
- XGBoost
- Logistic Regression

Model Training:
The dataset is divided into a testing, validation, and training set with proportion of 30%,
27.5%, and 52.5% respectively. Models are trained using the training data for each
algorithm, and hyperparameter optimization is used to fine-tune the models as needed
with validation data.

OUTCOMES
Below are the results from the different algorithms used for spam detection. The
implementation of Naïve Bayes along with cross validation was implemented by us. We
used library implementations for the XGBoost and Logistic Regression. Overall,
XGBoost performs well, balancing precision and recall.

Page 2
Comparison of Feature Importance and Smoothing Factor in Naïve Bayes

Smoothing Factor Top Features Precision Recall F1-Score Accuracy

0
10 0.884 0.892 0.887 0.889
25 0.894 0.908 0.895 0.896
50 0.834 0.845 0.836 0.839
0.6
10 0.902 0.903 0.903 0.906
25 0.904 0.917 0.906 0.907
50 0.859 0.869 0.852 0.852
0.8
10 0.902 0.904 0.903 0.906
25 0.902 0.915 0.904 0.906
50 0.857 0.867 0.848 0.848
1
10 0.902 0.904 0.903 0.906
25 0.903 0.916 0.906 0.907
50 0.845 0.853 0.831 0.831

Table 1: Naïve Bayes on Validation Data

The above is the results got from running Naïve bayes on validation data. It can be seen
that too low feature cound leaves out important information, on the other hand too many
of them clutters and hinders the model to learn. So among 57 features, taking top 25
features seemed suitable. Also the smoothing factor is a hyper parameter for Naïve Bayes,
for which it is found that higher smoothing factor perform similarly. Based on the best
smoothing factor (0.8 taken) and feature count 25, The following is the result of running
Naïve Bayes on test data.

Smoothing Factor Top Features Precision Recall F1-Score Accuracy

0.8 25 0.8827 0.8928 0.8832 0.8841

Table 2: Naïve Bayes on Test Data With Best Parameters

Run kfold cross validation and test data in XGBoost and Logistic Regression
We have run kfold cross validation with k = 10 for both XGBoost and Logistic
Regression, and run them on test data. The results found are as follows.

Mean Macro avg Precision Recall F1-Score

Cross Validation 0.949 0.948 0.948
Test 0.95 0.95 0.95

Table 3: Result from XGBoost for 10-fold Cross Validation and Test Data

Page 3
Mean Macro avg Precision Recall F1-Score
Cross Validation 0.926 0.92 0.923
Test 0.92 0.91 0.91

Table 4: Result from Logistic Regression for 10-fold Cross Validation and Test Data

RESULTS COMPARATIVE ANALYSIS OF ALGORITHMS PERFORMANCE

Chart Title

0.95 0.95 0.95

0.92
0.91 0.91

0.89
0.88 0.88

NAÏVE BAYES XG BOOST LOGISTIC REGRESSION

Precision Recall F1-Score

Chart 2: Comparison of Performance of Models

PERFORMANCE COMPARISON FROM SPAMBASE [2]

Image 1: Baseline model performance from Spambase [2]

Page 4
View publication stats

The above is the comparison of the baseline model performance from Spambase [2]. This
shows that XGBoost Classification has a better performance than the other models. Our
selected algorithms have similar performance and XGBoost outperformed the rest of our
models. Thus, our result is consistent with the findings from the authors of the data.

IMPLEMENTATION AND PRACTICAL APPLICATIONS

Spam detection has numerous practical applications such as filtering unwanted emails,
identifying potential phishing attempts, and reducing the spread of malicious content.
By accurately detecting spam, users can have a cleaner and more organized inbox,
allowing them to focus on important messages. Implementing spam detection algorithms
can significantly reduce the risk of falling victim to scams and cyber-attacks.

CONCLUSION
Spam detection is an important task in machine learning to identify and filter out
unwanted messages. We compared the performance of three algorithms, Naive Bayes,
XG Boost, and Logistic Regression, for spam detection. Based on the results, the
XGBoost algorithm outperformed the rest with higher accuracy.

REFERENCE
[1] Hopkins, Mark, Reeber, Erik, Forman, George, and Suermondt, Jaap. (1999).
Spambase. UCI Machine Learning Repository. https://doi.org/10.24432/C53G6X.

[2] Archive, " Spambase," [Online]. Available:

https://archive.ics.uci.edu/dataset/94/spambase

Page 5

A Comparison of The Accuracy of Support Vector
No ratings yet
A Comparison of The Accuracy of Support Vector
17 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
16 pages
Spam Detection Using Naive Bayes
No ratings yet
Spam Detection Using Naive Bayes
11 pages
Email
No ratings yet
Email
27 pages
Hamorspam
No ratings yet
Hamorspam
6 pages
Email Spam Detection for ML Experts
No ratings yet
Email Spam Detection for ML Experts
7 pages
Enhancing Email Security With Naïve Bayes Spam Detection - Docx Fully Edited
No ratings yet
Enhancing Email Security With Naïve Bayes Spam Detection - Docx Fully Edited
64 pages
Spam Detection
No ratings yet
Spam Detection
4 pages
Spam Email Filtering with Naive Bayes
No ratings yet
Spam Email Filtering with Naive Bayes
4 pages
Email Spam Classification
No ratings yet
Email Spam Classification
17 pages
Lec6 Parametricvsnonparametric
No ratings yet
Lec6 Parametricvsnonparametric
29 pages
Interplay Between Probabilistic Classifiers and Boosting Algorithms For Detecting Complex Unsolicited Emails
100% (1)
Interplay Between Probabilistic Classifiers and Boosting Algorithms For Detecting Complex Unsolicited Emails
5 pages
Emai Spam Detection Using Machine Learning and Python - IJRPR3714
No ratings yet
Emai Spam Detection Using Machine Learning and Python - IJRPR3714
6 pages
Title: Abstract
No ratings yet
Title: Abstract
2 pages
Spam Email Dection
No ratings yet
Spam Email Dection
23 pages
Naive Bayesian Spam Filtering
No ratings yet
Naive Bayesian Spam Filtering
6 pages
DSP Report Taashif 22347 Aman 22035 Vivek 22373 Emailspamdetection
No ratings yet
DSP Report Taashif 22347 Aman 22035 Vivek 22373 Emailspamdetection
3 pages
Project Report Emaildetection 4 44
No ratings yet
Project Report Emaildetection 4 44
41 pages
PPT
0% (1)
PPT
15 pages
Spam Email Using Machine Learning
No ratings yet
Spam Email Using Machine Learning
13 pages
Individual Assignment: Technology Park Malaysia
No ratings yet
Individual Assignment: Technology Park Malaysia
4 pages
164 331 3 PB
No ratings yet
164 331 3 PB
10 pages
Naive Bayes for Data Scientists
No ratings yet
Naive Bayes for Data Scientists
2 pages
Content Based Spam Detection in Email Us PDF
No ratings yet
Content Based Spam Detection in Email Us PDF
5 pages
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
No ratings yet
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
6 pages
Related Work
No ratings yet
Related Work
5 pages
Research Article On The Forensic
No ratings yet
Research Article On The Forensic
14 pages
A Study On Spam Classification Using Machine Learning Techniques
No ratings yet
A Study On Spam Classification Using Machine Learning Techniques
14 pages
Using Support Vector Machine For Classification and Feature Extraction of Spam in Email
No ratings yet
Using Support Vector Machine For Classification and Feature Extraction of Spam in Email
7 pages
Efficient Spam Classification by Appropriate Feature Selection
No ratings yet
Efficient Spam Classification by Appropriate Feature Selection
17 pages
Research Paper Spam Detection
No ratings yet
Research Paper Spam Detection
4 pages
B.Sc. Project: Email Spam Filter
No ratings yet
B.Sc. Project: Email Spam Filter
35 pages
1822 B Deleted
No ratings yet
1822 B Deleted
38 pages
Detecting Spam Messages Using The Naive Bayes Algorithm of Basic Machine Learning
No ratings yet
Detecting Spam Messages Using The Naive Bayes Algorithm of Basic Machine Learning
3 pages
1822 B Deleted Merged Cropped
No ratings yet
1822 B Deleted Merged Cropped
40 pages
B. Flowchart of The Model: Esult
No ratings yet
B. Flowchart of The Model: Esult
3 pages
Spam Filtering Algorithm Analysis
No ratings yet
Spam Filtering Algorithm Analysis
9 pages
AI-Enabled Email Classiciation Spam Detection (RP)
No ratings yet
AI-Enabled Email Classiciation Spam Detection (RP)
6 pages
Synopsis Email Spam
No ratings yet
Synopsis Email Spam
9 pages
BT-3435 Ali
No ratings yet
BT-3435 Ali
49 pages
Naive Bayesian Spam Filter Study
No ratings yet
Naive Bayesian Spam Filter Study
68 pages
DM Chapter 3
No ratings yet
DM Chapter 3
6 pages
A Support Vector Machine Based Naive Bayes Algorithm For Spam Filtering
No ratings yet
A Support Vector Machine Based Naive Bayes Algorithm For Spam Filtering
8 pages
Spam Filtering with Random Forests
No ratings yet
Spam Filtering with Random Forests
8 pages
Decision Tree Model For Email Classification: Ivana Čavor
No ratings yet
Decision Tree Model For Email Classification: Ivana Čavor
4 pages
Arnav MLlab04
No ratings yet
Arnav MLlab04
7 pages
Evaluation and Comparison of Machine Learning Models For Ham and Spam Email Classification
No ratings yet
Evaluation and Comparison of Machine Learning Models For Ham and Spam Email Classification
13 pages
Email Spam Detection with ML
No ratings yet
Email Spam Detection with ML
5 pages
Spam Detection Synopsis
No ratings yet
Spam Detection Synopsis
8 pages
Spam Detection Model
No ratings yet
Spam Detection Model
4 pages
Spam Detection and Filtering
No ratings yet
Spam Detection and Filtering
16 pages
IEEE Conference Template 148
No ratings yet
IEEE Conference Template 148
6 pages
Spam Classifier
No ratings yet
Spam Classifier
8 pages
Spam Feedback Multi-Class Classification
No ratings yet
Spam Feedback Multi-Class Classification
11 pages
Published Paper
No ratings yet
Published Paper
9 pages
CC218 Lec1 DiscreteMath Logic of Compound Stat
No ratings yet
CC218 Lec1 DiscreteMath Logic of Compound Stat
7 pages
Design Animation Tutorial #1: Assembly Sequence of Manifold
No ratings yet
Design Animation Tutorial #1: Assembly Sequence of Manifold
7 pages
Climate Influence of Thermal Energy
No ratings yet
Climate Influence of Thermal Energy
6 pages
CIE As & A LEVEL MECHANICS PREDICTION PAPER 1 FOR 250730 122147
No ratings yet
CIE As & A LEVEL MECHANICS PREDICTION PAPER 1 FOR 250730 122147
12 pages
2024 Spring Project
No ratings yet
2024 Spring Project
7 pages
Act. 2 - Micropipetting Techni
No ratings yet
Act. 2 - Micropipetting Techni
29 pages
Activity 1.6
No ratings yet
Activity 1.6
3 pages
TTH Module 1
No ratings yet
TTH Module 1
4 pages
CFD Sample Answer
No ratings yet
CFD Sample Answer
3 pages
Partmart Price List 2024
No ratings yet
Partmart Price List 2024
16 pages
Inverse of A Matrix
100% (1)
Inverse of A Matrix
71 pages
Whole-Body Vibration Therapy: An Overview
No ratings yet
Whole-Body Vibration Therapy: An Overview
6 pages
Improvise Academy: Subject: Physics Class: XII Full Marks: 75
No ratings yet
Improvise Academy: Subject: Physics Class: XII Full Marks: 75
2 pages
Projectile Motion (Lecture Note)
No ratings yet
Projectile Motion (Lecture Note)
16 pages
Manufacturing Process I Diploma in Mechanical Engineering 3 RD Semester
No ratings yet
Manufacturing Process I Diploma in Mechanical Engineering 3 RD Semester
18 pages
NGR Installation Manual PDF
No ratings yet
NGR Installation Manual PDF
15 pages
Electronic Cheat Sheet
No ratings yet
Electronic Cheat Sheet
1 page
Jogger Headset Demand Forecasting
No ratings yet
Jogger Headset Demand Forecasting
4 pages
Section 2.0 - Specifications Square Drive Tools: W ENG-5525-056 AD) Page 6 of 40 Eng Us
No ratings yet
Section 2.0 - Specifications Square Drive Tools: W ENG-5525-056 AD) Page 6 of 40 Eng Us
3 pages
Power Systems Engineers Guide
No ratings yet
Power Systems Engineers Guide
7 pages
Introduction of Sludge Management
No ratings yet
Introduction of Sludge Management
154 pages
CMM 26-11-15 PN CG7G0 Smoke Detector
No ratings yet
CMM 26-11-15 PN CG7G0 Smoke Detector
56 pages
General Tests, Processes and Apparatus PDF
No ratings yet
General Tests, Processes and Apparatus PDF
334 pages
Sample Test Hkimo Grade 3 (Vòng Sơ Lo I) : Part I: Logical Thinking
100% (1)
Sample Test Hkimo Grade 3 (Vòng Sơ Lo I) : Part I: Logical Thinking
7 pages
Non-Invasive Cylicon (Cylinder and Cone) Antenna For Blood Glucose Monitoring
No ratings yet
Non-Invasive Cylicon (Cylinder and Cone) Antenna For Blood Glucose Monitoring
5 pages
B10 AutoCAD 201222
No ratings yet
B10 AutoCAD 201222
2 pages
CE102-W5-Wood and Its Properties
No ratings yet
CE102-W5-Wood and Its Properties
43 pages
Free Booklet On Lettering Basics, by Mark Van Leeuwen
No ratings yet
Free Booklet On Lettering Basics, by Mark Van Leeuwen
9 pages
Engine Diagrams Cummins
100% (3)
Engine Diagrams Cummins
34 pages
Syllabus Musi1311
No ratings yet
Syllabus Musi1311
4 pages