0% found this document useful (0 votes)

228 views16 pages

Spam Email Classifier

This project aims to classify emails as spam or not spam using machine learning algorithms. The group used XGBoost, Naive Bayes, Random Forest, and SVM classifiers on a dataset of spam and non-spam emails. Data cleaning and preprocessing was performed including removing HTML, lowering case, removing stop words and punctuation, and stemming words. XGBoost achieved the best results with 98.62% accuracy, 97.47% recall, and 98.18% precision. The group created a Python function to classify new emails using the Naive Bayes or XGBoost models. This spam filtering has applications for email services and maintaining business communications.

Uploaded by

saravanan iyer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

228 views16 pages

Spam Email Classifier

Uploaded by

saravanan iyer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

TE MINIPROJECT

PROJECT TITLE- SPAM

EMAIL CLASSIFIER
GROUP MEMBERS-
VINEET IYER 118A1029
ABHISHEK JOSHI 118A1030
VISHAK KODETHUR 118A1033
TUSHANT GOKHE 118A1024
ABOUT OUR PROJECT

In this Project we classify whether an email is spam or not using Machine

Learning. The Machine Learning Algorithms which we have used in our project
are XGBoost Classifier, Naive Bayes. The other Algorithms which we have
used are Random Forest, Multinomial Naive Bayes and Support Vector
Machine. But the one used in our project is XGBoost considering precision
scores,recall scores and F-scores.
What is Machine Learning?

Machine Learning involves computers discovering how they can

perform tasks without being explicitly programmed to do so..

It provides systems the ability to automatically learn and improve

from previous experience without being programmed.

Thus it helps in our project in predicting whether an email is spam

or not .
CLASSIFICATION OF MACHINE LEARNING
ALGORITHMS

1] Supervised Machine Learning-It is a type of Machine Learning in which

Machines are trained using well “Labelled” training data.On the basis of that
Machines predict the output.

2] Unsupervised Machine Learning -Here Models are not supervised using

training dataset.Instead model itself find hidden patterns and insights from the
given data.

3] Reinforcement Learning-Here output depends on state of current input and

next input depends on output of previous input.
Random Forest Classifier(Supervised Learning)

It is a classifier that contains number of

Decision trees on various subset of the
given dataset.
XGBoost Classifier(Supervised Learning Algorithm)

It is one of the most popular and efficient implementation of gradient

boosted trees algorithm.

Why is XGBoost Fast?

It uses CPU cache to store calculated gradients to make necessary

calculations fast.
Multinomial Naive Bayes(Supervised Learning)
Using sklearn

The multinomial Naive Bayes classifier is

suitable for classification with discrete
features (e.g., word counts for text
classification). The multinomial distribution
normally requires integer feature counts.
However, in practice, fractional counts such
as tf-idf may also work.
Manually Coded Naive Bayes(Supervised Learning)

We created a vocabulary of 10000 most commonly occurring words

after data cleaning was done.

Then we calculated probabilities of these words for in complete dataset,

spam emails, and non-spam emails separately.

Then we found Posterior probabilities of each word for spam and non-
spam emails using formula:
DataSet Details

Source: SpamAssassin Public Corpus (

https://spamassassin.apache.org/old/publiccorpus/)

Data Format:

Separate folders for spam and non-spam emails.

The emails are documents consisting of sender’s information and mail

history of replies/forwards.

Some emails also use HTML which has to be cleaned.

DataSet Cleaning (5 steps)

1) Removal of HTML Tags (Using BeautifulSoup)

2) Converting words to lowercase and tokenising them into list of
separate words.
3) Removing all the stop words, numbers, special characters and
punctuation marks.
4) Stem all the words to its root word (Using PorterStemmer)
5) Creating Vocabulary (10000 words)
Training the Model

We split dataset into training and testing data in ratio 7:3.

Then we find probabilities of all the words in our vocabulary in 3

different formats:

1) Probability of word throughout the dataset.

2) Probability of word in Spam Emails.
3) Probability of word in Non-Spam Emails.
Testing the Model

We test the dataset by finding the probability of an email being spam

and non-spam using Naive Bayes algorithm:

Classification of an email being spam/non-spam is determined by

comparison of above two probabilities.
Scores of various models

Algorithm Accuracy Recall Score Precision Score F1 Score

XGBoost 98.62% 97.47% 98.18% 97.83%

RandomForest 97.64% 93.86% 98.67% 96.21%

Naive Bayes 98.14% 98.80% 96.8% 97%

(Manual)

Naive Bayes 94.36% 83.03% 99.14% 90.37%

(sklearn)

SVM 88.73% 65.70% 98.38% 78.79%

Python Function
Parameters:

data: This will take a string containing the contents of email.

mode: Default: mode=2 : It is used when data only contains email content.
Otherwise, it is considered to contain sender information and mail history as well

classifier:

(Default) classifier=’manual’: Only Manual Naive Bayes is used to classify.

classifier=’xgb’: Only XGBoost is considered to classify.

Returns: Boolean: True if email is spam and False for otherwise

Future Scope

1] Our project can help in filtering out spam messages received in emails.

2] It can help in maintaining proper business communications.

3] It can be used in various education sectors too.

THANK YOU

Evolution of Computers
100% (6)
Evolution of Computers
8 pages
DSP Report Taashif 22347 Aman 22035 Vivek 22373 Emailspamdetection
No ratings yet
DSP Report Taashif 22347 Aman 22035 Vivek 22373 Emailspamdetection
3 pages
Ai Project
No ratings yet
Ai Project
8 pages
Lab 3 Write Up
No ratings yet
Lab 3 Write Up
2 pages
ML Project - Classifying Spam Emails
No ratings yet
ML Project - Classifying Spam Emails
3 pages
How To Set Flexfield To Be Required by Form Personalization
67% (3)
How To Set Flexfield To Be Required by Form Personalization
2 pages
Email Spam Detection Project
No ratings yet
Email Spam Detection Project
2 pages
Email Spam Detection PPT Github
No ratings yet
Email Spam Detection PPT Github
11 pages
Naive Bayesian Spam Filtering
No ratings yet
Naive Bayesian Spam Filtering
6 pages
Email Spam Classification
No ratings yet
Email Spam Classification
17 pages
Lab # 13 Two Dimentional Arrays in C++
100% (1)
Lab # 13 Two Dimentional Arrays in C++
4 pages
Spam Email Classifier
No ratings yet
Spam Email Classifier
17 pages
Email
No ratings yet
Email
27 pages
Visual Basic Programmers Guide 1991
100% (2)
Visual Basic Programmers Guide 1991
470 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
2 pages
02 Loop
No ratings yet
02 Loop
5 pages
Ijirt156181 Paper
No ratings yet
Ijirt156181 Paper
5 pages
1CP2 02 AdSAMS2 QU
No ratings yet
1CP2 02 AdSAMS2 QU
9 pages
Chapter 1
No ratings yet
Chapter 1
4 pages
Mindtree Coding
No ratings yet
Mindtree Coding
13 pages
Lesson13 - DLP14 - Java Conditional Statements
No ratings yet
Lesson13 - DLP14 - Java Conditional Statements
7 pages
AppNote VxWorks 7 Porting C Code From 32-Bit To 64-Bit
0% (1)
AppNote VxWorks 7 Porting C Code From 32-Bit To 64-Bit
12 pages
Spam Detection via ML & NLP
No ratings yet
Spam Detection via ML & NLP
44 pages
Spam Detection
No ratings yet
Spam Detection
4 pages
Java Roadmap 2024
No ratings yet
Java Roadmap 2024
1 page
A Complete Guide To Flexbox - CSS-Tricks
No ratings yet
A Complete Guide To Flexbox - CSS-Tricks
13 pages
Document
No ratings yet
Document
11 pages
Spam Email Detection Using Machine Learning
No ratings yet
Spam Email Detection Using Machine Learning
8 pages
Final PPT
No ratings yet
Final PPT
18 pages
Vishal FOML Micro Project Vishal & Milan
No ratings yet
Vishal FOML Micro Project Vishal & Milan
26 pages
For Email
No ratings yet
For Email
8 pages
B. Flowchart of The Model: Esult
No ratings yet
B. Flowchart of The Model: Esult
3 pages
Spam Email Detection Using Python and Machine Learning
No ratings yet
Spam Email Detection Using Python and Machine Learning
14 pages
ML Lab
No ratings yet
ML Lab
13 pages
DSA Foundation Syllabus & Resources
No ratings yet
DSA Foundation Syllabus & Resources
15 pages
UML Reference Sheet PDF
No ratings yet
UML Reference Sheet PDF
5 pages
A Brief History
No ratings yet
A Brief History
19 pages
Project Report Emaildetection 4 44
No ratings yet
Project Report Emaildetection 4 44
41 pages
Ass 3
No ratings yet
Ass 3
2 pages
Pruthviraj Micor Foml
No ratings yet
Pruthviraj Micor Foml
26 pages
Research Article On The Forensic
No ratings yet
Research Article On The Forensic
14 pages
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
No ratings yet
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
7 pages
A Support Vector Machine Based Naive Bayes Algorithm For Spam Filtering
No ratings yet
A Support Vector Machine Based Naive Bayes Algorithm For Spam Filtering
8 pages
Farrell PLD 10e Ch01 PowerPoint
No ratings yet
Farrell PLD 10e Ch01 PowerPoint
62 pages
1822 B Deleted
No ratings yet
1822 B Deleted
38 pages
Spam Detection Using Machine Learning
No ratings yet
Spam Detection Using Machine Learning
7 pages
Problem Statement
No ratings yet
Problem Statement
10 pages
PChem3 Python Tutorial5
No ratings yet
PChem3 Python Tutorial5
18 pages
Enhancing Email Security With Naïve Bayes Spam Detection - Docx Fully Edited
No ratings yet
Enhancing Email Security With Naïve Bayes Spam Detection - Docx Fully Edited
64 pages
ML Spam Detection for Developers
No ratings yet
ML Spam Detection for Developers
51 pages
Ab Initio Training
No ratings yet
Ab Initio Training
100 pages
Email Spam Classification with ML & NLP
No ratings yet
Email Spam Classification with ML & NLP
6 pages
Email Spam Detection for Engineers
No ratings yet
Email Spam Detection for Engineers
4 pages
Linux Kernel Module Techniques
No ratings yet
Linux Kernel Module Techniques
22 pages
CSS Easy Solution Searchable CompsTreasure
No ratings yet
CSS Easy Solution Searchable CompsTreasure
123 pages
E-Mail Spam Detection
No ratings yet
E-Mail Spam Detection
8 pages
Chapter 2
No ratings yet
Chapter 2
15 pages
1822 B Deleted Merged Cropped
No ratings yet
1822 B Deleted Merged Cropped
40 pages
Spam Detection for CS Students
No ratings yet
Spam Detection for CS Students
29 pages
Arnav MLlab04
No ratings yet
Arnav MLlab04
7 pages
Email Classification Using Machine Learning
No ratings yet
Email Classification Using Machine Learning
22 pages
Spam Email Dection
No ratings yet
Spam Email Dection
23 pages
Creating Excel Files With Python
No ratings yet
Creating Excel Files With Python
16 pages
Array and Linked List Operations
No ratings yet
Array and Linked List Operations
54 pages
Spam Email Filtering with Naive Bayes
No ratings yet
Spam Email Filtering with Naive Bayes
4 pages
Final Report (Saie)
No ratings yet
Final Report (Saie)
38 pages
What Is MongoDB - Introduction, Architecture, Features & Example
No ratings yet
What Is MongoDB - Introduction, Architecture, Features & Example
8 pages
What Is Peterson's Solution?
No ratings yet
What Is Peterson's Solution?
4 pages
Problem Set Ee8205 PDF
No ratings yet
Problem Set Ee8205 PDF
4 pages
Email Spam Detection for ML Experts
No ratings yet
Email Spam Detection for ML Experts
7 pages
Panduan OOP untuk Mahasiswa
No ratings yet
Panduan OOP untuk Mahasiswa
13 pages
To Predict The Fraud in Auto Insurance Claims: Insofe PHD Hackathon Prepared By: Nimesh Harishbhai Katoriwala
No ratings yet
To Predict The Fraud in Auto Insurance Claims: Insofe PHD Hackathon Prepared By: Nimesh Harishbhai Katoriwala
32 pages
Lp3 Assignment Submission Methodology
No ratings yet
Lp3 Assignment Submission Methodology
12 pages
Health Care
No ratings yet
Health Care
23 pages
By: Saravanan Iyer Tecec 118A1028: Experiment 4 Aim
No ratings yet
By: Saravanan Iyer Tecec 118A1028: Experiment 4 Aim
16 pages
Published Paper
No ratings yet
Published Paper
9 pages
Content Based Spam Detection in Email Us PDF
No ratings yet
Content Based Spam Detection in Email Us PDF
5 pages
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
No ratings yet
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
6 pages
ML Algorithms for Spam Detection
No ratings yet
ML Algorithms for Spam Detection
10 pages
Spam Classifier
No ratings yet
Spam Classifier
8 pages
Agglomerative Mean-Shift Clustering
No ratings yet
Agglomerative Mean-Shift Clustering
7 pages
Data Structures
No ratings yet
Data Structures
7 pages
Business Modeler IDE Best Practices Guide V2.18
No ratings yet
Business Modeler IDE Best Practices Guide V2.18
234 pages
Examples On The Use of 8086/88 Addressing Modes: Register: Examples: Mov Al, BL, Add Ax, BX
No ratings yet
Examples On The Use of 8086/88 Addressing Modes: Register: Examples: Mov Al, BL, Add Ax, BX
3 pages
Pending Proj
No ratings yet
Pending Proj
37 pages
Spam Detection & Classification Final
No ratings yet
Spam Detection & Classification Final
38 pages
Spam Detection and Filtering
No ratings yet
Spam Detection and Filtering
16 pages
Aayush Nihar Spam Mail Filtering
No ratings yet
Aayush Nihar Spam Mail Filtering
18 pages
Citation 320651664
No ratings yet
Citation 320651664
1 page
A Comparison of The Accuracy of Support Vector
No ratings yet
A Comparison of The Accuracy of Support Vector
17 pages
B.Sc. Project: Email Spam Filter
No ratings yet
B.Sc. Project: Email Spam Filter
35 pages

Spam Email Classifier

Uploaded by

Spam Email Classifier

Uploaded by

TE MINIPROJECT

PROJECT TITLE- SPAM

In this Project we classify whether an email is spam or not using Machine

Machine Learning involves computers discovering how they can

It provides systems the ability to automatically learn and improve

Thus it helps in our project in predicting whether an email is spam

1] Supervised Machine Learning-It is a type of Machine Learning in which

2] Unsupervised Machine Learning -Here Models are not supervised using

3] Reinforcement Learning-Here output depends on state of current input and

It is a classifier that contains number of

It is one of the most popular and efficient implementation of gradient

Why is XGBoost Fast?

It uses CPU cache to store calculated gradients to make necessary

The multinomial Naive Bayes classifier is

We created a vocabulary of 10000 most commonly occurring words

Then we calculated probabilities of these words for in complete dataset,

Source: SpamAssassin Public Corpus (

Separate folders for spam and non-spam emails.

The emails are documents consisting of sender’s information and mail

Some emails also use HTML which has to be cleaned.

1) Removal of HTML Tags (Using BeautifulSoup)

We split dataset into training and testing data in ratio 7:3.

Then we find probabilities of all the words in our vocabulary in 3

1) Probability of word throughout the dataset.

We test the dataset by finding the probability of an email being spam

Classification of an email being spam/non-spam is determined by

Algorithm Accuracy Recall Score Precision Score F1 Score

XGBoost 98.62% 97.47% 98.18% 97.83%

RandomForest 97.64% 93.86% 98.67% 96.21%

Naive Bayes 98.14% 98.80% 96.8% 97%

Naive Bayes 94.36% 83.03% 99.14% 90.37%

SVM 88.73% 65.70% 98.38% 78.79%

data: This will take a string containing the contents of email.

(Default) classifier=’manual’: Only Manual Naive Bayes is used to classify.

classifier=’xgb’: Only XGBoost is considered to classify.

Returns: Boolean: True if email is spam and False for otherwise

2] It can help in maintaining proper business communications.

3] It can be used in various education sectors too.

You might also like