TE MINIPROJECT
PROJECT TITLE- SPAM
EMAIL CLASSIFIER
GROUP MEMBERS-
VINEET IYER 118A1029
ABHISHEK JOSHI 118A1030
VISHAK KODETHUR 118A1033
TUSHANT GOKHE 118A1024
ABOUT OUR PROJECT
In this Project we classify whether an email is spam or not using Machine
Learning. The Machine Learning Algorithms which we have used in our project
are XGBoost Classifier, Naive Bayes. The other Algorithms which we have
used are Random Forest, Multinomial Naive Bayes and Support Vector
Machine. But the one used in our project is XGBoost considering precision
scores,recall scores and F-scores.
What is Machine Learning?
Machine Learning involves computers discovering how they can
perform tasks without being explicitly programmed to do so..
It provides systems the ability to automatically learn and improve
from previous experience without being programmed.
Thus it helps in our project in predicting whether an email is spam
or not .
CLASSIFICATION OF MACHINE LEARNING
ALGORITHMS
1] Supervised Machine Learning-It is a type of Machine Learning in which
Machines are trained using well “Labelled” training data.On the basis of that
Machines predict the output.
2] Unsupervised Machine Learning -Here Models are not supervised using
training dataset.Instead model itself find hidden patterns and insights from the
given data.
3] Reinforcement Learning-Here output depends on state of current input and
next input depends on output of previous input.
Random Forest Classifier(Supervised Learning)
It is a classifier that contains number of
Decision trees on various subset of the
given dataset.
XGBoost Classifier(Supervised Learning Algorithm)
It is one of the most popular and efficient implementation of gradient
boosted trees algorithm.
Why is XGBoost Fast?
It uses CPU cache to store calculated gradients to make necessary
calculations fast.
Multinomial Naive Bayes(Supervised Learning)
Using sklearn
The multinomial Naive Bayes classifier is
suitable for classification with discrete
features (e.g., word counts for text
classification). The multinomial distribution
normally requires integer feature counts.
However, in practice, fractional counts such
as tf-idf may also work.
Manually Coded Naive Bayes(Supervised Learning)
We created a vocabulary of 10000 most commonly occurring words
after data cleaning was done.
Then we calculated probabilities of these words for in complete dataset,
spam emails, and non-spam emails separately.
Then we found Posterior probabilities of each word for spam and non-
spam emails using formula:
DataSet Details
Source: SpamAssassin Public Corpus (
https://spamassassin.apache.org/old/publiccorpus/)
Data Format:
Separate folders for spam and non-spam emails.
The emails are documents consisting of sender’s information and mail
history of replies/forwards.
Some emails also use HTML which has to be cleaned.
DataSet Cleaning (5 steps)
1) Removal of HTML Tags (Using BeautifulSoup)
2) Converting words to lowercase and tokenising them into list of
separate words.
3) Removing all the stop words, numbers, special characters and
punctuation marks.
4) Stem all the words to its root word (Using PorterStemmer)
5) Creating Vocabulary (10000 words)
Training the Model
We split dataset into training and testing data in ratio 7:3.
Then we find probabilities of all the words in our vocabulary in 3
different formats:
1) Probability of word throughout the dataset.
2) Probability of word in Spam Emails.
3) Probability of word in Non-Spam Emails.
Testing the Model
We test the dataset by finding the probability of an email being spam
and non-spam using Naive Bayes algorithm:
Classification of an email being spam/non-spam is determined by
comparison of above two probabilities.
Scores of various models
Algorithm Accuracy Recall Score Precision Score F1 Score
XGBoost 98.62% 97.47% 98.18% 97.83%
RandomForest 97.64% 93.86% 98.67% 96.21%
Naive Bayes 98.14% 98.80% 96.8% 97%
(Manual)
Naive Bayes 94.36% 83.03% 99.14% 90.37%
(sklearn)
SVM 88.73% 65.70% 98.38% 78.79%
Python Function
Parameters:
data: This will take a string containing the contents of email.
mode: Default: mode=2 : It is used when data only contains email content.
Otherwise, it is considered to contain sender information and mail history as well
classifier:
(Default) classifier=’manual’: Only Manual Naive Bayes is used to classify.
classifier=’xgb’: Only XGBoost is considered to classify.
Returns: Boolean: True if email is spam and False for otherwise
Future Scope
1] Our project can help in filtering out spam messages received in emails.
2] It can help in maintaining proper business communications.
3] It can be used in various education sectors too.
THANK YOU