0% found this document useful (0 votes)

36 views10 pages

Spam Detection

The document describes a dataset containing SMS messages that are labeled as either ham (legitimate) or spam. It performs preprocessing of the text data, including encoding, vectorization, and train-test splitting. Several machine learning models are trained on the data, including Naive Bayes classification, and their performance is evaluated.

Uploaded by

Himanshu Kautkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views10 pages

Spam Detection

Uploaded by

Himanshu Kautkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

4/9/24, 8:45 PM spam_detector

SMS and Email Spam Classifier

About dataset
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam
research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham
(legitimate) or spam.

In [ ]: # Import needed libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import warnings
warnings.filterwarnings('ignore')

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 1/10
4/9/24, 8:45 PM spam_detector

In [2]: # Data reading with read_csv function

data = pd.read_csv('/content/drive/MyDrive/Docs for collab/Spam /spam.csv',
encoding="ISO-8859-1")
data.head()

Out[2]:
Unnamed: Unnamed: Unnamed:
v1 v2
2 3 4

0 ham Go until jurong point, crazy.. Available only ... NaN NaN NaN

1 ham Ok lar... Joking wif u oni... NaN NaN NaN

Free entry in 2 a wkly comp to win FA Cup

2 spam NaN NaN NaN
fina...

3 ham U dun say so early hor... U c already then say... NaN NaN NaN

4 ham Nah I don't think he goes to usf, he lives aro... NaN NaN NaN

In [3]: data.rename(columns={'v1':'Type','v2':'Content'},inplace=True)

In [4]: # Getting quick info

df = data[['Type','Content']]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Type 5572 non-null object
1 Content 5572 non-null object
dtypes: object(2)
memory usage: 87.2+ KB

In [5]: # viewing first 5 data points

df.head()

Out[5]:
Type Content

0 ham Go until jurong point, crazy.. Available only ...

1 ham Ok lar... Joking wif u oni...

2 spam Free entry in 2 a wkly comp to win FA Cup fina...

3 ham U dun say so early hor... U c already then say...

4 ham Nah I don't think he goes to usf, he lives aro...

In [6]: # Checking null values

df.isnull().sum()

Out[6]: Type 0
Content 0
dtype: int64

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 2/10
4/9/24, 8:45 PM spam_detector

In [7]: # Description of dataset

df.describe()

Out[7]:
Type Content

count 5572 5572

unique 2 5169

top ham Sorry, I'll call later

freq 4825 30

In [8]: # Distribution of type of messages

ax = sns.countplot(x='Type',data=df,palette='gist_rainbow').set(title='Dist
ribution of Type of the message')
plt.show()

In [9]: # Percentage of Spam and Ham

ham = (df.Type.value_counts()[0]/5572)*100
spam = (df.Type.value_counts()[1]/5572)*100
print(f'Percentage of Ham in this dataset {ham.round(2)}%')
print(f'Percentage of Spam in this dataset {spam.round(2)}%')

Percentage of Ham in this dataset 86.59%

Percentage of Spam in this dataset 13.41%

It Shows it's clearly imbalanced Data

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 3/10
4/9/24, 8:45 PM spam_detector

In [10]: # Length of the Content

df['Content Length'] = df['Content'].apply(len)
df.head()

Out[10]:
Type Content Content Length

0 ham Go until jurong point, crazy.. Available only ... 111

1 ham Ok lar... Joking wif u oni... 29

2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155

3 ham U dun say so early hor... U c already then say... 49

4 ham Nah I don't think he goes to usf, he lives aro... 61

In [11]: # Content Length vs Type

figsize = (8, 3)
plt.figure(figsize=figsize)
sns.barplot(df, x='Content Length', y='Type', palette='gist_rainbow').set(t
itle='Content Length vs Type')
plt.show()

From Above plot we can see spam messages are high in length compared to ham messages

Text Preprocessing
In [ ]: # Encoding of Type Column
le = LabelEncoder()
le.fit(df['Type'])
df['Encoded Type'] = le.transform(df['Type'])

In [ ]: # spliting the data

X = df['Content']
y = df['Encoded Type']

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 4/10
4/9/24, 8:45 PM spam_detector

In [ ]: # Vectorization on description column using Tf idf Vectorizer

vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(X)
x_vector = x.toarray()

In [ ]: # DataFrame after Vectorization

pd.DataFrame(data=x_vector,columns=vectorizer.get_feature_names_out()).head
()

Out[ ]:
00 000 000pes 008704050406 0089 0121 01223585236 01223585334 0125698789 02

0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 8672 columns

In [ ]: # Train Test split

X_train,X_test,y_train,y_test = train_test_split(x_vector,y,test_size=0.2,r
andom_state=0)
X_train.shape

Out[ ]: (4457, 8672)

Model Building

1) Naive Bayes
In [ ]: model_MNB = MultinomialNB()
model_MNB.fit(X_train,y_train)
print('Training set Score :',model_MNB.score(X_train,y_train))
print('Test set Score :',model_MNB.score(X_test,y_test))

Training set Score : 0.9699349338119811

Test set Score : 0.9488789237668162

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 5/10
4/9/24, 8:45 PM spam_detector

In [ ]: # Confusion Matrix
y_pred = model_MNB.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
sns.heatmap(cm,annot=True,fmt='.0f').set(title='Confusion Matrix Heatmap')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

In [ ]: # Classification Report
cr = classification_report(y_test,y_pred)
print(cr)

precision recall f1-score support

0 0.94 1.00 0.97 949

1 1.00 0.66 0.79 166

accuracy 0.95 1115

macro avg 0.97 0.83 0.88 1115
weighted avg 0.95 0.95 0.94 1115

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 6/10
4/9/24, 8:45 PM spam_detector

2) Logistic Regression
In [ ]: model_lr = LogisticRegression()
model_lr.fit(X_train,y_train)
print('Training set Score :',model_lr.score(X_train,y_train))
print('Test set Score :',model_lr.score(X_test,y_test))

Training set Score : 0.9741978909580435

Test set Score : 0.9533632286995516

In [ ]: # Confusion Matrix
y_pred = model_lr.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
sns.heatmap(cm,annot=True,fmt='.0f').set(title='Confusion Matrix Heatmap')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 7/10
4/9/24, 8:45 PM spam_detector

In [ ]: # Classification Report
cr = classification_report(y_test,y_pred)
print(cr)

precision recall f1-score support

0 0.94 1.00 0.97 949

1 1.00 0.66 0.79 166

accuracy 0.95 1115

macro avg 0.97 0.83 0.88 1115
weighted avg 0.95 0.95 0.94 1115

3) SVM
In [ ]: model_svm = SVC()
model_svm.fit(X_train,y_train)
print('Training set Score :',model_svm.score(X_train,y_train))
print('Test set Score :',model_svm.score(X_test,y_test))

Training set Score : 0.9973076060130133

Test set Score : 0.968609865470852

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 8/10
4/9/24, 8:45 PM spam_detector

In [ ]: # Confusion Matrix
y_pred = model_svm.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
sns.heatmap(cm,annot=True,fmt='.0f').set(title='Confusion Matrix Heatmap')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

In [ ]: # Classification Report
cr = classification_report(y_test,y_pred)
print(cr)

precision recall f1-score support

0 0.95 1.00 0.97 949

1 1.00 0.69 0.81 166

accuracy 0.95 1115

macro avg 0.97 0.84 0.89 1115
weighted avg 0.96 0.95 0.95 1115

Prediction
In [ ]: text = ['Bored housewives! Chat n date now! 0871750.77.11! BT-national rate
10p/min only from landlines!',
'Let Ur Heart Be Ur Compass Ur Mind Ur Map Ur Soul Ur Guide And U W
ill Never loose in world....gnun - Sent via WAY2SMS.COM']
# Actual ===> text = [1(spam),0(ham)]

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 9/10
4/9/24, 8:45 PM spam_detector

In [ ]: test = vectorizer.transform(text)
test_dense = test.toarray()

In [ ]: # MultinomialNB
model_MNB.predict(test_dense)

Out[ ]: array([1, 0])

In [ ]: # Logistic Regression
model_lr.predict(test_dense)

Out[ ]: array([1, 0])

In [ ]: # SVM
model_svm.predict(test_dense)

Out[ ]: array([1, 0])

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 10/10

Sample Questions For Oracle 1z0 1110 25 Exam by Roberson
100% (1)
Sample Questions For Oracle 1z0 1110 25 Exam by Roberson
16 pages
Sms Spam Using Machine Learning 4
No ratings yet
Sms Spam Using Machine Learning 4
42 pages
Email Spam Detection Final Presentation-21BSCHH010002
No ratings yet
Email Spam Detection Final Presentation-21BSCHH010002
17 pages
Black Yellow Modern Minimalist Elegant Presentation
No ratings yet
Black Yellow Modern Minimalist Elegant Presentation
29 pages
Implemention of Sms Spam Filtering
No ratings yet
Implemention of Sms Spam Filtering
27 pages
PL LAB 3 File
No ratings yet
PL LAB 3 File
56 pages
Ai Project
No ratings yet
Ai Project
8 pages
Spam News Detection Report: Manikiran
No ratings yet
Spam News Detection Report: Manikiran
12 pages
Unstructured Data Classification
100% (2)
Unstructured Data Classification
83 pages
Aiml Assignment-2
No ratings yet
Aiml Assignment-2
8 pages
ML Spam Detection for Developers
No ratings yet
ML Spam Detection for Developers
51 pages
Spam News Detection Report
No ratings yet
Spam News Detection Report
9 pages
Notebook - Text Classification
No ratings yet
Notebook - Text Classification
7 pages
Email Spam Detection
No ratings yet
Email Spam Detection
3 pages
Spamemailneuralnetworks
No ratings yet
Spamemailneuralnetworks
5 pages
Arnav MLlab04
No ratings yet
Arnav MLlab04
7 pages
Sms
No ratings yet
Sms
16 pages
Spam Filter Project Report Logistic Regression
No ratings yet
Spam Filter Project Report Logistic Regression
10 pages
Spamemaillogistic
No ratings yet
Spamemaillogistic
5 pages
2.naïve Bayes Classifier For Sms
No ratings yet
2.naïve Bayes Classifier For Sms
9 pages
Spamemailsvm
No ratings yet
Spamemailsvm
5 pages
Email Spam Classifier
No ratings yet
Email Spam Classifier
22 pages
Task04 Emailspamdetectionwithmachinelearning 1752340927
No ratings yet
Task04 Emailspamdetectionwithmachinelearning 1752340927
2 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
5 pages
Sms Spam Filtering Pres
No ratings yet
Sms Spam Filtering Pres
18 pages
Spamemailrandomforest
No ratings yet
Spamemailrandomforest
5 pages
Document
No ratings yet
Document
11 pages
AI Phase4
No ratings yet
AI Phase4
11 pages
Spam Email Detection Documentation
No ratings yet
Spam Email Detection Documentation
3 pages
Report On Email Spam
No ratings yet
Report On Email Spam
7 pages
Spam Detection Model
No ratings yet
Spam Detection Model
4 pages
Sms Spam Detection
No ratings yet
Sms Spam Detection
7 pages
Micro
No ratings yet
Micro
5 pages
Python 21to30
No ratings yet
Python 21to30
9 pages
Email Spam Detection Guide
No ratings yet
Email Spam Detection Guide
4 pages
Span News Detection
No ratings yet
Span News Detection
7 pages
Fam PR-10
No ratings yet
Fam PR-10
4 pages
Machine Learning Learning With Email Spam Detection
No ratings yet
Machine Learning Learning With Email Spam Detection
5 pages
Sodapdf
No ratings yet
Sodapdf
1 page
Major Project by Ali (Intrainz)
No ratings yet
Major Project by Ali (Intrainz)
25 pages
SVM Lab Report
No ratings yet
SVM Lab Report
7 pages
2nd Project Darling
No ratings yet
2nd Project Darling
9 pages
DWDM Pavan Final
No ratings yet
DWDM Pavan Final
10 pages
Mail Spam
No ratings yet
Mail Spam
4 pages
Olympic Report
100% (1)
Olympic Report
24 pages
Unstructured
No ratings yet
Unstructured
37 pages
Spamfilter
No ratings yet
Spamfilter
4 pages
Spamdetection
No ratings yet
Spamdetection
6 pages
Bayesian Inference
No ratings yet
Bayesian Inference
20 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
Lab 78
No ratings yet
Lab 78
6 pages
AI Phash3
No ratings yet
AI Phash3
11 pages
Using Machine Learning Models To Identify and Predict Security
No ratings yet
Using Machine Learning Models To Identify and Predict Security
17 pages
Code
No ratings yet
Code
6 pages
Spam Detection & Classification Final
No ratings yet
Spam Detection & Classification Final
38 pages
Spam Email Dection
No ratings yet
Spam Email Dection
23 pages
Module3 Ids
No ratings yet
Module3 Ids
17 pages
AI Phash 5
No ratings yet
AI Phash 5
14 pages
Object Detection Report
No ratings yet
Object Detection Report
27 pages
HW4 Text-1
No ratings yet
HW4 Text-1
8 pages
Jaipur Service Company: Need of The Study
No ratings yet
Jaipur Service Company: Need of The Study
10 pages
Title: Abstract
No ratings yet
Title: Abstract
2 pages
Feature Selection For Machine Learning-Based Eraly Detection of Distributed Cyber Attacks
No ratings yet
Feature Selection For Machine Learning-Based Eraly Detection of Distributed Cyber Attacks
8 pages
Clustering & Classification Metrics
No ratings yet
Clustering & Classification Metrics
13 pages
Logistic Regression
No ratings yet
Logistic Regression
19 pages
Wongoutong (2024) - Kmeans Clustering
No ratings yet
Wongoutong (2024) - Kmeans Clustering
19 pages
EEE385L - Project Report - Group 3
No ratings yet
EEE385L - Project Report - Group 3
44 pages
C3 W1 Anomaly Detection
No ratings yet
C3 W1 Anomaly Detection
14 pages
MLT Unit-1
No ratings yet
MLT Unit-1
19 pages
Slides Chap04 PDF
No ratings yet
Slides Chap04 PDF
144 pages
Journal of Building Engineering: Adrianto Oktavianus, Po-Han Chen, Jacob J. Lin
No ratings yet
Journal of Building Engineering: Adrianto Oktavianus, Po-Han Chen, Jacob J. Lin
20 pages
Optimized Global Aware Siamese Network Based Monkeypox Disease Classification Using Skin Images
No ratings yet
Optimized Global Aware Siamese Network Based Monkeypox Disease Classification Using Skin Images
11 pages
Olympic Dataset 1
No ratings yet
Olympic Dataset 1
12 pages
Explainable Deep Learning Models With Gradient-Weighted Class Activation Mapping For Smart Agriculture
No ratings yet
Explainable Deep Learning Models With Gradient-Weighted Class Activation Mapping For Smart Agriculture
11 pages
PDS Final Project Report
No ratings yet
PDS Final Project Report
12 pages
Classification Model To Classify Network Traffic
No ratings yet
Classification Model To Classify Network Traffic
5 pages
Optimized Machine Learning Enabled Intrusion Detection IOT Medical
No ratings yet
Optimized Machine Learning Enabled Intrusion Detection IOT Medical
11 pages
Prediction of Autism and Dyslexia Using Machine Learning and Clinical Data Balancing
No ratings yet
Prediction of Autism and Dyslexia Using Machine Learning and Clinical Data Balancing
11 pages
Linear Regression for CPU User Mode Prediction
No ratings yet
Linear Regression for CPU User Mode Prediction
109 pages
FRA Milestone 2
No ratings yet
FRA Milestone 2
16 pages
Adinarayana, Ilavarasan - 2018 - An Efficient Decision Tree For Imbalance Data Learning Using Confiscate and Substitute Technique
No ratings yet
Adinarayana, Ilavarasan - 2018 - An Efficient Decision Tree For Imbalance Data Learning Using Confiscate and Substitute Technique
8 pages
Báo KH
No ratings yet
Báo KH
12 pages
AI Safety Tool for Conversations
No ratings yet
AI Safety Tool for Conversations
15 pages
Talent Acquisition Analytics Guide
No ratings yet
Talent Acquisition Analytics Guide
16 pages
Dynamic Strategies With Machine Learning
No ratings yet
Dynamic Strategies With Machine Learning
94 pages
5624 - Softskill - NLP
No ratings yet
5624 - Softskill - NLP
28 pages
Data Mining Models and Evaluation Techniques
No ratings yet
Data Mining Models and Evaluation Techniques
47 pages

Spam Detection

Uploaded by

Spam Detection

Uploaded by

4/9/24, 8:45 PM spam_detector

SMS and Email Spam Classifier

In [ ]: # Import needed libraries

In [2]: # Data reading with read_csv function

1 ham Ok lar... Joking wif u oni... NaN NaN NaN

Free entry in 2 a wkly comp to win FA Cup

In [4]: # Getting quick info

In [5]: # viewing first 5 data points

0 ham Go until jurong point, crazy.. Available only ...

1 ham Ok lar... Joking wif u oni...

2 spam Free entry in 2 a wkly comp to win FA Cup fina...

3 ham U dun say so early hor... U c already then say...

4 ham Nah I don't think he goes to usf, he lives aro...

In [6]: # Checking null values

In [7]: # Description of dataset

count 5572 5572

top ham Sorry, I'll call later

In [8]: # Distribution of type of messages

In [9]: # Percentage of Spam and Ham

Percentage of Ham in this dataset 86.59%

It Shows it's clearly imbalanced Data

In [10]: # Length of the Content

0 ham Go until jurong point, crazy.. Available only ... 111

1 ham Ok lar... Joking wif u oni... 29

2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155

3 ham U dun say so early hor... U c already then say... 49

4 ham Nah I don't think he goes to usf, he lives aro... 61

In [11]: # Content Length vs Type

In [ ]: # spliting the data

In [ ]: # Vectorization on description column using Tf idf Vectorizer

In [ ]: # DataFrame after Vectorization

5 rows × 8672 columns

In [ ]: # Train Test split

Out[ ]: (4457, 8672)

Training set Score : 0.9699349338119811

precision recall f1-score support

0 0.94 1.00 0.97 949

accuracy 0.95 1115

Training set Score : 0.9741978909580435

precision recall f1-score support

0 0.94 1.00 0.97 949

accuracy 0.95 1115

Training set Score : 0.9973076060130133

precision recall f1-score support

0 0.95 1.00 0.97 949

accuracy 0.95 1115

Out[ ]: array([1, 0])

Out[ ]: array([1, 0])

Out[ ]: array([1, 0])

You might also like