0% found this document useful (0 votes)

36 views10 pages

Spam Detection

The document describes a dataset containing SMS messages that are labeled as either ham (legitimate) or spam. It performs preprocessing of the text data, including encoding, vectorization, and train-test splitting. Several machine learning models are trained on the data, including Naive Bayes classification, and their performance is evaluated.

Uploaded by

Himanshu Kautkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views10 pages

Spam Detection

Uploaded by

Himanshu Kautkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

4/9/24, 8:45 PM spam_detector

SMS and Email Spam Classifier

About dataset
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam
research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham
(legitimate) or spam.

In [ ]: # Import needed libraries

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import warnings
warnings.filterwarnings('ignore')

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 1/10
4/9/24, 8:45 PM spam_detector

In [2]: # Data reading with read_csv function

data = pd.read_csv('/content/drive/MyDrive/Docs for collab/Spam /spam.csv',
encoding="ISO-8859-1")
data.head()

Out[2]:
Unnamed: Unnamed: Unnamed:
v1 v2
2 3 4

0 ham Go until jurong point, crazy.. Available only ... NaN NaN NaN

1 ham Ok lar... Joking wif u oni... NaN NaN NaN

Free entry in 2 a wkly comp to win FA Cup

2 spam NaN NaN NaN
fina...

3 ham U dun say so early hor... U c already then say... NaN NaN NaN

4 ham Nah I don't think he goes to usf, he lives aro... NaN NaN NaN

In [3]: data.rename(columns={'v1':'Type','v2':'Content'},inplace=True)

In [4]: # Getting quick info

df = data[['Type','Content']]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Type 5572 non-null object
1 Content 5572 non-null object
dtypes: object(2)
memory usage: 87.2+ KB

In [5]: # viewing first 5 data points

df.head()

Out[5]:
Type Content

0 ham Go until jurong point, crazy.. Available only ...

1 ham Ok lar... Joking wif u oni...

2 spam Free entry in 2 a wkly comp to win FA Cup fina...

3 ham U dun say so early hor... U c already then say...

4 ham Nah I don't think he goes to usf, he lives aro...

In [6]: # Checking null values

df.isnull().sum()

Out[6]: Type 0
Content 0
dtype: int64

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 2/10
4/9/24, 8:45 PM spam_detector

In [7]: # Description of dataset

df.describe()

Out[7]:
Type Content

count 5572 5572

unique 2 5169

top ham Sorry, I'll call later

freq 4825 30

In [8]: # Distribution of type of messages

ax = sns.countplot(x='Type',data=df,palette='gist_rainbow').set(title='Dist
ribution of Type of the message')
plt.show()

In [9]: # Percentage of Spam and Ham

ham = (df.Type.value_counts()[0]/5572)*100
spam = (df.Type.value_counts()[1]/5572)*100
print(f'Percentage of Ham in this dataset {ham.round(2)}%')
print(f'Percentage of Spam in this dataset {spam.round(2)}%')

Percentage of Ham in this dataset 86.59%

Percentage of Spam in this dataset 13.41%

It Shows it's clearly imbalanced Data

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 3/10
4/9/24, 8:45 PM spam_detector

In [10]: # Length of the Content

df['Content Length'] = df['Content'].apply(len)
df.head()

Out[10]:
Type Content Content Length

0 ham Go until jurong point, crazy.. Available only ... 111

1 ham Ok lar... Joking wif u oni... 29

2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155

3 ham U dun say so early hor... U c already then say... 49

4 ham Nah I don't think he goes to usf, he lives aro... 61

In [11]: # Content Length vs Type

figsize = (8, 3)
plt.figure(figsize=figsize)
sns.barplot(df, x='Content Length', y='Type', palette='gist_rainbow').set(t
itle='Content Length vs Type')
plt.show()

From Above plot we can see spam messages are high in length compared to ham messages

Text Preprocessing
In [ ]: # Encoding of Type Column
le = LabelEncoder()
le.fit(df['Type'])
df['Encoded Type'] = le.transform(df['Type'])

In [ ]: # spliting the data

X = df['Content']
y = df['Encoded Type']

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 4/10
4/9/24, 8:45 PM spam_detector

In [ ]: # Vectorization on description column using Tf idf Vectorizer

vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(X)
x_vector = x.toarray()

In [ ]: # DataFrame after Vectorization

pd.DataFrame(data=x_vector,columns=vectorizer.get_feature_names_out()).head
()

Out[ ]:
00 000 000pes 008704050406 0089 0121 01223585236 01223585334 0125698789 02

0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 8672 columns

In [ ]: # Train Test split

X_train,X_test,y_train,y_test = train_test_split(x_vector,y,test_size=0.2,r
andom_state=0)
X_train.shape

Out[ ]: (4457, 8672)

Model Building

1) Naive Bayes
In [ ]: model_MNB = MultinomialNB()
model_MNB.fit(X_train,y_train)
print('Training set Score :',model_MNB.score(X_train,y_train))
print('Test set Score :',model_MNB.score(X_test,y_test))

Training set Score : 0.9699349338119811

Test set Score : 0.9488789237668162

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 5/10
4/9/24, 8:45 PM spam_detector

In [ ]: # Confusion Matrix
y_pred = model_MNB.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
sns.heatmap(cm,annot=True,fmt='.0f').set(title='Confusion Matrix Heatmap')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

In [ ]: # Classification Report
cr = classification_report(y_test,y_pred)
print(cr)

precision recall f1-score support

0 0.94 1.00 0.97 949

1 1.00 0.66 0.79 166

accuracy 0.95 1115

macro avg 0.97 0.83 0.88 1115
weighted avg 0.95 0.95 0.94 1115

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 6/10
4/9/24, 8:45 PM spam_detector

2) Logistic Regression
In [ ]: model_lr = LogisticRegression()
model_lr.fit(X_train,y_train)
print('Training set Score :',model_lr.score(X_train,y_train))
print('Test set Score :',model_lr.score(X_test,y_test))

Training set Score : 0.9741978909580435

Test set Score : 0.9533632286995516

In [ ]: # Confusion Matrix
y_pred = model_lr.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
sns.heatmap(cm,annot=True,fmt='.0f').set(title='Confusion Matrix Heatmap')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 7/10
4/9/24, 8:45 PM spam_detector

In [ ]: # Classification Report
cr = classification_report(y_test,y_pred)
print(cr)

precision recall f1-score support

0 0.94 1.00 0.97 949

1 1.00 0.66 0.79 166

accuracy 0.95 1115

macro avg 0.97 0.83 0.88 1115
weighted avg 0.95 0.95 0.94 1115

3) SVM
In [ ]: model_svm = SVC()
model_svm.fit(X_train,y_train)
print('Training set Score :',model_svm.score(X_train,y_train))
print('Test set Score :',model_svm.score(X_test,y_test))

Training set Score : 0.9973076060130133

Test set Score : 0.968609865470852

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 8/10
4/9/24, 8:45 PM spam_detector

In [ ]: # Confusion Matrix
y_pred = model_svm.predict(X_test)
cm = confusion_matrix(y_test,y_pred)
sns.heatmap(cm,annot=True,fmt='.0f').set(title='Confusion Matrix Heatmap')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

In [ ]: # Classification Report
cr = classification_report(y_test,y_pred)
print(cr)

precision recall f1-score support

0 0.95 1.00 0.97 949

1 1.00 0.69 0.81 166

accuracy 0.95 1115

macro avg 0.97 0.84 0.89 1115
weighted avg 0.96 0.95 0.95 1115

Prediction
In [ ]: text = ['Bored housewives! Chat n date now! 0871750.77.11! BT-national rate
10p/min only from landlines!',
'Let Ur Heart Be Ur Compass Ur Mind Ur Map Ur Soul Ur Guide And U W
ill Never loose in world....gnun - Sent via WAY2SMS.COM']
# Actual ===> text = [1(spam),0(ham)]

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 9/10
4/9/24, 8:45 PM spam_detector

In [ ]: test = vectorizer.transform(text)
test_dense = test.toarray()

In [ ]: # MultinomialNB
model_MNB.predict(test_dense)

Out[ ]: array([1, 0])

In [ ]: # Logistic Regression
model_lr.predict(test_dense)

Out[ ]: array([1, 0])

In [ ]: # SVM
model_svm.predict(test_dense)

Out[ ]: array([1, 0])

https://htmtopdf.herokuapp.com/ipynbviewer/temp/36f3ae91b9bf9b1d3162af6d97d43670/spam_detector.html?t=1712675745283 10/10

Unstructured Data Classification
100% (2)
Unstructured Data Classification
83 pages
Email Spam Detection Final Presentation-21BSCHH010002
No ratings yet
Email Spam Detection Final Presentation-21BSCHH010002
17 pages
Sms Spam Using Machine Learning 4
No ratings yet
Sms Spam Using Machine Learning 4
42 pages
Email Spam Detection
No ratings yet
Email Spam Detection
3 pages
Implemention of Sms Spam Filtering
No ratings yet
Implemention of Sms Spam Filtering
27 pages
Black Yellow Modern Minimalist Elegant Presentation
No ratings yet
Black Yellow Modern Minimalist Elegant Presentation
29 pages
Arnav MLlab04
No ratings yet
Arnav MLlab04
7 pages
PL LAB 3 File
No ratings yet
PL LAB 3 File
56 pages
Task04 Emailspamdetectionwithmachinelearning 1752340927
No ratings yet
Task04 Emailspamdetectionwithmachinelearning 1752340927
2 pages
Sodapdf
No ratings yet
Sodapdf
1 page
Spam Email Detection Documentation
No ratings yet
Spam Email Detection Documentation
3 pages
Notebook - Text Classification
No ratings yet
Notebook - Text Classification
7 pages
Ai Project
No ratings yet
Ai Project
8 pages
ML Spam Detection for Developers
No ratings yet
ML Spam Detection for Developers
51 pages
Email Spam Classifier
No ratings yet
Email Spam Classifier
22 pages
Spamemailneuralnetworks
No ratings yet
Spamemailneuralnetworks
5 pages
AI Phase4
No ratings yet
AI Phase4
11 pages
Spamemaillogistic
No ratings yet
Spamemaillogistic
5 pages
Sms
No ratings yet
Sms
16 pages
Report On Email Spam
No ratings yet
Report On Email Spam
7 pages
Fam PR-10
No ratings yet
Fam PR-10
4 pages
Micro
No ratings yet
Micro
5 pages
Spam News Detection Report: Manikiran
No ratings yet
Spam News Detection Report: Manikiran
12 pages
Email Spam Detection Guide
No ratings yet
Email Spam Detection Guide
4 pages
Spamemailsvm
No ratings yet
Spamemailsvm
5 pages
Python 21to30
No ratings yet
Python 21to30
9 pages
Aiml Assignment-2
No ratings yet
Aiml Assignment-2
8 pages
Spamemailrandomforest
No ratings yet
Spamemailrandomforest
5 pages
Spam Detection Model
No ratings yet
Spam Detection Model
4 pages
Mail Spam
No ratings yet
Mail Spam
4 pages
Machine Learning Learning With Email Spam Detection
No ratings yet
Machine Learning Learning With Email Spam Detection
5 pages
Spam Filter Project Report Logistic Regression
No ratings yet
Spam Filter Project Report Logistic Regression
10 pages
SVM Lab Report
No ratings yet
SVM Lab Report
7 pages
Span News Detection
No ratings yet
Span News Detection
7 pages
DWDM Pavan Final
No ratings yet
DWDM Pavan Final
10 pages
Major Project by Ali (Intrainz)
No ratings yet
Major Project by Ali (Intrainz)
25 pages
Spam News Detection Report
No ratings yet
Spam News Detection Report
9 pages
Sms Spam Filtering Pres
No ratings yet
Sms Spam Filtering Pres
18 pages
Title: Abstract
No ratings yet
Title: Abstract
2 pages
2.naïve Bayes Classifier For Sms
No ratings yet
2.naïve Bayes Classifier For Sms
9 pages
Python CA 4
No ratings yet
Python CA 4
9 pages
Code
No ratings yet
Code
6 pages
Spamfilter
No ratings yet
Spamfilter
4 pages
Spam Email Dection
No ratings yet
Spam Email Dection
23 pages
Document
No ratings yet
Document
11 pages
Module3 Ids
No ratings yet
Module3 Ids
17 pages
AI Phash 5
No ratings yet
AI Phash 5
14 pages
Sms Spam Detection
No ratings yet
Sms Spam Detection
7 pages
2nd Project Darling
No ratings yet
2nd Project Darling
9 pages
Spamdetection
No ratings yet
Spamdetection
6 pages
Wonder of Heavens
No ratings yet
Wonder of Heavens
8 pages
Bayesian Inference
No ratings yet
Bayesian Inference
20 pages
Spam Detection & Classification Final
No ratings yet
Spam Detection & Classification Final
38 pages
Logistic Regression
No ratings yet
Logistic Regression
19 pages
BR-III MCQs
100% (2)
BR-III MCQs
8 pages
Regression and Correlation
No ratings yet
Regression and Correlation
19 pages
AI Phash3
No ratings yet
AI Phash3
11 pages
Unstructured
No ratings yet
Unstructured
37 pages
Lab 78
No ratings yet
Lab 78
6 pages
Percobaan 1 Pengaruh Cara Pemberian Terhadap Absorpsi Obat: 35g 20g 50ml 0,5ml 700 MG 600 MG
No ratings yet
Percobaan 1 Pengaruh Cara Pemberian Terhadap Absorpsi Obat: 35g 20g 50ml 0,5ml 700 MG 600 MG
7 pages
CST383 B
No ratings yet
CST383 B
4 pages
Data Analysts: Visualize Correlations
No ratings yet
Data Analysts: Visualize Correlations
2 pages
HW4 Text-1
No ratings yet
HW4 Text-1
8 pages
Demgn801 Business Analytics 76 150
No ratings yet
Demgn801 Business Analytics 76 150
75 pages
DMRT
No ratings yet
DMRT
4 pages
Class 3 Computer Exercise
No ratings yet
Class 3 Computer Exercise
4 pages
Stata Output Panel Hsiao 1986 Example
No ratings yet
Stata Output Panel Hsiao 1986 Example
5 pages
MBA Unit 3 and 5
No ratings yet
MBA Unit 3 and 5
2 pages
Introduction To Machine Learning and Data Mining: Arturo J. Patungan, Jr. University of Sto. Tomas Strandasia
No ratings yet
Introduction To Machine Learning and Data Mining: Arturo J. Patungan, Jr. University of Sto. Tomas Strandasia
103 pages
ML Exam Answers Final
No ratings yet
ML Exam Answers Final
3 pages
Lab
No ratings yet
Lab
9 pages
Introductory Econometrics A Modern Approach 5th Edition Wooldridge Solutions Manualdownload
100% (10)
Introductory Econometrics A Modern Approach 5th Edition Wooldridge Solutions Manualdownload
51 pages
Jurnal Inovasi Pendidikan Dasar: Hubungan Motivasi Menjaga Lingkungan Dengan Sikap Peduli Lingkungan
No ratings yet
Jurnal Inovasi Pendidikan Dasar: Hubungan Motivasi Menjaga Lingkungan Dengan Sikap Peduli Lingkungan
7 pages
Unit-13 Correlation Analysis in Time Series
No ratings yet
Unit-13 Correlation Analysis in Time Series
22 pages
Predicting Stem Volume To Any Height Limit For Native Tree Species in Southern New South Wales and Victoria - Bi - 1999
No ratings yet
Predicting Stem Volume To Any Height Limit For Native Tree Species in Southern New South Wales and Victoria - Bi - 1999
14 pages
(ENGDAT2) Exercise 3
No ratings yet
(ENGDAT2) Exercise 3
10 pages
DMML Lab Report 05
No ratings yet
DMML Lab Report 05
6 pages
Stata Syntax Alpha Omega
No ratings yet
Stata Syntax Alpha Omega
4 pages
2023 Past Year Question Paper
No ratings yet
2023 Past Year Question Paper
6 pages
Extra Activity 3
No ratings yet
Extra Activity 3
9 pages
Introduction To Econometrics For Finance: Please Read The Following Instructions Carefully
No ratings yet
Introduction To Econometrics For Finance: Please Read The Following Instructions Carefully
6 pages
Stat 520 CH 4 Slides
No ratings yet
Stat 520 CH 4 Slides
28 pages
How To Perform A Two-Way ANOVA in SPSS - Statology
No ratings yet
How To Perform A Two-Way ANOVA in SPSS - Statology
9 pages
6400 Lecture Spreadsheets
No ratings yet
6400 Lecture Spreadsheets
351 pages
Moral Pajak, Pemeriksaan, Sanksi, Kepatuhan Pajak Umkm Peran Moderasi Kesadaran Pajak
No ratings yet
Moral Pajak, Pemeriksaan, Sanksi, Kepatuhan Pajak Umkm Peran Moderasi Kesadaran Pajak
15 pages
Stat 250 Gunderson Lecture Notes 11: Regression Analysis: Main Idea
No ratings yet
Stat 250 Gunderson Lecture Notes 11: Regression Analysis: Main Idea
22 pages
Brechmann
No ratings yet
Brechmann
66 pages

Spam Detection

Uploaded by

Spam Detection

Uploaded by

4/9/24, 8:45 PM spam_detector

SMS and Email Spam Classifier

In [ ]: # Import needed libraries

In [2]: # Data reading with read_csv function

1 ham Ok lar... Joking wif u oni... NaN NaN NaN

Free entry in 2 a wkly comp to win FA Cup

In [4]: # Getting quick info

In [5]: # viewing first 5 data points

0 ham Go until jurong point, crazy.. Available only ...

1 ham Ok lar... Joking wif u oni...

2 spam Free entry in 2 a wkly comp to win FA Cup fina...

3 ham U dun say so early hor... U c already then say...

4 ham Nah I don't think he goes to usf, he lives aro...

In [6]: # Checking null values

In [7]: # Description of dataset

count 5572 5572

top ham Sorry, I'll call later

In [8]: # Distribution of type of messages

In [9]: # Percentage of Spam and Ham

Percentage of Ham in this dataset 86.59%

It Shows it's clearly imbalanced Data

In [10]: # Length of the Content

0 ham Go until jurong point, crazy.. Available only ... 111

1 ham Ok lar... Joking wif u oni... 29

2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155

3 ham U dun say so early hor... U c already then say... 49

4 ham Nah I don't think he goes to usf, he lives aro... 61

In [11]: # Content Length vs Type

In [ ]: # spliting the data

In [ ]: # Vectorization on description column using Tf idf Vectorizer

In [ ]: # DataFrame after Vectorization

5 rows × 8672 columns

In [ ]: # Train Test split

Out[ ]: (4457, 8672)

Training set Score : 0.9699349338119811

precision recall f1-score support

0 0.94 1.00 0.97 949

accuracy 0.95 1115

Training set Score : 0.9741978909580435

precision recall f1-score support

0 0.94 1.00 0.97 949

accuracy 0.95 1115

Training set Score : 0.9973076060130133

precision recall f1-score support

0 0.95 1.00 0.97 949

accuracy 0.95 1115

Out[ ]: array([1, 0])

Out[ ]: array([1, 0])

Out[ ]: array([1, 0])

You might also like