Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views42 pages

Sms Spam Using Machine Learning 4

The project report by Rahul Sharma focuses on using machine learning to classify SMS messages as spam or ham, utilizing natural language processing techniques. It details the process of data collection, preprocessing, feature engineering, and the implementation of various machine learning models, including Naive Bayes, Logistic Regression, and Random Forest. The results indicate that the Random Forest model achieved the highest accuracy and AUC, and the report concludes with suggestions for future improvements such as deploying an API and using deep learning methods.

Uploaded by

tejaswimathur3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views42 pages

Sms Spam Using Machine Learning 4

The project report by Rahul Sharma focuses on using machine learning to classify SMS messages as spam or ham, utilizing natural language processing techniques. It details the process of data collection, preprocessing, feature engineering, and the implementation of various machine learning models, including Naive Bayes, Logistic Regression, and Random Forest. The results indicate that the Random Forest model achieved the highest accuracy and AUC, and the report concludes with suggestions for future improvements such as deploying an API and using deep learning methods.

Uploaded by

tejaswimathur3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

SMS Spam Using Machine Learning

A project report submitted in training program on

Artificial Intelligence – Machine Learning

by
Rahul Sharma
23071003516
(2023-2026)
Dr.Virendra Swarup Institute Of Computer Studies
CANDIDATE’S DECLARATION

I, Rahul Sharma , hereby certify that the work presented in this report,
titled SMS Spam Using Machine Learning submitted in partial
fulfillment of the requirements for the training programme on Artificial
Intelligence – Machine Learning, is an authentic record of my own
work. I have also duly cited all references for any data, text(s), figure(s),
table(s), or equation(s) that have been taken from other sources.

Date: Signature of the Candidate


ABSTRACT
The main goal of this work is the implementation of Matsui’s linear cryptanalysis of DES and a
statistical and theoretical analysis of its complexity and success probability. In order to achieve
this goal, we implement first a very fast DES routine on the Intel Pentium III MMX architecture
which is fully optimized for linear cryptanalysis. New implementation concepts are applied,
resulting in a speed increase of almost 50% towards the best-known classical implementation.
The experimental results suggest strongly that the attack is in average about 10 times faster

(O(A39) DES computations) as expected with O(243) known plaintext-ciphertext at disposal;

furthermore, we have achieved a complexity of O(243) by using only 242.5 known pairs. Last, we
propose a new analytical expression which approximates success probabilities; it gives slightly
better results than Matsui’s experimental ones.

Keywords: Encryption, Decryption, Key Distribution, Secure Technique. (not more than six)
Table of Contents
1. Introduction
2. Objective
3. Libraries and Tools Used
4. Data Collection
5. Data Preprocessing
6. Exploratory Data Analysis(EDA)
7. Feature Engineering
8. Text Vectorization
9. Model Building
10. Model Evalution
11. Class Validation
12. Model Comparision
13. ROC Curve Analysis
14. Misclassification
15. Feature Importance
16. Saving and Deployment
17. Sample Prediction
18. Conclusion
19. References
Introduction

Spam messages are unsolicited texts sent for advertising,


phishing, or fraud. They are a major concern for users and
companies alike.
This project focuses on automatically classifying SMS
messages as either:
1.Ham (Not Spam)
2.Spam
We will use:
1.Natural Language Processing (NLP)
2.Machine Learning classifiers
Why this problem is important:
1.Improves user safety
2.Filters unwanted communication
3.Reduces chances of scams
Objective

The main goals of this project are:


1. Collect and explore SMS data.
2. Clean and preprocess text for modeling.
3. Extract relevant features.
4. Transform text to numeric representation.
5. Train multiple machine learning models.
6. Evaluate and compare models.
7. Identify the best performing model.
8. Make predictions on new messages.
9. Save the trained model for later use.
Libraries And Tools Used

Below libraries are used in Python:

Data Manipulation
pandas: Load and manipulate data tables.
numpy: Numerical computations.

Visualization
matplotlib: Plot graphs.
seaborn: Advanced visualization.
WordCloud: Visual word frequencies.

NLP
nltk: Preprocessing text.
Stopword removal
Lemmatization

Machine Learning
Scikit-learn: ML models and utilities.
Naive Bayes
Logistic Regression

Random Forest
Train-test splitting
Cross-validation
TF-IDF Vectorizer

Utilities
String, re: Regular expressions for cleaning text.
Pickle: Save trained model.
Data Collection

Dataset Source:
https://raw.githubusercontent.com/justmarkham/pycon-2016-
tutorial/master/data/sms.tsv

Loading:
df = pd.read_csv(url, sep="\t", names=["label", "message"])

Data Columns:
label: ham or spam
message: actual SMS text

Sample Records:
| Label | Message | |---|---| | ham | Go until jurong point, crazy... | | spam |
Free entry in 2 weekly competition... |

Class Distribution:
Ham: 4825 messages
Spam: 747 messages
This is an imbalanced dataset, so accuracy alone is not sufficient—
AUC and confusion matrix will be important.
Data Preprocessing

Text preprocessing ensures the model sees clean,


normalized text:
Steps:
1. Convert to lowercase.
2. Remove URLs.
3. Remove digits.
4. Remove punctuation.
5. Remove stopwords.
6. Lemmatize words.

Why this is important:


Reduces noise.
Standardizes vocabulary.
Improves model accuracy.
Exploratory Data Analysis (EDA)

EDA helps understand the dataset before modeling.


Visualizations:
Message Length Distribution: Spam messages are often longer.
Word Count Distribution: Highlights difference in verbosity.

Pie Chart:
Shows class imbalance.

Correlation Matrix:
Relationship between message length, word count, punctuation, unique
words.

WordClouds:
Spam WordCloud: shows words like free, win, claim.
Ham WordCloud: words related to casual conversation.

Insights:
Spam tends to have promotional keywords.
Spam uses more punctuations and longer text.
Feature Engineering

We create numerical features that improve predictive


power:
Features Added:
msg_length: Total character count.
word_count: Number of words.
punct_count: Number of punctuation marks.
unique_words: Number of unique words.

Why this helps:


Spam often uses excessive length and punctuation.
Text Vectorization

To train ML models, we convert text into numeric


vectors using TF-IDF (Term Frequency-Inverse
Document Frequency):
Steps:
1. Initialize vectorizer:
vectorizer = TfidfVectorizer(max_features=3000)

2. Fit and transform cleaned text:


X = vectorizer.fit_transform(df['cleaned'])

Output:
A matrix with 3000 columns representing important words.
Model Building

We use 3 classifiers:
Naive Bayes:
Probabilistic model good for text.

Logistic Regression:
Linear classifier.

Random Forest:
Ensemble of decision trees.

Train-Test Split:
80% training
20% testing
X_train, X_test, y_train, y_test = train_test_split(...)
Model Evaluation

For each model:


1. Fit on training data.
2.Predict on test data.
3. Calculate:

 Accuracy
 AUC
 Confusion Matrix
 Classification Report

Metrics Meaning:
 Accuracy: Correct predictions proportion.
 AUC: Measures separation between classes.
 Precision/Recall/F1: Better for imbalanced data.
Cross Validation

To ensure model stability, we used 5-fold cross-


validation:
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')

Reported:
Mean Accuracy
Standard Deviation

Why:
Reduces chance of overfitting.
Model Comparison

Bar chart comparing:


Accuracy
AUC

Observation:
Random Forest had highest accuracy and AUC.
Naive Bayes performed fast but slightly lower AUC.

ROC Curve Analysis

Plotted ROC curve:


X-axis: False Positive Rate
Y-axis: True Positive Rate

Interpretation:
Higher curve = better model.
Cross-validated ROC Curves showed stability.
Misclassification Analysis

Extracted examples where predictions were wrong:


Example:
Actual: Spam
Predicted: Ham
Message:
"Congratulations! You have won..."
Purpose:
Understand limitations and improve future models

Feature Importance

Naive Bayes:
Top words by log probability:
claim
win
free
Random Forest:
Top words by importance:
free
call
text

Visualized in bar charts.

Saving And Deployment

Saved model and vectorizer using pickle:


with open("best_model.pkl", "wb") as f:
pickle.dump(best_model, f)

Benefit:
Load model later without retraining.
Sample Prediction
Tested on unseen messages:
Message Predicted
Free entry in weekly draw...Spam
Hey are you free today? Ham

Function:
def predict_sms(message):

Conclusion
Achievements:
Built SMS classifier.
High accuracy and AUC.
Extracted important features.
Future Improvements:
Use deep learning (RNN, LSTM).

Deploy API.
Handle multilingual spam.
References
scikit-learn Documentation
NLTK Documentation
WordCloud Documentation
SMS Spam Dataset
Code-

Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re
import nltk
from wordcloud import WordCloud
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
confusion_matrix,
classification_report,
accuracy_score,
roc_auc_score,
roc_curve
)

Download NLTK data


nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
Load dataset
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-
tutorial/master/data/sms.tsv"
df = pd.read_csv(url, sep="\t", names=["label", "message"])

Basic info
print("\n=== Head of Dataset ===")
print(df.head())

print("\n=== Class Distribution ===")


print(df['label'].value_counts())

=== Head of Dataset ===


label message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...

=== Class Distribution ===


label
ham 4825
spam 747
Name: count, dtype: int64

Encode target
df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})

Preprocessing
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def clean_text(text):
text = text.lower()
text = re.sub(r"http\S+|www\S+|https\S+", '', text)
text = re.sub(r'\d+', '', text)
text = re.sub(r'[^\w\s]', '', text)
words = text.split()
words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
return ' '.join(words)

df['cleaned'] = df['message'].apply(clean_text)

print("\n=== Sample Cleaned Messages ===")


print(df[['message', 'cleaned']].head())

=== Sample Cleaned Messages ===


message \
0 Go until jurong point, crazy.. Available only ...
1 Ok lar... Joking wif u oni...
2 Free entry in 2 a wkly comp to win FA Cup fina...
3 U dun say so early hor... U c already then say...
4 Nah I don't think he goes to usf, he lives aro...

cleaned
0 go jurong point crazy available bugis n great ...
1 ok lar joking wif u oni
2 free entry wkly comp win fa cup final tkts st ...
3 u dun say early hor u c already say
4 nah dont think go usf life around though

Feature Engineering
df['msg_length'] = df['message'].apply(len)
df['word_count'] = df['message'].apply(lambda x: len(x.split()))
df['punct_count'] = df['message'].apply(lambda x: sum(1 for char in x if char in
string.punctuation))
df['unique_words'] = df['message'].apply(lambda x: len(set(x.split())))

print("\n=== Engineered Features (first 3 rows) ===")


print(df[['msg_length','word_count','punct_count','unique_words']].head(3))

=== Engineered Features (first 3 rows) ===


msg_length word_count punct_count unique_words
0 111 20 9 20
1 29 6 6 6
2 155 28 6 24

Visualizations
plt.figure(figsize=(12,5))
sns.histplot(data=df, x='msg_length', hue='label', bins=40, kde=True)
plt.title("Message Length Distribution")
plt.show()

Word Clouds
spam_text = ' '.join(df[df['label'] == 'spam']['cleaned'])
ham_text = ' '.join(df[df['label'] == 'ham']['cleaned'])

plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
plt.imshow(WordCloud(width=500, height=300,
background_color='white').generate(spam_text))
plt.axis('off')
plt.title("Spam Words")
plt.subplot(1,2,2)
plt.imshow(WordCloud(width=500, height=300,
background_color='white').generate(ham_text))
plt.axis('off')
plt.title("Ham Words")
plt.tight_layout()
plt.show()

Vectorization
vectorizer = TfidfVectorizer(max_features=3000)
X = vectorizer.fit_transform(df['cleaned'])
y = df['label_num']

Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\n=== Train-Test Split ===")
print(f"Training samples: {X_train.shape[0]}, Test samples: {X_test.shape[0]}")

=== Train-Test Split ===


Training samples: 4457, Test samples: 1115

Models
models = {
"Naive Bayes": MultinomialNB(),
"Logistic Regression": LogisticRegression(max_iter=1000),
"Random Forest": RandomForestClassifier(n_estimators=150, random_state=42)
}

Training & Evaluation


results = {}
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
results[name] = {'Accuracy': acc, 'AUC': auc}
print(f"\n=== {name} ===")
print(f"Accuracy: {acc:.4f}")
print(f"AUC: {auc:.4f}")
print(classification_report(y_test, y_pred)

=== Naive Bayes ===


Accuracy: 0.9785
AUC: 0.9794
precision recall f1-score support

0 0.98 1.00 0.99 966


1 1.00 0.84 0.91 149

accuracy 0.98 1115


macro avg 0.99 0.92 0.95 1115
weighted avg 0.98 0.98 0.98 1115

=== Logistic Regression ===


Accuracy: 0.9650
AUC: 0.9866
precision recall f1-score support

0 0.96 1.00 0.98 966


1 0.99 0.74 0.85 149

accuracy 0.97 1115


macro avg 0.98 0.87 0.92 1115
weighted avg 0.97 0.97 0.96 1115
=== Random Forest ===
Accuracy: 0.9803
AUC: 0.9935
precision recall f1-score support

0 0.98 1.00 0.99 966


1 1.00 0.85 0.92 149

accuracy 0.98 1115


macro avg 0.99 0.93 0.95 1115
weighted avg 0.98 0.98 0.98 1115

Cross-validation
print("\n=== Cross-Validation Scores ===")
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"{name}: Mean = {scores.mean():.4f}, Std = {scores.std():.4f}")

=== Cross-Validation Scores ===


Naive Bayes: Mean = 0.9743, Std = 0.0042
Logistic Regression: Mean = 0.9607, Std = 0.0051
Random Forest: Mean = 0.9743, Std = 0.0046

Confusion Matrix of Best Model


best_model_name = max(results, key=lambda x: results[x]['Accuracy'])
best_model = models[best_model_name]
y_pred_best = best_model.predict(X_test)

cm = confusion_matrix(y_test, y_pred_best)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Purples', xticklabels=['Ham','Spam'],
yticklabels=['Ham','Spam'])
plt.title(f"{best_model_name} Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

ROC Curve
fpr, tpr, _ = roc_curve(y_test, best_model.predict_proba(X_test)[:,1])
plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, label=f'{best_model_name}
(AUC={results[best_model_name]["AUC"]:.2f})')
plt.plot([0,1],[0,1],'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.grid()
plt.show()
Most informative words
tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
tfidf_df['label_num'] = y.values

spam_means = tfidf_df[tfidf_df['label_num']==1].iloc[:,:-
1].mean().sort_values(ascending=False).head(15)
ham_means = tfidf_df[tfidf_df['label_num']==0].iloc[:,:-
1].mean().sort_values(ascending=False).head(15)

plt.figure(figsize=(12,5))
spam_means.plot(kind='barh', color='red')
plt.title("Top 15 Spam Words by TF-IDF Score")
plt.gca().invert_yaxis()
plt.show()

plt.figure(figsize=(12,5))
ham_means.plot(kind='barh', color='green')
plt.title("Top 15 Ham Words by TF-IDF Score")
plt.gca().invert_yaxis()
plt.show()
Custom prediction function
def predict_sms(message):
cleaned = clean_text(message)
vector = vectorizer.transform([cleaned])
prediction = best_model.predict(vector)[0]
return "Spam" if prediction == 1 else "Ham"
Test predictions
test_msgs = [
"Free entry in 2 a weekly competition! Just text WIN to 80086 now!",
"Hey, are we still going to class tomorrow?",
"You've been selected for a cash prize of $5000!",
"Reminder: your electricity bill is due tomorrow.",
]

print("\n=== Sample Predictions ===")


for msg in test_msgs:
pred = predict_sms(msg)
print(f"\nMessage: {msg}\nPrediction: {pred}")

=== Sample Predictions ===

Message: Free entry in 2 a weekly competition! Just text WIN to 80086 now!
Prediction: Spam

Message: Hey, are we still going to class tomorrow?


Prediction: Ham

Message: You've been selected for a cash prize of $5000!


Prediction: Ham

Message: Reminder: your electricity bill is due tomorrow.


Prediction: Ham

Check Missing Values


print("Missing Values:\n", df.isnull().sum())

Missing Values:
label 0
message 0
label_num 0
cleaned 0
msg_length 0
word_count 0
punct_count 0
unique_words 0
dtype: int64

Class Distribution Pie Chart


plt.figure(figsize=(6,6))
df['label'].value_counts().plot.pie(autopct='%1.1f%%', colors=['lightgreen', 'lightcoral'])
plt.title("Class Distribution")
plt.ylabel("")
plt.show()
Correlation Matrix of Engineered
Features
## Correlation matrix of numerical features
plt.figure(figsize=(8,6))
sns.heatmap(df[['msg_length', 'word_count', 'punct_count', 'unique_words']].corr(),
annot=True, cmap='Blues')
plt.title("Feature Correlation Matrix")
plt.show()
Visualize Top N-Grams
from sklearn.feature_extraction.text import CountVectorizer

# Bigram visualization
cv = CountVectorizer(ngram_range=(2,2), max_features=20)
X_bigrams = cv.fit_transform(df['cleaned'])
bigrams_df = pd.DataFrame(X_bigrams.toarray(), columns=cv.get_feature_names_out())
bigrams_sum = bigrams_df.sum().sort_values(ascending=False)

plt.figure(figsize=(10,6))
bigrams_sum.plot(kind='bar', color='purple')
plt.title("Top 20 Bigrams")
plt.ylabel("Frequency")
plt.show()
Show Most Informative Features for
Naive Bayes
# Most informative features for Naive Bayes
feature_names = vectorizer.get_feature_names_out()
log_probs = models["Naive Bayes"].feature_log_prob_
top_spam_idx = np.argsort(log_probs[1])[::-1][:15]
top_spam_words = feature_names[top_spam_idx]
print("Top 15 Spam Predictive Words (Naive Bayes):\n", top_spam_words)

Top 15 Spam Predictive Words (Naive Bayes):


['call' 'free' 'mobile' 'text' 'txt' 'claim' 'stop' 'reply' 'ur' 'prize'
'service' 'new' 'tone' 'urgent' 'cash']

Feature Importance for Random Forest


# Feature importance
importances = models["Random Forest"].feature_importances_
indices = np.argsort(importances)[-15:]

plt.figure(figsize=(10,6))
plt.barh(range(len(indices)), importances[indices], color='teal')
plt.yticks(range(len(indices)), [vectorizer.get_feature_names_out()[i] for i in indices])
plt.title("Top 15 Feature Importances (Random Forest)")
plt.xlabel("Importance Score")
plt.show()
Model Comparison Bar Chart
# Compare model accuracy and AUC visually
acc_values = [results[m]['Accuracy'] for m in results]
auc_values = [results[m]['AUC'] for m in results]

x = np.arange(len(results))
width = 0.35

plt.figure(figsize=(8,5))
plt.bar(x - width/2, acc_values, width, label='Accuracy')
plt.bar(x + width/2, auc_values, width, label='AUC')
plt.xticks(x, results.keys())
plt.ylabel("Score")
plt.title("Model Performance Comparison")
plt.legend()
plt.show()
K-Fold Cross-Validation ROC
# ROC curve via cross-validation for best model
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

plt.figure(figsize=(8,6))
for fold, (train_idx, test_idx) in enumerate(skf.split(X, y), 1):
best_model.fit(X[train_idx], y[train_idx])
probs = best_model.predict_proba(X[test_idx])[:,1]
fpr, tpr, _ = roc_curve(y[test_idx], probs)
plt.plot(fpr, tpr, label=f"Fold {fold}")

plt.plot([0,1], [0,1], 'k--')


plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Cross-Validated ROC Curves")
plt.legend()
plt.grid()
plt.show()
Misclassified Examples
# Show misclassified examples
misclassified_idx = np.where(y_test != y_pred_best)[0]
print(f"Total Misclassified: {len(misclassified_idx)}\n")

for idx in misclassified_idx[:5]: # Show first 5


print(f"Actual: {'Spam' if y_test.iloc[idx] else 'Ham'}")
print(f"Predicted: {'Spam' if y_pred_best[idx] else 'Ham'}")
print(f"Message: {df.iloc[y_test.index[idx]]['message']}\n")

Total Misclassified: 22

Actual: Spam
Predicted: Ham
Message: Reminder: You have not downloaded the content you have already paid for. Goto
http://doit. mymoby. tv/ to collect your content.

Actual: Spam
Predicted: Ham
Message: Oh my god! I've found your number again! I'm so glad, text me back xafter this
msgs cst std ntwk chg £1.50

Actual: Spam
Predicted: Ham
Message: Your next amazing xxx PICSFREE1 video will be sent to you enjoy! If one vid is not
enough for 2day text back the keyword PICSFREE1 to get the next video.

Actual: Spam
Predicted: Ham
Message: Rock yr chik. Get 100's of filthy films &XXX pics on yr phone now. rply FILTH to
69669. Saristar Ltd, E14 9YT 08701752560. 450p per 5 days. Stop2 cancel

Actual: Spam
Predicted: Ham
Message: Babe: U want me dont u baby! Im nasty and have a thing 4 filthyguys. Fancy a
rude time with a sexy bitch. How about we go slo n hard! Txt XXX SLO(4msgs)

Confusion Matrix Normalized


# Normalized confusion matrix
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.figure(figsize=(5,4))
sns.heatmap(cm_norm, annot=True, fmt='.2f', cmap='Blues', xticklabels=['Ham', 'Spam'],
yticklabels=['Ham', 'Spam'])
plt.title(f"{best_model_name} Normalized Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
Save the Model
# Save model and vectorizer
import pickle

with open("best_model.pkl", "wb") as f:


pickle.dump(best_model, f)

with open("tfidf_vectorizer.pkl", "wb") as f:


pickle.dump(vectorizer, f)

print("Best model and vectorizer saved successfully!")

Best model and vectorizer saved successfully!

You might also like