SMS Spam Using Machine Learning
A project report submitted in training program on
Artificial Intelligence – Machine Learning
by
Rahul Sharma
23071003516
(2023-2026)
Dr.Virendra Swarup Institute Of Computer Studies
CANDIDATE’S DECLARATION
I, Rahul Sharma , hereby certify that the work presented in this report,
titled SMS Spam Using Machine Learning submitted in partial
fulfillment of the requirements for the training programme on Artificial
Intelligence – Machine Learning, is an authentic record of my own
work. I have also duly cited all references for any data, text(s), figure(s),
table(s), or equation(s) that have been taken from other sources.
Date: Signature of the Candidate
ABSTRACT
The main goal of this work is the implementation of Matsui’s linear cryptanalysis of DES and a
statistical and theoretical analysis of its complexity and success probability. In order to achieve
this goal, we implement first a very fast DES routine on the Intel Pentium III MMX architecture
which is fully optimized for linear cryptanalysis. New implementation concepts are applied,
resulting in a speed increase of almost 50% towards the best-known classical implementation.
The experimental results suggest strongly that the attack is in average about 10 times faster
(O(A39) DES computations) as expected with O(243) known plaintext-ciphertext at disposal;
furthermore, we have achieved a complexity of O(243) by using only 242.5 known pairs. Last, we
propose a new analytical expression which approximates success probabilities; it gives slightly
better results than Matsui’s experimental ones.
Keywords: Encryption, Decryption, Key Distribution, Secure Technique. (not more than six)
Table of Contents
1. Introduction
2. Objective
3. Libraries and Tools Used
4. Data Collection
5. Data Preprocessing
6. Exploratory Data Analysis(EDA)
7. Feature Engineering
8. Text Vectorization
9. Model Building
10. Model Evalution
11. Class Validation
12. Model Comparision
13. ROC Curve Analysis
14. Misclassification
15. Feature Importance
16. Saving and Deployment
17. Sample Prediction
18. Conclusion
19. References
Introduction
Spam messages are unsolicited texts sent for advertising,
phishing, or fraud. They are a major concern for users and
companies alike.
This project focuses on automatically classifying SMS
messages as either:
1.Ham (Not Spam)
2.Spam
We will use:
1.Natural Language Processing (NLP)
2.Machine Learning classifiers
Why this problem is important:
1.Improves user safety
2.Filters unwanted communication
3.Reduces chances of scams
Objective
The main goals of this project are:
1. Collect and explore SMS data.
2. Clean and preprocess text for modeling.
3. Extract relevant features.
4. Transform text to numeric representation.
5. Train multiple machine learning models.
6. Evaluate and compare models.
7. Identify the best performing model.
8. Make predictions on new messages.
9. Save the trained model for later use.
Libraries And Tools Used
Below libraries are used in Python:
Data Manipulation
pandas: Load and manipulate data tables.
numpy: Numerical computations.
Visualization
matplotlib: Plot graphs.
seaborn: Advanced visualization.
WordCloud: Visual word frequencies.
NLP
nltk: Preprocessing text.
Stopword removal
Lemmatization
Machine Learning
Scikit-learn: ML models and utilities.
Naive Bayes
Logistic Regression
Random Forest
Train-test splitting
Cross-validation
TF-IDF Vectorizer
Utilities
String, re: Regular expressions for cleaning text.
Pickle: Save trained model.
Data Collection
Dataset Source:
https://raw.githubusercontent.com/justmarkham/pycon-2016-
tutorial/master/data/sms.tsv
Loading:
df = pd.read_csv(url, sep="\t", names=["label", "message"])
Data Columns:
label: ham or spam
message: actual SMS text
Sample Records:
| Label | Message | |---|---| | ham | Go until jurong point, crazy... | | spam |
Free entry in 2 weekly competition... |
Class Distribution:
Ham: 4825 messages
Spam: 747 messages
This is an imbalanced dataset, so accuracy alone is not sufficient—
AUC and confusion matrix will be important.
Data Preprocessing
Text preprocessing ensures the model sees clean,
normalized text:
Steps:
1. Convert to lowercase.
2. Remove URLs.
3. Remove digits.
4. Remove punctuation.
5. Remove stopwords.
6. Lemmatize words.
Why this is important:
Reduces noise.
Standardizes vocabulary.
Improves model accuracy.
Exploratory Data Analysis (EDA)
EDA helps understand the dataset before modeling.
Visualizations:
Message Length Distribution: Spam messages are often longer.
Word Count Distribution: Highlights difference in verbosity.
Pie Chart:
Shows class imbalance.
Correlation Matrix:
Relationship between message length, word count, punctuation, unique
words.
WordClouds:
Spam WordCloud: shows words like free, win, claim.
Ham WordCloud: words related to casual conversation.
Insights:
Spam tends to have promotional keywords.
Spam uses more punctuations and longer text.
Feature Engineering
We create numerical features that improve predictive
power:
Features Added:
msg_length: Total character count.
word_count: Number of words.
punct_count: Number of punctuation marks.
unique_words: Number of unique words.
Why this helps:
Spam often uses excessive length and punctuation.
Text Vectorization
To train ML models, we convert text into numeric
vectors using TF-IDF (Term Frequency-Inverse
Document Frequency):
Steps:
1. Initialize vectorizer:
vectorizer = TfidfVectorizer(max_features=3000)
2. Fit and transform cleaned text:
X = vectorizer.fit_transform(df['cleaned'])
Output:
A matrix with 3000 columns representing important words.
Model Building
We use 3 classifiers:
Naive Bayes:
Probabilistic model good for text.
Logistic Regression:
Linear classifier.
Random Forest:
Ensemble of decision trees.
Train-Test Split:
80% training
20% testing
X_train, X_test, y_train, y_test = train_test_split(...)
Model Evaluation
For each model:
1. Fit on training data.
2.Predict on test data.
3. Calculate:
Accuracy
AUC
Confusion Matrix
Classification Report
Metrics Meaning:
Accuracy: Correct predictions proportion.
AUC: Measures separation between classes.
Precision/Recall/F1: Better for imbalanced data.
Cross Validation
To ensure model stability, we used 5-fold cross-
validation:
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
Reported:
Mean Accuracy
Standard Deviation
Why:
Reduces chance of overfitting.
Model Comparison
Bar chart comparing:
Accuracy
AUC
Observation:
Random Forest had highest accuracy and AUC.
Naive Bayes performed fast but slightly lower AUC.
ROC Curve Analysis
Plotted ROC curve:
X-axis: False Positive Rate
Y-axis: True Positive Rate
Interpretation:
Higher curve = better model.
Cross-validated ROC Curves showed stability.
Misclassification Analysis
Extracted examples where predictions were wrong:
Example:
Actual: Spam
Predicted: Ham
Message:
"Congratulations! You have won..."
Purpose:
Understand limitations and improve future models
Feature Importance
Naive Bayes:
Top words by log probability:
claim
win
free
Random Forest:
Top words by importance:
free
call
text
Visualized in bar charts.
Saving And Deployment
Saved model and vectorizer using pickle:
with open("best_model.pkl", "wb") as f:
pickle.dump(best_model, f)
Benefit:
Load model later without retraining.
Sample Prediction
Tested on unseen messages:
Message Predicted
Free entry in weekly draw...Spam
Hey are you free today? Ham
Function:
def predict_sms(message):
Conclusion
Achievements:
Built SMS classifier.
High accuracy and AUC.
Extracted important features.
Future Improvements:
Use deep learning (RNN, LSTM).
Deploy API.
Handle multilingual spam.
References
scikit-learn Documentation
NLTK Documentation
WordCloud Documentation
SMS Spam Dataset
Code-
Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import re
import nltk
from wordcloud import WordCloud
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
confusion_matrix,
classification_report,
accuracy_score,
roc_auc_score,
roc_curve
)
Download NLTK data
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
Load dataset
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-
tutorial/master/data/sms.tsv"
df = pd.read_csv(url, sep="\t", names=["label", "message"])
Basic info
print("\n=== Head of Dataset ===")
print(df.head())
print("\n=== Class Distribution ===")
print(df['label'].value_counts())
=== Head of Dataset ===
label message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
=== Class Distribution ===
label
ham 4825
spam 747
Name: count, dtype: int64
Encode target
df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})
Preprocessing
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def clean_text(text):
text = text.lower()
text = re.sub(r"http\S+|www\S+|https\S+", '', text)
text = re.sub(r'\d+', '', text)
text = re.sub(r'[^\w\s]', '', text)
words = text.split()
words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
return ' '.join(words)
df['cleaned'] = df['message'].apply(clean_text)
print("\n=== Sample Cleaned Messages ===")
print(df[['message', 'cleaned']].head())
=== Sample Cleaned Messages ===
message \
0 Go until jurong point, crazy.. Available only ...
1 Ok lar... Joking wif u oni...
2 Free entry in 2 a wkly comp to win FA Cup fina...
3 U dun say so early hor... U c already then say...
4 Nah I don't think he goes to usf, he lives aro...
cleaned
0 go jurong point crazy available bugis n great ...
1 ok lar joking wif u oni
2 free entry wkly comp win fa cup final tkts st ...
3 u dun say early hor u c already say
4 nah dont think go usf life around though
Feature Engineering
df['msg_length'] = df['message'].apply(len)
df['word_count'] = df['message'].apply(lambda x: len(x.split()))
df['punct_count'] = df['message'].apply(lambda x: sum(1 for char in x if char in
string.punctuation))
df['unique_words'] = df['message'].apply(lambda x: len(set(x.split())))
print("\n=== Engineered Features (first 3 rows) ===")
print(df[['msg_length','word_count','punct_count','unique_words']].head(3))
=== Engineered Features (first 3 rows) ===
msg_length word_count punct_count unique_words
0 111 20 9 20
1 29 6 6 6
2 155 28 6 24
Visualizations
plt.figure(figsize=(12,5))
sns.histplot(data=df, x='msg_length', hue='label', bins=40, kde=True)
plt.title("Message Length Distribution")
plt.show()
Word Clouds
spam_text = ' '.join(df[df['label'] == 'spam']['cleaned'])
ham_text = ' '.join(df[df['label'] == 'ham']['cleaned'])
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
plt.imshow(WordCloud(width=500, height=300,
background_color='white').generate(spam_text))
plt.axis('off')
plt.title("Spam Words")
plt.subplot(1,2,2)
plt.imshow(WordCloud(width=500, height=300,
background_color='white').generate(ham_text))
plt.axis('off')
plt.title("Ham Words")
plt.tight_layout()
plt.show()
Vectorization
vectorizer = TfidfVectorizer(max_features=3000)
X = vectorizer.fit_transform(df['cleaned'])
y = df['label_num']
Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("\n=== Train-Test Split ===")
print(f"Training samples: {X_train.shape[0]}, Test samples: {X_test.shape[0]}")
=== Train-Test Split ===
Training samples: 4457, Test samples: 1115
Models
models = {
"Naive Bayes": MultinomialNB(),
"Logistic Regression": LogisticRegression(max_iter=1000),
"Random Forest": RandomForestClassifier(n_estimators=150, random_state=42)
}
Training & Evaluation
results = {}
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
results[name] = {'Accuracy': acc, 'AUC': auc}
print(f"\n=== {name} ===")
print(f"Accuracy: {acc:.4f}")
print(f"AUC: {auc:.4f}")
print(classification_report(y_test, y_pred)
=== Naive Bayes ===
Accuracy: 0.9785
AUC: 0.9794
precision recall f1-score support
0 0.98 1.00 0.99 966
1 1.00 0.84 0.91 149
accuracy 0.98 1115
macro avg 0.99 0.92 0.95 1115
weighted avg 0.98 0.98 0.98 1115
=== Logistic Regression ===
Accuracy: 0.9650
AUC: 0.9866
precision recall f1-score support
0 0.96 1.00 0.98 966
1 0.99 0.74 0.85 149
accuracy 0.97 1115
macro avg 0.98 0.87 0.92 1115
weighted avg 0.97 0.97 0.96 1115
=== Random Forest ===
Accuracy: 0.9803
AUC: 0.9935
precision recall f1-score support
0 0.98 1.00 0.99 966
1 1.00 0.85 0.92 149
accuracy 0.98 1115
macro avg 0.99 0.93 0.95 1115
weighted avg 0.98 0.98 0.98 1115
Cross-validation
print("\n=== Cross-Validation Scores ===")
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"{name}: Mean = {scores.mean():.4f}, Std = {scores.std():.4f}")
=== Cross-Validation Scores ===
Naive Bayes: Mean = 0.9743, Std = 0.0042
Logistic Regression: Mean = 0.9607, Std = 0.0051
Random Forest: Mean = 0.9743, Std = 0.0046
Confusion Matrix of Best Model
best_model_name = max(results, key=lambda x: results[x]['Accuracy'])
best_model = models[best_model_name]
y_pred_best = best_model.predict(X_test)
cm = confusion_matrix(y_test, y_pred_best)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Purples', xticklabels=['Ham','Spam'],
yticklabels=['Ham','Spam'])
plt.title(f"{best_model_name} Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
ROC Curve
fpr, tpr, _ = roc_curve(y_test, best_model.predict_proba(X_test)[:,1])
plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, label=f'{best_model_name}
(AUC={results[best_model_name]["AUC"]:.2f})')
plt.plot([0,1],[0,1],'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.grid()
plt.show()
Most informative words
tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
tfidf_df['label_num'] = y.values
spam_means = tfidf_df[tfidf_df['label_num']==1].iloc[:,:-
1].mean().sort_values(ascending=False).head(15)
ham_means = tfidf_df[tfidf_df['label_num']==0].iloc[:,:-
1].mean().sort_values(ascending=False).head(15)
plt.figure(figsize=(12,5))
spam_means.plot(kind='barh', color='red')
plt.title("Top 15 Spam Words by TF-IDF Score")
plt.gca().invert_yaxis()
plt.show()
plt.figure(figsize=(12,5))
ham_means.plot(kind='barh', color='green')
plt.title("Top 15 Ham Words by TF-IDF Score")
plt.gca().invert_yaxis()
plt.show()
Custom prediction function
def predict_sms(message):
cleaned = clean_text(message)
vector = vectorizer.transform([cleaned])
prediction = best_model.predict(vector)[0]
return "Spam" if prediction == 1 else "Ham"
Test predictions
test_msgs = [
"Free entry in 2 a weekly competition! Just text WIN to 80086 now!",
"Hey, are we still going to class tomorrow?",
"You've been selected for a cash prize of $5000!",
"Reminder: your electricity bill is due tomorrow.",
]
print("\n=== Sample Predictions ===")
for msg in test_msgs:
pred = predict_sms(msg)
print(f"\nMessage: {msg}\nPrediction: {pred}")
=== Sample Predictions ===
Message: Free entry in 2 a weekly competition! Just text WIN to 80086 now!
Prediction: Spam
Message: Hey, are we still going to class tomorrow?
Prediction: Ham
Message: You've been selected for a cash prize of $5000!
Prediction: Ham
Message: Reminder: your electricity bill is due tomorrow.
Prediction: Ham
Check Missing Values
print("Missing Values:\n", df.isnull().sum())
Missing Values:
label 0
message 0
label_num 0
cleaned 0
msg_length 0
word_count 0
punct_count 0
unique_words 0
dtype: int64
Class Distribution Pie Chart
plt.figure(figsize=(6,6))
df['label'].value_counts().plot.pie(autopct='%1.1f%%', colors=['lightgreen', 'lightcoral'])
plt.title("Class Distribution")
plt.ylabel("")
plt.show()
Correlation Matrix of Engineered
Features
## Correlation matrix of numerical features
plt.figure(figsize=(8,6))
sns.heatmap(df[['msg_length', 'word_count', 'punct_count', 'unique_words']].corr(),
annot=True, cmap='Blues')
plt.title("Feature Correlation Matrix")
plt.show()
Visualize Top N-Grams
from sklearn.feature_extraction.text import CountVectorizer
# Bigram visualization
cv = CountVectorizer(ngram_range=(2,2), max_features=20)
X_bigrams = cv.fit_transform(df['cleaned'])
bigrams_df = pd.DataFrame(X_bigrams.toarray(), columns=cv.get_feature_names_out())
bigrams_sum = bigrams_df.sum().sort_values(ascending=False)
plt.figure(figsize=(10,6))
bigrams_sum.plot(kind='bar', color='purple')
plt.title("Top 20 Bigrams")
plt.ylabel("Frequency")
plt.show()
Show Most Informative Features for
Naive Bayes
# Most informative features for Naive Bayes
feature_names = vectorizer.get_feature_names_out()
log_probs = models["Naive Bayes"].feature_log_prob_
top_spam_idx = np.argsort(log_probs[1])[::-1][:15]
top_spam_words = feature_names[top_spam_idx]
print("Top 15 Spam Predictive Words (Naive Bayes):\n", top_spam_words)
Top 15 Spam Predictive Words (Naive Bayes):
['call' 'free' 'mobile' 'text' 'txt' 'claim' 'stop' 'reply' 'ur' 'prize'
'service' 'new' 'tone' 'urgent' 'cash']
Feature Importance for Random Forest
# Feature importance
importances = models["Random Forest"].feature_importances_
indices = np.argsort(importances)[-15:]
plt.figure(figsize=(10,6))
plt.barh(range(len(indices)), importances[indices], color='teal')
plt.yticks(range(len(indices)), [vectorizer.get_feature_names_out()[i] for i in indices])
plt.title("Top 15 Feature Importances (Random Forest)")
plt.xlabel("Importance Score")
plt.show()
Model Comparison Bar Chart
# Compare model accuracy and AUC visually
acc_values = [results[m]['Accuracy'] for m in results]
auc_values = [results[m]['AUC'] for m in results]
x = np.arange(len(results))
width = 0.35
plt.figure(figsize=(8,5))
plt.bar(x - width/2, acc_values, width, label='Accuracy')
plt.bar(x + width/2, auc_values, width, label='AUC')
plt.xticks(x, results.keys())
plt.ylabel("Score")
plt.title("Model Performance Comparison")
plt.legend()
plt.show()
K-Fold Cross-Validation ROC
# ROC curve via cross-validation for best model
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
plt.figure(figsize=(8,6))
for fold, (train_idx, test_idx) in enumerate(skf.split(X, y), 1):
best_model.fit(X[train_idx], y[train_idx])
probs = best_model.predict_proba(X[test_idx])[:,1]
fpr, tpr, _ = roc_curve(y[test_idx], probs)
plt.plot(fpr, tpr, label=f"Fold {fold}")
plt.plot([0,1], [0,1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Cross-Validated ROC Curves")
plt.legend()
plt.grid()
plt.show()
Misclassified Examples
# Show misclassified examples
misclassified_idx = np.where(y_test != y_pred_best)[0]
print(f"Total Misclassified: {len(misclassified_idx)}\n")
for idx in misclassified_idx[:5]: # Show first 5
print(f"Actual: {'Spam' if y_test.iloc[idx] else 'Ham'}")
print(f"Predicted: {'Spam' if y_pred_best[idx] else 'Ham'}")
print(f"Message: {df.iloc[y_test.index[idx]]['message']}\n")
Total Misclassified: 22
Actual: Spam
Predicted: Ham
Message: Reminder: You have not downloaded the content you have already paid for. Goto
http://doit. mymoby. tv/ to collect your content.
Actual: Spam
Predicted: Ham
Message: Oh my god! I've found your number again! I'm so glad, text me back xafter this
msgs cst std ntwk chg £1.50
Actual: Spam
Predicted: Ham
Message: Your next amazing xxx PICSFREE1 video will be sent to you enjoy! If one vid is not
enough for 2day text back the keyword PICSFREE1 to get the next video.
Actual: Spam
Predicted: Ham
Message: Rock yr chik. Get 100's of filthy films &XXX pics on yr phone now. rply FILTH to
69669. Saristar Ltd, E14 9YT 08701752560. 450p per 5 days. Stop2 cancel
Actual: Spam
Predicted: Ham
Message: Babe: U want me dont u baby! Im nasty and have a thing 4 filthyguys. Fancy a
rude time with a sexy bitch. How about we go slo n hard! Txt XXX SLO(4msgs)
Confusion Matrix Normalized
# Normalized confusion matrix
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.figure(figsize=(5,4))
sns.heatmap(cm_norm, annot=True, fmt='.2f', cmap='Blues', xticklabels=['Ham', 'Spam'],
yticklabels=['Ham', 'Spam'])
plt.title(f"{best_model_name} Normalized Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
Save the Model
# Save model and vectorizer
import pickle
with open("best_model.pkl", "wb") as f:
pickle.dump(best_model, f)
with open("tfidf_vectorizer.pkl", "wb") as f:
pickle.dump(vectorizer, f)
print("Best model and vectorizer saved successfully!")
Best model and vectorizer saved successfully!