Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
37 views23 pages

Microproject Report

a report that helps students
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views23 pages

Microproject Report

a report that helps students
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

MICRO PROJECT REPORT

ADVANCED MACHINE LEARNING


SMS Spam Detection
TABLE OF CONTENTS

ABSTRACT 3

INTRODUCTION 4

MULTINOMIAL NB 5

IMPLEMENTATION 6

RESULTS AND DISCUSSION 20

CONCLUSION 22

REFERENCES 23
ABSTRACT

In today's digital landscape, mobile phones serve as an integral


communication tool, receiving a multitude of messages regularly. However,
amidst this influx, a significant portion comprises spam messages, posing
threats and fraudulent attempts to extract sensitive personal information.
The prevalence of these deceptive tactics underscores the necessity for
robust spam detection systems. This project focuses on leveraging
advanced deep learning techniques within the Tensorflow framework to
develop robust models for SMS spam detection. By utilizing the SMS Spam
Detection Dataset—a comprehensive repository comprising SMS text
paired with corresponding labels indicating 'Ham' (authentic) or 'spam'
content—it aims to construct a foundational model. Subsequently, through
the exploration of deep learning architectures such as embeddings, LSTM,
and more, the main objective is to surpass the performance of the baseline
model. One of primary methodologies involves employing the
MultinomialNB() classifier, renowned for its efficacy in text classification
tasks. This project endeavors to not only enhance the efficiency of spam
detection but also contribute vital insights into the comparative performance
metrics of diverse deep learning models. Ultimately, the proposed
methodology aims to boost mobile security and empower users with
heightened vigilance against malicious SMS communications.
INTRODUCTION

In an era where mobile phones serve as indispensable companions in our


daily lives, the deluge of incoming messages—primarily overwhelmed by
spam—has become an escalating concern. Think about those texts or
emails trying to trick you into sharing your personal info, like passwords or
bank details. Scammers are pretty good at making these messages look
real, which can put our important accounts at risk.

To tackle this problem, a project was implemented to make our phones


safer using smart computer tools. Here, Tensorflow is used to build smart
models that can spot the bad messages among the good ones. The
dataset with messages labeled as either normal ('Ham') or spammy
('spam') is used. Our goal is to make a good starting model and then make
it even better using ML algorithms.

The work focuses on different types of smart models like those that
understand words really well and others that can learn patterns in
messages to make these models better at telling the real messages from
the spam ones.

One key thing used is a MultinomialNB(), Multinomial Naive Bayes, which


is a type of machine learning algorithm often used for text classification
tasks, especially when dealing with features that represent counts or
frequencies, like word counts in documents. This helps us know which
messages might be spam.

The aim is to not just make better tools for spotting spam, but also to help
people understand how well these different tools work. Ultimately, it aims to
give people better ways to protect themselves from these tricky messages
and make our mails a safer place to be.

MULTINOMIAL NB

Multinomial Naive Bayes (MultinomialNB) operates as a statistical classifier


by assessing the frequency of words within messages, distinguishing
between spam and legitimate content based solely on word occurrence
without considering word order or relationships. In contrast, alternative
models such as custom vector embeddings, LSTMs (Long Short-Term
Memory), bidirectional LSTMs, and USE (Universal Sentence Encoder)
transfer learning offer distinct methodologies in spam detection.

Custom vector embeddings, like word2vec or GloVe, possess semantic


understanding, enabling them to grasp word meanings and associations,
enhancing their comprehension of message content beyond word
frequency analysis.
LSTMs and bidirectional LSTMs, designed with memory capabilities, excel
in understanding sequential data by considering the order of words. This
contextual understanding aids in discerning nuanced variations in message
intent.

In contrast, USE transfer learning draws from extensive pre-training on


diverse textual data, leveraging this broad linguistic knowledge to swiftly
interpret and classify new messages based on learned patterns from
previous texts.

While MultinomialNB relies on word frequency, these alternative models


offer specialized strengths in semantic comprehension, contextual analysis,
and broader linguistic insights, providing a more nuanced and
comprehensive approach to detecting deceptive content within messages.
IMPLEMENTATION

Initially, import all the required libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

Load the Dataset using pandas function .read_csv()

df = pd.read_csv("/content/spam.csv",encoding='latin-1')
df.head()

Output:
The dataset contains three unnamed columns with null values. So, drop
those columns and rename the columns v1 and v2 to label and Text,
respectively. Since the target variable is in string form, encode it
numerically using pandas function .map().

df = df.drop(['Unnamed: 2','Unnamed: 3','Unnamed: 4'],axis=1)


df = df.rename(columns={'v1':'label','v2':'Text'})
df['label_enc'] = df['label'].map({'ham':0,'spam':1})
df.head()

Output after the above data preprocessing:

Now, visualize the distribution of Ham and Spam data:

sns.countplot(x=df['label'])
plt.show()

Output:
The ham data is comparatively higher than spam data. Since, here is using
embeddings in deep learning model, it need not balance the data.

Now, splitting the data into training and testing parts using train_test_split()
function.

# Splitting data for Training and testing


from sklearn.model_selection import train_test_split

X, y = np.asanyarray(df['Text']), np.asanyarray(df['label_enc'])
new_df = pd.DataFrame({'Text': X, 'label': y})
X_train, X_test, y_train, y_test = train_test_split(
new_df['Text'], new_df['label'], test_size=0.2, random_state=42)
X_train.shape, y_train.shape, X_test.shape, y_test.shape
Building the models:

First, build a baseline model and then try to beat the performance of the
baseline model using deep learning models (embeddings, LSTM, etc)

Here, MultinomialNB() is used, which performs well for text classification


when the features are discrete like word counts of the words or tf-idf
vectors. The tf-idf is a measure that tells how important or relevant a word
is to the document.

from sklearn.feature_extraction.text import TfidfVectorizer


from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report,accuracy_score

tfidf_vec = TfidfVectorizer().fit(X_train)
X_train_vec,X_test_vec =
tfidf_vec.transform(X_train),tfidf_vec.transform(X_test)

baseline_model = MultinomialNB()
baseline_model.fit(X_train_vec,y_train)
Confusion matrix for the baseline model:

Model 1: Creating custom Text vectorization and embedding layers:

Text vectorization is the process of converting text into a numerical


representation. Example: Bag of words frequency, Binary Term frequency,
etc. A word embedding is a learned representation of text in which words
with related meanings have similar representations. Each word is assigned
to a single vector, and the vector values are learned like that of a neural
network.
Now, create a custom text vectorization layer using TensorFlow:
from tensorflow.keras.layers import TextVectorization

MAXTOKENS=total_words_length
OUTPUTLEN=avg_words_len
text_vec = TextVectorization(
max_tokens=MAXTOKENS,
standardize='lower_and_strip_punctuation',
output_mode='int',
output_sequence_length=OUTPUTLEN
)
text_vec.adapt(X_train)

MAXTOKENS is the maximum size of the vocabulary which was found


earlier
OUTPUTLEN is the length to which the sentences should be padded
irrespective of the sentence length.
Output of a sample sentence using text vectorization is shown below:

Now, create an embedding layer:

embedding_layer = layers.Embedding(
input_dim=MAXTOKENS,
output_dim=128,
embeddings_initializer='uniform',
input_length=OUTPUTLEN
)

● input_dim is the size of vocabulary


● output_dim is the dimension of the embedding layer i.e, the size of
the vector in which the words will be embedded
● input_length is the length of input sequences

Build and compile model 1 using the Tensorflow Functional API:

input_layer = layers.Input(shape=(1,), dtype=tf.string)


vec_layer = text_vec(input_layer)
embedding_layer_model = embedding_layer(vec_layer)
x = layers.GlobalAveragePooling1D()(embedding_layer_model)
x = layers.Flatten()(x)
x = layers.Dense(32, activation='relu')(x)
output_layer = layers.Dense(1, activation='sigmoid')(x)
model_1 = keras.Model(input_layer, output_layer)

model_1.compile(optimizer='adam', loss=keras.losses.BinaryCrossentropy(
label_smoothing=0.5), metrics=['accuracy'])

Training the model-1:


Plotting the history of model-1:

Create helper functions for compiling, fitting, and evaluating the model
performance:

from sklearn.metrics import precision_score, recall_score, f1_score

def compile_model(model):
'''
simply compile the model with adam optimzer
'''
model.compile(optimizer=keras.optimizers.Adam(),
loss=keras.losses.BinaryCrossentropy(),
metrics=['accuracy'])

def fit_model(model, epochs, X_train=X_train, y_train=y_train,


X_test=X_test, y_test=y_test):
'''
fit the model with given epochs, train
and test data
'''
history = model.fit(X_train,
y_train,
epochs=epochs,
validation_data=(X_test, y_test),
validation_steps=int(0.2*len(X_test)))
return history

def evaluate_model(model, X, y):


'''
evaluate the model and returns accuracy,
precision, recall and f1-score
'''
y_preds = np.round(model.predict(X))
accuracy = accuracy_score(y, y_preds)
precision = precision_score(y, y_preds)
recall = recall_score(y, y_preds)
f1 = f1_score(y, y_preds)
model_results_dict = {'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1-score': f1}

return model_results_dict

Model -2 Bidirectional LSTM:

A bidirectional LSTM (Long short-term memory) is made up of two LSTMs,


one accepting input in one direction and the other in the other. BiLSTMs
effectively improve the network’s accessible information, boosting the
context for the algorithm (e.g. knowing what words immediately follow and
precede a word in a sentence).

Building and compiling the model-2:

input_layer = layers.Input(shape=(1,), dtype=tf.string)


vec_layer = text_vec(input_layer)
embedding_layer_model = embedding_layer(vec_layer)
bi_lstm = layers.Bidirectional(layers.LSTM(
64, activation='tanh',
return_sequences=True))(embedding_layer_model)
lstm = layers.Bidirectional(layers.LSTM(64))(bi_lstm)
flatten = layers.Flatten()(lstm)
dropout = layers.Dropout(.1)(flatten)
x = layers.Dense(32, activation='relu')(dropout)
output_layer = layers.Dense(1, activation='sigmoid')(x)
model_2 = keras.Model(input_layer, output_layer)

compile_model(model_2) # compile the model


history_2 = fit_model(model_2, epochs=5) # fit the model

Model -3 Transfer Learning with USE Encoder:

Transfer Learning:

Transfer learning is a machine learning approach in which a model


generated for one job is utilized as the foundation for a model on a different
task.

USE Layer (Universal Sentence Encoder):

The Universal Sentence Encoder converts text into high-dimensional


vectors that may be used for text categorization, semantic similarity, and
other natural language applications.

The USE can be downloaded from tensorflow_hub and can be used as a


layer using kerasLayer() function.
import tensorflow_hub as hub

# model with Sequential api


model_3 = keras.Sequential()

# universal-sentence-encoder layer
# directly from tfhub
use_layer =
hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
trainable=False,
input_shape=[],
dtype=tf.string,
name='USE')
model_3.add(use_layer)
model_3.add(layers.Dropout(0.2))
model_3.add(layers.Dense(64, activation=keras.activations.relu))
model_3.add(layers.Dense(1, activation=keras.activations.sigmoid))

compile_model(model_3)

history_3 = fit_model(model_3, epochs=5)

Analyzing our Model Performance:

Using the helper function which is created earlier to evaluate model


performance:
baseline_model_results = evaluate_model(baseline_model, X_test_vec,
y_test)
model_1_results = evaluate_model(model_1, X_test, y_test)
model_2_results = evaluate_model(model_2, X_test, y_test)
model_3_results = evaluate_model(model_3, X_test, y_test)

total_results = pd.DataFrame({'MultinomialNB
Model':baseline_model_results,
'Custom-Vec-Embedding
Model':model_1_results,
'Bidirectional-LSTM
Model':model_2_results,
'USE-Transfer learning
Model':model_3_results}).transpose()

total_results

Output:
RESULTS AND DISCUSSION

Metrics:

All four models deliver excellent results. (All of them have greater than 96
percent accuracy), thus comparing them might be difficult.

Plotting the results:

● False negatives and false positives are significant in this problem.


● Precision and recall are the metrics that allow us the ability to
calculate them, but there is one more, ‘f1-score.’
● The f1-score is the harmonic mean of accuracy and recall. Thus, we
can get both with a single shot.
● USE-Transfer learning model gives the best accuracy and f1-score.
CONCLUSION

In the realm of combating spam messages, this micro project has been an
insightful journey into leveraging the power of Tensorflow and deep learning
for SMS spam detection. The implementation and exploration of various
deep learning models, such as Convolutional Neural Networks (CNNs) and
Long Short-Term Memory (LSTM) networks, have provided a
comprehensive understanding of their efficacy in classifying SMS texts into
'spam' or 'ham' categories.

The practical application of these models on the SMS Spam Collection


Dataset has unveiled the promise and potential of utilizing advanced
machine learning techniques in addressing real-world challenges,
particularly in filtering out unwanted and deceptive messages. The iterative
process of model training, evaluation, and refinement has underscored the
significance of fine-tuning parameters and architecture choices to achieve
optimal performance.

Furthermore, the project’s comprehensive walkthrough and code


implementation have served as a valuable guide, offering practical insights
into the implementation of these models using Tensorflow in Python. By
embracing these techniques, this micro project aims to contribute to the
broader understanding of utilizing deep learning methodologies for SMS
spam detection, ultimately paving the way for enhanced mobile security
and user protection against fraudulent communications.
REFERENCES

1. Almeida, T. A., & Hidalgo, J. M. G. (2012). Contribution to the study of


SMS spam filtering: New collection and results. Proceedings of the
13th ACM Conference on Information and Knowledge Management.

2. GeeksforGeeks. (n.d.). "SMS Spam Detection using TensorFlow in


Python." Retrieved from: [Link to the GeeksforGeeks article]

3. TensorFlow Documentation. (n.d.). TensorFlow: An open-source


machine learning framework for everyone. Retrieved from: [Link to
TensorFlow official documentation]

4. Géron, A. (2017). Hands-On Machine Learning with Scikit-Learn and


TensorFlow: Concepts, Tools, and Techniques to Build Intelligent
Systems. O'Reilly Media.

5. Brownlee, J. (2019). Deep Learning for Natural Language


Processing: Develop Deep Learning Models for Your Natural
Language Problems. Machine Learning Mastery.

6. Abadi, M., et al. (2016). TensorFlow: A System for Large-Scale


Machine Learning. Proceedings of the 12th USENIX Symposium on
Operating Systems Design and Implementation.

You might also like