SMS Spam Detection Using Python
Programming Language
CHAPTER 1
INTRODUCTION
This internship report briefs about my one-month internship at Evolet Technologies.
During this period, I was trained in Python Programming. I was trained to develop simple
codes for machine learning using Python Libraries. This Internship helped me learn how to
do simple projects using python, which seems to be getting increasingly popular among the
data analyst enthusiasts these days for its ease of use in the world of programming.
1.1 ABOUT THE COMPANY
Evolet Technologies-A Division of Red18Tech is a Software Development (global IT
solutions) company started with set of people with 67-man years of experience with business
acumen in providing a user-friendly feature and customized solutions for small, medium and
large businesses.
It provides full-cycle services in the areas of
Customized software development
Web designing, web development
Search Engine Optimization
Software solutions and services.
Combining its solid business domain experience, technical expertise, profound knowledge of
latest industry trends and quality-driven delivery model it offers progressive end-to-end web
solutions.
DEPT. OF CSE, SVCE 2019-2020 Page 1
SMS Spam Detection Using Python
Programming Language
CHAPTER 2
INTRODUCTION TO PROJECT
2.1 OVERVIEW
To predict an SMS as ham or spam . Using Python libraries our goal is to develop a program
that predicts out the SMS with reasonable levels of accuracy and to study its scope in modern
day applications.
2.2 PURPOSE
Short Message Service is considered one of the widely used communication service. SMS
in line with an advancement of mobile technology has caused an emergence of SMS spam. The
focus of this program is to predict an SMS at users end as ham or spam. Its objective is to study
the discriminatory control of features and considering its informative or influence factor in
predicting the SMS.
2.3 OBJECTIVES
Filter and clean datasets to be used for prediction.
Implement algorithms and models for prediction of an SMS with high accuracy.
Evaluate the algorithms and tune them if necessary.
Build and host an end-to-end tool which takes an SMS as input and outputs a the SMS as
spam or ham.
2.4 RELATED WORK
The most common formulation of spam filtration is the task of classification. For a piece of
text, the aim is to predict whether or not it is spam. However, the past work varies in terms of
what kind of data sets are used for training and testing. For Example, the approach used by
Kimetal proposed a method that was based on calculation of frequency which measures
lightness and quickness of filtering methods.
Although neural networks have not been widely used for SMS classification, as they took
longer training.
DEPT. OF CSE, SVCE 2019-2020 Page 2
SMS Spam Detection Using Python
Programming Language
An example of neural networks is a Bayesian classifier, though it was easy to construct
Bayesian classifier it was not accurate. Researchers have gone even further to analyze the SMS
traffic messaging behavior similar to a spammer was observed for some networked
applications because a few machines to machine system transmit a huge count of texts per day.
It was suggested to consider such systems when designing SMS spam detection models.
2.5 ADVANTAGES OF PROPOSED SYSTEM
The naive bayes algorithm used in SMS spam filtering is the first predominant techniques
used by researchers and predicts correct results with high accuracy.
NB and SVM algorithms show good outcomes on data set.
Although neural networks perform better at classification, it requires complicated deep
learning techniques and takes lot of training time.
The system is very simple in design and implement. The system requires low system
resources system will work in all configurations.
.
DEPT. OF CSE, SVCE 2019-2020 Page 3
SMS Spam Detection Using Python
Programming Language
CHAPTER 3
IMPLEMENTATION
3.1 SOFTWARE AND HARDWARE REQUIREMENTS
SOFTWARE REQUIREMENTS
Operating system : Windows XP/2007/2010
Language used : Python
Software required : IDLE, Jupyter or any IDE supporting python
Required modules : Flask, Sklearn, Pandas
HARDWARE REQUIREMENTS
Processor : Pentium-IV
Speed : 1.1 GHz
RAM : 256 MB
Hard disk : 20 GB
3.2 EXECUTION
3.2.1 DATASET DESCRIPTION
SMS Spam Collection
This is a collection of 5574 spam and legitimate English text messages gathered from
the following free research sources:
National University of Singapore SMS Corpus (3,375 Ham SMS), Grumble text Website
(425 Spam SMS) and SMS Spam Corpus v.0.1 Big (1002 Ham SMS and 322 Spam
SMS).
Spam SMS Dataset
The SMS database contains 1,000 Spam and 1,000 Ham SMS. For collecting SMS spam
data, Yadav et al. ran an incentivized crowd-sourcing scheme in their campus. Due to
DEPT. OF CSE, SVCE 2019-2020 Page 4
SMS Spam Detection Using Python
Programming Language
large influence of Regional words, SMS with both Hindi & English words were
collected from 43 participants.
A. Preparing Readable Datasets
The datasets have been prepared as comma-separated values (CSV) files. These files
contain one text message per line. Each line has two columns - v1 is the label (ham or
spam) and v2 is the raw text.
B. Data Preprocessing
Different preprocessing approaches have been applied to different classifiers based on
their requirement of input data. Following is a brief description of these approaches.
1) Using Term Frequency—Inverse Document Frequency: In a given document, the count
of the number of times a word appears is called Term Frequency.
2) Using Tokenizer: When working with text, it is always good to start with splitting the
text into words. Words are known as tokens. Tokenization is the process of splitting text
into words or tokens. Kera’s' Tokenizer is a class for vectorizing texts. It is used to turn
texts into sequences.
DEPT. OF CSE, SVCE 2019-2020 Page 5
SMS Spam Detection Using Python
Programming Language
Fig3.1 Flow Diagram
CHAPTER 4
TECHNOLOGIES USED
DEPT. OF CSE, SVCE 2019-2020 Page 6
SMS Spam Detection Using Python
Programming Language
4.1 PYTHON
Python is a high-level, interpreted, interactive and object-oriented scripting language.
Python is designed to be highly readable. It is a dynamic scripting language similar to Perl and
Ruby. The principal author of Python is Guido van Rossum. Python supports dynamic typing
and has a garbage collector for automatic memory management. Another important feature of
Python is dynamic name solution which binds the names of functions and variables during
execution.
ADVANTAGES:
Presence of third-party modules
Extensive support libraries (NumPy for numerical calculations, Pandas for data analytic
etc)
Open source and community development
Easy to learn
User-friendly data structures
High-level language
Dynamically typed language (No need to mention data type based on value assigned, it
takes data type)
Object-oriented language
Portable and Interactive
Portable across Operating systems
APPLICATIONS:
GUI based desktop applications (Games, Scientific Applications)
Web frameworks and applications
Enterprise and Business applications
Operating Systems
Language Development
4.1.1 PANDAS
The Pandas has so many uses that it might make sense to list the things it can't do instead of
what it can do.
DEPT. OF CSE, SVCE 2019-2020 Page 7
SMS Spam Detection Using Python
Programming Language
This tool is essentially your data home. Through pandas, you get acquainted with your data
by cleaning, transforming, and analysing it.
For example, say you want to explore a dataset stored in a CSV on your computer. Pandas
will extract the data from that CSV into a Data Frame — a table, basically — then let you do
things like:
Calculate statistics and answer questions about the data, like
What's the average, median, max, or min of each column?
Does column A correlate with column B?
What does the distribution of data in column C look like?
Clean the data by doing things like removing missing values and filtering rows or
columns by some criteria
Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and
more.
Store the cleaned, transformed data back into a CSV, other file or database
Before you jump into the modelling or the complex visualizations you need to have a good
understanding of the nature of your dataset and pandas is the best avenue through which to
do that.
4.1.2 SKLEARN
The Sklearn is an open source Python library that implements a range of machine learning,
pre-processing, cross-validation and visualization algorithms using a unified interface.
Important features of sklearn:
Simple and efficient tools for data mining and data analysis. It features various
classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means, etc.
Accessible to everybody and reusable in various context.
Built on the top of NumPy, SciPy, and matplotlib.
4.1.3 NumPy
NumPy, which stands for Numerical Python, is a library consisting of multidimensional array
objects and a collection of routines for processing those arrays. Using NumPy, mathematical
and logical operations on arrays can be performed. This tutorial explains the basics of NumPy
DEPT. OF CSE, SVCE 2019-2020 Page 8
SMS Spam Detection Using Python
Programming Language
such as its architecture and environment. It also discusses the various array functions, types of
indexing, etc. An introduction to Matplotlib is also provided.
TensorFlow and other libraries uses NumPy internally for performing multiple operations on
tensors. Array interface is the best and the most important feature of NumPy.
Features of NumPy:
Interactive: NumPy is very interactive and easy to use.
Mathematics: Makes complex mathematical implementations very simple.
Intuitive: Makes coding really easy and grasping the concepts is easy.
Lot of Interaction: Widely used, hence a lot of open source contribution.
4.1.4 FLASK
Flask is a web application framework written in Python. Flask is based on the Werkzeug WSGI
toolkit and Jinja2 template engine.
To understand what Flask is you have to understand few general terms;
WSGI Web Server Gateway Interface (WSGI) has been adopted as a standard for Python
web application development. WSGI is a specification for a universal interface between
the web server and the web applications.
Werkzeug It is a WSGI toolkit, which implements requests, response objects, and other
utility functions. This enables building a web framework on top of it. The Flask
framework uses Werkzeug as one of its bases.
jinja2 jinja2 is a popular templating engine for Python. A web templating system combines a
template with a certain data source to render dynamic web pages.
CHAPTER 5
TESTING
5.1 TESTING
DEPT. OF CSE, SVCE 2019-2020 Page 9
SMS Spam Detection Using Python
Programming Language
In order to determine Software testing is a critical element of the ultimate review of
specification design and coding. Testing of software leads to the discovery of errors in the
software’s functions and to verify if the performance requirements are met. Testing also
provides a good indication of software reliability and software quality as a whole. The result
of different phases of testing are evaluated and then compared with the expected results. If the
errors are uncovered, they are debugged and corrected. A standard approach to software
testing has these generic characteristics:
Various testing techniques are appropriate at different points of time.
Testing and debugging are different activities, but debugging must be accommodating
in the testing strategy.
Following three approaches of debugging were used:
Debugging by Induction
Debugging by Deduction
Backtracking
5.2 TEST PLANS
In this test plan all major activities are described below:
Unit testing.
Integration testing.
Validation testing.
System testing.
DEPT. OF CSE, SVCE 2019-2020 Page 10
SMS Spam Detection Using Python
Programming Language
Sn Test case Expected Actual Remarks
o. Description result Result
Input text file is Summary of the Summary of the Pass
1. given to be text text
summarized
Article link is given Summary of the Summary of the Pass
2. to be summarized article article
Input text file is Alert: Error, No Alert: Error, No Pass
3. absent such file. such file.
Table 5.1 Test case table
5.3 TESTING AND EVALUATING
In order to determine certain evaluation metrics, the following parametric were used:
a. True Positive (TP) - the number of test cases that are classified correctly;
b. True Negative (TN) - the number of test cases that are rejected from the main class
correctly.
c. False Positive (FP) - the number of test cases that are rejected from the main class
incorrectly;
d. False Negative (FN) - the number of test cases that are classified incorrectly to the main
class.
DEPT. OF CSE, SVCE 2019-2020 Page 11
SMS Spam Detection Using Python
Programming Language
CHAPTER 6
SNAPSHOTS
Input Text Field:
Fig 6.1: Input Text Field
Input of Text 1:
Fig 6.2: Input of Text 1
DEPT. OF CSE, SVCE 2019-2020 Page 12
SMS Spam Detection Using Python
Programming Language
Output:1
Fig:6.3 Output Text 1
DEPT. OF CSE, SVCE 2019-2020 Page 13
SMS Spam Detection Using Python
Programming Language
Output:2
Fig:6.4 Summary of Input Text 2
CONCLUSION
To sum it up, I believe the skills that I have learned during the internship will definitely aid me
in the long run to face the challenges in a working environment. I could understand more about
the definition of an IT developer and a programmer and prepare myself to become a
responsible and innovative developer in the future.
Along my training period, I realize that observation is a critical element to find the root cause
of a problem. Not just in my project, but also in my daily routine. During my internship, I
worked together with my colleagues to determine the problems, the project indirectly helped
me learn independently, discipline myself, taught me to be be considerate/patient, have self-
confidence, take initiative and the ability to grasp the difficulty in the problems.
DEPT. OF CSE, SVCE 2019-2020 Page 14
SMS Spam Detection Using Python
Programming Language
This whole journey improved my communication skills, and gave me a sense of how corporate
industry works. During my internship, I have received constructive criticism and advice from
my colleagues which has helped me grow positively.
I would like to thank everyone who made my internship a fruitful experience.
REFERENCES
Evolet Technologies-A Division of Red18Tech is a Software Development (global IT
solutions)
Brill, E. (2000). Part-of-speech Tagging. Handbook of Natural Language Processing, 403-
414. Retrieved March 01, 2018.
Brownlee, J. (2017a, November 29). A Gentle Introduction to SMS spam detection.
Retrieved March 02, 2018, from https://machinelearningmastery.com/gentle-
introduction-SMS-spam-detection/
Dalal, V., & Malik, L. G. (2013, December). A Survey of Extractive and Abstractive Input
Text Techniques. In Emerging Trends in Engineering and Technology (ICETET), 2013
6th International Conference on (pp. 109-110). IEEE. Retrieved March 01, 2018.
Das., K. (2017). Introduction to Flask. Retrieved February 27, 2018, from
pymbook.readthedocs.io/en/latest/flask.html
Mihalcea, R., &Tarau, P. (2004). Text rank: Bringing Order into Text. In Proceedings of
the 2004 Conference on Empirical Methods in Natural Language Processing. Retrieved
February 27, 2018.
DEPT. OF CSE, SVCE 2019-2020 Page 15
SMS Spam Detection Using Python
Programming Language
Natural Language Toolkit. (2017, September 24). Retrieved February 23, from
http://www.nltk.org
Rahm, E., & Do, H. H. (2000). Data Cleaning: Problems and Current Approaches. IEEE
Data Eng. Bull., 23(4), 3-13. Retrieved March 01, 2018.
DEPT. OF CSE, SVCE 2019-2020 Page 16