This project uses the Naive Bayes Machine Learning algorithm to create a model that can classify dataset SMS messages as spam or not spam, based on the training we give to the model. It is important to have some level of intuition as to what a spammy text message might look like.
Usually they have words like 'free', 'win', 'winner', 'cash', 'prize', or similar words in them, as these texts are designed to catch your eye and tempt you to open them. Also, spam messages tend to have words written in all capitals and also tend to use a lot of exclamation marks. To the recipient, it is usually pretty straightforward to identify a spam text and our objective here is to train a model to do that for us!
Being able to identify spam messages is a binary classification problem as messages are classified as either 'Spam' or 'Not Spam' and nothing else. We will be feeding a labelled dataset into the model, that it can learn from, to make future predictions.
This projecs uses the sklearn implementation of the Naive Bayes and also discuss approaches to build Naive Baye from scratch.
This project requires Python 3.7 and the following Python libraries installed:
The code is provided in the spam-detection.ipynb notebook file.
In a terminal or command window, navigate to the top-level project directory spam-detection/ (that contains this README) and run one of the following commands:
ipython notebook spam-detection.ipynbor
jupyter notebook spam-detection.ipynbThis will open the Jupyter Notebook software and project file in your browser.
This project uses a dataset from the UCI Machine Learning repository which has a very good collection of datasets for experimental research purposes. The direct data link is here.
Here's a preview of the data:
The columns in the data set are currently not named and as you can see, there are 2 columns.
The first column takes two values, 'ham' which signifies that the message is not spam, and 'spam' which signifies that the message is spam.
The second column is the text content of the SMS message that is being classified.