AI-Based Online Spam Detection
AI-Based Online Spam Detection
Table of Contents
1. Abstract
2. Chapter 1: Introduction
○ 1.1 Background
○ 1.2 Problem Statement
○ 1.3 Objectives
○ 1.4 Scope of the Project
3. Chapter 2: Literature Survey
○ 2.1 Overview of Spam and Its Impact
○ 2.2 Traditional Filtering Techniques
○ 2.3 Machine Learning in Spam Detection
○ 2.4 Related Work and Studies
4. Chapter 3: System Analysis
○ 3.1 Existing System
○ 3.2 Proposed System
○ 3.3 Feasibility Study
○ 3.4 Requirements Analysis
■ 3.4.1 Functional Requirements
■ 3.4.2 Non-Functional Requirements
5. Chapter 4: System Design
○ 4.1 System Architecture
○ 4.2 Use Case Diagram
○ 4.3 Data Flow Diagram
○ 4.4 Database Schema
○ 4.5 UI Design
6. Chapter 5: Implementation
○ 5.1 Technology Stack
○ 5.2 Django Model and Views
○ 5.3 ML Model Integration
○ 5.4 Input/Output Flow
7. Chapter 6: Results and Discussion
○ 6.1 Evaluation Metrics
○ 6.2 Model Accuracy
○ 6.3 Sample Predictions
○ 6.4 Advantages and limitation
1.1 Background
In today’s digital communication age, spam messages and phishing emails have become a
significant threat, especially with the rapid growth of online platforms. These malicious contents
aim to trick users into sharing sensitive data or clicking on harmful links. Efficient spam detection
has thus become critical for ensuring user safety, privacy, and secure communication.
To counter these threats, machine learning (ML) models have emerged as powerful tools for
classifying messages as spam or not based on content, structure, and keywords. This project
explores a web-based solution to detect spam in emails and online messages using a Django
framework integrated with a pre-trained machine learning model.
Problem Definition: To design and develop a web-based AI-powered spam detection system
that classifies incoming messages or emails into spam or not spam using natural language
processing (NLP) and machine learning models.
1.3 Objectives
● To implement a machine learning model capable of accurately classifying spam
messages.
● To develop a Django-based web interface for users to submit messages and receive
predictions.
● To integrate a simple and extensible model-view-controller (MVC) structure for spam
classification.
● To store and view classified messages with their labels for review and analysis.
● Keyword-based filtering
● Blacklisting known sources
● Rule-based systems
While these methods are simple, they lack adaptability and are prone to high false-positive rates.
● The Enron Spam Dataset has been a benchmark dataset for training models.
● Researchers have achieved accuracy upwards of 95% using ensemble learning.
● Gmail, Outlook, and other platforms use hybrid models combining rule-based and AI
filters.
This project builds upon these foundations, implementing a logistic regression classifier
integrated into a Django web app for real-world testing.
This chapter outlines the major components of the system, describing how each module
contributes to the overall functionality. The system architecture is modular, allowing for future
expansion and easy maintenance. The four primary components are: Frontend, Backend,
Database, and Machine Learning Service.
Frontend
The user interface is built using Django templates and styled with Tailwind CSS, a utility-first
CSS framework that allows for rapid design and consistent aesthetics across pages. The
frontend is designed to be clean, responsive, and accessible, ensuring usability on both
desktop and mobile devices.
The frontend interacts with the backend using standard form submissions and can be extended
to include AJAX or API-based data submission in future versions. Mobile responsiveness is
achieved using Tailwind's responsive breakpoints, ensuring optimal user experience on smaller
screens.
Backend
The backend is developed in Django 4.x, a high-level Python web framework known for its
scalability, security, and built-in ORM. It handles:
● Routing and Views: URL mappings are created for all major functionalities such as data
entry, prediction, and dashboard access. Views control the logic for rendering templates
or processing prediction requests.
● Authentication and Authorization: Built-in Django auth is used for login, logout, and
password protection. Users are grouped into roles using Django's Group model, enabling
role-based access control.
● Model Integration: The machine learning model is integrated into Django as a Python
module. It is invoked during form processing to generate real-time predictions.
● Data Validation and Security: The backend ensures input validation, uses CSRF
protection on forms, and handles exceptions gracefully to prevent system crashes.
The architecture follows the Model-View-Template (MVT) design pattern, which cleanly
separates data, logic, and presentation layers.
Database
The system uses PostgreSQL as its primary relational database management system.
PostgreSQL was selected due to its robustness, ACID compliance, and support for advanced
features such as JSON storage, indexing, and role management.
● User Accounts: Stores information about registered users, their roles, and credentials.
● Prediction Logs: Captures environmental input values, recommended s, timestamps,
and user ID for traceability.
● Training Dataset Management (Admin Only): A provision to upload and manage
datasets for retraining the ML model in the future.
● Feedback Records: Planned for future implementation where can submit feedback
on the recommendation quality.
The Django ORM abstracts SQL queries and handles migrations seamlessly, which simplifies
database operations and schema evolution.
The core intelligence of the system lies in the Machine Learning service, implemented as a
standalone Python module integrated into the Django backend. The service uses a Random
Forest Classifier trained on a structured dataset containing labels and environmental features.
● Model Training Script: Written in Python using pandas, scikit-learn, and NumPy.
The model is trained offline and validated using test data before deployment.
● Model Serialization: The trained model is serialized using joblib for efficient storage
and fast loading at runtime.
● Runtime Inference: When a user submits environmental data, the Django view loads the
serialized model and passes the input to generate predictions in real-time.
● Result Output: The prediction result is returned to the user interface along with optional
metadata like confidence score (planned in future).
The design ensures that the model can be updated independently of the web app. Admins can
retrain the model offline and replace the serialized .pkl file without needing to redeploy the
entire system.
Integration Workflow
1. User Interaction: The user fills out the data to be used for prediction form via the
frontend interface.
2. Form Submission: The input is sent to a Django view through a secure POST request.
3. Model Prediction: The backend view invokes the ML service, loads the model, and
performs inference.
4. Result Storage and Display: The prediction is stored in the database and returned to
the frontend for display.
5. Admin Access (Optional): Admins can view all prediction logs and performance
analytics.
This modular and loosely coupled architecture ensures that each part of the system is
independently testable, replaceable, and scalable.
Technical Feasibility: The system is built using widely available open-source tools (Python,
Django, Scikit-learn). No proprietary software or specialized hardware is required. The ML
model is trained offline and loaded into memory during runtime using joblib.
Operational Feasibility: The system is easy to use and deploy. Once trained, the model
can serve multiple prediction requests in real time without needing re-training unless
explicitly required.
Economic Feasibility: Since the system uses open-source tools and publicly available
datasets, there are no direct costs involved. It is feasible for individual developers and
academic institutions.
Responsibilities:
Collect data from users (age, cholesterol, etc.)
Responsibilities:
Handle HTTP requests and form submissions
3. Data Layer
Components:
Database: SQLite (db.sqlite3)
Responsibilities:
Store input and prediction results
Provide data persistence for admin view and record-keeping
Store and load the trained ML model used for real-time predictions
Workflow Summary
1. User Accesses the Form
User navigates to the form page and enters required information.
API Layer: REST API using Django REST Framework for external integration.
Model Retraining: Periodic retraining with new data entries to improve prediction
accuracy.
1. User Input
The user submits a message through a web form.
Inputs include:
o Subject (e.g., “Win a free iPhone”)
o Body text (e.g., “Click here to claim your prize!”)
5. Prediction Output
The prediction (Spam or Not Spam) is saved in the database with the
message.
The user is redirected to a result page showing the prediction.
4.4 UI Design
Place screen shot and write minimal explanation about the screenshot you can UI you
feel good
Chapter 5: Implementation
1. Programming Languages
Python
Used for data processing, model training (machine learning), and back-end
development (Django framework).
HTML/CSS
Used for designing the web front-end interface (forms, templates).
Joblib – For saving the trained model ( _ .pkl) for reuse in the web app.
3. Web Framework
Django (Python-based Web Framework)
o Django Admin panel is used for managing data models through a web
UI.
4. Database
SQLite
5. Front-End
Django Templates (HTML with template tags)
Used to render web forms, display results, and interact with users dynamically.
7. Development Tools
Jupyter Notebook / Python scripts (for model development)
Key Features
1. Email or Message Analysis:
o Takes subject and body text as input.
o Combines them for spam detection processing.
2. Phishing Link Detection:
o utils.py contains a rule-based phishing detection using:
Blacklisted domains (e.g., bit.ly, fakebank.ru).
Pattern-based detection (e.g., presence of .ru, free, win, etc.).
3. Spam Classification:
o Uses a trained machine learning model to classify whether a message is
spam or not.
4. Web Interface:
o Built with Django (home and scam apps).
o Forms to submit messages and receive predictions.
o Uses scam_model.pkl for real-time prediction.
Machine Learning Model
Algorithm: Logistic Regression
Vectorization: TF-IDF (TfidfVectorizer)
Pipeline:
Training Data: scam_dataset.csv which contains:
subject
body
label (target)
Storage: Trained model saved as scam_model.pkl using joblib.
Admin Panel:
@admin.register(ScamRecord)
class ScamRecordAdmin(admin.ModelAdmin):
list_display = ('id', 'subject', 'body', 'label', 'created_at')
search_fields = ('subject', 'body', 'label')
@admin.register(ScamRecord)
A decorator that registers the ScamRecord model with the Django admin
site.
Equivalent to:
admin.site.register(ScamRecord, ScamRecordAdmin)
Cleaner and more modern approach.
class ScamRecordAdmin(admin.ModelAdmin):
A custom admin class that lets you control how ScamRecord appears in the
admin interface.
Inherits from admin.ModelAdmin.
list_display = (...)
Specifies which fields to show in the list view of the admin dashboard.
You’ll see a table with columns like:
o id: Primary key
o created_at: Timestamp
search_fields = (...)
Enables search functionality in the admin panel.
Admins can search across subject, body, or label text.
Create Superuser:
python manage.py createsuperuser
Output
In the admin panel, under the Scam section, you’ll now see a model
named Scam Records
ID Subject Body Text Label Created At
“Win a prize 2025-05-10
1 “Click this link to claim...” Spam
today” 12:00
“Meeting “Attached is the file for Not 2025-05-10
2
Agenda” review” Spam 12:05
Explanation
The admin.py file connects the backend model to Django's default
admin interface.
list_display shows fields in a table.
search_fields enables keyword search.
This allows admins to view, edit, or delete spam predictions
directly from the dashboard.
Django web application in AI-Based Spam Detection
project
Django is a Python web framework used here to:
1. Render the web interface for inputting message data.
2. Handle user submissions through forms.
3. Call the ML model to make predictions.
4. Store results in a database.
5. Display results and admin panel views.
Working Flow of Django in This Project
1. User Enters Message
Route: /scam/
Form: ScamRecordForm
App: scam/views.py
2. Form is Processed
Django collects subject and body.
Combines them and sends them to scam_model.pkl for prediction.
3. ML Model Predicts Spam or Not
Using joblib.load(...), the trained model makes a prediction.
4. Database Stores Input & Prediction
Django ORM saves subject, body, and predicted label in the ScamRecord model.
5. Admin Views or Result Page Shows Output
Admin panel or user result route shows whether the message is spam.
scam/models.py – Defines Data Structure
from django.db import models
class ScamRecord(models.Model):
subject = models.TextField()
body = models.TextField()
label = models.CharField(max_length=20)
created_at = models.DateTimeField(auto_now_add=True)
class ScamRecordForm(forms.ModelForm):
class Meta:
model = ScamRecord
fields = ['subject', 'body']
Purpose: Renders an input form on the web page.
scam/views.py – Web Logic and Prediction
import joblib
def scam_view(request):
if request.method == 'POST':
form = ScamRecordForm(request.POST)
if form.is_valid():
instance = form.save(commit=False)
prediction = model.predict([text])[0]
instance.label = prediction
instance.save()
else:
form = ScamRecordForm()
This function:
Displays a form to the user.
Processes the form when submitted.
Uses a trained ML model to predict if the message is spam.
Displays the result to the user.
urlpatterns = [
<form method="post">
{% csrf_token %}
{{ form.as_p }}
<button type="submit">Check</button>
</form>
<h2>Prediction Result</h2>
Step Description
Training Done offline in a Python script (e.g., Jupyter Notebook or Python file).
Step Description
Saving The trained model is stored using joblib as scam_model.pkl.
Loading Django loads this .pkl file and uses it to make predictions.
Prediction New messages are passed to the model and classified as Spam or Not Spam.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import joblib
# Load dataset
df = pd.read_csv('spam_dataset.csv') # Contains 'subject', 'body', 'label'
# Vectorize text
vectorizer = TfidfVectorizer()
X_vect = vectorizer.fit_transform(X)
# Train model
model = LogisticRegression()
model.fit(X_vect, y)
def scam_view(request):
...
if form.is_valid():
instance = form.save(commit=False)
text = f"{instance.subject} {instance.body}"
vect_text = vectorizer.transform([text]) # Convert input to numerical form
prediction = model.predict(vect_text)[0] # Predict
Explanation:
The model is loaded once when Django starts.
The user's input is transformed using the same TF-IDF logic used during
training.
The model gives a prediction ("Spam" or "Not Spam").
Complete Web-to-ML Flow Diagram
Advantages:
Conclusion
The AI-Based Online Spam Detection System developed in this project is a
web-based application that integrates machine learning techniques with the Django
web framework to identify and classify potentially harmful or unwanted messages.
The primary goal of this project was to develop a robust and user-friendly interface
where users can input text content — specifically email subjects and bodies — and
receive real-time feedback on whether the content is classified as spam or not.
At the heart of this application lies a machine learning model trained on a
labeled dataset containing spam and non-spam messages. The project employed a
logistic regression classifier in conjunction with TF-IDF vectorization to extract and
quantify text features. These features were then used to train the model to
distinguish between spam and legitimate messages. The trained model was
serialized and deployed within the Django web application using the joblib library,
enabling seamless interaction between the front-end user interface and the back-end
predictive engine.
Despite its simplicity and successful integration into the Django framework,
the current version of the model achieved an accuracy of approximately 50%. This
highlights several important considerations and limitations. First, the training data
may not have been diverse or balanced enough, limiting the model's ability to
generalize effectively. Second, the logistic regression model, while efficient and
interpretable, may not be powerful enough to capture complex spam patterns that
more advanced models like Random Forests or deep learning techniques could
handle. Lastly, the feature set — which only includes raw text from the subject and
body — may be insufficient. More sophisticated features such as the presence of
URLs, common spam keywords, or metadata like sender reputation could
significantly improve prediction accuracy.
From a software development perspective, the Django framework proved to
be a reliable and scalable choice for building the application. It facilitated rapid
development of the user interface, form handling, database integration, and dynamic
content rendering. By coupling Django with the trained machine learning model, the
application was able to process user input in real time and return actionable results,
demonstrating a practical example of ML deployment in a web environment.
In conclusion, while the project successfully demonstrates the integration of
machine learning with a web application for spam detection, there is considerable
scope for improvement. Enhancing the dataset, refining the preprocessing pipeline,
and experimenting with more advanced algorithms are the next logical steps. With
these improvements, the system can evolve into a highly reliable spam detection
platform suitable for production environments or enterprise deployment.
Future Enhancements
As technology advances and spam detection methods become increasingly
sophisticated, several future enhancements can be applied to improve the accuracy,
usability, and scalability of this project. These enhancements focus on both the
machine learning model and the web application components:
1. Use of Advanced Machine Learning Models
Replace the logistic regression model with more powerful algorithms like:
o Random Forest
o Support Vector Machine (SVM)
o XGBoost
o Deep Learning (LSTM, BERT) for contextual understanding.
These models can better capture complex spam patterns and contextual
nuances.
References
1. Metsis, V., Androutsopoulos, I., & Paliouras, G. (2006). Spam filtering with Naive
2. Zhang, L., Zhu, J., & Yao, T. (2004). An evaluation of statistical spam filtering
3. Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Bayesian
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.Tfidf
Vectorizer.html
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
16. BERT for Spam Classification. Devlin et al. (2019). BERT: Pre-training of Deep
learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html
18. GridSearchCV – Scikit-learn.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearch
CV.html
https://devcenter.heroku.com/categories/python-support