NSAI Notes Unit3

The document discusses email spamming, including its definition, incidents, and datasets for spam detection. It outlines the application of Natural Language Processing (NLP) techniques for detecting spam emails, detailing steps from data collection to model evaluation. Additionally, it emphasizes the importance of various metrics like accuracy, precision, and recall in assessing machine learning models for spam detection.

Uploaded by

anju.j3511

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPSX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views50 pages

NSAI Notes Unit3

Uploaded by

anju.j3511

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPSX, PDF, TXT or read online on Scribd

You are on page 1/ 50

• These datasets can be used to train, validate, and test spam detection
models. Each dataset varies in size and complexity, so choosing the one
that best fits your needs depends on the specific requirements of your
spam detection project.
Natural language processing of
e-mails for spam detection
• Natural Language Processing (NLP) is a powerful
technique for detecting spam in emails. Here's a step-
by-step guide to applying NLP for email spam detection:
1. Data Collection and Preprocessing
• Collect Data
• Gather Datasets: Use publicly available datasets such
as the Enron Spam Dataset, SpamAssassin Public
Corpus, or others as mentioned previously.
Natural language processing of
e-mails for spam detection
• Preprocess Data
• Text Cleaning: Remove unwanted characters, HTML tags, or
email headers.
• Tokenization: Split email text into individual words or tokens.
• Normalization: Convert text to lowercase to ensure
consistency.
• Remove Stop Words: Remove common words (e.g., "the,"
"and") that don't contribute to spam classification.
• Stemming/Lemmatization: Reduce words to their base or
root forms (e.g., "running" to "run").
Natural language processing of
e-mails for spam detection
2. Feature Extraction
Convert Text to Numerical Features
• Bag of Words (BoW): Represents text data as a matrix of
word counts.Example: "This is spam" → [1, 1, 1, 0, 0] (where each
position represents the count of a specific word)
• Term Frequency-Inverse Document Frequency (TF-IDF): Adjusts
word counts by how often they appear in the entire
dataset.Example: Words that appear frequently in one document
but rarely in others are given higher weights.
Natural language processing of
e-mails for spam detection
2. Feature Extraction
Convert Text to Numerical Features
• Word Embeddings: Use pre-trained embeddings like Word2Vec,
GloVe, or BERT to capture semantic meanings of words.Example:
"spam" and "junk" will have similar embeddings.
Natural language processing of
e-mails for spam detection
• 3. Model Selection and Training
• Choose a Model
• Naive Bayes: Often used for text classification due to its simplicity
and effectiveness in handling text data.
• Logistic Regression: A common choice for binary classification
tasks.
• Support Vector Machines (SVM): Effective in high-dimensional
spaces and for text classification.
Natural language processing of
e-mails for spam detection
• 3. Model Selection and Training
• Deep Learning Models:
• Recurrent Neural Networks (RNNs): Good for sequential data like
text.
• Long Short-Term Memory (LSTM): An advanced RNN variant that
handles long-range dependencies.
• Transformers: Models like BERT or GPT can understand context
and semantics better.
Natural language processing of
e-mails for spam detection
• 4. Evaluation
• Assess Model Performance
• Accuracy: Measure the overall correctness of the model.
• Precision: The proportion of true spam emails out of all emails classified
as spam.
• Recall: The proportion of true spam emails out of all actual spam emails.
• F1-Score: The harmonic mean of precision and recall, useful for
balancing the two metrics.
• Confusion Matrix: Visualizes the performance of the model, showing
true positives, false positives, true negatives, and false negatives.
Natural language processing of
e-mails for spam detection
5. Deployment and Monitoring
• Deploy the Model
• Integrate: Embed the model into your email system to classify
incoming emails in real time or in batch mode.
• Set Thresholds: Determine how aggressive the spam filter should
be (e.g., what percentage likelihood triggers a spam flag).
Natural language processing of
e-mails for spam detection
5. Deployment and Monitoring
• Monitor and Update
• Track Performance: Regularly monitor the model’s performance to
ensure it remains effective.
• Retrain: Update the model with new data to adapt to evolving
spam techniques.
• Handle Feedback: Adjust based on user feedback to reduce false
positives and false negatives.
Natural language processing of
e-mails for spam detection
• Example Workflow
1.Data Preprocessing: Clean and prepare the email text data.
2.Feature Extraction: Convert emails into numerical features using
TF-IDF or word embeddings.
3.Model Training: Train a Naive Bayes classifier with the processed
data.
Natural language processing of
e-mails for spam detection
• Example Workflow
4. Evaluation: Assess the model using metrics such as F1-score.
5. Deployment: Implement the model in an email system to
classify incoming messages.
6. Monitoring: Regularly update the model to handle new spam
tactics.
By following these steps, you can effectively use NLP to build a
robust spam detection system for emails.
Empirical analysis of machine learning models
in terms of Accuracy, Precision, Recall, F1-
Score, AUC
Model Evaluation
• In case of classification problem we should be equipped
with different assessment metrics to analyze the
classification algorithm. They are:
1.Confusion Matrix
2.Precision
3.Recall
4.Accuracy
5.Area under ROC curve(AUC)
Model Evaluation
• CONFUSION MATRIX
• The confusion matrix is a table that summarizes how
successful the classification model is at predicting
examples belonging to various classes.
• One axis of the confusion matrix is the label that the
model predicted, and the other axis is the actual label.
• Let's take an example of classifying whether a person
has a heart disease or not:
Model Evaluation
• CONFUSION MATRIX
In this confusion matrix
table, green diagonal
shows the result that in
actually they are one
thing and model also
predicted that same
thing but red diagonal
shows the result that in
actual they are one thing
and model predicted it
another thing.
We can use confusion
matrix when we compare
different model by
looking how well it
predicted a true
Model Evaluation
• CONFUSION MATRIX Example
X1 X2 Y1(Actual) Y2(Predicted)
- - 0 1 Actual
- - 1 1 1 0
Predicted
- - 0 0
1
- - 1 1
0
- - 1 1
- - 0 1
- - 1 0
Model Evaluation
• CONFUSION MATRIX Example
X1 X2 Y1(Actual) Y2(Predicted)
- - 0 1 Actual
- - 1 1 1 0
Predicted
- - 0 0
1 1
- - 1 1
0
- - 1 1
- - 0 1
- - 1 0
Model Evaluation
• CONFUSION MATRIX Example
X1 X2 Y1(Actual) Y2(Predicted)
- - 0 1 Actual
- - 1 1 1 0
Predicted
- - 0 0
1 1 1
- - 1 1
0
- - 1 1
- - 0 1
- - 1 0
Model Evaluation
• CONFUSION MATRIX Example
X1 X2 Y1(Actual) Y2(Predicted)
- - 0 1 Actual
- - 1 1 1 0
Predicted
- - 0 0
1 1 1
- - 1 1
0 1
- - 1 1
- - 0 1
- - 1 0
Model Evaluation
• CONFUSION MATRIX Example
X1 X2 Y1(Actual) Y2(Predicted)
- - 0 1 Actual
- - 1 1 1 0
Predicted
- - 0 0
1 2 1
- - 1 1
0 1
- - 1 1
- - 0 1
- - 1 0
Model Evaluation
• CONFUSION MATRIX Example
X1 X2 Y1(Actual) Y2(Predicted)
- - 0 1 Actual
- - 1 1 1 0
Predicted
- - 0 0
1 3 1
- - 1 1
0 1
- - 1 1
- - 0 1
- - 1 0
Model Evaluation
• CONFUSION MATRIX Example
X1 X2 Y1(Actual) Y2(Predicted)
- - 0 1 Actual
- - 1 1 1 0
Predicted
- - 0 0
1 3 2
- - 1 1
0 1
- - 1 1
- - 0 1
- - 1 0
Model Evaluation
• CONFUSION MATRIX Example
X1 X2 Y1(Actual) Y2(Predicted)
- - 0 1 Actual
- - 1 1 1 0
Predicted
- - 0 0
1 3 2
- - 1 1
0 1 1
- - 1 1
- - 0 1
- - 1 0
Model Evaluation
• CONFUSION MATRIX Example
Model Evaluation
• CONFUSION MATRIX
Example
• Accuracy
• It is defined as total correctly classified
example divided by the total number of
classified examples. Lets express it in terms of
confusion matrix:Actual
1 0
Predicted
1 3 2
0 1 1

Accuracy
= 3+1/3+1+2+1
Model Evaluation
• CONFUSION MATRIX Example
• Accuracy
• It is defined as total correctly classified example divided by the total number of classified
examples. Lets express it in terms of confusion matrix:
Actual
1 0
Predicted
1 3 2
= 3+1/3+1+2+1 0 1 1

=4/7
=57%
Model Evaluation
• CONFUSION MATRIX Example
• Precision
• In definition it is define as the actual correct prediction divided by total prediction made by model.
Actual
1 0
Predicted
= 3/3+2 1 3 2
0 1 1
=3/5
=60%
• Precision is a helpful metric when you want to minimize False Positives.
• For example, imagine you’re the loan officer at a bank. You don’t want to approve a loan (Positive)
for someone who won’t be able to repay. In reality, their loan should not be approved (Negative).
• In such cases, you should aim for a model with high Precision.
Model Evaluation
• CONFUSION MATRIX Example
• Recall
• It is calculated as the number of true positives divided by the total number of true positives and false negatives.
Actual
1 0
Predicted
= 3/3+1
1 3 2
=3/4
0 1 1
=75%
• Recall is a helpful metric when you want to minimize False Negatives.
• For example, imagine you’re testing if a patient is infected with a dangerous virus. You
don’t want a model that predicts a patient is not infected (Negative) when the patient
has the virus (Positive).
• In such cases, you should aim for a model with high Recall.
Model Evaluation
• CONFUSION MATRIX Example
• F1 score is a weighted average of precision and recall.
• As we know in precision and in recall there is false positive and false
negative so it also consider both of them.
• F1 score is usually more useful than accuracy, especially if you have
an uneven class distribution.
• Accuracy works best if false positives and false negatives have similar
cost. If the cost of false positives and false negatives are very
different, it’s better to look at both Precision and Recall.
Model Evaluation
• CONFUSION MATRIX Example
• F1-score:
• Ranges from 0 to 1, with 1 being the best score.
• Combines the strengths of precision and recall into a single metric.
• Useful when a balanced evaluation of both aspects is needed.
Model Evaluation
• CONFUSION MATRIX Example
• F1 Score = 2*(Recall * Precision) / (Recall + Precision)
= 2*(3/4*3/5)/(3/4+3/5)
=2*(0.45)/(1.35)
=0.9/1.35
=0.66
Actual
1 0
Predicted
1 3 2
0 1 1
Model Evaluation
• CONFUSION MATRIX Example
• AREA UNDER THE ROC CURVE (AUC)
• It can only be used to assess classifiers that return some confidence score (or a
probability) of prediction. For example, logistic regression, neural networks, and
decision trees (and ensemble models based on decision trees) can be assessed
using ROC curves.
• ROC curve commonly use the combination of true positive rate(TPR) and false
positive rate(FPR) and that is given as
Model Evaluation
• CONFUSION MATRIX Example
• AREA UNDER THE ROC CURVE (AUC)
• The higher the area under the ROC curve (AUC), the better the
classifier. A classifier with an AUC higher than 0.5 is better than
a random classifier. If AUC is lower than 0.5, then something is
wrong with your model. A perfect classifier would have an AUC
of 1.
References
• https://medium.com/@weidagang/demystifying-precision-and-recall-
in-machine-learning-6f756a4c54ac

• https://proclusacademy.com/blog/explainer/precision-recall-f1-score-
classification-models/

• https://pareto.ai/blog/precision-and-recall
• https://www.analyticsvidhya.com/blog/2020/09/precision-recall-mac
hine-learning/
Thanks!!!

Spam Email Classifier
No ratings yet
Spam Email Classifier
17 pages
TB 216 Workshop Manual TB216
No ratings yet
TB 216 Workshop Manual TB216
296 pages
Email Spam Detection PPT Github
No ratings yet
Email Spam Detection PPT Github
11 pages
Mini Project Final 10,42,52
No ratings yet
Mini Project Final 10,42,52
39 pages
8-Step Guide to Effective Gemba Walks
No ratings yet
8-Step Guide to Effective Gemba Walks
10 pages
Spam Email Detection Using Python
No ratings yet
Spam Email Detection Using Python
9 pages
Spam Email Detection Using Python and Machine Learning
No ratings yet
Spam Email Detection Using Python and Machine Learning
14 pages
Spam Detection via ML & NLP
No ratings yet
Spam Detection via ML & NLP
44 pages
Ai Project
No ratings yet
Ai Project
8 pages
Project Report Emaildetection 4 44
No ratings yet
Project Report Emaildetection 4 44
41 pages
AI Project2024 2025format
No ratings yet
AI Project2024 2025format
26 pages
Pruthviraj Micor Foml
No ratings yet
Pruthviraj Micor Foml
26 pages
1822 B Deleted Merged Cropped
No ratings yet
1822 B Deleted Merged Cropped
40 pages
Report (1) 1
No ratings yet
Report (1) 1
35 pages
Considering Behavior of Sender in Spam Mail Detection: S. Naksomboon, C. Charnsripinyo and N. Wattanapongsakorn
No ratings yet
Considering Behavior of Sender in Spam Mail Detection: S. Naksomboon, C. Charnsripinyo and N. Wattanapongsakorn
5 pages
Vishal FOML Micro Project Vishal & Milan
No ratings yet
Vishal FOML Micro Project Vishal & Milan
26 pages
Aryan Blackbook 1
No ratings yet
Aryan Blackbook 1
29 pages
Spam Detection for CS Students
No ratings yet
Spam Detection for CS Students
29 pages
Synopsis Email Spam
No ratings yet
Synopsis Email Spam
9 pages
Email Spam Detection Project Report
No ratings yet
Email Spam Detection Project Report
19 pages
Spam Email Classifier - Ramsanjay
No ratings yet
Spam Email Classifier - Ramsanjay
2 pages
Spam Detection NLP Project
No ratings yet
Spam Detection NLP Project
3 pages
Introduction To Spam Email Detection
No ratings yet
Introduction To Spam Email Detection
16 pages
Email Spam Detection
No ratings yet
Email Spam Detection
8 pages
ML Project - Classifying Spam Emails
No ratings yet
ML Project - Classifying Spam Emails
3 pages
Spam Detection Using BERT
No ratings yet
Spam Detection Using BERT
6 pages
Spam Email Detection PPT - 1011
No ratings yet
Spam Email Detection PPT - 1011
12 pages
Specialty Chemical Production Analysis
No ratings yet
Specialty Chemical Production Analysis
8 pages
Spam Email. Classifier
No ratings yet
Spam Email. Classifier
16 pages
Research Paper Spam Detection
No ratings yet
Research Paper Spam Detection
4 pages
Report
No ratings yet
Report
11 pages
Spam Detection via Machine Learning
No ratings yet
Spam Detection via Machine Learning
11 pages
E-Mail Spam Detection Using Machine Learning KNN
No ratings yet
E-Mail Spam Detection Using Machine Learning KNN
5 pages
Detecting Spam in Emails. Applying NLP and Deep Learning For Spam - by Ramya Vidiyala - Towards Data Science
No ratings yet
Detecting Spam in Emails. Applying NLP and Deep Learning For Spam - by Ramya Vidiyala - Towards Data Science
23 pages
Spam Detection in Email Using Machine Le
No ratings yet
Spam Detection in Email Using Machine Le
8 pages
Spam Message
No ratings yet
Spam Message
12 pages
E-Mail Spam Detection
No ratings yet
E-Mail Spam Detection
8 pages
Email Spam Detection
No ratings yet
Email Spam Detection
8 pages
Major-Final Research Paper
No ratings yet
Major-Final Research Paper
3 pages
Presentation 3
No ratings yet
Presentation 3
13 pages
NLP Chapter 3
No ratings yet
NLP Chapter 3
23 pages
Spam Mail Detection Using Machine Learning
No ratings yet
Spam Mail Detection Using Machine Learning
14 pages
VBK23 Cse 041
No ratings yet
VBK23 Cse 041
6 pages
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
No ratings yet
EMAIL+SPAM+DETECTION Final Fishries++ (2658+to+2664) - 1
7 pages
44 Decision Tree Model For Email Classification
No ratings yet
44 Decision Tree Model For Email Classification
4 pages
Spam Email Classification-1
No ratings yet
Spam Email Classification-1
10 pages
Spam Detection Thesis
100% (3)
Spam Detection Thesis
6 pages
Email Spam Detection Using Machine Learning
No ratings yet
Email Spam Detection Using Machine Learning
2 pages
IJRPR8167
No ratings yet
IJRPR8167
7 pages
Spam Email Detection Using Machine Learning
No ratings yet
Spam Email Detection Using Machine Learning
8 pages
ETREP
No ratings yet
ETREP
20 pages
E-Mail Spam Classification Via Machine Learning and Natural Language Processing
No ratings yet
E-Mail Spam Classification Via Machine Learning and Natural Language Processing
7 pages
Email Spam Detection (Research Paper)
No ratings yet
Email Spam Detection (Research Paper)
8 pages
Slide Format
No ratings yet
Slide Format
14 pages
Jebin 2
No ratings yet
Jebin 2
22 pages
$RVJ44FQ
No ratings yet
$RVJ44FQ
13 pages
2023 V14i805
No ratings yet
2023 V14i805
7 pages
AI Phash 5
No ratings yet
AI Phash 5
14 pages
Spam-T5: Benchmarking Large Language Models For Few-Shot Email Spam Detection
No ratings yet
Spam-T5: Benchmarking Large Language Models For Few-Shot Email Spam Detection
18 pages
46 - Ijme... Mech Engg..Research Paper-1
No ratings yet
46 - Ijme... Mech Engg..Research Paper-1
10 pages
Pantry Evaluation Proposal Internship
No ratings yet
Pantry Evaluation Proposal Internship
6 pages
Email Spam CLassification
No ratings yet
Email Spam CLassification
16 pages
Majority Voting Technique To Classify Emails As Spam or Ham: 1 Background, Context and Scope 2 Problem Description
No ratings yet
Majority Voting Technique To Classify Emails As Spam or Ham: 1 Background, Context and Scope 2 Problem Description
17 pages
Unit II Chap - 2 Notes
No ratings yet
Unit II Chap - 2 Notes
3 pages
Question Bank Computer Vision
No ratings yet
Question Bank Computer Vision
2 pages
The 7 Balkan Conference On Operational Research Constanta, May 2005, Romania
No ratings yet
The 7 Balkan Conference On Operational Research Constanta, May 2005, Romania
11 pages
Heavy Vehicle Tire Safety Guide
No ratings yet
Heavy Vehicle Tire Safety Guide
12 pages
Thesis Paper On Net Zero Carbon
No ratings yet
Thesis Paper On Net Zero Carbon
68 pages
Carrier BacnetSC Setup Guide
No ratings yet
Carrier BacnetSC Setup Guide
27 pages
Procedure For Design and Development
No ratings yet
Procedure For Design and Development
8 pages
Kuwait's Growing F&B Market
No ratings yet
Kuwait's Growing F&B Market
2 pages
Conceptual Framework
No ratings yet
Conceptual Framework
12 pages
Mayank Report
No ratings yet
Mayank Report
31 pages
Geotech 1 Lecture 2 Structure
No ratings yet
Geotech 1 Lecture 2 Structure
38 pages
Email Spam Detection with ML
No ratings yet
Email Spam Detection with ML
5 pages
Kioxia SSD XG6-P Product Brief
No ratings yet
Kioxia SSD XG6-P Product Brief
2 pages
WILP Brochure
No ratings yet
WILP Brochure
20 pages
Cotton Association of India
No ratings yet
Cotton Association of India
5 pages
Reset Root Password Linux
No ratings yet
Reset Root Password Linux
6 pages
Axial Stress and Strain Guide
No ratings yet
Axial Stress and Strain Guide
3 pages
Manas College Pamphlet 2025
No ratings yet
Manas College Pamphlet 2025
2 pages
Flexitallic Flexpro Brochure 11-30-2017
No ratings yet
Flexitallic Flexpro Brochure 11-30-2017
8 pages
Bagua Map
No ratings yet
Bagua Map
1 page
Rotary Valve Fast Cycle Pressure Swing Adsorption Paper
No ratings yet
Rotary Valve Fast Cycle Pressure Swing Adsorption Paper
14 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
33 pages
Confusion Matrix
No ratings yet
Confusion Matrix
2 pages
Experiment No 5
No ratings yet
Experiment No 5
2 pages
PT - 1 Apr 2025
No ratings yet
PT - 1 Apr 2025
4 pages
L1 Introduction To NLP
No ratings yet
L1 Introduction To NLP
21 pages
RA100Z - Manual - I56-0508 - Indicador Visual
No ratings yet
RA100Z - Manual - I56-0508 - Indicador Visual
2 pages
BBA Students: Globalization Insights
No ratings yet
BBA Students: Globalization Insights
4 pages
Behavior Aspect of Public Sector Planning and Budg
No ratings yet
Behavior Aspect of Public Sector Planning and Budg
3 pages
Peluang Kewirausahaan AUC 0324 Samarinda
No ratings yet
Peluang Kewirausahaan AUC 0324 Samarinda
19 pages
PORT AND TERMINAL INFORMATION BOOK-Ver 3 1 - 18 12 13
No ratings yet
PORT AND TERMINAL INFORMATION BOOK-Ver 3 1 - 18 12 13
21 pages

NSAI Notes Unit3

Uploaded by

NSAI Notes Unit3

Uploaded by

Network Security Application

You might also like