Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
48 views50 pages

NSAI Notes Unit3

The document discusses email spamming, including its definition, incidents, and datasets for spam detection. It outlines the application of Natural Language Processing (NLP) techniques for detecting spam emails, detailing steps from data collection to model evaluation. Additionally, it emphasizes the importance of various metrics like accuracy, precision, and recall in assessing machine learning models for spam detection.

Uploaded by

anju.j3511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views50 pages

NSAI Notes Unit3

The document discusses email spamming, including its definition, incidents, and datasets for spam detection. It outlines the application of Natural Language Processing (NLP) techniques for detecting spam emails, detailing steps from data collection to model evaluation. Additionally, it emphasizes the importance of various metrics like accuracy, precision, and recall in assessing machine learning models for spam detection.

Uploaded by

anju.j3511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPSX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Network Security Application

Using AI
Course Code: BTAIML 705 20
Unit 3: Color Image Processing
Prepared By:
Anju Bala
CSE Department
Outline
• What is Email Spamming?
• Incidents of email-Spamming.
• Datasets related to e-mail spamming.
• Natural language processing of e-mails for spam detection
• Empirical analysis of machine learning models in terms of
Accuracy
Precision
Recall
F1-Score
AUC
Email Spamming
• Email spam refers to unsolicited or unwanted
emails sent in bulk to a large number of
recipients.
• These messages are typically sent by
individuals or organizations with the intention
of promoting products, services, or malicious
content.
• Email spam can also be used to spread
malware or viruses, and it can be a nuisance
to users who receive a large volume of
unwanted emails.
Incidents of e-mail-spamming
• This email claims to be from
Netflix, warning the recipient
that their password is due to
expire soon and urging them to
reset it immediately due to a
purported increase in account
compromises. It includes a
"Reset Password" link, which is,
in reality, a phishing attempt to
steal Netflix login credentials.
Incidents of e-mail-spamming
• This email purports to be a
notification from American
Express, indicating that either
the recipient or an
authorized user has
requested a new card for
their account. It includes links
to confirm the card request
or to indicate that something
is wrong, as well as a general
link to sign into the account.
Incidents of e-mail-spamming
• This email imitates a notification
from GitHub, informing the
recipient that a third-party
OAuth application (AWS
CodeBuild) has been authorized
to access their account with
specific permissions. It includes
phishing links leading to GitHub
settings and support pages,
aimed at stealing GitHub
credentials or deploying
malware.
Incidents of e-mail-spamming
• This email impersonates
a security alert from
Instagram, notifying the
recipient about a
suspicious login attempt
from an unusual
location and device. It
provides a fake phishing
link for the recipient to
secure their account.
Incidents of e-mail-spamming
• Adopting Zoom's
familiar branding, this
email announces a
"Quarterly All Hands"
meeting, urging users
to confirm their
account. It directs
them to a button,
which leads to a
fraudulent Zoom
login page.
Incidents of e-mail-spamming
•A seemingly
innocent email
poses as Google
Drive, inviting you to
an "Office Holiday
Party" with a simple
"Open" button.
Clicking leads to a
phishing site aiming
to snatch your Gmail
credentials.
Incidents of e-mail-spamming
• Disguised as an
offer from HR, this
email dangles a job
with a hefty salary
and generic
requirements and
asks you to fill out
and return an
attached file.
Datasets related to e-mail
spamming
• For developing and training models to detect email spam, using high-
quality datasets is crucial. Here are some commonly used and publicly
available datasets related to email spamming:
• 1. Enron Spam Dataset
• Description: Contains emails from the Enron Corporation, labeled as
spam or ham (non-spam). It’s a well-known dataset for spam detection
and contains a mix of both legitimate and spam emails.
• Access: Available on Kaggle and other data repositories.
Datasets related to e-mail
spamming
2. Spam Assassin Public Corpus
• Description: A collection of emails labeled as spam or ham, maintained
by the Spam Assassin project. This dataset includes a variety of spam
types and is used extensively for spam filtering research.
• Access: Available on the SpamAssassin website.
Datasets related to e-mail
spamming
3.Ling-Spam Dataset
• Description: A dataset used for spam filtering research, containing
emails labeled as spam or non-spam. It’s often used in academic studies
and benchmarks.
• Access: Available at the UCI Machine Learning Repository.
Datasets related to e-mail
spamming
4.TREC Public Spam Corpus
• Description: Part of the Text REtrieval Conference (TREC), this dataset
includes spam and non-spam emails. It is used for evaluating spam
detection systems.
• Access: Available on the TREC website.
Datasets related to e-mail
spamming
5.Kaggle’s Spam SMS Dataset
• Description: A dataset consisting of SMS messages labeled as spam or
ham, often used for mobile spam detection but also applicable to
general spam filtering.
• Access: Available on Kaggle.

• These datasets can be used to train, validate, and test spam detection
models. Each dataset varies in size and complexity, so choosing the one
that best fits your needs depends on the specific requirements of your
spam detection project.
Natural language processing of
e-mails for spam detection
• Natural Language Processing (NLP) is a powerful
technique for detecting spam in emails. Here's a step-
by-step guide to applying NLP for email spam detection:
1. Data Collection and Preprocessing
• Collect Data
• Gather Datasets: Use publicly available datasets such
as the Enron Spam Dataset, SpamAssassin Public
Corpus, or others as mentioned previously.
Natural language processing of
e-mails for spam detection
• Preprocess Data
• Text Cleaning: Remove unwanted characters, HTML tags, or
email headers.
• Tokenization: Split email text into individual words or tokens.
• Normalization: Convert text to lowercase to ensure
consistency.
• Remove Stop Words: Remove common words (e.g., "the,"
"and") that don't contribute to spam classification.
• Stemming/Lemmatization: Reduce words to their base or
root forms (e.g., "running" to "run").
Natural language processing of
e-mails for spam detection
2. Feature Extraction
Convert Text to Numerical Features
• Bag of Words (BoW): Represents text data as a matrix of
word counts.Example: "This is spam" → [1, 1, 1, 0, 0] (where each
position represents the count of a specific word)
• Term Frequency-Inverse Document Frequency (TF-IDF): Adjusts
word counts by how often they appear in the entire
dataset.Example: Words that appear frequently in one document
but rarely in others are given higher weights.
Natural language processing of
e-mails for spam detection
2. Feature Extraction
Convert Text to Numerical Features
• Word Embeddings: Use pre-trained embeddings like Word2Vec,
GloVe, or BERT to capture semantic meanings of words.Example:
"spam" and "junk" will have similar embeddings.
Natural language processing of
e-mails for spam detection
• 3. Model Selection and Training
• Choose a Model
• Naive Bayes: Often used for text classification due to its simplicity
and effectiveness in handling text data.
• Logistic Regression: A common choice for binary classification
tasks.
• Support Vector Machines (SVM): Effective in high-dimensional
spaces and for text classification.
Natural language processing of
e-mails for spam detection
• 3. Model Selection and Training
• Deep Learning Models:
• Recurrent Neural Networks (RNNs): Good for sequential data like
text.
• Long Short-Term Memory (LSTM): An advanced RNN variant that
handles long-range dependencies.
• Transformers: Models like BERT or GPT can understand context
and semantics better.
Natural language processing of
e-mails for spam detection
• 4. Evaluation
• Assess Model Performance
• Accuracy: Measure the overall correctness of the model.
• Precision: The proportion of true spam emails out of all emails classified
as spam.
• Recall: The proportion of true spam emails out of all actual spam emails.
• F1-Score: The harmonic mean of precision and recall, useful for
balancing the two metrics.
• Confusion Matrix: Visualizes the performance of the model, showing
true positives, false positives, true negatives, and false negatives.
Natural language processing of
e-mails for spam detection
5. Deployment and Monitoring
• Deploy the Model
• Integrate: Embed the model into your email system to classify
incoming emails in real time or in batch mode.
• Set Thresholds: Determine how aggressive the spam filter should
be (e.g., what percentage likelihood triggers a spam flag).
Natural language processing of
e-mails for spam detection
5. Deployment and Monitoring
• Monitor and Update
• Track Performance: Regularly monitor the model’s performance to
ensure it remains effective.
• Retrain: Update the model with new data to adapt to evolving
spam techniques.
• Handle Feedback: Adjust based on user feedback to reduce false
positives and false negatives.
Natural language processing of
e-mails for spam detection
• Example Workflow
1.Data Preprocessing: Clean and prepare the email text data.
2.Feature Extraction: Convert emails into numerical features using
TF-IDF or word embeddings.
3.Model Training: Train a Naive Bayes classifier with the processed
data.
Natural language processing of
e-mails for spam detection
• Example Workflow
4. Evaluation: Assess the model using metrics such as F1-score.
5. Deployment: Implement the model in an email system to
classify incoming messages.
6. Monitoring: Regularly update the model to handle new spam
tactics.
By following these steps, you can effectively use NLP to build a
robust spam detection system for emails.
Empirical analysis of machine learning models
in terms of Accuracy, Precision, Recall, F1-
Score, AUC
Model Evaluation
• In case of classification problem we should be equipped
with different assessment metrics to analyze the
classification algorithm. They are:
1.Confusion Matrix
2.Precision
3.Recall
4.Accuracy
5.Area under ROC curve(AUC)
Model Evaluation
• CONFUSION MATRIX
• The confusion matrix is a table that summarizes how
successful the classification model is at predicting
examples belonging to various classes.
• One axis of the confusion matrix is the label that the
model predicted, and the other axis is the actual label.
• Let's take an example of classifying whether a person
has a heart disease or not:
Model Evaluation
• CONFUSION MATRIX
In this confusion matrix
table, green diagonal
shows the result that in
actually they are one
thing and model also
predicted that same
thing but red diagonal
shows the result that in
actual they are one thing
and model predicted it
another thing.
We can use confusion
matrix when we compare
different model by
looking how well it
predicted a true
Model Evaluation
• CONFUSION MATRIX Example
X1 X2 Y1(Actual) Y2(Predicted)
- - 0 1 Actual
- - 1 1 1 0
Predicted
- - 0 0
1
- - 1 1
0
- - 1 1
- - 0 1
- - 1 0
Model Evaluation
• CONFUSION MATRIX Example
X1 X2 Y1(Actual) Y2(Predicted)
- - 0 1 Actual
- - 1 1 1 0
Predicted
- - 0 0
1 1
- - 1 1
0
- - 1 1
- - 0 1
- - 1 0
Model Evaluation
• CONFUSION MATRIX Example
X1 X2 Y1(Actual) Y2(Predicted)
- - 0 1 Actual
- - 1 1 1 0
Predicted
- - 0 0
1 1 1
- - 1 1
0
- - 1 1
- - 0 1
- - 1 0
Model Evaluation
• CONFUSION MATRIX Example
X1 X2 Y1(Actual) Y2(Predicted)
- - 0 1 Actual
- - 1 1 1 0
Predicted
- - 0 0
1 1 1
- - 1 1
0 1
- - 1 1
- - 0 1
- - 1 0
Model Evaluation
• CONFUSION MATRIX Example
X1 X2 Y1(Actual) Y2(Predicted)
- - 0 1 Actual
- - 1 1 1 0
Predicted
- - 0 0
1 2 1
- - 1 1
0 1
- - 1 1
- - 0 1
- - 1 0
Model Evaluation
• CONFUSION MATRIX Example
X1 X2 Y1(Actual) Y2(Predicted)
- - 0 1 Actual
- - 1 1 1 0
Predicted
- - 0 0
1 3 1
- - 1 1
0 1
- - 1 1
- - 0 1
- - 1 0
Model Evaluation
• CONFUSION MATRIX Example
X1 X2 Y1(Actual) Y2(Predicted)
- - 0 1 Actual
- - 1 1 1 0
Predicted
- - 0 0
1 3 2
- - 1 1
0 1
- - 1 1
- - 0 1
- - 1 0
Model Evaluation
• CONFUSION MATRIX Example
X1 X2 Y1(Actual) Y2(Predicted)
- - 0 1 Actual
- - 1 1 1 0
Predicted
- - 0 0
1 3 2
- - 1 1
0 1 1
- - 1 1
- - 0 1
- - 1 0
Model Evaluation
• CONFUSION MATRIX Example
Model Evaluation
• CONFUSION MATRIX
Example
• Accuracy
• It is defined as total correctly classified
example divided by the total number of
classified examples. Lets express it in terms of
confusion matrix:Actual
1 0
Predicted
1 3 2
0 1 1

Accuracy
= 3+1/3+1+2+1
Model Evaluation
• CONFUSION MATRIX Example
• Accuracy
• It is defined as total correctly classified example divided by the total number of classified
examples. Lets express it in terms of confusion matrix:
Actual
1 0
Predicted
1 3 2
= 3+1/3+1+2+1 0 1 1

=4/7
=57%
Model Evaluation
• CONFUSION MATRIX Example
• Precision
• In definition it is define as the actual correct prediction divided by total prediction made by model.
Actual
1 0
Predicted
= 3/3+2 1 3 2
0 1 1
=3/5
=60%
• Precision is a helpful metric when you want to minimize False Positives.
• For example, imagine you’re the loan officer at a bank. You don’t want to approve a loan (Positive)
for someone who won’t be able to repay. In reality, their loan should not be approved (Negative).
• In such cases, you should aim for a model with high Precision.
Model Evaluation
• CONFUSION MATRIX Example
• Recall
• It is calculated as the number of true positives divided by the total number of true positives and false negatives.
Actual
1 0
Predicted
= 3/3+1
1 3 2
=3/4
0 1 1
=75%
• Recall is a helpful metric when you want to minimize False Negatives.
• For example, imagine you’re testing if a patient is infected with a dangerous virus. You
don’t want a model that predicts a patient is not infected (Negative) when the patient
has the virus (Positive).
• In such cases, you should aim for a model with high Recall.
Model Evaluation
• CONFUSION MATRIX Example
• F1 score is a weighted average of precision and recall.
• As we know in precision and in recall there is false positive and false
negative so it also consider both of them.
• F1 score is usually more useful than accuracy, especially if you have
an uneven class distribution.
• Accuracy works best if false positives and false negatives have similar
cost. If the cost of false positives and false negatives are very
different, it’s better to look at both Precision and Recall.
Model Evaluation
• CONFUSION MATRIX Example
• F1-score:
• Ranges from 0 to 1, with 1 being the best score.
• Combines the strengths of precision and recall into a single metric.
• Useful when a balanced evaluation of both aspects is needed.
Model Evaluation
• CONFUSION MATRIX Example
• F1 Score = 2*(Recall * Precision) / (Recall + Precision)
= 2*(3/4*3/5)/(3/4+3/5)
=2*(0.45)/(1.35)
=0.9/1.35
=0.66
Actual
1 0
Predicted
1 3 2
0 1 1
Model Evaluation
• CONFUSION MATRIX Example
• AREA UNDER THE ROC CURVE (AUC)
• It can only be used to assess classifiers that return some confidence score (or a
probability) of prediction. For example, logistic regression, neural networks, and
decision trees (and ensemble models based on decision trees) can be assessed
using ROC curves.
• ROC curve commonly use the combination of true positive rate(TPR) and false
positive rate(FPR) and that is given as
Model Evaluation
• CONFUSION MATRIX Example
• AREA UNDER THE ROC CURVE (AUC)
• The higher the area under the ROC curve (AUC), the better the
classifier. A classifier with an AUC higher than 0.5 is better than
a random classifier. If AUC is lower than 0.5, then something is
wrong with your model. A perfect classifier would have an AUC
of 1.
References
• https://medium.com/@weidagang/demystifying-precision-and-recall-
in-machine-learning-6f756a4c54ac

• https://proclusacademy.com/blog/explainer/precision-recall-f1-score-
classification-models/

• https://pareto.ai/blog/precision-and-recall
• https://www.analyticsvidhya.com/blog/2020/09/precision-recall-mac
hine-learning/
Thanks!!!

You might also like