0% found this document useful (0 votes)

20 views47 pages

Text Classification in ML

Uploaded by

voicemint311

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views47 pages

Text Classification in ML

Uploaded by

voicemint311

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 47

Machine Learning:

Text Classification
Learning Objectives
Upon completing topic, students will:

• Learn about text classification

• Understand the basic theory underlying machine

learning

• Be introduced to how a naive Bayes text

classifier works
What is Text
Classification?
• Text classification is the process of assigning a
labeled category, known as a class, to text.
Class
Produc
Class Positive
t
Spam Negative
Email
Email Revie
Not Spam Class w
Accounting
Business Finance
Article Economics
Marketing
Why Use Machine
Learning?
• We have more text data than ever before.

• Programming a rules-based system for text

classification (using a series of IF-THEN
statements) could require hundreds of rules.
This process only works well if you know all the
situations under which decisions can be made.

• For large data sets, machine learning systems

are often very efficient at classifying text.
Overview of Machine
Learning
• Machine learning uses statistical techniques,
allowing computers to "learn“ from data to
identify patterns and make predictions
without being explicitly programmed.

• Machine learning can be supervised,

unsupervised, semi-supervised, or reinforced.
Supervised or
Unsupervised?
Machine Learning

Supervised Unsupervised

• Discovery of an
• Predictive model
internal pattern
is developed
or structure is
based on labeled
based on
training data.
unlabeled
training data.
Supervised or
Unsupervised? (cont.)
• With supervised machine learning, the model
is trained using examples of labeled input-
output pairs.

• In other words, the computer is first being

shown a set of correctly labeled examples. It
creates a predictive model based on those
examples.
Supervised Machine
Learning Example
• For example, a set of news articles could be pre-
labeled as being either about sports,
entertainment, or politics.

• The supervised machine learning algorithm would

“learn” from that labeled data to create a
predictive model for classifying a new, unlabeled
article into one of the three classes.
Unsupervised Machine
Learning Example
• An example of unsupervised machine learning
would use unlabeled news articles as the
training data. The output classes in this case
would not be known in advance.

• The algorithm could then, for example, be used

to organize the articles into “clusters” based on
similarity of content.
Importance of Labeled
Data to Supervised
Learning
• Good quality labeled data is vitally important
for training and testing supervised machine
learning systems.

• Human annotators are often used to classify

data that has not been previously labeled.
Creating a set of training data in this way can
be expensive and time-consuming.
Importance of Labeled
Data to Supervised
Learning (cont.)
• The data used in machine learning analysis may
frequently change, so it is important to consider
if the training data is relevant.

• A large amount of training data may be needed

for the system to come up with a generalizable
model.

• More training data will typically lead to better

results.
Supervised Machine
Learning Overview -
Step 1
Training
Data
Labeled
Data Testing Data

• You begin by dividing your labeled data into a set

of training data and a set of testing data.
Supervised Machine
Learning Overview –
Step 2
Machine
Training Learning
Data algorithm

Labeled
Data Testing Data Predictive
Model

• Using the training data, the machine learning

algorithm creates a predictive model. In our case,
this predictive model is a text classifier.
Supervised Machine
Learning Overview –
Step 3
Machine
Training Learning
Data algorithm

Labeled
Data Testing Data Predictive
Model

• You use the testing data to validate the predictive

model.
Supervised Machine
Learning Overview –
Step 4
Machine
Learning Modifications
Training algorithm
Data
Labeled
Data Testing Data Predictive Test
Model results
• Based on the test results, you may wish to make
modifications to improve how the machine learning
algorithm is working. Or, you may want to train on
more data.
Supervised Machine
Learning Overview –
Step
Unlabeled
5 Predictive Labeled
Data Model data

• When you are satisfied with your results, you use

the predictive model to label new, unclassified
data.
Politics
Text Classifier Sports

News articles Entertainment

Evaluating Performance

• One metric for evaluating a text classifier

would be its accuracy:

Accuracy = Number of Correct Predictions

Total Number of Predictions
Accuracy
• For binary classification, accuracy is calculated to take into
account false positives:

Accuracy =

True Positives + True Negatives

True Positives + True Negatives + False Positives + False Negatives

• Accuracy may be misleading when there is a great deal of

inequality between the number of positive and negative
labels.
Precision
• Precision is another measure for evaluating
classification models. Precision is the proportion of
positive identifications that was actually correct.

Precision = True Positives

True Positives + False Positives
• When there are no false positives, precision = 1.0
Using a Naive Bayes
Classifier
• To create our text classification tool, we used Scikit-
learn, a free software machine learning library for the
Python programming language.

• Our classification tool uses the multinomial naive

Bayes algorithm.

• There are many other potential algorithms you could

use for text classification.

• For simplicity and effectiveness, naive Bayes is a good

candidate for an introduction to machine learning.
Introduction to Naive
Bayes
• Using Bayesian probability, you reason backwards
to find out the events—or random variables—that
most likely led to a specific outcome.

• In this text classification model, these random

variables will be the words in the document and
their frequencies. With a multinomial naive Bayes
classifier, word frequency is the feature that the
algorithm is trained on. The outcome is the class, or
category, to which the model will assign the
document.
Bayes’ Theorem

𝐏 ( 𝐁|𝐀 ) 𝐏 ( 𝐀 )
P (A│B) =
P ( B)
The probability of event B, given A,
The probability of multiplied by the prior probability
event A, given B. of A, divided by the prior
probability of B.
Naive Bayes
Assumption

• The naive Bayes algorithm is based on Bayes’

theorem of conditional probability, applied with
the “naive” assumption that there is
independence between features.

• In other words, the effect of the value of a

predictor (x) on a given class (c) is independent of
the values of other predictors.
Naive Bayes
Assumption (cont.)
• Is the occurrence of a particular feature really
independent of the occurrence of other features?

• Specifically, thinking in terms of text

classification, is the occurrence of a particular
word really independent of the occurrence of
other words?
Naive Bayes
Assumption (cont.)
• No. In text, the appearance of certain words
obviously depends upon the use of other words.
The naive Bayes algorithm, however, always
assumes independence.

• Using a naive Bayes classifier, we ignore word

order and sentence construction. We treat each
document as if it is a “bag of words.”
Naive Bayes
Classification Example
• For the purposes of illustrating how a naive Bayes
classifier works, imagine five extremely short
email documents, each only a few words long.

• In this example, the text classification problem is

to classify the email as being either spam
(unsolicited anonymous bulk advertising) or not
spam.
Naive Bayes
Classification (cont.)
• An annotator has already labeled each of the five
emails as being either spam or not spam. This is
considered our training set of data.
Class
Doc 1 Follow-up meeting Not Spam
Doc 2 Free cash. Get money. SPAM
Doc 3 Money! Money! Money! SPAM
Doc 4 Dinner plans Not Spam
Doc 5 GET CASH NOW SPAM
Naive Bayes
Classification (cont.)
• Consider the following new, unclassified example:

Doc 6 Get Money Now Spam or Not Spam?

• The naive Bayes algorithm will calculate the

probability, based on the previous results from the
training data, that Doc 6 is spam. It will also
calculate the probability that it is not spam.
Applying Bayes’
Theorem
𝐏 ( 𝐁|𝐀 ) 𝐏 ( 𝐀 )
P (A│B) =
P ( B)
The probability of the words in the
The probability of email, given they appeared in
an email being spam, multiplied by the prior
spam, given the probability of the email being
words in the spam, divided by the prior
email. probability of the words used in
the email.
Prior Probability
• First, we need to calculate Doc 1 Not Spam
the prior probability of each Doc 2 SPAM
class. Doc 3 SPAM
Doc 4 Not Spam
Doc 5 SPAM
• In other words, in our
training set, what was the
overall probability that a
document was spam? What Spam = 3/5 = .6
was the overall probability
that a document was not Not Spam = 2/5 = .4
spam?
Naive Bayes
Classification
𝐏 ( 𝐁|𝐀 ) 𝐏 ( 𝐀 )
P (A│B) =
P ( B) .6

P (spam│get money now) = P(get money now│spam) P(spam)

P(get money now)
.4
P(not spam│get money now) = P(get money now│not spam) P(not spam)
P(get money now)
Naive Bayes
Classification (cont.)
• To classify the text as spam or not spam, we will look
for the label having the bigger probability. Therefore,
we can eliminate the divisor, which is the same for
both labels.

P (spam│get money now) = P(get money now│spam) (.6)

P(get money now)

P(not spam│get money now) = P(get money now│not spam) (.4)

P(get money now)

Naive Bayes
Classification (cont.)
• Recall, in naive Bayes each feature is seen as
being independent.

• In other words, each word is calculated as having

a unique probability:

P (get money now) = P(get) × P(money) × P(now)

Word Frequency Table
Word Not Spam Spam
follow-up 1 0
meeting 1 0 • How many times
free 0 1 each word occurs in
cash 0 2 each class is counted
money 0 4
in the training data.
dinner 1 0
plans 1 0
get 0 2
now 0 1
Total 4 9
Calculating Conditional
Probabilities
• The conditional
Word Not Spam Spam
follow-up 1/4= .25 0/9 = .00 probability of each
meeting 1/4= .25 0/9 = .00 word occurring in a
free 0/4= .00 1/9 = .11 given class is
cash 0/4= .00 2/9 = .22 calculated.
money 0/4= .00 4/9 = .44
dinner 1/4= .25 0/9 = .00
• However, what
plans 1/4= .25 0/9 = .00
get 0/4= .00 2/9 = .22 happens when words
now 0/4= .00 1/9 = .11 do not appear in a
Total 4 9 class of the training
data?
Avoiding Zeros in the
Calculation • Words not
Word Not Spam Spam
occurring have a
follow-up 1/4= .25 0/9 = .00
meeting 1/4= .25 0/9 = .00 conditional
free 0/4= .00 1/9 = .11 probability of 0.
cash 0/4= .00 2/9 = .22
money 0/4= .00 4/9 = .44 • Multiplying by zero
dinner 1/4= .25 0/9 = .00 would nullify the
plans 1/4= .25 0/9 = .00
naive Bayes
get 0/4= .00 2/9 = .22
now 0/4= .00 1/9 = .11 calculation. We
Total 4 9
need to smooth
the data.
Laplace Smoothing
• Using Laplace
smoothing, we go
Word Not Spam Spam
follow-up 1 +1 0+1
back, adding 1 to
meeting 1 +1 0+1 every word count.
free 0 +1 1+1
cash 0 +1 2+1 • For balance, we add
money 0 +1 4+1 the number of all
dinner 1 +1 0+1 unique possible words
plans 1 +1 0+1
get 0 +1 2+1
(total vocabulary) to
now 0 +1 1+1 the divisor, so the
Total 4+9 9+9 division will never be
= 13 = 18 greater than 1.
Class Conditional
Probabilities
Word Not Spam Spam
follow-up 2/13= .153 1/18= .055 • Now that we
meeting 2/13= .153 1/18= .055 have smoothed
free 1/13= .077 2/18= .111 the data, the
cash 1/13= .077 3/18= .167 conditional
money 1/13= .077 5/18= .278
probability of
dinner 2/13= .153 1/18= .055
plans 2/13= .153 1/18= .055 each word
get 1/13= .077 3/18= .167 occurring in a
now 1/13= .077 2/18= .111 given class is
Total 4+9 9+9 recalculated.
= 13 = 18
Naive Bayes
Classification
P(spam│get money now) = P(get) × P(money) × P(now) × (.6)
P(spam│get money now) = P(.167) × P(.278) × P(.111) × (.6) = .0031

P(not spam│get money now) = P(get) × P(money) × P(now) × (.4)

P(not spam│get money now) = P(.077) × P(.077) × P(.077) × (.4) = .0002

• We calculate the probability of an email containing

the words “get money now” in each of the two
classes.
Naive Bayes
Classification (cont.)
P(spam│get money now) = P(get) × P(money) × P(now) × (.6)
P(spam│get money now) = P(.167) × P(.278) × P(.111) × (.6) = .0031

P(not spam│get money now) = P(get) × P(money) × P(now) × (.4)

P(not spam│get money now) = P(.077) × P(.077) × P(.077) × (.4) = .0002

• A higher value indicates a higher probability. In

this case, there is a higher probability of these
words occurring in an email that is spam. The
classifier would therefore label the email as spam.
Features for Spam
Classification
This was a simplified example. In reality, spam
detection systems use more than just individual words
to classify email as spam or not spam. Spam classifiers
use many additional features, including:

• Sender’s address
• IP address
• Use of capitalization
• Specific phrases
• Whether or not the text contains a link
Term Frequency - Inverse
Document Frequency (tf-idf)

• Some words appear much more often in the

English language than other words. In a text
classifier, a word like “the” can easily
overshadow the frequencies of less common
words that are more meaningful.

• In text analysis, a process of weighting the term

frequency is often used called tf-idf, for “term
frequency times inverse document frequency.”
TF-IDF
• The classification tool we built for the
assignment uses the following td-idf weighting
method:

tf × idf
• Term frequency is
the number of
times the term
appears in the
document.
TF-IDF (cont.)
• Inverse document frequency
is the log of the total number
of documents divided by the
number of documents
tf × idf containing the term. One is
added to the denominator to
avoid zero division.

• The higher the tf-idf

Number of documents
weight, the rarer the
term, and vice versa.
log 1 + Number of documents
containing the term
Conclusion
• Machine learning uses statistical techniques to give
computers the ability to "learn“ from data to
identify patterns and make predictions without
having to be manually programmed using a rule-
based system.

• Good quality labeled data is vitally important for

training and testing supervised machine learning
systems. In general, using more training data yields
better results.
Conclusion (cont.)
• Two measures used to evaluate the performance of
a text classifier are accuracy and precision.

• Naive Bayes is one type of algorithm that can be

used in machine learning. Using Bayesian
probability, you reason backwards to find out the
events—or random variables—that most likely led
to a specific outcome.

• In naive Bayes text classification the random

variables used are the specific words in the text.
Conclusion (cont.)
• The naive Bayes algorithm is based on Bayes’
theorem of conditional probability, applied with the
“naive” assumption that there is independence
between every pair of features.

• This assumption of independence means the

classifier ignores word order and sentence
construction, treating each document as if it is a
“bag of words.”

NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
100% (1)
NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
86 pages
Naïve Bayes Classifier Guide
No ratings yet
Naïve Bayes Classifier Guide
47 pages
Machine Learning
No ratings yet
Machine Learning
40 pages
The Cosmic Perspective 8th Edition Full Download
100% (1)
The Cosmic Perspective 8th Edition Full Download
403 pages
Unit-3 AML (Bayesian Concept Learning)
No ratings yet
Unit-3 AML (Bayesian Concept Learning)
40 pages
Naive Bayes
No ratings yet
Naive Bayes
30 pages
UNIT - IV
No ratings yet
UNIT - IV
169 pages
Naive Bayes Algorithm For Classification Tasks: Sana Badagan 1MS24RAI09
No ratings yet
Naive Bayes Algorithm For Classification Tasks: Sana Badagan 1MS24RAI09
31 pages
JKD Conversations With John Little
No ratings yet
JKD Conversations With John Little
37 pages
Mla Unit-5'2
No ratings yet
Mla Unit-5'2
74 pages
Lec 09
No ratings yet
Lec 09
50 pages
IR Unit 2 (1,2)
No ratings yet
IR Unit 2 (1,2)
76 pages
Lec 09
No ratings yet
Lec 09
50 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
74 pages
v1 Mechaics Raw Score Equivalent Pupils Performance Numeracy Level
No ratings yet
v1 Mechaics Raw Score Equivalent Pupils Performance Numeracy Level
3 pages
Naive Bayes Classifier in Machine Learning Javatpoint
No ratings yet
Naive Bayes Classifier in Machine Learning Javatpoint
23 pages
Lecture 12 Dr. Lamiaa
No ratings yet
Lecture 12 Dr. Lamiaa
21 pages
14 Supervised Machine Learning
No ratings yet
14 Supervised Machine Learning
94 pages
Lecture13 Nbayes
No ratings yet
Lecture13 Nbayes
56 pages
Text Classification
No ratings yet
Text Classification
11 pages
Ame: Waqar Ali
No ratings yet
Ame: Waqar Ali
22 pages
Text Classification Using TF-IDF and Machine Learning
No ratings yet
Text Classification Using TF-IDF and Machine Learning
30 pages
Cells: Basic Units of Life Worksheet
No ratings yet
Cells: Basic Units of Life Worksheet
12 pages
W2 3-NaiveBayes
No ratings yet
W2 3-NaiveBayes
17 pages
Introduction to Classification in AI
No ratings yet
Introduction to Classification in AI
66 pages
Personal Development Worksheets WK 1 - 1
No ratings yet
Personal Development Worksheets WK 1 - 1
7 pages
Naive Bayes Classification
No ratings yet
Naive Bayes Classification
16 pages
BAI601 Module 3 PDF
No ratings yet
BAI601 Module 3 PDF
19 pages
Naive Bayes Classifier Presentation
No ratings yet
Naive Bayes Classifier Presentation
10 pages
ML 3RD Unit
No ratings yet
ML 3RD Unit
67 pages
NLP NB
No ratings yet
NLP NB
52 pages
Naive Bayes
No ratings yet
Naive Bayes
38 pages
Supervised Learning-1
100% (1)
Supervised Learning-1
37 pages
Text Classification & Naive Bayes
No ratings yet
Text Classification & Naive Bayes
4 pages
NOTES
No ratings yet
NOTES
15 pages
Naive Bayes Explanation Cleaned
No ratings yet
Naive Bayes Explanation Cleaned
2 pages
What Is Naive Bayes Algorithm
No ratings yet
What Is Naive Bayes Algorithm
10 pages
Students Embrace 'Bored and Brilliant'
100% (1)
Students Embrace 'Bored and Brilliant'
2 pages
Naive Bayes
No ratings yet
Naive Bayes
12 pages
CH 5
No ratings yet
CH 5
21 pages
Classification With Bayes
No ratings yet
Classification With Bayes
12 pages
Naive Bayes Ons
No ratings yet
Naive Bayes Ons
29 pages
Naive Bayes Classifier Overview
No ratings yet
Naive Bayes Classifier Overview
7 pages
ML & Cloud Computing For Iot: Topics in Module-3
No ratings yet
ML & Cloud Computing For Iot: Topics in Module-3
38 pages
Naïve Bayesian Classifier
No ratings yet
Naïve Bayesian Classifier
15 pages
6d7701 - Bayesean Classifer
No ratings yet
6d7701 - Bayesean Classifer
8 pages
Mechine Learning
No ratings yet
Mechine Learning
7 pages
Lecture Feb20&25
No ratings yet
Lecture Feb20&25
11 pages
Naïve Bayesian Classifier and K-Means Clustering
No ratings yet
Naïve Bayesian Classifier and K-Means Clustering
13 pages
Naive Bays
No ratings yet
Naive Bays
10 pages
Machine Learning: Classification & Naive Bayes
No ratings yet
Machine Learning: Classification & Naive Bayes
20 pages
Flight of Dreams A Novel Lawhon PDF Download
No ratings yet
Flight of Dreams A Novel Lawhon PDF Download
102 pages
Naïve Bayes for Computer Science Students
No ratings yet
Naïve Bayes for Computer Science Students
38 pages
LM3 - Naive Bayes Model
No ratings yet
LM3 - Naive Bayes Model
21 pages
EAPP Q4module 1... Grade 12 Bezos
No ratings yet
EAPP Q4module 1... Grade 12 Bezos
3 pages
Supervised Learning Guide
No ratings yet
Supervised Learning Guide
9 pages
Naïve Bayes Classifier Guide
No ratings yet
Naïve Bayes Classifier Guide
24 pages
Naive Bates Classifier
No ratings yet
Naive Bates Classifier
18 pages
Myppt
No ratings yet
Myppt
14 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
6 pages
Text Classification Techniques
No ratings yet
Text Classification Techniques
17 pages
Basic Information On Grading
No ratings yet
Basic Information On Grading
1 page
Naïve Bayes Classifier
No ratings yet
Naïve Bayes Classifier
18 pages
NaiveBayes N Text Analytics
No ratings yet
NaiveBayes N Text Analytics
20 pages
Naïve Bayes Classifier
No ratings yet
Naïve Bayes Classifier
16 pages
Resume 2023
No ratings yet
Resume 2023
1 page
Catherine Hoblin-5
No ratings yet
Catherine Hoblin-5
1 page
Pathology Course Introduction
No ratings yet
Pathology Course Introduction
6 pages
Machine Ass
No ratings yet
Machine Ass
33 pages
SRS Document
100% (1)
SRS Document
14 pages
Background in A Research Paper
No ratings yet
Background in A Research Paper
2 pages
Librarian S Guide To Online Searching Cultivating Database Skills For Research and Instruction 4th Edition Suzanne S. Bell
100% (3)
Librarian S Guide To Online Searching Cultivating Database Skills For Research and Instruction 4th Edition Suzanne S. Bell
47 pages
Naive Bayes Classifiers - Parta
No ratings yet
Naive Bayes Classifiers - Parta
17 pages
Umbrella To Which All The Defense Mechanism Exist
No ratings yet
Umbrella To Which All The Defense Mechanism Exist
9 pages
Class Scheduling System and Attendance Monitoring System
100% (1)
Class Scheduling System and Attendance Monitoring System
6 pages
Data Sets
No ratings yet
Data Sets
36 pages
ALS A E Reviewer Mock Test LS4 Life Skills
No ratings yet
ALS A E Reviewer Mock Test LS4 Life Skills
31 pages
Refmoh 276G
No ratings yet
Refmoh 276G
3 pages
EIS ISA Complete
No ratings yet
EIS ISA Complete
49 pages
IJSRD - International Journal For Scientific Research & Development - Vol. 8, Issue 3, 2020 - ISSN (Online) - 2321-0613
No ratings yet
IJSRD - International Journal For Scientific Research & Development - Vol. 8, Issue 3, 2020 - ISSN (Online) - 2321-0613
3 pages
Reservation in Sanskriti School
No ratings yet
Reservation in Sanskriti School
31 pages
Family Development Stages Guide
No ratings yet
Family Development Stages Guide
7 pages
Cetprospectus 2025
No ratings yet
Cetprospectus 2025
56 pages
Course File Sviet
No ratings yet
Course File Sviet
6 pages
Random Forests Simplified
No ratings yet
Random Forests Simplified
39 pages
Simolazione Seconda Traccia Inglese 2023 Extra
No ratings yet
Simolazione Seconda Traccia Inglese 2023 Extra
5 pages
List of Affiliated B.Ed. College G-MAIL List Session 2014-154180
No ratings yet
List of Affiliated B.Ed. College G-MAIL List Session 2014-154180
20 pages
Lec 2
No ratings yet
Lec 2
23 pages
Concept Note - HEALTH WORKERS' DAY CELEBRATION HONORING DEDICATION AND ENHANCING WELL-BEING
No ratings yet
Concept Note - HEALTH WORKERS' DAY CELEBRATION HONORING DEDICATION AND ENHANCING WELL-BEING
4 pages
Cpar M3
100% (1)
Cpar M3
16 pages
Practical Research
No ratings yet
Practical Research
7 pages
Lec 13-Power Series
No ratings yet
Lec 13-Power Series
63 pages