Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
20 views47 pages

Text Classification in ML

Uploaded by

voicemint311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views47 pages

Text Classification in ML

Uploaded by

voicemint311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Machine Learning:

Text Classification
Learning Objectives
Upon completing topic, students will:

• Learn about text classification

• Understand the basic theory underlying machine


learning

• Be introduced to how a naive Bayes text


classifier works
What is Text
Classification?
• Text classification is the process of assigning a
labeled category, known as a class, to text.
Class
Produc
Class Positive
t
Spam Negative
Email
Email Revie
Not Spam Class w
Accounting
Business Finance
Article Economics
Marketing
Why Use Machine
Learning?
• We have more text data than ever before.

• Programming a rules-based system for text


classification (using a series of IF-THEN
statements) could require hundreds of rules.
This process only works well if you know all the
situations under which decisions can be made.

• For large data sets, machine learning systems


are often very efficient at classifying text.
Overview of Machine
Learning
• Machine learning uses statistical techniques,
allowing computers to "learn“ from data to
identify patterns and make predictions
without being explicitly programmed.

• Machine learning can be supervised,


unsupervised, semi-supervised, or reinforced.
Supervised or
Unsupervised?
Machine Learning

Supervised Unsupervised

• Discovery of an
• Predictive model
internal pattern
is developed
or structure is
based on labeled
based on
training data.
unlabeled
training data.
Supervised or
Unsupervised? (cont.)
• With supervised machine learning, the model
is trained using examples of labeled input-
output pairs.

• In other words, the computer is first being


shown a set of correctly labeled examples. It
creates a predictive model based on those
examples.
Supervised Machine
Learning Example
• For example, a set of news articles could be pre-
labeled as being either about sports,
entertainment, or politics.

• The supervised machine learning algorithm would


“learn” from that labeled data to create a
predictive model for classifying a new, unlabeled
article into one of the three classes.
Unsupervised Machine
Learning Example
• An example of unsupervised machine learning
would use unlabeled news articles as the
training data. The output classes in this case
would not be known in advance.

• The algorithm could then, for example, be used


to organize the articles into “clusters” based on
similarity of content.
Importance of Labeled
Data to Supervised
Learning
• Good quality labeled data is vitally important
for training and testing supervised machine
learning systems.

• Human annotators are often used to classify


data that has not been previously labeled.
Creating a set of training data in this way can
be expensive and time-consuming.
Importance of Labeled
Data to Supervised
Learning (cont.)
• The data used in machine learning analysis may
frequently change, so it is important to consider
if the training data is relevant.

• A large amount of training data may be needed


for the system to come up with a generalizable
model.

• More training data will typically lead to better


results.
Supervised Machine
Learning Overview -
Step 1
Training
Data
Labeled
Data Testing Data

• You begin by dividing your labeled data into a set


of training data and a set of testing data.
Supervised Machine
Learning Overview –
Step 2
Machine
Training Learning
Data algorithm

Labeled
Data Testing Data Predictive
Model

• Using the training data, the machine learning


algorithm creates a predictive model. In our case,
this predictive model is a text classifier.
Supervised Machine
Learning Overview –
Step 3
Machine
Training Learning
Data algorithm

Labeled
Data Testing Data Predictive
Model

• You use the testing data to validate the predictive


model.
Supervised Machine
Learning Overview –
Step 4
Machine
Learning Modifications
Training algorithm
Data
Labeled
Data Testing Data Predictive Test
Model results
• Based on the test results, you may wish to make
modifications to improve how the machine learning
algorithm is working. Or, you may want to train on
more data.
Supervised Machine
Learning Overview –
Step
Unlabeled
5 Predictive Labeled
Data Model data

• When you are satisfied with your results, you use


the predictive model to label new, unclassified
data.
Politics
Text Classifier Sports

News articles Entertainment


Evaluating Performance

• One metric for evaluating a text classifier


would be its accuracy:

Accuracy = Number of Correct Predictions


Total Number of Predictions
Accuracy
• For binary classification, accuracy is calculated to take into
account false positives:

Accuracy =

True Positives + True Negatives


True Positives + True Negatives + False Positives + False Negatives

• Accuracy may be misleading when there is a great deal of


inequality between the number of positive and negative
labels.
Precision
• Precision is another measure for evaluating
classification models. Precision is the proportion of
positive identifications that was actually correct.

Precision = True Positives


True Positives + False Positives
• When there are no false positives, precision = 1.0
Using a Naive Bayes
Classifier
• To create our text classification tool, we used Scikit-
learn, a free software machine learning library for the
Python programming language.

• Our classification tool uses the multinomial naive


Bayes algorithm.

• There are many other potential algorithms you could


use for text classification.

• For simplicity and effectiveness, naive Bayes is a good


candidate for an introduction to machine learning.
Introduction to Naive
Bayes
• Using Bayesian probability, you reason backwards
to find out the events—or random variables—that
most likely led to a specific outcome.

• In this text classification model, these random


variables will be the words in the document and
their frequencies. With a multinomial naive Bayes
classifier, word frequency is the feature that the
algorithm is trained on. The outcome is the class, or
category, to which the model will assign the
document.
Bayes’ Theorem

𝐏 ( 𝐁|𝐀 ) 𝐏 ( 𝐀 )
P (A│B) =
P ( B)
The probability of event B, given A,
The probability of multiplied by the prior probability
event A, given B. of A, divided by the prior
probability of B.
Naive Bayes
Assumption

• The naive Bayes algorithm is based on Bayes’


theorem of conditional probability, applied with
the “naive” assumption that there is
independence between features.

• In other words, the effect of the value of a


predictor (x) on a given class (c) is independent of
the values of other predictors.
Naive Bayes
Assumption (cont.)
• Is the occurrence of a particular feature really
independent of the occurrence of other features?

• Specifically, thinking in terms of text


classification, is the occurrence of a particular
word really independent of the occurrence of
other words?
Naive Bayes
Assumption (cont.)
• No. In text, the appearance of certain words
obviously depends upon the use of other words.
The naive Bayes algorithm, however, always
assumes independence.

• Using a naive Bayes classifier, we ignore word


order and sentence construction. We treat each
document as if it is a “bag of words.”
Naive Bayes
Classification Example
• For the purposes of illustrating how a naive Bayes
classifier works, imagine five extremely short
email documents, each only a few words long.

• In this example, the text classification problem is


to classify the email as being either spam
(unsolicited anonymous bulk advertising) or not
spam.
Naive Bayes
Classification (cont.)
• An annotator has already labeled each of the five
emails as being either spam or not spam. This is
considered our training set of data.
Class
Doc 1 Follow-up meeting Not Spam
Doc 2 Free cash. Get money. SPAM
Doc 3 Money! Money! Money! SPAM
Doc 4 Dinner plans Not Spam
Doc 5 GET CASH NOW SPAM
Naive Bayes
Classification (cont.)
• Consider the following new, unclassified example:

Doc 6 Get Money Now Spam or Not Spam?

• The naive Bayes algorithm will calculate the


probability, based on the previous results from the
training data, that Doc 6 is spam. It will also
calculate the probability that it is not spam.
Applying Bayes’
Theorem
𝐏 ( 𝐁|𝐀 ) 𝐏 ( 𝐀 )
P (A│B) =
P ( B)
The probability of the words in the
The probability of email, given they appeared in
an email being spam, multiplied by the prior
spam, given the probability of the email being
words in the spam, divided by the prior
email. probability of the words used in
the email.
Prior Probability
• First, we need to calculate Doc 1 Not Spam
the prior probability of each Doc 2 SPAM
class. Doc 3 SPAM
Doc 4 Not Spam
Doc 5 SPAM
• In other words, in our
training set, what was the
overall probability that a
document was spam? What Spam = 3/5 = .6
was the overall probability
that a document was not Not Spam = 2/5 = .4
spam?
Naive Bayes
Classification
𝐏 ( 𝐁|𝐀 ) 𝐏 ( 𝐀 )
P (A│B) =
P ( B) .6

P (spam│get money now) = P(get money now│spam) P(spam)


P(get money now)
.4
P(not spam│get money now) = P(get money now│not spam) P(not spam)
P(get money now)
Naive Bayes
Classification (cont.)
• To classify the text as spam or not spam, we will look
for the label having the bigger probability. Therefore,
we can eliminate the divisor, which is the same for
both labels.

P (spam│get money now) = P(get money now│spam) (.6)

P(get money now)

P(not spam│get money now) = P(get money now│not spam) (.4)

P(get money now)


Naive Bayes
Classification (cont.)
• Recall, in naive Bayes each feature is seen as
being independent.

• In other words, each word is calculated as having


a unique probability:

P (get money now) = P(get) × P(money) × P(now)


Word Frequency Table
Word Not Spam Spam
follow-up 1 0
meeting 1 0 • How many times
free 0 1 each word occurs in
cash 0 2 each class is counted
money 0 4
in the training data.
dinner 1 0
plans 1 0
get 0 2
now 0 1
Total 4 9
Calculating Conditional
Probabilities
• The conditional
Word Not Spam Spam
follow-up 1/4= .25 0/9 = .00 probability of each
meeting 1/4= .25 0/9 = .00 word occurring in a
free 0/4= .00 1/9 = .11 given class is
cash 0/4= .00 2/9 = .22 calculated.
money 0/4= .00 4/9 = .44
dinner 1/4= .25 0/9 = .00
• However, what
plans 1/4= .25 0/9 = .00
get 0/4= .00 2/9 = .22 happens when words
now 0/4= .00 1/9 = .11 do not appear in a
Total 4 9 class of the training
data?
Avoiding Zeros in the
Calculation • Words not
Word Not Spam Spam
occurring have a
follow-up 1/4= .25 0/9 = .00
meeting 1/4= .25 0/9 = .00 conditional
free 0/4= .00 1/9 = .11 probability of 0.
cash 0/4= .00 2/9 = .22
money 0/4= .00 4/9 = .44 • Multiplying by zero
dinner 1/4= .25 0/9 = .00 would nullify the
plans 1/4= .25 0/9 = .00
naive Bayes
get 0/4= .00 2/9 = .22
now 0/4= .00 1/9 = .11 calculation. We
Total 4 9
need to smooth
the data.
Laplace Smoothing
• Using Laplace
smoothing, we go
Word Not Spam Spam
follow-up 1 +1 0+1
back, adding 1 to
meeting 1 +1 0+1 every word count.
free 0 +1 1+1
cash 0 +1 2+1 • For balance, we add
money 0 +1 4+1 the number of all
dinner 1 +1 0+1 unique possible words
plans 1 +1 0+1
get 0 +1 2+1
(total vocabulary) to
now 0 +1 1+1 the divisor, so the
Total 4+9 9+9 division will never be
= 13 = 18 greater than 1.
Class Conditional
Probabilities
Word Not Spam Spam
follow-up 2/13= .153 1/18= .055 • Now that we
meeting 2/13= .153 1/18= .055 have smoothed
free 1/13= .077 2/18= .111 the data, the
cash 1/13= .077 3/18= .167 conditional
money 1/13= .077 5/18= .278
probability of
dinner 2/13= .153 1/18= .055
plans 2/13= .153 1/18= .055 each word
get 1/13= .077 3/18= .167 occurring in a
now 1/13= .077 2/18= .111 given class is
Total 4+9 9+9 recalculated.
= 13 = 18
Naive Bayes
Classification
P(spam│get money now) = P(get) × P(money) × P(now) × (.6)
P(spam│get money now) = P(.167) × P(.278) × P(.111) × (.6) = .0031

P(not spam│get money now) = P(get) × P(money) × P(now) × (.4)


P(not spam│get money now) = P(.077) × P(.077) × P(.077) × (.4) = .0002

• We calculate the probability of an email containing


the words “get money now” in each of the two
classes.
Naive Bayes
Classification (cont.)
P(spam│get money now) = P(get) × P(money) × P(now) × (.6)
P(spam│get money now) = P(.167) × P(.278) × P(.111) × (.6) = .0031

P(not spam│get money now) = P(get) × P(money) × P(now) × (.4)


P(not spam│get money now) = P(.077) × P(.077) × P(.077) × (.4) = .0002

• A higher value indicates a higher probability. In


this case, there is a higher probability of these
words occurring in an email that is spam. The
classifier would therefore label the email as spam.
Features for Spam
Classification
This was a simplified example. In reality, spam
detection systems use more than just individual words
to classify email as spam or not spam. Spam classifiers
use many additional features, including:

• Sender’s address
• IP address
• Use of capitalization
• Specific phrases
• Whether or not the text contains a link
Term Frequency - Inverse
Document Frequency (tf-idf)

• Some words appear much more often in the


English language than other words. In a text
classifier, a word like “the” can easily
overshadow the frequencies of less common
words that are more meaningful.

• In text analysis, a process of weighting the term


frequency is often used called tf-idf, for “term
frequency times inverse document frequency.”
TF-IDF
• The classification tool we built for the
assignment uses the following td-idf weighting
method:

tf × idf
• Term frequency is
the number of
times the term
appears in the
document.
TF-IDF (cont.)
• Inverse document frequency
is the log of the total number
of documents divided by the
number of documents
tf × idf containing the term. One is
added to the denominator to
avoid zero division.

• The higher the tf-idf


Number of documents
weight, the rarer the
term, and vice versa.
log 1 + Number of documents
containing the term
Conclusion
• Machine learning uses statistical techniques to give
computers the ability to "learn“ from data to
identify patterns and make predictions without
having to be manually programmed using a rule-
based system.

• Good quality labeled data is vitally important for


training and testing supervised machine learning
systems. In general, using more training data yields
better results.
Conclusion (cont.)
• Two measures used to evaluate the performance of
a text classifier are accuracy and precision.

• Naive Bayes is one type of algorithm that can be


used in machine learning. Using Bayesian
probability, you reason backwards to find out the
events—or random variables—that most likely led
to a specific outcome.

• In naive Bayes text classification the random


variables used are the specific words in the text.
Conclusion (cont.)
• The naive Bayes algorithm is based on Bayes’
theorem of conditional probability, applied with the
“naive” assumption that there is independence
between every pair of features.

• This assumption of independence means the


classifier ignores word order and sentence
construction, treating each document as if it is a
“bag of words.”

You might also like