Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views80 pages

Text Classification

The document discusses Naive Bayes classification, particularly in the context of text sentiment analysis, spam detection, and authorship identification. It explains the basic principles of the Naive Bayes classifier, including the bag of words representation, maximum likelihood estimation, and the challenges of zero probabilities and unknown words. Additionally, it covers techniques like Laplace smoothing and the handling of stop words in text classification.

Uploaded by

Mehedi Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views80 pages

Text Classification

The document discusses Naive Bayes classification, particularly in the context of text sentiment analysis, spam detection, and authorship identification. It explains the basic principles of the Naive Bayes classifier, including the bag of words representation, maximum likelihood estimation, and the challenges of zero probabilities and unknown words. Additionally, it covers techniques like Laplace smoothing and the handling of stop words in text classification.

Uploaded by

Mehedi Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Naive Bayes

and The Task of Text


Sentiment Classification
Classification
Is this spam?
Who wrote which Federalist papers?
1787-8: anonymous essays try to convince New
York to ratify U.S Constitution: Jay, Madison,
Hamilton.
Authorship of 12 of the letters in dispute
1963: solved by Mosteller and Wallace using
Bayesian methods

James Madison Alexander Hamilton


What is the subject of this medical article?
MEDLINE Article MeSH Subject Category Hierarchy
Antogonists and Inhibitors
Blood Supply
Chemistry
? Drug Therapy
Embryology
Epidemiology

4
Positive or negative movie review?

+ ...zany characters and richly applied satire, and some great


plot twists

− It was pathetic. The worst part about it was the boxing


scenes...
...awesome caramel sauce and sweet toasty almonds. I
+ love this place!

− ...awful pizza and ridiculously overpriced...


5
Positive or negative movie review?

+ ...zany characters and richly applied satire, and some great


plot twists

− It was pathetic. The worst part about it was the boxing


scenes...
...awesome caramel sauce and sweet toasty almonds. I
+ love this place!

− ...awful pizza and ridiculously overpriced...


6
Why sentiment analysis?

Movie: is this review positive or negative?


Products: what do people think about the new iPhone?
Public sentiment: how is consumer confidence?
Politics: what do people think about this candidate or issue?
Prediction: predict election outcomes or market trends from
sentiment

7
Scherer Typology of Affective States
Emotion: brief organically synchronized … evaluation of a major event
◦ angry, sad, joyful, fearful, ashamed, proud, elated
Mood: diffuse non-caused low-intensity long-duration change in subjective feeling
◦ cheerful, gloomy, irritable, listless, depressed, buoyant
Interpersonal stances: affective stance toward another person in a specific interaction
◦ friendly, flirtatious, distant, cold, warm, supportive, contemptuous
Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons
◦ liking, loving, hating, valuing, desiring
Personality traits: stable personality dispositions and typical behavior tendencies
◦ nervous, anxious, reckless, morose, hostile, jealous
Scherer Typology of Affective States
Emotion: brief organically synchronized … evaluation of a major event
◦ angry, sad, joyful, fearful, ashamed, proud, elated
Mood: diffuse non-caused low-intensity long-duration change in subjective feeling
◦ cheerful, gloomy, irritable, listless, depressed, buoyant
Interpersonal stances: affective stance toward another person in a specific interaction
◦ friendly, flirtatious, distant, cold, warm, supportive, contemptuous
Attitudes: enduring, affectively colored beliefs, dispositions towards objects or persons
◦ liking, loving, hating, valuing, desiring
Personality traits: stable personality dispositions and typical behavior tendencies
◦ nervous, anxious, reckless, morose, hostile, jealous
Basic Sentiment Classification

Sentiment analysis is the detection of


attitudes
Simple task we focus on in this chapter
◦ Is the attitude of this text positive or negative?
We return to affect classification in later
chapters
Summary: Text Classification

Sentiment analysis
Spam detection
Authorship identification
Language Identification
Assigning subject categories, topics, or genres

Text Classification: definition

Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}

Output: a predicted class c  C


Classification Methods: Hand-coded rules

Rules based on combinations of words or other features


◦ spam: black-list-address OR (“dollars” AND “you have been
selected”)
Accuracy can be high
◦ If rules carefully refined by expert
But building and maintaining these rules is expensive
Classification Methods:
Supervised Machine Learning
Input:
◦ a document d
◦ a fixed set of classes C = {c1, c2,…, cJ}
◦ A training set of m hand-labeled documents
(d1,c1),....,(dm,cm)
Output:
◦ a learned classifier γ:d → c

14
Classification Methods:
Supervised Machine Learning
Any kind of classifier
◦ Naïve Bayes
◦ Logistic regression
◦ Neural networks
◦ k-Nearest Neighbors
◦ …
Text
Classification The Task of Text
and Naive Classification
Bayes
Text
Classification Naive Bayes (I)
and Naive
Bayes
Naive Bayes Intuition

Simple (“naive”) classification method based on


Bayes rule
Relies on very simple representation of document
◦ Bag of words
The Bag of Words Representation

19
The bag of words representation
seen 2

γ( )=c
sweet 1
whimsical 1
recommend 1
happy 1
... ...
Text
Classification Naive Bayes (I)
and Naïve
Bayes
Text
Classification Formalizing the Naive
and Naïve Bayes Classifier
Bayes
Bayes’ Rule Applied to Documents and Classes

•For a document d and a class c

P(d | c)P(c)
P(c | d) =
P(d)
Naive Bayes Classifier (I)

cMAP = argmax P(c | d) MAP is “maximum a


posteriori” = most
cÎC likely class

P(d | c)P(c)
= argmax Bayes Rule

cÎC P(d)
= argmax P(d | c)P(c) Dropping the
denominator
cÎC
Naive Bayes Classifier (II)
"Likelihood" "Prior"

cMAP = argmax P(d | c)P(c)


cÎC
Document d

= argmax P(x1, x2,… , xn | c)P(c) represented as


features
cÎC x1..xn
Naïve Bayes Classifier (IV)

cMAP = argmax P(x1, x2,… , xn | c)P(c)


cÎC

O(|X|n•|C|) parameters How often does this


class occur?

Could only be estimated if a


We can just count the
very, very large number of relative frequencies in
training examples was a corpus

available.
Multinomial Naive Bayes Independence
Assumptions
P(x1, x2,… , xn | c)
Bag of Words assumption: Assume position doesn’t matter
Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c.

P(x1,… , xn | c) = P(x1 | c)·P(x2 | c)·P(x3 | c)·...·P(xn | c)


Multinomial Naive Bayes Classifier

cMAP = argmax P(x1, x2,… , xn | c)P(c)


cÎC

cNB = argmax P(c j )Õ P(x | c)


cÎC xÎX
Applying Multinomial Naive Bayes Classifiers
to Text Classification

positions  all word positions in test document

cNB = argmax P(c j )


c j ÎC
Õ P(xi | c j )
iÎ positions
Example
Let me explain a Multinomial Naïve Bayes Classifier
where we want to filter out the spam messages.
Initially, we consider eight normal messages and
four spam messages.
Histogram of all the words that occur in the
normal messages from family and friends
The probability of word dear given that we saw in
normal message is-
Probability (Dear|Normal) =
Probability (Friend|Normal) =
Probability (Lunch|Normal) =
Probability (Money|Normal) =
The probability of word dear given that we saw in
normal message is-
Probability (Dear|Normal) = 8 /17 = 0.47
Similarly, the probability of word Friend is-
Probability (Friend/Normal) = 5/ 17 =0.29
Probability (Lunch/Normal) = 3/ 17 =0.18
Probability (Money/Normal) = 1/ 17 =0.06
Histogram for Spam Message
he probability of word dear given that we saw in
spam message is-
Probability (Dear|Spam) =
Probability (Friend|Spam) =
Probability (Lunch|Spam) =
Probability (Money|Spam) =
he probability of word dear given that we saw in
spam message is-
Probability (Dear|Spam) = 2 /7 = 0.29
Similarly, the probability of word Friend is-
Probability (Friend|Spam) = 1/ 7 =0.14
Probability (Lunch|Spam) = 0/ 7 =0.00
Probability (Money|Spam) = 4/ 7 =0.57
What is the probability of “Dear Friend” as
normal message?
What is the probability of “Dear Friend” as
Spam message?
Problems with multiplying lots of probs
There's a problem with this:
cNB = argmax P(c j )
c j ÎC
Õ P(xi | c j )
iÎ positions

Multiplying lots of probabilities can result in floating-point


underflow!
Luckily, log(ab) = log(a) + log(b)
Let's sum logs of probabilities instead of multiplying probabilities!
We actually do everything in log space
Instead of this:
cNB = argmax P(c j )
c j ÎC
Õ P(xi | c j )
iÎ positions
This:

This is ok since log doesn't change the ranking of the classes (class with
highest prob still has highest log prob)
Model is now just max of sum of weights: a linear function of the inputs
So naive bayes is a linear classifier
Text
Classificatio Formalizing the Naïve
n and Naïve Bayes Classifier
Bayes
Text
Classification Naive Bayes: Learning
and Naïve
Bayes
Sec.13.3

Learning the Multinomial Naive Bayes Model

First attempt: maximum likelihood estimates


◦ simply use the frequencies in the data

doccount(C = c j )
P̂(c j ) =
N doc
count(wi , c j )
P̂(wi | c j ) =
å count(w, c j )
wÎV
Sec.13.3

Learning the Multinomial Naive Bayes Model


First attempt: maximum likelihood estimates
◦ simply use the frequencies in the data
Compute the probability of for a class C
doccount(C = c j ) P(Normal) =
P̂(c j ) = 8/12
N doc
Compute the probability of a word given a class ∈{ Positive, Negative }
count(wi , c j )
P̂(wi | c j ) =
å count(w, c j )
wÎV
Parameter estimation

count(wi , c j ) fraction of times word wi appears


P̂(wi | c j ) =
å count(w, c j ) among all words in documents of topic cj
wÎV

Create mega-document for topic j by concatenating all


docs in this topic
◦ Use frequency of w in mega-document
Doc 12, Normal 8 , Spam = 4
P ( Normal) = 8/12
P (Spam) = 4/12

Normal (17) Spam (7)


Dear – 8 Dear – 2
Friend – 5 Friend – 1
Lunch – 3 Lunch – 0
Money – 1 Money – 4
Probability of “Dear Friend” belongs to -
P ( Normal| “Dear Friend”) = (8/17) * (5/17) * (8/12)
P (Spam| “Dear Friend”) = (2/7) * (1/7) * (4/12)

Normal Spam
Dear – 8 Dear – 2
Friend – 5 Friend – 1
Lunch – 3 Lunch – 0
Money – 1 Money – 4
Probability of “Lunch Money” belongs to -

P ( Normal| “Lunch Money”) = (3/17) * (1/17) * (8/12)


P (Spam| “Lunch Money”) = (0/7) * (4/7) * (4/12) = 0

Normal Spam
Dear – 8 Dear – 2
Friend – 5 Friend – 1
Lunch – 3 Lunch – 0
Money – 1 Money – 4
Sec.13.3

Problem with Maximum Likelihood

What if we have seen no training documents with the word fantastic


and classified in the topic positive (thumbs-up)?

count("fantastic", positive)
P̂("fantastic" positive) = = 0
å count(w, positive)
wÎV

Zero probabilities cannot be conditioned away, no matter the other


evidence!
cMAP = argmax c P̂(c)Õ P̂(xi | c)
i
Laplace (add-1) smoothing for Naïve Bayes

count(wi , c) +1
P̂(wi | c) =
å (count(w, c))+1)
wÎV

count(wi , c) +1
=
æ ö
çç å count(w, c)÷÷ + V
è wÎV ø
P ( Normal| “Lunch Money”) = (?) * (?) * (8/12)
Normal
P (Spam| “Lunch Money”) = Dear – 8
count(wi , c) count(wi , c) +1 Friend – 5
P̂(wi | c) = =
å (count(w, c)) æ ö
çç å count(w, c)÷÷ + V
Lunch – 3
wÎV
è wÎV ø Money – 1

Unique Word = 4, Number of occurrence = Spam


17 Dear – 2
P(N|Lunch money) Friend – 1
= ( (3+1)/ (17+4) ) * (2/21 ) * (8/12) =0.012 Lunch – 0
P(S|Lunch money) Money – 4
= (1/11) * (5/11) * (4/12) = 0.013
Unknown words
What about unknown words
◦ that appear in our test data
◦ but not in our training data or vocab
We ignore them
◦ Remove them from the test document!
◦ Pretend they weren't there!
◦ Don't include any probability for them at all.
Why don't we build an unknown word model?
◦ It doesn't help: knowing which class has more unknown words is
not generally a useful thing to know!
Stop words
Some systems ignore another class of words:
Stop words: very frequent words like the and a.
◦ Sort the whole vocabulary by frequency in the training, call the
top 10 or 50 words the stopword list.
◦ Now we remove all stop words from the training and test sets
as if they were never there.
But in most text classification applications, removing
stop words don't help, so it's more common to not use
stopword lists and use all the words in naive Bayes.
Multinomial Naïve Bayes: Learning

• From training corpus, extract Vocabulary


Calculate P(cj) terms • Calculate P(wk | cj) terms
◦ For each cj in C do • Textj  single doc containing all docsj
docsj  all docs with class =cj • For each word wk in Vocabulary
nk  # of occurrences of wk in Textj
| docs j |
P(c j ) ¬ nk + a
| total # documents| P(wk | c j ) ¬
n + a | Vocabulary |
Text
Classification Naive Bayes: Learning
and Naive
Bayes
Text
Classification Sentiment and Binary
and Naive Naive Bayes
Bayes
Let's do a worked sentiment example!
Just
A worked sentiment example Plain
Boar
Entire
Predict
And 2
Lack
Energy
No
Surprise
Very
Few
lough
A worked sentiment example
Prior from training:
P(-) = 3/5
P(+) = 2/5
Drop "with"

Likelihoods from training:


Scoring the test set:
P(Predict|+) = ? P(Predict|-) = ? P(-) * P(“Predict No Fun”)
P(No|+) = ? P(No|-) = ? P(+) * P(“Predict No Fun”)
P(Fun|+) = ? P(Fun|-) = ?
A worked sentiment example
Prior from training:
P(-) = ?
P(+) = ?
Drop "with"

Likelihoods from training:


Scoring the test set:
Optimizing for sentiment analysis
For tasks like sentiment, word occurrence is more
important than word frequency.
◦ The occurrence of the word fantastic tells us a lot
◦ The fact that it occurs 5 times may not tell us much more.
Binary multinominal naive bayes, or binary NB
◦ Clip our word counts at 1
◦ Note: this is different than Bernoulli naive bayes; see the
textbook at the end of the chapter.
Binary Multinomial Naïve Bayes: Learning
• From training corpus, extract Vocabulary
Calculate P(cj) terms • Calculate P(wk | cj) terms
◦ For each cj in C do • Textj  single doc containing all docsj
docsj  all docs with class =cj • For each word wk in Vocabulary
nk  # of occurrences of wk in Textj
| docs j |
P(c j ) ¬ nk + a
| total # documents| P(wk | c j ) ¬
n + a | Vocabulary |
• Remove duplicates in each doc:
• For each word type w in docj
• Retain only a single instance of w
Binary Multinomial Naive Bayes
on a test document d
First remove all duplicate words from d
Then compute NB using the same equation:

cNB = argmax P(c j )


c j ÎC
Õ P(wi | c j )
iÎ positions

63
Binary multinominal naive Bayes

Counts can still be 2! Binarization is within-doc!


Text
Classification Sentiment and Binary
and Naive Naive Bayes
Bayes
Text
Classification Naïve Bayes: Relationship
and Naïve to Language Modeling
Bayes
Generative Model for Multinomial Naïve Bayes

c=+

X1=I X2=love X3=this X4=fun X5=film

67
Naïve Bayes and Language Modeling
Naïve bayes classifiers can use any sort of feature
◦ URL, email address, dictionaries, network features
But if, as in the previous slides
◦ We use only word features
◦ we use all of the words in the text (not a subset)
Then
◦ Naive bayes has an important similarity to language
modeling.
68
Sec.13.2.1

Each class = a unigram language model

Assigning each word: P(word | c)


Assigning each sentence: P(s|c)=P(word|c)
Class pos
0.1 I
I love this fun film
0.1 love
0.1 0.1 .05 0.01 0.1
0.01 this
0.05 fun
0.1 film P(s | pos) = 0.0000005
Sec.13.2.1

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

Model pos Model neg


0.1 I 0.2 I I love this fun film
0.1 love 0.001 love
0.1 0.1 0.01 0.05 0.1
0.01 this 0.01 this 0.2 0.001 0.01 0.005 0.1

0.05 fun 0.005 fun


0.1 film 0.1 film P(s|pos) > P(s|neg)
Text
Classification Naïve Bayes: Relationship
and Naïve to Language Modeling
Bayes
Text
Classification Precision, Recall, and F
and Naïve measure
Bayes
Evaluation
Let's consider just binary text classification tasks
Imagine you're the CEO of Delicious Pie Company
You want to know what people are saying about
your pies
So you build a "Delicious Pie" tweet detector
◦ Positive class: tweets about Delicious Pie Co
◦ Negative class: all other tweets
The 2-by-2 confusion matrix
The 2-by-2 confusion matrix TP 10 FP 2

FN 3 TN 34
Evaluation: Accuracy
Why don't we use accuracy as our metric?
Imagine we saw 1 million tweets
◦ 100 of them talked about Delicious Pie Co.
◦ 999,900 talked about something else
We could build a dumb classifier that just labels every
tweet "not about pie"
◦ It would get 99.99% accuracy!!! Wow!!!!
◦ But useless! Doesn't return the comments we are looking for!
◦ That's why we use precision and recall instead
Evaluation: Precision
% of items the system detected (i.e., items the
system labeled as positive) that are in fact positive
(according to the human gold labels)
Evaluation: Recall
% of items actually present in the input that were
correctly identified by the system.
Why Precision and recall
Our dumb pie-classifier
◦ Just label nothing as "about pie"
Accuracy=99.99%
but
Recall = 0
◦ (it doesn't get any of the 100 Pie tweets)
Precision and recall, unlike accuracy, emphasize true
positives:
◦ finding the things that we are supposed to be looking for.
A combined measure: F
F measure: a single number that combines P and R:

We almost always use balanced F1 (i.e.,  = 1)

You might also like