CS463 - Natural Language Processing
Text Categorization
Text Categorization ()ﺗﺻﻧﯾف اﻟﻧص
• Text categorization/text classification) is the task of
assigning categories to free-text documents.
• For example, news stories are organized by subject
categories (topics) ; academic papers are classified by
technical areas; patient reports in hospitals are indexed
using disease categories, etc.
• Another application of text categorization is spam
filtering, where email messages are classified into the
two categories of spam and non-spam, respectively.
2
Text Classification
• Text classification (text categorization):
– assign documents to one or more predefined categories.
classes
Documents ? class1
class2
.
.
.
classn
NLP, Data Mining and Machine Learning techniques
work together to automatically classify the different types
of documents.
Introduction
• Text classification (TC) is an important part of
text mining.
• An example classification;
– automatically labeling news stories with a topic
like “sports”, “politics” or “art”
• Classification task;
– Starts with a training set of documents labelled
with a class.
– Then determines a classification model to assign
the correct class to a new document of the domain.
Learning for Text Categorization
• Manual development of text categorization
functions is difficult.
• Learning Algorithms:
– Bayesian (naïve)
– Neural network
– Relevance Feedback (Rocchio)
– Rule based (Ripper)
– Nearest Neighbor (case based)
– Support Vector Machines (SVM)
5
i
The Vector-Space Model
• Assume t distinct terms remain after preprocessing;
call them index terms or the vocabulary.
• These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is given a
real-valued weight, wij. ? -wen !
Oteasers
• Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, …, wtj)
space
S
10 && , 8
6
What is a Vector Space Model?
• The document vectors are rendered as points in a plane.
Ex: This vector space is
divided into 3 classes.
e
The boundaries are called
decision boundaries.#
.
i ~
i ,3 To classify a new
document, we determine
the region it occurs in
and assign it the class of
B ↑ classes that region.
Graphic Representation
Example:~ St Te
recorblary
T
D1 = 2T1 + 3T2 + 5T3 T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3 first O5
Que D1 = 2T1+ 3T2 + 5T3 Y
Yeste Q = 0T1 + 0T2 + 2T3
00
2 3
T1
D2 = 3T1 + 7T2 + T3
• Is D1 or D2 more similar to Q?
• How to measure the degree of
7
T2 similarity? Distance? Angle?
& Projection?
8
wight , of
ii
Term Weights: Term Frequency
z
• More frequent terms in a document are more
important, i.e. more indicative of the topic.
①
fij = frequency of term i in document j
• May want to normalize term frequency (tf) by
dividing by the frequency of the most common
~
term in the document:
tfij = fij / maxi{fij}
·
grogii
Y
j
g
get e
do Se 9
-
Term Weights: Inverse Document Frequency
• Terms that appear in many different documents
are less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
• An indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.
Bored 10
TF-IDF Weighting
• A typical combined term importance indicator is
tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
• A term occurring frequently in the document but
rarely in the rest of the collection is given high
weight.
• Many other ways of determining term weights
have been proposed.
• Experimentally, tf-idf has been found to work well.
11
Similarity Measure
• A similarity measure is a function that computes
the degree of similarity between two vectors.
• Using a similarity measure between the query and
each document:
– It is possible to rank the retrieved documents in the
order of presumed relevance.
– It is possible to enforce a certain threshold so that the
size of the retrieved set can be controlled.
12
hCosine Similarity MeasureEi
↑
• Cosine similarity measures the cosine of t3
the angle between two vectors.
• Inner product normalized by the vector
lengths. t D1
dj q ( wij wiq )
Q
CosSim(dj, q) = i 1
t t
t1
dj q wij wiq
2 2
i 1 i 1
one document of timet
Query & 2
a
D2 term
- Ster e Q
-
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
9
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
&j
Q = 0T1 + 0T2 + 2T3
am ↳ term
Q
D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
si similaries is t
is 13
Using Relevance Feedback (Rocchio)
• Relevance feedback methods can be adapted for
text categorization.
• Use standard TF/IDF weighted vectors to
represent text documents (normalized by
maximum term frequency).- > most commons
• For each category, compute a prototype vector by
summing the vectors of the training documents in
the category. 191s
& i record
,
i
• Assign test documents to the category with the
closest prototype vector based on cosine
similarity. cos 11 Mibs's
.
vectorisi 3
iileg
, ↓
·
classes (1,5 & ge
i.
↑ centen
14
Explaining Rocchio Method
15
What is a Vector Space Model?
• The document vectors are rendered as points in a plane.
This vector space is
divided into 3 classes.
The boundaries are called
decision boundaries.
pores
To classify a new
document, we determine
the region it occurs in
and assign it the class of
, that region.
! certains
Rocchio Classification
• Rocchio classification uses centroids to define
-jjjois
~
the boundaries.
• The centroid of a class is computed as the
bounde
center of mass of its members.
1
(c)
| Dc | d D
v (d) * 3 gi
c b
Questors
-9!
>
-
N
%
& d "35 .
Dc is the set of all documents with class c.
v(d) is the vector space representation of d.
Rocchio Classification
• Centroids of classes
are shown as solid
9
circles.
• The boundary
between two classes
is the set of points
% with equal distance
jsia3
gejsia from the two
bounders
O
9
10
1,
9% centroids.
ever
Rocchio Properties
• Does not guarantee a consistent hypothesis.
• Forms a simple generalization of the
examples in each class (a prototype).
• Prototype vector does not need to be
averaged or otherwise normalized for length
since cosine similarity is insensitive to
vector length.
• Classification is based on similarity to class
prototypes.
29
Nearest-Neighbor Learning Algorithm
• Learning is just storing the representations of the
training examples in D.
• Testing instance x:
– Compute similarity between x and all examples in D.
– Assign x the category of the most similar example in D.
• Does not explicitly compute a generalization or
category prototypes.
• Also called:
– Case-based
– Memory-based
– Lazy learning
30
Nearest-Neighbor Learning Algorithm
• Using only the closest example to determine
categorization is subject to errors due to:
– A single atypical ( )ﺷﺎذexample.
– Noise (i.e. error) in the category label of a
single training example.
• More robust alternative is to find the k
most-similar examples and return the
majority category of these k examples.
• Value of k is typically odd, 3 and 5 are most
common.
31
Examples of Nearest-Neighbor Algorithm
Example 1 Example 2
32
Naïve Bayes for Text Classification
• Modeled as generating a bag of words for a
document in a given category by repeatedly
sampling with replacement from a
vocabulary V = {w1, w2,…wm} based on the
probabilities P(wj | ci).
• Smooth probability estimates with Laplace
m-estimates assuming a uniform distribution
over all words (p = 1/|V|) and m = |V|
– Equivalent to a virtual sample of seeing each word in
each category exactly once.
34
Example of Naive Bayes Classifying
numee
hangenga
&
83 g
-
&
.g
V =6
&
i
T
37
Sample Learning Curve
(Yahoo Science Data)
38