0% found this document useful (0 votes)

38 views25 pages

11 Text Categorization

Uploaded by

thatsarra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views25 pages

11 Text Categorization

Uploaded by

thatsarra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

CS463 - Natural Language Processing

Text Categorization
Text Categorization (‫)ﺗﺻﻧﯾف اﻟﻧص‬
• Text categorization/text classification) is the task of
assigning categories to free-text documents.

• For example, news stories are organized by subject

categories (topics) ; academic papers are classified by
technical areas; patient reports in hospitals are indexed
using disease categories, etc.

• Another application of text categorization is spam

filtering, where email messages are classified into the
two categories of spam and non-spam, respectively.
2
Text Classification

• Text classification (text categorization):

– assign documents to one or more predefined categories.

classes
Documents ? class1
class2
.
.
.
classn
 NLP, Data Mining and Machine Learning techniques
work together to automatically classify the different types
of documents.
Introduction
• Text classification (TC) is an important part of
text mining.
• An example classification;
– automatically labeling news stories with a topic
like “sports”, “politics” or “art”
• Classification task;
– Starts with a training set of documents labelled
with a class.
– Then determines a classification model to assign
the correct class to a new document of the domain.
Learning for Text Categorization

• Manual development of text categorization

functions is difficult.
• Learning Algorithms:
– Bayesian (naïve)
– Neural network
– Relevance Feedback (Rocchio)
– Rule based (Ripper)
– Nearest Neighbor (case based)
– Support Vector Machines (SVM)

5
i
The Vector-Space Model
• Assume t distinct terms remain after preprocessing;
call them index terms or the vocabulary.
• These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|
• Each term, i, in a document or query, j, is given a
real-valued weight, wij. ? -wen !
Oteasers
• Both documents and queries are expressed as
t-dimensional vectors:
dj = (w1j, w2j, …, wtj)
space
S
10 && , 8

6
What is a Vector Space Model?

• The document vectors are rendered as points in a plane.

Ex: This vector space is
divided into 3 classes.

e
 The boundaries are called

decision boundaries.#
.
i ~

i ,3  To classify a new
document, we determine
the region it occurs in
and assign it the class of
B ↑ classes that region.
Graphic Representation
Example:~ St Te
recorblary
T
D1 = 2T1 + 3T2 + 5T3 T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3 first O5
Que D1 = 2T1+ 3T2 + 5T3 Y

Yeste Q = 0T1 + 0T2 + 2T3

00
2 3
T1
D2 = 3T1 + 7T2 + T3
• Is D1 or D2 more similar to Q?
• How to measure the degree of
7
T2 similarity? Distance? Angle?

& Projection?

8
wight , of
ii
Term Weights: Term Frequency
z

• More frequent terms in a document are more

important, i.e. more indicative of the topic.
①
fij = frequency of term i in document j

• May want to normalize term frequency (tf) by

dividing by the frequency of the most common

~
term in the document:
tfij = fij / maxi{fij}

·
grogii
Y

j
g
get e
do Se 9
-
Term Weights: Inverse Document Frequency
• Terms that appear in many different documents
are less indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
• An indication of a term’s discrimination power.
• Log used to dampen the effect relative to tf.

Bored 10
TF-IDF Weighting
• A typical combined term importance indicator is
tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
• A term occurring frequently in the document but
rarely in the rest of the collection is given high
weight.
• Many other ways of determining term weights
have been proposed.
• Experimentally, tf-idf has been found to work well.

11
Similarity Measure
• A similarity measure is a function that computes
the degree of similarity between two vectors.

• Using a similarity measure between the query and

each document:
– It is possible to rank the retrieved documents in the
order of presumed relevance.
– It is possible to enforce a certain threshold so that the
size of the retrieved set can be controlled.

12
hCosine Similarity MeasureEi
↑

• Cosine similarity measures the cosine of t3

the angle between two vectors.
• Inner product normalized by the vector 
lengths.   t D1
dj q   ( wij  wiq )

Q
CosSim(dj, q) =   i 1
t t
 t1
dj  q  wij   wiq
2 2

i 1 i 1

one document of timet

Query & 2
a
D2 term
- Ster e Q

-
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81

9
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
&j
Q = 0T1 + 0T2 + 2T3
am ↳ term
Q
D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.

si similaries is t
is 13
Using Relevance Feedback (Rocchio)
• Relevance feedback methods can be adapted for
text categorization.
• Use standard TF/IDF weighted vectors to
represent text documents (normalized by
maximum term frequency).- > most commons

• For each category, compute a prototype vector by

summing the vectors of the training documents in
the category. 191s
& i record
,
i

• Assign test documents to the category with the

closest prototype vector based on cosine
similarity. cos 11 Mibs's
.
vectorisi 3
iileg
, ↓

·
classes (1,5 & ge
i.
↑ centen
14
Explaining Rocchio Method

15
What is a Vector Space Model?

• The document vectors are rendered as points in a plane.

 This vector space is
divided into 3 classes.
 The boundaries are called
decision boundaries.

pores
 To classify a new
document, we determine
the region it occurs in
and assign it the class of
, that region.
! certains
Rocchio Classification

• Rocchio classification uses centroids to define

-jjjois
~
the boundaries.
• The centroid of a class is computed as the
bounde
center of mass of its members.
1
(c)  
| Dc | d D
v (d) * 3 gi
c b
Questors
-9!
>
-

N
%
& d "35 .

 Dc is the set of all documents with class c.

 v(d) is the vector space representation of d.

Rocchio Classification

• Centroids of classes
are shown as solid

9
circles.
• The boundary
between two classes
is the set of points
% with equal distance
jsia3
gejsia from the two
bounders
O
9
10
1,
9% centroids.
ever
Rocchio Properties
• Does not guarantee a consistent hypothesis.
• Forms a simple generalization of the
examples in each class (a prototype).
• Prototype vector does not need to be
averaged or otherwise normalized for length
since cosine similarity is insensitive to
vector length.
• Classification is based on similarity to class
prototypes.

29
Nearest-Neighbor Learning Algorithm
• Learning is just storing the representations of the
training examples in D.
• Testing instance x:
– Compute similarity between x and all examples in D.
– Assign x the category of the most similar example in D.
• Does not explicitly compute a generalization or
category prototypes.
• Also called:
– Case-based
– Memory-based
– Lazy learning

30
Nearest-Neighbor Learning Algorithm
• Using only the closest example to determine
categorization is subject to errors due to:
– A single atypical (‫ )ﺷﺎذ‬example.
– Noise (i.e. error) in the category label of a
single training example.
• More robust alternative is to find the k
most-similar examples and return the
majority category of these k examples.
• Value of k is typically odd, 3 and 5 are most
common.
31
Examples of Nearest-Neighbor Algorithm

Example 1 Example 2
32
Naïve Bayes for Text Classification
• Modeled as generating a bag of words for a
document in a given category by repeatedly
sampling with replacement from a
vocabulary V = {w1, w2,…wm} based on the
probabilities P(wj | ci).
• Smooth probability estimates with Laplace
m-estimates assuming a uniform distribution
over all words (p = 1/|V|) and m = |V|
– Equivalent to a virtual sample of seeing each word in
each category exactly once.

34
Example of Naive Bayes Classifying
numee
hangenga
&

83 g

-
&

.g
V =6
&

i
T

37
Sample Learning Curve
(Yahoo Science Data)

Improve Text Classification Accuracy Based On Classifier Fusion Methods
No ratings yet
Improve Text Classification Accuracy Based On Classifier Fusion Methods
6 pages
Vector Space Model
No ratings yet
Vector Space Model
11 pages
Machine Learning in Automated Text Categorization
No ratings yet
Machine Learning in Automated Text Categorization
55 pages
Lecture 2: More Similarity Searching Multidimensional Scaling
No ratings yet
Lecture 2: More Similarity Searching Multidimensional Scaling
8 pages
An Introduction To Text: Mining
No ratings yet
An Introduction To Text: Mining
39 pages
Feature Engineering Guide
100% (2)
Feature Engineering Guide
44 pages
Irs Unit 4 CH 1
No ratings yet
Irs Unit 4 CH 1
58 pages
Information Retrieval Models Guide
No ratings yet
Information Retrieval Models Guide
30 pages
Becker and Kuropka - Topic-Based Vector Space Model PDF
No ratings yet
Becker and Kuropka - Topic-Based Vector Space Model PDF
6 pages
Text Sentiment Analysis
No ratings yet
Text Sentiment Analysis
59 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Text Processing & Term Weighting
100% (2)
Text Processing & Term Weighting
38 pages
KNN With Tf-Idf Based Framework For Text Categorization: Sciencedirect
No ratings yet
KNN With Tf-Idf Based Framework For Text Categorization: Sciencedirect
9 pages
Naïve Bayes for CS Students
No ratings yet
Naïve Bayes for CS Students
55 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
Text Classification
No ratings yet
Text Classification
32 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Document Classification Utilising Ontologies and Relations Between Documents
No ratings yet
Document Classification Utilising Ontologies and Relations Between Documents
8 pages
TF-IDF and Ranked Retrieval Basics
No ratings yet
TF-IDF and Ranked Retrieval Basics
51 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
No ratings yet
Frontiers of Computational Journalism - Columbia Journalism School Fall 2012 - Week 3: Document Topic Modeling
48 pages
Clustering for Text Categorization
No ratings yet
Clustering for Text Categorization
101 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
67 pages
Ir 103 131
No ratings yet
Ir 103 131
29 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
No ratings yet
Text, Web and Social Media Analytics: SE Computer, Sem VIII Academic Year: 2023 - 24
36 pages
Cs8080 Unit3 Text Classification and Clustering
No ratings yet
Cs8080 Unit3 Text Classification and Clustering
171 pages
A Survey On Different Types of Approaches To Text Categorization
No ratings yet
A Survey On Different Types of Approaches To Text Categorization
3 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Content-Based Filtering
No ratings yet
Content-Based Filtering
20 pages
CS585 Lecture October01st
No ratings yet
CS585 Lecture October01st
158 pages
Vector Space Model
No ratings yet
Vector Space Model
7 pages
Implementation of Discrete Hidden Markov Model For Sequence Classification in C++ Using Eigen
No ratings yet
Implementation of Discrete Hidden Markov Model For Sequence Classification in C++ Using Eigen
8 pages
Text Classification
No ratings yet
Text Classification
53 pages
Da Unit-4
No ratings yet
Da Unit-4
43 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
Machine Learning-Based Predictive Analytics and Big Data in The Automotive Sector
No ratings yet
Machine Learning-Based Predictive Analytics and Big Data in The Automotive Sector
6 pages
IR Lecture 4b
No ratings yet
IR Lecture 4b
57 pages
XGBoost for Data Scientists
No ratings yet
XGBoost for Data Scientists
8 pages
Craft ml2k
No ratings yet
Craft ml2k
6 pages
ShortCourse QTT Lecture1
No ratings yet
ShortCourse QTT Lecture1
40 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
L2 Cse256 Fa24 TC
No ratings yet
L2 Cse256 Fa24 TC
65 pages
IIT Bombay Seat Matrix - College Pravesh
No ratings yet
IIT Bombay Seat Matrix - College Pravesh
7 pages
Agarwal 2014
No ratings yet
Agarwal 2014
9 pages
Optimizing Retail Operations
No ratings yet
Optimizing Retail Operations
18 pages
Research Engineer's Career Profile
No ratings yet
Research Engineer's Career Profile
4 pages
BigData ML
No ratings yet
BigData ML
10 pages
Applications of NLP
No ratings yet
Applications of NLP
85 pages
Python Ieee Projects 2021 - 22 JP
No ratings yet
Python Ieee Projects 2021 - 22 JP
3 pages
Traffic Congestion Detection Using Deep Learning
No ratings yet
Traffic Congestion Detection Using Deep Learning
3 pages
BUS7017 Analysing Big Data (2495)
No ratings yet
BUS7017 Analysing Big Data (2495)
11 pages
Term Weighting & The Vector Space Model
No ratings yet
Term Weighting & The Vector Space Model
2 pages
Science Research Journal
No ratings yet
Science Research Journal
7 pages
Machine Learning For Text Document Classification-Efficient Classification Approach
No ratings yet
Machine Learning For Text Document Classification-Efficient Classification Approach
8 pages
Data Science Task List Pfsinterns
No ratings yet
Data Science Task List Pfsinterns
14 pages
AI in Everyday Life: Class X Guide
No ratings yet
AI in Everyday Life: Class X Guide
3 pages
Lecture 6 - Word2Vec and Text Classification
No ratings yet
Lecture 6 - Word2Vec and Text Classification
66 pages
Big Data and Data Science in Critical Care
No ratings yet
Big Data and Data Science in Critical Care
10 pages
Machine Learning Q&A
No ratings yet
Machine Learning Q&A
5 pages
AI Residency for Career Boost
No ratings yet
AI Residency for Career Boost
5 pages
Attrition Rate Literature Review
100% (1)
Attrition Rate Literature Review
5 pages
Lecture 8-2 - Text Classification, Naïve Bayes, Vector Space Classification
No ratings yet
Lecture 8-2 - Text Classification, Naïve Bayes, Vector Space Classification
30 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
Scan Generative AI Rathenau Instituut
No ratings yet
Scan Generative AI Rathenau Instituut
60 pages
Bag - of - Words NLP
No ratings yet
Bag - of - Words NLP
23 pages
Teaching Machine Learning As Part of Agile Software Engineering
No ratings yet
Teaching Machine Learning As Part of Agile Software Engineering
10 pages
UNIT-4 Information Retrieval Notes
No ratings yet
UNIT-4 Information Retrieval Notes
16 pages
Text Classification Using Support Vector Machine IJERTV1IS3174
No ratings yet
Text Classification Using Support Vector Machine IJERTV1IS3174
4 pages
Lect 5
No ratings yet
Lect 5
40 pages
ITD253 L6 TextClassificationClustering
No ratings yet
ITD253 L6 TextClassificationClustering
39 pages
PCCAIML502
No ratings yet
PCCAIML502
2 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
33 pages
Samaksh Gupta Programming Ass. IR
No ratings yet
Samaksh Gupta Programming Ass. IR
13 pages
Capstone Chapter 1 3
No ratings yet
Capstone Chapter 1 3
21 pages
An Improved Whale Optimization Algorithm Based Radial Neural Network For Multi Grade Brain Tumor Classification
No ratings yet
An Improved Whale Optimization Algorithm Based Radial Neural Network For Multi Grade Brain Tumor Classification
16 pages
Harrison Kinsley, Daniel Kukieła - Neural Networks From Scratch in Python (2020) - 62-92
No ratings yet
Harrison Kinsley, Daniel Kukieła - Neural Networks From Scratch in Python (2020) - 62-92
31 pages
AI5003 AML Week07
No ratings yet
AI5003 AML Week07
14 pages
SmartHire Report
No ratings yet
SmartHire Report
28 pages
MN10
No ratings yet
MN10
13 pages
MiniProjectReport Edit 1 (1) Latest (1) 2 (1) New
No ratings yet
MiniProjectReport Edit 1 (1) Latest (1) 2 (1) New
38 pages
Diabetes Project Using Machine Learning
100% (1)
Diabetes Project Using Machine Learning
49 pages
Luận Văn an Improved Term Weighting Scheme for Text Categorization
No ratings yet
Luận Văn an Improved Term Weighting Scheme for Text Categorization
16 pages
Khushi Verma Resume
No ratings yet
Khushi Verma Resume
2 pages
Machine Learning For Natural Language Processing: Classification: Nearest Neighbors
No ratings yet
Machine Learning For Natural Language Processing: Classification: Nearest Neighbors
28 pages
UNIT5
No ratings yet
UNIT5
23 pages