0% found this document useful (0 votes)

4 views38 pages

Text Classification

Uploaded by

het.vaghasiya2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views38 pages

Text Classification

Uploaded by

het.vaghasiya2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Text Classification

Nakul Dave
Assistant Professor
Computer Engineering Department
Vishwakarma Government Engineering College - Ahmedabad

Naku l Dave T ext Classification 1 / 38

Table of contents
1 Introduction
Examples of classification
2 Text Classification Phases
Feature Extraction
3 Problem Formulation
4 Methods of Text Classification
Naïve Bayes classification
Smoothing
A Worked Example
Model Training and Testing
5 Types and Evaluation
Types
Evaluation
6 Questions
7 References
Naku l Dave T ext Classification 2 / 38
Introduction

Text Classification

Text Classification is a fundamental task in NLP that involves

categorizing text documents into predefined classes or categories.
It enables automated analysis and organization of large amounts
of textual data.
Examples of text classification tasks:
Sentiment Analysis: Classifying text as positive, negative, or
neutral.
Topic Classification: Assigning documents to specific topics or
themes.
Spam Detection: Identifying spam emails or messages.

Naku l Dave T ext Classification 3 / 38

Introduction

What is Text Classification

Figure: text Classification Use Cases1

1
https://www.sketchbubble.com/en/presentation-text-classification.html
Naku l Dave T ext Classification 4 / 38
Introduction

Text Classification Applications

Figure: Text Classification Applications2

2
https://www.sketchbubble.com/en/presentation-text-classification.html
Naku l Dave T ext Classification 5
5 // 38
38
Introduction E xamp les of classification

Example of Movie Reviews

“Heartfelt and emotional, This is a must-watch.”

“Disappointing entry in the franchise, lacking the thrill and

intrigue of its predecessors. The plot is convoluted, and the
dialogues are uninspiring.”

“This movie is a weak romantic comedy that fails to make a

lasting impression.”

“This is the greatest comedy movie ever filmed.”

Naku l Dave T ext Classification 6 / 38

Introduction E xamp les of classification

What is the subject of this document?

Figure: Document Classifier3

3
https:
//www.pericent.com/products/docedge-dms/hot-features/classification/
Naku l Dave T ext Classification 7 / 38
Text Classification Phases

Text Classification Pipeline

1 Data Preprocessing: Cleaning and preparing the text data.

2 Feature Extraction: Converting text into numerical features.
3 Model Training: Building a classifier using labeled training data.
4 Model Evaluation: Assessing the performance of the trained
model.
5 Prediction: Applying the trained model to classify new, unseen
text data.

Naku l Dave T ext Classification 8 / 38

Text Classification Phases Featu re Extraction

Feature Extraction: Bag-of-Words

Bag-of-Words (BoW) is a popular technique for feature extraction

in text classification.
It represents text as a collection of word counts or
presence/absence indicators.
Each unique word in the corpus becomes a feature or dimension in
the vector space.
BoW ignores word order and only considers the frequency or
presence of words.

Naku l Dave T ext Classification 9 / 38

Text Classification Phases Featu re Extraction

Feature Extraction: Example

Let’s consider two movie reviews:

Review 1: ”The movie was amazing and captivating.”
Review 2: ”I didn’t like the movie. It was boring.”
After preprocessing, the reviews become:
Review 1: ”movie amazing captivating”
Review 2: ”didn’t like movie boring”
Using BoW, we can represent the reviews as feature vectors:
Review 1: [1, 1, 1] (indicating the presence of ”movie”, ”amazing”,
and ”captivating”)
Review 2: [1, 0, 1] (indicating the presence of ”movie” and ”boring”)

Naku l Dave T ext Classification 10 / 38

P roblem Formulation

Text Classification: A Problem Formulation

Input
A document d
A fix set of class C = {c 1 , c2, · · · , c n }

Output
A predicted Class c ∈ C

Naku l Dave T ext Classification 11 / 38

Meth od s of Text Classification

Classification Methods: Hand-coded rules

Rules are framed based on the features or words occurs in the text.
Number of rules depends on the features and size of text.

Spam
black-list-address OR (“dollars” AND “have been selected”)

Pros and Cons

Accuracy can be high if rules are carefully refined by experts, but
building and maintaining these rules is expensive.

Naku l Dave T ext Classification 12 / 38

Meth od s of Text Classification

Classification Methods: Supervised Machine Learning

Naive Bayes
Decision Tree Induction

Support Vector Machine

Logistic Regression

···

Naku l Dave T ext Classification 13 / 38

Meth od s of Text Classification Naïve B ay es classification

Naive Bayes Classification

Naive Bayes is a probabilistic classification algorithm based on

Bayes’ theorem.
It assumes that the features are conditionally independent given
the class label.
Naive Bayes is commonly used for text classification tasks.

Naku l Dave T ext Classification 14 / 38

Meth od s of Text Classification Naïve B ay es classification

Bayes’ Theorem

Bayes’ theorem is defined as:

P(d|C) · P(C)
P(C|d) =
P(d)
Where:
P(d|C) is the posterior probability of class d given the features C.
P(C|d) is the likelihood of the features C given class d.
P(d) is the prior probability of class d.
P(C) is the probability of the features C.

Naku l Dave T ext Classification 15 / 38

Meth od s of Text Classification Naïve B ay es classification

Naive Bayes Algorithm

1 Calculate the prior probabilities P(d) for each class d in the

training data.
2 For each feature in the input, calculate the likelihood P(C|d) for
each class d based on the training data.
3 Multiply the prior probabilities and likelihoods to obtain the joint
probability P(C, d) for each class d.
4 Normalize the joint probabilities by dividing them by the evidence
P(C) to obtain the posterior probabilities P(d|C).
5 Assign the input to the class with the highest posterior probability.

Naku l Dave T ext Classification 16 / 38

Meth od s of Text Classification Naïve B ay es classification

Argmax Notation

Bayes’ theorem
P(d|C) · P(C)
P(C|d) =
P(d)

For Naive Bayes classification, the argmax notation can be expressed

as:
Naïve Bayes Classifier

ˆc = arg max P(c|d)

c∈C
= arg max P(d|c)P(c)
c∈C
= arg max P(x1, x2, . . . , xn|c)P(c)
c∈C

Naku l Dave T ext Classification 17 / 38

Meth od s of Text Classification Naïve B ay es classification

Naïve Bayes classification assumptions

P(x1, x2, . . . , xn|c)

Bag of Word Assumption

Assume that the position of a word in the document doesn’t matter

Conditional Independence
P(x1, x2, . . . , xn|c) = P(x1|c)P(x2|c) . . . P(xn|c)

∏
ˆcNB = arg max P(c) P(x|c)
c∈C
x∈X

Naku l Dave T ext Classification 18 / 38

Meth od s of Text Classification Naïve B ay es classification

Learning the model parameters

Maximum Likelihood Estimate

count(C = cj )
P̂(c j ) =
Ndoc
count(wi; cj )
P̂(wij |c j ) = ∑
w∈V count(w; c j )

Problem with MLE

Suppose in the training data, we haven’t seen the word “Awesome”,
classified in the topic ‘positive’.

P̂(Awesome|positive) = 0
∏
ˆcNB = arg max P(c) P(x|c)
c∈C
x∈X

Naku l Dave T ext Classification 19 / 38

Meth od s of Text Classification Smoothing

Laplace (add-1) Smoothing

Laplace Smoothing

count(wi; c) + 1
P̂(wij |c) = ∑
w∈V(count(w; c) + 1)
count(wi; c) + 1
= ∑
w∈V count(w; c) + |V|

Naku l Dave T ext Classification 20 / 38

Meth od s of Text Classification A Worked E xamp le

A Worked Example

Naku l Dave T ext Classification 21 / 38

Meth od s of Text Classification Model Training an d Testing

Model Training

Once the text data is transformed into numerical features, we can

train a text classification model.
Popular algorithms for text classification include:
Naive Bayes: Based on Bayes’ theorem, assumes independence
between features.
Support Vector Machines (SVM): Constructs hyperplanes to
separate different classes.
Decision Trees: Hierarchical structure of if-else rules for
classification.
Neural Networks: Deep learning models with multiple layers for
feature learning.
We train the model on a labeled dataset, where each review is
associated with its sentiment label.

Naku l Dave T ext Classification 22 / 38

Meth od s of Text Classification Model Training an d Testing

Model Evaluation

To assess the performance of the text classification model, we

evaluate it using appropriate metrics.
Common evaluation metrics for text classification include
accuracy, precision, recall, and F1-score.
We split our labeled dataset into training and test sets, and
evaluate the model on the test set.
The evaluation results provide insights into the model’s ability to
generalize and classify new, unseen data.

Naku l Dave T ext Classification 23 / 38

Meth od s of Text Classification Model Training an d Testing

Prediction

After training and evaluating the model, we can use it to predict

the sentiment of new, unseen movie reviews.
We preprocess the new text data, extract features using the same
technique (e.g., BoW), and feed it to the trained model.
The model assigns a sentiment label (positive or negative) to each
new review based on its learned patterns.
The predictions can be used for various applications, such as
recommendation systems or sentiment analysis dashboards.

Naku l Dave T ext Classification 24 / 38

T y p e s an d Evalu ation Ty p es

Types of classification

Binary classification - Two Classes only

Yes, No
Positive, Negative

Multinomial classification - More than two classes

Example - High, Low, Medium
Example - Very Poor, Poor, Average, Good, Very Good, Excellent

Multi-value classification - document can belong to 0, 1 or > 1

classes

Naku l Dave T ext Classification 25 / 38

T y p e s an d Evalu ation Ty p es

Naïve Bayes: More than Two Classes

Multi-value classification
A document can belong to 0, 1 or > 1 classes

Handling Multi-value classification

For each class c ∈ C, build a classifier γc to distinguish c from all
other classes c′ ∈ C
Given test-doc d, evaluate it for membership in each class using
each γc
d belongs to any class for which γc returns true

Naku l Dave T ext Classification 26 / 38

T y p e s an d Evalu ation Evalu ation

Evaluation of Text Classification

Accuracy measures the proportion of correctly classified instances

out of all instances in the dataset.

Precision measures the proportion of true positive instances out of

all instances classified as positive.

Recall (Sensitivity) measures the proportion of true positive

instances out of all actual positive instances.

Naku l Dave T ext Classification 27 / 38

T y p e s an d Evalu ation Evalu ation

Confusion Matrix - Two Classes

The confusion matrix displays the number of true positive, true

negative, false positive, and false negative predictions.
It provides a more detailed view of the model’s performance for
each class.

Figure: Two-class Confusion Matrix4

4
Image Courtesy - https://www.arxiv-vanity.com/papers/2008.05756/
Naku l Dave T ext Classification 28 / 38
T y p e s an d Evalu ation Evalu ation

Precision

Precision
The Precision is the fraction of True Positive elements divided by
the total number of positively predicted units (column sum of the
predicted positives).
In particular, True Positive are the elements that have been
labeled as positive by the model and they are actually positive,
while False Positive are the elements that have been labeled as
positive by the model, but they are actually negative.

Let’s Calculate Precision

TP(20)
TP(20)+FP(10) = 0.66

Naku l Dave T ext Classification 29 / 38

T y p e s an d Evalu ation Evalu ation

Recall

Recall
The Recall is the fraction of True Positive elements divided by the
total number of positively classified units (row sum of the actual
positives).
In particular False Negative are the elements that have been
labeled as negative by the model, but they are actually positive.

Let’s Calculate Recall

TP(20)
TP(20)+FN(05) = 0.80

Naku l Dave T ext Classification 30 / 38

T y p e s an d Evalu ation Evalu ation

Accuracy

Accuracy
Accuracy is one of the most popular metrics in multi-class classification
and it is directly computed from the confusion matrix.

Let’s Calculate Accuracy

TP + TN
TP + FP + TN + FN

Naku l Dave T ext Classification 31 / 38

T y p e s an d Evalu ation Evalu ation

Confusion Matrix - More than two classes

Figure: Multi class Confusion Matrix5

5
Image Courtesy - https://www.arxiv-vanity.com/papers/2008.05756/
Naku l Dave T ext Classification 32 / 38
T y p e s an d Evalu ation Evalu ation

Confusion Matrix - More than two classes

Figure: Multi class Confusion Matrix6

Naku l Dave T ext Classification 33 / 38

T y p e s an d Evalu ation Evalu ation

Micro- vs. Macro-Average

If we have more than one class, how do we combine multiple

performance measures into one quantity?
Macro-averaging
Compute performance for each class, then average

Micro-averaging
Collect decisions for all the classes, compute contingency table, and
evaluate.

Naku l Dave T ext Classification 34 / 38

T y p e s an d Evalu ation Evalu ation

Confusion Matrix - More than two classes

Figure: Macro Vs. Micro averaged precision

Naku l Dave T ext Classification 35 / 38

Questions

Questions?

Naku l Dave T ext Classification 36 / 38

Questions

Thank You All.......

Naku l Dave T ext Classification 37 / 38

References

References I

Naku l Dave T ext Classification 38 / 38

Navsea Op 5 Volume 1 Seventh Revision Change 6 Ammunition and Explosives Safety Ashore
0% (1)
Navsea Op 5 Volume 1 Seventh Revision Change 6 Ammunition and Explosives Safety Ashore
879 pages
NB 24 Aug
No ratings yet
NB 24 Aug
79 pages
Text Classification & Naive Bayes
No ratings yet
Text Classification & Naive Bayes
4 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper
74 pages
Lecture Feb20&25
No ratings yet
Lecture Feb20&25
11 pages
4 Naive Bayes
No ratings yet
4 Naive Bayes
82 pages
NB 24 Aug
No ratings yet
NB 24 Aug
82 pages
Multinomial NB
No ratings yet
Multinomial NB
52 pages
Naive Bayes With Sentiment Classification
No ratings yet
Naive Bayes With Sentiment Classification
82 pages
NLP NB
No ratings yet
NLP NB
52 pages
NaiveBayes N Text Analytics
No ratings yet
NaiveBayes N Text Analytics
20 pages
01 What Is Text Classification 8-12
No ratings yet
01 What Is Text Classification 8-12
4 pages
Multimedia Application L7 - For
No ratings yet
Multimedia Application L7 - For
46 pages
What Is Text Classification - Exxact
No ratings yet
What Is Text Classification - Exxact
12 pages
4 NB 2024
No ratings yet
4 NB 2024
82 pages
NB 24 Aug
No ratings yet
NB 24 Aug
85 pages
Naivebayes 2021
No ratings yet
Naivebayes 2021
77 pages
Naive Bayes
No ratings yet
Naive Bayes
56 pages
02 Text Processing PDF
No ratings yet
02 Text Processing PDF
70 pages
MLRD 2
No ratings yet
MLRD 2
15 pages
Text Classification Guide & Datasets
No ratings yet
Text Classification Guide & Datasets
24 pages
Text Classification
No ratings yet
Text Classification
60 pages
BAI601 Module 3 PDF
No ratings yet
BAI601 Module 3 PDF
19 pages
Multimedia Application L8
No ratings yet
Multimedia Application L8
68 pages
Text Classification in ML
No ratings yet
Text Classification in ML
47 pages
Text Classification Using TF-IDF and Machine Learning
No ratings yet
Text Classification Using TF-IDF and Machine Learning
30 pages
UNIT5
No ratings yet
UNIT5
23 pages
Text Classification Lecture Notes
No ratings yet
Text Classification Lecture Notes
26 pages
Textclassification PDF
No ratings yet
Textclassification PDF
56 pages
Text Classification: SNLP 2016
No ratings yet
Text Classification: SNLP 2016
56 pages
Text Classification
No ratings yet
Text Classification
11 pages
Lecture5 421
No ratings yet
Lecture5 421
115 pages
T4L1 Naive Bayes
No ratings yet
T4L1 Naive Bayes
50 pages
Week 4
No ratings yet
Week 4
45 pages
Lect 05
No ratings yet
Lect 05
17 pages
Classification
No ratings yet
Classification
81 pages
Lecture 5-1 Naive
No ratings yet
Lecture 5-1 Naive
44 pages
Text Classification PDF
No ratings yet
Text Classification PDF
56 pages
NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
100% (1)
NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
86 pages
Naive Bayes Sentiment Analysis
No ratings yet
Naive Bayes Sentiment Analysis
23 pages
4.machine Learning For Text Understanding-1
No ratings yet
4.machine Learning For Text Understanding-1
45 pages
Bag - of - Words NLP
No ratings yet
Bag - of - Words NLP
23 pages
05 Text Classification - Naive Bayes
No ratings yet
05 Text Classification - Naive Bayes
64 pages
Naive Bayes and Sentiment Classification
No ratings yet
Naive Bayes and Sentiment Classification
23 pages
04-Textcat Text Class
No ratings yet
04-Textcat Text Class
77 pages
Text Classification: Slides Adapted From Lyle Ungar and Dan Jurafsky
No ratings yet
Text Classification: Slides Adapted From Lyle Ungar and Dan Jurafsky
29 pages
Unit-2 Text Classification
No ratings yet
Unit-2 Text Classification
7 pages
Irs Lab Week-4
No ratings yet
Irs Lab Week-4
2 pages
AI Lec 04+05 - Naive Bayes
No ratings yet
AI Lec 04+05 - Naive Bayes
55 pages
Text Classification Using Support Vector Machine IJERTV1IS3174
No ratings yet
Text Classification Using Support Vector Machine IJERTV1IS3174
4 pages
05 Text Classification - Naive Bayes
No ratings yet
05 Text Classification - Naive Bayes
64 pages
Text Classification
No ratings yet
Text Classification
53 pages
A Survey On Text Classification From Shallow To Deep Learning
No ratings yet
A Survey On Text Classification From Shallow To Deep Learning
21 pages
Chapter 4 Text Classification
No ratings yet
Chapter 4 Text Classification
28 pages
Lecture13 Nbayes
No ratings yet
Lecture13 Nbayes
56 pages
CAT King Study Material 4
No ratings yet
CAT King Study Material 4
32 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
NLP Module 3
No ratings yet
NLP Module 3
66 pages
Text Classification Research Paper 2
No ratings yet
Text Classification Research Paper 2
7 pages
L5 TextClassification Updated
No ratings yet
L5 TextClassification Updated
179 pages
Mastering Metrics Published
No ratings yet
Mastering Metrics Published
4 pages
8 Vertical Stresses Below Applied Loads
No ratings yet
8 Vertical Stresses Below Applied Loads
13 pages
Catalogue - Contact Rivets
No ratings yet
Catalogue - Contact Rivets
10 pages
Intercultural Communication Amongst Native Language Speakers Using Uncertainty Reduction Theory
No ratings yet
Intercultural Communication Amongst Native Language Speakers Using Uncertainty Reduction Theory
3 pages
Zyzz Bible
No ratings yet
Zyzz Bible
66 pages
Unit Test-1 Portion and Time Table Grade IX - Compressed
No ratings yet
Unit Test-1 Portion and Time Table Grade IX - Compressed
3 pages
Deep Learning Predicts Tipping Points
No ratings yet
Deep Learning Predicts Tipping Points
41 pages
Bull Heading
No ratings yet
Bull Heading
9 pages
PR2Q1
No ratings yet
PR2Q1
12 pages
q3 Performance Task 1
No ratings yet
q3 Performance Task 1
4 pages
Introduction To Applied Cryptography Syllabus
No ratings yet
Introduction To Applied Cryptography Syllabus
4 pages
Ecology and Evolution Notes
No ratings yet
Ecology and Evolution Notes
3 pages
Elementary Statistics Course Guide
No ratings yet
Elementary Statistics Course Guide
4 pages
Affords Investors The Right To Exclude How It Works, Physics Mechanism
No ratings yet
Affords Investors The Right To Exclude How It Works, Physics Mechanism
17 pages
Stats Correlation for Chem Students
No ratings yet
Stats Correlation for Chem Students
50 pages
B.Ed Teaching Practice Manual
50% (2)
B.Ed Teaching Practice Manual
45 pages
Unit 05 - Lesson 02 - Solid Slab Design Spanning in One Direction
No ratings yet
Unit 05 - Lesson 02 - Solid Slab Design Spanning in One Direction
22 pages
+3 Final - Programme - 2015
No ratings yet
+3 Final - Programme - 2015
4 pages
Outstanding Cambridge Learner Awards 2021 - Pakistan Brochure
No ratings yet
Outstanding Cambridge Learner Awards 2021 - Pakistan Brochure
38 pages
恢复的关系：中美教育交流的趋势，1978 1984年
No ratings yet
恢复的关系：中美教育交流的趋势，1978 1984年
287 pages
Beauty and The Alpha 1st Edition Liliana Carlisle Download
100% (7)
Beauty and The Alpha 1st Edition Liliana Carlisle Download
117 pages
NDDS Question Paper
No ratings yet
NDDS Question Paper
2 pages
Use Daily Affirmations To Crank Up Your Self Esteem
No ratings yet
Use Daily Affirmations To Crank Up Your Self Esteem
2 pages
Ammonia to Hydrazine Reaction Balancing
No ratings yet
Ammonia to Hydrazine Reaction Balancing
1 page
GR 6 - Sa 1 - Date Sheet
No ratings yet
GR 6 - Sa 1 - Date Sheet
1 page
RSER PI Manuscript
No ratings yet
RSER PI Manuscript
29 pages
Practical Research 1: Quarter 3, LAS 6: Synthesizing Information and Writing Coherent Literature Review
No ratings yet
Practical Research 1: Quarter 3, LAS 6: Synthesizing Information and Writing Coherent Literature Review
8 pages
Refrigeration Controller Guide
No ratings yet
Refrigeration Controller Guide
8 pages
Math One Revision Booklet
No ratings yet
Math One Revision Booklet
121 pages