COMP 3361 Natural Language Processing
Lecture 3: Text Classification
Spring 2024
Many materials from CS224n@Stanford and COS484@Princeton with special thanks!
Announcements
● Assignment 1 has been released (due in 4 weeks: 9:00 am, Feb 20)
● Once more, please sign up for the course's Slack workspace. This is included in your
class participation grade.
https://join.slack.com/t/slack-fdv4728/shared_invite/zt-2asgddr0h-6wIXbRndwKhBw2IX2~ZrJQ
● You should be able to access the course Moodle page now.
● The course page has updated details on the tentative schedule
Google form survey
https://forms.gle/FMQvFCuzUyJ3pB93A
Lecture plan
● Recap of language modeling
● Naive Bayes and sentiment classification
● Logistic Regression for text classification
Generating from language models
● Deterministic approach: Temperature=0, always selects the word with the highest
probability in each iteration
How ChatGPT completes a sentence with temperature=0
https://www.atmosera.com/ai/understanding-chatgpt/
Generating from language models
● Probabilistic or stochastic approach: e.g., temperature=0.7, the next word is chosen
based on a probability distribution over the possible words. More creative!
How ChatGPT completes a sentence with temperature=0.7
https://www.atmosera.com/ai/understanding-chatgpt/
Why text classification?
Spam email detection Sentiment analysis
Q: any other examples?
Text classification
Prompting ChatGPT for text classification
Prompting ChatGPT for text classification
Parse ChatGPT’s output
Rule-based text classification
IF there exists word w in document d such that w in [good, great, extra-ordinary, …],
THEN output Positive
IF email address ends in [ithelpdesk.com, makemoney.com, spinthewheel.com, …]
THEN output SPAM
● + Can be very accurate (if rules carefully refined by expert)
● - Rules may be hard to define (and some even unknown to us!)
● - Labor intensive and expensive
● - Hard to generalize and keep up-to-date
Supervised Learning: Let’s use statistics!
Let the machine figure out the best patterns using data
Key questions:
● What is the form of F?
● How do we learn F?
Types of supervised classifiers
Logistic regression
Naive Bayes
Support vector machines Neural networks
Naive Bayes
Naive Bayes
Naive Bayes classifier
Simple classification model making use of Bayes rule
● Bayes rule:
Naive Bayes classifier
Naive Bayes classifier
How to represent ?
● Option 1: represent the entire sequence of words
○ Too many sequences!
How to represent ?
● Option 1: represent the entire sequence of words
○ Too many sequences!
● Option 2: Bag of words
○ Assume position of each word doesn’t matter
○ Probability of each word is conditionally independent of the other words given
class c
Bag of words (BoW)
Predicting with Naive Bayes
How to estimate probabilities?
Data sparsity problem
��
This sounds familiar…
Solution: Smoothing!
Overall process
Overall process
Overall process
A worked example for sentiment analysis
A worked example for sentiment analysis
A worked example for sentiment analysis
Naive Bayes vs. language models
Naive Bayes vs. language models
Naive Bayes vs. language models
Naive Bayes vs. language models
Naive Bayes vs. language models
Naive Bayes vs. language models
Naive Bayes: pros and cons
Naive Bayes can use any features!
● In general, Naive Bayes can use
any set of features, not just
words:
○ URLs, email addresses,
Capitalization, …
○ Domain knowledge crucial
to performance
Top features for spam detection
Wait, we already have ChatGPT, why still NB?
Naive Bayes Transformers, neural networks and many others
e.g., ChatGPT
Wait, we already have ChatGPT, why still NB?
● Computational efficiency, cost
● Simplicity and interpretability
● Small data performance
● Out of domain
○ Requires domain experts to design
features
● …
Naive Bayes Transformers, neural networks and many others
e.g., ChatGPT
Logistic regression
Logistic regression
Study yourself!
Logistic regression
https://machine-learning.paperspace.com/wiki/logistic-regression
Generative vs. discriminative models
Generative classifiers
Discriminative classifiers
Overall process: Discriminative classifiers
1. Feature representation
Bag of words
Example: Sentiment classification
2. Classification function
Example: Sentiment classification
3. Loss function
Example: Computing CE loss
Properties of CE loss
Properties of CE loss
4. Optimization
Gradient for logistic regression
Regularization
Multinomial logistic regression
Features in multinomial LR
Learning
Next lecture: word embeddings