Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
5 views4 pages

Text Classification & Naive Bayes

Text classification is an NLP technique that assigns categories to text data, commonly used for spam detection, sentiment analysis, and topic labeling. Naive Bayes is a probabilistic classifier that operates on the assumption of feature independence and is effective in text classification tasks. It involves training on word frequencies and calculating probabilities to predict the class of new documents.

Uploaded by

sec22ad152
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

Text Classification & Naive Bayes

Text classification is an NLP technique that assigns categories to text data, commonly used for spam detection, sentiment analysis, and topic labeling. Naive Bayes is a probabilistic classifier that operates on the assumption of feature independence and is effective in text classification tasks. It involves training on word frequencies and calculating probabilities to predict the class of new documents.

Uploaded by

sec22ad152
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

What is Text Classification?

Text classification is a Natural Language Processing (NLP) technique used to automatically


assign categories or tags to textual data. It's widely used in:

●​ Spam detection (spam or not spam)​

●​ Sentiment analysis (positive, neutral, negative)​

●​ Topic labeling (e.g., sports, politics, tech)​

●​ Language detection​

Text classification transforms text into numerical features, and then a machine learning model is
trained to learn patterns and make predictions.

What is Naive Bayes?

Naive Bayes is a probabilistic classifier based on Bayes' Theorem, and it assumes:

1.​ Features (words in text) are independent of each other (naive assumption).​

2.​ Each word contributes equally and independently to the probability of a class.​

Despite its simplicity, it works surprisingly well in many text classification problems.

Bayes' Theorem Refresher:

P(C/X)=[P(X/C).P(C)]/P(X)

Where:

●​ P(C∣X): Posterior probability of class C given features X (e.g., probability it's spam
given the words)​
●​ P(X∣C): Likelihood of features X given class C​

●​ P(C): Prior probability of class C​

●​ P(X): Probability of features X

How Naive Bayes Works in Text Classification:

1.​ Training Phase:​

○​ Go through the training text documents and count how often each word appears in
each category.​

○​ Estimate probabilities for each word in each class.​

○​ Use smoothing techniques like Laplace Smoothing to handle rare or unseen


words.​

2.​ Prediction Phase:​

○​ For a new document, calculate the probability of each class given the words in the
document.​

○​ The class with the highest probability is the predicted category.

Naive Bayes for Text Classification

Here's how it works:

1.​ Text Preprocessing:​

○​ Tokenize the text (split into words).​

○​ Remove stopwords, punctuation.​

○​ Lowercase everything.​
○​ Optional: Stemming/Lemmatization.​

2.​ Feature Extraction:​

○​ Convert text into numerical features using:​

■​ Bag of Words (BoW): Counts word occurrences.​

■​ TF-IDF: Weights words based on frequency in a document vs. all


documents.​

3.​ Model Training:​

○​ Count word frequencies per class (e.g., word “free” appears 40 times in spam
emails and 2 times in non-spam).​

○​ Calculate:​

■​ Prior probabilities (e.g., % of emails that are spam).​

■​ Likelihoods (probability of a word given a class).​

○​ Apply Laplace smoothing to handle words not seen in training data.

4.​ Prediction:​

○​ For a new email/document, calculate the probability it belongs to each class.​

○​ Choose the class with the highest probability.

Pros of Naive Bayes:

●​ Simple and fast to train and predict.​

●​ Scales well to large datasets.​

●​ Performs well even with a small amount of training data.


●​ Robust to noisy and irrelevant features (e.g., common words like "the", "is").​

Cons:

●​ Assumes independence among features (not true in real language).​

●​ If a word wasn’t seen during training, its probability becomes zero (handled by
smoothing).​

Applications:

●​ Email spam detection​

●​ News categorization​

●​ Twitter sentiment analysis​

●​ Document tagging (legal, medical, academic domains)​

Variants of Naive Bayes in Text:

1.​ Multinomial Naive Bayes (most common for text) – Uses word frequencies.​

2.​ Bernoulli Naive Bayes – Uses binary features (whether a word appears or not).​

3.​ Gaussian Naive Bayes – Used for continuous data (not common in NLP).

You might also like