Name – Amit Shukla
Roll No. – 2200971640010
Branch – AIML
Subject – Machine Learning Technique Lab
Practical-06
AIM - Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier model to
perform this task. Built-in Java classes/API can be used to write the program. Calculate the accuracy,
precision, and recall for your data set
Theory:- The Naïve Bayesian Classifier is a probabilis c machine learning model used for text classifica on
tasks, such as spam detec on or sen ment analysis. It is based on Bayes' Theorem, with the "naïve" assump
on that all features (words in a document) are independent of each other given the class label. Despite this
simplifica on, it performs remarkably well in prac cal applica ons.
Key Concepts:
Bayes’ Theorem:
It provides a way to calculate the probability of a hypothesis given the evidence.
Prior Probability P(H):
Probability of a class (e.g., posi ve or nega ve) before seeing the data.
Likelihood P(E∣H):
Probability of observing a word in a document, given the class.
Posterior Probability P(H∣E):
Final probability of the class given the observed features (words).
Feature Independence Assump on:
Assumes each word in the document contributes independently to the class probability.
How the Naïve Bayesian Classifier Works for Document Classifica on:
1. Preprocess the Text:
Convert documents into tokens (words), remove stopwords, and vectorize the data using techniques
like Bag of Words or TF-IDF.
2. Training Phase:
Use the training documents and their labels to calculate the prior and likelihood probabili es for
each class.
3. Predic on Phase:
For a new/unseen document, compute the posterior probability for each class, and assign the class
with the highest probability.
4. Evalua on:
Use metrics such as Accuracy, Precision, and Recall to evaluate model performance.
Assump ons of Naïve Bayesian Classifier:
• The features (words) are condi onally independent given the class.
• The training dataset is representa ve of the real-world distribu on.
• The input text is already preprocessed (cleaned and vectorized).
Source Code :-
import pandas as pd
msg = pd.read_csv('/content/sample_data/document.csv', names=['message', 'label'])
print("Total Instances of Dataset: ", msg.shape[0]) msg['labelnum'] =
msg.label.map({'pos': 1, 'neg': 0})
X = msg.message
y = msg.labelnum
from sklearn.model_selec on import train_test_split Xtrain,
Xtest, ytrain, ytest = train_test_split(X, y)
from sklearn.feature_extrac on.text import CountVectorizer
count_v = CountVectorizer()
Xtrain_dm = count_v.fit_transform(Xtrain)
Xtest_dm = count_v.transform(Xtest)
……………………………………………………………………………
df = pd.DataFrame(Xtrain_dm.toarray(), columns=count_v.get_feature_names_out())
print(df[0:5])
from sklearn.naive_bayes import Mul nomialNB clf = Mul nomialNB()
clf.fit(Xtrain_dm, ytrain)
pred = clf.predict(Xtest_dm)
…………………………………………………
………………………… for doc, p in
zip(Xtrain, pred): p = 'pos' if p == 1 else 'neg'
print("%s -> %s" % (doc, p))
from
sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score
print('Accuracy Metrics: \n') print('Accuracy: ', accuracy_score(ytest, pred)) print('Recall: ',
recall_score(ytest, pred)) print('Precision: ', precision_score(ytest, pred))
print('Confusion Matrix: \n', confusion_matrix(ytest, pred))