Ex.
No: 4 NAIVE BAYESIAN CLASSIFIER FOR A SAMPLE TRAINING
DATE: DATA SET STORED AS A CSV FILE.
AIM:
The aim of this experiment is to use Naive Bayes classifier and its
application for classification tasks. and implement the Naive Bayes algorithm
using a sample training dataset stored in a CSV file.
HARDWARE SPECIFICATION:
Processor : AMD Ryzen 5 3450U with Radeon Vega Mobile Gfx 2.10GHz
Installed RAM : 8.00 GB (5.89 GB usable)
Device ID : A20FB902-090D-43B7-B646-A365E8586922
Product ID : 00356-24550-53284-AAOEM
System Type : 64-bit operating system, x64-based processor
SOFTWARE SPECIFICATION:
PYTHON IDLE( 3.12.1 64 BIT)
LIBRARIES:
NumPy
Pandas
Sklearn
ALGORITHM:
1. Data Acquisition and Preprocessing
This initial phase focuses on gathering and preparing the data for model
development.
1.1 Library Imports: Essential libraries for data manipulation and
machine learning are imported at the outset. These might include pandas
for data handling and scikit-learn for machine learning algorithms.
NIDHIN S
21EE076
1.2 Data Loading: The social network ad dataset, typically stored in a
comma-separated values (CSV) file format, is loaded into a pandas
DataFrame. This structure facilitates efficient data exploration and
manipulation.
1.3 Feature Separation: The DataFrame is carefully examined to
identify and separate the independent variables (features) into a
designated variable named 'X'. These features represent various attributes
that can potentially influence user click behavior (e.g., age, gender,
income, interests).
Target Variable Isolation: The dependent variable, representing the
target outcome (i.e., whether a user clicked on an ad), is isolated and
stored in a separate variable named 'y'.
Categorical Data Encoding (if applicable): If any of the features
contain categorical data (e.g., gender with categories like "Male" and
"Female"), these categories are meticulously encoded into numerical
values using appropriate techniques. LabelEncoder from scikit-learn is a
common choice for ordinal data (categories with inherent order).
Train-Test Split: The entire dataset is rigorously split into training and
testing sets using the train_test_split function from the scikit-learn
library. This crucial step serves to partition the data for model
development and subsequent evaluation. The training data is used to train
the model, while the testing data is used to assess the model's
performance on unseen data.
Feature Scaling: Feature scaling is meticulously applied to both the
training and testing sets using techniques like StandardScaler from scikit-
learn. This normalization process ensures that all features are on a
comparable scale, preventing any individual feature from
disproportionately influencing the model during the training phase.
NIDHIN S
21EE076
2. Model Development and Evaluation
This phase focuses on building and evaluating the Gaussian Naive Bayes model
to predict click-through rates.
Model Instantiation: A Gaussian Naive Bayes classifier is instantiated,
leveraging the scikit-learn library. This probabilistic model is particularly
adept at handling situations where the class distribution of the target
variable may be skewed.
Model Training: The meticulously prepared training data (features 'X'
and target variable 'y') is subsequently fed into the instantiated Gaussian
Naive Bayes classifier to train the model. During this phase, the model
learns the inherent relationships between the various features and the
target variable (click behavior).
Prediction: Once trained, the model is utilized to make predictions on
the unseen test data. These predictions represent the model's estimation of
click-through rates for users in the test set based on the patterns it learned
from the training data
Evaluation Metrics: The model's performance is rigorously evaluated
using two key metrics: - Confusion Matrix: This graphical tool is
meticulously constructed to visualize the correct and incorrect predictions
made by the model on the test data. It provides valuable insights into the
model's ability to identify true positives (correctly predicted clicks), true
negatives (correctly predicted non-clicks), false positives (incorrectly
predicted clicks), and false negatives (incorrectly predicted non-clicks). -
Accuracy Score: This metric quantitatively measures the overall
effectiveness of the model by calculating the proportion of correct
predictions made on the test data. A higher accuracy score indicates a
more robust model capable of generalizing its learned patterns to unseen
data.
3. Result Display (Optional)
The meticulously generated confusion matrix and the calculated accuracy
score are meticulously presented, providing a clear understanding of the
NIDHIN S
21EE076
model's performance in predicting user click-through rates for social
network advertisements. This information can be instrumental in
assessing the model's suitability for real-world deployment.
4. Additional Considerations (Optional)
This section highlights some optional steps that can be incorporated to further
enhance the model's performance and understanding.
Hyperparameter Tuning:
o Techniques like GridSearchCV from scikit-learn can be employed
to optimize the hyperparameters of the Gaussian Naive Bayes
model. Hyperparameters are settings within the model that can
influence its behavior, and tuning them can potentially lead to
improved performance.
PROGRAM:
#import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#importing the datasets
dataset = pd.read_csv("Social_Network_Ads.csv")
x = dataset.iloc[:, [1, 4]].values
y = dataset.iloc[:, -1].values
print("X values:")
print(x)
# Encoding categorical data (the Gender column)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
x[:, 0] = le.fit_transform(x[:, 0])
# Train-test splitting
from sklearn.model_selection import train_test_split
NIDHIN S
21EE076
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20,
random_state=0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
# Training the naive bayes model on the training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_test)
# printing values print("\nActual y_test values:")
print(y_test)
print("\nPredicted y_pred values:")
print(y_pred)
# Calculating and printing confusion matrix and accuracy score
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
ac = accuracy_score(y_test, y_pred)
print("\nConfusion matrix:")
print(cm)
print("\nAccuracy score:")
print(ac)
OUTPUT:
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
001010100000110000
0010000100101100011
001001010100001001
NIDHIN S
21EE076
0 0 0 0 1 1]
Predicted y_pred values:
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
001010100000110000
0010000100101100011
001001010100001001
0 0 0 0 1 1]
Confusion matrix:
[[58 0]
[ 0 22]]
Accuracy score:
1.0
INFERENCE:
A Gaussian Naive Bayes model is employed within this program to predict
user click-through rates on social network advertisements. The program
accomplishes this through data pre-processing, model training, and performance
evaluation. This facilitates advertisers in identifying user traits that correlate
with successful ad clicks.
NIDHIN S
21EE076
RUBRICS:
RESULT:
Thus, The program successfully built and trained a Gaussian Naive Bayes
model to classify user clicks on social network ads. It achieved an accuracy
score of 1.0 on the test set, indicating perfect prediction for this specific data
split.
NIDHIN S
21EE076