CHAPTER 3: CLASSIFICATION
1. Decision Tree Classifier
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn import tree
# Read the CSV file
df = pd.read_csv("salaries.csv")
print(df)
# Prepare inputs and target
inputs = df.drop('salary_more_then_100k', axis='columns')
target = df['salary_more_then_100k']
# Label encode categorical features
le_company = LabelEncoder()
le_job = LabelEncoder()
le_degree = LabelEncoder()
inputs['company_n'] = le_company.fit_transform(inputs['company'])
inputs['job_n'] = le_job.fit_transform(inputs['job'])
inputs['degree_n'] = le_degree.fit_transform(inputs['degree'])
print(inputs)
inputs_n = inputs.drop(['company', 'job', 'degree'], axis='columns')
# Create and fit the decision tree model
model = tree.DecisionTreeClassifier()
model.fit(inputs_n, target)
# Print model score and make predictions
print("Model Score:", model.score(inputs_n, target))
#Is salary of Google, Computer Engineer, Bachelors degree > 100 k ?
print("Prediction for [2, 1, 0]:", model.predict([[2, 1, 0]]))
#Is salary of Google, Computer Engineer, Masters degree > 100 k ?
print("Prediction for [2, 1, 1]:", model.predict([[2, 1, 1]]))
Output:
Model Score: 1.0
Prediction for [2, 1, 0]: [0]
Prediction for [2, 1, 1]: [1]
2. NAIVE BAYES CLASSIFICATION:[pg.no:51-53]
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used
for solving classification problems. It is mainly used in text classification that includes a high-
dimensional training dataset.
The Bayesian method of calculating conditional probabilities is used in machine learning applications
that involve classification tasks. A simplified version of the Bayes Theorem, known as the Naive Bayes
Classification, is used to reduce computation time and costs.
Applications of Naïve Bayes Classifier:
It is used for Credit Scoring.
It is used in medical data classification.
It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
It is used in Text classification such as Spam filtering and Sentiment analysis.
PROGRAM:
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
age = ['youth', 'youth', 'middle-aged', 'senior', 'senior',
'senior', 'middle-aged', 'youth', 'youth', 'senior', 'youth',
'middle-aged', 'middle-aged', 'senior']
income = ['high', 'high', 'high', 'medium', 'low', 'low',
'low', 'medium', 'low', 'medium', 'medium', 'medium',
'high', 'medium']
student = ['no', 'no', 'no', 'no', 'yes', 'yes', 'yes',
'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no']
credit_rating = ['fair', 'excellent', 'fair', 'fair', 'fair',
'excellent', 'excellent', 'fair', 'fair', 'fair',
'excellent', 'excellent', 'fair', 'excellent']
buys_computer = ['no', 'no', 'yes', 'yes', 'yes', 'no',
'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no']
# Create Label Encoder object
le = preprocessing.LabelEncoder()
# Converting string labels into numbers
age_encoded = le.fit_transform(age)
print(age_encoded)
income_encoded = le.fit_transform(income)
print(income_encoded)
student_encoded = le.fit_transform(student)
print(student_encoded)
credit_encoded = le.fit_transform(credit_rating)
print(credit_encoded)
# Converting string labels into numbers
label = le.fit_transform(buys_computer)
print(label)
# Combining age, income, student, and credit rating into a single list of tuples
features = list(zip(age_encoded, income_encoded, student_encoded, credit_encoded))
# Create a Gaussian Naive Bayes model
model = GaussianNB()
# Train the model using the training sets
model.fit(features, label)
# Predict output
predicted = model.predict([[2, 2, 1, 1]]) # 2: youth, 2: medium, 1: yes, 1: fair
print("Predicted Value:", predicted)
Output:
3. MULTINOMINAL NAIVE BAYES CLASSIFICATION:[pg.no:54-55]
Multinomial Naive Bayes is one of the most popular supervised learning classifications that is used
for the analysis of the categorical text data. Text data classification is gaining popularity because
there is an enormous amount of information available in email, documents, websites, etc. that needs
to be analyzed.
Examples of categorical variables are race, sex, age group, and educational level. While the latter two
variables may also be considered in a numerical manner by using exact values for age and highest
grade completed, it is often more informative to categorize such variables into a relatively small
number of groups.
This dataset is the result of a chemical analysis of wines grown in the same region in Italy but
derived from three different plant varieties. Dataset comprises of 13 features (alcohol, malic_acid,
ash,alkalinity_of_ash,magnesium,total_phenols,flavonoids,nonflavanoid_phenols,proanthocyanins,c
olor_intensity,hue,od280/od315_of_diluted_wines,proline) and type of wine plant variety. This data
has three type of wine Class_0, Class_1, Class_2.
PROGRAM:
# Import scikit-learn dataset library
from sklearn import datasets
# Load dataset
wine = datasets.load_wine()
# Print the names of the 13 features
print("Features:", wine.feature_names)
# Print the label type of wine (class 0, class 1, class 2)
print("Labels:", wine.target_names)
# Print data (feature) shape
print(wine.data.shape)
# Print the wine data features (top 5)
print(wine.data[:5])
print(wine.target)
# Import train test split function from sklearn.model_selection
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set (70% training and 30% test)
X_train, X_test, y_train, y_test =train_test_split(wine.data, wine.target, test_size=0.3,
random_state=109)
# Import Gaussian Naive Bayes model
from sklearn.naive_bayes import GaussianNB
# Create a Gaussian Classifier
gnb = GaussianNB()
# Train the model using the training sets
gnb.fit(X_train, y_train)
# Predict the response for the test dataset
y_pred = gnb.predict(X_test)
print("Predicted Labels:",y_pred)
# Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy to find out how accurate is the classifier
print("Accuracy:", metrics.accuracy_score(y_test, y_pred) *100)
Output:
Classify 1: According to color.
Classify 2: According to the carbon dioxide pressure.
Classify 3: According to the sugar content.
4. LINEAR KERNEL:[pg.no:59-61]
Kernels, also known as kernel techniques or kernel functions, are a collection of distinct forms of
pattern analysis algorithms, using a linear classifier, they solve an existing non-linear problem. SVM
(Support Vector Machines) uses Kernels Methods in ML to solve classification and regression issues.
Linear Kernel is used when the data is Linearly separable, that is, it can be separated using a single
Line. It is one of the most common kernels to be used. It is mostly used when there are a Large
number of Features in a particular Data Set.
PROGRAM:
# Load libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn.svm import SVC
# Assign column names to the dataset
column_names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
# Load dataset
df = pd.read_csv("iris.csv", names=column_names)
# Split dataset into features and target
X = df.drop('Class', axis=1) # Features
y = df['Class'] # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=1)
# Import support vector classifier from sklearn.svm
clf = SVC(kernel='linear')
# Fit the classifier to the training data
clf.fit(X_train, y_train)
# Predict the classes on test set
y_pred = clf.predict(X_test)
print(y_pred)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy * 100)
# Print classification report and confusion matrix
print(classification_report(y_test, y_pred))
# Generate and display confusion matrix heatmap
cm = pd.crosstab(y_test,y_pred,rownames=['Actual'],colnames=['Predicted'])
ax = sn.heatmap(cm, annot=True)
plt.show()
Output:
5. POLYNOMINAL KERNEL:[pg.no:61-63]
PROGRAM:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sn
# Assign column names to the dataset
colnames = ['sepal-length', 'sepal-width', 'petal-length',
'petal-width', 'Class']
df = pd.read_csv("iris.csv", names=colnames)
# Split dataset into features and target variable
X = df.drop('Class', axis=1) # Features
y = df['Class'] # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=1)
# Import support vector classifier from sklearn
from sklearn.svm import SVC
clf = SVC(kernel='poly', degree=8) # Polynomial kernel with degree 8
# Fit the classifier to the training data
clf.fit(X_train, y_train)
# Make predictions on the test data
y_pred = clf.predict(X_test)
# Print the predicted labels
print(y_pred)
# Calculate and print the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy*100)
# Print the classification report
print(classification_report(y_test, y_pred))
# Generate and display the confusion matrix as a heatmap
confusion_matrix = pd.crosstab(y_test, y_pred,
rownames=['Actual'],colnames=['Predicted'])
ax = sn.heatmap(confusion_matrix, annot=True)
plt.show()
Output:
6. RADIAL BASIS FUNCTION KERNEL:[pg.no:63-65]
PROGRAM:
# Load libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sn
from sklearn.svm import SVC
# Assign column names to the dataset
col_names = ['sepal-length', 'sepal-width', 'petal-length',
'petal-width', 'Class']
# Load the dataset
dataset = pd.read_csv("iris.csv", names=col_names)
# Separate features and target variable
X = dataset.drop('Class', axis=1)
y = dataset['Class']
# Split dataset into training set and test set (70% training and 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=1)
# Import support vector classifier
clf = SVC(kernel='rbf', gamma=0.1)
# Fit the classifier to the training data
clf.fit(X_train, y_train)
# Predict the labels for the test set
y_pred = clf.predict(X_test)
# Print the predicted labels and accuracy
print("Predicted labels:", y_pred)
print("Accuracy:", accuracy_score(y_test, y_pred))
# Print the classification report
print(classification_report(y_test, y_pred))
# Generate and display confusion matrix as a heatmap
confusion_matrix = pd.crosstab(y_test, y_pred,
rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)
plt.show()
7. K-NEAREST NEIGHBOURS
The k-nearest neighbors algorithm, also known as KNN or k-NN, is a non-parametric, supervised
learning classifier, which uses proximity to make classifications or predictions about the grouping of
an individual data point.
# The make_blobs function from sklearn.datasets is used to generate a synthetic dataset with 500
samples, 2 features, 4 centers, and a cluster standard deviation of 1.5. The X variable contains the
feature vectors, and the y variable contains the corresponding labels.
PROGRAM:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
X, y = make_blobs(n_samples = 500, n_features = 2, centers = 4,cluster_std = 1.5, random_state = 4)
plt.style.use('seaborn')
plt.figure(figsize = (10,10))
plt.scatter(X[:,0], X[:,1], c=y, marker= '*',s=100,edgecolors='black')
plt.show()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
knn5 = KNeighborsClassifier(n_neighbors = 5)
knn1 = KNeighborsClassifier(n_neighbors=1)
knn5.fit(X_train, y_train)
knn1.fit(X_train, y_train)
y_pred_5 = knn5.predict(X_test)
y_pred_1 = knn1.predict(X_test)
from sklearn.metrics import accuracy_score
print("Accuracy with k=5", accuracy_score(y_test, y_pred_5)*100)
print("Accuracy with k=1", accuracy_score(y_test, y_pred_1)*100)
plt.figure(figsize = (15,5))
plt.subplot(1,2,1)
plt.scatter(X_test[:,0], X_test[:,1], c=y_pred_5, marker= '*', s=100,edgecolors='black')
plt.title("Predicted values with k=5", fontsize=20)
plt.subplot(1,2,2)
plt.scatter(X_test[:,0], X_test[:,1], c=y_pred_1, marker= '*', s=100,edgecolors='black')
plt.title("Predicted values with k=1", fontsize=20)
plt.show()
Output:
Accuracy with k=5 93.60000000000001
Accuracy with k=1 90.4
8. RANDOM FOREST :[pg.no:70-71]
It builds decision trees on different samples and takes their majority vote for classification and
average in case of regression.
PROGRAM:
# Load libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sn
# Load dataset
df = pd.read_csv("iris.csv")
# Assign column names to the dataset
colnames = ['sepal-length', 'sepal-width', 'petal-length',
'petal-width', 'Class']
df.columns = colnames
# Split dataset into features and target
X = df.drop('Class', axis=1) # Features
y = df['Class'] # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=1)
# Import random forest classifier and fit the model
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
# Make predictions on the test set
y_pred = clf.predict(X_test)
# Print predictions and accuracy score
print("Predictions:", y_pred)
print("Accuracy:", accuracy_score(y_test, y_pred) * 100)
# Print classification report and confusion matrix
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(cm)
# Generate heatmap and display it
plt.figure(figsize=(8, 6))
sn.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()