Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
12 views67 pages

Machine Learning Lab Manual

Uploaded by

Dilip here
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views67 pages

Machine Learning Lab Manual

Uploaded by

Dilip here
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Cambridge Institute of Technology

K.R. Puram, Bangalore - 560036


Department of Information Science & Engineering

BCSL606 – Machine Learning Lab


(VTU - 22 Scheme)
6th Semester ISE

Lab Manual

Prepared by
Dr Loganathan D, Professor/ISE,
Dr Sapna, Associate Professor/ISE
Prof. Kavya, Assistant Professor/ISE,

Academic Year: 2024-25


BCSL606 – Machine Learning Lab - 6th Semester ISE

Syllabus

Machine Learning lab Semester 6


Course Code BCSL606 CIE Marks 50
Teaching Hours/Week (L:T:P: S) 0:0:2:0 SEE Marks 50
Credits 01 Exam Hours 100
Examination type (SEE) Practical
Course objectives:
1. To become familiar with data and visualize univariate, bivariate, and multivariate data using statistical
techniques and dimensionality reduction.
2. To understand various machine learning algorithms such as similarity-based learning, regression, decision trees,
and clustering.
3. To familiarize with learning theories, probability-based models and developing the skills required for
decision-making in dynamic environments.

Sl.NO Experiments
1 Develop a program to create histograms for all numerical features and analyze the distribution of each feature.
Generate box plots for all numerical features and identify any outliers. Use California Housing dataset.
Book 1: Chapter 2
2 Develop a program to Compute the correlation matrix to understand the relationships between pairs of features.
Visualize the correlation matrix using a heatmap to know which variables have strong positive/negative
correlations. Create a pair plot to visualize pairwise relationships between features. Use California Housing
dataset.
Book 1: Chapter 2
3 Develop a program to implement Principal Component Analysis (PCA) for reducing the dimensionality of the Iris
dataset from 4 features to 2.
Book 1: Chapter 2
4 For a given set of training data examples stored in a .CSV file, implement and demonstrate the Find-S algorithm to
output a description of the set of all hypotheses consistent with the training examples.
Book 1: Chapter 3
5 Develop a program to implement k-Nearest Neighbour algorithm to classify the randomly generated 100 values of
x in the range of [0,1]. Perform the following based on dataset generated.

1. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ∊ Class1, else xi ∊ Class1
2. Classify the remaining points, x51,……,x100 using KNN. Perform this for k=1,2,3,4,5,20,30
Book 2: Chapter – 2
6 Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points. Select
appropriate data set for your experiment and draw graphs
Book 1: Chapter – 4
7 Develop a program to demonstrate the working of Linear Regression and Polynomial Regression. Use Boston
Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel efficiency prediction) for
Polynomial Regression.
Book 1: Chapter – 5
8 Develop a program to demonstrate the working of the decision tree algorithm. Use Breast Cancer Data set for
building the decision tree and apply this knowledge to classify a new sample.
Book 2: Chapter – 3

ISE/ Cambridge Institute of Technology, Bangalore 1


BCSL606 – Machine Learning Lab - 6th Semester ISE

9 Develop a program to implement the Naive Bayesian classifier considering Olivetti Face Data set for training.
Compute the accuracy of the classifier, considering a few test data sets.
Book 2: Chapter – 4

10 Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set and visualize the
clustering result.
Book 2: Chapter – 4

Course outcomes (Course Skill Set):

At the end of the course the student will be able to:


1. Illustrate the principles of multivariate data and apply dimensionality reduction techniques.
2. Demonstrate similarity-based learning methods and perform regression analysis.
3. Develop decision trees for classification and regression problems, and Bayesian models for probabilistic
learning.
4. Implement the clustering algorithms to share computing resources.

COURSE OUTCOMES (COS):


At the end of the course, the student will be able to:
C366.1: Illustrate the principles of multivariate data and apply dimensionality reduction techniques.
C366.2: Demonstrate similarity-based learning methods and perform regression analysis.
C366.3: Develop decision trees for classification and regression problems, and Bayesian models for probabilistic
learning.
C366.4: Implement the clustering algorithms to share computing resources.

CO-PO Mapping
CO/PO PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3
CO1 3 3 3 2 3 3 - 1 - - - 1 3 3 -
CO2 2 3 3 2 3 3 - 1 - - - 1 3 3 -
CO3 2 3 3 2 3 3 - 1 - - - 1 3 3 -
CO4 2 3 3 2 3 3 - 1 - - - 1 3 3 -
2 3 3 2 3 3 - 1 - - - 1 3 3 -

ISE/ Cambridge Institute of Technology, Bangalore 2


BCSL606 – Machine Learning Lab - 6th Semester ISE

List of Experiments

Sl.No Experiments

Develop a program to create histograms for all numerical features and analyze the distribution of
each feature.
1 Generate box plots for all numerical features and identify any outliers. Use California Housing
dataset.
Book 1: Chapter 2
Develop a program to Compute the correlation matrix to understand the relationships between pairs
of features.
Visualize the correlation matrix using a heatmap to know which variables have strong
2 Positive/negative correlations.
Create a pair plot to visualize pairwise relationships between features. Use California Housing
dataset.
Book 1: Chapter 2
Develop a program to implement Principal Component Analysis (PCA) for reducing the
3 dimensionality of the Iris dataset from 4 features to 2.
Book 1: Chapter 2
For a given set of training data examples stored in a .CSV file, implement and demonstrate the Find-
4 S algorithm to output a description of the set of all hypotheses consistent with the training examples.
Book 1: Chapter 3
Develop a program to implement k-Nearest Neighbor algorithm to classify the randomly generated
100 values of x in the range of [0,1]. Perform the following based on dataset generated.
5 1. Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ε Class1, else xi ε Class1
2. Classify the remaining points, x51… x100 using KNN. Perform this for k=1,2,3,4,5,20,30
Book 2: Chapter – 2
Implement the non-parametric Locally Weighted Regression algorithm in order to fit data points.
6 Select appropriate data set for your experiment and draw graphs
Book 1: Chapter – 4
Develop a program to demonstrate the working of Linear Regression and Polynomial Regression.
Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel
7
efficiency prediction) for Polynomial Regression.
Book 1: Chapter – 5
Develop a program to demonstrate the working of the decision tree algorithm. Use Breast Cancer
8 Data set for building the decision tree and apply this knowledge to classify a new sample.
Book 2: Chapter – 3
Develop a program to implement the Naive Bayesian classifier considering Olivetti Face Data set for
training.
9
Compute the accuracy of the classifier, considering a few test data sets.
Book 2: Chapter – 4
Develop a program to implement k-means clustering using Wisconsin Breast Cancer data set and
10 visualize the clustering result.
Book 2: Chapter – 4

ISE/ Cambridge Institute of Technology, Bangalore 3


BCSL606 – Machine Learning Lab - 6th Semester ISE

Ex. No: 1 Develop a program to create histograms for all numerical features and
Date: analyze the distribution of each feature.
Generate box plots for all numerical features and identify any outliers.
Use California Housing dataset.

Aim: To create histograms and box plots for all numerical features in the California Housing
dataset, and to analyze the distribution and identify outliers, you can follow the steps below. This
lab manual provides a step-by-step algorithm, software requirements, coding, and procedure to
execute the program using Python and Jupyter Notebook.

Software Required:
1. Python 3.x
2. Anaconda
3. Jupyter Notebook
4. Dataset file – housing.csv

Algorithm:
Step 1: Import Required Libraries
Step 2: Load the Dataset - Read the California Housing dataset from the local CSV file
using panda.
Step 3: Data Exploration
Display dataset information & Identify numerical features
Step 4: Create Histograms - Iterate through each numerical feature and generate histograms
Step 5: Create Box Plots - Iterate through each numerical feature and generate box plots
Step 6: Identify Outliers (IQR Method) - Calculate outliers using the Interquartile Range
(IQR) method
Step 7: Analysis and Insights from the output
Step 8: Exit

Sample Coding:
# import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#from sklearn.datasets import fetch_california_housing
# Step 1: Load the California Housing dataset
#data = fetch_california_housing(as_frame=True)
data=pd.read_csv("C:\\Users\\ISE\\Desktop\\ML LAB\\housing.csv",delimiter=',')
data

ISE/ Cambridge Institute of Technology, Bangalore 4


BCSL606 – Machine Learning Lab - 6th Semester ISE

# Step 2: Create histograms for numerical features


numerical_features = data.select_dtypes(include=[np.number]).columns

# Plot histograms
plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
plt.subplot(3, 3, i + 1)
sns.histplot(data[feature], kde=True, bins=30, color='blue')
plt.title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()

# Step 3: Generate box plots for numerical features


plt.figure(figsize=(15, 10))
for i, feature in enumerate(numerical_features):
plt.subplot(3, 3, i + 1)
sns.boxplot(x=data[feature], color='orange')
plt.title(f'Box Plot of {feature}')
plt.tight_layout()
plt.show()

# Step 4: Identify outliers using the IQR method


print("Outliers Detection:")
outliers_summary = {}
for feature in numerical_features:
Q1 = data[feature].quantile(0.25)
Q3 = data[feature].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data[feature] < lower_bound) | (data[feature] > upper_bound)]
outliers_summary[feature] = len(outliers)
print(f"{feature}: {len(outliers)} outliers")

# Optional: Print a summary of the dataset


print("\nDataset Summary:")
print(data.describe())

ISE/ Cambridge Institute of Technology, Bangalore 5


BCSL606 – Machine Learning Lab - 6th Semester ISE

Sample Output:
1. Histograms:
Each histogram will show the distribution of a numerical feature.
2. Box Plots:
Each box plot will show the spread of the data and any outliers.

ISE/ Cambridge Institute of Technology, Bangalore 6


BCSL606 – Machine Learning Lab - 6th Semester ISE

Outliers Detection:
longitude: 0 outliers
latitude: 0 outliers
housing_median_age: 0 outliers
total_rooms: 1287 outliers
total_bedrooms: 1271 outliers
population: 1196 outliers
households: 1220 outliers
median_income: 681 outliers
median_house_value: 1071 outliers

Dataset Summary:
longitude latitude housing_median_age total_rooms \
count 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081
std 2.003532 2.135952 12.585558 2181.615252
min -124.350000 32.540000 1.000000 2.000000

ISE/ Cambridge Institute of Technology, Bangalore 7


BCSL606 – Machine Learning Lab - 6th Semester ISE

25% -121.800000 33.930000 18.000000 1447.750000


50% -118.490000 34.260000 29.000000 2127.000000
75% -118.010000 37.710000 37.000000 3148.000000
max -114.310000 41.950000 52.000000 39320.000000

total_bedrooms population households median_income \


count 20433.000000 20640.000000 20640.000000 20640.000000
mean 537.870553 1425.476744 499.539680 3.870671
std 421.385070 1132.462122 382.329753 1.899822
min 1.000000 3.000000 1.000000 0.499900
25% 296.000000 787.000000 280.000000 2.563400
50% 435.000000 1166.000000 409.000000 3.534800
75% 647.000000 1725.000000 605.000000 4.743250
max 6445.000000 35682.000000 6082.000000 15.000100

median_house_value
count 20640.000000
mean 206855.816909
std 115395.615874
min 14999.000000
25% 119600.000000
50% 179700.000000
75% 264725.000000
max 500001.000000

Result:
The program was executed successfully, and the output has been recorded.

ISE/ Cambridge Institute of Technology, Bangalore 8


BCSL606 – Machine Learning Lab - 6th Semester ISE

Ex. No: 2 Develop a program to Compute the correlation matrix to understand the
Date: relationships between pairs of features. Visualize the correlation matrix using
a heatmap to know which variables have strong positive/negative correlations.
Create a pair plot to visualize pairwise relationships between features. Use
California Housing dataset.

Aim: Write a program in Jupyter Notebook (using Anaconda) that performs the following tasks
using the California Housing dataset stored as a CSV file:
1. Compute the Correlation Matrix to understand the relationships between pairs of
numerical features.
2. Visualize the Correlation Matrix using a heatmap to identify strong positive and
negative correlations.
3. Create a Pair Plot to visualize pairwise relationships between numerical features and
observe patterns, trends, and potential outliers.

Software Required:
1. Python 3.x
2. Anaconda
3. Jupyter Notebook
4. Dataset file – housing.csv

Algorithm:
Step 1: Import Required Libraries
Step 2: Load the Dataset
Read the California Housing dataset from the local CSV file using pandas
Step 3: Data Exploration - Display dataset information to understand its structure &
Identify numerical features for correlation analysis
Step 4:Compute the Correlation Matrix - Compute the correlation matrix using .corr()
function
Step 5:Visualize the Correlation Matrix Using a Heatmap - Use Seaborn's sns.heatmap()
to create the heatmap
Step 6:Create a Pair Plot - Use Seaborn’s sns.pairplot() to visualize pairwise relationships
Step 7:Analysis and Insights from the output

Sample Coding:
# import required libraries
import pandas as pd
import seaborn as sns
#import numpy as np
import matplotlib.pyplot as plt
#from sklearn.datasets import fetch_california_housing

ISE/ Cambridge Institute of Technology, Bangalore 9


BCSL606 – Machine Learning Lab - 6th Semester ISE

# Step 1: Load the California Housing Dataset


#california_data = fetch_california_housing(as_frame=True)
#data = california_data.frame
data=pd.read_csv("C:\\Users\\ISE\\Desktop\\ML LAB\\housing1.csv",delimiter=',')
data

#numerical_features = data.select_dtypes(include=[np.number]).columns
# Step 2: Compute the correlation matrix
correlation_matrix = data.corr()

# Step 3: Visualize the correlation matrix using a heatmap


plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix of California Housing Features')
plt.show()

# Step 4: Create a pair plot to visualize pairwise relationships


sns.pairplot(data, diag_kind='kde', plot_kws={'alpha': 0.5})
plt.suptitle('Pair Plot of California Housing Features', y=1.02)
plt.show()

Sample Output:

ISE/ Cambridge Institute of Technology, Bangalore 10


BCSL606 – Machine Learning Lab - 6th Semester ISE

 Correlation Matrix - understand the relationships between pairs of numerical features.


 Correlation Matrix - using a heatmap to identify strong positive and negative
correlations.
 Pair Plot - visualize pairwise relationships between numerical features and observe
patterns, trends, and potential outliers.

Result:
The program was executed successfully, and the output has been recorded.

ISE/ Cambridge Institute of Technology, Bangalore 11


BCSL606 – Machine Learning Lab - 6th Semester ISE

Ex. No: 3 Develop a program to implement Principal Component Analysis (PCA) for
Date: reducing the dimensionality of the Iris dataset from 4 features to 2.

Aim: Write a program to implement Principal Component Analysis (PCA) for reducing the
dimensionality of the Iris dataset from 4 features to 2.

Software Required:
 Python 3.x
 Anaconda
 Jupyter Notebook

Reference for reducing the dimensionality of the Iris dataset from 4 features to 2.
The Iris dataset consists of four features (dimensions), which represent the physical
characteristics of iris flowers. These features are:
1. Sepal Length (cm)
2. Sepal Width (cm)
3. Petal Length (cm)
4. Petal Width (cm)

These four features are used to classify three different species of iris flowers:
1. Setosa (Iris setosa is known for having smaller petal and sepal dimensions compared to the
other two species)
2. Versicolor (Iris versicolor has features that fall between those of Iris setosa and Iris
virginica in terms of size and other characteristics)
3. Virginica (Iris virginica is generally larger in size, with notable differences in sepal and
petal dimensions compared to the other two species)
In the PCA implementation, we reduced these four-dimensional features into two principal
components to visualize the data in a 2D space (4D reduced to 2D)

Algorithm:
Step 1. Import necessary libraries.
Step 2. Load the Iris dataset.
Step 3. Convert it into a Pandas DataFrame for better visualization.
Step 4. Perform Principal Component Analysis (PCA) to reduce the dimensionality from 4
to 2.
Step 5. Create a new DataFrame with the reduced components.
Step 6. Visualize the reduced data using a scatter plot.

ISE/ Cambridge Institute of Technology, Bangalore 12


BCSL606 – Machine Learning Lab - 6th Semester ISE

Sample Coding:
# Import required libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load the Iris dataset


iris = load_iris()
data = iris.data
labels = iris.target
label_names = iris.target_names

# Convert to a DataFrame for better visualization


iris_df = pd.DataFrame(data, columns=iris.feature_names)

# Perform PCA to reduce dimensionality to 2


pca = PCA(n_components=2)
data_reduced = pca.fit_transform(data)

# Create a DataFrame for the reduced data


reduced_df = pd.DataFrame(data_reduced, columns=['Principal Component 1', 'Principal
Component 2'])
reduced_df['Label'] = labels

# Plot the reduced data


plt.figure(figsize=(8, 6))
colors = ['r', 'g', 'b']
for i, label in enumerate(np.unique(labels)):
plt.scatter(
reduced_df[reduced_df['Label'] == label]['Principal Component 1'],
reduced_df[reduced_df['Label'] == label]['Principal Component 2'],
label=label_names[label],
color=colors[i]
)

plt.title('PCA on Iris Dataset')


plt.xlabel('Principal Component 1')

ISE/ Cambridge Institute of Technology, Bangalore 13


BCSL606 – Machine Learning Lab - 6th Semester ISE

plt.ylabel('Principal Component 2')


plt.legend()
plt.grid()
plt.show()

Sample Output:

In the PCA plot, 4D (Petal Length, Petal Width, Sepal Length and Sepal Width) reduced to
1. PC1 is highly influenced by Petal Length and Petal Width
2. PC2 is influenced more by Sepal Length and Sepal Width

 Setosa (red points) cluster around (-3, 0) → this means that Setosa species have distinct
feature patterns compared to the others.
 Versicolor (green points) around (0, 0) and Virginica (blue points) towards (2, 0) → these
species are more similar but still separated based on their feature variations.

Result:
The program was executed successfully, and the output has been recorded.

ISE/ Cambridge Institute of Technology, Bangalore 14


BCSL606 – Machine Learning Lab - 6th Semester ISE

Ex. No: 4 For a given set of training data examples stored in a .CSV file, implement and
Date: demonstrate the Find-S algorithm to output a description of the set of all hypotheses
consistent with the training examples.

Aim: Write a python coding for a given set of training data examples stored in a .CSV file, implement
and demonstrate the Find-S algorithm to output a description of the set of all hypotheses consistent with
the training examples.

Software Required:
 Python 3.x
 Anaconda
 Jupyter Notebook
 Dataset file – training_data.csv

Algorithm 1:
Algorithm for Find-S Algorithm with CSV Dataset
 Input: CSV file containing training data with multiple attributes and a binary class
label (Yes/No).
 Output: The most specific hypothesis that fits the positive examples (i.e., class label
= 'Yes').
Algorithm 2: Find-S Algorithm
Step 1: Read Dataset
Step 2: Initialize Hypothesis
Step 3: Iterate Over Training Data
Step 4: Return Final Hypothesis

Sample Coding:
import pandas as pd

def find_s_algorithm(file_path):
data = pd.read_csv(file_path)

print("Training data:")
print(data)
attributes = data.columns[:-1]
class_label = data.columns[-1]
hypothesis = ['?' for _ in attributes]

for index, row in data.iterrows():


if row[class_label] == 'Yes':
for i, value in enumerate(row[attributes]):

ISE/ Cambridge Institute of Technology, Bangalore 15


BCSL606 – Machine Learning Lab - 6th Semester ISE

if hypothesis[i] == '?' or hypothesis[i] == value:


hypothesis[i] = value
else:
hypothesis[i] = '?'

return hypothesis

file_path = 'C:\\Users\\ISE\\Desktop\\training_data.csv'
hypothesis = find_s_algorithm(file_path)
print("\nThe final hypothesis is:", hypothesis)

Sample Output:

Training data:
Outlook Temperature Humidity Windy PlayTennis
0 Sunny Hot High False No
1 Sunny Hot High True No
2 Overcast Hot High False Yes
3 Rain Cold High False Yes
4 Rain Cold High True No
5 Overcast Hot High True Yes
6 Sunny Hot High False No

The final hypothesis is: ['Overcast', 'Hot', 'High', '?'] by the Find S Algorithm

Result:
The program was executed successfully, and the output has been recorded.

ISE/ Cambridge Institute of Technology, Bangalore 16


BCSL606 – Machine Learning Lab - 6th Semester ISE

Ex. No: 5 Develop a program to implement k-Nearest Neighbour algorithm to classify


Date: the randomly generated 100 values of x in the range of [0, 1]. Perform the
following based on dataset generated.
1.Label the first 50 points {x1,……,x50} as follows: if (xi ≤ 0.5), then xi ε
Class1, else xi ε Class2
2. Classify the remaining points, x51… x100 using KNN.
Perform this for k=1,2,3,4,5,20,30

Aim: Write a python coding to implement k-Nearest Neighbour algorithm to classify the
randomly generated 100 values of x in the range of [0, 1].

Software Required:
 Python 3.x
 Anaconda
 Jupyter Notebook

Algorithm:
Step 1: Generate Random Data
Generate 100 random values of 𝑥 in the range [0, 1].
Step 2: Assign Labels to First 50 Points
Assign Class 1 if 𝑥1 ≤ 0.5
Assign Class 2 if 𝑥1 > 0.5
Step 3: Prepare the KNN Classifier
Store the first 50 labeled points as the training dataset.
Store the remaining 50 unlabeled points as the test dataset.
Step 4: Implement KNN Algorithm
For each test point xtest(i.e., x51 to x100):
Compute the Euclidean distance between xtestand all training points.
Step 5: Classify Test Data for Multiple k Values
Repeat [For each test point xtest(i.e., x51 to x100)] for different values of k = 1, 2, 3, 4, 5,
20, 30.
Store and compare classification results for different k values.
Step 6: Visualize the Results
Plot the training points and classified test points using a scatter plot.
Use different colors and symbols

ISE/ Cambridge Institute of Technology, Bangalore 17


BCSL606 – Machine Learning Lab - 6th Semester ISE

Sample Coding:
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter

data = np.random.rand(100)

labels = ["Class1" if x <= 0.5 else "Class2" for x in data[:50]]

def euclidean_distance(x1, x2):


return abs(x1 - x2)

def knn_classifier(train_data, train_labels, test_point, k):


distances = [(euclidean_distance(test_point, train_data[i]), train_labels[i]) for i in
range(len(train_data))]

distances.sort(key=lambda x: x[0])
k_nearest_neighbors = distances[:k]

k_nearest_labels = [label for _, label in k_nearest_neighbors]

return Counter(k_nearest_labels).most_common(1)[0][0]

train_data = data[:50]
train_labels = labels

test_data = data[50:]

k_values = [1, 2, 3, 4, 5, 20, 30]

print("--- k-Nearest Neighbors Classification ---")


print("Training dataset: First 50 points labeled based on the rule (x <= 0.5 -> Class1, x > 0.5 ->
Class2)")
print("Testing dataset: Remaining 50 points to be classified\n")

results = {}

for k in k_values:
print(f"Results for k = {k}:")
classified_labels = [knn_classifier(train_data, train_labels, test_point, k) for test_point in
test_data]
results[k] = classified_labels

for i, label in enumerate(classified_labels, start=51):


print(f"Point x{i} (value: {test_data[i - 51]:.4f}) is classified as {label}")
print("\n")

ISE/ Cambridge Institute of Technology, Bangalore 18


BCSL606 – Machine Learning Lab - 6th Semester ISE

print("Classification complete.\n")

for k in k_values:
classified_labels = results[k]
class1_points = [test_data[i] for i in range(len(test_data)) if classified_labels[i] == "Class1"]
class2_points = [test_data[i] for i in range(len(test_data)) if classified_labels[i] == "Class2"]

plt.figure(figsize=(10, 6))
plt.scatter(train_data, [0] * len(train_data), c=["blue" if label == "Class1" else "red" for label in
train_labels],
label="Training Data", marker="o")
plt.scatter(class1_points, [1] * len(class1_points), c="blue", label="Class1 (Test)", marker="x")
plt.scatter(class2_points, [1] * len(class2_points), c="red", label="Class2 (Test)", marker="x")

plt.title(f"k-NN Classification Results for k = {k}")


plt.xlabel("Data Points")
plt.ylabel("Classification Level")
plt.legend()
plt.grid(True)
plt.show()

Sample Output:
1. Training dataset:
First 50 points labeled based on the rule (x <= 0.5 -> Class1, x > 0.5 -> Class2)
2. Testing dataset: Remaining 50 points to be classified

Results for k = 1:
Point x51 (value: 0.7976) is classified as Class2
Point x52 (value: 0.3163) is classified as Class1
Point x53 (value: 0.2959) is classified as Class1
Point x54 (value: 0.7594) is classified as Class2
Point x55 (value: 0.7926) is classified as Class2
Point x56 (value: 0.9752) is classified as Class2
Point x57 (value: 0.8851) is classified as Class2
Point x58 (value: 0.7563) is classified as Class2
Point x59 (value: 0.9945) is classified as Class2
Point x60 (value: 0.0421) is classified as Class1
Point x61 (value: 0.1763) is classified as Class1
Point x62 (value: 0.8064) is classified as Class2
Point x63 (value: 0.3906) is classified as Class1
Point x64 (value: 0.4836) is classified as Class1
Point x65 (value: 0.7161) is classified as Class2
Point x66 (value: 0.9742) is classified as Class2
Point x67 (value: 0.6985) is classified as Class2
Point x68 (value: 0.7710) is classified as Class2
Point x69 (value: 0.6594) is classified as Class2
Point x70 (value: 0.7821) is classified as Class2

ISE/ Cambridge Institute of Technology, Bangalore 19


BCSL606 – Machine Learning Lab - 6th Semester ISE

Point x71 (value: 0.4572) is classified as Class1


Point x72 (value: 0.7967) is classified as Class2
Point x73 (value: 0.7942) is classified as Class2
Point x74 (value: 0.3467) is classified as Class1
Point x75 (value: 0.1917) is classified as Class1
Point x76 (value: 0.1025) is classified as Class1
Point x77 (value: 0.2484) is classified as Class1
Point x78 (value: 0.1633) is classified as Class1
Point x79 (value: 0.6691) is classified as Class2
Point x80 (value: 0.4346) is classified as Class1
Point x81 (value: 0.7434) is classified as Class2
Point x82 (value: 0.6923) is classified as Class2
Point x83 (value: 0.5965) is classified as Class2
Point x84 (value: 0.6371) is classified as Class2
Point x85 (value: 0.1417) is classified as Class1
Point x86 (value: 0.7006) is classified as Class2
Point x87 (value: 0.5158) is classified as Class2
Point x88 (value: 0.9622) is classified as Class2
Point x89 (value: 0.1696) is classified as Class1
Point x90 (value: 0.5922) is classified as Class2
Point x91 (value: 0.9350) is classified as Class2
Point x92 (value: 0.8776) is classified as Class2
Point x93 (value: 0.5550) is classified as Class2
Point x94 (value: 0.0005) is classified as Class1
Point x95 (value: 0.4488) is classified as Class1
Point x96 (value: 0.1293) is classified as Class1
Point x97 (value: 0.9218) is classified as Class2
Point x98 (value: 0.9849) is classified as Class2
Point x99 (value: 0.2912) is classified as Class1
Point x100 (value: 0.1391) is classified as Class1

Results for k = 2:
Point x51 (value: 0.7976) is classified as Class2
Point x52 (value: 0.3163) is classified as Class1
Point x53 (value: 0.2959) is classified as Class1
Point x54 (value: 0.7594) is classified as Class2
Point x55 (value: 0.7926) is classified as Class2
Point x56 (value: 0.9752) is classified as Class2
Point x57 (value: 0.8851) is classified as Class2
Point x58 (value: 0.7563) is classified as Class2
Point x59 (value: 0.9945) is classified as Class2
Point x60 (value: 0.0421) is classified as Class1
Point x61 (value: 0.1763) is classified as Class1
Point x62 (value: 0.8064) is classified as Class2
Point x63 (value: 0.3906) is classified as Class1
Point x64 (value: 0.4836) is classified as Class1
Point x65 (value: 0.7161) is classified as Class2
Point x66 (value: 0.9742) is classified as Class2
Point x67 (value: 0.6985) is classified as Class2
Point x68 (value: 0.7710) is classified as Class2
Point x69 (value: 0.6594) is classified as Class2
Point x70 (value: 0.7821) is classified as Class2

ISE/ Cambridge Institute of Technology, Bangalore 20


BCSL606 – Machine Learning Lab - 6th Semester ISE

Point x71 (value: 0.4572) is classified as Class1


Point x72 (value: 0.7967) is classified as Class2
Point x73 (value: 0.7942) is classified as Class2
Point x74 (value: 0.3467) is classified as Class1
Point x75 (value: 0.1917) is classified as Class1
Point x76 (value: 0.1025) is classified as Class1
Point x77 (value: 0.2484) is classified as Class1
Point x78 (value: 0.1633) is classified as Class1
Point x79 (value: 0.6691) is classified as Class2
Point x80 (value: 0.4346) is classified as Class1
Point x81 (value: 0.7434) is classified as Class2
Point x82 (value: 0.6923) is classified as Class2
Point x83 (value: 0.5965) is classified as Class2
Point x84 (value: 0.6371) is classified as Class2
Point x85 (value: 0.1417) is classified as Class1
Point x86 (value: 0.7006) is classified as Class2
Point x87 (value: 0.5158) is classified as Class2
Point x88 (value: 0.9622) is classified as Class2
Point x89 (value: 0.1696) is classified as Class1
Point x90 (value: 0.5922) is classified as Class2
Point x91 (value: 0.9350) is classified as Class2
Point x92 (value: 0.8776) is classified as Class2
Point x93 (value: 0.5550) is classified as Class2
Point x94 (value: 0.0005) is classified as Class1
Point x95 (value: 0.4488) is classified as Class1
Point x96 (value: 0.1293) is classified as Class1
Point x97 (value: 0.9218) is classified as Class2
Point x98 (value: 0.9849) is classified as Class2
Point x99 (value: 0.2912) is classified as Class1
Point x100 (value: 0.1391) is classified as Class1

Results for k = 3:
Point x51 (value: 0.7976) is classified as Class2
Point x52 (value: 0.3163) is classified as Class1
Point x53 (value: 0.2959) is classified as Class1
Point x54 (value: 0.7594) is classified as Class2
Point x55 (value: 0.7926) is classified as Class2
Point x56 (value: 0.9752) is classified as Class2
Point x57 (value: 0.8851) is classified as Class2
Point x58 (value: 0.7563) is classified as Class2
Point x59 (value: 0.9945) is classified as Class2
Point x60 (value: 0.0421) is classified as Class1
Point x61 (value: 0.1763) is classified as Class1
Point x62 (value: 0.8064) is classified as Class2
Point x63 (value: 0.3906) is classified as Class1
Point x64 (value: 0.4836) is classified as Class1
Point x65 (value: 0.7161) is classified as Class2
Point x66 (value: 0.9742) is classified as Class2
Point x67 (value: 0.6985) is classified as Class2
Point x68 (value: 0.7710) is classified as Class2
Point x69 (value: 0.6594) is classified as Class2
Point x70 (value: 0.7821) is classified as Class2

ISE/ Cambridge Institute of Technology, Bangalore 21


BCSL606 – Machine Learning Lab - 6th Semester ISE

Point x71 (value: 0.4572) is classified as Class1


Point x72 (value: 0.7967) is classified as Class2
Point x73 (value: 0.7942) is classified as Class2
Point x74 (value: 0.3467) is classified as Class1
Point x75 (value: 0.1917) is classified as Class1
Point x76 (value: 0.1025) is classified as Class1
Point x77 (value: 0.2484) is classified as Class1
Point x78 (value: 0.1633) is classified as Class1
Point x79 (value: 0.6691) is classified as Class2
Point x80 (value: 0.4346) is classified as Class1
Point x81 (value: 0.7434) is classified as Class2
Point x82 (value: 0.6923) is classified as Class2
Point x83 (value: 0.5965) is classified as Class2
Point x84 (value: 0.6371) is classified as Class2
Point x85 (value: 0.1417) is classified as Class1
Point x86 (value: 0.7006) is classified as Class2
Point x87 (value: 0.5158) is classified as Class2
Point x88 (value: 0.9622) is classified as Class2
Point x89 (value: 0.1696) is classified as Class1
Point x90 (value: 0.5922) is classified as Class2
Point x91 (value: 0.9350) is classified as Class2
Point x92 (value: 0.8776) is classified as Class2
Point x93 (value: 0.5550) is classified as Class2
Point x94 (value: 0.0005) is classified as Class1
Point x95 (value: 0.4488) is classified as Class1
Point x96 (value: 0.1293) is classified as Class1
Point x97 (value: 0.9218) is classified as Class2
Point x98 (value: 0.9849) is classified as Class2
Point x99 (value: 0.2912) is classified as Class1
Point x100 (value: 0.1391) is classified as Class1

Results for k = 4:
Point x51 (value: 0.7976) is classified as Class2
Point x52 (value: 0.3163) is classified as Class1
Point x53 (value: 0.2959) is classified as Class1
Point x54 (value: 0.7594) is classified as Class2
Point x55 (value: 0.7926) is classified as Class2
Point x56 (value: 0.9752) is classified as Class2
Point x57 (value: 0.8851) is classified as Class2
Point x58 (value: 0.7563) is classified as Class2
Point x59 (value: 0.9945) is classified as Class2
Point x60 (value: 0.0421) is classified as Class1
Point x61 (value: 0.1763) is classified as Class1
Point x62 (value: 0.8064) is classified as Class2
Point x63 (value: 0.3906) is classified as Class1
Point x64 (value: 0.4836) is classified as Class1
Point x65 (value: 0.7161) is classified as Class2
Point x66 (value: 0.9742) is classified as Class2
Point x67 (value: 0.6985) is classified as Class2
Point x68 (value: 0.7710) is classified as Class2
Point x69 (value: 0.6594) is classified as Class2
Point x70 (value: 0.7821) is classified as Class2

ISE/ Cambridge Institute of Technology, Bangalore 22


BCSL606 – Machine Learning Lab - 6th Semester ISE

Point x71 (value: 0.4572) is classified as Class1


Point x72 (value: 0.7967) is classified as Class2
Point x73 (value: 0.7942) is classified as Class2
Point x74 (value: 0.3467) is classified as Class1
Point x75 (value: 0.1917) is classified as Class1
Point x76 (value: 0.1025) is classified as Class1
Point x77 (value: 0.2484) is classified as Class1
Point x78 (value: 0.1633) is classified as Class1
Point x79 (value: 0.6691) is classified as Class2
Point x80 (value: 0.4346) is classified as Class1
Point x81 (value: 0.7434) is classified as Class2
Point x82 (value: 0.6923) is classified as Class2
Point x83 (value: 0.5965) is classified as Class2
Point x84 (value: 0.6371) is classified as Class2
Point x85 (value: 0.1417) is classified as Class1
Point x86 (value: 0.7006) is classified as Class2
Point x87 (value: 0.5158) is classified as Class2
Point x88 (value: 0.9622) is classified as Class2
Point x89 (value: 0.1696) is classified as Class1
Point x90 (value: 0.5922) is classified as Class2
Point x91 (value: 0.9350) is classified as Class2
Point x92 (value: 0.8776) is classified as Class2
Point x93 (value: 0.5550) is classified as Class2
Point x94 (value: 0.0005) is classified as Class1
Point x95 (value: 0.4488) is classified as Class1
Point x96 (value: 0.1293) is classified as Class1
Point x97 (value: 0.9218) is classified as Class2
Point x98 (value: 0.9849) is classified as Class2
Point x99 (value: 0.2912) is classified as Class1
Point x100 (value: 0.1391) is classified as Class1

Results for k = 5:
Point x51 (value: 0.7976) is classified as Class2
Point x52 (value: 0.3163) is classified as Class1
Point x53 (value: 0.2959) is classified as Class1
Point x54 (value: 0.7594) is classified as Class2
Point x55 (value: 0.7926) is classified as Class2
Point x56 (value: 0.9752) is classified as Class2
Point x57 (value: 0.8851) is classified as Class2
Point x58 (value: 0.7563) is classified as Class2
Point x59 (value: 0.9945) is classified as Class2
Point x60 (value: 0.0421) is classified as Class1
Point x61 (value: 0.1763) is classified as Class1
Point x62 (value: 0.8064) is classified as Class2
Point x63 (value: 0.3906) is classified as Class1
Point x64 (value: 0.4836) is classified as Class1
Point x65 (value: 0.7161) is classified as Class2
Point x66 (value: 0.9742) is classified as Class2
Point x67 (value: 0.6985) is classified as Class2
Point x68 (value: 0.7710) is classified as Class2
Point x69 (value: 0.6594) is classified as Class2
Point x70 (value: 0.7821) is classified as Class2

ISE/ Cambridge Institute of Technology, Bangalore 23


BCSL606 – Machine Learning Lab - 6th Semester ISE

Point x71 (value: 0.4572) is classified as Class1


Point x72 (value: 0.7967) is classified as Class2
Point x73 (value: 0.7942) is classified as Class2
Point x74 (value: 0.3467) is classified as Class1
Point x75 (value: 0.1917) is classified as Class1
Point x76 (value: 0.1025) is classified as Class1
Point x77 (value: 0.2484) is classified as Class1
Point x78 (value: 0.1633) is classified as Class1
Point x79 (value: 0.6691) is classified as Class2
Point x80 (value: 0.4346) is classified as Class1
Point x81 (value: 0.7434) is classified as Class2
Point x82 (value: 0.6923) is classified as Class2
Point x83 (value: 0.5965) is classified as Class2
Point x84 (value: 0.6371) is classified as Class2
Point x85 (value: 0.1417) is classified as Class1
Point x86 (value: 0.7006) is classified as Class2
Point x87 (value: 0.5158) is classified as Class2
Point x88 (value: 0.9622) is classified as Class2
Point x89 (value: 0.1696) is classified as Class1
Point x90 (value: 0.5922) is classified as Class2
Point x91 (value: 0.9350) is classified as Class2
Point x92 (value: 0.8776) is classified as Class2
Point x93 (value: 0.5550) is classified as Class2
Point x94 (value: 0.0005) is classified as Class1
Point x95 (value: 0.4488) is classified as Class1
Point x96 (value: 0.1293) is classified as Class1
Point x97 (value: 0.9218) is classified as Class2
Point x98 (value: 0.9849) is classified as Class2
Point x99 (value: 0.2912) is classified as Class1
Point x100 (value: 0.1391) is classified as Class1

Results for k = 20:


Point x51 (value: 0.7976) is classified as Class2
Point x52 (value: 0.3163) is classified as Class1
Point x53 (value: 0.2959) is classified as Class1
Point x54 (value: 0.7594) is classified as Class2
Point x55 (value: 0.7926) is classified as Class2
Point x56 (value: 0.9752) is classified as Class2
Point x57 (value: 0.8851) is classified as Class2
Point x58 (value: 0.7563) is classified as Class2
Point x59 (value: 0.9945) is classified as Class2
Point x60 (value: 0.0421) is classified as Class1
Point x61 (value: 0.1763) is classified as Class1
Point x62 (value: 0.8064) is classified as Class2
Point x63 (value: 0.3906) is classified as Class1
Point x64 (value: 0.4836) is classified as Class1
Point x65 (value: 0.7161) is classified as Class2
Point x66 (value: 0.9742) is classified as Class2
Point x67 (value: 0.6985) is classified as Class2
Point x68 (value: 0.7710) is classified as Class2
Point x69 (value: 0.6594) is classified as Class2
Point x70 (value: 0.7821) is classified as Class2

ISE/ Cambridge Institute of Technology, Bangalore 24


BCSL606 – Machine Learning Lab - 6th Semester ISE

Point x71 (value: 0.4572) is classified as Class1


Point x72 (value: 0.7967) is classified as Class2
Point x73 (value: 0.7942) is classified as Class2
Point x74 (value: 0.3467) is classified as Class1
Point x75 (value: 0.1917) is classified as Class1
Point x76 (value: 0.1025) is classified as Class1
Point x77 (value: 0.2484) is classified as Class1
Point x78 (value: 0.1633) is classified as Class1
Point x79 (value: 0.6691) is classified as Class2
Point x80 (value: 0.4346) is classified as Class1
Point x81 (value: 0.7434) is classified as Class2
Point x82 (value: 0.6923) is classified as Class2
Point x83 (value: 0.5965) is classified as Class2
Point x84 (value: 0.6371) is classified as Class2
Point x85 (value: 0.1417) is classified as Class1
Point x86 (value: 0.7006) is classified as Class2
Point x87 (value: 0.5158) is classified as Class2
Point x88 (value: 0.9622) is classified as Class2
Point x89 (value: 0.1696) is classified as Class1
Point x90 (value: 0.5922) is classified as Class2
Point x91 (value: 0.9350) is classified as Class2
Point x92 (value: 0.8776) is classified as Class2
Point x93 (value: 0.5550) is classified as Class2
Point x94 (value: 0.0005) is classified as Class1
Point x95 (value: 0.4488) is classified as Class1
Point x96 (value: 0.1293) is classified as Class1
Point x97 (value: 0.9218) is classified as Class2
Point x98 (value: 0.9849) is classified as Class2
Point x99 (value: 0.2912) is classified as Class1
Point x100 (value: 0.1391) is classified as Class1

Results for k = 30:


Point x51 (value: 0.7976) is classified as Class2
Point x52 (value: 0.3163) is classified as Class1
Point x53 (value: 0.2959) is classified as Class1
Point x54 (value: 0.7594) is classified as Class2
Point x55 (value: 0.7926) is classified as Class2
Point x56 (value: 0.9752) is classified as Class2
Point x57 (value: 0.8851) is classified as Class2
Point x58 (value: 0.7563) is classified as Class2
Point x59 (value: 0.9945) is classified as Class2
Point x60 (value: 0.0421) is classified as Class1
Point x61 (value: 0.1763) is classified as Class1
Point x62 (value: 0.8064) is classified as Class2
Point x63 (value: 0.3906) is classified as Class1
Point x64 (value: 0.4836) is classified as Class1
Point x65 (value: 0.7161) is classified as Class2
Point x66 (value: 0.9742) is classified as Class2
Point x67 (value: 0.6985) is classified as Class2
Point x68 (value: 0.7710) is classified as Class2
Point x69 (value: 0.6594) is classified as Class2
Point x70 (value: 0.7821) is classified as Class2

ISE/ Cambridge Institute of Technology, Bangalore 25


BCSL606 – Machine Learning Lab - 6th Semester ISE

Point x71 (value: 0.4572) is classified as Class1


Point x72 (value: 0.7967) is classified as Class2
Point x73 (value: 0.7942) is classified as Class2
Point x74 (value: 0.3467) is classified as Class1
Point x75 (value: 0.1917) is classified as Class1
Point x76 (value: 0.1025) is classified as Class1
Point x77 (value: 0.2484) is classified as Class1
Point x78 (value: 0.1633) is classified as Class1
Point x79 (value: 0.6691) is classified as Class2
Point x80 (value: 0.4346) is classified as Class1
Point x81 (value: 0.7434) is classified as Class2
Point x82 (value: 0.6923) is classified as Class2
Point x83 (value: 0.5965) is classified as Class2
Point x84 (value: 0.6371) is classified as Class2
Point x85 (value: 0.1417) is classified as Class1
Point x86 (value: 0.7006) is classified as Class2
Point x87 (value: 0.5158) is classified as Class1
Point x88 (value: 0.9622) is classified as Class2
Point x89 (value: 0.1696) is classified as Class1
Point x90 (value: 0.5922) is classified as Class2
Point x91 (value: 0.9350) is classified as Class2
Point x92 (value: 0.8776) is classified as Class2
Point x93 (value: 0.5550) is classified as Class2
Point x94 (value: 0.0005) is classified as Class1
Point x95 (value: 0.4488) is classified as Class1
Point x96 (value: 0.1293) is classified as Class1
Point x97 (value: 0.9218) is classified as Class2
Point x98 (value: 0.9849) is classified as Class2
Point x99 (value: 0.2912) is classified as Class1
Point x100 (value: 0.1391) is classified as Class1

ISE/ Cambridge Institute of Technology, Bangalore 26


BCSL606 – Machine Learning Lab - 6th Semester ISE

ISE/ Cambridge Institute of Technology, Bangalore 27


BCSL606 – Machine Learning Lab - 6th Semester ISE

Result:
The program was executed successfully, and the output has been recorded.

ISE/ Cambridge Institute of Technology, Bangalore 28


BCSL606 – Machine Learning Lab - 6th Semester ISE

Ex. No: 6 Implement the non-parametric Locally Weighted Regression (LWR) algorithm in
Date: order to fit data points. Select appropriate data set for your experiment and draw
graphs

Aim: Write a python coding to implement the non-parametric Locally Weighted Regression algorithm
in order to fit data points. Select appropriate data set for your experiment and draw graphs.

Application:
 In Roads, curves, and driving patterns vary locally.
 In autonomous vehicles, LWR can be used to predict the steering angle based on sensor
data (like camera images, LiDAR, GPS, etc.) in real time, especially when driving in
complex or curved roads.

Software Required:
 Python 3.x
 Anaconda
 Jupyter Notebook

Algorithm:

Step 1: Add bias term to training data and query point

Step 2: Compute weights using the Gaussian kernel: For each I =1 to m:

Step 3: Construct the weight matrix W ∈ Rm

Step 4: Compute the parameter vector θ ∈ Rn+1, Target values y = {y1, y2... ym} ∈ Rm

Step 5: Predict the output at Query point: x ∈ Rn , Query point x ∈ Rn,


Bandwidth parameter τ > 0

ISE/ Cambridge Institute of Technology, Bangalore 29


BCSL606 – Machine Learning Lab - 6th Semester ISE

Sample Coding:

# Import required libraries


import numpy as np
import matplotlib.pyplot as plt

# Local component - handles the neighborhood definition


# Bandwidth tau controls the "zoom level" of the regression — smaller values focus tightly on
nearby data, while larger values widen the view.
def gaussian_kernel(x, xi, tau):
return np.exp(-np.sum((x - xi) ** 2) / (2 * tau ** 2))

# Weight component - computes the weights for each data point


def locally_weighted_regression(x, X, y, tau):
m = X.shape[0]
weights = np.array([gaussian_kernel(x, X[i], tau) for i in range(m)])

# Regression component - performs the weighted regression


# Combined function using the three components
# diag: construct diagonal matrix and weighted linear regression equation
W = np.diag(weights)
X_transpose_W = X.T @ W
theta = np.linalg.inv(X_transpose_W @ X) @ X_transpose_W @ y
return x @ theta

# Data generation and plotting (main program)


np.random.seed(42)
X = np.linspace(0, 2 * np.pi, 100)
y = np.sin(X) + 0.1 * np.random.randn(100)
X_bias = np.c_[np.ones(X.shape), X]

x_test = np.linspace(0, 2 * np.pi, 200)


x_test_bias = np.c_[np.ones(x_test.shape), x_test]
tau = 0.5
y_pred = np.array([locally_weighted_regression(xi, X_bias, y, tau) for xi in x_test_bias])

plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='red', label='Training Data', alpha=0.7)
plt.plot(x_test, y_pred, color='blue', label=f'LWR Fit (tau={tau})', linewidth=2)
plt.xlabel('X', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Locally Weighted Regression', fontsize=14)
plt.legend(fontsize=10)
plt.grid(alpha=0.3)
plt.show()

ISE/ Cambridge Institute of Technology, Bangalore 30


BCSL606 – Machine Learning Lab - 6th Semester ISE

Sample Output:

Result:
The program was executed successfully, and the output has been recorded.

ISE/ Cambridge Institute of Technology, Bangalore 31


BCSL606 – Machine Learning Lab - 6th Semester ISE

Ex. No: 7 Develop a program to demonstrate the working of Linear Regression and Polynomial
Date: Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG
Dataset (for vehicle fuel efficiency prediction) for Polynomial Regression.

Aim: Write a python coding to demonstrate the working of Linear Regression and Polynomial
Regression. Use Boston Housing Dataset for Linear Regression and Auto MPG Dataset (for vehicle fuel
efficiency prediction) for Polynomial Regression.

Software Required:
 Python 3.x
 Anaconda
 Jupyter Notebook

Concept Note:
The Boston Housing Dataset and the Auto MPG Dataset are two classic datasets widely used for
regression tasks and machine learning model evaluation, especially for educational and
benchmarking purposes.
 Mean Squared Error (MSE) is one of the most commonly used metrics to evaluate the
performance of regression models (like Linear or Polynomial Regression). It measures how
close the predicted values are to the actual observed values.

 The R² Score (also called the Coefficient of Determination) is a key metric used to
evaluate how well a regression model (like Linear or Polynomial Regression) fits the
observed data.
Total Sum of Squares (SST)
𝑅2 = 1 −
Sum of Squared Errors (SSE)
 SSE = ∑ (Actual – Predicted) ² → How much error the model has.
 SST = ∑ (Actual – Mean of Actual) ² → Total variance in the data.

Linear Regression on California Housing Dataset:


 Purpose: Predicting the median value of owner-occupied homes in Boston suburbs.
 MEDV: Median value of owner-occupied homes in $1000s
 Model Complexity: Linear: Straight-line relationship (y = mx + b)

Polynomial Regression on Auto MPG Dataset


 Purpose: Predicting the fuel efficiency (miles per gallon) of cars.
 Target: Calculate MPG: Miles per Gallon (fuel efficiency)
 Model Complexity: Polynomial: Curved relationship (y = ax² + bx + c)

ISE/ Cambridge Institute of Technology, Bangalore 32


BCSL606 – Machine Learning Lab - 6th Semester ISE

Linear Regression Algorithm Steps:


1. Data Loading: Fetch the California housing dataset using scikit-learn's built-in function.
2. Feature Selection: Use only "AveRooms" (average number of rooms) as the independent variable.
3. Data Splitting: Split data into training (80%) and testing (20%) sets.
4. Model Training: Create and fit a Linear Regression model on the training data.
5. Prediction: Use the trained model to predict house prices on the test set.
6. Visualization: Plot actual vs predicted values.
7. Evaluation: Calculate and print MSE and R² score metrics.

Polynomial Regression Algorithm Steps:


1. Data Loading: Download and load the Auto MPG dataset from UCI repository.
2. Data Cleaning: Handle missing values (NA) by dropping rows with missing data.
3. Feature Selection: Use "displacement" as the independent variable.
4. Data Splitting: Split data into training (80%) and testing (20%) sets.
5. Model Creation: Build a pipeline that:
o Creates polynomial features (degree=2)
o Scales the features (StandardScaler)
o Performs linear regression
6. Model Training: Fit the pipeline model on the training data.
7. Prediction: Use the trained model to predict MPG values on the test set.
8. Visualization: Plot actual vs predicted values.
9. Evaluation: Calculate and print MSE and R² score metrics.

Sample Coding:

# Import required libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score

# Implement Linear Regression


def linear_regression_boston():
# Load Boston Housing dataset using fetch_openml
housing = fetch_openml(name="boston", version=1, as_frame=True)
X = housing.data[["RM"]] # RM = average number of rooms
y = housing.target.astype(float) # Ensure target is float for regression

ISE/ Cambridge Institute of Technology, Bangalore 33


BCSL606 – Machine Learning Lab - 6th Semester ISE

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color="blue", label="Actual")
plt.plot(X_test, y_pred, color="red", label="Predicted")
plt.xlabel("Average number of rooms (RM)")
plt.ylabel("Median value of homes ($1000s)")
plt.title("Linear Regression - Boston Housing Dataset")
plt.legend()
plt.show()

print("Linear Regression - Boston Housing Dataset")


print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R^2 Score:", r2_score(y_test, y_pred))

# Implement Polynomial Regression


def polynomial_regression_auto_mpg():
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
column_names = ["mpg", "cylinders", "displacement", "horsepower", "weight",
"acceleration", "model_year", "origin"]
data = pd.read_csv(url, sep=r'\s+', names=column_names, na_values="?")
data = data.dropna()

X = data["displacement"].values.reshape(-1, 1)
y = data["mpg"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

poly_model = make_pipeline(PolynomialFeatures(degree=2), StandardScaler(),


LinearRegression())
poly_model.fit(X_train, y_train)

y_pred = poly_model.predict(X_test)

plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color="blue", label="Actual")
plt.scatter(X_test, y_pred, color="red", label="Predicted")
plt.xlabel("Displacement")
plt.ylabel("Miles per gallon (mpg)")
plt.title("Polynomial Regression - Auto MPG Dataset")
plt.legend()
plt.show()

ISE/ Cambridge Institute of Technology, Bangalore 34


BCSL606 – Machine Learning Lab - 6th Semester ISE

print("Polynomial Regression - Auto MPG Dataset")


print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R^2 Score:", r2_score(y_test, y_pred))

# Data visialization
print("Demonstrating Linear Regression and Polynomial Regression\n")
linear_regression_boston()
polynomial_regression_auto_mpg()

Sample Output:
 Mean Squared Error (MSE): Lower = Better fit.
 R² Score: Closer to 1 = Better model.

Linear Regression - Boston Housing Dataset


Mean Squared Error: 46.144775347317264
R^2 Score: 0.3707569232254778

ISE/ Cambridge Institute of Technology, Bangalore 35


BCSL606 – Machine Learning Lab - 6th Semester ISE

Polynomial Regression - Auto MPG Dataset


Mean Squared Error: 0.7431490557205862
R^2 Score: 0.7505650609469626

Result:
The program was executed successfully, and the output has been recorded.

ISE/ Cambridge Institute of Technology, Bangalore 36


BCSL606 – Machine Learning Lab - 6th Semester ISE

Ex. No: 8 Develop a program to demonstrate the working of the decision tree algorithm. Use
Date: Breast Cancer Data set for building the decision tree and apply this knowledge to
classify a new sample.

Aim: Write a python coding to demonstrate the working of the decision tree algorithm. Use Breast
Cancer Data set for building the decision tree and apply this knowledge to classify a new sample.

Software Required:
 Python 3.x
 Anaconda
 Jupyter Notebook

Concept Note:
Decision trees are supervised machine learning algorithms that can be used for both
classification and regression tasks. They work by recursively partitioning the data into
subsets based on feature values, creating a tree-like structure of decisions using Gini
impurity.
1. Data Loading: The Breast Cancer dataset contains features computed from digitized
images of fine needle aspirates of breast masses, with target labels indicating malignant or
benign tumors.
2. Data Splitting: The data is divided into training (80%) and testing (20%) sets to evaluate
model performance.
3. Model Training: A decision tree classifier is trained using the default Gini impurity
criterion.
4. Evaluation: The model's accuracy is calculated on the test set.
5. Prediction: The trained model is used to classify a new sample (in this case, the first test
sample).
6. Visualization: The decision tree is plotted, showing the hierarchical decision-making
process.

5 Main Features to identify the Brest Cancer:


1. worst radius: 0.6473
2. worst concave points: 0.1421
3. worst perimeter: 0.0789
4. mean concave points: 0.0532
5. worst texture: 0.0210

ISE/ Cambridge Institute of Technology, Bangalore 37


BCSL606 – Machine Learning Lab - 6th Semester ISE

How the Decision Tree Uses These Features


The tree splits samples based on thresholds like:
 worst radius ≤ 16.8 → Benign
 worst concave points > 0.051 → Malignant

Algorithm:
Step 1: Select the best attribute to split the dataset using a metric (Gini impurity in this
case)
Step 2: Split the dataset into subsets based on the selected attribute
Step 3: Repeat recursively for each subset until:
 All instances in a node belong to the same class
 No remaining attributes to split on
 A predefined stopping condition is met (max depth, min samples, etc.)

Sample Coding:
# Importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train decision tree classifier


clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions on test set


y_pred = clf.predict(X_test)

# Calculate accuracy

ISE/ Cambridge Institute of Technology, Bangalore 38


BCSL606 – Machine Learning Lab - 6th Semester ISE

accuracy = accuracy_score(y_test, y_pred)


print(f"Model Accuracy: {accuracy * 100:.2f}%")

# Classify a new sample (using first test sample as example)


new_sample = np.array([X_test[0]])
prediction = clf.predict(new_sample)
prediction_class = "Benign" if prediction == 1 else "Malignant"
print(f"Predicted Class for the new sample: {prediction_class}")

# Visualize the decision tree


plt.figure(figsize=(12,8))
tree.plot_tree(clf, filled=True, feature_names=data.feature_names,
class_names=data.target_names)
plt.title("Decision Tree - Breast Cancer Dataset")
plt.show()

Sample Output: Predicted Class for the new sample in Benign.

The model (clf) was trained on the Breast Cancer dataset, where:
o Class 0 = Malignant (Cancerous)
o Class 1 = Benign (Non-cancerous)

Each Node in the output contains the following information,


1. Splitting condition (which feature and threshold)
2. Gini impurity (measure of node purity)
3. Samples count (how many instances reach that node)
4. Class distribution (how many malignant/benign in that node)
5. Predicted class (for leaf nodes)

ISE/ Cambridge Institute of Technology, Bangalore 39


BCSL606 – Machine Learning Lab - 6th Semester ISE

Output:
Model Accuracy: 94.74%
Predicted Class for the new sample: Benign

Result:
The program was executed successfully, and the output has been recorded.

ISE/ Cambridge Institute of Technology, Bangalore 40


BCSL606 – Machine Learning Lab - 6th Semester ISE

Ex. No: 9 Develop a program to implement the Naive Bayesian classifier considering Olivetti
Date: Face Data set for training. Compute the accuracy of the classifier, considering a few
test data sets.

Aim: Write a python coding to implement the Naive Bayesian classifier considering Olivetti Face Data
set for training. Compute the accuracy of the classifier, considering a few test data sets.

Software Required:
 Python 3.x
 Anaconda
 Jupyter Notebook
 Based on Olivetti face dataset

Concept Note:
Naive Bayes is a probabilistic classification algorithm based on Bayes' Theorem with the
assumption of feature independence. It is simple yet effective, particularly for high-dimensional
datasets such as images. The Olivetti Face Dataset consists of 400 grayscale images of 40
individuals, with each image having a resolution of 64×64 pixels. It is widely used in facial
recognition research and machine learning model evaluation.

Methodology:
1. The Olivetti Face Dataset is loaded using scikit-learn’s fetch_olivetti_faces() function.
2. The dataset is split into training and testing sets.
3. A Gaussian Naive Bayes classifier is trained on the training data.
4. Predictions are made on the test set.
5. The accuracy of the classifier is calculated and performance metrics such as confusion
matrix and classification report are generated.
6. A few sample test images are visualized along with their predicted and actual labels.
7. Cross-validation is also performed to evaluate model reliability.

Algorithm
Step 1: Import Required Libraries
Step 2: Load Dataset
Step 3: Preprocess the Data
Step 4: Split the Dataset
Divide the dataset into training and testing sets (70% train, 30% test).
Step 5: Train the Classifier
Initialize and train a Gaussian Naive Bayes classifier using the training data.
Step 6: Predict on Test Data
Use the trained model to predict labels for the test set.
Step 7: Evaluate the Model
Step 8: Compute the accuracy of the model.
Step 9: Display the classification report (precision, recall, F1-score).

ISE/ Cambridge Institute of Technology, Bangalore 41


BCSL606 – Machine Learning Lab - 6th Semester ISE

Step 10: Show the confusion matrix.


Step 11: Cross-Validation
Perform 5-fold cross-validation to get a better estimate of model accuracy.
Step 12: Visualize Predictions
Display a few test face images along with their true and predicted labels.

Sample Coding:
# Step 1: Import Required Libraries
import numpy as np
from sklearn.datasets import fetch_olivetti_faces
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt

# Step 2: Load the Olivetti Face Dataset


data = fetch_olivetti_faces(shuffle=True, random_state=42)
X = data.data # Flattened image data (n_samples, 64*64 = 4096)
y = data.target # Labels (0 to 39)

# Step 3: Split Data into Training and Testing Sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Train Gaussian Naive Bayes Classifier


gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Step 5: Make Predictions on Test Data


y_pred = gnb.predict(X_test)

# Step 6: Evaluate the Classifier


accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')

# Step 7: Display Performance Metrics


print("\nClassification Report:")
print(classification_report(y_test, y_pred, zero_division=1))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

ISE/ Cambridge Institute of Technology, Bangalore 42


BCSL606 – Machine Learning Lab - 6th Semester ISE

# Step 8: Perform Cross-Validation


cross_val_accuracy = cross_val_score(gnb, X, y, cv=5, scoring='accuracy')
print(f'\nCross-validation accuracy: {cross_val_accuracy.mean() * 100:.2f}%')

# Step 9: Visualize Sample Predictions


fig, axes = plt.subplots(3, 5, figsize=(12, 8))
for ax, image, label, prediction in zip(axes.ravel(), X_test, y_test, y_pred):
ax.imshow(image.reshape(64, 64), cmap=plt.cm.gray)
ax.set_title(f"True: {label}, Pred: {prediction}")
ax.axis('off')

plt.tight_layout()
plt.show()

ISE/ Cambridge Institute of Technology, Bangalore 43


BCSL606 – Machine Learning Lab - 6th Semester ISE

Sample Output:

Accuracy: 80.83%

Classification Report:
precision recall f1-score support

0 0.67 1.00 0.80 2


1 1.00 1.00 1.00 2
2 0.33 0.67 0.44 3
3 1.00 0.00 0.00 5
4 1.00 0.50 0.67 4
5 1.00 1.00 1.00 2
7 1.00 0.75 0.86 4
8 1.00 0.67 0.80 3
9 1.00 0.75 0.86 4
10 1.00 1.00 1.00 3
11 1.00 1.00 1.00 1
12 0.40 1.00 0.57 4
13 1.00 0.80 0.89 5
14 1.00 0.40 0.57 5
15 0.67 1.00 0.80 2
16 1.00 0.67 0.80 3
17 1.00 1.00 1.00 3
18 1.00 1.00 1.00 3
19 0.67 1.00 0.80 2
20 1.00 1.00 1.00 3
21 1.00 0.67 0.80 3
22 1.00 0.60 0.75 5
23 1.00 0.75 0.86 4
24 1.00 1.00 1.00 3
25 1.00 0.75 0.86 4
26 1.00 1.00 1.00 2
27 1.00 1.00 1.00 5
28 0.50 1.00 0.67 2
29 1.00 1.00 1.00 2
30 1.00 1.00 1.00 2
31 1.00 0.75 0.86 4
32 1.00 1.00 1.00 2
34 0.25 1.00 0.40 1
35 1.00 1.00 1.00 5
36 1.00 1.00 1.00 3
37 1.00 1.00 1.00 1
38 1.00 0.75 0.86 4
39 0.50 1.00 0.67 5

accuracy 0.81 120


macro avg 0.89 0.85 0.83 120
weighted avg 0.91 0.81 0.81 120

ISE/ Cambridge Institute of Technology, Bangalore 44


BCSL606 – Machine Learning Lab - 6th Semester ISE

Confusion Matrix:
[[2 0 0 ... 0 0 0]
[0 2 0 ... 0 0 0]
[0 0 2 ... 0 0 1]
...
[0 0 0 ... 1 0 0]
[0 0 0 ... 0 3 0]
[0 0 0 ... 0 0 5]]

Cross-validation accuracy: 87.25%

ISE/ Cambridge Institute of Technology, Bangalore 45


BCSL606 – Machine Learning Lab - 6th Semester ISE

From the output image, shows two labels:

 True: the actual label (ground truth)


 Pred: the predicted label by the Gaussian Naive Bayes (GNB) classifier.

Let's check each one:


Image TRUE Prediction Correct Prediction?
1 18 18 ✅ True
2 0 0 ✅ True
3 5 5 ✅ True
4 22 22 ✅ True
5 22 22 ✅ True
6 27 27 ✅ True
7 16 16 ✅ True
8 18 18 ✅ True
9 31 31 ✅ True
10 35 35 ✅ True
11 12 12 ✅ True
12 5 5 ✅ True
13 22 22 ✅ True
14 0 0 ✅ True
15 25 25 ✅ True

✅ All 15 predictions are correct!

 True Positives: 15
 False Positives / False Negatives: 0

Conclusion:
The GNB classifier has perfectly predicted the labels for all the test images shown in this
particular output.

Result:
The program was executed successfully, and the output has been recorded.

ISE/ Cambridge Institute of Technology, Bangalore 46


BCSL606 – Machine Learning Lab - 6th Semester ISE

Ex. No: 10 Develop a program to implement k-means clustering using Wisconsin Breast Cancer
Date: data set and visualize the clustering result.

Aim: Write a python coding to implement k-means clustering using Wisconsin Breast Cancer data set
and visualize the clustering result.

Software Required:
 Python 3.x
 Anaconda
 Jupyter Notebook

Concept Note: This program provides a hands-on example of unsupervised learning,


showcasing how K-Means and PCA can reveal hidden structures in medical data, even without
labeled training. The visualizations bridge the gap between high-dimensional data and human
interpretability.
1. K-Means Clustering & PCA
A. K-Means Clustering:
 An unsupervised algorithm that partitions data into *k* clusters.
 Each cluster is represented by its centroid (mean of points in the cluster).
Steps:
 Initialize *k* centroids randomly (k=2)
 Assign each data point to the nearest centroid.
 Recalculate centroids based on assigned points.
 Repeat until convergence.

B. Dimensionality Reduction (PCA):


 Projects high-dimensional data into 2D/3D for visualization.
 Retains maximum variance using orthogonal components (PC1, PC2).

C. Evaluation Metrics
 Though unsupervised, we validate clusters using:
 Confusion Matrix: Compares predicted clusters vs. true labels.
 Classification Report: Precision, recall, F1-score (if ground truth exists).

2. Dataflow
Data Preparation
 Load dataset (30 features, e.g., radius, texture, symmetry).
 Standardize features to mean=0, variance=1 (critical for K-Means).
Clustering

 Apply K-Means (k=2 to match benign/malignant classes).


 Predict cluster labels.
Evaluation
 Compare clusters with true labels using supervised metrics.

ISE/ Cambridge Institute of Technology, Bangalore 47


BCSL606 – Machine Learning Lab - 6th Semester ISE

Visualization
 Reduce 30D data to 2D using PCA.
Plot:
 Clusters (colored by K-Means labels).
 True labels (ground truth).
 Centroids (marked as "X").

Sample Coding
# 1. Import Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix, classification_report
# 2. Load and Preprocess Data
data = load_breast_cancer()
X = data.data
y = data.target
## 2.1 Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 3. Apply K-Means Clustering
kmeans = KMeans(n_clusters=2, random_state=42)
y_kmeans = kmeans.fit_predict(X_scaled)
## 3.1 Evaluate performance
print("Confusion Matrix:")
print(confusion_matrix(y, y_kmeans))
print("\nClassification Report:")
print(classification_report(y, y_kmeans))
# 4. Dimensionality Reduction (PCA) for Visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
## 4.1 Create DataFrame for plotting
df = pd.DataFrame(X_pca, columns=['PC1', 'PC2'])
df['Cluster'] = y_kmeans
df['True Label'] = y

ISE/ Cambridge Institute of Technology, Bangalore 48


BCSL606 – Machine Learning Lab - 6th Semester ISE

# 5. Visualize Results
## 5.1 Cluster Assignments
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster', palette='Set1',
s=100, edgecolor='black', alpha=0.7)
plt.title('K-Means Clustering of Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()
### 5.2 True Labels (Ground Truth)
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='True Label', palette='coolwarm',
s=100, edgecolor='black', alpha=0.7)
plt.title('True Labels of Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="True Label")
plt.show()
#### 5.3 Clusters with Centroids
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='PC1', y='PC2', hue='Cluster', palette='Set1',
s=100, edgecolor='black', alpha=0.7)
centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(centers[:, 0], centers[:, 1], s=200, c='red', marker='X', label='Centroids')
plt.title('K-Means Clustering with Centroids')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title="Cluster")
plt.show()

ISE/ Cambridge Institute of Technology, Bangalore 49


BCSL606 – Machine Learning Lab - 6th Semester ISE

Sample Output:

Confusion Matrix:
[[175 37]
[13 344]]

Classification Report:
precision recall f1-score support

0 0.93 0.83 0.88 212


1 0.90 0.96 0.93 357

accuracy 0.91 569


macro avg 0.92 0.89 0.90 569
weighted avg 0.91 0.91 0.91 569

ISE/ Cambridge Institute of Technology, Bangalore 50


BCSL606 – Machine Learning Lab - 6th Semester ISE

Result:
The program was executed successfully, and the output has been recorded.

ISE/ Cambridge Institute of Technology, Bangalore 51


BCSL606 – Machine Learning Lab - 6th Semester ISE

Viva- Voce Questions

Experiment 1: Histograms & Box Plots (California Housing Dataset)

1. Q: What is a histogram?
A: A histogram is a graphical representation of the distribution of numerical data, showing
frequencies of values in bins.
2. Q: How do you identify outliers in a box plot?
A: Outliers are points outside the whiskers (typically 1.5×IQR from the quartiles).
3. Q: What does the median line in a box plot represent?
A: The median (50th percentile) of the dataset.
4. Q: Which Python libraries are used for histograms and box plots?
A: matplotlib.pyplot and seaborn.
5. Q: What is skewness in a histogram?
A: Skewness measures asymmetry; positive skew means a longer tail on the right.
6. Q: How is a box plot useful in EDA?
A: It shows quartiles, median, and outliers, helping understand spread and skew.
7. Q: What is the purpose of plt.show()?
A: It displays the plotted figure.
8. Q: How do you customize bin size in a histogram?
A: Use the bins parameter in plt.hist() (e.g., bins=20).
9. Q: What does IQR stand for in box plots?
A: Interquartile Range (Q3 − Q1).
10. Q: How would you handle outliers in this dataset?
A: Options: Remove them, cap them, or use robust statistical methods.

ISE/ Cambridge Institute of Technology, Bangalore 52


BCSL606 – Machine Learning Lab - 6th Semester ISE

Experiment 2: Correlation Matrix & Heatmap (California Housing Dataset)

1. Q: What does a correlation value of 0.9 mean?


A: A strong positive linear relationship between two variables.
2. Q: How is a heatmap useful?
A: It visualizes correlation strengths using colors.
3. Q: What is the range of correlation values?
A: −1 (perfect negative) to +1 (perfect positive).
4. Q: How do you create a correlation matrix in Pandas?
A: df.corr().
5. Q: What does a pair plot show?
A: Scatter plots for all pairwise feature relationships and histograms diagonally.
6. Q: What is multicollinearity?
A: High correlation between independent variables, which can affect regression models.
7. Q: How to interpret a dark cell in a heatmap?
A: It indicates strong correlation (positive or negative).
8. Q: What is the difference between Pearson and Spearman correlation?
A: Pearson measures linear relationships; Spearman measures monotonic relationships.
9. Q: How would you handle highly correlated features?
A: Drop one or use PCA.
10. Q: Name two features likely to correlate in the California dataset.
A: AveRooms and AveBedrooms (both relate to house size).

ISE/ Cambridge Institute of Technology, Bangalore 53


BCSL606 – Machine Learning Lab - 6th Semester ISE

Experiment 3: PCA on Iris Dataset

1. Q: What is PCA used for?


A: Dimensionality reduction by transforming features into uncorrelated principal components.
2. Q: How do you select the number of components?
A: Use explained variance (e.g., retain 95% variance).
3. Q: What is the role of eigenvalues in PCA?
A: They indicate the variance explained by each principal component.
4. Q: Why scale data before PCA?
A: PCA is sensitive to feature scales; scaling ensures equal weighting.
5. Q: What does a scatter plot of PC1 vs. PC2 show?
A: Data projected onto the first two principal components.
6. Q: Is PCA supervised or unsupervised?
A: Unsupervised (no labels are used).
7. Q: Can PCA reconstruct original data?
A: Yes, but with loss if components are reduced.
8. Q: What is the difference between PCA and LDA?
A: PCA maximizes variance; LDA maximizes class separation (supervised).
9. Q: What is the "curse of dimensionality"?
A: High-dimensional data sparsity, which PCA mitigates.
10. Q: Name a limitation of PCA.
A: Loses interpretability of original features.

ISE/ Cambridge Institute of Technology, Bangalore 54


BCSL606 – Machine Learning Lab - 6th Semester ISE

Experiment 4: Find-S Algorithm

1. Q: What is the output of Find-S?


A: The most specific hypothesis consistent with positive training examples.
2. Q: What is a hypothesis in Find-S?
A: A conjunction of constraints on attributes (e.g., (Sunny, Warm,?,?)).
3. Q: How does Find-S handle negative examples?
A: It ignores them (only generalizes from positives).
4. Q: What is the initial hypothesis in Find-S?
A: The most specific (e.g., (∅, ∅, ∅) for all attributes).
5. Q: What are the limitations of Find-S?
A: Cannot handle noise or multiple consistent hypotheses.
6. Q: What is the difference between Find-S and Candidate Elimination?
A: Candidate Elimination tracks both general and specific hypotheses.
7. Q: Can Find-S handle continuous-valued attributes?
A: No, it works only with discrete values.
8. Q: What is a "consistent" hypothesis?
A: One that correctly classifies all training examples.
9. Q: Give an example where Find-S fails.
A: If data has contradictions (e.g., same features, different labels).
10. Q: How would you modify Find-S for noisy data?
A: Use probabilistic methods or vote over multiple hypotheses.

ISE/ Cambridge Institute of Technology, Bangalore 55


BCSL606 – Machine Learning Lab - 6th Semester ISE

Experiment 5: k-Nearest Neighbors (k-NN)

1. Q: What is the basic principle of k-NN?


A: Classify a point based on the majority class of its *k* nearest neighbors.
2. Q: How does *k* affect the model?
A: Small *k* = high variance (overfitting); large *k* = high bias (underfitting).
3. Q: What distance metric is commonly used?
A: Euclidean distance.
4. Q: Why scale features in k-NN?
A: To prevent features with larger scales from dominating.
5. Q: What happens if *k* = 1?
A: The model becomes highly sensitive to noise (overfitting).
6. Q: Is k-NN parametric or non-parametric?
A: Non-parametric (no assumptions about data distribution).
7. Q: How do you handle ties in k-NN voting?
A: Random selection or reduce *k* by 1.
8. Q: What is the curse of dimensionality in k-NN?
A: In high dimensions, distance metrics lose meaning, reducing accuracy.
9. Q: Can k-NN be used for regression?
A: Yes, by averaging the values of the *k* neighbors.
10. Q: How do you choose the best *k*?
A: Use cross-validation to test different *k* values.

ISE/ Cambridge Institute of Technology, Bangalore 56


BCSL606 – Machine Learning Lab - 6th Semester ISE

Experiment 6: Locally Weighted Regression (LWR)

1. Q: How is LWR different from linear regression?


A: LWR fits a local model for each prediction, weighting nearby points more.
2. Q: What is the bandwidth parameter?
A: It controls the "neighborhood size" (how far points influence the prediction).
3. Q: Why is LWR non-parametric?
A: It doesn’t assume a fixed functional form; fits adaptively to data.
4. Q: What kernel is commonly used in LWR?
A: Gaussian kernel.
5. Q: When would you use LWR over linear regression?
A: When the data has non-linear patterns or varying relationships.
6. Q: What happens if bandwidth is too small?
A: Overfitting (noisy predictions).
7. Q: What happens if bandwidth is too large?
A: Underfitting (approaches linear regression).
8. Q: How does LWR handle outliers?
A: Outliers have less impact if they’re far from the prediction point.
9. Q: Is LWR computationally expensive?
A: Yes, because it recomputes weights for each prediction.
10. Q: Can LWR be used for classification?
A: Yes, via logistic regression locally.

ISE/ Cambridge Institute of Technology, Bangalore 57


BCSL606 – Machine Learning Lab - 6th Semester ISE

Experiment 7: Linear & Polynomial Regression

1. Q: What is the equation of linear regression?


A: y=β0+β1x1+ϵ.
2. Q: How do you interpret the coefficient β1?
A: For a unit increase in x1, y changes by β1.
3. Q: What is the difference between linear and polynomial regression?
A: Polynomial regression adds higher-degree terms (e.g., x2) to fit curves.
4. Q: How do you avoid overfitting in polynomial regression?
A: Use cross-validation, regularization, or limit the degree.
5. Q: What is RMSE?
A: Root Mean Squared Error; measures prediction error.
6. Q: Why scale features in polynomial regression?
A: To prevent numerical instability with high-degree terms.
7. Q: What is heteroscedasticity?
A: Non-constant variance in residuals; violates linear regression assumptions.
8. Q: How does Ridge regression differ from ordinary linear regression?
A: Ridge adds L2 penalty to coefficients to reduce overfitting.
9. Q: What is the role of the intercept term (β0)?
A: It represents the expected value of y when all xi=0.
10. Q: How do you check if a regression model is good?
A: Use R2, adjusted R2, and residual plots.

ISE/ Cambridge Institute of Technology, Bangalore 58


BCSL606 – Machine Learning Lab - 6th Semester ISE

Experiment 8: Decision Tree (Breast Cancer Dataset)

1. Q: What is the splitting criterion in decision trees?


A: Gini impurity or entropy (information gain).
2. Q: How does a decision tree handle overfitting?
A: Pruning, limiting depth (max_depth), or setting a minimum samples per leaf.
3. Q: What is the difference between Gini and entropy?
A: Both measure impurity; Gini is faster, entropy is more sensitive.
4. Q: How do you interpret feature importance?
A: Features used at the top splits have higher importance.
5. Q: Can decision trees handle missing values?
A: Yes, by surrogate splits or ignoring missing values.
6. Q: What is pruning?
A: Removing non-critical branches to simplify the tree.
7. Q: Why are decision trees called "greedy" algorithms?
A: They make locally optimal splits without backtracking.
8. Q: What is the advantage of decision trees over linear models?
A: They capture non-linear relationships and are interpretable.
9. Q: How would you convert a tree into rules?
A: Each path from root to leaf is an IF-THEN rule.
10. Q: What is min_samples_split?
A: The minimum number of samples required to split a node.

ISE/ Cambridge Institute of Technology, Bangalore 59


BCSL606 – Machine Learning Lab - 6th Semester ISE

Experiment 9: Naive Bayes (Olivetti Faces)

1. Q: Why is Naive Bayes "naive"?


A: It assumes features are independent (often untrue but simplifies computation).
2. Q: What is the role of Laplace smoothing?
A: Prevents zero probabilities for unseen feature values.
3. Q: How does Gaussian Naive Bayes work?
A: Assumes features follow a normal distribution.
4. Q: What is the equation for Bayes’ theorem?
A: (Or)

5. Q: How do you handle image data in Naive Bayes?


A: Flatten pixels into feature vectors (may require dimensionality reduction).
6. Q: What are the advantages of Naive Bayes?
A: Fast, works well with high dimensions, and handles small datasets.
7. Q: What is a limitation of Naive Bayes?
A: Poor performance if features are correlated.
8. Q: How do you evaluate a classifier’s accuracy?
A: Using metrics like precision, recall, F1-score, and confusion matrix.
9. Q: Can Naive Bayes handle continuous features?
A: Yes, using Gaussian NB.
10. Q: Name a real-world use case of Naive Bayes.
A: Spam detection (text classification).

ISE/ Cambridge Institute of Technology, Bangalore 60


BCSL606 – Machine Learning Lab - 6th Semester ISE

Experiment 10: k-Means Clustering (Breast Cancer Dataset)

1. Q: What is the goal of k-means?


A: To partition data into k clusters by minimizing within-cluster variance.
2. Q: How do you choose k?
A: Elbow method (plot WCSS vs. k and look for a "kink").
3. Q: What is a centroid?
A: The mean of all points in a cluster.
4. Q: Why is k-means sensitive to initial centroids?
A: Random initialization may lead to suboptimal clusters (solved by k-means++).
5. Q: What is WCSS?
A: Within-Cluster Sum of Squares (measure of cluster compactness).
6. Q: How does k-means handle categorical data?
A: Poorly; it requires numerical data (use k-modes for categorical).
7. Q: What is the difference between k-means and hierarchical clustering?
A: k-means requires predefined k; hierarchical builds a tree of clusters.
8. Q: Can k-means work with non-spherical clusters?
A: No, it assumes spherical clusters of similar size.
9. Q: How do you visualize clustering results?
A: Scatter plots with points colored by cluster labels.
10. Q: What is silhouette score?
A: A metric measuring how well points fit their clusters (range: −1 to 1).

ISE/ Cambridge Institute of Technology, Bangalore 61


BCSL606 – Machine Learning Lab - 6th Semester ISE

Key formulas

1. Histograms & Box Plots

 Histogram Bins:

 Box Plot Whiskers:

where IQR=Q3−Q1.

2. Correlation Matrix & Heatmap

 Pearson Correlation (r):

 Spearman Correlation:
Ranks data points before applying Pearson’s formula.

3. PCA (Dimensionality Reduction)

 Eigenvalue Equation:

 Explained Variance:

ISE/ Cambridge Institute of Technology, Bangalore 62


BCSL606 – Machine Learning Lab - 6th Semester ISE

4. Find-S Algorithm

5. k-Nearest Neighbors (k-NN)

 Euclidean Distance:

 Majority Voting:

where I is the indicator function.

6. Locally Weighted Regression (LWR)

 Weight Function (Gaussian Kernel):

Where τ = bandwidth.
 Weighted Least Squares:

ISE/ Cambridge Institute of Technology, Bangalore 63


BCSL606 – Machine Learning Lab - 6th Semester ISE

7. Linear & Polynomial Regression

 Linear Regression:

Solved via Normal Equations:

 Polynomial Regression:

 RMSE:

8. Decision Tree

 Gini Impurity:

where pc = proportion of class c.

 Information Gain (Entropy):

ISE/ Cambridge Institute of Technology, Bangalore 64


BCSL606 – Machine Learning Lab - 6th Semester ISE

9. Naive Bayes

 Bayes’ Theorem:

 Gaussian Likelihood:

 Laplace Smoothing:

10. k-Means Clustering

 Within-Cluster Sum of Squares (WCSS):

where μi = centroid of cluster Ci.

 Centroid Update:

ISE/ Cambridge Institute of Technology, Bangalore 65


BCSL606 – Machine Learning Lab - 6th Semester ISE

Thank You!

ISE/ Cambridge Institute of Technology, Bangalore 66

You might also like