Imports
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
Load IRIS dataset
iris = datasets.load_iris()
print(iris)
As you can see the dataset is in the form of a dictionay. What are the keys of the
dictionary?
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names',
'filename', 'data_module'])
What is the value of the key data? Assign the value to a variable X
What is the shape of X?
What is the value of the key target? Assign the value to a variable y
What is the shape of y?
What is the value of the key target_names? Assign the value to a variable
target_names
What is the value of the key feature_names? Assign the value to a variable
feature_names
#Solution
X = iris['data']
y = iris['target']
feature_names = iris['feature_names']
target_names = iris['target_names']
#note: you can also get access to the elements by dot (.) access operator,
e.g.,
# X = iris.data
print(type(X))
print(type(y))
print(X.shape)
print(y.shape)
print(feature_names)
print(target_names)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
(150, 4)
(150,)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width
(cm)']
['setosa' 'versicolor' 'virginica']
Figure below illustrates the features and target labels for iris
dataset.
Print the 5th datapoint in your dataset X
Print the features and target label of flower 1 to 5.
Iterate over all datapoints in X and calculate the area of Sepal and Petal for each
flower in the dataset.
Exploratory Data Analysis
Box plot of all features
plt.figure()
plt.boxplot(X)
plt.ylabel("[cm]")
plt.xlabel(feature_names)
plt.show()
[]
Scatter plot for each pair of features
Plot the scater plot for the pair of first and second features
(X[:,0], X[:,1])
Dont't forget to label your axes.
hint: use c=y inside the scatter plot to color the points based on the
target labels.
#your code here
Write a function called plot_pairwise that takes the pair of feaure and their
labels and plot the scatter plot.
def plot_pair(X1, X2, x1_label , x2_label, y):
...
Use plot_pari functions and plot the scatter plot for all pairs of features.
X[:,0], X[:,1], 'Sepal Length', 'Sepal Width'
X[:,0], X[:,2], 'Sepal Length', 'Petal Length'
X[:,0], X[:,3], 'Sepal Length', 'Petal Width'
X[:,1], X[:,2], 'Sepal Width', 'Petal Length'
X[:,1], X[:,3], 'Sepal Width', 'Petal Width'
X[:,2], X[:,3], 'Petal Length', 'Petal Width'
#your code here
(Optional) The plots shown above do not have legend. To add legend to
the plot, you can use the following code snippet.
def plot_pair_with_legned(x1, x2, x1_label , x2_label, y):
plt.figure()
for i, target_name in enumerate(iris.target_names):
plt.scatter(x1[y == i], x2[y == i], label=target_name)
plt.xlabel(x1_label)
plt.ylabel(x2_label)
plt.legend()
plt.show()
plot_pair_with_legned(X[:,0], X[:,1], feature_names[0], feature_names[1], y)
[]
Histogram of each feature
Plot the histogram of each feature.
#your code here
K Nearest Neighbors (KNN)
Euclidean Distance (2D)
In geometry, the Euclidean distance is the straight-line distance
between two points.
Given two points $ P(x_1, y_1) $ and $ Q(x_2, y_2)$ in a 2D plane, the
Euclidean distance between them is calculated as follows:
$ d(P, Q) = \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2} $
Example (2D)
Let's say we have two points:
- $ P(2, 2) $
- $ P_2(5, 5) $
$ d(P_1, P_2) = \sqrt{(2 - 5)^2 + (2 - 5)^2}= \sqrt{18} \approx 4.2 $
We can calculate the distance between these two points.
P = np.array([2, 2])
Q = np.array([5, 5])
distance = np.sqrt(np.sum((P - Q)**2))
distance
np.float64(4.242640687119285)
Example (3 Dimensions)
Consider two points in 3D space:
- $ P_1(1, 2, 3) $
- $ P_2(4, 0, 8) $
We can calculate the Euclidean distance as follows:
$ d(P_1, P_2) = \sqrt{(4 - 1)^2 + (0 - 2)^2 + (8 - 3)^2} $
$ d(P_1, P_2) = \sqrt{3^2 + (-2)^2 + 5^2} = \sqrt{9 + 4 + 25} =
\sqrt{38} \approx 6.16 $
# Define two points in 3D space
P1 = np.array([1, 2, 3])
P2 = np.array([4, 0, 8])
# Calculate the Euclidean distance
distance = np.sqrt(np.sum((P2 - P1)**2))
print(f'The Euclidean distance between P1 and P2 is: {distance:.2f}')
The Euclidean distance between P1 and P2 is: 6.16
Write a function that get two np arrays P and Q and return the Euclidean distance
between them.
def straight_line_distance(P, Q):
...
KNN Algorithm
KNN from scratch
0 - Look at the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5,
random_state=42)
Explain each term in the cell above. X_train, X_test, y_train, y_test?
????
1 - Calculate distances
Take one sample from test set and find the distance between this sample and all
samples in the training set. In addition to the distance, you need to store the
index of the sample in the training set.
So for exaple if the distance between the test sample and the 5th sample in the
training set is 3.5, you need to store (5, 3.5).
test_instance = X_test[0]
distances = [] # append the (index, distance) tuples to this list
# your code here
Write a function called calculate_distances that takes the test sample and the
training set and return the distances and the indices of the training samples.
def calculate_distances(test_instance, X_train):
#return distances
...
What you pass as input to the function calculate_distances? What you get as output
when you call this function?
????
What is shape of input arrays to the function calculate_distances? What is the
shape of output?
???
2 - Find neighbors
Step 1: Sort the (index, distance) tuples based on distance value in
anascending order.
distances = calculate_distances(test_instance, X_train)
distances.sort(key=lambda x: x[1])
distances
[(34, np.float64(0.22360679774997896)),
(45, np.float64(0.30000000000000027)),
(28, np.float64(0.5099019513592785)),
(35, np.float64(0.5099019513592788)),
(66, np.float64(0.5196152422706639)),
(47, np.float64(0.5291502622129183)),
(17, np.float64(0.5830951894845297)),
(36, np.float64(0.6164414002968978)),
(65, np.float64(0.6244997998398398)),
(41, np.float64(0.6480740698407859)),
(48, np.float64(0.6999999999999995)),
(70, np.float64(0.7071067811865478)),
(63, np.float64(0.728010988928052)),
(23, np.float64(0.741619848709566)),
(14, np.float64(0.754983443527075)),
(68, np.float64(0.774596669241483)),
(73, np.float64(0.7874007874011811)),
(0, np.float64(0.8124038404635955)),
(50, np.float64(0.8124038404635965)),
(9, np.float64(0.8602325267042631)),
(60, np.float64(0.9273618495495711)),
(18, np.float64(0.9433981132056598)),
(67, np.float64(0.9643650760992956)),
(20, np.float64(0.9746794344808962)),
(5, np.float64(0.9746794344808963)),
(37, np.float64(1.0049875621120894)),
(42, np.float64(1.0440306508910553)),
(2, np.float64(1.0535653752852738)),
(64, np.float64(1.0954451150103324)),
(62, np.float64(1.1045361017187258)),
(8, np.float64(1.1575836902790226)),
(44, np.float64(1.224744871391589)),
(43, np.float64(1.296148139681572)),
(11, np.float64(1.2999999999999998)),
(71, np.float64(1.3490737563232036)),
(38, np.float64(1.3490737563232043)),
(31, np.float64(1.407124727947029)),
(40, np.float64(1.4247806848775015)),
(1, np.float64(1.438749456993816)),
(52, np.float64(1.5556349186104048)),
(56, np.float64(1.6186414056238647)),
(29, np.float64(1.6278820596099706)),
(58, np.float64(1.6431676725154982)),
(16, np.float64(1.7349351572897476)),
(74, np.float64(1.8138357147217057)),
(55, np.float64(1.8165902124584952)),
(24, np.float64(1.8493242008906932)),
(4, np.float64(1.8601075237738276)),
(54, np.float64(1.8973665961010275)),
(32, np.float64(1.9157244060668017)),
(15, np.float64(1.997498435543818)),
(61, np.float64(2.0346989949375804)),
(51, np.float64(2.090454496036687)),
(19, np.float64(2.4020824298928627)),
(69, np.float64(3.2939338184001206)),
(3, np.float64(3.3674916480965473)),
(13, np.float64(3.4161381705077445)),
(39, np.float64(3.551056180912941)),
(49, np.float64(3.5623026261113755)),
(53, np.float64(3.5623026261113755)),
(10, np.float64(3.5735136770411273)),
(12, np.float64(3.5791060336346563)),
(26, np.float64(3.6318039594669758)),
(6, np.float64(3.6537651812890224)),
(59, np.float64(3.6565010597564442)),
(25, np.float64(3.685105154537656)),
(57, np.float64(3.765634076752546)),
(30, np.float64(3.782856063875548)),
(7, np.float64(3.823610858861032)),
(33, np.float64(3.8314488121336034)),
(72, np.float64(3.844476557348217)),
(21, np.float64(3.845776904605882)),
(46, np.float64(3.8961519477556315)),
(27, np.float64(3.9357337308308855)),
(22, np.float64(4.177319714841085))]
Step 2: Select the first k elements of the sorted list. And, store the
index of these k elements in a list.
k = 5
distances[:k]
[(34, np.float64(0.22360679774997896)),
(45, np.float64(0.30000000000000027)),
(28, np.float64(0.5099019513592785)),
(35, np.float64(0.5099019513592788)),
(66, np.float64(0.5196152422706639))]
Extract the index of the k nearest neighbors from (index, distance) tuples.
neighbor_index =[]
# your code here
Step 3: Find the labels of these top k samples from y_train array.
neighbor_label = []
#your code here
Now write a function find_neighbors to do all the steps above from 1 to 3.
def find_neighbors(test_instance, X_train, y_train, k):
"""
Inputs
test_instance: One data point form test set
X_train: train dataset
y_train: train labels
k: number of neighbours
Output
neighbor_label: list of k neighbours labels
"""
#your code here
What you pass as input to the function find_neighbors? What you get as output when
you call this function?
???
What is shape of input arrays to the function find_neighbors? What is the shape of
output?
???
Explain what operations are done inside the function find_neighbors to calculate
the label of k nearest neighbors?
???
3 - Vote on labels
You have this function to vote on labels of the k nearest neighbors.
def vote_on_labels(neighbor_label):
prediction_dict = {}
for label in neighbor_label:
if label in prediction_dict:
prediction_dict[label] += 1
else:
prediction_dict[label] = 1
prediction = max(prediction_dict, key=prediction_dict.get)
return prediction
y_pred = vote_on_labels(neighbor_label)
y_pred
np.int64(1)
What you pass as input to the function vote_on_label? What you get as output when
you call this function?
????
What is shape of input arrays to the function vote_on_label? What is the shape of
output?
???
4 - put it all together
Now iterate over all datapoints of X_test and calculate their label.
y_pred = []
#your code here
Turn code into a function KNN that takes the training set, the target labels of the
training set, the test set, and the value of k and return the predicted labels of
the test set.
def KNN(X_train, y_train, X_test, k):
...
5 - Evaluate the model
Finally, calculate the accuracy of the KNN algorithm.
y_test == y_pred
array([ True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, False, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, False, True, True, True, True, True, True, True,
True, True, True, True, True, True, False, True, True,
True, True, True, True, True, True, True, True, True,
False, True, True])
accuracy = sum(y_test == y_pred) / len(y_test) #takes True as 1 and False as 0
print(f"accuracy: {accuracy * 100} %")
accuracy: 94.66666666666667 %
Turn your code into a function evaluate that takes the predicted labels and the
true labels and return the accuracy of the model.
def evaluate(y_test, y_pred):
# your code here
...
KNN in Scikit-Learn
knn_model = KNeighborsClassifier(n_neighbors=4) # You can change the value of
'k' as needed.
knn_model.fit(X_train, y_train)
y_pred = knn_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
Accuracy: 93.33%
(Optional) 6 - Hyperparameter tuning
So far we have used k=3. Now, we are going to find the best value of k
for the KNN algorithm.
K = [1, 2, 3, 4, 5, 6, 7, 8]
my_accs = []
# your code here
plot the accuracy of the model for different values of k with
scikit-learn and compare the results with the results from the scratch
implementation.
K = [1, 2, 3, 4, 5, 6, 7, 8]
sklearn_accs = []
#your code here
Can you justify the difference between the results of the two
implementations?