Machine Learning Algorithms (Classification)
Q4: Explain the K-Nearest Neighbors (K-NN) classification algorithm with the help
of a suitable example. Also, discuss its advantages and disadvantages.
A4: K-Nearest Neighbors (K-NN) Classification Algorithm:
K-Nearest Neighbors (K-NN) is a non-parametric, lazy learning algorithm used for
classification and regression. In K-NN classification, the output is a class
membership. An object is classified by a majority vote of its neighbors, with the
object being assigned to the class most common among its K nearest neighbors (K is
a positive integer, typically small).
How it works:
Training Phase: There is essentially no "training" phase in the traditional sense.
The algorithm simply stores all available training data points and their
corresponding class labels. This is why it's called a "lazy" algorithm – it defers
computation until classification time.
Prediction Phase (for a new, unseen data point):
Choose K: Select a positive integer K, which represents the number of nearest
neighbors to consider.
Calculate Distance: Calculate the distance (e.g., Euclidean distance, Manhattan
distance) between the new data point and all training data points.
Find K-Nearest Neighbors: Identify the K training data points that are closest to
the new data point based on the calculated distances.
Vote for Class: For classification, count the number of data points in each class
among these K neighbors.
Assign Class: Assign the new data point to the class that has the majority vote
among the K nearest neighbors. In case of a tie, various tie-breaking rules can be
used (e.g., choose randomly, choose the class of the closest neighbor).
Suitable Example:
Imagine you have a dataset of fruits, classified as either "Apple" or "Orange,"
based on their "Sweetness" and "Crunchiness" scores (on a scale of 1-10).
Sweetness Crunchiness Fruit Type
7 8 Apple
6 7 Apple
3 2 Orange
4 3 Orange
8 6 Apple
2 4 Orange
Export to Sheets
Now, a new fruit arrives with Sweetness = 5 and Crunchiness = 5. We want to
classify it using K-NN. Let's choose K = 3.
Steps:
Calculate Distance (Euclidean Distance):
Distance =
(x
2
−x
1
)
2
+(y
2
−y
1
)
2
New Fruit (5, 5) to Apple (7, 8):
(7−5)
2
+(8−5)
2
=
2
2
+3
2
=
4+9
=
13
≈3.61
New Fruit (5, 5) to Apple (6, 7):
(6−5)
2
+(7−5)
2
=
1
2
+2
2
=
1+4
=
5
≈2.24
New Fruit (5, 5) to Orange (3, 2):
(3−5)
2
+(2−5)
2
=
(−2)
2
+(−3)
2
=
4+9
=
13
≈3.61
New Fruit (5, 5) to Orange (4, 3):
(4−5)
2
+(3−5)
2
=
(−1)
2
+(−2)
2
=
1+4
=
5
≈2.24
New Fruit (5, 5) to Apple (8, 6):
(8−5)
2
+(6−5)
2
=
3
2
+1
2
=
9+1
=
10
≈3.16
New Fruit (5, 5) to Orange (2, 4):
(2−5)
2
+(4−5)
2
=
(−3)
2
+(−1)
2
=
9+1
=
10
≈3.16
Find 3-Nearest Neighbors (ordered by distance):
Apple (6, 7) - Distance ≈2.24
Orange (4, 3) - Distance ≈2.24
Apple (8, 6) - Distance ≈3.16 (or Orange (2,4) - Distance ≈3.16, let's pick Apple
for this example)
Vote for Class:
Among the 3 nearest neighbors: 2 are "Apple" and 1 is "Orange".
Assign Class:
The new fruit is classified as an Apple.
Advantages of K-NN:
Simple and Intuitive: Easy to understand and implement.
No Training Phase: As a lazy learner, there's no explicit training, making it quick
to update the model with new data.
Non-parametric: Makes no assumptions about the underlying data distribution, which
can be useful for complex datasets.
Flexible: Can be used for both classification and regression.
Effective for Small Datasets: Can perform well when the decision boundary is
irregular.
Disadvantages of K-NN:
Computationally Expensive at Prediction Time: For large datasets, calculating the
distance to every training data point for each new prediction can be very slow.
Sensitive to the Choice of K: The performance of K-NN is highly dependent on the
value of K. A small K can be noisy, while a large K can blur decision boundaries.
Sensitive to Feature Scaling: Features with larger ranges will have a
disproportionate impact on distance calculations. Feature scaling
(normalization/standardization) is crucial.
Curse of Dimensionality: Performance degrades significantly with a high number of
features (dimensions) as distances become less meaningful in high-dimensional
spaces.
Storage Requirement: Needs to store the entire training dataset.
Imbalanced Data: Can be biased towards the majority class if the dataset is
imbalanced.