Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views9 pages

Adobe Scan 16 May 2023

This document discusses the k-nearest neighbor (KNN) algorithm, a supervised learning technique used for classification and regression. It explains how KNN classifies data points based on the majority vote of their k nearest neighbors and highlights the importance of choosing an appropriate k value. The document also outlines the pros and cons of the KNN algorithm, including its simplicity and effectiveness, as well as its sensitivity to data scale and computational expense.

Uploaded by

SUBHANKAR PARIA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views9 pages

Adobe Scan 16 May 2023

This document discusses the k-nearest neighbor (KNN) algorithm, a supervised learning technique used for classification and regression. It explains how KNN classifies data points based on the majority vote of their k nearest neighbors and highlights the importance of choosing an appropriate k value. The document also outlines the pros and cons of the KNN algorithm, including its simplicity and effectiveness, as well as its sensitivity to data scale and computational expense.

Uploaded by

SUBHANKAR PARIA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

After reading about regression analysis in the last chapter, let us learn about some crucial cluster.

ing techniques in this chapter.

K-Nearest algorithm
The k-nearest neighbor (KNN) algorithm is a supervised learning algorithm in which the outpt
value of data is known, but how to get the output is not known. For example, if we have several
groups of labeled samples and all items available are homogeneous, then to find which group an
item with an unknown label belongs to, we will find similarities between the item (at hand and
with items in each group. Finally, we will conclude that the item belongs to the group to which it is
most similar. The KNN algorithm works in exactly the same way.
The KNN algorithm stores all available cases and classifies new cases by a majoritv vote of its k
neighbors. The algorithm segregates unlabeled data values into well-defined groups.
Technically speaking, KNN is a non-parametric supervised learning algorithm which classtes
data into aparticular category with the help of training set. Here, the word non-paranmetric meas
that it makes assumptions on the underlying data distribution. Non-parametric methods do nt
have fixed numbers of parameters in the model. The parameters in the
model grow with the training dataset. KNN uses all of the
In the KNN algorithm, the value for a new instance (x) is predicted data for training while
by searching the training set for the k most similar cases (neighbors) classifving a new data
and summarizing the output values for those k cases. In other words, pointor instance
this is the mode (or most common) value.

11.1.1Choosing an appropriate k value


appropriate value
1he most important thing to do in the KNN algorithm is to determine annoisy data. But iff we hae
that is, the number of nearest neighbors. Alarge value of kreducesthe patterns which may have
more number of data points in one group then we may ignore the smaller
useful insights. ofthis
coloured pictorial representation of graphs and charts
*Note: Please refer to the OLC QR code given on the TOC for
chapter.
11.1.2Example of KNN algorithm
If we have n number of students rated on two parameters - Academic Score and EC Score on a
scale of1to 10, as given in Table 1.
Record
Table I. Student's
Name Academic Score EC Score Grade
Ria 8 Outstanding
Khushank 1 Academically Sound
Mehar Sporty
Nagma 2 1 Below Average

Then, Academic Score is an indication of how well the student performs academically and EC
Sore is the score obtained by the student in extra-curricular activities.

OP. > )

Eucidean distance (d) =/x + bg)


Equation for Euclidean Distance
Now, if we have anew student to categorize then we will calculate the distance between New
Student' and its nearest neighbors (Outstanding', 'Sporty', 'Academically Sound and 'Below
Average') using theEuclidean distance formula.
Assuming that co-ordinates of the New Student are (8,2), we calculate the distance between
Below Average (2,1) and New Student as:
01st (Below Average, New Student) = 6.08

Similarly, calculate the distance of New Student from each of its nearest neighbors as shown in
Table II.
lable II. Updated Student's Records
Grade Distance to new Student
EC Score
Name Academic Score
Outstanding
6.08
Ria 1.41
1
AcademicallySound
Khushank 7.21
Sporty
Mehar 4
Below Average
6.08

Nagma 2
New
and Academically Sound is the least, so the
Student
We seee that the distance between Neww sound students.
academically
Student belongs to the group of
KNN algorithmcan also be used for prediction. It is extensively used in pharmaceutical
to detect the growth of oncogenic (cancer) cells or presence of a disease. We will
it in Chapter 12,where the implementation of the algorithm isdiscussed in detail. lindustry
read more about

11.1.3Choosing the k value


Consider the given Figure 11.1. If we have two classes - class
A denoted by red stars and class B indicated by blue triangles,
then to classify a new green square object, the value of k playsan
important role. If k =3, then according toour figure the 3 nearest
neighbors belong to class B. So the new object is placed in class
B. But if we take k=7, then the new object will be a member of
class A. If we takek=1, then the new object will belongto class B.
Hence, value of k greatly influences the classification.
Figure 11.1 Choosing the kvalue
The K-nearest neighbor can be used for regression when dependent variable is continuous, In
this case, the predicted value is the average of the values of its k nearest neighbors.

11.1.4 Pros and Cons of KNN algorithm


Pros:

The KNN algorithm is a very simple and effective algorithm.


Users can easily implement the algorithm which has made the KNN is a non
algorithm quiet popular with data professionals. parametric learning
The algorithm is highly unbiased in nature and does not make algorithm, which means
that it doesn't assume
any prior assumption of the underlying data.
The KNN algorithm works wellwith multi-class algorithms. anything about the
> The algorithm can be applied to both classification and regression. underlying data
Cons:
Usually, simple things are not very powerful. Same is the case with KNN algorithm.
Although the algorithm trains the model really fast, the prediction time is high.
Many a times usefulinsights may get ignored.
The algorithm is sensitive to the scale of data, so data must first be standardized.
The performance of the algorithm deteriorates when there are multiple independent variable
The algorithm requires more memory and is computationally expensive.
It does not work wellif the target variable is skewed. k will
of
The accuracy of the algorithm depends onk value. For any given problem, asmall valuemodel bias.
lead to alarge variance in predictions. Andsetting kto alarge value may lead to a large
Handling categorical variables in KNN
variables.
If you have categorical variables, then create k dummy variables out of these categorical then
For example, if a categorical variable named "Designation" has 3 unique levels/categories
otherwise.
create 3 dummy variables. Each dummy variable has 1 against its designation and 0
7.5.1
k-Nearest Neighbour (kNN)
The
The kNN| algorithmis aasimple but extremely powerful classification algorithm.
people
name of thhe algorithm originates from the underlying philosophy of kNN-i.e.
other words,
having similar background or mindset tend tostay close to each other. In a part of
neighbours in a locality have a similar background. In the same way, asprediction
a
the kNN algorithm, the unknown and unlabelled data which comes forare similar to
problemiss judged on the basis of the training data set elements which
the unknown element. So, the class label of the unknown element is assigned on the
basis of theclass labels of thesimilar training dataset elements (metaphorically Can
be considered as neighbours of the unknown element). Let us try to understand the
algorithm with asimple data set.
7.5.1.1 How kNN Works

Let us consider a very simple Student data set as depicted in Figure 74. It consists of
15 students studying in a class. Each of the students has been assigned a score on a
scale of 10on two performance parameters Aptitude' and Communication. Also,
aclass value is assigned toeach student based on the following criteria:
1. Students having good communication skills as well as a good level of aptitude
have been classified as Leader'
2. Students having good communication skills but not so good level of aptitude
have been classified as Speaker'
3. Students having not so good communication skill but a good level of aptitude
have been classified as Intel'

Aptitude Communication Class


Name
Karuna 5 Speaker
Bhuyna
6 Speaker
7 6 Leader
Gaurav
7 2.5 Intel
Parul
6 Leader
Dinesh
Jani 4 7 Speaker
5 3 Intel
Bobby
Parimal 3 5.5 Speake
3 Intel
Govind
5.5 Leader
Susant Intel
4
Gouri
6 7 Leader
Bharat
6 2 Intel
Ravi 7 Leader
Pradeep 4.5 Intel
Josh 5

FIG.
Classification
2 Chapter 7 Supervised Learning:
already seen in Chapter 3, while building a classification model.
Aswe have test data. The remaining portion of the
input data is retained as
part ofthe labelled known as training data. The moti
model - hence
input datais used to train the data is to evaluate the performance of t
data as test
tion to retain a part of the performance of the classification model is
measured
model. As we have seen, the classifications made by the model when applied to
by the number of correct not possible during model testing to know th.
unknown data set. However, it is data, which is a nar:
an unknown data. Therefore, the test
actual label value of
used for this purpose. If the class value predictedhafor
the labelled input data, i_ matches with the actual class value that
thev
elements
most of the test data classification model possesses a good accuracy. In context f
then we say that the things simple, we assume one data
element of the
to keep the
the Student data set,
As depicted in Figure 7.5, the recordof the student
input data set as the test data. test data. Now thatwe have the training data and
named Josh is assumed to be the
with the modelling.
test dataidentified, we can start kNN algorithm, the class label of the
test
have already discussed, in the
As we which are
elements is decided by the class label of the training data elements
data challenges:
neighbouring, i.e. similar in nature. But there are two
we say that two data elements
1. What is the basis of this similarity or when can
are similar?
deciding the class label
2. How many similar elements should be considered for
of each test data element?

To answer the first question, though there are many measures of similarity, the most
common approach adopted by kNN tomeasure similarity between two data elements is

Name Aptitude Communication Class


Karuna 2 Speaker
Bhuvna 2 6 Speaker
Gaurav 7 6 Leader
Parul 7 2.5 Intel
Dinesh 6 Leader
Jani 4 7 Speaker
Training Data
Bobby 3 Intel
Parimal 5.5 Speaker
Govind 3 Intel
Susant 6 5.5 Leader
Gouri 6 4 Intel
Bharat 6 Leader
Ravi 6 2 Intel
Pradeep 7 Leader
Test Data Josh Intel
FIG. 7.5
Segregated student data set
Z-D FIG.
7.6
representation Speaker

asterisk. data
value asterisk ing
InThe the having in sidering
Now, a Intel f= Euclidean
Euclidean
Leader
decidingStudent
two-dimensional So, fy= foy where
ofpoint, feature
two-dimensional =
in the as value
he
let tThen, of only value valuefiu=
us the the same depicted distance
distance.
Euclidean
test 'Name' data the
be try represented the of ofvalueof
the same class student Aptitude
. to data set, class features feature featurefeature
find class is in Euclidean between
feature ofConsidering
element.distancespace.value. ignored i.e. datavalue Figure
the value 'Aptitude' data f, f;
featuref
space 'Aptitude' for fofor
The
nswer To The are
space. set 76, data data data distance two
of offindbecause,
test coming Pradeep Bharat Parimal
Govind Bhuvna
Dinesh Gaurav KarunaName
er is Ravi GouriSusant Bobby Jani Parul the f data a
the the for very
to out data and that As clement
training clementclement
closest and data elements
the different the = simple
lies 'Communication, aS we closeshown
second po1nt Communication
Aptitude
Communication' V,
in closest We are
d,clement
d, d,
neignbours
the can to 6 6 6 3 4 7 7 2 data data
dots tor
considering each in d,
stion,
alue or
understand, the points and set
-lassification
Algorithms
student f'
need
nearest other.
tigure,
d, havingt
d,
of of + can
'k´ 1.e. helps for
to Josh can the (f1-2) two
be doing just The the 7 2 7 4 5.5
be
hich how neighbours 3 5.5 3 7 6 2.5 6 6 be
1n Student features
measured
calculated thetraining
ishas it reason
assigning
many
is represented the
a no two
ino classification. for data (say
similar of rolefeatures data Speaker
Leader Leader Leader Speaker SpeakerLeader Leader
by
the from the consider Intel Intel Intel Intel Intel
Speaker Class
f
to points asset and
class as
plav
ele test an dots con- 183
Of f2).
184 Chapter 7 Supervised Learning: Classification
parameter given as an input to the algorithm. In the kNN algorithm. the
indicates the number of neighbours that nced to be considered. For value of k
value of kis 3, only thrce ncarest ncighbours or thrce training data
est to the test data clement are considered. Out of the three data
exampl
elementse, ifcltheoos-
which is predominant is considered as the classlabel to be assigned to the
elements, the class
test data.
In case the value of k is I,only the closest training data element is considered
class labcl of that data clenment is dircctly assigned to the test data element. This i
depicted in Figure 77

Name Aptitude Communication Class Distance k=1 k= 2 k=3


Karuna Speaker 3.041
Bhuvna 3.354
Speaker
Parimal 5.5 Speaker 2.236
Jani 4 7 Speaker 2.693
Bobby 3 Intel 1.500 1.500
Ravi 2 Intel 2.693
Gouri 4 Intel 1.118 1.118 1.118 1.118
Parul 7 2.5 Intel 2.828
Govind 3 Intel 3.354
Susant 6 5.5 Leader 1.414
Bharat 6 7 Leader 2.693
Gaurav 7
Leader 2.500
Dinesh 6 Leader 3.354
Pradeep 7 Leader 4.717
Josh
FIG. 7.7
Distancecalculation between test andtraining points

Let us now try to find out the


we have. In other words, we want outcome of the algorithm for the
to see what class value kNN will Student data sel
data for student Josh. Again, let us assign for the test
refer back to Figure 77. As is
value of k is taken as 1, only one
training data evident, when the
training record for student Gouri comes as the point needs to be
closest one to test considered.
lhe
with adistance value of 1.118. record JOstl
of
also assigned aclass Jabel valueGouri has class value Intel: So, the test data pont is
est neighbours of Josh in the 'Intel! When the value of k is assumed as 3, the Clos
training data set are
distances being 1.118, 1.414, and 1.5, Gouri, Susant, and Bobby wit
Intel, while Susant has class value respectively. Gouri and Bobby have class valle
decided by majority voting. 'Leader In this case, the class value of Josn s
of the
neighbourS, Because the
the class value of Joshclass value of lntel' is formed by the
beextended for any value of k. is majoiy
assigned as Intel: This same proCess
Common Classification Algorithms
But it is often a tricky decision to
decide the value of k. The reasOns a
follows:
" If the value of k is very large (in the extreme case
eaual tothe total number o
records in the training data), the class label of the majorityclass of the trai
data set will be assigned to the test data regardless of the class labels of the
neighbours nearest to the test data.
" If the value ofk is very small (in the extreme case equal to 1). the class value Of
anoisy data or outlier in the training data set which is the nearest
the test data will be assigned to the test data. neighbour to
The best k value is somewhere between these two
extremes.
Fewstrategies, highlighted below, are adopted by machine learning practitioners
toarrive at a value for k.
" One common practice is to set k equal to the square root of the number of
training records.
" An alternative approach is to test severalk values on avariety of test data sets
and choose the one that delivers the best performance.
" Another interesting approach is to choose a larger value of k, but apply a
weighted voting process in which the vote of close neighbours is considered
more influential than the vote of distant neighbours.

7.5.1.2 kNN algorithm


Input: Training data set, test data set (or data points), value of 'k' (i.e. number of near
est neighbours to be considered)
Steps:
Dofor all test data points
Calculate the distance (usually Euclidean distance) of the test data point
from the different training data points.
Find the closest 'k' training data points, i.e. training data pointswhose dis
tances are least from the test data point.
If k=1
Thenassign class label of thetraining data point to the test datapoint
Else
Whichever class label ispredominantly present in the training data
points, assign thatclass label to the test data point
End do
learner?
7.5.1.3 Why the kNN algorithm is called a lazy
We have already discussed in Chapter 3that eager learnersfollow theobtained
general
Steps of machine learning,i.e. perform an abstraction of the information
186 Chapter 7 Supervised Learning: Classification
generalization
fromthe input data and then follow it through by a step.coeVer
as we have scen in the case of the kNN algorithm, these steps are
skipped. It stores the training data and directly applies the philosophylel, However.
of near
est neighbourhood finding to arriveat the classification. So, for kNN
kNN, there 1S no
real scnsc. Therefore, kNN falls under the category of
lcarning happening in the
lazy learner.

7.5.1.4 Strengthsof the kNN algorlthm


" Extremcly simple algorithm- casy to understand
" Veryeffective in certain situations, e.g. for recommender system design
" Very fast or almost no time required for the training phase

7.5.1.5 Weaknesses of the kNN algorithm


Does not learn anything in the realsense.
the basis of the training data. So, it has a Classification is done completelv on
thetraining data does not represent the heavy reliance on the training data. Ii
problem domain comprehensively, the
algorithm fails to make an effective classification.
" Because there is no model trained in real sense
completely on the basis of the training data, and the classification is done
the classification process is
very slow.
" Also, a large
amount of
data for classification. computational space is required to load the training

7.5.1.6 Application of the kNN algorithm


One of the most popular areas in
widely adopted is machine learning where the kNN
recommender
mendusers different systems.
items which are
ASwe know, algorithm is
recommender systems
to like. The liking pattern may be siMilar to a particular item that the user recom
and thesimilar items are identified revealed from past purchases or seems
using the
Another area where there is widespread kNN algorithm browsing history
adoption nof
contents similar to a given document/content. kNNis searching
This is a core area documents
retrieval and is known as concept search. under information

You might also like