Adobe Scan 16 May 2023
Adobe Scan 16 May 2023
K-Nearest algorithm
The k-nearest neighbor (KNN) algorithm is a supervised learning algorithm in which the outpt
value of data is known, but how to get the output is not known. For example, if we have several
groups of labeled samples and all items available are homogeneous, then to find which group an
item with an unknown label belongs to, we will find similarities between the item (at hand and
with items in each group. Finally, we will conclude that the item belongs to the group to which it is
most similar. The KNN algorithm works in exactly the same way.
The KNN algorithm stores all available cases and classifies new cases by a majoritv vote of its k
neighbors. The algorithm segregates unlabeled data values into well-defined groups.
Technically speaking, KNN is a non-parametric supervised learning algorithm which classtes
data into aparticular category with the help of training set. Here, the word non-paranmetric meas
that it makes assumptions on the underlying data distribution. Non-parametric methods do nt
have fixed numbers of parameters in the model. The parameters in the
model grow with the training dataset. KNN uses all of the
In the KNN algorithm, the value for a new instance (x) is predicted data for training while
by searching the training set for the k most similar cases (neighbors) classifving a new data
and summarizing the output values for those k cases. In other words, pointor instance
this is the mode (or most common) value.
Then, Academic Score is an indication of how well the student performs academically and EC
Sore is the score obtained by the student in extra-curricular activities.
OP. > )
Similarly, calculate the distance of New Student from each of its nearest neighbors as shown in
Table II.
lable II. Updated Student's Records
Grade Distance to new Student
EC Score
Name Academic Score
Outstanding
6.08
Ria 1.41
1
AcademicallySound
Khushank 7.21
Sporty
Mehar 4
Below Average
6.08
Nagma 2
New
and Academically Sound is the least, so the
Student
We seee that the distance between Neww sound students.
academically
Student belongs to the group of
KNN algorithmcan also be used for prediction. It is extensively used in pharmaceutical
to detect the growth of oncogenic (cancer) cells or presence of a disease. We will
it in Chapter 12,where the implementation of the algorithm isdiscussed in detail. lindustry
read more about
Let us consider a very simple Student data set as depicted in Figure 74. It consists of
15 students studying in a class. Each of the students has been assigned a score on a
scale of 10on two performance parameters Aptitude' and Communication. Also,
aclass value is assigned toeach student based on the following criteria:
1. Students having good communication skills as well as a good level of aptitude
have been classified as Leader'
2. Students having good communication skills but not so good level of aptitude
have been classified as Speaker'
3. Students having not so good communication skill but a good level of aptitude
have been classified as Intel'
FIG.
Classification
2 Chapter 7 Supervised Learning:
already seen in Chapter 3, while building a classification model.
Aswe have test data. The remaining portion of the
input data is retained as
part ofthe labelled known as training data. The moti
model - hence
input datais used to train the data is to evaluate the performance of t
data as test
tion to retain a part of the performance of the classification model is
measured
model. As we have seen, the classifications made by the model when applied to
by the number of correct not possible during model testing to know th.
unknown data set. However, it is data, which is a nar:
an unknown data. Therefore, the test
actual label value of
used for this purpose. If the class value predictedhafor
the labelled input data, i_ matches with the actual class value that
thev
elements
most of the test data classification model possesses a good accuracy. In context f
then we say that the things simple, we assume one data
element of the
to keep the
the Student data set,
As depicted in Figure 7.5, the recordof the student
input data set as the test data. test data. Now thatwe have the training data and
named Josh is assumed to be the
with the modelling.
test dataidentified, we can start kNN algorithm, the class label of the
test
have already discussed, in the
As we which are
elements is decided by the class label of the training data elements
data challenges:
neighbouring, i.e. similar in nature. But there are two
we say that two data elements
1. What is the basis of this similarity or when can
are similar?
deciding the class label
2. How many similar elements should be considered for
of each test data element?
To answer the first question, though there are many measures of similarity, the most
common approach adopted by kNN tomeasure similarity between two data elements is
asterisk. data
value asterisk ing
InThe the having in sidering
Now, a Intel f= Euclidean
Euclidean
Leader
decidingStudent
two-dimensional So, fy= foy where
ofpoint, feature
two-dimensional =
in the as value
he
let tThen, of only value valuefiu=
us the the same depicted distance
distance.
Euclidean
test 'Name' data the
be try represented the of ofvalueof
the same class student Aptitude
. to data set, class features feature featurefeature
find class is in Euclidean between
feature ofConsidering
element.distancespace.value. ignored i.e. datavalue Figure
the value 'Aptitude' data f, f;
featuref
space 'Aptitude' for fofor
The
nswer To The are
space. set 76, data data data distance two
of offindbecause,
test coming Pradeep Bharat Parimal
Govind Bhuvna
Dinesh Gaurav KarunaName
er is Ravi GouriSusant Bobby Jani Parul the f data a
the the for very
to out data and that As clement
training clementclement
closest and data elements
the different the = simple
lies 'Communication, aS we closeshown
second po1nt Communication
Aptitude
Communication' V,
in closest We are
d,clement
d, d,
neignbours
the can to 6 6 6 3 4 7 7 2 data data
dots tor
considering each in d,
stion,
alue or
understand, the points and set
-lassification
Algorithms
student f'
need
nearest other.
tigure,
d, havingt
d,
of of + can
'k´ 1.e. helps for
to Josh can the (f1-2) two
be doing just The the 7 2 7 4 5.5
be
hich how neighbours 3 5.5 3 7 6 2.5 6 6 be
1n Student features
measured
calculated thetraining
ishas it reason
assigning
many
is represented the
a no two
ino classification. for data (say
similar of rolefeatures data Speaker
Leader Leader Leader Speaker SpeakerLeader Leader
by
the from the consider Intel Intel Intel Intel Intel
Speaker Class
f
to points asset and
class as
plav
ele test an dots con- 183
Of f2).
184 Chapter 7 Supervised Learning: Classification
parameter given as an input to the algorithm. In the kNN algorithm. the
indicates the number of neighbours that nced to be considered. For value of k
value of kis 3, only thrce ncarest ncighbours or thrce training data
est to the test data clement are considered. Out of the three data
exampl
elementse, ifcltheoos-
which is predominant is considered as the classlabel to be assigned to the
elements, the class
test data.
In case the value of k is I,only the closest training data element is considered
class labcl of that data clenment is dircctly assigned to the test data element. This i
depicted in Figure 77