fl Ye) oy cam]
Clustering Algorithms
“Wherever you see a successful business, someone once
made a courageous decision.”
— Veter Drucker
Cluster analysis is a technique of partitioning a collection of unlabelled objects, with
many attributes, into meaningful disjoint groups or clusters. This chapter aims to
provide the basic concepts of clustering algorithms.
Lear
* Introduce the concepts of clustering
+ Highlight the role of distanwe measures in clustering prowess
* Provide a taxonomy of clustering algorithms
* Explain hierarchical clustering algorithms
‘+ Explain partitional clustering algorithms
+ Briefly explain density-based, grid-hased, and prohabilistie madel-hased clustering
techniques
Discuss the validation techniques for clustering algorithms
13.1 INTRODUCTION TO CLUSTERING APPROACHES
Cluster analysis is the fundamental task of unsupervised learning. Unsupervised learning
involves exploring the given dataset. Cluster analysis is a technique of partitioning a collection,
of unlabelled objects that have many attributes into meaningful disjoint groups or clusters. This
1s done using a trial and error approach as there are no supervisors available as in classification.
‘The characteristic of clustering is that the objects in the clusters or groups are similar to each
other within the clusters while differ from the objects in other clusters significantly.
The input for cluster analysis is examples or samples. These are known as objects, data
points or data instances. All these terms are same and used interchangeably in this chapter. All
the samples or objects with no labels associated with them are called unlabelled. The output is= ———____
362. + Machine Learning
= eT
the set of clusters (or groups) of similar data if it exists in the input. For example, the following,
pul ple,
Figure 13.1(a) shows data points or samples with two features shown in difterent shaded samples
and Figure 13.1(b) shows the manually drawn ellipse to indicate the clusters formed.
‘cluster visualisation
45,
2
10,
5
2 75
3
5.
23]
ox
o « 2 3 «@ 5 6 ?
Samoles
@)
Cluster visualisation
value
samples
«)
Figure 13.1: (a) Data Samples (b) Clusters’ Description
Visual identification of cluctere in this case is easy as the evamples have only two features.
But, when examples have more features, say 100, then clustering cannot be done manually and
automatic clustering algorithms are required. Also, automating the clustering process is desirable
as these tasks are considered difficult by humans and almost impossible. All clusters are repre-
sented by centroids, For example, if the input examples or data is (3, 3), (2, 6) and (7, 9), then
7.34649
the centroid is given as @ oe - # ) = (4, 6). The clusters should not overlap and every
luster should represent only one class. Therefore, clustering algorithms use trial and error method
to form clusters that can be converted to labels. Thus, the important differences between classifi-
cation and clustering are given in Table 13.1.[
—._?A>€_—_______— ctstering Algorithms . 363
able 13.1: Differences between Classification and Clustering
ro Ct Classification
Unsupervised learning and cluster formation are | Supervised learning with the presence of a
stone by ial and error as there is no supervisor_| supervisor to provide training and testing data
Unlabeled data Labelled data
No prior knowledge in dustering Knowledge of the domain ls must t label the
samples ofthe dataset
Cluster results are dynamic Once a label is assigned, it does not change
Applications of Clustering
1. Grouping based on customer buying patterns
2, Profiling of customers based on lifestyle
. In information retrieval applications (like retrieval of a document from a collection of
documents)
. Identifying the groups of genes that influence a disease
Identification of organs that are similar in physiology functions
‘Taxonomy of animals, plants in Biology
Clustering based on purchasing behaviour and demography
Document indexing
9, Data compression by grouping similar objects and finding duplicate objects,
euawe
Challenges of Clustering Algorithms
A huge collection of data with higher dimensions (Le., features or attributes) can pose a
problem for clustering algorithms. With the arrival of the internet, billions of data are available
for clustering algorithms. This is a difficult task, as scaling is always an issue with clustering
algorithms. Scaling is an issue where some algorithms work with lower dimension data
but do not perform well for higher dimension data. Also, units of data can post a problem,
like same weights in kg and come in poundo can pose a problem in clustering. Designing
a proximity measure is also a big challenge.
The advantages and disadvantages of the cluster analysis algorithms are given in Table 13.2.
Table 13.2; Advantages and Disadvantages of Clustering Algorithms
es Eaters
1. | Cluster analysis algorithms can handle missing | Cluster analysis algorithms are sensitive to
data and outliers. initialization and order of the input data.
2 | Can help classifiers un labelling the unlabelled | Often, the number of clusters present in the
data. Semi-supervised algorithms use cluster | data have to be specified by the user.
analysis algorithms to label the unlabelled data
and then use classifiers to classify them.
(Continued)gee © Machine Learning
Advantages
Itiseasy toexplainthe clusteranalysisalgorithms
and to implement them,
Clustering is the oldest technique in statistics | Designing a proximity measure for the given
and it is easy to explain. It is also rel vaiiaeti
to implement iso relatively easy | data is an issue
Peed
Scaling is a problem.
13.2 PROXIMITY MEASURES
Sean for ‘Additonal Information on Proximity Measures!
Clustering algorithms need a measure to find the similarity or dissimilarity among the objects
to group them. Similarity and dissimilarity are collectively known as proximity measures.
Often, the distance measures are used to find similarity between two objects, say i and j.
Distance measures are known as dissimilarity measures, as these indicate how one object
is different from another. Measures like cosine similarity indicate the similarity among objects.
Distarwe measures and similaity measures are two sides of the same coin, a more distance
indicates more similarity and vice versa. Distance between two objects, say i and j, is denoted by
the symbol D,.
‘The properties of the distance measures are:
1. D, is always positive or zero.
2. D,=0, ie, the distance between the object to itselfis 0.
3. D,= Dy, Thus property is called symmetry.
4. D,
low and low < medium. Quantitative variables are real or Integer numbers ot
binary data. In binary data, the attributes of the object can take a Boolean value. Objects whose
attributes take binary data are called binary objects.
Tot us review some of the proximity measures.—— _—atastering algorithms 6 365
Quantitative Variables
Some of the qualitative variables are discussed below.
guclidean Distance it is one of the most important and common distance measures. It is also
called as L, norm. It can be defined as the square root of squared differences between the coordi-
.3 of a pair of objects.
The Euclidean distance between objects x, and x, with k features is given as follows:
xy (13.1)
‘The advantage of Euclidean distance is that the distance does not change with the addition of
new objects. But the disadvantage is that if the units change, the resulting Euclidean or squared
Euclidean changes drastically. Another disadvantage is that as the Euclidean distance involves
a square root and a square, the computational complexity is high for implementing the distance
for millions or billions of operations involved.
City Block Distance City block distance is known as Manhattan distance. This is also known as
boxcar, absolute value distance, Manhattan distance, Taxicab or L, norm. The formula for finding
the distance is given as follows:
Distance (x, x) (132)
Chebyshev Distance Chebyshev distance is known as maximum value distance. This is the
absolute magnitude of the differences between the coordinates of a pair of objects. This distance is
called supremum distance or L,,, or L_ norm. he formula for computing Chebyshev distance is
given as follows:
Distance (x, x)
(13.3)
Suppose, it the coordinates of the objects are (J, 3) and (5, 8), then what is the
Chebyshev distance?
Solution: The Euclidean distance usin By, (13.1) fs given as follows:
Distance (x,, x,) = (0-5? + (8-8)
50 = 7.07
‘The Manhattan distance using Eq. (13.2) is given as follows:
Distance (x, x) = (0-5) + (3 -8)]
‘The Chebyshev distance using Eq. (13.3) is given as follows:
Max {0-5}, [3-8|} = Max (5,5}=5
—_____
Minkowski Distance In general, all the above distance measures can be generalized as:
[y (134)
x,
Distance (x, x,) = ( fs366 © Machine Leaming———
This is called Minkowski distance, Here, ris a parameter. When the value of ris 1, the distance
measure is called city block distance. When the value of r is 2, the distance measure is called
Euclidean distance. When, ris «, then this is Chebyshev distance.
Binary Attributes
Binary attributes have only two values. Distance measures discussed above cannot be applied to
find distance between objects that have binary attributes. For finding the distance among objects
with binary objects, the contingency Table 13.3 can be used. Let x and y be the objects consisting
cof N-binary objects. Then, the contingency table can be constructed by counting the number of
matching of transitions, 0-0, 0-1, 1-0 and 1-1.
Table 13.3: Contingency Table
Ser ae)
0 a_|b
1 cia
In other words, ‘a’ is the number of attributes where x attribute is 0 and y attribute is 0.
‘bis the number of attributes where x attribute is 0 and y attribute is 1, ‘c’ is the number of
attributes where x attribute is 1 and y attribute is 0 and ‘a’ is the number of attributes where
x attribute is 1 and y attribute is 1.
Simple Matching Coefficient (SMC) SMC is a simple distance measure and is defined as the
ratio of number of matching attributes aid the suwuiber uf allsibutes. The formula is given as:
ee (3.5)
atb+c+d
The values of a, b, ¢, and d can be observed from the Table 13.4.
If the given vectors are x = (1, 0, 0) and y= (1, 1,1) then find the SMC and Jaccard
coefficient?
Solution: It can be seen from Table 13.2 that, a=0,b=2, ¢~Oandd=1.
. 5 ad _oape
The SMC using Eq, (13.5) is given as ———" = 041/3 = 0.38
= 1/3 = 0.33
a
Jaccard coefficient using Eq. (13.6) is given as J= -——
—"
Hamming Distance Hamming distance is another useful measure that can be used for
knowing the sequence of characters or binary values. It indicates the number of positions at
which the characters or binary bits are different.
For example, the hamming distance between x = (1 0 1) and y = (1 1 0) is 2 as x and y differ
in two positions. The distance between two words, say wood and hood is 1, as they differ in only
one character. Sometimes, more complex distance measures like edit distance can also be used.Clustering Algorktims «367
categorical Variables
In many cases, categorical values are used. It is just a code or symbol to represent the values.
For example, for the attribute Gender, a code 1 can be given to female and 0 can be given to male.
To calculate the distance between two objects represented by variables, we need to find only
whether they are equal or not. This is given as:
fH ifxey
Distance (x, ¥) = 19 ig y
(37)
ordinal Variables
Ordinal variables are like categorical values but with an inherent order. For example, designation
is an ordinal variable. If job designation is 1 or 2 or 3, it means code 1 is higher than 2 and code 2
ishigher than 3. It is ranked as 1>2>3,
Let us assume the designations of oltice employees are clerk, supervisor, manager and
general manager. These can be designated as numbers as clerk = 1, supervisor = 2, manager = 3
and general manager = 4. Then, the distance between employee X who is a clerk and Y whois a
manager can be obtained as:
Distance (x, 1) = Peston (X) — pesitin (| (38)
. n-1
Here, position (X) and position(Y) indicate the designated numerical value. Thus, the distance
between X (Clerk = 1) and ¥ (Manager ~3) uoing Eq, (13.8) ie given as:
Distance x, p= esti (= postion 09] _ N=. _2 65
Vector Type Distance Measures
For text classification, vectors are normally used. Cosine similarity is a metric used to measure
how similar the documents are irrespective of their size. Cosine similarity measures the cosine of
the angle between two vectors projected int a utulti-diuweusiual space. The similarity function for
vector objects can be defined as:
EX, xy
sim (XY) = es
PIP Bee x Eve
The numeration is the dot product of the vectors A and B; The denominator is the product of
the norm of vectors A and D.
(13.9)
Eanes
similarity?
Seaton: The dot product of the vector is 1x 0+1 x 1+0x1=1. The norm of the vectors A and
Bis V2
If the given vectors are A = (1. 1, 0) and B= (0. 1, 1}, then what is the cosine
So, the cosine similarity using Eq, (13.9) is given as
oe360. + Machine Learning —— ooo
Now, let us discuss the types of clustering algorithms, which include hierarchical, partitional,
density-based and grid-based algorithms,
‘Scan for information on “Taxonomy of Clustering Algorithms’
13.3 HIERARCHICAL CLUSTERING ALGORITHMS
Hierarchical methods produce a nested partition of objects with hierarchical relationships among
objects. Often, the hierarchy relationship is shown in the form of a dendrogram,
Hierarchical methods include categories, agglomerative methods and divisive methods,
In agglomerative methods, initially all individual samples are considered as a cluster, that is,
a cluster with a single element, Then, they are merged and the process is continued to get a
single cluster. Divisive methods use another kind of philosophy, where a single cluster of all
samples of the dataset taken initially io chosen and then partitioned. This partition process is
continued until the cluster is split into smaller clusters
Agglomerative methods merge clusters to reduce the number of clusters. This is repeated each
time while merging two closest clusters to get a single cluster. The procedure of agglomerative
clustering is given as follows:
RES ee One
Roce
1. Place cach N sample or dala instaue into a separate cluster. Su, initially N clusters are
available.
2. Repeat the following steps until a single cluster is formed:
(a) Determine two most similar clusters.
(b) Merge the two clusters into a single cluster reducing the number of clusters as N-1.
3. Choose resultant clusters of step 2 as result:
All the clusters that are produced by hierarchical algorithms have equal diameters. The main
disadvantage of this approach is that once the cluster is formed, it is an irreversible decision.
13.3.1 Single Linkage or MIN Algorithm
Hierarchical clustering algorithms producenested clusters, which can be visualized as a hierarchical
tree or dendrogram. The idea behind this approach is proximity among clusters. In single linkage
algorithm, the smallest distaste (2, y), where x is from one cluster anul y is fruit another duster,
is the distance between all possible pairs of the two groups or clusters (or simply the smallest
distance of two points where points are in different clusters) and is used for merging the clusters.
This corresponds to finding of minimum spanning tree (MST) of a graph.
‘The distance measures between individual samples or data points is already demonstrated in
the previous section 13.2. To understand the single linkage algorithm, go through the following
numerical problem that involves finding the distance between clusters.Clustering Algorithms + 369
ooo?
pee Consider the array of points as shown in the following Table. 13.4.
Table 13.4: Sample Data
Cn
0 1{4
1 2| 8
2 5 | 10
3 2 | 16
4 4 | 28
Solution: The similarity table among the variables is computed as shown in Table 13.4. Euclidean
distance is computed as shown in the following Table 13.5.
; tae
3 - 10.198
2 4
The minimum distance is 3.606. Therefore, the items 1 and 2 are clustered together. The
resultant is shown in Table 13.6,
Table 13.6: After Iteration 1
{1,2} Ga23)| 10.630 | 20.124
° eo eee 27.295
3 = _ [10198
4 =
‘The distance between the group (1, 2} and items (0}, (3) and (4 is computed using the formula:
D(C, C,) = minimum, 4. d(@, 2) (13.10)
Here, D,, is the single linkage distance, C, C, are clusters and d (a, b) is the distance between
the elements @ and b.
‘Thus, the distance between (1, 2} and {0} is:
‘Minimum ((1, 0}, (2, 0}} = Mintmum (4.123, 7211} = 4.123,
‘The distance between {1, 2} and (3} is given as:
Minimum (1, 3}, (2, 3)) = Minimum (14.412, 10.630) = 10.630
The distance between {1, 2} and (4) is given as:
Minimum (1, 4], (2, 4]} ~ Minimum (23.324, 20.124) - 20.124370 = Machine Learning
This is shown in Table 13.6. The minimum distance of Table 13.6 is 4.123.
Therefore, (0, 1, 2) is clustered together. This result is shown in Table 13.7.
Table 13,7: After Iteration 2
10.630 | 20.124
z
4
Thus, the distance between {U, 1, 2} and (3) using Eq. (13.10) is given as
Minimum ((0, 3}, (1, 3}, (2, 3}} = Minimum (17.804, 14.142, 10.630} = 10.630.
‘Thus, the distance between (0, 1, 2) and (4) is:
‘Minimum {{0, 4}, (1, 4), (2, 4}} = Minimum (27.295, 23.324, 20.124} = 20.124.
This is shown in Table 13,7, The minimum is 10.198,
‘Therefore, the items {3, 4) are merged. And, the items (1, 2, 3} and (4, 5] are merged. The resultant
is shown in Table 13.8.
Table 13.8: After Iteration 3
Cee
‘The computation of ? in Table 13.8 is not needed as no other items are left. Therefore, the
clusters (0, 1, 2} and (3, 4) are merged.
Dendrograms are used to plot the hierarchy of clusters. Dendrogram for the above clustering
is shown in the following Figure 13.2.
oS
10
8
6
4 1
2
°
° 1 2 3 4
Dendrogram for Table 13.4
es
es etClustering Algorithms + 374
13.3.2 Complete Linkage or MAX or Clique
In complete linkage algorithm, the distance (x, y), (where x is from one cluster and y is from
another cluster), is the largest distance between all possible pairs of the two groups or clusters
(or simply the largest distance of two points where points are in different clusters) as given
below. It is used for merging the clusters,
De (C,,C)) = maximum,.< ec, (as B) as)
Dendrogram for the above clustering is shown in Figure 13.3.
25.
20.
5
10
° |
0 1 2 3 4
Figure 13.3: Dendrogram for Table 13.4
13.3.3 Average Linkage
In case of an average linkage algorithm, the average distance of all pairs of points across the
| dusters is used to form clusters. The average value computed between clusters c, ¢, is given
as follows:
| DyCC)=—— E ata,d) (132)
OT tty wcycy
Here, m, and m, are sizes of the clusters.
The dendrogram for Table 13.4 is given in Figure 13.4.
00 -
0 1 2 3 4
Figure 13.4: Dendrogram for Table 13.4572 + Machine Learning ———
13.3.4 Mean-Shift Clustering Algorithm
Mean-shift is a non-parametric and hierarchical clustering algorithm. This algorithm is also
known as mode seeking algorithm or a sliding window algorithm. It has many applications in
image processing and computer vision,
There is no need for any prior knowledge of clusters or shape of the clusters present in the
dataset. The algorithm slowly moves from its initial position towards the dense regions.
The algorithm uses a window, which is basically a weighting function. Gaussian window is
2 good example of a window. The radius of the kernel ie called bandwidth. The entire window
is called a kernel. The window is based on the concept of kernel density function and its aim
is to find the underlying data distribution. The method of calculation of mean is dependent on
the choice of windows. If a Gaussian window is chosen, then every point is wsigned a weight
that decreases as the distance from the kernel center increases. The algorithm is given below.
Vober 4
Step 1: Design a window.
Step 2: Place the window ona set of data points.
Step 3 Compnte the monn for all the points that come under the window.
Step 4: Move the center of the window to the mean computed in step 3. Thus, the window
moves towards the dense regions, The movement to the dense region is controlled by a
mean shift vector. The mean shift vector is given as:
1
i (x, — x) (13.13)
Ree) (13.13)
Here, K is the number of points and S, is the data points where the distance from
data points x, and centroid of the kernel x is within the radius ot the sphere. Ihen, the
centroid is updated as x= x +0,
Step 5: Repeat the steps 3-4 for convergence. Once convergence is achieved, no further points
can be accommodated.
Advantages
1, No model assumptions
2, Suitable for all non-convex shapes
3, Only one parameter of the window, thal
4. Robust to noise
5. No issues of local minima or premature termination
s handwidth is required
Disadvantages
1. Selecting the bandwidth is a challenging task. If it is larger, then many clusters are missed.
it is small, then many points are missed and convergence occurs as the problem.
2. The number of clusters cannot be specified and user has no control over this parameter.— ee taistoring aigoritns 6 373
‘scan for “Additional Examples’ |
Gi ie
13.4 PARTITIONAL CLUSTERING ALGORITHM
‘kemeans’ algorithm is a straightforward iterative partitional algorithm. Here, k stands for the
user specified requested clusters as users are not aware of the clusters that are present in the
dataset. The k-means algorithm assumes that the clusters do not overlap. Therefore, a sample or
data point can belong to only one cluster in the end. Also, this algorithm can detect clusters of
shapes like circular or spherical,
Initially, the algorithm needs to be initialized. The algorithm can select k data points
randomly or use the prior knowledge of the data. In most cases, in k-means algorithm setup, prior
knowledge is absent. The composition of the cluster is based on the initial condition, therefore,
initialization is am important task. The sample or data points need to be normalized for better,
performance. The concepts of normalization are covered in Chapter 3.
The core process of the k-mean algorithm is assigning a sample to a cluster, that is, assigning
each sample or data point to the k cluster centers based on its distance and the centroid of the
dusters. This distance should be minimum. As a new sample is added, new computation of
mean veetora of the points for tual luster to which sample is assigned is required. Therefore, this
iterative process is continued until no change of instances to clusters is noticed. This algorithm
then terminates and the termination is guaranteed.
Step 1: Determine the number of clusters before the algorithm is started. This is called k.
Step 2: Choose k instances randomly. These are initial cluster centers.
Step 3: Compute the mean of the initial clusters and assign the remaining sample to the
closest cluster based on Euclidean distance or any other distance measure between.
the instances and the centroid of the chisters.
Step 4: Compute new centroid again considering the newly added samples.
Step 5: Perform the steps 3-4 till the algorithm becomes stable with no more changes in
assignment of instances and clusters.
kemeans can also be viewed as greedy algorithm as it involves partitioning n samples to
k clusters to minimize Sum of Squared Error (SSE). SSE is a metric that is a measure of error
that gives the sum of the squared Kuclidean distances of each data to its closest centroid. It is
given as: &
SSE = Eedist(c,, x? 13.14)
Here, ¢, is the centroid of the i cluster, x is the sample or data point and dist is the Euclidean
distance. The aim of the k-means algorithm is to minimize SSE.374 + Machine Learming
Advantages
1. Simple
2, Easy to implement
Disadvantages
1. Itis sensitive to initialization process ae change of initial pointe leads to different clusters.
2, Ifthe samples are large, then the algorithm takes a lot of time.
How to Choose the Value of k?
Itis obvious that k is the user specified value specifying the number of clusters that are present,
Obviously, there are no standard rules available to pick the value of k. Normally, the k-means
algorithm is run with multiple values of k and within group variance (sum of squares of samples
with its centroid) and plotted as a line graph. This plot is called Flhow curve. The optimal or best
value of k can be determined from the graph. The optimal value of k is identified by the flat or
horizontal part of the Elbow curve.
Complexity
The complexity of k-means algorithm is dependent on the parameters like 1», the number of
samples, k, the number of clusters, @(nkId). is the number of iterations and d is the number of
attributes. The complexity of k-means algorithm is O (7)
Consider the following set of data given in Table 13.9. Cluster it using k-means
algorithm with the initial value of objects 2 and 5 with the coordinate values (4, 6) and (12, 4) as
initial seeds.
Table 13.9: Sample Data
Solution: As per the problem, choose the objects 2 and 5 with the coordinate values. Hereafter,
the objects’ id is not important. The samples or data points (4, 6) and (12, 4) are started as two
dlusters as shown in Table 13.10.
Initially, centroid and data points are same as only one sample is involved.
Table 13.10: initial Cluster Table
4,6) (12,4)
Centroid 1 (4, 6) Centroid 2 (12, 4)
Iteration 1: Compare all the data points or samples with the centroid and assign to the
nearest sample. Take the sample object 1 (2, 4) from Table 13.9 and compare with the centroid ofstoring aigoritn ms «375,
the clusters in Table 13.10. The distance is 0. Therefore, it remains in the same cluster. Similarly,
consider the remaining samples. For the object 1 (2, 4), the Kuclidean distance between it and
the centroid is given as:
Dist (1, centroid 1) = @—4? + a-6p = V8
Dist (1, centroid 2)= (2—12)' + (4— 4? = V0 = 10
Object 1 is clover to the centroid of cluster 1 and hence assign it to cluster 1. Ihis is shown in
able 13.11. Object 2 is taken as centroid point.
For the object 3 (6, 8), the Euclidean distance between it and the centroid points is given ac:
Dist (3, centroid 1)= (6-4)? + (6-6) = V8
Dist (3, centroid 2) = (6 = 12)" + (8-4) = /52
Object 3 is closer to the centroid of cluster 1 and hence remains in the same cluster 1.
Proceed with the next point object 4(10, 4) and again compare it with the centroids in
Table 13.10.
Dict (4, centroid 1) = Jaap sO = VES
Dist (4, centroid 2)= (10 - 127° + (4-47 = V4 =2
Object 4 is closer to the centroid of cluster 2 and hence assign it to the cluster table. Object 4 is
in the same cluster. The final cluster table is shown in Table 13.11.
Obviously, Object 5 is in Cluster 3. Recompute the new centroids of cluster 1 and cluster 2.
‘They are (4, 6) and (11, 4), respectively.
Table 13.11: Cluster Tahle After Iteration 1
Cluster2
(20,4)
(2,4)
8)
Centroid 1 (4,6) | Centroid 2 (11,4)
‘The second iteration is ctarted again with the Table 13.11.
Obviously, the point (4, 6) remains in cluster 1, as the distance of it with itself is 0. The
remaining objects can be checked. Take the sample object 1 (2, 4) and compare with the centroid
of the clusters in Table 13.12.
Dist (1, centroid 1) = (2-4) +@-6) =v8
Dist (1, centroid 2) = (2-11)? + (4-4) = 81 =9
Object 1 is closer to centroid of cluster 1 and hence remains in the same cluster. Take the
sample object 3 (6, 8) and compare with the centroid values of clusters 1 (4, 6) and cluster
2(11, 4) of the Table 13.12.
Dist (3, centroid 1)= y(6—4)* + (= 6)" =
6-11? + (8-4 = Van
Dist (3, centroid 2)376 + Machine Learning —
i
Object 3 is closer to centroid of cluster 1 and hence remains in the same cluster. Take the
sample object 4 (10, 4) and compare with the centroid values of clusters 1 (4, 6) and cluster 2 (11, 4)
of the Table 13.12:
Dist (4 controid 1) 0 ay a=) = V0
Dist (3, centroid 2) = (10-11? +(@—4y = Ji =1
Object 5 Is closer to centroid of cluster 2 and hence remains in the same cluster. Obviously,
the sample (12, 4) is closer to its centroid as shown below:
Dist (5, centroid 1) = (12-4) + (46) = Vea
Dist (5, centroid 2) = (12-11)? + (4-4)? = Vi = 1. Therefore, it remains in the same cluster.
‘Object 5 is taken as centroid point.
The final cluster Table 13.12 is given below:
‘able 13.12: Cluster Table After Iteration 2
69) (0,4)
(24) (12,4)
an)
Centroid (4, 6) Centroid (11, 4)
‘There is no change in the cluster Table 13.12. It is exactly the same; therefore, the k-means
algorithm terminates with two clusters with data points as shown in the Table 13.12.
ng
“
ec a |
o
13.5 DENSITY-BASED METHODS
Density-based spatial clustering of applications with noise (DBSCAN) is one of the density-based
algorithms. Density of a region represents the region where many points above the specified
threshold are present. In a density-based approach, the clusters are regarded as dense regions
of objects that are separated by regions of low density such as noise. This 1s same as a human's
intuitive way of observing clusters.
‘The concept of density and connectivity is hased on the local distance of neighbours
‘The functioning of this algorithm is based on two parameters, the size of the neighbourhood (¢)
and the minimum number of points (m).
1. Core point ~ A point is called a core point if it has more than specified number of points
(m) within e-neighbourhood.
2. Border point - A point is called a border point if it has fewer than ‘m’ points but is @
neighbour of a core point.
3. Noise point ~ A point that is neither a core point nor border point——— Clustering Algorithms » 377
The main idea is that every data point or sample should have atleast a minimum number
‘of neighbours in a neighbourhood. The neighbourhood of radius € should have atleast m points,
‘The notion of density connectedness determines the quality of the algorithm.
The following connectednesa measures are used for thie algorithm.
1, Direct density reachable ~The point X is directly reachable from ¥, if:
(a) X is the eneighborhood of Y
(b) Yisa core point
2. Densely reachable ~ The point X is densely reachable from Y. if there is a set of core points
that leads from Y to X.
3. Density connected ~ X and ¥ are densely connected if Z is a core point and thus points
Xand Y are densely reachable from Z.
Step 1: Randomly select a point p. Compute distance between p and all other points.
Step 2: Find all points trom p with respect to its neighbourhood and check whether it has
minimum number of points m. Ifso, it is marked as a core point.
Stop 3: IF itis a core point, then a new cluster is formed, or existing cluster is enlarged.
Step 4: If it is a border point, then the algorithm moves to the next point and marks it as visited.
Step 5: If it is a noise point, they are removed,
Step 6: Merge the clusters ifit is mergeable, dist (¢,, ¢,) <.
Step 7: Repeat the proccss 3-6 till all points are processed,
Advantages
1. No need for specifying the number of clusters beforehand
2. The algorithm can detect clusters of any shapes,
3. Robust to noise
4. Few parameters are needed
The complexity of this algorithm is O(wlogn).
13.6 GRID-BASED APPROACH
Grid-based approach is a space-based approach. It partitions space into cells, the given data is fitted
on the cells for cluster formation.
There are three important concepts that need to be mastered for understanding the grid-based
schemes. They are:
1. Subspace clustering
2. Concept of denise cells
3. Monotonicity property
Let us discuss about them.378 + Machine Learning —
Subspace Clustering
Grid-based algorithms are useful for clustering high-dimensional data, that is, data with many
attributes. Some data like gene data may have millions of attributes. Every attribute is cailed a
dimension. Dut all the attributes are not needed, as in many applications one may not require all
the attributes. For example, an employee's address may not be required for profiling his diseases.
‘Age may be required in that case. So, one can conclude that only a subset of features is required.
For example, one may be interested in grouping gene data with similar characteristics or organs
that have similar functions,
Finding subspaces is difficult. For example, N dimensions may have 2“ subspaces. Exploring
all the subspaces is a difficult task. Here, only the CLIQUE algorithms are useful for exploring,
the subspaces. CLIQUE (Clustering in Quest) is a grid-based method for finding clustering in
subspaces. CLIQUE uses a multiresolution grid data structure.
Concept of Dense Cells
CLIQUE partitions each dimension into several overlapping intervals and intervals it into cells.
Then, the algorithm determines whither the cell is dense or sparse. The cell is eunsidered dense if it
exceeds a threshold value, say . Density is defined as the ratio of number of points and volume of the
region. In one pass, the algorithm finds the number of ces, number of points, etc. and then combines
the dense cells. For that, the algorithm uses the | ‘contiguous intervals and a set of dense cells.
Step 1: Define a set of grid points and assign the given data points on the grid.
Step 2: Determine the dense and sparse cells. If the number of points in a cell exceeds the threshold
value 7; the cells categorized as dense cell. Sparse cells are removed from thelist.
Step 3: Merge the dence cells if they ate adjacent,
Step 4: Form a list of grid cells for every subspace as output.
Monotonicity Property
CLIQUE uses anti-monotonicity property or apriori property of the famous apriori algorithm.
It means that all the subsets of a frequent item should be frequent. Similarly, if the subset is
infrequent. then all its supersets are infrequent as well. Based on the apriari property, ane can
conclude that a k-dimensional cell has r points if and only if every (k= 1) dimensional projections
of this cell have atleast r points. So like association rule mining that uses apriori rule, the candidate
1s. The algorithm works in two stages as shown
Step 1: Identify the dense cells.
Step 2: Merge dense cells c, and c, if they share the same interval.
(Continued)—_,_,_, _____—tustering algorithms. 379
(Step Generate Apriori rule tu generate (k + 1)" cell for higher dimension. Then, check
‘whether the number of points cross the threshold. This is repeated till there are no
dense cells or new generation of dense cells
Stage 2:
Step 1: Merging of dense cells into a cluster is carried out in each subspace using maximal
rior to cover dense cells, The maximal region is an hyperrectangle where all cells
fall into,
‘Step 2: Maximal region tries to cover all dense cells to form clusters.
In stage two, CLIQUE starts from dimension 2 and starts merging. This process is.
continued till the n-dimension
Advantages of CLIQUE
1. Insensitive to input order of objects
2. No assumptions of underlying data distributions
3. Finds subspace of higher dimensions such that high-density clusters exist in those subspaces
Disadvantage
The disadvantage of CLIQUE is that tuning of grid parameters, such as grid size, and finding
optimal threshold for finding whether the cell is dense or not is a challenge.
13.7 PROBABILITY MODEL-BASED METHODS
In the earlier clustering algorithms, the ample wore assigned to the clusters permanently.
Also, the samples were not allowed to be present in two clusters. In short, the clusters were
non-overlapping. In model-based schemes, the sample is associated with a probability for
membership. This assignment based on probability 1s called soft assignment. Soft assignments
are dynamic as compared to hard assignments which are static. Also, the sample can belong to
more than one clusters. This is acceptable as person X can be a father, a manager as well as a
member of a prestigious club. In short, person X has different roles in life.
A Model’ means a statistic method like probability distributions associated with its param-
eters. In the EM algorithm, the model assumes data 18 generated by a process and the focus 1s
to find the distribution that observes that data. There are two probability based soft assignment
schemes that are discussed here. One is fuzzy-C means (FCM) clustering algorithm and another is
EM algorithm. EM algorithm is discussed in detail in Chapter 2.
13.7.1 Fuzzy Clustering
Fuzzy C-Means is one of the most widely used algorithms for implementing the fuzzy clustering
concept. In fuzzy clustering, an object can belong to more than one cluster. Let us assume two
clusters ¢,and c, then an element, say x, can belong to both the clusters. The strength of association
of an object with the cluster is given as w,, The value of w, lies between zero and one. The sum of
weights of an object, if added, gives 1.wo + Ma
ine Learning — ee ee -
Like in k-means algorithm, the centroid of the cluster, c, s computed. The membership weight
ss inversely proportional to the distance between abject and centroid computed in earlier pass.
EI | %
Step 1: Choose the clusters, , randomly
Step 2: Assign weights w, of objects to clusters randomly.
| Step 3: Compute the centroid:
Bes, (03.15)
|
Step 4: The Sum of Squared Error (SSE) is computed as:
SSE = 3 E waist (x, €,? (13.16)
Step 5: Minimize SSE to update the membership weights. Here, p is a fuzzifier whose value
ranges from 1 to», This parameter determines the influence of the weights. If p is 1,
then fuzzy-c acts like k-means algorithm. A large weight results in a smaller value of the
membership and hence more fuzziness. Typically, p value is 2.
Step 6: Repeat steps 3-5 till convergence is reached, which mean when there is no change in
weights exceeding the threshold valuc.
Advantages and Disadvantages of FCM
‘The advantages and disadvantages of the FCM algorithm are given in Table 13.13.
Table 13.1 idvantages of FCM Algorithm
Advantages and
No, ‘Advantages Disad
Te [Minimum intra cls variance | The qual depends on he inital choke of weights
2 [Robust to noise Local nica rater than global nnimar
13.7.2 Expectation-Maximization (EM) Algorithm
Like FCM algorithm, in EM algorithm too there are no hard assignments but there are
overlapping clusters.
{In this scheme, clustering 1s done by statistical models. What ts a model? A statastical model is
described in terms of a distribution and a set of parameters. The data is assumed to be generated
by a process and the focus is to describe the data by finding a model that fits the data. In fact. data
is assumed to be generated by multiple distributions ~ a mixture model. As mostly gaussian distri-
butions are used, it also called as GMM (Gaussian mixture model).
Given a mix of distributions, data can be generated by randomly picking a distribution and
generating the point. The basics of Gaussian distribution are given in Chapter 3. One can recollect- Clustering Algorithms + 302
that Gaussian distribution is a bell-shaped curve, The function of gaussian distribution is given as
follows:
Novipo?)= hee
Vanco?
vize this function mean and standard deviation. Sometimes, variance
ean also be used! as it is the square of standard deviation. When the mean is zero, the peak of the
bell-shaped curve occurs, Standard deviation is the spread of the shape. The above function is
called probability distribution function that tells how to find the probability of the observed point x,
Two parameters charact
Ihe same gaussian tunction can be extended for multivariate too. In 2D, the mean is also a
vector and variance takes the form of covariance matrix. Chapter 3 discusses these important
concepts
Let us assume that:
k= Number of distributions
n= Number of samples
8 = {0,,0,,0,,---,0,1, a set of parameters that are associated with the distributions.
6, is the parameter of the j" probability distribution.
Then, p(x, |8,) is the probability of i object coming from the j* distribution. The probability
that j* distribution to be chosen is given by the weight w, 1