Overlapping Clustering
Overlapping Clustering
p1
p3 p4
p2
p1 p2 p3 p4 Fig. 5: Cluster Distance in Nearest Neighbour Method
(a) Dendrogram (b) Nested Clusters
Example 4: Let us suppose that Euclidean distance is the appropriate
Fig. 4: A Hierarchical Clustering of Four Points measure of proximity. Consider the five observations given as a, b, c, d
Logically, several approaches are possible to find a hierarchy and are shown in Fig. 6(b), and are forming its own cluster. The
associated with the data. The popular approach is to construct the distance between each pair of observations is shown in Fig. 6(a).
hierarchy level-by-level, from bottom to top (agglomerative clustering) or For example, the distance between a and b is
from top to bottom (divisive clustering). Let us discuss hicrachical
clustering methods one by one in detail. ( 2 − 8) 2 + ( 4 − 2) 2 = 36 + 4 = 6.325.
Agglomerative Hierarchical Clustering Observations b and e are nearest (most similar) and, as shown in Fig.
6(b), are grouped in the same cluster. Assuming the nearest neighbor
Agglomerative hierarchical techniques are the more commonly used
method is used, the distance between the cluster (be) and another
methods for clustering. Each object initially represents a cluster of its
observation is the smaller of the distances between that observation, on
own. Then clusters are successively merged until the desired cluster
the one hand, and b and e, on the other.
structure is obtained. Divisive hierarchical clustering. All objects initially
belong to one cluster. Then the cluster is divided into sub-clusters, Cluster a b c d e
which are successively divided into their own sub-clusters. This process
continues until the desired cluster structure is obtained. The result of the a 0 6.325 7.071 1.414 7.159
hierarchical methods is a dendrogram, representing the nested b 0 1.414 7.616 1.118
grouping of objects and similarity levels at which groupings change. A
clustering of the data objects is obtained by cutting the dendrogram at c 0 8.246 2.062
the desired similarity level. The merging or division of clusters is d 0 8.500
performed according to some similarity measure, chosen so as to
optimize some criterion (such as a sum of squares). e 0
If the data consist of similarities, the similarity between a pair of clusters For example, D(be, a ) = min{D(b, a ), D(e, a )} = min{6.325, 7.159} = 6.325.
is considered to be equal to the greatest similarity from any member of
one cluster to any member of the other cluster. This method has a
The four clusters remaining at the end of this step and the distances
tendency to cluster together at an early stage objects that are distant
between these clusters are shown in Fig. 7(a).
from each other in the same cluster because of a chain of intermediate 44
43
Unit 12 Unsupervised Learning Block 4 Pattern Recognition
Cluster (be) a c d The groupings and the distance between the clusters are also shown in
X2 the tree diagram (dendrogram) of Fig.10. One usually searches the
d
(be) 0 6.325 1.414 7.614 dendrogram for large jumps in the grouping distance as guidance in
5
a arriving at the number of groups. In this example, it is clear that the
a 0 7.071 1.414 c elements in each of the clusters (ad) and (bce) are close(they were
merged at a small distance), but the clusters are distant (the distance at
c 0 8.246 b which they merge is large).
e
Distanc
d 0 6
0 X1
5 10 5
(a) (b) 4
Fig. 7: Nearest Neighbour Method, (Step 2). 3
2
Two pairs of clusters are closest to one another at distance 1.414; these
1
are (ad) and (bce). We arbitrarily select (ad) as the new cluster, as 0
shown in Fig. 7(b). c b e a d OBS
The distance between (be) and (ad) is Fig. 10: Nearest neighbour method, (Dendrogram)
***
D(be, ad) = min{D(be, a ), D(be, d)} = min{6.325, 7.616} = 6.325,
Complete-link clustering (also called the diameter method, the
while that between c and (ad) is maximum method or the furthest neighbour method) - methods that
consider the distance between two clusters to be equal to the longest
D(c, ad) = min{D(c, a ), D(c, d)} = min{7.071, 8.246} = 7.071. distance from any member of one cluster to any member of the other
The three clusters remaining at this step and the distances between cluster. The nearest neighbour is not the only method for measuring the
these clusters are shown in Fig. 8 (a). We merge (be) with c to form the distance between clusters. Under the furthest neighbor (or complete
cluster (bce) shown in Fig. 8 (b). linkage) method, the distance between two clusters is the distance
between their two most distant members. This method tends to produce
The distance between the two remaining clusters is
clusters at the early stages that have objects that are within a narrow
range of distances from each other. If we visualize them as objects in
D(ad, bce) = min{D(ad, be), D(ad, c)} = min{6.325, 7.071} = 6.325.
space the objects in such clusters would have a more spherical shape
The grouping of these two clusters, it will be noted, occurs at a distance as shown in Fig. 11.
of 6.325, a much greater distance than that at which the earlier
groupings took place. Fig. 9 shows the final grouping.
x y Sample Nearest cluster (i) k = 2 and use the first two samples in the list as seed points.
1 4 4 centroid (ii) k = 3 and use the first three samples in the list as seed
(4,4) (4,4) points.
2 8 4
3 15 8 (8,4) (8,4)
4 24 4 (15,8) (8,4) In the following section, we discuss k -means clustering.
5 24 12 (24,4) (8,4)
(24,12) (8,4)
12.7 K-MEANS CLUSTERING
(a) x-y Coordinates for 5 Points (b) First Iteration
The K -means clustering technique is simple, and we first choose k
Sample Nearest Sample Nearest initial centroids, where k is a user-specified parameter, namely, the
cluster cluster number of clusters desired. Each point is then assigned to the closest
centroid centroid centroid, and each collection of points assigned to a centroid is a
(4,4) (4,4) (4,4) (6,4) cluster. The centroid of each cluster is then updated based on the
(8,4) (4,4) (8,4) (6,4) points assigned to the cluster. We repeat the assignment and update
(15,8) (17.75,7) (15,8) (21,8) steps until no point changes clusters, or equivalently, until the centroids
(24,4) (17.75,7) (24,4) (21,8) remain the same. In its simplest form, the k -means method follows the
(24,12) (17.75,7) (24,12) (21,8) following steps.
(c) Second Iteration (d) Third Iteration Step 1: Specify the number of clusters and, arbitrarily or deliberately,
Fig. 18
the members of each cluster.
For Step 2, find the nearest cluster centroid for each sample. Fig. 18(b) Step 2: Calculate each cluster's \centroid" (explained below), and the
shows the results. The clusters {(4,4)} and {(8,4), (15,8), (24,4), (24,12)} distances between each observation and centroid. If an
are produced. observation is nearer the centroid of a cluster other than the
For Step 4, we compute the centroid of the clusters. The centroid of first one to which it currently belongs, re-assign it to the nearer
cluster is (4,4). The centroid of second cluster is (17.75,7) as cluster.
Step 3: Repeat Step 2 until all observations are nearest the centroid of
(8+15+24+24)/4=17.75 and (4+8+4+12)/4 =7. the cluster to which they belong.
As samples change clusters, go to Step 2. Step 4: If the number of clusters cannot be specified with confidence in
advance, repeat Steps 1 to 3 with a different number of clusters
Find cluster centroid nearest each sample. Fig. 18(c) shows the results. and evaluate the results.
The clusters {(4,4),(8,4), } and { (15,8), (24,4), (24,12)} are produced.
For Step 4, we compute the centroid (6,4) and (21,8) of the clusters. As The operation of K -means are shown in Fig. 19, which shows how,
sample (8,4) changed cluster, return to Step 2. starting from three centroids, the final clusters are found in four
assignment-update steps. In these and other figures displaying K -
Find cluster centroid nearest each sample. Fig. 17(d) shows the results.
means clustering, each subfigure shows (1) the centroid sat the start of
The clusters {(4,4),(8,4), } and { (15,8), (24,4), (24,12)} are produced.
the iteration and (2) the assignment of the points to those centroids. The
For Step 4, we compute the centroid (6,4) and (21,8) of the clusters. As centroids are indicated by the “+” symbol. All points belonging to the
no sample changed clusters, the algorithm terminates. same cluster have the same marker shape.
*** In the first step, shown in Fig. 19(a), points are assigned to the initial
Try an exercise. centroids, which are all in the larger group of points. For this example,
we use the mean as the centroid. After points are updated again. In
E5) Consider the data steps 2, 3, and 4, which are shown in Fig. 19(b), (c), and (d),
50 respectively, two of the centroids move to the two small groups of points
49
Unit 12 Unsupervised Learning Block 4 Pattern Recognition
2 2
D(a , abd ) = ( 2 − 3.67) + ( 4 − 3.67) = 1.702.
D( a , ce) = ( 2 − 8.75) 2 + ( 4 − 2) 2 = 7.040.
at the bottom of the figures. When the K -means algorithm terminates in Since b is closer to Cluster 2's centroid than to that of Cluster 1, it is
Fig. 19(d), because no more changes occur, the centroids have reassigned to Cluster 2. The new cluster centroids are calculated as
identified the natural groupings of points. Centroid at the beginning of shown in Fig. 21(a).The new centroids are plotted in Fig. 21(b). The
the step and the assignment of points to those centroids. In the second distances of the observations from the new cluster centroids are shown
step, points are assigned to the updated centroids, and the centroids. in Fig. 21(c).
Let us understand this in the following example. (an asterisk indicates the nearest centroid):
Example 7: Suppose two clusters are to be formed for the observations Cluster 1 Cluster 2
listed in Fig. 20(a). We begin by arbitrarily assigning a, b and d to Observation X1 X2 Observation X1 X2
Cluster 1, and c and e to Cluster 2. The cluster centroids are
a 2 4 c 9 3
calculated as shown in Fig. 20(a).
d 1 5 e 8.5 1
The cluster centroid is the point with coordinates equal to the average b 8 2
values of the variables for the observations in that cluster. Thus, the Average 1.5 4.5 Average 8.5 2
centroid of Cluster 1 is the point ( X1 = 3.67, X 2 = 3.67), and that of
(a)
Cluster 2 the point (8.75, 2). The two centroids are marked by C1 and
C 2 in Fig. 20(a). The cluster's centroid, therefore, can be considered X2
the center of the observations in the cluster, as shown in Fig. 20(b). We d c1
now calculate the distance between a and the two centroids. 5
c
Cluster 1 Cluster 2 a c2
Observation X1 X2 Observation X1 X2 b
a 2 4 c 9 3 e
b 8 2 e 8.5 1
d 1 5 0 X1
5 10
Average 3.67 3.67 Average 8.75 2
(a) (b)
X2 Distance from
d c1 Observation Cluster 1 Cluster 2
5
a a 0.707* 6.801
c b 6.964 0.500*
c2 c 7.649 1.118*
b d 0.707* 8.078
e e 7.826 1.000*
(c)
0 X1
5 10 Fig. 21: Means Method (Step 2)
(b) Every observation belongs to the cluster to the centroid of which it
Fig. 20: Means Method (Step 1) 52
51
Unit 12 Unsupervised Learning Block 4 Pattern Recognition
is nearest, and the k -means method stops. The elements of the two Special Cases:
clusters are shown in Fig. 21(c).
• p=2: Euclidean distance
***
• p=1: Manhattan distance
Now, we list the benefits and drawbacks of k-means methods.
The commonly used Euclidean distance between two objects is
Benefits: achieved when p = 2.
1) Very fast algorithm (O (k . d . N), if we limit the number of iterations) 1
2) Convenient centroid vector for every cluster d ij = ((x i1 − x j1 ) 2 + ( x i 2 − x j2 ) 2 + L + ( x id − x jd ) 2 ) 2
3) Can be run multiple times to get different results
Another well-known measure is the Manhattan distance which is
Limitations: defined when p = 1.
Try an exercise.
d ij = (µ i − µ j ) T ∑ij (µi − µ j ).
Now let us summaries what we have learnt in this unit. iii) Agglomerative clustering
iv) Divisive clustering
12.8 SUMMARY v) Probabilistic clustering
We have discussed the following points:
1) Concept of clustering. E3)
Clustering
2) Various distance measures.
3) Various clustering methods.
3) Analyzed various Hierarchical clustering algorithms in detail.
4) Analyzed various Partitional clustering and k - nn clustering Hierarchical Bayesian
algorithms.
Limitations of k - nn algorithm:
dist ({3,6}),{4}) = max(dist(3,4), dist(6,4))
= max (0.15,0.22) 1) Difficult to choose the number of clusters, k
= 0.22
2) Cannot be used with arbitrary distances
dist ({3,6},{2,5} = max(dist(3,2), dist(6,2), dist(3,5), dist(6,5))
= max(0.15,0.25,0.28,0.39) 3) Sensitive to scaling – requires careful preprocessing
= 0.39. 4) Does not produce the same result every time
dist({3,6},{1}) = max(dist(3,1), dist(6,1))
5) Sensitive to outliers (squared errors emphasize outliers)
= max(0.22,0.23)
6) Cluster sizes can be quite unbalanced (e.g., one-element
= 0.23.
outlier clusters)
(iii) Average link clustering