Adaptive clustering algorithm for determining the
optimal partition size
Eliézer Béczi
November 17, 2020
In this paper we present the hierarchical agglomerative clustering algorithm.
We introduce concepts such as metric and linkage criteria, which constitute the
base of the algorithm. Although the algorithm does not need any predefined
number of clusters as it generates a dendrogram, we discuss about various
stopping criteria. This is important because in some cases one might want
to stop at a prespecified number of clusters. For these cases we present two
methods from the literature (the elbow and the silhouette methods), which help
us determine the optimal partition size for a given input.
1
Contents
1 Introduction 3
2 Hierarchical agglomerative clustering 3
2.1 Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Linkage criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Stopping criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Optimal partition size 7
4 Conclusion 8
2
1 Introduction
Clustering is a machine learning technique for grouping similar objects together. Given a set of
objects, we can use a clustering algorithm to classify each object into a specific group. Objects
that belong to the same group should have similar, while objects in different groups highly
dissimilar properties.
Clustering is a method of unsupervised learning, where the data we want to describe is not
labeled. We do not know much information of what is the expected outcome. The algorithm
only has the data and it should perform clustering in the best way possible.
There are many well-known clustering algorithms (K-Means, DBSCAN, Mean-Shift etc.),
but we will only focus on the hierarchical agglomerative clustering.
2 Hierarchical agglomerative clustering
The agglomerative clustering [2] is the most common type of hierarchical clustering used to
group objects in clusters based on their similarity. It is a bottom-up approach that treats each
object as a singleton cluster, and then successively merges pairs of clusters until all clusters
have been merged into a single cluster that contains all objects.
In order to decide which clusters should be combined, a measure of dissimilarity between
groups of objects is required. This is achieved by using an appropriate metric (a measure of
distance between two objects), and a linkage criterion which specifies the distance between two
clusters.
2.1 Metric
The metric is a function that measures the distance between pairs of objects. Next we present
some commonly used metric functions between two n-dimensional objects a and b:
3
qP
n
• Euclidean distance: k a − b k2 = i=1 (ai − bi )2 .
Pn
• Squared Euclidean distance: k a − b k22 = i=1 (ai − bi )2 .
Pn
• Manhattan distance: k a − b k= i=1 |ai − bi |.
• Maximum distance: k a − b k= max |ai − bi |.
i
The choice of an appropriate metric function is important because it will influence the shape
of the clusters. Some elements may be closer to each other under one metric than under another.
For non-numeric data the Hamming or Levenshtein distances can be used.
2.2 Linkage criteria
The linkage criteria determines the distance between two clusters as a function of the pairwise
distances between objects. Next we present some commonly used linkage criteria between two
clusters A and B where d is the chosen metric:
• Complete-linkage clustering: D (A, B) = max {d (a, b) : a ∈ A, b ∈ B} .
– Distance between farthest elements in clusters.
• Single-linkage clustering: D (A, B) = min {d (a, b) : a ∈ A, b ∈ B} .
– Distance between closest elements in clusters.
1
P P
• Average linkage clustering: D (A, B) = |A|·|B| a∈A b∈B d (a, b) .
– Average of all pairwise distances.
4
(a) Complete-linkage merge strategy. (b) Single-linkage merge strategy.
(c) Average linkage merge strategy. (d) Ward’s method merge strategy.
2·|A|·|B|
• Ward’s method [3]: D (A, B) = |A|+|B|
· k cA − cB k22 where cA and cB are the centroids
of clusters A and B.
– Minimizes the total within-cluster variance.
– Those two clusters are combined whose merge results in minimal information loss.
2.3 Algorithm
The below steps describe how the hierarchical agglomerative clustering algorithm works:
1. We treat each object as a single cluster.
2. At each iteration, we merge two clusters together. The two clusters to be combined are
those that optimize the function defined by the linkage criteria.
3. Step 2 is repeated until we only have one cluster that contains all the objects.
Example
Figure 2 shows an example of how a hierarchical agglomerative clustering algorithm works.
We see that our input data consists of 6 objects labeled from {a} to {f }. The first step is to
determine which two clusters to merge together. This is done according to the linkage criteria.
5
Figure 2: An example of the hierarchical agglomerative clustering.
Let’s say that we’ve chosen single-linkage clustering as our linkage criteria. This means that
we combine those two clusters together that contain the two closest elements. We assume that
in this example elements {b} and {c} are the closest and merge them into a single cluster {b, c}.
We now have the following clusters {a}, {b, c}, {d}, {e} and {f }, and we want to merge them
further. The next closest pair of objects are {d} and {e}, and so we merge them also into a
single cluster {d, e}. We continue this process until all the clusters have been merged into a
single cluster that contains all the elements {a, b, c, d, e, f }.
In case of equal minimum distances, a pair of clusters is randomly chosen. This way we
are able to generate several structurally different dendrograms. As an alternative, all equal pairs
can be merged at the same time, generating a multidendrogram [1].
2.4 Stopping criteria
Hierarchical clustering does not require a prespecified number of clusters. We have the possi-
bility to select which number of clusters fits our input data the most since the algorithm builds
a tree. However, in some cases we may want to stop merging clusters together at a given point
to save computational resources.
In these cases, a number of stopping criteria can be used to determine the cutting point [2]:
• Number criterion: we stop clustering when we reach the desired amount of clusters.
6
• Distance criterion: we stop clustering when the clusters are too far apart to be merged.
3 Optimal partition size
It is a rather difficult problem to determine the optimal number of clusters in an input data,
regardless of the clustering method we choose to use. There are several methods in the literature
that provide a solution to this problem:
• The elbow method [5]: we plot the average internal per cluster sum of squares distance
against the number of clusters to find a visual ”elbow” (the slope changes from steep to
shallow) which is the optimal number of clusters. The average internal sum of squares is
the average distance between objects inside of a cluster.
• The silhouette method [4]: we calculate the average silhouette score of all the objects for
different values of k. The optimal number of clusters k is the one that maximizes the
average silhouette score.
Figure 3 shows an example for both the elbow and silhouette methods. Both methods iden-
tify k = 2 as the optimal partition size. In the case of the elbow method we can see that exactly
Figure 3: A visual representation of the elbow method on the left and of the silhouette method
on the right.
7
at k = 2 the slope changes from steep to shallow. Meanwhile, in case of the silhouette method
we can see that k = 2 yields the highest average silhouette score.
4 Conclusion
In conclusion, we presented an adaptive clustering algorithm, namely the hierarchical agglom-
erative clustering, an unsupervised machine learning technique for grouping similar objects
together along with two methods for determining the optimal partition size: the elbow and the
silhouette methods. Finally, it can be said that the hierarchical agglomerative clustering is a
useful technique but there is always room for improvement.
8
References
[1] Alberto Fernández and Sergio Gómez. “Solving non-uniqueness in agglomerative hier-
archical clustering using multidendrograms”. In: Journal of Classification 25.1 (2008),
pp. 43–65.
[2] Christopher D Manning, Hinrich Schütze, and Prabhakar Raghavan. Introduction to infor-
mation retrieval. Cambridge university press, 2008.
[3] Fionn Murtagh and Pierre Legendre. “Ward’s hierarchical agglomerative clustering method:
which algorithms implement Ward’s criterion?” In: Journal of classification 31.3 (2014),
pp. 274–295.
[4] Tippaya Thinsungnoena et al. “The clustering validity with silhouette and sum of squared
errors”. In: learning 3.7 (2015).
[5] Antoine E Zambelli. “A data-driven approach to estimating the number of clusters in hi-
erarchical clustering”. In: F1000Research 5 (2016).