Hierarchical Clustering in Machine Learning
Clustering is an unsupervised learning technique that organizes data
based on its resemblance to other data sets. There are numerous types of
clustering methods in machine learning.
Connectivity-based clustering: This type of clustering algorithm
creates a cluster based on the connection of data points. Examples
include hierarchical clustering.
Centroid-based clustering: This clustering algorithm clusters
data points around their centroids. Examples include K-Means and K-
Mode clustering.
Distribution-based clustering: Statistical distributions are used
to model this clustering process. It assumes that the data points in a
cluster are created from a particular probability distribution, and the
method seeks to estimate the parameters of the distribution in
order to group comparable data points into clusters. Example:
Gaussian Mixture Models (GMM)
Density-based clustering: With this kind of clustering method,
data points in high-density concentrations are grouped together,
while points in low-density concentrations are separated. The main
concept is that it finds high density data point locations in the data
space and clusters those points together. DBSCAN (Density-Based
Spatial Clustering of Applications with Noise) is one example.
Hierarchical clustering
A connectivity-based clustering methodology called hierarchical clustering
puts nearby data points in groups according to similarity or distance. Data
points that are closer together are thought to be more similar or
connected than those that are farther apart.
The hierarchical links between groups are shown by a dendrogram, a tree-
like picture created via hierarchical clustering. The dendrogram shows
that the largest clusters, which contain all of the data points, are at the
top, while individual data points are at the bottom. It is possible to slice
the dendrogram at different heights to produce varying numbers of
clusters.
Dendogram
What is a Dendrogram?
A tree diagram showing the arrangement of clusters produced by
hierarchical clustering.
Vertical lines represent the merging of clusters.
Horizontal lines indicate the distance between clusters.
The height at which clusters merge can guide in determining the
optimal number of clusters.
Clusters are iteratively merged or broken according to a distance or
similarity metric between data points to form the dendrogram. Until every
data point is contained in a single cluster or until the target number of
clusters is reached, clusters are split or combined again.
To determine the ideal number of clusters, we can examine the
dendrogram and assess the height at which the branches form distinct
clusters. At this height, the dendrogram can be cut to find the number of
clusters.
Types of Hierarchical Clustering
Basically, there are two types of hierarchical Clustering
1. Agglomerative Clustering
2. Divisive clustering
Hierarchical Agglomerative Clustering
It is sometimes referred to as hierarchical agglomerative clustering (HAC)
or the bottom-up technique. An organized set of clusters that yields more
information than the unorganized group obtained from flat clustering. The
number of clusters does not need to be predetermined when using this
clustering procedure. From the beginning, bottom-up algorithms treat
each data set as a singleton cluster. They subsequently group pairs of
clusters together until all of the clusters are combined into a single cluster
that contains all of the data.
Algorithm :
given a dataset (d1, d2, d3, ....dN) of size N
# compute the distance matrix
for i=1 to N:
# as the distance matrix is symmetric about
# the primary diagonal so we compute only lower
# part of the primary diagonal
for j=1 to i:
dis_mat[i][j] = distance[di, dj]
each data point is a singleton cluster
repeat
merge the two cluster having minimum distance
update the distance matrix
until only a single cluster remains
Hierarchical Agglomerative Clustering
Steps:
Consider each alphabet as a single cluster and calculate the
distance of one cluster from all the other clusters.
In the second step, comparable clusters are merged together to
form a single cluster. Let’s say cluster (B) and cluster (C) are very
similar to each other therefore we merge them in the second step
similarly to cluster (D) and (E) and at last, we get the clusters [(A),
(BC), (DE), (F)]
We recalculate the proximity according to the algorithm and merge
the two nearest clusters([(DE), (F)]) together to form new clusters as
[(A), (BC), (DEF)]
Repeating the same process; The clusters DEF and BC are
comparable and merged together to form a new cluster. We’re now
left with clusters [(A), (BCDEF)].
At last, the two remaining clusters are merged together to form a
single cluster [(ABCDEF)].
Python implementation of the above algorithm using the scikit-learn
library:
from sklearn.cluster import AgglomerativeClustering
import numpy as np
# randomly chosen dataset
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])
# here we need to mention the number of clusters
# otherwise the result will be a single cluster
# containing all the data
clustering = AgglomerativeClustering(n_clusters=2).fit(X)
# print the class labels
print(clustering.labels_)
Output :
[1, 1, 1, 0, 0, 0]
Hierarchical Divisive Clustering
Another name for it is a top-down approach. It is also not necessary to
predetermine the number of clusters using this algorithm. The process of
breaking a cluster that includes all of the data must be done recursively in
order to divide each data into singleton clusters, which is necessary for
top-down clustering.
Algorithm :
given a dataset (d1, d2, d3, ....dN) of size N
at the top we have all data in one cluster
the cluster is split using a flat clustering method eg. K-Means etc
repeat
choose the best cluster among all the clusters to split
split that cluster by the flat clustering algorithm
until each data is in its own singleton cluster
Hierarchical Divisive clustering
Computing Distance Matrix
We measure the distance between each pair of clusters while merging
them, combining the ones that have the greatest similarity and the least
distance. How that distance is calculated is the question, though.
Distance/similarity between clusters can be defined in a variety of ways.
Some of them are:
1. Min Distance: Find the minimum distance between any two points of
the cluster.
2. Max Distance: Find the maximum distance between any two points
of the cluster.
3. Group Average: Find the average distance between every two points
of the clusters.
4. Ward’s Method: The similarity of two clusters is based on the
increase in squared error when two clusters are merged.
For example, if we group a given data using different methods, we may
get different results:
Distance Matrix Comparison in Hierarchical Clustering
Implementations code
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
# randomly chosen dataset
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])
# Perform hierarchical clustering
Z = linkage(X, 'ward')
# Plot dendrogram
dendrogram(Z)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Data point')
plt.ylabel('Distance')
plt.show()
Output:
Hierarchical Clustering Dendrogram
Hierarchical Agglomerative vs Divisive Clustering
Divisive clustering is more complex as compared to agglomerative
clustering, as in the case of divisive clustering we need a flat
clustering method as “subroutine” to split each cluster until we have
each data having its own singleton cluster.
Divisive clustering is more efficient if we do not generate a complete
hierarchy all the way down to individual data leaves. The time
complexity of a naive agglomerative clustering is O(n3) because we
exhaustively scan the N x N matrix dist_mat for the lowest distance
in each of N-1 iterations. Using priority queue data structure we can
reduce this complexity to O(n2logn). By using some more
optimizations it can be brought down to O(n2). Whereas for divisive
clustering given a fixed number of top levels, using an efficient flat
algorithm like K-Means, divisive algorithms are linear in the number
of patterns and clusters.
A divisive algorithm is also more accurate. Agglomerative clustering
makes decisions by considering the local patterns or neighbour
points without initially taking into account the global distribution of
data. These early decisions cannot be undone. Whereas divisive
clustering takes into consideration the global distribution of data
when making top-level partitioning decisions.