CLUSTERING
Module-3
What is Parametric Density Estimation?
We assumed that the data we are working with comes from a specific
type of distribution — for example, a Gaussian (normal)
distribution.
Why is this useful?
If we know the type of distribution, we only need to estimate a few
parameters to describe the entire dataset.
Example:
If the data is Gaussian, we just need:
• The mean (center of the data)
• The covariance (shape/spread of the data)
• This is called a parametric approach — because the entire model is
defined by a few parameters.
Limitations of Parametric Models
• However, assuming that all data fits nicely into one type of
distribution (like Gaussian) can sometimes cause errors or bias.
What if the data doesn't form a single group?
• Example:
• In optical character recognition, people write the digit 7 in different
ways:
• American style: just a plain 7.
• European style: 7 with a horizontal bar in the middle.
• These two styles form two different groups, but belong to the same
class (digit 7).
• If we use just one Gaussian, it will not represent both styles properly.
Solution: Use Mixtures – Semiparametric Density Estimation
• We can solve this using semiparametric models.
What is Semiparametric Estimation?
• We still assume a parametric model (like Gaussian),
• But we allow multiple groups (or components) within a class.
• This means we represent the data as a mixture of Gaussians — one
for each group.
• Example:
• The digit ‘7’ class = Gaussian for American style + Gaussian for
European style.
• This is called a mixture model (like Gaussian Mixture Models
(GMMs)).
What Is a Mixture Model?
• In semiparametric density estimation, we model the data as coming
from multiple subgroups, not just one.
• This is useful when your data doesn’t form a single cluster but several.
Choosing the Number of Clusters
One method to achieve the optimal number of clusters is
the elbow method.
It measures the euclidean distance between each data
point and its cluster center and chooses the number of
clusters based on where change in “within cluster sum of
squares” (WCSS) levels off.
This value represents the total variance within each cluster
that gets plotted against the number of clusters.
Hierarchical clustering
• Hierarchical clustering refers to a clustering process that
organizes the data into large groups, which contain smaller
groups and so on.
• A hierarchical clustering may be drawn as a tree or dendrogram.
• The finest grouping is at the bottom of the dendrogram, each
sample by itself forms a cluster.
• At the top of the dendrogram, where all samples are grouped
into one cluster.
Hierarchical clustering
• Figure shown in figure illustrates hierarchical clustering.
• At the top level we have Animals…
followed by sub groups…
• Do not have to assume any particular
number of clusters.
• The representation is called dendrogram.
• Any desired number of clusters can be
obtained by ‘cutting’ the dendrogram
at the proper level.
Types of clustering:
• Hierarchical Clustering:
– Agglomerative Clustering Algorithm
• The single Linkage Algorithm
• The Complete Linkage Algorithm
• The Average – Linkage Algorithm
– Divisive approach
• Polythetic The division is based on more than one feature.
• Monothetic Only one feature is considered at a time.
Two types of Hierarchical Clustering
– Agglomerative:
• It is the most popular algorithm, It is popular than divisive algorithm.
• Start with the points as individual clusters
• It follows bottom up approach
•At each step, merge the closest pair of clusters until only one cluster (or k clusters) left
• Ex: single-linkage, complete-linkage, Average linking algorithm etc.
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there
are k clusters)
Example: Agglomerative
• 100 students from India join MS program in some particular
university in USA.
• Initially each one of them looks like single cluster.
• After some times, 2 students from SJCE, Mysuru makes a cluster.
• Similarly another cluster of 3 students(patterns / Samples) from RVCE
meets SJCE students.
• Now these two clusters makes another bigger cluster of Karnataka
students.
• Later … south Indian student cluster and so on…
Example : Divisive approach
• In a large gathering of engineering students..
– Separate JSS S&TU students
• Further computer science students
– Again ..7th sem students
» In sub group and divisive cluster is C section students.
Agglomerative Clustering Algorithm
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
Key operation is the computation of the proximity of two clusters
– Different approaches to defining the distance between
clusters distinguish the different algorithms
Data Points: 18,22,25,27,42,43
Step-1 Step-2
Step-3 Step-4
Step-5
Step-6
Some commonly used criteria in Agglomerative clustering Algorithms
(The most popular distance measure used is Euclidean distance)
Single Linkage:
Distance between two clusters is the smallest pairwise distance between two
observations/nodes, each belonging to different clusters.
Complete Linkage:
Distance between two clusters is the largest pairwise distance between two
observations/nodes, each belonging to different clusters.
Mean or average linkage clustering:
Distance between two clusters is the average of all the pairwise distances,
each node/observation belonging to different clusters.
Single linkage… Continued
• The single linkage algorithm is also known as the minimum
method and the nearest neighbor method.
• Consider Ci and Cj are two clusters.
• ‘a’ and ‘b’ are samples from cluster Ci and Cj respectively.
• Where d(a,b) represents the distance between ‘a’ and ‘b’.
Find the clusters using a single link technique. Use
Euclidean distance and draw the dendrogram.
It contains 6 samples and 2 attributes.
Steps:
• Step 1: Compute the distance matrix
• So we have to find the Euclidean distance between each and every
points
• Let A(x1,y1) and B (x2, y2) are two points.
• Then Euclidean distance between
• d(A, B) = Squareroot((x2 − x1)² + (y2 − y1)²)
Step 2: Merging the two closest members.
• Here the minimum value is 0.10 and hence we combine P3 and P6
(as 0.10 came in the P6 row and P3 column).
• Now, form clusters of elements corresponding to the minimum
value and update the distance matrix.
[{(P3, P6), P4}, (P2, P5)], P1
Single linkage algorithm
• Consider the following scatter plot points.
• In single link hierarchical clustering, we merge in each step the
two clusters, whose two closest members have the smallest
distance
First level of distance computation D1
(Euclidean distance used)
• Use Euclidean distance for distance between samples.
• The table shown in the previous slide gives feature values for
each sample and the distance d between each pair of samples.
• The algorithm begins with five clusters, each consisting of one
sample.
• The two nearest clusters are then merged.
• The smallest number is 4 which is the distance between (1 and
2), so they are merged. Merged matrix is as shown in next slide.
D2 matrix
• In the next level, the smallest number in the matrix is 8
• It is between 4 and 5.
• Now the cluster 4 and 5 are merged.
• With this we will have 3 clusters: {1,2}, {3},{4,5}
• The matrix is as shown in the next slide.
D3 distance
• In the next step {1,2} will be merged with {3}.
• Now we will have two cluster {1,2,3} and {4,5}
• In the next step.. these two are merged to have single cluster.
• Dendrogram is as shown here.
• Height of the dendrogram is decided
based on the merger distance.
For example: 1 and 2 are merged at
the least distance 4. hence the height
is 4.
The complete linkage Algorithm
• It is also called the maximum method or the farthest neighbor
method.
• It is obtained by defining the distance between two clusters to be
largest distance between a sample in one cluster and a sample in
the other cluster.
• If Ci and Cj are clusters, we define:
Example Problem
• Given the dataset {a, b, c, d, e} and
the following distance matrix,
• Construct a dendrogram by complete
linkage hierarchical clustering using
the agglomerative method.
•The complete-linkage clustering uses
the "maximum formula", that is, the
following formula to compute the
distance between two clusters A and B:
d(A, B) = max{d(x, y) x Є A, y Є B}
Dataset {a, b, c, d, e}.
Initial clustering (singleton sets)
C1: {a}, {b}, {c}, {d}, {e}.
From the table, the minimum distance
is the distance between the clusters
{c} and {e}.
Also, d({c}, {e}) = 2.
We merge {c} and {e} to form the
cluster {c, e}.
The new set of clusters C2:
{a}, {b}, {d}, {c, e}.
• Let us compute the distance of {c, e} from other clusters.
• d({c, e}, {a}) = max{d(c, a), d(e, a)} = max{3, 11} = 11
• d({c, e}, {b}) = max{d(c, b), d(e,b)} = max{7, 10} = 10
• d({c, e}, {d}) = max{d(c, d), d(e, d)} = max{9,8} = 9
• The following table gives the distances between the
various clusters in C2.
• From the table, the minimum distance is the distance
between the clusters {b} and {d}.
• Also, d({b}, {d}) = 5.
• We merge {b} and {d} to form the cluster {b, d}.
• The new set of clusters C3: {a}, {b, d}, {c, e}.
• Let us compute the distance of {b, d} from
other clusters.
• d({b,d}, {a}) = max{d(b, a), d(d, a)} =
max{9,6} =9
• d({b, d}, {c, e}) = max{d(b, c), d(b, e), d(d,
c), d(d,e)}
• d({b, d}, {c, e}) = max{7, 10, 9,8} = 10
From the table, the minimum distance
is the distance between the clusters
{a} and {b, d}.
Also, d({a}, {b, d}) = 9
We merge {a} and {b, d} to form the
cluster {a, b, d}.
The new set of clusters C4:
{a, b, d}, {c, e}
• d({a, b, d}, {c, e}) =
max{d(a, c), d(a, e), d(b, c), d(b, e), d(d, c), d(d,e)}
• d({a,b,d}, {c,e}) = max{3, 11, 7, 10, 9,8} = 11
• Only two clusters are left. We merge them form a
single cluster containing all data points.
Example : Complete linkage algorithm
• Consider the same samples used in single linkage:
• Apply Euclidean distance and compute the distance.
• Algorithm starts with 5 clusters.
• As earlier samples 1 and 2 are the closest, they are merged first.
• While merging the maximum distance will be used to replace the
distance/ cost value.
• For example, the distance between 1&3 = 11.7 and 2&3=8.1.
This algorithm selects 11.7 as the distance.
• In complete linkage hierarchical clustering, the distance
between two clusters is defined as the longest distance
between two points in each cluster.
• In the next level, the smallest distance in the matrix is 8.0
between 4 and 5. Now merge 4 and 5.
• In the next step, the smallest distance is 9.8 between 3 and {4,5},
they are merged.
• At this stage we will have two clusters {1,2} and {3,4,5}.
• Notice that these clusters are different from those obtained from
single linkage algorithm.
• At the next step, the two remaining clusters will be merged.
• The hierarchical clustering will be complete.
• The dendrogram is as shown in the figure.
The Average Linkage Algorithm
• The average linkage algorithm, is an attempt to compromise
between the extremes of the single and complete linkage
algorithm.
• It is also known as the unweighted pair group method using
arithmetic averages.
The Average-linkage clustering uses
the "average formula", that is, the
following formula to compute the
distance between two clusters A and
B:
d(A,B) = avg{d(x,y): x Є A‚ y Є B}
d(A, B) = Σd(x,y): xЄA,yЄB
|A||B|
Dataset {a, b, c, d, e}.
Initial clustering (singleton sets)
C1: {a}, {b}, {c}, {d}, {e}.
From the table, the minimum distance
is the distance between the clusters
{c} and {e}.
Also, d({c}, {e}) = 2.
We merge {c} and {e} to form the
cluster {c, e}.
The new set of clusters C2:
{a}, {b}, {d}, {c, e}.
Example: Average linkage clustering algorithm
• Consider the same samples: compute the Euclidian distance
between the samples
• In the next step, cluster 1 and 2 are merged, as the distance
between them is the least.
• The distance values are computed based on the average values.
• For example distance between 1 & 3 =11.7 and 2&3=8.1 and the
average is 9.9. This value is replaced in the matrix between {1,2}
and 3.
• In the next stage 4 and 5 are merged:
Example 2: Single Linkage
Then, the updated distance matrix becomes
Then the updated distance matrix is
Example 3: Single linkage
As we are using single linkage, we choose the minimum distance, therefore, we choose 4.97
and consider it as the distance between the D1 and D4, D5. If we were using complete linkage
then the maximum value would have been selected as the distance between D1 and D4, D5
which would have been 6.09. If we were to use Average Linkage then the average of these two
distances would have been taken. Thus, here the distance between D1 and D4, D5 would have
come out to be 5.53 (4.97 + 6.09 / 2).
3 until we are left with one cluster. We again look
for the minimum value which comes out to be 1.78
indicating that the new cluster which can be formed
is by merging the data points D1 and D2.
Similar to what we did in Step
3, we again recalculate the
distance this time for cluster
D1, D2 and come up with the
following updated distance
matrix.
We repeat what we did in step 2
and find the minimum value
available in our distance matrix.
The minimum value comes out
to be 1.78 which indicates that
we have to merge D3 to the
cluster D1, D2.
Single
Link
method
.
Find the minimum distance in the matrix.
Merge the data points accordingly and form another cluster.
Update the distance matrix using Single Link method.
Expectation-Maximization (EM) algorithm
• Expectation-Maximization (EM) algorithm is a iterative method used
in unsupervised machine learning to find unknown values in statistical
models.
• It helps to find the best values for unknown parameters especially
when some data is missing or hidden.
It works in two steps:
• E-step (Expectation Step): Estimates missing or hidden values using
current parameter estimates.
• M-step (Maximization Step): Updates model parameters to maximize
the likelihood based on the estimated values from the E-step.
E M Algorithm
• In the real-world applications of machine learning, it is very common
that there are many relevant features available for learning but only a
small subset of them are observable.
• The Expectation-Maximization algorithm can be used for the latent
variables (variables that are not directly observable and are actually
inferred from the values of the other observed variables).
• This algorithm is actually the base for many unsupervised clustering
algorithms in the field of machine learning.
E M Algorithm
• Let us understand the EM algorithm in detail.
• Initially, a set of initial values of the parameters are considered. A set of
incomplete observed data is given to the system with the assumption that
the observed data comes from a specific model.
• The next step is known as "Expectation"-step or E-step. In this step, we use
the observed data in order to estimate or guess the values of the missing or
incomplete data. It is basically used to update the variables.
• The next step is known as "Maximization"-step or M-step. In this step, we
use the complete data generated in the preceding "Expectation" - step in
order to update the values of the parameters. It is basically used to update
the hypothesis.
• Now, in the fourth step, it is checked whether the values are converging or
not, if yes, then stop otherwise repeat step-2 and step-3 i.e. "Expectation"-
step and "Maximization" – step until the convergence occurs.
Algorithm Flowchart
EM Algorithm- Problem
• Expectation-Maximization (EM) – a very popular technique for
estimating parameters of probabilistic models.
• Many popular algorithms like Hidden Markov Models, Gaussian
Mixtures, Kalman Filters, and others uses EM technique.
• It is beneficial when working with data that is incomplete, has missing
data points, or has unobserved latent variables.
• Assume that we have two coins, C1 and C2
• Assume the bias of C₁ is θ₁ (i.e., probability of getting heads with C₁)
• Assume the bias of C2 is θ2 (i.e., probability of getting heads with C2)
• We want to find θ₁, θ2 by performing a number of trials (i.e., coin
tosses)
Example
Advantages of EM algorithm
• Always improves results – With each step, the algorithm improves the
likelihood (chances) of finding a good solution.
• Simple to implement – The two steps (E-step and M-step) are often easy to
code for many problems.
• Quick math solutions – In many cases, the M-step has a direct
mathematical solution (closed-form), making it efficient
Disadvantages of EM algorithm
• Takes time to finish: It converges slowly meaning it may take many
iterations to reach the best solution.
• Gets stuck in local best: Instead of finding the absolute best solution, it
might settle for a "good enough" one.
• Needs extra probabilities: Unlike some optimization methods that only
need forward probability, EM requires both forward and backward
probabilities making it slightly more complex.