Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views14 pages

Unit 4 ML

Clustering in machine learning is an unsupervised learning method used to group data points based on their similarity, with applications in market segmentation, social network analysis, and anomaly detection. There are two main types of clustering: hard clustering, where each data point belongs to one cluster, and soft clustering, where probabilities of belonging to multiple clusters are evaluated. Key clustering methods include hierarchical clustering, K-means, and the Expectation-Maximization algorithm, each with distinct workflows and distance measures to determine similarity between data points.

Uploaded by

befikraaman2630
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views14 pages

Unit 4 ML

Clustering in machine learning is an unsupervised learning method used to group data points based on their similarity, with applications in market segmentation, social network analysis, and anomaly detection. There are two main types of clustering: hard clustering, where each data point belongs to one cluster, and soft clustering, where probabilities of belonging to multiple clusters are evaluated. Key clustering methods include hierarchical clustering, K-means, and the Expectation-Maximization algorithm, each with distinct workflows and distance measures to determine similarity between data points.

Uploaded by

befikraaman2630
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Unit-4

Clustering in Machine Learning


The task of grouping data points based on their similarity with each other is called
Clustering or Cluster Analysis. This method is defined under the branch of unsupervised
learning, which aims at gaining insights from unlabelled data points.
Think of it as you have a dataset of customers shopping habits. Clustering can help you group
customers with similar purchasing behaviours, which can then be used for targeted
marketing, product recommendations, or customer segmentation.
For example, below is the diagram which shows clustering system grouped together the similar
kind of data in different clusters −

Types of Clustering
There are 2 types of clustering that can be performed to group similar data points:
• Hard Clustering: In this type of clustering, each data point belongs to a cluster
completely or not. For example, Let's say there are 4 data point and we have to cluster
them into 2 clusters. So, each data point will either belong to cluster 1 or cluster 2.

Data Points Clusters

A C1

B C2

C C2

D C1
• Soft Clustering: In this type of clustering, instead of assigning each data point into a
separate cluster, a probability or likelihood of that point being that cluster is evaluated.
For example, Let's say there are 4 data point and we have to cluster them into 2 clusters.
So we will be evaluating a probability of a data point belonging to both clusters. This
probability is calculated for all data points.

Data Points Probability of C1 Probability of C2

A 0.91 0.09

B 0.3 0.7

C 0.17 0.83

D 1 0

Uses of Clustering
Now before we begin with types of clustering algorithms, we will go through the use cases of
Clustering algorithms. Clustering algorithms are majorly used for:
• Market Segmentation: Businesses use clustering to group their customers and use
targeted advertisements to attract more audience.
• Market Basket Analysis: Shop owners analyse their sales and figure out which items
are majorly bought together by the customers. For example, In USA, according to a
study diapers and beers were usually bought together by fathers.
• Social Network Analysis: Social media sites use your data to understand your
browsing behaviour and provide you with targeted friend recommendations or content
recommendations.
• Medical Imaging: Doctors use Clustering to find out diseased areas in diagnostic
images like X-rays.
• Anomaly Detection: To find outliers in a stream of real-time dataset or forecasting
fraudulent transactions we can use clustering to identify them.
• Simplify working with large datasets: Each cluster is given a cluster ID after
clustering is complete. Now, you may reduce a feature set's whole feature set into its
cluster ID. Clustering is effective when it can represent a complicated case with a
straightforward cluster ID. Using the same principle, clustering data can make complex
datasets simpler.
Cluster Formation Methods
It is not necessary that clusters will be formed in spherical form. Followings are some other
cluster formation methods −
Density-based
In these methods, the clusters are formed as the dense region. The advantage of these methods
is that they have good accuracy as well as good ability to merge two clusters. Ex. Density-
Based Spatial Clustering of Applications with Noise (DBSCAN), Ordering Points to identify
Clustering structure (OPTICS) etc.
Hierarchical-based
In these methods, the clusters are formed as a tree type structure based on the hierarchy. They
have two categories namely, Agglomerative (Bottom up approach) and Divisive (Top down
approach). Ex. Clustering using Representatives (CURE), Balanced iterative Reducing
Clustering using Hierarchies (BIRCH) etc.
Partitioning
In these methods, the clusters are formed by portioning the objects into k clusters. Number of
clusters will be equal to the number of partitions. Ex. K-means, Clustering Large Applications
based upon randomized Search (CLARANS).
Grid
In these methods, the clusters are formed as a grid like structure. The advantage of these
methods is that all the clustering operation done on these grids are fast and independent of the
number of data objects. Ex. Statistical Information Grid (STING), Clustering in Quest
(CLIQUE).

Types of Clustering Methods


1. Hierarchical clustering
Hierarchical clustering is used to group similar data points together based on their
similarity creating a hierarchy or tree-like structure. The key idea is to begin with
each data point as its own separate cluster and then progressively merge or split them
based on their similarity. Let’s understand this with the help of an example
Imagine you have four fruits with different weights: an apple (100g), a banana (120g), a cherry
(50g) and a grape (30g). Hierarchical clustering starts by treating each fruit as its own group.
• It then merges the closest groups based on their weights.
• First the cherry and grape are grouped together because they are the lightest.
• Next the apple and banana are grouped together.
Finally, all the fruits are merged into one large group, showing how hierarchical clustering
progressively combines the most similar data points.
Dendogram
A dendrogram is like a family tree for clusters. It shows how individual data points or groups
of data merge together. The bottom shows each data point as its own group, and as you move
up, similar groups are combined. The lower the merge point, the more similar the groups are.
It helps you see how things are grouped step by step. The working of the dendrogram can be
explained using the below diagram:

In the above image on the left side there are five points labelled P, Q, R, S and T. These represent
individual data points that are being clustered. On the right side there’s a dendrogram which
show how these points are grouped together step by step.
• At the bottom of the dendrogram the points P, Q, R, S and T are all separate.
• As you move up, the closest points are merged into a single group.
• The lines connecting the points show how they are progressively merged based on
similarity.
• The height at which they are connected shows how similar the points are to each other;
the shorter the line the more similar they are
Types of Hierarchical Clustering
Now we understand the basics of hierarchical clustering. There are two main types of
hierarchical clustering.
1. Agglomerative Clustering
2. Divisive clustering
Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or hierarchical agglomerative clustering
(HAC). Unlike flat clustering hierarchical clustering provides a structured way to group data.
This clustering algorithm does not require us to prespecify the number of clusters. Bottom-up
algorithms treat each data as a singleton cluster at the outset and then successively agglomerate
pairs of clusters until all clusters have been merged into a single cluster that contains all data.
Workflow for Hierarchical Agglomerative clustering
1. Start with individual points: Each data point is its own cluster. For example if you
have 5 data points you start with 5 clusters each containing just one data point.
2. Calculate distances between clusters: Calculate the distance between every pair of
clusters. Initially since each cluster has one point this is the distance between the two
data points.
3. Merge the closest clusters: Identify the two clusters with the smallest distance and
merge them into a single cluster.
4. Update distance matrix: After merging you now have one less cluster. Recalculate the
distances between the new cluster and the remaining clusters.
5. Repeat steps 3 and 4: Keep merging the closest clusters and updating the distance
matrix until you have only one cluster left.
6. Create a dendrogram: As the process continues you can visualize the merging of
clusters using a tree-like diagram called a dendrogram. It shows the hierarchy of how
clusters are merged.
Hierarchical Divisive clustering
It is also known as a top-down approach. This algorithm also does not require to prespecify
the number of clusters. Top-down clustering requires a method for splitting a cluster that
contains the whole data and proceeds by splitting clusters recursively until individual data have
been split into singleton clusters.
Workflow for Hierarchical Divisive clustering :
1. Start with all data points in one cluster: Treat the entire dataset as a single large
cluster.
2. Split the cluster: Divide the cluster into two smaller clusters. The division is typically
done by finding the two most dissimilar points in the cluster and using them to separate
the data into two parts.
3. Repeat the process: For each of the new clusters, repeat the splitting process:
1. Choose the cluster with the most dissimilar points.
2. Split it again into two smaller clusters.
4. Stop when each data point is in its own cluster: Continue this process until every
data point is its own cluster, or the stopping condition (such as a predefined number of
clusters) is met.

Computing Distance Matrix


While merging two clusters we check the distance between two every pair of clusters and
merge the pair with the least distance/most similarity. But the question is how is that
distance determined. There are different ways of defining Inter Cluster distance/similarity.
Some of them are:
1. Min Distance: Find the minimum distance between any two points of the cluster.
2. Max Distance: Find the maximum distance between any two points of the cluster.
3. Group Average: Find the average distance between every two points of the clusters.
4. Ward's Method: The similarity of two clusters is based on the increase in squared error
when two clusters are merged.
K-Means Clustering
K-Means Clustering is an Unsupervised Machine Learning algorithm which groups unlabeled
dataset into different clusters. It is used to organize data into groups based on their similarity.
For example online store uses K-Means to group customers based on purchase frequency and
spending creating segments like Budget Shoppers, Frequent Buyers and Big Spenders for
personalised marketing.
The algorithm works by first randomly picking some central points called centroids and each
data point is then assigned to the closest centroid forming a cluster. After all the points are
assigned to a cluster the centroids are updated by finding the average position of the points in
each cluster. This process repeats until the centroids stop changing forming clusters. The goal
of clustering is to divide the data points into clusters so that similar data points belong to same
group.
How k-means clustering works?
We are given a data set of items with certain features and values for these features like a vector.
The task is to categorize those items into groups. To achieve this we will use the K-means
algorithm. 'K' in the name of the algorithm represents the number of groups/clusters we want
to classify our items into.

The algorithm will categorize the items into k groups or clusters of similarity. To calculate that
similarity we will use the Euclidean distance as a measurement. The algorithm works as
follows:
1. First, we randomly initialize k points called means or cluster centroids.
2. We categorize each item to its closest mean and we update the mean's coordinates,
which are the averages of the items categorized in that cluster so far.
3. We repeat the process for a given number of iterations and at the end, we have our
clusters.
The "points" mentioned above are called means because they are the mean values of the items
categorized in them. To initialize these means, we have a lot of options. An intuitive method is
to initialize the means at random items in the data set. Another method is to initialize the means
at random values between the boundaries of the data set. For example, for a feature x the items
have values in [0,3] we will initialize the means with values for x at [0,3].
Expectation-Maximization (EM) algorithm
Expectation-Maximization (EM) algorithm is a iterative method used in unsupervised
machine learning to find unknown values in statistical models. It helps to find the best values
for unknown parameters especially when some data is missing or hidden. It works in two
steps:
• E-step (Expectation Step): Estimates missing or hidden values using current
parameter estimates.
• M-step (Maximization Step): Updates model parameters to maximize the likelihood
based on the estimated values from the E-step.
This process repeats until the model reaches a stable solution as it improve accuracy with each
iteration. It is widely used in clustering like Gaussian Mixture Models and handling missing
data.

By iteratively repeating these steps the EM algorithm seeks to maximize the likelihood of the
observed data.
Key Terms in Expectation-Maximization (EM) Algorithm
Let’s understand about some of the most commonly used key terms in the Expectation-
Maximization (EM) Algorithm:
• Latent Variables: These are hidden parts of the data that we can’t see directly but they
still affect what we do see. We try to guess their values using the visible data.
• Likelihood: This refers to the probability of seeing the data we have based on certain
assumptions or parameters. The EM algorithm tries to find the best parameters that
make the data most likely.
• Log-Likelihood: This is just the natural log of the likelihood function. It's used to make
calculations easier and measure how well the model fits the data. The EM algorithm
tries to maximize the log-likelihood to improve the model fit.
• Maximum Likelihood Estimation (MLE): This is a method to find the best values for
a model’s settings called parameters. It looks for the values that make the data we
observed most likely to happen.
• Posterior Probability: In Bayesian methods this is the probability of the parameters
given both prior knowledge and the observed data. In EM it helps estimate the "best"
parameters when there's uncertainty about the data.
• Expectation (E) Step: In this step the algorithm estimates the missing or hidden
information (latent variables) based on the observed data and current parameters. It
calculates probabilities for the hidden values given what we can see.
• Maximization (M) Step: This step updates the parameters by finding the values that
maximize the likelihood based on the estimates from the E-step.
• Convergence: Convergence happens when the algorithm has reached a stable point.
This is checked by seeing if the changes in the model's parameters or the log-likelihood
are small enough to stop the process.

Working of Expectation-Maximization (EM) Algorithm


So far, we've discussed the key terms in the EM algorithm. Now, let's dive into how the EM
algorithm works. Here's a step-by-step breakdown of the process:

1. Initialization: The algorithm starts with initial parameter values and assumes the observed
data comes from a specific model.
2. E-Step (Expectation Step):
• Find the missing or hidden data based on the current parameters.
• Calculate the posterior probability of each latent variable based on the observed data.
• Compute the log-likelihood of the observed data using the current parameter estimates.
3. M-Step (Maximization Step):
• Update the model parameters by maximize the log-likelihood.
• The better the model the higher this value.
4. Convergence:
• Check if the model parameters are stable and converging.
• If the changes in log-likelihood or parameters are below a set threshold, stop. If not
repeat the E-step and M-step until convergence is reached
Advantages of EM algorithm
• Always improves results – With each step, the algorithm improves the likelihood
(chances) of finding a good solution.
• Simple to implement – The two steps (E-step and M-step) are often easy to code for
many problems.
• Quick math solutions – In many cases, the M-step has a direct mathematical solution
(closed-form), making it efficient
Disadvantages of EM algorithm
• Takes time to finish: It converges slowly meaning it may take many iterations to reach
the best solution.
• Gets stuck in local best: Instead of finding the absolute best solution, it might settle
for a "good enough" one.
• Needs extra probabilities: Unlike some optimization methods that only need forward
probability, EM requires both forward and backward probabilities making it slightly
more complex.

Distance measures or Distance Metrics


Distance measures are the backbone of clustering algorithms. Distance measures are
mathematical functions that determine how similar or different two data points are. The
choice of distance measure can significantly impact the clustering results, as it influences
the shape and structure of the clusters.
Common Distance Measures
There are several types of distance measures, each with its strengths and weaknesses. Here are
some of the most commonly used distance measures in clustering:
1. Euclidean Distance
The Euclidean distance is the most widely used distance measure in clustering. It calculates the
straight-line distance between two points in n-dimensional space. The formula for Euclidean
distance is:

where,
• p and q are two data points
• and n is the number of dimensions.
2. Manhattan Distance
The Manhattan distance, is the total of the absolute differences between their Cartesian
coordinates, sometimes referred to as the L1 distance or city block distance. Envision
maneuvering across a city grid in which your only directions are horizontal and vertical. The
Manhattan distance, which computes the total distance travelled along each dimension to
reach a different data point represents this movement. When it comes to categorical data this
metric is more effective than Euclidean distance since it is less susceptible to outliers. The
formula is:

3. Cosine Similarity
Instead, then concentrating on the exact distance between data points, cosine
similarity measure looks at their orientation. It calculates the cosine of the angle between two
data points, with a higher cosine value indicating greater similarity. This measure is often used
for text data analysis, where the order of features (words in a sentence) might not be as crucial
as their presence. It is used to determine how similar the vectors are, irrespective of their
magnitude.

4. Minkowski Distance
Minkowski distance is a generalized form of both Euclidean and Manhattan distances,
controlled by a parameter p. The Minkowski distance allows adjusting the power parameter
(p). When p=1, it’s equivalent to Manhattan distance; when p=2, it’s Euclidean distance.

5. Jaccard Index
This measure is ideal for binary data, where features can only take values of 0 or 1. It calculates
the ratio of the number of features shared by two data points to the total number of features.
Jaccard Index measures the similarity between two sets by comparing the size of their
intersection and union.
Choosing the Right Distance Measure
The choice of distance measure depends on the nature of the data and the clustering algorithm
being used. Here are some general guidelines:
• Euclidean distance is suitable for continuous data with a Gaussian distribution.
• Manhattan distance is suitable for data with a uniform distribution or when the
dimensions are not equally important.
• Minkowski distance is suitable when you want to generalize the Euclidean and
Manhattan distances.
• Cosine similarity is suitable for text data or when the angle between vectors is more
important than the magnitude.
• Jaccard similarity is suitable for categorical data or when the intersection and union
of sets are more important than the individual elements.

Clustering validation

Clustering validation is a crucial step in unsupervised learning, where the goal is to evaluate
how well your clustering algorithm has grouped the data—without ground-truth labels to
directly compare against. Since clustering often aims to uncover hidden patterns or structure in
data, validation helps answer: Did we actually find something meaningful or just random
groupings?

Why Clustering Validation Matters


Clustering algorithms (like K-Means, DBSCAN, or Hierarchical Clustering) can produce very
different results depending on:
• The number of clusters chosen
• The scale of the data
• Initialization or randomness in the algorithm
So, validation methods help determine:
• Are the clusters well-separated and meaningful?
• Is the chosen number of clusters optimal?
• How stable are the clusters across different runs?
Types of Clustering Validation
Validation techniques are broadly categorized into Internal, External, and Relative methods:
1. Internal Validation
Evaluates clustering quality based on the data itself, without external information.
• Silhouette Coefficient
Measures how similar a point is to its own cluster vs. other clusters. Ranges from -1 to
1. Closer to 1 is better.
• Davies-Bouldin Index
Lower values indicate better clustering by comparing intra-cluster distance to inter-
cluster distance.
• Dunn Index
Higher values are better. Compares the smallest distance between observations in
different clusters to the largest intra-cluster distance.
2. External Validation
Requires ground truth labels (which is rare in real unsupervised settings) to compare
predicted clusters with actual categories.
• Rand Index & Adjusted Rand Index (ARI)
Measures the similarity between the clustering results and the ground-truth
classification.
• Mutual Information (MI) & Normalized Mutual Information (NMI)
Quantifies the amount of information shared between the actual classes and the
predicted clusters.

3. Relative Validation
Used to compare multiple clustering results to find the best one—usually by running the
same algorithm with different parameters (like the number of clusters).
• Elbow Method
Plots the explained variance vs. the number of clusters. The “elbow” point is often a
good choice.
• Gap Statistic
Compares total intra-cluster variation with a reference null distribution to determine if
the clustering is better than random.
Limitations of Clustering Validation Methods
1. Lack of Ground Truth
• Unsupervised nature of clustering means there’s no “correct” answer to compare with.
• External metrics require labelled data—which defeats the purpose in many real-world
clustering tasks.
2. Sensitivity to Cluster Shape & Density
• Many internal metrics (like Silhouette Score) assume spherical, evenly sized clusters.
• Algorithms like DBSCAN, which find arbitrarily shaped clusters, may perform poorly
under these metrics even if results are meaningful.
3. Dependence on Distance Metrics
• Most validation techniques rely on distance-based measures (e.g., Euclidean
distance).
• If the data isn’t naturally distributed in such a way, the scores can be misleading.
4. Scalability Issues
• Calculating some metrics (like Dunn Index or Mutual Information) can be
computationally expensive for large datasets.
5. Arbitrary Interpretability
• Scores from internal indices (e.g., Davies-Bouldin or Silhouette) do not have an
intuitive interpretation—what qualifies as a “good” score may vary across datasets.
6. Over-Optimizing for a Metric
• It’s easy to fall into the trap of “tuning the model to look good on the score” rather
than reflecting the underlying structure in the data.
• This can lead to clusters that are technically separated but not meaningful from a
domain perspective.
7. Varying Results Across Initializations
• Algorithms like K-Means depend on random initialization. Different runs can yield
different results, and so can the corresponding validation scores.
• This leads to instability unless multiple runs are averaged.
8. Not Always Aligned with Business Goals
• Even if the clustering looks great statistically, it might not yield actionable insights or
align with your application’s needs—especially in customer segmentation or marketing
tasks.

You might also like