Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
9 views42 pages

Lecture 6

The document discusses clustering as an unsupervised learning method that groups similar data points into clusters, highlighting its goal of organizing data for better insights. It differentiates clustering from classification, outlines types of clustering methods, and details hierarchical clustering approaches, particularly agglomerative clustering. Additionally, it explains the Silhouette Score as a metric for evaluating clustering performance and introduces the R-index for comparing distances within and between clusters.

Uploaded by

vikrammadhad2446
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views42 pages

Lecture 6

The document discusses clustering as an unsupervised learning method that groups similar data points into clusters, highlighting its goal of organizing data for better insights. It differentiates clustering from classification, outlines types of clustering methods, and details hierarchical clustering approaches, particularly agglomerative clustering. Additionally, it explains the Silhouette Score as a metric for evaluating clustering performance and introduces the R-index for comparing distances within and between clusters.

Uploaded by

vikrammadhad2446
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

AIML

Dr. Nitin A. Shelke


Clustering
• Clustering (An unsupervised learning method ): It is a technique in which a
set of objects or points with similar characteristics are grouped together in
clusters.
• Clustering is the task of dividing the population or data points into a number
of groups such that data points in the same groups are more similar to other
data points in the same group and dissimilar to the data points in other
groups.
• The aim of cluster analysis is to organize observed data into meaningful
structures in order to gain further insight from them
Clustering
Difference between Clustering and
Classification
• Uses Unsupervised machine learning
• Clustering uses Unlabeled Data as an Input
• The Output is unknown
• There is no target variable in clustering
Goal of Clustering
Types of Clustering Methods
• Centroid-based Clustering (Partitioning methods) (Already Covered in SML)
• Connectivity-based Clustering (Hierarchical clustering)
• Density-based Clustering (Model-based methods)
What is Hierarchical Clustering?
Hierarchical clustering is another unsupervised learning
algorithm that is used to group together the unlabeled data
points having similar characteristics.
Hierarchical Clustering Approaches
• Agglomerative hierarchical algorithms − In agglomerative hierarchical
algorithms, each data point is treated as a single cluster and then successively
merge or agglomerate (bottom-up approach) the pairs of clusters. The
hierarchy of the clusters is represented as a dendrogram or tree structure.
• Divisive hierarchical algorithms − On the other hand, in divisive hierarchical
algorithms, all the data points are treated as one big cluster and the process
of clustering involves dividing (Top-down approach) the one big cluster into
various small clusters. Only in theoretical use.
Hierarchical Clustering Approaches
Agglomerative Clustering Algorithm
Agglomerative Clustering
Agglomerative Clustering
Typical Alternatives to Calculate the
Distance Between Clusters
Example of Aglomerative Clustering with
Single linkage method
Example of Aglomerative Clustering with
Complete Linkage method
Comparison
Linkage Criteria Supported in Sklearn
• The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each
observation starts in its own cluster, and clusters are successively merged together. The linkage criteria
determines the metric used for the merge strategy:
• Maximum or complete linkage minimizes the maximum distance between observations of pairs of
clusters.
• Single linkage minimizes the distance between the closest observations of pairs of clusters.
• Average linkage minimizes the average of the distances between all observations of pairs of clusters.
• Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach
and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical
approach.
What is the Silhouette Score?
• The silhouette coefficient is a metric that measures how well each
data point fits into its assigned cluster.
• It combines information about both the cohesion (how close a data
point is to other points in its own cluster) and the separation (how far
a data point is from points in other clusters) of the data point.
Silhouette Score

Silhouette Coefficient = (b - a) / max(a, b)

• a denotes the mean intra-cluster


distance
• b denotes the mean nearest-
cluster distance for each sample
What is the Silhouette Score?
• The Silhouette Score evaluates clustering performance by measuring
how similar a sample is to its own cluster (cohesion) compared to
other clusters (separation). It ranges from -1 to 1:
• 1 → Perfectly clustered (well-separated clusters).
• 0 → Overlapping clusters (not well-defined).
• Negative → Incorrect clustering (samples assigned to the wrong cluster).
Calculating the Silhouette Coefficient
1. For each data point, calculate two values:
— Average distance to all other data points within the same cluster
(cohesion).
— Average distance to all data points in the nearest neighboring cluster
(separation).
2. Compute the silhouette coefficient for each data point using the formula:
silhouette coefficient = (separation — cohesion) / max(separation, cohesion)
3. Calculate the average silhouette coefficient across all data points to obtain
the overall silhouette score for the clustering result.
R Index
The R-index is calculated by comparing the average distance between
points in the same cluster to the average distance between the nearest
cluster.

You might also like