UET
Since 2004
ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN
VNU-University of Engineering and Technology
INT3405 - Machine Learning
Lecture 8: Unsupervised Learning
CSUET
Hanoi, 4/2025
Outline
◎ Part I: Clustering - General Concepts
○ Real-life Applications
○ Types of Clusterings
◎ Part II: Typical Clustering Algorithms
2
1.
Clustering
- General Concepts
Main idea, real-life applications,
types
3
Motivating Example: Customer Segmentation
Clusters
● Demographic ● Geographic
● Behavioral ● Psychographic
4
What is Cluster Analysis or Clustering?
Given a set of objects, place them in groups such that the objects in
a group are similar (or related) to one another and different from
(or unrelated to) the objects in other groups
Inter-cluster distances
are maximized
Intra-cluster distances
are minimized
5
Real-life Applications: Google News
6
Real-life Applications: Anomaly Detection
● Fake News Detection
● Fraud Detection
● Spam Email Detection
Source:
https://towardsdatascience.com/unsupervised-
anomaly-detection-on-spotify-data-k-means-vs-local-
outlier-factor-f96ae783d7a7
7
Real-life Applications: Sport Science
Find players with
similar styles
Source: https://www.americansocceranalysis.com/home/2019/3/11/using-k-means-to-learn-
what-soccer-passing-tells-us-about-playing-styles
8
Real-life Applications: Image Segmentation
Source: http://pixelsciences.blogspot.com/2017/07/image-segmentation-k-means-clustering.html
9
Real-life Applications: Recommendation
● Cluster-based ranking
● Group recommendation
● …
10
What do affect on Cluster Analysis?
Clustering
Data Algorithm
Cluster
11
Characteristics of the Input Data Are Important
● High dimensionality
○ Dimensionality reduction
● Types of attributes
○ Binary, discrete, continuous, asymmetric
○ Mixed attribute types, e.g., continuous & nominal)
● Differences in attribute scales
○ Normalization techniques
● Size of data set
● Noise and Outliers
● Properties of the data space
12
Characteristics of Cluster
● Data distribution
○ Parametric models
● Shape
○ Globular or arbitrary shape
● Differing sizes
● Differing densities
● Level of separation among clusters Distance Metrics
● Relationship among clusters
● Subspace clusters
13
How to Measure the Similarity/Distance?
Source: https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa 14
Notion of a Cluster can be Ambiguous
15
Types of Clusterings
Source: https://www.datanovia.com/en/blog/types-of-clustering-methods-overview-and-quick-start-r-code/
16
Partitional Clustering
Data objects are separated into
non-overlapping subsets, i.e.,
clusters
17
Hierarchical Clustering
Data objects are separated into
nested clusters as a hierarchical
tree
Hierarchical Clustering
Clustering dendrogram
18
Fuzzy Clustering
Fuzzy clustering, i.e., soft
clustering, is a form of clustering
in which each data point can
belong to more than one cluster
with weights
19
Density-based Clustering
A cluster is a dense region of
points, which is separated by
low-density regions, from other
regions of high density.
Non-linear separation
20
Model-based Clustering
Model-based clustering assumes
that the data were generated by
a model and tries to recover the
original model from the data.
Gaussian Mixture Model
21
2.
Typical Clustering
Algorithms
Intuition, Main Idea, Limitation
22
Typical Clustering Algorithms
◎ Partitional Clustering
○ K-Means & Variants
◎ Hierarchical Clustering
○ HAC
◎ Density-based Clustering
○ DBSCAN
23
K-Means Clustering: An Example
24
K-Means Clustering
● Main idea: Each point is assigned to the cluster with the closest centroid
● Number of clusters, K, must be specified
● Sum of Squared Error (SSE)
● Complexity: O(n * K * I * d)
○ n = number of points, K = number of clusters,
○ I = number of iterations, d = number of attributes
25
Elbow Method for Optimal Value of K
WCSS is the sum of squared distance between each point and the centroid in a cluster
The graph will rapidly change at a point named Elbow Point
26
Two different K-means Clusterings
Optimal
Clustering
Original Points
Sub-optimal
Clustering
27
Importance of Choosing Initial Centroids
28
Solutions to Initial Centroids Problem
● Multiple runs
○ Helps, but probability is not on your side
● Use some strategies to select the k initial centroids and then
select among these initial centroids
○ Select most widely separated, e.g., K-means++
○ Use hierarchical clustering to determine initial centroids
● Bisecting K-Means
○ Not as susceptible to initialization issues
29
K-Means++
1. Choose one center uniformly at random among the data points.
2. For each data point x not chosen yet, compute D(x), the distance
between x and the nearest center that has already been chosen.
3. Choose one new data point at random as a new center, using a
weighted probability distribution where a point x is chosen with
probability proportional to D(x)2.
4. Repeat Steps 2 and 3 until k centers have been chosen.
5. Now that the initial centers have been chosen, proceed using
standard K-Means clustering
30
Bisecting K-Means
It is a variant of K-means that can produce a
partitional or a hierarchical clustering
31
Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)
32
Limitations of K-means: Differing Density
Original Points K-means (3 Clusters)
33
Limitations of K-means: Non-globular Shapes
Original Points K-means (2 Clusters)
34
Hierarchical Agglomerative Clustering
dendrogram
● Main Idea:
○ Start with the points as individual clusters
○ At each step, merge the closest pair of clusters until only one
cluster (or K clusters) left
● Key operation is the computation of the proximity of two clusters
○ Worst-case Complexity: O(N3)
35
HAC: Algorithm
36
Closest Pair of Clusters
● Many variants to defining closest pair of clusters
● Single-link
○ Similarity of the closet elements
● Complete-link
○ Similarity of the “furthest” points
● Average-link
○ Average cosine between pairs of elements
● Ward’s Method
○ The increase in squared error when two clusters are merged
37
HAC - Single-link (MIN)
5
1
3
5
2 1
2 3 6
4
4
Nested Clusters Dendrogram
38
HAC - Complete-link (MAX)
5
4 1
22 5
5
2 1
4 3 6
3 1
4
3
Nested Clusters Dendrogram
39
HAC - Average-link
5 4 1
2
5
2
3 6
1
4
3
Nested Clusters Dendrogram
40
HAC: Limitations
● Once two clusters are combined, it cannot be undone
● No global objective function is directly minimized
● Typical Problems:
○ Sensitivity to noise
○ Difficulty handling clusters of different
sizes and non-globular shapes
○ Breaking large clusters
41
Density-based Clustering - DBSCAN
● Main Idea: Clusters are regions of high
density that are separated from one
another by regions on low density.
● Density = number of points within
a specified radius (Eps)
○ Core point
○ Border point
○ Noise point
42
DBSCAN: Algorithm
43
How to Determine Points?
MinPts = 7
● Core point: Has at least a specified number of points (MinPts) within Eps
● Border point: not a core point, but is in the neighborhood of a core point
● Noise point: any point that is not a core point or a border point
44
DBSCAN: Core, Border and Noise Points
Original Points Point types: core, border
and noise
Eps = 10, MinPts = 4
45
DBSCAN: How to Determine Eps, MinPts?
Intuition:
● Core point: the k-th nearest
neighbors are at a close distance.
● Noise point: the k-th nearest
neighbors are at a far distance.
Plot sorted distance of every point to
its k-th nearest neighbor
46
DBSCAN: Limitations
(MinPts=4, Eps=9.92).
● Varying densities
(MinPts=4, Eps=9.75). ● High-dimensional
data
Original Points
47
Which Clustering Algorithm?
● Type of Clustering
● Type of Cluster
○ Prototype vs connected regions vs density-based
● Characteristics of Clusters
● Characteristics of Data Sets and Attributes
● Noise and Outliers
● Number of Data Objects
● Number of Attributes
● Algorithmic Considerations
48
A Comparison on Clustering Algorithms
Source: Text Clustering
Algorithms: A Review
49
Exercise
Using L1-Distance
◎ Run K-Means, with C1=(0, 0) and C2=(2, 2), 3 iterations
◎ Run DBSCAN with e = 1.95 and minPts = 3. Find the number ò
clusters, boundary, core-points and noise. Xác định các điểm
biên, điểm nhân và điểm nhiễu
A B C D E F G H I J K L
x 1 0 0 3 2 1 5 4 3 3 2 2
y 1 1 6 6 1 2 2 5 4 5 4 6
Unsupervised Metrics
How to determine Good vs Bad clustering ?
Davies–Bouldin index
◎ The Davies-Bouldin index calculates the intracluster (within-cluster) variance (left-side
plot) and the distance between the centroids of each cluster (right-side plot). For each
cluster, its nearest neighboring cluster is identified, and the sum of their intracluster
variances is divided by the difference between their centroids. This value is calculated
for each cluster, and the Davies-Bouldin index is the mean of these values.
Dunn index
◎ The Dunn index quantifies the ratio between the smallest distance between
cases in different clusters (left-side plot) and the largest distance within a
cluster (right-side plot).
Silhouete Coefficient
Common metrics to combine cohesion and
separation
3 Steps: with each sample i
◎ Step 1: the average distance a(i) within cluster:
◎ Step 2: the min distance b(i) to other cluster:
◎ Step 3: Calculate the silhouete coef:
What is the range of s(i) ?
Run time complexity ?
Summary
◎ General Concepts of Clustering
○ Definition
○ Real-life Applications
○ Types of Clustering
◎ Typical Clustering Algorithms
○ K-Means
○ HAC
○ DBSCAN
58