Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views58 pages

Lecture8 Unsupervised Learning

The document provides an overview of unsupervised learning with a focus on clustering, including its general concepts, real-life applications, and various types of clustering methods. It details typical clustering algorithms such as K-Means, Hierarchical Agglomerative Clustering (HAC), and Density-based Clustering (DBSCAN), along with their limitations and considerations for use. Additionally, it discusses metrics for evaluating clustering quality, such as the Davies-Bouldin index and Dunn index.

Uploaded by

trancongytn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views58 pages

Lecture8 Unsupervised Learning

The document provides an overview of unsupervised learning with a focus on clustering, including its general concepts, real-life applications, and various types of clustering methods. It details typical clustering algorithms such as K-Means, Hierarchical Agglomerative Clustering (HAC), and Density-based Clustering (DBSCAN), along with their limitations and considerations for use. Additionally, it discusses metrics for evaluating clustering quality, such as the Davies-Bouldin index and Dunn index.

Uploaded by

trancongytn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

UET

Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN


VNU-University of Engineering and Technology

INT3405 - Machine Learning


Lecture 8: Unsupervised Learning

CSUET

Hanoi, 4/2025
Outline

◎ Part I: Clustering - General Concepts


○ Real-life Applications
○ Types of Clusterings
◎ Part II: Typical Clustering Algorithms

2
1.
Clustering
- General Concepts
Main idea, real-life applications,
types
3
Motivating Example: Customer Segmentation

Clusters

● Demographic ● Geographic
● Behavioral ● Psychographic
4
What is Cluster Analysis or Clustering?
Given a set of objects, place them in groups such that the objects in
a group are similar (or related) to one another and different from
(or unrelated to) the objects in other groups

Inter-cluster distances
are maximized

Intra-cluster distances
are minimized
5
Real-life Applications: Google News

6
Real-life Applications: Anomaly Detection

● Fake News Detection


● Fraud Detection
● Spam Email Detection

Source:
https://towardsdatascience.com/unsupervised-
anomaly-detection-on-spotify-data-k-means-vs-local-
outlier-factor-f96ae783d7a7

7
Real-life Applications: Sport Science

Find players with


similar styles

Source: https://www.americansocceranalysis.com/home/2019/3/11/using-k-means-to-learn-
what-soccer-passing-tells-us-about-playing-styles
8
Real-life Applications: Image Segmentation

Source: http://pixelsciences.blogspot.com/2017/07/image-segmentation-k-means-clustering.html

9
Real-life Applications: Recommendation

● Cluster-based ranking
● Group recommendation
● …

10
What do affect on Cluster Analysis?

Clustering

Data Algorithm

Cluster

11
Characteristics of the Input Data Are Important
● High dimensionality
○ Dimensionality reduction
● Types of attributes
○ Binary, discrete, continuous, asymmetric
○ Mixed attribute types, e.g., continuous & nominal)
● Differences in attribute scales
○ Normalization techniques
● Size of data set
● Noise and Outliers
● Properties of the data space

12
Characteristics of Cluster
● Data distribution
○ Parametric models
● Shape
○ Globular or arbitrary shape
● Differing sizes
● Differing densities
● Level of separation among clusters Distance Metrics
● Relationship among clusters
● Subspace clusters

13
How to Measure the Similarity/Distance?

Source: https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa 14
Notion of a Cluster can be Ambiguous

15
Types of Clusterings

Source: https://www.datanovia.com/en/blog/types-of-clustering-methods-overview-and-quick-start-r-code/

16
Partitional Clustering

Data objects are separated into


non-overlapping subsets, i.e.,
clusters

17
Hierarchical Clustering

Data objects are separated into


nested clusters as a hierarchical
tree

Hierarchical Clustering

Clustering dendrogram

18
Fuzzy Clustering

Fuzzy clustering, i.e., soft


clustering, is a form of clustering
in which each data point can
belong to more than one cluster
with weights

19
Density-based Clustering

A cluster is a dense region of


points, which is separated by
low-density regions, from other
regions of high density.

Non-linear separation

20
Model-based Clustering

Model-based clustering assumes


that the data were generated by
a model and tries to recover the
original model from the data.

Gaussian Mixture Model

21
2.
Typical Clustering
Algorithms
Intuition, Main Idea, Limitation

22
Typical Clustering Algorithms

◎ Partitional Clustering
○ K-Means & Variants
◎ Hierarchical Clustering
○ HAC
◎ Density-based Clustering
○ DBSCAN

23
K-Means Clustering: An Example

24
K-Means Clustering

● Main idea: Each point is assigned to the cluster with the closest centroid
● Number of clusters, K, must be specified
● Sum of Squared Error (SSE)
● Complexity: O(n * K * I * d)
○ n = number of points, K = number of clusters,
○ I = number of iterations, d = number of attributes

25
Elbow Method for Optimal Value of K

WCSS is the sum of squared distance between each point and the centroid in a cluster
The graph will rapidly change at a point named Elbow Point
26
Two different K-means Clusterings

Optimal
Clustering

Original Points
Sub-optimal
Clustering

27
Importance of Choosing Initial Centroids

28
Solutions to Initial Centroids Problem
● Multiple runs
○ Helps, but probability is not on your side
● Use some strategies to select the k initial centroids and then
select among these initial centroids
○ Select most widely separated, e.g., K-means++
○ Use hierarchical clustering to determine initial centroids
● Bisecting K-Means
○ Not as susceptible to initialization issues

29
K-Means++
1. Choose one center uniformly at random among the data points.
2. For each data point x not chosen yet, compute D(x), the distance
between x and the nearest center that has already been chosen.
3. Choose one new data point at random as a new center, using a
weighted probability distribution where a point x is chosen with
probability proportional to D(x)2.
4. Repeat Steps 2 and 3 until k centers have been chosen.
5. Now that the initial centers have been chosen, proceed using
standard K-Means clustering

30
Bisecting K-Means

It is a variant of K-means that can produce a


partitional or a hierarchical clustering

31
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

32
Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

33
Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

34
Hierarchical Agglomerative Clustering

dendrogram

● Main Idea:
○ Start with the points as individual clusters
○ At each step, merge the closest pair of clusters until only one
cluster (or K clusters) left
● Key operation is the computation of the proximity of two clusters
○ Worst-case Complexity: O(N3)
35
HAC: Algorithm

36
Closest Pair of Clusters
● Many variants to defining closest pair of clusters
● Single-link
○ Similarity of the closet elements
● Complete-link
○ Similarity of the “furthest” points
● Average-link
○ Average cosine between pairs of elements
● Ward’s Method
○ The increase in squared error when two clusters are merged

37
HAC - Single-link (MIN)
5
1
3
5
2 1
2 3 6

4
4
Nested Clusters Dendrogram

38
HAC - Complete-link (MAX)
5
4 1
22 5
5
2 1
4 3 6
3 1
4
3

Nested Clusters Dendrogram

39
HAC - Average-link

5 4 1
2
5
2
3 6
1
4
3
Nested Clusters Dendrogram

40
HAC: Limitations
● Once two clusters are combined, it cannot be undone
● No global objective function is directly minimized
● Typical Problems:
○ Sensitivity to noise
○ Difficulty handling clusters of different
sizes and non-globular shapes
○ Breaking large clusters

41
Density-based Clustering - DBSCAN

● Main Idea: Clusters are regions of high


density that are separated from one
another by regions on low density.
● Density = number of points within
a specified radius (Eps)
○ Core point
○ Border point
○ Noise point

42
DBSCAN: Algorithm

43
How to Determine Points?

MinPts = 7

● Core point: Has at least a specified number of points (MinPts) within Eps
● Border point: not a core point, but is in the neighborhood of a core point
● Noise point: any point that is not a core point or a border point

44
DBSCAN: Core, Border and Noise Points

Original Points Point types: core, border


and noise

Eps = 10, MinPts = 4

45
DBSCAN: How to Determine Eps, MinPts?

Intuition:
● Core point: the k-th nearest
neighbors are at a close distance.
● Noise point: the k-th nearest
neighbors are at a far distance.
Plot sorted distance of every point to
its k-th nearest neighbor

46
DBSCAN: Limitations
(MinPts=4, Eps=9.92).

● Varying densities
(MinPts=4, Eps=9.75). ● High-dimensional
data
Original Points

47
Which Clustering Algorithm?
● Type of Clustering
● Type of Cluster
○ Prototype vs connected regions vs density-based
● Characteristics of Clusters
● Characteristics of Data Sets and Attributes
● Noise and Outliers
● Number of Data Objects
● Number of Attributes
● Algorithmic Considerations

48
A Comparison on Clustering Algorithms

Source: Text Clustering


Algorithms: A Review

49
Exercise
Using L1-Distance
◎ Run K-Means, with C1=(0, 0) and C2=(2, 2), 3 iterations
◎ Run DBSCAN with e = 1.95 and minPts = 3. Find the number ò
clusters, boundary, core-points and noise. Xác định các điểm
biên, điểm nhân và điểm nhiễu

A B C D E F G H I J K L
x 1 0 0 3 2 1 5 4 3 3 2 2
y 1 1 6 6 1 2 2 5 4 5 4 6
Unsupervised Metrics
How to determine Good vs Bad clustering ?
Davies–Bouldin index

◎ The Davies-Bouldin index calculates the intracluster (within-cluster) variance (left-side


plot) and the distance between the centroids of each cluster (right-side plot). For each
cluster, its nearest neighboring cluster is identified, and the sum of their intracluster
variances is divided by the difference between their centroids. This value is calculated
for each cluster, and the Davies-Bouldin index is the mean of these values.
Dunn index

◎ The Dunn index quantifies the ratio between the smallest distance between
cases in different clusters (left-side plot) and the largest distance within a
cluster (right-side plot).
Silhouete Coefficient
Common metrics to combine cohesion and
separation
3 Steps: with each sample i
◎ Step 1: the average distance a(i) within cluster:

◎ Step 2: the min distance b(i) to other cluster:

◎ Step 3: Calculate the silhouete coef:

What is the range of s(i) ?


Run time complexity ?
Summary
◎ General Concepts of Clustering
○ Definition
○ Real-life Applications
○ Types of Clustering
◎ Typical Clustering Algorithms
○ K-Means
○ HAC
○ DBSCAN

58

You might also like