0% found this document useful (0 votes)

14 views58 pages

Lecture8 Unsupervised Learning

The document provides an overview of unsupervised learning with a focus on clustering, including its general concepts, real-life applications, and various types of clustering methods. It details typical clustering algorithms such as K-Means, Hierarchical Agglomerative Clustering (HAC), and Density-based Clustering (DBSCAN), along with their limitations and considerations for use. Additionally, it discusses metrics for evaluating clustering quality, such as the Davies-Bouldin index and Dunn index.

Uploaded by

trancongytn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views58 pages

Lecture8 Unsupervised Learning

Uploaded by

trancongytn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

UET

Since 2004

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

VNU-University of Engineering and Technology

INT3405 - Machine Learning

Lecture 8: Unsupervised Learning

CSUET

Hanoi, 4/2025
Outline

◎ Part I: Clustering - General Concepts

○ Real-life Applications
○ Types of Clusterings
◎ Part II: Typical Clustering Algorithms

2
1.
Clustering
- General Concepts
Main idea, real-life applications,
types
3
Motivating Example: Customer Segmentation

Clusters

● Demographic ● Geographic
● Behavioral ● Psychographic
4
What is Cluster Analysis or Clustering?
Given a set of objects, place them in groups such that the objects in
a group are similar (or related) to one another and different from
(or unrelated to) the objects in other groups

Inter-cluster distances
are maximized

Intra-cluster distances
are minimized
5
Real-life Applications: Google News

6
Real-life Applications: Anomaly Detection

● Fake News Detection

● Fraud Detection
● Spam Email Detection

Source:
https://towardsdatascience.com/unsupervised-
anomaly-detection-on-spotify-data-k-means-vs-local-
outlier-factor-f96ae783d7a7

7
Real-life Applications: Sport Science

Find players with

similar styles

Source: https://www.americansocceranalysis.com/home/2019/3/11/using-k-means-to-learn-
what-soccer-passing-tells-us-about-playing-styles
8
Real-life Applications: Image Segmentation

Source: http://pixelsciences.blogspot.com/2017/07/image-segmentation-k-means-clustering.html

9
Real-life Applications: Recommendation

● Cluster-based ranking
● Group recommendation
● …

10
What do affect on Cluster Analysis?

Clustering

Data Algorithm

Cluster

11
Characteristics of the Input Data Are Important
● High dimensionality
○ Dimensionality reduction
● Types of attributes
○ Binary, discrete, continuous, asymmetric
○ Mixed attribute types, e.g., continuous & nominal)
● Differences in attribute scales
○ Normalization techniques
● Size of data set
● Noise and Outliers
● Properties of the data space

12
Characteristics of Cluster
● Data distribution
○ Parametric models
● Shape
○ Globular or arbitrary shape
● Differing sizes
● Differing densities
● Level of separation among clusters Distance Metrics
● Relationship among clusters
● Subspace clusters

13
How to Measure the Similarity/Distance?

Source: https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa 14
Notion of a Cluster can be Ambiguous

15
Types of Clusterings

Source: https://www.datanovia.com/en/blog/types-of-clustering-methods-overview-and-quick-start-r-code/

16
Partitional Clustering

Data objects are separated into

non-overlapping subsets, i.e.,
clusters

17
Hierarchical Clustering

Data objects are separated into

nested clusters as a hierarchical
tree

Hierarchical Clustering

Clustering dendrogram

18
Fuzzy Clustering

Fuzzy clustering, i.e., soft

clustering, is a form of clustering
in which each data point can
belong to more than one cluster
with weights

19
Density-based Clustering

A cluster is a dense region of

points, which is separated by
low-density regions, from other
regions of high density.

Non-linear separation

20
Model-based Clustering

Model-based clustering assumes

that the data were generated by
a model and tries to recover the
original model from the data.

Gaussian Mixture Model

21
2.
Typical Clustering
Algorithms
Intuition, Main Idea, Limitation

22
Typical Clustering Algorithms

◎ Partitional Clustering
○ K-Means & Variants
◎ Hierarchical Clustering
○ HAC
◎ Density-based Clustering
○ DBSCAN

23
K-Means Clustering: An Example

24
K-Means Clustering

● Main idea: Each point is assigned to the cluster with the closest centroid
● Number of clusters, K, must be specified
● Sum of Squared Error (SSE)
● Complexity: O(n * K * I * d)
○ n = number of points, K = number of clusters,
○ I = number of iterations, d = number of attributes

25
Elbow Method for Optimal Value of K

WCSS is the sum of squared distance between each point and the centroid in a cluster
The graph will rapidly change at a point named Elbow Point
26
Two different K-means Clusterings

Optimal
Clustering

Original Points
Sub-optimal
Clustering

27
Importance of Choosing Initial Centroids

28
Solutions to Initial Centroids Problem
● Multiple runs
○ Helps, but probability is not on your side
● Use some strategies to select the k initial centroids and then
select among these initial centroids
○ Select most widely separated, e.g., K-means++
○ Use hierarchical clustering to determine initial centroids
● Bisecting K-Means
○ Not as susceptible to initialization issues

29
K-Means++
1. Choose one center uniformly at random among the data points.
2. For each data point x not chosen yet, compute D(x), the distance
between x and the nearest center that has already been chosen.
3. Choose one new data point at random as a new center, using a
weighted probability distribution where a point x is chosen with
probability proportional to D(x)2.
4. Repeat Steps 2 and 3 until k centers have been chosen.
5. Now that the initial centers have been chosen, proceed using
standard K-Means clustering

30
Bisecting K-Means

It is a variant of K-means that can produce a

partitional or a hierarchical clustering

31
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

32
Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

33
Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

34
Hierarchical Agglomerative Clustering

dendrogram

● Main Idea:
○ Start with the points as individual clusters
○ At each step, merge the closest pair of clusters until only one
cluster (or K clusters) left
● Key operation is the computation of the proximity of two clusters
○ Worst-case Complexity: O(N3)
35
HAC: Algorithm

36
Closest Pair of Clusters
● Many variants to defining closest pair of clusters
● Single-link
○ Similarity of the closet elements
● Complete-link
○ Similarity of the “furthest” points
● Average-link
○ Average cosine between pairs of elements
● Ward’s Method
○ The increase in squared error when two clusters are merged

37
HAC - Single-link (MIN)
5
1
3
5
2 1
2 3 6

4
4
Nested Clusters Dendrogram

38
HAC - Complete-link (MAX)
5
4 1
22 5
5
2 1
4 3 6
3 1
4
3

Nested Clusters Dendrogram

39
HAC - Average-link

5 4 1
2
5
2
3 6
1
4
3
Nested Clusters Dendrogram

40
HAC: Limitations
● Once two clusters are combined, it cannot be undone
● No global objective function is directly minimized
● Typical Problems:
○ Sensitivity to noise
○ Difficulty handling clusters of different
sizes and non-globular shapes
○ Breaking large clusters

41
Density-based Clustering - DBSCAN

● Main Idea: Clusters are regions of high

density that are separated from one
another by regions on low density.
● Density = number of points within
a specified radius (Eps)
○ Core point
○ Border point
○ Noise point

42
DBSCAN: Algorithm

43
How to Determine Points?

MinPts = 7

● Core point: Has at least a specified number of points (MinPts) within Eps
● Border point: not a core point, but is in the neighborhood of a core point
● Noise point: any point that is not a core point or a border point

44
DBSCAN: Core, Border and Noise Points

Original Points Point types: core, border

and noise

Eps = 10, MinPts = 4

45
DBSCAN: How to Determine Eps, MinPts?

Intuition:
● Core point: the k-th nearest
neighbors are at a close distance.
● Noise point: the k-th nearest
neighbors are at a far distance.
Plot sorted distance of every point to
its k-th nearest neighbor

46
DBSCAN: Limitations
(MinPts=4, Eps=9.92).

● Varying densities
(MinPts=4, Eps=9.75). ● High-dimensional
data
Original Points

47
Which Clustering Algorithm?
● Type of Clustering
● Type of Cluster
○ Prototype vs connected regions vs density-based
● Characteristics of Clusters
● Characteristics of Data Sets and Attributes
● Noise and Outliers
● Number of Data Objects
● Number of Attributes
● Algorithmic Considerations

48
A Comparison on Clustering Algorithms

Source: Text Clustering

Algorithms: A Review

49
Exercise
Using L1-Distance
◎ Run K-Means, with C1=(0, 0) and C2=(2, 2), 3 iterations
◎ Run DBSCAN with e = 1.95 and minPts = 3. Find the number ò
clusters, boundary, core-points and noise. Xác định các điểm
biên, điểm nhân và điểm nhiễu

A B C D E F G H I J K L
x 1 0 0 3 2 1 5 4 3 3 2 2
y 1 1 6 6 1 2 2 5 4 5 4 6
Unsupervised Metrics
How to determine Good vs Bad clustering ?
Davies–Bouldin index

◎ The Davies-Bouldin index calculates the intracluster (within-cluster) variance (left-side

plot) and the distance between the centroids of each cluster (right-side plot). For each
cluster, its nearest neighboring cluster is identified, and the sum of their intracluster
variances is divided by the difference between their centroids. This value is calculated
for each cluster, and the Davies-Bouldin index is the mean of these values.
Dunn index

◎ The Dunn index quantifies the ratio between the smallest distance between
cases in different clusters (left-side plot) and the largest distance within a
cluster (right-side plot).
Silhouete Coefficient
Common metrics to combine cohesion and
separation
3 Steps: with each sample i
◎ Step 1: the average distance a(i) within cluster:

◎ Step 2: the min distance b(i) to other cluster:

◎ Step 3: Calculate the silhouete coef:

What is the range of s(i) ?

Run time complexity ?
Summary
◎ General Concepts of Clustering
○ Definition
○ Real-life Applications
○ Types of Clustering
◎ Typical Clustering Algorithms
○ K-Means
○ HAC
○ DBSCAN

Sa1 Frame
No ratings yet
Sa1 Frame
51 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
HFSS-High Frequency Structure Simulator
No ratings yet
HFSS-High Frequency Structure Simulator
38 pages
Clustering
No ratings yet
Clustering
104 pages
FIITJEE Admission Test Broucher
No ratings yet
FIITJEE Admission Test Broucher
76 pages
Generator Spare Parts Budget-2020
No ratings yet
Generator Spare Parts Budget-2020
106 pages
Spatial Data Mining: Clustering Techniques
No ratings yet
Spatial Data Mining: Clustering Techniques
56 pages
IFPS User Manual
100% (1)
IFPS User Manual
678 pages
Unsupervised Learning for Students
No ratings yet
Unsupervised Learning for Students
59 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
Enderlein and Pleomorphism
No ratings yet
Enderlein and Pleomorphism
1 page
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
21 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
82 pages
DMDW 5th Module
No ratings yet
DMDW 5th Module
28 pages
Clustering & Association Mining Basics
No ratings yet
Clustering & Association Mining Basics
50 pages
Week 9 Part 1 Clustering
No ratings yet
Week 9 Part 1 Clustering
44 pages
Acgih Manual 1998 (401-500)
No ratings yet
Acgih Manual 1998 (401-500)
100 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Bridge Works - Miscellaneous
No ratings yet
Bridge Works - Miscellaneous
26 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Constructive and Destructive Feedback Notes
No ratings yet
Constructive and Destructive Feedback Notes
5 pages
Cluster
100% (1)
Cluster
72 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
45 pages
Data Science Session 8 Clustering V0
No ratings yet
Data Science Session 8 Clustering V0
30 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
DSS09 (B) - Clustering
No ratings yet
DSS09 (B) - Clustering
35 pages
Earn Money Typing Online
100% (3)
Earn Money Typing Online
37 pages
Ambo University: Inistitute of Technology
No ratings yet
Ambo University: Inistitute of Technology
15 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
57 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
ML - 8
No ratings yet
ML - 8
70 pages
Unit 5
No ratings yet
Unit 5
63 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
Maharashtra State Board of Technical Education, Mumbai: Sant Gajanan Maharaj Rural Polytechnic, Mahagaon
No ratings yet
Maharashtra State Board of Technical Education, Mumbai: Sant Gajanan Maharaj Rural Polytechnic, Mahagaon
12 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Manual Wms
No ratings yet
Manual Wms
4 pages
EML %TH Module
No ratings yet
EML %TH Module
40 pages
Clustering
No ratings yet
Clustering
65 pages
M5
No ratings yet
M5
40 pages
Time Series Analysis and Forecasting of Gold Price Using ARIMA and LSTM Model
No ratings yet
Time Series Analysis and Forecasting of Gold Price Using ARIMA and LSTM Model
8 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
U-5 Iml
No ratings yet
U-5 Iml
20 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
Unit 2
No ratings yet
Unit 2
33 pages
Lab Ex1
100% (1)
Lab Ex1
7 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
51 pages
Clustering Analysis
No ratings yet
Clustering Analysis
12 pages
Clustering
No ratings yet
Clustering
12 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
Clustering
No ratings yet
Clustering
38 pages
Unsupervised Learning Explained
No ratings yet
Unsupervised Learning Explained
54 pages
Preparation of Fermented Blue Crab With Rice and It'S Market Ability
No ratings yet
Preparation of Fermented Blue Crab With Rice and It'S Market Ability
6 pages
Wulandari Et Al-2019-Journal of Holistic Nursing Science
No ratings yet
Wulandari Et Al-2019-Journal of Holistic Nursing Science
6 pages
Milling Parameters Guide
No ratings yet
Milling Parameters Guide
1 page
Quaternion Cheat Sheet and Problems Quaternion Arithmetic: 0 X y Z I 0 X y Z
No ratings yet
Quaternion Cheat Sheet and Problems Quaternion Arithmetic: 0 X y Z I 0 X y Z
2 pages
A For: Homework #2 Solution
No ratings yet
A For: Homework #2 Solution
3 pages
Business Client Information Form
No ratings yet
Business Client Information Form
5 pages
Clustering
No ratings yet
Clustering
80 pages
Joseph Matthews - The Renegade Rapport
No ratings yet
Joseph Matthews - The Renegade Rapport
21 pages
AI
No ratings yet
AI
19 pages
Clustering
No ratings yet
Clustering
67 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
ML Module 4 Unsupervised Learning - Updated
No ratings yet
ML Module 4 Unsupervised Learning - Updated
55 pages
Module - 05 Machine Learning (BCS602) Search Creators
No ratings yet
Module - 05 Machine Learning (BCS602) Search Creators
47 pages
E-Invoicing in Malaysia Client Data Request Through Know-Your-Client (KYC) Form
No ratings yet
E-Invoicing in Malaysia Client Data Request Through Know-Your-Client (KYC) Form
4 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Assignment # 1,2 - HE
No ratings yet
Assignment # 1,2 - HE
8 pages
Harrington 1 Ton Hand Chain Hoist OM Manual
No ratings yet
Harrington 1 Ton Hand Chain Hoist OM Manual
55 pages
Clustering
No ratings yet
Clustering
53 pages
Capstone Update 2
No ratings yet
Capstone Update 2
2 pages
Kratus 2017 Music Listening Is Creative
No ratings yet
Kratus 2017 Music Listening Is Creative
6 pages
Module 5
No ratings yet
Module 5
43 pages
Listof C25 Batcheswith Times&Syllabus
No ratings yet
Listof C25 Batcheswith Times&Syllabus
4 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Clustering
No ratings yet
Clustering
11 pages
Gingrich 2016
No ratings yet
Gingrich 2016
5 pages
Clustering
No ratings yet
Clustering
55 pages
Unit 4
No ratings yet
Unit 4
16 pages
Moldflow 2021 Features Comparison Matrix A4 en
No ratings yet
Moldflow 2021 Features Comparison Matrix A4 en
4 pages
Clustering - Unit 4
No ratings yet
Clustering - Unit 4
19 pages
Result Sem-6 UG 2025 Sixth 9 Subjects-1
No ratings yet
Result Sem-6 UG 2025 Sixth 9 Subjects-1
22 pages
L5 Clustering
No ratings yet
L5 Clustering
6 pages

Lecture8 Unsupervised Learning

Uploaded by

Lecture8 Unsupervised Learning

Uploaded by

UET

ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN

INT3405 - Machine Learning

◎ Part I: Clustering - General Concepts

● Fake News Detection

Find players with

Data objects are separated into

Data objects are separated into

Fuzzy clustering, i.e., soft

A cluster is a dense region of

Model-based clustering assumes

Gaussian Mixture Model

It is a variant of K-means that can produce a

Original Points K-means (3 Clusters)

Original Points K-means (3 Clusters)

Original Points K-means (2 Clusters)

Nested Clusters Dendrogram

● Main Idea: Clusters are regions of high

Original Points Point types: core, border

Eps = 10, MinPts = 4

Source: Text Clustering

◎ The Davies-Bouldin index calculates the intracluster (within-cluster) variance (left-side

◎ Step 2: the min distance b(i) to other cluster:

◎ Step 3: Calculate the silhouete coef:

What is the range of s(i) ?

You might also like