Data Mining
Principal Component Analysis
Clustering
CS 584 :: Fall 2024
Ziwei Zhu
Department of Computer Science
George Mason University
Part of slides is from Drs. Tan, Steinbach and Kumar.
Part of slides is from Dr. Jessica Lin.
Part of slides is from Dr. Theodora Chaspari. 1
• HW3 is due on 11/04
• Will have Quiz 3 next week (about clustering)
• 11/05 election day, no class
• 11/12, no class, watch recorded video
• Final exam: 12/17, 7:30pm-9:30pm
2
What is Cluster Analysis
Given a set of objects, place them in groups such that the
objects in the same group are similar to each other, and
different from the objects in other groups. (without labels)
• Unsupervised learning
• Descriptive task
3
Types of Clustering
• Partitional Clustering
• Hierarchical Clustering
• Density-based Clustering
• And more …
4
K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest
centroid
5
K-means Clustering: Formalization
𝑥1
𝑥5
𝜇2
𝜇1 𝑥3 𝑥4
𝑥2
6
Hierarchical Clustering
• Produces a set of nested clusters organized as a
hierarchical tree
• Can be visualized as a dendrogram
• A tree like diagram that records the sequences of
merges or splits
7
Hierarchical Clustering
Two main types of hierarchical clustering
• Agglomerative (bottom-up):
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until
only one cluster (or k clusters) left
• Divisive (top-down):
8
Hierarchical Clustering
Two main types of hierarchical clustering
• Agglomerative (bottom-up):
• Divisive (top-down):
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster
contains an individual point (or there are k clusters)
• E.g., apply K-means recursively to split existing
clusters into smaller ones 9
Bottom-up Clustering Algorithm
Key idea: successively merge closest clusters
10
Bottom-up Clustering Algorithm
Initialization: Start with each sample being a cluster
1 2 3 4 5
1 D(1,2) D(1,3) D(1,4) D(1,5)
2 D(2,3) D(2,4) D(2,5)
3 D(3,4) D(3,5)
4 D(4,5)
5 5
1
4
2
3
1 2 3 4 5
11
Bottom-up Clustering Algorithm
Initialization: Start with each sample being a cluster
1 {2,3} 4 5
1 D(1,{2,3}) D(1,4) D(1,5)
{2, 3} D({2,3},4) D({2,3},5)
4 D(4,5)
5
5
1
4
2
3
1 2 3 4 5
12
Bottom-up Clustering Algorithm
Among current clusters, find the two with minimum distance
1 {2,3} {4,5}
1 D(1,{2,3}) D(1,{4,5})
{2,3} D({2,3},{4,5})
{4,5}
5
1
4
2
3
1 2 3 4 5
13
Bottom-up Clustering Algorithm
Among current clusters, find the two with minimum distance
{1,2,3} {4,5}
{1,2,3} D({1,2,3},{4,5})
{4,5}
5
1
4
2
3
1 2 3 4 5
14
Bottom-up Clustering Algorithm
Among current clusters, find the two with minimum distance
{1,2,3,4,5}
{1,2,3,4,5}
5
1
4
2
3
1 2 3 4 5
15
Bottom-up Clustering Algorithm
The question is “How do we calculate the proximity
matrix?”
5 {1,2,3} {4,5}
1 {1,2,3} D({1,2,3},{4,5})
4 {4,5}
2
3
16
How to Define Inter-Cluster Distance
17
How to Define Inter-Cluster Distance
18
How to Define Inter-Cluster Distance
19
How to Define Inter-Cluster Distance
20
MIN
Distance of two clusters is based on the two closest
points in the different clusters.
Example:
Proximity/Distance Matrix:
21
MIN
Nested Clusters Dendrogram
22
MIN
Nested Clusters Dendrogram
23
Advantage of MIN
• Can handle non-globular shapes
Original Points Six Clusters
24
Limitation of MIN
• Sensitive to noise
Original Points Two Clusters
25
MAX
Distance of two clusters is based on the two most
distant points in the different clusters.
Example:
Proximity/Distance Matrix:
26
MAX
Nested Clusters Dendrogram
27
MAX
Nested Clusters Dendrogram
28
Advantage of MAX
• Less susceptible to noise
Original Points Two Clusters
29
Limitation of MAX
• prefer globular clusters
• Tends to break large clusters
Original Points Two Clusters
30
Group Average
Distance of two clusters is the average of pairwise distance
between points in the two clusters.
• Compromise between MIN and MAX
• Advantage: less susceptible to noise
• Limitation: prefer globular clusters
31
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1
2
5
2
3 6 Group Average
3
4 1
4
32
Outline
• Introduction
• Partitional Clustering: K-means
• Hierarchical Clustering
• Density-based Clustering
33
Density Based Clustering
Clusters are regions of high density that are separated from
regions of low density.
Original Points Clusters (dark blue points
are noise)
34
DBSCAN: Density-based spatial clustering of applications
with noise.
• Density = number of points within a specified radius
(denoted as Eps)
• A point is a core point if it has at least a specified number
of points (denoted as MinPts) within Eps
• These are points that are at the interior of a cluster
• Counts the point itself
35
DBSCAN: Density-based spatial clustering of applications
with noise.
• Density = number of points within a specified radius
(denoted as Eps)
• A point is a core point if it has at least a specified number
of points (denoted as MinPts) within Eps
• These are points that are at the interior of a cluster
• Counts the point itself
• A border point is not a core point, but is in the
neighborhood of a core point (i.e., within Eps)
36
DBSCAN: Density-based spatial clustering of applications
with noise.
• Density = number of points within a specified radius
(denoted as Eps)
• A point is a core point if it has at least a specified number
of points (denoted as MinPts) within Eps
• These are points that are at the interior of a cluster
• Counts the point itself
• A border point is not a core point, but is in the
neighborhood of a core point (i.e., within Eps)
• A noise point is any point that is not a core point or a
border point
37
DBSCAN
MinPts=7
B A
C
Eps
38
DBSCAN
MinPts=7
39
DBSCAN
Original Points Point types: core, border,
and noise
Eps = 10, MinPts = 4
40
DBSCAN
Form clusters using core points, and assign border
points to one of its neighboring clusters
1: Label all points as core, border, or noise points.
41
DBSCAN
Form clusters using core points, and assign border
points to one of its neighboring clusters
2: Eliminate noise points.
42
DBSCAN
Form clusters using core points, and assign border
points to one of its neighboring clusters
3: Link all core points within a distance Eps of each
other.
43
DBSCAN
Form clusters using core points, and assign border
points to one of its neighboring clusters
4: Make each group of connected core points into a
separate cluster.
44
DBSCAN
Form clusters using core points, and assign border
points to one of its neighboring clusters
5: Assign each border point to one of the clusters of
its associated core points
45
When DBSCAN Works Well
• Can handle clusters of different shapes and sizes
• Resistant to noise
Original Points Clusters (dark blue points
are noise)
46
When DBSCAN Does NOT Work Well
• Varying densities
47
Scikit-learn: many options
48
What we have learnt so far
• Partitional Clustering: K-means
• Hierarchical Clustering: bottom-up approach
• Density-based Clustering: DBSCAN
Next big question: How to rigorously
evaluate clustering results?
49
Clustering Evaluation
Numerical measures can be classified into the following two
types:
• Unsupervised: to measure the goodness of a clustering
result without external information.
• Sum of Squared Error (SSE)
• Often called internal indices because they only use information in
the data
50
Clustering Evaluation
Numerical measures can be classified into the following two
types:
• Unsupervised: to measure the goodness of a clustering
result without external information.
• Sum of Squared Error (SSE)
• Often called internal indices because they only use information in
the data
• Supervised: to measure the extent to which cluster labels
match externally supplied class labels.
• E.g., Entropy, V-measure
• Often called external indices because they use information
external to the data
51
Outline
➢ Unsupervised (Internal) Evaluation
• Supervised (External) Evaluation
52
Unsupervised: Correlation
• Two matrices
• Distance matrix
• “Incidence” Matrix
• One row and one column for each data point
• An entry is 1 if the associated pair of points belong to the same
cluster
• An entry is 0 if the associated pair of points belongs to different
clusters
53
Unsupervised: Correlation – Toy Example
54
Unsupervised: Correlation
• Compute the correlation between the two matrices
• Since the matrices are symmetric, only the correlation
between n(n-1) / 2 entries needs to be calculated.
• High correlation indicates that points that belong to
the same cluster are close to each other.
• The higher, the better
55
Unsupervised: Correlation – Toy Example
56
Unsupervised: Correlation – Toy Example
First step: vectorize them (row by row)
• Proximity: 0,0.1,0.8,0.9,0.1,0,0.9,0.8,0.9,0.8,0,0.2,0.9,0.8,0.2,0
• Incidence: 1,1,0,0,1,1,0,0,0,0,1,1,0,0,1,1
Better yet, due to use only the upper triangle
• Proximity: 0.1,0.8,0.9,0.9,0.8,0.2
• Incidence: 1,0,0,0,0,1
57
Unsupervised: Correlation – Toy Example
Now, calculate the correlation coefficient between
• Proximity: 0.1,0.8,0.9,0.9,0.8,0.2
• Incidence: 1,0,0,0,0,1
-0.9887
58
Unsupervised: Correlation – Toy Example
59
Unsupervised: Correlation – Toy Example
Now, calculate the correlation coefficient between
• Proximity: 0.1,0.8,0.9,0.9,0.8,0.2
• Incidence: 1,0,0,0,0,1
-0.9887
Correlation between distance and incidence: ~-1 is the best
Correlation between similarity and incidence: ~1 is the best
~0 is always the worest
60
Unsupervised: Correlation
Apply K-means to the following two datasets
Random data
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
y
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Corr = 0.9235 Corr = 0.5810
61
Unsupervised: Visualize Similarity Matrix
Order the similarity matrix with respect to cluster labels
and inspect visually.
1 1
0.9 10 0.9
0.8 20 0.8
0.7 30 0.7
0.6 40 0.6
Points
0.5 50 0.5
y
0.4 60 0.4
0.3 70 0.3
0.2 80 0.2
0.1 90 0.1
0 100 0
0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity
x Points
62
Unsupervised: Visualize Similarity Matrix
Clusters in random data are not so crisp.
1 1
0.9 10 0.9
0.8 20 0.8
0.7 30 0.7
0.6 40 0.6
Points
0.5 50 0.5
y
0.4 60 0.4
0.3 70 0.3
0.2 80 0.2
0.1 90 0.1
0 100 0
0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity
x Points
63
Unsupervised: Cohesion and Separation
c2
c1 c
c3
64
Unsupervised: Cohesion and Separation
𝑥1 𝑥5
𝑐2
𝑐
𝑥4
𝑐1 𝑥3
𝑥2
Total Sum of Squares (TSS)
Sum of Squares within groups (SSE)
Sum of Squares Between groups (SSB)
65
Unsupervised: Cohesion and Separation
𝑻𝑺𝑺 = 𝑺𝑺𝑬 + 𝑺𝑺𝑩
• Given a data set, TSS is fixed
• A clustering with large SSE has small SSB, while one
with small SSE has large SSB
• Goal is to minimize SSE and maximize SSB
66
Unsupervised: Cohesion and Separation
Example:
𝑻𝑺𝑺 = 𝑺𝑺𝑬 + 𝑺𝑺𝑩
67
Unsupervised: Cohesion and Separation
Example:
𝑻𝑺𝑺 = 𝑺𝑺𝑬 + 𝑺𝑺𝑩
68
Outline
• Unsupervised (Internal) Evaluation
• Supervised (External) Evaluation
69
Supervised: Entropy
External labels
Table: K-means clustering results for the LA Times document data set.
Idea: Measure the degree to which each cluster
consists of samples of a single class.
70
Supervised: Entropy
𝑚𝑖𝑗 : # of points in cluster j belonging to class i
𝑚𝑗 : # of pints in cluster j
𝑚: # of all points
𝐿: # of classes
71
Supervised: Entropy
𝑚𝑖𝑗 : # of points in cluster j belonging to class i
𝑚𝑗 : # of pints in cluster j
𝑚: # of all points
𝐿: # of classes
• Probability of a point in cluster j belonging to class i:
𝑝𝑖𝑗 = 𝑚𝑖𝑗 Τ𝑚𝑗
• Entropy of a cluster j: 𝑒𝑗 = − σ𝐿𝑖=1 𝑝𝑖𝑗 𝑙𝑜𝑔2 𝑝𝑖𝑗
If a probability distribution is extremely imbalanced, the entropy
is low (~0); if the distribution if uniform, the entropy is high.
72
Supervised: Entropy
• Entropy of a cluster j: 𝑒𝑗 = − σ𝐿𝑖=1 𝑝𝑖𝑗 𝑙𝑜𝑔2 𝑝𝑖𝑗
If a probability distribution is extremely imbalanced, the entropy
is low (~0); if the distribution if uniform, the entropy is high.
Case 1: Case 2:
1 1 1 1 [1,0, 0, 0]
[ , , , ]
4 4 4 4
Entropy: measure the level of uncertainty in a system. The
higher, the more uncertain.
73
Supervised: Entropy
• Probability of a point in cluster j belonging to class i: 𝑝𝑖𝑗 = 𝑚𝑖𝑗 Τ𝑚𝑗
• Entropy of a cluster j: 𝑒𝑗 = − σ𝐿𝑖=1 𝑝𝑖𝑗 𝑙𝑜𝑔2 𝑝𝑖𝑗
Sports Financial Movies
Cluster 1 2 1 7
2 2 1 1 7 7
𝑒1 = −( 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2 )
10 10 10 10 10 10
𝑚𝑖
• Total entropy: 𝑒 = σ𝐾
𝑖=1 𝑒𝑗
𝑚
So, the lower entropy we measure, the more certain a
cluster is aligned to a class label, the better clustering
result we get. 74
Supervised: Entropy
Table: K-means clustering results for the LA Times document data set.
So, the lower entropy we measure, the more certain a
cluster is aligned to a class label, the better clustering
result we get.
75
Supervised: Entropy
Entropy measures homogeneity: A perfectly homogeneous
clustering is one where each cluster has data points
belonging to the same class label.
0 entropy, but not ideal
76
Supervised: V-Measure
Combines two terms:
• Homogeneity: A perfectly homogeneous clustering is
one where each cluster has data points belonging to
the same class label.
• Completeness: A perfectly complete clustering is one
where all data points belonging to the same class are
clustered into the same cluster.
77
Supervised: V-Measure
Perfectly Homogeneous but not Complete
78
Supervised: V-Measure
Perfectly Complete but not Homogeneous
79
Supervised: V-Measure
Table: K-means clustering results for the LA Times document data set.
Intuition:
Homogeneity: measure entropy for each row
Completeness: measure entropy for each column
80
Suppose we have 𝑁 data samples, 𝐶 different class labels, 𝐾 clusters
and 𝑚𝑐𝑘 number of data points belonging to the class 𝑐 and cluster
𝑘.
Homogeneity: Normalized (0~1) entropy
𝐻(𝐶|𝐾) Probability of a sample in
ℎ =1−
𝐻(𝐶) cluster k is class c
𝐾
𝑚𝑐𝑘 𝐶 𝑚𝑐𝑘
𝐻 𝐶 𝐾 = − log 𝐶
𝑘=1 𝑐=1 𝑁 σ𝑐=1 𝑚𝑐𝑘
σ𝐾
𝐶
𝑘=1 𝑚𝑐𝑘 σ𝐾
𝑘=1 𝑚𝑐𝑘
𝐻(𝐶) = − log
𝑐=1 𝑁 𝑁
Entropy of the whole dataset w.r.t class labels
81
Suppose we have 𝑁 data samples, 𝐶 different class labels, 𝐾 clusters
and 𝑚𝑐𝑘 number of data points belonging to the class 𝑐 and cluster
𝑘.
Completeness:
𝐻(𝐾|𝐶) Probability of a sample in
c=1−
𝐻(𝐾) class c is in cluster k
𝐶
𝑚𝑐𝑘 𝐾 𝑚𝑐𝑘
𝐻 𝐾 𝐶 = − log 𝐾
𝑐=1 𝑘=1 𝑁 σ𝑘=1 𝑚𝑐𝑘
σ𝐶𝑐=1 𝑚𝑐𝑘
𝐾 σ𝐶𝑐=1 𝑚𝑐𝑘
𝐻(𝐾) = − log
𝑘=1 𝑁 𝑁
Entropy of the whole dataset w.r.t cluster labels
82
Supervised: V-Measure
𝐻(𝐶|𝐾)
Homogeneity: ℎ =1−
𝐻(𝐶)
𝐻(𝐾|𝐶)
Completeness: c = 1 −
𝐻(𝐾)
V-measure: 1 + 𝛽 ℎ𝑐
𝑉𝛽 =
𝛽ℎ + 𝑐
The larger (~1) the better.
83
What we have learnt so far
• Unsupervised (Internal) Evaluation
• Correlation
• Visualizing Similarity Matrix
• Cohesion and Separation
• Supervised (External) Evaluation
• Entropy
• V-Measure: Homogeneity and Completeness
84