Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views84 pages

8 Clustering2

The document outlines a course on Data Mining, focusing on clustering techniques such as K-means, hierarchical clustering, and density-based clustering (DBSCAN). It discusses various clustering methods, their advantages and limitations, and how to evaluate clustering results using unsupervised and supervised measures. Key concepts include the definition of clusters, types of clustering, and the importance of proximity measures in clustering algorithms.

Uploaded by

wuyuman6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views84 pages

8 Clustering2

The document outlines a course on Data Mining, focusing on clustering techniques such as K-means, hierarchical clustering, and density-based clustering (DBSCAN). It discusses various clustering methods, their advantages and limitations, and how to evaluate clustering results using unsupervised and supervised measures. Key concepts include the definition of clusters, types of clustering, and the importance of proximity measures in clustering algorithms.

Uploaded by

wuyuman6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Data Mining

Principal Component Analysis

Clustering

CS 584 :: Fall 2024


Ziwei Zhu
Department of Computer Science
George Mason University
Part of slides is from Drs. Tan, Steinbach and Kumar.
Part of slides is from Dr. Jessica Lin.
Part of slides is from Dr. Theodora Chaspari. 1
• HW3 is due on 11/04
• Will have Quiz 3 next week (about clustering)
• 11/05 election day, no class
• 11/12, no class, watch recorded video
• Final exam: 12/17, 7:30pm-9:30pm

2
What is Cluster Analysis
Given a set of objects, place them in groups such that the
objects in the same group are similar to each other, and
different from the objects in other groups. (without labels)
• Unsupervised learning
• Descriptive task

3
Types of Clustering

• Partitional Clustering
• Hierarchical Clustering
• Density-based Clustering
• And more …

4
K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest
centroid

5
K-means Clustering: Formalization

𝑥1
𝑥5
𝜇2
𝜇1 𝑥3 𝑥4
𝑥2

6
Hierarchical Clustering
• Produces a set of nested clusters organized as a
hierarchical tree
• Can be visualized as a dendrogram
• A tree like diagram that records the sequences of
merges or splits

7
Hierarchical Clustering

Two main types of hierarchical clustering


• Agglomerative (bottom-up):
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until
only one cluster (or k clusters) left

• Divisive (top-down):

8
Hierarchical Clustering

Two main types of hierarchical clustering


• Agglomerative (bottom-up):
• Divisive (top-down):
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster
contains an individual point (or there are k clusters)
• E.g., apply K-means recursively to split existing
clusters into smaller ones 9
Bottom-up Clustering Algorithm
Key idea: successively merge closest clusters

10
Bottom-up Clustering Algorithm
Initialization: Start with each sample being a cluster
1 2 3 4 5
1 D(1,2) D(1,3) D(1,4) D(1,5)
2 D(2,3) D(2,4) D(2,5)
3 D(3,4) D(3,5)
4 D(4,5)
5 5
1
4
2
3

1 2 3 4 5
11
Bottom-up Clustering Algorithm
Initialization: Start with each sample being a cluster
1 {2,3} 4 5
1 D(1,{2,3}) D(1,4) D(1,5)
{2, 3} D({2,3},4) D({2,3},5)
4 D(4,5)
5
5
1
4
2
3

1 2 3 4 5
12
Bottom-up Clustering Algorithm
Among current clusters, find the two with minimum distance
1 {2,3} {4,5}
1 D(1,{2,3}) D(1,{4,5})
{2,3} D({2,3},{4,5})
{4,5}

5
1
4
2
3

1 2 3 4 5
13
Bottom-up Clustering Algorithm
Among current clusters, find the two with minimum distance
{1,2,3} {4,5}
{1,2,3} D({1,2,3},{4,5})
{4,5}

5
1
4
2
3

1 2 3 4 5
14
Bottom-up Clustering Algorithm
Among current clusters, find the two with minimum distance
{1,2,3,4,5}
{1,2,3,4,5}

5
1
4
2
3

1 2 3 4 5
15
Bottom-up Clustering Algorithm
The question is “How do we calculate the proximity
matrix?”

5 {1,2,3} {4,5}
1 {1,2,3} D({1,2,3},{4,5})
4 {4,5}
2
3

16
How to Define Inter-Cluster Distance

17
How to Define Inter-Cluster Distance

18
How to Define Inter-Cluster Distance

19
How to Define Inter-Cluster Distance

20
MIN
Distance of two clusters is based on the two closest
points in the different clusters.

Example:

Proximity/Distance Matrix:

21
MIN

Nested Clusters Dendrogram

22
MIN

Nested Clusters Dendrogram

23
Advantage of MIN
• Can handle non-globular shapes

Original Points Six Clusters

24
Limitation of MIN
• Sensitive to noise

Original Points Two Clusters

25
MAX
Distance of two clusters is based on the two most
distant points in the different clusters.

Example:

Proximity/Distance Matrix:

26
MAX

Nested Clusters Dendrogram

27
MAX

Nested Clusters Dendrogram

28
Advantage of MAX
• Less susceptible to noise

Original Points Two Clusters

29
Limitation of MAX
• prefer globular clusters
• Tends to break large clusters

Original Points Two Clusters

30
Group Average
Distance of two clusters is the average of pairwise distance
between points in the two clusters.

• Compromise between MIN and MAX


• Advantage: less susceptible to noise
• Limitation: prefer globular clusters

31
Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4

5
1
2
5
2
3 6 Group Average
3
4 1
4

32
Outline

• Introduction
• Partitional Clustering: K-means
• Hierarchical Clustering
• Density-based Clustering

33
Density Based Clustering
Clusters are regions of high density that are separated from
regions of low density.

Original Points Clusters (dark blue points


are noise)

34
DBSCAN: Density-based spatial clustering of applications
with noise.

• Density = number of points within a specified radius


(denoted as Eps)
• A point is a core point if it has at least a specified number
of points (denoted as MinPts) within Eps
• These are points that are at the interior of a cluster
• Counts the point itself

35
DBSCAN: Density-based spatial clustering of applications
with noise.

• Density = number of points within a specified radius


(denoted as Eps)
• A point is a core point if it has at least a specified number
of points (denoted as MinPts) within Eps
• These are points that are at the interior of a cluster
• Counts the point itself
• A border point is not a core point, but is in the
neighborhood of a core point (i.e., within Eps)

36
DBSCAN: Density-based spatial clustering of applications
with noise.

• Density = number of points within a specified radius


(denoted as Eps)
• A point is a core point if it has at least a specified number
of points (denoted as MinPts) within Eps
• These are points that are at the interior of a cluster
• Counts the point itself
• A border point is not a core point, but is in the
neighborhood of a core point (i.e., within Eps)
• A noise point is any point that is not a core point or a
border point
37
DBSCAN
MinPts=7

B A
C

Eps

38
DBSCAN
MinPts=7

39
DBSCAN

Original Points Point types: core, border,


and noise

Eps = 10, MinPts = 4


40
DBSCAN
Form clusters using core points, and assign border
points to one of its neighboring clusters
1: Label all points as core, border, or noise points.

41
DBSCAN
Form clusters using core points, and assign border
points to one of its neighboring clusters
2: Eliminate noise points.

42
DBSCAN
Form clusters using core points, and assign border
points to one of its neighboring clusters
3: Link all core points within a distance Eps of each
other.

43
DBSCAN
Form clusters using core points, and assign border
points to one of its neighboring clusters
4: Make each group of connected core points into a
separate cluster.

44
DBSCAN
Form clusters using core points, and assign border
points to one of its neighboring clusters
5: Assign each border point to one of the clusters of
its associated core points

45
When DBSCAN Works Well
• Can handle clusters of different shapes and sizes
• Resistant to noise

Original Points Clusters (dark blue points


are noise)

46
When DBSCAN Does NOT Work Well
• Varying densities

47
Scikit-learn: many options

48
What we have learnt so far

• Partitional Clustering: K-means


• Hierarchical Clustering: bottom-up approach
• Density-based Clustering: DBSCAN

Next big question: How to rigorously


evaluate clustering results?

49
Clustering Evaluation
Numerical measures can be classified into the following two
types:
• Unsupervised: to measure the goodness of a clustering
result without external information.
• Sum of Squared Error (SSE)
• Often called internal indices because they only use information in
the data

50
Clustering Evaluation
Numerical measures can be classified into the following two
types:
• Unsupervised: to measure the goodness of a clustering
result without external information.
• Sum of Squared Error (SSE)
• Often called internal indices because they only use information in
the data
• Supervised: to measure the extent to which cluster labels
match externally supplied class labels.
• E.g., Entropy, V-measure
• Often called external indices because they use information
external to the data
51
Outline

➢ Unsupervised (Internal) Evaluation


• Supervised (External) Evaluation

52
Unsupervised: Correlation
• Two matrices
• Distance matrix
• “Incidence” Matrix
• One row and one column for each data point
• An entry is 1 if the associated pair of points belong to the same
cluster
• An entry is 0 if the associated pair of points belongs to different
clusters

53
Unsupervised: Correlation – Toy Example

54
Unsupervised: Correlation
• Compute the correlation between the two matrices
• Since the matrices are symmetric, only the correlation
between n(n-1) / 2 entries needs to be calculated.
• High correlation indicates that points that belong to
the same cluster are close to each other.
• The higher, the better

55
Unsupervised: Correlation – Toy Example

56
Unsupervised: Correlation – Toy Example
First step: vectorize them (row by row)
• Proximity: 0,0.1,0.8,0.9,0.1,0,0.9,0.8,0.9,0.8,0,0.2,0.9,0.8,0.2,0
• Incidence: 1,1,0,0,1,1,0,0,0,0,1,1,0,0,1,1
Better yet, due to use only the upper triangle
• Proximity: 0.1,0.8,0.9,0.9,0.8,0.2
• Incidence: 1,0,0,0,0,1

57
Unsupervised: Correlation – Toy Example
Now, calculate the correlation coefficient between
• Proximity: 0.1,0.8,0.9,0.9,0.8,0.2
• Incidence: 1,0,0,0,0,1

-0.9887

58
Unsupervised: Correlation – Toy Example

59
Unsupervised: Correlation – Toy Example
Now, calculate the correlation coefficient between
• Proximity: 0.1,0.8,0.9,0.9,0.8,0.2
• Incidence: 1,0,0,0,0,1

-0.9887

Correlation between distance and incidence: ~-1 is the best


Correlation between similarity and incidence: ~1 is the best
~0 is always the worest

60
Unsupervised: Correlation
Apply K-means to the following two datasets

Random data
1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5
y

y
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x

Corr = 0.9235 Corr = 0.5810

61
Unsupervised: Visualize Similarity Matrix
Order the similarity matrix with respect to cluster labels
and inspect visually.

1 1

0.9 10 0.9

0.8 20 0.8

0.7 30 0.7

0.6 40 0.6

Points
0.5 50 0.5
y

0.4 60 0.4

0.3 70 0.3

0.2 80 0.2

0.1 90 0.1

0 100 0
0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity
x Points

62
Unsupervised: Visualize Similarity Matrix
Clusters in random data are not so crisp.

1 1

0.9 10 0.9

0.8 20 0.8

0.7 30 0.7

0.6 40 0.6

Points
0.5 50 0.5
y

0.4 60 0.4

0.3 70 0.3

0.2 80 0.2

0.1 90 0.1

0 100 0
0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity
x Points

63
Unsupervised: Cohesion and Separation

c2
c1 c

c3

64
Unsupervised: Cohesion and Separation

𝑥1 𝑥5
𝑐2
𝑐
𝑥4
𝑐1 𝑥3
𝑥2

Total Sum of Squares (TSS)


Sum of Squares within groups (SSE)
Sum of Squares Between groups (SSB)
65
Unsupervised: Cohesion and Separation

𝑻𝑺𝑺 = 𝑺𝑺𝑬 + 𝑺𝑺𝑩

• Given a data set, TSS is fixed


• A clustering with large SSE has small SSB, while one
with small SSE has large SSB
• Goal is to minimize SSE and maximize SSB

66
Unsupervised: Cohesion and Separation
Example:

𝑻𝑺𝑺 = 𝑺𝑺𝑬 + 𝑺𝑺𝑩

67
Unsupervised: Cohesion and Separation
Example:

𝑻𝑺𝑺 = 𝑺𝑺𝑬 + 𝑺𝑺𝑩

68
Outline

• Unsupervised (Internal) Evaluation


• Supervised (External) Evaluation

69
Supervised: Entropy
External labels

Table: K-means clustering results for the LA Times document data set.

Idea: Measure the degree to which each cluster


consists of samples of a single class.
70
Supervised: Entropy
𝑚𝑖𝑗 : # of points in cluster j belonging to class i
𝑚𝑗 : # of pints in cluster j
𝑚: # of all points
𝐿: # of classes

71
Supervised: Entropy
𝑚𝑖𝑗 : # of points in cluster j belonging to class i
𝑚𝑗 : # of pints in cluster j
𝑚: # of all points
𝐿: # of classes

• Probability of a point in cluster j belonging to class i:


𝑝𝑖𝑗 = 𝑚𝑖𝑗 Τ𝑚𝑗

• Entropy of a cluster j: 𝑒𝑗 = − σ𝐿𝑖=1 𝑝𝑖𝑗 𝑙𝑜𝑔2 𝑝𝑖𝑗


If a probability distribution is extremely imbalanced, the entropy
is low (~0); if the distribution if uniform, the entropy is high.

72
Supervised: Entropy
• Entropy of a cluster j: 𝑒𝑗 = − σ𝐿𝑖=1 𝑝𝑖𝑗 𝑙𝑜𝑔2 𝑝𝑖𝑗
If a probability distribution is extremely imbalanced, the entropy
is low (~0); if the distribution if uniform, the entropy is high.

Case 1: Case 2:
1 1 1 1 [1,0, 0, 0]
[ , , , ]
4 4 4 4

Entropy: measure the level of uncertainty in a system. The


higher, the more uncertain.
73
Supervised: Entropy
• Probability of a point in cluster j belonging to class i: 𝑝𝑖𝑗 = 𝑚𝑖𝑗 Τ𝑚𝑗
• Entropy of a cluster j: 𝑒𝑗 = − σ𝐿𝑖=1 𝑝𝑖𝑗 𝑙𝑜𝑔2 𝑝𝑖𝑗
Sports Financial Movies
Cluster 1 2 1 7

2 2 1 1 7 7
𝑒1 = −( 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2 + 𝑙𝑜𝑔2 )
10 10 10 10 10 10

𝑚𝑖
• Total entropy: 𝑒 = σ𝐾
𝑖=1 𝑒𝑗
𝑚

So, the lower entropy we measure, the more certain a


cluster is aligned to a class label, the better clustering
result we get. 74
Supervised: Entropy

Table: K-means clustering results for the LA Times document data set.

So, the lower entropy we measure, the more certain a


cluster is aligned to a class label, the better clustering
result we get.

75
Supervised: Entropy
Entropy measures homogeneity: A perfectly homogeneous
clustering is one where each cluster has data points
belonging to the same class label.

0 entropy, but not ideal


76
Supervised: V-Measure

Combines two terms:


• Homogeneity: A perfectly homogeneous clustering is
one where each cluster has data points belonging to
the same class label.
• Completeness: A perfectly complete clustering is one
where all data points belonging to the same class are
clustered into the same cluster.

77
Supervised: V-Measure

Perfectly Homogeneous but not Complete

78
Supervised: V-Measure

Perfectly Complete but not Homogeneous

79
Supervised: V-Measure

Table: K-means clustering results for the LA Times document data set.

Intuition:
Homogeneity: measure entropy for each row
Completeness: measure entropy for each column
80
Suppose we have 𝑁 data samples, 𝐶 different class labels, 𝐾 clusters
and 𝑚𝑐𝑘 number of data points belonging to the class 𝑐 and cluster
𝑘.
Homogeneity: Normalized (0~1) entropy
𝐻(𝐶|𝐾) Probability of a sample in
ℎ =1−
𝐻(𝐶) cluster k is class c

𝐾
𝑚𝑐𝑘 𝐶 𝑚𝑐𝑘
𝐻 𝐶 𝐾 = −෍ ෍ log 𝐶
𝑘=1 𝑐=1 𝑁 σ𝑐=1 𝑚𝑐𝑘

σ𝐾
𝐶
𝑘=1 𝑚𝑐𝑘 σ𝐾
𝑘=1 𝑚𝑐𝑘
𝐻(𝐶) = − ෍ log
𝑐=1 𝑁 𝑁

Entropy of the whole dataset w.r.t class labels


81
Suppose we have 𝑁 data samples, 𝐶 different class labels, 𝐾 clusters
and 𝑚𝑐𝑘 number of data points belonging to the class 𝑐 and cluster
𝑘.
Completeness:
𝐻(𝐾|𝐶) Probability of a sample in
c=1−
𝐻(𝐾) class c is in cluster k

𝐶
𝑚𝑐𝑘 𝐾 𝑚𝑐𝑘
𝐻 𝐾 𝐶 = −෍ ෍ log 𝐾
𝑐=1 𝑘=1 𝑁 σ𝑘=1 𝑚𝑐𝑘

σ𝐶𝑐=1 𝑚𝑐𝑘
𝐾 σ𝐶𝑐=1 𝑚𝑐𝑘
𝐻(𝐾) = − ෍ log
𝑘=1 𝑁 𝑁

Entropy of the whole dataset w.r.t cluster labels


82
Supervised: V-Measure
𝐻(𝐶|𝐾)
Homogeneity: ℎ =1−
𝐻(𝐶)

𝐻(𝐾|𝐶)
Completeness: c = 1 −
𝐻(𝐾)

V-measure: 1 + 𝛽 ℎ𝑐
𝑉𝛽 =
𝛽ℎ + 𝑐

The larger (~1) the better.

83
What we have learnt so far

• Unsupervised (Internal) Evaluation


• Correlation
• Visualizing Similarity Matrix
• Cohesion and Separation
• Supervised (External) Evaluation
• Entropy
• V-Measure: Homogeneity and Completeness

84

You might also like