Clustering
Clustering
1
Clustering Houses
© Prentice Hall 3
Clustering Houses
Size Based
© Prentice Hall 4
2
Clustering Houses
3
Examples of Clustering Applications
◼ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
◼ Land use: Identification of areas of similar land use in an earth
observation database
◼ Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
◼ City-planning: Identifying groups of houses according to their house
type, value, and geographical location
◼ Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
4
Major Clustering Approaches (I)
◼ Partitioning approach:
◼ Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
◼ Typical methods: k-means, k-medoids, CLARANS
◼ Hierarchical approach:
◼ Create a hierarchical decomposition of the set of data (or objects) using
some criterion
◼ Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
◼ Density-based approach:
◼ Based on connectivity and density functions
◼ Typical methods: DBSCAN, OPTICS, DenClue
13
15
5
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster iN= 1(t
Cm =
◼ ip )
N
N N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)
16
17
6
The K-Means Clustering Method
18
◼ Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
the
3
each
2 2
2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
object as initial
6 6
5 5
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
19
7
K-Means example
◼ Cluster the following items in 2 clusters: {2, 4, 10, 12, 3, 20, 30, 11,
25}
◼ 𝑑 𝐶𝑖 , 𝑡𝑖 = (𝐶𝑖 − 𝑡𝑖 )2
◼ Assignment to K = min (𝑑 𝐶𝑖 , 𝑡𝑖 )
M1 M2 K1 K2
2 4 {
{2
{2,3} {
{4
{4,10 , 12, 20, 30, 11, 25}
Stopping Criteria:
• No new assignment
• No change in cluster means
20
K-means 2D example
◼ Apply k-means for the following dataset to make 2 clusters:
Step 1: Assume Initial Centroids: C1 = (185, 72), C2 = (170, 56)
X Y
Step 2: Calculate Euclidean Distance to each centroid:
185 72
𝑑[ 𝑥, 𝑦 , 𝑎, 𝑏 ] = (𝑥 − 𝑎)2 +(𝑦 − 𝑏)2
170 56
For t1 = (168, 60)
168 60 𝑑[ 185,72 , 168, 60 ] = (185 − 168)2 +(72 − 60)2
= 20.808
179 68
𝑑[ 170,56 , 168, 60 ] = (170 − 168)2 +(56 − 60)2
= 4.472
182 72
Since d(C2, t1) < d(C1,t1). So assign t1 to C2
188 77 Step 3: For t2 = (179, 68)
𝑑[ 185,72 , 179, 68 ] = (185 − 179)2 +(72 − 68)2
= 7.211
𝑑[ 170,56 , 179, 68 ] = (170 − 179)2 +(56 − 68)2
= 15
Since d(C1, t2) < d(C2,t2) So assign t2 to C1
Step 4: For t3 = (182, 72)
𝑑[ 185,72 , 182, 72 ] = (185 − 182)2 +(72 − 72)2
=3
𝑑[ 170,56 , 182, 72 ] = (170 − 182)2 +(56 − 72)2
= 20
Since d(C1, t3) < d(C2,t3), So assign t3 to C1
21
8
K-means 2D example
◼ Apply k-means for the following dataset to make 2 clusters:
X Y Step 5: For t4 = (188, 77)
𝑑[ 185,72 , 182, 72 ] = (185 − 188)2 +(72 − 77)2
185 72 = 5.83
𝑑[ 170,56 , 182, 72 ] = (170 − 188)2 +(56 − 77)2
170 56
= 27.65
Since d(C1, t4) < d(C2,t4), So assign t4 to C1
168 60
Final Clusters
22
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
25
9
The K-Medoids Clustering Method
◼ Find representative objects, called medoids, in clusters
◼ PAM (Partitioning Around Medoids, 1987)
◼ starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
◼ PAM works effectively for small data sets, but does not scale
well for large data sets
◼ CLARA (Kaufmann & Rousseeuw, 1990) – Clustering LARge
Applications
◼ CLARANS (Ng & Han, 1994): Clustering Large Applications based
upon RANdomized Search
26
26
27
10
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 10 10
Compute
9 9
Until no Swapping O
8 8
7 total cost of 7
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
28
32
11
Chapter 5. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
33
Hierarchical Clustering
34
12
AGNES (Agglomerative Nesting)
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
35
36
13
Hierarchical Clustering Single Link
example
Step 1: At level 0, 5 clusters
A B C D E Step 1: {A}, {B}, {C} , {D} {E}
A->D = 2 B->D = 4
37
D 2 4 1 0 3
E 3 3 5 3 0
38
14
Hierarchical Clustering Complete Link
example
Step 1: At level 0, 5 clusters
A B C D E Step 1: {A}, {B}, {C} , {D} {E}
A->D = 2 B->D = 4
39
B 1 0 2 4 3 A->D = 2 B->D = 4
40
15
Hierarchical Clustering Average Link
example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
41
42
16
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
43
44
17
Recent Hierarchical Clustering Methods
◼ Major weakness of agglomerative clustering methods
◼ do not scale well: time complexity of at least O(n2), where n is the number
of total objects
◼ can never undo what was done previously
45
Evaluation of Clustering
46
18
Evaluation of Clustering – assessing clustering tendency
◼ Use statistical tests for spatial randomness to measure the probability that the data set is
generated by a uniform data distribution.
◼ Hopkins’ statistical testing
47
Now,
Null hypothesis – D is uniformly distributed contains
no meaningful clusters
Alternate hypothesis – D is not uniformly
distributed and contains meaningful clusters
48
19
Evaluation of Clustering – determining number of clusters
◼ Elbow method
◼ Increasing the number of clusters can reduce the sum of within-cluster variance
◼ It will help to create cluster with finer groups i.e. groups that are more similar.
◼ However, creating too many clusters will reduce this effect of reducing the variance.
49
◼ How good is the clustering generated by a method, and how can we compare the clustering
generated by different methods
◼ Ground truth - the ideal clustering that is often built using human experts
◼ Two methods depending on the availability of ground truth
◼ Extrinsic method
◼ Supervised
◼ ground truth is available i.e. cluster labels are available
◼ compare the clustering against the ground truth and measure.
◼ Intrinsic method
◼ Unsupervised
◼ ground truth is not available
◼ evaluate the goodness of a clustering by considering how well the clusters are separated.
51
20
Evaluation of Clustering – measuring clustering quality
◼ Extrinsic method
◼ Cluster homogeneity
◼ Checks for cluster purity – the more pure clusters, the better is the clustering
◼ E.g:
◼ ground truth – L1 ….. Ln are the categories of the data for a dataset D
◼ Clustering method C1 places data points from two different categories Li and Lj in one
cluster
◼ Clustering method C2 places data points from two different categories Li and Lj in
different cluster
◼ Here C2 creates pure cluster than C1
◼ Cluster completeness
◼ Counter part of cluster homogeneity
◼ Cluster completeness requires that for a clustering, if any two objects belong to the same
category according to ground truth, then they should be assigned to the same cluster.
◼ Clustering method C1 contains two clusters with data points belonging to same cluster
52
◼ Extrinsic method
◼ Rag bags
◼ are Miscellaneous objects that cannot be merged with other categories
◼ The rag bag criterion states that putting a heterogeneous object into a pure cluster
should be penalized more than putting it into a rag bag.
◼ E.g:
◼ Clustering method C1 places all objects belonging to same category in one cluster,
except for one object that belongs to label o
◼ Clustering method C2 places object o in a different cluster C’ that contains
miscellaneous objects from different categories. So this cluster C’ is a noisy cluster
i.e. a rag bag.
◼ So score of C2 > C1 as per rag bag criterion
53
21
Evaluation of Clustering – measuring clustering quality
◼ Extrinsic method
◼ Small cluster preservation
◼ If a small category is split into small pieces in a clustering, those small pieces
may likely become noise and thus the small category cannot be discovered
from the clustering.
◼ The small cluster preservation criterion states that splitting a small category
into pieces is more harmful than splitting a large category into pieces.
54
Evaluation of Clustering
BCubed precision: how many other objects in the same cluster belong to the same category
as the object
Bcubed recall: how many objects of the same category are assigned to the same cluster
55
22
Evaluation of Clustering – Intrinsic method
◼ Intrinsic method
◼ Unsupervised – ground truth is not available
◼ evaluate a clustering by examining how well the clusters are separated and how compact the
clusters are
◼ Silhouette coefficient
◼ For a data set, D, of n objects, suppose D is partitioned into k clusters, C1, …. ,Ck.
◼ For each object o ϵ D,
◼ a(o) as the average distance between o and all other objects in the cluster to which o
belongs i.e. reflects the compactness of the cluster to which o belongs.
◼ Similarly, b(o) is the minimum average distance from o to all clusters to which o does not
belong i.e. captures the degree to which o is separated from other clusters.
◼ The smaller the value of a(o), the more compact the cluster and The larger b(o) is, the
more separated o is from other clusters. Therefore, when the silhouette coefficient value
of o approaches 1, the cluster containing o is compact and o is far away from other
clusters, which is the preferable case.
◼ However, when the silhouette coefficient value is negative (i.e., b(o) < a(o)), this means
that, in expectation, o is closer to the objects in another cluster than to the objects in the
same cluster as o. In many cases, this is a bad situation and should be avoided.
56
23