Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
6 views23 pages

Clustering

Cluster analysis involves grouping similar data objects into clusters based on their characteristics, using unsupervised learning without predefined classes. It has various applications across fields such as marketing, spatial data analysis, and image processing. Major clustering approaches include partitioning, hierarchical, and density-based methods, with k-means and k-medoids being common algorithms used for clustering.

Uploaded by

try.admerch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views23 pages

Clustering

Cluster analysis involves grouping similar data objects into clusters based on their characteristics, using unsupervised learning without predefined classes. It has various applications across fields such as marketing, spatial data analysis, and image processing. Major clustering approaches include partitioning, hierarchical, and density-based methods, with k-means and k-medoids being common algorithms used for clustering.

Uploaded by

try.admerch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Clustering

November 5, 2024 Data Mining: Concepts and Techniques 1

What is Cluster Analysis?


◼ Cluster: a collection of data objects
◼ Similar to one another within the same cluster
◼ Dissimilar to the objects in other clusters
◼ Cluster analysis
◼ Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
◼ Unsupervised learning: no predefined classes
◼ Typical applications
◼ As a stand-alone tool to get insight into data distribution
◼ As a preprocessing step for other algorithms

November 5, 2024 Data Mining: Concepts and Techniques 2

1
Clustering Houses

© Prentice Hall 3

Clustering Houses

Size Based

© Prentice Hall 4

2
Clustering Houses

Geographic Distance Based


© Prentice Hall 5

Clustering: Rich Applications and


Multidisciplinary Efforts
◼ Pattern Recognition
◼ Clustering methods simply try to group similar patterns into clusters whose
members are more similar to each other
◼ Spatial Data Analysis
◼ Create thematic maps in GIS by clustering feature spaces
◼ Detect spatial clusters or for other spatial mining tasks
◼ Image Processing
◼ clustering is used in image segmentation for separating image objects which are
analyzed further
◼ Economic Science (especially market research)
◼ cluster analysis is to classify objects into relatively homogeneous groups based on a
set of variables considered like demographics, psychographics, buying behaviours,
attitudes, preferences, etc.
◼ WWW
◼ Document classification
◼ Cluster Weblog data to discover groups of similar access patterns

November 5, 2024 Data Mining: Concepts and Techniques 6

3
Examples of Clustering Applications
◼ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
◼ Land use: Identification of areas of similar land use in an earth
observation database
◼ Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
◼ City-planning: Identifying groups of houses according to their house
type, value, and geographical location
◼ Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults

November 5, 2024 Data Mining: Concepts and Techniques 7

Quality: What Is Good Clustering?

◼ A good clustering method will produce high quality


clusters with
◼ high intra-class similarity
◼ low inter-class similarity
◼ The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
◼ The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns

November 5, 2024 Data Mining: Concepts and Techniques 8

4
Major Clustering Approaches (I)

◼ Partitioning approach:
◼ Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
◼ Typical methods: k-means, k-medoids, CLARANS
◼ Hierarchical approach:
◼ Create a hierarchical decomposition of the set of data (or objects) using
some criterion
◼ Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
◼ Density-based approach:
◼ Based on connectivity and density functions
◼ Typical methods: DBSCAN, OPTICS, DenClue

November 5, 2024 Data Mining: Concepts and Techniques 13

13

Typical Alternatives to Calculate the Distance


between Clusters
◼ Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
◼ Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
◼ Average: avg distance between an element in one cluster and an element in
the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
◼ Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) =
dis(Ci, Cj)
◼ Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) =
dis(Mi, Mj)
◼ Medoid: one chosen, centrally located object in the cluster

November 5, 2024 Data Mining: Concepts and Techniques 15

15

5
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster iN= 1(t
Cm =
◼ ip )
N

◼ Radius: square root of average distance from any point of the


cluster to its centroid
 N (t − cm ) 2
Rm = i =1 ip
N

◼ Diameter: square root of average mean squared distance between


all pairs of points in the cluster

 N  N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)

November 5, 2024 Data Mining: Concepts and Techniques 16

16

Partitioning Algorithms: Basic Concept

◼ Partitioning method: Construct a partition of a database D of n objects


into a set of k clusters, s.t., min sum of squared distance

km=1tmiKm (Cm − tmi )2


◼ Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
◼ Global optimal: exhaustively enumerate all partitions
◼ Heuristic methods: k-means and k-medoids algorithms
◼ k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
◼ k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster

November 5, 2024 Data Mining: Concepts and Techniques 17

17

6
The K-Means Clustering Method

◼ Given k, the k-means algorithm is implemented in


four steps:
◼ Partition objects into k nonempty subsets
◼ Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
◼ Assign each object to the cluster with the nearest
seed point
◼ Go back to Step 2, stop when no more new
assignment

November 5, 2024 Data Mining: Concepts and Techniques 18

18

The K-Means Clustering Method

◼ Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3

the
3

each
2 2
2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

object as initial
6 6

5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

November 5, 2024 Data Mining: Concepts and Techniques 19

19

7
K-Means example
◼ Cluster the following items in 2 clusters: {2, 4, 10, 12, 3, 20, 30, 11,
25}
◼ 𝑑 𝐶𝑖 , 𝑡𝑖 = (𝐶𝑖 − 𝑡𝑖 )2
◼ Assignment to K = min (𝑑 𝐶𝑖 , 𝑡𝑖 )

M1 M2 K1 K2
2 4 {
{2
{2,3} {
{4
{4,10 , 12, 20, 30, 11, 25}

2.5 16 {2, 3, 4} {10, 12, 20, 30, 11, 25}

3 18 {2, 3, 4, 10} {12, 20, 30, 11, 25}

4.75 19.6 {2, 3, 4, 10, 12, 11} {20, 30, 25}

7 25 {2, 3, 4, 10, 12, 11} {20, 30, 25}

Stopping Criteria:
• No new assignment
• No change in cluster means

November 5, 2024 Data Mining: Concepts and Techniques 20

20

K-means 2D example
◼ Apply k-means for the following dataset to make 2 clusters:
Step 1: Assume Initial Centroids: C1 = (185, 72), C2 = (170, 56)
X Y
Step 2: Calculate Euclidean Distance to each centroid:
185 72
𝑑[ 𝑥, 𝑦 , 𝑎, 𝑏 ] = (𝑥 − 𝑎)2 +(𝑦 − 𝑏)2
170 56
For t1 = (168, 60)
168 60 𝑑[ 185,72 , 168, 60 ] = (185 − 168)2 +(72 − 60)2
= 20.808
179 68
𝑑[ 170,56 , 168, 60 ] = (170 − 168)2 +(56 − 60)2
= 4.472
182 72
Since d(C2, t1) < d(C1,t1). So assign t1 to C2
188 77 Step 3: For t2 = (179, 68)
𝑑[ 185,72 , 179, 68 ] = (185 − 179)2 +(72 − 68)2
= 7.211
𝑑[ 170,56 , 179, 68 ] = (170 − 179)2 +(56 − 68)2
= 15
Since d(C1, t2) < d(C2,t2) So assign t2 to C1
Step 4: For t3 = (182, 72)
𝑑[ 185,72 , 182, 72 ] = (185 − 182)2 +(72 − 72)2
=3
𝑑[ 170,56 , 182, 72 ] = (170 − 182)2 +(56 − 72)2
= 20
Since d(C1, t3) < d(C2,t3), So assign t3 to C1

November 5, 2024 Data Mining: Concepts and Techniques 21

21

8
K-means 2D example
◼ Apply k-means for the following dataset to make 2 clusters:
X Y Step 5: For t4 = (188, 77)
𝑑[ 185,72 , 182, 72 ] = (185 − 188)2 +(72 − 77)2
185 72 = 5.83
𝑑[ 170,56 , 182, 72 ] = (170 − 188)2 +(56 − 77)2
170 56
= 27.65
Since d(C1, t4) < d(C2,t4), So assign t4 to C1
168 60

179 68 Step 6: Clusters after 1 iteration

182 72 D1 = {(185, 72), (179, 68), (182, 72), (188, 77)}


D2 = {(170, 56), (168, 60)}
188 77
Step 7: New clusters centroids C1 = {183.5, 72.25} C2 = {169, 58}

Repeat above steps for all samples till convergence

Final Clusters

D1 = {(185, 72), (179, 68), (182, 72), (188, 77)}


D2 = {(170, 56), (168, 60)}

November 5, 2024 Data Mining: Concepts and Techniques 22

22

What Is the Problem of the K-Means Method?

◼ The k-means algorithm is sensitive to outliers !


◼ Since an object with an extremely large value may substantially
distort the distribution of the data.

◼ K-Medoids: Instead of taking the mean value of the object in a


cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

November 5, 2024 Data Mining: Concepts and Techniques 25

25

9
The K-Medoids Clustering Method
◼ Find representative objects, called medoids, in clusters
◼ PAM (Partitioning Around Medoids, 1987)
◼ starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
◼ PAM works effectively for small data sets, but does not scale
well for large data sets
◼ CLARA (Kaufmann & Rousseeuw, 1990) – Clustering LARge
Applications
◼ CLARANS (Ng & Han, 1994): Clustering Large Applications based
upon RANdomized Search

26

26

PAM (Partitioning Around Medoids) (1987)

◼ PAM (Kaufman and Rousseeuw, 1987), built in Splus


◼ Use real object to represent the cluster
1. Select k representative objects arbitrarily
2. For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
3. For each pair of i and h,
◼ If TCih < 0, i is replaced by h
◼ Then assign each non-selected object to the most
similar representative object
4. repeat steps 2-3 until there is no change
November 5, 2024 Data Mining: Concepts and Techniques 27

27

10
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a


Total Cost = 26 nonmedoid object,Orandom

Do loop 10 10

Compute
9 9

Until no Swapping O
8 8

7 total cost of 7

and Orandom swapping


change
6 6

5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

November 5, 2024 Data Mining: Concepts and Techniques 28

28

What Is the Problem with PAM?

◼ Pam is more robust than k-means in the presence of noise and


outliers because a medoid is less influenced by outliers or other
extreme values than a mean
◼ Pam works efficiently for small data sets but does not scale well for
large data sets.
◼ O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
➔ Sampling based method,
CLARA(Clustering LARge Applications)

November 5, 2024 Data Mining: Concepts and Techniques 32

32

11
Chapter 5. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods

November 5, 2024 Data Mining: Concepts and Techniques 33

33

Hierarchical Clustering

◼ Use distance matrix as clustering criteria. This method does not


require the number of clusters k as an input, but needs a
termination condition

Step 0 Step 1 Step 2 Step 3 Step 4


agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
November 5, 2024 Data Mining: Concepts and Techniques 34

34

12
AGNES (Agglomerative Nesting)

◼ Introduced in Kaufmann and Rousseeuw (1990)


◼ Implemented in statistical analysis packages, e.g., Splus
◼ Use the Single-Link method and the dissimilarity matrix.
◼ Merge nodes that have the least dissimilarity
◼ Go on in a non-descending fashion
◼ Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

November 5, 2024 Data Mining: Concepts and Techniques 35

35

Dendrogram: Shows How the Clusters are Merged

Decompose data objects into a


several levels of nested partitioning
(tree of clusters), called a
dendrogram.

A clustering of the data objects is


obtained by cutting the
dendrogram at the desired level,
then each connected component
forms a cluster.

November 5, 2024 Data Mining: Concepts and Techniques 36

36

13
Hierarchical Clustering Single Link
example
Step 1: At level 0, 5 clusters
A B C D E Step 1: {A}, {B}, {C} , {D} {E}

A 0 1 2 2 3 Step 2: At Level 1: Min_dist = 1,

B 1 0 2 4 3 Find distance between each pair.


If min_dist{ti, tj} <= 1, then merge clusters
C 2 2 0 1 5
{A, B}, {C, D}, {E}
D 2 4 1 0 3
Step 3: At Level 2: Min_dist = 2,
E 3 3 5 3 0
Find distance between clusters formed in step 2
A->C = 2 B->C = 2

A->D = 2 B->D = 4

Hence min_dist({A,B}, {C, D}) = 2

A->E = 3 B->E = 3 min_dist({A,B}, {E}) = 3

C->E = 5 D->E = 3 min_dist({C, D}, {E}) = 3

Since threshold is 2, we merge {A, B, C, D} {E}

November 5, 2024 Data Mining: Concepts and Techniques 37

37

Hierarchical Clustering Single Link


example
A B C D E Step 4: At Level 3: Min_dist = 3,

Find distance between clusters formed in step 3


A 0 1 2 2 3 A->E = 3 B->E = 3 C->E = 5 D->E = 3
B 1 0 2 4 3 min_dist({A, B, C, D}, {E}) = 3

C 2 2 0 1 5 Since threshold is 3, we merge both the clusters to get {A, B, C, D, E}

D 2 4 1 0 3
E 3 3 5 3 0

November 5, 2024 Data Mining: Concepts and Techniques 38

38

14
Hierarchical Clustering Complete Link
example
Step 1: At level 0, 5 clusters
A B C D E Step 1: {A}, {B}, {C} , {D} {E}

A 0 1 2 2 3 Step 2: At Level 1: Max_dist = 1,

B 1 0 2 4 3 Find distance between each pair.


If max_dist{ti, tj} <= 1, then merge clusters
C 2 2 0 1 5
{A, B}, {C, D}, {E}
D 2 4 1 0 3
Step 3: At Level 2: Max_dist = 2,
E 3 3 5 3 0
Find distance between clusters formed in step 2
A->C = 2 B->C = 2

A->D = 2 B->D = 4

Hence max_dist({A,B}, {C, D}) = 4

A->E = 3 B->E = 3 max_dist({A,B}, {E}) = 3

C->E = 5 D->E = 3 max_dist({C, D}, {E}) = 5

Since threshold is 2, no merge at this level

November 5, 2024 Data Mining: Concepts and Techniques 39

39

Hierarchical Clustering Complete Link


example
Step 4: At Level 3: Max_dist = 3,
A B C D E
Find distance between clusters formed in step 2
A 0 1 2 2 3 A->C = 2 B->C = 2

B 1 0 2 4 3 A->D = 2 B->D = 4

C 2 2 0 1 5 Hence max_dist({A,B}, {C, D}) = 4

D 2 4 1 0 3 A->E = 3 B->E = 3 max_dist({A,B}, {E}) = 3

E 3 3 5 3 0 C->E = 5 D->E = 3 max_dist({C, D}, {E}) = 5

Since threshold is 3, max_dist({A,B}, {E}) is 3 so we merge them

Step 5: At Level 4: Max_dist = 4,

Find distance between clusters formed in step 3


A->C = 2 B->C = 2 A->D = 2 B->D = 4
C->E = 5 D->E = 3

max_dist({C, D}, {A, B, E}) = 5

Since threshold is 4, So no merge

Step 6: At Level 5: Merge both clusters.

November 5, 2024 Data Mining: Concepts and Techniques 40

40

15
Hierarchical Clustering Average Link
example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0

November 5, 2024 Data Mining: Concepts and Techniques 41

41

Hierarchical Clustering Average Link


example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0

November 5, 2024 Data Mining: Concepts and Techniques 42

42

16
DIANA (Divisive Analysis)

◼ Introduced in Kaufmann and Rousseeuw (1990)


◼ Implemented in statistical analysis packages, e.g., Splus
◼ Inverse order of AGNES
◼ Eventually each node forms a cluster on its own

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

November 5, 2024 Data Mining: Concepts and Techniques 43

43

Hierarchical Clustering Divisive


example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0

November 5, 2024 Data Mining: Concepts and Techniques 44

44

17
Recent Hierarchical Clustering Methods
◼ Major weakness of agglomerative clustering methods
◼ do not scale well: time complexity of at least O(n2), where n is the number
of total objects
◼ can never undo what was done previously

◼ Integration of hierarchical with distance-based clustering


◼ BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-
clusters
◼ ROCK (1999): clustering categorical data by neighbor and link analysis
◼ CHAMELEON (1999): hierarchical clustering using dynamic modeling

November 5, 2024 Data Mining: Concepts and Techniques 45

45

Evaluation of Clustering

◼ Major clustering evaluation tasks


◼ Assessing Clustering Tendency
◼ Determining Number of Clusters
◼ Measuring Cluster Quality
◼ Extrinsic methods
◼ Intrinsic methods

November 5, 2024 Data Mining: Concepts and Techniques 46

46

18
Evaluation of Clustering – assessing clustering tendency

◼ Assessing Clustering Tendency


◼ non-random data structure – meaningful clusters
◼ Determining if the data set has any non-random data structure.
◼ Random data structure means uniform distribution

◼ Use statistical tests for spatial randomness to measure the probability that the data set is
generated by a uniform data distribution.
◼ Hopkins’ statistical testing

November 5, 2024 Data Mining: Concepts and Techniques 47

47

Evaluation of Clustering – assessing clustering


tendency
Hopkin’s Statistics calculation

Now,
Null hypothesis – D is uniformly distributed contains
no meaningful clusters
Alternate hypothesis – D is not uniformly
distributed and contains meaningful clusters

To prove our hypothesis, conduct Hopkin’s statistics


iteratively with threshold 0.5.

If H >= 0.5, then D is uniformly distributed


and
if H < 0.5 then D is not uniformly distributed and has
statistically significant clusters.

November 5, 2024 Data Mining: Concepts and Techniques 48

48

19
Evaluation of Clustering – determining number of clusters

◼ Determining the number of clusters


◼ Not very easy – right number is always ambiguous
◼ Depends on
◼ distribution’s shape of data
◼ Scale of data
◼ Required number of clusters by the user
◼ Simple method
𝑛
◼ Set the number of clusters to for a dataset of size n. then each cluster will have 2𝑛 points
2

◼ Elbow method
◼ Increasing the number of clusters can reduce the sum of within-cluster variance
◼ It will help to create cluster with finer groups i.e. groups that are more similar.
◼ However, creating too many clusters will reduce this effect of reducing the variance.

November 5, 2024 Data Mining: Concepts and Techniques 49

49

Evaluation of Clustering – measuring clustering quality

◼ How good is the clustering generated by a method, and how can we compare the clustering
generated by different methods
◼ Ground truth - the ideal clustering that is often built using human experts
◼ Two methods depending on the availability of ground truth

◼ Extrinsic method
◼ Supervised
◼ ground truth is available i.e. cluster labels are available
◼ compare the clustering against the ground truth and measure.

◼ Intrinsic method
◼ Unsupervised
◼ ground truth is not available
◼ evaluate the goodness of a clustering by considering how well the clusters are separated.

November 5, 2024 Data Mining: Concepts and Techniques 51

51

20
Evaluation of Clustering – measuring clustering quality
◼ Extrinsic method
◼ Cluster homogeneity
◼ Checks for cluster purity – the more pure clusters, the better is the clustering
◼ E.g:
◼ ground truth – L1 ….. Ln are the categories of the data for a dataset D

◼ Clustering method C1 places data points from two different categories Li and Lj in one

cluster
◼ Clustering method C2 places data points from two different categories Li and Lj in

different cluster
◼ Here C2 creates pure cluster than C1

◼ So cluster homogeneity score for C2 > C1

◼ Cluster completeness
◼ Counter part of cluster homogeneity
◼ Cluster completeness requires that for a clustering, if any two objects belong to the same
category according to ground truth, then they should be assigned to the same cluster.
◼ Clustering method C1 contains two clusters with data points belonging to same cluster

◼ Clustering method C2 merged the two clusters into one

◼ So cluster purity score for C2 > C1

November 5, 2024 Data Mining: Concepts and Techniques 52

52

Evaluation of Clustering – measuring clustering quality

◼ Extrinsic method
◼ Rag bags
◼ are Miscellaneous objects that cannot be merged with other categories
◼ The rag bag criterion states that putting a heterogeneous object into a pure cluster
should be penalized more than putting it into a rag bag.

◼ E.g:
◼ Clustering method C1 places all objects belonging to same category in one cluster,
except for one object that belongs to label o
◼ Clustering method C2 places object o in a different cluster C’ that contains
miscellaneous objects from different categories. So this cluster C’ is a noisy cluster
i.e. a rag bag.
◼ So score of C2 > C1 as per rag bag criterion

November 5, 2024 Data Mining: Concepts and Techniques 53

53

21
Evaluation of Clustering – measuring clustering quality

◼ Extrinsic method
◼ Small cluster preservation
◼ If a small category is split into small pieces in a clustering, those small pieces
may likely become noise and thus the small category cannot be discovered
from the clustering.
◼ The small cluster preservation criterion states that splitting a small category
into pieces is more harmful than splitting a large category into pieces.

November 5, 2024 Data Mining: Concepts and Techniques 54

54

Evaluation of Clustering

Metrics that satisfy all the four criteria of extrinsic methods:

BCubed precision: how many other objects in the same cluster belong to the same category
as the object

Bcubed recall: how many objects of the same category are assigned to the same cluster

November 5, 2024 Data Mining: Concepts and Techniques 55

55

22
Evaluation of Clustering – Intrinsic method

◼ Intrinsic method
◼ Unsupervised – ground truth is not available
◼ evaluate a clustering by examining how well the clusters are separated and how compact the
clusters are
◼ Silhouette coefficient
◼ For a data set, D, of n objects, suppose D is partitioned into k clusters, C1, …. ,Ck.
◼ For each object o ϵ D,
◼ a(o) as the average distance between o and all other objects in the cluster to which o
belongs i.e. reflects the compactness of the cluster to which o belongs.
◼ Similarly, b(o) is the minimum average distance from o to all clusters to which o does not
belong i.e. captures the degree to which o is separated from other clusters.
◼ The smaller the value of a(o), the more compact the cluster and The larger b(o) is, the
more separated o is from other clusters. Therefore, when the silhouette coefficient value
of o approaches 1, the cluster containing o is compact and o is far away from other
clusters, which is the preferable case.
◼ However, when the silhouette coefficient value is negative (i.e., b(o) < a(o)), this means
that, in expectation, o is closer to the objects in another cluster than to the objects in the
same cluster as o. In many cases, this is a bad situation and should be avoided.

56

23

You might also like