0% found this document useful (0 votes)

6 views23 pages

Clustering

Cluster analysis involves grouping similar data objects into clusters based on their characteristics, using unsupervised learning without predefined classes. It has various applications across fields such as marketing, spatial data analysis, and image processing. Major clustering approaches include partitioning, hierarchical, and density-based methods, with k-means and k-medoids being common algorithms used for clustering.

Uploaded by

try.admerch

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views23 pages

Clustering

Uploaded by

try.admerch

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Clustering

November 5, 2024 Data Mining: Concepts and Techniques 1

What is Cluster Analysis?

◼ Cluster: a collection of data objects
◼ Similar to one another within the same cluster
◼ Dissimilar to the objects in other clusters
◼ Cluster analysis
◼ Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
◼ Unsupervised learning: no predefined classes
◼ Typical applications
◼ As a stand-alone tool to get insight into data distribution
◼ As a preprocessing step for other algorithms

November 5, 2024 Data Mining: Concepts and Techniques 2

1
Clustering Houses

Clustering Houses

Size Based

2
Clustering Houses

Geographic Distance Based

Clustering: Rich Applications and

Multidisciplinary Efforts
◼ Pattern Recognition
◼ Clustering methods simply try to group similar patterns into clusters whose
members are more similar to each other
◼ Spatial Data Analysis
◼ Create thematic maps in GIS by clustering feature spaces
◼ Detect spatial clusters or for other spatial mining tasks
◼ Image Processing
◼ clustering is used in image segmentation for separating image objects which are
analyzed further
◼ Economic Science (especially market research)
◼ cluster analysis is to classify objects into relatively homogeneous groups based on a
set of variables considered like demographics, psychographics, buying behaviours,
attitudes, preferences, etc.
◼ WWW
◼ Document classification
◼ Cluster Weblog data to discover groups of similar access patterns

November 5, 2024 Data Mining: Concepts and Techniques 6

3
Examples of Clustering Applications
◼ Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
◼ Land use: Identification of areas of similar land use in an earth
observation database
◼ Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
◼ City-planning: Identifying groups of houses according to their house
type, value, and geographical location
◼ Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults

November 5, 2024 Data Mining: Concepts and Techniques 7

Quality: What Is Good Clustering?

◼ A good clustering method will produce high quality

clusters with
◼ high intra-class similarity
◼ low inter-class similarity
◼ The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
◼ The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns

November 5, 2024 Data Mining: Concepts and Techniques 8

4
Major Clustering Approaches (I)

◼ Partitioning approach:
◼ Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
◼ Typical methods: k-means, k-medoids, CLARANS
◼ Hierarchical approach:
◼ Create a hierarchical decomposition of the set of data (or objects) using
some criterion
◼ Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
◼ Density-based approach:
◼ Based on connectivity and density functions
◼ Typical methods: DBSCAN, OPTICS, DenClue

November 5, 2024 Data Mining: Concepts and Techniques 13

Typical Alternatives to Calculate the Distance

between Clusters
◼ Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
◼ Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
◼ Average: avg distance between an element in one cluster and an element in
the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
◼ Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) =
dis(Ci, Cj)
◼ Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) =
dis(Mi, Mj)
◼ Medoid: one chosen, centrally located object in the cluster

November 5, 2024 Data Mining: Concepts and Techniques 15

5
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster iN= 1(t
Cm =
◼ ip )
N

◼ Radius: square root of average distance from any point of the

cluster to its centroid
 N (t − cm ) 2
Rm = i =1 ip
N

◼ Diameter: square root of average mean squared distance between

all pairs of points in the cluster

 N  N (t − t ) 2
Dm = i =1 i =1 ip iq
N ( N −1)

November 5, 2024 Data Mining: Concepts and Techniques 16

Partitioning Algorithms: Basic Concept

◼ Partitioning method: Construct a partition of a database D of n objects

into a set of k clusters, s.t., min sum of squared distance

km=1tmiKm (Cm − tmi )2

◼ Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
◼ Global optimal: exhaustively enumerate all partitions
◼ Heuristic methods: k-means and k-medoids algorithms
◼ k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
◼ k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster

November 5, 2024 Data Mining: Concepts and Techniques 17

6
The K-Means Clustering Method

◼ Given k, the k-means algorithm is implemented in

four steps:
◼ Partition objects into k nonempty subsets
◼ Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
◼ Assign each object to the cluster with the nearest
seed point
◼ Go back to Step 2, stop when no more new
assignment

November 5, 2024 Data Mining: Concepts and Techniques 18

The K-Means Clustering Method

◼ Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3

the
3

each
2 2
2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

object as initial
6 6

5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

November 5, 2024 Data Mining: Concepts and Techniques 19

7
K-Means example
◼ Cluster the following items in 2 clusters: {2, 4, 10, 12, 3, 20, 30, 11,
25}
◼ 𝑑 𝐶𝑖 , 𝑡𝑖 = (𝐶𝑖 − 𝑡𝑖 )2
◼ Assignment to K = min (𝑑 𝐶𝑖 , 𝑡𝑖 )

M1 M2 K1 K2
2 4 {
{2
{2,3} {
{4
{4,10 , 12, 20, 30, 11, 25}

2.5 16 {2, 3, 4} {10, 12, 20, 30, 11, 25}

3 18 {2, 3, 4, 10} {12, 20, 30, 11, 25}

4.75 19.6 {2, 3, 4, 10, 12, 11} {20, 30, 25}

7 25 {2, 3, 4, 10, 12, 11} {20, 30, 25}

Stopping Criteria:
• No new assignment
• No change in cluster means

November 5, 2024 Data Mining: Concepts and Techniques 20

K-means 2D example
◼ Apply k-means for the following dataset to make 2 clusters:
Step 1: Assume Initial Centroids: C1 = (185, 72), C2 = (170, 56)
X Y
Step 2: Calculate Euclidean Distance to each centroid:
185 72
𝑑[ 𝑥, 𝑦 , 𝑎, 𝑏 ] = (𝑥 − 𝑎)2 +(𝑦 − 𝑏)2
170 56
For t1 = (168, 60)
168 60 𝑑[ 185,72 , 168, 60 ] = (185 − 168)2 +(72 − 60)2
= 20.808
179 68
𝑑[ 170,56 , 168, 60 ] = (170 − 168)2 +(56 − 60)2
= 4.472
182 72
Since d(C2, t1) < d(C1,t1). So assign t1 to C2
188 77 Step 3: For t2 = (179, 68)
𝑑[ 185,72 , 179, 68 ] = (185 − 179)2 +(72 − 68)2
= 7.211
𝑑[ 170,56 , 179, 68 ] = (170 − 179)2 +(56 − 68)2
= 15
Since d(C1, t2) < d(C2,t2) So assign t2 to C1
Step 4: For t3 = (182, 72)
𝑑[ 185,72 , 182, 72 ] = (185 − 182)2 +(72 − 72)2
=3
𝑑[ 170,56 , 182, 72 ] = (170 − 182)2 +(56 − 72)2
= 20
Since d(C1, t3) < d(C2,t3), So assign t3 to C1

November 5, 2024 Data Mining: Concepts and Techniques 21

8
K-means 2D example
◼ Apply k-means for the following dataset to make 2 clusters:
X Y Step 5: For t4 = (188, 77)
𝑑[ 185,72 , 182, 72 ] = (185 − 188)2 +(72 − 77)2
185 72 = 5.83
𝑑[ 170,56 , 182, 72 ] = (170 − 188)2 +(56 − 77)2
170 56
= 27.65
Since d(C1, t4) < d(C2,t4), So assign t4 to C1
168 60

179 68 Step 6: Clusters after 1 iteration

182 72 D1 = {(185, 72), (179, 68), (182, 72), (188, 77)}

D2 = {(170, 56), (168, 60)}
188 77
Step 7: New clusters centroids C1 = {183.5, 72.25} C2 = {169, 58}

Repeat above steps for all samples till convergence

Final Clusters

D1 = {(185, 72), (179, 68), (182, 72), (188, 77)}

D2 = {(170, 56), (168, 60)}

November 5, 2024 Data Mining: Concepts and Techniques 22

What Is the Problem of the K-Means Method?

◼ The k-means algorithm is sensitive to outliers !

◼ Since an object with an extremely large value may substantially
distort the distribution of the data.

◼ K-Medoids: Instead of taking the mean value of the object in a

cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

November 5, 2024 Data Mining: Concepts and Techniques 25

9
The K-Medoids Clustering Method
◼ Find representative objects, called medoids, in clusters
◼ PAM (Partitioning Around Medoids, 1987)
◼ starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
◼ PAM works effectively for small data sets, but does not scale
well for large data sets
◼ CLARA (Kaufmann & Rousseeuw, 1990) – Clustering LARge
Applications
◼ CLARANS (Ng & Han, 1994): Clustering Large Applications based
upon RANdomized Search

PAM (Partitioning Around Medoids) (1987)

◼ PAM (Kaufman and Rousseeuw, 1987), built in Splus

◼ Use real object to represent the cluster
1. Select k representative objects arbitrarily
2. For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
3. For each pair of i and h,
◼ If TCih < 0, i is replaced by h
◼ Then assign each non-selected object to the most
similar representative object
4. repeat steps 2-3 until there is no change
November 5, 2024 Data Mining: Concepts and Techniques 27

10
A Typical K-Medoids Algorithm (PAM)
Total Cost = 20
10 10 10

9 9 9

8 8 8

Arbitrary Assign
7 7 7

6 6 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

nearest
1 1 1

0 0 0

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a

Total Cost = 26 nonmedoid object,Orandom

Do loop 10 10

Compute
9 9

Until no Swapping O
8 8

7 total cost of 7

and Orandom swapping

change
6 6

5 5

If quality is 4 4

improved. 3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

November 5, 2024 Data Mining: Concepts and Techniques 28

What Is the Problem with PAM?

◼ Pam is more robust than k-means in the presence of noise and

outliers because a medoid is less influenced by outliers or other
extreme values than a mean
◼ Pam works efficiently for small data sets but does not scale well for
large data sets.
◼ O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
➔ Sampling based method,
CLARA(Clustering LARge Applications)

November 5, 2024 Data Mining: Concepts and Techniques 32

11
Chapter 5. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods

November 5, 2024 Data Mining: Concepts and Techniques 33

Hierarchical Clustering

◼ Use distance matrix as clustering criteria. This method does not

require the number of clusters k as an input, but needs a
termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
November 5, 2024 Data Mining: Concepts and Techniques 34

12
AGNES (Agglomerative Nesting)

◼ Introduced in Kaufmann and Rousseeuw (1990)

◼ Implemented in statistical analysis packages, e.g., Splus
◼ Use the Single-Link method and the dissimilarity matrix.
◼ Merge nodes that have the least dissimilarity
◼ Go on in a non-descending fashion
◼ Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

November 5, 2024 Data Mining: Concepts and Techniques 35

Dendrogram: Shows How the Clusters are Merged

Decompose data objects into a

several levels of nested partitioning
(tree of clusters), called a
dendrogram.

A clustering of the data objects is

obtained by cutting the
dendrogram at the desired level,
then each connected component
forms a cluster.

November 5, 2024 Data Mining: Concepts and Techniques 36

13
Hierarchical Clustering Single Link
example
Step 1: At level 0, 5 clusters
A B C D E Step 1: {A}, {B}, {C} , {D} {E}

A 0 1 2 2 3 Step 2: At Level 1: Min_dist = 1,

B 1 0 2 4 3 Find distance between each pair.

If min_dist{ti, tj} <= 1, then merge clusters
C 2 2 0 1 5
{A, B}, {C, D}, {E}
D 2 4 1 0 3
Step 3: At Level 2: Min_dist = 2,
E 3 3 5 3 0
Find distance between clusters formed in step 2
A->C = 2 B->C = 2

A->D = 2 B->D = 4

Hence min_dist({A,B}, {C, D}) = 2

A->E = 3 B->E = 3 min_dist({A,B}, {E}) = 3

C->E = 5 D->E = 3 min_dist({C, D}, {E}) = 3

Since threshold is 2, we merge {A, B, C, D} {E}

November 5, 2024 Data Mining: Concepts and Techniques 37

Hierarchical Clustering Single Link

example
A B C D E Step 4: At Level 3: Min_dist = 3,

Find distance between clusters formed in step 3

A 0 1 2 2 3 A->E = 3 B->E = 3 C->E = 5 D->E = 3
B 1 0 2 4 3 min_dist({A, B, C, D}, {E}) = 3

C 2 2 0 1 5 Since threshold is 3, we merge both the clusters to get {A, B, C, D, E}

D 2 4 1 0 3
E 3 3 5 3 0

November 5, 2024 Data Mining: Concepts and Techniques 38

14
Hierarchical Clustering Complete Link
example
Step 1: At level 0, 5 clusters
A B C D E Step 1: {A}, {B}, {C} , {D} {E}

A 0 1 2 2 3 Step 2: At Level 1: Max_dist = 1,

B 1 0 2 4 3 Find distance between each pair.

If max_dist{ti, tj} <= 1, then merge clusters
C 2 2 0 1 5
{A, B}, {C, D}, {E}
D 2 4 1 0 3
Step 3: At Level 2: Max_dist = 2,
E 3 3 5 3 0
Find distance between clusters formed in step 2
A->C = 2 B->C = 2

A->D = 2 B->D = 4

Hence max_dist({A,B}, {C, D}) = 4

A->E = 3 B->E = 3 max_dist({A,B}, {E}) = 3

C->E = 5 D->E = 3 max_dist({C, D}, {E}) = 5

Since threshold is 2, no merge at this level

November 5, 2024 Data Mining: Concepts and Techniques 39

Hierarchical Clustering Complete Link

example
Step 4: At Level 3: Max_dist = 3,
A B C D E
Find distance between clusters formed in step 2
A 0 1 2 2 3 A->C = 2 B->C = 2

B 1 0 2 4 3 A->D = 2 B->D = 4

C 2 2 0 1 5 Hence max_dist({A,B}, {C, D}) = 4

D 2 4 1 0 3 A->E = 3 B->E = 3 max_dist({A,B}, {E}) = 3

E 3 3 5 3 0 C->E = 5 D->E = 3 max_dist({C, D}, {E}) = 5

Since threshold is 3, max_dist({A,B}, {E}) is 3 so we merge them

Step 5: At Level 4: Max_dist = 4,

Find distance between clusters formed in step 3

A->C = 2 B->C = 2 A->D = 2 B->D = 4
C->E = 5 D->E = 3

max_dist({C, D}, {A, B, E}) = 5

Since threshold is 4, So no merge

Step 6: At Level 5: Merge both clusters.

November 5, 2024 Data Mining: Concepts and Techniques 40

15
Hierarchical Clustering Average Link
example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0

November 5, 2024 Data Mining: Concepts and Techniques 41

Hierarchical Clustering Average Link

example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0

November 5, 2024 Data Mining: Concepts and Techniques 42

16
DIANA (Divisive Analysis)

◼ Introduced in Kaufmann and Rousseeuw (1990)

◼ Implemented in statistical analysis packages, e.g., Splus
◼ Inverse order of AGNES
◼ Eventually each node forms a cluster on its own

10 10
10

9 9
9

8 8
8

7 7
7

6 6
6

5 5
5

4 4
4

3 3
3

2 2
2

1 1
1

0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

November 5, 2024 Data Mining: Concepts and Techniques 43

Hierarchical Clustering Divisive

example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0

November 5, 2024 Data Mining: Concepts and Techniques 44

17
Recent Hierarchical Clustering Methods
◼ Major weakness of agglomerative clustering methods
◼ do not scale well: time complexity of at least O(n2), where n is the number
of total objects
◼ can never undo what was done previously

◼ Integration of hierarchical with distance-based clustering

◼ BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-
clusters
◼ ROCK (1999): clustering categorical data by neighbor and link analysis
◼ CHAMELEON (1999): hierarchical clustering using dynamic modeling

November 5, 2024 Data Mining: Concepts and Techniques 45

Evaluation of Clustering

◼ Major clustering evaluation tasks

◼ Assessing Clustering Tendency
◼ Determining Number of Clusters
◼ Measuring Cluster Quality
◼ Extrinsic methods
◼ Intrinsic methods

November 5, 2024 Data Mining: Concepts and Techniques 46

18
Evaluation of Clustering – assessing clustering tendency

◼ Assessing Clustering Tendency

◼ non-random data structure – meaningful clusters
◼ Determining if the data set has any non-random data structure.
◼ Random data structure means uniform distribution

◼ Use statistical tests for spatial randomness to measure the probability that the data set is
generated by a uniform data distribution.
◼ Hopkins’ statistical testing

November 5, 2024 Data Mining: Concepts and Techniques 47

Evaluation of Clustering – assessing clustering

tendency
Hopkin’s Statistics calculation

Now,
Null hypothesis – D is uniformly distributed contains
no meaningful clusters
Alternate hypothesis – D is not uniformly
distributed and contains meaningful clusters

To prove our hypothesis, conduct Hopkin’s statistics

iteratively with threshold 0.5.

If H >= 0.5, then D is uniformly distributed

and
if H < 0.5 then D is not uniformly distributed and has
statistically significant clusters.

November 5, 2024 Data Mining: Concepts and Techniques 48

19
Evaluation of Clustering – determining number of clusters

◼ Determining the number of clusters

◼ Not very easy – right number is always ambiguous
◼ Depends on
◼ distribution’s shape of data
◼ Scale of data
◼ Required number of clusters by the user
◼ Simple method
𝑛
◼ Set the number of clusters to for a dataset of size n. then each cluster will have 2𝑛 points
2

◼ Elbow method
◼ Increasing the number of clusters can reduce the sum of within-cluster variance
◼ It will help to create cluster with finer groups i.e. groups that are more similar.
◼ However, creating too many clusters will reduce this effect of reducing the variance.

November 5, 2024 Data Mining: Concepts and Techniques 49

Evaluation of Clustering – measuring clustering quality

◼ How good is the clustering generated by a method, and how can we compare the clustering
generated by different methods
◼ Ground truth - the ideal clustering that is often built using human experts
◼ Two methods depending on the availability of ground truth

◼ Extrinsic method
◼ Supervised
◼ ground truth is available i.e. cluster labels are available
◼ compare the clustering against the ground truth and measure.

◼ Intrinsic method
◼ Unsupervised
◼ ground truth is not available
◼ evaluate the goodness of a clustering by considering how well the clusters are separated.

November 5, 2024 Data Mining: Concepts and Techniques 51

20
Evaluation of Clustering – measuring clustering quality
◼ Extrinsic method
◼ Cluster homogeneity
◼ Checks for cluster purity – the more pure clusters, the better is the clustering
◼ E.g:
◼ ground truth – L1 ….. Ln are the categories of the data for a dataset D

◼ Clustering method C1 places data points from two different categories Li and Lj in one

cluster
◼ Clustering method C2 places data points from two different categories Li and Lj in

different cluster
◼ Here C2 creates pure cluster than C1

◼ So cluster homogeneity score for C2 > C1

◼ Cluster completeness
◼ Counter part of cluster homogeneity
◼ Cluster completeness requires that for a clustering, if any two objects belong to the same
category according to ground truth, then they should be assigned to the same cluster.
◼ Clustering method C1 contains two clusters with data points belonging to same cluster

◼ Clustering method C2 merged the two clusters into one

◼ So cluster purity score for C2 > C1

November 5, 2024 Data Mining: Concepts and Techniques 52

Evaluation of Clustering – measuring clustering quality

◼ Extrinsic method
◼ Rag bags
◼ are Miscellaneous objects that cannot be merged with other categories
◼ The rag bag criterion states that putting a heterogeneous object into a pure cluster
should be penalized more than putting it into a rag bag.

◼ E.g:
◼ Clustering method C1 places all objects belonging to same category in one cluster,
except for one object that belongs to label o
◼ Clustering method C2 places object o in a different cluster C’ that contains
miscellaneous objects from different categories. So this cluster C’ is a noisy cluster
i.e. a rag bag.
◼ So score of C2 > C1 as per rag bag criterion

November 5, 2024 Data Mining: Concepts and Techniques 53

21
Evaluation of Clustering – measuring clustering quality

◼ Extrinsic method
◼ Small cluster preservation
◼ If a small category is split into small pieces in a clustering, those small pieces
may likely become noise and thus the small category cannot be discovered
from the clustering.
◼ The small cluster preservation criterion states that splitting a small category
into pieces is more harmful than splitting a large category into pieces.

November 5, 2024 Data Mining: Concepts and Techniques 54

Evaluation of Clustering

Metrics that satisfy all the four criteria of extrinsic methods:

BCubed precision: how many other objects in the same cluster belong to the same category
as the object

Bcubed recall: how many objects of the same category are assigned to the same cluster

November 5, 2024 Data Mining: Concepts and Techniques 55

22
Evaluation of Clustering – Intrinsic method

◼ Intrinsic method
◼ Unsupervised – ground truth is not available
◼ evaluate a clustering by examining how well the clusters are separated and how compact the
clusters are
◼ Silhouette coefficient
◼ For a data set, D, of n objects, suppose D is partitioned into k clusters, C1, …. ,Ck.
◼ For each object o ϵ D,
◼ a(o) as the average distance between o and all other objects in the cluster to which o
belongs i.e. reflects the compactness of the cluster to which o belongs.
◼ Similarly, b(o) is the minimum average distance from o to all clusters to which o does not
belong i.e. captures the degree to which o is separated from other clusters.
◼ The smaller the value of a(o), the more compact the cluster and The larger b(o) is, the
more separated o is from other clusters. Therefore, when the silhouette coefficient value
of o approaches 1, the cluster containing o is compact and o is far away from other
clusters, which is the preferable case.
◼ However, when the silhouette coefficient value is negative (i.e., b(o) < a(o)), this means
that, in expectation, o is closer to the objects in another cluster than to the objects in the
same cluster as o. In many cases, this is a bad situation and should be avoided.

Clustering in AI
No ratings yet
Clustering in AI
16 pages
Lecture 3.2.3 3.2.4
No ratings yet
Lecture 3.2.3 3.2.4
28 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Clustering
No ratings yet
Clustering
84 pages
Clustering
No ratings yet
Clustering
18 pages
Clustering
No ratings yet
Clustering
32 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Lect3 Clustering
No ratings yet
Lect3 Clustering
86 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
8 CLST
No ratings yet
8 CLST
100 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Unit 5
No ratings yet
Unit 5
85 pages
Clustering Full 1
No ratings yet
Clustering Full 1
98 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
A Parallel Study On Clustering Algorithms in Data Mining
No ratings yet
A Parallel Study On Clustering Algorithms in Data Mining
7 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
BIS 541 Ch04 20-21 S
No ratings yet
BIS 541 Ch04 20-21 S
82 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
10 Clus Basic
No ratings yet
10 Clus Basic
95 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
Unit IV
No ratings yet
Unit IV
96 pages
Clustering
No ratings yet
Clustering
24 pages
Cluster Analysis for Researchers
No ratings yet
Cluster Analysis for Researchers
76 pages
Lecture 3.2.1 3.2.2
No ratings yet
Lecture 3.2.1 3.2.2
28 pages
8 Clustering
No ratings yet
8 Clustering
89 pages
Cluster Analysis for CS Students
No ratings yet
Cluster Analysis for CS Students
43 pages
Slide-08-Chapter10-Cluster Analysis Basic Concept I
No ratings yet
Slide-08-Chapter10-Cluster Analysis Basic Concept I
40 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Clustering
No ratings yet
Clustering
125 pages
Cluster Analysis
No ratings yet
Cluster Analysis
136 pages
Cluster Analysis Essentials
No ratings yet
Cluster Analysis Essentials
50 pages
Clustering Methods
No ratings yet
Clustering Methods
64 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
Cluster Analysis: Methods and Applications
No ratings yet
Cluster Analysis: Methods and Applications
14 pages
Clustering
No ratings yet
Clustering
25 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
120 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
93 pages
Clustering in Data Mining Guide
No ratings yet
Clustering in Data Mining Guide
39 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Clustering
No ratings yet
Clustering
104 pages
Cluster
No ratings yet
Cluster
20 pages
Clustering
No ratings yet
Clustering
29 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
8 CLST
No ratings yet
8 CLST
98 pages
Chapter 6
No ratings yet
Chapter 6
12 pages
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
No ratings yet
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
21 pages
10 Clus Basic
No ratings yet
10 Clus Basic
66 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
Unit Iv
No ratings yet
Unit Iv
14 pages
(3rd Year) Pattern REcognition Lecture 4
No ratings yet
(3rd Year) Pattern REcognition Lecture 4
48 pages
Ijcset 2016060701
No ratings yet
Ijcset 2016060701
3 pages
Data WM
No ratings yet
Data WM
2 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
Image Enhancement
No ratings yet
Image Enhancement
24 pages
3 ImageProcessing-SpatialFiltering
No ratings yet
3 ImageProcessing-SpatialFiltering
19 pages
Algorithms and System For Segmentation and Structure Analysis in Soccer Video
No ratings yet
Algorithms and System For Segmentation and Structure Analysis in Soccer Video
4 pages
Deep Learning Models A Practical Approach For Hands-On Professionals (Jonah Gamba)
No ratings yet
Deep Learning Models A Practical Approach For Hands-On Professionals (Jonah Gamba)
211 pages
Datascience 4 Months
No ratings yet
Datascience 4 Months
9 pages
Automated Medical Diagnosis: Ahmed Ezzat Supervisors: Prof. Magda B. Fayek Assoc Prof. Mona Farouk
No ratings yet
Automated Medical Diagnosis: Ahmed Ezzat Supervisors: Prof. Magda B. Fayek Assoc Prof. Mona Farouk
93 pages
Data Processing and Coding Tabulation and Data Presentation
No ratings yet
Data Processing and Coding Tabulation and Data Presentation
20 pages
Exploratory Data Analysis of Titanic Survival Prediction Using Machine Learning Techniques
No ratings yet
Exploratory Data Analysis of Titanic Survival Prediction Using Machine Learning Techniques
5 pages
Unveiling The Power: A Comparative Analysis of Data Mining Tools Through Decision Tree Classification On The Bank Marketing Dataset
No ratings yet
Unveiling The Power: A Comparative Analysis of Data Mining Tools Through Decision Tree Classification On The Bank Marketing Dataset
11 pages
Prediction of Bankruptcy Using Big Data Analytic Based On Fuzzy C-Means Algorithm
No ratings yet
Prediction of Bankruptcy Using Big Data Analytic Based On Fuzzy C-Means Algorithm
7 pages
AI-Powered Support for Slow Learners
No ratings yet
AI-Powered Support for Slow Learners
8 pages
Student Placement Prediction
No ratings yet
Student Placement Prediction
4 pages
KNN Model Based Approach in Classification
No ratings yet
KNN Model Based Approach in Classification
13 pages
Data Science Assignment Rules & Tasks
No ratings yet
Data Science Assignment Rules & Tasks
4 pages
Prediction of Comorbid
No ratings yet
Prediction of Comorbid
13 pages
Load Identification in Neural Networks For A Non-Intrusive Monitoring of Industrial Electrical Loads
No ratings yet
Load Identification in Neural Networks For A Non-Intrusive Monitoring of Industrial Electrical Loads
2 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
Nirmal Activation Function - 250729 - 192641
No ratings yet
Nirmal Activation Function - 250729 - 192641
4 pages
Presentation Slide of AI
No ratings yet
Presentation Slide of AI
30 pages
Data Science Task List Pfsinterns
No ratings yet
Data Science Task List Pfsinterns
14 pages
Ieee Conference Paper Template
No ratings yet
Ieee Conference Paper Template
5 pages
Course Descriptions - Duquesne University
No ratings yet
Course Descriptions - Duquesne University
8 pages
Libfm
No ratings yet
Libfm
7 pages
On The Selection of M For Fuzzy C-Means
No ratings yet
On The Selection of M For Fuzzy C-Means
7 pages
Classification With NaiveBayes
No ratings yet
Classification With NaiveBayes
19 pages
56 Direct Expenses Overhead
No ratings yet
56 Direct Expenses Overhead
6 pages
ECE Neural Networks Course Guide
No ratings yet
ECE Neural Networks Course Guide
18 pages
Educational Data Mining Survey
100% (1)
Educational Data Mining Survey
6 pages
Introduction to Web Mining Techniques
No ratings yet
Introduction to Web Mining Techniques
12 pages
Antibiotics v10 I05 - 20241112
No ratings yet
Antibiotics v10 I05 - 20241112
6 pages
Ex - No.5 - Naïve Bayesian Classifier
No ratings yet
Ex - No.5 - Naïve Bayesian Classifier
4 pages
Classification Algorithm
No ratings yet
Classification Algorithm
51 pages