Presentation on
Clustering Algorithms
23 May 2024 1
Presented to:
Dr. Tushar Kanti Saha
Professor
Dept. of Computer Science and Engineering
Jatiya Kabi Kazi Nazrul Islam University
Presented by:
A.K.M Mahfuzur Rahman
Roll: 231020104
Reg: 7724
Session: MS:2022-2023
Dept. of Computer Science and Engineering
Jatiya Kabi Kazi Nazrul Islam University
23 May 2024 2
Overview
BIRCH
CURE
CHAMELEON
23 May 2024 3
BRICH Agendas:
What is Birch ?
1.
Data Clustering
BIRCH BIRCH Goal
Example -1,2,3,4
23 May 2024 4
Balance
Hierarchies Iterative
BIRCH BIRCH
Clustering Reducing
23 May 2024 5
Birch is based on the notation of Clustering Feature a CF
What is Tree.
BIRCH ? CF tree is a height balanced tree that stores the clustering
features for a hierarchical clustering.
23 May 2024 6
Data It is portioning of a database into clusters.
Closely packed group.
Clustering
Collection of data objects that are similar to one another
and treated collectively in a group.
23 May 2024 7
Clustering
Treat dense areas
decision made
as one and
without scanning
reduce noise
whole data
BIRCH
Goals Minimizes
running time and
Exploit the non
uniformity of the
data scans data
23 May 2024 8
The tree cluster of data points as CF is represented by three
numbers (N, LS, SS).
BIRCH N = number of LS = vector SS = sum of
items in sub- sum of the data the squared
clusters points data points
23 May 2024 9
Phase 1: Scan all data and build an initial in-memory CF tree,
using the given amount of memory and recycling space on
disk.
Algorithm Phase 2: Condense into desirable length by building a smaller
CF tree.
Phase 3: Global clustering.
Phase 4: Cluster refining - this is optional, and requires more
passes over the data to refine the results.
23 May 2024 10
𝑛 𝑛 2
LS= 𝑖=1 𝑥𝑖 𝑎𝑛𝑑 𝑆𝑆 = 𝑖=1 𝑥𝑖
Consider a cluster C1={3,5,2,8,9,1} then
Example-1 CF(C1)=(6,28,184)
where n=6,
LS=3+5+2+8+9+1=28 and
SS=3²+5²+2²+8²+9²+1²=184
23 May 2024 11
Another example with 2-D objects,
Given that,
C2={(1,1),(2,1),(3,2)}
So,
Example-2 CF(C2)=(3,(6,4),(14,6))
where n=3,
LS = (1+2+3,1+1+2) = (6,4)
SS = (1²+2²+3², 1²+1²+2²) =(14,6)
23 May 2024 12
Another important property of the CFs is that they are additive. That
is, two disjoint clusters C1 and C2 with CFs CF1=(n1,LS2,SS1) and
CF2=(n2,LS2,SS2) respectively.
The CF of the cluster formed by merging C1 and C2 is given as,
CF1+CF2=(n1+n2,LS1+LS2,SS1+SS2)
C1={(2,5),(3,2),(4,3)} and
Example-3 C2={(1,1),(2,1),(3,1)}
Then,
CF1 = (3,(2+3+4,5+2+3), (2²+3²+4²,5²+2²+3²)) = (3,(9,10),(29,38))
and CF2 = (3,(1+2+3,1+1+1),(1²+2²+3²,1²+1²+1²)) = (3,(6,3),(14,3))
Now,
if C3 = C1UC2 then
CF3 = CF1+CF2 = (6,(15,13),(43,41))
23 May 2024 13
𝐿𝑆
Cluster’s Centroid, X0 = 𝑛
𝑛 2
𝑥𝑖 −𝑋0 𝑛 𝑆𝑆 −2𝐿𝑆2−𝑛(𝐿𝑆)
Cluster’s Radius, R= 𝑖=1
=
Formula 𝑛 𝑛2
𝑛 𝑛 2
𝑗=1(𝑥𝑖−𝑥𝑗 )
2
𝑖=1 2𝑛 𝑆𝑆 −2 𝐿𝑆
Cluster’s Diameter, D= =
𝑛(𝑛−1) 𝑛(𝑛−1)
23 May 2024 14
Apply BIRCH to cluster the given dataset. The dataset D:{
(3,4) (2,6) (4,5) (4,7) (3, 8), (6, 2), (7, 2), (7, 4), (8, 4), (7, 9)}.
The branching factor, B= 2, the maximum number of sub-
clusters at each leaf node, L= 2, and the threshold on the
diameter of sub-clusters stored in the leaf nodes is 1.5.
• For each data point, find the Radius and CF.
• Consider data point x₁ = (3,4)
• It is alone in the feature map. So,
• Radius = 0
Example-4 • Cluster Feature CF1<N, LS, SS>=<1, (3, 4), (9, 16)>
• Create the leaf node with data point x₁ = (3,4) and branch as
CFI.
CF1<1,(3,4),(9,16)>
Leaf
X1=(3,4)
23 May 2024 15
For each data point, find the Radius and CF.
Consider data point x2 = (2,6):
1. Linear Sum, LS= (3, 4)+(2, 6) = (5,10)
2. Squared Sum, SS = (22+9, 62+16) = (13,52); N=2 CF1<2,(5,10),(13,52)>
𝐿𝑆2 (5,10)2
𝑆𝑆 (13,52)
• Radius= − 𝑁
= − 2
= (0.5, 1) Leaf
𝑁 𝑁 2 2
X1=(3,4)
X2=(2,6)
• R (0.5, 1)<(T, T) -->True
• So, x2 = (2,6) will cluster with leaf x1 = (3,4).
3. Cluster Feature CF1 <N, LS, SS>=<2, (5, 10), (13, 52)>
23 May 2024 16
For each data point, find the Radius and CF.
Consider data point x3 = (4,5):
1. Linear Sum, LS= (5, 10)+(4, 5) = (9,15)
CF1<3, (9, 15), (29, 77)>
2. Squared Sum, SS = (42 + 13, 52+52) = (29,77); N=3
𝐿𝑆2 (9,15)2
𝑆𝑆 (29,77) Leaf
• Radius= 𝑁
− 𝑁
𝑁
= 3
− 3
3
= (0.47, 0.47) X1=(3,4)
X2=(2,6)
X3=(4,5)
• R (0.47, 0.47)<(T, T) -->True
• So, x3 = (4,5) will cluster with leaf x1 and x2
3. Cluster Feature CF1 <N, LS, SS>=<3, (9, 15), (29, 77)>
23 May 2024 17
Similarly
CF1<5, (16, 30), (54, 190)>
Leaf
X1=(3,4)
X2=(2,6)
X3=(4,5)
x4=(4,7)
X5=(3,8)
23 May 2024 18
Consider data point x6 = (6,2):
1. Linear Sum, LS= (16, 30)+(6, 2) = (22,32)
2. Squared Sum, SS = (62 + 54, 22+190) = (90,194); N=6
𝐿𝑆2 (22,32)2
𝑆𝑆 (90,194)
• Radius= 𝑁
− 𝑁
𝑁
= 6
− 6
6
= (1.24, 1.97)
• R (1.24, 1.97)<(T, T) -->False
• So, x6 = (6,2) will cluster in deferent branch
3. Cluster Feature CF2 <N, LS, SS>= <1, (6, 2), (36, 4)>
CF1<5, (16, 30), (54, 190)> CF2<1, (6, 2), (36, 4)>
Leaf Leaf
X1=(3,4) X6=(6,2)
X2=(2,6)
X3=(4,5)
X4=(4,7)
23 May 2024 X5=(3,8) 19
For data point x7 = (7,2). Two Branches B1 for CF1 and B2 for CF2 exists. Find x,
closes to CF1 or CF2. Then find the Radius.
𝐿𝑆 (16,30) 𝐿𝑆 (6,2)
CF1= 𝑁 = 5 = (3.2, 6) CF2= 𝑁 = 1 = (6, 2)-> is close to x7
CF2 will be in consider
1. Linear Sum, LS= (6, 2)+(7, 2) = (13,4)
2. Squared Sum, SS = (72 + 6, 22+2) = (85,8); N=2
𝐿𝑆2 (13,4)2
𝑆𝑆 (85,8)
• Radius= − 𝑁
= − 2
= (0.5, 0)
𝑁 𝑁 2 2
• R (0.5, 0)<(T, T) -->True
• So, x7 = (7,2) will cluster with x6
3. Cluster Feature CF2 <N, LS, SS>= <2, (13, 4), (85, 8)>
CF1<5, (16, 30), (54, 190)> CF2<2, (13, 4), (85, 8)>
Leaf
Leaf
X1=(3,4)
X6=(6,2)
X2=(2,6)
X7=(7,2)
X3=(4,5)
X4=(4,7)
X5=(3,8)
23 May 2024 20
Similarly
CF1<5, (16, 30), (54, 190)> CF2<4, (28,12), (198, 40)>
Leaf
Leaf
X6=(6,2)
X1=(3,4)
X7=(7,2)
X2=(2,6)
X8=(7,4)
X3=(4,5)
X9=(8,4)
x4=(4,7)
X5=(3,8)
23 May 2024 21
For data point x10 = (7,9). Two Branches B1 for CF1 and B2 for CF2
exists. Find x, closes to
CF1 or CF2. Then find the Radius.
𝐿𝑆 (16,30) 𝐿𝑆 (28,12)
CF1= 𝑁 = 5 = (3.2, 6) CF2= 𝑁 = 4 = (7, 3)
CF1 will be in consider
1. Linear Sum, LS= (16, 30)+(7, 9) = (23,39)
2. Squared Sum, SS = (72 + 54, 92+190) = (103,271); N=6
𝐿𝑆2 (23,39)2
𝑆𝑆 (103,271)
• Radius= − 𝑁
= − 6
= (1.57, 1.7)
𝑁 𝑁 6 6
• R (1.57, 1.7)<(T, T) -->False and L=5
• So, x10 = (7,9) cannot cluster with CF1
23 May 2024 22
As branching factor is 2, cannot create another branch, So we have to another parent.
CF12<9, (44, 42), (252, 230)> CF3<1, (7, 9), (49, 81)>
CF1<5, (16, 30), (54, 190)> CF2<4, (28,12), (198, 40)> CF3<1, (7, 9), (49, 81)>
Leaf Leaf Leaf
X1=(3,4) X6=(6,2) X10=(7,9)
X2=(2,6) X7=(7,2)
X3=(4,5) X8=(7,4)
x4=(4,7) X9=(8,4)
X5=(3,8)
23 May 2024 23
CURE Agendas:
2. What is CURE ?
CURE Structure
Algorithm
Example
OUTLIERS
CLUSTERS
23 May 2024 24
Clustering
CURE CURE
Representatives Using
23 May 2024 25
It is a hierarchical based clustering technique, that adopts a
middle ground between the centroid based and the all-point
extremes.
It is useful for discovering groups and identifying
What is interesting distributions in the underlying data.
CURE? Instead of using one point centroid, as in most of data
mining algorithms, CURE uses a set of well-defined
representative points, for efficiently handling the clusters
and eliminating the outliers.
23 May 2024 26
Data
Draw Random Partition
Sample Sample
Partially
Cluster
Partitions
Structure
Eliminations of
Outliers
Clusters
Label Data on Cluster Partial
Disk Clusters
23 May 2024 27
Phase 1: Begin with a large dataset D consisting of n data
points.
Phase 2: Randomly select a sample of c points from the dataset
D where c<<n. Sample should be representative of the entire
dataset.
Phase 3: Use a hierarchical clustering method (e.g., single-link,
complete-link, or average-link) on the sample to form an initial
set of clusters. This is typically done until a desired number of
clusters k is reached.
Algorithm Phase 4: For each cluster obtained, select a fixed number of
representative points r. These points are chosen to be as far apart
as possible to capture the shape and extent of the cluster.
Phase 5: For each cluster, move the representative points
towards the mean of the cluster by a fraction α. This step helps
to avoid the influence of outliers.
Phase 6: Repeat the merging process for the remaining clusters
until the desired number of clusters is achieved.
Phase 7: Assign the remaining non-sampled points in D to the
nearest cluster using the representative points.
23 May 2024 28
Example
23 May 2024 29
CURE is designed to efficiently process large datasets.
CURE is designed to efficiently process large datasets.
CURE reduces the computational complexity without
significantly compromising the quality of the clustering.
Advantages
CURE is relatively straightforward to implement
Flexibility in cluster shapes.
23 May 2024 30
Although subsampling helps reduce complexity, the
initial phase of clustering a large sample can still be
computationally intensive, especially for very large
datasets.
Too few points may not capture cluster shape
Disadvantages accurately, while too many points can increase
computational costs.
As CURE is designed to handle large datasets,
extremely large-scale applications might still face
scalability issues.
23 May 2024 31
CHAMELEON Agendas:
What is CHAMELEON ?
Framework of CHAMELEON
3.
Phase of CHAMELEON
CHAMELEON Advantages
23 May 2024 32
Chameleon is a hierarchical clustering algorithm
that uses dynamic modeling to decide the
similarity among pairs of clusters.
What is It was changed based on the observed
weaknesses of two hierarchical clustering
Chameleon? algorithms such as ROCK and CURE.
In Chameleon, cluster similarity is assessed
depending on how well-connected objects are
inside a cluster and on the proximity of clusters.
23 May 2024 33
Construct (K-NN)
Sparse Graph
Data Set
Partition the Graph
Framework
Final Clusters
Fig: Overall framework for CHAMELEON.
23 May 2024 34
A Two-phase Clustering Algorithm.
Phase 1: Finding Initial Sub-clusters. The first phase
is graph partitioning, which allows data items to be
clustered into a large number of sub-clusters.
Phase…
(a) (b)
Fig: An example of the bisections produced by multilevel graph
partitioning algorithms on two spatial data sets.
23 May 2024 35
Phase 2: Merging Sub-Clusters using a Dynamic
Framework . It employs an agglomerative hierarchical
clustering method to look for real clusters that may be
merged with the sub-clusters that are generated.
Two different schemes have been implemented in
CHAMELEON to employ an agglomerative
hierarchical clustering method.
Phase
1. Merges those pairs of clusters whose
relative inter-connectivity and relative
closeness are both above some user
specified threshold.
2. Combine the relative inter-connectivity
and relative closeness. then selects to
merge the pair of clusters that maximizes
this function.
23 May 2024 36
Ci and Cj are two clusters
RIC= (Absolute IC(Ci,Cj))/(Internal IC(Ci)+Internal IC(Cj))/2
Relative where Absolute IC(Ci,Cj)= sum of weights of
edges that connect Ci with Cj.
Inter
Connectivity Internal IC(Ci) = weighted sum of edges
that partition the cluster into roughly equal
parts.
23 May 2024 37
Absolute closeness normalized with
respect to the internal closeness of the
two clusters.
Relative Absolute closeness got by average
Closeness similarity between the points in Ci that
are connected to the points in Cj.
23 May 2024 38
Internal closeness of the cluster got by
average of the weights of the edges in
the cluster.
Internal
Closeness
Using them,
23 May 2024 39
If the relative inter-connectivity measure
relative closeness measure are same,
choose inter-connectivity.
Merging
Can also use,
the
Clusters RI (Ci , C j )≥T(RI) and RC(C i,C j ) ≥ T(RC)
23 May 2024 40
Allows it to adapt to the natural shapes and
densities of clusters in the data.
Can handle large dataset effectively.
Advantages The margin and refinement processes enhance
the quality of data.
23 May 2024 41
Any Questions?
23 May 2024 42
Thank You
23 May 2024 43