Clustering Methods
Clustering Methods
Information Management
course
Teacher: Alberto Ceselli
— Chapter 10 —
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
2
Cluster Analysis: Concepts and
Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
3
What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same
group
dissimilar (or unrelated) to the objects in other groups
distribution
As a preprocessing step for other algorithms
4
Applications of Cluster Analysis
Data reduction
Summarization: Preprocessing for regression, PCA,
classification, and association analysis
Compression: Image processing: vector quantization
Hypothesis generation and testing
Prediction based on groups
Cluster & find characteristics/patterns for each group
5
Clustering: Application Examples
Biology: taxonomy of living things: kingdom, phylum, class,
order, family, genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an
earth observation database
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
City-planning: Identifying groups of houses according to their
house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
Climate: understanding earth climate, find patterns of
atmospheric and ocean
Economic Science: market research
6
Basic Steps to Develop a Clustering
Task
Feature selection
Select info concerning the task of interest
Minimal information redundancy
Proximity measure
Similarity of two feature vectors
Clustering criterion
Expressed via a cost function or some rules
Clustering algorithms
Choice of algorithms
Validation of the results
Validation test (also, clustering tendency test)
Interpretation of the results
Integration with applications
7
Quality: What Is Good Clustering?
A good clustering method will produce high quality
clusters
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on
the similarity measure used by the method
its implementation (optimality guarantees +
computational effectiveness), and
Its ability to discover some or all of the hidden
patterns (practical behavior)
8
Measure the Quality of Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a (typically metric)
pairwise distance function d(i, j)
The definitions of distance functions are usually
rather different for interval-scaled, boolean,
categorical, ordinal ratio, and vector variables
Weights should be associated with different
variables based on applications and data semantics
Quality of clustering:
There is usually a separate global quality function
that measures the “goodness” of a cluster.
It is hard to define “similar enough” or “good
enough” (need to stick to the application!)
The answer is typically highly subjective (i.e.
don't blame the algorithm for modeling errors) 9
Major Clustering Approaches
Partitioning approach:
Construct various partitions and then evaluate them by
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or
Density-based approach:
Based on connectivity and density functions (keep growing
Grid-based approach:
Quantize object space in a grid structure
k 2
E =Σ i=1 Σ p∈C i (d ( p ,c i ))
15
Partitioning Algorithms: Basic
Concept
Partitioning method:
•
16
The K-Means Clustering
Method
19
Variations of the K-Means Method
Most of the variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data: k-modes
Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical objects
Using a frequency-based method to update modes of clusters
A mixture of categorical and numerical data: k-prototype method
20
What Is the Problem of the K-Means Method?
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
21
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
7 7 7
6
Arbitrar 6
Assign 6
5
y 5 each 5
4 choose 4 remaini 4
3
k object 3
ng 3
2
as 2
object 2
1 1
initial to
1
0 0 0
0 1 2 3 4 5 6 7 8 9 10
medoid 0 1 2 3 4 5 6 7 8 9 10
nearest 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
8 Compute
9
8
Swapping 7 total cost 7
Until no O and 6
of 6
Oramdom
change
5 5
4
swapping 4
If quality is 3
2
3
improved. 1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
22
The K-Medoid Clustering Method
23
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
24
Hierarchical Clustering
Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0
25
AGNES (AGglomerative
NESting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical packages, e.g., Splus
Use the single-link method and the dissimilarity
matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
26
DIANA (DIvisive ANAlysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
27
Dendrogram: Shows How Clusters are
Merged
Decompose data objects into a several levels of nested partitioning (tree of
clusters), called a dendrogram
28
Distance between X X
Clusters
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = minp in Ki, q in Kj d(p,q)
Complete link: largest distance between an element in one
cluster and an element in the other, i.e.,
dist(Ki, Kj) = maxp in Ki, q in Kj d(p,q)
Average: avg distance between an element in one cluster and an
element in the other, i.e.,
dist(Ki, Kj) = sump in Ki, q in Kj d(p,q) / (|Ki||Kj|)
Centroid: distance between the centroids of two clusters, e.g.,
p = mean(Ki), q = mean(Kj), dist(Ki, Kj) = d(p,q)
Medoid: distance between the medoids of two clusters, i.e.,
p = median(Ki), q = median(Kj), dist(Ki, Kj) = d(p,q)
29
Distance between X X
Clusters
Algorithms using minimum distance are also called
nearest-neighbor clustering algorithms
they build minimum spanning trees
if clustering is terminated when the minimum inter-
cluster distance exceeds a given threshold they are
called single-linkage
Algorithms using maximum distance are also called
farthest-neighbor clustering algorithms
If clustering is terminated when the maximum inter-
cluster distance between nearest clusters exceeds a
given threshold they are called complete-linkage
30
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
Centroid: the “middle” of a cluster
Σ iN=1 ( t )
C m= ip
N
Radius: square root of average distance from any
point of the cluster to its centroid
√
N 2
Σ i=1 (t ip−c m )
Rm =
N
Diameter: square root of average mean squared
distance between all pairs of points in the cluster
√
N N 2
Σ i=1 Σ (t ip −t iq )
i=1
D m=
N ( N −1)
31
Extensions to Hierarchical
Clustering
Major weakness of agglomerative clustering methods
Can never undo what was done previously
Do not scale well: time complexity of at least O(n2),
where n is the number of total objects
Integration of hierarchical & distance-based clustering
BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
CHAMELEON (1999): hierarchical clustering using
dynamic modeling
32
BIRCH (Balanced Iterative Reducing
and Clustering Using Hierarchies)
Zhang, Ramakrishnan & Livny, SIGMOD’96
Clustering Feature (CF): <n, LS, SS>
n: number of points, LS: their sum, SS: their sum of squares
Easy to compute centroid, radius and diameter from CF
CFs are additive
Incrementally construct a CF tree, a hierarchical data structure
for multiphase clustering
Phase 1: scan DB to build an initial in-memory CF tree (a
multi-level compression of the data that tries to preserve its
inherent clustering structure)
Phase 2: use an arbitrary clustering algorithm to cluster the
leaf nodes of the CF-tree
33
BIRCH (Balanced Iterative Reducing
and Clustering Using Hierarchies)
Scales linearly: finds a good clustering with a single scan and
improves the quality with a few additional scans
Weakness: handles only numeric data, and sensitive to the
order of the data record
34
Clustering Feature Vector in
BIRCH
Clustering Feature (CF): CF = (N, LS, SS)
N: Number of data points
LS: linear sum of N points: N
∑ Xi
i =1
SS: square sum of N points
CF = (5, (16;30),(54;190))
N 10
(3,4)
∑ Xi 2 9
7 (2,6)
i =1 6
5 (4,5)
4
3 (4,7)
2
1 (3,8)
0
0 1 2 3 4 5 6 7 8 9 10
35
CF-Tree in BIRCH
Clustering feature:
Summary of the statistics for a given subcluster: the 0-th,
1st, and 2nd moments of the subcluster from the statistical
point of view
Registers crucial measurements for computing cluster and
utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
A nonleaf node in a tree has descendants or “children”
The nonleaf nodes store sums of the CFs of their children
A CF tree has two parameters
Branching factor: max # of children
Threshold: max diameter of sub-clusters stored at the leaf
nodes
36
The CF Tree Structure
Root
Non-leaf node
prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next
37
The Birch Algorithm
√
Cluster Diameter 1
n( n−1)
∑ ( x i − x j ) 2
possibly parents
Algorithm is O(n)
Concerns
Sensitive to insertion order of data points
so natural
Clusters tend to be spherical given the radius and diameter
measures
38
CHAMELEON: Hierarchical Clustering
Using Dynamic Modeling (1999)
CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
Measures the similarity based on a dynamic model
Two clusters are merged only if the interconnectivity
and closeness (proximity) between two clusters are
high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
Graph-based, and two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects
into a large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering
algorithm: find the genuine clusters by repeatedly
combining these sub-clusters
39
KNN Graphs & Interconnectivity
k-nearest graphs from an original data in 2D:
Data Set
K-NN Graph
P and q are Merge Partition
connected if q is
among the top k
closest neighbors of Relative interconnectivity:
p connectivity of c1 and c2 over
internal connectivity
Final Clusters
Relative closeness:
closeness of c1 and c2 over
internal closeness 42
CHAMELEON (Clustering Complex Objects)
43
Hierarchical Clustering Summary
Hierarchical clustering strengths
Produce at once clustering solutions for different k values
Link them, highlighting regularities
Hierarchical clustering weaknesses
Nontrivial to choose a good distance measure
Hard to handle missing attribute values
Algorithmically (besides theoretically) hard: mainly heuristics in
practical settings
44
Probabilistic Hierarchical
Clustering
Hierarchical (distance-based) clustering strengths ...
Hierarchical (distance-based) clustering weaknesses ...
Probabilistic (“fitting”) hierarchical clustering
Use probabilistic models to measure distances between clusters
Generative model: Regard the set of data objects to be clustered
as a sample of the underlying data generation mechanism to be
analyzed
Easy to understand, same efficiency as algorithmic agglomerative
clustering method, can handle partially observed data
In practice, assume the generative models adopt common
distributions functions, e.g., Gaussian distribution or Bernoulli
distribution, governed by parameters
45
Generative Model
Given a set of 1-D points X = {x1, …, xn} for
clustering analysis & assuming they are
generated by a Gaussian distribution:
the maximum
likelihood
46
A Probabilistic Hierarchical Clustering
Algorithm
48
Cluster Analysis: Basic Concepts and
Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
49
Density-Based Clustering
Methods
Clustering based on density (local cluster criterion),
such as density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
based)
50
Density-Based Clustering: Basic
Concepts
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
Directly density-reachable: A point p is directly
density-reachable from a point q
w.r.t. (Eps, MinPts) if
p belongs to NEps(q) p MinPts = 5
Eps = 1 cm
core point condition: q
|NEps (q)| ≥ MinPts
51
Density-Reachable and Density-
Connected
Density-reachable:
A point p is density-reachable p
from a point q w.r.t. (Eps, MinPts)
p1
if there is a chain of points p1, …, q
pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
Density-connected
p q
A point p is density-connected to
a point q w.r.t. (Eps, MinPts) if
there is a point o such that both, o
p and q are density-reachable
from o w.r.t. Eps and MinPts
52
DBSCAN: Density-Based Spatial
Clustering of Applications with Noise
Relies on a density-based notion of cluster: A
cluster is defined as a maximal set of density-
connected points
Experimentally, discovers clusters of arbitrary
shape in spatial databases with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
53
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p w.r.t. Eps
and MinPts
If p is a core point, a cluster is formed
If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the
database
Continue the process until all of the points have been
processed
If a spatial index is used, the computational complexity
of DBSCAN is O(nlogn), where n is the number of
database objects. Otherwise, the complexity is O(n 2)
54
DBSCAN: Sensitive to Parameters
(SIGMOD’99)
Produces a special order of the database wrt its
visualization techniques 56
DENCLUE: Using Statistical Density
Functions
DENsity-based CLUstEring by Hinneburg & Keim (KDD’98)
total
Using statistical density functions: influence on
x
d ( x , x i )2
d ( x , y )2 N −
− D 2σ 2
f Gaussian ( x , y)=e 2s 2 f Gaussian ( x )=∑ i=1 e
d ( x , x i )2
influence of N −
D 2σ 2
y on x ∇f Gaussian ( x , x i )=∑ i=1 ( x i − x )⋅e
Major features gradient of x
in the
Uses gaussian kernel density approximation direction of xi
n
1 x− x i
f ( x)= ∑ K ( )
ns i =1 s
Clusters can be determined mathematically by identifying
density attractors (local maxima of the overall density function)
Center defined clusters: assign to each density attractor the
points density attracted to it (pick each point and follow the
gradient) 61
Denclue: Technical Essence
62
Density Attractor
63
Center-Defined and Arbitrary
64
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
65
Grid-Based Clustering Method
67
The STING Clustering Method
Each cell at a high level is partitioned into a number of
smaller cells in the next lower level
Statistical info of each cell is calculated and stored
beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated
from parameters of lower level cell
count, mean, s, min, max
68
STING Algorithm and Its
Analysis
Remove the irrelevant cells from further consideration
When finish examining the current layer, proceed to
the next lower level
Repeat this process until the bottom layer is reached
Advantages:
Query-independent, easy to parallelize, incremental
update
O(K), where K is the number of grid cells at the
lowest level
Disadvantages:
All the cluster boundaries are either horizontal or
69
CLIQUE (Clustering In QUEst)
=3
0 1 2 3 4 5 6 7
20
30
40
S
50
a la
Vacation
ry
60
age
30
Vacation(
week)
50
0 1 2 3 4 5 6 7
20
30
40
age
50
60
age
72
Strength and Weakness of CLIQUE
Strength
automatically finds subspaces of the highest
73