Introduction to Clustering
TOP: Data Clustering 076/091
Instructor: Sayan Bandyapadhyay
Portland State University
Outline
1 Introduction
2 A Preliminary Model of Clustering
3 Metric Space
4 Our First Model of Clustering
5 Center-based Clustering
6 Complexity of Clustering Problems
Clustering of Social Network
Dividing the customers into similar groups
Applications
grouping of genes and proteins, and cancer and tumor
detection in Biology
speech recognition, and text summarization in Natural
Language Processing
grouping images, and image segmentation in Computer
Vision
Applications
grouping of genes and proteins, and cancer and tumor
detection in Biology
speech recognition, and text summarization in Natural
Language Processing
grouping images, and image segmentation in Computer
Vision
Collaborative filtering
Data summarization
Dynamic trend detection
Social network analysis
Unsupervised learning
Unsupervised learning
Building a classifier to identify cats and dogs images
The ML Pipeline
Clustering
Algorithm
ML Output
Model Predictions
Training
Samples
Training Stage
Training the classifier: feature engineering/labeling of samples
Outline
1 Introduction
2 A Preliminary Model of Clustering
3 Metric Space
4 Our First Model of Clustering
5 Center-based Clustering
6 Complexity of Clustering Problems
The Mapping
(F1, F2, . . . , Fd)
Profile Point in real space
Points in Real Space
Partition of Points
Drawback: Abstract Data Types
Not all data can be represented in numerical forms
Categorical data: Gender, Address
Text data
Biological data: Gene expressions, Gene ontology
annotations
Drawback: Abstract Data Types
Not all data can be represented in numerical forms
Categorical data: Gender, Address
Text data
Biological data: Gene expressions, Gene ontology
annotations
We will try to represent data in an abstract way
Outline
1 Introduction
2 A Preliminary Model of Clustering
3 Metric Space
4 Our First Model of Clustering
5 Center-based Clustering
6 Complexity of Clustering Problems
Metric
X is a set of points
Metric
X is a set of points
A function d : X × X Ï R+ is said to be a distance metric if it
has the following properties:
Reflexivity: ∀x, y ∈ X , d(x, y ) = 0 ⇐Ñ x = y
Symmetry: ∀x, y ∈ X , d(x, y ) = d(y , x)
Triangle Inequality:
∀x, y , z ∈ X , d(x, y ) ≤ d(x, z) + d(z, y )
Metric
X is a set of points
A function d : X × X Ï R+ is said to be a distance metric if it
has the following properties:
Reflexivity: ∀x, y ∈ X , d(x, y ) = 0 ⇐Ñ x = y
Symmetry: ∀x, y ∈ X , d(x, y ) = d(y , x)
Triangle Inequality:
∀x, y , z ∈ X , d(x, y ) ≤ d(x, z) + d(z, y )
x ≤ d1 + d2 y
d1 d2
z
Metric
X is a set of points
A function d : X × X Ï R+ is said to be a distance metric if it
has the following properties:
Reflexivity: ∀x, y ∈ X , d(x, y ) = 0 ⇐Ñ x = y
Symmetry: ∀x, y ∈ X , d(x, y ) = d(y , x)
Triangle Inequality:
∀x, y , z ∈ X , d(x, y ) ≤ d(x, z) + d(z, y )
x ≤ d1 + d2 y
d1 d2
X along with the metric d is called a metric space (X , d)
The Idea of Metric Spaces
v r
Ball B(v , r ) with center v and radius r
The Idea of Metric Spaces
≤r
≤ 2r
v r
≤r
w
Diameter of B(v , r ) is ≤ 2r
Examples
Metric qP
d d
Euclidean distance: X = R : d(x, y ) = i=1 (xi − yi )2
Manhattan distance: X = Rd : d(x, y ) = di=1 |xi − yi |
P
X = Σ∗ is the set of finite length strings over an alphabet
Σ, d is the edit distance
X is a set of vertices in a graph G, d is the shortest path
distance
Outline
1 Introduction
2 A Preliminary Model of Clustering
3 Metric Space
4 Our First Model of Clustering
5 Center-based Clustering
6 Complexity of Clustering Problems
Distances to clustering
C is a set of points
Diameter of C: ∆(C) = maxx,y ∈C d(x, y )
Distances to clustering
C is a set of points
Diameter of C: ∆(C) = maxx,y ∈C d(x, y )
Is a measure of goodness of a cluster
Distances to clustering
C is a set of points
Diameter of C: ∆(C) = maxx,y ∈C d(x, y )
Is a measure of goodness of a cluster
A partition of X , Π(X ) = {C1 , C2 , . . . , Ck } such that
Ci ⊂ X ; ∀i
Ci ∩ Cj = ∅; ∀i ̸= j
Π(X ) is a cover: ∪ki=1 Ci = X
Measuring Goodness via Cost
Cost of a partition Π(X ): Cost(Π(X )) = maxki=1 ∆(Ci )
Measuring Goodness via Cost
Cost of a partition Π(X ): Cost(Π(X )) = maxki=1 ∆(Ci )
k -partition problem: Given a metric space (X , d), find a
partition Π(X ) of size k that minimizes Cost(Π(X ))
Measuring Goodness via Cost
Cost of a partition Π(X ): Cost(Π(X )) = maxki=1 ∆(Ci )
k -partition problem: Given a metric space (X , d), find a
partition Π(X ) of size k that minimizes Cost(Π(X ))
Why k is needed?
An example of a model selection
Cluster Representatives
center of a cluster
data compression/summarization
Cluster Representatives
center of a cluster
data compression/summarization
Should the center be in X ?
key patterns of variations of brain scans: Yes
words that capture different topics: No (continuous)
Cluster Representatives
center of a cluster
data compression/summarization
Should the center be in X ?
key patterns of variations of brain scans: Yes
words that capture different topics: No (continuous)
We use a universe U: X ⊂ U and centers are also in U
(discrete) centers are from X = U
(continuous) centers are from U and not-necessarily in X
Outline
1 Introduction
2 A Preliminary Model of Clustering
3 Metric Space
4 Our First Model of Clustering
5 Center-based Clustering
6 Complexity of Clustering Problems
Center-Based Clustering: Voronoi property
c2
c1
c3
Figure: 3-cluster example
k -means Clustering
p1
p5 10 p2
13 12
c1
15 25
p4 p3
Cost(p1 )=102 , Cost(p2 )=122 ,. . ., Cost(p5 )=132
k -means Clustering
p1
p5 10 p2
13 12
c1
15 25
p4 p3
Cost(p1 )=102 , Cost(p2 )=122 ,. . ., Cost(p5 )=132
Choose a set of cluster centers to minimize the sum of
point costs
k -means Clustering
Given a set X of n points in the metric space (U, d)
Find a set C of k points (cluster centers) in U that
minimizes,
X
cost(C) = d(p, NearestCenter (p))2
p
Popular Clustering Objectives
Find a set C of k points (cluster centers) in U that minimizes
X
k -means: cost(C) = d(p, NearestCenter (p))2
p
X
k -median: cost(C) = d(p, NearestCenter (p))
p
k -center: cost(C) = max d(p, NearestCenter (p))
p
Outline
1 Introduction
2 A Preliminary Model of Clustering
3 Metric Space
4 Our First Model of Clustering
5 Center-based Clustering
6 Complexity of Clustering Problems
Finding the Best Clustering
All these problems are NP-hard
Finding the Best Clustering
All these problems are NP-hard
Solving exactly
Discrete: |X | = |U| = n. Pick the best k centers in X in
nO(k ) time
Continuous: Pick the best k centers in U in |U|O(k ) time
Finding the Best Clustering
All these problems are NP-hard
Solving exactly
Discrete: |X | = |U| = n. Pick the best k centers in X in
nO(k ) time
Continuous: Pick the best k centers in U in |U|O(k ) time
We can solve more efficiently if we are allowed to have some
error in our solution
Coping with NP-hardness
Heuristics
Simple (easy to implement)
Very time-efficient
Works well in practice
No quality control
Coping with NP-hardness
Heuristics
Simple (easy to implement)
Very time-efficient
Works well in practice
No quality control
Approximation Algorithms
Simple most of the time
Time-efficient
Works well in general
Quality control
Approximation Algorithms
α-approximation algorithm: cost is within α-factor
minimum cost M; our cost ≤ α · M;
Approximation Algorithms
α-approximation algorithm: cost is within α-factor
minimum cost M; our cost ≤ α · M;
1-approximation is optimal
2-approximation is 100% error
Approximation Algorithms
α-approximation algorithm: cost is within α-factor
minimum cost M; our cost ≤ α · M;
1-approximation is optimal
2-approximation is 100% error
Approximation scheme
approximation to any desired precision: our cost
≤ (1 + ε) · M, for any ε > 0
Approximation Algorithms
α-approximation algorithm: cost is within α-factor
minimum cost M; our cost ≤ α · M;
1-approximation is optimal
2-approximation is 100% error
Approximation scheme
approximation to any desired precision: our cost
≤ (1 + ε) · M, for any ε > 0
error is controlled