Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
26 views48 pages

Clustering Lec 1 Introduction To Clustering

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views48 pages

Clustering Lec 1 Introduction To Clustering

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Introduction to Clustering

TOP: Data Clustering 076/091


Instructor: Sayan Bandyapadhyay
Portland State University
Outline

1 Introduction

2 A Preliminary Model of Clustering

3 Metric Space

4 Our First Model of Clustering

5 Center-based Clustering

6 Complexity of Clustering Problems


Clustering of Social Network

Dividing the customers into similar groups


Applications

grouping of genes and proteins, and cancer and tumor


detection in Biology
speech recognition, and text summarization in Natural
Language Processing
grouping images, and image segmentation in Computer
Vision
Applications

grouping of genes and proteins, and cancer and tumor


detection in Biology
speech recognition, and text summarization in Natural
Language Processing
grouping images, and image segmentation in Computer
Vision
Collaborative filtering
Data summarization
Dynamic trend detection
Social network analysis
Unsupervised learning
Unsupervised learning

Building a classifier to identify cats and dogs images


The ML Pipeline

Clustering
Algorithm

ML Output
Model Predictions
Training
Samples
Training Stage

Training the classifier: feature engineering/labeling of samples


Outline

1 Introduction

2 A Preliminary Model of Clustering

3 Metric Space

4 Our First Model of Clustering

5 Center-based Clustering

6 Complexity of Clustering Problems


The Mapping

(F1, F2, . . . , Fd)

Profile Point in real space


Points in Real Space
Partition of Points
Drawback: Abstract Data Types

Not all data can be represented in numerical forms

Categorical data: Gender, Address


Text data
Biological data: Gene expressions, Gene ontology
annotations
Drawback: Abstract Data Types

Not all data can be represented in numerical forms

Categorical data: Gender, Address


Text data
Biological data: Gene expressions, Gene ontology
annotations

We will try to represent data in an abstract way


Outline

1 Introduction

2 A Preliminary Model of Clustering

3 Metric Space

4 Our First Model of Clustering

5 Center-based Clustering

6 Complexity of Clustering Problems


Metric

X is a set of points
Metric

X is a set of points
A function d : X × X Ï R+ is said to be a distance metric if it
has the following properties:
Reflexivity: ∀x, y ∈ X , d(x, y ) = 0 ⇐Ñ x = y
Symmetry: ∀x, y ∈ X , d(x, y ) = d(y , x)
Triangle Inequality:
∀x, y , z ∈ X , d(x, y ) ≤ d(x, z) + d(z, y )
Metric

X is a set of points
A function d : X × X Ï R+ is said to be a distance metric if it
has the following properties:
Reflexivity: ∀x, y ∈ X , d(x, y ) = 0 ⇐Ñ x = y
Symmetry: ∀x, y ∈ X , d(x, y ) = d(y , x)
Triangle Inequality:
∀x, y , z ∈ X , d(x, y ) ≤ d(x, z) + d(z, y )

x ≤ d1 + d2 y

d1 d2

z
Metric

X is a set of points
A function d : X × X Ï R+ is said to be a distance metric if it
has the following properties:
Reflexivity: ∀x, y ∈ X , d(x, y ) = 0 ⇐Ñ x = y
Symmetry: ∀x, y ∈ X , d(x, y ) = d(y , x)
Triangle Inequality:
∀x, y , z ∈ X , d(x, y ) ≤ d(x, z) + d(z, y )

x ≤ d1 + d2 y

d1 d2

X along with the metric d is called a metric space (X , d)


The Idea of Metric Spaces

v r

Ball B(v , r ) with center v and radius r


The Idea of Metric Spaces

≤r
≤ 2r
v r
≤r
w

Diameter of B(v , r ) is ≤ 2r
Examples

Metric qP
d d
Euclidean distance: X = R : d(x, y ) = i=1 (xi − yi )2
Manhattan distance: X = Rd : d(x, y ) = di=1 |xi − yi |
P

X = Σ∗ is the set of finite length strings over an alphabet


Σ, d is the edit distance
X is a set of vertices in a graph G, d is the shortest path
distance
Outline

1 Introduction

2 A Preliminary Model of Clustering

3 Metric Space

4 Our First Model of Clustering

5 Center-based Clustering

6 Complexity of Clustering Problems


Distances to clustering

C is a set of points
Diameter of C: ∆(C) = maxx,y ∈C d(x, y )
Distances to clustering

C is a set of points
Diameter of C: ∆(C) = maxx,y ∈C d(x, y )
Is a measure of goodness of a cluster
Distances to clustering

C is a set of points
Diameter of C: ∆(C) = maxx,y ∈C d(x, y )
Is a measure of goodness of a cluster

A partition of X , Π(X ) = {C1 , C2 , . . . , Ck } such that


Ci ⊂ X ; ∀i
Ci ∩ Cj = ∅; ∀i ̸= j
Π(X ) is a cover: ∪ki=1 Ci = X
Measuring Goodness via Cost

Cost of a partition Π(X ): Cost(Π(X )) = maxki=1 ∆(Ci )


Measuring Goodness via Cost

Cost of a partition Π(X ): Cost(Π(X )) = maxki=1 ∆(Ci )

k -partition problem: Given a metric space (X , d), find a


partition Π(X ) of size k that minimizes Cost(Π(X ))
Measuring Goodness via Cost

Cost of a partition Π(X ): Cost(Π(X )) = maxki=1 ∆(Ci )

k -partition problem: Given a metric space (X , d), find a


partition Π(X ) of size k that minimizes Cost(Π(X ))

Why k is needed?
An example of a model selection
Cluster Representatives

center of a cluster
data compression/summarization
Cluster Representatives

center of a cluster
data compression/summarization

Should the center be in X ?


key patterns of variations of brain scans: Yes
words that capture different topics: No (continuous)
Cluster Representatives

center of a cluster
data compression/summarization

Should the center be in X ?


key patterns of variations of brain scans: Yes
words that capture different topics: No (continuous)

We use a universe U: X ⊂ U and centers are also in U


(discrete) centers are from X = U
(continuous) centers are from U and not-necessarily in X
Outline

1 Introduction

2 A Preliminary Model of Clustering

3 Metric Space

4 Our First Model of Clustering

5 Center-based Clustering

6 Complexity of Clustering Problems


Center-Based Clustering: Voronoi property

c2

c1

c3

Figure: 3-cluster example


k -means Clustering

p1
p5 10 p2
13 12
c1
15 25
p4 p3

Cost(p1 )=102 , Cost(p2 )=122 ,. . ., Cost(p5 )=132


k -means Clustering

p1
p5 10 p2
13 12
c1
15 25
p4 p3

Cost(p1 )=102 , Cost(p2 )=122 ,. . ., Cost(p5 )=132


Choose a set of cluster centers to minimize the sum of
point costs
k -means Clustering

Given a set X of n points in the metric space (U, d)

Find a set C of k points (cluster centers) in U that


minimizes,
X
cost(C) = d(p, NearestCenter (p))2
p
Popular Clustering Objectives

Find a set C of k points (cluster centers) in U that minimizes


X
k -means: cost(C) = d(p, NearestCenter (p))2
p

X
k -median: cost(C) = d(p, NearestCenter (p))
p

k -center: cost(C) = max d(p, NearestCenter (p))


p
Outline

1 Introduction

2 A Preliminary Model of Clustering

3 Metric Space

4 Our First Model of Clustering

5 Center-based Clustering

6 Complexity of Clustering Problems


Finding the Best Clustering

All these problems are NP-hard


Finding the Best Clustering

All these problems are NP-hard

Solving exactly
Discrete: |X | = |U| = n. Pick the best k centers in X in
nO(k ) time
Continuous: Pick the best k centers in U in |U|O(k ) time
Finding the Best Clustering

All these problems are NP-hard

Solving exactly
Discrete: |X | = |U| = n. Pick the best k centers in X in
nO(k ) time
Continuous: Pick the best k centers in U in |U|O(k ) time

We can solve more efficiently if we are allowed to have some


error in our solution
Coping with NP-hardness

Heuristics
Simple (easy to implement)
Very time-efficient
Works well in practice
No quality control
Coping with NP-hardness

Heuristics
Simple (easy to implement)
Very time-efficient
Works well in practice
No quality control

Approximation Algorithms
Simple most of the time
Time-efficient
Works well in general
Quality control
Approximation Algorithms

α-approximation algorithm: cost is within α-factor


minimum cost M; our cost ≤ α · M;
Approximation Algorithms

α-approximation algorithm: cost is within α-factor


minimum cost M; our cost ≤ α · M;
1-approximation is optimal
2-approximation is 100% error
Approximation Algorithms

α-approximation algorithm: cost is within α-factor


minimum cost M; our cost ≤ α · M;
1-approximation is optimal
2-approximation is 100% error
Approximation scheme
approximation to any desired precision: our cost
≤ (1 + ε) · M, for any ε > 0
Approximation Algorithms

α-approximation algorithm: cost is within α-factor


minimum cost M; our cost ≤ α · M;
1-approximation is optimal
2-approximation is 100% error
Approximation scheme
approximation to any desired precision: our cost
≤ (1 + ε) · M, for any ε > 0
error is controlled

You might also like