0% found this document useful (0 votes)

26 views48 pages

Clustering Lec 1 Introduction To Clustering

Uploaded by

addagallaprasanna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views48 pages

Clustering Lec 1 Introduction To Clustering

Uploaded by

addagallaprasanna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

Introduction to Clustering

TOP: Data Clustering 076/091

Instructor: Sayan Bandyapadhyay
Portland State University
Outline

1 Introduction

2 A Preliminary Model of Clustering

3 Metric Space

4 Our First Model of Clustering

5 Center-based Clustering

6 Complexity of Clustering Problems

Clustering of Social Network

Dividing the customers into similar groups

Applications

grouping of genes and proteins, and cancer and tumor

detection in Biology
speech recognition, and text summarization in Natural
Language Processing
grouping images, and image segmentation in Computer
Vision
Applications

grouping of genes and proteins, and cancer and tumor

detection in Biology
speech recognition, and text summarization in Natural
Language Processing
grouping images, and image segmentation in Computer
Vision
Collaborative filtering
Data summarization
Dynamic trend detection
Social network analysis
Unsupervised learning
Unsupervised learning

Building a classifier to identify cats and dogs images

The ML Pipeline

Clustering
Algorithm

ML Output
Model Predictions
Training
Samples
Training Stage

Training the classifier: feature engineering/labeling of samples

Outline

1 Introduction

2 A Preliminary Model of Clustering

3 Metric Space

4 Our First Model of Clustering

5 Center-based Clustering

6 Complexity of Clustering Problems

The Mapping

(F1, F2, . . . , Fd)

Profile Point in real space

Points in Real Space
Partition of Points
Drawback: Abstract Data Types

Not all data can be represented in numerical forms

Categorical data: Gender, Address

Text data
Biological data: Gene expressions, Gene ontology
annotations
Drawback: Abstract Data Types

Not all data can be represented in numerical forms

Categorical data: Gender, Address

Text data
Biological data: Gene expressions, Gene ontology
annotations

We will try to represent data in an abstract way

Outline

1 Introduction

2 A Preliminary Model of Clustering

3 Metric Space

4 Our First Model of Clustering

5 Center-based Clustering

6 Complexity of Clustering Problems

Metric

X is a set of points
Metric

X is a set of points
A function d : X × X Ï R+ is said to be a distance metric if it
has the following properties:
Reflexivity: ∀x, y ∈ X , d(x, y ) = 0 ⇐Ñ x = y
Symmetry: ∀x, y ∈ X , d(x, y ) = d(y , x)
Triangle Inequality:
∀x, y , z ∈ X , d(x, y ) ≤ d(x, z) + d(z, y )
Metric

x ≤ d1 + d2 y

d1 d2

z
Metric

x ≤ d1 + d2 y

d1 d2

X along with the metric d is called a metric space (X , d)

The Idea of Metric Spaces

v r

Ball B(v , r ) with center v and radius r

The Idea of Metric Spaces

≤r
≤ 2r
v r
≤r
w

Diameter of B(v , r ) is ≤ 2r
Examples

Metric qP
d d
Euclidean distance: X = R : d(x, y ) = i=1 (xi − yi )2
Manhattan distance: X = Rd : d(x, y ) = di=1 |xi − yi |
P

X = Σ∗ is the set of finite length strings over an alphabet

Σ, d is the edit distance
X is a set of vertices in a graph G, d is the shortest path
distance
Outline

1 Introduction

2 A Preliminary Model of Clustering

3 Metric Space

4 Our First Model of Clustering

5 Center-based Clustering

6 Complexity of Clustering Problems

Distances to clustering

C is a set of points
Diameter of C: ∆(C) = maxx,y ∈C d(x, y )
Distances to clustering

C is a set of points
Diameter of C: ∆(C) = maxx,y ∈C d(x, y )
Is a measure of goodness of a cluster
Distances to clustering

C is a set of points
Diameter of C: ∆(C) = maxx,y ∈C d(x, y )
Is a measure of goodness of a cluster

A partition of X , Π(X ) = {C1 , C2 , . . . , Ck } such that

Ci ⊂ X ; ∀i
Ci ∩ Cj = ∅; ∀i ̸= j
Π(X ) is a cover: ∪ki=1 Ci = X
Measuring Goodness via Cost

Cost of a partition Π(X ): Cost(Π(X )) = maxki=1 ∆(Ci )

Measuring Goodness via Cost

Cost of a partition Π(X ): Cost(Π(X )) = maxki=1 ∆(Ci )

k -partition problem: Given a metric space (X , d), find a

partition Π(X ) of size k that minimizes Cost(Π(X ))
Measuring Goodness via Cost

Cost of a partition Π(X ): Cost(Π(X )) = maxki=1 ∆(Ci )

k -partition problem: Given a metric space (X , d), find a

partition Π(X ) of size k that minimizes Cost(Π(X ))

Why k is needed?
An example of a model selection
Cluster Representatives

center of a cluster
data compression/summarization
Cluster Representatives

center of a cluster
data compression/summarization

Should the center be in X ?

key patterns of variations of brain scans: Yes
words that capture different topics: No (continuous)
Cluster Representatives

center of a cluster
data compression/summarization

Should the center be in X ?

key patterns of variations of brain scans: Yes
words that capture different topics: No (continuous)

We use a universe U: X ⊂ U and centers are also in U

(discrete) centers are from X = U
(continuous) centers are from U and not-necessarily in X
Outline

1 Introduction

2 A Preliminary Model of Clustering

3 Metric Space

4 Our First Model of Clustering

5 Center-based Clustering

6 Complexity of Clustering Problems

Center-Based Clustering: Voronoi property

Figure: 3-cluster example

k -means Clustering

p1
p5 10 p2
13 12
c1
15 25
p4 p3

Cost(p1 )=102 , Cost(p2 )=122 ,. . ., Cost(p5 )=132

k -means Clustering

p1
p5 10 p2
13 12
c1
15 25
p4 p3

Cost(p1 )=102 , Cost(p2 )=122 ,. . ., Cost(p5 )=132

Choose a set of cluster centers to minimize the sum of
point costs
k -means Clustering

Given a set X of n points in the metric space (U, d)

Find a set C of k points (cluster centers) in U that

minimizes,
X
cost(C) = d(p, NearestCenter (p))2
p
Popular Clustering Objectives

Find a set C of k points (cluster centers) in U that minimizes

X
k -means: cost(C) = d(p, NearestCenter (p))2
p

X
k -median: cost(C) = d(p, NearestCenter (p))
p

k -center: cost(C) = max d(p, NearestCenter (p))

p
Outline

1 Introduction

2 A Preliminary Model of Clustering

3 Metric Space

4 Our First Model of Clustering

5 Center-based Clustering

6 Complexity of Clustering Problems

Finding the Best Clustering

All these problems are NP-hard

Finding the Best Clustering

All these problems are NP-hard

Solving exactly
Discrete: |X | = |U| = n. Pick the best k centers in X in
nO(k ) time
Continuous: Pick the best k centers in U in |U|O(k ) time
Finding the Best Clustering

All these problems are NP-hard

Solving exactly
Discrete: |X | = |U| = n. Pick the best k centers in X in
nO(k ) time
Continuous: Pick the best k centers in U in |U|O(k ) time

We can solve more efficiently if we are allowed to have some

error in our solution
Coping with NP-hardness

Heuristics
Simple (easy to implement)
Very time-efficient
Works well in practice
No quality control
Coping with NP-hardness

Heuristics
Simple (easy to implement)
Very time-efficient
Works well in practice
No quality control

Approximation Algorithms
Simple most of the time
Time-efficient
Works well in general
Quality control
Approximation Algorithms

α-approximation algorithm: cost is within α-factor

minimum cost M; our cost ≤ α · M;
Approximation Algorithms

α-approximation algorithm: cost is within α-factor

minimum cost M; our cost ≤ α · M;
1-approximation is optimal
2-approximation is 100% error
Approximation Algorithms

α-approximation algorithm: cost is within α-factor

minimum cost M; our cost ≤ α · M;
1-approximation is optimal
2-approximation is 100% error
Approximation scheme
approximation to any desired precision: our cost
≤ (1 + ε) · M, for any ε > 0
Approximation Algorithms

α-approximation algorithm: cost is within α-factor

Clustering
No ratings yet
Clustering
5 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
CS8091 - Big Data Analytics - Unit 2
No ratings yet
CS8091 - Big Data Analytics - Unit 2
44 pages
Ijcttjournal V1i1p12
No ratings yet
Ijcttjournal V1i1p12
3 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
Module 5
No ratings yet
Module 5
98 pages
Virtual Lab Setup Guide FGT 6.2 PDF
100% (1)
Virtual Lab Setup Guide FGT 6.2 PDF
50 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
55 pages
Clustering Techniques Explained
No ratings yet
Clustering Techniques Explained
61 pages
Lect 4
No ratings yet
Lect 4
34 pages
GASTAT-700 Interface Protcol V1.06 - 180115
No ratings yet
GASTAT-700 Interface Protcol V1.06 - 180115
21 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
16 pages
Cluster Analysis
No ratings yet
Cluster Analysis
29 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
K Mean Clustering1
No ratings yet
K Mean Clustering1
23 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
UNIT5
No ratings yet
UNIT5
60 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Clustering L7
No ratings yet
Clustering L7
7 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
GhostNet Version 1.4
No ratings yet
GhostNet Version 1.4
32 pages
Chap 19 - CLustering
No ratings yet
Chap 19 - CLustering
18 pages
Cluster
100% (1)
Cluster
72 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
DM&BAFall2204 2
No ratings yet
DM&BAFall2204 2
61 pages
Aiml 5th Module Part2
No ratings yet
Aiml 5th Module Part2
28 pages
Module 5
No ratings yet
Module 5
370 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering Algorithm: An Unsupervised Learning Approach
No ratings yet
Clustering Algorithm: An Unsupervised Learning Approach
23 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Lec10 Clustering
No ratings yet
Lec10 Clustering
19 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Week6 Clustering Regression
No ratings yet
Week6 Clustering Regression
101 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
2 pages
Supervised Learning vs. Unsupervised Learning
No ratings yet
Supervised Learning vs. Unsupervised Learning
7 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
57 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Vigenere Cipher: By: Mohsin Tahir Waqas Akram Numan-Ul-Haq Ali Asghar Rao Arslan
No ratings yet
Vigenere Cipher: By: Mohsin Tahir Waqas Akram Numan-Ul-Haq Ali Asghar Rao Arslan
15 pages
Unit 5
No ratings yet
Unit 5
33 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering
No ratings yet
Clustering
27 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Information Bulletin - PHD M.Tech (R) M.Tech (S) - 2023
No ratings yet
Information Bulletin - PHD M.Tech (R) M.Tech (S) - 2023
18 pages
170 Cells Bms Wiring Color Code
100% (2)
170 Cells Bms Wiring Color Code
6 pages
SME AC Panel Manual 052005 en
100% (1)
SME AC Panel Manual 052005 en
65 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
Aspire Archon User Manual
No ratings yet
Aspire Archon User Manual
1 page
Identifying Main Idea
No ratings yet
Identifying Main Idea
6 pages
AI's Impact on Journalism and News
No ratings yet
AI's Impact on Journalism and News
46 pages
Going Beyond T-SNE: Exposing Whatlies in Text Embeddings
No ratings yet
Going Beyond T-SNE: Exposing Whatlies in Text Embeddings
8 pages
Flip-Flop Logic & Instruction Codes
No ratings yet
Flip-Flop Logic & Instruction Codes
3 pages
CS3342 Software Design Course
No ratings yet
CS3342 Software Design Course
15 pages
Training Material
No ratings yet
Training Material
15 pages
DXB3100 Radio 2212 B20 Ericsson Faulty Report
No ratings yet
DXB3100 Radio 2212 B20 Ericsson Faulty Report
1 page
Lec 9 - System Analysis Part2
No ratings yet
Lec 9 - System Analysis Part2
17 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Cellular Gateway Release Notes Xe 17 11 X
No ratings yet
Cellular Gateway Release Notes Xe 17 11 X
6 pages
Design Patterns for Developers
No ratings yet
Design Patterns for Developers
7 pages
cs4811 ch10c Clustering
No ratings yet
cs4811 ch10c Clustering
35 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Engineering Aptitude
No ratings yet
Engineering Aptitude
2 pages
How To Restore Deleted Files From The Recycle Bin
No ratings yet
How To Restore Deleted Files From The Recycle Bin
1 page
07 Clustering 2024
No ratings yet
07 Clustering 2024
51 pages
Artikel - CS114500 - Information On Envelopes and Shrinkwrap Features in Creo Parametric
No ratings yet
Artikel - CS114500 - Information On Envelopes and Shrinkwrap Features in Creo Parametric
3 pages
Eda On Housing Data
No ratings yet
Eda On Housing Data
7 pages
ML Module 4 Unsupervised Learning - Updated
No ratings yet
ML Module 4 Unsupervised Learning - Updated
55 pages
Screenshot 2024-03-12 at 6.57.10 PM
No ratings yet
Screenshot 2024-03-12 at 6.57.10 PM
1 page
GT20 Quick Use Instruction
No ratings yet
GT20 Quick Use Instruction
3 pages
Clang Integration
No ratings yet
Clang Integration
12 pages
General Terminal Commands::cd:pwd
No ratings yet
General Terminal Commands::cd:pwd
19 pages
Unit 4
No ratings yet
Unit 4
16 pages
Credit Card Usage Analysis Using KMeans Clustering Report
No ratings yet
Credit Card Usage Analysis Using KMeans Clustering Report
16 pages
Clustering
No ratings yet
Clustering
118 pages
2 Cluster Analysis
No ratings yet
2 Cluster Analysis
55 pages
All Pricelist
No ratings yet
All Pricelist
1 page
Week 6 AM Slides
No ratings yet
Week 6 AM Slides
39 pages
Eml 10 250825
No ratings yet
Eml 10 250825
91 pages
Lect 6 - Clustering
No ratings yet
Lect 6 - Clustering
50 pages
Session 37 CO4 Unsupervised Learning
No ratings yet
Session 37 CO4 Unsupervised Learning
34 pages

Clustering Lec 1 Introduction To Clustering

Uploaded by

Clustering Lec 1 Introduction To Clustering

Uploaded by

Introduction to Clustering

TOP: Data Clustering 076/091

2 A Preliminary Model of Clustering

4 Our First Model of Clustering

6 Complexity of Clustering Problems

Dividing the customers into similar groups

grouping of genes and proteins, and cancer and tumor

grouping of genes and proteins, and cancer and tumor

Building a classifier to identify cats and dogs images

Training the classifier: feature engineering/labeling of samples

2 A Preliminary Model of Clustering

4 Our First Model of Clustering

6 Complexity of Clustering Problems

(F1, F2, . . . , Fd)

Profile Point in real space

Not all data can be represented in numerical forms

Categorical data: Gender, Address

Not all data can be represented in numerical forms

Categorical data: Gender, Address

We will try to represent data in an abstract way

2 A Preliminary Model of Clustering

4 Our First Model of Clustering

6 Complexity of Clustering Problems

X along with the metric d is called a metric space (X , d)

Ball B(v , r ) with center v and radius r

X = Σ∗ is the set of finite length strings over an alphabet

2 A Preliminary Model of Clustering

4 Our First Model of Clustering

6 Complexity of Clustering Problems

A partition of X , Π(X ) = {C1 , C2 , . . . , Ck } such that

Cost of a partition Π(X ): Cost(Π(X )) = maxki=1 ∆(Ci )

Cost of a partition Π(X ): Cost(Π(X )) = maxki=1 ∆(Ci )

k -partition problem: Given a metric space (X , d), find a

Cost of a partition Π(X ): Cost(Π(X )) = maxki=1 ∆(Ci )

k -partition problem: Given a metric space (X , d), find a

Should the center be in X ?

Should the center be in X ?

We use a universe U: X ⊂ U and centers are also in U

2 A Preliminary Model of Clustering

4 Our First Model of Clustering

6 Complexity of Clustering Problems

Figure: 3-cluster example

Cost(p1 )=102 , Cost(p2 )=122 ,. . ., Cost(p5 )=132

Cost(p1 )=102 , Cost(p2 )=122 ,. . ., Cost(p5 )=132

Given a set X of n points in the metric space (U, d)

Find a set C of k points (cluster centers) in U that

Find a set C of k points (cluster centers) in U that minimizes

k -center: cost(C) = max d(p, NearestCenter (p))

2 A Preliminary Model of Clustering

4 Our First Model of Clustering

6 Complexity of Clustering Problems

All these problems are NP-hard

All these problems are NP-hard

All these problems are NP-hard

We can solve more efficiently if we are allowed to have some

α-approximation algorithm: cost is within α-factor

α-approximation algorithm: cost is within α-factor

α-approximation algorithm: cost is within α-factor

α-approximation algorithm: cost is within α-factor

You might also like