0% found this document useful (0 votes)

33 views70 pages

ML - 8

Uploaded by

Snehargha Saha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views70 pages

ML - 8

Uploaded by

Snehargha Saha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

Clustering

Some material borrowed from course materials of

Andrew Ng
Unsupervised learning
• Given a set of unlabeled data points / items
• Find patterns or structure in the data
• Clustering: automatically group the data points /
items into groups of ‘similar’ or ‘related’ points
Motivations for Clustering
• Understanding the data better
– Grouping Web search results into clusters, each of which
captures a particular aspect of the query
– Segment the market or customers of a service
• As precursor for some other application
– Summarization and data compression
– Recommendation
Clustering for Data Understanding and
• Biology: taxonomy of livingApplications
things: kingdom, phylum, class, order, family, genus
and species
• Information retrieval: document clustering
• Land use: Identification of areas of similar land use in an earth observation
database
• Marketing: Help marketers discover distinct groups in their customer bases, and
then use this knowledge to develop targeted marketing programs
• City-planning: Identifying groups of houses according to their house type, value,
and geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
• Climate: understanding earth climate, find patterns of atmospheric and
ocean
• Economic Science: market resarch
Clustering as a Preprocessing Tool (Utility)
• Summarization:
– Preprocessing for regression, PCA, classification, and association
analysis
• Compression:
– Image processing: vector quantization
• Finding K-nearest Neighbors
– Localizing search to one or a small number of clusters
• Outlier detection
– Outliers are often viewed as those “far away” from any cluster
5
Different types of clustering
• Partitional (Hard Clustering)
– Divide set of items into non-overlapping subsets
– Each item will be member of one subset

• Overlapping (Soft Clustering)

– Divide set of items into potentially overlapping subsets
– Each item can simultaneously belong to multiple subsets
Partitional Clustering (Hard)

Original Points A Partitional Clustering

Overlapping (Soft Clustering)
Different types of clustering
• Fuzzy (Soft Clustering)
– Every item belongs to every cluster with a membership
weight between 0 (absolutely does not belong) and 1
(absolutely belongs)
– Usual constraint: sum of weights for each individual
item
should be 1
– Convert to partitional clustering: assign every item to that
cluster for which its membership weight is highest
Different types of clustering
• Hierarchical
– Set of nested clusters, where one larger cluster can contain
smaller clusters
– Organized as a tree (dendrogram): leaf nodes are singleton
clusters containing individual items, each intermediate
node is union of its children sub-clusters
– A sequence of partitional clusterings – cut the dendrogram
at a certain level to get a partitional clustering
Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3
p4
Traditional Hierarchical Clustering
Traditional Dendrogram

p1
p3 p4
p2
p1 p2 p3 p4

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Different types of clustering
• Complete vs. partial
– A complete clustering assigns every item to one or more
clusters
– A partial clustering may not assign some items to any
cluster (e.g., outliers, items that are not sufficiently similar
to any other item)
Types of clustering methods
• Prototype-based
– Each cluster defined by a prototype (centroid or medoid),
i.e., the most representative point in the cluster
– A cluster is the set of items in which each item is closer
(more similar) to the prototype of this cluster, than to the
prototype of any other cluster
– Example method: K-means
Types of clustering methods
• Density-based
– Assumes items distributed in a space where ‘similar’ items
are placed close to each other (e.g., feature space)
– A cluster is a dense region of items, that is surrounded by a
region of low density
– Example method: DBSCAN
Density based clustering methods
• Locates regions of high density, that are separated
from one another by regions of low density
Types of Clusters: Density-Based

• Density-based
– A cluster is a dense region of points, which is separated by low-
density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when noise
and outliers are present.

6 density-based clusters
Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical
clustering work for finding spherical-shaped clusters or convex
clusters. In other words, they are suitable only for compact and
well-separated clusters. Moreover, they are also severely affected
by the presence of noise and outliers in the data.
Real life data may contain irregularities, like:
i) Clusters can be of arbitrary shape.
ii) Data may contain noise.
DBSCAN
• DBSCAN: Density Based Spatial Clustering of Applications
with Noise
– Proposed by Ester et al. in SIGKDD 1996
– First algorithm for detecting density-based clusters
• Advantages (e.g., over K-means)
– Can detect clusters of arbitrary shapes (while clusters detected
by K-means are usually globular (globe-shaped; spherical))
– Robust to outliers
DBSCAN
DBSCAN algorithm requires two parameters:
1.eps : It defines the neighborhood around a data point i.e. if the distance between two points is lower or equal to ‘eps’ then they are considered as
neighbors. If the eps value is chosen too small then large part of the data will be considered as outliers. If it is chosen very large then the clusters will merge
and majority of the data points will be in the same clusters. One way to find the eps value is based on the k-distance graph.
2.MinPts: Minimum number of neighbors (data points) within eps radius. Larger the dataset, the larger value of MinPts must be chosen. As a general rule,
the minimum MinPts can be derived from the number of dimensions D in the dataset as, MinPts >= D+1. The minimum value of MinPts must be chosen at
least 3.

In this algorithm, we have 3 types of data points.

Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a core point.
Noise or outlier: A point which is not a core point or border point.
DBSCAN algorithm steps:
1. Find all the neighbor points within eps and identify the core points or visited with
more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density connected points and assign them to the same cluster
as the core point.
A point a and b are said to be density connected if there exist a point c which has a
sufficient number of points in its neighbors and both the points a and b are within
the eps distance. This is a chaining process. So, if b is neighbor of c, c is neighbor
of d, d is neighbor of e, which in turn is neighbor of a implies that b is neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do
not belong to any cluster are noise.
Types of clustering methods
• Graph-based
– Assumes items represented as a graph/network where
items are nodes, and ‘similar’ items are linked via edges
– A cluster is a group of nodes having more and / or better
connections among its members, than between its
members and the rest of the network
– Also called ‘community structure’ in networks
– Example method: Algorithm by Girvan and Newman
K-means clustering
K-means
• Prototype-based, partitioning technique
• Finds a user-specified number of clusters (K)
• Each cluster represented by its centroid item
• There have been extensions where number of
clusters is not needed as input
K-means algorithm
Given k
1. Randomly choose k data points (seeds) to be the initial cluster
centres
2. Assign each data point to the closest cluster centre
3. Re-compute the cluster centres using the current cluster
memberships.
4. If a convergence criterion is not met, go to 2.

2
1
K-means algorithm
Example
• One dimensional
• Apply K-means algorithm in given data for
k=3. Use C1(2) , C2(16) and C3(38) as initial
cluster centers. Data: 2, 4, 6, 3,
31,12,15,16, 38, 35, 14, 21, 23, 25, 30
Solution
C1(2) C2(16) C3(38)

m1 = 2 m2 = 16 m3 = 38

{2,3,4,6} {12,14,15,16,21,23,25} {31,35,38}

Solution
C1(2) C2(16) C3(38)

m1 = 2 m2 = 16 m3 = 38

{2,3,4,6} {12,14,15,16,21,23,25} {31,35,38}

m1 = 3.75 m2 = 18 m3 = 34.67

{2,3,4,6} {12,14,15,16,21,23,25} {31,35,38}

Problem
We are given the following four data points in
two dimension: XI = (2,2), X2 =(8,6), X3 =(6,8), X4
= (2,4). We want to cluster the data points into
two clusters CI and C2 using the K-Means
algorithm. Euclidean distance is used for
clustering. To initialize the algorithm we consider
CI ={xI,x3} and C2 ={x2,x4}. After two iteration of
the K-means algorithm, find the cluster
memberships
For the below dataset find out the final clusters using K-Means algorithm.

Note: i) K=2; ii) Use ‘Euclidean distance’;

Advantages
• Fast, robust easy to understand.
• Relatively efficient
• Gives best result when data set are distinct or
well separated from each other.
Disadvantages
• Requires apriori specification of the number of
cluster centers.
• Hard assignment of data points to clusters
• Euclidean distance measures can unequally weight underlying
factors.
• Applicable only when mean is defined i.e. fails for categorical
data.
• Only local optima
Hierarchical clustering
Hierarchical clustering
• Bottom-up or Agglomerative clustering
– Start considering each data point as a singleton cluster
– Successively merge clusters if similarity is sufficiently high
– Until all points have been merged into a single cluster
• Top-down or Divisive clustering
– Start with all data points in a single cluster
– Iteratively split clusters into smaller sub-clusters if the
similarity between two sub-parts is low
Both Divisive and Agglomerative clustering can
be represented as a Dendrogram
Basic agglomerative hierarchical
clustering algorithm
• Start with each item in a singleton cluster
• Compute the proximity/similarity matrix between clusters
• Repeat
– Merge the closest/most similar two clusters
– Update the proximity matrix to reflect proximity between
the new cluster and the other clusters
• Until only one cluster remains
agglomerative hierarchical clustering
Basic agglomerative hierarchical
clustering algorithm
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
K-means: Numerical Problem
• Initial Centroids
– C1 = (4, 4)
– C2 = (3, 3)
• Distance between C1 and
(0, 1)
X1 X2 Cluster Assigned
4 4 1
= (4 − 0)2+(4 − 1)2
0 1 None
2 3 None
4 1 None
= (4)2+(3)2 = 5 3 3 2
1 1.5 None
0 2 None
K-means: Numerical Problem
• Initial Centroids
– C1 = (4, 4)
– C2 = (3, 3)
• Distance between C2 and
(0, 1)
X1 X2 Cluster Assigned
4 4 1
= (3 − 0)2+(3 − 1)2
0 1 None
2 3 None
4 1 None
= (3)2+(2)2 ≈ 3.6 3 3 2
1 1.5 None
0 2 None
K-means: Numerical Problem
• Initial Centroids
– C1 = (4, 4)
– C2 = (3, 3)

• Distance between C1 and (0, 1) = 5

X1 X2 Cluster Assigned
• Distance between C2 and (0, 1) = 3.6
4 4 1
• Point (0, 1) is nearer to C2 and thus 0 1 2
assigned to cluster ‘2’ 2 3 None
4 1 None
• Repeat the process for all data points 3 3 2
1 1.5 None
0 2 None
K-means: Numerical Problem
• After First Iteration:
– C1 = (4, 4)
– C2 = (1.66, 1.91)

• How do we calculate C2 ?
– By taking average of all points
assigned to
X1 X2 Cluster Assigned
cluster ‘2’
4 4 1
• C 2 = (0+2+4+3+1+0 , ) 0 1 2
6
10
1+3+1+3+1.5+2 2 3 2
= (11.5 )≈ 6 4 1 2
6
, 6 (1.66,1.91) 3 3 2
1 1.5 2
0 2 2
K-means: Numerical Problem
• After First Iteration:
– C1 = (4, 4)
– C2 = (1.66, 1.91)
• After Second Iteration:
X1 X2 Cluster Assigned
– C1 = (3.5, 3.5) 4 4 1
0 1 2
– C2 = (1.7, 1.7) 2
4
3
1
2
2
3 3 1
1 1.5 2
0 2 2
K-means: Numerical Problem
• After First Iteration:
– C1 = (4, 4)
– C2 = (1.66, 1.91)
• After Second Iteration:
– C1 = (3.5, 3.5)
– C2 = (1.7, 1.7) X1 X2 Cluster Assigned

• After Third Iteration:

4 4 1
0 1 2
2 3 2
– C1 = (3.66, 2.66) 4 1 1
– C2 = (0.75, 1.86) 3 3 1
1 1.5 2
0 2 2
K-means: Numerical Problem
• Consider the following
Data Points

• Apply k-means with k

=2
X1 X2 Cluster Assigned
4 4 None
0 1 None

• Distance metric = 2
4
3
1
None
None

Euclidean 3
1
3
1.5
None
None
0 2 None
K-means: Numerical Problem
• Initial Centroids
– C1 = (4, 4)
– C2 = (3, 3)

X1 X2 Cluster Assigned
4 4 1
0 1 None
2 3 None
4 1 None
3 3 2
1 1.5 None
0 2 None
K-means: Numerical Problem
• After Third Iteration:
– C1 = (3.66, 2.66)
– C2 = (0.75, 1.86)
• After Fourth Iteration:
– C1 = (3.66, 2.66)
– C2 = (0.75, 1.86) X1 X2 Cluster Assigned

• The algorithm stops, as 4

0
4
1
1
2
there is no change in 2
4
3
1
2
1
centroids 3 3 1
1 1.5 2
0 2 2
Proximity/similarity between clusters
• MIN similarity between two clusters: Proximity (similarity)
between the closest (most similar) two points, one from each
cluster (minimum pairwise distance)
• MAX similarity between two clusters: Proximity between
the farthest two points, one from each cluster (maximum
pairwise distance)
Types of hierarchical clustering
• Complete linkage
– Merge in each step the two clusters with the smallest
maximum similarity
• Single linkage
– Merge in each step the two clusters with the smallest
minimum similarity
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated
to) the objects in other groups

Inter-
Intra- cluster
cluster distances
distances are
are maximized
minimized
Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters
– high intra-class similarity: cohesive within clusters
– low inter-class similarity: distinctive between clusters
• The quality of a clustering method depends on
– the similarity measure used by the method
– its implementation, and
– Its ability to discover some or all of the
hidden patterns 79
(Dis)similarity measures

� �
� �
(Dis)similarity measures
Quality of Clustering

I Want To Eat Your Pancreas (2018) by Yoru Sumino
100% (7)
I Want To Eat Your Pancreas (2018) by Yoru Sumino
232 pages
Call Forth Your Destiny Helpers
No ratings yet
Call Forth Your Destiny Helpers
5 pages
Basic Marine Engineering For Maritime Students
100% (5)
Basic Marine Engineering For Maritime Students
55 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
40 pages
Unit 4
No ratings yet
Unit 4
16 pages
M5
No ratings yet
M5
40 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
45 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
ML 07 Clustering
No ratings yet
ML 07 Clustering
56 pages
Cluster
100% (1)
Cluster
72 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
L5 Clustering
No ratings yet
L5 Clustering
6 pages
Clustering
No ratings yet
Clustering
75 pages
Unit 5
No ratings yet
Unit 5
63 pages
Week 9 Part 1 Clustering
No ratings yet
Week 9 Part 1 Clustering
44 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
51 pages
Clustering in Data Mining Lecture
No ratings yet
Clustering in Data Mining Lecture
80 pages
Clustering
No ratings yet
Clustering
12 pages
(3rd Year) Pattern REcognition Lecture 4
No ratings yet
(3rd Year) Pattern REcognition Lecture 4
48 pages
Clustering
No ratings yet
Clustering
35 pages
UNIT5
No ratings yet
UNIT5
60 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Clustering
No ratings yet
Clustering
11 pages
Cluster Analysis
No ratings yet
Cluster Analysis
22 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Clustering
No ratings yet
Clustering
118 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Lecture8 Unsupervised Learning
No ratings yet
Lecture8 Unsupervised Learning
58 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
Clustering
No ratings yet
Clustering
65 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering
No ratings yet
Clustering
53 pages
Unit 5 Cluster Analysis
No ratings yet
Unit 5 Cluster Analysis
15 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Clustering
No ratings yet
Clustering
75 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
12 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Unsupervised Learning for Students
No ratings yet
Unsupervised Learning for Students
59 pages
Unit 2
No ratings yet
Unit 2
33 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
Clustering Part1
No ratings yet
Clustering Part1
79 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
U-5 Iml
No ratings yet
U-5 Iml
20 pages
Unit 3 Updated Notes
No ratings yet
Unit 3 Updated Notes
29 pages
Clustering
No ratings yet
Clustering
11 pages
Unit 4
No ratings yet
Unit 4
19 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Module 5
No ratings yet
Module 5
91 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
21 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
93 pages
Class-12-Maths-Sep Test-Final QN Paper
No ratings yet
Class-12-Maths-Sep Test-Final QN Paper
5 pages
GT Full Catalogue Web
No ratings yet
GT Full Catalogue Web
314 pages
DEPORTES
No ratings yet
DEPORTES
5 pages
3.1 BSMarE 1st Yr Level - REVALIDA SET B
No ratings yet
3.1 BSMarE 1st Yr Level - REVALIDA SET B
11 pages
Ganga Delta Population Analysis
No ratings yet
Ganga Delta Population Analysis
14 pages
Dialogo Ingles Semana 8
No ratings yet
Dialogo Ingles Semana 8
3 pages
Understanding Contours - Handout PDF
No ratings yet
Understanding Contours - Handout PDF
5 pages
Lte Users Guide v0.1 en
No ratings yet
Lte Users Guide v0.1 en
11 pages
MSDS KLINGERSIL C4430 e
No ratings yet
MSDS KLINGERSIL C4430 e
6 pages
Finite Element Method in Linear Elasticity
No ratings yet
Finite Element Method in Linear Elasticity
33 pages
Medieval English Architecture Guide
No ratings yet
Medieval English Architecture Guide
4 pages
MATLAB Solution To Microwave Engineering Pozar 4th Ed. Example 1.5
No ratings yet
MATLAB Solution To Microwave Engineering Pozar 4th Ed. Example 1.5
5 pages
Pretest Grade 7 Chs
No ratings yet
Pretest Grade 7 Chs
4 pages
Magneto-Optical Kerr Effect Guide
No ratings yet
Magneto-Optical Kerr Effect Guide
22 pages
The Opportunity Cost of Using Excess Capacity
No ratings yet
The Opportunity Cost of Using Excess Capacity
8 pages
Global Organic Textile Standard - GOTS
No ratings yet
Global Organic Textile Standard - GOTS
3 pages
Roll Crushers PDF
No ratings yet
Roll Crushers PDF
5 pages
F6
No ratings yet
F6
1 page
Trane 1 PDF
No ratings yet
Trane 1 PDF
25 pages
Comprehensive Gened Booster 1 Questionnaire
No ratings yet
Comprehensive Gened Booster 1 Questionnaire
26 pages
Mood Disorder
No ratings yet
Mood Disorder
18 pages
Resume Film
No ratings yet
Resume Film
1 page
SAE 1065 Steel Composition Guide
No ratings yet
SAE 1065 Steel Composition Guide
2 pages
PowerPoint 9 Collisions (Momentum) in 2D (4U)
No ratings yet
PowerPoint 9 Collisions (Momentum) in 2D (4U)
11 pages
Wbi11 01 Que 20240508
No ratings yet
Wbi11 01 Que 20240508
28 pages
Framed Structures
No ratings yet
Framed Structures
3 pages
Bài tập ôn hè lớp 4 lên lớp 5 môn tiếng Anh
No ratings yet
Bài tập ôn hè lớp 4 lên lớp 5 môn tiếng Anh
29 pages

ML - 8

Uploaded by

ML - 8

Uploaded by

Clustering

Some material borrowed from course materials of

• Overlapping (Soft Clustering)

Original Points A Partitional Clustering

Non-traditional Hierarchical Clustering Non-traditional Dendrogram

In this algorithm, we have 3 types of data points.

{2,3,4,6} {12,14,15,16,21,23,25} {31,35,38}

{2,3,4,6} {12,14,15,16,21,23,25} {31,35,38}

{2,3,4,6} {12,14,15,16,21,23,25} {31,35,38}

Note: i) K=2; ii) Use ‘Euclidean distance’;

• Distance between C1 and (0, 1) = 5

• After Third Iteration:

• Apply k-means with k

• The algorithm stops, as 4

You might also like