Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
33 views70 pages

ML - 8

Uploaded by

Snehargha Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views70 pages

ML - 8

Uploaded by

Snehargha Saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Clustering

Some material borrowed from course materials of


Andrew Ng
Unsupervised learning
• Given a set of unlabeled data points / items
• Find patterns or structure in the data
• Clustering: automatically group the data points /
items into groups of ‘similar’ or ‘related’ points
Motivations for Clustering
• Understanding the data better
– Grouping Web search results into clusters, each of which
captures a particular aspect of the query
– Segment the market or customers of a service
• As precursor for some other application
– Summarization and data compression
– Recommendation
Clustering for Data Understanding and
• Biology: taxonomy of livingApplications
things: kingdom, phylum, class, order, family, genus
and species
• Information retrieval: document clustering
• Land use: Identification of areas of similar land use in an earth observation
database
• Marketing: Help marketers discover distinct groups in their customer bases, and
then use this knowledge to develop targeted marketing programs
• City-planning: Identifying groups of houses according to their house type, value,
and geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
• Climate: understanding earth climate, find patterns of atmospheric and
ocean
• Economic Science: market resarch
Clustering as a Preprocessing Tool (Utility)
• Summarization:
– Preprocessing for regression, PCA, classification, and association
analysis
• Compression:
– Image processing: vector quantization
• Finding K-nearest Neighbors
– Localizing search to one or a small number of clusters
• Outlier detection
– Outliers are often viewed as those “far away” from any cluster
5
Different types of clustering
• Partitional (Hard Clustering)
– Divide set of items into non-overlapping subsets
– Each item will be member of one subset

• Overlapping (Soft Clustering)


– Divide set of items into potentially overlapping subsets
– Each item can simultaneously belong to multiple subsets
Partitional Clustering (Hard)

Original Points A Partitional Clustering


Overlapping (Soft Clustering)
Different types of clustering
• Fuzzy (Soft Clustering)
– Every item belongs to every cluster with a membership
weight between 0 (absolutely does not belong) and 1
(absolutely belongs)
– Usual constraint: sum of weights for each individual
item
should be 1
– Convert to partitional clustering: assign every item to that
cluster for which its membership weight is highest
Different types of clustering
• Hierarchical
– Set of nested clusters, where one larger cluster can contain
smaller clusters
– Organized as a tree (dendrogram): leaf nodes are singleton
clusters containing individual items, each intermediate
node is union of its children sub-clusters
– A sequence of partitional clusterings – cut the dendrogram
at a certain level to get a partitional clustering
Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3
p4
Traditional Hierarchical Clustering
Traditional Dendrogram

p1
p3 p4
p2
p1 p2 p3 p4

Non-traditional Hierarchical Clustering Non-traditional Dendrogram


Different types of clustering
• Complete vs. partial
– A complete clustering assigns every item to one or more
clusters
– A partial clustering may not assign some items to any
cluster (e.g., outliers, items that are not sufficiently similar
to any other item)
Types of clustering methods
• Prototype-based
– Each cluster defined by a prototype (centroid or medoid),
i.e., the most representative point in the cluster
– A cluster is the set of items in which each item is closer
(more similar) to the prototype of this cluster, than to the
prototype of any other cluster
– Example method: K-means
Types of clustering methods
• Density-based
– Assumes items distributed in a space where ‘similar’ items
are placed close to each other (e.g., feature space)
– A cluster is a dense region of items, that is surrounded by a
region of low density
– Example method: DBSCAN
Density based clustering methods
• Locates regions of high density, that are separated
from one another by regions of low density
Types of Clusters: Density-Based

• Density-based
– A cluster is a dense region of points, which is separated by low-
density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when noise
and outliers are present.

6 density-based clusters
Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical
clustering work for finding spherical-shaped clusters or convex
clusters. In other words, they are suitable only for compact and
well-separated clusters. Moreover, they are also severely affected
by the presence of noise and outliers in the data.
Real life data may contain irregularities, like:
i) Clusters can be of arbitrary shape.
ii) Data may contain noise.
DBSCAN
• DBSCAN: Density Based Spatial Clustering of Applications
with Noise
– Proposed by Ester et al. in SIGKDD 1996
– First algorithm for detecting density-based clusters
• Advantages (e.g., over K-means)
– Can detect clusters of arbitrary shapes (while clusters detected
by K-means are usually globular (globe-shaped; spherical))
– Robust to outliers
DBSCAN
DBSCAN algorithm requires two parameters:
1.eps : It defines the neighborhood around a data point i.e. if the distance between two points is lower or equal to ‘eps’ then they are considered as
neighbors. If the eps value is chosen too small then large part of the data will be considered as outliers. If it is chosen very large then the clusters will merge
and majority of the data points will be in the same clusters. One way to find the eps value is based on the k-distance graph.
2.MinPts: Minimum number of neighbors (data points) within eps radius. Larger the dataset, the larger value of MinPts must be chosen. As a general rule,
the minimum MinPts can be derived from the number of dimensions D in the dataset as, MinPts >= D+1. The minimum value of MinPts must be chosen at
least 3.

In this algorithm, we have 3 types of data points.


Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a core point.
Noise or outlier: A point which is not a core point or border point.
DBSCAN algorithm steps:
1. Find all the neighbor points within eps and identify the core points or visited with
more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density connected points and assign them to the same cluster
as the core point.
A point a and b are said to be density connected if there exist a point c which has a
sufficient number of points in its neighbors and both the points a and b are within
the eps distance. This is a chaining process. So, if b is neighbor of c, c is neighbor
of d, d is neighbor of e, which in turn is neighbor of a implies that b is neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do
not belong to any cluster are noise.
Types of clustering methods
• Graph-based
– Assumes items represented as a graph/network where
items are nodes, and ‘similar’ items are linked via edges
– A cluster is a group of nodes having more and / or better
connections among its members, than between its
members and the rest of the network
– Also called ‘community structure’ in networks
– Example method: Algorithm by Girvan and Newman
K-means clustering
K-means
• Prototype-based, partitioning technique
• Finds a user-specified number of clusters (K)
• Each cluster represented by its centroid item
• There have been extensions where number of
clusters is not needed as input
K-means algorithm
Given k
1. Randomly choose k data points (seeds) to be the initial cluster
centres
2. Assign each data point to the closest cluster centre
3. Re-compute the cluster centres using the current cluster
memberships.
4. If a convergence criterion is not met, go to 2.

2
1
K-means algorithm
Example
• One dimensional
• Apply K-means algorithm in given data for
k=3. Use C1(2) , C2(16) and C3(38) as initial
cluster centers. Data: 2, 4, 6, 3,
31,12,15,16, 38, 35, 14, 21, 23, 25, 30
Solution
C1(2) C2(16) C3(38)

m1 = 2 m2 = 16 m3 = 38

{2,3,4,6} {12,14,15,16,21,23,25} {31,35,38}


Solution
C1(2) C2(16) C3(38)

m1 = 2 m2 = 16 m3 = 38

{2,3,4,6} {12,14,15,16,21,23,25} {31,35,38}

m1 = 3.75 m2 = 18 m3 = 34.67

{2,3,4,6} {12,14,15,16,21,23,25} {31,35,38}


Problem
We are given the following four data points in
two dimension: XI = (2,2), X2 =(8,6), X3 =(6,8), X4
= (2,4). We want to cluster the data points into
two clusters CI and C2 using the K-Means
algorithm. Euclidean distance is used for
clustering. To initialize the algorithm we consider
CI ={xI,x3} and C2 ={x2,x4}. After two iteration of
the K-means algorithm, find the cluster
memberships
For the below dataset find out the final clusters using K-Means algorithm.

Note: i) K=2; ii) Use ‘Euclidean distance’;


Advantages
• Fast, robust easy to understand.
• Relatively efficient
• Gives best result when data set are distinct or
well separated from each other.
Disadvantages
• Requires apriori specification of the number of
cluster centers.
• Hard assignment of data points to clusters
• Euclidean distance measures can unequally weight underlying
factors.
• Applicable only when mean is defined i.e. fails for categorical
data.
• Only local optima
Hierarchical clustering
Hierarchical clustering
• Bottom-up or Agglomerative clustering
– Start considering each data point as a singleton cluster
– Successively merge clusters if similarity is sufficiently high
– Until all points have been merged into a single cluster
• Top-down or Divisive clustering
– Start with all data points in a single cluster
– Iteratively split clusters into smaller sub-clusters if the
similarity between two sub-parts is low
Both Divisive and Agglomerative clustering can
be represented as a Dendrogram
Basic agglomerative hierarchical
clustering algorithm
• Start with each item in a singleton cluster
• Compute the proximity/similarity matrix between clusters
• Repeat
– Merge the closest/most similar two clusters
– Update the proximity matrix to reflect proximity between
the new cluster and the other clusters
• Until only one cluster remains
agglomerative hierarchical clustering
Basic agglomerative hierarchical
clustering algorithm
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
agglomerative hierarchical clustering
K-means: Numerical Problem
• Initial Centroids
– C1 = (4, 4)
– C2 = (3, 3)
• Distance between C1 and
(0, 1)
X1 X2 Cluster Assigned
4 4 1
= (4 − 0)2+(4 − 1)2
0 1 None
2 3 None
4 1 None
= (4)2+(3)2 = 5 3 3 2
1 1.5 None
0 2 None
K-means: Numerical Problem
• Initial Centroids
– C1 = (4, 4)
– C2 = (3, 3)
• Distance between C2 and
(0, 1)
X1 X2 Cluster Assigned
4 4 1
= (3 − 0)2+(3 − 1)2
0 1 None
2 3 None
4 1 None
= (3)2+(2)2 ≈ 3.6 3 3 2
1 1.5 None
0 2 None
K-means: Numerical Problem
• Initial Centroids
– C1 = (4, 4)
– C2 = (3, 3)

• Distance between C1 and (0, 1) = 5


X1 X2 Cluster Assigned
• Distance between C2 and (0, 1) = 3.6
4 4 1
• Point (0, 1) is nearer to C2 and thus 0 1 2
assigned to cluster ‘2’ 2 3 None
4 1 None
• Repeat the process for all data points 3 3 2
1 1.5 None
0 2 None
K-means: Numerical Problem
• After First Iteration:
– C1 = (4, 4)
– C2 = (1.66, 1.91)

• How do we calculate C2 ?
– By taking average of all points
assigned to
X1 X2 Cluster Assigned
cluster ‘2’
4 4 1
• C 2 = (0+2+4+3+1+0 , ) 0 1 2
6
10
1+3+1+3+1.5+2 2 3 2
= (11.5 )≈ 6 4 1 2
6
, 6 (1.66,1.91) 3 3 2
1 1.5 2
0 2 2
K-means: Numerical Problem
• After First Iteration:
– C1 = (4, 4)
– C2 = (1.66, 1.91)
• After Second Iteration:
X1 X2 Cluster Assigned
– C1 = (3.5, 3.5) 4 4 1
0 1 2
– C2 = (1.7, 1.7) 2
4
3
1
2
2
3 3 1
1 1.5 2
0 2 2
K-means: Numerical Problem
• After First Iteration:
– C1 = (4, 4)
– C2 = (1.66, 1.91)
• After Second Iteration:
– C1 = (3.5, 3.5)
– C2 = (1.7, 1.7) X1 X2 Cluster Assigned

• After Third Iteration:


4 4 1
0 1 2
2 3 2
– C1 = (3.66, 2.66) 4 1 1
– C2 = (0.75, 1.86) 3 3 1
1 1.5 2
0 2 2
K-means: Numerical Problem
• Consider the following
Data Points

• Apply k-means with k


=2
X1 X2 Cluster Assigned
4 4 None
0 1 None

• Distance metric = 2
4
3
1
None
None

Euclidean 3
1
3
1.5
None
None
0 2 None
K-means: Numerical Problem
• Initial Centroids
– C1 = (4, 4)
– C2 = (3, 3)

X1 X2 Cluster Assigned
4 4 1
0 1 None
2 3 None
4 1 None
3 3 2
1 1.5 None
0 2 None
K-means: Numerical Problem
• After Third Iteration:
– C1 = (3.66, 2.66)
– C2 = (0.75, 1.86)
• After Fourth Iteration:
– C1 = (3.66, 2.66)
– C2 = (0.75, 1.86) X1 X2 Cluster Assigned

• The algorithm stops, as 4


0
4
1
1
2
there is no change in 2
4
3
1
2
1
centroids 3 3 1
1 1.5 2
0 2 2
Proximity/similarity between clusters
• MIN similarity between two clusters: Proximity (similarity)
between the closest (most similar) two points, one from each
cluster (minimum pairwise distance)
• MAX similarity between two clusters: Proximity between
the farthest two points, one from each cluster (maximum
pairwise distance)
Types of hierarchical clustering
• Complete linkage
– Merge in each step the two clusters with the smallest
maximum similarity
• Single linkage
– Merge in each step the two clusters with the smallest
minimum similarity
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or unrelated
to) the objects in other groups

Inter-
Intra- cluster
cluster distances
distances are
are maximized
minimized
Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters
– high intra-class similarity: cohesive within clusters
– low inter-class similarity: distinctive between clusters
• The quality of a clustering method depends on
– the similarity measure used by the method
– its implementation, and
– Its ability to discover some or all of the
hidden patterns 79
(Dis)similarity measures

� �
� �
(Dis)similarity measures
Quality of Clustering

You might also like