0 ratings0% found this document useful (0 votes) 67 views44 pagesChapter 5. Clustering Algorithms-Stud
machine learning clustering student note
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Chapter 5. Clustering Algorithm
Abebe B, PhD* Clustering is a technique for finding similarity groups in data,
called clusters. I.e.,
* it groups data instances that are similar to (near) each other in
one cluster and data instances that are very different (far away)
from each other into different clusters.
* Clustering is often called an unsupervised learning task as
no class values denoting an a priori grouping of the data
instances are given, which is the case in supervised learning.
* Due to historical reasons, clustering is often considered
synonymous with unsupervised learning.
* In fact, association rule mining is also unsupervised.
* Clustering is the task of dividing the population or data points
into a number of groups such that data points in the same
groups are more similar to other data points in the same group
and dissimilar to the data points in other groups. It is basically
a collection of objects on the basis of similarity and
dissimilarity between them.= Let us see some real-life examples
= Example 1: groups people of similar sizes
together to make “small”, “medium” and
“large” T-Shirts.
a Tailor-made for each person: too expensive
a One-size-fits-all: does not fit all.
= Example 2: In marketing, segment customers
according to their similarities
a To do targeted marketing.
= Example 3: Given a collection of text
documents, we want to organize them
according to their content similarities,
a To produce a topic hierarchy
= In fact, clustering is one of the most utilized
data mining techniques.
a It has a long history, and used in almost every
field, e.g., medicine, psychology, botany, sociology
biology, archeology, marketing, insurance,
libraries, etc.
2 In recent years, due to the rapid increase of online
documents, text clustering becomes important.Clustering Methods :
* Density-Based Methods: These methods consider the clusters
as the dense region having some similarities and differences
from the lower dense region of the space. These methods
have good accuracy and the ability to merge two clusters.
* Example:
* DBSCAN (Density-Based Spatial Clustering of Applications with Noise),
* OPTICS (Ordering Points to Identify Clustering Structure), etc.
*Hierarchical Based Methods: The clusters formed in this
method form a tree-type structure based on the hierarchy.
New clusters are formed using the previously formed one. It is
divided into two category
* Agglomerative (bottom-up approach)
* Divisive (top-down approach)
* Examples CURE (Clustering Using Representatives), BIRCH (Balanced
Iterative Reducing Clustering and using Hierarchies), etc.* Partitioning Methods: These methods partition the objects into
k clusters and each partition forms one cluster. This method is
used to optimize an objective criterion similarity function such
as when the distance is a major parameter example K-means,
CLARANS (Clustering Large Applications based upon
Randomized Search), etc.
*Grid-based Methods: In this method, the data space is
formulated into a finite number of cells that form a grid-like
structure. All the clustering operations done on these grids are
fast and independent of the number of data objects
example STING (Statistical Information Grid), wave cluster,
CLIQUE (CLustering In Quest), etc.Applications of Clustering in different fields
* Marketing: It can be used to characterize & discover customer
segments for marketing purposes.
* Biology: It can be used for classification among different
species of plants and animals.
* Libraries: It is used in clustering different books on the basis of
topics and information.
* Insurance: It is used to acknowledge the customers, their
policies and identifying the frauds.The K-Means Clustering Method
* Given k, the k-means algorithm is implemented in four
steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters of
the current partitioning (the centroid is the center, i.e.,
mean point, of the cluster)
3. Assign each object to the cluster with the nearest seed
point
4. Go back to Step 2, stop when the assignment does not
changeAn Example of K-Means Clustering
+
++
+ + +
. eee Update the
- cluster
. centroids
The initial data set 1 Loop i
needed
= Partition objects into k nonempty +
subsets +3
= Repeat 7 =e
= Compute centroid (.¢., mean i * Update the
point) for each partition 7 . pun! i
» Assign each object to the
cluster of its nearest centroid
= Until no changeComments on the K-Means Method
* Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is #
iterations. Normally, k, t << n.
* Comparing: PAM: O(k(n-k) ), CLARA: O(ks? + k(n-k))
* Comment: Often terminates at a /ocal optimal.
* Weakness
* Applicable only to objects in a continuous n-dimensional space
* Using the k-modes method for categorical data
* In comparison, k-medoids can be applied to a wide range of data
* Need to specify k, the number of clusters, in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
* Sensitive to noisy data and outliers
* Not suitable to discover clusters with non-convex shapesVariations of the K-Means Method
* Most of the variants of the k-means which differ in
* Selection of the initial k means “eo x
* Dissimilarity calculations
* Strategies to calculate cluster means
* Handling categorical data: k-modes
* Replacing means of clusters with modes
* Using new dissimilarity measures to deal with categorical objects
* Using a frequency-based method to update modes of clusters
* Amixture of categorical and numerical data: k-prototype methodClustering:
+ Clustering is the task of gathering samples into groups of similar samples
according to some predefined similarity or dissimilarity measure
sample Cluster/group
* A grouping of data objects such that the objects within a
group are similar (or related) to one another and different
from (or unrelated to) the objects in other groups+ Outliers are objects that do not belong to any cluster or
form clusters of very small cardinality
cluster
‘ outliers
=—
+ In some applications we are interested in discovering
outliers, not clusters (outlier analysis)
+ Clustering :
that
— Similar to one another within the same cluster
— Dissimilar to the objects in other clusters
ven a collection of data objects group them so
* Clustering results are used:
— Asa stand-alone too! to get insight into data distribution
+ Visualization of clusters may unveil important information
— Asa preprocessing step for other algorithms
+ Efficient indexing or compression often relies on clusteringApplications of clustering?
« Image Processing
— cluster images based on their visual content
* Web
— Cluster groups of users based on their access
patterns on webpages
— Cluster webpages based on their content
* Bioinformatics
— Cluster similar proteins together (similarity wrt
chemical structure and/or functionality etc)What Is the Problem of the K-Means Method?
* The k-means algorithm is sensitive to outliers !
* Since an object with an extremely large value may substantially distort the
distribution of the data
* K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located
object in a clusterThe k-means problem
* Given a set X of n points in a d-dimensional space
and an integer k
* Task: choose a set of k points {c,, C,,...,¢,} in the
d-dimensional space to form clusters {C,, C),...,C,}
such that k
CostC) = YL’ (e-4)
i=l xeC,
is minimized
* Some special cases: k=1,k=nAlgorithmic properties of the k-means
problem
* NP-hard if the dimensionality of the data is at least 2
(d>=2)
* Finding the best solution in polynomial time is
infeasible
* For d=1 the problem is solvable in polynomial time
(how?)
* Asimple iterative algorithm works quite well in
practiceK-means algorithm
* Given k, the k-means algorithm works as follows:
1)Randomly choose k data points (seeds) to be the initial centroids, cluster
centers
2)Assign each data point to the closest centroid
3)Re-compute the centroids using the current cluster memberships.
A)If a convergence criterion is not met, go to 2).
Algorithm k-means(k, D)
Choose & data points as the initial centroids (cluster centers)
repeat
for each data point x e D do
compute the distance from x to each centroid;
assign X to the closest centroid // a centroid represents a cluster
endfor
re-compute the centroids using the current cluster memberships
Wiitil the stopping criterion is metStopping/convergence criterion
nl
no (or minimum) re-assignments of data points to different
clusters,
no (or minimum) change of centroids, or
minimum decrease in the sum of squared error (SSE),
C, is the jth cluster, m; is the centroid of cluster C; (the mean vector of
all the data points in C), and dist(x, m;) is the distance between data
point x and centroid m,.
:
SSE = Duce, dist(x,m ,)° (1)
j=lAn example
fteration |; (B). Cluster assignment (C). Re-compute centroidsAn example (cont ...)
fteration 3: (F). Cluster assignment (G). Re-compute centroidsStrengths of k-means
*Strengths:
* Simple: easy to understand and to implement
° Efficient: Time complexity: O(tkn),
where rn is the number of data points,
k is the number of clusters, and
t is the number of iterations.
*Since both k and t are small. k-means is considered
a linear algorithm.
¢K-means is the most popular clustering
algorithm.
*Note that: it terminates at a local optimum if
SSE is used. The global optimum is hard to find
due to complexity.Weaknesses of k-means
*The algorithm is only applicable if the mean is defined.
* For categorical data, k-mode - the centroid is represented by
most frequent values.
*The user needs to specify k. Choosing k manually. For a low k,
you can mitigate this dependence by running k-means several times
with different initial values and picking the best result. As k increases,
you need advanced versions of k-means to pick better values of the
initial centroids (called k-means seeding).
*The algorithm is sensitive to outliers
* Outliers are data points that are very far away from other
data points.
* Outliers could be errors in the data recording or some
special data points with very different values.
* Clustering data of varying sizes and density. k-means has trouble
clustering data where clusters are of varying sizes and density.Weaknesses of k-means: Problems with outliers
outlier
outlier
(B): Ideal clustersWeaknesses of k-means: To deal with outliers
* One method is to remove some data points in the
clustering process that are much further away from the
centroids than other data points.
* To be safe, we may want to monitor these possible outliers over a
few iterations and then decide to remove them.
«Another method is to perform random sampling. Since in
sampling we only choose a small subset of the data points,
the chance of selecting an outlier is very small.
* Assign the rest of the data points to the clusters by distance or
similarity comparison, or classificationWeaknesses of k-means (cont ...)
* The algorithm is sensitive to initial seeds.
(B). Iteration | (C), Iteration 2Weaknesses of k-means (cont ...)
If we use different seeds: good results
= There are some
methods to help
choose good
seeds
(B). Iteration | (C). Iteration 2Weaknesses of k-means (cont ...)
* The k-means algorithm is not suitable for discovering
clusters that are not hyper-ellipsoids (or hyper-spheres).
(A): Two natural clusters (B): k-means clustersK-means summary
* Despite weaknesses, k-means is still the most popular
algorithm due to its simplicity, efficiency and
+ other clustering algorithms have their own lists of weaknesses.
* No clear evidence that any other clustering algorithm
performs better in general
* Although they may be more suitable for some specific types of
data or applications.
* Comparing different clustering algorithms is a difficult task.
* No one knows the correct clusters!*Use the centroid of each cluster to represent the
cluster.
*compute the radius and
*standard deviation of the cluster to determine its
spread in each dimension
*The centroid representation alone works well if the
clusters are of the hyper-spherical shape.
* If clusters are elongated or are of other shapes,
centroids are not sufficientClustering Approaches
1. Using classification
model
All the data points ina
cluster are regarded to
have the same class label, Ls
e.g., the cluster ID. :
* run a supervised learning
algorithm on the data to
find a classification model.
lo paw w nen p ee nee eee e
x S$ 2-— cluster 1
x>2,y > 1.5 — cluster 2
x>2,y 1.5 > cluster 32. Use frequent values to represent cluster
* This method is mainly for clustering of categorical data (e.g., k-modes
clustering).
* Main method used in text clustering, where a small set of frequent
words in each cluster is selected to represent the cluster.Clusters of arbitrary shapes
Oo
* Hyper-elliptical and hyper-spherical ahs
clusters are usually easy to represent, oo 7
using their centroid together with
spreads.
* Irregular shape clusters are hard to
represent. They may not be useful in
some applications.
* Using centroids are not suitable (upper
figure) in general
* K-means clusters may be more useful
(lower figure), e.g., for making 2 size T-
shirts.* Produce a nested sequence of clusters, a tree, also
called Dendrogram.Hierarchical clustering
Different approaches
© Top-down divisive.
» Start with assigning all data points to one (or a few coarse) cluster.
» Recursively split each cluster.
» Uses a flat clustering algorithm as a subroutine.
* Bottom-up agglomerative.
» Start with assigning each data point to its own cluster.
» Iteratively find pairs of clusters to merge.
» Clusters found by finding pairs with maximum similarity.
* Dominant approach is bottom-up: better search landscape, more
flexible algorithms, but is more myopic.Types of hierarchical clustering
*Agglomerative (bottom up) clustering: It builds
the dendrogram (tree) from the bottom level,
and
*merges the most similar (or nearest) pair of clusters
*stops when all the data points are merged into a
single cluster (i.e., the root cluster).
*Divisive (top down) clustering: It starts with all
data points in one cluster, the root.
*Splits the root into a set of child clusters. Each child
cluster is recursively divided further
*stops when only singleton clusters of individual data
points remain, i.e., each cluster with only a single
pointAgglomerative clustering
It is more popular then divisive methods.
*At the beginning, each data point forms a cluster (also
called a node).
* Merge nodes/clusters that have the least distance.
*Go on merging
* Eventually all nodes belong to one cluster
Algorithm Agglomerative())
find two clusters that are nearest to each other;
merge the two clusters form a new cluster c;
compute the distance from ¢ to all other clusters;
2-ountil there is only one ¢huster left
| Make each data point in the data set 2 a cluster,
2 Compute all pair-wise distances of x1, X2, ..., Xv € D;
2 repeat
4
— iAn example: working of the algorithm
Pez Pa
(A). Nested clusters (B) DendrogramDistance functions for numeric attributes
* Most commonly used functions are
* Euclidean distance and
* Manhattan (city block) distance
+ We denote distance with: dist(x;, x;), where x; and x; are data points
(vectors)
* They are special cases of Minkowski distance. h is positive integer.
1
dis€x,,x;)=(@ —X,)" +Qiq-Xj2)" +... -X},)")"Euclidean distance and Manhattan
distance
¢ If h=2, it is the Euclidean distance
- 2 2 2
dist(x;,X;) = (Xa — Xp)" + ig —Xj2)" +--+ OG, — Xe)
* If h=1, itis the Manhattan distance
dist(X,,X;) =| Xq — Xp | +| Xig — Xj | +--+ | My — Xp
+ Weighted Euclidean distance
2
2 2
dist(x,.X,) = 0% — Xp)? + (44g — 2/2)? ++ Ww, —X))Squared distance and Chebychev distance
* Squared Euclidean distance: to place progressively greater weight on
data points that are further apart.
* Chebychev distance: one wants to define two data points as "different"
if they are different on any one of the attributes.
* a 2 2
dist(x,.%,) = (%y — Xp)? + (iy — Xp) +--+ %, —Xp)
dist(x;,X ;) = max(| x — x, || Xj. — Xj |---| Xie — Xr DDBSCAN Clustering Work
* DBSCAN stands for Density-Based Spatial Clustering of Applications and
Noise.
* DBSCAN clustering algorithm works by assuming that the clusters are
regions with high-density data points separated by regions of low-density.
© Core point
@ Border point
@ Noise point
— Epsilon (e)
jechouinard.com* DBSCAN algorithms require only two parameters from the user:
* The radius of the circle to be created around each data point, also known as ‘epsilon’
* minPoints which defines the minimum number of data points required inside that
circle for that data point to be classified as a Core point.
* Some of the common use-cases for DBSCAN clustering algorithm are:
+ It performs great at separating clusters of high density versus low density;
+ It works great on non-linear datasets; and
* It can be used for anomaly detection as it separates out the noise points and do not
assign them to any cluster.
* Comparing DBSCAN with K-Means algorithms, the most common
differences are:
* K-Means algorithm cluster all the instances in the datasets whereas DBSCAN doesn’t
assign noise points (outliers) to a valid cluster
* K-Means has difficulty with non-global clusters whereas DBSCAN can handle that
smoothly
+ K-Means algorithm makes assumptions that all data points in the dataset come from a
gaussian distribution whereas DBSCAN makes no assumption about the data.Gaussian Mixture Clustering Models
* Gaussian Mixture Models, or GMMs, are probabilistic models that look at
Gaussian distributions, also known as normal distributions, to cluster data
points together.
* By looking at a certain number of Gaussian distributions, the models
assume that each distribution is a separate cluster.
Cluster 2
Cluster 1
Cluster 3