0% found this document useful (0 votes)

13 views31 pages

Cluster Analysis for Data Scientists

Uploaded by

tinaktm2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views31 pages

Cluster Analysis for Data Scientists

Uploaded by

tinaktm2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

UNIT-6

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms: Overview: What
Is Cluster Analysis? Different Types of Clustering, Different Types of
Clusters; K-means: The Basic K-means Algorithm, K-means Additional
Issues, Bisecting K-means, Strengths and Weaknesses; Agglomerative
Hierarchical Clustering: Basic Agglomerative Hierarchical Clustering
Algorithm DBSCAN: Traditional Density Center-Based Approach, DBSCAN
Algorithm, Strengths and Weaknesses.

Cluster Analysis
Cluster analysis groups data objects based on information found only in the data
that describes the objects and their relationships. The goal is that the objects
within a group be similar (or related) to one another and different from (or
unrelated to) the objects in other groups. The greater the similarity (or
homogeneity) within a group and the greater the difference between groups, the
better or more distinct the clustering.
consider Figure which shows 20 points and three different ways of dividing them
into clusters. The shapes of the markers indicate cluster membership.

Figures (b) and (d) divide the data into two and six parts, respectively. However,
the apparent division of each of the two larger clusters into three sub clusters
may simply be an artifact of the human visual system.
Also, it may not be unreasonable to say that the points form four clusters, as
shown in Figure (c). This figure illustrates that the definition of a cluster is
imprecise and that the best definition depends on the nature of data and the
desired results. Cluster analysis is related to other techniques that are used to
divide data objects into groups.
For instance, clustering can be regarded as a form of classification in that it
creates a labelling of objects with class (cluster) labels. However, it derives
these labels only from the data. In contrast, classification is supervised
classification; i.e., new, unlabelled objects are assigned a class label using a
model developed from objects with known class labels. For this reason,
cluster analysis is sometimes referred to as unsupervised classification. When
the term classification is used without any qualification within data mining, it
typically refers to supervised classification.
Also, while the terms segmentation and partitioning are sometimes used as
synonyms for clustering, these terms are frequently used for approaches outside
the traditional bounds of cluster analysis. For example, the term partitioning is
often used in connection with techniques that divide graphs into subgraphs and
that are not strongly connected to clustering. Segmentation often refers to the
division of data into groups using simple
techniques.
Example: An image can be split into segments based only on pixel intensity and
color, or people can be divided into groups based on their income.
Types of Clusterings:
Definition: An entire collection of clusters is commonly referred to as a
clustering.

Types of clusterings:
Hierarchical (nested) versus partitional (unnested)
Exclusive versus overlapping versus fuzzy
Complete versus partial.

Hierarchical versus Partitional:

Partitional Clustering: A partitional clustering is simply a division of the
set
of data objects into non-overlapping subsets (clusters) such that each data
object is in exactly one subset. Taken individually, each collection of
clusters in above Figures (b–d) is a partitional clustering.

Hierarchial Clustering: If we permit clusters to have sub clusters, then

we
obtain a hierarchical clustering, which is a set of nested clusters that are
organized as a tree. Each node (cluster) in the tree (except for the leaf nodes) is
the union of its children (sub clusters), and the root of the tree is the cluster
containing all the objects. Often, but not always, the leaves of the tree are
singleton clusters of individual data objects.
If we allow clusters to be nested, then one interpretation of Figure (a) is that it
has two sub clusters (b), each of which, in turn, has three sub clusters Figure (d).
The clusters shown in Figures (a–d), when taken in that order, also form a
hierarchical (nested) clustering with, respectively, 1, 2, 4, and 6 clusters on each
level.
Note: Finally, note that a hierarchical clustering can be viewed as a
sequence of
partitional clusterings and a partitional clustering can be obtained by taking any
member of that sequence; i.e., by cutting the hierarchical tree at a particular
level.

Exclusive versus Overlapping versus Fuzzy: Exclusive Clustering:

The clusterings shown in above Figure are all
exclusive, as they assign each object to a single cluster. There are many
situations in which a point could reasonably be placed in more than one cluster
and these situations are better addressed by non-exclusive clustering.
Overlapping Clustering: In the most general sense, an overlapping or

non-
exclusive clustering is used to reﬂect the fact that an object can simultaneously
belong to more than one group (class).For instance, a person at a university can
be both an enrolled student and an employee of the university.
A non-exclusive clustering is also often used when, for example, an object is
“between” two or more clusters and could reasonably be assigned to any of
these clusters. Imagine a point halfway between two of the clusters of Figure.
Rather than make a somewhat arbitrary assignment of the object to a single
cluster, it is placed in all of the “equally good” clusters.

Fuzzy clustering:
In a fuzzy clustering, every object belongs to every cluster with a membership
weight that is between 0 (absolutely doesn’t belong) and 1 (absolutely
belongs). In other words, clusters are treated as fuzzy sets. (Mathematically, a
fuzzy set is one in which an object belongs to every set with a weight that is
between 0 and 1. In fuzzy clustering, we often impose the additional constraint
that the sum of the weights for each object must equal 1.)
Similarly, probabilistic clustering techniques compute the probability with which
each point belongs to each cluster, and these probabilities must also sum to 1.
Because the membership weights or probabilities for any object sum to 1, a
fuzzy or probabilistic clustering does not address true multiclass situations, such
as the case of a student employee, where an object belongs to multiple classes.
Instead, these approaches are most appropriate for avoiding the arbitrariness of
assigning an object to only one cluster when it is close to several. In practice, a
fuzzy or probabilistic clustering is often converted to an exclusive clustering by
assigning each object to the cluster in which its membership weight or
probability is highest.
Complete versus Partial: Complete Clustering: A complete clustering
assigns every object to a cluster,
whereas a partial clustering does not.
partial clustering: Some objects in a data set may not belong to well

deﬁned
groups. Many times objects in the data set represent noise, outliers, or
“uninteresting background.”

For example: some newspaper stories share a common theme, such as

global
warming, while other stories are more generic or one-of-a-kind. Thus, to ﬁnd the
important topics in last month’s stories, we often want to search only for clusters
of documents that are tightly related by a common theme.
In other cases, a complete clustering of the objects is desired. For example, an
application that uses clustering to organize documents for browsing needs to
Types of
guarantee thatClusters:
all documents can be browsed.

Well-Separated Cluster:
A cluster is a set of objects in which each object is closer (or more similar) to
every other object in the cluster than to any object not in the cluster.
Sometimes a threshold is used to specify that all the objects in a cluster must
be suﬃciently close (or similar) to one another. Figure(a) gives an example of
well separated clusters that consists of two groups of points in a two-
dimensional space. The distance between any two points in different groups is
larger than the distance between any two points within a group. Well-
separated clusters do not need to be globular, but can have any shape.

Prototype-Based Cluster: A cluster is a set of objects in which each object

is closer (more similar) to the prototype that deﬁnes the cluster than to the
prototype of any other cluster. For data with continuous attributes, the prototype
of a cluster is often a centroid, i.e., the average (mean) of all the points in the
cluster. When a centroid is not meaningful, such as when the data has
categorical attributes, the prototype is often a medoid, i.e., the most
representative point of a cluster. For many types of data, the prototype can be
regarded as the most central point, and in such instances, we commonly refer to
prototype based clusters as center-based clusters.

Graph-Based Cluster: If the data is represented as a graph, where the

nodes
are objects and the links represent connections among objects, then a cluster
can be defined as a connected component; i.e., a group of objects that are
connected to one another, but that have no connection to objects outside the
group. Example: Example of graph-based clusters is a contiguity-based cluster,
where
two objects are connected only if they are within a specified distance of each
other. This implies that each object in a contiguity-based cluster is closer to
some other object in the cluster than to any point in a different cluster.
Figure (c) shows an example of such clusters for two-dimensional points. This
definition of a cluster is useful when clusters are irregular or intertwined.
However, this approach can have trouble when noise is present since, as
illustrated by the two spherical clusters of Figure(c), a small bridge of points can
merge two distinct clusters.

Density-Based Cluster: A cluster is a dense region of objects that is

surrounded by a region of low density. Figure(d) shows some density-based
clusters for data created by adding noise to the data of Figure (c). The two
circular clusters are not merged, as in Figure(c), because the bridge between
them fades into the noise. Likewise, the curve that is present in Figure(c) also
fades into the noise and does not form a cluster in Figure(d).
A density based deﬁnition of a cluster is often employed when the clusters are
irregular or intertwined, and when noise and outliers are present. By contrast, a
contiguity based deﬁnition of a cluster would not work well for the data of
Figure(d) because the noise would tend to form bridges between clusters.

Shared-Property(Conceptual Clusters):
More generally, we can define a cluster as a set of objects that share some
property. This definition encompasses all the previous definitions of a cluster;
e.g., objects in a center based cluster share the property that they are all
closest to the same centroid or medoid. However, the shared-property
approach also includes new types of clusters.
Consider the clusters shown in Figure(e). A triangular area (cluster) is adjacent to
a rectangular one, and there are two intertwined circles (clusters). In both cases,
a clustering algorithm would need a very specific concept of a cluster to
successfully detect these clusters. The process of finding such clusters is called
conceptual clustering.

K-means:
Prototype-based clustering techniques create a one-level partitioning of the data
objects.
The Basic K-means Algorithm:
Procedure: We ﬁrst choose K initial centroids, where K is a user speciﬁed
parameter, namely, the number of clusters desired. Each point is then assigned
to the closest centroid, and each collection of points assigned to a centroid is a
cluster. The centroid of each cluster is then updated based on the points
assigned to the cluster. We repeat the assignment and update steps until no
point changes clusters, or equivalently, until the centroids remain the same.

Time and Space Complexity:

The space requirements for K-means are modest because only the data points
and centroids are stored. Specifically, the storage required is O((m + K)n),
where m is the number of points and n is the number of attributes. The time
requirements for K-means are also modest—basically linear in the number of
data points. In particular, the time required is O(I×K×m×n), where I is the
number of iterations required for convergence. As mentioned, I is often small
and can usually be safely bounded, as most changes typically occur in the first
few iterations. Therefore, K-means is linear in m, the number of points, and is
efficient as well as simple provided that K, the number of clusters, is
significantly less than m.

Method:
1.Randomly assign K objects from the dataset(D) as cluster
centres(C)
2.(Re) Assign each object to which object is most similar based
upon mean values.
3.Update Cluster means, i.e., Recalculate the mean of each
cluster with the updated values.
4.Repeat Step 2 until no change occurs.
Figure – K-mean
Clustering
Flowchart:

Figure
– K-mean Clustering

\
Hierarchial Clustering
What is Hierarchical Clustering?
Hierarchical clustering is a method of cluster analysis in data
mining that creates a hierarchical representation of the clusters
in a dataset. The method starts by treating each data point as a
separate cluster and then iteratively combines the closest
clusters until a stopping criterion is reached. The result of
hierarchical clustering is a tree-like structure, called a
dendrogram, which illustrates the hierarchical relationships
among the clusters

Types of Hierarchical Clustering

Basically, there are two types of hierarchical Clustering:
1.Agglomerative Clustering
2.Divisive clustering
1. Agglomerative Clustering
Initially consider every data point as an individual Cluster and
at every step, merge the nearest pairs of the cluster. (It is a
bottom-up method). At first, every dataset is considered an
individual entity or cluster. At every iteration, the clusters merge
with different clusters until one cluster is formed.
The algorithm for Agglomerative Hierarchical Clustering is:


 Calculate the similarity of one cluster with all the other
 clusters (calculate proximity matrix)
 Consider every data point as an individual cluster
 Merge the clusters which are highly similar or close to each
other.
Recalculate the proximity matrix for each cluster
Repeat Steps 3 and 4 until only a single cluster remains.
 Step-1: Consider each alphabet as a single cluster and
calculate the distance of one cluster from all the other
clusters.
 Step-2: In the second step comparable clusters are merged
together to form a single cluster. Let’s say cluster (B) and
cluster (C) are very similar to each other therefore we merge
them in the second step similarly to cluster (D) and (E) and at
last, we get the clusters [(A), (BC), (DE), (F)]

Step-3: We recalculate the proximity according to the
algorithm and merge the two nearest clusters([(DE), (F)])
together to form new clusters as [(A), (BC), (DEF)]
Step-4: Repeating the same process; The clusters DEF and

BC are comparable and merged together to form a new
cluster. We’re now left with clusters [(A), (BCDEF)].
Step-5: At last, the two remaining clusters are merged
 together to form a single cluster [(ABCDEF)].

Agglomerative Clustering Algorithm

Following are the steps in agglomerative clustering.
1.We start by assigning each data point to its own cluster.
2.Next, we compute the distance between each pair of clusters
and select the pair of clusters with the smallest distance.
3.Then, we merge the pair of clusters with the smallest distance
into a single cluster and update the distance between the
newly formed cluster and every other cluster.
4.We repeat steps 2 and 3 until all data points are in one cluster.

Calculation of Distance Between Two

Clusters
The distance between clusters in agglomerative clustering can be
calculated using three approaches namely single linkage, complete
linkage, and average linkage.

 In the single linkage approach, we take the distance between

the nearest points in two clusters as the distance between the
clusters.
 In the complete linkage approach, we take the distance
between the farthest points in two clusters as the distance
between the clusters.
 In the average linkage approach, we take the average distance
between each pair of points in two given clusters as the
distance between the clusters. You can also take the distance
between the centroids of the clusters as their distance from
each other.

A (1, 1), B(2, 3), C(3, 5), D(4,5), E(6,6), and F(7,5) and try to cluster
them.
To perform clustering, we will ﬁrst create a distance matrix
consisting of the distance between each point in the dataset. The
distance matrix looks as follows.

Using Single Linkage approach

Step 1: First, we will consider each data point as a single cluster.
After this, we will start combining the clusters.

Step 2: To combine the individual clusters, we can consider the

following points.
1.Point A is closest to point B.
2.Point B is at a similar distance to points A and C.
3.Point C is closest to point D.
4.Point D is closest to point C.
5.Point E is closest to F.
6.Point F is closest to E.

From the above points, we can combine (C, D) and (E, F) as clusters.
We will name them CD and EF respectively. As we have ambiguity
for points A and B, let us not combine them and treat them as
individual clusters.
Step 3: Now, we will calculate the minimum distance between
clusters A, B, CD, and EF. You can observe that

1.Cluster A is closest to B.
2.Cluster B is closest to A as well as CD.
3.Cluster EF is closest to CD.

Using the above information let us combine B and CD and name the
cluster BCD.
Step 4: After this, The cluster BCD is at the same distance from A
and EF. Hence, we will ﬁrst merge BCD and EF to form BCDEF.
Step 5: Finally, we will merge A and BCDEF to form the cluster
ABCDEF. Using the above steps, we will get the following
dendrogram.
In the above example, if we combine (A, B), (C, D), and (E, F)
together in step 2 above, we will get clusters AB, CD, and EF.
Now, The minimum distance of CD is the same from AB and EF. Let
us merge AB and CD to form ABCD ﬁrst. Next, we will combine ABCD
and EF to obtain the cluster ABCDEF.

As a result, we will get the following dendrogram.

Instead of combining AB and CD in the previous example, let us ﬁrst
combine CD and EF to form CDEF. Then, we can combine AB with
CDEF to form the cluster ABCDEF. As a result, we will get the
following dendrogram.

K-means: Additional Issues

Handling Empty Clusters:
One of the problems with the basic K-means algorithm is that empty clusters can
be obtained if no points are allocated to a cluster during the assignment step. If
this happens, then a strategy is needed to choose a replacement centroid, since
otherwise, the squared error will be larger than necessary. Approach 1:
One
approach is to choose the point that is farthest away from any current centroid. If
nothing else, this eliminates the point that currently contributes most to the total
squared error.

Approach 2: Another approach is to choose the replacement centroid

at
random from the cluster that has the highest SSE. This will typically split the
cluster and reduce the overall SSE of the clustering. If there are several empty
clusters, then this process can be repeated several times.
Outliers:
When the squared error criterion is used, outliers can unduly influence the
clusters that are found. In particular, when outliers are present, the
resulting cluster centroids (prototypes) are typically not as representative
as they otherwise would be and thus, the SSE will be higher. Because of
this, it is often
useful to discover outliers and eliminate them beforehand. It is important,
however, to appreciate that there are certain clustering applications for
which outliers should not be eliminated.
When clustering is used for data compression, every point must be clustered,
and in some cases, such as financial analysis, apparent outliers, e.g., unusually
profitable customers, can be the most interesting points. An obvious issue is how to
identify outliers. There are a number of techniques for
identifying outliers. If we use approaches that remove outliers before clustering,
we avoid clustering points that will not cluster well. Alternatively, outliers can also
be identified in a postprocessing step. For
instance, we can keep track of the SSE contributed by each point, and eliminate
those points with unusually high contributions, especially over multiple runs.
Also, we often want to eliminate small clusters because they frequently
represent groups of outliers.

Reducing the SSE with Postprocessing:

An obvious way to reduce the SSE is to ﬁnd more clusters, i.e., to use a larger K.
In many cases, we would like to improve the SSE, but don’t want to increase the
number of clusters. This is often possible because K-means typically converges to
a local minimum.
Various techniques are used to “ﬁx up” the resulting clusters in order to produce
a clustering that has lower SSE. The strategy is to focus on individual clusters
since the total SSE is simply the sum of the SSE contributed by each cluster. (We
will use the terms total SSE and cluster SSE, respectively, to avoid any potential
confusion.) We can change the total SSE by performing various operations on the
clusters, such as splitting or merging clusters.
One commonly used approach is to employ alternate cluster splitting and
merging phases. During a splitting phase, clusters are divided, while during a
merging phase, clusters are combined. In this way, it is often possible to escape
local SSE minima and still produce a clustering solution with the desired number
of clusters. The following are some techniques used in the splitting and merging
phases.

Two strategies that decrease the total SSE by increasing the

number of clusters are the following:
Split a cluster: The cluster with the largest SSE is usually chosen, but

we
could also split the cluster with the largest standard deviation for one particular
attribute.

Introduce a new cluster centroid: Often the point that is farthest from
any
cluster center is chosen. We can easily determine this if we keep track of the SSE
contributed by each point. Another approach is to choose randomly from all
points or from the points with the highest SSE with respect to their closest
centroids.
Two strategies that decrease the number of clusters, while trying
to minimize the increase in total SSE, are the following:
Disperse a cluster:by removing the centroid that
This is accomplished
corresponds to the cluster and reassigning the points to other clusters. Ideally,
the cluster that is dispersed should be the one that increases the total SSE the
least.

Merge two clusters: The clusters with the closest centroids are typically
chosen, although another, perhaps better, approach is to merge the two clusters
that result in the smallest increase in total SSE. These two merging strategies
are the same ones that are used in the hierarchical clustering techniques known
as the centroid method and Ward’s method, respectively.

Updating Centroids Incrementally:

Instead of updating cluster centroids after all points have been assigned to a
cluster, the centroids can be updated incrementally, after each assignment of
a point to a cluster. Notice that this requires either zero or two updates to
cluster centroids at each step, since a point either moves to a new cluster (two
updates) or stays in its current cluster (zero updates).
Using an incremental update strategy guarantees that empty clusters are not
produced because all clusters start with a single point, and if a cluster ever has
only one point, then that point will always be reassigned to the same cluster. In
addition, if incremental updating is used, the relative weight of the point
being added can be adjusted; e.g., the weight of points is often decreased as the
clustering proceeds. While this can result in better accuracy and faster
convergence, it can be difficult to make a good choice for the relative weight,
especially in a wide variety of situations. These update issues are similar to those
involved in updating weights for artificial neural networks. Yet another benefit of
incremental updates has to do with using objectives other than “minimize SSE.”
Suppose that we are given an arbitrary objective function to measure the
goodness of a set of clusters. When we process an individual point, we can
compute the value of the objective function for each possible cluster assignment,
and then choose the one that optimizes the objective.
On the negative side, updating centroids incrementally introduces an order
dependency. In other words, the clusters produced usually depend on the order
in which the points are processed. Although this can be addressed by
randomizing the order in which the points are processed, the basic K-means
approach of updating the centroids after all points have been assigned to
clusters has no order dependency. Also, incremental updates are slightly more
expensive. However, K-means converges rather quickly, and therefore, the
number of points switching clusters quickly becomes relatively small.
Advantages:
• Easy to implement.
 can be used for a wide variety of data types.

 It is also quite eﬃcient, even though multiple runs are often performed.
•
With a large number of variables, K-Means may be computationally faster
than hierarchical clustering (if K is small).

• k-Means may produce tighter clusters than hierarchical clustering.

• An instance can change cluster (move to another cluster) when the centroids
are recomputed.
Disadvantages:
• Difficult to predict the number of clusters (K-Value)
• Initial seeds have a strong impact on the final results
• The order of the data has an impact on the final results
• Sensitive to scale: rescaling your datasets (normalization or standardization)
will completely change results. While this itself is not bad, not realizing that you
have to spend extra attention to scaling your data might be bad.

 K-means is not suitable for all types of data.

 K-means also has trouble clustering data that contains outliers. It cannot handle

 non-globular clusters or clusters of different sizes and

densities, although it can typically ﬁnd pure sub clusters if a large enough
number of clusters is speciﬁed.

 k-means is restricted to data for which there is a notion of a center (centroid).

Bisecting K-means
Idea: The bisecting K-means algorithm is a straightforward extension of the
basic K-means algorithm that is based on a simple idea: to obtain K clusters, split
the set of all points into two clusters, select one of these clusters to split, and so
on, until K clusters have been produced.
There are a number of different ways to choose which cluster to split. We can
choose the largest cluster at each step, choose the one with the largest SSE, or
use a criterion based on both size and SSE. Different choices result in different
clusters. Because we are using the K-means algorithm “locally,” i.e., to bisect
individual clusters, the ﬁnal set of clusters does not represent a clustering that is
a local minimum with respect to the total SSE. Thus, we often reﬁne the resulting
clusters by using their cluster centroids as the initial centroids for the standard
K-means algorithm.
Algorithm:

K-means as an Optimization Problem:

Given an objective function such as “minimize SSE,” clustering can be treated as
an optimization problem. One way to solve this problem—to find a global
optimum—is to enumerate all possible ways of dividing the points into clusters
and then choose the set of clusters that best satisfies the objective function, e.g.,
that minimizes the total SSE.
This strategy is computationally infeasible and as a result, a more practical
approach is needed, even if such an approach finds solutions that are not
guaranteed to be optimal. One technique, which is known as gradient descent, is
based on picking an initial solution and then repeating the following two steps:
compute the change to the solution that best optimizes the objective function
and then update the solution. We assume that the data is one-dimensional, i.e.,
This does not change anything essential, but greatly

simpliﬁes the notation.

Derivation of K-means as an Algorithm to Minimize the SSE:

The centroid for the K-means algorithm can be mathematically derived when the
proximity function is Euclidean distance and the objective is to minimize the SSE.
Speciﬁcally, we investigate how we can best update a cluster centroid so that the
cluster SSE is minimized.

Here, Ci is the ith cluster, x is a point in Ci, and ci is the mean of the ith cluster.
We can solve for the kth centroid ck, which minimizes Equation, by
differentiating the SSE, setting it equal to 0, and solving, as indicated below.
Thus, as previously indicated, the best centroid for minimizing the SSE of a
cluster is the mean of the points in the cluster.

Derivation of K-means for SAE:

To demonstrate that the K-means algorithm can be applied to a variety of
different objective functions, we consider how to partition the data into K
clusters such that the sum of the Manhattan (L1) distances of points from
the center of their clusters is minimized. We are seeking to minimize the
sum of the L1 absolute errors (SAE) as given by the following equation,
where distL1 is the L1 distance. Again, for notational simplicity, we use
one-dimensional data, i.e., distL1 = |ci −x|.

We can solve for the kth centroid ck, which minimizes Equation 7.5, by
differentiating the SAE, setting it equal to 0, and solving.
If we solve for ck, we ﬁnd that ck = median{x ∈ Ck}, the median of the points in
the cluster. The median of a group of points is straightforward to compute and
less susceptible to distortion by outliers.

Agglomerative Hierarchical Clustering

There are two basic approaches for generating a hierarchical clustering:

Agglomerative: Start with the points as individual clusters and, at each

step,
merge the closest pair of clusters. This requires deﬁning a notion of cluster
proximity.

Divisive: Start with one, all-inclusive cluster and, at each step, split a
cluster
until only singleton clusters of individual points remain. In this case, we need to
decide which cluster to split at each step and how to do the splitting.
Agglomerative hierarchical clustering techniques are by far the most common. A
hierarchical clustering is often displayed graphically using a tree-like diagram
called a dendrogram, which displays both the cluster-sub cluster relationships
and the order in which the clusters were merged (agglomerative view) or split
(divisive view). For sets of two-dimensional points, a hierarchical clustering can
also be
graphically represented using a nested cluster diagram. Figure shows an
example of these two types of ﬁgures for a set of four two-dimensional points.
Basic Agglomerative Hierarchical Clustering Algorithm:
Many agglomerative hierarchical clustering techniques are variations on a single
approach: starting with individual points as clusters, successively merge the two
closest clusters until only one cluster remains.

Time and Space Complexity

The basic agglomerative hierarchical clustering algorithm just presented uses a
proximity matrix. This requires the storage of proximities (assuming the
proximity matrix is symmetric) where m is the number of data points. The space
needed to keep track of the clusters is proportional to the number of clusters,
which is m−1, excluding singleton clusters. Hence, the total space complexity is

The analysis of the basic agglomerative hierarchical clustering algorithm is also

straightforward with respect to computational complexity. time is required
to compute the proximity matrix. After that step, there are m−1 iterations
involving steps 3 and 4 because there are m clusters at the start and two
clusters are merged during each iteration. If performed as a linear search of the
proximity matrix, then for the ith iteration, Step 3 requires time,
which is proportional to the current number of clusters squared. Step 4 requires
O(m−i + 1) time to update the proximity matrix after the merger of two clusters.
(A cluster merger affects O(m − i + 1) proximities for the techniques that we
consider.) Without modiﬁcation, this would yield a time complexity of . If
the distances from each cluster to all other clusters are stored as a sorted list (or
heap), it is possible to reduce the cost of ﬁnding the two closest clusters to
O(m−i + 1). However, because of the additional complexity of keeping data in a
sorted list or heap, the overall time required for a hierarchical clustering based
on Algorithm is

Specific Techniques
Sample Data

To illustrate the behaviour of the various hierarchical clustering algorithms, we

will use sample data that consists of six two-dimensional points, which are
shown in Figure. The x and y coordinates of the points and the Euclidean
distances between them are shown in Tables , respectively.

Single Link or MIN

For the single link or MIN version of hierarchical clustering, the proximity of two
clusters is deﬁned as the minimum of the distance (maximum of the similarity)
between any two points in the two different clusters. Using graph terminology, if
you start with all points as singleton clusters and add links between points one at
a time, shortest links ﬁrst, then these single links combine the points into
clusters. The single link technique is good at handling non-elliptical shapes, but is
sensitive to noise and outliers.

Example (Single Link): Figure shows the result of applying the single

link
technique to our example data set of six points. Figure (a) shows the nested
clusters as a sequence of nested ellipses, where the numbers associated with the
ellipses indicate the order of the clustering. Figure(b) shows the same
information, but as a dendrogram. The height at which two clusters are merged
in the dendrogram reﬂects the distance of the two clusters. For instance, from
Table, we see that the distance between points 3 and 6 is 0.11, and that is the
height at which they are joined into one cluster in the dendrogram. As another
example, the distance between clusters{3,6}and {2,5} is given by dist({3,6},
{2,5}) = min(dist(3,2),dist(6,2),dist(3,5),dist(6,5)) = min(0 .15,0.25,0.28,0.39)
=0 .15.

Complete Link or MAX or CLIQUE

For the complete link or MAX version of hierarchical clustering, the proximity of
two clusters is deﬁned as the maximum of the distance (minimum of the
similarity) between any two points in the two different clusters. Using graph
terminology, if you start with all points as singleton clusters and add links
between points one at a time, shortest links ﬁrst, then a group of points is not a
cluster until all the points in it are completely linked, i.e., form a clique.
Complete link is less susceptible to noise and outliers, but it can break large
clusters and it favours globular shapes.

Example (Complete Link): Figure shows the results of applying MAX to

the
sample data set of six points. As with single link, points 3 and 6 are merged ﬁrst.
However, {3,6} is merged with {4}, instead of {2,5} or {1} because
Group Average:
For the group average version of hierarchical clustering, the proximity of two
clusters is deﬁned as the average pairwise proximity among all pairs of points
in the different clusters. This is an intermediate approach between the single
and complete link approaches. Thus, for group average, the cluster proximity
proximity(Ci, C j) of clusters Ci and Cj, which are of size mi and mj,
respectively, is expressed by the following equation:

Example (Group Average). Figure shows the results of applying the group
average approach to the sample data set of six points. To illustrate how group
average works, we calculate the distance between some clusters.
Because dist({3,6,4},{2,5}) is smaller than dist({3,6,4},{1}) and dist({2,5},
{1}), clusters {3,6,4} and {2,5} are merged at the fourth stage.

Ward’s Method :
For Ward’s method, the proximity between two clusters is deﬁned as the
increase in the squared error that results when two clusters are merged. Thus,
this method uses the same objective function as K-means clustering. While it
might seem that this feature makes Ward’s method somewhat distinct from
other hierarchical techniques, it can be shown mathematically that Ward’s
method is very similar to the group average method when the proximity between
two points is taken to be the square of the distance between them.

Example (Ward’s Method): Figure shows the results of applying Ward’s

method to the sample data set of six points. The clustering that is produced is
different from those produced by single link, complete link, and group average.
Centroid method:
Centroid methods calculate the proximity between two clusters by calculating
the distance between the centroids of clusters. These techniques may seem
similar to K-means, but as we have remarked, Ward’s method is the correct
hierarchical analog. Centroid methods also have a characteristic—often
considered bad—that is not possessed by the other hierarchical clustering
techniques that we have discussed: the possibility of inversions. Speciﬁcally, two
clusters that are merged can be more similar (less distant) than the pair of
clusters that were merged in a previous step. For the other methods, the
distance between merged clusters monotonically increases (or is, at worst, non-
increasing) as we proceed from singleton clusters to one all-inclusive cluster.
Advantages:
• Hierarchical clustering outputs a hierarchy, i.e a structure that is more
informative than the unstructured set of ﬂat clusters returned by k-means.
Therefore, it is easier to decide on the number of clusters by looking at the
dendrogram
. • Easy to implement.

Dis Advantages:

 Agglomerative hierarchical clustering algorithms are expensive in terms of

their computational and storage requirements.
All merges are ﬁnal can also cause trouble for noisy, high-dimensional data,

such as document data.
It is not possible to undo the previous step: once the instances have been
 assigned to a cluster, they can no longer be moved around.
Time complexity: not suitable for large datasets.


 Initial seeds have a strong impact on the ﬁnal results.

 The order of the data has an impact on the ﬁnal results.

 Very sensitive to outliers.

DBSCAN
Density-based clustering locates regions of high density that are separated from
one another by regions of low density. DBSCAN is a simple and effective density-
based clustering algorithm that illustrates a number of important concepts that
are important for any density-based clustering approach. In this section, we focus
solely on DBSCAN after ﬁrst considering the key notion of density.

Traditional Density: Center-Based Approach:

In the center-based approach, density is estimated for a particular point in the
data set by counting the number of points within a speciﬁed radius, Eps, of that
point. This includes the point itself. This technique is graphically illustrated by
Figure. The number of points within a radius of Eps of point A is 7, including A
itself.
This method is simple to implement, but the density of any point will depend on
the speciﬁed radius. For instance, if the radius is large enough, then all points
will have a density of m, the number of points in the data set. Likewise, if the
radius is too small, then all points will have a density of 1. An approach for
deciding on the appropriate radius for low-dimensional data is given in the next
section in the context of our discussion of DBSCAN.

Center based density

Classification of Points According to Center-Based Density:
The center-based approach to density allows us to classify a point as being (1) in
the interior of a dense region (a core point), (2) on the edge of a dense region (a
border point), or (3) in a sparsely occupied region (a noise or background point).
Figure graphically illustrates the concepts of core, border, and noise points using
a collection of two-dimensional points. The following text provides a more precise
description.

Core points: These points are in the interior of a density-based cluster. A

point
is a core point if there are at leastMinPts within a distance of Eps, where MinPts
and Eps are user-speciﬁed parameters. In Figure, point A is a core point for the
radius (Eps) if MinPts≥ 7.
Border points: A border point is not a core point, but falls within the
neighbourhood of a core point. In Figure, point B is a border point. A border point
can fall within the neighbourhoods of several core points.

Noise points: A noise point is any point that is neither a core point
nor a
border point. In Figure, point C is a noise point.
The DBSCAN Algorithm
Given the previous definitions of core points, border points, and noise
points, the DBSCAN algorithm can be informally described as follows.
Any two core points that are close enough—within a distance Eps of one
another—are put in the same cluster. Likewise, any border point that is
close enough to a core point is put in the same cluster as the core point.
(Ties need to be resolved if a border point is close to core points from
different clusters.) Noise points are discarded. The formal details are given
in Algorithm. This algorithm uses the same
concepts and finds the same clusters as the original DBSCAN, but is
optimized for simplicity, not efficiency.
Time and Space Complexity:
The basic time complexity of the DBSCAN algorithm is O(m × time to find points in
the Eps-neighbourhood), where m is the number of points. In the worst case,
this complexity is However, in low-dimensional spaces (especially 2D
space), data structures such as kd-trees allow efficient retrieval of all points
within a given distance of a specified point, and the time complexity can be as
low as O(mlogm) in the average case. The space requirement of DBSCAN, even
for high-dimensional data, is O(m) because it is necessary to keep only a small
amount of data for each point, i.e., the cluster label and the identification of each
point as a core, border, or noise point.

Strengths and Weaknesses:

Strengths:
 DBSCAN uses a density-based deﬁnition of a cluster, it is relatively
resistant to noise and can handle clusters of arbitrary shapes and sizes.
DBSCAN can ﬁnd many clusters that could not be found using K-means.


Weaknesses:

 DBSCAN has trouble when the clusters have widely varying densities. It also
has trouble with high-dimensional data because density is more diﬃcult to
 deﬁne for such data. DBSCAN can be expensive when the computation of
nearest neighbours requires computing all pairwise proximities, as is usually
the case for high-dimensional data.
 DBSCAN

DBSCAN is the abbreviation for Density-Based Spatial Clustering

of Applications with Noise. It is an unsupervised clustering algorithm.DBSCAN
clustering can work with clusters of any size from huge amounts of data and
can work with datasets containing a signiﬁcant amount of noise. It is basically
based on the criteria of a minimum number of points within a region.

DBSCAN algorithm can cluster densely grouped points eﬃciently into one
cluster. It can identify local density in the data points among large datasets.
DBSCAN can very effectively handle outliers. An advantage of DBSACN over
the K-means algorithm is that the number of centroids need not be known
beforehand in the case of DBSCAN. DBSCAN algorithm depends upon two

parameters epsilon and minPoints. Epsilon is deﬁned as the radius of each data

point around which the density is

considered.
minPoints is the number of points required within the radius so that the data
point becomes a core point.

In the above ﬁgure, we can see that point A has no points inside epsilon(e)
radius. Hence it is a Noise Point. Point B has minPoints(=4) number of points
with epsilon e radius , thus it is a Core Point. While the point has only 1 (
less than minPoints) point, hence it is a Border Point.
Steps Involved in DBSCAN Algorithm.
 First, all the points within epsilon radius are found and the core points
are identiﬁed with number of points greater than or equal to minPoints.
 Next, for each core point, if not assigned to a particular cluster, a new
cluster is created for it. All the densely connected points related to the
 core point are found and assigned to the same cluster. Two points are
called densely connected points if they have a neighbor point that has
both the points within epsilon distance. Then all the points in the data
are iterated, and the points that do not belong to any cluster are
marked as noise.


Grid Based Clustering Methods

We can use the grid-based clustering method for multi-resolution of grid-
based data structure. It is used to quantize the area of the object into a
ﬁnite number of cells, which is stored in the grid system where all the
operations of Clustering are implemented. We can use this method for its
quick processing time, which is generally independent of the number of
data objects, still dependent on only the multiple cells in each dimension
in the quantized space. 1.STING 2.CLIQUE

Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
52 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
Chapter-8 (Cluster Analysis Basic Concepts and Algorithms)
No ratings yet
Chapter-8 (Cluster Analysis Basic Concepts and Algorithms)
73 pages
Data Mining: Cluster Analysis Guide
No ratings yet
Data Mining: Cluster Analysis Guide
40 pages
Cluster Analysis Concepts & Algorithms
No ratings yet
Cluster Analysis Concepts & Algorithms
93 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Cluster Analysis Set 01: Types of Clustering
No ratings yet
Cluster Analysis Set 01: Types of Clustering
18 pages
Improved K-Means Clustering Algorithm by Getting Initial Cenroids
No ratings yet
Improved K-Means Clustering Algorithm by Getting Initial Cenroids
9 pages
Cluster Analysis & Methods Guide
No ratings yet
Cluster Analysis & Methods Guide
11 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
21 pages
Unit 5 DWM by DR KSR Cluster Analysis
No ratings yet
Unit 5 DWM by DR KSR Cluster Analysis
72 pages
Comparative Analysis of Clustering Techniques
No ratings yet
Comparative Analysis of Clustering Techniques
13 pages
YEAH
No ratings yet
YEAH
2 pages
Dmbi Unit-4
No ratings yet
Dmbi Unit-4
18 pages
Two-Pass Assembler Guide
No ratings yet
Two-Pass Assembler Guide
4 pages
DWDM 5
No ratings yet
DWDM 5
12 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
Sai 2016 7555988
No ratings yet
Sai 2016 7555988
5 pages
21csc101t Oodp Unit-3
No ratings yet
21csc101t Oodp Unit-3
147 pages
Ethiopian CS&E Exit Exam Guidelines
No ratings yet
Ethiopian CS&E Exit Exam Guidelines
11 pages
Slide Set 6 Strings and Pointers
No ratings yet
Slide Set 6 Strings and Pointers
50 pages
Prasanna Hebbar @govt First Grade College Honnavar
No ratings yet
Prasanna Hebbar @govt First Grade College Honnavar
11 pages
012 - AWT MATERIAL For Cs Students
No ratings yet
012 - AWT MATERIAL For Cs Students
230 pages
Cluster Lecture-1
No ratings yet
Cluster Lecture-1
20 pages
Unit 4
No ratings yet
Unit 4
40 pages
Data Clustering: 50 Years Beyond K-Means
No ratings yet
Data Clustering: 50 Years Beyond K-Means
35 pages
Cluster Analysis for Data Scientists
No ratings yet
Cluster Analysis for Data Scientists
30 pages
DWDM Unit 5
No ratings yet
DWDM Unit 5
43 pages
DWDM FINAL6
No ratings yet
DWDM FINAL6
28 pages
DWDM Unit Vi
No ratings yet
DWDM Unit Vi
23 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
30 pages
Clustering
No ratings yet
Clustering
29 pages
GE3151 Unit V QB
No ratings yet
GE3151 Unit V QB
30 pages
Clustering: Methods and Applications
No ratings yet
Clustering: Methods and Applications
69 pages
Aditya Singh: CS & Math Graduate Resume
No ratings yet
Aditya Singh: CS & Math Graduate Resume
1 page
DM Unit 5
No ratings yet
DM Unit 5
15 pages
DM Unit-5 Notes
No ratings yet
DM Unit-5 Notes
16 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
AI
0% (1)
AI
7 pages
A New Hierarchical Clustering Algorithm
No ratings yet
A New Hierarchical Clustering Algorithm
5 pages
CG Lop
No ratings yet
CG Lop
24 pages
Understanding C++ Structures
No ratings yet
Understanding C++ Structures
5 pages
Microcontroller Na D Embedded System PPT Module-1
No ratings yet
Microcontroller Na D Embedded System PPT Module-1
51 pages
Year 11 Math Scope & Sequence
No ratings yet
Year 11 Math Scope & Sequence
1 page
Unit-IV Cluster Outlier Analysis
No ratings yet
Unit-IV Cluster Outlier Analysis
21 pages
Cluster Analysis & DBSCAN Guide
No ratings yet
Cluster Analysis & DBSCAN Guide
26 pages
Data Mining, Vipin Kumar, Pang-Ning Tan, Michael Steinback, Anuj Karpatne - Introduction To Data Mining-Pearson
No ratings yet
Data Mining, Vipin Kumar, Pang-Ning Tan, Michael Steinback, Anuj Karpatne - Introduction To Data Mining-Pearson
81 pages
DM UNIT-4 Part2
No ratings yet
DM UNIT-4 Part2
18 pages
Unit-V (Dmwh6em)
No ratings yet
Unit-V (Dmwh6em)
30 pages
Clustering
No ratings yet
Clustering
6 pages
Accept and Display Numbers
No ratings yet
Accept and Display Numbers
11 pages
Cluster Analysis Explained
No ratings yet
Cluster Analysis Explained
22 pages
Algorithm Analysis for Students
No ratings yet
Algorithm Analysis for Students
95 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
38 pages
001 - Clustering - Jain - Dubes (1) - 69-103
No ratings yet
001 - Clustering - Jain - Dubes (1) - 69-103
40 pages
Clustering
No ratings yet
Clustering
8 pages
DM Module 4
No ratings yet
DM Module 4
17 pages
Clustering New
No ratings yet
Clustering New
6 pages
Big Data Processing for Students
No ratings yet
Big Data Processing for Students
2 pages
Unit 4
No ratings yet
Unit 4
106 pages
Fuzzy Meaning
No ratings yet
Fuzzy Meaning
6 pages
Sheets, Banks, Written
No ratings yet
Sheets, Banks, Written
42 pages
Cluster Analysis (1) - RMM
No ratings yet
Cluster Analysis (1) - RMM
17 pages
Unit-5 DWDM Material
No ratings yet
Unit-5 DWDM Material
19 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
17 pages
Clustering Jain Dubes (1) - 69-103
No ratings yet
Clustering Jain Dubes (1) - 69-103
5 pages
Unit-4 New
No ratings yet
Unit-4 New
36 pages
Computer Science and Engineering (Data Science)
No ratings yet
Computer Science and Engineering (Data Science)
447 pages
Thesis Title in Computer Science
100% (3)
Thesis Title in Computer Science
7 pages
SE-FDS-Sample Questions For All Units
No ratings yet
SE-FDS-Sample Questions For All Units
4 pages
Processor Organization & Pipelining
No ratings yet
Processor Organization & Pipelining
74 pages
Appc 2.7b Packet
No ratings yet
Appc 2.7b Packet
4 pages
Java Exception Handling Q&A Guide
No ratings yet
Java Exception Handling Q&A Guide
6 pages
L06 - Binary Search Tree - Xem Lại Lần Làm Thử - BK-LMS
No ratings yet
L06 - Binary Search Tree - Xem Lại Lần Làm Thử - BK-LMS
21 pages
Final Lab Experiments
No ratings yet
Final Lab Experiments
17 pages
Dokumen - Pub - Introduction To Data Mining 2nbsped 2017048641 9780133128901 0133128903 846 858
No ratings yet
Dokumen - Pub - Introduction To Data Mining 2nbsped 2017048641 9780133128901 0133128903 846 858
13 pages
11th Computer Science EM Half Yearly Exam 2023 Question Paper Erode District English Medium PDF Download
No ratings yet
11th Computer Science EM Half Yearly Exam 2023 Question Paper Erode District English Medium PDF Download
2 pages
Documentation O3MCANReceiveLibrary Codesys23 Codesys35
No ratings yet
Documentation O3MCANReceiveLibrary Codesys23 Codesys35
54 pages
Clustering U 5
No ratings yet
Clustering U 5
2 pages
ML Syllabus
No ratings yet
ML Syllabus
3 pages
What Is The Role of Algorithm Analysis in Data Structures?: Computer Science
No ratings yet
What Is The Role of Algorithm Analysis in Data Structures?: Computer Science
10 pages
Popc Superimp
No ratings yet
Popc Superimp
44 pages
Clustering
No ratings yet
Clustering
118 pages
Unt III (DS)
No ratings yet
Unt III (DS)
49 pages
LAB211 Assignment: Title Background Program Specifications
No ratings yet
LAB211 Assignment: Title Background Program Specifications
5 pages
DMT Unit-5
No ratings yet
DMT Unit-5
10 pages
Chapter-8 (Cluster Analysis Basic Concepts and Algorithms)
No ratings yet
Chapter-8 (Cluster Analysis Basic Concepts and Algorithms)
75 pages

Cluster Analysis for Data Scientists

Uploaded by

Cluster Analysis for Data Scientists

Uploaded by

UNIT-6

Cluster Analysis: Basic Concepts and Algorithms

Hierarchical versus Partitional:

Hierarchial Clustering: If we permit clusters to have sub clusters, then

Exclusive versus Overlapping versus Fuzzy: Exclusive Clustering:

For example: some newspaper stories share a common theme, such as

Prototype-Based Cluster: A cluster is a set of objects in which each object

Graph-Based Cluster: If the data is represented as a graph, where the

Density-Based Cluster: A cluster is a dense region of objects that is

Time and Space Complexity:

Types of Hierarchical Clustering

Agglomerative Clustering Algorithm

Calculation of Distance Between Two

 In the single linkage approach, we take the distance between

Using Single Linkage approach

Step 2: To combine the individual clusters, we can consider the

As a result, we will get the following dendrogram.

K-means: Additional Issues

Approach 2: Another approach is to choose the replacement centroid

Reducing the SSE with Postprocessing:

Two strategies that decrease the total SSE by increasing the

Updating Centroids Incrementally:

• k-Means may produce tighter clusters than hierarchical clustering.

 K-means is not suitable for all types of data.

 non-globular clusters or clusters of different sizes and

 k-means is restricted to data for which there is a notion of a center (centroid).

K-means as an Optimization Problem:

simpliﬁes the notation.

Derivation of K-means as an Algorithm to Minimize the SSE:

Derivation of K-means for SAE:

Agglomerative Hierarchical Clustering

Agglomerative: Start with the points as individual clusters and, at each

Time and Space Complexity

The analysis of the basic agglomerative hierarchical clustering algorithm is also

To illustrate the behaviour of the various hierarchical clustering algorithms, we

Single Link or MIN

Complete Link or MAX or CLIQUE

Example (Complete Link): Figure shows the results of applying MAX to

Example (Ward’s Method): Figure shows the results of applying Ward’s

 Agglomerative hierarchical clustering algorithms are expensive in terms of

 The order of the data has an impact on the ﬁnal results.

 Very sensitive to outliers.

Traditional Density: Center-Based Approach:

Center based density

Core points: These points are in the interior of a density-based cluster. A

Strengths and Weaknesses:

DBSCAN is the abbreviation for Density-Based Spatial Clustering

point around which the density is

Grid Based Clustering Methods

You might also like