Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views39 pages

Unit V Notes

Uploaded by

Ayub Shaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views39 pages

Unit V Notes

Uploaded by

Ayub Shaik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

PE 833 CS ML DCET

MACHINE LEARNING
UNIT -V
Clustering
1. Introduction,
2. Similarity and Distance Measures,
3. Outliers,
4. Hierarchical Methods,
5. Partitional Algorithms,
6. Clustering Large Databases,
7. Clustering with Categorical Attributes,
8. Comparison

1. INTRODUCTION
Clustering is similar to classification in that data are grouped. However, unlike classification,
the groups are not predefined.
Instead, the grouping is accomplished by finding similarities between data according to
characteristics found in the actual data.
The groups are called clusters. Many definitions for clusters have been proposed:
• Set of like elements. Elements from different clusters are not alike.
• The distance between points in a cluster is less than the distance between a point in the
cluster and any point outside it.
A term similar to clustering is database segmentation, where like tuples (records) in a database
are grouped together. This is done to partition or segment the database into components that
then give the user a more general view of the data.
A simple example of clustering is found in Example 5.1. This example illustrates the fact that
determining how to do the clustering is not straightforward.
EXAMPLE 5.1
An International Online Catalog company wishes to group its customers based on common
features. Company management does not have any predefined labels for these groups.
Based on the outcome of the grouping, they will target marketing and advertising campaigns
to the different groups. The information they have about the customers includes

Pg. 1 UNIT -V
PE 833 CS ML DCET

Depending on the type of advertising not all attributes are important. For example, suppose the
advertisement is for a special sale on children' s clothes. We could target the advertising only
to the persons with children.

As illustrated in Figure 5.1 , a given set of data may be clustered on different attributes.
Here a group of homes in a geographic area is shown.
The first type of clustering is based on the location of the home. Homes that are
geographically close to each other are clustered together.
In the second clustering, homes are grouped based on the size of the house.
Clustering has been used in many application domains, including biology, medicine,
anthropology, marketing, and economics.
Clustering applications include plant and animal classification, disease classification,
image processing, pattern recognition, and document retrieval. One of the first domains in
which clustering was used was biological taxonomy. Recent uses include examining Web
log data to detect usage patterns.

When clustering is applied to a real-world database, many interesting problems occur:


• Outlier handling is difficult. Here the elements do not naturally fall into any cluster.
They can be viewed as solitary clusters. However, if a clustering algorithm attempts to
find larger clusters, these outliers will be forced to be placed in some cluster. This
process may result in the creation of poor clusters by combining two existing clusters
and leaving the outlier in its own cluster.
• Dynamic data in the database implies that cluster membership may change over time.
• Interpreting the semantic meaning of each cluster may be difficult. With classification,
the labeling of the classes is known ahead of time. However, with clustering, this may
not be the case. Thus, when the clustering process finishes creating a set of clusters,
the exact meaning of each cluster may not be obvious.
 Here is where a domain expert is needed to assign a label or interpretation for
each cluster.

Pg. 2 UNIT -V
PE 833 CS ML DCET

• There is no one correct answer to a clustering problem. In fact, many answers may be
found.
 The exact number of clusters required is not easy to determine. Again, a domain
expert may be required.
 For example, suppose we have a set of data about plants that have been collected
during a field trip. Without any prior knowledge of plant classification, if we
attempt to divide this set of data into similar groupings, it would not be clear
how many groups should be created.
• Another related issue is what data should be used for clustering. Unlike learning during
a classification process, where there is some a priori knowledge concerning what the
attributes of each classification should be, in clustering we have no supervised learning
to aid the process.

We can then summarize some basic features of clustering (as opposed to classification):
The (best) number of clusters is not known.
There may not be any a priori knowledge concerning the clusters.
Cluster results are dynamic.
The clustering problem is stated as shown in Definition 5.1. Here we assume that the number
of clusters to be created is an input value, k. The actual content (and interpretation) of each
cluster, K J, 1 ≤j ≤ k, is determined as a result of the function definition.
We will view that the result of solving a clustering problem is that a set of clusters is created:
k = { K 1, K 2, ... , Kk }.

A classification of the different types of clustering algorithms is shown in Figure 5 .2.


Clustering algorithms themselves may be viewed as hierarchical or partitional.
With hierarchical clustering, a nested set of clusters is created. Each level in the hierarchy has
a separate set of clusters.
At the lowest level, each item is in its own unique cluster. At the highest level, all items belong
to the same cluster.

With hierarchical clustering, the desired number of clusters not input.


With partitional clustering, the algorithm creates only one set of clusters. These approaches use
the desired number of clusters to drive how the final set is created.

Pg. 3 UNIT -V
PE 833 CS ML DCET

Traditional clustering algorithms tend to be targeted to small numeric databases that fit into
memory. There are, however, more recent clustering algorithms that look at categorical data
and are targeted to larger, perhaps dynamic, databases.
Algorithms targeted to larger databases may adapt to memory constraints by either sampling
the database or using data structures, which can be compressed or pruned to fit into memory
regardless of the size of the database.
Clustering algorithms may also differ based on whether they produce overlapping or non
overlapping clusters. Even though we consider only non overlapping clusters, it is possible to
place an item in multiple clusters.
In turn, non overlapping clusters can be viewed as extrinsic or intrinsic.
 Extrinsic techniques use labeling of the items to assist in the classification process. These
algorithms are the traditional classification supervised learning algorithms in which a
special input training set is used.
 Intrinsic algorithms do not use any a priori category labels, but depend only on the
adjacency matrix containing the distance between objects. All algorithms we examine in
this chapter fall into the intrinsic class.

The types of clustering algorithms can be furthered classified based on the implementation
technique used.
Hierarchical algorithms can be categorized as agglomerative or divisive.
 "Agglomerative" implies that the clusters are created in a bottom-up fashion, while
“Divisive” algorithms work in a top-down fashion.

Another descriptive tag indicates whether each individual element is handled one by one, serial
(sometimes called incremental), or whether all items are examined together, simultaneous.
 If a specific tuple is viewed as having attribute values for all attributes in the schema,
then clustering algorithms could differ as to how the attribute values are examined.
As is usually done with decision tree classification techniques, Some algorithms
examine attribute values one at a time, Monothetic.
 Polythetic algorithms consider all attribute values at one time.

Finally, clustering algorithms can be labeled based on the Mathematical Formulation given to
the algorithm:
 Graph theoretic or Matrix algebra.
 In this chapter we generally use the graph approach and describe the input to the
clustering algorithm as an adjacency matrix labeled with distance measures.

2. SIMILARITY AND DISTANCE MEASURES


There are many desirable properties for the clusters created by a solution to a specific clustering
problem.
The most important one is that a tuple within one cluster is more like tuples within that cluster
than it is similar to tuples outside it.
As with classification, then, we assume the definition of a similarity measure, sim(ti, tl ), defied
between any two tuples, ti, tl ε D.

Pg. 4 UNIT -V
PE 833 CS ML DCET

Here the Centroid is the "middle" of the cluster; it need not be an actual point in the cluster.
Some clustering algorithms alternatively assume that the cluster is represented by one centrally
located object in the cluster called a Medoid.
The radius is the square root of the average mean squared distance from any point in the cluster
to the centroid, and
The diameter is the square root of the average mean squared distance between all pairs of points
m the cluster.
We use the notation Mm to indicate the Medoid for cluster Km.

Pg. 5 UNIT -V
PE 833 CS ML DCET

3. OUTLIERS
As mentioned earlier, outliers are sample points with values much different from those of the
remaining set of data.
Outliers may represent errors in the data (perhaps a malfunctioning sensor recorded an incorrect
data value) or could be correct data values that are simply much different from the remaining
data.
Some clustering techniques do not perform well with the presence of outliers. This problem is
illustrated in Figure 5.3.
 Here if three clusters are found (solid line), the outlier will occur in a cluster by itself.
However, if two clusters are found (dashed line), the two (obviously) different sets of
data will be placed in one cluster because they are closer together than the outlier.
 This problem is complicated by the fact that many clustering algorithms actually have
as input the number of desired clusters to be found.

Pg. 6 UNIT -V
PE 833 CS ML DCET

 Clustering algorithms may actually find and remove outliers to ensure that they perform
better. However, care must be taken in actually removing outliers.
 For example, suppose that the data mining problem is to predict flooding. Extremely
high-water level values occur very infrequently, and when compared with the normal
water level values may seem to be outliers.
 However, removing these values may not allow the data mining algorithms to work
effectively because there would be no data that showed that floods ever actually
occurred.
 Outlier Detection, Or Outlier Mining, is the process of identifying outliers in a set of
data. Clustering, or other data mining, algorithms may then choose to remove or treat
these values differently.
 Some outlier detection techniques are based on statistical techniques. These usually
assume that the set of data follows a known distribution and that outliers can be detected
by well-known tests such as discordancy tests.
• However, these tests are not very realistic for real-world data because real world
data values may not follow well-defined data distributions.
• Also, most of these tests assume a single attribute value, and many attributes
are- involved in real-world datasets.
Alternative detection techniques may be based on distance measures.

4. HIERARCHICAL ALGORITHMS
As mentioned earlier, hierarchical clustering algorithms actually creates sets of clusters.
Example 5.2 illustrates the concept. Hierarchical algorithms differ in how the sets are created.
A tree data structure, called a Dendrogram, can be used to illustrate the hierarchical clustering
technique and the sets of different clusters.
The root in a dendrogram tree contains one cluster where all elements are together. The leaves
in the dendrogram each consist of a single element cluster. Internal nodes in the dendrogram
represent new clusters formed by merging the clusters that appear as its children in the tree.
Each level in the tree is associated with the distance measure that was used to merge the
clusters.
All clusters created at a particular level were combined because the children clusters
had a distance between them less than the distance value associated with this level in the tree.

Pg. 7 UNIT -V
PE 833 CS ML DCET

A dendrogram for Example 5.2 is seen in Figure 5.4.


EXAMPLE 5.2
Figure 5.5 shows six elements, {A, B, C, D, E, F}, to be clustered. Parts (a) to (e) of the figure
show five different sets of clusters. In part (a) each cluster is viewed to consist of a single
element.

Part (b) illustrates four clusters. Here there are two sets of two-element clusters.
These clusters are formed at this level because these two elements are closer to each other than
any of the other elements.
part (c) shows a new cluster formed by adding a close element to one of the two-element
clusters
In part (d) the two-element and three-element clusters are merged to give a five-element cluster.
This is done because these two clusters are closer to each other than to the remote element
cluster, {F}.
At the last stage, part (e), all six elements are merged.

The space complexity for hierarchical algorithms is O(n2) because this is the space required for
the adjacency matrix.
The space required for the dendrogram is O(kn), which is much less than O(n2).
The time complexity for hierarchical algorithms is 0(kn2) because there is one iteration for each
level in the dendrogram.

Pg. 8 UNIT -V
PE 833 CS ML DCET

Depending on the specific algorithm, however, this could actually be O(maxdn2) where maxd
is the maximum distance between points.
Different algorithms may actually merge the closest clusters from the next lowest level or
simply create new clusters at each level with progressively larger distances.
Hierarchical techniques are well suited for many clustering applications that naturally exhibit
a Nesting relationship between clusters.
For example, in biology, plant and animal taxonomies could easily be viewed as a hierarchy of
clusters.

4.1 Agglomerative Algorithms


Agglomerative algorithms start with each individual item in its own cluster and iteratively
merge clusters until all items belong in one cluster.
Different agglomerative algorithms differ in how the clusters are merged at each level.
Algorithm 5.1 illustrates the typical agglomerative clustering algorithm. It assumes that a set
of elements and distances between them is given as input.
We use an n x n vertex adjacency matrix, A, as input Here the adjacency matrix, A,
contains a distance value rather than a simple boolean value: A[i, j] = dis(ti,tj) .
The output of the algorithm is a dendrogram, DE, which we represent as a set of ordered
triples (d, k, K) where d is the threshold distance, k is the number of clusters, and K is the set
of clusters.
The dendrogram in Figure 5.7(a) would be represented by the following:Outputting the
dendrogram produces a set of clusters rather than just one clustering.
The user can determine which of the clusters (based on distance threshold) he or she wishes to
use.

Pg. 9 UNIT -V
PE 833 CS ML DCET

This algorithm uses a procedure called NewClusters to determine how to create the next level
of clusters from the previous level.
This is where the different type of agglomerative algorithms differs. It is possible that only two
clusters from the prior level are merged or that multiple clusters are merged.
Algorithms also differ in term of which clusters are merged when there are several clusters
with identical distances.
In addition, the technique used to determine the distance between clusters may vary.
Single link, complete link, and average link techniques are perhaps the most well know
agglomerative techniques based on well-known graph theory concepts.
• All agglomerative approaches experience excessive time and space constraints.
• The space required for the adjacency matrix is O(n2) where there are n items to cluster.
• Because of the iterative nature of the algorithm, the matrix (or a subset of it) must be
accessed multiple times.
• The simplistic algorithm provided in Algorithm 5.1 performs at most maxd
examinations of this matrix, where maxd is the largest distance between any two points.
• In addition, the complexity of the NewClusters procedure could be expensive. This is a
potentially severe problem in large databases.
• Another issue with the agglomerative approach is that it is not incremental. Thus, when
new elements are added or old ones are removed or changed, the entire algorithm must
be rerun.

4.1.1 Single Link Technique


The single link technique is based on the idea of finding maximal connected components in
a graph.
A connected component is a graph in which there exists a path between any two vertices.
With the single link approach, two clusters are merged if there is at least one edge that
connects the two clusters;
that is, if the minimum distance between any two points is less than or equal to the
threshold distance being considered.
For this reason, it is often called the Nearest Neighbor Clustering Technique.
Example 5.3 illustrates this process.

EXAMPLE 5.3
Table 5.2 contains five sample data items with the distance between the elements indicated
in the table entries.
When viewed as a graph problem, Figure 5.6(a) shows the general graph With all edges
labeled with the respective distances.
To understand the idea behind the hierarchical approach, we show several graph variations
in Figures 5.6(b), (c), (d), and (e).

Pg. 10 UNIT -V
PE 833 CS ML DCET

Figure 5 .6(b) shows only those edges with a distance of 1 or less. There are only two edges.
The first level of single link clustering then will combine the connected clusters (single
elements from the first phase), giving three clusters: {A, B}, {C, D}, and {E}.
During the next level of clustering, we look at edges with a length of 2 or less. The
graph representing this threshold distance is shown in Figure 5.6(c). Note that we now have
an edge (actually three) between the two clusters {A,B} and {C,D}. Thus, at this level of
the single link clustering algorithm, we merge these two clusters to obtain a total of two
clusters: {A, B, C, D} and {E}.

The graph that is created with a threshold distance of 3 is shown in Figure 5.6(d). Here the
graph is connected, so the two clusters from the last level are merged into one large cluster
that contains all elements.
The dendrogram for this single link example is shown in Figure 5.7(a). The labeling on the
right-hand side shows the threshold distance used to merge the clusters at each level

Pg. 11 UNIT -V
PE 833 CS ML DCET

The single link algorithm is obtained by replacing the NewClusters procedure in the
agglomerative algorithm with a procedure to find connected components of a graph.
We assume that this connected component procedure has as input a graph (actually
represented by a vertex adjacency matrix and set of vertices) and as outputs a set of connected
components defined by a number (indicating the number of components) and an array
containing the membership of each component.
Note that this is exactly what the last two entries in the ordered triple are used for by the
dendrogram data structure.
The single link approach is quite simple, but it suffers from several problems.
This algorithm is not very efficient because the connected components procedure, which is
an O(n2) space and time algorithm, is called at each iteration.
A more efficient algorithm could be developed by looking at which clusters from an earlier
level can be merged at each step.

Another problem is that the clustering creates clusters with long chains.
An alternative view to merging clusters in the single link approach is that two clusters are
merged at a stage where the threshold distance is d if the minimum distance between any
vertex in one cluster and any vertex in the other cluster is at most d.
There have been other variations of the single link algorithm. One variation, based on the
use of a minimum spanning tree (MST), is shown in Algorithm 5.2.
• Here we assume that a procedure, MST, produces a minimum spanning tree given
an adjacency matrix as input.
• The clusters are merged in increasing order of the distance found in the MST.
• In the algorithm we show that once two clusters are merged, the distance between
them in the tree. becomes ∞ .
• Alternatively, we could have replaced the two nodes and edge with one node.

Pg. 12 UNIT -V
PE 833 CS ML DCET

This algorithm using the data in Example 5.3. Figure 5.8 shows one MST for the example.
The algorithm will merge A and B and then C and D (or the reverse). These two clusters will
then be merged at a threshold of 2.
Finally, E will be merged at a threshold of 3. Note that we get exactly the same dendrogram
as in Figure 5.7(a).

The time complexity of this algorithm is 0(n2) because the procedure to create the minimum
spanning tree is O(n2) and it dominates the time of the algorithm.
Once it is created having n-1 edges, the repeat loop will be repeated only n-1 times.
The single linkage approach is infamous for its chain effect; that is, two clusters are merged
if only two of their points are close to each other. There may be points in the respective
clusters to be merged that are far apart, but this has no impact on the algorithm.
Thus, resulting clusters may have points that are not related to each other at all, but simply
happen to be near (perhaps via a transitive relationship) points that are close to each other.

4.1.2 Complete Link Algorithm.


Although the complete link algorithm is similar to the single link algorithm, it looks for
cliques rather than connected components.
A Clique is a maximal graph in which there is an edge between any two vertices. Here a
procedure is used to find the maximum distance between any clusters so that two clusters
are merged if the maximum distance is less than or equal to the distance threshold.

Pg. 13 UNIT -V
PE 833 CS ML DCET

In this algorithm, we assume the existence of a procedure, clique, which finds all cliques
in a graph. As with the single link algorithm, this is expensive because it is an 0(n2)
algorithm.
Clusters found with the complete link method tend to be more compact than those found
using the single link technique.
Using the data found in Example 5.3, Figure 5.7(b)shows the dendrogram created.
A variation of the complete link algorithm is called the Farthest Neighbor Algorithm.
Here the closest clusters are merged where the distance is the smallest measured by looking
at the maximum distance between any two points.

4.1.3 Average Link.


The average link technique merges two clusters if the average distance between any two
points in the two target clusters is below the distance threshold.
The algorithm used here is slightly different from that found in single and complete link
algorithms because we must examine the complete graph (not just the threshold graph) at
each stage. Thus, we restate this Algorithm :5.3.

Note that in this algorithm we increment d by 0.5 rather than by 1.


This is a rather arbitrary decision based on understanding of the data Certainly, we could
have use an increment of 1, but we would have had a dendrogram different from that seen in
Figure 5.7(c).

4.2 Divisive Clustering


With divisive clustering, all items are initially placed in one cluster and clusters are repeatedly
split in two until all items are in their own cluster.
The idea is to split up clusters where some elements are not sufficiently close to other
elements.

Pg. 14 UNIT -V
PE 833 CS ML DCET

One simple example of a divisive algorithm is based on the MST version of the single link
algorithm.
Here, however, we cut out edges from the MST from the largest to the smallest.

Looking at Figure 5.8, we would start with a cluster containing all items: {A, B, C, D, E}.
Looking at the MST, we see that the largest edge is between D and E. Cutting this out of
the MST, we then split the one cluster into two: {E} and {A, B, C, D}.
Next, we remove the edge between B and C. This splits the one large cluster into two: {A,
B} and {C, D}.
These will then be split at the next step. The order depends on how a specific
implementation would treat identical values.
Looking at the dendrogram in Figure 5.7(a), we see that we have created the same set of
clusters as with the agglomerative approach, but in reverse order.

5. PARTITIONAL ALGORITHMs
Nonhierarchical or partitional clustering creates the clusters in one step as opposed to
several steps.
Only one set of clusters is created, although several different sets of clusters may be created
internally within the various algorithms.
Since only one set of clusters is output, the user must input the desired number, k, of
clusters.
In addition, some metric or criterion function is used to determine the goodness of any
proposed solution.
This measure of quality could be the average distance between clusters or some other
metric.
The solution with the best value for the criterion function is the clustering solution used.
One common measure is a squared error metric, which measures the squared distance from
each point to the centroid for the associated cluster:

A problem with partitional algorithms is that they suffer from a combinatorial explosion
due to the number of possible solutions.
Clearly, searching all possible clustering alternatives usually would not be feasible.
For example, given a measurement criterion, a naive approach could look at all possible
sets of k clusters. There are S (n, k) possible combinations to examine. Here

Pg. 15 UNIT -V
PE 833 CS ML DCET

There are 11,259,666,000 different ways to cluster 19 items into 4 clusters.


Thus, most algorithms look only at a small subset of all the clusters using some strategy to
identify sensible clusters.

5.1 Minimum Spanning Tree (MST)


Since we have agglomerative and divisive algorithms based on the use of an MST, we also
present a partitional MST algorithm.
This is a very simplistic approach, but it illustrates how partitional algorithms work. The
algorithm is shown in Algorithm 5.4.
Since the clustering problem is to define a mapping, the output of this algorithm shows the
clusters as a set of ordered pairs (ti, j) where f(ti )=K J
The problem is how to define "inconsistent." It could be defined as in the earlier division
MST algorithm based on distance.
This would remove the largest k - 1 edges from the starting completely connected graph
and yield the same results as this corresponding level in the dendrogram.

Zahn proposes more reasonable inconsistent measures based on the weight (distance) of an
edge as compared to those close to it.
For example, an inconsistent edge would be one whose weight is much larger than the
average of the adjacent edges.
The time complexity of this algorithm is again dominated by the MST procedure, which is
O(n2).
At most, k-1 edges will be removed, so the last three steps of the algorithm, assuming each
step takes a constant time, is only O (k - 1).
Although determining the inconsistent edges in M may be quite complicated, it will not
require a time greater than the number of edges in M.
When looking at edges adjacent to one edge, there are at most k-2 of these edges.
In this case, then, the last three steps are O(k2), and the total algorithm is still 0(n2).

5.2 Squared Error Clustering Algorithm


The squared error clustering algorithm minimizes the squared error.
The squared error for a cluster is the sum of the squared Euclidean distances between each
element in the cluster and the cluster centroid, Ck.
Given a cluster Ki , let the set of items mapped to that cluster be {til , ti2, . . . , tim } .
The squared error is defined as

Pg. 16 UNIT -V
PE 833 CS ML DCET

In actuality, there are many different examples of squared error clustering algorithms. They
all follow the basic algorithm structure shown in Algorithm 5.5.
For each iteration in the squared error algorithm, each tuple is assigned to the cluster with
the closest center.
Since there are k clusters and n items, this is an 0(kn) operation. Assuming t iterations, this
becomes an O(tkn) algorithm.
The amount of space may be only O(n) because an adjacency matrix is not needed, as the
distance between all items is not used.

5.3 K-Means Clustering


K- means is an iterative clustering algorithm in which items are moved among sets of
clusters until the desired set is reached.
As such, it may be viewed as a type of squared error algorithm, although the convergence
criteria need not be defined based on the squared error.
A high degree of similarity among elements in clusters is obtained, while a high degree of
dissimilarity among elements in different clusters is achieved simultaneously.
The cluster mean of Ki = {til , ti2, . . . , tim } is defined as

This definition assumes that each tuple has only one numeric value as opposed to a tuple
with many attribute values.
The K-means algorithm requires that some definition of cluster mean exists, but it does not
have to be this particular one.
Here the mean is defined identically to our earlier definition of centroid.
This algorithm assumes that the desired number of clusters, k, is an input parameter.

Pg. 17 UNIT -V
PE 833 CS ML DCET

Note that the initial values for the means are arbitrarily assigned. These could be assigned
randomly or perhaps could use the values from the first k input items themselves.

The convergence criteria could be based on the squared error, but they need not be.

❖ For example, the algorithm could stop when no (or a very small) number of tuples are
assigned to different clusters.
❖ Other termination techniques have simply looked at a fixed number of iterations. A
maximum number of iterations may be included to ensure stopping even without
convergence.

EXAMPLE 5.4
Suppose that we are given the following items to cluster: {2, 4, 10, 12, 3, 20, 30, 11, 25}
and suppose that k = 2.
We initially assign the means to the first two values: m1 = 2 and m2 = 4.
Using Euclidean distance, we find that initially K1= {2, 3} and K2= {4, 10, 12, 20, 30, 11,
25 }.
We then recalculate the means to get m1 = 2.5 and m2 = 16.
We again make assignments to clusters to get K1 = {2, 3, 4} and K2 = {10, 12, 20 , 30, 11,
25}.
Continuing in this fashion, we obtain the following:

The time complexity of K-means is O(tkn) where t is the number of iterations.


K-means finds a local optimum and may actually miss the global optimum.
K-means does not work on categorical data because the mean must be defined on the
attribute type.
Only convex-shaped clusters are found.
It also does not handle outliers well.

Pg. 18 UNIT -V
PE 833 CS ML DCET

One variation of K-means, K-modes, does handle categorical data. Instead of using means,
it uses modes. A typical value for k is 2 to 10.
Although the K-means algorithm often produces good results, it is not time-efficient and
does not scale well.
Some K-means variations examine ways to improve the chances of finding the global
optimum. This often involves careful selection of the initial clusters and means.
Another variation is to allow clusters to be split and merged. The variance within a cluster
is examined, and if it is too large, a cluster is split. Similarly, if the distance between two
cluster centroids is less than a predefined threshold, they will be combined.

5.4 Nearest Neighbor Algorithm


An algorithm similar to the single link technique is called the Nearest Neighbor Algorithm.
With this serial algorithm, items are iteratively merged into the existing clusters that are
closest.
In this algorithm a threshold, t, is used to determine if items will be added to existing
clusters or if a new cluster is created.
Table 5.2 assuming a threshold of 2. Notice that the results are the same as those seen in
Figure 5.7(a) at the level of 2.

Pg. 19 UNIT -V
PE 833 CS ML DCET

EXAMPLE 5.5

Initially, A is placed in a cluster by itself, so we have K1 = {A}.


We then look at B to decide if it should be added to K1 or be placed in a new cluster. Since
dis(A , B ) = 1, which is less than the threshold of 2, we place B in K 1 to get K1 = { A , B
}.
When looking at C, we see that its distance to both A and B is 2, so we add it to the cluster
· to get K1 = {A , B, C }. The dis(D , C ) = 1 < 2, so we get K1 = {A, B, C , D}.
Finally, looking at E, we see that the closest item in K1 has a distance of 3 , which is greater
than 2, so we place it in its own cluster: K2 = { E } .

The complexity of the nearest neighbor algorithm actually depends on the number of items.
For each loop, each item must be compared to each item already in a cluster. Obviously,
this is n in the worst case.
Thus, the time complexity is O(n2 ).
Since we do need to examine the distances between items often, we assume that the space
requirement is also O(n 2 ).

5.5 PAM Algorithm or K-medoids


The PAM (Partitioning Around Medoids) algorithm, also called the K-medoids
algorithm, represents a cluster by a medoid.
Using a medoid is an approach that handles outliers well. The PAM algorithm is shown in
Algorithm 5.8.

Pg. 20 UNIT -V
PE 833 CS ML DCET

• Initially, a random set of k items is taken to be the set of medoids.


• Then at each step, all items from the input dataset that are not currently medoids are
examined one by one to see if they should be medoids.
• That is, the algorithm determines whether there is an item that should replace one of
the existing medoids.
• By looking at all pairs of medoid, non-medoid objects, the algorithm chooses the pair
that improves the overall quality of the clustering the best and exchanges them.
• Quality here is measured by the sum of all distances from a non-medoid object to the
medoid for the cluster it is in.
• An item is assigned to the cluster represented by the medoid to which it is closest
(minimum distance).
• We assume that K; is the cluster represented by medoid ti . Suppose ti is a current
medoid and we wish to determine whether it should be exchanged with a non-medoid
th .
• We wish to do this swap only if the overall impact to the cost (sum of the distances to
cluster medoids) represents an improvement.

We use Cjih to be the cost change for an item tj associated with swapping medoid ti; with
non-medoid th .
The cost is the change to the sum of all distances from items to their cluster medoids.
There are four cases that must be examined when calculating this cost:

The total impact to quality by a medoid change TCih then is given by

Pg. 21 UNIT -V
PE 833 CS ML DCET

EXAMPLE 5.6
Suppose that the two medoids that are initially chosen are A and B.
Based on the distances shown in Table 5.2

and randomly placing items when distances are identical to the two medoids,
We obtain the clusters { A , C, D} and { B , E} The three non-medoids,{ C, D , E},
are then examined to see which (if any) should be used to replace A or B .
We thus have six costs to determine: TCAC, TCAD· TCAE, TCBC, TCBD, and TCBE·
The total impact to quality by a medoid change TCih by swapping medoid ti; with non-
medoid th .
Here we use the name of the item instead of a numeric subscript value. We obtain the
following: Replacing medoid A with non medoid C is given by TCAC

// to compute Replacing medoid A with non medoid C is given by TCAC


With clusters { A , C, D} and { B , E} //

Here A is no longer a medoid, and since it is closer to B , it will be placed in the cluster
with B as medoid, and thus its cost is C AAC = 1 .
The cost for B is 0 because it stays a cluster medoid is CBAC =0
C is now a medoid, so it has a negative cost based on its distance to the old medoid;
that is, CCAB = -2.
D is closer to C than it was to A by a distance of 1, so its cost is CDAC = - 1 .
Finally, E stays in the same cluster with the same distance, so its cost change is C EAC
=0
Thus, we have that the overall cost is a reduction of 2.

Figure 5.9 illustrates the calculation of these six costs. Looking at these, we see that the
minimum cost is 2 and that there are several ways to reduce this cost.
Arbitrarily choosing the first swap, we get C and B as the new medoids with the clusters
being {C, D} and {B , A , E } .
This concludes the first iteration of PAM. At the next iteration,
we examine changing medoids again and pick the choice that best reduces the cost
The iterations stop when no changes will reduce the cost. We leave the rest of this
problem to the reader as an exercise.

Pg. 22 UNIT -V
PE 833 CS ML DCET

PAM does not scale well to large datasets because of its computational complexity.
For each iteration, we have k(n - k) pairs of objects i, h for which a cost, TCih, should be
determined.
Calculating the cost during each iteration requires that the cost be calculated for all other
non-medoids tj .
There are n - k of these. Thus, the total complexity per iteration is n(n - k)2.
The total number of iterations can be quite large, so PAM is not an alternative for large
databases.
However, there are some clustering algorithms based on PAM that are targeted to large
datasets.

CLARA
CLARA (Clustering LARge Applications) improves on the time complexity of PAM by
using samples of the dataset.
The basic idea is that it applies PAM to a sample of the underlying database and then uses
the medoids found as the medoids for the complete clustering.
Each item from the complete database is then assigned to the cluster with the medoid to
which it is closest.
To improve the CLARA accuracy, several samples can be drawn with PAM applied to
each. The sample chosen as the final clustering is the one that performs the best.
Because of the sampling, CLARA is more efficient than PAM for large databases.
However, it may not be as effective, depending on the sample size.

CLARANS
CLARANS (clustering large applications based upon randomized search) improves on
CLARA by using multiple different samples.
In addition to the normal input to PAM, CLARANS requires two additional parameters:
maxneighbor and numlocal.
Maxneighbor is the number of neighbors of a node to which any specific node can be
compared.
As maxneighbor increases, CLARANS looks more and more like PAM because all nodes
will be examined.
Numlocal indicates the number of samples to be taken.
Since a new clustering is performed on each sample, this also indicates the number of
clustering's to be made.
Performance studies indicate that numlocal = 2 and

Pg. 23 UNIT -V
PE 833 CS ML DCET

maxneighbor = max((0.0125 x k (n - k)) , 250) are good choices


CLARANS is shown to be more efficient than either PAM or CLARA for any size dataset.
CLARANS assumes that all data are in main memory.
This certainly is not a valid assumption for large databases.

5.6 Bond Energy Algorithm


The bond energy algorithm (BEA) was developed and has been used in the database design
area to determine how to group data and how to physically place data on a disk.
It can be used to cluster attributes based on usage and then perform logical or physical design
accordingly.
With BEA, the affinity (bond) between database attributes is based on common usage.
This bond is used by the clustering algorithm as a similarity measure.
The actual measure counts the number of times the two attributes are used together in a
given time.
To find this, all common queries must be identified.
The idea is that attributes that are used together form a cluster and should be stored together.
In a distributed database, each resulting cluster is called a vertical fragment and may be
stored at different sites from other fragments.

The basic steps of this clustering algorithm are:


1. Create an attribute affinity matrix in which each entry indicates the affinity between the
two associate attributes. The entries in the similarity matrix are based on the frequency of
common usage of attribute pairs.
2. The BEA then converts this similarity matrix to a BOND matrix in which the entries
represent a type of nearest neighbor bonding based on probability of co-access. The BEA
algorithm rearranges rows or columns so that similar attributes appear close together in the
matrix.
3. Finally, the designer draws boxes around regions in the matrix with high similarity.

• The two shaded boxes represent the attributes that have been grouped together into
two clusters.
• Two attributes Ai and AJ have a high affinity if they are frequently used together in
database applications.
• At the heart of the BEA algorithm is the global affinity measure.
• Suppose that a database schema consists of n attributes {A1, A2, ... , An}. The global
affinity measure, AM, is defined as

Pg. 24 UNIT -V
PE 833 CS ML DCET

5.7 Clustering with Genetic Algorithms


• There have been clustering techniques based on the use of genetic algorithms.
• To determine how to perform clustering with genetic algorithms, we first must determine
how to represent each cluster.
• One simple approach would be to use a bit-map representation for each possible cluster.
• So, given a database with four items, {A, B, C, D}, we would represent one solution to
creating two clusters as 1001 and 0110.
• This represents the two clusters {A, D} and {B, C}.
Clustering with Genetic Algorithms
Algorithm 5.9 shows one possible iterative refinement technique for clustering that uses
a genetic algorithm.

The approach is similar to that in the squared error approach in that an initial random
solution is given and successive changes to this converge on a local optimum.
A new solution is generated from the previous solution using crossover and mutation
operations.
Our algorithm shows only crossover. The use of crossover to create a new solution from a
previous solution is shown in Example 5.7.
The new "solution “must be created in such a way that it represents a valid k clustering.
A fitness function must be used and may be defined based on an inverse of the squared
error.
Because of the manner in which crossover works, genetic clustering algorithms perform a
global search rather than a local search of potential solutions.

EXAMPLE 5.7
Suppose a database contains the following eight items {A, B, C, D, E, F, G, H}, which are
to be placed into three clusters. We could initially place the items into
the three clusters {A, C, E}, {B, F}, and {D, G, H},
which are represented by 10101000, 01000100, and 00010011, respectively .Suppose we
choose the first and third individuals as parents and do a simple crossover at point 4.
This yields the new solution: 00011000, 01000100, and 10100011

Pg. 25 UNIT -V
PE 833 CS ML DCET

5.8 Clustering with Neural Networks


Neural networks (NNs) that use unsupervised learning attempt to find features in the data
that characterize the desired output.
They look for clusters of like data.
These types of NNs are often called Self-organizing Neural Networks
There are two basic types of unsupervised learning: Noncompetitive and competitive .
With the noncompetitive or Hebbian learning , the weight between two nodes is changed to
be proportional to both output values. That is

With competitive learning, nodes are allowed to compete and the winner takes all. This
approach usually assumes a two-layer NN in which all nodes from one layer are connected
to all nodes in the other layer.
As training occurs, nodes in the output layer become associated with certain tuples in the
input dataset. Thus, this provides a grouping of these tuples together into a cluster.
Imagine every input tuple having each attribute value input to a specific input node in the
NN.
The number of input nodes is the same as the number of attributes.
We can thus associate each weight to each output node with one of the attributes from the
input tuple.
When a tuple is input to the NN, all output nodes produce an output value.
The node with the weights more similar to the input tuple is declared the winner.
Its weights are then adjusted.
This process' continues with each tuple input from the training set.
The input weights to the node are then close to an average of the tuples in this cluster.

Self-Organizing Feature Maps (SOFM) or self organizing map (SOM) is an NN


approach that uses competitive unsupervised learning.
Learning is based on the concept that the behavior of a node should impact only those nodes
and arcs near it.
Weights are initially assigned randomly and adjusted during the learning process to produce
better results.
During this learning process, hidden features or patterns in the data ,are uncovered and the
weights are adjusted accordingly.
SOFMs were developed by observing how neurons work in the brain and in ANNs.

• The firing of neurons impact the firing of other neurons that are near it.
• Neurons that are far apart seem to inhibit( to restrain) each other.
• Neurons seem to have specific non overlapping tasks.

The term self-organizing indicates the ability of these NNs to organize the nodes into
clusters based on the similarity between them.
Those nodes that are closer together are more similar than those that are far apart. This
hints at how the actual clustering is performed.
Over time, nodes in the output layer become matched to input nodes, and patterns of nodes
in the output layer emerge.

Pg. 26 UNIT -V
PE 833 CS ML DCET

Perhaps the most common example of a SOFM is the Kohonen self-organizing map,
which is used extensively in commercial data mining products to perform clustering.
 There is one input layer and one special layer, which produces output values
that compete.
 In effect, multiple outputs are created and the best one is chosen.
 This extra layer is not technically either a hidden layer or an output layer, so we
refer to it here as the competitive layer.
 Nodes in this layer are viewed as a two-dimensional grid of nodes as seen in
Figure 5 . 1 1 .

Self-Organizing Feature Maps (SOFM).


 Each input node is connected to each node in this grid.
 Propagation occurs by sending the input value for each input node to each node
in the competitive layer.
 As with regular NNs, each arc has an associated weight and each node in the
competitive layer has an activation function.
 Thus, each node in the competitive layer produces an output value, and the node
with the best output wins the competition and is determined to be the output for
that input.
 An attractive feature of Kohonen nets is that the data can be fed into the
multiple competitive nodes in parallel.
 Training occurs by adjusting weights so that the best output is even better the
next time this input is used.
 "Best" is determined by computing a distance measure.
A common approach is to initialize the weights on the input arcs to the competitive layer
with normalized values.
The similarity between output nodes and input vectors is then determined by the dot product
of the two vectors. Given an input tuple X = (x1 , . . . , xh) and weights on arcs input to a
competitive node i as WJi, . . . , Whi, the similarity between X and i can be calculated by
The competitive node most similar to the input node wins the competitive. Based on this
the weights coming into i as well as those for the nodes immediately surrounding it in the
matrix are increased. This is the learning phase. Given a node i, we use the notation Ni to
represent the union of i and the nodes near it in the matrix. Thus, the learning process uses

Pg. 27 UNIT -V
PE 833 CS ML DCET

In this formula, c indicates the learning rate and may actually vary based on the node
rather than being a constant.

The basic idea of SOM learning


Is that after each input tuple in the training set, the winner and its neighbors have their
weights changed to be closer to that of the tuple.
Over time, a pattern on the output nodes emerges, which is close to that of the training
data.
At the beginning of the training process, the neighborhood of a node may be defined to be
large.
However, the neighborhood may decrease during the processing.

6. CLUSTERING LARGE DATABASES


When clustering is used with dynamic databases, these algorithms may not be appropriate.
First, they all assume that [because most are O(n2)] sufficient main memory exists to hold
the data to be clustered and the data structures needed to support them.
With large databases containing thousands of items (or more), these assumptions are not
realistic.
In addition, performing I/0s continuously through the multiple iterations of an algorithm is
too expensive.
Because of these main memory restrictions, the algorithms do not scale up to large
databases.
Another issue is that some assume that the data are present all at once. These techniques
are not appropriate for dynamic databases.
Clustering techniques should be able to adapt as the database changes.

It has been argued that to perform effectively on large databases, a clustering


algorithm should
1. Require no more (preferably less) than one scan of the database.
2. Have the ability to provide status and "best" answer so far during the algorithm
execution. This is sometimes referred to as the ability to be online.
3. Be suspendable, stoppable, and resumable.
4. Be able to update the results incrementally as data are added or removed from the
database.
5. Work with limited main memory .
6. Be capable of performing different techniques for scanning the database. This may
include sampling.
7. Process each tuple only once.
Recent research at Microsoft has examined how to efficiently perform the clustering
algorithms with large databases .

The basic idea of this scaling approach is as follows:


1. Read a subset of the database into main memory .
2. Apply clustering technique to data in memory.

Pg. 28 UNIT -V
PE 833 CS ML DCET

3. Combine results with those from prior samples.


4. The in-memory data are then divided into three different types:
those items that will always be needed even when the next sample is brought in,
those that can be discarded with appropriate updates to data being kept in order to
answer the problem, and
those that will be saved in a compressed format.
Based on the type, each data item is then kept, deleted, or compressed in memory.
5. If termination criteria are not met, then repeat from step 1 .

6.1 BIRCH
BIRCH (balanced iterative reducing and clustering using hierarchies) is designed for
clustering a large amount of metric data.
It assumes that there may be a limited amount of main memory and achieves a linear I/0
time requiring only one database scan.
It is incremental and hierarchical, and it uses an outlier handling technique. Here points that
are found in sparsely populated areas are removed.
The basic idea of the algorithm is that a tree is built that captures needed information to
perform clustering.
The clustering is then performed on the tree itself, where labeling of nodes in the tree
contain the needed information to calculate distance values.
A major characteristic of the BIRCH algorithm is the use of the clustering feature, which
is a triple that contains information about a cluster.
The clustering feature provides a summary of the information about one cluster.
By this definition it is clear that BIRCH applies only to numeric data

This algorithm uses a tree called a CF tree a s defined in Definition 5.4.


The size of the tree is determined by a threshold value, T, associated with each leaf node.
T is the maximum diameter allowed for any leaf.
Here diameter is the average of the pair wise distance between all points in the cluster.
Each internal node corresponds to a cluster that is composed of the sub clusters represented
by its children.

DEFINITION 5.3.
A clustering feature (CF) is a triple (N, LS, SS), where the number of the points in the
cluster is N, LS is the sum of the points in the cluster and SS is the sum of the squares of
the points in the cluster.

DEFINITION 5.4.
A CF tree is a balanced tree with a branching factor (maximum number of children a node
may have) B . Each internal node contains a CF triple for each of its children. Each leaf
node also represents a cluster and contains a CF entry for each sub cluster in it. A sub
cluster in a leaf node must have a diameter no greater than a given threshold value T.

Unlike a dendrogram, a CF tree is searched in a top-down fashion.


Each node in the CF tree contains clustering feature information about its sub clusters.
As points are added to the clustering problem, the CF tree is built.

Pg. 29 UNIT -V
PE 833 CS ML DCET

A point is inserted into the cluster (represented by a leaf node) to which it is closest.
If the diameter for the leaf node is greater than T, then a splitting and balancing of the tree
is performed (similar to that used in a B-tree).
The algorithm adapts to main memory size by changing the threshold value.
A larger threshold, T , yields a smaller CF tree.
This process can be performed without rereading the data.
The clustering feature data provides enough information to perform this condensation.
The complexity of the algorithm is O(n).

Not shown in this algorithm are the parameters needed for the CF tree construction, such
as its branching factor, the page block size, and memory size.
Based on size, each node has room for a fixed number, B, of clusters (i.e., CF triples).
The first step creates the CF tree in memory.
The threshold value can be modified if necessary to ensure that the tree fits into the
available memory space.

BIRCH Insertion
Insertion into the CF tree requires scanning the tree from the root down, choosing the node
closest to the new point at each level.
The distance here is calculated by looking at the distance between the new point and the
centroid of the cluster. This can be easily calculated with most distance measures (e.g.,
Euclidean or Manhattan) using the CF triple.
When the new item is inserted, the CF triple is appropriately updated, as is each triple on
the path from the root down to the leaf. It is then added to the closest leaf node found by
adjusting the CF value for that node.
When an item is inserted into a cluster at the leaf node of the tree, the cluster must satisfy
the threshold value.
 If it does, then the CF entry for that cluster is modified. If it does not, then that
item is added to that node as a single-item cluster.
Node splits occur if no space exists in a given node.
This is based on the size of the physical page because each node size is determined by the
page size.

Pg. 30 UNIT -V
PE 833 CS ML DCET

An attractive feature of the CF values is that they are additive; that is, if two clusters are
merged, the resulting CF is the addition of the CF values for the starting clusters.
Once the tree is built, the leaf nodes of the CF tree represent the current clusters.

In reality, this algorithm, Algorithm 5.1 0, is only the first of several steps proposed for the
use of BIRCH with large databases. The complete outline of steps is:
1. Create initial CF tree using a modified version of Algorithm 5. 10. This in effect "loads"
the database into memory.
If there is insufficient memory to construct the CF tree with a given threshold,
the threshold value is increased and a new smaller CF tree is constructed. This can be
done by inserting the leaf nodes of the previous tree into the new small tree.
2. The clustering represented by the CF tree may not be natural because each entry has a
limited size.
In addition, the input order can negatively impact the results These problems
can be overcome by another global clustering approach applied to the leaf nodes in
the CF tree. Here each leaf node is treated as a single point for clustering.
Although the original work proposes a centroid-based agglomerative
hierarchical clustering algorithm to cluster the sub clusters, other clustering
algorithms could be used.
3. The last phase (which is optional) reclusters all points by placing them in the cluster that
has the closest centroid. Outliers, points that are too far from any centroid can be removed
during this phase.

BIRCH is linear in both space and I/0 time.


The choice of threshold value is imperative to an efficient execution of the algorithm.
Otherwise, the tree may have to be rebuilt many times to ensure that it can be memory-
resident.
This gives the worst-case time complexity of O(n2) .

6.2 DBSCAN
The approach used by DB SCAN (density-based spatial clustering of applications with
noise) is to create clusters with a minimum size and density.
Density is defined as a minimum number of points within a certain distance of each other.
This handles the outlier problem by ensuring that an outlier (or a small set of outliers) will
not create a cluster.
One input parameter, MinPts, indicates the minimum number of points in any cluster.
In addition, for each point in a cluster there must be another point in the cluster whose
distance from it is less than a threshold input value, Eps.
The Eps-neighborhood or neighborhood of a point is the set of points within a distance of
Eps.
The desired number of clusters, k, is not input but rather is determined by the algorithm
itself.

Pg. 31 UNIT -V
PE 833 CS ML DCET

a) p is a core point because it has 4 (MinPts value) points within its neighborhood.
b) shows the 5 core points in the figure.
c) shows that even though point r is not a core point, it is density-reachable from q.

The first part of the definition ensures that the second point is "close enough" to the first
point.
The second portion of the definition ensures that there are enough core points close
enough to each other. These core points form the main portion of a cluster in that they
are all close to each other.
A directly density-reachable point must be close to one of these core points but it need
not be a core point itself. In that case, it is called a border point.

A point is said to be density-reachable from another point if there is a chain from one to
the other that contains only points that are directly density-reachable from the previous
point.
This guarantees that any cluster will have a core set of points very close to a large number
of other points (core points) and then some other points (border points) that are
sufficiently close to at least one core point.

Pg. 32 UNIT -V
PE 833 CS ML DCET

The expected time complexity of DB SCAN is 0 (nlog n).


It is possible that a border point could belong' to two clusters.

6.3 CURE Algorithm


CURE (Clustering Using Representatives) clustering algorithm is to handle outliers well.
It has both a hierarchical component and a partitioning component.
First, a constant number of points, c, are chosen from each cluster.
These well-scattered points are then shrunk toward the cluster's centroid by applying a
shrinkage factor, α.
When α is 1, all points are shrunk to just one-the centroid.
These points represent the cluster better than a single point (such as a medoid or centroid)
could.
With multiple representative points, clusters of unusual shapes (not just a sphere) can be
better represented.
CURE then uses a hierarchical clustering algorithm.
At each step in the agglomerative algorithm, clusters with the closest pair of
representative points are chosen to be merged.
The distance between them is defined as the minimum distance between any pair of
points in the representative sets from the two clusters.

Pg. 33 UNIT -V
PE 833 CS ML DCET

The basic approach used by CURE is shown in Figure 5.13.


The first step shows a sample of the data.
A set of clusters with its representative points exists at each step in the processing.
In Figure 5.1 3(b) there are three clusters, each with two representative points.
The representative points are shown as darkened circles.
These representative points are chosen to be far from each other as well as from the mean
of the cluster.
In part (c), two of the clusters are merged and two new representative points are chosen.
Finally, in part (d), these points are shrunk toward the mean of the cluster.
Notice that if one representative centroid had been chosen for the clusters, the smaller
cluster would have been merged with the bottom cluster instead of with the top cluster.
CURE handles limited main memory by obtaining a random sample to find the initial
clusters.
The random sample is partitioned, and each partition is then partially clustered.
These resulting clusters are then completely clustered in a second pass.

The basic steps of CURE for large databases are:


1. Obtain a sample of the database.
2. Partition the sample into p partitions of size n/p. This is done to speed up the algorithm
because clustering is first performed on each partition.
3. Partially cluster the points in each partition using the hierarchical algorithm (see
Algorithm 5. 12). This provides a first guess at what the clusters should be. The number
of clusters is n/p q for some constant q.
4. Remove outliers. Outliers are eliminated by the use of two different techniques. The
first technique eliminates clusters that grow very slowly. The second technique removes
very small clusters toward the end of the clustering phase.
5. Completely cluster all data in the sample using Algorithm 5.12.
6. Cluster the entire database on disk using c points to represent each cluster. An item in
the database is placed in the cluster that has the closest representative point to it. These
sets of representative points are small enough to fit into main memory, so each of the n
points must be compared to ck representative points.

Pg. 34 UNIT -V
PE 833 CS ML DCET

The time complexity of CURE is O (n2lgn), while space is O (n). This is worst case
behavior.

• A heap and k-d tree data structure are used to ensure this performance.
• One entry in the heap exists for each cluster.
• Each cluster has not only its representative points, but also the cluster that is closest to
it.
• Entries in the heap are stored in increasing order of the distances between clusters.
• We assume that each entry u in the heap contains the set of representative points, u.rep;
the mean of the points in the cluster, u.mean; and the cluster closest to it, u.closest.
• We use the heap operations: heapify to create the heap, min to extract the minimum
entry in the heap,
• Insert to add a new entry, and delete to delete an entry.
• A merge procedure is used to merge two clusters. It determines the new representative
points for the new cluster

The basic idea of this process is to first find the point that is farthest from the mean.
Subsequent points are then chosen based on being the farthest from those points that were
previously chosen. A predefined number of points is picked.
A k-D tree is a balanced binary tree that can be thought of as a generalization of a binary
search tree.
It is used to index data of k dimensions where the ith level of the tree indexes the ith
dimension.

Pg. 35 UNIT -V
PE 833 CS ML DCET

In CURE, a k-D tree is used to assist in the merging of clusters. It stores the representative
points for each cluster. Initially, there is only one representative point for each cluster,
the sole item in it.
Operations performed on the tree are: delete to delete an entry form the tree, insert to
insert an entry into it, and build to initially create it.
The hierarchical clustering algorithm itself, is shown in Algorithm 5. 12.
The quality of the clusters found by CURE is better.
While the value of the shrinking factor a does impact results, with a value between 0.2
and 0.7, the correct clusters are still found.
When the number of representative points per cluster is greater than five, the correct
clusters are still always found.
A random sample size of about 2.5% and the number of partitions is greater than one or
two times k seem to work well.
The results with large datasets indicate that CURE scales well and out performs BIRCH.

7.1 ROCK
Traditional algorithms do not always work with categorical data.
The ROCK (RObust Clustering using linKs) clustering algorithm is targeted to both
Boolean data and categorical data.
A novel approach to identifying similarity is based on the number of links between items.
A pair of items are said to be neighbors if their similarity exceeds some threshold.
This need not be defined based on a precise metric, but rather a more intuitive approach
using domain experts could be used.
The number of links between two items is defined as the number of common neighbors
they have.
The objective of the clustering algorithm is to group together points that have more links.
The algorithm is a hierarchical agglomerative algorithm using the number of links as the
similarity measure rather than a measure based on distance algorithms.
Instead of using a Euclidean distance, a different distance, such as the Jaccard coefficient,
has been proposed. One proposed similarity measure based on the Jaccard coefficient is
defined as

If the tuples are viewed to be sets of items purchased (i.e., market basket data), then we
look at the number of items they have in common divided by the total number in both.
The denominator is used to normalize the value to be between 0 and 1 .
The number of links between a pair of points can be viewed as the number of unique
paths of length 2 between them.
Example 5.9 illustrates the use of links by the ROCK algorithm using the data from
Example 5.8 using the Jaccard coefficient.
Note that different threshold values for neighbors could be used to get different results.
Also note that a hierarchical approach could be used with different threshold values for
each level in the dendrogram.

Pg. 36 UNIT -V
PE 833 CS ML DCET

The ROCK algorithm is divided into three general parts:


1. Obtaining a random sample of the data.
2. Performing clustering on the data using the link agglomerative approach. A goodness
measure is used to determine which pair of points is merged at each step.
3. Using these clusters the remaining data on disk are assigned to them.
The goodness measure used to merge clusters is:

Pg. 37 UNIT -V
PE 833 CS ML DCET

The first step in the algorithm converts the adjacency matrix into a Boolean matrix. where
an entry is 1 if the two corresponding points are neighbors.
As the adjacency matrix is of size n2, this is an O(n2) step.
The next step converts this into a matrix indicating the links.
This can be found by calculating S x S, which can be done in 0 (n2.37) .
The hierarchical clustering portion of the algorithm then starts by placing each point in
the sample in a separate cluster. It then successively merges clusters until k clusters are
found.
To facilitate this processing, both local and global heaps are used.
A local heap, q, is created to represent each cluster. Here q contains every cluster that has
a nonzero link to the cluster that corresponds to this cluster.
Initially, a cluster is created for each point, t;. The heap for ti , q [ti], contains every
cluster that has a nonzero link to {ti}
The global heap contains information about each cluster. All information in the heap is
ordered based on the goodness measure, which is shown in Equation 5 . 17 .

8. Comparison
The different clustering algorithms discussed in this chapter are compared in Table 5.3.
Here we include a classification of the type of algorithm, space and time complexity, and
general notes concerning applicability.

• The single link, complete link, and average link techniques are all hierarchical
techniques with
• (n2) time and space complexity.
• Both K-means and the squared error techniques are iterative, requiring O(tkn) time.
• The nearest neighbor is not iterative, but the number of clusters is not predetermined.
Thus, the worst-case complexity can be O(n2) .
• BIRCH appears to be quite efficient, but remember that the CF-tree may need to be
rebuilt. The time complexity in the table assumes that the tree is not rebuilt.
• CURE is an improvement on these by using sampling and partitioning to handle
scalability well and uses multiple points rather than just one point to represent each
cluster.

Pg. 38 UNIT -V
PE 833 CS ML DCET

• Using multiple points allows the approach to detect non spherical clusters. With
sampling, CURE obtains an O(n) time complexity.
• However, CURE does not handle categorical data well. This also allows it to be more
resistant to the negative impact of outliers.
• K-means and PAM work by iteratively reassigning items to clusters, which may not
find a global optimal assignment.
• The results of the K-means algorithm is quite sensitive to the presence of outliers.
• Through the use of the CF-tree, Birch is both dynamic and scalable. However, it
detects only spherical type clusters.
• DBSCAN is a density-based approach. The time complexity of DBSCAN can be
improved to O(nlogn ) with appropriate spatial indices.
• Genetic algorithms are not included in this table because their performance totally
depends on the technique chosen to represent individuals, how crossover is done, and
the termination condition used.

Pg. 39 UNIT -V

You might also like