Unit V Notes
Unit V Notes
MACHINE LEARNING
UNIT -V
Clustering
1. Introduction,
2. Similarity and Distance Measures,
3. Outliers,
4. Hierarchical Methods,
5. Partitional Algorithms,
6. Clustering Large Databases,
7. Clustering with Categorical Attributes,
8. Comparison
1. INTRODUCTION
Clustering is similar to classification in that data are grouped. However, unlike classification,
the groups are not predefined.
Instead, the grouping is accomplished by finding similarities between data according to
characteristics found in the actual data.
The groups are called clusters. Many definitions for clusters have been proposed:
• Set of like elements. Elements from different clusters are not alike.
• The distance between points in a cluster is less than the distance between a point in the
cluster and any point outside it.
A term similar to clustering is database segmentation, where like tuples (records) in a database
are grouped together. This is done to partition or segment the database into components that
then give the user a more general view of the data.
A simple example of clustering is found in Example 5.1. This example illustrates the fact that
determining how to do the clustering is not straightforward.
EXAMPLE 5.1
An International Online Catalog company wishes to group its customers based on common
features. Company management does not have any predefined labels for these groups.
Based on the outcome of the grouping, they will target marketing and advertising campaigns
to the different groups. The information they have about the customers includes
Pg. 1 UNIT -V
PE 833 CS ML DCET
Depending on the type of advertising not all attributes are important. For example, suppose the
advertisement is for a special sale on children' s clothes. We could target the advertising only
to the persons with children.
As illustrated in Figure 5.1 , a given set of data may be clustered on different attributes.
Here a group of homes in a geographic area is shown.
The first type of clustering is based on the location of the home. Homes that are
geographically close to each other are clustered together.
In the second clustering, homes are grouped based on the size of the house.
Clustering has been used in many application domains, including biology, medicine,
anthropology, marketing, and economics.
Clustering applications include plant and animal classification, disease classification,
image processing, pattern recognition, and document retrieval. One of the first domains in
which clustering was used was biological taxonomy. Recent uses include examining Web
log data to detect usage patterns.
Pg. 2 UNIT -V
PE 833 CS ML DCET
• There is no one correct answer to a clustering problem. In fact, many answers may be
found.
The exact number of clusters required is not easy to determine. Again, a domain
expert may be required.
For example, suppose we have a set of data about plants that have been collected
during a field trip. Without any prior knowledge of plant classification, if we
attempt to divide this set of data into similar groupings, it would not be clear
how many groups should be created.
• Another related issue is what data should be used for clustering. Unlike learning during
a classification process, where there is some a priori knowledge concerning what the
attributes of each classification should be, in clustering we have no supervised learning
to aid the process.
We can then summarize some basic features of clustering (as opposed to classification):
The (best) number of clusters is not known.
There may not be any a priori knowledge concerning the clusters.
Cluster results are dynamic.
The clustering problem is stated as shown in Definition 5.1. Here we assume that the number
of clusters to be created is an input value, k. The actual content (and interpretation) of each
cluster, K J, 1 ≤j ≤ k, is determined as a result of the function definition.
We will view that the result of solving a clustering problem is that a set of clusters is created:
k = { K 1, K 2, ... , Kk }.
Pg. 3 UNIT -V
PE 833 CS ML DCET
Traditional clustering algorithms tend to be targeted to small numeric databases that fit into
memory. There are, however, more recent clustering algorithms that look at categorical data
and are targeted to larger, perhaps dynamic, databases.
Algorithms targeted to larger databases may adapt to memory constraints by either sampling
the database or using data structures, which can be compressed or pruned to fit into memory
regardless of the size of the database.
Clustering algorithms may also differ based on whether they produce overlapping or non
overlapping clusters. Even though we consider only non overlapping clusters, it is possible to
place an item in multiple clusters.
In turn, non overlapping clusters can be viewed as extrinsic or intrinsic.
Extrinsic techniques use labeling of the items to assist in the classification process. These
algorithms are the traditional classification supervised learning algorithms in which a
special input training set is used.
Intrinsic algorithms do not use any a priori category labels, but depend only on the
adjacency matrix containing the distance between objects. All algorithms we examine in
this chapter fall into the intrinsic class.
The types of clustering algorithms can be furthered classified based on the implementation
technique used.
Hierarchical algorithms can be categorized as agglomerative or divisive.
"Agglomerative" implies that the clusters are created in a bottom-up fashion, while
“Divisive” algorithms work in a top-down fashion.
Another descriptive tag indicates whether each individual element is handled one by one, serial
(sometimes called incremental), or whether all items are examined together, simultaneous.
If a specific tuple is viewed as having attribute values for all attributes in the schema,
then clustering algorithms could differ as to how the attribute values are examined.
As is usually done with decision tree classification techniques, Some algorithms
examine attribute values one at a time, Monothetic.
Polythetic algorithms consider all attribute values at one time.
Finally, clustering algorithms can be labeled based on the Mathematical Formulation given to
the algorithm:
Graph theoretic or Matrix algebra.
In this chapter we generally use the graph approach and describe the input to the
clustering algorithm as an adjacency matrix labeled with distance measures.
Pg. 4 UNIT -V
PE 833 CS ML DCET
Here the Centroid is the "middle" of the cluster; it need not be an actual point in the cluster.
Some clustering algorithms alternatively assume that the cluster is represented by one centrally
located object in the cluster called a Medoid.
The radius is the square root of the average mean squared distance from any point in the cluster
to the centroid, and
The diameter is the square root of the average mean squared distance between all pairs of points
m the cluster.
We use the notation Mm to indicate the Medoid for cluster Km.
Pg. 5 UNIT -V
PE 833 CS ML DCET
3. OUTLIERS
As mentioned earlier, outliers are sample points with values much different from those of the
remaining set of data.
Outliers may represent errors in the data (perhaps a malfunctioning sensor recorded an incorrect
data value) or could be correct data values that are simply much different from the remaining
data.
Some clustering techniques do not perform well with the presence of outliers. This problem is
illustrated in Figure 5.3.
Here if three clusters are found (solid line), the outlier will occur in a cluster by itself.
However, if two clusters are found (dashed line), the two (obviously) different sets of
data will be placed in one cluster because they are closer together than the outlier.
This problem is complicated by the fact that many clustering algorithms actually have
as input the number of desired clusters to be found.
Pg. 6 UNIT -V
PE 833 CS ML DCET
Clustering algorithms may actually find and remove outliers to ensure that they perform
better. However, care must be taken in actually removing outliers.
For example, suppose that the data mining problem is to predict flooding. Extremely
high-water level values occur very infrequently, and when compared with the normal
water level values may seem to be outliers.
However, removing these values may not allow the data mining algorithms to work
effectively because there would be no data that showed that floods ever actually
occurred.
Outlier Detection, Or Outlier Mining, is the process of identifying outliers in a set of
data. Clustering, or other data mining, algorithms may then choose to remove or treat
these values differently.
Some outlier detection techniques are based on statistical techniques. These usually
assume that the set of data follows a known distribution and that outliers can be detected
by well-known tests such as discordancy tests.
• However, these tests are not very realistic for real-world data because real world
data values may not follow well-defined data distributions.
• Also, most of these tests assume a single attribute value, and many attributes
are- involved in real-world datasets.
Alternative detection techniques may be based on distance measures.
4. HIERARCHICAL ALGORITHMS
As mentioned earlier, hierarchical clustering algorithms actually creates sets of clusters.
Example 5.2 illustrates the concept. Hierarchical algorithms differ in how the sets are created.
A tree data structure, called a Dendrogram, can be used to illustrate the hierarchical clustering
technique and the sets of different clusters.
The root in a dendrogram tree contains one cluster where all elements are together. The leaves
in the dendrogram each consist of a single element cluster. Internal nodes in the dendrogram
represent new clusters formed by merging the clusters that appear as its children in the tree.
Each level in the tree is associated with the distance measure that was used to merge the
clusters.
All clusters created at a particular level were combined because the children clusters
had a distance between them less than the distance value associated with this level in the tree.
Pg. 7 UNIT -V
PE 833 CS ML DCET
Part (b) illustrates four clusters. Here there are two sets of two-element clusters.
These clusters are formed at this level because these two elements are closer to each other than
any of the other elements.
part (c) shows a new cluster formed by adding a close element to one of the two-element
clusters
In part (d) the two-element and three-element clusters are merged to give a five-element cluster.
This is done because these two clusters are closer to each other than to the remote element
cluster, {F}.
At the last stage, part (e), all six elements are merged.
The space complexity for hierarchical algorithms is O(n2) because this is the space required for
the adjacency matrix.
The space required for the dendrogram is O(kn), which is much less than O(n2).
The time complexity for hierarchical algorithms is 0(kn2) because there is one iteration for each
level in the dendrogram.
Pg. 8 UNIT -V
PE 833 CS ML DCET
Depending on the specific algorithm, however, this could actually be O(maxdn2) where maxd
is the maximum distance between points.
Different algorithms may actually merge the closest clusters from the next lowest level or
simply create new clusters at each level with progressively larger distances.
Hierarchical techniques are well suited for many clustering applications that naturally exhibit
a Nesting relationship between clusters.
For example, in biology, plant and animal taxonomies could easily be viewed as a hierarchy of
clusters.
Pg. 9 UNIT -V
PE 833 CS ML DCET
This algorithm uses a procedure called NewClusters to determine how to create the next level
of clusters from the previous level.
This is where the different type of agglomerative algorithms differs. It is possible that only two
clusters from the prior level are merged or that multiple clusters are merged.
Algorithms also differ in term of which clusters are merged when there are several clusters
with identical distances.
In addition, the technique used to determine the distance between clusters may vary.
Single link, complete link, and average link techniques are perhaps the most well know
agglomerative techniques based on well-known graph theory concepts.
• All agglomerative approaches experience excessive time and space constraints.
• The space required for the adjacency matrix is O(n2) where there are n items to cluster.
• Because of the iterative nature of the algorithm, the matrix (or a subset of it) must be
accessed multiple times.
• The simplistic algorithm provided in Algorithm 5.1 performs at most maxd
examinations of this matrix, where maxd is the largest distance between any two points.
• In addition, the complexity of the NewClusters procedure could be expensive. This is a
potentially severe problem in large databases.
• Another issue with the agglomerative approach is that it is not incremental. Thus, when
new elements are added or old ones are removed or changed, the entire algorithm must
be rerun.
EXAMPLE 5.3
Table 5.2 contains five sample data items with the distance between the elements indicated
in the table entries.
When viewed as a graph problem, Figure 5.6(a) shows the general graph With all edges
labeled with the respective distances.
To understand the idea behind the hierarchical approach, we show several graph variations
in Figures 5.6(b), (c), (d), and (e).
Pg. 10 UNIT -V
PE 833 CS ML DCET
Figure 5 .6(b) shows only those edges with a distance of 1 or less. There are only two edges.
The first level of single link clustering then will combine the connected clusters (single
elements from the first phase), giving three clusters: {A, B}, {C, D}, and {E}.
During the next level of clustering, we look at edges with a length of 2 or less. The
graph representing this threshold distance is shown in Figure 5.6(c). Note that we now have
an edge (actually three) between the two clusters {A,B} and {C,D}. Thus, at this level of
the single link clustering algorithm, we merge these two clusters to obtain a total of two
clusters: {A, B, C, D} and {E}.
The graph that is created with a threshold distance of 3 is shown in Figure 5.6(d). Here the
graph is connected, so the two clusters from the last level are merged into one large cluster
that contains all elements.
The dendrogram for this single link example is shown in Figure 5.7(a). The labeling on the
right-hand side shows the threshold distance used to merge the clusters at each level
Pg. 11 UNIT -V
PE 833 CS ML DCET
The single link algorithm is obtained by replacing the NewClusters procedure in the
agglomerative algorithm with a procedure to find connected components of a graph.
We assume that this connected component procedure has as input a graph (actually
represented by a vertex adjacency matrix and set of vertices) and as outputs a set of connected
components defined by a number (indicating the number of components) and an array
containing the membership of each component.
Note that this is exactly what the last two entries in the ordered triple are used for by the
dendrogram data structure.
The single link approach is quite simple, but it suffers from several problems.
This algorithm is not very efficient because the connected components procedure, which is
an O(n2) space and time algorithm, is called at each iteration.
A more efficient algorithm could be developed by looking at which clusters from an earlier
level can be merged at each step.
Another problem is that the clustering creates clusters with long chains.
An alternative view to merging clusters in the single link approach is that two clusters are
merged at a stage where the threshold distance is d if the minimum distance between any
vertex in one cluster and any vertex in the other cluster is at most d.
There have been other variations of the single link algorithm. One variation, based on the
use of a minimum spanning tree (MST), is shown in Algorithm 5.2.
• Here we assume that a procedure, MST, produces a minimum spanning tree given
an adjacency matrix as input.
• The clusters are merged in increasing order of the distance found in the MST.
• In the algorithm we show that once two clusters are merged, the distance between
them in the tree. becomes ∞ .
• Alternatively, we could have replaced the two nodes and edge with one node.
Pg. 12 UNIT -V
PE 833 CS ML DCET
This algorithm using the data in Example 5.3. Figure 5.8 shows one MST for the example.
The algorithm will merge A and B and then C and D (or the reverse). These two clusters will
then be merged at a threshold of 2.
Finally, E will be merged at a threshold of 3. Note that we get exactly the same dendrogram
as in Figure 5.7(a).
The time complexity of this algorithm is 0(n2) because the procedure to create the minimum
spanning tree is O(n2) and it dominates the time of the algorithm.
Once it is created having n-1 edges, the repeat loop will be repeated only n-1 times.
The single linkage approach is infamous for its chain effect; that is, two clusters are merged
if only two of their points are close to each other. There may be points in the respective
clusters to be merged that are far apart, but this has no impact on the algorithm.
Thus, resulting clusters may have points that are not related to each other at all, but simply
happen to be near (perhaps via a transitive relationship) points that are close to each other.
Pg. 13 UNIT -V
PE 833 CS ML DCET
In this algorithm, we assume the existence of a procedure, clique, which finds all cliques
in a graph. As with the single link algorithm, this is expensive because it is an 0(n2)
algorithm.
Clusters found with the complete link method tend to be more compact than those found
using the single link technique.
Using the data found in Example 5.3, Figure 5.7(b)shows the dendrogram created.
A variation of the complete link algorithm is called the Farthest Neighbor Algorithm.
Here the closest clusters are merged where the distance is the smallest measured by looking
at the maximum distance between any two points.
Pg. 14 UNIT -V
PE 833 CS ML DCET
One simple example of a divisive algorithm is based on the MST version of the single link
algorithm.
Here, however, we cut out edges from the MST from the largest to the smallest.
Looking at Figure 5.8, we would start with a cluster containing all items: {A, B, C, D, E}.
Looking at the MST, we see that the largest edge is between D and E. Cutting this out of
the MST, we then split the one cluster into two: {E} and {A, B, C, D}.
Next, we remove the edge between B and C. This splits the one large cluster into two: {A,
B} and {C, D}.
These will then be split at the next step. The order depends on how a specific
implementation would treat identical values.
Looking at the dendrogram in Figure 5.7(a), we see that we have created the same set of
clusters as with the agglomerative approach, but in reverse order.
5. PARTITIONAL ALGORITHMs
Nonhierarchical or partitional clustering creates the clusters in one step as opposed to
several steps.
Only one set of clusters is created, although several different sets of clusters may be created
internally within the various algorithms.
Since only one set of clusters is output, the user must input the desired number, k, of
clusters.
In addition, some metric or criterion function is used to determine the goodness of any
proposed solution.
This measure of quality could be the average distance between clusters or some other
metric.
The solution with the best value for the criterion function is the clustering solution used.
One common measure is a squared error metric, which measures the squared distance from
each point to the centroid for the associated cluster:
A problem with partitional algorithms is that they suffer from a combinatorial explosion
due to the number of possible solutions.
Clearly, searching all possible clustering alternatives usually would not be feasible.
For example, given a measurement criterion, a naive approach could look at all possible
sets of k clusters. There are S (n, k) possible combinations to examine. Here
Pg. 15 UNIT -V
PE 833 CS ML DCET
Zahn proposes more reasonable inconsistent measures based on the weight (distance) of an
edge as compared to those close to it.
For example, an inconsistent edge would be one whose weight is much larger than the
average of the adjacent edges.
The time complexity of this algorithm is again dominated by the MST procedure, which is
O(n2).
At most, k-1 edges will be removed, so the last three steps of the algorithm, assuming each
step takes a constant time, is only O (k - 1).
Although determining the inconsistent edges in M may be quite complicated, it will not
require a time greater than the number of edges in M.
When looking at edges adjacent to one edge, there are at most k-2 of these edges.
In this case, then, the last three steps are O(k2), and the total algorithm is still 0(n2).
Pg. 16 UNIT -V
PE 833 CS ML DCET
In actuality, there are many different examples of squared error clustering algorithms. They
all follow the basic algorithm structure shown in Algorithm 5.5.
For each iteration in the squared error algorithm, each tuple is assigned to the cluster with
the closest center.
Since there are k clusters and n items, this is an 0(kn) operation. Assuming t iterations, this
becomes an O(tkn) algorithm.
The amount of space may be only O(n) because an adjacency matrix is not needed, as the
distance between all items is not used.
This definition assumes that each tuple has only one numeric value as opposed to a tuple
with many attribute values.
The K-means algorithm requires that some definition of cluster mean exists, but it does not
have to be this particular one.
Here the mean is defined identically to our earlier definition of centroid.
This algorithm assumes that the desired number of clusters, k, is an input parameter.
Pg. 17 UNIT -V
PE 833 CS ML DCET
Note that the initial values for the means are arbitrarily assigned. These could be assigned
randomly or perhaps could use the values from the first k input items themselves.
The convergence criteria could be based on the squared error, but they need not be.
❖ For example, the algorithm could stop when no (or a very small) number of tuples are
assigned to different clusters.
❖ Other termination techniques have simply looked at a fixed number of iterations. A
maximum number of iterations may be included to ensure stopping even without
convergence.
EXAMPLE 5.4
Suppose that we are given the following items to cluster: {2, 4, 10, 12, 3, 20, 30, 11, 25}
and suppose that k = 2.
We initially assign the means to the first two values: m1 = 2 and m2 = 4.
Using Euclidean distance, we find that initially K1= {2, 3} and K2= {4, 10, 12, 20, 30, 11,
25 }.
We then recalculate the means to get m1 = 2.5 and m2 = 16.
We again make assignments to clusters to get K1 = {2, 3, 4} and K2 = {10, 12, 20 , 30, 11,
25}.
Continuing in this fashion, we obtain the following:
Pg. 18 UNIT -V
PE 833 CS ML DCET
One variation of K-means, K-modes, does handle categorical data. Instead of using means,
it uses modes. A typical value for k is 2 to 10.
Although the K-means algorithm often produces good results, it is not time-efficient and
does not scale well.
Some K-means variations examine ways to improve the chances of finding the global
optimum. This often involves careful selection of the initial clusters and means.
Another variation is to allow clusters to be split and merged. The variance within a cluster
is examined, and if it is too large, a cluster is split. Similarly, if the distance between two
cluster centroids is less than a predefined threshold, they will be combined.
Pg. 19 UNIT -V
PE 833 CS ML DCET
EXAMPLE 5.5
The complexity of the nearest neighbor algorithm actually depends on the number of items.
For each loop, each item must be compared to each item already in a cluster. Obviously,
this is n in the worst case.
Thus, the time complexity is O(n2 ).
Since we do need to examine the distances between items often, we assume that the space
requirement is also O(n 2 ).
Pg. 20 UNIT -V
PE 833 CS ML DCET
We use Cjih to be the cost change for an item tj associated with swapping medoid ti; with
non-medoid th .
The cost is the change to the sum of all distances from items to their cluster medoids.
There are four cases that must be examined when calculating this cost:
Pg. 21 UNIT -V
PE 833 CS ML DCET
EXAMPLE 5.6
Suppose that the two medoids that are initially chosen are A and B.
Based on the distances shown in Table 5.2
and randomly placing items when distances are identical to the two medoids,
We obtain the clusters { A , C, D} and { B , E} The three non-medoids,{ C, D , E},
are then examined to see which (if any) should be used to replace A or B .
We thus have six costs to determine: TCAC, TCAD· TCAE, TCBC, TCBD, and TCBE·
The total impact to quality by a medoid change TCih by swapping medoid ti; with non-
medoid th .
Here we use the name of the item instead of a numeric subscript value. We obtain the
following: Replacing medoid A with non medoid C is given by TCAC
Here A is no longer a medoid, and since it is closer to B , it will be placed in the cluster
with B as medoid, and thus its cost is C AAC = 1 .
The cost for B is 0 because it stays a cluster medoid is CBAC =0
C is now a medoid, so it has a negative cost based on its distance to the old medoid;
that is, CCAB = -2.
D is closer to C than it was to A by a distance of 1, so its cost is CDAC = - 1 .
Finally, E stays in the same cluster with the same distance, so its cost change is C EAC
=0
Thus, we have that the overall cost is a reduction of 2.
Figure 5.9 illustrates the calculation of these six costs. Looking at these, we see that the
minimum cost is 2 and that there are several ways to reduce this cost.
Arbitrarily choosing the first swap, we get C and B as the new medoids with the clusters
being {C, D} and {B , A , E } .
This concludes the first iteration of PAM. At the next iteration,
we examine changing medoids again and pick the choice that best reduces the cost
The iterations stop when no changes will reduce the cost. We leave the rest of this
problem to the reader as an exercise.
Pg. 22 UNIT -V
PE 833 CS ML DCET
PAM does not scale well to large datasets because of its computational complexity.
For each iteration, we have k(n - k) pairs of objects i, h for which a cost, TCih, should be
determined.
Calculating the cost during each iteration requires that the cost be calculated for all other
non-medoids tj .
There are n - k of these. Thus, the total complexity per iteration is n(n - k)2.
The total number of iterations can be quite large, so PAM is not an alternative for large
databases.
However, there are some clustering algorithms based on PAM that are targeted to large
datasets.
CLARA
CLARA (Clustering LARge Applications) improves on the time complexity of PAM by
using samples of the dataset.
The basic idea is that it applies PAM to a sample of the underlying database and then uses
the medoids found as the medoids for the complete clustering.
Each item from the complete database is then assigned to the cluster with the medoid to
which it is closest.
To improve the CLARA accuracy, several samples can be drawn with PAM applied to
each. The sample chosen as the final clustering is the one that performs the best.
Because of the sampling, CLARA is more efficient than PAM for large databases.
However, it may not be as effective, depending on the sample size.
CLARANS
CLARANS (clustering large applications based upon randomized search) improves on
CLARA by using multiple different samples.
In addition to the normal input to PAM, CLARANS requires two additional parameters:
maxneighbor and numlocal.
Maxneighbor is the number of neighbors of a node to which any specific node can be
compared.
As maxneighbor increases, CLARANS looks more and more like PAM because all nodes
will be examined.
Numlocal indicates the number of samples to be taken.
Since a new clustering is performed on each sample, this also indicates the number of
clustering's to be made.
Performance studies indicate that numlocal = 2 and
Pg. 23 UNIT -V
PE 833 CS ML DCET
• The two shaded boxes represent the attributes that have been grouped together into
two clusters.
• Two attributes Ai and AJ have a high affinity if they are frequently used together in
database applications.
• At the heart of the BEA algorithm is the global affinity measure.
• Suppose that a database schema consists of n attributes {A1, A2, ... , An}. The global
affinity measure, AM, is defined as
Pg. 24 UNIT -V
PE 833 CS ML DCET
The approach is similar to that in the squared error approach in that an initial random
solution is given and successive changes to this converge on a local optimum.
A new solution is generated from the previous solution using crossover and mutation
operations.
Our algorithm shows only crossover. The use of crossover to create a new solution from a
previous solution is shown in Example 5.7.
The new "solution “must be created in such a way that it represents a valid k clustering.
A fitness function must be used and may be defined based on an inverse of the squared
error.
Because of the manner in which crossover works, genetic clustering algorithms perform a
global search rather than a local search of potential solutions.
EXAMPLE 5.7
Suppose a database contains the following eight items {A, B, C, D, E, F, G, H}, which are
to be placed into three clusters. We could initially place the items into
the three clusters {A, C, E}, {B, F}, and {D, G, H},
which are represented by 10101000, 01000100, and 00010011, respectively .Suppose we
choose the first and third individuals as parents and do a simple crossover at point 4.
This yields the new solution: 00011000, 01000100, and 10100011
Pg. 25 UNIT -V
PE 833 CS ML DCET
With competitive learning, nodes are allowed to compete and the winner takes all. This
approach usually assumes a two-layer NN in which all nodes from one layer are connected
to all nodes in the other layer.
As training occurs, nodes in the output layer become associated with certain tuples in the
input dataset. Thus, this provides a grouping of these tuples together into a cluster.
Imagine every input tuple having each attribute value input to a specific input node in the
NN.
The number of input nodes is the same as the number of attributes.
We can thus associate each weight to each output node with one of the attributes from the
input tuple.
When a tuple is input to the NN, all output nodes produce an output value.
The node with the weights more similar to the input tuple is declared the winner.
Its weights are then adjusted.
This process' continues with each tuple input from the training set.
The input weights to the node are then close to an average of the tuples in this cluster.
• The firing of neurons impact the firing of other neurons that are near it.
• Neurons that are far apart seem to inhibit( to restrain) each other.
• Neurons seem to have specific non overlapping tasks.
The term self-organizing indicates the ability of these NNs to organize the nodes into
clusters based on the similarity between them.
Those nodes that are closer together are more similar than those that are far apart. This
hints at how the actual clustering is performed.
Over time, nodes in the output layer become matched to input nodes, and patterns of nodes
in the output layer emerge.
Pg. 26 UNIT -V
PE 833 CS ML DCET
Perhaps the most common example of a SOFM is the Kohonen self-organizing map,
which is used extensively in commercial data mining products to perform clustering.
There is one input layer and one special layer, which produces output values
that compete.
In effect, multiple outputs are created and the best one is chosen.
This extra layer is not technically either a hidden layer or an output layer, so we
refer to it here as the competitive layer.
Nodes in this layer are viewed as a two-dimensional grid of nodes as seen in
Figure 5 . 1 1 .
Pg. 27 UNIT -V
PE 833 CS ML DCET
In this formula, c indicates the learning rate and may actually vary based on the node
rather than being a constant.
Pg. 28 UNIT -V
PE 833 CS ML DCET
6.1 BIRCH
BIRCH (balanced iterative reducing and clustering using hierarchies) is designed for
clustering a large amount of metric data.
It assumes that there may be a limited amount of main memory and achieves a linear I/0
time requiring only one database scan.
It is incremental and hierarchical, and it uses an outlier handling technique. Here points that
are found in sparsely populated areas are removed.
The basic idea of the algorithm is that a tree is built that captures needed information to
perform clustering.
The clustering is then performed on the tree itself, where labeling of nodes in the tree
contain the needed information to calculate distance values.
A major characteristic of the BIRCH algorithm is the use of the clustering feature, which
is a triple that contains information about a cluster.
The clustering feature provides a summary of the information about one cluster.
By this definition it is clear that BIRCH applies only to numeric data
DEFINITION 5.3.
A clustering feature (CF) is a triple (N, LS, SS), where the number of the points in the
cluster is N, LS is the sum of the points in the cluster and SS is the sum of the squares of
the points in the cluster.
DEFINITION 5.4.
A CF tree is a balanced tree with a branching factor (maximum number of children a node
may have) B . Each internal node contains a CF triple for each of its children. Each leaf
node also represents a cluster and contains a CF entry for each sub cluster in it. A sub
cluster in a leaf node must have a diameter no greater than a given threshold value T.
Pg. 29 UNIT -V
PE 833 CS ML DCET
A point is inserted into the cluster (represented by a leaf node) to which it is closest.
If the diameter for the leaf node is greater than T, then a splitting and balancing of the tree
is performed (similar to that used in a B-tree).
The algorithm adapts to main memory size by changing the threshold value.
A larger threshold, T , yields a smaller CF tree.
This process can be performed without rereading the data.
The clustering feature data provides enough information to perform this condensation.
The complexity of the algorithm is O(n).
Not shown in this algorithm are the parameters needed for the CF tree construction, such
as its branching factor, the page block size, and memory size.
Based on size, each node has room for a fixed number, B, of clusters (i.e., CF triples).
The first step creates the CF tree in memory.
The threshold value can be modified if necessary to ensure that the tree fits into the
available memory space.
BIRCH Insertion
Insertion into the CF tree requires scanning the tree from the root down, choosing the node
closest to the new point at each level.
The distance here is calculated by looking at the distance between the new point and the
centroid of the cluster. This can be easily calculated with most distance measures (e.g.,
Euclidean or Manhattan) using the CF triple.
When the new item is inserted, the CF triple is appropriately updated, as is each triple on
the path from the root down to the leaf. It is then added to the closest leaf node found by
adjusting the CF value for that node.
When an item is inserted into a cluster at the leaf node of the tree, the cluster must satisfy
the threshold value.
If it does, then the CF entry for that cluster is modified. If it does not, then that
item is added to that node as a single-item cluster.
Node splits occur if no space exists in a given node.
This is based on the size of the physical page because each node size is determined by the
page size.
Pg. 30 UNIT -V
PE 833 CS ML DCET
An attractive feature of the CF values is that they are additive; that is, if two clusters are
merged, the resulting CF is the addition of the CF values for the starting clusters.
Once the tree is built, the leaf nodes of the CF tree represent the current clusters.
In reality, this algorithm, Algorithm 5.1 0, is only the first of several steps proposed for the
use of BIRCH with large databases. The complete outline of steps is:
1. Create initial CF tree using a modified version of Algorithm 5. 10. This in effect "loads"
the database into memory.
If there is insufficient memory to construct the CF tree with a given threshold,
the threshold value is increased and a new smaller CF tree is constructed. This can be
done by inserting the leaf nodes of the previous tree into the new small tree.
2. The clustering represented by the CF tree may not be natural because each entry has a
limited size.
In addition, the input order can negatively impact the results These problems
can be overcome by another global clustering approach applied to the leaf nodes in
the CF tree. Here each leaf node is treated as a single point for clustering.
Although the original work proposes a centroid-based agglomerative
hierarchical clustering algorithm to cluster the sub clusters, other clustering
algorithms could be used.
3. The last phase (which is optional) reclusters all points by placing them in the cluster that
has the closest centroid. Outliers, points that are too far from any centroid can be removed
during this phase.
6.2 DBSCAN
The approach used by DB SCAN (density-based spatial clustering of applications with
noise) is to create clusters with a minimum size and density.
Density is defined as a minimum number of points within a certain distance of each other.
This handles the outlier problem by ensuring that an outlier (or a small set of outliers) will
not create a cluster.
One input parameter, MinPts, indicates the minimum number of points in any cluster.
In addition, for each point in a cluster there must be another point in the cluster whose
distance from it is less than a threshold input value, Eps.
The Eps-neighborhood or neighborhood of a point is the set of points within a distance of
Eps.
The desired number of clusters, k, is not input but rather is determined by the algorithm
itself.
Pg. 31 UNIT -V
PE 833 CS ML DCET
a) p is a core point because it has 4 (MinPts value) points within its neighborhood.
b) shows the 5 core points in the figure.
c) shows that even though point r is not a core point, it is density-reachable from q.
The first part of the definition ensures that the second point is "close enough" to the first
point.
The second portion of the definition ensures that there are enough core points close
enough to each other. These core points form the main portion of a cluster in that they
are all close to each other.
A directly density-reachable point must be close to one of these core points but it need
not be a core point itself. In that case, it is called a border point.
A point is said to be density-reachable from another point if there is a chain from one to
the other that contains only points that are directly density-reachable from the previous
point.
This guarantees that any cluster will have a core set of points very close to a large number
of other points (core points) and then some other points (border points) that are
sufficiently close to at least one core point.
Pg. 32 UNIT -V
PE 833 CS ML DCET
Pg. 33 UNIT -V
PE 833 CS ML DCET
Pg. 34 UNIT -V
PE 833 CS ML DCET
The time complexity of CURE is O (n2lgn), while space is O (n). This is worst case
behavior.
• A heap and k-d tree data structure are used to ensure this performance.
• One entry in the heap exists for each cluster.
• Each cluster has not only its representative points, but also the cluster that is closest to
it.
• Entries in the heap are stored in increasing order of the distances between clusters.
• We assume that each entry u in the heap contains the set of representative points, u.rep;
the mean of the points in the cluster, u.mean; and the cluster closest to it, u.closest.
• We use the heap operations: heapify to create the heap, min to extract the minimum
entry in the heap,
• Insert to add a new entry, and delete to delete an entry.
• A merge procedure is used to merge two clusters. It determines the new representative
points for the new cluster
The basic idea of this process is to first find the point that is farthest from the mean.
Subsequent points are then chosen based on being the farthest from those points that were
previously chosen. A predefined number of points is picked.
A k-D tree is a balanced binary tree that can be thought of as a generalization of a binary
search tree.
It is used to index data of k dimensions where the ith level of the tree indexes the ith
dimension.
Pg. 35 UNIT -V
PE 833 CS ML DCET
In CURE, a k-D tree is used to assist in the merging of clusters. It stores the representative
points for each cluster. Initially, there is only one representative point for each cluster,
the sole item in it.
Operations performed on the tree are: delete to delete an entry form the tree, insert to
insert an entry into it, and build to initially create it.
The hierarchical clustering algorithm itself, is shown in Algorithm 5. 12.
The quality of the clusters found by CURE is better.
While the value of the shrinking factor a does impact results, with a value between 0.2
and 0.7, the correct clusters are still found.
When the number of representative points per cluster is greater than five, the correct
clusters are still always found.
A random sample size of about 2.5% and the number of partitions is greater than one or
two times k seem to work well.
The results with large datasets indicate that CURE scales well and out performs BIRCH.
7.1 ROCK
Traditional algorithms do not always work with categorical data.
The ROCK (RObust Clustering using linKs) clustering algorithm is targeted to both
Boolean data and categorical data.
A novel approach to identifying similarity is based on the number of links between items.
A pair of items are said to be neighbors if their similarity exceeds some threshold.
This need not be defined based on a precise metric, but rather a more intuitive approach
using domain experts could be used.
The number of links between two items is defined as the number of common neighbors
they have.
The objective of the clustering algorithm is to group together points that have more links.
The algorithm is a hierarchical agglomerative algorithm using the number of links as the
similarity measure rather than a measure based on distance algorithms.
Instead of using a Euclidean distance, a different distance, such as the Jaccard coefficient,
has been proposed. One proposed similarity measure based on the Jaccard coefficient is
defined as
If the tuples are viewed to be sets of items purchased (i.e., market basket data), then we
look at the number of items they have in common divided by the total number in both.
The denominator is used to normalize the value to be between 0 and 1 .
The number of links between a pair of points can be viewed as the number of unique
paths of length 2 between them.
Example 5.9 illustrates the use of links by the ROCK algorithm using the data from
Example 5.8 using the Jaccard coefficient.
Note that different threshold values for neighbors could be used to get different results.
Also note that a hierarchical approach could be used with different threshold values for
each level in the dendrogram.
Pg. 36 UNIT -V
PE 833 CS ML DCET
Pg. 37 UNIT -V
PE 833 CS ML DCET
The first step in the algorithm converts the adjacency matrix into a Boolean matrix. where
an entry is 1 if the two corresponding points are neighbors.
As the adjacency matrix is of size n2, this is an O(n2) step.
The next step converts this into a matrix indicating the links.
This can be found by calculating S x S, which can be done in 0 (n2.37) .
The hierarchical clustering portion of the algorithm then starts by placing each point in
the sample in a separate cluster. It then successively merges clusters until k clusters are
found.
To facilitate this processing, both local and global heaps are used.
A local heap, q, is created to represent each cluster. Here q contains every cluster that has
a nonzero link to the cluster that corresponds to this cluster.
Initially, a cluster is created for each point, t;. The heap for ti , q [ti], contains every
cluster that has a nonzero link to {ti}
The global heap contains information about each cluster. All information in the heap is
ordered based on the goodness measure, which is shown in Equation 5 . 17 .
8. Comparison
The different clustering algorithms discussed in this chapter are compared in Table 5.3.
Here we include a classification of the type of algorithm, space and time complexity, and
general notes concerning applicability.
• The single link, complete link, and average link techniques are all hierarchical
techniques with
• (n2) time and space complexity.
• Both K-means and the squared error techniques are iterative, requiring O(tkn) time.
• The nearest neighbor is not iterative, but the number of clusters is not predetermined.
Thus, the worst-case complexity can be O(n2) .
• BIRCH appears to be quite efficient, but remember that the CF-tree may need to be
rebuilt. The time complexity in the table assumes that the tree is not rebuilt.
• CURE is an improvement on these by using sampling and partitioning to handle
scalability well and uses multiple points rather than just one point to represent each
cluster.
Pg. 38 UNIT -V
PE 833 CS ML DCET
• Using multiple points allows the approach to detect non spherical clusters. With
sampling, CURE obtains an O(n) time complexity.
• However, CURE does not handle categorical data well. This also allows it to be more
resistant to the negative impact of outliers.
• K-means and PAM work by iteratively reassigning items to clusters, which may not
find a global optimal assignment.
• The results of the K-means algorithm is quite sensitive to the presence of outliers.
• Through the use of the CF-tree, Birch is both dynamic and scalable. However, it
detects only spherical type clusters.
• DBSCAN is a density-based approach. The time complexity of DBSCAN can be
improved to O(nlogn ) with appropriate spatial indices.
• Genetic algorithms are not included in this table because their performance totally
depends on the technique chosen to represent individuals, how crossover is done, and
the termination condition used.
Pg. 39 UNIT -V