0 ratings0% found this document useful (0 votes) 18 views9 pagesData Mining and Data Warehouse
Data mining and data warehouse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
fat Home > Coding Ground np
E> euorialencine = —
Data Mining - Cluster
Analysis
Advertisements7:36 9 ® LO 1D Yo Si
© Previous Page Next Page ©
Cluster is a group of objects that belongs to
the same class. In other words, similar
objects are grouped in one cluster and
dissimilar objects are grouped in another
cluster.
What is Clustering?
Clustering is the process of making a group
of abstract objects into classes of similar
objects.
Points to Remember
® A cluster of data objects can be treated
as one group.
While doing cluster analysis, we first
partition the set of data into groups
based on data similarity and then assign
the labels to the groups.
o
The main advantage of clustering over
classification is that, it is adaptable to
changes and helps single out useful
features that distinguish different
groups.
Applications of Cluster
Analysis736 39 LO 10 fo i ©
Applications of Cluster
Analysis
= Clustering analysis is broadly used in
many applications such as market
research, pattern recognition, data
analysis, and image processing.
= Clustering can also help marketers
discover distinct groups in their
customer base. And _ they can
characterize their customer groups
based on the purchasing patterns.
2 In the field of biology, it can be used to
derive plant and animal taxonomies,
categorize genes with similar
functionalities and gain insight into
structures inherent to populations.
Clustering also helps in identification of
areas of similar land use in an earth
observation database. It also helps in
the identification of groups of houses in
a city according to house type, value,
and geographic location.
o
Clustering also helps in classifying
documents on the web for information
discovery. 5
2 Clusterina is also used in outlier737 38 1O 01 te 4) a
analysis serves as a tool to gain insignt
into the distribution of data to observe
characteristics of each cluster.
Requirements of Clustering in
Data Mining
The following points throw light on why
clustering is required in data mining -
* Scalability - We need highly scalable
o
o
clustering algorithms to deal with large
databases.
Ability to deal with different kinds of
attributes - Algorithms should be
capable to be applied on any kind of
data such as interval-based (numerical)
data, categorical, and binary data.
Discovery of clusters with attribute
shape - The clustering algorithm should
be capable of detecting clusters of
arbitrary shape. They should not be
bounded to only distance measures that
tend to find spherical cluster of small
sizes.
High dimensionality - The clustering
algorithm should not only be able to
handle low-dimensional data but Bw
the high dimensional space.737 3 ¥1@Q 10! ta “i
High dimensionality - The clustering
algorithm should not only be able to
handle low-dimensional data but also
the high dimensional space.
ci
Ability to deal with noisy data -
Databases contain noisy, missing or
erroneous data. Some algorithms are
sensitive to such data and may lead to
poor quality clusters.
Interpretability - The clustering results
should be interpretable, comprehensible,
and usable.
Clustering Methods
Clustering methods can be classified into the
following categories -
= Partitioning Method
= Hierarchical Method
= Density-based Method
2 Grid-Based Method
= Model-Based Method
' Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’
objects and the partitioning met
eanctricte ‘l! nartitinn af data Fach nartitianPartitioning Method
Suppose we are given a database of ‘n’
objects and the partitioning method
constructs ‘k’ partition of data. Each partition
will represent a cluster and k < n. It means
that it will classify the data into k groups,
which satisfy the following requirements -
= Each group contains at least one object.
= Each object must belong to exactly one
group.
Points to remember -
= For a given number of partitions (say k),
the partitioning method will create an
initial partitioning.
= Then it uses the iterative relocation
technique to improve the partitioning by
moving objects from one group to other.
Hierarchical Methods
This method creates a_ hierarchical
decomposition of the given set of ‘a
objects. We can classify hierarchi737 3 9 1O 10 ft fi
methods on the basis of how the hierarchical
decomposition is formed. There are two
approaches here -
= Agglomerative Approach
2 Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-
up approach. In this, we start with each
object forming a separate group. It keeps on
merging the objects or groups that are close
to one another. It keep on doing so until all of
the groups are merged into one or until the
termination condition holds.
Divisive Approach
This approach is also known as the top-down
approach. In this, we start with all of the
objects in the same cluster. In the continuous
iteration, a cluster is split up into smaller
clusters. It is down until each object in one
cluster or the termination condition holds.
This method is rigid, i.e., once a merging or
splitting is done, it can never be undone.
Approaches to Improve Quality of
Hierarchical Clustering a
Lane ae thn han annem bran that nen enna tn7:37 3 9 1O Of Si OO
Approaches to Improve Quality of
Hierarchical Clustering
Here are the two approaches that are used to
improve the quality of hierarchical clustering
= Perform careful analysis of object
linkages at each hierarchical
partitioning.
5 Integrate hierarchical agglomeration by
first using a hierarchical agglomerative
algorithm to group objects into micro-
clusters, and then performing macro-
clustering on the micro-clusters.
Density-based Method
This method is based on the notion of
density. The basic idea is to continue
growing the given cluster as long as the
density in the neighborhood exceeds some
threshold, i.e., for each data point within a
given cluster, the radius of a given cluster
has to contain at least a minimum number of
points.
Grid-based Method
In this, the objects together form a grid. a
obiect space is auantized into finite number737 3 ® LOS 10! Me ©
Grid-based Method
In this, the objects together form a grid. The
object space is quantized into finite number
of cells that form a grid structure.
Advantages
= The major advantage of this method is
fast processing time.
2 It is dependent only on the number of
cells in each dimension in the quantized
space.
Model-based methods
In this method, a model is hypothesized for
each cluster to find the best fit of data for a
given model. This method locates the
clusters by clustering the density function. It
reflects spatial distribution of the data
points.
This method also provides a way to
automatically determine the number of
clusters based on standard statistics, taking
outlier or noise into account. It therefore
yields robust clustering methods.
Constraint-based Method g
In this method the clusterina is nerfarmed hv