Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
20 views22 pages

Cluster Analysis

Uploaded by

Anakha Ajayan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views22 pages

Cluster Analysis

Uploaded by

Anakha Ajayan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Cluster Analysis

AJEENA T
M23CSCS01
TKMCE
Contents
Introduction
Desired features of cluster analysis
Types of data in cluster analysis
Overview of basic clustering methods
DBSCAN - Density-Based Clustering Based on
Connected Regions with High Density
Introduction
Cluster analysis is the process of partitioning a
set of data objects into subsets.
The set of clusters resulting from a cluster analysis
can be referred to as a clustering.
Cluster, is a collection of data objects that are
similar to one another within the cluster and
dissimilar to objects in other cluster .
Applications such as business intelligence, image
pattern recognition, web search, biology and
security.
Sometimes called automatic classification

Clustering can automatically find the groupings

Clustering is known as unsupervised learning;


the class label information is not present

Clustering is a form of learning by observation,


rather than learning by example
Desired features of cluster analysis
Scalability : highly scalable algorithms are needed
Ability to deal with different types of
attributes : applications may require clustering
data types , such as binary, nominal, and ordinal
data , or mixtures of these data types.
 Discovery of clusters with arbitrary shape :
algorithms based on Euclidean or Manhattan
distance measures tend to find spherical clusters
with similar size and density. It is important to
develop algorithms that can detect clusters of
arbitrary shape.
Capability of clustering high-dimensionality
data: clustering algorithms are good at handling
low-dimensional data. Finding clusters of data
objects in a high-dimensional space is challenging.
Constraint- based clustering: to find data
groups with good clustering behaviour that satisfy
specified constraints.
Interpretability and usability: users want
clustering results to be interpretable,
comprehensible, and usable
Requirements for domain knowledge to
determine input parameters : many clustering
algorithms require users to provide domain
knowledge in the form of input parameters such as
the desired number of clusters
Ability to deal with noisy data: need clustering
methods that are robust to noise
Incremental clustering and insensitivity to
input order: incremental clustering algorithms and
algorithms that are insensitive to the input order are
needed
Types of data in cluster analysis

 Interval-scaled variables - Continuous measurements


of a roughly linear scale, e.g., weight and height,
latitude and longitude coordinates
 Binary variables - A variable that can take only 2
values, e.g., gender variables can take 2 values male
and female
 Nominal or categorical variables - A generalization of
the binary variable in that it can take more than 2
states, e.g., red, yellow, blue, green
 Ordinal variables - An ordinal variable can be discrete
or continuous. In this order is important, e.g.,rank.
 Ratio- Scaled variable - It is a positive measurement
on a nonlinear scale
 Variables of mixed type - A database may contain all
the types of variables binary, nominal, ordinal,
interval and ratio. And those combinedly called as
mixed-type variables
Overview of basic clustering methods
Partitioning method
 Given a set of n objects, a partitioning method constructs
k partitions of the data, each partition represents a
cluster and k<=n
 i.e., it divides the data into k groups such that each group
must contain at least one object
 The basic partitioning methods typically adopt exclusive
cluster separation. i.e., each object must belong to
exactly one group
 Commonly used partitioning methods are k-means and k-
medoids
Hierarchical methods
 Creates a hierarchical decomposition of the given set of
data objects
 It can be classified as being either agglomerative or
divisive
 Agglomerative approach is also called bottom-up
approach , starts with each object forming a separate
group - Merges the objects (groups) close to one another,
until a termination condition holds
 Divisive approach , also known as top-down approach ,
starts with all the objects in the same cluster - split into
smaller clusters, until each object is in one cluster
 can be distance-based or density- and continuity-based
Grid- based methods
 Quantize the object space into a finite number of
cells that form a grid structure
 All the operations are performed on the grid
structure
 Fast processing time (typically independent of
the number of data objects , yet dependent on
grid size)
Density –based methods
 Density- based methods can divide a set of objects into
multiple exclusive clusters
 Can find arbitrarily shaped clusters
 Clusters are dense regions of objects in space that are
separated by low-density regions
 Cluster density: each point must have a minimum
number of points within its “neighbourhood”
 Can be used to filter out noise or outliers
 Can be extended from full space to subspace clustering
Partitioning and hierarchical methods are designed
to find spherical-shaped clusters
If noise or outliers are included then they would
inaccurately identify convex regions
To find arbitrary shaped clusters, we can model
clusters as dense regions in the data space, separated
by sparse regions
This is the main strategy behind density-based
clustering method, which can discover clusters of
non spherical shape.
DBSCAN: Density-Based Clustering Based on
Connected Regions with High Density

To find dense regions in density-based


clustering
 The density of an object o can be measured by the
number of objects close to o.
 DBSCAN(density-based spatial clustering of applications
with noise) finds core objects- that have dense
neighbourhoods.
 Core objects and their neighbourhoods together form
dense regions as clusters
 To find the neighbourhood of an object by DBSCAN
o A user-specified parameter ε>0 is used to specify the
radius of a neighbourhood
o The ε-neighbourhood of an object o is the space within
a radius ε centered at o
o The density of a neighbourhood can be measured by
the number of objects in the neighbourhood
o DBSCAN uses another user-specified parameter, MintPts-
specifies the density threshold of dense regions
o An object is a core object if the ε-neighbourhood of the
object contains at least MinPts objects
 An object p is directly density-reachable from
another object q if and only if q is a core object and
p is in the ε-neighbourhood of q
 Two objects p1,p2 are density-connected with
respect to ε and MinPts if there is an object q such
that both p1 and p2 are density-reachable from q
with respect to ε and MintPts
 Consider an e.g.,of Density-reachability and
density-connectivity. Let MinPts =3
o Here m, p, o, r are core objects, each is in an ε-
neighbourhood containing at least 3 points
Object q is directly density-reachable from m, m is
directly density-reachable from p and vice versa
q is (indirectly) density-reachable from p; q is
directly density-reachable from m and m is directly
density-reachable from p
p is not density-reachable from q; q is not a core
object
Similarly, r and s are density-reachable from o and o
is density reachable from r. Thus, o, r, and s are all
density-connected
DBSCAN Algorithm

mark all objects as unvisited;


Do
randomly select an unvisited object p;
mark p as visited;
if the ε -neighbourhood of p has at least MinPts objects
create a new cluster C, and add p to C;
let N be the set of objects in the -neighbourhood of p;
for each point p’ in N
if p’ is unvisited
mark p’ as visited;
if the ε -neighbourhood of p’ has at least MinPts
points, add those points to N;
if p’ is not yet a member of any cluster, add p’ to C;
end for
output C;
else mark p as noise;
until no object is unvisited;
THE END

You might also like