Density Based Clustering
Algorithm (Expected Features)
A cluster, defined as a connected dense
component.
A cluster grows in a direction along which
density attains its maximum.
Determine number of clusters from an input
dataset.
Density-based algorithms are capable of
discovering clusters of arbitrary shapes.
Provides a natural protection against outliers.
Usually work with low-dimensional data.
DBSCAN
Density-based Clustering locates regions of high density
that are separated from one another by regions of low
density.
Density = number of points within a specified radius (Eps)
DBSCAN is a density-based algorithm.
A point is a core point if it has more than a specified number of
points (MinPts) within Eps
These are points that are at the interior of a cluster
A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point
DBSCAN
A noise point is any point that is not a core
point or a border point.
Any two core points are close enough within
a distance Eps of one another are put in the
same cluster
Any border point that is close enough to a
core point is put in the same cluster as the
core point
Noise points are discarded
Border & Core
Outlier
Border
Core
= 1unit
MinPts = 5
Concepts: -Neighborhood
-Neighborhood - Objects within a radius of
from an object. (epsilon-neighborhood)
Core objects - -Neighborhood of an object
contains at least MinPts of objects
qq
pp
-Neighborhood of p
-Neighborhood of q
p is a core object (MinPts = 4)
q is not a core object
Concepts: Reachability
Directly density-reachable
An object q is directly density-reachable from
object p if q is within the -Neighborhood of p
and p is a core object.
q is directly density-
qq
pp
reachable from p
p is not directly densityreachable from q?
Concepts: Reachability
Density-reachable:
An object p is density-reachable from q w.r.t and
MinPts if there is a chain of objects p1,,pn, with
p1=q, pn=p such that pi+1is directly density-reachable
from pi w.r.t and MinPts for all 1 <= i <= n
q is density-reachable
qq
pp
from p
p is not density- reachable
from q?
asymmetric
Concepts: Connectivity
Density-connectivity
Object p is density-connected to object q w.r.t
and MinPts if there is an object o such that
both p and q are density-reachable from o
w.r.t and MinPts
P and q are density-
qq
rr
pp
connected to each other
by r
Density-connectivity is
symmetric
Concepts: cluster & noise
Cluster: a cluster C in a set of objects D w.r.t
and MinPts is a non empty subset of D satisfying
Maximality: For all p, q if p C and if q is densityreachable from p w.r.t and MinPts, then also q C.
Connectivity: for all p, q C, p is density-connected
to q w.r.t and MinPts in D.
Note: cluster contains core objects as well as border
objects
Noise: objects which are not directly densityreachable from at least one core object.
(Indirectly) Density-reachable:
p
p1
q
Density-connected
q
o
DBSCAN: The Algorithm
select a point p
Retrieve all points density-reachable from p wrt and
MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable from
p and DBSCAN visits the next point of the database.
Continue the process until all of the points have been
processed.
Result is independent of the order of processing the points
An Example
MinPts = 4
C1
C1
C1
How to assign the values of
and MinPts.
For DBSCAN, the parameters and MinPts are
needed. The parameters must be specified by the
user.
As a rule of thumb, minPts can be derived from
the number of dimensions D in the data set, as
minPts >= D+1.
MinPts=1 does not make sense, as then every
point on its own will already be a cluster.
With MinPts=2, the result will be the same as of
hierarchical clustering with the single link metric,
with the dendrogram cut at height
How to assign the values of
and MinPts.
However, larger values are usually better for
data sets with noise and will yield more
significant clusters.
The value for can be chosen by using MST of
data points,
If is chosen too small, a large part of the data
will not be clustered; whereas for a too high
value of , clusters will merge and the majority of
objects will be in the same cluster.
The Merits of DBSCAN
Algorithm
DBSCAN does not require one to specify the
number of clusters in the data a priori, as
opposed to k-means.
DBSCAN can find arbitrarily shaped clusters. It
can even find a cluster completely surrounded
by (but not connected to) a different cluster. Due
to the MinPts parameter, the so-called single-link
effect (different clusters being connected by a
thin line of points) is reduced.
DBSCAN has a notion of noise.
The Merits of DBSCAN
Algorithm
DBSCAN requires just two parameters and is
mostly insensitive to the ordering of the points in
the database. (However, points sitting on the
edge of two different clusters might swap cluster
membership if the ordering of the points is
changed, and the cluster assignment is unique.)
The De-Merits of DBSCAN
Algorithm
The quality of DBSCAN depends on the distance
measure used in the function region Query(P,). The
most common distance metric used is Euclidean
distance. Especially for high-dimensional data, this
metric can be rendered almost useless due to the
so-called "Curse of dimensionality", making it
difficult to find an appropriate value for . This
effect, however, is also present in any other
algorithm based on Euclidean distance.
The De-Merits of DBSCAN
Algorithm
DBSCAN cannot cluster data sets well
with large differences in densities, since
the minPts- combination cannot then be
chosen appropriately for all clusters.