Density Based Methods
Density-Based Clustering refers to one of the most popular unsupervised learning
methodologies used in model building and machine learning algorithms. The data points in the
region separated by two clusters of low point density are considered as noise. The surroundings
with a radius ε of a given object are known as the ε neighborhood of the object. If the ε
neighborhood of the object comprises at least a minimum number, MinPts of objects, then it is
called a core object.
Major features:
1. It is used to discover clusters of arbitrary shape.
2. It is also used to handle noise in the data clusters.
3. It is a one scan method.
4. It needs density parameters as a termination condition.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Clusters are dense regions in the data space, separated by regions of the lower density of points.
The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”. The key
idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least
a minimum number of points.
Parameters Required for DBSCAN Algorithm
1. eps: It defines the neighborhood around a data point i.e. if the distance between two
points is lower or equal to ‘eps’ then they are considered neighbors. If the eps value is
chosen too small then a large part of the data will be considered as an outlier. If it is
chosen very large then the clusters will merge and the majority of the data points will
be in the same clusters. One way to find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. The larger the
dataset, the larger value of MinPts must be chosen. As a general rule, the minimum
MinPts can be derived from the number of dimensions D in the dataset as, MinPts >=
D+1. The minimum value of MinPts must be chosen at least 3.
In this algorithm, we have 3 types of data points.
Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of
a core point.
Noise or outlier: A point which is not a core point or border point.
Steps used in DBSCAN Algorithm
1. Find all the neighbor points within eps and identify the core points or visited with
more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same cluster
as the core point.
A point a and b are said to be density connected if there exists a point c which has a
sufficient number of points in its neighbors and both points a and b are within the eps
distance. This is a chaining process. So, if b is a neighbor of c, c is a neighbor of d,
and d is a neighbor of e, which in turn is neighbor of a implying that b is a neighbor
of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not
belong to any cluster are noise.
Ordering Points to Identify the Clustering Structure (OPTICS)
OPTICS (Ordering Points to Identify the Clustering Structure) is a density-based clustering
algorithm, it can extract clusters of varying densities and shapes. It is useful for identifying
clusters of different densities in large, high-dimensional datasets. The main idea behind
OPTICS is to extract the clustering structure of a dataset by identifying the density-connected
points. The algorithm builds a density-based representation of the data by creating an ordered
list of points called the reachability plot. Each point in the list is associated with a reachability
distance, which is a measure of how easy it is to reach that point from other points in the
dataset. Points with similar reachability distances are likely to be in the same cluster.
OPTICS Algorithm
1. Define a density threshold parameter, Eps, which controls the minimum density of
clusters.
2. For each point in the dataset, calculate the distance to its k-nearest neighbors.
3. Starting with an arbitrary point, calculate the reachability distance of each point in the
dataset, based on the density of its neighbors.
4. Order the points based on their reachability distance and create the reachability plot.
Extract clusters from the reachability plot by grouping points that are close to each
other and have similar reachability distances.
Several parameters including the minimum density threshold (Eps), the number of nearest
neighbors to consider (min_samples), and a reachability distance cutoff.
They are: -
1. Core Distance: It is the minimum value of radius required to classify a given point as
a core point. If the given point is not a Core point, then it’s Core Distance is undefined.
2. Reachability Distance: It is defined with respect to another data point q(Let). The
Reachability distance between a point p and q is the maximum of the Core Distance of
p and the Euclidean Distance (or some other distance metric) between p and q.
Discussion about OPTICS and DBSCAN Clustering:
1. Memory Cost: The OPTICS clustering technique requires more memory as it
maintains a priority queue (Min Heap) to determine the next data point which is closest
to the point currently being processed in terms of Reachability Distance. It also requires
more computational power because the nearest neighbour queries are more complicated
than radius queries in DBSCAN.
2. Fewer Parameters: The OPTICS clustering technique does not need to maintain the
epsilon parameter and is only given in the above pseudo-code to reduce the time taken.
This leads to the reduction of the analytical process of parameter tuning. This technique
does not segregate the given data into clusters. It merely produces a Reachability
distance plot and it is upon the interpretation of the programmer to cluster the points
accordingly.
3. Handling varying densities: DBSCAN clustering can struggle to handle datasets with
varying densities, as it requires a single value of epsilon to define the neighborhood
size for all points. In contrast, OPTICS can handle varying densities by using the
concept of reachability distance, which adapts to the local density of the data. This
means that OPTICS can identify clusters of different sizes and shapes more effectively
than DBSCAN in datasets with varying densities.
4. Cluster extraction: While both OPTICS and DBSCAN can identify clusters, OPTICS
produces a reachability distance plot that can be used to extract clusters at different
levels of granularity. This allows for more flexible clustering and can reveal clusters
that may not be apparent with a fixed epsilon value in DBSCAN. However, this also
requires more manual interpretation and decision-making on the part of the
programmer.
5. Noise handling: DBSCAN explicitly distinguishes between core points, boundary
points, and noise points, while OPTICS does not explicitly identify noise points.
Instead, points with high reachability distances can be considered as potential noise
points. However, this also means that OPTICS may be less effective at identifying
small clusters that are surrounded by noise points, as these clusters may be merged
with the noise points in the reachability distance plot.
6. Runtime complexity: The runtime complexity of OPTICS is generally higher than
that of DBSCAN, due to the use of a priority queue to maintain the reachability
distances. However, recent research has proposed optimizations to reduce the
computational complexity of OPTICS, making it more scalable for large datasets.
3.2.7 Grid Based Methods
Grid-based clustering method is used for multi-resolution of grid-based data structure. It is
used to quantize the area of the object into a finite number of cells, which is stored in the grid
system where all the operations of Clustering are implemented. We can use this method for
its quick processing time, which is generally independent of the number of data objects, still
dependent on only the multiple cells in each dimension in the quantized space.
There is an instance of a grid-based approach that involves STING, which explores statistical
data stored in the grid cells, and WaveCluster, which clusters objects using a wavelet
transform approach. And CLIQUE, which defines a grid-and density-based approach for
Clustering in high-dimensional data space.
STING (Statistical Information Grid)
It is also a grid-based clustering technique. This technique is used for a multidimensional
grid data structure, which is used to quantify the space into a finite number of cells. The main
factor of this technique is the value space surrounding the data points. The spatial area of the
STING can be divided into rectangular cells and several levels of cells at different resolution
levels. All the high-level cells are further divided into several low-level cells.
The STING contains all the data related to the attributes in each cell, such as mean,
maximum, and minimum values, which are precomputed and stored as statistical parameters.
These statistical parameters are useful for query processing and other data analysis tasks.
Steps:
Step 1: First, we have to Determine a layer to begin the process.
Step 2: For each cell, we have to calculate the confidence interval or estimated probability
range that this cell is relevant to the query.
Step 3: Then, we must level the cell as relevant or irrelevant based on the interval calculated.
Step 4: If the layer is the bottom layer, go to point 6; otherwise, go to point 5.
Step 5: It goes down the hierarchy structure by one level. Go to point 2 for those cells that
form the relevant cell of the high-level layer.
Step 6: If the specification required for the query is met, then we have to go to point 8;
otherwise, go to point 7.
Step 7: We must retrieve data that fall into the relevant cells and do further processing. Return
the result that meets the requirement of the query. Go to point 9.
Step 8: Find the regions of relevant cells. Return those regions that meet the query's
requirements. Go to point 9.
Step 9: Stop or terminate.
CLIQUE Algorithm
CLIQUE Algorithm uses density and grid-based technique i.e. subspace clustering algorithm
and finds out the cluster by taking density threshold and a number of grids as input
parameters. It is specially designed to handle datasets with a large number of dimensions.
CLIQUE Algorithm is very scalable with respect to the value of the records, and a number
of dimensions in the dataset because it is grid-based and uses the Apriori Property effectively.
Working of CLIQUE Algorithm
The CLIQUE algorithm first divides the data space into grids. It is done by dividing each
dimension into equal intervals called units. After that, it identifies dense units. A unit is dense
if the data points in this are exceeding the threshold value.
Once the algorithm finds dense cells along one dimension, the algorithm tries to find dense
cells along two dimensions, and it works until all dense cells along the entire dimension are
found.
After finding all dense cells in all dimensions, the algorithm proceeds to find the largest set
(“cluster”) of connected dense cells.
Finally, the CLIQUE algorithm generates a minimal description of the cluster. Clusters are
then generated from all dense subspaces using the apriori approach.