RFC: Introduce DBCV cluster validity metric

Dear scikit-learn core developers/maintainers,

I am opening this issue to make the case for the inclusion of DBCV as a scikit-learn cluster metric. I went ahead and tried to address possible concerns (according to your contribution guidelines) upfront (see below). The effort in terms of getting the actual functionality up and running should be manageable if we opt for transferring the version of DBCV that ships with scikit-learn-contrib/HDBSCAN which is what I'd suggest (for technical details see #28244). Looking forward to your feedback!


DBCV is the name of a clustering validity metric defined in this paper [direct link to PDF hosted on a German university server](https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf) [alternative ref](https://doi.org/10.1137/1.9781611973440.96). It stands for Density-Based Clustering Validation and is characterized by not relying on cluster centroids as part of its validity calculations/formula. Instead, it evaluates the quality of clusters based on the concepts of density separation (highest density area between clusters) and density sparseness (lowest density area within a cluster) - propped up on the underlying construct of minimum spanning trees, much like the HDBSCAN clustering algorithm. (the 3 authors of the original [2013 HDBSCAN paper](https://doi.org/10.1007/978-3-642-37456-2_14) also (co-)authored the DBCV paper linked above)


I'd argue DBCV meets the algorithm inclusion criteria (https://scikit-learn.org/dev/faq.html#new-algorithms-inclusion-criteria)

> "at least 3 years since publication"
- the paper was published in 2014

> "200+ citations"
- according to Google Scholar, this criterion is met as well:
![gscholar](https://github.com/scikit-learn/scikit-learn/assets/72552948/a09c5bc5-e434-4dc1-b6be-2ed296fc5a30)


I believe I have thoroughly addressed the further inclusion criteria (see below). If my proposal is not up to par in that regard in any way, please let me know.

> "The contributor should support the importance of the proposed addition with research papers and/or implementations in other similar packages, demonstrate its usefulness via common use-cases/applications and corroborate performance improvements, if any, with benchmarks and/or plots. It is expected that the proposed algorithm should outperform the methods that are already implemented in scikit-learn at least in some areas."

scikit-learn currently provides the following clustering validity metrics ([source](https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/metrics/cluster/_unsupervised.py), [documentation](https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation)):
- Silhouette Coefficient : https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient
- Calinski-Harabasz Index : https://scikit-learn.org/stable/modules/clustering.html#calinski-harabasz-index
- Davies-Bouldin Index : https://scikit-learn.org/stable/modules/clustering.html#davies-bouldin-index

All these metrics listed above implicitly assume clusters with spherical shapes based on their mathematical formulations.

> There are more metrics for the purpose of clustering evaluation that ship with scikit-learn. However, these all require knowledge of the underlying ground-truth, which makes them irrelevant in terms of practical considerations regarding their application on optimization processes since those underlying labels aren't available when working with non-synthetic, real-world data - hence the need for the clustering in the first place.

When trying to optimize the hyperparameters of a clustering algorithm given a certain dataset, it does not make sense to drive the optimization process based on a metric that does not align with/is opposed to the algorithms internal, relative metric/implicit notion of what a cluster is and thus what makes a "better"/"worse" cluster.
That notion materializes from the manner the clustering algorithm operates, e.g.:
- the way KMeans iterates based on quadratic distances from cluster centers
- DBSCAN assigning noise labels based on proximity to other points

If one is not careful about which metric is paired with which algorithm (or how it is applied), the underlying mathematical constructs can "clash" in a way that leads to subpar/nonsensical results, e.g. applying SWC to DBSCAN output on a dataset with elongated clusters such as the one shown below or without filtering out noise first.

Since there is currently no density-based metric, this means there is no suitable option available when trying to optimizing a scikit-learn cluster method with scikit-learn metrics. Including DBCV would address that.

Below, I have included an example that illustrates the shortcomings of centroid metrics when applied to density clusters such as those emitted by established algorithms like DBSCAN, HDBSCAN or OPTICS. For details regarding the example, please have a look at [this notebook](https://gist.github.com/luis261/5ad7f13d8b8ebe6762e424cc1d430f30).

If you're looking for something more thorough/empirical, I suggest having a look at this repo: https://github.com/luis261/clustering-validity-comparison - (perfomance benchmark in the sense of measuring outcome performance, not runtime performance)

The plot below shows a synthetic dataset containing two elongated clusters with some background noise.
The points were labeled based on the underlying ground-truth. Just for the sake of argument, you can imagine it being the result of running a density based clustering algorithm instead.
Afterwards, the Silhouette Coefficient is assigned to all non-noise points according to the labelling and a color palette applied based on the score of each point.
The result is a visualization showing the high internal disparity of the scores in each cluster, depending on their distance to the other clusters centroid. This incongruity can arise whenever clusters emitted by an algorithm are judged based on a metric ill-suited for the algorithm/dataset and will affect the optimization process, with outputs of arguably higher quality loosing against alternatives that are less "ideal".


![https://gist.github.com/luis261/5ad7f13d8b8ebe6762e424cc1d430f30](https://github.com/scikit-learn/scikit-learn/assets/72552948/9990ecf9-167d-45de-a12e-809ce0261193)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC: Introduce DBCV cluster validity metric #28243

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

RFC: Introduce DBCV cluster validity metric #28243

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions