ENH Introduces `mst_algorithm` keyword for HDBSCAN, alongside two new Boruvka MST algorithms #27572

Micky774 · 2023-10-11T18:27:22Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Introduces mst_algorithm
Adds Boruvka algorithm (both exact/inexact)
Streamlines algorithm selection logic for "auto" options
Improves tests to account for new MST algorithms
Provides a "warn" option for mst_algorithm as a backwards-compatible default which will provide a FutureWarning to users, encouraging them to opt-in to using mst_algorithm="auto"
Includes a deprecation for mst_algorithm="warn"

Any other comments?

Apologies for the gross commit log, the vast majority are a symptom of me keeping this branch open in parallel with mainstream HDBSCAN efforts in order to prevent it from getting "too out of sync". Not sure how to rectify history well here.

Benchmarks coming soon.

- Added support for `n_features_in_` - Improved validation and added support for `feature_names_in_` - Renamed `kwargs` to `metric_params` and added safety check for an empty dict - Removed attributes set in init and deferred to properties - Raised error if tree query is performed with too few samples - Cleaned up some list/dict comprehension logic

…to hdbscan

…trics`" This reverts commit cd1edc4.

- Removed internal minkowski metric parameter validation in favor of `sklearn.metrics` built-in handling - Removed default argument and presence of `p` in hdbscan functions - Now users must pass `p` in through `metric_params`, consistent w/ other metrics such as `wminkowski` and `mahalanobis` - Removed vestigial estimator check -- now supported via common tests - Fixed bug where `boruvka_kdtree` algorithm's accepted metrics were based off of `BallTree` not `KDTree` - Cleaned up lines with unused returns by indexing output of `hdbscan` - Greatly expanded scope of algorithm/metric compatability tests - Streamlined some other tests - Delted commented out tests

github-actions · 2023-10-11T18:28:37Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: fedeb90. Link to the linter CI: here}

Micky774 · 2023-10-12T13:39:28Z

sklearn/cluster/_hdbscan/_boruvka.pyx

+    return metric._rdist_to_dist(rdist)
+
+
+cdef class BoruvkaUnionFind:


Note for reviewers: I wasn't sure of a "nice" way to reuse the UnionFind structure from _hierarchical_fast since this one explicitly keeps track of initial nodes as components without leveraging virtual nodes created through union operations, and thus the UnionFind structure would need to be modified with non-trivial additions. I think it is better to leave them as two similar-yet-different structures. Open to suggestions though!

I think it is OK to maintain two structures for UnionFind rather than to try factorizing their logic.

jjerphan

Thank you a lot, @Micky774.

Here is a first review on everything but sklearn/cluster/_hdbscan/_boruvka.pyx.

jjerphan · 2023-10-14T07:16:19Z

sklearn/utils/_typedefs.pxd

@@ -14,6 +14,7 @@
 # TODO: Stop defining custom types locally or globally like DTYPE_t and friends and
 # use these consistently throughout the codebase.
 # NOTE: Extend this list as needed when converting more cython extensions.
+ctypedef char int8_t


Where is this used exactly?

jjerphan · 2023-10-14T07:42:23Z

doc/whats_new/v1.4.rst

@@ -217,6 +217,19 @@ Changelog
  `kdtree` and `balltree` values will be removed in 1.6.
  :pr:`26744` by :user:`Shreesha Kumar Bhat <Shreesha3112>`.

+- |Enhancement| : The `mst_algorithm` argument is introduced, allowing for the user to


Suggested change

- |Enhancement| : The `mst_algorithm` argument is introduced, allowing for the user to

- |Enhancement| :class:`cluster.HDBScan` has a new `mst_algorithm` argument, allowing for the user to

jjerphan · 2023-10-28T09:22:16Z

doc/whats_new/v1.4.rst

+  This instead introduces `"boruvka_exact", "boruvka_approx"` which are both faster
+  MST building algorithms than the current `"prims"`.


Suggested change

This instead introduces `"boruvka_exact", "boruvka_approx"` which are both faster

MST building algorithms than the current `"prims"`.

This instead introduces `"boruvka_exact"` and `"boruvka_approx"` which are both

faster MST building algorithms than the current `"prims"`.

jjerphan · 2023-10-28T09:27:18Z

sklearn/cluster/_hdbscan/_boruvka.pyx

+    return metric._rdist_to_dist(rdist)
+
+
+cdef class BoruvkaUnionFind:


I think it is OK to maintain two structures for UnionFind rather than to try factorizing their logic.

jjerphan · 2023-10-28T09:33:50Z

sklearn/cluster/_hdbscan/hdbscan.py

                kwargs["leaf_size"] = self.leaf_size
+                # We prefer KDTree unless otherwise specified


Can you document the reason behind this choice?

jjerphan · 2023-10-28T09:58:11Z

sklearn/cluster/_hdbscan/hdbscan.py

-                # Metric is a valid BallTree metric
-                mst_func = _hdbscan_prims
-                kwargs["algo"] = "ball_tree"
+                # Boruvka is always preferable


Why?

Suggested change

# Boruvka is always preferable

# The exact Boruvka algorithm is always preferable.

jjerphan · 2023-10-28T10:00:06Z

sklearn/cluster/_hdbscan/hdbscan.py

                kwargs["leaf_size"] = self.leaf_size
+                kwargs["algo"] = (
+                    "kd_tree" if self.metric in KDTree.valid_metrics else "ball_tree"


Are we sure that the union of {KDTree,BallTree}.valid_metrics overlaps the entire set of values for self.metric?

jjerphan · 2023-10-28T10:04:23Z

sklearn/cluster/tests/test_hdbscan.py

+
+EXACT_MST_ALGORITHMS = {"prims", "boruvka_exact"}
+MST_ALGORITHMS = {"boruvka_approx"} | EXACT_MST_ALGORITHMS | BRUTE_COMPATIBLE
+


I guess we might want to sorted them once and for all. What do you think?

Suggested change

# Converting sets into sorted list to have a reproducible set suite

# and to prevent errors with test dispatch to pytest-xdist workers.

BRUTE_COMPATIBLE = sorted(BRUTE_COMPATIBLE)

ALGORITHMS = sorted(ALGORITHMS)

EXACT_MST_ALGORITHMS = sorted(EXACT_MST_ALGORITHMS)

MST_ALGORITHMS = sorted(MST_ALGORITHMS)

jjerphan · 2023-10-28T10:09:51Z

sklearn/cluster/_hdbscan/hdbscan.py

@@ -811,40 +905,55 @@ def fit(self, X, y=None):
                " Please select a different metric."
            )

-        if self.algorithm != "auto":
+        if algorithms != {"auto"}:


I think I would rather treat the default mst_algorithm == algorithm == auto case first and then disjoint on cases.

What do you think?

jjerphan · 2023-10-28T10:14:55Z

sklearn/cluster/tests/test_hdbscan.py

+def _validate_algorithms(algorithm, mst_algorithm):
+    algos = {algorithm, mst_algorithm}
+    if "brute" in algos and not algos.issubset(BRUTE_COMPATIBLE):
+        pytest.xfail("Incompatible algorithm configuration")


Suggested change

pytest.xfail("Incompatible algorithm configuration")

pytest.xfail(f"Incompatible algorithm configuration: {(algorithm, mst_algorithm)=}")

jjerphan · 2023-10-29T10:29:02Z

sklearn/cluster/_hdbscan/_boruvka.pyx

Is it the original implementation? Was it included with modifications?

I guess we could proceed with either:

including the original implementation for reference and to have the history identical to the original project and modify it in subsequent places

including the original implementation with modifications and documenting the changes made in between (probably simpler)

What do you think? 🙂

I think a few minor elements can be improved for matching scikit-learn's conventions, but we can keep that for later.

jjerphan

Another review on the core.

I think we could merge this PR soon and perform optimization in other PRs.

jjerphan · 2023-11-12T18:39:54Z

sklearn/cluster/_hdbscan/_boruvka.pyx

+# Define a function giving the minimum distance between two
+# nodes of a ball tree
+cdef inline float64_t ball_tree_min_dist_dual(
+    float64_t radius1,
+    float64_t radius2,
+    intp_t node1,
+    intp_t node2,
+    float64_t[:, ::1] centroid_dist
+) noexcept nogil:
+
+    cdef float64_t dist_pt = centroid_dist[node1, node2]
+    return max(0, (dist_pt - radius1 - radius2))
+
+
+# Define a function giving the minimum distance between two
+# nodes of a kd-tree
+cdef inline float64_t kd_tree_min_dist_dual(
+    DistanceMetric64 metric,
+    intp_t node1,
+    intp_t node2,
+    float64_t[:, :, ::1] node_bounds,
+    intp_t num_features
+) noexcept nogil:
+
+    cdef float64_t d, d1, d2, rdist = 0.0
+    cdef intp_t j
+
+    if metric.p == INF:
+        for j in range(num_features):
+            d1 = node_bounds[0, node1, j] - node_bounds[1, node2, j]
+            d2 = node_bounds[0, node2, j] - node_bounds[1, node1, j]
+            d = (d1 + fabs(d1)) + (d2 + fabs(d2))
+
+            rdist = max(rdist, 0.5 * d)
+    else:
+        # here we'll use the fact that x + abs(x) = 2 * max(x, 0)
+        for j in range(num_features):
+            d1 = node_bounds[0, node1, j] - node_bounds[1, node2, j]
+            d2 = node_bounds[0, node2, j] - node_bounds[1, node1, j]
+            d = (d1 + fabs(d1)) + (d2 + fabs(d2))
+
+            rdist += pow(0.5 * d, metric.p)
+
+    return metric._rdist_to_dist(rdist)


Those functions already are defined in KDTree and BallTree translation units:

scikit-learn/sklearn/neighbors/_ball_tree.pyx.tp

Line 218 in eb08740

cdef inline float64_t min_dist_dual{{name_suffix}}(

scikit-learn/sklearn/neighbors/_kd_tree.pyx.tp

Line 279 in eb08740

cdef inline float64_t min_dist_dual{{name_suffix}}(

Would it be possible to reuse them?

jjerphan · 2023-11-12T21:00:18Z

sklearn/cluster/_hdbscan/_boruvka.pyx

+    tree : KDTree
+        The kd-tree to run Dual Tree Boruvka over.


Suggested change

tree : KDTree

The kd-tree to run Dual Tree Boruvka over.

tree : KDTree or BallTree

The binary tree to run Dual Tree Boruvka over.

jjerphan · 2023-11-12T21:00:51Z

sklearn/cluster/_hdbscan/_boruvka.pyx

+
+    leaf_size : int, optional (default=20)
+        The Boruvka algorithm benefits from a smaller leaf size than
+        standard kd-tree nearest neighbor searches. The tree passed in


Suggested change

standard kd-tree nearest neighbor searches. The tree passed in

standard binary tree nearest neighbor searches. The tree passed in

jjerphan · 2023-11-12T21:01:37Z

sklearn/cluster/_hdbscan/_boruvka.pyx

+        readonly const float64_t[:, ::1] raw_data
+        float64_t[:, :, ::1] node_bounds
+        float64_t alpha
+        int8_t approx_min_span_tree


I think bint is used for bool.

Suggested change

int8_t approx_min_span_tree

bint approx_min_span_tree

jjerphan · 2023-11-12T21:02:19Z

sklearn/cluster/_hdbscan/_boruvka.pyx

+        cdef cnp.ndarray[float64_t, ndim=2] knn_dist
+        cdef cnp.ndarray[intp_t, ndim=2] knn_indices


Is it possible to use memoryviews, here?

jjerphan · 2023-11-12T21:05:03Z

sklearn/cluster/_hdbscan/_boruvka.pyx

+        for n in range(self.num_nodes):
+            self.bounds[n] = DBL_MAX
+
+    cdef _initialize_components(self):


Suggested change

cdef _initialize_components(self):

cdef _initialize_components(self) noexcept nogil:

jjerphan · 2023-11-12T21:07:42Z

sklearn/cluster/_hdbscan/_boruvka.pyx

+    cdef cnp.ndarray[intp_t, ndim=1] components(self):
+        """Return an array of all component roots/identifiers"""
+        return np.array(self.is_component).nonzero()[0]


I am wondering whether we could make this lazy/online so that we do not need to allocate a new array in a gil-context every-time it is called.

For instance, we could:

allocate a a single array which could be partitioned in two halves with $\mathcal{O}(1)$ swaps with a first half for components being returned as a view. This would be efficient due to data locality, would preserve the logic of callers operating on arrays, but would not respect ordering.

create two link-lists, one for identifying components and the other with other nodes we could update in $O(1)$. This would not be as efficient since it break data locality, would need adapting the logic of callers so that it operates on a link-list, but would respect ordering.

What do you think?

jjerphan · 2023-11-12T21:17:57Z

sklearn/cluster/_hdbscan/_boruvka.pyx

+                self.dual_tree_traversal(0, 0)
+                num_components = self.update_components()
+
+        return np.array(self.edges, dtype=MST_edge_dtype)


Suggested change

return np.array(self.edges, dtype=MST_edge_dtype)

return np.asarray(self.edges)

jjerphan · 2023-11-12T21:24:49Z

sklearn/cluster/_hdbscan/_boruvka.pyx

+    n_jobs : int, optional (default=4)
+        The number of parallel jobs used to compute core distances.


n_jobs is not used in practice.

jjerphan · 2023-11-12T21:29:04Z

sklearn/cluster/_hdbscan/_boruvka.pyx

+        self.component_of_point = np.empty(self.num_points, dtype=np.intp)
+        self.component_of_node = np.empty(self.num_nodes, dtype=np.intp)
+        self.candidate_neighbor = np.empty(self.num_points, dtype=np.intp)
+        self.candidate_point = np.empty(self.num_points, dtype=np.intp)
+        self.candidate_distance = np.empty(self.num_points, dtype=np.float64)


Instead of using _initialize_components to fill them with for loops, we could use np.full here instead.

jjerphan · 2023-11-27T17:51:58Z

We discussed this PR during this month's meeting, mentioning that:

we need to perform benchmarks and comparisons of the various algorithms for MST and understand tradeoffs before merging this PR
improvements can be done in subsequent PRs

Micky774 · 2023-11-27T18:03:50Z

I'll try to clean up this PR per your review sometime this week. I'll hopefully also be able to publish some MST benchmarks. Sorry for stepping away from this -- time has gotten really scarce 😅

jjerphan · 2023-11-27T18:05:15Z

No problem, Meekail. :)

Micky774 added 30 commits February 24, 2022 23:58

Initial addition of hdbscan

1c61429

Added wraparound wrappers where needed

c5240b7

Updated documentation

74bd0b3

Merge branch 'main' into hdbscan

15793b2

Added a new batch of doc updates for passing docstring tests

faa06b5

Improved metric_params handling

2a7cc22

Propogated metric_params change to tests and other functions

97f036f

Removed plotting, to_pandas, to_networkx infrastructure

8aa297a

Removed plotting, to_pandas, to_networkx infrastructure

fe362b5

Merge branch 'hdbscan' of https://github.com/Micky774/scikit-learn in…

dd44dbc

…to hdbscan

Renamed plots.py-->_trees.py

fda9350

Fixed package namespace in cluster/__init__.py

7478586

Drop-in replaced private dist_metrics with metrics.dist_metrics

cd1edc4

Revert "Drop-in replaced private dist_metrics with `metrics.dist_me…

0802504

…trics`" This reverts commit cd1edc4.

Docstring compliance for flat.py

ce94591

Renamed flat.py --> _flat.py

e93bfe1

Renamed flat.py-->_flat.py

028e98f

Renamed validity.py-->_validity.py

a1ac99a

Renamed robust_single_linkage_.py

788d4bc

Merge branch 'main' into hdbscan

5fba5e0

Removed _flat_.py and associated tests

cf4f239

Made memview readonly constant

1ceac43

Removed experimental/extra API -- may reenable in future PRs

6f20a08

Merge branch 'main' into hdbscan

6705fa7

WIP docstring improvements for RSL

9e9be81

Trimmed and removed unnecessary RSL estimator

0cd08f3

Updated sqrt2 default in robust_single_linkage

7b73dd8

Updated alpha arg for rsl functions

62cf09e

Micky774 added 2 commits October 11, 2023 14:30

Updated changelog

3dd7c5a

Changed default to preserve backwards compatability

5a9aebe

Micky774 marked this pull request as ready for review October 11, 2023 18:37

Micky774 changed the title ~~ENH Introduces mst_algorithm keyword for HDBSCAN, alongside two new MST algorithms~~ ENH Introduces mst_algorithm keyword for HDBSCAN, alongside two new Boruvka MST algorithms Oct 12, 2023

Micky774 mentioned this pull request Oct 12, 2023

HDBSCAN Ongoing Work #26801

Open

13 tasks

Micky774 added 3 commits October 12, 2023 08:50

Merge branch 'main' into hdbscan_boruvka

90d9403

Improved tests, and adjusted auto option for backwards compatability

3b40f8a

Corrected changelog entry

04e4007

Micky774 commented Oct 12, 2023

View reviewed changes

Micky774 added 7 commits October 12, 2023 09:40

Removed extraneous function

e06188f

Stabalized tests by using sorted lists

6ef1668

Updated to include deprecation for auto heuristic

76713ff

Updated example in docstring

de5d041

Updated centers test to use less adversarial data

8ba71e8

Corrected test by making hdb model more noise-tolerant

6a592f4

Avoid FutureWarning in tests

68d4fd1

jjerphan self-requested a review October 14, 2023 07:07

Micky774 added 2 commits October 14, 2023 10:06

Fixed remaining FutureWarning

39aa992

Merge branch 'main' into hdbscan_boruvka

fedeb90

Micky774 mentioned this pull request Oct 18, 2023

MNT Improved rigor of HDBSCAN tests using Fowlkes-Mallows score #27571

Merged

jjerphan reviewed Oct 28, 2023

View reviewed changes

jjerphan reviewed Oct 29, 2023

View reviewed changes

jjerphan reviewed Nov 12, 2023

View reviewed changes

lorentzenchr added the Stalled label Mar 1, 2024

jjerphan mentioned this pull request Mar 7, 2024

FEAT Introduce DBCV as new cluster metric #28244

Closed

13 tasks

lesteve mentioned this pull request Jun 10, 2025

HDBSCAN performance issues compared to original hdbscan implementation (likely because Boruvka algorithm is not implemented) #31503

Open

		return metric._rdist_to_dist(rdist)


		cdef class BoruvkaUnionFind:

	- \|Enhancement\| : The `mst_algorithm` argument is introduced, allowing for the user to
	- \|Enhancement\| :class:`cluster.HDBScan` has a new `mst_algorithm` argument, allowing for the user to

		This instead introduces `"boruvka_exact", "boruvka_approx"` which are both faster
		MST building algorithms than the current `"prims"`.

		kwargs["leaf_size"] = self.leaf_size
		# We prefer KDTree unless otherwise specified

	# Boruvka is always preferable
	# The exact Boruvka algorithm is always preferable.


		EXACT_MST_ALGORITHMS = {"prims", "boruvka_exact"}
		MST_ALGORITHMS = {"boruvka_approx"} \| EXACT_MST_ALGORITHMS \| BRUTE_COMPATIBLE

+# Converting sets into sorted list to have a reproducible set suite
+# and to prevent errors with test dispatch to pytest-xdist workers.
+BRUTE_COMPATIBLE = sorted(BRUTE_COMPATIBLE)
+ALGORITHMS = sorted(ALGORITHMS)
+EXACT_MST_ALGORITHMS = sorted(EXACT_MST_ALGORITHMS)
+MST_ALGORITHMS = sorted(MST_ALGORITHMS)

	pytest.xfail("Incompatible algorithm configuration")
	pytest.xfail(f"Incompatible algorithm configuration: {(algorithm, mst_algorithm)=}")

	standard kd-tree nearest neighbor searches. The tree passed in
	standard binary tree nearest neighbor searches. The tree passed in

		cdef cnp.ndarray[float64_t, ndim=2] knn_dist
		cdef cnp.ndarray[intp_t, ndim=2] knn_indices

	cdef _initialize_components(self):
	cdef _initialize_components(self) noexcept nogil:

	return np.array(self.edges, dtype=MST_edge_dtype)
	return np.asarray(self.edges)

		n_jobs : int, optional (default=4)
		The number of parallel jobs used to compute core distances.

Uh oh!

ENH Introduces mst_algorithm keyword for HDBSCAN, alongside two new Boruvka MST algorithms #27572

Are you sure you want to change the base?

ENH Introduces mst_algorithm keyword for HDBSCAN, alongside two new Boruvka MST algorithms #27572

Uh oh!

Conversation

Micky774 commented Oct 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Oct 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjerphan Oct 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjerphan commented Nov 27, 2023

Uh oh!

Micky774 commented Nov 27, 2023

Uh oh!

jjerphan commented Nov 27, 2023

Uh oh!

Uh oh!

ENH Introduces `mst_algorithm` keyword for HDBSCAN, alongside two new Boruvka MST algorithms #27572

ENH Introduces `mst_algorithm` keyword for HDBSCAN, alongside two new Boruvka MST algorithms #27572

Micky774 commented Oct 11, 2023 •

edited

Loading

github-actions bot commented Oct 11, 2023 •

edited

Loading

jjerphan Oct 28, 2023 •

edited

Loading