-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH Introduces mst_algorithm
keyword for HDBSCAN, alongside two new Boruvka MST algorithms
#27572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Added support for `n_features_in_` - Improved validation and added support for `feature_names_in_` - Renamed `kwargs` to `metric_params` and added safety check for an empty dict - Removed attributes set in init and deferred to properties - Raised error if tree query is performed with too few samples - Cleaned up some list/dict comprehension logic
…trics`" This reverts commit cd1edc4.
- Removed internal minkowski metric parameter validation in favor of `sklearn.metrics` built-in handling - Removed default argument and presence of `p` in hdbscan functions - Now users must pass `p` in through `metric_params`, consistent w/ other metrics such as `wminkowski` and `mahalanobis` - Removed vestigial estimator check -- now supported via common tests - Fixed bug where `boruvka_kdtree` algorithm's accepted metrics were based off of `BallTree` not `KDTree` - Cleaned up lines with unused returns by indexing output of `hdbscan` - Greatly expanded scope of algorithm/metric compatability tests - Streamlined some other tests - Delted commented out tests
mst_algorithm
keyword for HDBSCAN, alongside two new MST algorithmsmst_algorithm
keyword for HDBSCAN, alongside two new Boruvka MST algorithms
return metric._rdist_to_dist(rdist) | ||
|
||
|
||
cdef class BoruvkaUnionFind: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for reviewers: I wasn't sure of a "nice" way to reuse the UnionFind
structure from _hierarchical_fast
since this one explicitly keeps track of initial nodes as components without leveraging virtual nodes created through union operations, and thus the UnionFind
structure would need to be modified with non-trivial additions. I think it is better to leave them as two similar-yet-different structures. Open to suggestions though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is OK to maintain two structures for UnionFind
rather than to try factorizing their logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you a lot, @Micky774.
Here is a first review on everything but sklearn/cluster/_hdbscan/_boruvka.pyx
.
@@ -14,6 +14,7 @@ | |||
# TODO: Stop defining custom types locally or globally like DTYPE_t and friends and | |||
# use these consistently throughout the codebase. | |||
# NOTE: Extend this list as needed when converting more cython extensions. | |||
ctypedef char int8_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this used exactly?
@@ -217,6 +217,19 @@ Changelog | |||
`kdtree` and `balltree` values will be removed in 1.6. | |||
:pr:`26744` by :user:`Shreesha Kumar Bhat <Shreesha3112>`. | |||
|
|||
- |Enhancement| : The `mst_algorithm` argument is introduced, allowing for the user to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- |Enhancement| : The `mst_algorithm` argument is introduced, allowing for the user to | |
- |Enhancement| :class:`cluster.HDBScan` has a new `mst_algorithm` argument, allowing for the user to |
This instead introduces `"boruvka_exact", "boruvka_approx"` which are both faster | ||
MST building algorithms than the current `"prims"`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This instead introduces `"boruvka_exact", "boruvka_approx"` which are both faster | |
MST building algorithms than the current `"prims"`. | |
This instead introduces `"boruvka_exact"` and `"boruvka_approx"` which are both | |
faster MST building algorithms than the current `"prims"`. |
return metric._rdist_to_dist(rdist) | ||
|
||
|
||
cdef class BoruvkaUnionFind: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is OK to maintain two structures for UnionFind
rather than to try factorizing their logic.
kwargs["leaf_size"] = self.leaf_size | ||
# We prefer KDTree unless otherwise specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document the reason behind this choice?
# Metric is a valid BallTree metric | ||
mst_func = _hdbscan_prims | ||
kwargs["algo"] = "ball_tree" | ||
# Boruvka is always preferable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
# Boruvka is always preferable | |
# The exact Boruvka algorithm is always preferable. |
kwargs["leaf_size"] = self.leaf_size | ||
kwargs["algo"] = ( | ||
"kd_tree" if self.metric in KDTree.valid_metrics else "ball_tree" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that the union of {KDTree,BallTree}.valid_metrics
overlaps the entire set of values for self.metric
?
|
||
EXACT_MST_ALGORITHMS = {"prims", "boruvka_exact"} | ||
MST_ALGORITHMS = {"boruvka_approx"} | EXACT_MST_ALGORITHMS | BRUTE_COMPATIBLE | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we might want to sorted them once and for all. What do you think?
# Converting sets into sorted list to have a reproducible set suite | |
# and to prevent errors with test dispatch to pytest-xdist workers. | |
BRUTE_COMPATIBLE = sorted(BRUTE_COMPATIBLE) | |
ALGORITHMS = sorted(ALGORITHMS) | |
EXACT_MST_ALGORITHMS = sorted(EXACT_MST_ALGORITHMS) | |
MST_ALGORITHMS = sorted(MST_ALGORITHMS) |
@@ -811,40 +905,55 @@ def fit(self, X, y=None): | |||
" Please select a different metric." | |||
) | |||
|
|||
if self.algorithm != "auto": | |||
if algorithms != {"auto"}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would rather treat the default mst_algorithm == algorithm == auto
case first and then disjoint on cases.
What do you think?
def _validate_algorithms(algorithm, mst_algorithm): | ||
algos = {algorithm, mst_algorithm} | ||
if "brute" in algos and not algos.issubset(BRUTE_COMPATIBLE): | ||
pytest.xfail("Incompatible algorithm configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest.xfail("Incompatible algorithm configuration") | |
pytest.xfail(f"Incompatible algorithm configuration: {(algorithm, mst_algorithm)=}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the original implementation? Was it included with modifications?
I guess we could proceed with either:
- including the original implementation for reference and to have the history identical to the original project and modify it in subsequent places
- including the original implementation with modifications and documenting the changes made in between (probably simpler)
What do you think? 🙂
I think a few minor elements can be improved for matching scikit-learn's conventions, but we can keep that for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another review on the core.
I think we could merge this PR soon and perform optimization in other PRs.
# Define a function giving the minimum distance between two | ||
# nodes of a ball tree | ||
cdef inline float64_t ball_tree_min_dist_dual( | ||
float64_t radius1, | ||
float64_t radius2, | ||
intp_t node1, | ||
intp_t node2, | ||
float64_t[:, ::1] centroid_dist | ||
) noexcept nogil: | ||
|
||
cdef float64_t dist_pt = centroid_dist[node1, node2] | ||
return max(0, (dist_pt - radius1 - radius2)) | ||
|
||
|
||
# Define a function giving the minimum distance between two | ||
# nodes of a kd-tree | ||
cdef inline float64_t kd_tree_min_dist_dual( | ||
DistanceMetric64 metric, | ||
intp_t node1, | ||
intp_t node2, | ||
float64_t[:, :, ::1] node_bounds, | ||
intp_t num_features | ||
) noexcept nogil: | ||
|
||
cdef float64_t d, d1, d2, rdist = 0.0 | ||
cdef intp_t j | ||
|
||
if metric.p == INF: | ||
for j in range(num_features): | ||
d1 = node_bounds[0, node1, j] - node_bounds[1, node2, j] | ||
d2 = node_bounds[0, node2, j] - node_bounds[1, node1, j] | ||
d = (d1 + fabs(d1)) + (d2 + fabs(d2)) | ||
|
||
rdist = max(rdist, 0.5 * d) | ||
else: | ||
# here we'll use the fact that x + abs(x) = 2 * max(x, 0) | ||
for j in range(num_features): | ||
d1 = node_bounds[0, node1, j] - node_bounds[1, node2, j] | ||
d2 = node_bounds[0, node2, j] - node_bounds[1, node1, j] | ||
d = (d1 + fabs(d1)) + (d2 + fabs(d2)) | ||
|
||
rdist += pow(0.5 * d, metric.p) | ||
|
||
return metric._rdist_to_dist(rdist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those functions already are defined in KDTree
and BallTree
translation units:
cdef inline float64_t min_dist_dual{{name_suffix}}( |
cdef inline float64_t min_dist_dual{{name_suffix}}( |
Would it be possible to reuse them?
tree : KDTree | ||
The kd-tree to run Dual Tree Boruvka over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tree : KDTree | |
The kd-tree to run Dual Tree Boruvka over. | |
tree : KDTree or BallTree | |
The binary tree to run Dual Tree Boruvka over. |
|
||
leaf_size : int, optional (default=20) | ||
The Boruvka algorithm benefits from a smaller leaf size than | ||
standard kd-tree nearest neighbor searches. The tree passed in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
standard kd-tree nearest neighbor searches. The tree passed in | |
standard binary tree nearest neighbor searches. The tree passed in |
readonly const float64_t[:, ::1] raw_data | ||
float64_t[:, :, ::1] node_bounds | ||
float64_t alpha | ||
int8_t approx_min_span_tree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think bint
is used for bool
.
int8_t approx_min_span_tree | |
bint approx_min_span_tree |
cdef cnp.ndarray[float64_t, ndim=2] knn_dist | ||
cdef cnp.ndarray[intp_t, ndim=2] knn_indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use memoryviews, here?
for n in range(self.num_nodes): | ||
self.bounds[n] = DBL_MAX | ||
|
||
cdef _initialize_components(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cdef _initialize_components(self): | |
cdef _initialize_components(self) noexcept nogil: |
cdef cnp.ndarray[intp_t, ndim=1] components(self): | ||
"""Return an array of all component roots/identifiers""" | ||
return np.array(self.is_component).nonzero()[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering whether we could make this lazy/online so that we do not need to allocate a new array in a gil
-context every-time it is called.
For instance, we could:
- allocate a a single array which could be partitioned in two halves with
$\mathcal{O}(1)$ swaps with a first half for components being returned as a view. This would be efficient due to data locality, would preserve the logic of callers operating on arrays, but would not respect ordering. - create two link-lists, one for identifying components and the other with other nodes we could update in
$O(1)$ . This would not be as efficient since it break data locality, would need adapting the logic of callers so that it operates on a link-list, but would respect ordering.
What do you think?
self.dual_tree_traversal(0, 0) | ||
num_components = self.update_components() | ||
|
||
return np.array(self.edges, dtype=MST_edge_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return np.array(self.edges, dtype=MST_edge_dtype) | |
return np.asarray(self.edges) |
n_jobs : int, optional (default=4) | ||
The number of parallel jobs used to compute core distances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_jobs
is not used in practice.
self.component_of_point = np.empty(self.num_points, dtype=np.intp) | ||
self.component_of_node = np.empty(self.num_nodes, dtype=np.intp) | ||
self.candidate_neighbor = np.empty(self.num_points, dtype=np.intp) | ||
self.candidate_point = np.empty(self.num_points, dtype=np.intp) | ||
self.candidate_distance = np.empty(self.num_points, dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using _initialize_components
to fill them with for loops, we could use np.full
here instead.
We discussed this PR during this month's meeting, mentioning that:
|
I'll try to clean up this PR per your review sometime this week. I'll hopefully also be able to publish some MST benchmarks. Sorry for stepping away from this -- time has gotten really scarce 😅 |
No problem, Meekail. :) |
Reference Issues/PRs
Towards #26801
What does this implement/fix? Explain your changes.
mst_algorithm
"warn"
option formst_algorithm
as a backwards-compatible default which will provide aFutureWarning
to users, encouraging them to opt-in to usingmst_algorithm="auto"
mst_algorithm="warn"
Any other comments?
Apologies for the gross commit log, the vast majority are a symptom of me keeping this branch open in parallel with mainstream HDBSCAN efforts in order to prevent it from getting "too out of sync". Not sure how to rectify history well here.
Benchmarks coming soon.