Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH Introduces mst_algorithm keyword for HDBSCAN, alongside two new Boruvka MST algorithms #27572

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 196 commits into
base: main
Choose a base branch
from

Conversation

Micky774
Copy link
Contributor

@Micky774 Micky774 commented Oct 11, 2023

Reference Issues/PRs

Towards #26801

What does this implement/fix? Explain your changes.

  • Introduces mst_algorithm
  • Adds Boruvka algorithm (both exact/inexact)
  • Streamlines algorithm selection logic for "auto" options
  • Improves tests to account for new MST algorithms
  • Provides a "warn" option for mst_algorithm as a backwards-compatible default which will provide a FutureWarning to users, encouraging them to opt-in to using mst_algorithm="auto"
  • Includes a deprecation for mst_algorithm="warn"

Any other comments?

Apologies for the gross commit log, the vast majority are a symptom of me keeping this branch open in parallel with mainstream HDBSCAN efforts in order to prevent it from getting "too out of sync". Not sure how to rectify history well here.

Benchmarks coming soon.

- Added support for `n_features_in_`
- Improved validation and added support for `feature_names_in_`
- Renamed `kwargs` to `metric_params` and added safety check
  for an empty dict
- Removed attributes set in init and deferred to properties
- Raised error if tree query is performed with too few samples
- Cleaned up some list/dict comprehension logic
- Removed internal minkowski metric parameter validation in favor
  of `sklearn.metrics` built-in handling
- Removed default argument and presence of `p` in hdbscan functions
- Now users must pass `p` in through `metric_params`, consistent w/
  other metrics such as `wminkowski` and `mahalanobis`
- Removed vestigial estimator check -- now supported via common tests
- Fixed bug where `boruvka_kdtree` algorithm's accepted metrics were
  based off of `BallTree` not `KDTree`
- Cleaned up lines with unused returns by indexing output of `hdbscan`
- Greatly expanded scope of algorithm/metric compatability tests
- Streamlined some other tests
- Delted commented out tests
@github-actions
Copy link

github-actions bot commented Oct 11, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: fedeb90. Link to the linter CI: here

@Micky774 Micky774 marked this pull request as ready for review October 11, 2023 18:37
@Micky774 Micky774 changed the title ENH Introduces mst_algorithm keyword for HDBSCAN, alongside two new MST algorithms ENH Introduces mst_algorithm keyword for HDBSCAN, alongside two new Boruvka MST algorithms Oct 12, 2023
@Micky774 Micky774 mentioned this pull request Oct 12, 2023
13 tasks
return metric._rdist_to_dist(rdist)


cdef class BoruvkaUnionFind:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers: I wasn't sure of a "nice" way to reuse the UnionFind structure from _hierarchical_fast since this one explicitly keeps track of initial nodes as components without leveraging virtual nodes created through union operations, and thus the UnionFind structure would need to be modified with non-trivial additions. I think it is better to leave them as two similar-yet-different structures. Open to suggestions though!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is OK to maintain two structures for UnionFind rather than to try factorizing their logic.

@jjerphan jjerphan self-requested a review October 14, 2023 07:07
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you a lot, @Micky774.

Here is a first review on everything but sklearn/cluster/_hdbscan/_boruvka.pyx.

@@ -14,6 +14,7 @@
# TODO: Stop defining custom types locally or globally like DTYPE_t and friends and
# use these consistently throughout the codebase.
# NOTE: Extend this list as needed when converting more cython extensions.
ctypedef char int8_t
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this used exactly?

@@ -217,6 +217,19 @@ Changelog
`kdtree` and `balltree` values will be removed in 1.6.
:pr:`26744` by :user:`Shreesha Kumar Bhat <Shreesha3112>`.

- |Enhancement| : The `mst_algorithm` argument is introduced, allowing for the user to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- |Enhancement| : The `mst_algorithm` argument is introduced, allowing for the user to
- |Enhancement| :class:`cluster.HDBScan` has a new `mst_algorithm` argument, allowing for the user to

Comment on lines +225 to +226
This instead introduces `"boruvka_exact", "boruvka_approx"` which are both faster
MST building algorithms than the current `"prims"`.
Copy link
Member

@jjerphan jjerphan Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This instead introduces `"boruvka_exact", "boruvka_approx"` which are both faster
MST building algorithms than the current `"prims"`.
This instead introduces `"boruvka_exact"` and `"boruvka_approx"` which are both
faster MST building algorithms than the current `"prims"`.

return metric._rdist_to_dist(rdist)


cdef class BoruvkaUnionFind:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is OK to maintain two structures for UnionFind rather than to try factorizing their logic.

kwargs["leaf_size"] = self.leaf_size
# We prefer KDTree unless otherwise specified
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document the reason behind this choice?

# Metric is a valid BallTree metric
mst_func = _hdbscan_prims
kwargs["algo"] = "ball_tree"
# Boruvka is always preferable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Suggested change
# Boruvka is always preferable
# The exact Boruvka algorithm is always preferable.

kwargs["leaf_size"] = self.leaf_size
kwargs["algo"] = (
"kd_tree" if self.metric in KDTree.valid_metrics else "ball_tree"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that the union of {KDTree,BallTree}.valid_metrics overlaps the entire set of values for self.metric?


EXACT_MST_ALGORITHMS = {"prims", "boruvka_exact"}
MST_ALGORITHMS = {"boruvka_approx"} | EXACT_MST_ALGORITHMS | BRUTE_COMPATIBLE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we might want to sorted them once and for all. What do you think?

Suggested change
# Converting sets into sorted list to have a reproducible set suite
# and to prevent errors with test dispatch to pytest-xdist workers.
BRUTE_COMPATIBLE = sorted(BRUTE_COMPATIBLE)
ALGORITHMS = sorted(ALGORITHMS)
EXACT_MST_ALGORITHMS = sorted(EXACT_MST_ALGORITHMS)
MST_ALGORITHMS = sorted(MST_ALGORITHMS)

@@ -811,40 +905,55 @@ def fit(self, X, y=None):
" Please select a different metric."
)

if self.algorithm != "auto":
if algorithms != {"auto"}:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would rather treat the default mst_algorithm == algorithm == auto case first and then disjoint on cases.

What do you think?

def _validate_algorithms(algorithm, mst_algorithm):
algos = {algorithm, mst_algorithm}
if "brute" in algos and not algos.issubset(BRUTE_COMPATIBLE):
pytest.xfail("Incompatible algorithm configuration")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pytest.xfail("Incompatible algorithm configuration")
pytest.xfail(f"Incompatible algorithm configuration: {(algorithm, mst_algorithm)=}")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the original implementation? Was it included with modifications?

I guess we could proceed with either:

  • including the original implementation for reference and to have the history identical to the original project and modify it in subsequent places
  • including the original implementation with modifications and documenting the changes made in between (probably simpler)

What do you think? 🙂

I think a few minor elements can be improved for matching scikit-learn's conventions, but we can keep that for later.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another review on the core.

I think we could merge this PR soon and perform optimization in other PRs.

Comment on lines +70 to +113
# Define a function giving the minimum distance between two
# nodes of a ball tree
cdef inline float64_t ball_tree_min_dist_dual(
float64_t radius1,
float64_t radius2,
intp_t node1,
intp_t node2,
float64_t[:, ::1] centroid_dist
) noexcept nogil:

cdef float64_t dist_pt = centroid_dist[node1, node2]
return max(0, (dist_pt - radius1 - radius2))


# Define a function giving the minimum distance between two
# nodes of a kd-tree
cdef inline float64_t kd_tree_min_dist_dual(
DistanceMetric64 metric,
intp_t node1,
intp_t node2,
float64_t[:, :, ::1] node_bounds,
intp_t num_features
) noexcept nogil:

cdef float64_t d, d1, d2, rdist = 0.0
cdef intp_t j

if metric.p == INF:
for j in range(num_features):
d1 = node_bounds[0, node1, j] - node_bounds[1, node2, j]
d2 = node_bounds[0, node2, j] - node_bounds[1, node1, j]
d = (d1 + fabs(d1)) + (d2 + fabs(d2))

rdist = max(rdist, 0.5 * d)
else:
# here we'll use the fact that x + abs(x) = 2 * max(x, 0)
for j in range(num_features):
d1 = node_bounds[0, node1, j] - node_bounds[1, node2, j]
d2 = node_bounds[0, node2, j] - node_bounds[1, node1, j]
d = (d1 + fabs(d1)) + (d2 + fabs(d2))

rdist += pow(0.5 * d, metric.p)

return metric._rdist_to_dist(rdist)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those functions already are defined in KDTree and BallTree translation units:

cdef inline float64_t min_dist_dual{{name_suffix}}(

cdef inline float64_t min_dist_dual{{name_suffix}}(

Would it be possible to reuse them?

Comment on lines +193 to +194
tree : KDTree
The kd-tree to run Dual Tree Boruvka over.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tree : KDTree
The kd-tree to run Dual Tree Boruvka over.
tree : KDTree or BallTree
The binary tree to run Dual Tree Boruvka over.


leaf_size : int, optional (default=20)
The Boruvka algorithm benefits from a smaller leaf size than
standard kd-tree nearest neighbor searches. The tree passed in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
standard kd-tree nearest neighbor searches. The tree passed in
standard binary tree nearest neighbor searches. The tree passed in

readonly const float64_t[:, ::1] raw_data
float64_t[:, :, ::1] node_bounds
float64_t alpha
int8_t approx_min_span_tree
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think bint is used for bool.

Suggested change
int8_t approx_min_span_tree
bint approx_min_span_tree

Comment on lines +305 to +306
cdef cnp.ndarray[float64_t, ndim=2] knn_dist
cdef cnp.ndarray[intp_t, ndim=2] knn_indices
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use memoryviews, here?

for n in range(self.num_nodes):
self.bounds[n] = DBL_MAX

cdef _initialize_components(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cdef _initialize_components(self):
cdef _initialize_components(self) noexcept nogil:

Comment on lines +181 to +183
cdef cnp.ndarray[intp_t, ndim=1] components(self):
"""Return an array of all component roots/identifiers"""
return np.array(self.is_component).nonzero()[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering whether we could make this lazy/online so that we do not need to allocate a new array in a gil-context every-time it is called.

For instance, we could:

  • allocate a a single array which could be partitioned in two halves with $\mathcal{O}(1)$ swaps with a first half for components being returned as a view. This would be efficient due to data locality, would preserve the logic of callers operating on arrays, but would not respect ordering.
  • create two link-lists, one for identifying components and the other with other nodes we could update in $O(1)$. This would not be as efficient since it break data locality, would need adapting the logic of callers so that it operates on a link-list, but would respect ordering.

What do you think?

self.dual_tree_traversal(0, 0)
num_components = self.update_components()

return np.array(self.edges, dtype=MST_edge_dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return np.array(self.edges, dtype=MST_edge_dtype)
return np.asarray(self.edges)

Comment on lines +218 to +219
n_jobs : int, optional (default=4)
The number of parallel jobs used to compute core distances.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_jobs is not used in practice.

Comment on lines +280 to +284
self.component_of_point = np.empty(self.num_points, dtype=np.intp)
self.component_of_node = np.empty(self.num_nodes, dtype=np.intp)
self.candidate_neighbor = np.empty(self.num_points, dtype=np.intp)
self.candidate_point = np.empty(self.num_points, dtype=np.intp)
self.candidate_distance = np.empty(self.num_points, dtype=np.float64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using _initialize_components to fill them with for loops, we could use np.full here instead.

@jjerphan
Copy link
Member

We discussed this PR during this month's meeting, mentioning that:

  • we need to perform benchmarks and comparisons of the various algorithms for MST and understand tradeoffs before merging this PR
  • improvements can be done in subsequent PRs

@Micky774
Copy link
Contributor Author

I'll try to clean up this PR per your review sometime this week. I'll hopefully also be able to publish some MST benchmarks. Sorry for stepping away from this -- time has gotten really scarce 😅

@jjerphan
Copy link
Member

No problem, Meekail. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants