-
-
Notifications
You must be signed in to change notification settings - Fork 26k
MNT Improved rigor of HDBSCAN tests using Fowlkes-Mallows score #27571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What would you think of keeping the n_clusters check and adding the Fowlkes-Mallows check on top of it? I can imagine that you can have a close to 1 Fowlkes-Mallows score while not having the same number of clusters, e.g.:
|
Sure, I think it wouldn't hurt! I'll update the PR. |
@@ -208,28 +210,14 @@ def test_dbscan_clustering_outlier_data(cut_distance): | |||
assert_array_equal(clean_labels, labels[clean_idx]) | |||
|
|||
|
|||
def test_hdbscan_high_dimensional(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, can you say a bit why you removed this test, is this case already covered by other tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was left over from the original HDBSCAN
implementation when algorithm="best"
would result in automatically dispatching to a BallTree
-based Prim's algorithm routine if:
- The metric was not supported by
KDTree
(this ensures aBallTree
based method) and - The dimensionality of the data was
>60
(this ensures a Prim's MST routine, as opposed to Boruvka)
Our current implementation does not have any dispatching between Prim's and Boruvka that would require this test. Even in my open PR adding Boruvka algorithms, the dispatching mechanism is different. Hence this test doesn't cover anything unique/interesting in our implementation and is a vestige of the original.
In reality, it should have been removed before the initial scikit-learn release of HDBSCAN when I had originally factored out the Boruvka algorithms for simplicity, but I missed it back then 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks for the details, merging this one!
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Instead of relying on a the number of clusters as a score for correctness, we now leverage the Fowlkes-Mallows score which is label-permutation invariant and allows us greater rigor.
This also removes
test_hdbscan_high_dimensional
which is not necessary with current API.Any other comments?