Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MNT Improved rigor of HDBSCAN tests using Fowlkes-Mallows score #27571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 19, 2023

Conversation

Micky774
Copy link
Contributor

@Micky774 Micky774 commented Oct 11, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Instead of relying on a the number of clusters as a score for correctness, we now leverage the Fowlkes-Mallows score which is label-permutation invariant and allows us greater rigor.

This also removes test_hdbscan_high_dimensional which is not necessary with current API.

Any other comments?

@Micky774 Micky774 added the Quick Review For PRs that are quick to review label Oct 11, 2023
@github-actions
Copy link

github-actions bot commented Oct 11, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: b9a0534. Link to the linter CI: here

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Micky774 Micky774 added the Waiting for Second Reviewer First reviewer is done, need a second one! label Oct 12, 2023
@lesteve
Copy link
Member

lesteve commented Oct 16, 2023

What would you think of keeping the n_clusters check and adding the Fowlkes-Mallows check on top of it?

I can imagine that you can have a close to 1 Fowlkes-Mallows score while not having the same number of clusters, e.g.:

In [12]: from sklearn.metrics import fowlkes_mallows_score

In [13]: fowlkes_mallows_score([0]*100, [0]*99 + [1])
Out[13]: 0.9899494936611666

@Micky774
Copy link
Contributor Author

What would you think of keeping the n_clusters check and adding the Fowlkes-Mallows check on top of it?

Sure, I think it wouldn't hurt! I'll update the PR.

@@ -208,28 +210,14 @@ def test_dbscan_clustering_outlier_data(cut_distance):
assert_array_equal(clean_labels, labels[clean_idx])


def test_hdbscan_high_dimensional():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, can you say a bit why you removed this test, is this case already covered by other tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was left over from the original HDBSCAN implementation when algorithm="best" would result in automatically dispatching to a BallTree-based Prim's algorithm routine if:

  1. The metric was not supported by KDTree (this ensures a BallTree based method) and
  2. The dimensionality of the data was >60 (this ensures a Prim's MST routine, as opposed to Boruvka)

Our current implementation does not have any dispatching between Prim's and Boruvka that would require this test. Even in my open PR adding Boruvka algorithms, the dispatching mechanism is different. Hence this test doesn't cover anything unique/interesting in our implementation and is a vestige of the original.

In reality, it should have been removed before the initial scikit-learn release of HDBSCAN when I had originally factored out the Boruvka algorithms for simplicity, but I missed it back then 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks for the details, merging this one!

@lesteve lesteve merged commit 0a31d59 into scikit-learn:main Oct 19, 2023
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 31, 2023
REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
@Micky774 Micky774 deleted the hdbscan_test_rigor branch August 13, 2024 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:cluster No Changelog Needed Quick Review For PRs that are quick to review Waiting for Second Reviewer First reviewer is done, need a second one!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants