MNT Improved rigor of HDBSCAN tests using Fowlkes-Mallows score #27571

Micky774 · 2023-10-11T18:00:57Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Instead of relying on a the number of clusters as a score for correctness, we now leverage the Fowlkes-Mallows score which is label-permutation invariant and allows us greater rigor.

This also removes test_hdbscan_high_dimensional which is not necessary with current API.

Any other comments?

github-actions · 2023-10-11T18:02:35Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: b9a0534. Link to the linter CI: here}

OmarManzoor

LGTM

lesteve · 2023-10-16T14:38:05Z

What would you think of keeping the n_clusters check and adding the Fowlkes-Mallows check on top of it?

I can imagine that you can have a close to 1 Fowlkes-Mallows score while not having the same number of clusters, e.g.:

In [12]: from sklearn.metrics import fowlkes_mallows_score

In [13]: fowlkes_mallows_score([0]*100, [0]*99 + [1])
Out[13]: 0.9899494936611666

Micky774 · 2023-10-16T14:43:45Z

What would you think of keeping the n_clusters check and adding the Fowlkes-Mallows check on top of it?

Sure, I think it wouldn't hurt! I'll update the PR.

lesteve · 2023-10-17T14:36:51Z

sklearn/cluster/tests/test_hdbscan.py

@@ -208,28 +210,14 @@ def test_dbscan_clustering_outlier_data(cut_distance):
    assert_array_equal(clean_labels, labels[clean_idx])


-def test_hdbscan_high_dimensional():


Just curious, can you say a bit why you removed this test, is this case already covered by other tests?

This test was left over from the original HDBSCAN implementation when algorithm="best" would result in automatically dispatching to a BallTree-based Prim's algorithm routine if:

The metric was not supported by KDTree (this ensures a BallTree based method) and

The dimensionality of the data was >60 (this ensures a Prim's MST routine, as opposed to Boruvka)

Our current implementation does not have any dispatching between Prim's and Boruvka that would require this test. Even in my open PR adding Boruvka algorithms, the dispatching mechanism is different. Hence this test doesn't cover anything unique/interesting in our implementation and is a vestige of the original.

In reality, it should have been removed before the initial scikit-learn release of HDBSCAN when I had originally factored out the Boruvka algorithms for simplicity, but I missed it back then 😅

OK thanks for the details, merging this one!

…it-learn#27571)

Improved rigor of tests usings folkes-mallowes score

624d897

Micky774 added the Quick Review For PRs that are quick to review label Oct 11, 2023

github-actions bot added the module:cluster label Oct 11, 2023

Micky774 added the No Changelog Needed label Oct 11, 2023

OmarManzoor approved these changes Oct 12, 2023

View reviewed changes

Micky774 added the Waiting for Second Reviewer First reviewer is done, need a second one! label Oct 12, 2023

Updated test to reintroduce cluster count

b9a0534

lesteve reviewed Oct 17, 2023

View reviewed changes

lesteve merged commit 0a31d59 into scikit-learn:main Oct 19, 2023

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Oct 31, 2023

MNT Improved rigor of HDBSCAN tests using Fowlkes-Mallows score (scik…

c57e7ea

…it-learn#27571)

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023

MNT Improved rigor of HDBSCAN tests using Fowlkes-Mallows score (scik…

18b8018

…it-learn#27571)

Micky774 deleted the hdbscan_test_rigor branch August 13, 2024 01:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MNT Improved rigor of HDBSCAN tests using Fowlkes-Mallows score #27571

MNT Improved rigor of HDBSCAN tests using Fowlkes-Mallows score #27571

Uh oh!

Micky774 commented Oct 11, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Oct 11, 2023 •

edited

Loading

Uh oh!

OmarManzoor left a comment

Uh oh!

lesteve commented Oct 16, 2023

Uh oh!

Micky774 commented Oct 16, 2023

Uh oh!

lesteve Oct 17, 2023

Uh oh!

Micky774 Oct 18, 2023

Uh oh!

lesteve Oct 19, 2023

Uh oh!

Uh oh!

		@@ -208,28 +210,14 @@ def test_dbscan_clustering_outlier_data(cut_distance):
		assert_array_equal(clean_labels, labels[clean_idx])


		def test_hdbscan_high_dimensional():

Uh oh!

MNT Improved rigor of HDBSCAN tests using Fowlkes-Mallows score #27571

MNT Improved rigor of HDBSCAN tests using Fowlkes-Mallows score #27571

Uh oh!

Conversation

Micky774 commented Oct 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Oct 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

OmarManzoor left a comment

Choose a reason for hiding this comment

Uh oh!

lesteve commented Oct 16, 2023

Uh oh!

Micky774 commented Oct 16, 2023

Uh oh!

lesteve Oct 17, 2023

Choose a reason for hiding this comment

Uh oh!

Micky774 Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

lesteve Oct 19, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Micky774 commented Oct 11, 2023 •

edited

Loading

github-actions bot commented Oct 11, 2023 •

edited

Loading