Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@viclafargue
Copy link
Contributor

Closes #7389

@viclafargue viclafargue requested a review from a team as a code owner October 27, 2025 11:19
@viclafargue viclafargue requested a review from dantegd October 27, 2025 11:19
@github-actions github-actions bot added the Cython / Python Cython or Python issue label Oct 27, 2025

# Skip this check if running in multigpu mode. In that case we don't care if
# a single partition has fewer rows than clusters
if not multigpu and n_rows < self.n_clusters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're here - I'm not 100% sure if this check should be skipped in multi-gpu execution. When refactoring I excluded it in multi-gpu since we weren't running it there before.

Is the multi-gpu implementation robust to a single node having fewer rows than the requested n_clusters? It doesn't seem to error when invoked in that setup, but I'm also not sure if it provides good results.

Copy link
Contributor Author

@viclafargue viclafargue Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the check was missing in the multi-GPU implementation. I do not know for sure either, but I guess that this is a rare case that should probably not yield very good results especially for scalable/parallel kmeans++ initialization. Better safe than sorry, we should probably alert the user in this case.

Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add a check that oversampling_factor > 0?

@viclafargue viclafargue force-pushed the init-checks-dask-kmeans branch from aaa89f5 to be8c66b Compare October 31, 2025 13:33
@viclafargue viclafargue added bug Something isn't working non-breaking Non-breaking change labels Oct 31, 2025
@viclafargue
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 550aba7 into rapidsai:main Oct 31, 2025
103 checks passed
@jcrist jcrist mentioned this pull request Oct 31, 2025
rapids-bot bot pushed a commit that referenced this pull request Oct 31, 2025
During a recent refactor we removed the `KMeansMG` class, viewing it as internal. It turns out this class was used by a few external projects.

Since we still need to support external users accessing the non-dask multi-gpu implementation, we'll want a public way to do so that isn't the private `_fit` method. Additionally, since we want to special case the `MG` case a little more, making it a separate class (even if as a thin shim) makes sense.

This PR:

- Brings back the `KMeansMG` class
- Adds a check that `random_state` is non-None in the `KMeansMG` case, ensuring external users also set `random_state` properly
- Removes mutation of kwargs in the dask `KMeans` case (as suggested [here](#7417 (comment)))
- Simplifies and moves the multi-gpu `kmeans++`/`oversampling_factor` check (as suggested [here](#7391 (comment)))

Fixes #7387.
Fixes #7389.

Authors:
  - Jim Crist-Harif (https://github.com/jcrist)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #7420
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Cython / Python Cython or Python issue non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Multi Node KMeans result doesn't match Single Node

3 participants