-
Notifications
You must be signed in to change notification settings - Fork 599
Init checks for Dask KMeans #7391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Init checks for Dask KMeans #7391
Conversation
python/cuml/cuml/cluster/kmeans.pyx
Outdated
|
|
||
| # Skip this check if running in multigpu mode. In that case we don't care if | ||
| # a single partition has fewer rows than clusters | ||
| if not multigpu and n_rows < self.n_clusters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're here - I'm not 100% sure if this check should be skipped in multi-gpu execution. When refactoring I excluded it in multi-gpu since we weren't running it there before.
Is the multi-gpu implementation robust to a single node having fewer rows than the requested n_clusters? It doesn't seem to error when invoked in that setup, but I'm also not sure if it provides good results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the check was missing in the multi-GPU implementation. I do not know for sure either, but I guess that this is a rare case that should probably not yield very good results especially for scalable/parallel kmeans++ initialization. Better safe than sorry, we should probably alert the user in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add a check that oversampling_factor > 0?
aaa89f5 to
be8c66b
Compare
|
/merge |
During a recent refactor we removed the `KMeansMG` class, viewing it as internal. It turns out this class was used by a few external projects. Since we still need to support external users accessing the non-dask multi-gpu implementation, we'll want a public way to do so that isn't the private `_fit` method. Additionally, since we want to special case the `MG` case a little more, making it a separate class (even if as a thin shim) makes sense. This PR: - Brings back the `KMeansMG` class - Adds a check that `random_state` is non-None in the `KMeansMG` case, ensuring external users also set `random_state` properly - Removes mutation of kwargs in the dask `KMeans` case (as suggested [here](#7417 (comment))) - Simplifies and moves the multi-gpu `kmeans++`/`oversampling_factor` check (as suggested [here](#7391 (comment))) Fixes #7387. Fixes #7389. Authors: - Jim Crist-Harif (https://github.com/jcrist) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) URL: #7420
Closes #7389