-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
[MRG] Fix: SkLearn .score() method generating error with Dask DataFrames
#12462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Fix: SkLearn .score() method generating error with Dask DataFrames
#12462
Conversation
…erence for size of objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good thanks.
Please add a unit test under sklearn/utils/tests/test_validation.py to test _num_samples on some mock object where obj.shape[0] is not an int, and with a __len__ attribute to test this.
Not sure if we should a what's new with something about a better compatibly with dask for the estimators affected..
| raise TypeError("Singleton array %r cannot be considered" | ||
| " a valid collection." % x) | ||
| return x.shape[0] | ||
| # Check that shape is returning an integer or default to len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment that shapes may in particular not be integer in dask.
So we are not officially supporting dask arrays as input and a what's new may imply that it is supported, at the same time I'm not sure how to let users know that experimentally compatibility improved (or if we want to do that). |
|
I think if we want to add any support for dask DataFrame we should detect it in |
|
@amueller I think that's a good suggestions as well. My main point was that Dask claims to play with SkLearn and then the recent update broke things; being more explicit that SkLearn will try to work with Dask, but is defeating the purpose, is really what needs to happen to be more explicit to the user base. I will attempt to add the warning to this PR as well. |
|
To add the warning we would need to hard-code the dask dataframe type into |
|
Ahh good point. Don't want that dependency. In that case, I still think allowing SkLearn to attempt to do something with Dask is worthwhile even if it's not the ideal way to do things. So if no one objects, I'll work on adding a test and fixing the Python2.7 version (the failed continuous-integration test) which is choking on a sparse array case (which I'm still trying to replicate locally). |
|
you can check against the name of the class |
|
Maybe duck-typing could work, if hasattr(obj, '__dask_graph__'):
# we have a dask object? In the end it shouldn't matter too much if it's dask DataFrames or arrays. There is a bunch of attributes that are exposed, though I'm not sure how robust that is, or if there is any other attributes we could use to reliably detect dask objects without importing it cc @TomAugspurger |
sklearn/utils/validation.py
Outdated
| " a valid collection." % x) | ||
| return x.shape[0] | ||
| # Check that shape is returning an integer or default to len | ||
| if isinstance(x.shape[0], int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it looks like on Python 2.7 Windows 32 bit, we have isinstance(x.shape[0], int) is False for sparse arrays. Could it be np.int there for some reason?
>>> isinstance(np.int, int)
FalseThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No actually it works,
>>> isinstance(np.int(42), int)
Truebut I'm not sure if does for all numpy versions and environements..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who cares about legacy python? Or did you want to backport to 0.20.1? Though I guess there is an argument for doing that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend isinstance(x.shape[0], numbers.Integral).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who cares about legacy python?
I'm not sure if it's due to python 2, numpy version or the fact that the system is 32 bit: numbers.Integral is probably indeed safer..
|
dask checks with def is_dask_collection(x):
"""Returns ``True`` if ``x`` is a dask collection"""
try:
return x.__dask_graph__() is not None
except (AttributeError, TypeError):
return FalseI'd recommend avoiding warnings / checking for dask objects though. If the user has a small dask dataframe, then it'll behave just like a pandas DataFrame. If a user has a large dask dataframe, well they'll find out soon enough anyway :) |
|
Forgive my neediness, just want to make sure I'm following (first time PRing on a larger project) want to make sure what I'm doing is useful: Before marking for MRG I should:
Correct? |
…d add test for getting object length if shape is non-numeric
.score() method generating error with Dask DataFrames.score() method generating error with Dask DataFrames
amueller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good.
|
A brief changelog entry in doc/whats_new/v0.20.rst under 0.20.1? |
|
@jnothman Is this something I need to do? Sorry, but I don't understand what is being asked here. Thanks! |
|
Please add an entry to the change log at |
doc/whats_new/v0.20.rst
Outdated
| happens immediately (i.e., without a deprecation cycle). | ||
| :issue:`11741` by `Olivier Grisel`_. | ||
|
|
||
| - |Fix| Fixed a bug in validation.py where :func: `_num_samples()` would not properly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually avoid referencing private functions here. How about just "a bug in validation helpers where passing a Dask DataFrame would result in an error"?
doc/whats_new/v0.20.rst
Outdated
|
|
||
| - |Fix| Fixed a bug in validation.py where :func: `_num_samples()` would not properly | ||
| handle checking the shape of Dask DataFrames after an update to Dask. | ||
| :issue:`12462` by :user:`zwmiller` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we'd state your real name too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Sorry for the hand-holding needed.
|
Thanks @ZWMiller! |
* upstream/master: joblib 0.13.0 (scikit-learn#12531) DOC tweak KMeans regarding cluster_centers_ convergence (scikit-learn#12537) DOC (0.21) Make sure plot_tree docs are generated and fix link in whatsnew (scikit-learn#12533) ALL Add HashingVectorizer to __all__ (scikit-learn#12534) BLD we should ensure continued support for joblib 0.11 (scikit-learn#12350) fix typo in whatsnew Fix dead link to numpydoc (scikit-learn#12532) [MRG] Fix segfault in AgglomerativeClustering with read-only mmaps (scikit-learn#12485) MNT (0.21) OPTiCS change the default `algorithm` to `auto` (scikit-learn#12529) FIX SkLearn `.score()` method generating error with Dask DataFrames (scikit-learn#12462) MNT KBinsDiscretizer.transform should not mutate _encoder (scikit-learn#12514)
…ybutton * upstream/master: FIX YeoJohnson transform lambda bounds (scikit-learn#12522) [MRG] Additional Warnings in case OpenML auto-detected a problem with dataset (scikit-learn#12541) ENH Prefer threads for IsolationForest (scikit-learn#12543) joblib 0.13.0 (scikit-learn#12531) DOC tweak KMeans regarding cluster_centers_ convergence (scikit-learn#12537) DOC (0.21) Make sure plot_tree docs are generated and fix link in whatsnew (scikit-learn#12533) ALL Add HashingVectorizer to __all__ (scikit-learn#12534) BLD we should ensure continued support for joblib 0.11 (scikit-learn#12350) fix typo in whatsnew Fix dead link to numpydoc (scikit-learn#12532) [MRG] Fix segfault in AgglomerativeClustering with read-only mmaps (scikit-learn#12485) MNT (0.21) OPTiCS change the default `algorithm` to `auto` (scikit-learn#12529) FIX SkLearn `.score()` method generating error with Dask DataFrames (scikit-learn#12462) MNT KBinsDiscretizer.transform should not mutate _encoder (scikit-learn#12514)
…Frames (scikit-learn#12462)" This reverts commit 2f5907d.
…Frames (scikit-learn#12462)" This reverts commit 2f5907d.
update _num_samples to check the type of shape before using it as reference for size of objects
Reference Issues/PRs
Fixes #12461
What does this implement/fix? Explain your changes.
In utils/validation.py
_num_samples()attempts to use the.shapeattribute as a stand in for the size of an array if.shapeis available. Dask has updated their dataframes to have a.shapeattribute, but that attribute does not conform to the format(int, int, int,)and as such any sklearn method that relies on_num_samples()crashes. This PR simply adds a check to see if the output of.shape[0]contains an instance of int, and if not defaults to computing thelen()of the object.Any other comments?
Upon running
makeandmake flake8-diffI see no failed tests (though several skipped tests do occur). The [WIP] is because I'm willing to add a test to make sure that the behavior passes with Dask DataFrames in future updates, but am unsure how to add a test properly in the SkLearn schema. If no test is needed, this should be ready for merge.