[MRG] Fix: SkLearn `.score()` method generating error with Dask DataFrames #12462

ZWMiller · 2018-10-25T21:53:00Z

update _num_samples to check the type of shape before using it as reference for size of objects

Reference Issues/PRs

What does this implement/fix? Explain your changes.

In utils/validation.py _num_samples() attempts to use the .shape attribute as a stand in for the size of an array if .shape is available. Dask has updated their dataframes to have a .shape attribute, but that attribute does not conform to the format (int, int, int,) and as such any sklearn method that relies on _num_samples() crashes. This PR simply adds a check to see if the output of .shape[0] contains an instance of int, and if not defaults to computing the len() of the object.

Any other comments?

Upon running make and make flake8-diff I see no failed tests (though several skipped tests do occur). The [WIP] is because I'm willing to add a test to make sure that the behavior passes with Dask DataFrames in future updates, but am unsure how to add a test properly in the SkLearn schema. If no test is needed, this should be ready for merge.

…erence for size of objects

rth

Looks good thanks.

Please add a unit test under sklearn/utils/tests/test_validation.py to test _num_samples on some mock object where obj.shape[0] is not an int, and with a __len__ attribute to test this.

Not sure if we should a what's new with something about a better compatibly with dask for the estimators affected..

rth · 2018-10-26T09:18:00Z

sklearn/utils/validation.py

            raise TypeError("Singleton array %r cannot be considered"
                            " a valid collection." % x)
-        return x.shape[0]
+        # Check that shape is returning an integer or default to len


Please add a comment that shapes may in particular not be integer in dask.

rth · 2018-10-26T09:24:17Z

Not sure if we should a what's new with something about a better compatibly with dask for the estimators affected..

So we are not officially supporting dask arrays as input and a what's new may imply that it is supported, at the same time I'm not sure how to let users know that experimentally compatibility improved (or if we want to do that).

amueller · 2018-10-26T15:14:19Z

I think if we want to add any support for dask DataFrame we should detect it in check_array and throw a warning saying that we're materializing it and there's no point in this.

ZWMiller · 2018-10-26T15:18:52Z

@amueller I think that's a good suggestions as well. My main point was that Dask claims to play with SkLearn and then the recent update broke things; being more explicit that SkLearn will try to work with Dask, but is defeating the purpose, is really what needs to happen to be more explicit to the user base. I will attempt to add the warning to this PR as well.

amueller · 2018-10-26T15:20:58Z

To add the warning we would need to hard-code the dask dataframe type into check_array though :-/

ZWMiller · 2018-10-26T15:24:55Z

Ahh good point. Don't want that dependency. In that case, I still think allowing SkLearn to attempt to do something with Dask is worthwhile even if it's not the ideal way to do things. So if no one objects, I'll work on adding a test and fixing the Python2.7 version (the failed continuous-integration test) which is choking on a sparse array case (which I'm still trying to replicate locally).

amueller · 2018-10-26T15:26:43Z

you can check against the name of the class str(type(X)) I think or X.__class__.__name__ without adding a dependency. It's not what I'd call elegant though ;)

rth · 2018-10-26T15:31:42Z

Maybe duck-typing could work,

if hasattr(obj, '__dask_graph__'):
   # we have a dask object

? In the end it shouldn't matter too much if it's dask DataFrames or arrays.

There is a bunch of attributes that are exposed,

df.DataFrame.__dask_graph__(
df.DataFrame.__dask_keys__(
df.DataFrame.__dask_optimize__( 
df.DataFrame.__dask_postcompute__(
df.DataFrame.__dask_postpersist__(
df.DataFrame.__dask_scheduler__(
df.DataFrame.__dask_tokenize__(

though I'm not sure how robust that is, or if there is any other attributes we could use to reliably detect dask objects without importing it cc @TomAugspurger

rth · 2018-10-26T15:37:01Z

sklearn/utils/validation.py

                            " a valid collection." % x)
-        return x.shape[0]
+        # Check that shape is returning an integer or default to len
+        if isinstance(x.shape[0], int):


So it looks like on Python 2.7 Windows 32 bit, we have isinstance(x.shape[0], int) is False for sparse arrays. Could it be np.int there for some reason?

>>> isinstance(np.int, int) False

No actually it works,

>>> isinstance(np.int(42), int) True

but I'm not sure if does for all numpy versions and environements..

who cares about legacy python? Or did you want to backport to 0.20.1? Though I guess there is an argument for doing that?

I'd recommend isinstance(x.shape[0], numbers.Integral).

who cares about legacy python?

I'm not sure if it's due to python 2, numpy version or the fact that the system is 32 bit: numbers.Integral is probably indeed safer..

TomAugspurger · 2018-10-26T15:41:07Z

dask checks with dask.is_dask_collection

def is_dask_collection(x):
    """Returns ``True`` if ``x`` is a dask collection"""
    try:
        return x.__dask_graph__() is not None
    except (AttributeError, TypeError):
        return False

I'd recommend avoiding warnings / checking for dask objects though. If the user has a small dask dataframe, then it'll behave just like a pandas DataFrame. If a user has a large dask dataframe, well they'll find out soon enough anyway :)

…dask_dataframes

ZWMiller · 2018-10-26T15:55:05Z

Forgive my neediness, just want to make sure I'm following (first time PRing on a larger project) want to make sure what I'm doing is useful:

Before marking for MRG I should:

Update the int check to use numbers.Integral
Add a test that builds a Test Object that has non-numeric shape attribute and a __len__ method, and make sure that if defaults to len if shape is non-numeric with an assert
Make sure that it passes the test locally, then update my fork so that this picks up the new version and re-runs the tests

Correct?

…d add test for getting object length if shape is non-numeric

amueller

looks good.

jnothman · 2018-10-30T11:18:33Z

A brief changelog entry in doc/whats_new/v0.20.rst under 0.20.1?

ZWMiller · 2018-11-04T15:42:28Z

@jnothman Is this something I need to do? Sorry, but I don't understand what is being asked here. Thanks!

jnothman · 2018-11-05T22:39:28Z

Please add an entry to the change log at doc/whats_new/v0.20.rst. Put it in the version 0.20.1 section, under Miscellaneous, and prefixed by |Fix|. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

jnothman · 2018-11-06T03:03:22Z

doc/whats_new/v0.20.rst

  happens immediately (i.e., without a deprecation cycle).
  :issue:`11741` by `Olivier Grisel`_.

+- |Fix| Fixed a bug in validation.py where :func: `_num_samples()` would not properly


We usually avoid referencing private functions here. How about just "a bug in validation helpers where passing a Dask DataFrame would result in an error"?

jnothman · 2018-11-06T03:03:34Z

doc/whats_new/v0.20.rst


+- |Fix| Fixed a bug in validation.py where :func: `_num_samples()` would not properly
+  handle checking the shape of Dask DataFrames after an update to Dask. 
+  :issue:`12462` by :user:`zwmiller`


Usually we'd state your real name too

Got it. Sorry for the hand-holding needed.

jnothman · 2018-11-06T10:46:16Z

Thanks @ZWMiller!

* upstream/master: joblib 0.13.0 (scikit-learn#12531) DOC tweak KMeans regarding cluster_centers_ convergence (scikit-learn#12537) DOC (0.21) Make sure plot_tree docs are generated and fix link in whatsnew (scikit-learn#12533) ALL Add HashingVectorizer to __all__ (scikit-learn#12534) BLD we should ensure continued support for joblib 0.11 (scikit-learn#12350) fix typo in whatsnew Fix dead link to numpydoc (scikit-learn#12532) [MRG] Fix segfault in AgglomerativeClustering with read-only mmaps (scikit-learn#12485) MNT (0.21) OPTiCS change the default `algorithm` to `auto` (scikit-learn#12529) FIX SkLearn `.score()` method generating error with Dask DataFrames (scikit-learn#12462) MNT KBinsDiscretizer.transform should not mutate _encoder (scikit-learn#12514)

…ybutton * upstream/master: FIX YeoJohnson transform lambda bounds (scikit-learn#12522) [MRG] Additional Warnings in case OpenML auto-detected a problem with dataset (scikit-learn#12541) ENH Prefer threads for IsolationForest (scikit-learn#12543) joblib 0.13.0 (scikit-learn#12531) DOC tweak KMeans regarding cluster_centers_ convergence (scikit-learn#12537) DOC (0.21) Make sure plot_tree docs are generated and fix link in whatsnew (scikit-learn#12533) ALL Add HashingVectorizer to __all__ (scikit-learn#12534) BLD we should ensure continued support for joblib 0.11 (scikit-learn#12350) fix typo in whatsnew Fix dead link to numpydoc (scikit-learn#12532) [MRG] Fix segfault in AgglomerativeClustering with read-only mmaps (scikit-learn#12485) MNT (0.21) OPTiCS change the default `algorithm` to `auto` (scikit-learn#12529) FIX SkLearn `.score()` method generating error with Dask DataFrames (scikit-learn#12462) MNT KBinsDiscretizer.transform should not mutate _encoder (scikit-learn#12514)

…cikit-learn#12462)

…Frames (scikit-learn#12462)" This reverts commit 2f5907d.

…cikit-learn#12462)

update _num_samples to check the type of shape before using it as ref…

cbec1dc

…erence for size of objects

rth reviewed Oct 26, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into fix_scoring_with_…

f4a393e

…dask_dataframes

ZWMiller added 3 commits October 26, 2018 11:23

update _num_samples to use numbers.Integral when checking shape[0] an…

bad30cd

…d add test for getting object length if shape is non-numeric

fix declaration in test to match PEP8 standard

f9def0a

add appropriate white space between functions to match pep

78cde4d

ZWMiller changed the title ~~[WIP] Fix: SkLearn .score() method generating error with Dask DataFrames~~ [MRG] Fix: SkLearn .score() method generating error with Dask DataFrames Oct 26, 2018

amueller approved these changes Oct 29, 2018

View reviewed changes

update doc/whats_new/0.20 with Fix notification for issue 12462

68338b3

jnothman reviewed Nov 6, 2018

View reviewed changes

update doc/whats_new/0.20 with Fix notification for issue 12462

2130ef9

jnothman approved these changes Nov 6, 2018

View reviewed changes

jnothman merged commit 5d8dfc9 into scikit-learn:master Nov 6, 2018

thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 14, 2018

FIX SkLearn .score() method generating error with Dask DataFrames (s…

812ea7f

…cikit-learn#12462)

thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 14, 2018

FIX SkLearn .score() method generating error with Dask DataFrames (s…

4777943

…cikit-learn#12462)

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018

FIX SkLearn .score() method generating error with Dask DataFrames (s…

7f27781

…cikit-learn#12462)

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018

FIX SkLearn .score() method generating error with Dask DataFrames (s…

aba21e8

…cikit-learn#12462)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

FIX SkLearn .score() method generating error with Dask DataFrames (s…

2f5907d

…cikit-learn#12462)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX SkLearn .score() method generating error with Dask Data…

6f97902

…Frames (scikit-learn#12462)" This reverts commit 2f5907d.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX SkLearn .score() method generating error with Dask Data…

590fcb1

…Frames (scikit-learn#12462)" This reverts commit 2f5907d.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

FIX SkLearn .score() method generating error with Dask DataFrames (s…

f2b804d

…cikit-learn#12462)

Uh oh!

[MRG] Fix: SkLearn .score() method generating error with Dask DataFrames #12462

[MRG] Fix: SkLearn .score() method generating error with Dask DataFrames #12462

Uh oh!

Conversation

ZWMiller commented Oct 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

rth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Oct 26, 2018

Uh oh!

amueller commented Oct 26, 2018

Uh oh!

ZWMiller commented Oct 26, 2018

Uh oh!

amueller commented Oct 26, 2018

Uh oh!

ZWMiller commented Oct 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Oct 26, 2018

Uh oh!

rth commented Oct 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Oct 26, 2018

Uh oh!

ZWMiller commented Oct 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Oct 30, 2018

Uh oh!

ZWMiller commented Nov 4, 2018

Uh oh!

jnothman commented Nov 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Nov 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[MRG] Fix: SkLearn `.score()` method generating error with Dask DataFrames #12462

[MRG] Fix: SkLearn `.score()` method generating error with Dask DataFrames #12462

ZWMiller commented Oct 25, 2018 •

edited

Loading

rth left a comment •

edited

Loading

ZWMiller commented Oct 26, 2018 •

edited

Loading

rth commented Oct 26, 2018 •

edited

Loading

ZWMiller commented Oct 26, 2018 •

edited

Loading