Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ZWMiller
Copy link
Contributor

@ZWMiller ZWMiller commented Oct 25, 2018

update _num_samples to check the type of shape before using it as reference for size of objects

Reference Issues/PRs

Fixes #12461

What does this implement/fix? Explain your changes.

In utils/validation.py _num_samples() attempts to use the .shape attribute as a stand in for the size of an array if .shape is available. Dask has updated their dataframes to have a .shape attribute, but that attribute does not conform to the format (int, int, int,) and as such any sklearn method that relies on _num_samples() crashes. This PR simply adds a check to see if the output of .shape[0] contains an instance of int, and if not defaults to computing the len() of the object.

Any other comments?

Upon running make and make flake8-diff I see no failed tests (though several skipped tests do occur). The [WIP] is because I'm willing to add a test to make sure that the behavior passes with Dask DataFrames in future updates, but am unsure how to add a test properly in the SkLearn schema. If no test is needed, this should be ready for merge.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good thanks.

Please add a unit test under sklearn/utils/tests/test_validation.py to test _num_samples on some mock object where obj.shape[0] is not an int, and with a __len__ attribute to test this.

Not sure if we should a what's new with something about a better compatibly with dask for the estimators affected..

raise TypeError("Singleton array %r cannot be considered"
" a valid collection." % x)
return x.shape[0]
# Check that shape is returning an integer or default to len
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment that shapes may in particular not be integer in dask.

@rth
Copy link
Member

rth commented Oct 26, 2018

Not sure if we should a what's new with something about a better compatibly with dask for the estimators affected..

So we are not officially supporting dask arrays as input and a what's new may imply that it is supported, at the same time I'm not sure how to let users know that experimentally compatibility improved (or if we want to do that).

@amueller
Copy link
Member

I think if we want to add any support for dask DataFrame we should detect it in check_array and throw a warning saying that we're materializing it and there's no point in this.

@ZWMiller
Copy link
Contributor Author

@amueller I think that's a good suggestions as well. My main point was that Dask claims to play with SkLearn and then the recent update broke things; being more explicit that SkLearn will try to work with Dask, but is defeating the purpose, is really what needs to happen to be more explicit to the user base. I will attempt to add the warning to this PR as well.

@amueller
Copy link
Member

To add the warning we would need to hard-code the dask dataframe type into check_array though :-/

@ZWMiller
Copy link
Contributor Author

ZWMiller commented Oct 26, 2018

Ahh good point. Don't want that dependency. In that case, I still think allowing SkLearn to attempt to do something with Dask is worthwhile even if it's not the ideal way to do things. So if no one objects, I'll work on adding a test and fixing the Python2.7 version (the failed continuous-integration test) which is choking on a sparse array case (which I'm still trying to replicate locally).

@amueller
Copy link
Member

you can check against the name of the class str(type(X)) I think or X.__class__.__name__ without adding a dependency. It's not what I'd call elegant though ;)

@rth
Copy link
Member

rth commented Oct 26, 2018

Maybe duck-typing could work,

if hasattr(obj, '__dask_graph__'):
   # we have a dask object

? In the end it shouldn't matter too much if it's dask DataFrames or arrays.

There is a bunch of attributes that are exposed,

df.DataFrame.__dask_graph__(
df.DataFrame.__dask_keys__(
df.DataFrame.__dask_optimize__( 
df.DataFrame.__dask_postcompute__(
df.DataFrame.__dask_postpersist__(
df.DataFrame.__dask_scheduler__(
df.DataFrame.__dask_tokenize__(

though I'm not sure how robust that is, or if there is any other attributes we could use to reliably detect dask objects without importing it cc @TomAugspurger

" a valid collection." % x)
return x.shape[0]
# Check that shape is returning an integer or default to len
if isinstance(x.shape[0], int):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like on Python 2.7 Windows 32 bit, we have isinstance(x.shape[0], int) is False for sparse arrays. Could it be np.int there for some reason?

>>> isinstance(np.int, int)
False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No actually it works,

>>> isinstance(np.int(42), int)
True

but I'm not sure if does for all numpy versions and environements..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who cares about legacy python? Or did you want to backport to 0.20.1? Though I guess there is an argument for doing that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend isinstance(x.shape[0], numbers.Integral).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who cares about legacy python?

I'm not sure if it's due to python 2, numpy version or the fact that the system is 32 bit: numbers.Integral is probably indeed safer..

@TomAugspurger
Copy link
Contributor

dask checks with dask.is_dask_collection

def is_dask_collection(x):
    """Returns ``True`` if ``x`` is a dask collection"""
    try:
        return x.__dask_graph__() is not None
    except (AttributeError, TypeError):
        return False

I'd recommend avoiding warnings / checking for dask objects though. If the user has a small dask dataframe, then it'll behave just like a pandas DataFrame. If a user has a large dask dataframe, well they'll find out soon enough anyway :)

@ZWMiller
Copy link
Contributor Author

ZWMiller commented Oct 26, 2018

Forgive my neediness, just want to make sure I'm following (first time PRing on a larger project) want to make sure what I'm doing is useful:

Before marking for MRG I should:

  1. Update the int check to use numbers.Integral
  2. Add a test that builds a Test Object that has non-numeric shape attribute and a __len__ method, and make sure that if defaults to len if shape is non-numeric with an assert
  3. Make sure that it passes the test locally, then update my fork so that this picks up the new version and re-runs the tests

Correct?

@ZWMiller ZWMiller changed the title [WIP] Fix: SkLearn .score() method generating error with Dask DataFrames [MRG] Fix: SkLearn .score() method generating error with Dask DataFrames Oct 26, 2018
Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good.

@jnothman
Copy link
Member

A brief changelog entry in doc/whats_new/v0.20.rst under 0.20.1?

@ZWMiller
Copy link
Contributor Author

ZWMiller commented Nov 4, 2018

@jnothman Is this something I need to do? Sorry, but I don't understand what is being asked here. Thanks!

@jnothman
Copy link
Member

jnothman commented Nov 5, 2018

Please add an entry to the change log at doc/whats_new/v0.20.rst. Put it in the version 0.20.1 section, under Miscellaneous, and prefixed by |Fix|. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

happens immediately (i.e., without a deprecation cycle).
:issue:`11741` by `Olivier Grisel`_.

- |Fix| Fixed a bug in validation.py where :func: `_num_samples()` would not properly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually avoid referencing private functions here. How about just "a bug in validation helpers where passing a Dask DataFrame would result in an error"?


- |Fix| Fixed a bug in validation.py where :func: `_num_samples()` would not properly
handle checking the shape of Dask DataFrames after an update to Dask.
:issue:`12462` by :user:`zwmiller`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we'd state your real name too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Sorry for the hand-holding needed.

@jnothman jnothman merged commit 5d8dfc9 into scikit-learn:master Nov 6, 2018
@jnothman
Copy link
Member

jnothman commented Nov 6, 2018

Thanks @ZWMiller!

thoo added a commit to thoo/scikit-learn that referenced this pull request Nov 7, 2018
* upstream/master:
  joblib 0.13.0 (scikit-learn#12531)
  DOC tweak KMeans regarding cluster_centers_ convergence (scikit-learn#12537)
  DOC (0.21) Make sure plot_tree docs are generated and fix link in whatsnew (scikit-learn#12533)
  ALL Add HashingVectorizer to __all__ (scikit-learn#12534)
  BLD we should ensure continued support for joblib 0.11 (scikit-learn#12350)
  fix typo in whatsnew
  Fix dead link to numpydoc (scikit-learn#12532)
  [MRG] Fix segfault in AgglomerativeClustering with read-only mmaps (scikit-learn#12485)
  MNT (0.21) OPTiCS change the default `algorithm` to `auto` (scikit-learn#12529)
  FIX SkLearn `.score()` method generating error with Dask DataFrames (scikit-learn#12462)
  MNT KBinsDiscretizer.transform should not mutate _encoder (scikit-learn#12514)
thoo added a commit to thoo/scikit-learn that referenced this pull request Nov 9, 2018
…ybutton

* upstream/master:
  FIX YeoJohnson transform lambda bounds (scikit-learn#12522)
  [MRG] Additional Warnings in case OpenML auto-detected a problem with dataset  (scikit-learn#12541)
  ENH Prefer threads for IsolationForest (scikit-learn#12543)
  joblib 0.13.0 (scikit-learn#12531)
  DOC tweak KMeans regarding cluster_centers_ convergence (scikit-learn#12537)
  DOC (0.21) Make sure plot_tree docs are generated and fix link in whatsnew (scikit-learn#12533)
  ALL Add HashingVectorizer to __all__ (scikit-learn#12534)
  BLD we should ensure continued support for joblib 0.11 (scikit-learn#12350)
  fix typo in whatsnew
  Fix dead link to numpydoc (scikit-learn#12532)
  [MRG] Fix segfault in AgglomerativeClustering with read-only mmaps (scikit-learn#12485)
  MNT (0.21) OPTiCS change the default `algorithm` to `auto` (scikit-learn#12529)
  FIX SkLearn `.score()` method generating error with Dask DataFrames (scikit-learn#12462)
  MNT KBinsDiscretizer.transform should not mutate _encoder (scikit-learn#12514)
thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 14, 2018
thoo pushed a commit to thoo/scikit-learn that referenced this pull request Nov 14, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SkLearn .score() method generating error with Dask DataFrames

5 participants