Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MNT KBinsDiscretizer.transform should not mutate _encoder #12514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Nov 6, 2018
Merged

MNT KBinsDiscretizer.transform should not mutate _encoder #12514

merged 6 commits into from
Nov 6, 2018

Conversation

qinhanmin2014
Copy link
Member

@qinhanmin2014 qinhanmin2014 commented Nov 4, 2018

Fixes #12490
Also fix a bug in the common test, which can serve as the regression test of the PR.
I'm unable to figure out a way to correct the common test, see the comment below.

@qinhanmin2014
Copy link
Member Author

Hmm, I'm unable to figure out a way to correct the common test. I don't think it's good to use .copy() to copy a dictionary, but if we use deepcopy (or extend base.clone to support dictionary), we're unable to compare two OneHotEncoder.

ohe = OneHotEncoder()
d1 = {"ohe" : ohe}
d2 = d1.copy()
assert_dict_equal(d1, d2)
# True (but d1 and d2 might change simultaneously)

from copy import deepcopy
ohe = OneHotEncoder()
d1 = {"ohe" : ohe}
d2 = deepcopy(d1)
assert_dict_equal(d1, d2)
# False

from sklearn.base import clone
ohe = OneHotEncoder()
d1 = {"ohe" : ohe}
d2 = {"ohe" : clone(ohe)}
assert_dict_equal(d1, d2)
# False

@qinhanmin2014
Copy link
Member Author

I'll take it after we figure out the solution. See #12490 (comment)

@qinhanmin2014 qinhanmin2014 reopened this Nov 4, 2018
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution looks good to me.

I don't see how we could improve the common test (check_estimators_overwrite_params). I think it's fine as it is, even though it could not detect this specific issue.

If I recall correctly, this problem (with the _encoder attribute being mutated by KBinsDiscretizer.transform) was originally found by @pierreglaser doing concurrent calls to transform with the threading backend of joblib and the calls where not thread-safe as expected for transform.

I am not sure we want to add thread-safety checks for the transform method of transformers. This is not officially part of our public "API" as far as I know.

@qinhanmin2014 qinhanmin2014 added this to the 0.20.1 milestone Nov 4, 2018
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we know that that common test is a weak one. It does not check that the objects do not change. We could potentially make it tighter by using some kind of hash, but it could make it a rather heavy test.

Yes, this is the right fix IMO. I'm not sure that we need a non-regression test.

@ogrisel regarding requiring prediction methods to be thread-safe, I like the idea in general. I think we may elsewhere have prediction methods that update a random state. Maybe they don't... and maybe they shouldn't!

@qinhanmin2014
Copy link
Member Author

I agree that we can keep the common test, since it's still useful even with these limitations, and it's not so easy to improve it.
@ogrisel @jnothman Anything else do I need to do? I guess we don't need a test and a what's new entry.

@jnothman
Copy link
Member

jnothman commented Nov 6, 2018 via email

@jnothman jnothman merged commit 6b4e00d into scikit-learn:master Nov 6, 2018
@jnothman
Copy link
Member

jnothman commented Nov 6, 2018

Thanks @qinhanmin2014

thoo added a commit to thoo/scikit-learn that referenced this pull request Nov 7, 2018
* upstream/master:
  joblib 0.13.0 (scikit-learn#12531)
  DOC tweak KMeans regarding cluster_centers_ convergence (scikit-learn#12537)
  DOC (0.21) Make sure plot_tree docs are generated and fix link in whatsnew (scikit-learn#12533)
  ALL Add HashingVectorizer to __all__ (scikit-learn#12534)
  BLD we should ensure continued support for joblib 0.11 (scikit-learn#12350)
  fix typo in whatsnew
  Fix dead link to numpydoc (scikit-learn#12532)
  [MRG] Fix segfault in AgglomerativeClustering with read-only mmaps (scikit-learn#12485)
  MNT (0.21) OPTiCS change the default `algorithm` to `auto` (scikit-learn#12529)
  FIX SkLearn `.score()` method generating error with Dask DataFrames (scikit-learn#12462)
  MNT KBinsDiscretizer.transform should not mutate _encoder (scikit-learn#12514)
@qinhanmin2014 qinhanmin2014 deleted the KBinsDiscretizer branch November 8, 2018 15:29
thoo added a commit to thoo/scikit-learn that referenced this pull request Nov 9, 2018
…ybutton

* upstream/master:
  FIX YeoJohnson transform lambda bounds (scikit-learn#12522)
  [MRG] Additional Warnings in case OpenML auto-detected a problem with dataset  (scikit-learn#12541)
  ENH Prefer threads for IsolationForest (scikit-learn#12543)
  joblib 0.13.0 (scikit-learn#12531)
  DOC tweak KMeans regarding cluster_centers_ convergence (scikit-learn#12537)
  DOC (0.21) Make sure plot_tree docs are generated and fix link in whatsnew (scikit-learn#12533)
  ALL Add HashingVectorizer to __all__ (scikit-learn#12534)
  BLD we should ensure continued support for joblib 0.11 (scikit-learn#12350)
  fix typo in whatsnew
  Fix dead link to numpydoc (scikit-learn#12532)
  [MRG] Fix segfault in AgglomerativeClustering with read-only mmaps (scikit-learn#12485)
  MNT (0.21) OPTiCS change the default `algorithm` to `auto` (scikit-learn#12529)
  FIX SkLearn `.score()` method generating error with Dask DataFrames (scikit-learn#12462)
  MNT KBinsDiscretizer.transform should not mutate _encoder (scikit-learn#12514)
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Nov 14, 2018
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KBinsDiscretizer.transform mutates the _encoder attribute
3 participants