Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

amueller
Copy link
Member

Simplifies the check in clone that checks if __init__ and get_params play nicely together.

Previously, we could clone objects that copy numpy arrays in __init__. Afterwards we can not.

Doing anything in init is not a good idea.
I have an additional check for == because Pipeline is evil.
Working on it.

Note: this is a fix for #5522.

@amueller
Copy link
Member Author

Pipeline is only evil if steps is a numpy array, in which case it copies it. Or if something is passed that is not a sequence, in which case it calls list() on it.

The fact that tests pass means that tosequence here always ends up in the second branch.

We could remove the call to tosequence, which will likely not change anything, at least nothing that is tested.

Or, if we want to continue to support steps being non-sequences [I have no idea what that would mean] or make sure that if steps is an array, that it is copied in __init__, then we need a more complicated check in clone.

My intuition is that we should remove the call to tosequenceand we'll be fine.

Feedback very welcome, in particular @GaelVaroquaux and @jnothman.

@amueller amueller changed the title [WIP] remove complicated equality checks in clone as __init__ shouldn't touch anything. [MRG] remove complicated equality checks in clone as __init__ shouldn't touch anything. Oct 23, 2015
@amueller
Copy link
Member Author

Opened #5547 for VBGMM and #5549 for GP.

@amueller
Copy link
Member Author

I'm wondering if we should keep the old stuff and raise a deprecation warning if param1 is not param2. Otherwise we could break a lot of users estimators.

@amueller
Copy link
Member Author

Ok after fighting with myself, I left all the old stuff in and adapted the test from #5525.
We should raise a deprecation warning if param1 is not param2 once we fixed our own stuff.

@ogrisel
Copy link
Member

ogrisel commented Oct 23, 2015

I opened a PR to your PR here: amueller#29

@ogrisel
Copy link
Member

ogrisel commented Oct 23, 2015

Then +1.

@ogrisel
Copy link
Member

ogrisel commented Oct 23, 2015

I opened a new issue to discuss the need for deepcopy in clone: #5563.

@amueller
Copy link
Member Author

This should be ready now. @jnothman ?

@amueller amueller changed the title [MRG] remove complicated equality checks in clone as __init__ shouldn't touch anything. [MRG + 1] remove complicated equality checks in clone as __init__ shouldn't touch anything. Aug 17, 2016
equality_test = (new_obj_val == params_set_val or
new_obj_val is params_set_val)
# fall back on standard equality
equality_test = param1 == param2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not going to fail for np.nan? These are not numpy arrays, hence this part of the if/else clause is explored, and:

In [3]: np.nan.__class__.__mro__
Out[3]: (float, object)

In [4]: np.nan == np.nan
Out[4]: False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it: the "is" test has already been perform above, right?

I am slow.

Maybe a test case with a nan would be a good thing.

@GaelVaroquaux
Copy link
Member

Minor bug that breaks CI, but is probably trivial to fix.

I have a request: could a test be added with a nan as an attribute, so that we are certain that this iteration of the code, or the others, does not break clone with nans (which is a bug that's easy to generate).

@jnothman
Copy link
Member

This is a good idea. I gather the deprecation warning is not happening during testing, except in this case where the test failure is. Can you add an assert_warns there?

@amueller
Copy link
Member Author

Sorry for the slow reply, I'll add the test and fix CI.

@amueller
Copy link
Member Author

amueller commented Aug 24, 2016

There is already a test for estimators with nan here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tests/test_base.py#L140

And the deprecation warning is actually raised in a test where clone is supposed to fail. So if your clone fails you will now also always get a deprecation warning in addition to the error. I just pushed a fix for that, and a test.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Aug 24, 2016 via email

@agramfort
Copy link
Member

@amueller you need to rebase.

then +1 for merge

@amueller
Copy link
Member Author

should be good to merge once lights are green

@jnothman
Copy link
Member

./sklearn/base.py:119:80: E501 line too long (81 > 79 characters)

@amueller
Copy link
Member Author

amueller commented Sep 6, 2016

done

@jnothman
Copy link
Member

jnothman commented Sep 6, 2016

Test failures!

@jnothman
Copy link
Member

jnothman commented Sep 6, 2016

Failure due to whitespace inconsistency in warning message vs test.

@amueller
Copy link
Member Author

amueller commented Sep 7, 2016

should work now.

@jnothman
Copy link
Member

jnothman commented Sep 8, 2016

LGTM

@jnothman jnothman merged commit 680ab51 into scikit-learn:master Sep 8, 2016
@amueller
Copy link
Member Author

amueller commented Sep 8, 2016

thanks for the reviews. This is soooo much more robust now :)

@amueller amueller mentioned this pull request Sep 13, 2016
rsmith54 pushed a commit to rsmith54/scikit-learn that referenced this pull request Sep 14, 2016
…uldn't touch anything. (scikit-learn#5540)

* BF: issue 5522 (cloning objects with pandas.Dataframe attributes)

* super conservative fix, pending GP and VBGMM fixes.

* TST improved test for df param

* add deprecation warning, add whatsnew entry

* fixed place of deprecation warning, added test for deprecation warning.

* pep8

* fix whitespace error
TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
…uldn't touch anything. (scikit-learn#5540)

* BF: issue 5522 (cloning objects with pandas.Dataframe attributes)

* super conservative fix, pending GP and VBGMM fixes.

* TST improved test for df param

* add deprecation warning, add whatsnew entry

* fixed place of deprecation warning, added test for deprecation warning.

* pep8

* fix whitespace error
@stdkoehler
Copy link

I'm not sure if this is the right place to ask this question. I always receive the

DeprecationWarning: Estimator KerasRegressor modifies parameters in init. This behavior is deprecated as of 0.18 and support for this behavior will be removed in 0.20. % type(estimator).name, DeprecationWarning)

when a parameter of my estimator is a list or an np.array, because the "param1 is param2" gives a false in such case. Is this intended behavior? My example is with KerasRegressor:

import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.datasets import load_boston
bostondata = load_boston()

X = bostondata.data
y = bostondata.target

# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

def dynamic_model(nInputLayer=10, hiddenLayers=[10]):
    # create model
    model = Sequential()
    model.add(Dense(nInputLayer, input_dim=13, init='normal', activation='relu'))
    for hLayer in hiddenLayers:
        model.add(Dense(hLayer, init='normal', activation='relu'))
    model.add(Dense(1, init='normal'))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

nInputLayerList = [8, 32]
hiddenLayers = [np.array([5]), np.array([5,5])]
for nlayer in nInputLayerList:
    for hlayer in hiddenLayers:
        krmlp = KerasRegressor(build_fn=dynamic_model, nb_epoch=100, batch_size=5, verbose=0)
        krmlp.set_params(**{'nInputLayer': nlayer, 'hiddenLayers': hlayer})
        estimators = []
        estimators.append(('standardize', StandardScaler()))
        estimators.append(('mlp', krmlp))
        pipeline = Pipeline(estimators)
        kfold = KFold(n_splits=10, random_state=seed)
        results = cross_val_score(pipeline, X, y, cv=kfold)
        print("Dynamic %d input, %s hidden: %.2f (%.2f) MSE" % (nlayer, str(hlayer).strip('[]'), results.mean(), results.std()))

Sorry if this is the wrong place to ask the question. Please delete if so.

@jnothman
Copy link
Member

I presume the issue is not a list or array being passed, but your Sequence instance. Yes, I think that after doing a deepcopy, it's wrong for the code to claim that param1 is param2 "should always be true" :\ @amueller, is the motivation of this PR correct?

@dukebody
Copy link

Hi! Not sure if this is the right place to ask about this - if not, please point me to the right location.

I'm the maintainer of the sklearn-pandas library. Due to the changes in this PR the DataFrameMapper now emits warnings whenever it's used in some kind of cross-validation: scikit-learn-contrib/sklearn-pandas#76

It is true that the DataFrameMapper constructor modifies parameters in the init method (see https://github.com/paulgb/sklearn-pandas/blob/master/sklearn_pandas/dataframe_mapper.py#L40), but those parameters are not numpy ndarrays or sparse matrices, therefore I don't really understand how the current implementation of the clone function (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/base.py#L29) can not raise the mentioned warning, even if the parameters of the old and the new (cloned) dataframe mapper are equal.

Where can I get some guidance about how to make sklearn-pandas compatible with sklearn>=0.20? Thanks.

@jnothman
Copy link
Member

I'm a little confused. Do you object to the special handling of numpy arrays and sparse matrices because these are not checked for equality, and you have another special case for that? Or do you object to not modifying parameters in __init__.

The principle is to avoid modifying parameters in __init__, and delaying this until fit. A key reason to do so is that those same parameters might be specified through set_params (or even explicit setting of attributes), not through __init__ and hence should only undergo transformation when they need to be interpreted during fit. This could be achieved through properties, but validating/transforming only in fit is the simpler solution which scikit-learn has chosen for a convention.

@dukebody
Copy link

@jnothman sorry, I was not objecting to special handling of numpy arrays and sparse matrices or not modifying parameters in __init__, I was just confused about how the check was implemented and didn't know how to modify the library I'm maintaining to comply with the new requisites.

Now I see that delaying the modification of parameters until the fit method is the way to go and why it was decided this way. Thanks for the explanation! :)

@jnothman
Copy link
Member

jnothman commented Jun 11, 2017 via email

dukebody added a commit to scikit-learn-contrib/sklearn-pandas that referenced this pull request Jun 24, 2017
This is to comply with sklearn>=0.20 which requires that
no parameters are mutated in __init__ to be able to make
cross-validation wrappers work easily.

See scikit-learn/scikit-learn#5540

Resolves #76.

Code based on PR #105 by @jimmywan
@dukebody
Copy link

@jnothman Where can I find the documentation about what has to be done to fit these new requirements? We finally figured out how to fix it in scikit-learn-contrib/sklearn-pandas#76 but having some docs explaining what you explained here would surely help other developers.

@jnothman
Copy link
Member

jnothman commented Jun 24, 2017 via email

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…uldn't touch anything. (scikit-learn#5540)

* BF: issue 5522 (cloning objects with pandas.Dataframe attributes)

* super conservative fix, pending GP and VBGMM fixes.

* TST improved test for df param

* add deprecation warning, add whatsnew entry

* fixed place of deprecation warning, added test for deprecation warning.

* pep8

* fix whitespace error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants