Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Polymorphic clone #5080

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnothman opened this issue Aug 3, 2015 · 20 comments
Closed

Polymorphic clone #5080

jnothman opened this issue Aug 3, 2015 · 20 comments

Comments

@jnothman
Copy link
Member

jnothman commented Aug 3, 2015

sklearn.base.clone is defined to reconstruct an object of the argument's type with its constructor parameters (from get_params(deep=False)) recursively cloned and other attributes removed.

There are cases where I think the One Obvious Way to provide an API entails allowing polymorphic overriding of clone behaviour. In particular, my longstanding implementation of wrappers for memoized and frozen estimators relies on this, and I would like to have that library of utilities not depend on a change to sklearn.base. So we need to patch the latter.

Let me try to explain. Let's say we want a way to freeze a model. That is, cloning it should not flush its fit attributes, and calling fit again should not affect it. A syntax like the following seems far and away the clearest:

est = freeze_model(MyEstimator().fit(special_X, special_Y))

It should be obvious that the standard definition of clone won't make this operate very easily: we need to keep more than will be returned by get_params, unless MyEstimator().__dict__ becomes a param of the freeze_model instance, which is pretty hacky.

Alternative syntax could be class decoration (freeze_model(MyEstimator)()) or mixin (class MyFrozenEstimator(MyEstimator, FrozenModel): pass) such that the first call to fit then sets a frozen model. These are not only uglier, but encounter the same problems.

Ideally this sort of estimator wrapper should pass through {set,get}_params of the wrapped estimator without adding underscored prefixes (not that this is so pertinent for a frozen model, but for other applications of similar wrappers). It should also delegate all attributes to the wrapped estimator. Without making a mess of freeze_model.__init__ this is also not possible, IMO, without redefining clone.

So. Can we agree:

  • that it would not be a Bad Thing to allow polymporphism in cloning?
  • on a name for the polymorphic clone method: clone or clone_params or sklearn_clone
    ?
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Aug 3, 2015 via email

@jnothman
Copy link
Member Author

jnothman commented Aug 3, 2015

The question is how to make it easy to freeze / memoize any estimator (and potentially other wrapper-type behaviours), because it is usually properties of the data/application, rather than properties of the estimator alone that make this appropriate. I'm not sure how your comments address that, although a memoization mixin/decorator may be possible, if ugly. Are you suggesting every estimator should have a memory param like AgglomerativeClustering does?

I continue to think the usable solution requires some kind of transparent wrapper, which entails being able to redefine clone in exceptional cases. set_params also has a clear contract, with exceptions.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Aug 3, 2015 via email

@jnothman
Copy link
Member Author

jnothman commented Aug 3, 2015

It seems for me that it's usecase for copy.deepcopy, and not clone.

I don't understand the relevance of this to the case where a wrapped estimator needs to be used in CV tools that will attempt to clone it.

The nice think of doing it this way, is that it is possible to choose what's being cached and what is not. As a result, it is possible to cache only the parts that are expensive to compute.

Fair but it doesn't give the user the choice to memoize all models by the same means until we implement it for all models, which frankly is a big task that clutters up the documentation and API.

The goal of clone is to make sure that there is no leakage of data across fits that should be statistically independent. I am afraid that if we change its semantics, there will be leakage, at some point, by someone making a mistake in the codebase.

If a user explicitly wants to freeze a model (e.g. an unsupervised component learnt from large unlabelled data, or a feature extractor trained externally for a different taks), they are accountable for any statistical assumptions as a result. I don't expect clone should be overridden in anything but special functionality adding wrappers. But it's necessary that it can be overridden in order to implement those things in a way I find aesthetically pleasing.

I'm not sure how your comments address that, although a memoization mixin/decorator may be possible, if ugly.

I disagree it's ugly. It's explicit, simple, and can be made robust.

At some point I may try to rewrite these so we can play with them. But I don't think freezing can be made to use these API constructs and be as clear.

@arjoly
Copy link
Member

arjoly commented Aug 3, 2015

For the memoization, I have done something similar. However I don't need to have my own version of clone https://gist.github.com/arjoly/8386742c66cbd6cf6c89

@GaelVaroquaux
Copy link
Member

It seems for me that it's usecase for copy.deepcopy, and not clone.

I don't understand the relevance of this to the case where a wrapped estimator
needs to be used in CV tools that will attempt to clone it.

So you do want to change that :).

Fair but it doesn't give the user the choice to memoize all models by the same
means

Agreed, but I don't think that modifying clone is the right way to do
this. I'd much rather having a 'caching' meta-estimator that uses a
memory object to cache the fit of the estimator.

The goal of clone is to make sure that there is no leakage of data across
fits that should be statistically independent. I am afraid that if we
change its semantics, there will be leakage, at some point, by someone
making a mistake in the codebase.

If a user explicitly wants to freeze a model (e.g. an unsupervised component
learnt from large unlabelled data, or a feature extractor trained externally
for a different taks), they are accountable for any statistical assumptions as
a result.

How about parallel computing? And also, users make mistakes all the time.
Developers also. I do too. Clone is a safety net. Over the time we have
actually made it more strict on its semantics, because we had bugs in our
codebase that the lax semantics enabled.

If a user really knows what he or she is doing, he writes his own feature
extractor

I don't expect clone should be overridden in anything but special
functionality adding wrappers. But it's necessary that it can be
overridden in order to implement those things in a way I find
aesthetically pleasing.

The problem of modifying clone is that it is modifying constraints on the
whole codebase, and adding loopholes everywhere. A modification should be
local as much as possible. Elsewhere we get into tightly coupled
codebases that are not tractable.

@arjoly: that's exactly what I had in mind: code that has only local
impacts.

@jnothman
Copy link
Member Author

jnothman commented Aug 3, 2015

One reason I selected freezing above is that it's much easier to explain the need for a changed clone.

For the memoization case, you're right, it's possible to do if we're willing to suffer indirection:

  • additional base_estimator__ prefix
  • additional .base_estimator before accessing attributes
  • need to create delegator methods for things like kneighbors(), statically or dynamically

Traditionally, decorators such as using @memory.cache are intended to be transparent from an API perpsective, and can be added or removed as needed. Try to modify your BaseLazy so that it wraps more-or-less transparently, and do so without affecting clone. Perhaps I want too much magic.

@jnothman
Copy link
Member Author

jnothman commented Aug 3, 2015

The problem of modifying clone is that it is modifying constraints on the whole codebase, and adding loopholes everywhere.

It provides a loophole, yes, but one that needs only be used rarely.

@amueller
Copy link
Member

amueller commented Aug 3, 2015

It seems to me that something like BaseLazy should be able to achieve what you want... Let us know how it goes :)

@jnothman
Copy link
Member Author

jnothman commented Aug 4, 2015

@amueller, it does not -- and cannot readily -- achieve model freezing. Just think about what happens when base_estimator contains a model you want to keep. As long as base_estimator is a parameter, it will undergo cloning, removing its frozen model. As long as it's not a parameter, cloning the metaestimator will not restore the base estimator.

And it does not achieve memoization transparently. It requires the user to modify their code to add indirection for parameter names, method calls, etc. I can't just modify my code by introducing the wrapper and expect everything else to work the same, and that's bad. (I've suggested elsewhere an alternative way of expressing nested parameters that gets around the parameter name indirection, which might placate me on that front.)

@amueller
Copy link
Member

amueller commented Aug 4, 2015

Hm.... I see...
You could pass a serialized version as a parameter ;)

@arjoly get's around this by not freezing but building an estimator that remembers... but that's not cloneable either, right?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Aug 4, 2015 via email

@arjoly
Copy link
Member

arjoly commented Aug 4, 2015

@arjoly get's around this by not freezing but building an estimator that remembers... but that's not cloneable either, right?

The idea of my lazy estimator is to delay fitting time at prediction time to save memory in the context of ensemble method. In order to avoid the time consuming, you can memoize method calls. I could also memoize the estimator fitting, but this wasn't interesting in my use case.

@jnothman
Copy link
Member Author

Your LazyClassifier remains opaque in the sense that you need to prefix all reference to attributes and methods. You can't just wrap an existing estimator with LazyClassifier and in most contexts expect it to work. Maybe if we could get around the parameter name prefixing along the lines of #5082, that wouldn't matter as we could implement __{get,set}attr__ to make it transparent for everything but __init__, {get,set}_params.

Here's something that works as a mixin in terms of transparency, and not needing to reimplement __init__ manually. It's major drawback is that memory must be global (or in the class), again because of clone.

memory = joblib.Memory('/tmp')
class MemoizedEstimatorMixin(object):
    def fit(self, *args, **kwargs):
        fit = memory.cache(super(MemoizedEstimatorMixin, self).fit)
        new_self = fit(*args, **kwargs)
        self.__dict__.update(new_self.__dict__)
        return self

class MemoLR(memox.MemoizedEstimatorMixin, linear_model.LogisticRegression):
    pass

clone(MemoLR()).fit(iris.data, iris.target)  # hits the cache the second time

(I haven't considered freezing with a similar technique.)

Making memory a parameter would mean reimplementing _get_param_names or equivalent as it would need to be a constructor param (due to clone), leading to an __init__ with *args (or else the user has to write out all params, making it a pretty useless mixin).

While I don't think the above is terrible, are you getting my point?

@jnothman
Copy link
Member Author

(Some minor thoughts on this: if a mixin is deemeed appropriate, a syntax like memoized_estimator(SomeEstimator()) could be achieved by dynamically defining the mixed-in class. While magic already, making this pickleable would require __reduce__ magic, too.)

@jnothman
Copy link
Member Author

I note that other wrappers, such as print_times(my_estimator), a solution to #5298, could be implemented similarly...

@amueller
Copy link
Member

amueller commented Oct 8, 2016

I actually think we should go the opaque way for now, simply so we can move forward in fixing things like #6451.
The meta-estimator only accesses the standard sklearn interface, and the FrozenEstimator can wrap that. Btw, we can just "freeze" an estimator by deleting its get_params right? Then clone will use a deepcopy on it when it encounters it as a parameter of some meta-estimator.
That's pretty reliant on our current implementation of clone, though.

@jnothman
Copy link
Member Author

jnothman commented Oct 8, 2016

How do you intend to delete an estimator's get_params? Or do you just mean that the wrapper will not implement get_params? That won't work for memoization where you need the ability to get and set parameters... It also won't work with clone(est, safe=True), I think.

I still think opacity/indirection, not just in terms of parameter underscore prefixing, but in terms of getting fitted attributes, is making an unnecessarily hard time for the users, really...

@jnothman
Copy link
Member Author

Closing for now following discussion on freezing at sprint (see #8370 (comment))

@adrinjalali
Copy link
Member

Not sure why this is not closed, we have the __sklearn_clone__ now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants