-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Polymorphic clone #5080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The value of a codebase, and an object model, is to be able to understand
what's going on. That's what enables users to have expectations about it,
and developers to tract it.
clone has a simple contract: flush everything and make the object new
again. What you're talking about would open the door to violations of
this contract. I haven't really understood why it was necessary for clone
to violate this contract.
For the memoization pattern, it is not necessary. As a matter of fact,
there is already one object which implements memoization:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/hierarchical.py#L762
The way you can avoid hacking clone is by having a parameter that keeps
the shared state. The simplest option is that this object has a global
shared state, as with the joblib memory. To make it slightly less global
(globals are nasty), you can give it an 'id', where if the two instances
have the id, they share the state, and if they have a different one,
nothing is shared.
|
The question is how to make it easy to freeze / memoize any estimator (and potentially other wrapper-type behaviours), because it is usually properties of the data/application, rather than properties of the estimator alone that make this appropriate. I'm not sure how your comments address that, although a memoization mixin/decorator may be possible, if ugly. Are you suggesting every estimator should have a I continue to think the usable solution requires some kind of transparent wrapper, which entails being able to redefine clone in exceptional cases. |
because it is usually properties of the data/application, rather than
properties of the estimator alone that make this appropriate.
It seems for me that it's usecase for copy.deepcopy, and not clone.
I'm not sure how your comments address that, although a memoization
mixin/decorator may be possible, if ugly.
I disagree it's ugly. It's explicit, simple, and can be made robust.
Are you suggesting every estimator should have a memory param like
AgglomerativeClustering does?
The nice think of doing it this way, is that it is possible to choose
what's being cached and what is not. As a result, it is possible to cache
only the parts that are expensive to compute.
I continue to think the usable solution requires some kind of transparent
wrapper, which entails being able to redefine clone in exceptional cases.
The goal of clone is to make sure that there is no leakage of data across
fits that should be statistically independent. I am afraid that if we
change its semantics, there will be leakage, at some point, by someone
making a mistake in the codebase.
|
I don't understand the relevance of this to the case where a wrapped estimator needs to be used in CV tools that will attempt to clone it.
Fair but it doesn't give the user the choice to memoize all models by the same means until we implement it for all models, which frankly is a big task that clutters up the documentation and API.
If a user explicitly wants to freeze a model (e.g. an unsupervised component learnt from large unlabelled data, or a feature extractor trained externally for a different taks), they are accountable for any statistical assumptions as a result. I don't expect clone should be overridden in anything but special functionality adding wrappers. But it's necessary that it can be overridden in order to implement those things in a way I find aesthetically pleasing.
At some point I may try to rewrite these so we can play with them. But I don't think freezing can be made to use these API constructs and be as clear. |
For the memoization, I have done something similar. However I don't need to have my own version of clone https://gist.github.com/arjoly/8386742c66cbd6cf6c89 |
So you do want to change that :).
Agreed, but I don't think that modifying clone is the right way to do
How about parallel computing? And also, users make mistakes all the time. If a user really knows what he or she is doing, he writes his own feature
The problem of modifying clone is that it is modifying constraints on the @arjoly: that's exactly what I had in mind: code that has only local |
One reason I selected freezing above is that it's much easier to explain the need for a changed For the memoization case, you're right, it's possible to do if we're willing to suffer indirection:
Traditionally, decorators such as using |
It provides a loophole, yes, but one that needs only be used rarely. |
It seems to me that something like |
@amueller, it does not -- and cannot readily -- achieve model freezing. Just think about what happens when And it does not achieve memoization transparently. It requires the user to modify their code to add indirection for parameter names, method calls, etc. I can't just modify my code by introducing the wrapper and expect everything else to work the same, and that's bad. (I've suggested elsewhere an alternative way of expressing nested parameters that gets around the parameter name indirection, which might placate me on that front.) |
Hm.... I see... @arjoly get's around this by not freezing but building an estimator that remembers... but that's not cloneable either, right? |
You could pass a serialized version as a parameter ;)
I would much prefer a version around this design than a modification of
clone (I am aware that clone will flush the serialized model by default,
so it would need an extra level of indirection to protect it).
|
The idea of my lazy estimator is to delay fitting time at prediction time to save memory in the context of ensemble method. In order to avoid the time consuming, you can memoize method calls. I could also memoize the estimator fitting, but this wasn't interesting in my use case. |
Your Here's something that works as a mixin in terms of transparency, and not needing to reimplement memory = joblib.Memory('/tmp')
class MemoizedEstimatorMixin(object):
def fit(self, *args, **kwargs):
fit = memory.cache(super(MemoizedEstimatorMixin, self).fit)
new_self = fit(*args, **kwargs)
self.__dict__.update(new_self.__dict__)
return self
class MemoLR(memox.MemoizedEstimatorMixin, linear_model.LogisticRegression):
pass
clone(MemoLR()).fit(iris.data, iris.target) # hits the cache the second time (I haven't considered freezing with a similar technique.) Making While I don't think the above is terrible, are you getting my point? |
(Some minor thoughts on this: if a mixin is deemeed appropriate, a syntax like |
I note that other wrappers, such as |
I actually think we should go the opaque way for now, simply so we can move forward in fixing things like #6451. |
How do you intend to delete an estimator's I still think opacity/indirection, not just in terms of parameter underscore prefixing, but in terms of getting fitted attributes, is making an unnecessarily hard time for the users, really... |
Closing for now following discussion on freezing at sprint (see #8370 (comment)) |
Not sure why this is not closed, we have the |
sklearn.base.clone
is defined to reconstruct an object of the argument's type with its constructor parameters (fromget_params(deep=False)
) recursively cloned and other attributes removed.There are cases where I think the One Obvious Way to provide an API entails allowing polymorphic overriding of clone behaviour. In particular, my longstanding implementation of wrappers for memoized and frozen estimators relies on this, and I would like to have that library of utilities not depend on a change to
sklearn.base
. So we need to patch the latter.Let me try to explain. Let's say we want a way to freeze a model. That is, cloning it should not flush its fit attributes, and calling
fit
again should not affect it. A syntax like the following seems far and away the clearest:It should be obvious that the standard definition of
clone
won't make this operate very easily: we need to keep more than will be returned byget_params
, unlessMyEstimator().__dict__
becomes a param of thefreeze_model
instance, which is pretty hacky.Alternative syntax could be class decoration (
freeze_model(MyEstimator)()
) or mixin (class MyFrozenEstimator(MyEstimator, FrozenModel): pass
) such that the first call tofit
then sets a frozen model. These are not only uglier, but encounter the same problems.Ideally this sort of estimator wrapper should pass through
{set,get}_params
of the wrapped estimator without adding underscored prefixes (not that this is so pertinent for a frozen model, but for other applications of similar wrappers). It should also delegate all attributes to the wrapped estimator. Without making a mess offreeze_model.__init__
this is also not possible, IMO, without redefiningclone
.So. Can we agree:
clone
orclone_params
orsklearn_clone
?
The text was updated successfully, but these errors were encountered: