-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
WIP allow Pipeline to memoize partial results #2086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I should clarify the mechanism a bit: for the
For Thus the cache is checked recursively from the end to the beginning of the pipeline (and filled from the beginning to the end). |
I guess one thing I would like to know is: which |
As this is estimator dependent, this could/should be sorted by adding a G |
return self.steps[:len(step_states)] | ||
|
||
@cache | ||
def _transform(X, other_args, fit_args, step_states, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't define functions in closures. It makes code hard to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't work out another way to sensibly use Memory.cache
to perform this operation. In particular, I do not want the pipeline object itself to be part of the cache key. I considered an approach that uses Memory.cache
's ignore
argument, but I can't remember why I decided against it.
If you have a neat alternative, let me know. But this code is currently intended only as a prototype of the functionality.
I thought this might be what you intended. As far as I'm concerned, configuring a memory parameter (and providing) for each transformer is an annoyance, as would be a caching MetaEstimator. I think there is a great usability advantage in providing an easy solution for avoiding unnecessary And if fine-grained control is actually necessary, one could implement |
You are right in theory, but in practice, you will always get better
I agree that generic memory in pipelines would be useful. I would think |
I'm considering what you wrote here:
My way of handling this was providing a closure whose arguments corresponded exactly to the cache key. Such closures are indeed not ideal, but I also think cloning the estimator is merely a workaround, not a clean solution. This is a limitation of The other difficulty I here is that |
Your As far as I'm concerned, that eliminates |
My experience of software development (which is somewhat substantial Simpler code should be preferred to elegant code. Features that add a lot These rules, I believe, are excellent guidelines to making a project |
Yes, that's correct. I don't want to build a full pipeline with parameter |
I understand where you're coming from, but removing redundant work in a grid search is a frequent, and sensible, request. If you can consider a way to do it without modifying each underlying estimator (should users really be expected to do so?) that does not add complexity to the code or the interface, do let me know. |
I have been thinking about this use case and I think it should be possible to generate clean cache keys recursively to support nested estimators, see: https://gist.github.com/ogrisel/7091781 |
My feature extraction step takes up the vast bulk of my pipeline execution time. This feature would really help me out. |
One option you could use is here. Unfortunately it requires a small change to the scikit-learn codebase, after which you can just wrap those models you want to memoize the fitting of with However, you might be better off just performing feature extraction as a preprocessing step. |
I ended up writing a wrapper estimator that pickles the fitted estimator to a file on the first run and on subsequent runs just unpickles it, but a more general solution does seem like it would be useful in many situations. |
|
Ah, @agramfort, I intended that for another thread! :) |
|
Perhaps the cloning behaviour only broke when you wanted to memoize the fit, but there were parameters etc. that didn't depend on fit. There may be better ways to do |
Sorry I didn't follow the discussion closely, could you summarize why you closed this one? |
I have PRs that are older than this one and I actually expect to receive On 11 November 2014 04:52, Andreas Mueller [email protected] wrote:
|
@jnothman I'm after this feature as well. So what was the problem with your other method jnothman@de0f86d for it not to be merged? It looks like something I can use already in my project. The only problem I see is the possibly slow cache comparisons when the inputs ( |
I think we should actually investigate https://github.com/ContinuumIO/dask for doing this, though this would be a bit of a dependency. |
This PR adds a
memory
parameter toPipeline
which allows it to memoize the results of partial pipeline evaluations (fit
s,transform
s). See [http://www.mail-archive.com/[email protected]/msg07402.html](a request for this feature).Currently:
Pipeline
's training data on the instance (perhaps a hash would suffice)fit
andtransform
are called separately even wherefit_transform
is implementedfit
,transform
,score
orpredict
, etc.), perhaps unnecessarilyfit
methods asPipeline
currently does (this is the failing test)SelectKBest
, onlyscore_func
should be a key forfit
, while (score_func
,k
) affect the result oftransform
)@GaelVaroquaux, is this what you had in mind?
@amueller, this isn't as much a generalised CV solution (#1626) as #2000 was; to what extent does it satisfy your use-cases?