WIP allow Pipeline to memoize partial results #2086

jnothman · 2013-06-23T06:15:44Z

This PR adds a memory parameter to Pipeline which allows it to memoize the results of partial pipeline evaluations (fits, transforms). See [http://www.mail-archive.com/[email protected]/msg07402.html](a request for this feature).

Currently:

it is a prototype, untested and not sparkling from clean code
it requires storing the Pipeline's training data on the instance (perhaps a hash would suffice)
the implementation perhaps caches data more frequently than necessary
fit and transform are called separately even where fit_transform is implemented
it memoizes the last step (whether it be fit, transform, score or predict, etc.), perhaps unnecessarily
it does not support passing keyword arguments to steps' fit methods as Pipeline currently does (this is the failing test)
it does not take advantage of the fact that for some estimators only a subset of parameters produce distinct models (e.g. for SelectKBest, only score_func should be a key for fit, while (score_func, k) affect the result of transform)

@GaelVaroquaux, is this what you had in mind?
@amueller, this isn't as much a generalised CV solution (#1626) as #2000 was; to what extent does it satisfy your use-cases?

jnothman · 2013-06-23T06:22:28Z

I should clarify the mechanism a bit: for the ith step, the following are in the cache key:

the step names, classes and parameters up to step i
the most recent arguments to fit or fit_transform
the current arguments to whatever method is being called

For fit, the cache stores all models from the beginning of the Pipeline to step i. For transform etc it stores the output at step i.

Thus the cache is checked recursively from the end to the beginning of the pipeline (and filled from the beginning to the end).

jnothman · 2013-06-23T12:08:45Z

I guess one thing I would like to know is: which transform methods in scikit-learn are substantially more expensive than loading a cached result?

GaelVaroquaux · 2013-06-23T14:47:26Z

I guess one thing I would like to know is: which transform methods in
scikit-learn are substantially more expensive than loading a cached result?

As this is estimator dependent, this could/should be sorted by adding a
'memory' keyword to the transformers, as in the case of the feature
agglomeration. The nice aspect is that it can then be chosen to be used
in the best place (think the SVD in the PCA for instance).

G

GaelVaroquaux · 2013-06-23T14:48:06Z

sklearn/pipeline.py

+            return self.steps[:len(step_states)]
+
+        @cache
+        def _transform(X, other_args, fit_args, step_states,


Please don't define functions in closures. It makes code hard to debug.

I couldn't work out another way to sensibly use Memory.cache to perform this operation. In particular, I do not want the pipeline object itself to be part of the cache key. I considered an approach that uses Memory.cache's ignore argument, but I can't remember why I decided against it.

If you have a neat alternative, let me know. But this code is currently intended only as a prototype of the functionality.

jnothman · 2013-06-23T15:05:07Z

As this is estimator dependent, this could/should be sorted by adding a 'memory' keyword to the transformers, as in the case of the feature agglomeration.

I thought this might be what you intended. As far as I'm concerned, configuring a memory parameter (and providing) for each transformer is an annoyance, as would be a caching MetaEstimator.

I think there is a great usability advantage in providing an easy solution for avoiding unnecessary Pipeline subsequence fits and perhaps transforms (the value of both is estimator-dependent). The question I was getting at is whether there's much value saved by caching transforms as well as fits.

And if fine-grained control is actually necessary, one could implement cache_fits and cache_transforms parameters that are boolean or an array of booleans corresponding to the Pipeline steps. But I think this solution is overkill for a convenience implementation.

GaelVaroquaux · 2013-06-23T16:08:38Z

As far as I'm concerned, configuring a memory parameter (and providing)
for each transformer is an annoyance, as would be a caching
MetaEstimator.

You are right in theory, but in practice, you will always get better
performance with more specific code.

I think there is a great usability advantage in providing an easy
solution for avoiding unnecessary Pipeline subsequence fits and perhaps
transforms (the value of both is estimator-dependent). The question I
was getting at is whether there's much value saved by caching
transforms as well as fits.

I agree that generic memory in pipelines would be useful. I would think
that we want to cache transformer's fit, but probably not the transform
method.

jnothman · 2013-06-24T00:27:07Z

I'm considering what you wrote here:

One problem that you face is that an estimator object can have estimated parameters. These may render the cache invalid, while they really shouldn't affect the fit.

My way of handling this was providing a closure whose arguments corresponded exactly to the cache key. Such closures are indeed not ideal, but I also think cloning the estimator is merely a workaround, not a clean solution. This is a limitation of Memory. Perhaps just as Memory.cache has an argument ignore, it should also have a way to get additional cache keys.

The other difficulty I here is that fit updates the state of the object, and we don't actually want to cache its return value so much as the final state (even though they should generally be the same). It is the reason fit_transform cannot be cached.

jnothman · 2013-06-24T01:30:57Z

Your cache(fit)(clone(estimator), X, y) solution also will not handle the case of excluding parameters that don't affect fit and its learnt attributes. For example, using a LinearSVC as a feature selector, but wanting to play with the threshold in different grid search candidates, we don't want to refit when that threshold parameter changes.

As far as I'm concerned, that eliminates clone as an option. Would you rather a closure, or some extra argument to Memory.cache?

GaelVaroquaux · 2013-06-24T15:52:12Z

My way of handling this was providing a closure whose arguments corresponded
exactly to the cache key. Such closures are indeed not ideal, but I also think
cloning the estimator is merely a workaround, not a clean solution. This is a
limitation of Memory. Perhaps just as Memory.cache has an argument ignore, it
should also have a way to get additional cache keys.

My experience of software development (which is somewhat substantial
having worked and followed many project) is that complexity is our worst
enemy, to a point which should not be underestimated. Indeed, as a
project grows more and more complex, the development slows down as adding
each feature becomes harder, and less and less people are qualified to
modify it.

Simpler code should be preferred to elegant code. Features that add a lot
of complexity should be included only if they are mission critical.

These rules, I believe, are excellent guidelines to making a project
successful in the long run. And indeed, I review code with them in mind:
if there is a simpler solution (less abstractions, less lines of code), I
will always push for it.

GaelVaroquaux · 2013-06-24T16:10:50Z

Your cache(fit)(clone(estimator), X, y) solution also will not handle
the case of excluding parameters that don't affect fit and its learnt
attributes. For example, using a LinearSVC as a feature selector, but
wanting to play with the threshold in different grid search candidates,
we don't want to refit when that threshold parameter changes.

Yes, that's correct. I don't want to build a full pipeline with parameter
tracking. Experience shows that it is a very costly enterprise that
really slows down package development. I would suggest people with such
need to implement a 'mem' inside the estimator object: using a simpler
solution, that is not generic.

jnothman · 2013-06-24T22:20:23Z

I understand where you're coming from, but removing redundant work in a grid search is a frequent, and sensible, request. If you can consider a way to do it without modifying each underlying estimator (should users really be expected to do so?) that does not add complexity to the code or the interface, do let me know.

ogrisel · 2013-10-31T14:33:03Z

I have been thinking about this use case and I think it should be possible to generate clean cache keys recursively to support nested estimators, see: https://gist.github.com/ogrisel/7091781

briandastous · 2014-07-30T15:11:13Z

My feature extraction step takes up the vast bulk of my pipeline execution time. This feature would really help me out.

jnothman · 2014-07-30T22:26:20Z

One option you could use is here. Unfortunately it requires a small change to the scikit-learn codebase, after which you can just wrap those models you want to memoize the fitting of with remember_model (or remember_transform).

However, you might be better off just performing feature extraction as a preprocessing step.

briandastous · 2014-07-31T03:35:51Z

I ended up writing a wrapper estimator that pickles the fitted estimator to a file on the first run and on subsequent runs just unpickles it, but a more general solution does seem like it would be useful in many situations.

agramfort · 2014-07-31T07:29:41Z

Well, if there are no other comments, I'll merge in a day or so. The CI
failures appear spurious.

please don't merge before getting 2 +1 by others.

jnothman · 2014-07-31T07:39:01Z

Ah, @agramfort, I intended that for another thread! :)

jnothman · 2014-07-31T07:48:00Z

I ended up writing a wrapper estimator that pickles the fitted estimator to a file on the first run and on subsequent runs just unpickles it, but a more general solution does seem like it would be useful in many situations.

remember_model does basically that, however in that case I found the cloning behaviour in cross validation broke it, and so I had to specially handle it.

jnothman · 2014-07-31T07:59:58Z

Perhaps the cloning behaviour only broke when you wanted to memoize the fit, but there were parameters etc. that didn't depend on fit. There may be better ways to do remember_model in any case.

amueller · 2014-11-10T17:52:10Z

Sorry I didn't follow the discussion closely, could you summarize why you closed this one?

jnothman · 2014-11-10T21:05:50Z

I have PRs that are older than this one and I actually expect to receive
reviews and be merged. This one I don't expect so. The issue is still an
issue: too much work is being redone in cross-validated pipelines by
default, but I don't think this is the right solution. I think being able
to tell any estimator to cache its model with respect to all or a subset of
its parameters and training data (by way of metaestimator, mixin, inbuilt
parameter in BaseEstimator or whatever) is a good way to keep the feature
modular. Supporting a metaestimator form only requires a change for
sklearn.base.clone to support polymorphism (see
jnothman@de0f86d
).

On 11 November 2014 04:52, Andreas Mueller [email protected] wrote:

Sorry I didn't follow the discussion closely, could you summarize why
you closed this one?

—
Reply to this email directly or view it on GitHub
#2086 (comment)
.

simonzack · 2015-08-10T13:34:36Z

@jnothman I'm after this feature as well. So what was the problem with your other method jnothman@de0f86d for it not to be merged? It looks like something I can use already in my project. The only problem I see is the possibly slow cache comparisons when the inputs (X) are large, as in pipelines only the initial input really needs to be compared.

amueller · 2015-08-13T17:01:12Z

I think we should actually investigate https://github.com/ContinuumIO/dask for doing this, though this would be a bit of a dependency.

ENH allow Pipeline to memoize partial results

df4aba6

GaelVaroquaux reviewed Jun 23, 2013
View reviewed changes

jnothman mentioned this pull request Jun 24, 2013

ENH support callable to modify Memory caching key joblib/joblib#69

Open

larsmans force-pushed the master branch from 58a55ad to 4b82379 Compare August 25, 2014 21:50

MechCoder force-pushed the master branch from 6deaea0 to 3f49cee Compare November 3, 2014 12:36

jnothman closed this Nov 9, 2014

jnothman mentioned this pull request Dec 10, 2014

ENH add joblib caching to Pipeline #3951

Closed

jnothman mentioned this pull request Nov 7, 2016

Explore optimization of machine learning pipeline cognoma/cognoml#7

Open

glemaitre mentioned this pull request Dec 6, 2016

[MRG+3] ENH Caching Pipeline by memoizing transformer #7990

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP allow Pipeline to memoize partial results #2086

WIP allow Pipeline to memoize partial results #2086

jnothman commented Jun 23, 2013

jnothman commented Jun 23, 2013

jnothman commented Jun 23, 2013

GaelVaroquaux commented Jun 23, 2013

GaelVaroquaux Jun 23, 2013

jnothman Jun 23, 2013

jnothman commented Jun 23, 2013

GaelVaroquaux commented Jun 23, 2013

jnothman commented Jun 24, 2013

jnothman commented Jun 24, 2013

GaelVaroquaux commented Jun 24, 2013

GaelVaroquaux commented Jun 24, 2013

jnothman commented Jun 24, 2013

ogrisel commented Oct 31, 2013

briandastous commented Jul 30, 2014

jnothman commented Jul 30, 2014

briandastous commented Jul 31, 2014

agramfort commented Jul 31, 2014

jnothman commented Jul 31, 2014

jnothman commented Jul 31, 2014

jnothman commented Jul 31, 2014

amueller commented Nov 10, 2014

jnothman commented Nov 10, 2014

simonzack commented Aug 10, 2015

amueller commented Aug 13, 2015

WIP allow Pipeline to memoize partial results #2086

WIP allow Pipeline to memoize partial results #2086

Conversation

jnothman commented Jun 23, 2013

jnothman commented Jun 23, 2013

jnothman commented Jun 23, 2013

GaelVaroquaux commented Jun 23, 2013

GaelVaroquaux Jun 23, 2013

Choose a reason for hiding this comment

jnothman Jun 23, 2013

Choose a reason for hiding this comment

jnothman commented Jun 23, 2013

GaelVaroquaux commented Jun 23, 2013

jnothman commented Jun 24, 2013

jnothman commented Jun 24, 2013

GaelVaroquaux commented Jun 24, 2013

GaelVaroquaux commented Jun 24, 2013

jnothman commented Jun 24, 2013

ogrisel commented Oct 31, 2013

briandastous commented Jul 30, 2014

jnothman commented Jul 30, 2014

briandastous commented Jul 31, 2014

agramfort commented Jul 31, 2014

jnothman commented Jul 31, 2014

jnothman commented Jul 31, 2014

jnothman commented Jul 31, 2014

amueller commented Nov 10, 2014

jnothman commented Nov 10, 2014

simonzack commented Aug 10, 2015

amueller commented Aug 13, 2015