ENH add joblib caching to Pipeline #3951

dblalock · 2014-12-10T00:20:00Z

Hello,

I recently added optional Joblib caching to Pipeline to avoid redundant computations / save the outputs for a personal project. I found this useful, and so figured I'd offer it up as a pull request. It passes make test, pep8, and pyflakes, but I'm new to open source and trying to learn (fresh out of college...), so feel free to tell me this needs more tests, is a bad idea, should change stylistically, etc.

Also, I read the contributing page, but apologies if I'm nonetheless following this process improperly.

-Dave

This adds optional caching to the results of each stage of a pipeline so that computations can be executed once and transparently persisted across executions. The motivation is to avoid writing large amounts of code to store and load intermediate outputs. This can already be done easily for one's own functions via joblib, but is not possible for existing estimators within a pipeline without modifications such as these. Changes include implementation of this functionality within pipeline.py, updated documentation, and additional unit tests.

coveralls · 2014-12-10T00:36:46Z

Coverage remained the same when pulling f4771ba on dblalock:pipeline-caching into 56ee99c on scikit-learn:master.

jnothman · 2014-12-10T01:41:48Z

It's not a bad idea, but it has been discussed before extensively (e.g. #2086). I'm interested to see how you have implemented it, but expect some friction before any such feature is merged!

jnothman · 2014-12-10T02:00:32Z

For an exercise, could you please set cache=True as default and report if existing tests fail? Thanks.

coveralls · 2014-12-10T04:12:31Z

Coverage remained the same when pulling 724573d on dblalock:pipeline-caching into 56ee99c on scikit-learn:master.

dblalock · 2014-12-10T04:29:05Z

Good suggestion. It initially failed two tests in the "working with text" doctests because the named_steps dict wasn't being updated, but everything passes now (branch with this experiment here). The only snag I observed when running it with this setting was that the caching version took much longer when manipulating the 20 Newsgroups dataset in the text analysis tutorial:

> ~/Desktop/code/scikit-learn$ nosetests --doctest-tests doc/tutorial/text_analytics/working_with_text_data.rst
.
----------------------------------------------------------------------
Ran 1 test in 5.864s        # no caching

OK
> ~/Desktop/code/scikit-learn$ nosetests --doctest-tests doc/tutorial/text_analytics/working_with_text_data.rst
.
----------------------------------------------------------------------
Ran 1 test in 64.028s       # cache=True, first run

OK
> ~/Desktop/code/scikit-learn$ nosetests --doctest-tests doc/tutorial/text_analytics/working_with_text_data.rst
.
----------------------------------------------------------------------
Ran 1 test in 29.057s       # cache=False, subsequent runs; +/- 1s
OK`

I.e., the text analysis runs significantly slower with cache=True than cache=False. This is unsurprising given that it's probably faster to count tokens than hash giant strings, as joblib is forced to do in this case. Clearly, caching should be off by default.

jnothman · 2014-12-10T04:44:29Z

Thanks for that. I think it would be a good idea to modify all pipeline tests to ensure they work both with and without caching, but it's a lot of work for little glory.

Clearly, caching should be off by default.

Or this is the wrong model, and the user should be able to cache any particular estimator, rather than the whole pipeline. Yet I think that in terms of usability that covers the majority of cases, the present model excels.

jnothman · 2014-12-10T04:48:08Z

sklearn/pipeline.py

@@ -46,6 +49,17 @@ class Pipeline(BaseEstimator):
        chained, in the order in which they are chained, with the last object
        an estimator.

+    cache: boolean


Please see other instances where memoization is performed, like https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/hierarchical.py#L710. Structure your usage similarly (accept a path, or a Memory instance, defaulting to None). Instead of a global Memory instance, you don't use the decorator form, and just use memory.cache(my_func)(estimator, ...)

I originally did this but found that it got broken by GridSearchCV's cloning of the Pipeline object--hence the pre-defined cached and non-cached functions below. I'll revisit this though.

Not sure why this would happen...

jnothman · 2014-12-10T04:54:01Z

I'd like to see tests with a mock estimator that counts the number of calls to each method so that we can actually tell when it's being called.

jnothman · 2014-12-10T05:00:37Z

Apart from hashing text, I'm a little concerned that we're using entire models as cache keys (#2086 instead used the training data + the estimator parameters). Can you benchmark caching a data-heavy model like RandomTreesEmbedding? How long does the caching itself take?

Apart from efficiency, one reason for using parameters rather than the whole estimator object is that we might be able to annotate some estimator parameters as not affecting the model fitting (e.g. feature selection thresholds in most cases), and then exclude them from the cache key. But this certainly falls under the class of pie-in-the-sky enhancement, and could -- if the user really cares to -- be achieved in other ways.

hnykda · 2016-08-24T10:24:30Z

Hi.

Is there any progress on this? There is also something on #2086. Is anyone currently working on this limitation of pipelines?

jnothman · 2016-08-24T12:47:26Z

No one's working on this limitation of Pipelines atm, unfortunately. And my attempts (#5080) to make clone work better for generic memoising wrappers hav come to a standstill.

hnykda · 2016-08-24T14:57:02Z

What a pity. This would make a huge impact on many of my projects (and surely on projects of others). But reading through all related issues and PRs, I don't believe I could do this myself.

GaelVaroquaux · 2016-11-29T07:15:26Z

I would use a subclass of Pipeline, rather than Pipeline itself. The reason being that it is probably a good idea to clone the objects before memoizing the fit (in order to maximize the cache hits). This is not what the standard Pipeline does, hence memoizing would have a slightly different behavior (though one fully consistent with scikit-learn's philosophy of not modifying input parameters).

amueller · 2016-11-29T21:40:08Z

@GaelVaroquaux yeah the not cloning is an interesting feature. Name suggestions? Cloneline? ;) [Actually we should be pretty careful with naming now, our names get adopted quickly]

hnykda · 2016-11-29T21:47:03Z

I have to say that cloneline sounds awesome :-D

amueller · 2016-11-29T22:24:31Z

should actually be Cacheline ;)

GaelVaroquaux · 2016-11-29T22:27:17Z

should actually be Cacheline ;)

MemoLine. Don't forget it's memoize. I get picture saying myself: "didn't you get the MemoLine".

GaelVaroquaux · 2016-11-29T22:30:17Z

On a serious note, "CachedPipe" or "CachedPipeline". We could add a "cache" option to make_pipeline (althought in "legacy Python", in makes a horrible signature) with "*steps, **options".

amueller · 2016-11-29T22:35:18Z

yeah why not. That's easy to do once we have the class....

glemaitre · 2016-12-05T13:40:17Z

I will propose a PR based on the above comments and the discussion from the mailing list.

jnothman · 2017-02-13T13:52:29Z

Fixed by #7990. Thanks @dblalock!

FIX add caching to pipeline named_steps, not just steps

724573d

jnothman reviewed Dec 10, 2014
View reviewed changes

jnothman mentioned this pull request Nov 7, 2016

Explore optimization of machine learning pipeline cognoma/cognoml#7

Open

glemaitre mentioned this pull request Dec 6, 2016

[MRG+3] ENH Caching Pipeline by memoizing transformer #7990

Merged

jnothman closed this Feb 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH add joblib caching to Pipeline #3951

ENH add joblib caching to Pipeline #3951

dblalock commented Dec 10, 2014

coveralls commented Dec 10, 2014

jnothman commented Dec 10, 2014

jnothman commented Dec 10, 2014

coveralls commented Dec 10, 2014

dblalock commented Dec 10, 2014

jnothman commented Dec 10, 2014

jnothman Dec 10, 2014

dblalock Dec 10, 2014

jnothman Dec 10, 2014

jnothman commented Dec 10, 2014

jnothman commented Dec 10, 2014

hnykda commented Aug 24, 2016

jnothman commented Aug 24, 2016

hnykda commented Aug 24, 2016 •

edited

Loading

GaelVaroquaux commented Nov 29, 2016

amueller commented Nov 29, 2016 •

edited

Loading

hnykda commented Nov 29, 2016

amueller commented Nov 29, 2016

GaelVaroquaux commented Nov 29, 2016 via email

GaelVaroquaux commented Nov 29, 2016 via email

amueller commented Nov 29, 2016

glemaitre commented Dec 5, 2016 •

edited

Loading

jnothman commented Feb 13, 2017

ENH add joblib caching to Pipeline #3951

ENH add joblib caching to Pipeline #3951

Conversation

dblalock commented Dec 10, 2014

coveralls commented Dec 10, 2014

jnothman commented Dec 10, 2014

jnothman commented Dec 10, 2014

coveralls commented Dec 10, 2014

dblalock commented Dec 10, 2014

jnothman commented Dec 10, 2014

jnothman Dec 10, 2014

Choose a reason for hiding this comment

dblalock Dec 10, 2014

Choose a reason for hiding this comment

jnothman Dec 10, 2014

Choose a reason for hiding this comment

jnothman commented Dec 10, 2014

jnothman commented Dec 10, 2014

hnykda commented Aug 24, 2016

jnothman commented Aug 24, 2016

hnykda commented Aug 24, 2016 • edited Loading

GaelVaroquaux commented Nov 29, 2016

amueller commented Nov 29, 2016 • edited Loading

hnykda commented Nov 29, 2016

amueller commented Nov 29, 2016

GaelVaroquaux commented Nov 29, 2016 via email

GaelVaroquaux commented Nov 29, 2016 via email

amueller commented Nov 29, 2016

glemaitre commented Dec 5, 2016 • edited Loading

jnothman commented Feb 13, 2017

hnykda commented Aug 24, 2016 •

edited

Loading

amueller commented Nov 29, 2016 •

edited

Loading

glemaitre commented Dec 5, 2016 •

edited

Loading