Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH add joblib caching to Pipeline #3951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

dblalock
Copy link

Hello,

I recently added optional Joblib caching to Pipeline to avoid redundant computations / save the outputs for a personal project. I found this useful, and so figured I'd offer it up as a pull request. It passes make test, pep8, and pyflakes, but I'm new to open source and trying to learn (fresh out of college...), so feel free to tell me this needs more tests, is a bad idea, should change stylistically, etc.

Also, I read the contributing page, but apologies if I'm nonetheless following this process improperly.

-Dave

This adds optional caching to the results of each stage of a pipeline so
that computations can be executed once and transparently persisted across
executions. The motivation is to avoid writing large amounts of code
to store and load intermediate outputs. This can already be done easily
for one's own functions via joblib, but is not possible for existing
estimators within a pipeline without modifications such as these.

Changes include implementation of this functionality within pipeline.py,
updated documentation, and additional unit tests.
@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling f4771ba on dblalock:pipeline-caching into 56ee99c on scikit-learn:master.

@jnothman
Copy link
Member

It's not a bad idea, but it has been discussed before extensively (e.g. #2086). I'm interested to see how you have implemented it, but expect some friction before any such feature is merged!

@jnothman
Copy link
Member

For an exercise, could you please set cache=True as default and report if existing tests fail? Thanks.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 724573d on dblalock:pipeline-caching into 56ee99c on scikit-learn:master.

@dblalock
Copy link
Author

Good suggestion. It initially failed two tests in the "working with text" doctests because the named_steps dict wasn't being updated, but everything passes now (branch with this experiment here). The only snag I observed when running it with this setting was that the caching version took much longer when manipulating the 20 Newsgroups dataset in the text analysis tutorial:

> ~/Desktop/code/scikit-learn$ nosetests --doctest-tests doc/tutorial/text_analytics/working_with_text_data.rst
.
----------------------------------------------------------------------
Ran 1 test in 5.864s        # no caching

OK
> ~/Desktop/code/scikit-learn$ nosetests --doctest-tests doc/tutorial/text_analytics/working_with_text_data.rst
.
----------------------------------------------------------------------
Ran 1 test in 64.028s       # cache=True, first run

OK
> ~/Desktop/code/scikit-learn$ nosetests --doctest-tests doc/tutorial/text_analytics/working_with_text_data.rst
.
----------------------------------------------------------------------
Ran 1 test in 29.057s       # cache=False, subsequent runs; +/- 1s
OK`

I.e., the text analysis runs significantly slower with cache=True than cache=False. This is unsurprising given that it's probably faster to count tokens than hash giant strings, as joblib is forced to do in this case. Clearly, caching should be off by default.

@jnothman
Copy link
Member

Thanks for that. I think it would be a good idea to modify all pipeline tests to ensure they work both with and without caching, but it's a lot of work for little glory.

Clearly, caching should be off by default.

Or this is the wrong model, and the user should be able to cache any particular estimator, rather than the whole pipeline. Yet I think that in terms of usability that covers the majority of cases, the present model excels.

@@ -46,6 +49,17 @@ class Pipeline(BaseEstimator):
chained, in the order in which they are chained, with the last object
an estimator.

cache: boolean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see other instances where memoization is performed, like https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/hierarchical.py#L710. Structure your usage similarly (accept a path, or a Memory instance, defaulting to None). Instead of a global Memory instance, you don't use the decorator form, and just use memory.cache(my_func)(estimator, ...)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally did this but found that it got broken by GridSearchCV's cloning of the Pipeline object--hence the pre-defined cached and non-cached functions below. I'll revisit this though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this would happen...

@jnothman
Copy link
Member

I'd like to see tests with a mock estimator that counts the number of calls to each method so that we can actually tell when it's being called.

@jnothman
Copy link
Member

Apart from hashing text, I'm a little concerned that we're using entire models as cache keys (#2086 instead used the training data + the estimator parameters). Can you benchmark caching a data-heavy model like RandomTreesEmbedding? How long does the caching itself take?

Apart from efficiency, one reason for using parameters rather than the whole estimator object is that we might be able to annotate some estimator parameters as not affecting the model fitting (e.g. feature selection thresholds in most cases), and then exclude them from the cache key. But this certainly falls under the class of pie-in-the-sky enhancement, and could -- if the user really cares to -- be achieved in other ways.

@hnykda
Copy link
Contributor

hnykda commented Aug 24, 2016

Hi.

Is there any progress on this? There is also something on #2086. Is anyone currently working on this limitation of pipelines?

@jnothman
Copy link
Member

No one's working on this limitation of Pipelines atm, unfortunately. And my attempts (#5080) to make clone work better for generic memoising wrappers hav come to a standstill.

@hnykda
Copy link
Contributor

hnykda commented Aug 24, 2016

What a pity. This would make a huge impact on many of my projects (and surely on projects of others). But reading through all related issues and PRs, I don't believe I could do this myself.

@GaelVaroquaux
Copy link
Member

I would use a subclass of Pipeline, rather than Pipeline itself. The reason being that it is probably a good idea to clone the objects before memoizing the fit (in order to maximize the cache hits). This is not what the standard Pipeline does, hence memoizing would have a slightly different behavior (though one fully consistent with scikit-learn's philosophy of not modifying input parameters).

@amueller
Copy link
Member

amueller commented Nov 29, 2016

@GaelVaroquaux yeah the not cloning is an interesting feature. Name suggestions? Cloneline? ;) [Actually we should be pretty careful with naming now, our names get adopted quickly]

@hnykda
Copy link
Contributor

hnykda commented Nov 29, 2016

I have to say that cloneline sounds awesome :-D

@amueller
Copy link
Member

should actually be Cacheline ;)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 29, 2016 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Nov 29, 2016 via email

@amueller
Copy link
Member

yeah why not. That's easy to do once we have the class....

@glemaitre
Copy link
Member

glemaitre commented Dec 5, 2016

I will propose a PR based on the above comments and the discussion from the mailing list.

@jnothman
Copy link
Member

Fixed by #7990. Thanks @dblalock!

@jnothman jnothman closed this Feb 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants