-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH add joblib caching to Pipeline #3951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds optional caching to the results of each stage of a pipeline so that computations can be executed once and transparently persisted across executions. The motivation is to avoid writing large amounts of code to store and load intermediate outputs. This can already be done easily for one's own functions via joblib, but is not possible for existing estimators within a pipeline without modifications such as these. Changes include implementation of this functionality within pipeline.py, updated documentation, and additional unit tests.
It's not a bad idea, but it has been discussed before extensively (e.g. #2086). I'm interested to see how you have implemented it, but expect some friction before any such feature is merged! |
For an exercise, could you please set |
Good suggestion. It initially failed two tests in the "working with text" doctests because the named_steps dict wasn't being updated, but everything passes now (branch with this experiment here). The only snag I observed when running it with this setting was that the caching version took much longer when manipulating the 20 Newsgroups dataset in the text analysis tutorial:
I.e., the text analysis runs significantly slower with cache=True than cache=False. This is unsurprising given that it's probably faster to count tokens than hash giant strings, as joblib is forced to do in this case. Clearly, caching should be off by default. |
Thanks for that. I think it would be a good idea to modify all pipeline tests to ensure they work both with and without caching, but it's a lot of work for little glory.
Or this is the wrong model, and the user should be able to cache any particular estimator, rather than the whole pipeline. Yet I think that in terms of usability that covers the majority of cases, the present model excels. |
@@ -46,6 +49,17 @@ class Pipeline(BaseEstimator): | |||
chained, in the order in which they are chained, with the last object | |||
an estimator. | |||
|
|||
cache: boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see other instances where memoization is performed, like https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/hierarchical.py#L710. Structure your usage similarly (accept a path, or a Memory
instance, defaulting to None
). Instead of a global Memory
instance, you don't use the decorator form, and just use memory.cache(my_func)(estimator, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally did this but found that it got broken by GridSearchCV's cloning of the Pipeline object--hence the pre-defined cached and non-cached functions below. I'll revisit this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this would happen...
I'd like to see tests with a mock estimator that counts the number of calls to each method so that we can actually tell when it's being called. |
Apart from hashing text, I'm a little concerned that we're using entire models as cache keys (#2086 instead used the training data + the estimator parameters). Can you benchmark caching a data-heavy model like Apart from efficiency, one reason for using parameters rather than the whole estimator object is that we might be able to annotate some estimator parameters as not affecting the model fitting (e.g. feature selection thresholds in most cases), and then exclude them from the cache key. But this certainly falls under the class of pie-in-the-sky enhancement, and could -- if the user really cares to -- be achieved in other ways. |
Hi. Is there any progress on this? There is also something on #2086. Is anyone currently working on this limitation of pipelines? |
No one's working on this limitation of Pipelines atm, unfortunately. And my attempts (#5080) to make |
What a pity. This would make a huge impact on many of my projects (and surely on projects of others). But reading through all related issues and PRs, I don't believe I could do this myself. |
I would use a subclass of Pipeline, rather than Pipeline itself. The reason being that it is probably a good idea to clone the objects before memoizing the fit (in order to maximize the cache hits). This is not what the standard Pipeline does, hence memoizing would have a slightly different behavior (though one fully consistent with scikit-learn's philosophy of not modifying input parameters). |
@GaelVaroquaux yeah the not cloning is an interesting feature. Name suggestions? Cloneline? ;) [Actually we should be pretty careful with naming now, our names get adopted quickly] |
I have to say that |
should actually be |
should actually be Cacheline ;)
MemoLine. Don't forget it's memoize.
I get picture saying myself: "didn't you get the MemoLine".
|
On a serious note, "CachedPipe" or "CachedPipeline".
We could add a "cache" option to make_pipeline (althought in "legacy
Python", in makes a horrible signature) with "*steps, **options".
|
yeah why not. That's easy to do once we have the class.... |
I will propose a PR based on the above comments and the discussion from the mailing list. |
Hello,
I recently added optional Joblib caching to Pipeline to avoid redundant computations / save the outputs for a personal project. I found this useful, and so figured I'd offer it up as a pull request. It passes
make test
, pep8, and pyflakes, but I'm new to open source and trying to learn (fresh out of college...), so feel free to tell me this needs more tests, is a bad idea, should change stylistically, etc.Also, I read the contributing page, but apologies if I'm nonetheless following this process improperly.
-Dave