-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+3] ENH Caching Pipeline by memoizing transformer #7990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jnothman @GaelVaroquaux @amueller From #3951, this is what I could come with. I will add promptly the testing from the However, I get into trouble while using the from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import samples_generator
from sklearn.decomposition import PCA
from sklearn.pipeline import CachedPipeline
from sklearn.model_selection import GridSearchCV
# generate some data to play with
X, y = samples_generator.make_classification(
n_samples=100,
n_informative=5, n_redundant=0, random_state=42)
pca = PCA()
clf = RandomForestClassifier()
pipeline = CachedPipeline([('pca', pca), ('rf', clf)],
memory='./')
parameters = {'pca__n_components': (.25, .5, .75),
'rf__n_estimators': (10, 20, 30)}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)
grid_search.fit(X, y) After that
I'm going to check why but if you have already an obvious answer, I would be happy to hear it. |
After that joblib loads the cached PCA, the transformer is seen as non fitted:
NotFittedError: This PCA instance is not fitted yet.
Call 'fit' with appropriate arguments before using this method.
Do you get the same problem if you take a fitted PCA, pickle it, unpickle
it and try to transform on the data?
|
Nop, this is working fine. pca.fit(X, y)
joblib.dump(pca, 'pca.p')
pickled_pca = joblib.load('pca.p')
pickled_pca.transform(X) |
sklearn/pipeline.py
Outdated
memory = self.memory | ||
if isinstance(memory, six.string_types) or memory is None: | ||
memory = Memory(cachedir=memory, verbose=10) | ||
self._fit_transform_one = memory.cache(_fit_transform_one) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't do that: don't decorate methods. Use the pattern described in https://github.com/joblib/joblib/blob/master/doc/memory.rst, under the bullet point "caching methods".
For the first configuration of the grid search, only the only dumped and fitted PCA is the first one. On the 2 others, PCA is dumped but not fitted. |
sklearn/pipeline.py
Outdated
Xt, transform = memory.cache(_fit_transform_one)( | ||
transform, name, | ||
None, Xt, y, | ||
**fit_params_steps[name]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would refactor Pipeline with a private class to avoid the code dupe.
I would refactor Pipeline with a private class to avoid the code dupe.
+1
|
Stupid mistakes fixed. I forgot to assign the loaded pipeline. |
@agramfort @GaelVaroquaux While speaking of private class, is the last commit implements what you had in mind?
I should miss something regarding the cloning of the objects. Could you elaborate on what is required in the implementation? |
I think I got it. The cloning is to clean the fitting info to not cache twice the same estimator, once without fitting and the second with the fitting. |
I added the test which are similar to the one propose in #3951 Is there anything in |
For now, I am checking that the timestamp of a |
@agramfort Do you see additional changes to do? |
You should probably change your title to [MRG]. Can you trigger a rebuild (e.g. by doing |
sklearn/pipeline.py
Outdated
with its name to another estimator, or a transformer removed by setting | ||
to None. | ||
|
||
Read more in the :ref:`User Guide <pipeline>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you not link to the cached_pipeline section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to add it to See Also
doc/modules/pipeline.rst
Outdated
Usage | ||
----- | ||
|
||
Similarly to :class:`Piepeline`, the pipeline is built on the same manner. However, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually you have a typo here in Pipeline. No need to do git commit --amend
changing this should trigger a CircleCI doc build
@@ -0,0 +1,77 @@ | |||
#!/usr/bin/python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the wrong "#!" line to me: it will always use the system Python, which is often not the right thing. The right way is "#!/usr/bin/env python"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shebang line is only used if the example is executable right? No example has the executable flag AFAICT. I would be in favour of removing the shebang line.
Any way that the new example could be integrated with an existing one? The danger is that we have too many examples. The way that I would suggest doing it is by extending an existing one (if there is one that is relevant), and use "notebook-style examples" of sphinx-gallery to add extra cells at the bottom (with an extra title) without making the initial example more complicated. That aspect is important: the initial example should be left as simple, without additional lines of code. And sphinx-gallery notebook-style formatting can be used to add discussion and code at the end. |
This example is a modified version of |
5eaa6d3
to
8fcacb7
Compare
@glemaitre the Travis failure in the doctest is because the default of I can reproduce the failure locally. The reason we could not reproduce locally earlier was because we were trying on your branch rather than on the result of the merge of your branch into master, d'oh ... |
@lesteve d'oh indeed ... |
@GaelVaroquaux Is the example look like what you had in mind? |
examples/plot_compare_reduction.py
Outdated
memory=memory) | ||
|
||
# This time, a cached pipeline will be used within the grid search | ||
grid = GridSearchCV(cached_pipe, cv=3, n_jobs=2, param_grid=param_grid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use n_jobs != 1 in examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must still be changed.
sklearn/pipeline.py
Outdated
with its name to another estimator, or a transformer removed by setting | ||
to None. | ||
|
||
Read more in the :ref:`User Guide <pipeline>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to add it to See Also
sklearn/tests/test_pipeline.py
Outdated
# Don't make the cachedir, Memory should be able to do that on the fly | ||
print(80 * '_') | ||
print('test_memory setup (%s)' % env['dir']) | ||
print(80 * '_') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why all this print statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact it was the way which was used in joblib
.
However, I am having a second thought about that. Probably a simple try
- finally
statement as here could be enough for the purpose of the test.
doc/modules/pipeline.rst
Outdated
@@ -124,6 +125,40 @@ i.e. if the last estimator is a classifier, the :class:`Pipeline` can be used | |||
as a classifier. If the last estimator is a transformer, again, so is the | |||
pipeline. | |||
|
|||
.. _cached_pipeline: | |||
|
|||
CachedPipeline: memoizing transformers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I would put a title that is more focused towards the problem that it solves rather than the technique used. Something like "CachedPipeline: avoiding to repeat computation"
doc/modules/pipeline.rst
Outdated
|
||
.. currentmodule:: sklearn.pipeline | ||
|
||
:class:`CachedPipeline` can be used instead of :class:`Pipeline` to avoid to fit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say "to avoid to compute the fit"
examples/plot_compare_reduction.py
Outdated
Selecting dimensionality reduction with Pipeline and GridSearchCV | ||
================================================================= | ||
======================================================================= | ||
Selecting dimensionality reduction with Pipeline, CachedPipeline, and \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that I would change the title here. It makes it too long, IMHO.
examples/plot_compare_reduction.py
Outdated
import numpy as np | ||
import matplotlib.pyplot as plt | ||
from sklearn.datasets import load_digits | ||
from sklearn.model_selection import GridSearchCV | ||
from sklearn.pipeline import Pipeline | ||
from sklearn.pipeline import Pipeline, CachedPipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would import this only later: trying to separate a bit the two parts of the example.
We have to keep in mind that every added piece of information to an example makes it harder to understand.
examples/plot_compare_reduction.py
Outdated
from sklearn.svm import LinearSVC | ||
from sklearn.decomposition import PCA, NMF | ||
from sklearn.feature_selection import SelectKBest, chi2 | ||
from sklearn.externals.joblib import Memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment.
examples/plot_compare_reduction.py
Outdated
@@ -73,3 +90,29 @@ | |||
plt.ylim((0, 1)) | |||
plt.legend(loc='upper left') | |||
plt.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that that "show" must be moved at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current scheme, it will generate the figure right after the code snipped and before the second section. It looks fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will block the execution of the script before the end. The convention is really to have it only at the end.
Codecov Report
@@ Coverage Diff @@
## master #7990 +/- ##
=========================================
Coverage ? 94.74%
=========================================
Files ? 342
Lines ? 60739
Branches ? 0
=========================================
Hits ? 57546
Misses ? 3193
Partials ? 0
Continue to review full report at Codecov.
|
Are we good for merge? Should we wait for the joblib fixes? |
@ogrisel wanted to wait for the joblib fixes. |
okay.
…On 9 February 2017 at 11:10, Guillaume Lemaitre ***@***.***> wrote:
@ogrisel <https://github.com/ogrisel> wanted to wait for the joblib fixes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7990 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_t2di4q5DnhmuiGI76vFlJYuwDhks5rall3gaJpZM4LFYC_>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually as the default setting will not trigger the joblib race condition I thing we can merge this as is.
Hurray. Merging. Good job, @glemaitre ! |
@jnothman w.r.t. your comment on GS loop ordering in #7990 (review) this would not impact the optimal design of this PR right? |
Hurray!! 🎉 |
I don't know what you're asking, @ogrisel. The ordering issue doesn't stop this change being useful, but it makes this (and other memoisation) less useful in parallel because the cache will be missed unnecessarily. Maximal cache hits is |
Agreed. Once the joblib race condition is fixed on windows we ca reinvestigate that issue in GS. |
OMG AMAZING! |
Thanks for the nice work on this. May I suggest to (optionally) also cache the pipeline's last step? The last step could itself be a transformer (e.g., in a pipeline of pipelines). In fact, I'm not even sure what the downside of caching all steps by default is. If you want the current behaviour, you could simply create a cached pipeline of all the steps you want cached, and insert that into a non-caching pipeline for the steps you don't want cached. Conversely, it is much more tedious to create a fully cached pipeline with the current implementation. |
@lsorber can you maybe open a new issue for that? |
All right, will do. |
I know there's promise of a further issue, but I don't see why we would
want to cache the last step alone, and not just the entire pipeline. When
we want to cache the entire pipeline as a whole, that seems useful to me
when that pipeline is in a pipeline or a FeatureUnion, in which case it
should be cached by them as one of the constituent transformers.
…On 6 June 2017 at 23:18, Laurent Sorber ***@***.***> wrote:
All right, will do.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7990 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz600nPIdx_DUa3BgThba2RrGBC3ktks5sBVG6gaJpZM4LFYC_>
.
|
Reference Issue
Address the discussions in #3951
Other related issues and PR: #2086 #5082 #5080
What does this implement/fix? Explain your changes.
It implements a version of
Pipeline
which allows for caching transformer.