[MRG+3] ENH Caching Pipeline by memoizing transformer #7990

glemaitre · 2016-12-06T12:53:48Z

Reference Issue

Address the discussions in #3951
Other related issues and PR: #2086 #5082 #5080

What does this implement/fix? Explain your changes.

It implements a version of Pipeline which allows for caching transformer.

glemaitre · 2016-12-06T13:05:29Z

From #3951, this is what I could come with. I will add promptly the testing from the Pipeline and adapt it. From what I could check, this is working for those tests.

However, I get into trouble while using the CachedPipeline in a GridSearchCV:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import samples_generator
from sklearn.decomposition import PCA
from sklearn.pipeline import CachedPipeline
from sklearn.model_selection import GridSearchCV

# generate some data to play with                                                               
X, y = samples_generator.make_classification(
    n_samples=100,
    n_informative=5, n_redundant=0, random_state=42)

pca = PCA()
clf = RandomForestClassifier()
pipeline = CachedPipeline([('pca', pca), ('rf', clf)],
                          memory='./')
parameters = {'pca__n_components': (.25, .5, .75),
              'rf__n_estimators': (10, 20, 30)}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1)

grid_search.fit(X, y)

After that joblib loads the cached PCA, the transformer is seen as non fitted:

NotFittedError: This PCA instance is not fitted yet.
Call 'fit' with appropriate arguments before using this method.

I'm going to check why but if you have already an obvious answer, I would be happy to hear it.

GaelVaroquaux · 2016-12-06T13:07:05Z

After that joblib loads the cached PCA, the transformer is seen as non fitted: NotFittedError: This PCA instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

Do you get the same problem if you take a fitted PCA, pickle it, unpickle it and try to transform on the data?

glemaitre · 2016-12-06T13:25:30Z

Do you get the same problem if you take a fitted PCA, pickle it, unpickle it and try to transform on the data?

Nop, this is working fine.

pca.fit(X, y)
joblib.dump(pca, 'pca.p')
pickled_pca = joblib.load('pca.p')
pickled_pca.transform(X)

GaelVaroquaux · 2016-12-06T13:29:13Z

sklearn/pipeline.py

+        memory = self.memory
+        if isinstance(memory, six.string_types) or memory is None:
+            memory = Memory(cachedir=memory, verbose=10)
+        self._fit_transform_one = memory.cache(_fit_transform_one)


Don't do that: don't decorate methods. Use the pattern described in https://github.com/joblib/joblib/blob/master/doc/memory.rst, under the bullet point "caching methods".

glemaitre · 2016-12-06T14:37:52Z

For the first configuration of the grid search, only the only dumped and fitted PCA is the first one. On the 2 others, PCA is dumped but not fitted.

agramfort · 2016-12-06T14:50:22Z

sklearn/pipeline.py

+                Xt, transform = memory.cache(_fit_transform_one)(
+                    transform, name,
+                    None, Xt, y,
+                    **fit_params_steps[name])


I would refactor Pipeline with a private class to avoid the code dupe.

GaelVaroquaux · 2016-12-06T17:44:20Z

I would refactor Pipeline with a private class to avoid the code dupe.

+1

glemaitre · 2016-12-06T17:48:02Z

Do you get the same problem if you take a fitted PCA, pickle it, unpickle it and try to transform on the data?

Stupid mistakes fixed. I forgot to assign the loaded pipeline.

glemaitre · 2016-12-07T13:07:00Z

@agramfort @GaelVaroquaux While speaking of private class, is the last commit implements what you had in mind?

@GaelVaroquaux

I would use a subclass of Pipeline, rather than Pipeline itself. The reason being that it is probably a good idea to clone the objects before memoizing the fit (in order to maximize the cache hits).

I should miss something regarding the cloning of the objects. Could you elaborate on what is required in the implementation?

glemaitre · 2016-12-07T14:44:33Z

I would use a subclass of Pipeline, rather than Pipeline itself. The reason being that it is probably a good idea to clone the objects before memoizing the fit (in order to maximize the cache hits).

I should miss something regarding the cloning of the objects. Could you elaborate on what is required in the implementation?

I think I got it. The cloning is to clean the fitting info to not cache twice the same estimator, once without fitting and the second with the fitting.

glemaitre · 2016-12-07T15:56:03Z

I added the test which are similar to the one propose in #3951
However, this test does not check that the cache has been loaded.
It only check the resulting array which kinda permissive.

Is there anything in joblib which inform if the cache has been read?

glemaitre · 2016-12-07T16:09:57Z

For now, I am checking that the timestamp of a DummyTransformer, assigned at the first fit within the pipeline, is recovered through the different fitting.

glemaitre · 2016-12-15T13:08:10Z

@agramfort Do you see additional changes to do?

lesteve · 2016-12-20T08:35:37Z

You should probably change your title to [MRG].

Can you trigger a rebuild (e.g. by doing git commit --amend and force push) to trigger a documentation build. It looks like at one point CircleCI stopped running on PRs last week.

lesteve · 2016-12-20T08:36:02Z

sklearn/pipeline.py

+    with its name to another estimator, or a transformer removed by setting
+    to None.
+
+    Read more in the :ref:`User Guide <pipeline>`.


Should you not link to the cached_pipeline section?

+1 to add it to See Also

lesteve · 2016-12-20T08:39:18Z

doc/modules/pipeline.rst

+Usage
+-----
+
+Similarly to :class:`Piepeline`, the pipeline is built on the same manner. However,


Actually you have a typo here in Pipeline. No need to do git commit --amend changing this should trigger a CircleCI doc build

GaelVaroquaux · 2016-12-20T09:46:57Z

examples/plot_compare_reduction_cached.py

@@ -0,0 +1,77 @@
+#!/usr/bin/python


Looks like the wrong "#!" line to me: it will always use the system Python, which is often not the right thing. The right way is "#!/usr/bin/env python"

The shebang line is only used if the example is executable right? No example has the executable flag AFAICT. I would be in favour of removing the shebang line.

GaelVaroquaux · 2016-12-20T09:48:36Z

Any way that the new example could be integrated with an existing one? The danger is that we have too many examples.

The way that I would suggest doing it is by extending an existing one (if there is one that is relevant), and use "notebook-style examples" of sphinx-gallery to add extra cells at the bottom (with an extra title) without making the initial example more complicated. That aspect is important: the initial example should be left as simple, without additional lines of code. And sphinx-gallery notebook-style formatting can be used to add discussion and code at the end.

glemaitre · 2016-12-20T09:57:19Z

Any way that the new example could be integrated with an existing one? The danger is that we have too many examples.

This example is a modified version of examples/plot_compare_reduction.py .
I'll try to merge both as suggested.

lesteve · 2016-12-20T13:02:53Z

@glemaitre the Travis failure in the doctest is because the default of decision_function_shape is None and not 'ovr' in master.

I can reproduce the failure locally. The reason we could not reproduce locally earlier was because we were trying on your branch rather than on the result of the merge of your branch into master, d'oh ...

glemaitre · 2016-12-20T13:12:57Z

@lesteve d'oh indeed ...

glemaitre · 2016-12-20T13:58:13Z

@GaelVaroquaux Is the example look like what you had in mind?

agramfort · 2016-12-21T13:32:02Z

examples/plot_compare_reduction.py

+                             memory=memory)
+
+# This time, a cached pipeline will be used within the grid search
+grid = GridSearchCV(cached_pipe, cv=3, n_jobs=2, param_grid=param_grid)


don't use n_jobs != 1 in examples

This must still be changed.

agramfort · 2016-12-21T15:14:53Z

sklearn/pipeline.py

+    with its name to another estimator, or a transformer removed by setting
+    to None.
+
+    Read more in the :ref:`User Guide <pipeline>`.


+1 to add it to See Also

agramfort · 2016-12-21T15:17:05Z

sklearn/tests/test_pipeline.py

+    # Don't make the cachedir, Memory should be able to do that on the fly
+    print(80 * '_')
+    print('test_memory setup (%s)' % env['dir'])
+    print(80 * '_')


why all this print statements?

In fact it was the way which was used in joblib.
However, I am having a second thought about that. Probably a simple try - finally statement as here could be enough for the purpose of the test.

GaelVaroquaux · 2016-12-21T23:06:47Z

doc/modules/pipeline.rst

@@ -124,6 +125,40 @@ i.e. if the last estimator is a classifier, the :class:`Pipeline` can be used
 as a classifier. If the last estimator is a transformer, again, so is the
 pipeline.

+.. _cached_pipeline:
+
+CachedPipeline: memoizing transformers


I think that I would put a title that is more focused towards the problem that it solves rather than the technique used. Something like "CachedPipeline: avoiding to repeat computation"

GaelVaroquaux · 2016-12-21T23:07:13Z

doc/modules/pipeline.rst

+
+.. currentmodule:: sklearn.pipeline
+
+:class:`CachedPipeline` can be used instead of :class:`Pipeline` to avoid to fit


I would say "to avoid to compute the fit"

GaelVaroquaux · 2016-12-21T23:08:11Z

examples/plot_compare_reduction.py

-Selecting dimensionality reduction with Pipeline and GridSearchCV
-=================================================================
+=======================================================================
+Selecting dimensionality reduction with Pipeline, CachedPipeline, and \


I don't think that I would change the title here. It makes it too long, IMHO.

GaelVaroquaux · 2016-12-21T23:09:27Z

examples/plot_compare_reduction.py

 import numpy as np
 import matplotlib.pyplot as plt
 from sklearn.datasets import load_digits
 from sklearn.model_selection import GridSearchCV
-from sklearn.pipeline import Pipeline
+from sklearn.pipeline import Pipeline, CachedPipeline


I would import this only later: trying to separate a bit the two parts of the example.

We have to keep in mind that every added piece of information to an example makes it harder to understand.

GaelVaroquaux · 2016-12-21T23:09:35Z

examples/plot_compare_reduction.py

 from sklearn.svm import LinearSVC
 from sklearn.decomposition import PCA, NMF
 from sklearn.feature_selection import SelectKBest, chi2
+from sklearn.externals.joblib import Memory


Same comment.

GaelVaroquaux · 2016-12-21T23:10:09Z

examples/plot_compare_reduction.py

@@ -73,3 +90,29 @@
 plt.ylim((0, 1))
 plt.legend(loc='upper left')
 plt.show()


I think that that "show" must be moved at the end.

With the current scheme, it will generate the figure right after the code snipped and before the second section. It looks fine to me.

It will block the execution of the script before the end. The convention is really to have it only at the end.

codecov · 2017-02-09T00:02:14Z

Codecov Report

❗ No coverage uploaded for pull request base (master@dfcf632). Click here to learn what that means.

@@            Coverage Diff            @@
##             master    #7990   +/-   ##
=========================================
  Coverage          ?   94.74%           
=========================================
  Files             ?      342           
  Lines             ?    60739           
  Branches          ?        0           
=========================================
  Hits              ?    57546           
  Misses            ?     3193           
  Partials          ?        0

Impacted Files	Coverage Δ
sklearn/tests/test_pipeline.py	`99.61% <100%> (ø)`
sklearn/pipeline.py	`99.26% <95.23%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dfcf632...1afb47f. Read the comment docs.

jnothman · 2017-02-09T00:08:41Z

Are we good for merge? Should we wait for the joblib fixes?

glemaitre · 2017-02-09T00:10:29Z

@ogrisel wanted to wait for the joblib fixes.

jnothman · 2017-02-09T01:05:07Z

okay.

…

On 9 February 2017 at 11:10, Guillaume Lemaitre ***@***.***> wrote: @ogrisel <https://github.com/ogrisel> wanted to wait for the joblib fixes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7990 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6_t2di4q5DnhmuiGI76vFlJYuwDhks5rall3gaJpZM4LFYC_> .

ogrisel

Actually as the default setting will not trigger the joblib race condition I thing we can merge this as is.

GaelVaroquaux · 2017-02-13T13:01:09Z

Hurray. Merging. Good job, @glemaitre !

ogrisel · 2017-02-13T13:01:22Z

@jnothman w.r.t. your comment on GS loop ordering in #7990 (review) this would not impact the optimal design of this PR right?

raghavrv · 2017-02-13T13:23:13Z

Hurray!! 🎉

jnothman · 2017-02-13T13:50:24Z

@jnothman w.r.t. your comment on GS loop ordering in #7990 (review) this would not impact the optimal design of this PR right?

I don't know what you're asking, @ogrisel. The ordering issue doesn't stop this change being useful, but it makes this (and other memoisation) less useful in parallel because the cache will be missed unnecessarily. Maximal cache hits is (n_candidates - 1) * n_splits, but under the current ordering it is more like (n_candidates - n_jobs) * n_splits.

ogrisel · 2017-02-13T14:08:09Z

Agreed. Once the joblib race condition is fixed on windows we ca reinvestigate that issue in GS.

amueller · 2017-02-14T19:43:51Z

OMG AMAZING!

) * ENH Caching Pipeline by memoizing transformer * Fix lesteve changes * Fix comments * Fix doc * Fix jnothman comments

lsorber · 2017-06-06T13:16:39Z

Thanks for the nice work on this. May I suggest to (optionally) also cache the pipeline's last step? The last step could itself be a transformer (e.g., in a pipeline of pipelines).

In fact, I'm not even sure what the downside of caching all steps by default is. If you want the current behaviour, you could simply create a cached pipeline of all the steps you want cached, and insert that into a non-caching pipeline for the steps you don't want cached. Conversely, it is much more tedious to create a fully cached pipeline with the current implementation.

amueller · 2017-06-06T13:18:03Z

@lsorber can you maybe open a new issue for that?

lsorber · 2017-06-06T13:18:47Z

All right, will do.

jnothman · 2017-06-06T13:26:52Z

I know there's promise of a further issue, but I don't see why we would want to cache the last step alone, and not just the entire pipeline. When we want to cache the entire pipeline as a whole, that seems useful to me when that pipeline is in a pipeline or a FeatureUnion, in which case it should be cached by them as one of the constituent transformers.

…

On 6 June 2017 at 23:18, Laurent Sorber ***@***.***> wrote: All right, will do. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7990 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz600nPIdx_DUa3BgThba2RrGBC3ktks5sBVG6gaJpZM4LFYC_> .

lsorber · 2017-06-06T13:30:02Z

@jnothman I was in fact suggesting to cache the full pipeline, including the last step. Some motivation in #9007.

) * ENH Caching Pipeline by memoizing transformer * Fix lesteve changes * Fix comments * Fix doc * Fix jnothman comments

GaelVaroquaux reviewed Dec 6, 2016

View reviewed changes

agramfort reviewed Dec 6, 2016

View reviewed changes

lesteve reviewed Dec 20, 2016

View reviewed changes

GaelVaroquaux reviewed Dec 20, 2016

View reviewed changes

glemaitre force-pushed the cachedpipeline branch from 5eaa6d3 to 8fcacb7 Compare December 20, 2016 11:13

agramfort reviewed Dec 21, 2016

View reviewed changes

GaelVaroquaux reviewed Dec 21, 2016

View reviewed changes

Fix jnothman comments

1afb47f

jnothman changed the title ~~[MRG+2] ENH Caching Pipeline by memoizing transformer~~ [MRG+3] ENH Caching Pipeline by memoizing transformer Feb 9, 2017

jnothman mentioned this pull request Feb 9, 2017

[MRG+2] ENH Loop over candidates as outer loop in search #8322

Merged

ogrisel approved these changes Feb 13, 2017

View reviewed changes

GaelVaroquaux merged commit b3a639f into scikit-learn:master Feb 13, 2017

jnothman mentioned this pull request Feb 13, 2017

ENH add joblib caching to Pipeline #3951

Closed

Przemo10 mentioned this pull request Mar 17, 2017

update fork (#1) #8606

Closed

lsorber mentioned this pull request Jun 6, 2017

Suggestion: cache all Pipeline steps by default #9007

Open

jnothman mentioned this pull request Jun 6, 2017

Ability to cache FeatureUnion transformers #9008

Open


		.. currentmodule:: sklearn.pipeline

		:class:`CachedPipeline` can be used instead of :class:`Pipeline` to avoid to fit

Uh oh!

[MRG+3] ENH Caching Pipeline by memoizing transformer #7990

[MRG+3] ENH Caching Pipeline by memoizing transformer #7990

Uh oh!

Conversation

glemaitre commented Dec 6, 2016

Reference Issue

What does this implement/fix? Explain your changes.

Uh oh!

glemaitre commented Dec 6, 2016

Uh oh!

GaelVaroquaux commented Dec 6, 2016 via email

Uh oh!

glemaitre commented Dec 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Dec 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Dec 6, 2016 via email

Uh oh!

glemaitre commented Dec 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Dec 7, 2016

Uh oh!

glemaitre commented Dec 7, 2016

Uh oh!

glemaitre commented Dec 7, 2016

Uh oh!

glemaitre commented Dec 7, 2016

Uh oh!

glemaitre commented Dec 15, 2016

Uh oh!

lesteve commented Dec 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Dec 20, 2016

Uh oh!

glemaitre commented Dec 20, 2016

Uh oh!

lesteve commented Dec 20, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Dec 20, 2016

Uh oh!

glemaitre commented Dec 20, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

glemaitre commented Dec 6, 2016 •

edited

Loading

lesteve commented Dec 20, 2016 •

edited

Loading