[MRG] Pass original dataset to Stacking final estimator #15138

jcusick13 · 2019-10-05T13:41:54Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Adds an argument, pass_through, to the base Stacking estimator. This allows for concatenating the original dataset X with the output from all individual estimators for use in training the final_estimator.

Any other comments?

I added tests but haven't been able to check them throughly. I haven't been able to get even the existing tests to run since calling (even on the master branch)

pytest sklearn/ensemble/tests/test_stacking.py

returns the below error.

_______ ERROR collecting sklearn/ensemble/tests/test_stacking.py _______ 
ImportError while importing test module '.../scikit-learn/sklearn/ensemble/tests/test_stacking.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
sklearn/ensemble/__init__.py:7: in <module>
    from .forest import RandomForestClassifier
sklearn/ensemble/forest.py:56: in <module>
    from ..tree import (DecisionTreeClassifier, DecisionTreeRegressor,
sklearn/tree/__init__.py:6: in <module>
    from .tree import DecisionTreeClassifier
sklearn/tree/tree.py:44: in <module>
    from ._tree import _build_pruned_tree_ccp
E   ImportError: cannot import name '_build_pruned_tree_ccp'

I've looked around for a little and am not too sure how to resolve this -- does anyone have any suggestions?

jcusick13 · 2019-10-07T01:10:31Z

Something's still off with my local build while trying to import '_build_pruned_tree_ccp' but I think things should otherwise be all set for a review now.

jnothman · 2019-10-07T01:47:06Z

Perhaps you compiled with version 0.21 but then pulled changes from 0.22 including that file. Do `pip install -e .`

jnothman · 2019-10-07T01:47:36Z

sklearn/ensemble/_stacking.py

    def __init__(self, estimators, final_estimator=None, cv=None,
-                 stack_method='auto', n_jobs=None, verbose=0):
+                 stack_method='auto', n_jobs=None, verbose=0,
+                 pass_through=False):


When we elsewhere use "passthrough" I don't think pass_through is a good idea

Ah good catch, I'll update to be consistent with passthrough.

jcusick13 · 2019-10-07T02:12:02Z

Thanks for the advice on reinstalling with the editable command -- that fixed things on my end!

thomasjpfan · 2019-10-07T15:37:00Z

sklearn/ensemble/tests/test_stacking.py

    X_trans = reg.transform(X_test)
-    assert X_trans.shape[1] == 2
+    expected_column_count = 12 if passthrough else 2
+    assert X_trans.shape[1] == expected_column_count


You can assert that the original values are the same as the passthrough values.

Same goes for the other asserts.

Sounds good. Should that be a distinct test or should that be part of the existing test_stacking_regressor_diabetes and test_stacking_classifier_iris?

I think this can be include here:

if passthrough: assert_allclose(...)

thomasjpfan · 2019-10-07T15:39:54Z

sklearn/ensemble/_stacking.py

                else:
                    X_meta.append(preds)
-        return np.concatenate(X_meta, axis=1)
+        if self.passthrough:


Nit:

if self.passthrough: X_meta.append(X) return np.concatenate(X_meta, axis=1)

jnothman

Please test behaviour with sparse matrix X

jcusick13 · 2019-10-12T15:06:33Z

Sounds good -- the original np.concatenate was failing for sparse matrices, so I added an additional check before combining the original X matrix with interim learner predictions.

jcusick13 · 2019-10-24T14:00:58Z

@jnothman, @thomasjpfan I just wanted to follow up on this -- let me know when you guys have a chance for another review, thanks!

thomasjpfan

We should document the behavior in transform when X is sparse. i.e. currently:

If X is sparse and passthrough=False, the output of transform will be dense (the predictions).
If X is sparse and passthrough=true, the output of transform will be sparse.

thomasjpfan · 2019-10-24T17:01:46Z

sklearn/ensemble/tests/test_stacking.py

+    )
+    clf.fit(X_train, y_train)
+    X_trans = clf.transform(X_test)
+    assert_allclose(X_test.toarray(), X_trans[:, -10:].toarray())


Nit: This can use:

from sklearn.utils.testing import assert_allclose_dense_sparse assert_allclose_dense_sparse(X-test, X_trans[:, -10:]

thomasjpfan · 2019-10-24T17:01:54Z

sklearn/ensemble/tests/test_stacking.py

+    )
+    clf.fit(X_train, y_train)
+    X_trans = clf.transform(X_test)
+    assert_allclose(X_test.toarray(), X_trans[:, -4:].toarray())


Nit: This can use:

from sklearn.utils.testing import assert_allclose_dense_sparse assert_allclose_dense_sparse(X-test, X_trans[:, -4:]

thomasjpfan · 2019-10-24T17:04:36Z

sklearn/ensemble/_stacking.py

-        """Concatenate the predictions of each first layer learner.
+    def _concatenate_predictions(self, X, predictions):
+        """Concatenate the predictions of each first layer learner and
+        possibly the input dataset `X`.


Please document the sparse behavior here.

jcusick13 · 2019-10-25T13:45:37Z

Just added the new changes -- thanks for pointing out assert_allclose_dense_sparse!

thomasjpfan

Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

thomasjpfan · 2019-10-25T14:21:13Z

sklearn/ensemble/_stacking.py

+            if issparse(X):
+                return sparse_hstack(X_meta).tocsr()


Nit: It would most likely be clearer for future devs:

import scipy.sparse as sparse ... if self.passthrough: X_meta.append(X) hstack = sparse.hstack if sparse.issparse(X) else np.hstack return hstack(X_meta)

thomasjpfan

Otherwise LGTM

thomasjpfan · 2019-10-25T17:24:07Z

sklearn/ensemble/_stacking.py

+        if self.passthrough:
+            X_meta.append(X)
+            if sparse.issparse(X):
+                return sparse.hstack(X_meta).tocsr()


More simplifications Nit:

if self.passthrough: X_meta.append(X) hstack = sparse.hstack if sparse.issparse(X) else np.hstack return hstack(X_meta)

I tried a few ways to get this to work but kept running into errors since sparse.hstack returns a coo_matrix instead of a csr_matrix.

Is there a clever way to embed the .tocsr() function within the if/else line? I didn't want to add it as return hstack(X_meta).tocsr() since it would try to force dense matrices into csr format also.

should this be toformat(X.format)??

Not sure what you mean by embedding .tocsr in the if/else line.

(I am surprised that sparse.hstack works with a mix of dense and sparse input... I'm sure the docs only mention sparse.)

(I am surprised that sparse.hstack works with a mix of dense and sparse input... I'm sure the docs only mention sparse.)

It is surprising. I am unsure if we should rely on this behavior.

@jcusick13 Maybe something like this will work:

if self.passthrough: X_meta.append(X) if sparse.issparse(X): return sparse.hstack(X_meta, format=X.format) return np.hstack(X_meta)

Good catch on using X.format instead of hardcoding csr.

The sparse.hstack documentation is a bit worrying with the explicit mention of sparse inputs (and the sparse statements are just as explicit in the sparse.bmat docs, which is called directly by sparse.hstack).

I took a look at the sparse.bmat source and it looks like it's converting all of its inputs into a COO matrix in a way that would work regardless of sparse vs. dense input (by just using sparse.coo_matrix()). I ran some simple tests locally and it never seemed to raise a problem --

sparse.hstack([np.ones(10), sparse.csr_matrix(np.ones(10))]) >>> <1x20 sparse matrix of type '<class 'numpy.float64'>' with 20 stored elements in COOrdinate format>

The current setup was passing all tests locally but definitely open to hearing your guys' thoughts on the matter. One suggestion is that we could first convert X_meta to be sparse before appending the sparse X to it and then running sparse.hstack. Something like

if self.passthrough: if sparse.issparse(X): X_meta = sparse.coo_matrix(X_meta) return sparse.hstack(X_meta.append(X), format=X.format) X_meta.append(X) return np.hstack(X_meta)

doc/whats_new/v0.22.rst

jnothman

Otherwise looking good!

sklearn/ensemble/_stacking.py

jnothman · 2019-10-27T22:11:14Z

sklearn/ensemble/_stacking.py

+        if self.passthrough:
+            X_meta.append(X)
+            if sparse.issparse(X):
+                return sparse.hstack(X_meta).tocsr()


should this be toformat(X.format)??

jnothman · 2019-10-27T22:12:53Z

sklearn/ensemble/_stacking.py

+        if self.passthrough:
+            X_meta.append(X)
+            if sparse.issparse(X):
+                return sparse.hstack(X_meta).tocsr()


Not sure what you mean by embedding .tocsr in the if/else line.

(I am surprised that sparse.hstack works with a mix of dense and sparse input... I'm sure the docs only mention sparse.)

jnothman · 2019-10-29T03:54:43Z

sklearn/ensemble/tests/test_stacking.py

+        estimators=estimators, final_estimator=rf, cv=5, passthrough=True
+    )
+    clf.fit(X_train, y_train)
+    X_trans = clf.transform(X_test)


You don't actually confirm that the output is sparse (nor that its format matches that of X)

Good call -- I added tests to check both a sparse output and matching format. I also adjusted one of the sparse inputs to be CSC format (instead of CSR) so that the tests run against multiple sparse format types.

jnothman · 2019-10-29T19:48:41Z

Testing with multiple sparse formats would be clearer to the reader with pytest.mark.parameterize, where here it's hard for the reader to see the differences.

jcusick13 · 2019-10-30T01:53:45Z

All set -- parameterized CSR, CSC, and COO matrices for both stacking regressor and classifier tests.

jnothman · 2019-10-30T02:29:27Z

sklearn/ensemble/tests/test_stacking.py

+
+@pytest.mark.parametrize(
+    'X',
+    [sparse.csc_matrix(scale(X_diabetes)),


Cleaner if you just use format as the param: ['csc', 'csr', 'coo']. Then use asformat to convert from an initial format

jnothman

Sorry to be not picking but I think we are closer to a neat and idiomatic style

jcusick13 · 2019-10-30T02:58:32Z

No worries -- I just adjusted it now, let me know what you think.

jnothman

Happy now :)

jnothman · 2019-10-30T04:48:59Z

Thanks @jcusick13!

jcusick13 added 8 commits October 5, 2019 09:24

Add pass through argument

75eb029

Update docstring

7bb2c66

Merge updates from master

74feb02

Remove code carried over from old branch

c654c7d

Add pass through tests

fb238cf

PEP 8

589af76

Correct typo

be188c3

Adjust expected cols in tests

316716e

jcusick13 changed the title ~~[WIP] Pass original dataset to Stacking final estimator~~ [MRG] Pass original dataset to Stacking final estimator Oct 7, 2019

jnothman reviewed Oct 7, 2019

View reviewed changes

Standardize on passthrough arg naming

d348031

thomasjpfan reviewed Oct 7, 2019

View reviewed changes

Add additional asserts/adjust style

640f4b0

jnothman reviewed Oct 8, 2019

View reviewed changes

Enable sparse matrix passthrough

8646b6c

thomasjpfan reviewed Oct 24, 2019

View reviewed changes

Utilize sklearn internal testing functions

d5334cd

thomasjpfan reviewed Oct 25, 2019

View reviewed changes

jcusick13 added 2 commits October 25, 2019 13:07

Edit changelog, adjust imports

a85d925

Merge changes from master

232ffb2

thomasjpfan approved these changes Oct 25, 2019

View reviewed changes

thomasjpfan reviewed Oct 25, 2019

View reviewed changes

doc/whats_new/v0.22.rst Outdated Show resolved Hide resolved

Adjust changelog wording

76a7350

jnothman reviewed Oct 27, 2019

View reviewed changes

jcusick13 added 3 commits October 28, 2019 22:33

Adjust sparse format

aa7b3d6

Merge upstream changes from master

6717d77

Grab correct changes re: sklearn.utils

e56734d

jnothman reviewed Oct 29, 2019

View reviewed changes

Add check for spare outputs

e628c63

Utilize pytest.mark.parameterize for sparse formats

9f26cc9

jcusick13 added 2 commits October 29, 2019 22:18

Flake8 spacing

001dfde

Flake8 line length

e83504b

jnothman reviewed Oct 30, 2019

View reviewed changes

Clean up parameterization

99a5da8

jnothman approved these changes Oct 30, 2019

View reviewed changes

jnothman merged commit 132ad99 into scikit-learn:master Oct 30, 2019

jcusick13 deleted the add-stacking-passthrough-15076 branch October 30, 2019 12:40

Uh oh!

[MRG] Pass original dataset to Stacking final estimator #15138

[MRG] Pass original dataset to Stacking final estimator #15138

Uh oh!

Conversation

jcusick13 commented Oct 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jcusick13 commented Oct 7, 2019

Uh oh!

jnothman commented Oct 7, 2019 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcusick13 commented Oct 7, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jcusick13 commented Oct 12, 2019

Uh oh!

jcusick13 commented Oct 24, 2019

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcusick13 commented Oct 25, 2019

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Oct 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcusick13 Oct 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcusick13 commented Oct 5, 2019 •

edited

Loading

thomasjpfan Oct 25, 2019 •

edited

Loading

jcusick13 Oct 29, 2019 •

edited

Loading