-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
[MRG] Pass original dataset to Stacking final estimator #15138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Pass original dataset to Stacking final estimator #15138
Conversation
|
Something's still off with my local build while trying to import |
|
Perhaps you compiled with version 0.21 but then pulled changes from 0.22
including that file. Do `pip install -e .`
|
sklearn/ensemble/_stacking.py
Outdated
| def __init__(self, estimators, final_estimator=None, cv=None, | ||
| stack_method='auto', n_jobs=None, verbose=0): | ||
| stack_method='auto', n_jobs=None, verbose=0, | ||
| pass_through=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we elsewhere use "passthrough" I don't think pass_through is a good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good catch, I'll update to be consistent with passthrough.
|
Thanks for the advice on reinstalling with the editable command -- that fixed things on my end! |
| X_trans = reg.transform(X_test) | ||
| assert X_trans.shape[1] == 2 | ||
| expected_column_count = 12 if passthrough else 2 | ||
| assert X_trans.shape[1] == expected_column_count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can assert that the original values are the same as the passthrough values.
Same goes for the other asserts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Should that be a distinct test or should that be part of the existing test_stacking_regressor_diabetes and test_stacking_classifier_iris?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be include here:
if passthrough:
assert_allclose(...)| else: | ||
| X_meta.append(preds) | ||
| return np.concatenate(X_meta, axis=1) | ||
| if self.passthrough: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
if self.passthrough:
X_meta.append(X)
return np.concatenate(X_meta, axis=1)
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please test behaviour with sparse matrix X
|
Sounds good -- the original |
|
@jnothman, @thomasjpfan I just wanted to follow up on this -- let me know when you guys have a chance for another review, thanks! |
thomasjpfan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should document the behavior in transform when X is sparse. i.e. currently:
- If X is sparse and passthrough=False, the output of
transformwill be dense (the predictions). - If X is sparse and passthrough=true, the output of
transformwill be sparse.
| ) | ||
| clf.fit(X_train, y_train) | ||
| X_trans = clf.transform(X_test) | ||
| assert_allclose(X_test.toarray(), X_trans[:, -10:].toarray()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This can use:
from sklearn.utils.testing import assert_allclose_dense_sparse
assert_allclose_dense_sparse(X-test, X_trans[:, -10:]| ) | ||
| clf.fit(X_train, y_train) | ||
| X_trans = clf.transform(X_test) | ||
| assert_allclose(X_test.toarray(), X_trans[:, -4:].toarray()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This can use:
from sklearn.utils.testing import assert_allclose_dense_sparse
assert_allclose_dense_sparse(X-test, X_trans[:, -4:]| """Concatenate the predictions of each first layer learner. | ||
| def _concatenate_predictions(self, X, predictions): | ||
| """Concatenate the predictions of each first layer learner and | ||
| possibly the input dataset `X`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document the sparse behavior here.
|
Just added the new changes -- thanks for pointing out |
thomasjpfan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an entry to the change log at doc/whats_new/v0.22.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.
sklearn/ensemble/_stacking.py
Outdated
| if issparse(X): | ||
| return sparse_hstack(X_meta).tocsr() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It would most likely be clearer for future devs:
import scipy.sparse as sparse
...
if self.passthrough:
X_meta.append(X)
hstack = sparse.hstack if sparse.issparse(X) else np.hstack
return hstack(X_meta)
thomasjpfan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
sklearn/ensemble/_stacking.py
Outdated
| if self.passthrough: | ||
| X_meta.append(X) | ||
| if sparse.issparse(X): | ||
| return sparse.hstack(X_meta).tocsr() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More simplifications Nit:
if self.passthrough:
X_meta.append(X)
hstack = sparse.hstack if sparse.issparse(X) else np.hstack
return hstack(X_meta)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a few ways to get this to work but kept running into errors since sparse.hstack returns a coo_matrix instead of a csr_matrix.
Is there a clever way to embed the .tocsr() function within the if/else line? I didn't want to add it as return hstack(X_meta).tocsr() since it would try to force dense matrices into csr format also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be toformat(X.format)??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by embedding .tocsr in the if/else line.
(I am surprised that sparse.hstack works with a mix of dense and sparse input... I'm sure the docs only mention sparse.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I am surprised that sparse.hstack works with a mix of dense and sparse input... I'm sure the docs only mention sparse.)
It is surprising. I am unsure if we should rely on this behavior.
@jcusick13 Maybe something like this will work:
if self.passthrough:
X_meta.append(X)
if sparse.issparse(X):
return sparse.hstack(X_meta, format=X.format)
return np.hstack(X_meta)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on using X.format instead of hardcoding csr.
The sparse.hstack documentation is a bit worrying with the explicit mention of sparse inputs (and the sparse statements are just as explicit in the sparse.bmat docs, which is called directly by sparse.hstack).
I took a look at the sparse.bmat source and it looks like it's converting all of its inputs into a COO matrix in a way that would work regardless of sparse vs. dense input (by just using sparse.coo_matrix()). I ran some simple tests locally and it never seemed to raise a problem --
sparse.hstack([np.ones(10), sparse.csr_matrix(np.ones(10))])
>>> <1x20 sparse matrix of type '<class 'numpy.float64'>'
with 20 stored elements in COOrdinate format>
The current setup was passing all tests locally but definitely open to hearing your guys' thoughts on the matter. One suggestion is that we could first convert X_meta to be sparse before appending the sparse X to it and then running sparse.hstack. Something like
if self.passthrough:
if sparse.issparse(X):
X_meta = sparse.coo_matrix(X_meta)
return sparse.hstack(X_meta.append(X), format=X.format)
X_meta.append(X)
return np.hstack(X_meta)
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise looking good!
sklearn/ensemble/_stacking.py
Outdated
| if self.passthrough: | ||
| X_meta.append(X) | ||
| if sparse.issparse(X): | ||
| return sparse.hstack(X_meta).tocsr() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be toformat(X.format)??
sklearn/ensemble/_stacking.py
Outdated
| if self.passthrough: | ||
| X_meta.append(X) | ||
| if sparse.issparse(X): | ||
| return sparse.hstack(X_meta).tocsr() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by embedding .tocsr in the if/else line.
(I am surprised that sparse.hstack works with a mix of dense and sparse input... I'm sure the docs only mention sparse.)
| estimators=estimators, final_estimator=rf, cv=5, passthrough=True | ||
| ) | ||
| clf.fit(X_train, y_train) | ||
| X_trans = clf.transform(X_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't actually confirm that the output is sparse (nor that its format matches that of X)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call -- I added tests to check both a sparse output and matching format. I also adjusted one of the sparse inputs to be CSC format (instead of CSR) so that the tests run against multiple sparse format types.
|
Testing with multiple sparse formats would be clearer to the reader with
pytest.mark.parameterize, where here it's hard for the reader to see the
differences.
|
|
All set -- parameterized CSR, CSC, and COO matrices for both stacking regressor and classifier tests. |
|
|
||
| @pytest.mark.parametrize( | ||
| 'X', | ||
| [sparse.csc_matrix(scale(X_diabetes)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaner if you just use format as the param: ['csc', 'csr', 'coo']. Then use asformat to convert from an initial format
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be not picking but I think we are closer to a neat and idiomatic style
|
No worries -- I just adjusted it now, let me know what you think. |
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy now :)
|
Thanks @jcusick13! |
Reference Issues/PRs
Fixes #15076.
What does this implement/fix? Explain your changes.
Adds an argument,
pass_through, to the base Stacking estimator. This allows for concatenating the original datasetXwith the output from all individualestimatorsfor use in training thefinal_estimator.Any other comments?
I added tests but haven't been able to check them throughly. I haven't been able to get even the existing tests to run since calling (even on the
masterbranch)returns the below error.
I've looked around for a little and am not too sure how to resolve this -- does anyone have any suggestions?