-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
EHN Add multilabel classification support for StackingClassifier
#20424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EHN Add multilabel classification support for StackingClassifier
#20424
Conversation
doc/whats_new/v1.0.rst
Outdated
| - |Enhancement| multilabel-indicator target is supported by | ||
| :class:`ensemble.StackingClassifier`. | ||
| :pr:`20424` by :user:`Nicolas Peretti <nicoperetti>`, | ||
| :user: `Nestor Navarro <nestornav>`, :user `Nati Tomattis <natitomattis>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| :user: `Nestor Navarro <nestornav>`, :user `Nati Tomattis <natitomattis>`. | |
| :user:`Nestor Navarro <nestornav>`, :user:`Nati Tomattis <natitomattis>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll work on it
|
Sent email reminder to @nicoperetti and @nestornav. |
|
Hi @nicoperetti and @nestornav, do you still have time to finish this PR? |
|
Hi @jjerphan actually yesterday I'm started seeing again the PR. So I'll working on it. |
|
Hi @nestornav |
|
@reshamas Hi! I'm a little bit delayed with the work, but I'm traing to apply the changes. |
|
@glemaitre to undertans. Do you suggest that we need to handle Taking into account this situation defined above:
|
yes |
|
@nicoperetti & @nestornav, are you working on this PR? Do you need help? Is it ok if I work on this? |
|
@jmloyola: I would recommend waiting for their response first. |
|
@jmloyola yes, I'm currently working on it, but I'm really delayed. @glemaitre I've a question about the output of predict_proba. Today the current output is as below: |
|
For |
|
@nestornav Hi all, Some of us will be online on the scikit-learn discord Monday December 6 at First time and occasional contributors are welcome to join us to The focus of these office hour sessions is to answer questions about We can also try to assist you into crafting minimal reproduction cases Please note, our Code of Conduct applies: |
…o KNeighborsClassifier in test
d9705bd to
cc3fe44
Compare
…pe n_samples, n_outputs * n_classes - 1. Update test.
|
Could you fix the CI failure before I give a review? |
|
@glemaitre Yes, I'm working on it. |
|
@glemaitre I ran the stacking test and the whole test suite local. The tests ran right, I'm not sure why this test is failling. On the other hand, in the stacking test suite, we set up the Have you got some intuitions? |
| # (passthrough & stack_method == "predict") => (columns == 29) | ||
| assert not passthrough or stack_method != "predict" or expected_column == 29 | ||
|
|
||
| # (passthrough & stack_method != "predict") => (sc == 38) | ||
| assert not passthrough or stack_method == "predict" or expected_column == 38 | ||
|
|
||
| # (not passthrough & stack_method == "predict") => (sc == 9) | ||
| assert passthrough or stack_method != "predict" or expected_column == 9 | ||
|
|
||
| # (not passthrough & stack_method != "predict") => (sc == 18) | ||
| assert passthrough or stack_method == "predict" or expected_column == 18 | ||
|
|
||
| if passthrough: | ||
| assert_allclose(X_test, X_trans[:, -20:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmloyola
Do you think the tests errors are happening from this code here? @nestornav and I are trying to figure out where and why the tests are failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. When I run the test locally the line assert passthrough or stack_method != "predict" or sc == pytest.approx(0.2) fails with:
assert passthrough or stack_method != "predict" or sc == pytest.approx(0.2)
E AssertionError: assert (False or 'predict' != 'predict' or 0.12 == 0.2 ± 2.0e-07)
E + where 0.2 ± 2.0e-07 = <function approx at 0x7f26c3af9430>(0.2)
E + where <function approx at 0x7f26c3af9430> = pytest.approx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I comment that line, the test passes correctly. How did you choose those values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmloyola Tests were written by my partner. I'm not sure why he choose them. I tried to understand but I'm not pretty sure why.
Taking into account the opened time of this PR, what do you think about skip just the wrong one?
On the other hand I'll create a fresh virtual env to reproduce the error.
| self.stack_method_[est_idx] == "predict_proba" | ||
| and self._type_of_target == "multilabel-indicator" | ||
| and isinstance(preds, list) | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to put a comment to explain what type do we expect with predict_proba here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that we get a list of NumPy array of a given shape.
| return super().fit(X, self._le.transform(y), sample_weight) | ||
| self._type_of_target = type_of_target(y) | ||
| self.classes_ = [] | ||
| if self._type_of_target != "multilabel-indicator": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should as well set the classes_ attribute in "multilabel-indicator".
For instance, with a random forest we would get:
from sklearn.datasets import make_multilabel_classification
from sklearn.ensemble import RandomForestClassifier
X, y = make_multilabel_classification()
model = RandomForestClassifier().fit(X, y)
model.classes_[array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1])]
So we should have the same behaviour here. We should also have the test align with this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best here might be to create a LabelEncoder for each column of y. Like this we can encode y in the meta-estimator and then decode the target y
| stacker.n_features_in_ | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("stack_method", ["auto", "predict"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a test for predict_proba
| assert not passthrough or stack_method != "predict" or sc == pytest.approx(0.44) | ||
|
|
||
| # (passthrough & stack_method != "predict") => (sc == 0.6) | ||
| assert not passthrough or stack_method == "predict" or sc == pytest.approx(0.6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such a check will be flaky. We should not test against a specific performance metric here.
I assume that checking the shape of the prediction, the attribute classes_ for the classifier and the shape of the transform method would be a good start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should then be parametrized with passthrough=True/False as well.
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
|
Hey @nicoperetti, @nestornav, @jmloyola @reshamas, @jjerphan and @glemaitre, this PR seems stalled. Can I try to finish it? |
|
Hi @Vincent-Maladiere, I would wait for @nicoperetti's response before pursuing this work. 🙂 |
StackingClassifier
|
@Vincent-Maladiere Yes, it's ok to complete this PR and most welcome. I have reached out to both original authors of this PR multiple times. One author has been unresponsive for about a year. The other author has said they would work on it, going on 6 months, without any progress. This is the last PR from the Data Umbrella LATAM sprint which was in June 2021. Thank you. This is officially stalled according to our guidelines here:
|

Reference Issues/PRs
Fixes #17979
What does this implement/fix? Explain your changes.
Any other comments?
#DataUmbrella sprint
This PR was developed by @nestornav and myself