Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@nicoperetti
Copy link

Reference Issues/PRs

Fixes #17979

What does this implement/fix? Explain your changes.

  • StackingClassifier is able to support multilabel classification.

Any other comments?

#DataUmbrella sprint
This PR was developed by @nestornav and myself

@nicoperetti nicoperetti changed the title MRG stacking classifier support multilabel classification. WIP stacking classifier support multilabel classification. Jun 29, 2021
@nicoperetti nicoperetti changed the title WIP stacking classifier support multilabel classification. MRG stacking classifier support multilabel classification. Jul 6, 2021
- |Enhancement| multilabel-indicator target is supported by
:class:`ensemble.StackingClassifier`.
:pr:`20424` by :user:`Nicolas Peretti <nicoperetti>`,
:user: `Nestor Navarro <nestornav>`, :user `Nati Tomattis <natitomattis>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:user: `Nestor Navarro <nestornav>`, :user `Nati Tomattis <natitomattis>`.
:user:`Nestor Navarro <nestornav>`, :user:`Nati Tomattis <natitomattis>`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll work on it

@reshamas
Copy link
Member

Sent email reminder to @nicoperetti and @nestornav.

@jjerphan
Copy link
Member

Hi @nicoperetti and @nestornav, do you still have time to finish this PR?

@nestornav
Copy link
Contributor

Hi @jjerphan actually yesterday I'm started seeing again the PR. So I'll working on it.

@reshamas
Copy link
Member

reshamas commented Sep 1, 2021

Hi @nestornav
How is this PR going? Do you still need to work on it, or are you waiting for a reviewer?

@nestornav
Copy link
Contributor

@reshamas Hi! I'm a little bit delayed with the work, but I'm traing to apply the changes.

@nestornav
Copy link
Contributor

@glemaitre to undertans. Do you suggest that we need to handle y_pred from predict_proba function?

Taking into account this situation defined above:

  • as a list of n_targets composed of ndarray of 2 columns, such as KneighborsClassifier, RandomForestClassifier.
  • as an ndarray of n_targets columns, such as MLPClassifier.

@glemaitre
Copy link
Member

Do you suggest that we need to handle y_pred from predict_proba function?

yes

@jmloyola
Copy link
Member

jmloyola commented Nov 8, 2021

@nicoperetti & @nestornav, are you working on this PR? Do you need help?

Is it ok if I work on this?

@jjerphan
Copy link
Member

jjerphan commented Nov 8, 2021

@jmloyola: I would recommend waiting for their response first.

@nestornav
Copy link
Contributor

@jmloyola yes, I'm currently working on it, but I'm really delayed.

@glemaitre I've a question about the output of predict_proba. Today the current output is as below:
[ array([[1. , 0. ], [0.6, 0.4], [0.2, 0.8], [0.4, 0.6], ]), array([[0. , 1. ], [0.4, 0.6], [0.2, 0.8], [0. , 1. ]]) ]
Each array is the output of the stacked ones. My doubt is when I have a multilabel-indicator as true I need to transform the output (described in previous comments) taking into account the final estimator?

@glemaitre
Copy link
Member

For multilabel-indicator, you will get a list of n_outputs arrays of shape n_samples, n_classes. The idea will be to concatenate this list in an array of shape n_samples, n_outputs * n_classes - 1. Indeed, we need to drop one column in each array because the last column can be computed from all others and thus it is collinear. If I recall this is something that we are already doing for the binary and multiclass problem.

@reshamas
Copy link
Member

reshamas commented Dec 2, 2021

@nestornav
In case you have questions on the PR. Can you make it to the next office hours for scikit-learn? Info is below.


Hi all,

Some of us will be online on the scikit-learn discord Monday December 6 at
10:00 PT / 13:00 ET / 18:00 UTC / 19:00 CET for about an hour or so.

First time and occasional contributors are welcome to join us to
discord using this invitation link:
https://discord.gg/YyYRXMju https://discord.gg/YyYRXMju

The focus of these office hour sessions is to answer questions about
contributing to scikit-learn. We can also split into break out
audio/text channels and do pair programming or live reviewing of
forgotten pull requests with screen sharing.

We can also try to assist you into crafting minimal reproduction cases
for bug reports to get a higher likelihood of resolution (e.g.
https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports).

Please note, our Code of Conduct applies:
https://github.com/scikit-learn/scikit-learn/blob/main/CODE_OF_CONDUCT.md https://github.com/scikit-learn/scikit-learn/blob/main/CODE_OF_CONDUCT.md

@nestornav nestornav force-pushed the 17979_stacking_classifier_multilabel branch from d9705bd to cc3fe44 Compare January 12, 2022 20:38
…pe n_samples, n_outputs * n_classes - 1. Update test.
@glemaitre
Copy link
Member

glemaitre commented Jan 24, 2022

Could you fix the CI failure before I give a review?

@nestornav
Copy link
Contributor

@glemaitre Yes, I'm working on it.

@nestornav
Copy link
Contributor

@glemaitre I ran the stacking test and the whole test suite local. The tests ran right, I'm not sure why this test is failling.
Here are the outputs:

Screen Shot 2022-01-26 at 22 52 59

On the other hand, in the stacking test suite, we set up the random_state value for both the estimators and make_multilabel_classification (to get the same sample data).

Have you got some intuitions?

Comment on lines +610 to +623
# (passthrough & stack_method == "predict") => (columns == 29)
assert not passthrough or stack_method != "predict" or expected_column == 29

# (passthrough & stack_method != "predict") => (sc == 38)
assert not passthrough or stack_method == "predict" or expected_column == 38

# (not passthrough & stack_method == "predict") => (sc == 9)
assert passthrough or stack_method != "predict" or expected_column == 9

# (not passthrough & stack_method != "predict") => (sc == 18)
assert passthrough or stack_method == "predict" or expected_column == 18

if passthrough:
assert_allclose(X_test, X_trans[:, -20:])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmloyola
Do you think the tests errors are happening from this code here? @nestornav and I are trying to figure out where and why the tests are failing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. When I run the test locally the line assert passthrough or stack_method != "predict" or sc == pytest.approx(0.2) fails with:

assert passthrough or stack_method != "predict" or sc == pytest.approx(0.2)
E AssertionError: assert (False or 'predict' != 'predict' or 0.12 == 0.2 ± 2.0e-07)
E + where 0.2 ± 2.0e-07 = <function approx at 0x7f26c3af9430>(0.2)
E + where <function approx at 0x7f26c3af9430> = pytest.approx

Copy link
Member

@jmloyola jmloyola Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I comment that line, the test passes correctly. How did you choose those values?

Copy link
Contributor

@nestornav nestornav Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmloyola Tests were written by my partner. I'm not sure why he choose them. I tried to understand but I'm not pretty sure why.

Taking into account the opened time of this PR, what do you think about skip just the wrong one?

On the other hand I'll create a fresh virtual env to reproduce the error.

@glemaitre glemaitre self-requested a review February 10, 2022 15:42
self.stack_method_[est_idx] == "predict_proba"
and self._type_of_target == "multilabel-indicator"
and isinstance(preds, list)
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to put a comment to explain what type do we expect with predict_proba here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that we get a list of NumPy array of a given shape.

return super().fit(X, self._le.transform(y), sample_weight)
self._type_of_target = type_of_target(y)
self.classes_ = []
if self._type_of_target != "multilabel-indicator":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should as well set the classes_ attribute in "multilabel-indicator".
For instance, with a random forest we would get:

from sklearn.datasets import make_multilabel_classification
from sklearn.ensemble import RandomForestClassifier
X, y = make_multilabel_classification()
model = RandomForestClassifier().fit(X, y)
model.classes_
[array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1])]

So we should have the same behaviour here. We should also have the test align with this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best here might be to create a LabelEncoder for each column of y. Like this we can encode y in the meta-estimator and then decode the target y

stacker.n_features_in_


@pytest.mark.parametrize("stack_method", ["auto", "predict"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a test for predict_proba

assert not passthrough or stack_method != "predict" or sc == pytest.approx(0.44)

# (passthrough & stack_method != "predict") => (sc == 0.6)
assert not passthrough or stack_method == "predict" or sc == pytest.approx(0.6)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a check will be flaky. We should not test against a specific performance metric here.

I assume that checking the shape of the prediction, the attribute classes_ for the classifier and the shape of the transform method would be a good start.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should then be parametrized with passthrough=True/False as well.

nestornav and others added 2 commits February 21, 2022 14:40
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
@Vincent-Maladiere
Copy link
Contributor

Vincent-Maladiere commented Aug 5, 2022

Hey @nicoperetti, @nestornav, @jmloyola @reshamas, @jjerphan and @glemaitre, this PR seems stalled. Can I try to finish it?

@jjerphan
Copy link
Member

jjerphan commented Aug 5, 2022

Hi @Vincent-Maladiere, I would wait for @nicoperetti's response before pursuing this work. 🙂

@jjerphan jjerphan changed the title MRG stacking classifier support multilabel classification. EHN Add multilabel classification support for StackingClassifier Aug 5, 2022
@reshamas
Copy link
Member

reshamas commented Aug 5, 2022

@Vincent-Maladiere Yes, it's ok to complete this PR and most welcome.

I have reached out to both original authors of this PR multiple times. One author has been unresponsive for about a year. The other author has said they would work on it, going on 6 months, without any progress. This is the last PR from the Data Umbrella LATAM sprint which was in June 2021. Thank you.

This is officially stalled according to our guidelines here:
https://scikit-learn.org/stable/developers/contributing.html#stalled-pull-requests

If a contributor comments on an issue to say they are working on it, a pull request is expected within 2 weeks (new contributor) or 4 weeks (contributor or core dev), unless an larger time frame is explicitly given. Beyond that time, another contributor can take the issue and make a pull request for it. We encourage contributors to comment directly on the stalled or unclaimed issue to let community members know that they will be working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ensemble Sprint Superseded PR has been replace by a newer PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

StackingClassifier to support multilabel classification

8 participants