EHN Add multilabel classification support for `StackingClassifier` #20424

nicoperetti · 2021-06-29T14:26:22Z

Reference Issues/PRs

Fixes #17979

What does this implement/fix? Explain your changes.

StackingClassifier is able to support multilabel classification.

Any other comments?

#DataUmbrella sprint
This PR was developed by @nestornav and myself

glemaitre · 2021-07-21T20:32:06Z

doc/whats_new/v1.0.rst

+- |Enhancement| multilabel-indicator target is supported by
+  :class:`ensemble.StackingClassifier`.
+  :pr:`20424` by :user:`Nicolas Peretti <nicoperetti>`,
+  :user: `Nestor Navarro <nestornav>`, :user `Nati Tomattis <natitomattis>`.


Suggested change

:user: `Nestor Navarro <nestornav>`, :user `Nati Tomattis <natitomattis>`.

:user:`Nestor Navarro <nestornav>`, :user:`Nati Tomattis <natitomattis>`.

I'll work on it

sklearn/ensemble/tests/test_stacking.py

reshamas · 2021-07-23T15:08:02Z

Sent email reminder to @nicoperetti and @nestornav.

jjerphan · 2021-08-10T11:37:06Z

Hi @nicoperetti and @nestornav, do you still have time to finish this PR?

nestornav · 2021-08-10T13:15:34Z

Hi @jjerphan actually yesterday I'm started seeing again the PR. So I'll working on it.

reshamas · 2021-09-01T15:51:59Z

Hi @nestornav
How is this PR going? Do you still need to work on it, or are you waiting for a reviewer?

nestornav · 2021-09-01T23:07:40Z

@reshamas Hi! I'm a little bit delayed with the work, but I'm traing to apply the changes.

nestornav · 2021-09-16T23:23:12Z

@glemaitre to undertans. Do you suggest that we need to handle y_pred from predict_proba function?

Taking into account this situation defined above:

as a list of n_targets composed of ndarray of 2 columns, such as KneighborsClassifier, RandomForestClassifier.
as an ndarray of n_targets columns, such as MLPClassifier.

glemaitre · 2021-09-17T10:01:32Z

Do you suggest that we need to handle y_pred from predict_proba function?

yes

jmloyola · 2021-11-08T02:49:05Z

@nicoperetti & @nestornav, are you working on this PR? Do you need help?

Is it ok if I work on this?

jjerphan · 2021-11-08T07:05:35Z

@jmloyola: I would recommend waiting for their response first.

nestornav · 2021-11-08T12:33:02Z

@jmloyola yes, I'm currently working on it, but I'm really delayed.

@glemaitre I've a question about the output of predict_proba. Today the current output is as below:
[ array([[1. , 0. ], [0.6, 0.4], [0.2, 0.8], [0.4, 0.6], ]), array([[0. , 1. ], [0.4, 0.6], [0.2, 0.8], [0. , 1. ]]) ]
Each array is the output of the stacked ones. My doubt is when I have a multilabel-indicator as true I need to transform the output (described in previous comments) taking into account the final estimator?

glemaitre · 2021-11-08T15:43:41Z

For multilabel-indicator, you will get a list of n_outputs arrays of shape n_samples, n_classes. The idea will be to concatenate this list in an array of shape n_samples, n_outputs * n_classes - 1. Indeed, we need to drop one column in each array because the last column can be computed from all others and thus it is collinear. If I recall this is something that we are already doing for the binary and multiclass problem.

reshamas · 2021-12-02T21:57:20Z

@nestornav
In case you have questions on the PR. Can you make it to the next office hours for scikit-learn? Info is below.

Hi all,

Some of us will be online on the scikit-learn discord Monday December 6 at
10:00 PT / 13:00 ET / 18:00 UTC / 19:00 CET for about an hour or so.

First time and occasional contributors are welcome to join us to
discord using this invitation link:
https://discord.gg/YyYRXMju https://discord.gg/YyYRXMju

The focus of these office hour sessions is to answer questions about
contributing to scikit-learn. We can also split into break out
audio/text channels and do pair programming or live reviewing of
forgotten pull requests with screen sharing.

We can also try to assist you into crafting minimal reproduction cases
for bug reports to get a higher likelihood of resolution (e.g.
https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports).

Please note, our Code of Conduct applies:
https://github.com/scikit-learn/scikit-learn/blob/main/CODE_OF_CONDUCT.md https://github.com/scikit-learn/scikit-learn/blob/main/CODE_OF_CONDUCT.md

…o KNeighborsClassifier in test

…pe n_samples, n_outputs * n_classes - 1. Update test.

glemaitre · 2022-01-24T15:34:33Z

Could you fix the CI failure before I give a review?

nestornav · 2022-01-24T15:36:37Z

@glemaitre Yes, I'm working on it.

nestornav · 2022-01-27T01:59:35Z

@glemaitre I ran the stacking test and the whole test suite local. The tests ran right, I'm not sure why this test is failling.
Here are the outputs:

On the other hand, in the stacking test suite, we set up the random_state value for both the estimators and make_multilabel_classification (to get the same sample data).

Have you got some intuitions?

reshamas · 2022-02-02T14:45:16Z

sklearn/ensemble/tests/test_stacking.py

+    # (passthrough & stack_method == "predict") => (columns == 29)
+    assert not passthrough or stack_method != "predict" or expected_column == 29
+
+    # (passthrough & stack_method != "predict") => (sc == 38)
+    assert not passthrough or stack_method == "predict" or expected_column == 38
+
+    # (not passthrough & stack_method == "predict") => (sc == 9)
+    assert passthrough or stack_method != "predict" or expected_column == 9
+
+    # (not passthrough & stack_method != "predict") => (sc == 18)
+    assert passthrough or stack_method == "predict" or expected_column == 18
+
+    if passthrough:
+        assert_allclose(X_test, X_trans[:, -20:])


@jmloyola
Do you think the tests errors are happening from this code here? @nestornav and I are trying to figure out where and why the tests are failing.

Yes. When I run the test locally the line assert passthrough or stack_method != "predict" or sc == pytest.approx(0.2) fails with:

assert passthrough or stack_method != "predict" or sc == pytest.approx(0.2)
E AssertionError: assert (False or 'predict' != 'predict' or 0.12 == 0.2 ± 2.0e-07)
E + where 0.2 ± 2.0e-07 = <function approx at 0x7f26c3af9430>(0.2)
E + where <function approx at 0x7f26c3af9430> = pytest.approx

If I comment that line, the test passes correctly. How did you choose those values?

@jmloyola Tests were written by my partner. I'm not sure why he choose them. I tried to understand but I'm not pretty sure why.

Taking into account the opened time of this PR, what do you think about skip just the wrong one?

On the other hand I'll create a fresh virtual env to reproduce the error.

doc/whats_new/v1.0.rst

sklearn/ensemble/_stacking.py

glemaitre · 2022-02-10T15:28:47Z

sklearn/ensemble/_stacking.py

+                    self.stack_method_[est_idx] == "predict_proba"
+                    and self._type_of_target == "multilabel-indicator"
+                    and isinstance(preds, list)
+                ):


It would be good to put a comment to explain what type do we expect with predict_proba here.

I assume that we get a list of NumPy array of a given shape.

glemaitre · 2022-02-10T15:46:57Z

sklearn/ensemble/_stacking.py

-        return super().fit(X, self._le.transform(y), sample_weight)
+        self._type_of_target = type_of_target(y)
+        self.classes_ = []
+        if self._type_of_target != "multilabel-indicator":


We should as well set the classes_ attribute in "multilabel-indicator".
For instance, with a random forest we would get:

from sklearn.datasets import make_multilabel_classification from sklearn.ensemble import RandomForestClassifier X, y = make_multilabel_classification() model = RandomForestClassifier().fit(X, y) model.classes_

[array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1]), array([0, 1])]

So we should have the same behaviour here. We should also have the test align with this case.

The best here might be to create a LabelEncoder for each column of y. Like this we can encode y in the meta-estimator and then decode the target y

glemaitre · 2022-02-10T15:52:20Z

sklearn/ensemble/tests/test_stacking.py

        stacker.n_features_in_
+
+
+@pytest.mark.parametrize("stack_method", ["auto", "predict"])


We should have a test for predict_proba

glemaitre · 2022-02-10T15:53:55Z

sklearn/ensemble/tests/test_stacking.py

+    assert not passthrough or stack_method != "predict" or sc == pytest.approx(0.44)
+
+    # (passthrough & stack_method != "predict") => (sc == 0.6)
+    assert not passthrough or stack_method == "predict" or sc == pytest.approx(0.6)


Such a check will be flaky. We should not test against a specific performance metric here.

I assume that checking the shape of the prediction, the attribute classes_ for the classifier and the shape of the transform method would be a good start.

It should then be parametrized with passthrough=True/False as well.

Co-authored-by: Guillaume Lemaitre <[email protected]>

Vincent-Maladiere · 2022-08-05T08:59:04Z

Hey @nicoperetti, @nestornav, @jmloyola @reshamas, @jjerphan and @glemaitre, this PR seems stalled. Can I try to finish it?

jjerphan · 2022-08-05T09:02:50Z

Hi @Vincent-Maladiere, I would wait for @nicoperetti's response before pursuing this work. 🙂

reshamas · 2022-08-05T13:31:55Z

@Vincent-Maladiere Yes, it's ok to complete this PR and most welcome.

I have reached out to both original authors of this PR multiple times. One author has been unresponsive for about a year. The other author has said they would work on it, going on 6 months, without any progress. This is the last PR from the Data Umbrella LATAM sprint which was in June 2021. Thank you.

This is officially stalled according to our guidelines here:
https://scikit-learn.org/stable/developers/contributing.html#stalled-pull-requests

If a contributor comments on an issue to say they are working on it, a pull request is expected within 2 weeks (new contributor) or 4 weeks (contributor or core dev), unless an larger time frame is explicitly given. Beyond that time, another contributor can take the issue and make a pull request for it. We encourage contributors to comment directly on the stalled or unclaimed issue to let community members know that they will be working on it.

github-actions bot added the module:ensemble label Jun 29, 2021

nicoperetti changed the title ~~MRG stacking classifier support multilabel classification.~~ WIP stacking classifier support multilabel classification. Jun 29, 2021

nicoperetti changed the title ~~WIP stacking classifier support multilabel classification.~~ MRG stacking classifier support multilabel classification. Jul 6, 2021

glemaitre reviewed Jul 21, 2021

View reviewed changes

reshamas added Sprint help wanted Stalled labels Nov 5, 2021

reshamas removed the help wanted label Dec 20, 2021

Nicolas and others added 7 commits January 12, 2022 17:35

Add stacking multilabel functionality

863090e

Add underscore to a class attr

f17754e

Remove model from base estimator in test_stacking

0cb8838

Remove scale in train/test split in test_stacking_classifier_multilabel

3a18d12

Add stack_method as a test parameter, change RandomForestClassifier t…

92be39b

…o KNeighborsClassifier in test

Update Changelog

68b4348

fix doc typos

cc3fe44

nestornav force-pushed the 17979_stacking_classifier_multilabel branch from d9705bd to cc3fe44 Compare January 12, 2022 20:38

predict_proba output will be concatenate this list in an array of sha…

6cb37ba

…pe n_samples, n_outputs * n_classes - 1. Update test.

reshamas reviewed Feb 2, 2022

View reviewed changes

glemaitre self-requested a review February 10, 2022 15:42

glemaitre reviewed Feb 10, 2022

View reviewed changes

nestornav and others added 2 commits February 21, 2022 14:40

Update sklearn/ensemble/_stacking.py

54ee2eb

Co-authored-by: Guillaume Lemaitre <[email protected]>

Update doc/whats_new/v1.0.rst

3629bb5

Co-authored-by: Guillaume Lemaitre <[email protected]>

jjerphan changed the title ~~MRG stacking classifier support multilabel classification.~~ EHN Add multilabel classification support for StackingClassifier Aug 5, 2022

Vincent-Maladiere mentioned this pull request Aug 8, 2022

EHN Add multilabel classification support for StackingClassifier #24146

Merged

jjerphan added Superseded PR has been replace by a newer PR Stalled and removed Stalled labels Aug 10, 2022

cmarmo closed this Sep 17, 2022

reshamas removed the Stalled label May 28, 2024

	:user: `Nestor Navarro <nestornav>`, :user `Nati Tomattis <natitomattis>`.
	:user:`Nestor Navarro <nestornav>`, :user:`Nati Tomattis <natitomattis>`.

		stacker.n_features_in_


		@pytest.mark.parametrize("stack_method", ["auto", "predict"])

Uh oh!

EHN Add multilabel classification support for StackingClassifier #20424

EHN Add multilabel classification support for StackingClassifier #20424

Uh oh!

Conversation

nicoperetti commented Jun 29, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

reshamas commented Jul 23, 2021

Uh oh!

jjerphan commented Aug 10, 2021

Uh oh!

nestornav commented Aug 10, 2021

Uh oh!

reshamas commented Sep 1, 2021

Uh oh!

nestornav commented Sep 1, 2021

Uh oh!

nestornav commented Sep 16, 2021

Uh oh!

glemaitre commented Sep 17, 2021

Uh oh!

jmloyola commented Nov 8, 2021

Uh oh!

jjerphan commented Nov 8, 2021

Uh oh!

nestornav commented Nov 8, 2021

Uh oh!

glemaitre commented Nov 8, 2021

Uh oh!

reshamas commented Dec 2, 2021

Uh oh!

glemaitre commented Jan 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nestornav commented Jan 24, 2022

Uh oh!

nestornav commented Jan 27, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmloyola Feb 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nestornav Feb 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vincent-Maladiere commented Aug 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjerphan commented Aug 5, 2022

Uh oh!

reshamas commented Aug 5, 2022

EHN Add multilabel classification support for `StackingClassifier` #20424

EHN Add multilabel classification support for `StackingClassifier` #20424

glemaitre commented Jan 24, 2022 •

edited

Loading

jmloyola Feb 2, 2022 •

edited

Loading

nestornav Feb 2, 2022 •

edited

Loading

Vincent-Maladiere commented Aug 5, 2022 •

edited

Loading