EHN Add multilabel classification support for `StackingClassifier` #24146

Vincent-Maladiere · 2022-08-08T21:34:45Z

Reference Issues/PRs

Follow up #20424 and close #17979
Thank you @nicoperetti, @nestornav, @jmloyola and @reshamas for the original PR!

What does this implement/fix? Explain your changes.

Enable multilabel-indicator targets for StackingClassifier by:

Removing the first column of each binary output array because predict_proba columns are collinear in the multilabel-indicator context for models like KNeighborsClassifier and RandomForestClassifier —contrary to MLPClassifier that directly outputs array of shape (n_sample, n_classes).
Define classes_ during fit in the multilabel-indicator case.
Using a different LabelEncoder for each column of y during fit, then decode the target in predict.
Add multilabel classification test, including KNeighborsClassifier, MLPClassifier and DummyClassifier as base estimators for parametrized passthrough and stack_method.

Any other comments?

Should we go further in tests by including more checks like sparsity? @jjerphan, @glemaitre

…o KNeighborsClassifier in test

…pe n_samples, n_outputs * n_classes - 1. Update test.

Co-authored-by: Guillaume Lemaitre <[email protected]>

…classifier_multilabel

jjerphan

Thank you for this follow-up, @Vincent-Maladiere.

Here are a few comments.

doc/whats_new/v1.2.rst

sklearn/ensemble/_stacking.py

sklearn/ensemble/tests/test_stacking.py

Co-authored-by: Julien Jerphanion <[email protected]>

sklearn/ensemble/tests/test_stacking.py

sklearn/ensemble/_stacking.py

Co-authored-by: Julien Jerphanion <[email protected]>

glemaitre

A couple of additional comments.

doc/whats_new/v1.2.rst

sklearn/ensemble/_stacking.py

sklearn/ensemble/tests/test_stacking.py

glemaitre · 2022-09-05T08:57:47Z

I would keep this PR only on the multilabel case and postpone the multiclass multioutput for later.

glemaitre · 2022-09-05T09:54:42Z

sklearn/ensemble/tests/test_stacking.py

+    "stack_method", ["auto", "predict", "predict_proba", "decision_function"]
+)
+@pytest.mark.parametrize("passthrough", [False, True])
+def test_stacking_classifier_multilabel(stack_method, passthrough):


I find that we test too many things at one (handled by the if-else).

I would instead split the test into several smaller tests.

@pytest.mark.parametrize( "estimator", [ # output a 2D array of the probability of the positive class for each output MLPClassifier(random_state=42), # output a list of 2D array containing the probability of each class for each output RandomForestClassifier(random_state=42), ], ids=["MLPClassifier", "RandomForestClassifier"], ) def test_stacking_classifier_multilabel_predict_proba(estimator): """Check the behaviour for the multilabel classification case and the `predict_proba` stacking method. Estimators are not consistent with the output arrays and we need to ensure that we handle all cases. """ X_train, X_test, y_train, y_test = train_test_split( X_multilabel, y_multilabel, stratify=y_multilabel, random_state=42 ) n_outputs = 3 # MLPClassifier will return a 2D array where each column is the probability # of the positive class for each output. We stack this array directly without # any further processing. estimators = [("est", estimator)] stacker = StackingClassifier( estimators=estimators, final_estimator=KNeighborsClassifier(), stack_method="predict_proba", ).fit(X_train, y_train) X_trans = stacker.transform(X_test) assert X_trans.shape == (X_test.shape[0], n_outputs) # we should not have any collinear classes and thus nothing should sum to 1 assert not any(np.isclose(X_trans.sum(axis=1), 1.0)) y_pred = stacker.predict(X_test) assert y_pred.shape == y_pred.shape def test_stacking_classifier_multilabel_decision_function(): """Check the behaviour for the multilabel classification case and the `decision_function` stacking method. Only `RidgeClassifier` supports this case. """ X_train, X_test, y_train, y_test = train_test_split( X_multilabel, y_multilabel, stratify=y_multilabel, random_state=42 ) n_outputs = 3 estimators = [("est", RidgeClassifier())] stacker = StackingClassifier( estimators=estimators, final_estimator=KNeighborsClassifier(), stack_method="decision_function", ).fit(X_train, y_train) X_trans = stacker.transform(X_test) assert X_trans.shape == (X_test.shape[0], n_outputs) # check the shape consistency of the prediction y_pred = stacker.predict(X_test) assert y_pred.shape == y_pred.shape @pytest.mark.parametrize("stack_method", ["auto", "predict"]) @pytest.mark.parametrize("passthrough", [False, True]) def test_stacking_classifier_multilabel_auto_predict(stack_method, passthrough): """Check the behaviour for the multilabel classification case for stack methods supported for all estimators or automatically picked up. """ X_train, X_test, y_train, y_test = train_test_split( X_multilabel, y_multilabel, stratify=y_multilabel, random_state=42 ) y_train_before_fit = y_train.copy() n_outputs = 3 estimators = [ ("mlp", MLPClassifier(random_state=42)), ("rf", RandomForestClassifier(random_state=42)), ("ridge", RidgeClassifier()), ] final_estimator = KNeighborsClassifier() clf = StackingClassifier( estimators=estimators, final_estimator=final_estimator, passthrough=passthrough, stack_method=stack_method, ).fit(X_train, y_train) # make sure we don't change `y_train` inplace assert_array_equal(y_train_before_fit, y_train) y_pred = clf.predict(X_test) assert y_pred.shape == y_test.shape if stack_method == "auto": expected_stack_methods = ['predict_proba', 'predict_proba', 'decision_function'] else: expected_stack_methods = ["predict"] * len(estimators) assert clf.stack_method_ == expected_stack_methods n_features_X_trans = n_outputs * len(estimators) if passthrough: n_features_X_trans += X_train.shape[1] X_trans = clf.transform(X_test) assert X_trans.shape == (X_test.shape[0], n_features_X_trans) assert_array_equal(clf.classes_, [np.array([0, 1])] * n_outputs)

glemaitre · 2022-09-05T09:56:45Z

sklearn/ensemble/tests/test_stacking.py

+    "stack_method", ["auto", "predict", "predict_proba", "decision_function"]
+)
+@pytest.mark.parametrize("passthrough", [False, True])
+def test_stacking_classifier_binary(stack_method, passthrough):


We already covered the binary case in some tests above.
We could drop this test.

Vincent-Maladiere · 2022-09-05T13:50:37Z

done, thanks for the review @glemaitre

…cent-Maladiere/scikit-learn into 17979_stacking_classifier_multilabel

glemaitre

Only a couple of outdated stuff that I previously proposed :).
LGTM otherwise

glemaitre · 2022-09-07T09:49:49Z

sklearn/ensemble/_stacking.py

        it will drop one of the probability column when using probabilities
        in the binary case. Indeed, the p(y|c=0) = 1 - p(y|c=1)
+
+        When `y` type is `"multilabel-indicator"` or `"multiclass-multioutput"`


we can probably remove the multiclass-multioutput mention for the moment.

glemaitre · 2022-09-07T09:49:58Z

sklearn/ensemble/_stacking.py

        in the binary case. Indeed, the p(y|c=0) = 1 - p(y|c=1)
+
+        When `y` type is `"multilabel-indicator"` or `"multiclass-multioutput"`
+        and the method used is `predict_proba`, `preds` can be either a ndarray


Suggested change

and the method used is `predict_proba`, `preds` can be either a ndarray

and the method used is `predict_proba`, `preds` can be either a `ndarray`

glemaitre · 2022-09-07T09:50:10Z

sklearn/ensemble/_stacking.py

+
+        When `y` type is `"multilabel-indicator"` or `"multiclass-multioutput"`
+        and the method used is `predict_proba`, `preds` can be either a ndarray
+        of shape (n_samples, n_class) or for some estimators a list of ndarray.


Suggested change

of shape (n_samples, n_class) or for some estimators a list of ndarray.

of shape `(n_samples, n_class)` or for some estimators a list of `ndarray`.

glemaitre · 2022-09-07T09:51:26Z

sklearn/ensemble/_stacking.py

-        self._le = LabelEncoder().fit(y)
-        self.classes_ = self._le.classes_
-        return super().fit(X, self._le.transform(y), sample_weight)
+        if type_of_target(y) in ("multilabel-indicator", "multiclass-multioutput"):


So we should restrain to multilabel-indicator here.

glemaitre · 2022-09-07T09:53:02Z

sklearn/ensemble/tests/test_stacking.py

+    # MLPClassifier will return a 2D array where each column is the probability
+    # of the positive class for each output. We stack this array directly without
+    # any further processing.


Actually we can remove this comment (it was before I parametrize the test). The info is already in the parametrization.

glemaitre · 2022-09-07T09:53:48Z

sklearn/ensemble/tests/test_stacking.py

+    X_trans = stacker.transform(X_test)
+    assert X_trans.shape == (X_test.shape[0], n_outputs)
+
+    # check the shape consistency of the prediction


Actually the assert is self explanatory here.

Suggested change

# check the shape consistency of the prediction

Vincent-Maladiere · 2022-09-07T18:27:43Z

Something looks wrong with CircleCI

glemaitre · 2022-09-07T20:08:33Z

You can ignore it ;)

jjerphan · 2022-09-08T06:52:29Z

Random timeout/disconnection happens from time to time.

You can push an empty commit to re-trigger the CI if needed. To create an empty commit with this clear purpose, you can use:

git commit -m "Trigger CI" --allow-empty

…cent-Maladiere/scikit-learn into 17979_stacking_classifier_multilabel

sklearn/ensemble/_stacking.py

jjerphan

LGTM. Thank you, @Vincent-Maladiere!

Before merging, I just have one question regarding a conversation which I think can be resolved.

PS: once again, we can ignore the unrelated error on Circle CI.

jjerphan · 2022-09-16T08:34:44Z

sklearn/ensemble/_stacking.py

+
+        if isinstance(self._label_encoder, list):
+            # Handle the multilabel-indicator and multiclass-multioutput cases
+            y_pred = np.array([preds[:, 0] for preds in y_pred]).T


Can we resolve this conversation?

Vincent-Maladiere · 2022-09-16T11:13:08Z

Hey @jjerphan, I simply removed the multiclass-multioutput mention from the comment, so that we don't account for it at the moment

jjerphan · 2022-09-16T11:35:47Z

OK, merging then.

The Circle CI fail is unrelated.

Nicolas and others added 13 commits January 12, 2022 17:35

Add stacking multilabel functionality

863090e

Add underscore to a class attr

f17754e

Remove model from base estimator in test_stacking

0cb8838

Remove scale in train/test split in test_stacking_classifier_multilabel

3a18d12

Add stack_method as a test parameter, change RandomForestClassifier t…

92be39b

…o KNeighborsClassifier in test

Update Changelog

68b4348

fix doc typos

cc3fe44

predict_proba output will be concatenate this list in an array of sha…

6cb37ba

…pe n_samples, n_outputs * n_classes - 1. Update test.

Update sklearn/ensemble/_stacking.py

54ee2eb

Co-authored-by: Guillaume Lemaitre <[email protected]>

Update doc/whats_new/v1.0.rst

3629bb5

Co-authored-by: Guillaume Lemaitre <[email protected]>

New merge remote-tracking branch 'upstream/main' into 17979_stacking_…

d24c0b5

…classifier_multilabel

update whats_new

cdc624f

add passthrough test

9837532

github-actions bot added the module:ensemble label Aug 8, 2022

Vincent-Maladiere added 2 commits August 8, 2022 23:37

update whats_new with current PR

2bc493f

Merge branch 'main' into 17979_stacking_classifier_multilabel

049beaf

jjerphan reviewed Aug 9, 2022

View reviewed changes

Vincent-Maladiere and others added 2 commits August 9, 2022 09:31

Apply suggestions from code review

239b3fd

Co-authored-by: Julien Jerphanion <[email protected]>

update tests

70a658e

jjerphan reviewed Aug 9, 2022

View reviewed changes

sklearn/ensemble/tests/test_stacking.py Outdated Show resolved Hide resolved

sklearn/ensemble/tests/test_stacking.py Outdated Show resolved Hide resolved

sklearn/ensemble/_stacking.py Outdated Show resolved Hide resolved

Vincent-Maladiere and others added 8 commits August 9, 2022 13:24

Apply suggestion to update comments on concatenate

feccb22

Co-authored-by: Julien Jerphanion <[email protected]>

parametrized the two tests into one

433893f

Merge branch 'main' into 17979_stacking_classifier_multilabel

6b45edc

parametrized the two tests into one

3ca18cb

strip the mysterious trailing _r

4540f9d

fix multilabel list scenario

d44a873

Merge branch 'main' into 17979_stacking_classifier_multilabel

fc0e07e

Merge branch 'main' into 17979_stacking_classifier_multilabel

62a586a

glemaitre self-requested a review August 22, 2022 08:10

glemaitre reviewed Aug 22, 2022

View reviewed changes

glemaitre self-requested a review September 5, 2022 08:57

glemaitre reviewed Sep 5, 2022

View reviewed changes

Vincent-Maladiere added 3 commits September 5, 2022 14:43

Merge branch 'main' into 17979_stacking_classifier_multilabel

04736bb

split tests

160c0fb

Merge branch 'main' into 17979_stacking_classifier_multilabel

49dddb8

Vincent-Maladiere added 4 commits September 5, 2022 15:59

fix flake8

47c694e

Merge branch '17979_stacking_classifier_multilabel' of github.com:Vin…

cf63480

…cent-Maladiere/scikit-learn into 17979_stacking_classifier_multilabel

Merge branch 'main' into 17979_stacking_classifier_multilabel

09a34d3

Merge branch 'main' into 17979_stacking_classifier_multilabel

2fa7f0c

glemaitre approved these changes Sep 7, 2022

View reviewed changes

glemaitre added this to the 1.2 milestone Sep 7, 2022

Vincent-Maladiere added 2 commits September 7, 2022 18:27

add suggestions

03fb51f

Merge branch 'main' into 17979_stacking_classifier_multilabel

175424e

Vincent-Maladiere and others added 4 commits September 8, 2022 10:04

Trigger CI

b2b928f

Merge branch '17979_stacking_classifier_multilabel' of github.com:Vin…

60139f2

…cent-Maladiere/scikit-learn into 17979_stacking_classifier_multilabel

Merge branch 'main' into 17979_stacking_classifier_multilabel

8f618c0

Merge branch 'main' into 17979_stacking_classifier_multilabel

316184d

jjerphan reviewed Sep 13, 2022

View reviewed changes

sklearn/ensemble/_stacking.py Outdated Show resolved Hide resolved

Vincent-Maladiere added 2 commits September 16, 2022 10:28

remove multiclass-multioutput from comments and docstrings

c5e43e1

Merge branch 'main' into 17979_stacking_classifier_multilabel

9c3687b

jjerphan approved these changes Sep 16, 2022

View reviewed changes

Merge branch 'main' into 17979_stacking_classifier_multilabel

7a3ad97

jjerphan merged commit c18460f into scikit-learn:main Sep 16, 2022

	and the method used is `predict_proba`, `preds` can be either a ndarray
	and the method used is `predict_proba`, `preds` can be either a `ndarray`

	of shape (n_samples, n_class) or for some estimators a list of ndarray.
	of shape `(n_samples, n_class)` or for some estimators a list of `ndarray`.

Uh oh!

EHN Add multilabel classification support for StackingClassifier #24146

EHN Add multilabel classification support for StackingClassifier #24146

Uh oh!

Conversation

Vincent-Maladiere commented Aug 8, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Sep 5, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vincent-Maladiere commented Sep 5, 2022

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vincent-Maladiere commented Sep 7, 2022

Uh oh!

glemaitre commented Sep 7, 2022

Uh oh!

jjerphan commented Sep 8, 2022

Uh oh!

Uh oh!

jjerphan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vincent-Maladiere commented Sep 16, 2022

Uh oh!

jjerphan commented Sep 16, 2022

Uh oh!

Reviewers

Assignees

Labels

EHN Add multilabel classification support for `StackingClassifier` #24146

EHN Add multilabel classification support for `StackingClassifier` #24146

jjerphan left a comment •

edited

Loading