-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
EHN Add multilabel classification support for StackingClassifier
#24146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EHN Add multilabel classification support for StackingClassifier
#24146
Conversation
…o KNeighborsClassifier in test
…pe n_samples, n_outputs * n_classes - 1. Update test.
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
…classifier_multilabel
jjerphan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this follow-up, @Vincent-Maladiere.
Here are a few comments.
Co-authored-by: Julien Jerphanion <[email protected]>
Co-authored-by: Julien Jerphanion <[email protected]>
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of additional comments.
|
I would keep this PR only on the multilabel case and postpone the multiclass multioutput for later. |
| "stack_method", ["auto", "predict", "predict_proba", "decision_function"] | ||
| ) | ||
| @pytest.mark.parametrize("passthrough", [False, True]) | ||
| def test_stacking_classifier_multilabel(stack_method, passthrough): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find that we test too many things at one (handled by the if-else).
I would instead split the test into several smaller tests.
@pytest.mark.parametrize(
"estimator",
[
# output a 2D array of the probability of the positive class for each output
MLPClassifier(random_state=42),
# output a list of 2D array containing the probability of each class for each output
RandomForestClassifier(random_state=42),
],
ids=["MLPClassifier", "RandomForestClassifier"],
)
def test_stacking_classifier_multilabel_predict_proba(estimator):
"""Check the behaviour for the multilabel classification case and the `predict_proba`
stacking method.
Estimators are not consistent with the output arrays and we need to ensure that
we handle all cases.
"""
X_train, X_test, y_train, y_test = train_test_split(
X_multilabel, y_multilabel, stratify=y_multilabel, random_state=42
)
n_outputs = 3
# MLPClassifier will return a 2D array where each column is the probability
# of the positive class for each output. We stack this array directly without
# any further processing.
estimators = [("est", estimator)]
stacker = StackingClassifier(
estimators=estimators, final_estimator=KNeighborsClassifier(), stack_method="predict_proba",
).fit(X_train, y_train)
X_trans = stacker.transform(X_test)
assert X_trans.shape == (X_test.shape[0], n_outputs)
# we should not have any collinear classes and thus nothing should sum to 1
assert not any(np.isclose(X_trans.sum(axis=1), 1.0))
y_pred = stacker.predict(X_test)
assert y_pred.shape == y_pred.shape
def test_stacking_classifier_multilabel_decision_function():
"""Check the behaviour for the multilabel classification case and the
`decision_function` stacking method. Only `RidgeClassifier` supports this
case.
"""
X_train, X_test, y_train, y_test = train_test_split(
X_multilabel, y_multilabel, stratify=y_multilabel, random_state=42
)
n_outputs = 3
estimators = [("est", RidgeClassifier())]
stacker = StackingClassifier(
estimators=estimators, final_estimator=KNeighborsClassifier(), stack_method="decision_function",
).fit(X_train, y_train)
X_trans = stacker.transform(X_test)
assert X_trans.shape == (X_test.shape[0], n_outputs)
# check the shape consistency of the prediction
y_pred = stacker.predict(X_test)
assert y_pred.shape == y_pred.shape
@pytest.mark.parametrize("stack_method", ["auto", "predict"])
@pytest.mark.parametrize("passthrough", [False, True])
def test_stacking_classifier_multilabel_auto_predict(stack_method, passthrough):
"""Check the behaviour for the multilabel classification case for stack methods
supported for all estimators or automatically picked up.
"""
X_train, X_test, y_train, y_test = train_test_split(
X_multilabel, y_multilabel, stratify=y_multilabel, random_state=42
)
y_train_before_fit = y_train.copy()
n_outputs = 3
estimators = [
("mlp", MLPClassifier(random_state=42)),
("rf", RandomForestClassifier(random_state=42)),
("ridge", RidgeClassifier()),
]
final_estimator = KNeighborsClassifier()
clf = StackingClassifier(
estimators=estimators,
final_estimator=final_estimator,
passthrough=passthrough,
stack_method=stack_method,
).fit(X_train, y_train)
# make sure we don't change `y_train` inplace
assert_array_equal(y_train_before_fit, y_train)
y_pred = clf.predict(X_test)
assert y_pred.shape == y_test.shape
if stack_method == "auto":
expected_stack_methods = ['predict_proba', 'predict_proba', 'decision_function']
else:
expected_stack_methods = ["predict"] * len(estimators)
assert clf.stack_method_ == expected_stack_methods
n_features_X_trans = n_outputs * len(estimators)
if passthrough:
n_features_X_trans += X_train.shape[1]
X_trans = clf.transform(X_test)
assert X_trans.shape == (X_test.shape[0], n_features_X_trans)
assert_array_equal(clf.classes_, [np.array([0, 1])] * n_outputs)| "stack_method", ["auto", "predict", "predict_proba", "decision_function"] | ||
| ) | ||
| @pytest.mark.parametrize("passthrough", [False, True]) | ||
| def test_stacking_classifier_binary(stack_method, passthrough): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already covered the binary case in some tests above.
We could drop this test.
|
done, thanks for the review @glemaitre |
…cent-Maladiere/scikit-learn into 17979_stacking_classifier_multilabel
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a couple of outdated stuff that I previously proposed :).
LGTM otherwise
sklearn/ensemble/_stacking.py
Outdated
| it will drop one of the probability column when using probabilities | ||
| in the binary case. Indeed, the p(y|c=0) = 1 - p(y|c=1) | ||
| When `y` type is `"multilabel-indicator"` or `"multiclass-multioutput"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can probably remove the multiclass-multioutput mention for the moment.
sklearn/ensemble/_stacking.py
Outdated
| in the binary case. Indeed, the p(y|c=0) = 1 - p(y|c=1) | ||
| When `y` type is `"multilabel-indicator"` or `"multiclass-multioutput"` | ||
| and the method used is `predict_proba`, `preds` can be either a ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| and the method used is `predict_proba`, `preds` can be either a ndarray | |
| and the method used is `predict_proba`, `preds` can be either a `ndarray` |
sklearn/ensemble/_stacking.py
Outdated
| When `y` type is `"multilabel-indicator"` or `"multiclass-multioutput"` | ||
| and the method used is `predict_proba`, `preds` can be either a ndarray | ||
| of shape (n_samples, n_class) or for some estimators a list of ndarray. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| of shape (n_samples, n_class) or for some estimators a list of ndarray. | |
| of shape `(n_samples, n_class)` or for some estimators a list of `ndarray`. |
sklearn/ensemble/_stacking.py
Outdated
| self._le = LabelEncoder().fit(y) | ||
| self.classes_ = self._le.classes_ | ||
| return super().fit(X, self._le.transform(y), sample_weight) | ||
| if type_of_target(y) in ("multilabel-indicator", "multiclass-multioutput"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we should restrain to multilabel-indicator here.
| # MLPClassifier will return a 2D array where each column is the probability | ||
| # of the positive class for each output. We stack this array directly without | ||
| # any further processing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we can remove this comment (it was before I parametrize the test). The info is already in the parametrization.
| X_trans = stacker.transform(X_test) | ||
| assert X_trans.shape == (X_test.shape[0], n_outputs) | ||
|
|
||
| # check the shape consistency of the prediction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the assert is self explanatory here.
| # check the shape consistency of the prediction |
|
Something looks wrong with CircleCI |
|
You can ignore it ;) |
|
Random timeout/disconnection happens from time to time. You can push an empty commit to re-trigger the CI if needed. To create an empty commit with this clear purpose, you can use: git commit -m "Trigger CI" --allow-empty |
…cent-Maladiere/scikit-learn into 17979_stacking_classifier_multilabel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you, @Vincent-Maladiere!
Before merging, I just have one question regarding a conversation which I think can be resolved.
PS: once again, we can ignore the unrelated error on Circle CI.
|
|
||
| if isinstance(self._label_encoder, list): | ||
| # Handle the multilabel-indicator and multiclass-multioutput cases | ||
| y_pred = np.array([preds[:, 0] for preds in y_pred]).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we resolve this conversation?
|
Hey @jjerphan, I simply removed the multiclass-multioutput mention from the comment, so that we don't account for it at the moment |
|
OK, merging then. The Circle CI fail is unrelated. |
Reference Issues/PRs
Follow up #20424 and close #17979
Thank you @nicoperetti, @nestornav, @jmloyola and @reshamas for the original PR!
What does this implement/fix? Explain your changes.
Enable
multilabel-indicatortargets forStackingClassifierby:predict_probacolumns are collinear in themultilabel-indicatorcontext for models likeKNeighborsClassifierandRandomForestClassifier—contrary toMLPClassifierthat directly outputs array of shape(n_sample, n_classes).classes_duringfitin themultilabel-indicatorcase.LabelEncoderfor each column of y duringfit, then decode the target inpredict.KNeighborsClassifier,MLPClassifierandDummyClassifieras base estimators for parametrizedpassthroughandstack_method.Any other comments?
Should we go further in tests by including more checks like sparsity? @jjerphan, @glemaitre