RFC SLEP006: allow users to enable a "strict" mode in metadata routing #23920

adrinjalali · 2022-07-16T11:36:29Z

ref: https://github.com/scikit-learn/scikit-learn/pull/22986/files#r862344847

Along the way we've had discussions on whether we should raise if a metadata is requested but not provided.

Mirroring the existing behavior, SLEP006 and the implementation won't raise in the following case:

pipeline = make_pipeline(LogisticRegression().set_fit_request(sample_weight=True)).fit(X, y)

Should we allow a strict mode, either globally or on the estimator level, which would make the above code raise?

thomasjpfan · 2022-07-21T17:46:27Z

For me, I prefer the final API to raise an error. In your example, LogisticRegression is configured to have a sample_weight, but is it not provided one. If we do nothing, then the following two pipelines have the same behavior:

pipeline = make_pipeline(LogisticRegression().set_fit_request(sample_weight=True)).fit(X, y)

pipeline = make_pipeline(LogisticRegression()).fit(X, y)

That feels weird to me.

Do you think the final API should be strict by default?

adrinjalali · 2022-07-22T11:20:18Z

I don't mind having it strict by default, but I might be missing some edge cases. Would you have something in mind to object the default strict mode @jnothman ?

jnothman · 2022-07-24T12:51:37Z

Basically by not being strict we're allowing for someone to provide us with a canned composite estimator with sample_weight supported but not required. If an estimator truly requires sample_weight it will raise in any case; but as long as we're not being strict allow for default behaviours in reusable composite estimators.

By saying set_fit_request(sample_weight=True) I don't think the user is asking you to validate their input differently, but rather that if sample_weight is provided it should go to the requested place. I acknowledge that these semantics are somewhat confusing, but I'm not sure how to fix them.

adrinjalali · 2022-07-25T09:23:07Z

Some options out there:

Rename RequestType.REQUESTED to RequestType.EXPECTED and add RequestType.ACCEPTED. The current True value can be the proxy for either of them.
Introduce a "strict" mode, the default can be either True or False
rename set_fit_request to set_fit_metadata_acceptance (or similar)
Do nothing for now

I'm personally happy with either of the above solutions.

jnothman · 2022-08-01T13:24:06Z

set_fit_receives? I don't see a lot of risk (albeit some user confusion) from these staying as optional.

eddiebergman · 2024-01-31T16:37:03Z

For me, I prefer the final API to raise an error. In your example, LogisticRegression is configured to have a sample_weight, but is it not provided one. If we do nothing, then the following two pipelines have the same behavior:
pipeline = make_pipeline(LogisticRegression().set_fit_request(sample_weight=True)).fit(X, y)

pipeline = make_pipeline(LogisticRegression()).fit(X, y)
That feels weird to me.

Do you think the final API should be strict by default?

I agree with this example being weird, and to make the example even more poignant, the below two are identical.

pipeline = make_pipeline(LogisticRegression().set_fit_request(sample_weight=True)).fit(X, y)

pipeline = make_pipeline(LogisticRegression().set_fit_request(sample_weight=False)).fit(X, y)

I personally would prefer some explicit parameter that will raise. To illustrate an example, suppose someone investigating the hypothesis of: "does sample_weight help my estimator?"

Research code is often hasty and you might expect that setting sample_weight=True means that it will use the sample_weight and sample_weight=False does not. Then in the process of writing the evaluation code, a mistake was made and the sample_weight was never passed in.
At the end I look at my numbers and conclude it does not help my estimator.

Of course this is an example of user error, but sklearn is very good at validating correctness and the expectations for this new feature would be no different. From experience, experimentation code in machine learning research is riddled with these subtle bugs and sklearn is great at informing users about them.

As for the exact semantics of specifying "strict" or True or some enum value, I don't think it matters too much.

One API recommendation to limit new methods or new parameters to simply introduce a new value type that determines behaviour, namely .set_fit_request(sample_weight="strict") or .set_fit_request(sample_weight="required"). I don't think this would conflict with anything but there might be code that already exists that does something like:

if param["sample_weight"]:  # Relies on implicit "truthyness"
   # do...

The above proposal would not work in this case as bool("strict") == True.
This can be fixed internally in sklearn but depending on the current adoption of metarouting, this could introduce a silent bug in existing code.

EDIT: The above API with "strict" proposal would technically conflict with semantics defined in 1.1.4. Advanced: Different scoring and fitting weights. It would be odd for someone to use it but it means that simply checking for a str would not work internally. I don't have any other ideas that weren't mentioned in that case.

github-actions bot added the Needs Triage Issue requires triage label Jul 16, 2022

This was referenced Jul 16, 2022

SLEP006 - Metadata Routing task list #22893

Open

FEAT multioutput routes metadata #22986

Merged

adrinjalali changed the title ~~RFC allow users to enable a "strict" mode in metadata routing~~ RFC SLEP006: allow users to enable a "strict" mode in metadata routing Jul 16, 2022

thomasjpfan added API RFC and removed Needs Triage Issue requires triage labels Jul 21, 2022

adrinjalali mentioned this issue Sep 7, 2023

ENH Add metadata routing to OneVsRestClassifier, OneVsOneClassifier and OutputCodeClassifier #27308

Merged

adrinjalali mentioned this issue Jan 31, 2024

[Question, Documentation] Metadata Routing, indicate metadata is required by a method #28324

Closed

glemaitre added this to Metadata routing May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC SLEP006: allow users to enable a "strict" mode in metadata routing #23920

RFC SLEP006: allow users to enable a "strict" mode in metadata routing #23920

adrinjalali commented Jul 16, 2022

thomasjpfan commented Jul 21, 2022

adrinjalali commented Jul 22, 2022

jnothman commented Jul 24, 2022

adrinjalali commented Jul 25, 2022

jnothman commented Aug 1, 2022

eddiebergman commented Jan 31, 2024 •

edited

Loading

RFC SLEP006: allow users to enable a "strict" mode in metadata routing #23920

RFC SLEP006: allow users to enable a "strict" mode in metadata routing #23920

Comments

adrinjalali commented Jul 16, 2022

thomasjpfan commented Jul 21, 2022

adrinjalali commented Jul 22, 2022

jnothman commented Jul 24, 2022

adrinjalali commented Jul 25, 2022

jnothman commented Aug 1, 2022

eddiebergman commented Jan 31, 2024 • edited Loading

eddiebergman commented Jan 31, 2024 •

edited

Loading