Thanks to visit codestin.com
Credit goes to github.com

Skip to content

RFC SLEP006: allow users to enable a "strict" mode in metadata routing #23920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #22893
adrinjalali opened this issue Jul 16, 2022 · 6 comments
Open
Tracked by #22893

Comments

@adrinjalali
Copy link
Member

ref: https://github.com/scikit-learn/scikit-learn/pull/22986/files#r862344847

Along the way we've had discussions on whether we should raise if a metadata is requested but not provided.

Mirroring the existing behavior, SLEP006 and the implementation won't raise in the following case:

pipeline = make_pipeline(LogisticRegression().set_fit_request(sample_weight=True)).fit(X, y)

Should we allow a strict mode, either globally or on the estimator level, which would make the above code raise?

@github-actions github-actions bot added the Needs Triage Issue requires triage label Jul 16, 2022
@adrinjalali adrinjalali changed the title RFC allow users to enable a "strict" mode in metadata routing RFC SLEP006: allow users to enable a "strict" mode in metadata routing Jul 16, 2022
@thomasjpfan thomasjpfan added API RFC and removed Needs Triage Issue requires triage labels Jul 21, 2022
@thomasjpfan
Copy link
Member

For me, I prefer the final API to raise an error. In your example, LogisticRegression is configured to have a sample_weight, but is it not provided one. If we do nothing, then the following two pipelines have the same behavior:

pipeline = make_pipeline(LogisticRegression().set_fit_request(sample_weight=True)).fit(X, y)

pipeline = make_pipeline(LogisticRegression()).fit(X, y)

That feels weird to me.

Do you think the final API should be strict by default?

@adrinjalali
Copy link
Member Author

I don't mind having it strict by default, but I might be missing some edge cases. Would you have something in mind to object the default strict mode @jnothman ?

@jnothman
Copy link
Member

Basically by not being strict we're allowing for someone to provide us with a canned composite estimator with sample_weight supported but not required. If an estimator truly requires sample_weight it will raise in any case; but as long as we're not being strict allow for default behaviours in reusable composite estimators.

By saying set_fit_request(sample_weight=True) I don't think the user is asking you to validate their input differently, but rather that if sample_weight is provided it should go to the requested place. I acknowledge that these semantics are somewhat confusing, but I'm not sure how to fix them.

@adrinjalali
Copy link
Member Author

Some options out there:

  • Rename RequestType.REQUESTED to RequestType.EXPECTED and add RequestType.ACCEPTED. The current True value can be the proxy for either of them.
  • Introduce a "strict" mode, the default can be either True or False
  • rename set_fit_request to set_fit_metadata_acceptance (or similar)
  • Do nothing for now

I'm personally happy with either of the above solutions.

@jnothman
Copy link
Member

jnothman commented Aug 1, 2022

set_fit_receives? I don't see a lot of risk (albeit some user confusion) from these staying as optional.

@eddiebergman
Copy link
Contributor

eddiebergman commented Jan 31, 2024

For me, I prefer the final API to raise an error. In your example, LogisticRegression is configured to have a sample_weight, but is it not provided one. If we do nothing, then the following two pipelines have the same behavior:

pipeline = make_pipeline(LogisticRegression().set_fit_request(sample_weight=True)).fit(X, y)

pipeline = make_pipeline(LogisticRegression()).fit(X, y)

That feels weird to me.

Do you think the final API should be strict by default?

I agree with this example being weird, and to make the example even more poignant, the below two are identical.

pipeline = make_pipeline(LogisticRegression().set_fit_request(sample_weight=True)).fit(X, y)

pipeline = make_pipeline(LogisticRegression().set_fit_request(sample_weight=False)).fit(X, y)

I personally would prefer some explicit parameter that will raise. To illustrate an example, suppose someone investigating the hypothesis of: "does sample_weight help my estimator?"

Research code is often hasty and you might expect that setting sample_weight=True means that it will use the sample_weight and sample_weight=False does not. Then in the process of writing the evaluation code, a mistake was made and the sample_weight was never passed in.
At the end I look at my numbers and conclude it does not help my estimator.

Of course this is an example of user error, but sklearn is very good at validating correctness and the expectations for this new feature would be no different. From experience, experimentation code in machine learning research is riddled with these subtle bugs and sklearn is great at informing users about them.

As for the exact semantics of specifying "strict" or True or some enum value, I don't think it matters too much.


One API recommendation to limit new methods or new parameters to simply introduce a new value type that determines behaviour, namely .set_fit_request(sample_weight="strict") or .set_fit_request(sample_weight="required"). I don't think this would conflict with anything but there might be code that already exists that does something like:

if param["sample_weight"]:  # Relies on implicit "truthyness"
   # do...

The above proposal would not work in this case as bool("strict") == True.
This can be fixed internally in sklearn but depending on the current adoption of metarouting, this could introduce a silent bug in existing code.

EDIT: The above API with "strict" proposal would technically conflict with semantics defined in 1.1.4. Advanced: Different scoring and fitting weights. It would be odd for someone to use it but it means that simply checking for a str would not work internally. I don't have any other ideas that weren't mentioned in that case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

4 participants