SLEP006: globally setting request values #26050

adrinjalali · 2023-04-01T15:30:01Z

One of the issues raised in #25776 and #23928 is about a code such as following to work under SLEP6:

est = AdaBoostClassifier(LogisticRegression())
est.fit(X, y, sample_weight=sw)

or

GridSearchCV(
	LogisticRegression(), 
	scorer=a_scorer_supporting_sample_weight, ...
).fit(..., sample_weight=sw)

without having to change the code, or w/o having to write too much boilerplate code, or w/o having to explicitly set request values for sample_weight for these usual usecases.

In order to allow the above pattern, the proposal is to allow users to set request values globally. These values can either be set as a configuration, or via dedicated methods. The user code at the end would look like:

sklearn.set_fit_request(sample_weight=True)
sklearn.set_score_request(sample_weight=True)

GridSearchCV(
	LogisticRegression(), 
	scorer=a_scorer_supporting_sample_weight, ...
).fit(..., sample_weight=sw)

What the above code does, is to set he default value to REQUESTED instead of ERROR_IF_PASSED for sample_weight, on fit and scoremethods / scorers everywhere.

API

A generic way to achieve this would be to set global default request values, on any method, for any metadata, which can generalize to groups and other metadata present in third party libraries.

Alternatively, we could only allow this for sample_weight (I rather not do this), and do it with something like:

sklearn.set_config(fit_requests_sample_weight=True)

or if we don't want to distinguish between methods:

sklearn.set_config(sample_weight_request=True)

My proposal would be:

sklearn.set_{method}_request(metadata=value)

which follows the API on the estimator level, except that here we're setting it on the library scope.

Context Manager

An addition to the above API, we can provide a context manager for the same purpose. If we expose this through a config, it's automatically done via config_context. If we follow another API, we can also provide the corresponding context managers.

Effect on SLEP6's examples

Here we look at how the examples in the SLEP would change.

Nested Grouped Cross Validation

weighted_acc = make_scorer(accuracy_score).set_score_request(sample_weight=True)
log_reg = LogisticRegressionCV(cv=GroupKFold(), scoring=weighted_acc).set_fit_request(
    sample_weight=True
)
cv_results = cross_validate(
    log_reg,
    X,
    y,
    cv=GroupKFold(),
    metadata={"sample_weight": my_weights, "groups": my_groups},
    scoring=weighted_acc,
)

becomes:

sklearn.set_fit_request(sample_weight=True)
sklearn.set_score_request(sample_weight=True)

weighted_acc = make_scorer(accuracy_score)
log_reg = LogisticRegressionCV(cv=GroupKFold(), scoring=weighted_acc)
cv_results = cross_validate(
    log_reg,
    X,
    y,
    cv=GroupKFold(),
    metadata={"sample_weight": my_weights, "groups": my_groups},
    scoring=weighted_acc,
)

And users can still override the default global value at the estimator level, for example, the next example:

weighted_acc = make_scorer(accuracy_score).set_score_request(sample_weight=True)
log_reg = LogisticRegressionCV(cv=group_cv, scoring=weighted_acc).fit_request(
    sample_weight=False
)
cross_validate(
    log_reg,
    X,
    y,
    cv=GroupKFold(),
    metadata={"sample_weight": weights, "groups": groups},
    scoring=weighted_acc,
)

becomes

sklearn.set_fit_request(sample_weight=True)
sklearn.set_score_request(sample_weight=True)

weighted_acc = make_scorer(accuracy_score)
log_reg = LogisticRegressionCV(cv=group_cv, scoring=weighted_acc).fit_request(
    sample_weight=False
)
cross_validate(
    log_reg,
    X,
    y,
    cv=GroupKFold(),
    metadata={"sample_weight": weights, "groups": groups},
    scoring=weighted_acc,
)

Unweighted Feature selection

Here there's no request value for SelectKBest set since it doesn't support sample_weight:

weighted_acc = make_scorer(accuracy_score).set_score_request(sample_weight=True)
log_reg = LogisticRegressionCV(cv=GroupKFold(), scoring=weighted_acc).set_fit_request(
    sample_weight=True
)
sel = SelectKBest(k=2)
pipe = make_pipeline(sel, log_reg)
pipe.fit(X, y, sample_weight=weights, groups=groups)

it becomes:

sklearn.set_fit_request(sample_weight=True)
sklearn.set_score_request(sample_weight=True)

weighted_acc = make_scorer(accuracy_score)
log_reg = LogisticRegressionCV(cv=GroupKFold(), scoring=weighted_acc)
sel = SelectKBest(k=2)
pipe = make_pipeline(sel, log_reg)
pipe.fit(X, y, sample_weight=weights, groups=groups)

More advanced use cases with aliased routing

Aliased routing is not supported at the moment, and if the user wants to pass different sample weights to different objects, they'll need to use the verbose API.

Open Questions

How should the API exactly look like? Do we like the proposal here?
Do we want to also implement the context manager?

`sample_weight="auto"`

We also talked about good default request values for sample_weight specifically (#25776 (comment)), to allow users to follow our recommendations. This would be enabled by doing something like:

sklearn.set_config(auto_sample_weight_routing=True)

or more explicitly

sklearn.set_{method}_request(sample_weight='auto')

This would enable recommended routing, whatever we decide that to be. This is NOT a blocker for releasing SLEP6 and can be implemented after it's released, and recommendations are subject to change, and we tell users that they can change.

There shall be a separate issue to talk about the default values and how the end user API should look like. It is only mentioned here to give context on what we plan to do.

cc @scikit-learn/core-devs

The text was updated successfully, but these errors were encountered:

betatim · 2023-04-03T07:53:56Z

What is the thinking about why we need to ask users to add code like

sklearn.set_fit_request(sample_weight=True)
sklearn.set_score_request(sample_weight=True)

to enable this compared to it being the default, with an option to turn it off via:

sklearn.set_fit_request(sample_weight=False)
# and/or
sklearn.set_score_request(sample_weight=False)

The reason I am asking is that I'd expect the case of "I have weights and want it to just work" being more frequent and hence default on instead of default off.

adrinjalali · 2023-04-03T08:07:40Z

Having the default to be sample weight requested everywhere results in wrong practices. For instance, CalibratedClassifierCV can both consume and route sample_weight, the same is true for Bagging.

We will work on a sane default value for requests, which we'll develop as a part of the sample_weight="auto" routing strategy, and once that's done, we can discuss weather we want it on as default or not. But since that strategy is subject to change, we might not want to have it as the default.

betatim · 2023-04-03T08:28:18Z

What are these wrong practices? Why would I not want CalibratedClassifierCV or bagging classifiers to route sample weights if I provide them?

adrinjalali · 2023-04-03T08:42:23Z

You probably don't want the to consume and route them. And by default they'd do both with this setting.

adrinjalali added RFC API and removed RFC labels Apr 1, 2023

This was referenced Apr 1, 2023

RFC SLEP006: verbose vs non-verbose declaration in meta-estimator #23928

Open

SLEP006 - Metadata Routing task list #22893

Open

ogrisel mentioned this issue Apr 1, 2023

Use sample_weight when validating LogisticRegressionCV #25906

Closed

adrinjalali mentioned this issue Apr 14, 2023

SLEP006: default routing #26179

Open

glemaitre added this to Metadata routing May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLEP006: globally setting request values #26050

SLEP006: globally setting request values #26050

adrinjalali commented Apr 1, 2023

betatim commented Apr 3, 2023

adrinjalali commented Apr 3, 2023

betatim commented Apr 3, 2023

adrinjalali commented Apr 3, 2023

SLEP006: globally setting request values #26050

SLEP006: globally setting request values #26050

Comments

adrinjalali commented Apr 1, 2023

API

Context Manager

Effect on SLEP6's examples

Nested Grouped Cross Validation

Unweighted Feature selection

More advanced use cases with aliased routing

Open Questions

sample_weight="auto"

betatim commented Apr 3, 2023

adrinjalali commented Apr 3, 2023

betatim commented Apr 3, 2023

adrinjalali commented Apr 3, 2023

`sample_weight="auto"`