Base sample-prop implementation and docs #21284

adrinjalali · 2021-10-08T15:30:22Z

This PR is into the sample-props branch and not main. The idea is to break #20350 into smaller PRs for easier review and discussion rounds.

This PR adds the base implementation, and some documentation and a few tests. The tests are re-done from the previous PR. You can probably start with examples/metadata_routing.py to get a sense of how things work, and then check the implementation.

This PR does NOT touch splitters and scorers, those and all meta-estimators will be done in future PRs.

cc @jnothman @glemaitre @thomasjpfan

EDIT: superceded by #22083

thomasjpfan

To skip doc/metadata_routing.rst, you can add the following to doc/conftest.py:

def pytest_runtest_setup(item):
    ...

    # Skip metarouting because is it is not fully implemented yet
    if fname.endswith("metadata_routing.rst"):
        raise SkipTest(
            "Skipping doctest for metadata_routing.rst because it "
            "is not fully implemented yet"
        )

thomasjpfan

First passed through the rst. Feel free to make adjustments.

thomasjpfan · 2021-10-13T13:54:36Z

doc/metadata_routing.rst

+  >>> n_samples, n_features = 100, 4
+  >>> X = np.random.rand(n_samples, n_features)
+  >>> y = np.random.randint(0, 2, size=n_samples)
+  >>> my_groups = np.random.randint(0, 10, size=n_samples)
+  >>> my_weights = np.random.rand(n_samples)
+  >>> my_other_weights = np.random.rand(n_samples)


For reproducibility, we can define a RandomState object.

will need to rework this rst file once things are implemented anyway, this example doesn't run the code for now.

doc/metadata_routing.rst

thomasjpfan · 2021-10-13T14:13:50Z

doc/metadata_routing.rst

+  >>> lr = LogisticRegressionCV(
+  ...     cv=GroupKFold(), scoring=weighted_acc,
+  ... ).fit_requests(sample_weight=True)
+  >>> sel = SelectKBest(k=2)


Would this workflow break if SelectKBest were to accept weights in the future?

In other words, let's say 1.3 SelectKBest accepts weights, would we need to call fit_requests(sample_weight=False) to have the same behavior?

I guess we would need to deprecation cycle migrating from RequestType.UNREQUESTED to RequestType.ERROR_IF_PASSED as the default.

yes, all you say is true, and the deprecation cycle is not hard to implement, since we have the mechanism of having UNREQUESTED as the default.

We can also implement a RequestType.DEPRECATED or something, to make the deprecation easier if necessary.

Hmmm... The case of a metaestimator adding support for a prop that is requested by its child is indeed a tricky one. I can't yet see a way to make this generally backwards compatible within the SLEP006 proposal. This makes me sad.

Indeed, generally a metaestimator supporting the same prop name as one of its children is tricky. I.e. if the metaestimator supports metadata x and its child requests metadata x, the metaestimator should only work where either:

the child's request aliases x to another name without such a clash;

the child's request and the metaestimator's request for x implies being passed the same metadata.

In other cases, this must raise an error. This is something, I'm pretty sure, we've not yet covered in SLEP006 (and it's a pretty messy and intricate consequence of having the caller responsible for delivering metadata in accordance with the request).

Deprecation would be pretty tricky as far as I can tell.

It is also the obligation of a user to know which estimator supports which property. This could be confusing.

@lorentzenchr Are you referring to a case where the user does not already need to know which estimator supports which property? I'm not sure what burden you are referring to

I just realised my response didn't really relate to @thomasjpfan's question which wasn't about a metaestimator adding support. Anyway, I've opened this issue as scikit-learn/enhancement_proposals#58

doc/metadata_routing.rst

sklearn/base.py

adrinjalali · 2021-10-20T15:47:41Z

Also pinging @lorentzenchr and @agramfort since you gave feedback on the SLEP.

glemaitre

Just finished reading the example. I would say that it would be better to not show it to our end-user but to the "developer" guide.

glemaitre · 2021-10-21T12:12:57Z

examples/metadata_routing.py

+"""
+# %%
+
+import numpy as np


I usually like to delay the import next to the first usage cell

The issue with that is that the imports are then only visible in the first cell using the module/method/class, and the imports are not repeated in the future cells.

examples/metadata_routing.py

glemaitre · 2021-10-21T12:56:22Z

examples/metadata_routing.py

+# Please note that as long as the above estimator is not used in another
+# meta-estimator, the user does not need to set any requests for the metadata.
+# A simple usage of the above estimator would work as expected. Remember that
+# ``{foo, bar}_is_none`` are for testing/demonstration purposes and don't have


I did not finish to read the example yet but would it be OK just to print some information directly raise an error without adding foo_is_none and bar_is_none in the constructor?

The thing is they're sometimes used and sometimes not, if I remove them, then I'd need a separate class for each combination.

glemaitre · 2021-10-21T12:57:36Z

examples/metadata_routing.py

+    def fit(self, X, y, foo=None):
+        if (foo is None) != self.foo_is_none:
+            raise ValueError("foo's value and foo_is_none disagree!")
+        # all classifiers need to expose a classes_ attribute once they're fit.


Would it be simpler to use a regressor where we don't need the extra classes_ attribute?

We tend to simplify our examples and therefore they usually don't show more common usecases. I intentionally left this one as a classifier, just to have an example of how a classifier should be implemented.

sklearn/utils/metadata_requests.py

examples/metadata_routing.py

doc/metadata_routing.rst

lorentzenchr · 2021-10-24T13:42:25Z

doc/metadata_routing.rst

+If :class:`~linear_model.LogisticRegressionCV` did not call ``fit_requests``,
+:func:`~model_selection.cross_validate` will raise an error because weights is
+passed in but :class:`~linear_model.LogisticRegressionCV` was not configured to
+recognize the weights.


I guess this error would be raised no matter if scoring is weighted or unweighted?

doc/metadata_routing.rst

…s-base

adrinjalali · 2021-11-05T17:23:03Z

@jnothman , @thomasjpfan WDYT of b5c962c or a similar solution as a solution to deprecation?

jnothman · 2021-11-21T12:42:17Z

What's unclear to me in that last change (b5c962c) – and sorry for the slow response – is what SampledMetaRegressor does to use sample_weight, but only if it is not UNREQUESTED. An estimator is usually concerned about downstream metadata requests in fit, not its own request, which would be necessary for it to ignore an UNREQUESTED key. That's the awkwardness of this belated issue (scikit-learn/enhancement_proposals#58).

And while it does not seem to be the most major risk to be concerned about, it does make me think that maybe we've got it all wrong, and the only way to achieve consistency is by having the estimator be aware of its own request, and to handle its own aliasing and validation. The reason we didn't do this is because it would involve modifying every estimator accepting metadata, changing it to accept **kw and to parse that input with respect to a request.

adrinjalali · 2021-11-29T13:38:36Z

What's unclear to me in that last change (b5c962c) – and sorry for the slow response – is what SampledMetaRegressor does to use sample_weight, but only if it is not UNREQUESTED. An estimator is usually concerned about downstream metadata requests in fit, not its own request, which would be necessary for it to ignore an UNREQUESTED key. That's the awkwardness of this belated issue (scikit-learn/enhancement_proposals#58).

I'm not sure what you mean @jnothman . The SampleMetaRegressor is not really using sample_weight, I was trying to say hypothetically it could use it, hence having it explicit in fit's signature. I could print the sum of sample_weight to use it if that helps?

On the other point, it's true that a consumer is usually not concerned with what the request value for a metadata is, but in the case of a meta-estimator, I think it's slightly different since the sub-estimator can also request the same thing, and if the meta-estimator goes from not consuming metadata x to consuming it, then there's a behavior change. Right now in sklearn we don't have a backward compatible way of doing it, and this proposal now gives us a way to handle the situation, albeit not a very clean one maybe.

And while it does not seem to be the most major risk to be concerned about, it does make me think that maybe we've got it all wrong, and the only way to achieve consistency is by having the estimator be aware of its own request, and to handle its own aliasing and validation. The reason we didn't do this is because it would involve modifying every estimator accepting metadata, changing it to accept **kw and to parse that input with respect to a request.

It may not be perfect, but we (and probably mostly you) have thought about it for quite some time. You wrote down several alternatives and this is the one that I found the most clean solution of all proposals.

And as for consumers not validating their input, to me it's not about not having to touch all estimators, it's more about supporting third party estimators w/o forcing them to do something to be supported in our new metadata routing mechanism.

If we think consumers should validate and handle aliases, I'm happy to go down that path, but that means all third parties also need to do the same thing; which is not entirely a no-go for me, since I really think from their perspective it'd be just a single line of code, or a single function call.

We can even give them a decorator to handle all of it, and they'd only need to use that decorator.

examples/plot_metadata_routing.py

glemaitre

I am posting a couple of comments that I have since I could concentrate a bit more on the PR. Some comments are really general regarding the design and it was probably due to some limitations that we discussed in the past.

I still have to understand exactly the overwrite and mask parameters

examples/plot_metadata_routing.py

glemaitre · 2021-12-01T16:47:44Z

sklearn/utils/metadata_requests.py

+        args = {arg for arg, value in kwargs.items() if value is not None}
+        if not ignore_extras and args - set(self.requests.keys()):
+            raise ValueError(
+                "Metadata passed which is not understood: "


Could be more specific than "understood" I think.

We might want to add which metadata would be accepted for this particular method.

examples/plot_metadata_routing.py

glemaitre · 2021-12-01T17:20:14Z

examples/plot_metadata_routing.py

+            raise ValueError("estimator cannot be None!")
+
+        # meta-estimators are responsible for validating the given metadata
+        metadata_request_factory(self).fit.validate_metadata(


(I can already imagine @adrinjalali rolling his eyes :))

I find it quite heavy to have to call the factory each time. My first thought would have been to do:

class MetaClassifier(MetaEstimatorMixin, ClassifierMixin, BaseEstimator): def __init__(self, estimator): self.estimator = estimator def fit(self, X, y, **fit_params): if self.estimator is None: raise ValueError("estimator cannot be None!") # create the metadata requests for the meta-classifier and the underlying # estimator. self._router_metadata_request = metadata_request_factory(self) self._consumer_metadata_request = metadata_request_factory(self.estimator) # validate the metadata provided to the meta-classifier self._router_metadata_request.fit.validate_metadata( ignore_extras=False, kwargs=fit_params ) # get the metadata requested by the underlying estimator metadata_to_route = self._consumer_metadata_request.fit.get_method_input( ignore_extras=False, kwargs=fit_params ) self.estimator_ = clone(self.estimator).fit(X, y, **metadata_to_route) self.classes_ = self.estimator_.classes_ return self def predict(self, X, **predict_params): check_is_fitted(self) # validate the metadata provided to the meta-classifier self._router_metadata_request.predict.validate_metadata( ignore_extras=False, kwargs=predict_params ) # get the metadata requested by the underlying estimator metadata_to_route = self._consumer_metadata_request.predict.get_method_input( ignore_extras=False, kwargs=predict_params ) return self.estimator_.predict(X, **metadata_to_route)

So basically calling the factory once and store it. So there are 2 questions there:

can something alter the request between fit and predict (then we cannot store it in this way)

is there anything that should alter the request of a fitted attribute (e.g. estiamtor_) but not the unfitted attribute (then we would need to call the factory on the fitted)

The above points could be bypassed by attaching the requests directly to the estimators and modifying the clone such that it knows how to attach to a cloned object (seems like some Deja Vu story there).

See #22083, it should satisfy your requirements.

glemaitre · 2021-12-01T17:38:24Z

examples/plot_metadata_routing.py

+
+    def get_metadata_request(self):
+        router = MetadataRouter().add(
+            self.estimator, mapping="one-to-one", overwrite=False, mask=True


Something related to my previous comment: here we only expose considering the unfitted estimator.
Would it be an issue in the long term since we sometimes pass estimator=None as default and create the estimator in fit?

Yes I've seen that, and in my other PR where I was naively trying to do everything at once 😁 what I did was to abstract the estimator creation away from fit, and use that same logic in get_metadata_request.

Here's a similar example:

scikit-learn/sklearn/ensemble/_gb.py

Lines 914 to 921 in 7341975

if self.init is None:

# we pass n_classes=2 since the estimators are the same regardless.

init = (

self._get_loss(n_classes=2)

.init_estimator()

.fit_requests(sample_weight=True)

)

# here overwrite="ignore" because we should not expose a

examples/plot_metadata_routing.py

glemaitre · 2021-12-01T17:50:34Z

examples/plot_metadata_routing.py

+# As you can see, the only metadata requested for method ``fit`` is
+# ``"aliased_foo"``. This information is enough for another
+# meta-estimator/router to know what needs to be passed to ``est``. In other
+# words, ``foo`` is *masked* . The ``MetadataRouter`` class enables us to


Still not convinced about the masking naming but I need to give it more thoughts.

Is there a way to say that you should have masked=True if you are the most inner part of the routing?

examples/plot_metadata_routing.py

glemaitre · 2021-12-01T18:06:53Z

examples/plot_metadata_routing.py

+        router = (
+            MetadataRouter()
+            .add(super(), mapping="one-to-one", overwrite=False, mask=False)
+            .add(self.estimator, mapping="one-to-one", overwrite="smart", mask=True)


I didn't read the documentation yet but at this point overwrite="smart" is magical and smart :)

The other PR now doesn't have overwrite and mask, should be much easier to review/understand.

glemaitre · 2021-12-01T18:08:14Z

examples/plot_metadata_routing.py

+                    "predict": "transform",
+                },
+                overwrite="smart",
+                mask=True,


OK so I need to think a bit more about this masking because I did not really get it.

lorentzenchr · 2021-12-07T09:34:02Z

preambulum: Should I better ask on the SLEP or in this PR?

IIUC, this solution of passing routing meta-data would not help to solve #18894? Is that correct?

The problem is meta-data that is "produced" in the pipeline and that is not available right from the start like X, y and sample_weights. Possible such meta-data could be:

split indices produced by a splitter
column indices and names of ordinal encoded columns
column indices and names of one-hot encoded columns
etc.

From #18894: Passing ordinal encoded indices to categorical_features in HGBT

X, y = ...

ct = make_column_transformer(
    (OrdinalEncoder(),
     make_column_selector(dtype_include='category')),
    remainder='passthrough')

hist_native = make_pipeline(
    ct,
    HistGradientBoostingRegressor(categorical_features=???)
)

How to fill the ???? It is the ColumnTransformer in the Pipeline that creates the knowledge about wich columns are ordinal encoded. How can it pass this information to the HistGradientBoostingRegressor?

examples/plot_metadata_routing.py

Co-authored-by: Christian Lorentzen <[email protected]>

adrinjalali · 2021-12-07T16:11:18Z

preambulum: Should I better ask on the SLEP or in this PR?

As you wish 🤷🏼

IIUC, this solution of passing routing meta-data would not help to solve #18894? Is that correct?

No, SLEP15 is the one which would fix that issue. However, there are ways to hack around with this PR and get that information through.

The problem is meta-data that is "produced" in the pipeline and that is not available right from the start like X, y and sample_weights. Possible such meta-data could be:

It's true that this implementation/SLEP doesn't address this issue. But it's also quite an advanced one, and I'm kind of okay with the hack that people can pass an array, the step changes the array inplace, and the next step has access to the "generated" new values.

split indices produced by a splitter

The indices are never passed around and never used. The *CV objects will properly split the data and pass only the relevant part to the underlying estimator.

column indices and names of ordinal encoded columns

column indices and names of one-hot encoded columns

these are more of a "column metadata" and they are not the main focus of this SLEP.

From #18894: Passing ordinal encoded indices to categorical_features in HGBT

X, y = ...

ct = make_column_transformer(
    (OrdinalEncoder(),
     make_column_selector(dtype_include='category')),
    remainder='passthrough')

hist_native = make_pipeline(
    ct,
    HistGradientBoostingRegressor(categorical_features=???)
)

How to fill the ???? It is the ColumnTransformer in the Pipeline that creates the knowledge about wich columns are ordinal encoded. How can it pass this information to the HistGradientBoostingRegressor?

With SLEP15, they can be the names which are generated and passed to HGBT.

…rn into sample-props-base

adrinjalali · 2021-12-27T17:20:48Z

Some of the issues raised here required some major changes. For us to be able to compare, I've created a separate PR with those proposed changes: #22083

I personally like the other PR a lot more than this one.

adrinjalali · 2022-01-19T21:05:25Z

Superceded by #22083

initial base implementation commit

dbead5c

github-actions bot added the module:utils label Oct 8, 2021

adrinjalali added the No Changelog Needed label Oct 8, 2021

ogrisel added the Waiting for Reviewer label Oct 8, 2021

adrinjalali removed the Waiting for Reviewer label Oct 8, 2021

fix test_props and the issue with attribute starting with __

7868950

adrinjalali added the Waiting for Reviewer label Oct 8, 2021

thomasjpfan reviewed Oct 8, 2021

View reviewed changes

adrinjalali added 4 commits October 11, 2021 14:22

skip doctest in metadata_routing.rst for now

5793318

DOC explain why aliasing on sub-estimator of a consumer/router is useful

6696497

reduce diff

c0841c8

DOC add user guide link to method docstrings

1aff2eb

adrinjalali mentioned this pull request Oct 13, 2021

Fix column_transformer to use fitparams like Pipeline #21311

Closed

thomasjpfan reviewed Oct 13, 2021

View reviewed changes

DOC apply Thomas's suggestions to the rst file

1457293

glemaitre self-requested a review October 14, 2021 14:14

glemaitre reviewed Oct 21, 2021

View reviewed changes

lorentzenchr reviewed Oct 24, 2021

View reviewed changes

adrinjalali added 3 commits October 25, 2021 15:13

CLN address a few comments in docs

af86e82

Merge remote-tracking branch 'upstream/sample-props' into sample-prop…

4c228cf

…s-base

ignore sentinel docstring check

11649d9

jnothman mentioned this pull request Oct 25, 2021

SLEP006 (sample props) should handle when metaestimator consumes the same key as its descendant scikit-learn/enhancement_proposals#58

Closed

handling backward compatibility and deprecation prototype

b5c962c

thomasjpfan mentioned this pull request Nov 30, 2021

__sklearn_clone__ protocol proposal #21838

Closed

glemaitre self-requested a review December 1, 2021 15:49

thomasjpfan mentioned this pull request Dec 1, 2021

Rewrite SLEP006 to be easier to read and vote on scikit-learn/enhancement_proposals#55

Merged

glemaitre reviewed Dec 1, 2021

View reviewed changes

examples/plot_metadata_routing.py Outdated Show resolved Hide resolved

glemaitre self-requested a review December 1, 2021 16:54

glemaitre reviewed Dec 1, 2021

View reviewed changes

glemaitre requested review from glemaitre and lorentzenchr and removed request for lorentzenchr December 1, 2021 21:11

lorentzenchr reviewed Dec 7, 2021

View reviewed changes

examples/plot_metadata_routing.py Outdated Show resolved Hide resolved

Update examples/plot_metadata_routing.py

fb200e2

Co-authored-by: Christian Lorentzen <[email protected]>

adrinjalali added 9 commits December 10, 2021 15:22

make __metadata_request__* format more intuitive and less redundant

6f849b2

metadata_request_factory always returns a copy

82b2128

Merge remote-tracking branch 'upstream/main' into sample-props-base

6f3f590

fix tests for the changed __metadata_request__* format

16c47b2

in example: foo->sample_weight, bar->groups

1c591fe

get_method_input->get_input

93d448e

minor comments from Guillaume

167e4c2

Merge branch 'sample-props-base' of github.com:adrinjalali/scikit-lea…

3d199ee

…rn into sample-props-base

fix estimator checks tests

20fe48a

adrinjalali mentioned this pull request Dec 27, 2021

Base sample-prop implementation and docs (alternative to #21284) #22083

Merged

adrinjalali closed this Jan 19, 2022

	if self.init is None:
	# we pass n_classes=2 since the estimators are the same regardless.
	init = (
	self._get_loss(n_classes=2)
	.init_estimator()
	.fit_requests(sample_weight=True)
	)
	# here overwrite="ignore" because we should not expose a

Base sample-prop implementation and docs #21284

Base sample-prop implementation and docs #21284

Conversation

adrinjalali commented Oct 8, 2021 • edited Loading

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman Oct 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Oct 20, 2021

glemaitre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Nov 5, 2021

jnothman commented Nov 21, 2021 • edited Loading

adrinjalali commented Nov 29, 2021

glemaitre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali Dec 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lorentzenchr commented Dec 7, 2021

adrinjalali commented Dec 7, 2021

adrinjalali commented Dec 27, 2021

adrinjalali commented Jan 19, 2022

adrinjalali commented Oct 8, 2021 •

edited

Loading

jnothman Oct 21, 2021 •

edited

Loading

jnothman commented Nov 21, 2021 •

edited

Loading

adrinjalali Dec 23, 2021 •

edited

Loading