Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Base sample-prop implementation and docs #21284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Oct 8, 2021

This PR is into the sample-props branch and not main. The idea is to break #20350 into smaller PRs for easier review and discussion rounds.

This PR adds the base implementation, and some documentation and a few tests. The tests are re-done from the previous PR. You can probably start with examples/metadata_routing.py to get a sense of how things work, and then check the implementation.

This PR does NOT touch splitters and scorers, those and all meta-estimators will be done in future PRs.

cc @jnothman @glemaitre @thomasjpfan

EDIT: superceded by #22083

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To skip doc/metadata_routing.rst, you can add the following to doc/conftest.py:

def pytest_runtest_setup(item):
    ...

    # Skip metarouting because is it is not fully implemented yet
    if fname.endswith("metadata_routing.rst"):
        raise SkipTest(
            "Skipping doctest for metadata_routing.rst because it "
            "is not fully implemented yet"
        )

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First passed through the rst. Feel free to make adjustments.

Comment on lines +31 to +36
>>> n_samples, n_features = 100, 4
>>> X = np.random.rand(n_samples, n_features)
>>> y = np.random.randint(0, 2, size=n_samples)
>>> my_groups = np.random.randint(0, 10, size=n_samples)
>>> my_weights = np.random.rand(n_samples)
>>> my_other_weights = np.random.rand(n_samples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reproducibility, we can define a RandomState object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to rework this rst file once things are implemented anyway, this example doesn't run the code for now.

>>> lr = LogisticRegressionCV(
... cv=GroupKFold(), scoring=weighted_acc,
... ).fit_requests(sample_weight=True)
>>> sel = SelectKBest(k=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this workflow break if SelectKBest were to accept weights in the future?

In other words, let's say 1.3 SelectKBest accepts weights, would we need to call fit_requests(sample_weight=False) to have the same behavior?

I guess we would need to deprecation cycle migrating from RequestType.UNREQUESTED to RequestType.ERROR_IF_PASSED as the default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, all you say is true, and the deprecation cycle is not hard to implement, since we have the mechanism of having UNREQUESTED as the default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also implement a RequestType.DEPRECATED or something, to make the deprecation easier if necessary.

Copy link
Member

@jnothman jnothman Oct 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... The case of a metaestimator adding support for a prop that is requested by its child is indeed a tricky one. I can't yet see a way to make this generally backwards compatible within the SLEP006 proposal. This makes me sad.

Indeed, generally a metaestimator supporting the same prop name as one of its children is tricky. I.e. if the metaestimator supports metadata x and its child requests metadata x, the metaestimator should only work where either:

  • the child's request aliases x to another name without such a clash;
  • the child's request and the metaestimator's request for x implies being passed the same metadata.

In other cases, this must raise an error. This is something, I'm pretty sure, we've not yet covered in SLEP006 (and it's a pretty messy and intricate consequence of having the caller responsible for delivering metadata in accordance with the request).

Deprecation would be pretty tricky as far as I can tell.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also the obligation of a user to know which estimator supports which property. This could be confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lorentzenchr Are you referring to a case where the user does not already need to know which estimator supports which property? I'm not sure what burden you are referring to

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised my response didn't really relate to @thomasjpfan's question which wasn't about a metaestimator adding support. Anyway, I've opened this issue as scikit-learn/enhancement_proposals#58

@glemaitre glemaitre self-requested a review October 14, 2021 14:14
@adrinjalali
Copy link
Member Author

Also pinging @lorentzenchr and @agramfort since you gave feedback on the SLEP.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just finished reading the example. I would say that it would be better to not show it to our end-user but to the "developer" guide.

"""
# %%

import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually like to delay the import next to the first usage cell

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with that is that the imports are then only visible in the first cell using the module/method/class, and the imports are not repeated in the future cells.

# Please note that as long as the above estimator is not used in another
# meta-estimator, the user does not need to set any requests for the metadata.
# A simple usage of the above estimator would work as expected. Remember that
# ``{foo, bar}_is_none`` are for testing/demonstration purposes and don't have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not finish to read the example yet but would it be OK just to print some information directly raise an error without adding foo_is_none and bar_is_none in the constructor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is they're sometimes used and sometimes not, if I remove them, then I'd need a separate class for each combination.

def fit(self, X, y, foo=None):
if (foo is None) != self.foo_is_none:
raise ValueError("foo's value and foo_is_none disagree!")
# all classifiers need to expose a classes_ attribute once they're fit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to use a regressor where we don't need the extra classes_ attribute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tend to simplify our examples and therefore they usually don't show more common usecases. I intentionally left this one as a classifier, just to have an example of how a classifier should be implemented.

Comment on lines +88 to +91
If :class:`~linear_model.LogisticRegressionCV` did not call ``fit_requests``,
:func:`~model_selection.cross_validate` will raise an error because weights is
passed in but :class:`~linear_model.LogisticRegressionCV` was not configured to
recognize the weights.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this error would be raised no matter if scoring is weighted or unweighted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@adrinjalali
Copy link
Member Author

@jnothman , @thomasjpfan WDYT of b5c962c or a similar solution as a solution to deprecation?

@jnothman
Copy link
Member

jnothman commented Nov 21, 2021

What's unclear to me in that last change (b5c962c) – and sorry for the slow response – is what SampledMetaRegressor does to use sample_weight, but only if it is not UNREQUESTED. An estimator is usually concerned about downstream metadata requests in fit, not its own request, which would be necessary for it to ignore an UNREQUESTED key. That's the awkwardness of this belated issue (scikit-learn/enhancement_proposals#58).

And while it does not seem to be the most major risk to be concerned about, it does make me think that maybe we've got it all wrong, and the only way to achieve consistency is by having the estimator be aware of its own request, and to handle its own aliasing and validation. The reason we didn't do this is because it would involve modifying every estimator accepting metadata, changing it to accept **kw and to parse that input with respect to a request.

@adrinjalali
Copy link
Member Author

What's unclear to me in that last change (b5c962c) – and sorry for the slow response – is what SampledMetaRegressor does to use sample_weight, but only if it is not UNREQUESTED. An estimator is usually concerned about downstream metadata requests in fit, not its own request, which would be necessary for it to ignore an UNREQUESTED key. That's the awkwardness of this belated issue (scikit-learn/enhancement_proposals#58).

I'm not sure what you mean @jnothman . The SampleMetaRegressor is not really using sample_weight, I was trying to say hypothetically it could use it, hence having it explicit in fit's signature. I could print the sum of sample_weight to use it if that helps?

On the other point, it's true that a consumer is usually not concerned with what the request value for a metadata is, but in the case of a meta-estimator, I think it's slightly different since the sub-estimator can also request the same thing, and if the meta-estimator goes from not consuming metadata x to consuming it, then there's a behavior change. Right now in sklearn we don't have a backward compatible way of doing it, and this proposal now gives us a way to handle the situation, albeit not a very clean one maybe.

And while it does not seem to be the most major risk to be concerned about, it does make me think that maybe we've got it all wrong, and the only way to achieve consistency is by having the estimator be aware of its own request, and to handle its own aliasing and validation. The reason we didn't do this is because it would involve modifying every estimator accepting metadata, changing it to accept **kw and to parse that input with respect to a request.

It may not be perfect, but we (and probably mostly you) have thought about it for quite some time. You wrote down several alternatives and this is the one that I found the most clean solution of all proposals.

And as for consumers not validating their input, to me it's not about not having to touch all estimators, it's more about supporting third party estimators w/o forcing them to do something to be supported in our new metadata routing mechanism.

If we think consumers should validate and handle aliases, I'm happy to go down that path, but that means all third parties also need to do the same thing; which is not entirely a no-go for me, since I really think from their perspective it'd be just a single line of code, or a single function call.

We can even give them a decorator to handle all of it, and they'd only need to use that decorator.

@glemaitre glemaitre self-requested a review December 1, 2021 16:54
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am posting a couple of comments that I have since I could concentrate a bit more on the PR. Some comments are really general regarding the design and it was probably due to some limitations that we discussed in the past.

I still have to understand exactly the overwrite and mask parameters

args = {arg for arg, value in kwargs.items() if value is not None}
if not ignore_extras and args - set(self.requests.keys()):
raise ValueError(
"Metadata passed which is not understood: "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be more specific than "understood" I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to add which metadata would be accepted for this particular method.

raise ValueError("estimator cannot be None!")

# meta-estimators are responsible for validating the given metadata
metadata_request_factory(self).fit.validate_metadata(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I can already imagine @adrinjalali rolling his eyes :))

I find it quite heavy to have to call the factory each time. My first thought would have been to do:

class MetaClassifier(MetaEstimatorMixin, ClassifierMixin, BaseEstimator):
    def __init__(self, estimator):
        self.estimator = estimator

    def fit(self, X, y, **fit_params):
        if self.estimator is None:
            raise ValueError("estimator cannot be None!")

        # create the metadata requests for the meta-classifier and the underlying
        # estimator.
        self._router_metadata_request = metadata_request_factory(self)
        self._consumer_metadata_request = metadata_request_factory(self.estimator)

        # validate the metadata provided to the meta-classifier
        self._router_metadata_request.fit.validate_metadata(
            ignore_extras=False, kwargs=fit_params
        )
        # get the metadata requested by the underlying estimator
        metadata_to_route = self._consumer_metadata_request.fit.get_method_input(
            ignore_extras=False, kwargs=fit_params
        )

        self.estimator_ = clone(self.estimator).fit(X, y, **metadata_to_route)
        self.classes_ = self.estimator_.classes_
        return self

    def predict(self, X, **predict_params):
        check_is_fitted(self)
        # validate the metadata provided to the meta-classifier
        self._router_metadata_request.predict.validate_metadata(
            ignore_extras=False, kwargs=predict_params
        )
        # get the metadata requested by the underlying estimator
        metadata_to_route = self._consumer_metadata_request.predict.get_method_input(
            ignore_extras=False, kwargs=predict_params
        )
        return self.estimator_.predict(X, **metadata_to_route)

So basically calling the factory once and store it. So there are 2 questions there:

  • can something alter the request between fit and predict (then we cannot store it in this way)
  • is there anything that should alter the request of a fitted attribute (e.g. estiamtor_) but not the unfitted attribute (then we would need to call the factory on the fitted)

The above points could be bypassed by attaching the requests directly to the estimators and modifying the clone such that it knows how to attach to a cloned object (seems like some Deja Vu story there).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #22083, it should satisfy your requirements.


def get_metadata_request(self):
router = MetadataRouter().add(
self.estimator, mapping="one-to-one", overwrite=False, mask=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something related to my previous comment: here we only expose considering the unfitted estimator.
Would it be an issue in the long term since we sometimes pass estimator=None as default and create the estimator in fit?

Copy link
Member Author

@adrinjalali adrinjalali Dec 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I've seen that, and in my other PR where I was naively trying to do everything at once 😁 what I did was to abstract the estimator creation away from fit, and use that same logic in get_metadata_request.

Here's a similar example:

if self.init is None:
# we pass n_classes=2 since the estimators are the same regardless.
init = (
self._get_loss(n_classes=2)
.init_estimator()
.fit_requests(sample_weight=True)
)
# here overwrite="ignore" because we should not expose a

# As you can see, the only metadata requested for method ``fit`` is
# ``"aliased_foo"``. This information is enough for another
# meta-estimator/router to know what needs to be passed to ``est``. In other
# words, ``foo`` is *masked* . The ``MetadataRouter`` class enables us to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not convinced about the masking naming but I need to give it more thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to say that you should have masked=True if you are the most inner part of the routing?

router = (
MetadataRouter()
.add(super(), mapping="one-to-one", overwrite=False, mask=False)
.add(self.estimator, mapping="one-to-one", overwrite="smart", mask=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't read the documentation yet but at this point overwrite="smart" is magical and smart :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other PR now doesn't have overwrite and mask, should be much easier to review/understand.

"predict": "transform",
},
overwrite="smart",
mask=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so I need to think a bit more about this masking because I did not really get it.

@glemaitre glemaitre requested review from glemaitre and lorentzenchr and removed request for lorentzenchr December 1, 2021 21:11
@lorentzenchr
Copy link
Member

preambulum: Should I better ask on the SLEP or in this PR?

IIUC, this solution of passing routing meta-data would not help to solve #18894? Is that correct?

The problem is meta-data that is "produced" in the pipeline and that is not available right from the start like X, y and sample_weights. Possible such meta-data could be:

  • split indices produced by a splitter
  • column indices and names of ordinal encoded columns
  • column indices and names of one-hot encoded columns
  • etc.

From #18894: Passing ordinal encoded indices to categorical_features in HGBT

X, y = ...

ct = make_column_transformer(
    (OrdinalEncoder(),
     make_column_selector(dtype_include='category')),
    remainder='passthrough')

hist_native = make_pipeline(
    ct,
    HistGradientBoostingRegressor(categorical_features=???)
)

How to fill the ???? It is the ColumnTransformer in the Pipeline that creates the knowledge about wich columns are ordinal encoded. How can it pass this information to the HistGradientBoostingRegressor?

@adrinjalali
Copy link
Member Author

preambulum: Should I better ask on the SLEP or in this PR?

As you wish 🤷🏼

IIUC, this solution of passing routing meta-data would not help to solve #18894? Is that correct?

No, SLEP15 is the one which would fix that issue. However, there are ways to hack around with this PR and get that information through.

The problem is meta-data that is "produced" in the pipeline and that is not available right from the start like X, y and sample_weights. Possible such meta-data could be:

It's true that this implementation/SLEP doesn't address this issue. But it's also quite an advanced one, and I'm kind of okay with the hack that people can pass an array, the step changes the array inplace, and the next step has access to the "generated" new values.

  • split indices produced by a splitter

The indices are never passed around and never used. The *CV objects will properly split the data and pass only the relevant part to the underlying estimator.

  • column indices and names of ordinal encoded columns
  • column indices and names of one-hot encoded columns

these are more of a "column metadata" and they are not the main focus of this SLEP.

From #18894: Passing ordinal encoded indices to categorical_features in HGBT

X, y = ...

ct = make_column_transformer(
    (OrdinalEncoder(),
     make_column_selector(dtype_include='category')),
    remainder='passthrough')

hist_native = make_pipeline(
    ct,
    HistGradientBoostingRegressor(categorical_features=???)
)

How to fill the ???? It is the ColumnTransformer in the Pipeline that creates the knowledge about wich columns are ordinal encoded. How can it pass this information to the HistGradientBoostingRegressor?

With SLEP15, they can be the names which are generated and passed to HGBT.

@adrinjalali
Copy link
Member Author

Some of the issues raised here required some major changes. For us to be able to compare, I've created a separate PR with those proposed changes: #22083

I personally like the other PR a lot more than this one.

@adrinjalali
Copy link
Member Author

Superceded by #22083

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants