-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Base sample-prop implementation and docs #21284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Base sample-prop implementation and docs #21284
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To skip doc/metadata_routing.rst
, you can add the following to doc/conftest.py
:
def pytest_runtest_setup(item):
...
# Skip metarouting because is it is not fully implemented yet
if fname.endswith("metadata_routing.rst"):
raise SkipTest(
"Skipping doctest for metadata_routing.rst because it "
"is not fully implemented yet"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First passed through the rst
. Feel free to make adjustments.
>>> n_samples, n_features = 100, 4 | ||
>>> X = np.random.rand(n_samples, n_features) | ||
>>> y = np.random.randint(0, 2, size=n_samples) | ||
>>> my_groups = np.random.randint(0, 10, size=n_samples) | ||
>>> my_weights = np.random.rand(n_samples) | ||
>>> my_other_weights = np.random.rand(n_samples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reproducibility, we can define a RandomState
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will need to rework this rst file once things are implemented anyway, this example doesn't run the code for now.
>>> lr = LogisticRegressionCV( | ||
... cv=GroupKFold(), scoring=weighted_acc, | ||
... ).fit_requests(sample_weight=True) | ||
>>> sel = SelectKBest(k=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this workflow break if SelectKBest
were to accept weights in the future?
In other words, let's say 1.3 SelectKBest
accepts weights, would we need to call fit_requests(sample_weight=False)
to have the same behavior?
I guess we would need to deprecation cycle migrating from RequestType.UNREQUESTED
to RequestType.ERROR_IF_PASSED
as the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, all you say is true, and the deprecation cycle is not hard to implement, since we have the mechanism of having UNREQUESTED
as the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also implement a RequestType.DEPRECATED
or something, to make the deprecation easier if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... The case of a metaestimator adding support for a prop that is requested by its child is indeed a tricky one. I can't yet see a way to make this generally backwards compatible within the SLEP006 proposal. This makes me sad.
Indeed, generally a metaestimator supporting the same prop name as one of its children is tricky. I.e. if the metaestimator supports metadata x
and its child requests metadata x
, the metaestimator should only work where either:
- the child's request aliases
x
to another name without such a clash; - the child's request and the metaestimator's request for
x
implies being passed the same metadata.
In other cases, this must raise an error. This is something, I'm pretty sure, we've not yet covered in SLEP006 (and it's a pretty messy and intricate consequence of having the caller responsible for delivering metadata in accordance with the request).
Deprecation would be pretty tricky as far as I can tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also the obligation of a user to know which estimator supports which property. This could be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lorentzenchr Are you referring to a case where the user does not already need to know which estimator supports which property? I'm not sure what burden you are referring to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realised my response didn't really relate to @thomasjpfan's question which wasn't about a metaestimator adding support. Anyway, I've opened this issue as scikit-learn/enhancement_proposals#58
Also pinging @lorentzenchr and @agramfort since you gave feedback on the SLEP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just finished reading the example. I would say that it would be better to not show it to our end-user but to the "developer" guide.
examples/metadata_routing.py
Outdated
""" | ||
# %% | ||
|
||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually like to delay the import next to the first usage cell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with that is that the imports are then only visible in the first cell using the module/method/class, and the imports are not repeated in the future cells.
examples/metadata_routing.py
Outdated
# Please note that as long as the above estimator is not used in another | ||
# meta-estimator, the user does not need to set any requests for the metadata. | ||
# A simple usage of the above estimator would work as expected. Remember that | ||
# ``{foo, bar}_is_none`` are for testing/demonstration purposes and don't have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not finish to read the example yet but would it be OK just to print some information directly raise an error without adding foo_is_none
and bar_is_none
in the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is they're sometimes used and sometimes not, if I remove them, then I'd need a separate class for each combination.
examples/metadata_routing.py
Outdated
def fit(self, X, y, foo=None): | ||
if (foo is None) != self.foo_is_none: | ||
raise ValueError("foo's value and foo_is_none disagree!") | ||
# all classifiers need to expose a classes_ attribute once they're fit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be simpler to use a regressor where we don't need the extra classes_
attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tend to simplify our examples and therefore they usually don't show more common usecases. I intentionally left this one as a classifier, just to have an example of how a classifier should be implemented.
If :class:`~linear_model.LogisticRegressionCV` did not call ``fit_requests``, | ||
:func:`~model_selection.cross_validate` will raise an error because weights is | ||
passed in but :class:`~linear_model.LogisticRegressionCV` was not configured to | ||
recognize the weights. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this error would be raised no matter if scoring is weighted or unweighted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
@jnothman , @thomasjpfan WDYT of b5c962c or a similar solution as a solution to deprecation? |
What's unclear to me in that last change (b5c962c) – and sorry for the slow response – is what And while it does not seem to be the most major risk to be concerned about, it does make me think that maybe we've got it all wrong, and the only way to achieve consistency is by having the estimator be aware of its own request, and to handle its own aliasing and validation. The reason we didn't do this is because it would involve modifying every estimator accepting metadata, changing it to accept |
I'm not sure what you mean @jnothman . The On the other point, it's true that a consumer is usually not concerned with what the request value for a metadata is, but in the case of a meta-estimator, I think it's slightly different since the sub-estimator can also request the same thing, and if the meta-estimator goes from not consuming metadata x to consuming it, then there's a behavior change. Right now in sklearn we don't have a backward compatible way of doing it, and this proposal now gives us a way to handle the situation, albeit not a very clean one maybe.
It may not be perfect, but we (and probably mostly you) have thought about it for quite some time. You wrote down several alternatives and this is the one that I found the most clean solution of all proposals. And as for consumers not validating their input, to me it's not about not having to touch all estimators, it's more about supporting third party estimators w/o forcing them to do something to be supported in our new metadata routing mechanism. If we think consumers should validate and handle aliases, I'm happy to go down that path, but that means all third parties also need to do the same thing; which is not entirely a no-go for me, since I really think from their perspective it'd be just a single line of code, or a single function call. We can even give them a decorator to handle all of it, and they'd only need to use that decorator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am posting a couple of comments that I have since I could concentrate a bit more on the PR. Some comments are really general regarding the design and it was probably due to some limitations that we discussed in the past.
I still have to understand exactly the overwrite
and mask
parameters
args = {arg for arg, value in kwargs.items() if value is not None} | ||
if not ignore_extras and args - set(self.requests.keys()): | ||
raise ValueError( | ||
"Metadata passed which is not understood: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be more specific than "understood" I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to add which metadata would be accepted for this particular method.
raise ValueError("estimator cannot be None!") | ||
|
||
# meta-estimators are responsible for validating the given metadata | ||
metadata_request_factory(self).fit.validate_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I can already imagine @adrinjalali rolling his eyes :))
I find it quite heavy to have to call the factory each time. My first thought would have been to do:
class MetaClassifier(MetaEstimatorMixin, ClassifierMixin, BaseEstimator):
def __init__(self, estimator):
self.estimator = estimator
def fit(self, X, y, **fit_params):
if self.estimator is None:
raise ValueError("estimator cannot be None!")
# create the metadata requests for the meta-classifier and the underlying
# estimator.
self._router_metadata_request = metadata_request_factory(self)
self._consumer_metadata_request = metadata_request_factory(self.estimator)
# validate the metadata provided to the meta-classifier
self._router_metadata_request.fit.validate_metadata(
ignore_extras=False, kwargs=fit_params
)
# get the metadata requested by the underlying estimator
metadata_to_route = self._consumer_metadata_request.fit.get_method_input(
ignore_extras=False, kwargs=fit_params
)
self.estimator_ = clone(self.estimator).fit(X, y, **metadata_to_route)
self.classes_ = self.estimator_.classes_
return self
def predict(self, X, **predict_params):
check_is_fitted(self)
# validate the metadata provided to the meta-classifier
self._router_metadata_request.predict.validate_metadata(
ignore_extras=False, kwargs=predict_params
)
# get the metadata requested by the underlying estimator
metadata_to_route = self._consumer_metadata_request.predict.get_method_input(
ignore_extras=False, kwargs=predict_params
)
return self.estimator_.predict(X, **metadata_to_route)
So basically calling the factory once and store it. So there are 2 questions there:
- can something alter the request between
fit
andpredict
(then we cannot store it in this way) - is there anything that should alter the request of a fitted attribute (e.g.
estiamtor_
) but not the unfitted attribute (then we would need to call the factory on the fitted)
The above points could be bypassed by attaching the requests directly to the estimators and modifying the clone such that it knows how to attach to a cloned object (seems like some Deja Vu story there).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #22083, it should satisfy your requirements.
|
||
def get_metadata_request(self): | ||
router = MetadataRouter().add( | ||
self.estimator, mapping="one-to-one", overwrite=False, mask=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something related to my previous comment: here we only expose considering the unfitted estimator.
Would it be an issue in the long term since we sometimes pass estimator=None
as default and create the estimator in fit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I've seen that, and in my other PR where I was naively trying to do everything at once 😁 what I did was to abstract the estimator creation away from fit
, and use that same logic in get_metadata_request
.
Here's a similar example:
scikit-learn/sklearn/ensemble/_gb.py
Lines 914 to 921 in 7341975
if self.init is None: | |
# we pass n_classes=2 since the estimators are the same regardless. | |
init = ( | |
self._get_loss(n_classes=2) | |
.init_estimator() | |
.fit_requests(sample_weight=True) | |
) | |
# here overwrite="ignore" because we should not expose a |
examples/plot_metadata_routing.py
Outdated
# As you can see, the only metadata requested for method ``fit`` is | ||
# ``"aliased_foo"``. This information is enough for another | ||
# meta-estimator/router to know what needs to be passed to ``est``. In other | ||
# words, ``foo`` is *masked* . The ``MetadataRouter`` class enables us to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not convinced about the masking naming but I need to give it more thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to say that you should have masked=True
if you are the most inner part of the routing?
router = ( | ||
MetadataRouter() | ||
.add(super(), mapping="one-to-one", overwrite=False, mask=False) | ||
.add(self.estimator, mapping="one-to-one", overwrite="smart", mask=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't read the documentation yet but at this point overwrite="smart"
is magical and smart :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other PR now doesn't have overwrite and mask, should be much easier to review/understand.
"predict": "transform", | ||
}, | ||
overwrite="smart", | ||
mask=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so I need to think a bit more about this masking because I did not really get it.
preambulum: Should I better ask on the SLEP or in this PR? IIUC, this solution of passing routing meta-data would not help to solve #18894? Is that correct? The problem is meta-data that is "produced" in the pipeline and that is not available right from the start like
From #18894: Passing ordinal encoded indices to X, y = ...
ct = make_column_transformer(
(OrdinalEncoder(),
make_column_selector(dtype_include='category')),
remainder='passthrough')
hist_native = make_pipeline(
ct,
HistGradientBoostingRegressor(categorical_features=???)
) How to fill the |
Co-authored-by: Christian Lorentzen <[email protected]>
As you wish 🤷🏼
No, SLEP15 is the one which would fix that issue. However, there are ways to hack around with this PR and get that information through.
It's true that this implementation/SLEP doesn't address this issue. But it's also quite an advanced one, and I'm kind of okay with the hack that people can pass an array, the step changes the array inplace, and the next step has access to the "generated" new values.
The indices are never passed around and never used. The *CV objects will properly split the data and pass only the relevant part to the underlying estimator.
these are more of a "column metadata" and they are not the main focus of this SLEP.
X, y = ...
ct = make_column_transformer(
(OrdinalEncoder(),
make_column_selector(dtype_include='category')),
remainder='passthrough')
hist_native = make_pipeline(
ct,
HistGradientBoostingRegressor(categorical_features=???)
)
With SLEP15, they can be the names which are generated and passed to HGBT. |
…rn into sample-props-base
Some of the issues raised here required some major changes. For us to be able to compare, I've created a separate PR with those proposed changes: #22083 I personally like the other PR a lot more than this one. |
Superceded by #22083 |
This PR is into the
sample-props
branch and notmain
. The idea is to break #20350 into smaller PRs for easier review and discussion rounds.This PR adds the base implementation, and some documentation and a few tests. The tests are re-done from the previous PR. You can probably start with
examples/metadata_routing.py
to get a sense of how things work, and then check the implementation.This PR does NOT touch splitters and scorers, those and all meta-estimators will be done in future PRs.
cc @jnothman @glemaitre @thomasjpfan
EDIT: superceded by #22083