-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
TST introduce _safe_tags for estimator not inheriting from BaseEstimator #18797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We would need a test to check that |
I remember we removed |
#16950 but it looks like there was no test indeed. |
The goal would be to honor our hold duck typing contract which I find fundamental to the spirit of the scikit-learn API design: https://scikit-learn.org/dev/developers/develop.html#rolling-your-own-estimator
|
IMHO, having merged #16950 does not violate our duck typing contract: estimators implementing fit, predict etc will work perfectly with the internal tools like What #16950 does is that it forces estimators to inherit from BaseEstimator in order to run check_estimator. Which is quite different, and I would say it's a reasonable constraint. We don't force a dependency on sklearn for third-party libraries, we just force a dependency on sklearn for their developers. So I'd be +0 to bring this back. If we do, let's please explicitly write a comment indicating that this should not be removed, possibly with a link here |
A lot of projects wouldn't want to add scikit-learn as a dependency (e.g. lightfgbm, scorch) which in turns means they have no way to way to check that their API is compliant. +1 to put it back, especially that's it's a minor change on our side. |
@NicolasHug here (https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator) it says that in order to be a scikit-learn compatible estimator, an estimator must pass |
Though yes, they could actually define a Edit: indeed, |
We could add a check that fails if the estimator has a predict method but both |
I understand that but I'm not sure what your point is @jeremiedbb ? Mine is that scikit-learn is not a dependency for users of third-party libraries. It's only a dependency for their developers. And as the comments above suggest, I doubt that one can get anything meaningful out of But I'm not opposing to adding |
+1 for pushing more 3rd party libraries to use |
@NicolasHug I'm feeling that neither us and our doc is sure about what it requires for a third party estimator to be scikit-learn compat. In the doc it says it needs to pass check_estimator. But in order check_estimator kind of require a dependency on sklearn which we don't want to enforce. My comment was just that I'm a bit confused here :) |
I'm -0 on this one. I see the tags as a part of our API, and estimators should implement them. I'd say if users want to pass This PR adds IMO unnecessary complexity to our codebase. This means we should always use |
This a very good point. If we were limiting the use of tags in the common test, we would not have this issue. However, we started to use the tag elsewhere in the estimators themselves. |
As I mentioned there, #18798 (comment), if we want solely to make tags part of our API, we might want to isolate this functionality in a Mixin. I am not sure this a good solution though; it will force to already make multiple inheritances in the common case (i.e. |
Do you have a case where the user would want to inherit |
I don't think we should suggest that they inherit from scikit-learn. Being compatible with scikit-learn API is and should be unrelated to depending on scikit-learn. However yes, So in that sense maybe
Not everywhere, just in common tests and meta-estimators. We did it before. It's certainly not ideal, but also not such a big deal maybe? Though the issue there is that is can start to be used in contrib projects as well, since they would find it as the way of getting tags in our code base :/ |
It would be someone that needs to redefine |
Even if we were to accept a third-party estimator as scikit-learn compatible w/o them inheriting from the right classes, I don't see why we'd need to do extra work for them to be compatible while they don't implement the API we require. This puts severe burden on us moving forward while designing our API. |
I don't understand why we introduce this PR: the goal seems to be that we don't want to force third-party libraries to depend on scikit-learn to pass the check suite. But to pass class MyWrappedEst(BaseEstimator, MyEst):
pass and call BTW, I agree with most of what @adrinjalali said but I don't think this is true:
This is only related to the check suite, not the rest of the code-base |
test dependencies are not necessarily runtime dependencies. |
Yes this is what I'm trying to say since the beginning: currently in This PR doesn't change anything w.r.t. dependency. All it does is removing the need for estimators to inherit from |
My understanding is that On the other hand, if a third party estimator doesn't want to inherit from BaseEstimator, it needs to implement all important parts of the scikit-learn api. tags is a part of that (as listed here https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator). So I join Nicolas and Adrin and I don't think this PR is necessary since it's clear in the docs that implementing tags is mandatory to be a compatible estimator. |
For scikit-learn built-in estimators, we should still rely on | ||
`self._get_tags()`. `_safe_tags(est)` should be used when we are not sure | ||
where `est` comes from: typically `_safe_tags(self.base_estimator)` where | ||
`self` is a meta-estimator, or in the common checks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit but using backquotes on doc that will never be rendered as html adds noise imho
sklearn/utils/_tags.py
Outdated
default : list of {str, dtype} or bool, default=None | ||
When `esimator.get_tags()` is not implemented, default` allows to | ||
define the default value of a tag if it is not present in | ||
`_DEFAULT_TAGS` or to overwrite the value in `_DEFAULT_TAGS` if it the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`_DEFAULT_TAGS` or to overwrite the value in `_DEFAULT_TAGS` if it the | |
`_DEFAULT_TAGS` or to overwrite the value in `_DEFAULT_TAGS` if the |
sklearn/utils/_tags.py
Outdated
define the default value of a tag if it is not present in | ||
`_DEFAULT_TAGS` or to overwrite the value in `_DEFAULT_TAGS` if it the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed at the moment IMO. Maybe we can leave this feature for later, in order to keep minimal here?
sklearn/utils/_tags.py
Outdated
if hasattr(estimator, "_get_tags"): | ||
if key is not None: | ||
try: | ||
return estimator._get_tags().get(key, _DEFAULT_TAGS[key]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this goes against our docs:
Note however that all tags must be present in the dict
What bothers me here is that there's no difference anymore between _get_tags()
and _more_tags()
from a third-party point of view.
I think it's coherent with the proposed changes to the doc:
|
I don't think I agree: when doing
No error will ever be raised if a tag isn't returned by Strictly following the docs would mean doing |
…eintroduce_safe_tags
Right I did not understand what you were reffering to in the first place. I agree that the code does not reflect the doc currently. Guillaume is working on it :) |
@@ -1985,15 +1985,3 @@ def _more_tags(self): | |||
"Set the estimator tags of your estimator instead") | |||
with pytest.warns(FutureWarning, match=msg): | |||
cross_validate(svm, linear_kernel, y, cv=2) | |||
|
|||
# the _pairwise attribute is present and set to True while the pairwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NicolasHug by not being permissive (getting default with _get_tags), we need to remove this test. What do you think about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not sure that this case is actually possible in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't followed the introduction of the pairwise
tag but since the test assumes that the tag doesn't exist and since we're telling 3rd parties that all tags should exist, I'd say it makes sense to remove the test
@@ -558,24 +558,9 @@ class IncorrectTagPCA(KernelPCA): | |||
with pytest.warns(FutureWarning, match=msg): | |||
assert not _is_pairwise(pca) | |||
|
|||
# the _pairwise attribute is present and set to False while the pairwise | |||
# tag is not present | |||
class FalsePairwise(BaseEstimator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a second test with the same issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new version of _safe_tags that takes _more_tags
into account if _get_tags
is not present.
The code is now simpler and it's more natural for third party estimators that do no inherit from scikit-learn base classes to incrementally define new tags without having to re-implement the for _get_tags
machinery from scratch.
The documentation is now simpler to follow as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new version of _safe_tags that takes _more_tags into account if _get_tags is not present.
I guess I'm fine with the code but regarding the docs: with that in place, a non-inheriting 3d part lib has no reason to ever implement _get_tags()
, does it? Unless they want to use the tags in their own code... In which case they'll have to switch from _more_tags()
to _get_tags()
, which will be annoying to them. But why would a library use the tags machinery in its code while still not inheriting...?
In other words, do we even want to document "you can also override _get_tags()
"? We could just say "you need to define _more_tags()
if you want to override the defaults, and if you want to access tags values that you don't override (i.e. that are not in your own-defined _more_tags()
), you'll need to inherit from BaseEstimator
."
To override the tags of a child class, one must define the `_more_tags()` | ||
method and return a dict with the desired tags, e.g:: | ||
It is unlikely that the default values for each tag will suit the needs of your | ||
specific estimator. Additional tags can be created or default tags can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Additionnal tags can be created"
I thought we agreed not to support that #18797 (comment)? (or that's how I interpret @ogrisel's +1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a difference between supporting in _safe_tags
and people creating their own tags within their libraries using _more_tags
. This is a real need here:
We have something similar in imbalanced-learn since the introduction of tags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My +1 was to remove the default param to the _safe_tags
. I think third-party implementers are free to add other tags in their own estimators if they which. cuML is already doing in in their master branch apparently:
How do you deal with CuML case: inheriting is not an option. If they want to use tags (for new checks for instance) internally, we are forcing them to call IMO, it is not a burden to mention that if you want to access your tags by implementing |
@NicolasHug would it be fine with you if we merge this PR as you are fine with the code. This would allow us to branch 0.24.X and start the release PR for 0.24.0rc1. We can always fine tune the doc before 0.24.0 final if needed. |
if hasattr(estimator, "_get_tags"): | ||
tags_provider = "_get_tags()" | ||
tags = estimator._get_tags() | ||
elif hasattr(estimator, "_more_tags"): | ||
tags_provider = "_more_tags()" | ||
tags = {**_DEFAULT_TAGS, **estimator._more_tags()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we rely on _more_tags
regardless of inheritance, what's the rationale for defaulting to _DEFAULT_TAGS
with _more_tags
but not with _get_tags
?
I admit I'm a bit lost on all the possible code paths and use-cases here. It seems that we're overly permissive in some cases while being restrictive in others, with no obvious reason. Things were clearer to me when the logic was "with inheritance -> define _more_tags, no inheritance -> define _get_tags".
But anyway, feel free to merge if we need to move with the release. This is still experimental after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the message is simpler to always recommend to define _more_tags
for whether or not you inherit from BaseEstimator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea now always implements _more_tags
and it will work and it should cover 99% of the use case.
The remaining 1% is no inheritance and people that want to use tags -> implement _get_tags
with strong requirements on our side regarding defaults.
Merged thanks all! |
closes #18820
This PR reintroduce
_safe_tags
avoiding third-party libraries to either inherit fromBaseEstimator
or implement the tags.