TST introduce _safe_tags for estimator not inheriting from BaseEstimator #18797

glemaitre · 2020-11-09T16:00:29Z

This PR reintroduce _safe_tags avoiding third-party libraries to either inherit from BaseEstimator or implement the tags.

…mator

ogrisel · 2020-11-09T16:04:46Z

We would need a test to check that check_estimator passes on a minimal estimator class that does not inherit from the scikit-learn base classes (at least for mode="compatible" in #18582).

rth · 2020-11-09T16:06:50Z

I remember we removed _safe_tags some time ago and I was really sure we had a test for an estimator not inheriting from BaseEstimator in sklearn/utils/tests/test_estimator_checks.py. Maybe we "fixed" the test at the same time, not sure...

rth · 2020-11-09T16:09:01Z

#16950 but it looks like there was no test indeed.

ogrisel · 2020-11-09T16:09:49Z

The goal would be to honor our hold duck typing contract which I find fundamental to the spirit of the scikit-learn API design:

https://scikit-learn.org/dev/developers/develop.html#rolling-your-own-estimator

BaseEstimator and mixins:

We tend to use “duck typing”, so building an estimator which follows the API suffices for compatibility, without needing to inherit from or even import any scikit-learn classes.

However, if a dependency on scikit-learn is acceptable in your code, you can prevent a lot of boilerplate code by deriving a class from BaseEstimator and optionally the mixin classes in sklearn.base. For example, below is a custom classifier, with more examples included in the scikit-learn-contrib project template.

NicolasHug · 2020-11-09T16:17:44Z

The goal would be to honor our hold duck typing contract which I find fundamental to the spirit of the scikit-learn API design

IMHO, having merged #16950 does not violate our duck typing contract: estimators implementing fit, predict etc will work perfectly with the internal tools like cross_validate, etc.

What #16950 does is that it forces estimators to inherit from BaseEstimator in order to run check_estimator. Which is quite different, and I would say it's a reasonable constraint.

We don't force a dependency on sklearn for third-party libraries, we just force a dependency on sklearn for their developers.

So I'd be +0 to bring this back. If we do, let's please explicitly write a comment indicating that this should not be removed, possibly with a link here

rth · 2020-11-09T16:22:25Z

is that it forces estimators to inherit from BaseEstimator in order to run check_estimator.

A lot of projects wouldn't want to add scikit-learn as a dependency (e.g. lightfgbm, scorch) which in turns means they have no way to way to check that their API is compliant.

+1 to put it back, especially that's it's a minor change on our side.

jeremiedbb · 2020-11-09T16:23:39Z

@NicolasHug here (https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator) it says that in order to be a scikit-learn compatible estimator, an estimator must pass check_estimator

rth · 2020-11-09T16:24:49Z

Though yes, they could actually define a _get_tags manually. If they don't set tags and don't inherit from sklearn, the determination of regressor vs classifier for tests won't work anyway, would it?

Edit: indeed, is_classifier or is_regressor helpers won't produce anything meanigful with arbitrary third party python classes at present. And then the risk is that check_estimator might pass, but not actually run any of the relevant checks which is another reason I think check_estimator should be avoided in favor of parametrize_with_checks #18750 (comment)

ogrisel · 2020-11-09T16:34:39Z

Edit: indeed, is_classifier or is_regressor helpers won't produce anything meanigful with arbitrary third party python classes at present. And then the risk is that check_estimator might pass, but not actually run any of the relevant checks..

We could add a check that fails if the estimator has a predict method but both is_classifier and is_regressor return False, WDYT?

NicolasHug · 2020-11-09T17:21:55Z

@NicolasHug here (scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator) it says that in order to be a scikit-learn compatible estimator, an estimator must pass check_estimator

I understand that but I'm not sure what your point is @jeremiedbb ?

Mine is that scikit-learn is not a dependency for users of third-party libraries. It's only a dependency for their developers. And as the comments above suggest, I doubt that one can get anything meaningful out of check_estimator without inheriting from at least some of our Mixins. (It's the same for parametrize_with_checks BTW)

But I'm not opposing to adding _safe_tags back anyway.

ogrisel · 2020-11-09T17:22:05Z

which is another reason I think check_estimator should be avoided in favor of parametrize_with_checks

+1 for pushing more 3rd party libraries to use parametrize_with_checks if they use pytest but check_estimator is still useful for interactive checks / demos in ipython sessions or jupyter notebooks.

jeremiedbb · 2020-11-09T17:33:31Z

I understand that but I'm not sure what your point is @jeremiedbb ?

@NicolasHug I'm feeling that neither us and our doc is sure about what it requires for a third party estimator to be scikit-learn compat. In the doc it says it needs to pass check_estimator. But in order check_estimator kind of require a dependency on sklearn which we don't want to enforce. My comment was just that I'm a bit confused here :)

adrinjalali · 2020-11-10T08:33:35Z

I'm -0 on this one. I see the tags as a part of our API, and estimators should implement them. I'd say if users want to pass check_estimator, they should also inherit from the right classes. They can choose not to pass these tests and their estimators may work w/o an issue in most usecases, but they are not completely scikit-learn compatible.

This PR adds IMO unnecessary complexity to our codebase. This means we should always use _safe_tags instead of getting the tags, everywhere in the codebase, and that I don't see why that's necessary where we can avoid if if we require users to inherit from the right class or implement everything they should, themselves.

glemaitre · 2020-11-10T08:39:22Z

This means we should always use _safe_tags instead of getting the tags, everywhere in the codebase

This a very good point. If we were limiting the use of tags in the common test, we would not have this issue. However, we started to use the tag elsewhere in the estimators themselves.

glemaitre · 2020-11-10T08:42:24Z

As I mentioned there, #18798 (comment), if we want solely to make tags part of our API, we might want to isolate this functionality in a Mixin. I am not sure this a good solution though; it will force to already make multiple inheritances in the common case (i.e. BaseEstimator, TagsMixin)

adrinjalali · 2020-11-10T08:47:22Z

Do you have a case where the user would want to inherit TagsMixin but not BaseEstimator?

rth · 2020-11-10T08:53:38Z

I'd say if users want to pass check_estimator, they should also inherit from the right classes
[..] or implement everything they should, themselves.

I don't think we should suggest that they inherit from scikit-learn. Being compatible with scikit-learn API is and should be unrelated to depending on scikit-learn.

However yes, _safe_tags also has its issues. We can provide default tags, but will they be appropriate for the estimator in question? Possibly but not sure until someone actually looks into tags, so they might as well re-implement them. However then for contrib projects the issue is that they would need to also implement tag inheritance via _get_tags, if they have many estimators to avoid having repeated N tags in each estimator.

So in that sense maybe _safe_tags is still useful as it would minimize the amount of work a contrib project would need to do (only tags that are different from defaults).

This means we should always use _safe_tags instead of getting the tags, everywhere in the codebase,

Not everywhere, just in common tests and meta-estimators. We did it before. It's certainly not ideal, but also not such a big deal maybe? Though the issue there is that is can start to be used in contrib projects as well, since they would find it as the way of getting tags in our code base :/

glemaitre · 2020-11-10T08:56:35Z

It would be someone that needs to redefine set/get/params/states and does not want to check_X_y within _validate_data. Of course, they could inherit their base class from it and overwrite these methods but the easiest way would be to implement your base from scratch for your use case. This looks like the use-case of cuml but I agree that it is not the most common and expected use case.

adrinjalali · 2020-11-10T09:35:44Z

Even if we were to accept a third-party estimator as scikit-learn compatible w/o them inheriting from the right classes, I don't see why we'd need to do extra work for them to be compatible while they don't implement the API we require. This puts severe burden on us moving forward while designing our API.

NicolasHug · 2020-11-10T11:26:49Z

I don't understand why we introduce this PR: the goal seems to be that we don't want to force third-party libraries to depend on scikit-learn to pass the check suite.

But to pass check_estimator, one needs to call check_estimator, so they need to have scikit-learn, right? Now if their estimator MyEst doesn't inherit from BaseEstimator, all they need to do is

class MyWrappedEst(BaseEstimator, MyEst):
     pass

and call check_estimator on MyWrappedEst instead of MyEst

BTW, I agree with most of what @adrinjalali said but I don't think this is true:

This means we should always use _safe_tags instead of getting the tags, everywhere in the codebase

This is only related to the check suite, not the rest of the code-base

ogrisel · 2020-11-10T12:32:05Z

But to pass check_estimator, one needs to call check_estimator, so they need to have scikit-learn, right?

test dependencies are not necessarily runtime dependencies.

NicolasHug · 2020-11-10T12:42:25Z

test dependencies are not necessarily runtime dependencies.

Yes this is what I'm trying to say since the beginning: currently in master, we don't force third party libraries to have scikit-learn as a runtime dependency.

This PR doesn't change anything w.r.t. dependency. All it does is removing the need for estimators to inherit from BaseEstimator in order to pass check_estimator.

jeremiedbb · 2020-11-10T13:20:09Z

My understanding is that check_estimator is a tool that can be used by third party estimators to check that such an estimator follows the scikit-learn api and can be used in scikit-learn model evaluation and model selection tools. So we don't want to force an estimator to inherit from BaseEstimator to pass check_estimator.

On the other hand, if a third party estimator doesn't want to inherit from BaseEstimator, it needs to implement all important parts of the scikit-learn api. tags is a part of that (as listed here https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator).

So I join Nicolas and Adrin and I don't think this PR is necessary since it's clear in the docs that implementing tags is mandatory to be a compatible estimator.

sklearn/tests/test_pipeline.py

NicolasHug · 2020-12-01T13:22:46Z

sklearn/utils/_tags.py

+    For scikit-learn built-in estimators, we should still rely on
+    `self._get_tags()`. `_safe_tags(est)` should be used when we are not sure
+    where `est` comes from: typically `_safe_tags(self.base_estimator)` where
+    `self` is a meta-estimator, or in the common checks.


Nit but using backquotes on doc that will never be rendered as html adds noise imho

NicolasHug · 2020-12-01T13:23:55Z

sklearn/utils/_tags.py

+    default : list of {str, dtype} or bool, default=None
+        When `esimator.get_tags()` is not implemented, default` allows to
+        define the default value of a tag if it is not present in
+        `_DEFAULT_TAGS` or to overwrite the value in `_DEFAULT_TAGS` if it the


Suggested change

`_DEFAULT_TAGS` or to overwrite the value in `_DEFAULT_TAGS` if it the

`_DEFAULT_TAGS` or to overwrite the value in `_DEFAULT_TAGS` if the

NicolasHug · 2020-12-01T13:24:59Z

sklearn/utils/_tags.py

+        define the default value of a tag if it is not present in
+        `_DEFAULT_TAGS` or to overwrite the value in `_DEFAULT_TAGS` if it the


This is not needed at the moment IMO. Maybe we can leave this feature for later, in order to keep minimal here?

NicolasHug · 2020-12-01T13:31:27Z

sklearn/utils/_tags.py

+    if hasattr(estimator, "_get_tags"):
+        if key is not None:
+            try:
+                return estimator._get_tags().get(key, _DEFAULT_TAGS[key])


this goes against our docs:

Note however that all tags must be present in the dict

What bothers me here is that there's no difference anymore between _get_tags() and _more_tags() from a third-party point of view.

jeremiedbb · 2020-12-01T15:18:20Z

this goes against our docs:
> Note however that all tags must be present in the dict

I think it's coherent with the proposed changes to the doc:

 you will need to implement a `_get_tags()` method which returns a dict that
  should contains all the necessary tags for that estimator, including the
  default tags typically defined in :class:`~sklearn.base.BaseEstimator` and
  other scikit-learn mixin classes. Note however that **all tags must be
  present in the dict**. If any of the keys documented above is not present in
  the output of `_get_tags()`, an error might occur.

NicolasHug · 2020-12-01T15:23:07Z

I don't think I agree: when doing estimator._get_tags().get(key, _DEFAULT_TAGS[key]), the following isn't true anymore:

Note however that all tags must be
present in the dict. If any of the keys documented above is not present in
the output of _get_tags(), an error might occur

No error will ever be raised if a tag isn't returned by _get_tags(), and so we don't require all keys to exist. In effect, implementing _get_tags() is exactly the same as implementing _more_tags() for 3rd parties

Strictly following the docs would mean doing estimator._get_tags()[key] (which is what I would prefer).

…eintroduce_safe_tags

jeremiedbb · 2020-12-01T16:20:35Z

Right I did not understand what you were reffering to in the first place. I agree that the code does not reflect the doc currently. Guillaume is working on it :)

sklearn/model_selection/_search.py

glemaitre · 2020-12-01T17:49:44Z

sklearn/model_selection/tests/test_validation.py

@@ -1985,15 +1985,3 @@ def _more_tags(self):
           "Set the estimator tags of your estimator instead")
    with pytest.warns(FutureWarning, match=msg):
        cross_validate(svm, linear_kernel, y, cv=2)
-
-    # the _pairwise attribute is present and set to True while the pairwise


@NicolasHug by not being permissive (getting default with _get_tags), we need to remove this test. What do you think about this.

We are not sure that this case is actually possible in practice.

I haven't followed the introduction of the pairwise tag but since the test assumes that the tag doesn't exist and since we're telling 3rd parties that all tags should exist, I'd say it makes sense to remove the test

glemaitre · 2020-12-01T17:50:22Z

sklearn/tests/test_base.py

@@ -558,24 +558,9 @@ class IncorrectTagPCA(KernelPCA):
    with pytest.warns(FutureWarning, match=msg):
        assert not _is_pairwise(pca)

-    # the _pairwise attribute is present and set to False while the pairwise
-    # tag is not present
-    class FalsePairwise(BaseEstimator):


We have a second test with the same issue.

sklearn/tests/test_pipeline.py

sklearn/utils/_tags.py

ogrisel

I like the new version of _safe_tags that takes _more_tags into account if _get_tags is not present.

The code is now simpler and it's more natural for third party estimators that do no inherit from scikit-learn base classes to incrementally define new tags without having to re-implement the for _get_tags machinery from scratch.

The documentation is now simpler to follow as well.

sklearn/utils/_testing.py

sklearn/utils/estimator_checks.py

NicolasHug

I like the new version of _safe_tags that takes _more_tags into account if _get_tags is not present.

I guess I'm fine with the code but regarding the docs: with that in place, a non-inheriting 3d part lib has no reason to ever implement _get_tags(), does it? Unless they want to use the tags in their own code... In which case they'll have to switch from _more_tags() to _get_tags(), which will be annoying to them. But why would a library use the tags machinery in its code while still not inheriting...?

In other words, do we even want to document "you can also override _get_tags()"? We could just say "you need to define _more_tags() if you want to override the defaults, and if you want to access tags values that you don't override (i.e. that are not in your own-defined _more_tags()), you'll need to inherit from BaseEstimator."

NicolasHug · 2020-12-02T08:24:38Z

doc/developers/develop.rst

-To override the tags of a child class, one must define the `_more_tags()`
-method and return a dict with the desired tags, e.g::
+It is unlikely that the default values for each tag will suit the needs of your
+specific estimator. Additional tags can be created or default tags can be


"Additionnal tags can be created"

I thought we agreed not to support that #18797 (comment)? (or that's how I interpret @ogrisel's +1)

There is a difference between supporting in _safe_tags and people creating their own tags within their libraries using _more_tags. This is a real need here:

https://github.com/rapidsai/cuml/pull/3113/files#diff-e4bd6eee2eca2b0619b03a5f6ba7b471b4ca03080a6619d0079105d5f13c2165R34-R35

We have something similar in imbalanced-learn since the introduction of tags.

My +1 was to remove the default param to the _safe_tags. I think third-party implementers are free to add other tags in their own estimators if they which. cuML is already doing in in their master branch apparently:

https://github.com/rapidsai/cuml/pull/3113/files

glemaitre · 2020-12-02T08:51:34Z

In other words, do we even want to document "you can also override _get_tags()"? We could just say "you need to define _more_tags() if you want to override the defaults, and if you want to access tags values that you don't override (i.e. that are not in your own-defined _more_tags()), you'll need to inherit from BaseEstimator."

How do you deal with CuML case: inheriting is not an option. If they want to use tags (for new checks for instance) internally, we are forcing them to call _more_tags instead of their own implementation of _get_tags.

IMO, it is not a burden to mention that if you want to access your tags by implementing _get_tags you need to have all scikit-learn defaults because we are going to raise error otherwise.

ogrisel · 2020-12-02T09:46:14Z

@NicolasHug would it be fine with you if we merge this PR as you are fine with the code. This would allow us to branch 0.24.X and start the release PR for 0.24.0rc1.

We can always fine tune the doc before 0.24.0 final if needed.

NicolasHug · 2020-12-02T10:13:20Z

sklearn/utils/_tags.py

+    if hasattr(estimator, "_get_tags"):
+        tags_provider = "_get_tags()"
+        tags = estimator._get_tags()
+    elif hasattr(estimator, "_more_tags"):
+        tags_provider = "_more_tags()"
+        tags = {**_DEFAULT_TAGS, **estimator._more_tags()}


Now that we rely on _more_tags regardless of inheritance, what's the rationale for defaulting to _DEFAULT_TAGS with _more_tags but not with _get_tags?

I admit I'm a bit lost on all the possible code paths and use-cases here. It seems that we're overly permissive in some cases while being restrictive in others, with no obvious reason. Things were clearer to me when the logic was "with inheritance -> define _more_tags, no inheritance -> define _get_tags".

But anyway, feel free to merge if we need to move with the release. This is still experimental after all.

I think the message is simpler to always recommend to define _more_tags for whether or not you inherit from BaseEstimator.

The idea now always implements _more_tags and it will work and it should cover 99% of the use case.

The remaining 1% is no inheritance and people that want to use tags -> implement _get_tags with strong requirements on our side regarding defaults.

ogrisel · 2020-12-02T10:19:08Z

Merged thanks all!

TST reintroduce _safe_tags for estimator not inheriting from BaseEsti…

a68194b

…mator

glemaitre marked this pull request as draft November 9, 2020 16:00

github-actions bot added the module:utils label Nov 9, 2020

typo

36f1c5c

thomasjpfan mentioned this pull request Nov 9, 2020

Do we have a compelling reason to enforce tags? #18798

Open

TST implement minimal classifier

9e54014

glemaitre mentioned this pull request Nov 11, 2020

[NoMRG] evaluate minimal implementation for sklearn estimator #18811

Closed

PEP8

eb9c41b

ogrisel reviewed Dec 1, 2020

View reviewed changes

sklearn/tests/test_pipeline.py Outdated Show resolved Hide resolved

Rephrase test comment [ci skip]

8227075

NicolasHug reviewed Dec 1, 2020

View reviewed changes

glemaitre added 2 commits December 1, 2020 16:49

iter

07dace1

Merge remote-tracking branch 'glemaitre/reintroduce_safe_tags' into r…

f3c2b02

…eintroduce_safe_tags

glemaitre added 2 commits December 1, 2020 17:33

iter

d76b684

iter

e8fa827

ogrisel reviewed Dec 1, 2020

View reviewed changes

sklearn/model_selection/_search.py Show resolved Hide resolved

glemaitre added 2 commits December 1, 2020 18:11

update doc

ed26968

fix test

b8ecc41

glemaitre commented Dec 1, 2020

View reviewed changes

doc

bb10791

ogrisel reviewed Dec 1, 2020

View reviewed changes

sklearn/tests/test_pipeline.py Outdated Show resolved Hide resolved

ogrisel reviewed Dec 1, 2020

View reviewed changes

sklearn/utils/_tags.py Outdated Show resolved Hide resolved

ogrisel reviewed Dec 1, 2020

View reviewed changes

glemaitre added 2 commits December 2, 2020 09:16

answer ogrisel comments

b19137d

more coverage

754539f

NicolasHug reviewed Dec 2, 2020

View reviewed changes

ogrisel merged commit 255718b into scikit-learn:master Dec 2, 2020

ogrisel mentioned this pull request Dec 2, 2020

Error raised during grid search on pipeline with None for transformer step #18815

Closed

lorentzenchr mentioned this pull request Dec 5, 2020

[WIP] sample props (proposal 4) #16079

Closed

	`_DEFAULT_TAGS` or to overwrite the value in `_DEFAULT_TAGS` if it the
	`_DEFAULT_TAGS` or to overwrite the value in `_DEFAULT_TAGS` if the

		define the default value of a tag if it is not present in
		`_DEFAULT_TAGS` or to overwrite the value in `_DEFAULT_TAGS` if it the

Uh oh!

TST introduce _safe_tags for estimator not inheriting from BaseEstimator #18797

TST introduce _safe_tags for estimator not inheriting from BaseEstimator #18797

Uh oh!

Conversation

glemaitre commented Nov 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Nov 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Nov 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Nov 9, 2020

Uh oh!

ogrisel commented Nov 9, 2020

Uh oh!

NicolasHug commented Nov 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rth commented Nov 9, 2020

Uh oh!

jeremiedbb commented Nov 9, 2020

Uh oh!

rth commented Nov 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Nov 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug commented Nov 9, 2020

Uh oh!

ogrisel commented Nov 9, 2020

Uh oh!

jeremiedbb commented Nov 9, 2020

Uh oh!

adrinjalali commented Nov 10, 2020

Uh oh!

glemaitre commented Nov 10, 2020

Uh oh!

glemaitre commented Nov 10, 2020

Uh oh!

adrinjalali commented Nov 10, 2020

Uh oh!

rth commented Nov 10, 2020

Uh oh!

glemaitre commented Nov 10, 2020

Uh oh!

adrinjalali commented Nov 10, 2020

Uh oh!

NicolasHug commented Nov 10, 2020

Uh oh!

ogrisel commented Nov 10, 2020

Uh oh!

NicolasHug commented Nov 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremiedbb commented Nov 10, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented Dec 1, 2020

Uh oh!

NicolasHug commented Dec 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremiedbb commented Dec 1, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

glemaitre commented Nov 9, 2020 •

edited

Loading

ogrisel commented Nov 9, 2020 •

edited

Loading

rth commented Nov 9, 2020 •

edited

Loading

NicolasHug commented Nov 9, 2020 •

edited

Loading

rth commented Nov 9, 2020 •

edited

Loading

ogrisel commented Nov 9, 2020 •

edited

Loading

NicolasHug commented Nov 10, 2020 •

edited

Loading

NicolasHug commented Dec 1, 2020 •

edited

Loading

glemaitre Dec 2, 2020 •

edited

Loading