Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[MRG] Estimator tags #8022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Estimator tags #8022
Changes from all commits
98d9aff
165727a
660bc44
74d10b6
d29ca95
b68c822
1ea2e28
ca84e37
72944e0
f5c5b7c
8ec5d7c
3e5194c
3868203
1c7d02f
4758abd
a62cd91
b857a05
62cfcc9
ee2c97b
e8efb8b
9aaae44
fcf5169
8a52e34
b90f0d5
f6e9b15
9502c6e
281a7c2
5aa2390
e36ea42
f871162
9194d73
e601a4b
a8648d5
7b7e152
1b23d88
c877e77
8d42707
923a946
aa9f6ba
74aa03d
ed0d91d
0966ee9
b944ee3
bd5ccb0
537daf9
fd717e8
283217a
b3281c0
b32a2ca
ab594c2
246d368
3c353e8
4591799
f368dd9
dedb873
928b3c8
d039962
2edf651
c1f7842
633f945
0d607eb
5c12cba
a52eff1
6749ff3
a57a253
e7cc0d7
779074a
0d08435
e5721be
12112ac
5866538
b926691
980a2dc
28b1dd1
2dce52c
9046dcb
b58c9d1
e054afd
095dd3f
b5092cc
8666465
bbfaf59
4dd732d
7ce1123
48bd931
b1171ed
c636b20
7eb6bed
7cb4505
c8b1f96
ca0767a
79e1c8f
9840f43
efe4614
8fede49
27743d4
02a93e8
764898e
7ef1c2b
57736d1
0691b71
3f74443
5a59d2f
cb74e53
49b48c9
7e5e0a1
b96a335
d660059
71a72a8
e2b8d63
a0c5eeb
5d91633
1ff8463
cce8954
ef97a81
46189b8
b151752
2cd6e1c
4c509e6
16f487b
dfc661a
c499b08
720e34c
22eee88
9eab395
a47e9f8
3b5762d
03e1716
ff37f01
5d73c1a
2157614
83744ef
e1f80d3
54bce7a
4e00dff
c04f361
91804f8
5df999c
e053cce
5aa313b
f547204
81b1c51
0617512
2e8d206
af2aaa6
c29dac4
61c5628
1dd02c0
500921e
afea648
16ba879
a8ea48c
860dd6b
48e6fca
f574be8
b217bb7
d09eb6f
cf3ded6
2d67c2f
42138a4
e13df63
17e5a9c
42fff09
9f34866
f68d5c0
d71f0c4
3b3ac3d
678b74f
e1d15b9
7851b7f
aeb3b36
b406af1
7e09f23
83f8883
259668c
e3b6459
e7bf51d
6930c8a
89c3050
5da0089
1b7725d
a79b82d
af8856f
d794c8b
f118b76
d1b67dc
7fe2cd4
20ca277
727267b
5da2c16
11f5e5c
18187a2
4263515
56f6903
18e2d66
173c126
049e3aa
2196a22
0a25ad3
1052f43
584d702
3666f63
8928ed4
1109981
8de0d04
b2c6b43
22501b8
aef7378
42aa99a
e10f20e
5e94012
e61b35d
4c1ed2d
281a7ef
d759329
4715e1b
873d916
d67df1c
83fa5f3
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to note that these tags may be dependent on estimator parameters and even system architecture, and hence are a method on an instance, rather than a property of the class.
You should probably also define the default implementation and
_more_tags
or do you consider that even more experimental???There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(should we deprecate the need for a call to fit for initialisation in stateless estimators?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can, because "stateless" can still mean it depends on
n_features
.RBFSampler
for example needs to sample a random matrix of shape(n_features, n_components)
. I'm not sure how to do it unless we callfit
.Do we error if
Normalizer
is called with differentn_features
during transform? do we want to?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure "stateless" is the right word then? We mean "data independent"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to have separate tags for data independent and no state at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dunno. What's the use case? If so, we could consider a ternary tag...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the main use-case here was that some estimators didn't complain if the number of features was different in fit and transform, possibly only AdditiveChi2Sampler.
Though this estimator still required calling fit.
I guess we can define stateless as "doesn't require calling fit" and everything that requires calling fit should check the shape?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm AdditiveChi2Sampler requires calling
fit
for no reason actually...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #12616 to follow up. I don't think there's a good reason for a ternary tag. Right now this is used for testing two things: checking that calling
transform
beforefit
will raise a nice error, and checking that the number of features needs to be consistent betweenfit
andtransform
.These checks are somewhat independent, but both related to being stateless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be some subtlety to this. What if it supports NaN at transform but not at fit (with some parameters)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, or the other way around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the other way around would be a case of interest. In #11635 we identified that you might have a feature selector that could not train on missing data (if only because the parameters weren't right) but there's no reason it shouldn't transform with missing data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the _ doesn't mean private, perhaps we can use something like !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it kinda means private in the sense that no-one should ever use it ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X_types is undocumented at present, and is mysterious... should it not be a series of boolean tags instead of a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would require us to define a list of possible input types now and it would be harder to change in the future though, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a set of boolean tags harder than a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt it might be more natural to add new things to a set/list than add another boolean variable to a set/list of boolean variables.
If for the boolean variable the variables that are present might change, then the logic will be strictly more complex than a list, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dunno. A list is fine... and could have benefits if the objects in the list are not merely strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not determine this automatically by inspecting
Estimator.__init__
signature? Or are there case where the two don't match?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman asked the same, so maybe my intentions are indeed unclear.
The point here is that we currently check that everything can be constructed without parameters in sklearn with a few exceptions. I'd like to keep checking that. This parameter is there to say "I'm really sure I want this to require parameters". I think otherwise we'd significantly weaken our tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the following be an appropriate substitute, shooting two birds with one stone:
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also has the potential to make most of
set_checking_parameters
disappear. (Although I suppose it does not then strictly test it is default-constructable)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_get_instances_for_checking
can certainly be implemented as a separate PR. I think this would also allow us to usecheck_estimator
intest_common.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for possibly separate PR. I kinda don't want to mess with the default construction test too much...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
binary only would be an important tag for external libraries (and came up in the context of the GP here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure you're clear that it's binary targets, not fratures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binary only is also relevant for calibration methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but then this would error if the
BaseEstimator
and the final estimator class define the same tag? Maybe re-ordering slightly the MRO classes (cf #8022 (comment)), then not overwriting existing tags could be a way around it...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, which I solved by having the
BaseEstimator
not defining anything. The current solutions does what I want...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to reverse this list to give precedence to tags set earlier in the MRO? The precedence should be tested either way.
(I think the official idiom might be
type(self)
rather thanself.__class__
but not sure)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm after thinking about this again, this looks like we're running into the same MRO issue that I was having earlier. I don't think @rth's solution actually works.
I was hoping we would be able to ignore the left-right MRO order with this approach and never overwrite tags. But he have the full tags defined in the
BaseEstimator
. One solution to make this work is to remove all the tags in the base estimator, not allow overwriting any tags, and then filling in missing tags with the default.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, it is method resolution order, e.g.,
So first, if a tag is defined in the first estimator, we don't want to overwrite it, i.e.
(or something similar), instead of
Then you are right that we want to tags from within the mixin to apply before the the base estimators. Maybe we want to sort
inspect.getmro(self.__class__)
with some custom comparison function, that would put mixins beforeBaseEstimator
, e.g.,it's a bit hackish, but might work. Here the output would be,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this should set 'multilabel' if
is_classifier(self)