-
-
Notifications
You must be signed in to change notification settings - Fork 26.6k
[MRG+2] KBinsDiscretizer : inverse_transform for ohe encoder #11505
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…n#11489] This PR introduce inverse_transform in KBinsDiscretizer for encoder other than ordinal.
|
See #11489. I'm not totally convinced about the testing though! |
|
So the problem it that you assign |
Creating encoder_ohe_ in fit and preventing test_non_meta_estimators to fail.
|
Sorry this was actually very very easy to solve :). |
| raise ValueError("inverse_transform only supports " | ||
| "'encode = ordinal'. Got encode={!r} instead." | ||
| .format(self.encode)) | ||
| Xt = self.ohe_encoder_.inverse_transform(Xt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For inverse transform, we need to use check_array to do input validation and then check number of features. So you'll have something like
Xinv = check_array...
if self.encode != 'ordinal':
if ...
raise ValueError("Incorrect number of features....)
Xinv = self.ohe_encoder_.inverse_transform(Xinv)
else:
...
Also, maybe we should make ohe_encoder_ a private attribute, otherwise we need to document it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a better way is to rely on the input validation provided by OneHotEncoder.inverse_transform, so you'll have something like
if self.encode != 'ordinal':
self.ohe_encoder_.inverse_transform
else:
existing input validation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes OneHotEncoder already run input validation, doesn't the output of the OHE inverse_transform need to be validated as well? I mean is there some case in which the input can pass the validation for a OneHotEncoder and not for a KBinsDiscretizer? As it is now both validation are runt.
I was thinking about the private variable as well, I have a doubt though, that could be a very stupid question but how can I run all the tests if the ohe_encoder is private? In particular
if encode != 'ordinal':
Xt_tmp = kbd.ohe_encoder_.inverse_transform(X2t)
else:
Xt_tmp = X2t
assert_array_equal(Xt_tmp.max(axis=0) + 1, kbd.n_bins_)
Should I write a different test?
|
|
||
| if self.encode != 'ordinal': | ||
| encode_sparse = self.encode == 'onehot' | ||
| self.ohe_encoder_ = OneHotEncoder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we want this attribute to be public.
So it should be named _ohe_encoder_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned I agree and I was actually thinking to make it private, _ohe_encoder would not be a "real" private would it be enough? If yes that would actually solve my previous question.
| self.bin_edges_ = bin_edges | ||
| self.n_bins_ = n_bins | ||
|
|
||
| if self.encode != 'ordinal': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition is weird.
I would expect something like:
if 'onehot' in self.encode:
...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if statement was actually already in the code (see line 291) should I change it as well to keep consistency?
| Xt = kbd.fit_transform(X) | ||
| assert_array_equal(Xt.max(axis=0) + 1, kbd.n_bins_) | ||
| if encode != 'ordinal': | ||
| Xt_tmp = kbd.ohe_encoder_.inverse_transform(Xt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Xt_tmp is not a good name :)
you can also make it inline
Xt = kbd.ohe_encoder_.inverse_transform(Xt) if 'onehot' in encode else XtThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I perfectly agree on the naming, it is actually bad, I'll probably need to refactor a bit this code, Xt is required to compute X2.
I'm not a big fan of inline statements though in particular when they end up in long strings.
Is it really beneficial in this case?
| X2 = kbd.inverse_transform(Xt) | ||
| X2t = kbd.fit_transform(X2) | ||
| assert_array_equal(X2t.max(axis=0) + 1, kbd.n_bins_) | ||
| if encode != 'ordinal': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same has above
ggc87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated!
| Xt = kbd.fit_transform(X) | ||
| assert_array_equal(Xt.max(axis=0) + 1, kbd.n_bins_) | ||
| if encode != 'ordinal': | ||
| Xt_tmp = kbd.ohe_encoder_.inverse_transform(Xt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I perfectly agree on the naming, it is actually bad, I'll probably need to refactor a bit this code, Xt is required to compute X2.
I'm not a big fan of inline statements though in particular when they end up in long strings.
Is it really beneficial in this case?
|
|
||
| if self.encode != 'ordinal': | ||
| encode_sparse = self.encode == 'onehot' | ||
| self.ohe_encoder_ = OneHotEncoder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned I agree and I was actually thinking to make it private, _ohe_encoder would not be a "real" private would it be enough? If yes that would actually solve my previous question.
| self.bin_edges_ = bin_edges | ||
| self.n_bins_ = n_bins | ||
|
|
||
| if self.encode != 'ordinal': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if statement was actually already in the code (see line 291) should I change it as well to keep consistency?
| self.n_bins_ = n_bins | ||
|
|
||
| if 'onehot' in self.encode: | ||
| self._ohe_encoder = OneHotEncoder( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just call it _encoder, so that we might use the same inverse_transform code when unary encoding is available.
qinhanmin2014
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please add an entry to what's new (maybe use the existing entry for KBinsDiscretizer)
+1 |
qinhanmin2014
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will merge when green.
|
thx @ggc87 |
|
Thank you for your review guys :) |
This PR introduce inverse_transform in KBinsDiscretizer for encoder
other than ordinal.