Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@ggc87
Copy link
Contributor

@ggc87 ggc87 commented Jul 13, 2018

This PR introduce inverse_transform in KBinsDiscretizer for encoder
other than ordinal.

…n#11489]

This PR introduce inverse_transform in KBinsDiscretizer for encoder
other than ordinal.
@ggc87
Copy link
Contributor Author

ggc87 commented Jul 13, 2018

See #11489.

I'm not totally convinced about the testing though!

@qinhanmin2014
Copy link
Member

So the problem it that you assign self.ohe_encoder_ in transform, not in fit.

Creating encoder_ohe_ in fit and preventing test_non_meta_estimators
to fail.
@ggc87
Copy link
Contributor Author

ggc87 commented Jul 14, 2018

Sorry this was actually very very easy to solve :).

raise ValueError("inverse_transform only supports "
"'encode = ordinal'. Got encode={!r} instead."
.format(self.encode))
Xt = self.ohe_encoder_.inverse_transform(Xt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For inverse transform, we need to use check_array to do input validation and then check number of features. So you'll have something like

Xinv = check_array...
if self.encode != 'ordinal':
    if ...
        raise ValueError("Incorrect number of features....)
    Xinv = self.ohe_encoder_.inverse_transform(Xinv)
else:
   ...

Also, maybe we should make ohe_encoder_ a private attribute, otherwise we need to document it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better way is to rely on the input validation provided by OneHotEncoder.inverse_transform, so you'll have something like

if self.encode != 'ordinal':
    self.ohe_encoder_.inverse_transform
else:
    existing input validation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes OneHotEncoder already run input validation, doesn't the output of the OHE inverse_transform need to be validated as well? I mean is there some case in which the input can pass the validation for a OneHotEncoder and not for a KBinsDiscretizer? As it is now both validation are runt.

I was thinking about the private variable as well, I have a doubt though, that could be a very stupid question but how can I run all the tests if the ohe_encoder is private? In particular

 if encode != 'ordinal':
         Xt_tmp = kbd.ohe_encoder_.inverse_transform(X2t)
     else:
         Xt_tmp = X2t
     assert_array_equal(Xt_tmp.max(axis=0) + 1, kbd.n_bins_)

Should I write a different test?


if self.encode != 'ordinal':
encode_sparse = self.encode == 'onehot'
self.ohe_encoder_ = OneHotEncoder(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we want this attribute to be public.
So it should be named _ohe_encoder_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned I agree and I was actually thinking to make it private, _ohe_encoder would not be a "real" private would it be enough? If yes that would actually solve my previous question.

self.bin_edges_ = bin_edges
self.n_bins_ = n_bins

if self.encode != 'ordinal':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is weird.

I would expect something like:

if 'onehot' in self.encode:
   ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if statement was actually already in the code (see line 291) should I change it as well to keep consistency?

Xt = kbd.fit_transform(X)
assert_array_equal(Xt.max(axis=0) + 1, kbd.n_bins_)
if encode != 'ordinal':
Xt_tmp = kbd.ohe_encoder_.inverse_transform(Xt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Xt_tmp is not a good name :)

you can also make it inline

Xt = kbd.ohe_encoder_.inverse_transform(Xt) if 'onehot' in encode else Xt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I perfectly agree on the naming, it is actually bad, I'll probably need to refactor a bit this code, Xt is required to compute X2.
I'm not a big fan of inline statements though in particular when they end up in long strings.

Is it really beneficial in this case?

X2 = kbd.inverse_transform(Xt)
X2t = kbd.fit_transform(X2)
assert_array_equal(X2t.max(axis=0) + 1, kbd.n_bins_)
if encode != 'ordinal':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same has above

Copy link
Contributor Author

@ggc87 ggc87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

Xt = kbd.fit_transform(X)
assert_array_equal(Xt.max(axis=0) + 1, kbd.n_bins_)
if encode != 'ordinal':
Xt_tmp = kbd.ohe_encoder_.inverse_transform(Xt)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I perfectly agree on the naming, it is actually bad, I'll probably need to refactor a bit this code, Xt is required to compute X2.
I'm not a big fan of inline statements though in particular when they end up in long strings.

Is it really beneficial in this case?


if self.encode != 'ordinal':
encode_sparse = self.encode == 'onehot'
self.ohe_encoder_ = OneHotEncoder(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned I agree and I was actually thinking to make it private, _ohe_encoder would not be a "real" private would it be enough? If yes that would actually solve my previous question.

self.bin_edges_ = bin_edges
self.n_bins_ = n_bins

if self.encode != 'ordinal':
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if statement was actually already in the code (see line 291) should I change it as well to keep consistency?

self.n_bins_ = n_bins

if 'onehot' in self.encode:
self._ohe_encoder = OneHotEncoder(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just call it _encoder, so that we might use the same inverse_transform code when unary encoding is available.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please add an entry to what's new (maybe use the existing entry for KBinsDiscretizer)

@qinhanmin2014 qinhanmin2014 changed the title KBinsDiscretizer : inverse_transform for ohe encoder [See #11489] [MRG+2] KBinsDiscretizer : inverse_transform for ohe encoder Jul 20, 2018
@qinhanmin2014 qinhanmin2014 added this to the 0.20 milestone Jul 20, 2018
@glemaitre
Copy link
Member

maybe use the existing entry for KBinsDiscretizer

+1

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will merge when green.

@qinhanmin2014 qinhanmin2014 merged commit 61547de into scikit-learn:master Jul 21, 2018
@qinhanmin2014
Copy link
Member

thx @ggc87

@ggc87
Copy link
Contributor Author

ggc87 commented Jul 22, 2018

Thank you for your review guys :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants