[WIP] Handle missing values in label._encode() #15009

TwsThomas · 2019-09-18T09:27:08Z

Reference Issues/PRs

partially fix #11996 (With a current PR #12045)
and partially fix #11997 (this PR will supersed #13028)

What does this implement/fix? Explain your changes.

This minimum PR will allows _encode() (in label.py) to handle np.nan.
That will help OneHot- and OrdinalEncoder to deal with np.nan values in next PRs.

Any other comments?

jnothman

I'll have to look more at this later. Some ci are failing. Is there a reason you haven't continued the previous work, but rather seem to have started from scratch?

jnothman · 2019-09-22T00:46:48Z

sklearn/preprocessing/label.py



-def _encode_numpy(values, uniques=None, encode=False, check_unknown=True):
+def _nan_unique(ar, return_inverse=False, allow_nan=False):


I think elsewhere in the repo we have an allow_nans parameter. Should probably keep the naming consistent (plural or singular)

I do not see any "allow_nans" parameter elsewhere. Did I miss something ?

jnothman · 2019-09-22T00:49:34Z

sklearn/preprocessing/label.py


-def _encode_numpy(values, uniques=None, encode=False, check_unknown=True):
+def _nan_unique(ar, return_inverse=False, allow_nan=False):
+    # mimic np.unique with allow_nan option


Please note that nan is removed from the unique values

Actually, in _nan_unique it's keeping only one nan.
Contrary to np.unique which keep them all (since there are all different by definition) .

sklearn/preprocessing/label.py

jnothman · 2019-09-22T00:54:02Z

sklearn/preprocessing/label.py



-def _encode_python(values, uniques=None, encode=False):
+class _TableWithNan(object):


Does this slow down the encoding much?

dictionary get 20x slower:

def time_dict(d): for x in range(10000): d[x] = x+1 %timeit time_dict(dict()) # 793 µs ± 13 µs per loop %timeit time_dict(_DictWithNan()) # 15.3 ms ± 107 µs per loop

but it is 1.4x slower on a vector:

def time_encode(): x = np.random.choice(range(10),size = 10000) _ = _encode_python(x) %timeit time_encode() # master # 799 µs ± 2.41 µs per loop %timeit time_encode() # this PR # 1.14 ms ± 18.3 µs per loop

TwsThomas · 2019-09-23T16:23:26Z

Thanks for the feedback.

Is there a reason you haven't continued the previous work, but rather seem to have started from scratch?

I try to do a minimal PR here, which might be easier to review. I could have start from #13028, but it was easier for me (to understand) to rewrite it from scratch.

…into nan_in_OHE

TwsThomas added 14 commits September 12, 2019 09:58

label encode with nan and mixed types

3f9c591

chg noly label.py

49448c1

iter, restore _encoders

302f3ae

iter (clean _encoders.py)

2053fb2

clean _encoders andtest_encoders

2d71efa

clean

8a66e43

clean

c613940

Merge remote-tracking branch 'upstream/master' into nan_in_OHE

75b4d8d

iter

894c0e5

typo

1a266b0

add functions

afad176

ad test

db05136

add more tests

b7284f6

add more tests

c4c4982

jnothman reviewed Sep 22, 2019

View reviewed changes

TwsThomas added 2 commits September 23, 2019 17:40

rename __DictWithNan

6d9d055

typo

c76b42b

TwsThomas added 2 commits September 25, 2019 11:00

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

ed1fa82

…into nan_in_OHE

minor

f3a120d

github-actions bot added the module:preprocessing label Mar 2, 2020

nilichen mentioned this pull request Mar 23, 2020

[WIP] Handle NaNs in OneHotEncoder #16749

Closed

thomasjpfan mentioned this pull request May 23, 2020

ENH Adds missing value support to OneHotEncoder #17317

Merged

lorentzenchr closed this in #17317 Oct 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] Handle missing values in label._encode() #15009

[WIP] Handle missing values in label._encode() #15009

Uh oh!

TwsThomas commented Sep 18, 2019

Uh oh!

jnothman left a comment

Uh oh!

jnothman Sep 22, 2019

Uh oh!

TwsThomas Sep 23, 2019

Uh oh!

jnothman Sep 22, 2019

Uh oh!

TwsThomas Sep 23, 2019

Uh oh!

Uh oh!

jnothman Sep 22, 2019

Uh oh!

TwsThomas Sep 23, 2019

Uh oh!

TwsThomas commented Sep 23, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def _encode_numpy(values, uniques=None, encode=False, check_unknown=True):
		def _nan_unique(ar, return_inverse=False, allow_nan=False):



		def _encode_python(values, uniques=None, encode=False):
		class _TableWithNan(object):

Uh oh!

[WIP] Handle missing values in label._encode() #15009

[WIP] Handle missing values in label._encode() #15009

Uh oh!

Conversation

TwsThomas commented Sep 18, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Sep 22, 2019

Choose a reason for hiding this comment

Uh oh!

TwsThomas Sep 23, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman Sep 22, 2019

Choose a reason for hiding this comment

Uh oh!

TwsThomas Sep 23, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jnothman Sep 22, 2019

Choose a reason for hiding this comment

Uh oh!

TwsThomas Sep 23, 2019

Choose a reason for hiding this comment

Uh oh!

TwsThomas commented Sep 23, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants