-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Handle missing values in label._encode() #15009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to look more at this later. Some ci are failing. Is there a reason you haven't continued the previous work, but rather seem to have started from scratch?
@@ -33,18 +34,47 @@ | |||
] | |||
|
|||
|
|||
def _encode_numpy(values, uniques=None, encode=False, check_unknown=True): | |||
def _nan_unique(ar, return_inverse=False, allow_nan=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think elsewhere in the repo we have an allow_nans parameter. Should probably keep the naming consistent (plural or singular)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see any "allow_nans" parameter elsewhere. Did I miss something ?
sklearn/preprocessing/label.py
Outdated
@@ -33,18 +34,47 @@ | |||
] | |||
|
|||
|
|||
def _encode_numpy(values, uniques=None, encode=False, check_unknown=True): | |||
def _nan_unique(ar, return_inverse=False, allow_nan=False): | |||
# mimic np.unique with allow_nan option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that nan is removed from the unique values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, in _nan_unique it's keeping only one nan.
Contrary to np.unique which keep them all (since there are all different by definition) .
sklearn/preprocessing/label.py
Outdated
@@ -54,15 +84,45 @@ def _encode_numpy(values, uniques=None, encode=False, check_unknown=True): | |||
return uniques | |||
|
|||
|
|||
def _encode_python(values, uniques=None, encode=False): | |||
class _TableWithNan(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this slow down the encoding much?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dictionary get 20x slower:
def time_dict(d):
for x in range(10000):
d[x] = x+1
%timeit time_dict(dict())
# 793 µs ± 13 µs per loop
%timeit time_dict(_DictWithNan())
# 15.3 ms ± 107 µs per loop
but it is 1.4x slower on a vector:
def time_encode():
x = np.random.choice(range(10),size = 10000)
_ = _encode_python(x)
%timeit time_encode() # master
# 799 µs ± 2.41 µs per loop
%timeit time_encode() # this PR
# 1.14 ms ± 18.3 µs per loop
Thanks for the feedback.
I try to do a minimal PR here, which might be easier to review. I could have start from #13028, but it was easier for me (to understand) to rewrite it from scratch. |
…into nan_in_OHE
Reference Issues/PRs
partially fix #11996 (With a current PR #12045)
and partially fix #11997 (this PR will supersed #13028)
What does this implement/fix? Explain your changes.
This minimum PR will allows
_encode()
(in label.py) to handle np.nan.That will help OneHot- and OrdinalEncoder to deal with np.nan values in next PRs.
Any other comments?