Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[WIP] Handle missing values in label._encode() #15009

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from

Conversation

TwsThomas
Copy link
Contributor

Reference Issues/PRs

partially fix #11996 (With a current PR #12045)
and partially fix #11997 (this PR will supersed #13028)

What does this implement/fix? Explain your changes.

This minimum PR will allows _encode() (in label.py) to handle np.nan.
That will help OneHot- and OrdinalEncoder to deal with np.nan values in next PRs.

Any other comments?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to look more at this later. Some ci are failing. Is there a reason you haven't continued the previous work, but rather seem to have started from scratch?

@@ -33,18 +34,47 @@
]


def _encode_numpy(values, uniques=None, encode=False, check_unknown=True):
def _nan_unique(ar, return_inverse=False, allow_nan=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think elsewhere in the repo we have an allow_nans parameter. Should probably keep the naming consistent (plural or singular)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see any "allow_nans" parameter elsewhere. Did I miss something ?

@@ -33,18 +34,47 @@
]


def _encode_numpy(values, uniques=None, encode=False, check_unknown=True):
def _nan_unique(ar, return_inverse=False, allow_nan=False):
# mimic np.unique with allow_nan option
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that nan is removed from the unique values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, in _nan_unique it's keeping only one nan.
Contrary to np.unique which keep them all (since there are all different by definition) .

@@ -54,15 +84,45 @@ def _encode_numpy(values, uniques=None, encode=False, check_unknown=True):
return uniques


def _encode_python(values, uniques=None, encode=False):
class _TableWithNan(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this slow down the encoding much?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dictionary get 20x slower:

def time_dict(d):
    for x in range(10000):
        d[x] = x+1

%timeit time_dict(dict())
# 793 µs ± 13 µs per loop
%timeit time_dict(_DictWithNan())
# 15.3 ms ± 107 µs per loop

but it is 1.4x slower on a vector:

def time_encode():
    x = np.random.choice(range(10),size = 10000)
    _ = _encode_python(x)

%timeit time_encode() # master
# 799 µs ± 2.41 µs per loop 
%timeit time_encode() # this PR
# 1.14 ms ± 18.3 µs per loop

@TwsThomas
Copy link
Contributor Author

Thanks for the feedback.

Is there a reason you haven't continued the previous work, but rather seem to have started from scratch?

I try to do a minimal PR here, which might be easier to review. I could have start from #13028, but it was easier for me (to understand) to rewrite it from scratch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle missing values in OrdinalEncoder Handle missing values in OneHotEncoder
2 participants