Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG: always raise on NaN in OneHotEncoder for object dtype data #12033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jorisvandenbossche
Copy link
Member

Closes #12018

This does not really solve the underlying issue that the valid_mask logic (for handling unknown categories) needs to become NaN-aware, but this PR (by checking for NaNs and then raising) will prevent that we get to that point if there are NaNs in the data.

When we change the default of raising to passing through the NaNs, the masking logic will need to be reworked anyway.

X_temp = check_array(X, dtype=None)
if not hasattr(X, 'dtype') and np.issubdtype(X_temp.dtype, np.str_):
X = check_array(X, dtype=np.object)
else:
X = X_temp

if X.dtype == np.dtype('object'):
if _object_dtype_isnan(X).any():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this shouldn't happen if the assume_all_finite configuration is set

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check for that.
Is there a way to test this? I could pass data with NaNs with the config set to assume all finite, but then the implementation fails, which seems a bit strange to test?

def test_one_hot_encoder_raise_missing(X, handle_unknown):
ohe = OneHotEncoder(categories='auto', handle_unknown=handle_unknown)

with pytest.raises(ValueError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add match="Input contains NaN" here and below to be more explicit about the error we are expecting

ohe = OneHotEncoder(categories='auto', handle_unknown=handle_unknown)

with pytest.raises(ValueError):
ohe.fit(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though (reading through your issue) this error would be raised only for X.dtype == 'object' not e.g. 'float'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for float it already raises correctly on master (on a first check_array when calling fit), it was only not explicitly tested. So the bug was only for object dtype.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not something check_array should be doing?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Sep 13, 2018

Is this not something check_array should be doing?

Yes, I was also thinking that, but forgot to add a comment asking about it.
For now I was adding it here as a stop-gap, because we already had some special handling for object dtypes here that is not included in check_array.

Basically, check_array currently doesn't really have the option of "dtype should ideally be numeric or object, but if it already was a numpy string dtype, that is fine to keep", i.e. what we do here in the encoders. See also #11401 (comment)
And until now check_array did not need to handle with object dtype data, so it could assume that only float data could ever contain NaNs.

So I can certainly move it there, I am only not fully sure if it would have wider impact (are there other estimators that accept object dtype data apart from imputers and encoders?), and we need to agree on an API to specify this in check_array.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay with this for now, but we should probably put this into check_array soon

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

@rth rth merged commit dfdf605 into scikit-learn:master Sep 13, 2018
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Sep 17, 2018
@jorisvandenbossche jorisvandenbossche deleted the onehotencoder-object-missing branch September 24, 2018 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants