BUG: always raise on NaN in OneHotEncoder for object dtype data #12033

jorisvandenbossche · 2018-09-07T17:54:01Z

Closes #12018

This does not really solve the underlying issue that the valid_mask logic (for handling unknown categories) needs to become NaN-aware, but this PR (by checking for NaNs and then raising) will prevent that we get to that point if there are NaNs in the data.

When we change the default of raising to passing through the NaNs, the masking logic will need to be reworked anyway.

jnothman · 2018-09-08T10:45:07Z

sklearn/preprocessing/_encoders.py

        X_temp = check_array(X, dtype=None)
        if not hasattr(X, 'dtype') and np.issubdtype(X_temp.dtype, np.str_):
            X = check_array(X, dtype=np.object)
        else:
            X = X_temp

+        if X.dtype == np.dtype('object'):
+            if _object_dtype_isnan(X).any():


Technically this shouldn't happen if the assume_all_finite configuration is set

Added a check for that.
Is there a way to test this? I could pass data with NaNs with the config set to assume all finite, but then the implementation fails, which seems a bit strange to test?

rth · 2018-09-09T19:04:34Z

sklearn/preprocessing/tests/test_encoders.py

+def test_one_hot_encoder_raise_missing(X, handle_unknown):
+    ohe = OneHotEncoder(categories='auto', handle_unknown=handle_unknown)
+
+    with pytest.raises(ValueError):


Maybe add match="Input contains NaN" here and below to be more explicit about the error we are expecting

rth · 2018-09-09T19:09:09Z

sklearn/preprocessing/tests/test_encoders.py

+    ohe = OneHotEncoder(categories='auto', handle_unknown=handle_unknown)
+
+    with pytest.raises(ValueError):
+        ohe.fit(X)


I though (reading through your issue) this error would be raised only for X.dtype == 'object' not e.g. 'float'?

for float it already raises correctly on master (on a first check_array when calling fit), it was only not explicitly tested. So the bug was only for object dtype.

jnothman

Is this not something check_array should be doing?

jorisvandenbossche · 2018-09-13T07:28:44Z

Is this not something check_array should be doing?

Yes, I was also thinking that, but forgot to add a comment asking about it.
For now I was adding it here as a stop-gap, because we already had some special handling for object dtypes here that is not included in check_array.

Basically, check_array currently doesn't really have the option of "dtype should ideally be numeric or object, but if it already was a numpy string dtype, that is fine to keep", i.e. what we do here in the encoders. See also #11401 (comment)
And until now check_array did not need to handle with object dtype data, so it could assume that only float data could ever contain NaNs.

So I can certainly move it there, I am only not fully sure if it would have wider impact (are there other estimators that accept object dtype data apart from imputers and encoders?), and we need to agree on an API to specify this in check_array.

jnothman

Okay with this for now, but we should probably put this into check_array soon

rth

LGTM thanks!

…t-learn#12033)

BUG: always raise on NaN in OneHotEncoder for object dtype data

7cb8ad4

jorisvandenbossche mentioned this pull request Sep 7, 2018

BUG: OneHotEncoder(string values) handles NaN as category on transform step #12018

Closed

pep8

770204c

jnothman approved these changes Sep 8, 2018

View reviewed changes

rth reviewed Sep 9, 2018

View reviewed changes

jorisvandenbossche added 2 commits September 12, 2018 10:41

add check for assume_finite config

891f244

add match string to error assert

2101566

jnothman reviewed Sep 13, 2018

View reviewed changes

jnothman approved these changes Sep 13, 2018

View reviewed changes

rth approved these changes Sep 13, 2018

View reviewed changes

rth merged commit dfdf605 into scikit-learn:master Sep 13, 2018

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Sep 17, 2018

BUG always raise on NaN in OneHotEncoder for object dtype data (sciki…

bb0ab59

…t-learn#12033)

jorisvandenbossche deleted the onehotencoder-object-missing branch September 24, 2018 14:21

jorisvandenbossche mentioned this pull request Sep 24, 2018

Improvements to check_array to handle heterogenous / object data #12148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: always raise on NaN in OneHotEncoder for object dtype data #12033

BUG: always raise on NaN in OneHotEncoder for object dtype data #12033

Uh oh!

jorisvandenbossche commented Sep 7, 2018

Uh oh!

jnothman Sep 8, 2018

Uh oh!

jorisvandenbossche Sep 12, 2018

Uh oh!

rth Sep 9, 2018

Uh oh!

rth Sep 9, 2018

Uh oh!

jorisvandenbossche Sep 10, 2018

Uh oh!

jnothman left a comment

Uh oh!

jorisvandenbossche commented Sep 13, 2018 •

edited

Loading

Uh oh!

jnothman left a comment

Uh oh!

rth left a comment

Uh oh!

Uh oh!

Uh oh!

BUG: always raise on NaN in OneHotEncoder for object dtype data #12033

BUG: always raise on NaN in OneHotEncoder for object dtype data #12033

Uh oh!

Conversation

jorisvandenbossche commented Sep 7, 2018

Uh oh!

jnothman Sep 8, 2018

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Sep 12, 2018

Choose a reason for hiding this comment

Uh oh!

rth Sep 9, 2018

Choose a reason for hiding this comment

Uh oh!

rth Sep 9, 2018

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Sep 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche commented Sep 13, 2018 •

edited

Loading