-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
BUG: always raise on NaN in OneHotEncoder for object dtype data #12033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: always raise on NaN in OneHotEncoder for object dtype data #12033
Conversation
sklearn/preprocessing/_encoders.py
Outdated
X_temp = check_array(X, dtype=None) | ||
if not hasattr(X, 'dtype') and np.issubdtype(X_temp.dtype, np.str_): | ||
X = check_array(X, dtype=np.object) | ||
else: | ||
X = X_temp | ||
|
||
if X.dtype == np.dtype('object'): | ||
if _object_dtype_isnan(X).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this shouldn't happen if the assume_all_finite configuration is set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a check for that.
Is there a way to test this? I could pass data with NaNs with the config set to assume all finite, but then the implementation fails, which seems a bit strange to test?
def test_one_hot_encoder_raise_missing(X, handle_unknown): | ||
ohe = OneHotEncoder(categories='auto', handle_unknown=handle_unknown) | ||
|
||
with pytest.raises(ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add match="Input contains NaN"
here and below to be more explicit about the error we are expecting
ohe = OneHotEncoder(categories='auto', handle_unknown=handle_unknown) | ||
|
||
with pytest.raises(ValueError): | ||
ohe.fit(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I though (reading through your issue) this error would be raised only for X.dtype == 'object'
not e.g. 'float'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for float it already raises correctly on master (on a first check_array when calling fit
), it was only not explicitly tested. So the bug was only for object dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not something check_array should be doing?
Yes, I was also thinking that, but forgot to add a comment asking about it. Basically, So I can certainly move it there, I am only not fully sure if it would have wider impact (are there other estimators that accept object dtype data apart from imputers and encoders?), and we need to agree on an API to specify this in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay with this for now, but we should probably put this into check_array soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks!
Closes #12018
This does not really solve the underlying issue that the
valid_mask
logic (for handling unknown categories) needs to become NaN-aware, but this PR (by checking for NaNs and then raising) will prevent that we get to that point if there are NaNs in the data.When we change the default of raising to passing through the NaNs, the masking logic will need to be reworked anyway.