-
-
Notifications
You must be signed in to change notification settings - Fork 26.4k
FIX preprocessing: Fix OneHotEncoder handle_unknown='warn' behavior #32592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FIX preprocessing: Fix OneHotEncoder handle_unknown='warn' behavior #32592
Conversation
|
Thanks for working on this. I think this looks good, except for the indentation change. |
ff7ee78 to
ef52e36
Compare
|
@betatim, Can you please request a second reviewer, as it has already been two weeks? |
|
@betatim, Can you please tell me what to do with the PR? |
|
Unfortunately, there is not much you can do. The biggest bottleneck for projects like scikit-learn is reviewer time. As a result is not unusual for things to take a long time to get reviewed and merged. While this is sad there is no easy solution to increasing the amount of time reviewers have. |
Fixes #32589
Description
According to the documentation
handle_unknown='warn'inOneHotEncoderis supposed to behave identically tohandle_unknown='infrequent_if_exist'(i.e., map unknown categories to the infrequent category) while also emitting aUserWarning.This PR fixes a bug where
handle_unknown='warn'was incorrectly behaving likehandle_unknown='ignore', causing unknown categories to be encoded as all zeros.Additionally, the
UserWarningitself was misleading. It incorrectly stated that unknown categories would be "encoded as all zeros" even whenhandle_unknown='infrequent_if_exist'was used.Changes Made
This PR addresses the bug in two ways:
Corrected the Behavior:
In
_BaseEncoder._map_infrequent_categories, the logic that un-masks unknown values (to map them to the infrequent category) was updated to includehandle_unknown='warn'. It previously only checked forhandle_unknown='infrequent_if_exist'. This ensures the behavior of'warn'now matches'infrequent_if_exist'.Corrected the Warning Message:
In
_BaseEncoder._transform, the warning-generation logic was updated to be conditional."...encoded as the infrequent category."whenhandle_unknownis'warn'or'infrequent_if_exist'."...encoded as all zeros"message whenhandle_unknown='ignore'.Testing
test_onehotencoder_handle_unknown_warn_maps_to_infrequent, to specifically verify that'warn'produces the same output as'infrequent_if_exist'and emits the new, correct warning.test_ohe_handle_unknown_warnandtest_ohe_drop_first_handle_unknown_ignore_warns) that were failing because they were asserting the old, incorrect warning message. They now expect the new, correct warning message.sklearn/preprocessing/tests/test_encoders.pynow pass.Checklist