Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Nithurshen
Copy link
Contributor

Fixes #32589

Description

According to the documentation handle_unknown='warn' in OneHotEncoder is supposed to behave identically to handle_unknown='infrequent_if_exist' (i.e., map unknown categories to the infrequent category) while also emitting a UserWarning.

This PR fixes a bug where handle_unknown='warn' was incorrectly behaving like handle_unknown='ignore', causing unknown categories to be encoded as all zeros.

Additionally, the UserWarning itself was misleading. It incorrectly stated that unknown categories would be "encoded as all zeros" even when handle_unknown='infrequent_if_exist' was used.

Changes Made

This PR addresses the bug in two ways:

  1. Corrected the Behavior:
    In _BaseEncoder._map_infrequent_categories, the logic that un-masks unknown values (to map them to the infrequent category) was updated to include handle_unknown='warn'. It previously only checked for handle_unknown='infrequent_if_exist'. This ensures the behavior of 'warn' now matches 'infrequent_if_exist'.

  2. Corrected the Warning Message:
    In _BaseEncoder._transform, the warning-generation logic was updated to be conditional.

    • It now emits the correct message: "...encoded as the infrequent category." when handle_unknown is 'warn' or 'infrequent_if_exist'.
    • It continues to emit the "...encoded as all zeros" message when handle_unknown='ignore'.

Testing

  • Added a new regression test, test_onehotencoder_handle_unknown_warn_maps_to_infrequent, to specifically verify that 'warn' produces the same output as 'infrequent_if_exist' and emits the new, correct warning.
  • Updated several existing tests (like test_ohe_handle_unknown_warn and test_ohe_drop_first_handle_unknown_ignore_warns) that were failing because they were asserting the old, incorrect warning message. They now expect the new, correct warning message.
  • All tests in sklearn/preprocessing/tests/test_encoders.py now pass.

Checklist

  • A short, descriptive title has been added.
  • Tests have been added to cover all changes.
  • All existing tests still pass.

@github-actions
Copy link

github-actions bot commented Oct 28, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: a7a3225. Link to the linter CI: here

@betatim
Copy link
Member

betatim commented Oct 28, 2025

Thanks for working on this. I think this looks good, except for the indentation change.

@Nithurshen Nithurshen force-pushed the bug/onehotencoder-handle_unknown=warn branch from ff7ee78 to ef52e36 Compare October 28, 2025 17:00
@betatim betatim added the Waiting for Second Reviewer First reviewer is done, need a second one! label Oct 29, 2025
@Nithurshen
Copy link
Contributor Author

@betatim, Can you please request a second reviewer, as it has already been two weeks?

@Nithurshen
Copy link
Contributor Author

@betatim, Can you please tell me what to do with the PR?

@betatim
Copy link
Member

betatim commented Nov 17, 2025

Unfortunately, there is not much you can do. The biggest bottleneck for projects like scikit-learn is reviewer time. As a result is not unusual for things to take a long time to get reviewed and merged. While this is sad there is no easy solution to increasing the amount of time reviewers have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:preprocessing Waiting for Second Reviewer First reviewer is done, need a second one!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OneHotEncoder handle_unknown='warn' behaves like 'ignore' instead of 'infrequent_if_exist'

2 participants