-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
FIX Fixes bug OneHotEncoder's drop_idx_ when there are infrequent categories #25589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX Fixes bug OneHotEncoder's drop_idx_ when there are infrequent categories #25589
Conversation
|
I'm not very familiar with all the one hot encorder internals, so I'm having a hard time reviewing the implementation :) I think it would be worth documenting more precisely the interaction between Other than that I can't see a way around having a separate attribute either. So I'd say LGTM. ping @glemaitre or @ogrisel who might be more familiar with the internals of ohe |
glemaitre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I don't see a more straightforward way to handle the remapping. I propose only minor changes.
Since we are discussing about a remapping, just wondering if a private dict together with a property could make allow to store of a single attribute. But I did not think if it was actually feasible.
Co-authored-by: Guillaume Lemaitre <[email protected]>
jjerphan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @thomasjpfan.
Here are just two nitpicks.
Co-authored-by: Julien Jerphanion <[email protected]>
…egories (scikit-learn#25589) Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Jérémie du Boisberranger <[email protected]> Co-authored-by: Julien Jerphanion <[email protected]>
…egories (#25589) Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Jérémie du Boisberranger <[email protected]> Co-authored-by: Julien Jerphanion <[email protected]>
Reference Issues/PRs
Fixes #25550
What does this implement/fix? Explain your changes.
This PR adds a
_drop_idx_internaltoOneHotEncoderthat is used to drop the categories._drop_idx_internalwas already precomputed to take into account the grouped infrequent categories.The public
drop_idx_attribute needs to be remapped to reference back to the category that was actually dropped. There are tests in this PR to assert this behavior.Any other comments?
I was not able to think of a simpler way to do this without adding another attribute.