-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Allowing Virtual Category instead of error for OrdinalEncoder #14534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Allowing Virtual Category instead of error for OrdinalEncoder #14534
Conversation
How does this compare to #13833?
This needs a test
|
Compared to #13833, this solution is more simple, specifically tailored to the #13488 problem (so there's no need to take a side on whether or not its useful to have infrequent_category implemented), and changes less in the codebase. Once I figure out how to write tests, I will do just that! Thank you for you help |
you're changing the estimator in |
Could you explain @amueller ? I thought this was an elegant solution but maybe I've done something against the rules? |
calling |
I understand, that does make sense. I don't see a solution that doesn't change the estimator. Thank you for explaining though |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nathanielmhld there are definitely ways to get the same output without modifying categories_
in transform.
You can certainly reopen the PR and try to change the implementation. |
…thanielmhld/scikit-learn into OrdinalEncoderVirtualCategory Merge
Hi! Just bumping this.., does this implementation look alright? |
I think the failing cotecov test might be erroneous, I'm having trouble viewing the specifics, I get an error. |
Reference Issues/PRs
Addresses #13488.
What does this implement/fix? Explain your changes.
This allows the user to specify an error policy of "virtual" when instantiating the OrdinalEncoder. When the encoder transforms data that contains values it has not seen in fit, instead of throwing an error, when handle_unknown == 'virtual', the transform will run and send all unknown values to a virtual additional category, last in the ordinal progression of categories, i.e 0,1,2 before, now 0,1,2,3, where 3 maps to any value not previously encountered. Upon inverting the transformation, these values will appear as None, which is consistent with the current behavior of OneHotEncoder when handle_unknown == 'ignore'.
The None category is created upon instantiation of the encoder, so the encoder does not have to be changed over the course of its use. This means that a drawback of this approach is that any encoder that has been fitted will indicate a none category in its category list
This is important because it is a common use case to attempt to transform categorical data that may contain categories not in the original fitted data. The behavior of sending to an additional virtual category seems like a reasonable solution, and is in line with the solution currently implemented for OneHotEncoder.
Any other comments?
This is my first ever pull request, I'm very excited and nervous. I personally would really like to use this feature, I hope others will find it useful as well.