Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Allowing Virtual Category instead of error for OrdinalEncoder #14534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from
Closed

[MRG] Allowing Virtual Category instead of error for OrdinalEncoder #14534

wants to merge 12 commits into from

Conversation

nathanielmhld
Copy link

@nathanielmhld nathanielmhld commented Jul 31, 2019

Reference Issues/PRs

Addresses #13488.

What does this implement/fix? Explain your changes.

This allows the user to specify an error policy of "virtual" when instantiating the OrdinalEncoder. When the encoder transforms data that contains values it has not seen in fit, instead of throwing an error, when handle_unknown == 'virtual', the transform will run and send all unknown values to a virtual additional category, last in the ordinal progression of categories, i.e 0,1,2 before, now 0,1,2,3, where 3 maps to any value not previously encountered. Upon inverting the transformation, these values will appear as None, which is consistent with the current behavior of OneHotEncoder when handle_unknown == 'ignore'.

The None category is created upon instantiation of the encoder, so the encoder does not have to be changed over the course of its use. This means that a drawback of this approach is that any encoder that has been fitted will indicate a none category in its category list

This is important because it is a common use case to attempt to transform categorical data that may contain categories not in the original fitted data. The behavior of sending to an additional virtual category seems like a reasonable solution, and is in line with the solution currently implemented for OneHotEncoder.

Any other comments?

This is my first ever pull request, I'm very excited and nervous. I personally would really like to use this feature, I hope others will find it useful as well.

@jnothman
Copy link
Member

jnothman commented Aug 1, 2019 via email

@nathanielmhld nathanielmhld changed the title Allowing Virtual Category instead of error for OrdinalEncoder [MRG] Allowing Virtual Category instead of error for OrdinalEncoder Aug 1, 2019
@nathanielmhld
Copy link
Author

Compared to #13833, this solution is more simple, specifically tailored to the #13488 problem (so there's no need to take a side on whether or not its useful to have infrequent_category implemented), and changes less in the codebase. Once I figure out how to write tests, I will do just that! Thank you for you help

@amueller
Copy link
Member

amueller commented Aug 2, 2019

you're changing the estimator in transform, right? That's not allowed...

@nathanielmhld
Copy link
Author

Could you explain @amueller ? I thought this was an elegant solution but maybe I've done something against the rules?

@amueller
Copy link
Member

amueller commented Aug 2, 2019

calling transform is not allowed to change the state of the estimator.
If you do that, it's hard to guarantee that if you call transform twice, it doesn't matter in which order you call it.

@nathanielmhld
Copy link
Author

I understand, that does make sense. I don't see a solution that doesn't change the estimator. Thank you for explaining though

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathanielmhld there are definitely ways to get the same output without modifying categories_ in transform.

@jnothman
Copy link
Member

jnothman commented Aug 4, 2019

You can certainly reopen the PR and try to change the implementation.

@nathanielmhld nathanielmhld reopened this Aug 7, 2019
@nathanielmhld
Copy link
Author

Hi! Just bumping this.., does this implementation look alright?

@nathanielmhld
Copy link
Author

@jnothman @amueller I implemented some of the requested changes, wondering if this still has issues, if it's just not a feature anyone is interested in.., or what. Please let me know.

@nathanielmhld
Copy link
Author

I think the failing cotecov test might be erroneous, I'm having trouble viewing the specifics, I get an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants