Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[WIP] Handle NaNs in OneHotEncoder #16749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

nilichen
Copy link
Contributor

@nilichen nilichen commented Mar 23, 2020

Reference Issues/PRs

Towards #11996. Fixed #12025. See also #13028 and #15009.

What does this implement/fix? Explain your changes.

Tackle handle_missing specifically for NaNs. handle_missing can be

  • None: kept as default for now to make sure all previous testings all passed as well as consistent behaviour compared to previous versions. Thinking of using 'warn' to alert users about the changes.
  • 'all-zero'
  • 'indicator'

Tests implemented for all three options, including inverse_transform

Pending: documentation
Suggestion: can utilize pd.isna to tackle both NaNs and None

@nilichen nilichen force-pushed the ohe_missing branch 4 times, most recently from c8e29b3 to 8ab4ec8 Compare March 23, 2020 06:59
@jnothman
Copy link
Member

How is this going, @nilichen?

@nilichen
Copy link
Contributor Author

nilichen commented Apr 19, 2020

I managed to get fit_transform and inv_transform working for general cases, but got stuck in deciding how handle_unknown should interact with different options, especially drop and handle_unkown (e.g., whether NaN should be treated as another category or not depends on other options). And the logic in the code starts to get obscure.

My personal view is that this function has become a bit too complicated and it might be more straightforward that users deal with NaN/None in pandas as usual. Otherwise, #13028 might be a better option to move forward.

@nilichen nilichen closed this Apr 19, 2020
@rth
Copy link
Member

rth commented Apr 19, 2020

how handle_unknown should interact with different options, especially drop and handle_unkown (e.g., whether NaN should be treated as another category or not depends on other options). And the logic in the code starts to get obscure.

Thanks for sharing your experience with it @nilichen . Long term maintainability of these options is certainly a concern, particularly if we want to add other encoders (that support them) in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants