-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[WIP] NaN Support for OneHotEncoder #13028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
52 commits
Select commit
Hold shift + click to select a range
04e15c1
Added _nanencode, a nan preserving implementation of _encode
baluyotraf 9b49ec2
Added test to _nanencode similar to _encode
baluyotraf 2142dc2
Improved _encoding test for _nanencoding and added test for _nanencod…
baluyotraf 4595bb4
Fixed _get_mask from impute since string types does not support np.equal
baluyotraf 2b535bc
Added the option to provide missing values in _nanencode as given by …
baluyotraf 9e38523
Fixed _nanencode_python comment that went to 80 characters
baluyotraf 54e6ca3
Removed comma at the end of a one line list
baluyotraf 4cc5fc8
Renamed missing_value to missing_values
baluyotraf 6dd8195
Fixed deprecated warning on empty array comparison
baluyotraf 63b5ae8
Removed parentheses on boolean array creation
baluyotraf ff53604
Changed _nanencode_python to have a more robust nan checking
baluyotraf d4741b7
Added nan test for object arrays and replaced some np.nan with float(…
baluyotraf 938376f
Added assertion of the ValueError when an extra value is not in the u…
baluyotraf a45055c
Added assume_true to setdiff1d calls
baluyotraf a208464
Removed _sort_nankey function
baluyotraf cd85a9e
Values are now removed from unique values in a way that takes advanta…
baluyotraf ca76371
Refactored some implementation to prepare for unknowns implementation
baluyotraf 22cb885
Moved getting the unique classes of object to _nanunique_object
baluyotraf b3a5711
Moved creation of nan based mapping to another function
baluyotraf ffef454
Added preprocessing of unknown in _nanencoder_numpy
baluyotraf 585ba4d
Improved comment on _nanunique_object
baluyotraf 944c20c
Added encode_unknown for objects
baluyotraf d74c6aa
Added tests for unknown encoding
baluyotraf aad1750
Made _nanencode interface uniform
baluyotraf d2af973
Added length checking in _nanin1d
baluyotraf 391cd99
Removed extra new lines at the end
baluyotraf 3c12705
Improved test coverage of the _nanencode function
baluyotraf 7980655
Made the index checking in _nanin1d more robust
baluyotraf 03f2688
Implemented all-zeroes and categorical handling of missing values in …
90d907c
Implemented all-missing for OneHotEncoder
8b4dbd9
Updated OrdinalEncoder with the changes to BaseEncoder
d1a80cb
Moved import of _get_mask to prevent circular import
6e7f514
Fixed merge with the drop functionality
baluyotraf 2cbe071
Fixed message provided by the BaseEncoder
baluyotraf 5341d68
Updated some details in the OneHotEncoder docstring
baluyotraf b0b62b8
Removed exception expectation when OneHotEncoder and OrdinalEncoder a…
baluyotraf 7805695
Removed numpy vectorization in _nanencode to allow pickling
baluyotraf 113ec09
Allow nan values for encoders in pandas data frames
baluyotraf 8dbc1b4
Made the dtypes to come from the data frame itself rather than hard e…
baluyotraf 1e8a499
Added missing_values parameters in OrdinalEncoder
460f10d
Renamed test names for missing for clarity
baluyotraf 3eec7df
Added validation for handle_missing parameter
baluyotraf 0394cac
Added tests for the missing values encoding
baluyotraf f3fd740
Refactored generation of all-missing encoding
baluyotraf 2a431dd
Fixed the OrdinalEncoder docstring
baluyotraf 0c22806
Added tests for the inverse transform of missing values
baluyotraf 57c7e9f
Added implementation of the missing values in inverse transform
baluyotraf 7deb703
Removed category printing
baluyotraf 74fabd8
Added warning supression in _nanin1d
baluyotraf f0fd75b
Removed the old encode function
baluyotraf 01bba68
Updated doc test related results for OneHotEncoder and OrdinalEncoder
baluyotraf b7decc0
Normalized whitespace in OrdinalEncoder doc string
baluyotraf File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What motivated this change? Separate PR? With a test, please?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also numpy issue: numpy/numpy#5399
I checked the test and it seems like the
sklearn.impute.MissingIndicator
was only tested on numeric values. Not sure if non-numeric values should be supported since using a numpy array with object type will fail. The string type on the other hand has the error below.Result
As you can see the error is with the
_get_mask
function. Since I pretty much do the same thing as theMissingIndicator
I feel like it's better to just modify the_get_mask
function to be more general.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When it was developed, SimpleImputer didn't support non-numerics either. But that's changed, so yes, that's probably an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok then. I'll make a PR after I made some progress with the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just open an issue for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #13035
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to revert this change