Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[WIP] NaN Support for OneHotEncoder #13028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 52 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
04e15c1
Added _nanencode, a nan preserving implementation of _encode
baluyotraf Jan 22, 2019
9b49ec2
Added test to _nanencode similar to _encode
baluyotraf Jan 23, 2019
2142dc2
Improved _encoding test for _nanencoding and added test for _nanencod…
baluyotraf Jan 23, 2019
4595bb4
Fixed _get_mask from impute since string types does not support np.equal
baluyotraf Jan 23, 2019
2b535bc
Added the option to provide missing values in _nanencode as given by …
baluyotraf Jan 23, 2019
9e38523
Fixed _nanencode_python comment that went to 80 characters
baluyotraf Jan 23, 2019
54e6ca3
Removed comma at the end of a one line list
baluyotraf Jan 23, 2019
4cc5fc8
Renamed missing_value to missing_values
baluyotraf Jan 23, 2019
6dd8195
Fixed deprecated warning on empty array comparison
baluyotraf Jan 23, 2019
63b5ae8
Removed parentheses on boolean array creation
baluyotraf Jan 23, 2019
ff53604
Changed _nanencode_python to have a more robust nan checking
baluyotraf Jan 23, 2019
d4741b7
Added nan test for object arrays and replaced some np.nan with float(…
baluyotraf Jan 23, 2019
938376f
Added assertion of the ValueError when an extra value is not in the u…
baluyotraf Jan 23, 2019
a45055c
Added assume_true to setdiff1d calls
baluyotraf Jan 24, 2019
a208464
Removed _sort_nankey function
baluyotraf Jan 24, 2019
cd85a9e
Values are now removed from unique values in a way that takes advanta…
baluyotraf Jan 24, 2019
ca76371
Refactored some implementation to prepare for unknowns implementation
baluyotraf Jan 28, 2019
22cb885
Moved getting the unique classes of object to _nanunique_object
baluyotraf Jan 28, 2019
b3a5711
Moved creation of nan based mapping to another function
baluyotraf Jan 28, 2019
ffef454
Added preprocessing of unknown in _nanencoder_numpy
baluyotraf Jan 28, 2019
585ba4d
Improved comment on _nanunique_object
baluyotraf Jan 28, 2019
944c20c
Added encode_unknown for objects
baluyotraf Jan 28, 2019
d74c6aa
Added tests for unknown encoding
baluyotraf Jan 28, 2019
aad1750
Made _nanencode interface uniform
baluyotraf Jan 28, 2019
d2af973
Added length checking in _nanin1d
baluyotraf Jan 28, 2019
391cd99
Removed extra new lines at the end
baluyotraf Jan 28, 2019
3c12705
Improved test coverage of the _nanencode function
baluyotraf Jan 29, 2019
7980655
Made the index checking in _nanin1d more robust
baluyotraf Jan 29, 2019
03f2688
Implemented all-zeroes and categorical handling of missing values in …
Feb 15, 2019
90d907c
Implemented all-missing for OneHotEncoder
Feb 15, 2019
8b4dbd9
Updated OrdinalEncoder with the changes to BaseEncoder
Feb 15, 2019
d1a80cb
Moved import of _get_mask to prevent circular import
Mar 18, 2019
6e7f514
Fixed merge with the drop functionality
baluyotraf Mar 18, 2019
2cbe071
Fixed message provided by the BaseEncoder
baluyotraf Mar 18, 2019
5341d68
Updated some details in the OneHotEncoder docstring
baluyotraf Mar 18, 2019
b0b62b8
Removed exception expectation when OneHotEncoder and OrdinalEncoder a…
baluyotraf Mar 18, 2019
7805695
Removed numpy vectorization in _nanencode to allow pickling
baluyotraf Mar 18, 2019
113ec09
Allow nan values for encoders in pandas data frames
baluyotraf Mar 18, 2019
8dbc1b4
Made the dtypes to come from the data frame itself rather than hard e…
baluyotraf Mar 18, 2019
1e8a499
Added missing_values parameters in OrdinalEncoder
Mar 19, 2019
460f10d
Renamed test names for missing for clarity
baluyotraf Mar 24, 2019
3eec7df
Added validation for handle_missing parameter
baluyotraf Mar 24, 2019
0394cac
Added tests for the missing values encoding
baluyotraf Mar 24, 2019
f3fd740
Refactored generation of all-missing encoding
baluyotraf Mar 24, 2019
2a431dd
Fixed the OrdinalEncoder docstring
baluyotraf Mar 24, 2019
0c22806
Added tests for the inverse transform of missing values
baluyotraf Mar 24, 2019
57c7e9f
Added implementation of the missing values in inverse transform
baluyotraf Mar 24, 2019
7deb703
Removed category printing
baluyotraf Mar 25, 2019
74fabd8
Added warning supression in _nanin1d
baluyotraf Mar 25, 2019
f0fd75b
Removed the old encode function
baluyotraf Mar 25, 2019
01bba68
Updated doc test related results for OneHotEncoder and OrdinalEncoder
baluyotraf Mar 26, 2019
b7decc0
Normalized whitespace in OrdinalEncoder doc string
baluyotraf Mar 26, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 13 additions & 10 deletions doc/modules/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -481,8 +481,9 @@ new feature of integers (0 to n_categories - 1)::

>>> enc = preprocessing.OrdinalEncoder()
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS
OrdinalEncoder(categories='auto', dtype=<... 'numpy.float64'>)
>>> enc.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
OrdinalEncoder(categories='auto', dtype=<... 'numpy.float64'>,
missing_values=nan)
>>> enc.transform([['female', 'from US', 'uses Safari']])
array([[0., 1., 1.]])

Expand All @@ -505,8 +506,9 @@ Continuing the example above::
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features=None, categories=None, drop=None,
dtype=<... 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
dtype=<... 'numpy.float64'>, handle_missing='all-zero',
handle_unknown='error', missing_values=nan, n_values=None,
sparse=True)
>>> enc.transform([['female', 'from US', 'uses Safari'],
... ['male', 'from Europe', 'uses Safari']]).toarray()
array([[1., 0., 0., 1., 0., 1.],
Expand All @@ -530,10 +532,10 @@ dataset::
>>> # feature
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features=None,
categories=[...], drop=None,
dtype=<... 'numpy.float64'>, handle_unknown='error',
n_values=None, sparse=True)
OneHotEncoder(categorical_features=None, categories=[...], drop=None,
dtype=<... 'numpy.float64'>, handle_missing='all-zero',
handle_unknown='error', missing_values=nan, n_values=None,
sparse=True)
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])

Expand All @@ -549,8 +551,9 @@ columns for this feature will be all zeros
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE
OneHotEncoder(categorical_features=None, categories=None, drop=None,
dtype=<... 'numpy.float64'>, handle_unknown='ignore',
n_values=None, sparse=True)
dtype=<... 'numpy.float64'>, handle_missing='all-zero',
handle_unknown='ignore', missing_values=nan, n_values=None,
sparse=True)
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 0., 0., 0.]])

Expand Down
11 changes: 8 additions & 3 deletions sklearn/impute.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,9 +61,14 @@ def _get_mask(X, value_to_mask):
# np.isnan does not work on object dtypes.
return _object_dtype_isnan(X)
else:
# X == value_to_mask with object dytpes does not always perform
# element-wise for old versions of numpy
return np.equal(X, value_to_mask)
if X.dtype.kind in ["S", "U"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What motivated this change? Separate PR? With a test, please?

Copy link
Contributor Author

@baluyotraf baluyotraf Jan 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also numpy issue: numpy/numpy#5399

I checked the test and it seems like the sklearn.impute.MissingIndicator was only tested on numeric values. Not sure if non-numeric values should be supported since using a numpy array with object type will fail. The string type on the other hand has the error below.

import numpy as np
from sklearn.impute import MissingIndicator

a = np.array([[c] for c in 'abcdea'], dtype='str')

MissingIndicator().fit_transform(a) # 2 FutureWarning
MissingIndicator(missing_values='a').fit_transform(a) # 1 FutureWarning and an error

Result

C:\Users\snowt\Python\scikit-learn\sklearn\utils\validation.py:558: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64).
  FutureWarning)
C:\Users\snowt\Python\scikit-learn\sklearn\utils\validation.py:558: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64).
  FutureWarning)
C:\Users\snowt\Python\scikit-learn\sklearn\utils\validation.py:558: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64).
  FutureWarning)
Traceback (most recent call last):
  File "C:/Users/snowt/Python/scikit-learn/test.py", line 7, in <module>
    MissingIndicator(missing_values='a').fit_transform(a)
  File "C:\Users\snowt\Python\scikit-learn\sklearn\impute.py", line 634, in fit_transform
    return self.fit(X, y).transform(X)
  File "C:\Users\snowt\Python\scikit-learn\sklearn\impute.py", line 570, in fit
    if self.features == 'missing-only'
  File "C:\Users\snowt\Python\scikit-learn\sklearn\impute.py", line 528, in _get_missing_features_info
    imputer_mask = _get_mask(X, self.missing_values)
  File "C:\Users\snowt\Python\scikit-learn\sklearn\impute.py", line 52, in _get_mask
    return np.equal(X, value_to_mask)
TypeError: ufunc 'equal' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('bool')

Process finished with exit code 1

As you can see the error is with the _get_mask function. Since I pretty much do the same thing as the MissingIndicator I feel like it's better to just modify the _get_mask function to be more general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the test and it seems like the sklearn.impute.MissingIndicator was only tested on numeric values. Not sure if non-numeric values should be supported since using a numpy array with object type will fail.

When it was developed, SimpleImputer didn't support non-numerics either. But that's changed, so yes, that's probably an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok then. I'll make a PR after I made some progress with the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just open an issue for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #13035

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to revert this change

# np.equal does not work for byte string and unicode types.
# However the == sign works fine.
return X == value_to_mask
else:
# X == value_to_mask with object dytpes does not always perform
# element-wise for old versions of numpy
return np.equal(X, value_to_mask)


def _most_frequent(array, extra_value, n_repeat):
Expand Down
Loading