Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[WIP] NaN Support for OneHotEncoder #13028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 52 commits into from

Conversation

baluyotraf
Copy link
Contributor

@baluyotraf baluyotraf commented Jan 22, 2019

Reference Issues/PRs

Fixes #11996

What does this implement/fix? Explain your changes.

I'll list the things I have implemented so far

  • Draft implementation of nan compatible version of _encode named _nanencode
  • Improved implementation and added missing_value parameter to _nanencode
  • Added tests for _nanencode

Any other comments?

Work in progress but feel free to comment/suggest improvements on the implementation

@baluyotraf
Copy link
Contributor Author

@jnothman in case the missing value is defined as not np.nan, do we categorize np.nan as another label or we send out an error to the user?

@jnothman
Copy link
Member

I would be happy with always making NaN our missing value marker here (at least initially)... But if not, let's reject NaN.

@jnothman
Copy link
Member

it's very hard to know if your code is working reasonably without tests.

@baluyotraf
Copy link
Contributor Author

I not that used to pytest so I still need to convert my local tests. I'll add one before I commit another change.

@baluyotraf
Copy link
Contributor Author

baluyotraf commented Jan 23, 2019

Oops sorry, the current code and test will encode nan if another value is defined as missing. Specifically what do you mean reject nan? An exception saying that there is nan despite another value defined as missing?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this functionality is more general than the other, so it's okay: the caller can just raise an error if is_scalar_nan(uniques[-1]).

Is there any strong reason for keeping both _encode and _nanencode?



def _nanencode_numpy(values, uniques=None, encode=False,
missing_value=np.nan):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for good or bad, this should be missing_values not missing_value for consistency.

None),
(np.array(['b', 'a', None, 'a', None, 'c', 'b', 'c', ], dtype=object),
np.array(['a', 'b', None], dtype=object),
'c'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also test object with missing_values=np.nan

# for indexing an empty array but so is any other index.
nan_index = -1
table = {val: i for i, val in enumerate(uniques)}
table[missing_value] = nan_index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps assert missing_value not in table?

table[missing_value] = nan_index
try:
encoded = np.array([table[v] for v in values])
encoded_nan = (encoded == nan_index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm parentheses

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I suspect the caller should do this instead of it being returned from here.

Copy link
Contributor Author

@baluyotraf baluyotraf Jan 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _nanencode_numpy returns a nan indicator mask so it's to have a uniform interface between it and the nanencode_python

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I'm okay with it, but I think we wouldn't need it if both _nanencode_numpy and _nanencode_python used -1 to represent missingness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I am using searchsorted for the numpy version, there is no way to easily make it as -1. I can get the mask and set it to -1 afterwards. However, I feel like the consumer of the function will probably have more use for the mask since they will use it to change how the nans are encoded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I realise there's no way to easily make it -1 (or nan) other than setting the value given the mask. I really would need to see them being used in OneHotEncoder/OrdinalEncoder to understand whether this is beneficial.

table = {val: i for i, val in enumerate(uniques)}
table[missing_value] = nan_index
try:
encoded = np.array([table[v] for v in values])
Copy link
Member

@jnothman jnothman Jan 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If missing_values=nan this won't work fully because float('nan') is neither identical nor equal to another float('nan'). Messy!

A relatively efficient way to handle this case might be to use table.setdefault(v, -1) in place of table[v]. Then after the encoding do something like:

unseen = [k for k, v in table.items() if v == -1]
if missing_values is None:
    unseen = [k for k in unseen if k is not None]
elif is_scalar_nan(missing_values):
    unseen = ... similar ...
else:
    unseen = ... similar ...
if unseen:
    raise ValueError("y contains previously unseen labels: %s" % unseen)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work since dictionary key comparisons are done by hashes and not by values. That being said, np.nan, float('nan'), and np.float(`nan`) have different hashes.

Though I think using the np.nan is a more robust solution so I'll try to change it anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dict lookups are done by hash, identity and equality. By their nature, a hash is not sufficient to match.

# X == value_to_mask with object dytpes does not always perform
# element-wise for old versions of numpy
return np.equal(X, value_to_mask)
if X.dtype.kind in ["S", "U"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What motivated this change? Separate PR? With a test, please?

Copy link
Contributor Author

@baluyotraf baluyotraf Jan 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also numpy issue: numpy/numpy#5399

I checked the test and it seems like the sklearn.impute.MissingIndicator was only tested on numeric values. Not sure if non-numeric values should be supported since using a numpy array with object type will fail. The string type on the other hand has the error below.

import numpy as np
from sklearn.impute import MissingIndicator

a = np.array([[c] for c in 'abcdea'], dtype='str')

MissingIndicator().fit_transform(a) # 2 FutureWarning
MissingIndicator(missing_values='a').fit_transform(a) # 1 FutureWarning and an error

Result

C:\Users\snowt\Python\scikit-learn\sklearn\utils\validation.py:558: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64).
  FutureWarning)
C:\Users\snowt\Python\scikit-learn\sklearn\utils\validation.py:558: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64).
  FutureWarning)
C:\Users\snowt\Python\scikit-learn\sklearn\utils\validation.py:558: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64).
  FutureWarning)
Traceback (most recent call last):
  File "C:/Users/snowt/Python/scikit-learn/test.py", line 7, in <module>
    MissingIndicator(missing_values='a').fit_transform(a)
  File "C:\Users\snowt\Python\scikit-learn\sklearn\impute.py", line 634, in fit_transform
    return self.fit(X, y).transform(X)
  File "C:\Users\snowt\Python\scikit-learn\sklearn\impute.py", line 570, in fit
    if self.features == 'missing-only'
  File "C:\Users\snowt\Python\scikit-learn\sklearn\impute.py", line 528, in _get_missing_features_info
    imputer_mask = _get_mask(X, self.missing_values)
  File "C:\Users\snowt\Python\scikit-learn\sklearn\impute.py", line 52, in _get_mask
    return np.equal(X, value_to_mask)
TypeError: ufunc 'equal' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('bool')

Process finished with exit code 1

As you can see the error is with the _get_mask function. Since I pretty much do the same thing as the MissingIndicator I feel like it's better to just modify the _get_mask function to be more general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the test and it seems like the sklearn.impute.MissingIndicator was only tested on numeric values. Not sure if non-numeric values should be supported since using a numpy array with object type will fail.

When it was developed, SimpleImputer didn't support non-numerics either. But that's changed, so yes, that's probably an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok then. I'll make a PR after I made some progress with the tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just open an issue for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #13035

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to revert this change

@baluyotraf
Copy link
Contributor Author

Just using the _encode as reference. Though it is faster than _nanencode so I'm not sure if people want to gone.

@jnothman
Copy link
Member

Though it is faster than _nanencode so I'm not sure if people want to gone.

Please benchmark the differences if you think it's a concern

@baluyotraf
Copy link
Contributor Author

Got this error on circleci

CondaHTTPError: HTTP 502 BAD GATEWAY for url <https://repo.anaconda.com/pkgs/main/linux-64/cython-0.29.2-py36he6710b0_0.tar.bz2>
Elapsed: 00:00.022996
CF-RAY: 49de382b3b83c1bd-IAD

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.

@jorisvandenbossche
Copy link
Member

(I didn't catch up with other PRs / discussion around this, so might missed some things)

General remark: this seems to be adding a lot of additional code / complexity. Which might be needed, but I was wondering: how would it compare to first imputing and keeping track of the imputed value so we can handle it in different ways?

@baluyotraf
Copy link
Contributor Author

baluyotraf commented Jan 24, 2019

That is how it is done. It just that I did some nan/missing value preprocessing before encoding so that the encoding is done correctly rather than adjusting the encoded values after encoding. For example.

from sklearn.impute import _get_mask
from sklearn.preprocessing.label import _encode, _nanencode
import numpy as np


a = np.arange(0, 5)
mv = 2

# Encode and fix
_, encoded = _encode(a, encode=True)
missing_e = _get_mask(a, 3)
print(encoded)  # [0, 1, 2, 3, 4]

# Nanencode
_, nanencoded, missing_ne = _nanencode(a, encode=True, missing_values=mv)
print(nanencoded)  # [0 1 2 2 3]

You still need to adjust the encoding after since the missing number was considered in the encoding.

The changes might look a lot since I haven't delete the previous implementation. I'm planning to write a timing benchmark to see how it compares to the old one so I left it for now. I also want to check if there might be faster implementations (like your suggestion).

@baluyotraf
Copy link
Contributor Author

@jnothman given that the legacy part is deprecated, there is no need to change the legacy part of the one hot encoder right?

@baluyotraf
Copy link
Contributor Author

Just a few questions @jnothman

Currently there is a FutureWarning inside the numpy function I am using which is causing the tests to fail. The function is using another numpy function whose behavior will change though the behavior won't have an impact to the function I am using. What is the your current approach to make the tests pass?

In inverse_transform the all-zero encoding can be reversed into 3 different values. None if handle-unknown is 'ignore'. The dropped value if drop is not None and the missing_values if handle_missing is 'all-zero'. What should be the priority here?

@jnothman
Copy link
Member

Probably the right approach to handle the FutureWarning is to have some helper function (perhaps in sklearn.utils.fixes) which handles the case where the FutureWarning might be raised and avoids calling in1d, or at least silences the warning.

Regarding inverse_transform, in the interim I don't mind if inverse_transform just breaks if this case arises. But inverting into None given handle_unknown='ignore' doesn't really make sense. Why None? My preference is to invert into missing_values if we handle this at all.

@jnothman
Copy link
Member

jnothman commented Mar 26, 2019 via email

@jnothman
Copy link
Member

Btw, should this be MRG now, or is there more work you feel needs to be done before it is mergeable?

@baluyotraf
Copy link
Contributor Author

It's fine if it won't make the release.

So far I only edited the docs to pass the test. I haven't gave it a good read.

I might do some benchmarks but I might change this to MRG after I reviewed the docs.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started looking through the implementation. This certainly feels like a beast. Do you think it would be easier to review if we just implement one missing value handling approach initially??

How do you feel about the state of the tests? are they reasonably complete?

# X == value_to_mask with object dytpes does not always perform
# element-wise for old versions of numpy
return np.equal(X, value_to_mask)
if X.dtype.kind in ["S", "U"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to revert this change

_nanencode(Xi, cats, encode=True,
missing_values=missing_values)
except ValueError as e:
diff = e.args[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can assume that the ValueError being raised is the one you explicitly raise. ValueErrors are the most common unintended error in case of logical failure.

encode=True,
encode_unknown=encode_unknown)

if len(encode_results) == 4:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not very readable. (It's also a bit strange that encode_results should change the shape of its return value without more explicit request)

@@ -673,37 +685,69 @@ def _legacy_transform(self, X):

return out if self.sparse else out.toarray()

def _make_onehot_sparse_matrix(self, labels, mask, cat_ns):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document what these parameters are

dtype=self.dtype)
return out

def _make_nan_sparse_matrix(self, Xi_missing, n_categories):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please document what these parameters are and the purpose/return of this function

Xi_missing = X_missing[:, i]
Xi_int = X_int[:, i]
Xi_int[Xi_missing] = c
cats_ns = [c+1 for c in cats_ns]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces around +, please

X_int[X_int > to_drop] -= 1
n_values = [len(cats) - 1 for cats in self.categories_]
cats_ns = [len(cats) - 1 for cats in self.categories_]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the name cats_ns. Please make it more explicit. n_values was okay

if self.handle_missing == 'category':
for i, c in enumerate(cats_ns):
Xi_missing = X_missing[:, i]
Xi_int = X_int[:, i]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this intermediate variable needs a name. Put this inline

return uniques


def _nanencode(values, uniques=None, encode=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, it would be nice to not need this as separate from _encode



# Since nan comes in multiple forms, hash is not enough to identify it
def _dict_to_mapper(d, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, unless I'm much mistaken, this is similar to the functionality provided by:

class nanmapper(dict):
    def __missing__(self, key):
        return nan_value

@baluyotraf
Copy link
Contributor Author

baluyotraf commented Apr 7, 2019

@jnothman I made a comment in #12045 since there's also a concern about the complexity of the implementation. You might want to check it out since I answered the trade offs we can make to simplify the implementation.

@TwsThomas
Copy link
Contributor

Hello @baluyotraf, are you still working on the encoding with missing values ?
I will be happy to try to help here.

@baluyotraf
Copy link
Contributor Author

Hello @baluyotraf, are you still working on the encoding with missing values ?
I will be happy to try to help here.

Go ahead. Was busy with work related stuff so I haven't find time to work on this again. xD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle missing values in OneHotEncoder
6 participants