[WIP] NaN Support for OneHotEncoder #13028

baluyotraf · 2019-01-22T14:09:39Z

Reference Issues/PRs

Fixes #11996

What does this implement/fix? Explain your changes.

I'll list the things I have implemented so far

Draft implementation of nan compatible version of _encode named _nanencode
Improved implementation and added missing_value parameter to _nanencode
Added tests for _nanencode

Any other comments?

Work in progress but feel free to comment/suggest improvements on the implementation

baluyotraf · 2019-01-23T01:05:45Z

@jnothman in case the missing value is defined as not np.nan, do we categorize np.nan as another label or we send out an error to the user?

jnothman · 2019-01-23T02:22:35Z

I would be happy with always making NaN our missing value marker here (at least initially)... But if not, let's reject NaN.

jnothman · 2019-01-23T02:22:59Z

it's very hard to know if your code is working reasonably without tests.

baluyotraf · 2019-01-23T03:29:18Z

I not that used to pytest so I still need to convert my local tests. I'll add one before I commit another change.

…ing with missing_values

…the added tests

baluyotraf · 2019-01-23T06:03:54Z

Oops sorry, the current code and test will encode nan if another value is defined as missing. Specifically what do you mean reject nan? An exception saying that there is nan despite another value defined as missing?

jnothman

Well this functionality is more general than the other, so it's okay: the caller can just raise an error if is_scalar_nan(uniques[-1]).

Is there any strong reason for keeping both _encode and _nanencode?

jnothman · 2019-01-23T07:32:14Z

sklearn/preprocessing/label.py

+
+
+def _nanencode_numpy(values, uniques=None, encode=False,
+                     missing_value=np.nan):


for good or bad, this should be missing_values not missing_value for consistency.

jnothman · 2019-01-23T07:34:19Z

sklearn/preprocessing/tests/test_label.py

+          None),
+         (np.array(['b', 'a', None, 'a', None, 'c', 'b', 'c', ], dtype=object),
+          np.array(['a', 'b', None], dtype=object),
+          'c'),


Please also test object with missing_values=np.nan

jnothman · 2019-01-23T07:37:14Z

sklearn/preprocessing/label.py

+        # for indexing an empty array but so is any other index.
+        nan_index = -1
+        table = {val: i for i, val in enumerate(uniques)}
+        table[missing_value] = nan_index


perhaps assert missing_value not in table?

jnothman · 2019-01-23T07:37:29Z

sklearn/preprocessing/label.py

+        table[missing_value] = nan_index
+        try:
+            encoded = np.array([table[v] for v in values])
+            encoded_nan = (encoded == nan_index)


rm parentheses

But I suspect the caller should do this instead of it being returned from here.

The _nanencode_numpy returns a nan indicator mask so it's to have a uniform interface between it and the nanencode_python

I suppose I'm okay with it, but I think we wouldn't need it if both _nanencode_numpy and _nanencode_python used -1 to represent missingness?

Since I am using searchsorted for the numpy version, there is no way to easily make it as -1. I can get the mask and set it to -1 afterwards. However, I feel like the consumer of the function will probably have more use for the mask since they will use it to change how the nans are encoded.

Yeah, I realise there's no way to easily make it -1 (or nan) other than setting the value given the mask. I really would need to see them being used in OneHotEncoder/OrdinalEncoder to understand whether this is beneficial.

jnothman · 2019-01-23T07:48:13Z

sklearn/preprocessing/label.py

+        table = {val: i for i, val in enumerate(uniques)}
+        table[missing_value] = nan_index
+        try:
+            encoded = np.array([table[v] for v in values])


If missing_values=nan this won't work fully because float('nan') is neither identical nor equal to another float('nan'). Messy!

A relatively efficient way to handle this case might be to use table.setdefault(v, -1) in place of table[v]. Then after the encoding do something like:

unseen = [k for k, v in table.items() if v == -1] if missing_values is None: unseen = [k for k in unseen if k is not None] elif is_scalar_nan(missing_values): unseen = ... similar ... else: unseen = ... similar ... if unseen: raise ValueError("y contains previously unseen labels: %s" % unseen)

It should work since dictionary key comparisons are done by hashes and not by values. That being said, np.nan, float('nan'), and np.float(`nan`) have different hashes.

Though I think using the np.nan is a more robust solution so I'll try to change it anyway.

Dict lookups are done by hash, identity and equality. By their nature, a hash is not sufficient to match.

jnothman · 2019-01-23T07:51:49Z

sklearn/impute.py

-        # X == value_to_mask with object dytpes does not always perform
-        # element-wise for old versions of numpy
-        return np.equal(X, value_to_mask)
+        if X.dtype.kind in ["S", "U"]:


What motivated this change? Separate PR? With a test, please?

See also numpy issue: numpy/numpy#5399

I checked the test and it seems like the sklearn.impute.MissingIndicator was only tested on numeric values. Not sure if non-numeric values should be supported since using a numpy array with object type will fail. The string type on the other hand has the error below.

import numpy as np from sklearn.impute import MissingIndicator a = np.array([[c] for c in 'abcdea'], dtype='str') MissingIndicator().fit_transform(a) # 2 FutureWarning MissingIndicator(missing_values='a').fit_transform(a) # 1 FutureWarning and an error

Result

C:\Users\snowt\Python\scikit-learn\sklearn\utils\validation.py:558: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64). FutureWarning) C:\Users\snowt\Python\scikit-learn\sklearn\utils\validation.py:558: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64). FutureWarning) C:\Users\snowt\Python\scikit-learn\sklearn\utils\validation.py:558: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64). FutureWarning) Traceback (most recent call last): File "C:/Users/snowt/Python/scikit-learn/test.py", line 7, in <module> MissingIndicator(missing_values='a').fit_transform(a) File "C:\Users\snowt\Python\scikit-learn\sklearn\impute.py", line 634, in fit_transform return self.fit(X, y).transform(X) File "C:\Users\snowt\Python\scikit-learn\sklearn\impute.py", line 570, in fit if self.features == 'missing-only' File "C:\Users\snowt\Python\scikit-learn\sklearn\impute.py", line 528, in _get_missing_features_info imputer_mask = _get_mask(X, self.missing_values) File "C:\Users\snowt\Python\scikit-learn\sklearn\impute.py", line 52, in _get_mask return np.equal(X, value_to_mask) TypeError: ufunc 'equal' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('bool') Process finished with exit code 1

As you can see the error is with the _get_mask function. Since I pretty much do the same thing as the MissingIndicator I feel like it's better to just modify the _get_mask function to be more general.

I checked the test and it seems like the sklearn.impute.MissingIndicator was only tested on numeric values. Not sure if non-numeric values should be supported since using a numpy array with object type will fail.

When it was developed, SimpleImputer didn't support non-numerics either. But that's changed, so yes, that's probably an issue.

Ok then. I'll make a PR after I made some progress with the tests.

Maybe just open an issue for now?

Created #13035

I think we want to revert this change

baluyotraf · 2019-01-23T11:56:51Z

Just using the _encode as reference. Though it is faster than _nanencode so I'm not sure if people want to gone.

jnothman · 2019-01-23T13:42:53Z

Though it is faster than _nanencode so I'm not sure if people want to gone.

Please benchmark the differences if you think it's a concern

…'nan') to verify nan checking robustness

…niques

baluyotraf · 2019-01-24T00:06:56Z

Got this error on circleci

CondaHTTPError: HTTP 502 BAD GATEWAY for url <https://repo.anaconda.com/pkgs/main/linux-64/cython-0.29.2-py36he6710b0_0.tar.bz2>
Elapsed: 00:00.022996
CF-RAY: 49de382b3b83c1bd-IAD

An HTTP error occurred when trying to retrieve this URL.
HTTP errors are often intermittent, and a simple retry will get you on your way.

jorisvandenbossche · 2019-01-24T19:22:35Z

(I didn't catch up with other PRs / discussion around this, so might missed some things)

General remark: this seems to be adding a lot of additional code / complexity. Which might be needed, but I was wondering: how would it compare to first imputing and keeping track of the imputed value so we can handle it in different ways?

baluyotraf · 2019-01-24T22:47:48Z

That is how it is done. It just that I did some nan/missing value preprocessing before encoding so that the encoding is done correctly rather than adjusting the encoded values after encoding. For example.

from sklearn.impute import _get_mask
from sklearn.preprocessing.label import _encode, _nanencode
import numpy as np


a = np.arange(0, 5)
mv = 2

# Encode and fix
_, encoded = _encode(a, encode=True)
missing_e = _get_mask(a, 3)
print(encoded)  # [0, 1, 2, 3, 4]

# Nanencode
_, nanencoded, missing_ne = _nanencode(a, encode=True, missing_values=mv)
print(nanencoded)  # [0 1 2 2 3]

You still need to adjust the encoding after since the missing number was considered in the encoding.

The changes might look a lot since I haven't delete the previous implementation. I'm planning to write a timing benchmark to see how it compares to the old one so I left it for now. I also want to check if there might be faster implementations (like your suggestion).

…ge of their uniqueness

baluyotraf · 2019-01-27T04:15:33Z

@jnothman given that the legacy part is deprecated, there is no need to change the legacy part of the one hot encoder right?

baluyotraf · 2019-03-24T11:39:09Z

Just a few questions @jnothman

Currently there is a FutureWarning inside the numpy function I am using which is causing the tests to fail. The function is using another numpy function whose behavior will change though the behavior won't have an impact to the function I am using. What is the your current approach to make the tests pass?

In inverse_transform the all-zero encoding can be reversed into 3 different values. None if handle-unknown is 'ignore'. The dropped value if drop is not None and the missing_values if handle_missing is 'all-zero'. What should be the priority here?

jnothman · 2019-03-25T02:35:34Z

Probably the right approach to handle the FutureWarning is to have some helper function (perhaps in sklearn.utils.fixes) which handles the case where the FutureWarning might be raised and avoids calling in1d, or at least silences the warning.

Regarding inverse_transform, in the interim I don't mind if inverse_transform just breaks if this case arises. But inverting into None given handle_unknown='ignore' doesn't really make sense. Why None? My preference is to invert into missing_values if we handle this at all.

jnothman · 2019-03-26T22:32:46Z

Sorry for the slow attention... Busy! We will get there, perhaps not for the upcoming release, sadly.

jnothman · 2019-03-27T07:42:22Z

Btw, should this be MRG now, or is there more work you feel needs to be done before it is mergeable?

baluyotraf · 2019-03-27T08:26:32Z

It's fine if it won't make the release.

So far I only edited the docs to pass the test. I haven't gave it a good read.

I might do some benchmarks but I might change this to MRG after I reviewed the docs.

jnothman

Started looking through the implementation. This certainly feels like a beast. Do you think it would be easier to review if we just implement one missing value handling approach initially??

How do you feel about the state of the tests? are they reasonably complete?

jnothman · 2019-04-06T21:08:23Z

sklearn/impute.py

-        # X == value_to_mask with object dytpes does not always perform
-        # element-wise for old versions of numpy
-        return np.equal(X, value_to_mask)
+        if X.dtype.kind in ["S", "U"]:


I think we want to revert this change

jnothman · 2019-04-06T21:11:12Z