-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] NaN Support for OneHotEncoder #13028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jnothman in case the missing value is defined as not np.nan, do we categorize np.nan as another label or we send out an error to the user? |
I would be happy with always making NaN our missing value marker here (at least initially)... But if not, let's reject NaN. |
it's very hard to know if your code is working reasonably without tests. |
I not that used to pytest so I still need to convert my local tests. I'll add one before I commit another change. |
…ing with missing_values
Oops sorry, the current code and test will encode nan if another value is defined as missing. Specifically what do you mean reject nan? An exception saying that there is nan despite another value defined as missing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well this functionality is more general than the other, so it's okay: the caller can just raise an error if is_scalar_nan(uniques[-1])
.
Is there any strong reason for keeping both _encode
and _nanencode
?
sklearn/preprocessing/label.py
Outdated
|
||
|
||
def _nanencode_numpy(values, uniques=None, encode=False, | ||
missing_value=np.nan): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for good or bad, this should be missing_values
not missing_value
for consistency.
None), | ||
(np.array(['b', 'a', None, 'a', None, 'c', 'b', 'c', ], dtype=object), | ||
np.array(['a', 'b', None], dtype=object), | ||
'c'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also test object with missing_values=np.nan
sklearn/preprocessing/label.py
Outdated
# for indexing an empty array but so is any other index. | ||
nan_index = -1 | ||
table = {val: i for i, val in enumerate(uniques)} | ||
table[missing_value] = nan_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps assert missing_value not in table
?
sklearn/preprocessing/label.py
Outdated
table[missing_value] = nan_index | ||
try: | ||
encoded = np.array([table[v] for v in values]) | ||
encoded_nan = (encoded == nan_index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm parentheses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I suspect the caller should do this instead of it being returned from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _nanencode_numpy
returns a nan indicator mask so it's to have a uniform interface between it and the nanencode_python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I'm okay with it, but I think we wouldn't need it if both _nanencode_numpy
and _nanencode_python
used -1 to represent missingness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I am using searchsorted
for the numpy version, there is no way to easily make it as -1. I can get the mask and set it to -1 afterwards. However, I feel like the consumer of the function will probably have more use for the mask since they will use it to change how the nans are encoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I realise there's no way to easily make it -1 (or nan) other than setting the value given the mask. I really would need to see them being used in OneHotEncoder/OrdinalEncoder to understand whether this is beneficial.
sklearn/preprocessing/label.py
Outdated
table = {val: i for i, val in enumerate(uniques)} | ||
table[missing_value] = nan_index | ||
try: | ||
encoded = np.array([table[v] for v in values]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If missing_values=nan
this won't work fully because float('nan')
is neither identical nor equal to another float('nan')
. Messy!
A relatively efficient way to handle this case might be to use table.setdefault(v, -1)
in place of table[v]
. Then after the encoding do something like:
unseen = [k for k, v in table.items() if v == -1]
if missing_values is None:
unseen = [k for k in unseen if k is not None]
elif is_scalar_nan(missing_values):
unseen = ... similar ...
else:
unseen = ... similar ...
if unseen:
raise ValueError("y contains previously unseen labels: %s" % unseen)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work since dictionary key comparisons are done by hashes and not by values. That being said, np.nan
, float('nan')
, and np.float(`nan`)
have different hashes.
Though I think using the np.nan is a more robust solution so I'll try to change it anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dict lookups are done by hash, identity and equality. By their nature, a hash is not sufficient to match.
# X == value_to_mask with object dytpes does not always perform | ||
# element-wise for old versions of numpy | ||
return np.equal(X, value_to_mask) | ||
if X.dtype.kind in ["S", "U"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What motivated this change? Separate PR? With a test, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also numpy issue: numpy/numpy#5399
I checked the test and it seems like the sklearn.impute.MissingIndicator
was only tested on numeric values. Not sure if non-numeric values should be supported since using a numpy array with object type will fail. The string type on the other hand has the error below.
import numpy as np
from sklearn.impute import MissingIndicator
a = np.array([[c] for c in 'abcdea'], dtype='str')
MissingIndicator().fit_transform(a) # 2 FutureWarning
MissingIndicator(missing_values='a').fit_transform(a) # 1 FutureWarning and an error
Result
C:\Users\snowt\Python\scikit-learn\sklearn\utils\validation.py:558: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64).
FutureWarning)
C:\Users\snowt\Python\scikit-learn\sklearn\utils\validation.py:558: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64).
FutureWarning)
C:\Users\snowt\Python\scikit-learn\sklearn\utils\validation.py:558: FutureWarning: Beginning in version 0.22, arrays of bytes/strings will be converted to decimal numbers if dtype='numeric'. It is recommended that you convert the array to a float dtype before using it in scikit-learn, for example by using your_array = your_array.astype(np.float64).
FutureWarning)
Traceback (most recent call last):
File "C:/Users/snowt/Python/scikit-learn/test.py", line 7, in <module>
MissingIndicator(missing_values='a').fit_transform(a)
File "C:\Users\snowt\Python\scikit-learn\sklearn\impute.py", line 634, in fit_transform
return self.fit(X, y).transform(X)
File "C:\Users\snowt\Python\scikit-learn\sklearn\impute.py", line 570, in fit
if self.features == 'missing-only'
File "C:\Users\snowt\Python\scikit-learn\sklearn\impute.py", line 528, in _get_missing_features_info
imputer_mask = _get_mask(X, self.missing_values)
File "C:\Users\snowt\Python\scikit-learn\sklearn\impute.py", line 52, in _get_mask
return np.equal(X, value_to_mask)
TypeError: ufunc 'equal' did not contain a loop with signature matching types dtype('<U1') dtype('<U1') dtype('bool')
Process finished with exit code 1
As you can see the error is with the _get_mask
function. Since I pretty much do the same thing as the MissingIndicator
I feel like it's better to just modify the _get_mask
function to be more general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the test and it seems like the sklearn.impute.MissingIndicator was only tested on numeric values. Not sure if non-numeric values should be supported since using a numpy array with object type will fail.
When it was developed, SimpleImputer didn't support non-numerics either. But that's changed, so yes, that's probably an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok then. I'll make a PR after I made some progress with the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just open an issue for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #13035
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to revert this change
Just using the |
Please benchmark the differences if you think it's a concern |
Got this error on circleci
|
(I didn't catch up with other PRs / discussion around this, so might missed some things) General remark: this seems to be adding a lot of additional code / complexity. Which might be needed, but I was wondering: how would it compare to first imputing and keeping track of the imputed value so we can handle it in different ways? |
That is how it is done. It just that I did some nan/missing value preprocessing before encoding so that the encoding is done correctly rather than adjusting the encoded values after encoding. For example.
You still need to adjust the encoding after since the missing number was considered in the encoding. The changes might look a lot since I haven't delete the previous implementation. I'm planning to write a timing benchmark to see how it compares to the old one so I left it for now. I also want to check if there might be faster implementations (like your suggestion). |
…ge of their uniqueness
@jnothman given that the legacy part is deprecated, there is no need to change the legacy part of the one hot encoder right? |
Just a few questions @jnothman Currently there is a In |
Probably the right approach to handle the FutureWarning is to have some helper function (perhaps in sklearn.utils.fixes) which handles the case where the FutureWarning might be raised and avoids calling Regarding |
Sorry for the slow attention... Busy! We will get there, perhaps not for
the upcoming release, sadly.
|
Btw, should this be MRG now, or is there more work you feel needs to be done before it is mergeable? |
It's fine if it won't make the release. So far I only edited the docs to pass the test. I haven't gave it a good read. I might do some benchmarks but I might change this to MRG after I reviewed the docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started looking through the implementation. This certainly feels like a beast. Do you think it would be easier to review if we just implement one missing value handling approach initially??
How do you feel about the state of the tests? are they reasonably complete?
# X == value_to_mask with object dytpes does not always perform | ||
# element-wise for old versions of numpy | ||
return np.equal(X, value_to_mask) | ||
if X.dtype.kind in ["S", "U"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to revert this change
_nanencode(Xi, cats, encode=True, | ||
missing_values=missing_values) | ||
except ValueError as e: | ||
diff = e.args[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can assume that the ValueError being raised is the one you explicitly raise. ValueErrors are the most common unintended error in case of logical failure.
encode=True, | ||
encode_unknown=encode_unknown) | ||
|
||
if len(encode_results) == 4: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not very readable. (It's also a bit strange that encode_results
should change the shape of its return value without more explicit request)
@@ -673,37 +685,69 @@ def _legacy_transform(self, X): | |||
|
|||
return out if self.sparse else out.toarray() | |||
|
|||
def _make_onehot_sparse_matrix(self, labels, mask, cat_ns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please document what these parameters are
dtype=self.dtype) | ||
return out | ||
|
||
def _make_nan_sparse_matrix(self, Xi_missing, n_categories): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please document what these parameters are and the purpose/return of this function
Xi_missing = X_missing[:, i] | ||
Xi_int = X_int[:, i] | ||
Xi_int[Xi_missing] = c | ||
cats_ns = [c+1 for c in cats_ns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces around +, please
X_int[X_int > to_drop] -= 1 | ||
n_values = [len(cats) - 1 for cats in self.categories_] | ||
cats_ns = [len(cats) - 1 for cats in self.categories_] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the name cats_ns
. Please make it more explicit. n_values
was okay
if self.handle_missing == 'category': | ||
for i, c in enumerate(cats_ns): | ||
Xi_missing = X_missing[:, i] | ||
Xi_int = X_int[:, i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this intermediate variable needs a name. Put this inline
return uniques | ||
|
||
|
||
def _nanencode(values, uniques=None, encode=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, it would be nice to not need this as separate from _encode
|
||
|
||
# Since nan comes in multiple forms, hash is not enough to identify it | ||
def _dict_to_mapper(d, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, unless I'm much mistaken, this is similar to the functionality provided by:
class nanmapper(dict):
def __missing__(self, key):
return nan_value
Hello @baluyotraf, are you still working on the encoding with missing values ? |
Go ahead. Was busy with work related stuff so I haven't find time to work on this again. xD |
Reference Issues/PRs
Fixes #11996
What does this implement/fix? Explain your changes.
I'll list the things I have implemented so far
_encode
named_nanencode
_nanencode
_nanencode
Any other comments?
Work in progress but feel free to comment/suggest improvements on the implementation