-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Handle missing values in OrdinalEncoder #12045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Handle missing values in OrdinalEncoder #12045
Conversation
Apologies. I must've not switched branches for first three commits. Commits for this issue start with |
sklearn/preprocessing/_encoders.py
Outdated
|
||
- 'ignore' : Allow missing values to pass through ``transform()`` and | ||
and be returned as np.nan | ||
- 'largest' : Encode missing values as largest ordinal category |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear whether largest refers to category frequency or to its order. "greatest" is better but I don't much like that either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using maximum/minimum or some subset thereof? max_cat
/min_cat
?
sklearn/preprocessing/_encoders.py
Outdated
and be returned as np.nan | ||
- 'largest' : Encode missing values as largest ordinal category | ||
- 'smallest' : Encode missing values as 0 | ||
- 'separate' : Encode missing values as -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just allow the user to give an integer. I think handle_missing=-1 is clearer... Or maybe not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely. I could see that being useful.
sklearn/preprocessing/_encoders.py
Outdated
- 'smallest' : Encode missing values as 0 | ||
- 'separate' : Encode missing values as -1 | ||
|
||
handle_unknown : string, default 'error' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should precede the missing values params
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense-- I'll put handle_unknown
before missing_value
.
sklearn/preprocessing/_encoders.py
Outdated
@@ -699,6 +699,26 @@ class OrdinalEncoder(_BaseEncoder): | |||
dtype : number type, default np.float64 | |||
Desired dtype of output. | |||
|
|||
missing_value : object type or a list of object types, default "nan" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Called missing_values (plural) elsewhere. Let's be consistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't really accept a string "nan" which is ambiguous when X can be strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I'd rather support NaN by default for now and not make it user-configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I'd rather support NaN by default for now and not make it user-configurable
+1
Also, if we keep it as an option, I would not make it a list (eg in SimpleImputer it is also only a single value, we should be consistent with that IMO)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't really accept a string "nan" which is ambiguous when X can be strings
@maxcopeland to give some context: in imputers currently it is a string "nan", but in master (and in soon to be released 0.20), this was changed to actually be np.nan
.
Thanks for working on this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some additional comments.
I think it would be good to first write some tests (they are required anyhow) to see what behaviour we are targetting.
sklearn/preprocessing/_encoders.py
Outdated
@@ -699,6 +699,26 @@ class OrdinalEncoder(_BaseEncoder): | |||
dtype : number type, default np.float64 | |||
Desired dtype of output. | |||
|
|||
missing_value : object type or a list of object types, default "nan" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I'd rather support NaN by default for now and not make it user-configurable
+1
Also, if we keep it as an option, I would not make it a list (eg in SimpleImputer it is also only a single value, we should be consistent with that IMO)
sklearn/preprocessing/_encoders.py
Outdated
@@ -699,6 +699,26 @@ class OrdinalEncoder(_BaseEncoder): | |||
dtype : number type, default np.float64 | |||
Desired dtype of output. | |||
|
|||
missing_value : object type or a list of object types, default "nan" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't really accept a string "nan" which is ambiguous when X can be strings
@maxcopeland to give some context: in imputers currently it is a string "nan", but in master (and in soon to be released 0.20), this was changed to actually be np.nan
.
sklearn/preprocessing/_encoders.py
Outdated
Which value(s) passed to ``transform()`` to be considered missing | ||
|
||
handle_missing : string, default 'ignore' | ||
How to handle missing values passed to ``transform()`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it only for transform
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, an assumption I'm making is the data fit to the encoder will not have missing values. Is this incorrect?
With this version, if data with missing values is fit to the Encoder, it will be appended to the categories of the instance. Missing values would still get encoded appropriately in transform
, but this means the transformed data will have one missing category.
Should I expand this implementation to make fit
tolerant to data with missing values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, data passed to fit
can contain NaNs. So fit and transform should be consistent with regard to this handle_missing
keyword. Eg if it is set to 'ignore' (preserving them in the output), NaNs should not be added to the categories during fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, good to know. I'll make it consistent.
sklearn/preprocessing/_encoders.py
Outdated
handle_missing : string, default 'ignore' | ||
How to handle missing values passed to ``transform()`` | ||
|
||
- 'ignore' : Allow missing values to pass through ``transform()`` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should think of another name here? Because eg in #10465 I used the 'ignore' term in context of OneHotEncoder to be all zero (like unknown category), which probably is also not the best term for that use case, but still might indicate possible confusion.
Maybe something like 'passthrough' or 'preserve-nan' or ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I can do that.
sklearn/preprocessing/_encoders.py
Outdated
Handle values not passed to ``fit()`` and not specified as missing | ||
|
||
- 'error' : Unknown values raise error | ||
- 'ignore' : Unknown values encoded as 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The encoded categories already start at 0, so I don't see how this is possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, the reason this keyword was not added to OrdinalEncoder
initially (but it is added to OneHotEncoder
) is exactly because there is no clear meaning of 'ignore' I think (at least when categories are encoded as 0...n).
I also think this is quite independent from the actual missing value handling, so maybe leave this for separate issue/PR to discuss?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, you're right. I'll leave this one alone.
sklearn/preprocessing/_encoders.py
Outdated
@@ -761,20 +786,45 @@ def fit(self, X, y=None): | |||
|
|||
def transform(self, X): | |||
"""Transform X to ordinal codes. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do not remove blank lines (or maybe an editor setting that does this automaticall?à
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops-- I'd love to blame my editor but I think that one was me. Sorry about that.
I really appreciate the input @jnothman and @jorisvandenbossche. Thanks for taking the time. I'll get working on these. |
Hi @jnothman and @jorisvandenbossche -- I'm having some issues with passing tests in py2.7 builds, but passes for 3. Was hoping you guys could lend some insight.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be best if you make changes to _encode
.
sklearn/preprocessing/_encoders.py
Outdated
- 'passthrough' : Allow missing values to pass through | ||
``transform()`` and be returned | ||
- 'max_cat' : Encode missing values as maximum ordinal category | ||
- 'min_cat' : Encode missing values as 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't need to be a separate option from numeric
sklearn/preprocessing/_encoders.py
Outdated
``transform()`` and be returned | ||
- 'max_cat' : Encode missing values as maximum ordinal category | ||
- 'min_cat' : Encode missing values as 0 | ||
- numeric : Category to be encoded to missing values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"number to encode missing values with"
sklearn/preprocessing/_encoders.py
Outdated
n_samples, n_features = X.shape | ||
|
||
if self.categories != 'auto': | ||
if X.dtype != object: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not covered by tests
sklearn/preprocessing/_encoders.py
Outdated
except TypeError: | ||
cats = _encode(Xi.astype('<U3')) | ||
else: | ||
cats = np.array(self.categories[i], dtype=X.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not covered by tests
sklearn/preprocessing/_encoders.py
Outdated
if self.categories == 'auto': | ||
try: | ||
cats = _encode(Xi) | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What case does this handle? It looks very suspicious, but if we're to do it, it at least needs a comment.
sklearn/preprocessing/_encoders.py
Outdated
# Removing missing values from each column | ||
for i, cats in enumerate(self.categories_): | ||
try: # find missing for numeric dtype | ||
missing_i = np.isnan(cats.astype(float), casting='unsafe') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't casting apply to astype
?
sklearn/preprocessing/_encoders.py
Outdated
missing_i = np.isnan(cats.astype(float), casting='unsafe') | ||
self.categories_[i] = cats[~missing_i] | ||
except (ValueError, TypeError): # find missing for object dtype | ||
missing_i = (cats == 'nan') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no () needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be treating the string 'nan' as signifying np.nan... we might need to do this checking before or in _encode
I think we also need to discuss more which options we want:
For me, passing through and separate category seem like the most important options. |
Yes, max is not as useful as new_cat
|
This pull request introduces 2 alerts when merging b5dd08c into 39bd736 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
sklearn/preprocessing/_encoders.py
Outdated
@@ -770,8 +782,15 @@ def fit(self, X, y=None): | |||
""" | |||
# base classes uses _categories to deal with deprecations in | |||
# OneHoteEncoder: can be removed once deprecations are removed | |||
_set_config(assume_finite=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, this configuration will remain set after fit is finished, secondly it will incorrectly pass through inf without error, thirdly it is not thread safe. We should instead use force_all_finite in check_array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.. definitely not a good option then. I added a force_all_finite
passed to _check_X
to give optionality for NaN
's.
def test_encode_util_encode_missing(values, expected): | ||
uniques = _encode(values, encode_missing=True) | ||
assert_array_equal(uniques, expected) | ||
uniques, encoded = _encode(values, encode=True, encode_missing=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tests _encode. Why is it in test_label.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the _encode
function is in label.py, I assumed the tests would go in test_label.py. Should I move them to test_encoders.py?
sklearn/preprocessing/_encoders.py
Outdated
_set_config(assume_finite=True) | ||
|
||
if self.handle_missing == 'separate': | ||
self._encode_missing = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be written as self._encode_missing = self.handle_missing == 'separate'
But why is this useful to do, rather than check self.handle_missing directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree-- this doesn't make sense. I removed the handle_missing
attribute and replaced it with an encode_missing
attribute, that accepts boolean input rather than string.
@jnothman Thanks so much for the comments. I really apologize for how long this has taken me-- I had some work travel. I hope this feature is still desired. |
This pull request introduces 1 alert when merging 5a4dee8 into 1c88b3c - view on LGTM.com new alerts:
Comment posted by LGTM.com |
No worries at all about the time. I hope I can review this again soon
|
sklearn/preprocessing/_encoders.py
Outdated
|
||
- False : Do not categorize missing values and return | ||
transformed data with NaN's | ||
- True : Categorize missing values sorted last |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps "Represent NaNs as the highest ordinal category."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, much clearer.
sklearn/preprocessing/_encoders.py
Outdated
encode_missing : boolean, default False | ||
How to represent NaN's | ||
|
||
- False : Do not categorize missing values and return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "categorize" as a verb is awkward here. "Retain NaNs in transformed data" would be simpler, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. My wording a little awkward here.
sklearn/preprocessing/_encoders.py
Outdated
@@ -795,10 +809,17 @@ def transform(self, X): | |||
------- | |||
X_out : sparse matrix or a 2-d array | |||
Transformed input. | |||
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove this added whitespace
sklearn/preprocessing/label.py
Outdated
missing_mask = np.isnan(values) | ||
values_masked_nans = values.copy() | ||
# mask missing values as existing non-null value | ||
values_masked_nans[missing_mask] = values_masked_nans[~missing_mask][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all values are NaN this will result in an IndexError. is this case handled and tested? An IndexError is not the most friendly way to say "all values are NaN", and when encoding, the all values are NaN case should be acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.unique
actually handles NaNs (it puts it as the last unique value, which I think is fine for us), so it might be that all of the above is not needed, and that we can directly pass the data to np.unique
sklearn/preprocessing/label.py
Outdated
if encode_missing and np.any(missing_mask): | ||
# add nan to categories and set as largest category | ||
uniques = np.append(uniques, np.nan) | ||
encoded[missing_mask] = np.max(encoded) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this RHS not the same as len(uniques) - 1
?
sklearn/preprocessing/label.py
Outdated
@@ -74,7 +106,7 @@ def _encode_python(values, uniques=None, encode=False): | |||
return uniques | |||
|
|||
|
|||
def _encode(values, uniques=None, encode=False): | |||
def _encode(values, uniques=None, encode=False, encode_missing=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find some of the complexity of the implementation of encode_missing to be quite perplexing. Whether True or False, it is being encoded, either as -1 or as the number of non-NaN categories. In _encode
, what is the benefit of supporting both of these? Is it not simpler to establish the convention that NaN will always be represented as the highest category and exploit the fact that np.unique does this anyway?
Maybe we can first focus on reviewing the implementation, but I think we need to discuss your change of |
sklearn/preprocessing/_encoders.py
Outdated
self._categories = self.categories | ||
self._fit(X) | ||
self._fit(X, encode_missing=self.encode_missing, force_all_finite=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why force_all_finite=True
here and not False? I would think that in the OrdinalEncoder we now allow NaNs?
sklearn/preprocessing/_encoders.py
Outdated
transformed = X_int.astype(self.dtype, copy=False) | ||
|
||
if not self._encode_missing: | ||
transformed[transformed == -1] = np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work for eg dtype='int64'
.
But not sure what best to do about this. We raise an informative error if dtype
is not floating, and the encoding is set to passthrough NaNs
sklearn/preprocessing/label.py
Outdated
missing_mask = np.isnan(values) | ||
values_masked_nans = values.copy() | ||
# mask missing values as existing non-null value | ||
values_masked_nans[missing_mask] = values_masked_nans[~missing_mask][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.unique
actually handles NaNs (it puts it as the last unique value, which I think is fine for us), so it might be that all of the above is not needed, and that we can directly pass the data to np.unique
exp = np.array([[0, 1, 0], | ||
[1, 0, 0], | ||
[2, 2, 1]], dtype='float64') | ||
assert_array_equal(enc.fit_transform(X), exp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also test enc.fit(X).transform(X)
here ?
[1, 0, 0], | ||
[np.nan, np.nan, np.nan]], dtype='float64') | ||
assert_array_equal(enc.fit_transform(X), exp) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a case where you have different data for fit and transform? Eg the case where there are no missing values during fitting, but only during transform (this will then need to interact with handle_unknown
I think ?)
sklearn/preprocessing/label.py
Outdated
if encode: | ||
diff = _encode_check_unknown(values, uniques) | ||
if diff: | ||
raise ValueError("y contains previously unseen labels: %s" | ||
% str(diff)) | ||
encoded = np.searchsorted(uniques, values) | ||
encoded = np.searchsorted(uniques, values).astype(float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This astype(float)
is what is causing all the failures on travis, because the LabelEncoder
currently returns integer data:
In [26]: from sklearn.preprocessing import LabelEncoder
In [27]: LabelEncoder().fit_transform([1, 2, 1, 2])
Out[27]: array([0, 1, 0, 1])
and I think we should keep the options consistent between both.
Well, we know they can't be entirely identical...
|
CI unhappy. Please avoid force pushing. Just add commits where possible. This makes it easier to track changes. |
Okay. Sorry about that, @jnothman. I wanted to rebase. Again, so sorry about my delay on this issue. I'll have a workable solution soon. |
Hi Max, Firstly, you might want to swap notes with #13028 where @baluyotraf has implemented NaN-handling changes to the Secondly, we have a helper I think setting |
…test_label.py passing
…rs. All tests in test_encoders.py passing
…peland/scikit-learn into ordinal_enc_handle_missing
Hi @jnothman, Thanks so much for your comment. Two questions:
I know there will be a release soon-- let me know how I can be most efficient with time to get this feature ready. |
I think it would be wise to agree with each other on the _encode internal API, or you adopt @baluyotraf's. I don't mind if the two PRs are merged and you collaborate, though there are aspects of the public-facing API that will differ. |
I've not yet checked the test failure and estimator check context, sorry. |
I think agreeing on the |
@baluyotraf totally. If the aim is that all encoders handle NaN's similarly, I think updating the Would love to work together on this. Would it be better to merge PR's or should I just rebase on yours? |
Either way works for me. |
Is it simplest if @maxcopeland just merges @rafbaluyot's branch and works
on top of that? This assumes that the _encode API is stable
|
I can't remember the code that well but I think the |
@baluyotraf that's correct-- |
That's fine then. No plans to change what the |
Hi @baluyotraf -- in merging your branch, I'm having some troubles setting up tests. Running into an ImportError in the impute module (traceback below). Looks like your commits in #13028 have similar issue. Any advice?
|
Looks like there could be a circular import :(
|
Sorry for my late response. It seems it's with the |
Sorry for the slow follow-up from my side. @maxcopeland it seems some of the updating to master went not fully OK, as the changes of #13253 (which will have conflicted with your changes here) are removed again in the diff here. And also the changes of #12908 seem to be removed again. This also makes the diff a bit hard to review right now. On the public API side:
On the implementation side (and the overlap with #13028): I would not necessarily directly merge that PR in your branch here. |
@jorisvandenbossche I think masking out the @jnothman if I want to submit a different implementation do I create a new PR? I might try this so I can also compare. I just need to change the ````BaseEncoder``` anyway. |
I just realized my code is quite long because it can also encode the Of course, it's perfectly reasonable to say the if they specified a different The same should be done to |
We don't currently have explicit checks for none, but except for classification targets, data represented as something other than finite numbers is still new to scikit-learn. I don't know what we should think about None. In general, I'm much happier to merge something that tells users their data is not yet supported than to try to handle every case from the outset. It is much easier for us to add capability later than to take it away. So I'd be happy with raising an error if the data contains NaN and missing_values=-1... I would certainly not go out of my way to support encoding NaN as a category unless the user explicitly said that's how they want missing values handled. Similarly I don't mind breaking with a "not comparable" TypeError when the user passes in Nine as if it were a category for now. As I've said before, I think, I would be just as happy to only support missing_values=NaN, which is the convention elsewhere in preprocessing; the goal is to make it reasonably easy to express certain problems in scikit-learn, not to give the users many ways to express themselves. HTH |
Fixes #11997. See also #10465.
transform()
sees an unknown value that hasn't been specified as missingStill to-do:
_encode_python
)Notes:
_BaseEncoder
toOrdinalEncoder
if preferrable