Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[WIP] Handle missing values in OrdinalEncoder #12045

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 26 commits into from
Closed

[WIP] Handle missing values in OrdinalEncoder #12045

wants to merge 26 commits into from

Conversation

maxcopeland
Copy link
Contributor

@maxcopeland maxcopeland commented Sep 9, 2018

Fixes #11997. See also #10465.

  • This implementation passes through values to transformer specified to be missing, and converts them to np.nan's by default.
  • Still raises error if transform() sees an unknown value that hasn't been specified as missing
  • Allows for one missing value or list of missing values
  • User can specify transforming missing values as smallest (0), largest, or separate (-1) ordinal category.

Still to-do:

  • Need to handle np.nan as missing (has issues with _encode_python)

Notes:

  • This implementation can be moved out of _BaseEncoder to OrdinalEncoder if preferrable

@maxcopeland
Copy link
Contributor Author

Apologies. I must've not switched branches for first three commits. Commits for this issue start with add masking/encoding for values specified missing

@maxcopeland maxcopeland changed the title [WIP] Ordinal enc handle missing [WIP] Handle missing values in OrdinalEncoder Sep 9, 2018

- 'ignore' : Allow missing values to pass through ``transform()`` and
and be returned as np.nan
- 'largest' : Encode missing values as largest ordinal category
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear whether largest refers to category frequency or to its order. "greatest" is better but I don't much like that either

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using maximum/minimum or some subset thereof? max_cat/min_cat?

and be returned as np.nan
- 'largest' : Encode missing values as largest ordinal category
- 'smallest' : Encode missing values as 0
- 'separate' : Encode missing values as -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just allow the user to give an integer. I think handle_missing=-1 is clearer... Or maybe not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. I could see that being useful.

- 'smallest' : Encode missing values as 0
- 'separate' : Encode missing values as -1

handle_unknown : string, default 'error'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should precede the missing values params

Copy link
Contributor Author

@maxcopeland maxcopeland Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense-- I'll put handle_unknown before missing_value.

@@ -699,6 +699,26 @@ class OrdinalEncoder(_BaseEncoder):
dtype : number type, default np.float64
Desired dtype of output.

missing_value : object type or a list of object types, default "nan"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Called missing_values (plural) elsewhere. Let's be consistent

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't really accept a string "nan" which is ambiguous when X can be strings

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I'd rather support NaN by default for now and not make it user-configurable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I'd rather support NaN by default for now and not make it user-configurable

+1
Also, if we keep it as an option, I would not make it a list (eg in SimpleImputer it is also only a single value, we should be consistent with that IMO)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't really accept a string "nan" which is ambiguous when X can be strings

@maxcopeland to give some context: in imputers currently it is a string "nan", but in master (and in soon to be released 0.20), this was changed to actually be np.nan.

@jnothman
Copy link
Member

Thanks for working on this

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some additional comments.

I think it would be good to first write some tests (they are required anyhow) to see what behaviour we are targetting.

@@ -699,6 +699,26 @@ class OrdinalEncoder(_BaseEncoder):
dtype : number type, default np.float64
Desired dtype of output.

missing_value : object type or a list of object types, default "nan"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I'd rather support NaN by default for now and not make it user-configurable

+1
Also, if we keep it as an option, I would not make it a list (eg in SimpleImputer it is also only a single value, we should be consistent with that IMO)

@@ -699,6 +699,26 @@ class OrdinalEncoder(_BaseEncoder):
dtype : number type, default np.float64
Desired dtype of output.

missing_value : object type or a list of object types, default "nan"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't really accept a string "nan" which is ambiguous when X can be strings

@maxcopeland to give some context: in imputers currently it is a string "nan", but in master (and in soon to be released 0.20), this was changed to actually be np.nan.

Which value(s) passed to ``transform()`` to be considered missing

handle_missing : string, default 'ignore'
How to handle missing values passed to ``transform()``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it only for transform?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, an assumption I'm making is the data fit to the encoder will not have missing values. Is this incorrect?

With this version, if data with missing values is fit to the Encoder, it will be appended to the categories of the instance. Missing values would still get encoded appropriately in transform, but this means the transformed data will have one missing category.

Should I expand this implementation to make fit tolerant to data with missing values?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, data passed to fit can contain NaNs. So fit and transform should be consistent with regard to this handle_missing keyword. Eg if it is set to 'ignore' (preserving them in the output), NaNs should not be added to the categories during fit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, good to know. I'll make it consistent.

handle_missing : string, default 'ignore'
How to handle missing values passed to ``transform()``

- 'ignore' : Allow missing values to pass through ``transform()`` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should think of another name here? Because eg in #10465 I used the 'ignore' term in context of OneHotEncoder to be all zero (like unknown category), which probably is also not the best term for that use case, but still might indicate possible confusion.

Maybe something like 'passthrough' or 'preserve-nan' or ..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I can do that.

Handle values not passed to ``fit()`` and not specified as missing

- 'error' : Unknown values raise error
- 'ignore' : Unknown values encoded as 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The encoded categories already start at 0, so I don't see how this is possible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the reason this keyword was not added to OrdinalEncoder initially (but it is added to OneHotEncoder) is exactly because there is no clear meaning of 'ignore' I think (at least when categories are encoded as 0...n).

I also think this is quite independent from the actual missing value handling, so maybe leave this for separate issue/PR to discuss?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, you're right. I'll leave this one alone.

@@ -761,20 +786,45 @@ def fit(self, X, y=None):

def transform(self, X):
"""Transform X to ordinal codes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not remove blank lines (or maybe an editor setting that does this automaticall?à

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops-- I'd love to blame my editor but I think that one was me. Sorry about that.

@maxcopeland
Copy link
Contributor Author

I really appreciate the input @jnothman and @jorisvandenbossche. Thanks for taking the time. I'll get working on these.

@maxcopeland
Copy link
Contributor Author

Hi @jnothman and @jorisvandenbossche --

I'm having some issues with passing tests in py2.7 builds, but passes for 3. Was hoping you guys could lend some insight..

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be best if you make changes to _encode.

- 'passthrough' : Allow missing values to pass through
``transform()`` and be returned
- 'max_cat' : Encode missing values as maximum ordinal category
- 'min_cat' : Encode missing values as 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't need to be a separate option from numeric

``transform()`` and be returned
- 'max_cat' : Encode missing values as maximum ordinal category
- 'min_cat' : Encode missing values as 0
- numeric : Category to be encoded to missing values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"number to encode missing values with"

n_samples, n_features = X.shape

if self.categories != 'auto':
if X.dtype != object:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not covered by tests

except TypeError:
cats = _encode(Xi.astype('<U3'))
else:
cats = np.array(self.categories[i], dtype=X.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not covered by tests

if self.categories == 'auto':
try:
cats = _encode(Xi)
except TypeError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What case does this handle? It looks very suspicious, but if we're to do it, it at least needs a comment.

# Removing missing values from each column
for i, cats in enumerate(self.categories_):
try: # find missing for numeric dtype
missing_i = np.isnan(cats.astype(float), casting='unsafe')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't casting apply to astype?

missing_i = np.isnan(cats.astype(float), casting='unsafe')
self.categories_[i] = cats[~missing_i]
except (ValueError, TypeError): # find missing for object dtype
missing_i = (cats == 'nan')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no () needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be treating the string 'nan' as signifying np.nan... we might need to do this checking before or in _encode

@jorisvandenbossche
Copy link
Member

I think we also need to discuss more which options we want:

  • You now added 'min_cat' and 'max_cat', where they "re-use" the integer for an already existing category. Is that what we want? Or should it rather take the min or max integer but as a separate category?
    In any case, if we keep the behaviour now in the PR (re-use an existing category), I think we should also add the option to have NaN as a separate category (eg sorted last).

For me, passing through and separate category seem like the most important options.

@jnothman
Copy link
Member

jnothman commented Sep 20, 2018 via email

@sklearn-lgtm
Copy link

This pull request introduces 2 alerts when merging b5dd08c into 39bd736 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Syntax error

Comment posted by LGTM.com

@@ -770,8 +782,15 @@ def fit(self, X, y=None):
"""
# base classes uses _categories to deal with deprecations in
# OneHoteEncoder: can be removed once deprecations are removed
_set_config(assume_finite=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, this configuration will remain set after fit is finished, secondly it will incorrectly pass through inf without error, thirdly it is not thread safe. We should instead use force_all_finite in check_array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.. definitely not a good option then. I added a force_all_finite passed to _check_X to give optionality for NaN's.

def test_encode_util_encode_missing(values, expected):
uniques = _encode(values, encode_missing=True)
assert_array_equal(uniques, expected)
uniques, encoded = _encode(values, encode=True, encode_missing=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests _encode. Why is it in test_label.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the _encode function is in label.py, I assumed the tests would go in test_label.py. Should I move them to test_encoders.py?

_set_config(assume_finite=True)

if self.handle_missing == 'separate':
self._encode_missing = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be written as self._encode_missing = self.handle_missing == 'separate'

But why is this useful to do, rather than check self.handle_missing directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree-- this doesn't make sense. I removed the handle_missing attribute and replaced it with an encode_missing attribute, that accepts boolean input rather than string.

@maxcopeland
Copy link
Contributor Author

@jnothman Thanks so much for the comments. I really apologize for how long this has taken me-- I had some work travel. I hope this feature is still desired.

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 5a4dee8 into 1c88b3c - view on LGTM.com

new alerts:

  • 1 for Unused import

Comment posted by LGTM.com

@jnothman
Copy link
Member

jnothman commented Nov 3, 2018 via email


- False : Do not categorize missing values and return
transformed data with NaN's
- True : Categorize missing values sorted last
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "Represent NaNs as the highest ordinal category."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, much clearer.

encode_missing : boolean, default False
How to represent NaN's

- False : Do not categorize missing values and return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "categorize" as a verb is awkward here. "Retain NaNs in transformed data" would be simpler, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. My wording a little awkward here.

@@ -795,10 +809,17 @@ def transform(self, X):
-------
X_out : sparse matrix or a 2-d array
Transformed input.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this added whitespace

missing_mask = np.isnan(values)
values_masked_nans = values.copy()
# mask missing values as existing non-null value
values_masked_nans[missing_mask] = values_masked_nans[~missing_mask][0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all values are NaN this will result in an IndexError. is this case handled and tested? An IndexError is not the most friendly way to say "all values are NaN", and when encoding, the all values are NaN case should be acceptable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.unique actually handles NaNs (it puts it as the last unique value, which I think is fine for us), so it might be that all of the above is not needed, and that we can directly pass the data to np.unique

if encode_missing and np.any(missing_mask):
# add nan to categories and set as largest category
uniques = np.append(uniques, np.nan)
encoded[missing_mask] = np.max(encoded) + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this RHS not the same as len(uniques) - 1?

@@ -74,7 +106,7 @@ def _encode_python(values, uniques=None, encode=False):
return uniques


def _encode(values, uniques=None, encode=False):
def _encode(values, uniques=None, encode=False, encode_missing=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find some of the complexity of the implementation of encode_missing to be quite perplexing. Whether True or False, it is being encoded, either as -1 or as the number of non-NaN categories. In _encode, what is the benefit of supporting both of these? Is it not simpler to establish the convention that NaN will always be represented as the highest category and exploit the fact that np.unique does this anyway?

@jorisvandenbossche
Copy link
Member

Maybe we can first focus on reviewing the implementation, but I think we need to discuss your change of handle_missing='passthrough'|'separate' to encode_missing=True|False in a broader context of eg the OneHotEncoder. The True/False might not make sense there (need to think about it more), and I think we should keep the options consistent between both.

self._categories = self.categories
self._fit(X)
self._fit(X, encode_missing=self.encode_missing, force_all_finite=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why force_all_finite=True here and not False? I would think that in the OrdinalEncoder we now allow NaNs?

transformed = X_int.astype(self.dtype, copy=False)

if not self._encode_missing:
transformed[transformed == -1] = np.nan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work for eg dtype='int64'.

But not sure what best to do about this. We raise an informative error if dtype is not floating, and the encoding is set to passthrough NaNs

missing_mask = np.isnan(values)
values_masked_nans = values.copy()
# mask missing values as existing non-null value
values_masked_nans[missing_mask] = values_masked_nans[~missing_mask][0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.unique actually handles NaNs (it puts it as the last unique value, which I think is fine for us), so it might be that all of the above is not needed, and that we can directly pass the data to np.unique

exp = np.array([[0, 1, 0],
[1, 0, 0],
[2, 2, 1]], dtype='float64')
assert_array_equal(enc.fit_transform(X), exp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also test enc.fit(X).transform(X) here ?

[1, 0, 0],
[np.nan, np.nan, np.nan]], dtype='float64')
assert_array_equal(enc.fit_transform(X), exp)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a case where you have different data for fit and transform? Eg the case where there are no missing values during fitting, but only during transform (this will then need to interact with handle_unknown I think ?)

if encode:
diff = _encode_check_unknown(values, uniques)
if diff:
raise ValueError("y contains previously unseen labels: %s"
% str(diff))
encoded = np.searchsorted(uniques, values)
encoded = np.searchsorted(uniques, values).astype(float)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This astype(float) is what is causing all the failures on travis, because the LabelEncoder currently returns integer data:

In [26]: from sklearn.preprocessing import LabelEncoder

In [27]: LabelEncoder().fit_transform([1, 2, 1, 2])
Out[27]: array([0, 1, 0, 1])

@jnothman
Copy link
Member

jnothman commented Nov 5, 2018 via email

@jnothman
Copy link
Member

CI unhappy.

Please avoid force pushing. Just add commits where possible. This makes it easier to track changes.

@maxcopeland
Copy link
Contributor Author

Okay. Sorry about that, @jnothman. I wanted to rebase. Again, so sorry about my delay on this issue. I'll have a workable solution soon.

@jnothman
Copy link
Member

Hi Max,

Firstly, you might want to swap notes with #13028 where @baluyotraf has implemented NaN-handling changes to the _encode helper used both in OrdinalEncoder and OneHotEncoder.

Secondly, we have a helper _object_dtype_isnan in sklearn.utils.fixes. It's fine to use this in a solution that avoids assert_array_equal.

I think setting force_all_finite='allow-nan' in BaseEncoder is okay, as long as the encoding will retain the NaNs, hence causing it to be rejected by the next stage in the pipeline.

@maxcopeland
Copy link
Contributor Author

Hi @jnothman,

Thanks so much for your comment. Two questions:

  1. In the estimator checks, it looks like the azure build fails in check_estimators_nan_inf and check_fit_idempotent for not checking for/detecting NaN's in fit. Is this something I should add to pass the build?
  2. I've reached out to @baluyotraf on his work on _encoder and BaseEncoder. Since there are pretty big changes to helper functions and base class, should I align OrdinalEncoder work to his implementation/wait for his branch to merge? Otherwise I can keep working on NaN handling from OrdinalEncoder-side.

I know there will be a release soon-- let me know how I can be most efficient with time to get this feature ready.

@jnothman
Copy link
Member

I think it would be wise to agree with each other on the _encode internal API, or you adopt @baluyotraf's. I don't mind if the two PRs are merged and you collaborate, though there are aspects of the public-facing API that will differ.

@jnothman
Copy link
Member

I've not yet checked the test failure and estimator check context, sorry.

@baluyotraf
Copy link
Contributor

I think agreeing on the _encode implementation is better for the maintainability. I'm not sure if there are other implementations for this at the moment. If there are, I would be happy if my code can help improve them. If you want to try out mine, you can try to rebase with my code and see how it goes.

@maxcopeland
Copy link
Contributor Author

@baluyotraf totally. If the aim is that all encoders handle NaN's similarly, I think updating the _encode helper and BaseEncoder, as you've done, makes the most sense. I was too nervous to touch the base class in my approach.

Would love to work together on this. Would it be better to merge PR's or should I just rebase on yours?

@baluyotraf
Copy link
Contributor

Either way works for me.

@jnothman
Copy link
Member

jnothman commented Mar 13, 2019 via email

@baluyotraf
Copy link
Contributor

I can't remember the code that well but I think the OrdinalEncoder only depends on the BaseEncoder implementation. If that's the case that's fine. Though we probably should merge after just to make sure the tests will still work after our own changes.

@maxcopeland
Copy link
Contributor Author

@baluyotraf that's correct-- OrdinalEncoder just depends on the BaseEncoder. Any work I do will be assuming the base class interface stays the same as your current implementation.

@baluyotraf
Copy link
Contributor

That's fine then. No plans to change what the BaseEncoder returns.

@maxcopeland
Copy link
Contributor Author

Hi @baluyotraf -- in merging your branch, I'm having some troubles setting up tests. Running into an ImportError in the impute module (traceback below). Looks like your commits in #13028 have similar issue. Any advice?

________ ERROR collecting sklearn/preprocessing/tests/test_encoders.py ________
ImportError while importing test module 'c:\Users\MCopeland155816\Documents\Python Scripts\Open Source Contributions\SK-Learn\scikit-learn\sklearn\preprocessing\tests\test_encoders.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
sklearn\preprocessing\__init__.py:8: in <module>
    from .data import Binarizer
sklearn\preprocessing\data.py:35: in <module>
    from ._encoders import OneHotEncoder
sklearn\preprocessing\_encoders.py:22: in <module>
    from .label import _encode, _encode_check_unknown
sklearn\preprocessing\label.py:26: in <module>
    from ..impute import _get_mask
sklearn\impute.py:23: in <module>
    from .preprocessing import normalize
E   ImportError: cannot import name 'normalize'

@jnothman
Copy link
Member

jnothman commented Mar 18, 2019 via email

@baluyotraf
Copy link
Contributor

Sorry for my late response. It seems it's with the _get_mask call. Let me fix it in a bit.

@jorisvandenbossche
Copy link
Member

Sorry for the slow follow-up from my side.

@maxcopeland it seems some of the updating to master went not fully OK, as the changes of #13253 (which will have conflicted with your changes here) are removed again in the diff here. And also the changes of #12908 seem to be removed again. This also makes the diff a bit hard to review right now.

On the public API side:

  • I see you now added force_all_finite (in addition to encode_missing) keyword. Is it needed to have this keyword? I think for now in sklearn, this force_all_finite is only used in check_array and not in actual transformers. I am not sure it is needed that the user has this fine grained option of allowing NaNs but not np.inf (scalers also don't have this option).

  • You update the fitted categories_ (to make sure they don't have multiple missing values, which is good). I was only wondering if we want to use another marker than np.nan for inside the categories. Those will eg be used for features names (in the OneHotEncoder case, but we should be consistent), and then having np.nan might be annoying. Another option would be to add a keyword like missing_category='missing_value' with which you can specify the name to use for missing values in the categories (so not the value that is considered as missing in the input).

On the implementation side (and the overlap with #13028): I would not necessarily directly merge that PR in your branch here.
@maxcopeland @baluyotraf can you try to check what the differences are (in approach, in what it achieves) between both implementations? Does the PR here already do what we want? (it does it for OrdinalEncoder, but could the same approach be used for the OneHotEncoder as well?)
I am mainly asking, because the implementation here seems simpler than the one in #13028 (but maybe also because it does not cover everything). In any case, you might be able to learn from both attempts and pick the good things of both (I didn't look enough in detail at the other PR to be able to assess this right now).

@baluyotraf
Copy link
Contributor

@jorisvandenbossche I think masking out the np.nan or the missing_values before encoding is a simpler solution. Internally the OneHotEncoder uses the the ordinal encoding so updating the BaseEncoder should work. Similar to my implementation it should still return the missing mask.

@jnothman if I want to submit a different implementation do I create a new PR? I might try this so I can also compare. I just need to change the ````BaseEncoder``` anyway.

@baluyotraf
Copy link
Contributor

I just realized my code is quite long because it can also encode the np.nan value correctly. For example missing_values = -1.0 and data = [-1.0, 0.0, 1.0. np.nan, np.nan] will fail with the current encoder even is we masked out the missing values.

Of course, it's perfectly reasonable to say the if they specified a different missing_values there should not be a np.nan in the data. So if we are to choose the masking implementation, we only accept np.nan only if it is the specified missing_values.

The same should be done to None though I haven't checked if the check_array function also checks for it if allow-nan is False

@jnothman
Copy link
Member

jnothman commented Apr 7, 2019

We don't currently have explicit checks for none, but except for classification targets, data represented as something other than finite numbers is still new to scikit-learn. I don't know what we should think about None. In general, I'm much happier to merge something that tells users their data is not yet supported than to try to handle every case from the outset. It is much easier for us to add capability later than to take it away. So I'd be happy with raising an error if the data contains NaN and missing_values=-1... I would certainly not go out of my way to support encoding NaN as a category unless the user explicitly said that's how they want missing values handled. Similarly I don't mind breaking with a "not comparable" TypeError when the user passes in Nine as if it were a category for now. As I've said before, I think, I would be just as happy to only support missing_values=NaN, which is the convention elsewhere in preprocessing; the goal is to make it reasonably easy to express certain problems in scikit-learn, not to give the users many ways to express themselves. HTH

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle missing values in OrdinalEncoder
6 participants