[WIP] Handle missing values in OrdinalEncoder #12045

maxcopeland · 2018-09-09T02:29:43Z

Fixes #11997. See also #10465.

This implementation passes through values to transformer specified to be missing, and converts them to np.nan's by default.
Still raises error if transform() sees an unknown value that hasn't been specified as missing
Allows for one missing value or list of missing values
User can specify transforming missing values as smallest (0), largest, or separate (-1) ordinal category.

Still to-do:

Need to handle np.nan as missing (has issues with _encode_python)

Notes:

This implementation can be moved out of _BaseEncoder to OrdinalEncoder if preferrable

maxcopeland · 2018-09-09T02:31:20Z

Apologies. I must've not switched branches for first three commits. Commits for this issue start with add masking/encoding for values specified missing

jnothman · 2018-09-12T08:13:57Z

sklearn/preprocessing/_encoders.py

+
+        - 'ignore' : Allow missing values to pass through ``transform()`` and
+           and be returned as np.nan
+        - 'largest' : Encode missing values as largest ordinal category


It's unclear whether largest refers to category frequency or to its order. "greatest" is better but I don't much like that either

What about using maximum/minimum or some subset thereof? max_cat/min_cat?

jnothman · 2018-09-12T08:15:40Z

sklearn/preprocessing/_encoders.py

+           and be returned as np.nan
+        - 'largest' : Encode missing values as largest ordinal category
+        - 'smallest' : Encode missing values as 0
+        - 'separate' : Encode missing values as -1


Maybe just allow the user to give an integer. I think handle_missing=-1 is clearer... Or maybe not?

Definitely. I could see that being useful.

jnothman · 2018-09-12T08:16:57Z

sklearn/preprocessing/_encoders.py

+        - 'smallest' : Encode missing values as 0
+        - 'separate' : Encode missing values as -1
+
+    handle_unknown : string, default 'error'


I think this should precede the missing values params

That makes sense-- I'll put handle_unknown before missing_value.

jnothman · 2018-09-12T08:17:28Z

sklearn/preprocessing/_encoders.py

@@ -699,6 +699,26 @@ class OrdinalEncoder(_BaseEncoder):
    dtype : number type, default np.float64
        Desired dtype of output.

+    missing_value : object type or a list of object types, default "nan"


Called missing_values (plural) elsewhere. Let's be consistent

We can't really accept a string "nan" which is ambiguous when X can be strings

And I'd rather support NaN by default for now and not make it user-configurable

And I'd rather support NaN by default for now and not make it user-configurable

+1
Also, if we keep it as an option, I would not make it a list (eg in SimpleImputer it is also only a single value, we should be consistent with that IMO)

We can't really accept a string "nan" which is ambiguous when X can be strings

@maxcopeland to give some context: in imputers currently it is a string "nan", but in master (and in soon to be released 0.20), this was changed to actually be np.nan.

jnothman · 2018-09-12T08:19:33Z

Thanks for working on this

jorisvandenbossche

Added some additional comments.

I think it would be good to first write some tests (they are required anyhow) to see what behaviour we are targetting.

jorisvandenbossche · 2018-09-12T12:13:29Z

sklearn/preprocessing/_encoders.py

@@ -699,6 +699,26 @@ class OrdinalEncoder(_BaseEncoder):
    dtype : number type, default np.float64
        Desired dtype of output.

+    missing_value : object type or a list of object types, default "nan"


And I'd rather support NaN by default for now and not make it user-configurable

+1
Also, if we keep it as an option, I would not make it a list (eg in SimpleImputer it is also only a single value, we should be consistent with that IMO)

jorisvandenbossche · 2018-09-12T12:14:21Z

sklearn/preprocessing/_encoders.py

@@ -699,6 +699,26 @@ class OrdinalEncoder(_BaseEncoder):
    dtype : number type, default np.float64
        Desired dtype of output.

+    missing_value : object type or a list of object types, default "nan"


We can't really accept a string "nan" which is ambiguous when X can be strings

@maxcopeland to give some context: in imputers currently it is a string "nan", but in master (and in soon to be released 0.20), this was changed to actually be np.nan.

jorisvandenbossche · 2018-09-12T12:15:31Z

sklearn/preprocessing/_encoders.py

+        Which value(s) passed to ``transform()`` to be considered missing
+
+    handle_missing : string, default 'ignore'
+        How to handle missing values passed to ``transform()``


is it only for transform?

Yes, an assumption I'm making is the data fit to the encoder will not have missing values. Is this incorrect?

With this version, if data with missing values is fit to the Encoder, it will be appended to the categories of the instance. Missing values would still get encoded appropriately in transform, but this means the transformed data will have one missing category.

Should I expand this implementation to make fit tolerant to data with missing values?

Yes, data passed to fit can contain NaNs. So fit and transform should be consistent with regard to this handle_missing keyword. Eg if it is set to 'ignore' (preserving them in the output), NaNs should not be added to the categories during fit.

Okay, good to know. I'll make it consistent.

jorisvandenbossche · 2018-09-12T12:24:14Z

sklearn/preprocessing/_encoders.py

+    handle_missing : string, default 'ignore'
+        How to handle missing values passed to ``transform()``
+
+        - 'ignore' : Allow missing values to pass through ``transform()`` and


Maybe we should think of another name here? Because eg in #10465 I used the 'ignore' term in context of OneHotEncoder to be all zero (like unknown category), which probably is also not the best term for that use case, but still might indicate possible confusion.

Maybe something like 'passthrough' or 'preserve-nan' or ..

That makes sense. I can do that.

jorisvandenbossche · 2018-09-12T12:25:01Z

sklearn/preprocessing/_encoders.py

+        Handle values not passed to ``fit()`` and not specified as missing
+
+        - 'error' : Unknown values raise error
+        - 'ignore' : Unknown values encoded as 0


The encoded categories already start at 0, so I don't see how this is possible?

BTW, the reason this keyword was not added to OrdinalEncoder initially (but it is added to OneHotEncoder) is exactly because there is no clear meaning of 'ignore' I think (at least when categories are encoded as 0...n).

I also think this is quite independent from the actual missing value handling, so maybe leave this for separate issue/PR to discuss?

Okay, you're right. I'll leave this one alone.

jorisvandenbossche · 2018-09-12T12:28:03Z

sklearn/preprocessing/_encoders.py

@@ -761,20 +786,45 @@ def fit(self, X, y=None):

    def transform(self, X):
        """Transform X to ordinal codes.
-


please do not remove blank lines (or maybe an editor setting that does this automaticall?à

Whoops-- I'd love to blame my editor but I think that one was me. Sorry about that.

maxcopeland · 2018-09-13T00:38:01Z

I really appreciate the input @jnothman and @jorisvandenbossche. Thanks for taking the time. I'll get working on these.

maxcopeland · 2018-09-20T00:28:13Z

Hi @jnothman and @jorisvandenbossche --

I'm having some issues with passing tests in py2.7 builds, but passes for 3. Was hoping you guys could lend some insight..

jnothman

I think it might be best if you make changes to _encode.

jnothman · 2018-09-20T07:52:13Z

sklearn/preprocessing/_encoders.py

+        - 'passthrough' : Allow missing values to pass through
+           ``transform()`` and be returned
+        - 'max_cat' : Encode missing values as maximum ordinal category
+        - 'min_cat' : Encode missing values as 0


this doesn't need to be a separate option from numeric

jnothman · 2018-09-20T07:52:31Z

sklearn/preprocessing/_encoders.py

+           ``transform()`` and be returned
+        - 'max_cat' : Encode missing values as maximum ordinal category
+        - 'min_cat' : Encode missing values as 0
+        - numeric : Category to be encoded to missing values


"number to encode missing values with"

jnothman · 2018-09-20T07:52:44Z

sklearn/preprocessing/_encoders.py

+        n_samples, n_features = X.shape
+
+        if self.categories != 'auto':
+            if X.dtype != object:


This is not covered by tests

jnothman · 2018-09-20T07:52:57Z

sklearn/preprocessing/_encoders.py

+                except TypeError:
+                    cats = _encode(Xi.astype('<U3'))
+            else:
+                cats = np.array(self.categories[i], dtype=X.dtype)


This is not covered by tests

jnothman · 2018-09-20T07:54:30Z

sklearn/preprocessing/_encoders.py

+            if self.categories == 'auto':
+                try:
+                    cats = _encode(Xi)
+                except TypeError:


What case does this handle? It looks very suspicious, but if we're to do it, it at least needs a comment.

jnothman · 2018-09-20T07:56:12Z

sklearn/preprocessing/_encoders.py

+        # Removing missing values from each column
+        for i, cats in enumerate(self.categories_):
+            try:  # find missing for numeric dtype
+                missing_i = np.isnan(cats.astype(float), casting='unsafe')


shouldn't casting apply to astype?

jnothman · 2018-09-20T07:56:19Z

sklearn/preprocessing/_encoders.py

+                missing_i = np.isnan(cats.astype(float), casting='unsafe')
+                self.categories_[i] = cats[~missing_i]
+            except (ValueError, TypeError):  # find missing for object dtype
+                missing_i = (cats == 'nan')


no () needed

I don't think we should be treating the string 'nan' as signifying np.nan... we might need to do this checking before or in _encode

jorisvandenbossche · 2018-09-20T10:26:04Z

I think we also need to discuss more which options we want:

You now added 'min_cat' and 'max_cat', where they "re-use" the integer for an already existing category. Is that what we want? Or should it rather take the min or max integer but as a separate category?
In any case, if we keep the behaviour now in the PR (re-use an existing category), I think we should also add the option to have NaN as a separate category (eg sorted last).

For me, passing through and separate category seem like the most important options.

jnothman · 2018-09-20T21:43:10Z

Yes, max is not as useful as new_cat

sklearn-lgtm · 2018-10-11T00:58:00Z

This pull request introduces 2 alerts when merging b5dd08c into 39bd736 - view on LGTM.com

new alerts:

1 for Unused import
1 for Syntax error

Comment posted by LGTM.com

jnothman · 2018-10-13T09:48:32Z

sklearn/preprocessing/_encoders.py

@@ -770,8 +782,15 @@ def fit(self, X, y=None):
        """
        # base classes uses _categories to deal with deprecations in
        # OneHoteEncoder: can be removed once deprecations are removed
+        _set_config(assume_finite=True)


Firstly, this configuration will remain set after fit is finished, secondly it will incorrectly pass through inf without error, thirdly it is not thread safe. We should instead use force_all_finite in check_array.

I see.. definitely not a good option then. I added a force_all_finite passed to _check_X to give optionality for NaN's.

jnothman · 2018-10-13T09:50:59Z

sklearn/preprocessing/tests/test_label.py

+def test_encode_util_encode_missing(values, expected):
+    uniques = _encode(values, encode_missing=True)
+    assert_array_equal(uniques, expected)
+    uniques, encoded = _encode(values, encode=True, encode_missing=True)


This tests _encode. Why is it in test_label.py?

Since the _encode function is in label.py, I assumed the tests would go in test_label.py. Should I move them to test_encoders.py?

jnothman · 2018-10-13T09:55:44Z

sklearn/preprocessing/_encoders.py

+        _set_config(assume_finite=True)
+
+        if self.handle_missing == 'separate':
+            self._encode_missing = True


This can be written as self._encode_missing = self.handle_missing == 'separate'

But why is this useful to do, rather than check self.handle_missing directly?

I agree-- this doesn't make sense. I removed the handle_missing attribute and replaced it with an encode_missing attribute, that accepts boolean input rather than string.

maxcopeland · 2018-11-01T00:17:42Z

@jnothman Thanks so much for the comments. I really apologize for how long this has taken me-- I had some work travel. I hope this feature is still desired.

sklearn-lgtm · 2018-11-01T00:37:57Z

This pull request introduces 1 alert when merging 5a4dee8 into 1c88b3c - view on LGTM.com

new alerts:

1 for Unused import

Comment posted by LGTM.com

jnothman · 2018-11-03T14:08:53Z

No worries at all about the time. I hope I can review this again soon

jnothman · 2018-11-04T22:12:06Z

sklearn/preprocessing/_encoders.py

+
+        - False : Do not categorize missing values and return
+          transformed data with NaN's
+        - True : Categorize missing values sorted last


Perhaps "Represent NaNs as the highest ordinal category."

Yes, much clearer.

jnothman · 2018-11-04T22:12:33Z

sklearn/preprocessing/_encoders.py

+    encode_missing : boolean, default False
+        How to represent NaN's
+
+        - False : Do not categorize missing values and return


I think "categorize" as a verb is awkward here. "Retain NaNs in transformed data" would be simpler, no?

Agreed. My wording a little awkward here.

jnothman · 2018-11-04T22:13:00Z

sklearn/preprocessing/_encoders.py

@@ -795,10 +809,17 @@ def transform(self, X):
        -------
        X_out : sparse matrix or a 2-d array
            Transformed input.
-
+            


please remove this added whitespace

jnothman · 2018-11-04T22:15:23Z

sklearn/preprocessing/label.py

+    missing_mask = np.isnan(values)
+    values_masked_nans = values.copy()
+    # mask missing values as existing non-null value
+    values_masked_nans[missing_mask] = values_masked_nans[~missing_mask][0]


If all values are NaN this will result in an IndexError. is this case handled and tested? An IndexError is not the most friendly way to say "all values are NaN", and when encoding, the all values are NaN case should be acceptable.

np.unique actually handles NaNs (it puts it as the last unique value, which I think is fine for us), so it might be that all of the above is not needed, and that we can directly pass the data to np.unique

jnothman · 2018-11-04T22:16:43Z

sklearn/preprocessing/label.py

+            if encode_missing and np.any(missing_mask):
+                # add nan to categories and set as largest category
+                uniques = np.append(uniques, np.nan)
+                encoded[missing_mask] = np.max(encoded) + 1


Is this RHS not the same as len(uniques) - 1?

jnothman · 2018-11-04T22:26:15Z

sklearn/preprocessing/label.py

@@ -74,7 +106,7 @@ def _encode_python(values, uniques=None, encode=False):
        return uniques


-def _encode(values, uniques=None, encode=False):
+def _encode(values, uniques=None, encode=False, encode_missing=False):


I find some of the complexity of the implementation of encode_missing to be quite perplexing. Whether True or False, it is being encoded, either as -1 or as the number of non-NaN categories. In _encode, what is the benefit of supporting both of these? Is it not simpler to establish the convention that NaN will always be represented as the highest category and exploit the fact that np.unique does this anyway?

jorisvandenbossche · 2018-11-05T07:56:12Z

Maybe we can first focus on reviewing the implementation, but I think we need to discuss your change of handle_missing='passthrough'|'separate' to encode_missing=True|False in a broader context of eg the OneHotEncoder. The True/False might not make sense there (need to think about it more), and I think we should keep the options consistent between both.

jorisvandenbossche · 2018-11-05T08:01:29Z

sklearn/preprocessing/_encoders.py

        self._categories = self.categories
-        self._fit(X)
+        self._fit(X, encode_missing=self.encode_missing, force_all_finite=True)


Why force_all_finite=True here and not False? I would think that in the OrdinalEncoder we now allow NaNs?

jorisvandenbossche · 2018-11-05T08:04:54Z

sklearn/preprocessing/_encoders.py

+        transformed = X_int.astype(self.dtype, copy=False)
+
+        if not self._encode_missing:
+            transformed[transformed == -1] = np.nan


This will not work for eg dtype='int64'.

But not sure what best to do about this. We raise an informative error if dtype is not floating, and the encoding is set to passthrough NaNs

jorisvandenbossche · 2018-11-05T08:08:54Z

sklearn/preprocessing/label.py

+    missing_mask = np.isnan(values)
+    values_masked_nans = values.copy()
+    # mask missing values as existing non-null value
+    values_masked_nans[missing_mask] = values_masked_nans[~missing_mask][0]


np.unique actually handles NaNs (it puts it as the last unique value, which I think is fine for us), so it might be that all of the above is not needed, and that we can directly pass the data to np.unique

jorisvandenbossche · 2018-11-05T08:19:10Z

sklearn/preprocessing/tests/test_encoders.py

+    exp = np.array([[0, 1, 0],
+                    [1, 0, 0],
+                    [2, 2, 1]], dtype='float64')
+    assert_array_equal(enc.fit_transform(X), exp)


can you also test enc.fit(X).transform(X) here ?

jorisvandenbossche · 2018-11-05T08:20:36Z

sklearn/preprocessing/tests/test_encoders.py

+                    [1, 0, 0],
+                    [np.nan, np.nan, np.nan]], dtype='float64')
+    assert_array_equal(enc.fit_transform(X), exp)
+


Can you also add a case where you have different data for fit and transform? Eg the case where there are no missing values during fitting, but only during transform (this will then need to interact with handle_unknown I think ?)

jorisvandenbossche · 2018-11-05T08:28:17Z

sklearn/preprocessing/label.py

    if encode:
        diff = _encode_check_unknown(values, uniques)
        if diff:
            raise ValueError("y contains previously unseen labels: %s"
                             % str(diff))
-        encoded = np.searchsorted(uniques, values)
+        encoded = np.searchsorted(uniques, values).astype(float)


This astype(float) is what is causing all the failures on travis, because the LabelEncoder currently returns integer data:

In [26]: from sklearn.preprocessing import LabelEncoder In [27]: LabelEncoder().fit_transform([1, 2, 1, 2]) Out[27]: array([0, 1, 0, 1])

jnothman · 2018-11-05T22:54:11Z

and I think we should keep the options consistent between both.

Well, we know they can't be entirely identical...

jnothman · 2018-12-18T14:31:28Z

CI unhappy.

Please avoid force pushing. Just add commits where possible. This makes it easier to track changes.

maxcopeland · 2018-12-19T01:06:33Z

Okay. Sorry about that, @jnothman. I wanted to rebase. Again, so sorry about my delay on this issue. I'll have a workable solution soon.

…ance

jnothman · 2019-03-11T11:26:14Z

Hi Max,

Firstly, you might want to swap notes with #13028 where @baluyotraf has implemented NaN-handling changes to the _encode helper used both in OrdinalEncoder and OneHotEncoder.

Secondly, we have a helper _object_dtype_isnan in sklearn.utils.fixes. It's fine to use this in a solution that avoids assert_array_equal.

I think setting force_all_finite='allow-nan' in BaseEncoder is okay, as long as the encoding will retain the NaNs, hence causing it to be rejected by the next stage in the pipeline.

…test_label.py passing

…rs. All tests in test_encoders.py passing

…peland/scikit-learn into ordinal_enc_handle_missing

maxcopeland · 2019-03-13T13:34:40Z

Hi @jnothman,

Thanks so much for your comment. Two questions:

In the estimator checks, it looks like the azure build fails in check_estimators_nan_inf and check_fit_idempotent for not checking for/detecting NaN's in fit. Is this something I should add to pass the build?
I've reached out to @baluyotraf on his work on _encoder and BaseEncoder. Since there are pretty big changes to helper functions and base class, should I align OrdinalEncoder work to his implementation/wait for his branch to merge? Otherwise I can keep working on NaN handling from OrdinalEncoder-side.

I know there will be a release soon-- let me know how I can be most efficient with time to get this feature ready.

jnothman · 2019-03-13T13:41:01Z

I think it would be wise to agree with each other on the _encode internal API, or you adopt @baluyotraf's. I don't mind if the two PRs are merged and you collaborate, though there are aspects of the public-facing API that will differ.

jnothman · 2019-03-13T13:41:31Z

I've not yet checked the test failure and estimator check context, sorry.

baluyotraf · 2019-03-13T14:01:30Z

I think agreeing on the _encode implementation is better for the maintainability. I'm not sure if there are other implementations for this at the moment. If there are, I would be happy if my code can help improve them. If you want to try out mine, you can try to rebase with my code and see how it goes.

maxcopeland · 2019-03-13T14:31:54Z

@baluyotraf totally. If the aim is that all encoders handle NaN's similarly, I think updating the _encode helper and BaseEncoder, as you've done, makes the most sense. I was too nervous to touch the base class in my approach.

Would love to work together on this. Would it be better to merge PR's or should I just rebase on yours?

baluyotraf · 2019-03-13T14:58:58Z

Either way works for me.

jnothman · 2019-03-13T21:43:38Z

Is it simplest if @maxcopeland just merges @rafbaluyot's branch and works on top of that? This assumes that the _encode API is stable

baluyotraf · 2019-03-13T23:12:38Z

I can't remember the code that well but I think the OrdinalEncoder only depends on the BaseEncoder implementation. If that's the case that's fine. Though we probably should merge after just to make sure the tests will still work after our own changes.

maxcopeland · 2019-03-14T12:52:47Z

@baluyotraf that's correct-- OrdinalEncoder just depends on the BaseEncoder. Any work I do will be assuming the base class interface stays the same as your current implementation.

baluyotraf · 2019-03-15T12:17:00Z

That's fine then. No plans to change what the BaseEncoder returns.

maxcopeland · 2019-03-15T17:24:01Z

Hi @baluyotraf -- in merging your branch, I'm having some troubles setting up tests. Running into an ImportError in the impute module (traceback below). Looks like your commits in #13028 have similar issue. Any advice?

________ ERROR collecting sklearn/preprocessing/tests/test_encoders.py ________
ImportError while importing test module 'c:\Users\MCopeland155816\Documents\Python Scripts\Open Source Contributions\SK-Learn\scikit-learn\sklearn\preprocessing\tests\test_encoders.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
sklearn\preprocessing\__init__.py:8: in <module>
    from .data import Binarizer
sklearn\preprocessing\data.py:35: in <module>
    from ._encoders import OneHotEncoder
sklearn\preprocessing\_encoders.py:22: in <module>
    from .label import _encode, _encode_check_unknown
sklearn\preprocessing\label.py:26: in <module>
    from ..impute import _get_mask
sklearn\impute.py:23: in <module>
    from .preprocessing import normalize
E   ImportError: cannot import name 'normalize'

jnothman · 2019-03-18T05:13:57Z

Looks like there could be a circular import :(

baluyotraf · 2019-03-18T05:28:38Z

Sorry for my late response. It seems it's with the _get_mask call. Let me fix it in a bit.

jorisvandenbossche · 2019-03-29T13:59:01Z

Sorry for the slow follow-up from my side.

@maxcopeland it seems some of the updating to master went not fully OK, as the changes of #13253 (which will have conflicted with your changes here) are removed again in the diff here. And also the changes of #12908 seem to be removed again. This also makes the diff a bit hard to review right now.

On the public API side:

I see you now added force_all_finite (in addition to encode_missing) keyword. Is it needed to have this keyword? I think for now in sklearn, this force_all_finite is only used in check_array and not in actual transformers. I am not sure it is needed that the user has this fine grained option of allowing NaNs but not np.inf (scalers also don't have this option).
You update the fitted categories_ (to make sure they don't have multiple missing values, which is good). I was only wondering if we want to use another marker than np.nan for inside the categories. Those will eg be used for features names (in the OneHotEncoder case, but we should be consistent), and then having np.nan might be annoying. Another option would be to add a keyword like missing_category='missing_value' with which you can specify the name to use for missing values in the categories (so not the value that is considered as missing in the input).

On the implementation side (and the overlap with #13028): I would not necessarily directly merge that PR in your branch here.
@maxcopeland @baluyotraf can you try to check what the differences are (in approach, in what it achieves) between both implementations? Does the PR here already do what we want? (it does it for OrdinalEncoder, but could the same approach be used for the OneHotEncoder as well?)
I am mainly asking, because the implementation here seems simpler than the one in #13028 (but maybe also because it does not cover everything). In any case, you might be able to learn from both attempts and pick the good things of both (I didn't look enough in detail at the other PR to be able to assess this right now).

baluyotraf · 2019-03-29T14:42:15Z

@jorisvandenbossche I think masking out the np.nan or the missing_values before encoding is a simpler solution. Internally the OneHotEncoder uses the the ordinal encoding so updating the BaseEncoder should work. Similar to my implementation it should still return the missing mask.

@jnothman if I want to submit a different implementation do I create a new PR? I might try this so I can also compare. I just need to change the ````BaseEncoder``` anyway.

baluyotraf · 2019-03-30T11:25:39Z

I just realized my code is quite long because it can also encode the np.nan value correctly. For example missing_values = -1.0 and data = [-1.0, 0.0, 1.0. np.nan, np.nan] will fail with the current encoder even is we masked out the missing values.

Of course, it's perfectly reasonable to say the if they specified a different missing_values there should not be a np.nan in the data. So if we are to choose the masking implementation, we only accept np.nan only if it is the specified missing_values.

The same should be done to None though I haven't checked if the check_array function also checks for it if allow-nan is False

jnothman · 2019-04-07T02:55:40Z

We don't currently have explicit checks for none, but except for classification targets, data represented as something other than finite numbers is still new to scikit-learn. I don't know what we should think about None. In general, I'm much happier to merge something that tells users their data is not yet supported than to try to handle every case from the outset. It is much easier for us to add capability later than to take it away. So I'd be happy with raising an error if the data contains NaN and missing_values=-1... I would certainly not go out of my way to support encoding NaN as a category unless the user explicitly said that's how they want missing values handled. Similarly I don't mind breaking with a "not comparable" TypeError when the user passes in Nine as if it were a category for now. As I've said before, I think, I would be just as happy to only support missing_values=NaN, which is the convention elsewhere in preprocessing; the goal is to make it reasonably easy to express certain problems in scikit-learn, not to give the users many ways to express themselves. HTH

maxcopeland changed the title ~~[WIP] Ordinal enc handle missing~~ [WIP] Handle missing values in OrdinalEncoder Sep 9, 2018

jnothman reviewed Sep 12, 2018

View reviewed changes

jorisvandenbossche reviewed Sep 12, 2018

View reviewed changes

jnothman reviewed Sep 20, 2018

View reviewed changes

jnothman reviewed Oct 13, 2018

View reviewed changes

jnothman reviewed Nov 4, 2018

View reviewed changes

jorisvandenbossche reviewed Nov 5, 2018

View reviewed changes

Implement handle missing functionality to _encode, OrdinalEncoder

5ceb0e3

maxcopeland added 5 commits December 20, 2018 11:27

Spec'd out tests for different use cases

3c00ba0

Add force_all_finite parameter to _fit and _check_X

66033cc

Add force_all_finite parameter to OrdinalEncoder

d640082

Update doc example to include force_all_finite in OrdinalEncoder inst…

6787512

…ance

Allow object dtype to have NaN, edit test

b8ef08a

jnothman mentioned this pull request Dec 24, 2018

KBinsDiscretizer: allow nans #9341

Open

Add force_all_finite param to _transform

cb854e6

Fix pep8 errors

0f57fe6

maxcopeland mentioned this pull request Mar 12, 2019

[WIP] NaN Support for OneHotEncoder #13028

Closed

maxcopeland added 4 commits March 12, 2019 12:24

Fix assert_array_equal with nan in object dtype arrays. All tests in …

d860e48

…test_label.py passing

Fix assert_array_equal with nan in object dtype arrays in test_encode…

23fb738

…rs. All tests in test_encoders.py passing

Fix pep8 errors

b0aa000

Merge branch 'ordinal_enc_handle_missing' of https://github.com/maxco…

7eed8c3

…peland/scikit-learn into ordinal_enc_handle_missing

fmder mentioned this pull request May 16, 2019

[MRG] Adds handle unknown option to ordinal encoder #13897

Closed

amueller added the Needs work label Aug 6, 2019

TwsThomas mentioned this pull request Sep 18, 2019

[WIP] Handle missing values in label._encode() #15009

Closed

github-actions bot added the module:preprocessing label Mar 2, 2020

FelixWick mentioned this pull request Aug 2, 2020

FEA allow unknowns in OrdinalEncoder transform #17406

Merged

jnothman closed this in #17406 Aug 5, 2020

		@@ -761,20 +786,45 @@ def fit(self, X, y=None):

		def transform(self, X):
		"""Transform X to ordinal codes.

Uh oh!

[WIP] Handle missing values in OrdinalEncoder #12045

[WIP] Handle missing values in OrdinalEncoder #12045

Uh oh!

Conversation

maxcopeland commented Sep 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxcopeland commented Sep 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxcopeland Sep 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Sep 12, 2018

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxcopeland commented Sep 13, 2018

Uh oh!

maxcopeland commented Sep 20, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxcopeland commented Sep 9, 2018 •

edited

Loading

maxcopeland Sep 13, 2018 •

edited

Loading