Handling of missing values in the CategoricalEncoder #10465

jorisvandenbossche · 2018-01-12T20:55:54Z

Currently the CategoricalEncoder doesn't handle missing values (you get an error about unorderable types or about nan being an unknown category for numerical types).
This came up in an issue about imputing missing values for categoricals (#2888), but independently of whether we add such abilities to Imputer, we should also discuss how CategoricalEncoder itself could handle missing values in different ways.

Possible ways to deal missing values (np.nan or None):

Raise an error when missing values are present:
- This is still a good default I think (we should only make sure to provide a better error message than currently is raised)
Ignore missing values (treat as unknown):
- This would give a row of all zeros for dummy encoding, and would no be implemented for ordinal encoding.
- In this way, it is similar in behaviour as unknown categories with handle_unknown='ignore', apart from the fact it can also occur in the training data.
Regard missing value as a separate category
- For ordinal encoding this would give an additional integer, for dummy encoding an additional column.
- Something similar is available in pd.get_dummies if you specify dummy_na=True keyword.
- Implementation-wise, a problem that would occur is that if your categories consist of a couple of strings values and a missing value (np.nan or None), it becomes unorderable, while in the CategoricalEncoder we normally sort the unique categories (as a possible solution, we could fallback in such a case to sort the non-missing ones first and then add np.nan in the end).
- This would be similar to a an indicator feature
Preserve as NaN:
- from comment of @amueller (Improve Imputer 'most_frequent' strategy #2888 (comment)), I suppose the idea would be to first see it as a separate category but before returning the result replace that category again with NaN (so it can be imputed after encoding).
- This might make sense only for ordinal encoding, unless we want a full row of NaNs for dummy case.
  This option could actually also be a way to deal with the "imputing categorical features" problem (see also next bullet), as it allows an easier and more flexible combination of encoding / imputing.
Impute missing values (eg with 'most_frequent' option):
- Personally I think this one should be left to Imputer itself, but adding it here instead could limit the scope of Imputer to numerical features.

Those options (or a subset of them) could be added as an additional keyword to the CategoricalEncoder. Possible names: handle_missing, handle_na, missing_values

Related to discussions in #2888 and #9012 (comment)

Example notebook on a toy dataframe showing the current problem with missing data in categorical features: http://nbviewer.jupyter.org/gist/jorisvandenbossche/736cead26ab65116ff4de18015b0b324

The text was updated successfully, but these errors were encountered:

jnothman · 2018-01-13T13:13:35Z

I think leave as NaN in ordinal encoding.

Unless there is an option to switch on this behaviour, I think it is problematic to leave a row of zeros or add a new category in one-hot. An additional column would likely be most useful in this case (it's the same as a row of zeros with more information.

jnothman · 2018-01-13T13:14:59Z

@jorisvandenbossche were you hoping to implement the changes, or should we mark it help wanted?

glemaitre · 2018-01-13T18:07:56Z

Preserving NaN seems that it could be consistent with what is going in #10404 with the preprocessing methods. missing_values could be used across the preprocessing methods and the imputation methods.

jorisvandenbossche · 2018-01-15T11:08:39Z

I think leave as NaN in ordinal encoding.

Would you do this as the default behaviour? (or as an option with erroring as the default behaviour)
Looking at #10404 on NaNs in preprocessing that Guillaume linked, I suppose you mean it as the default?
Reading that issue, that seems to make sense. In most cases the final estimator will still error on encountering NaNs, so the combined behaviour does not really change.

Unless there is an option to switch on this behaviour, I think it is problematic to leave a row of zeros or add a new category in one-hot.

For me it is fine to also add an option to switch the behaviour (that was the proposal in the top post). So the question is twofold: which options (see overview above), and which as the default.

@jorisvandenbossche were you hoping to implement the changes, or should we mark it help wanted?

It is something I could work on yes if there is agreement on what to do, but you can also mark it as help wanted in case there are other people (I also have work to finish ColumnTransformer, make a better example there, the issue about better performance for the encoder, ..).

jnothman · 2018-01-15T19:57:52Z

you can make it default behaviour where the transform will then output nans, since it should be picked up downstream. it's problematic to make it default behaviour in the one hot case for the same reason.

…

On 15 Jan 2018 10:08 pm, "Joris Van den Bossche" ***@***.***> wrote: I think leave as NaN in ordinal encoding. Would you do this as the default behaviour? (or as an option with erroring as the default behaviour) Looking at #10404 <#10404> on NaNs in preprocessing that Guillaume linked, I suppose you mean it as the default? Reading that issue, that seems to make sense. In most cases the final estimator will still error on encountering NaNs, so the combined behaviour does not really change. Unless there is an option to switch on this behaviour, I think it is problematic to leave a row of zeros or add a new category in one-hot. For me it is fine to *also* add an option to switch the behaviour (that was the proposal in the top post). So the question is twofold: which options (see overview above), and which as the default. @jorisvandenbossche <https://github.com/jorisvandenbossche> were you hoping to implement the changes, or should we mark it help wanted? It is something I could work on yes if there is agreement on what to do, but you can also mark it as help wanted in case there are other people (I also have work to finish ColumnTransformer, make a better example there, ..). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#10465 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67powAA55fcWcDhSxu3lar8my103ks5tKzG5gaJpZM4Rc2D6> .

jorisvandenbossche · 2018-01-15T21:54:22Z

it's problematic to make it default behaviour in the one hot case for the same reason.

I don't understand what you mean with the "same reason".
But is it that if you have a row of NaNs, there is not good Imputation strategy possible? (you could do a feature-wise mean or median, but not one that considers the different one-hot features together like most-frequent)

jorisvandenbossche · 2018-01-15T21:59:24Z

@jnothman the default is one thing, but can you also give your opinion about the different options outlined above?
The "preserve NaNs" and "treat as separate category" seem useful, "error" and "ignore" would be easy to implement, but not sure myself how useful it would be.

jnothman · 2018-01-16T10:07:42Z

'error' might be useful for users of pd.read_csv! I'm not sure what ignore means.

For encoding=onehot, I don't know if you can just output a row of NaNs. Is it helpful to do a featurewise mean or median as you suggest? Treating as a separate category seems useful, but so might be outputting a row of zeros, since the sample is not in any of the categories.

Is being able to provide an imputer a reasonable option too?

I don't understand what you mean with the "same reason".

I merely mean that if the default One Hot encoding of NaN is a row of zeros, then the downstream classifier, etc will not throw an error to say that the data includes NaNs, so it's not as safe as Ordinal encoding.

maykulkarni · 2018-01-21T09:20:23Z

If no one's working on this, I can take it up.

jnothman · 2018-01-22T20:53:51Z

@maykulkarni, unless @jorisvandenbossche is working on it, I think it's open for the taking

maykulkarni · 2018-01-23T03:45:01Z

@jorisvandenbossche let me know if you're busy/working on something else so I could take it up

jorisvandenbossche · 2018-01-23T09:28:25Z

@maykulkarni Go ahead! I am not yet working on it, and can work on other things.

I am not fully sure yet, however, if we fully decided on what the API should look like. I think the direction (also for the other preprocessors, see #10404) is to 'preserve' NaNs. For the ordinal that is clear, but for the onehot case we still have discussion whether this makes sense.
Maybe it's not the most useful to output a full row of NaNs in the onehot case, but I think I would be in favor of that for consistency, and with the idea that the user will need to do some kind of imputation before (or after) the encoding anyhow if it wants to get rid of the NaNs.

@maykulkarni I think you can already start with the above for passing through NaNs, we can later still see if we want to add a keyword for different options.

anuraglahon16 · 2018-05-09T03:23:23Z

In Label encoder ,the missing value also gets converted

amueller · 2018-05-22T17:37:35Z

@anuraglahon16 LabelEncoder is for labels, missing values there don't really make sense.

jorisvandenbossche · 2018-06-04T09:34:39Z

@jnothman @amueller opinions on how to move forward here?

I think it would be nice to at least have basic handling of missing values in the OneHotEncoder / OrdinalEncoder on a short term. Giving that we are moving towards "passing through NaNs" for other transformers, it might make sense to do the same for the encoders?

Since we will split them (#10523), we can also have slightly different behaviour.

OneHotEncoder:

"passing through" would mean a full rows of NaNs ? (for the corresponding one hot columns)
I am not sure how 'useful' this is in itself (because I don't think it makes much sense to impute after one-hot encoding), but at least it will still error in the estimator afterwards and is consistent with the other transformers. So go for this as the default?
For the OneHotEncoder I think other optional behaviours might makes sense, for example to ignore missing values (treat them as unknown categories), which would result in all 0's.

OrdinalEncoder:

Here a default of "passing through the NaNs" is fine I think
One optional behaviour I can think of that could be useful would be to handle the missing value as a separate category ('ignoring' is not an option here, as 0 is already the first category)

jnothman · 2018-06-04T12:53:34Z

Those sound sensible to me! But sort out documentation at #10523 first, please.

jnothman · 2018-06-04T12:54:03Z

Oh, you did!

amueller · 2018-06-04T17:44:51Z

I think for OneHotEncoder the "treating as own category" makes the most sense, in particular because we have no good imputation strategies for now. adding the NaN row might be useful in the future as an option.
Is there a good reason to encode "missing" as all zeros instead of adding a column? If it's informative it will be much harder for a tree to split on. Though NOCATS would fix that maybe?

How would you impute after OrdinalEncoder? As a continuous value? Does that make more sense than imputing as a continuous value after one-hot-encoding? I'm not entirely opposed but the two seem pretty similar to me.

jorisvandenbossche · 2018-06-04T21:37:54Z

I think for OneHotEncoder the "treating as own category" makes the most sense,

do you mean you would do this as the default behaviour?
(the consistent option with the other transformers would be the NaN row, but "the most sensible default" is of course also a good reason to deviate)

How would you impute after OrdinalEncoder? As a continuous value? Does that make more sense than imputing as a continuous value after one-hot-encoding?

I personally don't know. But the question is still what should be the default? Passing through the NaNs will mean in practice mostly the same as the erroring now, as the final sklearn model in the pipeline will also raise on the presence of NaNs. But passing it through at least gives the flexibility to the user in case it is needed (and is consistent with other transformers).
But also for the OrdinalEncoder, treating as separate category might be a good default as well (or at least useful to be available as an option).

amueller · 2018-06-04T21:42:53Z

do you mean you would do this as the default behaviour?

Yes.

I'm not sure what a good default would be for OrdinalEncoder. Simply add a new category at the end? If the feature is actually ordinal that doesn't make sense. I'm not sure what the typical use-case for OrdinalEncoder is.

jorisvandenbossche · 2018-06-04T21:58:18Z

I'm not sure what the typical use-case for OrdinalEncoder is.

If you use it for tree-based models because those work well with such features and don't necessarily need one-hot encoding, for such use case having it as a separate category makes sense I think?

jnothman · 2018-06-05T11:54:57Z

I think the principle should be that the default will output NaNs if they're in the input. We shouldn't let them go unchecked

amueller · 2018-06-28T14:08:34Z

So for now the easiest for the user with dataframes would be to do a .fillna("missing") if they are all strings and .fillna(-1) if they are all integers, and everything works, right?
Not very elegant, though.

amueller · 2018-06-28T14:09:08Z

Also see #11379 (not sure if that's open somewhere else as well)

jorisvandenbossche · 2018-07-03T11:13:58Z

So for now the easiest for the user with dataframes would be to do a .fillna("missing") if they are all strings and .fillna(-1) if they are all integers, and everything works, right?

So have the note here as well: this is now possible in the sklearn pipeline as well with SimpleImputer(strategy='constant') will fill with -1 / "missing_value" depending on the dtype of X.

tdpetrou · 2018-09-03T18:48:49Z

Hey all, feel free to completely ignore this suggestion as I am just a practical ML user. I really like the new changes coming to 0.20, but wanted to comment on OneHotEncoder.

When training, there is no option to ignore missing values. But if handle_unknown is set to 'ignore', then missing values in the test set will be set to a row of all 0's. Personally, I like to have the options to be able to encode missing values in the training set as a row of 0's. This simplifies the process (no need to impute) and is what get_dummies defaults to.

Going further, I like to ignore (make missing) categories with low-frequency. I wrote a gist that makes a BasicEstimator which one-hot encodes categoricals like this and gives you the option of ignoring categories with low frequency counts. I used pd.value_counts for this (which is a bit slow and can be optimized in Cython). For numeric columns, it fills missing values and standardizes them.

You can even think about having a max count (max proportion) for categoricals as having a string column with only 1 unique value isn't useful either.

In the kaggle housing dataset used in the gist, there are several string columns that have counts for values under 5. These are encoded as all 0's. I realize this is just a giant hammer applied to the whole dataset and typically you want to be more nuanced about this, but the idea remains.

I also wrote a blog on the new workflow for Pandas users going to scikit-learn.

jnothman · 2018-09-03T23:16:51Z

Handling missing values is likely to happen soon. Frequency thresholding? I suppose that could be useful, and we allow for it in CountVectorizer. It can, of course be performed after encoding in a separate feature selection transformer. if it is commonly beneficial I would not be against providing it in OneHotEncoder

tdpetrou · 2018-09-03T23:29:03Z

@jnothman Thank you for the response. I wasn't aware missing value handling was going to happen. That was my main issue. If there is an option to just not encode missing values (not make a new column) then we are good there.

The frequency threshold is just a fun idea that I wanted to bring to light. It's probably too much to add to OneHotEncoder. It's also a bit dangerous giving that much power to eliminate values that easily. It's probably better to inspect the columns manually for low counts and then make a decision to keep or make them missing.

Edit: Though, having some check for low counts or a check for all unique values (which will explode the array) with a method or separate function might be useful. I see this is possible with CountVectorizer but would be more of a workaround.

jnothman · 2018-09-04T05:40:15Z

Replacing with #11996, #11997

jorisvandenbossche mentioned this issue Jan 12, 2018

Improve Imputer 'most_frequent' strategy #2888

Closed

jnothman added the Enhancement label Jan 13, 2018

jorisvandenbossche mentioned this issue Feb 7, 2018

Rethinking the CategoricalEncoder API ? #10521

Closed

jnothman mentioned this issue Jun 28, 2018

Support standard data science use-case #10603

Open

jorisvandenbossche mentioned this issue Jul 3, 2018

OneHotEncoder doesn't handle columns with mix of string and int #11379

Open

This was referenced Sep 4, 2018

Handle missing values in OneHotEncoder #11996

Closed

Handle missing values in OrdinalEncoder #11997

Closed

jnothman closed this as completed Sep 4, 2018

maxcopeland mentioned this issue Sep 9, 2018

[WIP] Handle missing values in OrdinalEncoder #12045

Closed

fmder mentioned this issue May 16, 2019

[MRG] Adds handle unknown option to ordinal encoder #13897

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of missing values in the CategoricalEncoder #10465

Handling of missing values in the CategoricalEncoder #10465

jorisvandenbossche commented Jan 12, 2018 •

edited

Loading

jnothman commented Jan 13, 2018

jnothman commented Jan 13, 2018

glemaitre commented Jan 13, 2018

jorisvandenbossche commented Jan 15, 2018 •

edited

Loading

jnothman commented Jan 15, 2018 via email

jorisvandenbossche commented Jan 15, 2018

jorisvandenbossche commented Jan 15, 2018

jnothman commented Jan 16, 2018

maykulkarni commented Jan 21, 2018

jnothman commented Jan 22, 2018

maykulkarni commented Jan 23, 2018

jorisvandenbossche commented Jan 23, 2018

anuraglahon16 commented May 9, 2018

amueller commented May 22, 2018

jorisvandenbossche commented Jun 4, 2018

jnothman commented Jun 4, 2018 via email

jnothman commented Jun 4, 2018 via email

amueller commented Jun 4, 2018

jorisvandenbossche commented Jun 4, 2018

amueller commented Jun 4, 2018

jorisvandenbossche commented Jun 4, 2018

jnothman commented Jun 5, 2018 via email

amueller commented Jun 28, 2018

amueller commented Jun 28, 2018

jorisvandenbossche commented Jul 3, 2018

tdpetrou commented Sep 3, 2018 •

edited

Loading

jnothman commented Sep 3, 2018 via email

tdpetrou commented Sep 3, 2018 •

edited

Loading

jnothman commented Sep 4, 2018

Handling of missing values in the CategoricalEncoder #10465

Handling of missing values in the CategoricalEncoder #10465

Comments

jorisvandenbossche commented Jan 12, 2018 • edited Loading

jnothman commented Jan 13, 2018

jnothman commented Jan 13, 2018

glemaitre commented Jan 13, 2018

jorisvandenbossche commented Jan 15, 2018 • edited Loading

jnothman commented Jan 15, 2018 via email

jorisvandenbossche commented Jan 15, 2018

jorisvandenbossche commented Jan 15, 2018

jnothman commented Jan 16, 2018

maykulkarni commented Jan 21, 2018

jnothman commented Jan 22, 2018

maykulkarni commented Jan 23, 2018

jorisvandenbossche commented Jan 23, 2018

anuraglahon16 commented May 9, 2018

amueller commented May 22, 2018

jorisvandenbossche commented Jun 4, 2018

jnothman commented Jun 4, 2018 via email

jnothman commented Jun 4, 2018 via email

amueller commented Jun 4, 2018

jorisvandenbossche commented Jun 4, 2018

amueller commented Jun 4, 2018

jorisvandenbossche commented Jun 4, 2018

jnothman commented Jun 5, 2018 via email

amueller commented Jun 28, 2018

amueller commented Jun 28, 2018

jorisvandenbossche commented Jul 3, 2018

tdpetrou commented Sep 3, 2018 • edited Loading

jnothman commented Sep 3, 2018 via email

tdpetrou commented Sep 3, 2018 • edited Loading

jnothman commented Sep 4, 2018

jorisvandenbossche commented Jan 12, 2018 •

edited

Loading

jorisvandenbossche commented Jan 15, 2018 •

edited

Loading

tdpetrou commented Sep 3, 2018 •

edited

Loading

tdpetrou commented Sep 3, 2018 •

edited

Loading