Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Handling of missing values in the CategoricalEncoder #10465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Jan 12, 2018 · 29 comments
Closed

Handling of missing values in the CategoricalEncoder #10465

jorisvandenbossche opened this issue Jan 12, 2018 · 29 comments

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jan 12, 2018

Currently the CategoricalEncoder doesn't handle missing values (you get an error about unorderable types or about nan being an unknown category for numerical types).
This came up in an issue about imputing missing values for categoricals (#2888), but independently of whether we add such abilities to Imputer, we should also discuss how CategoricalEncoder itself could handle missing values in different ways.

Possible ways to deal missing values (np.nan or None):

  • Raise an error when missing values are present:
    • This is still a good default I think (we should only make sure to provide a better error message than currently is raised)
  • Ignore missing values (treat as unknown):
    • This would give a row of all zeros for dummy encoding, and would no be implemented for ordinal encoding.
    • In this way, it is similar in behaviour as unknown categories with handle_unknown='ignore', apart from the fact it can also occur in the training data.
  • Regard missing value as a separate category
    • For ordinal encoding this would give an additional integer, for dummy encoding an additional column.
    • Something similar is available in pd.get_dummies if you specify dummy_na=True keyword.
    • Implementation-wise, a problem that would occur is that if your categories consist of a couple of strings values and a missing value (np.nan or None), it becomes unorderable, while in the CategoricalEncoder we normally sort the unique categories (as a possible solution, we could fallback in such a case to sort the non-missing ones first and then add np.nan in the end).
    • This would be similar to a an indicator feature
  • Preserve as NaN:
    • from comment of @amueller (Improve Imputer 'most_frequent' strategy #2888 (comment)), I suppose the idea would be to first see it as a separate category but before returning the result replace that category again with NaN (so it can be imputed after encoding).
    • This might make sense only for ordinal encoding, unless we want a full row of NaNs for dummy case.
      This option could actually also be a way to deal with the "imputing categorical features" problem (see also next bullet), as it allows an easier and more flexible combination of encoding / imputing.
  • Impute missing values (eg with 'most_frequent' option):
    • Personally I think this one should be left to Imputer itself, but adding it here instead could limit the scope of Imputer to numerical features.

Those options (or a subset of them) could be added as an additional keyword to the CategoricalEncoder. Possible names: handle_missing, handle_na, missing_values

Related to discussions in #2888 and #9012 (comment)

Example notebook on a toy dataframe showing the current problem with missing data in categorical features: http://nbviewer.jupyter.org/gist/jorisvandenbossche/736cead26ab65116ff4de18015b0b324

@jnothman
Copy link
Member

I think leave as NaN in ordinal encoding.

Unless there is an option to switch on this behaviour, I think it is problematic to leave a row of zeros or add a new category in one-hot. An additional column would likely be most useful in this case (it's the same as a row of zeros with more information.

@jnothman
Copy link
Member

@jorisvandenbossche were you hoping to implement the changes, or should we mark it help wanted?

@glemaitre
Copy link
Member

Preserving NaN seems that it could be consistent with what is going in #10404 with the preprocessing methods. missing_values could be used across the preprocessing methods and the imputation methods.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Jan 15, 2018

I think leave as NaN in ordinal encoding.

Would you do this as the default behaviour? (or as an option with erroring as the default behaviour)
Looking at #10404 on NaNs in preprocessing that Guillaume linked, I suppose you mean it as the default?
Reading that issue, that seems to make sense. In most cases the final estimator will still error on encountering NaNs, so the combined behaviour does not really change.

Unless there is an option to switch on this behaviour, I think it is problematic to leave a row of zeros or add a new category in one-hot.

For me it is fine to also add an option to switch the behaviour (that was the proposal in the top post). So the question is twofold: which options (see overview above), and which as the default.

@jorisvandenbossche were you hoping to implement the changes, or should we mark it help wanted?

It is something I could work on yes if there is agreement on what to do, but you can also mark it as help wanted in case there are other people (I also have work to finish ColumnTransformer, make a better example there, the issue about better performance for the encoder, ..).

@jnothman
Copy link
Member

jnothman commented Jan 15, 2018 via email

@jorisvandenbossche
Copy link
Member Author

it's problematic to make it default behaviour in the one hot case for the same reason.

I don't understand what you mean with the "same reason".
But is it that if you have a row of NaNs, there is not good Imputation strategy possible? (you could do a feature-wise mean or median, but not one that considers the different one-hot features together like most-frequent)

@jorisvandenbossche
Copy link
Member Author

@jnothman the default is one thing, but can you also give your opinion about the different options outlined above?
The "preserve NaNs" and "treat as separate category" seem useful, "error" and "ignore" would be easy to implement, but not sure myself how useful it would be.

@jnothman
Copy link
Member

'error' might be useful for users of pd.read_csv! I'm not sure what ignore means.

For encoding=onehot, I don't know if you can just output a row of NaNs. Is it helpful to do a featurewise mean or median as you suggest? Treating as a separate category seems useful, but so might be outputting a row of zeros, since the sample is not in any of the categories.

Is being able to provide an imputer a reasonable option too?

I don't understand what you mean with the "same reason".

I merely mean that if the default One Hot encoding of NaN is a row of zeros, then the downstream classifier, etc will not throw an error to say that the data includes NaNs, so it's not as safe as Ordinal encoding.

@maykulkarni
Copy link
Contributor

If no one's working on this, I can take it up.

@jnothman
Copy link
Member

@maykulkarni, unless @jorisvandenbossche is working on it, I think it's open for the taking

@maykulkarni
Copy link
Contributor

@jorisvandenbossche let me know if you're busy/working on something else so I could take it up

@jorisvandenbossche
Copy link
Member Author

@maykulkarni Go ahead! I am not yet working on it, and can work on other things.

I am not fully sure yet, however, if we fully decided on what the API should look like. I think the direction (also for the other preprocessors, see #10404) is to 'preserve' NaNs. For the ordinal that is clear, but for the onehot case we still have discussion whether this makes sense.
Maybe it's not the most useful to output a full row of NaNs in the onehot case, but I think I would be in favor of that for consistency, and with the idea that the user will need to do some kind of imputation before (or after) the encoding anyhow if it wants to get rid of the NaNs.

@maykulkarni I think you can already start with the above for passing through NaNs, we can later still see if we want to add a keyword for different options.

@anuraglahon16
Copy link

In Label encoder ,the missing value also gets converted

@amueller
Copy link
Member

@anuraglahon16 LabelEncoder is for labels, missing values there don't really make sense.

@jorisvandenbossche
Copy link
Member Author

@jnothman @amueller opinions on how to move forward here?

I think it would be nice to at least have basic handling of missing values in the OneHotEncoder / OrdinalEncoder on a short term. Giving that we are moving towards "passing through NaNs" for other transformers, it might make sense to do the same for the encoders?

Since we will split them (#10523), we can also have slightly different behaviour.

OneHotEncoder:

  • "passing through" would mean a full rows of NaNs ? (for the corresponding one hot columns)
    I am not sure how 'useful' this is in itself (because I don't think it makes much sense to impute after one-hot encoding), but at least it will still error in the estimator afterwards and is consistent with the other transformers. So go for this as the default?
  • For the OneHotEncoder I think other optional behaviours might makes sense, for example to ignore missing values (treat them as unknown categories), which would result in all 0's.

OrdinalEncoder:

  • Here a default of "passing through the NaNs" is fine I think
  • One optional behaviour I can think of that could be useful would be to handle the missing value as a separate category ('ignoring' is not an option here, as 0 is already the first category)

@jnothman
Copy link
Member

jnothman commented Jun 4, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jun 4, 2018 via email

@amueller
Copy link
Member

amueller commented Jun 4, 2018

I think for OneHotEncoder the "treating as own category" makes the most sense, in particular because we have no good imputation strategies for now. adding the NaN row might be useful in the future as an option.
Is there a good reason to encode "missing" as all zeros instead of adding a column? If it's informative it will be much harder for a tree to split on. Though NOCATS would fix that maybe?

How would you impute after OrdinalEncoder? As a continuous value? Does that make more sense than imputing as a continuous value after one-hot-encoding? I'm not entirely opposed but the two seem pretty similar to me.

@jorisvandenbossche
Copy link
Member Author

I think for OneHotEncoder the "treating as own category" makes the most sense,

do you mean you would do this as the default behaviour?
(the consistent option with the other transformers would be the NaN row, but "the most sensible default" is of course also a good reason to deviate)

How would you impute after OrdinalEncoder? As a continuous value? Does that make more sense than imputing as a continuous value after one-hot-encoding?

I personally don't know. But the question is still what should be the default? Passing through the NaNs will mean in practice mostly the same as the erroring now, as the final sklearn model in the pipeline will also raise on the presence of NaNs. But passing it through at least gives the flexibility to the user in case it is needed (and is consistent with other transformers).
But also for the OrdinalEncoder, treating as separate category might be a good default as well (or at least useful to be available as an option).

@amueller
Copy link
Member

amueller commented Jun 4, 2018

do you mean you would do this as the default behaviour?

Yes.

I'm not sure what a good default would be for OrdinalEncoder. Simply add a new category at the end? If the feature is actually ordinal that doesn't make sense. I'm not sure what the typical use-case for OrdinalEncoder is.

@jorisvandenbossche
Copy link
Member Author

I'm not sure what the typical use-case for OrdinalEncoder is.

If you use it for tree-based models because those work well with such features and don't necessarily need one-hot encoding, for such use case having it as a separate category makes sense I think?

@jnothman
Copy link
Member

jnothman commented Jun 5, 2018 via email

@amueller
Copy link
Member

So for now the easiest for the user with dataframes would be to do a .fillna("missing") if they are all strings and .fillna(-1) if they are all integers, and everything works, right?
Not very elegant, though.

@amueller
Copy link
Member

Also see #11379 (not sure if that's open somewhere else as well)

@jorisvandenbossche
Copy link
Member Author

So for now the easiest for the user with dataframes would be to do a .fillna("missing") if they are all strings and .fillna(-1) if they are all integers, and everything works, right?

So have the note here as well: this is now possible in the sklearn pipeline as well with SimpleImputer(strategy='constant') will fill with -1 / "missing_value" depending on the dtype of X.

@tdpetrou
Copy link

tdpetrou commented Sep 3, 2018

Hey all, feel free to completely ignore this suggestion as I am just a practical ML user. I really like the new changes coming to 0.20, but wanted to comment on OneHotEncoder.

When training, there is no option to ignore missing values. But if handle_unknown is set to 'ignore', then missing values in the test set will be set to a row of all 0's. Personally, I like to have the options to be able to encode missing values in the training set as a row of 0's. This simplifies the process (no need to impute) and is what get_dummies defaults to.

Going further, I like to ignore (make missing) categories with low-frequency. I wrote a gist that makes a BasicEstimator which one-hot encodes categoricals like this and gives you the option of ignoring categories with low frequency counts. I used pd.value_counts for this (which is a bit slow and can be optimized in Cython). For numeric columns, it fills missing values and standardizes them.

You can even think about having a max count (max proportion) for categoricals as having a string column with only 1 unique value isn't useful either.

In the kaggle housing dataset used in the gist, there are several string columns that have counts for values under 5. These are encoded as all 0's. I realize this is just a giant hammer applied to the whole dataset and typically you want to be more nuanced about this, but the idea remains.

I also wrote a blog on the new workflow for Pandas users going to scikit-learn.

@jnothman
Copy link
Member

jnothman commented Sep 3, 2018 via email

@tdpetrou
Copy link

tdpetrou commented Sep 3, 2018

@jnothman Thank you for the response. I wasn't aware missing value handling was going to happen. That was my main issue. If there is an option to just not encode missing values (not make a new column) then we are good there.

The frequency threshold is just a fun idea that I wanted to bring to light. It's probably too much to add to OneHotEncoder. It's also a bit dangerous giving that much power to eliminate values that easily. It's probably better to inspect the columns manually for low counts and then make a decision to keep or make them missing.

Edit: Though, having some check for low counts or a check for all unique values (which will explode the array) with a method or separate function might be useful. I see this is possible with CountVectorizer but would be more of a workaround.

@jnothman
Copy link
Member

jnothman commented Sep 4, 2018

Replacing with #11996, #11997

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants