-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Handling of missing values in the CategoricalEncoder #10465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think leave as NaN in ordinal encoding. Unless there is an option to switch on this behaviour, I think it is problematic to leave a row of zeros or add a new category in one-hot. An additional column would likely be most useful in this case (it's the same as a row of zeros with more information. |
@jorisvandenbossche were you hoping to implement the changes, or should we mark it help wanted? |
Preserving NaN seems that it could be consistent with what is going in #10404 with the preprocessing methods. |
Would you do this as the default behaviour? (or as an option with erroring as the default behaviour)
For me it is fine to also add an option to switch the behaviour (that was the proposal in the top post). So the question is twofold: which options (see overview above), and which as the default.
It is something I could work on yes if there is agreement on what to do, but you can also mark it as help wanted in case there are other people (I also have work to finish ColumnTransformer, make a better example there, the issue about better performance for the encoder, ..). |
you can make it default behaviour where the transform will then output
nans, since it should be picked up downstream. it's problematic to make it
default behaviour in the one hot case for the same reason.
…On 15 Jan 2018 10:08 pm, "Joris Van den Bossche" ***@***.***> wrote:
I think leave as NaN in ordinal encoding.
Would you do this as the default behaviour? (or as an option with erroring
as the default behaviour)
Looking at #10404
<#10404> on NaNs in
preprocessing that Guillaume linked, I suppose you mean it as the default?
Reading that issue, that seems to make sense. In most cases the final
estimator will still error on encountering NaNs, so the combined behaviour
does not really change.
Unless there is an option to switch on this behaviour, I think it is
problematic to leave a row of zeros or add a new category in one-hot.
For me it is fine to *also* add an option to switch the behaviour (that
was the proposal in the top post). So the question is twofold: which
options (see overview above), and which as the default.
@jorisvandenbossche <https://github.com/jorisvandenbossche> were you
hoping to implement the changes, or should we mark it help wanted?
It is something I could work on yes if there is agreement on what to do,
but you can also mark it as help wanted in case there are other people (I
also have work to finish ColumnTransformer, make a better example there,
..).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#10465 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67powAA55fcWcDhSxu3lar8my103ks5tKzG5gaJpZM4Rc2D6>
.
|
I don't understand what you mean with the "same reason". |
@jnothman the default is one thing, but can you also give your opinion about the different options outlined above? |
'error' might be useful for users of For encoding=onehot, I don't know if you can just output a row of NaNs. Is it helpful to do a featurewise mean or median as you suggest? Treating as a separate category seems useful, but so might be outputting a row of zeros, since the sample is not in any of the categories. Is being able to provide an imputer a reasonable option too?
I merely mean that if the default One Hot encoding of NaN is a row of zeros, then the downstream classifier, etc will not throw an error to say that the data includes NaNs, so it's not as safe as Ordinal encoding. |
If no one's working on this, I can take it up. |
@maykulkarni, unless @jorisvandenbossche is working on it, I think it's open for the taking |
@jorisvandenbossche let me know if you're busy/working on something else so I could take it up |
@maykulkarni Go ahead! I am not yet working on it, and can work on other things. I am not fully sure yet, however, if we fully decided on what the API should look like. I think the direction (also for the other preprocessors, see #10404) is to 'preserve' NaNs. For the ordinal that is clear, but for the onehot case we still have discussion whether this makes sense. @maykulkarni I think you can already start with the above for passing through NaNs, we can later still see if we want to add a keyword for different options. |
In Label encoder ,the missing value also gets converted |
@anuraglahon16 LabelEncoder is for labels, missing values there don't really make sense. |
@jnothman @amueller opinions on how to move forward here? I think it would be nice to at least have basic handling of missing values in the OneHotEncoder / OrdinalEncoder on a short term. Giving that we are moving towards "passing through NaNs" for other transformers, it might make sense to do the same for the encoders? Since we will split them (#10523), we can also have slightly different behaviour. OneHotEncoder:
OrdinalEncoder:
|
Those sound sensible to me! But sort out documentation at #10523 first,
please.
|
Oh, you did!
|
I think for OneHotEncoder the "treating as own category" makes the most sense, in particular because we have no good imputation strategies for now. adding the NaN row might be useful in the future as an option. How would you impute after OrdinalEncoder? As a continuous value? Does that make more sense than imputing as a continuous value after one-hot-encoding? I'm not entirely opposed but the two seem pretty similar to me. |
do you mean you would do this as the default behaviour?
I personally don't know. But the question is still what should be the default? Passing through the NaNs will mean in practice mostly the same as the erroring now, as the final sklearn model in the pipeline will also raise on the presence of NaNs. But passing it through at least gives the flexibility to the user in case it is needed (and is consistent with other transformers). |
Yes. I'm not sure what a good default would be for OrdinalEncoder. Simply add a new category at the end? If the feature is actually ordinal that doesn't make sense. I'm not sure what the typical use-case for OrdinalEncoder is. |
If you use it for tree-based models because those work well with such features and don't necessarily need one-hot encoding, for such use case having it as a separate category makes sense I think? |
I think the principle should be that the default will output NaNs if
they're in the input. We shouldn't let them go unchecked
|
So for now the easiest for the user with dataframes would be to do a |
Also see #11379 (not sure if that's open somewhere else as well) |
So have the note here as well: this is now possible in the sklearn pipeline as well with |
Hey all, feel free to completely ignore this suggestion as I am just a practical ML user. I really like the new changes coming to 0.20, but wanted to comment on When training, there is no option to ignore missing values. But if Going further, I like to ignore (make missing) categories with low-frequency. I wrote a gist that makes a You can even think about having a max count (max proportion) for categoricals as having a string column with only 1 unique value isn't useful either. In the kaggle housing dataset used in the gist, there are several string columns that have counts for values under 5. These are encoded as all 0's. I realize this is just a giant hammer applied to the whole dataset and typically you want to be more nuanced about this, but the idea remains. I also wrote a blog on the new workflow for Pandas users going to scikit-learn. |
Handling missing values is likely to happen soon.
Frequency thresholding? I suppose that could be useful, and we allow for it
in CountVectorizer. It can, of course be performed after encoding in a
separate feature selection transformer. if it is commonly beneficial I
would not be against providing it in OneHotEncoder
|
@jnothman Thank you for the response. I wasn't aware missing value handling was going to happen. That was my main issue. If there is an option to just not encode missing values (not make a new column) then we are good there. The frequency threshold is just a fun idea that I wanted to bring to light. It's probably too much to add to OneHotEncoder. It's also a bit dangerous giving that much power to eliminate values that easily. It's probably better to inspect the columns manually for low counts and then make a decision to keep or make them missing. Edit: Though, having some check for low counts or a check for all unique values (which will explode the array) with a method or separate function might be useful. I see this is possible with CountVectorizer but would be more of a workaround. |
Currently the CategoricalEncoder doesn't handle missing values (you get an error about unorderable types or about nan being an unknown category for numerical types).
This came up in an issue about imputing missing values for categoricals (#2888), but independently of whether we add such abilities to Imputer, we should also discuss how CategoricalEncoder itself could handle missing values in different ways.
Possible ways to deal missing values (np.nan or None):
handle_unknown='ignore'
, apart from the fact it can also occur in the training data.pd.get_dummies
if you specifydummy_na=True
keyword.This option could actually also be a way to deal with the "imputing categorical features" problem (see also next bullet), as it allows an easier and more flexible combination of encoding / imputing.
Imputer
itself, but adding it here instead could limit the scope ofImputer
to numerical features.Those options (or a subset of them) could be added as an additional keyword to the CategoricalEncoder. Possible names:
handle_missing
,handle_na
,missing_values
Related to discussions in #2888 and #9012 (comment)
Example notebook on a toy dataframe showing the current problem with missing data in categorical features: http://nbviewer.jupyter.org/gist/jorisvandenbossche/736cead26ab65116ff4de18015b0b324
The text was updated successfully, but these errors were encountered: