META OHE / OrdinalEncoder: NaN support, unfrequent cat. and pd.Categorical #15796

rth · 2019-12-04T22:08:39Z

There have been 3 significant improvements proposed for OneHotEncoder (and to a lesser extent OrdinalEncoder), often with an associated PR,

NaN handling (issue Handle missing values in OneHotEncoder #11996, PR [WIP] NaN Support for OneHotEncoder #13028)
support of pd.Categorical dtype (issue Handle pd.Categorical in encoders #14953, PR [MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder #15396)
handling of infrequent categories (issue Add "other" / min_frequency option to OneHotEncoder #12153, PR [MRG] Add support for infrequent categories in OneHotEncoder and OrdinalEncoder #13833)

the goal of this issue to have a high level agreement on the desired solution, that it is consistent/compatible for different available encoders, or encoders that we may want to add in the near future (e.g. target encoder, #5853). Some of the possible solutions for the above 3 features are mutually exclusive. Also putting aside aside backward compatibility constraints for a start, what default options we would want ideally.

I have not followed in detail all past discussions about encoders (in particular about ordering concerns #15050). Following are some of the observations / open questions I have, please add more if I missed something/link with existing comments.

NaN support

Mainly if we want to implement support directly or suggest to use an imputer for the pre-processing step (as one can do now).

pd.Categorical support

Do we say that categories='dtype' would make OHE categories match categories in the dtype
? Including the ordering? But then, this means only at fit since, for transform, the test set could have unknown categories.
Actually, if one does a train test split, some categories from the dtype can be missing from the train set as well. Do we then create a column with 0s in fit_transform, or disregard this column breaking the assumption of conforming to dtype categories? Solution proposed by @thomasjpfan in [MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder #15396 (comment)
Finally, if we do not conform to the dtype categories, and only use pd.Categorical for computational efficiency internally, what is the point of defining a categories='dtype' in the first place (or warning that categories order doesn't match the order of categories in dtype [MRG] ENH Adds warning with pandas category does not match lexicon ordering #15050).

Infrequent categories

Overall the plan seems fairly clear in #12153 (comment)

NaN & pd.Categorical

Handing NaN as a separate category, means we are no longer using purely the categories from the dtype (even with dtype='category') which can be fine as long as we agree on it.
Another possibility is to implement a say CategoricalPreprocessor to add NaN as a new dtype category in a separate preprocessing step,
```
df[column].cat.add_categories("NaN", inplace=True)
df[column].fillna("NaN", inplace=True)
```
which can make interpretation simpler. Say df[column].value_counts() would then show NaN properly and one might want to do it for exploratory analysis in any case.

NaN & infrequent categories

Do infrequent categories rules apply to NaN or is it always a separate category (even if passes the infrequent criteria)?

pd.Categorical and infrequent categories

Similarly we could consider a preprocessor that would add infrequent as a category to dtype, instead of doing that internally in encoders. Say to evaluate how many infrequent elements one has, I find that doing (approximately),
```
ohe = OneHotEncoder(categories='dtype', min_frequency=5)
X = ohe.fit_transform(df['col'])
unfrequent_idx = ohe.get_feature_names().tolist().find("infrequent")
print(X.sum(axis=0).A1[unfrequent_idx])
```
very awkward as opposed to,
```
df['col'].value_counts()["infrequent"]
```
where it was properly added to df['col'].cat.categories previously and we are using all the nice features of pd.Categorical.

So there is some tension here between adding these features to scikit-learn and keeping exploratory analysis with pandas user friendly (and not asking users to implement the same thing twice).

Given the complexity of this interaction, maybe separating "Imputer + Unfrequent categories conversion with pd.Categorical support" and "OneHot, Ordinal, Target etc encoder" into 2 or 3 estimators might be easier to understand? Not sure about usability though. The alternative that would mean we also plan to add these features (and enforce consistency) for any future encoder.

cc @thomasjpfan @NicolasHug @glemaitre @jorisvandenbossche @amueller @jnothman @ogrisel

The text was updated successfully, but these errors were encountered:

jnothman · 2019-12-05T23:30:17Z

handling of infrequent categories (issue #12153, PR #13833)

I note that OrdinalEncoder does not currently have any handling of categories not seen at fit.

Mainly if we want to implement support directly or suggest to use an imputer for the pre-processing step (as one can do now).

If we want to support estimators, here and elsewhere, that learn directly from data with missing values, we're best propagating the missingness rather than imputing.

Do we say that categories='dtype' would make OHE categories match categories in the dtype
? Including the ordering?

I think so. I think ordering should be respected even when ordered=False, since ordered=False is the default, and since it would be easier to use s.codes than to remap them. But I could be persuaded to respect the order only when ordered=True.

But then, this means only at fit since, for transform, the test set could have unknown categories.

I think if there is a categorical dtype at fit time, then unknown categories appearing at transform time should raise an error. This should be included in _validate_data across the board.

some categories from the dtype can be missing from the train set as well. Do we then create a column with 0s in fit_transform, or disregard this column breaking the assumption of conforming to dtype categories?

I'm not certain, but maybe just make the column with 0s, for correspondence with when one explicitly sets the categories.

Infrequent categories: Overall the plan seems fairly clear in #12153 (comment)

This does not pertain to OrdinalEncoder, where we don't even know how to handle unknowns. But infrequent category handling is more straightforward than unknowns when ordinal encoding, because at least in the case of infrequents, we "know" the order where the value should appear, and either conflate with the greater or lesser category.

Handling NaN

I think in OrdinalEncoder, the only encoding of NaN should be as NaN.

I don't think that handling NaN as a separate category is a problem in OHE (and I don't see why we need to modify the dtype). When absent from training, I would just output a row of zeros.

Do infrequent categories rules apply to NaN or is it always a separate category (even if passes the infrequent criteria)?

My instinct is to treat it as special, not an infrequent cat, but I am not wedded to this position.

Similarly we could consider a preprocessor that would add infrequent as a category to dtype, instead of doing that internally in encoders.

I don't get why we're modifying the dtype, except for efficiency. And if for efficiency, then we do whatever is pragmatic, as long as it's hidden from the user, i.e. creates no API stability issues.

ohe.get_feature_names().tolist()

The convention is to return a list from get_feature_names()

as opposed to, df['col'].value_counts()["infrequent"] where it was properly added to df['col'].cat.categories previously and we are using all the nice features of pd.Categorical.

It's fine to ask the user to handle infrequents themselves (with the caveat that they'll likely mess up the train-test split)... but if we want to handle it in our encoders, then I don't think we can do that by modifying the input... so I don't understand this suggestion, until I read the next paragraph where I see you are considering handling infrequent categories in a different transformer.

If we can design an InfrequentCategoryTransformer, I'm okay with that... I'm not yet convinced that it will work without duplicating a lot of the stuff (either the implementation, or the need to specify parameters redundantly) from OHE or OrdinalEncoder, nor am I convinced that we can decouple the handling of pd.Categorical from encoders entirely.

I'd be keen to see this framed around some test cases as @thomasjpfan has done for one small piece of the pie.

mtorabirad · 2020-08-21T13:19:50Z

Are the above-proposed improvements now implemented in OneHotEncoder ?

glemaitre · 2020-08-21T13:54:16Z

Still in progress

ogrisel · 2023-12-01T16:52:42Z

There is also another feature request for a small but usability improvement to make it possible to use a shared encoding value for unknowns and infrequent features (as done in OneHotEncoder):

Please provide option to set unknown_values during test time to same as encoded min_frequency in OrdinalEncoder(Infrequent categories) #27629

rth mentioned this issue Dec 4, 2019

Add "other" / min_frequency option to OneHotEncoder #12153

Closed

thomasjpfan mentioned this issue May 23, 2020

ENH Adds missing value support to OneHotEncoder #17317

Merged

mfeurer mentioned this issue Jul 22, 2020

Added pandas support automl/auto-sklearn#889

Merged

mfeurer mentioned this issue Nov 16, 2020

Execution fails when data contains missing values automl/auto-sklearn#990

Closed

cmarmo added the module:preprocessing label Mar 29, 2022

kbattocchi mentioned this issue Mar 16, 2023

ufunc 'isnan' not supported for the input types in DML "effect()" function py-why/EconML#745

Open

glemaitre added this to Support for categorical variable May 17, 2024

glemaitre moved this to Discussion in Support for categorical variable May 17, 2024

adrinjalali added this to Missing value and nan support Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

META OHE / OrdinalEncoder: NaN support, unfrequent cat. and pd.Categorical #15796

META OHE / OrdinalEncoder: NaN support, unfrequent cat. and pd.Categorical #15796

rth commented Dec 4, 2019

jnothman commented Dec 5, 2019

mtorabirad commented Aug 21, 2020

glemaitre commented Aug 21, 2020

ogrisel commented Dec 1, 2023

META OHE / OrdinalEncoder: NaN support, unfrequent cat. and pd.Categorical #15796

META OHE / OrdinalEncoder: NaN support, unfrequent cat. and pd.Categorical #15796

Comments

rth commented Dec 4, 2019

jnothman commented Dec 5, 2019

mtorabirad commented Aug 21, 2020

glemaitre commented Aug 21, 2020

ogrisel commented Dec 1, 2023