Handle pd.Categorical in encoders

In `sklearn.preprocessing._encoders._BaseEncoder`, columns with pd.Categorical dtype are converted to arrays.

https://github.com/scikit-learn/scikit-learn/blob/03ea20db0f9585fa0d44f4d3cae4b4c4a7c7f235/sklearn/preprocessing/_encoders.py#L60

If the `categories` ordering is explicitly specified by the user to the constructor of `OneHotEncoder` or `OrdinalEncoder`, then this is fine... but if 'auto' is used, lexicographic ordering will be assumed, disregarding the encoding order determined by the Categorical dtype.

I propose that we raise a warning if:
* a feature passed to OneHotEncoder or OrdinalEncoder has a Pandas Categorical dtype (is there a way to duck type this?)
* the categories for that feature are 'auto'
* the lexicographically sorted features do not match the feature's `dtype.categories` ordering.

The warning might be something like `UserWarning("'auto' categories is used, but the Categorical dtype provided is not consistent with the automatic lexicographic ordering")`... or else something more intelligible.

We may change this warning to a `FutureWarning` with `"From version 0.24 the category ordering specified by a Categorical dtype will be respected in encoders."`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Handle pd.Categorical in encoders #14953

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Handle pd.Categorical in encoders #14953

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions