Description
In sklearn.preprocessing._encoders._BaseEncoder
, columns with pd.Categorical dtype are converted to arrays.
If the categories
ordering is explicitly specified by the user to the constructor of OneHotEncoder
or OrdinalEncoder
, then this is fine... but if 'auto' is used, lexicographic ordering will be assumed, disregarding the encoding order determined by the Categorical dtype.
I propose that we raise a warning if:
- a feature passed to OneHotEncoder or OrdinalEncoder has a Pandas Categorical dtype (is there a way to duck type this?)
- the categories for that feature are 'auto'
- the lexicographically sorted features do not match the feature's
dtype.categories
ordering.
The warning might be something like UserWarning("'auto' categories is used, but the Categorical dtype provided is not consistent with the automatic lexicographic ordering")
... or else something more intelligible.
We may change this warning to a FutureWarning
with "From version 0.24 the category ordering specified by a Categorical dtype will be respected in encoders."
Metadata
Metadata
Assignees
Labels
Type
Projects
Status