Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Handle pd.Categorical in encoders #14953

Open
@jnothman

Description

@jnothman

In sklearn.preprocessing._encoders._BaseEncoder, columns with pd.Categorical dtype are converted to arrays.

Xi = check_array(Xi, ensure_2d=False, dtype=None,

If the categories ordering is explicitly specified by the user to the constructor of OneHotEncoder or OrdinalEncoder, then this is fine... but if 'auto' is used, lexicographic ordering will be assumed, disregarding the encoding order determined by the Categorical dtype.

I propose that we raise a warning if:

  • a feature passed to OneHotEncoder or OrdinalEncoder has a Pandas Categorical dtype (is there a way to duck type this?)
  • the categories for that feature are 'auto'
  • the lexicographically sorted features do not match the feature's dtype.categories ordering.

The warning might be something like UserWarning("'auto' categories is used, but the Categorical dtype provided is not consistent with the automatic lexicographic ordering")... or else something more intelligible.

We may change this warning to a FutureWarning with "From version 0.24 the category ordering specified by a Categorical dtype will be respected in encoders."

Metadata

Metadata

Assignees

No one assigned

    Labels

    Breaking ChangeIssue resolution would not be easily handled by the usual deprecation cycle.ModerateAnything that requires some knowledge of conventions and best practicesmodule:preprocessing

    Type

    No type

    Projects

    Status

    Discussion

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions