OrdinalEncoder: Deprecate automatically assuming lexicographic ordering #14954

jnothman · 2019-09-11T11:13:57Z

Currently, using OrdinalEncoder with a string-valued feature, and without categories explicitly specifying an order, means that OrdinalEncoder will number the categories according to their lexicographic ordering.

This is not appropriate if the categories have a natural ordering (e.g. ['Green', 'Amber', 'Red']) that can be harnessed by the downstream estimator.

Rather, we should allow the user to specify categories='arbitrary' or categories='lexicographic' or something to explicitly state that lexicographic ordering is okay. When the user specifies categories='auto' for a string-valued feature, OrdinalEncoder should raise a warning along the lines of DeprecationWarning("From version 0.24, OrdinalEncoder's categories='auto' setting will not work with string-valued features, and categories='arbitrary' or an explicit category order will be required.")

The text was updated successfully, but these errors were encountered:

venkyyuvy · 2019-09-12T01:40:33Z

Can I work on this?

jnothman · 2019-09-12T02:31:36Z

Yes, but there might be some dispute among core developers about what the correct API looks like. If you give it a go and open a pull request, that at least gives us something tangible to consider.

Alexrand1 · 2019-09-12T15:16:42Z

Can you point me to where the file is located in the repo?

PyExtreme · 2019-09-22T05:49:45Z

Hi, @venkyyuvy , Are you still working on this? Since, I am also interested to contribute.

Thanks

venkyyuvy · 2019-09-22T06:43:08Z

yes @PyExtreme
I had raised a PR already, waiting for reviews!

glemaitre · 2019-11-15T14:23:34Z

I think that #14984, #15050, and #15396 might not be blockers for 0.22 and I would move them for 0.23.

I think that it could be great to have a single issue (superseded #14953, #14954) to discuss the overall behaviour for categories in OneHotEncoder and OrdinalEncoder and from there having several PRs which follows the discussed proposals.

yahoyoungho · 2021-07-21T07:18:37Z

@glemaitre Hi, random suggestion but what the function has a parameter named mapping that takes in a dictionary with keys being strings or some type of data and values being integers (either in a float or int form)?

jnothman added Easy Well-defined and straightforward way to resolve help wanted labels Sep 11, 2019

jnothman mentioned this issue Sep 11, 2019

OrdinalEncoder with string categories should force user to specify the order #14563

Closed

thomasjpfan mentioned this issue Sep 13, 2019

[WIP] CLN Encoder refactor #14972

Closed

venkyyuvy mentioned this issue Sep 14, 2019

[MRG] Sorting ordering option in OrdinalEncoder #14984

Closed

This was referenced Nov 15, 2019

Handle pd.Categorical in encoders #14953

Open

[MRG] ENH Adds categories='dtypes' option to OrdinalEncoder and OneHotEncoder #15396

Closed

glemaitre mentioned this issue Nov 15, 2019

[MRG] ENH Adds warning with pandas category does not match lexicon ordering #15050

Open

cmarmo removed the help wanted label Jun 23, 2020

jawadjawid added a commit to jawadjawid/scikit-learn that referenced this issue Mar 13, 2021

Add fix to scikit-learn#14954

eb528be

jawadjawid added a commit to jawadjawid/scikit-learn that referenced this issue Mar 13, 2021

Add unit tests to the fix of scikit-learn#14954

5a70e8a

cmarmo added the module:preprocessing label Mar 23, 2022

lucyleeow mentioned this issue May 15, 2025

TargetEncoder example code #31365

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OrdinalEncoder: Deprecate automatically assuming lexicographic ordering #14954

OrdinalEncoder: Deprecate automatically assuming lexicographic ordering #14954

jnothman commented Sep 11, 2019

venkyyuvy commented Sep 12, 2019

jnothman commented Sep 12, 2019 via email

Alexrand1 commented Sep 12, 2019

PyExtreme commented Sep 22, 2019

venkyyuvy commented Sep 22, 2019

glemaitre commented Nov 15, 2019

yahoyoungho commented Jul 21, 2021

OrdinalEncoder: Deprecate automatically assuming lexicographic ordering #14954

OrdinalEncoder: Deprecate automatically assuming lexicographic ordering #14954

Comments

jnothman commented Sep 11, 2019

venkyyuvy commented Sep 12, 2019

jnothman commented Sep 12, 2019 via email

Alexrand1 commented Sep 12, 2019

PyExtreme commented Sep 22, 2019

venkyyuvy commented Sep 22, 2019

glemaitre commented Nov 15, 2019

yahoyoungho commented Jul 21, 2021