LabelEncoder ignores pandas CategoricalDtype order #12086

avibrazil · 2018-09-15T09:29:07Z

The order of labels of pandas’ categorical features as CategoricalDtype(order=True) might be used by estimators. For example:

print(houses['quality'].unique())
[poor, fair, typical, good, excellent]
Categories (4, object): [poor < fair < typical < good < excellent]

Note how order is embedded in the data type above.

I was expecting codes like these:
0 poor
1 fair
2 typical
3 good
4 excellent

And I'm sure estimators would provide more meaningful results if such order was used.

But LabelEncoder gives random integer codes, probably using data as it comes:
3 poor
1 fair
0 typical
4 good
2 excellent

Thank you in advance

The text was updated successfully, but these errors were encountered:

jnothman · 2018-09-15T23:30:19Z

This seems a tricky issue to solve, but yes we should try to do so while improving pandas interoperability

avibrazil · 2018-09-16T15:30:17Z

This snippet gets the job done:

categoricalOrderedFeatures={
    'Alley': ['_UNAVAILABLE','Grvl', 'Pave'],
    'ExterCond': ['Po','Fa','TA','Gd','Ex'],
    'LotShape': ['IR3','IR2','IR1','Reg']
}
...

# Incrementally add encoded ordered categorical features
for feature in categoricalOrderedFeatures.keys():
    X[feature]=-1 # initialize target feature
    i=0
    for category in houses[feature].unique().categories:
        
        # Get row indexes for this category from source DataFrame
        indexes = houses.index[houses[feature] == category]
        
        # Imputation
        X.loc[indexes,feature] = i
        
        # move along
        i=i+1

jnothman · 2018-09-16T23:47:10Z

I don't get what you're trying to do here. It's not a complete runnable snippet with assertions for correct behaviour. Have you not managed to use OrdinalEncoder(categories=...) to get the right ordering?

avibrazil · 2018-09-17T01:12:24Z

I was not aware of OrdinalEncoder(), I'll have to try it.

But my point with this bug report is that LabelEncoder or even OrdinalEncoder should respect order of CategoricalDtype(order=True) features automatically because order is important. And not just increment integers as data is processed.

Again, note how order semantics is embedded in data of CategoricalDtype(order=True):

print(houses['quality'].unique())
[poor, fair, typical, good, excellent]
Categories (4, object): [poor < fair < typical < good < excellent]

About the snippet, sorry for the concise code. It takes the row indexes from the source DF (houses) and use it to know where to write the correct integer (i) in the target DF (X).

jnothman · 2018-09-17T02:57:14Z

Sorry, I got a bit confused here. Yes OrdinalEncoder (and extensions to OneHotEncoder) are new to 0.20 which is very soon to have its release. But it also looks like you're using LabelEncoder for something other than target labels. Yes, ideally, OneHotEncoder and OrdinalEncoder should get their order from a categorical dtype when applicable, especially if ordered=True. @jorisvandenbossche is this a feature we should throw into 0.20 to avoid deprecation hell?

jorisvandenbossche · 2018-09-17T06:55:56Z

So what people used LabelEncoder for (to encode features) in the past, we have now added OrdinalEncoder (or OneHotEncoder if the final goal was to have dummy variables). And, in contrast to LabelEncoder, OrdinalEncoder has a categories keyword that at least allows a user to solve this manually:

OrdinalEncoder(categories=[houses['quality'].cat.categories])

That said, I agree that scikit-learn should ideally handle pandas categorical dtypes automatically (making use of the categorical dtype information, now we simply discard that and convert to object dtype to then factorize the values again).

I think we decided to leave that for later, and I think it is also too late for 0.20 to get that in. But the annoying thing is that, if we add this properly, this will actually change the behaviour in a future release in case you are already using categorical dtypes. And I also don't see an easy way to do such a change with a deprecation cycle ..

jnothman · 2018-09-17T07:43:46Z

you would just have a FutureWarning that says "In the future, the order of CategoricalDtype(ordered=True) will be respected. To retain current behaviour please specify the categories for ordered pandas Categoricals".

jnothman · 2018-09-17T07:43:57Z

The main problem would be excess noise.

jnothman · 2018-09-17T07:44:20Z

But you've given your opinion that you don't think we could get this right in a hurry.

jorisvandenbossche · 2018-09-17T07:53:15Z

Yes, I am indeed worried about excessive noise / annoying warning.
It would be easy to add such a FutureWarning, but that further delays introducing the change without an easy way for the user to already get that behaviour + silence the warning (passing the categories manually like I did in the example above can get quite cumbersome if you have multiple columns, in combination with a ColumnTransformer, ..)

amueller · 2019-03-07T17:08:57Z

untagging as this is actually kinda complex and tied to many other things

jorisvandenbossche · 2019-10-28T13:30:01Z

Closing this in favor of #14953, which has some more recent discussion.

jnothman added Bug Enhancement Moderate Anything that requires some knowledge of conventions and best practices labels Sep 15, 2018

jnothman added this to the 0.21 milestone Sep 15, 2018

jorisvandenbossche removed the Bug label Sep 17, 2018

jorisvandenbossche mentioned this issue Sep 24, 2018

ENH: support DataFrames in OneHot/OrdinalEncoder without converting to array #12147

Closed

jnothman modified the milestones: 0.21, 0.22 Apr 9, 2019

amueller mentioned this issue Aug 6, 2019

Pandas DataFrame Categories supported by OneHotEncoder #13351

Closed

4 tasks

adrinjalali mentioned this issue Aug 13, 2019

[MRG] DOC clarify LabelEncoder's docstring #14642

Merged

jnothman mentioned this issue Oct 27, 2019

[MRG] ENH Adds warning with pandas category does not match lexicon ordering #15050

Open

jorisvandenbossche mentioned this issue Oct 28, 2019

Handle pd.Categorical in encoders #14953

Open

jorisvandenbossche closed this as completed Oct 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LabelEncoder ignores pandas CategoricalDtype order #12086

LabelEncoder ignores pandas CategoricalDtype order #12086

avibrazil commented Sep 15, 2018 •

edited

Loading

jnothman commented Sep 15, 2018

avibrazil commented Sep 16, 2018

jnothman commented Sep 16, 2018 via email

avibrazil commented Sep 17, 2018

jnothman commented Sep 17, 2018 via email

jorisvandenbossche commented Sep 17, 2018

jnothman commented Sep 17, 2018

jnothman commented Sep 17, 2018

jnothman commented Sep 17, 2018

jorisvandenbossche commented Sep 17, 2018

amueller commented Mar 7, 2019

jorisvandenbossche commented Oct 28, 2019

LabelEncoder ignores pandas CategoricalDtype order #12086

LabelEncoder ignores pandas CategoricalDtype order #12086

Comments

avibrazil commented Sep 15, 2018 • edited Loading

jnothman commented Sep 15, 2018

avibrazil commented Sep 16, 2018

jnothman commented Sep 16, 2018 via email

avibrazil commented Sep 17, 2018

jnothman commented Sep 17, 2018 via email

jorisvandenbossche commented Sep 17, 2018

jnothman commented Sep 17, 2018

jnothman commented Sep 17, 2018

jnothman commented Sep 17, 2018

jorisvandenbossche commented Sep 17, 2018

amueller commented Mar 7, 2019

jorisvandenbossche commented Oct 28, 2019

avibrazil commented Sep 15, 2018 •

edited

Loading