Thanks to visit codestin.com
Credit goes to github.com

Skip to content

LabelEncoder ignores pandas CategoricalDtype order #12086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
avibrazil opened this issue Sep 15, 2018 · 12 comments
Closed

LabelEncoder ignores pandas CategoricalDtype order #12086

avibrazil opened this issue Sep 15, 2018 · 12 comments
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices
Milestone

Comments

@avibrazil
Copy link

avibrazil commented Sep 15, 2018

The order of labels of pandas’ categorical features as CategoricalDtype(order=True) might be used by estimators. For example:

print(houses['quality'].unique())
[poor, fair, typical, good, excellent]
Categories (4, object): [poor < fair < typical < good < excellent]

Note how order is embedded in the data type above.

I was expecting codes like these:
0 poor
1 fair
2 typical
3 good
4 excellent

And I'm sure estimators would provide more meaningful results if such order was used.

But LabelEncoder gives random integer codes, probably using data as it comes:
3 poor
1 fair
0 typical
4 good
2 excellent

Thank you in advance

@jnothman
Copy link
Member

This seems a tricky issue to solve, but yes we should try to do so while improving pandas interoperability

@jnothman jnothman added Bug Enhancement Moderate Anything that requires some knowledge of conventions and best practices labels Sep 15, 2018
@jnothman jnothman added this to the 0.21 milestone Sep 15, 2018
@avibrazil
Copy link
Author

This snippet gets the job done:

categoricalOrderedFeatures={
    'Alley': ['_UNAVAILABLE','Grvl', 'Pave'],
    'ExterCond': ['Po','Fa','TA','Gd','Ex'],
    'LotShape': ['IR3','IR2','IR1','Reg']
}
...

# Incrementally add encoded ordered categorical features
for feature in categoricalOrderedFeatures.keys():
    X[feature]=-1 # initialize target feature
    i=0
    for category in houses[feature].unique().categories:
        
        # Get row indexes for this category from source DataFrame
        indexes = houses.index[houses[feature] == category]
        
        # Imputation
        X.loc[indexes,feature] = i
        
        # move along
        i=i+1

@jnothman
Copy link
Member

jnothman commented Sep 16, 2018 via email

@avibrazil
Copy link
Author

I was not aware of OrdinalEncoder(), I'll have to try it.

But my point with this bug report is that LabelEncoder or even OrdinalEncoder should respect order of CategoricalDtype(order=True) features automatically because order is important. And not just increment integers as data is processed.

Again, note how order semantics is embedded in data of CategoricalDtype(order=True):

print(houses['quality'].unique())
[poor, fair, typical, good, excellent]
Categories (4, object): [poor < fair < typical < good < excellent]

About the snippet, sorry for the concise code. It takes the row indexes from the source DF (houses) and use it to know where to write the correct integer (i) in the target DF (X).

@jnothman
Copy link
Member

jnothman commented Sep 17, 2018 via email

@jorisvandenbossche
Copy link
Member

So what people used LabelEncoder for (to encode features) in the past, we have now added OrdinalEncoder (or OneHotEncoder if the final goal was to have dummy variables). And, in contrast to LabelEncoder, OrdinalEncoder has a categories keyword that at least allows a user to solve this manually:

OrdinalEncoder(categories=[houses['quality'].cat.categories])

That said, I agree that scikit-learn should ideally handle pandas categorical dtypes automatically (making use of the categorical dtype information, now we simply discard that and convert to object dtype to then factorize the values again).

I think we decided to leave that for later, and I think it is also too late for 0.20 to get that in. But the annoying thing is that, if we add this properly, this will actually change the behaviour in a future release in case you are already using categorical dtypes. And I also don't see an easy way to do such a change with a deprecation cycle ..

@jnothman
Copy link
Member

you would just have a FutureWarning that says "In the future, the order of CategoricalDtype(ordered=True) will be respected. To retain current behaviour please specify the categories for ordered pandas Categoricals".

@jnothman
Copy link
Member

The main problem would be excess noise.

@jnothman
Copy link
Member

But you've given your opinion that you don't think we could get this right in a hurry.

@jorisvandenbossche
Copy link
Member

Yes, I am indeed worried about excessive noise / annoying warning.
It would be easy to add such a FutureWarning, but that further delays introducing the change without an easy way for the user to already get that behaviour + silence the warning (passing the categories manually like I did in the example above can get quite cumbersome if you have multiple columns, in combination with a ColumnTransformer, ..)

@amueller
Copy link
Member

amueller commented Mar 7, 2019

untagging as this is actually kinda complex and tied to many other things

@jorisvandenbossche
Copy link
Member

Closing this in favor of #14953, which has some more recent discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
Development

No branches or pull requests

4 participants