-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
LabelEncoder ignores pandas CategoricalDtype order #12086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This seems a tricky issue to solve, but yes we should try to do so while improving pandas interoperability |
This snippet gets the job done:
|
I don't get what you're trying to do here. It's not a complete runnable
snippet with assertions for correct behaviour.
Have you not managed to use OrdinalEncoder(categories=...) to get the right
ordering?
|
I was not aware of OrdinalEncoder(), I'll have to try it. But my point with this bug report is that LabelEncoder or even OrdinalEncoder should respect order of CategoricalDtype(order=True) features automatically because order is important. And not just increment integers as data is processed. Again, note how order semantics is embedded in data of CategoricalDtype(order=True):
About the snippet, sorry for the concise code. It takes the row indexes from the source DF (houses) and use it to know where to write the correct integer (i) in the target DF (X). |
Sorry, I got a bit confused here. Yes OrdinalEncoder (and extensions to
OneHotEncoder) are new to 0.20 which is very soon to have its release.
But it also looks like you're using LabelEncoder for something other than
target labels.
Yes, ideally, OneHotEncoder and OrdinalEncoder should get their order from
a categorical dtype when applicable, especially if ordered=True.
@jorisvandenbossche is this a feature we should throw into 0.20 to avoid
deprecation hell?
|
So what people used
That said, I agree that scikit-learn should ideally handle pandas categorical dtypes automatically (making use of the categorical dtype information, now we simply discard that and convert to object dtype to then factorize the values again). I think we decided to leave that for later, and I think it is also too late for 0.20 to get that in. But the annoying thing is that, if we add this properly, this will actually change the behaviour in a future release in case you are already using categorical dtypes. And I also don't see an easy way to do such a change with a deprecation cycle .. |
you would just have a FutureWarning that says "In the future, the order of CategoricalDtype(ordered=True) will be respected. To retain current behaviour please specify the categories for ordered pandas Categoricals". |
The main problem would be excess noise. |
But you've given your opinion that you don't think we could get this right in a hurry. |
Yes, I am indeed worried about excessive noise / annoying warning. |
untagging as this is actually kinda complex and tied to many other things |
Closing this in favor of #14953, which has some more recent discussion. |
The order of labels of pandas’ categorical features as CategoricalDtype(order=True) might be used by estimators. For example:
Note how order is embedded in the data type above.
I was expecting codes like these:
0 poor
1 fair
2 typical
3 good
4 excellent
And I'm sure estimators would provide more meaningful results if such order was used.
But LabelEncoder gives random integer codes, probably using data as it comes:
3 poor
1 fair
0 typical
4 good
2 excellent
Thank you in advance
The text was updated successfully, but these errors were encountered: