-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
OrdinalEncoder with string categories should force user to specify the order #14563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You may be right... For some estimators, ordinal encoding and OHE do not
provide much difference in terms of the learnt model, so we've previously
considered ordinal encoding a reasonable encoding choice even for unordered
categories. But yes, its power is that it *can* represent order, so maybe
we should enforce this.
|
@NicolasHug |
@venkyyuvy we haven't decided on a solution yet |
My proposal based on your comments is: let us make the categories as
when |
Why a dict and not a list?
I'm still not sure about this. There should be a way to use an ordinal
encoding without the categories being naturally ordered.
|
Sorry I wasn't clear @venkyyuvy, what I meant is we're still not sure whether we actully want this or not ;) |
Since we are forcing the user to specify the order for the string value columns. There could be only few columns, which contains string values. For example:
Here the ordering of first column is not right. The user might want to specify the order for that column alone. Currently (based on my understanding) Proposal
|
@jnothman I totally agree that we need to have a way to encode without the categories being naturally ordered because
If intent is to encode without the natural order then it sounds more like a nominal encoding but without one hot representation. Hence having this functionality inside |
I think we should decide which of the two this is supposed to implement. The name kind of suggests to me there is a natural ordering, or at least that the encoding somehow represents an ordering. The treatment of all the edge-cases kind of depends on whether we assume there is an ordering or not. I think the cases where there is an ordering are probably rare and the user could handle them themselves. Not assuming the ordering is semantic will make it SO much more easy to define sensible behavior in the edge-cases. |
hmm, I'd agree that if we want to force the user to specify the order, then we should have another encoder which just does |
OrdinalEncoder vs OrderedEncoder? When we gave up on the
CategoricalEncoder we had decided that we should group functionality by
output format rather than input semantics, but that obviously had its
limitations.
Maybe we need to just have a way for the user to explicitly say that the
ordering is arbitrary?
|
At the moment is the conflation of these functions confusing for users? Is the problem that users forget or fail to specify categories for string data, or that pd.Categorical is ignored? Otherwise I see practical difference in how we should handle dropped (e.g. infrequent) categories, and perhaps missing data... I'd say we should:
|
That sounds like a good plan to me. |
I think having I like your plan. |
For an unordered category, are we going to recommend |
That's what I'm proposing, yes: That we keep them combined into one class
(rather than creating more ambiguously named estimators), but make it
harder to do the wrong thing.
|
To my understanding, the main use case for So from a practical standpoint the current behavior (lexigraphical ordering) of this class is fine but I am not opposed in asking the user to make it explicit to acknowledge that the order of the mapping is arbitrary (that us lexicographical for string values). Besides, I am fine with @jnothman's plan (#14563 (comment)). Note that we can do the handling of unknown values in a dedicated PR as it feels like an independent problem. |
I'm replacing this by #14954 (and #14953) given some consensus around #14563 (comment). Hope that's okay @NicolasHug |
We use lexicographic ordering when
categories='auto'
which is the default.I think we should deprecate this behavior and force the user to specify the order of the categories when they are strings. Since the OrdinalEncoder is used in a context where order matters, lexicographic order doesn't make sense in general.
(This isn't an issue with e.g. OneHotEncoder where there is no order).
The text was updated successfully, but these errors were encountered: