OrdinalEncoder with string categories should force user to specify the order #14563

NicolasHug · 2019-08-04T12:39:06Z

We use lexicographic ordering when categories='auto' which is the default.

I think we should deprecate this behavior and force the user to specify the order of the categories when they are strings. Since the OrdinalEncoder is used in a context where order matters, lexicographic order doesn't make sense in general.

(This isn't an issue with e.g. OneHotEncoder where there is no order).

The text was updated successfully, but these errors were encountered:

jnothman · 2019-08-04T20:36:55Z

You may be right... For some estimators, ordinal encoding and OHE do not provide much difference in terms of the learnt model, so we've previously considered ordinal encoding a reasonable encoding choice even for unordered categories. But yes, its power is that it *can* represent order, so maybe we should enforce this.

venkyyuvy · 2019-08-13T05:41:00Z

@NicolasHug
Can I pick this one?

NicolasHug · 2019-08-13T10:21:04Z

@venkyyuvy we haven't decided on a solution yet

venkyyuvy · 2019-08-13T11:32:38Z

@NicolasHug

My proposal based on your comments is:

let us make the categories as dictionary instead as a list. Hence, user can specify ordering of unique value for the columns which has string values.

Raise error / depreciation error

order has to be specified in dictionary categories for the columns with string values

when Xi values are string and when categories='auto' or key i is not present in categories.

jnothman · 2019-08-13T11:53:56Z

Why a dict and not a list? I'm still not sure about this. There should be a way to use an ordinal encoding without the categories being naturally ordered.

NicolasHug · 2019-08-13T13:44:04Z

Sorry I wasn't clear @venkyyuvy, what I meant is we're still not sure whether we actully want this or not ;)

venkyyuvy · 2019-08-13T14:15:50Z

Why a dict and not a list? I'm still not sure about this. There should be a way to use an ordinal encoding without the categories being naturally ordered.

Since we are forcing the user to specify the order for the string value columns.

There could be only few columns, which contains string values.

For example:

>>> from sklearn.preprocessing import OrdinalEncoder
>>> enc = OrdinalEncoder()
>>> X = [['neutral', 1001], ['poor', 1003], ['great', 1002]]
>>> enc.fit(X)
>>> enc.categories_

[array(['great', 'neutral', 'poor'], dtype=object),
 array([1001, 1002, 1003], dtype=object)]

Here the ordering of first column is not right. The user might want to specify the order for that column alone. Currently (based on my understanding) categories has to have the length equal to number of features.

Proposal

>>> enc = OrdinalEncoder(categories={0:['poor','neutral','great']})

>>> X = [['neutral', 1001], ['poor', 1003], ['great', 1002]]
>>> enc.fit(X)
>>> enc.categories_

[array(['poor', 'neutral', 'great'], dtype=object),
 array([1001, 1002, 1003], dtype=object)]

venkyyuvy · 2019-08-13T15:38:24Z

@jnothman
Coming to your second point:

I totally agree that we need to have a way to encode without the categories being naturally ordered because

Few occasions user might not know the actual ordering itself
Some Models such like DecisionTree can work well even without natural ordering (As mentioned by you).

If intent is to encode without the natural order then it sounds more like a nominal encoding but without one hot representation. Hence having this functionality inside OrdinalEncoding may be misleading. Can we have something called as NumericalEncoder for this purpose alone?

amueller · 2019-08-13T17:54:16Z

There should be a way to use an ordinal
encoding without the categories being naturally ordered.

I think we should decide which of the two this is supposed to implement. The name kind of suggests to me there is a natural ordering, or at least that the encoding somehow represents an ordering.

The treatment of all the edge-cases kind of depends on whether we assume there is an ordering or not. I think the cases where there is an ordering are probably rare and the user could handle them themselves. Not assuming the ordering is semantic will make it SO much more easy to define sensible behavior in the edge-cases.

adrinjalali · 2019-08-14T13:00:53Z

hmm, I'd agree that if we want to force the user to specify the order, then we should have another encoder which just does {0-n-1} integer encoding regardless of the order. To me that's a much more common usecase than actually having a total order on the categories.

jnothman · 2019-08-14T23:58:17Z

OrdinalEncoder vs OrderedEncoder? When we gave up on the CategoricalEncoder we had decided that we should group functionality by output format rather than input semantics, but that obviously had its limitations. Maybe we need to just have a way for the user to explicitly say that the ordering is arbitrary?

jnothman · 2019-08-15T01:31:55Z

At the moment is the conflation of these functions confusing for users? Is the problem that users forget or fail to specify categories for string data, or that pd.Categorical is ignored?

Otherwise I see practical difference in how we should handle dropped (e.g. infrequent) categories, and perhaps missing data...

I'd say we should:

Deprecate automatically assuming lexicographic ordering (for string features) without an explicit request from the user (categories='arbitrary').
In the future, adopt the ordering implied by a pd.Categorical.
Provide options for representing dropped categories including "merge-down", "merge-up", "extra" (better names welcome).

adrinjalali · 2019-08-15T07:46:44Z

That sounds like a good plan to me.

amueller · 2019-08-15T17:01:22Z

I think having pd.Categorical ignored probably created bugs no-one is aware of. I don't think the conflation is confusing for now since we basically punted on implementing anything that would actually distinguish the cases well.

I like your plan.

thomasjpfan · 2019-08-15T20:24:41Z

For an unordered category, are we going to recommend OrdinalEncoder as well?

jnothman · 2019-08-16T01:21:38Z

That's what I'm proposing, yes: That we keep them combined into one class (rather than creating more ambiguously named estimators), but make it harder to do the wrong thing.

ogrisel · 2019-08-22T11:17:42Z

For an unordered category, are we going to recommend OrdinalEncoder as well?

To my understanding, the main use case for OrdinalEncoder is to map strings category labels to arbitrary integers as a preprocessing for models like gradient boosting, RF, xgboost / lightgbm. Those models are assumed to be expressive enough such that the ordering has little impact on the predictive accuracy. (off-course impact/target coding would even be better by promoting fewer splits and smaller models which is always good but I don't expect the effect would be to be that important).

So from a practical standpoint the current behavior (lexigraphical ordering) of this class is fine but I am not opposed in asking the user to make it explicit to acknowledge that the order of the mapping is arbitrary (that us lexicographical for string values).

Besides, I am fine with @jnothman's plan (#14563 (comment)).

Note that we can do the handling of unknown values in a dedicated PR as it feels like an independent problem.

jnothman · 2019-09-11T11:14:35Z

I'm replacing this by #14954 (and #14953) given some consensus around #14563 (comment). Hope that's okay @NicolasHug

NicolasHug mentioned this issue Aug 4, 2019

Handle Error Policy in OrdinalEncoder #13488

Closed

jnothman mentioned this issue Sep 11, 2019

[MRG] Add support for infrequent categories in OneHotEncoder and OrdinalEncoder #13833

Closed

4 tasks

jnothman closed this as completed Sep 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OrdinalEncoder with string categories should force user to specify the order #14563

OrdinalEncoder with string categories should force user to specify the order #14563

NicolasHug commented Aug 4, 2019

jnothman commented Aug 4, 2019 via email

venkyyuvy commented Aug 13, 2019

NicolasHug commented Aug 13, 2019

venkyyuvy commented Aug 13, 2019 •

edited

Loading

jnothman commented Aug 13, 2019 via email

NicolasHug commented Aug 13, 2019

venkyyuvy commented Aug 13, 2019 •

edited

Loading

venkyyuvy commented Aug 13, 2019 •

edited

Loading

amueller commented Aug 13, 2019

adrinjalali commented Aug 14, 2019

jnothman commented Aug 14, 2019 via email

jnothman commented Aug 15, 2019

adrinjalali commented Aug 15, 2019

amueller commented Aug 15, 2019

thomasjpfan commented Aug 15, 2019

jnothman commented Aug 16, 2019 via email

ogrisel commented Aug 22, 2019

jnothman commented Sep 11, 2019

OrdinalEncoder with string categories should force user to specify the order #14563

OrdinalEncoder with string categories should force user to specify the order #14563

Comments

NicolasHug commented Aug 4, 2019

jnothman commented Aug 4, 2019 via email

venkyyuvy commented Aug 13, 2019

NicolasHug commented Aug 13, 2019

venkyyuvy commented Aug 13, 2019 • edited Loading

jnothman commented Aug 13, 2019 via email

NicolasHug commented Aug 13, 2019

venkyyuvy commented Aug 13, 2019 • edited Loading

Proposal

venkyyuvy commented Aug 13, 2019 • edited Loading

amueller commented Aug 13, 2019

adrinjalali commented Aug 14, 2019

jnothman commented Aug 14, 2019 via email

jnothman commented Aug 15, 2019

adrinjalali commented Aug 15, 2019

amueller commented Aug 15, 2019

thomasjpfan commented Aug 15, 2019

jnothman commented Aug 16, 2019 via email

ogrisel commented Aug 22, 2019

jnothman commented Sep 11, 2019

venkyyuvy commented Aug 13, 2019 •

edited

Loading

venkyyuvy commented Aug 13, 2019 •

edited

Loading

venkyyuvy commented Aug 13, 2019 •

edited

Loading