Thanks to visit codestin.com
Credit goes to github.com

Skip to content

OrdinalEncoder with string categories should force user to specify the order #14563

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
NicolasHug opened this issue Aug 4, 2019 · 18 comments
Closed

Comments

@NicolasHug
Copy link
Member

We use lexicographic ordering when categories='auto' which is the default.

I think we should deprecate this behavior and force the user to specify the order of the categories when they are strings. Since the OrdinalEncoder is used in a context where order matters, lexicographic order doesn't make sense in general.

(This isn't an issue with e.g. OneHotEncoder where there is no order).

@jnothman
Copy link
Member

jnothman commented Aug 4, 2019 via email

@venkyyuvy
Copy link
Contributor

@NicolasHug
Can I pick this one?

@NicolasHug
Copy link
Member Author

@venkyyuvy we haven't decided on a solution yet

@venkyyuvy
Copy link
Contributor

venkyyuvy commented Aug 13, 2019

@NicolasHug

My proposal based on your comments is:

let us make the categories as dictionary instead as a list. Hence, user can specify ordering of unique value for the columns which has string values.

  1. Raise error / depreciation error

order has to be specified in dictionary categories for the columns with string values

when Xi values are string and when categories='auto' or key i is not present in categories.

@jnothman
Copy link
Member

jnothman commented Aug 13, 2019 via email

@NicolasHug
Copy link
Member Author

Sorry I wasn't clear @venkyyuvy, what I meant is we're still not sure whether we actully want this or not ;)

@venkyyuvy
Copy link
Contributor

venkyyuvy commented Aug 13, 2019

Why a dict and not a list? I'm still not sure about this. There should be a way to use an ordinal encoding without the categories being naturally ordered.

Since we are forcing the user to specify the order for the string value columns.

There could be only few columns, which contains string values.

For example:

>>> from sklearn.preprocessing import OrdinalEncoder
>>> enc = OrdinalEncoder()
>>> X = [['neutral', 1001], ['poor', 1003], ['great', 1002]]
>>> enc.fit(X)
>>> enc.categories_

[array(['great', 'neutral', 'poor'], dtype=object),
 array([1001, 1002, 1003], dtype=object)]

Here the ordering of first column is not right. The user might want to specify the order for that column alone. Currently (based on my understanding) categories has to have the length equal to number of features.

Proposal

>>> enc = OrdinalEncoder(categories={0:['poor','neutral','great']})
​
>>> X = [['neutral', 1001], ['poor', 1003], ['great', 1002]]
>>> enc.fit(X)
​>>> enc.categories_
​
[array(['poor', 'neutral', 'great'], dtype=object),
 array([1001, 1002, 1003], dtype=object)]

@venkyyuvy
Copy link
Contributor

venkyyuvy commented Aug 13, 2019

@jnothman
Coming to your second point:

I totally agree that we need to have a way to encode without the categories being naturally ordered because

  • Few occasions user might not know the actual ordering itself
  • Some Models such like DecisionTree can work well even without natural ordering (As mentioned by you).

If intent is to encode without the natural order then it sounds more like a nominal encoding but without one hot representation. Hence having this functionality inside OrdinalEncoding may be misleading. Can we have something called as NumericalEncoder for this purpose alone?

@amueller
Copy link
Member

There should be a way to use an ordinal
encoding without the categories being naturally ordered.

I think we should decide which of the two this is supposed to implement. The name kind of suggests to me there is a natural ordering, or at least that the encoding somehow represents an ordering.

The treatment of all the edge-cases kind of depends on whether we assume there is an ordering or not. I think the cases where there is an ordering are probably rare and the user could handle them themselves. Not assuming the ordering is semantic will make it SO much more easy to define sensible behavior in the edge-cases.

@adrinjalali
Copy link
Member

hmm, I'd agree that if we want to force the user to specify the order, then we should have another encoder which just does {0-n-1} integer encoding regardless of the order. To me that's a much more common usecase than actually having a total order on the categories.

@jnothman
Copy link
Member

jnothman commented Aug 14, 2019 via email

@jnothman
Copy link
Member

At the moment is the conflation of these functions confusing for users? Is the problem that users forget or fail to specify categories for string data, or that pd.Categorical is ignored?

Otherwise I see practical difference in how we should handle dropped (e.g. infrequent) categories, and perhaps missing data...

I'd say we should:

  • Deprecate automatically assuming lexicographic ordering (for string features) without an explicit request from the user (categories='arbitrary').
  • In the future, adopt the ordering implied by a pd.Categorical.
  • Provide options for representing dropped categories including "merge-down", "merge-up", "extra" (better names welcome).

@adrinjalali
Copy link
Member

That sounds like a good plan to me.

@amueller
Copy link
Member

I think having pd.Categorical ignored probably created bugs no-one is aware of. I don't think the conflation is confusing for now since we basically punted on implementing anything that would actually distinguish the cases well.

I like your plan.

@thomasjpfan
Copy link
Member

For an unordered category, are we going to recommend OrdinalEncoder as well?

@jnothman
Copy link
Member

jnothman commented Aug 16, 2019 via email

@ogrisel
Copy link
Member

ogrisel commented Aug 22, 2019

For an unordered category, are we going to recommend OrdinalEncoder as well?

To my understanding, the main use case for OrdinalEncoder is to map strings category labels to arbitrary integers as a preprocessing for models like gradient boosting, RF, xgboost / lightgbm. Those models are assumed to be expressive enough such that the ordering has little impact on the predictive accuracy. (off-course impact/target coding would even be better by promoting fewer splits and smaller models which is always good but I don't expect the effect would be to be that important).

So from a practical standpoint the current behavior (lexigraphical ordering) of this class is fine but I am not opposed in asking the user to make it explicit to acknowledge that the order of the mapping is arbitrary (that us lexicographical for string values).

Besides, I am fine with @jnothman's plan (#14563 (comment)).

Note that we can do the handling of unknown values in a dedicated PR as it feels like an independent problem.

@jnothman
Copy link
Member

I'm replacing this by #14954 (and #14953) given some consensus around #14563 (comment). Hope that's okay @NicolasHug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants