[MRG] Add support for infrequent categories in OneHotEncoder and OrdinalEncoder #13833

NicolasHug · 2019-05-08T17:36:41Z

Fixes: #12153
Closes #12264

Not strictly MRG but I need input.

This PR adds a max_levels parameter to OneHotEncoder and OrdinalEncoder.

Categories with the last amount of support are considered 'infrequent' and are mapped into a specific column.

TODO:

more thorough tests
docs
add min_frequency parameter
handle the interaction with handle_unkown to address Handle Error Policy in OrdinalEncoder #13488 (comment)

ping @jnothman @ogrisel @amueller @thomasjpfan, please LMK if this is going in the right direction.

(You can look at test_infrequent_categories_sanity() for some sort of guide to this new feature.)

jnothman

Is it appropriate to just add the infrequent categories to those dropped (in OHE at least)? Or is there some fundamental difference in their handling that I'm not considering now?

jnothman · 2019-05-19T23:14:29Z

sklearn/preprocessing/_encoders.py

@@ -16,7 +16,7 @@
 from ..utils.validation import check_is_fitted

 from .base import _transform_selected
-from .label import _encode, _encode_check_unknown
+from .label import _encode, _encode_check_unknown, _encode_numpy


can't we just use _encode rather than _encode_numpy?

jnothman · 2019-05-19T23:14:52Z

sklearn/preprocessing/_encoders.py

@@ -243,6 +279,10 @@ class OneHotEncoder(_BaseEncoder):
        be dropped for each feature. None if all the transformed features will
        be retained.

+    infrequent_indices_: list of arrays of shape(n_infrequent_categories)


Suggested change

infrequent_indices_: list of arrays of shape(n_infrequent_categories)

infrequent_indices_ : list of arrays of shape (n_infrequent_categories,)

jnothman · 2019-05-19T23:15:09Z

sklearn/preprocessing/_encoders.py

@@ -243,6 +279,10 @@ class OneHotEncoder(_BaseEncoder):
        be dropped for each feature. None if all the transformed features will
        be retained.

+    infrequent_indices_: list of arrays of shape(n_infrequent_categories)
+        ``infrequent_indices_[i]`` contains a list of indices in


Suggested change

``infrequent_indices_[i]`` contains a list of indices in

``infrequent_indices_[i]`` contains an array of indices in

jnothman · 2019-05-20T08:29:44Z

sklearn/preprocessing/_encoders.py

@@ -491,12 +532,28 @@ def fit(self, X, y=None):
        else:
            self._fit(X, handle_unknown=self.handle_unknown)
            self.drop_idx_ = self._compute_drop_idx()
+
+            # check if user wants to manually drop a feature that is
+            # infrequent: this is not allowed


it doesn't seem natural to drop some features that are infrequent, and not drop some.

it would also make the code even more complex.

NicolasHug · 2019-05-20T12:24:08Z

Is it appropriate to just add the infrequent categories to those dropped?

I don't think it is appropriate. with drop=first, that would result in treating the first category as infrequent.

FedericoV · 2019-06-04T21:11:58Z

Minor feedback: wouldn't it be more appropriate to always bucket the infrequent categories in 0? This way, regardless of the cardinality of a categorical variable, we always know in which bucket the infrequent categories end up.

NicolasHug · 2019-06-04T21:38:21Z

implementation wise, we would still need to remap the frequent categories anyway

as for user interface: then drop='first' becomes ambiguous

amueller · 2019-08-02T20:53:57Z

can you fix the merge conflicts?

thomasjpfan · 2019-08-02T21:53:36Z

We do not have a proper place to put the infrequent category for OrdinalEncoder. No matter where we put it, it doesn’t keep lexicon ordering of the original features.

NicolasHug · 2019-08-03T19:38:26Z

Can you please elaborate Thomas?

thomasjpfan · 2019-08-03T21:04:39Z

I have clothing size categories (which have a natural order): XS, S, M, L, ReallyLarge. Currently, our OrdinalEncoder will map:

L -> 0, M -> 1, ReallyLarge -> 2, S -> 3, and XS -> 4

Lets say we are okay with this, now let ReallyLarge and XS be "infrequent", what integer value do we encode the infrequent category? There are been ideas to do:

Infrequent -> 0 L -> 1, M -> 2, S -> 3

Or

L -> 0, M -> 1, S -> 2, Infrequent -> 3

"Infrequent" itself does not have an ordering when compared to the other categories since it is a combination of categories.

jnothman · 2019-08-03T22:06:08Z

Yes, in the context of truly ordinal categories (whether or not we are supporting lexical ordering) it doesn't make a lot of sense. Absorbing that category into a neighbour is more appropriate.

…frequent_categories

NicolasHug · 2019-08-04T14:54:12Z

Conflicts should be resolved @amueller

jnothman

I'm a bit confused by the combination of the unchecked TODO list and the "MRG" designation. Can you clarify what work you still intend to do here?

jnothman · 2019-08-04T22:44:44Z

sklearn/preprocessing/_encoders.py

@@ -222,6 +258,10 @@ class OneHotEncoder(_BaseEncoder):
        be dropped for each feature. None if all the transformed features will
        be retained.

+    infrequent_indices_: list of arrays of shape(n_infrequent_categories)


Suggested change

infrequent_indices_: list of arrays of shape(n_infrequent_categories)

infrequent_indices_ : list of arrays of shape (n_infrequent_categories,)

NicolasHug · 2019-08-05T12:06:50Z

The TODO list is up to date.

Given that working on encoders is a) extremely complex and b) subject to never-ending discussions (#12893), I want to make sure I have initial approval on the design and features of the PR before putting more work into it.

jnothman

There don't seem to be parameter docstrings for max_levels. Makes it hard to review that it's correct.

NicolasHug · 2019-08-06T13:11:27Z

I added a docstring @jnothman . The general expected features are illustrated in test_infrequent_categories_sanity().

jnothman · 2019-09-11T10:50:47Z

Per #14563 (comment), OrdinalEncoder should provide options for representing dropped categories including "merge-down", "merge-up", "extra" (better names welcome).

NicolasHug · 2019-09-20T13:32:23Z

I want to wait for #14972 before getting back to this. The code is way too complicated IMHO.

NicolasHug · 2020-02-23T15:07:33Z

Superseded

NicolasHug added 7 commits May 6, 2019 12:03

WIP

9dda919

WIP

758191f

WIP

4cff102

some tests

d2a1a06

added support for drop='infrequent'

0533761

comment

99352b6

pep8

8a3b827

rth changed the title ~~[MRG] Add supprot for infrequent categories in OneHotEncoder and OrdinalEncoder~~ [MRG] Add support for infrequent categories in OneHotEncoder and OrdinalEncoder May 8, 2019

jnothman reviewed May 20, 2019

View reviewed changes

NicolasHug mentioned this pull request Jun 3, 2019

[WIP] "other"/min_freq in OneHot and OrdinalEncoder #12264

Closed

7 tasks

jnothman mentioned this pull request Aug 1, 2019

[MRG] Allowing Virtual Category instead of error for OrdinalEncoder #14534

Closed

NicolasHug added 2 commits August 4, 2019 10:46

Merge branch 'master' of github.com:scikit-learn/scikit-learn into un…

14dc56a

…frequent_categories

pep8

e110419

jnothman reviewed Aug 4, 2019

View reviewed changes

jnothman reviewed Aug 5, 2019

View reviewed changes

Added docstring for max_levels

69b738f

rth mentioned this pull request Dec 4, 2019

META OHE / OrdinalEncoder: NaN support, unfrequent cat. and pd.Categorical #15796

Open

thomasjpfan mentioned this pull request Jan 3, 2020

ENH Adds infrequent categories to OneHotEncoder #16018

Merged

NicolasHug closed this Feb 23, 2020

	infrequent_indices_: list of arrays of shape(n_infrequent_categories)
	infrequent_indices_ : list of arrays of shape (n_infrequent_categories,)

	``infrequent_indices_[i]`` contains a list of indices in
	``infrequent_indices_[i]`` contains an array of indices in

Uh oh!

[MRG] Add support for infrequent categories in OneHotEncoder and OrdinalEncoder #13833

[MRG] Add support for infrequent categories in OneHotEncoder and OrdinalEncoder #13833

Uh oh!

Conversation

NicolasHug commented May 8, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman May 19, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman May 19, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman May 19, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman May 20, 2019

Choose a reason for hiding this comment

Uh oh!

NicolasHug May 20, 2019

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented May 20, 2019

Uh oh!

FedericoV commented Jun 4, 2019

Uh oh!

NicolasHug commented Jun 4, 2019

Uh oh!

amueller commented Aug 2, 2019

Uh oh!

thomasjpfan commented Aug 2, 2019

Uh oh!

NicolasHug commented Aug 3, 2019

Uh oh!

thomasjpfan commented Aug 3, 2019

Uh oh!

jnothman commented Aug 3, 2019 via email

Uh oh!

NicolasHug commented Aug 4, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Aug 4, 2019

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Aug 5, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Aug 6, 2019

Uh oh!

jnothman commented Sep 11, 2019

Uh oh!

NicolasHug commented Sep 20, 2019

Uh oh!

NicolasHug commented Feb 23, 2020

Uh oh!

Uh oh!