Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Add support for infrequent categories in OneHotEncoder and OrdinalEncoder #13833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

NicolasHug
Copy link
Member

Fixes: #12153
Closes #12264

Not strictly MRG but I need input.

This PR adds a max_levels parameter to OneHotEncoder and OrdinalEncoder.

Categories with the last amount of support are considered 'infrequent' and are mapped into a specific column.

TODO:

ping @jnothman @ogrisel @amueller @thomasjpfan, please LMK if this is going in the right direction.

(You can look at test_infrequent_categories_sanity() for some sort of guide to this new feature.)

@rth rth changed the title [MRG] Add supprot for infrequent categories in OneHotEncoder and OrdinalEncoder [MRG] Add support for infrequent categories in OneHotEncoder and OrdinalEncoder May 8, 2019
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it appropriate to just add the infrequent categories to those dropped (in OHE at least)? Or is there some fundamental difference in their handling that I'm not considering now?

@@ -16,7 +16,7 @@
from ..utils.validation import check_is_fitted

from .base import _transform_selected
from .label import _encode, _encode_check_unknown
from .label import _encode, _encode_check_unknown, _encode_numpy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just use _encode rather than _encode_numpy?

@@ -243,6 +279,10 @@ class OneHotEncoder(_BaseEncoder):
be dropped for each feature. None if all the transformed features will
be retained.

infrequent_indices_: list of arrays of shape(n_infrequent_categories)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
infrequent_indices_: list of arrays of shape(n_infrequent_categories)
infrequent_indices_ : list of arrays of shape (n_infrequent_categories,)

@@ -243,6 +279,10 @@ class OneHotEncoder(_BaseEncoder):
be dropped for each feature. None if all the transformed features will
be retained.

infrequent_indices_: list of arrays of shape(n_infrequent_categories)
``infrequent_indices_[i]`` contains a list of indices in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``infrequent_indices_[i]`` contains a list of indices in
``infrequent_indices_[i]`` contains an array of indices in

@@ -491,12 +532,28 @@ def fit(self, X, y=None):
else:
self._fit(X, handle_unknown=self.handle_unknown)
self.drop_idx_ = self._compute_drop_idx()

# check if user wants to manually drop a feature that is
# infrequent: this is not allowed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't seem natural to drop some features that are infrequent, and not drop some.

it would also make the code even more complex.

@NicolasHug
Copy link
Member Author

Is it appropriate to just add the infrequent categories to those dropped?

I don't think it is appropriate. with drop=first, that would result in treating the first category as infrequent.

@FedericoV
Copy link
Contributor

Minor feedback: wouldn't it be more appropriate to always bucket the infrequent categories in 0? This way, regardless of the cardinality of a categorical variable, we always know in which bucket the infrequent categories end up.

@NicolasHug
Copy link
Member Author

implementation wise, we would still need to remap the frequent categories anyway

as for user interface: then drop='first' becomes ambiguous

@amueller
Copy link
Member

amueller commented Aug 2, 2019

can you fix the merge conflicts?

@thomasjpfan
Copy link
Member

We do not have a proper place to put the infrequent category for OrdinalEncoder. No matter where we put it, it doesn’t keep lexicon ordering of the original features.

@NicolasHug
Copy link
Member Author

Can you please elaborate Thomas?

@thomasjpfan
Copy link
Member

I have clothing size categories (which have a natural order): XS, S, M, L, ReallyLarge. Currently, our OrdinalEncoder will map:

L -> 0, M -> 1, ReallyLarge -> 2, S -> 3, and XS -> 4

Lets say we are okay with this, now let ReallyLarge and XS be "infrequent", what integer value do we encode the infrequent category? There are been ideas to do:

Infrequent -> 0 L -> 1, M -> 2, S -> 3

Or

L -> 0, M -> 1, S -> 2, Infrequent -> 3

"Infrequent" itself does not have an ordering when compared to the other categories since it is a combination of categories.

@jnothman
Copy link
Member

jnothman commented Aug 3, 2019 via email

@NicolasHug
Copy link
Member Author

Conflicts should be resolved @amueller

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by the combination of the unchecked TODO list and the "MRG" designation. Can you clarify what work you still intend to do here?

@@ -222,6 +258,10 @@ class OneHotEncoder(_BaseEncoder):
be dropped for each feature. None if all the transformed features will
be retained.

infrequent_indices_: list of arrays of shape(n_infrequent_categories)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
infrequent_indices_: list of arrays of shape(n_infrequent_categories)
infrequent_indices_ : list of arrays of shape (n_infrequent_categories,)

@NicolasHug
Copy link
Member Author

The TODO list is up to date.

Given that working on encoders is a) extremely complex and b) subject to never-ending discussions (#12893), I want to make sure I have initial approval on the design and features of the PR before putting more work into it.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There don't seem to be parameter docstrings for max_levels. Makes it hard to review that it's correct.

@NicolasHug
Copy link
Member Author

I added a docstring @jnothman . The general expected features are illustrated in test_infrequent_categories_sanity().

@jnothman
Copy link
Member

Per #14563 (comment), OrdinalEncoder should provide options for representing dropped categories including "merge-down", "merge-up", "extra" (better names welcome).

@NicolasHug
Copy link
Member Author

I want to wait for #14972 before getting back to this. The code is way too complicated IMHO.

@NicolasHug
Copy link
Member Author

Superseded

@NicolasHug NicolasHug closed this Feb 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add "other" / min_frequency option to OneHotEncoder
5 participants