-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Add support for infrequent categories in OneHotEncoder and OrdinalEncoder #13833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it appropriate to just add the infrequent categories to those dropped (in OHE at least)? Or is there some fundamental difference in their handling that I'm not considering now?
@@ -16,7 +16,7 @@ | |||
from ..utils.validation import check_is_fitted | |||
|
|||
from .base import _transform_selected | |||
from .label import _encode, _encode_check_unknown | |||
from .label import _encode, _encode_check_unknown, _encode_numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we just use _encode
rather than _encode_numpy
?
@@ -243,6 +279,10 @@ class OneHotEncoder(_BaseEncoder): | |||
be dropped for each feature. None if all the transformed features will | |||
be retained. | |||
|
|||
infrequent_indices_: list of arrays of shape(n_infrequent_categories) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infrequent_indices_: list of arrays of shape(n_infrequent_categories) | |
infrequent_indices_ : list of arrays of shape (n_infrequent_categories,) |
@@ -243,6 +279,10 @@ class OneHotEncoder(_BaseEncoder): | |||
be dropped for each feature. None if all the transformed features will | |||
be retained. | |||
|
|||
infrequent_indices_: list of arrays of shape(n_infrequent_categories) | |||
``infrequent_indices_[i]`` contains a list of indices in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``infrequent_indices_[i]`` contains a list of indices in | |
``infrequent_indices_[i]`` contains an array of indices in |
sklearn/preprocessing/_encoders.py
Outdated
@@ -491,12 +532,28 @@ def fit(self, X, y=None): | |||
else: | |||
self._fit(X, handle_unknown=self.handle_unknown) | |||
self.drop_idx_ = self._compute_drop_idx() | |||
|
|||
# check if user wants to manually drop a feature that is | |||
# infrequent: this is not allowed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't seem natural to drop some features that are infrequent, and not drop some.
it would also make the code even more complex.
I don't think it is appropriate. with drop=first, that would result in treating the first category as infrequent. |
Minor feedback: wouldn't it be more appropriate to always bucket the infrequent categories in 0? This way, regardless of the cardinality of a categorical variable, we always know in which bucket the infrequent categories end up. |
implementation wise, we would still need to remap the frequent categories anyway as for user interface: then drop='first' becomes ambiguous |
can you fix the merge conflicts? |
We do not have a proper place to put the infrequent category for OrdinalEncoder. No matter where we put it, it doesn’t keep lexicon ordering of the original features. |
Can you please elaborate Thomas? |
I have clothing size categories (which have a natural order): XS, S, M, L, ReallyLarge. Currently, our OrdinalEncoder will map:
Lets say we are okay with this, now let ReallyLarge and XS be "infrequent", what integer value do we encode the infrequent category? There are been ideas to do:
Or
"Infrequent" itself does not have an ordering when compared to the other categories since it is a combination of categories. |
Yes, in the context of truly ordinal categories (whether or not we are
supporting lexical ordering) it doesn't make a lot of sense. Absorbing that
category into a neighbour is more appropriate.
|
…frequent_categories
Conflicts should be resolved @amueller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by the combination of the unchecked TODO list and the "MRG" designation. Can you clarify what work you still intend to do here?
@@ -222,6 +258,10 @@ class OneHotEncoder(_BaseEncoder): | |||
be dropped for each feature. None if all the transformed features will | |||
be retained. | |||
|
|||
infrequent_indices_: list of arrays of shape(n_infrequent_categories) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
infrequent_indices_: list of arrays of shape(n_infrequent_categories) | |
infrequent_indices_ : list of arrays of shape (n_infrequent_categories,) |
The TODO list is up to date. Given that working on encoders is a) extremely complex and b) subject to never-ending discussions (#12893), I want to make sure I have initial approval on the design and features of the PR before putting more work into it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There don't seem to be parameter docstrings for max_levels. Makes it hard to review that it's correct.
I added a docstring @jnothman . The general expected features are illustrated in |
Per #14563 (comment), OrdinalEncoder should provide options for representing dropped categories including "merge-down", "merge-up", "extra" (better names welcome). |
I want to wait for #14972 before getting back to this. The code is way too complicated IMHO. |
Superseded |
Fixes: #12153
Closes #12264
Not strictly MRG but I need input.
This PR adds a
max_levels
parameter to OneHotEncoder and OrdinalEncoder.Categories with the last amount of support are considered 'infrequent' and are mapped into a specific column.
TODO:
min_frequency
parameterhandle_unkown
to address Handle Error Policy in OrdinalEncoder #13488 (comment)ping @jnothman @ogrisel @amueller @thomasjpfan, please LMK if this is going in the right direction.
(You can look at
test_infrequent_categories_sanity()
for some sort of guide to this new feature.)