-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Adds infrequent categories to OneHotEncoder #16018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds infrequent categories to OneHotEncoder #16018
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review. Nicely thought out, @thomasjpfan! I'm not sure about the inverse_transform behaviour, but I don't think it is so fundamentally important.
doc/modules/preprocessing.rst
Outdated
Infrequent categories | ||
--------------------- | ||
|
||
:class:`OneHotEncoder` supports creating a category for infrequent categories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"creating a category" -> "outputting feature that conflates"???
doc/modules/preprocessing.rst
Outdated
in the training data. The parameters to enable the gathering of infrequent | ||
categories are `min_frequency` and `max_levels`. | ||
|
||
1. `min_frequency` can be a integer greater or equal to one or a float in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. `min_frequency` can be a integer greater or equal to one or a float in | |
1. `min_frequency` can be a integer greater or equal to one, or a float in |
maybe one -> 1 too.
doc/modules/preprocessing.rst
Outdated
total number of samples will be considered infrequent. | ||
|
||
2. `max_levels` can be `None` or any integer greater than one. This parameter | ||
sets an upper limit of the number of categories including the infrequent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"number of categories including the infrequent category" -> "output features for each input feature"??
doc/modules/preprocessing.rst
Outdated
sets an upper limit of the number of categories including the infrequent | ||
category. | ||
|
||
These parameters can be used together to filter out infrequent categories. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following example does not use these parameters together, so this sentence seems a bit out of place.
sklearn/preprocessing/_encoders.py
Outdated
encountered in transform: | ||
|
||
1. If there was no infrequent category during training, the resulting | ||
one-hot encoded columns for this feature will be be all zeros. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be indented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, doc renders poorly. Same for the other parameter.
sklearn/preprocessing/_encoders.py
Outdated
'ignore' is deprecated in favor of 'auto' | ||
|
||
min_frequency : int or float, default=1 | ||
Specifics the categories to be considered infrequent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifics the categories to be considered infrequent. | |
Specifies the categories to be considered infrequent. |
sklearn/preprocessing/_encoders.py
Outdated
|
||
.. versionadded:: 0.23 | ||
|
||
max_levels : int, default=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that we should adopt the term "levels". I don't know of it being used elsewhere in scikit-learn. What else could we call it? Output features? Categories??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_output_features might become confusing when used together with drop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently there is no drop support even for handle_unknown='error'
. If we want have drop support, I would to prefer to consider it in a follow up PR.
Side note: Even with drop support, this places an upper limit to the number of output features.
sklearn/preprocessing/_encoders.py
Outdated
|
||
max_levels : int, default=None | ||
Specifies the categories to be considered infrequent. Sets an upper | ||
limit to the number of categories including the infrequent category. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"number of categories" is confusing. I suggested an alternative wording above.
sklearn/preprocessing/_encoders.py
Outdated
# validates infrequent category features | ||
if self.drop is not None and self._infrequent_enabled: | ||
raise ValueError("infrequent categories are not supported when " | ||
"drop is specified") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not appear to be documented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now documented in the docstring for drop
.
sklearn/preprocessing/_encoders.py
Outdated
return output if output.size > 0 else None | ||
|
||
def _compute_infrequent_categories(self, category_counts, n_samples): | ||
"""Compute infrequent categories. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between this and the previous function is hard to discern from their name and summary. Maybe _fit_infrequent_category_mapping
here, which implies that the result is stored rather than returned.
The above could be _identify_infrequent
assert_array_equal(['x0_b', 'x0_c', 'x0_a'], feature_names) | ||
|
||
|
||
def test_ohe_infrequent_two_levels_user_cats(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer the relationship between the behaviour for user-specified cats and auto cats to be a bit clearer here. At the moment this tests the behaviour independently, rather than asserting that some relationship holds between the two behaviours. Is there a nice way to test that relationship instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the other test cases uses auto and checks that the infrequent_indices_
respects the lexicon ordering from auto.
The biggest different is how user provided cats may not have the same order as 'auto'. What relationship do you see between them?
sklearn/preprocessing/_encoders.py
Outdated
def _map_to_infrequent_categories(self, X_int): | ||
"""Map categories to infrequent categories. | ||
|
||
Note this will replace the encoding in X_int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This modifies X_int in-place"
sklearn/preprocessing/_encoders.py
Outdated
if not self._infrequent_enabled: | ||
return | ||
|
||
for col_idx, mapping in enumerate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happier if col_idx
is called i
and this fits on one line...
sklearn/preprocessing/_encoders.py
Outdated
"""Process the valid mask during `_transform` | ||
|
||
This function is passed to `_transform` to adjust the mask depending | ||
on if the infrequent column exist or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exist -> exists
sklearn/preprocessing/_encoders.py
Outdated
|
||
This function is passed to `_transform` to adjust the mask depending | ||
on if the infrequent column exist or not. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to the reader if you documented Parameters and Returns here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Parameters and Returns were added.
sklearn/preprocessing/_encoders.py
Outdated
return np.ones_like(valid_mask, dtype=bool) | ||
|
||
def _compute_transformed_category(self, i): | ||
"""Compute the transformed category used for column `i`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this computes categories for comumn i not category for comumn i
sklearn/preprocessing/_encoders.py
Outdated
if infrequent_idx is None: | ||
return valid_mask | ||
|
||
# infrequent column exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exist -> exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brief review. Please describe the tests.
What is the interaction with drop
?
|
||
|
||
def test_encode_util_uniques_unordered(): | ||
# The return counts are ordered based on the order of uniques |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# The return counts are ordered based on the order of uniques | |
# Make sure the returned counts are ordered based on the order of uniques |
doc/modules/preprocessing.rst
Outdated
>>> enc.transform([['dog']]).toarray() | ||
array([[0., 0., 1.]]) | ||
>>> enc.transform([['rabbit']]).toarray() | ||
array([[0., 1., 0.]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also display the rest of the categories, to be exhaustive and for clarity
doc/modules/preprocessing.rst
Outdated
category. | ||
|
||
These parameters can be used together to filter out infrequent categories. In | ||
the following example, the categories, `'dog', 'cat'`, are considered infrequent:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sneak instead of cat?
doc/modules/preprocessing.rst
Outdated
|
||
:class:`OneHotEncoder` supports creating a category for infrequent categories | ||
in the training data. The parameters to enable the gathering of infrequent | ||
categories are `min_frequency` and `max_levels`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please indicate that the infrequent categories form a new column
doc/modules/preprocessing.rst
Outdated
|
||
1. `min_frequency` can be a integer greater or equal to one or a float in | ||
`(0.0, 1.0)`. If `min_frequency` is an integer, categories with a cardinality | ||
smaller than this value will be considered infrequent. If `min_frequency` is an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a float
sklearn/preprocessing/_encoders.py
Outdated
.. versionadded:: 0.23 | ||
|
||
max_levels : int, default=None | ||
Specifies the categories to be considered infrequent. Sets an upper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't maxlevels specify the number of frequent categories instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this implementation it specifies the total number of categories (including the infrequent). I have updated the docs to reflect this.
X_inv = ohe.inverse_transform(X_trans) | ||
assert_array_equal(expected_inv, X_inv) | ||
|
||
# The most frequent infrequent category becomes the feature name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this documented too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is now in the user guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is no longer the correct comment
Ok, this is not supported ATM (As Joel suggested, please document this). I am however concerned about potential future complexity. Supporting interactions with |
Currently, if
|
Sorry I wasn't clear, I was more concerned about the complexity of the code, from a maintainability point of view |
Implementation-wise, I do not see an issue with including the drop support. I would adjust |
I think the things left to discuss for this PR is:
|
I don't like it for get_feature_names. Don't mind for inverse_transform |
Updated PR with the following changes:
Remaining issue around dropping infrequent categoriesThe remaining issue is how to handle dropping infrequent categories when "drop" is explicitly passed in as an array of strings. The current behavior is if the category is infrequent, then the whole infrequent category is dropped. Here is the discussion Some more ideas:
I still prefer option 3 (current behavior) as the best option, because it does not add more API and the semantics can be described in one sentence as we do in the docstring:
|
Hrm the discussion link doesn't work for me. Thanks GitHub. Oh wait, that's us... I agree we shouldn't add more interface and more complex behavior, but just erroring also can be described very quickly and is less surprising to me. We can always decide not to error later, but moving from not erroring to erroring is definitely a deprecation cycle at least. In other words, if we decide on option 1, all the other options are still available after we merged and we can see if anyone asks for it. |
Updated PR to error when explicitly dropping a category that is infrequent (using an array-like for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! @jnothman is your approval still valid lol? Or @adrinjalali did you want to have a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading your summary of changes, I'm happy to keep my approval :)
Congrats for getting this through, @thomasjpfan !! |
Nice!!!!! |
OMG this is exciting!!! |
Reference Issues/PRs
Resolves #12153
Supersedes #13833
Closes #19286
What does this implement/fix? Explain your changes.
This PR adds the infrequent categories to
OneHotEncoder
by:return_counts
to_encode
to return the counts of the categories._BaseEncoder._fit
and_BaseEncoder._transform
to accept callables, which can adjust the processing of categories and unknown categories.Deprecates 'ignore' in favor of 'auto'. 'auto' behaves like 'ignore' when there are no infrequent categories in training. If there are infrequent categories in training, then unknown categories are considered infrequent.Update
As of
ef86eb1
(#16018)'ignore'
is option remains and aninfrequent_if_exist
option was added tohandle_unknown
. Theinfrequent_if_exist
behaves just like'ignore'
when there are no infrequent categories for a feature. If a feature does have infrequent categories, then unknown categories will be mapped to the infrequent category.#### Why deprecated'ignore'
?The motivation to deprecate
'ignore'
is connected to how it interacts with infrequent categories. In the following example, feature 0 has infrequent categories and feature 1 does not. This means that feature 0 will map the unknown category to be infrequent, while feature 1 "ignores" the unknown category:If we were to keep
'ignore'
, we would run into the following behavior:In this case, all unknown categories are mapped to zero, regardless of there being a infrequent category for the feature. A while ago me and @amueller discussed this and concluded that this behavior is not desirable.