Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH Adds infrequent categories to OneHotEncoder #16018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 125 commits into from
Mar 14, 2022

Conversation

thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Jan 3, 2020

Reference Issues/PRs

Resolves #12153
Supersedes #13833
Closes #19286

What does this implement/fix? Explain your changes.

This PR adds the infrequent categories to OneHotEncoder by:

  1. Adds return_counts to _encode to return the counts of the categories.
  2. Adjust _BaseEncoder._fit and _BaseEncoder._transform to accept callables, which can adjust the processing of categories and unknown categories.
  3. Deprecates 'ignore' in favor of 'auto'. 'auto' behaves like 'ignore' when there are no infrequent categories in training. If there are infrequent categories in training, then unknown categories are considered infrequent.
  4. For the inverse transform and feature names, the infrequent category with the highest cardinality will be used.

Update

As of ef86eb1 (#16018) 'ignore' is option remains and an infrequent_if_exist option was added to handle_unknown. The infrequent_if_exist behaves just like 'ignore' when there are no infrequent categories for a feature. If a feature does have infrequent categories, then unknown categories will be mapped to the infrequent category.

#### Why deprecated 'ignore'?

The motivation to deprecate 'ignore' is connected to how it interacts with infrequent categories. In the following example, feature 0 has infrequent categories and feature 1 does not. This means that feature 0 will map the unknown category to be infrequent, while feature 1 "ignores" the unknown category:

from sklearn.preprocessing import OneHotEncoder
import pandas as pd

X = pd.DataFrame({
    "col1": ['a'] * 30 + ['b'] * 15 + ['c', 'd', 'e', 'f', 'g'],
    "col2": ['one'] * 25 + ['two'] * 25
})

ohe = OneHotEncoder(sparse=False, min_frequency=2, handle_unknown='auto').fit(X)
ohe.transform([['g', 'three']])
# array([[0., 0., 1., 0., 0.]])

If we were to keep 'ignore', we would run into the following behavior:

# This errors right now
ohe = OneHotEncoder(sparse=False, min_frequency=2, handle_unknown='ignore').fit(X)
ohe.transform([['g', 'three']])
# array([[0., 0., 0., 0., 0.]])

In this case, all unknown categories are mapped to zero, regardless of there being a infrequent category for the feature. A while ago me and @amueller discussed this and concluded that this behavior is not desirable.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review. Nicely thought out, @thomasjpfan! I'm not sure about the inverse_transform behaviour, but I don't think it is so fundamentally important.

Infrequent categories
---------------------

:class:`OneHotEncoder` supports creating a category for infrequent categories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"creating a category" -> "outputting feature that conflates"???

in the training data. The parameters to enable the gathering of infrequent
categories are `min_frequency` and `max_levels`.

1. `min_frequency` can be a integer greater or equal to one or a float in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. `min_frequency` can be a integer greater or equal to one or a float in
1. `min_frequency` can be a integer greater or equal to one, or a float in

maybe one -> 1 too.

total number of samples will be considered infrequent.

2. `max_levels` can be `None` or any integer greater than one. This parameter
sets an upper limit of the number of categories including the infrequent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"number of categories including the infrequent category" -> "output features for each input feature"??

sets an upper limit of the number of categories including the infrequent
category.

These parameters can be used together to filter out infrequent categories. In
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following example does not use these parameters together, so this sentence seems a bit out of place.

encountered in transform:

1. If there was no infrequent category during training, the resulting
one-hot encoded columns for this feature will be be all zeros. In
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be indented?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, doc renders poorly. Same for the other parameter.

'ignore' is deprecated in favor of 'auto'

min_frequency : int or float, default=1
Specifics the categories to be considered infrequent.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Specifics the categories to be considered infrequent.
Specifies the categories to be considered infrequent.


.. versionadded:: 0.23

max_levels : int, default=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that we should adopt the term "levels". I don't know of it being used elsewhere in scikit-learn. What else could we call it? Output features? Categories??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_output_features might become confusing when used together with drop

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there is no drop support even for handle_unknown='error'. If we want have drop support, I would to prefer to consider it in a follow up PR.

Side note: Even with drop support, this places an upper limit to the number of output features.


max_levels : int, default=None
Specifies the categories to be considered infrequent. Sets an upper
limit to the number of categories including the infrequent category.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"number of categories" is confusing. I suggested an alternative wording above.

# validates infrequent category features
if self.drop is not None and self._infrequent_enabled:
raise ValueError("infrequent categories are not supported when "
"drop is specified")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not appear to be documented

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now documented in the docstring for drop.

return output if output.size > 0 else None

def _compute_infrequent_categories(self, category_counts, n_samples):
"""Compute infrequent categories.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between this and the previous function is hard to discern from their name and summary. Maybe _fit_infrequent_category_mapping here, which implies that the result is stored rather than returned.

The above could be _identify_infrequent

assert_array_equal(['x0_b', 'x0_c', 'x0_a'], feature_names)


def test_ohe_infrequent_two_levels_user_cats():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the relationship between the behaviour for user-specified cats and auto cats to be a bit clearer here. At the moment this tests the behaviour independently, rather than asserting that some relationship holds between the two behaviours. Is there a nice way to test that relationship instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other test cases uses auto and checks that the infrequent_indices_ respects the lexicon ordering from auto.

The biggest different is how user provided cats may not have the same order as 'auto'. What relationship do you see between them?

def _map_to_infrequent_categories(self, X_int):
"""Map categories to infrequent categories.

Note this will replace the encoding in X_int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This modifies X_int in-place"

if not self._infrequent_enabled:
return

for col_idx, mapping in enumerate(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happier if col_idx is called i and this fits on one line...

"""Process the valid mask during `_transform`

This function is passed to `_transform` to adjust the mask depending
on if the infrequent column exist or not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exist -> exists


This function is passed to `_transform` to adjust the mask depending
on if the infrequent column exist or not.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to the reader if you documented Parameters and Returns here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Parameters and Returns were added.

return np.ones_like(valid_mask, dtype=bool)

def _compute_transformed_category(self, i):
"""Compute the transformed category used for column `i`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this computes categories for comumn i not category for comumn i

if infrequent_idx is None:
return valid_mask

# infrequent column exist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exist -> exists

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brief review. Please describe the tests.

What is the interaction with drop?



def test_encode_util_uniques_unordered():
# The return counts are ordered based on the order of uniques
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# The return counts are ordered based on the order of uniques
# Make sure the returned counts are ordered based on the order of uniques

Comment on lines 603 to 606
>>> enc.transform([['dog']]).toarray()
array([[0., 0., 1.]])
>>> enc.transform([['rabbit']]).toarray()
array([[0., 1., 0.]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also display the rest of the categories, to be exhaustive and for clarity

category.

These parameters can be used together to filter out infrequent categories. In
the following example, the categories, `'dog', 'cat'`, are considered infrequent::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sneak instead of cat?


:class:`OneHotEncoder` supports creating a category for infrequent categories
in the training data. The parameters to enable the gathering of infrequent
categories are `min_frequency` and `max_levels`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please indicate that the infrequent categories form a new column


1. `min_frequency` can be a integer greater or equal to one or a float in
`(0.0, 1.0)`. If `min_frequency` is an integer, categories with a cardinality
smaller than this value will be considered infrequent. If `min_frequency` is an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a float

.. versionadded:: 0.23

max_levels : int, default=None
Specifies the categories to be considered infrequent. Sets an upper
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't maxlevels specify the number of frequent categories instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this implementation it specifies the total number of categories (including the infrequent). I have updated the docs to reflect this.

X_inv = ohe.inverse_transform(X_trans)
assert_array_equal(expected_inv, X_inv)

# The most frequent infrequent category becomes the feature name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this documented too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now in the user guide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is no longer the correct comment

@NicolasHug
Copy link
Member

What is the interaction with drop?

Ok, this is not supported ATM (As Joel suggested, please document this).

I am however concerned about potential future complexity. Supporting interactions with drop is one of the reasons #13833 was so tricky. @thomasjpfan do you think things will be easier with the implementation proposed in this PR? I haven't looked at the code in details so I don't have an informed opinion yet

@thomasjpfan
Copy link
Member Author

I am however concerned about potential future complexity.

Currently, if handle_unknown=='ignore', then drop must be None. Going with this, lets assume that handle_unknown=='error' then we can try to support drop as follows:

  1. drop='first', will always drop a frequent column, because the infrequent column is always at the end of the encoding.
  2. drop='infrequent' the last column will be dropped. (infrequent column is always at the end of the encoding)
  3. drop is a list -> if dropped category is frequent, drop it. if dropped category is infrequent AND its the only infrequent category drop it, otherwise keep the infrequent category.

@NicolasHug
Copy link
Member

Sorry I wasn't clear, I was more concerned about the complexity of the code, from a maintainability point of view

See e.g. https://github.com/scikit-learn/scikit-learn/pull/13833/files#diff-d12408664448c94dbd880579e1b2e4d9R452

@thomasjpfan
Copy link
Member Author

Implementation-wise, I do not see an issue with including the drop support. I would adjust _compute_drop_idx to take into account of the infrequent categories (if they exist).

@thomasjpfan
Copy link
Member Author

I think the things left to discuss for this PR is:

  1. Is max_levels a good name?
  2. Is using the most frequent infrequent category a good default for get_feature_names and inverse_transform? I thought of using a string such as "infrequent" or if that exist then a named mangled str: "infrequent_sklearn". The issue with this, is if the categories were numerical, the inverse_transform will become of object dtype.

@jnothman
Copy link
Member

jnothman commented Feb 9, 2020

Is using the most frequent infrequent category a good default for get_feature_names and inverse_transform? I thought of using a string such as "infrequent" or if that exist then a named mangled str: "infrequent_sklearn". The issue with this, is if the categories were numerical, the inverse_transform will become of object dtype.

I don't like it for get_feature_names. Don't mind for inverse_transform

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Mar 3, 2022

Updated PR with the following changes:

  1. infrequent_indices_ is now private, infrequent_categories_ returns the infrequent categories already

  2. min_frequency default is None

  3. max_categories=1 is now supported. If all categories are infrequent, OneHotEncoder returns a constant feature. This is the same behavior as passing in a constant categorical feature.

  4. Document behavior when there are ties at the max_categories cutoff

Remaining issue around dropping infrequent categories

The remaining issue is how to handle dropping infrequent categories when "drop" is explicitly passed in as an array of strings. The current behavior is if the category is infrequent, then the whole infrequent category is dropped. Here is the discussion Some more ideas:

  1. Error when trying to drop a category that is infrequent.
  2. Allow "infrequent_sklearn" to be a place holder for an array-like "drop". ["a", "b", "infrequent_sklearn"] means drop the "infrequent_sklearn" feature in the feature 2. The downside of this is if there is no infrequent category in the feature 2. Then do we error?
  3. Current behavior of dropping the infrequent category if the category in "drop" is infrequent.
  4. drop="infrequent_if_exist", does not work because what happens if the feature has no infrequent categories and still wants to drop?
  5. infrequent_drop=(True, False), which only applies to features with infrequent categories. Basically overriding the default behavior of drop.

I still prefer option 3 (current behavior) as the best option, because it does not add more API and the semantics can be described in one sentence as we do in the docstring:

If there are infrequent categories and drop selects any of the infrequent categories, then the category representing infrequent categories is dropped.

@amueller
Copy link
Member

amueller commented Mar 4, 2022

Error when trying to drop a category that is infrequent.

Hrm the discussion link doesn't work for me. Thanks GitHub. Oh wait, that's us...
I think erroring is best. We can add the placeholder thing later if someone asks for it? I would just like to get anything in and that seems like a super fringe case and I don't think it's that meaningful from a statistical perspective either?

I agree we shouldn't add more interface and more complex behavior, but just erroring also can be described very quickly and is less surprising to me. We can always decide not to error later, but moving from not erroring to erroring is definitely a deprecation cycle at least. In other words, if we decide on option 1, all the other options are still available after we merged and we can see if anyone asks for it.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Mar 4, 2022

Updated PR to error when explicitly dropping a category that is infrequent (using an array-like for drop).

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! @jnothman is your approval still valid lol? Or @adrinjalali did you want to have a look?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading your summary of changes, I'm happy to keep my approval :)

@adrinjalali adrinjalali merged commit 7f0006c into scikit-learn:main Mar 14, 2022
@jnothman
Copy link
Member

Congrats for getting this through, @thomasjpfan !!

@glemaitre
Copy link
Member

Nice!!!!!

@amueller
Copy link
Member

OMG this is exciting!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Have handle_unknown="ignore" by default in OneHotEncoder Add "other" / min_frequency option to OneHotEncoder