ENH Adds infrequent categories to OneHotEncoder #16018

thomasjpfan · 2020-01-03T19:30:04Z

Reference Issues/PRs

Resolves #12153
Supersedes #13833
Closes #19286

What does this implement/fix? Explain your changes.

This PR adds the infrequent categories to OneHotEncoder by:

Adds return_counts to _encode to return the counts of the categories.
Adjust _BaseEncoder._fit and _BaseEncoder._transform to accept callables, which can adjust the processing of categories and unknown categories.
Deprecates 'ignore' in favor of 'auto'. 'auto' behaves like 'ignore' when there are no infrequent categories in training. If there are infrequent categories in training, then unknown categories are considered infrequent.
For the inverse transform and feature names, the infrequent category with the highest cardinality will be used.

Update

As of ef86eb1 (#16018) 'ignore' is option remains and an infrequent_if_exist option was added to handle_unknown. The infrequent_if_exist behaves just like 'ignore' when there are no infrequent categories for a feature. If a feature does have infrequent categories, then unknown categories will be mapped to the infrequent category.

~~#### Why deprecated 'ignore'?~~

The motivation to deprecate 'ignore' is connected to how it interacts with infrequent categories. In the following example, feature 0 has infrequent categories and feature 1 does not. This means that feature 0 will map the unknown category to be infrequent, while feature 1 "ignores" the unknown category:

from sklearn.preprocessing import OneHotEncoder import pandas as pd X = pd.DataFrame({ "col1": ['a'] * 30 + ['b'] * 15 + ['c', 'd', 'e', 'f', 'g'], "col2": ['one'] * 25 + ['two'] * 25 }) ohe = OneHotEncoder(sparse=False, min_frequency=2, handle_unknown='auto').fit(X) ohe.transform([['g', 'three']]) # array([[0., 0., 1., 0., 0.]])

If we were to keep 'ignore', we would run into the following behavior:

# This errors right now ohe = OneHotEncoder(sparse=False, min_frequency=2, handle_unknown='ignore').fit(X) ohe.transform([['g', 'three']]) # array([[0., 0., 0., 0., 0.]])

In this case, all unknown categories are mapped to zero, regardless of there being a infrequent category for the feature. A while ago me and @amueller discussed this and concluded that this behavior is not desirable.

…t_encoder_rb

jnothman

Partial review. Nicely thought out, @thomasjpfan! I'm not sure about the inverse_transform behaviour, but I don't think it is so fundamentally important.

jnothman · 2020-01-20T22:19:14Z

doc/modules/preprocessing.rst

+Infrequent categories
+---------------------
+
+:class:`OneHotEncoder` supports creating a category for infrequent categories


"creating a category" -> "outputting feature that conflates"???

jnothman · 2020-01-20T22:19:33Z

doc/modules/preprocessing.rst

+in the training data. The parameters to enable the gathering of infrequent
+categories are `min_frequency` and `max_levels`.
+
+1. `min_frequency` can be a integer greater or equal to one or a float in


Suggested change

1. `min_frequency` can be a integer greater or equal to one or a float in

1. `min_frequency` can be a integer greater or equal to one, or a float in

maybe one -> 1 too.

jnothman · 2020-01-20T22:21:05Z

doc/modules/preprocessing.rst

+total number of samples will be considered infrequent.
+
+2. `max_levels` can be `None` or any integer greater than one. This parameter
+sets an upper limit of the number of categories including the infrequent


"number of categories including the infrequent category" -> "output features for each input feature"??

jnothman · 2020-01-20T22:21:59Z

doc/modules/preprocessing.rst

+sets an upper limit of the number of categories including the infrequent
+category.
+
+These parameters can be used together to filter out infrequent categories. In


The following example does not use these parameters together, so this sentence seems a bit out of place.

jnothman · 2020-01-20T22:29:02Z

sklearn/preprocessing/_encoders.py

+        encountered in transform:
+
+        1. If there was no infrequent category during training, the resulting
+        one-hot encoded columns for this feature will be be all zeros. In


Should this be indented?

yes, doc renders poorly. Same for the other parameter.

jnothman · 2020-01-20T22:31:27Z

sklearn/preprocessing/_encoders.py

+            'ignore' is deprecated in favor of 'auto'
+
+    min_frequency : int or float, default=1
+        Specifics the categories to be considered infrequent.


Suggested change

Specifics the categories to be considered infrequent.

Specifies the categories to be considered infrequent.

jnothman · 2020-01-20T22:34:06Z

sklearn/preprocessing/_encoders.py

+
+        .. versionadded:: 0.23
+
+    max_levels : int, default=None


Not sure that we should adopt the term "levels". I don't know of it being used elsewhere in scikit-learn. What else could we call it? Output features? Categories??

max_output_features might become confusing when used together with drop

Currently there is no drop support even for handle_unknown='error'. If we want have drop support, I would to prefer to consider it in a follow up PR.

Side note: Even with drop support, this places an upper limit to the number of output features.

jnothman · 2020-01-20T22:35:39Z

sklearn/preprocessing/_encoders.py

+
+    max_levels : int, default=None
+        Specifies the categories to be considered infrequent. Sets an upper
+        limit to the number of categories including the infrequent category.


"number of categories" is confusing. I suggested an alternative wording above.

jnothman · 2020-01-20T22:37:53Z

sklearn/preprocessing/_encoders.py

+        # validates infrequent category features
+        if self.drop is not None and self._infrequent_enabled:
+            raise ValueError("infrequent categories are not supported when "
+                             "drop is specified")


This does not appear to be documented

This is now documented in the docstring for drop.

jnothman · 2020-01-20T22:47:41Z

sklearn/preprocessing/_encoders.py

+        return output if output.size > 0 else None
+
+    def _compute_infrequent_categories(self, category_counts, n_samples):
+        """Compute infrequent categories.


The difference between this and the previous function is hard to discern from their name and summary. Maybe _fit_infrequent_category_mapping here, which implies that the result is stored rather than returned.

The above could be _identify_infrequent

jnothman · 2020-01-23T12:26:18Z

sklearn/preprocessing/tests/test_encoders.py

+    assert_array_equal(['x0_b', 'x0_c', 'x0_a'], feature_names)
+
+
+def test_ohe_infrequent_two_levels_user_cats():


I would prefer the relationship between the behaviour for user-specified cats and auto cats to be a bit clearer here. At the moment this tests the behaviour independently, rather than asserting that some relationship holds between the two behaviours. Is there a nice way to test that relationship instead?

All the other test cases uses auto and checks that the infrequent_indices_ respects the lexicon ordering from auto.

The biggest different is how user provided cats may not have the same order as 'auto'. What relationship do you see between them?

jnothman · 2020-01-23T12:26:26Z

sklearn/preprocessing/_encoders.py

+    def _map_to_infrequent_categories(self, X_int):
+        """Map categories to infrequent categories.
+
+        Note this will replace the encoding in X_int


"This modifies X_int in-place"

jnothman · 2020-01-23T12:26:29Z

sklearn/preprocessing/_encoders.py

+        if not self._infrequent_enabled:
+            return
+
+        for col_idx, mapping in enumerate(


I'm happier if col_idx is called i and this fits on one line...

jnothman · 2020-01-23T12:26:31Z

sklearn/preprocessing/_encoders.py

+        """Process the valid mask during `_transform`
+
+        This function is passed to `_transform` to adjust the mask depending
+        on if the infrequent column exist or not.


exist -> exists

jnothman · 2020-01-23T12:26:32Z

sklearn/preprocessing/_encoders.py

+
+        This function is passed to `_transform` to adjust the mask depending
+        on if the infrequent column exist or not.
+        """


I think it would be helpful to the reader if you documented Parameters and Returns here

The Parameters and Returns were added.

jnothman · 2020-01-23T12:26:38Z

sklearn/preprocessing/_encoders.py

+        return np.ones_like(valid_mask, dtype=bool)
+
+    def _compute_transformed_category(self, i):
+        """Compute the transformed category used for column `i`.


I think this computes categories for comumn i not category for comumn i

sklearn/preprocessing/tests/test_encoders.py

jnothman · 2020-01-23T12:26:42Z

sklearn/preprocessing/_encoders.py

+        if infrequent_idx is None:
+            return valid_mask
+
+        # infrequent column exist


exist -> exists

NicolasHug

Brief review. Please describe the tests.

What is the interaction with drop?

NicolasHug · 2020-01-23T15:10:50Z

sklearn/preprocessing/tests/test_label.py

+
+
+def test_encode_util_uniques_unordered():
+    # The return counts are ordered based on the order of uniques


Suggested change

# The return counts are ordered based on the order of uniques

# Make sure the returned counts are ordered based on the order of uniques

NicolasHug · 2020-01-23T15:15:15Z

doc/modules/preprocessing.rst

+   >>> enc.transform([['dog']]).toarray()
+   array([[0., 0., 1.]])
+   >>> enc.transform([['rabbit']]).toarray()
+   array([[0., 1., 0.]])


maybe also display the rest of the categories, to be exhaustive and for clarity

NicolasHug · 2020-01-23T15:15:27Z

doc/modules/preprocessing.rst

+category.
+
+These parameters can be used together to filter out infrequent categories. In
+the following example, the categories, `'dog', 'cat'`, are considered infrequent::


sneak instead of cat?

NicolasHug · 2020-01-23T15:16:05Z

doc/modules/preprocessing.rst

+
+:class:`OneHotEncoder` supports creating a category for infrequent categories
+in the training data. The parameters to enable the gathering of infrequent
+categories are `min_frequency` and `max_levels`.


please indicate that the infrequent categories form a new column

NicolasHug · 2020-01-23T15:16:40Z

doc/modules/preprocessing.rst

+
+1. `min_frequency` can be a integer greater or equal to one or a float in
+`(0.0, 1.0)`. If `min_frequency` is an integer, categories with a cardinality
+smaller than this value will be considered infrequent. If `min_frequency` is an


NicolasHug · 2020-01-23T15:28:18Z

sklearn/preprocessing/_encoders.py

+        .. versionadded:: 0.23
+
+    max_levels : int, default=None
+        Specifies the categories to be considered infrequent. Sets an upper


doesn't maxlevels specify the number of frequent categories instead?

In this implementation it specifies the total number of categories (including the infrequent). I have updated the docs to reflect this.

sklearn/preprocessing/_encoders.py

sklearn/preprocessing/tests/test_encoders.py

NicolasHug · 2020-01-23T15:36:29Z

sklearn/preprocessing/tests/test_encoders.py

+    X_inv = ohe.inverse_transform(X_trans)
+    assert_array_equal(expected_inv, X_inv)
+
+    # The most frequent infrequent category becomes the feature name


is this documented too?

It is now in the user guide.

this is no longer the correct comment

…t_encoder_rb

NicolasHug · 2020-01-27T20:11:52Z

What is the interaction with drop?

Ok, this is not supported ATM (As Joel suggested, please document this).

I am however concerned about potential future complexity. Supporting interactions with drop is one of the reasons #13833 was so tricky. @thomasjpfan do you think things will be easier with the implementation proposed in this PR? I haven't looked at the code in details so I don't have an informed opinion yet

thomasjpfan · 2020-01-27T20:34:00Z

I am however concerned about potential future complexity.

Currently, if handle_unknown=='ignore', then drop must be None. Going with this, lets assume that handle_unknown=='error' then we can try to support drop as follows:

drop='first', will always drop a frequent column, because the infrequent column is always at the end of the encoding.
drop='infrequent' the last column will be dropped. (infrequent column is always at the end of the encoding)
drop is a list -> if dropped category is frequent, drop it. if dropped category is infrequent AND its the only infrequent category drop it, otherwise keep the infrequent category.

NicolasHug · 2020-01-27T20:38:42Z

Sorry I wasn't clear, I was more concerned about the complexity of the code, from a maintainability point of view

See e.g. https://github.com/scikit-learn/scikit-learn/pull/13833/files#diff-d12408664448c94dbd880579e1b2e4d9R452

thomasjpfan · 2020-01-28T17:24:14Z

Implementation-wise, I do not see an issue with including the drop support. I would adjust _compute_drop_idx to take into account of the infrequent categories (if they exist).

…t_encoder_rb

thomasjpfan · 2020-02-07T01:58:54Z

I think the things left to discuss for this PR is:

Is max_levels a good name?
Is using the most frequent infrequent category a good default for get_feature_names and inverse_transform? I thought of using a string such as "infrequent" or if that exist then a named mangled str: "infrequent_sklearn". The issue with this, is if the categories were numerical, the inverse_transform will become of object dtype.

jnothman · 2020-02-09T05:25:59Z

Is using the most frequent infrequent category a good default for get_feature_names and inverse_transform? I thought of using a string such as "infrequent" or if that exist then a named mangled str: "infrequent_sklearn". The issue with this, is if the categories were numerical, the inverse_transform will become of object dtype.

I don't like it for get_feature_names. Don't mind for inverse_transform

…t_encoder_rb

…encoder_rb

thomasjpfan · 2022-03-03T20:05:03Z

Updated PR with the following changes:

infrequent_indices_ is now private, infrequent_categories_ returns the infrequent categories already
min_frequency default is None
max_categories=1 is now supported. If all categories are infrequent, OneHotEncoder returns a constant feature. This is the same behavior as passing in a constant categorical feature.
Document behavior when there are ties at the max_categories cutoff

Remaining issue around dropping infrequent categories

The remaining issue is how to handle dropping infrequent categories when "drop" is explicitly passed in as an array of strings. The current behavior is if the category is infrequent, then the whole infrequent category is dropped. Here is the discussion Some more ideas:

Error when trying to drop a category that is infrequent.
Allow "infrequent_sklearn" to be a place holder for an array-like "drop". ["a", "b", "infrequent_sklearn"] means drop the "infrequent_sklearn" feature in the feature 2. The downside of this is if there is no infrequent category in the feature 2. Then do we error?
Current behavior of dropping the infrequent category if the category in "drop" is infrequent.
drop="infrequent_if_exist", does not work because what happens if the feature has no infrequent categories and still wants to drop?
infrequent_drop=(True, False), which only applies to features with infrequent categories. Basically overriding the default behavior of drop.

I still prefer option 3 (current behavior) as the best option, because it does not add more API and the semantics can be described in one sentence as we do in the docstring:

If there are infrequent categories and drop selects any of the infrequent categories, then the category representing infrequent categories is dropped.

amueller · 2022-03-04T17:46:44Z

Error when trying to drop a category that is infrequent.

Hrm the discussion link doesn't work for me. Thanks GitHub. Oh wait, that's us...
I think erroring is best. We can add the placeholder thing later if someone asks for it? I would just like to get anything in and that seems like a super fringe case and I don't think it's that meaningful from a statistical perspective either?

I agree we shouldn't add more interface and more complex behavior, but just erroring also can be described very quickly and is less surprising to me. We can always decide not to error later, but moving from not erroring to erroring is definitely a deprecation cycle at least. In other words, if we decide on option 1, all the other options are still available after we merged and we can see if anyone asks for it.

…encoder_rb

thomasjpfan · 2022-03-04T18:51:07Z

Updated PR to error when explicitly dropping a category that is infrequent (using an array-like for drop).

amueller

Looks good! @jnothman is your approval still valid lol? Or @adrinjalali did you want to have a look?

jnothman

Reading your summary of changes, I'm happy to keep my approval :)

jnothman · 2022-03-15T12:40:54Z

Congrats for getting this through, @thomasjpfan !!

glemaitre · 2022-03-15T16:29:51Z

Nice!!!!!

amueller · 2022-03-25T20:14:26Z

OMG this is exciting!!!

* ENH Completely adds infrequent categories * STY Linting * STY Linting * DOC Improves wording * DOC Lint * BUG Fixes * CLN Address comments * CLN Address comments * DOC Uses math to description float min_frequency * DOC Adds comment regarding drop * BUG Fixes method name * DOC Clearer docstring * TST Adds more tests * FIX Fixes mege * CLN More pythonic * CLN Address comments * STY Flake8 * CLN Address comments * DOC Fix * MRG * WIP * ENH Address comments * STY Fix * ENH Use functiion call instead of property * ENH Adds counts feature * CLN Rename variables * DOC More details * CLN Remove unneeded line * CLN Less lines is less complicated * CLN Less diffs * CLN Improves readiabilty * BUG Fix * CLN Address comments * TST Fix * CLN Address comments * CLN Address comments * CLN Move docstring to userguide * DOC Better wrapping * TST Adds test to handle_unknown='error' * ENH Spelling error in docstring * BUG Fixes counter with nan values * BUG Removes unneeded test * BUG Fixes issue * ENH Sync with main * DOC Correct settings * DOC Adds docstring * DOC Immprove user guide * DOC Move to 1.0 * DOC Update docs * TST Remove test * DOC Update docstring * STY Linting * DOC Address comments * ENH Neater code * DOC Update explaination for auto * Update sklearn/preprocessing/_encoders.py Co-authored-by: Roman Yurchak <[email protected]> * TST Uses docstring instead of comments * TST Remove call to fit * TST Spelling error * ENH Adds support for drop + infrequent categories * ENH Adds infrequent_if_exist option * DOC Address comments for user guide * DOC Address comments for whats_new * DOC Update docstring based on comments * CLN Update test with suggestions * ENH Adds computed property infrequent_categories_ * DOC Adds where the infrequent column is located * TST Adds more test for infrequent_categories_ * DOC Adds docstring for _compute_drop_idx * CLN Moves _convert_to_infrequent_idx into its own method * TST Increases test coverage * TST Adds failing test * CLN Careful consideration of dropped and inverse_transform * STY Linting * DOC Adds docstrinb about dropping infrequent * DOC Uses only * DOC Numpydoc * TST Includes test for get_feature_names_out * DOC Move whats new * DOC Address docstring comments * DOC Docstring changes * TST Better comments * TST Adds check for handle_unknown='ignore' for infrequent * CLN Make _infrequent_indices private * CLN Change min_frequency default to None * DOC Adds comments * ENH adds support for max_categories=1 * ENH Describe lexicon ordering for ties * DOC Better docstring * STY Fix * CLN Error when explicity dropping an infrequent category * STY Grammar Co-authored-by: Joel Nothman <[email protected]> Co-authored-by: Roman Yurchak <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]>

thomasjpfan added 8 commits January 3, 2020 12:57

ENH Completely adds infrequent categories

9c5dec4

STY Linting

6613645

STY Linting

741bd10

DOC Improves wording

f1ba191

DOC Lint

ae3f873

Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…

d7eb2b6

…t_encoder_rb

BUG Fixes

dc4249b

Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…

5941d97

…t_encoder_rb

jnothman reviewed Jan 20, 2020

View reviewed changes

jnothman reviewed Jan 23, 2020

View reviewed changes

NicolasHug reviewed Jan 23, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…

d539245

…t_encoder_rb

thomasjpfan added 3 commits January 28, 2020 10:45

CLN Address comments

c070f16

CLN Address comments

3400e07

DOC Uses math to description float min_frequency

5defa0b

thomasjpfan added 7 commits January 28, 2020 12:31

DOC Adds comment regarding drop

35d2470

Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…

f59f18f

…t_encoder_rb

BUG Fixes method name

aec1430

DOC Clearer docstring

a64ffdd

TST Adds more tests

f445018

Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…

c7c2fa9

…t_encoder_rb

Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…

0516482

…t_encoder_rb

thomasjpfan added 2 commits February 11, 2020 14:29

Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…

cac9d00

…t_encoder_rb

FIX Fixes mege

462b46c

thomasjpfan added 12 commits March 1, 2022 16:12

DOC Address docstring comments

23ae2e8

Merge remote-tracking branch 'upstream/main' into infrequent_one_hot_…

523625e

…encoder_rb

DOC Docstring changes

7980c6e

TST Better comments

552c983

TST Adds check for handle_unknown='ignore' for infrequent

4deb105

CLN Make _infrequent_indices private

ecb2a44

CLN Change min_frequency default to None

e7d8301

DOC Adds comments

0bc1fee

ENH adds support for max_categories=1

c802291

ENH Describe lexicon ordering for ties

10137a5

DOC Better docstring

0da2ee1

STY Fix

07b38bd

thomasjpfan added 2 commits March 4, 2022 13:06

Merge remote-tracking branch 'upstream/main' into infrequent_one_hot_…

2e28bb0

…encoder_rb

CLN Error when explicity dropping an infrequent category

cf73b27

STY Grammar

66306a4

amueller reviewed Mar 11, 2022

View reviewed changes

amueller approved these changes Mar 11, 2022

View reviewed changes

jnothman reviewed Mar 13, 2022

View reviewed changes

adrinjalali approved these changes Mar 14, 2022

View reviewed changes

adrinjalali merged commit 7f0006c into scikit-learn:main Mar 14, 2022

amueller mentioned this pull request Mar 25, 2022

[WIP] "other"/min_freq in OneHot and OrdinalEncoder #12264

Closed

7 tasks

This was referenced Nov 15, 2022

Update scikit learn 1.2 automl/auto-sklearn#1611

Closed

[Research] Use grouping of infrequent categories in OneHotEncoder automl/auto-sklearn#1614

Open

	1. `min_frequency` can be a integer greater or equal to one or a float in
	1. `min_frequency` can be a integer greater or equal to one, or a float in

	Specifics the categories to be considered infrequent.
	Specifies the categories to be considered infrequent.

		assert_array_equal(['x0_b', 'x0_c', 'x0_a'], feature_names)


		def test_ohe_infrequent_two_levels_user_cats():



		def test_encode_util_uniques_unordered():
		# The return counts are ordered based on the order of uniques

	# The return counts are ordered based on the order of uniques
	# Make sure the returned counts are ordered based on the order of uniques

Uh oh!

ENH Adds infrequent categories to OneHotEncoder #16018

ENH Adds infrequent categories to OneHotEncoder #16018

Uh oh!

Conversation

thomasjpfan commented Jan 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Update

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thomasjpfan commented Jan 3, 2020 •

edited

Loading

thomasjpfan commented Mar 3, 2022 •

edited

Loading

amueller commented Mar 4, 2022 •

edited

Loading

thomasjpfan commented Mar 4, 2022 •

edited

Loading