Thanks to visit codestin.com
Credit goes to github.com

Skip to content

ENH Adds infrequent categories to OneHotEncoder #16018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 125 commits into from
Mar 14, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
125 commits
Select commit Hold shift + click to select a range
9c5dec4
ENH Completely adds infrequent categories
thomasjpfan Dec 16, 2019
6613645
STY Linting
thomasjpfan Jan 3, 2020
741bd10
STY Linting
thomasjpfan Jan 3, 2020
f1ba191
DOC Improves wording
thomasjpfan Jan 3, 2020
ae3f873
DOC Lint
thomasjpfan Jan 4, 2020
d7eb2b6
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan Jan 6, 2020
dc4249b
BUG Fixes
thomasjpfan Jan 6, 2020
5941d97
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan Jan 6, 2020
d539245
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan Jan 27, 2020
c070f16
CLN Address comments
thomasjpfan Jan 28, 2020
3400e07
CLN Address comments
thomasjpfan Jan 28, 2020
5defa0b
DOC Uses math to description float min_frequency
thomasjpfan Jan 28, 2020
35d2470
DOC Adds comment regarding drop
thomasjpfan Jan 28, 2020
f59f18f
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan Jan 29, 2020
aec1430
BUG Fixes method name
thomasjpfan Jan 29, 2020
a64ffdd
DOC Clearer docstring
thomasjpfan Jan 29, 2020
f445018
TST Adds more tests
thomasjpfan Jan 29, 2020
c7c2fa9
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan Feb 5, 2020
0516482
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan Feb 7, 2020
cac9d00
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan Feb 11, 2020
462b46c
FIX Fixes mege
thomasjpfan Feb 11, 2020
a920d37
CLN More pythonic
thomasjpfan Feb 11, 2020
9398229
CLN Address comments
thomasjpfan Feb 11, 2020
3a3eb5d
STY Flake8
thomasjpfan Feb 11, 2020
e5c4eef
CLN Address comments
thomasjpfan Feb 21, 2020
a249944
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan Mar 7, 2020
78fa495
DOC Fix
thomasjpfan Mar 10, 2020
0c431ed
MRG
thomasjpfan Apr 13, 2020
8cf73fa
WIP
thomasjpfan Apr 13, 2020
ecf9e7b
ENH Address comments
thomasjpfan Apr 14, 2020
9a40eb7
STY Fix
thomasjpfan Apr 14, 2020
33a653f
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan Apr 14, 2020
eb8b501
ENH Use functiion call instead of property
thomasjpfan Apr 14, 2020
56aec01
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan May 1, 2020
dadaac2
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan May 13, 2020
8660704
ENH Adds counts feature
thomasjpfan May 13, 2020
b8a883f
CLN Rename variables
thomasjpfan May 13, 2020
29005b1
DOC More details
thomasjpfan May 13, 2020
03c8d4d
CLN Remove unneeded line
thomasjpfan May 13, 2020
f669c54
CLN Less lines is less complicated
thomasjpfan May 15, 2020
23ba7fd
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan May 15, 2020
ffe2976
CLN Less diffs
thomasjpfan May 15, 2020
8979f0b
CLN Improves readiabilty
thomasjpfan May 16, 2020
530a3fe
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan May 18, 2020
a1ed299
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan May 19, 2020
41a29b0
BUG Fix
thomasjpfan May 20, 2020
0d58dc3
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan May 20, 2020
5183d3c
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan May 25, 2020
ef2ebf6
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan May 26, 2020
a1cff1f
CLN Address comments
thomasjpfan May 26, 2020
e222452
TST Fix
thomasjpfan May 27, 2020
0d5942c
Merge branch 'master' into infrequent_one_hot_encoder_rb
jnothman May 31, 2020
d5f85d4
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan Jun 17, 2020
db96b44
CLN Address comments
thomasjpfan Jun 17, 2020
dc73894
CLN Address comments
thomasjpfan Jun 17, 2020
1a686b5
CLN Move docstring to userguide
thomasjpfan Jun 17, 2020
853f54d
DOC Better wrapping
thomasjpfan Jun 17, 2020
5ad5917
TST Adds test to handle_unknown='error'
thomasjpfan Jun 18, 2020
7414e26
ENH Spelling error in docstring
thomasjpfan Jun 18, 2020
99de0a6
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan Oct 31, 2020
265d85e
BUG Fixes counter with nan values
thomasjpfan Nov 1, 2020
090c594
BUG Removes unneeded test
thomasjpfan Nov 1, 2020
998272d
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan Nov 4, 2020
8411e3d
BUG Fixes issue
thomasjpfan Nov 4, 2020
213e3c3
Merge remote-tracking branch 'upstream/master' into infrequent_one_ho…
thomasjpfan Dec 29, 2020
ec6e23f
ENH Sync with main
thomasjpfan Dec 29, 2020
a730bce
DOC Correct settings
thomasjpfan Dec 29, 2020
97e9f7a
DOC Adds docstring
thomasjpfan Dec 29, 2020
e1f72d9
Merge remote-tracking branch 'upstream/main' into infrequent_one_hot_…
thomasjpfan Jan 30, 2021
433ccd7
DOC Immprove user guide
thomasjpfan Jan 30, 2021
ecb82df
DOC Move to 1.0
thomasjpfan Jan 30, 2021
35d0544
DOC Update docs
thomasjpfan Jan 30, 2021
274c090
TST Remove test
thomasjpfan Jan 30, 2021
abc504e
DOC Update docstring
thomasjpfan Jan 30, 2021
4df4b29
Merge remote-tracking branch 'upstream/main' into infrequent_one_hot_…
thomasjpfan Feb 22, 2021
484070a
STY Linting
thomasjpfan Feb 22, 2021
6088f9e
Merge remote-tracking branch 'upstream/main' into infrequent_one_hot_…
thomasjpfan Mar 3, 2021
c48ada2
DOC Address comments
thomasjpfan Mar 4, 2021
1922b32
ENH Neater code
thomasjpfan Mar 4, 2021
91fa58b
DOC Update explaination for auto
thomasjpfan Mar 4, 2021
a68ce31
Update sklearn/preprocessing/_encoders.py
thomasjpfan Apr 4, 2021
6f0c542
Merge remote-tracking branch 'upstream/main' into infrequent_one_hot_…
thomasjpfan Apr 5, 2021
3e305ef
TST Uses docstring instead of comments
thomasjpfan Apr 5, 2021
fec44b2
TST Remove call to fit
thomasjpfan Apr 5, 2021
e4ad665
TST Spelling error
thomasjpfan Apr 5, 2021
10b8aec
ENH Adds support for drop + infrequent categories
thomasjpfan Apr 5, 2021
ef86eb1
ENH Adds infrequent_if_exist option
thomasjpfan Apr 5, 2021
ad639e9
Merge remote-tracking branch 'upstream/main' into infrequent_one_hot_…
thomasjpfan Apr 19, 2021
61d1ddb
DOC Address comments for user guide
thomasjpfan Apr 19, 2021
2493223
DOC Address comments for whats_new
thomasjpfan Apr 19, 2021
a9f643f
DOC Update docstring based on comments
thomasjpfan Apr 19, 2021
1de557a
CLN Update test with suggestions
thomasjpfan Apr 19, 2021
058112e
ENH Adds computed property infrequent_categories_
thomasjpfan Apr 19, 2021
7ab2434
DOC Adds where the infrequent column is located
thomasjpfan Apr 19, 2021
aa7d5cf
TST Adds more test for infrequent_categories_
thomasjpfan Apr 19, 2021
939123c
DOC Adds docstring for _compute_drop_idx
thomasjpfan Apr 19, 2021
6a467ac
CLN Moves _convert_to_infrequent_idx into its own method
thomasjpfan Apr 19, 2021
f11ccff
TST Increases test coverage
thomasjpfan Apr 19, 2021
fac1f21
TST Adds failing test
thomasjpfan May 10, 2021
87a06fb
CLN Careful consideration of dropped and inverse_transform
thomasjpfan May 10, 2021
49aaa23
STY Linting
thomasjpfan May 10, 2021
cd3d29b
DOC Adds docstrinb about dropping infrequent
thomasjpfan May 10, 2021
01bc992
Merge remote-tracking branch 'upstream/main' into infrequent_one_hot_…
thomasjpfan May 11, 2021
06397b2
DOC Uses only
thomasjpfan May 11, 2021
9af51ba
Merge remote-tracking branch 'upstream/main' into infrequent_one_hot_…
thomasjpfan Aug 20, 2021
f7c8839
Merge branch 'main' into infrequent_one_hot_encoder_rb
glemaitre Aug 30, 2021
48a03ea
DOC Numpydoc
thomasjpfan Aug 30, 2021
388e2f3
Merge remote-tracking branch 'upstream/main' into infrequent_one_hot_…
thomasjpfan Nov 29, 2021
e36ca57
TST Includes test for get_feature_names_out
thomasjpfan Nov 29, 2021
6bbc6d4
DOC Move whats new
thomasjpfan Nov 29, 2021
23ae2e8
DOC Address docstring comments
thomasjpfan Mar 1, 2022
523625e
Merge remote-tracking branch 'upstream/main' into infrequent_one_hot_…
thomasjpfan Mar 1, 2022
7980c6e
DOC Docstring changes
thomasjpfan Mar 2, 2022
552c983
TST Better comments
thomasjpfan Mar 2, 2022
4deb105
TST Adds check for handle_unknown='ignore' for infrequent
thomasjpfan Mar 2, 2022
ecb2a44
CLN Make _infrequent_indices private
thomasjpfan Mar 2, 2022
e7d8301
CLN Change min_frequency default to None
thomasjpfan Mar 3, 2022
0bc1fee
DOC Adds comments
thomasjpfan Mar 3, 2022
c802291
ENH adds support for max_categories=1
thomasjpfan Mar 3, 2022
10137a5
ENH Describe lexicon ordering for ties
thomasjpfan Mar 3, 2022
0da2ee1
DOC Better docstring
thomasjpfan Mar 3, 2022
07b38bd
STY Fix
thomasjpfan Mar 3, 2022
2e28bb0
Merge remote-tracking branch 'upstream/main' into infrequent_one_hot_…
thomasjpfan Mar 4, 2022
cf73b27
CLN Error when explicity dropping an infrequent category
thomasjpfan Mar 4, 2022
66306a4
STY Grammar
thomasjpfan Mar 4, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 117 additions & 12 deletions doc/modules/preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -594,17 +594,19 @@ dataset::
array([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])

If there is a possibility that the training data might have missing categorical
features, it can often be better to specify ``handle_unknown='ignore'`` instead
of setting the ``categories`` manually as above. When
``handle_unknown='ignore'`` is specified and unknown categories are encountered
during transform, no error will be raised but the resulting one-hot encoded
columns for this feature will be all zeros
(``handle_unknown='ignore'`` is only supported for one-hot encoding)::

>>> enc = preprocessing.OneHotEncoder(handle_unknown='ignore')
features, it can often be better to specify
`handle_unknown='infrequent_if_exist'` instead of setting the `categories`
manually as above. When `handle_unknown='infrequent_if_exist'` is specified
and unknown categories are encountered during transform, no error will be
raised but the resulting one-hot encoded columns for this feature will be all
zeros or considered as an infrequent category if enabled.
(`handle_unknown='infrequent_if_exist'` is only supported for one-hot
encoding)::

>>> enc = preprocessing.OneHotEncoder(handle_unknown='infrequent_if_exist')
>>> X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
>>> enc.fit(X)
OneHotEncoder(handle_unknown='ignore')
OneHotEncoder(handle_unknown='infrequent_if_exist')
>>> enc.transform([['female', 'from Asia', 'uses Chrome']]).toarray()
array([[1., 0., 0., 0., 0., 0.]])

Expand All @@ -621,7 +623,8 @@ since co-linearity would cause the covariance matrix to be non-invertible::
... ['female', 'from Europe', 'uses Firefox']]
>>> drop_enc = preprocessing.OneHotEncoder(drop='first').fit(X)
>>> drop_enc.categories_
[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]
[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object),
array(['uses Firefox', 'uses Safari'], dtype=object)]
>>> drop_enc.transform(X).toarray()
array([[1., 1., 1.],
[0., 0., 0.]])
Expand All @@ -634,7 +637,8 @@ categories. In this case, you can set the parameter `drop='if_binary'`.
... ['female', 'Asia', 'Chrome']]
>>> drop_enc = preprocessing.OneHotEncoder(drop='if_binary').fit(X)
>>> drop_enc.categories_
[array(['female', 'male'], dtype=object), array(['Asia', 'Europe', 'US'], dtype=object), array(['Chrome', 'Firefox', 'Safari'], dtype=object)]
[array(['female', 'male'], dtype=object), array(['Asia', 'Europe', 'US'], dtype=object),
array(['Chrome', 'Firefox', 'Safari'], dtype=object)]
>>> drop_enc.transform(X).toarray()
array([[1., 0., 0., 1., 0., 0., 1.],
[0., 0., 1., 0., 0., 1., 0.],
Expand Down Expand Up @@ -699,6 +703,107 @@ separate categories::
See :ref:`dict_feature_extraction` for categorical features that are
represented as a dict, not as scalars.

.. _one_hot_encoder_infrequent_categories:

Infrequent categories
---------------------

:class:`OneHotEncoder` supports aggregating infrequent categories into a single
output for each feature. The parameters to enable the gathering of infrequent
categories are `min_frequency` and `max_categories`.

1. `min_frequency` is either an integer greater or equal to 1, or a float in
the interval `(0.0, 1.0)`. If `min_frequency` is an integer, categories with
a cardinality smaller than `min_frequency` will be considered infrequent.
If `min_frequency` is a float, categories with a cardinality smaller than
this fraction of the total number of samples will be considered infrequent.
The default value is 1, which means every category is encoded separately.

2. `max_categories` is either `None` or any integer greater than 1. This
parameter sets an upper limit to the number of output features for each
input feature. `max_categories` includes the feature that combines
infrequent categories.

In the following example, the categories, `'dog', 'snake'` are considered
infrequent::

>>> X = np.array([['dog'] * 5 + ['cat'] * 20 + ['rabbit'] * 10 +
... ['snake'] * 3], dtype=object).T
>>> enc = preprocessing.OneHotEncoder(min_frequency=6, sparse=False).fit(X)
>>> enc.infrequent_categories_
[array(['dog', 'snake'], dtype=object)]
>>> enc.transform(np.array([['dog'], ['cat'], ['rabbit'], ['snake']]))
array([[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])

By setting handle_unknown to `'infrequent_if_exist'`, unknown categories will
be considered infrequent::

>>> enc = preprocessing.OneHotEncoder(
... handle_unknown='infrequent_if_exist', sparse=False, min_frequency=6)
>>> enc = enc.fit(X)
>>> enc.transform(np.array([['dragon']]))
array([[0., 0., 1.]])

:meth:`OneHotEncoder.get_feature_names_out` uses 'infrequent' as the infrequent
feature name::

>>> enc.get_feature_names_out()
array(['x0_cat', 'x0_rabbit', 'x0_infrequent_sklearn'], dtype=object)

When `'handle_unknown'` is set to `'infrequent_if_exist'` and an unknown
category is encountered in transform:

1. If infrequent category support was not configured or there was no
infrequent category during training, the resulting one-hot encoded columns
for this feature will be all zeros. In the inverse transform, an unknown
category will be denoted as `None`.

2. If there is an infrequent category during training, the unknown category
will be considered infrequent. In the inverse transform, 'infrequent_sklearn'
will be used to represent the infrequent category.

Infrequent categories can also be configured using `max_categories`. In the
following example, we set `max_categories=2` to limit the number of features in
the output. This will result in all but the `'cat'` category to be considered
infrequent, leading to two features, one for `'cat'` and one for infrequent
categories - which are all the others::

>>> enc = preprocessing.OneHotEncoder(max_categories=2, sparse=False)
>>> enc = enc.fit(X)
>>> enc.transform([['dog'], ['cat'], ['rabbit'], ['snake']])
array([[0., 1.],
[1., 0.],
[0., 1.],
[0., 1.]])

If both `max_categories` and `min_frequency` are non-default values, then
categories are selected based on `min_frequency` first and `max_categories`
categories are kept. In the following example, `min_frequency=4` considers
only `snake` to be infrequent, but `max_categories=3`, forces `dog` to also be
infrequent::

>>> enc = preprocessing.OneHotEncoder(min_frequency=4, max_categories=3, sparse=False)
>>> enc = enc.fit(X)
>>> enc.transform([['dog'], ['cat'], ['rabbit'], ['snake']])
array([[0., 0., 1.],
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])

If there are infrequent categories with the same cardinality at the cutoff of
`max_categories`, then then the first `max_categories` are taken based on lexicon
ordering. In the following example, "b", "c", and "d", have the same cardinality
and with `max_categories=2`, "b" and "c" are infrequent because they have a higher
lexicon order.

>>> X = np.asarray([["a"] * 20 + ["b"] * 10 + ["c"] * 10 + ["d"] * 10], dtype=object).T
>>> enc = preprocessing.OneHotEncoder(max_categories=3).fit(X)
>>> enc.infrequent_categories_
[array(['b', 'c'], dtype=object)]

.. _preprocessing_discretization:

Discretization
Expand Down Expand Up @@ -981,7 +1086,7 @@ Interestingly, a :class:`SplineTransformer` of ``degree=0`` is the same as
Penalties <10.1214/ss/1038425655>`. Statist. Sci. 11 (1996), no. 2, 89--121.

* Perperoglou, A., Sauerbrei, W., Abrahamowicz, M. et al. :doi:`A review of
spline function procedures in R <10.1186/s12874-019-0666-3>`.
spline function procedures in R <10.1186/s12874-019-0666-3>`.
BMC Med Res Methodol 19, 46 (2019).

.. _function_transformer:
Expand Down
5 changes: 5 additions & 0 deletions doc/whats_new/v1.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -688,6 +688,11 @@ Changelog
:mod:`sklearn.preprocessing`
............................

- |Feature| :class:`preprocessing.OneHotEncoder` now supports grouping
infrequent categories into a single feature. Grouping infrequent categories
is enabled by specifying how to select infrequent categories with
`min_frequency` or `max_categories`. :pr:`16018` by `Thomas Fan`_.

- |Enhancement| Adds a `subsample` parameter to :class:`preprocessing.KBinsDiscretizer`.
This allows specifying a maximum number of samples to be used while fitting
the model. The option is only available when `strategy` is set to `quantile`.
Expand Down
Loading