FIX Ensure dtype of categories is `object` for strings in `OneHotEncoder` #25174

betatim · 2022-12-12T10:47:20Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This makes it so that the enc.categories_ attribute of OneHotEncoder is contains an array of dtype object when using predefined categories that are strings. This makes it consistent with the dtype of enc.categories_ when the categories are determine during fit. In general to compare a sequence of bytes to a string you need to assume an encoding, otherwise you can't really compare them. But I don't understand enough about the numpy type system to know if it would take care of this already? Ideas an input welcome.

adrinjalali · 2022-12-12T10:50:54Z

Why do we not want to have <UXXX types?

>>> np.asarray(["test"]).dtype
dtype('<U4')

betatim · 2022-12-12T10:54:27Z

Why do we not want to have <UXXX types?

I assumed that there is a reason why when you do something slightly different but similar (don't provide categories to constructor, rely on fit to find them) you end of with object as dtype. But I've not (yet) found a comment explaining why the code does what it does :-/

sklearn/preprocessing/_encoders.py

ogrisel

Please also add an entry in the changelog, maybe for 1.2.1.

Once the following is addressed, LGTM:

sklearn/preprocessing/_encoders.py

sklearn/preprocessing/tests/test_encoders.py

It isn't obvious how to compare a sequence of bytes to a string, without knowing what encoding to use. Removing this test as it fails now that we use object as dtype for strings.

sklearn/preprocessing/tests/test_encoders.py

betatim · 2022-12-13T09:13:52Z

I'll add a change log entry later.

For now I have two questions: (1) about the modified test (see comment) and (2) why is the dtype taken from X that is passed to fit instead of from self.categories, when the user explicitly specified the known categories. My instinct would have been to take the type from the specified category and require the input to fit to match/be compatible.

adrinjalali · 2022-12-13T17:03:01Z

My instinct would have been to take the type from the specified category and require the input to fit to match/be compatible.

I agree with them having to be compatible, but they don't have to be the same.

sklearn/preprocessing/tests/test_encoders.py

sklearn/preprocessing/_encoders.py

sklearn/preprocessing/tests/test_encoders.py

A predefined category of bytes can not be compared to input data that is a unicode string.

adrinjalali

Otherwise LGTM.

Needs a changelog though.

sklearn/preprocessing/_encoders.py

glemaitre

Just nitpicking.

doc/whats_new/v1.3.rst

sklearn/preprocessing/tests/test_encoders.py

Co-authored-by: Guillaume Lemaitre <[email protected]>

thomasjpfan

Thank you for the PR!

sklearn/preprocessing/tests/test_encoders.py

sklearn/preprocessing/_encoders.py

thomasjpfan

Minor comment, otherwise LGTM

doc/whats_new/v1.3.rst

Co-authored-by: Thomas J. Fan <[email protected]>

…der` (scikit-learn#25174) Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]>

…der` (#25174) Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]>

Ensure dtype of categories is object for strings

f46eee7

github-actions bot added the module:preprocessing label Dec 12, 2022

adrinjalali reviewed Dec 12, 2022

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

ogrisel reviewed Dec 12, 2022

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/tests/test_encoders.py Show resolved Hide resolved

betatim added 2 commits December 13, 2022 10:00

Check for truncated category labels

e79836e

Remove test with bytes and unicode categories

4abf142

It isn't obvious how to compare a sequence of bytes to a string, without knowing what encoding to use. Removing this test as it fails now that we use object as dtype for strings.

betatim commented Dec 13, 2022

View reviewed changes

sklearn/preprocessing/tests/test_encoders.py Show resolved Hide resolved

Add explicit test for mixed string-bytes categories

890ab9a

glemaitre reviewed Dec 14, 2022

View reviewed changes

betatim added 3 commits December 14, 2022 14:16

Add explicit error message for incompatible category values

8c9b391

A predefined category of bytes can not be compared to input data that is a unicode string.

Fix typo

dd4a3df

Fix tests

b217323

adrinjalali reviewed Dec 14, 2022

View reviewed changes

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

betatim added 3 commits December 15, 2022 10:47

Add what's new entry

6e96f2a

Improve error message by outputing type names

36f1a15

Merge branch 'main' into fix-categories-dtype

c51ad3f

glemaitre approved these changes Dec 15, 2022

View reviewed changes

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved

sklearn/preprocessing/tests/test_encoders.py Outdated Show resolved Hide resolved

betatim and others added 2 commits December 15, 2022 16:43

Remove blank line

ee8fa8e

Co-authored-by: Guillaume Lemaitre <[email protected]>

Move error message

0613991

betatim added the Waiting for Second Reviewer First reviewer is done, need a second one! label Dec 15, 2022

adrinjalali approved these changes Dec 15, 2022

View reviewed changes

adrinjalali enabled auto-merge (squash) December 15, 2022 15:50

Micky774 removed the Waiting for Second Reviewer First reviewer is done, need a second one! label Dec 16, 2022

betatim mentioned this pull request Dec 19, 2022

OneHotEncoder cuts predefined classes #25171

Closed

thomasjpfan reviewed Dec 19, 2022

View reviewed changes

sklearn/preprocessing/tests/test_encoders.py Outdated Show resolved Hide resolved

sklearn/preprocessing/tests/test_encoders.py Show resolved Hide resolved

sklearn/preprocessing/_encoders.py Outdated Show resolved Hide resolved

Add regex escape

f4cf410

auto-merge was automatically disabled December 20, 2022 09:27
Head branch was pushed to by a user without write access

betatim added 2 commits December 20, 2022 10:50

Switch to easier to understand error message

4d374ce

Move change log entry as it is a breaking change

c29ef00

thomasjpfan approved these changes Dec 20, 2022

View reviewed changes

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved

Update doc/whats_new/v1.3.rst

445fd78

Co-authored-by: Thomas J. Fan <[email protected]>

thomasjpfan merged commit ecb9a70 into scikit-learn:main Dec 22, 2022

betatim deleted the fix-categories-dtype branch January 4, 2023 09:13

adrinjalali pushed a commit that referenced this pull request Jan 24, 2023

FIX Ensure dtype of categories is object for strings in `OneHotEnco…

b2ddc5a

…der` (#25174) Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]>

Uh oh!

FIX Ensure dtype of categories is object for strings in OneHotEncoder #25174

FIX Ensure dtype of categories is object for strings in OneHotEncoder #25174

Uh oh!

Conversation

betatim commented Dec 12, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

adrinjalali commented Dec 12, 2022

Uh oh!

betatim commented Dec 12, 2022

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

betatim commented Dec 13, 2022

Uh oh!

adrinjalali commented Dec 13, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

FIX Ensure dtype of categories is `object` for strings in `OneHotEncoder` #25174

FIX Ensure dtype of categories is `object` for strings in `OneHotEncoder` #25174