Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FIX Ensure dtype of categories is object for strings in OneHotEncoder #25174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Dec 22, 2022

Conversation

betatim
Copy link
Member

@betatim betatim commented Dec 12, 2022

Reference Issues/PRs

Closes #25171

What does this implement/fix? Explain your changes.

This makes it so that the enc.categories_ attribute of OneHotEncoder is contains an array of dtype object when using predefined categories that are strings. This makes it consistent with the dtype of enc.categories_ when the categories are determine during fit. In general to compare a sequence of bytes to a string you need to assume an encoding, otherwise you can't really compare them. But I don't understand enough about the numpy type system to know if it would take care of this already? Ideas an input welcome.

@adrinjalali
Copy link
Member

Why do we not want to have <UXXX types?

>>> np.asarray(["test"]).dtype
dtype('<U4')

@betatim
Copy link
Member Author

betatim commented Dec 12, 2022

Why do we not want to have <UXXX types?

I assumed that there is a reason why when you do something slightly different but similar (don't provide categories to constructor, rely on fit to find them) you end of with object as dtype. But I've not (yet) found a comment explaining why the code does what it does :-/

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add an entry in the changelog, maybe for 1.2.1.

Once the following is addressed, LGTM:

It isn't obvious how to compare a sequence of bytes to a string, without
knowing what encoding to use. Removing this test as it fails now that we
use object as dtype for strings.
@betatim
Copy link
Member Author

betatim commented Dec 13, 2022

I'll add a change log entry later.

For now I have two questions: (1) about the modified test (see comment) and (2) why is the dtype taken from X that is passed to fit instead of from self.categories, when the user explicitly specified the known categories. My instinct would have been to take the type from the specified category and require the input to fit to match/be compatible.

@adrinjalali
Copy link
Member

My instinct would have been to take the type from the specified category and require the input to fit to match/be compatible.

I agree with them having to be compatible, but they don't have to be the same.

A predefined category of bytes can not be compared to input data that is
a unicode string.
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

Needs a changelog though.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nitpicking.

betatim and others added 2 commits December 15, 2022 16:43
@betatim betatim added the Waiting for Second Reviewer First reviewer is done, need a second one! label Dec 15, 2022
@adrinjalali adrinjalali enabled auto-merge (squash) December 15, 2022 15:50
@Micky774 Micky774 removed the Waiting for Second Reviewer First reviewer is done, need a second one! label Dec 16, 2022
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

auto-merge was automatically disabled December 20, 2022 09:27

Head branch was pushed to by a user without write access

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment, otherwise LGTM

Co-authored-by: Thomas J. Fan <[email protected]>
@thomasjpfan thomasjpfan merged commit ecb9a70 into scikit-learn:main Dec 22, 2022
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 3, 2023
@betatim betatim deleted the fix-categories-dtype branch January 4, 2023 09:13
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 20, 2023
jjerphan pushed a commit to jjerphan/scikit-learn that referenced this pull request Jan 23, 2023
adrinjalali pushed a commit that referenced this pull request Jan 24, 2023
…der` (#25174)

Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Thomas J. Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OneHotEncoder cuts predefined classes
6 participants