-
-
Notifications
You must be signed in to change notification settings - Fork 26k
FIX Ensure dtype of categories is object
for strings in OneHotEncoder
#25174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why do we not want to have >>> np.asarray(["test"]).dtype
dtype('<U4') |
I assumed that there is a reason why when you do something slightly different but similar (don't provide categories to constructor, rely on fit to find them) you end of with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add an entry in the changelog, maybe for 1.2.1.
Once the following is addressed, LGTM:
It isn't obvious how to compare a sequence of bytes to a string, without knowing what encoding to use. Removing this test as it fails now that we use object as dtype for strings.
I'll add a change log entry later. For now I have two questions: (1) about the modified test (see comment) and (2) why is the dtype taken from |
I agree with them having to be compatible, but they don't have to be the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM.
Needs a changelog though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just nitpicking.
Co-authored-by: Guillaume Lemaitre <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
Head branch was pushed to by a user without write access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment, otherwise LGTM
Co-authored-by: Thomas J. Fan <[email protected]>
…der` (scikit-learn#25174) Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]>
…der` (scikit-learn#25174) Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]>
…der` (scikit-learn#25174) Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]>
…der` (scikit-learn#25174) Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]>
…der` (#25174) Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Thomas J. Fan <[email protected]>
Reference Issues/PRs
Closes #25171
What does this implement/fix? Explain your changes.
This makes it so that the
enc.categories_
attribute ofOneHotEncoder
is contains an array of dtypeobject
when using predefined categories that are strings. This makes it consistent with the dtype ofenc.categories_
when the categories are determine duringfit
. In general to compare a sequence of bytes to a string you need to assume an encoding, otherwise you can't really compare them. But I don't understand enough about the numpy type system to know if it would take care of this already? Ideas an input welcome.