-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+1] Docs: refer users to the other encoders to do one hot encoding for labels. #7315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
LGTM. We could also mention it in the user guide or in the docstring but this might be enough? |
Hey @kchen17 , thanks for the quick change.
That would be awesome if you did that. |
I would love to put the nail in the coffin for issue #5930. I'm willing to accept the solution by @olologin as a way to perform one hot encodings for
@amueller do you think this would be a good idea? |
I think referencing |
@jnothman actually many people currently use LabelEncoder on columns so they can feed them through OneHotEncoder.... It's sort of tangential but wouldn't mind leaving it in. not entirely sure what the example adds. That the output shape is always |
Yeah, this is actually quite unfortunate. The LabelEncoder wasn't meant to be used on columns / features. (And the docs for LabelEncoder explicitly say that it's for labels.) I remember when I was getting acquainted to scikit-learn, I also used the LabelEncoder on features. |
@hlin117 done! |
fine with me |
@@ -661,6 +668,20 @@ class MultiLabelBinarizer(BaseEstimator, TransformerMixin): | |||
>>> list(mlb.classes_) | |||
['comedy', 'sci-fi', 'thriller'] | |||
|
|||
Perform a one-hot encoding for y labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this something LabelBinarizer
specialises in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relinking discussion here. The LabelBinarizer
doesn't necessarily do what we expect for 2-label problems.
That being said, for 3-label problems, you'll get output like this:
>>> from sklearn.preprocessing import LabelBinarizer
>>> LabelBinarizer().fit_transform([1, 2, 3, 1])
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0]])
I wasn't aware that the LabelBinarizer
behaved differently for 2-label problems versus 3-label problems...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right. But internal to scikit-learn, the flat vector version is a more efficient representation of binary labels...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still find this a bit of an obscure example to be presenting for MultiLabelBinarizer
. Its job is to deal with multilabel data, and to present a hacky use of it here seems inappropriate.
@jnothman do you have any other suggestions for a LGTM? |
My concern with this PR is that if its purpose is to classify the subtly distinct purposes of each processing tool, I'm not altogether certain it gets there. Overall, it is an improvement, but I'm certainly not persuaded by the |
@jnothman I do like seeing the added links in the documentation, but the trick with the |
How about remove the example with the |
I think you mean |
e2cea14
to
53622a5
Compare
Hi, sorry about the delay in getting this commit updated! (Have been without a laptop this past week while it was getting replaced.) I removed the example from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. If you want to go the full mile, you can compile the documentation and show us how it looks. (Ask us if you don't know how to do this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
@@ -1737,6 +1737,9 @@ class OneHotEncoder(BaseEstimator, TransformerMixin): | |||
This encoding is needed for feeding categorical data to many scikit-learn | |||
estimators, notably linear models and SVMs with the standard kernels. | |||
|
|||
Note: a one-hot encoding of y labels should use a MultiLabelBinarizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a LabelBinarizer, really, for the usual single-valued meaning of OHE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest commit makes this correction.
53622a5
to
52bc72e
Compare
…labelencoder' as well as an example to multilabel binarizer
Here are the changes to the documentation @hlin117: |
Hmm, is it just the screenshot, or is the "See also" section for LabelEncoder indented a level too much? (nit pick) Pro tip: you can save an entire html page by saving it as a pdf, and then uploading the pdf to github. |
Ah, sorry about that. I've realized the way I organized the screenshots made the indentations look inconsistent. All of the page formats match what is currently on the live site. |
…g for labels. (scikit-learn#7315) * refer users to the other encoders to do one hot encoding for labels. * added to the 'see more' for labelbinarizer, multilabelbinarizer, and labelencoder' as well as an example to multilabel binarizer * added note about y labels to the OneHotEncoder docstring * removed example from MultiLabelBinarizer * documentation should specify LabelBinarizer, not MultiLabelBinarizer in OHE
…g for labels. (scikit-learn#7315) * refer users to the other encoders to do one hot encoding for labels. * added to the 'see more' for labelbinarizer, multilabelbinarizer, and labelencoder' as well as an example to multilabel binarizer * added note about y labels to the OneHotEncoder docstring * removed example from MultiLabelBinarizer * documentation should specify LabelBinarizer, not MultiLabelBinarizer in OHE
…g for labels. (scikit-learn#7315) * refer users to the other encoders to do one hot encoding for labels. * added to the 'see more' for labelbinarizer, multilabelbinarizer, and labelencoder' as well as an example to multilabel binarizer * added note about y labels to the OneHotEncoder docstring * removed example from MultiLabelBinarizer * documentation should specify LabelBinarizer, not MultiLabelBinarizer in OHE
Reference Issue
This PR is a short fix in response to #5930 which is related to one hot encoding for features (x) versus labels (y).
What does this implement/fix? Explain your changes.
It adds two lines to the See Also section of OneHotEncoder to link to LabelBinarizer and LabelEncoder.
Any other comments?
In turn, would it make sense to also add a link to OneHotEncoder on the corresponding "See Also" sections for LabelBinarizer and LabelEncoder?