[MRG+1] Docs: refer users to the other encoders to do one hot encoding for labels. #7315

kathyxchen · 2016-08-31T21:50:28Z

Reference Issue

This PR is a short fix in response to #5930 which is related to one hot encoding for features (x) versus labels (y).

What does this implement/fix? Explain your changes.

It adds two lines to the See Also section of OneHotEncoder to link to LabelBinarizer and LabelEncoder.

Any other comments?

In turn, would it make sense to also add a link to OneHotEncoder on the corresponding "See Also" sections for LabelBinarizer and LabelEncoder?

amueller · 2016-08-31T22:11:16Z

LGTM. We could also mention it in the user guide or in the docstring but this might be enough?
@hlin117 do you have opinions on this?

hlin117 · 2016-08-31T23:22:41Z

Hey @kchen17 , thanks for the quick change.

In turn, would it make sense to also add a link to OneHotEncoder on the corresponding "See Also" sections for LabelBinarizer and LabelEncoder?

That would be awesome if you did that.

hlin117 · 2016-08-31T23:38:17Z

I would love to put the nail in the coffin for issue #5930. I'm willing to accept the solution by @olologin as a way to perform one hot encodings for y labels. More specifically, it would be nice to

Link the OneHotEncoder docs to the MultiLabelEncoder
Add this example to the docstring of the MultiLabelEncoder:

>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> import numpy as np
>>>
>>> y = np.array([0, 1, 1, 0])
>>> MultiLabelBinarizer().fit_transform(y.reshape(-1, 1))
array([[1, 0],
       [0, 1],
       [0, 1],
       [1, 0]])

@amueller do you think this would be a good idea?

jnothman · 2016-09-01T02:06:15Z

I think referencing LabelBinarizer and perhaps MultiLabelEncoder is fair enough. I don't see the relevance of LabelEncoder to OneHotEncoder.

amueller · 2016-09-01T02:28:07Z

@jnothman actually many people currently use LabelEncoder on columns so they can feed them through OneHotEncoder.... It's sort of tangential but wouldn't mind leaving it in.

not entirely sure what the example adds. That the output shape is always (n_samples, n_classes) ?

hlin117 · 2016-09-01T02:32:08Z

@amueller I was hoping it would show how one would perform a one hot encoding for y labels. (It addresses issue #5930)

hlin117 · 2016-09-01T02:34:59Z

@jnothman actually many people currently use LabelEncoder on columns so they can feed them through OneHotEncoder.... It's sort of tangential but wouldn't mind leaving it in.

Yeah, this is actually quite unfortunate. The LabelEncoder wasn't meant to be used on columns / features. (And the docs for LabelEncoder explicitly say that it's for labels.)

I remember when I was getting acquainted to scikit-learn, I also used the LabelEncoder on features.

kathyxchen · 2016-09-01T13:34:20Z

@amueller @hlin117: I added OneHotEncoder to the corresponding See Also sections. I also incorporated the example posted in the comment for MultiLabelBinarizer. Let me know if we need to make any further modifications!

hlin117 · 2016-09-01T15:52:55Z

@kchen17 I like the direction that this is going!

One last change, and you'll have my LGTM. Can you modify the docstring for OneHotEncoder here? It would be nice to mention that if a user is interested in a OneHotEncoding of y labels to use a MultiLabelBinarizer instead.

kathyxchen · 2016-09-01T16:11:18Z

@hlin117 done!

hlin117 · 2016-09-01T16:20:11Z

LGTM! @kchen17 , please prefix this pull title with "[MRG]" so package owners (@amueller , @jnothman ) can review it.

amueller · 2016-09-06T19:50:00Z

fine with me

jnothman · 2016-09-07T12:15:56Z

sklearn/preprocessing/label.py

@@ -661,6 +668,20 @@ class MultiLabelBinarizer(BaseEstimator, TransformerMixin):
    >>> list(mlb.classes_)
    ['comedy', 'sci-fi', 'thriller']

+    Perform a one-hot encoding for y labels


Isn't this something LabelBinarizer specialises in?

Relinking discussion here. The LabelBinarizer doesn't necessarily do what we expect for 2-label problems.

That being said, for 3-label problems, you'll get output like this:

>>> from sklearn.preprocessing import LabelBinarizer >>> LabelBinarizer().fit_transform([1, 2, 3, 1]) array([[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0]])

I wasn't aware that the LabelBinarizer behaved differently for 2-label problems versus 3-label problems...

Oh right. But internal to scikit-learn, the flat vector version is a more efficient representation of binary labels...

I still find this a bit of an obscure example to be presenting for MultiLabelBinarizer. Its job is to deal with multilabel data, and to present a hacky use of it here seems inappropriate.

hlin117 · 2016-09-08T03:59:25Z

@jnothman do you have any other suggestions for a LGTM?

jnothman · 2016-09-08T13:06:10Z

My concern with this PR is that if its purpose is to classify the subtly distinct purposes of each processing tool, I'm not altogether certain it gets there. Overall, it is an improvement, but I'm certainly not persuaded by the MultiLabelBinarizer example as advising best practice.

hlin117 · 2016-09-11T08:05:01Z

@jnothman I do like seeing the added links in the documentation, but the trick with the MultiLabelBinarizer does seem a bit hacky. At the same time though, it is unfortunate that there is no way to do a one-hot-encoding with a 2-class problem in scikit-learn. (But to be fair, I don't know how often people will encounter this use case.)

hlin117 · 2016-09-11T08:07:29Z

How about remove the example with the MultiLabelBinarizer, and merge this PR with the added links? (Comment updated)

jnothman · 2016-09-11T13:19:19Z

I think you mean MultiLabelBinarizer. yes, that would be my preference.

kathyxchen · 2016-09-17T15:49:10Z

Hi, sorry about the delay in getting this commit updated! (Have been without a laptop this past week while it was getting replaced.) I removed the example from MultiLabelBinarizer. Let me know if any other issues need to be addressed! @jnothman

hlin117

LGTM. If you want to go the full mile, you can compile the documentation and show us how it looks. (Ask us if you don't know how to do this.)

jnothman

Otherwise LGTM

jnothman · 2016-09-20T23:57:10Z

sklearn/preprocessing/data.py

@@ -1737,6 +1737,9 @@ class OneHotEncoder(BaseEstimator, TransformerMixin):
    This encoding is needed for feeding categorical data to many scikit-learn
    estimators, notably linear models and SVMs with the standard kernels.

+    Note: a one-hot encoding of y labels should use a MultiLabelBinarizer


a LabelBinarizer, really, for the usual single-valued meaning of OHE.

Latest commit makes this correction.

…labelencoder' as well as an example to multilabel binarizer

…in OHE

kathyxchen · 2016-09-21T16:22:21Z

Here are the changes to the documentation @hlin117:

See Also (for LabelEncoder, LabelBinarizer, and MultiLabelBinarizer--where MultiLabelBinarizer has an added example too.)
LabelEncoder

LabelBinarizer

MultiLabelBinarizer
1 line added to the docstring for the OneHotEncoder class, as well as an updated See Also section.
Docstring note

See Also

hlin117 · 2016-09-21T16:32:54Z

Hmm, is it just the screenshot, or is the "See also" section for LabelEncoder indented a level too much? (nit pick)

Pro tip: you can save an entire html page by saving it as a pdf, and then uploading the pdf to github.

kathyxchen · 2016-09-21T16:43:23Z

Ah, sorry about that. I've realized the way I organized the screenshots made the indentations look inconsistent. All of the page formats match what is currently on the live site.
(Thanks for the tip, I will do that next time, haha. :))

…g for labels. (scikit-learn#7315) * refer users to the other encoders to do one hot encoding for labels. * added to the 'see more' for labelbinarizer, multilabelbinarizer, and labelencoder' as well as an example to multilabel binarizer * added note about y labels to the OneHotEncoder docstring * removed example from MultiLabelBinarizer * documentation should specify LabelBinarizer, not MultiLabelBinarizer in OHE

kathyxchen changed the title ~~refer users to the other encoders to do one hot encoding for labels.~~ [MRG] Docs: refer users to the other encoders to do one hot encoding for labels. Sep 1, 2016

jnothman reviewed Sep 7, 2016
View reviewed changes

kathyxchen force-pushed the easy-issue branch from e2cea14 to 53622a5 Compare September 17, 2016 15:47

hlin117 approved these changes Sep 17, 2016

View reviewed changes

jnothman approved these changes Sep 20, 2016

View reviewed changes

kathyxchen force-pushed the easy-issue branch from 53622a5 to 52bc72e Compare September 21, 2016 16:12

kathyxchen and others added 5 commits September 21, 2016 12:13

refer users to the other encoders to do one hot encoding for labels.

f76f689

added to the 'see more' for labelbinarizer, multilabelbinarizer, and …

c7b0ea9

…labelencoder' as well as an example to multilabel binarizer

added note about y labels to the OneHotEncoder docstring

e657c79

removed example from MultiLabelBinarizer

b27cf06

documentation should specify LabelBinarizer, not MultiLabelBinarizer …

52bc72e

…in OHE

jnothman changed the title ~~[MRG] Docs: refer users to the other encoders to do one hot encoding for labels.~~ [MRG+1] Docs: refer users to the other encoders to do one hot encoding for labels. Sep 21, 2016

amueller merged commit d5b66f9 into scikit-learn:master Oct 5, 2016

amueller mentioned this pull request Oct 27, 2016

Letting OneHotEncoder encode y? #5930

Closed

Uh oh!

[MRG+1] Docs: refer users to the other encoders to do one hot encoding for labels. #7315

[MRG+1] Docs: refer users to the other encoders to do one hot encoding for labels. #7315

Uh oh!

Conversation

kathyxchen commented Aug 31, 2016

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

amueller commented Aug 31, 2016

Uh oh!

hlin117 commented Aug 31, 2016

Uh oh!

hlin117 commented Aug 31, 2016

Uh oh!

jnothman commented Sep 1, 2016

Uh oh!

amueller commented Sep 1, 2016

Uh oh!

hlin117 commented Sep 1, 2016

Uh oh!

hlin117 commented Sep 1, 2016

Uh oh!

kathyxchen commented Sep 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hlin117 commented Sep 1, 2016

Uh oh!

kathyxchen commented Sep 1, 2016

Uh oh!

hlin117 commented Sep 1, 2016

Uh oh!

amueller commented Sep 6, 2016

Uh oh!

jnothman Sep 7, 2016

Choose a reason for hiding this comment

Uh oh!

hlin117 Sep 7, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman Sep 7, 2016

Choose a reason for hiding this comment

Uh oh!

jnothman Sep 8, 2016

Choose a reason for hiding this comment

Uh oh!

hlin117 commented Sep 8, 2016

Uh oh!

jnothman commented Sep 8, 2016

Uh oh!

hlin117 commented Sep 11, 2016

Uh oh!

hlin117 commented Sep 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Sep 11, 2016

Uh oh!

kathyxchen commented Sep 17, 2016

Uh oh!

hlin117 left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Sep 20, 2016

Choose a reason for hiding this comment

Uh oh!

kathyxchen Sep 21, 2016

Choose a reason for hiding this comment

Uh oh!

kathyxchen commented Sep 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hlin117 commented Sep 21, 2016

Uh oh!

kathyxchen commented Sep 21, 2016

Uh oh!

Uh oh!

kathyxchen commented Sep 1, 2016 •

edited

Loading

hlin117 commented Sep 11, 2016 •

edited

Loading

kathyxchen commented Sep 21, 2016 •

edited

Loading