Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG+1] Docs: refer users to the other encoders to do one hot encoding for labels. #7315

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 5, 2016

Conversation

kathyxchen
Copy link
Contributor

Reference Issue

This PR is a short fix in response to #5930 which is related to one hot encoding for features (x) versus labels (y).

What does this implement/fix? Explain your changes.

It adds two lines to the See Also section of OneHotEncoder to link to LabelBinarizer and LabelEncoder.

Any other comments?

In turn, would it make sense to also add a link to OneHotEncoder on the corresponding "See Also" sections for LabelBinarizer and LabelEncoder?

@amueller
Copy link
Member

LGTM. We could also mention it in the user guide or in the docstring but this might be enough?
@hlin117 do you have opinions on this?

@hlin117
Copy link
Contributor

hlin117 commented Aug 31, 2016

Hey @kchen17 , thanks for the quick change.

In turn, would it make sense to also add a link to OneHotEncoder on the corresponding "See Also" sections for LabelBinarizer and LabelEncoder?

That would be awesome if you did that.

@hlin117
Copy link
Contributor

hlin117 commented Aug 31, 2016

I would love to put the nail in the coffin for issue #5930. I'm willing to accept the solution by @olologin as a way to perform one hot encodings for y labels. More specifically, it would be nice to

  1. Link the OneHotEncoder docs to the MultiLabelEncoder
  2. Add this example to the docstring of the MultiLabelEncoder:
>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> import numpy as np
>>>
>>> y = np.array([0, 1, 1, 0])
>>> MultiLabelBinarizer().fit_transform(y.reshape(-1, 1))
array([[1, 0],
       [0, 1],
       [0, 1],
       [1, 0]])

@amueller do you think this would be a good idea?

@jnothman
Copy link
Member

jnothman commented Sep 1, 2016

I think referencing LabelBinarizer and perhaps MultiLabelEncoder is fair enough. I don't see the relevance of LabelEncoder to OneHotEncoder.

@amueller
Copy link
Member

amueller commented Sep 1, 2016

@jnothman actually many people currently use LabelEncoder on columns so they can feed them through OneHotEncoder.... It's sort of tangential but wouldn't mind leaving it in.

not entirely sure what the example adds. That the output shape is always (n_samples, n_classes) ?

@hlin117
Copy link
Contributor

hlin117 commented Sep 1, 2016

@amueller I was hoping it would show how one would perform a one hot encoding for y labels. (It addresses issue #5930)

@hlin117
Copy link
Contributor

hlin117 commented Sep 1, 2016

@jnothman actually many people currently use LabelEncoder on columns so they can feed them through OneHotEncoder.... It's sort of tangential but wouldn't mind leaving it in.

Yeah, this is actually quite unfortunate. The LabelEncoder wasn't meant to be used on columns / features. (And the docs for LabelEncoder explicitly say that it's for labels.)

I remember when I was getting acquainted to scikit-learn, I also used the LabelEncoder on features.

@kathyxchen
Copy link
Contributor Author

kathyxchen commented Sep 1, 2016

@amueller @hlin117: I added OneHotEncoder to the corresponding See Also sections. I also incorporated the example posted in the comment for MultiLabelBinarizer. Let me know if we need to make any further modifications!

@hlin117
Copy link
Contributor

hlin117 commented Sep 1, 2016

@kchen17 I like the direction that this is going!

One last change, and you'll have my LGTM. Can you modify the docstring for OneHotEncoder here? It would be nice to mention that if a user is interested in a OneHotEncoding of y labels to use a MultiLabelBinarizer instead.

@kathyxchen
Copy link
Contributor Author

@hlin117 done!

@hlin117
Copy link
Contributor

hlin117 commented Sep 1, 2016

LGTM! @kchen17 , please prefix this pull title with "[MRG]" so package owners (@amueller , @jnothman ) can review it.

@kathyxchen kathyxchen changed the title refer users to the other encoders to do one hot encoding for labels. [MRG] Docs: refer users to the other encoders to do one hot encoding for labels. Sep 1, 2016
@amueller
Copy link
Member

amueller commented Sep 6, 2016

fine with me

@@ -661,6 +668,20 @@ class MultiLabelBinarizer(BaseEstimator, TransformerMixin):
>>> list(mlb.classes_)
['comedy', 'sci-fi', 'thriller']

Perform a one-hot encoding for y labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this something LabelBinarizer specialises in?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relinking discussion here. The LabelBinarizer doesn't necessarily do what we expect for 2-label problems.

That being said, for 3-label problems, you'll get output like this:

>>> from sklearn.preprocessing import LabelBinarizer
>>> LabelBinarizer().fit_transform([1, 2, 3, 1])
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1],
       [1, 0, 0]])

I wasn't aware that the LabelBinarizer behaved differently for 2-label problems versus 3-label problems...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right. But internal to scikit-learn, the flat vector version is a more efficient representation of binary labels...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still find this a bit of an obscure example to be presenting for MultiLabelBinarizer. Its job is to deal with multilabel data, and to present a hacky use of it here seems inappropriate.

@hlin117
Copy link
Contributor

hlin117 commented Sep 8, 2016

@jnothman do you have any other suggestions for a LGTM?

@jnothman
Copy link
Member

jnothman commented Sep 8, 2016

My concern with this PR is that if its purpose is to classify the subtly distinct purposes of each processing tool, I'm not altogether certain it gets there. Overall, it is an improvement, but I'm certainly not persuaded by the MultiLabelBinarizer example as advising best practice.

@hlin117
Copy link
Contributor

hlin117 commented Sep 11, 2016

@jnothman I do like seeing the added links in the documentation, but the trick with the MultiLabelBinarizer does seem a bit hacky. At the same time though, it is unfortunate that there is no way to do a one-hot-encoding with a 2-class problem in scikit-learn. (But to be fair, I don't know how often people will encounter this use case.)

@hlin117
Copy link
Contributor

hlin117 commented Sep 11, 2016

How about remove the example with the MultiLabelBinarizer, and merge this PR with the added links? (Comment updated)

@jnothman
Copy link
Member

I think you mean MultiLabelBinarizer. yes, that would be my preference.

@kathyxchen
Copy link
Contributor Author

Hi, sorry about the delay in getting this commit updated! (Have been without a laptop this past week while it was getting replaced.) I removed the example from MultiLabelBinarizer. Let me know if any other issues need to be addressed! @jnothman

Copy link
Contributor

@hlin117 hlin117 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. If you want to go the full mile, you can compile the documentation and show us how it looks. (Ask us if you don't know how to do this.)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@@ -1737,6 +1737,9 @@ class OneHotEncoder(BaseEstimator, TransformerMixin):
This encoding is needed for feeding categorical data to many scikit-learn
estimators, notably linear models and SVMs with the standard kernels.

Note: a one-hot encoding of y labels should use a MultiLabelBinarizer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a LabelBinarizer, really, for the usual single-valued meaning of OHE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest commit makes this correction.

@kathyxchen
Copy link
Contributor Author

kathyxchen commented Sep 21, 2016

Here are the changes to the documentation @hlin117:

  • See Also (for LabelEncoder, LabelBinarizer, and MultiLabelBinarizer--where MultiLabelBinarizer has an added example too.)
    LabelEncoder
    labelencoder
    LabelBinarizer
    labelbinarizer
    MultiLabelBinarizer
    multilabelbinarizer
  • 1 line added to the docstring for the OneHotEncoder class, as well as an updated See Also section.
    Docstring note
    onehotencoder_doc2
    See Also
    onehotencoder_doc1

@hlin117
Copy link
Contributor

hlin117 commented Sep 21, 2016

Hmm, is it just the screenshot, or is the "See also" section for LabelEncoder indented a level too much? (nit pick)

Pro tip: you can save an entire html page by saving it as a pdf, and then uploading the pdf to github.

@kathyxchen
Copy link
Contributor Author

Ah, sorry about that. I've realized the way I organized the screenshots made the indentations look inconsistent. All of the page formats match what is currently on the live site.
(Thanks for the tip, I will do that next time, haha. :))

@jnothman jnothman changed the title [MRG] Docs: refer users to the other encoders to do one hot encoding for labels. [MRG+1] Docs: refer users to the other encoders to do one hot encoding for labels. Sep 21, 2016
@amueller amueller merged commit d5b66f9 into scikit-learn:master Oct 5, 2016
amueller pushed a commit to amueller/scikit-learn that referenced this pull request Oct 14, 2016
…g for labels. (scikit-learn#7315)

* refer users to the other encoders to do one hot encoding for labels.

* added to the 'see more' for labelbinarizer, multilabelbinarizer, and labelencoder' as well as an example to multilabel binarizer

* added note about y labels to the OneHotEncoder docstring

* removed example from MultiLabelBinarizer

* documentation should specify LabelBinarizer, not MultiLabelBinarizer in OHE
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…g for labels. (scikit-learn#7315)

* refer users to the other encoders to do one hot encoding for labels.

* added to the 'see more' for labelbinarizer, multilabelbinarizer, and labelencoder' as well as an example to multilabel binarizer

* added note about y labels to the OneHotEncoder docstring

* removed example from MultiLabelBinarizer

* documentation should specify LabelBinarizer, not MultiLabelBinarizer in OHE
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…g for labels. (scikit-learn#7315)

* refer users to the other encoders to do one hot encoding for labels.

* added to the 'see more' for labelbinarizer, multilabelbinarizer, and labelencoder' as well as an example to multilabel binarizer

* added note about y labels to the OneHotEncoder docstring

* removed example from MultiLabelBinarizer

* documentation should specify LabelBinarizer, not MultiLabelBinarizer in OHE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants