Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[WIP] "other"/min_freq in OneHot and OrdinalEncoder #12264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

datajanko
Copy link
Contributor

@datajanko datajanko commented Oct 3, 2018

Reference Issues/PRs

Fixes: #12153

What does this implement/fix? Explain your changes.

Currently, adds the option to add a frequency threshold to OneHot- and OrdinalEncoder.
All categories below this threshold are determined, sorted and mapped to the first category.

What needs to be done?

  • Adds min_dfwith implementation to Ordinal- and OneHotEncoder
  • Example in examples/ folder
  • Documentation
  • Probably add more tests and remove some tests
  • add option to add a name of the other group -> What to do if not object/str? What happens if otheralready there?
  • With a threshold, encoders are not "really" invertible anymore -> add at least documentation?
  • Align if function names are appropriate

Any other comments?

Further points of extension:

  • Instead of min_freq add top_n categories. Moreover, one could use integers instead of floats in min_freq. top_nand min_freqcould interact
  • Allow an array of frequencies for each feature
  • One could provide a mapping, to group certain values in a category together. It might be though, that a different Encoder would be more suitable

@amueller
Copy link
Member

amueller commented Oct 3, 2018

thanks. Looks like you've got merge conflicts, though :-/

J42994 added 2 commits October 4, 2018 07:39
provide tests:
- tests for different frequency values
- otherwise tests similar to that of _encode
- adds min_freq keyword to ordinal and onehot encoder and adds the necessary calls to BaseEncoder
- improves tests on _group_values
- adds tests that ensure that fit does not alter the inputarray.
@datajanko
Copy link
Contributor Author

So I was able to rebase, but encountered another error. Will push an update later

@jorisvandenbossche
Copy link
Member

General design question: the docstring you added says "group low frequent categories together", but above you say "All categories below this threshold are ... mapped to the first category.".
I would understand the first as to combine all low frequent categories together in a separate category, which is different as the first category (which one is the first also depends on the sort order of the categories)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why you want to do this before encoding. Can it be done in the _encode functions in any case?

Although this does get trickier when you try to do it in an OrdinalEncoder context.

@@ -186,6 +190,9 @@ class OneHotEncoder(_BaseEncoder):
0.20 and will be removed in 0.22.
You can use the ``ColumnTransformer`` instead.

min_freq: float, default=0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space before colon, please

@@ -186,6 +190,9 @@ class OneHotEncoder(_BaseEncoder):
0.20 and will be removed in 0.22.
You can use the ``ColumnTransformer`` instead.

min_freq: float, default=0
group low frequent categories together
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be more specific, please. This should describe what the parameter is.

@datajanko
Copy link
Contributor Author

@jorisvandenbossche
Currently, because it's easier, I'm just selecting the first element as the key of the new groups (groups can be ints, so a string "other" will not always be feasible, the groups will be always sorted). That's why I chose this easy solution.

If one wants to add a new label, one has to check if the label is there already of find a label automatically. This can become complex. I'm open to suggestion on what to do here.

@jnothman
I think it can be done inside the _encode-function as well. However, I thought it is a good practice so separate the concerns, that's why I wanted to separate things here. Moreover, the logic in _encode will be more complex, specifically if one adds a top_n keyword. But it in that case it might even make sense encode the keywords according to rank. So there are arguments to add this into _encode.
If you prefer that, I'll just move everything there.

I'll update the documentation asap

@jorisvandenbossche
Copy link
Member

I don't have the answer, but I only think that it should not be based on "easier" to implement. I think both ways are possible (although the one more complex than the other), and we should choose what behaviour we want based on what makes most sense from a machine learning point of view.

@NicolasHug
Copy link
Member

@datajanko are you planning to work on this again soonish?

If not I'll give it a try ;)

@datajanko
Copy link
Contributor Author

Currently, my schedule is quite rough, so please go for it. However, if I recall correctly, there was some helpful work on adding nan support in onehotencoder or ordinal encoder. I think using that would be the easiest way to implement the feature without changing too much. I don't know the status on the issue though and can't find it.

@datajanko
Copy link
Contributor Author

You should not proceed here until missing values are added to the onehot encoder, see #13028 #11996
The idea was to treat the groups below the minimum frequencies as missing, thus reusing the code from there.

@NicolasHug
Copy link
Member

Could you please expand a bit @datajanko please?

I don't understand why we need to wait for nan support here.

@datajanko
Copy link
Contributor Author

So the simplest idea we had was: map all the low-frequency groups to NaN and then use the implementation with nan. This would mean a low implementation effort

Besides, I don't recall the details precisely, but I think I had some issues in my implementation related to nan values (could be related to non existing values in the test set). I just recall: wait for the nan implementation.

However, you are of course free to choose any approach you like and maybe I just oversaw an obvious direct solution here.

@FedericoV
Copy link
Contributor

Hi @datajanko - are you planning to continue to work on this? I was trying to solve the exact same problem.

@NicolasHug
Copy link
Member

I'm on it @FedericoV

@FedericoV
Copy link
Contributor

Cool, let me know if you need me to test out a new branch @NicolasHug

@FedericoV
Copy link
Contributor

Hi @NicolasHug - did you go ahead and make any headway on this or did you decide to abandon it for now?

@NicolasHug
Copy link
Member

I implemented #13833. It's waiting for feedback. These things are much more complicated than they look

@FedericoV
Copy link
Contributor

FedericoV commented Jun 3, 2019 via email

@NicolasHug
Copy link
Member

NicolasHug commented Jun 3, 2019

Of course! that'd be very helpful

These things are much more complicated than they look

And don't worry that wasn't directed to you, more like a rant to myself ;)

Base automatically changed from master to main January 22, 2021 10:50
@cmarmo cmarmo added Superseded PR has been replace by a newer PR and removed Waiting for Reviewer labels Feb 5, 2022
@amueller
Copy link
Member

Should we say closed via #16018? That one doesn't have OrdinalEncoder, though.

@thomasjpfan
Copy link
Member

I am okay with closing. Only OneHotEncoder was requested in the original issue. If OrdinalEncoder is requested, it should be able to use the same code in #16018. (By moving the new methods for infrequent categories up to the parent _BaseEncoder class.)

Thank you @datajanko for looking into the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:preprocessing Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add "other" / min_frequency option to OneHotEncoder
8 participants