-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] "other"/min_freq in OneHot and OrdinalEncoder #12264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
thanks. Looks like you've got merge conflicts, though :-/ |
provide tests: - tests for different frequency values - otherwise tests similar to that of _encode
- adds min_freq keyword to ordinal and onehot encoder and adds the necessary calls to BaseEncoder - improves tests on _group_values - adds tests that ensure that fit does not alter the inputarray.
16113a7
to
512a567
Compare
So I was able to rebase, but encountered another error. Will push an update later |
General design question: the docstring you added says "group low frequent categories together", but above you say "All categories below this threshold are ... mapped to the first category.". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why you want to do this before encoding. Can it be done in the _encode
functions in any case?
Although this does get trickier when you try to do it in an OrdinalEncoder context.
@@ -186,6 +190,9 @@ class OneHotEncoder(_BaseEncoder): | |||
0.20 and will be removed in 0.22. | |||
You can use the ``ColumnTransformer`` instead. | |||
|
|||
min_freq: float, default=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space before colon, please
@@ -186,6 +190,9 @@ class OneHotEncoder(_BaseEncoder): | |||
0.20 and will be removed in 0.22. | |||
You can use the ``ColumnTransformer`` instead. | |||
|
|||
min_freq: float, default=0 | |||
group low frequent categories together |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be more specific, please. This should describe what the parameter is.
@jorisvandenbossche If one wants to add a new label, one has to check if the label is there already of find a label automatically. This can become complex. I'm open to suggestion on what to do here. @jnothman I'll update the documentation asap |
I don't have the answer, but I only think that it should not be based on "easier" to implement. I think both ways are possible (although the one more complex than the other), and we should choose what behaviour we want based on what makes most sense from a machine learning point of view. |
@datajanko are you planning to work on this again soonish? If not I'll give it a try ;) |
Currently, my schedule is quite rough, so please go for it. However, if I recall correctly, there was some helpful work on adding nan support in onehotencoder or ordinal encoder. I think using that would be the easiest way to implement the feature without changing too much. I don't know the status on the issue though and can't find it. |
Could you please expand a bit @datajanko please? I don't understand why we need to wait for nan support here. |
So the simplest idea we had was: map all the low-frequency groups to NaN and then use the implementation with nan. This would mean a low implementation effort Besides, I don't recall the details precisely, but I think I had some issues in my implementation related to nan values (could be related to non existing values in the test set). I just recall: wait for the nan implementation. However, you are of course free to choose any approach you like and maybe I just oversaw an obvious direct solution here. |
Hi @datajanko - are you planning to continue to work on this? I was trying to solve the exact same problem. |
I'm on it @FedericoV |
Cool, let me know if you need me to test out a new branch @NicolasHug |
Hi @NicolasHug - did you go ahead and make any headway on this or did you decide to abandon it for now? |
I implemented #13833. It's waiting for feedback. These things are much more complicated than they look |
Oh awesome, thank you so much! I didn't mean to be impatient, I just
hadn't seen your PR. Do you mind if I play with it a bit? I might be able
to catch some bugs.
…On Mon, Jun 3, 2019 at 2:24 PM Nicolas Hug ***@***.***> wrote:
I implemented #13833
<#13833>. It's waiting
for feedback. These things are much more complicated than they look
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12264?email_source=notifications&email_token=AAEZ25VXIXUCP4TJSION7Z3PYWDXDA5CNFSM4FY2ZX42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW2XLTY#issuecomment-498431439>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEZ25XUGSSGNMNF6IC44M3PYWDXDANCNFSM4FY2ZX4Q>
.
|
Of course! that'd be very helpful
And don't worry that wasn't directed to you, more like a rant to myself ;) |
Should we say closed via #16018? That one doesn't have OrdinalEncoder, though. |
I am okay with closing. Only Thank you @datajanko for looking into the issue! |
Reference Issues/PRs
Fixes: #12153
What does this implement/fix? Explain your changes.
Currently, adds the option to add a frequency threshold to OneHot- and OrdinalEncoder.
All categories below this threshold are determined, sorted and mapped to the first category.
What needs to be done?
min_df
with implementation to Ordinal- and OneHotEncoderexamples/
folderother
group -> What to do if not object/str? What happens ifother
already there?Any other comments?
Further points of extension:
min_freq
addtop_n
categories. Moreover, one could use integers instead of floats inmin_freq
.top_n
andmin_freq
could interact