-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Handle pd.Categorical in encoders #14953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@jnothman Hi, I am a bit new to open source and would like to work on this issue, if available. Let me know if I got things correct,
if any of the above point hold true we need to raise warning. Few queries,
Few queries in general, if you don't mind answering them -
Thanks. |
I think check_array treats pandas categoricals as strings for now. You can
try!
And the condition for warning is that all of the above hold, not any.
|
@jnothman Thanks.
Do we have lexicographically sorted feature list? And what is feature's Thanks. |
I mean that if we use the current encoding logic, we sort the feature's
values. But if the feature has a categorical dtype, it also specifies an
ordering (x.dtype.categories where x is a feature column).
|
@jnothman Hey, I thought of doing it as - Is it a feasible approach? Thanks. |
I agree with the proposed behavior, but I doubt this issue will be easy ;) [also this issue is tagged easy and moderate?] |
@amueller Thanks for the heads up, I won't be opening any PR until I make sure that the approach is correct and get it reviewed by you guys.
And from your reply I think that this approach won't be any good. Still working on it. Thanks. |
@jnothman I was going through the If, so I think we need to verify the three conditions to raise the warning in the Thanks. |
No, modifying check_array sounds like a bad idea. you might ratherneed to
do some checking (or setting flags) before the call to check_array.
|
It would be nice to have support for this now and not wait till 0.24. Is adding option in |
categories='dtype'?
|
This duplicates #12086 somewhat, there is some discussion there as well (but basically what @jnothman is proposing here to introduce a warning for it). Will close the other issue. I agree with @thomasjpfan that it would be nice to already have the new behaviour now, but I was wondering if we cannot combine that we the changes in OrdinalEncoder for strings with BTW, I don't think that the actual implementation to use the categorical dtype's categories should be very hard (there is actually a PR for this: #13351), as the required preparatory work to handle a DataFrame column by column is already done (#13253). |
The question is also: what default behaviour do we want in the long run? EDIT: hmm, of course what I am forgetting is that users can already explicitly do |
In the long run I would want (This is a fairly long deprecation path) |
Yes, I was mainly trying to think if we can't do it with a shorter path, without having a new option that afterwards becomes obsolete. But yeah, as edited my comment above, it's difficult to do that in a fully backwards compatible way .. |
We only want to warn when the Categorical dtype is ordered right? Categories aren't necessarily ordered and this is actually pandas' default. The current PR #15396 warns when categories aren't ordered which seems wrong to me. (Sorry if this has been previously discussed) |
pandas uses the lexicon order for its encoding when the categorical dtype is unordered, so it so happens that this is okay. Although I agree we should not rely on this, and warn when the category is ordered. |
As I noted in the pr, I would be more comfortable warning in any case
precisely because the categoricals do not have the ordered flag set by
default.
|
Ordinal categories are far less common than pure nominal categories, so IMHO the pandas default makes sense, and we would be warning for no good reason in most cases. Why would you want to warn something about the order when there is no order in the first place? |
This comes back to what someone is using an OrdinalEncoder for... The
ordering obviously has an effect.
|
Sure, but then:
|
Sigh. Yes, people use the OrdinalEncoder just to turn strings into ints so
they can be fed to a forest.
|
... which makes absolutely no sense unless those strings are ordered |
I think that #14984, #15050, and #15396 might not be blockers for 0.22 and I would move them for 0.23. I think that it could be great to have a single issue (superseded #14953, #14954) to discuss the overall behaviour for |
@NicolasHug I would not be so sure about this. Trees can cope with |
True but trees aren't always deep, typically in GB To reproduce a OHE split using OE, you would need in the worst case C - 1 splits. That's not negligible when the OHE splits multiple time on the same feature during fitting. And because of the arbitrary order, you might just not ever consider such a split because the gain is too low. The right way to handle nominal categories in trees is still to use a OHE, or to natively support categories like the nocats PRs. In any case, going back to the original issue, my concern here is that the current proposal is to raise an order-related warning even when there is actually no order, which I think will just confuse / frustrate users. |
I would not raise a warning but maybe I assume that people know what they
are doing...
… |
Also note that you can have a pandas Categorical with a specific order without having it "ordered" (it the sense of the Eg:
So when people are passing this to a OrdinalEncoder, they might actually be doing it "correctly" already, even though it is not an "ordered" categorical, it just happens to have its categories in the sensible (non lexico) order. |
@thomasjpfan two PRs of yours are meant to close this issue: #15050 has two approvals, so I have milestoned it against #15396. I'm unable to understand if they are both necessary. Do you mind clarifying and fixing conflicts (in one or both), if you think the milestone is still relevant? Thanks a lot for your collaboration. |
I think the end goal is to have "auto" == use the encoded provided by pandas, at least for #15050 is a warning to tell the user that the order we use does not match the pandas categorical. I can update this to say that "auto" will use the pandas ordering in 0.26. This warning can be a little annoying because the only way to avoid it is to use a python warnings filter. #15396 would need to be updated to adjust 'auto' and than not merged until 0.26, because it will contain the implementation for using the pandas categorical. |
Is this something where we should just break auto behaviour for pandas
categoricals in version 1.0??
|
Doesn't seem like we'd get this in time. Moving to 2.0 |
is there a current/ recent issue tracking the support for pd.categorical? So it was added to hgb, not not anywhere else, right? |
We have a related PR that wants to leverage this feature: #27911 |
Thanks, that's quite interesting, but also somewhat orthogonal. I was thinking more about the API surface. Currently a user doesn't know whether categorical features are treated correctly or not in a model without reading the documentation of each model and default hyper-parameters. Passing categoricals encoded as integers and with categorical dtype to |
Indeed, we did not got the discussion. For linear model, we got kind of backward by deprecating The pattern of using the But I rather think that we should have the discussion on that topic. |
In
sklearn.preprocessing._encoders._BaseEncoder
, columns with pd.Categorical dtype are converted to arrays.scikit-learn/sklearn/preprocessing/_encoders.py
Line 60 in 03ea20d
If the
categories
ordering is explicitly specified by the user to the constructor ofOneHotEncoder
orOrdinalEncoder
, then this is fine... but if 'auto' is used, lexicographic ordering will be assumed, disregarding the encoding order determined by the Categorical dtype.I propose that we raise a warning if:
dtype.categories
ordering.The warning might be something like
UserWarning("'auto' categories is used, but the Categorical dtype provided is not consistent with the automatic lexicographic ordering")
... or else something more intelligible.We may change this warning to a
FutureWarning
with"From version 0.24 the category ordering specified by a Categorical dtype will be respected in encoders."
The text was updated successfully, but these errors were encountered: