-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FIX: deprecate integer valued numerical features for PDP #30409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: deprecate integer valued numerical features for PDP #30409
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with warning and erroring in the future here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ogrisel, I should provide my finding, this could have maybe helped.
I agree with the general idea of the PR. I was leaning towards the same solution and either consider integer dtype as a categorical feature or otherwise, use explicit floating point.
I just see that I merged this PR but it targets 1.6 indeed. We could always add it in 1.6.1 because this is kind of a fix + deprecation. @jeremiedbb do you think that we still have time to backport the commit? |
I merged the release PR but I haven't pushed the tag yet so I guess it's doable. |
This is a tentative fix for #30315 and #30378.
The problem is that new pandas versions refuse to assign floating-point values into integer dtyped columns. In PDP, such fractional floating-point values can be generated when creating the grid, even if the original numerical feature is integer valued.
In a first (and later abandoned) version of this PR, I tried to transparently change the dtype of
X_eval
(or at least the columns where we want to insert those values whenX
is a dataframe). However, this leads to very pandas and numpy specific code paths and this would need to be expanded for polars later. I have the feeling that this will become unmanageable quickly.Furthermore, changing the dtype of (some columns of)
X_eval
means that we are calling the response method of the estimator with different dtypes than theX_train
used to fit the estimator. I have the feeling that this can cause weird bugs.So this PR tries to explicitly deprecate the support of integer values (used as numerical features) in PDPs and later raise a
ValueError
instead: the error message should be explicit enough to let the user know what to do to update their code. I thought this would be simple, but as you can see in this PR, this is a bit more complex than anticipated.After creating this PR I am thinking of a final alternative: instead of changing the dtype in
X_eval
, we could round the fractionalThis last alternative will be less intrusive (and in particular will match the implicit behavior of numpy and old pandas versions). But it can be surprising: we generate a fine grid but then it is implicitly coarsified when computing the PDP value. This should be not to complex to implement in a somewhat container agnostic way (using
_safe_indexing
/_safe_assign
and.astype
). However, this might be a bit magic. Not sure what is best.Would love to get feedback from @glemaitre or @lesteve for instance.