Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FIX: deprecate integer valued numerical features for PDP #30409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Dec 4, 2024

This is a tentative fix for #30315 and #30378.

The problem is that new pandas versions refuse to assign floating-point values into integer dtyped columns. In PDP, such fractional floating-point values can be generated when creating the grid, even if the original numerical feature is integer valued.

In a first (and later abandoned) version of this PR, I tried to transparently change the dtype of X_eval (or at least the columns where we want to insert those values when X is a dataframe). However, this leads to very pandas and numpy specific code paths and this would need to be expanded for polars later. I have the feeling that this will become unmanageable quickly.

Furthermore, changing the dtype of (some columns of) X_eval means that we are calling the response method of the estimator with different dtypes than the X_train used to fit the estimator. I have the feeling that this can cause weird bugs.

So this PR tries to explicitly deprecate the support of integer values (used as numerical features) in PDPs and later raise a ValueError instead: the error message should be explicit enough to let the user know what to do to update their code. I thought this would be simple, but as you can see in this PR, this is a bit more complex than anticipated.

After creating this PR I am thinking of a final alternative: instead of changing the dtype in X_eval, we could round the fractional

This last alternative will be less intrusive (and in particular will match the implicit behavior of numpy and old pandas versions). But it can be surprising: we generate a fine grid but then it is implicitly coarsified when computing the PDP value. This should be not to complex to implement in a somewhat container agnostic way (using _safe_indexing / _safe_assign and .astype). However, this might be a bit magic. Not sure what is best.

Would love to get feedback from @glemaitre or @lesteve for instance.

Copy link

github-actions bot commented Dec 4, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: ca7fb2e. Link to the linter CI: here

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with warning and erroring in the future here.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ogrisel, I should provide my finding, this could have maybe helped.

I agree with the general idea of the PR. I was leaning towards the same solution and either consider integer dtype as a categorical feature or otherwise, use explicit floating point.

@glemaitre glemaitre merged commit 4a7f96e into scikit-learn:main Dec 8, 2024
30 checks passed
@glemaitre
Copy link
Member

I just see that I merged this PR but it targets 1.6 indeed. We could always add it in 1.6.1 because this is kind of a fix + deprecation. @jeremiedbb do you think that we still have time to backport the commit?

@jeremiedbb
Copy link
Member

I merged the release PR but I haven't pushed the tag yet so I guess it's doable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants