-
-
Notifications
You must be signed in to change notification settings - Fork 26.4k
[MRG] ENH Add 'minimum' and 'maximum' strategies to SimpleImputer #27986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It looks to me like all branches of the new code get exercised, I can’t tell why the coverage tool seems to think some elif statements don’t get hit but the blocks inside are fully covered. |
48c0277 to
f0755ee
Compare
|
I wonder if we should instead accept a callable which takes a column and returns a constant to fill the values. |
c381b5a to
b06eb7c
Compare
That's probably a bit bigger of a change and adapts the interface to SimpleImputer quite a bit more -- notably it might be a bit awkward becuase:
Besides the point, I'm happy to field questions about the min/max implementation and usage from a support perspective, but I'd definitely hesitate to support pairs of arbitrary lambdas. |
jnothman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the use cases for this? Seems reasonable to me as an implementation.
Extending to custom callables would especially enable other quantiles. I think in the custom callable case, we would just pass the callable an array of non-missing values.
For our models, the imputed value for some missing features performs best when set to minimum or maximum of the feature. Consider a feature which typically has a value of 0, provides highly negative signal when it's any positive integer, and for some of our inference-time executions is missing: in this case, we want to drive this feature with a value of 0 (so that our model doesn't weigh this feature heavily). We have a number of features where the right missingness value isn't something as obvious as 0, but we know it's "the typical minimum" or "the typical maximum", and rather than pre-compute this value by feature and store the constant, we'd rather 'train' the imputer to impute these features along with all the others. |
|
What steps can I take to get this merged? |
|
Thanks for the info, I'm happy for it to be included. But don't have bandwidth to review, no blocker from my side though. So next would be for two reviewers to have a look. Unfortunately that's a bottleneck and we don't necessarily have enough reviewers for a timely review of all PRs. So we shall wait and see. |
|
How hard do you think it would be to mock up a PR that takes a callable run over each array of non-missing values? Given how rarely the set of options here has had to change, I suspect that a generic solution will be more helpful for more users than this specific one...? |
Reference Issues/PRs
none
What does this implement/fix? Explain your changes.
Adds 'minimum' and 'maximum' strategies to
SimpleImputerto impute values based on minimum or maximum values, respectively.Any other comments?