Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@mark-thm
Copy link
Contributor

Reference Issues/PRs

none

What does this implement/fix? Explain your changes.

Adds 'minimum' and 'maximum' strategies to SimpleImputer to impute values based on minimum or maximum values, respectively.

Any other comments?

@github-actions
Copy link

github-actions bot commented Dec 19, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: b06eb7c. Link to the linter CI: here

@mark-thm mark-thm changed the title Add 'minimum' and 'maximum' strategies to SimpleImputer ENH Add 'minimum' and 'maximum' strategies to SimpleImputer Dec 19, 2023
@mark-thm
Copy link
Contributor Author

It looks to me like all branches of the new code get exercised, I can’t tell why the coverage tool seems to think some elif statements don’t get hit but the blocks inside are fully covered.

@mark-thm mark-thm force-pushed the me/min-max-imputation branch 3 times, most recently from 48c0277 to f0755ee Compare December 21, 2023 15:25
@mark-thm mark-thm changed the title ENH Add 'minimum' and 'maximum' strategies to SimpleImputer [MRG] ENH Add 'minimum' and 'maximum' strategies to SimpleImputer Dec 21, 2023
@adrinjalali
Copy link
Member

I wonder if we should instead accept a callable which takes a column and returns a constant to fill the values.

@mark-thm mark-thm force-pushed the me/min-max-imputation branch from c381b5a to b06eb7c Compare December 26, 2023 23:43
@mark-thm
Copy link
Contributor Author

I wonder if we should instead accept a callable which takes a column and returns a constant to fill the values.

That's probably a bit bigger of a change and adapts the interface to SimpleImputer quite a bit more -- notably it might be a bit awkward becuase:

  • lambda authors would need to write both sparse and dense implementations
  • the strategies today are all different kinds of stats (mean, median, most_frequent), it seems awkward to suddenly expand this list with 'lambda' or 'custom'

Besides the point, I'm happy to field questions about the min/max implementation and usage from a support perspective, but I'd definitely hesitate to support pairs of arbitrary lambdas.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the use cases for this? Seems reasonable to me as an implementation.

Extending to custom callables would especially enable other quantiles. I think in the custom callable case, we would just pass the callable an array of non-missing values.

@mark-thm
Copy link
Contributor Author

What are the use cases for this?

For our models, the imputed value for some missing features performs best when set to minimum or maximum of the feature. Consider a feature which typically has a value of 0, provides highly negative signal when it's any positive integer, and for some of our inference-time executions is missing: in this case, we want to drive this feature with a value of 0 (so that our model doesn't weigh this feature heavily). We have a number of features where the right missingness value isn't something as obvious as 0, but we know it's "the typical minimum" or "the typical maximum", and rather than pre-compute this value by feature and store the constant, we'd rather 'train' the imputer to impute these features along with all the others.

@mark-thm
Copy link
Contributor Author

What steps can I take to get this merged?

@adrinjalali
Copy link
Member

Thanks for the info, I'm happy for it to be included. But don't have bandwidth to review, no blocker from my side though.

So next would be for two reviewers to have a look. Unfortunately that's a bottleneck and we don't necessarily have enough reviewers for a timely review of all PRs. So we shall wait and see.

@jnothman
Copy link
Member

jnothman commented Jan 2, 2024

How hard do you think it would be to mock up a PR that takes a callable run over each array of non-missing values? Given how rarely the set of options here has had to change, I suspect that a generic solution will be more helpful for more users than this specific one...?

@mark-thm
Copy link
Contributor Author

mark-thm commented Jan 3, 2024

#28053

@mark-thm mark-thm closed this Jan 4, 2024
@mark-thm mark-thm deleted the me/min-max-imputation branch January 4, 2024 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants