-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Add strategy="random" to SimpleImputer #11209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Maybe this should be named |
This also requires not using statistics_ at transform time, so I suspect it should be a different class |
Since I'm currently working on the SimpleImputer, I'd like to help on that. |
I'd prefer |
@RianneSchouten also volunteered to contribute this: #8478 (comment) |
Ideas for tests. For any single-column training data
Behaviour at test time (i.e. |
Is someone working actively on this? Otherwise can I take a look? |
I thought @RianneSchouten was working on it, so I decided to step back, but I already have some code for that. Now that it seems that she's not actively working on that, I'll make a PR soon. |
I don't mind whoever does it. I am quite busy so if you can are faster with making a PR, don't hesitate. |
I made a PR to add this feature. Since I was not sure about new strategy vs new class, I made both. Tell me which one you think is best. |
I agree on adding a different class and by default use binning. |
Hi guys, is any of you still working on this issue or not, i want to help |
There is discussing surrounding this issue in #11368 At a high level, it looks like the solution needs benchmarking to see how it performs when compared to other imputing strategies. |
@thomasjpfan Thanks for letting me know, i'll try to catch up |
Any updates on this? I would love to use random imputation as a baseline for comparing the performance of other imputation methods. This would be equivalent to DummyClassifier with strategy="uniform". I don't think this is trivial, as the imputer should ideally:
Probably the easier and faster way to implement this and provide minimum functionality would be sampling from a uniform distribution where the min/max values are the min/max non-missing values, column-wise. The end-user can then round the values to discretize them, if needed |
The purpose would be to provide an unbiased, stochastic, univariate imputer. The imputer would replace missing values by values sampled uniformly at random from the non-missing values of the sample column.
This would require to add additional
random_state
constructor parameter to the class.As for #11208 this strategy should work both on numerical and non-numerical dtypes.
The text was updated successfully, but these errors were encountered: