-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG+2] Basic version of MICE Imputation #8478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8478 +/- ##
==========================================
+ Coverage 96.1% 96.13% +0.03%
==========================================
Files 337 337
Lines 63295 63179 -116
==========================================
- Hits 60828 60740 -88
+ Misses 2467 2439 -28
Continue to review full report at Codecov.
|
@raghavrv Do you have some suggestions for a minimal suite of tests that would cover MICE reasonably? |
You could try to replicate all the tests that are currently there for other imputation techniques as a basic starting point... Further, a hand computed data could be tested against to avoid unintended regressions when modifying this code... |
Also you could try and write tests that would improve the coverage... |
Thanks @raghavrv. I'll take a crack at this when my workload simmers down a bit. Feel free to look at the core algorithm in the meanwhile (if you have time) and suggest improvements! |
Yes. Let randomness influence the parameter draws and imputations, but make sure that the imputation models remain constant over the loops. |
ah, keeping the models constant is quite different. hmm.
|
@RianneSchouten @stefvanbuuren just to confirm: can instances of MICE be run totally independently? That is, can the for loop over In both cases, I think we can either make a For example, let's say the pipeline is: A I think this is pretty much what @glemaitre suggested, right? |
I am a bit confused. What do you mean by a constant model. Couple of remarks which I would like to be clarified:
I would think that for the release it could be good to actually tackle (2). We can see afterwords how to better handle the multiple imputation. NB: be aware that I am not familiar with the stats behind MICE, pardon me if my statements are wrong. They are intended to have constructive discussions :) |
I think so. In addition, I would expect the |
Re: As such. we should keep this parameter in my opinion. Do any of the more statistically knowledgeable folks have an opinion about this point? There is always the option to do something like What's the difference between my suggestion of a By the way, I don't think there is anything that can be done about the inefficiency that is pointed out in (3). It's the consequence of having multiple imputations - you have to multiply all your computation starting with the MICE step by |
@sergeyf Yes, for a given imputation model, you may create |
It is the same :)
We might have a work around sharing the imputer with a |
Thanks @stefvanbuuren for the confirmation. @glemaitre I think based on Stef's comments, we shouldn't be sharing any of the computation. It's |
Averaging the data gives inflated correlations, too short confidence intervals and too low P-values. The MSE and multiple imputation is not a happy marriage. If you try to select imputation methods that minimise MSE, then you end up imputing the same "best" value over and over again. But those single imputation methods do not appropriately incorporate the uncertainty of the imputation, and treat any imputed value as an observed value. See https://stefvanbuuren.name/fimd/sec-true.html for a short discussion. |
@stefvanbuuren thanks for the additional warnings and readings. My take is that this "unhappy marriage" must last. No divorce =) To clarify my point, here is how the current (1) Run for The "correct" thing to do is to set But we also need to support the "wrong" MSE-minimizing case where We CAN support both with only minor additional coding and/or some carefully written examples/documentation. That is, we should issue proper warnings about the problems with the ML use-case, and point out the theoretically correct ways to use MICE. |
@glemaitre With
I meant exactly what @sergeyf said:
It seems that we are speaking different languages :) |
Yep but don't worry, this is clear now. |
oh, i misunderstood "constant" too... i thought you just meant drawing
again from the final round robin.
|
@sergeyf I agree that the My hesitation is that - because of its apparent easiness - it will invite users to apply it in cases for which it is not designed, and hence give incorrect statistical inferences without the user being aware of it. It is perhaps better not to sell it is multiple imputation for python, but rather as a way to obtain predictions for the missing cells, and be very clear that the method is no replacement for multiple imputation. |
Sounds like we're all agreed that the minimum that needs to be done is to change the name of These options were mentioned: I vote for |
We have the following things to do:
|
I will open an issue and add couple of things missing but it seems like a plan. |
A minor other thing to think about: @sergeyf : i placed the loop over m as the most inner loop because it was the easiest way to implement it in the current code (X_filled is the same here for every m anyways). However, the result is not different from when you would make it the outer loop, as long as you continue with the right dataset. For example: impute var 1 in dataset 1, var 1 in dataset 2, var 1 in dataset 3, var 2 in dataset 1, var 2 in dataset 2, etc. is similar to impute var 1 in dataset 1, var 2 in dataset 1, var 1 in dataset 2, var 2 in dataset 2, var 1 in dataset 3, etc. Another minor thing, it might be good to have a possibility to check whether the algorithm converges. For example by calculating and storing the means of every feature for every i_rnd. |
Ah thanks @RianneSchouten. Currently, there is no way to start in with a random imputation. Might be good to add this support as well. Is the randomness in the MICE R package dependent on the non-missing values? |
I will have to check exactly, but I think what it does is it takes the
observed values of each feature and impute randomly from those observed
values. So it is a 'per feature' thing, just like mean imputation, but it
imputes observed values instead of the mean of the observed values. It
might actually be a good additional possibility for SimpleImputer.
2018-06-14 17:46 GMT+02:00 Sergey Feldman <[email protected]>:
… Ah thanks @RianneSchouten <https://github.com/RianneSchouten>. Currently,
there is no way to start in with a random imputation. Might be good to add
this support as well. Is the randomness in the MICE R package dependent on
the non-missing values?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8478 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVQqe9Ey4RIChkjOXLooUVvyTp0h2upZks5t8oVegaJpZM4MO_tv>
.
|
That is indeed the default in
In |
I think it deserves to be a separate object to SimpleImputer, because it
does not use a single value per column, but I'd be in favour of seeing a
RandomImputer or something do that.
|
I will make a draft for RandomImputer. I think I have most of the code
lying here already anyways.
2018-06-15 2:56 GMT+02:00 Joel Nothman <[email protected]>:
… I think it deserves to be a separate object to SimpleImputer, because it
does not use a single value per column, but I'd be in favour of seeing a
RandomImputer or something do that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8478 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVQqe4bpwWMF9RQMjV1thq9sKr4UvTdvks5t8wYpgaJpZM4MO_tv>
.
|
For reference, #11209 is the RandomImputer (maybe SamplingImputer?) issue.
|
The rename PR is here: #11314 It's probably also a good place to contribute additional documentation about the difference between the original MICE use-case and sklearn's main use-case. Feel free to contribute or make suggestions in the PR. |
…)" This reverts commit b97eda5.
Reference Issue
This is in reference to #7840, and builds on #7838.
Fixes #7840.
This code provides basic MICE imputation functionality. It currently only uses Bayesian linear regression as the prediction model. Once this is merged, I will add predictive mean matching (slower but sometimes better). See here for a reference: https://stat.ethz.ch/education/semesters/ss2012/ams/paper/mice.pdf