[MRG+2] Basic version of MICE Imputation #8478

sergeyf · 2017-02-28T22:55:57Z

Reference Issue

This is in reference to #7840, and builds on #7838.

This code provides basic MICE imputation functionality. It currently only uses Bayesian linear regression as the prediction model. Once this is merged, I will add predictive mean matching (slower but sometimes better). See here for a reference: https://stat.ethz.ch/education/semesters/ss2012/ams/paper/mice.pdf

codecov · 2017-03-02T00:24:42Z

Codecov Report

Merging #8478 into master will increase coverage by 0.03%.
The diff coverage is 98.2%.

@@            Coverage Diff             @@
##           master    #8478      +/-   ##
==========================================
+ Coverage    96.1%   96.13%   +0.03%     
==========================================
  Files         337      337              
  Lines       63295    63179     -116     
==========================================
- Hits        60828    60740      -88     
+ Misses       2467     2439      -28

Impacted Files	Coverage Δ
sklearn/dummy.py	`97.74% <100%> (+0.05%)`	⬆️
sklearn/utils/estimator_checks.py	`93.29% <100%> (ø)`	⬆️
sklearn/tests/test_dummy.py	`100% <100%> (ø)`	⬆️
sklearn/preprocessing/tests/test_imputation.py	`100% <100%> (ø)`	⬆️
sklearn/preprocessing/__init__.py	`100% <100%> (ø)`	⬆️
sklearn/preprocessing/imputation.py	`95.25% <96.57%> (+1.13%)`	⬆️
sklearn/linear_model/tests/test_bayes.py	`90.42% <0%> (-0.76%)`	⬇️
sklearn/linear_model/bayes.py	`95.45% <0%> (-0.23%)`	⬇️
sklearn/preprocessing/data.py	`98.81% <0%> (-0.17%)`	⬇️
sklearn/feature_selection/tests/test_rfe.py	`98.57% <0%> (-0.08%)`	⬇️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9808810...93d93cf. Read the comment docs.

sergeyf · 2017-03-06T16:43:10Z

@raghavrv Do you have some suggestions for a minimal suite of tests that would cover MICE reasonably?

raghavrv · 2017-03-07T11:12:07Z

You could try to replicate all the tests that are currently there for other imputation techniques as a basic starting point... Further, a hand computed data could be tested against to avoid unintended regressions when modifying this code...

raghavrv · 2017-03-07T11:12:53Z

Also you could try and write tests that would improve the coverage...

sergeyf · 2017-03-07T16:37:48Z

Thanks @raghavrv. I'll take a crack at this when my workload simmers down a bit. Feel free to look at the core algorithm in the meanwhile (if you have time) and suggest improvements!

gerkovink · 2018-06-14T09:29:03Z

Yes, if doing multiple imputation in a for loop with different random seeds
is identical to the proposal, then I think an example on imputation
uncertainty is the right way to go about it.

Yes. Let randomness influence the parameter draws and imputations, but make sure that the imputation models remain constant over the loops.

jnothman · 2018-06-14T10:03:41Z

ah, keeping the models constant is quite different. hmm.

sergeyf · 2018-06-14T11:02:34Z

@RianneSchouten @stefvanbuuren just to confirm: can instances of MICE be run totally independently? That is, can the for loop over m imputation be the outer-most for loop? Or does it have to be the way Rianne's code is written: where the for loop over m is the inner-most for loop?

In both cases, I think we can either make a MICEPipeline or add functionality to the current Pipeline so that it knows how to properly do multiple imputations when MICEImputer is one of the pipeline's steps.

For example, let's say the pipeline is: A -> B -> MICEImputer -> C -> D.

A MICEPipeline would be given this pipeline and m. It would run steps A -> B as per usual, and then run the MICEImputer portion m times. The m outputs would all go through C -> D and the m results are then returned.

I think this is pretty much what @glemaitre suggested, right?

glemaitre · 2018-06-14T11:47:01Z

imputation models remain constant over the loops.

I am a bit confused. What do you mean by a constant model.

Couple of remarks which I would like to be clarified:

What is the usefulness of n_imputations of the current implementation. Somehow, n_burn_in will allow to converge to a "stable" predictor. mu and sigma should more less converge to the same solution. Taking the average over n_imputations seem then not really useful.
If my previous intuition is not wrong, we could:
- Rename MICEImputer to something else (MultivariateImputer). Having MICE in the naming will bring us issue when users will expect algorithm to perform the same way than in R for instance.
- We could remove n_imputations.
In case we want to get a multiple imputation, we can indeed combine the MICEImputer with another estimator and put it inside a BaggingClassifier/VotingClassifier. However, this implementation will be inefficient since it will make n_burn_in iterations for each pipeline (imputer, estimator) instead of once.
We could even think to have a generic object that could make multiple imputation using any imputer.

I would think that for the release it could be good to actually tackle (2). We can see afterwords how to better handle the multiple imputation.

NB: be aware that I am not familiar with the stats behind MICE, pardon me if my statements are wrong. They are intended to have constructive discussions :)

glemaitre · 2018-06-14T11:48:19Z

I think this is pretty much what @glemaitre suggested, right?

I think so. In addition, I would expect the MICEPipeline to "pool" the results.

sergeyf · 2018-06-14T12:05:50Z

Re: n_imputations, I found that taking the average over n_imputations reduced MSE of the imputed values. As @stefvanbuuren pointed out, doing so actually artificially inflates the correlation between the imputed entries, but it's still better (empirically) in a mean-squared error sense, which is how ML people think about imputation's effectiveness for continuous features. Just to provide more intuition here: a single imputation out of n_imputations is high-variance. As such, averaging them helps provide single stable imputations.

As such. we should keep this parameter in my opinion. Do any of the more statistically knowledgeable folks have an opinion about this point? There is always the option to do something like n_burn_in = 100, n_imputations=1, m=100 to replicate the way MICE was designed to be used. But for ML purposes, the most common use case will be n_burn_in=10, n_imputations=100, m=1.

What's the difference between my suggestion of a MICEPipeline and the idea in (3) that @glemaitre posted?

By the way, I don't think there is anything that can be done about the inefficiency that is pointed out in (3). It's the consequence of having multiple imputations - you have to multiply all your computation starting with the MICE step by m.

stefvanbuuren · 2018-06-14T12:07:03Z

@sergeyf Yes, for a given imputation model, you may create m completed datasets fully independently. It is entirely parallel. If you want to do generation in this way, just initialize each instance with its own seed, and store the resulting sets in a convenient form for repeated analyses and pooling.

glemaitre · 2018-06-14T12:08:05Z

What's the difference between my suggestion of a MICEPipeline and the idea in (3) that @glemaitre posted?

It is the same :)

By the way, I don't think there is anything that can be done about the inefficiency that is pointed out in (3). It's the consequence of having multiple imputations - you have to multiply all your computation starting with the MICE step by m.

We might have a work around sharing the imputer with a warm_start.

sergeyf · 2018-06-14T12:16:48Z

Thanks @stefvanbuuren for the confirmation.

@glemaitre I think based on Stef's comments, we shouldn't be sharing any of the computation. It's m totally separate/parallel imputers identical in every way except with different seeds.

stefvanbuuren · 2018-06-14T12:17:33Z

Averaging the data gives inflated correlations, too short confidence intervals and too low P-values.
I generally advise against it, unless you have reason to do otherwise. (https://stefvanbuuren.name/fimd/workflow.html).

The MSE and multiple imputation is not a happy marriage. If you try to select imputation methods that minimise MSE, then you end up imputing the same "best" value over and over again. But those single imputation methods do not appropriately incorporate the uncertainty of the imputation, and treat any imputed value as an observed value. See https://stefvanbuuren.name/fimd/sec-true.html for a short discussion.

sergeyf · 2018-06-14T12:29:16Z

@stefvanbuuren thanks for the additional warnings and readings. My take is that this "unhappy marriage" must last. No divorce =)

To clarify my point, here is how the current MICEImputer works:

(1) Run for n_burn_in iterations.
(2) Run for n_imputations iterations.
(3) Return the average of the last n_imputations iterations <- this the "wrong" part.

The "correct" thing to do is to set n_imputations = 1 and run steps (1) to (3) m times with different seeds, thus preserving uncertainty and avoiding the problems of averaging that Stef described.

But we also need to support the "wrong" MSE-minimizing case where n_imputations > 1 and m = 1. Good single imputations are very useful for real-life work, despite ignoring inherent uncertainty in missing values and inflating correlations, etc.

We CAN support both with only minor additional coding and/or some carefully written examples/documentation. That is, we should issue proper warnings about the problems with the ML use-case, and point out the theoretically correct ways to use MICE.

gerkovink · 2018-06-14T12:31:58Z

@glemaitre With

Let randomness influence the parameter draws and imputations, but make sure that the imputation models remain constant over the loops.

I meant exactly what @sergeyf said:

totally separate/parallel imputers identical in every way except with different seeds.

It seems that we are speaking different languages :)

glemaitre · 2018-06-14T12:41:37Z

It seems that we are speaking different languages :)

Yep but don't worry, this is clear now.

jnothman · 2018-06-14T12:54:47Z

oh, i misunderstood "constant" too... i thought you just meant drawing again from the final round robin.

stefvanbuuren · 2018-06-14T13:00:34Z

@sergeyf I agree that the n_imputations > 1 and m = 1 could be useful in specific cases. In particular, it is valid for the case where the ONLY objective is to find the most likely value for each missing cell (e.g. image restoration where you want to show the most likely value for each missing pixel).

My hesitation is that - because of its apparent easiness - it will invite users to apply it in cases for which it is not designed, and hence give incorrect statistical inferences without the user being aware of it. It is perhaps better not to sell it is multiple imputation for python, but rather as a way to obtain predictions for the missing cells, and be very clear that the method is no replacement for multiple imputation.

sergeyf · 2018-06-14T13:11:14Z

Sounds like we're all agreed that the minimum that needs to be done is to change the name of MICEImputer (and alter some documentation) before the next release.

These options were mentioned: ChainedImputer, MultivariateImputer

I vote for ChainedImputer to preserve some connection to MICE.

jnothman · 2018-06-14T13:15:14Z

We have the following things to do:

Determine the most appropriate way to use individual imputation samples in predictive modelling, clustering, etc, which are Scikit-learn's focus.
a. is using a single draw acceptable?
b. is averaging over multiple draws from the final fit appropriate?
c. is ensembling multiple predictive estimators each trained on a different imputation most appropriate?
Perhaps determine if, in a predictive modelling context, it is necessary to have the sophistication of MICE in sampling each imputation value rather than just using point predictions.
Provide an example illustrating the inferential capabilities due to multiple imputation. I don't think there's anything limiting about our current interface, but it deserves an example.
Rename MICEImputer to de-emphasise multiple imputation because it only performs a single one at a time.

glemaitre · 2018-06-14T13:30:51Z

I will open an issue and add couple of things missing but it seems like a plan.

RianneSchouten · 2018-06-14T15:38:04Z

A minor other thing to think about:
X_filled is currently a mean imputed dataset. In the original idea of MICE, the first round of imputation happens with random imputation. By using different seeds, this ensures that each parallel running imputation sequence starts with a different dataset. I don't know to what extent it matters, though.

@sergeyf : i placed the loop over m as the most inner loop because it was the easiest way to implement it in the current code (X_filled is the same here for every m anyways). However, the result is not different from when you would make it the outer loop, as long as you continue with the right dataset. For example: impute var 1 in dataset 1, var 1 in dataset 2, var 1 in dataset 3, var 2 in dataset 1, var 2 in dataset 2, etc. is similar to impute var 1 in dataset 1, var 2 in dataset 1, var 1 in dataset 2, var 2 in dataset 2, var 1 in dataset 3, etc.

Another minor thing, it might be good to have a possibility to check whether the algorithm converges. For example by calculating and storing the means of every feature for every i_rnd.

sergeyf · 2018-06-14T15:45:28Z

Ah thanks @RianneSchouten. Currently, there is no way to start in with a random imputation. Might be good to add this support as well. Is the randomness in the MICE R package dependent on the non-missing values?

RianneSchouten · 2018-06-14T15:51:07Z

I will have to check exactly, but I think what it does is it takes the observed values of each feature and impute randomly from those observed values. So it is a 'per feature' thing, just like mean imputation, but it imputes observed values instead of the mean of the observed values. It might actually be a good additional possibility for SimpleImputer. 2018-06-14 17:46 GMT+02:00 Sergey Feldman <[email protected]>:

…

Ah thanks @RianneSchouten <https://github.com/RianneSchouten>. Currently, there is no way to start in with a random imputation. Might be good to add this support as well. Is the randomness in the MICE R package dependent on the non-missing values? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8478 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVQqe9Ey4RIChkjOXLooUVvyTp0h2upZks5t8oVegaJpZM4MO_tv> .

gerkovink · 2018-06-14T16:56:35Z

it takes the observed values of each feature and impute randomly from those observed
values.

That is indeed the default in mice, although specific starting values can also be provided. There is no strict need for a better form of initialization. There is randomness induced in the mice algorithm: therefore the autocorrelation between the iterations is usually low and the algorithm converges rapidly.

it might be good to have a possibility to check whether the algorithm converges. For example by calculating and storing the means of every feature for every i_rnd.

In mice the means and sd's for the chains at every iteration are calculated on the imputed values only, because the observed values remain constant. This may be more efficient.

jnothman · 2018-06-15T00:55:26Z

I think it deserves to be a separate object to SimpleImputer, because it does not use a single value per column, but I'd be in favour of seeing a RandomImputer or something do that.

RianneSchouten · 2018-06-15T09:02:19Z

I will make a draft for RandomImputer. I think I have most of the code lying here already anyways. 2018-06-15 2:56 GMT+02:00 Joel Nothman <[email protected]>:

…

I think it deserves to be a separate object to SimpleImputer, because it does not use a single value per column, but I'd be in favour of seeing a RandomImputer or something do that. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8478 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVQqe4bpwWMF9RQMjV1thq9sKr4UvTdvks5t8wYpgaJpZM4MO_tv> .

jnothman · 2018-06-16T08:11:26Z

For reference, #11209 is the RandomImputer (maybe SamplingImputer?) issue.

sergeyf · 2018-06-18T16:37:20Z

The rename PR is here: #11314

It's probably also a good place to contribute additional documentation about the difference between the original MICE use-case and sklearn's main use-case. Feel free to contribute or make suggestions in the PR.

…)" This reverts commit b97eda5.

sergeyf and others added 21 commits February 27, 2017 10:54

initial commit

931f938

init bug fix

87da5df

fixing pep8 errors

e576400

more pep8 fixes

164ba16

reseting diabetes as default in examples

9f0b3c9

fixing build failures

3747ebb

fixing error for _statistics in Imputer

70d0309

fixing failed test by skipping MICEImputer

b7456f6

fixing circular import issue. Questionable style?

a6892f9

one flake left

bae44ef

initial commit

5416534

init bug fix

e7ab5ce

fixing pep8 errors

2c64972

more pep8 fixes

dfb3c77

fixing build failures

f5ccaee

fixing error for _statistics in Imputer

020d121

fixing failed test by skipping MICEImputer

c296da2

fixing circular import issue. Questionable style?

ac61666

one flake left

84a3552

removing errant commit?

d1bac87

fixing some requests

db25789

sergeyf mentioned this pull request Mar 1, 2017

Applying to new data iskandr/fancyimpute#26

Closed

Fixing the single flake issue

ef9f19a

raghavrv self-requested a review March 2, 2017 14:41

raghavrv changed the title ~~[WIP] Basic version of MICE Imputation (3rd try)~~ [WIP] Basic version of MICE Imputation Mar 7, 2017

glemaitre mentioned this pull request Jun 14, 2018

MICE/Multiple Imputation branch #11259

Closed

4 tasks

jnothman mentioned this pull request Jun 16, 2018

Add strategy="random" to SimpleImputer #11209

Open

sergeyf mentioned this pull request Jun 18, 2018

[MRG+2] Rename MICEImputer to ChainedImputer #11314

Merged

jorisvandenbossche added a commit to jorisvandenbossche/scikit-learn that referenced this pull request Jul 17, 2018

Revert "[MRG+2] FEA Basic version of MICE Imputation (scikit-learn#8478…

e730af8

…)" This reverts commit b97eda5.

Uh oh!

[MRG+2] Basic version of MICE Imputation #8478

[MRG+2] Basic version of MICE Imputation #8478

Uh oh!

Conversation

sergeyf commented Feb 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sergeyf commented Mar 6, 2017

Uh oh!

raghavrv commented Mar 7, 2017

Uh oh!

raghavrv commented Mar 7, 2017

Uh oh!

sergeyf commented Mar 7, 2017

Uh oh!

gerkovink commented Jun 14, 2018

Uh oh!

jnothman commented Jun 14, 2018 via email

Uh oh!

sergeyf commented Jun 14, 2018

Uh oh!

glemaitre commented Jun 14, 2018

Uh oh!

glemaitre commented Jun 14, 2018

Uh oh!

sergeyf commented Jun 14, 2018

Uh oh!

stefvanbuuren commented Jun 14, 2018

Uh oh!

glemaitre commented Jun 14, 2018

Uh oh!

sergeyf commented Jun 14, 2018

Uh oh!

stefvanbuuren commented Jun 14, 2018

Uh oh!

sergeyf commented Jun 14, 2018

Uh oh!

gerkovink commented Jun 14, 2018

Uh oh!

glemaitre commented Jun 14, 2018

Uh oh!

jnothman commented Jun 14, 2018 via email

Uh oh!

stefvanbuuren commented Jun 14, 2018

Uh oh!

sergeyf commented Jun 14, 2018

Uh oh!

jnothman commented Jun 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Jun 14, 2018

Uh oh!

RianneSchouten commented Jun 14, 2018

Uh oh!

sergeyf commented Jun 14, 2018

Uh oh!

RianneSchouten commented Jun 14, 2018 via email

Uh oh!

gerkovink commented Jun 14, 2018

Uh oh!

jnothman commented Jun 15, 2018 via email

Uh oh!

RianneSchouten commented Jun 15, 2018 via email

Uh oh!

jnothman commented Jun 16, 2018 via email

Uh oh!

sergeyf commented Jun 18, 2018

Uh oh!

Uh oh!

sergeyf commented Feb 28, 2017 •

edited

Loading

codecov bot commented Mar 2, 2017 •

edited

Loading

jnothman commented Jun 14, 2018 •

edited

Loading