-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
MICE/Multiple Imputation branch #11259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@glemaitre can you clarify what you mean by your point about |
Currently the documentation is not explicit regarding the size of the list. I find that it should be |
Ah, thanks! |
One comment about the to-do list. I don't think this is necessary to do: "Perhaps determine if, in a predictive modelling context, it is necessary to have the sophistication of MICE in sampling each imputation value rather than just using point predictions." I think stats people already know that they want to understand uncertainty due to missing values. Why do we need to if it is "necessary"? The answer will inevitably be "sometimes" and depend on what dataset is chosen to run some basic experiments on. I'll try to carve some time out next week to tackle at least some of these, but it would be great to have someone else contribute as it's a busy work season for me. |
I am willing to help, but I don't totally get what's been decided on the n_imputations and m? |
We keep as it is for n_imputations. For m, it will be an example to show how to make inference. Sent from my phone - sorry to be brief and potential misspell.
|
My understanding is that concrete action about
This can be accompanied by comments that discuss why the various values are set as they are. In addition, we will also need a ML example that looks like this:
I think we may have something like this already actually here: https://github.com/scikit-learn/scikit-learn/blob/master/examples/plot_missing_values.py |
@glemaitre @jnothman I ran a quick experiment with Boston to demonstrate that averaging a longer set of the last See this gist for the code: https://gist.github.com/sergeyf/08e5af7674b4d2c6d36dcb7872745c40 The result: Here Boston is missing data in 75% of the rows. I kept As we go to the right, the mean/MSE goes down until about 50 and then flattens out. This entire experiment was re-run 100 times to get the bars. I think this provides at least a partial answer to:
|
Ah great! I can spend some time making a draft for the first example.
2018-06-14 19:39 GMT+02:00 Sergey Feldman <[email protected]>:
… @glemaitre <https://github.com/glemaitre> @jnothman
<https://github.com/jnothman> I ran a quick experiment with Boston to
demonstrate that averaging the a longer and longer set of the last
n_imputations is helpful for downstream tasks.
See this gist for the code: https://gist.github.com/sergeyf/
08e5af7674b4d2c6d36dcb7872745c40
The result:
[image: image]
<https://user-images.githubusercontent.com/1874668/41428275-89919e7e-6fd7-11e8-9a22-6e73b7b24e24.png>
What's happening here is I kept n_burn_in + n_imputations fixed to 100
and swept over n_imputations. At the very left of the plot is the average
held-out MSE of a RandomForestRegressor on the Boston dataset after being
filled in with MICE with n_burn_in = 99 and n_imputations = 1. It's the
highest MSE and the most variable.
This entire experiment was re-run 100 times to get the bars.
As we go to the right, the mean/MSE goes down until about 50 and then
flattens out.
I think this provides at least a partial answer to:
Determine the most appropriate way to use individual imputation samples in predictive modelling, clustering, etc, which are Scikit-learn's focus.
a. is using a single draw acceptable? NO
b. is averaging over multiple draws from the final fit appropriate? YES
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#11259 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVQqe4orruOt8qcCWqI31LM9gdKYjl8qks5t8p_mgaJpZM4Un-On>
.
|
I am also thinking that we should remove @jnothman WDYT? |
@glemaitre I'll try to get this one done today: "Rename MICEImputer to de-emphasise multiple imputation because it only performs a single one at a time." I'm going to use |
Here is the renaming PR: #11314 |
@sergeyf possibly naive question of somebody who has no statistics background: When the goal is to have a single best imputation (and not multiple imputations to do inference on the results), is it then really needed to do the many iterations and take the average? scikit-learn/sklearn/impute.py Lines 624 to 632 in 007aa71
So naively I would think doing this many times and averaging in the end will lead to almost the same results as just doing it once and not adding this random noise (after the initial burn in, so relying on the mean of the model) , but the latter would be much more efficient (1 instead of 100 imputations). |
@jorisvandenbossche I'm not a MICE expert, really. I just happened to meet Stef in real life, tried MICE out on some problems I had, and noticed it worked very well in ML contexts. So I wanted to make it available to ML people. That is to say: I'm not the best person to answer. @stefvanbuuren would know better, however. But, my intuition is that it would not be the same. It's basically a MCMC sampling process: I could be wrong though. Maybe you could make a fork, quickly modify MICE and check your hypothesis, using the gist I posted as a base? |
Yes, I was planning, and took now the time to do it, see figure below. The figure is the same as the one above (using the same seeds as you), but the added red line gives the experiment of using the imputed values of the last iteration + using the model mean (without adding noise) in each iteration. I did this for a The mean value is slightly lower (whether this is good or bad I don't know), the variation is clearly lower, and already stable after 10 burn in iterations (for this example).
Yes, but, when we don't add noise to the iterations but only use the mean of the model it will not be jumping around but converge to a stable (single best) imputation. So that is the reason that doing it like this gives a lower variance (and with much less iterations, so more efficient). The quick and dirty patch to MICEImputer (click to expand)--- a/sklearn/impute.py
+++ b/sklearn/impute.py
@@ -625,11 +625,12 @@ class MICEImputer(BaseEstimator, TransformerMixin):
X_test = safe_indexing(X_filled[:, neighbor_feat_idx],
missing_row_mask)
mus, sigmas = predictor.predict(X_test, return_std=True)
- good_sigmas = sigmas > 0
- imputed_values = np.zeros(mus.shape, dtype=X_filled.dtype)
- imputed_values[~good_sigmas] = mus[~good_sigmas]
- imputed_values[good_sigmas] = self.random_state_.normal(
- loc=mus[good_sigmas], scale=sigmas[good_sigmas])
+ imputed_values = mus
# clip the values
imputed_values = np.clip(imputed_values,
@@ -883,7 +884,8 @@ class MICEImputer(BaseEstimator, TransformerMixin):
'%d/%d, elapsed time %0.2f'
% (i_rnd + 1, n_rounds, time() - start_t))
- Xt /= self.n_imputations
+ Xt = X_filled # last filled values
Xt[~mask_missing_values] = X[~mask_missing_values]
return Xt |
Thanks for the experiment. I think I had the wrong interpretation of what you were suggesting. It looks like not sampling during the chained process helps the downstream model quite a bit (and reduces variance), which is a cool finding! Have you by chance looked at whether the MICE tests that have to do with empirical correctness still pass? They're in Without any kind of sampling, this is somewhat far from the original MICE algorithm. The empirical evidence you've provided is positive, but it's limited to this example and has not been thoroughly explored like MICE has in the literature. This makes me a bit hesitant about adding it to sklearn. But maybe we can put the sampling behind a flag and set it to True by default? This would at least allow an end-user to try to version you're suggesting. We would probably also need an example of how to use it with the flag set to False and why one might want to. Anyone else have thoughts here? |
This is what I meant by "Perhaps determine if, in a predictive modelling
context, it is necessary to have the sophistication of MICE in sampling
each imputation value rather than just using point predictions."
I think there is literature on chaining without sampling from a
distribution around the candidate imputation, but I've not explored the
references in the MICE paper.
Before seeing this conversation, I was thinking it would be nice to support
regressors (or indeed classifiers) that do not give a predictive
distribution. Yes, I think we could control this with a parameter
('sample'?). The question is: which behaviour should we offer by default?
|
Also I believe that this is the right way to go if we're to make it a more
generic "ChainingImputer". But we can still support, and illustrate, the
inferences possible with Multiple Imputation and sampling.
…On 22 June 2018 at 14:13, Joel Nothman ***@***.***> wrote:
This is what I meant by "Perhaps determine if, in a predictive modelling
context, it is necessary to have the sophistication of MICE in sampling
each imputation value rather than just using point predictions."
I think there is literature on chaining without sampling from a
distribution around the candidate imputation, but I've not explored the
references in the MICE paper.
Before seeing this conversation, I was thinking it would be nice to
support regressors (or indeed classifiers) that do not give a predictive
distribution. Yes, I think we could control this with a parameter
('sample'?). The question is: which behaviour should we offer by default?
|
In my opinion, the default one should be the one that is empirically shown to work best with a variety of examples based on already available ones. |
Those two still pass (only |
Great, thank you for checking. Definitely would be good to have a flag to enable this. It would also mean we can easily toss in other regressors, It would be great to have a reference for the non-sampled version of MICE if anyone happens to know a good one. |
Hi, I'm very new here but after talking with @jorisvandenbossche, I would like to support him on the fact that generating a variable's missing data in the conditional distribution is perhaps not necessary with an aim of prediction - we can just take the regression instead. I'm not sure about such an implementation of MICE, but missForest does just that with random forests. |
The example to show how ChainedImputer can be used as a MICE Imputer would include something like this:
However, I have done some simulations testing the procedure and comparing the ChainedImputer with m = 1 and m = 5 and with n = 1 and n= 100 and the results are not as I would expect. I will post the script in a github repo later, but need some more time to think about these results. |
@RianneSchouten I haven't had time to think about this yet, but you probably want to set a different seed in this line for each of the imputations in this line:
It should instead be:
|
Indeed, if your aim is to impute and predict as well as possible the missing entries then using single imputation is enough and using multiple imputation for that is not required and you could impute by taking the conditional expectation and not by drawing from the conditional distribution. As far as prediction is concerned, there are not yet many results on this problem. Common practice and few papers tend more to suggest the following approach: Perform multiple imputation and on each imputed data set, apply your predictive algorithm to estimate the response say Y. Then aggregate the different predictions. Best, |
@sergeyf |
I like the idea of the default being dependent on the mode. I think keeping the flag more function-oriented ( |
I'm fine with posterior_sample=True/False.
|
OK, I made the changes as discussed. And I last-minute went with |
I didn't actually look at Guillaume's benchmarks until now. I'd be
comfortable using BayesianRidge as the default in both cases... but I
might be under-informed on that.
|
There is one thing that is not mentioned in this discussion yet and it is
important:
- the methodology of mice implies using the whole dataset including the
output variable in the imputation iterations. In other words: x1 is imputed
with x2, x3, ... and y. Then x2 is imputed with x1, x3, ... and y,
etcetera! This is because the imputation model should include at least the
analysis model.
@stefvanbuuren am I saying this correctly?
As said earlier, it is yet unknown what the difference is between the
outcome of a prediction model (when the output variable will not be used)
and a model built for statistical inference.
Would it be possible to include the possibility to add y in the imputation
process?
Something like ` include_y = False ` by default and ` X = np.column_stack((X,
y))` when `include_y = True`?
I will show the difference between the two in the example then.
|
This won't work because during validation/test time we don't have |
Good point Rianne. Potentially you could add |
so... should we refuse to draw from the posterior and refuse to do multiple
imputation? surely in a clustering context needing to have y as a predictor
is meaningless
|
Sure, in a clustering context there is no observed |
So maybe we can find a way to make this an option, or to illustrate it in
an example... eventually. I don't see it as a priority here.
|
My interest is mostly in making the API as stable, useful, well-informed
and well-documented as possible.
|
Before a release in coming weeks, I should add.
|
I just used this and noticed that a) the model is called |
Thanks for the feedback. I'm happy to rename to `estimator`.
Regarding position: this is tricky because `SimpleImputer` doesn't have a
meta estimator so it will have a different first parameter from
`IterariveImputer` if we make the ordering change. Is that acceptable?
That's the reason we haven't made `estimator` first already.
…On Mon, Feb 11, 2019, 8:35 AM Andreas Mueller ***@***.*** wrote:
I just used this and noticed that a) the model is called predictor not
estimator (or base_estimator) as is usual for a meta-estimator, and b)
that it's not the first argument.
Ideally I'd prohibit positional arguments but since we don't for now, I'd
rather have the base estimator be the first argument.
If you do IterativeImputer(RandomForestRegressor(n_estimators=100)) you
get a hard-to-debug error about nans in training data.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11259 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABya7H-gzQJhhIeBTOpGiqeX61bUI2foks5vMZvFgaJpZM4Un-On>
.
|
Should we consider making |
It is the most important parameter... but I don't mind the current
consistency with SimpleImputer
|
I'll defer to sklearn full-timers. Let me know if you reach a consensus to move it. I'll change the name to |
Uh oh!
There was an error while loading. Please reload this page.
This discussion bring some insights about adding a multivariate imputer in scikit-learn. Because of release time constraint, the development was moved into a specific branch (FIXME: give a specific name) see #11600.
From the discussion in #8478, we have to deal with the following issues in
MICEImputer
:We have the following things to do:
a. is using a single draw acceptable?
b. is averaging over multiple draws from the final fit appropriate?
c. is ensembling multiple predictive estimators each trained on a different imputation most appropriate?
Minor things:
Imputer
instead ofSimpleImputer
.imputation_sequences_
should be improved (length of the list mainly).The text was updated successfully, but these errors were encountered: