Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MICE/Multiple Imputation branch #11259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks
glemaitre opened this issue Jun 14, 2018 · 68 comments
Closed
4 tasks

MICE/Multiple Imputation branch #11259

glemaitre opened this issue Jun 14, 2018 · 68 comments

Comments

@glemaitre
Copy link
Member

glemaitre commented Jun 14, 2018

This discussion bring some insights about adding a multivariate imputer in scikit-learn. Because of release time constraint, the development was moved into a specific branch (FIXME: give a specific name) see #11600.

From the discussion in #8478, we have to deal with the following issues in MICEImputer:

We have the following things to do:

  • Determine the most appropriate way to use individual imputation samples in predictive modelling, clustering, etc, which are Scikit-learn's focus.
    a. is using a single draw acceptable?
    b. is averaging over multiple draws from the final fit appropriate?
    c. is ensembling multiple predictive estimators each trained on a different imputation most appropriate?
  • Perhaps determine if, in a predictive modelling context, it is necessary to have the sophistication of MICE in sampling each imputation value rather than just using point predictions.
  • Provide an example illustrating the inferential capabilities due to multiple imputation. I don't think there's anything limiting about our current interface, but it deserves an example.
  • Rename MICEImputer to de-emphasise multiple imputation because it only performs a single one at a time.

Minor things:

  • The documentation refer to Imputer instead of SimpleImputer.
  • imputation_sequences_ should be improved (length of the list mainly).
@sergeyf
Copy link
Contributor

sergeyf commented Jun 14, 2018

@glemaitre can you clarify what you mean by your point about imputation_sequences_?

@glemaitre
Copy link
Member Author

Currently the documentation is not explicit regarding the size of the list. I find that it should be n_burn_in * n_features_with_missing + n_imputations * n_features_with_missing.

@sergeyf
Copy link
Contributor

sergeyf commented Jun 14, 2018

Ah, thanks!

@sergeyf
Copy link
Contributor

sergeyf commented Jun 14, 2018

One comment about the to-do list. I don't think this is necessary to do:

"Perhaps determine if, in a predictive modelling context, it is necessary to have the sophistication of MICE in sampling each imputation value rather than just using point predictions."

I think stats people already know that they want to understand uncertainty due to missing values. Why do we need to if it is "necessary"? The answer will inevitably be "sometimes" and depend on what dataset is chosen to run some basic experiments on.

I'll try to carve some time out next week to tackle at least some of these, but it would be great to have someone else contribute as it's a busy work season for me.

@RianneSchouten
Copy link

I am willing to help, but I don't totally get what's been decided on the n_imputations and m?

@glemaitre
Copy link
Member Author

glemaitre commented Jun 14, 2018 via email

@sergeyf
Copy link
Contributor

sergeyf commented Jun 14, 2018

My understanding is that concrete action about n_imputations and m are not being taken now, but instead we can demonstrate how to perform MICE as the stats world intended with something like this:

n_burn_in = 50
n_imputations = 1
m = 10
multiple_imputations = []
for i in range(m):
    imputer = MICEImputer(n_burn_in=n_burn_in, n_imputations=n_imputations, random_state=i)
    X_imputed = imputer.fit_transform(X_missing)
    multiple_imputations.append(X_imputed)

# insert some other downstream tasks that are done to each element so as to demonstrate the variability of the missing values

This can be accompanied by comments that discuss why the various values are set as they are.

In addition, we will also need a ML example that looks like this:

n_burn_in = 50
n_imputations = 50
# m = 1
# etc etc

I think we may have something like this already actually here: https://github.com/scikit-learn/scikit-learn/blob/master/examples/plot_missing_values.py

@sergeyf
Copy link
Contributor

sergeyf commented Jun 14, 2018

@glemaitre @jnothman I ran a quick experiment with Boston to demonstrate that averaging a longer set of the last n_imputations is helpful for downstream ML tasks.

See this gist for the code: https://gist.github.com/sergeyf/08e5af7674b4d2c6d36dcb7872745c40

The result:

image

Here Boston is missing data in 75% of the rows. I kept n_burn_in + n_imputations fixed to 100 and swept over n_imputations. At the very left of the plot is the average held-out MSE of a RandomForestRegressor after the data is imputed with MICE with n_burn_in = 99 and n_imputations = 1. It's the highest MSE and the most variable.

As we go to the right, the mean/MSE goes down until about 50 and then flattens out.

This entire experiment was re-run 100 times to get the bars.

I think this provides at least a partial answer to:

Determine the most appropriate way to use individual imputation samples in predictive modelling, clustering, etc, which are Scikit-learn's focus.
a. is using a single draw acceptable?  NO
b. is averaging over multiple draws from the final fit appropriate?  YES

@RianneSchouten
Copy link

RianneSchouten commented Jun 15, 2018 via email

@glemaitre
Copy link
Member Author

I am also thinking that we should remove initial_strategy to initial_imputer since that it will be a pain to update the docstring, each time that the SimpleImputer get updated. Also it allows to accept several imputer and the future RandomImputer for instance.

@jnothman WDYT?

@sergeyf
Copy link
Contributor

sergeyf commented Jun 18, 2018

@glemaitre I'll try to get this one done today: "Rename MICEImputer to de-emphasise multiple imputation because it only performs a single one at a time."

I'm going to use ChainedImputer, unless anyone has any objections.

@sergeyf
Copy link
Contributor

sergeyf commented Jun 18, 2018

Here is the renaming PR: #11314

@jorisvandenbossche
Copy link
Member

@sergeyf possibly naive question of somebody who has no statistics background: When the goal is to have a single best imputation (and not multiple imputations to do inference on the results), is it then really needed to do the many iterations and take the average?
In your experiment above you showed that with the current code it ensures a more stable result (which is indeed good if the goal is the single best imputation, and you are not interested in the variation on this). But, the current implementation also introduces random noise (based on the sigma of the model) in each iteration:

# get posterior samples
X_test = safe_indexing(X_filled[:, neighbor_feat_idx],
missing_row_mask)
mus, sigmas = predictor.predict(X_test, return_std=True)
good_sigmas = sigmas > 0
imputed_values = np.zeros(mus.shape, dtype=X_filled.dtype)
imputed_values[~good_sigmas] = mus[~good_sigmas]
imputed_values[good_sigmas] = self.random_state_.normal(
loc=mus[good_sigmas], scale=sigmas[good_sigmas])

So naively I would think doing this many times and averaging in the end will lead to almost the same results as just doing it once and not adding this random noise (after the initial burn in, so relying on the mean of the model) , but the latter would be much more efficient (1 instead of 100 imputations).

@sergeyf
Copy link
Contributor

sergeyf commented Jun 21, 2018

@jorisvandenbossche I'm not a MICE expert, really. I just happened to meet Stef in real life, tried MICE out on some problems I had, and noticed it worked very well in ML contexts. So I wanted to make it available to ML people. That is to say: I'm not the best person to answer. @stefvanbuuren would know better, however.

But, my intuition is that it would not be the same. It's basically a MCMC sampling process:
image
I think of the individual imputed values as jumping around as in the upper left image. If we freeze the process at any point, it may not be near the mean of that plot (around 1.0). By taking the average across the last n_imputations, we are hedging against this and getting closer to the mean. I don't think simply not sampling at the very end would get you the mean because refitting at each iteration is also part of the sampling process.

I could be wrong though. Maybe you could make a fork, quickly modify MICE and check your hypothesis, using the gist I posted as a base?

@jorisvandenbossche
Copy link
Member

I could be wrong though. Maybe you could make a fork, quickly modify MICE and check your hypothesis, using the gist I posted as a base?

Yes, I was planning, and took now the time to do it, see figure below. The figure is the same as the one above (using the same seeds as you), but the added red line gives the experiment of using the imputed values of the last iteration + using the model mean (without adding noise) in each iteration. I did this for a n_burn_inof [10, 20, 50] (and with n_imputations=1). So you don't really need to look at the x values, as the number of iterations do not match with the green line (the green line always has a total of 100 iterations divided between n_burn_in and n_imputations).
I also added error bars for the results with the original data without missing values (by running this also 100 times like the other experiments).

image

The mean value is slightly lower (whether this is good or bad I don't know), the variation is clearly lower, and already stable after 10 burn in iterations (for this example).

I think of the individual imputed values as jumping around as in the upper left image. If we freeze the process at any point, it may not be near the mean of that plot (around 1.0).

Yes, but, when we don't add noise to the iterations but only use the mean of the model it will not be jumping around but converge to a stable (single best) imputation. So that is the reason that doing it like this gives a lower variance (and with much less iterations, so more efficient).
Of course, directly taking the output (prediction) of the imputation model without adding noise in each iteration will affect the next model in the chain, but whether this is a problem I don't know.

The quick and dirty patch to MICEImputer (click to expand)
--- a/sklearn/impute.py
+++ b/sklearn/impute.py
@@ -625,11 +625,12 @@ class MICEImputer(BaseEstimator, TransformerMixin):
         X_test = safe_indexing(X_filled[:, neighbor_feat_idx],
                                missing_row_mask)
         mus, sigmas = predictor.predict(X_test, return_std=True)
-        good_sigmas = sigmas > 0
-        imputed_values = np.zeros(mus.shape, dtype=X_filled.dtype)
-        imputed_values[~good_sigmas] = mus[~good_sigmas]
-        imputed_values[good_sigmas] = self.random_state_.normal(
-            loc=mus[good_sigmas], scale=sigmas[good_sigmas])
+        imputed_values = mus
 
         # clip the values
         imputed_values = np.clip(imputed_values,
@@ -883,7 +884,8 @@ class MICEImputer(BaseEstimator, TransformerMixin):
                       '%d/%d, elapsed time %0.2f'
                       % (i_rnd + 1, n_rounds, time() - start_t))
 
-        Xt /= self.n_imputations
+        Xt = X_filled # last filled values
         Xt[~mask_missing_values] = X[~mask_missing_values]
         return Xt

@sergeyf
Copy link
Contributor

sergeyf commented Jun 21, 2018

Thanks for the experiment. I think I had the wrong interpretation of what you were suggesting. It looks like not sampling during the chained process helps the downstream model quite a bit (and reduces variance), which is a cool finding!

Have you by chance looked at whether the MICE tests that have to do with empirical correctness still pass? They're in test_impute.py: test_chained_imputer_additive_matrix, test_chained_imputer_transform_recovery.

Without any kind of sampling, this is somewhat far from the original MICE algorithm. The empirical evidence you've provided is positive, but it's limited to this example and has not been thoroughly explored like MICE has in the literature. This makes me a bit hesitant about adding it to sklearn.

But maybe we can put the sampling behind a flag and set it to True by default? This would at least allow an end-user to try to version you're suggesting. We would probably also need an example of how to use it with the flag set to False and why one might want to.

Anyone else have thoughts here?

@jnothman
Copy link
Member

jnothman commented Jun 22, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jun 22, 2018 via email

@sergeyf
Copy link
Contributor

sergeyf commented Jun 22, 2018

In my opinion, the default one should be the one that is empirically shown to work best with a variety of examples based on already available ones.

@jorisvandenbossche
Copy link
Member

Have you by chance looked at whether the MICE tests that have to do with empirical correctness still pass? They're in test_impute.py: test_chained_imputer_additive_matrix, test_chained_imputer_transform_recovery.

Those two still pass (only test_chained_imputer_transform_stochasticity is failing, for good reason as transform no longer adds the stochastic noise)

@sergeyf
Copy link
Contributor

sergeyf commented Jun 22, 2018

Great, thank you for checking. Definitely would be good to have a flag to enable this. It would also mean we can easily toss in other regressors, It would be great to have a reference for the non-sampled version of MICE if anyone happens to know a good one.

@nprost
Copy link

nprost commented Jun 22, 2018

Hi, I'm very new here but after talking with @jorisvandenbossche, I would like to support him on the fact that generating a variable's missing data in the conditional distribution is perhaps not necessary with an aim of prediction - we can just take the regression instead. I'm not sure about such an implementation of MICE, but missForest does just that with random forests.
@julierennes would know more than me on that.

@RianneSchouten
Copy link

The example to show how ChainedImputer can be used as a MICE Imputer would include something like this:

def calculate_data_variance(X):

    means = np.mean(X, axis = 0)
    X_dif = X - means
    X_dif_squared = X_dif ** 2
    SSxx = np.sum(X_dif_squared, axis = 0)

    return(SSxx)

def calculate_variance_of_beta_estimates(y_true, y_pred, SSxx):

    SSe = (np.sum((y_true - y_pred)**2) / (len(y_true) - 2))
    vars = SSe / SSxx

    return vars

 # Impute incomplete data using the ChainedImputer as a MICEImputer
 m = 5
 multiple_imputations = []

 for i in range(m):

     imputer = ChainedImputer(n_burn_in=100, n_imputations=1)
     imputer.fit(X_incomplete)
     X_imputed = imputer.transform(X_incomplete)
     multiple_imputations.append(X_imputed)
     del imputer

 # Perform a model on each of the m imputed datasets
 # Estimate the estimates for each model/dataset
 m_coefs = []
 m_vars = []
 for i in range(m):

     estimator = LinearRegression()
     estimator.fit(multiple_imputations[i], y)
     y_predict = estimator.predict(multiple_imputations[i])
     SSxx = calculate_data_variance(multiple_imputations[i])

     m_coefs.append(estimator.coef_)
     m_vars.append(calculate_variance_of_beta_estimates(y, y_predict, SSxx))

     del estimator

 # Calculate the end estimates by applying Rubin's rules
 # Rubin's rules can be slightly different for different types of estimates
 # In case of linear regression, these are the rules:
 # The value of every estimate is the mean of estimates in each of the m datasets
 Qbar = np.mean(m_coefs, axis = 0)
 # The variance of these estimates is a combination of the variance of each of the m estimates (Ubar)
 # And the variance between the m estimates (B)
 Ubar = np.mean(m_vars, axis = 0)
 B = (1 / (m-1)) * np.mean((Qbar - m_coefs) ** 2, axis = 0)
 T = Ubar + B + (B/m)

# The beta estimates are stored in Qbar, the variance of these estimates are stored in T

However, I have done some simulations testing the procedure and comparing the ChainedImputer with m = 1 and m = 5 and with n = 1 and n= 100 and the results are not as I would expect. I will post the script in a github repo later, but need some more time to think about these results.

simulationschainedimputer

@sergeyf
Copy link
Contributor

sergeyf commented Jun 22, 2018

@RianneSchouten I haven't had time to think about this yet, but you probably want to set a different seed in this line for each of the imputations in this line:

ChainedImputer(n_burn_in=100, n_imputations=1)

It should instead be:

ChainedImputer(n_burn_in=100, n_imputations=1, random_state=i)

@julierennes
Copy link

@jorisvandenbossche @nprost

Indeed, if your aim is to impute and predict as well as possible the missing entries then using single imputation is enough and using multiple imputation for that is not required and you could impute by taking the conditional expectation and not by drawing from the conditional distribution. As far as prediction is concerned, there are not yet many results on this problem. Common practice and few papers tend more to suggest the following approach: Perform multiple imputation and on each imputed data set, apply your predictive algorithm to estimate the response say Y. Then aggregate the different predictions.

Best,
JJ

@RianneSchouten
Copy link

@sergeyf
If the default random_state is None, and that means it picks a random number, than the imputer will be different for every i in range(m), don't you think? Because I delete the imputer after round i is finished with del imputer. Maybe changing it to random_state = i means you don't have to delete the imputer?

@sergeyf
Copy link
Contributor

sergeyf commented Jun 26, 2018

I like the idea of the default being dependent on the mode.

I think keeping the flag more function-oriented (posterior_sample) rather than usage-oriented (fill or mode) is probably better.

@jnothman
Copy link
Member

jnothman commented Jun 26, 2018 via email

@sergeyf
Copy link
Contributor

sergeyf commented Jun 26, 2018

OK, I made the changes as discussed. And I last-minute went with sample_posterior =)

@jnothman
Copy link
Member

jnothman commented Jun 26, 2018 via email

@RianneSchouten
Copy link

RianneSchouten commented Jun 26, 2018 via email

@sergeyf
Copy link
Contributor

sergeyf commented Jun 26, 2018

This won't work because during validation/test time we don't have y. Your option would only work in transductive settings.

@stefvanbuuren
Copy link

Good point Rianne. Potentially you could add y as a predictor and impute. Whether that's OK to do depends on how you want to impute. If you want to impute a single best value, then DO NOT include y (as now implemented). If you draw from the posterior you must include y. See Little 1992 for more details.

@jnothman
Copy link
Member

jnothman commented Jun 27, 2018 via email

@stefvanbuuren
Copy link

Sure, in a clustering context there is no observed y, so then there is no issue whether you should include it or not. But for prediction, there is an observed y, and we could go down two routes. It's then up to the software designer to decide which routes to support.

@jnothman
Copy link
Member

jnothman commented Jun 27, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jun 27, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jun 27, 2018 via email

@amueller
Copy link
Member

I just used this and noticed that a) the model is called predictor not estimator (or base_estimator) as is usual for a meta-estimator, and b) that it's not the first argument.
Ideally I'd prohibit positional arguments but since we don't for now, I'd rather have the base estimator be the first argument.
If you do IterativeImputer(RandomForestRegressor(n_estimators=100)) you get a hard-to-debug error about nans in training data.

@sergeyf
Copy link
Contributor

sergeyf commented Feb 11, 2019 via email

@jnothman
Copy link
Member

jnothman commented Feb 12, 2019 via email

@sergeyf
Copy link
Contributor

sergeyf commented Feb 12, 2019

Should we consider making predictor the first input parameter?

@jnothman
Copy link
Member

jnothman commented Feb 12, 2019 via email

@sergeyf
Copy link
Contributor

sergeyf commented Feb 12, 2019

I'll defer to sklearn full-timers. Let me know if you reach a consensus to move it. I'll change the name to estimator in the most recent PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants