Random Forest Imputation #9591

ashimb9 · 2017-08-21T05:11:51Z

Description

Would there be interest in implementing a random forest based imputer in the vein of missForest? Something (somewhat) related is in the works here but it appears the PR is only planning to support RF estimation with NaN and not imputation per se. IMO, this would be a useful addition to the Imputer "suite", especially given RF-based imputation has been shown to perform quite competitively compared to other imputation methods in a number of empirical studies (see here and here). Anyway, if there is interest, I would like to start working on this.

Reference R package: missForest
Paper: MissForest—non-parametric missing value imputation for mixed-type data

PS: @jnothman if you are rolling your eyes, let me just say that I understand :)

UPDATE (+ shameless plug :D) - I have implemented the MissForest algorithm in the missingPy package. It can perform Random Forest based imputation of numerical, categorical, as well as mixed-type data.

jnothman · 2017-08-21T06:08:46Z

@ashimb9, it's not that I don't like imputation... :P But I do think we need to be selective here. As a project, our maintenance burden is growing. The availability of veteran contributors who know the majority of the codebase and its tacit conventions is reducing. Each new implementation we adopt opens a new can of worms in terms of maintaining its assured quality, efficiency, flexibility and interoperability. Not to mention user support and documentation. And that's only the cost *once* we've merged the code. As you can tell, there are high costs even before then. I appreciate the good work you are doing in investigating and implementing these imputation techniques. Scikit-learn for a very long time operated under the unrealistic assumption of complete data. Imputers are a valuable solution, despite their limitations. Imputers deserve to exist in Python and with Scikit-learn compatible interfaces. Tools like https://github.com/hammerlab/fancyimpute provide for Python, but focus on the transductive problem (rather than fitting and later transforming held-out data), and @sergeyf's work on porting MICE to a Scikit-learn-compatible API is very welcome. But to keep our scope tight and our codebase maintainable (to the extent that it is!), much of these will need to remain external to the core project. Working to make fancyimpute more scikit-learn compatible, more inductive and more extensive may be welcomed. Everything that does get added to the core project will need to prove that it complements our existing offerings well (and that users are informed as to when it is the right tool to use), and ideally reuses or enhances the tools we already have present in Scikit-learn as much as possible. So again, I ask: what's the difference between missForest and generally learning a regression problem on the samples where the feature is not missing, for each missing feature? When do I need it? Can we reuse the existing random forests implementation? Does it need #5974 to be merged to train the regressor on samples with values missing?

…

On 21 August 2017 at 15:11, ashimb9 ***@***.***> wrote: Description Would there be interest in implementing a random forest based imputer in the vein of missForest <https://cran.r-project.org/web/packages/missForest/missForest.pdf>? Something (somewhat) related is in the works here <#5974> but it appears the PR is only planning to support RF estimation with NaN and not imputation per se. IMO, this would be a useful addition to the Imputer "suite", especially given RF-based imputation has been shown to perform quite competitively compared to other imputation methods in a number of empirical studies (see here <https://academic.oup.com/aje/article/179/6/764/107562/Comparison-of-Random-Forest-and-Parametric> and here <https://academic.oup.com/bioinformatics/article/28/1/112/219101/MissForest-non-parametric-missing-value-imputation>). Anyway, if there is interest, I would like to start working on this. Reference R package: missForest <https://cran.r-project.org/web/packages/missForest/missForest.pdf> Paper: MissForest—non-parametric missing value imputation for mixed-type data <https://academic.oup.com/bioinformatics/article/28/1/112/219101/MissForest-non-parametric-missing-value-imputation> PS: @jnothman <https://github.com/jnothman> if you are rolling your eyes, let me just say that I understand :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9591>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz660QOezrWfcLw59TGee_vPZUMqpuks5saRGagaJpZM4O842G> .

ashimb9 · 2017-08-21T07:27:19Z

Thanks a lot for your thorough response, I really do appreciate it. First let me just say I totally agree with the concerns you have raised regarding long-term code maintenance issues and I obviously do not have enough knowledge to comment on that front. However, I will just try and respond to the questions you have raised at the very end regarding the conceptual and implementation aspects:

what's the difference between missForest and generally learning a regression problem on the samples where the feature is not missing, for each missing feature? When do I need it?

Did you mean difference between fitting the RF with missing values versus imputation followed by fitting the RF? If yes: I do not know the answer but I was thinking more of imputation for estimators that do not have an established "intrinsic" method for handling missing data.

Can we reuse the existing random forests implementation?

Yes! The process, at least as implemented in the referenced package, begins with a common imputation strategy (like mean imputation) and iteratively imputes each feature until a stopping criteria or max_iter is reached, so we can definitely go ahead with the existing implementation.

Does it need #5974 to be merged to train the regressor on samples with values missing?

As suggested by the response above, no it does not depend on that PR being merged.

Having said that, it would be totally understandable if you decide not to go ahead with this one.

jnothman · 2017-08-21T07:39:11Z

Yes! The process, at least as implemented in the referenced package,

begins with a common imputation strategy (like mean imputation) and iteratively imputes each feature until a stopping criteria or max_iter is reached, so we can definitely go ahead with the existing implementation. Then it sounds like it comes from the MICE family of techniques: apply an arbitrary Regressor iteratively until convergence.

…

On 21 August 2017 at 17:27, ashimb9 ***@***.***> wrote: Thanks a lot for your thorough response, I really do appreciate it. First let me just say I totally agree with the concerns you have raised regarding long-term code maintenance issues and I obviously do not have enough knowledge to comment on that front. However, I will just try and respond to the questions you have raised at the very end regarding the conceptual and implementation aspects: what's the difference between missForest and generally learning a regression problem on the samples where the feature is not missing, for each missing feature? When do I need it? Did you mean difference between fitting the RF with missing values versus imputation followed by fitting the RF? If yes: I do not know the answer but I was thinking more of imputation for estimators that do not have an established "intrinsic" method for handling missing data. Can we reuse the existing random forests implementation? Yes! The process, at least as implemented in the referenced package, begins with a common imputation strategy (like mean imputation) and iteratively imputes each feature until a stopping criteria or max_iter is reached, so we can definitely go ahead with the existing implementation. Does it need #5974 <#5974> to be merged to train the regressor on samples with values missing? As suggested by the response above, no it does not depend on that PR being merged. Having said that, it would be totally understandable if you decide not to go ahead with this one. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9591 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-mZA0qxTpw_DW46yvgOoRh2rx0Kks5saTFagaJpZM4O842G> .

ashimb9 · 2017-08-21T07:51:21Z

The sequential aspect is similar between the two. One of the advantages over MICE in certain situations, as noted in one of the references, is when there are non-linearities/interactions that are not known a priori and would be better handled by the random forest. Also, the algorithm from the paper:

ashimb9 · 2017-08-22T03:01:25Z

@jnothman Here is a quickly whipped up implementation for continuous variables, albeit inefficient and only very lightly tested. As you can see, it is fairly straightforward. (And yes this might be a sales pitch :))

from __future__ import division
from __future__ import print_function
import numpy as np
from sklearn.ensemble import RandomForestRegressor
import pandas as pd

def RFImputer(Ximp):
    mask = np.isnan(Ximp)
    missing_rows, missing_cols = np.where(mask)

    # MissForest Algorithm
    # 1. Make initial guess for missing values
    col_means = np.nanmean(Ximp, axis=0)
    Ximp[(missing_rows, missing_cols)] = np.take(col_means, missing_cols)

    # 2. k <- vector of sorted indices of columns in X
    col_missing_count = mask.sum(axis=0)
    k = np.argsort(col_missing_count)

    # 3. While not gamma_new < gamma_old and iter < max_iter  do:
    iter = 0
    max_iter = 100
    gamma_new = 0
    gamma_old = np.inf
    col_index = np.arange(Ximp.shape[1])
    model_rf = RandomForestRegressor(random_state=0, n_estimators=1000)
    # TODO: Update while condition for categorical vars
    while gamma_new < gamma_old and iter < max_iter:
        # added
        # 4. store previously imputed matrix
        Ximp_old = np.copy(Ximp)
        if iter != 0:
            gamma_old = gamma_new
        # 5. loop
        for s in k:
            s_prime = np.delete(col_index, s)
            obs_rows = np.where(~mask[:, s])[0]
            mis_rows = np.where(mask[:, s])[0]
            yobs = Ximp[obs_rows, s]
            xobs = Ximp[np.ix_(obs_rows, s_prime)]
            xmis = Ximp[np.ix_(mis_rows, s_prime)]
            # 6. Fit a random forest
            model_rf.fit(X=xobs, y=yobs)
            # 7. predict ymis(s) using xmis(x)
            ymis = model_rf.predict(xmis)
            Ximp[mis_rows, s] = ymis
            # 8. update imputed matrix using predicted matrix ymis(s)
        # 9. Update gamma
        gamma_new = np.sum((Ximp_old - Ximp) ** 2) / np.sum(
            (Ximp) ** 2)
        print("Iteration:", iter)
        iter += 1
    return Ximp_old

jnothman · 2017-08-22T03:32:22Z

But that's not really specialised to Random Forests is it? It's something you can do with any regressor. Which is great, but why restrict it? It's not the same as MICE, which requires its regressor to provide stds over its predictions (which of course RFR can do but does not atm), and I can't remember the details of it.

…

On 22 August 2017 at 13:01, ashimb9 ***@***.***> wrote: @jnothman <https://github.com/jnothman> Here is a quickly whipped up implementation for continuous variables, albeit inefficient and only very lightly tested. As you can see, it is fairly straightforward. (And yes this might be a sales pitch :)) from __future__ import division from __future__ import print_function import numpy as np from sklearn.ensemble import RandomForestRegressor import pandas as pd def RFImputer(Ximp): mask = np.isnan(Ximp) missing_rows, missing_cols = np.where(mask) # MissForest Algorithm # 1. Make initial guess for missing values col_means = np.nanmean(Ximp, axis=0) Ximp[(missing_rows, missing_cols)] = np.take(col_means, missing_cols) # 2. k <- vector of sorted indices of columns in X col_missing_count = mask.sum(axis=0) k = np.argsort(col_missing_count) # 3. While not gamma_new < gamma_old and iter < max_iter do: iter = 0 max_iter = 100 gamma_new = 0 gamma_old = np.inf col_index = np.arange(Ximp.shape[1]) model_rf = RandomForestRegressor(random_state=0, n_estimators=1000) # TODO: Update while condition for categorical vars while gamma_new < gamma_old and iter < max_iter: # added # 4. store previously imputed matrix Ximp_old = np.copy(Ximp) if iter != 0: gamma_old = gamma_new # 5. loop for s in k: s_prime = np.delete(col_index, s) obs_rows = np.where(~mask[:, s])[0] mis_rows = np.where(mask[:, s])[0] yobs = Ximp[obs_rows, s] xobs = Ximp[np.ix_(obs_rows, s_prime)] xmis = Ximp[np.ix_(mis_rows, s_prime)] # 6. Fit a random forest model_rf.fit(X=xobs, y=yobs) # 7. predict ymis(s) using xmis(x) ymis = model_rf.predict(xmis) Ximp[mis_rows, s] = ymis # 8. update imputed matrix using predicted matrix ymis(s) # 9. Update gamma gamma_new = np.sum((Ximp_old - Ximp) ** 2) / np.sum( (Ximp) ** 2) print("Iteration:", iter) iter += 1 return Ximp_old — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9591 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67CRgeu8kJZVH0_6AqbTsp33Ar8Rks5sakSHgaJpZM4O842G> .

ashimb9 · 2017-08-22T05:37:12Z

i suppose it does not have to be RF specific. you could technically use any algorithm (BoostedImputerMachine has a nice ring to it). on a more serious note, i take it that this is DoA then?

jnothman · 2017-08-22T06:39:08Z

I think it would be good to have something like this somewhere. Whether or not in scikit-learn. I would like to firstly see a version that is inductive. I think we've had concerns about making it inductive before.

…

On 22 August 2017 at 15:37, ashimb9 ***@***.***> wrote: i suppose it does not have to be RF specific. you could technically use any algorithm (BoostedImputerMachine has a nice ring to it). on a more serious note, i take it that this is DoA then? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9591 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6yDcnd-Wtk6ai94POGrIGBZdY5Nzks5samkJgaJpZM4O842G> .

sergeyf · 2017-08-22T13:18:01Z

Just chiming in here. As mentioned elsewhere in this thread, MICE pretty much does what BoostedImputerMachine would do, and it's inductive in the current PR #8478. If the only difference is a requirement for the model to have return_std, it might be easier to add basic return_std functionality to the RF regressor OR a flag that doesn't do posterior sampling to MICE (although the latter would be antithetical to how MICE works).

For adding return_std to regression forests, maybe it would be sufficient to just return the standard deviation of the individual predictions? I suspect it might work just fine.

As a reference: a fancy version of forest uncertainties is already implemented in https://github.com/scikit-learn-contrib/forest-confidence-interval.

jnothman · 2017-08-22T13:27:50Z

Yes, I'd thought standard deviation over individual predictions should suffice for RFR return_std, but perhaps one should account for the within-tree uncertainties too (and forest-confidence-interval incorporates a bias correction). I don't think we should consider that a priority, though it would be nice to say we have something comparable to missforest once MICE and RFR's predict_std are implemented, I suppose...

sergeyf · 2017-08-22T13:32:45Z

Makes sense.

One other simple option that's inspired by bootstrap Thompson Sampling [1] -> let's say you pass MICE an ensemble of T regressors. Instead of sampling from a Gaussian centered at the mean prediction and empirical predictive standard deviation, you just pick one of the T estimates uniformly at random. That's effectively sampling from the actual posterior.

[1] https://arxiv.org/abs/1410.4009

jnothman · 2017-08-22T13:35:01Z

Yes, but that involves special-casing ensembles that give equal weight to their constituent estimators.

…

On 22 August 2017 at 23:32, Sergey Feldman ***@***.***> wrote: Makes sense. One other simple option that's inspired by bootstrap Thompson Sampling [1] -> let's say you pass MICE an ensemble of T regressors. Instead of sampling from a Gaussian centered at the mean prediction and empirical predictive standard deviation, you just pick one of the T estimates uniformly at random. That's effectively sampling from the actual posterior. [1] https://arxiv.org/abs/1410.4009 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9591 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6wEuZBdh6BLBDPG-84AeDk7VQsPmks5satiBgaJpZM4O842G> .

sergeyf · 2017-08-22T13:44:43Z

Sorry, I didn't follow that entirely. Can't I just add the following to MICE:

(a) Ensure that if the estimator doesn't have return_std, it inherits from BaseEnsemble.
(b) Then, when it comes time to get a posterior sample:

bootstrap_ind = np.random.choice(range(n_trees), 1)[0]
individual_preds = [tree.predict(X_miss) for tree in ensemble.estimators_]
y_imputed = individual_preds[bootstrap_ind]

ashimb9 · 2017-08-22T19:24:17Z

Sorry I do not quite understand where this discussion is leading. Are we talking about creating a base class from which any sequential imputation algo can inherit (include mice and missforest)? Also, I guess I also am not clear on why exactly we need std deviations if we are using bootstrap resamples and at the end returning a user chosen 'n_imputations' number of imputed datasets that the user can use to pool estimates from.

sergeyf · 2017-08-22T19:37:42Z

@ashimb9 I think we are discussing whether we can easily add support for a RandomForestRegressor or any other ensemble regressor/classifier to the current MICE pr.

As implemented, MICE currently doesn't let you specify the actual model to do the imputation with. It says BayesianRidgeRegression or bust:

https://github.com/sergeyf/scikit-learn/blob/2ac25005353f3a51636b974d416822b374683758/sklearn/preprocessing/imputation.py#L466

So, one thing we could do, is modify the PR to accept a model as argument. As of now, it would have to have return_std implemented. Only two models do that currently in sklearn and they're linear.

So, my suggestion was to do one of the following:

(1) Add return_std option to all of the ensembles in sklearn.
(2) Add a bootstrap method to MICE as described in my previous post so you wouldn't need return_std. The second option is a bit experimental - I'd have to try it out first, but I think it should work fine.

Does that clarify it?

ashimb9 · 2017-08-22T20:10:39Z

Ahh yes, sounds good to me. Although to nitpick a little, we should probably call it (i.e., the class) something other than "MICE" then? And "mice" or "missforest" or "random forest" or whatever could be a user-chosen attribute for the class, for instance? For what it's worth, in my opinion we probably would not be doing justice to authors of missForest and other sequential algorithms if we subsumed everything under the banner of "mice" which was designed in the context of linear regression models.

sergeyf · 2017-08-22T20:23:06Z

My feeling is that we should keep it named MICE - that's a very established name for these types of algorithms. It's been around for a long time and people know what it is. MICE doesn't require a linear regressor in any case - it has many options. For example, see the docs:

https://cran.r-project.org/web/packages/mice/mice.pdf

It even has mice.impute.rf (Random Forest) already, which does sampling a bit differently than I described:

The procedure is as follows:
1. Fit `T` classification or regression tree by recursive partitioning;
2. For each ymis, find the terminal nodes they end up according to the fitted trees;
3. Make a random draw among the members in the nodes, and take the observed value from that
draw as the imputation.

So we could also implement this. Pretty easy.

ashimb9 · 2017-08-22T20:41:36Z

Yes, I am aware that it currently supports RF-based imputation :)

sergeyf · 2017-08-22T20:42:33Z

Ah, ok, cool =)

sergeyf · 2018-09-17T13:56:47Z

Just an update to this thread. We have something that can do what missForest does via specifying a model for the IterativeImputer: #11977

sergeyf · 2018-09-17T17:53:56Z

Here's a very simple example that demos how this works: #12100

sergeyf · 2019-02-20T17:07:55Z

I think we can close this issue now that ItervativeImputer is merged, right @jnothman?

jnothman · 2019-02-20T20:46:34Z

Yes, thanks! Covered by IterativeImputer

jnothman closed this as completed Jul 4, 2019

Uh oh!

Random Forest Imputation #9591

Random Forest Imputation #9591

Comments

ashimb9 commented Aug 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

jnothman commented Aug 21, 2017 via email

Uh oh!

ashimb9 commented Aug 21, 2017

Uh oh!

jnothman commented Aug 21, 2017 via email

Uh oh!

ashimb9 commented Aug 21, 2017

Uh oh!

ashimb9 commented Aug 22, 2017

Uh oh!

jnothman commented Aug 22, 2017 via email

Uh oh!

ashimb9 commented Aug 22, 2017

Uh oh!

jnothman commented Aug 22, 2017 via email

Uh oh!

sergeyf commented Aug 22, 2017

Uh oh!

jnothman commented Aug 22, 2017

Uh oh!

sergeyf commented Aug 22, 2017

Uh oh!

jnothman commented Aug 22, 2017 via email

Uh oh!

sergeyf commented Aug 22, 2017

Uh oh!

ashimb9 commented Aug 22, 2017

Uh oh!

sergeyf commented Aug 22, 2017

Uh oh!

ashimb9 commented Aug 22, 2017

Uh oh!

sergeyf commented Aug 22, 2017

Uh oh!

ashimb9 commented Aug 22, 2017

Uh oh!

sergeyf commented Aug 22, 2017

Uh oh!

sergeyf commented Sep 17, 2018

Uh oh!

sergeyf commented Sep 17, 2018

Uh oh!

sergeyf commented Feb 20, 2019

Uh oh!

jnothman commented Feb 20, 2019

Uh oh!

ashimb9 commented Aug 21, 2017 •

edited

Loading