Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Random Forest Imputation #9591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ashimb9 opened this issue Aug 21, 2017 · 23 comments
Closed

Random Forest Imputation #9591

ashimb9 opened this issue Aug 21, 2017 · 23 comments

Comments

@ashimb9
Copy link
Contributor

ashimb9 commented Aug 21, 2017

Description

Would there be interest in implementing a random forest based imputer in the vein of missForest? Something (somewhat) related is in the works here but it appears the PR is only planning to support RF estimation with NaN and not imputation per se. IMO, this would be a useful addition to the Imputer "suite", especially given RF-based imputation has been shown to perform quite competitively compared to other imputation methods in a number of empirical studies (see here and here). Anyway, if there is interest, I would like to start working on this.

Reference R package: missForest
Paper: MissForest—non-parametric missing value imputation for mixed-type data

PS: @jnothman if you are rolling your eyes, let me just say that I understand :)

UPDATE (+ shameless plug :D) - I have implemented the MissForest algorithm in the missingPy package. It can perform Random Forest based imputation of numerical, categorical, as well as mixed-type data.

@jnothman
Copy link
Member

jnothman commented Aug 21, 2017 via email

@ashimb9
Copy link
Contributor Author

ashimb9 commented Aug 21, 2017

Thanks a lot for your thorough response, I really do appreciate it. First let me just say I totally agree with the concerns you have raised regarding long-term code maintenance issues and I obviously do not have enough knowledge to comment on that front. However, I will just try and respond to the questions you have raised at the very end regarding the conceptual and implementation aspects:

what's the difference between missForest and generally learning a regression problem on the samples where the feature is not missing, for each missing feature? When do I need it?

Did you mean difference between fitting the RF with missing values versus imputation followed by fitting the RF? If yes: I do not know the answer but I was thinking more of imputation for estimators that do not have an established "intrinsic" method for handling missing data.

Can we reuse the existing random forests implementation?

Yes! The process, at least as implemented in the referenced package, begins with a common imputation strategy (like mean imputation) and iteratively imputes each feature until a stopping criteria or max_iter is reached, so we can definitely go ahead with the existing implementation.

Does it need #5974 to be merged to train the regressor on samples with values missing?

As suggested by the response above, no it does not depend on that PR being merged.

Having said that, it would be totally understandable if you decide not to go ahead with this one.

@jnothman
Copy link
Member

jnothman commented Aug 21, 2017 via email

@ashimb9
Copy link
Contributor Author

ashimb9 commented Aug 21, 2017

The sequential aspect is similar between the two. One of the advantages over MICE in certain situations, as noted in one of the references, is when there are non-linearities/interactions that are not known a priori and would be better handled by the random forest. Also, the algorithm from the paper:
image

@ashimb9
Copy link
Contributor Author

ashimb9 commented Aug 22, 2017

@jnothman Here is a quickly whipped up implementation for continuous variables, albeit inefficient and only very lightly tested. As you can see, it is fairly straightforward. (And yes this might be a sales pitch :))

from __future__ import division
from __future__ import print_function
import numpy as np
from sklearn.ensemble import RandomForestRegressor
import pandas as pd

def RFImputer(Ximp):
    mask = np.isnan(Ximp)
    missing_rows, missing_cols = np.where(mask)

    # MissForest Algorithm
    # 1. Make initial guess for missing values
    col_means = np.nanmean(Ximp, axis=0)
    Ximp[(missing_rows, missing_cols)] = np.take(col_means, missing_cols)

    # 2. k <- vector of sorted indices of columns in X
    col_missing_count = mask.sum(axis=0)
    k = np.argsort(col_missing_count)

    # 3. While not gamma_new < gamma_old and iter < max_iter  do:
    iter = 0
    max_iter = 100
    gamma_new = 0
    gamma_old = np.inf
    col_index = np.arange(Ximp.shape[1])
    model_rf = RandomForestRegressor(random_state=0, n_estimators=1000)
    # TODO: Update while condition for categorical vars
    while gamma_new < gamma_old and iter < max_iter:
        # added
        # 4. store previously imputed matrix
        Ximp_old = np.copy(Ximp)
        if iter != 0:
            gamma_old = gamma_new
        # 5. loop
        for s in k:
            s_prime = np.delete(col_index, s)
            obs_rows = np.where(~mask[:, s])[0]
            mis_rows = np.where(mask[:, s])[0]
            yobs = Ximp[obs_rows, s]
            xobs = Ximp[np.ix_(obs_rows, s_prime)]
            xmis = Ximp[np.ix_(mis_rows, s_prime)]
            # 6. Fit a random forest
            model_rf.fit(X=xobs, y=yobs)
            # 7. predict ymis(s) using xmis(x)
            ymis = model_rf.predict(xmis)
            Ximp[mis_rows, s] = ymis
            # 8. update imputed matrix using predicted matrix ymis(s)
        # 9. Update gamma
        gamma_new = np.sum((Ximp_old - Ximp) ** 2) / np.sum(
            (Ximp) ** 2)
        print("Iteration:", iter)
        iter += 1
    return Ximp_old

@jnothman
Copy link
Member

jnothman commented Aug 22, 2017 via email

@ashimb9
Copy link
Contributor Author

ashimb9 commented Aug 22, 2017

i suppose it does not have to be RF specific. you could technically use any algorithm (BoostedImputerMachine has a nice ring to it). on a more serious note, i take it that this is DoA then?

@jnothman
Copy link
Member

jnothman commented Aug 22, 2017 via email

@sergeyf
Copy link
Contributor

sergeyf commented Aug 22, 2017

Just chiming in here. As mentioned elsewhere in this thread, MICE pretty much does what BoostedImputerMachine would do, and it's inductive in the current PR #8478. If the only difference is a requirement for the model to have return_std, it might be easier to add basic return_std functionality to the RF regressor OR a flag that doesn't do posterior sampling to MICE (although the latter would be antithetical to how MICE works).

For adding return_std to regression forests, maybe it would be sufficient to just return the standard deviation of the individual predictions? I suspect it might work just fine.

As a reference: a fancy version of forest uncertainties is already implemented in https://github.com/scikit-learn-contrib/forest-confidence-interval.

@jnothman
Copy link
Member

Yes, I'd thought standard deviation over individual predictions should suffice for RFR return_std, but perhaps one should account for the within-tree uncertainties too (and forest-confidence-interval incorporates a bias correction). I don't think we should consider that a priority, though it would be nice to say we have something comparable to missforest once MICE and RFR's predict_std are implemented, I suppose...

@sergeyf
Copy link
Contributor

sergeyf commented Aug 22, 2017

Makes sense.

One other simple option that's inspired by bootstrap Thompson Sampling [1] -> let's say you pass MICE an ensemble of T regressors. Instead of sampling from a Gaussian centered at the mean prediction and empirical predictive standard deviation, you just pick one of the T estimates uniformly at random. That's effectively sampling from the actual posterior.

[1] https://arxiv.org/abs/1410.4009

@jnothman
Copy link
Member

jnothman commented Aug 22, 2017 via email

@sergeyf
Copy link
Contributor

sergeyf commented Aug 22, 2017

Sorry, I didn't follow that entirely. Can't I just add the following to MICE:

(a) Ensure that if the estimator doesn't have return_std, it inherits from BaseEnsemble.
(b) Then, when it comes time to get a posterior sample:

bootstrap_ind = np.random.choice(range(n_trees), 1)[0]
individual_preds = [tree.predict(X_miss) for tree in ensemble.estimators_]
y_imputed = individual_preds[bootstrap_ind]

@ashimb9
Copy link
Contributor Author

ashimb9 commented Aug 22, 2017

Sorry I do not quite understand where this discussion is leading. Are we talking about creating a base class from which any sequential imputation algo can inherit (include mice and missforest)? Also, I guess I also am not clear on why exactly we need std deviations if we are using bootstrap resamples and at the end returning a user chosen 'n_imputations' number of imputed datasets that the user can use to pool estimates from.

@sergeyf
Copy link
Contributor

sergeyf commented Aug 22, 2017

@ashimb9 I think we are discussing whether we can easily add support for a RandomForestRegressor or any other ensemble regressor/classifier to the current MICE pr.

As implemented, MICE currently doesn't let you specify the actual model to do the imputation with. It says BayesianRidgeRegression or bust:

https://github.com/sergeyf/scikit-learn/blob/2ac25005353f3a51636b974d416822b374683758/sklearn/preprocessing/imputation.py#L466

So, one thing we could do, is modify the PR to accept a model as argument. As of now, it would have to have return_std implemented. Only two models do that currently in sklearn and they're linear.

So, my suggestion was to do one of the following:

(1) Add return_std option to all of the ensembles in sklearn.
(2) Add a bootstrap method to MICE as described in my previous post so you wouldn't need return_std. The second option is a bit experimental - I'd have to try it out first, but I think it should work fine.

Does that clarify it?

@ashimb9
Copy link
Contributor Author

ashimb9 commented Aug 22, 2017

Ahh yes, sounds good to me. Although to nitpick a little, we should probably call it (i.e., the class) something other than "MICE" then? And "mice" or "missforest" or "random forest" or whatever could be a user-chosen attribute for the class, for instance? For what it's worth, in my opinion we probably would not be doing justice to authors of missForest and other sequential algorithms if we subsumed everything under the banner of "mice" which was designed in the context of linear regression models.

@sergeyf
Copy link
Contributor

sergeyf commented Aug 22, 2017

My feeling is that we should keep it named MICE - that's a very established name for these types of algorithms. It's been around for a long time and people know what it is. MICE doesn't require a linear regressor in any case - it has many options. For example, see the docs:

https://cran.r-project.org/web/packages/mice/mice.pdf

It even has mice.impute.rf (Random Forest) already, which does sampling a bit differently than I described:

The procedure is as follows:
1. Fit `T` classification or regression tree by recursive partitioning;
2. For each ymis, find the terminal nodes they end up according to the fitted trees;
3. Make a random draw among the members in the nodes, and take the observed value from that
draw as the imputation.

So we could also implement this. Pretty easy.

@ashimb9
Copy link
Contributor Author

ashimb9 commented Aug 22, 2017

Yes, I am aware that it currently supports RF-based imputation :)

@sergeyf
Copy link
Contributor

sergeyf commented Aug 22, 2017

Ah, ok, cool =)

@sergeyf
Copy link
Contributor

sergeyf commented Sep 17, 2018

Just an update to this thread. We have something that can do what missForest does via specifying a model for the IterativeImputer: #11977

@sergeyf
Copy link
Contributor

sergeyf commented Sep 17, 2018

Here's a very simple example that demos how this works: #12100

@sergeyf
Copy link
Contributor

sergeyf commented Feb 20, 2019

I think we can close this issue now that ItervativeImputer is merged, right @jnothman?

@jnothman
Copy link
Member

Yes, thanks! Covered by IterativeImputer

@jnothman jnothman closed this as completed Jul 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants