-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Random Forest Imputation #9591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@ashimb9, it's not that I don't like imputation... :P
But I do think we need to be selective here. As a project, our maintenance
burden is growing. The availability of veteran contributors who know the
majority of the codebase and its tacit conventions is reducing. Each new
implementation we adopt opens a new can of worms in terms of maintaining
its assured quality, efficiency, flexibility and interoperability. Not to
mention user support and documentation. And that's only the cost *once*
we've merged the code. As you can tell, there are high costs even before
then.
I appreciate the good work you are doing in investigating and implementing
these imputation techniques. Scikit-learn for a very long time operated
under the unrealistic assumption of complete data. Imputers are a valuable
solution, despite their limitations. Imputers deserve to exist in Python
and with Scikit-learn compatible interfaces. Tools like
https://github.com/hammerlab/fancyimpute provide for Python, but focus on
the transductive problem (rather than fitting and later transforming
held-out data), and @sergeyf's work on porting MICE to a
Scikit-learn-compatible API is very welcome.
But to keep our scope tight and our codebase maintainable (to the extent
that it is!), much of these will need to remain external to the core
project. Working to make fancyimpute more scikit-learn compatible, more
inductive and more extensive may be welcomed. Everything that does get
added to the core project will need to prove that it complements our
existing offerings well (and that users are informed as to when it is the
right tool to use), and ideally reuses or enhances the tools we already
have present in Scikit-learn as much as possible.
So again, I ask: what's the difference between missForest and generally
learning a regression problem on the samples where the feature is not
missing, for each missing feature? When do I need it? Can we reuse the
existing random forests implementation? Does it need #5974 to be merged to
train the regressor on samples with values missing?
…On 21 August 2017 at 15:11, ashimb9 ***@***.***> wrote:
Description
Would there be interest in implementing a random forest based imputer in
the vein of missForest
<https://cran.r-project.org/web/packages/missForest/missForest.pdf>?
Something (somewhat) related is in the works here
<#5974> but it appears
the PR is only planning to support RF estimation with NaN and not
imputation per se. IMO, this would be a useful addition to the Imputer
"suite", especially given RF-based imputation has been shown to perform
quite competitively compared to other imputation methods in a number of
empirical studies (see here
<https://academic.oup.com/aje/article/179/6/764/107562/Comparison-of-Random-Forest-and-Parametric>
and here
<https://academic.oup.com/bioinformatics/article/28/1/112/219101/MissForest-non-parametric-missing-value-imputation>).
Anyway, if there is interest, I would like to start working on this.
Reference R package: missForest
<https://cran.r-project.org/web/packages/missForest/missForest.pdf>
Paper: MissForest—non-parametric missing value imputation for mixed-type
data
<https://academic.oup.com/bioinformatics/article/28/1/112/219101/MissForest-non-parametric-missing-value-imputation>
PS: @jnothman <https://github.com/jnothman> if you are rolling your eyes,
let me just say that I understand :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9591>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz660QOezrWfcLw59TGee_vPZUMqpuks5saRGagaJpZM4O842G>
.
|
Thanks a lot for your thorough response, I really do appreciate it. First let me just say I totally agree with the concerns you have raised regarding long-term code maintenance issues and I obviously do not have enough knowledge to comment on that front. However, I will just try and respond to the questions you have raised at the very end regarding the conceptual and implementation aspects:
Did you mean difference between fitting the RF with missing values versus imputation followed by fitting the RF? If yes: I do not know the answer but I was thinking more of imputation for estimators that do not have an established "intrinsic" method for handling missing data.
Yes! The process, at least as implemented in the referenced package, begins with a common imputation strategy (like mean imputation) and iteratively imputes each feature until a stopping criteria or max_iter is reached, so we can definitely go ahead with the existing implementation.
As suggested by the response above, no it does not depend on that PR being merged. Having said that, it would be totally understandable if you decide not to go ahead with this one. |
Yes! The process, at least as implemented in the referenced package,
begins with a common imputation strategy (like mean imputation) and
iteratively imputes each feature until a stopping criteria or max_iter is
reached, so we can definitely go ahead with the existing implementation.
Then it sounds like it comes from the MICE family of techniques: apply an
arbitrary Regressor iteratively until convergence.
…On 21 August 2017 at 17:27, ashimb9 ***@***.***> wrote:
Thanks a lot for your thorough response, I really do appreciate it. First
let me just say I totally agree with the concerns you have raised regarding
long-term code maintenance issues and I obviously do not have enough
knowledge to comment on that front. However, I will just try and respond to
the questions you have raised at the very end regarding the conceptual and
implementation aspects:
what's the difference between missForest and generally learning a
regression problem on the samples where the feature is not missing, for
each missing feature? When do I need it?
Did you mean difference between fitting the RF with missing values versus
imputation followed by fitting the RF? If yes: I do not know the answer but
I was thinking more of imputation for estimators that do not have an
established "intrinsic" method for handling missing data.
Can we reuse the existing random forests implementation?
Yes! The process, at least as implemented in the referenced package,
begins with a common imputation strategy (like mean imputation) and
iteratively imputes each feature until a stopping criteria or max_iter is
reached, so we can definitely go ahead with the existing implementation.
Does it need #5974
<#5974> to be merged to
train the regressor on samples with values missing?
As suggested by the response above, no it does not depend on that PR being
merged.
Having said that, it would be totally understandable if you decide not to
go ahead with this one.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9591 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-mZA0qxTpw_DW46yvgOoRh2rx0Kks5saTFagaJpZM4O842G>
.
|
@jnothman Here is a quickly whipped up implementation for continuous variables, albeit inefficient and only very lightly tested. As you can see, it is fairly straightforward. (And yes this might be a sales pitch :))
|
But that's not really specialised to Random Forests is it? It's something
you can do with any regressor. Which is great, but why restrict it?
It's not the same as MICE, which requires its regressor to provide stds
over its predictions (which of course RFR can do but does not atm), and I
can't remember the details of it.
…On 22 August 2017 at 13:01, ashimb9 ***@***.***> wrote:
@jnothman <https://github.com/jnothman> Here is a quickly whipped up
implementation for continuous variables, albeit inefficient and only very
lightly tested. As you can see, it is fairly straightforward. (And yes this
might be a sales pitch :))
from __future__ import division
from __future__ import print_function
import numpy as np
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
def RFImputer(Ximp):
mask = np.isnan(Ximp)
missing_rows, missing_cols = np.where(mask)
# MissForest Algorithm
# 1. Make initial guess for missing values
col_means = np.nanmean(Ximp, axis=0)
Ximp[(missing_rows, missing_cols)] = np.take(col_means, missing_cols)
# 2. k <- vector of sorted indices of columns in X
col_missing_count = mask.sum(axis=0)
k = np.argsort(col_missing_count)
# 3. While not gamma_new < gamma_old and iter < max_iter do:
iter = 0
max_iter = 100
gamma_new = 0
gamma_old = np.inf
col_index = np.arange(Ximp.shape[1])
model_rf = RandomForestRegressor(random_state=0, n_estimators=1000)
# TODO: Update while condition for categorical vars
while gamma_new < gamma_old and iter < max_iter:
# added
# 4. store previously imputed matrix
Ximp_old = np.copy(Ximp)
if iter != 0:
gamma_old = gamma_new
# 5. loop
for s in k:
s_prime = np.delete(col_index, s)
obs_rows = np.where(~mask[:, s])[0]
mis_rows = np.where(mask[:, s])[0]
yobs = Ximp[obs_rows, s]
xobs = Ximp[np.ix_(obs_rows, s_prime)]
xmis = Ximp[np.ix_(mis_rows, s_prime)]
# 6. Fit a random forest
model_rf.fit(X=xobs, y=yobs)
# 7. predict ymis(s) using xmis(x)
ymis = model_rf.predict(xmis)
Ximp[mis_rows, s] = ymis
# 8. update imputed matrix using predicted matrix ymis(s)
# 9. Update gamma
gamma_new = np.sum((Ximp_old - Ximp) ** 2) / np.sum(
(Ximp) ** 2)
print("Iteration:", iter)
iter += 1
return Ximp_old
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9591 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67CRgeu8kJZVH0_6AqbTsp33Ar8Rks5sakSHgaJpZM4O842G>
.
|
i suppose it does not have to be RF specific. you could technically use any algorithm (BoostedImputerMachine has a nice ring to it). on a more serious note, i take it that this is DoA then? |
I think it would be good to have something like this somewhere. Whether or
not in scikit-learn. I would like to firstly see a version that is
inductive. I think we've had concerns about making it inductive before.
…On 22 August 2017 at 15:37, ashimb9 ***@***.***> wrote:
i suppose it does not have to be RF specific. you could technically use
any algorithm (BoostedImputerMachine has a nice ring to it). on a more
serious note, i take it that this is DoA then?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9591 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yDcnd-Wtk6ai94POGrIGBZdY5Nzks5samkJgaJpZM4O842G>
.
|
Just chiming in here. As mentioned elsewhere in this thread, MICE pretty much does what For adding As a reference: a fancy version of forest uncertainties is already implemented in https://github.com/scikit-learn-contrib/forest-confidence-interval. |
Yes, I'd thought standard deviation over individual predictions should suffice for RFR return_std, but perhaps one should account for the within-tree uncertainties too (and forest-confidence-interval incorporates a bias correction). I don't think we should consider that a priority, though it would be nice to say we have something comparable to missforest once MICE and RFR's predict_std are implemented, I suppose... |
Makes sense. One other simple option that's inspired by bootstrap Thompson Sampling [1] -> let's say you pass |
Yes, but that involves special-casing ensembles that give equal weight to
their constituent estimators.
…On 22 August 2017 at 23:32, Sergey Feldman ***@***.***> wrote:
Makes sense.
One other simple option that's inspired by bootstrap Thompson Sampling [1]
-> let's say you pass MICE an ensemble of T regressors. Instead of
sampling from a Gaussian centered at the mean prediction and empirical
predictive standard deviation, you just pick one of the T estimates
uniformly at random. That's effectively sampling from the actual posterior.
[1] https://arxiv.org/abs/1410.4009
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9591 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wEuZBdh6BLBDPG-84AeDk7VQsPmks5satiBgaJpZM4O842G>
.
|
Sorry, I didn't follow that entirely. Can't I just add the following to (a) Ensure that if the estimator doesn't have
|
Sorry I do not quite understand where this discussion is leading. Are we talking about creating a base class from which any sequential imputation algo can inherit (include mice and missforest)? Also, I guess I also am not clear on why exactly we need std deviations if we are using bootstrap resamples and at the end returning a user chosen 'n_imputations' number of imputed datasets that the user can use to pool estimates from. |
@ashimb9 I think we are discussing whether we can easily add support for a As implemented, So, one thing we could do, is modify the PR to accept a So, my suggestion was to do one of the following: (1) Add Does that clarify it? |
Ahh yes, sounds good to me. Although to nitpick a little, we should probably call it (i.e., the class) something other than "MICE" then? And "mice" or "missforest" or "random forest" or whatever could be a user-chosen attribute for the class, for instance? For what it's worth, in my opinion we probably would not be doing justice to authors of missForest and other sequential algorithms if we subsumed everything under the banner of "mice" which was designed in the context of linear regression models. |
My feeling is that we should keep it named https://cran.r-project.org/web/packages/mice/mice.pdf It even has
So we could also implement this. Pretty easy. |
Yes, I am aware that it currently supports RF-based imputation :) |
Ah, ok, cool =) |
Just an update to this thread. We have something that can do what missForest does via specifying a model for the |
Here's a very simple example that demos how this works: #12100 |
I think we can close this issue now that ItervativeImputer is merged, right @jnothman? |
Yes, thanks! Covered by IterativeImputer |
Uh oh!
There was an error while loading. Please reload this page.
Description
Would there be interest in implementing a random forest based imputer in the vein of missForest? Something (somewhat) related is in the works here but it appears the PR is only planning to support RF estimation with NaN and not imputation per se. IMO, this would be a useful addition to the Imputer "suite", especially given RF-based imputation has been shown to perform quite competitively compared to other imputation methods in a number of empirical studies (see here and here). Anyway, if there is interest, I would like to start working on this.
Reference R package: missForest
Paper: MissForest—non-parametric missing value imputation for mixed-type data
PS: @jnothman if you are rolling your eyes, let me just say that I understand :)
UPDATE (+ shameless plug :D) - I have implemented the MissForest algorithm in the missingPy package. It can perform Random Forest based imputation of numerical, categorical, as well as mixed-type data.
The text was updated successfully, but these errors were encountered: