-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] ChainedImputer -> IterativeImputer, and documentation update #11350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] ChainedImputer -> IterativeImputer, and documentation update #11350
Conversation
CIs failing, FYI |
Fixed, thank you! |
I will work on the example next wednesday and make a PR then.
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Virusvrij.
www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
2018-06-25 7:44 GMT+02:00 Sergey Feldman <[email protected]>:
… Fixed, thank you!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11350 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVQqew6SZKjfI7oEWczNqwS6XHmkYw5zks5uAHihgaJpZM4U1b8P>
.
|
sklearn/impute.py
Outdated
Number of initial imputation rounds to perform the results of which | ||
will not be returned. | ||
n_iter : int, optional (default=10) | ||
Number of imputation rounds to perform before returning the final |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably clarify that a round indicates a single imputation of each feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove final here.
sklearn/impute.py
Outdated
|
||
sample_after_predict : boolean, default=False | ||
Whether to sample from the predictive posterior of the fitted | ||
predictor for each Imputation. Set to ``True`` if using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming gaussian posterior. Predictor requires return_std support in predict.
sklearn/impute.py
Outdated
It must support ``return_std`` in its ``predict`` method if | ||
``sample_after_predict`` option is set to ``True`` below. | ||
|
||
sample_after_predict : boolean, default=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this name is unnecessarily verbose.
Brainstorming.
- fill_with='expectation' vs 'sample'
- random_impute bool
- use_std
- sampled
- random_draw
- randomized (but there are other random components)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about predict_posterior
bool?
@stefvanbuuren This PR has a lot of the documentation changes you suggested. Please take a look at what's here and let me know if we're addressing your concerns. What's left to do is to make an example that actually demonstrates how to use |
@jnothman I'd like to raise an error if a predictor without
OK to rely on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sergeyf Thanks a lot for keeping up making changes based on the discussion!
Added some quick comments on the docs.
doc/modules/impute.rst
Outdated
Then, the regressor is used to predict the unknown values of `y`. This is repeated | ||
for each feature in a chained fashion, and then is done for a number of imputation | ||
rounds. Here is an example snippet:: | ||
A more sophisticated approach is to use the :class:`ChainedImputer` class, models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"model" -> "which models" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, thanks.
doc/modules/impute.rst
Outdated
estimate for imputation. It does so in an iterated round-robin fashion: at each step, | ||
a feature column is designated as output `y` and the other feature columns are treated | ||
as inputs `X`. A regressor is fit on `(X, y)` for known `y`. Then, the regressor is | ||
used to predict the unknown values of `y`. This is repeated for each feature in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"unknown" -> "missing"? (only a suggestion, the unknown is fine as well, but the missing might be more clear since it about imputing the missing values)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed, thanks.
sklearn/impute.py
Outdated
of the features are Gaussian. | ||
Basic implementation of chained mutual regressions to find replacement | ||
values in multivariate missing data. This version assumes all features | ||
are Gaussian. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it needed to mention this restriction of Gaussian? As I assume it depends on the predictor one uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far yes. The posterior sampling is done via a call to predict that also returns the standard deviation and then we sample from a Gaussian posterior. It's a limitation at the moment.
doc/modules/impute.rst
Outdated
chained fashion, and then is done for a number of imputation rounds. The results | ||
of the final imputation round are returned. Our implementation was inspired by the | ||
R MICE package (Multivariate Imputation by Chained Equations), but differs from | ||
it in setting single imputation to default instead of multiple imputation. This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the "setting single imputation to default" still confusing.
As AFAIK, the ChainedImputer
itself can only do single imputation, that's not a matter of its default settings. You can of course then use it multiple times (setting the appropriate arguments for this case) to do multiple imputation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right, but by now users have come to expect that "chained equations" refers to multiple imputation, so the combination "chained + single" would confuse quite a few.
@sergeyf I think it is clear now that your method does single imputation by default. Perhaps for balance also mention that what you describe as "the most common use case" differs from what is generally recommended by the statistical community (you can refer to Little & Rubin (2002, Ch 4) and Rubin (1987, Ch 1)), and only then say it's an open problem. |
@stefvanbuuren this is the current full paragraph, which starts with the point about it being different in the statistics community:
I'll add a parenthetical reference to it. |
I don't really like predict_posterior as the difference between the settings is not about whether prediction is from the posterior, but about whether we use the expectation or a random draw from that posterior I don't think it is essential to raise an explicit error for the case that the estimator lacks return_std: the default TypeError will be informative enough when it tries to predict, and users who want to use the sampling variant will likely have read it's documentation which mentions return_std |
OK. thanks. I'll change the name to The only thing decision left is whether to change the name to |
I'm +1 for IterativeImputer. But I was +1 for ChainedImputer too :) But I
suppose then it was more urgent to make it anything but MICE :P
…On 26 June 2018 at 09:22, Sergey Feldman ***@***.***> wrote:
OK. thanks. I'll change the name to posterior_draw?
The only thing decision left is whether to change the name to
IterativeImputer.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11350 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wyVh0wTjWhKlGBhwQYw3g7bRw8Bks5uAXDHgaJpZM4U1b8P>
.
|
@glemaitre @jnothman I changed the default to be
Let me know if there are any other gotchas I should be aware of in instantiating a default Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check that sample_posterior=False
is deterministic if n_nearest_features
is None.
I've also realised that n_nearest_features
documentation fails to mention that the features are not necessarily the nearest, but are drawn with probability proportional to correlation.
(And now that we're supporting non-linear regresssors, I wonder if we should be using a rank correlation in n_nearest_features
, or max(spearman_rho, pearson_rho) or something nasty like that. This n_nearest_features
is useful but might get in the way of good results with non-linear prediction.)
Thanks for all this great work. I think when we're finished here, we'll have a very powerful and useful tool.
doc/modules/impute.rst
Outdated
As implemented, the :class:`IterativeImputer` class generates a single imputation | ||
for each missing value because this is the most common use case for machine learning | ||
applications. However, it can also be used for multiple imputations by applying it | ||
repeatedly to the same dataset with different random seeds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you're using a posterior sample or perhaps a highly randomised regressor, I don't think there is quite enough randomness in IterativeImputer for it to be an appropriate method of multiple imputation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I should change this to say set sample_posterior=True
.
doc/modules/impute.rst
Outdated
See Chapter 4 of "Statistical Analysis with Missing Data" by Little and Rubin for | ||
more discussion on multiple vs. single imputations. | ||
|
||
It is still an open problem as to how useful single vs. multiple imputation is in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"when the user is not interested in measuring uncertainty due to missing values"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good addendum, thanks.
n_imputations : int, optional (default=100) | ||
Number of chained imputation rounds to perform, the results of which | ||
will be used in the final average. | ||
n_iter : int, optional (default=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a suggestion to call this max_iter and consider early stopping in a later pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to leave it as-is until we actually support early stopping. The word max
would be confusing as it is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we agree that in the future we would like to do the early stopping, I would already rename it now. It's true that this will be a bit confusing, but having to rename it after it is released also doesn't seem a good option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be one point to address in future questions about handling non-MICE cases
@@ -498,18 +505,24 @@ class ChainedImputer(BaseEstimator, TransformerMixin): | |||
|
|||
Attributes | |||
---------- | |||
initial_imputer_ : object of class :class:`sklearn.preprocessing.Imputer`' | |||
The imputer used to initialize the missing values. | |||
initial_imputer_ : object of type :class:`sklearn.impute.SimpleImputer` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I was thinking we should make it possible to set initial_strategy
to an imputer, to allow something non-parametric, like the SamplingImputer
which may or may not appear soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing I'd like to defer to future Sergey =)
self._predictor = BayesianRidge() | ||
else: | ||
from .linear_model import RidgeCV | ||
# including a very small alpha to approximate OLS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder if we should just be using OLS. But for near-parity with the sample_posterior
case, this makes sense.
sklearn/impute.py
Outdated
# then there is no need to do burn in and the result should be | ||
# just the initial imputation (before clipping) | ||
if self.n_imputations < 1: | ||
# edge case: in case the user specifies 0 for n_burn_in, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_iter
sklearn/impute.py
Outdated
return X_filled | ||
|
||
X_filled = np.clip(X_filled, self._min_value, self._max_value) | ||
# clip only the initial filledin values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really get why we would expect the initial imputation to require clipping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, that's a good point. What I should probably do is this:
if self.n_iter < 1:
X_filled[mask_missing_values] = np.clip(X_filled[mask_missing_values],
self._min_value,
self._max_value)
return X_filled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or actually just not clip before imputing.
if self.verbose > 1: | ||
print('[ChainedImputer] Ending imputation round ' | ||
print('[IterativeImputer] Ending imputation round ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't covered in tests, FWIW. It should be covered to ensure there's no AttributeError or whatever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
@@ -486,44 +486,25 @@ def test_imputation_copy(): | |||
# made, even if copy=False. | |||
|
|||
|
|||
def test_chained_imputer_rank_one(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should still be valid if we run multiple imputations and take the mean, shouldn't it? Let's try leave the tests in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still is: I changed the name to test_iterative_imputer_rank_one
and moved it to be with the other functional
tests.
@@ -614,17 +592,17 @@ def test_chained_imputer_missing_at_transform(strategy): | |||
initial_imputer.transform(X_test)[:, 0]) | |||
|
|||
|
|||
def test_chained_imputer_transform_stochasticity(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be run with and without sample_posterior=True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
A couple of minor merge conflicts will need resolution before we continue here. The what's new entry for ChainedImputer is now in v0.21.rst. |
@jnothman This was a strange merge conflict. It looked like both of the conflicting code snippets were something to add, so I did that. Please take a look and see if you agree. What are next steps here? |
I think you did an unfinished rename of a variable: flake8 failures
Error in examples:
|
Thanks @jnothman. This is kind of weird: I remember all tests passing just fine before all the unmerging started happening. |
@glemaitre it still says "requested changes" from you. Do you remember what these were? |
I would need to review once more since that a lot of changes have been done. For sure my previous comments have been addressed in some way. |
Great, thanks.
…On Tue, Sep 4, 2018, 7:23 PM Guillaume Lemaitre ***@***.***> wrote:
I would need to review once more since that a lot of changes have been
done. For sure my previous comments have been addressed in some way.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11350 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABya7BBNbcaQRCLXAqyWDsaNJev4o45qks5uXtMtgaJpZM4U1b8P>
.
|
@jnothman any eta for further reviews? I'd like to get this merged and copy the code into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. I'd like to merge this soon, but if you could please pull together a list of remaining questions/issues/goals, @sergeyf, that would be great
n_imputations : int, optional (default=100) | ||
Number of chained imputation rounds to perform, the results of which | ||
will be used in the final average. | ||
n_iter : int, optional (default=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be one point to address in future questions about handling non-MICE cases
sklearn/impute.py
Outdated
If ``sample_posterior`` is True, the predictor must support | ||
``return_std`` in its ``predict`` method. Also, if | ||
``sample_posterior=True`` the default predictor will be | ||
``BayesianRidge()`` and ``RidgeCV`` otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use () in both cases or neither
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think use a :class: reference in both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Whether to sample from the (Gaussian) predictive posterior of the | ||
fitted predictor for each imputation. Predictor must support | ||
``return_std`` in its ``predict`` method if set to ``True``. Set to | ||
``True`` if using ``IterativeImputer`` for multiple imputations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although a small random forest with changing seeds might also be a good way to do multiple imputation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify: this statement is in response to me specify "Gaussian"? If so, yes, but we need a return_std
for the random forest and I'm not sure what the right way to do this is yet. I think a good way to sample the posterior of the random forest is to pick a prediction of ONE of the n_trees
uniformly at random. What we actually care about is sampling from the posterior, not the return_std
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to pick one of the trees uniformly at random in a sensible way. But the API could be extended to support different ways of sampling the posterior via a non-binary switch here, i.e. sample_posterior='std' or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I can investigate if we agree it actually makes sense. I am not even sure how to tell if it does.
sklearn/impute.py
Outdated
additional fitting. We do this by storing each feature's predictor during | ||
the round-robin ``fit`` phase, and predicting without refitting (in order) | ||
during the ``transform`` phase. | ||
This implementation was inspired by the R MICE package (Multivariate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be dropped here, and stay in the user guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
sklearn/impute.py
Outdated
Imputation by Chained Equations), but differs from it by returning a single | ||
imputation instead of multiple imputations. However, multiple imputation | ||
can be achieved with multiple instances of the imputer with different | ||
random seeds run in parallel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would need the regressor to use different random seeds, which we don't easily support atm. Something to think about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. As long as we can sample from the posterior predictive distribution of the regressor within IterativeImputer
(with different random seeds), it's OK not to have different random seeds inside the regressor. Many regressors are deterministic, right?
sklearn/impute.py
Outdated
can be achieved with multiple instances of the imputer with different | ||
random seeds run in parallel. | ||
|
||
To support imputation in inductive mode we store each feature's predictor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like, again, this might be User Guide material. But ambivalent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to keep as much verbosity as possible, so I'll take your ambivalence as acquiescence to keep it here =)
@jnothman thanks for the review. In terms of outstanding issues, I'm not that sure actually. I think there was a concern that We also still need a solid multiple imputation example. Gael asked for "examples that convey the compromises and help users (and library developers) to make the right choices." I think he meant "what are sane defaults?" We probably need to do a survey of the packages in R, figure out which ones have been battle tested and are well-respected, and see if Am I missing something obvious? |
@jnothman lots of weird test errors: |
I've merged in master where that CI issue was fixed. |
Yes, I think you're right to make further changes driven by examples: we want an example that illustrates something along the lines of missforest functionality, and another along the lines of MICE, and use that to identify appropriate. default behaviour. |
sklearn/impute.py
Outdated
@@ -462,7 +462,8 @@ class IterativeImputer(BaseEstimator, TransformerMixin): | |||
If ``sample_posterior`` is True, the predictor must support | |||
``return_std`` in its ``predict`` method. Also, if | |||
``sample_posterior=True`` the default predictor will be | |||
``BayesianRidge()`` and ``RidgeCV`` otherwise. | |||
``:class:sklearn.linear_model:BayesianRidge()`` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the right syntax. Git grep :class: for examples
@jnothman I'm concerned that a few examples won't tell us about sensible defaults. What we actually need is a large-enough set of experiments where we sweep meaningful parameters to get a sense of reasonable defaults. What's big enough? How has |
Here we know less about datasets than about existing implementations. But
we could use Rianne's amputation to help apply this to assorted
classification or regression tasks. But there are a lot of variables in the
problem.
I actually think defaults are less important here than making sure it's
easy to get commonly used functionality, like what you would expect of
missforest.
One other thing that may be of interest is the ability to impute
categorical variables. I.e. have a categorical_features mask and accept a
classifier as well as regressor for imputation. But that can certainly come
with time.
|
Ok makes sense. All tests pass now. I'll wait for this to merge and then build a missForest example on top of what we currently have. Let me know if there's anything else for this PR. |
Thanks! |
Addresses two points as discussed in #11259:
(a) Removes the "average last
n_imputations
behavior". Now there just one parametern_iter
instead of the two:n_burn_in
andn_imputations
.(b) New flag:
sample_after_predict
, which isFalse
by default. If true it will sample from the predictive posterior after predicting during each round-robin iteration. Turning it on will makeChainedImputer
run a single@jnothman @glemaitre: I think
ChainedImputer
needs a new check: ifsample_after_predict=True
, then the predictor needs to havereturn_std
as a parameter of itspredict
method. What's best practice for checking something like that?Also, we may want a new example of how to use
ChainedImputer
for the purpose of a MICE-type analysis, and how to useChainedImputer
with aRandomForest
as the predictor instead ofBayesianRidge
to demonstrate missForest functionality.I think @RianneSchouten would be a good candidate for the MICE example as a contribution to this PR or a new one, and I can stick
ChainedImputer
+ RF intoplot_missing_values.py
as another bar on the plot. Let me know how that sounds.