[WIP] Multiple Imputation: Example with IterativeImputer #11370

RianneSchouten · 2018-06-27T16:10:51Z

Reference Issues/PRs

As promised in #11259 and following on @sergeyf 's work in #11314 and #11350: in this PR an example that shows how to use IterativeImputer for Multiple Imputation.

What does this implement/fix? Explain your changes.

As discussed in #11259, the defaults of IterativeImputer are such that single imputation is performed. Because the method is also quite powerfull for Multiple Imputation, we agreed to make an example that shows the user how to use ImputerImputer to perform Multiple Imputation.

I made the document: examples/plot_multiple_imputation.py and it shows 2 things:

Estimation of beta estimates and their standard error: compare IterativeImputer with using IterativeImputer as a MICE Imputer.
How to use IterativeImputer as a MICE Imputer when making a prediction model (with train and test datasets).

The following two figures are created with the script:

Any other comments?

The script use the IterativeImputer as it is currently accepted in the master branch (with name: ChainedImputer). When PR #11350 is accepted, I will update the code with the newest arguments (f.e. no n_burn_in but just n_iter).

Are there suggestions for improvements?

sklearn-lgtm · 2018-06-27T16:40:15Z

This pull request introduces 2 alerts when merging 965ae8e into 3b5abf7 - view on LGTM.com

new alerts:

1 for Unused import
1 for Suspicious unused loop iteration variable

Comment posted by LGTM.com

sergeyf · 2018-06-27T16:45:51Z

Thanks @RianneSchouten!

Take a look at http://scikit-learn.org/stable/developers/contributing.html#contributing-pull-requests for info about fixing the flake errors. sklearn has some strict style requirements =)

stefvanbuuren · 2018-06-27T18:52:44Z

Great. Two remarks.

You say: "The strength of the method is that it allows for finding unbiased statistical estimates due to its chained character."

This is a broad statement, and only true if the imputations are drawn from the posterior.

I am not familiar with Python's style, but perhaps some interpretation of the results would be useful. Your show the methods produce - as I expected - different standard errors. Is smaller better? How do we know which estimate is correct?

jnothman · 2018-06-28T03:33:07Z

Here is the rendered documentation. I hope to look over it soon.

jnothman · 2018-06-28T03:33:09Z

https://26355-843222-gh.circle-artifacts.com/0/doc/auto_examples/plot_multiple_imputation.html

jnothman · 2018-06-28T03:34:23Z

In the first instance, you'll need to make those plots bigger...

RianneSchouten · 2018-06-28T11:20:12Z

@sergeyf Thanks for the tip, I read the documents and think I changed the code according to the rules.
@stefvanbuuren I have changed the text into the following to listen to your remark 1. Do you think this is better?

The chained character of the method and the possiblity to draw imputation values from the posterior distribution of a Bayesian imputation model allows for the finding of unbiased statistical estimates.

Regarding remark 2: I can add some explanation of course, but I am not sure where to place that. @sergeyf , @jnothman is there a place I can write some explanation/interpretation of the images.
@jnothman : I made the images twice as large. Is that okay?

jnothman

In terms of where you can make comments/explanation/interpretation of plots: historically this has all been in the passage at the top. You can, however, have integrated text blocks with comments, as in examples/preprocessing/plot_all_scaling.py.

jnothman · 2018-06-28T14:36:43Z

The example took more than 10 minutes to run on Circle CI, causing it to halt. I don't know if this is a bug, but we need our examples to run a lot faster, even if that makes them less real-world.

jnothman

I'm separately interested in your "amputation functions". We had lots of discussion about putting this kind of feature into the library at #7084, but we were not aware of a literature in this space, and were disappointed by some of the over-simplifications in that PR.

For MCAR, you seem to take an approach which more likely masks high values. Looks neat enough. In reality should the base of the exponential be a parameter to allow the removal to be more/less skewed to the upper extreme? Are there other more interesting MCAR amputators you would recommend? Perhaps comment at #6284.

jnothman · 2018-06-28T14:42:57Z

examples/plot_multiple_imputation.py

+    if mech == "MNAR":
+        for i in np.arange(n_features):
+            data_values = -np.mean(X[:, i]) + X[:, i]
+            weights = list(map(lambda x: math.exp(x) / (1 + math.exp(x)),


why is this not just np.exp(data_values) / (1 + np.exp(data_values)) (or 1 / (1 + np.exp(-data_values)))?

Think you're absolutely right. Changed it.

RianneSchouten · 2018-06-28T16:15:04Z

@jnothman I'm happy you mention the amputation function: I (we) developed an amputation procedure that generates sophisticated missing data, in different forms, and we put it in the mice package in r (function: ampute). You can read about the procedure: https://rianneschouten.github.io/mice_ampute/vignette/ampute.html . I just got the confirmation that my article about the procedure will be published in Journal of Statistical Computation and Simulation. Will put the link to the online version here next week.

Apart from this article, no one has written about the generation of multivariate missingness (except Jaap Brand in 1999). The generation of univariate missingness is often done (deleting missing values feature by feature), but it is almost impossible to use the univariate approach to create a good overall missingness structure. Thats why we developed the multivariate procedure in R.

It's been on my wish list for a few months now to have the same procedure available in Python, also because I need it for many of my simulation studies. The effect of missingness mechanisms on prediction models is not yet clear.

Note that I didnot implement anything like the multivariate amputation procedure in this PR. Both this MCAR and MNAR approach are univariate approaches (and you're right: I create MNAR RIGHT missingness). Apart from RIGHT there are a lot of other types.
It is with MAR missingness where the multivariate approach is especially useful.

I will read the PR's you mention tomorrow.
Maybe I could take some days of from work and write the multivariate amputation procedure in Python code on short notice if that would be useful?

RianneSchouten · 2018-06-28T16:16:52Z

@jnothman I put a simulation aspect in the example. I will get it out tomorrow and then it should pass the time test.

jnothman · 2018-06-28T22:35:02Z

ha :) I don't think an amputation function is a high priority. We usually have a high threshold for inclusion in terms of algorithm maturity, but I think our dataset generators have not met the same bar in the past, so we might be able to include it soon. But if we decide a multivariate amputer is of interest, we would happily try find a contributor to port it (from description, or from R if you license us to) rather than your taking time off work!

jnothman · 2018-06-28T22:41:38Z

The vignette (I've only read section 1) looks great! We were onto a similar track in the univariate case, and having a weighted sum of features makes a lot of sense... I've not yet understood which feature you mask if a particular weighted sum is selected for masking.... but I'll keep reading eventually.

RianneSchouten · 2018-06-29T07:49:18Z

I've not yet understood which feature you mask if a
particular weighted sum is selected for masking

That has to do with the missingness patterns. You define patterns beforehand. Lets say you have a dataset with y1, y2 and y3. a pattern might look like: 0 0 1, meaning, if a weighted sum is selected for masking, features y1 and y2 will become incomplete.

You can define different kinds of patterns, ending up with a dataset with different patterns as well. each missing data pattern is connected to one way of calculating the weighted sum scores. In other words, you define the weights for each pattern.

stefvanbuuren · 2018-06-29T12:23:54Z

@RianneSchouten Text is fine.

stefvanbuuren · 2018-06-29T12:29:01Z

Both examples are fairly advanced. At some point you might wish to add a minimal example, e.g., as in https://github.com/stefvanbuuren/mice to emphasise the basics.

RianneSchouten · 2018-06-29T14:31:01Z

Okay, I made the explanation at the beginning larger, with more explanation of rubin's rules. also I put the rules in separate functions such that it is clear where the magic happens. In addition, I turned the nr of simulations down from 10 to 2, now the time range should be fine. Let's see. Otherwise I will get the simulation out.

@stefvanbuuren you might be right that the examples are high level, comparing different approaches while showing the example.

Do others think I should make a more basic example as well?

sklearn-lgtm · 2018-06-29T14:55:14Z

This pull request introduces 1 alert when merging ed1db8b into 3b5abf7 - view on LGTM.com

new alerts:

1 for Encoding error

Comment posted by LGTM.com

sklearn-lgtm · 2018-06-29T15:35:12Z

This pull request introduces 1 alert when merging 7c3fb97 into 3b5abf7 - view on LGTM.com

new alerts:

1 for Encoding error

Comment posted by LGTM.com

jnothman · 2018-06-30T15:48:50Z

examples/plot_multiple_imputation.py

+robust summary measures such as medians or ranges instead of using Rubin’s
+pooling rules. This applies to an estimate like explained variance.
+
+In sum, Rubin’s pooling rules are as follows. The overall point estimate after


lgtm.com is complaining about this curly apostrophe. Use '.

* modularized data column functionality * small bugfix * removes redundant line breaks * added some documentation on the added fn * added additional comment on advice of Nicholas Hug * added test case * merged master into branch, and added small comments by Joel * added doc item

)

…attributes (#12324) * rm criterion and max_features from __init__ and store them as class attrs instead * make sure that the docstring comes first

#12232)

Part of #11992. These were all the things that seemed pretty straight-forward. It's actually a bit bulky but should still be easy to review, hopefully.

* add reingold tillford tree layout algorithm * add first silly implementation of matplotlib based plotting for trees * object oriented design for export_graphviz so it can be extended * add class for mlp export * add colors * separately scale x and y, add arrowheads, fix strings * implement max_depth * don't use alpha for coloring because it makes boxes transparent * remove unused variables * vertical center of boxes * fix/simplify newline trimming * somewhere in the middle of stuff trying to get rid of scalex, scaley * remove "find_longest_child" for now, fix tests * make scalex and scaley internal, and ax local. render everything once to get the bbox sizes, then again to actually plot it with known extents. * add some margin to the max bbox width * add _BaseTreeExporter baseclass * add docstring to plot_tree * use data coordinates so we can put the plot in a subplot, remove some hacks. * remove scalex, scaley, add automatic font size * use rendered stuff for setting limits (well nearly there) * import plot_tree into tree module * set limits before font size adjustment? * add tree plotting via matplotlib to iris example and to docs * pep8 fix * skip doctest on plot_tree because matplotlib is not installed on all CI machines * redo everything in axis pixel coordinates re-introduce scalex, scaley add max_extents to tree to get tree size before plotting * fix max-depth parent node positioning and don't consider deep nodes in layouting * consider height in fontsize computation in case someone gave us a very flat figure * fix error when max_depth is None * add docstring for tree plotting fontsize * starting on jnothman's review * renaming fixes * whatsnew for tree plotting * clear axes prior to doing anything. * fix doctests * skip matplotlib doctest * trying to debug circle failure * trying to show full traceback * more print debugging * remove debugging crud * hack around matplotlib <1.5 issues * copy bbox args because old matplotlib is weird. * pep8 fixes * add explicit boxstyle * more pep8 * even more pep8 * add comment about matplotlib version requirement * remove redundant file * add whatsnew entry that the merge lost * fix merge issue * more merge issues * whitespace ... * remove doctest skip to see what's happening * added some simple invariance tests buchheim function * refactor ___init__ into superclass * added some tests of plot_tree * put skip back in, fix typo, fix versionadded number * remove unused parameters special_characters and parallel_leaves from mpl plotting * rename tests to test_reingold_tilford * added license header from pymag-trees repo * remove duplicate test file.

RianneSchouten · 2018-10-12T12:06:53Z

I wanted to bring my multiple-imputation branch up to date with the master branch of scikit learn where I originally forked from.

I first brought my master branch up to date by:
git fetch upstream
git checkout master
git merge upstream/master

I then brought the multiple-imputation branch up to date by:
git fetch -p origin
git merge origin/master
git checkout multiple-imputation
git merge master.

Now I see that at the top of this page it says: rianneschouten wants to push 164 commit into scikit-learn:iterativeimputer.

Can someone explain to me if I did something wrong an/or how this works?

jnothman · 2018-10-14T02:19:10Z

If your master couldn't be fast forwarded to upstream/master, that would have made a mess. Or it could have come from origin/master.

For another chance, git reset --hard last_good_commit and force push.
Then simply git merge upstream/master

RianneSchouten · 2018-10-31T15:48:53Z

I made a mess.
Since my work was in only one file, i will start over.

RianneSchouten · 2018-10-31T16:08:30Z

As the developer info says, I will fork/clone the master branch, make a local new branch, and work on the ampute files in that branch and then make a pull request.

I am wondering though, if i want to improve/update the multiple-imputation example, i have to use the info in the IterativeImputer branch. Should I wait till you guys update the master branch with that information or should I fork/clone the IterativeImputer branch? What is the preferred way to do it?

Thanks in advance.

sergeyf · 2018-10-31T16:54:50Z

@RianneSchouten The master will not be updated until we have all the pieces of IterativeImputer done in its own branch. That includes examples, etc.

RianneSchouten · 2018-10-31T19:50:57Z

That makes sense. Thanks.

RianneSchouten · 2018-12-12T11:08:39Z

Hey guys,

I am sorry but it seems I don't have enough time to actively finish this.
The example is finished in the sense: it shows how to loop over the iterativeimputer to create multiple imputations, how to calculate the variance of beta estimates, how to pool the outcomes after multiple imputation.

The example should be finished by: applying the code to the newest version of the iterative imputer, and by removing the simulation aspect from the example (that is not necessary to show how to perform mi and takes unncessary time).

The difficult aspect is that the pooling rules change for the different statistical analysis methods. In the example, I use linear regression which is the most used and most basic analysis method.

I hope someone else can finish this so that it comes along with the iterativeimputer?

Kind regards,
Rianne

sergeyf · 2018-12-12T17:25:04Z

Hi @RianneSchouten. When you say "The difficult aspect is that the pooling rules change for the different statistical analysis methods" - does this mean we need to figure out other pooling methods for this particular example? Or is the current pooling sufficient to demonstrate how this generally works? Should we include a reference for other pooling for other models?

RianneSchouten · 2018-12-12T17:47:38Z

these pooling rules are good for linear regression. so the example works and is good. in case of other statistical models you need other pooling rules. you could just mention that and then add a reference such as: Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines Andrea Marshall*1,2, Douglas G Altman1, Roger L Holder3 and Patrick Royston4 <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Virusvrij. www.avast.com <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> Op wo 12 dec. 2018 om 18:26 schreef Sergey Feldman <[email protected]

…

: Hi @RianneSchouten <https://github.com/RianneSchouten>. When you say "The difficult aspect is that the pooling rules change for the different statistical analysis methods" - does this mean we need to figure out other pooling methods for this particular example? Or is the current pooling sufficient to demonstrate how this generally works? Should we include a reference for other pooling for other models? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11370 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVQqe6ZXGkrqf4wK7unvK9Jv6BUOH7CGks5u4Tw3gaJpZM4U5_br> .

sergeyf · 2019-01-18T01:05:27Z

Hi again @RianneSchouten. Just to confirm what you mean about by removing the simulation aspect from the example. Which lines do you mean I should remove? I assume 445 to 509 (plus doc info about those lines). Is that correct?

sergeyf · 2019-01-18T16:22:17Z

Also, @RianneSchouten. This statement: Due to the between imputation variance, the standard errors of all regression coefficients are larger with multiple imputation than with single imputation. This allows for valid statistical inference making.

I think ML people might not know what this means as we don't really use standard errors much. Why do larger standard errors allow for valid statistical inference making?

jnothman · 2019-01-19T10:49:14Z

I think the point is that single imputation underestimates uncertainty. So if you want to know how reliable your prediction is, or what the credible interval is, you need to account for uncertainty due to missingness.

sergeyf · 2019-01-19T18:25:19Z

Thanks @jnothman. I am wondering if the ML community will understand that point. "Standard errors" don't appear in the ML curriculum very much.

In any case, I rewrote this example a bit with fewer iterations and it's much faster now. Should I make a new PR? Or is there anything else you'd like to see changed here?

jnothman · 2019-01-19T20:49:53Z

Best to make a new PR and I/others can see what it looks like.

jnothman · 2019-01-22T05:19:41Z

Superseded by #13025

RianneSchouten added 2 commits June 27, 2018 15:21

add example multiple imputation

686d758

adjust figure widths and legends

965ae8e

RianneSchouten mentioned this pull request Jun 27, 2018

[MRG] ChainedImputer -> IterativeImputer, and documentation update #11350

Merged

RianneSchouten added 2 commits June 28, 2018 12:53

changed code according pep rules and increased figure size

fa082de

solve two issues from lgtm and improve introduction text

e3e2465

RianneSchouten changed the title ~~Multiple Imputation: Example with IterativeImputer~~ [WIP] Multiple Imputation: Example with IterativeImputer Jun 28, 2018

jnothman reviewed Jun 28, 2018

View reviewed changes

remove spaces in arguments and add lines for definitions

15b3b91

jnothman reviewed Jun 28, 2018

View reviewed changes

put rules in separate functions and include explanation

ed1db8b

line from 80 to 79 characters

7c3fb97

jnothman reviewed Jun 30, 2018

View reviewed changes

janvanrijn and others added 12 commits October 9, 2018 11:25

[MRG] FIX Update power_transform docstring and add FutureWarning (#12317

bbb0d93

)

DOC check_array() and check_X_y() documentation update (#12340)

c8a4132

ENH add get_n_leaves() and get_max_depth() to DesicionTrees (#12300)

a80bbd9

DOC fix logistic regression.fit docstring on y (#12343)

00c2f41

[MRG] Move RandomTreesEmbedding criterion & max_features to be class …

39bd736

…attributes (#12324) * rm criterion and max_features from __init__ and store them as class attrs instead * make sure that the docstring comes first

ENH (0.21) Add max_error to the existing set of metrics for regression (

831c760

#12232)

DOC Update v0.20.rst with power_transform API change (#12351)

8985a63

MNT simple deprecations and removals for 0.21 (#12238)

0f94f29

Part of #11992. These were all the things that seemed pretty straight-forward. It's actually a bit bulky but should still be easy to review, hopefully.

save latest changes

096440e

Merge branch 'master' into multiple-imputation

ed9f3c8

RianneSchouten mentioned this pull request Oct 12, 2018

[ENH] Provision generating missing values and add parameters to control the same #6284

Open

sergeyf mentioned this pull request Jan 21, 2019

[WIP] Example of multiple imputation with IterativeImputer #13025

Open

jnothman closed this Jan 22, 2019

Uh oh!

[WIP] Multiple Imputation: Example with IterativeImputer #11370

[WIP] Multiple Imputation: Example with IterativeImputer #11370

Uh oh!

Conversation

RianneSchouten commented Jun 27, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

sklearn-lgtm commented Jun 27, 2018

Uh oh!

sergeyf commented Jun 27, 2018

Uh oh!

stefvanbuuren commented Jun 27, 2018

Uh oh!

jnothman commented Jun 28, 2018

Uh oh!

jnothman commented Jun 28, 2018

Uh oh!

jnothman commented Jun 28, 2018

Uh oh!

RianneSchouten commented Jun 28, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 28, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Jun 28, 2018

Choose a reason for hiding this comment

Uh oh!

RianneSchouten Jun 29, 2018

Choose a reason for hiding this comment

Uh oh!

RianneSchouten commented Jun 28, 2018

Uh oh!

RianneSchouten commented Jun 28, 2018

Uh oh!

jnothman commented Jun 28, 2018 via email

Uh oh!

jnothman commented Jun 28, 2018 via email

Uh oh!

RianneSchouten commented Jun 29, 2018

Uh oh!

stefvanbuuren commented Jun 29, 2018

Uh oh!

stefvanbuuren commented Jun 29, 2018

Uh oh!

RianneSchouten commented Jun 29, 2018

Uh oh!

sklearn-lgtm commented Jun 29, 2018

Uh oh!

sklearn-lgtm commented Jun 29, 2018

Uh oh!

jnothman Jun 30, 2018

Choose a reason for hiding this comment

Uh oh!

RianneSchouten commented Oct 12, 2018

Uh oh!

jnothman commented Oct 14, 2018

Uh oh!

RianneSchouten commented Oct 31, 2018

Uh oh!

RianneSchouten commented Oct 31, 2018

Uh oh!

sergeyf commented Oct 31, 2018

Uh oh!

RianneSchouten commented Oct 31, 2018 via email

Uh oh!

RianneSchouten commented Dec 12, 2018

Uh oh!

sergeyf commented Dec 12, 2018

Uh oh!

RianneSchouten commented Dec 12, 2018 via email

Uh oh!

sergeyf commented Jan 18, 2019