-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[WIP] Multiple Imputation: Example with IterativeImputer #11370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Multiple Imputation: Example with IterativeImputer #11370
Conversation
This pull request introduces 2 alerts when merging 965ae8e into 3b5abf7 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
Thanks @RianneSchouten! Take a look at http://scikit-learn.org/stable/developers/contributing.html#contributing-pull-requests for info about fixing the flake errors. |
Great. Two remarks.
This is a broad statement, and only true if the imputations are drawn from the posterior.
|
Here is the rendered documentation. I hope to look over it soon. |
In the first instance, you'll need to make those plots bigger... |
@sergeyf Thanks for the tip, I read the documents and think I changed the code according to the rules.
Regarding remark 2: I can add some explanation of course, but I am not sure where to place that. @sergeyf , @jnothman is there a place I can write some explanation/interpretation of the images. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of where you can make comments/explanation/interpretation of plots: historically this has all been in the passage at the top. You can, however, have integrated text blocks with comments, as in examples/preprocessing/plot_all_scaling.py.
The example took more than 10 minutes to run on Circle CI, causing it to halt. I don't know if this is a bug, but we need our examples to run a lot faster, even if that makes them less real-world. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm separately interested in your "amputation functions". We had lots of discussion about putting this kind of feature into the library at #7084, but we were not aware of a literature in this space, and were disappointed by some of the over-simplifications in that PR.
For MCAR, you seem to take an approach which more likely masks high values. Looks neat enough. In reality should the base of the exponential be a parameter to allow the removal to be more/less skewed to the upper extreme? Are there other more interesting MCAR amputators you would recommend? Perhaps comment at #6284.
examples/plot_multiple_imputation.py
Outdated
if mech == "MNAR": | ||
for i in np.arange(n_features): | ||
data_values = -np.mean(X[:, i]) + X[:, i] | ||
weights = list(map(lambda x: math.exp(x) / (1 + math.exp(x)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this not just np.exp(data_values) / (1 + np.exp(data_values))
(or 1 / (1 + np.exp(-data_values))
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think you're absolutely right. Changed it.
@jnothman I'm happy you mention the amputation function: I (we) developed an amputation procedure that generates sophisticated missing data, in different forms, and we put it in the mice package in r (function: ampute). You can read about the procedure: https://rianneschouten.github.io/mice_ampute/vignette/ampute.html . I just got the confirmation that my article about the procedure will be published in Journal of Statistical Computation and Simulation. Will put the link to the online version here next week. Apart from this article, no one has written about the generation of multivariate missingness (except Jaap Brand in 1999). The generation of univariate missingness is often done (deleting missing values feature by feature), but it is almost impossible to use the univariate approach to create a good overall missingness structure. Thats why we developed the multivariate procedure in R. It's been on my wish list for a few months now to have the same procedure available in Python, also because I need it for many of my simulation studies. The effect of missingness mechanisms on prediction models is not yet clear. Note that I didnot implement anything like the multivariate amputation procedure in this PR. Both this MCAR and MNAR approach are univariate approaches (and you're right: I create MNAR RIGHT missingness). Apart from RIGHT there are a lot of other types. I will read the PR's you mention tomorrow. |
@jnothman I put a simulation aspect in the example. I will get it out tomorrow and then it should pass the time test. |
ha :) I don't think an amputation function is a high priority. We usually
have a high threshold for inclusion in terms of algorithm maturity, but I
think our dataset generators have not met the same bar in the past, so we
might be able to include it soon. But if we decide a multivariate amputer
is of interest, we would happily try find a contributor to port it (from
description, or from R if you license us to) rather than your taking time
off work!
|
The vignette (I've only read section 1) looks great! We were onto a similar
track in the univariate case, and having a weighted sum of features makes a
lot of sense... I've not yet understood which feature you mask if a
particular weighted sum is selected for masking.... but I'll keep reading
eventually.
|
That has to do with the missingness patterns. You define patterns beforehand. Lets say you have a dataset with y1, y2 and y3. a pattern might look like: 0 0 1, meaning, if a weighted sum is selected for masking, features y1 and y2 will become incomplete. You can define different kinds of patterns, ending up with a dataset with different patterns as well. each missing data pattern is connected to one way of calculating the weighted sum scores. In other words, you define the weights for each pattern. |
@RianneSchouten Text is fine. |
Both examples are fairly advanced. At some point you might wish to add a minimal example, e.g., as in https://github.com/stefvanbuuren/mice to emphasise the basics. |
Okay, I made the explanation at the beginning larger, with more explanation of rubin's rules. also I put the rules in separate functions such that it is clear where the magic happens. In addition, I turned the nr of simulations down from 10 to 2, now the time range should be fine. Let's see. Otherwise I will get the simulation out. @stefvanbuuren you might be right that the examples are high level, comparing different approaches while showing the example. Do others think I should make a more basic example as well? |
This pull request introduces 1 alert when merging ed1db8b into 3b5abf7 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
This pull request introduces 1 alert when merging 7c3fb97 into 3b5abf7 - view on LGTM.com new alerts:
Comment posted by LGTM.com |
examples/plot_multiple_imputation.py
Outdated
robust summary measures such as medians or ranges instead of using Rubin’s | ||
pooling rules. This applies to an estimate like explained variance. | ||
|
||
In sum, Rubin’s pooling rules are as follows. The overall point estimate after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.com is complaining about this curly apostrophe. Use '.
* modularized data column functionality * small bugfix * removes redundant line breaks * added some documentation on the added fn * added additional comment on advice of Nicholas Hug * added test case * merged master into branch, and added small comments by Joel * added doc item
…attributes (#12324) * rm criterion and max_features from __init__ and store them as class attrs instead * make sure that the docstring comes first
Part of #11992. These were all the things that seemed pretty straight-forward. It's actually a bit bulky but should still be easy to review, hopefully.
* add reingold tillford tree layout algorithm * add first silly implementation of matplotlib based plotting for trees * object oriented design for export_graphviz so it can be extended * add class for mlp export * add colors * separately scale x and y, add arrowheads, fix strings * implement max_depth * don't use alpha for coloring because it makes boxes transparent * remove unused variables * vertical center of boxes * fix/simplify newline trimming * somewhere in the middle of stuff trying to get rid of scalex, scaley * remove "find_longest_child" for now, fix tests * make scalex and scaley internal, and ax local. render everything once to get the bbox sizes, then again to actually plot it with known extents. * add some margin to the max bbox width * add _BaseTreeExporter baseclass * add docstring to plot_tree * use data coordinates so we can put the plot in a subplot, remove some hacks. * remove scalex, scaley, add automatic font size * use rendered stuff for setting limits (well nearly there) * import plot_tree into tree module * set limits before font size adjustment? * add tree plotting via matplotlib to iris example and to docs * pep8 fix * skip doctest on plot_tree because matplotlib is not installed on all CI machines * redo everything in axis pixel coordinates re-introduce scalex, scaley add max_extents to tree to get tree size before plotting * fix max-depth parent node positioning and don't consider deep nodes in layouting * consider height in fontsize computation in case someone gave us a very flat figure * fix error when max_depth is None * add docstring for tree plotting fontsize * starting on jnothman's review * renaming fixes * whatsnew for tree plotting * clear axes prior to doing anything. * fix doctests * skip matplotlib doctest * trying to debug circle failure * trying to show full traceback * more print debugging * remove debugging crud * hack around matplotlib <1.5 issues * copy bbox args because old matplotlib is weird. * pep8 fixes * add explicit boxstyle * more pep8 * even more pep8 * add comment about matplotlib version requirement * remove redundant file * add whatsnew entry that the merge lost * fix merge issue * more merge issues * whitespace ... * remove doctest skip to see what's happening * added some simple invariance tests buchheim function * refactor ___init__ into superclass * added some tests of plot_tree * put skip back in, fix typo, fix versionadded number * remove unused parameters special_characters and parallel_leaves from mpl plotting * rename tests to test_reingold_tilford * added license header from pymag-trees repo * remove duplicate test file.
I wanted to bring my multiple-imputation branch up to date with the master branch of scikit learn where I originally forked from. I first brought my master branch up to date by: I then brought the multiple-imputation branch up to date by: Now I see that at the top of this page it says: rianneschouten wants to push 164 commit into scikit-learn:iterativeimputer. Can someone explain to me if I did something wrong an/or how this works? |
If your master couldn't be fast forwarded to upstream/master, that would have made a mess. Or it could have come from origin/master. For another chance, |
I made a mess. |
As the developer info says, I will fork/clone the master branch, make a local new branch, and work on the ampute files in that branch and then make a pull request. I am wondering though, if i want to improve/update the multiple-imputation example, i have to use the info in the IterativeImputer branch. Should I wait till you guys update the master branch with that information or should I fork/clone the IterativeImputer branch? What is the preferred way to do it? Thanks in advance. |
@RianneSchouten The master will not be updated until we have all the pieces of |
That makes sense. Thanks.
|
Hey guys, I am sorry but it seems I don't have enough time to actively finish this. The example should be finished by: applying the code to the newest version of the iterative imputer, and by removing the simulation aspect from the example (that is not necessary to show how to perform mi and takes unncessary time). The difficult aspect is that the pooling rules change for the different statistical analysis methods. In the example, I use linear regression which is the most used and most basic analysis method. I hope someone else can finish this so that it comes along with the iterativeimputer? Kind regards, |
Hi @RianneSchouten. When you say "The difficult aspect is that the pooling rules change for the different statistical analysis methods" - does this mean we need to figure out other pooling methods for this particular example? Or is the current pooling sufficient to demonstrate how this generally works? Should we include a reference for other pooling for other models? |
these pooling rules are good for linear regression. so the example works
and is good.
in case of other statistical models you need other pooling rules.
you could just mention that and then add a reference such as:
Combining estimates of interest in prognostic modelling studies
after multiple imputation: current practice and guidelines
Andrea Marshall*1,2, Douglas G Altman1, Roger L Holder3 and
Patrick Royston4
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Virusvrij.
www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
Op wo 12 dec. 2018 om 18:26 schreef Sergey Feldman <[email protected]
…:
Hi @RianneSchouten <https://github.com/RianneSchouten>. When you say "The
difficult aspect is that the pooling rules change for the different
statistical analysis methods" - does this mean we need to figure out other
pooling methods for this particular example? Or is the current pooling
sufficient to demonstrate how this generally works? Should we include a
reference for other pooling for other models?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11370 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AVQqe6ZXGkrqf4wK7unvK9Jv6BUOH7CGks5u4Tw3gaJpZM4U5_br>
.
|
Hi again @RianneSchouten. Just to confirm what you mean about |
Also, @RianneSchouten. This statement: I think ML people might not know what this means as we don't really use standard errors much. Why do larger standard errors allow for valid statistical inference making? |
I think the point is that single imputation underestimates uncertainty. So
if you want to know how reliable your prediction is, or what the credible
interval is, you need to account for uncertainty due to missingness.
|
Thanks @jnothman. I am wondering if the ML community will understand that point. "Standard errors" don't appear in the ML curriculum very much. In any case, I rewrote this example a bit with fewer iterations and it's much faster now. Should I make a new PR? Or is there anything else you'd like to see changed here? |
Best to make a new PR and I/others can see what it looks like.
|
Superseded by #13025 |
Reference Issues/PRs
As promised in #11259 and following on @sergeyf 's work in #11314 and #11350: in this PR an example that shows how to use IterativeImputer for Multiple Imputation.
What does this implement/fix? Explain your changes.
As discussed in #11259, the defaults of IterativeImputer are such that single imputation is performed. Because the method is also quite powerfull for Multiple Imputation, we agreed to make an example that shows the user how to use ImputerImputer to perform Multiple Imputation.
I made the document: examples/plot_multiple_imputation.py and it shows 2 things:
The following two figures are created with the script:


Any other comments?
The script use the IterativeImputer as it is currently accepted in the master branch (with name: ChainedImputer). When PR #11350 is accepted, I will update the code with the newest arguments (f.e. no n_burn_in but just n_iter).
Are there suggestions for improvements?