Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[WIP] Multiple Imputation: Example with IterativeImputer #11370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 164 commits into from
Closed

[WIP] Multiple Imputation: Example with IterativeImputer #11370

wants to merge 164 commits into from

Conversation

RianneSchouten
Copy link

Reference Issues/PRs

As promised in #11259 and following on @sergeyf 's work in #11314 and #11350: in this PR an example that shows how to use IterativeImputer for Multiple Imputation.

What does this implement/fix? Explain your changes.

As discussed in #11259, the defaults of IterativeImputer are such that single imputation is performed. Because the method is also quite powerfull for Multiple Imputation, we agreed to make an example that shows the user how to use ImputerImputer to perform Multiple Imputation.

I made the document: examples/plot_multiple_imputation.py and it shows 2 things:

  1. Estimation of beta estimates and their standard error: compare IterativeImputer with using IterativeImputer as a MICE Imputer.
  2. How to use IterativeImputer as a MICE Imputer when making a prediction model (with train and test datasets).

The following two figures are created with the script:
plotmultipleimputationstatisticalinference
plotmultipleimputationprediction

Any other comments?

The script use the IterativeImputer as it is currently accepted in the master branch (with name: ChainedImputer). When PR #11350 is accepted, I will update the code with the newest arguments (f.e. no n_burn_in but just n_iter).

Are there suggestions for improvements?

@sklearn-lgtm
Copy link

This pull request introduces 2 alerts when merging 965ae8e into 3b5abf7 - view on LGTM.com

new alerts:

  • 1 for Unused import
  • 1 for Suspicious unused loop iteration variable

Comment posted by LGTM.com

@sergeyf
Copy link
Contributor

sergeyf commented Jun 27, 2018

Thanks @RianneSchouten!

Take a look at http://scikit-learn.org/stable/developers/contributing.html#contributing-pull-requests for info about fixing the flake errors. sklearn has some strict style requirements =)

@stefvanbuuren
Copy link

Great. Two remarks.

  1. You say: "The strength of the method is that it allows for finding unbiased statistical estimates due to its chained character."

This is a broad statement, and only true if the imputations are drawn from the posterior.

  1. I am not familiar with Python's style, but perhaps some interpretation of the results would be useful. Your show the methods produce - as I expected - different standard errors. Is smaller better? How do we know which estimate is correct?

@jnothman
Copy link
Member

Here is the rendered documentation. I hope to look over it soon.

@jnothman
Copy link
Member

@jnothman
Copy link
Member

In the first instance, you'll need to make those plots bigger...

@RianneSchouten
Copy link
Author

@sergeyf Thanks for the tip, I read the documents and think I changed the code according to the rules.
@stefvanbuuren I have changed the text into the following to listen to your remark 1. Do you think this is better?

The chained character of the method and the possiblity to draw imputation values from the posterior distribution of a Bayesian imputation model allows for the finding of unbiased statistical estimates.

Regarding remark 2: I can add some explanation of course, but I am not sure where to place that. @sergeyf , @jnothman is there a place I can write some explanation/interpretation of the images.
@jnothman : I made the images twice as large. Is that okay?

@RianneSchouten RianneSchouten changed the title Multiple Imputation: Example with IterativeImputer [WIP] Multiple Imputation: Example with IterativeImputer Jun 28, 2018
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of where you can make comments/explanation/interpretation of plots: historically this has all been in the passage at the top. You can, however, have integrated text blocks with comments, as in examples/preprocessing/plot_all_scaling.py.

@jnothman
Copy link
Member

The example took more than 10 minutes to run on Circle CI, causing it to halt. I don't know if this is a bug, but we need our examples to run a lot faster, even if that makes them less real-world.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm separately interested in your "amputation functions". We had lots of discussion about putting this kind of feature into the library at #7084, but we were not aware of a literature in this space, and were disappointed by some of the over-simplifications in that PR.

For MCAR, you seem to take an approach which more likely masks high values. Looks neat enough. In reality should the base of the exponential be a parameter to allow the removal to be more/less skewed to the upper extreme? Are there other more interesting MCAR amputators you would recommend? Perhaps comment at #6284.

if mech == "MNAR":
for i in np.arange(n_features):
data_values = -np.mean(X[:, i]) + X[:, i]
weights = list(map(lambda x: math.exp(x) / (1 + math.exp(x)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not just np.exp(data_values) / (1 + np.exp(data_values)) (or 1 / (1 + np.exp(-data_values)))?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you're absolutely right. Changed it.

@RianneSchouten
Copy link
Author

@jnothman I'm happy you mention the amputation function: I (we) developed an amputation procedure that generates sophisticated missing data, in different forms, and we put it in the mice package in r (function: ampute). You can read about the procedure: https://rianneschouten.github.io/mice_ampute/vignette/ampute.html . I just got the confirmation that my article about the procedure will be published in Journal of Statistical Computation and Simulation. Will put the link to the online version here next week.

Apart from this article, no one has written about the generation of multivariate missingness (except Jaap Brand in 1999). The generation of univariate missingness is often done (deleting missing values feature by feature), but it is almost impossible to use the univariate approach to create a good overall missingness structure. Thats why we developed the multivariate procedure in R.

It's been on my wish list for a few months now to have the same procedure available in Python, also because I need it for many of my simulation studies. The effect of missingness mechanisms on prediction models is not yet clear.

Note that I didnot implement anything like the multivariate amputation procedure in this PR. Both this MCAR and MNAR approach are univariate approaches (and you're right: I create MNAR RIGHT missingness). Apart from RIGHT there are a lot of other types.
It is with MAR missingness where the multivariate approach is especially useful.

I will read the PR's you mention tomorrow.
Maybe I could take some days of from work and write the multivariate amputation procedure in Python code on short notice if that would be useful?

@RianneSchouten
Copy link
Author

@jnothman I put a simulation aspect in the example. I will get it out tomorrow and then it should pass the time test.

@jnothman
Copy link
Member

jnothman commented Jun 28, 2018 via email

@jnothman
Copy link
Member

jnothman commented Jun 28, 2018 via email

@RianneSchouten
Copy link
Author

I've not yet understood which feature you mask if a
particular weighted sum is selected for masking

That has to do with the missingness patterns. You define patterns beforehand. Lets say you have a dataset with y1, y2 and y3. a pattern might look like: 0 0 1, meaning, if a weighted sum is selected for masking, features y1 and y2 will become incomplete.

You can define different kinds of patterns, ending up with a dataset with different patterns as well. each missing data pattern is connected to one way of calculating the weighted sum scores. In other words, you define the weights for each pattern.

@stefvanbuuren
Copy link

@RianneSchouten Text is fine.

@stefvanbuuren
Copy link

Both examples are fairly advanced. At some point you might wish to add a minimal example, e.g., as in https://github.com/stefvanbuuren/mice to emphasise the basics.

@RianneSchouten
Copy link
Author

Okay, I made the explanation at the beginning larger, with more explanation of rubin's rules. also I put the rules in separate functions such that it is clear where the magic happens. In addition, I turned the nr of simulations down from 10 to 2, now the time range should be fine. Let's see. Otherwise I will get the simulation out.

@stefvanbuuren you might be right that the examples are high level, comparing different approaches while showing the example.

Do others think I should make a more basic example as well?

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging ed1db8b into 3b5abf7 - view on LGTM.com

new alerts:

  • 1 for Encoding error

Comment posted by LGTM.com

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging 7c3fb97 into 3b5abf7 - view on LGTM.com

new alerts:

  • 1 for Encoding error

Comment posted by LGTM.com

robust summary measures such as medians or ranges instead of using Rubin’s
pooling rules. This applies to an estimate like explained variance.

In sum, Rubin’s pooling rules are as follows. The overall point estimate after
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.com is complaining about this curly apostrophe. Use '.

janvanrijn and others added 12 commits October 9, 2018 11:25
* modularized data column functionality

* small bugfix

* removes redundant line breaks

* added some documentation on the added fn

* added additional comment on advice of Nicholas Hug

* added test case

* merged master into branch, and added small comments by Joel

* added doc item
…attributes (#12324)

* rm criterion and max_features from __init__ and store them as class attrs instead

* make sure that the docstring comes first
Part of #11992.
These were all the things that seemed pretty straight-forward. It's actually a bit bulky but should still be easy to review, hopefully.
* add reingold tillford tree layout algorithm

* add first silly implementation of matplotlib based plotting for trees

* object oriented design for export_graphviz so it can be extended

* add class for mlp export

* add colors

* separately scale x and y, add arrowheads, fix strings

* implement max_depth

* don't use alpha for coloring because it makes boxes transparent

* remove unused variables

* vertical center of boxes

* fix/simplify newline trimming

* somewhere in the middle of stuff

trying to get rid of scalex, scaley

* remove "find_longest_child" for now, fix tests

* make scalex and scaley internal, and ax local.

render everything once to get the bbox sizes, then again to actually plot it with known extents.

* add some margin to the max bbox width

* add _BaseTreeExporter baseclass

* add docstring to plot_tree

* use data coordinates so we can put the plot in a subplot, remove some hacks.

* remove scalex, scaley, add automatic font size

* use rendered stuff for setting limits (well nearly there)

* import plot_tree into tree module

* set limits before font size adjustment?

* add tree plotting via matplotlib to iris example and to docs

* pep8 fix

* skip doctest on plot_tree because matplotlib is not installed on all CI machines

* redo everything in axis pixel coordinates

re-introduce scalex, scaley
add max_extents to tree to get tree size before plotting

* fix max-depth

parent node positioning and don't consider deep nodes in layouting

* consider height in fontsize computation

in case someone gave us a very flat figure

* fix error when max_depth is None

* add docstring for tree plotting fontsize

* starting on jnothman's review

* renaming fixes

* whatsnew for tree plotting

* clear axes prior to doing anything.

* fix doctests

* skip matplotlib doctest

* trying to debug circle failure

* trying to show full traceback

* more print debugging

* remove debugging crud

* hack around matplotlib <1.5 issues

* copy bbox args because old matplotlib is weird.

* pep8 fixes

* add explicit boxstyle

* more pep8

* even more pep8

* add comment about matplotlib version requirement

* remove redundant file

* add whatsnew entry that the merge lost

* fix merge issue

* more merge issues

* whitespace ...

* remove doctest skip to see what's happening

* added some simple invariance tests buchheim function

* refactor
___init__ into superclass

* added some tests of plot_tree

* put skip back in, fix typo, fix versionadded number

* remove unused parameters special_characters and parallel_leaves from mpl plotting

* rename tests to test_reingold_tilford

* added license header from pymag-trees repo

* remove duplicate test file.
@RianneSchouten
Copy link
Author

I wanted to bring my multiple-imputation branch up to date with the master branch of scikit learn where I originally forked from.

I first brought my master branch up to date by:
git fetch upstream
git checkout master
git merge upstream/master

I then brought the multiple-imputation branch up to date by:
git fetch -p origin
git merge origin/master
git checkout multiple-imputation
git merge master.

Now I see that at the top of this page it says: rianneschouten wants to push 164 commit into scikit-learn:iterativeimputer.

Can someone explain to me if I did something wrong an/or how this works?

@jnothman
Copy link
Member

If your master couldn't be fast forwarded to upstream/master, that would have made a mess. Or it could have come from origin/master.

For another chance, git reset --hard last_good_commit and force push.
Then simply git merge upstream/master

@RianneSchouten
Copy link
Author

I made a mess.
Since my work was in only one file, i will start over.

@RianneSchouten
Copy link
Author

As the developer info says, I will fork/clone the master branch, make a local new branch, and work on the ampute files in that branch and then make a pull request.

I am wondering though, if i want to improve/update the multiple-imputation example, i have to use the info in the IterativeImputer branch. Should I wait till you guys update the master branch with that information or should I fork/clone the IterativeImputer branch? What is the preferred way to do it?

Thanks in advance.

@sergeyf
Copy link
Contributor

sergeyf commented Oct 31, 2018

@RianneSchouten The master will not be updated until we have all the pieces of IterativeImputer done in its own branch. That includes examples, etc.

@RianneSchouten
Copy link
Author

RianneSchouten commented Oct 31, 2018 via email

@RianneSchouten
Copy link
Author

Hey guys,

I am sorry but it seems I don't have enough time to actively finish this.
The example is finished in the sense: it shows how to loop over the iterativeimputer to create multiple imputations, how to calculate the variance of beta estimates, how to pool the outcomes after multiple imputation.

The example should be finished by: applying the code to the newest version of the iterative imputer, and by removing the simulation aspect from the example (that is not necessary to show how to perform mi and takes unncessary time).

The difficult aspect is that the pooling rules change for the different statistical analysis methods. In the example, I use linear regression which is the most used and most basic analysis method.

I hope someone else can finish this so that it comes along with the iterativeimputer?

Kind regards,
Rianne

@sergeyf
Copy link
Contributor

sergeyf commented Dec 12, 2018

Hi @RianneSchouten. When you say "The difficult aspect is that the pooling rules change for the different statistical analysis methods" - does this mean we need to figure out other pooling methods for this particular example? Or is the current pooling sufficient to demonstrate how this generally works? Should we include a reference for other pooling for other models?

@RianneSchouten
Copy link
Author

RianneSchouten commented Dec 12, 2018 via email

@sergeyf
Copy link
Contributor

sergeyf commented Jan 18, 2019

Hi again @RianneSchouten. Just to confirm what you mean about by removing the simulation aspect from the example. Which lines do you mean I should remove? I assume 445 to 509 (plus doc info about those lines). Is that correct?

@sergeyf
Copy link
Contributor

sergeyf commented Jan 18, 2019

Also, @RianneSchouten. This statement: Due to the between imputation variance, the standard errors of all regression coefficients are larger with multiple imputation than with single imputation. This allows for valid statistical inference making.

I think ML people might not know what this means as we don't really use standard errors much. Why do larger standard errors allow for valid statistical inference making?

@jnothman
Copy link
Member

jnothman commented Jan 19, 2019 via email

@sergeyf
Copy link
Contributor

sergeyf commented Jan 19, 2019

Thanks @jnothman. I am wondering if the ML community will understand that point. "Standard errors" don't appear in the ML curriculum very much.

In any case, I rewrote this example a bit with fewer iterations and it's much faster now. Should I make a new PR? Or is there anything else you'd like to see changed here?

@jnothman
Copy link
Member

jnothman commented Jan 19, 2019 via email

@jnothman
Copy link
Member

Superseded by #13025

@jnothman jnothman closed this Jan 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.