MNT fix iterative imputer example's speed issue. #13379

adrinjalali · 2019-03-03T22:20:50Z

circleci fails due to the example being extremely slow (2e+5 secods!)

This fixes the issue, but the example still needs improvement to fix the warnings and maybe still improve the speed. It's far from ideal, but fixes the CI issue (I hope).

Resulting plot before the change:

Resulting plot after the change:

examples/impute/plot_iterative_imputer_variants_comparison.py

adrinjalali · 2019-03-04T12:08:50Z

Finally circleci is happy :)

qinhanmin2014 · 2019-03-04T12:42:42Z

circleci fails due to the example being extremely slow (2e+5 secods!)

FYI there's a successful build on master, this example takes 1.9e+02 sec (https://circleci.com/gh/scikit-learn/scikit-learn/50704). I'm wondering whether this is the reason for the recent Circle CI failures.

Which part in this example is time-consuming? @adrinjalali

qinhanmin2014 · 2019-03-04T12:45:26Z

(Ignore the last part of the previous comment :))

And seems that current example is unable to demonstrate the advantages of IterativeImputer? @adrinjalali

ogrisel

I find the current plot weird: all the base estimators used with the IterativeImputer are collapsed together which causes very large error bars. I am pretty sure that this was not the intent of the author of this example.

ogrisel · 2019-03-04T11:00:59Z

examples/impute/plot_iterative_imputer_variants_comparison.py

@@ -93,7 +97,7 @@
 estimators = [
    BayesianRidge(),
    DecisionTreeRegressor(max_features='sqrt', random_state=0),
-    ExtraTreesRegressor(n_estimators=10, n_jobs=-1, random_state=0),


Why is this needed? Do this cause a performance regression on circle ci? If I am not mistaken ExtraTreesRegressor uses thread-based parallelism which is pretty light weight but maybe there is an over-subcription issue caused by a bad detection of the true number of available CPU cores on circle ci?

At least on my local machine, n_jobs=-1 makes it much slower. Probably cause forking the process is more expensive than that we gain.

My point is that no forking should happen if we use thread-based parallelism as I would expect.

Indeed there is something wrong. I increased the verbosity and the "loky" backend is picked-up irrespective of the prefer="threads" backend hint. I will have a look.

Ok I get it, it's caused by a bug in the handling of nested parallel calls in joblib that causes the prefer="threads" hint to be ignored.

Disabling n_jobs=-1 is fine as a workaround for now. I will report and fix upstream.

adrinjalali · 2019-03-04T12:49:37Z

I haven't worked on iterative imputer at all. I'm simply fixing the timing issue. And the example still raises some convergence warnings which need to be fixed. 1.9e+2 is still the slowest example we'd have, isn't it?

qinhanmin2014 · 2019-03-04T12:52:14Z

I haven't worked on iterative imputer at all. I'm simply fixing the timing issue. And the example still raises some convergence warnings which need to be fixed. 1.9e+2 is still the slowest example we'd have, isn't it?

No @adrinjalali , see the latest Circle CI log:

- plot_coin_segmentation.py: 4.6e+02 sec
- plot_iterative_imputer_variants_comparison.py: 1.9e+02 sec
- plot_rbm_logistic_classification.py: 1.5e+02 sec
- plot_pca_vs_fa_model_selection.py: 1.2e+02 sec
- plot_image_denoising.py: 1.1e+02 sec
- plot_lw_vs_oas.py: 1e+02 sec

This is why I'm asking whether this is the reason for the recent Circle CI failures.

adrinjalali · 2019-03-04T12:55:13Z

Interesting, that's different from what I observed. If we're happy and the CI is happy, we can close this. But the ExtraTreesRegressor is the slowest part.

qinhanmin2014 · 2019-03-04T12:57:46Z

Interesting, that's different from what I observed. If we're happy and the CI is happy, we can close this. But the ExtraTreesRegressor is the slowest part.

I think we should try to simplify some time-consuming examples, see #13383
My main concern here is that the new example is unable to demonstrate the advantages of IterativeImputer.

adrinjalali · 2019-03-04T13:02:24Z

This change doesn't really change the order of the items in the bar plot, does it? I don't see how this changes anything in the example in that regard. I understand it may not be demonstrating the advantages of the iterative imputer, but I don't think this PR is responsible for that matter, or a related change.

ogrisel · 2019-03-04T14:04:42Z

The bar plot is broken by the 2-level index: it should high light the impact of the nested estimator. But I am not sure what is the idiomatic way to do a bar plot with such a 2 level stack columns array that is turned into a 2 level index series when calling .mean() on the scores dataframe.

ogrisel · 2019-03-04T14:04:59Z

Maybe @jorisvandenbossche has a suggestion with the above?

jorisvandenbossche · 2019-03-04T14:31:18Z

I think it is actually the dataframe for the results of the IterativeImputer that is incorrectly constructed. It is each time overwriting the result of the previous iteration.

jorisvandenbossche · 2019-03-04T14:36:13Z

See #13384. Made a separate PR, as it is indeed somewhat unrelated to the speed / doc build issue here.

ogrisel · 2019-03-04T14:48:07Z

@jorisvandenbossche the dataframe and series look good:

>>> scores
   Original SimpleImputer  ...    IterativeImputer                    
  Full Data          mean  ... ExtraTreesRegressor KNeighborsRegressor
0 -0.408433     -0.581144  ...           -0.469225           -0.591546
1 -0.636009     -0.806046  ...           -0.686578           -0.815429
2 -0.614910     -0.764460  ...           -0.673648           -0.771269
3 -1.089616     -1.319445  ...           -1.180996           -1.322944
4 -0.407541     -0.663177  ...           -0.473872           -0.668056

[5 rows x 7 columns]
>>> -scores.mean()
Original          Full Data                0.631302
SimpleImputer     mean                     0.826854
                  median                   0.832756
IterativeImputer  BayesianRidge            0.701727
                  DecisionTreeRegressor    0.769014
                  ExtraTreesRegressor      0.696864
                  KNeighborsRegressor      0.833849
dtype: float64

It's the plotting code that is broken and does not handle the level 2 indexing of scores.

ogrisel · 2019-03-04T14:52:19Z

Actually you are right I forgot I had already fixed that issue (#13384) in my local workspace but I had commented out the plotting code out so I had not realized that it fixed it...

Debugging too many things at the same time :)

ogrisel

+1 for merging this PR along side with #13384. Both together show that iterative imputer still has value (on average), even on the subsampled dataset (with large error bars).

qinhanmin2014 · 2019-03-04T15:36:17Z

examples/impute/plot_iterative_imputer_variants_comparison.py

@@ -57,6 +57,10 @@
 rng = np.random.RandomState(0)

 X_full, y_full = fetch_california_housing(return_X_y=True)
+# ~2k samples is enough for the purpose of the example.
+# Remove the following two lines for a slower run with different error bars


different -> similar? And a . at the end.

The error bars are different, not really similar.

Really, could you please post these two figure?

before

after

So you think they are not similar?

exactly. Specially for the last one and bayesridge

examples/impute/plot_iterative_imputer_variants_comparison.py

…-example

…n#13379)" This reverts commit def6543.

adrinjalali mentioned this pull request Mar 3, 2019

MNT CI fix scikit-image dependency for latest numpy #13378

Merged

jnothman reviewed Mar 4, 2019

View reviewed changes

examples/impute/plot_iterative_imputer_variants_comparison.py Show resolved Hide resolved

adrinjalali added 4 commits March 4, 2019 11:11

fix iterative imputer example's speed issue.

bcd1d1b

[doc build] pep8

9fe5b28

add the comment

0006785

[doc build] this should make circleci green

8cb422c

adrinjalali force-pushed the iterative-imputer-example branch from ed82764 to 8cb422c Compare March 4, 2019 10:11

ogrisel reviewed Mar 4, 2019

View reviewed changes

qinhanmin2014 mentioned this pull request Mar 4, 2019

Simplify some time-comsuming examples #13383

Closed

7 tasks

ogrisel approved these changes Mar 4, 2019

View reviewed changes

qinhanmin2014 approved these changes Mar 4, 2019

View reviewed changes

adrinjalali added 2 commits March 4, 2019 16:52

Merge remote-tracking branch 'upstream/master' into iterative-imputer…

1f439da

…-example

address comments

10a5aac

qinhanmin2014 approved these changes Mar 4, 2019

View reviewed changes

qinhanmin2014 merged commit 813eb4c into scikit-learn:master Mar 4, 2019

adrinjalali deleted the iterative-imputer-example branch March 4, 2019 16:31

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

EXA Fix iterative imputer example's speed issue. (scikit-learn#13379)

def6543

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "EXA Fix iterative imputer example's speed issue. (scikit-lear…

b796bfa

…n#13379)" This reverts commit def6543.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "EXA Fix iterative imputer example's speed issue. (scikit-lear…

9bbb9c1

…n#13379)" This reverts commit def6543.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

EXA Fix iterative imputer example's speed issue. (scikit-learn#13379)

87485b3

Uh oh!

MNT fix iterative imputer example's speed issue. #13379

MNT fix iterative imputer example's speed issue. #13379

Uh oh!

Conversation

adrinjalali commented Mar 3, 2019

Uh oh!

Uh oh!

adrinjalali commented Mar 4, 2019

Uh oh!

qinhanmin2014 commented Mar 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qinhanmin2014 commented Mar 4, 2019

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Mar 4, 2019

Uh oh!

qinhanmin2014 commented Mar 4, 2019

Uh oh!

adrinjalali commented Mar 4, 2019

Uh oh!

qinhanmin2014 commented Mar 4, 2019

Uh oh!

adrinjalali commented Mar 4, 2019

Uh oh!

ogrisel commented Mar 4, 2019

Uh oh!

ogrisel commented Mar 4, 2019

Uh oh!

jorisvandenbossche commented Mar 4, 2019

Uh oh!

jorisvandenbossche commented Mar 4, 2019

Uh oh!

ogrisel commented Mar 4, 2019

Uh oh!

ogrisel commented Mar 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

qinhanmin2014 commented Mar 4, 2019 •

edited

Loading

ogrisel commented Mar 4, 2019 •

edited

Loading