-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
MNT fix iterative imputer example's speed issue. #13379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MNT fix iterative imputer example's speed issue. #13379
Conversation
ed82764
to
8cb422c
Compare
Finally circleci is happy :) |
FYI there's a successful build on master, this example takes 1.9e+02 sec (https://circleci.com/gh/scikit-learn/scikit-learn/50704). I'm wondering whether this is the reason for the recent Circle CI failures. Which part in this example is time-consuming? @adrinjalali |
(Ignore the last part of the previous comment :)) And seems that current example is unable to demonstrate the advantages of IterativeImputer? @adrinjalali |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the current plot weird: all the base estimators used with the IterativeImputer
are collapsed together which causes very large error bars. I am pretty sure that this was not the intent of the author of this example.
@@ -93,7 +97,7 @@ | |||
estimators = [ | |||
BayesianRidge(), | |||
DecisionTreeRegressor(max_features='sqrt', random_state=0), | |||
ExtraTreesRegressor(n_estimators=10, n_jobs=-1, random_state=0), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? Do this cause a performance regression on circle ci? If I am not mistaken ExtraTreesRegressor
uses thread-based parallelism which is pretty light weight but maybe there is an over-subcription issue caused by a bad detection of the true number of available CPU cores on circle ci?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least on my local machine, n_jobs=-1 makes it much slower. Probably cause forking the process is more expensive than that we gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that no forking should happen if we use thread-based parallelism as I would expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed there is something wrong. I increased the verbosity and the "loky" backend is picked-up irrespective of the prefer="threads"
backend hint. I will have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I get it, it's caused by a bug in the handling of nested parallel calls in joblib that causes the prefer="threads"
hint to be ignored.
Disabling n_jobs=-1
is fine as a workaround for now. I will report and fix upstream.
I haven't worked on iterative imputer at all. I'm simply fixing the timing issue. And the example still raises some convergence warnings which need to be fixed. 1.9e+2 is still the slowest example we'd have, isn't it? |
No @adrinjalali , see the latest Circle CI log:
This is why I'm asking whether this is the reason for the recent Circle CI failures. |
Interesting, that's different from what I observed. If we're happy and the CI is happy, we can close this. But the ExtraTreesRegressor is the slowest part. |
I think we should try to simplify some time-consuming examples, see #13383 |
This change doesn't really change the order of the items in the bar plot, does it? I don't see how this changes anything in the example in that regard. I understand it may not be demonstrating the advantages of the iterative imputer, but I don't think this PR is responsible for that matter, or a related change. |
The bar plot is broken by the 2-level index: it should high light the impact of the nested estimator. But I am not sure what is the idiomatic way to do a bar plot with such a 2 level stack columns array that is turned into a 2 level index series when calling |
Maybe @jorisvandenbossche has a suggestion with the above? |
I think it is actually the dataframe for the results of the IterativeImputer that is incorrectly constructed. It is each time overwriting the result of the previous iteration. |
See #13384. Made a separate PR, as it is indeed somewhat unrelated to the speed / doc build issue here. |
@jorisvandenbossche the dataframe and series look good: >>> scores
Original SimpleImputer ... IterativeImputer
Full Data mean ... ExtraTreesRegressor KNeighborsRegressor
0 -0.408433 -0.581144 ... -0.469225 -0.591546
1 -0.636009 -0.806046 ... -0.686578 -0.815429
2 -0.614910 -0.764460 ... -0.673648 -0.771269
3 -1.089616 -1.319445 ... -1.180996 -1.322944
4 -0.407541 -0.663177 ... -0.473872 -0.668056
[5 rows x 7 columns]
>>> -scores.mean()
Original Full Data 0.631302
SimpleImputer mean 0.826854
median 0.832756
IterativeImputer BayesianRidge 0.701727
DecisionTreeRegressor 0.769014
ExtraTreesRegressor 0.696864
KNeighborsRegressor 0.833849
dtype: float64 It's the plotting code that is broken and does not handle the level 2 indexing of scores. |
Actually you are right I forgot I had already fixed that issue (#13384) in my local workspace but I had commented out the plotting code out so I had not realized that it fixed it... Debugging too many things at the same time :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for merging this PR along side with #13384. Both together show that iterative imputer still has value (on average), even on the subsampled dataset (with large error bars).
@@ -57,6 +57,10 @@ | |||
rng = np.random.RandomState(0) | |||
|
|||
X_full, y_full = fetch_california_housing(return_X_y=True) | |||
# ~2k samples is enough for the purpose of the example. | |||
# Remove the following two lines for a slower run with different error bars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
different
-> similar
? And a .
at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error bars are different, not really similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really, could you please post these two figure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you think they are not similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly. Specially for the last one and bayesridge
circleci fails due to the example being extremely slow (2e+5 secods!)
This fixes the issue, but the example still needs improvement to fix the warnings and maybe still improve the speed. It's far from ideal, but fixes the CI issue (I hope).
Resulting plot before the change:
Resulting plot after the change: