-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MNT accelerate plot_iterative_imputer_variants_comparison.py #21748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…raping to ETrees and changed folds to 3
MSE using 3 folds instead of 5: 5 fold: 3 fold: MSE is worse for all but difference is similar. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
We have a lot of
|
I'll try to find a way to avoid |
The only other example of iterative imputation also uses California housing dataset. If using other datasets for this one is an option I can try to find the one with minimum |
Even after 250 iterations DecisionTreeRegressor does not converge for 5 variables. |
@adrinjalali xref: #14338 I was planning to have a look at this. |
There's no warnings with the new implementation, but I had to change the Tree and set the tolerance for each estimator. |
Iterative Imputation without scaling: Original Full Data 0.631302 With Robust Scaling: Original Full Data 0.630870 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise, plus @ogrisel's suggestion, LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good and the speed-up is nice but the top-level docstring still needs to be adapted to reflect the content of the code.
- ExtraTreesRegressor need to be replace by RandomForestRegressor in several occurrences;
- mentions of DecisionTreeRegressor needs to be removed;
- the pipeline with the expansion of a degree 2 polynomial kernel needs to be introduce.
And while we are at it we could add a final comment emphasizing that while some methods are seemingly better than others on average, the error bars observed on the cross-validated scores are still very wide in call cases.
We could finally emphasize that some estimators such as HistGradientBoostingRegressor
can natively deal with missing features and are often recommended over building pipelines with complex and costly missing values imputation strategies.
…ity to deal with missing values
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM.
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time is now 4sec instead of 16sec. LGTM. Thanks @siavrez !
…cikit-learn#21748) Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: Adrin Jalali <[email protected]> Co-authored-by: Jérémie du Boisberranger <[email protected]>
Adding bootstrapping to Extratrees with 0.75 sample_fraction improves runtime 4.8 seconds in 5 folds and 3 seconds in 3 folds. Also changed number of folds to 3. Total runtime is now 10 .1 +/- 1.3 seconds. from 24 +/- 3.3 seconds.
Reference Issues/PRs
#21598
What does this implement/fix? Explain your changes.
Any other comments?