Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MNT accelerate plot_iterative_imputer_variants_comparison.py #21748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Feb 23, 2022

Conversation

siavrez
Copy link
Contributor

@siavrez siavrez commented Nov 22, 2021

Adding bootstrapping to Extratrees with 0.75 sample_fraction improves runtime 4.8 seconds in 5 folds and 3 seconds in 3 folds. Also changed number of folds to 3. Total runtime is now 10 .1 +/- 1.3 seconds. from 24 +/- 3.3 seconds.

Reference Issues/PRs

#21598

What does this implement/fix? Explain your changes.

Any other comments?

@siavrez
Copy link
Contributor Author

siavrez commented Nov 22, 2021

MSE using 3 folds instead of 5:

5 fold:
Original Full Data 0.631302
SimpleImputer mean 0.826854
median 0.832756
IterativeImputer BayesianRidge 0.695367
DecisionTreeRegressor 0.764438
ExtraTreesRegressor 0.701408
KNeighborsRegressor 0.834774

3 fold:
Original Full Data 0.657900
SimpleImputer mean 0.868160
median 0.875723
IterativeImputer BayesianRidge 0.676844
DecisionTreeRegressor 0.903947
ExtraTreesRegressor 0.744838
KNeighborsRegressor 0.869787

MSE is worse for all but difference is similar.

@glemaitre glemaitre changed the title accelerate plot_iterative_imputer_variants_comparison.py added bootst… MNT accelerate plot_iterative_imputer_variants_comparison.py Nov 23, 2021
@adrinjalali adrinjalali mentioned this pull request Nov 23, 2021
41 tasks
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adrinjalali
Copy link
Member

We have a lot of ConvergenceWarning reported by IterativeImputer now, we should make sure examples don't have such warnings:

/home/circleci/project/sklearn/impute/_iterative.py:700: ConvergenceWarning: [IterativeImputer] Early stopping criterion not reached.
  warnings.warn(

@siavrez
Copy link
Contributor Author

siavrez commented Nov 24, 2021

I'll try to find a way to avoid ConvergenceWarning .

@siavrez
Copy link
Contributor Author

siavrez commented Nov 24, 2021

The only other example of iterative imputation also uses California housing dataset. If using other datasets for this one is an option I can try to find the one with minimum ConvergenceWarning .

@siavrez
Copy link
Contributor Author

siavrez commented Nov 24, 2021

I checked the implementation of tol in ItrativeImputer : normalized_tol = self.tol * np.max(np.abs(X[~mask_missing_values])) . One possible explanation is California housing dataset is full of outliers and in this scenario tolerance checking is only based on outlier values (for 5 variables).
CalH

@siavrez
Copy link
Contributor Author

siavrez commented Nov 24, 2021

Even after 250 iterations DecisionTreeRegressor does not converge for 5 variables.

@glemaitre
Copy link
Member

We have a lot of ConvergenceWarning reported by IterativeImputer now, we should make sure examples don't have such warnings

@adrinjalali
This is the issue of the IterativeImputer. I don't think that we can do anything to remove these warnings. They are already raised on the original example on my computer.

xref: #14338

I was planning to have a look at this.

@siavrez
Copy link
Contributor Author

siavrez commented Nov 24, 2021

There's no warnings with the new implementation, but I had to change the Tree and set the tolerance for each estimator.

@siavrez
Copy link
Contributor Author

siavrez commented Nov 24, 2021

Iterative Imputation without scaling:

Original Full Data 0.631302
SimpleImputer mean 0.826854
median 0.832756
IterativeImputer BayesianRidge 0.696538
RandomForestRegressor 0.713138
ExtraTreesRegressor 0.714369
KNeighborsRegressor 0.837740

With Robust Scaling:

Original Full Data 0.630870
SimpleImputer mean 0.830052
median 0.835994
IterativeImputer BayesianRidge 0.699316
RandomForestRegressor 0.712361
ExtraTreesRegressor 0.727760
KNeighborsRegressor 0.779323

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise, plus @ogrisel's suggestion, LGTM.

@glemaitre glemaitre self-requested a review November 25, 2021 17:25
@glemaitre glemaitre self-requested a review November 25, 2021 17:37
@glemaitre glemaitre self-requested a review November 25, 2021 18:05
@siavrez siavrez requested a review from ogrisel November 28, 2021 09:43
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good and the speed-up is nice but the top-level docstring still needs to be adapted to reflect the content of the code.

  • ExtraTreesRegressor need to be replace by RandomForestRegressor in several occurrences;
  • mentions of DecisionTreeRegressor needs to be removed;
  • the pipeline with the expansion of a degree 2 polynomial kernel needs to be introduce.

And while we are at it we could add a final comment emphasizing that while some methods are seemingly better than others on average, the error bars observed on the cross-validated scores are still very wide in call cases.

We could finally emphasize that some estimators such as HistGradientBoostingRegressor can natively deal with missing features and are often recommended over building pipelines with complex and costly missing values imputation strategies.

@siavrez
Copy link
Contributor Author

siavrez commented Dec 8, 2021

Egress is over the account limit seems to be the cause of failing tests.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

@siavrez siavrez requested a review from ogrisel January 7, 2022 15:17
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time is now 4sec instead of 16sec. LGTM. Thanks @siavrez !

@jeremiedbb jeremiedbb merged commit 8286f02 into scikit-learn:main Feb 23, 2022
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Mar 1, 2022
…cikit-learn#21748)

Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Adrin Jalali <[email protected]>
Co-authored-by: Jérémie du Boisberranger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants