Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FEA Add missing-value support for ExtaTreeClassifier and ExtaTreeRegressor #27966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 151 commits into from
Jul 9, 2024

Conversation

adam2392
Copy link
Member

@adam2392 adam2392 commented Dec 16, 2023

Reference Issues/PRs

Towards: #27931

Follow-up to: #26391 and #23595

What does this implement/fix? Explain your changes.

  • Adds missing-value support to RandomSplitter in _splitter.pyx
  • Enables the "random" splitter kwarg for ExtraTreeClassifier and ExtraTreeRegressor
  • Adds unit-tests for ExtraTreeClassifier and ExtraTreeRegressor
  • Make unit-tests for DecisionTree* and ExtraTree* more numerically robust by increasing the tolerance for a GLOBAL_RANDOM_SEED check, and using cross-validation scores rather than a single score

Any other comments?

Compared to BestSplitter, there can be an expected cost to doing splits on missing-values, as we can either:

  1. choose a random threshold and then randomly send missing-values to left/right, OR
  2. send all missing-values to left/right and all non-missing values to right/left

The push of missing values down the tree can be done randomly (i.e. first option), OR the second option can actually be evaluated. There is a computational cost to doing so, but more importantly there is an interpretation tradeoff. The tradeoff imo comes from the assumption of the missing-values:

  • if missing-completely-at-random (MCAR), then option 1 is ideal because one simply should ignore the missing values or impute them
  • if missing-at-random (MAR), then option 2 is nice because sometimes the missing-ness of the data can be informative.

However, I think the difference at a tree level is not super important. E.g. in the ExtraTree forest, #28268 demonstrates that the ExtraTrees when combined as a forest are resilient and predictive for missing-values.

Benchmarks demonstrating no significant runtime performance degradation

There is some complexity involved in checking if there are missing values. However, this only occurs at the Python level as shown by the following benchmark. In terms of the Cython code, there is no regression.

Benchmarks with and without Python Check

results_image_without_pythoncheck
results_image

Also ran this benchmark for ExtraTrees, which demonstrates that this check is negligible at the forest level, since it only occurs once. See #28268, which has the short code to enable it for ExtraTrees.

Benchmarks on ExtraTrees

results_image

Copy link

github-actions bot commented Dec 16, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 20e9d9f. Link to the linter CI: here

@adam2392 adam2392 marked this pull request as ready for review December 18, 2023 03:31
@adam2392 adam2392 marked this pull request as draft January 18, 2024 18:49
Signed-off-by: Adam Li <[email protected]>
@adam2392 adam2392 requested review from OmarManzoor and glemaitre July 6, 2024 15:41
Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments otherwise this looks good now

@adam2392
Copy link
Member Author

adam2392 commented Jul 8, 2024

Perhaps @glemaitre and @thomasjpfan can take one last look then?

I think this has changed a bit since @glemaitre approved changes, and I believe I've addressed most of the core issues raised.

Thanks everyone for reviewing!

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot @adam2392.

@OmarManzoor OmarManzoor enabled auto-merge (squash) July 9, 2024 07:52
@OmarManzoor OmarManzoor disabled auto-merge July 9, 2024 08:56
@OmarManzoor OmarManzoor enabled auto-merge (squash) July 9, 2024 08:57
@glemaitre
Copy link
Member

Sorry for the delay, I was a bit under the water. LGTM and I see the auto-merge was activated.
Thanks @adam2392

@OmarManzoor OmarManzoor merged commit dddf2f0 into scikit-learn:main Jul 9, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

4 participants