Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

cakedev0
Copy link
Contributor

@cakedev0 cakedev0 commented Sep 6, 2025

This PR refactors how missing values are handled in trees by:

  • removing missing-values handling from Criterion subclasses
  • making it the responsability of the splitter & partitionner only

This greatly simplifies the logic and unlocks for free the support of missing values for MAE trees.

Reference Issues/PRs

This PR accidentally fixes #32178

Otherwise, I looked but I didn't find any issue requesting this feature. I think it's because MAE trees are just too slow in sklearn for now so it's not much used... People wanting to use the MAE will just search for other options in sklearn or other libs.

What does this implement/fix? Explain your changes.

Currently, a part of the missing values support is done by each subclass of Criterion. I believe it's not a great design because:

  • Criterion is "X-blind", it's not aware of X values. It just looks at y and sample_weights in the order defined by the sorted indices (sample_indices). It never looks at X values. But somehow, by making it handle missing values, it does have some dealing with X values. Why not just use the ordering of sample_indices to take into account missing values? Like what we do for any other value (even inf/-inf for instance).
  • It requires each criterion to implement several methods.

So, to the question "Why not just use the ordering of sample_indices to take into account missing values? " my answer is :"yes, let's just do that". The result is removing 200 lines from _criterion.pxy while not increasing the complexity of the splitter and the partitionner (actually, it also simplifies a bit the splitter).

Any other comments?

I think it might unlock making the support for missing values + monotonic constraints easy, but I haven't look into it yet.

It might also simplify a bit the support for missing values + sparse, but this is still not easy to do.

Copy link

github-actions bot commented Sep 6, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 7a5cfce. Link to the linter CI: here

@cakedev0
Copy link
Contributor Author

Note: at this day, tests passes on my laptop and most CI unit tests pipelines are successful. But some are failing, I managed to reproduce one of the failing pipelines locally using a Docker image. I still need to find the bug though.

@cakedev0
Copy link
Contributor Author

Tests pass! 🎊

Well, I learned the difference between memcpy and memmove the hard way 😂

@cakedev0 cakedev0 marked this pull request as ready for review September 12, 2025 21:44
@cakedev0 cakedev0 changed the title [draft] Trees: Add support for missing values with criterion="absolute_error" by greatly simplifying the logic Trees: Add support for missing values with criterion="absolute_error" by greatly simplifying the logic Sep 12, 2025
@cakedev0 cakedev0 changed the title Trees: Add support for missing values with criterion="absolute_error" by greatly simplifying the logic FEA Trees: Add support for missing values with criterion="absolute_error" by greatly simplifying the logic Sep 12, 2025
@adam2392 adam2392 self-requested a review September 13, 2025 04:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trees: impurity decrease calculation is buggy when there are missing values
1 participant