FEA Add missing-value support to sparse splitter in RandomForest and ExtraTrees #29542

adam2392 · 2024-07-22T12:28:29Z

Summary

While missing-value support for decision trees have been added recently, they only work when encoded in a dense array. Since RandomForest* and ExtraTrees* both support sparse X, if a user encodes np.nan inside sparse X, it should still work.

Solution

Add missing-value logic in SparsePartitioner in _parititoner.pyx, BestSparseSplitter and RandomSparseSplitter in _splitter.pyx.

The logic is the same as in the dense case, but just has to handle the fact that X is now sparse CSC array format.

Misc.

          FYI https://github.com/scikit-learn/scikit-learn/pull/27966 will introduce native support for missing values in the `ExtraTree*` models (i.e. random splitter).

One thing I noticed though as I went through the PR is that the current codebase still does not support missing values in the sparse splitter. I think this might be pretty easy to add, but should we re-open this issue technically?

Xref: #5870 (comment)

Originally posted by @adam2392 in #5870 (comment)

The text was updated successfully, but these errors were encountered:

Higgs32584 · 2024-07-27T17:18:30Z

This might be related to #27993 , although it is more so related to multi-task random forests

vitorpohlenz · 2025-04-28T19:41:29Z

Hi @adam2392 ,

If you could help me, I have some questions about this Issue.

Is this Issue still open, and is it worth working on it? I followed the history associated, and it seems that the first conversations about it started 10 years ago.
Do you think this is "easy enough" to be a good "second issue?" Because my first and only contribution was a few weeks ago. @glemaitre cc here just because your comment on a related Issue. Also, I would like a "second opinion" on this question.

adam2392 · 2025-04-28T21:46:35Z

Hi @vitorpohlenz !

Thank you for your interest in contributing!

Yes it is still open and it is worth working on it.
Is it easy? That depends on your experience with Cython and/or C++ and also understanding of decision trees. It also depends on how deep you want to dive into the code. The main thing to add is the same tree logic handling for missing values for sparse array inputs.

Lmk what you think and if you have a proposed idea of how to proceed.

vitorpohlenz · 2025-04-29T12:01:22Z

Hi @vitorpohlenz !

Thank you for your interest in contributing!

Yes it is still open and it is worth working on it.

Is it easy? That depends on your experience with Cython and/or C++ and also understanding of decision trees. It also depends on how deep you want to dive into the code. The main thing to add is the same tree logic handling for missing values for sparse array inputs.

Lmk what you think and if you have a proposed idea of how to proceed.

Thanks for replying @adam2392 !

About [2.] I have no previous experience with Cython, and just a little experience with C++ (I have worked a little with C, but it was 10 years ago). But I understand the theory of decision trees. I know about the CART algorithm and the ways to measure the impurity using metrics like entropy, gini, etc...

So, I would like to ask your opinion if you think it is a good idea for me to work on this issue(based on your expertise and this piece of information), to get more acquainted with the tree-based code in sklearn.

Also, if I could ask for some guidance/advice to not struggle too much, if necessary

adam2392 · 2025-05-01T17:36:06Z

Hi @vitorpohlenz

So, I would like to ask your opinion if you think it is a good idea for me to work on this issue(based on your expertise and this piece of information), to get more acquainted with the tree-based code in sklearn.

Cython / C++ knowledge is definitely needed within the scikit-learn dev/reviewer community, so anyone interested is definitely welcome! There will be a learning curve, and require you to explore how Cython works, and what doesn't work. I think it's a net-positive to your skillset.

With that being said, yes you can ask for guidance / advice, but note my time is a bit limited these days, so will definitely require a lot of self-learning.

As a first step, I would look at this PR: #27966, review how Cython works. Then what's needed is for one to:

understand how sparse arrays work
understand how missing-values are split in the sklearn RF algo
apply that understanding to the sparse splitter logic

Feel free to propose/iterate on this plan accordingly!

vitorpohlenz · 2025-05-02T12:30:11Z

Hi @adam2392, thanks a lot for replying!

With that being said, yes you can ask for guidance / advice, but note my time is a bit limited these days, so will definitely require a lot of self-learning.

Sure, I understand that this task will be more self-driven than based on external aid. But this kind of guidance, like you did, will help me a lot.

So I will /take it and will start by the steps you suggested, and will update the progress here until I have a clear proposal of enhancement.

vitorpohlenz · 2025-05-09T11:55:20Z

Update about the progress/steps that I'm doing:

Understanding how sparse arrays work
Understanding how missing values are split in the sklearn Random Forest algorithm
Define how to apply the modifications to the sparse splitter logic

github-actions bot added the Needs Triage Issue requires triage label Jul 22, 2024

adam2392 added module:tree cython help wanted and removed Needs Triage Issue requires triage labels Jul 22, 2024

adam2392 mentioned this issue Aug 22, 2024

[RFC] Tree module improvements #5212

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

FEA Add missing-value support to sparse splitter in RandomForest and ExtraTrees #29542

FEA Add missing-value support to sparse splitter in RandomForest and ExtraTrees #29542

adam2392 commented Jul 22, 2024

Higgs32584 commented Jul 27, 2024

Uh oh!

vitorpohlenz commented Apr 28, 2025

Uh oh!

adam2392 commented Apr 28, 2025

Uh oh!

vitorpohlenz commented Apr 29, 2025

Uh oh!

adam2392 commented May 1, 2025

Uh oh!

vitorpohlenz commented May 2, 2025

Uh oh!

vitorpohlenz commented May 9, 2025

Uh oh!

Uh oh!

FEA Add missing-value support to sparse splitter in RandomForest and ExtraTrees #29542

FEA Add missing-value support to sparse splitter in RandomForest and ExtraTrees #29542

Comments

adam2392 commented Jul 22, 2024

Summary

Solution

Misc.

Higgs32584 commented Jul 27, 2024

Uh oh!

vitorpohlenz commented Apr 28, 2025

Uh oh!

adam2392 commented Apr 28, 2025

Uh oh!

vitorpohlenz commented Apr 29, 2025

Uh oh!

adam2392 commented May 1, 2025

Uh oh!

vitorpohlenz commented May 2, 2025

Uh oh!

vitorpohlenz commented May 9, 2025

Uh oh!