Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FEA Add missing-value support to sparse splitter in RandomForest and ExtraTrees #29542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
adam2392 opened this issue Jul 22, 2024 · 7 comments

Comments

@adam2392
Copy link
Member

Summary

While missing-value support for decision trees have been added recently, they only work when encoded in a dense array. Since RandomForest* and ExtraTrees* both support sparse X, if a user encodes np.nan inside sparse X, it should still work.

Solution

Add missing-value logic in SparsePartitioner in _parititoner.pyx, BestSparseSplitter and RandomSparseSplitter in _splitter.pyx.

The logic is the same as in the dense case, but just has to handle the fact that X is now sparse CSC array format.

Misc.

          FYI https://github.com/scikit-learn/scikit-learn/pull/27966 will introduce native support for missing values in the `ExtraTree*` models (i.e. random splitter). 

One thing I noticed though as I went through the PR is that the current codebase still does not support missing values in the sparse splitter. I think this might be pretty easy to add, but should we re-open this issue technically?

Xref: #5870 (comment)

Originally posted by @adam2392 in #5870 (comment)

@github-actions github-actions bot added the Needs Triage Issue requires triage label Jul 22, 2024
@adam2392 adam2392 added module:tree cython help wanted and removed Needs Triage Issue requires triage labels Jul 22, 2024
@Higgs32584
Copy link
Contributor

This might be related to #27993 , although it is more so related to multi-task random forests

@vitorpohlenz
Copy link
Contributor

Hi @adam2392 ,

If you could help me, I have some questions about this Issue.

  1. Is this Issue still open, and is it worth working on it? I followed the history associated, and it seems that the first conversations about it started 10 years ago.

  2. Do you think this is "easy enough" to be a good "second issue?" Because my first and only contribution was a few weeks ago. @glemaitre cc here just because your comment on a related Issue. Also, I would like a "second opinion" on this question.

@adam2392
Copy link
Member Author

Hi @vitorpohlenz !

Thank you for your interest in contributing!

  1. Yes it is still open and it is worth working on it.
  2. Is it easy? That depends on your experience with Cython and/or C++ and also understanding of decision trees. It also depends on how deep you want to dive into the code. The main thing to add is the same tree logic handling for missing values for sparse array inputs.

Lmk what you think and if you have a proposed idea of how to proceed.

@vitorpohlenz
Copy link
Contributor

Hi @vitorpohlenz !

Thank you for your interest in contributing!

  1. Yes it is still open and it is worth working on it.
  2. Is it easy? That depends on your experience with Cython and/or C++ and also understanding of decision trees. It also depends on how deep you want to dive into the code. The main thing to add is the same tree logic handling for missing values for sparse array inputs.

Lmk what you think and if you have a proposed idea of how to proceed.

Thanks for replying @adam2392 !

About [2.] I have no previous experience with Cython, and just a little experience with C++ (I have worked a little with C, but it was 10 years ago). But I understand the theory of decision trees. I know about the CART algorithm and the ways to measure the impurity using metrics like entropy, gini, etc...

So, I would like to ask your opinion if you think it is a good idea for me to work on this issue(based on your expertise and this piece of information), to get more acquainted with the tree-based code in sklearn.

Also, if I could ask for some guidance/advice to not struggle too much, if necessary

@adam2392
Copy link
Member Author

adam2392 commented May 1, 2025

Hi @vitorpohlenz

So, I would like to ask your opinion if you think it is a good idea for me to work on this issue(based on your expertise and this piece of information), to get more acquainted with the tree-based code in sklearn.

Cython / C++ knowledge is definitely needed within the scikit-learn dev/reviewer community, so anyone interested is definitely welcome! There will be a learning curve, and require you to explore how Cython works, and what doesn't work. I think it's a net-positive to your skillset.

With that being said, yes you can ask for guidance / advice, but note my time is a bit limited these days, so will definitely require a lot of self-learning.

As a first step, I would look at this PR: #27966, review how Cython works. Then what's needed is for one to:

  1. understand how sparse arrays work
  2. understand how missing-values are split in the sklearn RF algo
  3. apply that understanding to the sparse splitter logic

Feel free to propose/iterate on this plan accordingly!

@vitorpohlenz
Copy link
Contributor

Hi @adam2392, thanks a lot for replying!

With that being said, yes you can ask for guidance / advice, but note my time is a bit limited these days, so will definitely require a lot of self-learning.

Sure, I understand that this task will be more self-driven than based on external aid. But this kind of guidance, like you did, will help me a lot.

So I will /take it and will start by the steps you suggested, and will update the progress here until I have a clear proposal of enhancement.

@vitorpohlenz
Copy link
Contributor

Update about the progress/steps that I'm doing:

  • Understanding how sparse arrays work
  • Understanding how missing values are split in the sklearn Random Forest algorithm
  • Define how to apply the modifications to the sparse splitter logic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants