-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FEA Add missing-value support to sparse splitter in RandomForest and ExtraTrees #29542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This might be related to #27993 , although it is more so related to multi-task random forests |
Hi @adam2392 , If you could help me, I have some questions about this Issue.
|
Hi @vitorpohlenz ! Thank you for your interest in contributing!
Lmk what you think and if you have a proposed idea of how to proceed. |
Thanks for replying @adam2392 ! About [2.] I have no previous experience with Cython, and just a little experience with C++ (I have worked a little with C, but it was 10 years ago). But I understand the theory of decision trees. I know about the CART algorithm and the ways to measure the impurity using metrics like entropy, gini, etc... So, I would like to ask your opinion if you think it is a good idea for me to work on this issue(based on your expertise and this piece of information), to get more acquainted with the tree-based code in sklearn. Also, if I could ask for some guidance/advice to not struggle too much, if necessary |
Cython / C++ knowledge is definitely needed within the scikit-learn dev/reviewer community, so anyone interested is definitely welcome! There will be a learning curve, and require you to explore how Cython works, and what doesn't work. I think it's a net-positive to your skillset. With that being said, yes you can ask for guidance / advice, but note my time is a bit limited these days, so will definitely require a lot of self-learning. As a first step, I would look at this PR: #27966, review how Cython works. Then what's needed is for one to:
Feel free to propose/iterate on this plan accordingly! |
Hi @adam2392, thanks a lot for replying!
Sure, I understand that this task will be more self-driven than based on external aid. But this kind of guidance, like you did, will help me a lot. So I will /take it and will start by the steps you suggested, and will update the progress here until I have a clear proposal of enhancement. |
Update about the progress/steps that I'm doing:
|
Summary
While missing-value support for decision trees have been added recently, they only work when encoded in a dense array. Since
RandomForest*
andExtraTrees*
both support sparseX
, if a user encodesnp.nan
inside sparseX
, it should still work.Solution
Add missing-value logic in
SparsePartitioner
in_parititoner.pyx
,BestSparseSplitter
andRandomSparseSplitter
in_splitter.pyx
.The logic is the same as in the dense case, but just has to handle the fact that
X
is now sparse CSC array format.Misc.
One thing I noticed though as I went through the PR is that the current codebase still does not support missing values in the sparse splitter. I think this might be pretty easy to add, but should we re-open this issue technically?
Xref: #5870 (comment)
Originally posted by @adam2392 in #5870 (comment)
The text was updated successfully, but these errors were encountered: