Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Extending Criterion #10251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
camilstaps opened this issue Dec 4, 2017 · 16 comments · Fixed by #10325
Closed

Extending Criterion #10251

camilstaps opened this issue Dec 4, 2017 · 16 comments · Fixed by #10325

Comments

@camilstaps
Copy link
Contributor

camilstaps commented Dec 4, 2017

Unless I'm missing something, it's not completely trivial how one can use a custom sklearn.tree._criterion.Criterion for a decision tree. See my use case here.

Things I have tried include:

  • Import the ClassificationCriterion in Python and subclass it. It seems that node_impurity and children_impurity do not get called, the impurity is always 0 (perhaps because they are cdef and not cpdef?). I'm also unsure what the parameters to __new__ / __cinit__ should be (e.g. 1 and np.array([2], dtype='intp') for a binary classification problem?), or how to pass them properly: I have to create the Criterion object from outside the tree to circumvent the check on the criterion argument.

  • Extend ClassificationCriterion in a Cython file. This seems to work, but (a) it requires exporting ClassificationCriterion from _criterion.pxd and (b) it would be nice if it would be documented more extensively what should be done in node_impurity and children_impurity. I will post my code below once it seems to work correctly.

May I propose one of the following to make this easier?

  • Document what should be done to extend the class in Cython or Python - if Python should be allowed: I am aware of the performance issue with that, but in some cases it may be OK to do this in Python - I don't know.
  • Make it possible to pass a function or other object not extending Criterion to the tree, similar to how it is very easy to implement a custom scorer for validation functions. That would require changing the checks here.
@camilstaps
Copy link
Contributor Author

As promised, my code to get it working as a Cython extension:

from sklearn.tree._utils cimport log

cimport sklearn.tree._criterion
from sklearn.tree._criterion cimport SIZE_t

cdef class MyGini(sklearn.tree._criterion.ClassificationCriterion):
    # Implementation of Gini as in the real module

This requires the following patch:

diff --git sklearn/tree/_criterion.pxd sklearn/tree/_criterion.pxd
index 229a6bc28..22c728cc3 100644
--- sklearn/tree/_criterion.pxd
+++ sklearn/tree/_criterion.pxd
@@ -65,3 +65,9 @@ cdef class Criterion:
     cdef void node_value(self, double* dest) nogil
     cdef double impurity_improvement(self, double impurity) nogil
     cdef double proxy_impurity_improvement(self) nogil
+
+cdef class ClassificationCriterion(Criterion):
+    """Abstract criterion for classification."""
+
+    cdef SIZE_t* n_classes
+    cdef SIZE_t sum_stride
diff --git sklearn/tree/_criterion.pyx sklearn/tree/_criterion.pyx
index 5187a5066..2073ad091 100644
--- sklearn/tree/_criterion.pyx
+++ sklearn/tree/_criterion.pyx
@@ -212,9 +212,6 @@ cdef class Criterion:
 cdef class ClassificationCriterion(Criterion):
     """Abstract criterion for classification."""
 
-    cdef SIZE_t* n_classes
-    cdef SIZE_t sum_stride
-
     def __cinit__(self, SIZE_t n_outputs,
                   np.ndarray[SIZE_t, ndim=1] n_classes):
         """Initialize attributes for this criterion.

@jnothman
Copy link
Member

I think we have been heretofore reluctant to officially declare this public API (thoughts, @glemaitre?), but I think putting ClassificationCriterion in the pxd makes sense even for users who want to extend a not-assured-stable API. PR welcome.

@glemaitre
Copy link
Member

I am +1 to move ClassificationCritertion and RegressionCriterion in the pxd. In the contrary to my comment in #9947, we can keep the tree code how it is, since we can pass any criterion (I missed it at first).

I also agree that we should keep the API private to avoid any maintenance or constraint linked to this code. Therefore, it might be difficult to document how to make your own criterion, isn't it?

@glemaitre
Copy link
Member

Document what should be done to extend the class in Cython or Python - if Python should be allowed: I am aware of the performance issue with that, but in some cases it may be OK to do this in Python - I don't know.
Make it possible to pass a function or other object not extending Criterion to the tree, similar to how it is very easy to implement a custom scorer for validation functions. That would require changing the checks here.

I personally think that only Cython class should be implemented. I would also think that accepting Criterion class is better.

@jnothman
Copy link
Member

We can advertise how to DIY, but with a warning that this interface is not considered public, stable API and should be used at one's own risk.

@camilstaps, PR welcome to fix things up.

@camilstaps
Copy link
Contributor Author

Yes, thanks. Since this is my first contribution, I need some time to read through the guidelines, but I will make a PR soonish!

@jkingsbery
Copy link

@camilstaps Do you have any documentation for how to use this feature?

@camilstaps
Copy link
Contributor Author

@jkingsbery nice to see you here, I hope you're doing fine.

Unfortunately, I never actually used this and I deleted the tests I had. What I did was start from an existing class, like Gini, and then reconstruct from there how the two methods node_impurity and children_impurity should be implemented. If you have a specific use case, the method can be a bit simpler. I remember that in my case n_outputs was always 1, so the for loops weren't necessary. The docblocks for the methods also give some hints, but other than that I'm afraid I cannot help, sorry!

@jkingsbery
Copy link

OK, thanks for the pointers!

@simonprovost
Copy link

simonprovost commented Apr 1, 2023

All of a sudden, the progression of this #10251 was halted, despite the fact that it had begun with such a great aim. It would be fantastic if we were able to customise the sklearn cython tree implementations a little better, such as the splitting procedure (not criterion as seen here), but it is always difficult to get a printf to function.. Either it must be firmly stated to not even attempt it in the developers section, or development must commence to make this being a little easier I reckon.

The idea is that, for instance, neural networks are mathematically proven to be powerful, but all tree learning algorithms and other types are powerful too, probably less due to the lack of infinity that NN comes with, so on one side maybe less precise, yet fairly much more understandable, and appreciated in many industry (i.e medicine) thus proving their worthiness to the research community.

Due to the lack of transparency that these estimators provide (on the cython side only, not the python side obviously), I am concerned that it will result in the loss of such techniques or their gradual neglect in the community. Such algorithms (e.g., tree-based) are simple to understand and straightforward to implement in production, and their results are nowadays worthwhile; however, if we (researchers, developers) are unable to delve deeper into the investigation for enhancement, now that a great deal has been accomplished, how will these estimators remain connected to the community?

Due to other Ph.D. priorities, I do not have the opportunity ``now'' to experiment with doing complex improvement, but my goal for the past three days was to reproduce a great concept for a new decision tree which simply conceptualise an update of the best splitter rule set, even this has been such a difficulty..

I continue to have faith in all of this, and I expect that over time, researchers and developers will be able to enhance the variety of sklearn estimators in a straightforward manner.

Ref to #26031 for a shoutout of what has been expressed above. Plus, I would argue that a guideline pointing out a starting point would be grateful, e.g have a fast tutorial/guideline on how to edit the splitter file if we want to change something for a variant of one of the greatest decision trees on the Sci-kit community right now (shoutout to #25306 for such idea of a guideline to guide newcomers how to start)

Cheers,

@glemaitre
Copy link
Member

Plus, I would argue that a guideline pointing out a starting point would be grateful, e.g have a fast tutorial/guideline on how to edit the splitter file

This is not our priority. The discussion that we had about this topic was to make it possible to extend by exposing some base classes. However, we are not intended to give any support. We prefer to dedicate resources to implementing features such as handling missing values (i.e. #23595) or categorical data natively.

On this topic, @adam2392 worked such that the implementation can be extended: #24577 (and all other PRs). I assume that once we have real support for both missing and categorical then it would be possible to open the Cython API.

@adam2392
Copy link
Member

adam2392 commented Apr 3, 2023

Hi @simonprovost,

I am currently working with a group on extending the trees in scikit-learn. As @glemaitre mentioned, this is unfortunately not possible in sklearn as of now due to resource constraints and clashing complexity of adding new fundamental features (i.e. missing/categorical). Fingers crossed it will be in the future as he mentioned after the support of missing/categorical data.

Since the PRs are on hold for now, we (me and a group at JHU) are instead vendoring a light-weight scikit-learn fork that extends the Cython and Python API for the tree submodule. https://github.com/neurodata/scikit-learn. Installation rn still requires some involvement... but it operates as a stand-in for scikit-learn:main, while providing a nice overridable API for some of the more complex tree models.

This all makes it possible for our 3rd party package to try to support complex tree models by either extending the criterion, splitter, or tree classes: https://github.com/neurodata/scikit-tree. The code in that package explicitly demonstrates this for a number of different tree models. If this is of interest to you, happy to discuss further.

@simonprovost
Copy link

Hi @adam2392 and @glemaitre

First and foremost, many thanks for the helpful information you both supplied; it is now clear and I will spread the word. Furthermore, @adam2392 Thanks for pointing out that you and your team at JHU are attempting to address some of the concerns I raised. Your forked Scikit Learn into sci kit tree appears to point in the right direction for our research :)

However, may I obtain a quick guidance into how your 3rd party package might allow splitter to be changed ? I skim read toward this and gathered not that much of information as to how to. However, note that I will continue to explore both of the link provided within your answer, I just thought great first to answer and ask by the same opportunity the question.

I can provide additional details regarding our objective of how we'd like to change the splitter, which is rather straightforward. Additionally, I regret not being able to identify your sci kit tree earlier, as it is exciting, but in any case If I am able to address our research problems with your package, we will also resolve this ticket: #26031 which might be helpful for others.

Cannot wait to cite scikit learn and scikit tree repositories!

Cheers,

@simonprovost
Copy link

@glemaitre has explained why this is now unavailable in the official sklearn library here. Yet, as @adam2392 demonstrated here, their Scikit-tree package may serve as an excellent entry point into the aforementioned ideas requested. The current discussion is being moved to #26031 to avoid confusion with this original #10251.

@lorentzenchr
Copy link
Member

Out of curiosity, what kind of criteria do you have in mind that you want to try out?

@adam2392
Copy link
Member

adam2392 commented May 8, 2023

Out of curiosity, what kind of criteria do you have in mind that you want to try out?

Chiming in here: Different criterions that are on our list that are implemented, or we want to eventually add:

  • unsupervised criterion (e.g. BIC and Means diff between splits)
  • survival criterion for survival trees
  • causal criterion (e.g. the Linear Moment Criterion inside EconML)
  • AUC criterion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants