-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Extending Criterion #10251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As promised, my code to get it working as a Cython extension: from sklearn.tree._utils cimport log
cimport sklearn.tree._criterion
from sklearn.tree._criterion cimport SIZE_t
cdef class MyGini(sklearn.tree._criterion.ClassificationCriterion):
# Implementation of Gini as in the real module This requires the following patch: diff --git sklearn/tree/_criterion.pxd sklearn/tree/_criterion.pxd
index 229a6bc28..22c728cc3 100644
--- sklearn/tree/_criterion.pxd
+++ sklearn/tree/_criterion.pxd
@@ -65,3 +65,9 @@ cdef class Criterion:
cdef void node_value(self, double* dest) nogil
cdef double impurity_improvement(self, double impurity) nogil
cdef double proxy_impurity_improvement(self) nogil
+
+cdef class ClassificationCriterion(Criterion):
+ """Abstract criterion for classification."""
+
+ cdef SIZE_t* n_classes
+ cdef SIZE_t sum_stride
diff --git sklearn/tree/_criterion.pyx sklearn/tree/_criterion.pyx
index 5187a5066..2073ad091 100644
--- sklearn/tree/_criterion.pyx
+++ sklearn/tree/_criterion.pyx
@@ -212,9 +212,6 @@ cdef class Criterion:
cdef class ClassificationCriterion(Criterion):
"""Abstract criterion for classification."""
- cdef SIZE_t* n_classes
- cdef SIZE_t sum_stride
-
def __cinit__(self, SIZE_t n_outputs,
np.ndarray[SIZE_t, ndim=1] n_classes):
"""Initialize attributes for this criterion. |
I think we have been heretofore reluctant to officially declare this public API (thoughts, @glemaitre?), but I think putting |
I am +1 to move I also agree that we should keep the API private to avoid any maintenance or constraint linked to this code. Therefore, it might be difficult to document how to make your own criterion, isn't it? |
I personally think that only Cython class should be implemented. I would also think that accepting |
We can advertise how to DIY, but with a warning that this interface is not considered public, stable API and should be used at one's own risk. @camilstaps, PR welcome to fix things up. |
Yes, thanks. Since this is my first contribution, I need some time to read through the guidelines, but I will make a PR soonish! |
@camilstaps Do you have any documentation for how to use this feature? |
@jkingsbery nice to see you here, I hope you're doing fine. Unfortunately, I never actually used this and I deleted the tests I had. What I did was start from an existing class, like |
OK, thanks for the pointers! |
All of a sudden, the progression of this #10251 was halted, despite the fact that it had begun with such a great aim. It would be fantastic if we were able to customise the sklearn cython tree implementations a little better, such as the splitting procedure (not criterion as seen here), but it is always difficult to get a printf to function.. Either it must be firmly stated to not even attempt it in the developers section, or development must commence to make this being a little easier I reckon. The idea is that, for instance, neural networks are mathematically proven to be powerful, but all tree learning algorithms and other types are powerful too, probably less due to the lack of infinity that NN comes with, so on one side maybe less precise, yet fairly much more understandable, and appreciated in many industry (i.e medicine) thus proving their worthiness to the research community. Due to the lack of transparency that these estimators provide (on the cython side only, not the python side obviously), I am concerned that it will result in the loss of such techniques or their gradual neglect in the community. Such algorithms (e.g., tree-based) are simple to understand and straightforward to implement in production, and their results are nowadays worthwhile; however, if we (researchers, developers) are unable to delve deeper into the investigation for enhancement, now that a great deal has been accomplished, how will these estimators remain connected to the community? Due to other Ph.D. priorities, I do not have the opportunity ``now'' to experiment with doing complex improvement, but my goal for the past three days was to reproduce a great concept for a new decision tree which simply conceptualise an update of the best splitter rule set, even this has been such a difficulty.. I continue to have faith in all of this, and I expect that over time, researchers and developers will be able to enhance the variety of sklearn estimators in a straightforward manner. Ref to #26031 for a shoutout of what has been expressed above. Plus, I would argue that a guideline pointing out a starting point would be grateful, e.g have a fast tutorial/guideline on how to edit the splitter file if we want to change something for a variant of one of the greatest decision trees on the Sci-kit community right now (shoutout to #25306 for such idea of a guideline to guide newcomers how to start) Cheers, |
This is not our priority. The discussion that we had about this topic was to make it possible to extend by exposing some base classes. However, we are not intended to give any support. We prefer to dedicate resources to implementing features such as handling missing values (i.e. #23595) or categorical data natively. On this topic, @adam2392 worked such that the implementation can be extended: #24577 (and all other PRs). I assume that once we have real support for both missing and categorical then it would be possible to open the Cython API. |
Hi @simonprovost, I am currently working with a group on extending the trees in scikit-learn. As @glemaitre mentioned, this is unfortunately not possible in sklearn as of now due to resource constraints and clashing complexity of adding new fundamental features (i.e. missing/categorical). Fingers crossed it will be in the future as he mentioned after the support of missing/categorical data. Since the PRs are on hold for now, we (me and a group at JHU) are instead vendoring a light-weight scikit-learn fork that extends the Cython and Python API for the tree submodule. https://github.com/neurodata/scikit-learn. Installation rn still requires some involvement... but it operates as a stand-in for scikit-learn:main, while providing a nice overridable API for some of the more complex tree models. This all makes it possible for our 3rd party package to try to support complex tree models by either extending the criterion, splitter, or tree classes: https://github.com/neurodata/scikit-tree. The code in that package explicitly demonstrates this for a number of different tree models. If this is of interest to you, happy to discuss further. |
Hi @adam2392 and @glemaitre First and foremost, many thanks for the helpful information you both supplied; it is now clear and I will spread the word. Furthermore, @adam2392 Thanks for pointing out that you and your team at JHU are attempting to address some of the concerns I raised. Your forked Scikit Learn into sci kit tree appears to point in the right direction for our research :) However, may I obtain a quick guidance into how your 3rd party package might allow splitter to be changed ? I skim read toward this and gathered not that much of information as to how to. However, note that I will continue to explore both of the link provided within your answer, I just thought great first to answer and ask by the same opportunity the question. I can provide additional details regarding our objective of how we'd like to change the splitter, which is rather straightforward. Additionally, I regret not being able to identify your sci kit tree earlier, as it is exciting, but in any case If I am able to address our research problems with your package, we will also resolve this ticket: #26031 which might be helpful for others. Cannot wait to cite scikit learn and scikit tree repositories! Cheers, |
@glemaitre has explained why this is now unavailable in the official sklearn library here. Yet, as @adam2392 demonstrated here, their Scikit-tree package may serve as an excellent entry point into the aforementioned ideas requested. The current discussion is being moved to #26031 to avoid confusion with this original #10251. |
Out of curiosity, what kind of criteria do you have in mind that you want to try out? |
Chiming in here: Different criterions that are on our list that are implemented, or we want to eventually add:
|
Uh oh!
There was an error while loading. Please reload this page.
Unless I'm missing something, it's not completely trivial how one can use a custom
sklearn.tree._criterion.Criterion
for a decision tree. See my use case here.Things I have tried include:
Import the
ClassificationCriterion
in Python and subclass it. It seems thatnode_impurity
andchildren_impurity
do not get called, the impurity is always 0 (perhaps because they arecdef
and notcpdef
?). I'm also unsure what the parameters to__new__
/__cinit__
should be (e.g.1
andnp.array([2], dtype='intp')
for a binary classification problem?), or how to pass them properly: I have to create theCriterion
object from outside the tree to circumvent the check on thecriterion
argument.Extend
ClassificationCriterion
in a Cython file. This seems to work, but (a) it requires exportingClassificationCriterion
from_criterion.pxd
and (b) it would be nice if it would be documented more extensively what should be done innode_impurity
andchildren_impurity
. I will post my code below once it seems to work correctly.May I propose one of the following to make this easier?
Criterion
to the tree, similar to how it is very easy to implement a custom scorer for validation functions. That would require changing the checks here.The text was updated successfully, but these errors were encountered: