-
-
Notifications
You must be signed in to change notification settings - Fork 26k
MAINT use _validate_params
in DecisionTreeClassifier
and DecisionTreeRegressor
#23499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT use _validate_params
in DecisionTreeClassifier
and DecisionTreeRegressor
#23499
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @moritzwilksch. Here are some remarks. You have linting issues that you can fix by installing and running black.
There are also some failing checks in test_gbdt_parameter_checks
because they are testing against the previous validation pattern. You can remove the checks corresponding to the parameters for which validation is delegated to the DecisionTreeClassifier/Regressor
sklearn/tree/_classes.py
Outdated
"max_depth": [Interval(Integral, 1, None, closed="left"), None], | ||
"min_samples_split": [ | ||
Interval(Integral, 2, None, closed="left"), | ||
Interval(Real, 0.0, 1.0, closed="neither"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the valid interval for a float is (0, 1]
Interval(Real, 0.0, 1.0, closed="neither"), | |
Interval(Real, 0.0, 1.0, closed="right"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hang on, after changing this the tests fail:
pytest -vl sklearn/tests/test_common.py -k check_param_validation
fails with AssertionError: DecisionTreeRegressor does not raise an informative error message when the parameter min_samples_split does not have a valid type or value.
maybe this is also because of the int/float overlap? Although neither 1.0 (as in 100%) nor 1 (as in one sample) would be a valid parameter for min_samples_split
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a bug in the common test. Let me fix that in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR opened #23513. Let's wait for it to be merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok!
sklearn/tree/_classes.py
Outdated
@@ -905,6 +852,12 @@ class DecisionTreeClassifier(ClassifierMixin, BaseDecisionTree): | |||
0.93..., 0.93..., 1. , 0.93..., 1. ]) | |||
""" | |||
|
|||
_parameter_constraints = { | |||
**BaseDecisionTree._parameter_constraints, | |||
"criterion": [StrOptions({"gini", "entropy", "log_loss"})], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like an instance of Criterion
is also valid, but in a kind of non documented way, maybe for internal purpose (seems not even tested). Should we expose it in the error message or instead make it possible to hide some valid options from the error message like it's proposed for the string options in #23459 ? @thomasjpfan @ogrisel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically, I think Criterion
instances were unofficially supported for users that want to provided their own Criterion objects with the same interface (from Cython).
If we want to maintain backward compatibility with this unofficially supported feature, then we need to allow for Criterion
subclasses, i.e. check isinstance(self.criterion, Criterion)
. Can the validation framework allow for such a constraint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that we need to keep backward compat. For the constraint it's easy, we just need to add Criterion
as a possibiliy: "criterion": [StrOptions({"gini", "entropy", "log_loss"}), Criterion],
.
My concern was that it then would appear in the error message criterion must be a str among {"gini", "entropy", "log_loss"} or a Criterion instance
which makes it kind of official and I'm not sure we want to make it official. So do you think it's fine to expose it in the error message ? or do you think it would be better to hide it like we did for the "deprecated" string ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think it's better to hide it. Which means there needs to be another "internal" mechanism for types? 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes 😄
I have an idea that should make it simple. I'll submit a PR soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan I opened #23558 to make it possible to mark a constraint as internal.
fix float interval Co-authored-by: Jérémie du Boisberranger <[email protected]>
add mse & mae as deprecated Co-authored-by: Jérémie du Boisberranger <[email protected]>
You can now remove tests from previous param validation here
|
The linting issue is due to an unused import (https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=42700&view=logs&jobId=dde5042c-7464-5d47-9507-31bdd2ee0a3a&j=32e2e1bb-a28f-5b18-6cfc-3f01273f5609&t=8a54543f-0728-5134-6642-bedd98e03dd0) |
Co-authored-by: Jérémie du Boisberranger <[email protected]>
@moritzwilksch I directly pushed some changes to take into account recent improvements in the validation mechanism. |
Great thank you! Had some tests fail the other day, will check whether this resolves it later. |
Actually ExtraTreeClassifier/Regressor have the same constraints as their DecisionTree counterparts and inherit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
_validate_params
in DecisionTreeClassifier
and DecisionTreeRegressor
_validate_params
in DecisionTreeClassifier
and DecisionTreeRegressor
I push a small commit to remove some redundant tests for checking the value in |
Merging since the CIs are green. Thanks @moritzwilksch |
…nTreeRegressor` (scikit-learn#23499) Co-authored-by: Jérémie du Boisberranger <[email protected]> Co-authored-by: jeremiedbb <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]>
Reference Issues/PRs
towards #23462
What does this implement/fix? Explain your changes.
This makes
BaseDecisionTree
as well asDecisionTreeClassifier
andDecisionTreeRegressor
define_parameter_constraints
and calls_validate_params()
in.fit()
Any other comments?
First contribution, so happy to hear your suggestions!