MAINT use `_validate_params` in `DecisionTreeClassifier` and `DecisionTreeRegressor` #23499

moritzwilksch · 2022-05-31T10:49:52Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This makes BaseDecisionTree as well as DecisionTreeClassifier and DecisionTreeRegressor define _parameter_constraints and calls _validate_params() in .fit()

Any other comments?

First contribution, so happy to hear your suggestions!

jeremiedbb

Thanks for the PR @moritzwilksch. Here are some remarks. You have linting issues that you can fix by installing and running black.

There are also some failing checks in test_gbdt_parameter_checks because they are testing against the previous validation pattern. You can remove the checks corresponding to the parameters for which validation is delegated to the DecisionTreeClassifier/Regressor

jeremiedbb · 2022-05-31T15:35:02Z

sklearn/tree/_classes.py

+        "max_depth": [Interval(Integral, 1, None, closed="left"), None],
+        "min_samples_split": [
+            Interval(Integral, 2, None, closed="left"),
+            Interval(Real, 0.0, 1.0, closed="neither"),


the valid interval for a float is (0, 1]

Suggested change

Interval(Real, 0.0, 1.0, closed="neither"),

Interval(Real, 0.0, 1.0, closed="right"),

Hang on, after changing this the tests fail:
pytest -vl sklearn/tests/test_common.py -k check_param_validation

fails with AssertionError: DecisionTreeRegressor does not raise an informative error message when the parameter min_samples_split does not have a valid type or value.
maybe this is also because of the int/float overlap? Although neither 1.0 (as in 100%) nor 1 (as in one sample) would be a valid parameter for min_samples_split right?

There's a bug in the common test. Let me fix that in a separate PR

PR opened #23513. Let's wait for it to be merged

sklearn/tree/_classes.py

sklearn/tree/tests/test_tree.py

jeremiedbb · 2022-05-31T16:09:33Z

sklearn/tree/_classes.py

@@ -905,6 +852,12 @@ class DecisionTreeClassifier(ClassifierMixin, BaseDecisionTree):
            0.93...,  0.93...,  1.     ,  0.93...,  1.      ])
    """

+    _parameter_constraints = {
+        **BaseDecisionTree._parameter_constraints,
+        "criterion": [StrOptions({"gini", "entropy", "log_loss"})],


Looks like an instance of Criterion is also valid, but in a kind of non documented way, maybe for internal purpose (seems not even tested). Should we expose it in the error message or instead make it possible to hide some valid options from the error message like it's proposed for the string options in #23459 ? @thomasjpfan @ogrisel

Historically, I think Criterion instances were unofficially supported for users that want to provided their own Criterion objects with the same interface (from Cython).

If we want to maintain backward compatibility with this unofficially supported feature, then we need to allow for Criterion subclasses, i.e. check isinstance(self.criterion, Criterion). Can the validation framework allow for such a constraint?

I also think that we need to keep backward compat. For the constraint it's easy, we just need to add Criterion as a possibiliy: "criterion": [StrOptions({"gini", "entropy", "log_loss"}), Criterion],.
My concern was that it then would appear in the error message criterion must be a str among {"gini", "entropy", "log_loss"} or a Criterion instance which makes it kind of official and I'm not sure we want to make it official. So do you think it's fine to expose it in the error message ? or do you think it would be better to hide it like we did for the "deprecated" string ?

Ah I think it's better to hide it. Which means there needs to be another "internal" mechanism for types? 😅

Yes 😄
I have an idea that should make it simple. I'll submit a PR soon

@thomasjpfan I opened #23558 to make it possible to mark a constraint as internal.

fix float interval Co-authored-by: Jérémie du Boisberranger <[email protected]>

add mse & mae as deprecated Co-authored-by: Jérémie du Boisberranger <[email protected]>

sklearn/tree/tests/test_tree.py

jeremiedbb · 2022-06-03T11:13:00Z

You can now remove tests from previous param validation here

scikit-learn/sklearn/ensemble/tests/test_gradient_boosting.py

Line 142 in 02e09e3

# The following parameters are checked in BaseDecisionTree

jeremiedbb · 2022-06-03T11:14:28Z

The linting issue is due to an unused import (https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=42700&view=logs&jobId=dde5042c-7464-5d47-9507-31bdd2ee0a3a&j=32e2e1bb-a28f-5b18-6cfc-3f01273f5609&t=8a54543f-0728-5134-6642-bedd98e03dd0)
Seems like you can remove the import of check_scalar in _classes.py since you removed all its invocations

Co-authored-by: Jérémie du Boisberranger <[email protected]>

…siontree-val-params

jeremiedbb · 2022-06-23T15:35:35Z

@moritzwilksch I directly pushed some changes to take into account recent improvements in the validation mechanism.

moritzwilksch · 2022-06-23T15:41:58Z

Great thank you! Had some tests fail the other day, will check whether this resolves it later.

jeremiedbb · 2022-06-23T16:49:53Z

Actually ExtraTreeClassifier/Regressor have the same constraints as their DecisionTree counterparts and inherit _parameter_constraints from them so we have validation for both for free. I enabled the common test for them too. Should be good now

jeremiedbb

LGTM

glemaitre · 2022-06-24T09:26:36Z

I push a small commit to remove some redundant tests for checking the value in class_weight

glemaitre · 2022-06-24T10:19:42Z

Merging since the CIs are green. Thanks @moritzwilksch

…nTreeRegressor` (scikit-learn#23499) Co-authored-by: Jérémie du Boisberranger <[email protected]> Co-authored-by: jeremiedbb <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]>

moritzwilksch added 2 commits May 31, 2022 12:24

add param constraints: BaseDT, DTreg, DTclass

320cbc5

fix constraint boundaries

ace964b

github-actions bot added the module:tree label May 31, 2022

moritzwilksch mentioned this pull request May 31, 2022

Make all estimators use _validate_params #23462

Closed

jeremiedbb added the No Changelog Needed label May 31, 2022

jeremiedbb reviewed May 31, 2022

View reviewed changes

moritzwilksch and others added 4 commits May 31, 2022 18:19

Update sklearn/tree/_classes.py

2a5f073

fix float interval Co-authored-by: Jérémie du Boisberranger <[email protected]>

Update sklearn/tree/_classes.py

7659626

add mse & mae as deprecated Co-authored-by: Jérémie du Boisberranger <[email protected]>

add test_tree_params_validation back in

b776d85

remove last type error raised here

2362d33

jeremiedbb mentioned this pull request Jun 1, 2022

FIX Param validation: fix generating invalid param when 2 interval constraints #23513

Merged

jeremiedbb reviewed Jun 3, 2022

View reviewed changes

sklearn/tree/tests/test_tree.py Outdated Show resolved Hide resolved

moritzwilksch and others added 3 commits June 3, 2022 14:11

Update sklearn/tree/tests/test_tree.py

cd8640d

Co-authored-by: Jérémie du Boisberranger <[email protected]>

rm unused check_scalar import

4bd9392

rm validation tests from test_gradient_boosting

239c86b

jeremiedbb added the Validation related to input validation label Jun 13, 2022

moritzwilksch and others added 3 commits June 17, 2022 17:15

Merge branch 'main' of github.com:scikit-learn/scikit-learn into deci…

d55150b

…siontree-val-params

hidden

e14f377

Merge remote-tracking branch 'upstream/main' into pr/moritzwilksch/23499

4106785

ExtraTree for free

65e6347

jeremiedbb approved these changes Jun 23, 2022

View reviewed changes

jeremiedbb added the Waiting for Reviewer label Jun 23, 2022

glemaitre removed the Waiting for Reviewer label Jun 24, 2022

glemaitre self-requested a review June 24, 2022 09:06

glemaitre changed the title ~~use _validate_params in DecisionTreeClassifier and DecisionTreeRegressor~~ MAINT use _validate_params in DecisionTreeClassifier and DecisionTreeRegressor Jun 24, 2022

remove useless tests

a115926

Merge branch 'main' into decisiontree-val-params

984e844

glemaitre merged commit 549fbb8 into scikit-learn:main Jun 24, 2022

	Interval(Real, 0.0, 1.0, closed="neither"),
	Interval(Real, 0.0, 1.0, closed="right"),

Uh oh!

MAINT use _validate_params in DecisionTreeClassifier and DecisionTreeRegressor #23499

MAINT use _validate_params in DecisionTreeClassifier and DecisionTreeRegressor #23499

Uh oh!

Conversation

moritzwilksch commented May 31, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeremiedbb commented Jun 3, 2022

Uh oh!

jeremiedbb commented Jun 3, 2022

Uh oh!

jeremiedbb commented Jun 23, 2022

Uh oh!

moritzwilksch commented Jun 23, 2022

Uh oh!

jeremiedbb commented Jun 23, 2022

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Jun 24, 2022

Uh oh!

glemaitre commented Jun 24, 2022

Uh oh!

Uh oh!

MAINT use `_validate_params` in `DecisionTreeClassifier` and `DecisionTreeRegressor` #23499

MAINT use `_validate_params` in `DecisionTreeClassifier` and `DecisionTreeRegressor` #23499