Thanks to visit codestin.com
Credit goes to github.com

Skip to content

MAINT use _validate_params in DecisionTreeClassifier and DecisionTreeRegressor #23499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jun 24, 2022

Conversation

moritzwilksch
Copy link
Contributor

Reference Issues/PRs

towards #23462

What does this implement/fix? Explain your changes.

This makes BaseDecisionTree as well as DecisionTreeClassifier and DecisionTreeRegressor define _parameter_constraints and calls _validate_params() in .fit()

Any other comments?

First contribution, so happy to hear your suggestions!

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @moritzwilksch. Here are some remarks. You have linting issues that you can fix by installing and running black.

There are also some failing checks in test_gbdt_parameter_checks because they are testing against the previous validation pattern. You can remove the checks corresponding to the parameters for which validation is delegated to the DecisionTreeClassifier/Regressor

"max_depth": [Interval(Integral, 1, None, closed="left"), None],
"min_samples_split": [
Interval(Integral, 2, None, closed="left"),
Interval(Real, 0.0, 1.0, closed="neither"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the valid interval for a float is (0, 1]

Suggested change
Interval(Real, 0.0, 1.0, closed="neither"),
Interval(Real, 0.0, 1.0, closed="right"),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hang on, after changing this the tests fail:
pytest -vl sklearn/tests/test_common.py -k check_param_validation

fails with AssertionError: DecisionTreeRegressor does not raise an informative error message when the parameter min_samples_split does not have a valid type or value.
maybe this is also because of the int/float overlap? Although neither 1.0 (as in 100%) nor 1 (as in one sample) would be a valid parameter for min_samples_split right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bug in the common test. Let me fix that in a separate PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR opened #23513. Let's wait for it to be merged

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok!

@@ -905,6 +852,12 @@ class DecisionTreeClassifier(ClassifierMixin, BaseDecisionTree):
0.93..., 0.93..., 1. , 0.93..., 1. ])
"""

_parameter_constraints = {
**BaseDecisionTree._parameter_constraints,
"criterion": [StrOptions({"gini", "entropy", "log_loss"})],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an instance of Criterion is also valid, but in a kind of non documented way, maybe for internal purpose (seems not even tested). Should we expose it in the error message or instead make it possible to hide some valid options from the error message like it's proposed for the string options in #23459 ? @thomasjpfan @ogrisel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically, I think Criterion instances were unofficially supported for users that want to provided their own Criterion objects with the same interface (from Cython).

If we want to maintain backward compatibility with this unofficially supported feature, then we need to allow for Criterion subclasses, i.e. check isinstance(self.criterion, Criterion). Can the validation framework allow for such a constraint?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that we need to keep backward compat. For the constraint it's easy, we just need to add Criterion as a possibiliy: "criterion": [StrOptions({"gini", "entropy", "log_loss"}), Criterion],.
My concern was that it then would appear in the error message criterion must be a str among {"gini", "entropy", "log_loss"} or a Criterion instance which makes it kind of official and I'm not sure we want to make it official. So do you think it's fine to expose it in the error message ? or do you think it would be better to hide it like we did for the "deprecated" string ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think it's better to hide it. Which means there needs to be another "internal" mechanism for types? 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 😄
I have an idea that should make it simple. I'll submit a PR soon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan I opened #23558 to make it possible to mark a constraint as internal.

moritzwilksch and others added 4 commits May 31, 2022 18:19
fix float interval

Co-authored-by: Jérémie du Boisberranger <[email protected]>
add mse & mae as deprecated

Co-authored-by: Jérémie du Boisberranger <[email protected]>
@jeremiedbb
Copy link
Member

You can now remove tests from previous param validation here

# The following parameters are checked in BaseDecisionTree

@jeremiedbb
Copy link
Member

The linting issue is due to an unused import (https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=42700&view=logs&jobId=dde5042c-7464-5d47-9507-31bdd2ee0a3a&j=32e2e1bb-a28f-5b18-6cfc-3f01273f5609&t=8a54543f-0728-5134-6642-bedd98e03dd0)
Seems like you can remove the import of check_scalar in _classes.py since you removed all its invocations

@jeremiedbb jeremiedbb added the Validation related to input validation label Jun 13, 2022
@jeremiedbb
Copy link
Member

@moritzwilksch I directly pushed some changes to take into account recent improvements in the validation mechanism.

@moritzwilksch
Copy link
Contributor Author

Great thank you! Had some tests fail the other day, will check whether this resolves it later.

@jeremiedbb
Copy link
Member

Actually ExtraTreeClassifier/Regressor have the same constraints as their DecisionTree counterparts and inherit _parameter_constraints from them so we have validation for both for free. I enabled the common test for them too. Should be good now

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@glemaitre glemaitre self-requested a review June 24, 2022 09:06
@glemaitre glemaitre changed the title use _validate_params in DecisionTreeClassifier and DecisionTreeRegressor MAINT use _validate_params in DecisionTreeClassifier and DecisionTreeRegressor Jun 24, 2022
@glemaitre
Copy link
Member

I push a small commit to remove some redundant tests for checking the value in class_weight

@glemaitre glemaitre merged commit 549fbb8 into scikit-learn:main Jun 24, 2022
@glemaitre
Copy link
Member

Merging since the CIs are green. Thanks @moritzwilksch

ogrisel pushed a commit to ogrisel/scikit-learn that referenced this pull request Jul 11, 2022
…nTreeRegressor` (scikit-learn#23499)

Co-authored-by: Jérémie du Boisberranger <[email protected]>
Co-authored-by: jeremiedbb <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants