-
-
Notifications
You must be signed in to change notification settings - Fork 26k
FIX TunedThresholdClassifierCV error or warn with informative message on invalid metrics #29082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FIX TunedThresholdClassifierCV error or warn with informative message on invalid metrics #29082
Conversation
… on invalid metrics
I would need a changelog entry for 1.5.1 but the changelog file needs to be updated in EDIT: this section is being added in #29078 that is likely to merged first. |
For the first case, did you succeed to trigger it on a specific use-case. I thought that we would cover it with constant prediction but maybe the case we have probability of 0.0 and 1.0 then we get a constant score. |
<= np.finfo(objective_scores.dtype).eps | ||
): | ||
warn( | ||
f"The objective metric {self.scoring!r} is constant at " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea. Might help some users. Also I like that it is a warning not an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR in a0d62c5 to keep the 0.5 threshold in that case. Using a extreme near-zero threshold would introduce a very weird / unexpectedly biased behavior. Better keep a more neutral behavior in such a pathological situation.
scoring = check_scoring(self.estimator, scoring=self.scoring) | ||
scorer = check_scoring(self.estimator, scoring=self.scoring) | ||
if scorer._response_method != "predict": | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I’m not so sure whether we overly constrain possible scorers from the user. But I also do not 100% follow how the curve scorer works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't do that (as is the case in main) and the users pass a non thresholded metric like roc auc/average precision/log loss/brier, the metric is evaluated on the thresholded (binary) predictions which is really misleading. You still get a 'tuned' threshold but its meaning is really confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we can first restrain and be conservative. If a real use case is reported then we can then rework and make sure that the API is right and not just working as a side effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that here we're accessing private _response_method
as a proxy to see if it's the right kind of scorer or not. Scorers should have a public API for this (cc @StefanieSenger )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that since response_method="predict"
is the default in _BaseScorer
, this error might not reach the users that might be most vulnerable to be confused by wrong outputs. Is there a more computed way to check if the scorer fits the purpose that does not depend on user input?
Edit: That wasn't very helpful. I think I got something wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if the following helps address those concerns:
https://github.com/scikit-learn/scikit-learn/pull/29082/files#r1613325997
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
In |
My main conclusion is that we should make an official distinction of scores/metrics: probabilistic ones and decision metrics. |
There are also:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-authored-by: Christian Lorentzen <[email protected]>
scoring = check_scoring(self.estimator, scoring=self.scoring) | ||
scorer = check_scoring(self.estimator, scoring=self.scoring) | ||
if scorer._response_method != "predict": | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that here we're accessing private _response_method
as a proxy to see if it's the right kind of scorer or not. Scorers should have a public API for this (cc @StefanieSenger )
objective_scores.max() - objective_scores.min() | ||
<= np.finfo(objective_scores.dtype).eps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is something we should be doing everywhere where we "search" over a few models using a scorer. Kind of arbitrary to have it here and not in other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also be in favor of raising a similar warning for *SearchCV
meta-estimators when mean_test_score
is constant for all hyper-parameters. Not sure how frequent this is though.
But it's true that it's actually not that frequent for TunedThresholdClassifierCV
either.
The original problem I encountered that triggered the selection of an extreme threshold is actually of a different nature. Let me update this PR accordingly to discuss that further (maybe tomorrow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #29082 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that checking for the constant case is not really important but I do not expect this to happen often in practice. The other cases (extreme thresholds due to lack of trade-off) are more useful to warn against (and they do easily happen in practice as shown in the tests).
But I would rather keep the constant warning to make the warning message more precise for this particular edge case but also to fallback to the neutral 0.5 and keep the estimator behavior symmetric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to a utility function which takes a bunch of scores and warns, like _warn_on_constant_metrics
and call it in a few places where it's relevant? (A single PR for this would be nice, which would include this usecase)
I agree but defining a good public API for scorers is way beyond the scope of this fix and would probably require a SLEP. We can always refactor Another think that scorer need is a public string name that we could use to report result of model evaluations (e.g. data frame column names) or in error and warning messages. Also I would like to decouple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I agree with @ogrisel regarding that the API scorer is out of the scope of the current PR but it should be addressed.
I updated the PR to issue an informative warning for a pitfall Christian and I fell into when working on this the release highlights for this new class. Arguably, passing a constant scoring score is very unlikely, but it would be far more likely to craft a scoring metric that does not have a trade-off between FP and FN and as a result would lead to dummy classifiers without noticing it. |
I also added a TODO comment on the line in the code that would deserve to be updated once we improve the scoring API in 5a464e4. Please let me know what you think. |
I think we should fix incrementally and refactor on the way to trim code redundancy otherwise the perfect is the enemy of the good. |
In an IRL chat with @glemaitre we figured it's the scorer that should be raising this error when it's being applied to an invalid data or estimator since it would have access to Guillaume said he already has an old PR open about it. |
Going back to the old PRs, the scope a bit different: the idea was to get the metric associated with the use case: #17889, #17930 While we would benefit from having the discussion around the API of scores/scorers and decoupling the signs, I think that we could improve the situation to have a kind of metric registry where we can define a callable, the supported This would be limited to the scikit-learn metrics at first. Maybe there is a way in the future to expose something such that one can add their metric to the registry. Maybe |
Not sure why a registry would be better than public methods / attributes on the scorer object itself. |
I just thought about this exact question when I posted my previous message. If we don't want to touch the scorers API then the registry is an option but it does not mean this is a clean way. IMO, I would prefer to be part of the scorer API. This is the reason I would merge the PR and trigger a new PR where we can discuss the right API and alternative. |
I don’t intent to sound harsh, still: Could we incorporate Adrin‘s comments, do a bit less magic warnings here for the moment and discuss scorer API redesign elsewhere? |
That's what I tried in #29082 (comment). Do you have more specific suggestions for things to improve?
So only keep the |
Note that scoring objective should introduce a trade-off between false | ||
negatives and false positives, otherwise the tuned threshold would be | ||
trivial and the resulting classifier would be equivalent to constantly | ||
classifiying one of the two possible classes. This would be the case | ||
when passing scoring="precision" or scoring="recall" for instance. | ||
Furthermore, the scoring objective should evaluate thresholded | ||
classifier predictions: as a result, metrics such as ROC AUC, Average | ||
Precision, log loss or the Brier score are not valid scoring metrics in | ||
this context since they are all designed to evaluate unthresholded | ||
class-membership confidence scores. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be made much shorter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what to remove besides maybe cutting the last sentence:
https://github.com/scikit-learn/scikit-learn/pull/29082/files#r1616901942
warn( | ||
f"The objective metric {self.scoring!r} is constant at " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still like this warning.
# if a scorer expects thresholded binary classification predictions. | ||
# TODO: updates this condition when a better way is available. | ||
if scorer._response_method != "predict": | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still the most controversial part. Without it, this PR would be almost merged, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of #29082 (comment) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I'm suggesting now doesn't require a SLEP, and divides the PR into pieces which we should be able to tackle easily and quickly. WDYT @ogrisel
objective_scores.max() - objective_scores.min() | ||
<= np.finfo(objective_scores.dtype).eps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to a utility function which takes a bunch of scores and warns, like _warn_on_constant_metrics
and call it in a few places where it's relevant? (A single PR for this would be nice, which would include this usecase)
elif trivial_kind is not None: | ||
warn( | ||
f"Tuning the decision threshold on {self.scoring} " | ||
"leads to a trivial classifier that classifies all samples as " | ||
f"the {trivial_kind} class. Consider revising the scoring parameter " | ||
"to include a trade-off between false positives and false negatives.", | ||
UserWarning, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a similar warning when the found solution is using a parameter form the edges of the bound. Could they all be in the same utility function? It would make the messages more consistent. Could also be a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we would need a similar features for hparam search but I think both the code and the messages deserves to be specialized.
For the case of hparam tuning, we need to take the bounds of parameter validation into account (e.g. some hparams are naturally bounded, e.g. alpha=0
or l1_ratio=1.0
so we should not raise a warning if we reach those bounds).
For the case of threshold tuning I think it's important to mention the positive / negative classifications as I do in the 2 custom warnings of this PR to get a more explicit and actionable error message. The message for hparam tuning would be more generic.
Let me update the PR accordingly. I will split it in two so that we can have decoupled review discussions about the scorer object attribute and the warnings about trivial thresholds / constant thresholds. |
This PR fixes two usability problems with the new
TunedThresholdClassifierCV
when using it with invalid values for thescoring
parameter:ValueError
with a meaning error message.For the second point we could instead warn the user and keep on using 0.5 as the threshold which is probably less pathological/arbitrary.
Alternatively we could also raise a
ValueError
but I am worried that this error could be triggered for bad reasons when doing a grid search or an other resampling procedure around theTunedThresholdClassifierCV
instance, hence I thought that a warning would be less disruptive./cc @glemaitre @lorentzenchr