FIX TunedThresholdClassifierCV error or warn with informative message on invalid metrics #29082

ogrisel · 2024-05-22T16:57:58Z

This PR fixes two usability problems with the new TunedThresholdClassifierCV when using it with invalid values for the scoring parameter:

the first case, is passing a scoring name or scorer object that expects metrics defined for unthresholded predictions (e.g. ROC-AUC). This is clearly invalid and we can raise a ValueError with a meaning error message.
the second case is passing an under-specified scoring function that would return a constant prediction on a given dataset. In this case I chose to warn the user but leave the dummy threshold value that results from this case.

For the second point we could instead warn the user and keep on using 0.5 as the threshold which is probably less pathological/arbitrary.

Alternatively we could also raise a ValueError but I am worried that this error could be triggered for bad reasons when doing a grid search or an other resampling procedure around the TunedThresholdClassifierCV instance, hence I thought that a warning would be less disruptive.

/cc @glemaitre @lorentzenchr

… on invalid metrics

ogrisel · 2024-05-22T16:59:13Z

I would need a changelog entry for 1.5.1 but the changelog file needs to be updated in main to add the missing section first. Let's do that in another PR.

EDIT: this section is being added in #29078 that is likely to merged first.

github-actions · 2024-05-22T16:59:13Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 0af3d32. Link to the linter CI: here}

sklearn/model_selection/tests/test_classification_threshold.py

glemaitre · 2024-05-22T20:12:37Z

For the first case, did you succeed to trigger it on a specific use-case. I thought that we would cover it with constant prediction but maybe the case we have probability of 0.0 and 1.0 then we get a constant score.

lorentzenchr · 2024-05-22T20:38:10Z

sklearn/model_selection/_classification_threshold.py

+            <= np.finfo(objective_scores.dtype).eps
+        ):
+            warn(
+                f"The objective metric {self.scoring!r} is constant at "


This is a good idea. Might help some users. Also I like that it is a warning not an error.

I updated the PR in a0d62c5 to keep the 0.5 threshold in that case. Using a extreme near-zero threshold would introduce a very weird / unexpectedly biased behavior. Better keep a more neutral behavior in such a pathological situation.

lorentzenchr · 2024-05-22T20:48:13Z

sklearn/model_selection/_classification_threshold.py

-        scoring = check_scoring(self.estimator, scoring=self.scoring)
+        scorer = check_scoring(self.estimator, scoring=self.scoring)
+        if scorer._response_method != "predict":
+            raise ValueError(


Here, I’m not so sure whether we overly constrain possible scorers from the user. But I also do not 100% follow how the curve scorer works.

If we don't do that (as is the case in main) and the users pass a non thresholded metric like roc auc/average precision/log loss/brier, the metric is evaluated on the thresholded (binary) predictions which is really misleading. You still get a 'tuned' threshold but its meaning is really confusing.

I think that we can first restrain and be conservative. If a real use case is reported then we can then rework and make sure that the API is right and not just working as a side effect.

I don't like that here we're accessing private _response_method as a proxy to see if it's the right kind of scorer or not. Scorers should have a public API for this (cc @StefanieSenger )

It seems to me that since response_method="predict" is the default in _BaseScorer, this error might not reach the users that might be most vulnerable to be confused by wrong outputs. Is there a more computed way to check if the scorer fits the purpose that does not depend on user input?

Edit: That wasn't very helpful. I think I got something wrong.

Let me know if the following helps address those concerns:

https://github.com/scikit-learn/scikit-learn/pull/29082/files#r1613325997

Co-authored-by: Guillaume Lemaitre <[email protected]>

ogrisel · 2024-05-23T07:18:40Z

In main, passing scoring="roc_auc" does not result in constant scores as I would have intuitively expected: the ROC AUC metric is evaluated on binary predictions for all possible thresholds instead of unthresholded predicted probabilities. The result is really hard to make any sense of. I think we should not allow this usage of TunedThresholdClassifierCV because it's really confusing. Computing ROC AUC or log loss on binary predictions should probably be considered a methodological mistake and giving tools to users that do that silently is a really bad usability pitfall.

lorentzenchr · 2024-05-23T07:33:34Z

My main conclusion is that we should make an official distinction of scores/metrics: probabilistic ones and decision metrics.

ogrisel · 2024-05-23T09:49:21Z

My main conclusion is that we should make an official distinction of scores/metrics: probabilistic ones and decision metrics.

There are also:

non-probabilistic classification confidence scores (e.g. decision_function of an SVM classifier)
continuous/regression metrics

sklearn/model_selection/tests/test_classification_threshold.py

sklearn/model_selection/_classification_threshold.py

lorentzenchr

LGTM

sklearn/model_selection/_classification_threshold.py

sklearn/model_selection/tests/test_classification_threshold.py

Co-authored-by: Christian Lorentzen <[email protected]>

sklearn/model_selection/tests/test_classification_threshold.py

adrinjalali · 2024-05-23T15:49:38Z

sklearn/model_selection/_classification_threshold.py

-        scoring = check_scoring(self.estimator, scoring=self.scoring)
+        scorer = check_scoring(self.estimator, scoring=self.scoring)
+        if scorer._response_method != "predict":
+            raise ValueError(


I don't like that here we're accessing private _response_method as a proxy to see if it's the right kind of scorer or not. Scorers should have a public API for this (cc @StefanieSenger )

adrinjalali · 2024-05-23T15:50:32Z

sklearn/model_selection/_classification_threshold.py

+            objective_scores.max() - objective_scores.min()
+            <= np.finfo(objective_scores.dtype).eps


I wonder if this is something we should be doing everywhere where we "search" over a few models using a scorer. Kind of arbitrary to have it here and not in other places.

I would also be in favor of raising a similar warning for *SearchCV meta-estimators when mean_test_score is constant for all hyper-parameters. Not sure how frequent this is though.

But it's true that it's actually not that frequent for TunedThresholdClassifierCV either.

The original problem I encountered that triggered the selection of an extreme threshold is actually of a different nature. Let me update this PR accordingly to discuss that further (maybe tomorrow).

Done in #29082 (comment).

I think that checking for the constant case is not really important but I do not expect this to happen often in practice. The other cases (extreme thresholds due to lack of trade-off) are more useful to warn against (and they do easily happen in practice as shown in the tests).

But I would rather keep the constant warning to make the warning message more precise for this particular edge case but also to fallback to the neutral 0.5 and keep the estimator behavior symmetric.

Can we move this to a utility function which takes a bunch of scores and warns, like _warn_on_constant_metrics and call it in a few places where it's relevant? (A single PR for this would be nice, which would include this usecase)

ogrisel · 2024-05-23T16:13:30Z

I don't like that here we're accessing private _response_method as a proxy to see if it's the right kind of scorer or not. Scorers should have a public API for this (cc @StefanieSenger )

I agree but defining a good public API for scorers is way beyond the scope of this fix and would probably require a SLEP. We can always refactor TunedThresholdClassifierCV to use that improved public API once it exists.

Another think that scorer need is a public string name that we could use to report result of model evaluations (e.g. data frame column names) or in error and warning messages.

Also I would like to decouple greater_is_better flag from taking the negative value. This is so confusing to cross-validate on MSE and have to multiply by -1 to be able to report meaningful MSE numbers in a figure, table or notebook cell output.

glemaitre

LGTM. I agree with @ogrisel regarding that the API scorer is out of the scope of the current PR but it should be addressed.

…coring functions

ogrisel · 2024-05-24T09:37:57Z

I updated the PR to issue an informative warning for a pitfall Christian and I fell into when working on this the release highlights for this new class.

Arguably, passing a constant scoring score is very unlikely, but it would be far more likely to craft a scoring metric that does not have a trade-off between FP and FN and as a result would lead to dummy classifiers without noticing it.

ogrisel · 2024-05-24T09:39:22Z

I also added a TODO comment on the line in the code that would deserve to be updated once we improve the scoring API in 5a464e4.

Please let me know what you think.

ogrisel · 2024-05-24T09:47:44Z

Does this "bug" not exist anywhere else in the codebase? AFAIK we don't do a similar check in other places. My point is not that we shouldn't fix it. I'm saying I don't think an adhoc fix is the way to go.

I think we should fix incrementally and refactor on the way to trim code redundancy otherwise the perfect is the enemy of the good.

…ClassifierCV

sklearn/model_selection/_classification_threshold.py

adrinjalali · 2024-05-25T08:03:51Z

In an IRL chat with @glemaitre we figured it's the scorer that should be raising this error when it's being applied to an invalid data or estimator since it would have access to estimator as well, and the scorer knows its own internals well. It would also fix the issue in all other cases that we have.

Guillaume said he already has an old PR open about it.

glemaitre · 2024-05-27T10:53:30Z

Guillaume said he already has an old PR open about it.

Going back to the old PRs, the scope a bit different: the idea was to get the metric associated with the use case: #17889, #17930

While we would benefit from having the discussion around the API of scores/scorers and decoupling the signs, I think that we could improve the situation to have a kind of metric registry where we can define a callable, the supported y_type (for the y_true) and the compatible response_method. I think that just registry would allow to solve the problem of the original PR (providing some metrics depending on the problem) and the issue that we have here because we can track what are the requirements linked to a callable.

This would be limited to the scikit-learn metrics at first. Maybe there is a way in the future to expose something such that one can add their metric to the registry. Maybe make_scorer could have an option for it to automatically register.

ogrisel · 2024-05-27T11:58:26Z

Not sure why a registry would be better than public methods / attributes on the scorer object itself.

glemaitre · 2024-05-27T12:01:24Z

Not sure why a registry would be better than public methods / attributes on the scorer object itself.

I just thought about this exact question when I posted my previous message. If we don't want to touch the scorers API then the registry is an option but it does not mean this is a clean way. IMO, I would prefer to be part of the scorer API.

This is the reason I would merge the PR and trigger a new PR where we can discuss the right API and alternative.

lorentzenchr · 2024-05-27T14:10:34Z

I don’t intent to sound harsh, still: Could we incorporate Adrin‘s comments, do a bit less magic warnings here for the moment and discuss scorer API redesign elsewhere?

ogrisel · 2024-05-27T15:00:46Z

Could we incorporate Adrin‘s comments

That's what I tried in #29082 (comment).

Do you have more specific suggestions for things to improve?

do a bit less magic warnings here for the moment

So only keep the ValueError and remove the 2 warnings? I can at least split the PR in two if people prefer.

lorentzenchr · 2024-05-27T15:51:55Z

sklearn/model_selection/_classification_threshold.py

+        Note that scoring objective should introduce a trade-off between false
+        negatives and false positives, otherwise the tuned threshold would be
+        trivial and the resulting classifier would be equivalent to constantly
+        classifiying one of the two possible classes. This would be the case
+        when passing scoring="precision" or scoring="recall" for instance.
+        Furthermore, the scoring objective should evaluate thresholded
+        classifier predictions: as a result, metrics such as ROC AUC, Average
+        Precision, log loss or the Brier score are not valid scoring metrics in
+        this context since they are all designed to evaluate unthresholded
+        class-membership confidence scores.


Could this be made much shorter?

I am not sure what to remove besides maybe cutting the last sentence:

https://github.com/scikit-learn/scikit-learn/pull/29082/files#r1616901942

lorentzenchr · 2024-05-27T15:52:23Z

sklearn/model_selection/_classification_threshold.py

+            warn(
+                f"The objective metric {self.scoring!r} is constant at "


I still like this warning.

lorentzenchr · 2024-05-27T15:54:16Z

sklearn/model_selection/_classification_threshold.py

+        # if a scorer expects thresholded binary classification predictions.
+        # TODO: updates this condition when a better way is available.
+        if scorer._response_method != "predict":
+            raise ValueError(


This is still the most controversial part. Without it, this PR would be almost merged, right?

What do you think of #29082 (comment) ?

sklearn/model_selection/_classification_threshold.py

adrinjalali

I think what I'm suggesting now doesn't require a SLEP, and divides the PR into pieces which we should be able to tackle easily and quickly. WDYT @ogrisel

adrinjalali · 2024-05-28T09:23:20Z

sklearn/model_selection/_classification_threshold.py

+            objective_scores.max() - objective_scores.min()
+            <= np.finfo(objective_scores.dtype).eps


Can we move this to a utility function which takes a bunch of scores and warns, like _warn_on_constant_metrics and call it in a few places where it's relevant? (A single PR for this would be nice, which would include this usecase)

adrinjalali · 2024-05-28T09:24:47Z

sklearn/model_selection/_classification_threshold.py

+        elif trivial_kind is not None:
+            warn(
+                f"Tuning the decision threshold on {self.scoring} "
+                "leads to a trivial classifier that classifies all samples as "
+                f"the {trivial_kind} class. Consider revising the scoring parameter "
+                "to include a trade-off between false positives and false negatives.",
+                UserWarning,
+            )


I think we have a similar warning when the found solution is using a parameter form the edges of the bound. Could they all be in the same utility function? It would make the messages more consistent. Could also be a separate PR.

I agree we would need a similar features for hparam search but I think both the code and the messages deserves to be specialized.

For the case of hparam tuning, we need to take the bounds of parameter validation into account (e.g. some hparams are naturally bounded, e.g. alpha=0 or l1_ratio=1.0 so we should not raise a warning if we reach those bounds).

For the case of threshold tuning I think it's important to mention the positive / negative classifications as I do in the 2 custom warnings of this PR to get a more explicit and actionable error message. The message for hparam tuning would be more generic.

sklearn/model_selection/_classification_threshold.py

ogrisel · 2024-05-30T10:14:31Z

I think what I'm suggesting now doesn't require a SLEP, and divides the PR into pieces which we should be able to tackle easily and quickly. WDYT @ogrisel

Let me update the PR accordingly. I will split it in two so that we can have decoupled review discussions about the scorer object attribute and the warnings about trivial thresholds / constant thresholds.

FIX TunedThresholdClassifierCV error or warn with informative message…

5817ee9

… on invalid metrics

github-actions bot added the module:model_selection label May 22, 2024

ogrisel added the Bug label May 22, 2024

ogrisel added this to the 1.5.1 milestone May 22, 2024

glemaitre reviewed May 22, 2024

View reviewed changes

sklearn/model_selection/tests/test_classification_threshold.py Show resolved Hide resolved

glemaitre reviewed May 22, 2024

View reviewed changes

sklearn/model_selection/tests/test_classification_threshold.py Outdated Show resolved Hide resolved

lorentzenchr reviewed May 22, 2024

View reviewed changes

ogrisel and others added 2 commits May 23, 2024 09:10

Update sklearn/model_selection/tests/test_classification_threshold.py

0acb511

Co-authored-by: Guillaume Lemaitre <[email protected]>

Update sklearn/model_selection/tests/test_classification_threshold.py

a1b5518

Co-authored-by: Guillaume Lemaitre <[email protected]>

Use 0.5 threshold in case of constant scores.

a0d62c5

ogrisel added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label May 23, 2024

ogrisel commented May 23, 2024

View reviewed changes

sklearn/model_selection/tests/test_classification_threshold.py Outdated Show resolved Hide resolved

sklearn/model_selection/_classification_threshold.py Outdated Show resolved Hide resolved

Typo

a874855

lorentzenchr approved these changes May 23, 2024

View reviewed changes

sklearn/model_selection/_classification_threshold.py Outdated Show resolved Hide resolved

sklearn/model_selection/tests/test_classification_threshold.py Outdated Show resolved Hide resolved

sklearn/model_selection/tests/test_classification_threshold.py Outdated Show resolved Hide resolved

ogrisel and others added 2 commits May 23, 2024 13:41

Apply suggestions from code review

4f82762

Co-authored-by: Christian Lorentzen <[email protected]>

Improve & fix tests

505baff

ogrisel commented May 23, 2024

View reviewed changes

sklearn/model_selection/tests/test_classification_threshold.py Outdated Show resolved Hide resolved

ogrisel added 3 commits May 23, 2024 13:56

Improve test docstring.

b6b13d3

Merge branch 'main' into fix-tuned-threshold-on-invalid-metrics

942d58a

Add changelog entry for 1.5.1

964a079

adrinjalali requested changes May 23, 2024

View reviewed changes

Linter fix

c8e811d

glemaitre approved these changes May 24, 2024

View reviewed changes

Add a dedicated warning to guide the user into crafting non-trivial s…

942221d

…coring functions

Improve test

397ee8d

ogrisel added 2 commits May 24, 2024 11:50

Update changelog

553a1b8

DOC improve the docstring for the scoring parameter of TunedThreshold…

2b4927a

…ClassifierCV

ogrisel commented May 24, 2024

View reviewed changes

sklearn/model_selection/_classification_threshold.py Outdated Show resolved Hide resolved

DOC more precise phrasing in scoring docstring

a37edcd

scikit-learn deleted a comment from lorentzenchr May 24, 2024

Merge branch 'main' into fix-tuned-threshold-on-invalid-metrics

aadae00

lorentzenchr reviewed May 27, 2024

View reviewed changes

ogrisel commented May 28, 2024

View reviewed changes

sklearn/model_selection/_classification_threshold.py Outdated Show resolved Hide resolved

Grammar fix in comment

856d075

ogrisel commented May 28, 2024

View reviewed changes

sklearn/model_selection/_classification_threshold.py Outdated Show resolved Hide resolved

adrinjalali reviewed May 28, 2024

View reviewed changes

ogrisel added 3 commits June 4, 2024 09:55

Trim last sentence of paragraph to make docstring a bit shorter.

f4c43e3

Avoid failing if the private _response_method attribute does not exist

351a523

Merge branch 'main' into fix-tuned-threshold-on-invalid-metrics

0af3d32

ogrisel modified the milestones: 1.5.1, 1.6 Jun 24, 2024

glemaitre modified the milestones: 1.6, 1.7 Nov 7, 2024

		objective_scores.max() - objective_scores.min()
		<= np.finfo(objective_scores.dtype).eps

		warn(
		f"The objective metric {self.scoring!r} is constant at "

Uh oh!

FIX TunedThresholdClassifierCV error or warn with informative message on invalid metrics #29082

Are you sure you want to change the base?

FIX TunedThresholdClassifierCV error or warn with informative message on invalid metrics #29082

Uh oh!

Conversation

ogrisel commented May 22, 2024

Uh oh!

ogrisel commented May 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

Uh oh!

glemaitre commented May 22, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lorentzenchr commented May 23, 2024

Uh oh!

ogrisel commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel commented May 24, 2024

Uh oh!

ogrisel commented May 22, 2024 •

edited

Loading

github-actions bot commented May 22, 2024 •

edited

Loading

ogrisel May 23, 2024 •

edited

Loading

ogrisel May 23, 2024 •

edited

Loading

StefanieSenger May 24, 2024 •

edited

Loading

ogrisel commented May 23, 2024 •

edited

Loading

ogrisel commented May 23, 2024 •

edited

Loading

ogrisel May 24, 2024 •

edited

Loading

ogrisel commented May 23, 2024 •

edited

Loading

ogrisel commented May 24, 2024 •

edited

Loading

glemaitre commented May 27, 2024 •

edited

Loading

ogrisel commented May 30, 2024 •

edited

Loading