Thanks to visit codestin.com
Credit goes to github.com

Skip to content

FIX TunedThresholdClassifierCV error or warn with informative message on invalid metrics #29082

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented May 22, 2024

This PR fixes two usability problems with the new TunedThresholdClassifierCV when using it with invalid values for the scoring parameter:

  • the first case, is passing a scoring name or scorer object that expects metrics defined for unthresholded predictions (e.g. ROC-AUC). This is clearly invalid and we can raise a ValueError with a meaning error message.
  • the second case is passing an under-specified scoring function that would return a constant prediction on a given dataset. In this case I chose to warn the user but leave the dummy threshold value that results from this case.

For the second point we could instead warn the user and keep on using 0.5 as the threshold which is probably less pathological/arbitrary.

Alternatively we could also raise a ValueError but I am worried that this error could be triggered for bad reasons when doing a grid search or an other resampling procedure around the TunedThresholdClassifierCV instance, hence I thought that a warning would be less disruptive.

/cc @glemaitre @lorentzenchr

@ogrisel
Copy link
Member Author

ogrisel commented May 22, 2024

I would need a changelog entry for 1.5.1 but the changelog file needs to be updated in main to add the missing section first. Let's do that in another PR.

EDIT: this section is being added in #29078 that is likely to merged first.

Copy link

github-actions bot commented May 22, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 0af3d32. Link to the linter CI: here

@ogrisel ogrisel added the Bug label May 22, 2024
@ogrisel ogrisel added this to the 1.5.1 milestone May 22, 2024
@glemaitre
Copy link
Member

For the first case, did you succeed to trigger it on a specific use-case. I thought that we would cover it with constant prediction but maybe the case we have probability of 0.0 and 1.0 then we get a constant score.

<= np.finfo(objective_scores.dtype).eps
):
warn(
f"The objective metric {self.scoring!r} is constant at "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea. Might help some users. Also I like that it is a warning not an error.

Copy link
Member Author

@ogrisel ogrisel May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR in a0d62c5 to keep the 0.5 threshold in that case. Using a extreme near-zero threshold would introduce a very weird / unexpectedly biased behavior. Better keep a more neutral behavior in such a pathological situation.

scoring = check_scoring(self.estimator, scoring=self.scoring)
scorer = check_scoring(self.estimator, scoring=self.scoring)
if scorer._response_method != "predict":
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I’m not so sure whether we overly constrain possible scorers from the user. But I also do not 100% follow how the curve scorer works.

Copy link
Member Author

@ogrisel ogrisel May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't do that (as is the case in main) and the users pass a non thresholded metric like roc auc/average precision/log loss/brier, the metric is evaluated on the thresholded (binary) predictions which is really misleading. You still get a 'tuned' threshold but its meaning is really confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we can first restrain and be conservative. If a real use case is reported then we can then rework and make sure that the API is right and not just working as a side effect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that here we're accessing private _response_method as a proxy to see if it's the right kind of scorer or not. Scorers should have a public API for this (cc @StefanieSenger )

Copy link
Contributor

@StefanieSenger StefanieSenger May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that since response_method="predict" is the default in _BaseScorer, this error might not reach the users that might be most vulnerable to be confused by wrong outputs. Is there a more computed way to check if the scorer fits the purpose that does not depend on user input?

Edit: That wasn't very helpful. I think I got something wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if the following helps address those concerns:

https://github.com/scikit-learn/scikit-learn/pull/29082/files#r1613325997

@ogrisel
Copy link
Member Author

ogrisel commented May 23, 2024

In main, passing scoring="roc_auc" does not result in constant scores as I would have intuitively expected: the ROC AUC metric is evaluated on binary predictions for all possible thresholds instead of unthresholded predicted probabilities. The result is really hard to make any sense of. I think we should not allow this usage of TunedThresholdClassifierCV because it's really confusing. Computing ROC AUC or log loss on binary predictions should probably be considered a methodological mistake and giving tools to users that do that silently is a really bad usability pitfall.

@lorentzenchr
Copy link
Member

My main conclusion is that we should make an official distinction of scores/metrics: probabilistic ones and decision metrics.

@ogrisel
Copy link
Member Author

ogrisel commented May 23, 2024

My main conclusion is that we should make an official distinction of scores/metrics: probabilistic ones and decision metrics.

There are also:

  • non-probabilistic classification confidence scores (e.g. decision_function of an SVM classifier)
  • continuous/regression metrics

@ogrisel ogrisel added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label May 23, 2024
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

scoring = check_scoring(self.estimator, scoring=self.scoring)
scorer = check_scoring(self.estimator, scoring=self.scoring)
if scorer._response_method != "predict":
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that here we're accessing private _response_method as a proxy to see if it's the right kind of scorer or not. Scorers should have a public API for this (cc @StefanieSenger )

Comment on lines +936 to +937
objective_scores.max() - objective_scores.min()
<= np.finfo(objective_scores.dtype).eps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is something we should be doing everywhere where we "search" over a few models using a scorer. Kind of arbitrary to have it here and not in other places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also be in favor of raising a similar warning for *SearchCV meta-estimators when mean_test_score is constant for all hyper-parameters. Not sure how frequent this is though.

But it's true that it's actually not that frequent for TunedThresholdClassifierCV either.

The original problem I encountered that triggered the selection of an extreme threshold is actually of a different nature. Let me update this PR accordingly to discuss that further (maybe tomorrow).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #29082 (comment).

Copy link
Member Author

@ogrisel ogrisel May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that checking for the constant case is not really important but I do not expect this to happen often in practice. The other cases (extreme thresholds due to lack of trade-off) are more useful to warn against (and they do easily happen in practice as shown in the tests).

But I would rather keep the constant warning to make the warning message more precise for this particular edge case but also to fallback to the neutral 0.5 and keep the estimator behavior symmetric.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to a utility function which takes a bunch of scores and warns, like _warn_on_constant_metrics and call it in a few places where it's relevant? (A single PR for this would be nice, which would include this usecase)

@ogrisel
Copy link
Member Author

ogrisel commented May 23, 2024

I don't like that here we're accessing private _response_method as a proxy to see if it's the right kind of scorer or not. Scorers should have a public API for this (cc @StefanieSenger )

I agree but defining a good public API for scorers is way beyond the scope of this fix and would probably require a SLEP. We can always refactor TunedThresholdClassifierCV to use that improved public API once it exists.

Another think that scorer need is a public string name that we could use to report result of model evaluations (e.g. data frame column names) or in error and warning messages.

Also I would like to decouple greater_is_better flag from taking the negative value. This is so confusing to cross-validate on MSE and have to multiply by -1 to be able to report meaningful MSE numbers in a figure, table or notebook cell output.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I agree with @ogrisel regarding that the API scorer is out of the scope of the current PR but it should be addressed.

@ogrisel
Copy link
Member Author

ogrisel commented May 24, 2024

I updated the PR to issue an informative warning for a pitfall Christian and I fell into when working on this the release highlights for this new class.

Arguably, passing a constant scoring score is very unlikely, but it would be far more likely to craft a scoring metric that does not have a trade-off between FP and FN and as a result would lead to dummy classifiers without noticing it.

@ogrisel
Copy link
Member Author

ogrisel commented May 24, 2024

I also added a TODO comment on the line in the code that would deserve to be updated once we improve the scoring API in 5a464e4.

Please let me know what you think.

@ogrisel
Copy link
Member Author

ogrisel commented May 24, 2024

Does this "bug" not exist anywhere else in the codebase? AFAIK we don't do a similar check in other places. My point is not that we shouldn't fix it. I'm saying I don't think an adhoc fix is the way to go.

I think we should fix incrementally and refactor on the way to trim code redundancy otherwise the perfect is the enemy of the good.

@scikit-learn scikit-learn deleted a comment from lorentzenchr May 24, 2024
@adrinjalali
Copy link
Member

In an IRL chat with @glemaitre we figured it's the scorer that should be raising this error when it's being applied to an invalid data or estimator since it would have access to estimator as well, and the scorer knows its own internals well. It would also fix the issue in all other cases that we have.

Guillaume said he already has an old PR open about it.

@glemaitre
Copy link
Member

Guillaume said he already has an old PR open about it.

Going back to the old PRs, the scope a bit different: the idea was to get the metric associated with the use case: #17889, #17930

While we would benefit from having the discussion around the API of scores/scorers and decoupling the signs, I think that we could improve the situation to have a kind of metric registry where we can define a callable, the supported y_type (for the y_true) and the compatible response_method. I think that just registry would allow to solve the problem of the original PR (providing some metrics depending on the problem) and the issue that we have here because we can track what are the requirements linked to a callable.

This would be limited to the scikit-learn metrics at first. Maybe there is a way in the future to expose something such that one can add their metric to the registry. Maybe make_scorer could have an option for it to automatically register.

@ogrisel
Copy link
Member Author

ogrisel commented May 27, 2024

Not sure why a registry would be better than public methods / attributes on the scorer object itself.

@glemaitre
Copy link
Member

glemaitre commented May 27, 2024

Not sure why a registry would be better than public methods / attributes on the scorer object itself.

I just thought about this exact question when I posted my previous message. If we don't want to touch the scorers API then the registry is an option but it does not mean this is a clean way. IMO, I would prefer to be part of the scorer API.

This is the reason I would merge the PR and trigger a new PR where we can discuss the right API and alternative.

@lorentzenchr
Copy link
Member

I don’t intent to sound harsh, still: Could we incorporate Adrin‘s comments, do a bit less magic warnings here for the moment and discuss scorer API redesign elsewhere?

@ogrisel
Copy link
Member Author

ogrisel commented May 27, 2024

Could we incorporate Adrin‘s comments

That's what I tried in #29082 (comment).

Do you have more specific suggestions for things to improve?

do a bit less magic warnings here for the moment

So only keep the ValueError and remove the 2 warnings? I can at least split the PR in two if people prefer.

Comment on lines 646 to 655
Note that scoring objective should introduce a trade-off between false
negatives and false positives, otherwise the tuned threshold would be
trivial and the resulting classifier would be equivalent to constantly
classifiying one of the two possible classes. This would be the case
when passing scoring="precision" or scoring="recall" for instance.
Furthermore, the scoring objective should evaluate thresholded
classifier predictions: as a result, metrics such as ROC AUC, Average
Precision, log loss or the Brier score are not valid scoring metrics in
this context since they are all designed to evaluate unthresholded
class-membership confidence scores.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be made much shorter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what to remove besides maybe cutting the last sentence:

https://github.com/scikit-learn/scikit-learn/pull/29082/files#r1616901942

Comment on lines +958 to +959
warn(
f"The objective metric {self.scoring!r} is constant at "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still like this warning.

# if a scorer expects thresholded binary classification predictions.
# TODO: updates this condition when a better way is available.
if scorer._response_method != "predict":
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still the most controversial part. Without it, this PR would be almost merged, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of #29082 (comment) ?

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I'm suggesting now doesn't require a SLEP, and divides the PR into pieces which we should be able to tackle easily and quickly. WDYT @ogrisel

Comment on lines +936 to +937
objective_scores.max() - objective_scores.min()
<= np.finfo(objective_scores.dtype).eps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to a utility function which takes a bunch of scores and warns, like _warn_on_constant_metrics and call it in a few places where it's relevant? (A single PR for this would be nice, which would include this usecase)

Comment on lines +966 to +973
elif trivial_kind is not None:
warn(
f"Tuning the decision threshold on {self.scoring} "
"leads to a trivial classifier that classifies all samples as "
f"the {trivial_kind} class. Consider revising the scoring parameter "
"to include a trade-off between false positives and false negatives.",
UserWarning,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a similar warning when the found solution is using a parameter form the edges of the bound. Could they all be in the same utility function? It would make the messages more consistent. Could also be a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we would need a similar features for hparam search but I think both the code and the messages deserves to be specialized.

For the case of hparam tuning, we need to take the bounds of parameter validation into account (e.g. some hparams are naturally bounded, e.g. alpha=0 or l1_ratio=1.0 so we should not raise a warning if we reach those bounds).

For the case of threshold tuning I think it's important to mention the positive / negative classifications as I do in the 2 custom warnings of this PR to get a more explicit and actionable error message. The message for hparam tuning would be more generic.

@ogrisel
Copy link
Member Author

ogrisel commented May 30, 2024

I think what I'm suggesting now doesn't require a SLEP, and divides the PR into pieces which we should be able to tackle easily and quickly. WDYT @ogrisel

Let me update the PR accordingly. I will split it in two so that we can have decoupled review discussions about the scorer object attribute and the warnings about trivial thresholds / constant thresholds.

@ogrisel ogrisel modified the milestones: 1.5.1, 1.6 Jun 24, 2024
@glemaitre glemaitre modified the milestones: 1.6, 1.7 Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug module:model_selection To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants