-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Implement calibration loss metrics #11096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will also need to update:
doc/modules/classes.rst
doc/modules/model_evaluation.rst
sklearn/metrics/tests/test_common.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be available as a scorer for cross validation? Should it be available for CalibratedClassifierCV?
You don't need to test sample weights as the common tests will |
I'm not sure if it makes sense to have this as a ready-made scorer, and so I might want to hold out on that? It should probably be mentioned in the calibration docs and examples. |
Can you maybe also add https://www.math.ucdavis.edu/~saito/data/roc/ferri-class-perf-metrics.pdf to the references and maybe include some of the discussion there about this loss? In particular, this is one of two calibration losses they discuss, so maybe we should be more specific with the name? |
Actually I did not implement any of the losses from the paper you mention @amueller, as I take non-overlapping bins instead of sliding window in CalB.
|
I ended up implementing the calB loss suggested by @amueller, but did not provide support for sample weights. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for fashion again not being in a situation to review the main content...
doc/modules/model_evaluation.rst
Outdated
@@ -62,6 +62,7 @@ Scoring Function | |||
'balanced_accuracy' :func:`metrics.balanced_accuracy_score` for binary targets | |||
'average_precision' :func:`metrics.average_precision_score` | |||
'brier_score_loss' :func:`metrics.brier_score_loss` | |||
'calibration_loss' :func:`metrics.calibration_loss` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to respect the convention higher is better we should maybe call it
neg_calibration_error
thoughts anyone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes for the grid search tools in scikit-learn will always to maximize the scoring criterion. If the criterion is derived from a loss to minimize we instead maximize the the negative criterion. It's important to name it accordingly to make the column names and column values of pd.DataFrame(grid_search.cv_results_)
consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might also want to provide standard aliases neg_max_calibration_error
and neg_expected_calibration_error
to set the norm
parameter to make it possible to grid search for the ECE or MCE metrics traditionally used in the literature.
I have not followed recent development but this is related to this discussion: #17704 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just some comments regarding the code itself.
I have read more about the score itself and check if we don't miss anything in the test.
In particular, I see that we don't test anything when it is used as a scorer
. I am not sure we have a common test there.
|
||
if pos_label is None: | ||
pos_label = y_true.max() | ||
y_true = np.array(y_true == pos_label, int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the score is not symmetric, I would make pos_label
consistent with the scorer that raise an error with string.
To be more specific, we should carry this check:
scikit-learn/sklearn/metrics/_ranking.py
Lines 553 to 572 in e89157e
# ensure binary classification if pos_label is not specified | |
# classes.dtype.kind in ('O', 'U', 'S') is required to avoid | |
# triggering a FutureWarning by calling np.array_equal(a, b) | |
# when elements in the two arrays are not comparable. | |
classes = np.unique(y_true) | |
if (pos_label is None and ( | |
classes.dtype.kind in ('O', 'U', 'S') or | |
not (np.array_equal(classes, [0, 1]) or | |
np.array_equal(classes, [-1, 1]) or | |
np.array_equal(classes, [0]) or | |
np.array_equal(classes, [-1]) or | |
np.array_equal(classes, [1])))): | |
classes_repr = ", ".join(repr(c) for c in classes) | |
raise ValueError("y_true takes value in {{{classes_repr}}} and " | |
"pos_label is not specified: either make y_true " | |
"take value in {{0, 1}} or {{-1, 1}} or " | |
"pass pos_label explicitly.".format( | |
classes_repr=classes_repr)) | |
elif pos_label is None: | |
pos_label = 1. |
We could make a small refactoring indeed. So pos_label
would be in line with:
pos_label: int or str, default=None
The label of the positive class. When pos_label=None, if y_true is in {-1, 1} or {0, 1},
pos_label is set to 1, otherwise, an error will be raised.
raise ValueError("y_prob has values outside of [0, 1] range") | ||
|
||
labels = np.unique(y_true) | ||
if len(labels) > 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here I think that we should use sklearn.multiclass.type_of_target
for consistency
y_type = type_of_target(y_true)
if y_type != "binary":
raise ValueError(...)
y_pred = np.array([0.25, 0.25, 0.25, 0.25] + [0.75, 0.75, 0.75, 0.75]) | ||
sample_weight = np.array([1, 1, 1, 1] + [3, 3, 3, 3]) | ||
|
||
assert_almost_equal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using the following pattern
assert calibration_error(...) == pytest.approx(0.1875)
instead of assert_almost_equal
since we introduced pytest
0.2165063) | ||
|
||
|
||
def test_calibration_error_raises(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can parametrize this test and use pytest
@pytest.mark.parametrize(
"y_pred",
[np.array([0.25, 0.25, 0.25, 0.25] + [0.75, 0.75, 0.75]),
np.array([0.25, 0.25, 0.25, 0.25] + [0.75, 0.75, 0.75, 1.75]),
....]
def test_calibration_error_raises(y_pred):
...
with pytest.raises(ValueError, match=err_msg):
calibration_error(y_true, y_pred)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the err_msg
is changing, we can include it in the parametrization as well.
y_true = np.array([0, 1, 0, 1]) | ||
y_pred = np.array([0., 0.25, 0.5, 0.75]) | ||
|
||
assert_almost_equal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use pytest.approx
y_true = np.array([1, 0, 0, 1]) | ||
y_pred = np.array([0.25, 0.25, 0.75, 0.75]) | ||
|
||
assert_almost_equal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use pytest.approx
Sorry to add more work but would it be possible to add a quick note in
to clarify that the calibration loss metric we implement here is not the same as the one referred to above, and maybe how they relate? (ref: #18051 (comment)) |
Thanks for this work! In terms of implementation can't we re-use the calculation of |
Format fixes Co-authored-by: Guillaume Lemaitre <[email protected]>
The :func:`calibration_error` function computes the expected and maximum | ||
calibration errors as defined in [1]_ for binary classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the general ECE for multiclass problems would be more desirable, as used in Guo 2017 (Ref 1)? This implementation seems targeted at binary classification. I believe you could use this as is for multiclass if you pass in y_pred
as the maximum predicted probability and y_true
as 1 if the predicted class equals the true class and 0 otherwise. But I would think it more desirable to just have the multiclass logic under the hood.
sample_weight[i_start:i_end]) | ||
/ delta_count[i]) | ||
if norm == "l2" and reduce_bias: | ||
delta_debias = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this implementation of the bias term correct? I believe it should be (avg_pred_true[i] * (avg_pred_true[i] - 1) * delta_count[i])/ (count * (delta_count[i] - 1))
going off the definition here. The 1/count
doesn't get distributed properly, I think. This will produce nans if there's only one element in a bin, so I'm not sure if this was changed for numerical stability reasons. The authors of the paper mention this in their implementation, so it seems like the "undefined if bin size is 1" issue is expected.
We should have a look at https://cran.r-project.org/web/packages/reliabilitydiag/index.html and the accompanying paper https://doi.org/10.1073/pnas.2016191118. First, let us look if this PR corresponds to their miscalibration measure (MCB). Second, they propose a binning that might help us. |
Folks, I've made a separate implementation in #22233 which has some specific benefits (e.g. binless). What do you think? |
I would much prefer a more general solution for calibration scores by a general score decomposition as proposed in #23767. |
|
||
norm : {'l1', 'l2', 'max'}, default='l2' | ||
Norm method. The l1-norm is the Expected Calibration Error (ECE), | ||
and the max-norm corresponds to Maximum Calibration Error (MCE). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and l2 corresponds to...?
strategy : {'uniform', 'quantile'}, default='uniform' | ||
Strategy used to define the widths of the bins. | ||
|
||
uniform |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uniform | |
- 'uniform': |
Fix format to meet sklearn docstrings format, like here
|
||
uniform | ||
All bins have identical widths. | ||
quantile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quantile | |
- 'quantile': |
Fix format to meet sklearn docstrings format, like here
positive class. | ||
|
||
reduce_bias : bool, default=True | ||
Add debiasing term as in Verified Uncertainty Calibration, A. Kumar. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if possible, add link to A. Kumar
When are we going to see the calibration loss released? |
See discussion on issue #10883.
Closes #18268.
This PR implements calibration losses for binary classifiers.
It also updates the doc about calibration, especially inaccurate references to the Brier score.