Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] Implement calibration loss metrics #11096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 97 commits into
base: main
Choose a base branch
from

Conversation

samronsin
Copy link
Contributor

@samronsin samronsin commented May 15, 2018

See discussion on issue #10883.
Closes #18268.

This PR implements calibration losses for binary classifiers.

It also updates the doc about calibration, especially inaccurate references to the Brier score.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will also need to update:

  • doc/modules/classes.rst
  • doc/modules/model_evaluation.rst
  • sklearn/metrics/tests/test_common.py

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be available as a scorer for cross validation? Should it be available for CalibratedClassifierCV?

@jnothman
Copy link
Member

You don't need to test sample weights as the common tests will

@amueller
Copy link
Member

I'm not sure if it makes sense to have this as a ready-made scorer, and so I might want to hold out on that? It should probably be mentioned in the calibration docs and examples.

@amueller
Copy link
Member

amueller commented May 19, 2018

Can you maybe also add https://www.math.ucdavis.edu/~saito/data/roc/ferri-class-perf-metrics.pdf to the references and maybe include some of the discussion there about this loss? In particular, this is one of two calibration losses they discuss, so maybe we should be more specific with the name?

@samronsin
Copy link
Contributor Author

Actually I did not implement any of the losses from the paper you mention @amueller, as I take non-overlapping bins instead of sliding window in CalB.
CalB would totally make sense in this PR, although:

  • n_bins would have a very different meaning: e.g. the number of non-overlapping bins
  • taking sample_weight into account seems non-trivial because of the sliding window
  • implementing CalB efficiently would actually be non-trivial in itself since it is based on a sliding window
  • greedy implementation would be quite drag on performance, compared to non-overlapping binning

@samronsin
Copy link
Contributor Author

I ended up implementing the calB loss suggested by @amueller, but did not provide support for sample weights.
Also, I'll be happy to take suggestions regarding its implementation.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for fashion again not being in a situation to review the main content...

@@ -62,6 +62,7 @@ Scoring Function
'balanced_accuracy' :func:`metrics.balanced_accuracy_score` for binary targets
'average_precision' :func:`metrics.average_precision_score`
'brier_score_loss' :func:`metrics.brier_score_loss`
'calibration_loss' :func:`metrics.calibration_loss`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to respect the convention higher is better we should maybe call it

neg_calibration_error

thoughts anyone?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for the grid search tools in scikit-learn will always to maximize the scoring criterion. If the criterion is derived from a loss to minimize we instead maximize the the negative criterion. It's important to name it accordingly to make the column names and column values of pd.DataFrame(grid_search.cv_results_) consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also want to provide standard aliases neg_max_calibration_error and neg_expected_calibration_error to set the norm parameter to make it possible to grid search for the ECE or MCE metrics traditionally used in the literature.

@ogrisel
Copy link
Member

ogrisel commented Aug 6, 2020

@ogrisel unless I'm missing something, the only remaining discussion point is what should happen when no pos_label is explicitly given by the user (especially for non-integer target) ? Is there a new consensus we should follow ?

I have not followed recent development but this is related to this discussion: #17704

@lucyleeow
Copy link
Member

For housekeeping, I think this will close #10971 and #12479

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just some comments regarding the code itself.
I have read more about the score itself and check if we don't miss anything in the test.

In particular, I see that we don't test anything when it is used as a scorer. I am not sure we have a common test there.


if pos_label is None:
pos_label = y_true.max()
y_true = np.array(y_true == pos_label, int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the score is not symmetric, I would make pos_label consistent with the scorer that raise an error with string.
To be more specific, we should carry this check:

# ensure binary classification if pos_label is not specified
# classes.dtype.kind in ('O', 'U', 'S') is required to avoid
# triggering a FutureWarning by calling np.array_equal(a, b)
# when elements in the two arrays are not comparable.
classes = np.unique(y_true)
if (pos_label is None and (
classes.dtype.kind in ('O', 'U', 'S') or
not (np.array_equal(classes, [0, 1]) or
np.array_equal(classes, [-1, 1]) or
np.array_equal(classes, [0]) or
np.array_equal(classes, [-1]) or
np.array_equal(classes, [1])))):
classes_repr = ", ".join(repr(c) for c in classes)
raise ValueError("y_true takes value in {{{classes_repr}}} and "
"pos_label is not specified: either make y_true "
"take value in {{0, 1}} or {{-1, 1}} or "
"pass pos_label explicitly.".format(
classes_repr=classes_repr))
elif pos_label is None:
pos_label = 1.

We could make a small refactoring indeed. So pos_label would be in line with:

pos_label: int or str, default=None

    The label of the positive class. When pos_label=None, if y_true is in {-1, 1} or {0, 1},
    pos_label is set to 1, otherwise, an error will be raised.

raise ValueError("y_prob has values outside of [0, 1] range")

labels = np.unique(y_true)
if len(labels) > 2:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I think that we should use sklearn.multiclass.type_of_target for consistency

y_type = type_of_target(y_true)
if y_type != "binary":
    raise ValueError(...)

y_pred = np.array([0.25, 0.25, 0.25, 0.25] + [0.75, 0.75, 0.75, 0.75])
sample_weight = np.array([1, 1, 1, 1] + [3, 3, 3, 3])

assert_almost_equal(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using the following pattern

assert calibration_error(...) == pytest.approx(0.1875)

instead of assert_almost_equal since we introduced pytest

0.2165063)


def test_calibration_error_raises():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can parametrize this test and use pytest

@pytest.mark.parametrize(
    "y_pred",
    [np.array([0.25, 0.25, 0.25, 0.25] + [0.75, 0.75, 0.75]),
    np.array([0.25, 0.25, 0.25, 0.25] + [0.75, 0.75, 0.75, 1.75]),
    ....]
def test_calibration_error_raises(y_pred):
    ...
    with pytest.raises(ValueError, match=err_msg):
        calibration_error(y_true, y_pred)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the err_msg is changing, we can include it in the parametrization as well.

y_true = np.array([0, 1, 0, 1])
y_pred = np.array([0., 0.25, 0.5, 0.75])

assert_almost_equal(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pytest.approx

y_true = np.array([1, 0, 0, 1])
y_pred = np.array([0.25, 0.25, 0.75, 0.75])

assert_almost_equal(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pytest.approx

@lucyleeow
Copy link
Member

Sorry to add more work but would it be possible to add a quick note in model_evaluation.rst under Brier score:

This is because the Brier score can be decomposed as the sum of calibration loss and refinement loss [Bella2012]. Calibration loss is defined as the mean squared deviation from empirical probabilities derived from the slope of ROC segments.

to clarify that the calibration loss metric we implement here is not the same as the one referred to above, and maybe how they relate?

(ref: #18051 (comment))

@rth
Copy link
Member

rth commented Sep 11, 2020

Thanks for this work! In terms of implementation can't we re-use the calculation of prob_true, prob_pred done in calibration_curve, as suggested in #18268 instead of re-computing it from scratch? Or do we need a re-implementatation for reduce_bias=True case? In case, if that work it would be be good to a add a test comparing the manual calculation e.g. of ECE form calibration_curve with the added function.

dsleo and others added 2 commits November 5, 2020 18:15
Base automatically changed from master to main January 22, 2021 10:50
Comment on lines +1753 to +1754
The :func:`calibration_error` function computes the expected and maximum
calibration errors as defined in [1]_ for binary classes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the general ECE for multiclass problems would be more desirable, as used in Guo 2017 (Ref 1)? This implementation seems targeted at binary classification. I believe you could use this as is for multiclass if you pass in y_pred as the maximum predicted probability and y_true as 1 if the predicted class equals the true class and 0 otherwise. But I would think it more desirable to just have the multiclass logic under the hood.

sample_weight[i_start:i_end])
/ delta_count[i])
if norm == "l2" and reduce_bias:
delta_debias = (
Copy link

@edwardclem edwardclem Jun 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this implementation of the bias term correct? I believe it should be (avg_pred_true[i] * (avg_pred_true[i] - 1) * delta_count[i])/ (count * (delta_count[i] - 1)) going off the definition here. The 1/count doesn't get distributed properly, I think. This will produce nans if there's only one element in a bin, so I'm not sure if this was changed for numerical stability reasons. The authors of the paper mention this in their implementation, so it seems like the "undefined if bin size is 1" issue is expected.

@lorentzenchr
Copy link
Member

We should have a look at https://cran.r-project.org/web/packages/reliabilitydiag/index.html and the accompanying paper https://doi.org/10.1073/pnas.2016191118. First, let us look if this PR corresponds to their miscalibration measure (MCB). Second, they propose a binning that might help us.

@ColdTeapot273K
Copy link

ColdTeapot273K commented Jan 17, 2022

Folks, I've made a separate implementation in #22233 which has some specific benefits (e.g. binless). What do you think?

@lorentzenchr
Copy link
Member

I would much prefer a more general solution for calibration scores by a general score decomposition as proposed in #23767.


norm : {'l1', 'l2', 'max'}, default='l2'
Norm method. The l1-norm is the Expected Calibration Error (ECE),
and the max-norm corresponds to Maximum Calibration Error (MCE).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and l2 corresponds to...?

strategy : {'uniform', 'quantile'}, default='uniform'
Strategy used to define the widths of the bins.

uniform

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uniform
- 'uniform':

Fix format to meet sklearn docstrings format, like here


uniform
All bins have identical widths.
quantile

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
quantile
- 'quantile':

Fix format to meet sklearn docstrings format, like here

positive class.

reduce_bias : bool, default=True
Add debiasing term as in Verified Uncertainty Calibration, A. Kumar.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if possible, add link to A. Kumar

@Jerry-Master
Copy link

When are we going to see the calibration loss released?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: function to calculate Expected Calibration Error (ECE)