Description
Describe the workflow you want to enable
Proposition
I'd like to decompose scores (at least the ones from consistent scoring functions for identifiable functionals) into meaningful additive components:
- MSC ≥ 0 miscalibration (for Brier score also know as reliability term)
- DSC ≥ 0 discrimination (for Brier score also know as resolution term)
- UNC ≥ 0 uncertainty or entropy
as (lower is better) score_loss = MSC - DSC + UNC
, see [1, 2, 3, 4]. An implementation in R is available at [5].
For classification with Brier score
from sklearn.metrics import brier_score_loss, score_decompose
model.fit(X_train, y_train)
y_pred = model.predict(y_test)
msc, dsc, unc = score_decompose(y_test, y_pred, brier_score_loss)
assert msc - dsc + unc = brier_score_loss(y_test, y_pred)
For quantile regression with pinball loss
rom sklearn.metrics import mean_pinball_loss, score_decompose
alpha = 0.8 # the 80% quantile
model.set_params(quantile=alpha).fit(X_train, y_train)
y_pred = model.predict(y_test)
msc, dsc, unc = score_decompose(y_test, y_pred, brier_score_loss, alpha=alpha)
assert msc - dsc + unc = mean_pinball_loss(y_test, y_pred, alpha=alpha)
References
[1] Pohle (2020) https://arxiv.org/abs/2005.01835
[2] Dimitriadis, Gneiting, Jordan (2020) https://arxiv.org/abs/2008.03033
[3] Gneiting & Resin (2021) https://arxiv.org/abs/2108.03210
[4] Fissler, Lorentzen & Mayer (2022) https://arxiv.org/abs/2202.12780
[5] https://github.com/aijordan/reliabilitydiag
Describe your proposed solution
From [2]:
While there is a consensus on the character and intuitive interpretation of the decomposition terms, their exact form remains subject to debate, despite a half century quest in the wake of Murphy’s [1973] Brier score decomposition. In particular, Murphy’s decomposition is exact in the discrete case, but fails to be exact under continuous forecasts, which has prompted the development of increasingly complex types of decompositions (...).
In the extant literature, it has been assumed implicitly or explicitly that the calibrated and reference forecasts can be chosen at researchers’ discretion (...). We argue otherwise and contend that the calibrated forecasts ought to be the PAV-(re)calibrated probabilities, as displayed in the CORP reliability diagram, whereas the reference forecast r ought to be the marginal event frequency (...). We refer to the resulting decomposition as the CORP score decomposition, which enjoys the following properties:
• MCB ≥ 0 with equality if the original forecast is calibrated.
• DSC ≥ 0 with equality if the PAV-calibrated forecast is constant.
• The decomposition is exact.
Perhaps surprisingly, the PAV algorithm and its appealing properties generalize from prob-
abilistic classifiers to mean, quantile, and expectile assessments for real-valued outcomes
Describe alternatives you've considered, if relevant
No response
Additional context
#23132 is related and might be a good first step towards score_decomposition
.
Solving this issue will also solve #21774.