Thanks to visit codestin.com
Credit goes to github.com

Skip to content

UX CalibrationDisplay's naive use can lead to very confusing results #30664

Open
@ogrisel

Description

@ogrisel

The naive use of CalibrationDisplay parameter silently leads to degenerate, noisy results when some bins have with a few data points.

For instance, look at the variability obtained by displaying for calibration curve of a fitted model evaluated on various resampling with 50 data points in total using the uniform strategy when using n_bins=10 and the default strategy="uniform":

Image

# %%
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibrationDisplay
from sklearn.model_selection import train_test_split


X, y = make_classification(n_samples=10_000, n_features=200, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# %%
clf = LogisticRegression(C=10).fit(X_train, y_train)
# %%

fig, ax = plt.subplots()
for seed in range(5):
    # resample the test set
    rng = np.random.RandomState(seed)
    indices = rng.choice(X_test.shape[0], size=50, replace=True)
    X_test_sample = X_test[indices]
    y_test_sample = y_test[indices]
    CalibrationDisplay.from_estimator(clf, X_test_sample, y_test_sample, n_bins=10, ax=ax, label=None)

This problem can easily happen with the default strategy="uniform" if the test data is not large enough. I think this class should warn the user whenever it generates bins with lower than 10 data points per bin.

A typical user will only get one of the curves above and not suspect that it's just noise without manually plotting the others by random resampling. Note that I chose a minimal test set to make the problem catastrophic above, but it can happen with larger sample sizes, in particular with the uniform strategy, in particular on imbalanced datasets.

Updated recommendations

EDIT: based on the discussion below, here are my recommendations to address this issue:

  • display a confidence interval by default to reflect our uncertainty on the calibration curve estimate (using the Clopper-Pearson analytical CI from scipy looks both reliable and cheap to compute)
  • we set n_bins=15 by default but also introduce min_samples_per_bins=50 to automatically merge the smallest pair of neighboring bins until that constraint it met.
  • we would not warn, unless the resulting merging leads to less than 3 bins for instance,
  • we could keep strategy="uniform" if we like, but I wouldn't mind switching to "quantile" either.

Original description and suggestion

In such cases, this warning could recommend to:

  • use strategy="quantile" (which is a bit less likely to generate near empty, unstable bins);
  • decrease the number of bins;
  • increase the number of data points in the test set.

Beyond warnings, we could also offer the user extra-options to assess the stability of the calibration curve.

There are two cases:

  • for a fixed trained estimator, we could estimate the uncertainty in the estimation of its calibration curve induced by the finite size of the evaluation set (irrespective of whether the training procedure is stable or not) by resampling the test set (with replacement):
fig, ax = plt.subplots()
for seed in range(10):
    rng = np.random.RandomState(seed)
    indices = rng.choice(X_test.shape[0], size=X_test.shape[0], replace=True)
    X_test_resampled = X_test[indices]
    y_test_resampled = y_test[indices]

    CalibrationDisplay.from_estimator(
        clf,
        X_test_resampled,
        y_test_resampled,
        strategy="quantile",
        n_bins=10,
        ax=ax,
        label=None,
    )

We could optionally offer an option to aggregate those curves by displaying a shaded region derived from the vertical quantiles (e.g. 0.05-0.95) of the interpolated y-curve values at a fine grid of x values. See also the related issue on sampling uncertainties in performance curves: #25856.

  • to assess the uncertainty of the calibration of a full fitting procedure (uncertainty induced by both the finite training set size, the training algorithm itself (impacted by the choice of hyperparams) and the finite evaluation set size), we could also offer to plot calibration curves from CV results (see ENH use cv_results in the different curve display to add confidence intervals #21211 to implement CalibratedDisplayCV.from_cv_results).

On the opposite, if all the bins have more than 1000 data points and n_bins <= 10, we could suggest the user to increase n_bins to 10 or more to get a finer estimation of the calibration curve. Maybe this recommendation is better to be set in a docstring rather than a warning.

I think we could start simple by opening a first PR to issue warnings for the nearly empty bin problem (with an option to silence them).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Discussion

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions