UX `CalibrationDisplay`'s naive use can lead to very confusing results

The naive use of `CalibrationDisplay` parameter silently leads to degenerate, noisy results when some bins have with a few data points.

For instance, look at the variability obtained by displaying for calibration curve of a fitted model evaluated on various resampling with 50 data points in total using the uniform strategy when using `n_bins=10` and the default `strategy="uniform"`:

![Image](https://github.com/user-attachments/assets/0a925e43-0466-47aa-b57b-519da3b61b5a)


<details>

```python
# %%
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibrationDisplay
from sklearn.model_selection import train_test_split


X, y = make_classification(n_samples=10_000, n_features=200, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# %%
clf = LogisticRegression(C=10).fit(X_train, y_train)
# %%

fig, ax = plt.subplots()
for seed in range(5):
    # resample the test set
    rng = np.random.RandomState(seed)
    indices = rng.choice(X_test.shape[0], size=50, replace=True)
    X_test_sample = X_test[indices]
    y_test_sample = y_test[indices]
    CalibrationDisplay.from_estimator(clf, X_test_sample, y_test_sample, n_bins=10, ax=ax, label=None)
```

</details>


This problem can easily happen with the default `strategy="uniform"` if the test data is not large enough. I think this class should warn the user whenever it generates bins with lower than 10 data points per bin.

A typical user will only get one of the curves above and not suspect that it's just noise without manually plotting the others by random resampling. Note that I chose a minimal test set to make the problem catastrophic above, but it can happen with larger sample sizes, in particular with the uniform strategy, in particular on imbalanced datasets.

## Updated recommendations

EDIT: based on the discussion below, here are my recommendations to address this issue:

- display a confidence interval by default to reflect our uncertainty on the calibration curve estimate (using the Clopper-Pearson analytical CI from scipy looks both reliable and cheap to compute)
- we set `n_bins=15` by default but also introduce `min_samples_per_bins=50` to automatically merge the smallest pair of neighboring bins until that constraint it met.
- we would not warn, unless the resulting merging leads to less than 3 bins for instance,
- we could keep `strategy="uniform"` if we like, but I wouldn't mind switching to `"quantile"` either.

## Original description and suggestion

In such cases, this warning could recommend to:

- use `strategy="quantile"` (which is a bit less likely to generate near empty, unstable bins);
- decrease the number of bins;
- increase the number of data points in the test set.
  
Beyond warnings, we could also offer the user extra-options to assess the stability of the calibration curve.

There are two cases:

- for a fixed trained estimator, we could estimate the uncertainty in the estimation of its calibration curve induced by the finite size of the evaluation set (irrespective of whether the training procedure is stable or not) by resampling the test set (with replacement):

```python
fig, ax = plt.subplots()
for seed in range(10):
    rng = np.random.RandomState(seed)
    indices = rng.choice(X_test.shape[0], size=X_test.shape[0], replace=True)
    X_test_resampled = X_test[indices]
    y_test_resampled = y_test[indices]

    CalibrationDisplay.from_estimator(
        clf,
        X_test_resampled,
        y_test_resampled,
        strategy="quantile",
        n_bins=10,
        ax=ax,
        label=None,
    )
```
 We could optionally offer an option to aggregate those curves by displaying a shaded region derived from the vertical quantiles (e.g. 0.05-0.95) of the interpolated y-curve values at a fine grid of x values. See also the related issue on sampling uncertainties in performance curves: #25856.

- to assess the uncertainty of the calibration of a full fitting procedure (uncertainty induced by both the finite training set size, the training algorithm itself (impacted by the choice of hyperparams) and the finite evaluation set size), we could also offer to plot calibration curves from CV results (see #21211 to implement `CalibratedDisplayCV.from_cv_results`). 


On the opposite, if all the bins have more than 1000 data points and `n_bins <= 10`, we could suggest the user to increase n_bins to 10 or more to get a finer estimation of the calibration curve. Maybe this recommendation is better to be set in a docstring rather than a warning.


I think we could start simple by opening a first PR to issue warnings for the nearly empty bin problem (with an option to silence them).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

UX `CalibrationDisplay`'s naive use can lead to very confusing results #30664

Updated recommendations

Original description and suggestion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

UX CalibrationDisplay's naive use can lead to very confusing results #30664

Description

Updated recommendations

Original description and suggestion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

UX `CalibrationDisplay`'s naive use can lead to very confusing results #30664