RFC: stop using scikit-learn `stable_cumsum` and instead use `np/xp.cumsum` directly #31533

ogrisel · 2025-06-12T12:11:30Z

As discussed in https://github.com/scikit-learn/scikit-learn/pull/30878/files#r2142562746, our current stable_cumsum function brings very little value to the user: it does extra computation to check that np.allclose(np.sum(x), np.cumsum(x)[-1]) and raises a warning otherwise. However, in most cases, users can do nothing about the warning.

Furthermore, as seen in the CI of #30878, the array API compatible libraries we test against do not have the same numerical stability behavior for sum and cumsum, so it makes it challenging to write a test for the occurrence of this warning that is consistent across libraries.

So I would rather not waste the overhead of computing np.sum(x) and just always directly call np.cumsum or xp.cumsum and deprecate sklearn.utils.extmath.stable_cumsum.

The text was updated successfully, but these errors were encountered:

ogrisel · 2025-06-12T12:15:05Z

If I remember correctly, I think this one of the reasons why our kmeans++ implementation is slower than alternative implementations in other libraries such as Intel oneDAL.

OmarManzoor · 2025-06-12T12:51:18Z

I agree with this proposal. Thanks @ogrisel

lesteve · 2025-06-12T14:29:58Z

stable_cumsum also makes sure that the cumsum computation is done with float64, i.e. np.cumsum(..., dtype=np.float64). What would the plan be on this aspect?

Doing a bit of archeology the original motivation for stable_cumsum was that roc_auc_score could give "wrong" results with float32 weights, see #6842.

I tried quickly removing stable_cumsum in sklearn.metrics._ranking._binary_clf_curve (see diff below) and I can reproduce the problematic behaviour locally:

from sklearn.metrics import roc_auc_score
import numpy as np

rng = np.random.RandomState(42)
n_samples = 4 * 10 ** 7
y = rng.randint(2, size=n_samples)
prediction = rng.normal(size=n_samples) + y * 0.01
trivial_weight = np.ones(n_samples)
print('without weights', roc_auc_score(y, prediction))
print('float32 weights', roc_auc_score(y, prediction, sample_weight=trivial_weight.astype('float32')))
print('float64 weights', roc_auc_score(y, prediction, sample_weight=trivial_weight.astype('float64')))

without weights 0.5027392452566966
float32 weights 0.5835567712783813
float64 weights 0.5027392452566966

diff --git a/sklearn/metrics/_ranking.py b/sklearn/metrics/_ranking.py
index 2d0e5211c2..802279a812 100644
--- a/sklearn/metrics/_ranking.py
+++ b/sklearn/metrics/_ranking.py
@@ -898,11 +898,11 @@ def _binary_clf_curve(y_true, y_score, pos_label=None, sample_weight=None):
     threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
 
     # accumulate the true positives with decreasing threshold
-    tps = stable_cumsum(y_true * weight)[threshold_idxs]
+    tps = np.cumsum(y_true * weight)[threshold_idxs]
     if sample_weight is not None:
         # express fps as a cumsum to ensure fps is increasing even in
         # the presence of floating point errors
-        fps = stable_cumsum((1 - y_true) * weight)[threshold_idxs]
+        fps = np.cumsum((1 - y_true) * weight)[threshold_idxs]
     else:
         fps = 1 + threshold_idxs - tps
     return fps, tps, y_score[threshold_idxs]

OmarManzoor · 2025-06-12T14:42:32Z

I think we could just use xp.cumulative_sum with the max float dtype supported where ever this function is being used. https://data-apis.org/array-api/latest/API_specification/generated/array_api.cumulative_sum.html#array_api.cumulative_sum

So the main part that would be removed is the warning related code.

ogrisel · 2025-06-12T15:25:44Z

+1 for explicit casting to max float dtype in functions where there is a proven case that we need it.

But I would rather not case when we don't need it (in particular, I suspect this is not needed in k-means++).

jeremiedbb · 2025-06-12T20:56:21Z

+1 to remove stable_cumsum and cast wherever needed.

lucyleeow · 2025-06-13T02:26:05Z

+1 to remove stable_cumsum and cast.

ogrisel · 2025-06-13T08:02:23Z

I removed the "Needs Decision" label because we seem to have reached a consensus. Thanks, @lesteve, for the archeological work.

github-actions bot added the Needs Triage Issue requires triage label Jun 12, 2025

ogrisel added Needs Decision Requires decision and removed Needs Triage Issue requires triage labels Jun 12, 2025

ogrisel mentioned this issue Jun 12, 2025

ENH: Make roc_curve array API compatible #30878

Open

ogrisel removed the Needs Decision Requires decision label Jun 13, 2025

ogrisel added Array API RFC labels Jun 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC: stop using scikit-learn `stable_cumsum` and instead use `np/xp.cumsum` directly #31533

RFC: stop using scikit-learn `stable_cumsum` and instead use `np/xp.cumsum` directly #31533

ogrisel commented Jun 12, 2025 •

edited

Loading

ogrisel commented Jun 12, 2025

Uh oh!

OmarManzoor commented Jun 12, 2025

Uh oh!

lesteve commented Jun 12, 2025

Uh oh!

OmarManzoor commented Jun 12, 2025 •

edited

Loading

Uh oh!

ogrisel commented Jun 12, 2025

Uh oh!

jeremiedbb commented Jun 12, 2025

Uh oh!

lucyleeow commented Jun 13, 2025

Uh oh!

ogrisel commented Jun 13, 2025

Uh oh!

Uh oh!

RFC: stop using scikit-learn stable_cumsum and instead use np/xp.cumsum directly #31533

RFC: stop using scikit-learn stable_cumsum and instead use np/xp.cumsum directly #31533

Comments

ogrisel commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ogrisel commented Jun 12, 2025

Uh oh!

OmarManzoor commented Jun 12, 2025

Uh oh!

lesteve commented Jun 12, 2025

Uh oh!

OmarManzoor commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Jun 12, 2025

Uh oh!

jeremiedbb commented Jun 12, 2025

Uh oh!

lucyleeow commented Jun 13, 2025

Uh oh!

ogrisel commented Jun 13, 2025

Uh oh!

RFC: stop using scikit-learn `stable_cumsum` and instead use `np/xp.cumsum` directly #31533

RFC: stop using scikit-learn `stable_cumsum` and instead use `np/xp.cumsum` directly #31533

ogrisel commented Jun 12, 2025 •

edited

Loading

OmarManzoor commented Jun 12, 2025 •

edited

Loading