Thanks to visit codestin.com
Credit goes to github.com

Skip to content

RFC: stop using scikit-learn stable_cumsum and instead use np/xp.cumsum directly #31533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ogrisel opened this issue Jun 12, 2025 · 8 comments

Comments

@ogrisel
Copy link
Member

ogrisel commented Jun 12, 2025

As discussed in https://github.com/scikit-learn/scikit-learn/pull/30878/files#r2142562746, our current stable_cumsum function brings very little value to the user: it does extra computation to check that np.allclose(np.sum(x), np.cumsum(x)[-1]) and raises a warning otherwise. However, in most cases, users can do nothing about the warning.

Furthermore, as seen in the CI of #30878, the array API compatible libraries we test against do not have the same numerical stability behavior for sum and cumsum, so it makes it challenging to write a test for the occurrence of this warning that is consistent across libraries.

So I would rather not waste the overhead of computing np.sum(x) and just always directly call np.cumsum or xp.cumsum and deprecate sklearn.utils.extmath.stable_cumsum.

@github-actions github-actions bot added the Needs Triage Issue requires triage label Jun 12, 2025
@ogrisel ogrisel added Needs Decision Requires decision and removed Needs Triage Issue requires triage labels Jun 12, 2025
@ogrisel
Copy link
Member Author

ogrisel commented Jun 12, 2025

If I remember correctly, I think this one of the reasons why our kmeans++ implementation is slower than alternative implementations in other libraries such as Intel oneDAL.

@OmarManzoor
Copy link
Contributor

I agree with this proposal. Thanks @ogrisel

@lesteve
Copy link
Member

lesteve commented Jun 12, 2025

stable_cumsum also makes sure that the cumsum computation is done with float64, i.e. np.cumsum(..., dtype=np.float64). What would the plan be on this aspect?

Doing a bit of archeology the original motivation for stable_cumsum was that roc_auc_score could give "wrong" results with float32 weights, see #6842.

I tried quickly removing stable_cumsum in sklearn.metrics._ranking._binary_clf_curve (see diff below) and I can reproduce the problematic behaviour locally:

from sklearn.metrics import roc_auc_score
import numpy as np

rng = np.random.RandomState(42)
n_samples = 4 * 10 ** 7
y = rng.randint(2, size=n_samples)
prediction = rng.normal(size=n_samples) + y * 0.01
trivial_weight = np.ones(n_samples)
print('without weights', roc_auc_score(y, prediction))
print('float32 weights', roc_auc_score(y, prediction, sample_weight=trivial_weight.astype('float32')))
print('float64 weights', roc_auc_score(y, prediction, sample_weight=trivial_weight.astype('float64')))
without weights 0.5027392452566966
float32 weights 0.5835567712783813
float64 weights 0.5027392452566966
diff --git a/sklearn/metrics/_ranking.py b/sklearn/metrics/_ranking.py
index 2d0e5211c2..802279a812 100644
--- a/sklearn/metrics/_ranking.py
+++ b/sklearn/metrics/_ranking.py
@@ -898,11 +898,11 @@ def _binary_clf_curve(y_true, y_score, pos_label=None, sample_weight=None):
     threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
 
     # accumulate the true positives with decreasing threshold
-    tps = stable_cumsum(y_true * weight)[threshold_idxs]
+    tps = np.cumsum(y_true * weight)[threshold_idxs]
     if sample_weight is not None:
         # express fps as a cumsum to ensure fps is increasing even in
         # the presence of floating point errors
-        fps = stable_cumsum((1 - y_true) * weight)[threshold_idxs]
+        fps = np.cumsum((1 - y_true) * weight)[threshold_idxs]
     else:
         fps = 1 + threshold_idxs - tps
     return fps, tps, y_score[threshold_idxs]

@OmarManzoor
Copy link
Contributor

OmarManzoor commented Jun 12, 2025

I think we could just use xp.cumulative_sum with the max float dtype supported where ever this function is being used. https://data-apis.org/array-api/latest/API_specification/generated/array_api.cumulative_sum.html#array_api.cumulative_sum

So the main part that would be removed is the warning related code.

@ogrisel
Copy link
Member Author

ogrisel commented Jun 12, 2025

+1 for explicit casting to max float dtype in functions where there is a proven case that we need it.

But I would rather not case when we don't need it (in particular, I suspect this is not needed in k-means++).

@jeremiedbb
Copy link
Member

+1 to remove stable_cumsum and cast wherever needed.

@lucyleeow
Copy link
Member

+1 to remove stable_cumsum and cast.

@ogrisel ogrisel removed the Needs Decision Requires decision label Jun 13, 2025
@ogrisel
Copy link
Member Author

ogrisel commented Jun 13, 2025

I removed the "Needs Decision" label because we seem to have reached a consensus. Thanks, @lesteve, for the archeological work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants