-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
RFC: stop using scikit-learn stable_cumsum
and instead use np/xp.cumsum
directly
#31533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If I remember correctly, I think this one of the reasons why our kmeans++ implementation is slower than alternative implementations in other libraries such as Intel oneDAL. |
I agree with this proposal. Thanks @ogrisel |
Doing a bit of archeology the original motivation for I tried quickly removing from sklearn.metrics import roc_auc_score
import numpy as np
rng = np.random.RandomState(42)
n_samples = 4 * 10 ** 7
y = rng.randint(2, size=n_samples)
prediction = rng.normal(size=n_samples) + y * 0.01
trivial_weight = np.ones(n_samples)
print('without weights', roc_auc_score(y, prediction))
print('float32 weights', roc_auc_score(y, prediction, sample_weight=trivial_weight.astype('float32')))
print('float64 weights', roc_auc_score(y, prediction, sample_weight=trivial_weight.astype('float64')))
diff --git a/sklearn/metrics/_ranking.py b/sklearn/metrics/_ranking.py
index 2d0e5211c2..802279a812 100644
--- a/sklearn/metrics/_ranking.py
+++ b/sklearn/metrics/_ranking.py
@@ -898,11 +898,11 @@ def _binary_clf_curve(y_true, y_score, pos_label=None, sample_weight=None):
threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
# accumulate the true positives with decreasing threshold
- tps = stable_cumsum(y_true * weight)[threshold_idxs]
+ tps = np.cumsum(y_true * weight)[threshold_idxs]
if sample_weight is not None:
# express fps as a cumsum to ensure fps is increasing even in
# the presence of floating point errors
- fps = stable_cumsum((1 - y_true) * weight)[threshold_idxs]
+ fps = np.cumsum((1 - y_true) * weight)[threshold_idxs]
else:
fps = 1 + threshold_idxs - tps
return fps, tps, y_score[threshold_idxs] |
I think we could just use xp.cumulative_sum with the max float dtype supported where ever this function is being used. https://data-apis.org/array-api/latest/API_specification/generated/array_api.cumulative_sum.html#array_api.cumulative_sum So the main part that would be removed is the warning related code. |
+1 for explicit casting to max float dtype in functions where there is a proven case that we need it. But I would rather not case when we don't need it (in particular, I suspect this is not needed in k-means++). |
+1 to remove |
+1 to remove |
I removed the "Needs Decision" label because we seem to have reached a consensus. Thanks, @lesteve, for the archeological work. |
Uh oh!
There was an error while loading. Please reload this page.
As discussed in https://github.com/scikit-learn/scikit-learn/pull/30878/files#r2142562746, our current
stable_cumsum
function brings very little value to the user: it does extra computation to check thatnp.allclose(np.sum(x), np.cumsum(x)[-1])
and raises a warning otherwise. However, in most cases, users can do nothing about the warning.Furthermore, as seen in the CI of #30878, the array API compatible libraries we test against do not have the same numerical stability behavior for
sum
andcumsum
, so it makes it challenging to write a test for the occurrence of this warning that is consistent across libraries.So I would rather not waste the overhead of computing
np.sum(x)
and just always directly callnp.cumsum
orxp.cumsum
and deprecatesklearn.utils.extmath.stable_cumsum
.The text was updated successfully, but these errors were encountered: