-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Weighted variance computation for sparse data is not numerically stable #19546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It would also be nice to add similar tests for non-constant but significantly non-centered random data to check the consistency between the sparse and dense array variants (incremental or not). Also the marked XFAIL tests of #19527 and the test referenced in #19527 (review) should also be updated to test constant values larger than 1. |
Reading the code, I realized that for the first call, Then one would need to actually chunk the constant feature into small batches of say 10 samples at a time and check that calling |
Note that the dense implementation |
The main culprit seems to be the lines: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/sparsefuncs_fast.pyx#L155 and: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/sparsefuncs_fast.pyx#L259 variances[i] += (sum_weights[i] - sum_weights_nz[i]) * means[i]**2 A small rounding error in the computation of |
I am not sure how to fix this. The easiest way would be to materialize a dense representation of the the sparse columns, one at a time to avoid memory explosion and call the numerically stable A slightly more efficient way would be to iterate explicitly over the zero value elements and multiply the weight by the squared mean before summing it to the partial unnormalized variance accumulator. That would still be significantly more expensive than the current code but probably less than materializing the columns. But I am not so sure because of the non-uniform memory access patterns. Finally the code to manipulate row and column indices w.r.t. the CSR or CSC datastructures would probably be much more complex. |
Interesting. Thoughts:
|
I agree.
I agree also, but the code is going to be complex, I think. Not sure it's worth it. Also for the non-weighted case, variances[i] += (sum_weights[i] - sum_weights_nz[i]) * means[i]**2 so if we specialize this function for the non-weighted cases to use |
Uh oh!
There was an error while loading. Please reload this page.
This issue was discovered when adding tests for #19527 (currently marked XFAIL).
Here is minimal reproduction case using the underlying private API:
https://gist.github.com/ogrisel/bd2cf3350fff5bbd5a0899fa6baf3267
The results are the following (macOS / arm64 / Python 3.9.1 / Cython 0.29.21 / clang 11.0.1):
So the
sklearn.utils.extmath._incremental_mean_and_var
function for dense numpy arrays is numerically stable both in float64 and float32 (~1e-11 is much less thennp.finfo(np.float32).eps
), but the sparse counterparts, either incremental are not are all wrong in the same way.So the gist above should be adapted to write a new series of new tests for these Cython functions and the fix will probably involve adapting the algorithm implemented in
sklearn.utils.extmath._incremental_mean_and_var
to the sparse case.Note: there is another issue opened for the numerical stability of
StandardScaler
: #5602 / #11549 but it is related to the computation of the (unweighted) mean in incremental model (inpartial_fit
) vs full batch mode (infit
).The text was updated successfully, but these errors were encountered: