Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Weighted variance computation for sparse data is not numerically stable #19546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ogrisel opened this issue Feb 24, 2021 · 7 comments · Fixed by #19766
Closed

Weighted variance computation for sparse data is not numerically stable #19546

ogrisel opened this issue Feb 24, 2021 · 7 comments · Fixed by #19766
Labels
Bug Moderate Anything that requires some knowledge of conventions and best practices module:linear_model module:preprocessing

Comments

@ogrisel
Copy link
Member

ogrisel commented Feb 24, 2021

This issue was discovered when adding tests for #19527 (currently marked XFAIL).

Here is minimal reproduction case using the underlying private API:

https://gist.github.com/ogrisel/bd2cf3350fff5bbd5a0899fa6baf3267

The results are the following (macOS / arm64 / Python 3.9.1 / Cython 0.29.21 / clang 11.0.1):

## dtype=float64
_incremental_mean_and_var [100.] [0.]
csr_mean_variance_axis0 [100.] [-2.18040566e-11]
incr_mean_variance_axis0 csr [100.] [-2.18040566e-11]
csc_mean_variance_axis0 [100.] [-2.18040566e-11]
incr_mean_variance_axis0 csc [100.] [-2.18040566e-11]
## dtype=float32
_incremental_mean_and_var [100.00000577] [3.32692735e-11]
csr_mean_variance_axis0 [99.99997] [0.00123221]
incr_mean_variance_axis0 csr [99.99997] [0.00123221]
csc_mean_variance_axis0 [99.99997] [0.00123221]
incr_mean_variance_axis0 csc [99.99997] [0.00123221]

So the sklearn.utils.extmath._incremental_mean_and_var function for dense numpy arrays is numerically stable both in float64 and float32 (~1e-11 is much less then np.finfo(np.float32).eps), but the sparse counterparts, either incremental are not are all wrong in the same way.

So the gist above should be adapted to write a new series of new tests for these Cython functions and the fix will probably involve adapting the algorithm implemented in sklearn.utils.extmath._incremental_mean_and_var to the sparse case.

Note: there is another issue opened for the numerical stability of StandardScaler: #5602 / #11549 but it is related to the computation of the (unweighted) mean in incremental model (in partial_fit) vs full batch mode (in fit).

@ogrisel ogrisel added Bug Moderate Anything that requires some knowledge of conventions and best practices module:linear_model module:preprocessing help wanted labels Feb 24, 2021
@ogrisel
Copy link
Member Author

ogrisel commented Feb 24, 2021

It would also be nice to add similar tests for non-constant but significantly non-centered random data to check the consistency between the sparse and dense array variants (incremental or not).

Also the marked XFAIL tests of #19527 and the test referenced in #19527 (review) should also be updated to test constant values larger than 1.

@ogrisel
Copy link
Member Author

ogrisel commented Feb 24, 2021

Reading the code, I realized that for the first call, incr_mean_variance_axis0 is delegating to _csr_mean_variance_axis0 and _csc_mean_variance_axis0 so those function would need to be fixed first.

Then one would need to actually chunk the constant feature into small batches of say 10 samples at a time and check that calling incr_mean_variance_axis0 repeatedly on such data is also numerically stable.

@ogrisel
Copy link
Member Author

ogrisel commented Feb 24, 2021

Note that the dense implementation _incremental_mean_and_var uses np.float64 accumulators even when the input data is np.float32. This is not the case for the sparse implementation and could therefore explain part of the discrepancy for np.float32 data case. But this does not explain the discrepancy observed for np.float64 data.

@ogrisel
Copy link
Member Author

ogrisel commented Feb 24, 2021

The main culprit seems to be the lines:

https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/sparsefuncs_fast.pyx#L155

and:

https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/sparsefuncs_fast.pyx#L259

        variances[i] += (sum_weights[i] - sum_weights_nz[i]) * means[i]**2

A small rounding error in the computation of (sum_weights[i] - sum_weights_nz[i]) is multiplied by a large quantity: the squared mean value of the column.

@ogrisel
Copy link
Member Author

ogrisel commented Feb 24, 2021

I am not sure how to fix this. The easiest way would be to materialize a dense representation of the the sparse columns, one at a time to avoid memory explosion and call the numerically stable _incremental_mean_and_var on each such dense column.

A slightly more efficient way would be to iterate explicitly over the zero value elements and multiply the weight by the squared mean before summing it to the partial unnormalized variance accumulator. That would still be significantly more expensive than the current code but probably less than materializing the columns. But I am not so sure because of the non-uniform memory access patterns. Finally the code to manipulate row and column indices w.r.t. the CSR or CSC datastructures would probably be much more complex.

@jnothman
Copy link
Member

Interesting.

Thoughts:

  1. In the weighted case, are we already doing operations with O(n_samples) memory on the sample weights? (or is it possible that the sample weights are memmapped?) If so, maybe dense-column-at-a-time is okay.
  2. Your option to "iterate explicitly over the zero value elements and multiply the weight by the square mean" sounds like it might be reasonable in a CSC with sorted indices. Perhaps also a CSR with sorted indices.

@ogrisel
Copy link
Member Author

ogrisel commented Feb 25, 2021

In the weighted case, are we already doing operations with O(n_samples) memory on the sample weights? (or is it possible that the sample weights are memmapped?) If so, maybe dense-column-at-a-time is okay.

I agree.

Your option to "iterate explicitly over the zero value elements and multiply the weight by the square mean" sounds like it might be reasonable in a CSC with sorted indices. Perhaps also a CSR with sorted indices.

I agree also, but the code is going to be complex, I think. Not sure it's worth it.

Also for the non-weighted case, sum_weights and sum_weights_nz would be integers:

    variances[i] += (sum_weights[i] - sum_weights_nz[i]) * means[i]**2

so if we specialize this function for the non-weighted cases to use uint64 for instance, then we do not have catastrophic cancellation anymore and the code would still be fast and memory efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Moderate Anything that requires some knowledge of conventions and best practices module:linear_model module:preprocessing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants