Weighted variance computation for sparse data is not numerically stable #19546

ogrisel · 2021-02-24T10:11:07Z

This issue was discovered when adding tests for #19527 (currently marked XFAIL).

Here is minimal reproduction case using the underlying private API:

https://gist.github.com/ogrisel/bd2cf3350fff5bbd5a0899fa6baf3267

The results are the following (macOS / arm64 / Python 3.9.1 / Cython 0.29.21 / clang 11.0.1):

## dtype=float64
_incremental_mean_and_var [100.] [0.]
csr_mean_variance_axis0 [100.] [-2.18040566e-11]
incr_mean_variance_axis0 csr [100.] [-2.18040566e-11]
csc_mean_variance_axis0 [100.] [-2.18040566e-11]
incr_mean_variance_axis0 csc [100.] [-2.18040566e-11]
## dtype=float32
_incremental_mean_and_var [100.00000577] [3.32692735e-11]
csr_mean_variance_axis0 [99.99997] [0.00123221]
incr_mean_variance_axis0 csr [99.99997] [0.00123221]
csc_mean_variance_axis0 [99.99997] [0.00123221]
incr_mean_variance_axis0 csc [99.99997] [0.00123221]

So the sklearn.utils.extmath._incremental_mean_and_var function for dense numpy arrays is numerically stable both in float64 and float32 (~1e-11 is much less then np.finfo(np.float32).eps), but the sparse counterparts, either incremental are not are all wrong in the same way.

So the gist above should be adapted to write a new series of new tests for these Cython functions and the fix will probably involve adapting the algorithm implemented in sklearn.utils.extmath._incremental_mean_and_var to the sparse case.

Note: there is another issue opened for the numerical stability of StandardScaler: #5602 / #11549 but it is related to the computation of the (unweighted) mean in incremental model (in partial_fit) vs full batch mode (in fit).

The text was updated successfully, but these errors were encountered:

ogrisel · 2021-02-24T10:13:10Z

It would also be nice to add similar tests for non-constant but significantly non-centered random data to check the consistency between the sparse and dense array variants (incremental or not).

Also the marked XFAIL tests of #19527 and the test referenced in #19527 (review) should also be updated to test constant values larger than 1.

ogrisel · 2021-02-24T10:43:19Z

Reading the code, I realized that for the first call, incr_mean_variance_axis0 is delegating to _csr_mean_variance_axis0 and _csc_mean_variance_axis0 so those function would need to be fixed first.

Then one would need to actually chunk the constant feature into small batches of say 10 samples at a time and check that calling incr_mean_variance_axis0 repeatedly on such data is also numerically stable.

ogrisel · 2021-02-24T13:43:01Z

Note that the dense implementation _incremental_mean_and_var uses np.float64 accumulators even when the input data is np.float32. This is not the case for the sparse implementation and could therefore explain part of the discrepancy for np.float32 data case. But this does not explain the discrepancy observed for np.float64 data.

ogrisel · 2021-02-24T13:58:45Z

The main culprit seems to be the lines:

https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/sparsefuncs_fast.pyx#L155

and:

https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/sparsefuncs_fast.pyx#L259

        variances[i] += (sum_weights[i] - sum_weights_nz[i]) * means[i]**2

A small rounding error in the computation of (sum_weights[i] - sum_weights_nz[i]) is multiplied by a large quantity: the squared mean value of the column.

ogrisel · 2021-02-24T14:51:50Z

I am not sure how to fix this. The easiest way would be to materialize a dense representation of the the sparse columns, one at a time to avoid memory explosion and call the numerically stable _incremental_mean_and_var on each such dense column.

A slightly more efficient way would be to iterate explicitly over the zero value elements and multiply the weight by the squared mean before summing it to the partial unnormalized variance accumulator. That would still be significantly more expensive than the current code but probably less than materializing the columns. But I am not so sure because of the non-uniform memory access patterns. Finally the code to manipulate row and column indices w.r.t. the CSR or CSC datastructures would probably be much more complex.

jnothman · 2021-02-24T23:07:11Z

Interesting.

Thoughts:

In the weighted case, are we already doing operations with O(n_samples) memory on the sample weights? (or is it possible that the sample weights are memmapped?) If so, maybe dense-column-at-a-time is okay.
Your option to "iterate explicitly over the zero value elements and multiply the weight by the square mean" sounds like it might be reasonable in a CSC with sorted indices. Perhaps also a CSR with sorted indices.

ogrisel · 2021-02-25T09:17:06Z

In the weighted case, are we already doing operations with O(n_samples) memory on the sample weights? (or is it possible that the sample weights are memmapped?) If so, maybe dense-column-at-a-time is okay.

I agree.

Your option to "iterate explicitly over the zero value elements and multiply the weight by the square mean" sounds like it might be reasonable in a CSC with sorted indices. Perhaps also a CSR with sorted indices.

I agree also, but the code is going to be complex, I think. Not sure it's worth it.

Also for the non-weighted case, sum_weights and sum_weights_nz would be integers:

    variances[i] += (sum_weights[i] - sum_weights_nz[i]) * means[i]**2

so if we specialize this function for the non-weighted cases to use uint64 for instance, then we do not have catastrophic cancellation anymore and the code would still be fast and memory efficient.

ogrisel added Bug Moderate Anything that requires some knowledge of conventions and best practices module:linear_model module:preprocessing help wanted labels Feb 24, 2021

ogrisel mentioned this issue Feb 24, 2021

Prevent scalers to scale near-constant features very large values #19527

Merged

ogrisel mentioned this issue Mar 26, 2021

MNT Avoid catastrophic cancellation in mean_variance_axis #19766

Merged

ogrisel closed this as completed in #19766 Mar 31, 2021

cmarmo removed the help wanted label Mar 31, 2021

ogrisel mentioned this issue Apr 6, 2021

[MRG] Fix near constant feature detection in StandardScaler and linear models #19788

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Weighted variance computation for sparse data is not numerically stable #19546

Weighted variance computation for sparse data is not numerically stable #19546

ogrisel commented Feb 24, 2021 •

edited

Loading

ogrisel commented Feb 24, 2021 •

edited

Loading

Uh oh!

ogrisel commented Feb 24, 2021

Uh oh!

ogrisel commented Feb 24, 2021

Uh oh!

ogrisel commented Feb 24, 2021 •

edited

Loading

Uh oh!

ogrisel commented Feb 24, 2021 •

edited

Loading

Uh oh!

jnothman commented Feb 24, 2021

Uh oh!

ogrisel commented Feb 25, 2021 •

edited

Loading

Uh oh!

Uh oh!

Weighted variance computation for sparse data is not numerically stable #19546

Weighted variance computation for sparse data is not numerically stable #19546

Comments

ogrisel commented Feb 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ogrisel commented Feb 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Feb 24, 2021

Uh oh!

ogrisel commented Feb 24, 2021

Uh oh!

ogrisel commented Feb 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Feb 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Feb 24, 2021

Uh oh!

ogrisel commented Feb 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Feb 24, 2021 •

edited

Loading

ogrisel commented Feb 24, 2021 •

edited

Loading

ogrisel commented Feb 24, 2021 •

edited

Loading

ogrisel commented Feb 24, 2021 •

edited

Loading

ogrisel commented Feb 25, 2021 •

edited

Loading