[WIP] use more robust mean online computation in StandardScaler #11549

agramfort · 2018-07-16T10:54:26Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

use parallel algo from https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online_algorithm

I still new to check if same problem occurs on sparse matrices and for variance... which I suspect

agramfort · 2018-07-16T13:15:58Z

sklearn/utils/tests/test_extmath.py

@@ -609,7 +609,7 @@ def naive_mean_variance_update(x, last_mean, last_variance,
            _incremental_mean_and_var(A1[i, :].reshape((1, A1.shape[1])),
                                      mean, var, n)
    assert_array_equal(n, A.shape[0])
-    assert_array_almost_equal(A.mean(axis=0), mean)
+    assert_allclose(A.mean(axis=0), mean, rtol=1e-12)
    assert_greater(tol, np.abs(stable_var(A) - var).max())


this test comes from agramfort@6a5a2f7 by @giorgiop

agramfort · 2018-07-16T13:18:17Z

sklearn/utils/extmath.py

+    delta = new_mean - last_mean
+    updated_mean = \
+        last_mean + (delta * new_sample_count) / updated_sample_count  # fixes test
+        # (last_mean * last_sample_count + new_sum) / updated_sample_count  # breaks stability test


this is line is how it was before and is necessary to have the stability test below to pass. However this line is the problem for the issue reported as last_mean * last_sample_count will explode.

So I feel a bit stuck. Either I change the test or ... Any thought?

So you're saying last_mean + (delta * new_sample_count) / updated_sample_count is unstable, but avoids overflow?

Some notes on this issue: (In the context of our test_incremental_variance_numerical_stability test)

The mean in this PR causes the tolerance in the variance to go to ~458, which is up from the original 177. All it takes to get ~458 is by factoring out last_sample_count:

(last_mean + new_sum/last_sample_count)*last_sample_count / updated_sample_count

This PRs updated_means differ from master on the order of 10-9, which is enough to cause the instability in the variance. (The variance is on the order of 10+15)

jnothman · 2018-10-07T13:03:53Z

Is this something you're still completing, @agramfort?

agramfort · 2018-10-07T14:19:34Z

@jnothman please take over as it cannot be priority for me right now :(

agramfort · 2019-03-28T20:16:17Z

feel free to push a fix directly on my branch

…

dariomx · 2019-08-27T17:09:09Z

Hello folks,

Any workaround meanwhile this gets fixed? I do not have an online source, hence I would not bother using batch flavor. But sure if that is an option in StandardScaler ... or maybe I should just to the scaling manually on numpy or something meanwhile.

Thanks.

thomasjpfan · 2019-09-02T14:02:48Z

This PR uses the parallel algorithm which is less numerically stable as described here: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm

agramfort · 2019-09-05T18:46:26Z

indeed. As far as I remember my fix was breaking some test. Either we relax the tests or I am not sure what to do here.

use more robust mean online computation in StandardScaler

9ca071a

agramfort mentioned this pull request Jul 16, 2018

BUG: StandardScaler partial_fit overflows #5602

Open

fix one test + comments

3ba03e0

agramfort commented Jul 16, 2018

View reviewed changes

agramfort changed the title ~~use more robust mean online computation in StandardScaler~~ [WIP] use more robust mean online computation in StandardScaler Jul 16, 2018

ogrisel added Enhancement help wanted labels Oct 7, 2018

github-actions bot added module:preprocessing module:utils labels Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:50

ogrisel mentioned this pull request Mar 1, 2021

Weighted variance computation for sparse data is not numerically stable #19546

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] use more robust mean online computation in StandardScaler #11549

[WIP] use more robust mean online computation in StandardScaler #11549

Uh oh!

agramfort commented Jul 16, 2018

Uh oh!

agramfort Jul 16, 2018

Uh oh!

agramfort Jul 16, 2018

Uh oh!

jnothman Jul 16, 2018

Uh oh!

thomasjpfan Mar 28, 2019 •

edited

Loading

Uh oh!

jnothman commented Oct 7, 2018

Uh oh!

agramfort commented Oct 7, 2018

Uh oh!

agramfort commented Mar 28, 2019 via email

Uh oh!

dariomx commented Aug 27, 2019

Uh oh!

thomasjpfan commented Sep 2, 2019

Uh oh!

agramfort commented Sep 5, 2019

Uh oh!

Uh oh!

Uh oh!

[WIP] use more robust mean online computation in StandardScaler #11549

Are you sure you want to change the base?

[WIP] use more robust mean online computation in StandardScaler #11549

Uh oh!

Conversation

agramfort commented Jul 16, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

agramfort Jul 16, 2018

Choose a reason for hiding this comment

Uh oh!

agramfort Jul 16, 2018

Choose a reason for hiding this comment

Uh oh!

jnothman Jul 16, 2018

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Mar 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Oct 7, 2018

Uh oh!

agramfort commented Oct 7, 2018

Uh oh!

agramfort commented Mar 28, 2019 via email

Uh oh!

dariomx commented Aug 27, 2019

Uh oh!

thomasjpfan commented Sep 2, 2019

Uh oh!

agramfort commented Sep 5, 2019

Uh oh!

Uh oh!

thomasjpfan Mar 28, 2019 •

edited

Loading