Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[WIP] use more robust mean online computation in StandardScaler #11549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions sklearn/preprocessing/tests/test_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -873,6 +873,30 @@ def test_scaler_without_copy():
assert_array_equal(X_csc.toarray(), X_csc_copy.toarray())


def test_scaler_partial_fit_overflow():
# Test StandardScaler does not overflow in partial_fit #5602
rng = np.random.RandomState(0)

def gen_1d_uniform_batch(min_, max_, n):
return rng.uniform(min_, max_, size=(n, 1))

max_f = np.finfo(np.float64).max / 1e5
min_f = max_f / 1e2
stream_dim = 100
batch_dim = 500000

X = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)

scaler = StandardScaler(with_std=False).fit(X)

iscaler = StandardScaler(with_std=False)
batch = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)
for _ in range(stream_dim):
iscaler = iscaler.partial_fit(batch)

assert_allclose(iscaler.mean_, scaler.mean_)


def test_scale_sparse_with_mean_raise_exception():
rng = np.random.RandomState(42)
X = rng.randn(4, 5)
Expand Down
8 changes: 6 additions & 2 deletions sklearn/utils/extmath.py
Original file line number Diff line number Diff line change
Expand Up @@ -696,20 +696,24 @@ def _incremental_mean_and_var(X, last_mean, last_variance, last_sample_count):
# old = stats until now
# new = the current increment
# updated = the aggregated stats
last_sum = last_mean * last_sample_count
new_sum = np.nansum(X, axis=0)

new_sample_count = np.sum(~np.isnan(X), axis=0)
updated_sample_count = last_sample_count + new_sample_count

updated_mean = (last_sum + new_sum) / updated_sample_count
new_mean = new_sum / new_sample_count
delta = new_mean - last_mean
updated_mean = \
last_mean + (delta * new_sample_count) / updated_sample_count # fixes test
# (last_mean * last_sample_count + new_sum) / updated_sample_count # breaks stability test
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is line is how it was before and is necessary to have the stability test below to pass. However this line is the problem for the issue reported as last_mean * last_sample_count will explode.

So I feel a bit stuck. Either I change the test or ... Any thought?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're saying last_mean + (delta * new_sample_count) / updated_sample_count is unstable, but avoids overflow?

Copy link
Member

@thomasjpfan thomasjpfan Mar 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes on this issue: (In the context of our test_incremental_variance_numerical_stability test)

The mean in this PR causes the tolerance in the variance to go to ~458, which is up from the original 177. All it takes to get ~458 is by factoring out last_sample_count:

(last_mean + new_sum/last_sample_count)*last_sample_count / updated_sample_count

This PRs updated_means differ from master on the order of 10-9, which is enough to cause the instability in the variance. (The variance is on the order of 10+15)


if last_variance is None:
updated_variance = None
else:
new_unnormalized_variance = np.nanvar(X, axis=0) * new_sample_count
last_unnormalized_variance = last_variance * last_sample_count

last_sum = last_mean * last_sample_count
with np.errstate(divide='ignore', invalid='ignore'):
last_over_new_count = last_sample_count / new_sample_count
updated_unnormalized_variance = (
Expand Down
2 changes: 1 addition & 1 deletion sklearn/utils/tests/test_extmath.py
Original file line number Diff line number Diff line change
Expand Up @@ -609,7 +609,7 @@ def naive_mean_variance_update(x, last_mean, last_variance,
_incremental_mean_and_var(A1[i, :].reshape((1, A1.shape[1])),
mean, var, n)
assert_array_equal(n, A.shape[0])
assert_array_almost_equal(A.mean(axis=0), mean)
assert_allclose(A.mean(axis=0), mean, rtol=1e-12)
assert_greater(tol, np.abs(stable_var(A) - var).max())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test comes from agramfort@6a5a2f7 by @giorgiop



Expand Down