Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BUG: StandardScaler partial_fit overflows #5602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
giorgiop opened this issue Oct 27, 2015 · 8 comments · May be fixed by #11549
Open

BUG: StandardScaler partial_fit overflows #5602

giorgiop opened this issue Oct 27, 2015 · 8 comments · May be fixed by #11549
Labels
Bug help wanted Moderate Anything that requires some knowledge of conventions and best practices module:preprocessing

Comments

@giorgiop
Copy link
Contributor

giorgiop commented Oct 27, 2015

The recent implementation of partial_fit for StandardScaler can overflow. A use case there is to transform indefinitely long stream of data, but that is problematic with the current implementation. The reason is that to compute the running mean, we keep track of the sample sum.

Here the code to reproduce the behavior. To simulate long stream of data would take long time; instead, I use samples with very large norm but the effect is the same. The same batch is presented to the transformer many times. The mean should be same.

from sklearn.preprocessing import StandardScaler
import numpy as np

rng = np.random.RandomState(0)

def gen_1d_uniform_batch(min_, max_, n):
    return rng.uniform(min_, max_, size=(n, 1))

max_f = np.finfo(np.float64).max / 1e5
min_f = max_f / 1e2
stream_dim = 100
batch_dim = 500000
print("mean overflow: batch vs online on %d repetitions" % stream_dim)

X = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)

scaler = StandardScaler(with_std=False).fit(X)
print(scaler.mean_)
[  1.79769313e+301]

iscaler = StandardScaler(with_std=False)
batch = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)
for _ in range(stream_dim):
    iscaler = iscaler.partial_fit(batch)
RuntimeWarning: overflow encountered in add
  updated_mean = (last_sum + new_sum) / updated_sample_count

print(iscaler.mean_)
[ inf]
@giorgiop
Copy link
Contributor Author

A simple workaround is to keep track of the sample mean itself, instead of the sample sum. Here a gist. However, there is caveat: the computation of the sample variance is less numerically accurate.

from sklearn.preprocessing.data import StandardScaler
import numpy as np

rng = np.random.RandomState(0)

def gen_1d_uniform_batch(min_, max_, n):
    return rng.uniform(min_, max_, size=(n, 1))

max_f = 1e15
min_f = max_f / 1e3
stream_dim = 10000
batch_dim = 1000
print("var divergence: batch vs online on %d repetitions" % stream_dim)

X = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)

scaler = StandardScaler().fit(X)
iscaler = StandardScaler()

batch = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)
for _ in range(stream_dim):
    iscaler = iscaler.partial_fit(batch)

print(scaler.var_)
[  0.]
print(iscaler.var_)
[  1.11203573e-05]

while on master the last print is 0.

@giorgiop giorgiop reopened this Oct 27, 2015
@giorgiop
Copy link
Contributor Author

Notice that while in the first code snippet I use large numbers to simulate the online stream, in the second case the lenght of the stream does not matter. It is the absolute value of the mean itself that is problematic. Indeed, if you run the second script with smaller max_f, everything is smooth. This is the offending step, which should trigger a potential catastrophic cancellation.

I think this means that we want this change in the code, but I would like to hear other opinions first. Maybe @jakevdp ?

I have no idea if other estimators with partial_fit may suffer the same issue. Maybe we could test for overflow for every partial_fit? #3907

@jakevdp
Copy link
Member

jakevdp commented Dec 12, 2017

We'd probably want to use something like Welford's algorithm to avoid overflow and precision loss in the case of partial_fit.

@jnothman jnothman added Bug help wanted Moderate Anything that requires some knowledge of conventions and best practices labels Dec 13, 2017
@agramfort
Copy link
Member

Started to take care of this in #11549

@jnothman
Copy link
Member

Is part of the issue that previously we were using Python ints and now we're using numpy ints?

@agramfort
Copy link
Member

agramfort commented Jul 18, 2018 via email

@antonl
Copy link

antonl commented May 22, 2024

This has been fixed already. The partial_fit documentation says it's been stabilized https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/preprocessing/_data.py#L879-L889

@adrinjalali
Copy link
Member

I can still reproduce using the OP's post.

@ogrisel recently was looking at a similar thing I think. Would you happen to know if this is solvable now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug help wanted Moderate Anything that requires some knowledge of conventions and best practices module:preprocessing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants