SGDClassifier under/overflow #3040

worldveil · 2014-04-04T03:12:39Z

Example code:

from sklearn.linear_model import SGDClassifier
from sklearn.datasets import load_iris
from sklearn import cross_validation

iris = load_iris()

hyperparameter_choices = [ 

    # some examples, by no means exhausive
    {u'loss': 'modified_huber', u'shuffle': True, u'n_iter': 25.0, 
    u'l1_ratio': 0.5, u'learning_rate': 'constant', u'fit_intercept': 0.0, 
    u'penalty': 'l2', u'alpha': 1000.0, u'eta0': 0.1, u'class_weight': None},

    {u'loss': 'squared_hinge', u'shuffle': True, u'n_iter': 25.0, u'l1_ratio': 0.5, 
    u'learning_rate': 'optimal', u'fit_intercept': 0.0, u'penalty': 'elasticnet', 
    u'alpha': 0.001, u'eta0': 0.1, u'class_weight': None},

    {u'loss': 'squared_hinge', u'shuffle': True, u'n_iter': 100.0, u'l1_ratio': 0.5, 
    u'learning_rate': 'optimal', u'fit_intercept': 0.0, u'penalty': 'l2', u'alpha': 0.001, 
    u'eta0': 0.001, u'class_weight': None}
]

for params in hyperparameter_choices:
    try:
        clf = SGDClassifier(**params)
        scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=5)
    except ValueError as ve:
        print "ValueError: %s" % ve

I'm not sure if these are just faulty hyperparameters for an SGD in general. Otherwise it seems to be a numerical stability bug.

The above under/overflow happens when the data is scaled first as well.

pprett · 2014-04-04T03:29:01Z

I assume it fails on alpha=1000 ?

worldveil · 2014-04-04T05:12:30Z

2/3 examples fail when alpha = 0.001, so not exclusively, no.

larsmans · 2014-04-06T11:58:32Z

Failures:

ValueError('Floating-point under-/overflow occurred at epoch #2. Scaling input data with StandardScaler or MinMaxScaler might help.',)
{u'alpha': 1000.0,
 u'class_weight': None,
 u'eta0': 0.1,
 u'fit_intercept': 0.0,
 u'l1_ratio': 0.5,
 u'learning_rate': 'constant',
 u'loss': 'modified_huber',
 u'n_iter': 25.0,
 u'penalty': 'l2',
 u'shuffle': True}

ValueError('Floating-point under-/overflow occurred at epoch #3. Scaling input data with StandardScaler or MinMaxScaler might help.',)
{u'alpha': 0.001,
 u'class_weight': None,
 u'eta0': 0.1,
 u'fit_intercept': 0.0,
 u'l1_ratio': 0.5,
 u'learning_rate': 'optimal',
 u'loss': 'squared_hinge',
 u'n_iter': 25.0,
 u'penalty': 'elasticnet',
 u'shuffle': True}

ValueError('Floating-point under-/overflow occurred at epoch #3. Scaling input data with StandardScaler or MinMaxScaler might help.',)
{u'alpha': 0.001,
 u'class_weight': None,
 u'eta0': 0.001,
 u'fit_intercept': 0.0,
 u'l1_ratio': 0.5,
 u'learning_rate': 'optimal',
 u'loss': 'squared_hinge',
 u'n_iter': 100.0,
 u'penalty': 'l2',
 u'shuffle': True}

worldveil · 2014-04-07T03:11:17Z

MinMaxScaler on the (0,1) range does help, but not always:

from sklearn.linear_model import SGDClassifier
from sklearn.datasets import load_iris
from sklearn import cross_validation
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler

iris = load_iris()

hyperparameter_choices = [ 

    # some examples, by no means exhausive
    {u'loss': 'modified_huber', u'shuffle': True, u'n_iter': 25.0, 
    u'l1_ratio': 0.5, u'learning_rate': 'constant', u'fit_intercept': 0.0, 
    u'penalty': 'l2', u'alpha': 1000.0, u'eta0': 0.1, u'class_weight': None},

    {u'loss': 'squared_hinge', u'shuffle': True, u'n_iter': 25.0, u'l1_ratio': 0.5, 
    u'learning_rate': 'optimal', u'fit_intercept': 0.0, u'penalty': 'elasticnet', 
    u'alpha': 0.001, u'eta0': 0.1, u'class_weight': None},

    {u'loss': 'squared_hinge', u'shuffle': True, u'n_iter': 100.0, u'l1_ratio': 0.5, 
    u'learning_rate': 'optimal', u'fit_intercept': 0.0, u'penalty': 'l2', u'alpha': 0.001, 
    u'eta0': 0.001, u'class_weight': None}
]

print "\nWith Standard scaling..."
for params in hyperparameter_choices:
    try:
        pipeline = Pipeline([
            ("standard_scaler", StandardScaler()),
            ("sgd", SGDClassifier(**params))
        ])
        scores = cross_validation.cross_val_score(pipeline, iris.data, iris.target, cv=5)
    except ValueError as ve:
        print "ValueError: %s" % ve

print "\nWith MinMax scaling..."     
for params in hyperparameter_choices:
    try:
        pipeline = Pipeline([
            ("minmax_scaler", MinMaxScaler()),
            ("sgd", SGDClassifier(**params))
        ])
        scores = cross_validation.cross_val_score(pipeline, iris.data, iris.target, cv=5)
        print "Success with: %s" % params
    except ValueError as ve:
        print "ValueError: %s" % ve

larsmans · 2014-05-10T14:13:24Z

I just did some testing with this example and I think the parameters are just not good. With too high a learning rate, gradient descent will overshoot its target; that's an inherent risk with this algorithm. Scaling with StandardScaler and using the optimal learning rate fixes the first problem.

What remains are the other two. Here, the squared hinge loss goes off into infinity and its dloss becomes huge; so do the updates. Regularizing more or using the vanilla hinge loss solve this issue.

ogrisel · 2014-11-24T13:59:07Z

I think many SGD practitioners from the deep learning community clip the gradient norms (or the norm of the weights) to e.g. [-100, 100] to avoid such numerical stability issues in practice. That might be worth trying.

pprett · 2014-11-24T14:22:55Z

true -- we should experiment with this -- its a major annoyance during grid
search

2014-11-24 14:59 GMT+01:00 Olivier Grisel [email protected]:

I think many SGD practitioners from the deep learning community clip the
gradient norms or coefficients to [-100, 100] to avoid such numerical
stability issues in practice. That might be worth trying.

—
Reply to this email directly or view it on GitHub
#3040 (comment)
.

Peter Prettenhofer

agramfort · 2014-11-24T14:24:54Z

we could add a param max_grad ? or clip_grad ?

ogrisel · 2014-11-24T15:28:46Z

true -- we should experiment with this -- its a major annoyance during grid search

I am hacking my copy of sgd_fast to clip dloss to [-100, 100], it helps for some cases on @worldveil's script but not all: the huber loss case is still unstable. Needs more investigation.

ogrisel · 2014-11-24T15:35:11Z

Note that the problem disappears when clipping dloss to [-100, 100 ] and preventing alpha to be larger than 10..

amueller · 2014-11-24T22:13:37Z

Do you need to check the gradient in every step then? Doesn't that impact performance quite a bit?

jnothman · 2014-11-24T22:23:15Z

true -- we should experiment with this -- its a major annoyance during
grid
search

But at least in dev version it's possible to ask grid search to catch the
error and return 0 score...

On 25 November 2014 at 09:13, Andreas Mueller [email protected]
wrote:

Do you need to check the gradient in every step then? Doesn't that
impact performance quite a bit?

—
Reply to this email directly or view it on GitHub
#3040 (comment)
.

pprett · 2014-11-24T22:31:29Z

But at least in dev version it's possible to ask grid search to catch the
error and return 0 score...

completely missed that -- that's great!

2014-11-24 23:23 GMT+01:00 jnothman [email protected]:

true -- we should experiment with this -- its a major annoyance during
grid
search

But at least in dev version it's possible to ask grid search to catch the
error and return 0 score...

On 25 November 2014 at 09:13, Andreas Mueller [email protected]
wrote:

Do you need to check the gradient in every step then? Doesn't that
impact performance quite a bit?

—
Reply to this email directly or view it on GitHub
<
https://github.com/scikit-learn/scikit-learn/issues/3040#issuecomment-64274690>

.

—
Reply to this email directly or view it on GitHub
#3040 (comment)
.

Peter Prettenhofer

ogrisel · 2014-11-25T17:03:13Z

#3883 has an implementations that seems to work. Please have a look. I have not yet run benchmarks to see the computational overhead vs master but I have to go now so I cannot run them right now.

ogrisel · 2014-11-25T17:04:10Z

@worldveil #3883 seems to fix all the problems you reported. Please feel free to test on more cases on your own data and report any remaining issues.

larsmans · 2014-11-26T09:16:30Z

Should be fixed by f5e0ea0, closing.

ogrisel mentioned this issue Nov 25, 2014

[MRG] Improve stability of SGDClassifier / SGDRegressor with gradient clipping #3883

Merged

larsmans closed this as completed Nov 26, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SGDClassifier under/overflow #3040

SGDClassifier under/overflow #3040

worldveil commented Apr 4, 2014

pprett commented Apr 4, 2014

worldveil commented Apr 4, 2014

larsmans commented Apr 6, 2014

worldveil commented Apr 7, 2014

larsmans commented May 10, 2014

ogrisel commented Nov 24, 2014

pprett commented Nov 24, 2014

agramfort commented Nov 24, 2014

ogrisel commented Nov 24, 2014

ogrisel commented Nov 24, 2014

amueller commented Nov 24, 2014

jnothman commented Nov 24, 2014

pprett commented Nov 24, 2014

ogrisel commented Nov 25, 2014

ogrisel commented Nov 25, 2014

larsmans commented Nov 26, 2014

SGDClassifier under/overflow #3040

SGDClassifier under/overflow #3040

Comments

worldveil commented Apr 4, 2014

pprett commented Apr 4, 2014

worldveil commented Apr 4, 2014

larsmans commented Apr 6, 2014

worldveil commented Apr 7, 2014

larsmans commented May 10, 2014

ogrisel commented Nov 24, 2014

pprett commented Nov 24, 2014

agramfort commented Nov 24, 2014

ogrisel commented Nov 24, 2014

ogrisel commented Nov 24, 2014

amueller commented Nov 24, 2014

jnothman commented Nov 24, 2014

pprett commented Nov 24, 2014

ogrisel commented Nov 25, 2014

ogrisel commented Nov 25, 2014

larsmans commented Nov 26, 2014