Warm start bug when fitting a LogisticRegression model on binary outcomes with `multi_class='multinomial'`. #10836

rwolst · 2018-03-19T19:24:33Z

Description

Bug when fitting a LogisticRegression model on binary outcomes with multi_class='multinomial' when using warm start. Note that it is similar to the issue here #9889 i.e. only using a 1D coef object on binary outcomes even when using multi_class='multinomial' as opposed to a 2D coef object.

Steps/Code to Reproduce

from sklearn.linear_model import LogisticRegression
import sklearn.metrics
import numpy as np

# Set up a logistic regression object
lr = LogisticRegression(C=1000000, multi_class='multinomial',
                    solver='sag', tol=0.0001, warm_start=True,
                    verbose=0)

# Set independent variable values
Z = np.array([
[ 0.        ,  0.        ],
[ 1.33448632,  0.        ],
[ 1.48790105, -0.33289528],
[-0.47953866, -0.61499779],
[ 1.55548163,  1.14414766],
[-0.31476657, -1.29024053],
[-1.40220786, -0.26316645],
[ 2.227822  , -0.75403668],
[-0.78170885, -1.66963585],
[ 2.24057471, -0.74555021],
[-1.74809665,  2.25340192],
[-1.74958841,  2.2566389 ],
[ 2.25984734, -1.75106702],
[ 0.50598996, -0.77338402],
[ 1.21968303,  0.57530831],
[ 1.65370219, -0.36647173],
[ 0.66569897,  1.77740068],
[-0.37088553, -0.92379819],
[-1.17757946, -0.25393047],
[-1.624227  ,  0.71525192]])

# Set dependant variable values
Y = np.array([1, 0, 0, 1, 0, 0, 0, 0, 
          0, 0, 1, 1, 1, 0, 0, 1, 
          0, 0, 1, 1], dtype=np.int32)

# First fit model normally
lr.fit(Z, Y)

p = lr.predict_proba(Z)
print(sklearn.metrics.log_loss(Y, p)) # ...

print(lr.intercept_)
print(lr.coef_)

# Now fit model after a warm start
lr.fit(Z, Y)

p = lr.predict_proba(Z)
print(sklearn.metrics.log_loss(Y, p)) # ...

print(lr.intercept_)
print(lr.coef_)

Expected Results

The predictions should be the same as the model converged the first time it was run.

Actual Results

The predictions are different. In fact the more times you re-run the fit the worse it gets. This is actually the only reason I was able to catch the bug. It is caused by the line here https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L678.

 w0[:, :coef.shape[1]] = coef

As coef is (1, n_features), but w0 is (2, n_features), this causes the coef value to be broadcast into the w0. This some sort of singularity issue when training resulting in worse performance. Note that had it not done exactly this i.e. w0 was simply initialised by some random values, this bug would be very hard to catch because of course each time the model would converge just not as fast as one would hope when warm starting.

Further Information

The fix I believe is very easy, just need to swap the previous line to

 if n_classes == 1:
     w0[0, :coef.shape[1]] = -coef  # Be careful to get these the right way around
     w0[1, :coef.shape[1]] = coef
 else:
     w0[:, :coef.shape[1]] = coef

Versions

Linux-4.13.0-37-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
NumPy 1.14.2
SciPy 1.0.0
Scikit-Learn 0.20.dev0 (built from latest master)

The text was updated successfully, but these errors were encountered:

jnothman · 2018-03-19T20:12:53Z

Thanks for the report. At a glance, that looks very plausible.. Test and patch welcome

rwolst · 2018-03-20T11:29:28Z

I'm happy to do this although would be interested in opinions on the test. I could do either

Test what causes the bug above i.e. the model doesn't converge when warm starting.
Test that the initial w0 used in logistic_regression_path is the same as the previous w0 after the function has been run i.e. that warm starting is happening as expected.

The pros of (1) are that its quick and easy however as mentioned previously it doesn't really get to the essence of what is causing the bug. The only reason it is failing is because the w0 is getting initialised so that the rows are exactly identical. If this were not the case but the rows also weren't warm started correctly (i.e. just randomly initialised), the model would still converge (just slower than one would hope if a good warm start had been used) and the test would unfortunately pass.

The pros of (2) are that it would correctly test that the warm starting occurred but the cons would be I don't know how I would do it as the w0 is not available outside of the logistic_regression_path function.

lesteve · 2018-03-20T13:49:32Z

Go for the simplest test first, open a PR and see where that leads you!

lesteve added Bug help wanted labels Mar 20, 2018

aishgrt1 mentioned this issue Apr 16, 2018

Fix LogisticRegression with warm_start, multinomial and binary data #10986

Merged

jnothman closed this as completed in #10986 Jun 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Warm start bug when fitting a LogisticRegression model on binary outcomes with `multi_class='multinomial'`. #10836

Warm start bug when fitting a LogisticRegression model on binary outcomes with `multi_class='multinomial'`. #10836

rwolst commented Mar 19, 2018 •

edited

Loading

jnothman commented Mar 19, 2018

Uh oh!

rwolst commented Mar 20, 2018

Uh oh!

lesteve commented Mar 20, 2018

Uh oh!

Uh oh!

Warm start bug when fitting a LogisticRegression model on binary outcomes with multi_class='multinomial'. #10836

Warm start bug when fitting a LogisticRegression model on binary outcomes with multi_class='multinomial'. #10836

Comments

rwolst commented Mar 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Further Information

Versions

jnothman commented Mar 19, 2018

Uh oh!

rwolst commented Mar 20, 2018

Uh oh!

lesteve commented Mar 20, 2018

Uh oh!

Warm start bug when fitting a LogisticRegression model on binary outcomes with `multi_class='multinomial'`. #10836

Warm start bug when fitting a LogisticRegression model on binary outcomes with `multi_class='multinomial'`. #10836

rwolst commented Mar 19, 2018 •

edited

Loading