-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
Description
Bug when fitting a LogisticRegression model on binary outcomes with multi_class='multinomial' when using warm start. Note that it is similar to the issue here #9889 i.e. only using a 1D coef object on binary outcomes even when using multi_class='multinomial' as opposed to a 2D coef object.
Steps/Code to Reproduce
from sklearn.linear_model import LogisticRegression
import sklearn.metrics
import numpy as np
# Set up a logistic regression object
lr = LogisticRegression(C=1000000, multi_class='multinomial',
solver='sag', tol=0.0001, warm_start=True,
verbose=0)
# Set independent variable values
Z = np.array([
[ 0. , 0. ],
[ 1.33448632, 0. ],
[ 1.48790105, -0.33289528],
[-0.47953866, -0.61499779],
[ 1.55548163, 1.14414766],
[-0.31476657, -1.29024053],
[-1.40220786, -0.26316645],
[ 2.227822 , -0.75403668],
[-0.78170885, -1.66963585],
[ 2.24057471, -0.74555021],
[-1.74809665, 2.25340192],
[-1.74958841, 2.2566389 ],
[ 2.25984734, -1.75106702],
[ 0.50598996, -0.77338402],
[ 1.21968303, 0.57530831],
[ 1.65370219, -0.36647173],
[ 0.66569897, 1.77740068],
[-0.37088553, -0.92379819],
[-1.17757946, -0.25393047],
[-1.624227 , 0.71525192]])
# Set dependant variable values
Y = np.array([1, 0, 0, 1, 0, 0, 0, 0,
0, 0, 1, 1, 1, 0, 0, 1,
0, 0, 1, 1], dtype=np.int32)
# First fit model normally
lr.fit(Z, Y)
p = lr.predict_proba(Z)
print(sklearn.metrics.log_loss(Y, p)) # ...
print(lr.intercept_)
print(lr.coef_)
# Now fit model after a warm start
lr.fit(Z, Y)
p = lr.predict_proba(Z)
print(sklearn.metrics.log_loss(Y, p)) # ...
print(lr.intercept_)
print(lr.coef_)
Expected Results
The predictions should be the same as the model converged the first time it was run.
Actual Results
The predictions are different. In fact the more times you re-run the fit the worse it gets. This is actually the only reason I was able to catch the bug. It is caused by the line here https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py#L678.
w0[:, :coef.shape[1]] = coef
As coef is (1, n_features), but w0 is (2, n_features), this causes the coef value to be broadcast into the w0. This some sort of singularity issue when training resulting in worse performance. Note that had it not done exactly this i.e. w0 was simply initialised by some random values, this bug would be very hard to catch because of course each time the model would converge just not as fast as one would hope when warm starting.
Further Information
The fix I believe is very easy, just need to swap the previous line to
if n_classes == 1:
w0[0, :coef.shape[1]] = -coef # Be careful to get these the right way around
w0[1, :coef.shape[1]] = coef
else:
w0[:, :coef.shape[1]] = coef
Versions
Linux-4.13.0-37-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
NumPy 1.14.2
SciPy 1.0.0
Scikit-Learn 0.20.dev0 (built from latest master)