LogisticRegressionCV does not handle sample weights as expected when using liblinear solver #29416

snath-xoc · 2024-07-04T15:33:25Z

Note: this is a special case of a the wider problem described in:

RFC Sample weight invariance properties #15657

Describe the bug

_log_reg_scoring_path used within LogisticRegressionCV with liblinear solver not returning the same coefficients when weighting samples using sample_weight versus when repeating samples based on weights.

NOTE: L801 in _log_reg_scoring_path does not pass sample_weight into scorer when scorer is not specified, needs fixing.

Steps/Code to Reproduce

import numpy as np
from sklearn.datasets import make_classification
from sklearn.metrics import get_scorer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import LeaveOneGroupOut


import sklearn
sklearn.set_config(enable_metadata_routing=True)

rng = np.random.RandomState(0)

X, y = make_classification(
        n_samples=300000, n_features=8,
            random_state=10,
            n_informative=4,
            n_classes=2,

)
        

n_samples = X.shape[0] // 3
sw = np.ones_like(y)

# We weight the first fold n times more.
sw[:n_samples] = rng.randint(0, 5, size=n_samples)
groups_sw = np.r_[
    np.full(n_samples, 0), np.full(n_samples, 1), np.full(n_samples, 2)
]
splits_weighted = list(LeaveOneGroupOut().split(X, groups=groups_sw))

# We repeat the first fold n times and provide splits ourselves and overwrite
## initial resampled data
X_resampled_by_weights = np.repeat(X, sw.astype(int), axis=0)

##Need to know number of repitions made in total
n_reps = X_resampled_by_weights.shape[0] - X.shape[0]

y_resampled_by_weights = np.repeat(y, sw.astype(int), axis=0)
groups = np.r_[
    np.full(n_reps + n_samples, 0), np.full(n_samples, 1), np.full(n_samples, 2)
]
splits_repeated = list(LeaveOneGroupOut().split(X_resampled_by_weights, groups=groups))

est_weighted = LogisticRegression(solver = "liblinear").fit(X,y,sample_weight=sw)
est_repeated = LogisticRegression(solver = "liblinear").fit(X_resampled_by_weights,y_resampled_by_weights)

np.testing.assert_allclose(est_weighted.coef_, est_repeated.coef_)


est_weighted = LogisticRegressionCV(cv=splits_weighted, solver = "liblinear").fit(X,y,sample_weight=sw)

est_repeated = LogisticRegressionCV(cv=splits_repeated, solver = "liblinear").fit(X_resampled_by_weights,y_resampled_by_weights)

np.testing.assert_allclose(est_weighted.coef_, est_repeated.coef_)

Expected Results

No error is thrown

Actual Results

AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 8 / 8 (100%)
Max absolute difference among violations: 0.02352997
Max relative difference among violations: 10.49415031
 ACTUAL: array([[ 5.580057e-01,  1.455297e-01,  1.117538e-02,  9.940221e-04,
         2.078733e-05, -2.118241e-01, -2.361904e-01, -6.555003e-01]])
 DESIRED: array([[ 5.757953e-01,  1.541149e-01,  9.722671e-04,  1.094184e-03,
         1.143567e-04, -2.027509e-01, -2.405034e-01, -6.790303e-01]])

Versions

System:
    python: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:13:44) [Clang 16.0.6 ]
executable: /Users/shrutinath/micromamba/envs/scikit-learn/bin/python
   machine: macOS-14.3-arm64-arm-64bit

Python dependencies:
      sklearn: 1.6.dev0
          pip: 24.0
   setuptools: 70.1.1
        numpy: 2.0.0
        scipy: 1.14.0
       Cython: 3.0.10
       pandas: None
   matplotlib: 3.9.0
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/shrutinath/micromamba/envs/scikit-learn/lib/libopenblas.0.dylib
        version: 0.3.27
threading_layer: openmp
   architecture: VORTEX

       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/shrutinath/micromamba/envs/scikit-learn/lib/libomp.dylib
        version: None

The text was updated successfully, but these errors were encountered:

ogrisel · 2024-07-05T08:01:50Z

Thanks for the report.

I agree that the reproducer makes sense: the repeated samples in the within the first CV group should be equivalent to weighting the samples within this same first CV group.

ogrisel · 2024-07-05T12:19:29Z

@snath-xoc do you understand why the reproducer fails with solver="lbfgs"? The fix you propose in #29419 seems to impact all solvers.

ogrisel · 2024-07-05T13:33:13Z

Maybe you could make the test harder with a larger number of values for the grid of Cs, e.g. Cs=100 instead of the default of Cs=10.

Then before checking the values of the coef_ attribute, you can check that the selected C is the same:

np.testing.assert_allclose(est_weighted.C_, est_repeated.C_)

ogrisel · 2024-07-05T13:36:24Z

Also to make the test faster to execute and harder to pass at the same time, it would make sense to generate a dataset with a lower number of data points, e.g. n_samples=300 instead of n_samples=300000.

EDIT: I tried to edit the reproducer to only use n_samples=300 and Cs=100 and the assertions pass both for lbfgs and liblinear on main. I am confused, I would have expected this setting to be stricter.

ogrisel · 2024-07-05T14:12:21Z

Actually we can detect the problem for "lbfgs" as well by comparing the values for the scores_ attribute:

np.testing.assert_allclose(est_weighted.scores_[1], est_repeated.scores_[1])

and this happens also for small values of n_samples=300.

ogrisel · 2024-07-05T14:13:45Z

And I confirm that the changes in #29419 does fix the assertion failure on the .scores_ values. Good job at finding the culprit @snath-xoc!

snath-xoc added Bug Needs Triage Issue requires triage labels Jul 4, 2024

snath-xoc changed the title ~~LogisticRegressionCV does not handle sample weights as expected~~ LogisticRegressionCV does not handle sample weights as expected when using liblinear Jul 4, 2024

snath-xoc changed the title ~~LogisticRegressionCV does not handle sample weights as expected when using liblinear~~ LogisticRegressionCV does not handle sample weights as expected when using liblinear solver Jul 4, 2024

ogrisel removed the Needs Triage Issue requires triage label Jul 5, 2024

lesteve added the Metadata Routing all issues related to metadata routing, slep006, sample props label Jul 5, 2024

snath-xoc mentioned this issue Jul 5, 2024

Fix sample weight handling in scoring _log_reg_scoring_path #29419

Merged

2 tasks

ogrisel closed this as completed in #29419 Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LogisticRegressionCV does not handle sample weights as expected when using liblinear solver #29416

LogisticRegressionCV does not handle sample weights as expected when using liblinear solver #29416

snath-xoc commented Jul 4, 2024 •

edited by ogrisel

Loading

ogrisel commented Jul 5, 2024

ogrisel commented Jul 5, 2024

ogrisel commented Jul 5, 2024

ogrisel commented Jul 5, 2024 •

edited

Loading

ogrisel commented Jul 5, 2024

ogrisel commented Jul 5, 2024

LogisticRegressionCV does not handle sample weights as expected when using liblinear solver #29416

LogisticRegressionCV does not handle sample weights as expected when using liblinear solver #29416

Comments

snath-xoc commented Jul 4, 2024 • edited by ogrisel Loading

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

ogrisel commented Jul 5, 2024

ogrisel commented Jul 5, 2024

ogrisel commented Jul 5, 2024

ogrisel commented Jul 5, 2024 • edited Loading

ogrisel commented Jul 5, 2024

ogrisel commented Jul 5, 2024

snath-xoc commented Jul 4, 2024 •

edited by ogrisel

Loading

ogrisel commented Jul 5, 2024 •

edited

Loading