Thanks to visit codestin.com
Credit goes to github.com

Skip to content

LogisticRegressionCV does not handle sample weights as expected when using liblinear solver #29416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
snath-xoc opened this issue Jul 4, 2024 · 6 comments · Fixed by #29419
Closed
Labels
Bug Metadata Routing all issues related to metadata routing, slep006, sample props

Comments

@snath-xoc
Copy link
Contributor

snath-xoc commented Jul 4, 2024

Note: this is a special case of a the wider problem described in:

Describe the bug

_log_reg_scoring_path used within LogisticRegressionCV with liblinear solver not returning the same coefficients when weighting samples using sample_weight versus when repeating samples based on weights.

NOTE: L801 in _log_reg_scoring_path does not pass sample_weight into scorer when scorer is not specified, needs fixing.

Steps/Code to Reproduce

import numpy as np
from sklearn.datasets import make_classification
from sklearn.metrics import get_scorer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import LeaveOneGroupOut


import sklearn
sklearn.set_config(enable_metadata_routing=True)

rng = np.random.RandomState(0)

X, y = make_classification(
        n_samples=300000, n_features=8,
            random_state=10,
            n_informative=4,
            n_classes=2,

)
        

n_samples = X.shape[0] // 3
sw = np.ones_like(y)

# We weight the first fold n times more.
sw[:n_samples] = rng.randint(0, 5, size=n_samples)
groups_sw = np.r_[
    np.full(n_samples, 0), np.full(n_samples, 1), np.full(n_samples, 2)
]
splits_weighted = list(LeaveOneGroupOut().split(X, groups=groups_sw))

# We repeat the first fold n times and provide splits ourselves and overwrite
## initial resampled data
X_resampled_by_weights = np.repeat(X, sw.astype(int), axis=0)

##Need to know number of repitions made in total
n_reps = X_resampled_by_weights.shape[0] - X.shape[0]

y_resampled_by_weights = np.repeat(y, sw.astype(int), axis=0)
groups = np.r_[
    np.full(n_reps + n_samples, 0), np.full(n_samples, 1), np.full(n_samples, 2)
]
splits_repeated = list(LeaveOneGroupOut().split(X_resampled_by_weights, groups=groups))

est_weighted = LogisticRegression(solver = "liblinear").fit(X,y,sample_weight=sw)
est_repeated = LogisticRegression(solver = "liblinear").fit(X_resampled_by_weights,y_resampled_by_weights)

np.testing.assert_allclose(est_weighted.coef_, est_repeated.coef_)


est_weighted = LogisticRegressionCV(cv=splits_weighted, solver = "liblinear").fit(X,y,sample_weight=sw)

est_repeated = LogisticRegressionCV(cv=splits_repeated, solver = "liblinear").fit(X_resampled_by_weights,y_resampled_by_weights)

np.testing.assert_allclose(est_weighted.coef_, est_repeated.coef_)

Expected Results

No error is thrown

Actual Results

AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 8 / 8 (100%)
Max absolute difference among violations: 0.02352997
Max relative difference among violations: 10.49415031
 ACTUAL: array([[ 5.580057e-01,  1.455297e-01,  1.117538e-02,  9.940221e-04,
         2.078733e-05, -2.118241e-01, -2.361904e-01, -6.555003e-01]])
 DESIRED: array([[ 5.757953e-01,  1.541149e-01,  9.722671e-04,  1.094184e-03,
         1.143567e-04, -2.027509e-01, -2.405034e-01, -6.790303e-01]])

Versions

System:
    python: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:13:44) [Clang 16.0.6 ]
executable: /Users/shrutinath/micromamba/envs/scikit-learn/bin/python
   machine: macOS-14.3-arm64-arm-64bit

Python dependencies:
      sklearn: 1.6.dev0
          pip: 24.0
   setuptools: 70.1.1
        numpy: 2.0.0
        scipy: 1.14.0
       Cython: 3.0.10
       pandas: None
   matplotlib: 3.9.0
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/shrutinath/micromamba/envs/scikit-learn/lib/libopenblas.0.dylib
        version: 0.3.27
threading_layer: openmp
   architecture: VORTEX

       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/shrutinath/micromamba/envs/scikit-learn/lib/libomp.dylib
        version: None
@snath-xoc snath-xoc added Bug Needs Triage Issue requires triage labels Jul 4, 2024
@snath-xoc snath-xoc changed the title LogisticRegressionCV does not handle sample weights as expected LogisticRegressionCV does not handle sample weights as expected when using liblinear Jul 4, 2024
@snath-xoc snath-xoc changed the title LogisticRegressionCV does not handle sample weights as expected when using liblinear LogisticRegressionCV does not handle sample weights as expected when using liblinear solver Jul 4, 2024
@ogrisel ogrisel removed the Needs Triage Issue requires triage label Jul 5, 2024
@ogrisel
Copy link
Member

ogrisel commented Jul 5, 2024

Thanks for the report.

I agree that the reproducer makes sense: the repeated samples in the within the first CV group should be equivalent to weighting the samples within this same first CV group.

@lesteve lesteve added the Metadata Routing all issues related to metadata routing, slep006, sample props label Jul 5, 2024
@ogrisel
Copy link
Member

ogrisel commented Jul 5, 2024

@snath-xoc do you understand why the reproducer fails with solver="lbfgs"? The fix you propose in #29419 seems to impact all solvers.

@ogrisel
Copy link
Member

ogrisel commented Jul 5, 2024

Maybe you could make the test harder with a larger number of values for the grid of Cs, e.g. Cs=100 instead of the default of Cs=10.

Then before checking the values of the coef_ attribute, you can check that the selected C is the same:

np.testing.assert_allclose(est_weighted.C_, est_repeated.C_)

@ogrisel
Copy link
Member

ogrisel commented Jul 5, 2024

Also to make the test faster to execute and harder to pass at the same time, it would make sense to generate a dataset with a lower number of data points, e.g. n_samples=300 instead of n_samples=300000.

EDIT: I tried to edit the reproducer to only use n_samples=300 and Cs=100 and the assertions pass both for lbfgs and liblinear on main. I am confused, I would have expected this setting to be stricter.

@ogrisel
Copy link
Member

ogrisel commented Jul 5, 2024

Actually we can detect the problem for "lbfgs" as well by comparing the values for the scores_ attribute:

np.testing.assert_allclose(est_weighted.scores_[1], est_repeated.scores_[1])

and this happens also for small values of n_samples=300.

@ogrisel
Copy link
Member

ogrisel commented Jul 5, 2024

And I confirm that the changes in #29419 does fix the assertion failure on the .scores_ values. Good job at finding the culprit @snath-xoc!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Metadata Routing all issues related to metadata routing, slep006, sample props
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants