Thanks to visit codestin.com
Credit goes to github.com

Skip to content

LinearRegression on sparse matrices is not sample weight consistent #30131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #16298
antoinebaker opened this issue Oct 22, 2024 · 2 comments · Fixed by #30521
Closed
Tracked by #16298

LinearRegression on sparse matrices is not sample weight consistent #30131

antoinebaker opened this issue Oct 22, 2024 · 2 comments · Fixed by #30521
Labels

Comments

@antoinebaker
Copy link
Contributor

antoinebaker commented Oct 22, 2024

Part of #16298.

Describe the bug

When using a sparse container like csr_array for X, LinearRegression even fails to give the same coefficients for unit or no sample weight, and more generally fails the test_linear_regression_sample_weight_consitency checks. In that setting, the underlying solver is scipy.sparse.linalg.lsqr.

Steps/Code to Reproduce

from sklearn.utils.fixes import csr_array
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.utils._testing import assert_allclose

X, y = make_regression(100, 100, random_state=42)
X = csr_array(X)
reg = LinearRegression(fit_intercept=True)
reg.fit(X, y)
coef1 = reg.coef_
reg.fit(X, y, sample_weight=np.ones_like(y))
coef2 = reg.coef_
assert_allclose(coef1, coef2, rtol=1e-7, atol=1e-9)

Expected Results

The assert_allclose should pass.

Actual Results

AssertionError: 
Not equal to tolerance rtol=1e-07, atol=1e-09

Mismatched elements: 100 / 100 (100%)
Max absolute difference among violations: 0.00165048
Max relative difference among violations: 0.02621317
 ACTUAL: array([-2.450778e-01,  2.917985e+01,  1.678916e+00,  7.534454e+01,
        1.241587e+01,  1.076716e+00, -4.975206e-01, -9.262295e-01,
       -1.373931e+00, -1.624112e-01, -8.644422e-01, -5.986218e-01,...
 DESIRED: array([-2.452359e-01,  2.918078e+01,  1.678681e+00,  7.534410e+01,
        1.241459e+01,  1.076624e+00, -4.962305e-01, -9.257701e-01,
       -1.373862e+00, -1.622824e-01, -8.652183e-01, -5.981715e-01,...

The test also fails for fit_intercept=False. Note that this test and other sample weight consistency checks pass if we do not wrap X in a sparse container.

Versions

System:
    python: 3.12.5 | packaged by conda-forge | (main, Aug  8 2024, 18:32:50) [Clang 16.0.6 ]
executable: /Users/abaker/miniforge3/envs/sklearn-dev/bin/python
   machine: macOS-14.5-arm64-arm-64bit

Python dependencies:
      sklearn: 1.6.dev0
          pip: 24.2
   setuptools: 73.0.1
        numpy: 2.1.0
        scipy: 1.14.1
       Cython: 3.0.11
       pandas: 2.2.2
   matplotlib: 3.9.2
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /Users/abaker/miniforge3/envs/sklearn-dev/lib/libopenblas.0.dylib
        version: 0.3.27
threading_layer: openmp
   architecture: VORTEX

       user_api: openmp
   internal_api: openmp
    num_threads: 8
         prefix: libomp
       filepath: /Users/abaker/miniforge3/envs/sklearn-dev/lib/libomp.dylib
        version: None

EDIT: discovered while working on #30040 for the case of dense inputs.

@ogrisel
Copy link
Member

ogrisel commented Oct 22, 2024

Note, that this is not the case for Ridge(solver="lsqr", alpha=0, tol=1e-12):

>>> from sklearn.utils.fixes import csr_array
... from sklearn.datasets import make_regression
... from sklearn.linear_model import Ridge
... from sklearn.utils._testing import assert_allclose
... 
... X, y = make_regression(100, 100, random_state=42)
... X = csr_array(X)
... reg = Ridge(solver="lsqr", alpha=0, fit_intercept=True, tol=1e-12)
... reg.fit(X, y)
... coef1 = reg.coef_
... reg.fit(X, y, sample_weight=np.ones_like(y))
... coef2 = reg.coef_
... assert_allclose(coef1, coef2, rtol=1e-7, atol=1e-9)

But the same fails with a larger tol value, which is passed as atol & btol parameters to the lsqr call:

>>> reg = Ridge(solver="lsqr", alpha=0, fit_intercept=True, tol=1e-4)
... reg.fit(X, y)
... coef1 = reg.coef_
... reg.fit(X, y, sample_weight=np.ones_like(y))
... coef2 = reg.coef_
... assert_allclose(coef1, coef2, rtol=1e-7, atol=1e-9)
Traceback (most recent call last):
  Cell In[9], line 13
    assert_allclose(coef1, coef2, rtol=1e-7, atol=1e-9)
  File ~/code/scikit-learn/sklearn/utils/_testing.py:232 in assert_allclose
    np_assert_allclose(
  File ~/miniforge3/envs/dev/lib/python3.12/site-packages/numpy/testing/_private/utils.py:1688 in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File ~/miniforge3/envs/dev/lib/python3.12/contextlib.py:81 in inner
    return func(*args, **kwds)
  File ~/miniforge3/envs/dev/lib/python3.12/site-packages/numpy/testing/_private/utils.py:889 in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-07, atol=1e-09

Mismatched elements: 100 / 100 (100%)
Max absolute difference among violations: 0.00143505
Max relative difference among violations: 0.03345462
 ACTUAL: array([ 5.135931e-01,  3.037594e+01,  2.145667e+00,  7.394602e+01,
        1.324545e+01, -4.438229e-02, -1.202226e+00,  7.553014e-01,
       -1.661572e+00,  1.483041e-01, -3.624484e-01, -1.915631e+00,...
 DESIRED: array([ 5.125267e-01,  3.037609e+01,  2.145875e+00,  7.394601e+01,
        1.324581e+01, -4.334871e-02, -1.202492e+00,  7.555202e-01,
       -1.661108e+00,  1.495079e-01, -3.613935e-01, -1.916570e+00,...

I think we need to pass atol=self.tol / btol=self.tol to the inner call to lsqr and expose a tolerance parameter as we do in ridge. Or alternatively set a default strict, dtype-dependent value as we now do for the cond parameter in #30040.

@lorentzenchr
Copy link
Member

I think we should expose tol as in Ridge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging a pull request may close this issue.

3 participants