Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add tolerance tol to LinearRegression for sparse matrix solver lsqr #24601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
epplef opened this issue Oct 7, 2022 · 3 comments
Open

Add tolerance tol to LinearRegression for sparse matrix solver lsqr #24601

epplef opened this issue Oct 7, 2022 · 3 comments

Comments

@epplef
Copy link

epplef commented Oct 7, 2022

===BEGIN EDIT===

Description

LinearRegression should have a parameter tol that is passed to the LSQR routine for solving with sparse matrices. This way, it should be (more or less) equal to Ridge(alpha=0, sol="lsqr", tol=..)
===END EDIT===

Description

linear_model.LinearRegression performs different on sparse matrix than on numpy arrays. The built-in unit test test_linear_regression_sparse_equal_dense works well with two features, but not with other feature counts, e.g. n_features=14. Other combinations of n_sample and n_features lead to even higher discrepancies.

There was a similar issue (#13460) in 2019, that was fixed (#13279), and complemented by mentioned unit test.

Steps/Code to Reproduce

Original code from

def test_linear_regression_sparse_equal_dense(normalize, fit_intercept):

Only modification: n_features = 14.

import pytest
import numpy as np
from scipy import sparse
from sklearn.utils._testing import assert_allclose
from sklearn.linear_model import LinearRegression

def test_linear_regression_sparse_equal_dense(normalize, fit_intercept):
    # Test that linear regression agrees between sparse and dense
    rng = np.random.RandomState(0)
    n_samples = 200
    n_features = 14
    X = rng.randn(n_samples, n_features)
    X[X < 0.1] = 0.0
    Xcsr = sparse.csr_matrix(X)
    y = rng.rand(n_samples)
    params = dict(normalize=normalize, fit_intercept=fit_intercept)
    clf_dense = LinearRegression(**params)
    clf_sparse = LinearRegression(**params)
    clf_dense.fit(X, y)
    clf_sparse.fit(Xcsr, y)
    assert clf_dense.intercept_ == pytest.approx(clf_sparse.intercept_)
    assert_allclose(clf_dense.coef_, clf_sparse.coef_)
    
test_linear_regression_sparse_equal_dense(False, False)

Expected Results

Coefficients from both regressions should be equal. The test shouldn't throw any error.

Actual Results

Test throws an AssertionError as follows:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-10-52fe7d4efb39> in <module>
     23     assert_allclose(clf_dense.coef_, clf_sparse.coef_)
     24 
---> 25 test_linear_regression_sparse_equal_dense(False, False)

<ipython-input-10-52fe7d4efb39> in test_linear_regression_sparse_equal_dense(normalize, fit_intercept)
     21     clf_sparse.fit(Xcsr, y)
     22     assert clf_dense.intercept_ == pytest.approx(clf_sparse.intercept_)
---> 23     assert_allclose(clf_dense.coef_, clf_sparse.coef_)
     24 
     25 test_linear_regression_sparse_equal_dense(False, False)

    [... skipping hidden 2 frame]

AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 1 / 14 (7.14%)
Max absolute difference: 3.29286053e-09
Max relative difference: 2.61247094e-07
 x: array([ 0.119148,  0.138634,  0.045699,  0.09633 ,  0.095532,  0.052054,
        0.048282,  0.06737 ,  0.074815, -0.003684,  0.083319,  0.064834,
        0.124406,  0.070745])
 y: array([ 0.119148,  0.138634,  0.045699,  0.09633 ,  0.095532,  0.052054,
        0.048282,  0.06737 ,  0.074815, -0.003684,  0.083319,  0.064834,
        0.124406,  0.070745])

Versions

System:
    python: 3.6.8 (default, Aug 13 2020, 07:46:32)  [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
executable: /home/envs/my_env/bin/python3
   machine: Linux-3.10.0-1160.62.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo

Python dependencies:
          pip: 21.3.1
   setuptools: 39.2.0
      sklearn: 0.24.2
        numpy: 1.19.5
        scipy: 1.5.4
       Cython: None
       pandas: 1.1.5
   matplotlib: 3.3.4
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True
@epplef epplef added Bug Needs Triage Issue requires triage labels Oct 7, 2022
@epplef
Copy link
Author

epplef commented Oct 10, 2022

Further tests indicate that the problem has to do with tolerances in scipy.sparse.linalg.lsqr. Take the following code and try different values für atol and btol. Standard tolerances 1e-6 lead to a failed unit test , 1e-9 lead to satisying accuracy.

import pandas as pd
import numpy as np
import scipy.linalg
import scipy.sparse.linalg
from sklearn.utils._testing import assert_allclosea

atol=1e-6
btol=1e-6

rng = np.random.RandomState(0)
n_samples = 200
n_features = 14
X = rng.randn(n_samples, n_features)
X[X < 0.1] = 0.0
Xcsr = scipy.sparse.csr_matrix(X)
y = rng.rand(n_samples)

g1 = scipy.linalg.lstsq(X, y)
g2 = scipy.sparse.linalg.lsqr(Xcsr, y, atol=atol, btol=btol)

assert_allclose(g1[0], g2[0])

The problem, however, is to determine atol and btol in a manner that ensures the demanded accuracy of the solution.

Edit: The documentation of scipy.sparse.linalg.lsqr does not mention it, but its parameter iter_lim might be set to a certain value, based on other parameters (that is my guess). To reveice a solution that is only based on those tolerances and prevent a "premature" abortion, you have to set it explicitly to a high iteration value (e.g. 10000). This might not be relevant in the above example, though.

@lorentzenchr
Copy link
Member

As you noticed, it's not a bug but a matter of precision/tolerance. To real problem is that there is no parameter tol for LinearRegressor as other linear models have, e.g. Ridge(alpha=0, sol="lsqr", tol=1e-9).

@lorentzenchr lorentzenchr changed the title Linear Regression performs different on sparse matrix Add tol to LinearRegression for sparse matrix solver Oct 24, 2022
@lorentzenchr
Copy link
Member

This is a bit related to #14268.

@lorentzenchr lorentzenchr changed the title Add tol to LinearRegression for sparse matrix solver Add tolerance tol to LinearRegression for sparse matrix lsqr Oct 24, 2022
@lorentzenchr lorentzenchr changed the title Add tolerance tol to LinearRegression for sparse matrix lsqr Add tolerance tol to LinearRegression for sparse matrix solver lsqr Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants