Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add tolerance tol to LinearRegression for sparse matrix solver lsqr #24601

Open
@epplef

Description

@epplef

===BEGIN EDIT===

Description

LinearRegression should have a parameter tol that is passed to the LSQR routine for solving with sparse matrices. This way, it should be (more or less) equal to Ridge(alpha=0, sol="lsqr", tol=..)
===END EDIT===

Description

linear_model.LinearRegression performs different on sparse matrix than on numpy arrays. The built-in unit test test_linear_regression_sparse_equal_dense works well with two features, but not with other feature counts, e.g. n_features=14. Other combinations of n_sample and n_features lead to even higher discrepancies.

There was a similar issue (#13460) in 2019, that was fixed (#13279), and complemented by mentioned unit test.

Steps/Code to Reproduce

Original code from

def test_linear_regression_sparse_equal_dense(normalize, fit_intercept):

Only modification: n_features = 14.

import pytest
import numpy as np
from scipy import sparse
from sklearn.utils._testing import assert_allclose
from sklearn.linear_model import LinearRegression

def test_linear_regression_sparse_equal_dense(normalize, fit_intercept):
    # Test that linear regression agrees between sparse and dense
    rng = np.random.RandomState(0)
    n_samples = 200
    n_features = 14
    X = rng.randn(n_samples, n_features)
    X[X < 0.1] = 0.0
    Xcsr = sparse.csr_matrix(X)
    y = rng.rand(n_samples)
    params = dict(normalize=normalize, fit_intercept=fit_intercept)
    clf_dense = LinearRegression(**params)
    clf_sparse = LinearRegression(**params)
    clf_dense.fit(X, y)
    clf_sparse.fit(Xcsr, y)
    assert clf_dense.intercept_ == pytest.approx(clf_sparse.intercept_)
    assert_allclose(clf_dense.coef_, clf_sparse.coef_)
    
test_linear_regression_sparse_equal_dense(False, False)

Expected Results

Coefficients from both regressions should be equal. The test shouldn't throw any error.

Actual Results

Test throws an AssertionError as follows:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-10-52fe7d4efb39> in <module>
     23     assert_allclose(clf_dense.coef_, clf_sparse.coef_)
     24 
---> 25 test_linear_regression_sparse_equal_dense(False, False)

<ipython-input-10-52fe7d4efb39> in test_linear_regression_sparse_equal_dense(normalize, fit_intercept)
     21     clf_sparse.fit(Xcsr, y)
     22     assert clf_dense.intercept_ == pytest.approx(clf_sparse.intercept_)
---> 23     assert_allclose(clf_dense.coef_, clf_sparse.coef_)
     24 
     25 test_linear_regression_sparse_equal_dense(False, False)

    [... skipping hidden 2 frame]

AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 1 / 14 (7.14%)
Max absolute difference: 3.29286053e-09
Max relative difference: 2.61247094e-07
 x: array([ 0.119148,  0.138634,  0.045699,  0.09633 ,  0.095532,  0.052054,
        0.048282,  0.06737 ,  0.074815, -0.003684,  0.083319,  0.064834,
        0.124406,  0.070745])
 y: array([ 0.119148,  0.138634,  0.045699,  0.09633 ,  0.095532,  0.052054,
        0.048282,  0.06737 ,  0.074815, -0.003684,  0.083319,  0.064834,
        0.124406,  0.070745])

Versions

System:
    python: 3.6.8 (default, Aug 13 2020, 07:46:32)  [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
executable: /home/envs/my_env/bin/python3
   machine: Linux-3.10.0-1160.62.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo

Python dependencies:
          pip: 21.3.1
   setuptools: 39.2.0
      sklearn: 0.24.2
        numpy: 1.19.5
        scipy: 1.5.4
       Cython: None
       pandas: 1.1.5
   matplotlib: 3.3.4
       joblib: 1.1.0
threadpoolctl: 3.1.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions