Description
===BEGIN EDIT===
Description
LinearRegression
should have a parameter tol
that is passed to the LSQR routine for solving with sparse matrices. This way, it should be (more or less) equal to Ridge(alpha=0, sol="lsqr", tol=..)
===END EDIT===
Description
linear_model.LinearRegression
performs different on sparse matrix than on numpy arrays. The built-in unit test test_linear_regression_sparse_equal_dense
works well with two features, but not with other feature counts, e.g. n_features=14
. Other combinations of n_sample
and n_features
lead to even higher discrepancies.
There was a similar issue (#13460) in 2019, that was fixed (#13279), and complemented by mentioned unit test.
Steps/Code to Reproduce
Original code from
Only modification: n_features = 14
.
import pytest
import numpy as np
from scipy import sparse
from sklearn.utils._testing import assert_allclose
from sklearn.linear_model import LinearRegression
def test_linear_regression_sparse_equal_dense(normalize, fit_intercept):
# Test that linear regression agrees between sparse and dense
rng = np.random.RandomState(0)
n_samples = 200
n_features = 14
X = rng.randn(n_samples, n_features)
X[X < 0.1] = 0.0
Xcsr = sparse.csr_matrix(X)
y = rng.rand(n_samples)
params = dict(normalize=normalize, fit_intercept=fit_intercept)
clf_dense = LinearRegression(**params)
clf_sparse = LinearRegression(**params)
clf_dense.fit(X, y)
clf_sparse.fit(Xcsr, y)
assert clf_dense.intercept_ == pytest.approx(clf_sparse.intercept_)
assert_allclose(clf_dense.coef_, clf_sparse.coef_)
test_linear_regression_sparse_equal_dense(False, False)
Expected Results
Coefficients from both regressions should be equal. The test shouldn't throw any error.
Actual Results
Test throws an AssertionError as follows:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-10-52fe7d4efb39> in <module>
23 assert_allclose(clf_dense.coef_, clf_sparse.coef_)
24
---> 25 test_linear_regression_sparse_equal_dense(False, False)
<ipython-input-10-52fe7d4efb39> in test_linear_regression_sparse_equal_dense(normalize, fit_intercept)
21 clf_sparse.fit(Xcsr, y)
22 assert clf_dense.intercept_ == pytest.approx(clf_sparse.intercept_)
---> 23 assert_allclose(clf_dense.coef_, clf_sparse.coef_)
24
25 test_linear_regression_sparse_equal_dense(False, False)
[... skipping hidden 2 frame]
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0
Mismatched elements: 1 / 14 (7.14%)
Max absolute difference: 3.29286053e-09
Max relative difference: 2.61247094e-07
x: array([ 0.119148, 0.138634, 0.045699, 0.09633 , 0.095532, 0.052054,
0.048282, 0.06737 , 0.074815, -0.003684, 0.083319, 0.064834,
0.124406, 0.070745])
y: array([ 0.119148, 0.138634, 0.045699, 0.09633 , 0.095532, 0.052054,
0.048282, 0.06737 , 0.074815, -0.003684, 0.083319, 0.064834,
0.124406, 0.070745])
Versions
System:
python: 3.6.8 (default, Aug 13 2020, 07:46:32) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
executable: /home/envs/my_env/bin/python3
machine: Linux-3.10.0-1160.62.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo
Python dependencies:
pip: 21.3.1
setuptools: 39.2.0
sklearn: 0.24.2
numpy: 1.19.5
scipy: 1.5.4
Cython: None
pandas: 1.1.5
matplotlib: 3.3.4
joblib: 1.1.0
threadpoolctl: 3.1.0
Built with OpenMP: True