Linear Regression performs worse on sparse matrix #13460

bminixhofer · 2019-03-17T11:52:13Z

Description

linear_model.LinearRegression seems to fit sparse matrices not as well as regular numpy arrays. I noticed a significant difference in a private dataset of mine, but there is still a small difference in the mean squared error of the linear regression example. Especially on such a small dataset (422 samples x 1 feature) I believe the coefficients and intercept should be exactly the same.

Steps/Code to Reproduce

Original example:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = datasets.load_diabetes()


# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))

Example modified with a sparse matrix:

import matplotlib.pyplot as plt
import numpy as np
import scipy
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression(fit_intercept=True)

# Train the model using the training sets
regr.fit(scipy.sparse.csr_matrix(diabetes_X_train), diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(scipy.sparse.csr_matrix(diabetes_X_test))

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))

Expected Results

The original example prints a mean squared error of 2548.07 so I would expect the example with a sparse matrix to have the same error.

Actual Results

The modified example instead has a MSE of 2563.78. Note that the difference between the two is higher on a higher-dimensional dataset of mine.

I tried the same code using linear_model.Ridge instead of a linear regression. In that case, the MSE of the Ridge model is lower on the sparse matrix than on the regular numpy array. It's really weird.

Versions

System:
python: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]
executable: /home/bminixhofer/miniconda3/bin/python
machine: Linux-4.15.0-46-generic-x86_64-with-debian-buster-sid

BLAS:
macros:
lib_dirs:
cblas_libs: cblas

Python deps:
pip: 18.0
setuptools: 40.6.3
sklearn: 0.20.1
numpy: 1.16.1
scipy: 1.1.0
Cython: 0.29.4
pandas: 0.23.4

The text was updated successfully, but these errors were encountered:

jnothman · 2019-03-17T12:05:49Z

Try the nightly wheel: https://scikit-learn.org/dev/developers/advanced_installation.html#installing-nightly-builds This might have been fixed by #13279

bminixhofer · 2019-03-17T13:59:35Z

Perfect! I can confirm that the latest nightly build fixes this issue for linear_model.LinearRegression.

However, I still notice differences when using linear_model.Ridge (Lasso and ElasticNet work fine). When replacing the linear regression with linear_model.Ridge() in the example from above, the MSE is different every time I run the code:

import matplotlib.pyplot as plt
import numpy as np
import scipy
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create ridge regression object
regr = linear_model.Ridge()

# Train the model using the training sets
regr.fit(scipy.sparse.csr_matrix(diabetes_X_train), diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(scipy.sparse.csr_matrix(diabetes_X_test))

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))

When calling np.random.seed(1234) at the beginning I get the same result every time but it is of course still different from the result when using a regular numpy array. And it is not possible to set the random state in the Ridge constructor, so this looks like a bug to me.

qinhanmin2014 · 2019-03-18T04:17:44Z

And it is not possible to set the random state in the Ridge constructor, so this looks like a bug to me.

You can set the random_state parameter, e.g., regr = linear_model.Ridge(random_state=0), repoen if you still get different results.

agramfort · 2019-03-18T10:53:52Z

there is more work coming here : #13350

…

jeromedockes · 2019-05-03T14:47:21Z

there is more work coming here : #13350

#13350 solves a similar issue for RidgeCV, not for Ridge

qinhanmin2014 closed this as completed Mar 18, 2019

epplef mentioned this issue Oct 7, 2022

Add tolerance tol to LinearRegression for sparse matrix solver lsqr #24601

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linear Regression performs worse on sparse matrix #13460

Linear Regression performs worse on sparse matrix #13460

bminixhofer commented Mar 17, 2019

jnothman commented Mar 17, 2019 via email

bminixhofer commented Mar 17, 2019

qinhanmin2014 commented Mar 18, 2019

agramfort commented Mar 18, 2019 via email

jeromedockes commented May 3, 2019

Linear Regression performs worse on sparse matrix #13460

Linear Regression performs worse on sparse matrix #13460

Comments

bminixhofer commented Mar 17, 2019

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

jnothman commented Mar 17, 2019 via email

bminixhofer commented Mar 17, 2019

qinhanmin2014 commented Mar 18, 2019

agramfort commented Mar 18, 2019 via email

jeromedockes commented May 3, 2019