Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Linear Regression performs worse on sparse matrix #13460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bminixhofer opened this issue Mar 17, 2019 · 5 comments
Closed

Linear Regression performs worse on sparse matrix #13460

bminixhofer opened this issue Mar 17, 2019 · 5 comments

Comments

@bminixhofer
Copy link

Description

linear_model.LinearRegression seems to fit sparse matrices not as well as regular numpy arrays. I noticed a significant difference in a private dataset of mine, but there is still a small difference in the mean squared error of the linear regression example. Especially on such a small dataset (422 samples x 1 feature) I believe the coefficients and intercept should be exactly the same.

Steps/Code to Reproduce

Original example:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = datasets.load_diabetes()


# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))

Example modified with a sparse matrix:

import matplotlib.pyplot as plt
import numpy as np
import scipy
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression(fit_intercept=True)

# Train the model using the training sets
regr.fit(scipy.sparse.csr_matrix(diabetes_X_train), diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(scipy.sparse.csr_matrix(diabetes_X_test))

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))

Expected Results

The original example prints a mean squared error of 2548.07 so I would expect the example with a sparse matrix to have the same error.

Actual Results

The modified example instead has a MSE of 2563.78. Note that the difference between the two is higher on a higher-dimensional dataset of mine.

I tried the same code using linear_model.Ridge instead of a linear regression. In that case, the MSE of the Ridge model is lower on the sparse matrix than on the regular numpy array. It's really weird.

Versions

System:
python: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]
executable: /home/bminixhofer/miniconda3/bin/python
machine: Linux-4.15.0-46-generic-x86_64-with-debian-buster-sid

BLAS:
macros:
lib_dirs:
cblas_libs: cblas

Python deps:
pip: 18.0
setuptools: 40.6.3
sklearn: 0.20.1
numpy: 1.16.1
scipy: 1.1.0
Cython: 0.29.4
pandas: 0.23.4

@jnothman
Copy link
Member

jnothman commented Mar 17, 2019 via email

@bminixhofer
Copy link
Author

Perfect! I can confirm that the latest nightly build fixes this issue for linear_model.LinearRegression.

However, I still notice differences when using linear_model.Ridge (Lasso and ElasticNet work fine). When replacing the linear regression with linear_model.Ridge() in the example from above, the MSE is different every time I run the code:

import matplotlib.pyplot as plt
import numpy as np
import scipy
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create ridge regression object
regr = linear_model.Ridge()

# Train the model using the training sets
regr.fit(scipy.sparse.csr_matrix(diabetes_X_train), diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(scipy.sparse.csr_matrix(diabetes_X_test))

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))

When calling np.random.seed(1234) at the beginning I get the same result every time but it is of course still different from the result when using a regular numpy array. And it is not possible to set the random state in the Ridge constructor, so this looks like a bug to me.

@qinhanmin2014
Copy link
Member

And it is not possible to set the random state in the Ridge constructor, so this looks like a bug to me.

You can set the random_state parameter, e.g., regr = linear_model.Ridge(random_state=0), repoen if you still get different results.

@agramfort
Copy link
Member

agramfort commented Mar 18, 2019 via email

@jeromedockes
Copy link
Contributor

there is more work coming here : #13350

#13350 solves a similar issue for RidgeCV, not for Ridge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants