-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Linear Regression performs worse on sparse matrix #13460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Try the nightly wheel:
https://scikit-learn.org/dev/developers/advanced_installation.html#installing-nightly-builds
This might have been fixed by #13279
|
Perfect! I can confirm that the latest nightly build fixes this issue for However, I still notice differences when using import matplotlib.pyplot as plt
import numpy as np
import scipy
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create ridge regression object
regr = linear_model.Ridge()
# Train the model using the training sets
regr.fit(scipy.sparse.csr_matrix(diabetes_X_train), diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(scipy.sparse.csr_matrix(diabetes_X_test))
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% mean_squared_error(diabetes_y_test, diabetes_y_pred)) When calling |
You can set the random_state parameter, e.g., |
Description
linear_model.LinearRegression
seems to fit sparse matrices not as well as regular numpy arrays. I noticed a significant difference in a private dataset of mine, but there is still a small difference in the mean squared error of the linear regression example. Especially on such a small dataset (422 samples x 1 feature) I believe the coefficients and intercept should be exactly the same.Steps/Code to Reproduce
Original example:
Example modified with a sparse matrix:
Expected Results
The original example prints a mean squared error of
2548.07
so I would expect the example with a sparse matrix to have the same error.Actual Results
The modified example instead has a MSE of
2563.78
. Note that the difference between the two is higher on a higher-dimensional dataset of mine.I tried the same code using
linear_model.Ridge
instead of a linear regression. In that case, the MSE of the Ridge model is lower on the sparse matrix than on the regular numpy array. It's really weird.Versions
System:
python: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0]
executable: /home/bminixhofer/miniconda3/bin/python
machine: Linux-4.15.0-46-generic-x86_64-with-debian-buster-sid
BLAS:
macros:
lib_dirs:
cblas_libs: cblas
Python deps:
pip: 18.0
setuptools: 40.6.3
sklearn: 0.20.1
numpy: 1.16.1
scipy: 1.1.0
Cython: 0.29.4
pandas: 0.23.4
The text was updated successfully, but these errors were encountered: