Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add option to use different solver to LinearRegression #14268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ogrisel opened this issue Jul 5, 2019 · 8 comments
Open

Add option to use different solver to LinearRegression #14268

ogrisel opened this issue Jul 5, 2019 · 8 comments

Comments

@ogrisel
Copy link
Member

ogrisel commented Jul 5, 2019

As reported in #13923 the currently used scipy.linalg.lstsq can be significantly slower than Ridge(solver="cholesky", alpha=0) for tall dense X.

We should also call scipy.linalg.lstsq with check_finite=False as we already do input validation in fit.

Also the scipy.linalg.lstsq function has an optional lapack_driver that accepts the following options: 'gelsd', 'gelsy', 'gelss'. The default is'gelsd'. Maybe we should expose the others in our API and benchmark them to see if it would make sense to change the default.

Related issue for Ridge: #14269

@rth
Copy link
Member

rth commented Jul 5, 2019

Can't we use Ridge(solver="cholesky", alpha=0) underneath in LinearRegression, to avoid many similar but not identical implementations? There may have been an issue with alpha=0 exactly though last time I checked.

@ogrisel
Copy link
Member Author

ogrisel commented Jul 5, 2019

I would be in favor of factoring the common code in a private helper function instead of having public estimators call into oneanother.

@agramfort
Copy link
Member

agramfort commented Jul 5, 2019 via email

@rithvikrao
Copy link
Contributor

To summarize, this would involve the following tasks?

  1. changing the scipy.linalg.lstsq call in LinearRegression to include the check_finite=False param
  2. benchmarking lapack_driver choices for scipy.linalg.lstsq (should there also be an optional user param to choose lapack_driver?)
  3. factoring out some Ridge code such that something like Ridge(solver="cholesky", alpha=0) can be used in LinearRegression without explicitly calling another estimator
  4. determining when to use scipy.linalg.lstsq vs. cholesky—would this involve running experiments to find a good heuristic for tall/dense X?

@amueller
Copy link
Member

hi @rithvikrao yes, I think that sounds right; apart from 1 maybe? Where did you get that from? We already have a global option to control that and in general we want to check for finiteness. Only if there's redundant checks would we want to remove them.

It's not entirely clear how easy 4 is, maybe starting with 3 would be good?

@rithvikrao
Copy link
Contributor

rithvikrao commented Jun 10, 2020

Hi @amueller, sounds good, I'll work on 3 first. 1 came from the first comment in this issue - I believe that fit in LinearRegression calls check_array, which is defined in utils/validation.py. This function by default raises an error on np.inf, np.nan, pd.NA in an array passed in. So I think the finiteness check is redundant, but I may be wrong there.

@amueller
Copy link
Member

@rithvikrao sorry didn't see that from @ogrisel's original comment, you're right we can remove that check, that might even be the easiest first step.

@ogrisel
Copy link
Member Author

ogrisel commented Mar 16, 2022

This is related to #22855.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Discussion
6 participants