-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] MNT: Use GEMV in enet_coordinate_descent #11507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
seems like a good idea. but travis is failing. also some benchmarks would be nice? |
Yeah there are still some issues I've yet to sort out. Was hoping if I put the PR up, someone might be able to spot the cause. |
8c85f38
to
c3000db
Compare
So this is now fixed on Python 3. Weirdly on Python 2 some memory is getting called with Edit: On a closer look, it appears this is coming from ref: https://travis-ci.org/scikit-learn/scikit-learn/jobs/405429580 |
As to benchmarking, opted to just compare using Also disabled threading in the BLAS (using OpenBLAS). However this can probably do better with threading enabled as the This comparison is a bit more generous to the existing implementation as C pointers were used directly in all cases instead of indexing NumPy arrays (as is used for some variable currently), which likely cuts down on some overhead. As this pattern shows up a few times in Cython Code:cimport cython
import cython
cimport numpy as np
import numpy as np
np.import_array()
cdef extern from "cblas.h":
enum CBLAS_ORDER:
CblasRowMajor=101
CblasColMajor=102
enum CBLAS_TRANSPOSE:
CblasNoTrans=111
CblasTrans=112
CblasConjTrans=113
double ddot "cblas_ddot"(int N, double *X, int incX, double *Y, int incY) nogil
void dgemv "cblas_dgemv"(CBLAS_ORDER Order, CBLAS_TRANSPOSE TransA,
int M, int N, double alpha, double *A, int lda,
double *X, int incX, double beta,
double *Y, int incY) nogil
void dcopy "cblas_dcopy"(int N, double *X, int incX, double *Y, int incY) nogil
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
def use_dot(np.ndarray[double, ndim=1] w,
np.ndarray[double, ndim=2, mode='fortran'] X,
np.ndarray[double, ndim=1, mode='c'] y):
cdef unsigned int m = X.shape[0]
cdef unsigned int n = X.shape[1]
cdef np.ndarray[double, ndim=1] r = np.empty((m,), dtype=np.float64)
cdef double* w_data = <double*> w.data
cdef double* X_data = <double*> X.data
cdef double* y_data = <double*> y.data
cdef double* r_data = <double*> r.data
cdef unsigned i = 0
with nogil:
for i in range(m):
r_data[i] = y_data[i] - ddot(n, &X_data[i], m, w_data, 1)
return r
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
def use_gemv(np.ndarray[double, ndim=1] w,
np.ndarray[double, ndim=2, mode='fortran'] X,
np.ndarray[double, ndim=1, mode='c'] y):
cdef unsigned int m = X.shape[0]
cdef unsigned int n = X.shape[1]
cdef np.ndarray[double, ndim=1] r = np.empty((m,), dtype=np.float64)
cdef double* w_data = <double*> w.data
cdef double* X_data = <double*> X.data
cdef double* y_data = <double*> y.data
cdef double* r_data = <double*> r.data
cdef unsigned i = 0
with nogil:
dcopy(m, y_data, 1, r_data, 1)
dgemv(CblasColMajor, CblasNoTrans,
m, n, -1.0, X_data, m,
w_data, 1,
1.0, r_data, 1)
return r This benchmark was taken on Python 3.6 using Cython 0.28.3, NumPy 1.14.3, and OpenBLAS 0.2.20. First few lines were spent importing In [4]: m, n = (1000, 1100)
In [5]: w = np.random.random((n,))
In [6]: X = np.require(np.random.random((m, n)), requirements=["F"])
In [7]: y = np.random.random((m,))
In [8]: %timeit use_dot(w, X, y)
6.65 ms ± 69.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [9]: %timeit use_gemv(w, X, y)
341 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) So this is well over an order of magnitude improvement! Probably worth applying this pattern throughout this file. |
36e5fbd
to
4e7366e
Compare
After trying different installs on macOS and Linux with |
We might be able to help working through this after the 0.20 release...
|
Well the question is do we care about trying to fix the vendored BLAS or do we want to bump the SciPy dependency and start using SciPy's Cython BLAS and LAPACK API ( #11638 )? |
we have tended to keep the numpy/scipy dependencies in line with Ubuntu
LTS. We can look into this after the 0.20 release
|
96ac2a3
to
9c20d0f
Compare
The Cython BLAS was added to SciPy in 0.16.0. Ubuntu Trusty has SciPy 0.13.3, which is too old. Though its EOL is April 2019 (~8 months). Ubuntu Xenial has SciPy 0.17.0, which is recent enough. |
d128dd0
to
3f52fd7
Compare
It seems that GEMV with a transposed array in particular has some issues just with the vendored BLAS in scikit-learn. Have not encountered this issue on other BLAS implementations, which includes OpenBLAS, MKL, and macOS's Accelerate. Given this suspect there is a bug in the vendored BLAS. As this comes from ATLAS's reference BLAS implementation, looked to see if newer versions contained a fix, but found none. Since ATLAS is not really used for its reference BLAS implementation, expect this bug has been around for a long time. |
For now have broken out a subset of this change that only uses GEMV on non-transposed arrays into PR ( #11896 ). That change passes with the reference BLAS and other BLAS implementations. |
3f52fd7
to
36ce219
Compare
74d164f
to
192824b
Compare
192824b
to
94f8dec
Compare
Make use of the BLAS GEMV operation in `enet_coordinate_descent` instead of using DOT in a `for`-loop. Go ahead and use GEMV with both non-transposed and transposed arrays. Previously we have had issues with the vendored BLAS and GEMV on transposed arrays, but this attempts to use GEMV on transposed arrays anyways. Hopefully we can make them work as well. As GEMV and DOT in a `for`-loop are both semantically equivalent, this is a reasonable change to make. Though GEMV likely uses a multithreaded approach unlike our application of DOT in a serial loop here. In BLAS implementations that do use threads for DOT, we can expect that GEMV will make better usage of those threads and avoid unnecessary setup and teardown costs that DOT in a `for`-loop is likely to incur (possibly in each iteration of the `for`-loop).
94f8dec
to
ed7bda9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx @jakirkham
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a what's new?
thx @jakirkham |
I did some quick benchmark with the following script: from time import time
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=300, n_features=int(1e4), n_informative=30,
random_state=0)
tic = time()
m = LassoCV(cv=5, max_iter=5000).fit(X, y)
print(f"duration: {time() - tic:0.3f}s")
print(np.sum(m.coef_ != 0)) With scikit-learn 0.20.2 with the scikit-learn embedded ATLAS files, the above benchmark would only use a single thread:
On scikit-learn master, using the scipy OpenBLAS and the BLAS level2 GEMV calls, I see that the 4 hyperthreads of my laptop (with 2 physical cores) are used (although with a significant fraction of red in htop) and I get a good speed:
Note that with conda installed numpy, scipy and scikit-learn from anaconda, therefore with MKL, I could always see two threads being used, both in scikit-learn 0.20.2 and master (built in my conda env with the scipy from anaconda) and in both cases it takes ~20s. So despite trashing red on 4 hyperthreads in htop, OpenBLAS is faster than MKL on this one. Apparently the switch from BLAS level 1 to BLAS level 2 does not change anything for MKL. |
How were you configuring threading with MKL when testing it? |
Make use of the BLAS GEMV operation in `enet_coordinate_descent` instead of using DOT in a `for`-loop. Go ahead and use GEMV with both non-transposed and transposed arrays. Previously we have had issues with the vendored BLAS and GEMV on transposed arrays, but this attempts to use GEMV on transposed arrays anyways. Hopefully we can make them work as well. As GEMV and DOT in a `for`-loop are both semantically equivalent, this is a reasonable change to make. Though GEMV likely uses a multithreaded approach unlike our application of DOT in a serial loop here. In BLAS implementations that do use threads for DOT, we can expect that GEMV will make better usage of those threads and avoid unnecessary setup and teardown costs that DOT in a `for`-loop is likely to incur (possibly in each iteration of the `for`-loop).
This reverts commit 76d0116.
This reverts commit 76d0116.
Make use of the BLAS GEMV operation in `enet_coordinate_descent` instead of using DOT in a `for`-loop. Go ahead and use GEMV with both non-transposed and transposed arrays. Previously we have had issues with the vendored BLAS and GEMV on transposed arrays, but this attempts to use GEMV on transposed arrays anyways. Hopefully we can make them work as well. As GEMV and DOT in a `for`-loop are both semantically equivalent, this is a reasonable change to make. Though GEMV likely uses a multithreaded approach unlike our application of DOT in a serial loop here. In BLAS implementations that do use threads for DOT, we can expect that GEMV will make better usage of those threads and avoid unnecessary setup and teardown costs that DOT in a `for`-loop is likely to incur (possibly in each iteration of the `for`-loop).
Make use of the BLAS GEMV operation in
enet_coordinate_descent
instead of using DOT in afor
-loop. They are both semantically equivalent, but the former is likely multithreaded in BLAS implementations while here it is merely a serial loop.