[MRG] MNT: Use GEMV in enet_coordinate_descent #11507

jakirkham · 2018-07-13T20:00:21Z

Make use of the BLAS GEMV operation in enet_coordinate_descent instead of using DOT in a for-loop. They are both semantically equivalent, but the former is likely multithreaded in BLAS implementations while here it is merely a serial loop.

amueller · 2018-07-14T16:37:17Z

seems like a good idea. but travis is failing. also some benchmarks would be nice?

jakirkham · 2018-07-14T19:46:23Z

Yeah there are still some issues I've yet to sort out. Was hoping if I put the PR up, someone might be able to spot the cause.

jakirkham · 2018-07-18T17:55:57Z

So this is now fixed on Python 3. Weirdly on Python 2 some memory is getting called with free that is likely allocated by Python. Though there doesn't seem to be any call to free in this function. 😕

Edit: On a closer look, it appears this is coming from sklearn/feature_selection/tests/test_from_model.py. Not sure I see the relationship between these too. Should add that I'm unable to reproduce this failure locally. So have repushed to trigger CI again to see if reoccurs.

ref: https://travis-ci.org/scikit-learn/scikit-learn/jobs/405429580

jakirkham · 2018-07-18T18:22:51Z

As to benchmarking, opted to just compare using dot in a loop vs. using gemv (no loop needed). Stuck to double precision for simplicity. Though expect single precision is similar.

Also disabled threading in the BLAS (using OpenBLAS). However this can probably do better with threading enabled as the for-loop pushed in to gemv is embarrassingly parallel; while, the original for-loop could have been parallelized it was not.

This comparison is a bit more generous to the existing implementation as C pointers were used directly in all cases instead of indexing NumPy arrays (as is used for some variable currently), which likely cuts down on some overhead. As this pattern shows up a few times in cd_fast.pyx, this applies to those occurrences as well.

Cython Code:

cimport cython
import cython

cimport numpy as np
import numpy as np

np.import_array()

cdef extern from "cblas.h":
    enum CBLAS_ORDER:
        CblasRowMajor=101
        CblasColMajor=102
    enum CBLAS_TRANSPOSE:
        CblasNoTrans=111
        CblasTrans=112
        CblasConjTrans=113

    double ddot "cblas_ddot"(int N, double *X, int incX, double *Y, int incY) nogil
    void dgemv "cblas_dgemv"(CBLAS_ORDER Order, CBLAS_TRANSPOSE TransA,
                             int M, int N, double alpha, double *A, int lda,
                             double *X, int incX, double beta,
                             double *Y, int incY) nogil
    void dcopy "cblas_dcopy"(int N, double *X, int incX, double *Y, int incY) nogil


@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
def use_dot(np.ndarray[double, ndim=1] w,
            np.ndarray[double, ndim=2, mode='fortran'] X,
            np.ndarray[double, ndim=1, mode='c'] y):
    cdef unsigned int m = X.shape[0]
    cdef unsigned int n = X.shape[1]

    cdef np.ndarray[double, ndim=1] r = np.empty((m,), dtype=np.float64)

    cdef double* w_data = <double*> w.data
    cdef double* X_data = <double*> X.data
    cdef double* y_data = <double*> y.data

    cdef double* r_data = <double*> r.data

    cdef unsigned i = 0

    with nogil:
        for i in range(m):
            r_data[i] = y_data[i] - ddot(n, &X_data[i], m, w_data, 1)

    return r


@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
def use_gemv(np.ndarray[double, ndim=1] w,
             np.ndarray[double, ndim=2, mode='fortran'] X,
             np.ndarray[double, ndim=1, mode='c'] y):
    cdef unsigned int m = X.shape[0]
    cdef unsigned int n = X.shape[1]

    cdef np.ndarray[double, ndim=1] r = np.empty((m,), dtype=np.float64)

    cdef double* w_data = <double*> w.data
    cdef double* X_data = <double*> X.data
    cdef double* y_data = <double*> y.data

    cdef double* r_data = <double*> r.data

    cdef unsigned i = 0

    with nogil:
        dcopy(m, y_data, 1, r_data, 1)
        dgemv(CblasColMajor, CblasNoTrans,
             m, n, -1.0, X_data, m,
             w_data, 1,
             1.0, r_data, 1)

    return r

This benchmark was taken on Python 3.6 using Cython 0.28.3, NumPy 1.14.3, and OpenBLAS 0.2.20. First few lines were spent importing numpy and compiling the Cython code with the %%cython magic and were snipped to keep the code below succinct.

In [4]: m, n = (1000, 1100)

In [5]: w = np.random.random((n,))

In [6]: X = np.require(np.random.random((m, n)), requirements=["F"])

In [7]: y = np.random.random((m,))

In [8]: %timeit use_dot(w, X, y)
6.65 ms ± 69.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit use_gemv(w, X, y)
341 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So this is well over an order of magnitude improvement! Probably worth applying this pattern throughout this file.

jakirkham · 2018-07-19T21:06:02Z

After trying different installs on macOS and Linux with conda and pip, it seems that the Python 2.7 failure occurs when using scikit-learn's internal BLAS, but does not occur when using OpenBLAS. Not entirely sure why that is yet or why it only shows up on Python 2.7. The bug could still very well be my own, but it is a little strange that it does not show up in other places as well.

jnothman · 2018-07-22T04:58:32Z

We might be able to help working through this after the 0.20 release...

jakirkham · 2018-07-23T17:13:02Z

Well the question is do we care about trying to fix the vendored BLAS or do we want to bump the SciPy dependency and start using SciPy's Cython BLAS and LAPACK API ( #11638 )?

jnothman · 2018-07-24T01:11:51Z

we have tended to keep the numpy/scipy dependencies in line with Ubuntu LTS. We can look into this after the 0.20 release

jakirkham · 2018-08-21T00:48:47Z

The Cython BLAS was added to SciPy in 0.16.0. Ubuntu Trusty has SciPy 0.13.3, which is too old. Though its EOL is April 2019 (~8 months). Ubuntu Xenial has SciPy 0.17.0, which is recent enough.

jakirkham · 2018-08-23T01:52:25Z

It seems that GEMV with a transposed array in particular has some issues just with the vendored BLAS in scikit-learn. Have not encountered this issue on other BLAS implementations, which includes OpenBLAS, MKL, and macOS's Accelerate. Given this suspect there is a bug in the vendored BLAS. As this comes from ATLAS's reference BLAS implementation, looked to see if newer versions contained a fix, but found none. Since ATLAS is not really used for its reference BLAS implementation, expect this bug has been around for a long time.

jakirkham · 2018-08-23T01:54:08Z

For now have broken out a subset of this change that only uses GEMV on non-transposed arrays into PR ( #11896 ). That change passes with the reference BLAS and other BLAS implementations.

Make use of the BLAS GEMV operation in `enet_coordinate_descent` instead of using DOT in a `for`-loop. Go ahead and use GEMV with both non-transposed and transposed arrays. Previously we have had issues with the vendored BLAS and GEMV on transposed arrays, but this attempts to use GEMV on transposed arrays anyways. Hopefully we can make them work as well. As GEMV and DOT in a `for`-loop are both semantically equivalent, this is a reasonable change to make. Though GEMV likely uses a multithreaded approach unlike our application of DOT in a serial loop here. In BLAS implementations that do use threads for DOT, we can expect that GEMV will make better usage of those threads and avoid unnecessary setup and teardown costs that DOT in a `for`-loop is likely to incur (possibly in each iteration of the `for`-loop).

jakirkham · 2019-02-22T02:22:43Z

Thanks to PR ( #13203 ) and PR ( #13084 ) was able to rewrite this using the Cython BLAS API after rebasing. This passes all checks now and should be good to merge on my end. Please let me know if anything else is needed.

Note: This follows PR ( #11896 ).

agramfort

thx @jakirkham

jeremiedbb

Nice !

thomasjpfan

This is awesome!

jnothman

Add a what's new?

agramfort · 2019-02-25T08:50:48Z

thx @jakirkham

ogrisel · 2019-02-26T14:39:15Z

I did some quick benchmark with the following script:

from time import time
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=300, n_features=int(1e4), n_informative=30,
                       random_state=0)
tic = time()
m = LassoCV(cv=5, max_iter=5000).fit(X, y)
print(f"duration: {time() - tic:0.3f}s")
print(np.sum(m.coef_ != 0))

With scikit-learn 0.20.2 with the scikit-learn embedded ATLAS files, the above benchmark would only use a single thread:

$ python ~/tmp/bench_lasso.py
duration: 39.316s
75

On scikit-learn master, using the scipy OpenBLAS and the BLAS level2 GEMV calls, I see that the 4 hyperthreads of my laptop (with 2 physical cores) are used (although with a significant fraction of red in htop) and I get a good speed:

$ python ~/tmp/bench_lasso.py
duration: 16.390s
75

Note that with conda installed numpy, scipy and scikit-learn from anaconda, therefore with MKL, I could always see two threads being used, both in scikit-learn 0.20.2 and master (built in my conda env with the scipy from anaconda) and in both cases it takes ~20s.

So despite trashing red on 4 hyperthreads in htop, OpenBLAS is faster than MKL on this one.

Apparently the switch from BLAS level 1 to BLAS level 2 does not change anything for MKL.

jakirkham · 2019-02-26T19:30:15Z

How were you configuring threading with MKL when testing it?

Make use of the BLAS GEMV operation in `enet_coordinate_descent` instead of using DOT in a `for`-loop. Go ahead and use GEMV with both non-transposed and transposed arrays. Previously we have had issues with the vendored BLAS and GEMV on transposed arrays, but this attempts to use GEMV on transposed arrays anyways. Hopefully we can make them work as well. As GEMV and DOT in a `for`-loop are both semantically equivalent, this is a reasonable change to make. Though GEMV likely uses a multithreaded approach unlike our application of DOT in a serial loop here. In BLAS implementations that do use threads for DOT, we can expect that GEMV will make better usage of those threads and avoid unnecessary setup and teardown costs that DOT in a `for`-loop is likely to incur (possibly in each iteration of the `for`-loop).

This reverts commit 76d0116.

Make use of the BLAS GEMV operation in `enet_coordinate_descent` instead of using DOT in a `for`-loop. Go ahead and use GEMV with both non-transposed and transposed arrays. Previously we have had issues with the vendored BLAS and GEMV on transposed arrays, but this attempts to use GEMV on transposed arrays anyways. Hopefully we can make them work as well. As GEMV and DOT in a `for`-loop are both semantically equivalent, this is a reasonable change to make. Though GEMV likely uses a multithreaded approach unlike our application of DOT in a serial loop here. In BLAS implementations that do use threads for DOT, we can expect that GEMV will make better usage of those threads and avoid unnecessary setup and teardown costs that DOT in a `for`-loop is likely to incur (possibly in each iteration of the `for`-loop).

jakirkham force-pushed the opts_cd_fast branch 2 times, most recently from 8c85f38 to c3000db Compare July 18, 2018 16:12

jakirkham force-pushed the opts_cd_fast branch 2 times, most recently from 36e5fbd to 4e7366e Compare July 19, 2018 04:09

jakirkham force-pushed the opts_cd_fast branch 3 times, most recently from 96ac2a3 to 9c20d0f Compare August 20, 2018 23:55

jakirkham force-pushed the opts_cd_fast branch 3 times, most recently from d128dd0 to 3f52fd7 Compare August 23, 2018 01:49

jakirkham force-pushed the opts_cd_fast branch from 3f52fd7 to 36ce219 Compare August 27, 2018 13:30

jakirkham force-pushed the opts_cd_fast branch from 74d164f to 192824b Compare September 24, 2018 23:10

jakirkham force-pushed the opts_cd_fast branch from 192824b to 94f8dec Compare February 22, 2019 01:47

jakirkham force-pushed the opts_cd_fast branch from 94f8dec to ed7bda9 Compare February 22, 2019 01:50

jakirkham changed the title ~~[WIP] MNT: Use GEMV in enet_coordinate_descent~~ MNT: Use GEMV in enet_coordinate_descent Feb 22, 2019

jakirkham changed the title ~~MNT: Use GEMV in enet_coordinate_descent~~ [MRG] MNT: Use GEMV in enet_coordinate_descent Feb 22, 2019

jakirkham mentioned this pull request Feb 22, 2019

PERF: use higher level BLAS functions #13210

Closed

agramfort approved these changes Feb 22, 2019

View reviewed changes

jeremiedbb approved these changes Feb 22, 2019

View reviewed changes

thomasjpfan approved these changes Feb 22, 2019

View reviewed changes

jnothman approved these changes Feb 23, 2019

View reviewed changes

agramfort merged commit b29a961 into scikit-learn:master Feb 25, 2019

jakirkham deleted the opts_cd_fast branch February 26, 2019 19:26

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "MNT: Use GEMV in enet_coordinate_descent (scikit-learn#11507)"

33a7aa5

This reverts commit 76d0116.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "MNT: Use GEMV in enet_coordinate_descent (scikit-learn#11507)"

0079ff6

This reverts commit 76d0116.

rth mentioned this pull request Jun 27, 2021

[WIP] PERF Replace np.dot with higher level BLAS _gemv #20396

Closed

Uh oh!

[MRG] MNT: Use GEMV in enet_coordinate_descent #11507

[MRG] MNT: Use GEMV in enet_coordinate_descent #11507

Uh oh!

Conversation

jakirkham commented Jul 13, 2018

Uh oh!

amueller commented Jul 14, 2018

Uh oh!

jakirkham commented Jul 14, 2018

Uh oh!

jakirkham commented Jul 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakirkham commented Jul 18, 2018

Uh oh!

jakirkham commented Jul 19, 2018

Uh oh!

jnothman commented Jul 22, 2018 via email

Uh oh!

jakirkham commented Jul 23, 2018

Uh oh!

jnothman commented Jul 24, 2018 via email

Uh oh!

jakirkham commented Aug 21, 2018

Uh oh!

jakirkham commented Aug 23, 2018

Uh oh!

jakirkham commented Aug 23, 2018

Uh oh!

jakirkham commented Feb 22, 2019

Uh oh!

agramfort left a comment

Choose a reason for hiding this comment

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

agramfort commented Feb 25, 2019

Uh oh!

ogrisel commented Feb 26, 2019

Uh oh!

jakirkham commented Feb 26, 2019

Uh oh!

Uh oh!

jakirkham commented Jul 18, 2018 •

edited

Loading