Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[MRG] MNT: Use GEMV in enet_coordinate_descent #11507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 25, 2019

Conversation

jakirkham
Copy link
Contributor

Make use of the BLAS GEMV operation in enet_coordinate_descent instead of using DOT in a for-loop. They are both semantically equivalent, but the former is likely multithreaded in BLAS implementations while here it is merely a serial loop.

@amueller
Copy link
Member

seems like a good idea. but travis is failing. also some benchmarks would be nice?

@jakirkham
Copy link
Contributor Author

Yeah there are still some issues I've yet to sort out. Was hoping if I put the PR up, someone might be able to spot the cause.

@jakirkham jakirkham force-pushed the opts_cd_fast branch 2 times, most recently from 8c85f38 to c3000db Compare July 18, 2018 16:12
@jakirkham
Copy link
Contributor Author

jakirkham commented Jul 18, 2018

So this is now fixed on Python 3. Weirdly on Python 2 some memory is getting called with free that is likely allocated by Python. Though there doesn't seem to be any call to free in this function. 😕

Edit: On a closer look, it appears this is coming from sklearn/feature_selection/tests/test_from_model.py. Not sure I see the relationship between these too. Should add that I'm unable to reproduce this failure locally. So have repushed to trigger CI again to see if reoccurs.

ref: https://travis-ci.org/scikit-learn/scikit-learn/jobs/405429580

@jakirkham
Copy link
Contributor Author

As to benchmarking, opted to just compare using dot in a loop vs. using gemv (no loop needed). Stuck to double precision for simplicity. Though expect single precision is similar.

Also disabled threading in the BLAS (using OpenBLAS). However this can probably do better with threading enabled as the for-loop pushed in to gemv is embarrassingly parallel; while, the original for-loop could have been parallelized it was not.

This comparison is a bit more generous to the existing implementation as C pointers were used directly in all cases instead of indexing NumPy arrays (as is used for some variable currently), which likely cuts down on some overhead. As this pattern shows up a few times in cd_fast.pyx, this applies to those occurrences as well.

Cython Code:
cimport cython
import cython

cimport numpy as np
import numpy as np

np.import_array()

cdef extern from "cblas.h":
    enum CBLAS_ORDER:
        CblasRowMajor=101
        CblasColMajor=102
    enum CBLAS_TRANSPOSE:
        CblasNoTrans=111
        CblasTrans=112
        CblasConjTrans=113

    double ddot "cblas_ddot"(int N, double *X, int incX, double *Y, int incY) nogil
    void dgemv "cblas_dgemv"(CBLAS_ORDER Order, CBLAS_TRANSPOSE TransA,
                             int M, int N, double alpha, double *A, int lda,
                             double *X, int incX, double beta,
                             double *Y, int incY) nogil
    void dcopy "cblas_dcopy"(int N, double *X, int incX, double *Y, int incY) nogil


@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
def use_dot(np.ndarray[double, ndim=1] w,
            np.ndarray[double, ndim=2, mode='fortran'] X,
            np.ndarray[double, ndim=1, mode='c'] y):
    cdef unsigned int m = X.shape[0]
    cdef unsigned int n = X.shape[1]

    cdef np.ndarray[double, ndim=1] r = np.empty((m,), dtype=np.float64)

    cdef double* w_data = <double*> w.data
    cdef double* X_data = <double*> X.data
    cdef double* y_data = <double*> y.data

    cdef double* r_data = <double*> r.data

    cdef unsigned i = 0

    with nogil:
        for i in range(m):
            r_data[i] = y_data[i] - ddot(n, &X_data[i], m, w_data, 1)

    return r


@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
def use_gemv(np.ndarray[double, ndim=1] w,
             np.ndarray[double, ndim=2, mode='fortran'] X,
             np.ndarray[double, ndim=1, mode='c'] y):
    cdef unsigned int m = X.shape[0]
    cdef unsigned int n = X.shape[1]

    cdef np.ndarray[double, ndim=1] r = np.empty((m,), dtype=np.float64)

    cdef double* w_data = <double*> w.data
    cdef double* X_data = <double*> X.data
    cdef double* y_data = <double*> y.data

    cdef double* r_data = <double*> r.data

    cdef unsigned i = 0

    with nogil:
        dcopy(m, y_data, 1, r_data, 1)
        dgemv(CblasColMajor, CblasNoTrans,
             m, n, -1.0, X_data, m,
             w_data, 1,
             1.0, r_data, 1)

    return r

This benchmark was taken on Python 3.6 using Cython 0.28.3, NumPy 1.14.3, and OpenBLAS 0.2.20. First few lines were spent importing numpy and compiling the Cython code with the %%cython magic and were snipped to keep the code below succinct.

In [4]: m, n = (1000, 1100)

In [5]: w = np.random.random((n,))

In [6]: X = np.require(np.random.random((m, n)), requirements=["F"])

In [7]: y = np.random.random((m,))

In [8]: %timeit use_dot(w, X, y)
6.65 ms ± 69.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit use_gemv(w, X, y)
341 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So this is well over an order of magnitude improvement! Probably worth applying this pattern throughout this file.

@jakirkham jakirkham force-pushed the opts_cd_fast branch 2 times, most recently from 36e5fbd to 4e7366e Compare July 19, 2018 04:09
@jakirkham
Copy link
Contributor Author

After trying different installs on macOS and Linux with conda and pip, it seems that the Python 2.7 failure occurs when using scikit-learn's internal BLAS, but does not occur when using OpenBLAS. Not entirely sure why that is yet or why it only shows up on Python 2.7. The bug could still very well be my own, but it is a little strange that it does not show up in other places as well.

@jnothman
Copy link
Member

jnothman commented Jul 22, 2018 via email

@jakirkham
Copy link
Contributor Author

Well the question is do we care about trying to fix the vendored BLAS or do we want to bump the SciPy dependency and start using SciPy's Cython BLAS and LAPACK API ( #11638 )?

@jnothman
Copy link
Member

jnothman commented Jul 24, 2018 via email

@jakirkham jakirkham force-pushed the opts_cd_fast branch 3 times, most recently from 96ac2a3 to 9c20d0f Compare August 20, 2018 23:55
@jakirkham
Copy link
Contributor Author

The Cython BLAS was added to SciPy in 0.16.0. Ubuntu Trusty has SciPy 0.13.3, which is too old. Though its EOL is April 2019 (~8 months). Ubuntu Xenial has SciPy 0.17.0, which is recent enough.

@jakirkham jakirkham force-pushed the opts_cd_fast branch 3 times, most recently from d128dd0 to 3f52fd7 Compare August 23, 2018 01:49
@jakirkham
Copy link
Contributor Author

It seems that GEMV with a transposed array in particular has some issues just with the vendored BLAS in scikit-learn. Have not encountered this issue on other BLAS implementations, which includes OpenBLAS, MKL, and macOS's Accelerate. Given this suspect there is a bug in the vendored BLAS. As this comes from ATLAS's reference BLAS implementation, looked to see if newer versions contained a fix, but found none. Since ATLAS is not really used for its reference BLAS implementation, expect this bug has been around for a long time.

@jakirkham
Copy link
Contributor Author

For now have broken out a subset of this change that only uses GEMV on non-transposed arrays into PR ( #11896 ). That change passes with the reference BLAS and other BLAS implementations.

Make use of the BLAS GEMV operation in `enet_coordinate_descent` instead
of using DOT in a `for`-loop. Go ahead and use GEMV with both
non-transposed and transposed arrays. Previously we have had issues with
the vendored BLAS and GEMV on transposed arrays, but this attempts to
use GEMV on transposed arrays anyways. Hopefully we can make them work
as well.

As GEMV and DOT in a `for`-loop are both semantically equivalent, this
is a reasonable change to make.  Though GEMV likely uses a multithreaded
approach unlike our application of DOT in a serial loop here. In BLAS
implementations that do use threads for DOT, we can expect that GEMV
will make better usage of those threads and avoid unnecessary setup and
teardown costs that DOT in a `for`-loop is likely to incur (possibly in
each iteration of the `for`-loop).
@jakirkham jakirkham changed the title [WIP] MNT: Use GEMV in enet_coordinate_descent MNT: Use GEMV in enet_coordinate_descent Feb 22, 2019
@jakirkham jakirkham changed the title MNT: Use GEMV in enet_coordinate_descent [MRG] MNT: Use GEMV in enet_coordinate_descent Feb 22, 2019
@jakirkham
Copy link
Contributor Author

Thanks to PR ( #13203 ) and PR ( #13084 ) was able to rewrite this using the Cython BLAS API after rebasing. This passes all checks now and should be good to merge on my end. Please let me know if anything else is needed.

Note: This follows PR ( #11896 ).

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice !

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome!

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a what's new?

@agramfort agramfort merged commit b29a961 into scikit-learn:master Feb 25, 2019
@agramfort
Copy link
Member

thx @jakirkham

@ogrisel
Copy link
Member

ogrisel commented Feb 26, 2019

I did some quick benchmark with the following script:

from time import time
import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=300, n_features=int(1e4), n_informative=30,
                       random_state=0)
tic = time()
m = LassoCV(cv=5, max_iter=5000).fit(X, y)
print(f"duration: {time() - tic:0.3f}s")
print(np.sum(m.coef_ != 0))

With scikit-learn 0.20.2 with the scikit-learn embedded ATLAS files, the above benchmark would only use a single thread:

$ python ~/tmp/bench_lasso.py
duration: 39.316s
75

On scikit-learn master, using the scipy OpenBLAS and the BLAS level2 GEMV calls, I see that the 4 hyperthreads of my laptop (with 2 physical cores) are used (although with a significant fraction of red in htop) and I get a good speed:

$ python ~/tmp/bench_lasso.py
duration: 16.390s
75

Note that with conda installed numpy, scipy and scikit-learn from anaconda, therefore with MKL, I could always see two threads being used, both in scikit-learn 0.20.2 and master (built in my conda env with the scipy from anaconda) and in both cases it takes ~20s.

So despite trashing red on 4 hyperthreads in htop, OpenBLAS is faster than MKL on this one.

Apparently the switch from BLAS level 1 to BLAS level 2 does not change anything for MKL.

@jakirkham jakirkham deleted the opts_cd_fast branch February 26, 2019 19:26
@jakirkham
Copy link
Contributor Author

How were you configuring threading with MKL when testing it?

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
Make use of the BLAS GEMV operation in `enet_coordinate_descent` instead
of using DOT in a `for`-loop. Go ahead and use GEMV with both
non-transposed and transposed arrays. Previously we have had issues with
the vendored BLAS and GEMV on transposed arrays, but this attempts to
use GEMV on transposed arrays anyways. Hopefully we can make them work
as well.

As GEMV and DOT in a `for`-loop are both semantically equivalent, this
is a reasonable change to make.  Though GEMV likely uses a multithreaded
approach unlike our application of DOT in a serial loop here. In BLAS
implementations that do use threads for DOT, we can expect that GEMV
will make better usage of those threads and avoid unnecessary setup and
teardown costs that DOT in a `for`-loop is likely to incur (possibly in
each iteration of the `for`-loop).
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Make use of the BLAS GEMV operation in `enet_coordinate_descent` instead
of using DOT in a `for`-loop. Go ahead and use GEMV with both
non-transposed and transposed arrays. Previously we have had issues with
the vendored BLAS and GEMV on transposed arrays, but this attempts to
use GEMV on transposed arrays anyways. Hopefully we can make them work
as well.

As GEMV and DOT in a `for`-loop are both semantically equivalent, this
is a reasonable change to make.  Though GEMV likely uses a multithreaded
approach unlike our application of DOT in a serial loop here. In BLAS
implementations that do use threads for DOT, we can expect that GEMV
will make better usage of those threads and avoid unnecessary setup and
teardown costs that DOT in a `for`-loop is likely to incur (possibly in
each iteration of the `for`-loop).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants