[WIP] Faster manhattan distance for sparse and dense matrices #14986

ptocca · 2019-09-15T14:42:20Z

Reference Issues/PRs

Fixes enhancement request #14304 "manhattan_distances for sparse matrices is slow"

What does this implement/fix? Explain your changes.

This PR affects primarily metrics/pairwise_fast.pyx.
The new version provides a faster Cython implementation of _sparse_manhattan(), but requires that the matrices have sorted indices.
I also improved the implementation of the dense matrix case, making it less memory demanding.
In both cases, the implementation is now multithreaded, using all cores available (it uses Cython's prange).

I ran existing tests via:
pytest -v metrics/tests/test_pairwise.py
There are 2 issues but they have nothing to do with what I touched.

Any other comments?

This is a first attempt, so I welcome comments and suggestions.
Questions

Is it OK to make it multithreaded by default?
How do I provide non-regression tests? Any tutorials/examples with recommended best practices?

jnothman

Mostly cosmetics, thanks!

jnothman · 2019-09-15T22:52:24Z

sklearn/metrics/pairwise_fast.pyx

+                    if i < X_indptr[px+1]: ix = X_indices[i]
+                    if j < Y_indptr[py+1]: iy = Y_indices[j]
+
+                    if ix==iy:


Spaces around binary operators please

jnothman · 2019-09-15T22:52:50Z

sklearn/metrics/pairwise_fast.pyx

+                j = Y_indptr[py]
+                d = 0.0
+                while i < X_indptr[px+1] and j < Y_indptr[py+1]:
+                    if i < X_indptr[px+1]: ix = X_indices[i]


Don't put two statements on a line
See pep8

jnothman · 2019-09-15T22:53:45Z

sklearn/metrics/pairwise_fast.pyx

    """
-    cdef double[::1] row = np.empty(n_features)
-    cdef np.npy_intp ix, iy, j
+    """Pairwise L1 distances for CSR matrices.


The algorithm may be simpler for CSC...?

jnothman · 2019-09-15T22:54:31Z

sklearn/metrics/pairwise_fast.pyx

+    Usage:
+    >>> D = np.zeros(X.shape[0], Y.shape[0])
+    >>> cython_manhattan(X.data, X.indices, X.indptr,
+    ...                  Y.data, Y.indices, Y.indptr,


I think you're assuming that X.has_sorted_indices

jnothman · 2019-09-15T22:55:30Z

sklearn/metrics/pairwise_fast.pyx

+                s = 0
+                for k in range(x.shape[1]):
+                    s = s + fabs(x[i,k]-y[j,k])
+                out[i,j]=s


The file should end with a newline character

jnothman · 2019-09-15T22:57:29Z

sklearn/metrics/pairwise_fast.pyx

+                        j = j+1
+
+                if i== X_indptr[px+1]:
+                    while j < Y_indptr[py+1]:


You can use for here

jnothman · 2019-09-15T23:01:50Z

sklearn/metrics/pairwise.py

-    D = X[:, np.newaxis, :] - Y[np.newaxis, :, :]
-    D = np.abs(D, D)
-    return D.reshape((-1, X.shape[1]))
+    D = np.empty(shape=(X.shape[0],Y.shape[0]))


Spaces after commas please

jnothman · 2019-09-15T23:02:58Z

sklearn/metrics/pairwise_fast.pyx

+    cdef np.npy_intp px, py, i, j, ix, iy
+    cdef double d = 0.0
+
+    cdef int m = D.shape[0]


You don't need these variables.

Removed useless variable Removed stray part of docstring

sort_indices() in-place on sparse matrices

jeremiedbb · 2019-09-18T10:07:24Z

Is it OK to make it multithreaded by default?

We haven't made a decision yet :/

How do I provide non-regression tests?

Tests related to this are in test_pairwise.py. We already test some cases, but you can add a test to check that dense and sparse return the same result, on float32 and float64, with or without sorted indices. You should name it test_manhattan_distances. To write it you can get inspiration from other tests.

jeremiedbb

A few comments

jeremiedbb · 2019-09-18T08:47:14Z

sklearn/metrics/pairwise_fast.pyx

 #cython: boundscheck=False
 #cython: cdivision=True
 #cython: wraparound=False
+# distutils: extra_compile_args=-fopenmp


you don't need that. We take care of the OpenMP flags in the setup

jeremiedbb · 2019-09-18T08:49:28Z

sklearn/metrics/pairwise_fast.pyx

-                    row[X_indices[j]] = X_data[j]
-                for j in range(Y_indptr[iy], Y_indptr[iy + 1]):
-                    row[Y_indices[j]] -= Y_data[j]
+        for px in prange(m):


you can avoid 1 indentation by grouping nogil and prange:

for px in prange(m, nogil=True): ...

jeremiedbb · 2019-09-18T08:58:34Z

sklearn/metrics/pairwise_fast.pyx

+                        iy = Y_indices[j]
+
+                    if ix == iy:
+                        d = d + fabs(X_data[i] - Y_data[j])


inplace operation is preferable
d += fabs(X_data[i] - Y_data[j])

Have you tried using just abs ?

+1 for inplace operations everywhere i += 1 etc.

When using prange, the inplace operator signals a reduction (at the level of the for statement where the prange is used), which is not what I want here.

Maybe I should add a comment in the code to make that clear

You're right we want a thread local variable here

jeremiedbb · 2019-09-18T09:56:58Z

sklearn/metrics/pairwise_fast.pyx

+    cdef double s = 0.0
+    cdef np.npy_intp i, j, k
+    with nogil:
+        for i in prange(x.shape[0]):


Could you benchmark this function with the environment variable OMP_NUM_THREADS=1 (before any import) against scipy.spatial.distance.cdist(metric='cityblock') ?

jeremiedbb · 2019-09-18T10:09:42Z

It would be nice if you could also provide a small benchmark vs master (in both single and multi threaded modes)

sklearn/metrics/pairwise.py

rth · 2019-09-18T11:29:53Z

sklearn/metrics/pairwise_fast.pyx

-                    row[X_indices[j]] = X_data[j]
-                for j in range(Y_indptr[iy], Y_indptr[iy + 1]):
-                    row[Y_indices[j]] -= Y_data[j]
+        for px in prange(m):


Maybe add a few comments describing in English what this implementation does, as was done previously.

rth · 2019-09-18T11:32:13Z

sklearn/metrics/pairwise_fast.pyx

+                        iy = Y_indices[j]
+
+                    if ix == iy:
+                        d = d + fabs(X_data[i] - Y_data[j])


+1 for inplace operations everywhere i += 1 etc.

Fixed dense case

ptocca · 2019-09-18T15:18:21Z

General comment: if you had not noticed already, I am not quite up to speed with this development process. I read around and try to do the right thing, but please do not hesitate to set me on the right path.
For instance, is it OK to push often new versions of the PR?
Should I hit the Resolve Conversation only when I have pushed a version that I think addresses the issue? Or is it up to who raised the issue to accept the resolution?

ptocca · 2019-09-18T15:30:51Z

@jeremiedbb, below is a preliminary benchmark/regression test.
I timed three cases, each with 10 repeats.
What spurred me to work also on the dense matrix case is that in v21.2 it appears to run single-threaded and I did not like that,
Anyway, on my laptop (4 vCPU) there is not much of an advantage (multi-threading overhead? slow memory subsystem?). But I often work on HPC systems, with 24 to 32 cores and fast memory subsystems, and it seems to make a difference there.

Random sparse matrices, 10 repeats, shapes: (1000, 3000) (1500, 3000)
sklearn version: 0.21.2
real	0m14.808s
user	0m14.780s
sys	0m0.248s
Random sparse matrices, 10 repeats, shapes: (1000, 3000) (1500, 3000)
sklearn version: 0.22.dev0
real	0m3.283s
user	0m8.001s
sys	0m0.211s
np.all(np.isclose(.,.)) True
Max abs diff: 4.263256414560601e-14


Random sparse matrices, after train_test_split(), 10 repeats, shapes: (800, 3000) (1200, 3000)
sklearn version: 0.21.2
X.has_sorted_indices: 0
Y.has_sorted_indices: 0
real	0m9.819s
user	0m9.804s
sys	0m0.237s
Random sparse matrices, after train_test_split(), 10 repeats, shapes: (800, 3000) (1200, 3000)
sklearn version: 0.22.dev0
X.has_sorted_indices: 1
Y.has_sorted_indices: 1
real	0m2.321s
user	0m5.325s
sys	0m0.230s
np.all(np.isclose(.,.)) True
Max abs diff: 4.263256414560601e-14


Random dense matrices, 10 repeats, shapes: (1000, 1000) (1500, 1000)
sklearn version: 0.21.2
real	0m10.974s
user	0m10.971s
sys	0m0.213s
Random dense matrices, 10 repeats, shapes: (1000, 1000) (1500, 1000)
sklearn version: 0.22.dev0
real	0m7.572s
user	0m24.814s
sys	0m0.235s
np.all(np.isclose(.,.)) True
Max abs diff: 1.8189894035458565e-12

jeremiedbb · 2019-09-18T16:12:08Z

For instance, is it OK to push often new versions of the PR?

I have no problem with that.

Should I hit the Resolve Conversation only when I have pushed a version that I think addresses the issue? Or is it up to who raised the issue to accept the resolution?

If you made changes that impact the lines a conversation is about, this conversation will be marked as outdated. You don't need to mark it as resolved. And it's better to let the person who started the conversation to confirm that it has been addressed correctly :)

jeremiedbb · 2019-09-18T16:12:51Z

Regarding the benchmarks
The sparse case is very convincing. The dense case less. I'll take a look tomorrow to see what's going on with the dense case.

jeremiedbb · 2019-09-19T16:09:29Z

About the dense case.

It's a bit subtle to get the best performance. The reduction is not vectorized by gcc. The way to vectorize it is to manually unroll the loop. Here's how to do it:

def _dense_manhattan(floating[:, :] X, floating[:, :] Y, floating[:, :] out):
    cdef:
        floating s = 0.0
        int n_samples_x = X.shape[0]
        int n_samples_y = Y.shape[0]
        int n_features = X.shape[1]
        np.npy_intp i, j

    for i in prange(n_samples_x, nogil=True):
        for j in range(n_samples_y):
            out[i, j] = _manhattan_1d(&X[i, 0], &Y[j, 0], n_features)

cdef floating _manhattan_1d(floating *x, floating *y, int n_features) nogil:
    cdef:
        int i
        int n = n_features // 4
        int rem = n_features % 4
        floating result = 0

    for i in range(n):
        result += (fabs(x[0] - y[0])
                  +fabs(x[1] - y[1])
                  +fabs(x[2] - y[2])
                  +fabs(x[3] - y[3]))
        x += 4; y += 4

    for i in range(rem):
        result += fabs(x[i] - y[i])

    return result

With this you can equal (even slightly outperform) scipy. And scalability is good.

However
there's already pairwise_distances, which supports metric='manhattan' and n_jobs, so you can already have parallel manhattan distances. Isn't it enough for you ?

ptocca · 2019-09-19T16:36:43Z

@jeremiedbb, thanks for looking into this and for pointing out the vectorization opportunities, of which I was not aware.

Regarding the motivation for this PR, I was actually trying to improve the performance of the calculation of the laplacian kernel (in metrics/pairwise.py)), which uses the manhattan_distances() function:

def laplacian_kernel(X, Y=None, gamma=None):
    X, Y = check_pairwise_arrays(X, Y)
    if gamma is None:
        gamma = 1.0 / X.shape[1]

    K = -gamma * manhattan_distances(X, Y)
    np.exp(K, K)  # exponentiate K in-place
    return K

jeremiedbb · 2019-09-20T14:13:58Z

Ok makes sense. In that case, what would be even faster is to re-implement the whole kernel computation.

because this

K = -gamma * manhattan_distances(X, Y)
np.exp(K, K)

is sub-optimal. It involves 2 extra loop over the distance matrix.

Re-using the above code, it would be more efficient to do something like this:

from libc.math cimport exp

def _dense_laplacian_kernel(floating[:, :] X, floating[:, :] Y, floating[:, :] out, int gamma):
    cdef:
        int n_samples_x = X.shape[0]
        int n_samples_y = Y.shape[0]
        int n_features = X.shape[1]
        np.npy_intp i, j
        floating tmp

    for i in prange(n_samples_x, nogil=True):
        for j in range(n_samples_y):
            tmp =  _dense_manhattan_1d(&X[i, 0], &Y[j, 0], n_features)
            out[i, j] = exp(-gamma * tmp)

jeremiedbb · 2019-09-20T14:15:53Z

Anyway, I think these considerations can be moved to another PR. It will be easier to review if you stick to speed up sparse manhattan for this PR to actually fix the original issue.

ptocca · 2019-09-20T15:10:33Z

OK, I will reinstate the existing implementation of manhattan_distances() for the dense matrix case
(By the way, the title of the PR is now inaccurate. Can one change it?) and perhaps create another PR with your suggestions?

Regarding the sparse case, there is one detail that I am not sure about.
As noted, the proposed _sparse_manhattan() assumes that the CSR matrices in input have sorted indices.
To ensure that, I sort them in-place prior to calling _sparse_manhattan().
I chose to do it in-place because I prefer to avoid copies whenever possible. But it is debatable if it is OK to modify something that is passed as an argument. What is your view?

ptocca · 2019-09-21T16:18:30Z

I created PR #15049 to cover just the sparse matrix case.
I suppose this PR can be closed and perhaps any further discussion on the dense case should go into a new PR.

jeremiedbb · 2021-01-26T17:28:42Z

closing in favor of #15049 (already merged). Thanks @ptocca for your contribution !

First version that passes tests

40cff88

ptocca mentioned this pull request Sep 15, 2019

manhattan_distances for sparse matrices is slow #14304

Closed

jnothman reviewed Sep 15, 2019

View reviewed changes

ptocca added 4 commits September 17, 2019 15:29

Soaces around binary operators

cd1dd11

Removed useless variable Removed stray part of docstring

Spaces after commas

03aae33

sort_indices() in-place on sparse matrices

Fuxed stray triple quote

43034dc

Merge remote-tracking branch 'upstream/master' into faster_laplacian

e973d97

jeremiedbb reviewed Sep 18, 2019

View reviewed changes

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

sklearn/metrics/pairwise.py Outdated Show resolved Hide resolved

rth reviewed Sep 18, 2019

View reviewed changes

ptocca added 2 commits September 18, 2019 15:59

Addressing PR review comments

0bf2a2e

Fixed dense case

Merge remote-tracking branch 'upstream/master' into faster_laplacian

d0f1040

Faster _dense_manhattan() as per @jeremiedbb suggestions

5a36e34

ptocca added a commit to ptocca/scikit-learn that referenced this pull request Sep 21, 2019

Sparse matrix case ported from PR scikit-learn#14986

a24850e

ptocca mentioned this pull request Sep 21, 2019

[MRG] Faster manhattan_distances() for sparse matrices #15049

Merged

github-actions bot added the module:metrics label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:51

jeremiedbb closed this Jan 26, 2021

Uh oh!

[WIP] Faster manhattan distance for sparse and dense matrices #14986

[WIP] Faster manhattan distance for sparse and dense matrices #14986

Uh oh!

Conversation

ptocca commented Sep 15, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented Sep 18, 2019

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented Sep 18, 2019

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ptocca commented Sep 18, 2019

Uh oh!

ptocca commented Sep 18, 2019

Uh oh!

jeremiedbb commented Sep 18, 2019

Uh oh!

jeremiedbb commented Sep 18, 2019

Uh oh!

jeremiedbb commented Sep 19, 2019

Uh oh!

ptocca commented Sep 19, 2019

Uh oh!

jeremiedbb commented Sep 20, 2019

Uh oh!

jeremiedbb commented Sep 20, 2019

Uh oh!

ptocca commented Sep 20, 2019

Uh oh!

ptocca commented Sep 21, 2019

Uh oh!

jeremiedbb commented Jan 26, 2021

Uh oh!

Reviewers

Assignees